Evaluating the utility of synthetic COVID-19 case data

LAY SUMMARY

There remains a strong need for sharing COVID-19 data with the research community. This study evaluates whether data synthesis can address that need. We synthesized the Ontario case database of 90 514 individuals testing positive for SARS-CoV-2 and created a synthetic version of that. The synthesis method used sequential decision trees. A machine learning (gradient boosted trees) mortality prediction model was constructed using the synthetic data and its accuracy and the relationships it detected were compared to the real data. The results of the real and synthetic data models were similar and the conclusions were the same. A privacy risk assessment on the synthetic data showed that the attribute and membership disclosure risks were low. We conclude that the synthetic version of the COVID-19 testing dataset can be shared more broadly as it has high utility and privacy characteristics.

Lead Researchers

Link to Publication

Researchers

Khaled El Emam

Senior Scientist, CHEO Research Institute Professor, Faculty of Medicine, University of Ottawa

View Profile Email

Lead Researchers

Researchers

Khaled El Emam