How much data is enough? Improving study design for machine learning in health care

05/05/2026

Ottawa, Ontario — Tuesday May 5, 2026

Machine learning models are transforming health research and care, yet many studies overlook a critical step: determining how much data is needed to train a reliable model. When studies get this wrong, they waste research funding, and place unnecessary burden on patients and families who contribute their time and health data.

A new study led by researchers Nicholas Mitsakakis, Khaled El Emam and postdoc Dan Liu tested a new evidence–based sample size calculator that helps determine how much data is needed to inform reliable machine learning models.

“When families take part in research, there’s an expectation that their data will be used carefully and lead to something meaningful. Getting the sample size right is one of the ways we honour that trust and help ensure the machine‑learning tools we build are reliable enough to inform real health decisions.” – Nicholas Mitsakakis

To develop this sample size calculator, the team trained commonly used machine learning models on subsamples of varying sizes taken from 13 large real‑world healthcare datasets and compared their performance with models trained on the full datasets. They mapped how accuracy improves with larger datasets and estimated how much data a model needs to reach an acceptable level of performance.

The resulting sample size calculator was consistently more accurate than existing methods, which often overestimate data requirements.

By helping researchers better plan studies from the onset, this tool can support better study design, reduce research waste, and contribute to more trustworthy machine learning tools in the health care ecosystem.

Areas of Research

Health Data