Every day, we use artificial intelligence (AI)-powered tools, with voice assistants like Alexa and Siri being among the most common. These consumer products function reasonably well—Siri understands the majority of what we say—but they are far from perfect. We accept their limitations and adjust how we use them until they get the correct answer, at which point we give up. After all, if Siri or Alexa misunderstands a user's request, it usually doesn't matter much.
However, errors made by AI models that assist doctors in making clinical decisions can mean the difference between life and death. As a result, before we deploy these models, we must first understand how well they work. Currently, published reports on this technology paint an overly optimistic picture of its accuracy, which can lead to sensationalized stories in the press. The media is full of stories about algorithms that can diagnose early Alzheimer's disease with up to 74% accuracy or that are more accurate than clinicians. These scientific papers may serve as the foundation for new companies, new investments and lines of research, and large-scale hospital system implementations. Most of the time, the technology is not ready for deployment.
And why does the study claim that reported accuracy decreases as data set size increases? Ideally, the scientists should never see the withheld data until the model is finished and fixed. However, scientists may unintentionally peek at the data and modify the model until it yields high accuracy, a phenomenon known as data leakage. By modifying their model and then testing it, the researchers virtually guarantee that the system will correctly predict the held-out data, resulting in inflated estimates of the model's true accuracy. Instead, they must test the model with new data sets to see if it is learning and can look at something relatively unfamiliar to make the correct diagnosis.
While these overly optimistic accuracy estimates are published in the scientific literature, the lower-performing models are stuffed in the proverbial "file drawer," never to be seen by other researchers; or, if submitted for publication, they are less likely to be accepted. Data leakage and publication bias have a disproportionately large impact on models trained and evaluated on small data sets. That is, models trained on small data sets are more likely to report inflated estimates of accuracy; as a result, we observe an unusual trend in the published literature in which models trained on small data sets report higher accuracy than models trained on large data sets.
We can avoid these problems by being more stringent about how we validate models and report results in the literature. After determining that developing an AI model for a specific application is ethical, an algorithm designer should ask, "Do we have enough data to model a complex construct like human health?" If the answer is yes, scientists should devote more time to reliable model evaluation and less time to extracting every ounce of "accuracy" from a model. Reliable model validation begins with ensuring we have representative data. The design of training and test data is the most difficult problem in AI model development. While consumer AI companies harvest data opportunistically, clinical AI models require more care due to the high stakes. Algorithm designers should routinely question the size and composition of the data used to train a model to ensure that it is representative of the range of a condition's presentation as well as the demographics of the users. Every dataset is flawed in some way. Researchers should try to understand the limits of the data used to train and test models and how these limits affect how well the models do their jobs.