AI in Medicine Is Exaggerated

AI in Medicine Is Exaggerated
The AI models for disease prediction in health care are not as accurate as reports suggest. This is why:

Every day, we use artificial intelligence (AI)-powered tools, with voice assistants like Alexa and Siri being among the most common. These consumer products function reasonably well—Siri understands the majority of what we say—but they are far from perfect. We accept their limitations and adjust how we use them until they get the correct answer, at which point we give up. After all, if Siri or Alexa misunderstands a user's request, it usually doesn't matter much.

However, errors made by AI models that assist doctors in making clinical decisions can mean the difference between life and death. As a result, before we deploy these models, we must first understand how well they work. Currently, published reports on this technology paint an overly optimistic picture of its accuracy, which can lead to sensationalized stories in the press. The media is full of stories about algorithms that can diagnose early Alzheimer's disease with up to 74% accuracy or that are more accurate than clinicians. These scientific papers may serve as the foundation for new companies, new investments and lines of research, and large-scale hospital system implementations. Most of the time, the technology is not ready for deployment.

This is why: As researchers feed data into AI models, the models are expected to improve or remain stable. But our research and that of others has shown that the reported accuracy of published models goes down as the size of the data set goes up.
The cause of this counterintuitive scenario is how scientists estimate and report a model's reported accuracy. Best practices dictate that researchers train their AI model on a subset of their data set while keeping the rest in a "lockbox." They then use the "held-out" data to validate their model. Assume an AI program is being developed to distinguish people with dementia from those who do not have it by analyzing how they speak. The model is trained using spoken language samples and dementia diagnosis labels to predict whether a person has dementia based on their speech. It is then tested against held-out data of the same type to estimate its accuracy. This estimate of accuracy is then published in academic journals; the higher the accuracy on the held-out data, the better the scientists say the algorithm performs.

And why does the study claim that reported accuracy decreases as data set size increases? Ideally, the scientists should never see the withheld data until the model is finished and fixed. However, scientists may unintentionally peek at the data and modify the model until it yields high accuracy, a phenomenon known as data leakage. By modifying their model and then testing it, the researchers virtually guarantee that the system will correctly predict the held-out data, resulting in inflated estimates of the model's true accuracy. Instead, they must test the model with new data sets to see if it is learning and can look at something relatively unfamiliar to make the correct diagnosis.

While these overly optimistic accuracy estimates are published in the scientific literature, the lower-performing models are stuffed in the proverbial "file drawer," never to be seen by other researchers; or, if submitted for publication, they are less likely to be accepted. Data leakage and publication bias have a disproportionately large impact on models trained and evaluated on small data sets. That is, models trained on small data sets are more likely to report inflated estimates of accuracy; as a result, we observe an unusual trend in the published literature in which models trained on small data sets report higher accuracy than models trained on large data sets.

We can avoid these problems by being more stringent about how we validate models and report results in the literature. After determining that developing an AI model for a specific application is ethical, an algorithm designer should ask, "Do we have enough data to model a complex construct like human health?" If the answer is yes, scientists should devote more time to reliable model evaluation and less time to extracting every ounce of "accuracy" from a model. Reliable model validation begins with ensuring we have representative data. The design of training and test data is the most difficult problem in AI model development. While consumer AI companies harvest data opportunistically, clinical AI models require more care due to the high stakes. Algorithm designers should routinely question the size and composition of the data used to train a model to ensure that it is representative of the range of a condition's presentation as well as the demographics of the users. Every dataset is flawed in some way. Researchers should try to understand the limits of the data used to train and test models and how these limits affect how well the models do their jobs.

Unfortunately, there is no silver bullet for validating clinical AI models consistently. Every tool and clinical population is unique. Clinicians and patients must be involved early in the design process, with input from stakeholders such as the Food and Drug Administration, in order to achieve satisfactory validation plans that take into account real-world conditions. A broader discussion is more likely to ensure that the training data sets are representative, that the parameters for determining whether the model works are relevant, and that what the AI tells a clinician is appropriate. There are lessons to be drawn from the reproducibility crisis in clinical research, where strategies such as pre-registration and patient-centered research have been proposed to increase transparency and foster trust. A sociotechnical approach to AI model design, on the other hand, recognizes that developing trustworthy and responsible AI models for clinical applications is not solely a technical issue. It necessitates extensive knowledge of the underlying clinical application area, recognition that these models exist within larger systems, and an understanding of the potential harm if model performance degrades when deployed. 
The AI hype will continue unless a comprehensive approach is taken. This is unfortunate, because technology has the potential to improve clinical outcomes and reach underserved communities. Adopting a more comprehensive approach to developing and testing clinical AI models will result in more nuanced discussions about how well these models can function and their limitations. We believe that this will eventually lead to the technology reaching its full potential and people benefiting from it.

Post a Comment

Previous Post Next Post