AI’s Cognitive Limits in Medical Diagnostics: A Critical Assessment
Table of Contents
the rapid adoption of artificial intelligence (AI) in medical diagnostics, fueled by its potential to swiftly analyze medical data and identify patterns, necessitates a thorough examination of its reliability. While AI’s capacity to process medical histories, X-rays, and other datasets to detect subtle anomalies is undeniable, a study published in the BMJ on December 20, 2024, highlights significant limitations.
AI’s Cognitive Performance: Strengths and Weaknesses
The BMJ study, published December 20, 2024, raises concerns about the cognitive capabilities of AI technologies, particularly large language models (LLMs) and chatbots.The research suggests that, akin to human cognitive decline, these AI systems might exhibit deteriorating cognitive functions over time. As the authors noted, “These findings challenge the assumption that artificial intelligence will soon replace human doctors,
” underscoring potential implications for medical accuracy and patient confidence.
Methodology: Evaluating AI Cognition
Researchers assessed several publicly available LLMs, including OpenAI’s ChatGPT, Anthropic’s Sonnet, and Alphabet’s Gemini, using the Montreal Cognitive assessment (MoCA). This neuropsychological test evaluates various cognitive domains, such as attention, memory, language, spatial skills, and executive functions. The MoCA, commonly used to detect cognitive decline in conditions like Alzheimer’s disease, involves tasks such as drawing a clock, serial subtraction, and verbal recall.A score of 26 out of 30 is generally considered passing for humans.
While the LLMs showed competence in tasks involving naming, attention, language, and abstraction, their performance in visual-spatial skills and executive functions was considerably weaker. Interestingly, the latest ChatGPT version (version 4) achieved the highest score (26/30), whereas the older Gemini 1.0 scored only 16, suggesting a possible correlation between model “age” and cognitive performance in these AI systems.
Study Critique: Methodological Considerations
The study’s methodology and conclusions have drawn criticism.Critics argue that directly comparing AI cognitive function to human cognition using the moca is inappropriate. As Aya awwad, a research fellow at Mass General Hospital in Boston, stated in a January 2 letter to the BMJ:
“The MoCA was designed to assess human cognition, including visuospatial reasoning and self-orientation — faculties that do not align with the text-based architecture of LLMs. One might reasonably ask: why evaluate LLMs on these metrics at all? Their deficiencies in these areas are irrelevant to the roles they might fulfill in clinical settings — primarily tasks involving text processing, summarizing complex medical literature, and offering decision support.”
Further criticism focuses on the lack of longitudinal studies. Aaron Sterling, CEO of EMR Data Cloud, and Roxana Daneshjou, assistant professor of biomedical sciences at Stanford, in a January 13 letter to the BMJ, emphasized the need for repeated testing of AI models over time to assess changes in cognitive function after updates. This, they argued, would provide a more comprehensive evaluation of the study’s hypothesis.
In response to the criticism, lead author Roy Dayan, a medical doctor at the Hadassah Medical Center in Jerusalem, clarified that the study’s playful title, “Age Against the Machine,” shouldn’t overshadow its serious purpose. In a January 10 letter to the BMJ, Dayan wrote:
“we also hoped to cast a critical lens at recent research at the intersection of medicine and AI, some of which posits LLMs as fully-fledged substitutes for human physicians. by administering the standard tests used to assess human cognitive impairment, we tried to draw out the ways in which human cognition differs from how LLMs process and respond to data. This is also why we queried them as we would query humans, rather than via ‘state-of-the-art prompting techniques’, as Dr. Awwad suggests.”
The Future of AI in medical Diagnostics: Understanding Cognitive Limits
World Today News Senior Editor: welcome to World Today News. Today, we have Dr. Emma Reynolds, a renowned expert on artificial intelligence in medical diagnostics, to discuss a recent critical assessment of AI’s cognitive limits. Dr. Reynolds, how notable are these limitations highlighted in the recent study published in the BMJ?
Dr. Emma Reynolds: Thank you for having me. The study’s importance cannot be overstated. While AI technologies like large language models have demonstrated remarkable capabilities in swiftly analyzing medical histories and imaging data, the research highlights key limitations, especially in cognitive domains assessed by the Montreal Cognitive Assessment or MoCA. Attributes like visual-spatial skills and executive functions, essential for thorough diagnostics, remain challenging for AI. It’s crucial to understand that while AI excels in specific tasks, it still struggles in areas imitating broader human cognitive abilities.
World Today News Senior Editor: In the study, it appears that the AI systems, much like humans, showed varying performance levels, especially with model age.Could you elaborate on the implications of this observation?
Dr. Emma Reynolds: Absolutely. The findings that newer versions of models, such as ChatGPT’s version 4, performed better than older versions reflects a developmental trajectory in AI.This suggests ongoing refinement and enhancement in AI capabilities over time. Though, it’s vital for the medical community to interpret these improvements within specific applications.AI’s evolving performances, similar to enhancements observable in various software over updates, might lead to gradual improvements in practical applications but do not necessarily mean they can replace human cognitive prowess.
World Today News Senior Editor: Critics argue that using the MoCA to assess AI models is not entirely suitable,given their text-based nature. What’s your take on using this methodology?
Dr. Emma Reynolds: Critics, including Aya Awwad from Mass General Hospital, have valid points. The MoCA was designed to gauge human cognitive abilities,such as visuospatial reasoning and self-orientation—areas that do not directly align with the functions of text-based AI models like large language models. Evaluating AI on these metrics perhaps doesn’t accurately reflect its clinical utility, which primarily lies in handling text-heavy tasks, summarizing complex medical literature, and supporting decision-making. Using MoCA as a benchmark is more metaphorical, emphasizing the differences and limitations of current AI compared to the multifaceted nature of human cognition rather than serving an evaluative purpose.
World Today News Senior Editor: There’s a call for longitudinal studies to evaluate AI over time. Why is this important, and how might these types of studies enhance our understanding of AI in medical diagnostics?
Dr. Emma Reynolds: Longitudinal studies are imperative to gain insights into AI’s developmental trajectory and maintain a realistic perspective of its capabilities and limitations. Regularly testing AI models over time, especially post-update, would allow researchers to observe how these systems adapt and improve, akin to releasing software patches. such studies enable a comprehensive examination of AI’s evolution and what specific areas might need further progress or caution in request. This would particularly benefit predictive capabilities and error analysis in clinical environments, thereby fine-tuning AI deployment in medical diagnostics.
World Today News senior Editor: considering all the information and critiques surrounding this study, what would you say to medical professionals evaluating the integration of AI technologies in their practice?
Dr. Emma Reynolds: Medical professionals should approach AI with a balanced perspective, acknowledging its remarkable potential and also its limitations. AI can substantially enhance diagnostic accuracy and efficiency by processing vast amounts of data quickly and spotting patterns that may escape human detection. Though, it’s crucial to remember that AI should complement, not replace, human expertise. Incorporating AI tools should be done thoughtfully, ensuring they are used as an extension of the clinician’s capabilities, particularly in text-processing tasks, to present comprehensive decision support tools while retaining the necessity for human oversight in interpreting nuanced clinical data.