AI Can Diagnose but Can’t Think Through It

April 24, 2026by Chris Aiken, MD0

AI nails the diagnosis but stumbles on more nuanced clinical reasoning

STUDY: Rao AS et al, JAMA Netw Open 2026;9(4):e264003
STUDY TYPE: Cross-sectional study
FUNDING: Independent

Background

AI models do well, and sometimes beat doctors, at diagnosis, but how do they handle the rest of the work? This study asked whether today’s leading models can actually work through a clinical case from beginning to end—differential diagnosis, workup, final diagnosis, and management—the way a clinician does.

The Study

Twenty-one AI models—including GPT-5, Claude 4.5 Opus, Gemini 3.0, and Grok 4—were tested on 29 standardized clinical vignettes involving differential diagnosis, diagnostic testing, final diagnosis, management, and miscellaneous clinical reasoning. Rather than raw accuracy, the researchers scored balanced performance across all five domains using a new metric (PrIME-LLM score), which penalizes models that excel in some areas but fail in others.

The top scoring models—Grok 4, GPT-5, GPT-4.5, Claude 4.5 Opus, and the Gemini 3.0 models—performed similarly, with no statistically significant differences among them. Reasoning-optimized models outperformed standard models (mean 0.76 vs. 0.69; Cohen’s d = 2.60).

The starkest finding: Failure rates for differential diagnosis exceeded 80% across virtually every model, while failure rates for final diagnosis were below 40%. Models collapsed onto a single answer when asked what diagnosis is most likely, but fell apart when asked what else it could be.

Practice Implications

AI models match patterns, but don’t reason, and can jump to conclusions.
That’s fine for narrow, well-defined tasks; it’s a liability when the clinical picture is ambiguous.

— Chris Aiken, MD
Director, Psych Partners
Editor in Chief, Carlat Psychiatry Report