← Back to Spotlight
Spotlight

General-Purpose AI Models Beat Specialized Clinical Tools in Real-World Physician Test, Study Finds

From Pepkio Team · 13 June 2026 · 4 min read

Specialized clinical AI tools are increasingly marketed to hospitals, but a new study suggests they may not outperform widely available general-purpose AI models. Scientists report today in Nature Medicine that leading general-purpose large language models (LLMs) outperformed two proprietary clinical AI tools on medical knowledge, alignment with expert clinical judgment, and, critically, on a new benchmark built from real physician queries.

The work, led by senior author Eric Karl Oermann at NYU Langone Health, with first author Krithik Vishwanath, is a rare independent, head-to-head comparison between subscription-based clinical AI products and frontier consumer models. The findings raise questions about the procurement and regulation of AI systems before they enter clinical settings.

The researchers evaluated three frontier LLMs—OpenAI’s GPT-5.2, Google’s Gemini 3.1 Pro, and Anthropic’s Claude Opus 4.6—alongside two specialized clinical tools, OpenEvidence and UpToDate Expert AI. They also included Google’s auto-enabled Search AI Overview as a real-world control that physicians frequently encounter. The evaluation spanned three stages: a 500-question medical licensing exam-style test (MedQA), 500 items from the HealthBench benchmark measuring alignment with clinical experts, and a novel Real Clinical Queries (RCQ) benchmark.

The RCQ benchmark, which the authors consider the study’s primary evidence, was built from 100 de-identified clinician queries submitted to a GPT instance during live clinical work at NYU Langone Health. Twelve blinded physicians rated the models’ anonymized responses on a four-point scale for clinical correctness, completeness, safety, and clarity, producing 1,800 individual evaluations.

On medical knowledge, Gemini achieved the highest accuracy on MedQA at 97.4%, followed by GPT at 94.2% and Claude at 90.2%. The clinical tools scored lower, with OpenEvidence reaching 89.6% and UpToDate achieving 88.4%. On the HealthBench benchmark, which measures alignment with expert clinical judgment, GPT led with a mean score of 88.0 out of 100, while OpenEvidence and UpToDate scored 62.6 and 61.3, respectively.

On the real-world physician queries, a clear two-tier pattern emerged. Frontier LLMs formed the top tier, with Gemini scoring a mean aggregate rating of 3.62, followed closely by GPT at 3.54 and Claude at 3.52. The clinical AI tools and Google AI Overview clustered in a lower tier, with OpenEvidence at 3.24, UpToDate AI at 3.17, and Google AI Overview at 3.27. After adjusting for rater leniency, the odds of clinical tools receiving a higher rating were 49 to 87 percent lower compared to Gemini. Notably, Google’s freely available search feature matched the performance of the specialized clinical AI products.

The gaps were largest on clarity, where OpenEvidence scored the lowest, suggesting its weakness was communication rather than raw knowledge. UpToDate Expert AI refused to answer 19 percent of queries, far more than the frontier models, which refused just 1 to 3 percent. No significant differences were found between models in rates of harmful content or hallucination.

The study has several limitations. The clinical tools lack public application programming interfaces and were queried through browser interfaces, which may have introduced subtle differences in prompting or output formatting. The HealthBench benchmark was developed by OpenAI, and the authors note that potential developer overlap could favor GPT’s score on that specific test. The study is a snapshot of a rapidly evolving field, and the authors caution that deeply subspecialized medical tasks might one day favor more sophisticated, domain-specific adaptation.

“Clinical AI tools may carry institutional legitimacy and are likely safe for routine use,” the authors write, “but our results show that they are not superior to frontier models on knowledge, communication or clinical alignment.” The findings suggest that for many medical tasks, the scale and cross-domain reasoning of general-purpose models currently outweigh the benefits of specialized fine-tuning—a reality with implications for hospital purchasing, regulatory oversight, and the future design of medical AI systems.

Reference

Vishwanath K, Alyakin A, Ghosh M, et al. General-purpose large language models outperform specialized clinical AI tools on medical benchmarks. Nature Medicine (2026). https://doi.org/10.1038/s41591-026-04431-5

General-Purpose AI Models Beat Specialized Clinical Tools in Real-World Physician Test, Study Finds | Pepkio Radar | Pepkio