Why Leading Medical LLMs Falter in Dynamic Red‑Team Tests – The DAS Framework
A new study reveals that large language models which excel on static medical exams dramatically lose accuracy when subjected to the Dynamic, Automatic, Systematic (DAS) red‑team framework, exposing serious weaknesses in robustness, privacy, bias, and hallucination, and urging the adoption of continuous adversarial evaluation for trustworthy clinical AI.
Background
Recent large language models (LLMs) such as Med‑Gemini and the latest OpenAI releases achieve high accuracy on medical licensing exams, but static benchmark scores can conceal safety problems that may jeopardize patient care.
Limitations of static benchmarks
Model development outpaces benchmark updates, leaving assessments outdated.
According to Goodhart’s Law, once a metric becomes a target it ceases to be a reliable measure; models can over‑fit or contaminate benchmark data.
Static tests are inefficient at revealing unknown risks in safety‑critical medical contexts.
DAS Red‑Team Framework
The Dynamic, Automatic, and Systematic (DAS) framework replaces static exams with continuous adversarial evaluation. Autonomous agents automatically generate test cases, evolve attack strategies, and evaluate model responses without human intervention.
Evaluation dimensions
Robustness : Ability to maintain accuracy when presented with plausible but incorrect options, physiologically impossible lab results, or missing correct answers.
Privacy : Risk of unintentionally leaking protected health information (HIPAA/GDPR) during informal, lengthy dialogues.
Bias/Fairness : Susceptibility to demographic, linguistic, or authority cues that alter diagnoses or treatment recommendations.
Hallucination : Frequency of fabricating clinical guidelines, citing nonexistent literature, or recommending unsafe therapies under high‑risk queries.
Experimental setup
Fifteen open‑source and closed‑source LLMs were evaluated. Although many models scored >80 % on static benchmarks such as MedQA, they exhibited severe vulnerabilities under DAS testing.
Robustness
Six mutation tools—answer negation, question reversal, option expansion, narrative distraction, cognitive bait, and physiological fallacy—reduced correct answers by up to 94 %.
Privacy
In 81 simulated privacy‑leak scenarios, the average “jailbreak” rate reached 86 % without adversarial prompting and remained above 66 % even when explicit HIPAA/GDPR compliance instructions were added. Four disguise strategies (benevolent disguise, subtle request, focus misdirection, trap warning) further collapsed privacy defenses.
Bias & Fairness
“Cognitive bias activation” (e.g., inserting authority cues) altered clinical recommendations in 81 % of cases. Changing patient demographic labels or language style also triggered significant bias, indicating current LLMs are not yet fair enough for real‑world clinical deployment.
Hallucination
A seven‑category hallucination taxonomy was applied via an automated agent detector. All models hallucinated in more than 50 % of high‑risk clinical queries; even the best‑performing model fabricated facts, cited non‑existent papers, or suggested contraindicated treatments.
Implications
High static benchmark scores do not guarantee safety or reliability in clinical settings. The DAS framework provides a scalable, evolving “firewall” that can continuously audit LLMs before they are deployed in patient‑facing chatbots or decision‑support systems. Future releases of medical LLMs should include a DAS‑generated risk dossier, analogous to a drug’s side‑effect label, to transparently disclose capabilities and limitations.
Paper: https://arxiv.org/abs/2508.00923
Code (agents): https://github.com/JZPeterPan/DAS-Medical-Red-Teaming-Agents
Dataset: https://huggingface.co/datasets/JZPeterPan/DAS-Mediacal-Red-Teaming-Data
Code example
来源:ScienceAI
本文
约2700字
,建议阅读
5
分钟
该研究成果为我们揭示了在静态基准测试中表现优异的 LLMs,在面对持续的对抗性压力时,其脆弱性远超想象。Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
