Responding to AI Diagnostic Failures in Emergency Medicine

CS

Mar 25, 2026By Chester Shermer

Every technology introduced into emergency medicine eventually produces its first sentinel event. Computerized physician order entry was supposed to eliminate medication errors — and it introduced an entirely new category of them. Sepsis alert systems were supposed to save lives — and they generated enough false positives that nurses began overriding them by reflex. The history of clinical informatics is a history of tools that solved one problem and created three more that no one anticipated.
 
AI diagnostic tools are following the same arc. The question is not whether they will fail. They already are. The question is whether the emergency physicians using them understand how they fail, can recognize failure in real time, and have a decision framework for what to do when the algorithm and the patient tell different stories.

HOW AI DIAGNOSTIC TOOLS FAIL IN THE ED

AI diagnostic failures in emergency medicine cluster into four recognizable patterns. Understanding them is the first step toward not becoming their victim.

The first is distribution shift. Every AI diagnostic tool was trained on a specific patient population. When your patient falls outside that population — different demographics, different comorbidity profile, different disease prevalence — the algorithm's outputs become less reliable, sometimes dramatically so. A chest X-ray AI trained predominantly on data from large urban academic centers may perform differently in a patient population with higher rates of endemic fungal disease, prior TB exposure, or unusual occupational lung pathology. The algorithm doesn't know it's outside its training distribution. It will still generate a probability estimate. That estimate just means less than it did for the patients it was built on.
 
The second failure pattern is label propagation error. AI algorithms learn from historical clinical data, and historical clinical data contains the diagnostic errors, biases, and practice variation of the humans who generated it. If a training dataset systematically under-diagnosed pulmonary embolism in younger women — a well-documented historical pattern — an algorithm trained on that data will inherit that blind spot. The algorithm is not neutral. It reflects the clinical culture that generated its training labels.
 
The third pattern is threshold miscalibration. Most AI diagnostic tools output a probability estimate, and the clinical workflow converts that probability into an action recommendation based on a threshold — high risk, low risk, recommend CT, discharge safe. Those thresholds are calibrated on the training population. In your patient population, with your disease prevalence and your patient demographics, the threshold may be wrong. A tool calibrated to a high-prevalence PE population will over-call PE in a low-prevalence setting. A tool calibrated to a low-prevalence setting will under-call in a high-prevalence one. If you don't know what prevalence your tool was calibrated on, you don't know whether its thresholds are right for your patients.
 
The fourth pattern is adversarial fragility — the algorithm's vulnerability to unusual inputs. AI image analysis tools that perform excellently on standard-quality images can fail in unpredictable ways on technically degraded images: portable chest X-rays in a diaphoretic patient who can't hold still, ECGs with baseline artifact from a shivering hypothermic patient, CT scans with motion artifact in an agitated trauma patient. These are precisely the patients where diagnostic accuracy matters most and where AI tool performance is least reliable.

THE COGNITIVE TRAP: AUTOMATION BIAS

The most dangerous consequence of AI diagnostic tools is not the failure itself — it is the cognitive response to apparent algorithmic confidence. Automation bias is the well-documented human tendency to over-weight automated system outputs relative to other available information. It is not a character flaw. It is a feature of how human cognition handles information under cognitive load, and emergency physicians operating on hour fourteen of a night shift are not immune to it.
 
The clinical signature of automation bias in AI-assisted diagnosis looks like this: the algorithm says low probability, the physician updates their clinical probability downward, and the findings that would have triggered further workup — the subtle tachycardia, the slightly elevated D-dimer, the vague family history — get anchored out of the decision process. The miss that follows is not a failure of clinical knowledge. It is a failure of the physician-algorithm interface.
 
Recognizing automation bias as a risk is the first line of defense against it. The second is a deliberate clinical practice: generate your own pre-test probability before you look at the algorithm's output. If your clinical gestalt and the algorithm's output diverge significantly, that divergence is a signal — not necessarily that the algorithm is wrong, but that one of you is seeing something the other is not. That is precisely the moment for deeper clinical reasoning, not deference.

A COMMAND FRAMEWORK FOR AI DIAGNOSTIC DISAGREEMENT

Military medicine has a concept that translates directly here: the commander's critical information requirement. Before any operation, a commander defines in advance what information, if received, would require a change in the current plan. This is not reactive decision-making — it is prospective threshold-setting that allows a commander to act decisively when conditions change, without being paralyzed by ambiguity in the moment.
 
Emergency physicians can apply the same framework to AI-assisted diagnosis. For each AI tool active in your clinical workflow, define in advance: what clinical finding, if present, would cause me to disregard or override this algorithm's output regardless of its probability estimate? For a chest pain AI risk stratification tool, that threshold might be: any new ST changes, any hemodynamic instability, any prior history of aortic pathology. Write it down mentally. Own it as your clinical standard.
 
The physician who has prospectively defined their override criteria is far less vulnerable to automation bias than the one who decides in the moment — under cognitive load, under time pressure, with an algorithm confidently displaying a low-risk probability — whether the clinical picture is concerning enough to act.
 
This is what AI-bulletproof practice looks like in the diagnostic domain. Not refusing to use the tools. Not uncritically trusting them. Using them as one input in a structured clinical reasoning process that you — not the algorithm — control.
 
 
---DR. CHET'S TAKE---
 
I've been overriding clinical tools my entire career — lab values that didn't match the patient in front of me, imaging reads that missed what I was seeing on ultrasound, risk scores that stratified a sick patient as low risk. AI diagnostic tools are the newest version of that challenge, and the override calculus is the same: the tool informs my judgment, it does not replace it.
 
What concerns me about the current moment is not that AI tools fail. Every tool fails. What concerns me is that a generation of emergency physicians is being trained in environments where AI outputs are embedded in the workflow before the culture of critical evaluation has been established. When the algorithm is always there and usually right, the skill of recognizing when it's wrong atrophies. That is the diagnostic failure no one is measuring.
 
In my programs — including air medical and critical care transport, where diagnostic errors in the field have no safety net — we train to the failure mode, not just the standard case. Every provider who operates with AI-assisted tools needs to be able to articulate: how does this tool fail, and what am I watching for? If you can't answer that, you're not using the tool. The tool is using you.
 
— Dr. Chester "Chet" Shermer, MD, FACEP is a Professor of Emergency Medicine, Medical Director for Air Medical and Critical Care Transport programs, and a military medical commander with the Army National Guard. He is the founder of Global MedOps Command and the creator of AI in Emergency Medicine: Becoming AI Bulletproof.
 
 
AI Won't Wait. Neither Should You.
 
The diagnostic failure patterns described in this post are already occurring in departments where AI tools are active in clinical workflows without a structured physician override framework. Emergency physicians who understand how these tools fail — and who have prospectively defined their own override criteria — will catch what the algorithm misses. Those who don't will eventually sign their name to an outcome the algorithm caused and the patient paid for.
 
Consider enrolling in my course: AI in Emergency Medicine: Becoming AI Bulletproof — a physician-built course covering AI diagnostic accountability, automation bias recognition, and the clinical command frameworks you need to practice confidently in an AI-integrated environment.