https://github.com/junjslee/neonatal-ai-reliability
MRMC crossover study: expertise-dependent automation bias and sentinel behavior in human-AI collaborative neonatal diagnosis
https://github.com/junjslee/neonatal-ai-reliability
artificial-intelligence cnn computer-vision deep-learning fine-tuning foundation-models human-ai-interaction medical-ai medical-imaging medical-research neonatal-radiology pediatrics rad-dino
Last synced: 6 days ago
JSON representation
MRMC crossover study: expertise-dependent automation bias and sentinel behavior in human-AI collaborative neonatal diagnosis
- Host: GitHub
- URL: https://github.com/junjslee/neonatal-ai-reliability
- Owner: junjslee
- License: other
- Created: 2025-05-11T08:32:27.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2026-04-10T20:05:43.000Z (about 2 months ago)
- Last Synced: 2026-04-10T21:34:31.830Z (about 2 months ago)
- Topics: artificial-intelligence, cnn, computer-vision, deep-learning, fine-tuning, foundation-models, human-ai-interaction, medical-ai, medical-imaging, medical-research, neonatal-radiology, pediatrics, rad-dino
- Language: Python
- Homepage: https://neonatal-ai-sandbox.pages.dev/
- Size: 42.8 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# Expertise modulates automation bias and sentinel behavior in human-AI collaborative diagnosis of neonatal pneumoperitoneum
[]()
[]()
[]()
[]()
[](https://neonatal-ai-sandbox.pages.dev/)
**Lee et al.** — submitted to *npj Digital Medicine*; currently in revision (round 1).
> **One-line takeaway:** *AI reliability does not translate linearly into clinical benefit.* In 1,750 interpretation events across a multi-reader multi-center crossover study, high AI reliability paradoxically induced **automation bias in trainees**, while error-prone AI triggered **sentinel (vigilant) behavior in experts** — demonstrating that adversarial resilience, not standalone accuracy, is the defining metric of human-AI team performance.
---
## Abstract
Medical AI is often validated under an additive assumption that algorithmic sensitivity and clinician oversight will combine to improve care. We tested this assumption in the high-stakes diagnosis of neonatal pneumoperitoneum, a time-critical surgical emergency. In a multi-reader crossover study analyzing 1,750 interpretation events, clinicians reviewed radiographs aided by either a high-reliability model or a systematically error-injected model. We found that high AI reliability paradoxically induces automation bias in trainees, who accepted 52.0% of incorrect suggestions, while offering limited gains to experts. Conversely, when challenged by flawed AI, the three participating neonatologists exhibited a "sentinel behavior" phenotype, correctly overriding 91.7% of errors (Wilson 95% CI 83.0–96.0%) consistent with increased deliberation; given the small specialist cohort, this finding is hypothesis-generating and warrants prospective replication. We operationalize systemic resilience as the capacity to maintain diagnostic integrity under algorithmic failure and demonstrate that clinical validity depends on the human-AI team's adversarial resilience rather than standalone accuracy. To mitigate the risk of deskilling and never-skilling, we release an open-source educational sandbox designed to inoculate clinicians against automated errors.
**Keywords:** Neonatal pneumoperitoneum · Automation bias · Sentinel behavior · Artificial intelligence · Deep learning · Multi-reader multi-case study · Human-AI interaction · Radiology
---
## Educational Sandbox
> **Try it live:** [neonatal-ai-sandbox.pages.dev](https://neonatal-ai-sandbox.pages.dev/)
An open-source, web-based educational tool designed to inoculate clinicians against automated errors — simulating both reliable and error-prone AI assistance to build adversarial resilience in trainees and practicing clinicians.
---
## Model Architecture

To ensure that differences in reader behavior were driven solely by **AI reliability** (and not model capacity), both the Reliable and Error-Injected assistants use the same underlying architecture:
- **Backbone:** RAD-DINO (ViT-B/14) — a vision foundation model pre-trained on large-scale radiology datasets
- **Adaptation (LoRA):** Parameter-efficient fine-tuning via Low-Rank Adaptation ($r=12, \alpha=24$) injected into Query/Value projections and the MLP layer; only **1.36%** of parameters were trainable
- **Sampling Strategy (RFBS):** A custom Representation-Focused Batch Sampler enforcing diversity and exposure to uncommon pneumoperitoneum distributions during training
> **Error-Injected model:** Same architecture, trained on systematically poisoned labels. False positives engineered by mislabeling clinically plausible confounders (iatrogenic devices, portal venous gas, pneumatosis intestinalis, abdominal drains) as pneumoperitoneum — curated by a board-certified pediatric radiologist to simulate realistic deployment failures rather than random noise.
---
## Why This Matters
Neonatal pneumoperitoneum is a time-critical surgical emergency. Integrating AI into this workflow is not just about model accuracy. Clinicians interact with advice, confidence cues, and time pressure.
This study investigates the **Human-AI Interaction (HAI) layer**:
1. **Automation Bias:** When AI is *highly capable*, does it help — or does it reduce human vigilance?
2. **The Sentinel Effect:** When AI is *systematically wrong*, do clinicians disengage, blindly follow, or become hyper-vigilant?
3. **Expertise Gradient:** Do neonatologists, radiologists, and residents react differently to the same AI signals?
4. **Never-Skilling Risk:** Does early-career reliance on reliable AI prevent trainees from developing independent pattern recognition?
---
## Study Overview
### Cohorts
| Cohort | Radiographs | Positive Cases | Source |
| :--- | :--- | :--- | :--- |
| Internal Development | 688 (from 216 patients) | 310 | Asan Medical Center |
| External Validation (Reader Study) | 125 | 40 | 11 tertiary hospitals via AI-Hub |
### Reader Study Design
- **Participants (N=14):**
- Pediatric Radiologists: $n=6$ (mean experience 16.2 ± 4.2 years)
- Neonatologists: $n=3$ (mean experience 10.3 ± 1.5 years)
- Radiology Residents: $n=5$ (mean experience 2.2 ± 1.3 years)
- **Design:** Two-session, counterbalanced MRMC crossover with 6-week washout; double-masked
- **Total interpretation events:** 1,750
**Case Allocation (Stratified, N=125):**
| Condition | Cases |
| :--- | :--- |
| Unaided | 41 |
| Reliable AI | 40 |
| Error-Injected AI | 44 |
> Reliability was fixed at the *case level*. Readers were blinded to the reference standard and unaware of the two distinct AI reliability conditions.
---
## AI Tools Evaluated
| Model | Performance | Engineering | Purpose |
| :--- | :--- | :--- | :--- |
| **Reliable AI** | AUC 0.861 (study subset); AUC 0.948 (full external validation) | Standard training on clean labels | Test automation bias |
| **Error-Injected AI** | Balanced accuracy 0.44 (sensitivity 0.40; specificity 0.47) | Systematic label poisoning via clinically plausible confounders | Test sentinel / adversarial resilience |
---
## Statistical Methodology
Primary analysis: **Crossed Random-Effects GLMM** (logit link)
- Random intercept for `Case_ID` — controls for intrinsic image difficulty (variance 3.83 on log-odds scale)
- Random intercept for `Reader_ID` — controls for individual competence (variance 0.15)
- Covariates: gestational age, birth weight (both non-significant: P=0.542, P=0.969)
- No session-order effects (Session 2 vs 1: OR 1.33, P=0.448)
- Post-hoc contrasts adjusted via Holm-Bonferroni
---
## Key Findings
### 1. Expertise-Stratified Interaction
The primary GLMM identified a significant Condition × Expertise interaction for neonatologists under the Error-Injected AI condition:
| Contrast (vs Pediatric Radiologist) | OR | 95% CI | P-value |
| :--- | :--- | :--- | :--- |
| **Error-Injected AI × Neonatologist** | **4.16** | **1.26–13.77** | **0.020** |
Confirmed by GEE (P=0.018) and Leave-One-Neonatologist-Out sensitivity analysis (ORs 2.04–2.64 across all leave-one-out configurations).
Pediatric Radiologists maintained stable accuracy across all conditions (no significant gains or losses). Radiology Residents showed patterns consistent with automation bias.
### 2. Unaided Baseline Performance
| Group | Unaided Accuracy |
| :--- | :--- |
| Pediatric Radiologists | 90.2% |
| Radiology Residents | 85.9% |
| Neonatologists | 85.4% |
### 3. Error Acceptance (Automation Bias)
When AI was incorrect — rate at which readers accepted the wrong suggestion:
| Group | Acceptance of Incorrect AI (Reliable AI condition) |
| :--- | :--- |
| **Radiology Residents** | **52.0% (13/25)** |
| Neonatologists | 33.3% (5/15) |
| Pediatric Radiologists | 20.0% (6/30) |
Residents vs Radiologists: P=0.016 (significant after Bonferroni correction).
### 4. Sentinel Behavior (Correct Override of Flawed AI)
When the Error-Injected AI was wrong — rate at which readers successfully overrode it:
| Group | Correct Override Rate | Wilson 95% CI |
| :--- | :--- | :--- |
| **Neonatologists** | **91.7% (66/72)** | **83.0–96.0%** |
| Pediatric Radiologists | 85.4% (123/144) | 78.6–90.4% |
| Radiology Residents | 81.7% (98/120) | 73.8–87.6% |
> The 91.7% neonatologist override rate comes from three participating specialists (75 reader-case rows on the Error-Injected arm) and is presented as a **hypothesis-generating** observation that warrants prospective replication in a larger specialist cohort.
### 5. Verification Effort (Deliberation Time)
Reading time was modeled with a linear mixed-effects model on the aided set:
```
log(reading_time_sec) ~ disagree * reliability * group + pgy_within_5
+ (1 | reader) + (1 | case)
```
Headline fixed effects (REML, Satterthwaite df via lmerTest; full table in Supplementary Table 5):
| Term | β (log s) | SE | t (df) | P | Time ratio (95% CI) |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Discordance (main) | 0.892 | 0.174 | 5.12 (1043.0) | <0.001 | 2.44 (1.74–3.43) |
| Discordance × Reliability[Unreliable] | -0.610 | 0.215 | -2.84 (948.6) | 0.005 | — |
| Reliability[Unreliable] (main) | 0.172 | 0.131 | 1.32 (200.9) | 0.189 | — |
Random-effect variances: σ²(reader)=0.096, σ²(case)=0.148, σ²(residual)=0.665.
Marginal R²=0.101, Conditional R²=0.342 (Nakagawa–Schielzeth).
Group-specific simple slopes (disagree vs agree, marginalized over reliability):
| Group | Time ratio (disagree/agree) | 95% CI | P |
| :--- | :--- | :--- | :--- |
| Pediatric Radiologists | 1.77× | 1.44–2.18 | <0.001 |
| Neonatologists | 1.95× | 1.46–2.59 | <0.001 |
| Radiology Residents | 1.49× | 1.18–1.88 | <0.001 |
> **Statistical caveat (revised in round 1):** the omnibus disagree × group interaction was not significant (F(2, 1096)=1.32, P=0.267) — the model does not provide evidence that the *magnitude* of the slowdown differs across groups. We interpret the per-group multiplicative slowdowns as evidence that AI-induced verification effort imposed a similar workflow cost across expertise levels, with the between-group differential being operationally negligible.
>
> **Note on prior reporting:** an earlier draft summarized this analysis using absolute second-level differences (+1.2 / +3.1 / +4.6 s). Those raw second-level values were a technical error introduced during initial drafting and were not reproducible from the analytic dataset. They have been removed in favor of the model-adjusted multiplicative slowdowns above, which are traceable to Supplementary Table 5.
> **Within-session trust recalibration.** The significant `Discordance × Reliability[Unreliable]` interaction (β=-0.610, P=0.005) indicates that on the log-time scale, the verification cost of disagreeing with the AI was substantially attenuated under the Error-Injected condition. Once readers encountered a low-reliability AI, the cognitive cost of rejecting its output decreased — consistent with empirical recalibration of trust during the session.
### 6. Error-Type Stratification (Error-Injected AI arm; added in revision)
The Error-Injected AI was wrong on **18 unique false-positive (FP)** and **6 unique false-negative (FN)** cases (336 reader-case rows across the 14 readers in the aided arm). Agreement with the wrong AI, stratified:
| Group | FP rate (n agreed / n) | FN rate (n agreed / n) | FP→FN drop |
| :--- | :--- | :--- | :--- |
| Pediatric Radiologists | 19.4% (21/108) | **0.0%** (0/36) | -19.4 pp |
| Neonatologists | 11.1% (6/54) | **0.0%** (0/18) | -11.1 pp |
| **Radiology Residents** | 18.9% (17/90) | **16.7%** (5/30) | **-2.2 pp** |
Wilson 95% CIs and full counts in `quantitative_analysis/revision_analyses/r2_8_error_type_stratification.py`.
The conventional GLMM (`agree_with_ai ~ group * error_type + (1|reader) + (1|case)`) converged but exhibited **practical separation** in the FN cell (two of three groups had zero events). We therefore report Firth penalized logistic regression as the primary inferential model:
| Term | OR | 95% CI | P (Firth penalized LRT) |
| :--- | :--- | :--- | :--- |
| **error_type[FN]** (Ped Rad ref) | **0.06** | 0.00–0.42 | **0.001** |
| group[Radiology Resident] × error_type[FN] | **16.25** | 1.51–2239 | **0.017** |
| group[Neonatologist] × error_type[FN] | 3.62 | 0.02–721 | 0.54 |
Pediatric Radiologists and Neonatologists overrode every FN case. Residents accepted the AI's "all clear" verdict on 5/30 FN reader-case rows — the FP-to-FN drop is largely absent in trainees, suggesting that the FN cases are precisely where novice over-reliance is most clinically dangerous.
> **Exploratory caveat:** only 6 unique FN cases are shared across all 14 readers; the Resident × FN interaction CI spans 1.5 to 2239, directionally informative but with wide uncertainty.
### 7. Saliency Map Usage
| Group | Usage Rate (AI-incorrect cases) | Accuracy with Map | Interpretation |
| :--- | :--- | :--- | :--- |
| Radiology Residents | 53.8% (78/145) | 78.2% (vs 73.1% without; P=0.61) | Confirmatory — reinforces over-reliance |
| Pediatric Radiologists | 34.5% (60/174) | Trending lower (81.7% vs 86.0%; P=0.58) | Intermediate |
| **Neonatologists** | **17.2% (15/87)** | **100% (15/15; Wilson 95% CI 79.6–100%)** | **Refutation utility (exploratory)** |
> Neonatologist 100% accuracy is from 15 user-initiated map views and is presented as a **descriptive observation** — user-initiated access creates selection effects, and the wide Wilson CI (79.6–100%) reflects the limited subset.
Experts used explainability maps selectively to *refute* the AI; trainees used them indiscriminately, often reinforcing over-reliance.
---
## Conclusion
> *"Ultimately, in neonatal pneumoperitoneum, AI reliability affects clinicians through verification behavior and error phenotypes rather than accuracy alone. Highly reliable AI tends to induce automation bias in trainees, whereas intentionally error-injected AI can trigger vigilance in experts. Future evaluation and deployment frameworks must explicitly measure expertise-dependent behaviors to ensure resilience in time-critical emergencies."*
---
## Limitations
1. **Simulated environment** — Cannot fully replicate the time pressures of a live NICU
2. **Small neonatologist cohort** (n=3) — Mitigated by 375 independent decision points for the subgroup and LONO sensitivity analysis
3. **Saliency map analysis is exploratory** — User-initiated access creates selection effects; randomized exposure required for causal inference
4. **Generalizability** — Replication in larger, multicenter specialist cohorts needed
---
## Code and Data Availability
- **Source code** (preprocessing, model, training, evaluation, saliency, statistical analysis): [github.com/junjslee/neonatal-ai-reliability](https://github.com/junjslee/neonatal-ai-reliability)
- **Educational sandbox:** [neonatal-ai-sandbox.pages.dev](https://neonatal-ai-sandbox.pages.dev/)
- **Primary statistical pipeline:** `quantitative_analysis/reader_study_full_analysis.py` — GLMM crossed random effects, GEE sensitivity, time/CAM mechanism analyses, HCI-type stacked bars (Type 1–4 derivation including the **Type 4 automation-bias** and **Type 3 sentinel-override** counts that underlie §3–§4).
- **Revision-round analyses:** `quantitative_analysis/revision_analyses/` — R2-6 (reading-time LMM full output), R2-7 (between-group differential omnibus + simple slopes), R2-8 (FP-vs-FN error-type stratification with Firth sensitivity).
- **Model checkpoints:** Both the Reliable AI and Error-Injected AI weights are included in this repository under `quantitative_analysis/standalone_model_performance/rad_dino/`. These are the exact checkpoints used in the reader study and can be used to reproduce inference results without retraining. Weights are derived from [microsoft/rad-dino](https://huggingface.co/microsoft/rad-dino) (MIT License, research use only — not for clinical practice).
- **Raw image data:** Cannot be publicly redistributed (IRB/licensing); external validation set available via [AI-Hub](https://www.aihub.or.kr/)
- **De-identified derived data** (reader metrics, AI predictions, consensus labels): Available upon request to corresponding authors
---
## Citation
If you use the code, findings, or the error-injection validation framework, please cite:
> Lee, J., Kim, Y., Kim, V., Park, C., Song, J. M., Kwon, J., Nam, Y., Lenehan, P., Kim, D. Y., Cho, Y. A., Kim, P. H., Hwang, J.-Y., Lee, J., Lee, B. S., Jung, E., Jung, A. Y., Choi, J., Kim, N.\* & Yoon, H. M.\* *Expertise modulates automation bias and sentinel behavior in human-AI collaborative diagnosis of neonatal pneumoperitoneum.* **npj Digit. Med.** (in revision, 2026).
>
> \*Co-corresponding authors: Namkug Kim, PhD (namkugkim@gmail.com); Hee Mang Yoon, MD, PhD (espoirhm@gmail.com).
Machine-readable metadata: see [`CITATION.cff`](CITATION.cff). A final BibTeX entry will be added on acceptance.
---
## Correspondence
- **Namkug Kim, PhD** — namkugkim@gmail.com (MI2RL, Asan Medical Center)
- **Hee Mang Yoon, MD, PhD** — espoirhm@gmail.com (Massachusetts General Hospital / Asan Medical Center)