mARC: Medicine Abstraction & Reasoning Corpus
- mARC is a comprehensive biomedical resource featuring 5,000 RCT abstracts and an adversarial benchmark to evaluate clinical reasoning in medicine.
- The dataset offers detailed, hierarchical PICO annotations with high inter-annotator agreement to support accurate evidence synthesis.
- It facilitates diverse NLP tasks such as span detection, fine-grained concept tagging, and co-reference resolution for robust medical reasoning.
The Medicine Abstraction and Reasoning Corpus (mARC) is a suite of datasets and adversarial benchmarks developed to advance language processing and reasoning in the biomedical domain. It consists of two principal components: a large, richly annotated corpus of randomized controlled trial (RCT) abstracts structured to support evidence-based medicine and NLP extraction tasks, and an adversarial clinical reasoning benchmark designed to probe cognitive flexibility and the ability to override dominant heuristic patterns in medical question answering. mARC has been adopted both as a resource for extracting structured PICO (Patient, Intervention, Comparator, Outcome) information and as a stress test for deductive reasoning in humans and LLMs (Nye et al., 2018, Shidara et al., 17 Jan 2026).
1. Corpus Composition and Structure
mARC comprises 5,000 abstracts of RCTs drawn from MEDLINE, with coverage of clinical contexts such as cardiovascular diseases, oncology, and autism. The dataset includes:
- Train/development/test splits: 4,300 abstracts for training, 500 for development, and 200 held-out gold-standard test abstracts annotated by medical experts.
- Granularity: Each abstract features a high density of annotated spans—lay annotators identified an average of 34.5 participant, 26.5 intervention, and 33.0 outcome spans per abstract, while expert annotation yielded lower but more conservative counts (21.4/14.3/26.9, respectively).
- Hierarchical annotation: PICO spans are decomposed into sub-categories (e.g., under Participants: Age, Gender, Condition, Sample Size; under Interventions: Pharmacological, Surgical, Behavioral; under Outcomes: Pain, Mortality, Adverse Effects, Mental/Behavioral Impact).
- Formal encoding: BIO tagging is used (e.g., \texttt{B-P}, \texttt{I-P}, \texttt{O}), and sub-spans carry MeSH-compatible labels. Span-type assignments are explicitly denoted as , , and with corresponding subtypes.
The corpus therefore aligns with the principles of structured knowledge representation required for systematic evidence synthesis and downstream machine learning tasks (Nye et al., 2018).
2. Annotation Schema and Quality Assurance
Annotation proceeds in two stages:
- Stage 1: High-Level PICO Annotation Annotators highlight text spans corresponding to Patients, Interventions/Comparators, and Outcomes. Inter-annotator agreement among medical–student experts on these spans, measured by Cohen’s , is moderate (0.71/0.69/0.62 for P/I/O).
- Stage 2: Fine-Grained Sub-Span Labeling Each Stage 1 span is further segmented into minimal sub-spans and assigned a hierarchical label, aligned to a MeSH descriptor. Redundant labeling and co-reference grouping are performed for recurring tokens.
Quality control integrates:
- Lay annotators: Sourced from Amazon Mechanical Turk (AMT) with strict quality filtering; each abstract receives at least three independent annotations per stage.
- Expert validation: 200 test abstracts are labeled by two medical-student experts (Stage 1) and three medical professionals (Stage 2).
- Aggregation: Techniques including Majority Vote, Dawid–Skene EM, and HMMCrowd (Dawid–Skene plus sequential HMM for span coherence) are systematically compared. HMMCrowd produces the highest F1 for Stage 1 spans: 0.72 (P), 0.68 (I), and 0.59 (O).
Discrepancies are reconciled by guideline refinement and expert adjudication. Evaluation uses the union of expert span sets as reference labels (Nye et al., 2018).
3. NLP Tasks and Baseline Models
mARC was designed to support multiple NLP extraction and reasoning tasks:
- Span Detection: Token-level identification of P/I/O labels (BIO tagging). Bi-LSTM–CRF models significantly outperform linear-chain CRFs (test set F1: up to 0.71 for P, 0.65 for I, 0.63 for O).
- Fine-Grained Concept Tagging: Within given stage-1 spans, assign MeSH-compatible sub-labels (e.g., "Dosage" in interventions). Baseline methods include logistic regression (F1: 0.22–0.57) and CRF (F1: 0.21–0.55).
- Repetition/Co-reference Resolution: Detect multiple spans referring to the same entity within an abstract. Logistic regression achieves moderate F1 on participants (0.44), interventions (0.45), and low performance on outcomes (0.12).
Evaluation employs standard micro-averaged precision, recall, and F1 scores. The resource is extensible to more complex tasks, such as multi-hop inference for assembling full trial representations and reading-comprehension–style question answering over multiple abstracts (Nye et al., 2018).
4. Adversarial Clinical Reasoning Benchmark
To address limitations of standard medical QA (which often fail to evaluate genuine cognitive flexibility), Shidara et al. introduce an adversarial benchmark—also named mARC—comprising 100 author-curated, USMLE-style multiple-choice vignettes. These items embed the Einstellung effect: a familiar cue triggers a high-probability heuristic response, but a “blocker” in the vignette, together with background knowledge, compels rejection of the default inference.
Key design features:
- Item Construction: Each question presents a nonspecific symptom , a cue activating a default rule, and a blocker that negates the heuristic hypothesis when combined with background knowledge : , whereas the heuristic remains .
- Subspecialty Distribution: 12 medical categories, each contributing to the 100-item set and validated by board-certified specialists.
- Answer Formulation: Four-option format, with exactly one answer requiring overriding the obvious heuristic; 53 items also include a "seek more data" deferral option.
Representative cases include an anticoagulated patient with anencephaly (blocker: absence of brain tissue), rendering a CT scan for intracranial hemorrhage irrelevant (Shidara et al., 17 Jan 2026).
5. Benchmarking Human and Model Performance
Evaluation protocol applies to both human clinicians and LLMs:
- Human testing: Five UCSF physicians achieved an overall accuracy of 0.66 [95% CI: 0.55, 0.75].
- LLMs: State-of-the-art models from OpenAI, Gemini, Claude, Grok, and DeepSeek scored up to 0.751 (Claude 4.1 Opus) [95% CI: 0.738, 0.763]. Chain-of-Thought (CoT) prompting and MMLU-Pro guidelines for stochastic trials were employed. No significant difference between leading models and human clinicians was evident by paired bootstrap testing.
On the 20-item “human-miss” subset (most missed by physicians):
- Humans: 36% accuracy [26, 46].
- Claude 4.1 Opus: Decisively correct in 55.0% [39.9, 80.0], decisively wrong in 25.0% [9.9, 45.0], indeterminate in 20.0% [0.0, 35.0].
- Grok-4-Fast-Reasoning outperformed in “model-win” cases but achieved a lower overall mARC score.
Earlier, non-reasoning models relying on rote completion perform markedly below physician accuracy (Shidara et al., 17 Jan 2026).
| Population | Total (n) | Accuracy (mean) | 95% CI |
|---|---|---|---|
| Human clinicians | 5 | 0.66 | [0.55, 0.75] |
| Claude 4.1 Opus | — | 0.751 | [0.738,0.763] |
| GPT-5.1, Gemini 2.5, Grok-4 | — | no significant difference from humans | — |
6. Implications, Access, and Extensibility
mARC demonstrates that while LLMs tuned for reasoning can approach or surpass physician performance on adversarial, flexibility-testing clinical tasks, this capacity is architecture- and prompt-dependent. Improved calibration (as measured by Brier scores and reliability plots) increases model trustworthiness for clinical deployment with deferral options. Yet, persistent overreliance on heuristics and long-tail decision failures in smaller models underscore the need for granular, targeted benchmarks such as mARC.
All data and code—including raw abstracts, annotations, expert reference labels, aggregation code, and baseline models—are distributed in JSON and BioC formats under an OSI-compliant open license. The resource is available at http://www.ccs.neu.edu/home/bennye/EBM-NLP and is extensible for new annotation layers (e.g., risk-of-bias tagging, dosage normalization) following the staged workflow. Plans are in place to enlarge the expert-annotated gold-standard set and expand reasoning task coverage (Nye et al., 2018, Shidara et al., 17 Jan 2026).
A plausible implication is that as LLMs progress, adversarial resources such as mARC will be critical for distinguishing models exhibiting true cognitive flexibility from those that merely replicate surface patterns, thereby serving as benchmarks for safe, generalizable clinical AI.