IMLJP Dataset: Legal Judgment Data
- IMLJP dataset is a large-scale collection of Chinese criminal judgments focused on intentional injury, with explicit annotation of principal and accomplice roles.
- It employs a hybrid human–machine annotation pipeline with oriented masking and pair-based sampling, achieving around 91.05% accuracy in guilt inference.
- The dataset supports various tasks including binary classification and regression with detailed evaluation metrics, closely mirroring judicial logic.
IMLJP (“Intentional‐injury MultIDefendant Judgment Prediction”) is a large-scale, carefully curated dataset designed to support explainable, multidefendant legal judgment prediction for intentional injury cases, with a focus on explicit role annotation, robust preprocessing pipeline, and rigorous evaluation consistent with practical judicial logic. The dataset was assembled from public first-instance criminal judgments on China Judgment Online (2012–2020), ensuring granular, document-level supervision for the identification of defendant roles (principal vs accomplice) and sentence prediction (Zhang et al., 19 Jan 2026).
1. Corpus Construction and Scope
IMLJP comprises 17,253 judgment documents covering a total of 34,828 defendants, all prosecuted under the single statutory charge of intentional injury. Only cases containing more than one defendant (as registered in the court’s view) were included, and inclusion was further restricted to those in which the official Court View (CV) section explicitly mentioned keywords corresponding to "principal" (主犯) or "accomplice" (从犯). This filtering ensures clear ground truth for role supervision.
Table: Case/Defendant Distribution by Document
| Case Type | % of Documents | Description |
|---|---|---|
| Single-defendant | 38.8 | One defendant per case |
| Two-to-five-defendant | 58.3 | Most cases fall in this range |
| > Five defendants (max 20) | ~2.9 | Large, complex group cases |
A key characteristic is that only 31.5% of cases prosecute all suspects named in police files, reflecting selective indictment practices. Uniform-role cases (all defendants are either principals or accomplices) constitute 70.5% of IMLJP; the remaining 29.5% offer natural contrastive data for comparative learning by including both roles within one document.
2. Annotation Pipeline, Schema, and Preprocessing
A hybrid human–machine annotation pipeline was implemented. Initially, Ernie-Bot 3.5 (LLM) was used to insert principal/accomplice role labels into CV sections containing role-indicative keywords. Three professional legal annotators conducted line-by-line reviews to correct misdetections and resolve inconsistent phrasing.
Per-defendant labeling schema:
- @id: unique case identifier
- @name: anonymized defendant name
- @FD: fact description by the Procuratorate
- @CV: court view by the Judge
- @prison: fixed-term imprisonment (months)
- @probation: probation duration (months, if applicable)
- @guilt: binary role label (1: principal, 0: accomplice)
All personal identifiers are fully anonymized in the release version.
Targeted preprocessing strategies include "oriented masking"—replacing all occurrences of the target defendant’s name in FD or the pruned court view (CV_d; where explicit role phrases are deleted) with the token [MASK]. Three variants were explored: (1) Original, retaining names; (2) Split, retaining only sentences mentioning the target; (3) MASK, in which only the target defendant’s name is masked. The MASK variant delivered superior guilt-inference accuracy (91.05%) and F1 (0.9117), compared to Split and Original (Zhang et al., 19 Jan 2026).
3. Dataset Splits and Comparative Data Construction
Dataset splits adhere to an 80 / 10 / 10 ratio (train/validation/test) on a per-defendant basis. This yields approximately 7,571 / 945 / 948 examples per class for guilt-inference and guilt-identification tasks, with the full 34,828-defendant set similarly split for prison-term regression.
To address the class-imbalanced scenario and maximize contrastive signal, the “Pairs” data construction matches each minority-class principal (n=9,464) from the 6,596 mixed-role cases to a randomly selected accomplice, forming 9,464 principal–accomplice pairs. Pairs-based construction decisively outperforms both “Random” and “Full” sampling strategies for guilt differentiation (Tables 2 and 9 in (Zhang et al., 19 Jan 2026)).
4. Formal Task Definitions and Evaluation Metrics
Primary input fields are tokenized as sequences:
- Pruned (with explicit role references removed)
Benchmarked Task Types
- Guilt inference from fact description (MASKed FD): binary classification using cross-entropy loss,
Metrics: Accuracy, Precision, Recall, F1.
- Guilt identification from full court view: same structure and metrics.
- Prison-term prediction: regression with MSE loss,
Evaluated using ImpScore (piecewise log-error), ImpAcc (exactness within 25% of true duration), and ImpErr (relative absolute error normalized by 180 months, the maximum term in IMLJP).
All benchmarks use the unified BERT-base-Chinese tokenizer.
5. Model Protocols and Benchmarks
The dataset has been used to benchmark logic-guided inference pipelines, including the masked multistage inference (MMSI) framework, and various neural and LLM architectures (including BERT, mT5, Claude-3, GPT-3.5, GPT-4o, LLaMA3-70B, Gemini-1.5, DeepSeek-V3). Masked, pairs-based protocols yield substantial gains in both guilt-role discrimination (≈91.05% accuracy) and sentence prediction (Zhang et al., 19 Jan 2026).
Comparative data construction (Pairs approach) and oriented masking substantially enhance model sensitivity to role-specific context. Fine-grained benchmarks against prompt-based generative baselines, legal-domain models (Legal-BERT, Lawformer), and state-of-the-art LLMs consistently demonstrate superior performance for protocols leveraging IMLJP’s contrastive and masked design.
6. Applications, Limitations, and Prospective Extensions
IMLJP’s design fosters the development of explainable, multidefendant judgment systems able to distinguish principal/accomplice roles and predict statutory sentencing, using data and evaluation closely mirroring real-world judicial logic. The dataset’s use of explicit role extraction, document-level masking, and natural contrastive pairs addresses long-standing obstacles posed by role ambiguity and group case complexity.
Identified limitations include the focus on a single statutory charge (intentional injury), the filtering to cases with explicit principal/accomplice tags (which may exclude more nuanced real-world decisions), and dependence on Chinese criminal documents. A plausible implication is that adaptations to other crime categories or legal systems may require further schema development or additional annotation protocols.
Future extensions could generalize IMLJP’s multidefendant and masked annotation pipeline to broader offense types, increase the complexity of benchmarked judicial tasks, and exploit more sophisticated contrastive or logic-guided learning paradigms. Public release of corpus and code ensures reproducibility and transferability for judicial AI research (Zhang et al., 19 Jan 2026).