Crime Script-Aware Inference Dataset (CSID)

Updated 27 January 2026

CSID is a structured dataset that operationalizes crime script theory by mapping sequential criminal actions and intents.
It integrates multi-modal data, combining textual, visual, and acoustic features to enhance perpetrator identification and scam detection.
CSID bridges raw detection outcomes with step-wise reasoning, enabling anticipatory modeling and improved next-move predictions in adversarial scenarios.

The Crime Script–Aware Inference Dataset (CSID) is a structured, multistage resource designed to advance the computational modeling, detection, and explanation of criminal scripts as observed in realistic narrative and scam interaction scenarios. Across its instantiations—namely, the CSI drama dataset (Frermann et al., 2017) and the Korean phone scam corpus for LLM-based scam defense (Kim et al., 20 Jan 2026)—CSID operationalizes crime script theory, providing systematically annotated, sequence-driven data for both multi-modal sequence labeling and crime intent inference. Its principal function is to bridge raw detection outcomes with cognitively actionable, step-wise reasoning reflecting the underlying criminal intent, staging, and next-move prediction.

1. Motivation, Theoretical Foundations, and Dataset Scope

CSID originates from two core observations. First, real-world criminal activities—whether represented in crime dramas or manifested in social engineering scams—proceed according to “crime scripts,” i.e., highly structured, sequential procedures organizing criminal intent, roles, and actions over time. Second, conventional NLP and LLM systems typically treat scam or perpetrator identification as one-shot classification tasks, neglecting the incremental, anticipatory reasoning humans deploy while interpreting or defending against criminal schemes (Kim et al., 20 Jan 2026).

The CSI dataset comprises 39 episodes (59 criminal cases) from seasons 1–5 of CSI: Crime Scene Investigation, while the Korean scam CSID spans 571 voice-phishing conversations supplemented by 11,356 benign police summons, resulting in 22,712 structured training instances. Both datasets systematize the temporal unfoldings of crime, whether dramatized or real, and make explicit the mappings between script stage, observed actions, and perpetrator (or scammer) inference (Frermann et al., 2017, Kim et al., 20 Jan 2026).

2. Annotation Schema, Task Formalization, and Script Theory Integration

CSI Dataset

Each case is segmented into screenplay sentences $s = (s_1, \dots, s_N)$ and annotated via dual-pass procedures:

First-pass behavioral annotation: Annotators tag whether the perpetrator is mentioned and which case the sentence refers to, logging “first positive guess” points (κ on “perpetrator mentioned” = 0.62).
Second-pass gold annotation: Each script token is labeled as PERP (true perpetrator), SUS (suspect), or OTHER, yielding sentence-level labels by marking positives when ≥1 PERP token is present (κ(PERP) = 0.90).

Scam Dataset

Crime script theory is explicitly encoded as macro-stages and micro-intents:

Crime script stages: Identity confirmation, case presentation, involvement linking, investigation preparation, and voice investigation.
Intent classification: Each scammer utterance is mapped into one or more of 75 intent categories by professional crime profilers (Cohen’s κ = 0.91).
Record structure: Each CSID instance is a quadruple $(U_S,\,Y_a,\,U_{t+1},\,Y_{\text{int}})$ , with $U_S$ the partial conversation, $Y_a$ the scam label, $U_{t+1}$ the scammer’s next utterance, and $Y_{\text{int}}$ a rationale explaining current and next intents.

This schema operationalizes script-awareness by explicitly representing action-intent transitions and stagewise progression.

3. Data Modalities, Feature Engineering, and Alignment

CSI Dataset

Three modalities are extracted per sentence for sequence labeling:

Textual: Each word encoded via 50-dim GloVe; sentences processed through 1D CNNs with filter widths {3,4,5}→225-dim sentence vectors.
Visual: Midpoint frame extraction per sentence span; features (1536-dim) from pre-trained Inception-v4.
Acoustic: Audio (dialog-free) transformed to 13-dim MFCC features every 5 ms, sampling 5 vectors per sentence (65-dim).
Alignment: Dynamic time warping on closed-caption subtitles versus script utterances; heuristic scene timestamping.

Scam Dataset

All conversational data is textual, transcribed, de-identified, and augmented with intent rationale templates (“Current intent: XXX. Next intent: YYY.”). Sequence validation uses the Standardized Residual (SR) statistic to ensure high script consistency and sequence predictability.

4. Dataset Construction, Splitting, and Accessibility

Dataset	Source	Total Instances	Modality	Annotation Format
CSI	TV drama (season 1–5)	59 cases	Text/vis/aud	Scripts, JSON/CSV
Scam CSID	Korean police transcripts	22,712	Text	Conversation JSON

CSI dataset splits: 5-fold cross-validation, held-out final evaluation, downloadable with aligned scripts and annotations (Frermann et al., 2017).

Scam dataset splits: full shuffle, test/validation sets (≈200 instances for LLM-as-a-Judge), downloadable JSON format under academic/non-commercial terms (Kim et al., 20 Jan 2026).

5. Model Architectures, Training Regimen, and Inference Protocols

CSI Perpetrator Identification

Fusion: $[x^s; x^v; x^a]$ concatenated; 300-dim ReLU transformation.
Incremental unidirectional LSTM: 128 units; strict history propagation with no lookahead.
Loss: Cross-entropy over binary mention labels.
Baselines: MLP (two-layer ReLU), CRF over sentence embeddings, PRO (rule-based pronoun model).

Scam Detection and Reasoning

Fine-tuning: Compact LLMs (7–11B and 1–2B parameters) trained for 5 epochs via QLoRA + Paged AdamW (LR=1e-4) on two A100 GPUs.
Prompt–completion protocol: Input is conversation, output is JSON {label, next_utterance, rationale}.
Evaluation: Detection accuracy, F₁, FP/FN, next utterance/rationale Pearson $r$ (≥.81, $p<.001$ ) via LLM-as-a-Judge pipeline.

6. Experimental Findings and Insights

CSI Dataset

Performance: Best LSTM (T+V+A) F₁ = 46.6 (held-out, minority class); humans reach 67.3 upper bound. Visual and acoustic modalities improved recall and overall F₁ by 2–3 points. LSTM outperformed non-incremental methods by >6 F₁, reflecting necessity for temporal modeling and script awareness.
Inference behavior: LSTM “guesses” correct perpetrator much earlier than humans (avg first detection 141 versus 423 sentences).
Special cases: In episodes without a perpetrator (e.g., suicide), both models and humans generate early false positives; however, LSTM continues predictions longer, indicative of a learned prior.

Scam CSID

Detection: cLLMs fine-tuned with CSID outperform GPT-4o by 13% in accuracy, with reduced false-positives, improved scammer utterance prediction and rationale quality.
Cognitive evaluation: ScriptMind, leveraging CSID, significantly increased and sustained users' suspicion during simulated scams, enhancing real-time cognitive awareness.
Script-structure manipulation: SR sequence analysis confirmed strong macro-stage predictability (e.g., “Case Introduction → Self-Introduction”, SR=68.6).

A plausible implication is that encoding script transitions and next-utterance expectations endows LLMs both with robust detection and anticipatory guidance capabilities, supporting human users in adversarial, evolving environments.

7. Accessibility, Usage Constraints, and Applications

CSID variants and accompanying code are available for academic, non-commercial use (subject to registration/IRB compliance, where required). CSI resources include full scripts, aligned video/audio, and annotation files (Frermann et al., 2017), while the scam dataset provides JSON records mapped to public Law&Order transcripts (Kim et al., 20 Jan 2026).

CSID has been deployed to train compact, cognitively adaptive LLMs, supporting structured inference tasks such as multi-modal perpetrator identification and phone scam detection/explanation. It serves as a standard-bearer for script-aware, incremental detection methodologies and offers a high-fidelity sandbox for both model evaluation and cognitive simulation-based defense.

CSID exemplifies the operationalization of crime script theory within machine learning, aligning stepwise, intent–action dynamics with detection, explanation, and next-move prediction tasks. It advances both algorithmic and user-centered research paradigms for sequence labeling and adversarial scenario understanding, leveraging rich annotation, sequence modeling, and cognitive evaluation. For researchers seeking robust, script-grounded resources for crime inference and scam defense, CSID remains a critical asset (Frermann et al., 2017, Kim et al., 20 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Whodunnit? Crime Drama as a Case for Natural Language Understanding (2017)

SCRIPTMIND: Crime Script Inference and Cognitive Evaluation for LLM-based Social Engineering Scam Detection System (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Crime Script-Aware Inference Dataset (CSID).

Crime Script-Aware Inference Dataset (CSID)

1. Motivation, Theoretical Foundations, and Dataset Scope

2. Annotation Schema, Task Formalization, and Script Theory Integration

CSI Dataset

Scam Dataset

3. Data Modalities, Feature Engineering, and Alignment

CSI Dataset

Scam Dataset

4. Dataset Construction, Splitting, and Accessibility

5. Model Architectures, Training Regimen, and Inference Protocols

CSI Perpetrator Identification

Scam Detection and Reasoning

6. Experimental Findings and Insights

CSI Dataset

Scam CSID

7. Accessibility, Usage Constraints, and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics