Crime Script Inference Task
- The CSIT framework integrates scam detection, next action prediction, and intent inference to analyze crime event sequences.
- It leverages transformer and recurrent neural architectures with multi-modal data for structured, explainable predictions.
- CSIT supports early intervention by providing cognitive support with interpretable crime scripts in high-stakes settings.
The Crime Script Inference Task (CSIT) is a formally defined, multi-output reasoning task designed to operationalize the detection and understanding of crime-related event sequences in both real-world and narrative contexts. Prominently applied to domains such as social engineering scam detection and crime drama analysis, CSIT requires a system to identify ongoing criminal activity, anticipate subsequent actions or utterances, and provide interpretable explanations of underlying intent within partially observed event sequences. Recent approaches to CSIT incorporate transformer-based LLMs and recurrent neural architectures, leverage multi-modal data, and emphasize alignment with criminological scripting theory for structured, explainable prediction (Kim et al., 20 Jan 2026, Frermann et al., 2017).
1. Definition and Objectives
CSIT encompasses multi-task reasoning in event-driven, adversarial contexts. Given an incomplete dialogue or event sequence—such as part of a suspicious phone call or a segment from a crime drama episode—the task requires simultaneous resolution of three sub-tasks:
- Scam or crime identification: binary classification denoted , where indicates a scam/crime and otherwise;
- Next action or utterance prediction: generate the adversary's likely next move ;
- Intent inference: explain the adversary's (or perpetrator's) underlying intent , selected from a taxonomy of intent classes.
The principal objectives of CSIT are early detection (to intervene before harm occurs), cognitive support for human decision-makers (by surfacing intent and next-step rationales), and explainability to promote model transparency and trust—particularly in high-stakes, real-time settings such as phone scam prevention (Kim et al., 20 Jan 2026).
2. Formal Task Formulation
A canonical CSIT instance is structured as follows. Let denote the scenario type (e.g., “prosecutor impersonation”), and the observed context comprising alternating user and adversary utterances: Given input , the CSIT framework requires
Here, is the utterance space; is a curated set of intent labels (15+ for social engineering scams); and structured crime scripts are viewed as sequences of tuples
In the crime drama domain, the sequence consists of sentences spanning multi-modal features (text, vision, audio) and associated binary labels indicating perpetrator mentions (Frermann et al., 2017).
The training objective aggregates multiple loss functions: binary classification loss for , cross-entropy for utterance and intent generation (, ), with task weights (typically set to 1). Statistical validation employing Standardized Residuals (SR) is used to ensure script transitions () are non-random and follow criminologically plausible pathways (Kim et al., 20 Jan 2026).
3. Datasets and Annotation Schemas
CSIT instantiations are grounded in large-scale, expertly annotated datasets. In social scam detection, the CSID dataset is constructed from 571 Korean phone scam cases (LAW-ORDER benchmark), with 48,229 utterances and 23,771 intent-labeled scammer utterances. Annotation involves extracting scammer utterances, expert profiling with a fine-grained taxonomy (>15 intent classes), constructing Scammer’s Behavior Sequences, and statistically validating sequence consistency. Sliding window techniques generate “partial contexts” () for model supervision.
Each example includes a binary scam label (), ground-truth next utterance (), and intent label (), resulting in a balanced dataset (, split equally between scam and benign cases) (Kim et al., 20 Jan 2026).
In the crime drama domain, data is sourced from 39 CSI episodes (yielding 59 cases). Scripts and video are aligned using dynamic time-warping on subtitles; scenes and entity mentions are annotated at the sentence level. Multi-modal representations are constructed by concatenating GloVe-based text features, Inception-v4-derived frame embeddings, and MFCC audio features (Frermann et al., 2017).
4. Model Architectures and Training Regimens
For social scam detection, CSIT models employ open-source transformer decoders (e.g., LLaMA-3.2-1B, EXAONE-3.5-2.4B) fine-tuned using QLoRA (rank 4) adapters for parameter efficiency. Multi-task prefix prompts enforce structured outputs in JSON format (keys: “label,” “next_utterance,” “rationale”). Mixed-precision (FP16) enables longer context handling, with training on A100 GPUs and hyperparameters: learning rate , batch size 32, context window 1,024 tokens, 5 epochs. The optimizer is Paged AdamW, with gradient checkpointing for memory reduction (Kim et al., 20 Jan 2026).
For crime drama, the architecture centers on a unidirectional LSTM that processes fused multi-modal feature vectors at each sentence position, recursively updating hidden and cell states (). Output layers predict the probability of perpetrator mention per sentence. This sequence labeling approach allows incremental inference, leveraging the LSTM’s capacity to accumulate and re-weight evidence over long event sequences (Frermann et al., 2017).
5. Empirical Evaluation and Quantitative Results
The performance of CSIT models is assessed using standard metrics: accuracy, F1, false-positive/false-negative rates, as well as specialized evaluations of utterance and rationale quality (via LLM-as-a-Judge, agreement with human raters).
| Model | ACC | F1 | FP/FN | Next Utterance | Intent |
|---|---|---|---|---|---|
| Llama-3.2-1B-ZS | 0.55 | 0.56 | 0.24/0.21 | 0.04 | 0.17 |
| Llama-3.2-1B-FT | 0.91 | 0.92 | 0.06/0.02 | 0.39 | 0.57 |
| EXAONE-3.5-2.4B-ZS | 0.58 | 0.65 | 0.31/0.11 | 0.30 | 0.51 |
| EXAONE-3.5-2.4B-FT | 0.94 | 0.94 | 0.06/0.01 | 0.53 | 0.73 |
| EEVE-10.8B-ZS | 0.71 | 0.74 | 0.21/0.09 | 0.42 | 0.52 |
| EEVE-10.8B-FT | 0.98 | 0.98 | 0.01/0.01 | 0.68 | 0.80 |
Fine-tuned models raised detection accuracy by 0.28 on average and reduced FP rate by 0.24. The quality of next-utterance and rationale outputs improved by 0.24 and 0.16, respectively. In crime drama, best models achieved cross-val F of 44.1 and held-out F of 46.6 for perpetrator mention detection; multimodal fusion delivered recall gains of up to 14 points over text-only models. LSTMs outperformed non-sequential and shallow sequence models, indicating the criticality of memory and feature fusion for CSIT (Frermann et al., 2017, Kim et al., 20 Jan 2026).
6. Illustrative Examples
A representative scam dialogue fragment (“We found two accounts under your name at [Bank]. Were you aware of this?”) prompts CSIT inference as follows:
- Gold: label: “scam”; next_utterance: “We now need to determine if you personally opened and transferred those accounts for profit or if you were a victim of identity theft.”; rationale: “Scammer is switching from presenting evidence to confirming whether the target actively participated or was impersonated.”
- Fine-tuned EXAONE-3.5B: next_utterance and rationale closely match gold, accurately tracing adversarial intent transitions.
- Zero-shot baselines perform worse, often repeating input or offering generic rationale (“Scammer wants your personal info”).
In crime drama, human annotators and models operate incrementally: the LSTM model can “lock on” to the true perpetrator earlier (average sentence 141) than human viewers (423), but at the risk of making spurious predictions when the narrative lacks an actual perpetrator, underscoring the importance of context-sensitive inference (Frermann et al., 2017).
7. Novel Aspects, Limitations, and Extensions
CSIT’s distinctiveness derives from its integrated multi-task setup (classification, generation, and explanation), script-aware data pipeline with statistical validation, and focus on user cognitive state alignment. Challenges include the labor-intensive annotation of fine-grained intent classes, managing the false positive/negative tradeoff in adversarial dialogue, and efficient multi-output learning—particularly in lower-resource languages and with partial observations.
Potential directions include expanding CSIT to multimodal domains (e.g., incorporating prosody, metadata), continual learning for emergent adversarial strategies, cross-lingual script transfer, human-in-the-loop feedback, and integration of dynamic suspicion models for optimal alerting. Further, application to narrative domains such as crime drama demonstrates that incremental, context-accumulating models best capture the evolving inference required in real-world and fictional script understanding (Kim et al., 20 Jan 2026, Frermann et al., 2017).