Movie Scripts Corpus

Updated 17 January 2026

Movie Scripts Corpus is a systematically curated collection of full-length screenplays with precise scene, dialogue, and action segmentation.
The corpus leverages detailed annotations including metadata, scene salience, and discourse graphs, formatted in XML/JSON for robust computational analysis.
It supports advanced research in narrative understanding, summarization, machine translation, and multimodal video–language alignment using standardized evaluation metrics.

A movie scripts corpus is a systematically constructed and curated collection of full-length movie screenplays, often paired with plot summaries, temporal alignments to audiovisual materials, structured metadata, or specialized annotations such as scene salience or discourse graphs. These corpora enable rigorous research into narrative understanding, abstractive and extractive summarization, neural machine translation, multimodal video–language alignment, saliency detection, and discourse modelling. Recent resources are distinguished by their scale (hundreds to thousands of scripts), high-fidelity formatting (XML, JSON with scene/dialogue/action markup), genre diversity, and support for advanced annotation schemes.

1. Corpus Composition and Source Material

Movie scripts corpora aggregate screenplays from publicly accessible online archives such as scriptslug.com, imsdb.com, dailyscript.com, weeklyscript.com, and simplyscripts.com. Larger datasets such as MovieSum consist of 2,200 English-language movie scripts, each accompanied by a Wikipedia-based plot summary (Saxena et al., 2024). Scripts are manually de-duplicated and filtered for completeness before formatting. Earlier resources, e.g. ScriptBase-j and CreativeSumm, contain 917–1,276 scripts spanning 23 genres and produced between 1909 and 2013 (Agarwal et al., 2022). Specialized corpora such as Movie Description serve aligned scripts for 31–50 movies, with 23,396–31,103 sentences, for video–language research (Rohrbach et al., 2015, Rohrbach et al., 2016).

Each screenplay is typically segmented into scenes, with scene headings (e.g., "INT. OFFICE – NIGHT"); character cues and dialogue blocks; and action/description paragraphs or monologues. Highly structured corpora store these using professional authoring tools (Celtx) and export to XML, enabling unambiguous extraction of each element. Scripts are accompanied by metadata including IMDb IDs, release year, title, and genre(s), facilitating linkage to external knowledge sources (posters, cast lists, box-office statistics).

Corpus	#Scripts	Mean Length (words)	Summary Source	Format
MovieSum	2,200	~29,000	Wikipedia	XML
ScriptBase-j	917	~29,000	Wikipedia/fan	TXT
CreativeSumm	1,276+216	~12,600 (avg)	Wikipedia/fan	TXT
MovieDesc.	31–50	~10.2 tokens/sent.	Aligned video	CSV/JSON

2. Structural Annotation and Scene–Salience Alignment

Modern corpora emphasize precise annotation of screenplay structure, supporting analysis beyond surface-level text. Each script in MovieSum is formatted to mark scene boundaries, character identifiers, parentheticals, dialogue, and actions according to industry-standard screenplay conventions (Saxena et al., 2024). The MENSA dataset introduces explicit scene salience annotation for 100 scripts: scenes are marked salient if aligned to at least one summary sentence in the corresponding Wikipedia plot, yielding 5,365 salient scenes out of 16,208 total across 100 movies ( $\mu_{\text{scenes}} \approx 162.1$ ) (Saxena et al., 2024).

Raw scripts undergo human annotation in some corpora, where scene–sentence alignment is produced via a combination of automatic initialization and manual refinement. Agreement among annotators is quantified by Exact Match Agreement (EMA = 52.8 %), Partial Agreement (PA = 81.6 %), and mean annotation distance ( $D = 1.21$ ). High-fidelity collections often sort scripts into XML or scene-by-scene JSON, tagging each segment with IDs and supporting interoperability.

This suggests a research direction toward joint models of narrative salience and summarization quality using gold-standard scene–summary and fine-grained character/scene linkage.

3. Discourse Graphs and Computational Representation

Recent advances encode screenplay content as heterogeneous graphs, notably the Character-Aware Discourse Graph (CaD Graph; Editor’s term) formalized in DiscoGraMS (Chitale et al., 2024). Each script is represented as $G = (V, E)$ , with node types for scenes ( $V_s$ ), dialogues ( $V_d$ ), and characters ( $V_c$ ), and edges for scene–scene, scene–dialogue, scene–character, and character–dialogue relations. Embeddings for textual nodes leverage models such as Sentence-BERT, while character nodes are refined by graph neural networks (GATConv). Edge types encode sequential, referential, and speaker/participant relationships.

Graph objects are stored as PyTorch Geometric Data instances, with explicit schema: node attribute “type” ∈ {scene, dialogue, character}; edge attribute “relation” ∈ {ss, sd, sc, cd}. All underlying scripts use XML, facilitating reproducible extraction.

Discourse graphs are designed for late fusion with text modality (e.g., in LGAT), supporting multi-headed self-attention between structured graph and screenplay embeddings.

4. Temporal Alignment with Audiovisual Data

Certain corpora provide paired video–script resources, enabling temporal alignment between script monologues and movie frames (Rohrbach et al., 2015, Rohrbach et al., 2016). Scripts are aligned to SRT subtitle tracks via dynamic programming: tokens in script and subtitle sentences are compared for word-overlap, and global monotonic alignments are established. Filtering removes script lines not present in the released film (cast lists, soundtrack cues, off-screen text). Each aligned sentence is annotated with start/end timestamps in the video, speaker label, and whether it constitutes pure action.

The LSMDC corpus, for instance, contains 31,071–31,103 clips (from 50 movies), sentence-level alignment to video segments (average duration 3.9 s), and coverage of over 317,728 tokens. Only script sentences with local match ≥0.5 are retained. Temporal noise is reduced by manual adjustment of timestamp boundaries.

Comparison to Audio Description (AD) corpora reveals that ADs are more visual and correspond more closely to on-screen events, with script coverage judged relevant to actual frames only 34–37 % of the time.

5. Parallel Corpora from Subtitles: Multilingual Construction

Movie scripts and subtitles facilitate large-scale multilingual parallel corpora. Using MongoDB to catalogue ~14,000 IMDb titles, synchronized subtitle pairs are crawled, cleaned, and aligned at sentence level (Jafari, 2018). Synchronization is verified via time-stamp checks: only pairs for which all segments match start times within δ ≈ 0 are retained. Extracted sentence-level pairs are constructed by segmenting dialogue, matching counts on both sides, and appending aligned pairs. Contextualization is achieved by genre, year, and rating filtering during crawling, supporting domain-specific neural machine translation (NMT) experiments.

For English–Persian, ~682k sentence pairs are extracted from 541 synchronized movies. NMT models trained on these yield dev BLEU ≈ 6.86 after 60k steps. The methodology prioritizes informal conversational style—mirroring screen dialogue. Further, genre-specific models and context-sensitive routing via lightweight classifiers are suggested.

6. Evaluation and Downstream Applications

Corpora are benchmarked via ROUGE-F1, BERTScore, LitePyramid, and SummaC, while test setups incorporate lead-N baselines, zero-shot LLMs, and fine-tuned LLMs with long-input capacities (Saxena et al., 2024, Chitale et al., 2024). ROUGE-1 in MovieSum ranges from ~10–18 (lead-N) to 44.8 (LED 16k context); ROUGE-2 typically <10, indicating abstractive difficulty. Structure-aware fine-tuning (dialogue-only/description-only) yields near-identical performance, underscoring that models have yet to fully leverage screenplay structure. DiscoGraMS (LGAT) achieves ROUGE-1 = 49.25 and ROUGE-2 = 13.12, a material improvement over text-only models.

Target applications span abstractive summarization, narrative salience detection, question answering, event timeline generation, multimodal video–language alignment, turning-point identification, and character–network analysis. Access to IMDb-linked metadata enables retrieval-augmented generation and genre-specific downstream studies.

Model	ROUGE-1	ROUGE-2	BERT-F1
LED	44.85	9.83	58.73
Pegasus-X	42.42	8.16	54.36
LGAT (Disc.)	49.25	13.12	81.51

A plausible implication is that graph-based screenplay representations drive substantial gains in summarization and question answering by encoding latent narrative and character dependencies.

7. Availability, Licensing, and Limitations

Most leading corpora are available for research via GitHub (MovieSum, MENSA, DiscoGraMS, CreativeSumm), subject to licensing conditions such as CC-BY-SA (Wikipedia summaries), CC-BY-NC (script/video alignment), and original site copyright restrictions for user-uploaded scripts or subtitles. Detailed schema, JSON/XML formats, and train/val/test splits are documented in dataset releases. Licensing for subtitles is predominantly Creative Commons or public domain, though verification is advised for commercial or redistribution use.

Known limitations include variability in script formatting, discrepancies between scripts and final cuts, uneven genre distribution, and wide quality variation in fan-written summaries. Inter-annotator agreement in human-annotated datasets is moderate, highlighting the complexity of salient scene identification. Corpus coverage is English-dominant, with growing but limited multilingual expansion via subtitles (Jafari, 2018).

Coverage gaps, noise in alignment, incomplete representation of visual events, and evaluation metric mismatch (low scores for news-domain metrics in creative texts) remain active areas of methodological refinement.