SCAND Corpus: Nordic & Robotics Data

Updated 20 January 2026

SCAND corpus is a collection of distinct resources, including multilingual Nordic text corpora for LLM pretraining and a robotics dataset for social navigation.
The Nordic Pile and SWEb variants use automated pipelines to process terabytes of diverse web and print data with rigorous deduplication and quality filtering.
The Socially CompliAnt Navigation Dataset offers multi-sensor, imitation learning data of teleoperated robot trajectories for training humanlike social navigation.

The “SCAND corpus” refers to several distinct and significant resources bearing this designation in recent research: (1) a multilingual Nordic text corpus for language modeling and LLM pretraining, notably exemplified within “The Nordic Pile”; (2) the Socially CompliAnt Navigation Dataset (SCAND) for learning humanlike social navigation in robotics; and (3) SWEb—Scandinavian WEb—the largest open-access web-crawl corpus for Scandinavian languages, which has become pivotal for LLM scaling in smaller North Germanic languages. All utilize automated and scalable collection and filtering pipelines, support reproducible research, and offer new opportunities for supervised and unsupervised learning in both NLP and robotics domains.

1. SCAND Corpus in The Nordic Pile

The “SCAND corpus” within “The Nordic Pile” (Öhman et al., 2023) is an extensive multilingual resource constructed for pretraining LLMs in North Germanic languages (Danish, Icelandic, Norwegian, Swedish) and high-quality English. After post-processing, it comprises 1.2 TB (1,208.69 GB) of text from ≈ 659.5 million documents, drawing on diverse domains: web-crawls, Wikipedia, books, academic material, conversational forums, parallel corpora, synthetic math, and code. The language composition is 40.24% English, 26.02% Swedish, 11.55% Norwegian, 10.80% Danish, and 1.59% Icelandic.

Data Collection and Processing Pipeline

The pipeline is fully automatic and consists of normalization, document-level statistical annotation, 16-criteria boolean quality filtering, exact deduplication (MD5), language segmentation, fuzzy deduplication (MinHash+LSH on character 10-grams, Jaccard $\ge 0.5$ ), and shard merging. No human-in-the-loop annotation or filtering is involved; all steps are scripted and scalable. Quality filters include alpha fraction, digit ratios, document length, ellipsis/hashtag-to-word, flagged vocabulary, mean word length, n-gram repetition (with thresholds for $k$ from 2 to 10), stop-word frequency, and supported language constraints.

Corpus Statistics and Structure

Language	Size (GB)	% Tokens
English	486.36	40.24%
Swedish	314.48	26.02%
Norwegian	139.62	11.55%
Danish	130.51	10.80%
Icelandic	19.17	1.59%

The pipeline reduced the raw 1.5 TB collection by approximately 20% through filtering and a further 7% through fuzzy deduplication. Code and Wikipedia comprise 9.47% and 1.38% respectively. The mean document size is approximately 0.55 KB.

Licensing and Access

Due to legal restrictions (GDPR and European legislation), distribution of the complete processed corpus is not allowed. Detailed scripts for data collection and processing are available (GitHub); researchers must reproduce the full pipeline to acquire a corpus of similar quality and composition (Öhman et al., 2023).

The Socially CompliAnt Navigation Dataset (SCAND) (Karnan et al., 2022) is a large-scale, multi-modal robotics corpus designed for imitation learning of socially compliant navigation behaviors. It encompasses 8.7 hours (~25 miles, or ~40 km) of teleoperated driving demonstrations across indoor and outdoor environments using two distinct mobile robot platforms (Boston Dynamics Spot, Clearpath Jackal), with four human demonstrators.

Sensor Modalities and Data Structure

Each of the 138 trajectories contains ROS-bag-synchronized streams:

Velodyne VLP-16 3D LiDAR (10 Hz)
Azure Kinect RGB-D (20 Hz)
Monocular cameras (Spot, 5 units, 5 Hz each)
Stereo camera (Jackal, forward, 20 Hz)
6 DOF IMU (16 Hz)
Wheel odometry (Jackal, 30 Hz)
Spot body joint/visual odometry (proprietary stream)
Joystick commands: linear velocity $v$ (m/s), angular velocity $\omega$ (rad/s)
Full TF static frames

Each trajectory is self-contained (sensors.bag, tf_static.bag, tags.json with 1–4 social-interaction labels/class from a set of 12, e.g., “navigating through crowds,” “street crossing,” “stairs”) and is ~300–500 MB uncompressed.

Tag	Description	#Traj
With Traffic	Navigating with oncoming traffic	74
Sidewalk	Navigating on a sidewalk	57
Passing Conversational	Passing a group in conversation	38
Street Crossing	Crossing a street	34
…	…	…

Experimental and Benchmark Protocols

Author-provided splits (e.g., Spot: 75% train / 25% test) are recommended for reproducibility. Social compliance is quantified through demonstrator-classification (74.48% acc. vs. 50% chance), Hausdorff distance for path prediction ( $d_H$ ; BC model: 0.26 m vs. move_base: 1.25 m), and human subject ratings for “social-compliance” and “safety” (mean 4.39 and 4.71 for BC vs. 2.86 and 2.89 for move_base, $p < 10^{-6}$ in ANOVA).

Typical pipelines leverage rasterized LiDAR bird’s-eye-view (BEV), odometry, and inertial/vision data, with Convolution+FC neural encoders in PyTorch. The dataset supports rapid development and evaluation of joint global and local navigation policies under imitation learning frameworks.

Access and Licensing

SCAND is openly available for research use (CC BY 4.0 or equivalent). Full dataset (~65 GB) and baseline split definitions are downloadable from the authors’ website (Karnan et al., 2022).

3. The SWEb SCAND Corpus

SWEb (“Scandinavian WEb”) (Norlund et al., 2024) constitutes the largest open-access Scandinavian web-crawl corpus: 3.6 TB of raw text (1.01 trillion tokens via GPT-SW3 tokenizer), with 1.2 billion documents originating from 98 Common Crawl WET snapshots (2013–2024). Language distribution (fastText scores): 48% Swedish (0.48 T tokens), 26% Danish (0.26 T), 20% Norwegian (0.20 T), 2.3% Icelandic (0.023 T), 3.7% other/unclassified.

By comparison, prior open SCAND corpora are markedly smaller:

Corpus	Total Tokens
SWEb	1.01 T
mC4 (SCAND)	~0.10 T
OSCAR-23.01	0.01 T
HPLT-1.2	0.035 T

Pipeline: Collection, Extraction, and Filtering

SWEb’s workflow is divided into three primary stages:

Content Selection: CCNet line-level deduplication on WET files with fastText; documents retained if any Scandinavian language score > 0.2.
Content Extraction: HTML→Markdown via Pandoc; model-based line extraction (Longformer-base-16384, local attention window=256, [SEP]-annotated, F1=87) trained on 1,380 annotated pages, yielding content with low complexity and high relevance, avoiding rule-based HTML heuristics.
Quality Filtering: Documents must satisfy minimum length $\geq$ 100 char; alphanumeric ratio $\geq 0.4$ ; heading/body word ratio $\leq 0.05$ ; unigram entropy $H \geq 3.0$ . MinHashLSH deduplication (16-codepoint shingles, 112 hashes) is done intra-snapshot; PII (email, public IP) are regex-masked.

Only four numeric thresholds control filtering—significantly simpler than previous pipelines. Resulting corpus is $k$ 099% Scandinavian post-extraction.

Benchmarking and Evaluation

SWEb introduces HP-MEK, a cloze-style Swedish benchmark derived from the Högskoleprovet MEK test (460 prompts with four candidate completions). In a controlled comparison with FineWeb, LLaMA models (1.82B parameters) trained for approximate tokens/epochs yield:

Lower perplexity for SWEb-trained models on their own test set than FineWeb
Comparable or improved transferability and accuracy (on-par HP-MEK performance) with 60% more tokens for SWEb.

Access, Licensing, and Integration

SWEb is fully open (CC0 compilation license) and released via HuggingFace; extraction code and checkpoints are available (GitHub, HuggingFace). Integration guidelines stress deduplication (MinHash), consistent tokenization (e.g., GPT-SW3), and optional harmonization of filtering criteria.

4. Comparative Analysis and Integration

All three datasets under the moniker “SCAND corpus” demonstrate scalable, reproducible, and modular collection and filtering practices, but target markedly different research domains:

The Nordic Pile’s SCAND corpus focuses on maximizing high-quality text diversity for LLMs in North Germanic languages, employing extensive web/print/corpus sources and strict automatic filtering, but is not openly distributed due to legal constraints (Öhman et al., 2023).
The SWEb SCAND corpus, with an order-of-magnitude more tokens, uses a streamlined, model-based extraction architecture well suited for next-generation LLM pretraining. Its process is fully open and designed for perpetual extension and cross-corpus harmonization (Norlund et al., 2024).
The Socially CompliAnt Navigation Dataset addresses social behavior modeling in robotics, delivering rich, multimodal trajectories and a strong imitation learning baseline, with direct reproducibility and open licensing (Karnan et al., 2022).

A plausible implication is that SWEb and The Nordic Pile can be combined (after deduplication and harmonization) to form extended pretraining mixes for LLM research, whereas the robotics-focused SCAND dataset is orthogonal and unrelated in content—but demonstrates the widespread adoption of the SCAND designation across modalities.

5. Applications and Future Prospects

NLP and LLM Pretraining: SWEb and The Nordic Pile SCAND corpora underpin Swedish, Danish, Norwegian, and Icelandic LLM scaling, closing a key resource gap for North Germanic and Nordic NLP.
Transfer Learning and Benchmarking: HP-MEK (SWEb) introduces the MEK-based cloze task for Swedish model validation, enabling fine-grained downstream evaluation. No downstream benchmarks are reported in The Nordic Pile (Öhman et al., 2023), underscoring the need for further open benchmarking effort.
Robotic Social Navigation: The SCAND navigation corpus provides a rigorous platform for training and benchmarking robot policies on real-world, multi-agent social compliance, with validated improvement over rule-based navigation.
Pipeline Reproducibility: Both SCAND text corpora encourage harmonization of deduplication/tokenization protocols to facilitate cross-corpus consistency, while SCAND navigation supports modular PyTorch pipelines and standardized train/test splits.

A plausible future direction is increased cross-linguistic and cross-modal fusion—combining rich web-scale text (SWEb, Nordic Pile SCAND) with dialog and perception-coupled datasets (robotics SCAND)—for more sophisticated language and embodied AI research.

6. Licensing, Usage, and Ethical Considerations

SWEb: Openly licensed (CC0) for structure and compilation, but downstream usage rights for web content remain the user’s responsibility. Notice/takedown channels are specified (Norlund et al., 2024).
The Nordic Pile SCAND: Pipeline and code are open, but no redistribution of processed data (GDPR constraints); external researchers must independently replicate the pipeline (Öhman et al., 2023).
SCAND Navigation: Open research license (CC BY 4.0); all usage is subject to standard research-exempt conditions (Karnan et al., 2022).

No dataset includes human-in-the-loop data cleaning for text (The Nordic Pile, SWEb); all language filtering and deduplication are fully automatic. Ethical usage, particularly regarding personally identifiable information (PII) and compliance with copyright and privacy law, is an explicit consideration in release protocols.

Markdown Report Issue Upgrade to Chat

References (3)

The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling (2023)

Socially Compliant Navigation Dataset (SCAND): A Large-Scale Dataset of Demonstrations for Social Navigation (2022)

SWEb: A Large Web Dataset for the Scandinavian Languages (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SCAND corpus.

SCAND Corpus: Nordic & Robotics Data

1. SCAND Corpus in The Nordic Pile

Data Collection and Processing Pipeline

Corpus Statistics and Structure

Licensing and Access

2. Socially CompliAnt Navigation Dataset (SCAND)

Sensor Modalities and Data Structure

Experimental and Benchmark Protocols

Access and Licensing

3. The SWEb SCAND Corpus

Pipeline: Collection, Extraction, and Filtering

Benchmarking and Evaluation

Access, Licensing, and Integration

4. Comparative Analysis and Integration

5. Applications and Future Prospects

6. Licensing, Usage, and Ethical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SCAND Corpus: Nordic & Robotics Data

1. SCAND Corpus in The Nordic Pile

Data Collection and Processing Pipeline

Corpus Statistics and Structure

Licensing and Access

2. Socially CompliAnt Navigation Dataset (SCAND)

Sensor Modalities and Data Structure

Experimental and Benchmark Protocols

Access and Licensing

3. The SWEb SCAND Corpus

Pipeline: Collection, Extraction, and Filtering

Benchmarking and Evaluation

Access, Licensing, and Integration

4. Comparative Analysis and Integration

5. Applications and Future Prospects

6. Licensing, Usage, and Ethical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research