Papers
Topics
Authors
Recent
Search
2000 character limit reached

SCAND Corpus: Nordic & Robotics Data

Updated 20 January 2026
  • SCAND corpus is a collection of distinct resources, including multilingual Nordic text corpora for LLM pretraining and a robotics dataset for social navigation.
  • The Nordic Pile and SWEb variants use automated pipelines to process terabytes of diverse web and print data with rigorous deduplication and quality filtering.
  • The Socially CompliAnt Navigation Dataset offers multi-sensor, imitation learning data of teleoperated robot trajectories for training humanlike social navigation.

The “SCAND corpus” refers to several distinct and significant resources bearing this designation in recent research: (1) a multilingual Nordic text corpus for language modeling and LLM pretraining, notably exemplified within “The Nordic Pile”; (2) the Socially CompliAnt Navigation Dataset (SCAND) for learning humanlike social navigation in robotics; and (3) SWEb—Scandinavian WEb—the largest open-access web-crawl corpus for Scandinavian languages, which has become pivotal for LLM scaling in smaller North Germanic languages. All utilize automated and scalable collection and filtering pipelines, support reproducible research, and offer new opportunities for supervised and unsupervised learning in both NLP and robotics domains.

1. SCAND Corpus in The Nordic Pile

The “SCAND corpus” within “The Nordic Pile” (Öhman et al., 2023) is an extensive multilingual resource constructed for pretraining LLMs in North Germanic languages (Danish, Icelandic, Norwegian, Swedish) and high-quality English. After post-processing, it comprises 1.2 TB (1,208.69 GB) of text from ≈ 659.5 million documents, drawing on diverse domains: web-crawls, Wikipedia, books, academic material, conversational forums, @@@@6@@@@ corpora, synthetic math, and code. The language composition is 40.24% English, 26.02% Swedish, 11.55% Norwegian, 10.80% Danish, and 1.59% Icelandic.

Data Collection and Processing Pipeline

The pipeline is fully automatic and consists of normalization, document-level statistical annotation, 16-criteria boolean quality filtering, exact deduplication (MD5), language segmentation, fuzzy deduplication (MinHash+LSH on character 10-grams, Jaccard 0.5\ge 0.5), and shard merging. No human-in-the-loop annotation or filtering is involved; all steps are scripted and scalable. Quality filters include alpha fraction, digit ratios, document length, ellipsis/hashtag-to-word, flagged vocabulary, mean word length, n-gram repetition (with thresholds for kk from 2 to 10), stop-word frequency, and supported language constraints.

Corpus Statistics and Structure

Language Size (GB) % Tokens
English 486.36 40.24%
Swedish 314.48 26.02%
Norwegian 139.62 11.55%
Danish 130.51 10.80%
Icelandic 19.17 1.59%

The pipeline reduced the raw 1.5 TB collection by approximately 20% through filtering and a further 7% through fuzzy deduplication. Code and Wikipedia comprise 9.47% and 1.38% respectively. The mean document size is approximately 0.55 KB.

Licensing and Access

Due to legal restrictions (GDPR and European legislation), distribution of the complete processed corpus is not allowed. Detailed scripts for data collection and processing are available (GitHub); researchers must reproduce the full pipeline to acquire a corpus of similar quality and composition (Öhman et al., 2023).

2. Socially CompliAnt Navigation Dataset (SCAND)

The Socially CompliAnt Navigation Dataset (SCAND) (Karnan et al., 2022) is a large-scale, multi-modal robotics corpus designed for imitation learning of socially compliant navigation behaviors. It encompasses 8.7 hours (~25 miles, or ~40 km) of teleoperated driving demonstrations across indoor and outdoor environments using two distinct mobile robot platforms (Boston Dynamics Spot, Clearpath Jackal), with four human demonstrators.

Sensor Modalities and Data Structure

Each of the 138 trajectories contains ROS-bag-synchronized streams:

  • Velodyne VLP-16 3D LiDAR (10 Hz)
  • Azure Kinect RGB-D (20 Hz)
  • Monocular cameras (Spot, 5 units, 5 Hz each)
  • Stereo camera (Jackal, forward, 20 Hz)
  • 6 DOF IMU (16 Hz)
  • Wheel odometry (Jackal, 30 Hz)
  • Spot body joint/visual odometry (proprietary stream)
  • Joystick commands: linear velocity vv (m/s), angular velocity ω\omega (rad/s)
  • Full TF static frames

Each trajectory is self-contained (sensors.bag, tf_static.bag, tags.json with 1–4 social-interaction labels/class from a set of 12, e.g., “navigating through crowds,” “street crossing,” “stairs”) and is ~300–500 MB uncompressed.

Tag Description #Traj
With Traffic Navigating with oncoming traffic 74
Sidewalk Navigating on a sidewalk 57
Passing Conversational Passing a group in conversation 38
Street Crossing Crossing a street 34

Experimental and Benchmark Protocols

Author-provided splits (e.g., Spot: 75% train / 25% test) are recommended for reproducibility. Social compliance is quantified through demonstrator-classification (74.48% acc. vs. 50% chance), Hausdorff distance for path prediction (dHd_H; BC model: 0.26 m vs. move_base: 1.25 m), and human subject ratings for “social-compliance” and “safety” (mean 4.39 and 4.71 for BC vs. 2.86 and 2.89 for move_base, p<106p < 10^{-6} in ANOVA).

Typical pipelines leverage rasterized LiDAR bird’s-eye-view (BEV), odometry, and inertial/vision data, with Convolution+FC neural encoders in PyTorch. The dataset supports rapid development and evaluation of joint global and local navigation policies under imitation learning frameworks.

Access and Licensing

SCAND is openly available for research use (CC BY 4.0 or equivalent). Full dataset (~65 GB) and baseline split definitions are downloadable from the authors’ website (Karnan et al., 2022).

3. The SWEb SCAND Corpus

SWEb (“Scandinavian WEb”) (Norlund et al., 2024) constitutes the largest open-access Scandinavian web-crawl corpus: 3.6 TB of raw text (1.01 trillion tokens via GPT-SW3 tokenizer), with 1.2 billion documents originating from 98 Common Crawl WET snapshots (2013–2024). Language distribution (fastText scores): 48% Swedish (0.48 T tokens), 26% Danish (0.26 T), 20% Norwegian (0.20 T), 2.3% Icelandic (0.023 T), 3.7% other/unclassified.

By comparison, prior open SCAND corpora are markedly smaller:

Corpus Total Tokens
SWEb 1.01 T
mC4 (SCAND) ~0.10 T
OSCAR-23.01 0.01 T
HPLT-1.2 0.035 T

Pipeline: Collection, Extraction, and Filtering

SWEb’s workflow is divided into three primary stages:

  1. Content Selection: CCNet line-level deduplication on WET files with fastText; documents retained if any Scandinavian language score > 0.2.
  2. Content Extraction: HTML→Markdown via Pandoc; model-based line extraction (Longformer-base-16384, local attention window=256, [SEP]-annotated, F1=87) trained on 1,380 annotated pages, yielding content with low complexity and high relevance, avoiding rule-based HTML heuristics.
  3. Quality Filtering: Documents must satisfy minimum length \geq 100 char; alphanumeric ratio 0.4\geq 0.4; heading/body word ratio 0.05\leq 0.05; unigram entropy H3.0H \geq 3.0. MinHashLSH deduplication (16-codepoint shingles, 112 hashes) is done intra-snapshot; PII (email, public IP) are regex-masked.

Only four numeric thresholds control filtering—significantly simpler than previous pipelines. Resulting corpus is >>99% Scandinavian post-extraction.

Benchmarking and Evaluation

SWEb introduces HP-MEK, a cloze-style Swedish benchmark derived from the Högskoleprovet MEK test (460 prompts with four candidate completions). In a controlled comparison with FineWeb, LLaMA models (1.82B parameters) trained for approximate tokens/epochs yield:

  • Lower perplexity for SWEb-trained models on their own test set than FineWeb
  • Comparable or improved transferability and accuracy (on-par HP-MEK performance) with 60% more tokens for SWEb.

Access, Licensing, and Integration

SWEb is fully open (CC0 compilation license) and released via HuggingFace; extraction code and checkpoints are available (GitHub, HuggingFace). Integration guidelines stress deduplication (MinHash), consistent tokenization (e.g., GPT-SW3), and optional harmonization of filtering criteria.

4. Comparative Analysis and Integration

All three datasets under the moniker “SCAND corpus” demonstrate scalable, reproducible, and modular collection and filtering practices, but target markedly different research domains:

  • The Nordic Pile’s SCAND corpus focuses on maximizing high-quality text diversity for LLMs in North Germanic languages, employing extensive web/print/corpus sources and strict automatic filtering, but is not openly distributed due to legal constraints (Öhman et al., 2023).
  • The SWEb SCAND corpus, with an order-of-magnitude more tokens, uses a streamlined, model-based extraction architecture well suited for next-generation LLM pretraining. Its process is fully open and designed for perpetual extension and cross-corpus harmonization (Norlund et al., 2024).
  • The Socially CompliAnt Navigation Dataset addresses social behavior modeling in robotics, delivering rich, multimodal trajectories and a strong imitation learning baseline, with direct reproducibility and open licensing (Karnan et al., 2022).

A plausible implication is that SWEb and The Nordic Pile can be combined (after deduplication and harmonization) to form extended pretraining mixes for LLM research, whereas the robotics-focused SCAND dataset is orthogonal and unrelated in content—but demonstrates the widespread adoption of the SCAND designation across modalities.

5. Applications and Future Prospects

  • NLP and LLM Pretraining: SWEb and The Nordic Pile SCAND corpora underpin Swedish, Danish, Norwegian, and Icelandic LLM scaling, closing a key resource gap for North Germanic and Nordic NLP.
  • Transfer Learning and Benchmarking: HP-MEK (SWEb) introduces the MEK-based cloze task for Swedish model validation, enabling fine-grained downstream evaluation. No downstream benchmarks are reported in The Nordic Pile (Öhman et al., 2023), underscoring the need for further open benchmarking effort.
  • Robotic Social Navigation: The SCAND navigation corpus provides a rigorous platform for training and benchmarking robot policies on real-world, multi-agent social compliance, with validated improvement over rule-based navigation.
  • Pipeline Reproducibility: Both SCAND text corpora encourage harmonization of deduplication/tokenization protocols to facilitate cross-corpus consistency, while SCAND navigation supports modular PyTorch pipelines and standardized train/test splits.

A plausible future direction is increased cross-linguistic and cross-modal fusion—combining rich web-scale text (SWEb, Nordic Pile SCAND) with dialog and perception-coupled datasets (robotics SCAND)—for more sophisticated language and embodied AI research.

6. Licensing, Usage, and Ethical Considerations

  • SWEb: Openly licensed (CC0) for structure and compilation, but downstream usage rights for web content remain the user’s responsibility. Notice/takedown channels are specified (Norlund et al., 2024).
  • The Nordic Pile SCAND: Pipeline and code are open, but no redistribution of processed data (GDPR constraints); external researchers must independently replicate the pipeline (Öhman et al., 2023).
  • SCAND Navigation: Open research license (CC BY 4.0); all usage is subject to standard research-exempt conditions (Karnan et al., 2022).

No dataset includes human-in-the-loop data cleaning for text (The Nordic Pile, SWEb); all language filtering and deduplication are fully automatic. Ethical usage, particularly regarding personally identifiable information (PII) and compliance with copyright and privacy law, is an explicit consideration in release protocols.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SCAND corpus.