SWEb SCAND Corpus
- SWEb SCAND Corpus is a comprehensive dataset for Scandinavian languages, containing over 1 trillion tokens and 1.2 billion documents.
- It employs a robust four-stage pipeline—content selection, extraction with a fine-tuned Longformer, quality filtering, and deduplication—to ensure high-quality data.
- Evaluation using the HP-MEK benchmark and perplexity comparisons demonstrates its effectiveness in enhancing language models while supporting reproducible research.
The SWEb SCAND Corpus is the largest open pretraining dataset for Scandinavian (North-Germanic) languages, encompassing over one trillion tokens of web text in Swedish, Danish, Norwegian, and Icelandic. Developed to support large-scale LLM training for these languages, SWEb establishes new benchmarks in dataset scale, multilingual coverage, and automated extraction methodology. Its publicly available resources, including the HP-MEK benchmark and Longformer-based content extractor, facilitate reproducibility and independent research on Scandinavian LLMs (Norlund et al., 2024).
1. Corpus Statistics and Language Distribution
SWEb contains approximately 1.2 billion documents and 1.01 trillion tokens, as measured by the GPT-SW3 tokenizer. The distribution of documents was established via fastText document-level language identification. The breakdown is as follows:
| Language | Document Fraction (%) | Estimated #Docs |
|---|---|---|
| Swedish | 48 | |
| Danish | 26 | |
| Norwegian | 20 | |
| Icelandic | 2.3 |
This scale, exceeding one trillion tokens, makes SWEb the largest dataset of its kind for the Scandinavian languages.
2. Data Collection and Filtering Pipeline
The SWEb pipeline comprises four main stages: content selection, extraction and formatting, quality filtering, and deduplication.
2.1 Content Selection:
The corpus utilizes 98 Common Crawl WET snapshots spanning from 2013-20 to 2024-26. The CCNet framework is employed for line-level deduplication. Documents are selected using fastText if the target language score for Swedish, Danish, Norwegian, or Icelandic exceeds 0.2.
2.2 Content Extraction & Formatting:
Raw HTML files are re-downloaded from WARC archives. HTML is converted to markdown (Pandoc), and a model-based extractor—specifically a fine-tuned Longformer encoder (16k tokens, window size 256)—performs line-level filtering. The extractor was trained on 1,380 manually annotated pages and applies a sigmoid-based decision over each line embedding, keeping a line if , with inference threshold . Final text normalization is performed using ftfy.
2.3 Quality Filtering:
Only four minimal filters are used:
- Length 100 characters
- Alphanumeric-character ratio
- Headings per non-heading word
- Unigram entropy
This minimal set replaces 30+ rule-based heuristics previously used in pipelines such as FineWeb.
2.4 Deduplication and PII Removal:
MinHashLSH is applied for near-duplicate removal, using code-point-level 16-shingles with 14 bands × 8 hashes, deduplicating within each snapshot. Regex-based masking removes email addresses and public IPs.
3. Content Extraction: Model-Based vs. Rule-Based Approaches
SWEb replaces rule-based HTML text extraction (e.g., Trafilatura) with a line-level, transformer-based model. Unlike rule-based approaches, the model-based extractor learns to distinguish main content from boilerplate, advertisements, and navigation without hardcoded DOM heuristics. Empirical evaluation on two Common Crawl snapshots (2024-10 and 2024-18) showed:
| Exp. Dataset | #Docs | #Tokens | Tokens/Doc |
|---|---|---|---|
| SWEb | 32.3M | 25.2B | 779.7 |
| FineWeb | 19.2M | 15.8B | 820.3 |
SWEb yielded +62% documents and +60% tokens over FineWeb on identical input, indicating substantially improved content recall. The slightly higher tokens-per-document in FineWeb is attributable to using plain-text rather than markdown tokens (Norlund et al., 2024).
4. Corpus Structure and Licensing
SWEb is distributed as approximately 3.6 TB of UTF-8 encoded markdown text. Data is sharded by Common Crawl snapshot (week identifier), and each record includes the following metadata fields:
url: Source webpage URLwarc_path: WARC file locationwarc_date: Snapshot datetext: Extracted markdownlang: Detected language code
The corpus structure is released under CC0 ("no rights reserved"). Users must ensure compliance with rights underlying the web content; a notice and takedown policy is in place ([email protected]).
5. Evaluation Benchmark: HP-MEK
HP-MEK is a Swedish cloze-style evaluation suite derived from the MEK section of the Swedish Scholastic Aptitude Test (Högskoleprovet). Consisting of 460 sentence-completion items (masked phrase, four options), it probes vocabulary choice, local syntax, and semantic coherence. For each item, log probabilities of the passage instantiated with each option are computed under the LLM; the maximum-arg scoring is used for evaluation. HP-MEK is designed as a fast, early-signal benchmark suitable for small-scale experiments.
6. Experimental Results and Comparison to Prior Corpora
Evaluation utilized Llama 1.82B models with the GPT-SW3 tokenizer. Training involved 2M-token batch, cosine learning-rate decay, 10,811 steps (one epoch on SWEb, 1.6 epochs on FineWeb).
Perplexity Cross-Evaluation:
A model trained on SWEb (M_SW) achieved lower perplexity on both SWEb and FineWeb test sets compared to a FineWeb-trained model (M_FW). This suggests markdown tokenization in SWEb may render prediction easier.
HP-MEK Performance:
Accuracy on HP-MEK closely matched across M_SW and M_FW throughout training, with a final accuracy difference . SWEb thus delivers comparable downstream performance to FineWeb, with far fewer manual filters and greater content recall.
7. Availability and Reproducibility
All data, benchmarks, and code are openly available:
- SWEb dataset: https://huggingface.co/datasets/AI-Sweden-Models/SWEb
- HP-MEK benchmark: https://huggingface.co/datasets/AI-Sweden-Models/HP-MEK
- Pipeline and extractor model: https://github.com/aidotse/SWEb
- Pretrained Longformer extractor: https://huggingface.co/severinsimmler/xlm-roberta-longformer-base-16384
All artifacts are released under open licenses to facilitate transparent and reproducible research on Scandinavian language modeling (Norlund et al., 2024).