SWEb SCAND Corpus

Updated 1 February 2026

SWEb SCAND Corpus is a comprehensive dataset for Scandinavian languages, containing over 1 trillion tokens and 1.2 billion documents.
It employs a robust four-stage pipeline—content selection, extraction with a fine-tuned Longformer, quality filtering, and deduplication—to ensure high-quality data.
Evaluation using the HP-MEK benchmark and perplexity comparisons demonstrates its effectiveness in enhancing language models while supporting reproducible research.

The SWEb SCAND Corpus is the largest open pretraining dataset for Scandinavian (North-Germanic) languages, encompassing over one trillion tokens of web text in Swedish, Danish, Norwegian, and Icelandic. Developed to support large-scale LLM training for these languages, SWEb establishes new benchmarks in dataset scale, multilingual coverage, and automated extraction methodology. Its publicly available resources, including the HP-MEK benchmark and Longformer-based content extractor, facilitate reproducibility and independent research on Scandinavian LLMs (Norlund et al., 2024).

1. Corpus Statistics and Language Distribution

SWEb contains approximately 1.2 billion documents and 1.01 trillion tokens, as measured by the GPT-SW3 tokenizer. The distribution of documents was established via fastText document-level language identification. The breakdown is as follows:

Language	Document Fraction (%)	Estimated #Docs
Swedish	48	$5.76 \times 10^8$
Danish	26	$3.12 \times 10^8$
Norwegian	20	$2.40 \times 10^8$
Icelandic	2.3	$2.76 \times 10^7$

This scale, exceeding one trillion tokens, makes SWEb the largest dataset of its kind for the Scandinavian languages.

2. Data Collection and Filtering Pipeline

The SWEb pipeline comprises four main stages: content selection, extraction and formatting, quality filtering, and deduplication.

2.1 Content Selection:

The corpus utilizes 98 Common Crawl WET snapshots spanning from 2013-20 to 2024-26. The CCNet framework is employed for line-level deduplication. Documents are selected using fastText if the target language score for Swedish, Danish, Norwegian, or Icelandic exceeds 0.2.

2.2 Content Extraction & Formatting:

Raw HTML files are re-downloaded from WARC archives. HTML is converted to markdown (Pandoc), and a model-based extractor—specifically a fine-tuned Longformer encoder (16k tokens, window size 256)—performs line-level filtering. The extractor was trained on 1,380 manually annotated pages and applies a sigmoid-based decision over each line embedding, keeping a line if $p_j \ge \tau$ , with inference threshold $\tau = 0.05$ . Final text normalization is performed using ftfy.

2.3 Quality Filtering:

Only four minimal filters are used:

Length $\geq$ 100 characters
Alphanumeric-character ratio $\geq 0.4$
Headings per non-heading word $\leq 0.05$
Unigram entropy $H \geq 3.0$

This minimal set replaces 30+ rule-based heuristics previously used in pipelines such as FineWeb.

2.4 Deduplication and PII Removal:

MinHashLSH is applied for near-duplicate removal, using code-point-level 16-shingles with 14 bands × 8 hashes, deduplicating within each snapshot. Regex-based masking removes email addresses and public IPs.

3. Content Extraction: Model-Based vs. Rule-Based Approaches

SWEb replaces rule-based HTML text extraction (e.g., Trafilatura) with a line-level, transformer-based model. Unlike rule-based approaches, the model-based extractor learns to distinguish main content from boilerplate, advertisements, and navigation without hardcoded DOM heuristics. Empirical evaluation on two Common Crawl snapshots (2024-10 and 2024-18) showed:

Exp. Dataset	#Docs	#Tokens	Tokens/Doc
SWEb	32.3M	25.2B	779.7
FineWeb	19.2M	15.8B	820.3

SWEb yielded +62% documents and +60% tokens over FineWeb on identical input, indicating substantially improved content recall. The slightly higher tokens-per-document in FineWeb is attributable to using plain-text rather than markdown tokens (Norlund et al., 2024).

4. Corpus Structure and Licensing

SWEb is distributed as approximately 3.6 TB of UTF-8 encoded markdown text. Data is sharded by Common Crawl snapshot (week identifier), and each record includes the following metadata fields:

url: Source webpage URL
warc_path: WARC file location
warc_date: Snapshot date
text: Extracted markdown
lang: Detected language code

The corpus structure is released under CC0 ("no rights reserved"). Users must ensure compliance with rights underlying the web content; a notice and takedown policy is in place ([email protected]).

5. Evaluation Benchmark: HP-MEK

HP-MEK is a Swedish cloze-style evaluation suite derived from the MEK section of the Swedish Scholastic Aptitude Test (Högskoleprovet). Consisting of 460 sentence-completion items (masked phrase, four options), it probes vocabulary choice, local syntax, and semantic coherence. For each item, log probabilities of the passage instantiated with each option are computed under the LLM; the maximum-arg scoring is used for evaluation. HP-MEK is designed as a fast, early-signal benchmark suitable for small-scale experiments.

6. Experimental Results and Comparison to Prior Corpora

Evaluation utilized Llama 1.82B models with the GPT-SW3 tokenizer. Training involved $\sim$ 2M-token batch, cosine learning-rate decay, 10,811 steps (one epoch on SWEb, 1.6 epochs on FineWeb).

Perplexity Cross-Evaluation:

A model trained on SWEb (M_SW) achieved lower perplexity on both SWEb and FineWeb test sets compared to a FineWeb-trained model (M_FW). This suggests markdown tokenization in SWEb may render prediction easier.

HP-MEK Performance:

Accuracy on HP-MEK closely matched across M_SW and M_FW throughout training, with a final accuracy difference $\Delta \leq 1\%$ . SWEb thus delivers comparable downstream performance to FineWeb, with far fewer manual filters and greater content recall.

7. Availability and Reproducibility

All data, benchmarks, and code are openly available:

SWEb dataset: https://huggingface.co/datasets/AI-Sweden-Models/SWEb
HP-MEK benchmark: https://huggingface.co/datasets/AI-Sweden-Models/HP-MEK
Pipeline and extractor model: https://github.com/aidotse/SWEb
Pretrained Longformer extractor: https://huggingface.co/severinsimmler/xlm-roberta-longformer-base-16384

All artifacts are released under open licenses to facilitate transparent and reproducible research on Scandinavian language modeling (Norlund et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

SWEb: A Large Web Dataset for the Scandinavian Languages (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWEb SCAND Corpus.