Papers
Topics
Authors
Recent
Search
2000 character limit reached

CLASSLA-web 2.0 Corpus Collection

Updated 23 January 2026
  • The paper introduces an iterative crawling and recrawling approach that yields 82% novel content with significant corpus growth.
  • The methodology leverages targeted ccTLD selection, host-level politeness, and a multi-tool language identification pipeline to ensure precise language classification.
  • The results show a 57% word count increase and robust quality control measures to filter out AI-generated and low-quality texts.

The CLASSLA-web 2.0 Corpus Collection is the largest systematically assembled general-purpose web corpus resource for South Slavic and related languages, comprising 17.0 billion words and 38.1 million texts in Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian. Built via an infrastructure for continuous, iterative crawling of national and related top-level domains (TLDs), CLASSLA-web 2.0 advances the empirical and methodological state of web corpus construction for less-resourced languages and highlights emergent challenges in large-scale web data acquisition (Pungeršek et al., 16 Jan 2026).

1. Crawling Architecture and Corpus Design

CLASSLA-web 2.0 relies on iterative, national ccTLD-based crawling, supplemented by generic TLDs hosting language-relevant content. The crawling process employs a structured, multi-phase workflow:

  • Domain selection: Targeted ccTLDs—.hr, .si, .bg, .mk, .bs, .sr, .cnr—augmented with non-country TLDs (.com, .org, .net) serving national-language materials. High-priority domain seeds (e.g., major news sites, government portals) anchor the initial frontier.
  • URL frontier management: A scalable crawler (e.g., SpiderLing or MaCoCu) maintains a priority queue of URLs. The queue is partitioned per host to enforce politeness and is seeded by both explicit domain lists and links discovered in prior fetches.
  • Host-level politeness: The system enforces robots.txt and crawl-delay, institutes per-host throttle (Δ ≥ 1 s), limits concurrent connections (1–2 threads/host), and globally caps fetch rates (≤ 100 pages/minute).
  • Batching and recrawling: The frontier is batch-processed: high-value domains receive extended quotas and are recrawled entirely each iteration, while the long tail is explored incrementally. Batch and crawl states are persisted to ensure resumption across cycles.
  • Post-crawl pipeline: Each iteration applies boilerplate removal (jusText), near-duplicate detection (Onion, MinHash), document/paragraph-level language identification (CLD2, trigram classifiers, HBS classifier), character-encoding repair, length filtering, and manual review of high-volume domains.

2. Corpus Metrics and Formal Definitions

Key corpus metrics and overlap statistics underpin the evaluation and justification of repeated crawling cycles. For corpus iteration CiC_i:

  • W(Ci)W(C_i) total words,
  • T(Ci)T(C_i) total texts,
  • Ci=T(Ci)|C_i| = T(C_i)

Overlap ratio between two iterations quantifies reused content:

Overlap(C1,C2)=C1C2C2×100%\mathrm{Overlap}(C_1, C_2) = \frac{\lvert C_1 \cap C_2\rvert}{\lvert C_2\rvert} \times 100\%

Gain rate (growth factor):

Gain(C1C2)=C2C1C2C1C2×100%\mathrm{Gain}(C_1 \to C_2) = \frac{\lvert C_2\rvert - \lvert C_1 \cap C_2\rvert}{\lvert C_1 \cup C_2\rvert} \times 100\%

For CLASSLA-web 2.0, C2C_2 contains 17.0 billion words and 38.1 million texts; overlap with CLASSLA-web 1.0 is approximately 18%, demonstrating high content turnover and corpus novelty (Pungeršek et al., 16 Jan 2026).

3. Impact of Iterative Recrawling

Empirical data from CLASSLA-web 1.0 (2021–22) and 2.0 (2024) demonstrate the effect of two-year recrawl intervals:

  • Text volume increased by ~46%, word count by ~57%.
  • Only 18% of CLASSLA-web 2.0 texts overlap with the previous release, indicating 82% of material is novel.
  • The combined unique size over both iterations exceeds 57 million texts.

This suggests that biannual recrawling can more than double the corpus's unique content volume within four years—a critical property for keeping web corpora relevant in rapidly changing linguistic and content landscapes (Pungeršek et al., 16 Jan 2026).

4. Content Quality: The Problem of Generated and Low-Quality Texts

Manual curation of the 250 most prolific domains per corpus revealed substantial growth in problematic sources: ≈15% of CLASSLA-web 2.0 texts are traced to “bad” domains (sites exhibiting AI-generated content or SEO-oriented junk pages), a 15× increase over 1.0. Corpus quality assurance relies on:

  • Heuristic filtering (jusText for boilerplate stripping; Onion/MinHash for near-duplicate pruning; minimum-length filtering, e.g., >75 words).
  • Explicit manual diagnosis and blacklisting of machine-generated or machine-translated domains responsible for ≥0.1% of corpus content.

A plausible implication is that, without rigorous filtering and curation, the usefulness of large web corpora in linguistic research could be compromised by widespread diffusion of synthetic or low-impact textual material (Pungeršek et al., 16 Jan 2026).

5. Language Identification and Annotation Pipeline

To accommodate the fine-grained linguistic reality of related South Slavic languages and dialects, CLASSLA-web 2.0 integrates a multi-tool language identification routine:

  • Document and paragraph classification uses the CLD2 toolkit, a trigram-based classifier, and a specialized classifier for HBS (Bosnian/Croatian/Serbian/Montenegrin) variants.
  • The output supports genre and topic annotation, facilitating both linguistic and thematic studies.

Multi-tool identification is especially necessary to mitigate the risk of misclassification among closely related language varieties and to maximize the corpus’s value for high-resolution language modeling (Pungeršek et al., 16 Jan 2026).

6. Best Practices and Recommendations

CLASSLA-web 2.0 formalizes several operational principles:

  • Automate crawling and processing pipelines, but retain manual oversight for top-contributing domains to maintain corpus quality.
  • Use multi-phase language identification for fine distinctions among related languages.
  • Track overlap statistics at the URL and text level (e.g., MinHash-based text overlap, fast URL overlap as a proxy).
  • Monitor real-time health metrics: crawl success, average document length, boilerplate share, domain pollution rates.
  • Version each crawl iteration, document growth (ΔW, ΔT), and explicitly compute Overlap and Gain metrics for transparency and reproducibility.
  • Schedule recrawls every 12–24 months to balance content freshness against data engineering effort and computational cost; this frequency is empirically associated with ≈80% content renewal in the domain (Pungeršek et al., 16 Jan 2026).

7. Connection to Broader Web Domain Gathering Techniques

The CLASSLA-web 2.0 methodology is aligned with public-source-driven ccTLD census techniques, as described in "This Is a Local Domain: On Amassing Country-Code Top-Level Domains from Public Data" (Sommese et al., 2023). Both approaches emphasize iterative domain discovery, robust deduplication (e.g., Bloom filters for domain tracking), and reliance on public data sources such as Certificate Transparency logs and Common Crawl indexes. While CLASSLA-web focuses on content acquisition and linguistic annotation, web census work offers statistical coverage proxies and discovery mechanisms for domain enumeration. For example, public CT logs and Common Crawl, when used as seeds, can cover on average ≈59% of ccTLD domains, with coverage steadily increasing, and ≈90% of these extracted domains are verified as Web-active (open port 80/443) (Sommese et al., 2023). Integrating these two paradigms—domain discovery from infrastructure logs and linguistically motivated iterative content crawling—enables the scalable construction of language-specific, representative, and up-to-date web corpora for under-resourced and evolving linguistic communities.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CLASSLA-web 2.0 Corpus Collection.