Papers
Topics
Authors
Recent
Search
2000 character limit reached

W{content}I: Global Multi-Modal Web Dataset

Updated 21 January 2026
  • W{content}I is a comprehensive multi-modal dataset of 49,438 globally sourced web pages featuring screenshots, HTML text, and structured quantitative metrics.
  • It enables detailed studies of web page structure, error detection via CNN models (up to 97% validation accuracy), and transfer learning for subject categorization.
  • The dataset is organized by six thematic categories and openly licensed under CC-BY 4.0, supporting reproducible research in web mining and computational web science.

The W{content}I dataset, formally described in "A Large Visual, Qualitative and Quantitative Dataset of Web Pages," is a multi-modal collection comprising visual, textual, and structured quantitative features of web pages gathered worldwide. It is engineered to enable detailed, reproducible studies of web page appearance, structure, and topical content across six core thematic categories. Recognized for its breadth and explicit design for machine learning workflows, W{content}I underpins both browser-level rendering analysis and web-mining research, with open access under CC-BY 4.0 (Mejia-Escobar et al., 2021).

1. Scope and Data Modalities

The dataset contains 49,438 distinct web pages, each represented by three data modalities:

  • Visual: Full-page screenshots (webshots) in JPG, typically at 992 × 3,560 pixels, mean file size ≈ 274 KB.
  • Textual: Both raw HTML source and script-generated extracted plain text.
  • Quantitative: Structured features including download time, HTML size, and counts of HTML elements (images, script files, CSS files, tables, iframes, style tags) as well as webshot image properties (size, width, height).

Pages are globally representative, covering websites from all countries and spanning continents. Each is annotated with one of six high-level subject labels: Arts & Entertainment; Business & Economy; Education; Government; News & Media; Science & Environment.

2. Data Collection and Preprocessing Protocol

The data acquisition pipeline consists of rigorously automated search and structured browsing mechanisms:

  1. URL Collection:
    • Searching: Google queries of the form OR(keyword1,…) site:.cc ext:html across country-specific TLDs (.es, .br, etc.), yielding the top 100 results per country/query.
    • Browsing: Scraping the Best Of The Web (BOTW) directory categorically by continent and country.
  2. Feature Extraction:
    • Each URL is fetched via HTTP GET. HTML markup is parsed using Python BeautifulSoup to derive quantitative features. Example element-count calculation:
      1
      2
      3
      
      soup = BeautifulSoup(resp.text, 'html.parser')
      num_images = len(soup.find_all('img'))
      num_scripts = len(soup.find_all('script', src=True))
    • Webshots are captured with R (phantomJS via webshot), recording file size and dimensions with magick.
  3. Debugging:
    • Pages with exceptions (timeout, HTTP errors, missing components) are marked −1.
    • CNN-based error-page classifier is applied to purge artefacts from domain-parked or under-construction sites.

This suggests a design optimized to minimize noise and operational failures commonly encountered with large-scale web snapshots.

3. Dataset Structure and Statistical Properties

Data is organized by both page category and geographic region:

Category Browsing (n) Searching (n)
Arts & Entertainment 7,752
Business & Economy 8,438
Education 7,892
Government 7,354
News & Media 10,044
Science & Environment 7,958

Key numerical features, shown with means and standard deviations:

Parameter Browsing (mean±σ) Searching (mean±σ)
URL length 31.7 ± 12.6 chars 69.8 ± 31.0 chars
Download time 18.7 ± 17.4 ms 22.3 ± 15.8 ms
HTML size 50.5 ± 44.3 KB 46.8 ± 39.6 KB
Images/page 2.1 ± 3.5 2.5 ± 4.1
Scripts/page 3.4 ± 5.1 3.0 ± 4.4
Webshot size 302 ± 200 KB 274 ± 180 KB
Webshot width 1,059 ± 389 px 1,052 ± 380 px
Webshot height 3,859 ± 4,293 px 3,560 ± 3,918 px

A plausible implication is that the dataset permits stratified statistical studies of web page complexity, global design variation, and multimedia content densities.

4. Experimental Use Cases and Benchmark Models

The authors document two primary supervised tasks to demonstrate dataset utility:

4.1 Error-Page Detection

Binary classification on webshots (valid vs error) using a compact CNN:

  • Train/val split: 80/20
  • Architecture: Three Conv-ReLU-MaxPool blocks, dense output
  • Results: Train accuracy 96.6%, validation 97.2%
  • Applied to large “Searching” subset: overall 92.5% accuracy

4.2 Subject Categorization

Six-way classification on category-balanced webshots using transfer learning:

  • Model: ResNet-50 (ImageNet), new classifier head (GlobalAveragePool → Dense(256) → Dropout → Dense(6) → Softmax)
  • Training accuracy: 94.3%
  • Validation accuracy: 40.4% (high inter-class confusion; only three classes >50% recall)

This suggests a strong visual signature for error-page discrimination but ambiguity in topical cues from screenshots alone.

5. Data Access, Format, and Licensing

  • Repository: https://osf.io/7ghd2/
  • Contents:
    • _images/: webshots per category
    • _datasheets/: CSV tables of URLs and quantitative features
    • code/: Python and R scripts for parsing and extraction
    • docs/: methodology and README
  • Schema: All features are tabular or image file, with −1 to indicate absent data.
  • License: CC-BY 4.0; citation required (Mejia-Escobar et al., 2021).

6. Analytical Significance and Applications

W{content}I enables:

  • Systematic studies of visual web diversity, page structure typology, and graphical complexity across nations and markets.
  • Multi-task learning frameworks exploiting cross-modal data (image, text, quantitative metrics).
  • Development of robust page error/anomaly detectors and visual classifiers with real-world noise and heterogeneity.
  • Extension to unsupervised and semi-supervised representation learning on page-level, DOM, and domain aggregates.

This suggests the dataset serves as a baseline resource for web intelligence, content classification, accessibility evaluation, and page-quality assessment in large-scale computational web science and applied ML.

7. Limitations and Future Directions

Inter-class confusion in subject categorization highlights the need for more sophisticated model architectures or integration of textual/structural features beyond screenshots. The absence of detailed textual analysis in the initial paper points to future expansions where HTML-derived text is leveraged for complementary classifiers, retrieval, or semantic structure mining. Possible future directions include topic modeling, dynamic web evolution tracking, and page accessibility assessment.

The W{content}I dataset, by integrating high-resolution screenshots and engineered structural features at global scale, provides the foundation for cross-modal page analytics and supports reproducible research in web-driven machine learning (Mejia-Escobar et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to W{content}I Dataset.