W{content}I: Global Multi-Modal Web Dataset

Updated 21 January 2026

W{content}I is a comprehensive multi-modal dataset of 49,438 globally sourced web pages featuring screenshots, HTML text, and structured quantitative metrics.
It enables detailed studies of web page structure, error detection via CNN models (up to 97% validation accuracy), and transfer learning for subject categorization.
The dataset is organized by six thematic categories and openly licensed under CC-BY 4.0, supporting reproducible research in web mining and computational web science.

The W{content}I dataset, formally described in "A Large Visual, Qualitative and Quantitative Dataset of Web Pages," is a multi-modal collection comprising visual, textual, and structured quantitative features of web pages gathered worldwide. It is engineered to enable detailed, reproducible studies of web page appearance, structure, and topical content across six core thematic categories. Recognized for its breadth and explicit design for machine learning workflows, W{content}I underpins both browser-level rendering analysis and web-mining research, with open access under CC-BY 4.0 (Mejia-Escobar et al., 2021).

1. Scope and Data Modalities

The dataset contains 49,438 distinct web pages, each represented by three data modalities:

Visual: Full-page screenshots (webshots) in JPG, typically at 992 × 3,560 pixels, mean file size ≈ 274 KB.
Textual: Both raw HTML source and script-generated extracted plain text.
Quantitative: Structured features including download time, HTML size, and counts of HTML elements (images, script files, CSS files, tables, iframes, style tags) as well as webshot image properties (size, width, height).

Pages are globally representative, covering websites from all countries and spanning continents. Each is annotated with one of six high-level subject labels: Arts & Entertainment; Business & Economy; Education; Government; News & Media; Science & Environment.

2. Data Collection and Preprocessing Protocol

The data acquisition pipeline consists of rigorously automated search and structured browsing mechanisms:

URL Collection:
- Searching: Google queries of the form OR(keyword1,…) site:.cc ext:html across country-specific TLDs (.es, .br, etc.), yielding the top 100 results per country/query.
- Browsing: Scraping the Best Of The Web (BOTW) directory categorically by continent and country.
Feature Extraction:
- Each URL is fetched via HTTP GET. HTML markup is parsed using Python BeautifulSoup to derive quantitative features. Example element-count calculation:
  1 2 3
  
  soup = BeautifulSoup(resp.text, 'html.parser') num_images = len(soup.find_all('img')) num_scripts = len(soup.find_all('script', src=True))
- Webshots are captured with R (phantomJS via webshot), recording file size and dimensions with magick.
Debugging:
- Pages with exceptions (timeout, HTTP errors, missing components) are marked −1.
- CNN-based error-page classifier is applied to purge artefacts from domain-parked or under-construction sites.

This suggests a design optimized to minimize noise and operational failures commonly encountered with large-scale web snapshots.

3. Dataset Structure and Statistical Properties

Data is organized by both page category and geographic region:

Category	Browsing (n)	Searching (n)
Arts & Entertainment	7,752	—
Business & Economy	8,438	—
Education	7,892	—
Government	7,354	—
News & Media	10,044	—
Science & Environment	7,958	—

Key numerical features, shown with means and standard deviations:

Parameter	Browsing (mean±σ)	Searching (mean±σ)
URL length	31.7 ± 12.6 chars	69.8 ± 31.0 chars
Download time	18.7 ± 17.4 ms	22.3 ± 15.8 ms
HTML size	50.5 ± 44.3 KB	46.8 ± 39.6 KB
Images/page	2.1 ± 3.5	2.5 ± 4.1
Scripts/page	3.4 ± 5.1	3.0 ± 4.4
Webshot size	302 ± 200 KB	274 ± 180 KB
Webshot width	1,059 ± 389 px	1,052 ± 380 px
Webshot height	3,859 ± 4,293 px	3,560 ± 3,918 px

A plausible implication is that the dataset permits stratified statistical studies of web page complexity, global design variation, and multimedia content densities.

4. Experimental Use Cases and Benchmark Models

The authors document two primary supervised tasks to demonstrate dataset utility:

4.1 Error-Page Detection

Binary classification on webshots (valid vs error) using a compact CNN:

Train/val split: 80/20
Architecture: Three Conv-ReLU-MaxPool blocks, dense output
Results: Train accuracy 96.6%, validation 97.2%
Applied to large “Searching” subset: overall 92.5% accuracy

4.2 Subject Categorization

Six-way classification on category-balanced webshots using transfer learning:

Model: ResNet-50 (ImageNet), new classifier head (GlobalAveragePool → Dense(256) → Dropout → Dense(6) → Softmax)
Training accuracy: 94.3%
Validation accuracy: 40.4% (high inter-class confusion; only three classes >50% recall)

This suggests a strong visual signature for error-page discrimination but ambiguity in topical cues from screenshots alone.

5. Data Access, Format, and Licensing

Repository: https://osf.io/7ghd2/
Contents:
- _images/: webshots per category
- _datasheets/: CSV tables of URLs and quantitative features
- code/: Python and R scripts for parsing and extraction
- docs/: methodology and README
Schema: All features are tabular or image file, with −1 to indicate absent data.
License: CC-BY 4.0; citation required (Mejia-Escobar et al., 2021).

6. Analytical Significance and Applications

W{content}I enables:

Systematic studies of visual web diversity, page structure typology, and graphical complexity across nations and markets.
Multi-task learning frameworks exploiting cross-modal data (image, text, quantitative metrics).
Development of robust page error/anomaly detectors and visual classifiers with real-world noise and heterogeneity.
Extension to unsupervised and semi-supervised representation learning on page-level, DOM, and domain aggregates.

This suggests the dataset serves as a baseline resource for web intelligence, content classification, accessibility evaluation, and page-quality assessment in large-scale computational web science and applied ML.

7. Limitations and Future Directions

Inter-class confusion in subject categorization highlights the need for more sophisticated model architectures or integration of textual/structural features beyond screenshots. The absence of detailed textual analysis in the initial paper points to future expansions where HTML-derived text is leveraged for complementary classifiers, retrieval, or semantic structure mining. Possible future directions include topic modeling, dynamic web evolution tracking, and page accessibility assessment.

The W{content}I dataset, by integrating high-resolution screenshots and engineered structural features at global scale, provides the foundation for cross-modal page analytics and supports reproducible research in web-driven machine learning (Mejia-Escobar et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

A Large Visual, Qualitative and Quantitative Dataset of Web Pages (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to W{content}I Dataset.