W{content}I: Global Multi-Modal Web Dataset
- W{content}I is a comprehensive multi-modal dataset of 49,438 globally sourced web pages featuring screenshots, HTML text, and structured quantitative metrics.
- It enables detailed studies of web page structure, error detection via CNN models (up to 97% validation accuracy), and transfer learning for subject categorization.
- The dataset is organized by six thematic categories and openly licensed under CC-BY 4.0, supporting reproducible research in web mining and computational web science.
The W{content}I dataset, formally described in "A Large Visual, Qualitative and Quantitative Dataset of Web Pages," is a multi-modal collection comprising visual, textual, and structured quantitative features of web pages gathered worldwide. It is engineered to enable detailed, reproducible studies of web page appearance, structure, and topical content across six core thematic categories. Recognized for its breadth and explicit design for machine learning workflows, W{content}I underpins both browser-level rendering analysis and web-mining research, with open access under CC-BY 4.0 (Mejia-Escobar et al., 2021).
1. Scope and Data Modalities
The dataset contains 49,438 distinct web pages, each represented by three data modalities:
- Visual: Full-page screenshots (webshots) in JPG, typically at 992 × 3,560 pixels, mean file size ≈ 274 KB.
- Textual: Both raw HTML source and script-generated extracted plain text.
- Quantitative: Structured features including download time, HTML size, and counts of HTML elements (images, script files, CSS files, tables, iframes, style tags) as well as webshot image properties (size, width, height).
Pages are globally representative, covering websites from all countries and spanning continents. Each is annotated with one of six high-level subject labels: Arts & Entertainment; Business & Economy; Education; Government; News & Media; Science & Environment.
2. Data Collection and Preprocessing Protocol
The data acquisition pipeline consists of rigorously automated search and structured browsing mechanisms:
- URL Collection:
- Searching: Google queries of the form
OR(keyword1,…) site:.cc ext:htmlacross country-specific TLDs (.es, .br, etc.), yielding the top 100 results per country/query. - Browsing: Scraping the Best Of The Web (BOTW) directory categorically by continent and country.
- Searching: Google queries of the form
- Feature Extraction:
- Each URL is fetched via HTTP GET. HTML markup is parsed using Python BeautifulSoup to derive quantitative features. Example element-count calculation:
1 2 3
soup = BeautifulSoup(resp.text, 'html.parser') num_images = len(soup.find_all('img')) num_scripts = len(soup.find_all('script', src=True))
- Webshots are captured with R (phantomJS via webshot), recording file size and dimensions with magick.
- Each URL is fetched via HTTP GET. HTML markup is parsed using Python BeautifulSoup to derive quantitative features. Example element-count calculation:
- Debugging:
- Pages with exceptions (timeout, HTTP errors, missing components) are marked −1.
- CNN-based error-page classifier is applied to purge artefacts from domain-parked or under-construction sites.
This suggests a design optimized to minimize noise and operational failures commonly encountered with large-scale web snapshots.
3. Dataset Structure and Statistical Properties
Data is organized by both page category and geographic region:
| Category | Browsing (n) | Searching (n) |
|---|---|---|
| Arts & Entertainment | 7,752 | — |
| Business & Economy | 8,438 | — |
| Education | 7,892 | — |
| Government | 7,354 | — |
| News & Media | 10,044 | — |
| Science & Environment | 7,958 | — |
Key numerical features, shown with means and standard deviations:
| Parameter | Browsing (mean±σ) | Searching (mean±σ) |
|---|---|---|
| URL length | 31.7 ± 12.6 chars | 69.8 ± 31.0 chars |
| Download time | 18.7 ± 17.4 ms | 22.3 ± 15.8 ms |
| HTML size | 50.5 ± 44.3 KB | 46.8 ± 39.6 KB |
| Images/page | 2.1 ± 3.5 | 2.5 ± 4.1 |
| Scripts/page | 3.4 ± 5.1 | 3.0 ± 4.4 |
| Webshot size | 302 ± 200 KB | 274 ± 180 KB |
| Webshot width | 1,059 ± 389 px | 1,052 ± 380 px |
| Webshot height | 3,859 ± 4,293 px | 3,560 ± 3,918 px |
A plausible implication is that the dataset permits stratified statistical studies of web page complexity, global design variation, and multimedia content densities.
4. Experimental Use Cases and Benchmark Models
The authors document two primary supervised tasks to demonstrate dataset utility:
4.1 Error-Page Detection
Binary classification on webshots (valid vs error) using a compact CNN:
- Train/val split: 80/20
- Architecture: Three Conv-ReLU-MaxPool blocks, dense output
- Results: Train accuracy 96.6%, validation 97.2%
- Applied to large “Searching” subset: overall 92.5% accuracy
4.2 Subject Categorization
Six-way classification on category-balanced webshots using transfer learning:
- Model: ResNet-50 (ImageNet), new classifier head (GlobalAveragePool → Dense(256) → Dropout → Dense(6) → Softmax)
- Training accuracy: 94.3%
- Validation accuracy: 40.4% (high inter-class confusion; only three classes >50% recall)
This suggests a strong visual signature for error-page discrimination but ambiguity in topical cues from screenshots alone.
5. Data Access, Format, and Licensing
- Repository: https://osf.io/7ghd2/
- Contents:
_images/: webshots per category_datasheets/: CSV tables of URLs and quantitative featurescode/: Python and R scripts for parsing and extractiondocs/: methodology and README
- Schema: All features are tabular or image file, with −1 to indicate absent data.
- License: CC-BY 4.0; citation required (Mejia-Escobar et al., 2021).
6. Analytical Significance and Applications
W{content}I enables:
- Systematic studies of visual web diversity, page structure typology, and graphical complexity across nations and markets.
- Multi-task learning frameworks exploiting cross-modal data (image, text, quantitative metrics).
- Development of robust page error/anomaly detectors and visual classifiers with real-world noise and heterogeneity.
- Extension to unsupervised and semi-supervised representation learning on page-level, DOM, and domain aggregates.
This suggests the dataset serves as a baseline resource for web intelligence, content classification, accessibility evaluation, and page-quality assessment in large-scale computational web science and applied ML.
7. Limitations and Future Directions
Inter-class confusion in subject categorization highlights the need for more sophisticated model architectures or integration of textual/structural features beyond screenshots. The absence of detailed textual analysis in the initial paper points to future expansions where HTML-derived text is leveraged for complementary classifiers, retrieval, or semantic structure mining. Possible future directions include topic modeling, dynamic web evolution tracking, and page accessibility assessment.
The W{content}I dataset, by integrating high-resolution screenshots and engineered structural features at global scale, provides the foundation for cross-modal page analytics and supports reproducible research in web-driven machine learning (Mejia-Escobar et al., 2021).