Amazon 2023 Review Dataset

Updated 19 December 2025

Amazon Reviews 2023 is a comprehensive dataset featuring 570M reviews, 48M items, and nearly three decades of detailed, multi-modal metadata.
The dataset employs JSON Lines format and rigorous preprocessing—including temporal splits and fixed seed downsampling—to ensure high fidelity and replicability.
It underpins advanced research in retrieval and recommendation, demonstrating significant performance improvements in contrastive text–item representation tasks.

Amazon Reviews 2023 is the largest publicly released multi-domain e-commerce review corpus to date, introduced in “Bridging Language and Items for Retrieval and Recommendation” (Hou et al., 2024) for pretraining BLaIR, a family of sentence-embedding models that jointly encode user reviews and item metadata. The corpus provides a high-resolution, multi-modal resource of over 570 million reviews, 48 million items, and 33 top-level product categories, spanning nearly three decades with nuanced metadata and millisecond-precision time stamps. This dataset underpins advanced research in retrieval, recommendation, and contrastive text–item representations (Hou et al., 2024).

1. Dataset Structure and Scope

Amazon Reviews 2023 comprises:

Reviews: 571,544,897
Distinct items: 48,185,153
Users: 54,514,264
Product categories: 33 top-level domains, reflecting Amazon’s “browse node” taxonomy (e.g., Beauty, Video Games, Office Products, Sports, Baby).
Temporal coverage: June 1996 through September 2023, with millisecond-precision timestamps.
Token counts: 30.1 billion tokens in review texts; 30.7 billion tokens in item metadata.

This dataset surpasses earlier Amazon releases in both size and metadata richness, enabling granular multi-domain evaluations and pretraining. The distribution of reviews across categories exhibits heavy long-tail behavior, as indicated by samples (e.g., Beauty: 105k reviews over 43,982 items; Video Games: 2.55M reviews over 115,815 items; Baby: 3.60M reviews over 179,133 items) and an overall category mean of $\mu_{\rm rev/ctg} \approx 17.32\times 10^{6}$ reviews per category. The full per-category standard deviation $\sigma$ is not reported but expected to be tens of millions.

2. Data Schema and Formats

All records use a JSON Lines format, with each line containing a single JSON object. Review records include:

reviewerID (anonymized)
itemID / ASIN
reviewTitle (string)
reviewText (string)
overall rating (1–5 stars)
timestamp (ISO 8601, millisecond precision)

Item metadata records include:

itemID / ASIN
category (top-level)
title (string)
features (bullet-point array)
description (paragraph)
images (array of URLs + resolution)
videos (array of URLs + metadata)

Schema definitions are distributed in the GitHub repository's “schema/” directory, dictating required and optional fields. Preprocessing for BLaIR includes an 8:1:1 Train/Validation/Test temporal split, paired context/metadata formation (reviewTitle + reviewText for context; title + features + description for metadata), length filtering (≥30 characters in both context and metadata), and a fixed-seed 10% downsample for single-epoch pretraining within resource constraints.

3. Collection and Curation Methodologies

Dataset collection involved crawling raw HTML from Amazon through September 2023, parsing the HTML to JSON for both reviews and item metadata, and removing duplicate reviewID/itemID pairs. Several quality controls were implemented:

Length threshold: Minimum 30 characters for both review context and item metadata.
Timestamp-based splits: Ensures temporal integrity and prevents future-leakage in pretraining and evaluation.
Fixed random seed: Employed for reproducibility in the 10% downsampling process.
Test contamination prevention: No review text or metadata from the Test split is used in pretraining or model development.

This methodology produces a high-fidelity, temporally-stratified corpus suitable for longitudinal and multi-domain modeling.

4. Comparative Analysis with Prior Amazon Releases

A scale comparison with Amazon Reviews 2018 highlights the corpus expansion:

Version	Categories	Reviews	Items	Meta Tokens
Amazon '18	29	2.33×10⁸	1.52×10⁷	7.9 B
Amazon '23 (ours)	33	5.72×10⁸	4.82×10⁷	30.7 B

Compared to Amazon '18, Amazon '23 is ∼2.4× larger in reviews and ∼3.18× larger in items, extends coverage by nearly five additional years (Oct 2018–Sep 2023), and quadruples metadata token size. The schema is expanded with descriptive (“Description”, “Features”) and multi-modal (“Images”, “Videos”) fields, along with millisecond-precision timestamps. Benchmarking encoder models (RoBERTa, SimCSE) on this corpus yielded up to +5% gains in retrieval/R@50 metrics.

5. Public Access, Licensing, and Citation Requirements

The dataset, accompanying schemas, and splits are available at GitHub: hyp1231/AmazonReviews2023:

data/raw/: Unprocessed JSON Lines for reviews and items.
data/schema/: JSON schemas for field specification.
data/splits/: Timestamp landmarks for train/val/test splits.

Licensing is governed by Amazon’s Customer Reviews Terms of Use and an MIT-style license on the repository. Redistribution of review text requires acceptance of Amazon’s policy. Users must cite “Bridging Language and Items for Retrieval and Recommendation,” ACL 2024, and include the GitHub URL when using the dataset.

6. Core Research Applications and Known Limitations

Use cases supported by the dataset include:

Sequential recommendation (e.g., predicting next click/purchase from historical reviews)
Conventional product search (keyword → relevant ASIN retrieval)
Complex product search (long, instruction-style queries → item retrieval)
Contrastive pretraining for item–text embeddings (as in BLaIR)

Limitations include:

Popularity bias: Review counts are strongly concentrated on “head” items, especially in electronics.
Reviewer bias: The dataset reflects self-selection in review authorship.
Temporal bias: Older categories accumulate more reviews through longevity.

Mitigation strategies discussed include downsampling head/tail items, per-category re-weighting, multi-domain pretraining to improve tail performance, and timestamp-aware evaluation. A plausible implication is that models trained exclusively on head items may underperform in low-activity or evolving domains, suggesting the necessity for domain and temporal balance in evaluation protocols.

7. Significance for Retrieval and Recommender System Research

Amazon Reviews 2023 provides an unprecedented foundation for research in large-scale product retrieval, recommendation, and representation learning. The dataset's volume, temporal scope, category diversity, and multi-modal metadata enable evaluation of model generalization, complex query handling, and cross-domain adaptation. Its deployment in pretraining BLaIR demonstrates strong capacity for text and item representation, with substantial improvements observed in conventional benchmarks (Hou et al., 2024). This suggests further opportunities in contrastive learning, sequential modeling across time, and domain-aware recommendation.

Markdown Report Issue Upgrade to Chat

References (1)

Bridging Language and Items for Retrieval and Recommendation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Amazon 2023 Review Dataset.