Amazon 2023 Review Dataset
- Amazon Reviews 2023 is a comprehensive dataset featuring 570M reviews, 48M items, and nearly three decades of detailed, multi-modal metadata.
- The dataset employs JSON Lines format and rigorous preprocessing—including temporal splits and fixed seed downsampling—to ensure high fidelity and replicability.
- It underpins advanced research in retrieval and recommendation, demonstrating significant performance improvements in contrastive text–item representation tasks.
Amazon Reviews 2023 is the largest publicly released multi-domain e-commerce review corpus to date, introduced in “Bridging Language and Items for Retrieval and Recommendation” (Hou et al., 2024) for pretraining BLaIR, a family of sentence-embedding models that jointly encode user reviews and item metadata. The corpus provides a high-resolution, multi-modal resource of over 570 million reviews, 48 million items, and 33 top-level product categories, spanning nearly three decades with nuanced metadata and millisecond-precision time stamps. This dataset underpins advanced research in retrieval, recommendation, and contrastive text–item representations (Hou et al., 2024).
1. Dataset Structure and Scope
Amazon Reviews 2023 comprises:
- Reviews: 571,544,897
- Distinct items: 48,185,153
- Users: 54,514,264
- Product categories: 33 top-level domains, reflecting Amazon’s “browse node” taxonomy (e.g., Beauty, Video Games, Office Products, Sports, Baby).
- Temporal coverage: June 1996 through September 2023, with millisecond-precision timestamps.
- Token counts: 30.1 billion tokens in review texts; 30.7 billion tokens in item metadata.
This dataset surpasses earlier Amazon releases in both size and metadata richness, enabling granular multi-domain evaluations and pretraining. The distribution of reviews across categories exhibits heavy long-tail behavior, as indicated by samples (e.g., Beauty: 105k reviews over 43,982 items; Video Games: 2.55M reviews over 115,815 items; Baby: 3.60M reviews over 179,133 items) and an overall category mean of reviews per category. The full per-category standard deviation is not reported but expected to be tens of millions.
2. Data Schema and Formats
All records use a JSON Lines format, with each line containing a single JSON object. Review records include:
reviewerID(anonymized)itemID/ ASINreviewTitle(string)reviewText(string)overall rating(1–5 stars)timestamp(ISO 8601, millisecond precision)
Item metadata records include:
itemID/ ASINcategory(top-level)title(string)features(bullet-point array)description(paragraph)images(array of URLs + resolution)videos(array of URLs + metadata)
Schema definitions are distributed in the GitHub repository's “schema/” directory, dictating required and optional fields. Preprocessing for BLaIR includes an 8:1:1 Train/Validation/Test temporal split, paired context/metadata formation (reviewTitle + reviewText for context; title + features + description for metadata), length filtering (≥30 characters in both context and metadata), and a fixed-seed 10% downsample for single-epoch pretraining within resource constraints.
3. Collection and Curation Methodologies
Dataset collection involved crawling raw HTML from Amazon through September 2023, parsing the HTML to JSON for both reviews and item metadata, and removing duplicate reviewID/itemID pairs. Several quality controls were implemented:
- Length threshold: Minimum 30 characters for both review context and item metadata.
- Timestamp-based splits: Ensures temporal integrity and prevents future-leakage in pretraining and evaluation.
- Fixed random seed: Employed for reproducibility in the 10% downsampling process.
- Test contamination prevention: No review text or metadata from the Test split is used in pretraining or model development.
This methodology produces a high-fidelity, temporally-stratified corpus suitable for longitudinal and multi-domain modeling.
4. Comparative Analysis with Prior Amazon Releases
A scale comparison with Amazon Reviews 2018 highlights the corpus expansion:
| Version | Categories | Reviews | Items | Meta Tokens |
|---|---|---|---|---|
| Amazon '18 | 29 | 2.33×10⁸ | 1.52×10⁷ | 7.9 B |
| Amazon '23 (ours) | 33 | 5.72×10⁸ | 4.82×10⁷ | 30.7 B |
Compared to Amazon '18, Amazon '23 is ∼2.4× larger in reviews and ∼3.18× larger in items, extends coverage by nearly five additional years (Oct 2018–Sep 2023), and quadruples metadata token size. The schema is expanded with descriptive (“Description”, “Features”) and multi-modal (“Images”, “Videos”) fields, along with millisecond-precision timestamps. Benchmarking encoder models (RoBERTa, SimCSE) on this corpus yielded up to +5% gains in retrieval/R@50 metrics.
5. Public Access, Licensing, and Citation Requirements
The dataset, accompanying schemas, and splits are available at GitHub: hyp1231/AmazonReviews2023:
- data/raw/: Unprocessed JSON Lines for reviews and items.
- data/schema/: JSON schemas for field specification.
- data/splits/: Timestamp landmarks for train/val/test splits.
Licensing is governed by Amazon’s Customer Reviews Terms of Use and an MIT-style license on the repository. Redistribution of review text requires acceptance of Amazon’s policy. Users must cite “Bridging Language and Items for Retrieval and Recommendation,” ACL 2024, and include the GitHub URL when using the dataset.
6. Core Research Applications and Known Limitations
Use cases supported by the dataset include:
- Sequential recommendation (e.g., predicting next click/purchase from historical reviews)
- Conventional product search (keyword → relevant ASIN retrieval)
- Complex product search (long, instruction-style queries → item retrieval)
- Contrastive pretraining for item–text embeddings (as in BLaIR)
Limitations include:
- Popularity bias: Review counts are strongly concentrated on “head” items, especially in electronics.
- Reviewer bias: The dataset reflects self-selection in review authorship.
- Temporal bias: Older categories accumulate more reviews through longevity.
Mitigation strategies discussed include downsampling head/tail items, per-category re-weighting, multi-domain pretraining to improve tail performance, and timestamp-aware evaluation. A plausible implication is that models trained exclusively on head items may underperform in low-activity or evolving domains, suggesting the necessity for domain and temporal balance in evaluation protocols.
7. Significance for Retrieval and Recommender System Research
Amazon Reviews 2023 provides an unprecedented foundation for research in large-scale product retrieval, recommendation, and representation learning. The dataset's volume, temporal scope, category diversity, and multi-modal metadata enable evaluation of model generalization, complex query handling, and cross-domain adaptation. Its deployment in pretraining BLaIR demonstrates strong capacity for text and item representation, with substantial improvements observed in conventional benchmarks (Hou et al., 2024). This suggests further opportunities in contrastive learning, sequential modeling across time, and domain-aware recommendation.