Hierarchical Book Genre Dataset
- Hierarchical Book Genre Dataset is a structured resource offering multi-level genre labels for books, enabling precise parent-child classification and nuanced recommendations.
- It integrates multi-modal data sources like blurbs, reviews, cover images, and OCR text with standardized annotation protocols to support digital library indexing and cross-category retrieval.
- Advanced techniques such as zero-shot semantic filtering and diffusion-based image augmentation enhance minority class representation and stabilize multi-label classification metrics.
A hierarchical book genre dataset is a structured resource that captures both the coarse and fine-grained genre labels for books, facilitating robust, multi-level classification in computational literary studies, digital library science, and recommender system development. Unlike flat genre datasets, the hierarchical format explicitly models the parent-child relationships between top-level categories (e.g., Fiction vs. Non-Fiction) and their respective subgenres. Recent datasets such as those introduced by the HiGeMine (Kumar et al., 24 Dec 2025) and IMAGINE (Nareti et al., 5 May 2025) frameworks provide annotated corpora with multi-modal content and carefully standardized taxonomies, enabling precise supervised learning, multi-label reasoning, and nuanced genre mining.
1. Taxonomy Architecture and Labeling Schemes
Hierarchical book genre datasets typically utilize a two-level taxonomy:
- Level-1 (coarse): Binary categorization into Fiction () and Non-Fiction ().
- Level-2 (fine): Multi-label assignment from a set of distinct subgenres associated with each Level-1 branch. For HiGeMine (Kumar et al., 24 Dec 2025) this comprises 28 fiction subgenres (e.g., Mystery, Romance, Science-Fiction, Fantasy, Historical) and 29 non-fiction subgenres (e.g., Biography, Memoir, Self-Help, History, Science).
IMAGINE (Nareti et al., 5 May 2025) implements nearly identical top-level branching but defines 29 subgenres under each parent, with genre lists explicitly documented in the paper (see Section II and supplementary material).
Taxonomic formalization uses the following notation (cf. (Nareti et al., 5 May 2025)):
where and are the fine label sets for fiction and non-fiction, respectively. For book , labels are denoted , with and or , conditional on the coarse category.
2. Data Composition and Source Modalities
Hierarchical book genre datasets aggregate multiple data modalities per book:
| Modality | HiGeMine (Kumar et al., 24 Dec 2025) | IMAGINE (Nareti et al., 5 May 2025) |
|---|---|---|
| Blurb | 1 per book, crawled (Goodreads) | 1 per book, web-scraped |
| User Reviews | 9–10 per book (filtered) | Not primary modality |
| Cover Image | Not included | JPEG/PNG, all present |
| OCR Text | Not included | Gemini Pro Vision, >90% avail. |
| Metadata | Not included | Author/publisher, 95% avail. |
HiGeMine prioritizes textual sources, combining authoritative blurbs and filtered, crowd-sourced reviews. IMAGINE employs a multi-modal schema, integrating cover images (processed via Swin-Transformer), OCR-extracted text, and metadata (with missing values encoded as zero-vectors via TransD knowledge-graph embeddings).
Both datasets are sourced from Goodreads, with explicit annotation strategies and taxonomy standardization. HiGeMine utilizes blurb/review filtration via zero-shot semantic alignment; IMAGINE enforces data resilience by dynamic source selection and augmentation.
3. Data Volume, Splits, and Augmentation
Dataset statistics reveal substantial sample sizes and rigorous partitioning:
- HiGeMine: 5,612 fiction + 3,763 non-fiction = 9,375 books, totalling 90,499 filtered reviews; average blurb/review lengths ~100–183 words (Kumar et al., 24 Dec 2025).
- IMAGINE: 6,704 fiction + 4,598 non-fiction = 11,302 books; cover images and OCR text are present for , mean OCR text length ≈20 words, blurb length up to 1,786 words (Nareti et al., 5 May 2025).
Splitting conventions:
- HiGeMine: 70% train, 10% validation, 20% test.
- IMAGINE: 80% train, 10% validation, 10% test.
Both employ augmentation protocols to mitigate class imbalance:
- Visual: SDEdit (diffusion) doubling images in underrepresented classes (Nareti et al., 5 May 2025).
- Textual: Gemini LLM paraphrasing for long-tail subgenres (Kumar et al., 24 Dec 2025, Nareti et al., 5 May 2025).
A plausible implication is that such augmentation strategies substantially increase minority-class representation, boosting multi-label metric stability for Level-2 categories.
4. Preprocessing and Review Filtering
Text-centric datasets introduce extensive preprocessing:
- Text cleaning: Removal of hyperlinks, emojis, filler tokens, non-alphabetic characters (Kumar et al., 24 Dec 2025).
- Vocabulary construction: Tokenization of filtered blurbs and user reviews.
HiGeMine employs a zero-shot semantic filtering stage (Sec. 4.1): each review is encoded via BERT, cosine similarity is computed with the blurb, and only reviews surpassing a dynamic threshold are retained. If the blurb is too short, filtering is disabled for that sample.
IMAGINE deals with incomplete or noisy modalities, designating some blurbs or OCR texts as "challenged" and adapting feature extraction accordingly.
5. Formal Dataset Representation and Loss Functions
Both datasets formalize samples as , with denoting multi-modal input features (text, image, metadata) and hierarchical genre labels.
Label assignment for HiGeMine (Kumar et al., 24 Dec 2025):
with for fiction and for non-fiction.
IMAGINE (Nareti et al., 5 May 2025) implements hierarchical multi-label learning with asymmetric multi-label loss (Eq. 1) and overall hierarchical loss (Eq. 2):
with the binary BCE for Level-1.
6. Annotation, Schema, and Quality Assurance
Annotation protocols entail:
- Taxonomy finalized by expert consensus: eight student annotators plus three linguistics experts, with Krippendorff’s α = 0.83 (Nareti et al., 5 May 2025).
- Standardized files:
images/: cover artblurbs.csv: bookID, raw blurbocr/: OCR text filesmetadata.csv: bookID, author, publisherlabels.csv: bookID, Level-1/2
- Public release planned as a ZIP archive with CSV splits and genres.json for taxonomy versioning.
- Co-occurrence graph constructed from training label overlaps; low-weight edges are pruned, high-weight edges binarized (Kumar et al., 24 Dec 2025).
This ensures that label distributions and genre relationships are systematically documented, supporting reproducibility and downstream benchmarking.
7. Applications and Evaluation Protocols
Primary applications include:
- Digital library indexing & cross-category retrieval
- Genre-aware personalized recommendation
- Collection organization for market analysis
Evaluation metrics:
- Level-1 (coarse): Accuracy, binary F1
- Level-2 (fine): micro-F1, macro-F1, balanced accuracy, Hamming loss
Baseline models referenced include hierarchical text classifiers, pretrained LLM (PLM)-based architectures, and both open/closed LLMs (Kumar et al., 24 Dec 2025, Nareti et al., 5 May 2025). The use of the label co-occurrence graph and token-document graphs , is recommended to model inter-genre dependencies and enhance multi-label consistency.
This suggests that integration of multi-modal and hierarchy-aware datasets significantly advances the granularity and reliability of automated genre classification, addressing limitations of flat, single-label methods and noisy user-driven signals (Kumar et al., 24 Dec 2025, Nareti et al., 5 May 2025).