Hierarchical Book Genre Dataset

Updated 31 December 2025

Hierarchical Book Genre Dataset is a structured resource offering multi-level genre labels for books, enabling precise parent-child classification and nuanced recommendations.
It integrates multi-modal data sources like blurbs, reviews, cover images, and OCR text with standardized annotation protocols to support digital library indexing and cross-category retrieval.
Advanced techniques such as zero-shot semantic filtering and diffusion-based image augmentation enhance minority class representation and stabilize multi-label classification metrics.

A hierarchical book genre dataset is a structured resource that captures both the coarse and fine-grained genre labels for books, facilitating robust, multi-level classification in computational literary studies, digital library science, and recommender system development. Unlike flat genre datasets, the hierarchical format explicitly models the parent-child relationships between top-level categories (e.g., Fiction vs. Non-Fiction) and their respective subgenres. Recent datasets such as those introduced by the HiGeMine (Kumar et al., 24 Dec 2025) and IMAGINE (Nareti et al., 5 May 2025) frameworks provide annotated corpora with multi-modal content and carefully standardized taxonomies, enabling precise supervised learning, multi-label reasoning, and nuanced genre mining.

1. Taxonomy Architecture and Labeling Schemes

Hierarchical book genre datasets typically utilize a two-level taxonomy:

Level-1 (coarse): Binary categorization into Fiction ( $y^\textrm{1}_i = 0$ ) and Non-Fiction ( $y^\textrm{1}_i = 1$ ).
Level-2 (fine): Multi-label assignment from a set of distinct subgenres associated with each Level-1 branch. For HiGeMine (Kumar et al., 24 Dec 2025) this comprises 28 fiction subgenres (e.g., Mystery, Romance, Science-Fiction, Fantasy, Historical) and 29 non-fiction subgenres (e.g., Biography, Memoir, Self-Help, History, Science).

IMAGINE (Nareti et al., 5 May 2025) implements nearly identical top-level branching but defines 29 subgenres under each parent, with genre lists explicitly documented in the paper (see Section II and supplementary material).

Taxonomic formalization uses the following notation (cf. (Nareti et al., 5 May 2025)):

$L = (\{0\} \times L_f^2) \cup (\{1\} \times L_{nf}^2)$

where $L_f$ and $L_{nf}$ are the fine label sets for fiction and non-fiction, respectively. For book $i$ , labels are denoted $\mathcal{Y}_i = (\mathcal{Y}^i_1, \mathcal{Y}^i_2)$ , with $\mathcal{Y}^i_1 \in \{0, 1\}$ and $\mathcal{Y}^i_2 \subseteq L_f$ or $\mathcal{Y}^i_2 \subseteq L_{nf}$ , conditional on the coarse category.

2. Data Composition and Source Modalities

Hierarchical book genre datasets aggregate multiple data modalities per book:

Modality	HiGeMine (Kumar et al., 24 Dec 2025)	IMAGINE (Nareti et al., 5 May 2025)
Blurb	1 per book, crawled (Goodreads)	1 per book, web-scraped
User Reviews	9–10 per book (filtered)	Not primary modality
Cover Image	Not included	JPEG/PNG, all present
OCR Text	Not included	Gemini Pro Vision, >90% avail.
Metadata	Not included	Author/publisher, 95% avail.

HiGeMine prioritizes textual sources, combining authoritative blurbs and filtered, crowd-sourced reviews. IMAGINE employs a multi-modal schema, integrating cover images (processed via Swin-Transformer), OCR-extracted text, and metadata (with missing values encoded as zero-vectors via TransD knowledge-graph embeddings).

Both datasets are sourced from Goodreads, with explicit annotation strategies and taxonomy standardization. HiGeMine utilizes blurb/review filtration via zero-shot semantic alignment; IMAGINE enforces data resilience by dynamic source selection and augmentation.

3. Data Volume, Splits, and Augmentation

Dataset statistics reveal substantial sample sizes and rigorous partitioning:

HiGeMine: 5,612 fiction + 3,763 non-fiction = 9,375 books, totalling 90,499 filtered reviews; average blurb/review lengths ~100–183 words (Kumar et al., 24 Dec 2025).
IMAGINE: 6,704 fiction + 4,598 non-fiction = 11,302 books; cover images and OCR text are present for $y^\textrm{1}_i = 1$ 0, mean OCR text length ≈20 words, blurb length up to 1,786 words (Nareti et al., 5 May 2025).

Splitting conventions:

HiGeMine: 70% train, 10% validation, 20% test.
IMAGINE: 80% train, 10% validation, 10% test.

Both employ augmentation protocols to mitigate class imbalance:

Visual: SDEdit (diffusion) doubling images in underrepresented classes (Nareti et al., 5 May 2025).
Textual: Gemini LLM paraphrasing for long-tail subgenres (Kumar et al., 24 Dec 2025, Nareti et al., 5 May 2025).

A plausible implication is that such augmentation strategies substantially increase minority-class representation, boosting multi-label metric stability for Level-2 categories.

4. Preprocessing and Review Filtering

Text-centric datasets introduce extensive preprocessing:

Text cleaning: Removal of hyperlinks, emojis, filler tokens, non-alphabetic characters (Kumar et al., 24 Dec 2025).
Vocabulary construction: Tokenization of filtered blurbs and user reviews.

HiGeMine employs a zero-shot semantic filtering stage (Sec. 4.1): each review is encoded via BERT, cosine similarity is computed with the blurb, and only reviews surpassing a dynamic threshold $y^\textrm{1}_i = 1$ 1 are retained. If the blurb is too short, filtering is disabled for that sample.

IMAGINE deals with incomplete or noisy modalities, designating some blurbs or OCR texts as "challenged" and adapting feature extraction accordingly.

5. Formal Dataset Representation and Loss Functions

Both datasets formalize samples as $y^\textrm{1}_i = 1$ 2, with $y^\textrm{1}_i = 1$ 3 denoting multi-modal input features (text, image, metadata) and $y^\textrm{1}_i = 1$ 4 hierarchical genre labels.

Label assignment for HiGeMine (Kumar et al., 24 Dec 2025):

$y^\textrm{1}_i = 1$ 5

with $y^\textrm{1}_i = 1$ 6 for fiction and $y^\textrm{1}_i = 1$ 7 for non-fiction.

IMAGINE (Nareti et al., 5 May 2025) implements hierarchical multi-label learning with asymmetric multi-label loss (Eq. 1) and overall hierarchical loss (Eq. 2):

$y^\textrm{1}_i = 1$ 8

$y^\textrm{1}_i = 1$ 9

with $L = (\{0\} \times L_f^2) \cup (\{1\} \times L_{nf}^2)$ 0 the binary BCE for Level-1.

6. Annotation, Schema, and Quality Assurance

Annotation protocols entail:

Taxonomy finalized by expert consensus: eight student annotators plus three linguistics experts, with Krippendorff’s α = 0.83 (Nareti et al., 5 May 2025).
Standardized files:
- images/: cover art
- blurbs.csv: bookID, raw blurb
- ocr/: OCR text files
- metadata.csv: bookID, author, publisher
- labels.csv: bookID, Level-1/2
Public release planned as a ZIP archive with CSV splits and genres.json for taxonomy versioning.
Co-occurrence graph $L = (\{0\} \times L_f^2) \cup (\{1\} \times L_{nf}^2)$ 1 constructed from training label overlaps; low-weight edges are pruned, high-weight edges binarized (Kumar et al., 24 Dec 2025).

This ensures that label distributions and genre relationships are systematically documented, supporting reproducibility and downstream benchmarking.

7. Applications and Evaluation Protocols

Primary applications include:

Digital library indexing & cross-category retrieval
Genre-aware personalized recommendation
Collection organization for market analysis

Evaluation metrics:

Level-1 (coarse): Accuracy, binary F1
Level-2 (fine): micro-F1, macro-F1, balanced accuracy, Hamming loss

Baseline models referenced include hierarchical text classifiers, pretrained LLM (PLM)-based architectures, and both open/closed LLMs (Kumar et al., 24 Dec 2025, Nareti et al., 5 May 2025). The use of the label co-occurrence graph $L = (\{0\} \times L_f^2) \cup (\{1\} \times L_{nf}^2)$ 2 and token-document graphs $L = (\{0\} \times L_f^2) \cup (\{1\} \times L_{nf}^2)$ 3, $L = (\{0\} \times L_f^2) \cup (\{1\} \times L_{nf}^2)$ 4 is recommended to model inter-genre dependencies and enhance multi-label consistency.

This suggests that integration of multi-modal and hierarchy-aware datasets significantly advances the granularity and reliability of automated genre classification, addressing limitations of flat, single-label methods and noisy user-driven signals (Kumar et al., 24 Dec 2025, Nareti et al., 5 May 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Blurb-Refined Inference from Crowdsourced Book Reviews using Hierarchical Genre Mining with Dual-Path Graph Convolutions (2025)

An Adaptive Data-Resilient Multi-Modal Framework for Hierarchical Multi-Label Book Genre Identification (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Book Genre Dataset.

Hierarchical Book Genre Dataset

1. Taxonomy Architecture and Labeling Schemes

2. Data Composition and Source Modalities

3. Data Volume, Splits, and Augmentation

4. Preprocessing and Review Filtering

5. Formal Dataset Representation and Loss Functions

6. Annotation, Schema, and Quality Assurance

7. Applications and Evaluation Protocols

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Hierarchical Book Genre Dataset

1. Taxonomy Architecture and Labeling Schemes

2. Data Composition and Source Modalities

3. Data Volume, Splits, and Augmentation

4. Preprocessing and Review Filtering

5. Formal Dataset Representation and Loss Functions

6. Annotation, Schema, and Quality Assurance

7. Applications and Evaluation Protocols

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research