Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Book Genre Dataset

Updated 31 December 2025
  • Hierarchical Book Genre Dataset is a structured resource offering multi-level genre labels for books, enabling precise parent-child classification and nuanced recommendations.
  • It integrates multi-modal data sources like blurbs, reviews, cover images, and OCR text with standardized annotation protocols to support digital library indexing and cross-category retrieval.
  • Advanced techniques such as zero-shot semantic filtering and diffusion-based image augmentation enhance minority class representation and stabilize multi-label classification metrics.

A hierarchical book genre dataset is a structured resource that captures both the coarse and fine-grained genre labels for books, facilitating robust, multi-level classification in computational literary studies, digital library science, and recommender system development. Unlike flat genre datasets, the hierarchical format explicitly models the parent-child relationships between top-level categories (e.g., Fiction vs. Non-Fiction) and their respective subgenres. Recent datasets such as those introduced by the HiGeMine (Kumar et al., 24 Dec 2025) and IMAGINE (Nareti et al., 5 May 2025) frameworks provide annotated corpora with multi-modal content and carefully standardized taxonomies, enabling precise supervised learning, multi-label reasoning, and nuanced genre mining.

1. Taxonomy Architecture and Labeling Schemes

Hierarchical book genre datasets typically utilize a two-level taxonomy:

  1. Level-1 (coarse): Binary categorization into Fiction (yi1=0y^\textrm{1}_i = 0) and Non-Fiction (yi1=1y^\textrm{1}_i = 1).
  2. Level-2 (fine): Multi-label assignment from a set of distinct subgenres associated with each Level-1 branch. For HiGeMine (Kumar et al., 24 Dec 2025) this comprises 28 fiction subgenres (e.g., Mystery, Romance, Science-Fiction, Fantasy, Historical) and 29 non-fiction subgenres (e.g., Biography, Memoir, Self-Help, History, Science).

IMAGINE (Nareti et al., 5 May 2025) implements nearly identical top-level branching but defines 29 subgenres under each parent, with genre lists explicitly documented in the paper (see Section II and supplementary material).

Taxonomic formalization uses the following notation (cf. (Nareti et al., 5 May 2025)):

L=({0}×Lf2)∪({1}×Lnf2)L = (\{0\} \times L_f^2) \cup (\{1\} \times L_{nf}^2)

where LfL_f and LnfL_{nf} are the fine label sets for fiction and non-fiction, respectively. For book ii, labels are denoted Yi=(Y1i,Y2i)\mathcal{Y}_i = (\mathcal{Y}^i_1, \mathcal{Y}^i_2), with Y1i∈{0,1}\mathcal{Y}^i_1 \in \{0, 1\} and Y2i⊆Lf\mathcal{Y}^i_2 \subseteq L_f or Y2i⊆Lnf\mathcal{Y}^i_2 \subseteq L_{nf}, conditional on the coarse category.

2. Data Composition and Source Modalities

Hierarchical book genre datasets aggregate multiple data modalities per book:

Modality HiGeMine (Kumar et al., 24 Dec 2025) IMAGINE (Nareti et al., 5 May 2025)
Blurb 1 per book, crawled (Goodreads) 1 per book, web-scraped
User Reviews 9–10 per book (filtered) Not primary modality
Cover Image Not included JPEG/PNG, all present
OCR Text Not included Gemini Pro Vision, >90% avail.
Metadata Not included Author/publisher, 95% avail.

HiGeMine prioritizes textual sources, combining authoritative blurbs and filtered, crowd-sourced reviews. IMAGINE employs a multi-modal schema, integrating cover images (processed via Swin-Transformer), OCR-extracted text, and metadata (with missing values encoded as zero-vectors via TransD knowledge-graph embeddings).

Both datasets are sourced from Goodreads, with explicit annotation strategies and taxonomy standardization. HiGeMine utilizes blurb/review filtration via zero-shot semantic alignment; IMAGINE enforces data resilience by dynamic source selection and augmentation.

3. Data Volume, Splits, and Augmentation

Dataset statistics reveal substantial sample sizes and rigorous partitioning:

  • HiGeMine: 5,612 fiction + 3,763 non-fiction = 9,375 books, totalling 90,499 filtered reviews; average blurb/review lengths ~100–183 words (Kumar et al., 24 Dec 2025).
  • IMAGINE: 6,704 fiction + 4,598 non-fiction = 11,302 books; cover images and OCR text are present for >90%>90\%, mean OCR text length ≈20 words, blurb length up to 1,786 words (Nareti et al., 5 May 2025).

Splitting conventions:

  • HiGeMine: 70% train, 10% validation, 20% test.
  • IMAGINE: 80% train, 10% validation, 10% test.

Both employ augmentation protocols to mitigate class imbalance:

A plausible implication is that such augmentation strategies substantially increase minority-class representation, boosting multi-label metric stability for Level-2 categories.

4. Preprocessing and Review Filtering

Text-centric datasets introduce extensive preprocessing:

  • Text cleaning: Removal of hyperlinks, emojis, filler tokens, non-alphabetic characters (Kumar et al., 24 Dec 2025).
  • Vocabulary construction: Tokenization of filtered blurbs and user reviews.

HiGeMine employs a zero-shot semantic filtering stage (Sec. 4.1): each review is encoded via BERT, cosine similarity is computed with the blurb, and only reviews surpassing a dynamic threshold Ψ\Psi are retained. If the blurb is too short, filtering is disabled for that sample.

IMAGINE deals with incomplete or noisy modalities, designating some blurbs or OCR texts as "challenged" and adapting feature extraction accordingly.

5. Formal Dataset Representation and Loss Functions

Both datasets formalize samples as D={(xi,yi)}i=1nD = \{(x_i, y_i)\}_{i=1}^n, with xix_i denoting multi-modal input features (text, image, metadata) and yiy_i hierarchical genre labels.

Label assignment for HiGeMine (Kumar et al., 24 Dec 2025):

yi=(yi1,yi2),yi1∈{0,1},yi2∈{0,1}ky_i = (y_i^1, y_i^2),\quad y_i^1 \in \{0,1\},\quad y_i^2 \in \{0,1\}^{k}

with k=28k=28 for fiction and k=29k=29 for non-fiction.

IMAGINE (Nareti et al., 5 May 2025) implements hierarchical multi-label learning with asymmetric multi-label loss (Eq. 1) and overall hierarchical loss (Eq. 2):

LMF(i)=1m1∑j=1m1(ℓMFij++ℓMFij−)\mathcal{L}_M^{F(i)} = \frac{1}{m_1} \sum_{j=1}^{m_1}(\ell^{ij+}_{MF} + \ell^{ij-}_{MF})

L=1n∑i=1n[L1(i)+Y1i G(Y^1i) LF2(i)+(1−Y1i)(1−G(Y^1i)) LN2(i)]\mathcal{L} = \frac{1}{n} \sum_{i=1}^{n}[\mathcal{L}_1^{(i)} + Y^i_1\,\mathcal{G}(\hat Y^i_1)\,\mathcal{L}_F^{2(i)} + (1-Y^i_1)(1 - \mathcal{G}(\hat Y^i_1))\,\mathcal{L}_N^{2(i)}]

with L1(i)\mathcal{L}_1^{(i)} the binary BCE for Level-1.

6. Annotation, Schema, and Quality Assurance

Annotation protocols entail:

  • Taxonomy finalized by expert consensus: eight student annotators plus three linguistics experts, with Krippendorff’s α = 0.83 (Nareti et al., 5 May 2025).
  • Standardized files:
    • images/: cover art
    • blurbs.csv: bookID, raw blurb
    • ocr/: OCR text files
    • metadata.csv: bookID, author, publisher
    • labels.csv: bookID, Level-1/2
  • Public release planned as a ZIP archive with CSV splits and genres.json for taxonomy versioning.
  • Co-occurrence graph AcA^c constructed from training label overlaps; low-weight edges are pruned, high-weight edges binarized (Kumar et al., 24 Dec 2025).

This ensures that label distributions and genre relationships are systematically documented, supporting reproducibility and downstream benchmarking.

7. Applications and Evaluation Protocols

Primary applications include:

  • Digital library indexing & cross-category retrieval
  • Genre-aware personalized recommendation
  • Collection organization for market analysis

Evaluation metrics:

  • Level-1 (coarse): Accuracy, binary F1
  • Level-2 (fine): micro-F1, macro-F1, balanced accuracy, Hamming loss

Baseline models referenced include hierarchical text classifiers, pretrained LLM (PLM)-based architectures, and both open/closed LLMs (Kumar et al., 24 Dec 2025, Nareti et al., 5 May 2025). The use of the label co-occurrence graph AcA^c and token-document graphs GbG_b, GrG_r is recommended to model inter-genre dependencies and enhance multi-label consistency.

This suggests that integration of multi-modal and hierarchy-aware datasets significantly advances the granularity and reliability of automated genre classification, addressing limitations of flat, single-label methods and noisy user-driven signals (Kumar et al., 24 Dec 2025, Nareti et al., 5 May 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Book Genre Dataset.