HiGIA: Hierarchical Granularity-Aware Interval Aggregation
- The paper introduces HiGIA, a neural aggregation method that learns multilevel probability distributions to capture subjective uncertainties in generative content evaluations.
- It replaces direct mean opinion score regression with an interval-constrained approach, enhancing calibration and interpretability by leveraging both fine and coarse score intervals.
- Empirical evaluations on song aesthetics datasets show significant improvements in metrics like MSE and correlation coefficients compared to traditional scalar regression methods.
Hierarchical Granularity-Aware Interval Aggregation (HiGIA) is a neural aggregation module designed to address the challenges of uncertainty, subjectivity, and granularity inherent in human-centric evaluations of complex generative content. Originally introduced for automated song aesthetics assessment, HiGIA replaces direct scalar regression with a multistep process that leverages hierarchical probability distributions over intervals, capturing human imprecision and cross-rater variability in scores. This approach is particularly useful in scenarios where supervised targets, such as mean opinion scores (MOS), are inherently subjective or imprecisely defined, and where the distribution of ratings carries semantic information not captured by a single point estimate (Lv et al., 18 Jan 2026).
1. Motivation and Context
Standard regression heads in automated evaluation models produce a deterministic prediction for metrics such as MOS. However, empirical evidence demonstrates that human evaluation of musical and creative works yields distributions spanning discrete score intervals, reflecting not only anchoring and context effects but also explicit uncertainty and disagreement among raters. Direct mean regression either collapses or ignores this structure, often leading to poor calibration and loss of valuable information for downstream decision-making.
HiGIA is designed to address these issues by:
- Learning score probability distributions at multiple granularities (e.g., fine, medium, coarse)
- Aggregating these distributions into an interval of plausible scores, rather than committing to a single scalar
- Regressing the final score within the most probable interval, thus aligning the prediction not only with the mean but also with the spread and modality of the human label distribution
This design enables models to better mimic the structure of human judgment and uncertainty, leading to improved calibration, robustness, and interpretability in aesthetics evaluation tasks (Lv et al., 18 Jan 2026).
2. Architectural Components
The HiGIA architecture decomposes the prediction task into three main stages:
- Multi-Granularity Probabilistic Prediction: The model predicts probability vectors over score bins at multiple levels of granularity. For example, using fine-grained intervals (e.g., 0.1 increments) to capture detailed nuances, and coarser intervals (e.g., full points) to account for broader trends.
- Interval Aggregation: Probability vectors from each granularity are aggregated, often via a learned or deterministic hierarchical mapping, to identify the interval(s) most compatible with both fine- and coarse-level predictions.
- Interval-Constrained Regression: A regression head produces a continuous score prediction, but this regression is constrained or informed by the selected interval: the prediction is regularized to lie within (or proximal to) this interval, penalizing predictions that stray from the support suggested by the probabilistic estimates.
The following table outlines each staged output:
| Stage | Output | Purpose |
|---|---|---|
| Multi-Granularity Distribution | Probability vectors at each granularity | Encode uncertainty at several levels |
| Interval Aggregation | Likely interval(s) per instance | Localize prediction to most credible intervals |
| Interval-Constrained Regression | Scalar within interval | Achieve score prediction with calibrated bounds |
3. Mathematical Formulation
Let denote the number of granularity levels (e.g., fine, medium, coarse). At each level , the model produces a probability vector over non-overlapping bins/intervals covering the possible score range.
- For a given input , compute
where is the fused feature representation, and , are learnable parameters.
- Aggregate interval supports across granularities. One schema is to select the (coarsest) interval with highest probability mass, then refine to subintervals at finer levels. Denote the selected interval as .
- Predict a continuous score via a regression sub-head, with loss penalized for out-of-interval predictions:
where is the label mean, computes the distance to the closest interval boundary, and is a regularization hyperparameter.
A key aspect is that interval selection is a function of the predicted distributions at all granularity levels, ensuring predictions respect evidence from both coarse and fine partitions.
4. Training and Inference Procedure
During training, the full target rating distribution (or histogram) is used to supervise the multi-granularity classification heads via cross-entropy loss, while the regression head receives a standard mean-squared error loss as above, regularized by the interval constraint. If only summary statistics are available, a label smoothing or synthetic distribution based on measurement error may be used.
At inference, the pipeline is:
- Predict multi-granular probability vectors
- Aggregate to select the most likely score interval(s)
- Output a scalar prediction within this interval (the regression output), together with the full probabilistic support (optional for interpretability)
This process ensures the model’s output can be interpreted as both a "best estimate" and a confidence-aware interval, facilitating applications requiring uncertainty estimates.
5. Empirical Evaluation
HiGIA was evaluated in the context of multi-dimensional song aesthetics evaluation on both the SongEval dataset (AI-generated songs) and an internal dataset of human-created works (Lv et al., 18 Jan 2026). Empirical results demonstrate that HiGIA, when integrated with a Multi-Stem Attention Fusion (MSAF) front end, leads to statistically significant improvements over state-of-the-art baselines in metrics such as MSE (mean squared error), LCC (linear correlation coefficient), SRCC (Spearman's rank correlation coefficient), and Kendall Tau (KTAU). The use of interval-based aggregation, as opposed to direct MOS regression, was identified as a key factor in capturing evaluation nuances.
6. Significance and Relation to Broader Methodologies
HiGIA represents a shift from direct scalar regression toward probabilistic and interval-based modeling in subjective assessment tasks. This paradigm aligns with a broader movement toward uncertainty-aware neural network architectures in tasks like subjective quality assessment, where modeling annotator disagreement and confidence is critical.
HiGIA’s hierarchical granularity design is distinct from flat classification or single-level ordinal regression. By encoding fine and coarse intervals jointly, it leverages the semantic structure of score spaces and allows more robust decision-making in the presence of human uncertainty.
Comparable interval or distributional approaches have appeared in speech scoring and emotion recognition, but HiGIA is, to date, the most explicit instantiation of hierarchical, granularity-aware interval aggregation for song aesthetics evaluation. A plausible implication is that adaptations of this method could generalize to other tasks where ordinal or grouped labels with ambiguous boundaries are the norm.
7. Limitations and Future Directions
While HiGIA demonstrably improves both accuracy and interpretability, its efficacy depends on the quality and granularity of the available label distributions. In settings with severely limited or ambiguous ground-truth, the benefits may diminish. Furthermore, the choice of granularity levels and aggregation schemes typically requires empirical tuning. Future research may explore joint end-to-end learning of both granularity hierarchies and aggregation strategies, as well as extensions to tasks with multimodal, cross-domain uncertainty.
References:
Song Aesthetics Evaluation with Multi-Stem Attention and Hierarchical Uncertainty Modeling (Lv et al., 18 Jan 2026)