Hierarchical Cross-Entropy (HXE)

Updated 14 February 2026

Hierarchical Cross-Entropy (HXE) is a loss function that uses tree-structured label taxonomies to assign variable penalties based on semantic proximity.
HXE decomposes the prediction into conditional probabilities via sibling softmax normalization along the class hierarchy, enhancing error calibration.
Weighted variants of HXE introduce per-node balancing to tackle data imbalance, yielding competitive performance on datasets with hierarchical labels.

Hierarchical Cross-Entropy (HXE) is a principled extension of classical cross-entropy loss for supervised classification tasks where the label space possesses nontrivial hierarchical structure. Standard cross-entropy ignores relationships among classes, treating all misclassifications as equally severe. HXE exploits a predefined class taxonomy—typically a tree or, in some cases, a directed acyclic graph (DAG)—to modulate the loss according to label proximity, yielding models that make "better mistakes" by preferring semantically closer confusions and by leveraging all available label granularity in a unified framework (Villar et al., 2023, Bertinetto et al., 2019).

1. Formal Definitions and Theoretical Framework

Let $\mathcal{C}$ denote the set of label classes, equipped with a rooted directed tree $\mathcal{H} = (V, E)$ representing the class taxonomy. Each node $v \in V$ is a class (labels at the leaf nodes) or internal node (superclass), with a unique path from root $r$ to each leaf $c \in \mathcal{C}$ :

$c^{(H)}=r \to c^{(H-1)} \to \cdots \to c^{(0)}=c$

Given a neural network that predicts logit scores for every node, standard categorical cross-entropy (CE) loss for one-hot ground-truth $t \in \{0,1\}^C$ and softmax probabilities $p \in [0,1]^C$ is:

$L_{\mathrm{CE}}(p, t) = -\sum_{i=1}^C t_i \log p_i$

HXE factorizes the probability of a leaf label $c^{(0)}$ via the chain rule over the class hierarchy:

$p\left(c^{(0)}\right) = \prod_{h=0}^{H-1} p\left( c^{(h)} \mid c^{(h+1)} \right)$

where conditional probabilities are computed by normalizing softmax outputs within sibling groups at each level. The core HXE loss for a ground-truth leaf $c$ is:

$L_{\mathrm{HXE}}(p, c) = -\sum_{h=0}^{H-1} \lambda\left( c^{(h)} \right)\; \log p\left( c^{(h)} \mid c^{(h+1)} \right)$

with $\lambda(\cdot)$ a hierarchy-dependent weighting: a common choice is $\lambda(n) = \exp(-\alpha \mathrm{depth}(n))$ , where the hyperparameter $\alpha$ regulates the emphasis on coarse (closer to root) versus fine (closer to leaves) errors (Bertinetto et al., 2019).

Weighted Hierarchical Cross-Entropy (WHXE) introduces a per-node class-rebalancing weight $W(\cdot)$ to correct data imbalance and a flexible $\lambda(\cdot)$ for level-weighting:

$L_{\mathrm{WHXE}}(p, c) = -\sum_{h=0}^{H-1} W\left( c^{(h)} \right) \lambda\left( c^{(h)} \right) \log p\left( c^{(h)} \mid c^{(h+1)} \right)$

$W(v) = \frac{N_{\mathrm{All}}}{N_{\mathrm{Labels}} N_v}$ , where $N_v$ is the number of samples traversing node $v$ (Villar et al., 2023).

2. Hierarchical Probability Factorization and Taxonomy Encoding

In HXE-based training, input labels are interpreted as paths through the taxonomy. The model emits logits for each node; softmax normalization is performed over each sibling group (children of the same parent) so that probabilities sum to unity at every branch point. This design guarantees a valid, uniquely factorized probability for any leaf, leveraging the tree's structure. Each mini-batch sample's loss decomposes additively by traversing the observed label path; errors at higher hierarchical levels are weighted by $\lambda(\cdot)$ , allowing practitioners to trade off coarser-versus-finer distinctions as required.

The taxonomy graph must be a rooted tree or DAG—with the latter requiring unique paths from root to any leaf to ensure unique factorization. Internal nodes can represent semantic superclasses, fault categories, biological taxa, or any domain-specific hierarchy (Villar et al., 2023, Bertinetto et al., 2019).

3. Algorithmic Integration into Neural Network Pipelines

Architecture: The model's final layer produces $|V|$ logits, one per taxonomy node.
Softmax Sibling Blocks: Softmax is applied to each set of sibling nodes, producing conditional probabilities required for the loss.
Forward Pass: For each sample, the conditional probabilities along the true label's path are collected. The batch loss is computed as the sum of $W(\cdot)\lambda(\cdot) \log$ -conditional probabilities, averaged over the batch.
Training Details: Standard optimizers such as AdamW are fully compatible; the main hyperparameter specific to HXE/WHXE is $\alpha$ .
Imbalance Handling: $W(v)$ compensates for long-tail distributions by up-weighting rare classes and down-weighting frequently observed nodes.
No Architectural Overhead at Inference: Only the training objective changes; inference pipeline is identical as the network outputs probabilities for all leaves (Villar et al., 2023, Bertinetto et al., 2019).

4. Empirical Performance and Experimental Findings

Empirical validation on ZTF (Zwicky Transient Facility) astrophysical transient datasets demonstrates that WHXE permits training on all available data—overcoming the inefficiency of standard "flat" classifiers which discard examples lacking fine-level labels. On variable stars (45,748 objects, 35 leaves) and supernovae (4,206 objects, 5 coarse classes in a deeper tree), WHXE achieves micro-F1 and macro-F1 scores comparable to or slightly outperforming flat cross-entropy, while utilizing 100% of the training set.

Task	Method	F1 (macro)	F1 (micro)	Frac. Used
VarStars	WHXE	0.232 ± 0.004	0.894 ± 0.002	1.00
VarStars	Baseline	0.256 ± 0.002	0.899 ± 0.001	0.33
SNs	WHXE	0.485 ± 0.004	0.855 ± 0.001	1.00
SNs	Baseline	0.447 ± 0.014	0.835 ± 0.008	0.45

In all cases, WHXE achieves parity or improvement in both F1 and data efficiency (Villar et al., 2023). Similar findings are observed for HXE and soft-label variants when applied to large, hierarchical vision datasets (tieredImageNet-H, iNaturalist’19), where trade-offs between flat and hierarchical error metrics are readily tunable via $\alpha$ or related hyperparameters (Bertinetto et al., 2019).

5. Practical Implementation and Recommendations

Hierarchy Preprocessing: Compute per-class paths, sibling relationships, and per-node/level weights before training.
Loss Computation: Implemented as a vectorized operation (see provided PyTorch pseudocode in (Bertinetto et al., 2019)), iterating over the hierarchical path for each training sample.
Hyperparameter Tuning: Conduct grid-search over $\alpha$ ; monitor both flat and hierarchical metrics for selection.
Generalizability: Applicable to any domain with semantic or operational taxonomies. No architecture modification is required beyond the addition of per-node outputs and sibling softmax blocks.
Custom Weighting: $W(\cdot)$ and $\lambda(\cdot)$ may be replaced with application-specific reweighting schemes or extended via data-driven, learnable functions.

6. Extensions, Limitations, and Applicability

Applicability: Domains including astrophysical transients (Villar et al., 2023), biological taxonomy, document classification, engineering diagnostics, and fine-grained image recognition (Bertinetto et al., 2019).
Extensions: WHXE generalizes to some DAGs; can support zero-shot learning where new leaves are grafted dynamically.
Limitations: Requires a meaningful hierarchy—arbitrary or poor-quality taxonomies degrade performance. There is intrinsic tension: improving the semantic tolerance of mistakes often reduces flat top-1 accuracy.
Potential Research Directions: Learnable weighting functions, support for more general graph structures, incorporation of real-valued hierarchy edge costs, or mixtures of expert architectures at internal nodes (Bertinetto et al., 2019).

7. Relationship to Prior Hierarchical Methods

HXE and its weighted variants unify a variety of past approaches for hierarchical classification: label-embedding methods, conditional classifiers (e.g., YOLO-v2 trees of softmaxes), and hyperspherical outputs using LCA similarity. Experimental results indicate that HXE and soft-label variants dominate prior methods across Pareto fronts trading top-1 accuracy and hierarchical mistake severity. Implementing HXE by using network-predicted flat leaf probabilities and computing hierarchy-derived conditionals in the loss outperforms directly predicting conditionals via sibling softmaxes (Bertinetto et al., 2019).

In summary, Hierarchical Cross-Entropy and its weighted generalizations provide an efficient, principled, and empirically validated loss design for deep learning on structured label spaces, combining data efficiency, flexible granularity, and semantically calibrated error penalization (Villar et al., 2023, Bertinetto et al., 2019).

Markdown Report Issue Upgrade to Chat

References (2)

Hierarchical Cross-entropy Loss for Classification of Astrophysical Transients (2023)

Making Better Mistakes: Leveraging Class Hierarchies with Deep Networks (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Cross-Entropy (HXE).