Macro-F1 Score Overview
- Macro-F1 is a metric that averages per-class F1 scores equally, providing a clear measure of performance on imbalanced datasets.
- It is computed by taking the arithmetic mean of individual F1 scores, making it essential for tasks like medical coding and NLP where rare classes matter.
- The metric highlights performance disparities by penalizing neglect of minority classes, guiding optimal model adjustments in multi-label and multi-class settings.
The macro-F1 score is a class-decomposable metric for evaluation of classification systems, defined as the unweighted arithmetic mean of the per-class F₁ scores. Each class contributes equally, regardless of its prevalence, making macro-F1 particularly relevant for imbalanced datasets or fairness-sensitive tasks. It is employed across multi-class and multi-label settings as a primary indicator of system balance, penalizing neglect of rare or underrepresented classes just as severely as failure on dominant ones (Grandini et al., 2020, Harbecke et al., 2022, Kreuzthaler et al., 2021, Opitz, 2024, Gowda et al., 2021). Multiple disciplines in machine learning—including medical coding, activity recognition, relation classification, and machine translation—have adopted macro-F1 as a standard or supplementary metric, often alongside micro-averaged alternatives to provide a fuller picture of model behavior.
1. Formal Definition and Mathematical Properties
Let denote the number of classes. For each class , let , , and be true positive, false positive, and false negative counts, respectively, when class is treated as the positive label. The per-class precision and recall are defined as: The per-class F₁ score is the harmonic mean: Macro-F₁ aggregates by averaging over all classes: In multi-label settings, this averaging can be over labels for each instance, or across the entire batch (Lipton et al., 2014).
Alternative but non-equivalent forms exist, notably the harmonic mean of the macro-precision and macro-recall (sometimes called ), but the class-decomposable mean of F1_c (sometimes labeled ) is the most widely adopted for macro-F1 reporting (Opitz et al., 2019, Opitz, 2024). The difference between these variants can be up to 0.5 in extreme cases and is non-negligible even in moderate class imbalance (Opitz et al., 2019).
2. Rationale and Comparative Role
Macro-F₁'s core rationale is its invariance to class frequency: it forces a model to uniformly balance precision and recall across all classes. In contrast:
- Micro-F₁ aggregates all , , globally before computing the score, thus emphasizing overall accuracy and favoring dominant classes (Grandini et al., 2020, Harbecke et al., 2022).
- Weighted-F₁ weights per-class F1_c by class support (number of true instances), creating a compromise between macro and micro (Grandini et al., 2020).
- AUC (area under the ROC curve) generally focuses on ranking and does not decompose into class-wise F₁s (Harbecke et al., 2022).
Scenarios in which macro-F₁ is preferred include those requiring equal attention to every class or robust evaluation under long-tail (rare) class distributions (Harbecke et al., 2022, Gowda et al., 2021, Liu et al., 2024). Micro-F₁, in contrast, may be more appropriate when performance on the majority class dominates real-world utility.
3. Applications and Experimental Usage
Macro-F₁ has seen widespread adoption in diverse fields:
- Automated disease code assignment: Macro-F₁ is used to evaluate models assigning ICD-10 codes, which present a pronounced class imbalance. In one study, macro-F₁ values of 0.83 (fastText), 0.84 (LSTM), and 0.88 (RoBERTa) were reported, with improvements attributed to deeper contextual representations and better capture of rare-code distinctions (Kreuzthaler et al., 2021).
- Fitness activity recognition: Macro-F₁ measures balanced detection across multiple activity classes with strong class imbalance (e.g., arm opener is rare). Fusion of modalities and contrastive learning improved macro-F₁ from 81.49% (IMU baseline) to 84.71% (IMU+contrastive) and 89.57% (sensor fusion) (Liu et al., 2024).
- Machine translation evaluation: Macro-F₁ over word types emphasizes strict adequacy by giving rare words the same impact as frequent function words, escaping Zipf’s law bias seen in micro-F₁, BLEU, or chrF. Macro-F₁ aligns better with human semantic adequacy judgments and downstream cross-lingual IR task success (Gowda et al., 2021).
- Relation classification and other NLP tasks: Macro-F₁ exposes long-tail performance weaknesses and enables fairer comparison of models under highly skewed label distributions (Harbecke et al., 2022).
4. Behavior Under Class Imbalance and Limitations
Macro-F₁'s equal weighting means that poor performance on rare classes can precipitously reduce the overall score, making it sensitive to a single misclassification in a minority class. This property is beneficial when rare outcomes are critical but can introduce instability if such classes are small or, more problematically, have noisy or inconsistent labels, as seen in ICD-10 coding or fitness activity detection (Kreuzthaler et al., 2021, Liu et al., 2024). Systems optimized solely for macro-F₁ may sacrifice utility on frequent classes and can be penalized if evaluation data contain annotation inconsistencies (Kreuzthaler et al., 2021).
Macro-F₁ is not invariant to prevalence shifts: artificially altering class distribution in the test set will alter macro-F₁, even if model performance per class remains unchanged (Opitz, 2024). This contrasts with micro-F₁, which is strictly proportional to the proportion of correctly classified examples.
5. Thresholding, Optimization, and Reporting Practices
For probabilistic classifiers, threshold selection to maximize per-class F₁ is nontrivial. In multi-label settings, the optimal threshold for each label is half its maximal attainable F₁ if probabilities are well calibrated (Lipton et al., 2014). However, with uninformative classifiers or in rare class regimes, this can yield pathological all-positive predictions for rare classes, which may not be desirable in practice.
Macro-F₁ must be reported alongside class support statistics and, ideally, per-class breakdowns to ensure transparency. When macro-F₁ diverges significantly from weighted or micro-F₁, this indicates issues with model coverage of rare classes (Grandini et al., 2020, Harbecke et al., 2022). In evaluation settings, it is essential to explicitly specify the precise macro-F₁ formula used to avoid ambiguity (Opitz, 2024, Opitz et al., 2019).
6. Variants, Controversies, and Best Practices
The literature documents several aggregation ambiguities:
- The standard, class-decomposable macro-F₁ (mean of per-class F₁’s, ).
- The less common harmonic mean of macro-precision and macro-recall (), which is not class-decomposable and can yield inflated scores in certain error distributions (Opitz et al., 2019, Opitz, 2024).
Best practice recommendations include unambiguously stating the metric and its formula, always motivating its choice based on deployment needs and class importance, and supplementing macro-F₁ with micro-F₁ and either per-class or weighted breakdowns (Opitz, 2024, Grandini et al., 2020). Researchers should be wary of tuning solely for macro-F₁ without monitoring shifts in overall accuracy or performance on dominant classes.
7. Extensions and Intermediate Weighting Schemes
Recent work explores intermediate weighting strategies between macro and micro, such as "dodrans" (weighting by ) and entropy-based schemes that balance sensitivity to rare classes against dominance by large ones (Harbecke et al., 2022). These alternatives offer nuanced evaluation lenses and can better reflect application-specific preferences. Nonetheless, macro-F₁ remains the canonical choice in settings demanding maximal class parity, especially when rare or tail phenomena are of intrinsic or regulatory concern.
Key References:
- “Metrics for Multi-Class Classification: an Overview” (Grandini et al., 2020)
- “Macro F1 and Macro F1” (Opitz et al., 2019)
- “A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice” (Opitz, 2024)
- “Secondary Use of Clinical Problem List Entries for Neural Network-Based Disease Code Assignment” (Kreuzthaler et al., 2021)
- “Why only Micro-F1? Class Weighting of Measures for Relation Classification” (Harbecke et al., 2022)
- “Macro-Average: Rare Types Are Important Too” (Gowda et al., 2021)
- “iMove: Exploring Bio-impedance Sensing for Fitness Activity Recognition” (Liu et al., 2024)
- “Thresholding Classifiers to Maximize F1 Score” (Lipton et al., 2014)