Papers
Topics
Authors
Recent
Search
2000 character limit reached

ABSA-mix: Multilingual & Multi-domain Sentiment Benchmark

Updated 14 January 2026
  • ABSA-mix is a benchmark for aspect-based sentiment analysis that integrates 92 diverse domains and 6 languages using both real and synthetic annotated data.
  • It features an expanded sentiment taxonomy—including 'mixed' and 'unknown' labels—and supports both encoder-only and decoder-only protocols for robust evaluation.
  • The framework enables cross-domain, multilingual, and zero-/few-shot learning, driving universal evaluation and transfer in fine-grained sentiment research.

ABSA-mix refers primarily to a large-scale, multi-domain, and multilingual benchmark for aspect-based sentiment analysis (ABSA), as well as unified modeling strategies for ABSA subtasks. The benchmark aggregates real and synthetic annotated texts across 92 domains and six languages, encompassing expanded sentiment taxonomy and versatile modeling protocols. ABSA-mix is pivotal for both advancing reasoning-infused LLM architectures and establishing universal evaluation standards in fine-grained sentiment modeling (Liskowski et al., 7 Jan 2026, Wang et al., 2022).

1. Composition, Sources, and Multilingual Coverage

ABSA-mix comprises 17 publicly available ABSA datasets (including eastwind/semeval-2016-absa-reviews, Alpaca69B/reviews_appstore_all_absa, SEntFiN, SilvioLima/absa, VocabVictor/acl2014_absa_twitter, jordiclive/FABSA, omymble/amazon-books-reviews-absa, SemEval_14_laptops, SemEval_14_restaurants, SemEval_15_restaurants, SemEval_16_restaurants, siat-nlp/MAMS-for-ABSA, stanfordnlp/imdb, stanfordnlp/sst2, Sp1786/multiclass-sentiment-analysis-dataset), collectively spanning 92 distinct application domains such as hotel reviews, app-store evaluations, financial headlines, online shopping, movie reviews, and Twitter posts.

Synthetic complements include SemEval14-synth (style-augmented and neutral-focused versions of SemEval-2014, 3,368 samples) and ABSA-synth (19,503 samples, covering 46 domains with explicitly engineered mixed and unknown aspect labels via Chain-of-Thought prompting).

To enable robust multilingual ABSA, the corpus is translated into six languages: English, French, German, Spanish, Italian, and Polish. Each language version is strictly parallel (~85,880 examples per language), sustaining dataset integrity and direct cross-lingual transfer evaluations.

Corpus totals (train+validation+test) exceed 85,000 labeled texts. Standard test splits are retained for each original dataset, and rigorous duplicate and data-leak controls are implemented.

2. Label Taxonomy and Annotation Schema

ABSA-mix expands the sentiment classification granularity beyond the canonical {positive, negative, neutral}. The annotated sentiment categories are:

  1. positive
  2. negative
  3. neutral
  4. mixed (assigned where aspect-level sentiment polarity is ambiguous or contradictory within the same text, e.g., due to source label conflicts or explicit synthetic prompt engineering)
  5. unknown (attributed to aspects explicitly absent in the text, with 25% of examples sampled and verified by LLM-based annotation "judges")

Aspect terms themselves number 21,819 from public sources and an additional 1,638 from synthetic data. These aspects cover product properties, service features, and domain-specific phenomena.

Overall sentiment for each text is jointly predicted with aspect labels, accommodating reviews that blend multiple aspect polarities.

3. Data Format, Modeling Protocols, and Output Structure

Two primary modeling protocols are supported:

  • Encoder-only models: Each instance follows
    1
    
    [CLS] <review text> [SEP] <aspect term> [SEP]
    allowing per-aspect sentiment inference, suitable for fine-grained text classification architectures.
  • Decoder-only models: The prompt structure is
    1
    2
    3
    
    Text: <review text>
    Aspects: <aspect1>, <aspect2>, ...
    Instructions: extract aspects and classify each into {positive, negative, neutral, mixed, unknown}.
    enabling simultaneous extraction and sentiment assignment over multiple aspects, compatible with generative LLMs.

Both protocols yield output in standardized JSON format, which can include a "thoughts" key denoting inference chains in reasoning-augmented modes.

Example JSON:

1
2
3
4
5
6
7
8
{
  "aspect_sentiments": [
    {"aspect": "battery life", "sentiment": "positive"},
    {"aspect": "price", "sentiment": "negative"}
  ],
  "overall_sentiment": "positive",
  "thoughts": "I see words like 'long-lasting' for battery, but 'expensive' for price…"
}

4. Evaluation Protocols and Metrics

Performance assessment is conducted with canonical classification metrics:

  • Accuracy: Acc=(TP+TN)/(TP+TN+FP+FN)Acc = (TP + TN)/(TP + TN + FP + FN) per aspect or document
  • Precision, Recall, F1: Standard definitions per class
  • Macro-averaged F1: F1macro=1Cc=1CF1cF_1^{\text{macro}} = \frac{1}{C} \sum_{c=1}^C F_1^c, where CC is the number of classes

Each original dataset's test split serves as the in-domain evaluation set. For multilingual experiments, per-language accuracy is reported over the translated test subsets. All data are deduplicated and cross-source leak risks are mitigated by removing overlapping review texts.

5. Applications, Transfer, and Benchmarks

ABSA-mix is designed for several key use cases:

  • Cross-domain ABSA: Training across 92 domains and testing on held-out or previously unseen domains, evaluating generalization.
  • Multilingual ABSA: Training a single model to yield 87–91% accuracy on French, German, Spanish, Italian, Polish, and English, with no degradation on English.
  • Zero-/few-shot ABSA: Decoder-based LLMs can be prompted with minimal or no supervision to tackle new domains or languages.

The benchmark facilitates both encoder-based and decoder-based approaches, and enables joint reasoning–infused modeling via Chain-of-Thought fine-tuning and reasoning-pretraining for encoder architectures. These lead to significant generalization improvements in downstream ABSA tasks.

Benchmark Results

Model Accuracy
GPT-4o 82.15%
Claude 3.5 Sonnet 83.65%
Mistral Large 2 82.77%
Llama 3.1-405B 83.08%
Llama 3.1-70B 81.03%
Llama 3.1-8B 76.65%
Arctic-Encoder 91.28%
Arctic-Encoder-thinking 91.24%
Arctic-Decoder 93.03%
Arctic-Decoder-thinking 92.99%

Top-performing models surpass GPT-4o and Claude 3.5 Sonnet by over 10 percentage points in accuracy. A single 395M multilingual encoder sustains high accuracy (87–91%) across all six languages (Liskowski et al., 7 Jan 2026).

6. UnifiedABSA and Multi-task Instruction Tuning

UnifiedABSA (sometimes referenced as ABSA-mix in literature) adopts a multi-task instruction tuning regime, recasting all 11 ABSA subtasks (including aspect term extraction, sentiment extraction, category detection, opinion extraction, and complex quad prediction) as conditional text-to-text problems. Each review is prepended with a Unified Sentiment Instruction (USI) specifying the task, options, and a natural language template, then processed by a single T5 encoder-decoder.

For a given input,

1
2
3
4
5
Task Name: <TASK>
Input: <review sentence>
[Sentiment Options: good, ok, bad]
[Category Options: ...]
Template: <verbalization>
the T5 model outputs structured summaries, enabling simultaneous mastery of all subtasks without task-specific heads or adapters.

The training objective is:

L(θ)=i=1Tj=1TilogPθ(si,jui,j)\mathcal{L}(\theta) = -\sum_{i=1}^{T} \sum_{j=1}^{|T_i|} \log P_\theta(s_{i, j} \mid u_{i, j})

where TT is the number of subtasks, ui,ju_{i,j} is the input (USI+review), and si,js_{i,j} is the expected output.

UnifiedABSA achieves higher F1 across all subtasks versus dedicated models, particularly in low-resource (32–64-shot) scenarios, benefiting from significant cross-task transfer and storage efficiency: one T5 instance replaces 11 separate models (Wang et al., 2022).

7. Significance for Sentiment Research and Future Directions

ABSA-mix provides a standardized multi-domain, multilingual reference for evaluating aspect-based sentiment models under realistic commercial and cross-lingual conditions. Its extended sentiment taxonomy and inclusive support for reasoning chains address known bottlenecks in weakly-supervised and transfer learning workflows. Recent work demonstrates that joint reasoning injection and large-scale instruction tuning enable robust generalization, closing performance gaps between task-specific paradigms and universal architectures.

Ongoing research proposes further domain expansion, annotation for fine-grained emotion types, and integration with large-scale foundation models capable of direct zero-shot transfer. A plausible implication is accelerated development of universal sentiment understanding systems applicable to heterogeneous commercial, social, and scientific domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ABSA-mix.