ABSA-mix: Multilingual & Multi-domain Sentiment Benchmark

Updated 14 January 2026

ABSA-mix is a benchmark for aspect-based sentiment analysis that integrates 92 diverse domains and 6 languages using both real and synthetic annotated data.
It features an expanded sentiment taxonomy—including 'mixed' and 'unknown' labels—and supports both encoder-only and decoder-only protocols for robust evaluation.
The framework enables cross-domain, multilingual, and zero-/few-shot learning, driving universal evaluation and transfer in fine-grained sentiment research.

ABSA-mix refers primarily to a large-scale, multi-domain, and multilingual benchmark for aspect-based sentiment analysis (ABSA), as well as unified modeling strategies for ABSA subtasks. The benchmark aggregates real and synthetic annotated texts across 92 domains and six languages, encompassing expanded sentiment taxonomy and versatile modeling protocols. ABSA-mix is pivotal for both advancing reasoning-infused LLM architectures and establishing universal evaluation standards in fine-grained sentiment modeling (Liskowski et al., 7 Jan 2026, Wang et al., 2022).

1. Composition, Sources, and Multilingual Coverage

ABSA-mix comprises 17 publicly available ABSA datasets (including eastwind/semeval-2016-absa-reviews, Alpaca69B/reviews_appstore_all_absa, SEntFiN, SilvioLima/absa, VocabVictor/acl2014_absa_twitter, jordiclive/FABSA, omymble/amazon-books-reviews-absa, SemEval_14_laptops, SemEval_14_restaurants, SemEval_15_restaurants, SemEval_16_restaurants, siat-nlp/MAMS-for-ABSA, stanfordnlp/imdb, stanfordnlp/sst2, Sp1786/multiclass-sentiment-analysis-dataset), collectively spanning 92 distinct application domains such as hotel reviews, app-store evaluations, financial headlines, online shopping, movie reviews, and Twitter posts.

Synthetic complements include SemEval14-synth (style-augmented and neutral-focused versions of SemEval-2014, 3,368 samples) and ABSA-synth (19,503 samples, covering 46 domains with explicitly engineered mixed and unknown aspect labels via Chain-of-Thought prompting).

To enable robust multilingual ABSA, the corpus is translated into six languages: English, French, German, Spanish, Italian, and Polish. Each language version is strictly parallel (~85,880 examples per language), sustaining dataset integrity and direct cross-lingual transfer evaluations.

Corpus totals (train+validation+test) exceed 85,000 labeled texts. Standard test splits are retained for each original dataset, and rigorous duplicate and data-leak controls are implemented.

2. Label Taxonomy and Annotation Schema

ABSA-mix expands the sentiment classification granularity beyond the canonical {positive, negative, neutral}. The annotated sentiment categories are:

positive
negative
neutral
mixed (assigned where aspect-level sentiment polarity is ambiguous or contradictory within the same text, e.g., due to source label conflicts or explicit synthetic prompt engineering)
unknown (attributed to aspects explicitly absent in the text, with 25% of examples sampled and verified by LLM-based annotation "judges")

Aspect terms themselves number 21,819 from public sources and an additional 1,638 from synthetic data. These aspects cover product properties, service features, and domain-specific phenomena.

Overall sentiment for each text is jointly predicted with aspect labels, accommodating reviews that blend multiple aspect polarities.

3. Data Format, Modeling Protocols, and Output Structure

Two primary modeling protocols are supported:

Encoder-only models: Each instance follows
1
[CLS] <review text> [SEP] <aspect term> [SEP]
allowing per-aspect sentiment inference, suitable for fine-grained text classification architectures.

Decoder-only models: The prompt structure is

1
2
3

Text: <review text>
Aspects: <aspect1>, <aspect2>, ...
Instructions: extract aspects and classify each into {positive, negative, neutral, mixed, unknown}.

enabling simultaneous extraction and sentiment assignment over multiple aspects, compatible with generative LLMs.

Both protocols yield output in standardized JSON format, which can include a "thoughts" key denoting inference chains in reasoning-augmented modes.

Example JSON:

{
  "aspect_sentiments": [
    {"aspect": "battery life", "sentiment": "positive"},
    {"aspect": "price", "sentiment": "negative"}
  ],
  "overall_sentiment": "positive",
  "thoughts": "I see words like 'long-lasting' for battery, but 'expensive' for price…"
}

4. Evaluation Protocols and Metrics

Performance assessment is conducted with canonical classification metrics:

Accuracy: $Acc = (TP + TN)/(TP + TN + FP + FN)$ per aspect or document
Precision, Recall, F1: Standard definitions per class
Macro-averaged F1: $F_1^{\text{macro}} = \frac{1}{C} \sum_{c=1}^C F_1^c$ , where $C$ is the number of classes

Each original dataset's test split serves as the in-domain evaluation set. For multilingual experiments, per-language accuracy is reported over the translated test subsets. All data are deduplicated and cross-source leak risks are mitigated by removing overlapping review texts.

5. Applications, Transfer, and Benchmarks

ABSA-mix is designed for several key use cases:

Cross-domain ABSA: Training across 92 domains and testing on held-out or previously unseen domains, evaluating generalization.
Multilingual ABSA: Training a single model to yield 87–91% accuracy on French, German, Spanish, Italian, Polish, and English, with no degradation on English.
Zero-/few-shot ABSA: Decoder-based LLMs can be prompted with minimal or no supervision to tackle new domains or languages.

The benchmark facilitates both encoder-based and decoder-based approaches, and enables joint reasoning–infused modeling via Chain-of-Thought fine-tuning and reasoning-pretraining for encoder architectures. These lead to significant generalization improvements in downstream ABSA tasks.

Benchmark Results

Model	Accuracy
GPT-4o	82.15%
Claude 3.5 Sonnet	83.65%
Mistral Large 2	82.77%
Llama 3.1-405B	83.08%
Llama 3.1-70B	81.03%
Llama 3.1-8B	76.65%
Arctic-Encoder	91.28%
Arctic-Encoder-thinking	91.24%
Arctic-Decoder	93.03%
Arctic-Decoder-thinking	92.99%

Top-performing models surpass GPT-4o and Claude 3.5 Sonnet by over 10 percentage points in accuracy. A single 395M multilingual encoder sustains high accuracy (87–91%) across all six languages (Liskowski et al., 7 Jan 2026).

6. UnifiedABSA and Multi-task Instruction Tuning

UnifiedABSA (sometimes referenced as ABSA-mix in literature) adopts a multi-task instruction tuning regime, recasting all 11 ABSA subtasks (including aspect term extraction, sentiment extraction, category detection, opinion extraction, and complex quad prediction) as conditional text-to-text problems. Each review is prepended with a Unified Sentiment Instruction (USI) specifying the task, options, and a natural language template, then processed by a single T5 encoder-decoder.

For a given input, $F_1^{\text{macro}} = \frac{1}{C} \sum_{c=1}^C F_1^c$ 0 the T5 model outputs structured summaries, enabling simultaneous mastery of all subtasks without task-specific heads or adapters.

The training objective is:

$\mathcal{L}(\theta) = -\sum_{i=1}^{T} \sum_{j=1}^{|T_i|} \log P_\theta(s_{i, j} \mid u_{i, j})$

where $T$ is the number of subtasks, $u_{i,j}$ is the input (USI+review), and $s_{i,j}$ is the expected output.

UnifiedABSA achieves higher F1 across all subtasks versus dedicated models, particularly in low-resource (32–64-shot) scenarios, benefiting from significant cross-task transfer and storage efficiency: one T5 instance replaces 11 separate models (Wang et al., 2022).

7. Significance for Sentiment Research and Future Directions

ABSA-mix provides a standardized multi-domain, multilingual reference for evaluating aspect-based sentiment models under realistic commercial and cross-lingual conditions. Its extended sentiment taxonomy and inclusive support for reasoning chains address known bottlenecks in weakly-supervised and transfer learning workflows. Recent work demonstrates that joint reasoning injection and large-scale instruction tuning enable robust generalization, closing performance gaps between task-specific paradigms and universal architectures.

Ongoing research proposes further domain expansion, annotation for fine-grained emotion types, and integration with large-scale foundation models capable of direct zero-shot transfer. A plausible implication is accelerated development of universal sentiment understanding systems applicable to heterogeneous commercial, social, and scientific domains.

Markdown Report Issue Upgrade to Chat

References (2)

Large-Scale Aspect-Based Sentiment Analysis with Reasoning-Infused LLMs (2026)

UnifiedABSA: A Unified ABSA Framework Based on Multi-task Instruction Tuning (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ABSA-mix.

ABSA-mix: Multilingual & Multi-domain Sentiment Benchmark

1. Composition, Sources, and Multilingual Coverage

2. Label Taxonomy and Annotation Schema

3. Data Format, Modeling Protocols, and Output Structure

4. Evaluation Protocols and Metrics

5. Applications, Transfer, and Benchmarks

Benchmark Results

6. UnifiedABSA and Multi-task Instruction Tuning

7. Significance for Sentiment Research and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ABSA-mix: Multilingual & Multi-domain Sentiment Benchmark

1. Composition, Sources, and Multilingual Coverage

2. Label Taxonomy and Annotation Schema

3. Data Format, Modeling Protocols, and Output Structure

4. Evaluation Protocols and Metrics

5. Applications, Transfer, and Benchmarks

Benchmark Results

6. UnifiedABSA and Multi-task Instruction Tuning

7. Significance for Sentiment Research and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research