ABSA-mix: Multilingual & Multi-domain Sentiment Benchmark
- ABSA-mix is a benchmark for aspect-based sentiment analysis that integrates 92 diverse domains and 6 languages using both real and synthetic annotated data.
- It features an expanded sentiment taxonomy—including 'mixed' and 'unknown' labels—and supports both encoder-only and decoder-only protocols for robust evaluation.
- The framework enables cross-domain, multilingual, and zero-/few-shot learning, driving universal evaluation and transfer in fine-grained sentiment research.
ABSA-mix refers primarily to a large-scale, multi-domain, and multilingual benchmark for aspect-based sentiment analysis (ABSA), as well as unified modeling strategies for ABSA subtasks. The benchmark aggregates real and synthetic annotated texts across 92 domains and six languages, encompassing expanded sentiment taxonomy and versatile modeling protocols. ABSA-mix is pivotal for both advancing reasoning-infused LLM architectures and establishing universal evaluation standards in fine-grained sentiment modeling (Liskowski et al., 7 Jan 2026, Wang et al., 2022).
1. Composition, Sources, and Multilingual Coverage
ABSA-mix comprises 17 publicly available ABSA datasets (including eastwind/semeval-2016-absa-reviews, Alpaca69B/reviews_appstore_all_absa, SEntFiN, SilvioLima/absa, VocabVictor/acl2014_absa_twitter, jordiclive/FABSA, omymble/amazon-books-reviews-absa, SemEval_14_laptops, SemEval_14_restaurants, SemEval_15_restaurants, SemEval_16_restaurants, siat-nlp/MAMS-for-ABSA, stanfordnlp/imdb, stanfordnlp/sst2, Sp1786/multiclass-sentiment-analysis-dataset), collectively spanning 92 distinct application domains such as hotel reviews, app-store evaluations, financial headlines, online shopping, movie reviews, and Twitter posts.
Synthetic complements include SemEval14-synth (style-augmented and neutral-focused versions of SemEval-2014, 3,368 samples) and ABSA-synth (19,503 samples, covering 46 domains with explicitly engineered mixed and unknown aspect labels via Chain-of-Thought prompting).
To enable robust multilingual ABSA, the corpus is translated into six languages: English, French, German, Spanish, Italian, and Polish. Each language version is strictly parallel (~85,880 examples per language), sustaining dataset integrity and direct cross-lingual transfer evaluations.
Corpus totals (train+validation+test) exceed 85,000 labeled texts. Standard test splits are retained for each original dataset, and rigorous duplicate and data-leak controls are implemented.
2. Label Taxonomy and Annotation Schema
ABSA-mix expands the sentiment classification granularity beyond the canonical {positive, negative, neutral}. The annotated sentiment categories are:
- positive
- negative
- neutral
- mixed (assigned where aspect-level sentiment polarity is ambiguous or contradictory within the same text, e.g., due to source label conflicts or explicit synthetic prompt engineering)
- unknown (attributed to aspects explicitly absent in the text, with 25% of examples sampled and verified by LLM-based annotation "judges")
Aspect terms themselves number 21,819 from public sources and an additional 1,638 from synthetic data. These aspects cover product properties, service features, and domain-specific phenomena.
Overall sentiment for each text is jointly predicted with aspect labels, accommodating reviews that blend multiple aspect polarities.
3. Data Format, Modeling Protocols, and Output Structure
Two primary modeling protocols are supported:
- Encoder-only models: Each instance follows
allowing per-aspect sentiment inference, suitable for fine-grained text classification architectures.1
[CLS] <review text> [SEP] <aspect term> [SEP]
- Decoder-only models: The prompt structure is
enabling simultaneous extraction and sentiment assignment over multiple aspects, compatible with generative LLMs.1 2 3
Text: <review text> Aspects: <aspect1>, <aspect2>, ... Instructions: extract aspects and classify each into {positive, negative, neutral, mixed, unknown}.
Both protocols yield output in standardized JSON format, which can include a "thoughts" key denoting inference chains in reasoning-augmented modes.
Example JSON:
1 2 3 4 5 6 7 8 |
{
"aspect_sentiments": [
{"aspect": "battery life", "sentiment": "positive"},
{"aspect": "price", "sentiment": "negative"}
],
"overall_sentiment": "positive",
"thoughts": "I see words like 'long-lasting' for battery, but 'expensive' for price…"
} |
4. Evaluation Protocols and Metrics
Performance assessment is conducted with canonical classification metrics:
- Accuracy: per aspect or document
- Precision, Recall, F1: Standard definitions per class
- Macro-averaged F1: , where is the number of classes
Each original dataset's test split serves as the in-domain evaluation set. For multilingual experiments, per-language accuracy is reported over the translated test subsets. All data are deduplicated and cross-source leak risks are mitigated by removing overlapping review texts.
5. Applications, Transfer, and Benchmarks
ABSA-mix is designed for several key use cases:
- Cross-domain ABSA: Training across 92 domains and testing on held-out or previously unseen domains, evaluating generalization.
- Multilingual ABSA: Training a single model to yield 87–91% accuracy on French, German, Spanish, Italian, Polish, and English, with no degradation on English.
- Zero-/few-shot ABSA: Decoder-based LLMs can be prompted with minimal or no supervision to tackle new domains or languages.
The benchmark facilitates both encoder-based and decoder-based approaches, and enables joint reasoning–infused modeling via Chain-of-Thought fine-tuning and reasoning-pretraining for encoder architectures. These lead to significant generalization improvements in downstream ABSA tasks.
Benchmark Results
| Model | Accuracy |
|---|---|
| GPT-4o | 82.15% |
| Claude 3.5 Sonnet | 83.65% |
| Mistral Large 2 | 82.77% |
| Llama 3.1-405B | 83.08% |
| Llama 3.1-70B | 81.03% |
| Llama 3.1-8B | 76.65% |
| Arctic-Encoder | 91.28% |
| Arctic-Encoder-thinking | 91.24% |
| Arctic-Decoder | 93.03% |
| Arctic-Decoder-thinking | 92.99% |
Top-performing models surpass GPT-4o and Claude 3.5 Sonnet by over 10 percentage points in accuracy. A single 395M multilingual encoder sustains high accuracy (87–91%) across all six languages (Liskowski et al., 7 Jan 2026).
6. UnifiedABSA and Multi-task Instruction Tuning
UnifiedABSA (sometimes referenced as ABSA-mix in literature) adopts a multi-task instruction tuning regime, recasting all 11 ABSA subtasks (including aspect term extraction, sentiment extraction, category detection, opinion extraction, and complex quad prediction) as conditional text-to-text problems. Each review is prepended with a Unified Sentiment Instruction (USI) specifying the task, options, and a natural language template, then processed by a single T5 encoder-decoder.
For a given input,
1 2 3 4 5 |
Task Name: <TASK> Input: <review sentence> [Sentiment Options: good, ok, bad] [Category Options: ...] Template: <verbalization> |
The training objective is:
where is the number of subtasks, is the input (USI+review), and is the expected output.
UnifiedABSA achieves higher F1 across all subtasks versus dedicated models, particularly in low-resource (32–64-shot) scenarios, benefiting from significant cross-task transfer and storage efficiency: one T5 instance replaces 11 separate models (Wang et al., 2022).
7. Significance for Sentiment Research and Future Directions
ABSA-mix provides a standardized multi-domain, multilingual reference for evaluating aspect-based sentiment models under realistic commercial and cross-lingual conditions. Its extended sentiment taxonomy and inclusive support for reasoning chains address known bottlenecks in weakly-supervised and transfer learning workflows. Recent work demonstrates that joint reasoning injection and large-scale instruction tuning enable robust generalization, closing performance gaps between task-specific paradigms and universal architectures.
Ongoing research proposes further domain expansion, annotation for fine-grained emotion types, and integration with large-scale foundation models capable of direct zero-shot transfer. A plausible implication is accelerated development of universal sentiment understanding systems applicable to heterogeneous commercial, social, and scientific domains.