Multilingual Open-Set Learning and Discovery

Updated 26 January 2026

Multilingual Open-Set Learning and Discovery (MOSLD) Benchmark is a comprehensive suite that evaluates models' abilities to detect, cluster, and adapt to unseen classes across various modalities.
It utilizes energy-based outlier detection, dynamic clustering, and semantic anchoring to facilitate incremental learning and cross-lingual generalization in text, vision-language, audio-video, and knowledge base tasks.
The benchmark's rigorous evaluation protocols and multi-metric analysis advance research in open-set discovery, addressing challenges like cultural specificity, data distribution shifts, and cross-modal transfer.

Multilingual Open-Set Learning and Discovery (MOSLD) Benchmark encompasses a suite of datasets, tasks, and evaluation protocols designed to assess and advance the capacity of machine learning models to recognize, discover, and learn over unseen or unknown classes in a multilingual context. Open-set learning and active discovery are treated as generalizations of zero-shot learning, with unknown classes not prespecified and requiring on-the-fly identification and incremental adaptation. MOSLD benchmarks span text, vision-language, audio-visual, and structured knowledge base modalities, enabling rigorous quantitative analysis of generalization, detection, and adaptation across diverse languages, scripts, and data types.

1. Formal Definitions and Problem Setting

Open-set learning and discovery (OSLD) for multilingual tasks is formulated as follows: Given a labeled training set $\mathcal{D} = \{(x_i, y_i) \mid x_i \in \mathcal{X}, y_i \in C_{\mathrm{train}}\}$ with $C_{\mathrm{train}}$ the set of “known” classes, test-time data may contain samples from $C_{\mathrm{new}}$ , disjoint from $C_{\mathrm{train}}$ and not previously observed. The system must (i) detect the presence of samples belonging to $C_{\mathrm{new}}$ (open-set detection), (ii) cluster and assign provisional or semantic labels to these new classes (discovery), and (iii) optionally update its classifier to handle both $C_{\mathrm{train}}$ and $C_{\mathrm{new}}$ incrementally (Costache et al., 19 Jan 2026).

This paradigm generalizes zero-shot learning by removing the assumption that the set of candidate classes $C_{\mathrm{test}}$ is known a priori; instead, label expansion, class discovery, and continuous learning are central to the evaluation goal. In structured KB completion tasks, open-set KBC is defined over an unbounded vocabulary of relation and object phrases, with models required to generalize to unseen tuples at test time (Mittal et al., 2022). In multimodal and audio-visual benchmarks, open-set regimes arise due to language diversity and the presence of previously unseen generative sources or data distributions (Schneider et al., 2024, Croitoru et al., 16 May 2025).

2. Benchmark Construction Across Modalities

MOSLD evaluations integrate benchmarks for three principal modalities: text categorization, vision-language, and audio-video.

Text Domain:

"MOSLD-Bench" (Costache et al., 19 Jan 2026) comprises 960,438 news articles covering 12 languages from diverse families and scripts. For each language, $10\!-\!14$ topical classes (e.g., politics, sports, technology) are split into four known classes $\lvert C_{\mathrm{train}}\rvert=4$ (train/val/test) and 6–10 unknown classes partitioned into three sequential discovery stages ( $\mathcal{T}_1, \mathcal{T}_2, \mathcal{T}_3$ ). Datasets include Ultimate Arabic News, L3Cube-IndicNews, THUCNews, DBpedia Ontology, and regionally scraped local news. This enables systematic evaluation under staged introduction of novel classes and typological diversity.

Vision-Language Domain:

The M5-VLOD task (Schneider et al., 2024), derived from the M5 benchmark, extends MOSLD to multilingual, multicultural visio-linguistic open-set detection. 1,440 samples are created, covering 12 languages (Amharic, Berber, Bengali, English, German, Filipino, Hausa, Hindi, Russian, Swahili, Thai, Zulu), each with custom-tailored, native-annotated single-sentence statements for image sets curated according to a topic taxonomy (∼87 topics). Annotation ensures that for each sample, a statement holds for exactly four of five images; the system is tasked to outlier-detect, i.e., choose the single image that does not conform to the statement.

Audio-Video Domain:

MAVOS-DD (Croitoru et al., 16 May 2025) includes 25,195 real and 35,169 deepfake videos (250+ hours total) across 8 languages. Fake videos are generated using seven methods spanning talking-head synthesis (EchoMimic, Memo, Sonic), portrait animation (LivePortrait), and face swapping (Inswapper, HifiFace, Roop). The open-set protocol withholds both languages and generators at test time, creating evaluation splits for in-domain, open-model, open-language, and open-full settings.

Knowledge Base Completion:

The mOKB6 benchmark (Mittal et al., 2022) constructs KBs in six languages (English, Hindi, Telugu, Spanish, Portuguese, Chinese) based on multilingual coreference resolution and Open IE from 300 aligned Wikipedia articles per language, generating ∼42,000 triples for training ( $\sim$ 5–9 K triples per language), with dense entity linkage and test/dev splits.

3. Algorithmic Frameworks and Methodologies

Text OSLD Pipeline [Editor’s term]

The canonical framework (Costache et al., 19 Jan 2026) for text classification under OSLD involves the following stages:

Energy-based Outlier Detection: Compute $E(x) = -\log\sum_{j=1}^N \exp(h^j(x))$ for a classifier $h$ , flagging the top 15% by $E(x)$ as outliers.
Clustering: Extract $[\mathrm{CLS}]$ embeddings using a frozen LLM, perform k-means for $k=2\ldots8$ , selecting $k$ by silhouette coefficient, hence partitioning outliers into candidate new classes.
Keyword Extraction and Semantic Labeling: Extract TF-IDF top keywords for each cluster, encode via BERT to produce centroids $\mathbf{c}_j$ , facilitating pseudo-labelling and, at evaluation, optimal label assignment via Hungarian matching.
Pseudo-labeled Retraining: Update the classifier output dimension to $N+k$ , selecting the 40% of points nearest centroid per cluster for pseudo-label assignment; fine-tune output with cross-entropy (V1) or cross-entropy plus contrastive loss (V2), with V2 incorporating $\mathcal{L}_{\mathrm{CL}}$ to anchor cluster representation.

Multilingual Open KB Completion

All mOKB6 models utilize SimKGC’s two-tower architecture (mBERT encoders), with several multilingual strategies:

MONO (per-language),
UNION (joint, all languages),
TRANS (monolingual on translated English triples),
UNION+TRANS (union with English→target translation augmentation).

Vision-Language Outlier Detection

In M5-VLOD (Schneider et al., 2024), models receive stacked images and a statement and must predict the outlier index. Zero-shot prompting is used, and evaluation leverages accuracy against a random-choice baseline. Extensions for anomaly scoring and AUROC/FPR@95% are proposed.

Audio-Visual Deepfake Open-Set Detection

Detectors (AVFF, MRDF, TALL-Swin) are evaluated for generalization to unseen generators and languages under several fine-tuning regimes (Croitoru et al., 16 May 2025). No special open-set loss is used, but distribution shift is the primary challenge.

4. Evaluation Metrics

MOSLD leverages a multi-faceted set of metrics suited to open-set and discovery settings.

Metric	Description	Where Applied
Accuracy	Fraction correctly classified (overall, known, unknown)	Text, Audio-Video, Vision-Language
Macro-F1	Macro-averaged F1 over all classes	Text (Costache et al., 19 Jan 2026)
AUROC	Area under ROC curve, known vs. unknown class separation	Text OSLD, Vision-Language (proposed), Audio-Video
Purity, NMI	Cluster quality for discovered classes	Text, OSLD (Costache et al., 19 Jan 2026)
Hits@ $k$ /MRR	Ranking metrics for entity/relation completion	KB (Mittal et al., 2022)
mAP	Mean average precision for deepfake detection	Audio-Video (Croitoru et al., 16 May 2025)

Specific baselines include random-choice (M5-VLOD, 1/5 = 0.20 accuracy), and Hungarian-matching post-mapping to account for unsupervised cluster discovery (text OSLD).

5. Empirical Findings and Cross-Lingual Transfer

Text OSLD:

Both V1 (cross-entropy) and V2 (plus contrastive loss) pipelines restore much of the baseline’s F1 drop after new classes are introduced, with V2 providing consistent gains (2–3 points early stage, up to 85% unknown-class accuracy for Arabic at $\mathcal{T}_1$ ) (Costache et al., 19 Jan 2026). Known/unknown class F1 gaps remain substantive (up to 40 points), indicating persistent open-set difficulty. GPT-4o demonstrates moderate ability (e.g., 0.71 accuracy in French $\mathcal{T}_1$ for open-set discovery) but is outperformed by the tailored pipeline, especially at lower compute.

Vision-Language OSLD:

Only proprietary LMMs (GPT-4 Vision, Gemini) exceed random baseline accuracy on M5-VLOD, with English peak at 0.70 (GPT-4V), non-English near 0.42. Open-source multimodal models consistently score at the random baseline across languages. Cross-lingual transfer is minimal in the open-set (outlier) regime, while cultural content and script differences cause significant miss-detection (Schneider et al., 2024).

Audio-Visual OSLD:

Fine-tuned AVFF achieves up to 86.9% accuracy in-domain, but open-model/test setups see drops to 75.3%, and open-full to 77.7%, underscoring a persistent distribution-shift gap. Specialization to seen generator artifacts and poor audio representation for unseen languages are principal error sources (Croitoru et al., 16 May 2025).

Open KB Completion:

UNION and UNION+TRANS settings increase performance over monolingual approaches (+4.6 Hits@10/ +2.8 MRR and +25.9 Hits@10/ +18.4 MRR, respectively), confirming information flow across languages. Latin-script language transfer is significant (e.g., MEMORIZE_En→Es: 50.7% H@10), but script barriers (En→Te: 11.0% H@10) impose sharp bottlenecks (Mittal et al., 2022).

6. Insights, Limitations, and Best Practices

Findings across MOSLD benchmarks suggest:

Open-set discovery in multilingual environments presents substantial challenges not present in monolingual or fixed-label zero-shot settings (Costache et al., 19 Jan 2026, Mittal et al., 2022).
Pipelines integrating energy-based detection, dynamic clustering, and contrastive semantic anchoring yield competitive performance, but clustering errors and outlier misdetection remain limiting factors, especially for low-resource languages.
Cultural specificity and script diversity consistently degrade cross-lingual transfer (M5-VLOD: low-resource languages perform near chance even for advanced LMMs; knowledge transfer in KB completion is largely confined to within-script transfer).
Overgeneration and pretraining data leakage are observed in LLM-based open-set discovery, necessitating careful prompt and model calibration.
Current deepfake detectors are not robust to open-set shifts; substantial accuracy gains are only achieved for configurations similar to those seen in training (Croitoru et al., 16 May 2025).

Best practices include calibrating detection scores, exploiting confidence distributions (e.g., Platt or temperature scaling), expanding and balancing language/topic coverage, and integrating stronger semantic anchoring and clustering modules.

7. Future Research Directions

Identified directions for advancing MOSLD evaluation and methodology include:

Development of improved outlier detectors based on generative model likelihoods (for text) and per-image anomaly scoring (for vision-language).
Improved clustering approaches, including graph-based and representation learning strategies for semantic alignment among discovered classes or entities.
Pretraining objectives and model architectures explicitly designed for open-world and open-set discovery scenarios across languages and modalities.
Active learning schemes to reduce annotation requirements, potentially leveraging human-in-the-loop and minimal-labeled data adaptation.
Expansion of benchmarks to cover additional scripts, language families, and domains—e.g., scaling mOKB6 beyond six languages, growing M5-VLOD example counts, and extending MAVOS-DD with novel generators and languages.
Fairness auditing and bias monitoring to ensure equitable open-set detection performance across demographic and linguistic categories.

MOSLD benchmarks, through their modular, multilingual, and open-set discovery-oriented design, establish foundational evaluation tools for probing and improving models’ open-world adaptation and generalization capabilities (Costache et al., 19 Jan 2026, Mittal et al., 2022, Schneider et al., 2024, Croitoru et al., 16 May 2025).