NUS Datasets for Research Benchmarks

Updated 20 January 2026

NUS Dataset is a suite of multi-domain, curated datasets developed at the National University of Singapore for reproducible academic research.
Each dataset employs rigorous acquisition, annotation, and privacy protocols, ensuring high-quality data for tasks like language modeling and medical image segmentation.
The collections serve as benchmarks in NLP, sociolinguistics, and clinical analysis, enabling studies on data sparsity, class imbalance, and transfer learning.

The term “NUS Dataset” refers to a family of publicly released, academically curated datasets developed at the National University of Singapore and intended for the advancement of research in computational linguistics, sociolinguistics, medical image analysis, software vulnerability mining, and related data-driven domains. These collections are notable for their rigorous annotation protocols, documented methodologies, transparent release cycles, and comprehensive metadata suited to reproducible, large-scale academic research.

1. Representative Datasets

Multiple, distinct datasets carry the “NUS” designation, spanning diverse modalities and use cases. Three prominent examples include:

Dataset Name	Primary Domain	Size / Scope
NUS SMS Corpus	Personal mobile SMS	57,824 messages, English/Mandarin, 2011 snapshot
NUS ABC Codemixed Corpus	Code-mixed chat	355,641 messages, 477 chats, SE Asian languages
UNGT Dataset	Medical ultrasound	493 annotated images, 110 patients

Each is purpose-built to address acute gaps in public, high-quality data resources for their respective research communities (Chen et al., 2011, Churina et al., 31 May 2025, Liu et al., 19 Feb 2025).

2. Data Acquisition and Curation Protocols

NUS SMS Corpus: SMS messages were collected via compensated crowdsourcing (Amazon Mechanical Turk for English, Zhubajie for Mandarin), institutional outreach, and web platforms. Only personally authored “sent” messages were eligible; chain and marketing messages were excluded. Sensitive entities (e.g., emails, URLs, phone numbers, numeric strings) are masked by semantic placeholders within the text (e.g., <EMAIL>, <URL>, <#>), with further privacy assured using DES encryption for sender/receiver identifiers. Extensive manual and automated quality validation eliminated verbatim web-copied content and mass invalid uploads. The protocol adheres to IRB exemption criteria: no personally identifying information is released in published data (Chen et al., 2011).

NUS ABC Codemixed Corpus: Chat threads were donated by Singaporean undergraduates through a controlled data-donation drive under full IRB approval. Donors provided up to three threads across relationship strata (acquaintance, general, close), with explicit instruction to self-redact PII. All submissions underwent manual and machine-assisted anonymization. Token-level language tags were assigned via a hybrid model (rule-based, fastText/XLM-RoBERTa, prompt-guided Qwen2.5), with inter-annotator agreement computed (Cohen’s κ ≈ 0.25–0.26 for LLM vs. traditional models), and final labels subject to manual correction for residual errors (Churina et al., 31 May 2025).

UNGT Dataset: Ultrasound images were retrospectively and prospectively collected from PACS archives and IRB-approved scans. Annotation involved a primary radiologist (10 years’ experience) and a secondary reviewer, with disagreements resolved by consensus. Four semantic regions (liver, stomach, nasogastric tube, pancreas) were labeled using Labelme. Images and ground-truth masks were consistently post-processed to a 224×224 px format using a dynamic cropping strategy (Liu et al., 19 Feb 2025).

3. Data Structure, Schema, and Metadata

NUS SMS Corpus: Distributed as XML and SQL dumps. Each message instance contains the full (anonymized) text, timestamp, sender/receiver hashes, and (optional) contributor demographics: age group, gender, country, native language, years using SMS, daily SMS count, phone model/type, and input method. The majority of SMS contributors provided full demographic surveys, though partial non-compliance results in minor gaps for ~0.5% (English) and ~6.2% (Chinese) messages. Message language is tagged as ‘en’ or ‘zh’ (Chen et al., 2011).

NUS ABC Codemixed Corpus: Data released in JSON format, organized by conversation threads. Each message record includes a unique message ID, timestamp, anonymized speaker, raw text (with emojis and original casing), an array of whitespace-tokenized word segments, aligned language tags for each token, token-level and corpus-level code-mixing ratios (λₘ, λ), a transliteration flag, and an English translation. Per-thread metadata contains intimacy label, years known, trust/disclosure ratings, platform, donor age/gender, and self-reported relationship motivation (Churina et al., 31 May 2025).

Field	NUS SMS Corpus	ABC Codemixed Corpus	UNGT
Main Unit	Message	Message (in chat thread)	Image
Primary Labels	Language (en/zh)	Per-token language tags	Semantic mask (4 classes)
Metadata	Demographics	Intimacy, relationship	Patient, modality metadata
Sensitive Redaction	Placeholder + Hash	Self-redaction + manual	IRB ethics, de-ID

UNGT: Images are paired with per-pixel masks in four semantic classes. Clinical metadata (not publicly released) includes imaging modality, probe parameters, and acquisition notes. Two experimental subsets: tube-present (321 images) and tube-absent (172 images).

4. Benchmarking, Task Design, and Metrics

The “NUS Dataset” umbrella supports a wide spectrum of benchmark tasks:

SMS Corpus: Language modeling in short-format text, linguistic register studies, code-mixing/abbreviation norm analysis, metadata-conditioned sociolinguistics, noisy-channel normalization, and privacy risk assessment (Chen et al., 2011).
ABC Codemixed Corpus: Token-level language and script identification, code-mixing detection, measurement of λₘ and λ at message/corpus levels, dialogue act classification, relationship inference from linguistic content, machine translation of informal, code-mixed utterances, and conversational response generation in multilingual style (Churina et al., 31 May 2025).
UNGT: Supervised and semi-supervised semantic segmentation (4-way) under class imbalance, binary classification (tube present vs. absent), and performance affinity under severe label scarcity. Benchmark metrics include Dice Coefficient (DSC), Sensitivity (SEN), Precision (PRE), with quantitative improvement over CMUNet, CTCT, and other segmentation/classification baselines (Liu et al., 19 Feb 2025).

A plausible implication is that these datasets enable systematic ablation and transfer learning experiments for tasks involving data sparsity, informal linguistics, code-mixed language, and clinical annotation scenarios for which few public resources exist.

5. Distribution, Licensing, and Reproducibility

Access:

NUS SMS Corpus: Freely downloadable at http://wing.comp.nus.edu.sg/SMSCorpus/, distributed under a public-domain/CC0-equivalent license; no access limitations or registration (Chen et al., 2011).
ABC Codemixed Corpus: Released under CC BY 4.0, with dataset and annotation code at https://github.com/NUS-CTIC/ABC-Codemixed-Corpus. Researchers must cite Churina et al. (2024) (Churina et al., 31 May 2025).
UNGT Dataset: Released upon publication at https://github.com/NUS-Tim/UNGT, intended for medical image segmentation research (Liu et al., 19 Feb 2025).

Versioning:

SMS and Codemixed corpora are updated in discrete, immutable monthly or periodic snapshots to guarantee experiment reproducibility and historical comparison. All tracking pages publish growth curves and demographic breakdowns per release.

Limitations:

All three datasets exhibit selection biases: MTurk/campus overrepresentation and device skew in SMS data, undergraduate donor demographics in chat data, and single-site (institutional) imaging in medical data. This suggests domain adaptation and cross-corpus robustness studies are warranted for external validity.

6. Applications and Research Impact

The “NUS Dataset” collections have directly facilitated research in:

Low-resource text normalization and SMS-specific NLP.
Computational sociolinguistics: age, gender, and relationship-based style modeling.
Code-mixed language processing: token-level identification, transliteration detection, and code-switching ratio estimation.
Conversation analysis in multilingual communities, focusing on intimacy, conversational context, and digital social dynamics.
Medical image segmentation under extreme class imbalance, semi-supervised learning in low-data regimes, and benchmarking adaptive weighting loss functions for small or sparse targets (Chen et al., 2011, Churina et al., 31 May 2025, Liu et al., 19 Feb 2025).

A plausible implication is that the NUS datasets function as reference benchmarks for both end-to-end machine learning method validation and foundational quantitative sociolinguistic work—establishing ground truth at the intersection of language, relationship, and communication modality.

7. Future Directions and Ongoing Efforts

Each NUS dataset project has articulated concrete prospects for expansion:

SMS Corpus: Potential future directions include extension to additional languages/dialects, broader device coverage, and improved demographic representation.
ABC Codemixed Corpus: Ongoing live collection and annotation will expand language/relationship diversity and increase the sample size per intimacy category; enhancements to code-mixing annotation protocols and expansion of translation quality evaluation datasets are planned.
UNGT: The stated roadmap includes expanding the patient cohort, collecting additional tube positions, and implementing iterative per-class training schedules to address segmentation challenges posed by class imbalance and small sample sizes (Liu et al., 19 Feb 2025).

The persistent, open-access, and extensible design of these datasets ensures their ongoing relevance for emerging machine learning, language science, and medical imaging applications.

Markdown Report Issue Upgrade to Chat

References (3)

Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus (2011)

Disentangling Codemixing in Chats: The NUS ABC Codemixed Corpus (2025)

UNGT: Ultrasound Nasogastric Tube Dataset for Medical Image Analysis (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NUS Dataset.

NUS Datasets for Research Benchmarks

1. Representative Datasets

2. Data Acquisition and Curation Protocols

3. Data Structure, Schema, and Metadata

4. Benchmarking, Task Design, and Metrics

5. Distribution, Licensing, and Reproducibility

6. Applications and Research Impact

7. Future Directions and Ongoing Efforts

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

NUS Datasets for Research Benchmarks

1. Representative Datasets

2. Data Acquisition and Curation Protocols

3. Data Structure, Schema, and Metadata

4. Benchmarking, Task Design, and Metrics

5. Distribution, Licensing, and Reproducibility

6. Applications and Research Impact

7. Future Directions and Ongoing Efforts

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research