Papers
Topics
Authors
Recent
Search
2000 character limit reached

NUS Datasets for Research Benchmarks

Updated 20 January 2026
  • NUS Dataset is a suite of multi-domain, curated datasets developed at the National University of Singapore for reproducible academic research.
  • Each dataset employs rigorous acquisition, annotation, and privacy protocols, ensuring high-quality data for tasks like language modeling and medical image segmentation.
  • The collections serve as benchmarks in NLP, sociolinguistics, and clinical analysis, enabling studies on data sparsity, class imbalance, and transfer learning.

The term “NUS Dataset” refers to a family of publicly released, academically curated datasets developed at the National University of Singapore and intended for the advancement of research in computational linguistics, sociolinguistics, medical image analysis, software vulnerability mining, and related data-driven domains. These collections are notable for their rigorous annotation protocols, documented methodologies, transparent release cycles, and comprehensive metadata suited to reproducible, large-scale academic research.

1. Representative Datasets

Multiple, distinct datasets carry the “NUS” designation, spanning diverse modalities and use cases. Three prominent examples include:

Dataset Name Primary Domain Size / Scope
NUS SMS Corpus Personal mobile SMS 57,824 messages, English/Mandarin, 2011 snapshot
NUS ABC Codemixed Corpus Code-mixed chat 355,641 messages, 477 chats, SE Asian languages
UNGT Dataset Medical ultrasound 493 annotated images, 110 patients

Each is purpose-built to address acute gaps in public, high-quality data resources for their respective research communities (Chen et al., 2011, Churina et al., 31 May 2025, Liu et al., 19 Feb 2025).

2. Data Acquisition and Curation Protocols

NUS SMS Corpus: SMS messages were collected via compensated crowdsourcing (Amazon Mechanical Turk for English, Zhubajie for Mandarin), institutional outreach, and web platforms. Only personally authored “sent” messages were eligible; chain and marketing messages were excluded. Sensitive entities (e.g., emails, URLs, phone numbers, numeric strings) are masked by semantic placeholders within the text (e.g., <EMAIL>, <URL>, <#>), with further privacy assured using DES encryption for sender/receiver identifiers. Extensive manual and automated quality validation eliminated verbatim web-copied content and mass invalid uploads. The protocol adheres to IRB exemption criteria: no personally identifying information is released in published data (Chen et al., 2011).

NUS ABC Codemixed Corpus: Chat threads were donated by Singaporean undergraduates through a controlled data-donation drive under full IRB approval. Donors provided up to three threads across relationship strata (acquaintance, general, close), with explicit instruction to self-redact PII. All submissions underwent manual and machine-assisted anonymization. Token-level language tags were assigned via a hybrid model (rule-based, fastText/XLM-RoBERTa, prompt-guided Qwen2.5), with inter-annotator agreement computed (Cohen’s κ ≈ 0.25–0.26 for LLM vs. traditional models), and final labels subject to manual correction for residual errors (Churina et al., 31 May 2025).

UNGT Dataset: Ultrasound images were retrospectively and prospectively collected from PACS archives and IRB-approved scans. Annotation involved a primary radiologist (10 years’ experience) and a secondary reviewer, with disagreements resolved by consensus. Four semantic regions (liver, stomach, nasogastric tube, pancreas) were labeled using Labelme. Images and ground-truth masks were consistently post-processed to a 224×224 px format using a dynamic cropping strategy (Liu et al., 19 Feb 2025).

3. Data Structure, Schema, and Metadata

NUS SMS Corpus: Distributed as XML and SQL dumps. Each message instance contains the full (anonymized) text, timestamp, sender/receiver hashes, and (optional) contributor demographics: age group, gender, country, native language, years using SMS, daily SMS count, phone model/type, and input method. The majority of SMS contributors provided full demographic surveys, though partial non-compliance results in minor gaps for ~0.5% (English) and ~6.2% (Chinese) messages. Message language is tagged as ‘en’ or ‘zh’ (Chen et al., 2011).

NUS ABC Codemixed Corpus: Data released in JSON format, organized by conversation threads. Each message record includes a unique message ID, timestamp, anonymized speaker, raw text (with emojis and original casing), an array of whitespace-tokenized word segments, aligned language tags for each token, token-level and corpus-level code-mixing ratios (λₘ, λ), a transliteration flag, and an English translation. Per-thread metadata contains intimacy label, years known, trust/disclosure ratings, platform, donor age/gender, and self-reported relationship motivation (Churina et al., 31 May 2025).

Field NUS SMS Corpus ABC Codemixed Corpus UNGT
Main Unit Message Message (in chat thread) Image
Primary Labels Language (en/zh) Per-token language tags Semantic mask (4 classes)
Metadata Demographics Intimacy, relationship Patient, modality metadata
Sensitive Redaction Placeholder + Hash Self-redaction + manual IRB ethics, de-ID

UNGT: Images are paired with per-pixel masks in four semantic classes. Clinical metadata (not publicly released) includes imaging modality, probe parameters, and acquisition notes. Two experimental subsets: tube-present (321 images) and tube-absent (172 images).

4. Benchmarking, Task Design, and Metrics

The “NUS Dataset” umbrella supports a wide spectrum of benchmark tasks:

  • SMS Corpus: Language modeling in short-format text, linguistic register studies, code-mixing/abbreviation norm analysis, metadata-conditioned sociolinguistics, noisy-channel normalization, and privacy risk assessment (Chen et al., 2011).
  • ABC Codemixed Corpus: Token-level language and script identification, code-mixing detection, measurement of λₘ and λ at message/corpus levels, dialogue act classification, relationship inference from linguistic content, machine translation of informal, code-mixed utterances, and conversational response generation in multilingual style (Churina et al., 31 May 2025).
  • UNGT: Supervised and semi-supervised semantic segmentation (4-way) under class imbalance, binary classification (tube present vs. absent), and performance affinity under severe label scarcity. Benchmark metrics include Dice Coefficient (DSC), Sensitivity (SEN), Precision (PRE), with quantitative improvement over CMUNet, CTCT, and other segmentation/classification baselines (Liu et al., 19 Feb 2025).

A plausible implication is that these datasets enable systematic ablation and transfer learning experiments for tasks involving data sparsity, informal linguistics, code-mixed language, and clinical annotation scenarios for which few public resources exist.

5. Distribution, Licensing, and Reproducibility

Access:

Versioning:

  • SMS and Codemixed corpora are updated in discrete, immutable monthly or periodic snapshots to guarantee experiment reproducibility and historical comparison. All tracking pages publish growth curves and demographic breakdowns per release.

Limitations:

  • All three datasets exhibit selection biases: MTurk/campus overrepresentation and device skew in SMS data, undergraduate donor demographics in chat data, and single-site (institutional) imaging in medical data. This suggests domain adaptation and cross-corpus robustness studies are warranted for external validity.

6. Applications and Research Impact

The “NUS Dataset” collections have directly facilitated research in:

  • Low-resource text normalization and SMS-specific NLP.
  • Computational sociolinguistics: age, gender, and relationship-based style modeling.
  • Code-mixed language processing: token-level identification, transliteration detection, and code-switching ratio estimation.
  • Conversation analysis in multilingual communities, focusing on intimacy, conversational context, and digital social dynamics.
  • Medical image segmentation under extreme class imbalance, semi-supervised learning in low-data regimes, and benchmarking adaptive weighting loss functions for small or sparse targets (Chen et al., 2011, Churina et al., 31 May 2025, Liu et al., 19 Feb 2025).

A plausible implication is that the NUS datasets function as reference benchmarks for both end-to-end machine learning method validation and foundational quantitative sociolinguistic work—establishing ground truth at the intersection of language, relationship, and communication modality.

7. Future Directions and Ongoing Efforts

Each NUS dataset project has articulated concrete prospects for expansion:

  • SMS Corpus: Potential future directions include extension to additional languages/dialects, broader device coverage, and improved demographic representation.
  • ABC Codemixed Corpus: Ongoing live collection and annotation will expand language/relationship diversity and increase the sample size per intimacy category; enhancements to code-mixing annotation protocols and expansion of translation quality evaluation datasets are planned.
  • UNGT: The stated roadmap includes expanding the patient cohort, collecting additional tube positions, and implementing iterative per-class training schedules to address segmentation challenges posed by class imbalance and small sample sizes (Liu et al., 19 Feb 2025).

The persistent, open-access, and extensible design of these datasets ensures their ongoing relevance for emerging machine learning, language science, and medical imaging applications.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NUS Dataset.