Papers
Topics
Authors
Recent
Search
2000 character limit reached

LMSYS-Chat-1M Dataset Overview

Updated 5 February 2026
  • LMSYS-Chat-1M is a large-scale dataset comprising one million multi-turn dialogues between users and 25 state-of-the-art LLMs collected from diverse public platforms.
  • It provides detailed metadata including conversation transcripts, language annotations, and OpenAI moderation flags to facilitate robust safety evaluation and model benchmarking.
  • The corpus supports varied research applications such as content moderation training, jailbreak detection, and instruction tuning with empirical performance metrics.

LMSYS-Chat-1M is a corpus of one million real-world conversations between human users and 25 state-of-the-art LLMs, systematically collected from public web interfaces, specifically the Vicuna demo and Chatbot Arena platforms, over a period of approximately five months in 2023. Each conversation record consists of a unique identifier, the target model’s name, a JSON-formatted transcript (adhering to OpenAI Chat API schema), an auto-detected language annotation, and an OpenAI Moderation API flag. This dataset is distinguished by its unprecedented scale, diversity in model and language coverage, and inclusion of both standard and flagged (unsafe or jailbroken) content, providing a comprehensive resource for LLM behavior analysis, safety research, benchmarking, and instruction tuning (Zheng et al., 2023).

1. Collection Methodology and Dataset Construction

LMSYS-Chat-1M aggregates user–LLM interaction data from two primary online sources: chat.lmsys.org, offering a single-model interface, and Chatbot Arena, where users interact with or compare multiple LLMs side-by-side. The collection spanned April to August 2023, recording every conversation from all platform interfaces without any down-sampling.

Data privacy and anonymization protocols include the removal of personally identifiable information (PII) from stored text, with no user names, email addresses, or other direct identifiers retained. Each message is further passed through the OpenAI Moderation API, with the resultant moderation flags released as part of the dataset. Participation required users to accept website Terms of Use that explicitly covered data collection and release.

2. Dataset Structure and Content

The dataset comprises 1,000,000 multi-turn dialogues, sourced from 210,479 unique IP addresses representing at least 154 languages. Models represented span both open-source and proprietary LLMs, including but not limited to Vicuna, Koala, Alpaca, ChatGLM, Llama-2, GPT-3.5-Turbo, GPT-4, and Claude-2. The average conversation contains 2.0 turns, with user prompts averaging 69.5 tokens and model responses 214.5 tokens (as measured by the Llama-2 tokenizer).

Table 1. Basic Dataset Statistics

Statistic LMSYS-Chat-1M Value
Total Conversations 1,000,000
Unique Users (by IP) 210,479
Covered Languages 154
Avg. Turns per Conversation 2.0
Avg. Prompt Tokens 69.5
Avg. Response Tokens 214.5

Dominant models by conversation volume are Vicuna, Koala, Alpaca, ChatGLM, and Llama. The five most frequent languages are English, Portuguese, Russian, Chinese, and Spanish.

Topic categorization of a 100,000-prompt English sample via Sentence-Transformers and k-means clustering (with GPT-4 annotation) reveals major content classes: coding and software assistance (largest), general knowledge/QA, business/finance, creative writing/editing, and unsafe content (sexual, violence, hate). Approximately 5% of messages are flagged as unsafe by the OpenAI API.

3. Curation, Filtering, and Data Quality

Beyond PII scrub and application of moderation flags, LMSYS-Chat-1M is intentionally preserved in a “raw” state. No additional filtering or deduplication is applied, capturing authentic real-world user engagement including noise, repeated queries, and potentially script-generated content. Unsafe or malicious prompts, including jailbreaks, are retained to facilitate rigorous safety evaluations. The dataset’s moderation flags and harmful content categories (sexual, harassment, violence, hate, self-harm) provide granular annotations for each message.

A plausible implication is that this curation stance offers a valuable asset for developing robust moderation systems, jailbreaking detection protocols, and adversarial prompt mitigation strategies, albeit necessitating caution during downstream use due to the presence of unfiltered and potentially harmful material.

4. Benchmarking and Example Use Cases

LMSYS-Chat-1M demonstrates versatility through four primary research applications, as documented in the origin paper:

  • Training Content Moderation Models: Vicuna-7B is refined into Vicuna-moderator-7B using 1,000 flagged samples per OpenAI moderation category, 1,000 non-toxic samples, and 3,000 ShareGPT samples, with GPT-4-generated explanations as supervision. On a 5-category moderation classification held-out set, Vicuna-moderator-7B achieves a micro-F1 of 0.65 (zero-shot), closely matching GPT-4 and outperforming GPT-3.5 and baseline LLMs.

| Model | Zero-shot Micro-F1 | One-shot Micro-F1 | |--------------------------|:------------------:|:-----------------:| | GPT-4 | 0.71 | 0.69 | | Vicuna-moderator-7B | 0.65 | 0.70 | | GPT-3.5-Turbo | 0.45 | 0.64 | | OpenAI mod (006) | 0.36 | – |

  • Safety (Jailbreak) Benchmarking: 50 high-risk prompts (derived from top jailbreak attempts) are tested across 10 LLMs, assessing the likelihood that model responses are flagged as unsafe. Success rates highlight vulnerabilities in open LLMs (Alpaca-13B: 74%, Vicuna-13B-v1.5: 66%) compared to proprietary or safety-fine-tuned models (Llama-2-13B-chat: 16%, Claude-2: 18%, GPT-3.5-Turbo: 34%, GPT-4: 34%).
  • Instruction-Following Model Training: Llama2-7B is fine-tuned on two LMSYS-Chat-1M-derived subsets:
    • HighQuality (45K OpenAI/Anthropic responses)
    • Upvote (39K upvoted open-source responses)
    • Result: HighQuality-7B nearly matches Vicuna-7B’s performance on MMLU (5-shot: 47.7 vs 49.8) and MT-Bench (6.03 vs 6.17).
  • Challenging Benchmark Question Generation: Using Chatbot Arena voting data, prompts are rated by three LLMs (GPT-3.5, Claude-2, GPT-4); 200 “hard” prompts (rated ≥9 by all graders and manually verified) form the Arena-Hard-200 benchmark. Arena-Hard-200 produces a greater performance gap between open-source and proprietary models than MT-Bench, indicating higher discriminative power for advanced model evaluation.

5. Comparison to Existing LLM Datasets

LMSYS-Chat-1M exhibits an order-of-magnitude increase in conversation scale over prior real-world LLM datasets such as Anthropic HH (338,704 conversations, 1 model, 143 users), OpenAssistant (66,497 conversations, 13,500 users), and Chatbot Arena (33,000 conversations, 20 models). In contrast to these, its multi-model coverage (25 LLMs), broader language spectrum (154 languages), and preservation of unsafe content establish it as a uniquely comprehensive resource.

6. Accessibility, Licensing, and Ethical Considerations

The dataset is publicly accessible via Hugging Face (https://huggingface.co/datasets/lmsys/lmsys-chat-1m). There is no explicit open-source license; usage is governed by Terms of Use accepted by participants and the dataset is intended for non-commercial research. Released data includes only JSON-formatted transcripts and moderation metadata with direct PII removed and unsafe content retained.

Notwithstanding best-effort anonymization, ethical constraints remain due to possible residual PII and the presence of violent or otherwise harmful content. The user base consists predominantly of LLM hobbyists and researchers, presenting a non-representative sample of broader language technology consumers.

7. Limitations and Prospective Evolution

Identified limitations include demographic skew toward technically engaged users, absence of registration or robust deduplication (permitting bulk or repetitive content), and lack of released human preference signals despite upvote/downvote collection in the native platforms. Known dataset noise arises from script-driven attacks and low-quality submissions. The ongoing roadmap contemplates quarterly dataset releases to capture new models and evolving user patterns, improved data deduplication, topic calibration, stratified registration or bucketing, and imminent public release of human preference annotations for benchmarking purposes.

LMSYS-Chat-1M represents the first publicly released, million-scale dataset of authentic human–LLM interactions encompassing diverse model families, languages, and safety contexts (Zheng et al., 2023). Its holistic coverage across safe and unsafe dialogues, broad linguistic and topical spectrum, and support for both fundamental and highly specialized research tasks establish it as a foundational resource in LLM research and evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LMSYS-Chat-1M Dataset.