Concept Training for Human-Aligned Language Models

Published 31 Mar 2026 in cs.CL | (2603.29123v1)

Abstract: The next-token prediction (NTP) objective trains LLMs to predict a single continuation token at each step. In natural language, however, a prefix can be continued in many valid ways, and even similar meanings may differ in surface form. For example, the sentence ``this website is safe to \underline{browse}'' could plausibly continue with words such as browse, search, visit, surf, or navigate. While standard NTP training treats these alternatives as mutually exclusive targets, we explore a framework that instead predicts concepts, approximated as sets of semantically related tokens. We show that models trained with concept supervision exhibit stronger alignment with human semantic similarity judgments on multiple lexical benchmarks. These gains are accompanied by lower perplexity on semantically meaningful words (definition in Section 3.1), and a modest increase in global token-level perplexity, reflecting a tradeoff between standard NTP optimization and concept-level supervision. Our results suggest that concept-level objectives can improve semantic alignment while maintaining competitive language modeling performance.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a novel concept-level supervision that rewards predicting any token from a contextual synonym set instead of a specific target token.
It combines the concept loss with standard next-token prediction in a tunable framework, enhancing semantic alignment and reducing content word perplexity.
Empirical results demonstrate significant improvements on benchmarks like WordSim353 and STS-B, indicating better human-like semantic organization and robustness.

Concept-Based Supervision for Human-Aligned LLMs

Motivation and Background

The standard next-token prediction (NTP) objective for LLM training is fundamentally limited in its alignment with human-level semantic reasoning. NTP treats alternative surface forms with similar semantics as mutually exclusive prediction targets. As humans operate at a conceptual level, recognizing synonymy and contextually interchangeable words (e.g., “browse,” “search,” “visit”), this limitation causes models to penalize valid completions solely for not matching one specific token. Prior critiques [holtzman2021surface], [bender-koller-2020-climbing], and studies on semantic generalization failures [merrill2024can] highlight that successful token prediction does not necessarily lead to genuine semantic understanding.

This paper introduces a novel training paradigm for LLMs: concept-level supervision. Here, concepts are operationalized as sets of contextual synonyms for content words (nouns, verbs, adjectives), derived directly from the model itself rather than external resources. The objective rewards prediction of any token within these synonym sets for a given context, and is combined with the traditional NTP loss via a tunable hyperparameter.

Concept Training Pipeline

Concept Dataset Construction

The method begins with C4 and OpenWebText datasets. For each sequence, spaCy tagging and tokenizer alignment extract full-word tokens classified as content-bearing (nouns, verbs, adjectives). These tokens represent approximately 28% of each sequence—tokens most responsible for semantic content.

Llama 3.1 8B is used to sample the top-200 plausible NTP completions for each content-word position. The next step uses Llama 3.1 8B Instruct to select contextually appropriate synonyms from this pool, defined by their ability to meaningfully substitute the original token in the current sentence context. The resulting synonym sets form operational concept classes.

Concept Loss Function

The concept objective maximizes the aggregate likelihood that the model predicts any member of the contextual synonym set $T^*$ for the prefix $S$ , rather than only the original target token:

$\mathcal{L}_{\text{concept}}(T^* | S) = -\log \sum_{n=1}^{|T^*|} p(t_n \in T^* | S, \Theta)$

This objective is linearly interpolated with the standard NTP loss:

$\mathcal{L}_{\text{total}} = (1 - \lambda) \mathcal{L}_{\text{NTP}} + \lambda \mathcal{L}_{\text{concept}}$

with $\lambda \in [0, 1]$ . To avoid excessive optimization once sufficiently capturing a concept, the concept loss is zeroed if cumulative token probability exceeds 0.6 for $T^*$ .

Training Regimen

Post-training adapters (LoRA) are used on pretrained Llama models (1B, 3B, 8B), varying concept loss weights and data sources. Baselines include the original models and their NTP-only finetuned versions. Both in-domain and out-of-distribution evaluations are performed.

Empirical Results

Human Semantic Judgments

Semantic alignment is assessed with MEN, WordSim353, SimLex-999, and STS-B benchmarks via cosine similarity of mean-pooled final-layer embeddings, correlating with human similarity ratings. Concept-trained models consistently surpass both NTP-only and pretrained baselines in Spearman correlation across all tasks, e.g., WordSim353 correlations rise from ~0.22 (baseline) to ~0.31 (concept-trained 8B), with best results for moderate concept loss weights ( $\lambda=0.75$ or $1.0$). Notably, NTP-only tuning slightly degrades alignment, indicating that token-level optimization alone can harm semantic structure.

Content Word Perplexity

Content word perplexity—evaluating only tokens with significant semantic content—is decreased by concept training. All concept-trained models demonstrate lower perplexity and higher token accuracy on held-out and cross-domain data (improvements >5 points in perplexity and >0.03 in accuracy). NTP-only tuning yields negligible improvement or moderate degradation. As $\lambda$ increases to 0.75, content-word performance improves monotonically, with saturation or slightly diminishing gains at $\lambda=1$ . Statistical significance is established via bootstrap intervals.

Global Perplexity

Global token-level perplexity (all tokens) slightly increases for concept-trained models, with notable worsening at $S$ 0. This is attributed to the emphasis on semantically meaningful tokens, which are harder to predict and less frequent. The tradeoff is favorable: optimizing for concept signal meaningfully boosts semantic alignment and content-token performance while minimally degrading overall perplexity for moderate $S$ 1.

Model Analysis and Ablations

Latent Space Geometry

Concept training sharpens latent representation boundaries. Intra-concept similarity remains high, while inter-concept similarity decreases—yielding better cluster separation. Smaller models benefit most from concept supervision, showing stronger improvements in clustering and content-word perplexity. Centroid-based measures corroborate this, indicating sharper representations for distinct concepts.

Noise Controls

Training with randomized synonym sets substantially degrades semantic alignment and NTP performance, indicating that gains are not explained by mere label variability or regularization. Semantically coherent synonym structure is essential for improved model performance.

Supervision Proportion

Even training on only a subset (e.g., last content word per sequence) yields improvements in semantic alignment. Increasing proportion to 25% saturates NTP content-word gains, but human benchmark alignment continues to improve with more supervision, highlighting a dissociation between token prediction and conceptual abstraction.

Limitations and Ethical Concerns

The approach is constrained to decoder-only models and a specific subset of content words (single-token full words, nouns/verbs/adjectives). Concept sets are based on LLM-generated synonyms, which may encode biases and lack linguistic robustness. Applying concept loss exclusively to content words can limit coverage in global tasks, and pure concept training ( $S$ 2) may compress representational space excessively.

Ethically, unconstrained concept set generation risks amplifying existing biases or overgeneralizations present in training data. Model outputs are not fully examined for factuality or safety in unconstrained generation scenarios.

Implications and Future Directions

This work demonstrates that concept-level training yields more human-like semantic boundaries, improves alignment with human intuition, and better models challenging semantic tokens, especially for smaller LLMs. These findings support a shift from token-centric toward concept-centric objectives for language modeling.

Pragmatically, concept training improves generalization to out-of-domain data and robustness in key semantic tasks, suggesting its utility in applications requiring nuanced language understanding. Theoretically, it reaffirms the inadequacy of token-level objectives for capturing conceptual abstraction.

Future research should expand the paradigm to multi-token expressions, sequence-level concepts, multilingual data, and downstream task evaluation. Design of more principled concept extraction and supervision procedures, as well as integration with post-training alignment methods, is warranted.

Conclusion

Concept-level supervision provides a scalable, self-supervised approach for improving semantic abstraction in LLMs, yielding models that better mirror human concept organization. By incorporating contextual synonym sets into the training objective, models attain higher semantic alignment and improved prediction of meaningful tokens. These results underscore the importance of optimizing for semantic signal rather than surface token forms. The integration of concept and token-level losses is recommended for optimal performance. Adoption of content-word-centric evaluation metrics further advances the rigor of LLM assessment. The approach constitutes a meaningful step toward more human-like LLMs (2603.29123).

Markdown Report Issue