Papers
Topics
Authors
Recent
Search
2000 character limit reached

Chasing Shadows: Pitfalls in LLM Security Research

Published 10 Dec 2025 in cs.CR | (2512.09549v2)

Abstract: LLMs are increasingly prevalent in security research. Their unique characteristics, however, introduce challenges that undermine established paradigms of reproducibility, rigor, and evaluation. Prior work has identified common pitfalls in traditional machine learning research, but these studies predate the advent of LLMs. In this paper, we identify nine common pitfalls that have become (more) relevant with the emergence of LLMs and that can compromise the validity of research involving them. These pitfalls span the entire computation process, from data collection, pre-training, and fine-tuning to prompting and evaluation. We assess the prevalence of these pitfalls across all 72 peer-reviewed papers published at leading Security and Software Engineering venues between 2023 and 2024. We find that every paper contains at least one pitfall, and each pitfall appears in multiple papers. Yet only 15.7% of the present pitfalls were explicitly discussed, suggesting that the majority remain unrecognized. To understand their practical impact, we conduct four empirical case studies showing how individual pitfalls can mislead evaluation, inflate performance, or impair reproducibility. Based on our findings, we offer actionable guidelines to support the community in future work.

Summary

  • The paper's main contribution is the comprehensive identification of nine key pitfalls, including model ambiguity and evaluation inflation in LLM security research.
  • The study employs empirical analysis of 72 peer-reviewed papers, revealing high rates of data leakage, context truncation, and reproducibility issues.
  • Actionable recommendations are provided, urging transparent model specifications and benchmark redesign to enhance the reliability of LLM security experiments.

Pitfalls and Limitations in LLM Security Research

Introduction

The paper "Chasing Shadows: Pitfalls in LLM Security Research" (2512.09549) delivers a rigorous meta-analysis of published work on the use of LLMs in security and software engineering contexts. The authors systematically enumerate nine distinct pitfalls, evaluate their prevalence across 72 peer-reviewed papers published between 2023 and 2024, and provide in-depth empirical case studies on the real-world consequences of these methodological issues. Their results expose the ubiquity of fundamental limitations, including the systematic absence of reproducibility, misinterpretation of benchmarks, and evaluation inflation—often unrecognized by the original contributors. The study emphasizes community-level awareness and actionable recommendations for researchers and practitioners operating within the LLM security domain.

Taxonomy of Pitfalls in LLM Security Research

The authors introduce a comprehensive taxonomy encapsulating nine problematic patterns impacting LLM-based security work:

  • Data Poisoning via Internet Scraping: Insecure corpus construction exposes models to subtle data poisoning, especially given the scale and anonymity of platforms such as GitHub and Reddit.
  • LLM-Generated Label Inaccuracy: Automated annotation using LLMs risks propagation of unvalidated or flawed labels, undermining experimental reliability.
  • Data Leakage: Overlap between training and benchmark data—frequently undetected in opaque or proprietary model pipelines—leads to severe contamination and overestimation of generalization capabilities.
  • Model Collapse through Synthetic Training Data: Recursive fine-tuning on LLM-generated outputs degrades distributional diversity and accumulates error, risking catastrophic collapse.
  • Spurious Correlations: The vast parameter space and training scale facilitate memorization of shortcuts and non-causal artifacts, ultimately distorting measured performance and transferability.
  • Context Truncation: Practical context window limits force input truncation, which can mask critical code or information in vulnerability detection and secure code generation.
  • Prompt Sensitivity: Minor variations or mismatch in prompt format strongly affect outcomes, producing volatility and diminishing the reliability of comparative analysis.
  • Proxy/Surrogate Fallacies: Erroneous generalization of findings between model classes, quantization levels, or versions, without empirical validation.
  • Model Information Ambiguity: Incomplete specification of model identifiers, versions, access methods, or quantization prevents reproducibility and hinders precise interpretation.

Prevalence Assessment

The prevalence survey covers all 72 relevant papers published in major security and software engineering venues. Every paper exhibits at least one pitfall, and several pathologies (notably model information ambiguity, data leakage, and context truncation) are present in over 20% of studies. Only 15.7% of documented pitfalls were explicitly discussed, indicating systemic under-recognition of technical and methodological threats. Notably, the most prevalent pitfall, insufficient model specification (ambiguity in identifier, version, or quantization), affected 73.6% of the corpus but was not discussed in any publication.

The comparison across research topics revealed highest pitfall densities in vulnerability detection and repair, whereas fuzzing and secure code generation exhibited relatively fewer systematic failures.

Empirical Case Studies

To move beyond qualitative prevalence, the paper presents four rigorous empirical analyses quantifying practical impacts:

Model Information Ambiguity and Proxy Fallacies

Experiments on hate speech detection using different OpenAI GPT-4 snapshots and multiple open-source quantization settings demonstrate substantial shifts in precision, recall, and attack success rates. For example, accuracy dropped from 88.7% to 77.2% and recall plummeted from 82% to 41% when switching model instances, even with identical prompts and datasets. Robustness against prompt-based attacks depended critically on snapshot and quantization parameters, with attack success rates fluctuating by factors of three or more.

Data Leakage

Controlled fine-tuning of vulnerability detection models revealed near-linear increases in F1-score corresponding to leakage ratios; even 20% leakage elevated scores by 0.08–0.11, confirming that benchmark contamination can severely inflate performance metrics. Yet, targeted memorization probes on proprietary models found no evidence of test data reproduction in real-world settings, suggesting that leakage risks are instance-dependent and demand active investigation.

Context Truncation

Analysis of context windows in common vulnerability datasets revealed that, with standard 512-token contexts, nearly 50% of vulnerable functions are truncated, with 29% exceeding 1024 tokens. In concrete scenarios, the actual vulnerable code was outside the model’s input window, indicating that reliability of detection is undermined, and reported performance may be fundamentally misrepresentative.

Model Collapse

Iterative retraining on self-generated synthetic code, using Qwen2.5-Coder-0.5B-Instruct, demonstrated a monotonic increase in mean and variance of perplexity across generations, serving as a quantitative signal for model collapse. Figure 1

Figure 1: Distribution of perplexities for training samples generated across multiple model generations; both mean and variance increase, indicating stability degradation through recursive synthetic training.

Practical and Theoretical Implications

The findings necessitate substantial protocol changes for LLM-focused research. In absence of transparent model specification, reproducibility is unattainable, rendering empirical comparisons between studies technically invalid. The risks of data leakage and context truncation undermine both experimental integrity and real-world deployment credibility, especially in security-critical applications such as vulnerability detection, penetration testing, and automated code repair. Model collapse and spurious correlation exploitation further threaten the longevity of generative pipelines, as iterative model retraining or automated annotation becomes increasingly prevalent in both research and industry.

The authors’ recommendations—detailed specification of model family, version, quantization, prompt structure, and explicit evaluation for data leakage—represent minimum requirements for technical rigor. They argue for treating pitfall avoidance as fundamental design constraints, with the community urged to adopt transparent, testable methodologies aligned with regulatory movements (e.g., EU AI Act).

From a theoretical perspective, the universality of these pitfalls challenges the reliability of both "state-of-the-art" performance claims and security claims regarding attack/defense efficacy. Future directions should include benchmark redesign with explicit leakage and contamination probes, model versioning standards, robust prompt and context ablation testing, and longitudinal studies on generative drift and collapse dynamics under operational retraining regimes.

Conclusion

This paper establishes that methodologically significant pitfalls are pervasive in LLM security research, typically underappreciated and often unaddressed. Strong empirical evidence is provided for the distortion of results via model ambiguity, data contamination, context truncation, and iterative synthetic data training. The practical upshot is that reported results in high-impact venues may substantially misrepresent security, robustness, and generalization. An effective path forward requires formalization, transparency, and constant audit of these pitfalls in both academic and production-level LLM pipelines.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored in the paper, phrased to guide concrete follow-up research.

  • Limited scope and representativeness: The prevalence study covers 72 papers from select Security/SE venues (2023–2024) only; it excludes journals, arXiv/industry reports, non-English venues, and later years. How do the findings change with broader venues, languages, and a longitudinal (multi-year) extension?
  • Search/selection bias: The paper relies on keyword filtering of titles/abstracts; relevant studies may be missed if they describe LLMs indirectly or use different terminology. What is the recall/precision of the search strategy, and how robust are prevalence estimates to alternative pipelines?
  • Inter-annotator agreement (IAA): The study uses dual reviewers and consensus, but does not report IAA metrics (e.g., Cohen’s κ) or calibration procedures. How reliable are pitfall labels, and how sensitive are results to reviewer assignment?
  • Operationalization clarity: Some pitfalls (e.g., proxy/surrogate fallacy vs model-info ambiguity) may overlap and lack crisp, testable criteria. Can the community define standardized, unambiguous operational tests/checklists for each pitfall?
  • Reliance on “Likely present”: Several conclusions hinge on plausibility (e.g., training cutoff vs publicly available evaluation data). Can stronger evidence be obtained via contamination tests (e.g., n-gram matching, hashing, canaries) or vendor collaboration?
  • External validity beyond security/SE: The taxonomy and prevalence are tailored to LLM security research; it is unclear how they transfer to other domains (e.g., healthcare, law), modalities (multimodal LMMs), or agent/tool-augmented systems. What adaptations are required?
  • Incomplete taxonomy: The nine pitfalls omit other recurrent sources of variance and failure (e.g., decoding parameters/temperature, seed control, API rate limits, tool-use variability, multilingual tokenization effects, evaluation determinism). What additional pitfalls should be incorporated and validated?
  • Effect-size generalization: Only four pitfalls receive empirical case studies; the magnitude of impact for the remaining pitfalls is not quantified systematically. Can a meta-analysis or re-evaluation across the 72 papers estimate typical effect sizes per pitfall?
  • Case study breadth: Case studies are limited to specific tasks/datasets/models; generality across vendors, model families, languages, and tasks remains untested. Which results replicate across diverse settings?
  • Pitfall co-occurrence and interactions: The study reports per-pitfall prevalence but not how pitfalls cluster or compound (e.g., prompt sensitivity masked by proxy fallacy). What are the dominant co-occurrence patterns and their joint effects on outcomes?
  • Predictors of pitfalls: The paper notes topic-level differences but does not model predictors (e.g., open vs closed models, venue, author backgrounds, dataset types). Which factors statistically predict pitfall incidence?
  • Mitigation efficacy: The paper proposes guidelines but does not evaluate whether they reduce pitfalls or improve reproducibility. Can controlled studies (before/after, randomized reviews, or natural experiments) measure guideline impact and feasibility?
  • Tooling gaps: No automated instruments are provided to detect/flag pitfalls (e.g., leakage scanners, prompt-audit checkers, context-truncation detectors, version-capture validators). What tools can be built, validated, and integrated into CI/review workflows?
  • Proprietary opacity: Data leakage and version ambiguity cannot be conclusively audited for closed models. What practical methods (e.g., membership inference, canary probes, watermarking) and disclosure standards can bridge this gap with providers?
  • Web-scale data poisoning: The study flags risk but does not quantify real-world poisoning prevalence or evaluate defense coverage at scale. Can we build poisoning audit frameworks and benchmarks for LLM-scale pipelines?
  • Spurious correlations: The prevalence is reported, but methods to expose and mitigate non-causal features (e.g., counterfactual evaluation, causal probes, representation diagnostics) are not assessed. Which techniques reliably detect and reduce spurious cues in LLMs?
  • Prompt sensitivity standardization: There is no standardized protocol for cross-model prompt optimization/reporting. What reporting standards and prompt-normalization procedures best reduce evaluation variance while preserving comparability?
  • Context truncation monitoring: The paper shows prevalence but does not evaluate systematic detection/mitigation (e.g., automatic truncation alerts, chunking strategies, retrieval augmentation) or quantify cost–performance trade-offs across methods.
  • Model collapse thresholds and safeguards: Degradation from synthetic data is demonstrated, but conditions/thresholds and mitigation efficacy (data curation, diversity controls, filtering) remain unquantified for security tasks. What safeguards work in practice?
  • Temporal drift and reproducibility: Model updates (APIs/snapshots/quantization) are a major source of variance, but no longitudinal measurement quantifies drift effects over time. How quickly do results stale, and which version-capture practices prevent irreproducibility?
  • Policy and review standards: The paper recommends best practices but does not translate them into concrete venue policies (e.g., model/version disclosure requirements, prompt/decoding reporting, contamination checks). Which policy interventions are most effective and adoptable?
  • Real-world harm linkage: The study does not connect pitfall prevalence to downstream security harm (e.g., vulnerable code shipped, failed defenses). Can incident analyses or field studies quantify real-world impact attributable to specific pitfalls?
  • Data and artifact durability: While code/data are released, the paper does not address long-term artifact stability (API changes, dataset availability). What archival practices ensure enduring reproducibility (e.g., model cards with immutable digests, dataset snapshots)?

Practical Applications

Immediate Applications

The following applications can be deployed now, using the paper’s taxonomy of LLM pitfalls, empirical findings, and accompanying resources (llmpitfalls.org, GitHub repository). Each item notes sectors, concrete tools or workflows, and feasibility assumptions.

  • Stronger LLM evaluation hygiene in CI/CD pipelines (software, robotics, finance, healthcare): Implement a reproducibility checklist that logs model name, snapshot/date, endpoint, tokenizer, system prompt, quantization level, and API parameters per run; pin model versions and capture changelogs; attach run artifacts to build releases. Tools/workflows: “LLM RunLogger” middleware, CI step templates, experiment notebooks seeded from the paper’s guidelines. Assumptions/dependencies: Model providers expose version identifiers or changelogs; teams adopt disciplined logging.
  • Data leakage pre-checks for benchmarks and fine-tuning (software, education, security labs): Before training or evaluation, run near-duplicate and source overlap scanning on test sets versus likely pretraining sources (GitHub, Wikipedia, Reddit). Use the paper’s finding that 20% leakage can inflate F1 by ≈ 0.08–0.11 to gate deployments. Tools/products: “LeakGuard” using locality-sensitive hashing (LSH), SimHash, or MinHash; dataset lineage metadata. Assumptions: Availability of probable pretraining corpus fingerprints or public web mirrors; acceptance of probabilistic contamination signals.
  • Prompt sensitivity A/B testing harness (software, genAI safety, customer support): Establish prompt suites per model (model-tailored delimiters, formatting) and run A/B/C tests to quantify variability; adopt model-specific prompt templates for production. Tools/products: “PromptLab” harness generating perturbations and measuring variance across models; prompt registries. Assumptions: Access to multiple models, modest compute budget; orgs accept small performance tradeoffs for robustness.
  • Context window analysis and safe chunking for long inputs (software security, code intelligence, legal tech): Automatically count token length for inputs; apply chunk-and-stitch or retrieval-augmented patterns to avoid truncation. Guided by the paper’s result that 49% of vulnerable functions exceed 512 tokens (29% > 1024). Tools/workflows: IDE linter that flags overflow, static token estimators, structured chunking with overlap windows. Assumptions: Chunking preserves task semantics; retrieval corpus quality is adequate.
  • Label quality assurance for LLM-as-a-judge (genAI safety, moderation, education): Use model ensembles or human adjudication for a subset; compute inter-rater agreement and calibrate confidence; maintain gold sets for spot checks. Tools/workflows: “JudgeQA” consensus framework; periodic audits; disagreement-triggered human review. Assumptions: Budget for human annotation on critical subsets; acceptance of slower but more reliable pipelines.
  • Pitfall-aware red teaming playbooks (security, compliance, cloud platforms): Build attack and evaluation scripts that vary prompts, include long-context cases, and re-run across snapshots/quantizations; explicitly test for proxy/surrogate fallacy by including multiple model families and sizes. Tools/workflows: Security runbooks; scheduled re-tests after vendor updates; scorecards mapped to the nine pitfalls. Assumptions: Access to model families; vendor updates are monitored.
  • Procurement and vendor due-diligence requirements (policy, industry procurement, finance/healthcare governance): Require model cards with snapshot IDs, tokenizer and system prompt details, quantization options, endpoint versioning, and update cadence; mandate disclosure of training data sources at category-level (licensed/public/created). Tools/workflows: Standardized AI procurement templates; compliance checklists referencing the paper’s guidelines. Assumptions: Vendors agree to disclose non-sensitive metadata; legal teams update contracts.
  • Conference and journal artifact checklists (academia): Integrate a “pitfalls compliance” badge in artifact evaluation—explicitly check version pinning, leakage mitigation, prompt documentation, and context-handling strategies. Tools/workflows: Reviewer rubrics; template sections for model versioning, dataset lineage, and prompt suites. Assumptions: PC/AE committees adopt the rubric; authors have access to model metadata.
  • Continuous drift monitoring due to model updates (software, customer-facing AI, fintech): Track precision/recall and jailbreak robustness across time; set alerts when performance changes after vendor-side updates, informed by the paper’s case study that snapshot/quantization materially alters outcomes. Tools/workflows: “LLM Driftwatch” dashboards; canary test sets; weekly scheduled evaluations. Assumptions: Stable test suites; observability pipelines in place.
  • Dataset curation and ingestion hygiene (industry data teams, model fine-tuning shops): Introduce whitelist/blacklist sources, poisoning checks, and quality filters when scraping; measure label noise rates and re-validate high-risk segments. Tools/workflows: Poisoning detectors (outlier scores, heuristic filters), source integrity checks, sampled manual reviews. Assumptions: Teams can tolerate slower ingestion; cost for periodic human quality control.

Long-Term Applications

These applications require further research, scaling, standardization, or ecosystem development. They build on the paper’s taxonomy and empirical insights to improve reliability, transparency, and safety.

  • Standardized reporting for LLM experiments (NIST/ISO/IEEE, academia, industry): Develop a formal standard that mandates snapshot IDs, tokenizer/system prompts, quantization, prompt suites, leakage checks, and cross-model validations to avoid proxy fallacies. Tools/products: A “Reproducible LLM Report” spec; validator tooling integrated into papers and CI. Assumptions/dependencies: Multi-stakeholder consensus; adoption by venues and funders.
  • Training data provenance registries and contamination scanning (policy, cloud providers, labs): Maintain hashed fingerprints and lineage metadata of common pretraining corpora; create registrar services to scan evaluation/test sets for overlap before publication or deployment. Tools/products: “DataProvenance Registry”; contamination APIs; privacy-preserving hashing. Assumptions: Vendor buy-in; careful handling of IP/privacy; legal frameworks for disclosure.
  • Synthetic data watermarking and filtering to mitigate model collapse (model providers, research labs): Embed detectable markers in synthetic outputs; build filters and curriculum strategies mixing real and diverse data to prevent recursive degradation noted in the paper’s case study. Tools/products: Watermarking algorithms; anti-collapse curricula; synthetic-data quality gates. Assumptions: Watermarks remain robust; synthetic generation is traceable; training stacks accommodate filters.
  • Proxy-robust benchmarks spanning families and sizes (academia, evaluation consortia): Create benchmark suites that explicitly assess generalization across open-source and proprietary families and sizes to counter the proxy/surrogate fallacy. Tools/products: “CrossFam Bench” with model-agnostic protocols; shared leaderboards that penalize over-generalization. Assumptions: Access to diverse models; funding for compute.
  • Automatic spurious correlation detectors for security tasks (software security, healthcare NLP): Develop causal probes, counterfactuals, and ablation tools that reveal non-causal shortcuts in code and text tasks; integrate into evaluation to avoid inflated scores. Tools/products: Causal analysis libraries; counterfactual code patchers; feature ablation engines. Assumptions: Availability of structured annotations; domain expertise to interpret causal signals.
  • Black-box model snapshot fingerprinting for auditors (compliance, cloud marketplace): Build challenge-set probes that infer effective snapshot/quantization differences from responses, enabling third-party verification even when vendors don’t disclose full details. Tools/products: “ModelID Fingerprinter”; auditor APIs; certification workflows. Assumptions: Response patterns are sufficiently discriminative; legal permission to probe.
  • Regulated versioned endpoints and changelog transparency (policy, cloud providers): Require providers to expose versioned endpoints, changelogs, and rollback options; mandate end-user notification when behavior-changing updates occur. Tools/products: API versioning standards; change impact notices; opt-in “stable channels.” Assumptions: Legislative adoption; alignment with safety/privacy requirements.
  • Training data supply-chain security (model providers): Sign and verify ingested datasets; track source integrity; deploy poisoning detection at scale for internet-scraped content. Tools/products: Data signing PKI; ingestion attestation; anomaly detectors for poisoning. Assumptions: Infrastructure investment; governance for key management.
  • IDE and notebook “Pitfall Guard” extensions (daily life, software engineering education): Offer real-time warnings for context overflow, prompt fragility, missing version metadata, and potential leakage sources while building LLM apps. Tools/products: VSCode/JetBrains extensions; Jupyter magics; suggested fixes aligned to the nine pitfalls. Assumptions: Developer adoption; sustained maintenance.
  • Continuous performance and safety monitoring platforms (industry, healthcare, finance): SaaS that detects drift from model updates, tracks jailbreak susceptibility, and flags metric inflation linked to leakage or prompt changes; aligns dashboards with risk/compliance. Tools/products: “LLM SafetyOps” platform; alerts, SLOs, incident playbooks. Assumptions: Stable governance; integration with existing monitoring stacks.
  • Education and reviewer training programs (academia, professional bodies): Course modules and reviewer bootcamps on LLM pitfalls, emphasizing reproducibility, prompt sensitivity, context handling, and leakage risks; include hands-on labs using the paper’s repository. Tools/products: Curricula, lab kits, certification credits. Assumptions: Institutional buy-in; ongoing updates.
  • Sector-specific compliance frameworks (healthcare, finance, public sector): Translate the nine pitfalls into domain controls (e.g., leakage audits for clinical notes, snapshot logging in trading systems, prompt robustness for public services). Tools/products: Control catalogs; audit procedures; mapping to existing standards (HIPAA, PCI DSS, ISO/IEC 42001). Assumptions: Harmonization with sector regulations; resource availability for audits.

Glossary

  • Alignment: Adjusting an LLM to follow desired objectives, safety constraints, or user intent. Example: "During fine-tuning or alignment, LLMs are adapted for specific downstream tasks and user-facing behaviors."
  • Alignment testing: Evaluations that probe whether an LLM adheres to intended safety or behavioral guidelines. Example: "In tasks such as jailbreak detection or alignment testing, vague prompts (e.g., ``You are a helpful AI assistant'') may underrepresent a model's true capabilities or vulnerabilities."
  • Context window: The maximum number of tokens an LLM can process in a single input. Example: "about 49% of vulnerable functions exceed common context windows of 512 tokens (29% >1024>1024 tokens)"
  • Data leakage: Unintended overlap between training and evaluation data that can inflate reported performance. Example: "Others represent familiar concerns, such as data leakage or spurious correlations, but their implications change and intensify in the context of LLMs."
  • Data poisoning: Malicious manipulation of training data to induce harmful or biased model behavior. Example: "data scraped from the Internet opens the door to data poisoning attacks, where malicious or biased content can be subtly inserted into the training data without detection."
  • Downstream tasks: Specific applications or tasks for which a pre-trained model is adapted. Example: "During fine-tuning or alignment, LLMs are adapted for specific downstream tasks and user-facing behaviors."
  • Endpoint version: The specific deployed API/backend variant of a model at a given time. Example: "specifying the exact snapshot (e., endpoint version, commit ID, or access date) is essential."
  • Fine-tuning: Further training a pre-trained model on task-specific data to specialize its behavior. Example: "Fine-tuning and alignment frequently rely on data that was itself generated by LLMs, introducing the risk of \Pmodelcollapse[]~\cite{shumailov_ai_2024}."
  • Foundational model: A broadly pre-trained model used as a base for adaptation to many tasks. Example: "Foundational models may have been trained on (a subset of) samples from the test set."
  • Fuzzing: A software testing technique that feeds programs with generated inputs to find bugs or vulnerabilities. Example: "while studies on Fuzzing and Secure Code Generation contain fewer pitfalls on average (15--18\%)."
  • Jailbreak attacks: Techniques that induce an LLM to bypass its safety policies or restrictions. Example: "and are generally vulnerable to prompt injection and jailbreak attacks~\cite{yu_jailbreaks}."
  • LLM-as-a-judge: Using an LLM to automatically evaluate or label data or outputs. Example: "a practice known as LLM-as-a-judge~\cite{zheng_judging_2023}."
  • Model collapse: Degradation in model diversity and quality when models are trained on their own or other models’ outputs. Example: "such as model collapse caused by training on synthetic data or unpredictable model behavior due to prompt sensitivity."
  • Perplexity: A measure of how well a LLM predicts text; lower values indicate better predictive performance. Example: "recursive self-training in code generation increases perplexity across generations, leading to degradation and instability."
  • Pre-training: Large-scale initial training of a model on broad corpora to learn general patterns. Example: "LLMs are typically pre-trained on large-scale datasets to capture general language patterns."
  • Prompt engineering: Crafting and formatting inputs to guide an LLM’s behavior and performance. Example: "prompt engineering (e., context sensitivity)"
  • Prompt injection: Maliciously crafted prompts that manipulate an LLM into ignoring instructions or performing unintended actions. Example: "and are generally vulnerable to prompt injection~\cite{liu_pi} and jailbreak attacks~\cite{yu_jailbreaks}."
  • Prompt sensitivity: The phenomenon where small prompt changes cause large variations in model outputs. Example: "Prompting also introduces the issue of \Ppromptsensitivity[], where minor changes in phrasing can lead to drastically different outputs and where different models can have distinct prompt preferences."
  • Quantization: Reducing numerical precision of model parameters to speed up inference or reduce memory, potentially changing behavior. Example: "snapshot versions and quantization affect robustness against jailbreaks and significantly alter precision/recall"
  • Selection bias: Systematic bias introduced when training data are not representative of the true distribution. Example: "Such features may arise from selection bias in the training data or from shortcuts the model learns using unrelated patterns."
  • Snapshot: A specific saved/versioned state of a model used in experiments. Example: "specifying the exact snapshot (e., endpoint version, commit ID, or access date) is essential."
  • Spurious correlations: Non-causal patterns that a model exploits to perform well on benchmarks but fail to generalize. Example: "Spurious Correlations often remain unidentified."
  • Stateless: Lacking memory across requests; all necessary information must be included in the current input. Example: "Because LLMs are stateless, all relevant information must be included in the current context."
  • Synthetic data: Data generated artificially (e.g., by other LLMs) rather than collected from real-world sources. Example: "To address data scarcity, synthetic data generated by other LLMs is frequently used."
  • Tokenizers: Components that convert text into tokens used as model inputs. Example: "Because even small changes in tokenizers or system prompts can affect model behavior"
  • Truncation: Cutting off input text to fit within an LLM’s context window, potentially removing critical information. Example: "The LLM's context size is not large enough for its intended task, and the input needs to be truncated."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 27 likes about this paper.