Papers
Topics
Authors
Recent
Search
2000 character limit reached

Write on Paper, Wrong in Practice: Why LLMs Still Struggle with Writing Clinical Notes

Published 4 Sep 2025 in cs.HC | (2509.04340v1)

Abstract: LLMs are often proposed as tools to streamline clinical documentation, a task viewed as both high-volume and low-risk. However, even seemingly straightforward applications of LLMs raise complex sociotechnical considerations to translate into practice. This case study, conducted at KidsAbility, a pediatric rehabilitation facility in Ontario, Canada examined the use of LLMs to support occupational therapists in reducing documentation burden.We conducted a qualitative study involving 20 clinicians who participated in pilot programs using two AI technologies: a general-purpose proprietary LLM and a bespoke model fine-tuned on proprietary historical documentation. Our findings reveal that documentation challenges are sociotechnical in nature, shaped by clinical workflows, organizational policies, and system constraints. Four key themes emerged: (1) the heterogeneity of workflows, (2) the documentation burden is systemic and not directly linked to the creation of any single type of documentation, (3) the need for flexible tools and clinician autonomy, and (4) effective implementation requires mutual learning between clinicians and AI systems. While LLMs show promise in easing documentation tasks, their success will depend on flexible, adaptive integration that supports clinician autonomy. Beyond technical performance, sustained adoption will require training programs and implementation strategies that reflect the complexity of clinical environments.

Summary

  • The paper finds that deploying LLMs in clinical documentation is hindered by individualized workflows and sociotechnical misalignments.
  • The study employed qualitative interviews and pilot deployments with 20 clinicians to evaluate LLM performance in varied clinical settings.
  • The findings emphasize that rigid AI systems clash with clinician autonomy and existing organizational processes, limiting practical benefits.

Critical Analysis of "Write on Paper, Wrong in Practice: Why LLMs Still Struggle with Writing Clinical Notes" (2509.04340)

Introduction

This paper presents a qualitative case study examining the deployment of LLMs for clinical documentation in pediatric occupational therapy. The authors challenge the prevailing assumption that automating routine documentation tasks, such as SOAP notes, is a straightforward application for LLMs. Through interviews and pilot deployments at KidsAbility, the study reveals that the challenges of integrating LLMs into clinical workflows are fundamentally sociotechnical, involving complex interactions between individual clinician practices, organizational policies, and the technical design of AI systems.

Methodological Overview

The study involved 20 clinicians across three pediatric occupational therapy programs, with 10 participating in a pilot using both a general-purpose proprietary LLM (Microsoft Copilot) and a bespoke Llama 3 8B model fine-tuned on historical SOAP notes. The qualitative methodology included 30 semi-structured interviews, analyzed using thematic coding. The pilot compared LLM-generated documentation from scratch notes, focusing on workflow integration, user experience, and perceived utility.

Key Findings

Heterogeneity of Clinical Workflows

The study demonstrates that documentation workflows are highly individualized and context-dependent. Clinicians in different programs (in-center vs. school-based) exhibit substantial variation in how and when they create scratch notes, the structure of those notes, and their subsequent conversion to formal documentation. This heterogeneity undermines the assumption that a single LLM system can generalize across diverse clinical settings.

Systemic Nature of Documentation Burden

Contrary to the notion that automating SOAP notes would alleviate documentation burden, clinicians identified broader systemic issues as the primary source of inefficiency. These include organizational policies mandating redundant documentation, inefficiencies in electronic health record platforms, and time constraints imposed by clinical environments. The act of writing notes is not the bottleneck; rather, it is the cumulative effect of administrative requirements and workflow fragmentation.

Need for Flexibility and Clinician Autonomy

Clinicians expressed a strong preference for maintaining autonomy over their documentation practices. Many have developed personalized templates and strategies that align with their professional identity and workflow needs. LLM tools that impose rigid input/output formats or disrupt established routines are met with resistance. The pilot revealed that clinicians often bypassed the LLM system when it failed to accommodate their workflow, indicating that tool flexibility and user control are critical for adoption.

Mutual Learning and Trust Calibration

Effective integration of LLMs requires mutual adaptation: clinicians must learn to interact with AI systems, and those systems must be designed to accommodate diverse user behaviors. The pilot exposed a trust calibration issue, where clinicians overcompensated for perceived model limitations by providing overly detailed inputs, effectively negating any time savings. Concerns about clinical accuracy, legal risk, and loss of professional voice further eroded trust in AI-generated documentation.

Theoretical Implications

The authors employ the FITT (Fit between Individuals, Task, and Technology) framework to interpret their findings. The study highlights misalignments across all three dimensions:

  • Technology–Task Fit: LLMs assumed structured scratch notes, but actual inputs were sparse and variable.
  • Technology–Individual Fit: Clinicians felt the tools undermined their control and added cognitive burden.
  • Individual–Task Fit: Diverse conceptions of documentation quality and workflow priorities complicated standardization.

These misalignments resulted in poor integration, low adoption, and increased workload, despite the technical capabilities of the LLMs.

Practical Implications

The study provides several actionable insights for the deployment of LLMs in clinical documentation:

  • Customization and Adaptability: LLM systems must be highly configurable to accommodate heterogeneous workflows and documentation practices.
  • Organizational Readiness: Successful adoption requires alignment with organizational policies, reduction of redundant administrative tasks, and integration with existing health record platforms.
  • Clinician-Facing AI Literacy: Training programs should focus on building clinician competence in interacting with AI systems, emphasizing collaborative rather than prescriptive use.
  • Iterative Co-Design: Ongoing collaboration between clinicians, developers, and administrators is essential to refine system requirements and ensure sociotechnical fit.

Performance and Limitations

The bespoke Llama 3 8B model, fine-tuned with LoRA adapters and domain-adaptive pre-training, was selected for its balance between domain specificity and generalization. However, the study found no significant time savings or improvement in documentation quality compared to manual workflows. Both the bespoke and general-purpose models struggled with input variability, misclassification of content, and omission of critical information. These limitations underscore the need for more robust context modeling and input normalization strategies in future LLM deployments.

Implications for Future AI Development

The findings suggest that the current generation of LLMs is not yet ready for seamless integration into clinical documentation workflows, even for well-scoped tasks like SOAP note generation. Future research should prioritize:

  • Context-Aware Modeling: Incorporating richer contextual signals (e.g., program type, clinician role, session environment) to improve input/output alignment.
  • Human-in-the-Loop Systems: Designing AI tools that augment rather than replace clinician judgment, supporting flexible interaction and iterative refinement.
  • Evaluation in Real-World Settings: Moving beyond benchtop performance metrics to assess utility, trust, and workflow impact in live clinical environments.
  • Ethical Considerations: Treating efficacy and alignment as ethical imperatives, given the potential for harm through inefficiency, loss of trust, or misaligned documentation.

Conclusion

This study provides a rigorous, context-sensitive analysis of the challenges facing LLM adoption in clinical documentation. The authors demonstrate that technical feasibility alone is insufficient; successful integration depends on sociotechnical alignment across individual, task, and organizational dimensions. The persistent barriers identified in this well-scoped use case raise critical questions about the readiness of LLMs for more complex clinical applications. Future development and deployment strategies must prioritize flexibility, clinician autonomy, and organizational fit to realize the potential of AI in healthcare documentation.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following list captures what remains missing, uncertain, or unexplored, framed to be actionable for future research:

  • External validity: findings come from a single pediatric rehabilitation center and one profession (OT) in Ontario; generalizability to other settings (e.g., acute care, adult rehab, primary care), jurisdictions, and professions is unknown.
  • Short exposure: the pilot spanned ~3 weeks per tool; long-term adoption, sustained efficacy, and behavior change over months are unmeasured.
  • Lack of quantitative outcomes: no rigorous measures of time saved, note quality, error rates, edit burden, or “pajama time” reduction; establish standardized metrics and baselines.
  • Comparative performance gap: no controlled, head-to-head quantitative comparison between the bespoke fine-tuned model and the enterprise LLM across tasks and contexts.
  • Error taxonomy and frequency: qualitative reports of misplacements and clinical misstatements (e.g., “anxious” → “has anxiety”) lack a systematic error classification with incidence rates and severity.
  • Impact heterogeneity: differential effects across programs (in-center vs school-based), clinician seniority, digital literacy, and documentation styles are not quantified.
  • Scratch-note realism: synthetic scratch notes used for fine-tuning may not reflect authentic shorthand, noise, or variability; their fidelity and impact on model behavior are unvalidated.
  • Input modality exploration: alternatives (speech-to-text, ambient scribing, digital pen/handwriting OCR, photos of paper notes) and hybrid capture strategies were not tested.
  • EHR integration: deep, workflow-native integration (auto-populating fields, reducing duplicate entry, preserving metadata) vs a standalone tool was not evaluated.
  • UI/UX for trust and control: effects of features like source highlighting, traceability to inputs, uncertainty cues, style controls, and “tracked changes” on trust and edit burden remain unexplored.
  • Voice/style personalization: mechanisms to preserve clinician “voice” (style transfer, per-clinician profiles) and their safety implications are untested.
  • Mutual learning loops: no implementation/evaluation of continuous learning from clinician edits (active learning, on-device personalization, RAG from prior notes) to reduce future errors.
  • Success criteria: explicit adoption thresholds (e.g., ≥30% time saved with ≤5% critical errors) and KPIs (waitlist reduction, staff retention, audit pass rates) are not defined.
  • Policy vs technology: the relative impact of organizational policy changes (simplifying documentation requirements, reducing duplication) compared to LLM tooling is not experimentally assessed.
  • Cognitive load: quantitative measures (e.g., NASA-TLX) of cognitive load and multitasking burden with vs without AI are absent.
  • Safety and risk management: procedures for detecting/preventing harmful edits, high-risk content flags, and escalation pathways are not specified or evaluated.
  • Legal/regulatory compliance: implications for professional liability, audit trails, authorship/provenance, and adherence to regulatory documentation standards need formal analysis.
  • Privacy/security: risks of PHI leakage in prompts, re-identification in fine-tuning data, data residency, and PHIPA/HIPAA compliance for both enterprise and bespoke models are not assessed.
  • Cost and ROI: no analysis of total cost of ownership (compute, integration, training, maintenance) versus realized benefits; financial sustainability is uncertain.
  • Benchmarking resources: no public or privacy-preserving benchmark for scratch-to-SOAP exists; lack of shared datasets, metrics, and evaluation protocols hinders reproducibility.
  • Template/context adaptation: optimal strategies for context-aware templates (school vs in-center), dynamic prompts, and structured guidance from clinicians are untested.
  • Training interventions: the content, dosage, and efficacy of clinician training/digital literacy programs on outcomes (quality, time, trust) are unmeasured.
  • Human factors in constrained environments: device availability, connectivity, noise, and in-the-wild constraints (e.g., classrooms) and their effects on AI usefulness are not characterized.
  • Patient/caregiver perspective: impacts of AI-generated documentation on trust, comprehension, and satisfaction—and acceptability of disclosing AI involvement—are unknown.
  • Equity and bias: potential disparate impacts across populations, programs, and clinician groups; bias in outputs and in who benefits (or is burdened) remain unassessed.
  • Provenance and versioning: audit-friendly co-authoring workflows (version control, edit logs, attribution between human and AI) are not designed or evaluated.
  • Automation bias: risk that clinicians over-rely on AI suggestions, especially under time pressure, and mitigation strategies are not studied.
  • Optimal insertion point: whether AI yields highest net benefit pre-session (scaffolding), in-session (capture), post-session (summarization), or for dissemination (family letters) is an open question.
  • Non-SOAP value: opportunities in tasks clinicians value (family-facing letters, individualized resources, care plans, school team communications) were identified but not prototyped or evaluated.
  • Model ablations: effects of model size, instruction/prompt tuning, RAG over local corpora, and fine-tuning recipes (LoRA vs full FT) on safety, accuracy, and edit burden are untested.
  • Personalization strategies: trade-offs among per-clinician fine-tunes, style prompting, and retrieval of prior notes for personalization (and the associated privacy risks) are unknown.
  • Documentation quality standards: agreed-upon, external scoring of completeness, accuracy, and compliance (e.g., blinded auditor ratings) is missing.
  • Data-sharing methods: feasibility of federated evaluation, synthetic-but-validated corpora, or secure enclaves to enable cross-site benchmarking remains unexplored.
  • Scalability and MLOps: lifecycle management (monitoring drift, updating models, rollback, governance) in multi-program organizations is not addressed.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 138 likes about this paper.