Write on Paper, Wrong in Practice: Why LLMs Still Struggle with Writing Clinical Notes
Abstract: LLMs are often proposed as tools to streamline clinical documentation, a task viewed as both high-volume and low-risk. However, even seemingly straightforward applications of LLMs raise complex sociotechnical considerations to translate into practice. This case study, conducted at KidsAbility, a pediatric rehabilitation facility in Ontario, Canada examined the use of LLMs to support occupational therapists in reducing documentation burden.We conducted a qualitative study involving 20 clinicians who participated in pilot programs using two AI technologies: a general-purpose proprietary LLM and a bespoke model fine-tuned on proprietary historical documentation. Our findings reveal that documentation challenges are sociotechnical in nature, shaped by clinical workflows, organizational policies, and system constraints. Four key themes emerged: (1) the heterogeneity of workflows, (2) the documentation burden is systemic and not directly linked to the creation of any single type of documentation, (3) the need for flexible tools and clinician autonomy, and (4) effective implementation requires mutual learning between clinicians and AI systems. While LLMs show promise in easing documentation tasks, their success will depend on flexible, adaptive integration that supports clinician autonomy. Beyond technical performance, sustained adoption will require training programs and implementation strategies that reflect the complexity of clinical environments.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The following list captures what remains missing, uncertain, or unexplored, framed to be actionable for future research:
- External validity: findings come from a single pediatric rehabilitation center and one profession (OT) in Ontario; generalizability to other settings (e.g., acute care, adult rehab, primary care), jurisdictions, and professions is unknown.
- Short exposure: the pilot spanned ~3 weeks per tool; long-term adoption, sustained efficacy, and behavior change over months are unmeasured.
- Lack of quantitative outcomes: no rigorous measures of time saved, note quality, error rates, edit burden, or “pajama time” reduction; establish standardized metrics and baselines.
- Comparative performance gap: no controlled, head-to-head quantitative comparison between the bespoke fine-tuned model and the enterprise LLM across tasks and contexts.
- Error taxonomy and frequency: qualitative reports of misplacements and clinical misstatements (e.g., “anxious” → “has anxiety”) lack a systematic error classification with incidence rates and severity.
- Impact heterogeneity: differential effects across programs (in-center vs school-based), clinician seniority, digital literacy, and documentation styles are not quantified.
- Scratch-note realism: synthetic scratch notes used for fine-tuning may not reflect authentic shorthand, noise, or variability; their fidelity and impact on model behavior are unvalidated.
- Input modality exploration: alternatives (speech-to-text, ambient scribing, digital pen/handwriting OCR, photos of paper notes) and hybrid capture strategies were not tested.
- EHR integration: deep, workflow-native integration (auto-populating fields, reducing duplicate entry, preserving metadata) vs a standalone tool was not evaluated.
- UI/UX for trust and control: effects of features like source highlighting, traceability to inputs, uncertainty cues, style controls, and “tracked changes” on trust and edit burden remain unexplored.
- Voice/style personalization: mechanisms to preserve clinician “voice” (style transfer, per-clinician profiles) and their safety implications are untested.
- Mutual learning loops: no implementation/evaluation of continuous learning from clinician edits (active learning, on-device personalization, RAG from prior notes) to reduce future errors.
- Success criteria: explicit adoption thresholds (e.g., ≥30% time saved with ≤5% critical errors) and KPIs (waitlist reduction, staff retention, audit pass rates) are not defined.
- Policy vs technology: the relative impact of organizational policy changes (simplifying documentation requirements, reducing duplication) compared to LLM tooling is not experimentally assessed.
- Cognitive load: quantitative measures (e.g., NASA-TLX) of cognitive load and multitasking burden with vs without AI are absent.
- Safety and risk management: procedures for detecting/preventing harmful edits, high-risk content flags, and escalation pathways are not specified or evaluated.
- Legal/regulatory compliance: implications for professional liability, audit trails, authorship/provenance, and adherence to regulatory documentation standards need formal analysis.
- Privacy/security: risks of PHI leakage in prompts, re-identification in fine-tuning data, data residency, and PHIPA/HIPAA compliance for both enterprise and bespoke models are not assessed.
- Cost and ROI: no analysis of total cost of ownership (compute, integration, training, maintenance) versus realized benefits; financial sustainability is uncertain.
- Benchmarking resources: no public or privacy-preserving benchmark for scratch-to-SOAP exists; lack of shared datasets, metrics, and evaluation protocols hinders reproducibility.
- Template/context adaptation: optimal strategies for context-aware templates (school vs in-center), dynamic prompts, and structured guidance from clinicians are untested.
- Training interventions: the content, dosage, and efficacy of clinician training/digital literacy programs on outcomes (quality, time, trust) are unmeasured.
- Human factors in constrained environments: device availability, connectivity, noise, and in-the-wild constraints (e.g., classrooms) and their effects on AI usefulness are not characterized.
- Patient/caregiver perspective: impacts of AI-generated documentation on trust, comprehension, and satisfaction—and acceptability of disclosing AI involvement—are unknown.
- Equity and bias: potential disparate impacts across populations, programs, and clinician groups; bias in outputs and in who benefits (or is burdened) remain unassessed.
- Provenance and versioning: audit-friendly co-authoring workflows (version control, edit logs, attribution between human and AI) are not designed or evaluated.
- Automation bias: risk that clinicians over-rely on AI suggestions, especially under time pressure, and mitigation strategies are not studied.
- Optimal insertion point: whether AI yields highest net benefit pre-session (scaffolding), in-session (capture), post-session (summarization), or for dissemination (family letters) is an open question.
- Non-SOAP value: opportunities in tasks clinicians value (family-facing letters, individualized resources, care plans, school team communications) were identified but not prototyped or evaluated.
- Model ablations: effects of model size, instruction/prompt tuning, RAG over local corpora, and fine-tuning recipes (LoRA vs full FT) on safety, accuracy, and edit burden are untested.
- Personalization strategies: trade-offs among per-clinician fine-tunes, style prompting, and retrieval of prior notes for personalization (and the associated privacy risks) are unknown.
- Documentation quality standards: agreed-upon, external scoring of completeness, accuracy, and compliance (e.g., blinded auditor ratings) is missing.
- Data-sharing methods: feasibility of federated evaluation, synthetic-but-validated corpora, or secure enclaves to enable cross-site benchmarking remains unexplored.
- Scalability and MLOps: lifecycle management (monitoring drift, updating models, rollback, governance) in multi-program organizations is not addressed.
Collections
Sign up for free to add this paper to one or more collections.