Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles
Abstract: Recent work has explored the use of LLMs for generating tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We examine this question using a controlled, turn-level comparison in which expert human tutors, novice human tutors, and multiple LLMs respond to the same set of math remediation conversation turns. We examine both instructional strategies and linguistic characteristics of tutoring responses, including restating and revoicing, pressing for accuracy, lexical diversity, readability, politeness, and agency. We find that LLMs approach expert levels of perceived pedagogical quality on average but exhibit systematic differences in their instructional and linguistic profiles. In particular, LLMs tend to underuse restating and revoicing strategies characteristic of expert human tutors, while producing longer, more lexically diverse, and more polite responses. Statistical analyses show that restating and revoicing, lexical diversity, and pressing for accuracy are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. Overall, recent LLMs exhibit levels of perceived pedagogical quality comparable to expert human tutors, while relying on different instructional and linguistic strategies. These findings underscore the value of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper asks a simple question: When AI “tutors” help students fix math mistakes, do they teach like expert human tutors? The authors compare expert teachers, beginner teachers, and several AI models to see how each one gives feedback and which styles actually lead to better tutoring.
The Big Questions
The study focuses on three easy-to-grasp questions:
- Do experts, beginners, and AI tutors use different teaching moves and writing styles?
- Whose feedback is judged as higher quality: experts, beginners, or AIs?
- Which specific teaching moves and writing choices are linked to higher-quality tutoring?
How the Study Was Done
Think of a science fair where every judge scores the same projects. Here, every tutor (expert, beginner, and AI) responded to the same student math mistakes so the comparison would be fair.
Here’s the approach, step by step:
- The authors used hundreds of short math conversations where a student made a mistake. For each student turn, multiple tutors (humans and AIs) wrote a reply.
- Each reply was scored on four things: Did it find the mistake? Point out where it happened? Give helpful guidance? Tell the student what to do next? Scores were combined into one “pedagogical quality” score.
- They also measured teaching moves and writing style:
- Teaching moves:
- Restating/revoicing: the tutor repeats the student’s idea in their own words to check understanding (like “So you added instead of multiplied, right?”).
- Pressing for accuracy: the tutor pushes the student to double-check or correct their work (like “Are you sure 6×7 is 42?”).
- Writing style:
- Length: how long the reply is.
- Lexical diversity: how many different words are used (more variety = higher diversity).
- Readability: how easy the reply is to read (shorter sentences and simpler words = easier).
- Politeness: how much the tutor uses softening or very courteous language.
- Agency: how much the language sounds directive or in-control (who’s “driving” the action: the tutor or the student).
- To be fair, they compared replies to the same student mistake (apples to apples) and then analyzed which features were tied to higher-quality feedback.
What They Found
Here are the main takeaways, explained in plain language:
- Overall quality:
- Expert human tutors scored the highest on average.
- Beginner tutors scored the lowest.
- The best AI tutors came close to expert-level quality on average, but not all AIs were that strong.
- Teaching moves:
- Experts often restated or revoiced the student’s reasoning. AIs did this less, and beginners also did it less. This move was strongly linked to better tutoring quality.
- AIs tended to “press for accuracy” more than experts. This move also helped quality, but not as strongly as restating/revoicing.
- Writing style differences:
- AIs wrote longer replies with more varied vocabulary. Longer or fancier writing did not automatically mean better quality.
- AIs’ writing was generally harder to read than experts’ writing.
- AIs were more polite than experts. Surprisingly, very high politeness was linked to lower tutoring quality. (Being respectful is good; over-softening can sometimes water down clear guidance.)
- More “agentic” or strongly directive language was also linked to lower quality. (Overly taking charge can crowd out student thinking.)
- What mattered most for high-quality feedback:
- Strong positive links: restating/revoicing the student’s reasoning, using a good variety of words.
- Smaller positive link: pressing for accuracy.
- No clear link: making replies longer or easier to read.
- Negative links: very high politeness and very high “agency.”
Why It Matters
This study suggests that how a tutor talks is just as important as what they say. The most helpful math feedback doesn’t have to be long or super polite—it should:
- Reflect the student’s thinking back to them (restating/revoicing),
- Nudge them to check their accuracy,
- Use words clearly and with some variety,
- Keep the focus on helping the student think, not the tutor “taking over.”
For AI tutors, this means designers should train models to use expert teaching moves (especially restating/revoicing) and aim for clear, student-centered guidance rather than longer, more polite, or more forceful answers. For teachers and students, it’s a reminder that the best feedback helps you see your own thinking and fix it yourself.
Note on limits: The study looked at single-turn replies in English math tutoring and rated “perceived quality,” not actual learning gains over time. Future work should test multi-turn conversations and measure what students learn afterward.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The following list identifies specific unresolved issues and concrete directions for future research based on the paper’s methods, analyses, and scope.
- Replicability of LLM behavior is unclear because prompting, system instructions, sampling temperature, and decoding settings for each model are not reported; future work should specify and systematically vary these to assess sensitivity.
- Validity of the instructional and linguistic classifiers (talk moves, politeness, agency) in math-tutoring discourse is unverified; conduct domain-specific calibration, accuracy benchmarking, and error analysis on this corpus.
- Inter-annotator reliability for the four pedagogical dimensions (mistake identification, location, guidance, actionability) is not reported; quantify agreement and examine how annotation uncertainty affects results.
- Mathematical correctness of tutor feedback is not directly measured; add explicit correctness/validity checks of guidance and solutions alongside pedagogical qualities.
- The analysis is limited to single-turn responses; extend to multi-turn interactions to capture adaptation, scaffolding over time, and longitudinal learning dynamics.
- Student learning outcomes are absent; run controlled studies linking feature configurations to downstream learning gains, persistence, and transfer.
- Only two human tutors (one expert, one novice) are used; expand to larger, more diverse tutor pools to improve generalizability of “expert” and “novice” profiles.
- Negative association between politeness and pedagogical quality is not unpacked; decompose politeness into subtypes (e.g., hedging, mitigation, indirectness) and test which facets hinder/enable effective correction.
- The agency measure may conflate tutor agency with student-empowering language; distinguish locus of agency (tutor vs student) and evaluate their separate effects on quality.
- Heterogeneity by error type or mathematical topic is not analyzed; incorporate an error taxonomy (e.g., arithmetic vs algebra, procedural vs conceptual) to test feature–quality associations across error classes.
- Associations are estimated across all sources pooled; fit subgroup models (human-only, LLM-only, expert vs novice) to examine whether correlates differ by tutor type.
- Potential multicollinearity among features (e.g., length, MTLD, politeness) is not assessed; report diagnostics (VIF), test interactions, and consider non-linearities (e.g., U-shaped effects of length/readability).
- Readability is measured only via Flesch–Kincaid Reading Ease; compare alternative readability indices (e.g., Dale–Chall, SMOG) and math-specific readability measures that account for symbols and equations.
- Stability of MTLD on short texts is assumed; validate lexical diversity with alternative metrics (e.g., HD-D, vocd-D) and bootstrapped reliability analyses on short responses.
- Detection of restating/revoicing relies solely on a classifier; add human validation and qualitative coding to confirm model outputs and capture nuanced revoicing strategies.
- Differences across LLMs are described but not linked to architecture/training data/parameter count; perform controlled comparisons to isolate capacity vs stylistic defaults.
- Dataset description has inconsistencies (e.g., “~300 conversations” vs “2,476 individual conversations”); clarify counts, splits, and sampling to support reproducibility.
- Preprocessing details (tokenization, sentence segmentation, syllable counting) used for length/readability/MTLD are not specified; document pipelines to prevent metric artifacts across human vs LLM text.
- No intervention tests are conducted to alter LLM strategies (e.g., prompts that enforce revoicing or reduce excessive politeness); run manipulation experiments to evaluate causal effects on pedagogical quality.
- Formatting and structure (bullets, stepwise hints, equation rendering) are not considered; analyze how textual organization influences readability and error correction.
- The study is observational; implement randomized controlled trials that prescribe specific instructional/linguistic features to establish causal impacts on quality and learning.
- Equity and fairness are not examined; test whether feature effectiveness varies across student demographics, prior achievement, or language proficiency.
- English-only math context limits generalization; replicate in other languages and evaluate cross-cultural pragmatics (politeness norms, directive styles).
- Domain generalization is unknown; assess whether findings hold in other subjects (science, writing, programming) where feedback norms differ.
- Trade-offs with affective outcomes are not measured; evaluate how optimizing for pedagogical quality affects rapport, motivation, and perceived support.
- Potential selection bias from curated tutoring datasets is not addressed; validate findings on real-world classroom or platform data with natural error distributions.
- Dependence structure across tutors is not modeled; consider mixed-effects models with random effects for tutor and conversation to account for repeated measures.
- Releasing code, features, and annotations is not mentioned; provide open materials and detailed protocols to facilitate replication and meta-analyses.
- The mechanism behind the negative agency–quality link is unclear; test finer-grained agency constructs (directive vs autonomy-supportive language) and their causal roles.
- Explore whether the relation between readability and quality is non-linear or thresholded; evaluate quadratic terms and piecewise models to detect diminishing returns.
- Investigate interactions among instructional moves (e.g., restating × pressing for accuracy) to identify synergistic combinations that maximize pedagogical quality.
Glossary
- Actionability: A pedagogical dimension indicating whether feedback tells the student what to do next. "Actionability: The tutor's response should inform the student on what they should do next."
- Agency: The extent to which language expresses intention, control, or action orientation. "Agency is operationalized using a transformer-based model that estimates the degree of agentic expression in text"
- Agentic language: Language that conveys assertiveness or control, often associated with taking action. "higher levels of agentic and polite language are negatively associated."
- Composite pedagogical quality score: An aggregated measure combining multiple annotated dimensions of feedback quality. "We aggregated scores assigned to each pedagogical dimension to derive a composite pedagogical quality score."
- Confidence intervals: Statistical ranges indicating uncertainty around estimated values. "with error bars indicating 95% confidence intervals."
- Compound feedback: Feedback that bundles multiple suggestions or actions together rather than focusing on a single step. "LLMs more often provide compound feedback, whereas human tutors typically deliver focused, single-action interventions"
- Demeaning (within-conversation): Subtracting the within-group mean from each value to isolate variation inside a group. "By demeaning both the outcome and predictors within each conversation, this specification isolates within-conversation variation"
- Discourse organisation: The structural arrangement and coherence of text across sentences and paragraphs. "differences in lexical choice, syntactic patterns, and discourse organisation"
- Epistemic framing: How feedback positions knowledge, certainty, and evidence within an explanation. "other aspects of tutor responses such as discourse structure, epistemic framing, or mathematical specificity may also be relevant"
- Flesch–Kincaid Reading Ease: A readability metric based on sentence length and word syllables. "Readability is assessed using the FleschâKincaid Reading Ease score"
- Fixed effects model: A regression approach controlling for unobserved, constant differences across groups or contexts. "we estimated a linear fixed effects model at the conversation level."
- Hint-based guidance: Feedback that nudges students toward solutions via hints rather than providing answers directly. "Both tend to rely on hint-based guidance rather than direct solutions"
- Intelligent tutoring systems: Computer-based systems that provide personalized instruction or feedback. "human tutoring, intelligent tutoring systems, and other tutoring systems"
- Log-transformed: Applying a logarithm to a variable to reduce skew and the impact of outliers. "Response length is measured as the total number of tokens in each response and log-transformed to reduce skew and limit the influence of outliers."
- Measure of Textual Lexical Diversity (MTLD): A length-insensitive metric capturing the variety of vocabulary used. "Lexical diversity is quantified using the Measure of Textual Lexical Diversity (MTLD), which captures the range of vocabulary use independently of text length."
- Ordinary least squares: A standard linear regression technique that minimizes squared residuals. "Coefficients from an ordinary least squares model predicting perceived pedagogical quality at the tutor response level."
- Pedagogical dimensions: Annotated aspects of feedback quality such as mistake identification, location, guidance, and actionability. "The dataset consists of human annotation of responses using a set of pedagogical dimensions described in prior work (Maurya et al., 2024)."
- Pedagogical quality: The perceived effectiveness of feedback in addressing errors and guiding correction. "We evaluate tutor responses in terms of pedagogical quality, operationalized through structured annotations of error handling and guidance"
- Politeness: Pragmatic markers of interpersonal tone such as mitigation and respectfulness. "Politeness is estimated using a transformer-based classifier"
- Pressing for accuracy: An instructional move that challenges the correctness of a student’s answer to promote precision. "Pressing for accuracy indicates whether a response explicitly challenges or prompts the student to reconsider the correctness of their answer"
- Probabilistic outputs (classifier probabilities as continuous features): Using model probability scores directly as numeric features rather than thresholded labels. "We use the probabilistic outputs of classifiers as continuous features to represent the degree to which each response exhibits a given instructional or linguistic feature"
- Relative pedagogical quality: A response’s quality measured relative to other responses to the same prompt. "This transformation yields a measure of relative pedagogical quality that captures how a tutorâs response compares to alternative responses to the same instructional prompt."
- Remediation: Targeted feedback to address and correct student mistakes. "math remediation conversation turns."
- Restating/Revoicing: Reformulating a student’s reasoning or answer to clarify and highlight misconceptions. "Restating or revoicing captures whether the tutor reformulates the studentâs reasoning or answer in their own words"
- Scaffolding: Structured support that helps learners progress toward understanding or skill mastery. "Evaluations of tutoring interactions often focus on traits such as engagement, empathy, scaffolding, and conciseness"
- Standard errors (clustered): Error estimates adjusted for correlated observations within groups. "standard errors were clustered at the conversation level to account for dependence among responses within the same interaction."
- Stylistic variability: Variation in writing style across texts or authors. "characteristic patterns of lexical choice and reduced stylistic variability."
- Transformer-based classifier: A neural text classifier built on transformer architectures. "Politeness is estimated using a transformer-based classifier"
- Turn-centered measure: A metric computed within each dialogue turn to control for prompt-specific factors. "we construct a turn-centered measure of pedagogical quality."
- Turn-level comparison: Comparing responses that address the same specific student turn for controlled evaluation. "We examine this question using a controlled, turn-level comparison"
- Within-conversation regression: A regression that models variation among responses inside the same conversation. "Within-conversation regression predicting relative pedagogical quality from instructional and linguistic features."
Practical Applications
Immediate Applications
The following applications can be deployed now, leveraging the paper’s findings on instructional strategies and linguistic correlates of pedagogical quality, as well as its evaluation methodology.
- Education/EdTech — AI Tutor QA and Monitoring Pipeline
- Description: Build a “Pedagogical Quality Meter” that scores AI tutor responses using continuous features (restating/revoicing, pressing for accuracy, lexical diversity, politeness, agency) to flag low-quality feedback and surface improvement suggestions.
- Tools/products/workflows: LMS plugin or SDK that ingests tutor transcripts, computes MTLD/readability, runs pretrained classifiers for talk moves, politeness, and agency, and displays tutor-level dashboards with turn-centered baselines.
- Assumptions/dependencies: Availability and reliability of feature classifiers (e.g., TalkMoves, TyDiP, BERTAgent), English-language coverage, suitable privacy controls for student data, alignment to math remediation tasks.
- Education/EdTech — Prompt and Style Controllers for AI Tutors
- Description: Introduce prompt templates and system-level style controls to increase restating/revoicing and pressing for accuracy while reducing overly polite or highly agentic phrasing; expose toggles like “Revoice student reasoning,” “Press for accuracy,” “Tone calibration.”
- Tools/products/workflows: Prompt linter, style tuner APIs, “Revoice This” button embedded in tutor chat UIs.
- Assumptions/dependencies: LLM responsiveness to prompt-level control and guardrail policies; monitoring to avoid unintended drops in correctness.
- Teacher Professional Development — Evidence-Based Feedback Training
- Description: Create PD modules that train novice tutors to adopt expert-like moves (restating/revoicing, pressing for accuracy) and calibrate linguistic choices (lexical diversity without excessive politeness/agency).
- Tools/products/workflows: Tutor analytics dashboards highlighting these features; automated post-session reports with examples and targeted practice.
- Assumptions/dependencies: Access to session transcripts, consent procedures, and school/district buy-in.
- Education Policy and Procurement — Evaluation Criteria for AI Tutors
- Description: Establish procurement rubrics requiring vendors to report tutor performance using the paper’s turn-centered evaluation and feature-based profiles, prioritizing models that approach expert-level quality.
- Tools/products/workflows: RFP checklists; vendor compliance reports; independent audits using shared benchmarks.
- Assumptions/dependencies: Agreement on standardized metrics and third-party evaluation protocols; policy capacity for audits.
- EdTech Content Authoring — Feedback Style Guidelines
- Description: Authoring guides for AI and human-generated feedback emphasizing restating/revoicing and pressing for accuracy; caution against over-politeness/agency; avoid optimizing readability or length alone (not predictive here).
- Tools/products/workflows: “Feedback Linter” integrated into content creation tools.
- Assumptions/dependencies: Domain specificity (math remediation); need to validate in other subjects.
- Research Methods — Turn-Centered Evaluation Design
- Description: Adopt the paper’s within-conversation fixed-effects framework to control for prompt difficulty when comparing tutors or model variants in offline experiments.
- Tools/products/workflows: Evaluation scripts; standardized annotation schema for mistake identification/location, guidance, actionability.
- Assumptions/dependencies: Access to multi-response datasets per turn; annotator training; consistent scales.
- Customer Support and Corporate Training — Clarity-First Scripting
- Description: Adapt feedback scripts to emphasize restating the customer’s issue and pressing for accuracy over excessive politeness, improving corrective clarity in problem resolution.
- Sector links: Customer support, operations training.
- Assumptions/dependencies: Generalization of findings beyond math; alignment with brand tone and cultural norms.
- Daily Life — Personal Study Assistants with Feedback Style Controls
- Description: Let learners configure their AI study assistant’s feedback style (e.g., “more revoicing,” “less apologetic tone,” “ask me to verify steps”) to improve corrective guidance in math practice.
- Tools/products/workflows: Consumer app settings for feedback moves and tone.
- Assumptions/dependencies: LLM control reliability; user education to avoid misinterpreting tone as correctness.
- Software/Developer Tools — Pedagogical Feature Extractor Library
- Description: Release an open-source library that computes the paper’s features and exposes a scoring API for downstream integration in tutoring and coaching apps.
- Tools/products/workflows:
pythonpackage with MTLD, readability, and classifier wrappers; CI hooks for automated feedback QA. - Assumptions/dependencies: Model licenses (Hugging Face), maintenance of classifiers, and documentation.
Long-Term Applications
These applications require additional research, scaling, cross-domain validation, or integration with outcome measures and governance.
- Education/EdTech — Reward Models and Fine-Tuning for Pedagogical Moves
- Description: Train reward models (RLHF/RLAIF) that explicitly optimize for restating/revoicing, pressing for accuracy, and lexical diversity while discouraging overly polite/agentic language; validate on multi-turn interactions.
- Tools/products/workflows: Tutor-specific alignment datasets; multi-turn simulations; guardrails for correctness and safety.
- Assumptions/dependencies: Large labeled datasets with outcome signals (learning gains), compute resources, and robust safety layers.
- Multi-Subject and Multilingual Expansion
- Description: Extend feature detectors and evaluation to other subjects (science, writing, languages) and languages, adapting politeness and agency classifiers to cultural-linguistic norms.
- Sector links: Global education providers, language learning platforms.
- Assumptions/dependencies: Cross-cultural annotation standards, multilingual classifier development, subject-specific talk move taxonomies.
- Intelligent Tutoring Systems with Learning Outcome Integration
- Description: Close the loop by measuring downstream learning gains; personalize which instructional moves and linguistic profiles work best for each learner.
- Tools/products/workflows: Adaptive ITS that track student progress and vary feedback strategies; causal experimentation at scale.
- Assumptions/dependencies: Longitudinal trials, ethical data use, robust experimental design.
- Regulatory Standards and Certification for AI Tutors
- Description: Create certification regimes (e.g., “Pedagogically Aligned Tutor” seals) using standardized metrics and audits to protect learners from fluent-but-inaccurate feedback.
- Sector links: Education policy, accreditation bodies.
- Assumptions/dependencies: Multi-stakeholder consensus, oversight capacity, transparent reporting.
- Real-Time Coaching for Human Tutors and Teachers
- Description: Build “Tutor Copilots” that provide live suggestions (e.g., revoice student reasoning; press for accuracy now) during sessions, with minimal disruption.
- Tools/products/workflows: In-class assistance via wearable or desktop coach; post-hoc reflection reports.
- Assumptions/dependencies: Real-time inference, user acceptance, data privacy, and co-design with educators.
- Healthcare and Patient Education — Clarification-Focused Chatbots
- Description: Patient-facing bots that revoice patient explanations and press for accuracy to clarify medication instructions or symptom descriptions without excessive politeness that obscures corrective clarity.
- Sector links: Healthcare communication, digital therapeutics.
- Assumptions/dependencies: Clinical safety validation, medical liability frameworks, domain-specific tuning.
- Workforce Training and Safety-Critical Domains
- Description: AI coaches for apprenticeships (e.g., technical trades, compliance training) that prioritize revoicing and accuracy pressing to prevent error propagation.
- Sector links: Manufacturing, utilities, aviation, logistics.
- Assumptions/dependencies: Domain adaptation, scenario-based evaluation, safety cases.
- Authoring IDEs for Feedback — “Feedback Linter Pro”
- Description: Professional authoring tools that flag feedback likely to be pedagogically weak (e.g., too polite, too agentic, insufficient revoicing) and suggest targeted rewrites.
- Tools/products/workflows: Integrated editor extensions; batch QA for large content pipelines.
- Assumptions/dependencies: Industry adoption, cross-domain generalization, maintainable rule sets.
- Fairness, Safety, and Bias Auditing of Tutor Language
- Description: Auditing frameworks that assess whether tone adjustments (politeness/agency) affect different learner groups disparately and whether pressing for accuracy is applied equitably.
- Sector links: Governance, compliance.
- Assumptions/dependencies: Representative datasets, fairness metrics tailored to tutoring contexts, legal/ethical guidance.
Collections
Sign up for free to add this paper to one or more collections.