Papers
Topics
Authors
Recent
Search
2000 character limit reached

Automated Questionnaire Generation

Updated 1 February 2026
  • Automated questionnaire generation is a process that utilizes machine learning and large language models to automatically create structured Q&A pairs for tasks like assessment and benchmark construction.
  • The methodology involves rigorous preprocessing, metadata extraction, and retrieval-augmented generation to ensure precise, context-relevant question generation across technical domains.
  • Empirical evaluations show that integrating RAG techniques can boost question-answering accuracy from around 50% to 71–75%, while also supporting Responsible AI measures such as fairness and safety.

Automated questionnaire generation refers to the use of algorithmic and machine learning techniques—especially LLMs—to produce structured question sets for downstream tasks such as knowledge assessment, data gathering, system evaluation, or benchmark construction. This process encompasses the automatic creation of question-answer (Q&A) pairs, multiple-choice items, and associated labeling schemas, often with task- or domain-specific constraints. Approaches vary from fully generative pipelines to retrieval-augmented workflows and can be deployed in domains ranging from telecom standards to e-commerce and Responsible AI evaluation.

1. Data Sources and Corpus Construction

Automated questionnaire generation critically depends on the availability and structure of source corpora. One approach exemplified by "TSpec-LLM: An Open-source Dataset for LLM Understanding of 3GPP Specifications" (Nikbakht et al., 2024) involves exhaustively curating domain-specific documents—in this case, the entirety of 3GPP technical specifications, reports, and supporting documents from Release 8 through 19 (1999–2023). Document types are stratified as Technical Specifications (TS), Technical Reports (TR), Change Requests (CR), and Study Items, organized by series (e.g., 36.xxx for LTE, 38.xxx for 5G NR).

Rigorous preprocessing protocols are required. The pipeline involves:

  1. Automated bulk download of all versions and revisions ([“download3gpp” tool v0.7.0]).
  2. Batch document conversion into markdown, preserving tables, LaTeX-formatted equations, figure captions, and footnotes.
  3. Structured metadata extraction (release, version, document type, series).
  4. Tokenization and chunking, using distributed text splitters to enforce maximum chunk size (W[512,1024]W \in [512, 1024] tokens).
  5. Embedding and indexing (e.g., OpenAI’s text-embedding-ada-002, dimension D=1536D = 1536).

A plausible implication is that high-quality preprocessing and the preservation of tables and formulas are necessary to support precise, knowledge-intensive automated question generation—especially in engineering domains.

2. Questionnaire Generation Methodologies

Automated question generation pipelines often integrate LLMs at the core of candidate item creation, supplemented by iterative filtering and labeling protocols. For example, in (Nikbakht et al., 2024), the pipeline for generating telecom technical questions is as follows:

  • Draft question creation and difficulty tagging (“Easy,” “Intermediate,” “Hard”) via prompt-engineered GPT-4.
  • LLM-based verification, employing Mistral 7×8B for independent assessment (with a reported 78% agreement).
  • Human adjudication for resolving ambiguous or borderline cases.

In the context of Responsible AI benchmarks, question/prompt template instantiation is elaborated in (Sagae et al., 23 Oct 2025), where prompts incorporate cross-parameterization along axes such as demographic attributes, product categories, and gendered adjectives. The automated pipeline:

  1. Forms the full cross product of input parameters.
  2. Heuristically filters nonsensical or irrelevant combinations.
  3. Automatically retrieves task-relevant context (e.g., e-commerce product feature lists).

A core principle is that prompt templates embody structured metadata, supporting the downstream evaluation of dimensions such as fairness and safety.

3. Integration with Retrieval-Augmented Generation

A decisive approach for automated and contextually-relevant question or answer generation is retrieval-augmented generation (RAG), where LLMs are coupled to dense vector stores for domain-contextualized prompting. In (Nikbakht et al., 2024), the formal workflow is:

  • Document chunks D={di}i=1ND = \{d_i\}_{i=1}^N are embedded as ei=ϕ(di)e_i = \phi(d_i).
  • For a query qq, its embedding eq=ϕ(q)e_q = \phi(q) is computed.
  • Top-KK relevant chunks are retrieved via cosine similarity:

r(q)=arg top-KeDsim(eq,e)r(q) = \text{arg top-}K_{e \in D} \operatorname{sim}(e_q, e)

  • Retrieved chunks are concatenated with the query to form the RAG prompt:

$\text{prompt}_{\text{RAG}} = [ q ; "\n\n---\n\n" ; \text{concat}_{c \in r(q)} c ]$

  • The LLM then generates an answer conditioned on this augmented prompt.

The empirical improvement is significant: accuracy for telecom technical questions increases from approximately 50% (LLM-only) to 71–75% (LLM + naive RAG) for competitive models such as GPT-3.5, Gemini 1.0 Pro, and GPT-4 (Nikbakht et al., 2024).

4. Evaluation Protocols and Responsible AI Dimensions

Automated questionnaire generation is foundational for systematic LLM evaluation. In (Nikbakht et al., 2024), a multiple-choice Q&A set (100 items, evenly distributed across difficulty levels) is constructed to benchmark core LLM reasoning with and without RAG.

For Responsible AI, dataset-driven frameworks such as in (Sagae et al., 23 Oct 2025) structure evaluation around four quantitative dimensions:

  • Quality (semantic alignment): BertScore F1F_1, [0,1][0,1] scale.
  • Veracity (informational precision and recall): BertScore precision (BSPBS_P) and recall (BSRBS_R).
  • Safety: Detoxify toxicity probability T(y^)[0,1]T(\hat{y}) \in [0,1].
  • Fairness: Metric disparity across cohort groups:

Disparity(m)=maxcCEicohort=c[mi]mincCEicohort=c[mi]\mathrm{Disparity}(m) = \frac{\max_{c \in \mathcal{C}} \mathbb{E}_{i\,|\,\mathrm{cohort}=c}[\,m_i\,]}{\min_{c \in \mathcal{C}} \mathbb{E}_{i\,|\,\mathrm{cohort}=c}[\,m_i\,]}

A plausible implication is that automated generation pipelines, when parameterized with structured metadata, enable granular measurement of LLMs’ strengths and deficiencies in both knowledge and Responsible AI metrics.

5. Dataset Composition and Implementation Examples

The referenced literature details pipeline statistics and practical dataset instances:

Corpus #Docs Words (M) Task Q&A Items
TSpec-LLM (Nikbakht et al., 2024) 30,137 535 3GPP standard Q&A (retrieval-aug) 100
Responsible-AI Prompt (Sagae et al., 23 Oct 2025) 7,047 Product-description generation 7,047

Qualitative sample: (Nikbakht et al., 2024), telecom Q&A—“What is the maximum directional gain of an antenna element?” (A: 8 dBi from TR 38.901 V16.1.0).

In Responsible AI (Sagae et al., 23 Oct 2025), automated prompts integrate product, adjective, category, and identity group; resulting LLM outputs are scored for relevance, toxicity, and cohort disparity. For instance, Disparity(toxicity) reaches 645×645\times between highest-risk and lowest-risk categories.

6. Practical Recommendations and Limitations

Effective adoption of automated questionnaire generation pipelines requires:

  • Open-source, comprehensive corpora (e.g., Hugging Face hosting for TSpec-LLM).
  • Preservation of complex document elements (tables, mathematical formulas).
  • Reusable prompt templates with domain-appropriate parameterization axes.
  • Integration with robust embedding and retrieval infrastructure (e.g., LlamaIndex, FAISS).

Limitations include dependency on the quality and representativeness of ground-truth text (e.g., potential biases in human-generated product descriptions), lack of comprehensive manual annotation for sensitive dimensions (toxicity, cohort fairness), and monolingual focus in some datasets (Sagae et al., 23 Oct 2025). For fine-tuning, in-context Q&A pairs can serve as training targets in SFT/LoRA pipelines.

7. Impact and Benchmarking Utility

Automated questionnaire generation is shown to transform the evaluation landscape for domain-specific LLM performance and Responsible AI assessment. In the telecom domain, the TSpec-LLM dataset, paired with a simple RAG framework, yields a 20–30 percentage point improvement in multiple-choice question-answering over LLM-only baselines, outperforming earlier, less comprehensive benchmarks such as SPEC5G (Nikbakht et al., 2024). In Responsible AI, prompt libraries systematically surface fairness and safety disparities, enabling direct, blade-to-blade model comparison (Sagae et al., 23 Oct 2025).

A plausible implication is that automated question generation protocols—paired with comprehensive, well-labeled corpora and structured evaluation—enable reproducible, domain-robust benchmarks and facilitate rapid iteration in high-stakes application domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Automated Questionnaire Generation.