Automated Questionnaire Generation
- Automated questionnaire generation is a process that utilizes machine learning and large language models to automatically create structured Q&A pairs for tasks like assessment and benchmark construction.
- The methodology involves rigorous preprocessing, metadata extraction, and retrieval-augmented generation to ensure precise, context-relevant question generation across technical domains.
- Empirical evaluations show that integrating RAG techniques can boost question-answering accuracy from around 50% to 71–75%, while also supporting Responsible AI measures such as fairness and safety.
Automated questionnaire generation refers to the use of algorithmic and machine learning techniques—especially LLMs—to produce structured question sets for downstream tasks such as knowledge assessment, data gathering, system evaluation, or benchmark construction. This process encompasses the automatic creation of question-answer (Q&A) pairs, multiple-choice items, and associated labeling schemas, often with task- or domain-specific constraints. Approaches vary from fully generative pipelines to retrieval-augmented workflows and can be deployed in domains ranging from telecom standards to e-commerce and Responsible AI evaluation.
1. Data Sources and Corpus Construction
Automated questionnaire generation critically depends on the availability and structure of source corpora. One approach exemplified by "TSpec-LLM: An Open-source Dataset for LLM Understanding of 3GPP Specifications" (Nikbakht et al., 2024) involves exhaustively curating domain-specific documents—in this case, the entirety of 3GPP technical specifications, reports, and supporting documents from Release 8 through 19 (1999–2023). Document types are stratified as Technical Specifications (TS), Technical Reports (TR), Change Requests (CR), and Study Items, organized by series (e.g., 36.xxx for LTE, 38.xxx for 5G NR).
Rigorous preprocessing protocols are required. The pipeline involves:
- Automated bulk download of all versions and revisions ([“download3gpp” tool v0.7.0]).
- Batch document conversion into markdown, preserving tables, LaTeX-formatted equations, figure captions, and footnotes.
- Structured metadata extraction (release, version, document type, series).
- Tokenization and chunking, using distributed text splitters to enforce maximum chunk size ( tokens).
- Embedding and indexing (e.g., OpenAI’s text-embedding-ada-002, dimension ).
A plausible implication is that high-quality preprocessing and the preservation of tables and formulas are necessary to support precise, knowledge-intensive automated question generation—especially in engineering domains.
2. Questionnaire Generation Methodologies
Automated question generation pipelines often integrate LLMs at the core of candidate item creation, supplemented by iterative filtering and labeling protocols. For example, in (Nikbakht et al., 2024), the pipeline for generating telecom technical questions is as follows:
- Draft question creation and difficulty tagging (“Easy,” “Intermediate,” “Hard”) via prompt-engineered GPT-4.
- LLM-based verification, employing Mistral 7×8B for independent assessment (with a reported 78% agreement).
- Human adjudication for resolving ambiguous or borderline cases.
In the context of Responsible AI benchmarks, question/prompt template instantiation is elaborated in (Sagae et al., 23 Oct 2025), where prompts incorporate cross-parameterization along axes such as demographic attributes, product categories, and gendered adjectives. The automated pipeline:
- Forms the full cross product of input parameters.
- Heuristically filters nonsensical or irrelevant combinations.
- Automatically retrieves task-relevant context (e.g., e-commerce product feature lists).
A core principle is that prompt templates embody structured metadata, supporting the downstream evaluation of dimensions such as fairness and safety.
3. Integration with Retrieval-Augmented Generation
A decisive approach for automated and contextually-relevant question or answer generation is retrieval-augmented generation (RAG), where LLMs are coupled to dense vector stores for domain-contextualized prompting. In (Nikbakht et al., 2024), the formal workflow is:
- Document chunks are embedded as .
- For a query , its embedding is computed.
- Top- relevant chunks are retrieved via cosine similarity:
- Retrieved chunks are concatenated with the query to form the RAG prompt:
$\text{prompt}_{\text{RAG}} = [ q ; "\n\n---\n\n" ; \text{concat}_{c \in r(q)} c ]$
- The LLM then generates an answer conditioned on this augmented prompt.
The empirical improvement is significant: accuracy for telecom technical questions increases from approximately 50% (LLM-only) to 71–75% (LLM + naive RAG) for competitive models such as GPT-3.5, Gemini 1.0 Pro, and GPT-4 (Nikbakht et al., 2024).
4. Evaluation Protocols and Responsible AI Dimensions
Automated questionnaire generation is foundational for systematic LLM evaluation. In (Nikbakht et al., 2024), a multiple-choice Q&A set (100 items, evenly distributed across difficulty levels) is constructed to benchmark core LLM reasoning with and without RAG.
For Responsible AI, dataset-driven frameworks such as in (Sagae et al., 23 Oct 2025) structure evaluation around four quantitative dimensions:
- Quality (semantic alignment): BertScore , scale.
- Veracity (informational precision and recall): BertScore precision () and recall ().
- Safety: Detoxify toxicity probability .
- Fairness: Metric disparity across cohort groups:
A plausible implication is that automated generation pipelines, when parameterized with structured metadata, enable granular measurement of LLMs’ strengths and deficiencies in both knowledge and Responsible AI metrics.
5. Dataset Composition and Implementation Examples
The referenced literature details pipeline statistics and practical dataset instances:
| Corpus | #Docs | Words (M) | Task | Q&A Items |
|---|---|---|---|---|
| TSpec-LLM (Nikbakht et al., 2024) | 30,137 | 535 | 3GPP standard Q&A (retrieval-aug) | 100 |
| Responsible-AI Prompt (Sagae et al., 23 Oct 2025) | 7,047 | – | Product-description generation | 7,047 |
Qualitative sample: (Nikbakht et al., 2024), telecom Q&A—“What is the maximum directional gain of an antenna element?” (A: 8 dBi from TR 38.901 V16.1.0).
In Responsible AI (Sagae et al., 23 Oct 2025), automated prompts integrate product, adjective, category, and identity group; resulting LLM outputs are scored for relevance, toxicity, and cohort disparity. For instance, Disparity(toxicity) reaches between highest-risk and lowest-risk categories.
6. Practical Recommendations and Limitations
Effective adoption of automated questionnaire generation pipelines requires:
- Open-source, comprehensive corpora (e.g., Hugging Face hosting for TSpec-LLM).
- Preservation of complex document elements (tables, mathematical formulas).
- Reusable prompt templates with domain-appropriate parameterization axes.
- Integration with robust embedding and retrieval infrastructure (e.g., LlamaIndex, FAISS).
Limitations include dependency on the quality and representativeness of ground-truth text (e.g., potential biases in human-generated product descriptions), lack of comprehensive manual annotation for sensitive dimensions (toxicity, cohort fairness), and monolingual focus in some datasets (Sagae et al., 23 Oct 2025). For fine-tuning, in-context Q&A pairs can serve as training targets in SFT/LoRA pipelines.
7. Impact and Benchmarking Utility
Automated questionnaire generation is shown to transform the evaluation landscape for domain-specific LLM performance and Responsible AI assessment. In the telecom domain, the TSpec-LLM dataset, paired with a simple RAG framework, yields a 20–30 percentage point improvement in multiple-choice question-answering over LLM-only baselines, outperforming earlier, less comprehensive benchmarks such as SPEC5G (Nikbakht et al., 2024). In Responsible AI, prompt libraries systematically surface fairness and safety disparities, enabling direct, blade-to-blade model comparison (Sagae et al., 23 Oct 2025).
A plausible implication is that automated question generation protocols—paired with comprehensive, well-labeled corpora and structured evaluation—enable reproducible, domain-robust benchmarks and facilitate rapid iteration in high-stakes application domains.