Neural Question Generation Framework

Updated 13 February 2026

Neural Question Generation (NQG) is an end-to-end framework that uses sequence-to-sequence models with attention to generate contextually relevant questions from text and answer spans.
The framework employs bidirectional LSTM encoders and unidirectional LSTM decoders with beam search and UNK-copy mechanisms to enhance fluency and answer alignment.
Recent extensions incorporate answer-aware encodings, pointer-generator networks, and transformer-based models to further improve the quality and relevance of generated questions.

Neural Question Generation (NQG) Framework

Neural Question Generation (NQG) refers to the set of techniques and models that, given an input text (typically a sentence or passage) and an answer span, generate coherent and relevant natural-language questions that are contextually answerable from the input. Unlike earlier rule-based systems, neural approaches utilize end-to-end trainable sequence learning, leveraging deep encoders, attention mechanisms, and large-scale supervisory signals.

1. Core Architecture and Methodology

The canonical NQG framework, as introduced in "Learning to Ask: Neural Question Generation for Reading Comprehension," is a sequence-to-sequence (seq2seq) model with attention, designed to map input sentences (optionally augmented with context paragraphs) to question sequences (Du et al., 2017). The key components include:

Encoder: A stack of bidirectional LSTM layers (two layers, hidden size 600), processing either just the sentence [x₁…x_M] or both sentence and containing paragraph [z₁…z_L]. Each encoder produces context-dependent embeddings. For the sentence-level encoder:
- Forward: $\overrightarrow{b_t} = \overrightarrow{\mathrm{LSTM}_2}(x_t, \overrightarrow{b_{t-1}})$
- Backward: $\overleftarrow{b_t} = \overleftarrow{\mathrm{LSTM}_2}(x_t, \overleftarrow{b_{t+1}})$
- Concatenation: $b_t = [\overrightarrow{b_t}; \overleftarrow{b_t}]$

The final sentence representation for decoder initialization is $s = [\overrightarrow{b_M}; \overleftarrow{b_1}]$ . If using additional paragraph context, paragraph vector $s'$ is computed analogously, and $h_0 = [s; s']$ forms the decoder’s initial state in a "Y-shaped" configuration.

Decoder: A two-layer unidirectional LSTM (hidden size 600), which generates questions token-by-token, using attention over encoder outputs. At each step $t$ $t$ :
- Recurrent update: $h_t = \mathrm{LSTM}_1(y_{t-1}, h_{t-1})$
- Attention scores: $e_{i,t} = h_t^\top W_b b_i$ , $\alpha_{i,t} = \frac{\exp(e_{i,t})}{\sum_j \exp(e_{j,t})}$
- Context vector: $c_t = \sum_i \alpha_{i,t} b_i$
- Prediction: $P(y_t|x, y_{<t}) = \mathrm{softmax}\left( W_s \tanh(W_t [h_t; c_t]) \right)$
Input Representation: 300-dimensional GloVe embeddings (fixed during training), vocabularies of 45k (source) and 28k (target) most frequent tokens, handling OOVs as UNK. Each example is paired with an answer span; however, no explicit span-indicator embedding is used in this architecture (Du et al., 2017).
Optimization: Task is formulated as negative log-likelihood minimization over the gold-standard question conditioned on the input sentence and, if applicable, paragraph:

$\mathcal{L} = -\sum_{i=1}^N \sum_{t=1}^{|y^{(i)}|} \log P\bigl(y_t^{(i)}|x^{(i)},\,y_{<t}^{(i)}\bigr)$

Inference: Beam search (beam size = 3) with UNK replacement via token-level attention, copying the input word assigned highest attention at each UNK.

2. Data, Preprocessing, and Supervision

NQG training is supervised, requiring answer spans aligned with candidate questions. The SQuAD v1.0 dataset, comprising ≈90k question–answer pairs from 536 Wikipedia articles, is used. Preprocessing includes:

Sentence and word tokenization, lowercasing.
Locating the sentence containing the answer span and generating (sentence, question) pairs.
Excluding pairs with zero non-stopword overlap between source and reference (≈6.6% of raw, to filter misaligned annotations).
Annotating sentence boundaries (<SOS>, <EOS>).
Article-level splits: ≈70k train, 10.5k dev, 11.8k test. Average sentence length is 33, and question length is 11 tokens.

No position-indicator or answer tag embedding is introduced in this formulation, making the model reliant on contextual clues for answer focus.

3. Evaluation Metrics and Empirical Results

Evaluation of NQG systems relies primarily on automatic n-gram matching and sequence-overlap metrics:

Metric	Value (Sentence-only)
BLEU-1	43.09
BLEU-2	25.96
BLEU-3	17.50
BLEU-4	12.28
METEOR	16.62
ROUGE-L	39.75

Human evaluation dimensions include:

Naturalness (grammar/fluency): 3.36 (NQG) vs. 2.95 (baseline) vs. 3.91 (human).
Difficulty (divergence/reasoning): 3.03 (NQG) vs. 1.94 (baseline) vs. 2.63 (human).

Judges selected NQG outputs as best in 38.4% of pairwise comparisons, versus 20.2% for the prior rule-based system. Inter-rater agreement: Krippendorff’s α = 0.236 (Du et al., 2017).

4. Notable Innovations and Impact

Principal innovations of this framework include:

End-to-end trainable architecture eliminating fixed rules, feature engineering, or manual pipeline components.
Fully global attention matrix enables fine-grained candidate-answer alignment within the sentence.
Optional Y-shaped context encoder structure with paragraph-level context.
Pretrained, fixed word embeddings and a robust UNK-copy mechanism at decoding enhance rare-word and entity reproduction.
Strong empirical performance, yielding roughly +25% relative BLEU-4 improvement over previous systems and producing questions that human judges rate as more fluent and challenging than earlier baselines (Du et al., 2017).

This design informed subsequent NQG studies, which expanded with answer-span encodings, auxiliary objectives (semantic and span-matching), RL or adversarial training, pointer-copy and coverage mechanisms, and transformer-based models (Ma et al., 2019, Murakhovs'ka et al., 2021, Guo et al., 2024).

5. Extensions and Variations in Recent Work

Subsequent research introduced various modifications atop the base NQG architecture:

Feature-rich and answer-aware encodings: Later models encode the answer position via BIO/bit flags or additional answer-span branches, sometimes fusing the answer vector into the decoder’s initial state (Ma et al., 2019).
Auxiliary objectives: Multi-task loss terms encourage (i) semantic alignment between question and source (sentence-level semantic matching); and/or (ii) accurate prediction of the answer span via attention-guided span-prediction modules (answer position inferring) (Ma et al., 2019).
Pointer-generator/copy mechanism: Facilitates copying of OOV or domain-specific tokens, significantly improving reproduction of rare entities, especially in open-domain or fine-grained contexts (Subramanian et al., 2017, Zhou et al., 2017).
Paragraph-level or multi-passage context: Additional context encoders may be introduced for multi-sentence passages or full paragraphs, generally concatenating learned context vectors for richer source representation.
Decoding strategies: More recent NQG frameworks utilize larger beam sizes, advanced length penalties, and attention-guided copying to further boost output quality and handle length/coverage tradeoffs (Shahidi et al., 2019).

6. Limitations, Open Challenges, and Research Directions

A number of challenges and directions persist across NQG research lines:

Answer signaling: While the original model uses answer supervision via paired examples, explicit span-encoding (position features, BIO tags, masking, or multi-channel encoders) is now standard and improves precision in answer-focused question generation (Zhou et al., 2017, Kumar et al., 2018).
Global semantics and answer irrelevance: Early seq2seq models frequently generate questions that drift off-topic or fail to anchor on the correct answer span (Ma et al., 2019). Auxiliary semantic matching losses and span-prediction modules address these issues.
Data and annotation: Performance is sensitive to sentence-level alignment, stopword overlap criteria, and answer localization in preprocessing. The framework relies on gold-standard answer spans, which affects applicability to answer-agnostic settings.
Evaluation: BLEU, METEOR, and ROUGE remain standard but do not fully capture answerability or reasoning difficulty; human studies and new metrics (BERTScore, QG-specific rewards, downstream QA performance) address these gaps (Du et al., 2017, Kumar et al., 2018).
Generality: The same core framework is found to generalize with minimal changes to a variety of input types and broader passage contexts, as later extended to transformer-based models and multi-task objectives (Guo et al., 2024, Murakhovs'ka et al., 2021).

7. Implementation Details and Best Practices

Software stack: Original implementation is in Torch7, integrated with OpenNMT; modern replications leverage PyTorch or TensorFlow.
Batch size: 64, with efficient training and model selection by development set perplexity.
Regularization: Dropout $p=0.3$ between LSTM layers, gradient clipping to norm ≤ 5.
Training time: Approximately 2 hours on a single GPU for up to 15 epochs, selection by minimum dev-set perplexity.
Practical inference: UNK-replacement via attention copying, moderate beam search for fluency–diversity balance.

The NQG framework described here—attention-based seq2seq with learned global alignment—established the foundation for nearly all subsequent developments in neural question generation for reading comprehension and beyond (Du et al., 2017).