FiLM Generator Architecture
- FiLM Generator Architecture is a design that maps symbolic language inputs to feature-wise affine modulation parameters for CNNs, enhancing visual reasoning.
- It employs a GRU to encode word embeddings into a fixed-length question embedding, which is then linearly projected to yield scaling and shifting vectors.
- Empirical ablation shows that allowing flexible scaling (γ) is critical, confirming the architecture's robustness and effectiveness in compositional visual tasks.
The FiLM generator architecture is the functional module responsible for mapping symbolic inputs (typically natural-language questions) to feature-wise affine modulation parameters used in FiLM (Feature-wise Linear Modulation) layers. This architecture is a central component in models designed for conditional visual reasoning, particularly where conditioning information must be seamlessly integrated with convolutional neural networks (CNNs) via structured affine modulation. The principal mechanism is to compute, per network block, scaling (γ) and shifting (β) vectors conditioned on a linguistic embedding, with the generator comprising a dedicated pipeline for encoding symbolic information and mapping it to these parameters (Perez et al., 2017).
1. Symbolic Input Encoding
The generator begins by embedding each token in the symbolic input (such as the words in a question) into a fixed 200-dimensional vector space. Let denote the sequence of word embeddings for a -word question. This sequence is processed by a single-layer Gated Recurrent Unit (GRU) with hidden size 4096. The GRU state recursion is:
The final hidden state serves as the fixed-length "question embedding" summarizing the conditioning context (Perez et al., 2017).
2. Projection to Modulation Parameters
For a visual neural network containing FiLM-modulated residual blocks, each with feature maps, the generator must produce two -dimensional vectors (γⁿ, βⁿ) per block, i.e., the scale and shift for each feature channel within that block. For block , this is performed via independent affine projections of :
and
Here,
In practice, these weights and biases may be concatenated into matrices of shape $2C × 4096$ and vectors of length $2C$, respectively. There are no hidden layers or nonlinearities between and the modulation parameters, resulting in a simple and direct conditioning mechanism.
3. FiLM Layer Application
Given feature activations for the th block (for instance in a mini-batch), FiLM applies channel-wise affine modulation as:
This operation is performed independently at each spatial location of each feature map. The parameters and are unique to each example and block, computed by projecting the GRU-derived question embedding (Perez et al., 2017).
4. Parameterization, Ablations, and Architectural Variants
Extensive ablation studies were conducted on the output parameterization of the generator:
- Zeroing β: Setting (thus only applying scaling) reduced CLEVR dataset accuracy by approximately 0.5%.
- Fixing γ = 1: Removing scaling by fixing (thus only applying shift) led to a ∼1.5% drop in accuracy.
- Restricting γ's Range:
- (so ) yields 95.9% accuracy (vs. 97.7% unrestricted),
- (so ): 96.3%,
- (so ): 96.3%.
- These results show that allowing to take negative values and large magnitudes is important for task performance.
- Test-time Noise: Replacing at test time with its training-set mean reduces accuracy by ~1%. Substituting with its mean reduces accuracy by approximately 65%, confirming that scaling is the dominant factor.
- Depth of Conditioning: Varying the number of FiLM blocks from 1 to 12 demonstrates that a single block yields 93.5% CLEVR accuracy, but 4–6 blocks are required to achieve the maximal (97.4%) accuracy (Perez et al., 2017).
5. Dimensionality and Parameter Counts
The architecture is parameterized as follows:
| Component | Shape / Size | Notes |
|---|---|---|
| Word embeddings | $200 × |V|$ | = vocabulary size |
| GRU | $4096$ hidden units | ∼ |
| Per-block affine projections | $2 × (128 × 4096) + 2 × 128$ per block () | million total |
This minimal network design enables effective, context-sensitive modulation of deep CNNs, supporting multi-step visual reasoning (Perez et al., 2017).
6. Synthesis of Generator Design and Performance
The FiLM generator's core design consists of three stages: (1) word tokenization and embedding, (2) global sequence encoding by a large GRU, and (3) per-block affine linear projections to obtain (Δγⁿ, βⁿ). The generator is minimal—lacking intermediate nonlinearities—yet highly expressive, as evidenced by robust empirical performance and insensitivity to most architectural perturbations except those affecting the range and specificity of γ. The architecture demonstrates that large-capacity linear conditioning is key to feature selection and multi-step reasoning in visual question answering and related domains (Perez et al., 2017).
7. Context, Limitations, and Significance
Feature-wise linear modulation via a dedicated generator has enabled state-of-the-art reasoning in compositional vision and language settings, notably halving error rates on the CLEVR benchmark compared to prior methods, while exhibiting strong robustness to ablations and generalizing to few-shot and zero-shot settings. The simplicity and effectiveness of the generator’s design underscores the power of linear, global conditioning for complex neural architectures. A plausible implication is that variants of this generator architecture may be adapted to other modalities or hierarchical tasks where conditioning via featurewise affine transformations is effective (Perez et al., 2017).