Papers
Topics
Authors
Recent
Search
2000 character limit reached

FiLM Generator Architecture

Updated 8 February 2026
  • FiLM Generator Architecture is a design that maps symbolic language inputs to feature-wise affine modulation parameters for CNNs, enhancing visual reasoning.
  • It employs a GRU to encode word embeddings into a fixed-length question embedding, which is then linearly projected to yield scaling and shifting vectors.
  • Empirical ablation shows that allowing flexible scaling (γ) is critical, confirming the architecture's robustness and effectiveness in compositional visual tasks.

The FiLM generator architecture is the functional module responsible for mapping symbolic inputs (typically natural-language questions) to feature-wise affine modulation parameters used in FiLM (Feature-wise Linear Modulation) layers. This architecture is a central component in models designed for conditional visual reasoning, particularly where conditioning information must be seamlessly integrated with convolutional neural networks (CNNs) via structured affine modulation. The principal mechanism is to compute, per network block, scaling (γ) and shifting (β) vectors conditioned on a linguistic embedding, with the generator comprising a dedicated pipeline for encoding symbolic information and mapping it to these parameters (Perez et al., 2017).

1. Symbolic Input Encoding

The generator begins by embedding each token in the symbolic input (such as the words in a question) into a fixed 200-dimensional vector space. Let w1,,wTR200w_1, …, w_T ∈ ℝ^{200} denote the sequence of word embeddings for a TT-word question. This sequence is processed by a single-layer Gated Recurrent Unit (GRU) with hidden size 4096. The GRU state recursion is:

h0=0,ht=GRU(wt,ht1),t=1Th_0 = 0, \quad h_t = \text{GRU}(w_t, h_{t-1}), \quad t = 1 \ldots T

The final hidden state q=hTR4096q = h_T ∈ ℝ^{4096} serves as the fixed-length "question embedding" summarizing the conditioning context (Perez et al., 2017).

2. Projection to Modulation Parameters

For a visual neural network containing NN FiLM-modulated residual blocks, each with CC feature maps, the generator must produce two CC-dimensional vectors (γⁿ, βⁿ) per block, i.e., the scale and shift for each feature channel within that block. For block nn, this is performed via independent affine projections of qq:

Δγn=Wγnq+bγn,ΔγnRC βn=Wβnq+bβn,βnRCΔγ^n = W^n_γ q + b^n_γ,\quad Δγ^n ∈ ℝ^C \ β^n = W^n_β q + b^n_β, \quad β^n ∈ ℝ^C

and

γn=1+Δγnγ^n = 1 + Δγ^n

Here,

Wγn,WβnRC×4096,bγn,bβnRCW^n_γ, W^n_β ∈ ℝ^{C×4096},\quad b^n_γ, b^n_β ∈ ℝ^C

In practice, these weights and biases may be concatenated into matrices of shape $2C × 4096$ and vectors of length $2C$, respectively. There are no hidden layers or nonlinearities between qq and the modulation parameters, resulting in a simple and direct conditioning mechanism.

3. FiLM Layer Application

Given feature activations FinRC×H×WF^n_i ∈ ℝ^{C×H×W} for the nnth block (for instance ii in a mini-batch), FiLM applies channel-wise affine modulation as:

FiLM(Fi,cnγi,cn,βi,cn)=γi,cnFi,cn+βi,cn,c=1C\mathrm{FiLM}\bigl(F^n_{i,c} \mid γ^n_{i,c}, β^n_{i,c}\bigr) = γ^n_{i,c} \, F^n_{i,c} + β^n_{i,c}, \quad ∀\, c = 1 \ldots C

This operation is performed independently at each spatial location of each feature map. The parameters γi,cnγ^n_{i,c} and βi,cnβ^n_{i,c} are unique to each example and block, computed by projecting the GRU-derived question embedding (Perez et al., 2017).

4. Parameterization, Ablations, and Architectural Variants

Extensive ablation studies were conducted on the output parameterization of the generator:

  • Zeroing β: Setting β0β ≡ 0 (thus only applying scaling) reduced CLEVR dataset accuracy by approximately 0.5%.
  • Fixing γ = 1: Removing scaling by fixing γ1γ ≡ 1 (thus only applying shift) led to a ∼1.5% drop in accuracy.
  • Restricting γ's Range:
    • γsigmoid(Δγ)γ ← \text{sigmoid}(Δγ) (so γ(0,1)γ∈(0,1)) yields 95.9% accuracy (vs. 97.7% unrestricted),
    • γtanh(Δγ)γ ← \tanh(Δγ) (so γ(1,1)γ∈(-1,1)): 96.3%,
    • γexp(Δγ)γ ← \exp(Δγ) (so γ>0γ>0): 96.3%.
    • These results show that allowing γγ to take negative values and large magnitudes is important for task performance.
  • Test-time Noise: Replacing ββ at test time with its training-set mean reduces accuracy by ~1%. Substituting γγ with its mean reduces accuracy by approximately 65%, confirming that scaling is the dominant factor.
  • Depth of Conditioning: Varying the number of FiLM blocks NN from 1 to 12 demonstrates that a single block yields \approx93.5% CLEVR accuracy, but 4–6 blocks are required to achieve the maximal (\sim97.4%) accuracy (Perez et al., 2017).

5. Dimensionality and Parameter Counts

The architecture is parameterized as follows:

Component Shape / Size Notes
Word embeddings $200 × |V|$ V|V| = vocabulary size
GRU $4096$ hidden units 4(40962+4096×200)4(4096^2 + 4096 × 200)
Per-block affine projections $2 × (128 × 4096) + 2 × 128$ per block (N=4N=4) 4.2≃ 4.2 million total

This minimal network design enables effective, context-sensitive modulation of deep CNNs, supporting multi-step visual reasoning (Perez et al., 2017).

6. Synthesis of Generator Design and Performance

The FiLM generator's core design consists of three stages: (1) word tokenization and embedding, (2) global sequence encoding by a large GRU, and (3) per-block affine linear projections to obtain (Δγⁿ, βⁿ). The generator is minimal—lacking intermediate nonlinearities—yet highly expressive, as evidenced by robust empirical performance and insensitivity to most architectural perturbations except those affecting the range and specificity of γ. The architecture demonstrates that large-capacity linear conditioning is key to feature selection and multi-step reasoning in visual question answering and related domains (Perez et al., 2017).

7. Context, Limitations, and Significance

Feature-wise linear modulation via a dedicated generator has enabled state-of-the-art reasoning in compositional vision and language settings, notably halving error rates on the CLEVR benchmark compared to prior methods, while exhibiting strong robustness to ablations and generalizing to few-shot and zero-shot settings. The simplicity and effectiveness of the generator’s design underscores the power of linear, global conditioning for complex neural architectures. A plausible implication is that variants of this generator architecture may be adapted to other modalities or hierarchical tasks where conditioning via featurewise affine transformations is effective (Perez et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FiLM Generator Architecture.