FiLM Generator Architecture

Updated 8 February 2026

FiLM Generator Architecture is a design that maps symbolic language inputs to feature-wise affine modulation parameters for CNNs, enhancing visual reasoning.
It employs a GRU to encode word embeddings into a fixed-length question embedding, which is then linearly projected to yield scaling and shifting vectors.
Empirical ablation shows that allowing flexible scaling (γ) is critical, confirming the architecture's robustness and effectiveness in compositional visual tasks.

The FiLM generator architecture is the functional module responsible for mapping symbolic inputs (typically natural-language questions) to feature-wise affine modulation parameters used in FiLM (Feature-wise Linear Modulation) layers. This architecture is a central component in models designed for conditional visual reasoning, particularly where conditioning information must be seamlessly integrated with convolutional neural networks (CNNs) via structured affine modulation. The principal mechanism is to compute, per network block, scaling (γ) and shifting (β) vectors conditioned on a linguistic embedding, with the generator comprising a dedicated pipeline for encoding symbolic information and mapping it to these parameters (Perez et al., 2017).

1. Symbolic Input Encoding

The generator begins by embedding each token in the symbolic input (such as the words in a question) into a fixed 200-dimensional vector space. Let $w_1, …, w_T ∈ ℝ^{200}$ denote the sequence of word embeddings for a $T$ -word question. This sequence is processed by a single-layer Gated Recurrent Unit (GRU) with hidden size 4096. The GRU state recursion is:

$h_0 = 0, \quad h_t = \text{GRU}(w_t, h_{t-1}), \quad t = 1 \ldots T$

The final hidden state $q = h_T ∈ ℝ^{4096}$ serves as the fixed-length "question embedding" summarizing the conditioning context (Perez et al., 2017).

2. Projection to Modulation Parameters

For a visual neural network containing $N$ FiLM-modulated residual blocks, each with $C$ feature maps, the generator must produce two $C$ -dimensional vectors (γⁿ, βⁿ) per block, i.e., the scale and shift for each feature channel within that block. For block $n$ , this is performed via independent affine projections of $q$ :

$Δγ^n = W^n_γ q + b^n_γ,\quad Δγ^n ∈ ℝ^C \ β^n = W^n_β q + b^n_β, \quad β^n ∈ ℝ^C$

and

$γ^n = 1 + Δγ^n$

Here,

$W^n_γ, W^n_β ∈ ℝ^{C×4096},\quad b^n_γ, b^n_β ∈ ℝ^C$

In practice, these weights and biases may be concatenated into matrices of shape $2C × 4096$ and vectors of length $2C$, respectively. There are no hidden layers or nonlinearities between $q$ and the modulation parameters, resulting in a simple and direct conditioning mechanism.

3. FiLM Layer Application

Given feature activations $F^n_i ∈ ℝ^{C×H×W}$ for the $n$ th block (for instance $i$ in a mini-batch), FiLM applies channel-wise affine modulation as:

$\mathrm{FiLM}\bigl(F^n_{i,c} \mid γ^n_{i,c}, β^n_{i,c}\bigr) = γ^n_{i,c} \, F^n_{i,c} + β^n_{i,c}, \quad ∀\, c = 1 \ldots C$

This operation is performed independently at each spatial location of each feature map. The parameters $γ^n_{i,c}$ and $β^n_{i,c}$ are unique to each example and block, computed by projecting the GRU-derived question embedding (Perez et al., 2017).

4. Parameterization, Ablations, and Architectural Variants

Extensive ablation studies were conducted on the output parameterization of the generator:

Zeroing β: Setting $β ≡ 0$ (thus only applying scaling) reduced CLEVR dataset accuracy by approximately 0.5%.
Fixing γ = 1: Removing scaling by fixing $γ ≡ 1$ (thus only applying shift) led to a ∼1.5% drop in accuracy.
Restricting γ's Range:
- $γ ← \text{sigmoid}(Δγ)$ (so $γ∈(0,1)$ ) yields 95.9% accuracy (vs. 97.7% unrestricted),
- $γ ← \tanh(Δγ)$ (so $γ∈(-1,1)$ ): 96.3%,
- $γ ← \exp(Δγ)$ (so $γ>0$ ): 96.3%.
- These results show that allowing $γ$ to take negative values and large magnitudes is important for task performance.
Test-time Noise: Replacing $β$ at test time with its training-set mean reduces accuracy by ~1%. Substituting $γ$ with its mean reduces accuracy by approximately 65%, confirming that scaling is the dominant factor.
Depth of Conditioning: Varying the number of FiLM blocks $N$ from 1 to 12 demonstrates that a single block yields $\approx$ 93.5% CLEVR accuracy, but 4–6 blocks are required to achieve the maximal ( $\sim$ 97.4%) accuracy (Perez et al., 2017).

5. Dimensionality and Parameter Counts

The architecture is parameterized as follows:

Component	Shape / Size	Notes
Word embeddings	$200 × \|V\|$	$\|V\|$ = vocabulary size
GRU	$4096$ hidden units	∼ $4(4096^2 + 4096 × 200)$
Per-block affine projections	$2 × (128 × 4096) + 2 × 128$ per block ( $N=4$ )	$≃ 4.2$ million total

This minimal network design enables effective, context-sensitive modulation of deep CNNs, supporting multi-step visual reasoning (Perez et al., 2017).

6. Synthesis of Generator Design and Performance

The FiLM generator's core design consists of three stages: (1) word tokenization and embedding, (2) global sequence encoding by a large GRU, and (3) per-block affine linear projections to obtain (Δγⁿ, βⁿ). The generator is minimal—lacking intermediate nonlinearities—yet highly expressive, as evidenced by robust empirical performance and insensitivity to most architectural perturbations except those affecting the range and specificity of γ. The architecture demonstrates that large-capacity linear conditioning is key to feature selection and multi-step reasoning in visual question answering and related domains (Perez et al., 2017).

7. Context, Limitations, and Significance

Feature-wise linear modulation via a dedicated generator has enabled state-of-the-art reasoning in compositional vision and language settings, notably halving error rates on the CLEVR benchmark compared to prior methods, while exhibiting strong robustness to ablations and generalizing to few-shot and zero-shot settings. The simplicity and effectiveness of the generator’s design underscores the power of linear, global conditioning for complex neural architectures. A plausible implication is that variants of this generator architecture may be adapted to other modalities or hierarchical tasks where conditioning via featurewise affine transformations is effective (Perez et al., 2017).

Markdown Report Issue Upgrade to Chat

References (1)

FiLM: Visual Reasoning with a General Conditioning Layer (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FiLM Generator Architecture.