Image-Based Event Sequencing

Updated 1 February 2026

Image-based Event Sequencing is a computational approach that infers minimal, causally ordered move sequences to transform a source image into a target image.
The method decomposes the task into visual perception and event sequencing, leveraging CNNs for configuration encoding and logic engines for precise move reconstruction.
Empirical results on the BIRD dataset show that modular approaches, especially those using ILP, achieve perfect accuracy compared to traditional end-to-end deep learning models.

Image-based Event Sequencing (IES) constitutes the computational challenge of inferring a minimal-length sequence of actions necessary to transform an object arrangement depicted in a source image into the configuration shown in a target image. IES is distinguished from related perceptual reasoning tasks by its emphasis on causally precise, temporally ordered event reconstruction, as well as its demand for robust inductive generalization beyond training sequence complexity. Rigorous formalization, dataset curation, and methodological comparison have been central to its study, notably through the Blocksworld Image Reasoning Dataset (BIRD) and associated modeling protocols (Gokhale et al., 2019).

1. Formal Task Specification

IES is defined by the tuple $(I^S, I^T, M)$ , where $I^S$ and $I^T$ are input source and target images, and $M = [m_1, m_2, ..., m_L]$ is the event-sequence (sequence of moves) transforming $I^S$ to $I^T$ . Each move comprises: $m_t = \mathrm{move}(X, Y, t)\quad X \in \mathcal{C},\;Y \in \mathcal{D},\;t \in \{0, ..., 7\}$ with $\mathcal{C}$ the set of block colors and $\mathcal{D}$ the set of possible destinations (blocks, "table", "out"). The end-to-end mapping is formalized as: $f_E:\;\mathbb{R}^{256\times256\times6}\;\longrightarrow\;\{0,1\}^{128}$ encoding up to 8 moves as 16-bit one-hot vectors, reflecting the combinatorial challenge $(48^8 \approx 2.8 \times 10^{13}$ possible sequences)). Inductive generalization is rigorously defined: a solver generalizes if it correctly outputs $\hat M$ with $|\hat M| > L_{\max}$ for sequences whose length exceeds any in the training set.

2. Blocksworld Image Reasoning Dataset (BIRD)

BIRD is the canonical dataset for IES, constructed with 7,267 real photographs of up to five distinct-colored blocks in various tabletop arrangements under uniform lighting, with no duplicate colors per image. Blocks may be stacked or in contact, differentiating BIRD from synthetic datasets such as CLEVR. Each image is annotated with a 5×5 arrangement vector and a 5×3 color vector (3-bit encoding per block). Source–target pairs are algorithmically paired, and minimal-length move sequences (up to 8 steps) are exhaustively enumerated using the logic programming solver clingo with background axioms (exogeneity, freedom, inertia, sequentialism), yielding approximately one million unique triplets, uniformly sampled over sequence lengths.

3. End-to-End Deep Sequence Prediction Approaches

End-to-end methods using deep convolutional architectures for direct event sequence prediction have been evaluated extensively. Architectures include:

ResNet-50: Standard deep backbone.
PSPNet: Incorporates pyramidal spatial context pooling.
Relation Network (RN): Adapted for image pair input with dual CNN modules.

Input is a concatenated stack $[I^S, I^T] \in \mathbb{R}^{256 \times 256 \times 6}$ , and output is a 128-bit vector compared to ground-truth via binary cross-entropy loss: $\mathcal{L}_{\mathrm{BCE}} = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^{128} [y^i_k\log\hat{y}^i_k + (1-y^i_k)\log(1-\hat{y}^i_k)]$ Performance on BIRD is limited (ResNet50: FSA 30.5%, SLA 36.3%; PSPNet: FSA 35.0%, SLA 56.7%; RN: FSA 34.4%, SLA 52.1%). All architectures demonstrate negligible inductive generalization, failing to generate correct sequences longer than those seen during training, and struggle due to the exponential output-space and poor modeling of temporal dependencies.

4. Modular Perception and Sequencing Approach

A two-stage modular pipeline divides IES into visual perception and event sequencing:

Stage 1: Visual Perception
- Arrangement encoder ( $f_A$ ): 8-layer CNN mapping image to grid occupancy ( $\mathbb{R}^{5 \times 5}$ ).
- Color grounding ( $f_C$ ): ResNet-50 mapping image to color encoding ( $\mathbb{R}^{5 \times 3}$ ).
- Outputs an interpretable configuration: $[f_A(I), f_C(I)]$ .
Stage 2: Event Sequencing
- Logic engine applies moves to successive state vectors: $z_{t+1} = g_\ell(z_t, m_t)$ .
- Three solver variants:
- Fully-connected network (FC): 5-layer MLP, multi-label BC-loss.
- Q-Learning (QL): MDP formulation, off-policy value learning.
- Inductive Logic Programming (ILP): Learns ASP rules mapping moves to block configuration changes.

This modular decomposition dramatically improves performance and generalization. Table 1 in (Gokhale et al., 2019) summarizes key results:

Approach	FSA	SLA
End-to-End (ResNet50)	30.5%	36.3%
Stage1+Stage2 (FC, PR)	68.9%	72.6%
Stage1+Stage2 (QL, PR)	84.1%	87.8%
Stage1+Stage2 (ILP, PR)	100%	100%

ILP achieves perfect inductive generalization for all tested sequence lengths, even beyond training $\ell$ , while deep learning and RL only partially improve with longer training.

5. Evaluation Metrics and Empirical Results

Performance in IES is measured via Full Sequence Accuracy (FSA) (exact match across entire sequence) and Step-Level Accuracy (SLA) (mean per-move correctness). Human ceiling is 100% for both metrics. End-to-end CNNs exhibit low accuracy (<36.3% SLA), while modular pipelines with logic programming and RL achieve substantially higher rates (ILP: 100%, QL: up to 87.8%).

Inductive generalization is a central benchmark: end-to-end networks fail on sequences longer than seen at training, while ILP provides stable generalization across increasing target lengths (Enc+ILP: FSA 83.6% beyond training max; PR+ILP: 100%). FC and QL models demonstrate gradual improvement as training length increases but do not reach ILP performance.

6. Extension to Natural Images and Cross-domain Transfer

IES methodology has been extended to real-world image contexts. Using Mask-RCNN, objects from 30 photographic pairs are mapped one-to-one as block colors, enabling direct use of BIRD-trained sequencing modules. Table 3 from (Gokhale et al., 2019) presents results:

Approach	FSA	SLA
PR+FC	55.3	61.1
PR+QL	92.2	96.4
PR+ILP	100	100
MaskRCNN+FC	47.5	51.7
MaskRCNN+QL	64.3	69.2
MaskRCNN+ILP	75.6	80.6

A plausible implication is that modular sequencing approaches (QL, ILP) enable robust zero-shot transfer to more complex domains, with accuracy scaling according to perceptual recognition fidelity and sequencing solver generalization.

7. Significance, Limitations, and Future Directions

IES crystallizes a core problem in scene understanding: causal inference of event sequences from raw perceptual data. The substantial combinatorial complexity of event-space renders direct end-to-end methods ineffective, demanding strategies which separate state representation from reasoning. The synthesis of visual encoding and logic-driven event solution, exemplified by the modular approach and ASP-based ILP, yields high accuracy and generalizes to novel sequence lengths—suggesting that explicit reasoning modules are essential for hard temporal tasks in vision (Gokhale et al., 2019).

Limitations include reliance on accurate perceptual encoding for transfer to natural images and the need for explicit domain axioms in logic engines. Ongoing advances in scene parsing and relational reasoning may further bridge perception and temporal reasoning, extending IES applicability to real-world manipulation and autonomous agents.

Markdown Report Issue Upgrade to Chat

References (1)

Blocksworld Revisited: Learning and Reasoning to Generate Event-Sequences from Image Pairs (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Image-based Event Sequencing (IES).