UniLight: Lighting & Traffic Signal Control
- UniLight is a dual-contribution framework that unifies multi-modal lighting representation and decentralized traffic signal control using cross-modal embeddings and minimal communication.
- The lighting component employs modality-specific encoders and a shared latent space with contrastive alignment and spherical harmonics loss to achieve high-fidelity cross-modal transfer.
- The traffic signal control component utilizes multi-agent reinforcement learning with analytically grounded uni-modal communication (UniComm) to enhance traffic throughput and reduce congestion.
UniLight denotes two distinct research contributions with authoritative impact in the domains of computational lighting representation and multi-agent reinforcement learning for traffic signal control. Below, each system is expounded in technical depth, reflecting its foundational principles, architectural details, evaluation protocols, and quantified achievements as described in the source works (Zhang et al., 3 Dec 2025, Jiang et al., 2022).
1. Definition and Scope
UniLight (Lighting Representation) (Zhang et al., 3 Dec 2025):
A unified latent-embedding space for representing lighting, capable of jointly modeling environment maps, images, irradiance maps, and text descriptions. The approach leverages modality-specific encoders, a shared embedding mechanism, and contrastive-alignment objectives with auxiliary spherical harmonics supervision to achieve cross-modal transfer and manipulation of lighting cues.
UniLight (Traffic Signal Control) (Jiang et al., 2022):
A multi-agent reinforcement learning network designed for decentralized traffic signal control, paired with a universal communication protocol ("UniComm"). It focuses on efficient inter-agent messaging by sharing analytically grounded predicted upstream flows, resulting in improved traffic throughput and reduced congestion in simulation and real-world datasets.
2. UniLight for Unified Lighting Representation
2.1. Unified Embedding Architecture
Lighting is encoded in a shared latent representation , where is the number of learnable query tokens (default ) and is the embedding dimension (default ). Each modality—environment maps, perspective images, irradiance, and descriptive text—undergoes encoding via backbone networks (primarily DINOv2-B for vision, Qwen3-Embedding for text), followed by a modality-specific “summary” module. This module introduces learnable query tokens per modality, which interact with backbone tokens via multi-head attention, yielding aligned representations suitable for direct cross-modal comparison and downstream integration.
2.2. Modality-Specific Encoders
A. Environment-Map Encoder:
Utilizes DINOv2-B with three input channels: Reinhard tone-mapped LDR (), logarithmically encoded HDR (, with , ), and per-pixel direction vectors (). Robustness is enhanced by random dropout of and heterogeneous sampling between ground-truth and DiffusionLight-Turbo estimates.
B. Image Encoder:
Processes perspective crops (90° FOV), Reinhard auto-exposed (), tone-mapped.
C. Irradiance Encoder:
Consumes per-pixel irradiance estimated by Prism (intrinsic image decomposition).
D. Text Encoder:
Employs Qwen3-Embedding (600M parameters), with structured prompts focusing on dominant light sites, color temperature, and brightness.
2.3. Contrastive Alignment and Spherical Harmonics Loss
Contrastive training employs an InfoNCE-style loss over all pairwise modality combinations within a batch: where is the mean-pooled, -normalized representation from modality for sample ; denotes cosine similarity.
An auxiliary spherical harmonics (SH) regression adds anchoring to explicit lighting directions. Ground-truth SH coefficients up to degree ($16$ coefficients) are regressed from flattened embeddings: The total objective combines these components, typically with :
3. Multi-Modal Data Pipeline
The UniLight training dataset comprises 8,020 HDR environment maps from ULaval Outdoor/Indoor, Polyhaven, Domeble, and custom Theta Z1 captures. Each environment map yields nine random 90° crops (totaling 72,180 samples). Data augmentation includes:
- Reinhard tone-mapping () for
- Log-encoded HDR and directional projections for
- Per-pixel irradiance via Prism
- Spherical harmonics fitting (degree 3)
- Structured text generation (InternVL3-38B, prompting for dominant illumination sources)
- Additional environment map estimation per crop using DiffusionLight-Turbo
This design ensures diverse, high-fidelity supervision for both direct lighting appearance and abstracted semantic descriptions.
4. Evaluation Tasks and Results (Lighting Representation)
Evaluation spans three primary tasks:
A. Cross-Modal Lighting Retrieval:
Testing on 603 environment map crops, retrieval is performed using pairwise cosine similarity between embeddings. Performance metrics include R@1, R@5, R@10, MRR, median and mean rank. UniLight surpasses a CLIP ViT-B/32 baseline (image↔text only): R@1=24.9 vs. 2.6, R@5=49.0 vs. 10.8, MRR=0.367 vs. 0.077.
B. Environment-Map Generation:
Stable Diffusion 3.5 Medium is fine-tuned, replacing its text conditioning with a cross-attention block on UniLight embeddings. Any modality's embedding can drive generation of LDR environment maps.
C. Lighting Control in Image Synthesis:
In the X↔RGB (Stable Diffusion 3.5) framework, UniLight embedding replaces irradiance as lighting control. The system supports relighting from arbitrary modalities, with quantitative evaluation (for outdoor scenes) via sky coherence estimation and qualitative comparison to LumiNet, DiffusionRenderer, and original X↔RGB models.
Ablation studies demonstrate sensitivity to token count (marginal gain above ), necessity of the SH loss (omission results in dramatic R@1 performance drop), and directional encoding (cosine similarity decreases monotonically with vertical-axis rotation of input maps).
5. UniLight for Traffic Signal Control
5.1. Problem Formulation and Notation
The multi-intersection traffic signal control problem is formalized as a Decentralized Partially Observable Markov Game (Dec-POMDP), with each intersection modeled as an agent . The state space , joint actions , and local observations encapsulate phase selection and per-movement vehicle counts. The reward at each timestep is the negative average queue length per intersection, and the objective is to maximize the global discounted return .
5.2. UniLight Architecture and Universal Communication (UniComm)
A. State Encoding:
Each intersection partitions observations into movement-level tuples , mapped through a shared FC+ReLU to 32D representations.
B. UniComm Module:
Communication is analytically grounded: only the predicted future arriving volume (approaching flow) is exchanged between neighbors.
- Phase permission is predicted via self-attention and linear-sigmoid layers, supervised with BCELoss against stored phase data.
- Approaching flow per outgoing lane is accumulated using predicted phase permissions; supervised regressor loss aligns predicted volume with actual (replayed) arrival counts.
- Each agent transmits scalar predicted flow to adjacent intersections, ensuring minimal yet sufficient communication.
C. Q-Value Computation:
Q-values are computed per candidate phase by partitioning movement vectors into permitted () and blocked () groups, aggregating with weights () and feeding through phase-aware linear heads.
5.3. Training Protocol and Loss Structure
The framework is trained using a Double Dueling DQN backbone with agent-specific parameters. The objective compounds the canonical Bellman Q-value loss and auxiliary losses from UniComm predictions: Training utilizes replay buffers (size 8,000), minibatch size 30, target network updates every 5 steps, discount , and -greedy scheduling from 0.9 to 0.02.
6. Experimental Evaluation (Traffic Signal Control)
UniLight with UniComm is benchmarked in the CityFlow simulator over 240,000 training frames, tested on 10 runs. Datasets include public (Hangzhou, Jinan, New York) and real-world (Shanghai-Taxi) networks, with varying intersection counts and configurations. Baselines span both traditional and DRL-based controllers.
Results (average travel time in seconds):
| Dataset | Best Baseline | UniLight w/o Comm. | UniLight + UniComm |
|---|---|---|---|
| Jinan (JN) | PressLight 335.9 | 335.85 | 325.47 (~3%↓) |
| Hangzhou (HZ) | MPLight 334.0 | 324.24 | 323.01 (~3%↓) |
| New York (NY) | CoLight 244.6 | 186.85 | 180.72 (~26%↓) |
| Shanghai-1 | SOTL 2362.7 | 2326.29 | 159.88 (~93%↓) |
| Shanghai-2 | PressLight 322.5 | 209.89 | 208.06 (~35%↓) |
Ablation studies reveal that replacing UniComm’s analytically justified scalar flow with a higher-dimensional learned state (as in CoLight) degrades performance, underlining the efficiency of UniComm’s minimalism.
7. Theoretical Insights, Strengths, and Limitations
Lighting Representation:
UniLight’s representation is robust to modality, expressive of both directionality and global lighting properties (as evidenced by monotonic response to environment map rotation and high-fidelity SH coefficient prediction). The architecture’s embedding supports direct transfer among text, images, and physical lighting encodings, exceeding classical cross-modal baselines by substantial margins.
Traffic Signal Control:
UniComm ensures the only information transmitted between agents is the precise predicted impacting flow for downstream intersections, achieving minimal message size with theoretical justification. The grouping and weighting of traffic-movement features in UniLight’s Q-head architecture accelerates training and generalization, especially in large, complex urban grids.
Limitations (Traffic):
Explicitly, performance guarantees presume no lane spillback and that vehicles cannot traverse an entire road segment within the action interval ; relaxing these conditions necessitates either multi-step prediction or richer dynamical modeling. UniComm currently handles only one-step-ahead prediction and homogeneous traffic conditions, suggesting opportunity for future expansions to more complex, multi-modal scenarios.
Summary:
UniLight in both domains exemplifies the convergence of architectural minimalism, domain-theoretic analysis, and multi-modal integration, setting quantitative state-of-the-art and providing an extensible basis for subsequent advances in lighting representation and decentralized coordination (Zhang et al., 3 Dec 2025, Jiang et al., 2022).