Papers
Topics
Authors
Recent
Search
2000 character limit reached

Horizon Activation Mapping for Forecasting

Updated 12 January 2026
  • Horizon Activation Mapping (HAM) is a dual-method approach combining gradient-based interpretability and activation tuning to optimize neural forecasting.
  • HAM uses gradient norm curves and activation curvature metrics to reveal model bias toward short- or long-range dependencies in forecast subseries.
  • HAM facilitates architecture selection and hyperparameter tuning by mapping activation curvature and state entropy to extend forecast horizon persistence.

Horizon Activation Mapping (HAM) is both a quantitative interpretability framework for neural forecasting models and a principled algorithmic tool for optimizing nonlinear architectures in time series prediction. The term encompasses two distinct but thematically unified methodological streams: (1) a gradient-based interpretability approach for delineating which forecast horizon subseries most influence parameter updates in general neural forecasting architectures (Krupakar et al., 5 Jan 2026), and (2) a systematic recipe for tuning node activation functions in reservoir computers to maximize the time interval over which accurate predictions are possible, via curvature and entropy metrics (Hurley et al., 2023). Both approaches resolve the key challenge of understanding and shaping how neural models distribute computational “effort” over prediction time, facilitating both architecture selection and hyperparameter optimization.

1. Defining Horizon Activation Mapping in Forecasting

In model-agnostic interpretability, HAM formalizes the attribution of parameter update magnitudes to temporal forecast subseries, replacing the spatial attention maps of Grad-CAM with temporal “horizon subseries” (Krupakar et al., 5 Jan 2026). In each mode (causal or anti-causal), HAM computes, for every possible subseries cutoff, the aggregate L2L_2 gradient norm of the masked loss with respect to parameters. The result is a curve Gm(h)G_m(h), which expresses how much each early (causal) or late (anti-causal) segment of the horizon influences effective training. Comparing these curves to lines of proportionality (ideal uniform contributions across the horizon) exposes model bias toward short- or long-range dependencies.

In reservoir computing, HAM refers to the workflow of choosing and tuning activation functions (e.g., Swish(β), Shifted tanh(b), or a library of 16 canonical choices) to optimize the forecast horizon (FH)—the dynamical timescale for which reservoir predictions retain fidelity on chaotic benchmarks such as the Lorenz attractor. The method quantifies how activation curvature KK and state entropy (ASE) jointly influence FH, enabling principled exploration of activation parameter space to maximize predictive stability (Hurley et al., 2023).

2. Mathematical Formulation and Methodology

A. HAM in General Neural Forecasting

Let fθf_\theta be a forecasting model with horizon HH and per-timestep loss h\ell_h, so that

L(fθ(x),y)=1Hh=1Hh(fθ(x),y)h.\mathcal{L}(f_\theta(x), y) = \frac{1}{H} \sum_{h=1}^H \ell_h(f_\theta(x), y)_h.

Define binary masks Mc(h^,H)M_c(\hat{h}, H) and Ma(h^,H)M_a(\hat{h}, H) for causal and anti-causal modes. The masked subseries loss is

Lm(h^)=1Hh=1HMm(h^,H)hh(fθ(x),y)h\mathcal{L}_m(\hat{h}) = \frac{1}{H} \sum_{h=1}^H M_m(\hat{h}, H)_h \cdot \ell_h(f_\theta(x), y)_h

and the key metric is the mean gradient norm over examples,

Gm(h^)=1Ni=1NθLm(i)(h^)2.G_m(\hat{h}) = \frac{1}{N} \sum_{i=1}^N \|\nabla_\theta \mathcal{L}_m^{(i)}(\hat{h})\|_2.

The “line of proportionality” Lm(h)L_m(h) is a scaled baseline:

Lc(h)=hHG,La(h)=HhHGL_c(h) = \frac{h}{H} G, \quad L_a(h) = \frac{H - h}{H} G

where GG is the maximal observed gradient norm. Deviations between Gm(h)G_m(h) and Lm(h)L_m(h) reflect model architecture or training regime bias.

Auxiliary analyses include the “gradient-equivariant point” hh^* where Gc(h)Ga(h)G_c(h^*) \approx G_a(h^*) (demarcating equal investment in early vs late subseries) and the signed, normalized difference plot

d(h)=Gc(h)Ga(h)maxtGc(t)Ga(t)d(h) = \frac{G_c(h) - G_a(h)}{\max_t |G_c(t) - G_a(t)|}

which emphasizes short- vs long-horizon dominance.

B. HAM in Reservoir Computer Optimization

The reservoir FH is mathematically defined as

FH=λLinf{t>0:x(t)x^(t)>Δ}\mathrm{FH} = \lambda_L \cdot \inf\{ t > 0 : |x(t) - \hat{x}(t)| > \Delta \}

with λL\lambda_L as the Lyapunov time and Δ\Delta typically set to $5$ (15% of the Lorenz xx-range). For each activation function f(x)f(x), the weighted curvature is

K=κ(x)p(x)dxK = \int \kappa(x) p(x) dx

with

κ(x)=f(x)[1+(f(x))2]3/2\kappa(x) = \frac{|f''(x)|}{ [1 + (f'(x))^2]^{3/2} }

and p(x)p(x) the empirical node input distribution. Average State Entropy (ASE) is computed by averaging the instantaneous entropy (via Gaussian kernel density estimates) over all time steps preceding the FH error crossing.

3. Empirical Outcomes and Benchmark Comparisons

A. Gradient-based HAM Across Neural Families

HAM was applied to a diverse suite of modern multivariate forecasters (MLP-based CycleNet, N-Linear, N-HITS; self-attention-based FEDformer, Pyraformer; SSM-based SpaceTime; diffusion-based Multi-Resolution DDPM) on the ETTm2 dataset with forecast horizons H{96,192,336,720}H \in \{96,\,192,\,336,\,720\} (Krupakar et al., 5 Jan 2026).

Observed phenomena include:

  • Dropout in N-HITS significantly amplifies overall and late-horizon gradient norms.
  • Larger batch sizes increase Gc,GaG_c, G_a magnitudes, realign gradient curves toward proportionality, and shift attention bias toward early subseries.
  • Early-stopped models show overall dampened gradients and loss of strong horizon bias, with the equivariant point close to H/2H/2.
  • Normalization in N-Linear modulates early- vs late-horizon gradient allocation.
  • In SpaceTime, longer forecast horizons transition G_c(h) from linear to exponential due to state-space model dynamics.

B. Activation-Driven HAM in Reservoir Computers

A systematic survey of 16 activation functions (7 non-monotonic, 9 monotonic) found notable FH differences:

Function Type FH (N=300)
Logish Non-monotonic 12.31±0.2712.31 \pm 0.27
Swish(β=0.6\beta=0.6) Non-monotonic 9.63±0.329.63 \pm 0.32
Shifted tanh(b=1b=1) Monotonic 10.53±0.2110.53 \pm 0.21
Hard-tanh Monotonic 2.46±0.242.46 \pm 0.24
Hard-Sigmoid Monotonic 0.24±0.020.24 \pm 0.02

FH increases monotonically with activation curvature KK for Swish(β) and generally across the library. Maximum FH is found at intermediate ASE; excessively high or low state entropy leads to shorter horizons (Hurley et al., 2023).

4. Interpretation and Diagnostic Utilities

HAM visualizations provide several crucial diagnostics:

  • Curves Gm(h)G_m(h) above the proportional baseline indicate subseries of heightened parameter-update sensitivity; below indicates underweighting.
  • The signed area between Gm(h)G_m(h) and Lm(h)L_m(h) quantifies net attention bias to short/long subseries.
  • The difference plot d(h)d(h) enables scale-free comparisons, e.g., for cross-family benchmarking in model selection.
  • The equivariant point hh^* immediately signals the horizon regime where model focus transitions from early to late.

In reservoir networks, rapid quantitative mapping of FH over (activation parameter, reservoir size) space exposes narrow “sweet spots” where chosen activation and entropy/curvature metrics lead to substantial predictive persistence.

5. Practical Application and Optimization Workflow

A. Interpretability and Selection

HAM facilitates model-agnostic selection on the validation set by identifying which architectures exhibit the desired gradient attenuation or persistence across the forecast horizon. For long-term planning tasks, models with persistent Ga(h)G_a(h) and d(h)<0d(h)<0 post-H/2H/2 are preferred (Krupakar et al., 5 Jan 2026).

Batch size and early-stopping can be tuned using HAM to avoid regimes where all gradient curves merely scale or collapse, which would indicate diminished learning signal differentiation across the horizon.

B. Reservoir Computer Optimization

The HAM guideline for activation tuning is:

  1. Set target FH and select system (e.g., Lorenz).
  2. Choose activation family or fixed function library.
  3. For each candidate and parameter, train and assess FH, KK, ASE.
  4. Map FH=F(parameter,N)FH = F(\text{parameter}, N) and locate maximal ridges.
  5. Ensure K0.05K \gtrsim 0.05 (Lorenz) and ASE in the intermediate regime.
  6. Fix optimal activation, re-tune other hyperparameters if desired.

This approach reduces the high-dimensional search over activation functions to a low-dimensional manifold corresponding to empirically verified “sweet spots” of forecast persistence (Hurley et al., 2023).

6. Limitations and Extensions

HAM is restricted to aggregate gradient norm information and does not resolve the contributions of specific parameters or internal layers, although layer-wise extensions are plausible. Computational cost scales as O(NH)O(N\,H) backpropagation steps per dataset epoch, implying a need for efficient approximations or subsampling for very large datasets or horizons. In reservoir computing, results have benchmark specificity (e.g., Lorenz system), but the methodology is transferable given calibration (Hurley et al., 2023). For general neural models, future directions include layer-wise HAM, integration with probabilistic forecast diagnostics, and combinations with internal attention-based interpretations.

A plausible implication is that by combining HAM’s diagnostic metrics with traditional validation losses, practitioners can achieve more robust architecture design and tuning for tasks with variable forecast horizon requirements, and cross-family benchmarking is enhanced by the scale-free normalization afforded by difference plots.

7. Significance in Time Series Forecasting Research

Horizon Activation Mapping synthesizes architectural interpretability and principled hyperparameter tuning into a unified analytic toolkit. It underpins both model-agnostic comparative evaluation and activation function selection—core processes in long-horizon and multivariate time-series forecasting. Both variants of HAM have demonstrated capability to extract actionable insight from models as diverse as MLPs, SSMs, attention models, and diffusion forecasters, and to expose nontrivial dependencies of predictive longevity on both low-level nonlinearities and high-level architectural design (Krupakar et al., 5 Jan 2026, Hurley et al., 2023). The formalization of horizon-oriented mapping thus represents a significant advance in the analysis and optimization of temporal learning systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Horizon Activation Mapping (HAM).