DSST: Efficient Scale-Adaptive Object Tracking

Updated 14 February 2026

The paper introduces DSST, which decouples translation and scale estimation using separate correlation filters to enhance tracking efficiency and robustness.
DSST employs a dedicated 1D scale filter updated in the Fourier domain, significantly reducing computational complexity compared to joint filtering methods.
Empirical evaluations on benchmarks like OTB-50 and VOT 2014 demonstrate DSST’s superior accuracy, real-time speed, and effective handling of scale variations.

The Discriminative Scale Space Tracker (DSST) is a scale-adaptive visual object tracking method within the tracking-by-detection paradigm. DSST achieves robust and accurate scale estimation by learning separate discriminative correlation filters for translation and scale, providing a computationally efficient solution compared to exhaustive joint search or full 3D filtering. The DSST framework explicitly models scale change using a dedicated 1D correlation filter, updating both translation and scale filters online via exponential averaging in the Fourier domain. DSST attains state-of-the-art tracking performance and real-time speeds, with superior robustness across standardized benchmarks (Danelljan et al., 2016).

1. Translation Filter: Formulation and Solution

The spatial translation filter in DSST is formulated as a multi-channel ridge regression/correlation filter. Given a set of feature patches $x_i \in \mathbb{R}^{M \times N \times d}$ (e.g., HOG and gray) centered around the target in frame $i$ , and desired outputs $y_i \in \mathbb{R}^{M \times N}$ (typically Gaussian-shaped), the filter $h = (h^1,\ldots,h^d)$ is found as:

$\min_h \sum_{i=1}^N \lVert h \star x_i - y_i \rVert_2^2 + \lambda \sum_{l=1}^d \lVert h^l \rVert_2^2$

where $\star$ denotes circular correlation and $\lambda > 0$ is a regularization parameter. By exploiting the circulant structure and Parseval’s theorem, the optimization decouples in the Fourier domain into $M \! \times \! N$ independent $d \times d$ systems. For online tracking, exponential-windowed averaging of auto- and cross-spectra is employed:

$A_t^l = (1-\eta) A_{t-1}^l + \eta \, (\overline{G} \odot X_t^l)$

$B_t = (1-\eta) B_{t-1} + \eta \sum_{k=1}^d \overline{X_t^k} \odot X_t^k$

where $X_t^l = \text{FFT}\{x_t^l\}$ , $G = \text{FFT}\{y\}$ , and $\odot$ denotes pointwise multiplication. The filter is updated as:

$H_t^l = \frac{A_t^l}{B_t + \lambda}$

Translation in a new frame is determined by extracting a patch at the previous target position, computing its FFT, and correlating with the learned filter. The new target location is the argmax of the inverse FFT of the response (Danelljan et al., 2016).

2. Scale Filter: One-Dimensional Learning and Estimation

For scale estimation, DSST learns a separate one-dimensional filter $f$ . At each frame $t$ , $S$ patches $I_n$ of varying sizes $\left(a^n P, a^n R\right)$ are extracted (where $a \approx 1.02$ and $n \in \left\{-\lfloor(S-1)/2\rfloor, \ldots, \lfloor(S-1)/2\rfloor\right\}$ ). Each patch is resized to a fixed template and represented by a $d$ -dimensional feature vector $s_n$ . The scale filter minimizes:

$\min_f \sum_n \lVert f \star s_n - g_n \rVert_2^2 + \mu \lVert f \rVert_2^2$

where $g_n$ is a Gaussian response peaking at $n = 0$ . In the frequency domain, the closed-form solution is:

$F(\omega) = \frac{\overline{G}(\omega) \sum_n \overline{S_n(\omega)}}{\sum_n |S_n(\omega)|^2 + \mu}$

Updates follow the same exponential windows as for translation. Scale detection is accomplished by applying $f$ to the $S$ feature vectors and selecting the scale index as the argmax of the output scores (Danelljan et al., 2016).

3. Decoupled Translation and Scale Estimation

A primary innovation of DSST is the explicit separation of translation and scale estimation, as opposed to exhaustive or joint approaches. The sequence for each frame is as follows:

Translation estimation: Apply the spatial filter $h$ to a patch at previous $(p_{t-1}, s_{t-1})$ to obtain position $p_t$ via maximum response.
Scale estimation: Around $p_t$ , extract scale-sampled patches $\{s_n\}$ and apply the scale filter $f$ to obtain the new scale $s_t$ .
Model update: At $(p_t,s_t)$ , extract new patches for translation and scale, updating $A$ , $B$ and their scale counterparts with learning rate $\eta$ (Danelljan et al., 2016).

This decoupling avoids the high cost of full 3D or multi-resolution search and allows DSST to operate efficiently while preserving accuracy.

4. Frame-by-Frame Algorithm and Implementation

The framewise operations executed by DSST can be summarized:

Translation
- Extract a patch around $(p_{t-1}, s_{t-1})$ , compute $Z = \text{FFT}\{z^x_t\}$
- Compute $Y = \sum_{l} \overline{A_{t-1}^l} \odot Z^l / (B_{t-1} + \lambda)$ , obtain spatial score $y = \text{IFFT}\{Y\}$
- Set $p_t = \operatorname{arg\,max}(y)$
Scale
- For each $n$ , extract $s_{t,n}$ at scale $a^n s_{t-1}$ , compute $S_n$
- Compute $Y^s = \overline{G^s} \odot \sum_n S_n / (B_{t-1}^s + \mu)$ , $y^s = \text{IFFT}\{Y^s\}$
- Set $n^* = \operatorname{arg\,max}(y^s)$ , $s_t = a^{n^*} s_{t-1}$
Model Update
- Update $A, B$ and scale arrays $A^s, B^s$ at $(p_t, s_t)$ with newly extracted data (Danelljan et al., 2016).

5. Computational Complexity

DSST achieves substantial computational efficiency compared to joint or exhaustive multi-scale search methods. The dominant operations are:

Translation filter: $O(d M N \log(MN))$ per update/detection, for $d$ feature channels and patch size $M \times N$ .
Scale filter: $O(d S \log S)$ for $S$ scale levels and feature dimension $d$ .
Comparative costs:
- Multi-resolution search applies translation filter at $T$ scales: $O(T d M N \log(MN))$
- Full 3D correlation filter: $O(M N S \log(M N S))$
- DSST, with separate translation and 1D scale filtering, is approximately $T$ times faster than multi-resolution and $S$ times faster than 3D filtering for reasonable values of $T$ and $S$ (e.g., $S \approx 17$ ) (Danelljan et al., 2016).

6. Experimental Results and Benchmarks

Key empirical findings from DSST and its compressed/faster variant fDSST:

Tracker (Configuration)	Overlap Precision (%)	Distance Precision (%)	Speed (FPS)
Baseline DCF (translation only)	57.7	70.8	57.3
Multi-resolution DCF (5 scales)	65.2	74.8	16.9
Joint 3D DCF (33 scales)	63.2	72.1	1.46
Iterative Joint DCF	64.1	74.2	1.01
DSST (translation+scale, 33 scales)	67.7	75.7	25.4
fDSST (PCA+FFT interp)	74.3	80.2	54.3

On the OTB-50 benchmark, DSST achieves a 6.6% AUC improvement over the baseline, and fDSST outperforms the previous best (SAMF) by 2.6%. fDSST demonstrates real-time processing at 54 FPS and robust initialization, consistently outperforming KCF, SAMF, Struck, and similar trackers in both TRE and SRE protocols.

In the VOT 2014 challenge (25 videos), fDSST achieved the best composite rank (accuracy and robustness) among 38 trackers, attaining average overlap $\approx 0.52$ and the fewest failures per sequence ( $\approx 1.1$ ). Attribute-based analysis showed fDSST winning in 7 out of 11 categories, notably excelling in scale variation, fast motion, and background clutter (Danelljan et al., 2016).

7. Notation Summary and Key Formulas

Key notation and formulae:

$h^l, x^l$ : $l$ -th feature channel of translation filter and input patch
$A_t^l, B_t$ : numerator and denominator spectra for translation filter at time $t$
$H_t^l = \text{FFT}\{h_t^l\}$ , $X_t^l = \text{FFT}\{x_t^l\}$ , $G = \text{FFT}\{\text{desired response}\}$
$A_t^l = (1-\eta) A_{t-1}^l + \eta \overline{G} \odot X_t^l$
$B_t = (1-\eta) B_{t-1} + \eta \sum_k \overline{X_t^k} \odot X_t^k$
Filter: $H_t^l = A_t^l / (B_t + \lambda)$ ; Detection: $Y_t = \sum_l \overline{H_{t-1}^l} \odot Z_t^l / (B_{t-1} + \lambda)$
Scale filter $f$ , samples $s_n$ , desired responses $g_n$ : update and spectrum calculation identical in form, with parameter $\mu$

DSST’s primary methodological contribution is the decoupled scale-adaptive correlation filter framework, providing state-of-the-art accuracy and speed for generic object tracking (Danelljan et al., 2016).

Markdown Report Issue Upgrade to Chat

References (1)

Discriminative Scale Space Tracking (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Discriminative Scale Space Tracker (DSST).