Papers
Topics
Authors
Recent
Search
2000 character limit reached

DSST: Efficient Scale-Adaptive Object Tracking

Updated 14 February 2026
  • The paper introduces DSST, which decouples translation and scale estimation using separate correlation filters to enhance tracking efficiency and robustness.
  • DSST employs a dedicated 1D scale filter updated in the Fourier domain, significantly reducing computational complexity compared to joint filtering methods.
  • Empirical evaluations on benchmarks like OTB-50 and VOT 2014 demonstrate DSST’s superior accuracy, real-time speed, and effective handling of scale variations.

The Discriminative Scale Space Tracker (DSST) is a scale-adaptive visual object tracking method within the tracking-by-detection paradigm. DSST achieves robust and accurate scale estimation by learning separate discriminative correlation filters for translation and scale, providing a computationally efficient solution compared to exhaustive joint search or full 3D filtering. The DSST framework explicitly models scale change using a dedicated 1D correlation filter, updating both translation and scale filters online via exponential averaging in the Fourier domain. DSST attains state-of-the-art tracking performance and real-time speeds, with superior robustness across standardized benchmarks (Danelljan et al., 2016).

1. Translation Filter: Formulation and Solution

The spatial translation filter in DSST is formulated as a multi-channel ridge regression/correlation filter. Given a set of feature patches xiRM×N×dx_i \in \mathbb{R}^{M \times N \times d} (e.g., HOG and gray) centered around the target in frame ii, and desired outputs yiRM×Ny_i \in \mathbb{R}^{M \times N} (typically Gaussian-shaped), the filter h=(h1,,hd)h = (h^1,\ldots,h^d) is found as:

minhi=1Nhxiyi22+λl=1dhl22\min_h \sum_{i=1}^N \lVert h \star x_i - y_i \rVert_2^2 + \lambda \sum_{l=1}^d \lVert h^l \rVert_2^2

where \star denotes circular correlation and λ>0\lambda > 0 is a regularization parameter. By exploiting the circulant structure and Parseval’s theorem, the optimization decouples in the Fourier domain into M ⁣× ⁣NM \! \times \! N independent d×dd \times d systems. For online tracking, exponential-windowed averaging of auto- and cross-spectra is employed:

Atl=(1η)At1l+η(GXtl)A_t^l = (1-\eta) A_{t-1}^l + \eta \, (\overline{G} \odot X_t^l)

Bt=(1η)Bt1+ηk=1dXtkXtkB_t = (1-\eta) B_{t-1} + \eta \sum_{k=1}^d \overline{X_t^k} \odot X_t^k

where Xtl=FFT{xtl}X_t^l = \text{FFT}\{x_t^l\}, G=FFT{y}G = \text{FFT}\{y\}, and \odot denotes pointwise multiplication. The filter is updated as:

Htl=AtlBt+λH_t^l = \frac{A_t^l}{B_t + \lambda}

Translation in a new frame is determined by extracting a patch at the previous target position, computing its FFT, and correlating with the learned filter. The new target location is the argmax of the inverse FFT of the response (Danelljan et al., 2016).

2. Scale Filter: One-Dimensional Learning and Estimation

For scale estimation, DSST learns a separate one-dimensional filter ff. At each frame tt, SS patches InI_n of varying sizes (anP,anR)\left(a^n P, a^n R\right) are extracted (where a1.02a \approx 1.02 and n{(S1)/2,,(S1)/2}n \in \left\{-\lfloor(S-1)/2\rfloor, \ldots, \lfloor(S-1)/2\rfloor\right\}). Each patch is resized to a fixed template and represented by a dd-dimensional feature vector sns_n. The scale filter minimizes:

minfnfsngn22+μf22\min_f \sum_n \lVert f \star s_n - g_n \rVert_2^2 + \mu \lVert f \rVert_2^2

where gng_n is a Gaussian response peaking at n=0n = 0. In the frequency domain, the closed-form solution is:

F(ω)=G(ω)nSn(ω)nSn(ω)2+μF(\omega) = \frac{\overline{G}(\omega) \sum_n \overline{S_n(\omega)}}{\sum_n |S_n(\omega)|^2 + \mu}

Updates follow the same exponential windows as for translation. Scale detection is accomplished by applying ff to the SS feature vectors and selecting the scale index as the argmax of the output scores (Danelljan et al., 2016).

3. Decoupled Translation and Scale Estimation

A primary innovation of DSST is the explicit separation of translation and scale estimation, as opposed to exhaustive or joint approaches. The sequence for each frame is as follows:

  • Translation estimation: Apply the spatial filter hh to a patch at previous (pt1,st1)(p_{t-1}, s_{t-1}) to obtain position ptp_t via maximum response.
  • Scale estimation: Around ptp_t, extract scale-sampled patches {sn}\{s_n\} and apply the scale filter ff to obtain the new scale sts_t.
  • Model update: At (pt,st)(p_t,s_t), extract new patches for translation and scale, updating AA, BB and their scale counterparts with learning rate η\eta (Danelljan et al., 2016).

This decoupling avoids the high cost of full 3D or multi-resolution search and allows DSST to operate efficiently while preserving accuracy.

4. Frame-by-Frame Algorithm and Implementation

The framewise operations executed by DSST can be summarized:

  1. Translation
    • Extract a patch around (pt1,st1)(p_{t-1}, s_{t-1}), compute Z=FFT{ztx}Z = \text{FFT}\{z^x_t\}
    • Compute Y=lAt1lZl/(Bt1+λ)Y = \sum_{l} \overline{A_{t-1}^l} \odot Z^l / (B_{t-1} + \lambda), obtain spatial score y=IFFT{Y}y = \text{IFFT}\{Y\}
    • Set pt=arg max(y)p_t = \operatorname{arg\,max}(y)
  2. Scale
    • For each nn, extract st,ns_{t,n} at scale anst1a^n s_{t-1}, compute SnS_n
    • Compute Ys=GsnSn/(Bt1s+μ)Y^s = \overline{G^s} \odot \sum_n S_n / (B_{t-1}^s + \mu), ys=IFFT{Ys}y^s = \text{IFFT}\{Y^s\}
    • Set n=arg max(ys)n^* = \operatorname{arg\,max}(y^s), st=anst1s_t = a^{n^*} s_{t-1}
  3. Model Update
    • Update A,BA, B and scale arrays As,BsA^s, B^s at (pt,st)(p_t, s_t) with newly extracted data (Danelljan et al., 2016).

5. Computational Complexity

DSST achieves substantial computational efficiency compared to joint or exhaustive multi-scale search methods. The dominant operations are:

  • Translation filter: O(dMNlog(MN))O(d M N \log(MN)) per update/detection, for dd feature channels and patch size M×NM \times N.
  • Scale filter: O(dSlogS)O(d S \log S) for SS scale levels and feature dimension dd.
  • Comparative costs:
    • Multi-resolution search applies translation filter at TT scales: O(TdMNlog(MN))O(T d M N \log(MN))
    • Full 3D correlation filter: O(MNSlog(MNS))O(M N S \log(M N S))
    • DSST, with separate translation and 1D scale filtering, is approximately TT times faster than multi-resolution and SS times faster than 3D filtering for reasonable values of TT and SS (e.g., S17S \approx 17) (Danelljan et al., 2016).

6. Experimental Results and Benchmarks

Key empirical findings from DSST and its compressed/faster variant fDSST:

Tracker (Configuration) Overlap Precision (%) Distance Precision (%) Speed (FPS)
Baseline DCF (translation only) 57.7 70.8 57.3
Multi-resolution DCF (5 scales) 65.2 74.8 16.9
Joint 3D DCF (33 scales) 63.2 72.1 1.46
Iterative Joint DCF 64.1 74.2 1.01
DSST (translation+scale, 33 scales) 67.7 75.7 25.4
fDSST (PCA+FFT interp) 74.3 80.2 54.3

On the OTB-50 benchmark, DSST achieves a 6.6% AUC improvement over the baseline, and fDSST outperforms the previous best (SAMF) by 2.6%. fDSST demonstrates real-time processing at 54 FPS and robust initialization, consistently outperforming KCF, SAMF, Struck, and similar trackers in both TRE and SRE protocols.

In the VOT 2014 challenge (25 videos), fDSST achieved the best composite rank (accuracy and robustness) among 38 trackers, attaining average overlap 0.52\approx 0.52 and the fewest failures per sequence (1.1\approx 1.1). Attribute-based analysis showed fDSST winning in 7 out of 11 categories, notably excelling in scale variation, fast motion, and background clutter (Danelljan et al., 2016).

7. Notation Summary and Key Formulas

Key notation and formulae:

  • hl,xlh^l, x^l: ll-th feature channel of translation filter and input patch
  • Atl,BtA_t^l, B_t: numerator and denominator spectra for translation filter at time tt
  • Htl=FFT{htl}H_t^l = \text{FFT}\{h_t^l\}, Xtl=FFT{xtl}X_t^l = \text{FFT}\{x_t^l\}, G=FFT{desired response}G = \text{FFT}\{\text{desired response}\}
  • Atl=(1η)At1l+ηGXtlA_t^l = (1-\eta) A_{t-1}^l + \eta \overline{G} \odot X_t^l
  • Bt=(1η)Bt1+ηkXtkXtkB_t = (1-\eta) B_{t-1} + \eta \sum_k \overline{X_t^k} \odot X_t^k
  • Filter: Htl=Atl/(Bt+λ)H_t^l = A_t^l / (B_t + \lambda); Detection: Yt=lHt1lZtl/(Bt1+λ)Y_t = \sum_l \overline{H_{t-1}^l} \odot Z_t^l / (B_{t-1} + \lambda)
  • Scale filter ff, samples sns_n, desired responses gng_n: update and spectrum calculation identical in form, with parameter μ\mu

DSST’s primary methodological contribution is the decoupled scale-adaptive correlation filter framework, providing state-of-the-art accuracy and speed for generic object tracking (Danelljan et al., 2016).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Discriminative Scale Space Tracker (DSST).