DSST: Efficient Scale-Adaptive Object Tracking
- The paper introduces DSST, which decouples translation and scale estimation using separate correlation filters to enhance tracking efficiency and robustness.
- DSST employs a dedicated 1D scale filter updated in the Fourier domain, significantly reducing computational complexity compared to joint filtering methods.
- Empirical evaluations on benchmarks like OTB-50 and VOT 2014 demonstrate DSST’s superior accuracy, real-time speed, and effective handling of scale variations.
The Discriminative Scale Space Tracker (DSST) is a scale-adaptive visual object tracking method within the tracking-by-detection paradigm. DSST achieves robust and accurate scale estimation by learning separate discriminative correlation filters for translation and scale, providing a computationally efficient solution compared to exhaustive joint search or full 3D filtering. The DSST framework explicitly models scale change using a dedicated 1D correlation filter, updating both translation and scale filters online via exponential averaging in the Fourier domain. DSST attains state-of-the-art tracking performance and real-time speeds, with superior robustness across standardized benchmarks (Danelljan et al., 2016).
1. Translation Filter: Formulation and Solution
The spatial translation filter in DSST is formulated as a multi-channel ridge regression/correlation filter. Given a set of feature patches (e.g., HOG and gray) centered around the target in frame , and desired outputs (typically Gaussian-shaped), the filter is found as:
where denotes circular correlation and is a regularization parameter. By exploiting the circulant structure and Parseval’s theorem, the optimization decouples in the Fourier domain into independent systems. For online tracking, exponential-windowed averaging of auto- and cross-spectra is employed:
where , , and denotes pointwise multiplication. The filter is updated as:
Translation in a new frame is determined by extracting a patch at the previous target position, computing its FFT, and correlating with the learned filter. The new target location is the argmax of the inverse FFT of the response (Danelljan et al., 2016).
2. Scale Filter: One-Dimensional Learning and Estimation
For scale estimation, DSST learns a separate one-dimensional filter . At each frame , patches of varying sizes are extracted (where and ). Each patch is resized to a fixed template and represented by a -dimensional feature vector . The scale filter minimizes:
where is a Gaussian response peaking at . In the frequency domain, the closed-form solution is:
Updates follow the same exponential windows as for translation. Scale detection is accomplished by applying to the feature vectors and selecting the scale index as the argmax of the output scores (Danelljan et al., 2016).
3. Decoupled Translation and Scale Estimation
A primary innovation of DSST is the explicit separation of translation and scale estimation, as opposed to exhaustive or joint approaches. The sequence for each frame is as follows:
- Translation estimation: Apply the spatial filter to a patch at previous to obtain position via maximum response.
- Scale estimation: Around , extract scale-sampled patches and apply the scale filter to obtain the new scale .
- Model update: At , extract new patches for translation and scale, updating , and their scale counterparts with learning rate (Danelljan et al., 2016).
This decoupling avoids the high cost of full 3D or multi-resolution search and allows DSST to operate efficiently while preserving accuracy.
4. Frame-by-Frame Algorithm and Implementation
The framewise operations executed by DSST can be summarized:
- Translation
- Extract a patch around , compute
- Compute , obtain spatial score
- Set
- Scale
- For each , extract at scale , compute
- Compute ,
- Set ,
- Model Update
- Update and scale arrays at with newly extracted data (Danelljan et al., 2016).
5. Computational Complexity
DSST achieves substantial computational efficiency compared to joint or exhaustive multi-scale search methods. The dominant operations are:
- Translation filter: per update/detection, for feature channels and patch size .
- Scale filter: for scale levels and feature dimension .
- Comparative costs:
- Multi-resolution search applies translation filter at scales:
- Full 3D correlation filter:
- DSST, with separate translation and 1D scale filtering, is approximately times faster than multi-resolution and times faster than 3D filtering for reasonable values of and (e.g., ) (Danelljan et al., 2016).
6. Experimental Results and Benchmarks
Key empirical findings from DSST and its compressed/faster variant fDSST:
| Tracker (Configuration) | Overlap Precision (%) | Distance Precision (%) | Speed (FPS) |
|---|---|---|---|
| Baseline DCF (translation only) | 57.7 | 70.8 | 57.3 |
| Multi-resolution DCF (5 scales) | 65.2 | 74.8 | 16.9 |
| Joint 3D DCF (33 scales) | 63.2 | 72.1 | 1.46 |
| Iterative Joint DCF | 64.1 | 74.2 | 1.01 |
| DSST (translation+scale, 33 scales) | 67.7 | 75.7 | 25.4 |
| fDSST (PCA+FFT interp) | 74.3 | 80.2 | 54.3 |
On the OTB-50 benchmark, DSST achieves a 6.6% AUC improvement over the baseline, and fDSST outperforms the previous best (SAMF) by 2.6%. fDSST demonstrates real-time processing at 54 FPS and robust initialization, consistently outperforming KCF, SAMF, Struck, and similar trackers in both TRE and SRE protocols.
In the VOT 2014 challenge (25 videos), fDSST achieved the best composite rank (accuracy and robustness) among 38 trackers, attaining average overlap and the fewest failures per sequence (). Attribute-based analysis showed fDSST winning in 7 out of 11 categories, notably excelling in scale variation, fast motion, and background clutter (Danelljan et al., 2016).
7. Notation Summary and Key Formulas
Key notation and formulae:
- : -th feature channel of translation filter and input patch
- : numerator and denominator spectra for translation filter at time
- , ,
- Filter: ; Detection:
- Scale filter , samples , desired responses : update and spectrum calculation identical in form, with parameter
DSST’s primary methodological contribution is the decoupled scale-adaptive correlation filter framework, providing state-of-the-art accuracy and speed for generic object tracking (Danelljan et al., 2016).