Spectrogram Token Skip-Gram Pipeline

Updated 20 January 2026

The paper introduces the STSG pipeline that leverages unsupervised clustering and Word2Vec skip-gram modeling to transform spectrogram frames into compact static token embeddings for fast inference.
It employs a systematic approach with audio preprocessing, PCA reduction, Faiss K-means tokenization, and mean-pooled embeddings to address bioacoustic classification under strict CPU constraints.
The design balances extreme inference speed and low memory footprint against reduced classification accuracy compared to CNN-based models, highlighting key trade-offs for resource-constrained settings.

The Spectrogram Token Skip-Gram (STSG) pipeline is a lightweight, sequence-modeling approach to bioacoustic classification with a focus on extreme CPU inference speed. Developed for the BirdCLEF+ 2025 challenge, STSG translates continuous audio waveforms into compact, static token embeddings via unsupervised clustering and context modeling. This pipeline demonstrates that static codebook-based embeddings can achieve practical classification performance under strict resource budgets, although with notable trade-offs in accuracy compared to convolutional neural network (CNN) baselines (Miyaguchi et al., 11 Jul 2025).

1. Audio Preprocessing and Frame Extraction

The input to the STSG pipeline is raw audio, resampled to 32 kHz to standardize temporal resolution and frequency content. A Mel-spectrogram is computed with a 0.25 s frame length (8 000 samples) and a 50% overlap (4 000 sample hop length), producing exactly 8 frames per second. Each frame contains 768 Mel bands, capturing fine spectral granularity. The v2 pipeline introduces normalization (e.g., L2-norm across bands) to reduce level dependence. Principal Component Analysis (PCA) is performed on all normalized frames within the training set, retaining 128 principal components to denoise and accelerate downstream clustering (retaining ≈87% variance).

Step	Parameter	Value
Sample rate	Hz	32 000
Frame size	samples (seconds)	8 000 (0.25 s)
Hop size	samples (seconds)	4 000 (0.125 s)
Mel bands	Number	768
PCA dimensions	Number	128

This configuration yields discrete, denoised spectrogram frames suitable for quantization.

2. Tokenization via Faiss K-means Clustering

All PCA-reduced spectrogram frames from the training set (approximately 3.7 million) are aggregated. Faiss K-means clustering (L2 distance) is applied to produce $K = 16\,384$ centroids serving as the quantization codebook, designed for compact 2-byte token representation. Optional Approximate Nearest Neighbor (ANN) search (HNSW) accelerates assignment during clustering and inference. At embedding or inference time, each spectrogram frame is quantized:

$\text{token}(x) = \arg\min_{k=1,\ldots,K} \| \text{PCA}(x) - c_k \|_2$

where $c_k$ denotes centroid $k$ . Within any 5-second interval, the pipeline extracts $5\,\text{s} \times 8\,\text{fps} = 40$ tokens.

3. Unsupervised Token Embedding with Word2Vec Skip-Gram

The sequence of discrete frame tokens for each recording is treated as an integer-valued "sentence." Contextual representations are learned using gensim's Word2Vec skip-gram with negative sampling (SGNS), parameterized as follows: vector size ( $d = 1024$ ), context window ( $w = 80$ tokens, 10 s on each side), number of negative samples ( $m = 5$ ), negative sampling exponent (ns_exponent $=0.0$ , uniform), subsampling ( $1 \times 10^{-5}$ ), and minimum token frequency ($1$, no pruning).

The embedding objective is:

$L = - \sum_{t=1}^{T} \sum_{-w \leq i \leq w,\, i \neq 0} \left[ \log \sigma(v_{t+i}^\top u_t) + \sum_{j=1}^m \mathbb{E}_{v_j \sim P_n} (\log \sigma(-v_j^\top u_t)) \right]$

where $u_t, v_{t+i} \in \mathbb{R}^{1024}$ are input and output token embeddings, $P_n$ is the negative sampling distribution, $w$ is the context window size, and $m$ the number of negative samples.

4. Feature Aggregation and Downstream Classification

At inference time, for each 5 s interval, the 40 contiguous tokens are mapped through the learned embedding matrix, each yielding a $1024$-dimensional vector. These vectors are averaged (mean pooling), producing a single $1 \times 1024$ feature vector for downstream classification. The classifier head is a two-layer MLP: Linear ( $1024 \rightarrow 512$ ), ReLU, Linear ( $512 \rightarrow N_{\text{classes}} = 10\,932$ ), trained with CrossEntropyLoss over a surrogate multi-class task. No further temporal modeling, masking, or attention mechanisms are incorporated; the pipeline utilizes static lookup, mean aggregation, and shallow classification.

5. Performance Benchmarking and Comparative Evaluation

The pipeline was validated in the BirdCLEF+ 2025 challenge, which restricts inference to 90 minutes CPU-only for 700 minutes of test data. STSG (v2.1) achieves an inference time of ~0.5 s per one-minute file (total ~6 minutes for 700 files), outperforming baseline models in speed:

Model	CPU Inference/Filename	Leaderboard ROC-AUC (public/private)
STSG (v2.1)	0.5 s (~6 min total)	0.559 / 0.520
Perch-TFLite	1.4 s (~16 min total)	0.729 / 0.711
BirdSetEffNetB1	2.21 s (~26 min total)	0.810 / 0.778

While STSG is approximately three times faster than the fastest baseline (Perch-TFLite), it incurs a significant drop in ROC-AUC (~0.15–0.25), largely attributable to information loss from quantization and static embeddings.

6. Design Trade-offs and Limitations

Advantages of the STSG pipeline include extreme inference speed (relying on matrix lookup, nearest-neighbor search, and mean pooling), a small memory footprint (just the codebook and embedding matrices), and full CPU compatibility without dependencies on large-batch GPU computation. Limitations are pronounced: classification accuracy is notably lower than that of state-of-the-art CNN backbones (AUC 0.52 vs. 0.71–0.78), static token embeddings cannot capture fine temporal structure or complex co-occurrence statistics, and quantization plus averaging inherently discard locational and amplitude details.

7. Future Directions in Spectrogram Tokenization

The pipeline suggests extensions and improvements in multiple directions: replacing the K-means tokenization with neural codecs (e.g., EnCodec) or Transformer/CNN-based quantizers as in wav2vec 2.0, utilizing shallow Transformers for token sequence modeling now that the vocabulary (16 K tokens) is tractable, refining clustering via methods such as HDBSCAN or information-theoretic codebooks, and hybridizing STSG features with small CNN or 1D-CNN architectures to recover temporal sequential order. A plausible implication is that combining static token embeddings with shallow neural sequence models may offer a more favorable trade-off between inference speed and classification performance in future bioacoustic tasks (Miyaguchi et al., 11 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Distilling Spectrograms into Tokens: Fast and Lightweight Bioacoustic Classification for BirdCLEF+ 2025 (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spectrogram Token Skip-Gram (STSG) Pipeline.