Spectrogram Token Skip-Gram Pipeline
- The paper introduces the STSG pipeline that leverages unsupervised clustering and Word2Vec skip-gram modeling to transform spectrogram frames into compact static token embeddings for fast inference.
- It employs a systematic approach with audio preprocessing, PCA reduction, Faiss K-means tokenization, and mean-pooled embeddings to address bioacoustic classification under strict CPU constraints.
- The design balances extreme inference speed and low memory footprint against reduced classification accuracy compared to CNN-based models, highlighting key trade-offs for resource-constrained settings.
The Spectrogram Token Skip-Gram (STSG) pipeline is a lightweight, sequence-modeling approach to bioacoustic classification with a focus on extreme CPU inference speed. Developed for the BirdCLEF+ 2025 challenge, STSG translates continuous audio waveforms into compact, static token embeddings via unsupervised clustering and context modeling. This pipeline demonstrates that static codebook-based embeddings can achieve practical classification performance under strict resource budgets, although with notable trade-offs in accuracy compared to convolutional neural network (CNN) baselines (Miyaguchi et al., 11 Jul 2025).
1. Audio Preprocessing and Frame Extraction
The input to the STSG pipeline is raw audio, resampled to 32 kHz to standardize temporal resolution and frequency content. A Mel-spectrogram is computed with a 0.25 s frame length (8 000 samples) and a 50% overlap (4 000 sample hop length), producing exactly 8 frames per second. Each frame contains 768 Mel bands, capturing fine spectral granularity. The v2 pipeline introduces normalization (e.g., L2-norm across bands) to reduce level dependence. Principal Component Analysis (PCA) is performed on all normalized frames within the training set, retaining 128 principal components to denoise and accelerate downstream clustering (retaining ≈87% variance).
| Step | Parameter | Value |
|---|---|---|
| Sample rate | Hz | 32 000 |
| Frame size | samples (seconds) | 8 000 (0.25 s) |
| Hop size | samples (seconds) | 4 000 (0.125 s) |
| Mel bands | Number | 768 |
| PCA dimensions | Number | 128 |
This configuration yields discrete, denoised spectrogram frames suitable for quantization.
2. Tokenization via Faiss K-means Clustering
All PCA-reduced spectrogram frames from the training set (approximately 3.7 million) are aggregated. Faiss K-means clustering (L2 distance) is applied to produce centroids serving as the quantization codebook, designed for compact 2-byte token representation. Optional Approximate Nearest Neighbor (ANN) search (HNSW) accelerates assignment during clustering and inference. At embedding or inference time, each spectrogram frame is quantized:
where denotes centroid . Within any 5-second interval, the pipeline extracts tokens.
3. Unsupervised Token Embedding with Word2Vec Skip-Gram
The sequence of discrete frame tokens for each recording is treated as an integer-valued "sentence." Contextual representations are learned using gensim's Word2Vec skip-gram with negative sampling (SGNS), parameterized as follows: vector size (), context window ( tokens, 10 s on each side), number of negative samples (), negative sampling exponent (ns_exponent , uniform), subsampling (), and minimum token frequency ($1$, no pruning).
The embedding objective is:
where are input and output token embeddings, is the negative sampling distribution, is the context window size, and the number of negative samples.
4. Feature Aggregation and Downstream Classification
At inference time, for each 5 s interval, the 40 contiguous tokens are mapped through the learned embedding matrix, each yielding a $1024$-dimensional vector. These vectors are averaged (mean pooling), producing a single feature vector for downstream classification. The classifier head is a two-layer MLP: Linear (), ReLU, Linear (), trained with CrossEntropyLoss over a surrogate multi-class task. No further temporal modeling, masking, or attention mechanisms are incorporated; the pipeline utilizes static lookup, mean aggregation, and shallow classification.
5. Performance Benchmarking and Comparative Evaluation
The pipeline was validated in the BirdCLEF+ 2025 challenge, which restricts inference to 90 minutes CPU-only for 700 minutes of test data. STSG (v2.1) achieves an inference time of ~0.5 s per one-minute file (total ~6 minutes for 700 files), outperforming baseline models in speed:
| Model | CPU Inference/Filename | Leaderboard ROC-AUC (public/private) |
|---|---|---|
| STSG (v2.1) | 0.5 s (~6 min total) | 0.559 / 0.520 |
| Perch-TFLite | 1.4 s (~16 min total) | 0.729 / 0.711 |
| BirdSetEffNetB1 | 2.21 s (~26 min total) | 0.810 / 0.778 |
While STSG is approximately three times faster than the fastest baseline (Perch-TFLite), it incurs a significant drop in ROC-AUC (~0.15–0.25), largely attributable to information loss from quantization and static embeddings.
6. Design Trade-offs and Limitations
Advantages of the STSG pipeline include extreme inference speed (relying on matrix lookup, nearest-neighbor search, and mean pooling), a small memory footprint (just the codebook and embedding matrices), and full CPU compatibility without dependencies on large-batch GPU computation. Limitations are pronounced: classification accuracy is notably lower than that of state-of-the-art CNN backbones (AUC 0.52 vs. 0.71–0.78), static token embeddings cannot capture fine temporal structure or complex co-occurrence statistics, and quantization plus averaging inherently discard locational and amplitude details.
7. Future Directions in Spectrogram Tokenization
The pipeline suggests extensions and improvements in multiple directions: replacing the K-means tokenization with neural codecs (e.g., EnCodec) or Transformer/CNN-based quantizers as in wav2vec 2.0, utilizing shallow Transformers for token sequence modeling now that the vocabulary (16 K tokens) is tractable, refining clustering via methods such as HDBSCAN or information-theoretic codebooks, and hybridizing STSG features with small CNN or 1D-CNN architectures to recover temporal sequential order. A plausible implication is that combining static token embeddings with shallow neural sequence models may offer a more favorable trade-off between inference speed and classification performance in future bioacoustic tasks (Miyaguchi et al., 11 Jul 2025).