Non-Semantic Financial Data Encoding
- Non-semantic financial data encoding is a collection of domain-agnostic techniques that transform raw numerical, categorical, or temporal financial data into normalized and structured representations.
- Techniques such as Quantile Linear Encoding, symbolic quantization with n-gram analysis, and pseudo-token replacements enable models to handle heterogeneous, high-cardinality, and sequential data effectively.
- These encoding methods are practically applied in credit scoring, fraud detection, financial entity recognition, and anomaly detection, leading to robust improvements in predictive performance.
Non-semantic financial data encoding refers to a collection of techniques for representing financial data—numerical, categorical, or sequential—in forms that do not leverage explicit semantic or domain knowledge, but instead transform raw data into normalized, quantized, or structurally arranged representations. These encodings enable machine learning and statistical models to efficiently process high-cardinality, heterogeneous, or high-dimensional financial data for tasks such as credit scoring, fraud detection, numerical entity recognition, and time-series analysis without explicit modeling of financial semantics.
1. Principles and Motivations
Non-semantic encodings are designed to address key challenges characteristic of large-scale financial datasets:
- Heterogeneous distributions: Financial features such as balances, incomes, or transaction amounts often span orders of magnitude.
- High cardinality/sparsity: Categorical fields like account codes can take hundreds of possible values, leading to extremely sparse one-hot vectors.
- Sequential/temporal patterns: In time-series or sequential transaction data, symbolizing the series enables higher-order structure discovery.
- Numerical representation robustness: Preserving both global scaling and local magnitudes is essential for prediction accuracy, especially in tasks like credit scoring where magnitude differences within quantile bins hold signal.
- Model compatibility: Many encodings avoid increasing input dimensionality, minimize computational and memory load, and enable plug-and-play integration with neural, tree, or statistical prediction architectures.
These constraints motivate compact, numerically stable, and completely domain-agnostic data transformation strategies (Zhang et al., 2024, Borovikov et al., 2013, Bakumenko et al., 2024, Loukas et al., 2022, Wang et al., 2020).
2. Core Encoding Techniques
2.1 Quantile Linear Encoding (QLE) for Numerical Features
QLE uniformly transforms real-valued attributes into , combining quantile binning with intra-bin linear interpolation to preserve both the regularization of quantile transforms and fine-grained magnitude distinctions:
Here, bins and quantile boundaries are precomputed; within-bin interpolation avoids precision loss. QLE is “non-semantic” as it is feature-wise, requires no external domain knowledge, and does not introduce extra input dimensions (Zhang et al., 2024).
2.2 Symbolic Quantization and -gram Dictionary Construction
For sequential/temporal data, such as returns time-series:
- Binary encoding: (log-return) is mapped to for , else .
- General quantization: binned to an alphabet of size , e.g., via thresholding.
- -gram dictionaries: Contiguous symbol subsequences of length are counted; empirical frequencies and maximum-entropy reference distributions are constructed.
- Relative entropy (information capacity): , highlighting non-random structure as a function of .
This approach is entirely non-semantic, reducing continuous-valued time series to symbolic “texts” amenable to information-theoretic anomaly/event analysis (Borovikov et al., 2013).
2.3 Pseudo-Token and Shape-Based Numeric Encoding in NLP
In tasks like XBRL tagging:
- Pseudo-token ([NUM]): All numeric strings are replaced by a single token, eliminating subword fragmentation in BERT-style models.
- Shape-based tokens ([X...]): Digits are replaced by ‘X’, preserving length and punctuation (e.g., “40,200.5” “[XX,XXX.X]”).
- Vocabulary management: Either a single extra token or a small set (e.g., 214 shapes) is added to the tokenizer; embeddings are trained or fine-tuned (Loukas et al., 2022).
These methods neutralize the uninformative lexical content of numbers while retaining contextual and coarse magnitude information.
2.4 Dense Embeddings for Non-Semantic Categorical Data
For tabular records or ledger entries with categorical fields:
- Textual verbalization: Key-value pairs are converted to pseudo-natural language (“Source: A, Account_DC: 123…”).
- Sentence-Transformer encoding: Pre-trained models (e.g., all-MiniLM-L6-v2, all-distilroberta-v1, all-mpnet-base-v2) embed the concatenated string or per-transaction snippets into dense, fixed-dimensional vectors via mean-pooling.
- Normalization: Embedding vectors are feature-wise standardized.
- Model-agnostic: No explicit semantics are imposed on field values; the approach allows for variable numbers of transactions per record without padding (Bakumenko et al., 2024).
2.5 Image Encoding for Numeric Vectors
When using CNNs on 1D tabular/radio data:
- Sequential Arrangement (SA): Feature vector reshaped in row-major order into a image, zero-padded as needed.
- Category Chunk Arrangement (CCA): Feature subsets grouped by accounting/semantic role, packed into blocks/chunks, then tiled.
- Hilbert Vector Arrangement (HVA): Space-filling Hilbert curve maps feature index to 2D image grid, preserving adjacency.
These spatially non-semantic encodings are applied when local spatial patterns are potentially exploitable by 2D CNNs—particularly for engineered financial ratio vectors (Wang et al., 2020).
3. Empirical Outcomes and Comparative Performance
Tabular Numerical Data
- QLE in TKGMLP outperformed centered log-ratio (CLR), pure quantile transform, and high-dimensional piecewise-linear embeddings (PLE) in credit scoring: test AUC 95.04 vs. 94.91–94.96, KS 76.08 vs. 75.45–75.82. The improvement is substantial at scale and directly impactful in real-world profitability (Zhang et al., 2024).
NLP: Financial Numeric Entity Recognition
- Pseudo-token ([NUM]/shape) approaches improved micro-F by 3–6.4 pp compared to vanilla BERT (e.g., SEC-BERT: 82.1 with shape tokens vs 75.7) (Loukas et al., 2022).
- Shape tokens preserve coarse magnitude, outperforming magnitude-agnostic [NUM] tokens in all tested configurations.
Categorical Data Anomaly Detection
- Sentence-Transformer embeddings substantially improved macro-average recall in logistic regression and neural networks for transaction anomaly detection (e.g., recall up to 0.9920 with all-MiniLM-L6-v2 vs 0.9280 baseline one-hot); for tree-based and SVM classifiers, improvements depend on the model choice (Bakumenko et al., 2024).
Image Encodings for CNNs
- 2D image encoding (SA, CCA) improved accuracy by up to ∼5–10% on financial ratio data, outperforming 1D CNNs/MLP. For raw fundamental variables, no 2D encoding outperformed direct 1D methods (Wang et al., 2020).
4. Implementation and Complexity Considerations
| Encoding | Input Dimensionality | Primary Complexity | Key Hyperparameters |
|---|---|---|---|
| QLE | No increase (scalar per feature) | per value | Number of bins |
| Binary -gram | for n-grams | per | Quantizer cardinality, |
| Pseudo-token | Few extra tokens | preprocessing | Vocabulary size |
| SBERT Embedding | Fixed (384–768) | seq. length | Embedding model choice |
| SA/CCA/HVA | images | mapping | Image shape, chunk size |
- Memory and runtime scale with number of bins (QLE), (n-grams), and image grid size (imaging encodings).
- Most schemes avoid trainable parameters except for embedding-based pipelines, which may fine-tune new pseudo-token or SBERT embeddings.
5. Broader Implications and Future Directions
Non-semantic encoding techniques are applicable across a wide array of machine learning pipelines in financial settings. Key observations include:
- Robustness to data heterogeneity: By abstracting away from domain semantics, these approaches generalize across diverse financial institutions, products, or localities.
- Avoidance of “curse of dimensionality”: Compact scalars or dense vectors replace sparse, high-cardinality representations.
- Seamless neural integration: Most encodings yield inputs compatible with standard dense or sequential network architectures.
- Extension opportunities: Possible future avenues include trainable quantile boundaries (QLE), higher-order within-bin interpolation, hierarchical token-prompt engineering for complex records, and increased synergy between non-semantic and domain-informed feature engineering.
6. Application Scenarios and Practical Guidelines
- Tabular classification (e.g., credit scoring): Use QLE or similar per-feature scalars with grid-tuned bin count; batch normalize before downstream networks (Zhang et al., 2024).
- Anomaly detection in ledgers: Textualize record fields and embed via light-weight SBERT or related transformers, especially to address high-cardinality categorical fields (Bakumenko et al., 2024).
- Entity recognition in regulatory text: Replace numbers by shape tokens processed as atomic tokens in the NLP pipeline, mitigating tokenization artifacts (Loukas et al., 2022).
- Financial time-series anomaly/event analysis: Apply symbolic quantization and -gram entropy analysis to discover implicit dynamics free of semantic labeling (Borovikov et al., 2013).
- Neural image analysis of ratios: Rasterize ratio vectors for 2D CNNs; favor row-major (SA) or chunked (CCA) arrangements for maximum empirical gain (Wang et al., 2020).
These approaches exemplify the operationalization of non-semantic encoding, providing strong, empirically validated tools for a range of financial machine learning tasks.