SEC Form 4 Insider Purchase Filings

Updated 10 February 2026

SEC Form 4 insider purchase filings are mandated disclosures of insider transactions that serve as a record for regulatory compliance and market surveillance.
These filings aggregate detailed data on insider roles, trade specifics, and filing delays to enable precise empirical asset pricing and forensic analysis.
Advanced machine learning models like MaBoost integrate time-series encoding with tree-based classifiers to achieve high accuracy in detecting filing violations and predicting abnormal returns.

SEC Form 4 insider purchase filings are disclosures mandated by U.S. securities regulations, recording equity transactions by corporate insiders such as officers, directors, and significant beneficial owners. These filings, governed principally by the Securities and Exchange Act Rule 16a-3 and reinforced by the Sarbanes-Oxley Act (“SOX 2002”), serve as a critical input for regulatory compliance, market surveillance, and empirical asset pricing research. Recent advances in dataset construction, machine learning classification, and empirical microstructure analysis have enabled deep, systematic investigations into compliance behavior and predictive signal extraction from these filings (Huang et al., 27 Jul 2025, Zhao, 5 Feb 2026).

1. Dataset Construction and Regulatory Parameters

SEC Form 4 datasets aggregate structured transaction information from EDGAR filings for open-market purchases (TRANCODE = P) and sales (TRANCODE = S) by insiders. The "Insider Filing Delay" (IFD) benchmark encompasses 4,051,143 transactions spanning 2002–2025 and captures approximately 50.5% purchases and 49.5% sales. Essential metadata fields include insider roles (e.g., CEO, VP), transaction date (TRANDATE), transaction code, acquired/disposed indicator (ACQDISP), share quantities (SHARES), transaction prices (TPRICE), post-trade holdings, key company identifiers, and an extensive set of quality control flags (AMEND, CLEANSE).

Timeliness of disclosure is strictly defined. The SEC mandates that Form 4 must be filed within two business days of trade execution. Filing compliance is therefore measured as: $\text{delay\_days} = \text{filed\_date} - \text{transaction\_date}$ A compliant filing has $\text{delay\_days} \leq 2$ ; a violation ( $y_i = 1$ ) occurs if $\text{delay\_days} > 2$ . Temporal annotations use the SEC business-day calendar and distinguish on-time, “oversight” (delay $\leq 3$ days), and “intentional” (delay $\geq 4$ days, repeat) violations (Huang et al., 27 Jul 2025).

2. Feature Engineering and Labeling of Insider Purchases

IFD datasets deploy 52 constructed features across five domains:

Insider history: Metrics such as InsiderRatio (historical violations/trades by an individual) and FirmRatio (firm-level aggregate) capture behavioral tendencies over time.
Trade characteristics: Notably, TradeValue (trade amount/market cap), raw delay, and the log distance between firm HQ and major metro centers provide microstructure and frictions context.
Governance and ownership: BlockholderRatio (large-holder shares/total), Herfindahl index (HHI), and related measures describe internal power distribution.
Firm-level financials: Features include return-on-assets (ROA), leverage, book/market, R&D/Assets, and Tobin’s Q, allowing firm-value contextualization.
Spatio-temporal context: Variables such as gap days to earnings, calendar distance, and regulatory timing support detection of opportunistic behavior.

Binary and subcategorical labeling rely on the compliant/violation taxonomy, further distinguishing sporadic from systematic offenders. These extensive features underpin both compliance forensics and predictive modeling (Huang et al., 27 Jul 2025).

3. Machine Learning Frameworks for Violation Detection

The MaBoost architecture integrates a Mamba-based state-space sequence encoder with XGBoost regression-tree classification. The Mamba encoder models a transaction history $\mathbf{X}_i = \{\mathbf{x}_i^{(1)},\ldots,\mathbf{x}_i^{(T)}\}$ for insider $i$ , with the update equations: $\mathbf{h}_t = \mathbf{A}_t\,\mathbf{h}_{t-1} + \mathbf{B}_t\,\mathbf{x}_i^{(t)}, \qquad \mathbf{z}_t = \mathbf{C}_t\,\mathbf{h}_t$ Aggregated sequence embeddings ( $\mathbf{h}_i$ ) inform the XGBoost classifier: $\hat{y}_i = \sum_{m=1}^M f_m(\mathbf{h}_i), \qquad \hat p_i = \sigma(\hat y_i), \qquad \hat y_i^{\rm label} = \mathbb{I}(\hat p_i\ge\tau)$ with regularized logistic loss. Baseline comparisons include linear/logistic regression, decision/ensemble tree methods, sequence models (RNN, LSTM, Transformer), and LLM embeddings.

MaBoost achieves F1-score of 99.47% (Precision = 99.09%, Recall = 99.85%) under SEC-defined constraint conditions, outperforming XGBoost (97.65%) and Transformer (98.29%). Ablation studies show a greater than seven-point F1 drop when Insider history or Spatio-temporal features are omitted (Huang et al., 27 Jul 2025).

4. Predictive Signal in Microcap Insider Purchases

Extended analysis of 17,237 open-market purchases (2018–2024) in microcap equities (market cap \$30M–\$500M) employs a gradient boosting classifier to detect positive return opportunities following Form 4 purchase disclosure dates. The model is trained on features covering insider identity, trade specifics, and contemporaneous market conditions:

Title score (CEO=5, CFO=4, etc.), transaction value, first-in-12-months indicator, purchase deviations from usual size.
Price deviation from trade to disclosure date, distance from 52-week high/low, month-to-date return, short-term volatility, filing-date market capitalization.

CAR over $[1,30]$ day window post-disclosure is used to define abnormal returns: $\text{CAR}_{[1,30]} = \sum_{s=1}^{30} AR_{t+s}$ where abnormal return $AR_{t} = R_{t} - \hat{\alpha} - \hat{\beta}_\text{MKT}(R_{\text{MKT},t} - R_{f,t}) - \hat{\beta}_\text{SMB}\text{SMB}_t - \hat{\beta}_\text{HML}\text{HML}_t$ , with factor loadings estimated from the prior 252 trading days.

On 2024 out-of-sample data, the classifier (tuned with $n_\text{estimators} = 1000$ , learning rate $= 0.01$ , max_depth $=3$ ) achieves an AUC of 0.70, with optimal precision 0.38 and recall 0.69 at threshold 0.20. The mean unconditional 30-day CAR across all filings is ≈3.5%, rising to 6.3% and a 36.7% probability of $>10\%$ outperformance for filings disclosed after a $>10\%$ price run-up at disclosure—subverting naive mean-reversion expectations (Zhao, 5 Feb 2026).

5. Feature Importance, Interpretation, and Behavioral Regularities

Tree-based feature importance reveals that the “distance from 52-week high” accounts for 36% of predictive power in microcap purchase filings, followed by month-to-date return (8.1%), 30-day volatility (7.2%), and market cap at filing (6.6%). Traditional mean-reversion heuristics are not supported by the empirical evidence: purchase filings disclosed after significant upward price moves (momentum) outperform those following price weakness, a pattern robust to winsorization and alternate holding periods (20-day and 60-day windows).

Interpretability analysis in violation detection studies identifies InsiderRatio and FirmRatio as the strongest predictors, followed by trade characteristics and governance metrics. Removal of historical/temporal features yields major reductions in model performance (≥7 F1 points), underscoring the behavioral and contextual dimension of insider compliance (Huang et al., 27 Jul 2025, Zhao, 5 Feb 2026).

6. Surveillance, Forensic, and Trading Applications

High-fidelity Form 4 datasets and robust ML classifiers support several system-critical applications:

Regulatory surveillance: Real-time dashboards that flag high-risk/delayed filings for further scrutiny under SEC two-business-day rules.
Compliance automation: Legal and corporate teams can leverage interpretable outputs to identify systematic violators or oversight trends.
Forensic analytics: Linking of delayed purchase filings to post-disclosure return patterns supports regulatory forensics and event study research.
Quantitative trading: Screening microcap purchases for price momentum preceding disclosure and high model probability identifies positive-alpha opportunities, even when accounting for estimated slippage and capacity. Actionable screens involve computing price deviations and thresholding on CAR probability, e.g., initiating a 30-day long if the “run-up” and model conditions are met (Zhao, 5 Feb 2026).

A plausible implication is that the integration of large-scale Form 4 data, sophisticated time-series models, and market-microstructure features enables both improved regulatory enforcement and systematic extraction of abnormal returns in asset pricing and trading contexts.

7. Research Extensions and Open Challenges

IFD provides the first large-scale, publicly available, behavior-rich benchmark enabling reproducible, comparative research in insider filing compliance and its capital market impact. Extensions include: cross-jurisdictional filings, modeling of unstructured textual narratives in 10-Ks, and application of causal inference for proactive regulatory intervention. Persistent challenges include data quality, detection of strategic vs. oversight violations, and adaptation to high-frequency disclosure environments (Huang et al., 27 Jul 2025).

The intersection of regulatory data science and empirical asset pricing, as concretized by SEC Form 4 analyses, remains an active and methodologically rich research frontier.

Markdown Report Issue Upgrade to Chat

References (2)

IFD: A Large-Scale Benchmark for Insider Filing Violation Detection (2025)

Insider Purchase Signals in Microcap Equities: Gradient Boosting Detection of Abnormal Returns (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SEC Form 4 Insider Purchase Filings.