TMV-Hunter: TLS MitM Vulnerability Detection
- TMV-Hunter is a dynamic analysis tool that detects TLS certificate validation flaws in Android apps through automated MitM attack simulation.
- It employs a foundation model–driven GUI agent, per-app VPN traffic interception, and sequential MitM testing to achieve high coverage across large app corpora.
- Empirical results on nearly 40,000 apps reveal a 22.42% vulnerability rate, underscoring persistent TLS security flaws and the need for prompt remediation.
TMV-Hunter is the dynamic-analysis detection component of the Okara framework, designed for large-scale detection of Transport Layer Security (TLS) Man-in-the-Middle Vulnerabilities (TMVs) in Android applications. TMV-Hunter leverages foundation-model–driven graphical user interface (GUI) exploration and automated network-level MitM attack simulation to identify flaws in TLS certificate validation, achieving high coverage and scalability across market-sized app corpora (Yang et al., 30 Jan 2026).
1. System Architecture
TMV-Hunter operates as a standalone dynamic analysis tool that integrates into the Okara pipeline as its detection stage. Its architecture is organized around three core modules orchestrated by a centralized Test Orchestrator:
- GUI Agent: Automates interaction with the app's UI to trigger possible TLS flows.
- Traffic Forwarding Module: Sets up per-app VPN-based traffic interception and forwarding, enabling transparent capture and manipulation of encrypted flows.
- MitM Test Module: Performs active man-in-the-middle probing on observed TLS flows to assess certificate validation robustness.
The Test Orchestrator receives an APK file and a set of testing parameters , outputting a vulnerability report of all TLS flows found susceptible to MitM-T1, MitM-T2, and MitM-T3 attack variants. The full workflow is formalized in Algorithm 1, which prescribes sequential installation, traffic interception, GUI exploration, and iterative MitM testing on discovered flows.
2. Foundation Model-Driven GUI Exploration
At the center of TMV-Hunter's scalability is its GUI Agent, which supersedes random and rule-based crawlers by utilizing foundation models for high-coverage interaction. The agent accepts as input the current UI observation (encompassing UI hierarchy and optional screenshots), historical interaction traces , and task instructions focused on maximal TLS flow discovery. The agent selects discrete actions from , parameterized for specific UI elements.
Three decision strategies are implemented:
- Random: Uniform random selection over legal pairs.
- General LLM: One-shot prompting with a 32B-parameter vision-LLM (Qwen2.5-VL-Instruct) using whole-session context.
- Specialized LLM: Multi-turn UI-specific interaction via a 7B-parameter UI-TARS model, leveraging session-based alternation and screenshot inputs.
The agent operates on local vLLM inference servers. System prompts guide the agent to exhaust visible elements, employ back-navigation if stuck, and systematically attempt text input fields. Specialized LLM prompts further encode heuristics to reveal hidden or conditional screens, such as login dialogs and pop-ups. An interaction wait parameter ensures asynchronous content is realized before subsequent actions.
Coverage is quantified by metrics including and (intersection ratios with manual ground-truth UI screens and FQDNs), their "novel" complements measuring previously unseen discoveries, and by a high-level coverage formula:
3. Automated MitM Vulnerability Testing Methodology
The MitM Test Module executes three attack protocols per observed TLS flow , with server endpoint and certificate :
- MitM-T1 (Untrusted-CA Test): Presents a valid chained to a self-generated, untrusted CA; vulnerability is signaled if .
- MitM-T2 (Domain-Mismatch Test): Substitutes the subject in to a domain while retaining CA validity; vulnerability occurs if .
- MitM-T3 (Pinning-Bypass Test): Installs the attacker's CA in the device trust store; apps without robust certificate pinning will accept arbitrary CA-signed certificates (i.e., for trust-manager , some does not throw a ).
Flows meeting vulnerability criteria are added to the aggregate report along with relevant metadata.
4. Empirical Results and Scale
TMV-Hunter was evaluated over a deduplicated dataset of 39,876 unique Android apps, sampled from Google Play (AndroZoo, 20,000 APKs) and the AppChina third-party store (20,000 APKs, latest from March 2025). The dynamic execution environment leveraged 8 parallel Android emulators (redroid on AWS Graviton2/Alibaba Ampere) and three high-end GPUs for model inference, achieving an average per-app analysis time of 144.75 seconds.
Key findings are summarized below:
| Entity | AppChina (Count/%) | AndroZoo (Count/%) | Combined (Count/%) |
|---|---|---|---|
| Apps | 7.82K (39.40%) | 0.558K (3.19%) | 8.37K (22.42%) |
| Flows | 80K (9.94%) | 6.43K (0.77%) | 86K (5.25%) |
| FQDNs | 5.04K (17.11%) | 0.919K (4.69%) | 5.88K (12.16%) |
| App-FQDN Pairs | 30K (19.42%) | 1.61K (1.23%) | 32K (11.08%) |
Of 37,349 analyzed apps, 8,374 (22.42%) exhibited at least one MitM-vulnerable TLS flow, across 5,881 unique vulnerable FQDNs and 86,000 of 1.64 million tested flows. Vulnerability prevalence is uniform across popularity and app categories (), with category-wise Jensen–Shannon divergence of 0.0499 (AppChina) and 0.2433 (AndroZoo) indicating minimal skew.
TLS 1.3 dominates amongst vulnerable flows (78.98% vs 21.02% for 1.2); transport protocols are exclusively TCP (for all vulnerable flows). A plausible implication is that the vulnerabilities are not isolated to deprecated cryptographic transport versions but affect the contemporary ecosystem.
Critical functionalities are recurrently affected. In a 100-app case study:
| Category | % Flows Vulnerable | % Apps w/≥1 Vuln Flow |
|---|---|---|
| Content Delivery | 61.28% | 56.00% |
| Telemetry/Analytics | 27.70% | 61.00% |
| Executable Code | 6.19% | 27.00% |
| Authentication | 4.06% | 39.00% |
| Financial Transactions | 0.75% | 13.00% |
Longitudinal analysis (100 apps, 5-year, 3-month sampled history) reveals that vulnerabilities are highly persistent, with a median vulnerable span of 1,384 days, median app lifespan of 1,901 days, and a median remediation delay of 330 days.
5. Performance, Limitations, and Scalability
TMV-Hunter's coverage and detection quality are conditioned by both the GUI agent and MitM test module. Empirically, per-app coverage and runtime depend on agent strategy: random (95s), general LLM (532s), and specialized LLM (334s) at a 4s step wait and 50-step budget; the reported end-to-end mean is 144.75s per app.
Principal sources of error include:
- False negatives: Caused by incomplete GUI coverage and thus missing live flows.
- False positives: Stemming from heuristic mapping of flows to code regions; benign flows may be misattributed.
Scalability challenges are associated with LLM inference cost/latency and the instrumentation coverage of non-debuggable/native-code apps. Proposed mitigations include the deployment of smaller specialized models, parameterized step/time budgets to fine-tune exploration, extension to native libraries via eBPF and Frida-ART-TI hybrids, and further GUI exploration enhancements using multimodal memory and RL-based coverage guidance.
6. Context and Implications within TLS Security Research
TMV-Hunter’s approach of integrating foundation model–driven exploration with practical MitM probing distinguishes it from prior UI crawlers constrained by low coverage and high manual effort. Its design allows for efficient, market-scale scanning and systematic measurement of TLS certificate validation weaknesses, found to be widespread (22.42% of tested apps) and persistent over multi-year intervals. This suggests that, despite the adoption of improved TLS standards, implementation-level flaws remain pervasive across device and store boundaries. TMV-Hunter’s outputs enable subsequent code-level attribution and mitigation, contributing to ongoing responsible disclosure and research ecosystem support (Yang et al., 30 Jan 2026).