Papers
Topics
Authors
Recent
Search
2000 character limit reached

TMV-Hunter: TLS MitM Vulnerability Detection

Updated 6 February 2026
  • TMV-Hunter is a dynamic analysis tool that detects TLS certificate validation flaws in Android apps through automated MitM attack simulation.
  • It employs a foundation model–driven GUI agent, per-app VPN traffic interception, and sequential MitM testing to achieve high coverage across large app corpora.
  • Empirical results on nearly 40,000 apps reveal a 22.42% vulnerability rate, underscoring persistent TLS security flaws and the need for prompt remediation.

TMV-Hunter is the dynamic-analysis detection component of the Okara framework, designed for large-scale detection of Transport Layer Security (TLS) Man-in-the-Middle Vulnerabilities (TMVs) in Android applications. TMV-Hunter leverages foundation-model–driven graphical user interface (GUI) exploration and automated network-level MitM attack simulation to identify flaws in TLS certificate validation, achieving high coverage and scalability across market-sized app corpora (Yang et al., 30 Jan 2026).

1. System Architecture

TMV-Hunter operates as a standalone dynamic analysis tool that integrates into the Okara pipeline as its detection stage. Its architecture is organized around three core modules orchestrated by a centralized Test Orchestrator:

  • GUI Agent: Automates interaction with the app's UI to trigger possible TLS flows.
  • Traffic Forwarding Module: Sets up per-app VPN-based traffic interception and forwarding, enabling transparent capture and manipulation of encrypted flows.
  • MitM Test Module: Performs active man-in-the-middle probing on observed TLS flows to assess certificate validation robustness.

The Test Orchestrator receives an APK file and a set of testing parameters {SGUI,Nsteps,Tmax,Twait,PMitM}\{S_{\mathrm{GUI}}, N_{\mathrm{steps}}, T_{\max}, T_{\mathrm{wait}}, P_{\mathrm{MitM}}\}, outputting a vulnerability report of all TLS flows found susceptible to MitM-T1, MitM-T2, and MitM-T3 attack variants. The full workflow is formalized in Algorithm 1, which prescribes sequential installation, traffic interception, GUI exploration, and iterative MitM testing on discovered flows.

2. Foundation Model-Driven GUI Exploration

At the center of TMV-Hunter's scalability is its GUI Agent, which supersedes random and rule-based crawlers by utilizing foundation models for high-coverage interaction. The agent accepts as input the current UI observation oio_i (encompassing UI hierarchy and optional screenshots), historical interaction traces {(oj,aj)}j<i\{(o_j, a_j)\}_{j<i}, and task instructions focused on maximal TLS flow discovery. The agent selects discrete actions aia_i from {click,long_click,type,scroll,drag,back,wait,finish}\{\text{click}, \text{long\_click}, \text{type}, \text{scroll}, \text{drag}, \text{back}, \text{wait}, \text{finish}\}, parameterized for specific UI elements.

Three decision strategies are implemented:

  • Random: Uniform random selection over legal (ai,oi)(a_i, o_i) pairs.
  • General LLM: One-shot prompting with a 32B-parameter vision-LLM (Qwen2.5-VL-Instruct) using whole-session context.
  • Specialized LLM: Multi-turn UI-specific interaction via a 7B-parameter UI-TARS model, leveraging session-based alternation and screenshot inputs.

The agent operates on local vLLM inference servers. System prompts guide the agent to exhaust visible elements, employ back-navigation if stuck, and systematically attempt text input fields. Specialized LLM prompts further encode heuristics to reveal hidden or conditional screens, such as login dialogs and pop-ups. An interaction wait parameter TwaitT_{\mathrm{wait}} ensures asynchronous content is realized before subsequent actions.

Coverage is quantified by metrics including CUIC_{UI} and CTLSC_{TLS} (intersection ratios with manual ground-truth UI screens and FQDNs), their "novel" complements measuring previously unseen discoveries, and by a high-level coverage formula: Coverage=#UI elements interacted#Total UI elements×100%.\text{Coverage} = \frac{\#\text{UI elements interacted}}{\#\text{Total UI elements}}\times 100\%.

3. Automated MitM Vulnerability Testing Methodology

The MitM Test Module executes three attack protocols per observed TLS flow ff, with server endpoint dd and certificate CC:

  • MitM-T1 (Untrusted-CA Test): Presents a valid CC chained to a self-generated, untrusted CA; vulnerability is signaled if ¬IsTrustedAnchor(C)AppAccepts(C)\neg\mathrm{IsTrustedAnchor}(C) \wedge \mathrm{AppAccepts}(C).
  • MitM-T2 (Domain-Mismatch Test): Substitutes the subject in CC to a domain ddd' \neq d while retaining CA validity; vulnerability occurs if ChainValid(C)(subject(C)d)AppAccepts(C)\mathrm{ChainValid}(C) \wedge (\mathrm{subject}(C)\neq d) \wedge \mathrm{AppAccepts}(C).
  • MitM-T3 (Pinning-Bypass Test): Installs the attacker's CA in the device trust store; apps without robust certificate pinning will accept arbitrary CA-signed certificates (i.e., for trust-manager mm, some Cbad:m.checkServerTrusted(Cbad,...)C_{\mathrm{bad}}: m.\texttt{checkServerTrusted}(C_{\mathrm{bad}},...)\, does not throw a CertificateException\texttt{CertificateException}).

Flows meeting vulnerability criteria are added to the aggregate report R\mathcal{R} along with relevant metadata.

4. Empirical Results and Scale

TMV-Hunter was evaluated over a deduplicated dataset of 39,876 unique Android apps, sampled from Google Play (AndroZoo, 20,000 APKs) and the AppChina third-party store (20,000 APKs, latest from March 2025). The dynamic execution environment leveraged 8 parallel Android emulators (redroid on AWS Graviton2/Alibaba Ampere) and three high-end GPUs for model inference, achieving an average per-app analysis time of 144.75 seconds.

Key findings are summarized below:

Entity AppChina (Count/%) AndroZoo (Count/%) Combined (Count/%)
Apps 7.82K (39.40%) 0.558K (3.19%) 8.37K (22.42%)
Flows 80K (9.94%) 6.43K (0.77%) 86K (5.25%)
FQDNs 5.04K (17.11%) 0.919K (4.69%) 5.88K (12.16%)
App-FQDN Pairs 30K (19.42%) 1.61K (1.23%) 32K (11.08%)

Of 37,349 analyzed apps, 8,374 (22.42%) exhibited at least one MitM-vulnerable TLS flow, across 5,881 unique vulnerable FQDNs and 86,000 of 1.64 million tested flows. Vulnerability prevalence is uniform across popularity and app categories (rpb0r_{pb} \approx 0), with category-wise Jensen–Shannon divergence of 0.0499 (AppChina) and 0.2433 (AndroZoo) indicating minimal skew.

TLS 1.3 dominates amongst vulnerable flows (78.98% vs 21.02% for 1.2); transport protocols are exclusively TCP (for all vulnerable flows). A plausible implication is that the vulnerabilities are not isolated to deprecated cryptographic transport versions but affect the contemporary ecosystem.

Critical functionalities are recurrently affected. In a 100-app case study:

Category % Flows Vulnerable % Apps w/≥1 Vuln Flow
Content Delivery 61.28% 56.00%
Telemetry/Analytics 27.70% 61.00%
Executable Code 6.19% 27.00%
Authentication 4.06% 39.00%
Financial Transactions 0.75% 13.00%

Longitudinal analysis (100 apps, 5-year, 3-month sampled history) reveals that vulnerabilities are highly persistent, with a median vulnerable span of 1,384 days, median app lifespan of 1,901 days, and a median remediation delay of 330 days.

5. Performance, Limitations, and Scalability

TMV-Hunter's coverage and detection quality are conditioned by both the GUI agent and MitM test module. Empirically, per-app coverage and runtime depend on agent strategy: random (\sim95s), general LLM (\sim532s), and specialized LLM (\sim334s) at a 4s step wait and 50-step budget; the reported end-to-end mean is 144.75s per app.

Principal sources of error include:

  • False negatives: Caused by incomplete GUI coverage and thus missing live flows.
  • False positives: Stemming from heuristic mapping of flows to code regions; benign flows may be misattributed.

Scalability challenges are associated with LLM inference cost/latency and the instrumentation coverage of non-debuggable/native-code apps. Proposed mitigations include the deployment of smaller specialized models, parameterized step/time budgets to fine-tune exploration, extension to native libraries via eBPF and Frida-ART-TI hybrids, and further GUI exploration enhancements using multimodal memory and RL-based coverage guidance.

6. Context and Implications within TLS Security Research

TMV-Hunter’s approach of integrating foundation model–driven exploration with practical MitM probing distinguishes it from prior UI crawlers constrained by low coverage and high manual effort. Its design allows for efficient, market-scale scanning and systematic measurement of TLS certificate validation weaknesses, found to be widespread (22.42% of tested apps) and persistent over multi-year intervals. This suggests that, despite the adoption of improved TLS standards, implementation-level flaws remain pervasive across device and store boundaries. TMV-Hunter’s outputs enable subsequent code-level attribution and mitigation, contributing to ongoing responsible disclosure and research ecosystem support (Yang et al., 30 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TMV-Hunter.