Gradient-informed Black-Box Signatures

Updated 2 February 2026

Gradient-informed black-box signatures are methods that estimate gradients from query outputs, essential for adversarial attacks, IP verification, and model interpretability.
They employ techniques such as stochastic finite differences, binary sign recovery, and domain-aware approximations to build robust model fingerprints.
These methods offer theoretical guarantees on completeness, linearity, and robustness while optimizing query efficiency in high-dimensional settings.

Gradient-informed black-box signatures represent a class of methodologies that recover or leverage gradient-related information from models accessible only via input-output queries, with no access to internal parameters or true analytic gradients. These signatures underpin query-based adversarial attacks, model fingerprinting, intellectual property (IP) verification, attribution/explanation, and the end-to-end integration of black-box functions. Across security, interpretability, and model management, the development of robust and information-rich black-box signatures has advanced via stochastic gradient estimation, sign recovery, frequency-domain priors, and query-efficient combinatorial strategies.

1. Mathematical Foundations and Problem Setting

The unifying abstraction is a parameterized function $f:\mathbb{R}^d\to\mathbb{R}$ (or discrete-output classifier), for which only $f(x)$ or $\mathcal{F}(x)$ can be queried. The fundamental challenge is to reconstruct or utilize information about $\nabla f(x)$ , or in discrete settings, $\mathrm{sign}(\nabla f(x))$ , based solely on black-box access. Typical objectives include:

Estimating attributions or saliency $\alpha \in \mathbb{R}^d$ via (approximated) gradients, subject to properties like completeness and linearity (Cai et al., 2023).
Recovering the sign vector $s^* = \mathrm{sign}(\nabla_x L(x,y))$ to guide query-efficient adversarial perturbations or for use as a compact, local "signature" of model sensitivities (Al-Dujaili et al., 2019).
Embedding IP-related signatures in gradient structures for watermarking/fingerprinting, extractable via zeroth-order oracle queries (Aramoon et al., 2021, Shao et al., 8 Oct 2025).
Constructing gradient-based interfaces for integrating fixed, non-differentiable programmatic functions into end-to-end neural architectures (Jacovi et al., 2019).

Critical constraints include high input dimension, limited query budgets, absence of internal scores (label-only settings), and frequent need for query perturbations that respect input domain semantics (e.g., word-level in text models).

2. Gradient Estimation and Signature Construction Techniques

A variety of Monte Carlo and combinatorial approaches instantiate these signatures under black-box query regimes:

Stochastic Finite Differences and Smoothing

The GEEX framework (Cai et al., 2023) estimates gradients by Monte Carlo averaging:

$\hat\nabla f(z) = \frac{1}{m}\sum_{i=1}^m \frac{f(z+\sigma u_i)-f(z)}{\sigma}u_i, \quad u_i\sim N(0,I)$

Path-integral approaches—such as integrating along $x(\alpha) = x'+\alpha(x-x')$ —yield black-box analogues of white-box integrated gradients, enabling attribution with strong theoretical guarantees (insensitivity, implementation invariance, and completeness). The output signature $\alpha$ encodes the average local sensitivity of $f$ to each input coordinate, behaving as a robust, noise-regularized proxy for internal gradient flows.

Binary Sign Recovery and Gradient-informed Fingerprints

Sign-based methods focus on $\mathrm{sign}(\nabla_x L(x,y))$ estimation, employing only direction queries and loss-oracle access. Properties established in (Al-Dujaili et al., 2019) reveal that for $q \in \{-1,+1\}^n$ , the directional derivative $D_q L(x,y)$ is affine in the Hamming distance to the true sign vector. Efficient algorithms like SignHunter perform divide-and-conquer bit-flip queries to reconstruct the entire sign vector in $O(n)$ queries, which acts as a binary "gradient-informed black-box signature." This form is minimal, discrete, and empirically sufficient for robust adversarial attacks (e.g., FGSM-style perturbation using $s^*$ yields high-confidence evasion on MNIST, CIFAR, and ImageNet with minimal queries).

Hard-label and Label-only Gradient Structure Estimation

Recent work (Liu et al., 17 Jan 2026) demonstrates that even in regimes where only the predicted class label is observable, runs of sign-coherent directions (block-coherent gradient sign structures) can be uncovered. Techniques such as zero-query Block-DCT variance priors, followed by Pattern-Driven Optimization (PDO), produce initializations $d_0$ and iteratively refine signs over atomic runs (contiguous regions with constant sign), yielding signatures with high expected alignment to the true $\mathrm{sign}(\nabla_x L)$ and favorable boundary proximity, even under highly constrained feedback.

Domain-aware Zeroth-order Approximations

In discrete domains (e.g., text), infinitesimal vector perturbations are replaced by semantically-preserving transformations—such as word substitutions—enabling local Jacobian estimation (Shao et al., 8 Oct 2025). For each base input $x$ , perturbed samples $x'_j$ are generated, and the difference in output embeddings $\Delta y_j = \bar e_{y'_j} - \bar e_{y}$ is regressed onto $\Delta x_j = e_{x'_j} - e_x$ , yielding a Jacobian $J$ that operates as a high-dimensional fingerprint capturing local model behavior.

3. Algorithmic Frameworks and Training Schemes

Below is a comparative snapshot of algorithmic strategies for gradient-informed black-box signatures:

Method	Query Type	Signature Structure	Application
GEEX	Value (f(x))	Real-valued attribution ( $\alpha$ )	Interpretability/Attribution
SignHunter	Loss-oracle	Binary sign vector ( $s^*$ )	Adversarial attack, fingerprinting
GradSigns	Loss-oracle	Encoded bits in grad halfspaces	Model watermarking (ownership)
ZeroPrint	Textual query	Local Jacobian ( $J$ )	LLM fingerprinting and auditing
DPAttack/PDO	Label-only	Block-coherent sign vector	Hard-label attacks, security evaluation

Hybrid schemes (e.g., Estimate-and-Replace for function interfaces (Jacovi et al., 2019)) introduce differentiable proxies (estimators) for black-box functions during training to enable end-to-end gradient flow, then swap in the true black-box implementation at inference, leveraging the compliance of the trained interface signature.

4. Theoretical Guarantees and Structural Properties

Analysis across these works establishes information-theoretic and structural guarantees for the expressivity and reliability of gradient-informed black-box signatures:

Completeness and Linear Attribution: Attributions sum to $f(x)-f(x')$ , and are linear in $f$ (Cai et al., 2023).
Alignment and Query Complexity: Binary sign signatures maximize directional derivative; optimal sign recovery can be performed in $O(n)$ queries, near the information-theoretic lower bound (Al-Dujaili et al., 2019).
Gradient Structure Correlations: Block-DCT initialization yields provably positive expected alignment to true signs under Gaussian correlation models, leading to lower initial boundary distances (Liu et al., 17 Jan 2026).
Information Content: Fisher information analysis confirms that gradients (or Jacobians) contain more parameter-specific detail than outputs, especially after nonlinearity-induced compression (Shao et al., 8 Oct 2025). This underpins the superiority of gradient-based fingerprints for model identification and ownership validation.
Robustness: GradSigns-embedded signatures are resilient to parameter pruning, quantization, adversarial retraining, and query obfuscation; statistical verification bounds the probability of false positive identification (Aramoon et al., 2021).

5. Empirical Performance and Application Domains

Empirical studies validate the practical efficacy of gradient-informed black-box signatures in key ML domains:

Adversarial Attacks: SignHunter achieves high evasion rates (e.g., on MNIST, $<$ 12 queries for 100% evasion; outperforms NES/ZO/Bandits by 2.5–3.8 $\times$ in queries/failure rate) in both $\ell_\infty$ and $\ell_2$ regimes (Al-Dujaili et al., 2019). Pattern-driven hard-label attacks maintain $>30\%$ success under strict query caps and evade adaptive detectors (Liu et al., 17 Jan 2026).
Watermarking and IP Protection: GradSigns enables embedding and remote verification of 16–64 bit watermarks in deep models with $<$ 1% accuracy loss, scalable extraction costs ( $<$ 500 queries per bit), and resilience to adaptive attacks (Aramoon et al., 2021).
Explaining and Auditing Models: GEEX produces attribution maps on vision tasks with state-of-the-art deletion-AOPC (e.g., 0.949 on MNIST, outperforming both white- and black-box baselines) and visual equivalence to integrated gradients (Cai et al., 2023).
LLM Fingerprinting: ZeroPrint shows AUC=0.720 for related/unrelated LLMs (substantial improvement over prior black-box baselines), is robust to paraphrasing/logit perturbation, and operates within practical query/compute budgets (Shao et al., 8 Oct 2025).
Hybrid Model Integration: Estimate-and-Replace enables neural networks to interface reliably with non-differentiable programmatic functions, generalizing with higher data efficiency than RL-based or naive end-to-end approaches (Jacovi et al., 2019).

6. Limitations, Trade-offs, and Open Problems

Gradient-informed black-box signature methods are inherently subject to the limitations of query complexity, input-dimension-dependent variance, and feedback granularity:

High-dimensional problems (e.g., ImageNet) may require tens of thousands of queries for low-variance estimates unless structural priors or adaptive sampling are employed (Cai et al., 2023, Liu et al., 17 Jan 2026).
In the hard-label setting, gradient sign recovery is more challenging; frequency- or block-structure priors, as well as adaptive pattern-driven search, are required for tractable alignment (Liu et al., 17 Jan 2026).
Domain-specific perturbations (especially for text and structured data) can complicate the definition of "local" gradients, necessitating embedding-based or semantically-preserving substitution strategies (Shao et al., 8 Oct 2025).
For watermarking, an adversary retraining a model with a new key (watermark) requires access to the full original dataset for forgery, otherwise the false positive risk rises with limited data (Aramoon et al., 2021).

Future directions include exploring optimal perturbation strategies for variance minimization, extending these techniques to graphs/time-series, tightening theoretical bias–variance bounds, and active query selection to further reduce cost without loss of fidelity (Cai et al., 2023, Shao et al., 8 Oct 2025).

7. Cross-disciplinary Impact and Outlook

Gradient-informed black-box signatures have become central to several areas:

Security and Robustness: They enable highly query-efficient adversarial evaluation and defense circumvention, even against models with restricted or stateful interfaces (Al-Dujaili et al., 2019, Liu et al., 17 Jan 2026).
Model Ownership and Audit: IP owners can embed and verify robust, high-capacity signatures in deployed models and LLMs, with rigorous statistical verification and resilience to adversarial attempts at watermark removal or obfuscation (Aramoon et al., 2021, Shao et al., 8 Oct 2025).
Interpretable AI: Model-agnostic attributions and debugging tools rely on accurate black-box gradient proxies, extending interpretability guarantees to opaque or non-cooperative systems (Cai et al., 2023).
Hybrid and Modular ML: Deep learning architectures can be reliably interfaced with external black-box modules by optimizing over differentiable surrogates during training, then replacing them at inference without loss of validity (Jacovi et al., 2019).

As the frontier between white-box transparency and black-box opacity shifts due to privacy, deployment, and scalability considerations, these techniques anchor a principled toolkit for model interrogation, defense, and ownership in modern machine learning ecosystems.