Capturing P: On the Expressive Power and Efficient Evaluation of Boolean Retrieval

Published 26 Jan 2026 in cs.IR, cs.AI, cs.CC, cs.CL, and cs.DB | (2601.18747v1)

Abstract: Modern information retrieval is transitioning from simple document filtering to complex, neuro-symbolic reasoning workflows. However, current retrieval architectures face a fundamental efficiency dilemma when handling the rigorous logical and arithmetic constraints required by this new paradigm. Standard iterator-based engines (Document-at-a-Time) do not natively support complex, nested logic graphs; forcing them to execute such queries typically results in intractable runtime performance. Conversely, naive recursive approaches (Term-at-a-Time), while capable of supporting these structures, suffer from prohibitive memory consumption when enforcing broad logical exclusions. In this paper, we propose that a retrieval engine must be capable of Capturing $\mathbf{P}$'' -- evaluating any polynomial-time property directly over its index in a computationally efficient manner. We define a formal Retrieval Language ($\mathcal{L}_R$) based on Directed Acyclic Graphs (DAGs) and prove it precisely captures the complexity class $\mathbf{P}$. We introduce \texttt{ComputePN}, a novel evaluation algorithm that makes $\mathcal{L}_R$ tractable. By combining native DAG traversal with a memory-efficientPositive-Negative'' response mechanism, \texttt{ComputePN} ensures the efficient evaluation of any query in $\mathcal{L}_R$. This work establishes the theoretical foundation for turning the search index into a general-purpose computational engine.

Abstract PDF Upgrade to Chat

Authors (1)

Amir Aavani

Summary

The paper establishes a DAG-based retrieval language that captures P, proving its ability to represent any polynomial-time Boolean logic.
It systematically contrasts TAAT, DAAT, and embedding-based methods to highlight their limitations in efficiently executing complex Boolean queries.
The paper introduces ComputePN, a memory-efficient algorithm that evaluates Boolean query DAGs in sub-second time and avoids universe-scale intermediate computations.

Capturing P in Boolean Retrieval: Expressive Power and Efficient Computation

Introduction

The paper "Capturing P: On the Expressive Power and Efficient Evaluation of Boolean Retrieval" (2601.18747) rigorously formalizes the computational boundaries of information retrieval systems. The work contends with the increasingly neuro-symbolic demands on retrieval architectures, driven by the proliferation of agentic workflows, LLM-augmented search, and complex constraint satisfaction. The author demonstrates that traditional retrieval paradigms (TAAT, DAAT, and embedding-based methods) are fundamentally limited in representing and efficiently executing arbitrary polynomial-time logic over large document corpora. To resolve these bottlenecks, the paper introduces a Retrieval Language based on Directed Acyclic Graphs (DAGs), establishes its equivalence to the complexity class $\mathbf{P}$ , and presents ComputePN—a tractable algorithm leveraging positive-negative response logic to efficiently evaluate these expressive queries.

Computational Framework for Retrieval

Theoretical Foundations

The paper defines a Retrieval Language $\mathcal{L}_R$ in which Boolean queries are represented as DAGs, allowing for the reuse of subgraphs and the expression of deeply nested, non-tree logic. The expressive power of this formalism is validated by showing that the decision problem over $\mathcal{L}_R$ (whether the result set is non-empty) is P-hard via a logspace reduction from the Circuit Value Problem (CVP), the canonical P-complete problem. The construction demonstrates that any polynomial-time Boolean circuit can be simulated by a retrieval query DAG, and the language is strictly contained within $\mathbf{P}$ via a naive term-at-a-time algorithm, theoretically bounding evaluation to polynomial time.

This result is significant—a retrieval engine supporting $\mathcal{L}_R$ is provably powerful enough to express and solve any property or constraint that an agent might impose which is computable within polynomial time. This far exceeds the typical logic supported by keyword-based or vector retrieval systems.

Structural and Practical Gaps in Existing Systems

The analysis contrasts three primary paradigms:

Term-at-a-Time (TAAT): Materializes intermediate sets, supports DAG reuse and logic, but incurs prohibitive space and linear-in-universe costs for dense intermediaries (especially severe for negated queries).
Document-at-a-Time (DAAT): Optimized for conjunctive trees via iterator hierarchies, scales with posting list length, but fundamentally incapable of efficient DAG execution due to tree-expansion bottlenecks and stateful iterators; suffers exponential blow-up in execution time for queries with repeated subexpressions.
Embedding-based Retrieval: Provides semantic breadth but is mathematically incapable of representing general Boolean logic (e.g., parity, negation) due to fixed latent dimensions and limitations in sign-rank separation; cannot reliably enforce strict exclusion logic.

The paper identifies strong failure modes in commodity IR engines when confronted with neuro-symbolic logic, non-monotonic arithmetic constraints, and complex agentic queries—a finding corroborated by analytical and empirical stress tests on the MS MARCO corpus.

The ComputePN Algorithm

ComputePN is proposed as a memory- and compute-efficient algorithm for evaluating arbitrary Boolean query DAGs over an inverted index. Its central technical device is the Positive-Negative (PN) response tuple—a dual representation for intermediate document sets storing both the set itself and a semantic flag (POS or NEG). Negation operations merely flip the flag without materializing full universe-sized buffers, and algebraic rules for conjunction/disjunction guarantee that all intermediary operands remain strictly output-sensitive, bounded by the union of active posting lists rather than universe size. The final result accedes to materializing the universe complement only at the output, never during intermediate computation.

Crucially, ComputePN executes each DAG node once via topological sorting and memoization, exploiting structural sharing and eliminating redundant evaluation—a direct asymptotic contrast to the $O(2^{|Q|})$ exponential cost in DAAT for deeply nested, reentrant queries. Analytical complexity is $O(|V| \cdot |U_{active}|)$ —linear in the number of DAG nodes and the cumulative size of active posting lists.

The architecture supports task-level parallelism, adaptive polarity assignment for leaves (ensuring sparsity), and common subexpression elimination. Notably, ComputePN enables efficient realization of arithmetic and comparative logic (e.g., counting, summation, ripple-carry circuits) directly in the retrieval layer, a capability previously relegated to post-filter or ranking stages.

Implications for Neuro-Symbolic and LLM-Augmented Retrieval

Practical Impact

The demonstrated expressiveness and efficiency of $\mathcal{L}_R$ and ComputePN underpin a transformation in retrieval system architecture. LLMs transition from heavy-weight, iterative, agentic filtering over broad candidate sets to acting as high-trust compilers—translating user intent and context into precise constraints encoded in query DAGs, which are then executed natively and efficiently by the engine. This design drastically reduces context overhead, candidate set sizes, and agentic planning latency.

Empirical evaluation shows that complex arithmetic constraints expressible in DAGs (e.g., weighted net-positive filtering via digital logic circuits) can be executed in sub-second time (0.8s on MS MARCO with 500-node DAGs), in stark contrast to iterator-based engines that either fail due to combinatorial explosion or revert to universal scan at linear cost.

Theoretical Consequences

The paper reveals that capturing $\mathbf{P}$ is both necessary and sufficient for practical neuro-symbolic retrieval. Restriction to acyclic logic circuits (DAGs) preserves polynomial-time bounds, safely ruling out unbounded or Turing-complete queries that would violate service-level guarantees. This positions the index as a high-throughput SIMD-like logic processor supporting agentic reasoning, field-aware exclusion, and native arithmetic operations.

Further, the compositional uniformity of the DAG formalism enables the incorporation and nesting of vector search or semantic gadgets as macros (HyperNodes), promoting deep logical entanglement between symbolic and dense retrieval at arbitrary query graph depth—a sharp theoretical and practical differentiation from ad-hoc semantic filtering or top-level API hooks in commercial systems.

Future Developments

By formalizing the retrieval engine as a co-processor capable of simulating any polynomial-time circuit over document attributes, the work creates a foundation for universally expressive and predictably efficient computational indices. Next directions include extending DAG Boolean logic to canonical support for continuous fields and floating-point constraints, exploiting bit-sliced indexing for SIMD-like parallelism, and optimizing macro expansion (HyperNode compilation) for dense/ranking signals.

This paradigm is likely to inform the design of future agentic search stacks, hybrid neuro-symbolic retrieval, and context-aware LLM orchestration, further integrating computational logic, vector search, and high-level agentic planning in large-scale, interactive, multi-modal information systems.

Conclusion

The formalization and realization of Boolean retrieval languages that capture $\mathbf{P}$ redefines the expressive boundary and practical capabilities of search engines. Standard IR architectures are provably and empirically inadequate for complex, agentic, polynomial-time reasoning, and attempts to patch them with semantic filtering lack deep composability. The ComputePN algorithm and its underpinning DAG logic architecture offer strict output-sensitive efficiency, compositional flexibility, and polynomial-time guarantees for high-trust, neuro-symbolic queries, transforming the index into a general-purpose computational engine. This advances both the theory and practice of agentic retrieval, directly facilitating the next generation of LLM- and agent-powered search systems.