Split-Correctness in Information Extraction

Published 8 Oct 2018 in cs.DB | (1810.03367v2)

Abstract: Programs for extracting structured information from text, namely information extractors, often operate separately on document segments obtained from a generic splitting operation such as sentences, paragraphs, k-grams, HTTP requests, and so on. An automated detection of this behavior of extractors, which we refer to as split-correctness, would allow text analysis systems to devise query plans with parallel evaluation on segments for accelerating the processing of large documents. Other applications include the incremental evaluation on dynamic content, where re-evaluation of information extractors can be restricted to revised segments, and debugging, where developers of information extractors are informed about potential boundary crossing of different semantic components. We propose a new formal framework for split-correctness within the formalism of document spanners. Our analysis studies the complexity of split-correctness over regular spanners. We also discuss different variants of split-correctness, for instance, in the presence of black-box extractors with split constraints.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (23)

View on Semantic Scholar

Summary

The paper presents a formal definition of split-correctness, ensuring that applying an IE program to whole documents yields the same result as applying it to individual segments.
It rigorously analyzes computational problems and complexity bounds, including PSPACE-complete and PTIME cases under specific splitter conditions.
The study demonstrates significant practical speedups for parallel and incremental information extraction in large-scale text processing systems.

Split-Correctness in Information Extraction: A Formal and Complexity-Theoretic Analysis

Introduction and Motivation

The paper "Split-Correctness in Information Extraction" (1810.03367) formulates and investigates the computational underpinnings of split-correctness—determining whether information extraction (IE) programs (modeled as document spanners) can be correctly and efficiently decomposed with respect to a given splitting operation (splitters) on unstructured text data. This property is fundamental in declarative text analysis systems, enabling effective parallelization, incremental maintenance, and systematic debugging of IE pipelines.

A concrete motivation arises from practical IE development scenarios, where text is naturally segmented (e.g., into sentences, paragraphs, log entries). If an extraction program can be guaranteed to be split-correct relative to a known segmentation (without semantic alteration), system-level optimizations such as parallel evaluation or incremental recomputation become possible, greatly improving throughput and resource utilization.

Formal Framework

The study builds on the semantic model of document spanners—formal objects that extract tuples of spans (substrings defined by their start and end indices) from a document. Spanners encompass and generalize traditional regular expression-based extractors, as well as more expressive models involving relational algebra, automata with capture variables, and Datalog variants.

A splitter is a unary spanner that decomposes a document into segments (spans) according to some criterion, such as sentences, paragraphs, or $N$ -grams. The notion of split-correctness (or self-splittability) for a spanner $P$ and splitter $S$ demands that for every document $d$ , applying $P$ to $d$ yields the same set of tuples as the (index-adjusted) union of applying $P$ to the substrings specified by $S(d)$ .

A more general problem asks whether $P$ is splittable by $S$ via some other spanner $P_S$ ; that is, whether there exists a $P_S$ such that $P(d) = \bigcup_{s \in S(d)} P_S(d_s)$ , where $d_s$ is the segment corresponding to span $s$ in $d$ . This decomposition may be strictly finer than self-splittability.

Main Computational Problems and Definitions

Three central decision problems are formalized:

Split-Correctness (Split $\{\mathcal{C}\}$ ): Given $P$ , $P_S$ (or $P$ ), and $S$ , does $P = P_S \circ S$ hold?
Splittability ( $\mathcal{C}$ -Splittability): Given $P$ and $S$ , is $P$ splittable by $S$ for some $P_S$ ?
Self-Splittability ( $\mathcal{C}$ -Self): Special case where $P_S = P$ .

These problems are parameterized by the class $\mathcal{C}$ of spanner representations, including regex-formulas, variable-set automata (VSetA), and their sequential, unambiguous, and deterministic subclasses.

Essential combinatorial notions are formalized for tractable analysis:

Cover Condition: Every output tuple of $P$ is covered by some splitter segment.
Highlander Condition: Every output tuple is covered by at most one splitter segment; automatically satisfied if the splitter yields pairwise-disjoint spans (e.g., sentences, paragraphs, assuming proper spanners).

Expressiveness and Regular Spanner Models

The authors methodically situate various spanner formalisms—regex formulas, VSet-automata, and their normal forms—in an expressiveness hierarchy. Sequential or unambiguous VSet-automata subsume regex formulas, while full regular spanner expressiveness demands closure under certain relational algebraic operations. An explicit variable order condition is required for tractable containment and equivalence tests among automata-based spanner representations.

Complexity Results

The paper delivers an exhaustive and fine-grained complexity landscape for split-correctness and its variants:

General Case: Split-correctness and self-splittability are PSPACE-complete for both regex formulas and VSet-automata, with splittability EXPSPACE in general and PSPACE-hard.
Tractable Fragment: If spanners are proper and splitters are disjoint, or if the highlander condition holds, split-correctness and splittability drop to PTIME for unambiguous and sequential VSet-automata.
Containment: Demonstrated to be PSPACE-complete for regex formulas and (weakly) deterministic VSet-automata. The authors introduce a strictly stronger determinism notion to enable PTIME or NL (nondeterministic logspace) containment checks for unambiguous or deterministic automata, resolving gaps and correcting misclassifications in previous literature.
Extensions: Black-box extractors with split constraints and splittability conditioned on document schemas (regular languages) maintain the above complexity, provided the involved schema constraints and constraints on splitters.

Technical Contributions

A notable portion of the work is devoted to efficient constructions and algebraic characterization of composition: demonstrating that composition, associativity, transitivity, and distributivity properties hold under certain conditions—critical for automated query planning and optimization in relational-style IE pipelines. The construction of canonical split-spanners for arbitrary regular spanners is achieved via an elaborate exponential-size monoid recognizing the set of tuples that can safely be extracted via the splitter decomposition.

The analysis establishes a tight connection between splittability and the classical language primality problem in formal language theory, revealing that the latter’s still open complexity status constrains progress on more advanced splittability questions.

Empirical and Practical Implications

The proposed framework provides criteria for statically identifying when an IE program can be parallelized or incrementally maintained by splitting the input, which is practically critical for high-throughput or dynamic document collections (Wikipedia edits, log files, social media streams). The paper features empirical timing data demonstrating significant speedups for sentence and $N$ -gram-based splitting in real-world datasets, confirming theoretical motivations.

From a methods engineering viewpoint, the ability to detect split-correctness automatically allows system-level optimizations, pushing declarative extraction systems closer to the ideal in which end-users specify what they want extracted, leaving how (including parallel and incremental computation) to the optimizer.

Further Theoretical Impact and Open Problems

The theoretical analysis is foundational for formal verification and optimization of program transformations in IE engines, especially in heterogeneous or black-box settings (e.g., incorporating neural IE modules where only coarse split constraints are known).

Several open directions are delineated:

Establishing tight complexity bounds for general splittability (especially for unambiguous VSet-automata under the highlander condition).
Extending the framework to more expressive spanner classes (e.g., core or context-free spanners).
Advancing automated reasoning about splitting in sophisticated pipelines with schema constraints, black-box modules, and user-defined splitters.
Clarifying connections to the language primality problem for broader formal language classes.

Conclusion

The paper systematically formalizes and analyzes the split-correctness problem for information extraction, establishing both expressive power characterizations and tight complexity bounds in the setting of regular spanners and splitters. It reveals a correspondence between well-behaved splitter-spanner pairs (notably, those satisfying the highlander condition) and tractable parallelizable extraction, while highlighting the inherent computational cost for the general case. These results have implications for both theory and system design, guiding future development of high-performance, declarative information extraction engines.

Markdown Report Issue