Dynamic Continuous Indexing (DCI)
- Dynamic Continuous Indexing (DCI) is a family of algorithms and data structures enabling real-time, update-friendly indexing for both document retrieval and vector k-NN search.
- It employs innovations like fixed-size block arrays and Double VByte compression to achieve O(1) update costs and up to 36% space savings in document indexing.
- For k-NN search, DCI leverages random projections and prioritized candidate retrieval to optimize query speed and memory usage compared to traditional methods like LSH.
Dynamic Continuous Indexing (DCI) encompasses a family of dynamic data structures and algorithms for high-throughput, immediate-access indexing in information retrieval systems, as well as efficient approximate or exact -nearest neighbor (k-NN) search in high-dimensional spaces. DCI explicitly seeks to address the requirements of real-time ingestion, instantaneous queryability, and minimal space overhead, simultaneously supporting dynamic insertions and deletions with rigorous theoretical and empirical guarantees. Notably, DCI has been developed along two major lines: (1) immediate-access term-based document indexing (Moffat et al., 2022) and (2) vector-space dynamic nearest-neighbor search (Li et al., 2015, Li et al., 2017). Both share core principles of continuous, incremental construction and efficient, update-friendly query routines, but their domains and algorithmic specifics differ.
1. Immediate-Access Dynamic Indexing: System Architecture and Data Structures
DCI for term-based document indexing maintains all postings in RAM within a single, contiguous array of fixed-size blocks, each of size bytes (Moffat et al., 2022). The global block array is preallocated, enabling -time append operations and eliminating array resizing costs. Each unique term is mapped via an in-memory hash table (of size $2v$, where is the vocabulary size) to a "head block," which anchors a singly linked chain of blocks comprising the postings list for that term.
Each postings list has the following structure:
- Head block: contains the term’s string, document frequency, "last document ID," the offset of the next free byte in the current tail block, the posting count for the term, and pointers to the first full block and current tail block.
- Full and tail blocks: store compressed posting tuples, where -gap is the difference between consecutive document IDs and is the term frequency in the document. Each full block contains exactly bytes for postings (with an -byte in-band "next" pointer). The tail block is partially filled and unused bytes are zeroed for decoding termination.
The architecture avoids pointer-heavy structures by embedding all control data in-band, resulting in a pointer-to-payload ratio of . The system is designed to maximize in-RAM density, enable fast parallel ingestion, and provide instantaneous queryability over all newly inserted documents.
2. Compression Techniques: Double VByte and Block Extensibility
The Double VByte encoding scheme is a core innovation in DCI’s document indexing instantiation (Moffat et al., 2022). Unlike standard VByte, which encodes the -gap and in two separate steps, Double VByte combines small values ($0 < f < F$, with a typical ) with -gaps into a single integer, encoded as
Decoding separates and by modulus and quotient operations. If , an additional VByte code stores the outlier value. This method achieves an average posting size of approximately 1.46 bytes on the WSJ1 test collection, representing a 36% savings relative to standard VByte and enabling a total space overhead—including pointers and vocabulary—of about 2 bytes per posting on large real-world datasets such as Wikipedia and WSJ1.
By allocating fixed-size blocks and using in-band pointers, DCI achieves dynamic extensibility with minimal space and time overhead, supporting block allocation without indirection or reallocation.
3. Incremental Ingestion and Query Algorithms
DCI supports immediate insertion and continuous queryability through a set of streamlined algorithms (Moffat et al., 2022):
- Document insertion: For each pair in a document, the term is hashed to its head block. If absent, a new block is allocated and initialized. The posting is added to the current tail block, allocating a new block if necessary; all operations are amortized per posting.
- Conjunctive Boolean queries (DAAT with skip-blocks): Given a set of terms, the system fetches their head blocks, collects all skip pointers, and then advances block-wise using -gaps to efficiently intersect document lists. Query time depends on plus the sum of the number of blocks per term.
- Top- scored queries: As documents are identified across terms, their scores (e.g., or BM25) are accumulated, and a min-heap is used to track the top candidates.
The algorithms allow concurrent ingestion and querying, with query mean latencies below 1 ms on 100K-document lists and ingestion throughput of approximately 2GB/minute on Wikipedia-scale corpora.
4. DCI for Vector k-Nearest Neighbor Search
Dynamic Continuous Indexing introduces a fundamentally different approach to high-dimensional nearest neighbor search, replacing space partitioning with the construction of "continuous indices" via random projections (Li et al., 2015, Li et al., 2017). For a set , random unit vectors are sampled, and each data point is projected onto each vector:
Each simple index is a balanced search structure over these scalar projections. Composite indices group such structures, and is chosen to control the probability of failure.
At query time, a point is projected onto the same set of vectors. Candidates are identified by "walking outwards" from 's projection in each , incrementing a counter for each point found among the nearest projections. Once , is added to the candidate set for composite index . The process halts after a fixed number of candidates ( for the data-independent version) or when a data-dependent hypothesis test on the th candidate’s true distance passes.
The final -NN set is extracted by selecting the points with minimal true Euclidean distance among all candidates identified across composite indices.
The query time achieves
where is the intrinsic (expansion) dimension and is related to the doubling property of the data.
Standard DCI supports fine-grained trade-offs between speed and recall by adjusting and , with per-query dynamic stopping in the data-dependent setting. Empirical evaluation demonstrates strong improvements over LSH, with DCI requiring 61–79% fewer candidates for comparable recall on benchmark datasets and using substantially less memory.
5. Prioritized DCI: Accelerated Retrieval in High Intrinsic Dimension
Prioritized DCI refines the candidate exploration schedule of DCI for k-NN search by globally prioritizing proximity in the projected spaces (Li et al., 2017). For each composite index, a max-heap tracks the nearest unvisited projections across simple indices. At each step, the projection closest to the query’s projection is selected, its associated candidate is explored, and the heap is updated accordingly. This approach ensures that points are retrieved in ascending order of their maximum projected distance to the query.
The theoretical performance of Prioritized DCI achieves query time:
A crucial property is that the exponent $1-m/d'$ may be made arbitrarily small by taking for . This enables Prioritized DCI to counteract an exponential increase in neighborhood size (i.e., increased local intrinsic dimension) by a linear increase in space. Empirically, Prioritized DCI affords 14–116× reductions in distance calculations and 21–55× lower memory consumption compared to LSH on image datasets such as CIFAR-100 and MNIST. It is particularly effective on datasets with high intrinsic dimension.
6. Complexity, Space, and Dynamic Update Guarantees
The following table summarizes key complexity and space results across DCI variants:
| Variant | Update Cost | Query Cost | Space (auxiliary) |
|---|---|---|---|
| Immediate-access DCI | per posting (amortized) | AND: ; TOPK: decoding + heap | 2 bytes/posting |
| DCI k-NN (std) | (Thm. 8 (Li et al., 2015)) | ||
| Prioritized DCI | (Li et al., 2017) |
A central theme is the ability of DCI to balance time and space expenditures via choice of parameters such as (simple indices), (composite indices), and block size or in Double VByte.
In the document-indexing setting, DCI achieves contiguous ingestion and strong compression; in the k-NN setting, it achieves dynamic supports for insertion and deletion with strictly linear space in the dataset size.
7. Practical Considerations and Empirical Performance
Across both domains, DCI is distinguished by real-time ingestion speeds and immediate queryability. Typical performance figures for immediate-access document DCI include:
- Indexing throughput: 2GB/min on Wikipedia (6.5M docs) with 2.09 bytes/posting (all overheads included) (Moffat et al., 2022).
- Query latencies: mean 4.1 ms (13.9 ms at 95th percentile) for conjunctive queries on Wikipedia; mean 0.55 ms for WSJ1 (100K docs).
- Conversion to static index: <10 s for multi-gigabyte in-memory shards.
For k-NN search, standard and Prioritized DCI provide dynamic updates, fine-tunable speed-accuracy profiles, and resource usage significantly improved over LSH. On MNIST and CIFAR-100, Prioritized DCI reduces wall-clock time for 100 queries by two orders of magnitude and requires order-of-magnitude less memory (Li et al., 2017).
Parameter selection is guided by intrinsic dataset characteristics: is set near the expansion dimension , is tuned for desired failure probability, and stopping criteria are adaptively chosen per-query to minimize candidate evaluations while preserving recall.
In summary, DCI unifies principles of continuous, update-friendly indexing and randomized multidimensional projections, offering scalable, high-performance solutions to both term-based document retrieval and high-dimensional nearest neighbor search (Moffat et al., 2022, Li et al., 2015, Li et al., 2017).