Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Continuous Indexing (DCI)

Updated 17 February 2026
  • Dynamic Continuous Indexing (DCI) is a family of algorithms and data structures enabling real-time, update-friendly indexing for both document retrieval and vector k-NN search.
  • It employs innovations like fixed-size block arrays and Double VByte compression to achieve O(1) update costs and up to 36% space savings in document indexing.
  • For k-NN search, DCI leverages random projections and prioritized candidate retrieval to optimize query speed and memory usage compared to traditional methods like LSH.

Dynamic Continuous Indexing (DCI) encompasses a family of dynamic data structures and algorithms for high-throughput, immediate-access indexing in information retrieval systems, as well as efficient approximate or exact kk-nearest neighbor (k-NN) search in high-dimensional spaces. DCI explicitly seeks to address the requirements of real-time ingestion, instantaneous queryability, and minimal space overhead, simultaneously supporting dynamic insertions and deletions with rigorous theoretical and empirical guarantees. Notably, DCI has been developed along two major lines: (1) immediate-access term-based document indexing (Moffat et al., 2022) and (2) vector-space dynamic nearest-neighbor search (Li et al., 2015, Li et al., 2017). Both share core principles of continuous, incremental construction and efficient, update-friendly query routines, but their domains and algorithmic specifics differ.

1. Immediate-Access Dynamic Indexing: System Architecture and Data Structures

DCI for term-based document indexing maintains all postings in RAM within a single, contiguous array of fixed-size blocks, each of size BB bytes (Moffat et al., 2022). The global block array is preallocated, enabling O(1)O(1)-time append operations and eliminating array resizing costs. Each unique term is mapped via an in-memory hash table (of size $2v$, where vv is the vocabulary size) to a "head block," which anchors a singly linked chain of blocks comprising the postings list for that term.

Each postings list has the following structure:

  • Head block: contains the term’s string, document frequency, "last document ID," the offset of the next free byte in the current tail block, the posting count for the term, and pointers to the first full block and current tail block.
  • Full and tail blocks: store compressed d-gap,f\langle d\text{-gap}, f\rangle posting tuples, where dd-gap is the difference between consecutive document IDs and ff is the term frequency in the document. Each full block contains exactly BhB-h bytes for postings (with an hh-byte in-band "next" pointer). The tail block is partially filled and unused bytes are zeroed for decoding termination.

The architecture avoids pointer-heavy structures by embedding all control data in-band, resulting in a pointer-to-payload ratio of h/(Bh)h/(B-h). The system is designed to maximize in-RAM density, enable fast parallel ingestion, and provide instantaneous queryability over all newly inserted documents.

2. Compression Techniques: Double VByte and Block Extensibility

The Double VByte encoding scheme is a core innovation in DCI’s document indexing instantiation (Moffat et al., 2022). Unlike standard VByte, which encodes the dd-gap and ff in two separate steps, Double VByte combines small ff values ($0 < f < F$, with a typical F=4F=4) with dd-gaps into a single integer, encoded as

g=(g1)×F+fg' = (g-1)\times F + f

Decoding separates gg and ff by modulus and quotient operations. If fFf \geq F, an additional VByte code stores the outlier value. This method achieves an average posting size of approximately 1.46 bytes on the WSJ1 test collection, representing a 36% savings relative to standard VByte and enabling a total space overhead—including pointers and vocabulary—of about 2 bytes per posting on large real-world datasets such as Wikipedia and WSJ1.

By allocating fixed-size blocks and using in-band pointers, DCI achieves dynamic extensibility with minimal space and time overhead, supporting O(1)O(1) block allocation without indirection or reallocation.

3. Incremental Ingestion and Query Algorithms

DCI supports immediate insertion and continuous queryability through a set of streamlined algorithms (Moffat et al., 2022):

  • Document insertion: For each (t,f)(t, f) pair in a document, the term is hashed to its head block. If absent, a new block is allocated and initialized. The posting d,f\langle d, f \rangle is added to the current tail block, allocating a new block if necessary; all operations are O(1)O(1) amortized per posting.
  • Conjunctive Boolean queries (DAAT with skip-blocks): Given a set of terms, the system fetches their head blocks, collects all skip pointers, and then advances block-wise using dd-gaps to efficiently intersect document lists. Query time depends on Qlogv|Q|\log v plus the sum of the number of blocks per term.
  • Top-kk scored queries: As documents are identified across terms, their scores (e.g., log(1+f)log(1+N/ft)\log(1+f)\log(1+N/f_t) or BM25) are accumulated, and a min-heap is used to track the top kk candidates.

The algorithms allow concurrent ingestion and querying, with query mean latencies below 1 ms on 100K-document lists and ingestion throughput of approximately 2GB/minute on Wikipedia-scale corpora.

Dynamic Continuous Indexing introduces a fundamentally different approach to high-dimensional nearest neighbor search, replacing space partitioning with the construction of "continuous indices" via random projections (Li et al., 2015, Li et al., 2017). For a set DRdD \subset \mathbb{R}^{d}, mLmL random unit vectors {ujl}\{\mathbf{u}_{jl}\} are sampled, and each data point is projected onto each vector:

pjli=pi,ujl\overline{p}^i_{jl} = \langle p^i, \mathbf{u}_{jl} \rangle

Each simple index TjlT_{jl} is a balanced search structure over these scalar projections. Composite indices group mm such structures, and LL is chosen to control the probability of failure.

At query time, a point qq is projected onto the same set of vectors. Candidates are identified by "walking outwards" from qq's projection in each TjlT_{jl}, incrementing a counter Cl[h]C_l[h] for each point hh found among the ii nearest projections. Once Cl[h]=mC_l[h] = m, hh is added to the candidate set for composite index ll. The process halts after a fixed number of candidates (k~\widetilde{k} for the data-independent version) or when a data-dependent hypothesis test on the kkth candidate’s true distance passes.

The final kk-NN set is extracted by selecting the kk points with minimal true Euclidean distance among all candidates identified across LL composite indices.

The query time achieves

Tquery=O(max{dklognk, dk(nk)11/d})T_{\rm query} = O\left(\max\left\{d\,k\,\log\frac{n}{k},\ d\,k\,\left(\frac{n}{k}\right)^{1-1/d'}\right\}\right)

where d=1/log2γd' = 1 / \log_2 \gamma is the intrinsic (expansion) dimension and γ\gamma is related to the doubling property of the data.

Standard DCI supports fine-grained trade-offs between speed and recall by adjusting k~\widetilde{k} and LL, with per-query dynamic stopping in the data-dependent setting. Empirical evaluation demonstrates strong improvements over LSH, with DCI requiring 61–79% fewer candidates for comparable recall on benchmark datasets and using substantially less memory.

5. Prioritized DCI: Accelerated Retrieval in High Intrinsic Dimension

Prioritized DCI refines the candidate exploration schedule of DCI for k-NN search by globally prioritizing proximity in the projected spaces (Li et al., 2017). For each composite index, a max-heap PlP_l tracks the nearest unvisited projections across mm simple indices. At each step, the projection closest to the query’s projection is selected, its associated candidate is explored, and the heap is updated accordingly. This approach ensures that points are retrieved in ascending order of their maximum projected distance to the query.

The theoretical performance of Prioritized DCI achieves query time:

TP-DCI(n,d,d,m,k)=O(dkmax{log(n/k),(n/k)1m/d}+mklogmmax{log(n/k),(n/k)11/d})T_{P\text{-}DCI}(n,d,d',m,k) = O\Bigl( d\,k\,\max\{\log(n/k),\,(n/k)^{1-m/d'}\} + m\,k\,\log m\,\max\{\log(n/k),\,(n/k)^{1-1/d'}\} \Bigr)

A crucial property is that the exponent $1-m/d'$ may be made arbitrarily small by taking mcdm \approx c d' for c>1c > 1. This enables Prioritized DCI to counteract an exponential increase in neighborhood size (i.e., increased local intrinsic dimension) by a linear increase in space. Empirically, Prioritized DCI affords 14–116× reductions in distance calculations and 21–55× lower memory consumption compared to LSH on image datasets such as CIFAR-100 and MNIST. It is particularly effective on datasets with high intrinsic dimension.

6. Complexity, Space, and Dynamic Update Guarantees

The following table summarizes key complexity and space results across DCI variants:

Variant Update Cost Query Cost Space (auxiliary)
Immediate-access DCI O(1)O(1) per posting (amortized) AND: O(Qlogv+tft/(Bh))O(|Q|\log v + \sum_t \lceil f_t/(B-h) \rceil); TOPK: O(Qlogv+O(|Q|\log v + decoding + heap)) \sim2 bytes/posting
DCI k-NN (std) O(d+logn)O(d+\log n) O(dkmax{log(n/k),(n/k)11/d})O(d\,k\,\max\{\log(n/k),\,{(n/k)}^{1-1/d'}\}) (Thm. 8 (Li et al., 2015)) O(n)O(n)
Prioritized DCI O(m(d+logn))O(m(d+\log n)) O(dkmax{log(n/k),(n/k)1m/d}+mklogmmax{log(n/k),(n/k)11/d})O(d\,k\,\max\{\log(n/k),\,{(n/k)}^{1-m/d'}\} + m\,k\,\log m\,\max\{\log(n/k),\,{(n/k)}^{1-1/d'}\}) (Li et al., 2017) O(mn)O(mn)

A central theme is the ability of DCI to balance time and space expenditures via choice of parameters such as mm (simple indices), LL (composite indices), and block size BB or FF in Double VByte.

In the document-indexing setting, DCI achieves contiguous ingestion and strong compression; in the k-NN setting, it achieves dynamic supports for insertion and deletion with strictly linear space in the dataset size.

7. Practical Considerations and Empirical Performance

Across both domains, DCI is distinguished by real-time ingestion speeds and immediate queryability. Typical performance figures for immediate-access document DCI include:

  • Indexing throughput: 2GB/min on Wikipedia (6.5M docs) with 2.09 bytes/posting (all overheads included) (Moffat et al., 2022).
  • Query latencies: mean 4.1 ms (13.9 ms at 95th percentile) for conjunctive queries on Wikipedia; mean 0.55 ms for WSJ1 (100K docs).
  • Conversion to static index: <10 s for multi-gigabyte in-memory shards.

For k-NN search, standard and Prioritized DCI provide dynamic updates, fine-tunable speed-accuracy profiles, and resource usage significantly improved over LSH. On MNIST and CIFAR-100, Prioritized DCI reduces wall-clock time for 100 queries by two orders of magnitude and requires order-of-magnitude less memory (Li et al., 2017).

Parameter selection is guided by intrinsic dataset characteristics: mm is set near the expansion dimension dd', LL is tuned for desired failure probability, and stopping criteria are adaptively chosen per-query to minimize candidate evaluations while preserving recall.

In summary, DCI unifies principles of continuous, update-friendly indexing and randomized multidimensional projections, offering scalable, high-performance solutions to both term-based document retrieval and high-dimensional nearest neighbor search (Moffat et al., 2022, Li et al., 2015, Li et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Continuous Indexing (DCI).