Databases 8 min read

How optimize_indices Improves Query Performance in Lance

The article explains the purpose and inner workings of Lance's optimize_indices function, detailing how it incorporates newly appended data into existing indexes, merges delta indexes, and manages partition adjustments to maintain fast vector and scalar query performance without full re‑training.

Big Data Technology Tribe
Big Data Technology Tribe
Big Data Technology Tribe
How optimize_indices Improves Query Performance in Lance

In Lance, the optimize_indices function serves two main purposes: (1) it incorporates newly appended data that has not been indexed into the existing index, restoring query speed that would otherwise degrade to a costly scan of unindexed fragments; and (2) it merges multiple delta indexes into a single index version, reducing the number of indexes and the overhead of multi‑way merges during queries.

Function entry points and options

Rust: Dataset::optimize_indices(options) Python: dataset.optimize.optimize_indices(**kwargs) Key options (OptimizeOptions) include: num_indices_to_merge: number of delta indexes to merge; None lets the implementation decide, Some(0) only appends new data, Some(N) merges the most recent N indexes. index_names: specific index names to optimize; None targets all non‑system indexes. retrain: whether to retrain the entire index with all data (supported only for some v3 vector indexes).

Process overview

The optimization proceeds as follows:

Select indexes based on index_names and group versions of the same name.

For each group, call merge_indices to obtain a new UUID, a list of old indexes to delete, a new fragment bitmap, and updated index details.

Commit the changes with Transaction::CreateIndex { new_indices, removed_indices }, writing new metadata and removing merged indexes.

Finalize the version with apply_commit.

Implementation details

1. Retrieving unindexed fragments

The Rust method Dataset::unindexed_fragments(index_name) loads the fragment bitmap of all indexes under the given name, computes the union to identify already indexed fragments, and returns the fragments not present in that union – the data that must be added to the index.

2. Vector index optimization (IVF family)

The core routine merge_indices opens all indexes for the column and name. For vector indexes it invokes optimize_vector_indices (v2 path) or the older v1 implementation.

Key steps of the v2 path:

Extract IVF model components (centroids, quantizer, metric, partition count, sub‑index type) from the first existing index.

Reuse the same IVF and quantizer without retraining, ensuring new and old data share the same partitioning and quantization.

Shuffle unindexed data per partition using IvfShuffler.

Stream unindexed fragments with

dataset.scan().with_fragments(unindexed).with_row_id().project([column])

and feed them to the builder via shuffle_data(unindexed).

Pass existing indexes to the builder via with_existing_indices(existing_indices); the builder extracts vectors from matching partitions and merges them with the new data.

Build the new incremental index with

IvfIndexBuilder::new_incremental(...).with_ivf(...).with_quantizer(...).with_existing_indices(...).shuffle_data(unindexed).build().await

, which:

Selects which existing indexes to merge based on num_indices_to_merge or a full retrain.

For each partition writes a new sub‑index containing both existing vectors and newly assigned unindexed vectors.

Performs partition split or join if a partition becomes too large or too small, potentially triggering a merge of all delta indexes.

Writes the result to a new UUID directory and records the new fragment bitmap in the manifest.

3. Scalar index optimization

When index_type.is_scalar() is true, the process does not treat delta indexes separately. It unions the bitmap of all currently indexed fragments, adds the unindexed fragments, and calls index.update(new_data_stream, &new_store) to create a new scalar index UUID that incorporates the new data.

4. Merge count and partition adjustments

Some(0)

: only add unindexed data, no merging of existing indexes. Some(N): merge the most recent N indexes plus the unindexed data into a single new index. None: implementation decides; if partition split/join is needed (e.g., a partition is too large or too small), all delta indexes for that name are merged and partitions are rewritten.

During building, the system may split or join partitions based on current data volume, updating IVF centroids accordingly.

In summary, optimize_indices performs incremental index updates by adding new data to existing indexes and optionally merging delta indexes, thereby preserving high‑performance vector and scalar queries in a continuously ingesting environment without requiring full re‑training.

Lancevector indexIVFoptimize_indices
Big Data Technology Tribe
Written by

Big Data Technology Tribe

Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.