Databases 12 min read

How Lance Builds Scalar and Vector Indexes: A Deep Dive into create_index

This article explains how Lance's Python API creates scalar and vector indexes, walks through the internal Rust implementation of the create_index workflow, and details the transaction, commit, and error‑handling mechanisms that ensure atomic and consistent index creation.

Big Data Technology Tribe

Mar 10, 2026

How Lance Builds Scalar and Vector Indexes: A Deep Dive into create_index

API Usage Examples

Scalar index example using the Python lance_ray API:

# Assume a LanceDataset with a numeric column "id" exists at this path
import lance_ray as lr

updated_dataset = lr.create_scalar_index(
    uri="path/to/dataset",
    column="id",
    index_type="BTREE",
    name="btree_multiple_fragment_idx",
    replace=False,
    num_workers=4,
)
# Example queries
updated_dataset.scanner(filter="id = 100", columns=["id", "text"]).to_table()
updated_dataset.scanner(filter="id >= 200 AND id < 800", columns=["id", "text"]).to_table()

Vector index examples (distributed IVF_PQ, IVF_SQ, IVF_FLAT):

import lance_ray as lr

# Build a distributed IVF_PQ index
updated_dataset = lr.create_index(
    uri="path/to/dataset.lance",
    column="vector",
    index_type="IVF_PQ",
    name="idx_ivf_pq",
    num_workers=4,
    num_partitions=256,
    num_sub_vectors=16,
    metric="l2",
)

# Build a distributed IVF_SQ index
updated_dataset = lr.create_index(
    uri="path/to/dataset.lance",
    column="vector",
    index_type="IVF_SQ",
    name="idx_ivf_sq",
    num_workers=4,
    num_partitions=256,
)

# Build a distributed IVF_FLAT index
updated_dataset = lr.create_index(
    uri="path/to/dataset.lance",
    column="vector",
    index_type="IVF_FLAT",
    name="idx_ivf_flat",
    num_workers=4,
    num_partitions=256,
)

create_index Implementation Overview

The Rust create_index function follows a multi‑step workflow:

create_index (index.rs)
│
├─> create_index_builder (index.rs)
│       └─> CreateIndexBuilder::new
│
└─> builder.replace(replace).await
          │
          └─> CreateIndexBuilder::execute (create.rs)
                │
                ├─> execute_uncommitted
                │       └─> match (index_type, params)
                │               └─> Vector Index → build_vector_index / build_distributed_vector_index / build_empty_vector_index
                └─> apply_commit (commit transaction)

The call builder.replace(replace).await works because CreateIndexBuilder implements the IntoFuture trait. The compiler rewrites the expression to

std::future::IntoFuture::into_future(builder.replace(replace)).await

, which internally calls self.execute().

execute Method Details

The async execute function builds the index and then commits it. Core steps:

Call execute_uncommitted().await? to build the index without committing.

Validate column existence, check for an empty dataset, load existing indexes, and generate or parse a UUID.

Dispatch to the appropriate builder based on index type (scalar → build_scalar_index, vector → build_vector_index or build_distributed_vector_index).

Save the new index UUID for later lookup.

Create a Transaction with the dataset version and an Operation::CreateIndex payload.

Apply the transaction via

self.dataset.apply_commit(transaction, &Default::default(), &Default::default()).await?

, which writes a new manifest and atomically updates the _latest.manifest pointer.

Reload index metadata with self.dataset.load_indices().await? to capture any automatic actions (index merging, delta indices, optimizations) triggered during the commit.

Find the just‑created index by matching the saved UUID and return its IndexMetadata, or raise a clear internal error if not found.

Transaction and Commit Mechanics

The transaction records the dataset version for optimistic concurrency control; if another writer changes the dataset during index construction, the version mismatch causes the commit to fail, forcing the client to retry. apply_commit receives three arguments: the Transaction, a ManifestWriteConfig (default), and a CommitConfig (default). It writes a new manifest, updates the dataset’s internal state, and returns Ok(()).

Loading Index Metadata

load_indices()

first checks an in‑memory cache; on a miss it reads the manifest from storage, handles special fragment‑reuse logic, and returns the full list of IndexMetadata objects.

Error Handling

If the UUID lookup fails after a successful commit, the code returns an Error::Internal with a formatted message containing the missing UUID and the source location, making debugging straightforward.

Key Design Points

Two‑Phase Commit

// Phase 1: Build index (no commit)
let new_idx = self.execute_uncommitted().await?;

// Phase 2: Commit transaction
self.dataset.apply_commit(transaction, ...).await?;

This ensures that a failed build does not corrupt dataset state and allows pre‑commit validation and rollback.

Transaction Isolation

Transaction::new(
    new_idx.dataset_version,
    Operation::CreateIndex { new_indices: vec![new_idx], removed_indices: vec![] },
    None,
);

If the dataset changes between the two phases, the commit aborts, preserving consistency.

Reload for Consistency

let indices = self.dataset.load_indices().await?;

Reloading captures automatic actions (e.g., index merging) triggered during the commit.

Error‑Handling Strategy

.ok_or_else(|| Error::Internal {
    message: format!("Index with UUID {} not found after commit", index_uuid),
    location: location!(),
})

Potential Issues

Concurrent conflicts: Simultaneous creation of the same index name leads to the first succeeding and the second failing due to version mismatch; the client must retry.

Partial failures: If execute_uncommitted succeeds but apply_commit fails, the index file exists but is not registered in the manifest, requiring garbage‑collection.

Performance overhead: load_indices may read the entire manifest from disk, which can be costly for datasets with many indexes; cache hit rate becomes critical.

Indexing database Rust vector search Lance create_index

Written by

Big Data Technology Tribe

Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.