How Lance Builds Scalar and Vector Indexes: A Deep Dive into create_index
This article explains how Lance's Python API creates scalar and vector indexes, walks through the internal Rust implementation of the create_index workflow, and details the transaction, commit, and error‑handling mechanisms that ensure atomic and consistent index creation.
API Usage Examples
Scalar index example using the Python lance_ray API:
# Assume a LanceDataset with a numeric column "id" exists at this path
import lance_ray as lr
updated_dataset = lr.create_scalar_index(
uri="path/to/dataset",
column="id",
index_type="BTREE",
name="btree_multiple_fragment_idx",
replace=False,
num_workers=4,
)
# Example queries
updated_dataset.scanner(filter="id = 100", columns=["id", "text"]).to_table()
updated_dataset.scanner(filter="id >= 200 AND id < 800", columns=["id", "text"]).to_table()Vector index examples (distributed IVF_PQ, IVF_SQ, IVF_FLAT):
import lance_ray as lr
# Build a distributed IVF_PQ index
updated_dataset = lr.create_index(
uri="path/to/dataset.lance",
column="vector",
index_type="IVF_PQ",
name="idx_ivf_pq",
num_workers=4,
num_partitions=256,
num_sub_vectors=16,
metric="l2",
)
# Build a distributed IVF_SQ index
updated_dataset = lr.create_index(
uri="path/to/dataset.lance",
column="vector",
index_type="IVF_SQ",
name="idx_ivf_sq",
num_workers=4,
num_partitions=256,
)
# Build a distributed IVF_FLAT index
updated_dataset = lr.create_index(
uri="path/to/dataset.lance",
column="vector",
index_type="IVF_FLAT",
name="idx_ivf_flat",
num_workers=4,
num_partitions=256,
)create_index Implementation Overview
The Rust create_index function follows a multi‑step workflow:
create_index (index.rs)
│
├─> create_index_builder (index.rs)
│ └─> CreateIndexBuilder::new
│
└─> builder.replace(replace).await
│
└─> CreateIndexBuilder::execute (create.rs)
│
├─> execute_uncommitted
│ └─> match (index_type, params)
│ └─> Vector Index → build_vector_index / build_distributed_vector_index / build_empty_vector_index
└─> apply_commit (commit transaction)The call builder.replace(replace).await works because CreateIndexBuilder implements the IntoFuture trait. The compiler rewrites the expression to
std::future::IntoFuture::into_future(builder.replace(replace)).await, which internally calls self.execute().
execute Method Details
The async execute function builds the index and then commits it. Core steps:
Call execute_uncommitted().await? to build the index without committing.
Validate column existence, check for an empty dataset, load existing indexes, and generate or parse a UUID.
Dispatch to the appropriate builder based on index type (scalar → build_scalar_index, vector → build_vector_index or build_distributed_vector_index).
Save the new index UUID for later lookup.
Create a Transaction with the dataset version and an Operation::CreateIndex payload.
Apply the transaction via
self.dataset.apply_commit(transaction, &Default::default(), &Default::default()).await?, which writes a new manifest and atomically updates the _latest.manifest pointer.
Reload index metadata with self.dataset.load_indices().await? to capture any automatic actions (index merging, delta indices, optimizations) triggered during the commit.
Find the just‑created index by matching the saved UUID and return its IndexMetadata, or raise a clear internal error if not found.
Transaction and Commit Mechanics
The transaction records the dataset version for optimistic concurrency control; if another writer changes the dataset during index construction, the version mismatch causes the commit to fail, forcing the client to retry. apply_commit receives three arguments: the Transaction, a ManifestWriteConfig (default), and a CommitConfig (default). It writes a new manifest, updates the dataset’s internal state, and returns Ok(()).
Loading Index Metadata
load_indices()first checks an in‑memory cache; on a miss it reads the manifest from storage, handles special fragment‑reuse logic, and returns the full list of IndexMetadata objects.
Error Handling
If the UUID lookup fails after a successful commit, the code returns an Error::Internal with a formatted message containing the missing UUID and the source location, making debugging straightforward.
Key Design Points
Two‑Phase Commit
// Phase 1: Build index (no commit)
let new_idx = self.execute_uncommitted().await?;
// Phase 2: Commit transaction
self.dataset.apply_commit(transaction, ...).await?;This ensures that a failed build does not corrupt dataset state and allows pre‑commit validation and rollback.
Transaction Isolation
Transaction::new(
new_idx.dataset_version,
Operation::CreateIndex { new_indices: vec![new_idx], removed_indices: vec![] },
None,
);If the dataset changes between the two phases, the commit aborts, preserving consistency.
Reload for Consistency
let indices = self.dataset.load_indices().await?;Reloading captures automatic actions (e.g., index merging) triggered during the commit.
Error‑Handling Strategy
.ok_or_else(|| Error::Internal {
message: format!("Index with UUID {} not found after commit", index_uuid),
location: location!(),
})Potential Issues
Concurrent conflicts: Simultaneous creation of the same index name leads to the first succeeding and the second failing due to version mismatch; the client must retry.
Partial failures: If execute_uncommitted succeeds but apply_commit fails, the index file exists but is not registered in the manifest, requiring garbage‑collection.
Performance overhead: load_indices may read the entire manifest from disk, which can be costly for datasets with many indexes; cache hit rate becomes critical.
Big Data Technology Tribe
Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
