How Lance Implements Merge‑Insert: Upserts, Deletes, and Deduplication Explained
This article explains the merge‑insert operation in Lance, detailing its SQL‑like semantics, typical use‑cases such as bulk upserts and conditional deletes, the underlying DataFusion planning and execution flow, the generation of the __action column, and the handling of source‑side duplicate rows.
What is merge‑insert? In Lance, merge‑insert mirrors SQL MERGE: it aligns new rows (source) with existing rows (target) on key columns and then decides, per row, whether to insert, update, delete, or leave unchanged. The operation is defined in merge_insert.rs and requires a meaningful key column for matching.
Typical capabilities include:
Bulk update: set when_matched to update existing rows (e.g., UpdateAll ).
Find‑or‑create: default behavior keeps matched rows unchanged and inserts unmatched rows.
Upsert: update when matched, insert when not matched.
Conditional delete‑and‑replace: use when_not_matched_by_source to delete or predicate‑delete rows not covered by the source, then insert new rows.
The default builder semantics (as documented in the Rust comments) are: keep matched rows, insert new rows, and retain old rows that are not matched, which can be altered via the builder chain to achieve upsert or delete behavior.
Implementation details
Planning and execution
Lance builds logical and physical plans with DataFusion, introducing a custom node MergeInsertWriteNode. The node receives the following columns:
/// Expects input schema:
/// * `source.{col1, col2, ...}` - columns from the source relation
/// * `target.{col1, col2, ...}` - columns from the target relation
/// * `target._rowaddr` - special column to locate existing rows in the target
/// * `__action` - unqualified column that describes the action to perform.
/// See [`super::assign_action::merge_insert_action`]Generating the __action column
The function assign_action::merge_insert_action translates the user‑specified clauses ( when_matched, when_not_matched, when_not_matched_by_source) into a CASE expression that yields an Action enum value such as Nothing, UpdateAll, Insert, Delete, or Fail. To disambiguate NULL handling and outer‑join semantics, a sentinel literal MERGE_SOURCE_SENTINEL (lit(true)) is injected on the source side, allowing the engine to decide whether a source row truly exists rather than relying on ON‑column NULLness.
Write‑back phase
The physical executor, e.g., FullSchemaMergeInsertExec, consumes streams containing _rowaddr and __action. For each action it performs:
Update : delete the old row address and write the new data.
Insert : write a new data fragment.
Delete : record the deletion.
These operations cooperate with transaction/manifest handling and optional scalar indexes for equality‑based look‑ups. Conflict handling is supported via a conflict_retries retry mechanism.
Deduplication (source_dedupe)
When multiple source rows match the same target row within a single merge, the SourceDedupeBehavior enum controls the outcome:
pub enum SourceDedupeBehavior {
/// Fail the operation if duplicates are found (default)
#[default]
Fail,
/// Keep the first seen value and skip subsequent duplicates
FirstSeen,
} Fail(default): the operation aborts on duplicate detection. FirstSeen: the first matching source row updates the target; later duplicates are skipped and counted in a num_skipped_duplicates metric.
During the write path, for UpdateAll actions the executor checks whether a row_id has already been processed:
// Check for duplicate _rowid in the current merge operation
if !self.processed_row_ids.insert(row_id) {
match self.source_dedupe_behavior {
SourceDedupeBehavior::Fail => {
return Err(create_duplicate_row_error(batch, row_idx, &self.on_columns));
}
SourceDedupeBehavior::FirstSeen => {
self.metrics.num_skipped_duplicates.add(1);
return Ok(None); // Skip this duplicate row
}
}
}Users can set the desired behavior via the Rust API MergeInsertBuilder::source_dedupe_behavior(...). If FirstSeen does not provide a deterministic ordering, the documentation advises sorting the source beforehand to guarantee which duplicate is retained.
Big Data Technology Tribe
Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
