Syncing Vectors with Changing Documents: Add, Update, Delete Made Simple
This article walks through why keeping a vector store consistent with a mutable knowledge base is challenging, explains the three failure points, introduces hash‑based incremental syncing, shows idempotent add, proper update and soft‑delete workflows, covers embedding model upgrades, and presents a production‑grade event‑driven architecture with common pitfalls and remedies.
Triangular Inconsistency in Vector Sync
A RAG knowledge base consists of three storage layers: raw documents (e.g., S3 or local disk), a vector database (Milvus, Pinecone, Qdrant), and a metadata store (PostgreSQL/SQLite) that records doc_id, version, and hash. If any layer diverges, LLMs may retrieve contradictory chunks, causing hallucinations or compliance violations.
Typical loss points
Only new vectors are written while old vectors remain (orphan vectors).
Documents are deleted from the source but their vectors stay in the vector store.
Hash‑Based Deduplication for Incremental Sync
Re‑embedding every changed document is expensive. LangChain’s Index API computes a SHA‑256 hash for each chunk and skips embedding when the hash matches the stored value.
document chunk → compute SHA‑256 hash → compare with record manager
↓
hash matches → skip (no re‑embedding)
hash differs → re‑embed → write to vector storeTypeScript implementation:
import { index } from "langchain/indexes";
import { SQLRecordManager } from "@langchain/community/indexes/sqlite";
import { OpenAIEmbeddings } from "@langchain/openai";
import { Milvus } from "@langchain/community/vectorstores/milvus";
import { Document } from "@langchain/core/documents";
import * as crypto from "crypto";
const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-large" });
const vectorStore = await Milvus.fromExistingCollection(embeddings, { collectionName: "knowledge_base" });
const recordManager = new SQLRecordManager("milvus/knowledge_base", { dbUrl: "sqlite:///record_manager.db" });
await recordManager.createSchema();
async function syncDocuments(docs: Document[]) {
const result = await index({
docsSource: docs,
recordManager,
vectorStore,
options: { cleanup: "incremental", sourceIdKey: "source" },
});
console.log(`Added: ${result.numAdded}, Skipped: ${result.numSkipped}, Deleted: ${result.numDeleted}`);
}
function contentHash(text: string): string {
return crypto.createHash("sha256").update(text).digest("hex");
}Cleanup Modes
none: No automatic cleanup; manual deduplication only. incremental: Real‑time cleanup of old versions on write; does not delete removed documents. full: Cleans up deleted documents and old versions after a batch. scoped_full: Batch‑wise cleanup of old versions after a batch.
Idempotent Adds (Duplicate Delivery Protection)
Repeated deliveries (e.g., from message‑queue retries) can cause the same document to be indexed multiple times. Attaching a content_hash and using the Index API’s built‑in deduplication skips re‑adds when the content is unchanged.
interface ChunkMetadata {
doc_id: string;
chunk_id: string;
content_hash: string;
version_id: number;
source: string; // original path/URL, required by Index API
source_type: string; // "pdf" | "confluence" | "notion"
embedding_model: string;
created_at: string;
is_deleted: boolean;
}
async function addDocument(filePath: string) {
const loader = new PDFLoader(filePath);
const docs = await loader.load();
const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 512, chunkOverlap: 50 });
const chunks = await splitter.splitDocuments(docs);
const docId = generateDocId(filePath);
const chunksWithMeta = chunks.map((chunk, i) => ({
...chunk,
metadata: {
...chunk.metadata,
doc_id: docId,
chunk_id: `${docId}-chunk-${i}`,
content_hash: contentHash(chunk.pageContent),
version_id: 1,
source: filePath,
source_type: "pdf",
embedding_model: "text-embedding-3-large",
created_at: new Date().toISOString(),
is_deleted: false,
} as ChunkMetadata,
}));
await syncDocuments(chunksWithMeta);
}Updating Documents – Mandatory Old‑Chunk Cleanup
When a document changes, the incremental mode uses the unchanged source field to locate old chunks. If the content hash differs, old chunks are deleted before new ones are written.
async function updateDocument(filePath: string, newContent: string) {
const newChunks = await splitContent(newContent, { source: filePath }); // keep source unchanged
const result = await index({
docsSource: newChunks,
recordManager,
vectorStore,
options: { cleanup: "incremental", sourceIdKey: "source" },
});
console.log(result); // e.g., { numAdded: 8, numDeleted: 10 }
}If the chunk count changes (e.g., from 10 to 6), incremental still removes all old chunks because it matches on source, not on chunk_id.
Soft Delete + Delayed Physical Cleanup
Hard deletion removes vectors immediately but requires a full list of documents to keep. Soft delete marks is_deleted = true in the metadata store, filters it out at query time, and schedules a physical purge after a configurable grace period (e.g., 30 days), providing an audit window and accidental‑deletion recovery.
async function softDeleteDocument(docId: string) {
await metaDB.update({ is_deleted: true, deleted_at: new Date().toISOString() }, { where: { doc_id: docId } });
// Retriever will filter out deleted chunks
const retriever = vectorStore.asRetriever({ filter: { is_deleted: false }, k: 5 });
await schedulePhysicalCleanup(docId, 30 * 24 * 60 * 60 * 1000);
}Embedding Model Upgrade – Blue‑Green Switch
Different embedding models produce vectors in incompatible spaces (e.g., text-embedding-3-small 1536‑dim vs text-embedding-3-large 3072‑dim). Storing the model name and version in metadata enables a pre‑search validation step that throws an error if the query model differs from the index model.
interface EmbeddingMetadata {
embedding_model: string; // e.g., "text-embedding-3-large"
embedding_model_version: string; // e.g., "2025-01-15"
embedding_dimension: number; // e.g., 3072
}
async function safeSearch(query: string, expectedModel: string) {
const indexMeta = await getIndexMetadata();
if (indexMeta.embedding_model !== expectedModel) {
throw new Error(`Model mismatch! Index uses ${indexMeta.embedding_model}, query uses ${expectedModel}. Rebuild the index.`);
}
return await vectorStore.similaritySearch(query, 5);
}Recommended upgrade workflow:
Create a new collection (e.g., knowledge_base_v2) using the new model and re‑index all data.
Run both the old and new collections in parallel for a week, comparing recall.
Swap the alias to point to the new collection.
Retain the old collection for a rollback period, then delete it.
Production‑Grade Sync Architecture – Event‑Driven + Compensation
In production, document changes arrive continuously. A robust pipeline consists of:
Document source (Confluence/Notion/S3)
↓ change events (Webhook / CDC / polling)
Message queue (Kafka / Redis Queue)
↓ at‑least‑once consumer
Sync worker → parse → hash‑dedup → embed (only changed chunks) → write vector store & update metadata
↓
Reconciliation (periodic) → scan for mismatches → auto‑compensateWorker code ensures idempotency by recording processing status and skipping already‑completed events.
async function processDocumentEvent(event) {
const { type, source, docId } = event;
const status = await getProcessingStatus(event.eventId);
if (status === "completed") return;
await markProcessing(event.eventId);
switch (type) {
case "created":
case "updated":
await syncDocumentToVectorStore(source, { cleanup: "incremental" });
break;
case "deleted":
await softDeleteDocument(docId);
break;
}
await markCompleted(event.eventId);
}
async function reconcile() {
const activeDocs = await metaDB.findAll({ is_deleted: false });
const vectorDocIds = await vectorStore.listDocIds();
const missing = activeDocs.filter(d => !vectorDocIds.includes(d.doc_id));
for (const doc of missing) {
console.log(`Compensating sync: ${doc.source}`);
await syncDocumentToVectorStore(doc.source, { cleanup: "incremental" });
}
}Common Pitfalls (90% of Users Encounter These)
Inconsistent source values : Use absolute paths or globally unique IDs.
Chunk‑size change without full rebuild : Record chunking parameters in metadata and trigger a full rebuild when they change.
Permission changes not re‑indexed : Treat ACL updates as content updates and re‑index.
incremental mode ignores deletions : Use full mode or explicit source -based deletions for removed documents.
Embedding model swap without re‑index : Store model version in metadata and enforce a full rebuild on change.
Summary
Consistent RAG knowledge bases require synchronizing three layers—raw documents, metadata, and vector store. Hash‑based deduplication enables cheap incremental updates; the incremental mode safely handles adds and updates; soft delete with delayed physical cleanup protects compliance; embedding model upgrades demand a blue‑green deployment; and a production pipeline should be event‑driven with periodic reconciliation to automatically fix drift.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
James' Growth Diary
I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
