Databases 13 min read

From Full Sync to Real‑Time CDC: Building Scalable Order Data Pipelines

An e‑commerce junior developer tackles the challenge of regularly syncing order data to a data warehouse, evolving from naïve full‑table copies to incremental sync, batch processing, cursor‑based pagination, performance tuning, and finally a real‑time CDC‑plus‑message‑queue architecture, while addressing reliability, ordering, and scaling issues.

ITPUB
ITPUB
ITPUB
From Full Sync to Real‑Time CDC: Building Scalable Order Data Pipelines

Full Sync

Data synchronization means periodically copying data from one database to another, similar to copying photos between phones. The simplest implementation queries the entire orders table each day and inserts all rows into the warehouse, regardless of changes.

# Query all orders
orders = db.query("SELECT * FROM orders")
# Clear old data in the warehouse
warehouse.execute("DELETE FROM orders")
# Insert new rows
warehouse.insert(order)

This approach works for a small dataset (e.g., 10,000 rows) but becomes inefficient as data volume grows.

Basic Alert

When a scheduled sync fails (e.g., due to network issues), the failure may go unnoticed until the next day, causing business impact. Adding error logging, email notifications, and transaction rollback ensures the program reports problems promptly and can be re‑executed manually.

Incremental Sync

To avoid copying unchanged data, the developer switches to incremental sync: each night only rows created after midnight are copied. This reduces the data volume dramatically.

However, using created_time alone misses updates to existing orders (e.g., refunds). Switching to an updated_time column captures both new rows and modifications, ensuring all changes are synchronized.

Batch Processing

During a sales surge, syncing millions of rows at once caused OOM crashes. The solution is to process data in batches (e.g., 100 rows per batch), fetching each batch with pagination, inserting it, then moving to the next batch. This limits memory usage and allows partial retries on failure.

Cursor Mechanism

When new rows are inserted during pagination, using OFFSET can skip or duplicate records. The fix is to replace offset‑based pagination with a cursor: store the last processed primary‑key (or timestamp) and query WHERE id > last_id for the next batch. This eliminates data loss and avoids deep‑offset performance penalties.

SELECT * FROM orders
WHERE updated_time >= '2025-09-08' AND updated_time < '2025-09-09'
ORDER BY updated_time
LIMIT 2 OFFSET 0;  -- first page

Performance Optimization

As order volume reaches millions and the business demands a sync every two hours, the developer optimizes further:

Replace row‑by‑row inserts with bulk insert statements, reducing round‑trips to the database.

Parallelize batch processing by launching multiple threads, each handling a separate batch, turning the pipeline from serial to concurrent.

These changes cut execution time from hours to minutes.

Real‑Time Sync

For real‑time monitoring, the periodic job is insufficient. The solution combines Change Data Capture (CDC) with a message queue:

CDC monitors the source database and emits a change event for every INSERT, UPDATE, or DELETE.

The message queue acts as a durable buffer; a consumer reads events and writes them to the warehouse.

This architecture delivers updates within ~100 ms of an order being placed and includes monitoring to alert on message backlog.

Solution Completion & Common Pitfalls

During a massive promotion, the real‑time pipeline exhibited three issues:

Duplicate messages – solved by assigning a unique identifier to each event and implementing idempotent processing.

Out‑of‑order processing – solved by partitioning messages per order so that all events for the same order are consumed in sequence.

Message backlog – solved by scaling the consumer cluster and enabling dynamic auto‑scaling of processing nodes.

The developer also notes that mature open‑source tools (e.g., DataX, Canal, Debezium) can handle many of these concerns out of the box, reducing the need for custom code.

BatchCursordata-syncCDCincrementalmessage-queue
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.