How to Solve ES Latency in MySQL‑Canal Sync and Indexing Scenarios?

The article dissects the interview question about ES latency in a MySQL‑Canal‑to‑Elasticsearch pipeline, explains the root causes across four system layers, and presents a comprehensive four‑layer optimization, end‑to‑end observability, routing‑based degradation, and a Java‑based LatencyProbe component to measure and control delay.

Tech Freedom Circle
Tech Freedom Circle
Tech Freedom Circle
How to Solve ES Latency in MySQL‑Canal Sync and Indexing Scenarios?

Problem Essence

Interview question asks to treat the binlog→ES indexing path as an end‑to‑end distributed system. Latency reveals four core concerns: data consistency, peak‑traffic mitigation, fault‑recovery, and full‑chain observability. Goal is to keep latency within business‑acceptable bounds and make it transparent.

Latency Bottlenecks

Canal pull: milliseconds to seconds, caused by single‑point bottleneck and limited thread‑pool.

Kafka transport: millisecond‑level, affected by batch size, compression, ACK.

Indexer consumption: 10 ms to minutes, limited by batch strategy, ES load, back‑pressure.

ES refresh: default 1 s, governed by refresh_interval.

Failure recovery: minutes or more, due to missing downstream degradation and retry mechanisms.

Four‑Layer Optimization

1. Collection Layer (Canal)

Bottleneck analysis : high CPU for binlog→JSON conversion; early Canal versions process serially.

Solutions

Deploy multiple Canal‑Server nodes with Zookeeper leader election for HA.

Split instances by database/table to enable parallel processing.

Use protobuf + snappy to reduce network I/O by ~60%.

Configure instance.filter.regex to drop unnecessary tables.

2. Transport Layer (Kafka)

Bottleneck analysis : linger.ms adds cumulative delay; partition skew from uneven primary‑key distribution.

Solutions

Hash partition by primary key to preserve order.

Set acks=all and min.insync.replicas=2 for reliability.

Create two topics: tp_order_normal (latency < 1 s) and tp_order_large for large‑batch async handling.

Route messages that fail three times to a dead‑letter queue for compensation.

3. Compute Layer (Indexer)

Bottleneck analysis : producer speed exceeds consumer speed, causing backlog; bulk parameters sub‑optimal.

Solutions

Use ES _id = table+pk for idempotent writes.

Adjust bulk size dynamically between 100 and 5000 based on load.

Integrate Sentinel for rate limiting and block Kafka polls when back‑pressure is detected.

Assign hot indexes to dedicated consumer groups for resource isolation.

4. Storage Layer (Elasticsearch)

Bottleneck analysis : refresh interval adds visibility delay; bulk queue buildup reduces throughput.

Solutions

Reduce refresh_interval from 1 s to 300 ms.

Set translog.durability=async to improve performance with a slight risk of data loss.

Hot‑cold tiering: route hot shards to SSD nodes with high‑CPU, warm shards to HDD nodes.

Pre‑create index and shrink before rollover to avoid performance spikes.

End‑to‑End Observability & Self‑Healing

Monitor Canal delay, Kafka lag, and ES refresh delay in Grafana. Define a latency alarm (e.g., > 3 s) and link it to a Kubernetes HPA to trigger automatic scaling.

Query Routing & Degradation

Real‑time latency t is collected; gateway/SDK routes queries as follows:

If t < SLA (e.g., 500 ms) → query ES.

If t ≥ SLA → fall back to MySQL to guarantee correctness.

LatencyProbe Component

Design

Inject a TraceId and emit timestamp at the Canal side, compute t = indexTime – emitTime after a successful bulk request, and publish t to a Kafka topic and Prometheus.

Implementation

Canal side adds CanalTrace (UUID, emitTime, target index, docId) to each entry.

Indexer bulk‑success hook reads the trace, calculates t, writes back docId, sends the trace to latency.topic, and records t in a Micrometer timer.

Prometheus query

histogram_quantile(0.99, rate(canal_es_latency_duration_seconds_bucket[5m]))

provides p99 latency; Grafana visualizes p50/p99 and triggers alerts when t > 3 s.

public final class CanalTrace {
    private String traceId;   // UUID
    private long emitTime;    // ms
    private String index;
    private String docId;
}
@Component
public class TraceInjector implements CanalEventParser.PostInterceptor {
    @Override
    public void postProcess(CanalEntry.Entry entry, List<CanalEntry.RowData> rows) {
        CanalTrace trace = new CanalTrace(
                UUID.randomUUID().toString(),
                System.currentTimeMillis(),
                calcIndex(entry.getHeader().getTableName()),
                null);
        entry.getProps().put("trace", JsonUtil.toJson(trace));
    }
}
Trace is stored in the entry’s extension field; no DB write is required.
@Component
public class LatencyRecorder {
    private final KafkaTemplate<String, CanalTrace> kafka;
    private final MeterRegistry registry;
    @EventListener
    public void onBulkSuccess(BulkSuccessEvent event) {
        for (DocWriteRequest<?> req : event.getRequests()) {
            CanalTrace trace = (CanalTrace) req.getHeaders().get("trace");
            long t = System.currentTimeMillis() - trace.getEmitTime();
            trace.setDocId(event.getId(req));
            kafka.send("latency.topic", trace.getTraceId(), trace);
            registry.timer("canal.es.latency", "index", trace.getIndex())
                    .record(t, TimeUnit.MILLISECONDS);
        }
    }
}

Dual‑Write Consistency Guarantee

For ultra‑real‑time scenarios (e.g., inventory), combine CDC with business‑layer double‑write: write to MySQL and ES simultaneously, route failed writes to a delay queue, and run a periodic reconciliation task that compares MySQL and ES and repairs mismatches. This provides eventual consistency with higher reliability at the cost of added complexity.

Technical Selection Guide

Second‑level latency (≈30% of cases) : Canal parallelism + Kafka partition tuning + Indexer batch processing.

Hundred‑millisecond latency (≈30% of cases) : Optimize refresh_interval, use async translog, and apply hot‑cold tiering.

Strong real‑time (≈10% of cases) : Business‑layer double‑write + compensation.

All options require monitoring, alerting, and automation to form a complete governance system.

ElasticsearchobservabilityKafkaSpring BootCanalData SynchronizationLatency Monitoring
Tech Freedom Circle
Written by

Tech Freedom Circle

Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.