How to Solve ES Latency in MySQL‑Canal Sync and Indexing Scenarios?
The article dissects the interview question about ES latency in a MySQL‑Canal‑to‑Elasticsearch pipeline, explains the root causes across four system layers, and presents a comprehensive four‑layer optimization, end‑to‑end observability, routing‑based degradation, and a Java‑based LatencyProbe component to measure and control delay.
Problem Essence
Interview question asks to treat the binlog→ES indexing path as an end‑to‑end distributed system. Latency reveals four core concerns: data consistency, peak‑traffic mitigation, fault‑recovery, and full‑chain observability. Goal is to keep latency within business‑acceptable bounds and make it transparent.
Latency Bottlenecks
Canal pull: milliseconds to seconds, caused by single‑point bottleneck and limited thread‑pool.
Kafka transport: millisecond‑level, affected by batch size, compression, ACK.
Indexer consumption: 10 ms to minutes, limited by batch strategy, ES load, back‑pressure.
ES refresh: default 1 s, governed by refresh_interval.
Failure recovery: minutes or more, due to missing downstream degradation and retry mechanisms.
Four‑Layer Optimization
1. Collection Layer (Canal)
Bottleneck analysis : high CPU for binlog→JSON conversion; early Canal versions process serially.
Solutions
Deploy multiple Canal‑Server nodes with Zookeeper leader election for HA.
Split instances by database/table to enable parallel processing.
Use protobuf + snappy to reduce network I/O by ~60%.
Configure instance.filter.regex to drop unnecessary tables.
2. Transport Layer (Kafka)
Bottleneck analysis : linger.ms adds cumulative delay; partition skew from uneven primary‑key distribution.
Solutions
Hash partition by primary key to preserve order.
Set acks=all and min.insync.replicas=2 for reliability.
Create two topics: tp_order_normal (latency < 1 s) and tp_order_large for large‑batch async handling.
Route messages that fail three times to a dead‑letter queue for compensation.
3. Compute Layer (Indexer)
Bottleneck analysis : producer speed exceeds consumer speed, causing backlog; bulk parameters sub‑optimal.
Solutions
Use ES _id = table+pk for idempotent writes.
Adjust bulk size dynamically between 100 and 5000 based on load.
Integrate Sentinel for rate limiting and block Kafka polls when back‑pressure is detected.
Assign hot indexes to dedicated consumer groups for resource isolation.
4. Storage Layer (Elasticsearch)
Bottleneck analysis : refresh interval adds visibility delay; bulk queue buildup reduces throughput.
Solutions
Reduce refresh_interval from 1 s to 300 ms.
Set translog.durability=async to improve performance with a slight risk of data loss.
Hot‑cold tiering: route hot shards to SSD nodes with high‑CPU, warm shards to HDD nodes.
Pre‑create index and shrink before rollover to avoid performance spikes.
End‑to‑End Observability & Self‑Healing
Monitor Canal delay, Kafka lag, and ES refresh delay in Grafana. Define a latency alarm (e.g., > 3 s) and link it to a Kubernetes HPA to trigger automatic scaling.
Query Routing & Degradation
Real‑time latency t is collected; gateway/SDK routes queries as follows:
If t < SLA (e.g., 500 ms) → query ES.
If t ≥ SLA → fall back to MySQL to guarantee correctness.
LatencyProbe Component
Design
Inject a TraceId and emit timestamp at the Canal side, compute t = indexTime – emitTime after a successful bulk request, and publish t to a Kafka topic and Prometheus.
Implementation
Canal side adds CanalTrace (UUID, emitTime, target index, docId) to each entry.
Indexer bulk‑success hook reads the trace, calculates t, writes back docId, sends the trace to latency.topic, and records t in a Micrometer timer.
Prometheus query
histogram_quantile(0.99, rate(canal_es_latency_duration_seconds_bucket[5m]))provides p99 latency; Grafana visualizes p50/p99 and triggers alerts when t > 3 s.
public final class CanalTrace {
private String traceId; // UUID
private long emitTime; // ms
private String index;
private String docId;
} @Component
public class TraceInjector implements CanalEventParser.PostInterceptor {
@Override
public void postProcess(CanalEntry.Entry entry, List<CanalEntry.RowData> rows) {
CanalTrace trace = new CanalTrace(
UUID.randomUUID().toString(),
System.currentTimeMillis(),
calcIndex(entry.getHeader().getTableName()),
null);
entry.getProps().put("trace", JsonUtil.toJson(trace));
}
}Trace is stored in the entry’s extension field; no DB write is required.
@Component
public class LatencyRecorder {
private final KafkaTemplate<String, CanalTrace> kafka;
private final MeterRegistry registry;
@EventListener
public void onBulkSuccess(BulkSuccessEvent event) {
for (DocWriteRequest<?> req : event.getRequests()) {
CanalTrace trace = (CanalTrace) req.getHeaders().get("trace");
long t = System.currentTimeMillis() - trace.getEmitTime();
trace.setDocId(event.getId(req));
kafka.send("latency.topic", trace.getTraceId(), trace);
registry.timer("canal.es.latency", "index", trace.getIndex())
.record(t, TimeUnit.MILLISECONDS);
}
}
}Dual‑Write Consistency Guarantee
For ultra‑real‑time scenarios (e.g., inventory), combine CDC with business‑layer double‑write: write to MySQL and ES simultaneously, route failed writes to a delay queue, and run a periodic reconciliation task that compares MySQL and ES and repairs mismatches. This provides eventual consistency with higher reliability at the cost of added complexity.
Technical Selection Guide
Second‑level latency (≈30% of cases) : Canal parallelism + Kafka partition tuning + Indexer batch processing.
Hundred‑millisecond latency (≈30% of cases) : Optimize refresh_interval, use async translog, and apply hot‑cold tiering.
Strong real‑time (≈10% of cases) : Business‑layer double‑write + compensation.
All options require monitoring, alerting, and automation to form a complete governance system.
Tech Freedom Circle
Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
