Big Data 7 min read

Improving Log Replay Efficiency with Flink and Elasticsearch at Ctrip Ticket Frontend

The article describes how Ctrip's ticket front‑end team replaced a slow, manual log‑pulling process with a Flink‑based real‑time pipeline that streams Kafka data, indexes it in Elasticsearch, and enables second‑level log retrieval for automated scenario replay, dramatically reducing CI cycle time.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Improving Log Replay Efficiency with Flink and Elasticsearch at Ctrip Ticket Frontend

Background

As Ctrip's ticket business grew, manual regression testing could no longer keep up with the increasing number of test cases, and the existing log‑pulling solution that cached logs in Redis required half a day per release, becoming a bottleneck for continuous integration.

Introduction

The team introduced a CI pipeline that includes unit tests, traffic replay, and case verification. Traffic replay requires realistic online request results, which are achieved by mocking third‑party services using large volumes of online logs.

Scenario Replay

To cover online business scenarios, user reservation flows are instrumented with trace points that record logs. These logs are then used to mock SOA interfaces and A/B test results, allowing the system to replay and verify responses against real online data.

Refactoring Plan

The new solution uses Flink to consume Kafka streams in real time. Each request receives a unique ID, which links the main service logs with SOA logs. The combined logs are transformed into searchable keywords and stored in Elasticsearch, leveraging its Lucene‑based inverted index for fast retrieval. An alternative backup index strategy is also prepared if Elasticsearch does not meet expectations.

Example of log tag transformation:

{"CaseTag": "11|0|0|0|1|3|1"}

After processing, the tags become:

{"c_cus_ct_0": "[1];[2];[8];[11];",
 "c_cus_ct_1": "[0];",
 "c_cus_ct_2": "[0];",
 "c_cus_ct_3": "[0];",
 "c_cus_ct_4": "[1];[1];",
 "c_cus_ct_5": "[1];[2];[3];",
 "c_cus_ct_6": "[1];[1];"}

Effect of the New Scheme

Previously, preparing logs for traffic replay took over four hours. With the new indexing approach, log retrieval for each scenario is completed in seconds, enabling near‑real‑time replay and significantly improving release efficiency.

Considerations When Using Flink and Elasticsearch

Flink’s Stream API can cause memory overflow when processing 1–2 TB of daily logs; therefore, TaskManager JVM heap size should be set around 7 GB and Yarn mode is recommended for cluster reliability. Elasticsearch now auto‑creates mappings, but fields containing dots must be handled carefully to avoid type conflicts. Since Elasticsearch runs on the JVM, keep heap usage below 32 GB on 32‑bit JVMs (64‑bit JVMs in JDK 11 alleviate this limit).

big dataFlinkStreamingautomation testinglog-replay
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.