Evolution of Tongcheng Log System Architecture
The article chronicles the development of Tongcheng's centralized log system from early file‑based logging through a MongoDB‑based solution to the current multi‑layer architecture using Flume, Elasticsearch, and Hadoop, highlighting design decisions, challenges, and future improvement plans.
Evolution of Tongcheng Log System Architecture
With the rapid growth of company services, the number of applications increased dramatically, making centralized log collection, storage, and querying essential for quick issue diagnosis. A qualified log system must provide high availability, reliability, and scalability.
1. Background Introduction
As the business expanded, developers and operations needed a unified way to collect and analyze logs generated at runtime. Repeating effort across projects prompted the creation of a centralized logging platform.
2. Architecture Design
The architecture evolved through three major phases:
Phase 1 (pre‑2012 – "Stone Age"): Logs were stored locally as plain files; accessing them was time‑consuming and error‑prone.
Phase 2 (2012 – first unified log system): The team introduced a centralized solution using MongoDB as the backend store. Although MongoDB performed well at tens of millions of records, scaling to billions caused instability and data‑balancing issues, revealing the need for deeper expertise in the chosen technology.
Phase 3 (2014 – Tianwang Log Component V1): A completely redesigned four‑layer architecture was released:
Client layer – lightweight agents that monitor files without requiring application code changes.
Collection layer – Apache Flume, customized to write ORC files to Hadoop and optionally forward events to an internal MQ.
Storage layer – Elasticsearch for real‑time queries and Hadoop for long‑term, massive‑scale storage, with routing and hot‑cold index separation.
Query layer – web UI and REST API for interactive and programmatic access.
The client agents now operate entirely on the Linux side, listening to log files and parsing them according to flexible, user‑defined rules (see configuration screenshot).
Flume ensures reliable delivery via transactional semantics and supports failover and load‑balancing sink groups. Custom sinks enable direct ORC writes to Hadoop and optional MQ forwarding for offline analysis.
Elasticsearch provides fast search and alerting capabilities, with routing and index lifecycle management to handle growing data volumes.
Hadoop serves as the durable, scalable storage for full‑history logs, allowing linear expansion as needed.
3. Future Plans
After several iterations, the Tianwang log system now reliably supports the company's peak loads, yet continuous improvement remains a priority. Planned enhancements for the second half of 2016 include cross‑platform file collection, data‑center awareness and disaster recovery, Docker‑based storage with auto‑scaling, and other features to further strengthen the platform.
Tongcheng Travel Technology Center
Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.