Big Data 15 min read

Design and Implementation of a Unified Log Framework for Ctrip Payment Center

The article describes the design, architecture, and operational details of a unified logging framework at Ctrip's payment center, covering log production via a Log4j2 extension, Kafka‑Camus collection, Hive/ORC storage, MapReduce parsing optimizations, and governance strategies for massive daily TB‑scale data.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Design and Implementation of a Unified Log Framework for Ctrip Payment Center

Background

Ctrip's payment center handles transaction, identity verification, and account services, requiring long‑term storage of intermediate data for audit and compliance. Diverse applications generate massive, heterogeneous logs, creating challenges for storage, retrieval, and analysis, prompting the development of a unified log framework.

Overall Architecture

The framework consists of three core modules: log production, log collection, and log parsing. Applications send logs to Kafka via a custom Log4j2 appender, Camus jobs periodically consume Kafka topics and write data to HDFS, and a T+1 MapReduce job loads the data into Hive tables.

Log Production – Unified Log Component

Developers integrate a Log4j2‑based component that forwards logs to Kafka. The component uses AOP for low‑intrusiveness, defines rich Java annotations for configurable output (class name, method name, parameters, return values, exceptions, and sensitive‑field masking), and supports multiple underlying log frameworks (CLOG, CAT, ES).

Advantages:

Extensible Appender mapping for various log frameworks.

Low‑intrusive AOP integration.

Rich annotation‑driven configuration enabling detailed, sanitized logs.

Issues identified:

Inconsistent log formats across hundreds of services.

Short storage retention (online CLOG only recent days, Hive tables limited to T+2).

To address these, the team extended the component with a unified logging schema.

3.1 Unified Log – Field Design

Logs are structured as JSON with two main parts: tag (a Map for searchable key‑value pairs) and message (raw log). Standard fields include serviceName, tag, message, request, response, requesttime, responsetime, etc. Additional flexible fields such as version, platform, and reference number are also defined.

Common fields automatically captured without developer code include applicationId and logTime.

Log Collection

The collection layer builds on LinkedIn's open‑source Camus, which reads Kafka data via MapReduce and writes to HDFS. Customizations include:

Custom decoder/partitioner to generate business‑meaningful HDFS paths instead of coarse date‑based directories.

Custom provider to write ORC files (instead of large text files), reducing storage footprint and improving query performance.

4.1 Camus Job Execution

Execution frequency must balance Kafka retention (3 days, 10 GB per partition) to avoid data loss. Overlapping tasks can cause offset misalignment, requiring manual cleanup.

The earliest offset was found to be more than the current offset

Task overlap may also produce file size mismatches:

Error: java.io.IOException: target exists.the file size(614490 vs 616553) is not the same.

4.2 Controlling Camus Output File Size

Imbalanced partition writes can create oversized HDFS files that hinder parallel processing. Mitigations include distributing logs across Kafka partitions and using splittable input formats.

4.3 ORC File Writing Considerations

ORC writes may timeout after 600 seconds:

AttemptID:attempt_1587545556983_2611216_m_000001_0 Timed out after 600 secs

Solution: increase task timeout:

mapreduce.task.timeout=1200000

OOM errors during ORC writes require more mapper memory:

beyond physical memory limits. Current usage: 2.5 GB of 2.5 GB physical memory used; 4.2 GB of 5.3 GB virtual memory used. Killing container.

Adjusted settings:

mapreduce.map.memory.mb=8096
mapreduce.map.java.opts=-Xmx6000m

Log Parsing

Parsing runs on the Map side of MapReduce, allowing easy scaling by adjusting the number of mappers. Optimizations include:

InputSplit optimization : managing file count and splitability to avoid excessive small files or single‑mapper bottlenecks.

Shuffle optimization : adding timestamps or custom partitioners to reduce data skew.

Batch parsing : using MultipleInputs/MultipleOutputs to handle hundreds of business processes in a single job.

Empty file handling : employing LazyOutputFormat to prevent zero‑size files that stress the NameNode.

File duplication avoidance : catching exceptions on the Reduce side to delete duplicate files and abort faulty tasks.

These measures reduced daily full‑log parsing time from several hours to about 25 minutes, and after switching to ORC+Snappy and implementing a CombineFileInputFormat, performance stabilized.

Log Governance

Daily new ORC data reaches terabytes, so logs are tiered by purpose (debug, audit, analysis) with different retention policies. Business‑based partitioning creates many small files; a TTL and file‑merge process mitigates NameNode pressure.

Conclusion and Outlook

Current pipelines process TB‑scale logs within 30 minutes. Future plans include migrating the log system to ClickHouse for real‑time analytics to support fine‑grained operational insights.

Big DataMapReducedata governanceHadooporclog-processingcamus
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.