Big Data 16 min read

How Ctrip Built a Scalable Unified Log Framework for Payment Data

Facing massive, heterogeneous logs from numerous payment services, Ctrip’s data team designed a unified logging framework that extends log4j2, streams logs via Kafka to HDFS using a customized Camus pipeline, partitions and stores data in ORC for efficient Hive analysis, while addressing format, storage, and performance challenges.

dbaplus Community
dbaplus Community
dbaplus Community
How Ctrip Built a Scalable Unified Log Framework for Payment Data

Background

Payment Center at Ctrip handles transactions, real‑name binding, accounts and acquiring, requiring long‑term storage of intermediate data for audit and compliance. The sheer volume and varied formats of logs from many services created storage and usability challenges, prompting the development of a unified log framework.

Overall Architecture

The core modules are log production, log collection, and log parsing. The flow is:

Applications integrate a log4j2‑based unified component that forwards logs to Kafka.

A periodic Camus job consumes the Kafka topic and writes the data to HDFS.

On T+1, a MapReduce job reads the Camus output from HDFS and loads it into Hive tables.

Log Production – Unified Log Component

The team created custom Log4j2 appenders that send application logs to Kafka, where a log_process_service normalizes them and forwards them to Ctrip’s common logging back‑ends such as CLOG, CAT and Elasticsearch. Applications do not need to know the underlying logging infrastructure.

Different log frameworks map to distinct appenders, simplifying integration of new frameworks.

AOP programming keeps the intrusion level low for developers.

Rich Java annotations enable configurable output of class name, method name, parameters, return values, exceptions, and support for sensitive‑field masking.

Problems identified:

Inconsistent log formats across hundreds of services hinder analysis.

Short retention periods: online CLOG stores only recent days, ES holds slightly longer, and offline Hive tables can keep data only up to T+2, insufficient for T+1 reporting.

To address these, the payment data team built a unified log framework on top of the custom component.

Unified Log – Instrumentation Design

Given hundreds of services (routing, authentication, card service, order, wallet, etc.) across app, H5, online and offline projects, a consistent logging schema was essential for reliable BI analysis. Logs are split into tag (a Map) and message (raw string) with the format [[$tag]]$message. Two categories of fields are defined:

Normative Fields (required)

serviceName (string): name of the invoked service.

tag (Map): key‑value metadata.

message (string): original log content.

request (string): API request parameters.

response (string): API response value.

requesttime (string): request timestamp.

responsetime (string): response timestamp.

Common Fields (optional, JSON‑style)

version (string): app version.

plat (string): platform identifier.

refno (string): transaction reference number.

Partition / Bucketing Definition

Hive partitioning heavily influences query performance. For payment checkout, hundreds of scenarios exist, each with varying call frequencies. A unique scene ID is used as a partition key to balance data size while avoiding an explosion of small files; bucketing further refines distribution.

Log Collection

The collection pipeline builds on LinkedIn’s open‑source Camus, which reads Kafka via MapReduce and writes directly to HDFS without a reduce phase, minimizing data skew. Customizations include:

Custom decoder/partitioner to extract business‑specific partition fields and generate meaningful HDFS paths, enabling efficient back‑fill for specific time ranges.

Custom provider that writes ORC files instead of plain text, reducing storage footprint and allowing parallel splits for downstream parsing.

Camus Job Execution

Execution frequency must match Kafka retention (3 days, 10 GB per partition) to avoid data loss.

Jobs run as singletons; after completion they commit Kafka offsets. Overlapping executions cause size mismatches and require manual cleanup of the output path.

Controlling Camus File Size

Uneven data distribution across Kafka partitions can produce oversized HDFS files that are not splittable, throttling parallelism. Solutions:

Temporarily disperse logs across multiple Kafka partitions.

Adopt a splittable input format (e.g., ORC) to enable data sharding.

ORC File Write Considerations

Timeouts: increase mapreduce.task.timeout from 600 s to 1200 s.

OOM: raise mapper memory with mapreduce.map.memory.mb=8096 and JVM options -Xmx6000m.

Log Parsing

Parsing runs on the Map side of MapReduce, allowing control over map parallelism. Optimizations include:

InputSplit Optimization

File count: without CombineFileInputFormat, each small file spawns a map, hurting performance.

File size & splitability: large, splittable files generate multiple maps.

Earlier runs processed a full day’s logs in ~25 minutes; later a mis‑partitioned batch caused some maps to run for hours.

Switching from text+snappy to orc+snappy and adding a custom CombineFileInputFormat reduced small‑file overhead and improved concurrency.

Shuffle Optimization

Custom partitioner or timestamp‑augmented keys prevent reducer skew.

Batch Parsing

MultipleInputs/MultipleOutputs enable a single job to handle diverse business‑specific parsing and write unified results to Hive external tables.

Empty File Handling

Use LazyOutputFormat to avoid creating zero‑size files that burden the NameNode.

Duplicate File Creation

Catch write exceptions on the reducer side, delete the partially written file, and abort the task; MapReduce’s retry mechanism prevents endless duplicate file generation.

Reducer Count Adjustment

Dynamic calculation of reducer numbers based on input file count and total size balances memory usage and small‑file proliferation.

Log Governance

Daily ORC compressed raw data reaches terabytes, prompting tiered log retention: troubleshooting logs, audit logs, analytical logs, each with distinct lifecycle policies. Camus‑written files are merged and assigned TTL to protect the NameNode.

Conclusion and Outlook

Current daily TB‑scale logs are parsed within 30 minutes. Future plans include migrating the log system to ClickHouse for low‑latency, real‑time analytics to support fine‑grained operational insights.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataKafkaHiveHadoopORCLog ProcessingCamus
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.