Big Data 11 min read

Flume Practice at YouZan: Data Collection and Pipeline Construction in Big Data Scenarios

YouZan’s experience with Flume shows how the at‑least‑once delivery model, combined with FileChannel storage and custom extensions such as an NsqSource, hourly‑based HdfsEventSink, metric reporting server, and timestamp interceptor, can reliably move MySQL binlog data to HDFS, while tuning transaction batch size and channel capacity boosts throughput and stability, paving the way for a unified management platform.

Youzan Coder
Youzan Coder
Youzan Coder
Flume Practice at YouZan: Data Collection and Pipeline Construction in Big Data Scenarios

Flume is a distributed, highly reliable, and scalable data collection service. At YouZan, Flume has played a stable and reliable role as a "data mover" for big data business operations. This article shares YouZan's practical experience with Flume and their understanding of the framework.

Delivery Guarantee

Understanding Flume's event delivery reliability guarantee is crucial when deciding whether to use Flume. There are three types of delivery guarantees: At-least-once, At-most-once, and Exactly-once. While users typically desire Exactly-once guarantees, few tools achieve this without significant cost trade-offs. Flume chooses At-least-once strategy to maintain stability and throughput. This guarantee applies to message delivery between Source , Channel , and Sink components. When using MemoryChannel , data in the channel can be lost if the process crashes, making FileChannel necessary for strict At-least-once requirements.

Flume's At-least-once implementation is based on its Transaction mechanism with four lifecycle functions: start , commit , rollback , and close . When a Source batches events to a Channel , it starts a transaction, puts events, commits on success or rolls back on failure, then closes the transaction and acknowledges the source service. Sink follows the same pattern but commits only after successfully writing to the destination.

Application Scenario: MySQL Binlog-based Data Warehouse Incremental Sync

A classic Flume use case at YouZan is MySQL binlog-based incremental data warehouse synchronization (datay). This scenario requires reliable delivery of NSQ binlog messages to HDFS without data loss. Since Flume follows At-least-once semantics and must handle various failure scenarios, MemoryChannel was insufficient, so FileChannel was used instead. Due to NSQ as the message broker, YouZan extended NsqSource based on their NSQ SDK. The second version further extended NsqChannel to achieve better reliability with benefits including single-transaction message passing (better performance) and avoiding additional Kafka infrastructure.

Custom Extensions

Flume provides excellent extensibility for component customization:

NsqSource : Custom Source implementation extending abstract base classes with lifecycle method implementations. Note that Flume components may be restarted when configuration changes, requiring proper resource cleanup.

HdfsEventSink Extension : Modified roll file logic to be based on hourly time boundaries rather than first event time, solving issues where low-volume data periods caused delayed file closure affecting offline task data availability.

MetricsReportServer : Custom HTTP-based metric reporting service that periodically pushes metrics to a centralized web service, addressing port allocation challenges when deploying multiple Flume instances on single machines.

Event Timestamp Interceptor : Extended Interceptor to use message timestamp fields instead of system time for HDFS directory partitioning, ensuring events within the same hour are stored together.

Performance Tuning

Key tuning parameters include transaction batch size and ChannelCapacity . Increasing batch sizes reduces CPU consumption and network I/O wait time. Channel capacity directly impacts source and sink event throughput—larger capacity improves throughput but increases memory consumption (for MemoryChannel ) and potential data loss on process failure. Different channel types require different considerations, making trade-offs inevitable.

Summary and Outlook

Flume has proven to be a very stable service in production. Its clear architectural design with multiple component implementations and extension points makes it easy to find or customize data pipeline solutions. Future plans include building a unified platform for centralized Flume instance management to reduce user costs, optimize resource allocation, and improve monitoring reliability.

Big Datadata collectiondata pipelinePerformance TuningHDFSflumeNSQat-least-once
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.