How Weidian Built a Scalable Big Data Platform for Mobile Commerce
This article outlines the design and implementation of Weidian’s end‑to‑end big data processing platform, covering dataset definition, data collection via Flume‑based DataAgent, transmission through Databus, storage options such as HDFS, Kafka and Elasticsearch, and the monitoring and resource‑integration strategies that support massive mobile commerce logs.
Abstract: Weidian, a leading mobile e‑commerce network with over 30 million merchants, needed a robust big data platform to handle massive log data generated by its services. This article describes the core technologies used for data collection, transmission, storage, and analysis.
What Is a Dataset?
A dataset (DATASET) provides a unified name and usage guidelines for data that flows through the collection, transmission, storage, and analysis stages, reducing communication overhead among developers, operators, and analysts.
The dataset includes attributes such as owner, product line, project, data type (e.g., access log, application log, MySQL log), collection method, storage method, estimated scale, and retention period.
Data Collection Layer
Data collection is performed mainly by a custom Flume‑based agent called DataAgent . The workflow is:
Business developers request a dataset.
The big‑data team releases a DataAgent configuration.
Operations deploy DataAgent on the service machines.
DataAgent tails log files in real time and forwards events.
Key features of DataAgent include multi‑dataset configuration, checkpointing, rate limiting, inode‑based file lookup, handling of symbolic links, custom header injection, DualChannel (MemChannel + FileChannel), KafkaChannel support, and built‑in Ganglia metrics.
Data Transmission Layer
Collected events enter the Databus , which routes data to storage back‑ends. The architecture consists of:
Flume‑based Avro receiver for DataAgent output.
KafkaChannel for real‑time Kafka consumption.
Syslog collector for network device logs.
HadoopLoader for Rsync‑pushed files.
API endpoint for direct POSTs.
Supported storage targets are HDFS, Kafka, Elasticsearch, and third‑party stores. Each dataset (or its category identifier) maps to a unique Kafka topic.
Data Storage and Analysis Layer
We use Hadoop 2.6.0 for distributed storage. Multiple clusters (HDFS, Spark, GPU, UDC) share a unified 300‑node HDFS pool, while compute workloads are isolated per cluster. A portal provides unified access control, metadata browsing, job submission (MapReduce, Hive, Spark, etc.), and resource monitoring.
Data is written to HDFS via a customized Flume HdfsSink that adds a random suffix to avoid filename collisions and optimizes path parsing, achieving >40% throughput improvement. LZO compression is applied for efficient storage.
HadoopLoader periodically scans Rsync‑delivered directories, validates files with MD5, moves them to an uploading queue, and uploads them to HDFS, recording metadata in a database.
Monitoring Layer
Ganglia collects metrics from all components, while Nagios handles alerting. Monitoring dashboards display the health of each layer and per‑business‑group DataAgent metrics.
The platform continues to evolve, with plans to simplify Databus configuration and further integrate resource management.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
