Umeng’s Mobile Big Data Platform: Architecture, Challenges & Insights
The article details Umeng’s mobile big‑data platform architecture, describing its Lambda‑style hybrid design, data ingestion pipeline with dual Kafka clusters, offline and real‑time processing using Hadoop, Spark, Storm, and storage layers such as HDFS, HBase, MongoDB and Elasticsearch, while also discussing challenges in data collection, cleaning, computation, security, and value‑added services.
Wu Lei, head of data platform at Umeng, has over ten years of software development experience, previously working on large‑scale communication systems, search engines and massive data analysis.
Mobile Big Data Analysis Platform Architecture and Evolution
Umeng’s architecture adopts a Lambda‑style hybrid model that balances low‑latency real‑time computation with full‑batch processing, aggregating the two views to serve external services. The Lambda architecture consists of a batch layer, a speed layer, and a serving layer.
Overall System Design
Data from SDKs embedded in apps (mobile, tablet, set‑top box) is sent to Nginx, load‑balanced, and received by a Finagle‑based log collector. Two Kafka clusters handle ingestion: the upstream cluster feeds real‑time consumers, the downstream cluster feeds offline consumers, synchronized via Kafka Mirror to separate I/O loads.
The computation layer is split into offline and real‑time parts. Offline processing uses Hadoop MapReduce jobs, Hive for data warehousing, Pig (moving to Spark) for data mining, and stores results in HDFS and HBase (with Elasticsearch for secondary indexing). Real‑time processing uses Storm, with results stored in MongoDB. Both layers expose data through a unified REST service.
Challenges in Data Collection, Cleansing, and Computation
Data Collection
Massive traffic: Handles billions of log events daily.
High concurrency: Peaks during commuting, lunch, and night hours.
Scalability: Requires horizontal scaling to accommodate rapid mobile growth.
Initially built on Ruby on Rails with Resque, the platform migrated to a Finagle server to achieve higher throughput and stateless horizontal scaling.
Data Cleansing
Three aspects: unique identifiers, standardization, and format.
Unique identifiers: Android devices provide IMEI, MAC, and Android ID, but fragmentation and custom ROMs cause duplicates or missing IDs. Umeng combines these IDs with a deterministic algorithm and applies machine‑learning‑based blacklisting to generate a unified ID.
Standardization: Normalizes device models, geographic regions (using IP heuristics), and timestamps (handling client‑side clock drift).
Format: Supports JSON, Thrift, Protobuf.
Computation
Real‑time computation must stay lightweight to meet latency requirements; spikes are handled by scaling the Lambda architecture. Offline computation deals with data skew and resource scheduling, using custom Hadoop fair‑scheduler tweaks. Near‑real‑time tasks use mini‑batch MapReduce or Spark Streaming.
Storage and Security
Online storage: MongoDB for balanced read/write workloads; HBase for write‑heavy offline workloads.
Offline storage: HDFS with compression and lifecycle management.
Cache: Redis cluster for horizontal scaling and pre‑loading.
Umeng does not collect personally identifiable information such as phone numbers or precise location. Access to computed results requires authentication, and data is isolated between teams to ensure privacy.
Data Value‑Added Services
Umeng leverages its data platform to provide cross‑business data integration, user profiling, device rating, and app health assessments, helping developers improve user engagement and evaluate channel effectiveness.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
