Big Data 13 min read

Umeng’s Mobile Big Data Platform: Architecture, Challenges & Insights

The article details Umeng’s mobile big‑data platform architecture, describing its Lambda‑style hybrid design, data ingestion pipeline with dual Kafka clusters, offline and real‑time processing using Hadoop, Spark, Storm, and storage layers such as HDFS, HBase, MongoDB and Elasticsearch, while also discussing challenges in data collection, cleaning, computation, security, and value‑added services.

ITFLY8 Architecture Home

Dec 13, 2016

Umeng’s Mobile Big Data Platform: Architecture, Challenges & Insights

Wu Lei, head of data platform at Umeng, has over ten years of software development experience, previously working on large‑scale communication systems, search engines and massive data analysis.

Mobile Big Data Analysis Platform Architecture and Evolution

Umeng’s architecture adopts a Lambda‑style hybrid model that balances low‑latency real‑time computation with full‑batch processing, aggregating the two views to serve external services. The Lambda architecture consists of a batch layer, a speed layer, and a serving layer.

Overall System Design

Data from SDKs embedded in apps (mobile, tablet, set‑top box) is sent to Nginx, load‑balanced, and received by a Finagle‑based log collector. Two Kafka clusters handle ingestion: the upstream cluster feeds real‑time consumers, the downstream cluster feeds offline consumers, synchronized via Kafka Mirror to separate I/O loads.

The computation layer is split into offline and real‑time parts. Offline processing uses Hadoop MapReduce jobs, Hive for data warehousing, Pig (moving to Spark) for data mining, and stores results in HDFS and HBase (with Elasticsearch for secondary indexing). Real‑time processing uses Storm, with results stored in MongoDB. Both layers expose data through a unified REST service.

Challenges in Data Collection, Cleansing, and Computation

Data Collection

Massive traffic: Handles billions of log events daily.

High concurrency: Peaks during commuting, lunch, and night hours.

Scalability: Requires horizontal scaling to accommodate rapid mobile growth.

Initially built on Ruby on Rails with Resque, the platform migrated to a Finagle server to achieve higher throughput and stateless horizontal scaling.

Data Cleansing

Three aspects: unique identifiers, standardization, and format.

Unique identifiers: Android devices provide IMEI, MAC, and Android ID, but fragmentation and custom ROMs cause duplicates or missing IDs. Umeng combines these IDs with a deterministic algorithm and applies machine‑learning‑based blacklisting to generate a unified ID.

Standardization: Normalizes device models, geographic regions (using IP heuristics), and timestamps (handling client‑side clock drift).

Format: Supports JSON, Thrift, Protobuf.

Computation

Real‑time computation must stay lightweight to meet latency requirements; spikes are handled by scaling the Lambda architecture. Offline computation deals with data skew and resource scheduling, using custom Hadoop fair‑scheduler tweaks. Near‑real‑time tasks use mini‑batch MapReduce or Spark Streaming.

Storage and Security

Online storage: MongoDB for balanced read/write workloads; HBase for write‑heavy offline workloads.

Offline storage: HDFS with compression and lifecycle management.

Cache: Redis cluster for horizontal scaling and pre‑loading.

Umeng does not collect personally identifiable information such as phone numbers or precise location. Access to computed results requires authentication, and data is isolated between teams to ensure privacy.

Data Value‑Added Services

Umeng leverages its data platform to provide cross‑business data integration, user profiling, device rating, and app health assessments, helping developers improve user engagement and evaluate channel effectiveness.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kafka mobile analytics Hadoop Data Architecture Lambda architecture

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.