TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN
This article recounts how TalkingData progressively introduced Spark into its Hadoop‑YARN based mobile big‑data platform, detailing early architectures, migration challenges, performance gains, the fully Spark‑centric redesign with Kafka and Spark Streaming, encountered pitfalls, and future plans for further optimization.
In recent years Spark has gained wide recognition in China, with events such as Spark Summit China 2014 and multiple Spark Meetups in major cities. TalkingData, an early adopter of Spark in the mobile internet big‑data services space, actively participated in these community activities and shared its Spark experiences.
First Encounter with Spark
While reviewing the Strata 2013 lecture notes, the team was attracted by a tutorial titled “An Introduction on the Berkeley Data Analytics Stack” which highlighted Spark’s in‑memory RDD model, machine‑learning support, and unified batch‑stream processing. Compared with Impala, Spark’s ecosystem built around RDDs seemed more suitable for a fast‑growing data‑driven startup.
Exploring Spark
In mid‑2013, as the volume of mobile device data surged, TalkingData decided to build its own data center to aggregate logs from various business platforms for deeper analysis. The initial data‑center functions included cross‑market Android app ranking and user‑interest‑based app recommendation, and the architecture was based on Hadoop 2.0 (Cloudera CDH4.3) with MapReduce for offline batch jobs.
As real‑time analysis needs grew, the team introduced Hive for interactive queries and Spark 0.8.0 (later 0.8.1) on YARN to support iterative machine‑learning workloads. The new architecture isolated Spark machine‑learning tasks, MapReduce ranking jobs, and Hive interactive jobs using separate YARN queues.
Because the default Spark binaries did not support CDH 4.3, the team compiled Spark from source using Maven on an AWS instance, adjusting JVM memory settings to avoid OOM errors. After successful compilation, the packaged Spark distribution was deployed on Hadoop client machines and verified with the SparkPi example.
Developing Spark applications in Scala proved dramatically more concise and faster than equivalent MapReduce jobs, especially for iterative machine‑learning algorithms, leading to the migration of all data‑mining workloads to Spark.
Fully Embracing Spark
By 2014, data volume had multiplied several times, making Hive‑based batch jobs too slow for weekly or monthly analyses. After attending Spark Summit China 2014, the team decided to migrate the entire data‑center to a Spark‑centric architecture built on YARN.
The new platform aimed to provide near‑real‑time ETL, streaming processing, efficient offline computation, fast multidimensional analysis, real‑time analytics, machine‑learning capabilities, unified data access, unified data view, and flexible task scheduling.
Kafka was introduced as the log‑collection channel, feeding raw JSON logs into HDFS, where Spark Streaming performed log preservation, cleaning, conversion to Parquet, and downstream processing such as tag generation stored in MongoDB.
Ranking jobs rewritten on Spark achieved a six‑fold speedup even when data volume tripled. Spark SQL and Parquet columnar storage reduced a previously hour‑long multidimensional analysis to minutes, and a month‑scale analysis to a couple of hours.
Additional components included a Bitmap engine migrated to YARN for fast indexed queries, an HCatalog‑based metadata service for a unified data view, and a custom DAG‑based distributed scheduler integrated with YARN for batch, real‑time, and pipeline tasks.
Pitfalls Encountered
The team documented several common issues: large‑scale shuffle causing "org.apache.spark.SparkException: Error communicating with MapOutputTracker" which was mitigated by setting spark.shuffle.consolidateFiles=true; frequent Fetch failures that required log inspection and parameter tuning; and other runtime problems solved through source code review and community support.
Future Plans
Looking ahead, TalkingData plans to adopt Spark 1.3’s DataFrame API, integrate the Tachyon distributed cache as a shared storage layer, deploy SSDs for Spark shuffle output, and leverage Sparkling‑Water (Spark + H2O) for deep‑learning‑enabled machine learning.
Conclusion
From the early MapReduce era to a modern Spark‑centric architecture, TalkingData’s two‑year evolution mirrors the broader shift in the big‑data industry, illustrating how Spark’s performance, flexibility, and ecosystem have become the foundation for next‑generation data processing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
