Big Data 16 min read

TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN

This article recounts how TalkingData progressively introduced Spark into its Hadoop‑YARN based mobile big‑data platform, detailing early architectures, migration challenges, performance gains, the fully Spark‑centric redesign with Kafka and Spark Streaming, encountered pitfalls, and future plans for further optimization.

Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN

In recent years Spark has gained wide recognition in China, with events such as Spark Summit China 2014 and multiple Spark Meetups in major cities. TalkingData, an early adopter of Spark in the mobile internet big‑data services space, actively participated in these community activities and shared its Spark experiences.

First Encounter with Spark

While reviewing the Strata 2013 lecture notes, the team was attracted by a tutorial titled “An Introduction on the Berkeley Data Analytics Stack” which highlighted Spark’s in‑memory RDD model, machine‑learning support, and unified batch‑stream processing. Compared with Impala, Spark’s ecosystem built around RDDs seemed more suitable for a fast‑growing data‑driven startup.

Exploring Spark

In mid‑2013, as the volume of mobile device data surged, TalkingData decided to build its own data center to aggregate logs from various business platforms for deeper analysis. The initial data‑center functions included cross‑market Android app ranking and user‑interest‑based app recommendation, and the architecture was based on Hadoop 2.0 (Cloudera CDH4.3) with MapReduce for offline batch jobs.

As real‑time analysis needs grew, the team introduced Hive for interactive queries and Spark 0.8.0 (later 0.8.1) on YARN to support iterative machine‑learning workloads. The new architecture isolated Spark machine‑learning tasks, MapReduce ranking jobs, and Hive interactive jobs using separate YARN queues.

Because the default Spark binaries did not support CDH 4.3, the team compiled Spark from source using Maven on an AWS instance, adjusting JVM memory settings to avoid OOM errors. After successful compilation, the packaged Spark distribution was deployed on Hadoop client machines and verified with the SparkPi example.

Developing Spark applications in Scala proved dramatically more concise and faster than equivalent MapReduce jobs, especially for iterative machine‑learning algorithms, leading to the migration of all data‑mining workloads to Spark.

Fully Embracing Spark

By 2014, data volume had multiplied several times, making Hive‑based batch jobs too slow for weekly or monthly analyses. After attending Spark Summit China 2014, the team decided to migrate the entire data‑center to a Spark‑centric architecture built on YARN.

The new platform aimed to provide near‑real‑time ETL, streaming processing, efficient offline computation, fast multidimensional analysis, real‑time analytics, machine‑learning capabilities, unified data access, unified data view, and flexible task scheduling.

Kafka was introduced as the log‑collection channel, feeding raw JSON logs into HDFS, where Spark Streaming performed log preservation, cleaning, conversion to Parquet, and downstream processing such as tag generation stored in MongoDB.

Ranking jobs rewritten on Spark achieved a six‑fold speedup even when data volume tripled. Spark SQL and Parquet columnar storage reduced a previously hour‑long multidimensional analysis to minutes, and a month‑scale analysis to a couple of hours.

Additional components included a Bitmap engine migrated to YARN for fast indexed queries, an HCatalog‑based metadata service for a unified data view, and a custom DAG‑based distributed scheduler integrated with YARN for batch, real‑time, and pipeline tasks.

Pitfalls Encountered

The team documented several common issues: large‑scale shuffle causing "org.apache.spark.SparkException: Error communicating with MapOutputTracker" which was mitigated by setting spark.shuffle.consolidateFiles=true; frequent Fetch failures that required log inspection and parameter tuning; and other runtime problems solved through source code review and community support.

Future Plans

Looking ahead, TalkingData plans to adopt Spark 1.3’s DataFrame API, integrate the Tachyon distributed cache as a shared storage layer, deploy SSDs for Spark shuffle output, and leverage Sparkling‑Water (Spark + H2O) for deep‑learning‑enabled machine learning.

Conclusion

From the early MapReduce era to a modern Spark‑centric architecture, TalkingData’s two‑year evolution mirrors the broader shift in the big‑data industry, illustrating how Spark’s performance, flexibility, and ecosystem have become the foundation for next‑generation data processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningData PlatformYarnSparkHadoopSpark Streaming
Art of Distributed System Architecture Design
Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.