Spark Latest Features, Tungsten Project, and Hulu’s Production Practices
This article reviews Spark's evolution from version 1.2 to 1.6, explains the DataFrame and Tungsten projects, shares Hulu’s real‑world Spark deployments, and discusses performance‑related challenges such as stack overflow, streaming receiver latency, and class‑loader deadlocks.
The article begins with an overview of Spark’s version history (1.2 → 1.5.2 → 1.6), highlighting major milestones such as Shuffle optimizations, the introduction of the DataFrame API in 1.3, the start of the Tungsten project in 1.4, and its first‑stage completion in 1.5.
DataFrame API, introduced in Spark 1.3, provides a schema‑aware abstraction over RDDs, enabling unified handling of structured data through SQL, Hive, and R interfaces. Queries are parsed into logical plans, optimized, turned into physical plans, and finally compiled into executable RDD code.
The article shows a simple Spark SQL example: val count = sqlContext.sql("SELECT COUNT(*) FROM records").collect().head.getLong(0) , and notes that DataFrames support both SQL‑style expressions and RDD‑like transformations.
Related to DataFrames is the DataSource API, which offers a uniform read/write interface for formats such as JSON, JDBC, ORC, and Parquet.
The Tungsten project aims to dramatically improve Spark’s memory and CPU efficiency by using binary formats, cache‑friendly computation, and runtime code generation. Its first phase (Spark 1.5) focused on memory management, binary data handling, and code generation; the second phase continues these optimizations.
Key Tungsten optimizations include:
Memory management and binary format handling to reduce GC pressure and avoid Java object overhead.
Cache‑friendly computation for faster sorting, hashing, aggregation, joins, and shuffle.
Runtime code generation to accelerate expression evaluation and serialization.
Hulu’s production environment runs Spark on YARN alongside HDFS, HBase, Hive, Cassandra, and Presto. Spark is used for both long‑running streaming jobs (via Kafka) and short‑lived batch jobs.
To address stack‑overflow crashes caused by deep RDD lineage, Hulu rewrote the RDD serialization logic to replace recursive serialization with an iterative approach using two mapping tables ( rddId → List of depId and depId → Dependency ), moving the data to the heap and eliminating stack‑frame limits.
For Spark Streaming, Hulu introduced a custom initialization task that runs on each executor before any normal work, preventing memory buildup in receivers when executors are not yet ready to process data.
Hulu also encountered deadlocks in Spark’s ChildFirstURLClassLoader under high concurrency; the fix was to remove the fine‑grained class‑loader lock, as Scala’s companion objects do not provide static initialization hooks comparable to Java.
Looking ahead to Spark 1.6, the article highlights two major features: unified memory management (merging execution and storage memory pools) and the upcoming Dataset API, which combines the type safety of RDDs with the performance of DataFrames via encoders and Catalyst logical plans.
Finally, a Q&A section answers practical questions about Hulu’s Spark cluster size, tool choices (Zeppelin vs. spark‑notebook.io), Hive‑on‑Spark vs. Spark SQL, Spark shuffle vs. YARN shuffle, image‑processing capabilities, Tungsten’s production readiness, and a brief comparison between Storm and Spark for streaming workloads.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.