Big Data 16 min read

Inside Spark 1.2: New APIs, In‑Memory Columnar Storage, and Baidu’s High‑Performance Shuffle

This article reviews Spark 1.2’s major enhancements—including the External Data Source API, column pruning, predicate pushdown, and in‑memory columnar storage—while also detailing Baidu’s large‑scale Spark deployments, its custom high‑performance Shuffle service, and the integration of Spark with the Tachyon memory file system.

Baidu Tech Salon

Jan 13, 2015

Inside Spark 1.2: New APIs, In‑Memory Columnar Storage, and Baidu’s High‑Performance Shuffle

Databricks Engineer Lian Cheng – Spark SQL 1.2 Enhancements

Spark SQL 1.2 introduced four major improvements: an External Data Source API that abstracts external systems as relational tables, enhanced in‑memory columnar storage, stronger Parquet support, and stronger Hive support.

External Data Source API enables seamless access to formats such as JSON, Avro, CSV, Parquet, and ORC, and allows JDBC connections to systems like HBase. It provides two key optimizations: column pruning, which skips unnecessary columns to reduce I/O, and predicate pushdown, which moves filter conditions closer to the data source for lower disk and network I/O.

Enhanced In‑Memory Columnar Storage unifies the semantics of SchemaRDD.cache() and SQLContext.cacheTable(), introduces eager cache materialization, and adds DML syntax CACHE [LAZY] TABLE tbl [AS SELECT …]. It also adds table statistics that enable predicate pushdown and auto broadcast joins, mitigating OOM issues when caching large tables.

Table Statistics in Spark 1.2 provide predicate pushdown and auto broadcast join capabilities, improving scan and join performance for columnar formats.

Baidu Engineer Zhen Peng – Spark on Baidu Cloud BMR

Baidu has operated Spark since 2011 and integrated it into its Baidu MapReduce (BMR) platform, a cloud‑native data analysis service built on HDFS, BOS, and HBase, and supporting Spark, MapReduce, Pig, Hive, Streaming, GraphX, and MLlib.

Reasons Baidu chose Spark include high performance through thread‑pool scheduling, extensive in‑memory computation, multi‑language APIs, rich expressive capabilities, and a mature ecosystem of components.

BMR offers on‑demand Spark clusters that can be provisioned in 3–5 minutes, with both short‑lived and long‑running modes, and provides a web console and SDK for job management.

Baidu Architect Sun Yaoguang – High‑Performance Shuffle Service

Shuffle is the data redistribution phase between map and reduce tasks. Traditional Hadoop/Spark shuffles use a disk‑based pull model, which incurs high latency, disk seeks, and data loss on node failures.

Baidu’s “New Shuffle” adopts an in‑memory push model: map outputs are pushed directly to shuffle workers, eliminating disk writes and reducing network I/O. The architecture separates shuffle from compute, uses a master‑worker design for horizontal scalability, and introduces mechanisms for handling slow nodes and data duplication.

Challenges include slow shuffle nodes, data duplication, resource sharing, and pressure on the underlying DFS.

Silicon Valley Architect Liu Shaoshan – Fast Big Data Analytics with Spark on Tachyon

Tachyon is a distributed memory file system that sits between storage and compute frameworks, allowing Spark jobs to share data in memory instead of writing to HDFS.

Key problems Tachyon solves for Spark are excessive disk I/O, data loss on Spark failures, and redundant caching that increases memory pressure.

The Tachyon architecture consists of a Master (managing Inodes and worker info), Workers (storing local data and interfacing with HDFS), and Clients (providing transparent file access via TachyonFS). In Baidu, Tachyon is deployed on about 50 machines to accelerate ad‑hoc queries by caching remote data locally.

Practical challenges include incomplete block caching, low cache hit rates, and the need for careful workload profiling to achieve optimal performance.

Future work at Baidu includes hierarchical storage features and improved cache replacement policies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Spark Shuffle Baidu Tachyon External Data Source API In-Memory Columnar Storage

Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.