Big Data 14 min read

Design, Evolution, and Performance Evaluation of the PINGO Distributed Interactive Query Platform

This article details the motivation, architectural iterations, caching strategies, SparkSQL enhancements, and performance benchmarks of Baidu's PINGO platform, illustrating how it transformed from a Hive‑based QueryEngine into a high‑performance, Spark‑driven interactive query system for large‑scale data analysis.

Architecture Digest

Mar 25, 2016

Design, Evolution, and Performance Evaluation of the PINGO Distributed Interactive Query Platform

PINGO is a distributed interactive query platform jointly developed by Baidu's Big Data Department and its U.S. research center, created to overcome the limitations of the Hive‑based QueryEngine for interactive workloads.

The original QueryEngine required users to provision Hadoop resources and suffered long startup times for short queries, making it unsuitable for sub‑two‑minute latency requirements.

To address these issues, Baidu designed PINGO using SparkSQL as the execution engine, leveraging Spark’s in‑memory computation, resident service capability, machine‑learning library support, and unified processing for diverse workloads.

Several architectural versions were released:

PINGO 1.0 employed Spark Standalone clusters with an Operation Manager (a Spark driver) handling queries dispatched from Magi Service.

PINGO 1.1 introduced a cache layer built on Tachyon and a ViewManager to manage hot data, modifying SparkSQL’s Catalyst planner to route reads to either cached or original storage, achieving significant performance gains without altering user queries.

PINGO 1.2 separated scheduling from execution, adding a PingoMaster that orchestrates multiple Spark applications across clusters (including YARN), enabling data‑locality‑aware and size‑aware scheduling strategies to improve reliability and scalability.

Performance evaluations show that Spark already outperforms Hive by 2–3× on complex queries, while adding the Tachyon cache yields 30–50× speedups compared to uncached Hive. In production, PINGO reduces the proportion of queries finishing within two minutes from ~1% (Hive+MR) to over 50%.

The authors conclude that PINGO has dramatically lowered interactive query latency from tens of minutes to under two minutes and outline future work: expanding cache coverage, improving prefetch and replacement policies, and accelerating SQL operators (e.g., joins) with FPGA hardware.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SparkSQL Caching performance evaluation Distributed Query PINGO

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.