Design, Evolution, and Performance Evaluation of the PINGO Distributed Interactive Query Platform
This article details the motivation, architectural iterations, caching strategies, SparkSQL enhancements, and performance benchmarks of Baidu's PINGO platform, illustrating how it transformed from a Hive‑based QueryEngine into a high‑performance, Spark‑driven interactive query system for large‑scale data analysis.
PINGO is a distributed interactive query platform jointly developed by Baidu's Big Data Department and its U.S. research center, created to overcome the limitations of the Hive‑based QueryEngine for interactive workloads.
The original QueryEngine required users to provision Hadoop resources and suffered long startup times for short queries, making it unsuitable for sub‑two‑minute latency requirements.
To address these issues, Baidu designed PINGO using SparkSQL as the execution engine, leveraging Spark’s in‑memory computation, resident service capability, machine‑learning library support, and unified processing for diverse workloads.
Several architectural versions were released:
PINGO 1.0 employed Spark Standalone clusters with an Operation Manager (a Spark driver) handling queries dispatched from Magi Service.
PINGO 1.1 introduced a cache layer built on Tachyon and a ViewManager to manage hot data, modifying SparkSQL’s Catalyst planner to route reads to either cached or original storage, achieving significant performance gains without altering user queries.
PINGO 1.2 separated scheduling from execution, adding a PingoMaster that orchestrates multiple Spark applications across clusters (including YARN), enabling data‑locality‑aware and size‑aware scheduling strategies to improve reliability and scalability.
Performance evaluations show that Spark already outperforms Hive by 2–3× on complex queries, while adding the Tachyon cache yields 30–50× speedups compared to uncached Hive. In production, PINGO reduces the proportion of queries finishing within two minutes from ~1% (Hive+MR) to over 50%.
The authors conclude that PINGO has dramatically lowered interactive query latency from tens of minutes to under two minutes and outline future work: expanding cache coverage, improving prefetch and replacement policies, and accelerating SQL operators (e.g., joins) with FPGA hardware.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.