Databases 15 min read

Design and Implementation of the "Little Boy" Greenplum Optimization and Operations Platform

This article introduces the architecture, key modules, and implementation details of the Little Boy platform, a Greenplum optimization and operations system that parses SQL, applies index and distribution‑key tuning, manages resources, and outlines future enhancements for large‑scale data warehouses.

Baidu Waimai Technology Team

Mar 23, 2017

Design and Implementation of the "Little Boy" Greenplum Optimization and Operations Platform

The Baidu Waimai big‑data team developed a Greenplum optimization and operations platform, named "Little Boy", to address performance and stability limitations of Hive and Impala warehouses. The platform continuously tunes Greenplum databases at table and column levels, leveraging PostgreSQL‑based design for parallel processing.

Architecture and Modules – The system consists of three major parts: an optimization module (SQL parsing, index creation, slow‑query analysis, table statistics, distribution‑key tuning, table‑level expansion), an operations module (web‑based automation, dynamic resource queues, command and configuration management), and a system module (global parameters, notification push).

SQL Parsing Module – Parses incoming SQL to extract field usage frequency across select, join, where, order, group, and having contexts. It uses JSqlParser for generic parsing and identifies plain SELECT statements, sub‑queries, join clauses, and expressions, handling each clause (FROM, JOIN, WHERE, SELECT, ORDER, GROUP, HAVING) with a unified expression‑analysis step.

Multithreaded Parsing – To process 60‑70 k daily queries, the parser runs in multiple threads, distributing SQL statements evenly while keeping database access in a single main thread, reducing total parsing time to about ten minutes.

Index Module – Analyzes field usage in where, join, order, and group scenarios, adding or dropping single‑column B‑tree indexes based on usage thresholds, while avoiding over‑indexing that could degrade performance.

Distribution‑Key Module – Evaluates data distribution uniformity and field usage in join/where contexts, selecting keys that balance segment load and minimize data reshuffling.

Statistics Field Module – Refreshes column statistics for frequently used fields in join, where, order, group, and having clauses, ensuring accurate query planning and preventing unnecessary data redistribution.

Dynamic Resource Queue – Provides configurable resource queues that adapt to workload patterns (e.g., daytime consumption vs. nighttime production) and special periods like month‑end reporting, improving overall cluster utilization.

Future Work – Remaining modules (slow‑query analysis, table‑level statistics, expansion tables, command/config management, cost estimation) are in integration testing and will be released soon. Plans include improving SQL‑parser success rates, adding multi‑column and bitmap index support, and open‑sourcing the platform to the community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Indexing Resource Management SQL parsing database optimization Greenplum

Written by

Baidu Waimai Technology Team

The Baidu Waimai Technology Team supports and drives the company's business growth. This account provides a platform for engineers to communicate, share, and learn. Follow us for team updates, top technical articles, and internal/external open courses.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.