Big Data 24 min read

SuperSQL: A Cross‑Engine, Cross‑DC High‑Performance Big Data SQL Middleware

The article presents SuperSQL, a high‑performance big‑data SQL middleware that enables cross‑engine and cross‑data‑center query processing, detailing its architecture, metadata management, cost‑based optimization, operator push‑down, distributed execution, performance benchmarks, and future roadmap within modern data‑intensive environments.

DataFunSummit
DataFunSummit
DataFunSummit
SuperSQL: A Cross‑Engine, Cross‑DC High‑Performance Big Data SQL Middleware

SuperSQL is introduced as a one‑stop, high‑performance big‑data SQL middleware designed to handle heterogeneous data sources, different engine versions, and multi‑DC environments, addressing challenges such as data locality, consistency, and unified query access.

The solution showcases concrete business outcomes, including reduced data migration effort, 3×‑plus query performance improvement, and a unified data query entry point that eliminates ETL overhead.

A competitive analysis compares SuperSQL with open‑source engines (SparkSQL, Presto, Drill) and commercial products (Google F1, Alibaba DLA, Huawei Pollux, Contiamo), highlighting its broader functionality, rule‑based operator push‑down, and metadata capabilities.

The system architecture is built on Apache Calcite, using a volcano planner for cost‑based optimization (CBO) and multi‑stage planning, supporting a wide range of data sources (Hive, SparkSQL, PostgreSQL, MySQL, Oracle) and compute engines (Spark, Flink, Presto).

Metadata management employs a Trie‑based hierarchical model stored in Hive Metastore, with CBO statistics collection via ANALYZE commands, enabling accurate cost estimation and plan selection.

Operator push‑down covers common SQL constructs (average, max, min, sort, union all, join) and advanced techniques such as concurrent sub‑query slicing, reducing network I/O and balancing engine load.

Cross‑DC query optimization incorporates bandwidth awareness, multi‑stage planning, and optimal engine selection, dynamically adapting to network constraints and resource availability.

Performance evaluations on a six‑node cluster (128 GB RAM, 48 CPU cores) demonstrate up to 26× single‑source speedup and 5× cross‑source improvement over baseline SparkSQL/JDBC, with significant gains in concurrent sub‑query scenarios.

The future roadmap includes HBO (historical‑based optimization) using SQL fingerprinting, metadata service refactoring, multi‑cost‑factor optimization, and integration with OLAP engines (Elasticsearch, ClickHouse) to provide a unified intelligent big‑data platform.

big dataQuery Optimizationdistributed computingmetadata managementSuperSQLCross-DCSQL Middleware
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.