Databases 14 min read

Why SQL Fails at Multi‑Group & Top‑N Queries and How SPL Fixes It

The article explains how conventional SQL struggles with executing multiple grouping and Top‑N aggregations on massive tables, leading to repeated full scans and poor performance, and demonstrates how the SPL compute engine can perform these operations in a single pass with parallelism, improving speed and scalability.

Java Backend Technology
Java Backend Technology
Java Backend Technology
Why SQL Fails at Multi‑Group & Top‑N Queries and How SPL Fixes It

It is well known that many big‑data computations are implemented with SQL, and when they run slowly developers must tune the SQL, often encountering baffling performance problems.

Multiple Grouping Queries

Consider three statements that each group a huge table T (billions of rows) in different ways:

select a,b, sum(x) from T group by a,b where ...;
select c,d, max(y) from T group by c,d where ...;
select a,c, avg(y), min(z) from T group by a,c where ...;

Each statement forces a full scan of T. Traversing the table three times is costly; the CPU work is negligible compared to disk I/O.

If SQL could express a single‑pass syntax that returns multiple result sets, the disk reads would be reduced dramatically.

SQL lacks such syntax, so a common workaround is to create a temporary table with a finer‑grained grouping (e.g., group by a,b,c,d) and then compute the desired aggregates from it:

create table T_temp as
  select a,b,c,d,
    sum(case when ... then x else 0 end) sumx,
    max(case when ... then y else null end) maxy,
    sum(case when ... then y else 0 end) sumy,
    count(case when ... then 1 else null end) county,
    min(case when ... then z else null end) minz
  group by a,b,c,d;

select a,b, sum(sumx) from T_temp group by a,b where ...;
select c,d, max(maxy) from T_temp group by c,d where ...;
select a,c, sum(sumy)/sum(county), min(minz) from T_temp group by a,c where ...;

This reduces the scans to one, but it complicates the query, inflates the intermediate result size, and often still requires additional passes.

Top‑N Aggregation

Standard SQL implements a Top‑5 query by sorting the entire table, which is extremely expensive for billions of rows:

select * from (select x from T order by x desc) where rownum <= 5;

A more efficient approach keeps a small in‑memory heap of the best N rows while scanning once. SQL cannot express this directly, so developers resort to large sorts or engine‑specific tricks.

Join Performance

When joining a large fact table with several small dimension tables, a typical hash‑join requires a separate hash computation for each join, leading to multiple passes over the fact table.

Even if dimension keys are replaced by numeric IDs to enable array‑based lookups, relational databases still treat them as unordered sets and fall back to hash joins.

High‑Concurrency Account Queries

Even with an index on the id column, heavy concurrent access suffers because rows for the same account are not stored contiguously on disk, causing excessive I/O.

Introducing SPL

Because SQL and traditional RDBMS cannot implement the desired single‑pass, set‑based algorithms efficiently, the open‑source SPL engine is proposed. SPL supports richer data types and allows concise expression of the algorithms.

One‑Pass Multi‑Group Example

A1 = file("T.ctx").open().cursor(a,b,c,d,x,y,z);
cursorA1 = A1.select(...).groups(a,b; sum(x));
cursorA2 = A1.select(...).groups(c,d; max(y));
cursorA3 = A1.select(...).groupx(a,c; avg(y), min(z));
// ... compute the three result sets in a single traversal

Top‑5 with Parallelism

// All‑data Top‑5 (both max and min)
A = file("T.ctx").open();
result = A.cursor@m(x).total(top(-5,x), top(5,x));

Join Using Numeric IDs

// System initialization – load small tables into memory
env(city, file("city.btx").import@b()), env(employee, file("employee.btx").import@b()), ...;

// Query – replace foreign keys with numeric IDs
orders = file("orders.ctx").open().cursor(cid,eid,...).switch(cid,city:#; eid,employee:#; ...);
filtered = orders.select(cid.state='New York' && eid.title=="manager" ...);

High‑Concurrency Account Query in SPL

// Pre‑sort and index by account id
A1 = file("T-original.ctx").open().cursor(id,tdate,amt,...);
A2 = A1.sortx(id);
B = file("T.ctx");
B2 = B.create@r(#id,tdate,amt,...).append@i(A2);
B2.open().index(index_id;id);

// Query
result = B2.icursor(; id==10100 && tdate>=date("2021-01-10") && tdate<date("2021-01-25"), index_id).fetch();

SPL can also handle ordered merges for fact‑detail joins, multi‑dimensional analysis, bitmap counting, boolean‑set filtering, and time‑series grouping, offering a versatile platform for high‑performance data processing.

PerformanceSQLbig-dataSPLquery-optimizationtop-n
Java Backend Technology
Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.