How to Efficiently Retrieve Top N Rows per Group in PostgreSQL
This article explains how to extract the top N records for each group in PostgreSQL, compares the slow window‑function approach with a fast recursive‑CTE and set‑returning function solution, and shows performance results dropping from over 20 seconds to under half a second.
Background
Extracting the top N rows for each group (e.g., top 10 songs per singer) is a common requirement.
Traditional approach using window functions
The usual solution uses a window query with row_number() over (partition by c1 order by c2) and filters on rn <= 10. Example setup creates a table tbl(c1 int, c2 int, c3 int) with an index on (c1,c2), inserts 10 000 000 rows spread across 10 000 groups, and runs the query.
postgres=# create table tbl(c1 int, c2 int, c3 int);
postgres=# create index idx1 on tbl(c1,c2);
postgres=# insert into tbl select mod(trunc(random()*10000)::int,10000), trunc(random()*10000000) from generate_series(1,10000000);
postgres=# explain select * from (select row_number() over(partition by c1 order by c2) as rn, * from tbl) t where t.rn <= 10;
QUERY PLAN
----------------------------------------------------------------------------------------
Subquery Scan on t (cost=0.43..770563.03 rows=3333326 width=20)
Filter: (t.rn <= 10)
-> WindowAgg (cost=0.43..645563.31 rows=9999977 width=12)
-> Index Scan using idx1 on tbl (cost=0.43..470563.72 rows=9999977 width=12)
(4 rows)
Execution time: 20833.747 msOptimized method with recursive CTE and set‑returning function
When the number of groups is moderate, a recursive CTE can enumerate distinct group identifiers and an index can fetch the top N rows per group. PostgreSQL does not allow ORDER BY inside the recursive seed, so the CTE is wrapped in a PL/pgSQL function that returns a set of rows.
with recursive t1 as (
select min(c1) c1 from tbl
union all
select (select min(tbl.c1) from tbl where tbl.c1 > t.c1) c1
from t1 t where t.c1 is not null
);
create or replace function f() returns setof tbl as $$
declare v int;
begin
for v in with recursive t1 as (
select min(c1) c1 from tbl
union all
select (select min(tbl.c1) from tbl where tbl.c1 > t.c1) c1
from t1 t where t.c1 is not null
) loop
return query select * from tbl where c1 = v order by c2 limit 10;
end loop;
return;
end;
$$ language plpgsql strict;Calling the function returns the same top‑10 rows per group, but the execution plan shows a dramatic speed improvement.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from f();
QUERY PLAN
---------------------------------------------------------------
Function Scan on public.f (cost=0.25..10.25 rows=1000 width=12) (actual time=419.218..444.810 rows=100000 loops=1)
Buffers: shared hit=170407, temp read=221 written=220
Planning time: 0.037 ms
Execution time: 464.257 msKey points
The window‑function method scans the entire table; on 10 million rows it takes about 20 seconds.
With a limited number of groups, a recursive CTE combined with an index can retrieve the top N per group in sub‑second time.
PostgreSQL currently does not support ordering in the recursive seed or using ORDER BY directly inside the recursive part, so a wrapper function is required.
Source: Database Kernel Monthly, http://www.kancloud.cn/taobaomysql/monthly/181768
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
