Databases 9 min read

Master Greenplum Table Design & Performance Optimization: Practical Tips

This article explains what Greenplum is, its MPP shared‑nothing architecture, and provides concrete table‑design principles, distribution‑column strategies, indexing guidance, vacuum and table‑rebuilding techniques, as well as SQL, join, insert, update/delete, and resource‑queue optimizations for better performance.

360 Zhihui Cloud Developer

Mar 9, 2017

Master Greenplum Table Design & Performance Optimization: Practical Tips

What is Greenplum

Greenplum is a relational database built for data‑warehouse workloads, based on PostgreSQL and accessed via standard SQL. It runs as a cluster of independent database services, using a shared‑nothing architecture that distributes table rows across segment nodes according to a distribution column.

The cluster consists of multiple segment nodes and a master node that stores metadata and generates execution plans.

Greenplum Table Design and Optimization

Table design follows the same principles as other relational databases: choose appropriate data types and ensure join columns have matching types.

Distribution column design is critical. Choose a column with discrete values (e.g., a primary key) that is also used in joins; otherwise use DISTRIBUTED RANDOMLY. This avoids data motion and improves performance.

Real‑world cases show a 50% performance boost when each segment can join locally.

Be aware of data skew when a distribution column has highly uneven access patterns; you can inspect distribution with:

select gp_segment_id, count(*) from table_name group by 1 order by 1;

Index Recommendations

Official docs advise minimizing indexes because MPP scans are fast, but for typical OLTP‑style queries you still need composite indexes such as:

create index idx_name on user(name,status);

Vacuum and Table Rebuilding (Hole Reclamation)

Frequent deletes leave dead space that is not reclaimed efficiently. Rebuilding the table is fast in MPP; for example, a 9 TB table can be rebuilt in minutes.

Rebuild steps (example for user table):

create table user_new as select * from user;

rename table user to user_bak;

rename table user_new to user;

SQL‑Level Optimizations

For memory‑intensive queries like select count(distinct m3) from up_old_full; you may hit ERROR: insufficient memory reserved for statement. Rewrite as a two‑step query:

select count(*) from (select m3 from up_old_full group by m3) a;

Always filter data as early as possible, preferably on a single segment, to avoid costly data motion.

Join Optimizations

Ensure join columns have the same data type.

Prefer that join columns are the distribution key.

This reduces global motion and keeps joins local to a segment.

Insert Optimizations

High insert rates can saturate the resource queue. Setting resource_select_only=1 limits the queue to SELECT statements, allowing faster inserts.

For massive inserts, use the custom gpstall tool developed by the HULK DBA team.

Update/Delete Considerations

Updates and deletes are very slow in Greenplum; avoid them when possible.

Prepared Statements

Server‑side prepares can be slow. In PHP PDO, set PDO::ATTR_EMULATE_PREPARES to TRUE.

Network Card Recommendations

Prefer 10 Gbps NICs or multiple NICs to prevent network bottlenecks during segment communication.

Segment Mirror Strategies

Greenplum supports group mirror (default) and spread mirror . Spread mirror improves fault tolerance by distributing mirrors across different hosts, reducing the risk of cluster‑wide performance degradation when a host fails.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Database Design MPP SQL tuning Greenplum Distributed Tables

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.