How to Slash Cloud Data Warehouse Costs with ADB PG Disk Optimization
This article explains how enterprises can dramatically reduce cloud‑native data‑warehouse expenses by understanding ADB PG/Greenplum architecture, applying disk‑reservation and lock‑write safeguards, and implementing practical optimizations such as table compression, hot‑cold tiering, vacuuming, redundant‑index cleanup, replication conversion, and isolated temporary‑table spaces.
Background
Enterprises store core data in two‑dimensional tables; under the push for domestic or open‑source databases, Greenplum and AnalyticDB PostgreSQL (ADB PG) are increasingly used for OLAP data warehouses and data‑mid platforms. Customers care about the cost of running a cloud‑native data warehouse, prompting a systematic cost‑optimization study.
ADB PG & Greenplum Architecture
ADB PG (AnalyticDB PostgreSQL) is a cloud‑native MPP database derived from open‑source Greenplum. A Master node handles connections, SQL parsing, optimization and dispatches work to Segment nodes, each a standalone PostgreSQL instance storing a portion of the data. Reducing cost mainly means minimizing the number of Segment servers while preserving compute power and storage capacity.
Disk Management Basics
Key technical points
Business data resides on Segment nodes.
Each Segment has Primary and Mirror, so usable space is roughly half of the server’s total.
Greenplum’s MVCC creates dead tuples after DML.
Replicated tables store a full copy on every Segment; distributed tables spread data by a distribution key.
Append‑Only column tables support compression, trading CPU for space.
Database control disk reservation
When a Segment instance is created, the control layer reserves about 12 % of the physical disk (e.g., a 20 TB server reports 17.6 TB usable). Monitoring uses the reserved size as the denominator, providing a first‑level safety margin.
Additional space is needed for WAL and error logs, so the reported usage may be higher than the actual data footprint.
Lock‑write protection
If disk usage reaches 90 % of the reserved capacity, an automatic task triggers a “lock‑write” state, causing all DML to fail. The threshold is configurable.
Disk‑Space Optimization Strategies
1. Table storage format and compression
Greenplum supports row, Append‑Only row, and Append‑Only column storage. Column‑store tables with appendonly=true and a compression algorithm (zstd, zlib, lz4) can reduce size by several times. Example DDL:
CREATE TABLE bar (id integer, name text)
WITH (appendonly=true, orientation=column, COMPRESSTYPE=zstd, COMPRESSLEVEL=5)
DISTRIBUTED BY (id);zstd offers a balanced trade‑off; lz4 is faster for query‑intensive workloads; compression level 5‑9 is commonly used.
2. Hot‑cold data tiering
Cold data can be stored in HDFS/Hive and accessed via PXF external tables, which consume no ADB PG disk space. Alternatively, the built‑in hot‑cold storage feature can convert external tables to internal tables when joins are needed.
Steps to enable PXF:
-- Enable PXF service in the console
-- Create extension
create extension pxf_fdw;
-- Create external table
CREATE EXTERNAL TABLE pxf_hdfs_textsimple(
location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text&SERVER=23')
FORMAT 'TEXT' (delimiter=',');3. Vacuuming dead tuples
Automatic vacuum reclaims space when dead‑tuple ratios exceed a threshold, but large tables often require manual VACUUM or VACUUM FULL. Example queries to find top‑size tables and high‑bloat tables are provided, followed by manual vacuum commands.
4. Redundant index cleanup
Indexes on large tables can consume terabytes. Queries identify indexes with many columns, large size, or never used (idx_scan=0). Removing redundant indexes reclaimed up to 24 TB in a case study.
5. Converting replicated tables to distributed tables
Replicated tables on large datasets waste space; converting them to hash‑distributed tables can save several terabytes.
6. Isolating temporary‑table space
Creating a dedicated tablespace for temporary objects allows separate monitoring and prevents temp files from inflating the main data footprint.
Optimization Benefits
The described techniques reduced a 1.5 PB source warehouse to roughly 300 TB stored in ADB PG, saving the cost of about 100 servers and avoiding disk‑water‑mark‑induced write locks. The combined approach delivered both financial savings and operational stability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
