Databases 25 min read

How to Slash Cloud Data Warehouse Costs with ADB PG Disk Optimization

This article explains how enterprises can dramatically reduce cloud‑native data‑warehouse expenses by understanding ADB PG/Greenplum architecture, applying disk‑reservation and lock‑write safeguards, and implementing practical optimizations such as table compression, hot‑cold tiering, vacuuming, redundant‑index cleanup, replication conversion, and isolated temporary‑table spaces.

Alibaba Cloud Developer

Jan 12, 2022

How to Slash Cloud Data Warehouse Costs with ADB PG Disk Optimization

Background

Enterprises store core data in two‑dimensional tables; under the push for domestic or open‑source databases, Greenplum and AnalyticDB PostgreSQL (ADB PG) are increasingly used for OLAP data warehouses and data‑mid platforms. Customers care about the cost of running a cloud‑native data warehouse, prompting a systematic cost‑optimization study.

ADB PG & Greenplum Architecture

ADB PG (AnalyticDB PostgreSQL) is a cloud‑native MPP database derived from open‑source Greenplum. A Master node handles connections, SQL parsing, optimization and dispatches work to Segment nodes, each a standalone PostgreSQL instance storing a portion of the data. Reducing cost mainly means minimizing the number of Segment servers while preserving compute power and storage capacity.

Disk Management Basics

Key technical points

Business data resides on Segment nodes.

Each Segment has Primary and Mirror, so usable space is roughly half of the server’s total.

Greenplum’s MVCC creates dead tuples after DML.

Replicated tables store a full copy on every Segment; distributed tables spread data by a distribution key.

Append‑Only column tables support compression, trading CPU for space.

Database control disk reservation

When a Segment instance is created, the control layer reserves about 12 % of the physical disk (e.g., a 20 TB server reports 17.6 TB usable). Monitoring uses the reserved size as the denominator, providing a first‑level safety margin.

Additional space is needed for WAL and error logs, so the reported usage may be higher than the actual data footprint.

Lock‑write protection

If disk usage reaches 90 % of the reserved capacity, an automatic task triggers a “lock‑write” state, causing all DML to fail. The threshold is configurable.

Disk‑Space Optimization Strategies

1. Table storage format and compression

Greenplum supports row, Append‑Only row, and Append‑Only column storage. Column‑store tables with appendonly=true and a compression algorithm (zstd, zlib, lz4) can reduce size by several times. Example DDL:

CREATE TABLE bar (id integer, name text)
    WITH (appendonly=true, orientation=column, COMPRESSTYPE=zstd, COMPRESSLEVEL=5)
    DISTRIBUTED BY (id);

zstd offers a balanced trade‑off; lz4 is faster for query‑intensive workloads; compression level 5‑9 is commonly used.

2. Hot‑cold data tiering

Cold data can be stored in HDFS/Hive and accessed via PXF external tables, which consume no ADB PG disk space. Alternatively, the built‑in hot‑cold storage feature can convert external tables to internal tables when joins are needed.

Steps to enable PXF:

-- Enable PXF service in the console
-- Create extension
create extension pxf_fdw;
-- Create external table
CREATE EXTERNAL TABLE pxf_hdfs_textsimple(
    location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text&SERVER=23')
FORMAT 'TEXT' (delimiter=',');

3. Vacuuming dead tuples

Automatic vacuum reclaims space when dead‑tuple ratios exceed a threshold, but large tables often require manual VACUUM or VACUUM FULL. Example queries to find top‑size tables and high‑bloat tables are provided, followed by manual vacuum commands.

4. Redundant index cleanup

Indexes on large tables can consume terabytes. Queries identify indexes with many columns, large size, or never used (idx_scan=0). Removing redundant indexes reclaimed up to 24 TB in a case study.

5. Converting replicated tables to distributed tables

Replicated tables on large datasets waste space; converting them to hash‑distributed tables can save several terabytes.

6. Isolating temporary‑table space

Creating a dedicated tablespace for temporary objects allows separate monitoring and prevents temp files from inflating the main data footprint.

Optimization Benefits

The described techniques reduced a 1.5 PB source warehouse to roughly 300 TB stored in ADB PG, saving the cost of about 100 servers and avoiding disk‑water‑mark‑induced write locks. The combined approach delivered both financial savings and operational stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Warehouse compression cold data cost reduction Greenplum Vacuum disk optimization ADB PG

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.