Databases 17 min read

Greenplum Standby Failure Diagnosis and Recovery Procedure

This article describes how to identify and fix Greenplum standby segment failures by inspecting segment status, comparing missing files, generating a recovery configuration, executing gprecoverseg commands, and verifying that all segments return to normal operation without service interruption.

Aikesheng Open Source Community

Dec 5, 2022

Greenplum Standby Failure Diagnosis and Recovery Procedure

1. Background

The customer discovered that, although Greenplum was running, some data segments were in an abnormal state; primary segment failures automatically trigger mirror segments, but timely repair is required to maintain high availability.

2. Simulated Customer Failure Environment

[gpadmin@master ~]$ psql -c "select * from gp_segment_configuration order by content asc,dbid;"
 dbid | content | role | preferred_role | mode | status | port | hostname | address | datadir
------+---------+------+----------------+------+--------+-------+--------------+--------------+-----------------------------------
  44 |      -1 | p    | p              | s    | u      |  5432 | master      | master      | /greenplum/gpdata/master/gpseg-1
  ... (42 rows total)

The query shows that 11 out of 42 segment nodes are in a down state.

3. Fault Analysis and Resolution

3.1 Data Inspection

Inspect the data files on a down node (example shown for one node):

[gpadmin@data02 gpseg16]$ pwd
/greenplum/gpdata/primary/gpseg16
[gpadmin@data02 gpseg16]$ ls
base                pg_hba.conf    pg_serial       pg_utilitymodedtmredo
fts_probe_file.bak  pg_ident.conf  pg_snapshots    PG_VERSION
global              pg_log         pg_stat         pg_xlog
... (other files listed)

The postmaster.pid file is missing.

For comparison, a healthy node shows the same directory listing but includes postmaster.pid:

[gpadmin@data01 gpseg1]$ pwd
/greenplum/gpdata/primary/gpseg1
[gpadmin@data01 gpseg1]$ ls
... (same files) ...
postmaster.pid
[gpadmin@data01 gpseg1]$ cat postmaster.pid
20517
/greenplum/gpdata/primary/gpseg1
1652022705
55001
/tmp
*
55001001 393219

Simply restarting the cluster may not resolve the issue and can cause service interruption.

3.2 Generate Recovery Configuration

[gpadmin@master ~]$ gprecoverseg -o ./recover
20221127:22:10:22:020909 gprecoverseg:master:gpadmin-[INFO]:-Starting gprecoverseg with args: -o ./recover
... (version info) ...
20221127:22:10:22:020909 gprecoverseg:master:gpadmin-[INFO]:-Configuration file output to ./recover successfully.

[gpadmin@master ~]$ ls
gpAdminLogs  recover
[gpadmin@master ~]$ more recover
data04|55000|/greenplum/gpdata/primary/gpseg9
... (list of segment mappings) ...

3.3 Apply Recovery Configuration

[gpadmin@master ~]$ gprecoverseg -i ./recover -F
Continue with segment recovery procedure Yy|Nn (default=N):
> y

3.4 Verify Data

[gpadmin@master ~]$ psql -c "select * from gp_segment_configuration order by content asc,dbid;"
... (output shows all segments with status "u" and role correctly set) ...

All segment nodes are now in the up state.

[gpadmin@master ~]$ psql
psql (9.4.24)
# \c test
You are now connected to database "test" as user "gpadmin".
# select gp_segment_id,count(*) from test_yw group by gp_segment_id;
 gp_segment_id | count
---------------+-------
 ... (counts per segment) ...

Data exists on every segment and is normal.

Note that some segment roles were swapped (primary became mirror and vice‑versa).

3.5 Fix Segment Role Status

[gpadmin@master ~]$ gprecoverseg -r
Continue with segment rebalance procedure Yy|Nn (default=N):
> y

3.6 Re‑check

[gpadmin@master ~]$ psql -c "select * from gp_segment_configuration order by content asc,dbid;"
... (output shows correct role and preferred_role for all segments) ...

All segment statuses and roles are now correct.

Inspecting the previously faulty node now shows the missing postmaster.pid file restored and a new recovery.done file present:

[gpadmin@data02 gpseg16]$ ls
... (list includes postmaster.pid and recovery.done) ...
[gpadmin@data02 gpseg16]$ more postmaster.pid
19572
/greenplum/gpdata/primary/gpseg16
1669556066
55003
/tmp
*
55003001 327680
[gpadmin@data02 gpseg16]$ more recovery.done
standby_mode = 'on'
primary_conninfo = 'user=gpadmin host=data03 port=56003 sslmode=prefer sslcompression=1 krbsrvname=postgres application_name=gp_walreceiver'
primary_slot_name = 'internal_wal_replication_slot'

Final queries confirm that data on all segments is normal.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL database troubleshooting Segment Recovery Greenplum standby

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.