Greenplum Standby Failure Diagnosis and Recovery Procedure
This article describes how to identify and fix Greenplum standby segment failures by inspecting segment status, comparing missing files, generating a recovery configuration, executing gprecoverseg commands, and verifying that all segments return to normal operation without service interruption.
1. Background
The customer discovered that, although Greenplum was running, some data segments were in an abnormal state; primary segment failures automatically trigger mirror segments, but timely repair is required to maintain high availability.
2. Simulated Customer Failure Environment
[gpadmin@master ~]$ psql -c "select * from gp_segment_configuration order by content asc,dbid;"
dbid | content | role | preferred_role | mode | status | port | hostname | address | datadir
------+---------+------+----------------+------+--------+-------+--------------+--------------+-----------------------------------
44 | -1 | p | p | s | u | 5432 | master | master | /greenplum/gpdata/master/gpseg-1
... (42 rows total)The query shows that 11 out of 42 segment nodes are in a down state.
3. Fault Analysis and Resolution
3.1 Data Inspection
Inspect the data files on a down node (example shown for one node):
[gpadmin@data02 gpseg16]$ pwd
/greenplum/gpdata/primary/gpseg16
[gpadmin@data02 gpseg16]$ ls
base pg_hba.conf pg_serial pg_utilitymodedtmredo
fts_probe_file.bak pg_ident.conf pg_snapshots PG_VERSION
global pg_log pg_stat pg_xlog
... (other files listed)The postmaster.pid file is missing.
For comparison, a healthy node shows the same directory listing but includes postmaster.pid :
[gpadmin@data01 gpseg1]$ pwd
/greenplum/gpdata/primary/gpseg1
[gpadmin@data01 gpseg1]$ ls
... (same files) ...
postmaster.pid
[gpadmin@data01 gpseg1]$ cat postmaster.pid
20517
/greenplum/gpdata/primary/gpseg1
1652022705
55001
/tmp
*
55001001 393219Simply restarting the cluster may not resolve the issue and can cause service interruption.
3.2 Generate Recovery Configuration
[gpadmin@master ~]$ gprecoverseg -o ./recover
20221127:22:10:22:020909 gprecoverseg:master:gpadmin-[INFO]:-Starting gprecoverseg with args: -o ./recover
... (version info) ...
20221127:22:10:22:020909 gprecoverseg:master:gpadmin-[INFO]:-Configuration file output to ./recover successfully. [gpadmin@master ~]$ ls
gpAdminLogs recover
[gpadmin@master ~]$ more recover
data04|55000|/greenplum/gpdata/primary/gpseg9
... (list of segment mappings) ...3.3 Apply Recovery Configuration
[gpadmin@master ~]$ gprecoverseg -i ./recover -F
Continue with segment recovery procedure Yy|Nn (default=N):
> y3.4 Verify Data
[gpadmin@master ~]$ psql -c "select * from gp_segment_configuration order by content asc,dbid;"
... (output shows all segments with status "u" and role correctly set) ...All segment nodes are now in the up state.
[gpadmin@master ~]$ psql
psql (9.4.24)
# \c test
You are now connected to database "test" as user "gpadmin".
# select gp_segment_id,count(*) from test_yw group by gp_segment_id;
gp_segment_id | count
---------------+-------
... (counts per segment) ...Data exists on every segment and is normal.
Note that some segment roles were swapped (primary became mirror and vice‑versa).
3.5 Fix Segment Role Status
[gpadmin@master ~]$ gprecoverseg -r
Continue with segment rebalance procedure Yy|Nn (default=N):
> y3.6 Re‑check
[gpadmin@master ~]$ psql -c "select * from gp_segment_configuration order by content asc,dbid;"
... (output shows correct role and preferred_role for all segments) ...All segment statuses and roles are now correct.
Inspecting the previously faulty node now shows the missing postmaster.pid file restored and a new recovery.done file present:
[gpadmin@data02 gpseg16]$ ls
... (list includes postmaster.pid and recovery.done) ...
[gpadmin@data02 gpseg16]$ more postmaster.pid
19572
/greenplum/gpdata/primary/gpseg16
1669556066
55003
/tmp
*
55003001 327680
[gpadmin@data02 gpseg16]$ more recovery.done
standby_mode = 'on'
primary_conninfo = 'user=gpadmin host=data03 port=56003 sslmode=prefer sslcompression=1 krbsrvname=postgres application_name=gp_walreceiver'
primary_slot_name = 'internal_wal_replication_slot'Final queries confirm that data on all segments is normal.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.