单位内的一套greenplum集群出现了一点小问题:其中一台segment节点的主镜像出错,mirror镜像自动升级成了primary;另外一个节点提示数据库PID不存在,但是各功能都正常。今天就记录一下我们的修复过程。
(部分内容做了脱敏处理,显示会不完整)
故障现象
在日常巡检中,gpstate是我们最常用的命令工具:显示有关正在运行的Greenplum数据库实例的信息。
linux
$ gpstate -m
20250320:07:00:02:024403 gpstate:[INFO]:-Starting gpstate with args: -m
20250320:07:00:02:024403 gpstate:[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build
20250320:07:00:02:024403 gpstate:[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-b3:23:56'
20250320:07:00:02:024403 gpstate:[INFO]:-Obtaining Segment details from master...
20250320:07:00:02:024403 gpstate:[INFO]:--------------------------------------------------------------
20250320:07:00:02:024403 gpstate:[INFO]:--Current GPDB mirror list and status
20250320:07:00:02:024403 gpstate:[INFO]:--Type = Group
20250320:07:00:02:024403 gpstate:[INFO]:--------------------------------------------------------------
20250320:07:00:02:024403 gpstate:[INFO]:- Mirror Datadir Port Status Data Status
20250320:07:00:02:024403 gpstate:[INFO]:- mpp-02 /data1/m1/gpseg0 43000 Passive Synchronized
20250320:07:00:02:024403 gpstate:[INFO]:- mpp-02 /data1/m2/gpseg1 43001 Passive Synchronized
20250320:07:00:02:024403 gpstate:[INFO]:- mpp-03 /data1/m1/gpseg2 43000 Passive Synchronized
20250320:07:00:02:024403 gpstate:[INFO]:- mpp-03 /data1/m2/gpseg3 43001 Passive Synchronized
20250320:07:00:02:024403 gpstate:[INFO]:- mpp-04 /data1/m1/gpseg4 43000 Passive Synchronized
20250320:07:00:02:024403 gpstate:[INFO]:- mpp-04 /data1/m2/gpseg5 43001 Passive Synchronized
20250320:07:00:02:024403 gpstate:[INFO]:- mpp-05 /data1/m1/gpseg6 43000 Passive Synchronized
20250320:07:00:02:024403 gpstate:[INFO]:- mpp-05 /data1/m2/gpseg7 43001 Passive Synchronized
20250320:07:00:02:024403 gpstate:[INFO]:- mpp-06 /data1/m1/gpseg8 43000 Acting as Primary Not In Sync
20250320:07:00:02:024403 gpstate:[INFO]:- mpp-06 /data1/m2/gpseg9 43001 Acting as Primary Not In Sync
20250320:07:00:02:024403 gpstate:[INFO]:- mpp-01 /data1/m1/gpseg10 43000 Passive Synchronized
20250320:07:00:02:024403 gpstate:[INFO]:- mpp-01 /data1/m2/gpseg11 43001 Passive Synchronized
20250320:07:00:02:024403 gpstate:[INFO]:--------------------------------------------------------------
20250320:07:00:02:024403 gpstate:[WARNING]:-2 segment(s) configured as mirror(s) are acting as primaries
20250320:07:00:02:024403 gpstate:[WARNING]:-2 mirror segment(s) acting as primaries are not synchronized
$ gpstate -e
20250320:07:00:02:024653 gpstate:[INFO]:-Starting gpstate with args: -e
20250320:07:00:02:024653 gpstate:[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build
20250320:07:00:02:024653 gpstate:[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-b3:23:56'
20250320:07:00:02:024653 gpstate:[INFO]:-Obtaining Segment details from master...
20250320:07:00:02:024653 gpstate:[INFO]:-Gathering data from segments...
20250320:07:00:06:024653 gpstate:[WARNING]:-pg_stat_replication shows no standby connections
20250320:07:00:06:024653 gpstate:[WARNING]:-pg_stat_replication shows no standby connections
20250320:07:00:06:024653 gpstate:[INFO]:-----------------------------------------------------
20250320:07:00:06:024653 gpstate:[INFO]:-Segment Mirroring Status Report
20250320:07:00:06:024653 gpstate:[INFO]:-----------------------------------------------------
20250320:07:00:06:024653 gpstate:[INFO]:-Segments with Primary and Mirror Roles Switched
20250320:07:00:06:024653 gpstate:[INFO]:- Current Primary Port Mirror Port
20250320:07:00:06:024653 gpstate:[INFO]:- mpp-06 43000 znhcy-edcmpp-05 42000
20250320:07:00:06:024653 gpstate:[INFO]:- mpp-06 43001 znhcy-edcmpp-05 42001
20250320:07:00:06:024653 gpstate:[INFO]:-----------------------------------------------------
20250320:07:00:06:024653 gpstate:[INFO]:-Unsynchronized Segment Pairs
20250320:07:00:06:024653 gpstate:[INFO]:- Current Primary Port Mirror Port
20250320:07:00:06:024653 gpstate:[INFO]:- mpp-06 43000 znhcy-edcmpp-05 42000
20250320:07:00:06:024653 gpstate:[INFO]:- mpp-06 43001 znhcy-edcmpp-05 42001
20250320:07:00:06:024653 gpstate:[INFO]:-----------------------------------------------------
20250320:07:00:06:024653 gpstate:[INFO]:-Downed Segments (may include segments where status could not be retrieved)
20250320:07:00:06:024653 gpstate:[INFO]:- Segment Port Config status Status
20250320:07:00:06:024653 gpstate:[INFO]:- mpp-02 43000 Up Process error -- database process may be down
20250320:07:00:06:024653 gpstate:[INFO]:- mpp-02 43001 Up Process error -- database process may be down
20250320:07:00:06:024653 gpstate:[INFO]:- mpp-02 42000 Up Process error -- database process may be down
20250320:07:00:06:024653 gpstate:[INFO]:- mpp-02 42001 Up Process error -- database process may be down
20250320:07:00:06:024653 gpstate:[INFO]:- mpp-05 42000 Down Down in configuration
20250320:07:00:06:024653 gpstate:[INFO]:- mpp-05 42001 Down Down in configuration
从这里我们可以看到:
- mpp-02节点的数据库pid出现了问题,但是它和相关节点的镜像复制是正常
- mpp-05节点的primary gpseg8和gpseg9状态为DOWN,它在mpp-06节点上的mirror镜像升级成了Primary
故障修复
在greenplum中,gprecoverseg工具用于恢复已标记为down的主Segment实例或镜像Segment实例。但是这里有个前提:必须是启用了镜像的集群
在mpp-02节点查看ps -ef|grep postgres
发现相关进程是存在的,但是gpstate中又提示数据库PID不存在,当时想着mpp-02上面的gpseg都有对应的mirror并且同步状态正常,就kill了postgres的进程并且重启mpp-02节点。后来复盘时觉得这一步可能是多余的,因为这个操作引发了后面的另外一个问题。
gprecoverseg恢复故障节点
mpp-02节点重启,开始gprecoverseg恢复mpp-02和mpp-05的gpseg
linux
$ gprecoverseg
20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Starting gprecoverseg with args:
20250321:20:05:58:022067 gprecoverseg:-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build
20250321:20:05:58:022067 gprecoverseg:-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Jun 11 2020 03:23:56'
20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Obtaining Segment details from master...
20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Heap checksum setting is consistent between master and the segments that are candidates for recoverseg
20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Greenplum instance recovery parameters
20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Recovery type = Standard
20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Recovery 1 of 6
20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Synchronization mode = Incremental
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance host = mpp-02
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance address = mpp-02
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance directory = /data1/m1/gpseg0
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance port = 43000
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-01
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-01
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/p1/gpseg0
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance port = 42000
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Target = in-place
20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Recovery 2 of 6
20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Synchronization mode = Incremental
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance host = mpp-02
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance address = mpp-02
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance directory = /data1/m2/gpseg1
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance port = 43001
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-01
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-01
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/p2/gpseg1
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance port = 42001
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Target = in-place
20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Recovery 3 of 6
20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Synchronization mode = Incremental
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance host = mpp-02
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance address = mpp-02
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance directory = /data1/p1/gpseg2
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance port = 42000
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-03
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-03
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/m1/gpseg2
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance port = 43000
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Target = in-place
20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Recovery 4 of 6
20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Synchronization mode = Incremental
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance host = mpp-02
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance address = mpp-02
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance directory = /data1/p2/gpseg3
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance port = 42001
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-03
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-03
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/m2/gpseg3
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance port = 43001
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Target = in-place
20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Recovery 5 of 6
20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Synchronization mode = Incremental
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance host = mpp-05
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance address = mpp-05
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance directory = /data1/p1/gpseg8
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance port = 42000
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-06
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-06
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/m1/gpseg8
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance port = 43000
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Target = in-place
20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Recovery 6 of 6
20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Synchronization mode = Incremental
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance host = mpp-05
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance address = mpp-05
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance directory = /data1/p2/gpseg9
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance port = 42001
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-06
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-06
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/m2/gpseg9
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance port = 43001
20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Target = in-place
20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:06:01:022067 gprecoverseg:-[INFO]:-6 segment(s) to recover
20250321:20:06:01:022067 gprecoverseg:-[INFO]:-Ensuring 6 failed segment(s) are stopped
20250321:20:06:05:022067 gprecoverseg:-[INFO]:-3033: /data1/p1/gpseg8
20250321:20:06:08:022067 gprecoverseg:-[INFO]:-3035: /data1/p2/gpseg9
20250321:20:06:23:022067 gprecoverseg:-[INFO]:-Ensuring that shared memory is cleaned up for stopped segments
20250321:20:06:24:022067 gprecoverseg:-[INFO]:-Updating configuration with new mirrors
20250321:20:06:24:022067 gprecoverseg:-[INFO]:-Updating mirrors
20250321:20:06:24:022067 gprecoverseg:-[INFO]:-Running pg_rewind on required mirrors
20250321:20:13:45:022067 gprecoverseg:-[INFO]:-Starting mirrors
20250321:20:13:45:022067 gprecoverseg:-[INFO]:-era is None
20250321:20:13:45:022067 gprecoverseg:-[INFO]:-Commencing parallel segment instance startup, please wait...
20250321:20:18:38:022067 gprecoverseg:-[INFO]:-Process results...
20250321:20:18:38:022067 gprecoverseg:-[INFO]:-Triggering FTS probe
20250321:20:18:38:022067 gprecoverseg:-[INFO]:-******************************************************************
20250321:20:18:38:022067 gprecoverseg:-[INFO]:-Updating segments for streaming is completed.
20250321:20:18:38:022067 gprecoverseg:-[INFO]:-For segments updated successfully, streaming will continue in the background.
20250321:20:18:38:022067 gprecoverseg:-[INFO]:-Use gpstate -s to check the streaming progress.
20250321:20:18:38:022067 gprecoverseg:-[INFO]:-******************************************************************
$ gprecoverseg -r
20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Starting gprecoverseg with args: -r
20250321:20:19:45:023457 gprecoverseg:-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build commit:7118e8aca825b743dd9477d19406fcc06fa53852'
20250321:20:19:45:023457 gprecoverseg:-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build commit:7118e8aca825b743dd9477d19406fcc06fa53852) on x86_64-unknown-linux-gnu, compiled by piled on Jun 11 2020 03:23:56'
20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Obtaining Segment details from master...
20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Greenplum instance recovery parameters
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Recovery type = Rebalance
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 1 of 8
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-03
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-03
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/m1/gpseg2
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 43000
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Mirror
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Primary
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 2 of 8
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-02
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-02
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/p1/gpseg2
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 42000
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Primary
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Mirror
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 3 of 8
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-03
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-03
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/m2/gpseg3
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 43001
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Mirror
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Primary
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 4 of 8
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-02
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-02
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/p2/gpseg3
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 42001
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Primary
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Mirror
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 5 of 8
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-06
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-06
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/m1/gpseg8
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 43000
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Mirror
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Primary
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 6 of 8
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-05
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-05
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/p1/gpseg8
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 42000
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Primary
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Mirror
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 7 of 8
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-06
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-06
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/m2/gpseg9
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 43001
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Mirror
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Primary
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 8 of 8
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-05
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-05
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/p2/gpseg9
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 42001
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Primary
20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Mirror
20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:19:45:023457 gprecoverseg:-[WARNING]:-This operation will cancel queries that are currently executing.
20250321:20:19:45:023457 gprecoverseg:-[WARNING]:-Connections to the database however will not be interrupted.
20250321:20:19:47:023457 gprecoverseg:-[INFO]:-Getting unbalanced segments
20250321:20:19:47:023457 gprecoverseg:-[INFO]:-Stopping unbalanced primary segments...
20250321:20:20:48:023457 gprecoverseg:-[INFO]:-Triggering segment reconfiguration
20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Starting segment synchronization
20250321:20:20:52:023457 gprecoverseg:-[INFO]:-=============================START ANOTHER RECOVER=========================================
20250321:20:20:52:023457 gprecoverseg:-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build commit:7118e8aca825b743dd9477d19406fcc06fa53852'
20250321:20:20:52:023457 gprecoverseg:-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build commit:7118e8aca825b743dd9477d19406fcc06fa53852) on x86_64-unknown-linux-gnu, compiled by piled on Jun 11 2020 03:23:56'
20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Obtaining Segment details from master...
20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Heap checksum setting is consistent between master and the segments that are candidates for recoverseg
20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Greenplum instance recovery parameters
20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Recovery type = Standard
20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Recovery 1 of 4
20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Synchronization mode = Incremental
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance host = mpp-03
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance address = mpp-03
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance directory = /data1/m1/gpseg2
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance port = 43000
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-02
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-02
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/p1/gpseg2
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance port = 42000
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Target = in-place
20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Recovery 2 of 4
20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Synchronization mode = Incremental
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance host = mpp-03
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance address = mpp-03
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance directory = /data1/m2/gpseg3
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance port = 43001
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-02
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-02
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/p2/gpseg3
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance port = 42001
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Target = in-place
20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Recovery 3 of 4
20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Synchronization mode = Incremental
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance host = mpp-06
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance address = mpp-06
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance directory = /data1/m1/gpseg8
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance port = 43000
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-05
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-05
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/p1/gpseg8
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance port = 42000
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Target = in-place
20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Recovery 4 of 4
20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Synchronization mode = Incremental
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance host = mpp-06
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance address = mpp-06
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance directory = /data1/m2/gpseg9
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance port = 43001
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-05
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-05
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/p2/gpseg9
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance port = 42001
20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Target = in-place
20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------
20250321:20:20:52:023457 gprecoverseg:-[INFO]:-4 segment(s) to recover
20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Ensuring 4 failed segment(s) are stopped
20250321:20:20:56:023457 gprecoverseg:-[INFO]:-Ensuring that shared memory is cleaned up for stopped segments
20250321:20:20:56:023457 gprecoverseg:-[INFO]:-Updating configuration with new mirrors
20250321:20:20:56:023457 gprecoverseg:-[INFO]:-Updating mirrors
20250321:20:20:56:023457 gprecoverseg:-[INFO]:-Running pg_rewind on required mirrors
20250321:20:21:03:023457 gprecoverseg:-[INFO]:-Starting mirrors
20250321:20:21:03:023457 gprecoverseg:-[INFO]:-era is None
20250321:20:21:03:023457 gprecoverseg:-[INFO]:-Commencing parallel segment instance startup, please wait...
20250321:20:21:05:023457 gprecoverseg:-[INFO]:-Process results...
20250321:20:21:05:023457 gprecoverseg:-[INFO]:-Triggering FTS probe
20250321:20:21:05:023457 gprecoverseg:-[INFO]:-******************************************************************
20250321:20:21:05:023457 gprecoverseg:-[INFO]:-Updating segments for streaming is completed.
20250321:20:21:05:023457 gprecoverseg:-[INFO]:-For segments updated successfully, streaming will continue in the background.
20250321:20:21:05:023457 gprecoverseg:-[INFO]:-Use gpstate -s to check the streaming progress.
20250321:20:21:05:023457 gprecoverseg:-[INFO]:-******************************************************************
20250321:20:21:05:023457 gprecoverseg:-[INFO]:-==============================END ANOTHER RECOVER==========================================
20250321:20:21:05:023457 gprecoverseg:-[INFO]:-******************************************************************
20250321:20:21:05:023457 gprecoverseg:-[INFO]:-The rebalance operation has completed successfully.
20250321:20:21:05:023457 gprecoverseg:-[INFO]:-There is a resynchronization running in the background to bring all
20250321:20:21:05:023457 gprecoverseg:-[INFO]:-segments in sync.
20250321:20:21:05:023457 gprecoverseg:-[INFO]:-Use gpstate -e to check the resynchronization progress.
20250321:20:21:05:023457 gprecoverseg:-[INFO]:-**********************************************************************
这里日志内容太多,总结一下:6个seg异常(mpp-02重启导致其上面的2个primary和2个mirror的seg异常,加上原先的mpp-05上面的2个primary seg),使用gprecoverseg
命令重新激活故障的Segment实例,然后gprecoverseg -r
将Segment回到在系统初始化时为它们指定的首选角色。
此时使用gpstate查看集群状态一切正常
linux
$ gpstate -m
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:-Starting gpstate with args: -m
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Jun 11 2020102123:56'
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:-Obtaining Segment details from master...
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:--Current GPDB mirror list and status
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:--Type = Group
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- Mirror Datadir Port Status Data Status
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-02 /data1/m1/gpseg0 43000 Passive Synchronized
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-02 /data1/m2/gpseg1 43001 Passive Synchronized
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-03 /data1/m1/gpseg2 43000 Passive Synchronized
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-03 /data1/m2/gpseg3 43001 Passive Synchronized
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-04 /data1/m1/gpseg4 43000 Passive Synchronized
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-04 /data1/m2/gpseg5 43001 Passive Synchronized
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-05 /data1/m1/gpseg6 43000 Passive Synchronized
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-05 /data1/m2/gpseg7 43001 Passive Synchronized
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-06 /data1/m1/gpseg8 43000 Passive Synchronized
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-06 /data1/m2/gpseg9 43001 Passive Synchronized
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-01 /data1/m1/gpseg10 43000 Passive Synchronized
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-01 /data1/m2/gpseg11 43001 Passive Synchronized
20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------
$ gpstate -c
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:-Starting gpstate with args: -c
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Jun 11 2020102123:56'
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:-Obtaining Segment details from master...
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:--Current GPDB mirror list and status
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:--Type = Group
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Status Data State Primary Datadir Port Mirror Datadir Port
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-01 /data1/p1/gpseg0 42000 mpp-02 /data1/m1/gpseg0 43000
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-01 /data1/p2/gpseg1 42001 mpp-02 /data1/m2/gpseg1 43001
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-02 /data1/p1/gpseg2 42000 mpp-03 /data1/m1/gpseg2 43000
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-02 /data1/p2/gpseg3 42001 mpp-03 /data1/m2/gpseg3 43001
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-03 /data1/p1/gpseg4 42000 mpp-04 /data1/m1/gpseg4 43000
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-03 /data1/p2/gpseg5 42001 mpp-04 /data1/m2/gpseg5 43001
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-04 /data1/p1/gpseg6 42000 mpp-05 /data1/m1/gpseg6 43000
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-04 /data1/p2/gpseg7 42001 mpp-05 /data1/m2/gpseg7 43001
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-05 /data1/p1/gpseg8 42000 mpp-06 /data1/m1/gpseg8 43000
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-05 /data1/p2/gpseg9 42001 mpp-06 /data1/m2/gpseg9 43001
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-06 /data1/p1/gpseg10 42000 mpp-01 /data1/m1/gpseg10 43000
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-06 /data1/p2/gpseg11 42001 mpp-01 /data1/m2/gpseg11 43001
20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------
$ gpstate -e
20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-Starting gpstate with args: -e
20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build
20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Jun 11 2020102123:56'
20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-Obtaining Segment details from master...
20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-Gathering data from segments...
20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-----------------------------------------------------
20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-Segment Mirroring Status Report
20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-----------------------------------------------------
20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-All segments are running normall21
正当一切可以收工时,监控同事说mpp-02节点上的5432端口还是为DOWN状态,纳尼?
赶紧查看一下
linux
$ gpstate -f
20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:-Starting gpstate with args: -f
20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build
20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Jun 20250321:21:56:14:009720
20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:-Obtaining Segment details from master...
20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:-Standby master details
20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:-----------------------
20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:- Standby address = mpp-02
20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:- Standby data directory = /data1/master/gpseg-1
20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:- Standby port = 5432
20250321:21:56:14:009720 gpstate:mpp-01:-[WARNING]:-Standby PID = 0 <<<<<<<<
20250321:21:56:14:009720 gpstate:mpp-01:-[WARNING]:-Standby status = Standby process not running <<<<<<<<
20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------
20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:--pg_stat_replication
20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------
20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:-No entries found.
20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------
果然standby master没启动。查看官方文档,可以使用gpinitstandby恢复之。
linux
$ gpinitstandby -n
20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Validating environment and parameters for standby initialization...
20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Checking for data directory /data1/master/gpseg-1 on mpp-02
20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:------------------------------------------------------
20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum standby master initialization parameters
20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:------------------------------------------------------
20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum master hostname = mpp-01
20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum master data directory = /data1/master/gpseg-1
20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum master port = 5432
20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum standby master hostname = mpp-02
20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum standby master port = 5432
20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum standby master data directory = /data1/master/gpseg-1
20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum update system catalog = On
20250321:57:21:11:010314 gpinitstandby:mpp-01:-[INFO]:-Syncing Greenplum Database extensions to standby
20250321:57:21:12:010314 gpinitstandby:mpp-01:-[INFO]:-The packages on mpp-02 are consistent.
20250321:57:21:12:010314 gpinitstandby:mpp-01:-[INFO]:-Adding standby master to catalog...
20250321:57:21:12:010314 gpinitstandby:mpp-01:-[INFO]:-Database catalog updated successfully.
20250321:57:21:12:010314 gpinitstandby:mpp-01:-[INFO]:-Updating pg_hba.conf file...
20250321:57:21:13:010314 gpinitstandby:mpp-01:-[INFO]:-pg_hba.conf files updated successfully.
20250321:57:21:16:010314 gpinitstandby:mpp-01:-[INFO]:-Starting standby master
20250321:57:21:16:010314 gpinitstandby:mpp-01:-[INFO]:-Checking if standby master is running on host: mpp-02 in directory: /data1/master/gpseg-1
20250321:57:22:42:010314 gpinitstandby:mpp-01:-[WARNING]:-Could not start standby master
20250321:57:22:42:010314 gpinitstandby:mpp-01:-[INFO]:-Cleaning up pg_hba.conf backup files...
20250321:57:22:43:010314 gpinitstandby:mpp-01:-[INFO]:-Backup files of pg_hba.conf cleaned up successfully.
20250321:57:22:43:010314 gpinitstandby:mpp-01:-[INFO]:-Successfully created standby master on mpp-02
然并卵,重新激活standby master无效。此时病急乱投医,激活不行就剔除standby master然后再添加mpp-02节点为standby节点,结果还是一样,无法启动mpp-02上的master进程!
pg_hba.conf引发的坑
由于standby master一直无法启动,不过系统倒是正常运行,领导体谅干的太迟思路混乱就让先下班了。
第二天刚好是周末,可我不信邪为啥standby无法拉起,查看mpp-01和mpp-02的pg_log,发现了其中端倪
linux
]# more gpdb-2025-03-21_212117.csv
2025-03-21 21:21:17.736282 CST,,,p25377,th167159936,,,,0,,,seg-1,,,,,"LOG","F0000","invalid authentication method ""0.0.0.0/0""",,,,,"line 107 of configuration file ""/data1/master/gpseg-1/pg_hba.conf""",,0,,"hba.c",1206,
2025-03-21 21:21:17.736434 CST,,,p25377,th167159936,,,,0,,,seg-1,,,,,"FATAL","F0000","could not load pg_hba.conf",,,,,,,0,,"postmaster.c",1460,
pg_hba.conf文件配置出错了!看下具体配置信息
sql
...
local all backup 0.0.0.0/0 md5
...
果然配置出错了,询问相关同事后得知该配置是当时添加备份一体时增加的配置,当时后来配置一体机又没备份该数据库,如果不是这次阴差阳错,这个错误配置可能一直存在下去。后面就绪注释掉该配置,gpinitstandby -n
重新激活standby master
复盘
在当时发现mpp-02节点pid进程以后时,是不是也可以使用gprecoverseg
解决故障?
greenplum我也是半路接手,如果有大佬知道该问题有更好的处理方法,希望可以私信告知,谢谢