GaussDB数据库故障定位手段

日志
视图
WDR报告
错误码
core文件
ffic日志

常用的GaussDB数据库故障定位手段包括：日志（系统日志/操作日志/健康检查日志）、系统视图、WDR报告、错误码、core文件、ffic日志。

日志

日志类别	日志路径	文件名格式	日志内容
系统日志	$GAUSSLOG/gs_log/<DN_NAME>	gaussdb-%Y-%m-%d_%H%M%S.log	DN节点日志，数据库运行日志，包括checkpoint、autovacuum信息
系统日志	$GAUSSLOG/cm/cm_server	cm_server-%Y-%m-%d_%H%M%S-current.log	cm_server组件运行日志，包含集群仲裁、主备切换等信息。不带current标识符的文件是历史日志文件，带current标识符的文件是当前日志文件。
系统日志	$GAUSSLOG/cm/cm_agent	cm_agent-%Y-%m-%d_%H%M%S-current.log	cm_agent组件日志，记录DN和CMS组件、gaussdb进程的活跃状态、系统检查、日志压缩等信息。不带current标识符的文件是历史日志文件，带current标识符的文件是当前日志文件。
系统日志	$GAUSSLOG/cm/cm_agent	system_call-current.log	cm_agent组件日志，记录CM Server下发的仲裁命令的提示信息。不带current标识符的文件是历史日志文件，带current标识符的文件是当前日志文件。
系统日志	$GAUSSLOG/cm/om_monitor	om_monitor-%Y-%m-%d_%H%M%S-current.log	om_monitor组件日志，记录etcd和CM Agent的活跃状态。不带current标识符的文件是历史日志文件，带current标识符的文件是当前日志文件。
系统日志	$GAUSSLOG/cm/etcd	etcd_-current.log	ETCD集群运行日志
操作日志	$GAUSSLOG/cm/cm_ctl	cm_ctl-%Y-%m-%d_%H%M%S-current.log	cm_ctl命令下发信息和执行结果
操作日志	$GAUSSLOG/bin/gs_guc	gs_guc-%Y-%m-%d_%H%M%S-current.log	gs_guc命令下发信息和执行结果
操作日志	$GAUSSLOG/bin/gs_ctl	gs_ctl-%Y-%m-%d_%H%M%S-current.log	gs_ctl命令下发信息和执行结果
审计日志	$GAUSSLOG/pg_audit/<DN_NAME>	<数字>_adt	数据库审计日志
WAL日志	<实例数据目录>/pg_xlog	由24个十六进制组成，例如 00000001000000000000001C	WAL日志的内容取决于记录事务的类型，在系统崩溃时可以利用WAL日志进行恢复
性能日志	$GAUSSLOG/gs_profile	gaussdb-%Y-%m-%d_%H%M%S.prf	数据库系统在运行时检测物理资源的运行状态的日志，在对外部资源进行访问时的性能检测，包括磁盘等外部资源的访问检测信息
安装日志	$GAUSSLOG/om	gs_install-%Y-%m-%d_%H%M%S.log	数据库集群安装日志
健康检查日志	$GAUSSLOG/om	gs_check-%Y-%m-%d_%H%M%S.log	数据库健康检查日志

📖一般日志排查顺序：cm_ctl/gs_guc/gs_ctl日志 => om_monitor日志 => cm_agent日志 => cm_server日志 => 数据库运行日志

cm_agent日志中记录的到其他节点的CM Server访问失败信息：

bash 复制代码

2025-01-09 09:38:21.181 tid=365425 AgentConnServerMain_2 ASYN ERROR: connect to cm server failed! The 1st of cm server node id is = 2, listenCount(1: 1).
2025-01-09 09:38:21.181 tid=365424 AgentConnServerMain_1 ASYN ERROR: 309: connect to cm_server failed, host=22.12.73.130 port=30200 localhost=22.12.73.132 connect_timeout=1 node_id=2 node_name=22.12.73.132 remote_type=7. could not connect to server:
        Is the server running on host "22.12.73.130" and accepting
        TCP/IP connections on port 30200?

cm_server日志中记录的备机ETCD实例升主信息：

bash 复制代码

2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: idx is 0, primary(1) ha heartbeat is 0.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: cmserver on node(1) is down, heartbeat_of_primary=0, and then choose to promte primary.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: g_pre_agent_conn_count=1, primaryNodeId is 1, curInstIdx is 1, g_cmRole is 2, g_delayTimeout is 13.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: idx=0, node=1, instId is 1, heartbeat=0, etcdHeartbeat=50895, primaryNodeId=1.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: idx=1, node=2, instId is 2, heartbeat=6, etcdHeartbeat=49822, primaryNodeId=1.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: minNodeId=2, nodeIndex=1, currentNode=2, role=2, minNodeIdForCmId=2.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: find the min cm id=2, the cm could be the best primary.
2025-01-09 09:38:27.199 tid=3522231 ETCD_HA ASYN LOG: cm_delay_arbitrate_time_out End: server_node_index = 1, g_pre_agent_conn_count=1, g_cmRole is 2.
2025-01-09 09:38:27.199 tid=3522231 ETCD_HA ASYN LOG: [CmSetPrimary2Etcd]: node(2) role is 2, ready to promote.
2025-01-09 09:38:27.203 tid=3522231 ETCD_HA ASYN LOG: [CmSetPrimary2Etcd]: node(2) last role is 2, promote to primary.
2025-01-09 09:38:27.203 tid=3522231 ETCD_HA ASYN LOG: [CmSetPrimary2Etcd]: pre_agent_count is 1, node(2) cm role is 2, to primary.

cm_server日志中记录的备机CMS实例升主信息：

bash 复制代码

2025-01-09 09:38:29.070 tid=3522281 HA_MAIN ASYN LOG: current node is 2, change it's role to primary.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cm_server_current_role is 1. cm_server_last_role is 2.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: Promoted to PRIMARY. Do variable reset and reload.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cms change to primary, will clean finish redo time.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cms change to primary, will clean cma fault time.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cms change to primary, will clean switchover command.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: Setting arbitration_majority_reelection_timeout to 10.

cm_server日志中记录的故障转移（Failover）信息：

bash 复制代码

2025-01-09 09:38:34.062 tid=3522280 AGENT_WORKER ASYN LOG: [ChangeDnPrimaryMemberIndex]: line 54: instd(6002) instTypePur is (1: Primary), instTypeSor is (2: Standby), peerInstId is 0.
2025-01-09 09:38:34.062 tid=3522280 AGENT_WORKER ASYN LOG: [ChangeDnPrimaryMemberIndex]: 67: instance(6001) static role(Primary) will change to be Standby.
2025-01-09 09:38:34.062 tid=3522280 AGENT_WORKER ASYN LOG: [ChangeDnPrimaryMemberIndex]: 60: instance(6002) static role(Standby) will change to be Primary.
...
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: [KeyEvent: KEY_EVENT_FAILOVER] [Instance: 6002] [Details: Failover message has sent to instance 6002, term 103, sendFailoverTimes is 0.]
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: line 185: new arbitra node 0, instanceId 6001, static_role 2=Standby, local_dynamic_role 0=Unknown, local_term=0, local_redo_finished = 0, local_last_xlog_location=0/0, local_db_state 0=Unknown, local_sync_state=0,         build_reason 0=Normal, double_restarting=0, group_term=103, sendFailoverTimes=0
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: line 185: new arbitra node 1, instanceId 6002, static_role 1=Primary, local_dynamic_role 2=Standby, local_term=4, local_redo_finished = 1, local_last_xlog_location=0/44FA5010, local_db_state 2=Need repair, local_sy        nc_state=0, build_reason 2=Disconnected, double_restarting=0, group_term=103, sendFailoverTimes=0
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: line 185: new arbitra node 2, instanceId 6003, static_role 2=Standby, local_dynamic_role 2=Standby, local_term=4, local_redo_finished = 1, local_last_xlog_location=0/44FA5010, local_db_state 2=Need repair, local_sy        nc_state=0, build_reason 2=Disconnected, double_restarting=0, group_term=103, sendFailoverTimes=0
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN WARNING: this cluster has no coordinator, no need to notify cn.
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: [SendFailoverQuarm], line 2139: Failover message has sent to instance 6002 in reduce standy condition(0), local promoting.
2025-01-09 09:38:34.069 tid=3522274 AGENT_IO ASYN LOG: cmserver send msg to node 2, msgtype: MSG_CM_AGENT_FAILOVER

cm_server日志中记录的其他节点实例心跳超时信息：

bash 复制代码

2025-01-09 09:38:38.435 tid=3522266 Monitor ASYN LOG: arbitration_majority_reelection_timeout elapsed into 0. Majority re-election enabled now.
2025-01-09 09:38:38.435 tid=3522266 Monitor ASYN LOG: instance(6001) heartbeat timeout, heartbeat:11, threshold:6

故障节点恢复后om_monitor日志中记录的CM Agent和ETCD的启动记录：

bash 复制代码

2025-01-09 09:39:02.288 tid=2826  LOG: The CM Agent startup check is complete: cluster_manual_start=0, agent_config_file_r=1, agent_binary_file_x=1, config_change_flag=0, previous_status=0, start_count=0.
2025-01-09 09:39:02.288 tid=2826  LOG: cm_agent start, pid is 2827
[cm_agent]: cmserverNum is 3, and cmserver info is [0 node:1, cmserverId:1, cmServerIndex:0], [1 node:2, cmserverId:2, cmServerIndex:1], [2 node:3, cmserverId:3, cmServerIndex:2], .
 2025-01-09 09:39:02.383 tid=2826  LOG: run check etcd log-outputs command: /gauss/app/cluster/core/app/bin/etcd --help | grep "\-\-log-outputs" success
2025-01-09 09:39:02.384 tid=2826  LOG: run etcd command: umask=`umask`;umask 0077;/gauss/app/cluster/core/app/bin/etcd  -name ...

主库cm_server日志中记录的主备切换（Switchover）信息：

bash 复制代码

2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: auto switchover instanceid=6001, wait_seconds=120.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: [SetSwitchoverInSwitchoverProcess], instd(6001) localRole is (2: Standby), cmd[cmdPur(1: Primary), cmdSour(2: Standby), cmdPur(0: Unknown), peerIdx: 6002] timeout is 120, delayTime is 0.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: dn instanceid=6003, primaryFlushLocation=00000000/45BDE228, standbyReplayLocation=00000000/45BDE228, curGap=00000000/00000000, maxGap=00000001/2C000000.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: dn instanceid=6001, primaryFlushLocation=00000000/45BDE228, standbyReplayLocation=00000000/45BDE228, curGap=00000000/00000000, maxGap=00000001/2C000000.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: [SwitchoverDone]: inst(6001) is doing switchover.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: the balance state is 3 by DN.
2025-01-09 14:24:31.951 tid=3522280 AGENT_WORKER ASYN LOG: instId(6002) may be doing switchover, switchoverInstId is 6001.
2025-01-09 14:24:31.951 tid=3522280 AGENT_WORKER ASYN LOG: [Primary], 6002: another instance (6001) is doing[3/11], pendStatus count is 1, cannot to do arbitrate.
2025-01-09 14:24:32.008 tid=3522279 AGENT_WORKER ASYN LOG: send switchover to instance(6001) for [1/4] times.
2025-01-09 14:24:32.008 tid=3522279 AGENT_WORKER ASYN LOG: [KeyEvent: KEY_EVENT_SWITCHOVER] [Instance: 6001] [Details: send switchover message, node=1, instance=6001]
2025-01-09 14:24:32.008 tid=3522273 AGENT_IO ASYN LOG: cmserver send msg to node 1, msgtype: MSG_CM_AGENT_SWITCHOVER

视图

pg_stat_activity：可以查询当前实例上各个会话的状态。
pg_thread_wait_status：可以查询当前实例上各个线程的等待事件。
pg_locks：用于查询当前实例上的锁状态。

示例：

sql 复制代码

select datname,sessionid,usename,application_name,client_addr,client_hostname,state,query_id,query from pg_stat_activity;

select * from pg_thread_wait_status where wait_status<>'none';

select locktype,database,relation,virtualxid,transactionid,objid,sessionid,mode,granted from pg_locks;

WDR报告

可以通过生成WDR报告来分析数据库级别和节点级别的状态。

sql 复制代码

--检查已生成的快照
select * from snapshot.snapshot;

--手动创建快照
select create_wdr_snapshot();

--生成节点级别报告
select * from pg_node_env;   --检查当前的节点名
\o /home/omm/wdr_20241122_node.html
select generate_wdr_report(504,505,'all','node','dn_6001');

错误码

📖 GaussDB数据库错误码释义请参考：https://support.huaweicloud.com/errorcode-dws/dws_08_0003.html

core文件

数据库崩溃时产生的core文件对于定位程序崩溃的原因和位置非常重要。如果进程运行时出现coredump，建议立即收集core文件。

🕷 开启core文件对性能有一定的影响，尤其是进程频繁异常时对性能的影响更大。

检查core dump文件是否开启：

bash 复制代码

gs_guc check -Z datanode -N all -I all -c "enable_bbox_dump"

开启core dump文件功能：

bash 复制代码

gs_guc set -Z datanode -N all -I all -c "enable_bbox_dump=on"

--配置core文件生成位置
mkdir /gauss/corefiles
chmod 750 /gauss/corefiles
gs_guc set -Z datanode -N all -I all -c "bbox_dump_path='/gauss/corefiles'"

--设置core文件生成的最大个数，避免磁盘空间被占满
gs_guc set -Z datanode -N all -I all -c "bbox_dump_count=4"

ffic日志

GaussDB的相关进程在运行过程中可能会因为各种意外情况导致数据库故障，在RTO高要求下无法开启bbox的场景时（使用bbox生成core通常需要阻塞进程分钟级别），可以使用ffic收集数据库崩溃前的关键信息，用于故障定位。

检查ffic是否开启：

bash 复制代码

gs_guc check -Z datanode -N all -I all -c "enable_ffic_log"

开启ffic日志：

bash 复制代码

gs_guc set -Z datanode -N all -I all -c "enable_ffic_log=on"

ffic日志位置默认位于$GAUSSLOG/ffic。

ffic日志中包含如下信息：

导致数据库进程故障的信号；
故障线程的调用栈；
触发进程故障的sql的unique sql id；
故障时间点CPU寄存器信息；
故障线程pc指针；
故障时间点内存映射信息；
数据库参数配置。