GaussDB数据库故障定位手段
常用的GaussDB数据库故障定位手段包括:日志(系统日志/操作日志/健康检查日志)、系统视图、WDR报告、错误码、core文件、ffic日志。
日志
日志类别 | 日志路径 | 文件名格式 | 日志内容 |
---|---|---|---|
系统日志 | $GAUSSLOG/gs_log/<DN_NAME> | gaussdb-%Y-%m-%d_%H%M%S.log | DN节点日志,数据库运行日志,包括checkpoint、autovacuum信息 |
系统日志 | $GAUSSLOG/cm/cm_server | cm_server-%Y-%m-%d_%H%M%S-current.log | cm_server组件运行日志,包含集群仲裁、主备切换等信息。不带current标识符的文件是历史日志文件,带current标识符的文件是当前日志文件。 |
系统日志 | $GAUSSLOG/cm/cm_agent | cm_agent-%Y-%m-%d_%H%M%S-current.log | cm_agent组件日志,记录DN和CMS组件、gaussdb进程的活跃状态、系统检查、日志压缩等信息。不带current标识符的文件是历史日志文件,带current标识符的文件是当前日志文件。 |
系统日志 | $GAUSSLOG/cm/cm_agent | system_call-current.log | cm_agent组件日志,记录CM Server下发的仲裁命令的提示信息。不带current标识符的文件是历史日志文件,带current标识符的文件是当前日志文件。 |
系统日志 | $GAUSSLOG/cm/om_monitor | om_monitor-%Y-%m-%d_%H%M%S-current.log | om_monitor组件日志,记录etcd和CM Agent的活跃状态。不带current标识符的文件是历史日志文件,带current标识符的文件是当前日志文件。 |
系统日志 | $GAUSSLOG/cm/etcd | etcd_-current.log | ETCD集群运行日志 |
操作日志 | $GAUSSLOG/cm/cm_ctl | cm_ctl-%Y-%m-%d_%H%M%S-current.log | cm_ctl命令下发信息和执行结果 |
操作日志 | $GAUSSLOG/bin/gs_guc | gs_guc-%Y-%m-%d_%H%M%S-current.log | gs_guc命令下发信息和执行结果 |
操作日志 | $GAUSSLOG/bin/gs_ctl | gs_ctl-%Y-%m-%d_%H%M%S-current.log | gs_ctl命令下发信息和执行结果 |
审计日志 | $GAUSSLOG/pg_audit/<DN_NAME> | <数字>_adt | 数据库审计日志 |
WAL日志 | <实例数据目录>/pg_xlog | 由24个十六进制组成,例如 00000001000000000000001C | WAL日志的内容取决于记录事务的类型,在系统崩溃时可以利用WAL日志进行恢复 |
性能日志 | $GAUSSLOG/gs_profile | gaussdb-%Y-%m-%d_%H%M%S.prf | 数据库系统在运行时检测物理资源的运行状态的日志,在对外部资源进行访问时的性能检测,包括磁盘等外部资源的访问检测信息 |
安装日志 | $GAUSSLOG/om | gs_install-%Y-%m-%d_%H%M%S.log | 数据库集群安装日志 |
健康检查日志 | $GAUSSLOG/om | gs_check-%Y-%m-%d_%H%M%S.log | 数据库健康检查日志 |
📖一般日志排查顺序:cm_ctl/gs_guc/gs_ctl日志 => om_monitor日志 => cm_agent日志 => cm_server日志 => 数据库运行日志
cm_agent日志中记录的到其他节点的CM Server访问失败信息:
bash
2025-01-09 09:38:21.181 tid=365425 AgentConnServerMain_2 ASYN ERROR: connect to cm server failed! The 1st of cm server node id is = 2, listenCount(1: 1).
2025-01-09 09:38:21.181 tid=365424 AgentConnServerMain_1 ASYN ERROR: 309: connect to cm_server failed, host=22.12.73.130 port=30200 localhost=22.12.73.132 connect_timeout=1 node_id=2 node_name=22.12.73.132 remote_type=7. could not connect to server:
Is the server running on host "22.12.73.130" and accepting
TCP/IP connections on port 30200?
cm_server日志中记录的备机ETCD实例升主信息:
bash
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: idx is 0, primary(1) ha heartbeat is 0.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: cmserver on node(1) is down, heartbeat_of_primary=0, and then choose to promte primary.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: g_pre_agent_conn_count=1, primaryNodeId is 1, curInstIdx is 1, g_cmRole is 2, g_delayTimeout is 13.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: idx=0, node=1, instId is 1, heartbeat=0, etcdHeartbeat=50895, primaryNodeId=1.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: idx=1, node=2, instId is 2, heartbeat=6, etcdHeartbeat=49822, primaryNodeId=1.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: minNodeId=2, nodeIndex=1, currentNode=2, role=2, minNodeIdForCmId=2.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: find the min cm id=2, the cm could be the best primary.
2025-01-09 09:38:27.199 tid=3522231 ETCD_HA ASYN LOG: cm_delay_arbitrate_time_out End: server_node_index = 1, g_pre_agent_conn_count=1, g_cmRole is 2.
2025-01-09 09:38:27.199 tid=3522231 ETCD_HA ASYN LOG: [CmSetPrimary2Etcd]: node(2) role is 2, ready to promote.
2025-01-09 09:38:27.203 tid=3522231 ETCD_HA ASYN LOG: [CmSetPrimary2Etcd]: node(2) last role is 2, promote to primary.
2025-01-09 09:38:27.203 tid=3522231 ETCD_HA ASYN LOG: [CmSetPrimary2Etcd]: pre_agent_count is 1, node(2) cm role is 2, to primary.
cm_server日志中记录的备机CMS实例升主信息:
bash
2025-01-09 09:38:29.070 tid=3522281 HA_MAIN ASYN LOG: current node is 2, change it's role to primary.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cm_server_current_role is 1. cm_server_last_role is 2.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: Promoted to PRIMARY. Do variable reset and reload.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cms change to primary, will clean finish redo time.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cms change to primary, will clean cma fault time.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cms change to primary, will clean switchover command.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: Setting arbitration_majority_reelection_timeout to 10.
cm_server日志中记录的故障转移(Failover)信息:
bash
2025-01-09 09:38:34.062 tid=3522280 AGENT_WORKER ASYN LOG: [ChangeDnPrimaryMemberIndex]: line 54: instd(6002) instTypePur is (1: Primary), instTypeSor is (2: Standby), peerInstId is 0.
2025-01-09 09:38:34.062 tid=3522280 AGENT_WORKER ASYN LOG: [ChangeDnPrimaryMemberIndex]: 67: instance(6001) static role(Primary) will change to be Standby.
2025-01-09 09:38:34.062 tid=3522280 AGENT_WORKER ASYN LOG: [ChangeDnPrimaryMemberIndex]: 60: instance(6002) static role(Standby) will change to be Primary.
...
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: [KeyEvent: KEY_EVENT_FAILOVER] [Instance: 6002] [Details: Failover message has sent to instance 6002, term 103, sendFailoverTimes is 0.]
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: line 185: new arbitra node 0, instanceId 6001, static_role 2=Standby, local_dynamic_role 0=Unknown, local_term=0, local_redo_finished = 0, local_last_xlog_location=0/0, local_db_state 0=Unknown, local_sync_state=0, build_reason 0=Normal, double_restarting=0, group_term=103, sendFailoverTimes=0
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: line 185: new arbitra node 1, instanceId 6002, static_role 1=Primary, local_dynamic_role 2=Standby, local_term=4, local_redo_finished = 1, local_last_xlog_location=0/44FA5010, local_db_state 2=Need repair, local_sy nc_state=0, build_reason 2=Disconnected, double_restarting=0, group_term=103, sendFailoverTimes=0
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: line 185: new arbitra node 2, instanceId 6003, static_role 2=Standby, local_dynamic_role 2=Standby, local_term=4, local_redo_finished = 1, local_last_xlog_location=0/44FA5010, local_db_state 2=Need repair, local_sy nc_state=0, build_reason 2=Disconnected, double_restarting=0, group_term=103, sendFailoverTimes=0
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN WARNING: this cluster has no coordinator, no need to notify cn.
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: [SendFailoverQuarm], line 2139: Failover message has sent to instance 6002 in reduce standy condition(0), local promoting.
2025-01-09 09:38:34.069 tid=3522274 AGENT_IO ASYN LOG: cmserver send msg to node 2, msgtype: MSG_CM_AGENT_FAILOVER
cm_server日志中记录的其他节点实例心跳超时信息:
bash
2025-01-09 09:38:38.435 tid=3522266 Monitor ASYN LOG: arbitration_majority_reelection_timeout elapsed into 0. Majority re-election enabled now.
2025-01-09 09:38:38.435 tid=3522266 Monitor ASYN LOG: instance(6001) heartbeat timeout, heartbeat:11, threshold:6
故障节点恢复后om_monitor日志中记录的CM Agent和ETCD的启动记录:
bash
2025-01-09 09:39:02.288 tid=2826 LOG: The CM Agent startup check is complete: cluster_manual_start=0, agent_config_file_r=1, agent_binary_file_x=1, config_change_flag=0, previous_status=0, start_count=0.
2025-01-09 09:39:02.288 tid=2826 LOG: cm_agent start, pid is 2827
[cm_agent]: cmserverNum is 3, and cmserver info is [0 node:1, cmserverId:1, cmServerIndex:0], [1 node:2, cmserverId:2, cmServerIndex:1], [2 node:3, cmserverId:3, cmServerIndex:2], .
2025-01-09 09:39:02.383 tid=2826 LOG: run check etcd log-outputs command: /gauss/app/cluster/core/app/bin/etcd --help | grep "\-\-log-outputs" success
2025-01-09 09:39:02.384 tid=2826 LOG: run etcd command: umask=`umask`;umask 0077;/gauss/app/cluster/core/app/bin/etcd -name ...
主库cm_server日志中记录的主备切换(Switchover)信息:
bash
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: auto switchover instanceid=6001, wait_seconds=120.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: [SetSwitchoverInSwitchoverProcess], instd(6001) localRole is (2: Standby), cmd[cmdPur(1: Primary), cmdSour(2: Standby), cmdPur(0: Unknown), peerIdx: 6002] timeout is 120, delayTime is 0.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: dn instanceid=6003, primaryFlushLocation=00000000/45BDE228, standbyReplayLocation=00000000/45BDE228, curGap=00000000/00000000, maxGap=00000001/2C000000.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: dn instanceid=6001, primaryFlushLocation=00000000/45BDE228, standbyReplayLocation=00000000/45BDE228, curGap=00000000/00000000, maxGap=00000001/2C000000.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: [SwitchoverDone]: inst(6001) is doing switchover.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: the balance state is 3 by DN.
2025-01-09 14:24:31.951 tid=3522280 AGENT_WORKER ASYN LOG: instId(6002) may be doing switchover, switchoverInstId is 6001.
2025-01-09 14:24:31.951 tid=3522280 AGENT_WORKER ASYN LOG: [Primary], 6002: another instance (6001) is doing[3/11], pendStatus count is 1, cannot to do arbitrate.
2025-01-09 14:24:32.008 tid=3522279 AGENT_WORKER ASYN LOG: send switchover to instance(6001) for [1/4] times.
2025-01-09 14:24:32.008 tid=3522279 AGENT_WORKER ASYN LOG: [KeyEvent: KEY_EVENT_SWITCHOVER] [Instance: 6001] [Details: send switchover message, node=1, instance=6001]
2025-01-09 14:24:32.008 tid=3522273 AGENT_IO ASYN LOG: cmserver send msg to node 1, msgtype: MSG_CM_AGENT_SWITCHOVER
视图
pg_stat_activity
:可以查询当前实例上各个会话的状态。pg_thread_wait_status
:可以查询当前实例上各个线程的等待事件。pg_locks
:用于查询当前实例上的锁状态。
示例:
sql
select datname,sessionid,usename,application_name,client_addr,client_hostname,state,query_id,query from pg_stat_activity;
select * from pg_thread_wait_status where wait_status<>'none';
select locktype,database,relation,virtualxid,transactionid,objid,sessionid,mode,granted from pg_locks;
WDR报告
可以通过生成WDR报告来分析数据库级别和节点级别的状态。
sql
--检查已生成的快照
select * from snapshot.snapshot;
--手动创建快照
select create_wdr_snapshot();
--生成节点级别报告
select * from pg_node_env; --检查当前的节点名
\o /home/omm/wdr_20241122_node.html
select generate_wdr_report(504,505,'all','node','dn_6001');
错误码
📖 GaussDB数据库错误码释义请参考:https://support.huaweicloud.com/errorcode-dws/dws_08_0003.html
core文件
数据库崩溃时产生的core文件对于定位程序崩溃的原因和位置非常重要。如果进程运行时出现coredump,建议立即收集core文件。
🕷 开启core文件对性能有一定的影响,尤其是进程频繁异常时对性能的影响更大。
检查core dump文件是否开启:
bash
gs_guc check -Z datanode -N all -I all -c "enable_bbox_dump"
开启core dump文件功能:
bash
gs_guc set -Z datanode -N all -I all -c "enable_bbox_dump=on"
--配置core文件生成位置
mkdir /gauss/corefiles
chmod 750 /gauss/corefiles
gs_guc set -Z datanode -N all -I all -c "bbox_dump_path='/gauss/corefiles'"
--设置core文件生成的最大个数,避免磁盘空间被占满
gs_guc set -Z datanode -N all -I all -c "bbox_dump_count=4"
ffic日志
GaussDB的相关进程在运行过程中可能会因为各种意外情况导致数据库故障,在RTO高要求下无法开启bbox的场景时(使用bbox生成core通常需要阻塞进程分钟级别),可以使用ffic收集数据库崩溃前的关键信息,用于故障定位。
检查ffic是否开启:
bash
gs_guc check -Z datanode -N all -I all -c "enable_ffic_log"
开启ffic日志:
bash
gs_guc set -Z datanode -N all -I all -c "enable_ffic_log=on"
ffic日志位置默认位于$GAUSSLOG/ffic
。
ffic日志中包含如下信息:
- 导致数据库进程故障的信号;
- 故障线程的调用栈;
- 触发进程故障的sql的unique sql id;
- 故障时间点CPU寄存器信息;
- 故障线程pc指针;
- 故障时间点内存映射信息;
- 数据库参数配置。