GaussDB数据库故障定位手段

GaussDB数据库故障定位手段

常用的GaussDB数据库故障定位手段包括:日志(系统日志/操作日志/健康检查日志)、系统视图、WDR报告、错误码、core文件、ffic日志。

日志

日志类别 日志路径 文件名格式 日志内容
系统日志 $GAUSSLOG/gs_log/<DN_NAME> gaussdb-%Y-%m-%d_%H%M%S.log DN节点日志,数据库运行日志,包括checkpoint、autovacuum信息
系统日志 $GAUSSLOG/cm/cm_server cm_server-%Y-%m-%d_%H%M%S-current.log cm_server组件运行日志,包含集群仲裁、主备切换等信息。不带current标识符的文件是历史日志文件,带current标识符的文件是当前日志文件。
系统日志 $GAUSSLOG/cm/cm_agent cm_agent-%Y-%m-%d_%H%M%S-current.log cm_agent组件日志,记录DN和CMS组件、gaussdb进程的活跃状态、系统检查、日志压缩等信息。不带current标识符的文件是历史日志文件,带current标识符的文件是当前日志文件。
系统日志 $GAUSSLOG/cm/cm_agent system_call-current.log cm_agent组件日志,记录CM Server下发的仲裁命令的提示信息。不带current标识符的文件是历史日志文件,带current标识符的文件是当前日志文件。
系统日志 $GAUSSLOG/cm/om_monitor om_monitor-%Y-%m-%d_%H%M%S-current.log om_monitor组件日志,记录etcd和CM Agent的活跃状态。不带current标识符的文件是历史日志文件,带current标识符的文件是当前日志文件。
系统日志 $GAUSSLOG/cm/etcd etcd_-current.log ETCD集群运行日志
操作日志 $GAUSSLOG/cm/cm_ctl cm_ctl-%Y-%m-%d_%H%M%S-current.log cm_ctl命令下发信息和执行结果
操作日志 $GAUSSLOG/bin/gs_guc gs_guc-%Y-%m-%d_%H%M%S-current.log gs_guc命令下发信息和执行结果
操作日志 $GAUSSLOG/bin/gs_ctl gs_ctl-%Y-%m-%d_%H%M%S-current.log gs_ctl命令下发信息和执行结果
审计日志 $GAUSSLOG/pg_audit/<DN_NAME> <数字>_adt 数据库审计日志
WAL日志 <实例数据目录>/pg_xlog 由24个十六进制组成,例如 00000001000000000000001C WAL日志的内容取决于记录事务的类型,在系统崩溃时可以利用WAL日志进行恢复
性能日志 $GAUSSLOG/gs_profile gaussdb-%Y-%m-%d_%H%M%S.prf 数据库系统在运行时检测物理资源的运行状态的日志,在对外部资源进行访问时的性能检测,包括磁盘等外部资源的访问检测信息
安装日志 $GAUSSLOG/om gs_install-%Y-%m-%d_%H%M%S.log 数据库集群安装日志
健康检查日志 $GAUSSLOG/om gs_check-%Y-%m-%d_%H%M%S.log 数据库健康检查日志

📖一般日志排查顺序:cm_ctl/gs_guc/gs_ctl日志 => om_monitor日志 => cm_agent日志 => cm_server日志 => 数据库运行日志

cm_agent日志中记录的到其他节点的CM Server访问失败信息:

bash 复制代码
2025-01-09 09:38:21.181 tid=365425 AgentConnServerMain_2 ASYN ERROR: connect to cm server failed! The 1st of cm server node id is = 2, listenCount(1: 1).
2025-01-09 09:38:21.181 tid=365424 AgentConnServerMain_1 ASYN ERROR: 309: connect to cm_server failed, host=22.12.73.130 port=30200 localhost=22.12.73.132 connect_timeout=1 node_id=2 node_name=22.12.73.132 remote_type=7. could not connect to server:
        Is the server running on host "22.12.73.130" and accepting
        TCP/IP connections on port 30200?

cm_server日志中记录的备机ETCD实例升主信息:

bash 复制代码
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: idx is 0, primary(1) ha heartbeat is 0.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: cmserver on node(1) is down, heartbeat_of_primary=0, and then choose to promte primary.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: g_pre_agent_conn_count=1, primaryNodeId is 1, curInstIdx is 1, g_cmRole is 2, g_delayTimeout is 13.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: idx=0, node=1, instId is 1, heartbeat=0, etcdHeartbeat=50895, primaryNodeId=1.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: idx=1, node=2, instId is 2, heartbeat=6, etcdHeartbeat=49822, primaryNodeId=1.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: minNodeId=2, nodeIndex=1, currentNode=2, role=2, minNodeIdForCmId=2.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: find the min cm id=2, the cm could be the best primary.
2025-01-09 09:38:27.199 tid=3522231 ETCD_HA ASYN LOG: cm_delay_arbitrate_time_out End: server_node_index = 1, g_pre_agent_conn_count=1, g_cmRole is 2.
2025-01-09 09:38:27.199 tid=3522231 ETCD_HA ASYN LOG: [CmSetPrimary2Etcd]: node(2) role is 2, ready to promote.
2025-01-09 09:38:27.203 tid=3522231 ETCD_HA ASYN LOG: [CmSetPrimary2Etcd]: node(2) last role is 2, promote to primary.
2025-01-09 09:38:27.203 tid=3522231 ETCD_HA ASYN LOG: [CmSetPrimary2Etcd]: pre_agent_count is 1, node(2) cm role is 2, to primary.

cm_server日志中记录的备机CMS实例升主信息:

bash 复制代码
2025-01-09 09:38:29.070 tid=3522281 HA_MAIN ASYN LOG: current node is 2, change it's role to primary.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cm_server_current_role is 1. cm_server_last_role is 2.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: Promoted to PRIMARY. Do variable reset and reload.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cms change to primary, will clean finish redo time.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cms change to primary, will clean cma fault time.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cms change to primary, will clean switchover command.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: Setting arbitration_majority_reelection_timeout to 10.

cm_server日志中记录的故障转移(Failover)信息:

bash 复制代码
2025-01-09 09:38:34.062 tid=3522280 AGENT_WORKER ASYN LOG: [ChangeDnPrimaryMemberIndex]: line 54: instd(6002) instTypePur is (1: Primary), instTypeSor is (2: Standby), peerInstId is 0.
2025-01-09 09:38:34.062 tid=3522280 AGENT_WORKER ASYN LOG: [ChangeDnPrimaryMemberIndex]: 67: instance(6001) static role(Primary) will change to be Standby.
2025-01-09 09:38:34.062 tid=3522280 AGENT_WORKER ASYN LOG: [ChangeDnPrimaryMemberIndex]: 60: instance(6002) static role(Standby) will change to be Primary.
...
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: [KeyEvent: KEY_EVENT_FAILOVER] [Instance: 6002] [Details: Failover message has sent to instance 6002, term 103, sendFailoverTimes is 0.]
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: line 185: new arbitra node 0, instanceId 6001, static_role 2=Standby, local_dynamic_role 0=Unknown, local_term=0, local_redo_finished = 0, local_last_xlog_location=0/0, local_db_state 0=Unknown, local_sync_state=0,         build_reason 0=Normal, double_restarting=0, group_term=103, sendFailoverTimes=0
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: line 185: new arbitra node 1, instanceId 6002, static_role 1=Primary, local_dynamic_role 2=Standby, local_term=4, local_redo_finished = 1, local_last_xlog_location=0/44FA5010, local_db_state 2=Need repair, local_sy        nc_state=0, build_reason 2=Disconnected, double_restarting=0, group_term=103, sendFailoverTimes=0
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: line 185: new arbitra node 2, instanceId 6003, static_role 2=Standby, local_dynamic_role 2=Standby, local_term=4, local_redo_finished = 1, local_last_xlog_location=0/44FA5010, local_db_state 2=Need repair, local_sy        nc_state=0, build_reason 2=Disconnected, double_restarting=0, group_term=103, sendFailoverTimes=0
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN WARNING: this cluster has no coordinator, no need to notify cn.
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: [SendFailoverQuarm], line 2139: Failover message has sent to instance 6002 in reduce standy condition(0), local promoting.
2025-01-09 09:38:34.069 tid=3522274 AGENT_IO ASYN LOG: cmserver send msg to node 2, msgtype: MSG_CM_AGENT_FAILOVER

cm_server日志中记录的其他节点实例心跳超时信息:

bash 复制代码
2025-01-09 09:38:38.435 tid=3522266 Monitor ASYN LOG: arbitration_majority_reelection_timeout elapsed into 0. Majority re-election enabled now.
2025-01-09 09:38:38.435 tid=3522266 Monitor ASYN LOG: instance(6001) heartbeat timeout, heartbeat:11, threshold:6

故障节点恢复后om_monitor日志中记录的CM Agent和ETCD的启动记录:

bash 复制代码
2025-01-09 09:39:02.288 tid=2826  LOG: The CM Agent startup check is complete: cluster_manual_start=0, agent_config_file_r=1, agent_binary_file_x=1, config_change_flag=0, previous_status=0, start_count=0.
2025-01-09 09:39:02.288 tid=2826  LOG: cm_agent start, pid is 2827
[cm_agent]: cmserverNum is 3, and cmserver info is [0 node:1, cmserverId:1, cmServerIndex:0], [1 node:2, cmserverId:2, cmServerIndex:1], [2 node:3, cmserverId:3, cmServerIndex:2], .
 2025-01-09 09:39:02.383 tid=2826  LOG: run check etcd log-outputs command: /gauss/app/cluster/core/app/bin/etcd --help | grep "\-\-log-outputs" success
2025-01-09 09:39:02.384 tid=2826  LOG: run etcd command: umask=`umask`;umask 0077;/gauss/app/cluster/core/app/bin/etcd  -name ...

主库cm_server日志中记录的主备切换(Switchover)信息:

bash 复制代码
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: auto switchover instanceid=6001, wait_seconds=120.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: [SetSwitchoverInSwitchoverProcess], instd(6001) localRole is (2: Standby), cmd[cmdPur(1: Primary), cmdSour(2: Standby), cmdPur(0: Unknown), peerIdx: 6002] timeout is 120, delayTime is 0.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: dn instanceid=6003, primaryFlushLocation=00000000/45BDE228, standbyReplayLocation=00000000/45BDE228, curGap=00000000/00000000, maxGap=00000001/2C000000.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: dn instanceid=6001, primaryFlushLocation=00000000/45BDE228, standbyReplayLocation=00000000/45BDE228, curGap=00000000/00000000, maxGap=00000001/2C000000.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: [SwitchoverDone]: inst(6001) is doing switchover.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: the balance state is 3 by DN.
2025-01-09 14:24:31.951 tid=3522280 AGENT_WORKER ASYN LOG: instId(6002) may be doing switchover, switchoverInstId is 6001.
2025-01-09 14:24:31.951 tid=3522280 AGENT_WORKER ASYN LOG: [Primary], 6002: another instance (6001) is doing[3/11], pendStatus count is 1, cannot to do arbitrate.
2025-01-09 14:24:32.008 tid=3522279 AGENT_WORKER ASYN LOG: send switchover to instance(6001) for [1/4] times.
2025-01-09 14:24:32.008 tid=3522279 AGENT_WORKER ASYN LOG: [KeyEvent: KEY_EVENT_SWITCHOVER] [Instance: 6001] [Details: send switchover message, node=1, instance=6001]
2025-01-09 14:24:32.008 tid=3522273 AGENT_IO ASYN LOG: cmserver send msg to node 1, msgtype: MSG_CM_AGENT_SWITCHOVER

视图

  • pg_stat_activity:可以查询当前实例上各个会话的状态。
  • pg_thread_wait_status:可以查询当前实例上各个线程的等待事件。
  • pg_locks:用于查询当前实例上的锁状态。

示例:

sql 复制代码
select datname,sessionid,usename,application_name,client_addr,client_hostname,state,query_id,query from pg_stat_activity;

select * from pg_thread_wait_status where wait_status<>'none';

select locktype,database,relation,virtualxid,transactionid,objid,sessionid,mode,granted from pg_locks;

WDR报告

可以通过生成WDR报告来分析数据库级别和节点级别的状态。

sql 复制代码
--检查已生成的快照
select * from snapshot.snapshot;

--手动创建快照
select create_wdr_snapshot();

--生成节点级别报告
select * from pg_node_env;   --检查当前的节点名
\o /home/omm/wdr_20241122_node.html
select generate_wdr_report(504,505,'all','node','dn_6001');   

错误码

📖 GaussDB数据库错误码释义请参考:https://support.huaweicloud.com/errorcode-dws/dws_08_0003.html

core文件

数据库崩溃时产生的core文件对于定位程序崩溃的原因和位置非常重要。如果进程运行时出现coredump,建议立即收集core文件。

🕷 开启core文件对性能有一定的影响,尤其是进程频繁异常时对性能的影响更大。

检查core dump文件是否开启:

bash 复制代码
gs_guc check -Z datanode -N all -I all -c "enable_bbox_dump"

开启core dump文件功能:

bash 复制代码
gs_guc set -Z datanode -N all -I all -c "enable_bbox_dump=on"

--配置core文件生成位置
mkdir /gauss/corefiles
chmod 750 /gauss/corefiles
gs_guc set -Z datanode -N all -I all -c "bbox_dump_path='/gauss/corefiles'"

--设置core文件生成的最大个数,避免磁盘空间被占满
gs_guc set -Z datanode -N all -I all -c "bbox_dump_count=4"

ffic日志

GaussDB的相关进程在运行过程中可能会因为各种意外情况导致数据库故障,在RTO高要求下无法开启bbox的场景时(使用bbox生成core通常需要阻塞进程分钟级别),可以使用ffic收集数据库崩溃前的关键信息,用于故障定位。

检查ffic是否开启:

bash 复制代码
gs_guc check -Z datanode -N all -I all -c "enable_ffic_log"

开启ffic日志:

bash 复制代码
gs_guc set -Z datanode -N all -I all -c "enable_ffic_log=on"

ffic日志位置默认位于$GAUSSLOG/ffic

ffic日志中包含如下信息:

  • 导致数据库进程故障的信号;
  • 故障线程的调用栈;
  • 触发进程故障的sql的unique sql id;
  • 故障时间点CPU寄存器信息;
  • 故障线程pc指针;
  • 故障时间点内存映射信息;
  • 数据库参数配置。
相关推荐
爱上语文12 分钟前
MyBatis实现数据库的CRUD
java·开发语言·数据库·mybatis
明月看潮生24 分钟前
青少年编程与数学 02-007 PostgreSQL数据库应用 06课题、数据库操作
数据库·青少年编程·postgresql·编程与数学
wenchun00127 分钟前
【MySQL实战】mysql_exporter+Prometheus+Grafana
数据库·mysql·性能优化·数据分析
tq108631 分钟前
AIP-121 面向资源设计
数据库
Mistra丶32 分钟前
A股微型低频套利交易-Java版本
java·量化交易·a股·自动交易·做t·套利
liynet35 分钟前
Goland项目内引入字符串标红的解决办法
java·服务器·前端
alden_ygq39 分钟前
Go os/exec 使用实践
开发语言·数据库·golang
龙少954342 分钟前
【UNION与UNION ALL的区别?】
数据库
bing_15844 分钟前
Spring Boot 中使用 ShardingSphere-Proxy
java·spring boot·后端
十二同学啊1 小时前
Spring Boot 整合 Knife4j:打造更优雅的 API 文档
java·spring boot·后端