PostgreSQL 数据库性能问题定位完全指南
引言
在生产环境中,PostgreSQL 数据库性能问题往往表现为 CPU 占用过高、内存消耗异常、磁盘 IO 瓶颈或 SQL 查询缓慢。本文将系统地介绍如何定位和分析这些问题,帮助你快速找到性能瓶颈并制定优化方案。
本文价值
- 系统化方法论:从操作系统到数据库的完整排查路径
- 实战导向:每个场景都有具体的命令和SQL示例
- 问题驱动:针对真实生产环境中的常见问题
- 可操作性强:提供可直接使用的诊断脚本
一、PostgreSQL配置文件说明
在进行性能优化时,经常需要修改PostgreSQL的配置文件。本节介绍配置文件的位置和修改方法。
0.1 配置文件路径
PostgreSQL的主配置文件为 postgresql.conf,不同操作系统和安装方式的路径如下:
Linux系统:
| 安装方式 | 配置文件路径 |
|---|---|
| 源码编译安装(默认) | /usr/local/pgsql/data/postgresql.conf |
| Debian/Ubuntu(apt安装) | /etc/postgresql/{版本号}/main/postgresql.conf |
| CentOS/RHEL(yum安装) | /var/lib/pgsql/{版本号}/data/postgresql.conf |
| Docker容器 | /var/lib/postgresql/data/postgresql.conf |
常用版本路径示例:
bash
# PostgreSQL 15 on Ubuntu/Debian
/etc/postgresql/15/main/postgresql.conf
# PostgreSQL 14 on CentOS/RHEL
/var/lib/pgsql/14/data/postgresql.conf
# PostgreSQL 16 源码安装
/usr/local/pgsql/data/postgresql.conf
Windows系统:
| 安装方式 | 配置文件路径 |
|---|---|
| 默认安装 | C:\Program Files\PostgreSQL\{版本号}\data\postgresql.conf |
| 自定义路径 | {安装目录}\data\postgresql.conf |
查找配置文件路径的方法:
bash
# 方法一:通过psql查询
psql -U postgres -c "SHOW config_file;"
# 方法二:通过进程查看
ps aux | grep postgres | grep -o '\-D [^ ]*' | head -1
# 方法三:通过find命令查找
find / -name "postgresql.conf" 2>/dev/null
# 方法四:使用pg_config工具
pg_config --configure
0.2 其他重要配置文件
| 文件名 | 说明 | 用途 |
|---|---|---|
postgresql.conf |
主配置文件 | 数据库参数设置 |
pg_hba.conf |
客户端认证配置 | 访问控制和权限管理 |
pg_ident.conf |
用户名映射配置 | 操作系统用户到数据库用户的映射 |
postmaster.pid |
进程ID文件 | 记录主进程PID(自动生成) |
postmaster.opts |
启动参数文件 | 记录启动命令行参数(自动生成) |
配置文件路径查询:
sql
-- 在psql中查询所有配置文件路径
SELECT name, setting FROM pg_settings
WHERE name IN ('config_file', 'hba_file', 'ident_file', 'data_directory');
0.3 配置修改方法
方法一:直接编辑配置文件
bash
# 编辑配置文件
sudo vim /etc/postgresql/15/main/postgresql.conf
# 修改后需要重启或重新加载配置
# 重新加载配置(不中断服务)
sudo systemctl reload postgresql
# 或
pg_ctl reload -D /var/lib/pgsql/15/data
# 重启服务(会中断连接)
sudo systemctl restart postgresql
方法二:使用ALTER SYSTEM命令(PostgreSQL 9.4+)
sql
-- 修改参数(自动写入postgresql.auto.conf)
ALTER SYSTEM SET shared_buffers = '8GB';
ALTER SYSTEM SET work_mem = '32MB';
-- 使配置生效(部分参数需要重启)
SELECT pg_reload_conf();
-- 查看参数是否需要重启
SELECT name, setting, pending_restart
FROM pg_settings
WHERE pending_restart = true;
-- 重置参数为默认值
ALTER SYSTEM RESET work_mem;
方法三:会话级临时修改
sql
-- 仅当前会话有效
SET work_mem = '64MB';
-- 事务级临时修改
BEGIN;
SET LOCAL work_mem = '128MB';
-- 执行查询
COMMIT;
0.4 参数生效方式
不同参数修改后的生效方式不同:
| 生效方式 | 说明 | 示例参数 |
|---|---|---|
pg_reload_conf() |
重新加载配置文件即可生效 | work_mem, log_min_duration_statement |
| 重启数据库 | 必须重启服务才能生效 | shared_buffers, max_connections |
| 会话级 | 仅对当前连接有效 | work_mem, maintenance_work_mem |
查询参数生效方式:
sql
-- 查看参数的上下文(生效方式)
SELECT name, context, setting, source
FROM pg_settings
WHERE name IN ('shared_buffers', 'work_mem', 'max_connections', 'log_min_duration_statement');
-- context字段说明:
-- internal: 编译时固定,无法修改
-- postmaster: 需要重启数据库
-- sighup: 需要重新加载配置文件
-- backend: 需要重新连接
-- user: 会话级可修改
-- superuser: 超级用户会话级可修改
0.5 配置文件最佳实践
ini
# postgresql.conf 推荐的组织结构
#------------------------------------------------------------------------------
# 连接设置
#------------------------------------------------------------------------------
listen_addresses = '*' # 监听地址
port = 5432 # 监听端口
max_connections = 200 # 最大连接数
#------------------------------------------------------------------------------
# 内存设置
#------------------------------------------------------------------------------
shared_buffers = 8GB # 共享缓冲区
work_mem = 32MB # 操作内存
maintenance_work_mem = 1GB # 维护内存
effective_cache_size = 24GB # 优化器估计缓存
#------------------------------------------------------------------------------
# WAL设置
#------------------------------------------------------------------------------
wal_level = replica
max_wal_size = 4GB
min_wal_size = 1GB
#------------------------------------------------------------------------------
# 日志设置
#------------------------------------------------------------------------------
log_min_duration_statement = 1000
log_checkpoints = on
log_lock_waits = on
#------------------------------------------------------------------------------
# 自动清理
#------------------------------------------------------------------------------
autovacuum = on
autovacuum_vacuum_cost_limit = 200
提示:修改配置文件前,建议先备份原文件:
bashcp /etc/postgresql/15/main/postgresql.conf /etc/postgresql/15/main/postgresql.conf.bak
二、性能问题分类与排查思路
1.1 常见性能问题类型
| 问题类型 | 典型表现 | 影响范围 |
|---|---|---|
| CPU占用高 | 数据库进程CPU使用率持续80%以上 | 查询响应慢,系统负载高 |
| 内存占用高 | 内存使用率超过90%,频繁swap | 查询卡顿,OOM风险 |
| 磁盘IO高 | iowait高,磁盘吞吐达到上限 | 查询超时,系统假死 |
| SQL查询慢 | 单个查询耗时超过预期 | 业务响应慢,用户体验差 |
1.2 整体排查思路
性能问题排查流程
│
├─ 第一步:确认问题现象
│ ├─ 查看系统资源使用情况
│ └─ 确认问题发生的时间段
│
├─ 第二步:定位问题源头
│ ├─ 操作系统层面分析
│ └─ 数据库层面分析
│
├─ 第三步:找到具体原因
│ ├─ 锁定问题SQL
│ └─ 分析执行计划
│
└─ 第四步:制定优化方案
├─ SQL优化
├─ 索引优化
└─ 参数调优
三、CPU占用高问题定位
2.1 确认CPU使用情况
Linux系统命令:
bash
# 查看整体CPU使用情况
top -c
# 查看CPU详细信息
mpstat -P ALL 1
# 查看PostgreSQL进程CPU占用
ps aux | grep postgres | grep -v grep | sort -k3 -rn | head -10
# 实时监控
htop
关键指标解读:
- %usr:用户空间CPU使用率(高值可能表示SQL计算密集)
- %sys:内核空间CPU使用率(高值可能表示系统调用频繁)
- %iowait:等待IO的CPU时间占比(高值说明磁盘瓶颈)
2.2 定位消耗CPU的进程
bash
# 找出CPU占用最高的PostgreSQL进程
ps -eo pid,ppid,user,%cpu,%mem,cmd --sort=-%cpu | grep postgres
# 输出示例
# 12345 1234 postgres 85.2 5.3 postgres: worker process
2.3 根据进程PID找到对应SQL
方法一:通过pg_stat_activity
sql
-- 根据进程ID找到正在执行的SQL
SELECT pid,
usename,
application_name,
client_addr,
state,
query,
query_start,
now() - query_start AS duration
FROM pg_stat_activity
WHERE pid = 12345; -- 替换为实际PID
-- 查看所有正在执行的SQL(按执行时间排序)
SELECT pid,
usename,
state,
query,
now() - query_start AS duration
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;
方法二:使用pg_stat_statements扩展
sql
-- 启用扩展(需要超级用户权限)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- 查看CPU消耗最高的SQL(通过total_time判断)
SELECT query,
calls,
total_exec_time,
mean_exec_time,
rows,
100.0 * shared_blks_hit /
nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
2.4 CPU高常见原因与解决方案
原因一:缺少索引导致全表扫描
sql
-- 发现全表扫描的SQL
SELECT query,
calls,
total_exec_time,
rows
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_%'
ORDER BY rows DESC
LIMIT 10;
-- 检查执行计划
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 12345;
-- 解决方案:创建合适的索引
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
原因二:复杂的计算和排序
sql
-- 问题SQL:大量内存排序
SELECT * FROM orders
ORDER BY created_at DESC
LIMIT 100000 OFFSET 0;
-- 优化方案:添加索引支持排序
CREATE INDEX idx_orders_created_desc ON orders(created_at DESC);
-- 或使用覆盖索引
CREATE INDEX idx_orders_cover ON orders(created_at DESC) INCLUDE (order_no, customer_id, amount);
原因三:并行查询过多
sql
-- 查看并行工作进程使用情况
SELECT count(*) as parallel_workers
FROM pg_stat_activity
WHERE query LIKE '%Parallel%';
-- 调整并行参数
-- 配置文件路径: /etc/postgresql/15/main/postgresql.conf (Ubuntu/Debian)
-- /var/lib/pgsql/15/data/postgresql.conf (CentOS/RHEL)
-- 修改后执行: SELECT pg_reload_conf(); 或 systemctl reload postgresql
max_parallel_workers_per_gather = 2 -- 限制每个查询的并行度
max_parallel_workers = 4 -- 总并行工作进程数
或使用ALTER SYSTEM命令:
sql
ALTER SYSTEM SET max_parallel_workers_per_gather = 2;
ALTER SYSTEM SET max_parallel_workers = 4;
SELECT pg_reload_conf();
2.5 CPU监控脚本
bash
#!/bin/bash
# cpu_monitor.sh - PostgreSQL CPU监控脚本
LOG_FILE="/var/log/pg_cpu_monitor.log"
THRESHOLD=80
# 确保日志目录存在
mkdir -p "$(dirname "$LOG_FILE")"
while true; do
# 获取PostgreSQL进程总CPU使用率
CPU_USAGE=$(ps aux | grep 'postgres:' | grep -v grep | awk '{sum+=$3} END {print sum+0}')
# 检查是否超过阈值(使用bc或awk进行比较)
if [ -n "$CPU_USAGE" ] && [ "$CPU_USAGE" != "0" ]; then
IS_HIGH=$(awk "BEGIN {print ($CPU_USAGE > $THRESHOLD) ? 1 : 0}")
if [ "$IS_HIGH" -eq 1 ]; then
echo "[$(date)] CPU High: $CPU_USAGE%" >> "$LOG_FILE"
# 记录当前活跃SQL
psql -U postgres -c "
SELECT pid, query, now() - query_start as duration
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 5;
" >> "$LOG_FILE" 2>&1
fi
fi
sleep 60
done
注意:此脚本需要以root或postgres用户运行,确保有psql访问权限。可通过以下命令后台运行:
bashnohup bash cpu_monitor.sh >/dev/null 2>&1 &
四、内存占用高问题定位
3.1 确认内存使用情况
bash
# 查看整体内存使用
free -h
# 查看PostgreSQL进程内存占用
ps aux | grep postgres | grep -v grep | sort -k4 -rn | head -10
# 详细内存映射
pmap -x <pid>
# 查看共享内存使用
ipcs -m
3.2 PostgreSQL内存参数分析
核心内存参数:
sql
-- 查看当前内存配置
SELECT name,
setting,
unit,
short_desc,
source
FROM pg_settings
WHERE name IN (
'shared_buffers',
'work_mem',
'maintenance_work_mem',
'effective_cache_size',
'temp_buffers',
'max_connections'
);
-- 常用配置说明
-- shared_buffers: 共享缓冲区,通常设置为系统内存的25%
-- work_mem: 每个操作的内存,默认4MB
-- maintenance_work_mem: 维护操作内存,如VACUUM
-- effective_cache_size: 查询优化器假设的可用缓存
计算总内存消耗:
总内存消耗估算:
= shared_buffers
+ (max_connections × work_mem × sort/hash操作数)
+ (autovacuum_max_workers × maintenance_work_mem)
+ wal_buffers
3.3 内存问题诊断SQL
查看当前内存使用:
sql
-- PostgreSQL 13+ 使用pg_shmem_allocations
SELECT name,
off,
size,
allocated_size
FROM pg_shmem_allocations
ORDER BY size DESC
LIMIT 10;
-- 查看临时文件使用情况(可能表示work_mem不足)
SELECT datname,
query,
temp_files,
temp_bytes
FROM pg_stat_database db
LEFT JOIN pg_stat_statements s ON s.dbid = db.oid
WHERE temp_files > 0
ORDER BY temp_bytes DESC
LIMIT 10;
检测内存泄漏:
sql
-- 监控每个数据库的内存使用趋势
SELECT datname,
pg_size_pretty(pg_database_size(datname)) AS db_size,
pg_database_size(datname) AS size_bytes
FROM pg_database
WHERE datistemplate = false
ORDER BY size_bytes DESC;
-- 查看会话内存使用
SELECT pid,
usename,
application_name,
client_addr,
state,
query,
pg_backend_pid() as backend_pid
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
3.4 内存优化建议
参数调优公式:
sql
-- 假设服务器有32GB内存
-- shared_buffers: 25%的系统内存
shared_buffers = 8GB
-- effective_cache_size: 75%的系统内存
effective_cache_size = 24GB
-- work_mem: 根据连接数调整
-- 公式: (总内存 - shared_buffers) / (max_connections × 4)
-- 假设 max_connections = 200
work_mem = 32MB
-- maintenance_work_mem: 用于VACUUM等维护操作
maintenance_work_mem = 1GB
-- wal_buffers: WAL缓冲区
wal_buffers = 64MB
实际配置示例:
ini
# postgresql.conf - 内存优化配置
# 配置文件路径参考:
# Ubuntu/Debian: /etc/postgresql/15/main/postgresql.conf
# CentOS/RHEL: /var/lib/pgsql/15/data/postgresql.conf
# Windows: C:\Program Files\PostgreSQL\15\data\postgresql.conf
#
# 修改方式:
# 方式1: sudo vim /etc/postgresql/15/main/postgresql.conf
# 方式2: ALTER SYSTEM SET shared_buffers = '8GB';
#
# 生效方式:shared_buffers 需要重启数据库服务
# sudo systemctl restart postgresql
# 服务器内存:32GB
# 最大连接数:200
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 32MB
maintenance_work_mem = 1GB
wal_buffers = 64MB
huge_pages = try # 使用大页内存,减少TLB miss
min_wal_size = 1GB
max_wal_size = 4GB
3.5 Work_mem溢出检测
sql
-- 查看临时文件(work_mem不足时会使用磁盘)
SELECT datname,
temp_files,
pg_size_pretty(temp_bytes) as temp_size,
pg_size_pretty(blk_read) as disk_read
FROM pg_stat_database
WHERE datistemplate = false
ORDER BY temp_bytes DESC;
-- 如果temp_files很多,考虑增加work_mem或优化SQL
临时文件监控:
bash
# 监控临时文件生成
log_temp_files = 0 # 在postgresql.conf中设置,记录所有临时文件
# 查看日志中的临时文件记录
grep "temporary file" /var/log/postgresql/*.log | tail -20
五、磁盘IO高问题定位
4.1 确认磁盘IO使用情况
bash
# 查看磁盘IO统计
iostat -x 1
# 实时IO监控
iotop -o
# 查看磁盘使用率
df -h
# 查看PostgreSQL数据目录大小
du -sh $PGDATA/*
# 查看表空间大小
psql -c "SELECT spcname, pg_size_pretty(pg_tablespace_size(oid)) FROM pg_tablespace;"
iostat关键指标:
| 指标 | 说明 | 警戒值 |
|---|---|---|
| %util | 设备利用率 | >80%表示瓶颈 |
| await | 平均IO等待时间 | >20ms需要关注 |
| svctm | 平均服务时间 | >10ms需要优化 |
| r/s, w/s | 每秒读写次数 | 接近磁盘上限需注意 |
4.2 定位IO密集的SQL
sql
-- 使用pg_stat_statements找到IO密集的SQL
SELECT query,
calls,
shared_blks_read + temp_blks_read AS total_blks_read,
shared_blks_hit + temp_blks_hit AS total_blks_hit,
pg_size_pretty((shared_blks_read + temp_blks_read) * 8192) as read_size,
round(100.0 * shared_blks_hit /
nullif(shared_blks_hit + shared_blks_read, 0), 2) as hit_ratio
FROM pg_stat_statements
WHERE shared_blks_read > 0
ORDER BY shared_blks_read DESC
LIMIT 20;
-- 低命中率表示缺少索引或内存不足
4.3 检查表和索引大小
sql
-- 查看最大的表
SELECT schemaname,
relname,
pg_size_pretty(pg_total_relation_size(relid)) as total_size,
pg_size_pretty(pg_relation_size(relid)) as table_size,
pg_size_pretty(pg_indexes_size(relid)) as index_size
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20;
-- 查看表膨胀率(需要pgstattuple扩展)
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT schemaname,
relname,
pgstattuple(relid).dead_tuple_count as dead_tuples,
pgstattuple(relid).dead_tuple_percent as dead_percent
FROM pg_stat_user_tables
WHERE n_live_tup > 0
ORDER BY n_dead_tup DESC
LIMIT 10;
-- 或者使用pg_stat_user_tables
SELECT schemaname,
relname,
n_live_tup,
n_dead_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_ratio
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
4.4 检查Checkpoint和WAL
Checkpoint频率影响IO:
sql
-- 查看Checkpoint统计
SELECT * FROM pg_stat_bgwriter;
-- 关键指标:
-- checkpoints_timed: 按时执行的checkpoint次数
-- checkpoints_req: 请求执行的checkpoint次数(checkpoint_timeout或max_wal_size导致)
-- checkpoint_write_time: 写入时间(毫秒)
-- checkpoint_sync_time: 同步时间(毫秒)
-- 如果checkpoints_req很高,说明checkpoint过于频繁
WAL写入监控:
sql
-- 查看WAL生成速率
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) as pending_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as replication_lag
FROM pg_stat_replication;
-- WAL目录大小
SELECT pg_size_pretty(sum(size)) as wal_size
FROM pg_ls_waldir();
4.5 IO优化策略
策略一:调整Checkpoint参数
ini
# postgresql.conf - Checkpoint优化
# 配置文件路径:
# Ubuntu/Debian: /etc/postgresql/15/main/postgresql.conf
# CentOS/RHEL: /var/lib/pgsql/15/data/postgresql.conf
#
# 修改后生效: SELECT pg_reload_conf();
# 注意: max_wal_size 修改后立即生效,无需重启
# 增加checkpoint间隔,减少IO峰值
checkpoint_timeout = 15min # 默认5min
max_wal_size = 4GB # 增加WAL缓冲
checkpoint_completion_target = 0.9 # 分散写入,默认0.5
# 减少full_page_writes的影响
full_page_writes = on # 保持开启以保证安全
或使用ALTER SYSTEM命令:
sql
ALTER SYSTEM SET checkpoint_timeout = '15min';
ALTER SYSTEM SET max_wal_size = '4GB';
ALTER SYSTEM SET checkpoint_completion_target = 0.9;
SELECT pg_reload_conf();
策略二:优化VACUUM
sql
-- 手动执行VACUUM(不阻塞)
VACUUM ANALYZE large_table;
-- 表膨胀严重时,使用VACUUM FULL重建表
-- 注意:会锁定表!
VACUUM FULL ANALYZE large_table;
-- 或者使用pg_repack扩展(在线重建)
-- pg_repack --table=large_table -d mydb
ini
# postgresql.conf - Autovacuum调优
# 配置文件路径: /etc/postgresql/15/main/postgresql.conf (Ubuntu/Debian)
# /var/lib/pgsql/15/data/postgresql.conf (CentOS/RHEL)
# 修改后执行: SELECT pg_reload_conf(); 或 sudo systemctl reload postgresql
autovacuum = on
autovacuum_max_workers = 3
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 50
autovacuum_vacuum_scale_factor = 0.1
autovacuum_vacuum_cost_limit = 200 # 增加限制,减少IO冲击
autovacuum_vacuum_cost_delay = 2ms
或使用ALTER SYSTEM命令:
sql
ALTER SYSTEM SET autovacuum_vacuum_cost_limit = 200;
ALTER SYSTEM SET autovacuum_vacuum_cost_delay = '2ms';
SELECT pg_reload_conf();
策略三:使用SSD和调整调度器
bash
# 对于SSD,使用noop或deadline调度器
cat /sys/block/sda/queue/scheduler
echo noop > /sys/block/sda/queue/scheduler
# 或在/etc/rc.local中设置
echo noop > /sys/block/sda/queue/scheduler
4.6 磁盘IO监控脚本
bash
#!/bin/bash
# io_monitor.sh - PostgreSQL IO监控脚本
LOG_FILE="/var/log/pg_io_monitor.log"
PGDATA="/var/lib/postgresql/15/main"
THRESHOLD=20
# 确保日志目录存在
mkdir -p "$(dirname "$LOG_FILE")"
while true; do
# 获取IO等待百分比(兼容不同Linux发行版)
# iostat输出格式:avg-cpu: %user %nice %system %iowait %steal %idle
IO_WAIT=$(iostat -c 1 2 | grep -A1 "avg-cpu" | tail -1 | awk '{print $4}')
# 确保获取到有效数值
if [ -n "$IO_WAIT" ] && [[ "$IO_WAIT" =~ ^[0-9.]+$ ]]; then
IS_HIGH=$(awk "BEGIN {print ($IO_WAIT > $THRESHOLD) ? 1 : 0}")
if [ "$IS_HIGH" -eq 1 ]; then
echo "[$(date)] IO Wait High: $IO_WAIT%" >> "$LOG_FILE"
# 记录IO密集进程(需要root权限)
if command -v iotop &> /dev/null; then
iotop -b -n 1 -o >> "$LOG_FILE" 2>&1
else
echo "iotop not installed, skipping process IO info" >> "$LOG_FILE"
fi
# 记录活跃SQL
psql -U postgres -c "
SELECT pid, query, now() - query_start as duration
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 5;
" >> "$LOG_FILE" 2>&1
fi
fi
sleep 60
done
注意:
iostat需要安装sysstat包:sudo apt install sysstat或sudo yum install sysstatiotop需要单独安装且需要root权限:sudo apt install iotop- 运行脚本:
nohup sudo bash io_monitor.sh >/dev/null 2>&1 &
六、SQL查询慢问题定位
5.1 开启慢查询日志
sql
-- 方法一:修改配置文件 postgresql.conf
log_min_duration_statement = 1000 # 记录超过1秒的查询
-- 方法二:动态修改
ALTER SYSTEM SET log_min_duration_statement = 1000;
SELECT pg_reload_conf();
-- 查看当前设置
SHOW log_min_duration_statement;
推荐配置:
ini
# postgresql.conf - 慢查询日志配置
# 配置文件路径:
# Ubuntu/Debian: /etc/postgresql/15/main/postgresql.conf
# CentOS/RHEL: /var/lib/pgsql/15/data/postgresql.conf
# Windows: C:\Program Files\PostgreSQL\15\data\postgresql.conf
#
# 修改后执行: SELECT pg_reload_conf(); 或 sudo systemctl reload postgresql
# 这些参数均可在线生效,无需重启
log_min_duration_statement = 1000 # 1000毫秒
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
5.2 使用pg_stat_statements分析
sql
-- 启用扩展
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- 添加到postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
-- 重启PostgreSQL生效
-- 查询最慢的SQL
SELECT query,
calls,
total_exec_time / 1000 / 60 as total_minutes,
mean_exec_time as avg_ms,
min_exec_time,
max_exec_time,
rows,
100.0 * shared_blks_hit /
nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
-- 查询平均执行时间最长的SQL
SELECT query,
calls,
round(mean_exec_time, 2) as avg_ms,
rows,
shared_blks_read
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;
-- 查询执行次数最多的SQL
SELECT query,
calls,
round(mean_exec_time, 2) as avg_ms,
total_exec_time / 1000 / 60 as total_minutes
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 20;
-- 重置统计信息
SELECT pg_stat_statements_reset();
5.3 分析执行计划
使用EXPLAIN ANALYZE:
sql
-- 查看执行计划和实际执行时间
EXPLAIN ANALYZE
SELECT o.order_id, c.customer_name, o.amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.created_at >= '2024-01-01'
ORDER BY o.amount DESC
LIMIT 100;
-- 更详细的输出
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 12345;
-- JSON格式输出
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT * FROM orders WHERE customer_id = 12345;
执行计划关键指标:
| 指标 | 说明 | 问题信号 |
|---|---|---|
| Seq Scan | 全表扫描 | 大表上出现需要添加索引 |
| Hash Join | 哈希连接 | 大表上可能内存不足 |
| Merge Join | 合并连接 | 需要排序,可能较慢 |
| Nested Loop | 嵌套循环 | 内层表多次扫描 |
| Filter | 过滤 | 过滤条件无法使用索引 |
| Sort | 排序 | 数据量大时内存不足 |
| HashAggregate | 聚合 | 内存不足时溢出到磁盘 |
| Buffers: shared hit | 缓存命中 | 命中率低表示缓存不足 |
| Buffers: shared read | 磁盘读取 | 高值表示IO密集 |
常见问题模式:
sql
-- 问题1:全表扫描
EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'pending';
-- 输出: Seq Scan on orders (cost=0.00..12345.67 rows=100 width=100)
-- 解决: CREATE INDEX idx_orders_status ON orders(status);
-- 问题2:索引未使用
EXPLAIN ANALYZE SELECT * FROM orders WHERE LOWER(customer_name) = 'john';
-- 索引无法用于函数处理后的列
-- 解决: CREATE INDEX idx_orders_customer_name_lower ON orders(LOWER(customer_name));
-- 问题3:连接效率低
EXPLAIN ANALYZE
SELECT * FROM orders o
JOIN order_items i ON o.order_id = i.order_id
WHERE o.customer_id = 12345;
-- 检查连接字段是否有索引
-- 解决: CREATE INDEX idx_order_items_order_id ON order_items(order_id);
-- 问题4:排序慢
EXPLAIN ANALYZE SELECT * FROM orders ORDER BY created_at DESC LIMIT 100;
-- 如果有Sort节点,考虑索引排序
-- 解决: CREATE INDEX idx_orders_created_desc ON orders(created_at DESC);
5.4 查看当前活跃查询
sql
-- 查看所有活跃查询
SELECT pid,
now() - pg_stat_activity.query_start AS duration,
usename,
client_addr,
state,
query
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
AND state != 'idle'
ORDER BY duration DESC;
-- 查看等待事件
SELECT pid,
now() - query_start AS duration,
wait_event_type,
wait_event,
state,
query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL
ORDER BY duration DESC;
-- 终止长时间运行的查询
SELECT pg_cancel_backend(pid); -- 发送取消信号
SELECT pg_terminate_backend(pid); -- 强制终止连接
5.5 锁等待分析
sql
-- 查看锁等待
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS current_statement_in_blocking_process,
blocked_activity.application_name AS blocked_application
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
-- 查看表级锁
SELECT relation::regclass,
locktype,
mode,
pid,
granted
FROM pg_locks
WHERE relation IS NOT NULL
ORDER BY relation;
5.6 SQL优化案例
案例一:优化分页查询
sql
-- 问题SQL:大偏移量分页
SELECT * FROM orders ORDER BY id LIMIT 10 OFFSET 100000;
-- 执行时间:2.5秒
-- 优化方案:使用游标分页
SELECT * FROM orders
WHERE id > 100000
ORDER BY id
LIMIT 10;
-- 执行时间:0.02秒
-- 或使用索引覆盖
SELECT o.* FROM orders o
JOIN (SELECT id FROM orders ORDER BY id LIMIT 10 OFFSET 100000) t
ON o.id = t.id;
-- 执行时间:0.15秒
案例二:优化COUNT查询
sql
-- 问题SQL:大表COUNT
SELECT COUNT(*) FROM orders WHERE status = 'completed';
-- 执行时间:5秒
-- 优化方案1:使用索引覆盖
CREATE INDEX idx_orders_status ON orders(status);
-- 优化方案2:使用估计值
SELECT reltuples::bigint AS estimate_count
FROM pg_class
WHERE relname = 'orders';
-- 优化方案3:维护计数表
CREATE TABLE order_counts (
status VARCHAR(20) PRIMARY KEY,
count INTEGER
);
-- 通过触发器维护计数
案例三:优化复杂查询
sql
-- 问题SQL:多个子查询
SELECT * FROM orders
WHERE customer_id IN (SELECT customer_id FROM customers WHERE region = 'Asia')
AND product_id IN (SELECT product_id FROM products WHERE category = 'Electronics');
-- 优化方案:使用JOIN
SELECT DISTINCT o.*
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE c.region = 'Asia' AND p.category = 'Electronics';
-- 或使用EXISTS
SELECT * FROM orders o
WHERE EXISTS (SELECT 1 FROM customers c WHERE c.customer_id = o.customer_id AND c.region = 'Asia')
AND EXISTS (SELECT 1 FROM products p WHERE p.product_id = o.product_id AND p.category = 'Electronics');
七、综合诊断脚本
6.1 系统资源监控脚本
bash
#!/bin/bash
# pg_health_check.sh - PostgreSQL健康检查脚本
# 使用方法: chmod +x pg_health_check.sh && ./pg_health_check.sh
echo "=== PostgreSQL Health Check Report ==="
echo "Date: $(date)"
echo ""
# CPU使用率(兼容不同Linux发行版)
echo "=== CPU Usage ==="
if command -v mpstat &> /dev/null; then
# 使用mpstat获取CPU使用率
IDLE=$(mpstat 1 1 | tail -1 | awk '{print $NF}')
CPU_USED=$(awk "BEGIN {printf \"%.1f\", 100 - $IDLE}")
echo "CPU Usage: ${CPU_USED}%"
else
# 备选方案:从/proc/stat读取
CPU_LINE=$(head -1 /proc/stat)
CPU_USED=$(echo "$CPU_LINE" | awk '{usage=100-($5*100)/($2+$3+$4+$5+$6+$7+$8); printf "%.1f", usage}')
echo "CPU Usage: ${CPU_USED}% (approximate)"
fi
echo ""
# 内存使用率(兼容不同Linux发行版)
echo "=== Memory Usage ==="
if command -v free &> /dev/null; then
MEM_INFO=$(free -m | grep "Mem:")
TOTAL=$(echo "$MEM_INFO" | awk '{print $2}')
USED=$(echo "$MEM_INFO" | awk '{print $3}')
if [ -n "$TOTAL" ] && [ "$TOTAL" -gt 0 ]; then
MEM_PCT=$(awk "BEGIN {printf \"%.1f\", ($USED/$TOTAL)*100}")
echo "Memory: ${USED}MB / ${TOTAL}MB (${MEM_PCT}%)"
else
echo "Memory: Unable to parse free output"
fi
else
echo "Memory: 'free' command not available"
fi
echo ""
# 磁盘使用率
echo "=== Disk Usage ==="
df -h | grep -E '^/dev|^Filesystem'
echo ""
# 磁盘IO统计
echo "=== Disk IO ==="
if command -v iostat &> /dev/null; then
iostat -x 1 2 | tail -n +4
else
echo "iostat not available. Install sysstat package."
fi
echo ""
# PostgreSQL连接数
echo "=== PostgreSQL Connections ==="
if command -v psql &> /dev/null; then
psql -U postgres -c "
SELECT count(*) as total_connections,
count(*) FILTER (WHERE state = 'active') as active,
count(*) FILTER (WHERE state = 'idle') as idle,
count(*) FILTER (WHERE state = 'idle in transaction') as idle_in_transaction
FROM pg_stat_activity;
" 2>/dev/null || echo "Failed to connect to PostgreSQL"
else
echo "psql not found in PATH"
fi
echo ""
# 长时间运行的查询
echo "=== Long Running Queries (> 1min) ==="
psql -U postgres -c "
SELECT pid, now() - query_start as duration, usename, state, left(query, 100) as query_preview
FROM pg_stat_activity
WHERE now() - query_start > interval '1 minute'
AND state != 'idle'
ORDER BY duration DESC
LIMIT 10;
" 2>/dev/null
echo ""
# 表膨胀检查
echo "=== Table Bloat (Top 10) ==="
psql -U postgres -c "
SELECT schemaname, relname,
n_dead_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_ratio
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC
LIMIT 10;
" 2>/dev/null
echo ""
# 未使用索引检查
echo "=== Unused Indexes ==="
psql -U postgres -c "
SELECT schemaname, relname, indexrelname,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 10;
" 2>/dev/null
依赖检查:运行脚本前请确保已安装:
bash# Ubuntu/Debian sudo apt install sysstat postgresql-client # CentOS/RHEL sudo yum install sysstat postgresql
6.2 实时监控脚本
bash
#!/bin/bash
# pg_realtime_monitor.sh - 实时监控脚本
# 使用方法: chmod +x pg_realtime_monitor.sh && ./pg_realtime_monitor.sh
# 退出: Ctrl+C
# 捕获Ctrl+C信号
trap 'echo -e "\n\nMonitoring stopped."; exit 0' INT TERM
while true; do
clear
echo "=== PostgreSQL Realtime Monitor ($(date)) ==="
echo "Press Ctrl+C to exit"
echo ""
# CPU和内存
echo "--- System Resources ---"
PG_STATS=$(ps aux | grep 'postgres:' | grep -v grep | awk '
BEGIN {cpu=0; mem=0; count=0}
{cpu+=$3; mem+=$4; count++}
END {
if (count > 0) {
printf "PostgreSQL Processes: %d\nCPU: %.1f%% | Memory: %.1f%%", count, cpu, mem
} else {
printf "No PostgreSQL processes found"
}
}')
echo "$PG_STATS"
echo ""
# 连接数
echo "--- Connections ---"
psql -U postgres -t -c "
SELECT 'Active: ' || COALESCE(count(*) FILTER (WHERE state = 'active'), 0) ||
' | Idle: ' || COALESCE(count(*) FILTER (WHERE state = 'idle'), 0) ||
' | Total: ' || count(*)
FROM pg_stat_activity;" 2>/dev/null || echo "Connection failed"
echo ""
# 活跃查询
echo "--- Active Queries (> 1s) ---"
psql -U postgres -c "
SELECT pid,
extract(epoch from (now() - query_start))::int as sec,
left(query, 60) as query_preview
FROM pg_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '1 second'
AND query NOT LIKE '%pg_stat_activity%'
ORDER BY query_start
LIMIT 5;" 2>/dev/null
echo ""
# 锁等待
echo "--- Lock Waits ---"
psql -U postgres -t -c "
SELECT COALESCE(count(*), 0) || ' lock wait(s)'
FROM pg_stat_activity
WHERE wait_event IS NOT NULL
AND wait_event_type = 'Lock';" 2>/dev/null
echo ""
# Checkpoint
echo "--- Checkpoint Stats ---"
psql -U postgres -t -c "
SELECT 'Timed: ' || checkpoints_timed || ' | Requested: ' || checkpoints_req
FROM pg_stat_bgwriter;" 2>/dev/null
echo ""
# 表膨胀预警
echo "--- Table Bloat Warning ---"
psql -U postgres -t -c "
SELECT count(*) || ' tables with >10% dead tuples'
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
AND 100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0) > 10;" 2>/dev/null
sleep 5
done
使用建议:
- 在tmux或screen会话中运行,防止SSH断开中断
- 可重定向输出到文件:
./pg_realtime_monitor.sh | tee monitor.log- 对于生产环境,建议使用Prometheus+Grafana等专业监控方案
八、性能问题排查速查表
7.1 问题现象速查
| 现象 | 可能原因 | 排查命令 |
|---|---|---|
| CPU持续高 | 缺索引、复杂计算 | top, pg_stat_statements |
| 内存不足 | work_mem太大、连接过多 | free, 查看pg_stat_activity |
| 磁盘IO高 | 全表扫描、checkpoint频繁 | iostat, pg_stat_bgwriter |
| 查询慢 | 缺索引、锁等待 | EXPLAIN ANALYZE, pg_locks |
| 连接数过多 | 连接泄漏、连接池配置不当 | pg_stat_activity, max_connections |
| 数据库假死 | 死锁、磁盘满 | pg_locks, df -h |
7.2 关键监控指标
| 类别 | 指标 | 健康值 | 异常处理 |
|---|---|---|---|
| 连接 | active连接数 | < max_connections的80% | 检查连接泄漏 |
| 连接 | idle in transaction | < 10 | 检查未提交事务 |
| 缓存 | 命中率 | > 95% | 增加shared_buffers |
| 锁 | 等待次数 | < 10/min | 优化事务、添加索引 |
| 表膨胀 | 死元组比例 | < 10% | 执行VACUUM |
| 索引 | 未使用索引 | = 0 | 删除无用索引 |
| Checkpoint | checkpoints_req/total | < 30% | 调整checkpoint参数 |
7.3 常用诊断SQL汇总
sql
-- 1. 查看活跃连接
SELECT pid, usename, client_addr, state, query, now() - query_start as duration
FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;
-- 2. 查看锁等待
SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;
-- 3. 查看表大小
SELECT schemaname, relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_stat_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 20;
-- 4. 查看索引使用情况
SELECT schemaname, relname, indexrelname, idx_scan FROM pg_stat_user_indexes ORDER BY idx_scan;
-- 5. 查看慢查询
SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;
-- 6. 查看表膨胀
SELECT schemaname, relname, n_dead_tup FROM pg_stat_user_tables WHERE n_dead_tup > 10000;
-- 7. 查看缓存命中率
SELECT sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) FROM pg_statio_user_tables;
-- 8. 查看数据库大小
SELECT datname, pg_size_pretty(pg_database_size(datname)) FROM pg_database;
九、最佳实践建议
8.1 监控体系建设
-
基础监控
- 部署Prometheus + Grafana监控平台
- 配置系统资源告警(CPU、内存、磁盘)
- 设置PostgreSQL指标采集
-
数据库监控
- 启用pg_stat_statements扩展
- 配置慢查询日志
- 定期收集统计信息
-
告警策略
- CPU使用率 > 80% 持续5分钟
- 内存使用率 > 90%
- 磁盘IO wait > 20%
- 慢查询数量突增
- 连接数接近上限
8.2 定期维护任务
sql
-- 每周执行:分析表统计信息
ANALYZE;
-- 每月执行:检查索引使用情况
SELECT schemaname, relname, indexrelname, idx_scan
FROM pg_stat_user_indexes WHERE idx_scan = 0;
-- 定期执行:VACUUM ANALYZE大表
VACUUM ANALYZE large_table;
-- 每季度:重建高膨胀表
-- 注意:VACUUM FULL会锁表,建议在低峰期执行
8.3 配置优化清单
ini
# postgresql.conf 优化建议
# ============================================================
# 配置文件路径说明:
# Ubuntu/Debian: /etc/postgresql/{版本}/main/postgresql.conf
# CentOS/RHEL: /var/lib/pgsql/{版本}/data/postgresql.conf
# Windows: C:\Program Files\PostgreSQL\{版本}\data\postgresql.conf
#
# 修改方式:
# 方式1: 直接编辑配置文件后执行 pg_reload_conf() 或重启服务
# 方式2: 使用 ALTER SYSTEM SET 参数名 = '值'; 然后 pg_reload_conf()
#
# 查询配置文件路径: psql -c "SHOW config_file;"
# ============================================================
# 内存配置(根据服务器内存调整)
# 注意: shared_buffers 修改后需要重启数据库
shared_buffers = 8GB # 系统内存的25%
effective_cache_size = 24GB # 系统内存的75%
work_mem = 32MB # 根据连接数调整,可在线修改
maintenance_work_mem = 1GB # VACUUM等操作使用,可在线修改
# 检查点配置
checkpoint_timeout = 15min # 可在线修改
max_wal_size = 4GB # 可在线修改
checkpoint_completion_target = 0.9 # 可在线修改
# 日志配置(均可在线修改)
log_min_duration_statement = 1000
log_checkpoints = on
log_lock_waits = on
log_temp_files = 0
# 统计信息(均可在线修改)
track_activities = on
track_counts = on
track_io_timing = on
track_functions = all
# 自动清理(均可在线修改)
autovacuum = on
autovacuum_vacuum_cost_limit = 200
# 并行查询(均可在线修改)
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
快速应用配置:
bash
# 方法1: 使用ALTER SYSTEM(推荐,无需编辑文件)
psql -U postgres << EOF
ALTER SYSTEM SET shared_buffers = '8GB';
ALTER SYSTEM SET effective_cache_size = '24GB';
ALTER SYSTEM SET work_mem = '32MB';
SELECT pg_reload_conf();
EOF
# 方法2: 直接编辑配置文件
sudo vim /etc/postgresql/15/main/postgresql.conf
sudo systemctl reload postgresql # 或: sudo systemctl restart postgresql
# 查看哪些参数需要重启
psql -U postgres -c "SELECT name, setting, pending_restart FROM pg_settings WHERE pending_restart = true;"
十、总结
PostgreSQL性能问题定位需要从多个层面进行分析:
- 系统层面:通过top、iostat、free等工具确认资源瓶颈
- 数据库层面:通过pg_stat_*视图分析连接、锁、缓存等
- SQL层面:通过EXPLAIN ANALYZE分析执行计划
核心排查思路:
- CPU高 → 查找消耗CPU的进程 → 定位对应的SQL → 分析执行计划
- 内存高 → 检查内存参数 → 分析work_mem和连接数 → 查找内存泄漏
- IO高 → 检查checkpoint频率 → 分析表膨胀 → 定位IO密集SQL
- 查询慢 → 检查执行计划 → 添加合适索引 → 优化SQL写法
记住:性能优化是一个持续的过程,需要建立完善的监控体系,定期维护,才能保持数据库的最佳性能状态。
参考资源
- PostgreSQL官方文档:https://www.postgresql.org/docs/current/monitoring.html
- pg_stat_statements扩展:https://www.postgresql.org/docs/current/pgstatstatements.html
- PostgreSQL Wiki性能优化:https://wiki.postgresql.org/wiki/Performance_Optimization