记一次 JVM+Postgresql的 “死锁” 问题排查

1.发现问题

在web页面操作系统,前端有发起请求,后端无任何响应,一直pending状态,等待超过几分钟都无响应。

该业务的处理类有个全局锁

java 复制代码
private static final ReentrantLock GLOBAL_LOCK = new ReentrantLock();

业务决定了这个锁是必须加,不同系统的外部请求会操作修改同一行,甚至同一批数据。第一反应是不是"死锁了"。

2.排查JVM死锁

bash 复制代码
jstack -l <pid> > stack.log

查看整个 dump 没有出现典型死锁标志:

❌ Found one Java-level deadlock

❌ waiting to lock <xxx> held by ... 形成环

找到页面请求接口的线程

log 复制代码
"http-nio-0.0.0.0-8081-exec-340" Id=1848 cpuUsage=0.0% deltaTime=0ms time=28ms WAITING on java.util.concurrent.locks.ReentrantLock$NonfairSync@ed361db owned by "DomainForceCommandTranslator" Id=206
    at java.base@17.0.11/jdk.internal.misc.Unsafe.park(Native Method)
    -  waiting on java.util.concurrent.locks.ReentrantLock$NonfairSync@ed361db
    at java.base@17.0.11/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
    at java.base@17.0.11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:715)
    at java.base@17.0.11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:938)
    at java.base@17.0.11/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)
    at java.base@17.0.11/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)

发现其在等待 Id=206的线程释放 ReentrantLock GLOBAL_LOCK,并且有大量线程在等待这个锁

再看看206线程在干什么

log 复制代码
"DomainForceCommandTranslator" Id=206 cpuUsage=1.52% deltaTime=3ms time=444450ms RUNNABLE (in native)
    at java.base@17.0.11/sun.nio.ch.Net.poll(Native Method)
    at java.base@17.0.11/sun.nio.ch.NioSocketImpl.park(NioSocketImpl.java:186)
    at java.base@17.0.11/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:290)
    at java.base@17.0.11/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:314)
    at java.base@17.0.11/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
    at java.base@17.0.11/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)
    at java.base@17.0.11/java.net.Socket$SocketInputStream.read(Socket.java:966)
    at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:161)
    at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:128)
    at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:113)
    at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:73)
    at org.postgresql.core.PGStream.receiveChar(PGStream.java:465)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2155)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:368)
    at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:498)
    at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:415)
    at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190)
    at org.postgresql.jdbc.PgPreparedStatement.execute(PgPreparedStatement.java:177)

❗ 线程正在"等 PostgreSQL 返回结果"

三、排查Postgresql不响应

sql 复制代码
postgres=# SELECT pid,client_addr,state,now() - xact_start AS tx_duration,query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY tx_duration DESC;
  pid   | client_addr | state  |   tx_duration   |                                                                       query                                                                        
--------+-------------+--------+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------
 120657 | 127.0.0.1   | active | 00:06:26.896642 |                                                                                                                                                   +
        |             |        |                 |             UPDATE domain_force_strategy_record                                                                                                   +
        |             |        |                 |             SET                                                                                                                                   +
        |             |        |                 |             version = $1,                                                                                                                         +
        |             |        |                 |             cancel_command_id = $2,                                                                                                               +
        |             |        |                 |             operation_type = $3,                                                                                                                  +
        |             |        |                 |             send_status = $4,                                                                                                                     +
        |             |        |                 |             send_message = $5,                                                                                                                    +
        |             |        |                 |             ack_status = $6,                                                                                                                      +
        |             |        |                 |             ack_message = $7,                                                                                                                     +
        |             |        |                 |             updated_at = NOW()                                                                                                                    +
        |             |        |                 |             WHERE id = $8                                                                                                                         +
        |             |        |                 |          
 212857 | 127.0.0.1   | active | 00:00:00        | SELECT pid,client_addr,state,now() - xact_start AS tx_duration,query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY tx_duration DESC;
(2 rows)

UPDATE domain_force_strategy_record sql事务竟然持续了6分钟了,但是这个根据主键更新的sql不应该会这么慢,查看了执行计划,都是毫秒级的,

tx_duration 很长

💥 说明:

❗ 有事务一直没提交 → 持有锁 → 线程206就在等它

再看下是否有锁等待

sql 复制代码
postgres=# SELECT pid,state,wait_event_type,wait_event,query,now() - query_start AS duration FROM pg_stat_activity WHERE state <> 'idle' ORDER BY duration DESC;
  pid   | state  | wait_event_type |   wait_event   |                                                                         query                                                                         |    d
uration     
--------+--------+-----------------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-----
------------
 212979 | active | Timeout         | VacuumTruncate | autovacuum: VACUUM public.domain_force_strategy_record                                                                                                | 00:0
1:08.821759
 212857 | active |                 |                | SELECT pid,state,wait_event_type,wait_event,query,now() - query_start AS duration FROM pg_stat_activity WHERE state <> 'idle' ORDER BY duration DESC; | 00:0
0:00
 120657 | active |                 |                |                                                                                                                                                      +| -00:
00:00.00069
        |        |                 |                |             UPDATE domain_force_strategy_record                                                                                                      +| 
        |        |                 |                |             SET                                                                                                                                      +| 
        |        |                 |                |             version = $1,                                                                                                                            +| 
        |        |                 |                |             cancel_command_id = $2,                                                                                                                  +| 
        |        |                 |                |             operation_type = $3,                                                                                                                     +| 
        |        |                 |                |             send_status = $4,                                                                                                                        +| 
        |        |                 |                |             send_message = $5,                                                                                                                       +| 
        |        |                 |                |             ack_status = $6,                                                                                                                         +| 
        |        |                 |                |             ack_message = $7,                                                                                                                        +| 
        |        |                 |                |             updated_at = NOW()                                                                                                                       +| 
        |        |                 |                |             WHERE id = $8                                                                                                                            +| 
        |        |                 |                |                                                                                                                                                       | 
(3 rows)

wait_event_type = null

✅ 不是锁等待

如果是锁等待,会看到:wait_event_type = Lock

✅ 不是IO阻塞

如果是IO,会有:wait_event_type = IO

验证是否有锁等待:

sql 复制代码
SELECT 
    blocked.pid AS blocked_pid,
    blocked.query AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.query AS blocking_query,
    blocked.wait_event_type,
    blocked.wait_event
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking 
ON blocked.wait_event IS NOT NULL
AND blocking.pid = ANY(pg_blocking_pids(blocked.pid));

SELECT pg_blocking_pids(120657);

返回结果都是空,再次证明没有锁等待,综合来看

复制代码
tx_duration = 6 min

state = active(active 不一定是在执行 SQL!有可能:正在等客户端继续发送命令或者 JDBC 还没结束调用)

执行计划:0.2 ms

pg_blocking_pids = {}(没有阻塞)

👉 说明:

❗ SQL已经执行完了,但事务没有结束(卡在"事务内部")

极有可能就是:应用层没有提交

👉 PostgreSQL 视角:

SQL 已执行完 ✅

事务还开着 ❗

xact_start 一直在计时 ⏱

👉 这正好解释:

为什么执行计划快

为什么没有 blocking

为什么时间很长

应用层事务控制出问题了(长事务)

四、排查业务代码

数据库没有任何问题,问题又回到了JVM里,使用Arthas跟踪对应方法,找出耗时的代码。

根据sql就可以定位到具体方法,然后在一个个 trace 看,就发现问题了,这个方法循环了500次,耗时巨长。结合业务数据分析,优化,最终问题得以解决。

额外优化

复制代码
statement_timeout = 5s //设置数据库sql超时时间

相关推荐
一然明月1 小时前
Qt QML 锚定(Anchors)全解析
java·数据库·qt
晓纪同学1 小时前
EffctiveC++_02第二章
java·jvm·c++
分享牛2 小时前
Operaton入门到精通23-Operaton 2.0 原生支持 JUnit 6 核心指南
数据库·junit
皙然2 小时前
线上问题定位与排查实战:从日志到优化的完整思路
java·jvm
编码忘我2 小时前
mysq系列之事务
数据库
知识分享小能手2 小时前
Redis入门学习教程,从入门到精通,Redis进阶编程知识点详解(5)
数据库·redis·学习
MekoLi292 小时前
MongoDB 新手完全指南:从入门到精通的实战手册
数据库·后端
cyforkk2 小时前
Spring AOP 进阶:揭秘 @annotation 参数绑定的底层逻辑
java·数据库·spring
2401_884970612 小时前
用Pygame开发你的第一个小游戏
jvm·数据库·python