1.发现问题
在web页面操作系统,前端有发起请求,后端无任何响应,一直pending状态,等待超过几分钟都无响应。
该业务的处理类有个全局锁
java
private static final ReentrantLock GLOBAL_LOCK = new ReentrantLock();
业务决定了这个锁是必须加,不同系统的外部请求会操作修改同一行,甚至同一批数据。第一反应是不是"死锁了"。
2.排查JVM死锁
bash
jstack -l <pid> > stack.log
查看整个 dump 没有出现典型死锁标志:
❌ Found one Java-level deadlock
❌ waiting to lock <xxx> held by ... 形成环
找到页面请求接口的线程
log
"http-nio-0.0.0.0-8081-exec-340" Id=1848 cpuUsage=0.0% deltaTime=0ms time=28ms WAITING on java.util.concurrent.locks.ReentrantLock$NonfairSync@ed361db owned by "DomainForceCommandTranslator" Id=206
at java.base@17.0.11/jdk.internal.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.locks.ReentrantLock$NonfairSync@ed361db
at java.base@17.0.11/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
at java.base@17.0.11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:715)
at java.base@17.0.11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:938)
at java.base@17.0.11/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)
at java.base@17.0.11/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)
发现其在等待 Id=206的线程释放 ReentrantLock GLOBAL_LOCK,并且有大量线程在等待这个锁
再看看206线程在干什么
log
"DomainForceCommandTranslator" Id=206 cpuUsage=1.52% deltaTime=3ms time=444450ms RUNNABLE (in native)
at java.base@17.0.11/sun.nio.ch.Net.poll(Native Method)
at java.base@17.0.11/sun.nio.ch.NioSocketImpl.park(NioSocketImpl.java:186)
at java.base@17.0.11/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:290)
at java.base@17.0.11/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:314)
at java.base@17.0.11/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
at java.base@17.0.11/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)
at java.base@17.0.11/java.net.Socket$SocketInputStream.read(Socket.java:966)
at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:161)
at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:128)
at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:113)
at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:73)
at org.postgresql.core.PGStream.receiveChar(PGStream.java:465)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2155)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:368)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:498)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:415)
at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190)
at org.postgresql.jdbc.PgPreparedStatement.execute(PgPreparedStatement.java:177)
❗ 线程正在"等 PostgreSQL 返回结果"
三、排查Postgresql不响应
sql
postgres=# SELECT pid,client_addr,state,now() - xact_start AS tx_duration,query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY tx_duration DESC;
pid | client_addr | state | tx_duration | query
--------+-------------+--------+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------
120657 | 127.0.0.1 | active | 00:06:26.896642 | +
| | | | UPDATE domain_force_strategy_record +
| | | | SET +
| | | | version = $1, +
| | | | cancel_command_id = $2, +
| | | | operation_type = $3, +
| | | | send_status = $4, +
| | | | send_message = $5, +
| | | | ack_status = $6, +
| | | | ack_message = $7, +
| | | | updated_at = NOW() +
| | | | WHERE id = $8 +
| | | |
212857 | 127.0.0.1 | active | 00:00:00 | SELECT pid,client_addr,state,now() - xact_start AS tx_duration,query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY tx_duration DESC;
(2 rows)
UPDATE domain_force_strategy_record sql事务竟然持续了6分钟了,但是这个根据主键更新的sql不应该会这么慢,查看了执行计划,都是毫秒级的,

tx_duration 很长
💥 说明:
❗ 有事务一直没提交 → 持有锁 → 线程206就在等它
再看下是否有锁等待
sql
postgres=# SELECT pid,state,wait_event_type,wait_event,query,now() - query_start AS duration FROM pg_stat_activity WHERE state <> 'idle' ORDER BY duration DESC;
pid | state | wait_event_type | wait_event | query | d
uration
--------+--------+-----------------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+-----
------------
212979 | active | Timeout | VacuumTruncate | autovacuum: VACUUM public.domain_force_strategy_record | 00:0
1:08.821759
212857 | active | | | SELECT pid,state,wait_event_type,wait_event,query,now() - query_start AS duration FROM pg_stat_activity WHERE state <> 'idle' ORDER BY duration DESC; | 00:0
0:00
120657 | active | | | +| -00:
00:00.00069
| | | | UPDATE domain_force_strategy_record +|
| | | | SET +|
| | | | version = $1, +|
| | | | cancel_command_id = $2, +|
| | | | operation_type = $3, +|
| | | | send_status = $4, +|
| | | | send_message = $5, +|
| | | | ack_status = $6, +|
| | | | ack_message = $7, +|
| | | | updated_at = NOW() +|
| | | | WHERE id = $8 +|
| | | | |
(3 rows)
wait_event_type = null
✅ 不是锁等待
如果是锁等待,会看到:wait_event_type = Lock
✅ 不是IO阻塞
如果是IO,会有:wait_event_type = IO
验证是否有锁等待:
sql
SELECT
blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query,
blocked.wait_event_type,
blocked.wait_event
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocked.wait_event IS NOT NULL
AND blocking.pid = ANY(pg_blocking_pids(blocked.pid));
SELECT pg_blocking_pids(120657);
返回结果都是空,再次证明没有锁等待,综合来看
tx_duration = 6 min
state = active(active 不一定是在执行 SQL!有可能:正在等客户端继续发送命令或者 JDBC 还没结束调用)
执行计划:0.2 ms
pg_blocking_pids = {}(没有阻塞)
👉 说明:
❗ SQL已经执行完了,但事务没有结束(卡在"事务内部")
极有可能就是:应用层没有提交
👉 PostgreSQL 视角:
SQL 已执行完 ✅
事务还开着 ❗
xact_start 一直在计时 ⏱
👉 这正好解释:
为什么执行计划快
为什么没有 blocking
为什么时间很长
应用层事务控制出问题了(长事务)
四、排查业务代码
数据库没有任何问题,问题又回到了JVM里,使用Arthas跟踪对应方法,找出耗时的代码。
根据sql就可以定位到具体方法,然后在一个个 trace 看,就发现问题了,这个方法循环了500次,耗时巨长。结合业务数据分析,优化,最终问题得以解决。

额外优化
statement_timeout = 5s //设置数据库sql超时时间