问题背景
在我们的项目中,遇到了一个典型的生产环境问题:当多个用户使用同一个账号同时进行 AI 图片/视频生成时,系统突然报错 "Prisma 达到最大连接数",随后发现是数据库事务死锁导致的。
技术栈
- 运行环境: Cloudflare Workers
- 数据库 ORM: Prisma
- 数据库: PostgreSQL
- 框架: Hono
问题分析
1. Cloudflare Workers 的特性
Cloudflare Workers 是无服务器环境,每个请求都会创建一个新的 Worker 实例,这意味着:
- 每个请求 = 一个新的 Worker 实例
- 每个实例都会初始化 Prisma 客户端
- 每个 Prisma 客户端都会建立数据库连接
2. 死锁产生的原因
我们的积分系统使用了复杂的事务操作:
typescript
// 问题代码:容易产生死锁
await prisma.$transaction(async (tx) => {
// 1. 读取用户积分(获取读锁)
let userCredits = await tx.userCredits.findUnique({
where: { userId: user.id },
});
// 2. 检查余额
if (userCredits.balance < 1) {
throw new Error('积分余额不足');
}
// 3. 创建交易记录(写锁)
await tx.creditsTransaction.create({
data: {
userId: user.id,
type: 'deduct',
amount: -1,
// ...
},
});
// 4. 更新用户积分(写锁)
await tx.userCredits.update({
where: { userId: user.id },
data: {
balance: userCredits.balance - 1,
totalSpent: userCredits.totalSpent + 1,
},
});
});
3. 死锁场景
当多个用户同时使用同一账号时:
- 请求A 开始事务,锁定
userCredits
记录 - 请求B 也开始事务,等待
userCredits
记录 - 请求A 需要创建
creditsTransaction
,但可能被其他锁阻塞 - 请求B 同样等待,形成死锁链
如何检查数据库行锁状态
PostgreSQL 查询锁状态
sql
-- 1. 查看当前所有锁
SELECT
l.locktype,
l.database,
l.relation,
l.page,
l.tuple,
l.virtualxid,
l.transactionid,
l.classid,
l.objid,
l.objsubid,
l.virtualtransaction,
l.pid,
l.mode,
l.granted,
a.query,
a.query_start,
age(now(), a.query_start) AS "age"
FROM pg_locks l
LEFT JOIN pg_stat_activity a ON l.pid = a.pid
WHERE l.granted = false -- 只显示未获得的锁(等待中的锁)
ORDER BY a.query_start;
-- 2. 查看特定表的锁状态
SELECT
l.mode,
l.granted,
a.pid,
a.application_name,
a.client_addr,
a.query,
a.state
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
JOIN pg_class c ON l.relation = c.oid
WHERE c.relname = 'user_credits' -- 替换为你的表名
ORDER BY l.granted, l.mode;
-- 3. 查看死锁详情(需要启用死锁检测日志)
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS current_statement_in_blocking_process
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
-- 4. 启用死锁检测日志(需要数据库管理员权限)
-- 在 postgresql.conf 中设置:
-- log_lock_waits = on
-- deadlock_timeout = 1s
实时监控脚本
sql
-- 创建一个监控视图
CREATE OR REPLACE VIEW lock_monitor AS
SELECT
schemaname,
tablename,
attname,
n_tup_ins,
n_tup_upd,
n_tup_del,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze
FROM pg_stat_user_tables
JOIN pg_attribute ON pg_stat_user_tables.relid = pg_attribute.attrelid
WHERE tablename IN ('user_credits', 'credits_transaction');
解决方案
1. 临时解决方案:注释积分逻辑
typescript
// 注释掉所有积分相关的事务操作
// await prisma.$transaction(async (tx) => {
// // 积分扣除逻辑...
// });
// 直接调用外部API
const response = await fetch(`${EXTERNAL_API_BASE}/api/custom/image/submit/task`, {
method: 'POST',
body: formData,
});
2. 长期解决方案:乐观锁
typescript
async function deductCreditsOptimistic(userId: string, amount: number) {
const maxRetries = 5;
for (let i = 0; i < maxRetries; i++) {
try {
// 读取当前数据(无锁)
const current = await prisma.userCredits.findUnique({
where: { userId }
});
if (current.balance < amount) {
throw new Error('余额不足');
}
// 基于版本的原子更新
const result = await prisma.userCredits.updateMany({
where: {
userId,
updatedAt: current.updatedAt // 版本检查
},
data: {
balance: current.balance - amount,
totalSpent: current.totalSpent + amount,
updatedAt: new Date()
}
});
if (result.count > 0) {
return true; // 成功
}
// 版本冲突,重试
await new Promise(resolve =>
setTimeout(resolve, Math.pow(2, i) * 10 + Math.random() * 10)
);
} catch (error) {
if (i === maxRetries - 1) throw error;
}
}
throw new Error('乐观锁冲突次数过多');
}
3. 数据库优化
sql
-- 添加版本字段支持乐观锁
ALTER TABLE user_credits ADD COLUMN version INTEGER DEFAULT 0;
-- 创建索引优化查询
CREATE INDEX idx_user_credits_user_version ON user_credits(user_id, version);
CREATE INDEX idx_credits_transaction_user_status ON credits_transaction(user_id, status);
最佳实践总结
1. Cloudflare Workers 环境下的数据库使用原则
- 最小化事务操作:避免长时间持有锁
- 使用乐观锁:适合高并发读多写少场景
- 异步处理:将复杂操作移到后台处理
- 连接池管理:合理配置连接参数
2. 死锁预防策略
- 统一锁顺序:总是按相同顺序获取锁
- 减少锁持有时间:尽快提交或回滚事务
- 避免嵌套事务:简化事务逻辑
- 使用重试机制:处理偶发的锁冲突
3. 监控和告警
typescript
// 添加性能监控
const startTime = Date.now();
try {
await deductCreditsOptimistic(userId, amount);
console.log(`积分扣除成功,耗时: ${Date.now() - startTime}ms`);
} catch (error) {
console.error(`积分扣除失败,耗时: ${Date.now() - startTime}ms`, error);
// 发送告警...
}
结论
在 Cloudflare Workers 这样的无服务器环境中,传统的悲观锁策略很容易导致连接池耗尽和死锁问题。通过采用乐观锁、减少事务操作、优化数据库设计等方式,可以有效解决这些问题,提升系统的并发性能和稳定性。
对于我们的项目,临时注释掉积分逻辑快速解决了生产问题,后续可以逐步引入乐观锁机制来重新实现积分系统。
本文基于实际生产环境问题总结,希望对遇到类似问题的开发者有所帮助。