Process 5671 waits for ExclusiveLock on tuple (33,12) of relation 31709 of database 16393; blocked by process 5652. Process 5652 waits for ShareLock on transaction 3382643; blocked by process 5670. Process 5670 waits for ExclusiveLock on tuple (33,12) of relation 31709 of database 16393; blocked by process 5671. Process 5671: UPDATE "user" SET seqno = seqno + 1,push_pending = true WHERE user_id = $1 RETURNING seqno Process 5652: UPDATE "user" SET seqno = seqno + 1,push_pending = true WHERE user_id = $1 RETURNING seqno Process 5670: UPDATE "user" SET seqno = seqno + 1,push_pending = true WHERE user_id = $1 RETURNING seqno
(关系31709是“用户”表.所有三个事务中的user_id都相同.)
这看起来不像你看到demonstrated in the documentation的普通死锁.我没有按顺序更新这个表的多行.我怀疑RETURNING子句与它有关,但我不明白为什么.关于如何解决这个问题或进一步调试的任何想法?
在评论中更新Erwin的问题:这是Postgres 9.3.此事务中还有其他命令,但我不相信它们会触及“用户”表.表上有一个触发器,用于使用current_timestamp()更新updated_at列:
CREATE OR REPLACE FUNCTION update_timestamp() RETURNS TRIGGER AS $$ BEGIN NEW.updated_at = now(); RETURN NEW; END; $$LANGUAGE plpgsql;
当我可以重现这个时,我会考虑获取交易的细节;我事后正在分析日志.
更新#2:我使用LOCK_DEBUG重建Postgres并使用trace_locks = on运行,试图绕过锁定的顺序.
更新的死锁消息是:
Process 54131 waits for ShareLock on transaction 4774; blocked by process 54157. Process 54157 waits for ExclusiveLock on tuple (1,16) of relation 18150 of database 18136; blocked by process 54131.
我可以清楚地看到ExclusiveLock上的阻塞:
2014-08-05 10:35:15 EDT apiary [54131] 2/316 4773 LOG: 00000: LockAcquire: new: lock(0x10f039f88) id(18136,18150,1,16,3,1) grantMask(0) req(0,0)=0 grant(0,0)=0 wait(0) type(ExclusiveLock) 2014-08-05 10:35:15 EDT apiary [54131] 2/316 4773 LOG: 00000: GrantLock: lock(0x10f039f88) id(18136,1) grantMask(80) req(0,1)=1 grant(0,1)=1 wait(0) type(ExclusiveLock) 2014-08-05 10:35:15 EDT apiary [54157] 3/188 4774 LOG: 00000: LockAcquire: found: lock(0x10f039f88) id(18136,1)=1 wait(0) type(ExclusiveLock) 2014-08-05 10:35:15 EDT apiary [54157] 3/188 4774 LOG: 00000: WaitOnLock: sleeping on lock: lock(0x10f039f88) id(18136,2)=2 grant(0,1)=1 wait(0) type(ExclusiveLock)
(格式为日期程序[pid] virtualtxid txid msg)
但我没有看到ShareLock的创建位置,或者事务4773在事务4774上阻塞的原因.我在查询pg_locks表时看到类似的结果:总是等待另一个事务的ShareLock的事务阻塞了一个元组第一笔交易.有关如何深入了解ShareLock源代码的任何建议?
更新3:我需要更新LOCK_DEBUG_ENABLED内联函数以无条件地返回true以查看ShareLock创建.一旦我这样做,我开始看到他们的创作:
2014-08-05 12:53:22 EDT apiary [76705] 2/471 6294 LOG: 00000: LockAcquire: lock [6294,0] ExclusiveLock 2014-08-05 12:53:22 EDT apiary [76705] 2/471 6294 LOG: 00000: LockAcquire: new: lock(0x115818378) id(6294,4,0)=0 wait(0) type(ExclusiveLock) 2014-08-05 12:53:22 EDT apiary [76705] 2/471 6294 LOG: 00000: GrantLock: lock(0x115818378) id(6294,1)=1 wait(0) type(ExclusiveLock) 2014-08-05 12:53:22 EDT apiary [76706] 4/153 6295 LOG: 00000: LockAcquire: lock [6294,0] ShareLock 2014-08-05 12:53:22 EDT apiary [76706] 4/153 6295 LOG: 00000: LockAcquire: found: lock(0x115818378) id(6294,1)=1 wait(0) type(ShareLock) 2014-08-05 12:53:22 EDT apiary [76706] 4/153 6295 LOG: 00000: WaitOnLock: sleeping on lock: lock(0x115818378) id(6294,1)=2 grant(0,1)=1 wait(0) type(ShareLock)
但我仍然不确定为什么要创建ShareLock,为什么6295(在这种情况下)必须等待6294.
解决方法
看起来所有引用用户的表的插入/更新也会在主表上插入后更新用户表上的用户seq.这些更新需要独占锁并被任何不属于当前事务的共享锁阻止.如果两个同时发生,它们就会陷入僵局.
例如,两个事件同时插入media_size和source可能会死锁,如下所示:
T1 T2 ----------------------------------------------------------------------- 1. Insert media size 1a. Excl Lock media size row 1b. Shared Lock on user row (FK) 2. Insert Source 2a. Excl Lock source row 2b. Shared lock on user row (FK) 3. Update user seq 3a. Excl Lock on user row (BLOCKS on 2b) 4. Update user seq 4a. Excl Lock on user row (Blocks on 1b) 5. Deadlock
我认为将更新用户seq步骤切换为第一步是有意义的,因为它会在尝试获取共享锁之前强制T1和T2阻塞(由于它已经有一个已排除的锁,因此它不需要).