SQL太长导致 library cache lock 长期持有造成系统hang住

lock等待事件

通常，库缓存锁在解析SQL时被保留，锁在共享模式下被对象保留，解析后释放。这是预期的行为。
大多数情况下，库缓存锁不会在执行阶段被卡住。
示例：一个连接为合并连接笛卡尔的查询执行时间较长，不会阻止其他带有库缓存锁的查询。因为库缓存锁会在解析后释放。
这就是库缓存锁的行为。

解决方案

为什么LC library cache lock锁即使在解析时间14秒后仍能长时间保持？

根据LC锁的行为，解析完成后必须解除对对象的锁定。

但也有一些例外，比如在绑定变量中使用，或者字面替换作为cursor_sharing=相似的一部分。因为绑定替换发生在为光标创建句法树之后。由于有大量列表，且这些变量被转换为绑定变量，作为cursor_sharing=相似的一部分，这里的绑定替换需要很长时间，即167秒。
由于SQL的执行依赖于绑定值，而这些值是在解析树建立后才遍历的，因此LC锁必须一直保留直到执行完成。原因是由于结合peeking功能，解释计划可能会比用解析树生成的计划发生变化。因此，LC锁会保持一段时间，直到最终执行计划生成并执行完毕。这就是为什么Oracle在这里获得LC锁的时间是167秒。
lock 等待为何导致DB hang？-------DDL 案例

在业务进程运行时，避免实现 DDL 维护操作（如果是与编译包相关的）。

附件：

病因

问题应由在其他会话执行相关包时发出"alter package"命令引起，该命令将编译包。

根据hanganalyze，我们看到会话需要请求"0x11ce403120"（76474757408）的独占模式来编译包：
等待"库缓存针脚"，等待信息：
{
p1： 'handle address'=0x11ce403120 <<<<<<<<<<< 句柄地址争用 = 76474757408
p2： 'PIN 地址'=0x11a5dbbae0
p3： '100*mode+namespace'=0x2b5cefa00010003 <<<<<<<<<<<<<< 请求独占模式
等待时间：11分24秒超
时后：3分35秒
等待ID：1091
阻塞：0
会话当前SQL： ALTER PACKAGE 编译调试规范 <<<<<<<<<<<<<<<<<<编译包

当其他会话需要引用该包或执行包时，这些会话需要以共享模式请求句柄。因此，正在编译包的会话会阻挡其他后续会话。

当正在编译包的会话也被某个会话阻挡时，数据库应该会被挂起。

我的总结：：：

所以正常情况下TX lock 是应用之间的设计，没有DDL 不会造成hang，只是正常等待，访问package 下对应的表也没问题，但是如果有了所，访问依赖的表也会出现问题。

-----gather statistics 导致的

Gather Statistics And Grant To Table Caused Library Cache Lock And Cursor: Mutex S

KB133453

The database was recently upgraded from 11.2.0.4 and gather stats job ran fine nightly on this database prior to upgrade. The database environment is using Multi Threaded Server (MTS) shared server connections.

This issue is occurring on multiple databases for the custom gather stats job. Trying to gather statistics for table, and after gather statistics , grant to dba to invalidate SQL.---no validattion=false

This action caused library cache lock, cursor: mutex S, and library cache: mutex X wait events when the cursor invalidations happens due to DDL on an object used in the SQL.（SQL中的一个表 DDL ，包括**-**gather table statistics no validattion=false）

There were some SQL with high execution rate observed which were creating a new child cursor for every execution and the version count report showed high version counts with the following reason USER_BIND_PEEK_MISMATCH.

利用with as 创建类似临时表，或者大的inlist 都会导致 library cache lock is being held also after the query is parsed.

这时候如果修改AA表（DDL， gather statistics with no_invalidation=falset），会导致AA表被lock住，连锁导致后续select AA表如出不了结果。

with XX as (

select xx from dual union

.... 上万个select

select xx from dual union

select xx from dual

)

SELECT /*+ parallel(8)*/

yy from xx ,AA ..

Applies To

All Users

Summary

Generally, the Library cache lock is held up during parsing of the sql and the lock is held up on the object in shared mode and released after the parse. This is the expected behaviour.

Mostly, The Library cache lock would not be held up in the execution phase.

Example: A query with joins going for merge join cartesian taking long time in execution would not block others with Library cache lock. Because Library cache lock would have been released after the parse.

This is the behaviour of Library cache lock.

Solution

Why the LC lock held up for long time even after parse time of 14s?

As per the behaviour of LC lock, it has to release the lock on the object once the parse is done.
But there are some exemptions where in bind variables are used or literal replacement occurs as part of cursor_sharing=similar. Because, bind replacement occurs after the parse tree has been created for the cursor. Since there are huge inlist and those are converted to bind variables as part of cursor_sharing=similar, the bind replacement here takes long time for 9001 bind values, ie 167 secs.
Because the execution of the sql depends upon the bind values and those are traversed after the parse tree had been made, the LC lock has to be held till the execution takes place. The reason is there is potential chances of change in explain plan than with the plan generated with parse tree because of feature called bind peeking. Hence the LC lock holds for the time till the final execution plan has been generated and executed. Thats why, Oracle acquires LC lock here for 167s.

Excerpts from the raw 10046:

=============================

PARSING IN CURSOR #13 len=126725 dep=0 uid=5 oct=3 lid=5 tim=437160281069 hv=91324704 ad='3b67e54c8' sqlid='43ycb442r3090'

select * from test.tbl_quelle1 t1, test.tbl_quelle1 t2 , test.tbl_quelle1 t3, test.tbl_quelle1 t4, test.tbl_quelle1 t5

...

order by :"SYS_B_9000"

END OF STMT

PARSE #13:c=13630000,e=14389433,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=0,tim=437160281069
++parse is finished in 14 seconds, /* e=14389433 */

++Bind replacement follows after parse++

BINDS #13:

Bind#0

oacdty=02 mxl=22(02) mxlc=00 mal=00 scl=00 pre=00

oacflg=10 fl2=0100 frm=00 csi=00 siz=24 off=0

kxsbbbfp=ffffffff7aaffbe0 bln=22 avl=02 flg=09

value=1

...

Bind#9000

oacdty=02 mxl=22(02) mxlc=00 mal=00 scl=00 pre=00

oacflg=10 fl2=0500 frm=00 csi=00 siz=24 off=0

kxsbbbfp=ffffffff7ab23fa0 bln=22 avl=02 flg=09

value=1

EXEC #13:c=163420000,e=171631775,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=285361322,tim=437331932600

The elapsed time of EXEC is e=171631775 ie 170s. The time taken here is due to the bind value replacement.

Inference with reduced bind variables:

The Library cache lock is held up for lesser time when the binds are reduced.

With 9000 binds:

call count cpu elapsed disk query current rows

Parse 1 2.26 2.30 0 1 0 0

Execute 1 640.95 644.00 0 10 0 0

Fetch 68 0.13 0.14 0 2790 0 1000

total 70 643.34 646.45 0 2801 0 1000

With 4500 binds:

call count cpu elapsed disk query current rows

Parse 1 0.46 0.45 0 1 0 0

Execute 1 160.01 160.14 0 10 0 0

Fetch 35 0.08 0.07 0 1420 0 500

total 37 160.55 160.67 0 1431 0 500

With 900 binds:

call count cpu elapsed disk query current rows

Parse 1 0.03 0.02 0 1 0 0

Execute 1 7.21 7.22 0 10 0 0

Fetch 8 0.01 0.01 0 310 0 100

total 10 7.25 7.26 0 321 0 100

With 90 binds:

call count cpu elapsed disk query current rows

Parse 1 0.01 0.00 0 1 0 0

Execute 1 0.18 0.16 0 10 0 0

Fetch 2 0.00 0.00 0 70 0 10

total 4 0.19 0.17 0 81 0 10

So, it is clearly seen that the Library cache lock is not held up for long time when the no.of binds are reduced.

Conclusion:

The Library cache lock gets held up for the objects involving the SQL with binds till the execute phase of the SQL is completed. This is not the case for SQLs without binds. This is expected behavior and it is not a bug.

Oracle usually recommend not to use such a huge inlist going for bind replacement. The application SQLs have to be modified in such a way to avoid this kind of situation.

Attachments :

Cause

Should the SQL has bind variables specified in the where clause, then the Library cache lock is held up even beyond the parse phase.Because, the bind replacement happens after the parse phase and before the execute phase. There is a intermediate stage called "BIND" which comes after the parse and thus Library cache lock has to be held even after the parse and till all the binds have been replaced.
Should the SQL has too many bind variables in the where clause or huge inlist with bind variables, then the Library cache lock would get held up for long time till the bind replacement finishes. In those circumstances, Library cache lock will be seen till execute phase.

Example:

Following example shows where the SQL has huge inlist with 9001 bind variables in the inlist and the Library cache lock is held up for long time after parse phase.

select * from test.tbl_quelle1 t1, test.tbl_quelle1 t2 , test.tbl_quelle1 t3, test.tbl_quelle1 t4, test.tbl_quelle1 t5

where

t1.rn=t2.rn and

t1.rn=t3.rn and

t1.rn=t4.rn and

t1.rn=t5.rn and

( t1.rn in (:"SYS_B_0000",...,:"SYS_B_8999");

From TKPROF

call count cpu elapsed disk query current rows

Parse 1 13.38 13.46 0 0 0 0

Execute 1 154.68 154.60 0 0 0 0

Fetch 68 0.37 3.85 338 2990 0 1000

total 70 168.43 171.93 338 2990 0 1000

SQL>select sid, serial#, event, sql_id, seconds_in_wait, state from v$session where sql_id=<sql_id> and state='WAITING';

SID SERIAL# EVENT SQL_ID SECONDS_IN_WAIT STATE

2183 1 library cache lock 167 WAITING bc3n13pnrfv1g

The parse time of the query is 13.46 seconds. But the Library cache lock is held up for 167 seconds.Here the library cache lock is being held also after the query is parsed.

-------

Applies To

All Users

Summary

AWR report shows high library cache pin:
Top 10 Foreground Events by Total Wait Time

Event Waits Total Wait Time (sec) Wait Avg(ms) % DB time Wait Class

library cache pin 56,938 734K 12892 94.2 Concurrency

DB CPU 24.2K 3.1

enq: TX - row lock contention 299 7024.1 23492 .9 Application

library cache: mutex X 52,606 5919.1 113 .8 Concurrency
ASH report indicates main contention happened in procedure:
Top Event P1/P2/P3 Values

Event % Event P1 Value, P2 Value, P3 Value % Activity Parameter 1 Parameter 2 Parameter 3

library cache pin 96.98 "76474757408","75797084896","195289731997696003" 0.13 handle address pin address 100*mode+namespace <<<<<<<<< contention on handl address = 76474757408
Hanganalyze dump show the following:
Chain 6:

Oracle session identified by:
{
instance: 2
os id: 18761
process id: 1216,
session id: 6199
session serial #: 18179
}
is waiting for 'library cache pin' with wait info:
{
p1: 'handle address'=0x11ce403120 <<<<<<<<<<< contention on handle address = 76474757408
p2: 'pin address'=0x11a5dbbae0
p3: '100*mode+namespace'=0x2b5cefa00010003 <<<<<<<<<<<<<< request exclusive mode
time in wait: 11 min 24 sec
timeout after: 3 min 35 sec
wait id: 1091
blocking: 0 sessions
current sql: ALTER PACKAGE COMPILE DEBUG SPECIFICATION <<<<<<<<<<<<<<<<<< compile the package
short stack: ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+178<-ksliwat()+2047<-kslwaitctx()+163<-kjusuc()+3400<-ksipgetctxi()+1759<-kqlmPin()+2943<-kqlmClusterLock()+237<-kglpnal()+4059<-kglpin()+1381<-kkdllk0()+904<-kkdlGetCodeObject()+461<-kkpalt()+353<-opiexe()+18730<-opiosq0()+4310<-kpooprx()+274<-kpoal8()+842<-opiodr()+915<-ttcpip()+2183<-opitsk()+1705<-opiino()+969<-opiodr()+915<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+265<-main()+201<-_libc
wait history:
* time between current wait and wait #1: 0.000049 sec
1. event: 'library cache lock'
  time waited: 0.000063 sec
  wait id: 1090 p1: 'handle address'=0x11ce403120
  p2: 'lock address'=0x11ac983730
  p3: '100*mode+namespace'=0x2b5cefa00010003
  * time between wait #1 and #2: 0.000228 sec
  ...
  }
  and is blocked by
  => Oracle session identified by:
  {
  instance: 2
  os id: 163634
  process id: 659,
  session id: 3707
  session serial #: 57363
  }
  which is waiting for 'enq: TX - row lock contention' with wait info:
  {
  p1: 'name|mode'=0x54580006
  p2: 'usn<<16 | slot'=0xd30003
  p3: 'sequence'=0x27cec
  time in wait: 12 min 3 sec
  timeout after: never
  wait id: 212120
  blocking: 1 session
  current sql: UPDATE...
  short stack: ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+178<-ksliwat()+2047<-kslwaitctx()+163<-kjusuc()+3400<-ksipgetctxi()+1759<-ksqcmi()+20798<-ksqgtlctx()+3501<-ksqgelctx()+557<-ktuGetTxForXid()+131<-ktcwit1()+336<-kdddgb()+8587<-kdusru()+461<-kauupd()+412<-updrow()+2167<-qerupFetch()+860<-qermvlgFetch()+296<-updaul()+1378<-updThreePhaseExe()+318<-updexe()+638<-opiexe()+10916<-opipls()+2154<-opiodr()+915<-rpidrus()+211<-skgmstack()+148<-rpiswu2()+690<-rpidrv()+13
...

}

and is blocked by

=> Oracle session identified by:

{

instance: 2

os id: 161259

process id: 583,

session id: 4737

session serial #: 19239

}

which is waiting for 'SQL*Net message from client' with wait info:

{

p1: 'driver id'=0x54435000

p2: '#bytes'=0x1

time in wait: 12 min 3 sec

timeout after: never

wait id: 83454

blocking: 2 sessions

current sql: <none>

short stack: ksedsts()+465<-ksdxfstk()+32<-ksdxcb()+1927<-sspuser()+112<-__sighandler()<-read()+14<-nttrd()+227<-nsprecv()+488<-nsrdr()+216<-nsfull_pkt_rcv()+10214<-nsfull_brc()+79<-nsbrecv()+69<-nioqrc()+495<-opikndf2()+978<-opitsk()+826<-opiino()+969<-opiodr()+915<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253

NOTE: In the images and/or the document content below, the user information and data used represents fictitious data from the Oracle sample schema(s) or Public Documentation delivered with an Oracle database product. Any similarity to actual persons, living or dead, is purely coincidental and not intended in any manner.

Solution

Avoid implementing DDL maintenance operation (in the case it is relate to compile package) when business process is running.

Attachments :

Cause

The issue should be caused by issuing 'alter package' command when other sessions were executing relevant packages and this command will compile package.

According to hanganalyze , we see a session need to request '0x11ce403120'(76474757408) with exclusive mode to compile package:

is waiting for 'library cache pin' with wait info:

{

p1: 'handle address'=0x11ce403120 <<<<<<<<<<< contention on handle address = 76474757408

p2: 'pin address'=0x11a5dbbae0

p3: '100*mode+namespace'=0x2b5cefa00010003 <<<<<<<<<<<<<< request exclusive mode

time in wait: 11 min 24 sec

timeout after: 3 min 35 sec

wait id: 1091

blocking: 0 sessions

current sql: ALTER PACKAGE COMPILE DEBUG SPECIFICATION <<<<<<<<<<<<<<<<<< compile the package

When other sessions need to refer to the package or execute the package , these sessions need to request the handle（package 在sharepool 的地址？？？）with shared mode in general. So the session which is compiling package will block other later sessions.

When the session which is compiling package also is blocked by a certain of session（上一次的执行还没完成）, DB should hang.

SQL太长导致 library cache lock 长期持有 造成系统hang住

lock等待事件

解决方案

病因

-----gather statistics 导致的

利用with as 创建类似临时表，或者大的inlist 都会导致 library cache lock is being held also after the query is parsed.

Applies To

Summary

Solution

Cause

Applies To

Summary

Hanganalyze dump show the following: Chain 6:

Solution

Cause

SQL太长导致 library cache lock 长期持有造成系统hang住

Hanganalyze dump show the following:
Chain 6: