Mutex Event Waits
"Mutex Waits" is a collective term for waits for resources associated with the management of cursor objects in the shared pool during parsing. Mutexes were introduced in 10g as faster and lightweight forms of latches and waits for these resources will occur in normal operations. However when these waits become excessive, contention can occur causing problems.
Full troubleshooting and diagnostics for every type of mutex related issue may be beyond the scope of this article, but basic principles and problem identification can be achieved.
Firstly you need to actually identify that mutex waits are occurring.
How to Identify Mutex Event Waits.
Mutex waits are characterised by sessions waiting for one one or more of the following events:
- cursor: mutex X
- cursor: mutex S
- cursor: pin S
- cursor: pin X
- cursor: pin S wait on X
- library cache: mutex X
- library cache: mutex S
Cursor mutexes are used to protect the parent cursor and also with cursor statistic operations.
Cursor pins are used to pin a cursor in preparation for a related operation on the cursor.
Library cache mutexes are similar to library cache operations in earlier versions except they are now implemented using mutexes. In all these cases, waits for these resources occurs when 2 (or more) sessions are working with the same cursors simultaneously. When one session takes and holds a resource required by another, the second session will wait and will wait on one of these events.
Mutex contention is typically characterised by a perception of slow performance at a session or even the database level. Since Mutexes are almost wholly a CPU using resource, if contention occurs, CPU usage can rise and will quickly start to impact users. In normal operation the amount of CPU usage per mutex and the time taken is extremely small, but when contention occurs and the number of mutex operations against the same objects goes in to millions these small numbers add up. Additionally as the CPU is used the mutex operations themselves can start to take longer (because of the time taken waiting on the CPU run queue) further adding to problems.
Diagnosing Potential Causes using AWR Report
The best starting point for identification of Mutex waits is the use of a general database report such as the Automatic Workload Repository (AWR) Reports.
When looking for mutex contention it is best to collect AWR reports for 2 separate periods:
- When the problem is actually occurring
- A separate,baseline, period when the problem is not occurring but with similar load
Collection of both an active report and a baseline is extremely useful for comparison purposes.
Remember AWR is a general report and as such may not show directly which session is holding the mutex and why, but it will reveal the overall picture of database statistics such as the top waits, sql statements run, parses, version counts, parameter settings, etc. that are useful indicators towards mutex issues.
For information on how to collect AWR reports refer to:
Document 1363422.1 Automatic Workload Repository (AWR) Reports - Start Point
For mutex contention, it is preferable to look at snapshots with a maximum duration of an hour. Durations as short as 5-10 minutes can be used as long as the durations are the same for the baseline and and problem periods.
If mutex contention is occurring then usually mutex waits will surface to the top timed events:
Problem Period AWR Report: (1 hour duration)
Compare to the baseline report: (1 hour duration)
Baseline AWR Report:
In the problem report, the top wait is for a cursor operation 'library cache: mutex X' which means that sessions are waiting to get a library cache mutex in eXclusive mode for one or more cursors. From the figures, this is taking > 56.42% of the database time. The Average wait time of 294 ms (milliseconds) is extremely high as is the number of waits at > 1.3 Million waits in an hour.
In comparison, during the baseline, there is no evidence of high waits for mutex events in the top 5 at all and the events seen are the more normal I/O waits.
Now that we have identified a problem, we want to dig deeper and determine the area the problem is in so that we can ultimately get to a root cause and a solution.
- If you have an AWR which shows the high mutex issue then start by running:
select * from (
select p1, sql_id,
count(*),
(ratio_to_report(count(*)) over ())*100 pct
from dba_hist_active_sess_history
where event='library cache: mutex X'
and snap_id between <begin snap> and <end snap>
and dbid = <dbid>
group by p1, sql_id
order by count(*) desc)
where rownum <= 10;
This will give you the top 10 P1/SQL_ID arguments of the waits.
The SQL_ID is the SQL statement the session is running.
The P1 is the object the mutex is against.
For the topmost P1 run:
select KGLNAOBJ, KGLNAOWN, KGLHDNSP, KGLOBTYP
from x$kglob where KGLNAHSH= {value of P1}
This will tell you the object the mutex is against. If the same SQL_ID shows up with different P1 values in the Top10, then it is likely to be related to that SQL statement. If the SQL_ID and P1 is unique, it is likely to be a hot object.
If there is hot object, review following bug:
Note:9239863.8 Excessive "library cache:mutex X" contention on hot objects
If there is no hot object, but high general mutex waits, start diagnosing the load profile.
Load Profile
The load profile on the server and the location of that load can help drill down. For mutex contention issues we are primarily interested in parse information
Problem Period AWR Report: (1 hour duration)
Baseline AWR Report: (1 hour duration)
Generally, the load is higher in the "Problem Period" report. Furthermore, the parse statistics are higher in the 'bad' report; hard parse is 45 vs 23 per second. So this indicates that there is a higher rate of parsing in the Problem period which may be causing contention issues. Now we should look to see the SQL that is being parsed the most as this is likely to be the cause of the problem.
Note: The SQL with the Highest Volume is more likely to be the cause of problems but this is not necessarily the case and often an increase in parsing from a "Good" Baseline is a better indicator.
Increased Parse Counts
Under SQL ordered by Parse Calls, we are looking for the total parse calls and then the parse calls for particular statements:
Problem Period AWR Report: (1 hour duration)
Baseline AWR Report: (1 hour duration)
In general the parse count has increased moving from 1.8M to 3.1M. Focusing on specific statements, SQL_Id '68subccxd9b0'3 and '12235mxs4h54u' have doubled the number of parses and '3j91frnd21kks' has come in from 'nowhere' and must also have at least doubled the parses since the lowest parse calls shown in the baseline is 15,000 and this shows 42,000.
These SQL statements are good candidates for investigation:
- Why has the parse count increased?
- Has new code or code changes been introduced?
- Is a new application being used?
- Have more users been brought online?
- Has the activity profile been changed
- are more activities being run concurrently than previously?
By answering these kind of questions, you can often find potential causes.
See the "Over Parsing" section in:
Document:33089.1 TROUBLESHOOTING: Possible Causes of Poor SQL Performance
Mutex Sleeps
When a mutex is requested, this is called a get request.
If a session tries to get a mutex but another session is already holding it, then the get request cannot be granted and the session requesting the mutex will 'spin' for the mutex a number of times in the hope that it will be quickly freed. The session spins in a tight loop, testing each time to see if the mutex has been freed.
If the mutex has not been freed by the end of this spinning, the session waits.
When this happens the sleeps column for the particular code location where the session is waiting is incremented in the v$mutex_sleep* views.
This 'Sleeps' count for a particular location is very useful for identification of the area in which mutex contention is occurring.
In later versions this information is externalised in the 'Mutex Sleep Summary' section of the AWR report:
Mutex Type Location Sleeps Time (ms)
---------------- -------------- ------------ ------------
Library Cache kglpin1 4 20,053,325 201,203
Library Cache kglget1 1 38,809 110,015
Library Cache kglpndl1 95 25,147 55,946
Library Cache kglpin1 4 24,887 52,524
What we are interested in here is the location and primarily the Time spent in each. The number of sleeps is also important but if it takes no time then it is unlikely to be affecting performance.
This information can be used to search for other similar issues that have also resulted in contention in this particular area and from these determine solutions that have previously been used to address these.
As an example:
In this case the the top location for sleeps is in the Library cache 'kglpin1 4'.
In terms of time this is taking almost 2x as much time as the next sleeper and also is responsible for 20M more sleeps. This would therefore be a good candidate for a search for known issues. In this case if you search on 'kglpin1 4', one of the documents you will find is:
Document:7307972.8 Bug 7307972 - Excessive waits on 'library cache: mutex x'
which may be directly applicable, or may give pointers as to potential solutions.
Note: if nothing specific is found from searches on 'kglpin1 4' it is worth searching for the other locations (e.g. 'kglget1 1') - although this may be a new issue, related information from these searches may be helpful.
Note: Although this information is included in AWR reports, you can select it directly from the view V$MUTEX_SLEEP_HISTORY using:
select to_char(sysdate, 'HH:MI:SS') time, KGLNAHSH hash, sum(sleeps) sleeps,location,MUTEX_TYPE
, substr(KGLNAOBJ,1,40) object
from xkglob , vmutex_sleep_history
where kglnahsh=mutex_identifier
group by KGLNAOBJ,KGLNAHSH,location,MUTEX_TYPE
order by sleeps
/
Interpretation is as with the AWR example above.
Database Appears 'Hung'
Sometimes contention for mutexes will become so intense that the database may appear to hang. In these cases, it is useful to determine which session or sessions are blocking others and to investigate what the blocking sessions are doing.
By running the following select (which outputs the Session ID and the Text of the SQL being executed) at short intervals, pick up common blockers and investigate their activities. If the same SQL is seen then it can be investigated for problems in a similar way to we investigated High Parsing SQL previously.
select s.sid, t.sql_text
from vsession s, vsql t
where s.event like '%mutex%'
and t.sql_id = s.sql_id