Pythian Blog: Technical Track

Locks, Latches, Mutexes, and CPU Usage

I was alerted about a high CPU usage on a client machine. Checking which processes used the most CPU, I found that there was not one or a few processes taking up most of the CPU resources. There were just a lot of sessions in the run-queue each taking a bite of the CPU.

Checking current wait events, I found that there are a lot of sessions waiting on transaction-exclusive locks (enq: TX - row lock contention). The client implied that the CPU usage was caused by the high number of locks and sessions waiting for them.

This made me think that there is a misconception on locks and latches in an Oracle RDBMS instance and how are they are actually implemented under the hood – most importantly, how waiting is implemented when the resource needed is locked, latched, or mutexed (I suppose this is not a real word but it sounds okay). Even if you understand the background of a latch spinning and have played with the spin_count parameter in latest versions, causing CPU spikes, this is also changed. In addition, latches are mostly not spinning while waiting and some are being replaced by mutexes to protect some areas like the shared pool. So, as we know, locks are used to protect “end-user visible resources”, such as tables, rows in tables, transactions, sequences, and materialized views, from concurrent user access, i.e. serializing requests that are incompatible to be done at the same time (think of a simple case of updating the same row in two different transactions). The locks are called enqueues because the sessions waiting on them are placed in queues (blockers, modifiers, and waiters). They are getting access to the resource they need as they come in an orderly fashion.

The implementation can be compared to a supermarket where people are sessions waiting to check-out the stuff they like to buy at the cash registers, which are the resources being locked, but with a little twist: Each cash register is for specific products, so people only go to a specific cash register depending on what they need, and are served as they come.

So how is the actual “waiting on lock” implemented? How does session B, waiting for a transaction to commit started by session A, knows that the resource is free for use?

When a resource needs to be locked, an entry is linked to a linked list made of parties interested in the particular resource. The session that has the lock has an entry in the owner's linked list, while sessions that need to acquire the lock cannot since there is also another entry in the holders list (or converters list). So, how does it know that it can move up the waiters list or can be attached to the holders list? If it just retries and retries, the result would be like spinning and it will waste a lot of CPU. A better way to proceed is to sleep and be notified when the there is a movement in the queue.

To find out how it is implemented, I have traced Oracle foreground processes. I tried this on Oracle RDBMS 11.2.0.3, running on Linux. This is a excerpt of system calls being executed during a session waiting for a lock:

 ... semtimedop(196610, , 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable) <3.001000> semtimedop(196610, , 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable) <3.001000> semtimedop(196610, , 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable) <3.001000> ... 

 

So Oracle uses UNIX semaphores in order to prevent multiple sessions from modifying the holders, converters, and waiters linked lists. (These structures, on the Oracle kernel level, are protected by the enqueue and the enqueue hash latches, so no two processes can modify them at the same time.)

There is a separate semaphore used for protecting a different resource, like a traffic light on a road junction controlling the flow of the locks trough the enqueue linked lusts. In order for a process to move “up” through the waiters queue, it needs to get a hold on the semaphore with ID 196610, semaphore number 35, in this case. It sleeps for 3 seconds when waiting for the semaphore, so it doesn’t waste any CPU cycles. The nice thing is that when a process waits to get a hold on a semaphore, the owner process calls the semctl system call, which sets the semaphore to value 0. The waiters then get notified that the status of the semaphore has changed, so their wait is interrupted:

... semctl(196610, 33, IPC_64|SETVAL, 0xbfeada10) = 0 <0.009000> semtimedop(196610, , 1, {0, 100000000}) = 0 <0.003000> ...

 

Consequently, a large number of sessions waiting on a lock would not impact the server performance noticeably.

Latches and Mutexes

The latches, as we know, are used for protecting Oracle's internal structures against corruption by multiple sessions/processes that modify them at the same time. Examples for this include cache buffer chains, which are linked lists connected to a hash table used as a method for quickly determining if some block is in the buffer cache or not by hashing its address and class. In this example, each linked list (chain) of blocks is protected by a latch, so if we need to add a block to the list, remove, or just read through the list, we would need to get this latch. Obviously this needs to be held for a very short time, as getting this latch happens very often (for each block being accessed). The processes waiting for a latch are not queued. They just retry to get the latch or get posted that the latch is free while sleeping, so there is no order in who gets the latch. We can compare latches to a company of 100 employees. Say they are in 10 departments, each department having their own secretary answering the phone. When someone from the outside wants to call Scott in marketing, he dials the marketing dept.'s phone number. If their secretary is already talking on the phone, he will get a busy signal. If the caller is willing to wait and must make the call, he will call back immediately trying to reach Scott. If he is very desperate (and doesn’t have a life) he will try this 20000 times, until he gets tired and make a short break (and goes to sleep). Later, he'll continue to bother the phone with another 20000 attempts. In the mean time, if another caller attempts and succeeds in getting the free phone line, he will get to the marketing dept. first. The caller represents the process needing the latch. The latches are the secretaries’ phones and the resource needed is Scott. This method of getting a latch is by so called spinning and was used prior to Oracle 10g. (This is platform-dependent, and I am aware of the implementation on Linux.) The bad thing about spinning is that it burns CPU while doing it – it’s an active process. Think of the caller trying to reach Scott in the example – while he desperately redials 20000 times, he uses all of his “CPU” and is unable to do anything else while dialing. That’s why in newer versions, Oracle implemented “sleeping” latch waits, which upon seeing that the latch is in use, immortally go to sleep and get notified by the latch holding process when it becomes free. This would mean that the waiters don’t clog the CPU while waiting; they just sleep. So, contention on latches would not impact the whole system and reduce sociability. Here is how it goes:

 

  1. We will do some testing with the so-called “First Spare Latch”. We can get the address form v$latch:
    1 select addr, name from v$latch
    2* where name like '%spare%'
    SQL> /
    ADDR NAME
    -------- ----------------------------------------------------------------
    200091F8 first spare latch
    2000925C second spare latch
    200092C0 third spare latch
    20009324 fourth spare latch
    20009388 fifth spare latch
  2. Acquire the latch “artificially” by calling the Oracle function kslgetl:
    SQL> oradebug setmypidStatement processed.
    SQL> oradebug call kslgetl 0x&laddr 1 2 3
    Enter value for laddr: 200091F8
    Function returned 1
  3. From another session, trace a session that will try to acquire the same latch and will wait:
    SQL> oradebug setmypidStatement processed.
    SQL> oradebug call kslgetl 0x&laddr 1 2 3
    Enter value for laddr: 200091F8
    <The process is waiting now>
    The strace output shows that it waits using the semop system call:
    semop(196610, , 1

This means that the process will sleep until it is waken by the process, which holds the latch and will not burn any CPU. When I first discovered that waiting on a latch is not active CPU spinning but just sleeping, it was a bit of a revelation to me that it works like this. I felt like someone who had just realized that the Earth is round and not flat!

The latch holder notifies the waiters (sleepers) by setting the semaphore value using: semctl(196610, 33, IPC_64|SETVAL, 0xbfd5774c) = 0 <0.001000>

At which point the waiter finally (after waiting for 207 seconds here) gets the latch: semop(196610, , 1) = 0 <207.309000>

In relation to the comparison of latches with secretaries answering phones, this is like if the secretaries got a list of all callers that called while the phone was busy and called them back after finishing with another call, so the callers would not just waste time (CPU) by obsessively redialing.

All this is good because latch waiting doesn’t waste CPU while waiting on latch free would be decreased. (Processes will not have to finish their sleep or spin_count before realizing the latch is actually free.) However, it is best if we decrease the chances of even hitting a latch-free contention, i.e. minimizing the chances of one process waiting for another to finish work on something that is protected by the latch. This is where mutexes come into the game.

Going back again to the office example with the secretaries, mutexes are like giving all employees their own phone and removing the secretary, which had become a bottleneck. This would mean that a caller would get a busy signal only if the called person is busy talking (in our example decreasing the chance of getting a busy signal 10 fold).

In the Oracle world, this is not implemented for all resources that need to be protected because it would become an overhead in memory usage. (Think of having additional 100-200 bytes allocated to a mutex for each block in the buffer cache if, for example, it were used for protecting block buffers.)

From 10g on, I noticed it being used for preventing cursors in the shared pool being parsed and executed. In theory we should rarely see contention on mutexes since as mentioned above, the chances of asking for a mutex that is already being used should be much smaller – think of massively re-executing the same SQL over and over in many sessions in parallel.

Mutexes in Oracle are not the same thing as a mutex on a OS level. Oracle mutexes are implemented by using OS semaphores (Linux), whith a simple kernel variable used a placeholder for a flag to mark some resource as busy or free and serialize, otherwise parallel running processes, that would need to access that protected resource. So latches and mutexes are actually a Oracle instance high level interface to the OS semaphores.

I have tested making an artificial mutex contention by just “poking” (Andrey's Nikolaev method) in the actual memory location of the mutex, setting a value that would represent a mutex being busy:

We first need to find the actual address of the mutex in memory, since it is not a fixed place. For protecting cursors, there is a separate mutex for each SQL with a particular hash value (I have noticed that even child cursors have different mutexes, although they have the same hash value). So we first create a contention for a particular SQL.

I am using the following SQL as it doesn’t require any additional load for logical/physical reads:

SQL> l

1* select 'bla' from dual

SQL> /

'BL

---

bla

Execution Plan

----------------------------------------------------------

Plan hash value: 1388734953

-----------------------------------------------------------------

| Id | Operation | Name | Rows | Cost (%CPU)| Time |

-----------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | 2 (0)| 00:00:01 |

| 1 | FAST DUAL | | 1 | 2 (0)| 00:00:01 |

-----------------------------------------------------------------

Statistics

----------------------------------------------------------

0 recursive calls

0 db block gets

0 consistent gets

0 physical reads

0 redo size

420 bytes sent via SQL*Net to client

419 bytes received via SQL*Net from client

2 SQL*Net roundtrips to/from client

0 sorts (memory)

0 sorts (disk)

1 rows processed

SQL>

This is a script for making a mutex contention:

[oracle@gtest scripts]$ cat mutex_load.sh
N=1 ;
until test "$N" -gt "101";
do echo $N
sqlplus /nolog @soft_parse.sql &
sleep 5
N=`expr $N + 1`
done
[oracle@gtest scripts]$ cat soft_parse.sql
connect / as sysdba
begin
for i in 1..9000000 loop
execute immediate 'select ''bla'' from dual';
end loop;
end;
/ Exit

So the script will run 100 sessions, each continuously executing the specified SQL. Let’s find the SQL hash value, which will be needed to find the mutex address:

select hash_value, sql_text, child_number from v$sql where sql_text like '%bla%'

SQL> /

HASH_VALUE SQL_TEXT CHILD_NUMBER

---------- -------------------------------------------------- ------------

2386981435 begin for i in 1..9000000 loop execute immediat 0
e 'select ''bla'' from dual'; end loop; end;
957527304 select 'bla' from dual 0
957527304 select 'bla' from dual 1
2344015831 EXPLAIN PLAN SET STATEMENT_ID='PLUS4294967295' FOR 0
select 'bla' from dual
2344015831 EXPLAIN PLAN SET STATEMENT_ID='PLUS4294967295' FOR 1
select 'bla' from dual
HASH_VALUE SQL_TEXT CHILD_NUMBER

---------- -------------------------------------------------- ------------

3438743966 select hash_value, sql_text, child_number from v$s 0
ql where sql_text like '%bla%'
1671538998 select hash_value, sql_text from v$sql where sql_t 0
ext like '%bla%'
7 rows selected.

We are interested in the SQL with hash value 957527304, and we have 2 child cursors for that SQL, since I have executed one plainly in SQL*Plues and the other one through execute immediate in the script above. So the mutex in question will have the identifier 957527304 and will protect both children for this SQL:

select MUTEX_ADDR, MUTEX_IDENTIFIER, MUTEX_TYPE, max(gets), max(sleeps), mutex_value
from x$mutex_sleep_history
group by MUTEX_ADDR, MUTEX_IDENTIFIER, MUTEX_TYPE, mutex_value
order by 4, 5
SQL> /
MUTEX_AD MUTEX_IDENTIFIER MUTEX_TYPE MAX(GETS) MAX(SLEEPS) MUTEX_VA

-------- ---------------- -------------------------------- ---------- ----------- --------

37D31DB4 3607215236 Cursor Pin 1 1 00260000
37D0BDB4 722748295 Cursor Pin 1 1 00280000
37D18DB4 4063208512 Cursor Pin 1 1 00280000
37D3FDB4 3873422482 Cursor Pin 9 2 00220000
37D4ADB4 3165782676 Cursor Pin 11 1 00220000
36D493C0 3096556448 Cursor Pin 16 1 00240000
352156CC 957527304 Cursor Pin 15555114 178 003E0012
352156CC 957527304 Cursor Pin 15765884 81 00450012
352156CC 957527304 Cursor Pin 16536474 107 00430016
352156CC 957527304 Cursor Pin 16776537 116 00340011
352156CC 957527304 Cursor Pin 17281498 77 004C0017

The memory address of the mutex protecting the cursor for our SQL is 352156CC. Let’s poke it a bit to make the mutex busy:

SQL> oradebug setmypid

Statement processed.

SQL> oradebug poke 0x352156CC 4 0x004C0017

BEFORE: [352156CC, 352156D0) = 00000000

AFTER: [352156CC, 352156D0) = 004C0017

And from another session, to initiate a waiting for that mutex:

1 begin
2 execute immediate 'select ''bla'' from dual';
3* end;
4 /

<The session is waiting now...>
Let’s see what the session is waiting on:

EVENT BLOCKING_SESSION

------------------------------ ----------------


cursor: pin S 76

The actual number of the blocking session 76 is actually the the first two bytes in the value I have set in the mutex 0x4C. In decimal, it is 76. So how is waiting on mutexes implemented?

Doing a strace on the waiting process shows it repeatedly executes:

semtimedop(196610, , 1, {0, 10000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.011000>

semtimedop(196610, , 1, {0, 10000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.011000>

semtimedop(196610, , 1, {0, 10000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.011000>

semtimedop(196610, , 1, {0, 10000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.011000>

semtimedop(196610, , 1, {0, 10000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.011000>

So again, it uses semaphores with a defined timeout (1 centisecond here). It is trying to get the semaphore 75 in the semaphore array with ID 196610. It sees it is busy, sleeps for 1 centisecond, times out with an error, and retries the same again on an loop which would end when the semaphore is free. I have released the mutex (i.e. the semaphore) by setting its memory location to a value of 0:

Statement processed.

SQL> oradebug poke 0x352156CC 4 00000000

BEFORE: [352156CC, 352156D0) = 004C0017

AFTER: [352156CC, 352156D0) = 00000000

This doesn’t notify (post) the waiters that the mutex is free since I have just modified the location in memory and did not use system calls normally, such as syscrl. On the next semaphore time out it has realized that the mutex is free and executed the SQL, which was waiting:

1 begin

2 execute immediate 'select ''bla'' from dual';

3* end; SQL> /

PL/SQL procedure successfully completed.

So all mechanisms for protecting a shared resource in the latest Oracle RDBMS instances (I have tested this on Oracle 11.2.0.3 on Linux) seem to use semaphores as an underlying mechanism on the OS level. This is a graphical representation of why two processes would use a semaphore to acquire access to a shared resource (SQL Cursor in this example):

When one of the processes in the example would like to access the shared resource:

      1. It would try to get a hold of the responsible semaphore for that object using the sysop, or systimedop system calls. If the semaphore is already set as being used, the process will go to sleep – it would either be awaken if the semaphore gets released, or if systimedop is ed, the specified timeout period has finished returning an error code.
      2. If the semaphore is free, its counter will be incremented, so it will be marked as being used, and the system call will finish with return value of 0.
      3. The process that was looking to get a hold of the resource now has full access to it.
      4. When it finishes with the resources, it would release the semaphore calling the sysctl system call, which would notify processes that wait (and that have executed em(timed)op on that the semaphore) and they will wake up and get a chance to acquire the semaphore and the underlying resource.

The implementation of mutexes in release 11.2.0.3 (and 11.2.0.2.2) had added some parameters that allow the mechanics of the mutexes to be tuned in the respect of the time a mutex sleeps improved how it uses the CPU cycles while waiting. It also added the possibility of an “exponential back-off” while sleeping. This is explained in MOS note:

Bug 10411618 - Enhancement to add different "Mutex" wait schemes [ID 10411618.8]

Wait schemes

~~~~~~~~~~~~

The following are the wait schemes added. For each of them we spin for a fixed number of times and then yield the first time we try to get the mutex.

From the second time onwards we do one of the following:

* Yield - Always yield the CPU.
* Sleep - Sleep for a fixed amount of time
* Exponential back off - Each iteration we sleep for a greater amount of time

When a process “yields” the CPU, it actually means it is giving up its time on the CPU (its slice) and is put in the end of the run queue. This would imply that in a busy system, with a long run-queue, processes-yielding the CPU will result in poor performance. This is the reason why in some releases like 11.2.0.2, for example, we would see waits on mutexes (like cursor pin: S) in the top 5 wait events if we had a busy system.

These are the hidden parameters that can be used for mutex tuning:

> For 11.2.0.3, 11.2.0.2.2 with patch 12431716 or 11.2.0.2.3 onwards:
* _mutex_spin_count (Integer)
- This sets the number of times to spin before yielding/waiting.
* _mutex_wait_scheme (Integer)
- In 11.2 this controls which wait scheme to use. It can be set to one of the three wait schemes described above thus:
o _mutex_wait_scheme = 0 - Always YIELD
o _mutex_wait_scheme = 1 & _mutex_wait_time = t - Always SLEEP for t milli-seconds
o _mutex_wait_scheme = 2 & _mutex_wait_time = t - EXP BACKOFF with maximum sleep

=============

I have done some testing with different mutex wait schemes and wait times:

SQL> alter system set "_mutex_wait_time"=10 scope=memory;
System altered.

Here are the system calls it did while waiting, increasing the wait time up to 10 centiseconds (the _mutex_wait_scheme being 2 here):

semtimedop(196610, , 1, {0, 10000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.011000>

semtimedop(196610, , 1, {0, 10000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.011000>

semtimedop(196610, , 1, {0, 30000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.031000>

semtimedop(196610, , 1, {0, 30000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.031000>

semtimedop(196610, , 1, {0, 70000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.071000>

semtimedop(196610, , 1, {0, 70000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.071000>

semtimedop(196610, , 1, {0, 100000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.101000>

semtimedop(196610, , 1, {0, 100000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.101000>

semtimedop(196610, , 1, {0, 100000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.101000>

semtimedop(196610, , 1, {0, 100000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.101000> ...

Or up to 1 second of waiting:

SQL> alter system set "_mutex_wait_time"=100;

System altered.

semtimedop(196610, , 1, {0, 10000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.011000>

semtimedop(196610, , 1, {0, 10000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.011000>

semtimedop(196610, , 1, {0, 30000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.031000>

semtimedop(196610, , 1, {0, 70000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.071000>

semtimedop(196610, , 1, {0, 120000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.121000>

semtimedop(196610, , 1, {0, 130000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.131000>

semtimedop(196610, , 1, {0, 220000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.221000>

semtimedop(196610, , 1, {0, 230000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.231000>

semtimedop(196610, , 1, {0, 400000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.401000>

semtimedop(196610, , 1, {0, 410000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.411000>

semtimedop(196610, , 1, {0, 740000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.741000>

semtimedop(196610, , 1, {0, 750000000}) = -1 EAGAIN (Resource temporarily unavailable) <0.751000>

semtimedop(196610, , 1, {1, 0}) = -1 EAGAIN (Resource temporarily unavailable) <1.001000>

semtimedop(196610, , 1, {1, 0}) = -1 EAGAIN (Resource temporarily unavailable) <1.001000>

semtimedop(196610, , 1, {1, 0}) = -1 EAGAIN (Resource temporarily unavailable) <1.001000> ...

I tried the scheme 0, which would mean the process is not going to sleep but just yield the CPU (giving up):

SQL> alter system set "_mutex_wait_scheme"=0;

System altered.

are its system calls:

...

sched_yield() = 0 <0.000000>

sched_yield() = 0 <0.000000>

sched_yield() = 0 <0.000000>

sched_yield() = 0 <0.000000>

sched_yield() = 0 <0.000000>^C

And the strace output (with the summary option):

% time seconds usecs/call calls errors syscall

------ ----------- ----------- --------- --------- ----------------

57.71 1.729747 4 491535 sched_yield
35.39 1.060833 2 496678 gettimeofday
6.84 0.204961 41 4965 select
0.03 0.001000 7 140 munmap
0.03 0.001000 200 5 pwrite64
0.00 0.000000 0 11 read
0.00 0.000000 0 7 write
0.00 0.000000 0 8 open
0.00 0.000000 0 18 close
0.00 0.000000 0 1 chmod
0.00 0.000000 0 7 lseek
0.00 0.000000 0 28 times
0.00 0.000000 0 1 1 ioctl
0.00 0.000000 0 98 getrusage
0.00 0.000000 0 3 statfs
0.00 0.000000 0 2 fstatfs
0.00 0.000000 0 20 rt_sigaction
0.00 0.000000 0 2 rt_sigprocmask
0.00 0.000000 0 3 pread64
0.00 0.000000 0 5 getrlimit
0.00 0.000000 0 2 mmap2
0.00 0.000000 0 29 stat64
0.00 0.000000 0 8 lstat64
0.00 0.000000 0 10 fcntl64
0.00 0.000000 0 3 futex
0.00 0.000000 0 1 semctl

------ ----------- ----------- --------- --------- ----------------

100.00 2.997541 993590 1 total

===================================================================

References and further reading:

- Latch, mutex and beyond by Andrey Nikolaev: https://andreynikolaev.wordpress.com/

- TANEL PODER'S BLOG: IT & MOBILE FOR GEEKS AND PROS: https://blog.tanelpoder.com

- THE Q U A D R O BLOG by ALEX FATKULIN: https://afatkulin.blogspot.ca/2009/01/longhold-latch-waits-on-linux.html

No Comments Yet

Let us know what you think

Subscribe by email