Pythian Blog: Technical Track

Post-Mortem Analysis Tools: Systemstate Dump

Editor’s note: This “post-mortem analysis” series consists of four posts in total. Make sure you also check out:

Now, onto today’s post:

Imagine this situation: The database is hanging and you can’t figure out what’s going on with the required timing. So to avoid affecting any SLA you decide to restart the database OR your leader/boss instructed you to do it OR you know the database is going to go down, anyway …

Every good DBA has faced this kind of situation at least once. Bottom line, after restarting everything was OK and the “problem” was solved. But now you’re being asked about RCA (root cause analysis—aka what caused this situation?). Let’s say the database was hanging, so no snap was closed for AWR (Automatic Workload Repository) and you lost the ASH (Active Session History) data due to restart. What can you do?

Well, there are a few tools that I’ll be covering in further posts. The first one and our subject for today is the systemstate dump.

Systemstate is basically composed of the process state for all processes in the instance (or instances) at the time it’s called. Through a systemstate it’s possible to identify enqueues, rowcache locks, mutexes, library cache pins and locks, latch free situations and other kinds of chains.

Systemstate is a good tool to add in an SR (systematic review), but it’s quite hard to habituate on reading/interpreting the file. To understand exactly how to read a systemstate the best recommendation I can make is, read the manual! The doc Reading and Understanding Systemstate Dumps (Doc ID 423153.1) has a very good explanation with examples; I’m not able to do it better.

I feel it’s necessary to share some information about the systemstate levels, though, as I had some difficulty finding it myself. SAFE HARBOR STATEMENT: It’s important to mention that Oracle does not recommend you to use “numeric events’’ without an SR (MOS), according to Note: 75713.1.

So, systemstate dump has the following levels:

Level Content
2 dump (not including the lock element)
10 dump
11 dump + global cache of RAC
256 short stack (function stack)
258 256 + 2 -> short stack + dump (not including the lock element)
266 256 + 10 -> short stack + dump
267 256 + 11 -> short stack + dump + global cache of RAC

Levels 11 and 267 will dump the global cache and will generate a large trace file—under normal circumstances these are not recommended.

Generally, if there aren’t too many processes, 266 is the recommended level once it can dump out the processes’ functions stack. The dump can be used to analyze the process in detail. However, even to generate a short stack, if there are too many processes—e.g. 2000+—it may take more than 30 minutes.

In this case, you can generate a level 10 or level 258 dump. Level 258 will collect more than a level 10 short stack, but less than a regular level 10, including some lock element data.

If using a RAC system, please pay attention to Bug 11800959 – A SYSTEMSTATE dump with level> = 10 in RAC dumps huge BUSY GLOBAL CACHE ELEMENTS – can hang / crash instances (Doc ID 11800959.8). The bug is fixed in 11.2.0.3. For versions <= 11.2.0.2 of the RAC, when the system locks an element for a length of time, when executing level 10, 266 or 267 of systemstate dump, it can cause database hang or crash. It may be solved by using level 258.

To generate it in the local node:

oradebug setmypid; 
oradebug unlimit; 
oradebug dump systemstate 266 
oradebug tracefile_name

OR, for all nodes in the cluster:

oradebug setmypid;
oradebug unlimit;
oradebug -g all dump systemstate 266
oradebug tracefile_name

An example of execution:

SQL> oradebug setmypid; 
Statement processed. 
SQL> oradebug unlimit; 
Statement processed. 
SQL> oradebug dump systemstate 266 
Statement processed. 
SQL> oradebug tracefile_name 
/db/oracle/diag/rdbms/pythiandb/pythiandb/trace/pythiandb_ora_18256.trc

If you cannot create a connection to the database with SQL*Plus (even as SYSDBA), because of a hang situation, you can use a preliminary connection.

I’ll explain more about this in a further post. But if you need help ASAP, I’d recommend you read:

  • How to Collect Systemstate Dumps When you Cannot Connect to Oracle (Doc ID 121779.1)
  • Important Customer information about using Numeric Events (Doc ID 75713.1)

An example of a systemstate level 266 dumpfile:

*** 2016-06-15 16:59:00.180
Processing Oradebug command 'dump systemstate 266'
===================================================
SYSTEM STATE (level=10, with short stacks)
------------
System global information:
     processes: base 0xbb5b6850, size 1000, cleanup 0xbb63b780
     allocation: free sessions 0xbba6fb48, free calls (nil)
     control alloc errors: 0 (process), 0 (session), 0 (call)
     PMON latch cleanup depth: 0
     seconds since PMON's last scan for dead processes: 39
     system statistics:
0 OS CPU Qt wait time
727657760 Requests to/from client
7106305 logons cumulative
274 logons current
231675649 opened cursors cumulative
1737 opened cursors current
43388218 user commits
1682106 user rollbacks
835078962 user calls
752889391 recursive calls
12720247 recursive cpu usage
150 pinned cursors current
15100586388 session logical reads
0 session logical reads in local numa group
0 session logical reads in remote numa group
[...]
PROCESS 1:
  ----------------------------------------
  SO: 0xbb63a6d0, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0xbb63a6d0, name=process, file=ksu.h LINE:12616, pg=0
  (process) Oracle pid:1, ser:0, calls cur/top: (nil)/(nil)
            flags : (0x20) PSEUDO
            flags2: (0x0),  flags3: (0x10)
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 0
              last post received-location: No post
              last process to post me: none
              last post sent: 0 0 0
              last post sent-location: No post
              last process posted by me: none
    (latch info) wait_event=0 bits=0
    O/S info: user: , term: , ospid:  (DEAD)
    OSD pid info: Unix process pid: 0, image: PSEUDO
    ----------------------------------------
    SO: 0x6000c838, type: 5, owner: 0xbb63a6d0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=(nil), name=kss parent, file=kss2.h LINE:138, pg=0
  PSO child state object changes :
Dump of memory from 0x00000000BB5BA8E8 to 0x00000000BB5BAAF0
0BB5BA8E0                   00000000 00000000          [........]
0BB5BA8F0 00000000 00000000 00000000 00000000  [................]
  Repeat 31 times
PROCESS 2: PMON
  ----------------------------------------
  SO: 0xbb63b780, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0xbb63b780, name=process, file=ksu.h LINE:12616, pg=0
  (process) Oracle pid:2, ser:1, calls cur/top: 0xbaea88a8/0xbaea88a8
            flags : (0xe) SYSTEM
            flags2: (0x0),  flags3: (0x10)
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 16
              last post received-location: ksu.h LINE:13945 ID:ksupsc
              last process to post me: bb75a490 25 0
              last post sent: 0 0 19
              last post sent-location: ksu.h LINE:13957 ID:ksuxfd
              last process posted by me: bb63b780 1 14
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 0xbba5e848
    O/S info: user: oracle, term: UNKNOWN, ospid: 14438
    OSD pid info: Unix process pid: 14438, image: oracle@devdb09 (PMON)
    Short stack dump:
[...]

I hope this was helpful! If you have questions or thoughts, please leave them in the comments.

You can find other posts in this series here:

No Comments Yet

Let us know what you think

Subscribe by email