Business Insights: Data Analytics for Business Insights

High System Time on Windows Running in a VM

Written by Timur Akhmadeev | Oct 22, 2014 4:00:00 AM
Recently I've seen an issue with CPU usage on a server running Windows 2003 Server in a VMware. This is a small Virtual Machine with just 2 cores allocated (which are possibly mapped to "threads" on a host level but I don't know the details). For some reason very high System CPU time was reported in a Statspack report. Here is how it looks like in a 1 hour Statspack report: [sourcecode gutter="false"] Host CPU (CPUs: 2 Cores: 2 Sockets: 0) ~~~~~~~~ Load Average Begin End User System Idle WIO WCPU ------- ------- ------- ------- ------- ------- -------- 3.04 8.77 88.19 [/sourcecode] Note that the System CPU time is more than twice the User CPU time on average (remember the averages could be misleading sometimes). This caught my attention as usual. Although the average CPU used is not really high, this server is somewhat sluggish even for a one hop RDP connection over the VPN. I have tried to find out some details about what is going on. Since I'm not a Windows guy, I did not know what kind of tools could be used to track places in the OS kernel that take too much time. On Linux this is relatively easy starting with strace/pstack/perf utilities and other command line tools. Windows is different. I've started to search for the options available, and the first thing to find is of course Perfmon, which allows to track and visualize different OS related metrics (counters in Perfmon terminology) on a system, CPU, or process levels. I've used it to capture a few key metrics such as User Time, System Time (which is apparently called Privileged Time on Windows), Queue length and Context Switches per second. From a graph of the CPU usage the issue is visible: Here the white line is representing Privileged (or System) CPU, and yellow line is Total CPU. It's clear that almost all used CPU is accounted to the Privileged part. By the way it is actually very easy to see a similar picture in a standard Performance tab of Task Manager, you just need to select View then Show Kernel Times and Privileged part of the used CPU will be displayed in red. After that I have searched for details of where to find why Privileged CPU time is so high. A good article that I have found is here. Although it is relatively old, it fits my case as the OS is a 32 bit Windows 2003 Server. The article points to a tool called KernRates. This is a command line tool with a very easy interface: you run it, wait for some time and stop it with Ctrl-C. After that the tool prints the profile of system calls by module. Here is what I've seen: [sourcecode gutter="false"] C:\Program Files\KrView\Kernrates>Kernrate_i386_Win2000.exe /==============================\ < KERNRATE LOG > \==============================/ Date: 2014/09/03 Time: 12:39:21 Machine Name: *** Number of Processors: 2 PROCESSOR_ARCHITECTURE: x86 PROCESSOR_LEVEL: 6 PROCESSOR_REVISION: 1706 Physical Memory: 3072 MB Pagefile Total: 6996 MB Virtual Total: 2047 MB PageFile1: \??\C:\pagefile.sys, 4080MB OS Version: 5.2 Build 3790 Service-Pack: 2.0 WinDir: C:\WINDOWS Kernrate User-Specified Command Line: Kernrate_i386_Win2000.exe Kernel Profile (PID = 0): Source= Time, Using Kernrate Default Rate of 25000 events/hit Starting to collect profile data ***> Press ctrl-c to finish collecting profile data ===> Finished Collecting Data, Starting to Process Results ------------Overall Summary:-------------- P0 K 0:00:03.703 ( 8.7%) U 0:00:00.734 ( 1.7%) I 0:00:38.046 (89.6%) DPC 0:00:00.031 ( 0.1%) Interrupt 0:00:00.406 ( 1.0%) Interrupts= 22840, Interrupt Rate= 538/sec. P1 K 0:00:02.343 ( 5.5%) U 0:00:00.656 ( 1.5%) I 0:00:39.484 (92.9%) DPC 0:00:00.000 ( 0.0%) Interrupt 0:00:00.281 ( 0.7%) Interrupts= 20017, Interrupt Rate= 471/sec. TOTAL K 0:00:06.046 ( 7.1%) U 0:00:01.390 ( 1.6%) I 0:01:17.531 (91.2%) DPC 0:00:00.031 ( 0.0%) Interrupt 0:00:00.687 ( 0.8%) Total Interrupts= 42857, Total Interrupt Rate= 1009/sec. Total Profile Time = 42484 msec BytesStart BytesStop BytesDiff. Available Physical Memory , 372678656, 363945984, -8732672 Available Pagefile(s) , 3285475328, 3281805312, -3670016 Available Virtual , 2131267584, 2130219008, -1048576 Available Extended Virtual , 0, 0, 0 Total Avg. Rate Context Switches , 61247, 1442/sec. System Calls , 305201, 7184/sec. Page Faults , 58440, 1376/sec. I/O Read Operations , 3496, 82/sec. I/O Write Operations , 2637, 62/sec. I/O Other Operations , 29567, 696/sec. I/O Read Bytes , 59649045, 17062/ I/O I/O Write Bytes , 2653894, 1006/ I/O I/O Other Bytes , 624604436, 21125/ I/O ----------------------------- Results for Kernel Mode: ----------------------------- OutputResults: KernelModuleCount = 109 Percentage in the following table is based on the Total Hits for the Kernel Time 33235 hits, 25000 events per hit -------- Module Hits msec %Total Events/Sec intelppm 30310 42486 91 % 17835286 ntkrnlpa 2337 42486 7 % 1375158 hal 271 42486 0 % 159464 mfehidk01 74 42486 0 % 43543 Ntfs 58 42486 0 % 34128 mfehidk 52 42486 0 % 30598 mfeapfk 47 42486 0 % 27656 mfeavfk01 17 42486 0 % 10003 tcpip 13 42486 0 % 7649 win32k 12 42486 0 % 7061 mfeavfk 10 42486 0 % 5884 fltmgr 6 42486 0 % 3530 CLASSPNP 3 42486 0 % 1765 SCSIPORT 3 42486 0 % 1765 RDPDD 2 42486 0 % 1176 afd 2 42486 0 % 1176 Npfs 2 42486 0 % 1176 NDIS 2 42486 0 % 1176 symmpi 2 42486 0 % 1176 TDTCP 1 42486 0 % 588 rdbss 1 42486 0 % 588 netbt 1 42486 0 % 588 mfetdi2k 1 42486 0 % 588 ipsec 1 42486 0 % 588 termdd 1 42486 0 % 588 TDI 1 42486 0 % 588 vmxnet 1 42486 0 % 588 KSecDD 1 42486 0 % 588 atapi 1 42486 0 % 588 volsnap 1 42486 0 % 588 ftdisk 1 42486 0 % 588 ================================= END OF RUN ================================== ============================== NORMAL END OF RUN ============================== [/sourcecode] The default output contains some basic information about the system, CPU usage, memory and context switching. The kernel modules profile is the most interesting part here. It lists some modules with internal names and the profile data: number of times the module was running during a sample; this is the most important information. So in mycase intelppm was the top running kernel module. I've searched again, now for intelppm, and found a few posts describing similar symptoms. Apparently intelppm is a CPU driver. Sometimes it causes issues such as BSOD or high CPU usage, especially if it is a cloned VM and CPU architecture changes in between. It was not clear if this something which can be disabled, but there were posts suggesting that stopping this service (which is not listed in Services) helped a few people. So I have recommended the client to try to disable this driver with the following commands: [sourcecode gutter="false"] sc config intelppm start=disabled sc stop intelppm [/sourcecode] Theoretically this should disable Intel CPU driver and Windows should try to use another if it is available. When we tried to run it, the 2nd command (to stop the driver) failed with the following message: [sourcecode gutter="false"] [SC] ControlService FAILED 1052: The requested control is not valid for this service. [/sourcecode] So it is not possible to stop the driver online, and Windows restart is necessary. We did a restart of the VM. After that, the situation was a bit different: the CPU time was somewhat reduced; but the privileged part was still quite high with hal (Hardware Abstraction Layer) on top instead of intelppm: [sourcecode gutter="false"] Time 95865 hits, 25000 events per hit -------- Module Hits msec %Total Events/Sec hal 82669 125183 86 % 16509629 ntkrnlpa 11788 125183 12 % 2354153 mfehidk 474 125183 0 % 94661 mfeapfk 224 125183 0 % 44734 Ntfs 207 125183 0 % 41339 vmmemctl 155 125183 0 % 30954 mfeavfk 92 125183 0 % 18373 tcpip 85 125183 0 % 16975 win32k 54 125183 0 % 10784 fltmgr 14 125183 0 % 2795 mfetdi2k 11 125183 0 % 2196 TDI 10 125183 0 % 1997 RDPWD 9 125183 0 % 1797 PartMgr 9 125183 0 % 1797 KSecDD 7 125183 0 % 1397 SCSIPORT 7 125183 0 % 1397 afd 6 125183 0 % 1198 symmpi 6 125183 0 % 1198 RDPDD 5 125183 0 % 998 ipsec 5 125183 0 % 998 NDIS 5 125183 0 % 998 CLASSPNP 5 125183 0 % 998 mfebopk 4 125183 0 % 798 Npfs 3 125183 0 % 599 termdd 3 125183 0 % 599 vmxnet 2 125183 0 % 399 volsnap 2 125183 0 % 399 ndisuio 1 125183 0 % 199 mrxsmb 1 125183 0 % 199 rdbss 1 125183 0 % 199 atapi 1 125183 0 % 199 [/sourcecode] But in terms of Oracle performance everything changed: everything now run much faster, including simple queries in the SQL*Plus. A particular query started to run 3 times faster on average: [sourcecode gutter="false"] -- stats before SQL> @sqlstats cp9jr3hp1jupk Elapsed Ela/exec User IO Rows per Versi Share Avg hard Execs s s CPU, s Gets/exec s exec ons Loads Mem, MB parse, ms PX Exec ------------ ------------ ------------ ----------- ------------ --------- ------------ ------- ------- ------- --------- ------- 135 170.093 1.260 155.31 835 4.29 1 1 1 .02 350.99 0 -- stats after SQL> @sqlstats cp9jr3hp1jupk Elapsed Ela/exec User IO Rows per Versi Share Avg hard Execs s s CPU, s Gets/exec s exec ons Loads Mem, MB parse, ms PX Exec ------------ ------------ ------------ ----------- ------------ --------- ------------ ------- ------- ------- --------- ------- 604 212.151 .351 154.75 1,013 31.79 1 1 1 .02 8.34 0 [/sourcecode] It looks like the change helped, but there is no sign that it helped on the OS level. This makes me think that such an improvement in performance may be attributed to something else, such as OS, hypervisor or combination of them and Oracle. In any case, high system time is not good and it usually indicates that something is wrong.