Monthly Archives: January 2012

RCA on High CPU Utilization in Linux Box

High CPU utilization is the most common phenomenon in servers whether it is Dev or Stage or even in Production. Most of the time support & development guys break their heads to know what actually cause this high CPU utilization. They usually play the blame games and ball jumps from one’s court to other. All happens due to lack of clarity on how to read the CPU utilization in a server. This is big topic which is quite difficult to brief in a short blog, but I’ve tried to make this simple, informative and explanatory as much as possible.

 

[root@xxx.yyy.zzz ~]# mpstat -P ALL

Linux 2.6.18-194.32.1.el5 (xxx.yyy.zzz.com)    12/19/2011

07:14:44 AM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s

07:14:44 AM  all   46.44    0.00    0.30    1.91    0.02    0.06    0.00   51.27    222.41

07:14:44 AM    0   40.54    0.01    0.55    4.80    0.04    0.11    0.00   53.96    179.90

07:14:44 AM    1   49.56    0.01    0.21    0.59    0.01    0.04    0.00   49.60     14.16

07:14:44 AM    2   49.22    0.00    0.14    0.33    0.01    0.04    0.00   50.26     28.34

It is perfectly okay to have a system with less percent idle (may be zero), so long as the average runnable queue for the CPU is less than (2 x number of CPUs).

But the xxx.yyy.zzz.com (192.168.0.1) has CPU bottleneck –

[root@xxx.yyy.zzz ~]# vmstat 1 10

procs ———–memory———- —swap– —–io—- –system– —–cpu——

r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

11  1    104  48696 2060588 539632    0    0    59    73    0    0 46  0 51  2  0

13  1    104  48696 2060588 539684    0    0     0     0  110  430 100  0  0  0  0

22  1    104  48432 2060592 539680    0    0     4     0  123  440 100  0  0  0  0

15  1    104  48440 2060592 539684    0    0     0     8  107  450 100  0  0  0  0

16  1    104  48440 2060604 539672    0    0     8    80  125  485 100  0  0  0  0

15  1    104  48440 2060612 539688    0    0     8     0  110  448 100  0  0  0  0

15  1    104  48424 2060620 539680    0    0     8     0  107  524 100  0  0  0  0

19  1    104  48424 2060628 539696    0    0     8     0  264  552 100  0  0  0  0

17  1    104  48424 2060636 539688    0    0     8    16  111  442 100  0  0  0  0

15  1    104  48440 2060652 539680    0    0     8    16  163  516 100  0  0  0  0

 

As the metric that identifies a CPU over utilization (not bottleneck) is the run queue (r value) which exceeds the number of CPUs on the server (should be <6)

[root@xxx.yyy.zzz ~]# top

top – 07:38:07 up 317 days, 18:22,  3 users,  load average: 12.57, 11.99, 11.76

Tasks: 139 total,   4 running, 134 sleeping,   0 stopped,   1 zombie

Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu1  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu2  : 99.7%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Mem:   8175240k total,  8126656k used,    48584k free,  2059612k buffers

Swap:  4192956k total,      104k used,  4192852k free,   539912k cached

 

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

23506 root      19   0 3193m 2.4g  12m S 299.1 31.2  25257:21 /opt/jdk1.6.0_23/bin/java -server -Xms2048m -Xmx2048m -XX:MaxPermSize=512m -Dfile.encoding=UTF-8

3465 root      15   0 17532 1292 1044 S  0.3  0.0 171:13.34 /usr/lib/vmware-tools/sbin64/vmware-guestd –background /var/run/vmware-guestd.pid

The above output clearly indicates that due to java process the all CPUs are running in full capacity (i.e. 100%) and user (us) usage is maximum.

General Misconceptions

CPU Utilization is reaching 100%, then CPU bottleneck exists.

  1. If the CPU utilization is not increasing with the load then application is working well
  2. If average CPU utilization is less, then the chance of CPU bottleneck is less.

Reality

CPU utilization is 100% means CPU is utilization is optimal, until and unless the CPU runqueue values (per vmstat) exceeds the number of processors on the server (cpu_count), there is no CPU (i.e. Infra) bottleneck.

  1. Average CPU utilization is very less (may be 15%), that doesn’t mean application is working fine, because possibly one CPU alone is overloaded (hence, average is coming less). This indicates a possible CPU issue.

There could be any possible cause for causing the high CPU utilization.

Next Tuning Steps

Thread Dump Analysis

    • When to use – If CPU is under or over utilized and the response time of site is increasing.
    • To find the lock contentions and deadlocks in threads.
  1. ThreadsPerchild directive in Apache httpd.conf file
    • the number of requests that can be handled concurrently by the HTTP server is less or more.
  2. Tuning
    1. Thread Pool
      1. Increase thread pool size until there is no further improvement in CPU utilization.
      2. JVM
        1. JVM tuning includes both heap and stack tuning. [ How to doc – WIP ]
        2. Monitor the GC activity using –Xverbosegc.
        3. DB Pool
          1. Increase DB pool size until there is no further improvement in database machine CPU utilization.
  3. Run Profiler tool
    1. Tune most frequently executed methods
    2. Tune hot methods where most of the time is spent
  4. Database
    1. Tune SQLs / Database instance
    2. Create missing indexes
    3. DB Connection issue like listener or too many open DB connections or SQL queries taking too much time

Further Example

Vmstat show %system CPU usage is high.

# vmstat 2

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
1  0      0 191420   8688  35780    0    0     0     0 1006   31  1  4 96  0  0
1  0      0 124468   9208  98020    0    0 15626  2074 1195  188  0 76  0 24  0
0  1      0 110716   9316 110996    0    0  3268  4144 1366   84  0 94  0  7  0
0  3      0  97048   9416 122272    0    0  2818 11855 1314  109  1 80  0 20  0
0  4      0  80476   9544 137888    0    0  3908  2786 1272  172  0 54  0 46  0

Let’s run mpstat to show more detailed CPU usage,it showed CPU was busy with interruptions.

# mpstat 2

Linux 2.6.18-92.el5 (centos-ks)         01/14/2010

02:03:50 AM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
02:04:04 AM  all    1.33    0.00   41.78    0.00    0.44    3.56    0.00   52.89   1015.56
02:04:06 AM  all    0.00    0.00    8.04   38.69   29.65   23.62    0.00    0.00   1326.63
02:04:08 AM  all    0.00    0.00    8.70   30.43   27.54   28.50    0.00    4.83   1327.54
02:04:10 AM  all    0.00    0.00    5.47   46.77   27.36   20.40    0.00    0.00   1280.10

Use sar to find out which interrupt number was culprit.

# sar -I XALL 2 10

02:07:10 AM      INTR    intr/s

02:07:12 AM         0    992.57

02:07:12 AM         1      0.00

02:07:12 AM         2      0.00

02:07:12 AM         3      0.00

02:07:12 AM         4      0.00

02:07:12 AM         5      0.00

02:07:12 AM         6      0.00

02:07:12 AM         7      0.00

02:07:12 AM         8      0.00

02:07:12 AM         9    350.50

See which interrupts cause the prob – # cat /proc/interrupts

Conclusion: It’s actually a network NIC issue, it cause some problem during transmits or receives the frame.

Advertisements