Linux CPU Statistics Explored
Running top can give you a good high level overview of the overall health of your server at the time you are looking at it. One of the most useful statistics presented is the %Cpu line, which is split into eight sections, each representing a possible state of a task using CPU resources. In my previous article on using top, I briefly mentioned three of the eight sections I glance at when troubleshooting a server. Today, I'd like to take a closer look.
%Cpu(s): 26.0 us, 7.4 sy, 0.0 ni, 66.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
To understand what this is telling us, let's take a look at each section in turn.
us - User Processes
Most likely, this is what the server is here to do. User processes are normal programs, and in the case of your server, the services the machine is providing. It is important to note that these processes have not had their priority altered by nice, those processes are handled later.
sy - Kernel Processes
System memory is split into two areas, user space and kernel space. Processes running in user space have no direct access to hardware, and can not interfere with processes running in kernel space. Processes that run in kernel space have full access to the entire machine, including direct access to the hardware, so any errors at this level tend to cause an entire system crash.
ni - User processes affected by nice
Some processes are more important than others. For example, you never want your backup program to affect response time of a production web server. In this case, you may want to set the priority of the web server to -20, the highest priority, and the priority of the backup program to 20, the lowest. If any processes on your system have been affected by nice, they will show up here.
id - Idle
The CPU had some free time, so it spent it catching up on a few good books, doing a little fishing, maybe taking a nap in a hammock tied between two palm trees. A server with lots of idle time is underused, it could do more work without problem.
wa - I/O Wait
As I covered previously, I have found that keeping a close eye on I/O wait can give you a heads-up when there is trouble brewing. Perhaps you have a busy database on a shared SAN environment, it is possible that the shared disk may be overloaded by the database, and becomes unresponsive for milliseconds at a time. In this case, the CPU will set the processes that are waiting on a response from the disk aside, in a waiting state. Not doing anything useful, but still taking up valuable resources. Ideally, this should always be 0.0.
hi - Hardware Interrupts
According to Wikipedia "A hardware interrupt is an electronic alerting signal sent to the processor from an external device, either a part of the computer itself such as a disk controller or an external peripheral." A high percentage of hardware interrupts may point towards faulty hardware. My first stop after top would probably be a quick check through dmesg, followed by syslog to see if anything had been logging errors.
si - Software Interrupts
Also according to Wikipedia: "A software interrupt is caused either by an exceptional condition in the processor itself, or a special instruction in the instruction set which causes an interrupt when it is executed." I have not come across an issue where a server was showing a high amount of software interrupts.
st - Stolen Time
If your machine is running in a virtualized environment, as my Ubuntu server is, you may come across a non-zero percentage here in the "stolen from this vm by the hypervisor" section. This is basically time that the host machine needed from the CPU and took from the guest. Ideally, this is another that should always be zero, but if you find yourself with high numbers here, you may need to move some virtual machines off of the host.
Top does a good job of condensing a lot of information down into a readable format. Taking the extra time to read through the man pages and online documentation shows just how powerful this often overlooked tool really is.