Simple Systems Administrators Toolbox
Over the years I’ve come to appreciate the value of simplicity in systems administration. The more simple a setup is, the more likely it will be stable and easy to fix and maintain. If a setup is so complicated that it takes serious brainpower just to comprehend, chances are that there is going to be something wrong somewhere. Obviously, systems administrators have been dealing with complexity for a long time, and the basic sysadmin toolbox reflects the lean towards keeping it simple.
My toolbox is very plain Jane, and for a reason. It’s basic because these are the tools that get the job done on every Unix-like platform available. Everything I’ve come to rely on is open source, even on proprietary platforms like AIX. If there’s a part of my toolbox missing, I’ll download the package or source and make sure it’s available. Here’s what’s in my sysadmin’s toolbox:
ssh: The first basic tool is SSH, usually OpenSSH from the same project that created OpenBSD. Believe it or not, I still run into systems or applications that rely on rsh, sometimes even with root access! (I’m looking at you DB2 v.8) SSH was built to be a drop in replacement for rsh. Using SSH is very easy, mastering SSH requires a little more time, but is well worth the effort. With a proper setup, SSH can give you quick access to any server on the network, and allow scripted access to pull in information.
rsync: Like SSH above, rsync is another tool that is easy to use, and hard to master. If we need to move data from one place to another, rsync is the tool for the job. Just this week we had to move a few hundred gigabytes from one SAN to another, a move that required us to have the application using the data shut down. This kind of move is done in the middle of the night, so in preparation, I kicked off an rsync during the day and did the first initial copy. That night, when the applications were shut down, I ran the same rsync command again, and since all of the data was already moved, all rsync had to do was copy over the data that had changed since the move earlier that day. Rsync saves us time, which brings up our application faster, which saves the company money.
top: Top is a command line program that shows a snapshot of the performance of the machine. If I’m told that there’s a problem with one of our systems, top is normally the first command I’ll run. Top makes it easy to tell if some random java process has decided that it needs to eat up 99 percent of the CPU, or if the system load is going through the roof. Another simple application with a truckload of functionality, top is often overlooked as a diagnostic tool. Top can kill processes, display per-core cpu utilization, give memory statistics, and a host of other information. The man page for top is well worth the time to commit to memory.
df: Disk use is simultaneously the number one cause and result of applications behaving badly. Cause, because if an application needs to write temporary files or logs, and the filesystem becomes full so that it can’t, the applications behavior often becomes unpredictable. Also the result, because I’ve seen so many poorly coded applications that will hit a bug somewhere and either core dump and fill up a filesystem or start writing logs full of garbage until they’ve used up all available space. I’ve become addicted to the “-h” flag for df, which means human readable, but there are many other flags that can give you detailed information into what is going on in the filesystems.
free: Free gives information about the system memory use, for both physical RAM and swap space. I normally like to see the swap use being very low, as close to zero as possible, and let Linux use the RAM. Linux will cache data that it uses frequently in RAM, and then let it go when an application needs that space. Often what I see is nearly all physical RAM in use, and very little, if any, swap space utilized. This translates to an appropriately sized physical machine for the application. If there are large amounts of RAM that are not in use at all, the machine is probably underutilized, if there is a large amount of swap in use, the machine is probably over-taxed. Free gives me all of that information in a heartbeat.
Nagios: Tying together the basic tools above is Nagios, our monitoring system of choice. Nagios, as we use it, is entirely configured on the command line, and shows a simple web interface to give an overview of the health of the servers and services being monitored. I know it’s the thing for all the cool kids these days to jump on board completely gui driven monitors that do the same thing Nagios does (only “better”!), but in my experience Nagios does everything we need, is customizable, scriptable, reliable, and simple. There is a bit of a learning curve at first, but once you understand how Nagios works, it is very easy to maintain.
These are the basic tools I use in systems administration. There are more, and perhaps I’ll cover them in another article, but these are the basics. I believe in mastering the basics, and anything that does the same thing as any of these tools, but makes it more complicated, simply gets thrown out the window. The beauty of the toolbox being built around open source is that even when I work in proprietary systems like AIX, I can still download the source, compile it, and make sure my toolbox is available. Do you have any simple, powerful tools you’d like to share? Drop me a line in the comments!