On Call Scheduling With Nagios
Nagios continues to impress me with its power and capability. For the past seven years Nagios has been my enterprise monitoring solution of choice, and as our environment has grown, Nagios has grown right along with it. Luckily, since we have a sane method of managing our configuration files, growing Nagios has not been an issue. Recently though, we were comparing ZenOSS with Nagios, and one item discussed was how ZenOSS deals with the on-call rotation between sysadmins. The way we were doing it with ZenOSS required the sysadmin who was on call that week to log in and make a change. I was certain that Nagios had something built into it, and sure enough, I was right.
On call scheduling with Nagios is done using timeperiods. Normally, time periods are defined in the timeperiods.cfg file, and to be honest, the only one we normally use is 24x7. Nagios comes preconfigured with 24x7, or all the time, a "workhours" time period for the work day, "none", for no time at all, the US Holidays, and the reverse of the US Holidays. Time periods are one of the checks Nagios performs when deciding if it should send out an alert or not. If there is an alert that needs to be sent, and the current time falls within the time period defined for the users, the alert goes out.
To set up an on call rotation, we create a new file named "oncall.cfg" and define a few new timeperiods for Nagios. For example, here is a portion of ours:
2013-03-11 / 21 07:00-24:00
2013-03-12 / 21 00:00-24:00
2013-03-13 / 21 00:00-24:00
2013-03-14 / 21 00:00-24:00
2013-03-15 / 21 00:00-24:00
2013-03-16 / 21 00:00-24:00
2013-03-17 / 21 00:00-24:00
2013-03-18 / 21 00:00-07:00
If we walk through this a bit, the first definition names the timeperiod "buys-oncall".The next section defines the starting dates for the rotation. 2013-03-11 obviously is March 11, 2013. The next part, " / 21" tells Nagios to repeat this timeperiod every twenty-one days, and the last part of the line tells Nagios to start at seven in the morning and end that day at midnight. I then repeat the definition for the remainder of the week, ending at seven in the morning of the following Monday.
The last two lines reference two other timeperiods, one to exclude and one to use. The first, named buys-out-of-office looks like this:
2013-01-22 - 2013-01-23 07:00-16:00 ; Test Vacation Definition
This section defines when I am going to be unavailable to be on call. According to the dates above, I would be on vacation from January twenty-second at 7:00 AM to January twenty-third at 4:00 PM, or two days. Nagios uses these dates as exclusion times when choosing if you are going to be sent an alert.
But if I am not available, who is going to cover for me? That question is answered by the last section of the on call definition, "use". If there are three sysadmins, each covering a week, than each should cover for one other. Which is why in the third sysadmin's timeperiod definition is a use line that says use buys-out-of-office. So, for times that I tell Nagios not to send something to me, I am also telling it to send it to my backup.
Each of us has the two definitions in Nagios that automatically rotate through every 21 days. If any of us has vacation or time off, we enter it in as a line item in the config file, maybe with a friendly comment of where we will be.
Nagios can be hard to wrap your head around at first, but once you do it becomes very easy to maintain. In the past few years there has been a lot of emphasis put on big, heavy, GUI-driven interfaces for systems management, but I'm personally happy to keep everything in text files and do my management with vi.