The Kernel Panic
I was on a conference call with a few other technicians and a few managers today, explaining that, no, unfortunately, they did not quite understand what had happened the day before. The manager assumed that we had a large influx of traffic that caused the site to go down, but unfortunately, that was not the case. I explained to him that our primary database had suffered a kernel panic, and the failover database did not, in fact, fail over. The manager then asked me a question that I had not thought of before, and one that gave me pause. What is a kernel panic?
If you have ever been on one of these types of calls, you know that they are always rather uncomfortable. The manager is upset because something went wrong, and on top of that it was something that they don’t fully understand. During such conversations I’ve found that it is normally best to keep explanations correct, but succinct. I explained that a kernel panic is what happens when the operating system encounters an error that it cannot recover from. That explanation seemed to be enough for him, but as I thought about it later, I found that it was not nearly enough for me.
The source of all knowledge, also known as Wikipedia, has a fairly good summary of a kernel panic:
A kernel panic is an action taken by an operating system upon detecting an internal fatal error from which it cannot safely recover. The term is largely specific to Unix and Unix-like systems; for Microsoft Windows operating systems the equivalent term is “Bug check” (or, colloquially, “Blue Screen of Death”).
Which is basically what I said earlier. However, the interesting portion of the article is too brief:
A panic may occur as a result of a hardware failure or a software bug in the operating system. In many cases, the operating system is capable of continued operation after an error has occurred. However, the system is in an unstable state and rather than risking security breaches and data corruption, the operating system stops to prevent further damage and facilitate diagnosis of the error.
That’s all well and good, but why would a perfectly good server that had been running fine for months suddenly panic and die? When the server was rebooted it came back perfectly fine, and there were no modifications to the software prior to the failure. There are no logs, and the other sysadmin said that he saw something about the ext3 driver on the console before rebooting.
So, this gives us a couple of directions we could pursue. On one hand, it is possible that the server hardware is beginning to fail. This particular server is four or five years old, and has been in production the entire time. It’s well past its prime, and due for a refresh. However, it is difficult to envision anything but the hard drive going bad on it. There are very few moving parts in the server, there are few moving parts in the machine, so something would have to be burned out. It’s possible.
On the other hand, given the other sysadmin’s sighting of an ext3 filesystem error, it is possible that the server lost access to the hard drive, and simply couldn’t read or write to it. Linux reacts, well, poorly, to loosing it’s filesystem, so it would only be a matter of time before the kernel would decide it needed to panic. That would also explain the lack of logs, and why the server appeared to be up to pings, but would not allow logins over SSH.
But, back again, why would the server suddenly decide it was going to lose access to the hard drive? Was there an accumulation of smaller errors that went unnoticed until it was too late? Or, perhaps it is because this particular server is a few months behind on patches, a couple revisions back on firmware, and a year or two past due to be replaced. Unfortunately, without detailed logs, it is impossible to say for certain what this particular kernel panic was, other than a wakeup call that there is still much to do.