Techy: Server Downtime and Time

Sorry if you were visiting my blog between 7:45pm and 8:05pm today – I was performing a small server upgrade (to be specific, applying a patch to the Apache webserver) when something went wrong. It took me around 10 minutes to notice that Apache hadn’t properly restarted (I was doing some other maintenance tasks at the time so I hadn’t got round to checking it had came back alive correctly), than around another ten minutes to find out why it wouldn’t restart.

The problem I was getting was [crit] (98)Address already in use: make_sock: could not bind to port 80 whenever I tried to restart Apache. I disabled the Tomcat Jakarta server (as that was what I was patching as the previous version I was running had a security hole) as I don’t do anything important with Java on my server at the moment and attempted to restart Apache. Still no joy and the same error message. I do a netstat -lpn to try and find out which process is running on port 80 so I can kill it…. Nothing is. Aaargh! I then have a brain wave and think that perhaps when Apache was restarted it failed to shutdown properly. So I go into the Apache config file and change the “Port” number from 80 to 81. Start the server and success! (this also proves there was nothing wrong with the Apache configuration file). Change the port number back to “80”, restart the server and the flibblepenguin is alive!

Hardly the most complex thing in the world, but I had never come across that problem before and I wasn’t expecting it (and upgrade that only failed in the re-start? I was kinda worried that I may have had to “roll back” the server to the previous configuration – work of a whole 10 minutes maximum!).

Once I was happy that the server was running correctly, I checked my email…. Big mistake – oodles of messages sent to my multiple accounts saying “Apache Server down”, “Apache Server down”, “Apache Server down” – yep, I knew that already. One for every 5 minutes of downtime (grr). The really annoying thing is that just before it sent those emails to me, it would have tried to automatically bring Apache back “up” (the idea being that if my server ‘hic-cups’ it fixes itself) and I was trying to fix it as well at the same time. Grrr…

In other techy news, we have been recently having problems with one of our servers at work that we host a large number of client websites on. It has been shutting down at random intervals (complete shutdown resulting in it having to be manually cold re-booted). Log files revealed no evidence whatsoever what was going on and I don’t have physical access to the box. I poke around the logs a bit more and check a few news groups. Then I report to my boss that it is probably a hardware failure – mostly likely bad RAM – and we need to get the hosting company (we lease dedicated servers off them) to replace the RAM modules.

Less than 24hours later (and around 2 hours of downtime) and we get a message back from the hosting company saying they’ve swapped out the RAM modules and found the previous ones faulty. Go me! A perfect diagnostics of a long-occurring server problem performed remotely! Wooo! Everything’s perfect… Apart from the fact the server now thinks it is mid-February 2002 – around a year out of date. Ah. I go in, go to superuser/root, perform a quick “date -s "Jan 30 2003 21:00" to reset the clock and then hwclock --systohc to synchronise the hardware clock time to the system time (so if the power fails/other, the correct time is kept). Just to be safe, I install getdate and set the server so it checks a couple of public time servers every day, checks the clock and keeps the server time in sync. I could have probably coded the NTP interpreter myself, but I just wanted something quick and simple.

Related

Be First to Comment

Leave a Reply Cancel reply