Work: Problems and more problems

Transformer Failure Despite our disaster recovery system operating correctly when a transformer next to our datacentre blew up (UPS kicked in straight away, followed by the generators and additional generators were on site whilst we waited nearly a week for a new transformer to be installed: all the time running on generator power), it seems that we’ve been “bitten on the a–” anyway.

Basically, servers hosted in our 3—-.net datacentre area (aka “Server farm 1”) have been suffering major problems recently ever since we upgraded to the new Ensim Pro control panel. We suspect the additional load Ensim Pro caused on the server resulted in the servers hitting unexpected high usage (Ensim Pro seems to take up around 10x as much resources: but don’t quote me on that figure), so the servers have been shutting down to preserve data integrity (a bit like the way a human body faints or falls into a coma if something is wrong) and the associated downtime this causes for customers and the amount of stress we get (if you’ve ever been woken up at 2am by your mobile phone dancing off the desk as another text message comes in to tell you “Datacentre down” you’ll know what I mean).

So we’ve rushed 2 more datacentres online and started moving customers off the affected datacentre to the new ones with the intention that once the datacentre has been massively reduced in load we’ll be able to do some proper investigation on it (small snippet of information: none of our data centres are even touching 1/20th of the available bandwidth to them!).

However…

Things went a little bit “tits-up”.

First of all: the email warning customers about the migration didn’t get sent and we didn’t realise until the day that the migration was taking place that it hadn’t gone out.

Second: The old datacentre failed again during us copying the data to the new one.

Thirdly: The new datacentre didn’t particularly like being forced to accept new sites in bulk (requests of adding 6 sites per minute were bringing it to its knees as it was adding DNS records, user details, mailbox details, copying over the web files and old email etc etc)

Fourthly: Sites seem to be “larger” upon import. One a number of sites, the datacenter wouldn’t accept the import as the FrontPage extensions to the site would put the site of its disc usage limit: which is exactly the same as it was pre-restore. A test of re-restoring a site to the same datacenter (in fact, the exact same server) resulting in a growth of a site of 50Mb in disc usage: something we’re still puzzling about as we didn’t notice it in trials.

Fourthly: Whilst we did try and make the migration seamless for customers (reduce the DNS TTL and expiry to 600seconds, get emailed on the automatic process at every stage, continually watch the load average on the two datacentres, manually update the old DNS records to point to the new location), we did hit two major problems.

One is an customer that’s a friend of the boss – his site is 90Mb in size and the new datacenter wouldn’t accept it for some reason (failing the restore for a number of reasons). 8pm and the boss calls and says he wants it sorted ASAP. I pull out every single stop I can, manually recreate the account on the old datacentre, manually increase the limits, import the data from the backup and bingo! It’s working again and I inform the customer. 20minutes later and the new datacentre tells me that the 8th attempt at restoring the site to the new datacentre has finally been successful! Grrr….

The other one we hit problems with was a certain N.R.Turner’s site. Most of our customers give us complete control over their domain names so we can update Nameserver details etc as necessary (very rarely we need to, but it’s one less headache for some people that shouldn’t even be allowed near a PC IMHO), the rest just point their domain name at our nameservers where we perform “behind the scenes magic” to make the transition as smooth as possible). However, Neil likes maintaining his own DNS services and therefore we weren’t able to do the appropriate magic and he experienced downtime.

The good news is that everything now seems to be stable – the load average on the “broken” datacentre is now the lowest I’ve seen for a long time, the load average on the new datacentre is hardly recordable (it’s that low: I spent a day tweaking the settings of it to ensure it’s as optimised as we’re able to get it) and the trials of a new control panel suite has commenced (we’ve decided that Ensim Pro isn’t suitable for us: it seems to be too prone to various problems and the worry that if we manually fix a problem it could break the entire “integrated into itself” Ensim suite).

Bad news: I’m a little bit worried about job security as the downtime hasn’t been good…

Related

Be First to Comment

Leave a Reply Cancel reply