msgbartop
msgbarbottom

27 Aug 13 Server Outage

On Monday morning, one of our four redundant power supplies caught on fire in our blade chassis, causing permanent damage to the power supply and part of the chassis.  We caught the problem in less than an hour but getting a new power supply and a new chassis component took some time.  The new power supply arrived in a few hours but the chassis component didn’t arrive until 8:45pm. As a result, many of our systems couldn’t be accessed from 11:30am to 9:40pm Monday night.

The blade chassis contains the CPUs that access our data.  All of our data is stored on an area away from the chassis, and all data is backed up to disk and to tape.  All parts were covered under our maintenance contract and should have arrived in four hours.  However, because entire chassis failures are so unusual, one had to be driven to us from South Bend, Indiana, which is what took the bulk of the time.

The ALA website, the iMIS Membership system, the financial system, blogs server, wikis server, our internal Knowledge Management System, the ITTS help desk tracking system, and some shared drive access were all down during this time. 

We’ve been told that this type of incident is very rare.  The redundant power supplies should have sufficed – the additional component that was damaged is not normally subject to failure.  The only way to protect against this type of failure would be to acquire an entire additional chassis with redundant network connections, at significant additional expense. This isn’t a feasible option for ALA, but we’ll continue our efforts to reduce single points of failure across our network.

Sherri

 

Comments are closed.