Jan/05
2008

A little yellow light

So, we have this server. Actually we have lots and lots of them. But I don't care about them right now. I care about this one.

This one is a Linux box. And this Linux box has mirrored raid array.

So one day, one of our techs walks by the server rack and notices a little yellow light. Like a dedicated employee he reports it. Which is good because that little yellow light means something important. It means one of the drives has gone bad and has been removed from the array.

YAY for technology seeing a fault and isolating it!

So, we call up the support company and say "Hey you. Yellow light bad, RMA me a new drive so we can replace it."

You'd think they'd ship one out. But you'd be wrong. What they told us is that we needed to run a diagnostic program on the system to verify that the little yellow light was accurate. We rolled our eyes, but we tried to do as they said.

Of course, they sent us the instructions for a different model of server, so the first attempt to do this failed. But eventually we got it done and sent the logs off thinking that now they would give us a new drive

No. Of course not. I should have known. They told us that the next step for diagnosing the problem would be to rebuild the drive array from scratch and re-image the server.

....

No no, you read that right. In order to test to see if my raid drive is broke, I have to blow away the server. I ask you... what is the point of having raid if I have to blow away the server in order to test a raid failure? At that point, raid has ceased to be useful. I mean, I suppose it kept the server up until we could schedule an outage, but the failure still caused an outage.

So, we roll our eyes some more and schedule some downtime to rebuild the server.

First step, reboot and log into the BIOS to reset it to factory defaults. And that's where we get the problem. The server no longer understands the bad drive. The raid controller craps out and says that it has an unrecoverable drive error. Server won't boot. Remove the drive with the little yellow light and the server boots up fine.

So, we call up the support company. They're going to ship out a replacement drive.

It took us 10-15 man hours, 2 separate change approvals which involve dozens of groups signing off on our tests and a rather large conference call. This takes about 3 weeks to get everything set up. All to tell us exactly what the little yellow light told us in the first damn place.

3 comments
Comment from: Larathiel [Visitor] Email
Retarded on the part of the vendor but also perhaps a bit too sheepish on the part of IT.

How come Your company simply didn't order another drive for immediate replacement and then worry about RMAs on existing equipment later? At least then You'd have a spare on-hand in case another drive should fail in the future. Surely these drives can't be that expensive to exceed the cost in man-hours and down-time...
01/05/08 @ 04:25
Comment from: Roulette [Member] Email
While actual cost may be lower for the spare drive, justifying the expense would probably take longer and still cost a lot of man-hours. We're an outsourcing company, so not only would my team have to put together a proposal and cost plan for the spare drive, we'd have to get a series of managerial approvals. Once that was accomplished, we would then take the case the client to try to talk them into spending the money (this particular client is notoriously... thrifty. If they did actually agree, they'd send the request to accounting for an authorization number. Then we give that to the vendor who would then send us a new drive.

Gotta love big companies. Nothing works because everything has way too many steps.
01/05/08 @ 05:24
Comment from: Larathiel [Visitor] Email
What kind of gum?

FUGGUM!
01/05/08 @ 05:35