Every month we have server maintenance and patching.
This past weekend I arrived at 6:55am, breakfast in hand.
Checked out the building electronics recycling. Didn't find anything cool.
Went about my tasks.
We had an onsite repair scheduled for two servers.
9:20 the tech calls me and asks if we'd gotten the parts. We hadn't. He says they were picked up at 7am by special courier. Okay, well they haven't shown up.
9:28 one of my guys shows up.
9:35 the other one comes in.
9:40 he calls back and says they're "ten minutes away" and comes on over.
10:10 he shows up. No parts.
10:45 he calls his dispatch and is on the phone with them for twenty minutes while they try to figure out what happened to the parts. They tell him that one of the batches is on its way.
Meanwhile I get a call from the second courier stating she'll be arriving after 11:30. I asked if she could show up sooner because the tech has already been onsite for the better part of an hour.
She arrives at 11:10.
Working with the tech we get one of the devices repaired.
At 12:15pm while trying to decide what we want for lunch, and while the tech is back on with his dispatch trying to find out what happened to Parts Batch_001, we get an alert that the AC unit in the server room has gone down and the temp is going up.
Keep in mind, it's one of the hottest days so far this summer - but we'd come to find out it's a new problem and our usual steps to fix it weren't working.
We set up our portable coolers and, fifteen minutes later as the unit isn't coming back on we start shutting down servers which we needed to complete patching/verification thereof.
By 12:45 we have our virtual hosts and two other servers running out of fifty. ETA on the HVAC emergency response is 60 minutes.
The tech gets with his dispatch again and they state that, as no one had been present to receive the package, it was sent back.
I send him on his way.
After, I call the mfr and rip them a new one as politely as I can, stating that after having watched our CCTV footage, there were no visits from anyone outside of the timestamps listed above, and all I get is "we're sorry for the inconvenience." I reschedule a new delivery for Monday because by that point the item was on its way back to depot.
3pm the hvac tech shows up and goes through a bunch of things. At this point four days later I can't remember what his initial diagnosis is, but it's some bullshit about how the unit "shut down due to inconsistent flow from the building water supply". If that were truly the problem, it would be having issues every weekend, not just this one. Plus if that were the case, there'd be building engineers crawling all over the place because we're not the only tenant with water cooled AC in our server room.
But it starts cooling by 4:30 or so, and slowly dropping.
We power up our servers and get our shit finished because it's getting late and we'd lost two guys due to time zone shenanigans.
5:30 rolls around and it drops low enough into the low orange that we button up and I head home, pulling out of the garage at 6pm.
6:20 I hear my phone ring, but no handsfree in car so I keep going home.
6:32 I get home, feed the cats, and remember that phone call. It was the emergency notification system stating the temps in the server room shot back up.
Shit.
I put my shoes back on and run out, getting out of the house by 6:40 and back into the building at 7:10 due to sportsball game traffic.
While I'm rushing around setting up the cooling units again two of the other guys are shutting down those same servers again.
Building engineer shows up at a few minutes after 8pm because he lives only a few blocks away and was on his well deserved weekend, but when one of the C level people in my org calls his boss, he's gonna show up.
HVAC tech shows up about 45 minutes later.
Turns out there's a condensate pan on the unit (as there should be) but the sensor that triggers cycling of the system is manual reset.
So what happened was humidity caused the pan to fill up with condensation, and the float sensor turned off the unit. Except for some reason some idiot installed a "manual override" float, instead of "automatic" - so if the unit ever shut off for this reason, it would need to be manually turned back on after a default cooling period of 15-20 minutes, plus any time required for the pan (currently bone dry) to drain.
The tech removed the trigger for the float and left after getting part information to order next business day.
I left at midnight after two conference calls - one with the tech and my boss, and the one after that with my boss and the C-level who called in the cavalry. I also set two rubbermaid totes under the pan for condensation catching purposes. Not ideal, but it would work.
I was back the next morning a little before 9 to check up on it. Still dry, still running.
Another of my guys went back in the afternoon. Same deal.
Turns out the mfr of the HVAC unit had no idea the device was shipped with a manual reset on that trigger. They had to special order the automatic part, which would require reprogramming of the unit - but as it's now Thursday morning and I've gotten no further calls from them about it, we don't know when it'll come in or what sort of downtime we're looking at (again).
to be continued...




Reply With Quote
