Drive Crash 6: The Great Roses 2023 Drive Crash
It would really, really, really suck to have a drive crash on the morning of the first day of Roses, wouldn't it...
Background
Over the Easter holidays, Jamie Parker-East, Dan Wade, and Marks Polakovs were upgrading our primary web server, which was a bit of a potato at the time. As part of this they replaced the motherboard, and realised that the new motherboard had one fewer SATA connector than the old one. They got very concerned, until they noticed that one of the drives in Web at the time wasn't actually in use at all, either as part of the RAID array or otherwise mounted, and Web seemed to work perfectly fine without it. This drive was labelled the "mystery drive" and put in Marks' pigeon hole.
What the crew hadn't realised is that this drive was probably part of the array, until it had presumably failed some months ago (that's probably what all those emails were about...) - the array kept on keeping on, as a RAID-5 is meant to, but they were now one drive failure away from disaster.
The Crash
In the day or two before Roses the crew noted that Web had started rebooting sporadically, but couldn't quite pin down why. Then, at around 10:00 on Friday 28 April (the first day of Roses), Web rebooted again, only this time it went into emergency mode. Liam Burnand was in the MCR (control room) at the time and started diagnosing, and Marks was next door in G/N/020 sorting logistical kerfuffles, before getting concerned about mentions of a "server issue" and coming into the control room. They took a look at Web's startup logs, and spotted that systemd was failing to mount a drive, eventually putting two and two together and realising what had happened. Oh dear.
Marks quickly took over MCR duties while Liam started investigating, though the two of them concluded that the /data mount on Web was a goner, taking with it:
- User home directories and Windows profiles
- Users' email inboxes
- The source code of the website (though there were copies on GitHub)
- The images for the wikis (that's why they're not exactly working right now)
- The contents of the HashiCorp Vault ("Computing Vault", used by some of our deployment automation)
The Bodging
So now we have a dead drive, a few terabytes of lost data, and a Roses to keep going. The team put in a number of bodges to recover everything Roses broadcast-critical:
- Commenting out or deleting all the bits of nginx config that refer to /data
- Bypassing authentication on the streaming servers (as that relied on an API running on Web)
- Hard-coding various keys into the Sports Graphics deployment configuration to get it to start
The award for most spectacular bodge, though, has to go to Marks temporarily pointing the DNS for "*.prod.ystv.co.uk" to URY's web server, which they configured as a reverse proxy pointing to Sports Graphics (also copying a wildcard TLS certificate and private key to URY in the process...) - this wasn't actually strictly necessary, but meant that the team could reboot Web at will without causing issues.
The Fallout
Having survived a Roses, the team's next order of business was to survey the damage and rebuild.
Eventually most bits and pieces got put back together in some way. Computing Vault was rebuilt based on snippets of configuration scattered around the place, the wikis got brought back without images (until Rhys found a surprisingly recent backup and restored them), and all computing services were rebuilt. The team also recreated a new /data array based on SSDs, though it did take them about three weeks of arguing, failing to pass the money at admin, and probably boring the rest of the society to tears before they could finally agree on its configuration.
Unfortunately there were some permanent casualties. Users' home directories and emails were gone forever. Every cloud has a silver lining though: the team had just been discussing about moving emails to Google Workspace (hosted under YUSU's domain), specifically about how to handle moving over "legacy" on-prem emails - them no longer existing made the decision much easier.