Drive Crash 6: The Great Roses 2023 Drive Crash
It would really, really, really suck to have a drive crash on the morning of the first day of Roses, wouldn't it...
Background
Over the Easter holidays, Jamie Parker-East, Dan Wade, and Marks Polakovs were upgrading our primary web server, which was a bit of a potato at the time. As part of this they replaced the motherboard, and realised that the new motherboard had one fewer SATA connector than the old one. They got very concerned, until they noticed that one of the drives in Web at the time wasn't actually in use at all, either as part of the RAID array or otherwise mounted, and Web seemed to work perfectly fine without it. This drive was labelled the "mystery drive" and put in Marks' pigeon hole.
What the crew hadn't realised is that this drive was probably part of the array, until it had presumably failed some months ago (that's probably what all those emails were about...) - the array kept on keeping on, as a RAID-5 is meant to, but they were now one drive failure away from disaster.
The Crash
In the day or two before Roses the crew noted that Web had started rebooting sporadically, but couldn't quite pin down why. Then, at around 10:00 on Friday 28 April (the first day of Roses), Web rebooted again, only this time it went into emergency mode. Liam Burnand was in the MCR (control room) at the time and started diagnosing, and Marks was next door in G/N/020 sorting logistical kerfuffles, before getting concerned about mentions of a "server issue" and coming into the control room. They took a look at Web's startup logs, and spotted that systemd was failing to mount a drive, eventually putting two and two together and realising what had happened. Oh dear.
Marks quickly took over MCR duties while Liam started investigating, though the two of them concluded that the /data mount on Web was a goner, taking with it:
- User home directories and Windows profiles
- The source code of the website (though there were copies on GitHub)
- The images for the wikis (that's why they're not exactly working right now)
- The contents of the HashiCorp Vault ("Computing Vault", used by some of our deployment automation)
The Bodging
So now we have a dead drive and a Roses to keep going. The team put in a number of bodges to recover everything Roses broadcast-critical:
- Commenting out or deleting all the bits of nginx config that refer to /data
- Bypassing authentication on the streaming servers (as that relied on an API running on Web)
- Hard-coding various keys into the Sports Graphics deployment configuration to get it to start
The award for most spectacular bodge, though, has to go to Marks temporarily pointing the DNS for "*.prod.ystv.co.uk" to URY's web server, which they configured as a reverse proxy pointing to Sports Graphics - this wasn't actually strictly necessary, but meant that the team could reboot Web at will without causing issues.
The Fallout
Having survived a Roses, the team's next order of business was to survey the damage and rebuild.
More to come...