Drive Crash 6: The Great Roses 2023 Drive Crash: Difference between revisions

m
No edit summary
 
(7 intermediate revisions by the same user not shown)
Line 2: Line 2:


== Background ==
== Background ==
Over the Easter holidays, [[Jamie Parker-East]], [[Dan Wade]], and [[Marks Polakovs]] were upgrading our primary web server, which was a bit of a potato at the time. As part of this they replaced the motherboard, realising that the new motherboard had one fewer SATA connector than the old one. They got very concerned, until they noticed that one of the drives in Web at the time wasn't actually in use at all, either as part of the RAID array or otherwise mounted, and Web seemed to work perfectly fine without it. This drive was labelled the "mystery drive" and put in Marks' pigeon hole.
Over the Easter holidays, [[Jamie Parker-East]], [[Dan Wade]], and [[Marks Polakovs]] were upgrading our primary web server, which was a bit of a potato at the time. As part of this they replaced the motherboard, and realised that the new motherboard had one fewer SATA connector than the old one. They got very concerned, until they noticed that one of the drives in Web at the time wasn't actually in use at all, either as part of the RAID array or otherwise mounted, and Web seemed to work perfectly fine without it. This drive was labelled the "mystery drive" and put in Marks' pigeon hole.


What the crew hadn't realised is that this drive ''was'' probably part of the array, until it had presumably failed some months ago (that's probably what all those emails were about...) - the array kept on keeping on, as a RAID-5 is meant to, but they were now one drive failure away from disaster.
What the crew hadn't realised is that this drive ''was'' probably part of the array, until it had presumably failed some months ago (that's probably what all those emails were about...) - the array kept on keeping on, as a RAID-5 is meant to, but they were now one drive failure away from disaster.
Line 12: Line 12:


* User home directories and Windows profiles
* User home directories and Windows profiles
* Users' email inboxes
* The source code of the website (though there were copies on GitHub)
* The source code of the website (though there were copies on GitHub)
* The images for the wikis (that's why they're not exactly working right now)
* The images for the wikis (that's why they're not exactly working right now)
Line 17: Line 18:


== The Bodging ==
== The Bodging ==
So now we have a dead drive and a Roses to keep going. The team put in a number of bodges to recover everything Roses broadcast-critical:
So now we have a dead drive, a few terabytes of lost data, and a Roses to keep going. The team put in a number of bodges to recover everything Roses broadcast-critical:


* Commenting out or deleting all the bits of nginx config that refer to /data
* Commenting out or deleting all the bits of nginx config that refer to /data
Line 23: Line 24:
* Hard-coding various keys into the Sports Graphics deployment configuration to get it to start
* Hard-coding various keys into the Sports Graphics deployment configuration to get it to start


The award for most spectacular bodge, though, has to go to Marks temporarily pointing the DNS for "*.prod.ystv.co.uk" to URY's web server, which they configured as a reverse proxy pointing to Sports Graphics - this wasn't actually strictly necessary, but meant that the team could reboot Web at will without causing issues.
The award for most spectacular bodge, though, has to go to Marks temporarily pointing the DNS for "*.prod.ystv.co.uk" to [[URY]]'s web server, which they configured as a reverse proxy pointing to Sports Graphics (also copying a wildcard TLS certificate and private key to URY in the process...) - this wasn't actually strictly necessary, but meant that the team could reboot Web at will without causing issues. With multiple layers of bodges, hacks, and horribleness in place, the team kept enough infrastructure going to deliver the biggest Roses ever.


== The Fallout ==
== The Fallout ==
Having survived a Roses, the team's next order of business was to survey the damage and rebuild.
Having survived a Roses, the team's next order of business was to survey the damage and rebuild.


More to come...
Eventually most bits and pieces got put back together in some way. Computing Vault was rebuilt based on snippets of configuration scattered around the place, the wikis got brought back without images (until [[Rhys Milling|Rhys]] found a surprisingly recent backup and restored them), and all computing services were rebuilt. The team also recreated a new /data array based on SSDs, though it did take them about three weeks of arguing, failing to pass the money at admin, and probably boring the rest of the society to tears before they could finally agree on its configuration.
 
Unfortunately there were some permanent casualties. Users' home directories and emails were gone forever. Every cloud has a silver lining though: the team had just been discussing about moving emails to Google Workspace (hosted under YUSU's domain), specifically about how to handle moving over "legacy" on-prem emails - them no longer existing made the decision much easier.
 
== External Links ==
 
* [https://docs.google.com/document/u/1/d/1h_088clpJNuPGto5-polZUkJKjhxowbVTacxPg8PEwc/edit Post-mortem] (on the team Google Drive)
 
[[Category:Drive Crashes]]
[[Category:Drive Crashes]]
[[Category:Notable Events]]