Drive Crash 3: DD-ouble or Bust

From YSTV History Wiki
Revision as of 20:35, 18 June 2017 by Samw (talk | contribs) (Add category)
Jump to navigation Jump to search

To set the scene: it's the Sunday before Fresher's Week 2016, and 3rd Year Technical Director Sam W and now-alumni Peter are in York preparing for a week of freshers-related YSTV fun. That evening, Sam notices the following two emails: from Derrick, the trusty UPS.

Derrick oh no.png

Hm, looks like the campus substations got lazy over summer and are protesting at the sudden appearance of 5000 freshers. Ah well, it came back quickly enough. Nothing to see here. However the illusion is shattered when - oh the horror! - Vault-hosted status.ystv.co.uk 502's! The seeds of doubt are sown, and, sure enough:

8:21 PM Peter Eskdale: So, guess which server still isn't on the UPS?

Yikes and forsooth, it is so: confirmed by Anthony, vault - the primary filestore - is not supplied by the UPS. However, surely it should just be a case of a power on when next in YSTV and all would be well? Sam and Peter, at Sam's house turn in for the night.

But alas. Bright and early the next day, at 1PM, our ill-fated techies turn their attention to the ailing server and all is not well.

Drive crash 3 slack.png

Oh no, not this sh*t again. Vault is not booting, alternately complaining about xz compressed data corruption, or silently blinking an underscore. In light of this, we were upgraded to defcon probably:

Driver crash 3 defcon probably.png

(At this point, Sam and Peter disappear off into the rain in the direction of Heslington East with a trolley full of gear for Constantine Freshers' Fair. They reappear, moistened by the weather a little later.) Some hope, in the form of Anthony suggesting running memtest, which is duly done.

Drive crash 3 look at them errors.jpg

I mean, it's not like memory is that important right? Looks like it's finally time to upgrade that crappy old hardware in Vault with one of the large stack of i5 motherboards recently acquired from Computer Science via the ever-bountiful DCOs mailing list. Perhaps the disks are fine then! In reaction to this, we are downgraded again:

7:20 PM lloydw set the channel topic: Defcon: “Maybe". This is not a drill!

i5 motherboard duly installed, and debian is made to boot. And then immediately kernel panic. Some discussion and experimentation regarding Vault's drives results in the scenario becoming clear. 8 1Tb hard drives attached to a pci card, in linux software raid, along with one SSD serving as the OS drive and high-speed cache for the storage. The evening grows late, and the assembled parties call it a night.

The next day, Debian kernel panics again when booted with just the SSD plugged in, so perhaps the issue is a borked SSD? It rapidly looks like the best plan of action is to reinstall debian on a new disk and attempt to attach and import the existing mdadm raid array, but our protagonist is hesitant to proceed. For, through all of this there has been the promise of a nightly backup from before the incident, but in accordance with the teachings of Drive Crash 2 one should never trust the backup to remain uncorrupted, and thus things were left as is without any forthcoming way to make a copy of the data on Vault's RAID before any recovery effort was made. This being because 6TB of free storage doesn't grow on trees. Peter and Sam spends much of the day ignoring the issue and putting up trunking instead.

However, the situation cannot be ignored, and as such Sam joins Peter in the station the following day at the crack of 4PM or so, and starts to work out a way forward. After a few hours, hope (and >=6TB of free storage) comes in the form of 3 3TB drives bought for Sam's Housemate (and URY computing team member)'s new file server. Quite why they need 9TB of spinning platter is unknown.... but some bribery and a frantic bike ride later and they are at our disposal. Simultaneously, a solicitation to the DCO's Hipchat by YUSU IT Assistant and friend of the station Liam results in the offer of some network storage from, of all places, the York NeuroImaging Center. He also supplies a PSU for the frankenstein temporary box created to house copies of all 8 of Vault's disks, a device known to the world as osht.ystv.york.ac.uk...

osht.ystv.york.ac.uk visible to the right of the screen, along with assorted drive crash debris

And so, at 3AM on the 4th day multiple dd's piped into ssh are ferrying the precious bytes of Vault's RAID over to the temporary filestore osht. dd encountered errors on three of the eight drives, the current plan being to attempt to ignore the errors, copying the bulk of the data, and hoping that the array, along with most of the data might be recovered. It is unclear but increasingly likely that these 3 drives were damaged by the same conditions which killed Vault's old RAM and the SSD.

For now, though, Player One of Drive Crash 3 heads home for bed, leaving the hum of hard drives behind in the dark...

TO BE CONTINUED


Lessons Learned

  • For the love of The Flying Spaghetti Monster, put the primary filestore on the frigging UPS.