Drive Crash 4: Now That's What I Call rsync
At the end of the 2016/2017 academic year, during weeks 9 and 10, Sam Willcocks, Tom Lee and Matthew Stratford decided it would be a good idea to tear everything out of the AV and Computing racks (See The Great Tech Redo 2017). Unfortunately, some of the servers didn't like being turned off and moved around and decided to fail.
The first server to complain was backup, which complained of a degraded array. This was caused by a High Impedance Air Gap between the hard drive power supply and the drive.
Shortly after plugging the drive back into backup, Web started to complain of a drive reporting SMART errors. This drive was replaced and Backup, not to be outdone by Web, decided to destroy one of its drives. This drive was replaced and all was well.
The Attenborough Disaster
Matt and Tim Bradgate were happily patching cat5 when Tom (who was patching SDI in the AV rack at the time) noticed a suspiciously familiar beeping noise.
The general reaction was "oh crap, not again".
Edwin Barnes was asked to stop editing so we could shutdown Attenborough and the dead drive was identified as one of the OS disks. The disk was replaced and the array began to rebuild so Tim turned his attention to determining why the drive had failed, whereas Matt and Tom went back to patching the AV rack.
Tim determined that the drive was healthy, which was slightly concerning, but more concerning was the beeping that started to come from Attenborough.
#Computing moved up a level of emergency.
Rebuilding the dead OS drive failed. So the team decided to give the old "failed" drive a try. This started to rebuild fine so everyone went to the pub.
One Courtyard meal later and the OS drive claims to be rebuilt, but just to be sure, Attenborough was rebooted into the RAID BIOS. Much to the disappointment of all present, Attenborough promptly started to beep and reported the RAID array to be degraded. Just in case something other than the drive had failed, Attenborough was set about rebuilding his RAID array again. This failed.
At this point there was only one thing to do: call Sam. Sam suggested installing Ubuntu on another, unraided, drive to dump all of the data on Attenborough onto Backup.
The new Ubuntu-based temporary Attenborough was given the hostname "TomScott" after York alumnus w:Tom_Scott_(entertainer), who is in part known for bodging together the Emoji keyboard.
Hui-Ling Phillips and Katherine Bell had arrived bringing the gift of biscuits and all awaited data to start pouring onto Backup through the magic of rsync with the Pirates of the Caribbean soundtrack playing in the background to match the atmosphere and keep morale high.
Pizza was then ordered.
The decision was made to prioritise current and paid productions from pending edits during the transfer; so these projects were synced to Backup first. Edwin then set about copying these projects from backup to his SSD so that YSTV definitely, absolutely, without a doubt had a copy. In the mean time Kenric, Hui-Ling, Katherine and Edwin started crimping some Cat5 cables to help while computers were being dealt with.
All seemed well so the Tech/Computing teams went back to completing small jobs about the studio. Tim started working on chron jobs to automate backups, Tom started to assemble a media cache for the edit PCs as a way to combat drive failure, and Matt continued work on patching/routing various cables. All was fine, until Backup started reporting SMART errors.
Goddammit.
Now there was a mad scramble to retrieve data from Backup onto Edwin's SSD.
As there was not much else to do other than to wait for files to sync and pray that Backup lived long enough, Tom took this opportunity to go home, have a shower, and change - shortly followed by Katherine and Hui-Ling. Meanwhile, Matt and Tim pulled Backup out of the Computing rack. Upon Tom's return, Tim took his shift of showering and changing and Tom pulled the 2TB drive out of Obriain to be sacrificed to the great Backup RAID array. Matt and Tom take the opportunity during the rebuild to continue the ongoing attempt to tidy up the studio.
After Tim's return, Tim and Tom continue setting up the media cache while Tom continued his effort to write the wiki article for the 4th in the series of Drive Crashes while the crash unfolded around him.
Attendees and Roles
Person | Role |
---|---|
Katherine | Nervous and there |
Hui-Ling | Cable Monkey |
Tom | Chief Bodger |
Tim | Linux Wrangler |
Matt | Cat5 Patcher |
Edwin | Stressed out Editor |
Sam | Remote Tech Support |
Kenric | Crimping Party Starter |
Drive Crash 4 as Told by #Computing
Lessons Learned
- Keep regular backups
- Don't unplug the servers
- No, that's not a good reason to
- Seriously, they will fail
- Drives cling to life until powered off (mostly)
- Bring biscuits
- Sleep is good
The Final Fatality
After returning home to recover from the ordeal, Tom sat down at his desktop to find it frozen. After months of being neglected to be maintained, and several days of being left on, Tom's desktop's OS drive had failed. The final victim of Drive Crash 4.