Drive Crash 2: The fscking: Difference between revisions
mNo edit summary |
m (Add category) |
||
(5 intermediate revisions by 2 users not shown) | |||
Line 21: | Line 21: | ||
== The Fixing of Fsrv == | == The Fixing of Fsrv == | ||
At this point, everything was considered good to go - we had a full, working backup on a redundant array, and would only need a couple of days to rebuild the array on Fsrv and move everything back. So the array | At this point, everything was considered good to go - we had a full, working backup on a redundant array, and would only need a couple of days to rebuild the array on Fsrv and move everything back. So, we shut down fsrv with the intention of rebuilding the array in the RAID BIOS. | ||
== The Balls-up == | == The Balls-up == | ||
As fsrv posted, there was one immediate concern with the POST data - the RAID controller insisted that the existing array had failed, as all four of the drives connected to it were no longer there. However, they were replaced with four completely different drives, despite the fact they were identical down to the serial numbers. It was at the first stage of investigating this that a [[docs:Glossary | High Impedance Air Gap]] developed between the power cable providing power to backup and the power supply of backup, causing an unexpected power-down. At this time, it was not noticed that the noise heard was that of backup restarting, and was disregarded whilst fsrv was told to start building a new array on its "new" drives. | |||
Of course, once sufficient array building was done to fsrv to make it unlikely anything was ever coming back out of it, the source of the earlier electrical crackling noise was investigated. Once ystvbackup was brought back up, it was only natural for our worst fears to be correct - the backup of fsrv's recently wiped data was corrupt. | |||
== The "Recovery" == | |||
Over the following weeks, multiple attempts were made to recover the data from what was left of this filesystem using dd images, but no luck was to be had. Alex Williams is currently known to be working on writing Python script to manually repair the inodes - we know for a fact the data is still on the disk, it is just the drive metadata that is lost. | |||
Since this was just after the annual NaSTA deadline, this wasn't as catastrophic as Drive Crash Classic, but meant that the content that won us Best Broadcaster 2014 made its way happily to the judges. For much of the content that was on the drives, producers expressed relief at no longer needing to find time to edit the content. For the content that was important, most of it was on local tempvideo drives on the edit machines or still had the original recordings from Kenobi or Vidsrv, so the only real loss was show resources and finished shows, the latter of which can be reconstructed using data from web and playout. | |||
== The Conclusion == | |||
Never, ever have Just One copy of you data. And when you ask yourself "Should I make an extra copy on this spare disk I have too?" - the answer should not be "No". | |||
[[Category:Drive_Crashes]] |
Latest revision as of 20:35, 18 June 2017
During the Easter of 2014, then-Computing Officer Lloyd Wallis did some semi-planned and somewhat-thought-through work to increase the storage available in YSTV as well as increasing resilience.
Potentially also to cover in this document is the shortly-proceeding replacement of fsrv's OS drive after a kerfuffle.
The Filling of Fsrv
In the months leading up to the Easter break, there were several occassions where the Pending Edits share on fsrv was completely full. This needed fixing.
So, a plan was devised to add another TB of storage to both fsrv and backup and grow their respective RAID5 arrays. We were in posession of one spare 1TB SATA drive, so at the start of Easter it was proposed we grow the fsrv array during the holiday while it was not in heavy use, then passing the money to grow backup at the beginning of the Summer term.
Of course, with any plan to touch the file server since the computing team played Drive Crash classic, a plan was first put in place to do a complete test restore of the data on backup, to ensure that if it all went wrong we still had a copy.
The Borking of Backup
Of course, it turned out that running a restore of the entirety of Finished Shows did not go well. After leaving it running overnight, the restore was still copying the first file, 2013's Live on the Lawn, having succesfully copied 430GB of the file so far. This was quite worrying as the file was only 2GB.
So, after a few more tries and a few more wiggles, scrapping BackupPC sounded like a good idea, replacing it with a flat copy of all the files.
Then at some point, we noticed something interesting with one of the files - the system would sit for ages on that one thing. Applying smartctl gave us good news - a potentially failing hard disk. By removing said disk and rebuilding as a 2TB array, the copy could continue. Testing on the HDD later showed that it was not actually failing.
After the ~18 hours copying, a fresh and working backup was obtained.
The Fixing of Fsrv
At this point, everything was considered good to go - we had a full, working backup on a redundant array, and would only need a couple of days to rebuild the array on Fsrv and move everything back. So, we shut down fsrv with the intention of rebuilding the array in the RAID BIOS.
The Balls-up
As fsrv posted, there was one immediate concern with the POST data - the RAID controller insisted that the existing array had failed, as all four of the drives connected to it were no longer there. However, they were replaced with four completely different drives, despite the fact they were identical down to the serial numbers. It was at the first stage of investigating this that a High Impedance Air Gap developed between the power cable providing power to backup and the power supply of backup, causing an unexpected power-down. At this time, it was not noticed that the noise heard was that of backup restarting, and was disregarded whilst fsrv was told to start building a new array on its "new" drives.
Of course, once sufficient array building was done to fsrv to make it unlikely anything was ever coming back out of it, the source of the earlier electrical crackling noise was investigated. Once ystvbackup was brought back up, it was only natural for our worst fears to be correct - the backup of fsrv's recently wiped data was corrupt.
The "Recovery"
Over the following weeks, multiple attempts were made to recover the data from what was left of this filesystem using dd images, but no luck was to be had. Alex Williams is currently known to be working on writing Python script to manually repair the inodes - we know for a fact the data is still on the disk, it is just the drive metadata that is lost.
Since this was just after the annual NaSTA deadline, this wasn't as catastrophic as Drive Crash Classic, but meant that the content that won us Best Broadcaster 2014 made its way happily to the judges. For much of the content that was on the drives, producers expressed relief at no longer needing to find time to edit the content. For the content that was important, most of it was on local tempvideo drives on the edit machines or still had the original recordings from Kenobi or Vidsrv, so the only real loss was show resources and finished shows, the latter of which can be reconstructed using data from web and playout.
The Conclusion
Never, ever have Just One copy of you data. And when you ask yourself "Should I make an extra copy on this spare disk I have too?" - the answer should not be "No".