Drive Crash 2: The fscking: Difference between revisions

From YSTV History Wiki
Jump to navigation Jump to search
m (adding more stuff)
Line 18: Line 18:


After the ~18 hours copying, a fresh and working backup was obtained.
After the ~18 hours copying, a fresh and working backup was obtained.
== The Fixing of Fsrv ==
At this point, everything was considered good to go - we had a full, working backup on a redundant array, and would only need a couple of days to rebuild the array on Fsrv and move everything back. So the array was rebuilt as a 3TB software RAID.
== The Balls-up ==
During the rebuild, a high impedance air gap developed between the power cable providing power to backup and the power supply of backup, causing an expected power-down. Upon restart, it was discovered that the EXT4 partition holding all the data had become corrupt. It is suspected that the drive being over 95% full didn't help with this.

Revision as of 12:49, 4 April 2014

During the Easter of 2014, then-Computing Officer Lloyd Wallis did some semi-planned and somewhat-thought-through work to increase the storage available in YSTV as well as increasing resilience.

Potentially also to cover in this document is the shortly-proceeding replacement of fsrv's OS drive after a kerfuffle.

The Filling of Fsrv

In the months leading up to the Easter break, there were several occassions where the Pending Edits share on fsrv was completely full. This needed fixing.

So, a plan was devised to add another TB of storage to both fsrv and backup and grow their respective RAID5 arrays. We were in posession of one spare 1TB SATA drive, so at the start of Easter it was proposed we grow the fsrv array during the holiday while it was not in heavy use, then passing the money to grow backup at the beginning of the Summer term.

Of course, with any plan to touch the file server since the computing team played Drive Crash classic, a plan was first put in place to do a complete test restore of the data on backup, to ensure that if it all went wrong we still had a copy.

The Borking of Backup

Of course, it turned out that running a restore of the entirety of Finished Shows did not go well. After leaving it running overnight, the restore was still copying the first file, 2013's Live on the Lawn, having succesfully copied 430GB of the file so far. This was quite worrying as the file was only 2GB.

So, after a few more tries and a few more wiggles, scrapping BackupPC sounded like a good idea, replacing it with a flat copy of all the files.

Then at some point, we noticed something interesting with one of the files - the system would sit for ages on that one thing. Applying smartctl gave us good news - a potentially failing hard disk. By removing said disk and rebuilding as a 2TB array, the copy could continue. Testing on the HDD later showed that it was not actually failing.

After the ~18 hours copying, a fresh and working backup was obtained.

The Fixing of Fsrv

At this point, everything was considered good to go - we had a full, working backup on a redundant array, and would only need a couple of days to rebuild the array on Fsrv and move everything back. So the array was rebuilt as a 3TB software RAID.

The Balls-up

During the rebuild, a high impedance air gap developed between the power cable providing power to backup and the power supply of backup, causing an expected power-down. Upon restart, it was discovered that the EXT4 partition holding all the data had become corrupt. It is suspected that the drive being over 95% full didn't help with this.