Mon, 14 May 2007

Data disaster on pusa

pusa, a server I administer at uni, suffered a massive data accident on Wednesday. When I went to see why it didn't come up from a reboot on Friday, I found out the initrd hadn't been able to mount /. Weird...

Luckily, the two new disks were already installed in the host and waiting for me to finish the migration to the RAID1 and the new Linux-VServer setup, but unfortunately I've been way too busy and it was too late for some of our data. A fsck of /dev/hda1 resulted on large portions of the data going to /lost+found. Discovering this made me feel like a great fool after not having dd'd the device before doing this (a dry-run of fsck had not reported anything useful). I found out some of the lost data in random directories, but in general lots were missing, and others made no sense:

/oldpusa/etc: gzip compressed data, was "libpng.txt", from Unix, last modified: Wed Dec 20 00:58:51 2006, max compression

I hoped for my PostgreSQL stuff being intact, so after dd'ing /dev/hda5, I fsck'd the image. The result was an empty filesystem, and a lost+found full of stuff. I can't find a directory with stuff that resembles postgresql data at all. I did find a directory with a PG_VERSION file in it, but the rest of the files in it (around 100) had numeric names and little more. If anyone thinks I might be able to rebuild my /var/lib/postgresql from this, I'll be infinitely grateful.

Anyway, I haven't written to the corrupted after I fucked up the root partition. I'm very interested in knowing what could cause corruption on all partitions, making them unmountable, but still recognisable by fsck, even if the result is not good at all. Maybe a corrupted partition table? If so, what does the Dear Lazyweb recommend me to try out? I suspect the first portion of all partitions were damaged, but maybe just that. Some “partition table shift”, which makes the filesystems lose the first superblock (trying other superblocks didn't work either)? Suggestions is very welcome by comment or email, and detail on what tools and how to use to try out things, better. My backup of PostgreSQL is not so recent, and recovering some SmartList data would also be great.

As for the mandatory “where are your backups”, the answer is basically we had no resources to store them until very recently, and when we finally got the disks I've had no time until now to set it up entirely, so some bits (db, lists, web) were still not running off the new drives. The luckiest people have been the MUD owners, who have had no data loss at all, as they were living entirely on /dev/md0. Losing MUD data probably means getting angry calls at 4AM or so. :)