Fri, 21 Jan 2005

www.es.debian.org rebuilt

When we moved the Spanish Debian web mirror to a dedicated box, I only thought about the better Internet link, maintenance and resources the mirror would have. I never thought hardware problems would appear soon after we installed the box in the new location.

The move was done back in March or so, AFAICT from the logs in the server. In May, the box crashed for the first time due to some massive SCSI errors in the disk that had the root filesystem. Just rebooting would help it, but some weeks later we would found dmesg full of crap again.

Fernando, one of the operators at the University, found out one of the fans had stopped when he opened the box trying to find out what was going on. We thought that might be causing weird stuff, but soon after, I had to go to the computer room to fix it myself, because the damage in the file system was too big.

During that visit, I finally saw the consoles of a few boxes that I had been using for like 8 years... iluso, gong, and other famous ones like tiberio (once the best computer in the University, used to do some Chemistry simulations, IIRC) or cesar (a very big Sun computer, the current best one in València, if I'm not mistaken.

The other day I had to update httpd.conf as requested by the debian-www guys, but as I feared, the box was having problems: apache was running normally (had been for months, thanks to the binaries being in memory), but the filesystem was read only due to the same errors in the disk, so I couldn't modify anything. I tried rebooting, but as expected the box didn't come up.

Today Sergio and I went to the campus, picked up the heavy box and took it back to the zulex to have a closer look outside the freezing University server room. After booting d-i, which is our preferred rescue tool these days, we examined what the disks still had, and with a few spare SCSI drives we started rebuilding the box from scratch.

Not having a Woody CD at the office, we decided it was time to upgrade to Sarge anyway, so we did our first RAID install using Debian Installer. Man, partman just rocks. After the base system was installed, we found our first blocker: lilo-installer apparently didn't know where to install the boot block, and would suggest /dev/md/0, which failed. After a few tries we learned about the raid-specific lilo.conf parametre, and managed to finish up.

Next, the SCSI BIOS was missconfigured, and it didn't boot from the correct SCSI ID. After some thought we realised what was going on and finally I could take the box home to finish up.

To stick the new disks on the case, I had to brute-force open the lid, a problem that will go away as soon as we get the rack case we've asked for donation to the Hardware Donations team. (hi robster ;) Finding the old data was not so fun, as many files in /etc were corrupt, but I could save the ssh keys and the websync scripts for the web mirror.

Having a nice chance like this to fix things up, I moved the mirror to Apache 2, and it's hopefully working ok now. Tomorrow I'll take the box back to Uni and see if it is. Ideally sto will accept being co-admin for the mirror, as he lives nearby and is University staff anyway. :)

There's some extra-space in the box now, so we are thinking about doing an ftp mirror for the Uni, which I believe has none, while many, many servers run Debian.

I'm finally ready to power it off. This is the noisiest box I've worked on it a long time... it's going to be hard to get rid of the head ache...