[SGVLUG] rsnapshot: the good and the bad - recovery after patching that goes very very wrong

Tue May 8 08:42:53 PDT 2007

First, what's the best way to defer fsck of a large filesystem when
rebooting?  I have a rather large filesystem with thousands of hard
links that takes forever to run fsck, especially when there are
several errors and I have to reboot over and over.

------------------------------------------------------------------------------------------

Here's the story of a patch job going horribly wrong and how rsnapshot
came to my rescue but stalling a full startup...

Okay, stupid me.  Before completing the installation of my replacement
fileserver, I got carried away playing with iptstate on various
machines and wanted to install it on my fileserver at home.  I needed
to resolve some dependencies but the machine hadn't been patched
since, oh probably Feb when the mirror I was pointing to wasn't
available at the time.  I found a different mirror so feeling lucky, I
told it to patch everything and went back to work.  Checked back on it
a little later and a few screens of errors had scrolled by about
various patches failing.  oh oh.  I tried to type ls, coredump.  even
ps caused a coredump.  I tried to open up a new ssh session and got no
response.  big oh oh.  I could access some web services but I couldn't
do anything new.  Port mapped to the new server on the port and tried
to hop over to the old server, no response.  damn.

When I got home, i checked the log console (I have syslogd configured
to send a coyp of /var/log/messages to /dev/tty12) and saw that it was
sorta still alive but there were also messages that all was not happy
- cronjobs were failing.  From my desktop, I could see that samba was
still responding as was the TiVo media server, the SlimServer and
mysql via phpmyadmin.  The weather server was still polling the
sensors and posting to wundergound and AAG but systemgraph was no
longer responding.  Programs that had been running before the patching
started were still running but anything new died. I tried a few more
things from my desktop and went back to the server but it was locked
up.  crap.

I rebooted into failsafe and it locked up while trying to pivot root.
Same for every other entry in lilo.  Time to use my rescue CD but that
required reconnecting the CD drive as slave on the primary ide
controller.  I've got three ATA drives running as RAID-5 (/dev/hda,
hdc and hde) so I normally leave the CD drive disconnected so it's not
slowing down hda.

I've got three 250 GB drives configured into 5 RAID devices: the first
two are RAID-1 (two drives each w/ third as hot spare) for /boot and
/, the other three are each RAID-5 for /home, /exports which has all
my shared media (197 GB 97% full) and /.snapshots (229 GB 89%) which
contains the rsnapshot live backups that go back 180 days.  All my
important data - the family pics, music, etc, is backed up on every
computer in the house and onto DVD's stored in remote locations.  I
have a full backup minus the snapshots on an external drive that takes
forever to update so it's not current.  Since I've been working on the
replacement server, I'd forgotten to make a new full backup.  Worse
case, if I had to restore from that, the only thing I'd really lose is
a few tweaks and maybe month or two weather data.  Not a big deal.

Using the rescue mode on the mandriva install cd, I was able to verify
that the hardware was still working and my data was still there.
Using a snapshot from before I started patching, I used rsync to
restore /, /etc, /var. /root, /bin/ /lib, /sbin and /usr.  I left
/home and /export alone.  I rebooted and crossed my fingers.   The
reboot was coming along fine until it got to the point where it checks
the filesystems and started running fsck on everything as expected but
I forgot how long it can take to fsck 230 GB on a PIII-450. Of course
the biggest one had errors, duplicate blocks, etc.  After the 3rd
reboot, I commented out the entry in /etc/fstab for /dev/md4 and
disabled the backups.  The system booted up fine and is now working
again.  I left it running doing a 2nd iteration of running fsck
manually from a shell and went to bed very very late.

Well this morning the system is still alive with all services as best
I can tell running.  It's currently making a fresh "hourly" backup
which I plan to copy to the external hard disk.

So, what I'm thinking of doing is add the noauto flag to /etc/fstab
for the two big partitions to defer fsck until after everything else
is running.  Then have a script, possibly called from
/etc/rc.d/rc.local that would run fsck and then mount when done.  I'd
have to verify that anything that tries to use these directories do a
sanity check first.  I think rsnapshot already does this but not sure
about samba and the media server applications.  If possible, I'd to
have it send an email if the mount fails.

Suggestions or comments?