Fri, 08 Jan 2016

Weird VirtIO errors on a jessie KVM host: Fixed!

Yesterday I posted a desperate plea for help as I had no idea where else to look for clues on what was causing random I/O errors on the guests of our jessie KVM host.

Thanks to Michael Herold, who was kind enough to mail me after identifying our problem, now we know os-prober is to blame, triggering the problem on every kernel update on the host, and we have quickly uninstalled it from all our systems.

Thanks Michael! If you by any chance going to attend FOSDEM, I am so happily going to buy you beers!

Let's hope anyone else wondering what's going on with their filesystems will find the trail to these blog posts to find a quick solution!

Thu, 07 Jan 2016

Weird VirtIO errors on a jessie KVM host running Debian guests

Hi Interwebs! I'm facing a weird issue with one of our server's at work, involving Debian jessie, libvirt and Debian guests using VirtIO drivers. This is a plea for help. :)

Basically, we are getting random VirtIO errors inside our guests, resulting in stuff like this:

[4735406.568235] blk_update_request: I/O error, dev vda, sector 142339584
[4735406.572008] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -5 writing to inode 1184437 (offset 0 size 208896 starting block 17729472)
[4735406.572008] Buffer I/O error on device dm-0, logical block 17729472
[ ... ]
[4735406.572008] Buffer I/O error on device dm-0, logical block 17729481
[4735406.643486] blk_update_request: I/O error, dev vda, sector 142356480
[ ... ]
[4735406.748456] blk_update_request: I/O error, dev vda, sector 38587480
[4735411.020309] Buffer I/O error on dev dm-0, logical block 12640808, lost sync page write
[4735411.055184] Aborting journal on device dm-0-8.
[4735411.056148] Buffer I/O error on dev dm-0, logical block 12615680, lost sync page write
[4735411.057626] JBD2: Error -5 detected when updating journal superblock for dm-0-8.
[4735411.057936] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[4735411.057946] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
[4735411.057948] EXT4-fs (dm-0): Remounting filesystem read-only
[4735411.057949] EXT4-fs (dm-0): previous I/O error to superblock detected

(From an Ubuntu 15.04 guest, EXT4 on LVM2)

Or,

Jan 06 03:39:11 titanium kernel: end_request: I/O error, dev vda, sector 1592467904
Jan 06 03:39:11 titanium kernel: EXT4-fs warning (device vda3): ext4_end_bio:317: I/O error -5 writing to inode 31169653 (offset 0 size 0 starting block 199058492)
Jan 06 03:39:11 titanium kernel: Buffer I/O error on device vda3, logical block 198899256
[ ... ]
Jan 06 03:39:12 titanium kernel: Aborting journal on device vda3-8.
Jan 06 03:39:12 titanium kernel: Buffer I/O error on device vda3, logical block 99647488

(From a Debian jessie guest, EXT4 directly on a VirtIO-based block device)

When this happens, it affects multiple guests on the hosts at the same time. Normally they are severe enough that they end up with a r/o file system, but we've seen a few hosts survive with a non-fatal I/O error. The host's dmesg has nothing interesting to see.

We've seen this happen with quite heterogeneous guests:

In short, we haven't seen a clear characteristic in any guest, other than the affected hosts being the ones with some sustained I/O load (build machines, cgit servers, PostgreSQL RDBMs...). Most of the times, hosts that just sit there doing nothing with their disks are not affected.

The host is a stock Debian jessie install that manages libvirt-based QEMU guests. All the guests have their block devices using virtio drivers, some of them on spinning media based on LSI RAID (was a 3ware card before, got replaced as we were very suspicious about it, but are getting the same results), and some of them based on PCIe SSD storage. We have some other 3 hosts, similar setup except they run Debian wheezy (and honestly we're not too keen on upgrading them yet, just in case), none of them has ever shown this kind of problem.

We've been seeing this since last summer, and haven't found a pattern that tells us where these I/O error bugs are coming from. Google isn't revealing other people with a similar problem, and we're finding that quite surprising as our setup is quite basic.

So, dear Interwebs, have you seen this? We could use any comment (on the blog, or in Debian bug #810121, or kernel bug 110441) that clues us on what's to blame here. Thanks in advance!

Update: We finally know what's going on! The problem is gone at long last!