read corruption with qemu master io_uring engine / linux master / btrfs(?)

public inbox for [email protected]
 help / color / mirror / Atom feed

* read corruption with qemu master io_uring engine / linux master / btrfs(?)
@ 2022-06-28  9:08 Dominique MARTINET
  2022-06-28 19:03 ` Nikolay Borisov
  0 siblings, 1 reply; 4+ messages in thread
From: Dominique MARTINET @ 2022-06-28  9:08 UTC (permalink / raw)
  To: io-uring, linux-btrfs

I don't have any good reproducer so it's a bit difficult to specify,
let's start with what I have...

I've got this one VM which has various segfaults all over the place when
starting it with aio=io_uring for its disk as follow:

  qemu-system-x86_64 -drive file=qemu/atde-test,if=none,id=hd0,format=raw,cache=none,aio=io_uring \
      -device virtio-blk-pci,drive=hd0 -m 8G -smp 4 -serial mon:stdio -enable-kvm

It also happens with virtio-scsi-blk:
  -device virtio-scsi-pci,id=scsihw0 \
  -drive file=qemu/atde-test,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring \
  -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100

It also happened when the disk I was using was a qcow file backing up a
vmdk image (this VM's original disk is for vmware), so while I assume
qemu reading code and qemu-img convert code are similar I'll pretend
image format doesn't matter at this point...
It's happened with two such images, but I haven't been able to reproduce
with any other VMs yet.

I can also reproduce this on a second host machine with a completely
different ssd (WD sata in one vs. samsung nvme), so probably not a
firmware bug.

scrub sees no problem with my filesystems on the host.

I've confirmed it happens with at least debian testing's 5.16.0-4-amd64
and 5.17.0-1-amd64 kernels, as well as 5.19.0-rc4.
It also happens with both debian's 7.0.0 and the master branch
(v7.0.0-2031-g40d522490714)

These factors aside, anything else I tried changing made this bug no
longer reproduce:
 - I'm not sure what the rule is but it sometimes doesn't happen when
running the VM twice in a row, sometimes it happens again. Making a
fresh copy with `cp --reflink=always` of my source image seems to be
reliable.
 - it stops happening without io_uring
 - it stops happening if I copy the disk image with --reflink=never
 - it stops happening if I copy the disk image to another btrfs
partition, created in the same lv, so something about my partition
history matters?...
(I have ssd > GPT partitions > luks > lvm > btrfs with a single disk as
metadata DUP data single)
 - I was unable to reproduce on xfs (with a reflink copy) either but I
also was only able to try on a new fs...
 - I've never been able to reproduce on other VMs

If you'd like to give it a try, my reproducer source image is
---
curl -O https://download.atmark-techno.com/atde/atde9-amd64-20220624.tar.xz
tar xf atde9-amd64-20220624.tar.xz
qemu-img convert -O raw atde9-amd64-20220624/atde9-amd64.vmdk atde-test
cp --reflink=always atde-test atde-test2
---
and using 'atde-test'.
For further attempts I've removed atde-test and copied back from
atde-test2 with cp --reflink=always.
This VM graphical output is borked, but ssh listens so something like
`-netdev user,id=net0,hostfwd=tcp::2227-:22 -device virtio-net-pci,netdev=net0`
and 'ssh -p 2227 -l atmark localhost' should allow login with password
'atmark' or you can change vt on the console (root password 'root')

I also had similar problems with atde9-amd64-20211201.tar.xz .

When reproducing I've had either segfaults in the initrd and complete
boot failures, or boot working and login failures but ssh working
without login shell (ssh ... -tt localhost sh)
that allowed me to dump content of a couple of corrupted files.
When I looked:
- /usr/lib/x86_64-linux-gnu/libc-2.31.so had zeroes instead of data from
offset 0xb6000 to 0xb7fff; rest of file was identical.
- /usr/bin/dmesg had garbadge from 0x05000 until 0x149d8 (end of file).
I was lucky and could match the garbage quickly: it is identical to the
content from 0x1000-0x109d8 within the disk itself.

I've rebooted a few times and it looks like the corruption is identical
everytime for this machine as long as I keep using the same source file;
running from qemu-img convert again seems to change things a bit?
but whatever it is that is specific to these files is stable, even
through host reboots.

I'm sorry I haven't been able to make a better reproducer, I'll keep
trying a bit more tomorrow but maybe someone has an idea with what I've
had so far :/

Perhaps at this point it might be simpler to just try to take qemu out
of the equation and issue many parallel reads to different offsets
(overlapping?) of a large file in a similar way qemu io_uring engine
does and check their contents?

Thanks, and I'll probably follow up a bit tomorrow even if no-one has
any idea, but even ideas of where to look would be appreciated.
-- 
Dominique

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
  2022-06-28  9:08 read corruption with qemu master io_uring engine / linux master / btrfs(?) Dominique MARTINET
@ 2022-06-28 19:03 ` Nikolay Borisov
  2022-06-29  0:35   ` Dominique MARTINET
  0 siblings, 1 reply; 4+ messages in thread
From: Nikolay Borisov @ 2022-06-28 19:03 UTC (permalink / raw)
  To: Dominique MARTINET, io-uring, linux-btrfs



On 28.06.22 г. 12:08 ч., Dominique MARTINET wrote:
> I don't have any good reproducer so it's a bit difficult to specify,
> let's start with what I have...
> 
> I've got this one VM which has various segfaults all over the place when
> starting it with aio=io_uring for its disk as follow:
> 
>    qemu-system-x86_64 -drive file=qemu/atde-test,if=none,id=hd0,format=raw,cache=none,aio=io_uring \
>        -device virtio-blk-pci,drive=hd0 -m 8G -smp 4 -serial mon:stdio -enable-kvm

So cache=none means O_DIRECT and using io_uring. This really sounds 
similar to:

ca93e44bfb5fd7996b76f0f544999171f647f93b

This commit got merged into v5.17 so you shouldn't be seeing it on 5.17 
and onwards.

<snip>

> 
> Perhaps at this point it might be simpler to just try to take qemu out
> of the equation and issue many parallel reads to different offsets
> (overlapping?) of a large file in a similar way qemu io_uring engine
> does and check their contents?

Care to run the sample program in the aforementioned commit and verify 
it's not failing

> 
> 
> Thanks, and I'll probably follow up a bit tomorrow even if no-one has
> any idea, but even ideas of where to look would be appreciated.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
  2022-06-28 19:03 ` Nikolay Borisov
@ 2022-06-29  0:35   ` Dominique MARTINET
  2022-06-29  5:14     ` Dominique MARTINET
  0 siblings, 1 reply; 4+ messages in thread
From: Dominique MARTINET @ 2022-06-29  0:35 UTC (permalink / raw)
  To: Nikolay Borisov, Jens Axboe; +Cc: io-uring, linux-btrfs


Thanks for the replies.

Nikolay Borisov wrote on Tue, Jun 28, 2022 at 10:03:20PM +0300:
> >    qemu-system-x86_64 -drive file=qemu/atde-test,if=none,id=hd0,format=raw,cache=none,aio=io_uring \
> >        -device virtio-blk-pci,drive=hd0 -m 8G -smp 4 -serial mon:stdio -enable-kvm
> 
> So cache=none means O_DIRECT and using io_uring. This really sounds similar
> to:
> 
> ca93e44bfb5fd7996b76f0f544999171f647f93b

That looks close, yes...

> This commit got merged into v5.17 so you shouldn't be seeing it on 5.17 and
> onwards.
> 
> <snip>
> 
> > 
> > Perhaps at this point it might be simpler to just try to take qemu out
> > of the equation and issue many parallel reads to different offsets
> > (overlapping?) of a large file in a similar way qemu io_uring engine
> > does and check their contents?
> 
> Care to run the sample program in the aforementioned commit and verify it's
> not failing

But unfortunately it seems like it is properly fixed on my machines:
---
io_uring read result for file foo:

  cqe->res == 8192 (expected 8192)
  memcmp(read_buf, write_buf) == 0 (expected 0)
---

Nikolay Borisov wrote on Tue, Jun 28, 2022 at 10:05:39PM +0300:
> Alternatively change cache=none (O_DIRECT) to cache=writeback (ordinary
> buffered writeback path) that way we'll know if it's related to the
> iomap-based O_DIRECT code in btrfs.

Good idea; I can confirm this doesn't reproduce without cache=none, so
O_DIRECT probably is another requirement here (probably because I
haven't been able to reproduce on a freshly created fs either, so not
being able to reproducing in a few tries is no guarantee...)


Jens Axboe wrote on Tue, Jun 28, 2022 at 01:12:54PM -0600:
> Not sure what's going on here, but I use qemu with io_uring many times
> each day and haven't seen anything odd. This is on ext4 and xfs however,
> I haven't used btrfs as the backing file system. I wonder if we can boil
> this down into a test case and try and figure out what is doing on here.

Yes I'd say it's fs specific, I've not been able to reproduce on ext4 or
xfs -- but then again I couldn't reproduce with btrfs on a new
filesystem so there probably are some other conditions :/

I also agree writing a simple program like the io_uring test in the
above commit that'd sort of do it like qemu and compare contents would
be ideal.
I'll have a stab at this today.

-- 
Dominique


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
  2022-06-29  0:35   ` Dominique MARTINET
@ 2022-06-29  5:14     ` Dominique MARTINET
  0 siblings, 0 replies; 4+ messages in thread
From: Dominique MARTINET @ 2022-06-29  5:14 UTC (permalink / raw)
  To: Nikolay Borisov, Jens Axboe; +Cc: io-uring, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2008 bytes --]

Dominique MARTINET wrote on Wed, Jun 29, 2022 at 09:35:44AM +0900:
> I also agree writing a simple program like the io_uring test in the
> above commit that'd sort of do it like qemu and compare contents would
> be ideal.
> I'll have a stab at this today.

Okay, after half a day failing to reproduce I had a closer look at qemu
and... it's a qemu bug.

Well, there probably are two bugs, but one should be benign:

 - qemu short read handling was... rather disappointing.
Patch should appear here[1] eventually, but as it seems moderated?
I'm reposting it here:
-----
diff --git a/block/io_uring.c b/block/io_uring.c
index d48e472e74cb..d58aff9615ce 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -103,7 +103,7 @@ static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
                       remaining);
 
     /* Update sqe */
-    luringcb->sqeq.off = nread;
+    luringcb->sqeq.off += nread;
     luringcb->sqeq.addr = (__u64)(uintptr_t)luringcb->resubmit_qiov.iov;
     luringcb->sqeq.len = luringcb->resubmit_qiov.niov;
 
-----
 (basically "just" a typo, but that must have never been tested!)
[1] https://lore.kernel.org/qemu-devel/[email protected]


 - comments there also say short reads should never happen on newer
kernels (assuming local filesystems?) -- how true is that? If we're
doing our best kernel side to avoid short reads I guess we probably
ought to have a look at this.

It can easily be reproduced with a simple io_uring program -- see
example attached that eventually fails with the following error on
btrfs:
bad read result for io 8, offset 792227840: 266240 should be 1466368

but doesn't fail on tmpfs or without O_DIRECT.
feel free to butcher it, it's already a quickly hacked downversion of my
original test that had hash computation etc so the flow might feel a bit
weird.
Just compile with `gcc -o shortreads uring_shortreads.c -luring` and run
with file to read in argument.


Thanks!
-- 
Dominique

[-- Attachment #2: uring_shortreads.c --]
[-- Type: text/x-csrc, Size: 3359 bytes --]

/* Get O_DIRECT */
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <liburing.h>
#include <sys/random.h>
#include <sys/stat.h>

long pagesize;
size_t n_blocks;
#define QUEUE_SIZE 10
char *buffers[QUEUE_SIZE];
int bufsize[QUEUE_SIZE];
struct iovec iovec[QUEUE_SIZE];
long int offsets[QUEUE_SIZE];

void breakme(void) {
}

int submit_read(struct io_uring *ring, int fd, int i) {
	struct io_uring_sqe *sqe;
	int ret;

	sqe = io_uring_get_sqe(ring);
	if (!sqe) {
		fprintf(stderr, "Failed to get io_uring sqe\n");
		return 1;
	}
	if (i == 0 || rand() % 2 == 0 || offsets[i-1] > n_blocks - bufsize[i]) {
		offsets[i] = rand() % (n_blocks - bufsize[i] + 1);
	} else {
		offsets[i] = offsets[i - 1];
	}
	io_uring_prep_readv(sqe, fd, iovec + i, 1, offsets[i] * pagesize);
	io_uring_sqe_set_data(sqe, (void*)(uintptr_t)i);
	ret = io_uring_submit(ring);
	if (ret != 1) {
		fprintf(stderr,	"submit failed\n");
		return 1;
	}
	return 0;
}

int getsize(int fd) {
	struct stat sb;
	if (fstat(fd, &sb)) {
		fprintf(stderr, "fstat: %m\n");
		return 1;
	}
	n_blocks = sb.st_size / pagesize;
	return 0;
}

int main(int argc, char *argv[])
{
	char *file, *mapfile;
	unsigned int seed;
	struct io_uring ring;
	struct io_uring_cqe *cqe;
	int fd, i;
	ssize_t ret;
	size_t total = 0;

	if (argc < 2 || argc > 3) {
		fprintf(stderr, "Use: %s <file> [<seed>]\n", argv[0]);
		return 1;
	}
	file = argv[1];
	if (argc == 3) {
		seed = atol(argv[2]);
	} else {
		getrandom(&seed, sizeof(seed), 0);
	}
	printf("random seed %u\n", seed);
	srand(seed);
	pagesize = sysconf(_SC_PAGE_SIZE);
	if (asprintf(&mapfile, "%s.map", file) < 0) {
		fprintf(stderr, "asprintf map %d\n", errno);
		return 1;
	}

	fd = open(file, O_RDONLY | O_DIRECT);
	if (fd == -1) {
		fprintf(stderr,
				"Failed to open file '%s': %s (errno %d)\n",
				file, strerror(errno), errno);
		return 1;
	}
	if (getsize(fd))
		return 1;

	for (i = 0 ; i < QUEUE_SIZE; i++) {
		bufsize[i] = (rand() % 1024) + 1;
		ret = posix_memalign((void**)&buffers[i], pagesize, bufsize[i] * pagesize);
		if (ret) {
			fprintf(stderr, "Failed to allocate read buffer\n");
			return 1;
		}
	}


	printf("Starting io_uring reads...\n");


	ret = io_uring_queue_init(QUEUE_SIZE, &ring, 0);
	if (ret != 0) {
		fprintf(stderr, "Failed to create io_uring queue\n");
		return 1;
	}


	for (i = 0 ; i < QUEUE_SIZE; i++) {
		iovec[i].iov_base = buffers[i];
		iovec[i].iov_len = bufsize[i] * pagesize;
		if (submit_read(&ring, fd, i))
			return 1;
	}

	while (total++ < 10000000) {
		if (total % 1000 == 0)
			printf("%zd\n", total);
		ret = io_uring_wait_cqe(&ring, &cqe);
		if (ret < 0) {
			fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
			return 1;
		}
		i = (intptr_t)io_uring_cqe_get_data(cqe);
		if (cqe->res < 0) {
			fprintf(stderr, "bad read result for io %d, offset %zd: %d\n",
				i, offsets[i] * pagesize, cqe->res);
			breakme();
			return 1;
		}
		if (cqe->res != bufsize[i] * pagesize) {
			fprintf(stderr, "bad read result for io %d, offset %zd: %d should be %zd\n",
				i, offsets[i] * pagesize, cqe->res, bufsize[i] * pagesize);
			breakme();
			return 1;
		}
		io_uring_cqe_seen(&ring, cqe);

		// resubmit
		if (submit_read(&ring, fd, i))
			return 1;
	}
	io_uring_queue_exit(&ring);

	return 0;
}

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-06-29  5:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-28  9:08 read corruption with qemu master io_uring engine / linux master / btrfs(?) Dominique MARTINET
2022-06-28 19:03 ` Nikolay Borisov
2022-06-29  0:35   ` Dominique MARTINET
2022-06-29  5:14     ` Dominique MARTINET
     [not found] <[email protected]>

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox