* read corruption with qemu master io_uring engine / linux master / btrfs(?) @ 2022-06-28 9:08 Dominique MARTINET 2022-06-28 19:03 ` Nikolay Borisov 0 siblings, 1 reply; 4+ messages in thread From: Dominique MARTINET @ 2022-06-28 9:08 UTC (permalink / raw) To: io-uring, linux-btrfs I don't have any good reproducer so it's a bit difficult to specify, let's start with what I have... I've got this one VM which has various segfaults all over the place when starting it with aio=io_uring for its disk as follow: qemu-system-x86_64 -drive file=qemu/atde-test,if=none,id=hd0,format=raw,cache=none,aio=io_uring \ -device virtio-blk-pci,drive=hd0 -m 8G -smp 4 -serial mon:stdio -enable-kvm It also happens with virtio-scsi-blk: -device virtio-scsi-pci,id=scsihw0 \ -drive file=qemu/atde-test,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring \ -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100 It also happened when the disk I was using was a qcow file backing up a vmdk image (this VM's original disk is for vmware), so while I assume qemu reading code and qemu-img convert code are similar I'll pretend image format doesn't matter at this point... It's happened with two such images, but I haven't been able to reproduce with any other VMs yet. I can also reproduce this on a second host machine with a completely different ssd (WD sata in one vs. samsung nvme), so probably not a firmware bug. scrub sees no problem with my filesystems on the host. I've confirmed it happens with at least debian testing's 5.16.0-4-amd64 and 5.17.0-1-amd64 kernels, as well as 5.19.0-rc4. It also happens with both debian's 7.0.0 and the master branch (v7.0.0-2031-g40d522490714) These factors aside, anything else I tried changing made this bug no longer reproduce: - I'm not sure what the rule is but it sometimes doesn't happen when running the VM twice in a row, sometimes it happens again. Making a fresh copy with `cp --reflink=always` of my source image seems to be reliable. - it stops happening without io_uring - it stops happening if I copy the disk image with --reflink=never - it stops happening if I copy the disk image to another btrfs partition, created in the same lv, so something about my partition history matters?... (I have ssd > GPT partitions > luks > lvm > btrfs with a single disk as metadata DUP data single) - I was unable to reproduce on xfs (with a reflink copy) either but I also was only able to try on a new fs... - I've never been able to reproduce on other VMs If you'd like to give it a try, my reproducer source image is --- curl -O https://download.atmark-techno.com/atde/atde9-amd64-20220624.tar.xz tar xf atde9-amd64-20220624.tar.xz qemu-img convert -O raw atde9-amd64-20220624/atde9-amd64.vmdk atde-test cp --reflink=always atde-test atde-test2 --- and using 'atde-test'. For further attempts I've removed atde-test and copied back from atde-test2 with cp --reflink=always. This VM graphical output is borked, but ssh listens so something like `-netdev user,id=net0,hostfwd=tcp::2227-:22 -device virtio-net-pci,netdev=net0` and 'ssh -p 2227 -l atmark localhost' should allow login with password 'atmark' or you can change vt on the console (root password 'root') I also had similar problems with atde9-amd64-20211201.tar.xz . When reproducing I've had either segfaults in the initrd and complete boot failures, or boot working and login failures but ssh working without login shell (ssh ... -tt localhost sh) that allowed me to dump content of a couple of corrupted files. When I looked: - /usr/lib/x86_64-linux-gnu/libc-2.31.so had zeroes instead of data from offset 0xb6000 to 0xb7fff; rest of file was identical. - /usr/bin/dmesg had garbadge from 0x05000 until 0x149d8 (end of file). I was lucky and could match the garbage quickly: it is identical to the content from 0x1000-0x109d8 within the disk itself. I've rebooted a few times and it looks like the corruption is identical everytime for this machine as long as I keep using the same source file; running from qemu-img convert again seems to change things a bit? but whatever it is that is specific to these files is stable, even through host reboots. I'm sorry I haven't been able to make a better reproducer, I'll keep trying a bit more tomorrow but maybe someone has an idea with what I've had so far :/ Perhaps at this point it might be simpler to just try to take qemu out of the equation and issue many parallel reads to different offsets (overlapping?) of a large file in a similar way qemu io_uring engine does and check their contents? Thanks, and I'll probably follow up a bit tomorrow even if no-one has any idea, but even ideas of where to look would be appreciated. -- Dominique ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?) 2022-06-28 9:08 read corruption with qemu master io_uring engine / linux master / btrfs(?) Dominique MARTINET @ 2022-06-28 19:03 ` Nikolay Borisov 2022-06-29 0:35 ` Dominique MARTINET 0 siblings, 1 reply; 4+ messages in thread From: Nikolay Borisov @ 2022-06-28 19:03 UTC (permalink / raw) To: Dominique MARTINET, io-uring, linux-btrfs On 28.06.22 г. 12:08 ч., Dominique MARTINET wrote: > I don't have any good reproducer so it's a bit difficult to specify, > let's start with what I have... > > I've got this one VM which has various segfaults all over the place when > starting it with aio=io_uring for its disk as follow: > > qemu-system-x86_64 -drive file=qemu/atde-test,if=none,id=hd0,format=raw,cache=none,aio=io_uring \ > -device virtio-blk-pci,drive=hd0 -m 8G -smp 4 -serial mon:stdio -enable-kvm So cache=none means O_DIRECT and using io_uring. This really sounds similar to: ca93e44bfb5fd7996b76f0f544999171f647f93b This commit got merged into v5.17 so you shouldn't be seeing it on 5.17 and onwards. <snip> > > Perhaps at this point it might be simpler to just try to take qemu out > of the equation and issue many parallel reads to different offsets > (overlapping?) of a large file in a similar way qemu io_uring engine > does and check their contents? Care to run the sample program in the aforementioned commit and verify it's not failing > > > Thanks, and I'll probably follow up a bit tomorrow even if no-one has > any idea, but even ideas of where to look would be appreciated. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?) 2022-06-28 19:03 ` Nikolay Borisov @ 2022-06-29 0:35 ` Dominique MARTINET 2022-06-29 5:14 ` Dominique MARTINET 0 siblings, 1 reply; 4+ messages in thread From: Dominique MARTINET @ 2022-06-29 0:35 UTC (permalink / raw) To: Nikolay Borisov, Jens Axboe; +Cc: io-uring, linux-btrfs Thanks for the replies. Nikolay Borisov wrote on Tue, Jun 28, 2022 at 10:03:20PM +0300: > > qemu-system-x86_64 -drive file=qemu/atde-test,if=none,id=hd0,format=raw,cache=none,aio=io_uring \ > > -device virtio-blk-pci,drive=hd0 -m 8G -smp 4 -serial mon:stdio -enable-kvm > > So cache=none means O_DIRECT and using io_uring. This really sounds similar > to: > > ca93e44bfb5fd7996b76f0f544999171f647f93b That looks close, yes... > This commit got merged into v5.17 so you shouldn't be seeing it on 5.17 and > onwards. > > <snip> > > > > > Perhaps at this point it might be simpler to just try to take qemu out > > of the equation and issue many parallel reads to different offsets > > (overlapping?) of a large file in a similar way qemu io_uring engine > > does and check their contents? > > Care to run the sample program in the aforementioned commit and verify it's > not failing But unfortunately it seems like it is properly fixed on my machines: --- io_uring read result for file foo: cqe->res == 8192 (expected 8192) memcmp(read_buf, write_buf) == 0 (expected 0) --- Nikolay Borisov wrote on Tue, Jun 28, 2022 at 10:05:39PM +0300: > Alternatively change cache=none (O_DIRECT) to cache=writeback (ordinary > buffered writeback path) that way we'll know if it's related to the > iomap-based O_DIRECT code in btrfs. Good idea; I can confirm this doesn't reproduce without cache=none, so O_DIRECT probably is another requirement here (probably because I haven't been able to reproduce on a freshly created fs either, so not being able to reproducing in a few tries is no guarantee...) Jens Axboe wrote on Tue, Jun 28, 2022 at 01:12:54PM -0600: > Not sure what's going on here, but I use qemu with io_uring many times > each day and haven't seen anything odd. This is on ext4 and xfs however, > I haven't used btrfs as the backing file system. I wonder if we can boil > this down into a test case and try and figure out what is doing on here. Yes I'd say it's fs specific, I've not been able to reproduce on ext4 or xfs -- but then again I couldn't reproduce with btrfs on a new filesystem so there probably are some other conditions :/ I also agree writing a simple program like the io_uring test in the above commit that'd sort of do it like qemu and compare contents would be ideal. I'll have a stab at this today. -- Dominique ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?) 2022-06-29 0:35 ` Dominique MARTINET @ 2022-06-29 5:14 ` Dominique MARTINET 0 siblings, 0 replies; 4+ messages in thread From: Dominique MARTINET @ 2022-06-29 5:14 UTC (permalink / raw) To: Nikolay Borisov, Jens Axboe; +Cc: io-uring, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2008 bytes --] Dominique MARTINET wrote on Wed, Jun 29, 2022 at 09:35:44AM +0900: > I also agree writing a simple program like the io_uring test in the > above commit that'd sort of do it like qemu and compare contents would > be ideal. > I'll have a stab at this today. Okay, after half a day failing to reproduce I had a closer look at qemu and... it's a qemu bug. Well, there probably are two bugs, but one should be benign: - qemu short read handling was... rather disappointing. Patch should appear here[1] eventually, but as it seems moderated? I'm reposting it here: ----- diff --git a/block/io_uring.c b/block/io_uring.c index d48e472e74cb..d58aff9615ce 100644 --- a/block/io_uring.c +++ b/block/io_uring.c @@ -103,7 +103,7 @@ static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb, remaining); /* Update sqe */ - luringcb->sqeq.off = nread; + luringcb->sqeq.off += nread; luringcb->sqeq.addr = (__u64)(uintptr_t)luringcb->resubmit_qiov.iov; luringcb->sqeq.len = luringcb->resubmit_qiov.niov; ----- (basically "just" a typo, but that must have never been tested!) [1] https://lore.kernel.org/qemu-devel/[email protected] - comments there also say short reads should never happen on newer kernels (assuming local filesystems?) -- how true is that? If we're doing our best kernel side to avoid short reads I guess we probably ought to have a look at this. It can easily be reproduced with a simple io_uring program -- see example attached that eventually fails with the following error on btrfs: bad read result for io 8, offset 792227840: 266240 should be 1466368 but doesn't fail on tmpfs or without O_DIRECT. feel free to butcher it, it's already a quickly hacked downversion of my original test that had hash computation etc so the flow might feel a bit weird. Just compile with `gcc -o shortreads uring_shortreads.c -luring` and run with file to read in argument. Thanks! -- Dominique [-- Attachment #2: uring_shortreads.c --] [-- Type: text/x-csrc, Size: 3359 bytes --] /* Get O_DIRECT */ #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <string.h> #include <liburing.h> #include <sys/random.h> #include <sys/stat.h> long pagesize; size_t n_blocks; #define QUEUE_SIZE 10 char *buffers[QUEUE_SIZE]; int bufsize[QUEUE_SIZE]; struct iovec iovec[QUEUE_SIZE]; long int offsets[QUEUE_SIZE]; void breakme(void) { } int submit_read(struct io_uring *ring, int fd, int i) { struct io_uring_sqe *sqe; int ret; sqe = io_uring_get_sqe(ring); if (!sqe) { fprintf(stderr, "Failed to get io_uring sqe\n"); return 1; } if (i == 0 || rand() % 2 == 0 || offsets[i-1] > n_blocks - bufsize[i]) { offsets[i] = rand() % (n_blocks - bufsize[i] + 1); } else { offsets[i] = offsets[i - 1]; } io_uring_prep_readv(sqe, fd, iovec + i, 1, offsets[i] * pagesize); io_uring_sqe_set_data(sqe, (void*)(uintptr_t)i); ret = io_uring_submit(ring); if (ret != 1) { fprintf(stderr, "submit failed\n"); return 1; } return 0; } int getsize(int fd) { struct stat sb; if (fstat(fd, &sb)) { fprintf(stderr, "fstat: %m\n"); return 1; } n_blocks = sb.st_size / pagesize; return 0; } int main(int argc, char *argv[]) { char *file, *mapfile; unsigned int seed; struct io_uring ring; struct io_uring_cqe *cqe; int fd, i; ssize_t ret; size_t total = 0; if (argc < 2 || argc > 3) { fprintf(stderr, "Use: %s <file> [<seed>]\n", argv[0]); return 1; } file = argv[1]; if (argc == 3) { seed = atol(argv[2]); } else { getrandom(&seed, sizeof(seed), 0); } printf("random seed %u\n", seed); srand(seed); pagesize = sysconf(_SC_PAGE_SIZE); if (asprintf(&mapfile, "%s.map", file) < 0) { fprintf(stderr, "asprintf map %d\n", errno); return 1; } fd = open(file, O_RDONLY | O_DIRECT); if (fd == -1) { fprintf(stderr, "Failed to open file '%s': %s (errno %d)\n", file, strerror(errno), errno); return 1; } if (getsize(fd)) return 1; for (i = 0 ; i < QUEUE_SIZE; i++) { bufsize[i] = (rand() % 1024) + 1; ret = posix_memalign((void**)&buffers[i], pagesize, bufsize[i] * pagesize); if (ret) { fprintf(stderr, "Failed to allocate read buffer\n"); return 1; } } printf("Starting io_uring reads...\n"); ret = io_uring_queue_init(QUEUE_SIZE, &ring, 0); if (ret != 0) { fprintf(stderr, "Failed to create io_uring queue\n"); return 1; } for (i = 0 ; i < QUEUE_SIZE; i++) { iovec[i].iov_base = buffers[i]; iovec[i].iov_len = bufsize[i] * pagesize; if (submit_read(&ring, fd, i)) return 1; } while (total++ < 10000000) { if (total % 1000 == 0) printf("%zd\n", total); ret = io_uring_wait_cqe(&ring, &cqe); if (ret < 0) { fprintf(stderr, "Failed at io_uring_wait_cqe()\n"); return 1; } i = (intptr_t)io_uring_cqe_get_data(cqe); if (cqe->res < 0) { fprintf(stderr, "bad read result for io %d, offset %zd: %d\n", i, offsets[i] * pagesize, cqe->res); breakme(); return 1; } if (cqe->res != bufsize[i] * pagesize) { fprintf(stderr, "bad read result for io %d, offset %zd: %d should be %zd\n", i, offsets[i] * pagesize, cqe->res, bufsize[i] * pagesize); breakme(); return 1; } io_uring_cqe_seen(&ring, cqe); // resubmit if (submit_read(&ring, fd, i)) return 1; } io_uring_queue_exit(&ring); return 0; } ^ permalink raw reply related [flat|nested] 4+ messages in thread
[parent not found: <[email protected]>]
end of thread, other threads:[~2022-06-29 5:14 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-28 9:08 read corruption with qemu master io_uring engine / linux master / btrfs(?) Dominique MARTINET
2022-06-28 19:03 ` Nikolay Borisov
2022-06-29 0:35 ` Dominique MARTINET
2022-06-29 5:14 ` Dominique MARTINET
[not found] <[email protected]>
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox