* read corruption with qemu master io_uring engine / linux master / btrfs(?)
@ 2022-06-28 9:08 Dominique MARTINET
2022-06-28 19:03 ` Nikolay Borisov
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Dominique MARTINET @ 2022-06-28 9:08 UTC (permalink / raw)
To: io-uring, linux-btrfs
I don't have any good reproducer so it's a bit difficult to specify,
let's start with what I have...
I've got this one VM which has various segfaults all over the place when
starting it with aio=io_uring for its disk as follow:
qemu-system-x86_64 -drive file=qemu/atde-test,if=none,id=hd0,format=raw,cache=none,aio=io_uring \
-device virtio-blk-pci,drive=hd0 -m 8G -smp 4 -serial mon:stdio -enable-kvm
It also happens with virtio-scsi-blk:
-device virtio-scsi-pci,id=scsihw0 \
-drive file=qemu/atde-test,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring \
-device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100
It also happened when the disk I was using was a qcow file backing up a
vmdk image (this VM's original disk is for vmware), so while I assume
qemu reading code and qemu-img convert code are similar I'll pretend
image format doesn't matter at this point...
It's happened with two such images, but I haven't been able to reproduce
with any other VMs yet.
I can also reproduce this on a second host machine with a completely
different ssd (WD sata in one vs. samsung nvme), so probably not a
firmware bug.
scrub sees no problem with my filesystems on the host.
I've confirmed it happens with at least debian testing's 5.16.0-4-amd64
and 5.17.0-1-amd64 kernels, as well as 5.19.0-rc4.
It also happens with both debian's 7.0.0 and the master branch
(v7.0.0-2031-g40d522490714)
These factors aside, anything else I tried changing made this bug no
longer reproduce:
- I'm not sure what the rule is but it sometimes doesn't happen when
running the VM twice in a row, sometimes it happens again. Making a
fresh copy with `cp --reflink=always` of my source image seems to be
reliable.
- it stops happening without io_uring
- it stops happening if I copy the disk image with --reflink=never
- it stops happening if I copy the disk image to another btrfs
partition, created in the same lv, so something about my partition
history matters?...
(I have ssd > GPT partitions > luks > lvm > btrfs with a single disk as
metadata DUP data single)
- I was unable to reproduce on xfs (with a reflink copy) either but I
also was only able to try on a new fs...
- I've never been able to reproduce on other VMs
If you'd like to give it a try, my reproducer source image is
---
curl -O https://download.atmark-techno.com/atde/atde9-amd64-20220624.tar.xz
tar xf atde9-amd64-20220624.tar.xz
qemu-img convert -O raw atde9-amd64-20220624/atde9-amd64.vmdk atde-test
cp --reflink=always atde-test atde-test2
---
and using 'atde-test'.
For further attempts I've removed atde-test and copied back from
atde-test2 with cp --reflink=always.
This VM graphical output is borked, but ssh listens so something like
`-netdev user,id=net0,hostfwd=tcp::2227-:22 -device virtio-net-pci,netdev=net0`
and 'ssh -p 2227 -l atmark localhost' should allow login with password
'atmark' or you can change vt on the console (root password 'root')
I also had similar problems with atde9-amd64-20211201.tar.xz .
When reproducing I've had either segfaults in the initrd and complete
boot failures, or boot working and login failures but ssh working
without login shell (ssh ... -tt localhost sh)
that allowed me to dump content of a couple of corrupted files.
When I looked:
- /usr/lib/x86_64-linux-gnu/libc-2.31.so had zeroes instead of data from
offset 0xb6000 to 0xb7fff; rest of file was identical.
- /usr/bin/dmesg had garbadge from 0x05000 until 0x149d8 (end of file).
I was lucky and could match the garbage quickly: it is identical to the
content from 0x1000-0x109d8 within the disk itself.
I've rebooted a few times and it looks like the corruption is identical
everytime for this machine as long as I keep using the same source file;
running from qemu-img convert again seems to change things a bit?
but whatever it is that is specific to these files is stable, even
through host reboots.
I'm sorry I haven't been able to make a better reproducer, I'll keep
trying a bit more tomorrow but maybe someone has an idea with what I've
had so far :/
Perhaps at this point it might be simpler to just try to take qemu out
of the equation and issue many parallel reads to different offsets
(overlapping?) of a large file in a similar way qemu io_uring engine
does and check their contents?
Thanks, and I'll probably follow up a bit tomorrow even if no-one has
any idea, but even ideas of where to look would be appreciated.
--
Dominique
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-28 9:08 read corruption with qemu master io_uring engine / linux master / btrfs(?) Dominique MARTINET
@ 2022-06-28 19:03 ` Nikolay Borisov
2022-06-29 0:35 ` Dominique MARTINET
2022-06-28 19:05 ` Nikolay Borisov
2022-06-28 19:12 ` Jens Axboe
2 siblings, 1 reply; 19+ messages in thread
From: Nikolay Borisov @ 2022-06-28 19:03 UTC (permalink / raw)
To: Dominique MARTINET, io-uring, linux-btrfs
On 28.06.22 г. 12:08 ч., Dominique MARTINET wrote:
> I don't have any good reproducer so it's a bit difficult to specify,
> let's start with what I have...
>
> I've got this one VM which has various segfaults all over the place when
> starting it with aio=io_uring for its disk as follow:
>
> qemu-system-x86_64 -drive file=qemu/atde-test,if=none,id=hd0,format=raw,cache=none,aio=io_uring \
> -device virtio-blk-pci,drive=hd0 -m 8G -smp 4 -serial mon:stdio -enable-kvm
So cache=none means O_DIRECT and using io_uring. This really sounds
similar to:
ca93e44bfb5fd7996b76f0f544999171f647f93b
This commit got merged into v5.17 so you shouldn't be seeing it on 5.17
and onwards.
<snip>
>
> Perhaps at this point it might be simpler to just try to take qemu out
> of the equation and issue many parallel reads to different offsets
> (overlapping?) of a large file in a similar way qemu io_uring engine
> does and check their contents?
Care to run the sample program in the aforementioned commit and verify
it's not failing
>
>
> Thanks, and I'll probably follow up a bit tomorrow even if no-one has
> any idea, but even ideas of where to look would be appreciated.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-28 9:08 read corruption with qemu master io_uring engine / linux master / btrfs(?) Dominique MARTINET
2022-06-28 19:03 ` Nikolay Borisov
@ 2022-06-28 19:05 ` Nikolay Borisov
2022-06-28 19:12 ` Jens Axboe
2 siblings, 0 replies; 19+ messages in thread
From: Nikolay Borisov @ 2022-06-28 19:05 UTC (permalink / raw)
To: Dominique MARTINET, io-uring, linux-btrfs
On 28.06.22 г. 12:08 ч., Dominique MARTINET wrote:
> I don't have any good reproducer so it's a bit difficult to specify,
> let's start with what I have...
>
> I've got this one VM which has various segfaults all over the place when
> starting it with aio=io_uring for its disk as follow:
>
> qemu-system-x86_64 -drive file=qemu/atde-test,if=none,id=hd0,format=raw,cache=none,aio=io_uring \
> -device virtio-blk-pci,drive=hd0 -m 8G -smp 4 -serial mon:stdio -enable-kvm
>
> It also happens with virtio-scsi-blk:
> -device virtio-scsi-pci,id=scsihw0 \
> -drive file=qemu/atde-test,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring \
> -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100
>
Alternatively change cache=none (O_DIRECT) to cache=writeback (ordinary
buffered writeback path) that way we'll know if it's related to the
iomap-based O_DIRECT code in btrfs.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-28 9:08 read corruption with qemu master io_uring engine / linux master / btrfs(?) Dominique MARTINET
2022-06-28 19:03 ` Nikolay Borisov
2022-06-28 19:05 ` Nikolay Borisov
@ 2022-06-28 19:12 ` Jens Axboe
2 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2022-06-28 19:12 UTC (permalink / raw)
To: Dominique MARTINET, io-uring, linux-btrfs
On 6/28/22 3:08 AM, Dominique MARTINET wrote:
> I don't have any good reproducer so it's a bit difficult to specify,
> let's start with what I have...
>
> I've got this one VM which has various segfaults all over the place when
> starting it with aio=io_uring for its disk as follow:
>
> qemu-system-x86_64 -drive file=qemu/atde-test,if=none,id=hd0,format=raw,cache=none,aio=io_uring \
> -device virtio-blk-pci,drive=hd0 -m 8G -smp 4 -serial mon:stdio -enable-kvm
>
> It also happens with virtio-scsi-blk:
> -device virtio-scsi-pci,id=scsihw0 \
> -drive file=qemu/atde-test,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring \
> -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100
>
> It also happened when the disk I was using was a qcow file backing up a
> vmdk image (this VM's original disk is for vmware), so while I assume
> qemu reading code and qemu-img convert code are similar I'll pretend
> image format doesn't matter at this point...
> It's happened with two such images, but I haven't been able to reproduce
> with any other VMs yet.
>
> I can also reproduce this on a second host machine with a completely
> different ssd (WD sata in one vs. samsung nvme), so probably not a
> firmware bug.
>
> scrub sees no problem with my filesystems on the host.
>
> I've confirmed it happens with at least debian testing's 5.16.0-4-amd64
> and 5.17.0-1-amd64 kernels, as well as 5.19.0-rc4.
> It also happens with both debian's 7.0.0 and the master branch
> (v7.0.0-2031-g40d522490714)
>
>
> These factors aside, anything else I tried changing made this bug no
> longer reproduce:
> - I'm not sure what the rule is but it sometimes doesn't happen when
> running the VM twice in a row, sometimes it happens again. Making a
> fresh copy with `cp --reflink=always` of my source image seems to be
> reliable.
> - it stops happening without io_uring
> - it stops happening if I copy the disk image with --reflink=never
> - it stops happening if I copy the disk image to another btrfs
> partition, created in the same lv, so something about my partition
> history matters?...
> (I have ssd > GPT partitions > luks > lvm > btrfs with a single disk as
> metadata DUP data single)
> - I was unable to reproduce on xfs (with a reflink copy) either but I
> also was only able to try on a new fs...
> - I've never been able to reproduce on other VMs
>
>
> If you'd like to give it a try, my reproducer source image is
> ---
> curl -O https://download.atmark-techno.com/atde/atde9-amd64-20220624.tar.xz
> tar xf atde9-amd64-20220624.tar.xz
> qemu-img convert -O raw atde9-amd64-20220624/atde9-amd64.vmdk atde-test
> cp --reflink=always atde-test atde-test2
> ---
> and using 'atde-test'.
> For further attempts I've removed atde-test and copied back from
> atde-test2 with cp --reflink=always.
> This VM graphical output is borked, but ssh listens so something like
> `-netdev user,id=net0,hostfwd=tcp::2227-:22 -device virtio-net-pci,netdev=net0`
> and 'ssh -p 2227 -l atmark localhost' should allow login with password
> 'atmark' or you can change vt on the console (root password 'root')
>
> I also had similar problems with atde9-amd64-20211201.tar.xz .
>
>
> When reproducing I've had either segfaults in the initrd and complete
> boot failures, or boot working and login failures but ssh working
> without login shell (ssh ... -tt localhost sh)
> that allowed me to dump content of a couple of corrupted files.
> When I looked:
> - /usr/lib/x86_64-linux-gnu/libc-2.31.so had zeroes instead of data from
> offset 0xb6000 to 0xb7fff; rest of file was identical.
> - /usr/bin/dmesg had garbadge from 0x05000 until 0x149d8 (end of file).
> I was lucky and could match the garbage quickly: it is identical to the
> content from 0x1000-0x109d8 within the disk itself.
>
> I've rebooted a few times and it looks like the corruption is identical
> everytime for this machine as long as I keep using the same source file;
> running from qemu-img convert again seems to change things a bit?
> but whatever it is that is specific to these files is stable, even
> through host reboots.
>
>
>
> I'm sorry I haven't been able to make a better reproducer, I'll keep
> trying a bit more tomorrow but maybe someone has an idea with what I've
> had so far :/
>
> Perhaps at this point it might be simpler to just try to take qemu out
> of the equation and issue many parallel reads to different offsets
> (overlapping?) of a large file in a similar way qemu io_uring engine
> does and check their contents?
>
>
> Thanks, and I'll probably follow up a bit tomorrow even if no-one has
> any idea, but even ideas of where to look would be appreciated.
Not sure what's going on here, but I use qemu with io_uring many times
each day and haven't seen anything odd. This is on ext4 and xfs however,
I haven't used btrfs as the backing file system. I wonder if we can boil
this down into a test case and try and figure out what is doing on here.
--
Jens Axboe
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-28 19:03 ` Nikolay Borisov
@ 2022-06-29 0:35 ` Dominique MARTINET
2022-06-29 5:14 ` Dominique MARTINET
0 siblings, 1 reply; 19+ messages in thread
From: Dominique MARTINET @ 2022-06-29 0:35 UTC (permalink / raw)
To: Nikolay Borisov, Jens Axboe; +Cc: io-uring, linux-btrfs
Thanks for the replies.
Nikolay Borisov wrote on Tue, Jun 28, 2022 at 10:03:20PM +0300:
> > qemu-system-x86_64 -drive file=qemu/atde-test,if=none,id=hd0,format=raw,cache=none,aio=io_uring \
> > -device virtio-blk-pci,drive=hd0 -m 8G -smp 4 -serial mon:stdio -enable-kvm
>
> So cache=none means O_DIRECT and using io_uring. This really sounds similar
> to:
>
> ca93e44bfb5fd7996b76f0f544999171f647f93b
That looks close, yes...
> This commit got merged into v5.17 so you shouldn't be seeing it on 5.17 and
> onwards.
>
> <snip>
>
> >
> > Perhaps at this point it might be simpler to just try to take qemu out
> > of the equation and issue many parallel reads to different offsets
> > (overlapping?) of a large file in a similar way qemu io_uring engine
> > does and check their contents?
>
> Care to run the sample program in the aforementioned commit and verify it's
> not failing
But unfortunately it seems like it is properly fixed on my machines:
---
io_uring read result for file foo:
cqe->res == 8192 (expected 8192)
memcmp(read_buf, write_buf) == 0 (expected 0)
---
Nikolay Borisov wrote on Tue, Jun 28, 2022 at 10:05:39PM +0300:
> Alternatively change cache=none (O_DIRECT) to cache=writeback (ordinary
> buffered writeback path) that way we'll know if it's related to the
> iomap-based O_DIRECT code in btrfs.
Good idea; I can confirm this doesn't reproduce without cache=none, so
O_DIRECT probably is another requirement here (probably because I
haven't been able to reproduce on a freshly created fs either, so not
being able to reproducing in a few tries is no guarantee...)
Jens Axboe wrote on Tue, Jun 28, 2022 at 01:12:54PM -0600:
> Not sure what's going on here, but I use qemu with io_uring many times
> each day and haven't seen anything odd. This is on ext4 and xfs however,
> I haven't used btrfs as the backing file system. I wonder if we can boil
> this down into a test case and try and figure out what is doing on here.
Yes I'd say it's fs specific, I've not been able to reproduce on ext4 or
xfs -- but then again I couldn't reproduce with btrfs on a new
filesystem so there probably are some other conditions :/
I also agree writing a simple program like the io_uring test in the
above commit that'd sort of do it like qemu and compare contents would
be ideal.
I'll have a stab at this today.
--
Dominique
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-29 0:35 ` Dominique MARTINET
@ 2022-06-29 5:14 ` Dominique MARTINET
2022-06-29 15:37 ` Filipe Manana
0 siblings, 1 reply; 19+ messages in thread
From: Dominique MARTINET @ 2022-06-29 5:14 UTC (permalink / raw)
To: Nikolay Borisov, Jens Axboe; +Cc: io-uring, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2008 bytes --]
Dominique MARTINET wrote on Wed, Jun 29, 2022 at 09:35:44AM +0900:
> I also agree writing a simple program like the io_uring test in the
> above commit that'd sort of do it like qemu and compare contents would
> be ideal.
> I'll have a stab at this today.
Okay, after half a day failing to reproduce I had a closer look at qemu
and... it's a qemu bug.
Well, there probably are two bugs, but one should be benign:
- qemu short read handling was... rather disappointing.
Patch should appear here[1] eventually, but as it seems moderated?
I'm reposting it here:
-----
diff --git a/block/io_uring.c b/block/io_uring.c
index d48e472e74cb..d58aff9615ce 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -103,7 +103,7 @@ static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
remaining);
/* Update sqe */
- luringcb->sqeq.off = nread;
+ luringcb->sqeq.off += nread;
luringcb->sqeq.addr = (__u64)(uintptr_t)luringcb->resubmit_qiov.iov;
luringcb->sqeq.len = luringcb->resubmit_qiov.niov;
-----
(basically "just" a typo, but that must have never been tested!)
[1] https://lore.kernel.org/qemu-devel/[email protected]
- comments there also say short reads should never happen on newer
kernels (assuming local filesystems?) -- how true is that? If we're
doing our best kernel side to avoid short reads I guess we probably
ought to have a look at this.
It can easily be reproduced with a simple io_uring program -- see
example attached that eventually fails with the following error on
btrfs:
bad read result for io 8, offset 792227840: 266240 should be 1466368
but doesn't fail on tmpfs or without O_DIRECT.
feel free to butcher it, it's already a quickly hacked downversion of my
original test that had hash computation etc so the flow might feel a bit
weird.
Just compile with `gcc -o shortreads uring_shortreads.c -luring` and run
with file to read in argument.
Thanks!
--
Dominique
[-- Attachment #2: uring_shortreads.c --]
[-- Type: text/x-csrc, Size: 3359 bytes --]
/* Get O_DIRECT */
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <liburing.h>
#include <sys/random.h>
#include <sys/stat.h>
long pagesize;
size_t n_blocks;
#define QUEUE_SIZE 10
char *buffers[QUEUE_SIZE];
int bufsize[QUEUE_SIZE];
struct iovec iovec[QUEUE_SIZE];
long int offsets[QUEUE_SIZE];
void breakme(void) {
}
int submit_read(struct io_uring *ring, int fd, int i) {
struct io_uring_sqe *sqe;
int ret;
sqe = io_uring_get_sqe(ring);
if (!sqe) {
fprintf(stderr, "Failed to get io_uring sqe\n");
return 1;
}
if (i == 0 || rand() % 2 == 0 || offsets[i-1] > n_blocks - bufsize[i]) {
offsets[i] = rand() % (n_blocks - bufsize[i] + 1);
} else {
offsets[i] = offsets[i - 1];
}
io_uring_prep_readv(sqe, fd, iovec + i, 1, offsets[i] * pagesize);
io_uring_sqe_set_data(sqe, (void*)(uintptr_t)i);
ret = io_uring_submit(ring);
if (ret != 1) {
fprintf(stderr, "submit failed\n");
return 1;
}
return 0;
}
int getsize(int fd) {
struct stat sb;
if (fstat(fd, &sb)) {
fprintf(stderr, "fstat: %m\n");
return 1;
}
n_blocks = sb.st_size / pagesize;
return 0;
}
int main(int argc, char *argv[])
{
char *file, *mapfile;
unsigned int seed;
struct io_uring ring;
struct io_uring_cqe *cqe;
int fd, i;
ssize_t ret;
size_t total = 0;
if (argc < 2 || argc > 3) {
fprintf(stderr, "Use: %s <file> [<seed>]\n", argv[0]);
return 1;
}
file = argv[1];
if (argc == 3) {
seed = atol(argv[2]);
} else {
getrandom(&seed, sizeof(seed), 0);
}
printf("random seed %u\n", seed);
srand(seed);
pagesize = sysconf(_SC_PAGE_SIZE);
if (asprintf(&mapfile, "%s.map", file) < 0) {
fprintf(stderr, "asprintf map %d\n", errno);
return 1;
}
fd = open(file, O_RDONLY | O_DIRECT);
if (fd == -1) {
fprintf(stderr,
"Failed to open file '%s': %s (errno %d)\n",
file, strerror(errno), errno);
return 1;
}
if (getsize(fd))
return 1;
for (i = 0 ; i < QUEUE_SIZE; i++) {
bufsize[i] = (rand() % 1024) + 1;
ret = posix_memalign((void**)&buffers[i], pagesize, bufsize[i] * pagesize);
if (ret) {
fprintf(stderr, "Failed to allocate read buffer\n");
return 1;
}
}
printf("Starting io_uring reads...\n");
ret = io_uring_queue_init(QUEUE_SIZE, &ring, 0);
if (ret != 0) {
fprintf(stderr, "Failed to create io_uring queue\n");
return 1;
}
for (i = 0 ; i < QUEUE_SIZE; i++) {
iovec[i].iov_base = buffers[i];
iovec[i].iov_len = bufsize[i] * pagesize;
if (submit_read(&ring, fd, i))
return 1;
}
while (total++ < 10000000) {
if (total % 1000 == 0)
printf("%zd\n", total);
ret = io_uring_wait_cqe(&ring, &cqe);
if (ret < 0) {
fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
return 1;
}
i = (intptr_t)io_uring_cqe_get_data(cqe);
if (cqe->res < 0) {
fprintf(stderr, "bad read result for io %d, offset %zd: %d\n",
i, offsets[i] * pagesize, cqe->res);
breakme();
return 1;
}
if (cqe->res != bufsize[i] * pagesize) {
fprintf(stderr, "bad read result for io %d, offset %zd: %d should be %zd\n",
i, offsets[i] * pagesize, cqe->res, bufsize[i] * pagesize);
breakme();
return 1;
}
io_uring_cqe_seen(&ring, cqe);
// resubmit
if (submit_read(&ring, fd, i))
return 1;
}
io_uring_queue_exit(&ring);
return 0;
}
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-29 5:14 ` Dominique MARTINET
@ 2022-06-29 15:37 ` Filipe Manana
2022-06-30 0:41 ` Dominique MARTINET
0 siblings, 1 reply; 19+ messages in thread
From: Filipe Manana @ 2022-06-29 15:37 UTC (permalink / raw)
To: Dominique MARTINET; +Cc: Nikolay Borisov, Jens Axboe, io-uring, linux-btrfs
On Wed, Jun 29, 2022 at 02:14:18PM +0900, Dominique MARTINET wrote:
> Dominique MARTINET wrote on Wed, Jun 29, 2022 at 09:35:44AM +0900:
> > I also agree writing a simple program like the io_uring test in the
> > above commit that'd sort of do it like qemu and compare contents would
> > be ideal.
> > I'll have a stab at this today.
>
> Okay, after half a day failing to reproduce I had a closer look at qemu
> and... it's a qemu bug.
>
> Well, there probably are two bugs, but one should be benign:
>
> - qemu short read handling was... rather disappointing.
> Patch should appear here[1] eventually, but as it seems moderated?
> I'm reposting it here:
> -----
> diff --git a/block/io_uring.c b/block/io_uring.c
> index d48e472e74cb..d58aff9615ce 100644
> --- a/block/io_uring.c
> +++ b/block/io_uring.c
> @@ -103,7 +103,7 @@ static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
> remaining);
>
> /* Update sqe */
> - luringcb->sqeq.off = nread;
> + luringcb->sqeq.off += nread;
> luringcb->sqeq.addr = (__u64)(uintptr_t)luringcb->resubmit_qiov.iov;
> luringcb->sqeq.len = luringcb->resubmit_qiov.niov;
>
> -----
> (basically "just" a typo, but that must have never been tested!)
> [1] https://lore.kernel.org/qemu-devel/[email protected]
Btw, the link doesn't work (at least at the moment):
"Message-ID <[email protected]> not found"
>
>
> - comments there also say short reads should never happen on newer
> kernels (assuming local filesystems?) -- how true is that? If we're
> doing our best kernel side to avoid short reads I guess we probably
> ought to have a look at this.
Short reads can happen, and an application should deal with it.
If we look at the man page for read(2):
"
On success, the number of bytes read is returned (zero indicates
end of file), and the file position is advanced by this number.
It is not an error if this number is smaller than the number of
bytes requested; this may happen for example because fewer bytes
are actually available right now (maybe because we were close to
end-of-file, or because we are reading from a pipe, or from a
terminal), or because read() was interrupted by a signal. See
also NOTES.
"
pread(2) refers to read(2)'s documention about short reads as well.
I don't think reading with io_uring is an exception, I'm not aware of
any rules that forbided short reads from happening (even if the offset
and length don't cross the EOF boundary).
As mentioned in the commit pointed before, we recently had a similar
report with MariaDB, which wasn't dealing with short reads properly
and got fixed shortly after:
https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582
In fact not dealing with short reads at all, is not that uncommon
in applications. In that particular case we could avoid doing the
short read in btrfs, by returning -EAGAIN and making io_uring use
a blocking context to do a blocking direct IO read.
>
> It can easily be reproduced with a simple io_uring program -- see
> example attached that eventually fails with the following error on
> btrfs:
> bad read result for io 8, offset 792227840: 266240 should be 1466368
>
> but doesn't fail on tmpfs or without O_DIRECT.
> feel free to butcher it, it's already a quickly hacked downversion of my
> original test that had hash computation etc so the flow might feel a bit
> weird.
> Just compile with `gcc -o shortreads uring_shortreads.c -luring` and run
> with file to read in argument.
I just tried your program, against the qemu/vmdk image you mentioned in the
first message, and after over an hour running I couldn't trigger any short
reads - this was on the integration misc-next branch.
It's possible that to trigger the issue, one needs a particular file extent
layout, which will not be the same as yours after downloading and converting
the file.
Are you able to apply kernel patches and test? If so I may provide you with
a patch to try and see if it fixes the problem for you.
Thanks.
>
>
> Thanks!
> --
> Dominique
> /* Get O_DIRECT */
> #ifndef _GNU_SOURCE
> #define _GNU_SOURCE
> #endif
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <errno.h>
> #include <string.h>
> #include <liburing.h>
> #include <sys/random.h>
> #include <sys/stat.h>
>
> long pagesize;
> size_t n_blocks;
> #define QUEUE_SIZE 10
> char *buffers[QUEUE_SIZE];
> int bufsize[QUEUE_SIZE];
> struct iovec iovec[QUEUE_SIZE];
> long int offsets[QUEUE_SIZE];
>
> void breakme(void) {
> }
>
> int submit_read(struct io_uring *ring, int fd, int i) {
> struct io_uring_sqe *sqe;
> int ret;
>
> sqe = io_uring_get_sqe(ring);
> if (!sqe) {
> fprintf(stderr, "Failed to get io_uring sqe\n");
> return 1;
> }
> if (i == 0 || rand() % 2 == 0 || offsets[i-1] > n_blocks - bufsize[i]) {
> offsets[i] = rand() % (n_blocks - bufsize[i] + 1);
> } else {
> offsets[i] = offsets[i - 1];
> }
> io_uring_prep_readv(sqe, fd, iovec + i, 1, offsets[i] * pagesize);
> io_uring_sqe_set_data(sqe, (void*)(uintptr_t)i);
> ret = io_uring_submit(ring);
> if (ret != 1) {
> fprintf(stderr, "submit failed\n");
> return 1;
> }
> return 0;
> }
>
> int getsize(int fd) {
> struct stat sb;
> if (fstat(fd, &sb)) {
> fprintf(stderr, "fstat: %m\n");
> return 1;
> }
> n_blocks = sb.st_size / pagesize;
> return 0;
> }
>
> int main(int argc, char *argv[])
> {
> char *file, *mapfile;
> unsigned int seed;
> struct io_uring ring;
> struct io_uring_cqe *cqe;
> int fd, i;
> ssize_t ret;
> size_t total = 0;
>
> if (argc < 2 || argc > 3) {
> fprintf(stderr, "Use: %s <file> [<seed>]\n", argv[0]);
> return 1;
> }
> file = argv[1];
> if (argc == 3) {
> seed = atol(argv[2]);
> } else {
> getrandom(&seed, sizeof(seed), 0);
> }
> printf("random seed %u\n", seed);
> srand(seed);
> pagesize = sysconf(_SC_PAGE_SIZE);
> if (asprintf(&mapfile, "%s.map", file) < 0) {
> fprintf(stderr, "asprintf map %d\n", errno);
> return 1;
> }
>
> fd = open(file, O_RDONLY | O_DIRECT);
> if (fd == -1) {
> fprintf(stderr,
> "Failed to open file '%s': %s (errno %d)\n",
> file, strerror(errno), errno);
> return 1;
> }
> if (getsize(fd))
> return 1;
>
> for (i = 0 ; i < QUEUE_SIZE; i++) {
> bufsize[i] = (rand() % 1024) + 1;
> ret = posix_memalign((void**)&buffers[i], pagesize, bufsize[i] * pagesize);
> if (ret) {
> fprintf(stderr, "Failed to allocate read buffer\n");
> return 1;
> }
> }
>
>
> printf("Starting io_uring reads...\n");
>
>
> ret = io_uring_queue_init(QUEUE_SIZE, &ring, 0);
> if (ret != 0) {
> fprintf(stderr, "Failed to create io_uring queue\n");
> return 1;
> }
>
>
> for (i = 0 ; i < QUEUE_SIZE; i++) {
> iovec[i].iov_base = buffers[i];
> iovec[i].iov_len = bufsize[i] * pagesize;
> if (submit_read(&ring, fd, i))
> return 1;
> }
>
> while (total++ < 10000000) {
> if (total % 1000 == 0)
> printf("%zd\n", total);
> ret = io_uring_wait_cqe(&ring, &cqe);
> if (ret < 0) {
> fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
> return 1;
> }
> i = (intptr_t)io_uring_cqe_get_data(cqe);
> if (cqe->res < 0) {
> fprintf(stderr, "bad read result for io %d, offset %zd: %d\n",
> i, offsets[i] * pagesize, cqe->res);
> breakme();
> return 1;
> }
> if (cqe->res != bufsize[i] * pagesize) {
> fprintf(stderr, "bad read result for io %d, offset %zd: %d should be %zd\n",
> i, offsets[i] * pagesize, cqe->res, bufsize[i] * pagesize);
> breakme();
> return 1;
> }
> io_uring_cqe_seen(&ring, cqe);
>
> // resubmit
> if (submit_read(&ring, fd, i))
> return 1;
> }
> io_uring_queue_exit(&ring);
>
> return 0;
> }
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-29 15:37 ` Filipe Manana
@ 2022-06-30 0:41 ` Dominique MARTINET
2022-06-30 7:56 ` Dominique MARTINET
0 siblings, 1 reply; 19+ messages in thread
From: Dominique MARTINET @ 2022-06-30 0:41 UTC (permalink / raw)
To: Filipe Manana; +Cc: Nikolay Borisov, Jens Axboe, io-uring, linux-btrfs
Filipe Manana wrote on Wed, Jun 29, 2022 at 04:37:10PM +0100:
> On Wed, Jun 29, 2022 at 02:14:18PM +0900, Dominique MARTINET wrote:
> > - qemu short read handling was... rather disappointing.
> > Patch should appear here[1] eventually, but as it seems moderated?
> > [1] https://lore.kernel.org/qemu-devel/[email protected]
>
> Btw, the link doesn't work (at least at the moment):
>
> "Message-ID <[email protected]> not found"
Yes, the submitting a patch documentation[1] mentions the lists are
moderated, so it took a bit of time.
It looks like it went through now, and my understanding is further
mails won't be delayed -- but I'll Cc you on v2 after testing it.
[1] https://www.qemu.org/docs/master/devel/submitting-a-patch.html
> > - comments there also say short reads should never happen on newer
> > kernels (assuming local filesystems?) -- how true is that? If we're
> > doing our best kernel side to avoid short reads I guess we probably
> > ought to have a look at this.
>
> Short reads can happen, and an application should deal with it.
I definitely agree with this, qemu must be fixed. I don't think anyone
will argue with that.
> If we look at the man page for read(2):
>
> "
> On success, the number of bytes read is returned (zero indicates
> end of file), and the file position is advanced by this number.
> It is not an error if this number is smaller than the number of
> bytes requested; this may happen for example because fewer bytes
> are actually available right now (maybe because we were close to
> end-of-file, or because we are reading from a pipe, or from a
> terminal), or because read() was interrupted by a signal. See
> also NOTES.
> "
>
> pread(2) refers to read(2)'s documention about short reads as well.
> I don't think reading with io_uring is an exception, I'm not aware of
> any rules that forbided short reads from happening (even if the offset
> and length don't cross the EOF boundary).
It might be documented, but I was laughed when I said we (9p) were
allowed to return short reads on a whim:
https://lkml.kernel.org/r/[email protected]
(now that might not be the proudest thing I've allowed for 9p, but it
shows there is some expectation we don't do short reads if we can avoid
it... But yes, that doesn't mean we shouldn't fix broken applications
when we find one)
> As mentioned in the commit pointed before, we recently had a similar
> report with MariaDB, which wasn't dealing with short reads properly
> and got fixed shortly after:
>
> https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582
>
> In fact not dealing with short reads at all, is not that uncommon
> in applications. In that particular case we could avoid doing the
> short read in btrfs, by returning -EAGAIN and making io_uring use
> a blocking context to do a blocking direct IO read.
Sounds good to me.
> > It can easily be reproduced with a simple io_uring program -- see
> > example attached that eventually fails with the following error on
> > btrfs:
> > bad read result for io 8, offset 792227840: 266240 should be 1466368
> >
> > but doesn't fail on tmpfs or without O_DIRECT.
> > feel free to butcher it, it's already a quickly hacked downversion of my
> > original test that had hash computation etc so the flow might feel a bit
> > weird.
> > Just compile with `gcc -o shortreads uring_shortreads.c -luring` and run
> > with file to read in argument.
>
> I just tried your program, against the qemu/vmdk image you mentioned in the
> first message, and after over an hour running I couldn't trigger any short
> reads - this was on the integration misc-next branch.
>
> It's possible that to trigger the issue, one needs a particular file extent
> layout, which will not be the same as yours after downloading and converting
> the file.
Ugh. I've also been unable to reproduce on a test fs, despite filling it
with small files and removing some to artificially fragment the image,
so I guess I really do have something on these "normal" filesystems...
Is there a way to artificially try to recreate weird layouts?
I've also tried btrfs send|receive, but while it did preserve reflinked
extents it didn't seem to do the trick.
> Are you able to apply kernel patches and test? If so I may provide you with
> a patch to try and see if it fixes the problem for you.
Yes, no problem with that; I'm not deleting that file until we've seen
the end of it and will be happy to test anything :)
--
Dominique
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-30 0:41 ` Dominique MARTINET
@ 2022-06-30 7:56 ` Dominique MARTINET
2022-06-30 10:45 ` Filipe Manana
0 siblings, 1 reply; 19+ messages in thread
From: Dominique MARTINET @ 2022-06-30 7:56 UTC (permalink / raw)
To: Filipe Manana; +Cc: Nikolay Borisov, Jens Axboe, io-uring, linux-btrfs
Dominique MARTINET wrote on Thu, Jun 30, 2022 at 09:41:01AM +0900:
> > I just tried your program, against the qemu/vmdk image you mentioned in the
> > first message, and after over an hour running I couldn't trigger any short
> > reads - this was on the integration misc-next branch.
> >
> > It's possible that to trigger the issue, one needs a particular file extent
> > layout, which will not be the same as yours after downloading and converting
> > the file.
>
> Ugh. I've also been unable to reproduce on a test fs, despite filling it
> with small files and removing some to artificially fragment the image,
> so I guess I really do have something on these "normal" filesystems...
>
> Is there a way to artificially try to recreate weird layouts?
> I've also tried btrfs send|receive, but while it did preserve reflinked
> extents it didn't seem to do the trick.
I take that one back, I was able to reproduce with my filesystem riddled
with holes.
I was just looking at another distantly related problem that happened
with cp, but trying with busybox cat didn't reproduce it and got
confused:
https://lore.kernel.org/linux-btrfs/[email protected]/T/#u
Anyway, here's a pretty ugly reproducer to create a file that made short
reads on a brand new FS:
# 50GB FS -> fill with 50GB of small files and remove 1/10
$ mkdir /mnt/d.{00..50}
$ for i in {00000..49999}; do
dd if=/dev/urandom of=/mnt/d.${i:0:2}/test.$i bs=1M count=1 status=none;
done
$ rm -f /mnt/d.*/*2
$ btrfs subvolume create ~/sendme
$ cp --reflink=always bigfile ~/sendme/bigfile
$ btrfs property set ~/sendme ro true
$ btrfs send ~/sendme | btrfs receive /mnt/receive
and /mnt/receive/bigfile did the trick for me.
This probably didn't need the send/receive at all, I just didn't try
plain copy again.
Anyway, happy to test any patch as said earlier, it's probably not worth
spending too much time on trying to reproduce on your end at this
point...
--
Dominique
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-30 7:56 ` Dominique MARTINET
@ 2022-06-30 10:45 ` Filipe Manana
2022-06-30 11:09 ` Filipe Manana
2022-06-30 11:27 ` Dominique MARTINET
0 siblings, 2 replies; 19+ messages in thread
From: Filipe Manana @ 2022-06-30 10:45 UTC (permalink / raw)
To: Dominique MARTINET; +Cc: Nikolay Borisov, Jens Axboe, io-uring, linux-btrfs
On Thu, Jun 30, 2022 at 04:56:37PM +0900, Dominique MARTINET wrote:
> Dominique MARTINET wrote on Thu, Jun 30, 2022 at 09:41:01AM +0900:
> > > I just tried your program, against the qemu/vmdk image you mentioned in the
> > > first message, and after over an hour running I couldn't trigger any short
> > > reads - this was on the integration misc-next branch.
> > >
> > > It's possible that to trigger the issue, one needs a particular file extent
> > > layout, which will not be the same as yours after downloading and converting
> > > the file.
> >
> > Ugh. I've also been unable to reproduce on a test fs, despite filling it
> > with small files and removing some to artificially fragment the image,
> > so I guess I really do have something on these "normal" filesystems...
> >
> > Is there a way to artificially try to recreate weird layouts?
> > I've also tried btrfs send|receive, but while it did preserve reflinked
> > extents it didn't seem to do the trick.
>
> I take that one back, I was able to reproduce with my filesystem riddled
> with holes.
> I was just looking at another distantly related problem that happened
> with cp, but trying with busybox cat didn't reproduce it and got
> confused:
> https://lore.kernel.org/linux-btrfs/[email protected]/T/#u
>
>
> Anyway, here's a pretty ugly reproducer to create a file that made short
> reads on a brand new FS:
>
> # 50GB FS -> fill with 50GB of small files and remove 1/10
> $ mkdir /mnt/d.{00..50}
> $ for i in {00000..49999}; do
> dd if=/dev/urandom of=/mnt/d.${i:0:2}/test.$i bs=1M count=1 status=none;
> done
> $ rm -f /mnt/d.*/*2
> $ btrfs subvolume create ~/sendme
> $ cp --reflink=always bigfile ~/sendme/bigfile
> $ btrfs property set ~/sendme ro true
> $ btrfs send ~/sendme | btrfs receive /mnt/receive
>
> and /mnt/receive/bigfile did the trick for me.
> This probably didn't need the send/receive at all, I just didn't try
> plain copy again.
>
> Anyway, happy to test any patch as said earlier, it's probably not worth
> spending too much time on trying to reproduce on your end at this
> point...
That's perfect.
So here's a patch for you to try:
https://gist.githubusercontent.com/fdmanana/4b24d6b30983e956bb1784a44873c5dd/raw/572490b127071bf827c3bc05dd58dcb7bcff373a/dio.patch
Thanks!
>
> --
> Dominique
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-30 10:45 ` Filipe Manana
@ 2022-06-30 11:09 ` Filipe Manana
2022-06-30 11:27 ` Dominique MARTINET
1 sibling, 0 replies; 19+ messages in thread
From: Filipe Manana @ 2022-06-30 11:09 UTC (permalink / raw)
To: Dominique MARTINET; +Cc: Nikolay Borisov, Jens Axboe, io-uring, linux-btrfs
On Thu, Jun 30, 2022 at 11:45:36AM +0100, Filipe Manana wrote:
> On Thu, Jun 30, 2022 at 04:56:37PM +0900, Dominique MARTINET wrote:
> > Dominique MARTINET wrote on Thu, Jun 30, 2022 at 09:41:01AM +0900:
> > > > I just tried your program, against the qemu/vmdk image you mentioned in the
> > > > first message, and after over an hour running I couldn't trigger any short
> > > > reads - this was on the integration misc-next branch.
> > > >
> > > > It's possible that to trigger the issue, one needs a particular file extent
> > > > layout, which will not be the same as yours after downloading and converting
> > > > the file.
> > >
> > > Ugh. I've also been unable to reproduce on a test fs, despite filling it
> > > with small files and removing some to artificially fragment the image,
> > > so I guess I really do have something on these "normal" filesystems...
> > >
> > > Is there a way to artificially try to recreate weird layouts?
> > > I've also tried btrfs send|receive, but while it did preserve reflinked
> > > extents it didn't seem to do the trick.
> >
> > I take that one back, I was able to reproduce with my filesystem riddled
> > with holes.
> > I was just looking at another distantly related problem that happened
> > with cp, but trying with busybox cat didn't reproduce it and got
> > confused:
> > https://lore.kernel.org/linux-btrfs/[email protected]/T/#u
> >
> >
> > Anyway, here's a pretty ugly reproducer to create a file that made short
> > reads on a brand new FS:
> >
> > # 50GB FS -> fill with 50GB of small files and remove 1/10
> > $ mkdir /mnt/d.{00..50}
> > $ for i in {00000..49999}; do
> > dd if=/dev/urandom of=/mnt/d.${i:0:2}/test.$i bs=1M count=1 status=none;
> > done
> > $ rm -f /mnt/d.*/*2
> > $ btrfs subvolume create ~/sendme
> > $ cp --reflink=always bigfile ~/sendme/bigfile
> > $ btrfs property set ~/sendme ro true
> > $ btrfs send ~/sendme | btrfs receive /mnt/receive
> >
> > and /mnt/receive/bigfile did the trick for me.
> > This probably didn't need the send/receive at all, I just didn't try
> > plain copy again.
> >
> > Anyway, happy to test any patch as said earlier, it's probably not worth
> > spending too much time on trying to reproduce on your end at this
> > point...
>
> That's perfect.
>
> So here's a patch for you to try:
>
> https://gist.githubusercontent.com/fdmanana/4b24d6b30983e956bb1784a44873c5dd/raw/572490b127071bf827c3bc05dd58dcb7bcff373a/dio.patch
Actually it's this URL:
https://gist.githubusercontent.com/fdmanana/4b24d6b30983e956bb1784a44873c5dd/raw/0dad2dd3fd14df735f166c2c416dc9265d660493/dio.patch
Thanks.
>
> Thanks!
> >
> > --
> > Dominique
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-30 10:45 ` Filipe Manana
2022-06-30 11:09 ` Filipe Manana
@ 2022-06-30 11:27 ` Dominique MARTINET
2022-06-30 12:51 ` Filipe Manana
1 sibling, 1 reply; 19+ messages in thread
From: Dominique MARTINET @ 2022-06-30 11:27 UTC (permalink / raw)
To: Filipe Manana; +Cc: Nikolay Borisov, Jens Axboe, io-uring, linux-btrfs
Filipe Manana wrote on Thu, Jun 30, 2022 at 11:45:36AM +0100:
> So here's a patch for you to try:
>
> https://gist.githubusercontent.com/fdmanana/4b24d6b30983e956bb1784a44873c5dd/raw/572490b127071bf827c3bc05dd58dcb7bcff373a/dio.patch
Thanks.
Unfortunately I still hit short reads with this; I can't really tell if
there are more or less than before (unfortunately the parallelism of my
reproducer means that even dropping caches and restarting with the same
seed I get a different offset for short read), but it looks fairly
similar -- usually happens within the first 1000 operations with
sometimes a bit slower with or without the patch.
I went ahead and added a printk in dio_fault_in_size to see if it was
used and it looks like it is, but that doesn't really tell if it is the
reason for short reads (hm, thinking back I could just have printed the
offsets...):
----
/ # /mnt/repro /mnt/t/t/atde-test
random seed 4061910570
Starting io_uring reads...
[ 17.872992] dio_fault_in_size: left 3710976 prev_left 0 size 131072
[ 17.873958] dio_fault_in_size: left 3579904 prev_left 3710976 size 131072
[ 17.874246] dio_fault_in_size: left 1933312 prev_left 0 size 131072
[ 17.875111] dio_fault_in_size: left 3448832 prev_left 3579904 size 131072
[ 17.876446] dio_fault_in_size: left 3317760 prev_left 3448832 size 131072
[ 17.877493] dio_fault_in_size: left 3186688 prev_left 3317760 size 131072
[ 17.878667] dio_fault_in_size: left 3055616 prev_left 3186688 size 131072
[ 17.880001] dio_fault_in_size: left 2924544 prev_left 3055616 size 131072
[ 17.881524] dio_fault_in_size: left 2793472 prev_left 2924544 size 131072
[ 17.882462] dio_fault_in_size: left 2662400 prev_left 2793472 size 131072
[ 17.883433] dio_fault_in_size: left 2531328 prev_left 2662400 size 131072
[ 17.884573] dio_fault_in_size: left 2400256 prev_left 2531328 size 131072
[ 17.886008] dio_fault_in_size: left 2269184 prev_left 2400256 size 131072
[ 17.887058] dio_fault_in_size: left 2138112 prev_left 2269184 size 131072
[ 17.888313] dio_fault_in_size: left 2007040 prev_left 2138112 size 131072
[ 17.889873] dio_fault_in_size: left 1875968 prev_left 2007040 size 131072
[ 17.891041] dio_fault_in_size: left 1744896 prev_left 1875968 size 131072
[ 17.893174] dio_fault_in_size: left 802816 prev_left 1744896 size 131072
[ 17.930249] dio_fault_in_size: left 3325952 prev_left 0 size 131072
[ 17.931472] dio_fault_in_size: left 1699840 prev_left 0 size 131072
[ 17.956509] dio_fault_in_size: left 1699840 prev_left 0 size 131072
[ 17.957522] dio_fault_in_size: left 1888256 prev_left 0 size 131072
bad read result for io 3, offset 4022030336: 176128 should be 1531904
----
(ugh, saw the second patch after writing all this.. but it's the same:
----
/ # /mnt/repro /mnt/t/t/atde-test
random seed 634214270
Starting io_uring reads...
[ 17.858718] dio_fault_in_size: left 1949696 prev_left 0 size 131072
[ 18.193604] dio_fault_in_size: left 1142784 prev_left 0 size 131072
[ 18.218500] dio_fault_in_size: left 528384 prev_left 0 size 131072
[ 18.248184] dio_fault_in_size: left 643072 prev_left 0 size 131072
[ 18.291639] dio_fault_in_size: left 131072 prev_left 0 size 131072
bad read result for io 4, offset 5079498752: 241664 should be 2142208
----
rest of the mail is on first patch as I used offset of first message,
but shouldn't matter)
Given my file has many many extents, my guess would be that short reads
happen when we're crossing an extent boundary.
Using the fiemap[1] command I can confirm that it is the case:
[1] https://github.com/ColinIanKing/fiemap
$ printf "%x\n" $((4022030336 + 176128))
efbe0000
$ fiemap /mnt/t/t/atde-test
File atde-test has 199533 extents:
# Logical Physical Length Flags
...
23205: 00000000efba0000 0000001324f00000 0000000000020000 0008
23206: 00000000efbc0000 00000013222af000 0000000000020000 0008
23207: 00000000efbe0000 00000013222bb000 0000000000020000 0008
but given how many extents there are that doesn't explain why it stopped
at this offset within the file and not another before it: transition
from compressed to non-compressed or something? I didn't find any tool
able to show extent attributes; here's what `btrfs insp dump-tree` has
to say about this physical offset:
$ printf "%d\n" 0x00000013222af000
82177617920
$ printf "%d\n" 0x00000013222bb000
82177667072
$ btrfs insp dump-tree /dev/vg/test
...
leaf 171360256 items 195 free space 29 generation 527 owner EXTENT_TREE
leaf 171360256 flags 0x1(WRITTEN) backref revision 1
checksum stored d9b6566b00000000000000000000000000000000000000000000000000000000
checksum calced d9b6566b00000000000000000000000000000000000000000000000000000000
fs uuid 3f85a731-21b4-4f3d-85b5-f9c45e8493f5
chunk uuid 77575a06-4d6f-4748-a62c-59e6d9221be8
item 0 key (82177576960 EXTENT_ITEM 40960) itemoff 16230 itemsize 53
refs 1 gen 527 flags DATA
extent data backref root 256 objectid 257 offset 4021682176 count 1
item 1 key (82177617920 EXTENT_ITEM 49152) itemoff 16177 itemsize 53
refs 1 gen 527 flags DATA
extent data backref root 256 objectid 257 offset 4022075392 count 1
item 2 key (82177667072 EXTENT_ITEM 36864) itemoff 16124 itemsize 53
refs 1 gen 527 flags DATA
extent data backref root 256 objectid 257 offset 4022206464 count 1
... but that doesn't really help me understand here.
Oh, well, passing you the ball again! :)
Please ask if there's any infos I could get you.
--
Dominique
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-30 11:27 ` Dominique MARTINET
@ 2022-06-30 12:51 ` Filipe Manana
2022-06-30 13:08 ` Dominique MARTINET
0 siblings, 1 reply; 19+ messages in thread
From: Filipe Manana @ 2022-06-30 12:51 UTC (permalink / raw)
To: Dominique MARTINET; +Cc: Nikolay Borisov, Jens Axboe, io-uring, linux-btrfs
On Thu, Jun 30, 2022 at 08:27:50PM +0900, Dominique MARTINET wrote:
> Filipe Manana wrote on Thu, Jun 30, 2022 at 11:45:36AM +0100:
> > So here's a patch for you to try:
> >
> > https://gist.githubusercontent.com/fdmanana/4b24d6b30983e956bb1784a44873c5dd/raw/572490b127071bf827c3bc05dd58dcb7bcff373a/dio.patch
>
> Thanks.
> Unfortunately I still hit short reads with this; I can't really tell if
> there are more or less than before (unfortunately the parallelism of my
> reproducer means that even dropping caches and restarting with the same
> seed I get a different offset for short read), but it looks fairly
> similar -- usually happens within the first 1000 operations with
> sometimes a bit slower with or without the patch.
>
> I went ahead and added a printk in dio_fault_in_size to see if it was
> used and it looks like it is, but that doesn't really tell if it is the
> reason for short reads (hm, thinking back I could just have printed the
> offsets...):
> ----
> / # /mnt/repro /mnt/t/t/atde-test
> random seed 4061910570
> Starting io_uring reads...
> [ 17.872992] dio_fault_in_size: left 3710976 prev_left 0 size 131072
> [ 17.873958] dio_fault_in_size: left 3579904 prev_left 3710976 size 131072
> [ 17.874246] dio_fault_in_size: left 1933312 prev_left 0 size 131072
> [ 17.875111] dio_fault_in_size: left 3448832 prev_left 3579904 size 131072
> [ 17.876446] dio_fault_in_size: left 3317760 prev_left 3448832 size 131072
> [ 17.877493] dio_fault_in_size: left 3186688 prev_left 3317760 size 131072
> [ 17.878667] dio_fault_in_size: left 3055616 prev_left 3186688 size 131072
> [ 17.880001] dio_fault_in_size: left 2924544 prev_left 3055616 size 131072
> [ 17.881524] dio_fault_in_size: left 2793472 prev_left 2924544 size 131072
> [ 17.882462] dio_fault_in_size: left 2662400 prev_left 2793472 size 131072
> [ 17.883433] dio_fault_in_size: left 2531328 prev_left 2662400 size 131072
> [ 17.884573] dio_fault_in_size: left 2400256 prev_left 2531328 size 131072
> [ 17.886008] dio_fault_in_size: left 2269184 prev_left 2400256 size 131072
> [ 17.887058] dio_fault_in_size: left 2138112 prev_left 2269184 size 131072
> [ 17.888313] dio_fault_in_size: left 2007040 prev_left 2138112 size 131072
> [ 17.889873] dio_fault_in_size: left 1875968 prev_left 2007040 size 131072
> [ 17.891041] dio_fault_in_size: left 1744896 prev_left 1875968 size 131072
> [ 17.893174] dio_fault_in_size: left 802816 prev_left 1744896 size 131072
> [ 17.930249] dio_fault_in_size: left 3325952 prev_left 0 size 131072
> [ 17.931472] dio_fault_in_size: left 1699840 prev_left 0 size 131072
> [ 17.956509] dio_fault_in_size: left 1699840 prev_left 0 size 131072
> [ 17.957522] dio_fault_in_size: left 1888256 prev_left 0 size 131072
> bad read result for io 3, offset 4022030336: 176128 should be 1531904
> ----
>
> (ugh, saw the second patch after writing all this.. but it's the same:
Yep, it only prevents an infinite loop on rare scenarios (not triggered
by your reproducer).
> ----
> / # /mnt/repro /mnt/t/t/atde-test
> random seed 634214270
> Starting io_uring reads...
> [ 17.858718] dio_fault_in_size: left 1949696 prev_left 0 size 131072
> [ 18.193604] dio_fault_in_size: left 1142784 prev_left 0 size 131072
> [ 18.218500] dio_fault_in_size: left 528384 prev_left 0 size 131072
> [ 18.248184] dio_fault_in_size: left 643072 prev_left 0 size 131072
> [ 18.291639] dio_fault_in_size: left 131072 prev_left 0 size 131072
> bad read result for io 4, offset 5079498752: 241664 should be 2142208
> ----
> rest of the mail is on first patch as I used offset of first message,
> but shouldn't matter)
>
> Given my file has many many extents, my guess would be that short reads
> happen when we're crossing an extent boundary.
>
>
> Using the fiemap[1] command I can confirm that it is the case:
> [1] https://github.com/ColinIanKing/fiemap
>
> $ printf "%x\n" $((4022030336 + 176128))
> efbe0000
> $ fiemap /mnt/t/t/atde-test
> File atde-test has 199533 extents:
> # Logical Physical Length Flags
> ...
> 23205: 00000000efba0000 0000001324f00000 0000000000020000 0008
> 23206: 00000000efbc0000 00000013222af000 0000000000020000 0008
> 23207: 00000000efbe0000 00000013222bb000 0000000000020000 0008
>
> but given how many extents there are that doesn't explain why it stopped
> at this offset within the file and not another before it: transition
> from compressed to non-compressed or something? I didn't find any tool
> able to show extent attributes; here's what `btrfs insp dump-tree` has
> to say about this physical offset:
>
> $ printf "%d\n" 0x00000013222af000
> 82177617920
> $ printf "%d\n" 0x00000013222bb000
> 82177667072
> $ btrfs insp dump-tree /dev/vg/test
> ...
> leaf 171360256 items 195 free space 29 generation 527 owner EXTENT_TREE
> leaf 171360256 flags 0x1(WRITTEN) backref revision 1
> checksum stored d9b6566b00000000000000000000000000000000000000000000000000000000
> checksum calced d9b6566b00000000000000000000000000000000000000000000000000000000
> fs uuid 3f85a731-21b4-4f3d-85b5-f9c45e8493f5
> chunk uuid 77575a06-4d6f-4748-a62c-59e6d9221be8
> item 0 key (82177576960 EXTENT_ITEM 40960) itemoff 16230 itemsize 53
> refs 1 gen 527 flags DATA
> extent data backref root 256 objectid 257 offset 4021682176 count 1
> item 1 key (82177617920 EXTENT_ITEM 49152) itemoff 16177 itemsize 53
> refs 1 gen 527 flags DATA
> extent data backref root 256 objectid 257 offset 4022075392 count 1
> item 2 key (82177667072 EXTENT_ITEM 36864) itemoff 16124 itemsize 53
> refs 1 gen 527 flags DATA
> extent data backref root 256 objectid 257 offset 4022206464 count 1
>
> ... but that doesn't really help me understand here.
>
> Oh, well, passing you the ball again! :)
> Please ask if there's any infos I could get you.
Ok, maybe it's page fault related or there's something else besides page faults
involved.
Can you dump the subvolume tree like this:
btrfs inspect-internal dump-tree -t 5 /dev/sda 2>&1 | xz -9 > dump.xz
Here the 5 is the ID of the default subvolume. If the test file is on
a different subvolume, you'll need to replace 5 with the subvolume's ID.
This is just to look at the file extent layout.
Also, then tell me what's the inode number of the file (or just its name,
and I'll find out its inode number), and an example file offset and read
length that triggers a short read, so that I know where to look at.
And btw, that dump-tree command will dump all file names, directory names
and xattr names and values (if they are human readable) - so if privacy is
a concern here, just pass --hide-names to the dump-tree command.
Thanks.
> --
> Dominique
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-30 12:51 ` Filipe Manana
@ 2022-06-30 13:08 ` Dominique MARTINET
2022-06-30 15:10 ` Filipe Manana
0 siblings, 1 reply; 19+ messages in thread
From: Dominique MARTINET @ 2022-06-30 13:08 UTC (permalink / raw)
To: Filipe Manana; +Cc: Nikolay Borisov, Jens Axboe, io-uring, linux-btrfs
Filipe Manana wrote on Thu, Jun 30, 2022 at 01:51:24PM +0100:
> > Please ask if there's any infos I could get you.
>
> Ok, maybe it's page fault related or there's something else besides page faults
> involved.
>
> Can you dump the subvolume tree like this:
>
> btrfs inspect-internal dump-tree -t 5 /dev/sda 2>&1 | xz -9 > dump.xz
>
> Here the 5 is the ID of the default subvolume. If the test file is on
> a different subvolume, you'll need to replace 5 with the subvolume's ID.
Sure thing.
It's 2MB compressed:
https://gaia.codewreck.org/local/tmp/dump-tree.xz
> This is just to look at the file extent layout.
> Also, then tell me what's the inode number of the file (or just its name,
> and I'll find out its inode number), and an example file offset and read
> length that triggers a short read, so that I know where to look at.
There's just a single file in that subvolume, inode 257
> And btw, that dump-tree command will dump all file names, directory names
> and xattr names and values (if they are human readable) - so if privacy is
> a concern here, just pass --hide-names to the dump-tree command.
(thanks for the warning)
--
Dominique
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-30 13:08 ` Dominique MARTINET
@ 2022-06-30 15:10 ` Filipe Manana
2022-07-01 1:25 ` Dominique MARTINET
0 siblings, 1 reply; 19+ messages in thread
From: Filipe Manana @ 2022-06-30 15:10 UTC (permalink / raw)
To: Dominique MARTINET; +Cc: Nikolay Borisov, Jens Axboe, io-uring, linux-btrfs
On Thu, Jun 30, 2022 at 10:08:18PM +0900, Dominique MARTINET wrote:
> Filipe Manana wrote on Thu, Jun 30, 2022 at 01:51:24PM +0100:
> > > Please ask if there's any infos I could get you.
> >
> > Ok, maybe it's page fault related or there's something else besides page faults
> > involved.
> >
> > Can you dump the subvolume tree like this:
> >
> > btrfs inspect-internal dump-tree -t 5 /dev/sda 2>&1 | xz -9 > dump.xz
> >
> > Here the 5 is the ID of the default subvolume. If the test file is on
> > a different subvolume, you'll need to replace 5 with the subvolume's ID.
>
> Sure thing.
>
> It's 2MB compressed:
> https://gaia.codewreck.org/local/tmp/dump-tree.xz
Ok, the file has a mix of compressed and non-compressed extents.
This may prevent the short reads (not tested yet):
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7a54f964ff37..42fb56ed0021 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7684,7 +7684,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags) ||
em->block_start == EXTENT_MAP_INLINE) {
free_extent_map(em);
- ret = -ENOTBLK;
+ ret = (flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOTBLK;
goto unlock_err;
}
Can you give it a try?
Thanks.
>
>
> > This is just to look at the file extent layout.
> > Also, then tell me what's the inode number of the file (or just its name,
> > and I'll find out its inode number), and an example file offset and read
> > length that triggers a short read, so that I know where to look at.
>
> There's just a single file in that subvolume, inode 257
>
> > And btw, that dump-tree command will dump all file names, directory names
> > and xattr names and values (if they are human readable) - so if privacy is
> > a concern here, just pass --hide-names to the dump-tree command.
>
> (thanks for the warning)
> --
> Dominique
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-06-30 15:10 ` Filipe Manana
@ 2022-07-01 1:25 ` Dominique MARTINET
2022-07-01 8:45 ` Filipe Manana
0 siblings, 1 reply; 19+ messages in thread
From: Dominique MARTINET @ 2022-07-01 1:25 UTC (permalink / raw)
To: Filipe Manana; +Cc: Nikolay Borisov, Jens Axboe, io-uring, linux-btrfs
Filipe Manana wrote on Thu, Jun 30, 2022 at 04:10:38PM +0100:
> This may prevent the short reads (not tested yet):
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 7a54f964ff37..42fb56ed0021 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7684,7 +7684,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags) ||
> em->block_start == EXTENT_MAP_INLINE) {
> free_extent_map(em);
> - ret = -ENOTBLK;
> + ret = (flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOTBLK;
> goto unlock_err;
> }
>
> Can you give it a try?
This appears to do the trick!
I've also removed the first patch and still cannot see any short reads,
so this would be enough on its own for my case.
--
Dominique
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-07-01 1:25 ` Dominique MARTINET
@ 2022-07-01 8:45 ` Filipe Manana
2022-07-01 10:33 ` Filipe Manana
0 siblings, 1 reply; 19+ messages in thread
From: Filipe Manana @ 2022-07-01 8:45 UTC (permalink / raw)
To: Dominique MARTINET; +Cc: Nikolay Borisov, Jens Axboe, io-uring, linux-btrfs
On Fri, Jul 01, 2022 at 10:25:58AM +0900, Dominique MARTINET wrote:
> Filipe Manana wrote on Thu, Jun 30, 2022 at 04:10:38PM +0100:
> > This may prevent the short reads (not tested yet):
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 7a54f964ff37..42fb56ed0021 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -7684,7 +7684,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> > if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags) ||
> > em->block_start == EXTENT_MAP_INLINE) {
> > free_extent_map(em);
> > - ret = -ENOTBLK;
> > + ret = (flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOTBLK;
> > goto unlock_err;
> > }
> >
> > Can you give it a try?
>
> This appears to do the trick!
> I've also removed the first patch and still cannot see any short reads,
> so this would be enough on its own for my case.
Great.
After my last reply, I retried your reproducer (and qemu image file) on a fs
mounted with -o compress-force=zstd and was able to trigger the short reads
in less than a minute.
Witht that patch I can no longer trigger the short reads too (after running
the reproducer for about 2 hours).
I'll give it some more testing here along with other minor fixes I have for
other scenarios. I'll submit a patchset, with this fix included, on monday.
I'll add your Tested-by tag for this specific patch.
Thanks for all the testing and the reproducer!
>
> --
> Dominique
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-07-01 8:45 ` Filipe Manana
@ 2022-07-01 10:33 ` Filipe Manana
2022-07-01 23:48 ` Dominique MARTINET
0 siblings, 1 reply; 19+ messages in thread
From: Filipe Manana @ 2022-07-01 10:33 UTC (permalink / raw)
To: Dominique MARTINET; +Cc: Nikolay Borisov, Jens Axboe, io-uring, linux-btrfs
On Fri, Jul 01, 2022 at 09:45:04AM +0100, Filipe Manana wrote:
> On Fri, Jul 01, 2022 at 10:25:58AM +0900, Dominique MARTINET wrote:
> > Filipe Manana wrote on Thu, Jun 30, 2022 at 04:10:38PM +0100:
> > > This may prevent the short reads (not tested yet):
> > >
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 7a54f964ff37..42fb56ed0021 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -7684,7 +7684,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> > > if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags) ||
> > > em->block_start == EXTENT_MAP_INLINE) {
> > > free_extent_map(em);
> > > - ret = -ENOTBLK;
> > > + ret = (flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOTBLK;
> > > goto unlock_err;
> > > }
> > >
> > > Can you give it a try?
> >
> > This appears to do the trick!
> > I've also removed the first patch and still cannot see any short reads,
> > so this would be enough on its own for my case.
>
> Great.
>
> After my last reply, I retried your reproducer (and qemu image file) on a fs
> mounted with -o compress-force=zstd and was able to trigger the short reads
> in less than a minute.
>
> Witht that patch I can no longer trigger the short reads too (after running
> the reproducer for about 2 hours).
>
> I'll give it some more testing here along with other minor fixes I have for
> other scenarios. I'll submit a patchset, with this fix included, on monday.
In the meanwhile, I've left the patches staging here:
https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/log/?h=dio_fixes
>
> I'll add your Tested-by tag for this specific patch.
>
> Thanks for all the testing and the reproducer!
>
>
> >
> > --
> > Dominique
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
2022-07-01 10:33 ` Filipe Manana
@ 2022-07-01 23:48 ` Dominique MARTINET
0 siblings, 0 replies; 19+ messages in thread
From: Dominique MARTINET @ 2022-07-01 23:48 UTC (permalink / raw)
To: Filipe Manana; +Cc: Nikolay Borisov, Jens Axboe, io-uring, linux-btrfs
Filipe Manana wrote on Fri, Jul 01, 2022 at 11:33:29AM +0100:
> > I'll give it some more testing here along with other minor fixes I have for
> > other scenarios. I'll submit a patchset, with this fix included, on monday.
>
> In the meanwhile, I've left the patches staging here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/log/?h=dio_fixes
Thanks!
Just a small typo in commit message for 'return -EAGAIN for NOWAIT' patch:
> After having read the first extent, when we find the compressed extent we
> eturn -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to
return at start of second line missing its r
rest looks good to me, thank you for the fixes :)
--
Dominique
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2022-07-01 23:49 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-28 9:08 read corruption with qemu master io_uring engine / linux master / btrfs(?) Dominique MARTINET
2022-06-28 19:03 ` Nikolay Borisov
2022-06-29 0:35 ` Dominique MARTINET
2022-06-29 5:14 ` Dominique MARTINET
2022-06-29 15:37 ` Filipe Manana
2022-06-30 0:41 ` Dominique MARTINET
2022-06-30 7:56 ` Dominique MARTINET
2022-06-30 10:45 ` Filipe Manana
2022-06-30 11:09 ` Filipe Manana
2022-06-30 11:27 ` Dominique MARTINET
2022-06-30 12:51 ` Filipe Manana
2022-06-30 13:08 ` Dominique MARTINET
2022-06-30 15:10 ` Filipe Manana
2022-07-01 1:25 ` Dominique MARTINET
2022-07-01 8:45 ` Filipe Manana
2022-07-01 10:33 ` Filipe Manana
2022-07-01 23:48 ` Dominique MARTINET
2022-06-28 19:05 ` Nikolay Borisov
2022-06-28 19:12 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox