* Possible io_uring related race leads to btrfs data csum mismatch @ 2023-08-16 6:52 Qu Wenruo 2023-08-16 14:33 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Qu Wenruo @ 2023-08-16 6:52 UTC (permalink / raw) To: [email protected], Linux FS Devel, io-uring Hi, Recently I'm digging into a very rare failure during btrfs/06[234567], where btrfs scrub detects unrepairable data corruption. After days of digging, I have a much smaller reproducer: ``` fail() { echo "!!! FAILED !!!" exit 1 } workload() { mkfs.btrfs -f -m single -d single --csum sha256 $dev1 mount $dev1 $mnt # There are around 10 more combinations with different # seed and -p/-n parameters, but this is the smallest one # I found so far. $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt umount $mnt btrfs check --check-data-csum $dev1 || fail } runtime=1024 for (( i = 0; i < $runtime; i++ )); do echo "=== $i / $runtime ===" workload done ``` At least here, with a VM with 6 cores (host has 8C/16T), fast enough storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around 1/100 to hit the error. Checking the fsstress verbose log against the failed file, it turns out to be an io_uring write. And with uring_write disabled in fsstress, I have no longer reproduced the csum mismatch, even with much larger -n and -p parameters. However I didn't see any io_uring related callback inside btrfs code, any advice on the io_uring part would be appreciated. Thanks, Qu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-16 6:52 Possible io_uring related race leads to btrfs data csum mismatch Qu Wenruo @ 2023-08-16 14:33 ` Jens Axboe 2023-08-16 14:49 ` Jens Axboe 2023-08-16 21:46 ` Qu Wenruo 0 siblings, 2 replies; 22+ messages in thread From: Jens Axboe @ 2023-08-16 14:33 UTC (permalink / raw) To: Qu Wenruo, [email protected], Linux FS Devel, io-uring On 8/16/23 12:52 AM, Qu Wenruo wrote: > Hi, > > Recently I'm digging into a very rare failure during btrfs/06[234567], > where btrfs scrub detects unrepairable data corruption. > > After days of digging, I have a much smaller reproducer: > > ``` > fail() > { > echo "!!! FAILED !!!" > exit 1 > } > > workload() > { > mkfs.btrfs -f -m single -d single --csum sha256 $dev1 > mount $dev1 $mnt > # There are around 10 more combinations with different > # seed and -p/-n parameters, but this is the smallest one > # I found so far. > $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt > umount $mnt > btrfs check --check-data-csum $dev1 || fail > } > runtime=1024 > for (( i = 0; i < $runtime; i++ )); do > echo "=== $i / $runtime ===" > workload > done > ``` Tried to reproduce this, both on a vm and on a real host, and no luck so far. I've got a few followup questions as your report is missing some important info: 1) What kernel are you running? 2) What's the .config you are using? > At least here, with a VM with 6 cores (host has 8C/16T), fast enough > storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around > 1/100 to hit the error. What does "unsafe cche mode" mean? Is that write back caching enabled? Write back caching with volatile write cache? For your device, can you do: $ grep . /sys/block/$dev/queue/* > Checking the fsstress verbose log against the failed file, it turns out > to be an io_uring write. Any more details on what the write looks like? > And with uring_write disabled in fsstress, I have no longer reproduced > the csum mismatch, even with much larger -n and -p parameters. Is it more likely to reproduce with larger -n/-p in general? > However I didn't see any io_uring related callback inside btrfs code, > any advice on the io_uring part would be appreciated. io_uring doesn't do anything special here, it uses the normal page cache read/write parts for buffered IO. But you may get extra parallellism with io_uring here. For example, with the buffered write that this most likely is, libaio would be exactly the same as a pwrite(2) on the file. If this would've blocked, io_uring would offload this to a helper thread. Depending on the workload, you could have multiple of those in progress at the same time. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-16 14:33 ` Jens Axboe @ 2023-08-16 14:49 ` Jens Axboe 2023-08-16 21:46 ` Qu Wenruo 1 sibling, 0 replies; 22+ messages in thread From: Jens Axboe @ 2023-08-16 14:49 UTC (permalink / raw) To: Qu Wenruo, [email protected], Linux FS Devel, io-uring On 8/16/23 8:33 AM, Jens Axboe wrote: >> However I didn't see any io_uring related callback inside btrfs code, >> any advice on the io_uring part would be appreciated. > > io_uring doesn't do anything special here, it uses the normal page cache > read/write parts for buffered IO. But you may get extra parallellism > with io_uring here. For example, with the buffered write that this most > likely is, libaio would be exactly the same as a pwrite(2) on the file. > If this would've blocked, io_uring would offload this to a helper > thread. Depending on the workload, you could have multiple of those in > progress at the same time. I poked a bit at fsstress, and it's a bit odd imho. For example, any aio read/write seems to hardcode O_DIRECT. The io_uring side will be buffered. Not sure why there are those differences and why buffered/dio isn't a variable. But this does mean that these are certainly buffered writes with io_uring. Are any of the writes overlapping? You could have a situation where writeA and writeB overlap, and writeA will get punted to io-wq for execution and writeB will complete inline. In other words, writeA is issued, writeB is issued. writeA goes to io-wq, writeB now completes inline, and now writeA is done and completed. It may be exposing issues in btrfs. You can try the below patch, which should serialize all the writes to a given file. If this makes a difference for you, then I'd strongly suspect that the issue is deeper than the delivery mechanism of the write. diff --git a/ltp/fsstress.c b/ltp/fsstress.c index 6641a525fe5d..034cbba27c6e 100644 --- a/ltp/fsstress.c +++ b/ltp/fsstress.c @@ -2317,6 +2317,7 @@ do_uring_rw(opnum_t opno, long r, int flags) off %= maxfsize; memset(buf, nameseq & 0xff, len); io_uring_prep_writev(sqe, fd, &iovec, 1, off); + sqe->flags |= IOSQE_ASYNC; } else { off = (off64_t)(lr % stb.st_size); io_uring_prep_readv(sqe, fd, &iovec, 1, off); -- Jens Axboe ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-16 14:33 ` Jens Axboe 2023-08-16 14:49 ` Jens Axboe @ 2023-08-16 21:46 ` Qu Wenruo 2023-08-16 22:28 ` Jens Axboe 2023-08-16 22:36 ` Jens Axboe 1 sibling, 2 replies; 22+ messages in thread From: Qu Wenruo @ 2023-08-16 21:46 UTC (permalink / raw) To: Jens Axboe, [email protected], Linux FS Devel, io-uring On 2023/8/16 22:33, Jens Axboe wrote: > On 8/16/23 12:52 AM, Qu Wenruo wrote: >> Hi, >> >> Recently I'm digging into a very rare failure during btrfs/06[234567], >> where btrfs scrub detects unrepairable data corruption. >> >> After days of digging, I have a much smaller reproducer: >> >> ``` >> fail() >> { >> echo "!!! FAILED !!!" >> exit 1 >> } >> >> workload() >> { >> mkfs.btrfs -f -m single -d single --csum sha256 $dev1 >> mount $dev1 $mnt >> # There are around 10 more combinations with different >> # seed and -p/-n parameters, but this is the smallest one >> # I found so far. >> $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt >> umount $mnt >> btrfs check --check-data-csum $dev1 || fail >> } >> runtime=1024 >> for (( i = 0; i < $runtime; i++ )); do >> echo "=== $i / $runtime ===" >> workload >> done >> ``` > > Tried to reproduce this, both on a vm and on a real host, and no luck so > far. I've got a few followup questions as your report is missing some > important info: You may want to try much higher -p/-n numbers. For verification purpose, I normally go with -p 10 -n 10000, which has a much higher chance to hit, but definitely too noisy for debug. I just tried a run with "$fsstress -p 10 -n 10000 -w -d $mnt" as the workload, it failed at 21/1024. > > 1) What kernel are you running? David's misc-next branch, aka, lastest upstream tags plus some btrfs patches for the next merge window. Although I have some internal reports showing this problem quite some time ago. > 2) What's the .config you are using? Pretty common config, no heavy debug options (KASAN etc). > >> At least here, with a VM with 6 cores (host has 8C/16T), fast enough >> storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around >> 1/100 to hit the error. > > What does "unsafe cche mode" mean? Libvirt cache option "unsafe" Which is mostly ignoring flush/fua commands and fully rely on host fs (in my case it's file backed) cache. > Is that write back caching enabled? > Write back caching with volatile write cache? For your device, can you > do: > > $ grep . /sys/block/$dev/queue/* > >> Checking the fsstress verbose log against the failed file, it turns out >> to be an io_uring write. > > Any more details on what the write looks like? For the involved file, it shows the following operations for the minimal reproducible seed/-p/-n combination: ``` 0/24: link d0/f2 d0/f3 0 0/29: fallocate(INSERT_RANGE) d0/f3 [276 2 0 0 176 481971]t 884736 585728 95 0/30: uring_write d0/f3[276 2 0 0 176 481971] [1400622, 56456(res=56456)] 0 0/31: writev d0/f3[276 2 0 0 296 1457078] [709121,8,964] 0 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 320 1457078] return 25, fallback to stat() 0/34: dwrite d0/f3[276 2 308134 1763236 320 1457078] [589824,16384] 0 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 496 1457078] return 25, fallback to stat() 0/38: dwrite d0/f3[276 2 308134 1763236 496 1457078] [2084864,36864] 0 0/40: fallocate(ZERO_RANGE) d0/f3 [276 2 308134 1763236 688 2809139]t 3512660 81075 0 0/43: splice d0/f5[289 1 0 0 1872 2678784] [552619,59420] -> d0/f3[276 2 308134 1763236 856 3593735] [5603798,59420] 0 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [276 1 308134 1763236 976 5663218]t 1361821 480392 0 0/49: clonerange d0/f3[276 1 308134 1763236 856 5663218] [2461696,53248] -> d0/f5[289 1 0 0 1872 2678784] [942080,53248] ``` > >> And with uring_write disabled in fsstress, I have no longer reproduced >> the csum mismatch, even with much larger -n and -p parameters. > > Is it more likely to reproduce with larger -n/-p in general? Yes, but I use that specific combination as the minimal reproducer for debug purposes. > >> However I didn't see any io_uring related callback inside btrfs code, >> any advice on the io_uring part would be appreciated. > > io_uring doesn't do anything special here, it uses the normal page cache > read/write parts for buffered IO. But you may get extra parallellism > with io_uring here. For example, with the buffered write that this most > likely is, libaio would be exactly the same as a pwrite(2) on the file. > If this would've blocked, io_uring would offload this to a helper > thread. Depending on the workload, you could have multiple of those in > progress at the same time. My biggest concern is, would io_uring modify the page when it's still under writeback? In that case, it's going to cause csum mismatch as btrfs relies on the page under writeback to be unchanged. Thanks, Qu > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-16 21:46 ` Qu Wenruo @ 2023-08-16 22:28 ` Jens Axboe 2023-08-17 1:05 ` Qu Wenruo 2023-08-16 22:36 ` Jens Axboe 1 sibling, 1 reply; 22+ messages in thread From: Jens Axboe @ 2023-08-16 22:28 UTC (permalink / raw) To: Qu Wenruo, [email protected], Linux FS Devel, io-uring On 8/16/23 3:46 PM, Qu Wenruo wrote: > > > On 2023/8/16 22:33, Jens Axboe wrote: >> On 8/16/23 12:52 AM, Qu Wenruo wrote: >>> Hi, >>> >>> Recently I'm digging into a very rare failure during btrfs/06[234567], >>> where btrfs scrub detects unrepairable data corruption. >>> >>> After days of digging, I have a much smaller reproducer: >>> >>> ``` >>> fail() >>> { >>> echo "!!! FAILED !!!" >>> exit 1 >>> } >>> >>> workload() >>> { >>> mkfs.btrfs -f -m single -d single --csum sha256 $dev1 >>> mount $dev1 $mnt >>> # There are around 10 more combinations with different >>> # seed and -p/-n parameters, but this is the smallest one >>> # I found so far. >>> $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt >>> umount $mnt >>> btrfs check --check-data-csum $dev1 || fail >>> } >>> runtime=1024 >>> for (( i = 0; i < $runtime; i++ )); do >>> echo "=== $i / $runtime ===" >>> workload >>> done >>> ``` >> >> Tried to reproduce this, both on a vm and on a real host, and no luck so >> far. I've got a few followup questions as your report is missing some >> important info: > > You may want to try much higher -p/-n numbers. > > For verification purpose, I normally go with -p 10 -n 10000, which has a > much higher chance to hit, but definitely too noisy for debug. > > I just tried a run with "$fsstress -p 10 -n 10000 -w -d $mnt" as the > workload, it failed at 21/1024. OK I'll try that. >> 1) What kernel are you running? > > David's misc-next branch, aka, lastest upstream tags plus some btrfs > patches for the next merge window. > > Although I have some internal reports showing this problem quite some > time ago. That's what I was getting at, if it was new or not. >> 2) What's the .config you are using? > > Pretty common config, no heavy debug options (KASAN etc). Please just send the .config, I'd rather not have to guess. Things like preempt etc may make a difference in reproducing this. >>> At least here, with a VM with 6 cores (host has 8C/16T), fast enough >>> storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around >>> 1/100 to hit the error. >> >> What does "unsafe cche mode" mean? > > Libvirt cache option "unsafe" > > Which is mostly ignoring flush/fua commands and fully rely on host fs > (in my case it's file backed) cache. Gotcha >> Is that write back caching enabled? >> Write back caching with volatile write cache? For your device, can you >> do: >> >> $ grep . /sys/block/$dev/queue/* >> >>> Checking the fsstress verbose log against the failed file, it turns out >>> to be an io_uring write. >> >> Any more details on what the write looks like? > > For the involved file, it shows the following operations for the minimal > reproducible seed/-p/-n combination: > > ``` > 0/24: link d0/f2 d0/f3 0 > 0/29: fallocate(INSERT_RANGE) d0/f3 [276 2 0 0 176 481971]t 884736 585728 95 > 0/30: uring_write d0/f3[276 2 0 0 176 481971] [1400622, 56456(res=56456)] 0 > 0/31: writev d0/f3[276 2 0 0 296 1457078] [709121,8,964] 0 > 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 320 > 1457078] return 25, fallback to stat() > 0/34: dwrite d0/f3[276 2 308134 1763236 320 1457078] [589824,16384] 0 > 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 496 > 1457078] return 25, fallback to stat() > 0/38: dwrite d0/f3[276 2 308134 1763236 496 1457078] [2084864,36864] 0 > 0/40: fallocate(ZERO_RANGE) d0/f3 [276 2 308134 1763236 688 2809139]t > 3512660 81075 0 > 0/43: splice d0/f5[289 1 0 0 1872 2678784] [552619,59420] -> d0/f3[276 2 > 308134 1763236 856 3593735] [5603798,59420] 0 > 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [276 1 308134 1763236 976 > 5663218]t 1361821 480392 0 > 0/49: clonerange d0/f3[276 1 308134 1763236 856 5663218] [2461696,53248] > -> d0/f5[289 1 0 0 1872 2678784] [942080,53248] > ``` And just to be sure, this is not mixing dio and buffered, right? >>> However I didn't see any io_uring related callback inside btrfs code, >>> any advice on the io_uring part would be appreciated. >> >> io_uring doesn't do anything special here, it uses the normal page cache >> read/write parts for buffered IO. But you may get extra parallellism >> with io_uring here. For example, with the buffered write that this most >> likely is, libaio would be exactly the same as a pwrite(2) on the file. >> If this would've blocked, io_uring would offload this to a helper >> thread. Depending on the workload, you could have multiple of those in >> progress at the same time. > > My biggest concern is, would io_uring modify the page when it's still > under writeback? No, of course not. Like I mentioned, io_uring doesn't do anything that the normal read/write path isn't already doing - it's using the same ->read_iter() and ->write_iter() that everything else is, there's no page cache code in io_uring. > In that case, it's going to cause csum mismatch as btrfs relies on the > page under writeback to be unchanged. Sure, I'm aware of the stable page requirements. See my followup email as well on a patch to test as well. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-16 22:28 ` Jens Axboe @ 2023-08-17 1:05 ` Qu Wenruo 2023-08-17 1:12 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Qu Wenruo @ 2023-08-17 1:05 UTC (permalink / raw) To: Jens Axboe, [email protected], Linux FS Devel, io-uring [-- Attachment #1: Type: text/plain, Size: 4068 bytes --] On 2023/8/17 06:28, Jens Axboe wrote: [...] > >>> 2) What's the .config you are using? >> >> Pretty common config, no heavy debug options (KASAN etc). > > Please just send the .config, I'd rather not have to guess. Things like > preempt etc may make a difference in reproducing this. Sure, please see the attached config.gz > >>>> At least here, with a VM with 6 cores (host has 8C/16T), fast enough >>>> storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around >>>> 1/100 to hit the error. >>> >>> What does "unsafe cche mode" mean? >> >> Libvirt cache option "unsafe" >> >> Which is mostly ignoring flush/fua commands and fully rely on host fs >> (in my case it's file backed) cache. > > Gotcha > >>> Is that write back caching enabled? >>> Write back caching with volatile write cache? For your device, can you >>> do: >>> >>> $ grep . /sys/block/$dev/queue/* >>> >>>> Checking the fsstress verbose log against the failed file, it turns out >>>> to be an io_uring write. >>> >>> Any more details on what the write looks like? >> >> For the involved file, it shows the following operations for the minimal >> reproducible seed/-p/-n combination: >> >> ``` >> 0/24: link d0/f2 d0/f3 0 >> 0/29: fallocate(INSERT_RANGE) d0/f3 [276 2 0 0 176 481971]t 884736 585728 95 >> 0/30: uring_write d0/f3[276 2 0 0 176 481971] [1400622, 56456(res=56456)] 0 >> 0/31: writev d0/f3[276 2 0 0 296 1457078] [709121,8,964] 0 >> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 320 >> 1457078] return 25, fallback to stat() >> 0/34: dwrite d0/f3[276 2 308134 1763236 320 1457078] [589824,16384] 0 >> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 496 >> 1457078] return 25, fallback to stat() >> 0/38: dwrite d0/f3[276 2 308134 1763236 496 1457078] [2084864,36864] 0 >> 0/40: fallocate(ZERO_RANGE) d0/f3 [276 2 308134 1763236 688 2809139]t >> 3512660 81075 0 >> 0/43: splice d0/f5[289 1 0 0 1872 2678784] [552619,59420] -> d0/f3[276 2 >> 308134 1763236 856 3593735] [5603798,59420] 0 >> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [276 1 308134 1763236 976 >> 5663218]t 1361821 480392 0 >> 0/49: clonerange d0/f3[276 1 308134 1763236 856 5663218] [2461696,53248] >> -> d0/f5[289 1 0 0 1872 2678784] [942080,53248] >> ``` > > And just to be sure, this is not mixing dio and buffered, right? I'd say it's mixing, there are dwrite() and writev() for the same file, but at least not overlapping using this particular seed, nor they are concurrent (all inside the same process sequentially). But considering if only uring_write is disabled, then no more reproduce, thus there must be some untested btrfs path triggered by uring_write. > >>>> However I didn't see any io_uring related callback inside btrfs code, >>>> any advice on the io_uring part would be appreciated. >>> >>> io_uring doesn't do anything special here, it uses the normal page cache >>> read/write parts for buffered IO. But you may get extra parallellism >>> with io_uring here. For example, with the buffered write that this most >>> likely is, libaio would be exactly the same as a pwrite(2) on the file. >>> If this would've blocked, io_uring would offload this to a helper >>> thread. Depending on the workload, you could have multiple of those in >>> progress at the same time. >> >> My biggest concern is, would io_uring modify the page when it's still >> under writeback? > > No, of course not. Like I mentioned, io_uring doesn't do anything that > the normal read/write path isn't already doing - it's using the same > ->read_iter() and ->write_iter() that everything else is, there's no > page cache code in io_uring. > >> In that case, it's going to cause csum mismatch as btrfs relies on the >> page under writeback to be unchanged. > > Sure, I'm aware of the stable page requirements. > > See my followup email as well on a patch to test as well. > Applied and tested, using "-p 10 -n 1000" as fsstress workload, failed at 23rd run. Thanks, Qu [-- Attachment #2: config.gz --] [-- Type: application/gzip, Size: 35693 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-17 1:05 ` Qu Wenruo @ 2023-08-17 1:12 ` Jens Axboe 2023-08-17 1:19 ` Qu Wenruo 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2023-08-17 1:12 UTC (permalink / raw) To: Qu Wenruo, [email protected], Linux FS Devel, io-uring On 8/16/23 7:05 PM, Qu Wenruo wrote: > > > On 2023/8/17 06:28, Jens Axboe wrote: > [...] >> >>>> 2) What's the .config you are using? >>> >>> Pretty common config, no heavy debug options (KASAN etc). >> >> Please just send the .config, I'd rather not have to guess. Things like >> preempt etc may make a difference in reproducing this. > > Sure, please see the attached config.gz Thanks >> And just to be sure, this is not mixing dio and buffered, right? > > I'd say it's mixing, there are dwrite() and writev() for the same file, > but at least not overlapping using this particular seed, nor they are > concurrent (all inside the same process sequentially). > > But considering if only uring_write is disabled, then no more reproduce, > thus there must be some untested btrfs path triggered by uring_write. That would be one conclusion, another would be that timing is just different and that triggers and issue. Or it could of course be a bug in io_uring, perhaps a short write that gets retried or something like that. I've run the tests for hours here and don't hit anything, I've pulled in the for-next branch for btrfs and see if that'll make a difference. I'll check your .config too. Might not be a bad idea to have the writes contain known data, and when you hit the failure to verify the csum, dump the data where the csum says it's wrong and figure out at what offset, what content, etc it is? If that can get correlated to the log of what happened, that might shed some light on this. >>>>> However I didn't see any io_uring related callback inside btrfs code, >>>>> any advice on the io_uring part would be appreciated. >>>> >>>> io_uring doesn't do anything special here, it uses the normal page cache >>>> read/write parts for buffered IO. But you may get extra parallellism >>>> with io_uring here. For example, with the buffered write that this most >>>> likely is, libaio would be exactly the same as a pwrite(2) on the file. >>>> If this would've blocked, io_uring would offload this to a helper >>>> thread. Depending on the workload, you could have multiple of those in >>>> progress at the same time. >>> >>> My biggest concern is, would io_uring modify the page when it's still >>> under writeback? >> >> No, of course not. Like I mentioned, io_uring doesn't do anything that >> the normal read/write path isn't already doing - it's using the same >> ->read_iter() and ->write_iter() that everything else is, there's no >> page cache code in io_uring. >> >>> In that case, it's going to cause csum mismatch as btrfs relies on the >>> page under writeback to be unchanged. >> >> Sure, I'm aware of the stable page requirements. >> >> See my followup email as well on a patch to test as well. >> > > Applied and tested, using "-p 10 -n 1000" as fsstress workload, failed > at 23rd run. OK, that rules out the multiple-writers theory. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-17 1:12 ` Jens Axboe @ 2023-08-17 1:19 ` Qu Wenruo 2023-08-17 1:23 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Qu Wenruo @ 2023-08-17 1:19 UTC (permalink / raw) To: Jens Axboe, [email protected], Linux FS Devel, io-uring On 2023/8/17 09:12, Jens Axboe wrote: > On 8/16/23 7:05 PM, Qu Wenruo wrote: >> >> >> On 2023/8/17 06:28, Jens Axboe wrote: >> [...] >>> >>>>> 2) What's the .config you are using? >>>> >>>> Pretty common config, no heavy debug options (KASAN etc). >>> >>> Please just send the .config, I'd rather not have to guess. Things like >>> preempt etc may make a difference in reproducing this. >> >> Sure, please see the attached config.gz > > Thanks > >>> And just to be sure, this is not mixing dio and buffered, right? >> >> I'd say it's mixing, there are dwrite() and writev() for the same file, >> but at least not overlapping using this particular seed, nor they are >> concurrent (all inside the same process sequentially). >> >> But considering if only uring_write is disabled, then no more reproduce, >> thus there must be some untested btrfs path triggered by uring_write. > > That would be one conclusion, another would be that timing is just > different and that triggers and issue. Or it could of course be a bug in > io_uring, perhaps a short write that gets retried or something like > that. I've run the tests for hours here and don't hit anything, I've > pulled in the for-next branch for btrfs and see if that'll make a > difference. I'll check your .config too. Just to mention, the problem itself was pretty hard to hit before if using any debug kernel configs. Not sure why but later I switched both my CPUs (from a desktop i7-13700K but with limited 160W power, to a laptop 7940HS), dropping all heavy debug kernel configs, then it's 100% reproducible here. So I guess a faster CPU is also one factor? > > Might not be a bad idea to have the writes contain known data, and when > you hit the failure to verify the csum, dump the data where the csum > says it's wrong and figure out at what offset, what content, etc it is? > If that can get correlated to the log of what happened, that might shed > some light on this. > Thanks for the advice, would definitely try this method, would keep you updated when I found something valuable. Thanks, Qu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-17 1:19 ` Qu Wenruo @ 2023-08-17 1:23 ` Jens Axboe 2023-08-17 1:31 ` Qu Wenruo 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2023-08-17 1:23 UTC (permalink / raw) To: Qu Wenruo, [email protected], Linux FS Devel, io-uring On 8/16/23 7:19 PM, Qu Wenruo wrote: > On 2023/8/17 09:12, Jens Axboe wrote: >> On 8/16/23 7:05 PM, Qu Wenruo wrote: >>> >>> >>> On 2023/8/17 06:28, Jens Axboe wrote: >>> [...] >>>> >>>>>> 2) What's the .config you are using? >>>>> >>>>> Pretty common config, no heavy debug options (KASAN etc). >>>> >>>> Please just send the .config, I'd rather not have to guess. Things like >>>> preempt etc may make a difference in reproducing this. >>> >>> Sure, please see the attached config.gz >> >> Thanks >> >>>> And just to be sure, this is not mixing dio and buffered, right? >>> >>> I'd say it's mixing, there are dwrite() and writev() for the same file, >>> but at least not overlapping using this particular seed, nor they are >>> concurrent (all inside the same process sequentially). >>> >>> But considering if only uring_write is disabled, then no more reproduce, >>> thus there must be some untested btrfs path triggered by uring_write. >> >> That would be one conclusion, another would be that timing is just >> different and that triggers and issue. Or it could of course be a bug in >> io_uring, perhaps a short write that gets retried or something like >> that. I've run the tests for hours here and don't hit anything, I've >> pulled in the for-next branch for btrfs and see if that'll make a >> difference. I'll check your .config too. > > Just to mention, the problem itself was pretty hard to hit before if > using any debug kernel configs. The kernels I'm testing with don't have any debug options enabled, outside of the basic cheap stuff. I do notice you have all btrfs debug stuff enabled, I'll try and do that too. > Not sure why but later I switched both my CPUs (from a desktop i7-13700K > but with limited 160W power, to a laptop 7940HS), dropping all heavy > debug kernel configs, then it's 100% reproducible here. > > So I guess a faster CPU is also one factor? I've run this on kvm on an apple m1 max, no luck there. Ran it on a 7950X, no luck there. Fiddling config options on the 7950 and booting up the 7763 two socket box. Both that and the 7950 are using gen4 optane, should be plenty beefy. But if it's timing related, well... >> Might not be a bad idea to have the writes contain known data, and when >> you hit the failure to verify the csum, dump the data where the csum >> says it's wrong and figure out at what offset, what content, etc it is? >> If that can get correlated to the log of what happened, that might shed >> some light on this. >> > Thanks for the advice, would definitely try this method, would keep you > updated when I found something valuable. If I can't reproduce this, then this seems like the best way forward indeed. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-17 1:23 ` Jens Axboe @ 2023-08-17 1:31 ` Qu Wenruo 2023-08-17 1:32 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Qu Wenruo @ 2023-08-17 1:31 UTC (permalink / raw) To: Jens Axboe, [email protected], Linux FS Devel, io-uring On 2023/8/17 09:23, Jens Axboe wrote: > On 8/16/23 7:19 PM, Qu Wenruo wrote: >> On 2023/8/17 09:12, Jens Axboe wrote: >>> On 8/16/23 7:05 PM, Qu Wenruo wrote: >>>> >>>> >>>> On 2023/8/17 06:28, Jens Axboe wrote: >>>> [...] >>>>> >>>>>>> 2) What's the .config you are using? >>>>>> >>>>>> Pretty common config, no heavy debug options (KASAN etc). >>>>> >>>>> Please just send the .config, I'd rather not have to guess. Things like >>>>> preempt etc may make a difference in reproducing this. >>>> >>>> Sure, please see the attached config.gz >>> >>> Thanks >>> >>>>> And just to be sure, this is not mixing dio and buffered, right? >>>> >>>> I'd say it's mixing, there are dwrite() and writev() for the same file, >>>> but at least not overlapping using this particular seed, nor they are >>>> concurrent (all inside the same process sequentially). >>>> >>>> But considering if only uring_write is disabled, then no more reproduce, >>>> thus there must be some untested btrfs path triggered by uring_write. >>> >>> That would be one conclusion, another would be that timing is just >>> different and that triggers and issue. Or it could of course be a bug in >>> io_uring, perhaps a short write that gets retried or something like >>> that. I've run the tests for hours here and don't hit anything, I've >>> pulled in the for-next branch for btrfs and see if that'll make a >>> difference. I'll check your .config too. >> >> Just to mention, the problem itself was pretty hard to hit before if >> using any debug kernel configs. > > The kernels I'm testing with don't have any debug options enabled, > outside of the basic cheap stuff. I do notice you have all btrfs debug > stuff enabled, I'll try and do that too. > >> Not sure why but later I switched both my CPUs (from a desktop i7-13700K >> but with limited 160W power, to a laptop 7940HS), dropping all heavy >> debug kernel configs, then it's 100% reproducible here. >> >> So I guess a faster CPU is also one factor? > > I've run this on kvm on an apple m1 max, no luck there. Ran it on a > 7950X, no luck there. Fiddling config options on the 7950 and booting up > the 7763 two socket box. Both that and the 7950 are using gen4 optane, > should be plenty beefy. But if it's timing related, well... Just to mention, the following progs are involved: - btrfs-progs v6.3.3 In theory anything newer than 5.15 should be fine, it's some default settings change. - fsstress from xfstests project Thus it's not the one directly from LTP Hopes this could help you to reproduce the bug. Thanks, Qu > >>> Might not be a bad idea to have the writes contain known data, and when >>> you hit the failure to verify the csum, dump the data where the csum >>> says it's wrong and figure out at what offset, what content, etc it is? >>> If that can get correlated to the log of what happened, that might shed >>> some light on this. >>> >> Thanks for the advice, would definitely try this method, would keep you >> updated when I found something valuable. > > If I can't reproduce this, then this seems like the best way forward > indeed. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-17 1:31 ` Qu Wenruo @ 2023-08-17 1:32 ` Jens Axboe 2023-08-19 23:59 ` Qu Wenruo 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2023-08-17 1:32 UTC (permalink / raw) To: Qu Wenruo, [email protected], Linux FS Devel, io-uring On 8/16/23 7:31 PM, Qu Wenruo wrote: > > > On 2023/8/17 09:23, Jens Axboe wrote: >> On 8/16/23 7:19 PM, Qu Wenruo wrote: >>> On 2023/8/17 09:12, Jens Axboe wrote: >>>> On 8/16/23 7:05 PM, Qu Wenruo wrote: >>>>> >>>>> >>>>> On 2023/8/17 06:28, Jens Axboe wrote: >>>>> [...] >>>>>> >>>>>>>> 2) What's the .config you are using? >>>>>>> >>>>>>> Pretty common config, no heavy debug options (KASAN etc). >>>>>> >>>>>> Please just send the .config, I'd rather not have to guess. Things like >>>>>> preempt etc may make a difference in reproducing this. >>>>> >>>>> Sure, please see the attached config.gz >>>> >>>> Thanks >>>> >>>>>> And just to be sure, this is not mixing dio and buffered, right? >>>>> >>>>> I'd say it's mixing, there are dwrite() and writev() for the same file, >>>>> but at least not overlapping using this particular seed, nor they are >>>>> concurrent (all inside the same process sequentially). >>>>> >>>>> But considering if only uring_write is disabled, then no more reproduce, >>>>> thus there must be some untested btrfs path triggered by uring_write. >>>> >>>> That would be one conclusion, another would be that timing is just >>>> different and that triggers and issue. Or it could of course be a bug in >>>> io_uring, perhaps a short write that gets retried or something like >>>> that. I've run the tests for hours here and don't hit anything, I've >>>> pulled in the for-next branch for btrfs and see if that'll make a >>>> difference. I'll check your .config too. >>> >>> Just to mention, the problem itself was pretty hard to hit before if >>> using any debug kernel configs. >> >> The kernels I'm testing with don't have any debug options enabled, >> outside of the basic cheap stuff. I do notice you have all btrfs debug >> stuff enabled, I'll try and do that too. >> >>> Not sure why but later I switched both my CPUs (from a desktop i7-13700K >>> but with limited 160W power, to a laptop 7940HS), dropping all heavy >>> debug kernel configs, then it's 100% reproducible here. >>> >>> So I guess a faster CPU is also one factor? >> >> I've run this on kvm on an apple m1 max, no luck there. Ran it on a >> 7950X, no luck there. Fiddling config options on the 7950 and booting up >> the 7763 two socket box. Both that and the 7950 are using gen4 optane, >> should be plenty beefy. But if it's timing related, well... > > Just to mention, the following progs are involved: > > - btrfs-progs v6.3.3 > In theory anything newer than 5.15 should be fine, it's some default > settings change. axboe@r7525 ~> apt show btrfs-progs Package: btrfs-progs Version: 6.3.2-1 is what I have. > - fsstress from xfstests project > Thus it's not the one directly from LTP That's what I'm using too. > Hopes this could help you to reproduce the bug. So far, not really :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-17 1:32 ` Jens Axboe @ 2023-08-19 23:59 ` Qu Wenruo 2023-08-20 0:22 ` Qu Wenruo 0 siblings, 1 reply; 22+ messages in thread From: Qu Wenruo @ 2023-08-19 23:59 UTC (permalink / raw) To: Jens Axboe, [email protected], Linux FS Devel, io-uring Hi Jens I tried more on my side to debug the situation, and found a very weird write behavior: Some unexpected direct IO happened, without corresponding fsstress workload. The workload is: $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress Which I can reliably reproduce the problem locally, around 1/50 possibility. In my particular case, it results data corruption at root 5 inode 283 offset 8192. Then I added some trace points for the following functions: - btrfs_do_write_iter() Two trace points, one before btrfs_direct_write(), and one before btrfs_buffered_write(), outputting the aligned and unaligned write range, root/inode number, type of the write (buffered or direct). - btrfs_finish_one_ordered() This is where btrfs inserts its ordered extent into the subvolume tree. This happens when a range of pages finishes its writeback. Then here comes the fsstress log for inode 283 (no btrfs root number): 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] [307200,0] 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] [1058819,0] 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736 585728 95 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320 1457078] return 25, fallback to stat() 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320 1457078] return 25, fallback to stat() 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496 1457078] return 25, fallback to stat() 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t 3512660 81075 0 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2 308134 1763236 856 3593735] [5603798,59420] 0 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976 5663218]t 1361821 480392 0 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248] -> d0/f5[293 1 0 0 1872 2678784] [942080,53248] Note one thing, there is no direct/buffered write into inode 283 offset 8192. But from the trace events for root 5 inode 283: btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644) len=90112(88327) btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622) len=61440(56456) btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121) len=12288(7712) btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192) len=73728(73728) <<<<< btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824) len=16384(16384) btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728 btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384 btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864) len=36864(36864) btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864 btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000) len=61440(60139) btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798) len=61440(59420) btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112 btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288 btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440 btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096 btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440 btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440 Note that phantom direct IO call, which is in the corrupted range. If paired with fsstress, that phantom write happens between the two operations: 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 I'll keep digging, but the phantom writes which is not properly loggeg from fsstress is already a concern to me. Or maybe I'm missing some fixes in fsstress? Thanks, Qu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-19 23:59 ` Qu Wenruo @ 2023-08-20 0:22 ` Qu Wenruo 2023-08-20 13:26 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Qu Wenruo @ 2023-08-20 0:22 UTC (permalink / raw) To: Jens Axboe, [email protected], Linux FS Devel, io-uring On 2023/8/20 07:59, Qu Wenruo wrote: > Hi Jens > > I tried more on my side to debug the situation, and found a very weird > write behavior: > > Some unexpected direct IO happened, without corresponding > fsstress workload. > > The workload is: > > $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress > > Which I can reliably reproduce the problem locally, around 1/50 > possibility. > In my particular case, it results data corruption at root 5 inode 283 > offset 8192. > > Then I added some trace points for the following functions: > > - btrfs_do_write_iter() > Two trace points, one before btrfs_direct_write(), and one > before btrfs_buffered_write(), outputting the aligned and unaligned > write range, root/inode number, type of the write (buffered or > direct). > > - btrfs_finish_one_ordered() > This is where btrfs inserts its ordered extent into the subvolume > tree. > This happens when a range of pages finishes its writeback. > > Then here comes the fsstress log for inode 283 (no btrfs root number): > > 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] > [307200,0] > 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] > [1058819,0] > 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0 > 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736 > 585728 95 > 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0 > 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 > 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320 > 1457078] return 25, fallback to stat() > 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320 > 1457078] return 25, fallback to stat() > 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 > 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496 > 1457078] return 25, fallback to stat() > 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0 > 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0 > 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t > 3512660 81075 0 > 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2 > 308134 1763236 856 3593735] [5603798,59420] 0 > 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976 > 5663218]t 1361821 480392 0 > 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248] > -> d0/f5[293 1 0 0 1872 2678784] [942080,53248] > > Note one thing, there is no direct/buffered write into inode 283 offset > 8192. > > But from the trace events for root 5 inode 283: > > btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644) > len=90112(88327) > btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622) > len=61440(56456) > btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121) > len=12288(7712) > > btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192) > len=73728(73728) <<<<< > > btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824) > len=16384(16384) > btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728 > btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384 > btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864) > len=36864(36864) > btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864 > btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000) > len=61440(60139) > btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798) > len=61440(59420) > btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112 > btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288 > btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440 > btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096 > btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440 > btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440 > > Note that phantom direct IO call, which is in the corrupted range. > > If paired with fsstress, that phantom write happens between the two > operations: > > 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 > 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 Just to be more accurate, there is a 0/33 operation, which is: 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320 1457078] return 25, fallback to stat() 0/33: awrite - io_getevents failed -4 The failed one doesn't have inode number thus it didn't get caught by grep. Return value -4 means -INTR, not sure who sent the interruption. But if this interruption happens before the IO finished, we can call free() on the buffer, and if we're unlucky enough, the freed memory can be re-allocated for some other usage, thus modifying the pages before the writeback finished. I think this is the direct cause of the data corruption, page modification before direct IO finished. But unfortunately I still didn't get why the interruption can happen, nor how can we handle such interruption? (I guess just retry?) Thanks, Qu > > I'll keep digging, but the phantom writes which is not properly loggeg > from fsstress is already a concern to me. > > Or maybe I'm missing some fixes in fsstress? > > Thanks, > Qu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-20 0:22 ` Qu Wenruo @ 2023-08-20 13:26 ` Jens Axboe 2023-08-20 14:11 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2023-08-20 13:26 UTC (permalink / raw) To: Qu Wenruo, [email protected], Linux FS Devel, io-uring On 8/19/23 6:22 PM, Qu Wenruo wrote: > > > On 2023/8/20 07:59, Qu Wenruo wrote: >> Hi Jens >> >> I tried more on my side to debug the situation, and found a very weird >> write behavior: >> >> Some unexpected direct IO happened, without corresponding >> fsstress workload. >> >> The workload is: >> >> $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress >> >> Which I can reliably reproduce the problem locally, around 1/50 >> possibility. >> In my particular case, it results data corruption at root 5 inode 283 >> offset 8192. >> >> Then I added some trace points for the following functions: >> >> - btrfs_do_write_iter() >> Two trace points, one before btrfs_direct_write(), and one >> before btrfs_buffered_write(), outputting the aligned and unaligned >> write range, root/inode number, type of the write (buffered or >> direct). >> >> - btrfs_finish_one_ordered() >> This is where btrfs inserts its ordered extent into the subvolume >> tree. >> This happens when a range of pages finishes its writeback. >> >> Then here comes the fsstress log for inode 283 (no btrfs root number): >> >> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] >> [307200,0] >> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] >> [1058819,0] >> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0 >> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736 >> 585728 95 >> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0 >> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 >> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320 >> 1457078] return 25, fallback to stat() >> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320 >> 1457078] return 25, fallback to stat() >> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 >> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496 >> 1457078] return 25, fallback to stat() >> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0 >> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0 >> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t >> 3512660 81075 0 >> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2 >> 308134 1763236 856 3593735] [5603798,59420] 0 >> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976 >> 5663218]t 1361821 480392 0 >> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248] >> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248] >> >> Note one thing, there is no direct/buffered write into inode 283 offset >> 8192. >> >> But from the trace events for root 5 inode 283: >> >> btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644) >> len=90112(88327) >> btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622) >> len=61440(56456) >> btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121) >> len=12288(7712) >> >> btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192) >> len=73728(73728) <<<<< >> >> btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824) >> len=16384(16384) >> btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728 >> btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384 >> btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864) >> len=36864(36864) >> btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864 >> btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000) >> len=61440(60139) >> btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798) >> len=61440(59420) >> btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112 >> btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288 >> btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440 >> btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096 >> btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440 >> btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440 >> >> Note that phantom direct IO call, which is in the corrupted range. >> >> If paired with fsstress, that phantom write happens between the two >> operations: >> >> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 >> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 > > Just to be more accurate, there is a 0/33 operation, which is: > > 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320 > 1457078] return 25, fallback to stat() > 0/33: awrite - io_getevents failed -4 > > The failed one doesn't have inode number thus it didn't get caught by grep. > > Return value -4 means -INTR, not sure who sent the interruption. > But if this interruption happens before the IO finished, we can call > free() on the buffer, and if we're unlucky enough, the freed memory can > be re-allocated for some other usage, thus modifying the pages before > the writeback finished. > > I think this is the direct cause of the data corruption, page > modification before direct IO finished. > > But unfortunately I still didn't get why the interruption can happen, > nor how can we handle such interruption? > (I guess just retry?) It's because you are mixing aio/io_uring, and the default settings for io_uring is to use signal based notifications for queueing task_work. This then causes a spurious -EINTR, which stops your io_getevents() wait. Looks like this is a bug in fsstress, it should just retry the wait if this happens. You can also configure the ring to not use signal based notifications, but that bug needs fixing regardless. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-20 13:26 ` Jens Axboe @ 2023-08-20 14:11 ` Jens Axboe 2023-08-20 18:18 ` Matthew Wilcox 2023-08-21 0:38 ` Qu Wenruo 0 siblings, 2 replies; 22+ messages in thread From: Jens Axboe @ 2023-08-20 14:11 UTC (permalink / raw) To: Qu Wenruo, [email protected], Linux FS Devel, io-uring On 8/20/23 7:26 AM, Jens Axboe wrote: > On 8/19/23 6:22 PM, Qu Wenruo wrote: >> >> >> On 2023/8/20 07:59, Qu Wenruo wrote: >>> Hi Jens >>> >>> I tried more on my side to debug the situation, and found a very weird >>> write behavior: >>> >>> Some unexpected direct IO happened, without corresponding >>> fsstress workload. >>> >>> The workload is: >>> >>> $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress >>> >>> Which I can reliably reproduce the problem locally, around 1/50 >>> possibility. >>> In my particular case, it results data corruption at root 5 inode 283 >>> offset 8192. >>> >>> Then I added some trace points for the following functions: >>> >>> - btrfs_do_write_iter() >>> Two trace points, one before btrfs_direct_write(), and one >>> before btrfs_buffered_write(), outputting the aligned and unaligned >>> write range, root/inode number, type of the write (buffered or >>> direct). >>> >>> - btrfs_finish_one_ordered() >>> This is where btrfs inserts its ordered extent into the subvolume >>> tree. >>> This happens when a range of pages finishes its writeback. >>> >>> Then here comes the fsstress log for inode 283 (no btrfs root number): >>> >>> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] >>> [307200,0] >>> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] >>> [1058819,0] >>> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0 >>> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736 >>> 585728 95 >>> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0 >>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 >>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320 >>> 1457078] return 25, fallback to stat() >>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320 >>> 1457078] return 25, fallback to stat() >>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 >>> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496 >>> 1457078] return 25, fallback to stat() >>> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0 >>> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0 >>> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t >>> 3512660 81075 0 >>> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2 >>> 308134 1763236 856 3593735] [5603798,59420] 0 >>> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976 >>> 5663218]t 1361821 480392 0 >>> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248] >>> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248] >>> >>> Note one thing, there is no direct/buffered write into inode 283 offset >>> 8192. >>> >>> But from the trace events for root 5 inode 283: >>> >>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644) >>> len=90112(88327) >>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622) >>> len=61440(56456) >>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121) >>> len=12288(7712) >>> >>> btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192) >>> len=73728(73728) <<<<< >>> >>> btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824) >>> len=16384(16384) >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728 >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384 >>> btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864) >>> len=36864(36864) >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864 >>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000) >>> len=61440(60139) >>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798) >>> len=61440(59420) >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112 >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288 >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440 >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096 >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440 >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440 >>> >>> Note that phantom direct IO call, which is in the corrupted range. >>> >>> If paired with fsstress, that phantom write happens between the two >>> operations: >>> >>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 >>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 >> >> Just to be more accurate, there is a 0/33 operation, which is: >> >> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320 >> 1457078] return 25, fallback to stat() >> 0/33: awrite - io_getevents failed -4 >> >> The failed one doesn't have inode number thus it didn't get caught by grep. >> >> Return value -4 means -INTR, not sure who sent the interruption. >> But if this interruption happens before the IO finished, we can call >> free() on the buffer, and if we're unlucky enough, the freed memory can >> be re-allocated for some other usage, thus modifying the pages before >> the writeback finished. >> >> I think this is the direct cause of the data corruption, page >> modification before direct IO finished. >> >> But unfortunately I still didn't get why the interruption can happen, >> nor how can we handle such interruption? >> (I guess just retry?) > > It's because you are mixing aio/io_uring, and the default settings for > io_uring is to use signal based notifications for queueing task_work. > This then causes a spurious -EINTR, which stops your io_getevents() > wait. Looks like this is a bug in fsstress, it should just retry the > wait if this happens. You can also configure the ring to not use signal > based notifications, but that bug needs fixing regardless. Something like this will probably fix it. diff --git a/ltp/fsstress.c b/ltp/fsstress.c index 6641a525fe5d..05fbfd3f8cf8 100644 --- a/ltp/fsstress.c +++ b/ltp/fsstress.c @@ -2072,6 +2072,23 @@ void inode_info(char *str, size_t sz, struct stat64 *s, int verbose) (long long) s->st_blocks, (long long) s->st_size); } +static int io_get_single_event(struct io_event *event) +{ + int ret; + + do { + /* + * We can get -EINTR if competing with io_uring using signal + * based notifications. For that case, just retry the wait. + */ + ret = io_getevents(io_ctx, 1, 1, event, NULL); + if (ret != -EINTR) + break; + } while (1); + + return ret; +} + void afsync_f(opnum_t opno, long r) { @@ -2111,7 +2128,7 @@ afsync_f(opnum_t opno, long r) close(fd); return; } - if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) { + if ((e = io_get_single_event(&event)) != 1) { if (v) printf("%d/%lld: afsync - io_getevents failed %d\n", procid, opno, e); @@ -2220,10 +2237,10 @@ do_aio_rw(opnum_t opno, long r, int flags) if ((e = io_submit(io_ctx, 1, iocbs)) != 1) { if (v) printf("%d/%lld: %s - io_submit failed %d\n", - procid, opno, iswrite ? "awrite" : "aread", e); + procid, opno, iswrite ? "awrite" : "aread", e); goto aio_out; } - if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) { + if ((e = io_get_single_event(&event)) != 1) { if (v) printf("%d/%lld: %s - io_getevents failed %d\n", procid, opno, iswrite ? "awrite" : "aread", e); -- Jens Axboe ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-20 14:11 ` Jens Axboe @ 2023-08-20 18:18 ` Matthew Wilcox 2023-08-20 18:40 ` Jens Axboe 2023-08-21 0:38 ` Qu Wenruo 1 sibling, 1 reply; 22+ messages in thread From: Matthew Wilcox @ 2023-08-20 18:18 UTC (permalink / raw) To: Jens Axboe Cc: Qu Wenruo, [email protected], Linux FS Devel, io-uring On Sun, Aug 20, 2023 at 08:11:04AM -0600, Jens Axboe wrote: > +static int io_get_single_event(struct io_event *event) > +{ > + int ret; > + > + do { > + /* > + * We can get -EINTR if competing with io_uring using signal > + * based notifications. For that case, just retry the wait. > + */ > + ret = io_getevents(io_ctx, 1, 1, event, NULL); > + if (ret != -EINTR) > + break; > + } while (1); > + > + return ret; > +} Is there a reason to prefer this style over: do { ret = io_getevents(io_ctx, 1, 1, event, NULL); } while (ret == -1 && errno == EINTR); (we need to check errno, here, right? Or is io_getevents() special somehow?) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-20 18:18 ` Matthew Wilcox @ 2023-08-20 18:40 ` Jens Axboe 0 siblings, 0 replies; 22+ messages in thread From: Jens Axboe @ 2023-08-20 18:40 UTC (permalink / raw) To: Matthew Wilcox Cc: Qu Wenruo, [email protected], Linux FS Devel, io-uring On 8/20/23 12:18 PM, Matthew Wilcox wrote: > On Sun, Aug 20, 2023 at 08:11:04AM -0600, Jens Axboe wrote: >> +static int io_get_single_event(struct io_event *event) >> +{ >> + int ret; >> + >> + do { >> + /* >> + * We can get -EINTR if competing with io_uring using signal >> + * based notifications. For that case, just retry the wait. >> + */ >> + ret = io_getevents(io_ctx, 1, 1, event, NULL); >> + if (ret != -EINTR) >> + break; >> + } while (1); >> + >> + return ret; >> +} > > Is there a reason to prefer this style over: > > do { > ret = io_getevents(io_ctx, 1, 1, event, NULL); > } while (ret == -1 && errno == EINTR); > > (we need to check errno, here, right? Or is io_getevents() special > somehow?) Honestly, don't really care about the style, mostly cared about getting a bug fixed. io_getevents() returns number of events claimed, or -errno. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-20 14:11 ` Jens Axboe 2023-08-20 18:18 ` Matthew Wilcox @ 2023-08-21 0:38 ` Qu Wenruo 2023-08-21 14:57 ` Jens Axboe 1 sibling, 1 reply; 22+ messages in thread From: Qu Wenruo @ 2023-08-21 0:38 UTC (permalink / raw) To: Jens Axboe, [email protected], Linux FS Devel, io-uring On 2023/8/20 22:11, Jens Axboe wrote: > On 8/20/23 7:26 AM, Jens Axboe wrote: >> On 8/19/23 6:22 PM, Qu Wenruo wrote: >>> >>> >>> On 2023/8/20 07:59, Qu Wenruo wrote: >>>> Hi Jens >>>> >>>> I tried more on my side to debug the situation, and found a very weird >>>> write behavior: >>>> >>>> Some unexpected direct IO happened, without corresponding >>>> fsstress workload. >>>> >>>> The workload is: >>>> >>>> $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress >>>> >>>> Which I can reliably reproduce the problem locally, around 1/50 >>>> possibility. >>>> In my particular case, it results data corruption at root 5 inode 283 >>>> offset 8192. >>>> >>>> Then I added some trace points for the following functions: >>>> >>>> - btrfs_do_write_iter() >>>> Two trace points, one before btrfs_direct_write(), and one >>>> before btrfs_buffered_write(), outputting the aligned and unaligned >>>> write range, root/inode number, type of the write (buffered or >>>> direct). >>>> >>>> - btrfs_finish_one_ordered() >>>> This is where btrfs inserts its ordered extent into the subvolume >>>> tree. >>>> This happens when a range of pages finishes its writeback. >>>> >>>> Then here comes the fsstress log for inode 283 (no btrfs root number): >>>> >>>> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] >>>> [307200,0] >>>> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] >>>> [1058819,0] >>>> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0 >>>> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736 >>>> 585728 95 >>>> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0 >>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 >>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320 >>>> 1457078] return 25, fallback to stat() >>>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320 >>>> 1457078] return 25, fallback to stat() >>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 >>>> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496 >>>> 1457078] return 25, fallback to stat() >>>> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0 >>>> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0 >>>> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t >>>> 3512660 81075 0 >>>> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2 >>>> 308134 1763236 856 3593735] [5603798,59420] 0 >>>> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976 >>>> 5663218]t 1361821 480392 0 >>>> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248] >>>> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248] >>>> >>>> Note one thing, there is no direct/buffered write into inode 283 offset >>>> 8192. >>>> >>>> But from the trace events for root 5 inode 283: >>>> >>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644) >>>> len=90112(88327) >>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622) >>>> len=61440(56456) >>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121) >>>> len=12288(7712) >>>> >>>> btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192) >>>> len=73728(73728) <<<<< >>>> >>>> btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824) >>>> len=16384(16384) >>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728 >>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384 >>>> btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864) >>>> len=36864(36864) >>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864 >>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000) >>>> len=61440(60139) >>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798) >>>> len=61440(59420) >>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112 >>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288 >>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440 >>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096 >>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440 >>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440 >>>> >>>> Note that phantom direct IO call, which is in the corrupted range. >>>> >>>> If paired with fsstress, that phantom write happens between the two >>>> operations: >>>> >>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 >>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 >>> >>> Just to be more accurate, there is a 0/33 operation, which is: >>> >>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320 >>> 1457078] return 25, fallback to stat() >>> 0/33: awrite - io_getevents failed -4 >>> >>> The failed one doesn't have inode number thus it didn't get caught by grep. >>> >>> Return value -4 means -INTR, not sure who sent the interruption. >>> But if this interruption happens before the IO finished, we can call >>> free() on the buffer, and if we're unlucky enough, the freed memory can >>> be re-allocated for some other usage, thus modifying the pages before >>> the writeback finished. >>> >>> I think this is the direct cause of the data corruption, page >>> modification before direct IO finished. >>> >>> But unfortunately I still didn't get why the interruption can happen, >>> nor how can we handle such interruption? >>> (I guess just retry?) >> >> It's because you are mixing aio/io_uring, and the default settings for >> io_uring is to use signal based notifications for queueing task_work. >> This then causes a spurious -EINTR, which stops your io_getevents() >> wait. Looks like this is a bug in fsstress, it should just retry the >> wait if this happens. You can also configure the ring to not use signal >> based notifications, but that bug needs fixing regardless. > > Something like this will probably fix it. > > > diff --git a/ltp/fsstress.c b/ltp/fsstress.c > index 6641a525fe5d..05fbfd3f8cf8 100644 > --- a/ltp/fsstress.c > +++ b/ltp/fsstress.c > @@ -2072,6 +2072,23 @@ void inode_info(char *str, size_t sz, struct stat64 *s, int verbose) > (long long) s->st_blocks, (long long) s->st_size); > } > > +static int io_get_single_event(struct io_event *event) > +{ > + int ret; > + > + do { > + /* > + * We can get -EINTR if competing with io_uring using signal > + * based notifications. For that case, just retry the wait. > + */ > + ret = io_getevents(io_ctx, 1, 1, event, NULL); > + if (ret != -EINTR) > + break; > + } while (1); > + > + return ret; > +} > + > void > afsync_f(opnum_t opno, long r) > { > @@ -2111,7 +2128,7 @@ afsync_f(opnum_t opno, long r) > close(fd); > return; > } > - if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) { > + if ((e = io_get_single_event(&event)) != 1) { > if (v) > printf("%d/%lld: afsync - io_getevents failed %d\n", > procid, opno, e); > @@ -2220,10 +2237,10 @@ do_aio_rw(opnum_t opno, long r, int flags) > if ((e = io_submit(io_ctx, 1, iocbs)) != 1) { > if (v) > printf("%d/%lld: %s - io_submit failed %d\n", > - procid, opno, iswrite ? "awrite" : "aread", e); > + procid, opno, iswrite ? "awrite" : "aread", e); > goto aio_out; > } > - if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) { > + if ((e = io_get_single_event(&event)) != 1) { > if (v) > printf("%d/%lld: %s - io_getevents failed %d\n", > procid, opno, iswrite ? "awrite" : "aread", e); > Exactly what I sent for fsstress: https://lore.kernel.org/linux-btrfs/[email protected]/T/#u Thanks, Qu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-21 0:38 ` Qu Wenruo @ 2023-08-21 14:57 ` Jens Axboe 2023-08-21 21:42 ` Qu Wenruo 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2023-08-21 14:57 UTC (permalink / raw) To: Qu Wenruo, [email protected], Linux FS Devel, io-uring On 8/20/23 6:38 PM, Qu Wenruo wrote: > > > On 2023/8/20 22:11, Jens Axboe wrote: >> On 8/20/23 7:26 AM, Jens Axboe wrote: >>> On 8/19/23 6:22 PM, Qu Wenruo wrote: >>>> >>>> >>>> On 2023/8/20 07:59, Qu Wenruo wrote: >>>>> Hi Jens >>>>> >>>>> I tried more on my side to debug the situation, and found a very weird >>>>> write behavior: >>>>> >>>>> Some unexpected direct IO happened, without corresponding >>>>> fsstress workload. >>>>> >>>>> The workload is: >>>>> >>>>> $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress >>>>> >>>>> Which I can reliably reproduce the problem locally, around 1/50 >>>>> possibility. >>>>> In my particular case, it results data corruption at root 5 inode 283 >>>>> offset 8192. >>>>> >>>>> Then I added some trace points for the following functions: >>>>> >>>>> - btrfs_do_write_iter() >>>>> Two trace points, one before btrfs_direct_write(), and one >>>>> before btrfs_buffered_write(), outputting the aligned and unaligned >>>>> write range, root/inode number, type of the write (buffered or >>>>> direct). >>>>> >>>>> - btrfs_finish_one_ordered() >>>>> This is where btrfs inserts its ordered extent into the subvolume >>>>> tree. >>>>> This happens when a range of pages finishes its writeback. >>>>> >>>>> Then here comes the fsstress log for inode 283 (no btrfs root number): >>>>> >>>>> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] >>>>> [307200,0] >>>>> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] >>>>> [1058819,0] >>>>> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0 >>>>> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736 >>>>> 585728 95 >>>>> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0 >>>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 >>>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320 >>>>> 1457078] return 25, fallback to stat() >>>>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320 >>>>> 1457078] return 25, fallback to stat() >>>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 >>>>> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496 >>>>> 1457078] return 25, fallback to stat() >>>>> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0 >>>>> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0 >>>>> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t >>>>> 3512660 81075 0 >>>>> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2 >>>>> 308134 1763236 856 3593735] [5603798,59420] 0 >>>>> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976 >>>>> 5663218]t 1361821 480392 0 >>>>> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248] >>>>> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248] >>>>> >>>>> Note one thing, there is no direct/buffered write into inode 283 offset >>>>> 8192. >>>>> >>>>> But from the trace events for root 5 inode 283: >>>>> >>>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644) >>>>> len=90112(88327) >>>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622) >>>>> len=61440(56456) >>>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121) >>>>> len=12288(7712) >>>>> >>>>> btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192) >>>>> len=73728(73728) <<<<< >>>>> >>>>> btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824) >>>>> len=16384(16384) >>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728 >>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384 >>>>> btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864) >>>>> len=36864(36864) >>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864 >>>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000) >>>>> len=61440(60139) >>>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798) >>>>> len=61440(59420) >>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112 >>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288 >>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440 >>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096 >>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440 >>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440 >>>>> >>>>> Note that phantom direct IO call, which is in the corrupted range. >>>>> >>>>> If paired with fsstress, that phantom write happens between the two >>>>> operations: >>>>> >>>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 >>>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 >>>> >>>> Just to be more accurate, there is a 0/33 operation, which is: >>>> >>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320 >>>> 1457078] return 25, fallback to stat() >>>> 0/33: awrite - io_getevents failed -4 >>>> >>>> The failed one doesn't have inode number thus it didn't get caught by grep. >>>> >>>> Return value -4 means -INTR, not sure who sent the interruption. >>>> But if this interruption happens before the IO finished, we can call >>>> free() on the buffer, and if we're unlucky enough, the freed memory can >>>> be re-allocated for some other usage, thus modifying the pages before >>>> the writeback finished. >>>> >>>> I think this is the direct cause of the data corruption, page >>>> modification before direct IO finished. >>>> >>>> But unfortunately I still didn't get why the interruption can happen, >>>> nor how can we handle such interruption? >>>> (I guess just retry?) >>> >>> It's because you are mixing aio/io_uring, and the default settings for >>> io_uring is to use signal based notifications for queueing task_work. >>> This then causes a spurious -EINTR, which stops your io_getevents() >>> wait. Looks like this is a bug in fsstress, it should just retry the >>> wait if this happens. You can also configure the ring to not use signal >>> based notifications, but that bug needs fixing regardless. >> >> Something like this will probably fix it. >> >> >> diff --git a/ltp/fsstress.c b/ltp/fsstress.c >> index 6641a525fe5d..05fbfd3f8cf8 100644 >> --- a/ltp/fsstress.c >> +++ b/ltp/fsstress.c >> @@ -2072,6 +2072,23 @@ void inode_info(char *str, size_t sz, struct stat64 *s, int verbose) >> (long long) s->st_blocks, (long long) s->st_size); >> } >> >> +static int io_get_single_event(struct io_event *event) >> +{ >> + int ret; >> + >> + do { >> + /* >> + * We can get -EINTR if competing with io_uring using signal >> + * based notifications. For that case, just retry the wait. >> + */ >> + ret = io_getevents(io_ctx, 1, 1, event, NULL); >> + if (ret != -EINTR) >> + break; >> + } while (1); >> + >> + return ret; >> +} >> + >> void >> afsync_f(opnum_t opno, long r) >> { >> @@ -2111,7 +2128,7 @@ afsync_f(opnum_t opno, long r) >> close(fd); >> return; >> } >> - if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) { >> + if ((e = io_get_single_event(&event)) != 1) { >> if (v) >> printf("%d/%lld: afsync - io_getevents failed %d\n", >> procid, opno, e); >> @@ -2220,10 +2237,10 @@ do_aio_rw(opnum_t opno, long r, int flags) >> if ((e = io_submit(io_ctx, 1, iocbs)) != 1) { >> if (v) >> printf("%d/%lld: %s - io_submit failed %d\n", >> - procid, opno, iswrite ? "awrite" : "aread", e); >> + procid, opno, iswrite ? "awrite" : "aread", e); >> goto aio_out; >> } >> - if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) { >> + if ((e = io_get_single_event(&event)) != 1) { >> if (v) >> printf("%d/%lld: %s - io_getevents failed %d\n", >> procid, opno, iswrite ? "awrite" : "aread", e); >> > Exactly what I sent for fsstress: > https://lore.kernel.org/linux-btrfs/[email protected]/T/#u It's not really, as you only did the one case of io_getevents(). What happens if the other one gets EINTR and aborts, now we do a rw operation and the first event returned is the one from the fsync? You should not just fix up the one that you happened to hit, fix up both of them. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-21 14:57 ` Jens Axboe @ 2023-08-21 21:42 ` Qu Wenruo 0 siblings, 0 replies; 22+ messages in thread From: Qu Wenruo @ 2023-08-21 21:42 UTC (permalink / raw) To: Jens Axboe, [email protected], Linux FS Devel, io-uring On 2023/8/21 22:57, Jens Axboe wrote: > On 8/20/23 6:38 PM, Qu Wenruo wrote: >> >> >> On 2023/8/20 22:11, Jens Axboe wrote: >>> On 8/20/23 7:26 AM, Jens Axboe wrote: >>>> On 8/19/23 6:22 PM, Qu Wenruo wrote: >>>>> >>>>> >>>>> On 2023/8/20 07:59, Qu Wenruo wrote: >>>>>> Hi Jens >>>>>> >>>>>> I tried more on my side to debug the situation, and found a very weird >>>>>> write behavior: >>>>>> >>>>>> Some unexpected direct IO happened, without corresponding >>>>>> fsstress workload. >>>>>> >>>>>> The workload is: >>>>>> >>>>>> $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress >>>>>> >>>>>> Which I can reliably reproduce the problem locally, around 1/50 >>>>>> possibility. >>>>>> In my particular case, it results data corruption at root 5 inode 283 >>>>>> offset 8192. >>>>>> >>>>>> Then I added some trace points for the following functions: >>>>>> >>>>>> - btrfs_do_write_iter() >>>>>> Two trace points, one before btrfs_direct_write(), and one >>>>>> before btrfs_buffered_write(), outputting the aligned and unaligned >>>>>> write range, root/inode number, type of the write (buffered or >>>>>> direct). >>>>>> >>>>>> - btrfs_finish_one_ordered() >>>>>> This is where btrfs inserts its ordered extent into the subvolume >>>>>> tree. >>>>>> This happens when a range of pages finishes its writeback. >>>>>> >>>>>> Then here comes the fsstress log for inode 283 (no btrfs root number): >>>>>> >>>>>> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] >>>>>> [307200,0] >>>>>> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] >>>>>> [1058819,0] >>>>>> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0 >>>>>> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736 >>>>>> 585728 95 >>>>>> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0 >>>>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 >>>>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320 >>>>>> 1457078] return 25, fallback to stat() >>>>>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320 >>>>>> 1457078] return 25, fallback to stat() >>>>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 >>>>>> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496 >>>>>> 1457078] return 25, fallback to stat() >>>>>> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0 >>>>>> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0 >>>>>> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t >>>>>> 3512660 81075 0 >>>>>> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2 >>>>>> 308134 1763236 856 3593735] [5603798,59420] 0 >>>>>> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976 >>>>>> 5663218]t 1361821 480392 0 >>>>>> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248] >>>>>> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248] >>>>>> >>>>>> Note one thing, there is no direct/buffered write into inode 283 offset >>>>>> 8192. >>>>>> >>>>>> But from the trace events for root 5 inode 283: >>>>>> >>>>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644) >>>>>> len=90112(88327) >>>>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622) >>>>>> len=61440(56456) >>>>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121) >>>>>> len=12288(7712) >>>>>> >>>>>> btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192) >>>>>> len=73728(73728) <<<<< >>>>>> >>>>>> btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824) >>>>>> len=16384(16384) >>>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728 >>>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384 >>>>>> btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864) >>>>>> len=36864(36864) >>>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864 >>>>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000) >>>>>> len=61440(60139) >>>>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798) >>>>>> len=61440(59420) >>>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112 >>>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288 >>>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440 >>>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096 >>>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440 >>>>>> btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440 >>>>>> >>>>>> Note that phantom direct IO call, which is in the corrupted range. >>>>>> >>>>>> If paired with fsstress, that phantom write happens between the two >>>>>> operations: >>>>>> >>>>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 >>>>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 >>>>> >>>>> Just to be more accurate, there is a 0/33 operation, which is: >>>>> >>>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320 >>>>> 1457078] return 25, fallback to stat() >>>>> 0/33: awrite - io_getevents failed -4 >>>>> >>>>> The failed one doesn't have inode number thus it didn't get caught by grep. >>>>> >>>>> Return value -4 means -INTR, not sure who sent the interruption. >>>>> But if this interruption happens before the IO finished, we can call >>>>> free() on the buffer, and if we're unlucky enough, the freed memory can >>>>> be re-allocated for some other usage, thus modifying the pages before >>>>> the writeback finished. >>>>> >>>>> I think this is the direct cause of the data corruption, page >>>>> modification before direct IO finished. >>>>> >>>>> But unfortunately I still didn't get why the interruption can happen, >>>>> nor how can we handle such interruption? >>>>> (I guess just retry?) >>>> >>>> It's because you are mixing aio/io_uring, and the default settings for >>>> io_uring is to use signal based notifications for queueing task_work. >>>> This then causes a spurious -EINTR, which stops your io_getevents() >>>> wait. Looks like this is a bug in fsstress, it should just retry the >>>> wait if this happens. You can also configure the ring to not use signal >>>> based notifications, but that bug needs fixing regardless. >>> >>> Something like this will probably fix it. >>> >>> >>> diff --git a/ltp/fsstress.c b/ltp/fsstress.c >>> index 6641a525fe5d..05fbfd3f8cf8 100644 >>> --- a/ltp/fsstress.c >>> +++ b/ltp/fsstress.c >>> @@ -2072,6 +2072,23 @@ void inode_info(char *str, size_t sz, struct stat64 *s, int verbose) >>> (long long) s->st_blocks, (long long) s->st_size); >>> } >>> >>> +static int io_get_single_event(struct io_event *event) >>> +{ >>> + int ret; >>> + >>> + do { >>> + /* >>> + * We can get -EINTR if competing with io_uring using signal >>> + * based notifications. For that case, just retry the wait. >>> + */ >>> + ret = io_getevents(io_ctx, 1, 1, event, NULL); >>> + if (ret != -EINTR) >>> + break; >>> + } while (1); >>> + >>> + return ret; >>> +} >>> + >>> void >>> afsync_f(opnum_t opno, long r) >>> { >>> @@ -2111,7 +2128,7 @@ afsync_f(opnum_t opno, long r) >>> close(fd); >>> return; >>> } >>> - if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) { >>> + if ((e = io_get_single_event(&event)) != 1) { >>> if (v) >>> printf("%d/%lld: afsync - io_getevents failed %d\n", >>> procid, opno, e); >>> @@ -2220,10 +2237,10 @@ do_aio_rw(opnum_t opno, long r, int flags) >>> if ((e = io_submit(io_ctx, 1, iocbs)) != 1) { >>> if (v) >>> printf("%d/%lld: %s - io_submit failed %d\n", >>> - procid, opno, iswrite ? "awrite" : "aread", e); >>> + procid, opno, iswrite ? "awrite" : "aread", e); >>> goto aio_out; >>> } >>> - if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) { >>> + if ((e = io_get_single_event(&event)) != 1) { >>> if (v) >>> printf("%d/%lld: %s - io_getevents failed %d\n", >>> procid, opno, iswrite ? "awrite" : "aread", e); >>> >> Exactly what I sent for fsstress: >> https://lore.kernel.org/linux-btrfs/[email protected]/T/#u > > It's not really, as you only did the one case of io_getevents(). What > happens if the other one gets EINTR and aborts, now we do a rw operation > and the first event returned is the one from the fsync? Oh my bad, forgot there is another one. Would fix it in the next update. Thanks, Qu > > You should not just fix up the one that you happened to hit, fix up both > of them. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-16 21:46 ` Qu Wenruo 2023-08-16 22:28 ` Jens Axboe @ 2023-08-16 22:36 ` Jens Axboe 2023-08-17 0:40 ` Qu Wenruo 1 sibling, 1 reply; 22+ messages in thread From: Jens Axboe @ 2023-08-16 22:36 UTC (permalink / raw) To: Qu Wenruo, [email protected], Linux FS Devel, io-uring On 8/16/23 3:46 PM, Qu Wenruo wrote: >> Is that write back caching enabled? >> Write back caching with volatile write cache? For your device, can you >> do: >> >> $ grep . /sys/block/$dev/queue/* You didn't answer this one either. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Possible io_uring related race leads to btrfs data csum mismatch 2023-08-16 22:36 ` Jens Axboe @ 2023-08-17 0:40 ` Qu Wenruo 0 siblings, 0 replies; 22+ messages in thread From: Qu Wenruo @ 2023-08-17 0:40 UTC (permalink / raw) To: Jens Axboe, [email protected], Linux FS Devel, io-uring On 2023/8/17 06:36, Jens Axboe wrote: > On 8/16/23 3:46 PM, Qu Wenruo wrote: >>> Is that write back caching enabled? >>> Write back caching with volatile write cache? For your device, can you >>> do: >>> >>> $ grep . /sys/block/$dev/queue/* > > You didn't answer this one either. > Sorry, here is the output: $ grep . /sys/block/vdb/queue/* /sys/block/vdb/queue/add_random:0 /sys/block/vdb/queue/chunk_sectors:0 /sys/block/vdb/queue/dax:0 /sys/block/vdb/queue/discard_granularity:512 /sys/block/vdb/queue/discard_max_bytes:2147483136 /sys/block/vdb/queue/discard_max_hw_bytes:2147483136 /sys/block/vdb/queue/discard_zeroes_data:0 /sys/block/vdb/queue/dma_alignment:511 /sys/block/vdb/queue/fua:0 /sys/block/vdb/queue/hw_sector_size:512 /sys/block/vdb/queue/io_poll:0 /sys/block/vdb/queue/io_poll_delay:-1 /sys/block/vdb/queue/iostats:1 /sys/block/vdb/queue/logical_block_size:512 /sys/block/vdb/queue/max_discard_segments:1 /sys/block/vdb/queue/max_hw_sectors_kb:2147483647 /sys/block/vdb/queue/max_integrity_segments:0 /sys/block/vdb/queue/max_sectors_kb:1280 /sys/block/vdb/queue/max_segment_size:4294967295 /sys/block/vdb/queue/max_segments:254 /sys/block/vdb/queue/minimum_io_size:512 /sys/block/vdb/queue/nomerges:0 /sys/block/vdb/queue/nr_requests:256 /sys/block/vdb/queue/nr_zones:0 /sys/block/vdb/queue/optimal_io_size:0 /sys/block/vdb/queue/physical_block_size:512 /sys/block/vdb/queue/read_ahead_kb:128 /sys/block/vdb/queue/rotational:1 /sys/block/vdb/queue/rq_affinity:1 /sys/block/vdb/queue/scheduler:[none] mq-deadline kyber bfq /sys/block/vdb/queue/stable_writes:0 /sys/block/vdb/queue/throttle_sample_time:100 /sys/block/vdb/queue/virt_boundary_mask:0 /sys/block/vdb/queue/wbt_lat_usec:75000 /sys/block/vdb/queue/write_cache:write back /sys/block/vdb/queue/write_same_max_bytes:0 /sys/block/vdb/queue/write_zeroes_max_bytes:2147483136 /sys/block/vdb/queue/zone_append_max_bytes:0 /sys/block/vdb/queue/zone_write_granularity:0 /sys/block/vdb/queue/zoned:none Thanks, Qu ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2023-08-21 21:42 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-08-16 6:52 Possible io_uring related race leads to btrfs data csum mismatch Qu Wenruo 2023-08-16 14:33 ` Jens Axboe 2023-08-16 14:49 ` Jens Axboe 2023-08-16 21:46 ` Qu Wenruo 2023-08-16 22:28 ` Jens Axboe 2023-08-17 1:05 ` Qu Wenruo 2023-08-17 1:12 ` Jens Axboe 2023-08-17 1:19 ` Qu Wenruo 2023-08-17 1:23 ` Jens Axboe 2023-08-17 1:31 ` Qu Wenruo 2023-08-17 1:32 ` Jens Axboe 2023-08-19 23:59 ` Qu Wenruo 2023-08-20 0:22 ` Qu Wenruo 2023-08-20 13:26 ` Jens Axboe 2023-08-20 14:11 ` Jens Axboe 2023-08-20 18:18 ` Matthew Wilcox 2023-08-20 18:40 ` Jens Axboe 2023-08-21 0:38 ` Qu Wenruo 2023-08-21 14:57 ` Jens Axboe 2023-08-21 21:42 ` Qu Wenruo 2023-08-16 22:36 ` Jens Axboe 2023-08-17 0:40 ` Qu Wenruo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox