On 2023/8/17 06:28, Jens Axboe wrote: [...] > >>> 2) What's the .config you are using? >> >> Pretty common config, no heavy debug options (KASAN etc). > > Please just send the .config, I'd rather not have to guess. Things like > preempt etc may make a difference in reproducing this. Sure, please see the attached config.gz > >>>> At least here, with a VM with 6 cores (host has 8C/16T), fast enough >>>> storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around >>>> 1/100 to hit the error. >>> >>> What does "unsafe cche mode" mean? >> >> Libvirt cache option "unsafe" >> >> Which is mostly ignoring flush/fua commands and fully rely on host fs >> (in my case it's file backed) cache. > > Gotcha > >>> Is that write back caching enabled? >>> Write back caching with volatile write cache? For your device, can you >>> do: >>> >>> $ grep . /sys/block/$dev/queue/* >>> >>>> Checking the fsstress verbose log against the failed file, it turns out >>>> to be an io_uring write. >>> >>> Any more details on what the write looks like? >> >> For the involved file, it shows the following operations for the minimal >> reproducible seed/-p/-n combination: >> >> ``` >> 0/24: link d0/f2 d0/f3 0 >> 0/29: fallocate(INSERT_RANGE) d0/f3 [276 2 0 0 176 481971]t 884736 585728 95 >> 0/30: uring_write d0/f3[276 2 0 0 176 481971] [1400622, 56456(res=56456)] 0 >> 0/31: writev d0/f3[276 2 0 0 296 1457078] [709121,8,964] 0 >> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 320 >> 1457078] return 25, fallback to stat() >> 0/34: dwrite d0/f3[276 2 308134 1763236 320 1457078] [589824,16384] 0 >> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 496 >> 1457078] return 25, fallback to stat() >> 0/38: dwrite d0/f3[276 2 308134 1763236 496 1457078] [2084864,36864] 0 >> 0/40: fallocate(ZERO_RANGE) d0/f3 [276 2 308134 1763236 688 2809139]t >> 3512660 81075 0 >> 0/43: splice d0/f5[289 1 0 0 1872 2678784] [552619,59420] -> d0/f3[276 2 >> 308134 1763236 856 3593735] [5603798,59420] 0 >> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [276 1 308134 1763236 976 >> 5663218]t 1361821 480392 0 >> 0/49: clonerange d0/f3[276 1 308134 1763236 856 5663218] [2461696,53248] >> -> d0/f5[289 1 0 0 1872 2678784] [942080,53248] >> ``` > > And just to be sure, this is not mixing dio and buffered, right? I'd say it's mixing, there are dwrite() and writev() for the same file, but at least not overlapping using this particular seed, nor they are concurrent (all inside the same process sequentially). But considering if only uring_write is disabled, then no more reproduce, thus there must be some untested btrfs path triggered by uring_write. > >>>> However I didn't see any io_uring related callback inside btrfs code, >>>> any advice on the io_uring part would be appreciated. >>> >>> io_uring doesn't do anything special here, it uses the normal page cache >>> read/write parts for buffered IO. But you may get extra parallellism >>> with io_uring here. For example, with the buffered write that this most >>> likely is, libaio would be exactly the same as a pwrite(2) on the file. >>> If this would've blocked, io_uring would offload this to a helper >>> thread. Depending on the workload, you could have multiple of those in >>> progress at the same time. >> >> My biggest concern is, would io_uring modify the page when it's still >> under writeback? > > No, of course not. Like I mentioned, io_uring doesn't do anything that > the normal read/write path isn't already doing - it's using the same > ->read_iter() and ->write_iter() that everything else is, there's no > page cache code in io_uring. > >> In that case, it's going to cause csum mismatch as btrfs relies on the >> page under writeback to be unchanged. > > Sure, I'm aware of the stable page requirements. > > See my followup email as well on a patch to test as well. > Applied and tested, using "-p 10 -n 1000" as fsstress workload, failed at 23rd run. Thanks, Qu