From: Qu Wenruo <[email protected]>
To: Jens Axboe <[email protected]>,
"[email protected]" <[email protected]>,
Linux FS Devel <[email protected]>,
[email protected]
Subject: Re: Possible io_uring related race leads to btrfs data csum mismatch
Date: Thu, 17 Aug 2023 09:05:56 +0800 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
[-- Attachment #1: Type: text/plain, Size: 4068 bytes --]
On 2023/8/17 06:28, Jens Axboe wrote:
[...]
>
>>> 2) What's the .config you are using?
>>
>> Pretty common config, no heavy debug options (KASAN etc).
>
> Please just send the .config, I'd rather not have to guess. Things like
> preempt etc may make a difference in reproducing this.
Sure, please see the attached config.gz
>
>>>> At least here, with a VM with 6 cores (host has 8C/16T), fast enough
>>>> storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around
>>>> 1/100 to hit the error.
>>>
>>> What does "unsafe cche mode" mean?
>>
>> Libvirt cache option "unsafe"
>>
>> Which is mostly ignoring flush/fua commands and fully rely on host fs
>> (in my case it's file backed) cache.
>
> Gotcha
>
>>> Is that write back caching enabled?
>>> Write back caching with volatile write cache? For your device, can you
>>> do:
>>>
>>> $ grep . /sys/block/$dev/queue/*
>>>
>>>> Checking the fsstress verbose log against the failed file, it turns out
>>>> to be an io_uring write.
>>>
>>> Any more details on what the write looks like?
>>
>> For the involved file, it shows the following operations for the minimal
>> reproducible seed/-p/-n combination:
>>
>> ```
>> 0/24: link d0/f2 d0/f3 0
>> 0/29: fallocate(INSERT_RANGE) d0/f3 [276 2 0 0 176 481971]t 884736 585728 95
>> 0/30: uring_write d0/f3[276 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
>> 0/31: writev d0/f3[276 2 0 0 296 1457078] [709121,8,964] 0
>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 320
>> 1457078] return 25, fallback to stat()
>> 0/34: dwrite d0/f3[276 2 308134 1763236 320 1457078] [589824,16384] 0
>> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 496
>> 1457078] return 25, fallback to stat()
>> 0/38: dwrite d0/f3[276 2 308134 1763236 496 1457078] [2084864,36864] 0
>> 0/40: fallocate(ZERO_RANGE) d0/f3 [276 2 308134 1763236 688 2809139]t
>> 3512660 81075 0
>> 0/43: splice d0/f5[289 1 0 0 1872 2678784] [552619,59420] -> d0/f3[276 2
>> 308134 1763236 856 3593735] [5603798,59420] 0
>> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [276 1 308134 1763236 976
>> 5663218]t 1361821 480392 0
>> 0/49: clonerange d0/f3[276 1 308134 1763236 856 5663218] [2461696,53248]
>> -> d0/f5[289 1 0 0 1872 2678784] [942080,53248]
>> ```
>
> And just to be sure, this is not mixing dio and buffered, right?
I'd say it's mixing, there are dwrite() and writev() for the same file,
but at least not overlapping using this particular seed, nor they are
concurrent (all inside the same process sequentially).
But considering if only uring_write is disabled, then no more reproduce,
thus there must be some untested btrfs path triggered by uring_write.
>
>>>> However I didn't see any io_uring related callback inside btrfs code,
>>>> any advice on the io_uring part would be appreciated.
>>>
>>> io_uring doesn't do anything special here, it uses the normal page cache
>>> read/write parts for buffered IO. But you may get extra parallellism
>>> with io_uring here. For example, with the buffered write that this most
>>> likely is, libaio would be exactly the same as a pwrite(2) on the file.
>>> If this would've blocked, io_uring would offload this to a helper
>>> thread. Depending on the workload, you could have multiple of those in
>>> progress at the same time.
>>
>> My biggest concern is, would io_uring modify the page when it's still
>> under writeback?
>
> No, of course not. Like I mentioned, io_uring doesn't do anything that
> the normal read/write path isn't already doing - it's using the same
> ->read_iter() and ->write_iter() that everything else is, there's no
> page cache code in io_uring.
>
>> In that case, it's going to cause csum mismatch as btrfs relies on the
>> page under writeback to be unchanged.
>
> Sure, I'm aware of the stable page requirements.
>
> See my followup email as well on a patch to test as well.
>
Applied and tested, using "-p 10 -n 1000" as fsstress workload, failed
at 23rd run.
Thanks,
Qu
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 35693 bytes --]
next prev parent reply other threads:[~2023-08-17 1:06 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-16 6:52 Possible io_uring related race leads to btrfs data csum mismatch Qu Wenruo
2023-08-16 14:33 ` Jens Axboe
2023-08-16 14:49 ` Jens Axboe
2023-08-16 21:46 ` Qu Wenruo
2023-08-16 22:28 ` Jens Axboe
2023-08-17 1:05 ` Qu Wenruo [this message]
2023-08-17 1:12 ` Jens Axboe
2023-08-17 1:19 ` Qu Wenruo
2023-08-17 1:23 ` Jens Axboe
2023-08-17 1:31 ` Qu Wenruo
2023-08-17 1:32 ` Jens Axboe
2023-08-19 23:59 ` Qu Wenruo
2023-08-20 0:22 ` Qu Wenruo
2023-08-20 13:26 ` Jens Axboe
2023-08-20 14:11 ` Jens Axboe
2023-08-20 18:18 ` Matthew Wilcox
2023-08-20 18:40 ` Jens Axboe
2023-08-21 0:38 ` Qu Wenruo
2023-08-21 14:57 ` Jens Axboe
2023-08-21 21:42 ` Qu Wenruo
2023-08-16 22:36 ` Jens Axboe
2023-08-17 0:40 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox