From: Jens Axboe <[email protected]>
To: Qu Wenruo <[email protected]>,
"[email protected]" <[email protected]>,
Linux FS Devel <[email protected]>,
[email protected]
Subject: Re: Possible io_uring related race leads to btrfs data csum mismatch
Date: Wed, 16 Aug 2023 16:28:25 -0600 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 8/16/23 3:46 PM, Qu Wenruo wrote:
>
>
> On 2023/8/16 22:33, Jens Axboe wrote:
>> On 8/16/23 12:52 AM, Qu Wenruo wrote:
>>> Hi,
>>>
>>> Recently I'm digging into a very rare failure during btrfs/06[234567],
>>> where btrfs scrub detects unrepairable data corruption.
>>>
>>> After days of digging, I have a much smaller reproducer:
>>>
>>> ```
>>> fail()
>>> {
>>> echo "!!! FAILED !!!"
>>> exit 1
>>> }
>>>
>>> workload()
>>> {
>>> mkfs.btrfs -f -m single -d single --csum sha256 $dev1
>>> mount $dev1 $mnt
>>> # There are around 10 more combinations with different
>>> # seed and -p/-n parameters, but this is the smallest one
>>> # I found so far.
>>> $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt
>>> umount $mnt
>>> btrfs check --check-data-csum $dev1 || fail
>>> }
>>> runtime=1024
>>> for (( i = 0; i < $runtime; i++ )); do
>>> echo "=== $i / $runtime ==="
>>> workload
>>> done
>>> ```
>>
>> Tried to reproduce this, both on a vm and on a real host, and no luck so
>> far. I've got a few followup questions as your report is missing some
>> important info:
>
> You may want to try much higher -p/-n numbers.
>
> For verification purpose, I normally go with -p 10 -n 10000, which has a
> much higher chance to hit, but definitely too noisy for debug.
>
> I just tried a run with "$fsstress -p 10 -n 10000 -w -d $mnt" as the
> workload, it failed at 21/1024.
OK I'll try that.
>> 1) What kernel are you running?
>
> David's misc-next branch, aka, lastest upstream tags plus some btrfs
> patches for the next merge window.
>
> Although I have some internal reports showing this problem quite some
> time ago.
That's what I was getting at, if it was new or not.
>> 2) What's the .config you are using?
>
> Pretty common config, no heavy debug options (KASAN etc).
Please just send the .config, I'd rather not have to guess. Things like
preempt etc may make a difference in reproducing this.
>>> At least here, with a VM with 6 cores (host has 8C/16T), fast enough
>>> storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around
>>> 1/100 to hit the error.
>>
>> What does "unsafe cche mode" mean?
>
> Libvirt cache option "unsafe"
>
> Which is mostly ignoring flush/fua commands and fully rely on host fs
> (in my case it's file backed) cache.
Gotcha
>> Is that write back caching enabled?
>> Write back caching with volatile write cache? For your device, can you
>> do:
>>
>> $ grep . /sys/block/$dev/queue/*
>>
>>> Checking the fsstress verbose log against the failed file, it turns out
>>> to be an io_uring write.
>>
>> Any more details on what the write looks like?
>
> For the involved file, it shows the following operations for the minimal
> reproducible seed/-p/-n combination:
>
> ```
> 0/24: link d0/f2 d0/f3 0
> 0/29: fallocate(INSERT_RANGE) d0/f3 [276 2 0 0 176 481971]t 884736 585728 95
> 0/30: uring_write d0/f3[276 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
> 0/31: writev d0/f3[276 2 0 0 296 1457078] [709121,8,964] 0
> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 320
> 1457078] return 25, fallback to stat()
> 0/34: dwrite d0/f3[276 2 308134 1763236 320 1457078] [589824,16384] 0
> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 496
> 1457078] return 25, fallback to stat()
> 0/38: dwrite d0/f3[276 2 308134 1763236 496 1457078] [2084864,36864] 0
> 0/40: fallocate(ZERO_RANGE) d0/f3 [276 2 308134 1763236 688 2809139]t
> 3512660 81075 0
> 0/43: splice d0/f5[289 1 0 0 1872 2678784] [552619,59420] -> d0/f3[276 2
> 308134 1763236 856 3593735] [5603798,59420] 0
> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [276 1 308134 1763236 976
> 5663218]t 1361821 480392 0
> 0/49: clonerange d0/f3[276 1 308134 1763236 856 5663218] [2461696,53248]
> -> d0/f5[289 1 0 0 1872 2678784] [942080,53248]
> ```
And just to be sure, this is not mixing dio and buffered, right?
>>> However I didn't see any io_uring related callback inside btrfs code,
>>> any advice on the io_uring part would be appreciated.
>>
>> io_uring doesn't do anything special here, it uses the normal page cache
>> read/write parts for buffered IO. But you may get extra parallellism
>> with io_uring here. For example, with the buffered write that this most
>> likely is, libaio would be exactly the same as a pwrite(2) on the file.
>> If this would've blocked, io_uring would offload this to a helper
>> thread. Depending on the workload, you could have multiple of those in
>> progress at the same time.
>
> My biggest concern is, would io_uring modify the page when it's still
> under writeback?
No, of course not. Like I mentioned, io_uring doesn't do anything that
the normal read/write path isn't already doing - it's using the same
->read_iter() and ->write_iter() that everything else is, there's no
page cache code in io_uring.
> In that case, it's going to cause csum mismatch as btrfs relies on the
> page under writeback to be unchanged.
Sure, I'm aware of the stable page requirements.
See my followup email as well on a patch to test as well.
--
Jens Axboe
next prev parent reply other threads:[~2023-08-16 22:29 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-16 6:52 Possible io_uring related race leads to btrfs data csum mismatch Qu Wenruo
2023-08-16 14:33 ` Jens Axboe
2023-08-16 14:49 ` Jens Axboe
2023-08-16 21:46 ` Qu Wenruo
2023-08-16 22:28 ` Jens Axboe [this message]
2023-08-17 1:05 ` Qu Wenruo
2023-08-17 1:12 ` Jens Axboe
2023-08-17 1:19 ` Qu Wenruo
2023-08-17 1:23 ` Jens Axboe
2023-08-17 1:31 ` Qu Wenruo
2023-08-17 1:32 ` Jens Axboe
2023-08-19 23:59 ` Qu Wenruo
2023-08-20 0:22 ` Qu Wenruo
2023-08-20 13:26 ` Jens Axboe
2023-08-20 14:11 ` Jens Axboe
2023-08-20 18:18 ` Matthew Wilcox
2023-08-20 18:40 ` Jens Axboe
2023-08-21 0:38 ` Qu Wenruo
2023-08-21 14:57 ` Jens Axboe
2023-08-21 21:42 ` Qu Wenruo
2023-08-16 22:36 ` Jens Axboe
2023-08-17 0:40 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox