Re: Possible io_uring related race leads to btrfs data csum mismatch

public inbox for [email protected]
 help / color / mirror / Atom feed

From: Qu Wenruo <[email protected]>
To: Jens Axboe <[email protected]>,
	"[email protected]" <[email protected]>,
	Linux FS Devel <[email protected]>,
	[email protected]
Subject: Re: Possible io_uring related race leads to btrfs data csum mismatch
Date: Thu, 17 Aug 2023 05:46:18 +0800	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>



On 2023/8/16 22:33, Jens Axboe wrote:
> On 8/16/23 12:52 AM, Qu Wenruo wrote:
>> Hi,
>>
>> Recently I'm digging into a very rare failure during btrfs/06[234567],
>> where btrfs scrub detects unrepairable data corruption.
>>
>> After days of digging, I have a much smaller reproducer:
>>
>> ```
>> fail()
>> {
>>          echo "!!! FAILED !!!"
>>          exit 1
>> }
>>
>> workload()
>> {
>>          mkfs.btrfs -f -m single -d single --csum sha256 $dev1
>>          mount $dev1 $mnt
>>      # There are around 10 more combinations with different
>>          # seed and -p/-n parameters, but this is the smallest one
>>      # I found so far.
>>      $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt
>>      umount $mnt
>>      btrfs check --check-data-csum $dev1 || fail
>> }
>> runtime=1024
>> for (( i = 0; i < $runtime; i++ )); do
>>          echo "=== $i / $runtime ==="
>>          workload
>> done
>> ```
>
> Tried to reproduce this, both on a vm and on a real host, and no luck so
> far. I've got a few followup questions as your report is missing some
> important info:

You may want to try much higher -p/-n numbers.

For verification purpose, I normally go with -p 10 -n 10000, which has a
much higher chance to hit, but definitely too noisy for debug.

I just tried a run with "$fsstress -p 10 -n 10000 -w -d $mnt" as the
workload, it failed at 21/1024.

>
> 1) What kernel are you running?

David's misc-next branch, aka, lastest upstream tags plus some btrfs
patches for the next merge window.

Although I have some internal reports showing this problem quite some
time ago.

> 2) What's the .config you are using?

Pretty common config, no heavy debug options (KASAN etc).

>
>> At least here, with a VM with 6 cores (host has 8C/16T), fast enough
>> storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around
>> 1/100 to hit the error.
>
> What does "unsafe cche mode" mean?

Libvirt cache option "unsafe"

Which is mostly ignoring flush/fua commands and fully rely on host fs
(in my case it's file backed) cache.

> Is that write back caching enabled?
> Write back caching with volatile write cache? For your device, can you
> do:
>
> $ grep . /sys/block/$dev/queue/*
>
>> Checking the fsstress verbose log against the failed file, it turns out
>> to be an io_uring write.
>
> Any more details on what the write looks like?

For the involved file, it shows the following operations for the minimal
reproducible seed/-p/-n combination:

```
0/24: link d0/f2 d0/f3 0
0/29: fallocate(INSERT_RANGE) d0/f3 [276 2 0 0 176 481971]t 884736 585728 95
0/30: uring_write d0/f3[276 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
0/31: writev d0/f3[276 2 0 0 296 1457078] [709121,8,964] 0
0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 320
1457078] return 25, fallback to stat()
0/34: dwrite d0/f3[276 2 308134 1763236 320 1457078] [589824,16384] 0
0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 496
1457078] return 25, fallback to stat()
0/38: dwrite d0/f3[276 2 308134 1763236 496 1457078] [2084864,36864] 0
0/40: fallocate(ZERO_RANGE) d0/f3 [276 2 308134 1763236 688 2809139]t
3512660 81075 0
0/43: splice d0/f5[289 1 0 0 1872 2678784] [552619,59420] -> d0/f3[276 2
308134 1763236 856 3593735] [5603798,59420] 0
0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [276 1 308134 1763236 976
5663218]t 1361821 480392 0
0/49: clonerange d0/f3[276 1 308134 1763236 856 5663218] [2461696,53248]
-> d0/f5[289 1 0 0 1872 2678784] [942080,53248]
```

>
>> And with uring_write disabled in fsstress, I have no longer reproduced
>> the csum mismatch, even with much larger -n and -p parameters.
>
> Is it more likely to reproduce with larger -n/-p in general?

Yes, but I use that specific combination as the minimal reproducer for
debug purposes.

>
>> However I didn't see any io_uring related callback inside btrfs code,
>> any advice on the io_uring part would be appreciated.
>
> io_uring doesn't do anything special here, it uses the normal page cache
> read/write parts for buffered IO. But you may get extra parallellism
> with io_uring here. For example, with the buffered write that this most
> likely is, libaio would be exactly the same as a pwrite(2) on the file.
> If this would've blocked, io_uring would offload this to a helper
> thread. Depending on the workload, you could have multiple of those in
> progress at the same time.

My biggest concern is, would io_uring modify the page when it's still
under writeback?
In that case, it's going to cause csum mismatch as btrfs relies on the
page under writeback to be unchanged.

Thanks,
Qu

>

next prev parent reply	other threads:[~2023-08-16 21:47 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-16  6:52 Possible io_uring related race leads to btrfs data csum mismatch Qu Wenruo
2023-08-16 14:33 ` Jens Axboe
2023-08-16 14:49   ` Jens Axboe
2023-08-16 21:46   ` Qu Wenruo [this message]
2023-08-16 22:28     ` Jens Axboe
2023-08-17  1:05       ` Qu Wenruo
2023-08-17  1:12         ` Jens Axboe
2023-08-17  1:19           ` Qu Wenruo
2023-08-17  1:23             ` Jens Axboe
2023-08-17  1:31               ` Qu Wenruo
2023-08-17  1:32                 ` Jens Axboe
2023-08-19 23:59                   ` Qu Wenruo
2023-08-20  0:22                     ` Qu Wenruo
2023-08-20 13:26                       ` Jens Axboe
2023-08-20 14:11                         ` Jens Axboe
2023-08-20 18:18                           ` Matthew Wilcox
2023-08-20 18:40                             ` Jens Axboe
2023-08-21  0:38                           ` Qu Wenruo
2023-08-21 14:57                             ` Jens Axboe
2023-08-21 21:42                               ` Qu Wenruo
2023-08-16 22:36     ` Jens Axboe
2023-08-17  0:40       ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox