From: Jens Axboe <[email protected]>
To: Qu Wenruo <[email protected]>,
"[email protected]" <[email protected]>,
Linux FS Devel <[email protected]>,
[email protected]
Subject: Re: Possible io_uring related race leads to btrfs data csum mismatch
Date: Sun, 20 Aug 2023 07:26:08 -0600 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 8/19/23 6:22 PM, Qu Wenruo wrote:
>
>
> On 2023/8/20 07:59, Qu Wenruo wrote:
>> Hi Jens
>>
>> I tried more on my side to debug the situation, and found a very weird
>> write behavior:
>>
>> Some unexpected direct IO happened, without corresponding
>> fsstress workload.
>>
>> The workload is:
>>
>> $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress
>>
>> Which I can reliably reproduce the problem locally, around 1/50
>> possibility.
>> In my particular case, it results data corruption at root 5 inode 283
>> offset 8192.
>>
>> Then I added some trace points for the following functions:
>>
>> - btrfs_do_write_iter()
>> Two trace points, one before btrfs_direct_write(), and one
>> before btrfs_buffered_write(), outputting the aligned and unaligned
>> write range, root/inode number, type of the write (buffered or
>> direct).
>>
>> - btrfs_finish_one_ordered()
>> This is where btrfs inserts its ordered extent into the subvolume
>> tree.
>> This happens when a range of pages finishes its writeback.
>>
>> Then here comes the fsstress log for inode 283 (no btrfs root number):
>>
>> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
>> [307200,0]
>> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
>> [1058819,0]
>> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0
>> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736
>> 585728 95
>> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320
>> 1457078] return 25, fallback to stat()
>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320
>> 1457078] return 25, fallback to stat()
>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
>> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496
>> 1457078] return 25, fallback to stat()
>> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0
>> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0
>> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t
>> 3512660 81075 0
>> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2
>> 308134 1763236 856 3593735] [5603798,59420] 0
>> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976
>> 5663218]t 1361821 480392 0
>> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248]
>> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248]
>>
>> Note one thing, there is no direct/buffered write into inode 283 offset
>> 8192.
>>
>> But from the trace events for root 5 inode 283:
>>
>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644)
>> len=90112(88327)
>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622)
>> len=61440(56456)
>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121)
>> len=12288(7712)
>>
>> btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192)
>> len=73728(73728) <<<<<
>>
>> btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824)
>> len=16384(16384)
>> btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728
>> btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384
>> btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864)
>> len=36864(36864)
>> btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864
>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000)
>> len=61440(60139)
>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798)
>> len=61440(59420)
>> btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112
>> btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288
>> btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440
>> btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096
>> btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440
>> btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440
>>
>> Note that phantom direct IO call, which is in the corrupted range.
>>
>> If paired with fsstress, that phantom write happens between the two
>> operations:
>>
>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
>
> Just to be more accurate, there is a 0/33 operation, which is:
>
> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320
> 1457078] return 25, fallback to stat()
> 0/33: awrite - io_getevents failed -4
>
> The failed one doesn't have inode number thus it didn't get caught by grep.
>
> Return value -4 means -INTR, not sure who sent the interruption.
> But if this interruption happens before the IO finished, we can call
> free() on the buffer, and if we're unlucky enough, the freed memory can
> be re-allocated for some other usage, thus modifying the pages before
> the writeback finished.
>
> I think this is the direct cause of the data corruption, page
> modification before direct IO finished.
>
> But unfortunately I still didn't get why the interruption can happen,
> nor how can we handle such interruption?
> (I guess just retry?)
It's because you are mixing aio/io_uring, and the default settings for
io_uring is to use signal based notifications for queueing task_work.
This then causes a spurious -EINTR, which stops your io_getevents()
wait. Looks like this is a bug in fsstress, it should just retry the
wait if this happens. You can also configure the ring to not use signal
based notifications, but that bug needs fixing regardless.
--
Jens Axboe
next prev parent reply other threads:[~2023-08-20 13:29 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-16 6:52 Possible io_uring related race leads to btrfs data csum mismatch Qu Wenruo
2023-08-16 14:33 ` Jens Axboe
2023-08-16 14:49 ` Jens Axboe
2023-08-16 21:46 ` Qu Wenruo
2023-08-16 22:28 ` Jens Axboe
2023-08-17 1:05 ` Qu Wenruo
2023-08-17 1:12 ` Jens Axboe
2023-08-17 1:19 ` Qu Wenruo
2023-08-17 1:23 ` Jens Axboe
2023-08-17 1:31 ` Qu Wenruo
2023-08-17 1:32 ` Jens Axboe
2023-08-19 23:59 ` Qu Wenruo
2023-08-20 0:22 ` Qu Wenruo
2023-08-20 13:26 ` Jens Axboe [this message]
2023-08-20 14:11 ` Jens Axboe
2023-08-20 18:18 ` Matthew Wilcox
2023-08-20 18:40 ` Jens Axboe
2023-08-21 0:38 ` Qu Wenruo
2023-08-21 14:57 ` Jens Axboe
2023-08-21 21:42 ` Qu Wenruo
2023-08-16 22:36 ` Jens Axboe
2023-08-17 0:40 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox