public inbox for [email protected]
 help / color / mirror / Atom feed
From: Qu Wenruo <[email protected]>
To: Jens Axboe <[email protected]>,
	"[email protected]" <[email protected]>,
	Linux FS Devel <[email protected]>,
	[email protected]
Subject: Re: Possible io_uring related race leads to btrfs data csum mismatch
Date: Sun, 20 Aug 2023 08:22:14 +0800	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>



On 2023/8/20 07:59, Qu Wenruo wrote:
> Hi Jens
>
> I tried more on my side to debug the situation, and found a very weird
> write behavior:
>
>      Some unexpected direct IO happened, without corresponding
>      fsstress workload.
>
> The workload is:
>
>      $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress
>
> Which I can reliably reproduce the problem locally, around 1/50
> possibility.
> In my particular case, it results data corruption at root 5 inode 283
> offset 8192.
>
> Then I added some trace points for the following functions:
>
> - btrfs_do_write_iter()
>    Two trace points, one before btrfs_direct_write(), and one
>    before btrfs_buffered_write(), outputting the aligned and unaligned
>    write range, root/inode number, type of the write (buffered or
>    direct).
>
> - btrfs_finish_one_ordered()
>    This is where btrfs inserts its ordered extent into the subvolume
>    tree.
>    This happens when a range of pages finishes its writeback.
>
> Then here comes the fsstress log for inode 283 (no btrfs root number):
>
> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
> [307200,0]
> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
> [1058819,0]
> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0
> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736
> 585728 95
> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320
> 1457078] return 25, fallback to stat()
> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320
> 1457078] return 25, fallback to stat()
> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496
> 1457078] return 25, fallback to stat()
> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0
> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0
> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t
> 3512660 81075 0
> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2
> 308134 1763236 856 3593735] [5603798,59420] 0
> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976
> 5663218]t 1361821 480392 0
> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248]
> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248]
>
> Note one thing, there is no direct/buffered write into inode 283 offset
> 8192.
>
> But from the trace events for root 5 inode 283:
>
>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644)
> len=90112(88327)
>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622)
> len=61440(56456)
>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121)
> len=12288(7712)
>
>   btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192)
> len=73728(73728) <<<<<
>
>   btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824)
> len=16384(16384)
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384
>   btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864)
> len=36864(36864)
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864
>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000)
> len=61440(60139)
>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798)
> len=61440(59420)
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440
>
> Note that phantom direct IO call, which is in the corrupted range.
>
> If paired with fsstress, that phantom write happens between the two
> operations:
>
> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0

Just to be more accurate, there is a 0/33 operation, which is:

0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320
1457078] return 25, fallback to stat()
0/33: awrite - io_getevents failed -4

The failed one doesn't have inode number thus it didn't get caught by grep.

Return value -4 means -INTR, not sure who sent the interruption.
But if this interruption happens before the IO finished, we can call
free() on the buffer, and if we're unlucky enough, the freed memory can
be re-allocated for some other usage, thus modifying the pages before
the writeback finished.

I think this is the direct cause of the data corruption, page
modification before direct IO finished.

But unfortunately I still didn't get why the interruption can happen,
nor how can we handle such interruption?
(I guess just retry?)

Thanks,
Qu
>
> I'll keep digging, but the phantom writes which is not properly loggeg
> from fsstress is already a concern to me.
>
> Or maybe I'm missing some fixes in fsstress?
>
> Thanks,
> Qu

  reply	other threads:[~2023-08-20  0:55 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-16  6:52 Possible io_uring related race leads to btrfs data csum mismatch Qu Wenruo
2023-08-16 14:33 ` Jens Axboe
2023-08-16 14:49   ` Jens Axboe
2023-08-16 21:46   ` Qu Wenruo
2023-08-16 22:28     ` Jens Axboe
2023-08-17  1:05       ` Qu Wenruo
2023-08-17  1:12         ` Jens Axboe
2023-08-17  1:19           ` Qu Wenruo
2023-08-17  1:23             ` Jens Axboe
2023-08-17  1:31               ` Qu Wenruo
2023-08-17  1:32                 ` Jens Axboe
2023-08-19 23:59                   ` Qu Wenruo
2023-08-20  0:22                     ` Qu Wenruo [this message]
2023-08-20 13:26                       ` Jens Axboe
2023-08-20 14:11                         ` Jens Axboe
2023-08-20 18:18                           ` Matthew Wilcox
2023-08-20 18:40                             ` Jens Axboe
2023-08-21  0:38                           ` Qu Wenruo
2023-08-21 14:57                             ` Jens Axboe
2023-08-21 21:42                               ` Qu Wenruo
2023-08-16 22:36     ` Jens Axboe
2023-08-17  0:40       ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox