Re: Possible io_uring related race leads to btrfs data csum mismatch

public inbox for [email protected]
 help / color / mirror / Atom feed

From: Qu Wenruo <[email protected]>
To: Jens Axboe <[email protected]>,
	"[email protected]" <[email protected]>,
	Linux FS Devel <[email protected]>,
	[email protected]
Subject: Re: Possible io_uring related race leads to btrfs data csum mismatch
Date: Tue, 22 Aug 2023 05:42:40 +0800	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>



On 2023/8/21 22:57, Jens Axboe wrote:
> On 8/20/23 6:38 PM, Qu Wenruo wrote:
>>
>>
>> On 2023/8/20 22:11, Jens Axboe wrote:
>>> On 8/20/23 7:26 AM, Jens Axboe wrote:
>>>> On 8/19/23 6:22 PM, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2023/8/20 07:59, Qu Wenruo wrote:
>>>>>> Hi Jens
>>>>>>
>>>>>> I tried more on my side to debug the situation, and found a very weird
>>>>>> write behavior:
>>>>>>
>>>>>>        Some unexpected direct IO happened, without corresponding
>>>>>>        fsstress workload.
>>>>>>
>>>>>> The workload is:
>>>>>>
>>>>>>        $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress
>>>>>>
>>>>>> Which I can reliably reproduce the problem locally, around 1/50
>>>>>> possibility.
>>>>>> In my particular case, it results data corruption at root 5 inode 283
>>>>>> offset 8192.
>>>>>>
>>>>>> Then I added some trace points for the following functions:
>>>>>>
>>>>>> - btrfs_do_write_iter()
>>>>>>      Two trace points, one before btrfs_direct_write(), and one
>>>>>>      before btrfs_buffered_write(), outputting the aligned and unaligned
>>>>>>      write range, root/inode number, type of the write (buffered or
>>>>>>      direct).
>>>>>>
>>>>>> - btrfs_finish_one_ordered()
>>>>>>      This is where btrfs inserts its ordered extent into the subvolume
>>>>>>      tree.
>>>>>>      This happens when a range of pages finishes its writeback.
>>>>>>
>>>>>> Then here comes the fsstress log for inode 283 (no btrfs root number):
>>>>>>
>>>>>> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
>>>>>> [307200,0]
>>>>>> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
>>>>>> [1058819,0]
>>>>>> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0
>>>>>> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736
>>>>>> 585728 95
>>>>>> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
>>>>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
>>>>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320
>>>>>> 1457078] return 25, fallback to stat()
>>>>>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320
>>>>>> 1457078] return 25, fallback to stat()
>>>>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
>>>>>> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496
>>>>>> 1457078] return 25, fallback to stat()
>>>>>> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0
>>>>>> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0
>>>>>> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t
>>>>>> 3512660 81075 0
>>>>>> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2
>>>>>> 308134 1763236 856 3593735] [5603798,59420] 0
>>>>>> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976
>>>>>> 5663218]t 1361821 480392 0
>>>>>> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248]
>>>>>> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248]
>>>>>>
>>>>>> Note one thing, there is no direct/buffered write into inode 283 offset
>>>>>> 8192.
>>>>>>
>>>>>> But from the trace events for root 5 inode 283:
>>>>>>
>>>>>>     btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644)
>>>>>> len=90112(88327)
>>>>>>     btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622)
>>>>>> len=61440(56456)
>>>>>>     btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121)
>>>>>> len=12288(7712)
>>>>>>
>>>>>>     btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192)
>>>>>> len=73728(73728) <<<<<
>>>>>>
>>>>>>     btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824)
>>>>>> len=16384(16384)
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384
>>>>>>     btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864)
>>>>>> len=36864(36864)
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864
>>>>>>     btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000)
>>>>>> len=61440(60139)
>>>>>>     btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798)
>>>>>> len=61440(59420)
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440
>>>>>>
>>>>>> Note that phantom direct IO call, which is in the corrupted range.
>>>>>>
>>>>>> If paired with fsstress, that phantom write happens between the two
>>>>>> operations:
>>>>>>
>>>>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
>>>>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
>>>>>
>>>>> Just to be more accurate, there is a 0/33 operation, which is:
>>>>>
>>>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320
>>>>> 1457078] return 25, fallback to stat()
>>>>> 0/33: awrite - io_getevents failed -4
>>>>>
>>>>> The failed one doesn't have inode number thus it didn't get caught by grep.
>>>>>
>>>>> Return value -4 means -INTR, not sure who sent the interruption.
>>>>> But if this interruption happens before the IO finished, we can call
>>>>> free() on the buffer, and if we're unlucky enough, the freed memory can
>>>>> be re-allocated for some other usage, thus modifying the pages before
>>>>> the writeback finished.
>>>>>
>>>>> I think this is the direct cause of the data corruption, page
>>>>> modification before direct IO finished.
>>>>>
>>>>> But unfortunately I still didn't get why the interruption can happen,
>>>>> nor how can we handle such interruption?
>>>>> (I guess just retry?)
>>>>
>>>> It's because you are mixing aio/io_uring, and the default settings for
>>>> io_uring is to use signal based notifications for queueing task_work.
>>>> This then causes a spurious -EINTR, which stops your io_getevents()
>>>> wait. Looks like this is a bug in fsstress, it should just retry the
>>>> wait if this happens. You can also configure the ring to not use signal
>>>> based notifications, but that bug needs fixing regardless.
>>>
>>> Something like this will probably fix it.
>>>
>>>
>>> diff --git a/ltp/fsstress.c b/ltp/fsstress.c
>>> index 6641a525fe5d..05fbfd3f8cf8 100644
>>> --- a/ltp/fsstress.c
>>> +++ b/ltp/fsstress.c
>>> @@ -2072,6 +2072,23 @@ void inode_info(char *str, size_t sz, struct stat64 *s, int verbose)
>>>                 (long long) s->st_blocks, (long long) s->st_size);
>>>    }
>>>
>>> +static int io_get_single_event(struct io_event *event)
>>> +{
>>> +    int ret;
>>> +
>>> +    do {
>>> +        /*
>>> +         * We can get -EINTR if competing with io_uring using signal
>>> +         * based notifications. For that case, just retry the wait.
>>> +         */
>>> +        ret = io_getevents(io_ctx, 1, 1, event, NULL);
>>> +        if (ret != -EINTR)
>>> +            break;
>>> +    } while (1);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>    void
>>>    afsync_f(opnum_t opno, long r)
>>>    {
>>> @@ -2111,7 +2128,7 @@ afsync_f(opnum_t opno, long r)
>>>            close(fd);
>>>            return;
>>>        }
>>> -    if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) {
>>> +    if ((e = io_get_single_event(&event)) != 1) {
>>>            if (v)
>>>                printf("%d/%lld: afsync - io_getevents failed %d\n",
>>>                       procid, opno, e);
>>> @@ -2220,10 +2237,10 @@ do_aio_rw(opnum_t opno, long r, int flags)
>>>        if ((e = io_submit(io_ctx, 1, iocbs)) != 1) {
>>>            if (v)
>>>                printf("%d/%lld: %s - io_submit failed %d\n",
>>> -                   procid, opno, iswrite ? "awrite" : "aread", e);
>>> +                    procid, opno, iswrite ? "awrite" : "aread", e);
>>>            goto aio_out;
>>>        }
>>> -    if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) {
>>> +    if ((e = io_get_single_event(&event)) != 1) {
>>>            if (v)
>>>                printf("%d/%lld: %s - io_getevents failed %d\n",
>>>                       procid, opno, iswrite ? "awrite" : "aread", e);
>>>
>> Exactly what I sent for fsstress:
>> https://lore.kernel.org/linux-btrfs/[email protected]/T/#u
>
> It's not really, as you only did the one case of io_getevents(). What
> happens if the other one gets EINTR and aborts, now we do a rw operation
> and the first event returned is the one from the fsync?

Oh my bad, forgot there is another one.

Would fix it in the next update.

Thanks,
Qu
>
> You should not just fix up the one that you happened to hit, fix up both
> of them.
>

next prev parent reply	other threads:[~2023-08-21 21:42 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-16  6:52 Possible io_uring related race leads to btrfs data csum mismatch Qu Wenruo
2023-08-16 14:33 ` Jens Axboe
2023-08-16 14:49   ` Jens Axboe
2023-08-16 21:46   ` Qu Wenruo
2023-08-16 22:28     ` Jens Axboe
2023-08-17  1:05       ` Qu Wenruo
2023-08-17  1:12         ` Jens Axboe
2023-08-17  1:19           ` Qu Wenruo
2023-08-17  1:23             ` Jens Axboe
2023-08-17  1:31               ` Qu Wenruo
2023-08-17  1:32                 ` Jens Axboe
2023-08-19 23:59                   ` Qu Wenruo
2023-08-20  0:22                     ` Qu Wenruo
2023-08-20 13:26                       ` Jens Axboe
2023-08-20 14:11                         ` Jens Axboe
2023-08-20 18:18                           ` Matthew Wilcox
2023-08-20 18:40                             ` Jens Axboe
2023-08-21  0:38                           ` Qu Wenruo
2023-08-21 14:57                             ` Jens Axboe
2023-08-21 21:42                               ` Qu Wenruo [this message]
2023-08-16 22:36     ` Jens Axboe
2023-08-17  0:40       ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox