Re: [PATCH 5/5] btrfs: add io_uring command for encoded reads

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

From: Pavel Begunkov <asml.silence@gmail.com>
To: Mark Harmstone <maharmstone@meta.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Cc: "io-uring@vger.kernel.org" <io-uring@vger.kernel.org>
Subject: Re: [PATCH 5/5] btrfs: add io_uring command for encoded reads
Date: Thu, 31 Oct 2024 18:26:03 +0000	[thread overview]
Message-ID: <fad6bb86-0646-4039-b234-5752f3f833f0@gmail.com> (raw)
In-Reply-To: <46aa1f2a-d0c6-429e-a862-1b3b8c37c109@meta.com>

On 10/31/24 17:08, Mark Harmstone wrote:
> Thanks Pavel.
> 
...
>> If that's an iovec backed iter that might fail, you'd need to
>> steal this patch
>>
>> https://lore.kernel.org/all/20241016-fuse-uring-for-6-10-rfc4-v4-12-9739c753666e@ddn.com/
>>
>> and fail if "issue_flags & IO_URING_F_TASK_DEAD", see
>>
>> https://lore.kernel.org/all/20241016-fuse-uring-for-6-10-rfc4-v4-13-9739c753666e@ddn.com/
> 
> Thanks, I've sent a patchset including your patch. Does it make a
> difference, though? If the task has died, presumably copy_page_to_iter
> can't copy to another process' memory...?

IIRC copy_to_user will crash without mm set, not sure about
copy_page_to_iter(). Regardless, when the original task has dies
and it gets run from io_uring's fallback path, you shouldn't
make any assumptions about the current task.

>>> +            ret = -EFAULT;
>>> +            goto out;
>>> +        }
>>> +
>>> +        i++;
>>> +        cur += bytes;
>>> +        page_offset = 0;
>>> +    }
>>> +    ret = priv->count;
>>> +
>>> +out:
>>> +    unlock_extent(io_tree, priv->start, priv->lockend,
>>> &priv->cached_state);
>>> +    btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
>>
>> When called via io_uring_cmd_complete_in_task() this function might
>> not get run in any reasonable amount of time. Even worse, a
>> misbehaving user can block it until the task dies.
>>
>> I don't remember if rwsem allows unlock being executed in a different
>> task than the pairing lock, but blocking it for that long could be a
>> problem. I might not remember it right but I think Boris meantioned
>> that the O_DIRECT path drops the inode lock right after submission
>> without waiting for bios to complete. Is that right? Can we do it
>> here as well?
> 
> We can't release the inode lock until we've released the extent lock. I
> do intend to look into reducing the amount of time we hold the extent
> lock, if we can, but it's not trivial to do this in a safe manner.

I lack the btrfs knowledge, but sounds like it can be done the
same way the async dio path works.

> We could perhaps move the unlocking to btrfs_uring_read_extent_endio
> instead, but it looks unlocking an rwsem in a different context might
> cause problems with PREEMPT_RT(?).

At least from a quick glance it doesn't seem that locks in
__clear_extent_bit are [soft]irq protected. Would be a good
idea to give it a run with lockdep enabled.


...
>>> +    ret = btrfs_encoded_read_regular_fill_pages(inode, disk_bytenr,
>>> +                            disk_io_size, pages,
>>> +                            priv);
>>> +    if (ret && ret != -EIOCBQUEUED)
>>> +        goto fail;
>>
>> Turning both into return EIOCBQUEUED is a bit suspicious, but
>> I lack context to say. Might make sense to return ret and let
>> the caller handle it.
> 
> btrfs_encoded_read_regular_fill_pages returns 0 if the bio completes
> before the function can finish, -EIOCBQUEUED otherwise. In either case
> the behaviour of the calling function will be the same.

Ok

...
>>> +    if (copy_to_user(sqe_addr + copy_end, (char *)&args +
>>> copy_end_kernel,
>>> +             sizeof(args) - copy_end_kernel)) {
>>> +        if (ret == -EIOCBQUEUED) {
>>> +            unlock_extent(io_tree, start, lockend, &cached_state);
>>> +            btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
>>> +        }
>>> +        ret = -EFAULT;
>>> +        goto out_free;
>>
>> It seems we're saving iov in the priv structure, who can access the iovec
>> after the request is submitted? -EIOCBQUEUED in general means that the
>> request is submitted and will get completed async, e.g. via callback, and
>> if the bio callback can use the iov maybe via the iter, this goto will be
>> a use after free.
>>
>> Also, you're returning -EFAULT back to io_uring, which will kill the
>> io_uring request / cmd while there might still be in flight bios that
>> can try to access it.
>>
>> Can you inject errors into the copy and test please?
> 
> The bio hasn't been submitted at this point, that happens in
> btrfs_uring_read_extent. So far all we've done is read the metadata from
> the page cache. The copy_to_user here is copying the metadata info to
> the userspace structure.

I see, in this case it should be fine, but why is it -EIOCBQUEUED
then? It always meant that it queued up the request and will
complete it asynchronously, and that's where the confusion sprouted
from. Not looking deeper but sounds more like -EAGAIN? Assuming it's
returned because we can't block

>>> +    }
>>> +
>>> +    if (ret == -EIOCBQUEUED) {
>>> +        u64 count;
>>> +
>>> +        /*
>>> +         * If we've optimized things by storing the iovecs on the stack,
>>> +         * undo this.
>>> +         */> +        if (!iov) {
>>> +            iov = kmalloc(sizeof(struct iovec) * args.iovcnt,
>>> +                      GFP_NOFS);
>>> +            if (!iov) {
>>> +                unlock_extent(io_tree, start, lockend,
>>> +                          &cached_state);
>>> +                btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
>>> +                ret = -ENOMEM;
>>> +                goto out_acct;
>>> +            }
>>> +
>>> +            memcpy(iov, iovstack,
>>> +                   sizeof(struct iovec) * args.iovcnt);

As an optimisation in the future you can allocate it
together with the btrfs_priv structure.

>>> +        }
>>> +
>>> +        count = min_t(u64, iov_iter_count(&iter), disk_io_size);
>>> +
>>> +        /* Match ioctl by not returning past EOF if uncompressed */
>>> +        if (!args.compression)
>>> +            count = min_t(u64, count, args.len);
>>> +
>>> +        ret = btrfs_uring_read_extent(&kiocb, &iter, start, lockend,
>>> +                          cached_state, disk_bytenr,
>>> +                          disk_io_size, count,
>>> +                          args.compression, iov, cmd);

So that's the only spot where asynchronous code branches off
in this function? Do I read you right?

>>> +
>>> +        goto out_acct;
>>> +    }
>>> +
>>> +out_free:
>>> +    kfree(iov);
>>> +
>>> +out_acct:
>>> +    if (ret > 0)
>>> +        add_rchar(current, ret);
>>> +    inc_syscr(current);
>>> +
>>> +    return ret;
>>> +}

-- 
Pavel Begunkov

next prev parent reply	other threads:[~2024-10-31 18:25 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-22 14:50 [PATCH v4 0/5] btrfs: io_uring interface for encoded reads Mark Harmstone
2024-10-22 14:50 ` [PATCH 1/5] btrfs: remove pointless addition in btrfs_encoded_read Mark Harmstone
2024-10-30  1:46   ` Anand Jain
2024-10-31 11:44     ` Mark Harmstone
2024-10-22 14:50 ` [PATCH 2/5] btrfs: change btrfs_encoded_read so that reading of extent is done by caller Mark Harmstone
2024-10-22 14:50 ` [PATCH 3/5] btrfs: don't sleep in btrfs_encoded_read if IOCB_NOWAIT set Mark Harmstone
2024-10-22 14:50 ` [PATCH 4/5] btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages Mark Harmstone
2024-10-22 14:50 ` [PATCH 5/5] btrfs: add io_uring command for encoded reads Mark Harmstone
2024-10-29 21:51   ` David Sterba
2024-10-30  0:59   ` Pavel Begunkov
2024-10-30  1:24     ` David Sterba
2024-10-30  2:32       ` Pavel Begunkov
2024-10-31 17:08     ` Mark Harmstone
2024-10-31 18:26       ` Pavel Begunkov [this message]
2024-10-29 21:29 ` [PATCH v4 0/5] btrfs: io_uring interface " David Sterba
2024-10-30 12:37   ` Mark Harmstone
2024-10-30  1:22 ` Anand Jain
2024-10-30 11:48   ` Mark Harmstone
  -- strict thread matches above, loose matches on Subject: below --
2024-10-14 17:18 [PATCH v3 0/5] btrfs: encoded reads via io_uring Mark Harmstone
2024-10-14 17:18 ` [PATCH 5/5] btrfs: add io_uring command for encoded reads Mark Harmstone
2024-10-21 13:50   ` David Sterba
2024-10-21 16:15     ` Pavel Begunkov
2024-10-21 17:05     ` Mark Harmstone
2024-10-21 18:23       ` David Sterba
2024-10-22  9:12         ` Mark Harmstone

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fad6bb86-0646-4039-b234-5752f3f833f0@gmail.com \
    --to=asml.silence@gmail.com \
    --cc=io-uring@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=maharmstone@meta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox