From: Pavel Begunkov <[email protected]>
To: Mark Harmstone <[email protected]>,
"[email protected]" <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: [PATCH 5/5] btrfs: add io_uring command for encoded reads
Date: Thu, 31 Oct 2024 18:26:03 +0000 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 10/31/24 17:08, Mark Harmstone wrote:
> Thanks Pavel.
>
...
>> If that's an iovec backed iter that might fail, you'd need to
>> steal this patch
>>
>> https://lore.kernel.org/all/[email protected]/
>>
>> and fail if "issue_flags & IO_URING_F_TASK_DEAD", see
>>
>> https://lore.kernel.org/all/[email protected]/
>
> Thanks, I've sent a patchset including your patch. Does it make a
> difference, though? If the task has died, presumably copy_page_to_iter
> can't copy to another process' memory...?
IIRC copy_to_user will crash without mm set, not sure about
copy_page_to_iter(). Regardless, when the original task has dies
and it gets run from io_uring's fallback path, you shouldn't
make any assumptions about the current task.
>>> + ret = -EFAULT;
>>> + goto out;
>>> + }
>>> +
>>> + i++;
>>> + cur += bytes;
>>> + page_offset = 0;
>>> + }
>>> + ret = priv->count;
>>> +
>>> +out:
>>> + unlock_extent(io_tree, priv->start, priv->lockend,
>>> &priv->cached_state);
>>> + btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
>>
>> When called via io_uring_cmd_complete_in_task() this function might
>> not get run in any reasonable amount of time. Even worse, a
>> misbehaving user can block it until the task dies.
>>
>> I don't remember if rwsem allows unlock being executed in a different
>> task than the pairing lock, but blocking it for that long could be a
>> problem. I might not remember it right but I think Boris meantioned
>> that the O_DIRECT path drops the inode lock right after submission
>> without waiting for bios to complete. Is that right? Can we do it
>> here as well?
>
> We can't release the inode lock until we've released the extent lock. I
> do intend to look into reducing the amount of time we hold the extent
> lock, if we can, but it's not trivial to do this in a safe manner.
I lack the btrfs knowledge, but sounds like it can be done the
same way the async dio path works.
> We could perhaps move the unlocking to btrfs_uring_read_extent_endio
> instead, but it looks unlocking an rwsem in a different context might
> cause problems with PREEMPT_RT(?).
At least from a quick glance it doesn't seem that locks in
__clear_extent_bit are [soft]irq protected. Would be a good
idea to give it a run with lockdep enabled.
...
>>> + ret = btrfs_encoded_read_regular_fill_pages(inode, disk_bytenr,
>>> + disk_io_size, pages,
>>> + priv);
>>> + if (ret && ret != -EIOCBQUEUED)
>>> + goto fail;
>>
>> Turning both into return EIOCBQUEUED is a bit suspicious, but
>> I lack context to say. Might make sense to return ret and let
>> the caller handle it.
>
> btrfs_encoded_read_regular_fill_pages returns 0 if the bio completes
> before the function can finish, -EIOCBQUEUED otherwise. In either case
> the behaviour of the calling function will be the same.
Ok
...
>>> + if (copy_to_user(sqe_addr + copy_end, (char *)&args +
>>> copy_end_kernel,
>>> + sizeof(args) - copy_end_kernel)) {
>>> + if (ret == -EIOCBQUEUED) {
>>> + unlock_extent(io_tree, start, lockend, &cached_state);
>>> + btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
>>> + }
>>> + ret = -EFAULT;
>>> + goto out_free;
>>
>> It seems we're saving iov in the priv structure, who can access the iovec
>> after the request is submitted? -EIOCBQUEUED in general means that the
>> request is submitted and will get completed async, e.g. via callback, and
>> if the bio callback can use the iov maybe via the iter, this goto will be
>> a use after free.
>>
>> Also, you're returning -EFAULT back to io_uring, which will kill the
>> io_uring request / cmd while there might still be in flight bios that
>> can try to access it.
>>
>> Can you inject errors into the copy and test please?
>
> The bio hasn't been submitted at this point, that happens in
> btrfs_uring_read_extent. So far all we've done is read the metadata from
> the page cache. The copy_to_user here is copying the metadata info to
> the userspace structure.
I see, in this case it should be fine, but why is it -EIOCBQUEUED
then? It always meant that it queued up the request and will
complete it asynchronously, and that's where the confusion sprouted
from. Not looking deeper but sounds more like -EAGAIN? Assuming it's
returned because we can't block
>>> + }
>>> +
>>> + if (ret == -EIOCBQUEUED) {
>>> + u64 count;
>>> +
>>> + /*
>>> + * If we've optimized things by storing the iovecs on the stack,
>>> + * undo this.
>>> + */> + if (!iov) {
>>> + iov = kmalloc(sizeof(struct iovec) * args.iovcnt,
>>> + GFP_NOFS);
>>> + if (!iov) {
>>> + unlock_extent(io_tree, start, lockend,
>>> + &cached_state);
>>> + btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
>>> + ret = -ENOMEM;
>>> + goto out_acct;
>>> + }
>>> +
>>> + memcpy(iov, iovstack,
>>> + sizeof(struct iovec) * args.iovcnt);
As an optimisation in the future you can allocate it
together with the btrfs_priv structure.
>>> + }
>>> +
>>> + count = min_t(u64, iov_iter_count(&iter), disk_io_size);
>>> +
>>> + /* Match ioctl by not returning past EOF if uncompressed */
>>> + if (!args.compression)
>>> + count = min_t(u64, count, args.len);
>>> +
>>> + ret = btrfs_uring_read_extent(&kiocb, &iter, start, lockend,
>>> + cached_state, disk_bytenr,
>>> + disk_io_size, count,
>>> + args.compression, iov, cmd);
So that's the only spot where asynchronous code branches off
in this function? Do I read you right?
>>> +
>>> + goto out_acct;
>>> + }
>>> +
>>> +out_free:
>>> + kfree(iov);
>>> +
>>> +out_acct:
>>> + if (ret > 0)
>>> + add_rchar(current, ret);
>>> + inc_syscr(current);
>>> +
>>> + return ret;
>>> +}
--
Pavel Begunkov
next prev parent reply other threads:[~2024-10-31 18:25 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-22 14:50 [PATCH v4 0/5] btrfs: io_uring interface for encoded reads Mark Harmstone
2024-10-22 14:50 ` [PATCH 1/5] btrfs: remove pointless addition in btrfs_encoded_read Mark Harmstone
2024-10-30 1:46 ` Anand Jain
2024-10-31 11:44 ` Mark Harmstone
2024-10-22 14:50 ` [PATCH 2/5] btrfs: change btrfs_encoded_read so that reading of extent is done by caller Mark Harmstone
2024-10-22 14:50 ` [PATCH 3/5] btrfs: don't sleep in btrfs_encoded_read if IOCB_NOWAIT set Mark Harmstone
2024-10-22 14:50 ` [PATCH 4/5] btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages Mark Harmstone
2024-10-22 14:50 ` [PATCH 5/5] btrfs: add io_uring command for encoded reads Mark Harmstone
2024-10-29 21:51 ` David Sterba
2024-10-30 0:59 ` Pavel Begunkov
2024-10-30 1:24 ` David Sterba
2024-10-30 2:32 ` Pavel Begunkov
2024-10-31 17:08 ` Mark Harmstone
2024-10-31 18:26 ` Pavel Begunkov [this message]
2024-10-29 21:29 ` [PATCH v4 0/5] btrfs: io_uring interface " David Sterba
2024-10-30 12:37 ` Mark Harmstone
2024-10-30 1:22 ` Anand Jain
2024-10-30 11:48 ` Mark Harmstone
-- strict thread matches above, loose matches on Subject: below --
2024-10-14 17:18 [PATCH v3 0/5] btrfs: encoded reads via io_uring Mark Harmstone
2024-10-14 17:18 ` [PATCH 5/5] btrfs: add io_uring command for encoded reads Mark Harmstone
2024-10-21 13:50 ` David Sterba
2024-10-21 16:15 ` Pavel Begunkov
2024-10-21 17:05 ` Mark Harmstone
2024-10-21 18:23 ` David Sterba
2024-10-22 9:12 ` Mark Harmstone
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox