public inbox for [email protected]
 help / color / mirror / Atom feed
From: Yonghong Song <[email protected]>
To: Kumar Kartikeya Dwivedi <[email protected]>
Cc: <[email protected]>, Jens Axboe <[email protected]>,
	Pavel Begunkov <[email protected]>,
	<[email protected]>, Alexei Starovoitov <[email protected]>,
	Daniel Borkmann <[email protected]>,
	Andrii Nakryiko <[email protected]>,
	Pavel Emelyanov <[email protected]>,
	Alexander Mihalicyn <[email protected]>,
	Andrei Vagin <[email protected]>, <[email protected]>,
	<[email protected]>
Subject: Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
Date: Thu, 18 Nov 2021 11:13:02 -0800	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>



On 11/18/21 10:28 AM, Kumar Kartikeya Dwivedi wrote:
> On Thu, Nov 18, 2021 at 10:51:59PM IST, Yonghong Song wrote:
>>
>>
>> On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
>>> This change adds eBPF iterator for buffers registered in io_uring ctx.
>>> It gives access to the ctx, the index of the registered buffer, and a
>>> pointer to the io_uring_ubuf itself. This allows the iterator to save
>>> info related to buffers added to an io_uring instance, that isn't easy
>>> to export using the fdinfo interface (like exact struct page composing
>>> the registered buffer).
>>>
>>> The primary usecase this is enabling is checkpoint/restore support.
>>>
>>> Note that we need to use mutex_trylock when the file is read from, in
>>> seq_start functions, as the order of lock taken is opposite of what it
>>> would be when io_uring operation reads the same file.  We take
>>> seq_file->lock, then ctx->uring_lock, while io_uring would first take
>>> ctx->uring_lock and then seq_file->lock for the same ctx.
>>>
>>> This can lead to a deadlock scenario described below:
>>>
>>>         CPU 0				CPU 1
>>>
>>>         vfs_read
>>>         mutex_lock(&seq_file->lock)	io_read
>>> 					mutex_lock(&ctx->uring_lock)
>>>         mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
>>> 					mutex_lock(&seq_file->lock)
>>
>> It is not clear which mutex_lock switched to mutex_trylock.
> 
> The one in vfs_read.
> 
>>  From below example, it looks like &ctx->uring_lock. But if this is
>> the case, we could have deadlock, right?
>>
> 
> Sorry, I will make the commit message clearer in the next version.
> 
> The sequence on CPU 0 is for normal read(2) on iterator.
> For CPU 1, it is an io_uring instance trying to do same on iterator attached to
> itself.
> 
> So CPU 0 does
> 
> sys_read
> vfs_read
>   bpf_seq_read
>   mutex_lock(&seq_file->lock)    # A
>    io_uring_buf_seq_start
>    mutex_lock(&ctx->uring_lock)  # B
> 
> and CPU 1 does
> 
> io_uring_enter
> mutex_lock(&ctx->uring_lock)    # B
>   io_read
>    bpf_seq_read
>    mutex_lock(&seq_file->lock)   # A
>    ...
> 
> Since the order of locks is opposite, it can deadlock. So I switched the
> mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for
> this case, then it will release seq_file->lock and CPU 1 will make progress.
> 
> The second problem without use of trylock is described below (for same case of
> io_uring reading from iterator attached to itself).
> 
> Let me know if I missed something.

Thanks for explanation. The above diagram is much better.

> 
>>>
>>> The trylock also protects the case where io_uring tries to read from
>>> iterator attached to itself (same ctx), where the order of locks would
>>> be:
>>>    io_uring_enter
>>>     mutex_lock(&ctx->uring_lock) <-----------.
>>>     io_read				    \
>>>      seq_read				     \
>>>       mutex_lock(&seq_file->lock)		     /
>>>       mutex_lock(&ctx->uring_lock) # deadlock-`
>>>
>>> In both these cases (recursive read and contended uring_lock), -EDEADLK
>>> is returned to userspace.
>>>
>>> In the future, this iterator will be extended to directly support
>>> iteration of bvec Flexible Array Member, so that when there is no
>>> corresponding VMA that maps to the registered buffer (e.g. if VMA is
>>> destroyed after pinning pages), we are able to reconstruct the
>>> registration on restore by dumping the page contents and then replaying
>>> them into a temporary mapping used for registration later. All this is
>>> out of scope for the current series however, but builds upon this
>>> iterator.
>>>
>>> Cc: Jens Axboe <[email protected]>
>>> Cc: Pavel Begunkov <[email protected]>
>>> Cc: [email protected]
>>> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
>>> ---
>>>    fs/io_uring.c                  | 179 +++++++++++++++++++++++++++++++++
>>>    include/linux/bpf.h            |   2 +
>>>    include/uapi/linux/bpf.h       |   3 +
>>>    tools/include/uapi/linux/bpf.h |   3 +
>>>    4 files changed, 187 insertions(+)
>>>
[...]
> 
>>> [...]
>>> +static struct bpf_iter_reg io_uring_buf_reg_info = {
>>> +	.target            = "io_uring_buf",
>>> +	.feature	   = BPF_ITER_RESCHED,
>>> +	.attach_target     = bpf_io_uring_iter_attach,
>>> +	.detach_target     = bpf_io_uring_iter_detach,
>>
>> Since you have this extra `io_uring_fd` for the iterator, you may want
>> to implement show_fdinfo and fill_link_info callback functions here.
>>
> 
> Ack, but some questions:
> 
> What should it have? e.g. it easy to go from map_id to map fd if one wants
> access to the map attached to the iterator, but not sure how one can obtain more
> information about target fd from io_uring or epoll iterators.

Just to be clear, I am talking about uapi struct bpf_link_info.
I agree that fd is not really useful. So I guess it is up to you
whether you want to show fd to user or not. We can always
add it later if needed.

> 
> Should I delegate to their show_fdinfo and dump using that?
> 
> The type/target is already available in link_info, not sure what other useful
> information can be added there, which allows obtaining the io_uring/epoll fd.
> 
>>> +	.ctx_arg_info_size = 2,
>>> +	.ctx_arg_info = {
>>> +		{ offsetof(struct bpf_iter__io_uring_buf, ctx),
>>> +		  PTR_TO_BTF_ID },
>>> +		{ offsetof(struct bpf_iter__io_uring_buf, ubuf),
>>> +		  PTR_TO_BTF_ID_OR_NULL },
>>> +	},
>>> +	.seq_info	   = &bpf_io_uring_buf_seq_info,
>>> +};
>>> +
>>> +static int __init io_uring_iter_init(void)
>>> +{
>>> +	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
>>> +	io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
>>> +	return bpf_iter_reg_target(&io_uring_buf_reg_info);
>>> +}
>>> +late_initcall(io_uring_iter_init);
>>> +
>>> +#endif
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index 56098c866704..ddb9d4520a3f 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -1509,8 +1509,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>>>    	extern int bpf_iter_ ## target(args);			\
>>>    	int __init bpf_iter_ ## target(args) { return 0; }
>>> +struct io_ring_ctx;
>>>    struct bpf_iter_aux_info {
>>>    	struct bpf_map *map;
>>> +	struct io_ring_ctx *ctx;
>>>    };
>>
>> Can we use union here? Note that below bpf_iter_link_info in
>> uapi/linux/bpf.h, map_fd and io_uring_fd is also an union.
>>
> 
> So the reason I didn't use a union was the link->aux.map check in
> bpf_iter.c::__get_seq_info. Even if we switch to union bpf_iter_aux_info, it
> needs some way to determine whether link is for map type, so maybe a string
> comparison there? Leaving it out of union felt cleaner, also I move both
> io_ring_ctx and eventpoll file into a union in later patch.

I see. the seq_ops for map element iterator is different from others.
the seq_ops is not from main iter registration but from map_ops.

I think your change is okay. But maybe a comment to explain why
map is different from others in struct bpf_iter_aux_info.

> 
>>>    typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 6297eafdc40f..3323defa99a1 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>>>    	struct {
>>>    		__u32	map_fd;
>>>    	} map;
>>> +	struct {
>>> +		__u32   io_uring_fd;
>>> +	} io_uring;
>>>    };
>>>    /* BPF syscall commands, see bpf(2) man-page for more details. */
>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>> index 6297eafdc40f..3323defa99a1 100644
>>> --- a/tools/include/uapi/linux/bpf.h
>>> +++ b/tools/include/uapi/linux/bpf.h
>>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>>>    	struct {
>>>    		__u32	map_fd;
>>>    	} map;
>>> +	struct {
>>> +		__u32   io_uring_fd;
>>> +	} io_uring;
>>>    };
>>>    /* BPF syscall commands, see bpf(2) man-page for more details. */
>>>
> 
> --
> Kartikeya
> 

  reply	other threads:[~2021-11-18 19:13 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-16  5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
2021-11-18 17:21   ` Yonghong Song
2021-11-18 18:28     ` Kumar Kartikeya Dwivedi
2021-11-18 19:13       ` Yonghong Song [this message]
2021-11-18 22:02   ` Alexei Starovoitov
2021-11-19  4:15     ` Kumar Kartikeya Dwivedi
2021-11-19  4:44       ` Kumar Kartikeya Dwivedi
2021-11-19  4:56       ` Alexei Starovoitov
2021-11-19  5:16         ` Kumar Kartikeya Dwivedi
2021-11-19  5:24           ` Alexei Starovoitov
2021-11-19  6:12             ` Kumar Kartikeya Dwivedi
2021-12-03 15:52             ` Pavel Begunkov
2021-12-03 23:16               ` Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
2021-11-17 12:35   ` kernel test robot
2021-11-17 13:39   ` kernel test robot
2021-11-18 17:27   ` Yonghong Song
2021-11-18 18:30     ` Kumar Kartikeya Dwivedi
2021-11-18 19:18       ` Yonghong Song
2021-11-18 19:22         ` Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 3/8] io_uring: Implement eBPF iterator for registered files Kumar Kartikeya Dwivedi
2021-11-18 17:33   ` Yonghong Song
2021-11-16  5:42 ` [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items Kumar Kartikeya Dwivedi
2021-11-18 17:50   ` Yonghong Song
2021-11-16  5:42 ` [PATCH bpf-next v1 5/8] selftests/bpf: Add test for io_uring BPF iterators Kumar Kartikeya Dwivedi
2021-11-18 17:54   ` Yonghong Song
2021-11-18 18:33     ` Kumar Kartikeya Dwivedi
2021-11-18 19:21       ` Yonghong Song
2021-11-16  5:42 ` [PATCH bpf-next v1 6/8] selftests/bpf: Add test for epoll BPF iterator Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 7/8] selftests/bpf: Test partial reads for io_uring, epoll iterators Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 8/8] selftests/bpf: Fix btf_dump test for bpf_iter_link_info Kumar Kartikeya Dwivedi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox