From: Yonghong Song <[email protected]>
To: Kumar Kartikeya Dwivedi <[email protected]>
Cc: <[email protected]>, Jens Axboe <[email protected]>,
Pavel Begunkov <[email protected]>,
<[email protected]>, Alexei Starovoitov <[email protected]>,
Daniel Borkmann <[email protected]>,
Andrii Nakryiko <[email protected]>,
Pavel Emelyanov <[email protected]>,
Alexander Mihalicyn <[email protected]>,
Andrei Vagin <[email protected]>, <[email protected]>,
<[email protected]>
Subject: Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
Date: Thu, 18 Nov 2021 11:13:02 -0800 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 11/18/21 10:28 AM, Kumar Kartikeya Dwivedi wrote:
> On Thu, Nov 18, 2021 at 10:51:59PM IST, Yonghong Song wrote:
>>
>>
>> On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
>>> This change adds eBPF iterator for buffers registered in io_uring ctx.
>>> It gives access to the ctx, the index of the registered buffer, and a
>>> pointer to the io_uring_ubuf itself. This allows the iterator to save
>>> info related to buffers added to an io_uring instance, that isn't easy
>>> to export using the fdinfo interface (like exact struct page composing
>>> the registered buffer).
>>>
>>> The primary usecase this is enabling is checkpoint/restore support.
>>>
>>> Note that we need to use mutex_trylock when the file is read from, in
>>> seq_start functions, as the order of lock taken is opposite of what it
>>> would be when io_uring operation reads the same file. We take
>>> seq_file->lock, then ctx->uring_lock, while io_uring would first take
>>> ctx->uring_lock and then seq_file->lock for the same ctx.
>>>
>>> This can lead to a deadlock scenario described below:
>>>
>>> CPU 0 CPU 1
>>>
>>> vfs_read
>>> mutex_lock(&seq_file->lock) io_read
>>> mutex_lock(&ctx->uring_lock)
>>> mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
>>> mutex_lock(&seq_file->lock)
>>
>> It is not clear which mutex_lock switched to mutex_trylock.
>
> The one in vfs_read.
>
>> From below example, it looks like &ctx->uring_lock. But if this is
>> the case, we could have deadlock, right?
>>
>
> Sorry, I will make the commit message clearer in the next version.
>
> The sequence on CPU 0 is for normal read(2) on iterator.
> For CPU 1, it is an io_uring instance trying to do same on iterator attached to
> itself.
>
> So CPU 0 does
>
> sys_read
> vfs_read
> bpf_seq_read
> mutex_lock(&seq_file->lock) # A
> io_uring_buf_seq_start
> mutex_lock(&ctx->uring_lock) # B
>
> and CPU 1 does
>
> io_uring_enter
> mutex_lock(&ctx->uring_lock) # B
> io_read
> bpf_seq_read
> mutex_lock(&seq_file->lock) # A
> ...
>
> Since the order of locks is opposite, it can deadlock. So I switched the
> mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for
> this case, then it will release seq_file->lock and CPU 1 will make progress.
>
> The second problem without use of trylock is described below (for same case of
> io_uring reading from iterator attached to itself).
>
> Let me know if I missed something.
Thanks for explanation. The above diagram is much better.
>
>>>
>>> The trylock also protects the case where io_uring tries to read from
>>> iterator attached to itself (same ctx), where the order of locks would
>>> be:
>>> io_uring_enter
>>> mutex_lock(&ctx->uring_lock) <-----------.
>>> io_read \
>>> seq_read \
>>> mutex_lock(&seq_file->lock) /
>>> mutex_lock(&ctx->uring_lock) # deadlock-`
>>>
>>> In both these cases (recursive read and contended uring_lock), -EDEADLK
>>> is returned to userspace.
>>>
>>> In the future, this iterator will be extended to directly support
>>> iteration of bvec Flexible Array Member, so that when there is no
>>> corresponding VMA that maps to the registered buffer (e.g. if VMA is
>>> destroyed after pinning pages), we are able to reconstruct the
>>> registration on restore by dumping the page contents and then replaying
>>> them into a temporary mapping used for registration later. All this is
>>> out of scope for the current series however, but builds upon this
>>> iterator.
>>>
>>> Cc: Jens Axboe <[email protected]>
>>> Cc: Pavel Begunkov <[email protected]>
>>> Cc: [email protected]
>>> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
>>> ---
>>> fs/io_uring.c | 179 +++++++++++++++++++++++++++++++++
>>> include/linux/bpf.h | 2 +
>>> include/uapi/linux/bpf.h | 3 +
>>> tools/include/uapi/linux/bpf.h | 3 +
>>> 4 files changed, 187 insertions(+)
>>>
[...]
>
>>> [...]
>>> +static struct bpf_iter_reg io_uring_buf_reg_info = {
>>> + .target = "io_uring_buf",
>>> + .feature = BPF_ITER_RESCHED,
>>> + .attach_target = bpf_io_uring_iter_attach,
>>> + .detach_target = bpf_io_uring_iter_detach,
>>
>> Since you have this extra `io_uring_fd` for the iterator, you may want
>> to implement show_fdinfo and fill_link_info callback functions here.
>>
>
> Ack, but some questions:
>
> What should it have? e.g. it easy to go from map_id to map fd if one wants
> access to the map attached to the iterator, but not sure how one can obtain more
> information about target fd from io_uring or epoll iterators.
Just to be clear, I am talking about uapi struct bpf_link_info.
I agree that fd is not really useful. So I guess it is up to you
whether you want to show fd to user or not. We can always
add it later if needed.
>
> Should I delegate to their show_fdinfo and dump using that?
>
> The type/target is already available in link_info, not sure what other useful
> information can be added there, which allows obtaining the io_uring/epoll fd.
>
>>> + .ctx_arg_info_size = 2,
>>> + .ctx_arg_info = {
>>> + { offsetof(struct bpf_iter__io_uring_buf, ctx),
>>> + PTR_TO_BTF_ID },
>>> + { offsetof(struct bpf_iter__io_uring_buf, ubuf),
>>> + PTR_TO_BTF_ID_OR_NULL },
>>> + },
>>> + .seq_info = &bpf_io_uring_buf_seq_info,
>>> +};
>>> +
>>> +static int __init io_uring_iter_init(void)
>>> +{
>>> + io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
>>> + io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
>>> + return bpf_iter_reg_target(&io_uring_buf_reg_info);
>>> +}
>>> +late_initcall(io_uring_iter_init);
>>> +
>>> +#endif
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index 56098c866704..ddb9d4520a3f 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -1509,8 +1509,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>>> extern int bpf_iter_ ## target(args); \
>>> int __init bpf_iter_ ## target(args) { return 0; }
>>> +struct io_ring_ctx;
>>> struct bpf_iter_aux_info {
>>> struct bpf_map *map;
>>> + struct io_ring_ctx *ctx;
>>> };
>>
>> Can we use union here? Note that below bpf_iter_link_info in
>> uapi/linux/bpf.h, map_fd and io_uring_fd is also an union.
>>
>
> So the reason I didn't use a union was the link->aux.map check in
> bpf_iter.c::__get_seq_info. Even if we switch to union bpf_iter_aux_info, it
> needs some way to determine whether link is for map type, so maybe a string
> comparison there? Leaving it out of union felt cleaner, also I move both
> io_ring_ctx and eventpoll file into a union in later patch.
I see. the seq_ops for map element iterator is different from others.
the seq_ops is not from main iter registration but from map_ops.
I think your change is okay. But maybe a comment to explain why
map is different from others in struct bpf_iter_aux_info.
>
>>> typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 6297eafdc40f..3323defa99a1 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>>> struct {
>>> __u32 map_fd;
>>> } map;
>>> + struct {
>>> + __u32 io_uring_fd;
>>> + } io_uring;
>>> };
>>> /* BPF syscall commands, see bpf(2) man-page for more details. */
>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>> index 6297eafdc40f..3323defa99a1 100644
>>> --- a/tools/include/uapi/linux/bpf.h
>>> +++ b/tools/include/uapi/linux/bpf.h
>>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>>> struct {
>>> __u32 map_fd;
>>> } map;
>>> + struct {
>>> + __u32 io_uring_fd;
>>> + } io_uring;
>>> };
>>> /* BPF syscall commands, see bpf(2) man-page for more details. */
>>>
>
> --
> Kartikeya
>
next prev parent reply other threads:[~2021-11-18 19:13 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-16 5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
2021-11-18 17:21 ` Yonghong Song
2021-11-18 18:28 ` Kumar Kartikeya Dwivedi
2021-11-18 19:13 ` Yonghong Song [this message]
2021-11-18 22:02 ` Alexei Starovoitov
2021-11-19 4:15 ` Kumar Kartikeya Dwivedi
2021-11-19 4:44 ` Kumar Kartikeya Dwivedi
2021-11-19 4:56 ` Alexei Starovoitov
2021-11-19 5:16 ` Kumar Kartikeya Dwivedi
2021-11-19 5:24 ` Alexei Starovoitov
2021-11-19 6:12 ` Kumar Kartikeya Dwivedi
2021-12-03 15:52 ` Pavel Begunkov
2021-12-03 23:16 ` Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
2021-11-17 12:35 ` kernel test robot
2021-11-17 13:39 ` kernel test robot
2021-11-18 17:27 ` Yonghong Song
2021-11-18 18:30 ` Kumar Kartikeya Dwivedi
2021-11-18 19:18 ` Yonghong Song
2021-11-18 19:22 ` Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 3/8] io_uring: Implement eBPF iterator for registered files Kumar Kartikeya Dwivedi
2021-11-18 17:33 ` Yonghong Song
2021-11-16 5:42 ` [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items Kumar Kartikeya Dwivedi
2021-11-18 17:50 ` Yonghong Song
2021-11-16 5:42 ` [PATCH bpf-next v1 5/8] selftests/bpf: Add test for io_uring BPF iterators Kumar Kartikeya Dwivedi
2021-11-18 17:54 ` Yonghong Song
2021-11-18 18:33 ` Kumar Kartikeya Dwivedi
2021-11-18 19:21 ` Yonghong Song
2021-11-16 5:42 ` [PATCH bpf-next v1 6/8] selftests/bpf: Add test for epoll BPF iterator Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 7/8] selftests/bpf: Test partial reads for io_uring, epoll iterators Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 8/8] selftests/bpf: Fix btf_dump test for bpf_iter_link_info Kumar Kartikeya Dwivedi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox