From: Yonghong Song <[email protected]>
To: Kumar Kartikeya Dwivedi <[email protected]>, <[email protected]>
Cc: Alexander Viro <[email protected]>,
<[email protected]>,
Alexei Starovoitov <[email protected]>,
Daniel Borkmann <[email protected]>,
Andrii Nakryiko <[email protected]>,
Pavel Emelyanov <[email protected]>,
Alexander Mihalicyn <[email protected]>,
Andrei Vagin <[email protected]>, <[email protected]>,
<[email protected]>
Subject: Re: [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items
Date: Thu, 18 Nov 2021 09:50:29 -0800 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> This patch adds eBPF iterator for epoll items (epitems) registered in an
> epoll instance. It gives access to the eventpoll ctx, and the registered
> epoll item (struct epitem). This allows the iterator to inspect the
> registered file and be able to use others iterators to associate it with
> a task's fdtable.
>
> The primary usecase this is enabling is expediting existing eventpoll
> checkpoint/restore support in the CRIU project. This iterator allows us
> to switch from a worst case O(n^2) algorithm to a single O(n) pass over
> task and epoll registered descriptors.
>
> We also make sure we're iterating over a live file, one that is not
> going away. The case we're concerned about is a file that has its
> f_count as zero, but is waiting for iterator bpf_seq_read to release
> ep->mtx, so that it can remove its epitem. Since such a file will
> disappear once iteration is done, and it is being destructed, we use
> get_file_rcu to ensure it is alive when invoking the BPF program.
>
> Getting access to a file that is going to disappear after iteration
> is not useful anyway. This does have a performance overhead however
> (since file reference will be raised and dropped for each file).
>
> The rcu_read_lock around get_file_rcu isn't strictly required for
> lifetime management since fput path is serialized on ep->mtx to call
> ep_remove, hence the epi->ffd.file pointer remains stable during our
> seq_start/seq_stop bracketing.
>
> To be able to continue back from the position we were iterating, we
> store the epi->ffi.fd and use ep_find_tfd to find the target file again.
> It would be more appropriate to use both struct file pointer and fd
> number to find the last file, but see below for why that cannot be done.
>
> Taking reference to struct file and walking RB-Tree to find it again
> will lead to reference cycle issue if the iterator after partial read
> takes reference to socket which later is used in creating a descriptor
> cycle using SCM_RIGHTS. An example that was encountered when working on
> this is mentioned below.
>
> Let there be Unix sockets SK1, SK2, epoll fd EP, and epoll iterator
> ITER.
> Let SK1 be registered in EP, then on a partial read it is possible
> that ITER returns from read and takes reference to SK1 to be able to
> find it later in RB-Tree and continue the iteration. If SK1 sends
> ITER over to SK2 using SCM_RIGHTS, and SK2 sends SK2 over to SK1 using
> SCM_RIGHTS, and both fds are not consumed on the corresponding receive
> ends, a cycle is created. When all of SK1, SK2, EP, and ITER are
> closed, SK1's receive queue holds reference to SK2, and SK2's receive
> queue holds reference to ITER, which holds a reference to SK1.
> All file descriptors except EP leak.
>
> To resolve it, we would need to hook into the Unix Socket GC mechanism,
> but the alternative of using ep_find_tfd is much more simpler. The
> finding of the last position in face of concurrent modification of the
> epoll set is at best an approximation anyway. For the case of CRIU, the
> epoll set remains stable.
>
> Cc: Alexander Viro <[email protected]>
> Cc: [email protected]
> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
> ---
> fs/eventpoll.c | 196 ++++++++++++++++++++++++++++++++-
> include/linux/bpf.h | 5 +-
> include/uapi/linux/bpf.h | 3 +
> tools/include/uapi/linux/bpf.h | 3 +
> 4 files changed, 205 insertions(+), 2 deletions(-)
>
[...]
> +
> +static const struct bpf_iter_seq_info bpf_epoll_seq_info = {
> + .seq_ops = &bpf_epoll_seq_ops,
> + .init_seq_private = bpf_epoll_init_seq,
> + .seq_priv_size = sizeof(struct bpf_epoll_iter_seq_info),
> +};
> +
> +static struct bpf_iter_reg epoll_reg_info = {
> + .target = "epoll",
> + .feature = BPF_ITER_RESCHED,
> + .attach_target = bpf_epoll_iter_attach,
> + .detach_target = bpf_epoll_iter_detach,
implement show_fdinfo and fill_link_info?
There are some bpftool work needed as well to show the information
in user space. An example is e60495eafdba ("bpftool: Implement
link_query for bpf iterators").
> + .ctx_arg_info_size = 2,
> + .ctx_arg_info = {
> + { offsetof(struct bpf_iter__epoll, ep),
> + PTR_TO_BTF_ID },
> + { offsetof(struct bpf_iter__epoll, epi),
> + PTR_TO_BTF_ID_OR_NULL },
> + },
> + .seq_info = &bpf_epoll_seq_info,
> +};
> +
> +static int __init epoll_iter_init(void)
> +{
> + epoll_reg_info.ctx_arg_info[0].btf_id = btf_epoll_ids[0];
> + epoll_reg_info.ctx_arg_info[1].btf_id = btf_epoll_ids[1];
> + return bpf_iter_reg_target(&epoll_reg_info);
> +}
> +late_initcall(epoll_iter_init);
> +
> +#endif
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index fe7b499da781..eb1c9acdc40b 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1512,7 +1512,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
> struct io_ring_ctx;
> struct bpf_iter_aux_info {
> struct bpf_map *map;
> - struct io_ring_ctx *ctx;
> + union {
> + struct io_ring_ctx *ctx;
> + struct file *ep;
> + };
You changed to union here. I think we can change
"struct bpf_iter_aux_info" to "union bpf_iter_aux_info".
This should make code simpler and easy to understand.
> };
>
> typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b70e9da3d722..64e18c1dcfca 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -94,6 +94,9 @@ union bpf_iter_link_info {
> struct {
> __u32 io_uring_fd;
> } io_uring;
> + struct {
> + __u32 epoll_fd;
> + } epoll;
> };
>
> /* BPF syscall commands, see bpf(2) man-page for more details. */
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index b70e9da3d722..64e18c1dcfca 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -94,6 +94,9 @@ union bpf_iter_link_info {
> struct {
> __u32 io_uring_fd;
> } io_uring;
> + struct {
> + __u32 epoll_fd;
> + } epoll;
> };
>
> /* BPF syscall commands, see bpf(2) man-page for more details. */
>
next prev parent reply other threads:[~2021-11-18 17:51 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-16 5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
2021-11-18 17:21 ` Yonghong Song
2021-11-18 18:28 ` Kumar Kartikeya Dwivedi
2021-11-18 19:13 ` Yonghong Song
2021-11-18 22:02 ` Alexei Starovoitov
2021-11-19 4:15 ` Kumar Kartikeya Dwivedi
2021-11-19 4:44 ` Kumar Kartikeya Dwivedi
2021-11-19 4:56 ` Alexei Starovoitov
2021-11-19 5:16 ` Kumar Kartikeya Dwivedi
2021-11-19 5:24 ` Alexei Starovoitov
2021-11-19 6:12 ` Kumar Kartikeya Dwivedi
2021-12-03 15:52 ` Pavel Begunkov
2021-12-03 23:16 ` Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
2021-11-17 12:35 ` kernel test robot
2021-11-17 13:39 ` kernel test robot
2021-11-18 17:27 ` Yonghong Song
2021-11-18 18:30 ` Kumar Kartikeya Dwivedi
2021-11-18 19:18 ` Yonghong Song
2021-11-18 19:22 ` Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 3/8] io_uring: Implement eBPF iterator for registered files Kumar Kartikeya Dwivedi
2021-11-18 17:33 ` Yonghong Song
2021-11-16 5:42 ` [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items Kumar Kartikeya Dwivedi
2021-11-18 17:50 ` Yonghong Song [this message]
2021-11-16 5:42 ` [PATCH bpf-next v1 5/8] selftests/bpf: Add test for io_uring BPF iterators Kumar Kartikeya Dwivedi
2021-11-18 17:54 ` Yonghong Song
2021-11-18 18:33 ` Kumar Kartikeya Dwivedi
2021-11-18 19:21 ` Yonghong Song
2021-11-16 5:42 ` [PATCH bpf-next v1 6/8] selftests/bpf: Add test for epoll BPF iterator Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 7/8] selftests/bpf: Test partial reads for io_uring, epoll iterators Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 8/8] selftests/bpf: Fix btf_dump test for bpf_iter_link_info Kumar Kartikeya Dwivedi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox