public inbox for [email protected]
 help / color / mirror / Atom feed
From: Yonghong Song <[email protected]>
To: Kumar Kartikeya Dwivedi <[email protected]>, <[email protected]>
Cc: Alexander Viro <[email protected]>,
	<[email protected]>,
	Alexei Starovoitov <[email protected]>,
	Daniel Borkmann <[email protected]>,
	Andrii Nakryiko <[email protected]>,
	Pavel Emelyanov <[email protected]>,
	Alexander Mihalicyn <[email protected]>,
	Andrei Vagin <[email protected]>, <[email protected]>,
	<[email protected]>
Subject: Re: [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items
Date: Thu, 18 Nov 2021 09:50:29 -0800	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>



On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> This patch adds eBPF iterator for epoll items (epitems) registered in an
> epoll instance. It gives access to the eventpoll ctx, and the registered
> epoll item (struct epitem). This allows the iterator to inspect the
> registered file and be able to use others iterators to associate it with
> a task's fdtable.
> 
> The primary usecase this is enabling is expediting existing eventpoll
> checkpoint/restore support in the CRIU project. This iterator allows us
> to switch from a worst case O(n^2) algorithm to a single O(n) pass over
> task and epoll registered descriptors.
> 
> We also make sure we're iterating over a live file, one that is not
> going away. The case we're concerned about is a file that has its
> f_count as zero, but is waiting for iterator bpf_seq_read to release
> ep->mtx, so that it can remove its epitem. Since such a file will
> disappear once iteration is done, and it is being destructed, we use
> get_file_rcu to ensure it is alive when invoking the BPF program.
> 
> Getting access to a file that is going to disappear after iteration
> is not useful anyway. This does have a performance overhead however
> (since file reference will be raised and dropped for each file).
> 
> The rcu_read_lock around get_file_rcu isn't strictly required for
> lifetime management since fput path is serialized on ep->mtx to call
> ep_remove, hence the epi->ffd.file pointer remains stable during our
> seq_start/seq_stop bracketing.
> 
> To be able to continue back from the position we were iterating, we
> store the epi->ffi.fd and use ep_find_tfd to find the target file again.
> It would be more appropriate to use both struct file pointer and fd
> number to find the last file, but see below for why that cannot be done.
> 
> Taking reference to struct file and walking RB-Tree to find it again
> will lead to reference cycle issue if the iterator after partial read
> takes reference to socket which later is used in creating a descriptor
> cycle using SCM_RIGHTS. An example that was encountered when working on
> this is mentioned below.
> 
>    Let there be Unix sockets SK1, SK2, epoll fd EP, and epoll iterator
>    ITER.
>    Let SK1 be registered in EP, then on a partial read it is possible
>    that ITER returns from read and takes reference to SK1 to be able to
>    find it later in RB-Tree and continue the iteration.  If SK1 sends
>    ITER over to SK2 using SCM_RIGHTS, and SK2 sends SK2 over to SK1 using
>    SCM_RIGHTS, and both fds are not consumed on the corresponding receive
>    ends, a cycle is created.  When all of SK1, SK2, EP, and ITER are
>    closed, SK1's receive queue holds reference to SK2, and SK2's receive
>    queue holds reference to ITER, which holds a reference to SK1.
>    All file descriptors except EP leak.
> 
> To resolve it, we would need to hook into the Unix Socket GC mechanism,
> but the alternative of using ep_find_tfd is much more simpler. The
> finding of the last position in face of concurrent modification of the
> epoll set is at best an approximation anyway. For the case of CRIU, the
> epoll set remains stable.
> 
> Cc: Alexander Viro <[email protected]>
> Cc: [email protected]
> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
> ---
>   fs/eventpoll.c                 | 196 ++++++++++++++++++++++++++++++++-
>   include/linux/bpf.h            |   5 +-
>   include/uapi/linux/bpf.h       |   3 +
>   tools/include/uapi/linux/bpf.h |   3 +
>   4 files changed, 205 insertions(+), 2 deletions(-)
> 
[...]
> +
> +static const struct bpf_iter_seq_info bpf_epoll_seq_info = {
> +	.seq_ops          = &bpf_epoll_seq_ops,
> +	.init_seq_private = bpf_epoll_init_seq,
> +	.seq_priv_size    = sizeof(struct bpf_epoll_iter_seq_info),
> +};
> +
> +static struct bpf_iter_reg epoll_reg_info = {
> +	.target            = "epoll",
> +	.feature           = BPF_ITER_RESCHED,
> +	.attach_target     = bpf_epoll_iter_attach,
> +	.detach_target     = bpf_epoll_iter_detach,

implement show_fdinfo and fill_link_info?

There are some bpftool work needed as well to show the information
in user space. An example is e60495eafdba ("bpftool: Implement 
link_query for bpf iterators").


> +	.ctx_arg_info_size = 2,
> +	.ctx_arg_info = {
> +		{ offsetof(struct bpf_iter__epoll, ep),
> +		  PTR_TO_BTF_ID },
> +		{ offsetof(struct bpf_iter__epoll, epi),
> +		  PTR_TO_BTF_ID_OR_NULL },
> +	},
> +	.seq_info	   = &bpf_epoll_seq_info,
> +};
> +
> +static int __init epoll_iter_init(void)
> +{
> +	epoll_reg_info.ctx_arg_info[0].btf_id = btf_epoll_ids[0];
> +	epoll_reg_info.ctx_arg_info[1].btf_id = btf_epoll_ids[1];
> +	return bpf_iter_reg_target(&epoll_reg_info);
> +}
> +late_initcall(epoll_iter_init);
> +
> +#endif
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index fe7b499da781..eb1c9acdc40b 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1512,7 +1512,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>   struct io_ring_ctx;
>   struct bpf_iter_aux_info {
>   	struct bpf_map *map;
> -	struct io_ring_ctx *ctx;
> +	union {
> +		struct io_ring_ctx *ctx;
> +		struct file *ep;
> +	};

You changed to union here. I think we can change
"struct bpf_iter_aux_info" to "union bpf_iter_aux_info".
This should make code simpler and easy to understand.

>   };
>   
>   typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b70e9da3d722..64e18c1dcfca 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -94,6 +94,9 @@ union bpf_iter_link_info {
>   	struct {
>   		__u32   io_uring_fd;
>   	} io_uring;
> +	struct {
> +		__u32   epoll_fd;
> +	} epoll;
>   };
>   
>   /* BPF syscall commands, see bpf(2) man-page for more details. */
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index b70e9da3d722..64e18c1dcfca 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -94,6 +94,9 @@ union bpf_iter_link_info {
>   	struct {
>   		__u32   io_uring_fd;
>   	} io_uring;
> +	struct {
> +		__u32   epoll_fd;
> +	} epoll;
>   };
>   
>   /* BPF syscall commands, see bpf(2) man-page for more details. */
> 

  reply	other threads:[~2021-11-18 17:51 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-16  5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
2021-11-18 17:21   ` Yonghong Song
2021-11-18 18:28     ` Kumar Kartikeya Dwivedi
2021-11-18 19:13       ` Yonghong Song
2021-11-18 22:02   ` Alexei Starovoitov
2021-11-19  4:15     ` Kumar Kartikeya Dwivedi
2021-11-19  4:44       ` Kumar Kartikeya Dwivedi
2021-11-19  4:56       ` Alexei Starovoitov
2021-11-19  5:16         ` Kumar Kartikeya Dwivedi
2021-11-19  5:24           ` Alexei Starovoitov
2021-11-19  6:12             ` Kumar Kartikeya Dwivedi
2021-12-03 15:52             ` Pavel Begunkov
2021-12-03 23:16               ` Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
2021-11-17 12:35   ` kernel test robot
2021-11-17 13:39   ` kernel test robot
2021-11-18 17:27   ` Yonghong Song
2021-11-18 18:30     ` Kumar Kartikeya Dwivedi
2021-11-18 19:18       ` Yonghong Song
2021-11-18 19:22         ` Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 3/8] io_uring: Implement eBPF iterator for registered files Kumar Kartikeya Dwivedi
2021-11-18 17:33   ` Yonghong Song
2021-11-16  5:42 ` [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items Kumar Kartikeya Dwivedi
2021-11-18 17:50   ` Yonghong Song [this message]
2021-11-16  5:42 ` [PATCH bpf-next v1 5/8] selftests/bpf: Add test for io_uring BPF iterators Kumar Kartikeya Dwivedi
2021-11-18 17:54   ` Yonghong Song
2021-11-18 18:33     ` Kumar Kartikeya Dwivedi
2021-11-18 19:21       ` Yonghong Song
2021-11-16  5:42 ` [PATCH bpf-next v1 6/8] selftests/bpf: Add test for epoll BPF iterator Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 7/8] selftests/bpf: Test partial reads for io_uring, epoll iterators Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 8/8] selftests/bpf: Fix btf_dump test for bpf_iter_link_info Kumar Kartikeya Dwivedi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox