From: Kumar Kartikeya Dwivedi <[email protected]>
To: Yonghong Song <[email protected]>
Cc: [email protected], Jens Axboe <[email protected]>,
Pavel Begunkov <[email protected]>,
[email protected], Alexei Starovoitov <[email protected]>,
Daniel Borkmann <[email protected]>,
Andrii Nakryiko <[email protected]>,
Pavel Emelyanov <[email protected]>,
Alexander Mihalicyn <[email protected]>,
Andrei Vagin <[email protected]>,
[email protected], [email protected]
Subject: Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
Date: Thu, 18 Nov 2021 23:58:45 +0530 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On Thu, Nov 18, 2021 at 10:51:59PM IST, Yonghong Song wrote:
>
>
> On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> > This change adds eBPF iterator for buffers registered in io_uring ctx.
> > It gives access to the ctx, the index of the registered buffer, and a
> > pointer to the io_uring_ubuf itself. This allows the iterator to save
> > info related to buffers added to an io_uring instance, that isn't easy
> > to export using the fdinfo interface (like exact struct page composing
> > the registered buffer).
> >
> > The primary usecase this is enabling is checkpoint/restore support.
> >
> > Note that we need to use mutex_trylock when the file is read from, in
> > seq_start functions, as the order of lock taken is opposite of what it
> > would be when io_uring operation reads the same file. We take
> > seq_file->lock, then ctx->uring_lock, while io_uring would first take
> > ctx->uring_lock and then seq_file->lock for the same ctx.
> >
> > This can lead to a deadlock scenario described below:
> >
> > CPU 0 CPU 1
> >
> > vfs_read
> > mutex_lock(&seq_file->lock) io_read
> > mutex_lock(&ctx->uring_lock)
> > mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
> > mutex_lock(&seq_file->lock)
>
> It is not clear which mutex_lock switched to mutex_trylock.
The one in vfs_read.
> From below example, it looks like &ctx->uring_lock. But if this is
> the case, we could have deadlock, right?
>
Sorry, I will make the commit message clearer in the next version.
The sequence on CPU 0 is for normal read(2) on iterator.
For CPU 1, it is an io_uring instance trying to do same on iterator attached to
itself.
So CPU 0 does
sys_read
vfs_read
bpf_seq_read
mutex_lock(&seq_file->lock) # A
io_uring_buf_seq_start
mutex_lock(&ctx->uring_lock) # B
and CPU 1 does
io_uring_enter
mutex_lock(&ctx->uring_lock) # B
io_read
bpf_seq_read
mutex_lock(&seq_file->lock) # A
...
Since the order of locks is opposite, it can deadlock. So I switched the
mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for
this case, then it will release seq_file->lock and CPU 1 will make progress.
The second problem without use of trylock is described below (for same case of
io_uring reading from iterator attached to itself).
Let me know if I missed something.
> >
> > The trylock also protects the case where io_uring tries to read from
> > iterator attached to itself (same ctx), where the order of locks would
> > be:
> > io_uring_enter
> > mutex_lock(&ctx->uring_lock) <-----------.
> > io_read \
> > seq_read \
> > mutex_lock(&seq_file->lock) /
> > mutex_lock(&ctx->uring_lock) # deadlock-`
> >
> > In both these cases (recursive read and contended uring_lock), -EDEADLK
> > is returned to userspace.
> >
> > In the future, this iterator will be extended to directly support
> > iteration of bvec Flexible Array Member, so that when there is no
> > corresponding VMA that maps to the registered buffer (e.g. if VMA is
> > destroyed after pinning pages), we are able to reconstruct the
> > registration on restore by dumping the page contents and then replaying
> > them into a temporary mapping used for registration later. All this is
> > out of scope for the current series however, but builds upon this
> > iterator.
> >
> > Cc: Jens Axboe <[email protected]>
> > Cc: Pavel Begunkov <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
> > ---
> > fs/io_uring.c | 179 +++++++++++++++++++++++++++++++++
> > include/linux/bpf.h | 2 +
> > include/uapi/linux/bpf.h | 3 +
> > tools/include/uapi/linux/bpf.h | 3 +
> > 4 files changed, 187 insertions(+)
> >
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index b07196b4511c..46a110989155 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -81,6 +81,7 @@
> > #include <linux/tracehook.h>
> > #include <linux/audit.h>
> > #include <linux/security.h>
> > +#include <linux/btf_ids.h>
> > #define CREATE_TRACE_POINTS
> > #include <trace/events/io_uring.h>
> > @@ -11125,3 +11126,181 @@ static int __init io_uring_init(void)
> > return 0;
> > };
> > __initcall(io_uring_init);
> > +
> > +#ifdef CONFIG_BPF_SYSCALL
> > +
> > +BTF_ID_LIST(btf_io_uring_ids)
> > +BTF_ID(struct, io_ring_ctx)
> > +BTF_ID(struct, io_mapped_ubuf)
> > +
> > +struct bpf_io_uring_seq_info {
> > + struct io_ring_ctx *ctx;
> > + unsigned long index;
> > +};
> > +
> > +static int bpf_io_uring_init_seq(void *priv_data, struct bpf_iter_aux_info *aux)
> > +{
> > + struct bpf_io_uring_seq_info *info = priv_data;
> > + struct io_ring_ctx *ctx = aux->ctx;
> > +
> > + info->ctx = ctx;
> > + return 0;
> > +}
> > +
> > +static int bpf_io_uring_iter_attach(struct bpf_prog *prog,
> > + union bpf_iter_link_info *linfo,
> > + struct bpf_iter_aux_info *aux)
> > +{
> > + struct io_ring_ctx *ctx;
> > + struct fd f;
> > + int ret;
> > +
> > + f = fdget(linfo->io_uring.io_uring_fd);
> > + if (unlikely(!f.file))
> > + return -EBADF;
> > +
> > + ret = -EOPNOTSUPP;
> > + if (unlikely(f.file->f_op != &io_uring_fops))
> > + goto out_fput;
> > +
> > + ret = -ENXIO;
> > + ctx = f.file->private_data;
> > + if (unlikely(!percpu_ref_tryget(&ctx->refs)))
> > + goto out_fput;
> > +
> > + ret = 0;
> > + aux->ctx = ctx;
> > +
> > +out_fput:
> > + fdput(f);
> > + return ret;
> > +}
> > +
> > +static void bpf_io_uring_iter_detach(struct bpf_iter_aux_info *aux)
> > +{
> > + percpu_ref_put(&aux->ctx->refs);
> > +}
> > +
> > +/* io_uring iterator for registered buffers */
> > +
> > +struct bpf_iter__io_uring_buf {
> > + __bpf_md_ptr(struct bpf_iter_meta *, meta);
> > + __bpf_md_ptr(struct io_ring_ctx *, ctx);
> > + __bpf_md_ptr(struct io_mapped_ubuf *, ubuf);
> > + unsigned long index;
> > +};
>
> I would suggest to change "unsigned long index" to either u32 or u64.
> This structure is also the bpf program context and in bpf program context,
> "index" will be u64. Then on 32bit system, we potentially
> could have issues.
>
Ack, will do.
> > +
> > +static void *__bpf_io_uring_buf_seq_get_next(struct bpf_io_uring_seq_info *info)
> > +{
> > + if (info->index < info->ctx->nr_user_bufs)
> > + return info->ctx->user_bufs[info->index++];
> > + return NULL;
> > +}
> > +
> > +static void *bpf_io_uring_buf_seq_start(struct seq_file *seq, loff_t *pos)
> > +{
> > + struct bpf_io_uring_seq_info *info = seq->private;
> > + struct io_mapped_ubuf *ubuf;
> > +
> > + /* Indicate to userspace that the uring lock is contended */
> > + if (!mutex_trylock(&info->ctx->uring_lock))
> > + return ERR_PTR(-EDEADLK);
> > +
> > + ubuf = __bpf_io_uring_buf_seq_get_next(info);
> > + if (!ubuf)
> > + return NULL;
> > +
> > + if (*pos == 0)
> > + ++*pos;
> > + return ubuf;
> > +}
> > +
> > +static void *bpf_io_uring_buf_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> > +{
> > + struct bpf_io_uring_seq_info *info = seq->private;
> > +
> > + ++*pos;
> > + return __bpf_io_uring_buf_seq_get_next(info);
> > +}
> > +
> > +DEFINE_BPF_ITER_FUNC(io_uring_buf, struct bpf_iter_meta *meta,
> > + struct io_ring_ctx *ctx, struct io_mapped_ubuf *ubuf,
> > + unsigned long index)
>
> Again, change "unsigned long" to "u32" or "u64".
>
Ack.
> > [...]
> > +static struct bpf_iter_reg io_uring_buf_reg_info = {
> > + .target = "io_uring_buf",
> > + .feature = BPF_ITER_RESCHED,
> > + .attach_target = bpf_io_uring_iter_attach,
> > + .detach_target = bpf_io_uring_iter_detach,
>
> Since you have this extra `io_uring_fd` for the iterator, you may want
> to implement show_fdinfo and fill_link_info callback functions here.
>
Ack, but some questions:
What should it have? e.g. it easy to go from map_id to map fd if one wants
access to the map attached to the iterator, but not sure how one can obtain more
information about target fd from io_uring or epoll iterators.
Should I delegate to their show_fdinfo and dump using that?
The type/target is already available in link_info, not sure what other useful
information can be added there, which allows obtaining the io_uring/epoll fd.
> > + .ctx_arg_info_size = 2,
> > + .ctx_arg_info = {
> > + { offsetof(struct bpf_iter__io_uring_buf, ctx),
> > + PTR_TO_BTF_ID },
> > + { offsetof(struct bpf_iter__io_uring_buf, ubuf),
> > + PTR_TO_BTF_ID_OR_NULL },
> > + },
> > + .seq_info = &bpf_io_uring_buf_seq_info,
> > +};
> > +
> > +static int __init io_uring_iter_init(void)
> > +{
> > + io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
> > + io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
> > + return bpf_iter_reg_target(&io_uring_buf_reg_info);
> > +}
> > +late_initcall(io_uring_iter_init);
> > +
> > +#endif
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 56098c866704..ddb9d4520a3f 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1509,8 +1509,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
> > extern int bpf_iter_ ## target(args); \
> > int __init bpf_iter_ ## target(args) { return 0; }
> > +struct io_ring_ctx;
> > struct bpf_iter_aux_info {
> > struct bpf_map *map;
> > + struct io_ring_ctx *ctx;
> > };
>
> Can we use union here? Note that below bpf_iter_link_info in
> uapi/linux/bpf.h, map_fd and io_uring_fd is also an union.
>
So the reason I didn't use a union was the link->aux.map check in
bpf_iter.c::__get_seq_info. Even if we switch to union bpf_iter_aux_info, it
needs some way to determine whether link is for map type, so maybe a string
comparison there? Leaving it out of union felt cleaner, also I move both
io_ring_ctx and eventpoll file into a union in later patch.
> > typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 6297eafdc40f..3323defa99a1 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -91,6 +91,9 @@ union bpf_iter_link_info {
> > struct {
> > __u32 map_fd;
> > } map;
> > + struct {
> > + __u32 io_uring_fd;
> > + } io_uring;
> > };
> > /* BPF syscall commands, see bpf(2) man-page for more details. */
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 6297eafdc40f..3323defa99a1 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -91,6 +91,9 @@ union bpf_iter_link_info {
> > struct {
> > __u32 map_fd;
> > } map;
> > + struct {
> > + __u32 io_uring_fd;
> > + } io_uring;
> > };
> > /* BPF syscall commands, see bpf(2) man-page for more details. */
> >
--
Kartikeya
next prev parent reply other threads:[~2021-11-18 18:28 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-16 5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
2021-11-18 17:21 ` Yonghong Song
2021-11-18 18:28 ` Kumar Kartikeya Dwivedi [this message]
2021-11-18 19:13 ` Yonghong Song
2021-11-18 22:02 ` Alexei Starovoitov
2021-11-19 4:15 ` Kumar Kartikeya Dwivedi
2021-11-19 4:44 ` Kumar Kartikeya Dwivedi
2021-11-19 4:56 ` Alexei Starovoitov
2021-11-19 5:16 ` Kumar Kartikeya Dwivedi
2021-11-19 5:24 ` Alexei Starovoitov
2021-11-19 6:12 ` Kumar Kartikeya Dwivedi
2021-12-03 15:52 ` Pavel Begunkov
2021-12-03 23:16 ` Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
2021-11-17 12:35 ` kernel test robot
2021-11-17 13:39 ` kernel test robot
2021-11-18 17:27 ` Yonghong Song
2021-11-18 18:30 ` Kumar Kartikeya Dwivedi
2021-11-18 19:18 ` Yonghong Song
2021-11-18 19:22 ` Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 3/8] io_uring: Implement eBPF iterator for registered files Kumar Kartikeya Dwivedi
2021-11-18 17:33 ` Yonghong Song
2021-11-16 5:42 ` [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items Kumar Kartikeya Dwivedi
2021-11-18 17:50 ` Yonghong Song
2021-11-16 5:42 ` [PATCH bpf-next v1 5/8] selftests/bpf: Add test for io_uring BPF iterators Kumar Kartikeya Dwivedi
2021-11-18 17:54 ` Yonghong Song
2021-11-18 18:33 ` Kumar Kartikeya Dwivedi
2021-11-18 19:21 ` Yonghong Song
2021-11-16 5:42 ` [PATCH bpf-next v1 6/8] selftests/bpf: Add test for epoll BPF iterator Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 7/8] selftests/bpf: Test partial reads for io_uring, epoll iterators Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 8/8] selftests/bpf: Fix btf_dump test for bpf_iter_link_info Kumar Kartikeya Dwivedi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox