[PATCH bpf-next v3 00/10] Introduce BPF iterators for io

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH bpf-next v3 00/10] Introduce BPF iterators for io_uring and epoll
@ 2021-12-01  4:23 Kumar Kartikeya Dwivedi
  2021-12-01  4:23 ` [PATCH bpf-next v3 01/10] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-12-01  4:23 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Yonghong Song, Pavel Emelyanov, Alexander Mikhalitsyn,
	Andrei Vagin, criu, io-uring, linux-fsdevel

The CRIU [0] project developers are exploring potential uses of the BPF
subsystem to do complicated tasks that are difficult to add support for in the
kernel using existing interfaces.  Even if they are implemented using procfs,
or kcmp, it is difficult to make it perform well without having some kind of
programmable introspection into the kernel data structures. Moreover, for
procfs based state inspection, the output format once agreed upon is set in
stone and hard to extend, and at the same time inefficient to consume from
programs (where it is first converted from machine readable form to human
readable form, only to be converted again to machine readable form).  In
addition to this, kcmp based file set matching algorithm performs poorly since
each file in one set needs to be compared to each file in another set, to
determine struct file equivalence.

This set adds a io_uring file iterator (for registered files), a io_uring ubuf
iterator (for registered buffers), and a epoll iterator (for registered items
(files, registered using EPOLL_CTL_ADD)) to overcome these limitations.  Using
existing task, task_file, task_vma iterators, all of these can be combined
together to significantly enhance and speed up the task dumping procedure.

The two immediate use cases are io_uring checkpoint/restore support and epoll
checkpoint/restore support. The first is unimplemented, and the second is being
expedited using a new epoll iterator. In the future, more stages of the
checkpointing sequence can be offloaded to eBPF programs to reduce process
downtime, e.g. in pre-dump stage, before task is seized.

The io_uring file iterator is even more important now due to the advent of
descriptorless files in io_uring [1], which makes dumping a task's files a lot
more harder for CRIU, since there is no visibility into these hidden
descriptors that the task depends upon for operation. Similarly, the
io_uring_ubuf iterator is useful in case original VMA used in registering a
buffer has been destroyed.

The set includes an example sample showing how these iterator(s) along with
task_file iterator can be useful to restore an io_uring instance, implementing a
simplified version of the code we are planning to adopt for CRIU. Patch 10 is
not meant for submission, only exposition, hence explicitly marked RFC. It
implements the missing features noted in [2].

Please see the individual patches for more details.

[ Note (for Yonghong): I am still unusure what will be useful in show_fdinfo,
  fill_link_info for epoll, so that has been left out. I was reminded that
  io_uring now uses anon_inode_getfile_secure, which we also use in CRIU to
  determine source fd of ring mapping, so this should be enough to identify
  the io_uring fd in userspace, hence I implemented it for io_uring in v2.   ]

  [0]: https://criu.org/Main_Page
  [1]: https://lwn.net/Articles/863071
  [2]: https://github.com/checkpoint-restore/criu/pull/1597

Changelog:
----------
v2 -> v3:
v2: https://lore.kernel.org/bpf/[email protected]

 * Make show_fdinfo/fill_link_info functions static (Kernel Test Robot)
 * Minor memory leak fixes for bpf_cr
 * Use proper names instead of -2, -1 for denoting epoll iterator state

v1 -> v2:
v1: https://lore.kernel.org/bpf/[email protected]

 * Add example showing how iterator is useful in C/R of io_uring (Alexei)
 * Change type of index from unsigned long to u64 (Yonghong)
 * Fix build error for CONFIG_IO_URING=n (Kernel Test Robot)
  * Move bpf_page_to_pfn out of CONFIG_IO_URING (Yonghong)
 * Add comment to bpf_iter_aux_info for map member (Yonghong)
 * show_fdinfo/fill_link_info for io_uring (Yonghong)
 * Fix other nits

Kumar Kartikeya Dwivedi (10):
  io_uring: Implement eBPF iterator for registered buffers
  bpf: Add bpf_page_to_pfn helper
  io_uring: Implement eBPF iterator for registered files
  epoll: Implement eBPF iterator for registered items
  bpftool: Output io_uring iterator info
  selftests/bpf: Add test for io_uring BPF iterators
  selftests/bpf: Add test for epoll BPF iterator
  selftests/bpf: Test partial reads for io_uring, epoll iterators
  selftests/bpf: Fix btf_dump test for bpf_iter_link_info
  samples/bpf: Add example to checkpoint/restore io_uring

 fs/eventpoll.c                                | 201 ++++-
 fs/io_uring.c                                 | 345 +++++++++
 include/linux/bpf.h                           |  16 +
 include/uapi/linux/bpf.h                      |  18 +
 kernel/trace/bpf_trace.c                      |  19 +
 samples/bpf/.gitignore                        |   1 +
 samples/bpf/Makefile                          |   8 +-
 samples/bpf/bpf_cr.bpf.c                      | 185 +++++
 samples/bpf/bpf_cr.c                          | 688 ++++++++++++++++++
 samples/bpf/bpf_cr.h                          |  48 ++
 samples/bpf/hbm_kern.h                        |   2 -
 scripts/bpf_doc.py                            |   2 +
 tools/bpf/bpftool/link.c                      |  10 +
 tools/include/uapi/linux/bpf.h                |  18 +
 .../selftests/bpf/prog_tests/bpf_iter.c       | 387 +++++++++-
 .../selftests/bpf/prog_tests/btf_dump.c       |   4 +-
 .../selftests/bpf/progs/bpf_iter_epoll.c      |  33 +
 .../selftests/bpf/progs/bpf_iter_io_uring.c   |  50 ++
 18 files changed, 2027 insertions(+), 8 deletions(-)
 create mode 100644 samples/bpf/bpf_cr.bpf.c
 create mode 100644 samples/bpf/bpf_cr.c
 create mode 100644 samples/bpf/bpf_cr.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_epoll.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c

-- 
2.34.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v3 01/10] io_uring: Implement eBPF iterator for registered buffers
  2021-12-01  4:23 [PATCH bpf-next v3 00/10] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
@ 2021-12-01  4:23 ` Kumar Kartikeya Dwivedi
  2021-12-01  4:23 ` [PATCH bpf-next v3 02/10] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-12-01  4:23 UTC (permalink / raw)
  To: bpf
  Cc: Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Yonghong Song, Pavel Emelyanov,
	Alexander Mikhalitsyn, Andrei Vagin, criu, linux-fsdevel

This change adds eBPF iterator for buffers registered in io_uring ctx.
It gives access to the ctx, the index of the registered buffer, and a
pointer to the io_uring_ubuf itself. This allows the iterator to save
info related to buffers added to an io_uring instance, that isn't easy
to export using the fdinfo interface (like exact struct page composing
the registered buffer).

The primary usecase this is enabling is checkpoint/restore support.

Note that we need to use mutex_trylock when the file is read from, in
seq_start functions, as the order of lock taken is opposite of what it
would be when io_uring operation reads the same file.  We take
seq_file->lock, then ctx->uring_lock, while io_uring would first take
ctx->uring_lock and then seq_file->lock for the same ctx.

This can lead to a deadlock scenario described below:

The sequence on CPU 0 is for normal read(2) on iterator.
For CPU 1, it is an io_uring instance trying to do same on iterator attached to
itself.

So CPU 0 does

sys_read
vfs_read
 bpf_seq_read
 mutex_lock(&seq_file->lock)    # A
  io_uring_buf_seq_start
  mutex_lock(&ctx->uring_lock)  # B

and CPU 1 does

io_uring_enter
mutex_lock(&ctx->uring_lock)    # B
 io_read
  bpf_seq_read
  mutex_lock(&seq_file->lock)   # A
  ...

Since the order of locks is opposite, it can deadlock. So we switch the
mutex_lock in io_uring_buf_seq_start to trylock, so it can return an
error for this case, then it will release seq_file->lock and CPU 1 will
make progress.

The trylock also protects the case where io_uring tries to read from
iterator attached to itself (same ctx), where the order of locks would
be:
 io_uring_enter
 mutex_lock(&ctx->uring_lock) <------------.
  io_read				    \
   seq_read				     \
    mutex_lock(&seq_file->lock)		     /
    mutex_lock(&ctx->uring_lock) # deadlock-`

In both these cases (recursive read and contended uring_lock), -EDEADLK
is returned to userspace.

In the future, this iterator will be extended to directly support
iteration of bvec Flexible Array Member, so that when there is no
corresponding VMA that maps to the registered buffer (e.g. if VMA is
destroyed after pinning pages), we are able to reconstruct the
registration on restore by dumping the page contents and then replaying
them into a temporary mapping used for registration later. All this is
out of scope for the current series however, but builds upon this
iterator.

Cc: Jens Axboe <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: [email protected]
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
---
 fs/io_uring.c                  | 203 +++++++++++++++++++++++++++++++++
 include/linux/bpf.h            |  12 ++
 include/uapi/linux/bpf.h       |   6 +
 tools/include/uapi/linux/bpf.h |   6 +
 4 files changed, 227 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index b07196b4511c..02e628448ebd 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -81,6 +81,7 @@
 #include <linux/tracehook.h>
 #include <linux/audit.h>
 #include <linux/security.h>
+#include <linux/btf_ids.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/io_uring.h>
@@ -11125,3 +11126,205 @@ static int __init io_uring_init(void)
 	return 0;
 };
 __initcall(io_uring_init);
+
+#ifdef CONFIG_BPF_SYSCALL
+
+BTF_ID_LIST(btf_io_uring_ids)
+BTF_ID(struct, io_ring_ctx)
+BTF_ID(struct, io_mapped_ubuf)
+
+struct bpf_io_uring_seq_info {
+	struct io_ring_ctx *ctx;
+	u64 index;
+};
+
+static int bpf_io_uring_init_seq(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+	struct bpf_io_uring_seq_info *info = priv_data;
+	struct io_ring_ctx *ctx = aux->io_uring.ctx;
+
+	info->ctx = ctx;
+	return 0;
+}
+
+static int bpf_io_uring_iter_attach(struct bpf_prog *prog,
+				    union bpf_iter_link_info *linfo,
+				    struct bpf_iter_aux_info *aux)
+{
+	struct io_ring_ctx *ctx;
+	struct fd f;
+	int ret;
+
+	f = fdget(linfo->io_uring.io_uring_fd);
+	if (unlikely(!f.file))
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (unlikely(f.file->f_op != &io_uring_fops))
+		goto out_fput;
+
+	ret = -ENXIO;
+	ctx = f.file->private_data;
+	if (unlikely(!percpu_ref_tryget(&ctx->refs)))
+		goto out_fput;
+
+	ret = 0;
+	aux->io_uring.ctx = ctx;
+	/* each io_uring file's inode is unique, since it uses
+	 * anon_inode_getfile_secure, which can be used to search
+	 * through files and map link fd back to the io_uring.
+	 */
+	aux->io_uring.inode = f.file->f_inode->i_ino;
+
+out_fput:
+	fdput(f);
+	return ret;
+}
+
+static void bpf_io_uring_iter_detach(struct bpf_iter_aux_info *aux)
+{
+	percpu_ref_put(&aux->io_uring.ctx->refs);
+}
+
+#ifdef CONFIG_PROC_FS
+static void bpf_io_uring_iter_show_fdinfo(const struct bpf_iter_aux_info *aux,
+					  struct seq_file *seq)
+{
+	seq_printf(seq, "io_uring_inode:\t%lu\n", aux->io_uring.inode);
+}
+#endif
+
+static int bpf_io_uring_iter_fill_link_info(const struct bpf_iter_aux_info *aux,
+					    struct bpf_link_info *info)
+{
+	info->iter.io_uring.inode = aux->io_uring.inode;
+	return 0;
+}
+
+/* io_uring iterator for registered buffers */
+
+struct bpf_iter__io_uring_buf {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct io_ring_ctx *, ctx);
+	__bpf_md_ptr(struct io_mapped_ubuf *, ubuf);
+	u64 index;
+};
+
+static void *__bpf_io_uring_buf_seq_get_next(struct bpf_io_uring_seq_info *info)
+{
+	if (info->index < info->ctx->nr_user_bufs)
+		return info->ctx->user_bufs[info->index++];
+	return NULL;
+}
+
+static void *bpf_io_uring_buf_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+	struct io_mapped_ubuf *ubuf;
+
+	/* Indicate to userspace that the uring lock is contended */
+	if (!mutex_trylock(&info->ctx->uring_lock))
+		return ERR_PTR(-EDEADLK);
+
+	ubuf = __bpf_io_uring_buf_seq_get_next(info);
+	if (!ubuf)
+		return NULL;
+
+	if (*pos == 0)
+		++*pos;
+	return ubuf;
+}
+
+static void *bpf_io_uring_buf_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+
+	++*pos;
+	return __bpf_io_uring_buf_seq_get_next(info);
+}
+
+DEFINE_BPF_ITER_FUNC(io_uring_buf, struct bpf_iter_meta *meta,
+		     struct io_ring_ctx *ctx, struct io_mapped_ubuf *ubuf,
+		     u64 index)
+
+static int __bpf_io_uring_buf_seq_show(struct seq_file *seq, void *v, bool in_stop)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+	struct bpf_iter__io_uring_buf ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, in_stop);
+	if (!prog)
+		return 0;
+
+	ctx.meta = &meta;
+	ctx.ctx = info->ctx;
+	ctx.ubuf = v;
+	ctx.index = info->index ? info->index - !in_stop : 0;
+
+	return bpf_iter_run_prog(prog, &ctx);
+}
+
+static int bpf_io_uring_buf_seq_show(struct seq_file *seq, void *v)
+{
+	return __bpf_io_uring_buf_seq_show(seq, v, false);
+}
+
+static void bpf_io_uring_buf_seq_stop(struct seq_file *seq, void *v)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+
+	/* If IS_ERR(v) is true, then ctx->uring_lock wasn't taken */
+	if (IS_ERR(v))
+		return;
+	if (!v)
+		__bpf_io_uring_buf_seq_show(seq, v, true);
+	else if (info->index) /* restart from index */
+		info->index--;
+	mutex_unlock(&info->ctx->uring_lock);
+}
+
+static const struct seq_operations bpf_io_uring_buf_seq_ops = {
+	.start = bpf_io_uring_buf_seq_start,
+	.next  = bpf_io_uring_buf_seq_next,
+	.stop  = bpf_io_uring_buf_seq_stop,
+	.show  = bpf_io_uring_buf_seq_show,
+};
+
+static const struct bpf_iter_seq_info bpf_io_uring_buf_seq_info = {
+	.seq_ops          = &bpf_io_uring_buf_seq_ops,
+	.init_seq_private = bpf_io_uring_init_seq,
+	.fini_seq_private = NULL,
+	.seq_priv_size    = sizeof(struct bpf_io_uring_seq_info),
+};
+
+static struct bpf_iter_reg io_uring_buf_reg_info = {
+	.target            = "io_uring_buf",
+	.feature	   = BPF_ITER_RESCHED,
+	.attach_target     = bpf_io_uring_iter_attach,
+	.detach_target     = bpf_io_uring_iter_detach,
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	   = bpf_io_uring_iter_show_fdinfo,
+#endif
+	.fill_link_info    = bpf_io_uring_iter_fill_link_info,
+	.ctx_arg_info_size = 2,
+	.ctx_arg_info = {
+		{ offsetof(struct bpf_iter__io_uring_buf, ctx),
+		  PTR_TO_BTF_ID },
+		{ offsetof(struct bpf_iter__io_uring_buf, ubuf),
+		  PTR_TO_BTF_ID_OR_NULL },
+	},
+	.seq_info	   = &bpf_io_uring_buf_seq_info,
+};
+
+static int __init io_uring_iter_init(void)
+{
+	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
+	io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
+	return bpf_iter_reg_target(&io_uring_buf_reg_info);
+}
+late_initcall(io_uring_iter_init);
+
+#endif
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cc7a0c36e7df..967842881024 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1509,8 +1509,20 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
 	extern int bpf_iter_ ## target(args);			\
 	int __init bpf_iter_ ## target(args) { return 0; }
 
+struct io_ring_ctx;
+
 struct bpf_iter_aux_info {
+	/* Map member must not alias any other members, due to the check in
+	 * bpf_trace.c:__get_seq_info, since in case of map the seq_ops for
+	 * iterator is different from others. The seq_ops is not from main
+	 * iter registration but from map_ops. Nullability of 'map' allows
+	 * to skip this check for non-map iterator cheaply.
+	 */
 	struct bpf_map *map;
+	struct {
+		struct io_ring_ctx *ctx;
+		ino_t inode;
+	} io_uring;
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a69e4b04ffeb..1ad1ae85743c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u32   io_uring_fd;
+	} io_uring;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5720,6 +5723,9 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 inode;
+				} io_uring;
 			};
 		} iter;
 		struct  {
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a69e4b04ffeb..1ad1ae85743c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u32   io_uring_fd;
+	} io_uring;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5720,6 +5723,9 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 inode;
+				} io_uring;
 			};
 		} iter;
 		struct  {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v3 02/10] bpf: Add bpf_page_to_pfn helper
  2021-12-01  4:23 [PATCH bpf-next v3 00/10] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
  2021-12-01  4:23 ` [PATCH bpf-next v3 01/10] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
@ 2021-12-01  4:23 ` Kumar Kartikeya Dwivedi
  2021-12-01  4:23 ` [PATCH bpf-next v3 03/10] io_uring: Implement eBPF iterator for registered files Kumar Kartikeya Dwivedi
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-12-01  4:23 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Yonghong Song, Pavel Emelyanov, Alexander Mikhalitsyn,
	Andrei Vagin, criu, io-uring, linux-fsdevel

In CRIU, we need to be able to determine whether the page pinned by
io_uring is still present in the same range in the process VMA.
/proc/<pid>/pagemap gives us the PFN, hence using this helper we can
establish this mapping easily from the iterator side.

It is a simple wrapper over the in-kernel page_to_pfn macro, and ensures
the passed in pointer is a struct page PTR_TO_BTF_ID. This is obtained
from the bvec of io_uring_ubuf for the CRIU usecase.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
---
 include/linux/bpf.h            |  1 +
 include/uapi/linux/bpf.h       |  9 +++++++++
 kernel/trace/bpf_trace.c       | 19 +++++++++++++++++++
 scripts/bpf_doc.py             |  2 ++
 tools/include/uapi/linux/bpf.h |  9 +++++++++
 5 files changed, 40 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 967842881024..e44503158d76 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2176,6 +2176,7 @@ extern const struct bpf_func_proto bpf_sk_setsockopt_proto;
 extern const struct bpf_func_proto bpf_sk_getsockopt_proto;
 extern const struct bpf_func_proto bpf_kallsyms_lookup_name_proto;
 extern const struct bpf_func_proto bpf_find_vma_proto;
+extern const struct bpf_func_proto bpf_page_to_pfn_proto;
 
 const struct bpf_func_proto *tracing_prog_func_proto(
   enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1ad1ae85743c..885d9293c147 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4960,6 +4960,14 @@ union bpf_attr {
  *		**-ENOENT** if *task->mm* is NULL, or no vma contains *addr*.
  *		**-EBUSY** if failed to try lock mmap_lock.
  *		**-EINVAL** for invalid **flags**.
+ *
+ * long bpf_page_to_pfn(struct page *page)
+ *	Description
+ *		Obtain the page frame number (PFN) for the given *struct page*
+ *		pointer.
+ *	Return
+ *		Page Frame Number corresponding to the page pointed to by the
+ *		*struct page* pointer, or U64_MAX if pointer is NULL.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5143,6 +5151,7 @@ union bpf_attr {
 	FN(skc_to_unix_sock),		\
 	FN(kallsyms_lookup_name),	\
 	FN(find_vma),			\
+	FN(page_to_pfn),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 25ea521fb8f1..2a6488f14e58 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1091,6 +1091,23 @@ static const struct bpf_func_proto bpf_get_branch_snapshot_proto = {
 	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
 };
 
+BPF_CALL_1(bpf_page_to_pfn, struct page *, page)
+{
+	/* PTR_TO_BTF_ID can be NULL */
+	if (!page)
+		return U64_MAX;
+	return page_to_pfn(page);
+}
+
+BTF_ID_LIST_SINGLE(btf_page_to_pfn_ids, struct, page)
+
+const struct bpf_func_proto bpf_page_to_pfn_proto = {
+	.func		= bpf_page_to_pfn,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_BTF_ID,
+	.arg1_btf_id	= &btf_page_to_pfn_ids[0],
+};
+
 static const struct bpf_func_proto *
 bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -1212,6 +1229,8 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_find_vma_proto;
 	case BPF_FUNC_trace_vprintk:
 		return bpf_get_trace_vprintk_proto();
+	case BPF_FUNC_page_to_pfn:
+		return &bpf_page_to_pfn_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index a6403ddf5de7..ae68ca794980 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -549,6 +549,7 @@ class PrinterHelpers(Printer):
             'struct socket',
             'struct file',
             'struct bpf_timer',
+            'struct page',
     ]
     known_types = {
             '...',
@@ -598,6 +599,7 @@ class PrinterHelpers(Printer):
             'struct socket',
             'struct file',
             'struct bpf_timer',
+            'struct page',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1ad1ae85743c..885d9293c147 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4960,6 +4960,14 @@ union bpf_attr {
  *		**-ENOENT** if *task->mm* is NULL, or no vma contains *addr*.
  *		**-EBUSY** if failed to try lock mmap_lock.
  *		**-EINVAL** for invalid **flags**.
+ *
+ * long bpf_page_to_pfn(struct page *page)
+ *	Description
+ *		Obtain the page frame number (PFN) for the given *struct page*
+ *		pointer.
+ *	Return
+ *		Page Frame Number corresponding to the page pointed to by the
+ *		*struct page* pointer, or U64_MAX if pointer is NULL.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5143,6 +5151,7 @@ union bpf_attr {
 	FN(skc_to_unix_sock),		\
 	FN(kallsyms_lookup_name),	\
 	FN(find_vma),			\
+	FN(page_to_pfn),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v3 03/10] io_uring: Implement eBPF iterator for registered files
  2021-12-01  4:23 [PATCH bpf-next v3 00/10] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
  2021-12-01  4:23 ` [PATCH bpf-next v3 01/10] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
  2021-12-01  4:23 ` [PATCH bpf-next v3 02/10] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
@ 2021-12-01  4:23 ` Kumar Kartikeya Dwivedi
  2021-12-01  4:23 ` [PATCH bpf-next v3 04/10] epoll: Implement eBPF iterator for registered items Kumar Kartikeya Dwivedi
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-12-01  4:23 UTC (permalink / raw)
  To: bpf
  Cc: Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Yonghong Song, Pavel Emelyanov,
	Alexander Mikhalitsyn, Andrei Vagin, criu, linux-fsdevel

This change adds eBPF iterator for buffers registered in io_uring ctx.
It gives access to the ctx, the index of the registered buffer, and a
pointer to the struct file itself. This allows the iterator to save
info related to the file added to an io_uring instance, that isn't easy
to export using the fdinfo interface (like being able to match
registered files to a task's file set). Getting access to underlying
struct file allows deduplication and efficient pairing with task file
set (obtained using task_file iterator).

The primary usecase this is enabling is checkpoint/restore support.

Note that we need to use mutex_trylock when the file is read from, in
seq_start functions, as the order of lock taken is opposite of what it
would be when io_uring operation reads the same file.  We take
seq_file->lock, then ctx->uring_lock, while io_uring would first take
ctx->uring_lock and then seq_file->lock for the same ctx.

This can lead to a deadlock scenario described below:

The sequence on CPU 0 is for normal read(2) on iterator.  For CPU 1, it
is an io_uring instance trying to do same on iterator attached to
itself.

So CPU 0 does

sys_read
vfs_read
 bpf_seq_read
 mutex_lock(&seq_file->lock)    # A
  io_uring_buf_seq_start
  mutex_lock(&ctx->uring_lock)  # B

and CPU 1 does

io_uring_enter
mutex_lock(&ctx->uring_lock)    # B
 io_read
  bpf_seq_read
  mutex_lock(&seq_file->lock)   # A
  ...

Since the order of locks is opposite, it can deadlock. So we switch the
mutex_lock in io_uring_buf_seq_start to trylock, so it can return an
error for this case, then it will release seq_file->lock and CPU 1 will
make progress.

The trylock also protects the case where io_uring tries to read from
iterator attached to itself (same ctx), where the order of locks would
be:
 io_uring_enter
 mutex_lock(&ctx->uring_lock) <------------.
  io_read				    \
   seq_read				     \
    mutex_lock(&seq_file->lock)		     /
    mutex_lock(&ctx->uring_lock) # deadlock-`

In both these cases (recursive read and contended uring_lock), -EDEADLK
is returned to userspace.

With the advent of descriptorless files supported by io_uring, this
iterator provides the required visibility and introspection of io_uring
instance for the purposes of dumping and restoring it.

In the future, this iterator will be extended to support direct
inspection of a lot of file state (currently descriptorless files
are obtained using openat2 and socket) to dump file state for these
hidden files. Later, we can explore filling in the gaps for dumping
file state for more file types (those not hidden in io_uring ctx).
All this is out of scope for the current series however, but builds
upon this iterator.

Cc: Jens Axboe <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: [email protected]
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
---
 fs/io_uring.c | 144 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 143 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 02e628448ebd..28348fce81dc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -11132,6 +11132,7 @@ __initcall(io_uring_init);
 BTF_ID_LIST(btf_io_uring_ids)
 BTF_ID(struct, io_ring_ctx)
 BTF_ID(struct, io_mapped_ubuf)
+BTF_ID(struct, file)
 
 struct bpf_io_uring_seq_info {
 	struct io_ring_ctx *ctx;
@@ -11319,11 +11320,152 @@ static struct bpf_iter_reg io_uring_buf_reg_info = {
 	.seq_info	   = &bpf_io_uring_buf_seq_info,
 };
 
+/* io_uring iterator for registered files */
+
+struct bpf_iter__io_uring_file {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct io_ring_ctx *, ctx);
+	__bpf_md_ptr(struct file *, file);
+	u64 index;
+};
+
+static void *__bpf_io_uring_file_seq_get_next(struct bpf_io_uring_seq_info *info)
+{
+	struct file *file = NULL;
+
+	if (info->index < info->ctx->nr_user_files) {
+		/* file set can be sparse */
+		file = io_file_from_index(info->ctx, info->index++);
+		/* use info as a distinct pointer to distinguish between empty
+		 * slot and valid file, since we cannot return NULL for this
+		 * case if we want iter prog to still be invoked with file ==
+		 * NULL.
+		 */
+		if (!file)
+			return info;
+	}
+
+	return file;
+}
+
+static void *bpf_io_uring_file_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+	struct file *file;
+
+	/* Indicate to userspace that the uring lock is contended */
+	if (!mutex_trylock(&info->ctx->uring_lock))
+		return ERR_PTR(-EDEADLK);
+
+	file = __bpf_io_uring_file_seq_get_next(info);
+	if (!file)
+		return NULL;
+
+	if (*pos == 0)
+		++*pos;
+	return file;
+}
+
+static void *bpf_io_uring_file_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+
+	++*pos;
+	return __bpf_io_uring_file_seq_get_next(info);
+}
+
+DEFINE_BPF_ITER_FUNC(io_uring_file, struct bpf_iter_meta *meta,
+		     struct io_ring_ctx *ctx, struct file *file,
+		     u64 index)
+
+static int __bpf_io_uring_file_seq_show(struct seq_file *seq, void *v, bool in_stop)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+	struct bpf_iter__io_uring_file ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, in_stop);
+	if (!prog)
+		return 0;
+
+	ctx.meta = &meta;
+	ctx.ctx = info->ctx;
+	/* when we encounter empty slot, v will point to info */
+	ctx.file = v == info ? NULL : v;
+	ctx.index = info->index ? info->index - !in_stop : 0;
+
+	return bpf_iter_run_prog(prog, &ctx);
+}
+
+static int bpf_io_uring_file_seq_show(struct seq_file *seq, void *v)
+{
+	return __bpf_io_uring_file_seq_show(seq, v, false);
+}
+
+static void bpf_io_uring_file_seq_stop(struct seq_file *seq, void *v)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+
+	/* If IS_ERR(v) is true, then ctx->uring_lock wasn't taken */
+	if (IS_ERR(v))
+		return;
+	if (!v)
+		__bpf_io_uring_file_seq_show(seq, v, true);
+	else if (info->index) /* restart from index */
+		info->index--;
+	mutex_unlock(&info->ctx->uring_lock);
+}
+
+static const struct seq_operations bpf_io_uring_file_seq_ops = {
+	.start = bpf_io_uring_file_seq_start,
+	.next  = bpf_io_uring_file_seq_next,
+	.stop  = bpf_io_uring_file_seq_stop,
+	.show  = bpf_io_uring_file_seq_show,
+};
+
+static const struct bpf_iter_seq_info bpf_io_uring_file_seq_info = {
+	.seq_ops          = &bpf_io_uring_file_seq_ops,
+	.init_seq_private = bpf_io_uring_init_seq,
+	.fini_seq_private = NULL,
+	.seq_priv_size    = sizeof(struct bpf_io_uring_seq_info),
+};
+
+static struct bpf_iter_reg io_uring_file_reg_info = {
+	.target            = "io_uring_file",
+	.feature           = BPF_ITER_RESCHED,
+	.attach_target     = bpf_io_uring_iter_attach,
+	.detach_target     = bpf_io_uring_iter_detach,
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	   = bpf_io_uring_iter_show_fdinfo,
+#endif
+	.fill_link_info	   = bpf_io_uring_iter_fill_link_info,
+	.ctx_arg_info_size = 2,
+	.ctx_arg_info = {
+		{ offsetof(struct bpf_iter__io_uring_file, ctx),
+		  PTR_TO_BTF_ID },
+		{ offsetof(struct bpf_iter__io_uring_file, file),
+		  PTR_TO_BTF_ID_OR_NULL },
+	},
+	.seq_info	   = &bpf_io_uring_file_seq_info,
+};
+
 static int __init io_uring_iter_init(void)
 {
+	int ret;
+
 	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
 	io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
-	return bpf_iter_reg_target(&io_uring_buf_reg_info);
+	io_uring_file_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
+	io_uring_file_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[2];
+	ret = bpf_iter_reg_target(&io_uring_buf_reg_info);
+	if (ret)
+		return ret;
+	ret = bpf_iter_reg_target(&io_uring_file_reg_info);
+	if (ret)
+		bpf_iter_unreg_target(&io_uring_buf_reg_info);
+	return ret;
 }
 late_initcall(io_uring_iter_init);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v3 04/10] epoll: Implement eBPF iterator for registered items
  2021-12-01  4:23 [PATCH bpf-next v3 00/10] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
                   ` (2 preceding siblings ...)
  2021-12-01  4:23 ` [PATCH bpf-next v3 03/10] io_uring: Implement eBPF iterator for registered files Kumar Kartikeya Dwivedi
@ 2021-12-01  4:23 ` Kumar Kartikeya Dwivedi
  2021-12-01  4:23 ` [PATCH bpf-next v3 05/10] bpftool: Output io_uring iterator info Kumar Kartikeya Dwivedi
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-12-01  4:23 UTC (permalink / raw)
  To: bpf
  Cc: Alexander Viro, linux-fsdevel, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Yonghong Song, Pavel Emelyanov,
	Alexander Mikhalitsyn, Andrei Vagin, criu, io-uring

This patch adds eBPF iterator for epoll items (epitems) registered in an
epoll instance. It gives access to the eventpoll ctx, and the registered
epoll item (struct epitem). This allows the iterator to inspect the
registered file and be able to use others iterators to associate it with
a task's fdtable.

The primary usecase this is enabling is expediting existing eventpoll
checkpoint/restore support in the CRIU project. This iterator allows us
to switch from a worst case O(n^2) algorithm to a single O(n) pass over
task and epoll registered descriptors.

We also make sure we're iterating over a live file, one that is not
going away. The case we're concerned about is a file that has its
f_count as zero, but is waiting for iterator bpf_seq_read to release
ep->mtx, so that it can remove its epitem. Since such a file will
disappear once iteration is done, and it is being destructed, we use
get_file_rcu to ensure it is alive when invoking the BPF program.

Getting access to a file that is going to disappear after iteration
is not useful anyway. This does have a performance overhead however
(since file reference will be raised and dropped for each file).

The rcu_read_lock around get_file_rcu isn't strictly required for
lifetime management since fput path is serialized on ep->mtx to call
ep_remove, hence the epi->ffd.file pointer remains stable during our
seq_start/seq_stop bracketing.

To be able to continue back from the position we were iterating, we
store the epi->ffi.fd and use ep_find_tfd to find the target file again.
It would be more appropriate to use both struct file pointer and fd
number to find the last file, but see below for why that cannot be done.

Taking reference to struct file and walking RB-Tree to find it again
will lead to reference cycle issue if the iterator after partial read
takes reference to socket which later is used in creating a descriptor
cycle using SCM_RIGHTS. An example that was encountered when working on
this is mentioned below.

  Let there be Unix sockets SK1, SK2, epoll fd EP, and epoll iterator
  ITER.
  Let SK1 be registered in EP, then on a partial read it is possible
  that ITER returns from read and takes reference to SK1 to be able to
  find it later in RB-Tree and continue the iteration.  If SK1 sends
  ITER over to SK2 using SCM_RIGHTS, and SK2 sends SK2 over to SK1 using
  SCM_RIGHTS, and both fds are not consumed on the corresponding receive
  ends, a cycle is created.  When all of SK1, SK2, EP, and ITER are
  closed, SK1's receive queue holds reference to SK2, and SK2's receive
  queue holds reference to ITER, which holds a reference to SK1.
  All file descriptors except EP leak.

To resolve it, we would need to hook into the Unix Socket GC mechanism,
but the alternative of using ep_find_tfd is much more simpler. The
finding of the last position in face of concurrent modification of the
epoll set is at best an approximation anyway. For the case of CRIU, the
epoll set remains stable.

Cc: Alexander Viro <[email protected]>
Cc: [email protected]
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
---
 fs/eventpoll.c                 | 201 ++++++++++++++++++++++++++++++++-
 include/linux/bpf.h            |  11 +-
 include/uapi/linux/bpf.h       |   3 +
 tools/include/uapi/linux/bpf.h |   3 +
 4 files changed, 213 insertions(+), 5 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 06f4c5ae1451..fb4e58857baa 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -37,6 +37,7 @@
 #include <linux/seq_file.h>
 #include <linux/compat.h>
 #include <linux/rculist.h>
+#include <linux/btf_ids.h>
 #include <net/busy_poll.h>
 
 /*
@@ -985,7 +986,6 @@ static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd)
 	return epir;
 }
 
-#ifdef CONFIG_KCMP
 static struct epitem *ep_find_tfd(struct eventpoll *ep, int tfd, unsigned long toff)
 {
 	struct rb_node *rbp;
@@ -1005,6 +1005,7 @@ static struct epitem *ep_find_tfd(struct eventpoll *ep, int tfd, unsigned long t
 	return NULL;
 }
 
+#ifdef CONFIG_KCMP
 struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
 				     unsigned long toff)
 {
@@ -2385,3 +2386,201 @@ static int __init eventpoll_init(void)
 	return 0;
 }
 fs_initcall(eventpoll_init);
+
+#ifdef CONFIG_BPF_SYSCALL
+
+enum epoll_iter_state {
+	EP_ITER_DONE = -2,
+	EP_ITER_INIT = -1,
+};
+
+BTF_ID_LIST(btf_epoll_ids)
+BTF_ID(struct, eventpoll)
+BTF_ID(struct, epitem)
+
+struct bpf_epoll_iter_seq_info {
+	struct eventpoll *ep;
+	struct rb_node *rbp;
+	int tfd;
+};
+
+static int bpf_epoll_init_seq(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+	struct bpf_epoll_iter_seq_info *info = priv_data;
+
+	info->ep = aux->ep->private_data;
+	info->tfd = EP_ITER_INIT;
+	return 0;
+}
+
+static int bpf_epoll_iter_attach(struct bpf_prog *prog,
+				 union bpf_iter_link_info *linfo,
+				 struct bpf_iter_aux_info *aux)
+{
+	struct file *file;
+	int ret;
+
+	file = fget(linfo->epoll.epoll_fd);
+	if (!file)
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (unlikely(!is_file_epoll(file)))
+		goto out_fput;
+
+	aux->ep = file;
+	return 0;
+out_fput:
+	fput(file);
+	return ret;
+}
+
+static void bpf_epoll_iter_detach(struct bpf_iter_aux_info *aux)
+{
+	fput(aux->ep);
+}
+
+struct bpf_iter__epoll {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct eventpoll *, ep);
+	__bpf_md_ptr(struct epitem *, epi);
+};
+
+static void *bpf_epoll_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct bpf_epoll_iter_seq_info *info = seq->private;
+	struct epitem *epi;
+
+	mutex_lock(&info->ep->mtx);
+	/* already iterated? */
+	if (info->tfd == EP_ITER_DONE)
+		return NULL;
+	/* partially iterated? find position to restart */
+	if (info->tfd >= 0) {
+		epi = ep_find_tfd(info->ep, info->tfd, 0);
+		if (!epi)
+			return NULL;
+		info->rbp = &epi->rbn;
+		return epi;
+	}
+	WARN_ON(info->tfd != EP_ITER_INIT);
+	/* first iteration */
+	info->rbp = rb_first_cached(&info->ep->rbr);
+	if (!info->rbp)
+		return NULL;
+	if (*pos == 0)
+		++*pos;
+	return rb_entry(info->rbp, struct epitem, rbn);
+}
+
+static void *bpf_epoll_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_epoll_iter_seq_info *info = seq->private;
+
+	++*pos;
+	info->rbp = rb_next(info->rbp);
+	return info->rbp ? rb_entry(info->rbp, struct epitem, rbn) : NULL;
+}
+
+DEFINE_BPF_ITER_FUNC(epoll, struct bpf_iter_meta *meta, struct eventpoll *ep,
+		     struct epitem *epi)
+
+static int __bpf_epoll_seq_show(struct seq_file *seq, void *v, bool in_stop)
+{
+	struct bpf_epoll_iter_seq_info *info = seq->private;
+	struct bpf_iter__epoll ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret;
+
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, in_stop);
+	if (!prog)
+		return 0;
+
+	ctx.meta = &meta;
+	ctx.ep = info->ep;
+	ctx.epi = v;
+	if (ctx.epi) {
+		/* The file we are going to pass to prog may already have its
+		 * f_count as 0, hence before invoking the prog, we always try
+		 * to get the reference if it isn't zero, failing which we skip
+		 * the file. This is usually the case for files that are closed
+		 * before calling EPOLL_CTL_DEL for them, which would wait for
+		 * us to release ep->mtx before doing ep_remove.
+		 */
+		rcu_read_lock();
+		ret = get_file_rcu(ctx.epi->ffd.file);
+		rcu_read_unlock();
+		if (!ret)
+			return 0;
+	}
+	ret = bpf_iter_run_prog(prog, &ctx);
+	/* fput queues work asynchronously, so in our case, either task_work for
+	 * non-exiting task, and otherwise delayed_fput, so holding ep->mtx and
+	 * calling fput (which will take the same lock) in this context will not
+	 * deadlock us, in case f_count is 1 at this point.
+	 */
+	if (ctx.epi)
+		fput(ctx.epi->ffd.file);
+	return ret;
+}
+
+static int bpf_epoll_seq_show(struct seq_file *seq, void *v)
+{
+	return __bpf_epoll_seq_show(seq, v, false);
+}
+
+static void bpf_epoll_seq_stop(struct seq_file *seq, void *v)
+{
+	struct bpf_epoll_iter_seq_info *info = seq->private;
+	struct epitem *epi;
+
+	if (!v) {
+		__bpf_epoll_seq_show(seq, v, true);
+		/* done iterating */
+		info->tfd = EP_ITER_DONE;
+	} else {
+		epi = rb_entry(info->rbp, struct epitem, rbn);
+		info->tfd = epi->ffd.fd;
+	}
+	mutex_unlock(&info->ep->mtx);
+}
+
+static const struct seq_operations bpf_epoll_seq_ops = {
+	.start = bpf_epoll_seq_start,
+	.next  = bpf_epoll_seq_next,
+	.stop  = bpf_epoll_seq_stop,
+	.show  = bpf_epoll_seq_show,
+};
+
+static const struct bpf_iter_seq_info bpf_epoll_seq_info = {
+	.seq_ops          = &bpf_epoll_seq_ops,
+	.init_seq_private = bpf_epoll_init_seq,
+	.seq_priv_size    = sizeof(struct bpf_epoll_iter_seq_info),
+};
+
+static struct bpf_iter_reg epoll_reg_info = {
+	.target            = "epoll",
+	.feature           = BPF_ITER_RESCHED,
+	.attach_target     = bpf_epoll_iter_attach,
+	.detach_target     = bpf_epoll_iter_detach,
+	.ctx_arg_info_size = 2,
+	.ctx_arg_info = {
+		{ offsetof(struct bpf_iter__epoll, ep),
+		  PTR_TO_BTF_ID },
+		{ offsetof(struct bpf_iter__epoll, epi),
+		  PTR_TO_BTF_ID_OR_NULL },
+	},
+	.seq_info	   = &bpf_epoll_seq_info,
+};
+
+static int __init epoll_iter_init(void)
+{
+	epoll_reg_info.ctx_arg_info[0].btf_id = btf_epoll_ids[0];
+	epoll_reg_info.ctx_arg_info[1].btf_id = btf_epoll_ids[1];
+	return bpf_iter_reg_target(&epoll_reg_info);
+}
+late_initcall(epoll_iter_init);
+
+#endif
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e44503158d76..d7e3e9c59b68 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1519,10 +1519,13 @@ struct bpf_iter_aux_info {
 	 * to skip this check for non-map iterator cheaply.
 	 */
 	struct bpf_map *map;
-	struct {
-		struct io_ring_ctx *ctx;
-		ino_t inode;
-	} io_uring;
+	union {
+		struct {
+			struct io_ring_ctx *ctx;
+			ino_t inode;
+		} io_uring;
+		struct file *ep;
+	};
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 885d9293c147..b82b11d72520 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,6 +94,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32   io_uring_fd;
 	} io_uring;
+	struct {
+		__u32   epoll_fd;
+	} epoll;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 885d9293c147..b82b11d72520 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -94,6 +94,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32   io_uring_fd;
 	} io_uring;
+	struct {
+		__u32   epoll_fd;
+	} epoll;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v3 05/10] bpftool: Output io_uring iterator info
  2021-12-01  4:23 [PATCH bpf-next v3 00/10] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
                   ` (3 preceding siblings ...)
  2021-12-01  4:23 ` [PATCH bpf-next v3 04/10] epoll: Implement eBPF iterator for registered items Kumar Kartikeya Dwivedi
@ 2021-12-01  4:23 ` Kumar Kartikeya Dwivedi
  2021-12-01  4:23 ` [PATCH bpf-next v3 06/10] selftests/bpf: Add test for io_uring BPF iterators Kumar Kartikeya Dwivedi
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-12-01  4:23 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Yonghong Song, Pavel Emelyanov, Alexander Mikhalitsyn,
	Andrei Vagin, criu, io-uring, linux-fsdevel

Output the sole field related to io_uring iterator (inode of attached
io_uring) so that it can be useful in informational and also debugging
cases (trying to find actual io_uring fd attached to the iterator).

Output:
89: iter  prog 262  target_name io_uring_file  io_uring_inode 16764
	pids test_progs(384)

[
  {
    "id": 123,
    "type": "iter",
    "prog_id": 463,
    "target_name": "io_uring_buf",
    "io_uring_inode": 16871,
    "pids": [
      {
        "pid": 443,
        "comm": "test_progs"
      }
    ]
  }
]

[
  {
    "id": 126,
    "type": "iter",
    "prog_id": 483,
    "target_name": "io_uring_file",
    "io_uring_inode": 16887,
    "pids": [
      {
        "pid": 448,
        "comm": "test_progs"
      }
    ]
  }
]

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
---
 tools/bpf/bpftool/link.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tools/bpf/bpftool/link.c b/tools/bpf/bpftool/link.c
index 2c258db0d352..409ae861b839 100644
--- a/tools/bpf/bpftool/link.c
+++ b/tools/bpf/bpftool/link.c
@@ -86,6 +86,12 @@ static bool is_iter_map_target(const char *target_name)
 	       strcmp(target_name, "bpf_sk_storage_map") == 0;
 }
 
+static bool is_iter_io_uring_target(const char *target_name)
+{
+	return strcmp(target_name, "io_uring_file") == 0 ||
+	       strcmp(target_name, "io_uring_buf") == 0;
+}
+
 static void show_iter_json(struct bpf_link_info *info, json_writer_t *wtr)
 {
 	const char *target_name = u64_to_ptr(info->iter.target_name);
@@ -94,6 +100,8 @@ static void show_iter_json(struct bpf_link_info *info, json_writer_t *wtr)
 
 	if (is_iter_map_target(target_name))
 		jsonw_uint_field(wtr, "map_id", info->iter.map.map_id);
+	else if (is_iter_io_uring_target(target_name))
+		jsonw_uint_field(wtr, "io_uring_inode", info->iter.io_uring.inode);
 }
 
 static int get_prog_info(int prog_id, struct bpf_prog_info *info)
@@ -204,6 +212,8 @@ static void show_iter_plain(struct bpf_link_info *info)
 
 	if (is_iter_map_target(target_name))
 		printf("map_id %u  ", info->iter.map.map_id);
+	else if (is_iter_io_uring_target(target_name))
+		printf("io_uring_inode %llu  ", info->iter.io_uring.inode);
 }
 
 static int show_link_close_plain(int fd, struct bpf_link_info *info)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v3 06/10] selftests/bpf: Add test for io_uring BPF iterators
  2021-12-01  4:23 [PATCH bpf-next v3 00/10] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
                   ` (4 preceding siblings ...)
  2021-12-01  4:23 ` [PATCH bpf-next v3 05/10] bpftool: Output io_uring iterator info Kumar Kartikeya Dwivedi
@ 2021-12-01  4:23 ` Kumar Kartikeya Dwivedi
  2021-12-01  4:23 ` [PATCH bpf-next v3 07/10] selftests/bpf: Add test for epoll BPF iterator Kumar Kartikeya Dwivedi
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-12-01  4:23 UTC (permalink / raw)
  To: bpf
  Cc: Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Yonghong Song, Pavel Emelyanov,
	Alexander Mikhalitsyn, Andrei Vagin, criu, linux-fsdevel

This exercises the io_uring_buf and io_uring_file iterators, and tests
sparse file sets as well.

Cc: Jens Axboe <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: [email protected]
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
---
 .../selftests/bpf/prog_tests/bpf_iter.c       | 251 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_iter_io_uring.c   |  50 ++++
 2 files changed, 301 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
index 0b996be923b5..13ea2eaed032 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
@@ -1,6 +1,10 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2020 Facebook */
+#include <sys/stat.h>
+#include <sys/mman.h>
 #include <test_progs.h>
+#include <linux/io_uring.h>
+
 #include "bpf_iter_ipv6_route.skel.h"
 #include "bpf_iter_netlink.skel.h"
 #include "bpf_iter_bpf_map.skel.h"
@@ -26,6 +30,7 @@
 #include "bpf_iter_bpf_sk_storage_map.skel.h"
 #include "bpf_iter_test_kern5.skel.h"
 #include "bpf_iter_test_kern6.skel.h"
+#include "bpf_iter_io_uring.skel.h"
 
 static int duration;
 
@@ -1239,6 +1244,248 @@ static void test_task_vma(void)
 	bpf_iter_task_vma__destroy(skel);
 }
 
+static int sys_io_uring_setup(u32 entries, struct io_uring_params *p)
+{
+	return syscall(__NR_io_uring_setup, entries, p);
+}
+
+static int io_uring_register_bufs(int io_uring_fd, struct iovec *iovs, unsigned int nr)
+{
+	return syscall(__NR_io_uring_register, io_uring_fd,
+		       IORING_REGISTER_BUFFERS, iovs, nr);
+}
+
+static int io_uring_register_files(int io_uring_fd, int *fds, unsigned int nr)
+{
+	return syscall(__NR_io_uring_register, io_uring_fd,
+		       IORING_REGISTER_FILES, fds, nr);
+}
+
+static unsigned long long page_addr_to_pfn(unsigned long addr)
+{
+	int page_size = sysconf(_SC_PAGE_SIZE), fd, ret;
+	unsigned long long pfn;
+
+	if (page_size < 0)
+		return 0;
+	fd = open("/proc/self/pagemap", O_RDONLY);
+	if (fd < 0)
+		return 0;
+
+	ret = pread(fd, &pfn, sizeof(pfn), (addr / page_size) * 8);
+	close(fd);
+	if (ret < 0)
+		return 0;
+	/* Bits 0-54 have PFN for non-swapped page */
+	return pfn & 0x7fffffffffffff;
+}
+
+static int io_uring_inode_match(int link_fd, int io_uring_fd)
+{
+	struct bpf_link_info linfo = {};
+	__u32 info_len = sizeof(linfo);
+	struct stat st;
+	int ret;
+
+	ret = fstat(io_uring_fd, &st);
+	if (ret < 0)
+		return -errno;
+
+	ret = bpf_obj_get_info_by_fd(link_fd, &linfo, &info_len);
+	if (ret < 0)
+		return -errno;
+
+	ASSERT_EQ(st.st_ino, linfo.iter.io_uring.inode, "io_uring inode matches");
+	return 0;
+}
+
+void test_io_uring_buf(void)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	char rbuf[4096], buf[4096] = "B\n";
+	union bpf_iter_link_info linfo;
+	struct bpf_iter_io_uring *skel;
+	int ret, fd, i, len = 128;
+	struct io_uring_params p;
+	struct iovec iovs[8];
+	int iter_fd;
+	char *str;
+
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	skel = bpf_iter_io_uring__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "bpf_iter_io_uring__open_and_load"))
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(iovs); i++) {
+		iovs[i].iov_len	 = len;
+		iovs[i].iov_base = mmap(NULL, len, PROT_READ | PROT_WRITE,
+					MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+		if (iovs[i].iov_base == MAP_FAILED)
+			goto end;
+		len *= 2;
+	}
+
+	memset(&p, 0, sizeof(p));
+	fd = sys_io_uring_setup(1, &p);
+	if (!ASSERT_GE(fd, 0, "io_uring_setup"))
+		goto end;
+
+	linfo.io_uring.io_uring_fd = fd;
+	skel->links.dump_io_uring_buf = bpf_program__attach_iter(skel->progs.dump_io_uring_buf,
+								 &opts);
+	if (!ASSERT_OK_PTR(skel->links.dump_io_uring_buf, "bpf_program__attach_iter"))
+		goto end_close_fd;
+
+	if (!ASSERT_OK(io_uring_inode_match(bpf_link__fd(skel->links.dump_io_uring_buf), fd), "inode match"))
+		goto end_close_fd;
+
+	ret = io_uring_register_bufs(fd, iovs, ARRAY_SIZE(iovs));
+	if (!ASSERT_OK(ret, "io_uring_register_bufs"))
+		goto end_close_fd;
+
+	/* "B\n" */
+	len = 2;
+	str = buf + len;
+	for (int j = 0; j < ARRAY_SIZE(iovs); j++) {
+		ret = snprintf(str, sizeof(buf) - len, "%d:0x%lx:%zu\n", j,
+			       (unsigned long)iovs[j].iov_base,
+			       iovs[j].iov_len);
+		if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
+			goto end_close_fd;
+		len += ret;
+		str += ret;
+
+		ret = snprintf(str, sizeof(buf) - len, "`-PFN for bvec[0]=%llu\n",
+			       page_addr_to_pfn((unsigned long)iovs[j].iov_base));
+		if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
+			goto end_close_fd;
+		len += ret;
+		str += ret;
+	}
+
+	ret = snprintf(str, sizeof(buf) - len, "E:%zu\n", ARRAY_SIZE(iovs));
+	if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
+		goto end_close_fd;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_buf));
+	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
+		goto end_close_fd;
+
+	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
+	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer"))
+		goto end_close_iter;
+
+	if (!ASSERT_OK(strcmp(rbuf, buf), "compare iterator output")) {
+		puts("=== Expected Output ===");
+		printf("%s", buf);
+		puts("==== Actual Output ====");
+		printf("%s", rbuf);
+		puts("=======================");
+	}
+end_close_iter:
+	close(iter_fd);
+end_close_fd:
+	close(fd);
+end:
+	while (i--)
+		munmap(iovs[i].iov_base, iovs[i].iov_len);
+	bpf_iter_io_uring__destroy(skel);
+}
+
+void test_io_uring_file(void)
+{
+	int reg_files[] = { [0 ... 7] = -1 };
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	char buf[4096] = "B\n", rbuf[4096] = {}, *str;
+	union bpf_iter_link_info linfo = {};
+	struct bpf_iter_io_uring *skel;
+	int iter_fd, fd, len = 0, ret;
+	struct io_uring_params p;
+
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	skel = bpf_iter_io_uring__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "bpf_iter_io_uring__open_and_load"))
+		return;
+
+	/* "B\n" */
+	len = 2;
+	str = buf + len;
+	ret = snprintf(str, sizeof(buf) - len, "B\n");
+	for (int i = 0; i < ARRAY_SIZE(reg_files); i++) {
+		char templ[] = "/tmp/io_uringXXXXXX";
+		const char *name, *def = "<none>";
+
+		/* create sparse set */
+		if (i & 1) {
+			name = def;
+		} else {
+			reg_files[i] = mkstemp(templ);
+			if (!ASSERT_GE(reg_files[i], 0, templ))
+				goto end_close_reg_files;
+			name = templ;
+			ASSERT_OK(unlink(name), "unlink");
+		}
+		ret = snprintf(str, sizeof(buf) - len, "%d:%s%s\n", i, name, name != def ? " (deleted)" : "");
+		if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
+			goto end_close_reg_files;
+		len += ret;
+		str += ret;
+	}
+
+	ret = snprintf(str, sizeof(buf) - len, "E:%zu\n", ARRAY_SIZE(reg_files));
+	if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
+		goto end_close_reg_files;
+
+	memset(&p, 0, sizeof(p));
+	fd = sys_io_uring_setup(1, &p);
+	if (!ASSERT_GE(fd, 0, "io_uring_setup"))
+		goto end_close_reg_files;
+
+	linfo.io_uring.io_uring_fd = fd;
+	skel->links.dump_io_uring_file = bpf_program__attach_iter(skel->progs.dump_io_uring_file,
+								  &opts);
+	if (!ASSERT_OK_PTR(skel->links.dump_io_uring_file, "bpf_program__attach_iter"))
+		goto end_close_fd;
+
+	if (!ASSERT_OK(io_uring_inode_match(bpf_link__fd(skel->links.dump_io_uring_file), fd), "inode match"))
+		goto end_close_fd;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_file));
+	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
+		goto end;
+
+	ret = io_uring_register_files(fd, reg_files, ARRAY_SIZE(reg_files));
+	if (!ASSERT_OK(ret, "io_uring_register_files"))
+		goto end_iter_fd;
+
+	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
+	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer(iterator_fd, buf)"))
+		goto end_iter_fd;
+
+	if (!ASSERT_OK(strcmp(rbuf, buf), "compare iterator output")) {
+		puts("=== Expected Output ===");
+		printf("%s", buf);
+		puts("==== Actual Output ====");
+		printf("%s", rbuf);
+		puts("=======================");
+	}
+end_iter_fd:
+	close(iter_fd);
+end_close_fd:
+	close(fd);
+end_close_reg_files:
+	for (int i = 0; i < ARRAY_SIZE(reg_files); i++) {
+		if (reg_files[i] != -1)
+			close(reg_files[i]);
+	}
+end:
+	bpf_iter_io_uring__destroy(skel);
+}
+
 void test_bpf_iter(void)
 {
 	if (test__start_subtest("btf_id_or_null"))
@@ -1299,4 +1546,8 @@ void test_bpf_iter(void)
 		test_rdonly_buf_out_of_bound();
 	if (test__start_subtest("buf-neg-offset"))
 		test_buf_neg_offset();
+	if (test__start_subtest("io_uring_buf"))
+		test_io_uring_buf();
+	if (test__start_subtest("io_uring_file"))
+		test_io_uring_file();
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c b/tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c
new file mode 100644
index 000000000000..caf8bd0bf8d4
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+
+SEC("iter/io_uring_buf")
+int dump_io_uring_buf(struct bpf_iter__io_uring_buf *ctx)
+{
+	struct io_mapped_ubuf *ubuf = ctx->ubuf;
+	struct seq_file *seq = ctx->meta->seq;
+	unsigned int index = ctx->index;
+
+	if (!ctx->meta->seq_num)
+		BPF_SEQ_PRINTF(seq, "B\n");
+
+	if (ubuf) {
+		BPF_SEQ_PRINTF(seq, "%u:0x%lx:%lu\n", index, (unsigned long)ubuf->ubuf,
+			       (unsigned long)ubuf->ubuf_end - ubuf->ubuf);
+		BPF_SEQ_PRINTF(seq, "`-PFN for bvec[0]=%lu\n",
+			       (unsigned long)bpf_page_to_pfn(ubuf->bvec[0].bv_page));
+	} else {
+		BPF_SEQ_PRINTF(seq, "E:%u\n", index);
+	}
+	return 0;
+}
+
+SEC("iter/io_uring_file")
+int dump_io_uring_file(struct bpf_iter__io_uring_file *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	unsigned int index = ctx->index;
+	struct file *file = ctx->file;
+	char buf[256] = "";
+
+	if (!ctx->meta->seq_num)
+		BPF_SEQ_PRINTF(seq, "B\n");
+	/* for io_uring_file iterator, this is the terminating condition */
+	if (ctx->ctx->nr_user_files == index) {
+		BPF_SEQ_PRINTF(seq, "E:%u\n", index);
+		return 0;
+	}
+	if (file) {
+		bpf_d_path(&file->f_path, buf, sizeof(buf));
+		BPF_SEQ_PRINTF(seq, "%u:%s\n", index, buf);
+	} else {
+		BPF_SEQ_PRINTF(seq, "%u:<none>\n", index);
+	}
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v3 07/10] selftests/bpf: Add test for epoll BPF iterator
  2021-12-01  4:23 [PATCH bpf-next v3 00/10] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
                   ` (5 preceding siblings ...)
  2021-12-01  4:23 ` [PATCH bpf-next v3 06/10] selftests/bpf: Add test for io_uring BPF iterators Kumar Kartikeya Dwivedi
@ 2021-12-01  4:23 ` Kumar Kartikeya Dwivedi
  2021-12-01  4:23 ` [PATCH bpf-next v3 08/10] selftests/bpf: Test partial reads for io_uring, epoll iterators Kumar Kartikeya Dwivedi
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-12-01  4:23 UTC (permalink / raw)
  To: bpf
  Cc: Alexander Viro, linux-fsdevel, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Yonghong Song, Pavel Emelyanov,
	Alexander Mikhalitsyn, Andrei Vagin, criu, io-uring

This tests the epoll iterator, including peeking into the epitem to
inspect the registered file and fd number, and verifying that in
userspace.

Cc: Alexander Viro <[email protected]>
Cc: [email protected]
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
---
 .../selftests/bpf/prog_tests/bpf_iter.c       | 121 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_iter_epoll.c      |  33 +++++
 2 files changed, 154 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_epoll.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
index 13ea2eaed032..cc0555c5b373 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
@@ -2,6 +2,7 @@
 /* Copyright (c) 2020 Facebook */
 #include <sys/stat.h>
 #include <sys/mman.h>
+#include <sys/epoll.h>
 #include <test_progs.h>
 #include <linux/io_uring.h>
 
@@ -31,6 +32,7 @@
 #include "bpf_iter_test_kern5.skel.h"
 #include "bpf_iter_test_kern6.skel.h"
 #include "bpf_iter_io_uring.skel.h"
+#include "bpf_iter_epoll.skel.h"
 
 static int duration;
 
@@ -1486,6 +1488,123 @@ void test_io_uring_file(void)
 	bpf_iter_io_uring__destroy(skel);
 }
 
+void test_epoll(void)
+{
+	const char *fmt = "B\npipe:%d\nsocket:%d\npipe:%d\nsocket:%d\nE\n";
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	char buf[4096] = {}, rbuf[4096] = {};
+	union bpf_iter_link_info linfo;
+	int fds[2], sk[2], epfd, ret;
+	struct bpf_iter_epoll *skel;
+	struct epoll_event ev = {};
+	int iter_fd, set[4];
+	char *s, *t;
+
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	skel = bpf_iter_epoll__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "bpf_iter_epoll__open_and_load"))
+		return;
+
+	epfd = epoll_create1(EPOLL_CLOEXEC);
+	if (!ASSERT_GE(epfd, 0, "epoll_create1"))
+		goto end;
+
+	ret = pipe(fds);
+	if (!ASSERT_OK(ret, "pipe(fds)"))
+		goto end_epfd;
+
+	ret = socketpair(AF_UNIX, SOCK_STREAM, 0, sk);
+	if (!ASSERT_OK(ret, "socketpair"))
+		goto end_pipe;
+
+	ev.events = EPOLLIN;
+
+	ret = epoll_ctl(epfd, EPOLL_CTL_ADD, fds[0], &ev);
+	if (!ASSERT_OK(ret, "epoll_ctl"))
+		goto end_sk;
+
+	ret = epoll_ctl(epfd, EPOLL_CTL_ADD, sk[0], &ev);
+	if (!ASSERT_OK(ret, "epoll_ctl"))
+		goto end_sk;
+
+	ret = epoll_ctl(epfd, EPOLL_CTL_ADD, fds[1], &ev);
+	if (!ASSERT_OK(ret, "epoll_ctl"))
+		goto end_sk;
+
+	ret = epoll_ctl(epfd, EPOLL_CTL_ADD, sk[1], &ev);
+	if (!ASSERT_OK(ret, "epoll_ctl"))
+		goto end_sk;
+
+	linfo.epoll.epoll_fd = epfd;
+	skel->links.dump_epoll = bpf_program__attach_iter(skel->progs.dump_epoll, &opts);
+	if (!ASSERT_OK_PTR(skel->links.dump_epoll, "bpf_program__attach_iter"))
+		goto end_sk;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_epoll));
+	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
+		goto end_sk;
+
+	ret = epoll_ctl(epfd, EPOLL_CTL_ADD, iter_fd, &ev);
+	if (!ASSERT_EQ(ret, -1, "epoll_ctl add for iter_fd"))
+		goto end_iter_fd;
+
+	ret = snprintf(buf, sizeof(buf), fmt, fds[0], sk[0], fds[1], sk[1]);
+	if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf), "snprintf"))
+		goto end_iter_fd;
+
+	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
+	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer"))
+		goto end_iter_fd;
+
+	puts("=== Expected Output ===");
+	printf("%s", buf);
+	puts("==== Actual Output ====");
+	printf("%s", rbuf);
+	puts("=======================");
+
+	s = rbuf;
+	while ((s = strtok_r(s, "\n", &t))) {
+		int fd = -1;
+
+		if (s[0] == 'B' || s[0] == 'E')
+			goto next;
+		ASSERT_EQ(sscanf(s, s[0] == 'p' ? "pipe:%d" : "socket:%d", &fd), 1, s);
+		if (fd == fds[0]) {
+			ASSERT_NEQ(set[0], 1, "pipe[0]");
+			set[0] = 1;
+		} else if (fd == fds[1]) {
+			ASSERT_NEQ(set[1], 1, "pipe[1]");
+			set[1] = 1;
+		} else if (fd == sk[0]) {
+			ASSERT_NEQ(set[2], 1, "sk[0]");
+			set[2] = 1;
+		} else if (fd == sk[1]) {
+			ASSERT_NEQ(set[3], 1, "sk[1]");
+			set[3] = 1;
+		} else {
+			ASSERT_TRUE(0, "Incorrect fd in iterator output");
+		}
+next:
+		s = NULL;
+	}
+	for (int i = 0; i < ARRAY_SIZE(set); i++)
+		ASSERT_EQ(set[i], 1, "fd found");
+end_iter_fd:
+	close(iter_fd);
+end_sk:
+	close(sk[1]);
+	close(sk[0]);
+end_pipe:
+	close(fds[1]);
+	close(fds[0]);
+end_epfd:
+	close(epfd);
+end:
+	bpf_iter_epoll__destroy(skel);
+}
+
 void test_bpf_iter(void)
 {
 	if (test__start_subtest("btf_id_or_null"))
@@ -1550,4 +1669,6 @@ void test_bpf_iter(void)
 		test_io_uring_buf();
 	if (test__start_subtest("io_uring_file"))
 		test_io_uring_file();
+	if (test__start_subtest("epoll"))
+		test_epoll();
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_epoll.c b/tools/testing/selftests/bpf/progs/bpf_iter_epoll.c
new file mode 100644
index 000000000000..0afc74d154a1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_epoll.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+
+extern void pipefifo_fops __ksym;
+
+SEC("iter/epoll")
+int dump_epoll(struct bpf_iter__epoll *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct epitem *epi = ctx->epi;
+	char sstr[] = "socket";
+	char pstr[] = "pipe";
+
+	if (!ctx->meta->seq_num) {
+		BPF_SEQ_PRINTF(seq, "B\n");
+	}
+	if (epi) {
+		struct file *f = epi->ffd.file;
+		char *str;
+
+		if (f->f_op == &pipefifo_fops)
+			str = pstr;
+		else
+			str = sstr;
+		BPF_SEQ_PRINTF(seq, "%s:%d\n", str, epi->ffd.fd);
+	} else {
+		BPF_SEQ_PRINTF(seq, "E\n");
+	}
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v3 08/10] selftests/bpf: Test partial reads for io_uring, epoll iterators
  2021-12-01  4:23 [PATCH bpf-next v3 00/10] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
                   ` (6 preceding siblings ...)
  2021-12-01  4:23 ` [PATCH bpf-next v3 07/10] selftests/bpf: Add test for epoll BPF iterator Kumar Kartikeya Dwivedi
@ 2021-12-01  4:23 ` Kumar Kartikeya Dwivedi
  2021-12-01  4:23 ` [PATCH bpf-next v3 09/10] selftests/bpf: Fix btf_dump test for bpf_iter_link_info Kumar Kartikeya Dwivedi
  2021-12-01  4:23 ` [PATCH RFC bpf-next v3 10/10] samples/bpf: Add example to checkpoint/restore io_uring Kumar Kartikeya Dwivedi
  9 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-12-01  4:23 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Yonghong Song, Pavel Emelyanov, Alexander Mikhalitsyn,
	Andrei Vagin, criu, io-uring, linux-fsdevel

Ensure that the output is consistent in face of partial reads that
return to userspace and then resume again later. To this end, we do
reads in 1-byte chunks, which is a bit stupid in real life, but works
well to simulate interrupted iteration. This also tests case where
seq_file buffer is consumed (after seq_printf) on interrupted read
before iterator invoked BPF prog again.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
---
 .../selftests/bpf/prog_tests/bpf_iter.c       | 33 ++++++++++++-------
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
index cc0555c5b373..3a07fdf31874 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
@@ -73,13 +73,13 @@ static void do_dummy_read(struct bpf_program *prog)
 	bpf_link__destroy(link);
 }
 
-static int read_fd_into_buffer(int fd, char *buf, int size)
+static int __read_fd_into_buffer(int fd, char *buf, int size, size_t chunks)
 {
 	int bufleft = size;
 	int len;
 
 	do {
-		len = read(fd, buf, bufleft);
+		len = read(fd, buf, chunks ?: bufleft);
 		if (len > 0) {
 			buf += len;
 			bufleft -= len;
@@ -89,6 +89,11 @@ static int read_fd_into_buffer(int fd, char *buf, int size)
 	return len < 0 ? len : size - bufleft;
 }
 
+static int read_fd_into_buffer(int fd, char *buf, int size)
+{
+	return __read_fd_into_buffer(fd, buf, size, 0);
+}
+
 static void test_ipv6_route(void)
 {
 	struct bpf_iter_ipv6_route *skel;
@@ -1301,7 +1306,7 @@ static int io_uring_inode_match(int link_fd, int io_uring_fd)
 	return 0;
 }
 
-void test_io_uring_buf(void)
+void test_io_uring_buf(bool partial)
 {
 	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
 	char rbuf[4096], buf[4096] = "B\n";
@@ -1375,7 +1380,7 @@ void test_io_uring_buf(void)
 	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
 		goto end_close_fd;
 
-	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
+	ret = __read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf), partial);
 	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer"))
 		goto end_close_iter;
 
@@ -1396,7 +1401,7 @@ void test_io_uring_buf(void)
 	bpf_iter_io_uring__destroy(skel);
 }
 
-void test_io_uring_file(void)
+void test_io_uring_file(bool partial)
 {
 	int reg_files[] = { [0 ... 7] = -1 };
 	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
@@ -1464,7 +1469,7 @@ void test_io_uring_file(void)
 	if (!ASSERT_OK(ret, "io_uring_register_files"))
 		goto end_iter_fd;
 
-	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
+	ret = __read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf), partial);
 	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer(iterator_fd, buf)"))
 		goto end_iter_fd;
 
@@ -1488,7 +1493,7 @@ void test_io_uring_file(void)
 	bpf_iter_io_uring__destroy(skel);
 }
 
-void test_epoll(void)
+void test_epoll(bool partial)
 {
 	const char *fmt = "B\npipe:%d\nsocket:%d\npipe:%d\nsocket:%d\nE\n";
 	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
@@ -1554,7 +1559,7 @@ void test_epoll(void)
 	if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf), "snprintf"))
 		goto end_iter_fd;
 
-	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
+	ret = __read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf), partial);
 	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer"))
 		goto end_iter_fd;
 
@@ -1666,9 +1671,15 @@ void test_bpf_iter(void)
 	if (test__start_subtest("buf-neg-offset"))
 		test_buf_neg_offset();
 	if (test__start_subtest("io_uring_buf"))
-		test_io_uring_buf();
+		test_io_uring_buf(false);
 	if (test__start_subtest("io_uring_file"))
-		test_io_uring_file();
+		test_io_uring_file(false);
 	if (test__start_subtest("epoll"))
-		test_epoll();
+		test_epoll(false);
+	if (test__start_subtest("io_uring_buf-partial"))
+		test_io_uring_buf(true);
+	if (test__start_subtest("io_uring_file-partial"))
+		test_io_uring_file(true);
+	if (test__start_subtest("epoll-partial"))
+		test_epoll(true);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v3 09/10] selftests/bpf: Fix btf_dump test for bpf_iter_link_info
  2021-12-01  4:23 [PATCH bpf-next v3 00/10] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
                   ` (7 preceding siblings ...)
  2021-12-01  4:23 ` [PATCH bpf-next v3 08/10] selftests/bpf: Test partial reads for io_uring, epoll iterators Kumar Kartikeya Dwivedi
@ 2021-12-01  4:23 ` Kumar Kartikeya Dwivedi
  2021-12-01  4:23 ` [PATCH RFC bpf-next v3 10/10] samples/bpf: Add example to checkpoint/restore io_uring Kumar Kartikeya Dwivedi
  9 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-12-01  4:23 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Yonghong Song, Pavel Emelyanov, Alexander Mikhalitsyn,
	Andrei Vagin, criu, io-uring, linux-fsdevel

Since we changed the definition while adding io_uring and epoll iterator
support, adjust the selftest to check against the updated definition.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
---
 tools/testing/selftests/bpf/prog_tests/btf_dump.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/btf_dump.c b/tools/testing/selftests/bpf/prog_tests/btf_dump.c
index 9e26903f9170..1678b2c49f78 100644
--- a/tools/testing/selftests/bpf/prog_tests/btf_dump.c
+++ b/tools/testing/selftests/bpf/prog_tests/btf_dump.c
@@ -736,7 +736,9 @@ static void test_btf_dump_struct_data(struct btf *btf, struct btf_dump *d,
 
 	/* union with nested struct */
 	TEST_BTF_DUMP_DATA(btf, d, "union", str, union bpf_iter_link_info, BTF_F_COMPACT,
-			   "(union bpf_iter_link_info){.map = (struct){.map_fd = (__u32)1,},}",
+			   "(union bpf_iter_link_info){.map = (struct){.map_fd = (__u32)1,},"
+			   ".io_uring = (struct){.io_uring_fd = (__u32)1,},"
+			   ".epoll = (struct){.epoll_fd = (__u32)1,},}",
 			   { .map = { .map_fd = 1 }});
 
 	/* struct skb with nested structs/unions; because type output is so
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH RFC bpf-next v3 10/10] samples/bpf: Add example to checkpoint/restore io_uring
  2021-12-01  4:23 [PATCH bpf-next v3 00/10] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
                   ` (8 preceding siblings ...)
  2021-12-01  4:23 ` [PATCH bpf-next v3 09/10] selftests/bpf: Fix btf_dump test for bpf_iter_link_info Kumar Kartikeya Dwivedi
@ 2021-12-01  4:23 ` Kumar Kartikeya Dwivedi
  9 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-12-01  4:23 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Yonghong Song, Pavel Emelyanov, Alexander Mikhalitsyn,
	Andrei Vagin, criu, io-uring, linux-fsdevel

The sample demonstrates how BPF iterators for task and io_uring can be
used to checkpoint the state of an io_uring instance and then recreate
it using that information, as a working example of how the iterator
will be utilized for the same by userspace projects like CRIU.

This is very similar to how CRIU actually works in principle, by writing
all data on dump to protobuf images, which are then read during restore
to reconstruct the task and its resources. Here we use a custom binary
format and pipe the io_uring "image(s)" (in case of wq_fd there will be
multiple images), to the restorer, which then consumes this information
to form a total ordering of restore actions it has to execute to reach
the same state.

The sample restores all features that currently cannot be restored
without bpf iterators, hence is a good demonstration of what we would
like to achieve using these new facilities. As is evident, we need a
single iteration pass in each iterator to obtain all the information we
require.

io_uring ring buffer restoration is orthogonal and not specific to
iterators, so it has been left out.

Our example app also shares the workqueue with parent io_uring, which is
detected by our dumper tool and it moves to first dump the parent
io_uring. io_uring doesn't allow creating cycles in this case, so the
chain ends eventually in practice. For now only single parent is
supported, but it easy to extend to arbitrary length chains (by
recursing with limit in do_dump_parent after detecting presence of wq_fd > 0).

The epoll iterator usecase is similar to what we do in dump_io_uring_file,
and would significantly simplify current implementation [0].

  [0]: https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/eventpoll.c

The dry-run mode of bpf_cr tool prints the dump image:

$ ./bpf_cr app &
PID: 318, Parent io_uring: 3, Dependent io_uring: 4

$ ./bpf_cr dump 318 4 | ./bpf_cr restore --dry-run
DUMP_SETUP:
	io_uring_fd: 3
	end: true
		flags: 14
		sq_entries: 2
		cq_entries: 4
		sq_thread_cpu: 0
		sq_thread_idle: 1500
		wq_fd: 0
DUMP_SETUP:
	io_uring_fd: 4
	end: false
		flags: 46
		sq_entries: 2
		cq_entries: 4
		sq_thread_cpu: 0
		sq_thread_idle: 1500
		wq_fd: 3
DUMP_EVENTFD:
	io_uring_fd: 4
	end: false
		eventfd: 5
		async: true
DUMP_REG_FD:
	io_uring_fd: 4
	end: false
		reg_fd: 0
		index: 0
DUMP_REG_FD:
	io_uring_fd: 4
	end: false
		reg_fd: 0
		index: 2
DUMP_REG_FD:
	io_uring_fd: 4
	end: false
		reg_fd: 0
		index: 4
DUMP_REG_BUF:
	io_uring_fd: 4
	end: false
		addr: 0
		len: 0
		index: 0
DUMP_REG_BUF:
	io_uring_fd: 4
	end: true
		addr: 140721288339216
		len: 120
		index: 1
Nothing to do, exiting...

======

The trace is as follows:
// We can shift fd number around randomly, it doesn't impact C/R
$ exec 3<> /dev/urandom
$ exec 4<> /dev/random
$ exec 5<> /dev/null
$ strace ./bpf_cr app &
	...
	io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE, sq_thread_cpu=0, sq_thread_idle=1500, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 6
	getpid()                                = 324
	...
	io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE|IORING_SETUP_ATTACH_WQ, sq_thread_cpu=0, sq_thread_idle=1500, wq_fd=6, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 7
	...
	// PID: 324, Parent io_uring: 6, Dependent io_uring: 7
	...
	eventfd2(42, 0)                         = 8
	io_uring_register(7, IORING_REGISTER_EVENTFD_ASYNC, [8], 1) = 0
	io_uring_register(7, IORING_REGISTER_FILES, [0, -1, 1, -1, 2], 5) = 0
	io_uring_register(7, IORING_REGISTER_BUFFERS, [{iov_base=NULL, iov_len=0}, {iov_base=0x7ffdf1a27680, iov_len=120}], 2) = 0

The restore's trace is as follows (which detects the wq_fd on its own)
and dumps and restores it as well, before restoring fd 7:

$ ./bpf_cr dump 326 7 | strace ./bpf_cr restore
	...
	io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE, sq_thread_cpu=0, sq_thread_idle=1500, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 6
	dup2(6, 6)                              = 6
	...
	io_uring_setup(2, {flags=IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF|IORING_SETUP_CQSIZE|IORING_SETUP_ATTACH_WQ, sq_thread_cpu=0, sq_thread_idle=1500, wq_fd=6, sq_entries=2, cq_entries=4, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=384}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=0x118 /* IORING_CQ_??? */}}) = 7
	dup2(7, 7)                              = 7
	...
	eventfd2(42, 0)                         = 8
	io_uring_register(7, IORING_REGISTER_EVENTFD_ASYNC, [8], 1) = 0
	...
	// fd number 0 is same as 1 and 2, hence the lowest one is used during restore,
	// it doesn't matter as underlying struct file is same...
	io_uring_register(7, IORING_REGISTER_FILES, [0, -1, 0, -1, 0], 5) = 0
	// This step would happen after restoring mm, so it fails for now for second iovec
	io_uring_register(7, IORING_REGISTER_BUFFERS, [{iov_base=NULL, iov_len=0}, {iov_base=0x7ffdf1a27680, iov_len=120}], 2) = -1 EFAULT (Bad address)
	...
---
 samples/bpf/.gitignore   |   1 +
 samples/bpf/Makefile     |   8 +-
 samples/bpf/bpf_cr.bpf.c | 185 +++++++++++
 samples/bpf/bpf_cr.c     | 688 +++++++++++++++++++++++++++++++++++++++
 samples/bpf/bpf_cr.h     |  48 +++
 samples/bpf/hbm_kern.h   |   2 -
 6 files changed, 928 insertions(+), 4 deletions(-)
 create mode 100644 samples/bpf/bpf_cr.bpf.c
 create mode 100644 samples/bpf/bpf_cr.c
 create mode 100644 samples/bpf/bpf_cr.h

diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
index 0e7bfdbff80a..9c542431ea45 100644
--- a/samples/bpf/.gitignore
+++ b/samples/bpf/.gitignore
@@ -1,4 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
+bpf_cr
 cpustat
 fds_example
 hbm
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a886dff1ba89..a64f2e019bfc 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -53,6 +53,7 @@ tprogs-y += task_fd_query
 tprogs-y += xdp_sample_pkts
 tprogs-y += ibumad
 tprogs-y += hbm
+tprogs-y += bpf_cr

 tprogs-y += xdp_redirect_cpu
 tprogs-y += xdp_redirect_map_multi
@@ -118,6 +119,7 @@ task_fd_query-objs := task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o
 ibumad-objs := ibumad_user.o
 hbm-objs := hbm.o $(CGROUP_HELPERS)
+bpf_cr-objs := bpf_cr.o

 xdp_redirect_map_multi-objs := xdp_redirect_map_multi_user.o $(XDP_SAMPLE)
 xdp_redirect_cpu-objs := xdp_redirect_cpu_user.o $(XDP_SAMPLE)
@@ -198,7 +200,7 @@ BPF_EXTRA_CFLAGS += -I$(srctree)/arch/mips/include/asm/mach-generic
 endif
 endif

-TPROGS_CFLAGS += -Wall -O2
+TPROGS_CFLAGS += -Wall -O2 -g
 TPROGS_CFLAGS += -Wmissing-prototypes
 TPROGS_CFLAGS += -Wstrict-prototypes

@@ -337,6 +339,7 @@ $(obj)/xdp_redirect_map_multi_user.o: $(obj)/xdp_redirect_map_multi.skel.h
 $(obj)/xdp_redirect_map_user.o: $(obj)/xdp_redirect_map.skel.h
 $(obj)/xdp_redirect_user.o: $(obj)/xdp_redirect.skel.h
 $(obj)/xdp_monitor_user.o: $(obj)/xdp_monitor.skel.h
+$(obj)/bpf_cr.o: $(obj)/bpf_cr.skel.h

 $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h
 $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
@@ -392,7 +395,7 @@ $(obj)/%.bpf.o: $(src)/%.bpf.c $(obj)/vmlinux.h $(src)/xdp_sample.bpf.h $(src)/x
 		-I$(LIBBPF_INCLUDE) $(CLANG_SYS_INCLUDES) \
 		-c $(filter %.bpf.c,$^) -o $@

-LINKED_SKELS := xdp_redirect_cpu.skel.h xdp_redirect_map_multi.skel.h \
+LINKED_SKELS := bpf_cr.skel.h xdp_redirect_cpu.skel.h xdp_redirect_map_multi.skel.h \
 		xdp_redirect_map.skel.h xdp_redirect.skel.h xdp_monitor.skel.h
 clean-files += $(LINKED_SKELS)

@@ -401,6 +404,7 @@ xdp_redirect_map_multi.skel.h-deps := xdp_redirect_map_multi.bpf.o xdp_sample.bp
 xdp_redirect_map.skel.h-deps := xdp_redirect_map.bpf.o xdp_sample.bpf.o
 xdp_redirect.skel.h-deps := xdp_redirect.bpf.o xdp_sample.bpf.o
 xdp_monitor.skel.h-deps := xdp_monitor.bpf.o xdp_sample.bpf.o
+bpf_cr.skel.h-deps := bpf_cr.bpf.o

 LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.bpf.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))

diff --git a/samples/bpf/bpf_cr.bpf.c b/samples/bpf/bpf_cr.bpf.c
new file mode 100644
index 000000000000..6b0bb019f2be
--- /dev/null
+++ b/samples/bpf/bpf_cr.bpf.c
@@ -0,0 +1,185 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "vmlinux.h"
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_helpers.h>
+
+#include "bpf_cr.h"
+
+/* struct file -> int fd */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u64);
+	__type(value, int);
+	__uint(max_entries, 16);
+} fdtable_map SEC(".maps");
+
+struct ctx_map_val {
+	int fd;
+	bool init;
+};
+
+/* io_ring_ctx -> int fd */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u64);
+	__type(value, struct ctx_map_val);
+	__uint(max_entries, 16);
+} io_ring_ctx_map SEC(".maps");
+
+/* ctx->sq_data -> int fd */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u64);
+	__type(value, int);
+	__uint(max_entries, 16);
+} sq_data_map SEC(".maps");
+
+/* eventfd_ctx -> int fd */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u64);
+	__type(value, int);
+	__uint(max_entries, 16);
+} eventfd_ctx_map SEC(".maps");
+
+const volatile pid_t tgid = 0;
+
+extern void eventfd_fops __ksym;
+extern void io_uring_fops __ksym;
+
+SEC("iter/task_file")
+int dump_task(struct bpf_iter__task_file *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct task_struct *task = ctx->task;
+	struct file *file = ctx->file;
+	struct ctx_map_val val = {};
+	__u64 f_priv;
+	int fd;
+
+	if (!task)
+		return 0;
+	if (task->tgid != tgid)
+		return 0;
+	if (!file)
+		return 0;
+
+	f_priv = (__u64)file->private_data;
+	fd = ctx->fd;
+	val.fd = fd;
+	if (file->f_op == &eventfd_fops) {
+		bpf_map_update_elem(&eventfd_ctx_map, &f_priv, &fd, 0);
+	} else if (file->f_op == &io_uring_fops) {
+		struct io_ring_ctx *ctx;
+		void *sq_data;
+		__u64 key;
+
+		bpf_map_update_elem(&io_ring_ctx_map, &f_priv, &val, 0);
+		ctx = file->private_data;
+		bpf_probe_read_kernel(&sq_data, sizeof(sq_data), &ctx->sq_data);
+		key = (__u64)sq_data;
+		bpf_map_update_elem(&sq_data_map, &key, &fd, BPF_NOEXIST);
+	}
+	f_priv = (__u64)file;
+	bpf_map_update_elem(&fdtable_map, &f_priv, &fd, BPF_NOEXIST);
+	return 0;
+}
+
+static void dump_io_ring_ctx(struct seq_file *seq, struct io_ring_ctx *ctx, int ring_fd)
+{
+	struct io_uring_dump dump;
+	struct ctx_map_val *val;
+	__u64 key;
+	int *fd;
+
+	key = (__u64)ctx;
+	val = bpf_map_lookup_elem(&io_ring_ctx_map, &key);
+	if (val && val->init)
+		return;
+	__builtin_memset(&dump, 0, sizeof(dump));
+	if (val)
+		val->init = true;
+	dump.type = DUMP_SETUP;
+	dump.io_uring_fd = ring_fd;
+	key = (__u64)ctx->sq_data;
+#define ATTACH_WQ_FLAG (1 << 5)
+	if (ctx->flags & ATTACH_WQ_FLAG) {
+		fd = bpf_map_lookup_elem(&sq_data_map, &key);
+		if (fd)
+			dump.desc.setup.wq_fd = *fd;
+	}
+	dump.desc.setup.flags = ctx->flags;
+	dump.desc.setup.sq_entries = ctx->sq_entries;
+	dump.desc.setup.cq_entries = ctx->cq_entries;
+	dump.desc.setup.sq_thread_cpu = ctx->sq_data->sq_cpu;
+	dump.desc.setup.sq_thread_idle = ctx->sq_data->sq_thread_idle;
+	bpf_seq_write(seq, &dump, sizeof(dump));
+	if (ctx->cq_ev_fd) {
+		dump.type = DUMP_EVENTFD;
+		key = (__u64)ctx->cq_ev_fd;
+		fd = bpf_map_lookup_elem(&eventfd_ctx_map, &key);
+		if (fd)
+			dump.desc.eventfd.eventfd = *fd;
+		dump.desc.eventfd.async = ctx->eventfd_async;
+		bpf_seq_write(seq, &dump, sizeof(dump));
+	}
+}
+
+SEC("iter/io_uring_buf")
+int dump_io_uring_buf(struct bpf_iter__io_uring_buf *ctx)
+{
+	struct io_mapped_ubuf *ubuf = ctx->ubuf;
+	struct seq_file *seq = ctx->meta->seq;
+	struct io_uring_dump dump;
+	__u64 key;
+	int *fd;
+
+	__builtin_memset(&dump, 0, sizeof(dump));
+	key = (__u64)ctx->ctx;
+	fd = bpf_map_lookup_elem(&io_ring_ctx_map, &key);
+	if (!ctx->meta->seq_num)
+		dump_io_ring_ctx(seq, ctx->ctx, fd ? *fd : 0);
+	if (!ubuf)
+		return 0;
+	dump.type = DUMP_REG_BUF;
+	if (fd)
+		dump.io_uring_fd = *fd;
+	dump.desc.reg_buf.index = ctx->index;
+	if (ubuf != ctx->ctx->dummy_ubuf) {
+		dump.desc.reg_buf.addr = ubuf->ubuf;
+		dump.desc.reg_buf.len = ubuf->ubuf_end - ubuf->ubuf;
+	}
+	bpf_seq_write(seq, &dump, sizeof(dump));
+	return 0;
+}
+
+SEC("iter/io_uring_file")
+int dump_io_uring_file(struct bpf_iter__io_uring_file *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct file *file = ctx->file;
+	struct io_uring_dump dump;
+	__u64 key;
+	int *fd;
+
+	__builtin_memset(&dump, 0, sizeof(dump));
+	key = (__u64)ctx->ctx;
+	fd = bpf_map_lookup_elem(&io_ring_ctx_map, &key);
+	if (!ctx->meta->seq_num)
+		dump_io_ring_ctx(seq, ctx->ctx, fd ? *fd : 0);
+	if (!file)
+		return 0;
+	dump.type = DUMP_REG_FD;
+	if (fd)
+		dump.io_uring_fd = *fd;
+	dump.desc.reg_fd.index = ctx->index;
+	key = (__u64)file;
+	fd = bpf_map_lookup_elem(&fdtable_map, &key);
+	if (fd)
+		dump.desc.reg_fd.reg_fd = *fd;
+	bpf_seq_write(seq, &dump, sizeof(dump));
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/bpf_cr.c b/samples/bpf/bpf_cr.c
new file mode 100644
index 000000000000..f5e0270af852
--- /dev/null
+++ b/samples/bpf/bpf_cr.c
@@ -0,0 +1,688 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * BPF C/R
+ *
+ * Tool to use BPF iterators to dump process state.  This currently supports
+ * dumping io_uring fd state, by taking process PID and fd number pair, then
+ * dumping to stdout the state as binary struct, which can be passed to the
+ * tool consuming it, to recreate io_uring.
+ */
+
+#include <errno.h>
+#include <stdio.h>
+#include <assert.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <bpf/bpf.h>
+#include <stdbool.h>
+#include <sys/uio.h>
+#include <bpf/libbpf.h>
+#include <sys/eventfd.h>
+#include <sys/syscall.h>
+#include <linux/io_uring.h>
+
+#include "bpf_cr.h"
+#include "bpf_cr.skel.h"
+
+/* Approx. 4096/40 */
+#define MAX_DESC 96
+size_t dump_desc_cnt;
+size_t reg_fd_cnt;
+size_t reg_buf_cnt;
+struct io_uring_dump *dump_desc[MAX_DESC];
+int fds[MAX_DESC];
+struct iovec bufs[MAX_DESC];
+
+static int sys_pidfd_open(pid_t pid, unsigned int flags)
+{
+	return syscall(__NR_pidfd_open, pid, flags);
+}
+
+static int sys_pidfd_getfd(int pidfd, int targetfd, unsigned int flags)
+{
+	return syscall(__NR_pidfd_getfd, pidfd, targetfd, flags);
+}
+
+static int sys_io_uring_setup(uint32_t entries, struct io_uring_params *p)
+{
+	return syscall(__NR_io_uring_setup, entries, p);
+}
+
+static int sys_io_uring_register(unsigned int fd, unsigned int opcode,
+				 void *arg, unsigned int nr_args)
+{
+	return syscall(__NR_io_uring_register, fd, opcode, arg, nr_args);
+}
+
+static const char *type2str[__DUMP_MAX] = {
+	[DUMP_SETUP]   = "DUMP_SETUP",
+	[DUMP_EVENTFD] = "DUMP_EVENTFD",
+	[DUMP_REG_FD]  = "DUMP_REG_FD",
+	[DUMP_REG_BUF] = "DUMP_REG_BUF",
+};
+
+static int do_dump_parent(struct bpf_cr *skel, int parent_fd)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo = {};
+	int ret = 0, buf_it, file_it;
+	struct bpf_link *lb, *lf;
+	char buf[4096];
+
+	linfo.io_uring.io_uring_fd = parent_fd;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	lb = bpf_program__attach_iter(skel->progs.dump_io_uring_buf, &opts);
+	if (!lb) {
+		ret = -errno;
+		fprintf(stderr, "Failed to attach to io_uring_buf: %m\n");
+		return ret;
+	}
+
+	lf = bpf_program__attach_iter(skel->progs.dump_io_uring_file, &opts);
+	if (!lf) {
+		ret = -errno;
+		fprintf(stderr, "Failed to attach io_uring_file: %m\n");
+		goto end;
+	}
+
+	buf_it = bpf_iter_create(bpf_link__fd(lb));
+	if (buf_it < 0) {
+		ret = -errno;
+		fprintf(stderr, "Failed to create io_uring_buf: %m\n");
+		goto end_lf;
+	}
+
+	file_it = bpf_iter_create(bpf_link__fd(lf));
+	if (file_it < 0) {
+		ret = -errno;
+		fprintf(stderr, "Failed to create io_uring_file: %m\n");
+		goto end_buf_it;
+	}
+
+	ret = read(file_it, buf, sizeof(buf));
+	if (ret < 0) {
+		ret = -errno;
+		fprintf(stderr, "Failed to read from io_uring_file iterator: %m\n");
+		goto end_file_it;
+	}
+
+	ret = write(STDOUT_FILENO, buf, ret);
+	if (ret < 0) {
+		ret = -errno;
+		fprintf(stderr, "Failed to write to stdout: %m\n");
+		goto end_file_it;
+	}
+
+	ret = read(buf_it, buf, sizeof(buf));
+	if (ret < 0) {
+		ret = -errno;
+		fprintf(stderr, "Failed to read from io_uring_buf iterator: %m\n");
+		goto end_file_it;
+	}
+
+	ret = write(STDOUT_FILENO, buf, ret);
+	if (ret < 0) {
+		ret = -errno;
+		fprintf(stderr, "Failed to write to stdout: %m\n");
+		goto end_file_it;
+	}
+
+end_file_it:
+	close(file_it);
+end_buf_it:
+	close(buf_it);
+end_lf:
+	bpf_link__destroy(lf);
+end:
+	bpf_link__destroy(lb);
+	return ret;
+}
+
+static int do_dump(pid_t tpid, int tfd)
+{
+	int pidfd, ret = 0, buf_it, file_it, task_it;
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo = {};
+	const struct io_uring_dump *d;
+	struct bpf_cr *skel;
+	char buf[4096];
+
+	pidfd = sys_pidfd_open(tpid, 0);
+	if (pidfd < 0) {
+		fprintf(stderr, "Failed to open pidfd for PID %d: %m\n", tpid);
+		return 1;
+	}
+
+	tfd = sys_pidfd_getfd(pidfd, tfd, 0);
+	if (tfd < 0) {
+		fprintf(stderr, "Failed to acquire io_uring fd from PID %d: %m\n", tpid);
+		ret = 1;
+		goto end;
+	}
+
+	skel = bpf_cr__open();
+	if (!skel) {
+		fprintf(stderr, "Failed to open BPF prog: %m\n");
+		ret = 1;
+		goto end_tfd;
+	}
+	skel->rodata->tgid = tpid;
+
+	ret = bpf_cr__load(skel);
+	if (ret < 0) {
+		fprintf(stderr, "Failed to load BPF prog: %m\n");
+		ret = 1;
+		goto end_skel;
+	}
+
+	skel->links.dump_task = bpf_program__attach_iter(skel->progs.dump_task, NULL);
+	if (!skel->links.dump_task) {
+		fprintf(stderr, "Failed to attach task_file iterator: %m\n");
+		ret = 1;
+		goto end_skel;
+	}
+
+	task_it = bpf_iter_create(bpf_link__fd(skel->links.dump_task));
+	if (task_it < 0) {
+		fprintf(stderr, "Failed to create task_file iterator: %m\n");
+		ret = 1;
+		goto end_skel;
+	}
+
+	/* Drive task iterator */
+	ret = read(task_it, buf, sizeof(buf));
+	close(task_it);
+	if (ret < 0) {
+		fprintf(stderr, "Failed to read from task_file iterator: %m\n");
+		ret = 1;
+		goto end_skel;
+	}
+
+	linfo.io_uring.io_uring_fd = tfd;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+	skel->links.dump_io_uring_buf = bpf_program__attach_iter(skel->progs.dump_io_uring_buf,
+								 &opts);
+	if (!skel->links.dump_io_uring_buf) {
+		fprintf(stderr, "Failed to attach io_uring_buf iterator: %m\n");
+		ret = 1;
+		goto end_skel;
+	}
+	skel->links.dump_io_uring_file = bpf_program__attach_iter(skel->progs.dump_io_uring_file,
+								  &opts);
+	if (!skel->links.dump_io_uring_file) {
+		fprintf(stderr, "Failed to attach io_uring_file iterator: %m\n");
+		ret = 1;
+		goto end_skel;
+	}
+
+	buf_it = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_buf));
+	if (buf_it < 0) {
+		fprintf(stderr, "Failed to create io_uring_buf iterator: %m\n");
+		ret = 1;
+		goto end_skel;
+	}
+
+	file_it = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_file));
+	if (file_it < 0) {
+		fprintf(stderr, "Failed to create io_uring_file iterator: %m\n");
+		ret = 1;
+		goto end_buf_it;
+	}
+
+	ret = read(file_it, buf, sizeof(buf));
+	if (ret < 0) {
+		fprintf(stderr, "Failed to read from io_uring_file iterator: %m\n");
+		ret = 1;
+		goto end_file_it;
+	}
+
+	/* Check if we have to dump its parent as well, first descriptor will
+	 * always be DUMP_SETUP, if so, recurse and dump it first.
+	 */
+	d = (void *)buf;
+	if (ret >= sizeof(*d) && d->type == DUMP_SETUP && d->desc.setup.wq_fd) {
+		int r;
+
+		r = sys_pidfd_getfd(pidfd, d->desc.setup.wq_fd, 0);
+		if (r < 0) {
+			fprintf(stderr, "Failed to obtain parent io_uring: %m\n");
+			ret = 1;
+			goto end_file_it;
+		}
+		r = do_dump_parent(skel, r);
+		if (r < 0) {
+			ret = 1;
+			goto end_file_it;
+		}
+	}
+
+	ret = write(STDOUT_FILENO, buf, ret);
+	if (ret < 0) {
+		fprintf(stderr, "Failed to write to stdout: %m\n");
+		ret = 1;
+		goto end_file_it;
+	}
+
+	ret = read(buf_it, buf, sizeof(buf));
+	if (ret < 0) {
+		fprintf(stderr, "Failed to read from io_uring_buf iterator: %m\n");
+		ret = 1;
+		goto end_file_it;
+	}
+
+	ret = write(STDOUT_FILENO, buf, ret);
+	if (ret < 0) {
+		fprintf(stderr, "Failed to write to stdout: %m\n");
+		ret = 1;
+		goto end_file_it;
+	}
+
+end_file_it:
+	close(file_it);
+end_buf_it:
+	close(buf_it);
+end_skel:
+	bpf_cr__destroy(skel);
+end_tfd:
+	close(tfd);
+end:
+	close(pidfd);
+	return ret;
+}
+
+static int dump_desc_cmp(const void *a, const void *b)
+{
+	const struct io_uring_dump *da = a;
+	const struct io_uring_dump *db = b;
+	uint64_t dafd = da->io_uring_fd;
+	uint64_t dbfd = db->io_uring_fd;
+
+	if (dafd < dbfd)
+		return -1;
+	else if (dafd > dbfd)
+		return 1;
+	else if (da->type < db->type)
+		return -1;
+	else if (da->type > db->type)
+		return 1;
+	return 0;
+}
+
+static int do_restore_setup(const struct io_uring_dump *d)
+{
+	struct io_uring_params p;
+	int fd, nfd;
+
+	memset(&p, 0, sizeof(p));
+
+	p.flags = d->desc.setup.flags;
+	if (p.flags & IORING_SETUP_SQ_AFF)
+		p.sq_thread_cpu = d->desc.setup.sq_thread_cpu;
+	if (p.flags & IORING_SETUP_SQPOLL)
+		p.sq_thread_idle = d->desc.setup.sq_thread_idle;
+	if (p.flags & IORING_SETUP_ATTACH_WQ)
+		p.wq_fd = d->desc.setup.wq_fd;
+	if (p.flags & IORING_SETUP_CQSIZE)
+		p.cq_entries = d->desc.setup.cq_entries;
+
+	fd = sys_io_uring_setup(d->desc.setup.sq_entries, &p);
+	if (fd < 0) {
+		fprintf(stderr, "Failed to restore DUMP_SETUP desc: %m\n");
+		return -errno;
+	}
+
+	nfd = dup2(fd, d->io_uring_fd);
+	if (nfd < 0) {
+		fprintf(stderr, "Failed to dup io_uring_fd: %m\n");
+		close(fd);
+		return -errno;
+	}
+	return 0;
+}
+
+static int do_restore_eventfd(const struct io_uring_dump *d)
+{
+	int evfd, ret, opcode;
+
+	/* This would require restoring the eventfd first in CRIU, which would
+	 * be found using eventfd_ctx and peeking into struct file guts from
+	 * task_file iterator. Here, we just reopen a normal eventfd and
+	 * register it. The BPF program does have code which does eventfd
+	 * matching to report the fd number.
+	 */
+	evfd = eventfd(42, 0);
+	if (evfd < 0) {
+		fprintf(stderr, "Failed to open eventfd: %m\n");
+		return -errno;
+	}
+
+	opcode = d->desc.eventfd.async ? IORING_REGISTER_EVENTFD_ASYNC : IORING_REGISTER_EVENTFD;
+	ret = sys_io_uring_register(d->io_uring_fd, opcode, &evfd, 1);
+	if (ret < 0) {
+		ret = -errno;
+		fprintf(stderr, "Failed to register eventfd: %m\n");
+		goto end;
+	}
+
+	ret = 0;
+end:
+	close(evfd);
+	return ret;
+}
+
+static void print_desc(const struct io_uring_dump *d)
+{
+	printf("%s:\n\tio_uring_fd: %d\n\tend: %s\n",
+	       type2str[d->type % __DUMP_MAX], d->io_uring_fd, d->end ? "true" : "false");
+	switch (d->type) {
+	case DUMP_SETUP:
+		printf("\t\tflags: %u\n\t\tsq_entries: %u\n\t\tcq_entries: %u\n"
+		       "\t\tsq_thread_cpu: %d\n\t\tsq_thread_idle: %d\n\t\twq_fd: %d\n",
+		       d->desc.setup.flags, d->desc.setup.sq_entries,
+		       d->desc.setup.cq_entries, d->desc.setup.sq_thread_cpu,
+		       d->desc.setup.sq_thread_idle, d->desc.setup.wq_fd);
+		break;
+	case DUMP_EVENTFD:
+		printf("\t\teventfd: %d\n\t\tasync: %s\n",
+		       d->desc.eventfd.eventfd,
+		       d->desc.eventfd.async ? "true" : "false");
+		break;
+	case DUMP_REG_FD:
+		printf("\t\treg_fd: %d\n\t\tindex: %lu\n",
+		       d->desc.reg_fd.reg_fd, d->desc.reg_fd.index);
+		break;
+	case DUMP_REG_BUF:
+		printf("\t\taddr: %lu\n\t\tlen: %lu\n\t\tindex: %lu\n",
+		       d->desc.reg_buf.addr, d->desc.reg_buf.len,
+		       d->desc.reg_buf.index);
+		break;
+	default:
+		printf("\t\t{Unknown}\n");
+		break;
+	}
+}
+
+static int do_restore_reg_fd(const struct io_uring_dump *d)
+{
+	int ret;
+
+	/* In CRIU, we restore the fds to be registered before executing the
+	 * restore action that registers file descriptors to io_uring.
+	 * Our example app would register stdin/stdout/stderr in a sparse
+	 * table, so the test case in the commit works.
+	 */
+	if (reg_fd_cnt == MAX_DESC || d->desc.reg_fd.index >= MAX_DESC) {
+		fprintf(stderr, "Exceeded max fds MAX_DESC (%d)\n", MAX_DESC);
+		return -EDOM;
+	}
+	assert(reg_fd_cnt <= d->desc.reg_fd.index);
+	/* Fill sparse entries */
+	while (reg_fd_cnt < d->desc.reg_fd.index)
+		fds[reg_fd_cnt++] = -1;
+	fds[reg_fd_cnt++] = d->desc.reg_fd.reg_fd;
+	if (d->end) {
+		ret = sys_io_uring_register(d->io_uring_fd,
+					    IORING_REGISTER_FILES, &fds,
+					    reg_fd_cnt);
+		if (ret < 0) {
+			fprintf(stderr, "Failed to register files: %m\n");
+			return -errno;
+		}
+	}
+	return 0;
+}
+
+static int do_restore_reg_buf(const struct io_uring_dump *d)
+{
+	struct iovec *iov;
+	int ret;
+
+	/* This step in CRIU for buffers with intact source buffers must be
+	 * executed with care. There are primarily three cases (each with corner
+	 * cases excluded for brevity):
+	 * 1. Source VMA is intact ([ubuf->ubuf, ubuf->ubuf_end) is in VMA, base
+	 *    page PFN is same)
+	 * 2. Source VMA is split (with multiple pages of ubuf overlaying over
+	 *    holes) using munmap(s).
+	 * 3. Source VMA is absent (no VMA or full VMA with incorrect PFN).
+	 *
+	 * PFN remains unique as pages are pinned, hence one with same PFN will
+	 * not be recycled to be part of another mapping by page allocator. 2
+	 * and 3 required page contents dumping.
+	 *
+	 * VMA with holes (registered before punching holes) also needs partial
+	 * page content dumping to restore without holes, and then punch the
+	 * holes. This can be detected when buffer touches two VMAs with holes,
+	 * and base page PFN matches (split VMA case).
+	 *
+	 * All of this is too complicated to demonstrate here, and is done in
+	 * userspace, hence left out. Future patches will implement the page
+	 * dumping from ubuf iterator part.
+	 *
+	 * In usual cases we might be able to dump page contents from inside
+	 * io_uring that we are dumping, by submitting operations, but we want
+	 * to avoid manipulating the ring while dumping, and opcodes we might
+	 * need for doing that may be restricted, hence preventing dump.
+	 */
+	if (reg_buf_cnt == MAX_DESC) {
+		fprintf(stderr, "Exceeded max buffers MAX_DESC (%d)\n", MAX_DESC);
+		return -EDOM;
+	}
+	assert(d->desc.reg_buf.index == reg_buf_cnt);
+	iov = &bufs[reg_buf_cnt++];
+	iov->iov_base = (void *)d->desc.reg_buf.addr;
+	iov->iov_len  = d->desc.reg_buf.len;
+	if (d->end) {
+		if (reg_fd_cnt) {
+			ret = sys_io_uring_register(d->io_uring_fd,
+						    IORING_REGISTER_FILES, &fds,
+						    reg_fd_cnt);
+			if (ret < 0) {
+				fprintf(stderr, "Failed to register files: %m\n");
+				return -errno;
+			}
+		}
+
+		ret = sys_io_uring_register(d->io_uring_fd,
+					    IORING_REGISTER_BUFFERS, &bufs,
+					    reg_buf_cnt);
+		if (ret < 0) {
+			fprintf(stderr, "Failed to register buffers: %m\n");
+			return -errno;
+		}
+	}
+	return 0;
+}
+
+static int do_restore_action(const struct io_uring_dump *d, bool dry_run)
+{
+	int ret;
+
+	print_desc(d);
+
+	if (dry_run)
+		return 0;
+
+	switch (d->type) {
+	case DUMP_SETUP:
+		ret = do_restore_setup(d);
+		break;
+	case DUMP_EVENTFD:
+		ret = do_restore_eventfd(d);
+		break;
+	case DUMP_REG_FD:
+		ret = do_restore_reg_fd(d);
+		break;
+	case DUMP_REG_BUF:
+		ret = do_restore_reg_buf(d);
+		break;
+	default:
+		fprintf(stderr, "Unknown dump descriptor\n");
+		return -EDOM;
+	}
+	return ret;
+}
+
+static int do_restore(bool dry_run)
+{
+	struct io_uring_dump dump;
+	int ret, prev_fd = 0;
+
+	while ((ret = read(STDIN_FILENO, &dump, sizeof(dump)))) {
+		struct io_uring_dump *d;
+
+		if (ret < 0) {
+			fprintf(stderr, "Failed to read descriptor: %m\n");
+			ret = 1;
+			goto free;
+		}
+
+		ret = 1;
+		if (dump_desc_cnt == MAX_DESC) {
+			fprintf(stderr, "Cannot process more than MAX_DESC (%d) dump descs\n",
+				MAX_DESC);
+			goto free;
+		}
+
+		d = calloc(1, sizeof(*d));
+		if (!d) {
+			fprintf(stderr, "Failed to allocate dump descriptor: %m\n");
+			goto free;
+		}
+
+		*d = dump;
+		if (!prev_fd)
+			prev_fd = d->io_uring_fd;
+		if (prev_fd != d->io_uring_fd) {
+			dump_desc[dump_desc_cnt - 1]->end = true;
+			prev_fd = d->io_uring_fd;
+		}
+		dump_desc[dump_desc_cnt++] = d;
+		qsort(dump_desc, dump_desc_cnt, sizeof(dump_desc[0]), dump_desc_cmp);
+	}
+	if (dump_desc_cnt)
+		dump_desc[dump_desc_cnt - 1]->end = true;
+
+	for (size_t i = 0; i < dump_desc_cnt; i++) {
+		ret = do_restore_action(dump_desc[i], dry_run);
+		if (ret < 0) {
+			fprintf(stderr, "Failed to execute restore action\n");
+			goto free;
+		}
+	}
+
+	if (!dry_run && dump_desc_cnt)
+		sleep(10000);
+	else
+		puts("Nothing to do, exiting...");
+	ret = 0;
+free:
+	while (dump_desc_cnt--)
+		free(dump_desc[dump_desc_cnt]);
+	return ret;
+}
+
+static int run_app(void)
+{
+	struct io_uring_params p;
+	int r, ret, fd, evfd;
+
+	memset(&p, 0, sizeof(p));
+	p.flags |= IORING_SETUP_CQSIZE | IORING_SETUP_SQPOLL | IORING_SETUP_SQ_AFF;
+	p.sq_thread_idle = 1500;
+	p.cq_entries = 4;
+	/* Create a test case with parent io_uring, dependent io_uring,
+	 * registered files, eventfd (async), buffers, etc.
+	 */
+	fd = sys_io_uring_setup(2, &p);
+	if (fd < 0) {
+		fprintf(stderr, "Failed to create io_uring: %m\n");
+		return 1;
+	}
+
+	r = 1;
+	printf("PID: %d, Parent io_uring: %d, ", getpid(), fd);
+	p.flags |= IORING_SETUP_ATTACH_WQ;
+	p.wq_fd = fd;
+
+	fd = sys_io_uring_setup(2, &p);
+	if (fd < 0) {
+		fprintf(stderr, "\nFailed to create io_uring: %m\n");
+		goto end_wq_fd;
+	}
+
+	printf("Dependent io_uring: %d\n", fd);
+
+	evfd = eventfd(42, 0);
+	if (evfd < 0) {
+		fprintf(stderr, "Failed to create eventfd: %m\n");
+		goto end_fd;
+	}
+
+	ret = sys_io_uring_register(fd, IORING_REGISTER_EVENTFD_ASYNC, &evfd, 1);
+	if (ret < 0) {
+		fprintf(stderr, "Failed to register eventfd (async): %m\n");
+		goto end_evfd;
+	}
+
+	ret = sys_io_uring_register(fd, IORING_REGISTER_FILES, &(int []){0, -1, 1, -1, 2}, 5);
+	if (ret < 0) {
+		fprintf(stderr, "Failed to register files: %m\n");
+		goto end_evfd;
+	}
+
+	/* Register dummy buf as well */
+	ret = sys_io_uring_register(fd, IORING_REGISTER_BUFFERS, &(struct iovec[]){{}, {&p, sizeof(p)}}, 2);
+	if (ret < 0) {
+		fprintf(stderr, "Failed to register buffers: %m\n");
+		goto end_evfd;
+	}
+
+	pause();
+
+	r = 0;
+end_evfd:
+	close(evfd);
+end_fd:
+	close(fd);
+end_wq_fd:
+	close(p.wq_fd);
+	return r;
+}
+
+int main(int argc, char *argv[])
+{
+	if (argc < 2 || argc > 4) {
+usage:
+		fprintf(stderr, "Usage: %s dump PID FD > dump.out\n"
+			"\tcat dump.out | %s restore [--dry-run]\n"
+			"\t%s app\n", argv[0], argv[0], argv[0]);
+		return 1;
+	}
+
+	if (libbpf_set_strict_mode(LIBBPF_STRICT_ALL)) {
+		fprintf(stderr, "Failed to set libbpf strict mode\n");
+		return 1;
+	}
+
+	if (!strcmp(argv[1], "app")) {
+		return run_app();
+	} else if (!strcmp(argv[1], "dump")) {
+		if (argc != 4)
+			goto usage;
+		return do_dump(atoi(argv[2]), atoi(argv[3]));
+	} else if (!strcmp(argv[1], "restore")) {
+		if (argc < 2 || argc > 3)
+			goto usage;
+		if (argc == 3 && strcmp(argv[2], "--dry-run"))
+			goto usage;
+		return do_restore(argc == 3 /* dry_run mode */);
+	}
+	fprintf(stderr, "Unknown argument\n");
+	goto usage;
+}
diff --git a/samples/bpf/bpf_cr.h b/samples/bpf/bpf_cr.h
new file mode 100644
index 000000000000..74d4ca639db5
--- /dev/null
+++ b/samples/bpf/bpf_cr.h
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#ifndef BPF_CR_H
+#define BPF_CR_H
+
+/* The order of restore actions is in order of declaration for each type,
+ * hence on restore consumed descriptors can be sorted based on their type,
+ * and then each action for the corresponding descriptor can be invoked, to
+ * recreate the io_uring.
+ */
+enum io_uring_state_type {
+	DUMP_SETUP,	/* Record setup parameters */
+	DUMP_EVENTFD,	/* eventfd registered in io_uring */
+	DUMP_REG_FD,	/* fd registered in io_uring */
+	DUMP_REG_BUF,	/* buffer registered in io_uring */
+	__DUMP_MAX,
+};
+
+struct io_uring_dump {
+	enum io_uring_state_type type;
+	int32_t io_uring_fd;
+	bool end;
+	union {
+		struct /* DUMP_SETUP */ {
+			uint32_t flags;
+			uint32_t sq_entries;
+			uint32_t cq_entries;
+			int32_t sq_thread_cpu;
+			int32_t sq_thread_idle;
+			uint32_t wq_fd;
+		} setup;
+		struct /* DUMP_EVENTFD */ {
+			uint32_t eventfd;
+			bool async;
+		} eventfd;
+		struct /* DUMP_REG_FD */ {
+			uint32_t reg_fd;
+			uint64_t index;
+		} reg_fd;
+		struct /* DUMP_REG_BUF */ {
+			uint64_t addr;
+			uint64_t len;
+			uint64_t index;
+		} reg_buf;
+	} desc;
+};
+
+#endif
diff --git a/samples/bpf/hbm_kern.h b/samples/bpf/hbm_kern.h
index 722b3fadb467..1752a46a2b05 100644
--- a/samples/bpf/hbm_kern.h
+++ b/samples/bpf/hbm_kern.h
@@ -9,8 +9,6 @@
  * Include file for sample Host Bandwidth Manager (HBM) BPF programs
  */
 #define KBUILD_MODNAME "foo"
-#include <stddef.h>
-#include <stdbool.h>
 #include <uapi/linux/bpf.h>
 #include <uapi/linux/if_ether.h>
 #include <uapi/linux/if_packet.h>
--
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-12-01  4:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-12-01  4:23 [PATCH bpf-next v3 00/10] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
2021-12-01  4:23 ` [PATCH bpf-next v3 01/10] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
2021-12-01  4:23 ` [PATCH bpf-next v3 02/10] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
2021-12-01  4:23 ` [PATCH bpf-next v3 03/10] io_uring: Implement eBPF iterator for registered files Kumar Kartikeya Dwivedi
2021-12-01  4:23 ` [PATCH bpf-next v3 04/10] epoll: Implement eBPF iterator for registered items Kumar Kartikeya Dwivedi
2021-12-01  4:23 ` [PATCH bpf-next v3 05/10] bpftool: Output io_uring iterator info Kumar Kartikeya Dwivedi
2021-12-01  4:23 ` [PATCH bpf-next v3 06/10] selftests/bpf: Add test for io_uring BPF iterators Kumar Kartikeya Dwivedi
2021-12-01  4:23 ` [PATCH bpf-next v3 07/10] selftests/bpf: Add test for epoll BPF iterator Kumar Kartikeya Dwivedi
2021-12-01  4:23 ` [PATCH bpf-next v3 08/10] selftests/bpf: Test partial reads for io_uring, epoll iterators Kumar Kartikeya Dwivedi
2021-12-01  4:23 ` [PATCH bpf-next v3 09/10] selftests/bpf: Fix btf_dump test for bpf_iter_link_info Kumar Kartikeya Dwivedi
2021-12-01  4:23 ` [PATCH RFC bpf-next v3 10/10] samples/bpf: Add example to checkpoint/restore io_uring Kumar Kartikeya Dwivedi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox