[RFC 0/3] Add BPF for io

public inbox for [email protected]
 help / color / mirror / Atom feed

* [RFC 0/3] Add BPF for io_uring
@ 2024-11-11  1:50 Pavel Begunkov
  2024-11-11  1:50 ` [RFC 1/3] bpf/io_uring: add io_uring program type Pavel Begunkov
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Pavel Begunkov @ 2024-11-11  1:50 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

WARNING: it's an early prototype and could likely be broken and unsafe
to run. Also, most probably it doesn't do the right thing from the
modern BPF perspective, but that's fine as I want to get some numbers
first and only then consult with BPF folks and brush it up.

A comeback of the io_uring BPF proposal put on top new infrastructure.
Instead executing BPF as a new request type, it's now run in the io_uring
waiting loop. The program is called to react every time we get a new
event like a queued task_work or an interrupt. Patch 3 adds some helpers
the BPF program can use to interact with io_uring like submitting new
requests and looking at CQEs. It also controls when to return control
back to user space by returning one of IOU_BPF_RET_{OK,STOP}, and sets
the task_work batching size, i.e. how many CQEs to wait for it be run
again, via a kfunc helper. We need to be able to sleep to submit
requests, hence only sleepable BPF is allowed. 

BPF can help to create arbitrary relations between requests from
within the kernel and later help with tuning the wait loop batching.
E.g. with minor extensions we can implement batch wait timeouts.
We can also use it to let the user to safely access internal resources
and maybe even do a more elaborate request setup than SQE allows it.

The benchmark is primitive, the non-BPF baseline issues a 2 nop request
link at a time and waits for them to complete. The BPF version runs
them (2 * N requests) one by one. Numbers with mitigations on:

# nice -n -20 taskset -c 0 ./minimal 0 50000000
type 2-LINK, requests to run 50000000
sec 10, total (ms) 10314
# nice -n -20 taskset -c 0 ./minimal 1 50000000
type BPF, requests to run 50000000
sec 6, total (ms) 6808

It needs to be better tested, especially with asynchronous requests
like reads and other hardware. It can also be further optimised. E.g.
we can avoid extra locking by taking it once for BPF/task_work_run.

The test (see examples-bpf/minimal[.bpf].c)
https://github.com/isilence/liburing.git io_uring-bpf
https://github.com/isilence/liburing/tree/io_uring-bpf

Pavel Begunkov (3):
  bpf/io_uring: add io_uring program type
  io_uring/bpf: allow to register and run BPF programs
  io_uring/bpf: add kfuncs for BPF programs

 include/linux/bpf.h               |   1 +
 include/linux/bpf_types.h         |   4 +
 include/linux/io_uring/bpf.h      |  10 ++
 include/linux/io_uring_types.h    |   4 +
 include/uapi/linux/bpf.h          |   1 +
 include/uapi/linux/io_uring.h     |   9 ++
 include/uapi/linux/io_uring/bpf.h |  22 ++++
 io_uring/Makefile                 |   1 +
 io_uring/bpf.c                    | 205 ++++++++++++++++++++++++++++++
 io_uring/bpf.h                    |  43 +++++++
 io_uring/io_uring.c               |  16 +++
 io_uring/register.c               |   7 +
 kernel/bpf/btf.c                  |   3 +
 kernel/bpf/syscall.c              |   1 +
 kernel/bpf/verifier.c             |  10 +-
 15 files changed, 336 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/io_uring/bpf.h
 create mode 100644 include/uapi/linux/io_uring/bpf.h
 create mode 100644 io_uring/bpf.c
 create mode 100644 io_uring/bpf.h

-- 
2.46.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC 1/3] bpf/io_uring: add io_uring program type
  2024-11-11  1:50 [RFC 0/3] Add BPF for io_uring Pavel Begunkov
@ 2024-11-11  1:50 ` Pavel Begunkov
  2024-11-11  1:50 ` [RFC 2/3] io_uring/bpf: allow to register and run BPF programs Pavel Begunkov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2024-11-11  1:50 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

Add a new BPF program type and bare minimum implementation that would be
responsible orchestrating in-kernel request handling in the io_uring
waiting loop. The program is supposed to replace the logic which
terminates the traditional waiting loop based on a number of parameters
like the number of completion event to wait for, and it returns one of
the IOU_BPF_RET_* return codes telling the kernel whether it should
return back to the user space or continue waiting.

At the moment there is no way to attach it anywhere, and the program
is pretty useless and doesn't know yet how to interact with io_uring.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/bpf.h               |  1 +
 include/linux/bpf_types.h         |  4 ++++
 include/linux/io_uring/bpf.h      | 10 ++++++++++
 include/uapi/linux/bpf.h          |  1 +
 include/uapi/linux/io_uring/bpf.h | 22 ++++++++++++++++++++++
 io_uring/Makefile                 |  1 +
 io_uring/bpf.c                    | 24 ++++++++++++++++++++++++
 kernel/bpf/btf.c                  |  3 +++
 kernel/bpf/syscall.c              |  1 +
 kernel/bpf/verifier.c             | 10 +++++++++-
 10 files changed, 76 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/io_uring/bpf.h
 create mode 100644 include/uapi/linux/io_uring/bpf.h
 create mode 100644 io_uring/bpf.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 19d8ca8ac960..bccd99dd58c4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -30,6 +30,7 @@
 #include <linux/static_call.h>
 #include <linux/memcontrol.h>
 #include <linux/cfi.h>
+#include <linux/io_uring/bpf.h>
 
 struct bpf_verifier_env;
 struct bpf_verifier_log;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 9f2a6b83b49e..24293e1ee0b1 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -83,6 +83,10 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SYSCALL, bpf_syscall,
 BPF_PROG_TYPE(BPF_PROG_TYPE_NETFILTER, netfilter,
 	      struct bpf_nf_ctx, struct bpf_nf_ctx)
 #endif
+#ifdef CONFIG_IO_URING
+BPF_PROG_TYPE(BPF_PROG_TYPE_IOURING, bpf_io_uring,
+	      struct io_uring_bpf_ctx, struct io_bpf_ctx_kern)
+#endif
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/linux/io_uring/bpf.h b/include/linux/io_uring/bpf.h
new file mode 100644
index 000000000000..b700a4b65111
--- /dev/null
+++ b/include/linux/io_uring/bpf.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _LINUX_IO_URING_BPF_H
+#define _LINUX_IO_URING_BPF_H
+
+#include <uapi/linux/io_uring/bpf.h>
+
+struct io_bpf_ctx_kern {
+};
+
+#endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e8241b320c6d..1945430d31a6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1055,6 +1055,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
 	BPF_PROG_TYPE_NETFILTER,
+	BPF_PROG_TYPE_IOURING,
 	__MAX_BPF_PROG_TYPE
 };
 
diff --git a/include/uapi/linux/io_uring/bpf.h b/include/uapi/linux/io_uring/bpf.h
new file mode 100644
index 000000000000..da749fe7251c
--- /dev/null
+++ b/include/uapi/linux/io_uring/bpf.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */
+/*
+ * Header file for the io_uring bpf interface.
+ *
+ * Copyright (C) 2024 Pavel Begunkov
+ */
+#ifndef LINUX_IO_URING_BPF_H
+#define LINUX_IO_URING_BPF_H
+
+#include <linux/types.h>
+
+enum {
+	IOU_BPF_RET_OK,
+	IOU_BPF_RET_STOP,
+
+	__IOU_BPF_RET_MAX,
+};
+
+struct io_uring_bpf_ctx {
+};
+
+#endif
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 53167bef37d7..5da66ecc98e5 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -17,3 +17,4 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o opdef.o kbuf.o rsrc.o notif.o \
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
 obj-$(CONFIG_FUTEX)		+= futex.o
 obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o
+obj-$(CONFIG_BPF) += bpf.o
diff --git a/io_uring/bpf.c b/io_uring/bpf.c
new file mode 100644
index 000000000000..6eb0c47b4aa9
--- /dev/null
+++ b/io_uring/bpf.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+
+static const struct bpf_func_proto *
+io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id, prog);
+}
+
+static bool io_bpf_is_valid_access(int off, int size,
+				   enum bpf_access_type type,
+				   const struct bpf_prog *prog,
+				   struct bpf_insn_access_aux *info)
+{
+	return false;
+}
+
+const struct bpf_prog_ops bpf_io_uring_prog_ops = {};
+
+const struct bpf_verifier_ops bpf_io_uring_verifier_ops = {
+	.get_func_proto			= io_bpf_func_proto,
+	.is_valid_access		= io_bpf_is_valid_access,
+};
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 5cd1c7a23848..e102ee7c530a 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -219,6 +219,7 @@ enum btf_kfunc_hook {
 	BTF_KFUNC_HOOK_LWT,
 	BTF_KFUNC_HOOK_NETFILTER,
 	BTF_KFUNC_HOOK_KPROBE,
+	BTF_KFUNC_HOOK_IOURING,
 	BTF_KFUNC_HOOK_MAX,
 };
 
@@ -8393,6 +8394,8 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
 		return BTF_KFUNC_HOOK_NETFILTER;
 	case BPF_PROG_TYPE_KPROBE:
 		return BTF_KFUNC_HOOK_KPROBE;
+	case BPF_PROG_TYPE_IOURING:
+		return BTF_KFUNC_HOOK_IOURING;
 	default:
 		return BTF_KFUNC_HOOK_MAX;
 	}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8cfa7183d2ef..5587ede39ae2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2571,6 +2571,7 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		return -EINVAL;
 	case BPF_PROG_TYPE_SYSCALL:
 	case BPF_PROG_TYPE_EXT:
+	case BPF_PROG_TYPE_IOURING:
 		if (expected_attach_type)
 			return -EINVAL;
 		fallthrough;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 411ab1b57af4..14de335ba66b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -15946,6 +15946,9 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 	case BPF_PROG_TYPE_NETFILTER:
 		range = retval_range(NF_DROP, NF_ACCEPT);
 		break;
+	case BPF_PROG_TYPE_IOURING:
+		range = retval_range(IOU_BPF_RET_OK, __IOU_BPF_RET_MAX - 1);
+		break;
 	case BPF_PROG_TYPE_EXT:
 		/* freplace program can return anything as its return value
 		 * depends on the to-be-replaced kernel func or bpf program.
@@ -22209,7 +22212,8 @@ static bool can_be_sleepable(struct bpf_prog *prog)
 	}
 	return prog->type == BPF_PROG_TYPE_LSM ||
 	       prog->type == BPF_PROG_TYPE_KPROBE /* only for uprobes */ ||
-	       prog->type == BPF_PROG_TYPE_STRUCT_OPS;
+	       prog->type == BPF_PROG_TYPE_STRUCT_OPS ||
+	       prog->type == BPF_PROG_TYPE_IOURING;
 }
 
 static int check_attach_btf_id(struct bpf_verifier_env *env)
@@ -22229,6 +22233,10 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 		verbose(env, "Syscall programs can only be sleepable\n");
 		return -EINVAL;
 	}
+	if (prog->type == BPF_PROG_TYPE_IOURING && !prog->sleepable) {
+		verbose(env, "io_uring programs can only be sleepable\n");
+		return -EINVAL;
+	}
 
 	if (prog->sleepable && !can_be_sleepable(prog)) {
 		verbose(env, "Only fentry/fexit/fmod_ret, lsm, iter, uprobe, and struct_ops programs can be sleepable\n");
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC 2/3] io_uring/bpf: allow to register and run BPF programs
  2024-11-11  1:50 [RFC 0/3] Add BPF for io_uring Pavel Begunkov
  2024-11-11  1:50 ` [RFC 1/3] bpf/io_uring: add io_uring program type Pavel Begunkov
@ 2024-11-11  1:50 ` Pavel Begunkov
  2024-11-13  8:21   ` Ming Lei
  2024-11-11  1:50 ` [RFC 3/3] io_uring/bpf: add kfuncs for " Pavel Begunkov
  2024-11-13  8:13 ` [RFC 0/3] Add BPF for io_uring Ming Lei
  3 siblings, 1 reply; 8+ messages in thread
From: Pavel Begunkov @ 2024-11-11  1:50 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

Let the user to register a BPF_PROG_TYPE_IOURING BPF program to a ring.
The progrma will be run in the waiting loop every time something
happens, i.e. the task was woken up by a task_work / signal / etc.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h |  4 +++
 include/uapi/linux/io_uring.h  |  9 +++++
 io_uring/bpf.c                 | 63 ++++++++++++++++++++++++++++++++++
 io_uring/bpf.h                 | 41 ++++++++++++++++++++++
 io_uring/io_uring.c            | 15 ++++++++
 io_uring/register.c            |  7 ++++
 6 files changed, 139 insertions(+)
 create mode 100644 io_uring/bpf.h

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index ad5001102c86..50cee0d3622e 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -8,6 +8,8 @@
 #include <linux/llist.h>
 #include <uapi/linux/io_uring.h>
 
+struct io_bpf_ctx;
+
 enum {
 	/*
 	 * A hint to not wake right away but delay until there are enough of
@@ -246,6 +248,8 @@ struct io_ring_ctx {
 
 		enum task_work_notify_mode	notify_method;
 		unsigned			sq_thread_idle;
+
+		struct io_bpf_ctx		*bpf_ctx;
 	} ____cacheline_aligned_in_smp;
 
 	/* submission data */
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ba373deb8406..f2c2fefc8514 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -634,6 +634,8 @@ enum io_uring_register_op {
 	/* register fixed io_uring_reg_wait arguments */
 	IORING_REGISTER_CQWAIT_REG		= 34,
 
+	IORING_REGISTER_BPF			= 35,
+
 	/* this goes last */
 	IORING_REGISTER_LAST,
 
@@ -905,6 +907,13 @@ enum io_uring_socket_op {
 	SOCKET_URING_OP_SETSOCKOPT,
 };
 
+struct io_uring_bpf_reg {
+	__u64		prog_fd;
+	__u32		flags;
+	__u32		resv1;
+	__u64		resv2[2];
+};
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/io_uring/bpf.c b/io_uring/bpf.c
index 6eb0c47b4aa9..8b7c74761c63 100644
--- a/io_uring/bpf.c
+++ b/io_uring/bpf.c
@@ -1,6 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0
 
 #include <linux/bpf.h>
+#include <linux/filter.h>
+
+#include "bpf.h"
 
 static const struct bpf_func_proto *
 io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
@@ -22,3 +25,63 @@ const struct bpf_verifier_ops bpf_io_uring_verifier_ops = {
 	.get_func_proto			= io_bpf_func_proto,
 	.is_valid_access		= io_bpf_is_valid_access,
 };
+
+int io_run_bpf(struct io_ring_ctx *ctx)
+{
+	struct io_bpf_ctx *bc = ctx->bpf_ctx;
+	int ret;
+
+	mutex_lock(&ctx->uring_lock);
+	ret = bpf_prog_run_pin_on_cpu(bc->prog, bc);
+	mutex_unlock(&ctx->uring_lock);
+	return ret;
+}
+
+int io_unregister_bpf(struct io_ring_ctx *ctx)
+{
+	struct io_bpf_ctx *bc = ctx->bpf_ctx;
+
+	if (!bc)
+		return -ENXIO;
+	bpf_prog_put(bc->prog);
+	kfree(bc);
+	ctx->bpf_ctx = NULL;
+	return 0;
+}
+
+int io_register_bpf(struct io_ring_ctx *ctx, void __user *arg,
+		    unsigned int nr_args)
+{
+	struct __user io_uring_bpf_reg *bpf_reg_usr = arg;
+	struct io_uring_bpf_reg bpf_reg;
+	struct io_bpf_ctx *bc;
+	struct bpf_prog *prog;
+
+	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
+		return -EOPNOTSUPP;
+
+	if (nr_args != 1)
+		return -EINVAL;
+	if (copy_from_user(&bpf_reg, bpf_reg_usr, sizeof(bpf_reg)))
+		return -EFAULT;
+	if (bpf_reg.flags || bpf_reg.resv1 ||
+	    bpf_reg.resv2[0] || bpf_reg.resv2[1])
+		return -EINVAL;
+
+	if (ctx->bpf_ctx)
+		return -ENXIO;
+
+	bc = kzalloc(sizeof(*bc), GFP_KERNEL);
+	if (!bc)
+		return -ENOMEM;
+
+	prog = bpf_prog_get_type(bpf_reg.prog_fd, BPF_PROG_TYPE_IOURING);
+	if (IS_ERR(prog)) {
+		kfree(bc);
+		return PTR_ERR(prog);
+	}
+
+	bc->prog = prog;
+	ctx->bpf_ctx = bc;
+	return 0;
+}
diff --git a/io_uring/bpf.h b/io_uring/bpf.h
new file mode 100644
index 000000000000..2b4e555ff07a
--- /dev/null
+++ b/io_uring/bpf.h
@@ -0,0 +1,41 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IOU_BPF_H
+#define IOU_BPF_H
+
+#include <linux/io_uring/bpf.h>
+#include <linux/io_uring_types.h>
+
+struct bpf_prog;
+
+struct io_bpf_ctx {
+	struct io_bpf_ctx_kern kern;
+	struct bpf_prog *prog;
+};
+
+static inline bool io_bpf_enabled(struct io_ring_ctx *ctx)
+{
+	return IS_ENABLED(CONFIG_BPF) && ctx->bpf_ctx != NULL;
+}
+
+#ifdef CONFIG_BPF
+int io_register_bpf(struct io_ring_ctx *ctx, void __user *arg,
+		    unsigned int nr_args);
+int io_unregister_bpf(struct io_ring_ctx *ctx);
+int io_run_bpf(struct io_ring_ctx *ctx);
+
+#else
+static inline int io_register_bpf(struct io_ring_ctx *ctx, void __user *arg,
+				  unsigned int nr_args)
+{
+	return -EOPNOTSUPP;
+}
+static inline int io_unregister_bpf(struct io_ring_ctx *ctx)
+{
+	return -EOPNOTSUPP;
+}
+static inline int io_run_bpf(struct io_ring_ctx *ctx)
+{
+}
+#endif
+
+#endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index f34fa1ead2cf..82599e2a888a 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -104,6 +104,7 @@
 #include "rw.h"
 #include "alloc_cache.h"
 #include "eventfd.h"
+#include "bpf.h"
 
 #define SQE_COMMON_FLAGS (IOSQE_FIXED_FILE | IOSQE_IO_LINK | \
 			  IOSQE_IO_HARDLINK | IOSQE_ASYNC)
@@ -2834,6 +2835,12 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
 
 	io_napi_busy_loop(ctx, &iowq);
 
+	if (io_bpf_enabled(ctx)) {
+		ret = io_run_bpf(ctx);
+		if (ret == IOU_BPF_RET_STOP)
+			return 0;
+	}
+
 	trace_io_uring_cqring_wait(ctx, min_events);
 	do {
 		unsigned long check_cq;
@@ -2879,6 +2886,13 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
 		if (ret < 0)
 			break;
 
+		if (io_bpf_enabled(ctx)) {
+			ret = io_run_bpf(ctx);
+			if (ret == IOU_BPF_RET_STOP)
+				break;
+			continue;
+		}
+
 		check_cq = READ_ONCE(ctx->check_cq);
 		if (unlikely(check_cq)) {
 			/* let the caller flush overflows, retry */
@@ -3009,6 +3023,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	io_futex_cache_free(ctx);
 	io_destroy_buffers(ctx);
 	io_unregister_cqwait_reg(ctx);
+	io_unregister_bpf(ctx);
 	mutex_unlock(&ctx->uring_lock);
 	if (ctx->sq_creds)
 		put_cred(ctx->sq_creds);
diff --git a/io_uring/register.c b/io_uring/register.c
index 45edfc57963a..2a8efeacf2db 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -30,6 +30,7 @@
 #include "eventfd.h"
 #include "msg_ring.h"
 #include "memmap.h"
+#include "bpf.h"
 
 #define IORING_MAX_RESTRICTIONS	(IORING_RESTRICTION_LAST + \
 				 IORING_REGISTER_LAST + IORING_OP_LAST)
@@ -846,6 +847,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_register_cqwait_reg(ctx, arg);
 		break;
+	case IORING_REGISTER_BPF:
+		ret = -EINVAL;
+		if (!arg)
+			break;
+		ret = io_register_bpf(ctx, arg, nr_args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC 2/3] io_uring/bpf: allow to register and run BPF programs
  2024-11-11  1:50 ` [RFC 2/3] io_uring/bpf: allow to register and run BPF programs Pavel Begunkov
@ 2024-11-13  8:21   ` Ming Lei
  2024-11-13 13:09     ` Pavel Begunkov
  0 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2024-11-13  8:21 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, ming.lei

On Mon, Nov 11, 2024 at 01:50:45AM +0000, Pavel Begunkov wrote:
> Let the user to register a BPF_PROG_TYPE_IOURING BPF program to a ring.
> The progrma will be run in the waiting loop every time something
> happens, i.e. the task was woken up by a task_work / signal / etc.
> 
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
>  include/linux/io_uring_types.h |  4 +++
>  include/uapi/linux/io_uring.h  |  9 +++++
>  io_uring/bpf.c                 | 63 ++++++++++++++++++++++++++++++++++
>  io_uring/bpf.h                 | 41 ++++++++++++++++++++++
>  io_uring/io_uring.c            | 15 ++++++++
>  io_uring/register.c            |  7 ++++
>  6 files changed, 139 insertions(+)
>  create mode 100644 io_uring/bpf.h
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index ad5001102c86..50cee0d3622e 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -8,6 +8,8 @@
>  #include <linux/llist.h>
>  #include <uapi/linux/io_uring.h>
>  
> +struct io_bpf_ctx;
> +
>  enum {
>  	/*
>  	 * A hint to not wake right away but delay until there are enough of
> @@ -246,6 +248,8 @@ struct io_ring_ctx {
>  
>  		enum task_work_notify_mode	notify_method;
>  		unsigned			sq_thread_idle;
> +
> +		struct io_bpf_ctx		*bpf_ctx;
>  	} ____cacheline_aligned_in_smp;
>  
>  	/* submission data */
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index ba373deb8406..f2c2fefc8514 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -634,6 +634,8 @@ enum io_uring_register_op {
>  	/* register fixed io_uring_reg_wait arguments */
>  	IORING_REGISTER_CQWAIT_REG		= 34,
>  
> +	IORING_REGISTER_BPF			= 35,
> +
>  	/* this goes last */
>  	IORING_REGISTER_LAST,
>  
> @@ -905,6 +907,13 @@ enum io_uring_socket_op {
>  	SOCKET_URING_OP_SETSOCKOPT,
>  };
>  
> +struct io_uring_bpf_reg {
> +	__u64		prog_fd;
> +	__u32		flags;
> +	__u32		resv1;
> +	__u64		resv2[2];
> +};
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/io_uring/bpf.c b/io_uring/bpf.c
> index 6eb0c47b4aa9..8b7c74761c63 100644
> --- a/io_uring/bpf.c
> +++ b/io_uring/bpf.c
> @@ -1,6 +1,9 @@
>  // SPDX-License-Identifier: GPL-2.0
>  
>  #include <linux/bpf.h>
> +#include <linux/filter.h>
> +
> +#include "bpf.h"
>  
>  static const struct bpf_func_proto *
>  io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> @@ -22,3 +25,63 @@ const struct bpf_verifier_ops bpf_io_uring_verifier_ops = {
>  	.get_func_proto			= io_bpf_func_proto,
>  	.is_valid_access		= io_bpf_is_valid_access,
>  };
> +
> +int io_run_bpf(struct io_ring_ctx *ctx)
> +{
> +	struct io_bpf_ctx *bc = ctx->bpf_ctx;
> +	int ret;
> +
> +	mutex_lock(&ctx->uring_lock);
> +	ret = bpf_prog_run_pin_on_cpu(bc->prog, bc);
> +	mutex_unlock(&ctx->uring_lock);
> +	return ret;
> +}
> +
> +int io_unregister_bpf(struct io_ring_ctx *ctx)
> +{
> +	struct io_bpf_ctx *bc = ctx->bpf_ctx;
> +
> +	if (!bc)
> +		return -ENXIO;
> +	bpf_prog_put(bc->prog);
> +	kfree(bc);
> +	ctx->bpf_ctx = NULL;
> +	return 0;
> +}
> +
> +int io_register_bpf(struct io_ring_ctx *ctx, void __user *arg,
> +		    unsigned int nr_args)
> +{
> +	struct __user io_uring_bpf_reg *bpf_reg_usr = arg;
> +	struct io_uring_bpf_reg bpf_reg;
> +	struct io_bpf_ctx *bc;
> +	struct bpf_prog *prog;
> +
> +	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
> +		return -EOPNOTSUPP;
> +
> +	if (nr_args != 1)
> +		return -EINVAL;
> +	if (copy_from_user(&bpf_reg, bpf_reg_usr, sizeof(bpf_reg)))
> +		return -EFAULT;
> +	if (bpf_reg.flags || bpf_reg.resv1 ||
> +	    bpf_reg.resv2[0] || bpf_reg.resv2[1])
> +		return -EINVAL;
> +
> +	if (ctx->bpf_ctx)
> +		return -ENXIO;
> +
> +	bc = kzalloc(sizeof(*bc), GFP_KERNEL);
> +	if (!bc)
> +		return -ENOMEM;
> +
> +	prog = bpf_prog_get_type(bpf_reg.prog_fd, BPF_PROG_TYPE_IOURING);
> +	if (IS_ERR(prog)) {
> +		kfree(bc);
> +		return PTR_ERR(prog);
> +	}
> +
> +	bc->prog = prog;
> +	ctx->bpf_ctx = bc;
> +	return 0;
> +}
> diff --git a/io_uring/bpf.h b/io_uring/bpf.h
> new file mode 100644
> index 000000000000..2b4e555ff07a
> --- /dev/null
> +++ b/io_uring/bpf.h
> @@ -0,0 +1,41 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#ifndef IOU_BPF_H
> +#define IOU_BPF_H
> +
> +#include <linux/io_uring/bpf.h>
> +#include <linux/io_uring_types.h>
> +
> +struct bpf_prog;
> +
> +struct io_bpf_ctx {
> +	struct io_bpf_ctx_kern kern;
> +	struct bpf_prog *prog;
> +};
> +
> +static inline bool io_bpf_enabled(struct io_ring_ctx *ctx)
> +{
> +	return IS_ENABLED(CONFIG_BPF) && ctx->bpf_ctx != NULL;
> +}
> +
> +#ifdef CONFIG_BPF
> +int io_register_bpf(struct io_ring_ctx *ctx, void __user *arg,
> +		    unsigned int nr_args);
> +int io_unregister_bpf(struct io_ring_ctx *ctx);
> +int io_run_bpf(struct io_ring_ctx *ctx);
> +
> +#else
> +static inline int io_register_bpf(struct io_ring_ctx *ctx, void __user *arg,
> +				  unsigned int nr_args)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline int io_unregister_bpf(struct io_ring_ctx *ctx)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline int io_run_bpf(struct io_ring_ctx *ctx)
> +{
> +}
> +#endif
> +
> +#endif
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index f34fa1ead2cf..82599e2a888a 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -104,6 +104,7 @@
>  #include "rw.h"
>  #include "alloc_cache.h"
>  #include "eventfd.h"
> +#include "bpf.h"
>  
>  #define SQE_COMMON_FLAGS (IOSQE_FIXED_FILE | IOSQE_IO_LINK | \
>  			  IOSQE_IO_HARDLINK | IOSQE_ASYNC)
> @@ -2834,6 +2835,12 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
>  
>  	io_napi_busy_loop(ctx, &iowq);
>  
> +	if (io_bpf_enabled(ctx)) {
> +		ret = io_run_bpf(ctx);
> +		if (ret == IOU_BPF_RET_STOP)
> +			return 0;
> +	}
> +
>  	trace_io_uring_cqring_wait(ctx, min_events);
>  	do {
>  		unsigned long check_cq;
> @@ -2879,6 +2886,13 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
>  		if (ret < 0)
>  			break;
>  
> +		if (io_bpf_enabled(ctx)) {
> +			ret = io_run_bpf(ctx);
> +			if (ret == IOU_BPF_RET_STOP)
> +				break;
> +			continue;
> +		}

I believe 'struct_ops' is much simpler to run the prog and return the result.
Then you needn't any bpf core change and the bpf register code.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 2/3] io_uring/bpf: allow to register and run BPF programs
  2024-11-13  8:21   ` Ming Lei
@ 2024-11-13 13:09     ` Pavel Begunkov
  0 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2024-11-13 13:09 UTC (permalink / raw)
  To: Ming Lei; +Cc: io-uring

On 11/13/24 08:21, Ming Lei wrote:
> On Mon, Nov 11, 2024 at 01:50:45AM +0000, Pavel Begunkov wrote:
>> Let the user to register a BPF_PROG_TYPE_IOURING BPF program to a ring.
>> The progrma will be run in the waiting loop every time something
>> happens, i.e. the task was woken up by a task_work / signal / etc.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> ---
...
>>   	do {
>>   		unsigned long check_cq;
>> @@ -2879,6 +2886,13 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
>>   		if (ret < 0)
>>   			break;
>>   
>> +		if (io_bpf_enabled(ctx)) {
>> +			ret = io_run_bpf(ctx);
>> +			if (ret == IOU_BPF_RET_STOP)
>> +				break;
>> +			continue;
>> +		}
> 
> I believe 'struct_ops' is much simpler to run the prog and return the result.
> Then you needn't any bpf core change and the bpf register code.

Right, that's one of the things I need to look into, I have it in my
todo list, but I'm not sure at all I'd want to get rid of the register
opcode. It's a good idea to have stronger registration locking, i.e.
under ->ring_lock, but again I need to check it out.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC 3/3] io_uring/bpf: add kfuncs for BPF programs
  2024-11-11  1:50 [RFC 0/3] Add BPF for io_uring Pavel Begunkov
  2024-11-11  1:50 ` [RFC 1/3] bpf/io_uring: add io_uring program type Pavel Begunkov
  2024-11-11  1:50 ` [RFC 2/3] io_uring/bpf: allow to register and run BPF programs Pavel Begunkov
@ 2024-11-11  1:50 ` Pavel Begunkov
  2024-11-13  8:13 ` [RFC 0/3] Add BPF for io_uring Ming Lei
  3 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2024-11-11  1:50 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

Add a way for io_uring BPF programs to look at CQEs and submit new
requests.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/bpf.c      | 118 ++++++++++++++++++++++++++++++++++++++++++++
 io_uring/bpf.h      |   2 +
 io_uring/io_uring.c |   1 +
 3 files changed, 121 insertions(+)

diff --git a/io_uring/bpf.c b/io_uring/bpf.c
index 8b7c74761c63..d413c3712612 100644
--- a/io_uring/bpf.c
+++ b/io_uring/bpf.c
@@ -4,6 +4,123 @@
 #include <linux/filter.h>
 
 #include "bpf.h"
+#include "io_uring.h"
+
+static inline struct io_bpf_ctx *io_user_to_bpf_ctx(struct io_uring_bpf_ctx *ctx)
+{
+	struct io_bpf_ctx_kern *bc = (struct io_bpf_ctx_kern *)ctx;
+
+	return container_of(bc, struct io_bpf_ctx, kern);
+}
+
+__bpf_kfunc_start_defs();
+
+__bpf_kfunc int bpf_io_uring_queue_sqe(struct io_uring_bpf_ctx *user_ctx,
+					void *bpf_sqe, int mem__sz)
+{
+	struct io_bpf_ctx *bc = io_user_to_bpf_ctx(user_ctx);
+	struct io_ring_ctx *ctx = bc->ctx;
+	unsigned tail = ctx->rings->sq.tail;
+	struct io_uring_sqe *sqe;
+
+	if (mem__sz != sizeof(*sqe))
+		return -EINVAL;
+
+	ctx->rings->sq.tail++;
+	tail &= (ctx->sq_entries - 1);
+	/* double index for 128-byte SQEs, twice as long */
+	if (ctx->flags & IORING_SETUP_SQE128)
+		tail <<= 1;
+	sqe = &ctx->sq_sqes[tail];
+	memcpy(sqe, bpf_sqe, sizeof(*sqe));
+	return 0;
+}
+
+__bpf_kfunc int bpf_io_uring_submit_sqes(struct io_uring_bpf_ctx *user_ctx,
+					 unsigned nr)
+{
+	struct io_bpf_ctx *bc = io_user_to_bpf_ctx(user_ctx);
+	struct io_ring_ctx *ctx = bc->ctx;
+
+	return io_submit_sqes(ctx, nr);
+}
+
+__bpf_kfunc int bpf_io_uring_get_cqe(struct io_uring_bpf_ctx *user_ctx,
+				     struct io_uring_cqe *res__uninit)
+{
+	struct io_bpf_ctx *bc = io_user_to_bpf_ctx(user_ctx);
+	struct io_ring_ctx *ctx = bc->ctx;
+	struct io_rings *rings = ctx->rings;
+	unsigned int mask = ctx->cq_entries - 1;
+	unsigned head = rings->cq.head;
+	struct io_uring_cqe *cqe;
+
+	/* TODO CQE32 */
+	if (head == rings->cq.tail)
+		goto fail;
+
+	cqe = &rings->cqes[head & mask];
+	memcpy(res__uninit, cqe, sizeof(*cqe));
+	rings->cq.head++;
+	return 0;
+fail:
+	memset(res__uninit, 0, sizeof(*res__uninit));
+	return -EINVAL;
+}
+
+__bpf_kfunc
+struct io_uring_cqe *bpf_io_uring_get_cqe2(struct io_uring_bpf_ctx *user_ctx)
+{
+	struct io_bpf_ctx *bc = io_user_to_bpf_ctx(user_ctx);
+	struct io_ring_ctx *ctx = bc->ctx;
+	struct io_rings *rings = ctx->rings;
+	unsigned int mask = ctx->cq_entries - 1;
+	unsigned head = rings->cq.head;
+	struct io_uring_cqe *cqe;
+
+	/* TODO CQE32 */
+	if (head == rings->cq.tail)
+		return NULL;
+
+	cqe = &rings->cqes[head & mask];
+	rings->cq.head++;
+	return cqe;
+}
+
+__bpf_kfunc
+void bpf_io_uring_set_wait_params(struct io_uring_bpf_ctx *user_ctx,
+				  unsigned wait_nr)
+{
+	struct io_bpf_ctx *bc = io_user_to_bpf_ctx(user_ctx);
+	struct io_ring_ctx *ctx = bc->ctx;
+	struct io_wait_queue *wq = bc->waitq;
+
+	wait_nr = min_t(unsigned, wait_nr, ctx->cq_entries);
+	wq->cq_tail = READ_ONCE(ctx->rings->cq.head) + wait_nr;
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(io_uring_kfunc_set)
+BTF_ID_FLAGS(func, bpf_io_uring_queue_sqe, KF_SLEEPABLE);
+BTF_ID_FLAGS(func, bpf_io_uring_submit_sqes, KF_SLEEPABLE);
+BTF_ID_FLAGS(func, bpf_io_uring_get_cqe, 0);
+BTF_ID_FLAGS(func, bpf_io_uring_get_cqe2, KF_RET_NULL);
+BTF_ID_FLAGS(func, bpf_io_uring_set_wait_params, 0);
+BTF_KFUNCS_END(io_uring_kfunc_set)
+
+static const struct btf_kfunc_id_set bpf_io_uring_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set = &io_uring_kfunc_set,
+};
+
+static int init_io_uring_bpf(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_IOURING,
+					 &bpf_io_uring_kfunc_set);
+}
+late_initcall(init_io_uring_bpf);
+
 
 static const struct bpf_func_proto *
 io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
@@ -82,6 +199,7 @@ int io_register_bpf(struct io_ring_ctx *ctx, void __user *arg,
 	}
 
 	bc->prog = prog;
+	bc->ctx = ctx;
 	ctx->bpf_ctx = bc;
 	return 0;
 }
diff --git a/io_uring/bpf.h b/io_uring/bpf.h
index 2b4e555ff07a..9f578a48ce2e 100644
--- a/io_uring/bpf.h
+++ b/io_uring/bpf.h
@@ -9,6 +9,8 @@ struct bpf_prog;
 
 struct io_bpf_ctx {
 	struct io_bpf_ctx_kern kern;
+	struct io_ring_ctx *ctx;
+	struct io_wait_queue *waitq;
 	struct bpf_prog *prog;
 };
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 82599e2a888a..98206e68ce70 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2836,6 +2836,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
 	io_napi_busy_loop(ctx, &iowq);
 
 	if (io_bpf_enabled(ctx)) {
+		ctx->bpf_ctx->waitq = &iowq;
 		ret = io_run_bpf(ctx);
 		if (ret == IOU_BPF_RET_STOP)
 			return 0;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC 0/3] Add BPF for io_uring
  2024-11-11  1:50 [RFC 0/3] Add BPF for io_uring Pavel Begunkov
                   ` (2 preceding siblings ...)
  2024-11-11  1:50 ` [RFC 3/3] io_uring/bpf: add kfuncs for " Pavel Begunkov
@ 2024-11-13  8:13 ` Ming Lei
  2024-11-13 13:09   ` Pavel Begunkov
  3 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2024-11-13  8:13 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, ming.lei

On Mon, Nov 11, 2024 at 01:50:43AM +0000, Pavel Begunkov wrote:
> WARNING: it's an early prototype and could likely be broken and unsafe
> to run. Also, most probably it doesn't do the right thing from the
> modern BPF perspective, but that's fine as I want to get some numbers
> first and only then consult with BPF folks and brush it up.
> 
> A comeback of the io_uring BPF proposal put on top new infrastructure.
> Instead executing BPF as a new request type, it's now run in the io_uring
> waiting loop. The program is called to react every time we get a new
> event like a queued task_work or an interrupt. Patch 3 adds some helpers
> the BPF program can use to interact with io_uring like submitting new
> requests and looking at CQEs. It also controls when to return control
> back to user space by returning one of IOU_BPF_RET_{OK,STOP}, and sets
> the task_work batching size, i.e. how many CQEs to wait for it be run
> again, via a kfunc helper. We need to be able to sleep to submit
> requests, hence only sleepable BPF is allowed. 

I guess this way may break the existed interface of io_uring_enter(),
or at least one flag should be added to tell kernel that the wait behavior
will be overrided by bpf prog.

Also can you share how these perfect parameters may be calculated
by bpf prog? And why isn't io_uring kernel code capable of doing that?

> 
> BPF can help to create arbitrary relations between requests from
> within the kernel

Can you explain it in details about the `arbitrary relations`?

> and later help with tuning the wait loop batching.
> E.g. with minor extensions we can implement batch wait timeouts.
> We can also use it to let the user to safely access internal resources
> and maybe even do a more elaborate request setup than SQE allows it.
> 
> The benchmark is primitive, the non-BPF baseline issues a 2 nop request
> link at a time and waits for them to complete. The BPF version runs
> them (2 * N requests) one by one. Numbers with mitigations on:
> 
> # nice -n -20 taskset -c 0 ./minimal 0 50000000
> type 2-LINK, requests to run 50000000
> sec 10, total (ms) 10314
> # nice -n -20 taskset -c 0 ./minimal 1 50000000
> type BPF, requests to run 50000000
> sec 6, total (ms) 6808
> 
> It needs to be better tested, especially with asynchronous requests
> like reads and other hardware. It can also be further optimised. E.g.
> we can avoid extra locking by taking it once for BPF/task_work_run.
> 
> The test (see examples-bpf/minimal[.bpf].c)
> https://github.com/isilence/liburing.git io_uring-bpf
> https://github.com/isilence/liburing/tree/io_uring-bpf

Looks you pull bpftool & libbpf code into the example, and just
wondering why not link the example with libbpf directly?


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 0/3] Add BPF for io_uring
  2024-11-13  8:13 ` [RFC 0/3] Add BPF for io_uring Ming Lei
@ 2024-11-13 13:09   ` Pavel Begunkov
  0 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2024-11-13 13:09 UTC (permalink / raw)
  To: Ming Lei; +Cc: io-uring

On 11/13/24 08:13, Ming Lei wrote:
> On Mon, Nov 11, 2024 at 01:50:43AM +0000, Pavel Begunkov wrote:
>> WARNING: it's an early prototype and could likely be broken and unsafe
>> to run. Also, most probably it doesn't do the right thing from the
>> modern BPF perspective, but that's fine as I want to get some numbers
>> first and only then consult with BPF folks and brush it up.
>>
>> A comeback of the io_uring BPF proposal put on top new infrastructure.
>> Instead executing BPF as a new request type, it's now run in the io_uring
>> waiting loop. The program is called to react every time we get a new
>> event like a queued task_work or an interrupt. Patch 3 adds some helpers
>> the BPF program can use to interact with io_uring like submitting new
>> requests and looking at CQEs. It also controls when to return control
>> back to user space by returning one of IOU_BPF_RET_{OK,STOP}, and sets
>> the task_work batching size, i.e. how many CQEs to wait for it be run
>> again, via a kfunc helper. We need to be able to sleep to submit
>> requests, hence only sleepable BPF is allowed.
> 
> I guess this way may break the existed interface of io_uring_enter(),
> or at least one flag should be added to tell kernel that the wait behavior
> will be overrided by bpf prog.

It doesn't change anything if there is no BPF registered, a user
who adds BPF should expect the change of behaviour dictated by
its own BPF program. Unlike some other BPF hooks it should be
installed by the ring user and not from outside.

> Also can you share how these perfect parameters may be calculated
> by bpf prog?

In terms of knowledge it should be on par with normal user space,
so it's not about how to calculate a certain parameter but rather
how to pass it to the kernel and whether we want to carve the path
through the io_uring_enter syscall for that. With kfuncs it's easier
to keep it out of the generic path, and they're even more lax on
removals.

  And why isn't io_uring kernel code capable of doing that?

Take a look at the min_wait feature, it passes two timeout values to
io_uring_enter, which wouldn't be needed if same is implement in BPF
(with additional helpers). But what if you want to go a step further,
and lets say have "if the first timeout expired without any new CQEs
wire retry with the doubled waiting time"? You either do it less
efficiently or further extend the API.

>> BPF can help to create arbitrary relations between requests from
>> within the kernel
> 
> Can you explain it in details about the `arbitrary relations`?

For example, we could've implemented IOSQE_IO_LINK and spared ourselves
from a lot of misery, unfortunately that's paste tense. No code for
assembling links, no extra worries about CQE ordering, no complications
around cancellations, no problems with when we bind files and buffers
(prep vs issue), no hacky state machine for DRAIN+LINK, no trobules for
interpreting request results as errors or not with the subsequent
decision of breaking links (see IOSQE_IO_HARDLINK, min_ret/MSG_WAITALL
handling in net.c, IORING_TIMEOUT_ETIME_SUCCESS, etc.). Simpler kernel
code, easier to maintain, less bugs, and would even work faster.

By arbitrary relations you can think of a directed acyclic graph setting
execution ordering bw requests. With error handling questions I don't
believe a hard coded kernel version would ever be viable.

>> and later help with tuning the wait loop batching.
>> E.g. with minor extensions we can implement batch wait timeouts.
>> We can also use it to let the user to safely access internal resources
>> and maybe even do a more elaborate request setup than SQE allows it.
>>
>> The benchmark is primitive, the non-BPF baseline issues a 2 nop request
>> link at a time and waits for them to complete. The BPF version runs
>> them (2 * N requests) one by one. Numbers with mitigations on:
>>
>> # nice -n -20 taskset -c 0 ./minimal 0 50000000
>> type 2-LINK, requests to run 50000000
>> sec 10, total (ms) 10314
>> # nice -n -20 taskset -c 0 ./minimal 1 50000000
>> type BPF, requests to run 50000000
>> sec 6, total (ms) 6808
>>
>> It needs to be better tested, especially with asynchronous requests
>> like reads and other hardware. It can also be further optimised. E.g.
>> we can avoid extra locking by taking it once for BPF/task_work_run.
>>
>> The test (see examples-bpf/minimal[.bpf].c)
>> https://github.com/isilence/liburing.git io_uring-bpf
>> https://github.com/isilence/liburing/tree/io_uring-bpf
> 
> Looks you pull bpftool & libbpf code into the example, and just
> wondering why not link the example with libbpf directly?

It needs liburing and is more useful down the road as a liburing
example. I'll clean up the mess into submodules and separate commits.
Eventually it'd to be turned into tests with system deps, but that
wouldn't be convenient for now.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-11-13 13:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-11  1:50 [RFC 0/3] Add BPF for io_uring Pavel Begunkov
2024-11-11  1:50 ` [RFC 1/3] bpf/io_uring: add io_uring program type Pavel Begunkov
2024-11-11  1:50 ` [RFC 2/3] io_uring/bpf: allow to register and run BPF programs Pavel Begunkov
2024-11-13  8:21   ` Ming Lei
2024-11-13 13:09     ` Pavel Begunkov
2024-11-11  1:50 ` [RFC 3/3] io_uring/bpf: add kfuncs for " Pavel Begunkov
2024-11-13  8:13 ` [RFC 0/3] Add BPF for io_uring Ming Lei
2024-11-13 13:09   ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox