* [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk
@ 2023-02-15 0:41 Xiaoguang Wang
2023-02-15 0:41 ` [RFC 1/3] bpf: add UBLK program type Xiaoguang Wang
` (4 more replies)
0 siblings, 5 replies; 13+ messages in thread
From: Xiaoguang Wang @ 2023-02-15 0:41 UTC (permalink / raw)
To: linux-block, io-uring, bpf; +Cc: ming.lei, axboe, asml.silence, ZiyangZhang
Normally, userspace block device impementations need to copy data between
kernel block layer's io requests and userspace block device's userspace
daemon, for example, ublk and tcmu both have similar logic, but this
operation will consume cpu resources obviously, especially for large io.
There are methods trying to reduce these cpu overheads, then userspace
block device's io performance will be improved further. These methods
contain: 1) use special hardware to do memory copy, but seems not all
architectures have these special hardware; 2) sofeware methods, such as
mmap kernel block layer's io requests's data to userspace daemon [1],
but it has page table's map/unmap, tlb flush overhead, security issue,
etc, and it maybe only friendly to large io.
Add a new program type BPF_PROG_TYPE_UBLK for ublk, which is a generic
framework for implementing block device logic from userspace. Typical
userspace block device impementations need to copy data between kernel
block layer's io requests and userspace block device's userspace daemon,
which will consume cpu resources, especially for large io.
To solve this problem, I'd propose a new method, which will combine the
respective advantages of io_uring and ebpf. Add a new program type
BPF_PROG_TYPE_UBLK for ublk, userspace block device daemon process should
register an ebpf prog. This bpf prog will use bpf helper offered by ublk
bpf prog type to submit io requests on behalf of daemon process.
Currently there is only one helper:
u64 bpf_ublk_queue_sqe(struct ublk_io_bpf_ctx *bpf_ctx,
struct io_uring_sqe *sqe, u32 sqe_len, u32, fd)
This helper will use io_uring to submit io requests, so we need to make
io_uring be able to submit a sqe located in kernel(Some codes idea comes
from Pavel's patchset [2], but pavel's patch needs sqe->buf still comes
from userspace addr), and bpf prog initializes sqes, but does not need to
initializes sqes' buf field, sqe->buf will come from kernel block layer io
requests in some form. See patch 2 for more.
In example of ublk loop target, we can easily implement such below logic in
ebpf prog:
1. userspace daemon registers an ebpf prog and passes two backend file
fd in ebpf map structure。
2. For kernel io requests against the first half of userspace device,
ebpf prog prepares an io_uring sqe, which will submit io against the first
backend file fd and sqe's buffer comes from kernel io reqeusts. Kernel
io requests against second half of userspace device has similar logic,
only sqe's fd will be the second backend file fd.
3. When ublk driver blk-mq queue_rq() is called, this ebpf prog will
be executed and completes kernel io requests.
That means, by using ebpf, we can implement various userspace log in kernel.
From above expample, we can see that this method has 3 advantages at least:
1. Remove memory copy between kernel block layer and userspace daemon
completely.
2. Save memory. Userspace daemon doesn't need to maintain memory to
issue and complete io requests, and use kernel block layer io requests
memory directly.
2. We may reduce the number of round trips between kernel and userspace
daemon, so may reduce kernel & userspace context switch overheads.
Test:
Add a ublk loop target: ublk add -t loop -q 1 -d 128 -f loop.file
fio job file:
[global]
direct=1
filename=/dev/ublkb0
time_based
runtime=60
numjobs=1
cpus_allowed=1
[rand-read-4k]
bs=512K
iodepth=16
ioengine=libaio
rw=randwrite
stonewall
Without this patch:
WRITE: bw=745MiB/s (781MB/s), 745MiB/s-745MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60010-60010msec
ublk daemon's cpu utilization is about 9.3%~10.0%, showed by top tool.
With this patch:
WRITE: bw=744MiB/s (781MB/s), 744MiB/s-744MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60012-60012msec
ublk daemon's cpu utilization is about 1.3%~1.7%, showed by top tool.
From above tests, this method can reduce cpu copy overhead obviously.
TODO:
I must say this patchset is just a RFC for design.
1) Currently for this patchset, I just make ublk ebpf prog submit io requests
using io_uring in kernel, cqe event still needs to be handled in userspace
daemon. Once later we succeed in make io_uring handle cqe in kernel, ublk
ebpf prog can implement io in kernel.
2) ublk driver needs to work better with ebpf, currently I did some hack
codes to support ebpf in ublk driver, it only can support write requests.
3) I have not done much tests yet, will run liburing/ublk/blktests
later.
Any review and suggestions are welcome, thanks.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
Xiaoguang Wang (3):
bpf: add UBLK program type
io_uring: enable io_uring to submit sqes located in kernel
ublk_drv: add ebpf support
drivers/block/ublk_drv.c | 228 ++++++++++++++++++++++++++++++++-
include/linux/bpf_types.h | 2 +
include/linux/io_uring.h | 13 ++
include/linux/io_uring_types.h | 8 +-
include/uapi/linux/bpf.h | 2 +
include/uapi/linux/ublk_cmd.h | 11 ++
io_uring/io_uring.c | 59 ++++++++-
io_uring/rsrc.c | 15 +++
io_uring/rsrc.h | 3 +
io_uring/rw.c | 7 +
kernel/bpf/syscall.c | 1 +
kernel/bpf/verifier.c | 9 +-
scripts/bpf_doc.py | 4 +
tools/include/uapi/linux/bpf.h | 9 ++
tools/lib/bpf/libbpf.c | 2 +
15 files changed, 366 insertions(+), 7 deletions(-)
--
2.31.1
^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC 1/3] bpf: add UBLK program type
2023-02-15 0:41 [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Xiaoguang Wang
@ 2023-02-15 0:41 ` Xiaoguang Wang
2023-02-15 0:41 ` [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel Xiaoguang Wang
` (3 subsequent siblings)
4 siblings, 0 replies; 13+ messages in thread
From: Xiaoguang Wang @ 2023-02-15 0:41 UTC (permalink / raw)
To: linux-block, io-uring, bpf; +Cc: ming.lei, axboe, asml.silence, ZiyangZhang
Add a new program type BPF_PROG_TYPE_UBLK for ublk, which is a generic
framework for implementing block device logic from userspace. Typical
userspace block device impementations need to copy data between kernel
block layer's io requests and userspace block device's userspace daemon,
which will consume cpu resources, especially for large io.
There are methods trying to reduce these cpu overheads, then userspace
block device's io performance will be improved further. These methods
contain: 1) use special hardware to do memory copy, but seems not all
architectures have these special hardware; 2) sofeware methods, such as
mmap kernel block layer's io requests's data to userspace daemon [1],
but it has page table's map/unmap, tlb flush overhead, etc, which maybe
only friendly to large io.
To solve this problem, I'd propose a new method which will use io_uring
to submit userspace daemon's io requests in kernel and use kernel block
device io requests's pages. Further, userspace block devices may have
different userspace logic about how to complete kernel io requests, here
we can use ebpf to implement various userspace log in kernel. In the
example of ublk loop target, we can easily implement such below logic in
ebpf prog:
1. userspace daemon registers this ebpf prog and passes two backend file
fd in ebpf map structure。
2. For kernel io requests against the first half of userspace device,
ebpf prog prepares an io_uring sqe, which will submit io against the first
backend file fd and sqe's buffer comes from kernel io reqeusts. Kernel
io requests against second half of userspace device has similar logic,
only sqe's fd will be the second backend file fd.
3. When ublk driver blk-mq queue_rq() is called, this ebpf prog will
be executed and completes kernel io requests.
From above expample, we can see that this method has 3 advantages at least:
1. Remove memory copy between kernel block layer and userspace daemon
completely.
2. Save memory. Userspace daemon doesn't need to maintain memory to
issue and complete io requests, and use kernel block layer io requests
memory directly.
2. We may reduce the numberr of round trips between kernel and userspace
daemon.
Currently for this patchset, I just make ublk ebpf prog submit io requests
using io_uring in kernel, cqe event still needs to be handled in userspace
daemon. Once later we succeed in make io_uring handle cqe in kernel, ublk
ebpf prog can implement io in kernel.
[1] https://lore.kernel.org/all/[email protected]/
Signed-off-by: Xiaoguang Wang <[email protected]>
---
drivers/block/ublk_drv.c | 23 +++++++++++++++++++++++
include/linux/bpf_types.h | 2 ++
include/uapi/linux/bpf.h | 1 +
kernel/bpf/syscall.c | 1 +
kernel/bpf/verifier.c | 9 +++++++--
tools/include/uapi/linux/bpf.h | 1 +
tools/lib/bpf/libbpf.c | 2 ++
7 files changed, 37 insertions(+), 2 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 6368b56eacf1..b628e9eaefa6 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -43,6 +43,8 @@
#include <asm/page.h>
#include <linux/task_work.h>
#include <uapi/linux/ublk_cmd.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
#define UBLK_MINORS (1U << MINORBITS)
@@ -187,6 +189,27 @@ static DEFINE_MUTEX(ublk_ctl_mutex);
static struct miscdevice ublk_misc;
+static const struct bpf_func_proto *
+ublk_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ return bpf_base_func_proto(func_id);
+}
+
+static bool ublk_bpf_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ return false;
+}
+
+const struct bpf_prog_ops bpf_ublk_prog_ops = {};
+
+const struct bpf_verifier_ops bpf_ublk_verifier_ops = {
+ .get_func_proto = ublk_bpf_func_proto,
+ .is_valid_access = ublk_bpf_is_valid_access,
+};
+
static void ublk_dev_param_basic_apply(struct ublk_device *ub)
{
struct request_queue *q = ub->ub_disk->queue;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index d4ee3ccd3753..4ef0bc0251b7 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -79,6 +79,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LSM, lsm,
#endif
BPF_PROG_TYPE(BPF_PROG_TYPE_SYSCALL, bpf_syscall,
void *, void *)
+BPF_PROG_TYPE(BPF_PROG_TYPE_UBLK, bpf_ublk,
+ void *, void *)
BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 464ca3f01fe7..515b7b995b3a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -986,6 +986,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LSM,
BPF_PROG_TYPE_SK_LOOKUP,
BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+ BPF_PROG_TYPE_UBLK,
};
enum bpf_attach_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ecca9366c7a6..eb1752243f4f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2432,6 +2432,7 @@ static bool is_net_admin_prog_type(enum bpf_prog_type prog_type)
case BPF_PROG_TYPE_CGROUP_SOCKOPT:
case BPF_PROG_TYPE_CGROUP_SYSCTL:
case BPF_PROG_TYPE_SOCK_OPS:
+ case BPF_PROG_TYPE_UBLK:
case BPF_PROG_TYPE_EXT: /* extends any prog */
return true;
case BPF_PROG_TYPE_CGROUP_SKB:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7ee218827259..1e5bc89aea36 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12235,6 +12235,10 @@ static int check_return_code(struct bpf_verifier_env *env)
}
break;
+ case BPF_PROG_TYPE_UBLK:
+ range = tnum_const(0);
+ break;
+
case BPF_PROG_TYPE_EXT:
/* freplace program can return anything as its return value
* depends on the to-be-replaced kernel func or bpf program.
@@ -16770,8 +16774,9 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
}
if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING &&
- prog->type != BPF_PROG_TYPE_LSM && prog->type != BPF_PROG_TYPE_KPROBE) {
- verbose(env, "Only fentry/fexit/fmod_ret, lsm, and kprobe/uprobe programs can be sleepable\n");
+ prog->type != BPF_PROG_TYPE_LSM && prog->type != BPF_PROG_TYPE_KPROBE &&
+ prog->type != BPF_PROG_TYPE_UBLK) {
+ verbose(env, "Only fentry/fexit/fmod_ret, lsm, and kprobe/uprobe, ublk programs can be sleepable\n");
return -EINVAL;
}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 464ca3f01fe7..515b7b995b3a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -986,6 +986,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LSM,
BPF_PROG_TYPE_SK_LOOKUP,
BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+ BPF_PROG_TYPE_UBLK,
};
enum bpf_attach_type {
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2a82f49ce16f..6fe77f9a2cc8 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8606,6 +8606,8 @@ static const struct bpf_sec_def section_defs[] = {
SEC_DEF("cgroup/dev", CGROUP_DEVICE, BPF_CGROUP_DEVICE, SEC_ATTACHABLE_OPT),
SEC_DEF("struct_ops+", STRUCT_OPS, 0, SEC_NONE),
SEC_DEF("sk_lookup", SK_LOOKUP, BPF_SK_LOOKUP, SEC_ATTACHABLE),
+ SEC_DEF("ublk/", UBLK, 0, SEC_SLEEPABLE),
+ SEC_DEF("ublk.s/", UBLK, 0, SEC_SLEEPABLE),
};
static size_t custom_sec_def_cnt;
--
2.31.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel
2023-02-15 0:41 [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Xiaoguang Wang
2023-02-15 0:41 ` [RFC 1/3] bpf: add UBLK program type Xiaoguang Wang
@ 2023-02-15 0:41 ` Xiaoguang Wang
2023-02-15 0:41 ` [RFC 3/3] ublk_drv: add ebpf support Xiaoguang Wang
` (2 subsequent siblings)
4 siblings, 0 replies; 13+ messages in thread
From: Xiaoguang Wang @ 2023-02-15 0:41 UTC (permalink / raw)
To: linux-block, io-uring, bpf; +Cc: ming.lei, axboe, asml.silence, ZiyangZhang
Currently this feature can be used by userspace block device to reduce
kernel & userspace memory copy overhead. With this feature, userspace
block device driver can submit and complete io requests using kernel
block layer io requests's memory data, and further, by using ebpf, we
can customize how sqe is initialized, how io is submitted and completed.
Signed-off-by: Xiaoguang Wang <[email protected]>
---
include/linux/io_uring.h | 13 ++++++++
include/linux/io_uring_types.h | 8 ++++-
io_uring/io_uring.c | 59 ++++++++++++++++++++++++++++++++--
io_uring/rsrc.c | 15 +++++++++
io_uring/rsrc.h | 3 ++
io_uring/rw.c | 7 ++++
6 files changed, 101 insertions(+), 4 deletions(-)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 934e5dd4ccc0..d69882c98608 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -36,6 +36,12 @@ struct io_uring_cmd {
u8 pdu[32]; /* available inline for free use */
};
+struct io_mapped_kbuf {
+ size_t count;
+ unsigned int nr_bvecs;
+ struct bio_vec *bvec;
+};
+
#if defined(CONFIG_IO_URING)
int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
struct iov_iter *iter, void *ioucmd);
@@ -65,6 +71,8 @@ static inline void io_uring_free(struct task_struct *tsk)
if (tsk->io_uring)
__io_uring_free(tsk);
}
+int io_uring_submit_sqe(int fd, const struct io_uring_sqe *sqe, u32 sqe_len,
+ struct io_mapped_kbuf *kbuf);
#else
static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
struct iov_iter *iter, void *ioucmd)
@@ -96,6 +104,11 @@ static inline const char *io_uring_get_opcode(u8 opcode)
{
return "";
}
+int io_uring_submit_sqe(int fd, const struct io_uring_sqe *sqe, u32 sqe_len,
+ struct io_mapped_kbuf *kbuf)
+{
+ return 0;
+}
#endif
#endif
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 128a67a40065..260f8365c802 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -398,6 +398,7 @@ enum {
/* keep async read/write and isreg together and in order */
REQ_F_SUPPORT_NOWAIT_BIT,
REQ_F_ISREG_BIT,
+ REQ_F_KBUF_BIT,
/* not a real bit, just to check we're not overflowing the space */
__REQ_F_LAST_BIT,
@@ -467,6 +468,8 @@ enum {
REQ_F_CLEAR_POLLIN = BIT(REQ_F_CLEAR_POLLIN_BIT),
/* hashed into ->cancel_hash_locked, protected by ->uring_lock */
REQ_F_HASH_LOCKED = BIT(REQ_F_HASH_LOCKED_BIT),
+ /* buffer comes from kernel */
+ REQ_F_KBUF = BIT(REQ_F_KBUF_BIT),
};
typedef void (*io_req_tw_func_t)(struct io_kiocb *req, bool *locked);
@@ -527,7 +530,7 @@ struct io_kiocb {
* and after selection it points to the buffer ID itself.
*/
u16 buf_index;
- unsigned int flags;
+ u64 flags;
struct io_cqe cqe;
@@ -540,6 +543,9 @@ struct io_kiocb {
/* store used ubuf, so we can prevent reloading */
struct io_mapped_ubuf *imu;
+ /* store used kbuf */
+ struct io_mapped_kbuf *imk;
+
/* stores selected buf, valid IFF REQ_F_BUFFER_SELECTED is set */
struct io_buffer *kbuf;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index db623b3185c8..a174365470fb 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2232,7 +2232,8 @@ static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
}
static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
- const struct io_uring_sqe *sqe)
+ const struct io_uring_sqe *sqe,
+ struct io_mapped_kbuf *kbuf)
__must_hold(&ctx->uring_lock)
{
struct io_submit_link *link = &ctx->submit_state.link;
@@ -2241,6 +2242,10 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
ret = io_init_req(ctx, req, sqe);
if (unlikely(ret))
return io_submit_fail_init(sqe, req, ret);
+ if (unlikely(kbuf)) {
+ req->imk = kbuf;
+ req->flags |= REQ_F_KBUF;
+ }
/* don't need @sqe from now on */
trace_io_uring_submit_sqe(req, true);
@@ -2392,7 +2397,7 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
* Continue submitting even for sqe failure if the
* ring was setup with IORING_SETUP_SUBMIT_ALL
*/
- if (unlikely(io_submit_sqe(ctx, req, sqe)) &&
+ if (unlikely(io_submit_sqe(ctx, req, sqe, NULL)) &&
!(ctx->flags & IORING_SETUP_SUBMIT_ALL)) {
left--;
break;
@@ -3272,6 +3277,54 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz
return 0;
}
+int io_uring_submit_sqe(int fd, const struct io_uring_sqe *sqe, u32 sqe_len,
+ struct io_mapped_kbuf *kbuf)
+{
+ struct io_kiocb *req;
+ struct fd f;
+ int ret;
+ struct io_ring_ctx *ctx;
+
+ f = fdget(fd);
+ if (unlikely(!f.file))
+ return -EBADF;
+
+ ret = -EOPNOTSUPP;
+ if (unlikely(!io_is_uring_fops(f.file))) {
+ ret = -EBADF;
+ goto out;
+ }
+ ctx = f.file->private_data;
+
+ mutex_lock(&ctx->uring_lock);
+ if (unlikely(!io_alloc_req_refill(ctx)))
+ goto out;
+ req = io_alloc_req(ctx);
+ if (unlikely(!req)) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ if (!percpu_ref_tryget_many(&ctx->refs, 1)) {
+ kmem_cache_free(req_cachep, req);
+ ret = -EAGAIN;
+ goto out;
+ }
+ percpu_counter_add(¤t->io_uring->inflight, 1);
+ refcount_add(1, ¤t->usage);
+
+ /* returns number of submitted SQEs or an error */
+ ret = !io_submit_sqe(ctx, req, sqe, kbuf);
+ mutex_unlock(&ctx->uring_lock);
+ fdput(f);
+ return ret;
+
+out:
+ mutex_unlock(&ctx->uring_lock);
+ fdput(f);
+ return ret;
+}
+EXPORT_SYMBOL(io_uring_submit_sqe);
+
SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
u32, min_complete, u32, flags, const void __user *, argp,
size_t, argsz)
@@ -4270,7 +4323,7 @@ static int __init io_uring_init(void)
BUILD_BUG_ON(SQE_COMMON_FLAGS >= (1 << 8));
BUILD_BUG_ON((SQE_VALID_FLAGS | SQE_COMMON_FLAGS) != SQE_VALID_FLAGS);
- BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(int));
+ BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(u64));
BUILD_BUG_ON(sizeof(atomic_t) != sizeof(u32));
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 18de10c68a15..51861f01185f 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1380,3 +1380,18 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
return 0;
}
+
+int io_import_fixed_kbuf(int ddir, struct iov_iter *iter,
+ struct io_mapped_kbuf *kbuf,
+ u64 offset, size_t len)
+{
+ if (WARN_ON_ONCE(!kbuf))
+ return -EFAULT;
+ if (offset >= kbuf->count)
+ return -EFAULT;
+
+ iov_iter_bvec(iter, ddir, kbuf->bvec, kbuf->nr_bvecs, offset + len);
+ iov_iter_advance(iter, offset);
+ return 0;
+}
+
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 2b8743645efc..c6897d218bb9 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -69,6 +69,9 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
struct io_mapped_ubuf *imu,
u64 buf_addr, size_t len);
+int io_import_fixed_kbuf(int ddir, struct iov_iter *iter,
+ struct io_mapped_kbuf *kbuf, u64 buf_addr, size_t len);
+
void __io_sqe_buffers_unregister(struct io_ring_ctx *ctx);
int io_sqe_buffers_unregister(struct io_ring_ctx *ctx);
int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 9c3ddd46a1ad..bdf4c4f0661f 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -378,6 +378,13 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req,
return NULL;
}
+ if (unlikely(req->flags & REQ_F_KBUF)) {
+ ret = io_import_fixed_kbuf(ddir, iter, req->imk, rw->addr, rw->len);
+ if (ret)
+ return ERR_PTR(ret);
+ return NULL;
+ }
+
buf = u64_to_user_ptr(rw->addr);
sqe_len = rw->len;
--
2.31.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC 3/3] ublk_drv: add ebpf support
2023-02-15 0:41 [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Xiaoguang Wang
2023-02-15 0:41 ` [RFC 1/3] bpf: add UBLK program type Xiaoguang Wang
2023-02-15 0:41 ` [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel Xiaoguang Wang
@ 2023-02-15 0:41 ` Xiaoguang Wang
2023-02-16 8:11 ` Ming Lei
2023-02-15 0:46 ` [UBLKSRV] Add " Xiaoguang Wang
2023-02-15 8:40 ` [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Ziyang Zhang
4 siblings, 1 reply; 13+ messages in thread
From: Xiaoguang Wang @ 2023-02-15 0:41 UTC (permalink / raw)
To: linux-block, io-uring, bpf; +Cc: ming.lei, axboe, asml.silence, ZiyangZhang
Currenly only one bpf_ublk_queue_sqe() ebpf is added, ublksrv target
can use this helper to write ebpf prog to support ublk kernel & usersapce
zero copy, please see ublksrv test codes for more info.
Signed-off-by: Xiaoguang Wang <[email protected]>
---
drivers/block/ublk_drv.c | 207 ++++++++++++++++++++++++++++++++-
include/uapi/linux/bpf.h | 1 +
include/uapi/linux/ublk_cmd.h | 11 ++
scripts/bpf_doc.py | 4 +
tools/include/uapi/linux/bpf.h | 8 ++
5 files changed, 229 insertions(+), 2 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index b628e9eaefa6..44c289b72864 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -61,6 +61,7 @@
struct ublk_rq_data {
struct llist_node node;
struct callback_head work;
+ struct io_mapped_kbuf *kbuf;
};
struct ublk_uring_cmd_pdu {
@@ -163,6 +164,9 @@ struct ublk_device {
unsigned int nr_queues_ready;
atomic_t nr_aborted_queues;
+ struct bpf_prog *io_prep_prog;
+ struct bpf_prog *io_submit_prog;
+
/*
* Our ubq->daemon may be killed without any notification, so
* monitor each queue's daemon periodically
@@ -189,10 +193,46 @@ static DEFINE_MUTEX(ublk_ctl_mutex);
static struct miscdevice ublk_misc;
+struct ublk_io_bpf_ctx {
+ struct ublk_bpf_ctx ctx;
+ struct ublk_device *ub;
+ struct callback_head work;
+};
+
+BPF_CALL_4(bpf_ublk_queue_sqe, struct ublk_io_bpf_ctx *, bpf_ctx,
+ struct io_uring_sqe *, sqe, u32, sqe_len, u32, fd)
+{
+ struct request *rq;
+ struct ublk_rq_data *data;
+ struct io_mapped_kbuf *kbuf;
+ u16 q_id = bpf_ctx->ctx.q_id;
+ u16 tag = bpf_ctx->ctx.tag;
+
+ rq = blk_mq_tag_to_rq(bpf_ctx->ub->tag_set.tags[q_id], tag);
+ data = blk_mq_rq_to_pdu(rq);
+ kbuf = data->kbuf;
+ io_uring_submit_sqe(fd, sqe, sqe_len, kbuf);
+ return 0;
+}
+
+const struct bpf_func_proto ublk_bpf_queue_sqe_proto = {
+ .func = bpf_ublk_queue_sqe,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_ANYTHING,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_ANYTHING,
+};
+
static const struct bpf_func_proto *
ublk_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
- return bpf_base_func_proto(func_id);
+ switch (func_id) {
+ case BPF_FUNC_ublk_queue_sqe:
+ return &ublk_bpf_queue_sqe_proto;
+ default:
+ return bpf_base_func_proto(func_id);
+ }
}
static bool ublk_bpf_is_valid_access(int off, int size,
@@ -200,6 +240,23 @@ static bool ublk_bpf_is_valid_access(int off, int size,
const struct bpf_prog *prog,
struct bpf_insn_access_aux *info)
{
+ if (off < 0 || off >= sizeof(struct ublk_bpf_ctx))
+ return false;
+ if (off % size != 0)
+ return false;
+
+ switch (off) {
+ case offsetof(struct ublk_bpf_ctx, q_id):
+ return size == sizeof_field(struct ublk_bpf_ctx, q_id);
+ case offsetof(struct ublk_bpf_ctx, tag):
+ return size == sizeof_field(struct ublk_bpf_ctx, tag);
+ case offsetof(struct ublk_bpf_ctx, op):
+ return size == sizeof_field(struct ublk_bpf_ctx, op);
+ case offsetof(struct ublk_bpf_ctx, nr_sectors):
+ return size == sizeof_field(struct ublk_bpf_ctx, nr_sectors);
+ case offsetof(struct ublk_bpf_ctx, start_sector):
+ return size == sizeof_field(struct ublk_bpf_ctx, start_sector);
+ }
return false;
}
@@ -324,7 +381,7 @@ static void ublk_put_device(struct ublk_device *ub)
static inline struct ublk_queue *ublk_get_queue(struct ublk_device *dev,
int qid)
{
- return (struct ublk_queue *)&(dev->__queues[qid * dev->queue_size]);
+ return (struct ublk_queue *)&(dev->__queues[qid * dev->queue_size]);
}
static inline bool ublk_rq_has_data(const struct request *rq)
@@ -492,12 +549,16 @@ static inline int ublk_copy_user_pages(struct ublk_map_data *data,
static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
struct ublk_io *io)
{
+ struct ublk_device *ub = ubq->dev;
const unsigned int rq_bytes = blk_rq_bytes(req);
/*
* no zero copy, we delay copy WRITE request data into ublksrv
* context and the big benefit is that pinning pages in current
* context is pretty fast, see ublk_pin_user_pages
*/
+ if ((req_op(req) == REQ_OP_WRITE) && ub->io_prep_prog)
+ return rq_bytes;
+
if (req_op(req) != REQ_OP_WRITE && req_op(req) != REQ_OP_FLUSH)
return rq_bytes;
@@ -860,6 +921,89 @@ static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
}
}
+static void ublk_bpf_io_submit_fn(struct callback_head *work)
+{
+ struct ublk_io_bpf_ctx *bpf_ctx = container_of(work,
+ struct ublk_io_bpf_ctx, work);
+
+ if (bpf_ctx->ub->io_submit_prog)
+ bpf_prog_run_pin_on_cpu(bpf_ctx->ub->io_submit_prog, bpf_ctx);
+ kfree(bpf_ctx);
+}
+
+static int ublk_init_uring_kbuf(struct request *rq)
+{
+ struct bio_vec *bvec;
+ struct req_iterator rq_iter;
+ struct bio_vec tmp;
+ int nr_bvec = 0;
+ struct io_mapped_kbuf *kbuf;
+ struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
+
+ /* Drop previous allocation */
+ if (data->kbuf) {
+ kfree(data->kbuf->bvec);
+ kfree(data->kbuf);
+ data->kbuf = NULL;
+ }
+
+ kbuf = kmalloc(sizeof(struct io_mapped_kbuf), GFP_NOIO);
+ if (!kbuf)
+ return -EIO;
+
+ rq_for_each_bvec(tmp, rq, rq_iter)
+ nr_bvec++;
+
+ bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec), GFP_NOIO);
+ if (!bvec) {
+ kfree(kbuf);
+ return -EIO;
+ }
+ kbuf->bvec = bvec;
+ rq_for_each_bvec(tmp, rq, rq_iter) {
+ *bvec = tmp;
+ bvec++;
+ }
+
+ kbuf->count = blk_rq_bytes(rq);
+ kbuf->nr_bvecs = nr_bvec;
+ data->kbuf = kbuf;
+ return 0;
+}
+
+static int ublk_run_bpf_prog(struct ublk_queue *ubq, struct request *rq)
+{
+ int err;
+ struct ublk_device *ub = ubq->dev;
+ struct bpf_prog *prog = ub->io_prep_prog;
+ struct ublk_io_bpf_ctx *bpf_ctx;
+
+ if (!prog)
+ return 0;
+
+ bpf_ctx = kmalloc(sizeof(struct ublk_io_bpf_ctx), GFP_NOIO);
+ if (!bpf_ctx)
+ return -EIO;
+
+ err = ublk_init_uring_kbuf(rq);
+ if (err < 0) {
+ kfree(bpf_ctx);
+ return -EIO;
+ }
+ bpf_ctx->ub = ub;
+ bpf_ctx->ctx.q_id = ubq->q_id;
+ bpf_ctx->ctx.tag = rq->tag;
+ bpf_ctx->ctx.op = req_op(rq);
+ bpf_ctx->ctx.nr_sectors = blk_rq_sectors(rq);
+ bpf_ctx->ctx.start_sector = blk_rq_pos(rq);
+ bpf_prog_run_pin_on_cpu(prog, bpf_ctx);
+
+ init_task_work(&bpf_ctx->work, ublk_bpf_io_submit_fn);
+ if (task_work_add(ubq->ubq_daemon, &bpf_ctx->work, TWA_SIGNAL_NO_IPI))
+ kfree(bpf_ctx);
+ return 0;
+}
+
static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
@@ -872,6 +1016,9 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
if (unlikely(res != BLK_STS_OK))
return BLK_STS_IOERR;
+ /* Currently just for test. */
+ ublk_run_bpf_prog(ubq, rq);
+
/* With recovery feature enabled, force_abort is set in
* ublk_stop_dev() before calling del_gendisk(). We have to
* abort all requeued and new rqs here to let del_gendisk()
@@ -2009,6 +2156,56 @@ static int ublk_ctrl_end_recovery(struct io_uring_cmd *cmd)
return ret;
}
+static int ublk_ctrl_reg_bpf_prog(struct io_uring_cmd *cmd)
+{
+ struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+ struct ublk_device *ub;
+ struct bpf_prog *prog;
+ int ret = 0;
+
+ ub = ublk_get_device_from_id(header->dev_id);
+ if (!ub)
+ return -EINVAL;
+
+ mutex_lock(&ub->mutex);
+ prog = bpf_prog_get_type(header->data[0], BPF_PROG_TYPE_UBLK);
+ if (IS_ERR(prog)) {
+ ret = PTR_ERR(prog);
+ goto out_unlock;
+ }
+ ub->io_prep_prog = prog;
+
+ prog = bpf_prog_get_type(header->data[1], BPF_PROG_TYPE_UBLK);
+ if (IS_ERR(prog)) {
+ ret = PTR_ERR(prog);
+ goto out_unlock;
+ }
+ ub->io_submit_prog = prog;
+
+out_unlock:
+ mutex_unlock(&ub->mutex);
+ ublk_put_device(ub);
+ return ret;
+}
+
+static int ublk_ctrl_unreg_bpf_prog(struct io_uring_cmd *cmd)
+{
+ struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+ struct ublk_device *ub;
+
+ ub = ublk_get_device_from_id(header->dev_id);
+ if (!ub)
+ return -EINVAL;
+
+ mutex_lock(&ub->mutex);
+ bpf_prog_put(ub->io_prep_prog);
+ bpf_prog_put(ub->io_submit_prog);
+ ub->io_prep_prog = NULL;
+ ub->io_submit_prog = NULL;
+ mutex_unlock(&ub->mutex);
+ ublk_put_device(ub);
+ return 0;
+}
static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
unsigned int issue_flags)
{
@@ -2059,6 +2256,12 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
case UBLK_CMD_END_USER_RECOVERY:
ret = ublk_ctrl_end_recovery(cmd);
break;
+ case UBLK_CMD_REG_BPF_PROG:
+ ret = ublk_ctrl_reg_bpf_prog(cmd);
+ break;
+ case UBLK_CMD_UNREG_BPF_PROG:
+ ret = ublk_ctrl_unreg_bpf_prog(cmd);
+ break;
default:
break;
}
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 515b7b995b3a..578d65e9f30e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5699,6 +5699,7 @@ union bpf_attr {
FN(user_ringbuf_drain, 209, ##ctx) \
FN(cgrp_storage_get, 210, ##ctx) \
FN(cgrp_storage_delete, 211, ##ctx) \
+ FN(ublk_queue_sqe, 212, ##ctx) \
/* */
/* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index 8f88e3a29998..a43b1864de51 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -17,6 +17,8 @@
#define UBLK_CMD_STOP_DEV 0x07
#define UBLK_CMD_SET_PARAMS 0x08
#define UBLK_CMD_GET_PARAMS 0x09
+#define UBLK_CMD_REG_BPF_PROG 0x0a
+#define UBLK_CMD_UNREG_BPF_PROG 0x0b
#define UBLK_CMD_START_USER_RECOVERY 0x10
#define UBLK_CMD_END_USER_RECOVERY 0x11
/*
@@ -230,4 +232,13 @@ struct ublk_params {
struct ublk_param_discard discard;
};
+struct ublk_bpf_ctx {
+ __u32 t_val;
+ __u16 q_id;
+ __u16 tag;
+ __u8 op;
+ __u32 nr_sectors;
+ __u64 start_sector;
+};
+
#endif
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index e8d90829f23e..f8672294e145 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -700,6 +700,8 @@ class PrinterHelpers(Printer):
'struct bpf_dynptr',
'struct iphdr',
'struct ipv6hdr',
+ 'struct ublk_io_bpf_ctx',
+ 'struct io_uring_sqe',
]
known_types = {
'...',
@@ -755,6 +757,8 @@ class PrinterHelpers(Printer):
'const struct bpf_dynptr',
'struct iphdr',
'struct ipv6hdr',
+ 'struct ublk_io_bpf_ctx',
+ 'struct io_uring_sqe',
}
mapped_types = {
'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 515b7b995b3a..530094246e2a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5485,6 +5485,13 @@ union bpf_attr {
* 0 on success.
*
* **-ENOENT** if the bpf_local_storage cannot be found.
+ *
+ * u64 bpf_ublk_queue_sqe(struct ublk_io_bpf_ctx *ctx, struct io_uring_sqe *sqe, u32 offset, u32 len)
+ * Description
+ * Submit ublk io requests.
+ * Return
+ * 0 on success.
+ *
*/
#define ___BPF_FUNC_MAPPER(FN, ctx...) \
FN(unspec, 0, ##ctx) \
@@ -5699,6 +5706,7 @@ union bpf_attr {
FN(user_ringbuf_drain, 209, ##ctx) \
FN(cgrp_storage_get, 210, ##ctx) \
FN(cgrp_storage_delete, 211, ##ctx) \
+ FN(ublk_queue_sqe, 212, ##ctx) \
/* */
/* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
--
2.31.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [UBLKSRV] Add ebpf support.
2023-02-15 0:41 [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Xiaoguang Wang
` (2 preceding siblings ...)
2023-02-15 0:41 ` [RFC 3/3] ublk_drv: add ebpf support Xiaoguang Wang
@ 2023-02-15 0:46 ` Xiaoguang Wang
2023-02-16 8:28 ` Ming Lei
2023-02-15 8:40 ` [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Ziyang Zhang
4 siblings, 1 reply; 13+ messages in thread
From: Xiaoguang Wang @ 2023-02-15 0:46 UTC (permalink / raw)
To: linux-block, io-uring, bpf; +Cc: ming.lei, axboe, asml.silence, ZiyangZhang
Signed-off-by: Xiaoguang Wang <[email protected]>
---
bpf/ublk.bpf.c | 168 +++++++++++++++++++++++++++++++++++++++++
include/ublk_cmd.h | 2 +
include/ublksrv.h | 8 ++
include/ublksrv_priv.h | 1 +
include/ublksrv_tgt.h | 1 +
lib/ublksrv.c | 4 +
lib/ublksrv_cmd.c | 21 ++++++
tgt_loop.cpp | 31 +++++++-
ublksrv_tgt.cpp | 33 ++++++++
9 files changed, 268 insertions(+), 1 deletion(-)
create mode 100644 bpf/ublk.bpf.c
diff --git a/bpf/ublk.bpf.c b/bpf/ublk.bpf.c
new file mode 100644
index 0000000..80e79de
--- /dev/null
+++ b/bpf/ublk.bpf.c
@@ -0,0 +1,168 @@
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+
+
+static long (*bpf_ublk_queue_sqe)(void *ctx, struct io_uring_sqe *sqe,
+ u32 sqe_len, u32 fd) = (void *) 212;
+
+int target_fd = -1;
+
+struct sqe_key {
+ u16 q_id;
+ u16 tag;
+ u32 res;
+ u64 offset;
+};
+
+struct sqe_data {
+ char data[128];
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __uint(max_entries, 8192);
+ __type(key, struct sqe_key);
+ __type(value, struct sqe_data);
+} sqes_map SEC(".maps");
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 128);
+ __type(key, int);
+ __type(value, int);
+} uring_fd_map SEC(".maps");
+
+static inline void io_uring_prep_rw(__u8 op, struct io_uring_sqe *sqe, int fd,
+ const void *addr, unsigned len,
+ __u64 offset)
+{
+ sqe->opcode = op;
+ sqe->flags = 0;
+ sqe->ioprio = 0;
+ sqe->fd = fd;
+ sqe->off = offset;
+ sqe->addr = (unsigned long) addr;
+ sqe->len = len;
+ sqe->fsync_flags = 0;
+ sqe->buf_index = 0;
+ sqe->personality = 0;
+ sqe->splice_fd_in = 0;
+ sqe->addr3 = 0;
+ sqe->__pad2[0] = 0;
+}
+
+static inline void io_uring_prep_nop(struct io_uring_sqe *sqe)
+{
+ io_uring_prep_rw(IORING_OP_NOP, sqe, -1, 0, 0, 0);
+}
+
+static inline void io_uring_prep_read(struct io_uring_sqe *sqe, int fd,
+ void *buf, unsigned nbytes, off_t offset)
+{
+ io_uring_prep_rw(IORING_OP_READ, sqe, fd, buf, nbytes, offset);
+}
+
+static inline void io_uring_prep_write(struct io_uring_sqe *sqe, int fd,
+ const void *buf, unsigned nbytes, off_t offset)
+{
+ io_uring_prep_rw(IORING_OP_WRITE, sqe, fd, buf, nbytes, offset);
+}
+
+/*
+static u64 submit_sqe(struct bpf_map *map, void *key, void *value, void *data)
+{
+ struct io_uring_sqe *sqe = (struct io_uring_sqe *)value;
+ struct ublk_bpf_ctx *ctx = ((struct callback_ctx *)data)->ctx;
+ struct sqe_key *skey = (struct sqe_key *)key;
+ char fmt[] ="submit sqe for req[qid:%u tag:%u]\n";
+ char fmt2[] ="submit sqe test prep\n";
+ u16 qid, tag;
+ int q_id = skey->q_id, *ring_fd;
+
+ bpf_trace_printk(fmt2, sizeof(fmt2));
+ ring_fd = bpf_map_lookup_elem(&uring_fd_map, &q_id);
+ if (ring_fd) {
+ bpf_trace_printk(fmt, sizeof(fmt), qid, skey->tag);
+ bpf_ublk_queue_sqe(ctx, sqe, 128, *ring_fd);
+ bpf_map_delete_elem(map, key);
+ }
+ return 0;
+}
+*/
+
+static inline __u64 build_user_data(unsigned tag, unsigned op,
+ unsigned tgt_data, unsigned is_target_io,
+ unsigned is_bpf_io)
+{
+ return tag | (op << 16) | (tgt_data << 24) | (__u64)is_target_io << 63 |
+ (__u64)is_bpf_io << 60;
+}
+
+SEC("ublk.s/")
+int ublk_io_prep_prog(struct ublk_bpf_ctx *ctx)
+{
+ struct io_uring_sqe *sqe;
+ struct sqe_data sd = {0};
+ struct sqe_key key;
+ u16 q_id = ctx->q_id;
+ u8 op; // = ctx->op;
+ u32 nr_sectors = ctx->nr_sectors;
+ u64 start_sector = ctx->start_sector;
+ char fmt_1[] ="ublk_io_prep_prog %d %d\n";
+
+ key.q_id = ctx->q_id;
+ key.tag = ctx->tag;
+ key.offset = 0;
+ key.res = 0;
+
+ bpf_probe_read_kernel(&op, 1, &ctx->op);
+ bpf_trace_printk(fmt_1, sizeof(fmt_1), q_id, op);
+ sqe = (struct io_uring_sqe *)&sd;
+ if (op == REQ_OP_READ) {
+ char fmt[] ="add read sae\n";
+
+ bpf_trace_printk(fmt, sizeof(fmt));
+ io_uring_prep_read(sqe, target_fd, 0, nr_sectors << 9,
+ start_sector << 9);
+ sqe->user_data = build_user_data(ctx->tag, op, 0, 1, 1);
+ bpf_map_update_elem(&sqes_map, &key, &sd, BPF_NOEXIST);
+ } else if (op == REQ_OP_WRITE) {
+ char fmt[] ="add write sae\n";
+
+ bpf_trace_printk(fmt, sizeof(fmt));
+
+ io_uring_prep_write(sqe, target_fd, 0, nr_sectors << 9,
+ start_sector << 9);
+ sqe->user_data = build_user_data(ctx->tag, op, 0, 1, 1);
+ bpf_map_update_elem(&sqes_map, &key, &sd, BPF_NOEXIST);
+ } else {
+ ;
+ }
+ return 0;
+}
+
+SEC("ublk.s/")
+int ublk_io_submit_prog(struct ublk_bpf_ctx *ctx)
+{
+ struct io_uring_sqe *sqe;
+ char fmt[] ="submit sqe for req[qid:%u tag:%u]\n";
+ int q_id = ctx->q_id, *ring_fd;
+ struct sqe_key key;
+
+ key.q_id = ctx->q_id;
+ key.tag = ctx->tag;
+ key.offset = 0;
+ key.res = 0;
+
+ sqe = bpf_map_lookup_elem(&sqes_map, &key);
+ ring_fd = bpf_map_lookup_elem(&uring_fd_map, &q_id);
+ if (ring_fd) {
+ bpf_trace_printk(fmt, sizeof(fmt), key.q_id, key.tag);
+ bpf_ublk_queue_sqe(ctx, sqe, 128, *ring_fd);
+ bpf_map_delete_elem(&sqes_map, &key);
+ }
+ return 0;
+}
+
+char LICENSE[] SEC("license") = "GPL";
diff --git a/include/ublk_cmd.h b/include/ublk_cmd.h
index f6238cc..893ba8c 100644
--- a/include/ublk_cmd.h
+++ b/include/ublk_cmd.h
@@ -17,6 +17,8 @@
#define UBLK_CMD_STOP_DEV 0x07
#define UBLK_CMD_SET_PARAMS 0x08
#define UBLK_CMD_GET_PARAMS 0x09
+#define UBLK_CMD_REG_BPF_PROG 0x0a
+#define UBLK_CMD_UNREG_BPF_PROG 0x0b
#define UBLK_CMD_START_USER_RECOVERY 0x10
#define UBLK_CMD_END_USER_RECOVERY 0x11
#define UBLK_CMD_GET_DEV_INFO2 0x12
diff --git a/include/ublksrv.h b/include/ublksrv.h
index d38bd46..f5deddb 100644
--- a/include/ublksrv.h
+++ b/include/ublksrv.h
@@ -106,6 +106,7 @@ struct ublksrv_tgt_info {
unsigned int nr_fds;
int fds[UBLKSRV_TGT_MAX_FDS];
void *tgt_data;
+ void *tgt_bpf_obj;
/*
* Extra IO slots for each queue, target code can reserve some
@@ -263,6 +264,8 @@ struct ublksrv_tgt_type {
int (*init_queue)(const struct ublksrv_queue *, void **queue_data_ptr);
void (*deinit_queue)(const struct ublksrv_queue *);
+ int (*init_queue_bpf)(const struct ublksrv_dev *dev, const struct ublksrv_queue *q);
+
unsigned long reserved[5];
};
@@ -318,6 +321,11 @@ extern void ublksrv_ctrl_prep_recovery(struct ublksrv_ctrl_dev *dev,
const char *recovery_jbuf);
extern const char *ublksrv_ctrl_get_recovery_jbuf(const struct ublksrv_ctrl_dev *dev);
+extern void ublksrv_ctrl_set_bpf_obj_info(struct ublksrv_ctrl_dev *dev,
+ void *obj);
+extern int ublksrv_ctrl_reg_bpf_prog(struct ublksrv_ctrl_dev *dev,
+ int io_prep_fd, int io_submit_fd);
+
/* ublksrv device ("/dev/ublkcN") level APIs */
extern const struct ublksrv_dev *ublksrv_dev_init(const struct ublksrv_ctrl_dev *
ctrl_dev);
diff --git a/include/ublksrv_priv.h b/include/ublksrv_priv.h
index 2996baa..8da8866 100644
--- a/include/ublksrv_priv.h
+++ b/include/ublksrv_priv.h
@@ -42,6 +42,7 @@ struct ublksrv_ctrl_dev {
const char *tgt_type;
const struct ublksrv_tgt_type *tgt_ops;
+ void *bpf_obj;
/*
* default is UBLKSRV_RUN_DIR but can be specified via command line,
diff --git a/include/ublksrv_tgt.h b/include/ublksrv_tgt.h
index 234d31e..e0db7d9 100644
--- a/include/ublksrv_tgt.h
+++ b/include/ublksrv_tgt.h
@@ -9,6 +9,7 @@
#include <getopt.h>
#include <string.h>
#include <stdarg.h>
+#include <limits.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
diff --git a/lib/ublksrv.c b/lib/ublksrv.c
index 96bed95..110ccb3 100644
--- a/lib/ublksrv.c
+++ b/lib/ublksrv.c
@@ -603,6 +603,9 @@ skip_alloc_buf:
goto fail;
}
+ if (dev->tgt.ops->init_queue_bpf)
+ dev->tgt.ops->init_queue_bpf(tdev, local_to_tq(q));
+
ublksrv_dev_init_io_cmds(dev, q);
/*
@@ -723,6 +726,7 @@ const struct ublksrv_dev *ublksrv_dev_init(const struct ublksrv_ctrl_dev *ctrl_d
}
tgt->fds[0] = dev->cdev_fd;
+ tgt->tgt_bpf_obj = ctrl_dev->bpf_obj;
ret = ublksrv_tgt_init(dev, ctrl_dev->tgt_type, ctrl_dev->tgt_ops,
ctrl_dev->tgt_argc, ctrl_dev->tgt_argv);
diff --git a/lib/ublksrv_cmd.c b/lib/ublksrv_cmd.c
index 0d7265d..0101cb9 100644
--- a/lib/ublksrv_cmd.c
+++ b/lib/ublksrv_cmd.c
@@ -502,6 +502,27 @@ int ublksrv_ctrl_end_recovery(struct ublksrv_ctrl_dev *dev, int daemon_pid)
return ret;
}
+int ublksrv_ctrl_reg_bpf_prog(struct ublksrv_ctrl_dev *dev,
+ int io_prep_fd, int io_submit_fd)
+{
+ struct ublksrv_ctrl_cmd_data data = {
+ .cmd_op = UBLK_CMD_REG_BPF_PROG,
+ .flags = CTRL_CMD_HAS_DATA,
+ };
+ int ret;
+
+ data.data[0] = io_prep_fd;
+ data.data[1] = io_submit_fd;
+
+ ret = __ublksrv_ctrl_cmd(dev, &data);
+ return ret;
+}
+
+void ublksrv_ctrl_set_bpf_obj_info(struct ublksrv_ctrl_dev *dev, void *obj)
+{
+ dev->bpf_obj = obj;
+}
+
const struct ublksrv_ctrl_dev_info *ublksrv_ctrl_get_dev_info(
const struct ublksrv_ctrl_dev *dev)
{
diff --git a/tgt_loop.cpp b/tgt_loop.cpp
index 79a65d3..b1568fe 100644
--- a/tgt_loop.cpp
+++ b/tgt_loop.cpp
@@ -4,7 +4,11 @@
#include <poll.h>
#include <sys/epoll.h>
+#include <linux/bpf.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
#include "ublksrv_tgt.h"
+#include "bpf/.tmp/ublk.skel.h"
static bool backing_supports_discard(char *name)
{
@@ -88,6 +92,20 @@ static int loop_recovery_tgt(struct ublksrv_dev *dev, int type)
return 0;
}
+static int loop_init_queue_bpf(const struct ublksrv_dev *dev,
+ const struct ublksrv_queue *q)
+{
+ int ret, q_id, ring_fd;
+ const struct ublksrv_tgt_info *tgt = &dev->tgt;
+ struct ublk_bpf *obj = (struct ublk_bpf*)tgt->tgt_bpf_obj;
+
+ q_id = q->q_id;
+ ring_fd = q->ring_ptr->ring_fd;
+ ret = bpf_map_update_elem(bpf_map__fd(obj->maps.uring_fd_map), &q_id,
+ &ring_fd, 0);
+ return ret;
+}
+
static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
*argv[])
{
@@ -125,6 +143,7 @@ static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
},
};
bool can_discard = false;
+ struct ublk_bpf *bpf_obj;
strcpy(tgt_json.name, "loop");
@@ -218,6 +237,10 @@ static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
jbuf = ublksrv_tgt_realloc_json_buf(dev, &jbuf_size);
} while (ret < 0);
+ if (tgt->tgt_bpf_obj) {
+ bpf_obj = (struct ublk_bpf *)tgt->tgt_bpf_obj;
+ bpf_obj->data->target_fd = tgt->fds[1];
+ }
return 0;
}
@@ -252,9 +275,14 @@ static int loop_queue_tgt_io(const struct ublksrv_queue *q,
const struct ublk_io_data *data, int tag)
{
const struct ublksrv_io_desc *iod = data->iod;
- struct io_uring_sqe *sqe = io_uring_get_sqe(q->ring_ptr);
+ struct io_uring_sqe *sqe;
unsigned ublk_op = ublksrv_get_op(iod);
+ /* ebpf prog wil handle read/write requests. */
+ if ((ublk_op == UBLK_IO_OP_READ) || (ublk_op == UBLK_IO_OP_WRITE))
+ return 1;
+
+ sqe = io_uring_get_sqe(q->ring_ptr);
if (!sqe)
return 0;
@@ -374,6 +402,7 @@ struct ublksrv_tgt_type loop_tgt_type = {
.type = UBLKSRV_TGT_TYPE_LOOP,
.name = "loop",
.recovery_tgt = loop_recovery_tgt,
+ .init_queue_bpf = loop_init_queue_bpf,
};
static void tgt_loop_init() __attribute__((constructor));
diff --git a/ublksrv_tgt.cpp b/ublksrv_tgt.cpp
index 5ed328d..d3796cf 100644
--- a/ublksrv_tgt.cpp
+++ b/ublksrv_tgt.cpp
@@ -2,6 +2,7 @@
#include "config.h"
#include "ublksrv_tgt.h"
+#include "bpf/.tmp/ublk.skel.h"
/* per-task variable */
static pthread_mutex_t jbuf_lock;
@@ -575,6 +576,31 @@ static void ublksrv_tgt_set_params(struct ublksrv_ctrl_dev *cdev,
}
}
+static int ublksrv_tgt_load_bpf_prog(struct ublksrv_ctrl_dev *cdev)
+{
+ struct ublk_bpf *obj;
+ int ret, io_prep_fd, io_submit_fd;
+
+ obj = ublk_bpf__open();
+ if (!obj) {
+ fprintf(stderr, "failed to open BPF object\n");
+ return -1;
+ }
+ ret = ublk_bpf__load(obj);
+ if (ret) {
+ fprintf(stderr, "failed to load BPF object\n");
+ return -1;
+ }
+
+
+ io_prep_fd = bpf_program__fd(obj->progs.ublk_io_prep_prog);
+ io_submit_fd = bpf_program__fd(obj->progs.ublk_io_submit_prog);
+ ret = ublksrv_ctrl_reg_bpf_prog(cdev, io_prep_fd, io_submit_fd);
+ if (!ret)
+ ublksrv_ctrl_set_bpf_obj_info(cdev, obj);
+ return ret;
+}
+
static int cmd_dev_add(int argc, char *argv[])
{
static const struct option longopts[] = {
@@ -696,6 +722,13 @@ static int cmd_dev_add(int argc, char *argv[])
goto fail;
}
+ ret = ublksrv_tgt_load_bpf_prog(dev);
+ if (ret < 0) {
+ fprintf(stderr, "dev %d load bpf prog failed, ret %d\n",
+ data.dev_id, ret);
+ goto fail_stop_daemon;
+ }
+
{
const struct ublksrv_ctrl_dev_info *info =
ublksrv_ctrl_get_dev_info(dev);
--
2.31.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk
2023-02-15 0:41 [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Xiaoguang Wang
` (3 preceding siblings ...)
2023-02-15 0:46 ` [UBLKSRV] Add " Xiaoguang Wang
@ 2023-02-15 8:40 ` Ziyang Zhang
4 siblings, 0 replies; 13+ messages in thread
From: Ziyang Zhang @ 2023-02-15 8:40 UTC (permalink / raw)
To: linux-block, io-uring, bpf; +Cc: ming.lei, axboe, asml.silence, Xiaoguang Wang
On 2023/2/15 08:41, Xiaoguang Wang wrote:
> Normally, userspace block device impementations need to copy data between
> kernel block layer's io requests and userspace block device's userspace
> daemon, for example, ublk and tcmu both have similar logic, but this
> operation will consume cpu resources obviously, especially for large io.
>
> There are methods trying to reduce these cpu overheads, then userspace
> block device's io performance will be improved further. These methods
> contain: 1) use special hardware to do memory copy, but seems not all
> architectures have these special hardware; 2) sofeware methods, such as
> mmap kernel block layer's io requests's data to userspace daemon [1],
> but it has page table's map/unmap, tlb flush overhead, security issue,
> etc, and it maybe only friendly to large io.
>
> Add a new program type BPF_PROG_TYPE_UBLK for ublk, which is a generic
> framework for implementing block device logic from userspace. Typical
> userspace block device impementations need to copy data between kernel
> block layer's io requests and userspace block device's userspace daemon,
> which will consume cpu resources, especially for large io.
>
> To solve this problem, I'd propose a new method, which will combine the
> respective advantages of io_uring and ebpf. Add a new program type
> BPF_PROG_TYPE_UBLK for ublk, userspace block device daemon process should
> register an ebpf prog. This bpf prog will use bpf helper offered by ublk
> bpf prog type to submit io requests on behalf of daemon process.
> Currently there is only one helper:
> u64 bpf_ublk_queue_sqe(struct ublk_io_bpf_ctx *bpf_ctx,
> struct io_uring_sqe *sqe, u32 sqe_len, u32, fd)
>
> This helper will use io_uring to submit io requests, so we need to make
> io_uring be able to submit a sqe located in kernel(Some codes idea comes
> from Pavel's patchset [2], but pavel's patch needs sqe->buf still comes
> from userspace addr), and bpf prog initializes sqes, but does not need to
> initializes sqes' buf field, sqe->buf will come from kernel block layer io
> requests in some form. See patch 2 for more.
>
> In example of ublk loop target, we can easily implement such below logic in
> ebpf prog:
> 1. userspace daemon registers an ebpf prog and passes two backend file
> fd in ebpf map structure。
> 2. For kernel io requests against the first half of userspace device,
> ebpf prog prepares an io_uring sqe, which will submit io against the first
> backend file fd and sqe's buffer comes from kernel io reqeusts. Kernel
> io requests against second half of userspace device has similar logic,
> only sqe's fd will be the second backend file fd.
> 3. When ublk driver blk-mq queue_rq() is called, this ebpf prog will
> be executed and completes kernel io requests.
>
> That means, by using ebpf, we can implement various userspace log in kernel.
>
> From above expample, we can see that this method has 3 advantages at least:
> 1. Remove memory copy between kernel block layer and userspace daemon
> completely.
> 2. Save memory. Userspace daemon doesn't need to maintain memory to
> issue and complete io requests, and use kernel block layer io requests
> memory directly.
> 2. We may reduce the number of round trips between kernel and userspace
> daemon, so may reduce kernel & userspace context switch overheads.
>
> Test:
> Add a ublk loop target: ublk add -t loop -q 1 -d 128 -f loop.file
>
> fio job file:
> [global]
> direct=1
> filename=/dev/ublkb0
> time_based
> runtime=60
> numjobs=1
> cpus_allowed=1
>
> [rand-read-4k]
> bs=512K
> iodepth=16
> ioengine=libaio
> rw=randwrite
> stonewall
>
>
> Without this patch:
> WRITE: bw=745MiB/s (781MB/s), 745MiB/s-745MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60010-60010msec
> ublk daemon's cpu utilization is about 9.3%~10.0%, showed by top tool.
>
> With this patch:
> WRITE: bw=744MiB/s (781MB/s), 744MiB/s-744MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60012-60012msec
> ublk daemon's cpu utilization is about 1.3%~1.7%, showed by top tool.
>
> From above tests, this method can reduce cpu copy overhead obviously.
>
>
> TODO:
> I must say this patchset is just a RFC for design.
>
> 1) Currently for this patchset, I just make ublk ebpf prog submit io requests
> using io_uring in kernel, cqe event still needs to be handled in userspace
> daemon. Once later we succeed in make io_uring handle cqe in kernel, ublk
> ebpf prog can implement io in kernel.
>
> 2) ublk driver needs to work better with ebpf, currently I did some hack
> codes to support ebpf in ublk driver, it only can support write requests.
>
> 3) I have not done much tests yet, will run liburing/ublk/blktests
> later.
>
> Any review and suggestions are welcome, thanks.
>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/all/[email protected]/
>
>
> Xiaoguang Wang (3):
> bpf: add UBLK program type
> io_uring: enable io_uring to submit sqes located in kernel
> ublk_drv: add ebpf support
>
> drivers/block/ublk_drv.c | 228 ++++++++++++++++++++++++++++++++-
> include/linux/bpf_types.h | 2 +
> include/linux/io_uring.h | 13 ++
> include/linux/io_uring_types.h | 8 +-
> include/uapi/linux/bpf.h | 2 +
> include/uapi/linux/ublk_cmd.h | 11 ++
> io_uring/io_uring.c | 59 ++++++++-
> io_uring/rsrc.c | 15 +++
> io_uring/rsrc.h | 3 +
> io_uring/rw.c | 7 +
> kernel/bpf/syscall.c | 1 +
> kernel/bpf/verifier.c | 9 +-
> scripts/bpf_doc.py | 4 +
> tools/include/uapi/linux/bpf.h | 9 ++
> tools/lib/bpf/libbpf.c | 2 +
> 15 files changed, 366 insertions(+), 7 deletions(-)
>
Hi, Here is perf report output of ublk daemon(loop target):
+ 57.96% 4.03% ublk liburing.so.2.2 [.] _io_uring_get_cqe ▒
+ 53.94% 0.00% ublk [kernel.vmlinux] [k] entry_SYSCALL_64 ◆
+ 53.94% 0.65% ublk [kernel.vmlinux] [k] do_syscall_64 ▒
+ 48.37% 1.18% ublk [kernel.vmlinux] [k] __do_sys_io_uring_enter ▒
+ 42.92% 1.72% ublk [kernel.vmlinux] [k] io_cqring_wait ▒
+ 35.17% 0.06% ublk [kernel.vmlinux] [k] task_work_run ▒
+ 34.75% 0.53% ublk [kernel.vmlinux] [k] io_run_task_work_sig ▒
+ 33.45% 0.00% ublk [kernel.vmlinux] [k] ublk_bpf_io_submit_fn ▒
+ 33.16% 0.06% ublk bpf_prog_3bdc6181a3c616fb_ublk_io_submit_prog [k] bpf_prog_3bdc6181a3c616fb_ublk_io_sub▒
+ 32.68% 0.00% iou-wrk-18583 [unknown] [k] 0000000000000000 ▒
+ 32.68% 0.00% iou-wrk-18583 [unknown] [k] 0x00007efe920b1040 ▒
+ 32.68% 0.00% iou-wrk-18583 [kernel.vmlinux] [k] ret_from_fork ▒
+ 32.68% 0.47% iou-wrk-18583 [kernel.vmlinux] [k] io_wqe_worker ▒
+ 30.61% 0.00% ublk [kernel.vmlinux] [k] io_submit_sqe ▒
+ 30.31% 0.06% ublk [kernel.vmlinux] [k] io_issue_sqe ▒
+ 28.00% 0.00% ublk [kernel.vmlinux] [k] bpf_ublk_queue_sqe ▒
+ 28.00% 0.00% ublk [kernel.vmlinux] [k] io_uring_submit_sqe ▒
+ 27.18% 0.00% ublk [kernel.vmlinux] [k] io_write ▒
+ 27.18% 0.00% ublk [xfs] [k] xfs_file_write_iter
The call stack is:
- 57.96% 4.03% ublk liburing.so.2.2 [.] _io_uring_get_cqe ◆
- 53.94% _io_uring_get_cqe ▒
entry_SYSCALL_64 ▒
- do_syscall_64 ▒
- 48.37% __do_sys_io_uring_enter ▒
- 42.92% io_cqring_wait ▒
- 34.75% io_run_task_work_sig ▒
- task_work_run ▒
- 32.50% ublk_bpf_io_submit_fn ▒
- 32.21% bpf_prog_3bdc6181a3c616fb_ublk_io_submit_prog ▒
- 27.12% bpf_ublk_queue_sqe ▒
- io_uring_submit_sqe ▒
- 26.64% io_submit_sqe ▒
- 26.35% io_issue_sqe ▒
- io_write ▒
xfs_file_write_iter ▒
Here, "io_submit" ebpf prog will be run in task_work of ublk daemon
process after io_uring_enter() syscall. In this ebpf prog, a sqe is
built and submitted. All information about this blk-mq request is
stored in a "ctx". Then io_uring can write to the backing file
(xfs_file_write_iter).
Here is call stack from perf report output of fio:
- 5.04% 0.18% fio [kernel.vmlinux] [k] ublk_queue_rq ▒
- 4.86% ublk_queue_rq ▒
- 3.67% bpf_prog_b8456549dbe40c37_ublk_io_prep_prog ▒
- 3.10% bpf_trace_printk ▒
2.83% _raw_spin_unlock_irqrestore ▒
- 0.70% task_work_add ▒
- try_to_wake_up ▒
_raw_spin_unlock_irqrestore ▒
Here, "io_prep" ebpf prog will be run in "ublk_queue_rq" process.
In this ebpf prog, qid, tag, nr_sectors, start_sector, op, flags
will be stored in one "ctx". Then we add a task_work to the ublk
daemon process.
Regards,
Zhang
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 3/3] ublk_drv: add ebpf support
2023-02-15 0:41 ` [RFC 3/3] ublk_drv: add ebpf support Xiaoguang Wang
@ 2023-02-16 8:11 ` Ming Lei
2023-02-16 12:12 ` Xiaoguang Wang
0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2023-02-16 8:11 UTC (permalink / raw)
To: Xiaoguang Wang
Cc: linux-block, io-uring, bpf, axboe, asml.silence, ZiyangZhang,
ming.lei
On Wed, Feb 15, 2023 at 08:41:22AM +0800, Xiaoguang Wang wrote:
> Currenly only one bpf_ublk_queue_sqe() ebpf is added, ublksrv target
> can use this helper to write ebpf prog to support ublk kernel & usersapce
> zero copy, please see ublksrv test codes for more info.
>
> Signed-off-by: Xiaoguang Wang <[email protected]>
> ---
> drivers/block/ublk_drv.c | 207 ++++++++++++++++++++++++++++++++-
> include/uapi/linux/bpf.h | 1 +
> include/uapi/linux/ublk_cmd.h | 11 ++
> scripts/bpf_doc.py | 4 +
> tools/include/uapi/linux/bpf.h | 8 ++
> 5 files changed, 229 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index b628e9eaefa6..44c289b72864 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -61,6 +61,7 @@
> struct ublk_rq_data {
> struct llist_node node;
> struct callback_head work;
> + struct io_mapped_kbuf *kbuf;
> };
>
> struct ublk_uring_cmd_pdu {
> @@ -163,6 +164,9 @@ struct ublk_device {
> unsigned int nr_queues_ready;
> atomic_t nr_aborted_queues;
>
> + struct bpf_prog *io_prep_prog;
> + struct bpf_prog *io_submit_prog;
> +
> /*
> * Our ubq->daemon may be killed without any notification, so
> * monitor each queue's daemon periodically
> @@ -189,10 +193,46 @@ static DEFINE_MUTEX(ublk_ctl_mutex);
>
> static struct miscdevice ublk_misc;
>
> +struct ublk_io_bpf_ctx {
> + struct ublk_bpf_ctx ctx;
> + struct ublk_device *ub;
> + struct callback_head work;
> +};
> +
> +BPF_CALL_4(bpf_ublk_queue_sqe, struct ublk_io_bpf_ctx *, bpf_ctx,
> + struct io_uring_sqe *, sqe, u32, sqe_len, u32, fd)
> +{
> + struct request *rq;
> + struct ublk_rq_data *data;
> + struct io_mapped_kbuf *kbuf;
> + u16 q_id = bpf_ctx->ctx.q_id;
> + u16 tag = bpf_ctx->ctx.tag;
> +
> + rq = blk_mq_tag_to_rq(bpf_ctx->ub->tag_set.tags[q_id], tag);
> + data = blk_mq_rq_to_pdu(rq);
> + kbuf = data->kbuf;
> + io_uring_submit_sqe(fd, sqe, sqe_len, kbuf);
> + return 0;
> +}
> +
> +const struct bpf_func_proto ublk_bpf_queue_sqe_proto = {
> + .func = bpf_ublk_queue_sqe,
> + .gpl_only = false,
> + .ret_type = RET_INTEGER,
> + .arg1_type = ARG_ANYTHING,
> + .arg2_type = ARG_ANYTHING,
> + .arg3_type = ARG_ANYTHING,
> +};
> +
> static const struct bpf_func_proto *
> ublk_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> {
> - return bpf_base_func_proto(func_id);
> + switch (func_id) {
> + case BPF_FUNC_ublk_queue_sqe:
> + return &ublk_bpf_queue_sqe_proto;
> + default:
> + return bpf_base_func_proto(func_id);
> + }
> }
>
> static bool ublk_bpf_is_valid_access(int off, int size,
> @@ -200,6 +240,23 @@ static bool ublk_bpf_is_valid_access(int off, int size,
> const struct bpf_prog *prog,
> struct bpf_insn_access_aux *info)
> {
> + if (off < 0 || off >= sizeof(struct ublk_bpf_ctx))
> + return false;
> + if (off % size != 0)
> + return false;
> +
> + switch (off) {
> + case offsetof(struct ublk_bpf_ctx, q_id):
> + return size == sizeof_field(struct ublk_bpf_ctx, q_id);
> + case offsetof(struct ublk_bpf_ctx, tag):
> + return size == sizeof_field(struct ublk_bpf_ctx, tag);
> + case offsetof(struct ublk_bpf_ctx, op):
> + return size == sizeof_field(struct ublk_bpf_ctx, op);
> + case offsetof(struct ublk_bpf_ctx, nr_sectors):
> + return size == sizeof_field(struct ublk_bpf_ctx, nr_sectors);
> + case offsetof(struct ublk_bpf_ctx, start_sector):
> + return size == sizeof_field(struct ublk_bpf_ctx, start_sector);
> + }
> return false;
> }
>
> @@ -324,7 +381,7 @@ static void ublk_put_device(struct ublk_device *ub)
> static inline struct ublk_queue *ublk_get_queue(struct ublk_device *dev,
> int qid)
> {
> - return (struct ublk_queue *)&(dev->__queues[qid * dev->queue_size]);
> + return (struct ublk_queue *)&(dev->__queues[qid * dev->queue_size]);
> }
>
> static inline bool ublk_rq_has_data(const struct request *rq)
> @@ -492,12 +549,16 @@ static inline int ublk_copy_user_pages(struct ublk_map_data *data,
> static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
> struct ublk_io *io)
> {
> + struct ublk_device *ub = ubq->dev;
> const unsigned int rq_bytes = blk_rq_bytes(req);
> /*
> * no zero copy, we delay copy WRITE request data into ublksrv
> * context and the big benefit is that pinning pages in current
> * context is pretty fast, see ublk_pin_user_pages
> */
> + if ((req_op(req) == REQ_OP_WRITE) && ub->io_prep_prog)
> + return rq_bytes;
Can you explain a bit why READ isn't supported? Because WRITE zero
copy is supposed to be supported easily with splice based approach,
and I am more interested in READ zc actually.
> +
> if (req_op(req) != REQ_OP_WRITE && req_op(req) != REQ_OP_FLUSH)
> return rq_bytes;
>
> @@ -860,6 +921,89 @@ static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
> }
> }
>
> +static void ublk_bpf_io_submit_fn(struct callback_head *work)
> +{
> + struct ublk_io_bpf_ctx *bpf_ctx = container_of(work,
> + struct ublk_io_bpf_ctx, work);
> +
> + if (bpf_ctx->ub->io_submit_prog)
> + bpf_prog_run_pin_on_cpu(bpf_ctx->ub->io_submit_prog, bpf_ctx);
> + kfree(bpf_ctx);
> +}
> +
> +static int ublk_init_uring_kbuf(struct request *rq)
> +{
> + struct bio_vec *bvec;
> + struct req_iterator rq_iter;
> + struct bio_vec tmp;
> + int nr_bvec = 0;
> + struct io_mapped_kbuf *kbuf;
> + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> +
> + /* Drop previous allocation */
> + if (data->kbuf) {
> + kfree(data->kbuf->bvec);
> + kfree(data->kbuf);
> + data->kbuf = NULL;
> + }
> +
> + kbuf = kmalloc(sizeof(struct io_mapped_kbuf), GFP_NOIO);
> + if (!kbuf)
> + return -EIO;
> +
> + rq_for_each_bvec(tmp, rq, rq_iter)
> + nr_bvec++;
> +
> + bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec), GFP_NOIO);
> + if (!bvec) {
> + kfree(kbuf);
> + return -EIO;
> + }
> + kbuf->bvec = bvec;
> + rq_for_each_bvec(tmp, rq, rq_iter) {
> + *bvec = tmp;
> + bvec++;
> + }
> +
> + kbuf->count = blk_rq_bytes(rq);
> + kbuf->nr_bvecs = nr_bvec;
> + data->kbuf = kbuf;
> + return 0;
bio/req bvec table is immutable, so here you can pass its reference
to kbuf directly.
> +}
> +
> +static int ublk_run_bpf_prog(struct ublk_queue *ubq, struct request *rq)
> +{
> + int err;
> + struct ublk_device *ub = ubq->dev;
> + struct bpf_prog *prog = ub->io_prep_prog;
> + struct ublk_io_bpf_ctx *bpf_ctx;
> +
> + if (!prog)
> + return 0;
> +
> + bpf_ctx = kmalloc(sizeof(struct ublk_io_bpf_ctx), GFP_NOIO);
> + if (!bpf_ctx)
> + return -EIO;
> +
> + err = ublk_init_uring_kbuf(rq);
> + if (err < 0) {
> + kfree(bpf_ctx);
> + return -EIO;
> + }
> + bpf_ctx->ub = ub;
> + bpf_ctx->ctx.q_id = ubq->q_id;
> + bpf_ctx->ctx.tag = rq->tag;
> + bpf_ctx->ctx.op = req_op(rq);
> + bpf_ctx->ctx.nr_sectors = blk_rq_sectors(rq);
> + bpf_ctx->ctx.start_sector = blk_rq_pos(rq);
The above is for setting up target io parameter, which is supposed
to be from userspace, cause it is result of user space logic. If
these parameters are from kernel, the whole logic has to be done
in io_prep_prog.
> + bpf_prog_run_pin_on_cpu(prog, bpf_ctx);
> +
> + init_task_work(&bpf_ctx->work, ublk_bpf_io_submit_fn);
> + if (task_work_add(ubq->ubq_daemon, &bpf_ctx->work, TWA_SIGNAL_NO_IPI))
> + kfree(bpf_ctx);
task_work_add() is only available in case of ublk builtin.
> + return 0;
> +}
> +
> static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> const struct blk_mq_queue_data *bd)
> {
> @@ -872,6 +1016,9 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> if (unlikely(res != BLK_STS_OK))
> return BLK_STS_IOERR;
>
> + /* Currently just for test. */
> + ublk_run_bpf_prog(ubq, rq);
Can you explain the above comment a bit? When is the io_prep_prog called
in the non-test version? Or can you post the non-test version in list
for review.
Here it is the key for understanding the whole idea, especially when
is io_prep_prog called finally? How to pass parameters to io_prep_prog?
Given it is ebpf prog, I don't think any userspace parameter can be
passed to io_prep_prog when submitting IO, that means all user logic has
to be done inside io_prep_prog? If yes, not sure if it is one good way,
cause ebpf prog is very limited programming environment, but the user
logic could be as complicated as using btree to map io, or communicating
with remote machine for figuring out the mapping. Loop is just the
simplest direct mapping.
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [UBLKSRV] Add ebpf support.
2023-02-15 0:46 ` [UBLKSRV] Add " Xiaoguang Wang
@ 2023-02-16 8:28 ` Ming Lei
2023-02-16 9:17 ` Xiaoguang Wang
0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2023-02-16 8:28 UTC (permalink / raw)
To: Xiaoguang Wang
Cc: linux-block, io-uring, bpf, axboe, asml.silence, ZiyangZhang,
ming.lei
On Wed, Feb 15, 2023 at 08:46:18AM +0800, Xiaoguang Wang wrote:
> Signed-off-by: Xiaoguang Wang <[email protected]>
> ---
> bpf/ublk.bpf.c | 168 +++++++++++++++++++++++++++++++++++++++++
> include/ublk_cmd.h | 2 +
> include/ublksrv.h | 8 ++
> include/ublksrv_priv.h | 1 +
> include/ublksrv_tgt.h | 1 +
> lib/ublksrv.c | 4 +
> lib/ublksrv_cmd.c | 21 ++++++
> tgt_loop.cpp | 31 +++++++-
> ublksrv_tgt.cpp | 33 ++++++++
> 9 files changed, 268 insertions(+), 1 deletion(-)
> create mode 100644 bpf/ublk.bpf.c
>
> diff --git a/bpf/ublk.bpf.c b/bpf/ublk.bpf.c
> new file mode 100644
> index 0000000..80e79de
> --- /dev/null
> +++ b/bpf/ublk.bpf.c
> @@ -0,0 +1,168 @@
> +#include "vmlinux.h"
Where is vmlinux.h?
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_core_read.h>
> +
> +
> +static long (*bpf_ublk_queue_sqe)(void *ctx, struct io_uring_sqe *sqe,
> + u32 sqe_len, u32 fd) = (void *) 212;
> +
> +int target_fd = -1;
> +
> +struct sqe_key {
> + u16 q_id;
> + u16 tag;
> + u32 res;
> + u64 offset;
> +};
> +
> +struct sqe_data {
> + char data[128];
> +};
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_HASH);
> + __uint(max_entries, 8192);
> + __type(key, struct sqe_key);
> + __type(value, struct sqe_data);
> +} sqes_map SEC(".maps");
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_ARRAY);
> + __uint(max_entries, 128);
> + __type(key, int);
> + __type(value, int);
> +} uring_fd_map SEC(".maps");
> +
> +static inline void io_uring_prep_rw(__u8 op, struct io_uring_sqe *sqe, int fd,
> + const void *addr, unsigned len,
> + __u64 offset)
> +{
> + sqe->opcode = op;
> + sqe->flags = 0;
> + sqe->ioprio = 0;
> + sqe->fd = fd;
> + sqe->off = offset;
> + sqe->addr = (unsigned long) addr;
> + sqe->len = len;
> + sqe->fsync_flags = 0;
> + sqe->buf_index = 0;
> + sqe->personality = 0;
> + sqe->splice_fd_in = 0;
> + sqe->addr3 = 0;
> + sqe->__pad2[0] = 0;
> +}
> +
> +static inline void io_uring_prep_nop(struct io_uring_sqe *sqe)
> +{
> + io_uring_prep_rw(IORING_OP_NOP, sqe, -1, 0, 0, 0);
> +}
> +
> +static inline void io_uring_prep_read(struct io_uring_sqe *sqe, int fd,
> + void *buf, unsigned nbytes, off_t offset)
> +{
> + io_uring_prep_rw(IORING_OP_READ, sqe, fd, buf, nbytes, offset);
> +}
> +
> +static inline void io_uring_prep_write(struct io_uring_sqe *sqe, int fd,
> + const void *buf, unsigned nbytes, off_t offset)
> +{
> + io_uring_prep_rw(IORING_OP_WRITE, sqe, fd, buf, nbytes, offset);
> +}
> +
> +/*
> +static u64 submit_sqe(struct bpf_map *map, void *key, void *value, void *data)
> +{
> + struct io_uring_sqe *sqe = (struct io_uring_sqe *)value;
> + struct ublk_bpf_ctx *ctx = ((struct callback_ctx *)data)->ctx;
> + struct sqe_key *skey = (struct sqe_key *)key;
> + char fmt[] ="submit sqe for req[qid:%u tag:%u]\n";
> + char fmt2[] ="submit sqe test prep\n";
> + u16 qid, tag;
> + int q_id = skey->q_id, *ring_fd;
> +
> + bpf_trace_printk(fmt2, sizeof(fmt2));
> + ring_fd = bpf_map_lookup_elem(&uring_fd_map, &q_id);
> + if (ring_fd) {
> + bpf_trace_printk(fmt, sizeof(fmt), qid, skey->tag);
> + bpf_ublk_queue_sqe(ctx, sqe, 128, *ring_fd);
> + bpf_map_delete_elem(map, key);
> + }
> + return 0;
> +}
> +*/
> +
> +static inline __u64 build_user_data(unsigned tag, unsigned op,
> + unsigned tgt_data, unsigned is_target_io,
> + unsigned is_bpf_io)
> +{
> + return tag | (op << 16) | (tgt_data << 24) | (__u64)is_target_io << 63 |
> + (__u64)is_bpf_io << 60;
> +}
> +
> +SEC("ublk.s/")
> +int ublk_io_prep_prog(struct ublk_bpf_ctx *ctx)
> +{
> + struct io_uring_sqe *sqe;
> + struct sqe_data sd = {0};
> + struct sqe_key key;
> + u16 q_id = ctx->q_id;
> + u8 op; // = ctx->op;
> + u32 nr_sectors = ctx->nr_sectors;
> + u64 start_sector = ctx->start_sector;
> + char fmt_1[] ="ublk_io_prep_prog %d %d\n";
> +
> + key.q_id = ctx->q_id;
> + key.tag = ctx->tag;
> + key.offset = 0;
> + key.res = 0;
> +
> + bpf_probe_read_kernel(&op, 1, &ctx->op);
> + bpf_trace_printk(fmt_1, sizeof(fmt_1), q_id, op);
> + sqe = (struct io_uring_sqe *)&sd;
> + if (op == REQ_OP_READ) {
> + char fmt[] ="add read sae\n";
> +
> + bpf_trace_printk(fmt, sizeof(fmt));
> + io_uring_prep_read(sqe, target_fd, 0, nr_sectors << 9,
> + start_sector << 9);
> + sqe->user_data = build_user_data(ctx->tag, op, 0, 1, 1);
> + bpf_map_update_elem(&sqes_map, &key, &sd, BPF_NOEXIST);
> + } else if (op == REQ_OP_WRITE) {
> + char fmt[] ="add write sae\n";
> +
> + bpf_trace_printk(fmt, sizeof(fmt));
> +
> + io_uring_prep_write(sqe, target_fd, 0, nr_sectors << 9,
> + start_sector << 9);
> + sqe->user_data = build_user_data(ctx->tag, op, 0, 1, 1);
> + bpf_map_update_elem(&sqes_map, &key, &sd, BPF_NOEXIST);
> + } else {
> + ;
> + }
> + return 0;
> +}
> +
> +SEC("ublk.s/")
> +int ublk_io_submit_prog(struct ublk_bpf_ctx *ctx)
> +{
> + struct io_uring_sqe *sqe;
> + char fmt[] ="submit sqe for req[qid:%u tag:%u]\n";
> + int q_id = ctx->q_id, *ring_fd;
> + struct sqe_key key;
> +
> + key.q_id = ctx->q_id;
> + key.tag = ctx->tag;
> + key.offset = 0;
> + key.res = 0;
> +
> + sqe = bpf_map_lookup_elem(&sqes_map, &key);
> + ring_fd = bpf_map_lookup_elem(&uring_fd_map, &q_id);
> + if (ring_fd) {
> + bpf_trace_printk(fmt, sizeof(fmt), key.q_id, key.tag);
> + bpf_ublk_queue_sqe(ctx, sqe, 128, *ring_fd);
> + bpf_map_delete_elem(&sqes_map, &key);
> + }
> + return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "GPL";
> diff --git a/include/ublk_cmd.h b/include/ublk_cmd.h
> index f6238cc..893ba8c 100644
> --- a/include/ublk_cmd.h
> +++ b/include/ublk_cmd.h
> @@ -17,6 +17,8 @@
> #define UBLK_CMD_STOP_DEV 0x07
> #define UBLK_CMD_SET_PARAMS 0x08
> #define UBLK_CMD_GET_PARAMS 0x09
> +#define UBLK_CMD_REG_BPF_PROG 0x0a
> +#define UBLK_CMD_UNREG_BPF_PROG 0x0b
> #define UBLK_CMD_START_USER_RECOVERY 0x10
> #define UBLK_CMD_END_USER_RECOVERY 0x11
> #define UBLK_CMD_GET_DEV_INFO2 0x12
> diff --git a/include/ublksrv.h b/include/ublksrv.h
> index d38bd46..f5deddb 100644
> --- a/include/ublksrv.h
> +++ b/include/ublksrv.h
> @@ -106,6 +106,7 @@ struct ublksrv_tgt_info {
> unsigned int nr_fds;
> int fds[UBLKSRV_TGT_MAX_FDS];
> void *tgt_data;
> + void *tgt_bpf_obj;
>
> /*
> * Extra IO slots for each queue, target code can reserve some
> @@ -263,6 +264,8 @@ struct ublksrv_tgt_type {
> int (*init_queue)(const struct ublksrv_queue *, void **queue_data_ptr);
> void (*deinit_queue)(const struct ublksrv_queue *);
>
> + int (*init_queue_bpf)(const struct ublksrv_dev *dev, const struct ublksrv_queue *q);
> +
> unsigned long reserved[5];
> };
>
> @@ -318,6 +321,11 @@ extern void ublksrv_ctrl_prep_recovery(struct ublksrv_ctrl_dev *dev,
> const char *recovery_jbuf);
> extern const char *ublksrv_ctrl_get_recovery_jbuf(const struct ublksrv_ctrl_dev *dev);
>
> +extern void ublksrv_ctrl_set_bpf_obj_info(struct ublksrv_ctrl_dev *dev,
> + void *obj);
> +extern int ublksrv_ctrl_reg_bpf_prog(struct ublksrv_ctrl_dev *dev,
> + int io_prep_fd, int io_submit_fd);
> +
> /* ublksrv device ("/dev/ublkcN") level APIs */
> extern const struct ublksrv_dev *ublksrv_dev_init(const struct ublksrv_ctrl_dev *
> ctrl_dev);
> diff --git a/include/ublksrv_priv.h b/include/ublksrv_priv.h
> index 2996baa..8da8866 100644
> --- a/include/ublksrv_priv.h
> +++ b/include/ublksrv_priv.h
> @@ -42,6 +42,7 @@ struct ublksrv_ctrl_dev {
>
> const char *tgt_type;
> const struct ublksrv_tgt_type *tgt_ops;
> + void *bpf_obj;
>
> /*
> * default is UBLKSRV_RUN_DIR but can be specified via command line,
> diff --git a/include/ublksrv_tgt.h b/include/ublksrv_tgt.h
> index 234d31e..e0db7d9 100644
> --- a/include/ublksrv_tgt.h
> +++ b/include/ublksrv_tgt.h
> @@ -9,6 +9,7 @@
> #include <getopt.h>
> #include <string.h>
> #include <stdarg.h>
> +#include <limits.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/ioctl.h>
> diff --git a/lib/ublksrv.c b/lib/ublksrv.c
> index 96bed95..110ccb3 100644
> --- a/lib/ublksrv.c
> +++ b/lib/ublksrv.c
> @@ -603,6 +603,9 @@ skip_alloc_buf:
> goto fail;
> }
>
> + if (dev->tgt.ops->init_queue_bpf)
> + dev->tgt.ops->init_queue_bpf(tdev, local_to_tq(q));
> +
> ublksrv_dev_init_io_cmds(dev, q);
>
> /*
> @@ -723,6 +726,7 @@ const struct ublksrv_dev *ublksrv_dev_init(const struct ublksrv_ctrl_dev *ctrl_d
> }
>
> tgt->fds[0] = dev->cdev_fd;
> + tgt->tgt_bpf_obj = ctrl_dev->bpf_obj;
>
> ret = ublksrv_tgt_init(dev, ctrl_dev->tgt_type, ctrl_dev->tgt_ops,
> ctrl_dev->tgt_argc, ctrl_dev->tgt_argv);
> diff --git a/lib/ublksrv_cmd.c b/lib/ublksrv_cmd.c
> index 0d7265d..0101cb9 100644
> --- a/lib/ublksrv_cmd.c
> +++ b/lib/ublksrv_cmd.c
> @@ -502,6 +502,27 @@ int ublksrv_ctrl_end_recovery(struct ublksrv_ctrl_dev *dev, int daemon_pid)
> return ret;
> }
>
> +int ublksrv_ctrl_reg_bpf_prog(struct ublksrv_ctrl_dev *dev,
> + int io_prep_fd, int io_submit_fd)
> +{
> + struct ublksrv_ctrl_cmd_data data = {
> + .cmd_op = UBLK_CMD_REG_BPF_PROG,
> + .flags = CTRL_CMD_HAS_DATA,
> + };
> + int ret;
> +
> + data.data[0] = io_prep_fd;
> + data.data[1] = io_submit_fd;
> +
> + ret = __ublksrv_ctrl_cmd(dev, &data);
> + return ret;
> +}
> +
> +void ublksrv_ctrl_set_bpf_obj_info(struct ublksrv_ctrl_dev *dev, void *obj)
> +{
> + dev->bpf_obj = obj;
> +}
> +
> const struct ublksrv_ctrl_dev_info *ublksrv_ctrl_get_dev_info(
> const struct ublksrv_ctrl_dev *dev)
> {
> diff --git a/tgt_loop.cpp b/tgt_loop.cpp
> index 79a65d3..b1568fe 100644
> --- a/tgt_loop.cpp
> +++ b/tgt_loop.cpp
> @@ -4,7 +4,11 @@
>
> #include <poll.h>
> #include <sys/epoll.h>
> +#include <linux/bpf.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> #include "ublksrv_tgt.h"
> +#include "bpf/.tmp/ublk.skel.h"
Where is bpf/.tmp/ublk.skel.h?
>
> static bool backing_supports_discard(char *name)
> {
> @@ -88,6 +92,20 @@ static int loop_recovery_tgt(struct ublksrv_dev *dev, int type)
> return 0;
> }
>
> +static int loop_init_queue_bpf(const struct ublksrv_dev *dev,
> + const struct ublksrv_queue *q)
> +{
> + int ret, q_id, ring_fd;
> + const struct ublksrv_tgt_info *tgt = &dev->tgt;
> + struct ublk_bpf *obj = (struct ublk_bpf*)tgt->tgt_bpf_obj;
> +
> + q_id = q->q_id;
> + ring_fd = q->ring_ptr->ring_fd;
> + ret = bpf_map_update_elem(bpf_map__fd(obj->maps.uring_fd_map), &q_id,
> + &ring_fd, 0);
> + return ret;
> +}
> +
> static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
> *argv[])
> {
> @@ -125,6 +143,7 @@ static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
> },
> };
> bool can_discard = false;
> + struct ublk_bpf *bpf_obj;
>
> strcpy(tgt_json.name, "loop");
>
> @@ -218,6 +237,10 @@ static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
> jbuf = ublksrv_tgt_realloc_json_buf(dev, &jbuf_size);
> } while (ret < 0);
>
> + if (tgt->tgt_bpf_obj) {
> + bpf_obj = (struct ublk_bpf *)tgt->tgt_bpf_obj;
> + bpf_obj->data->target_fd = tgt->fds[1];
> + }
> return 0;
> }
>
> @@ -252,9 +275,14 @@ static int loop_queue_tgt_io(const struct ublksrv_queue *q,
> const struct ublk_io_data *data, int tag)
> {
> const struct ublksrv_io_desc *iod = data->iod;
> - struct io_uring_sqe *sqe = io_uring_get_sqe(q->ring_ptr);
> + struct io_uring_sqe *sqe;
> unsigned ublk_op = ublksrv_get_op(iod);
>
> + /* ebpf prog wil handle read/write requests. */
> + if ((ublk_op == UBLK_IO_OP_READ) || (ublk_op == UBLK_IO_OP_WRITE))
> + return 1;
> +
> + sqe = io_uring_get_sqe(q->ring_ptr);
> if (!sqe)
> return 0;
>
> @@ -374,6 +402,7 @@ struct ublksrv_tgt_type loop_tgt_type = {
> .type = UBLKSRV_TGT_TYPE_LOOP,
> .name = "loop",
> .recovery_tgt = loop_recovery_tgt,
> + .init_queue_bpf = loop_init_queue_bpf,
> };
>
> static void tgt_loop_init() __attribute__((constructor));
> diff --git a/ublksrv_tgt.cpp b/ublksrv_tgt.cpp
> index 5ed328d..d3796cf 100644
> --- a/ublksrv_tgt.cpp
> +++ b/ublksrv_tgt.cpp
> @@ -2,6 +2,7 @@
>
> #include "config.h"
> #include "ublksrv_tgt.h"
> +#include "bpf/.tmp/ublk.skel.h"
Same with above
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [UBLKSRV] Add ebpf support.
2023-02-16 8:28 ` Ming Lei
@ 2023-02-16 9:17 ` Xiaoguang Wang
0 siblings, 0 replies; 13+ messages in thread
From: Xiaoguang Wang @ 2023-02-16 9:17 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-block, io-uring, bpf, axboe, asml.silence, ZiyangZhang
hello,
> On Wed, Feb 15, 2023 at 08:46:18AM +0800, Xiaoguang Wang wrote:
>> Signed-off-by: Xiaoguang Wang <[email protected]>
>> ---
>> bpf/ublk.bpf.c | 168 +++++++++++++++++++++++++++++++++++++++++
>> include/ublk_cmd.h | 2 +
>> include/ublksrv.h | 8 ++
>> include/ublksrv_priv.h | 1 +
>> include/ublksrv_tgt.h | 1 +
>> lib/ublksrv.c | 4 +
>> lib/ublksrv_cmd.c | 21 ++++++
>> tgt_loop.cpp | 31 +++++++-
>> ublksrv_tgt.cpp | 33 ++++++++
>> 9 files changed, 268 insertions(+), 1 deletion(-)
>> create mode 100644 bpf/ublk.bpf.c
>>
>> diff --git a/bpf/ublk.bpf.c b/bpf/ublk.bpf.c
>> new file mode 100644
>> index 0000000..80e79de
>> --- /dev/null
>> +++ b/bpf/ublk.bpf.c
>> @@ -0,0 +1,168 @@
>> +#include "vmlinux.h"
> Where is vmlinux.h?
Sorry, I forgot to attach Makefile in this commit, which will
show how to generate vmlinux.h and how to compile ebpf
prog object. I'll prepare v2 patch set to fix this issue soon.
Thanks for review.
Regards,
Xiaoguang Wang
>
>> +#include <bpf/bpf_helpers.h>
>> +#include <bpf/bpf_core_read.h>
>> +
>> +
>> +static long (*bpf_ublk_queue_sqe)(void *ctx, struct io_uring_sqe *sqe,
>> + u32 sqe_len, u32 fd) = (void *) 212;
>> +
>> +int target_fd = -1;
>> +
>> +struct sqe_key {
>> + u16 q_id;
>> + u16 tag;
>> + u32 res;
>> + u64 offset;
>> +};
>> +
>> +struct sqe_data {
>> + char data[128];
>> +};
>> +
>> +struct {
>> + __uint(type, BPF_MAP_TYPE_HASH);
>> + __uint(max_entries, 8192);
>> + __type(key, struct sqe_key);
>> + __type(value, struct sqe_data);
>> +} sqes_map SEC(".maps");
>> +
>> +struct {
>> + __uint(type, BPF_MAP_TYPE_ARRAY);
>> + __uint(max_entries, 128);
>> + __type(key, int);
>> + __type(value, int);
>> +} uring_fd_map SEC(".maps");
>> +
>> +static inline void io_uring_prep_rw(__u8 op, struct io_uring_sqe *sqe, int fd,
>> + const void *addr, unsigned len,
>> + __u64 offset)
>> +{
>> + sqe->opcode = op;
>> + sqe->flags = 0;
>> + sqe->ioprio = 0;
>> + sqe->fd = fd;
>> + sqe->off = offset;
>> + sqe->addr = (unsigned long) addr;
>> + sqe->len = len;
>> + sqe->fsync_flags = 0;
>> + sqe->buf_index = 0;
>> + sqe->personality = 0;
>> + sqe->splice_fd_in = 0;
>> + sqe->addr3 = 0;
>> + sqe->__pad2[0] = 0;
>> +}
>> +
>> +static inline void io_uring_prep_nop(struct io_uring_sqe *sqe)
>> +{
>> + io_uring_prep_rw(IORING_OP_NOP, sqe, -1, 0, 0, 0);
>> +}
>> +
>> +static inline void io_uring_prep_read(struct io_uring_sqe *sqe, int fd,
>> + void *buf, unsigned nbytes, off_t offset)
>> +{
>> + io_uring_prep_rw(IORING_OP_READ, sqe, fd, buf, nbytes, offset);
>> +}
>> +
>> +static inline void io_uring_prep_write(struct io_uring_sqe *sqe, int fd,
>> + const void *buf, unsigned nbytes, off_t offset)
>> +{
>> + io_uring_prep_rw(IORING_OP_WRITE, sqe, fd, buf, nbytes, offset);
>> +}
>> +
>> +/*
>> +static u64 submit_sqe(struct bpf_map *map, void *key, void *value, void *data)
>> +{
>> + struct io_uring_sqe *sqe = (struct io_uring_sqe *)value;
>> + struct ublk_bpf_ctx *ctx = ((struct callback_ctx *)data)->ctx;
>> + struct sqe_key *skey = (struct sqe_key *)key;
>> + char fmt[] ="submit sqe for req[qid:%u tag:%u]\n";
>> + char fmt2[] ="submit sqe test prep\n";
>> + u16 qid, tag;
>> + int q_id = skey->q_id, *ring_fd;
>> +
>> + bpf_trace_printk(fmt2, sizeof(fmt2));
>> + ring_fd = bpf_map_lookup_elem(&uring_fd_map, &q_id);
>> + if (ring_fd) {
>> + bpf_trace_printk(fmt, sizeof(fmt), qid, skey->tag);
>> + bpf_ublk_queue_sqe(ctx, sqe, 128, *ring_fd);
>> + bpf_map_delete_elem(map, key);
>> + }
>> + return 0;
>> +}
>> +*/
>> +
>> +static inline __u64 build_user_data(unsigned tag, unsigned op,
>> + unsigned tgt_data, unsigned is_target_io,
>> + unsigned is_bpf_io)
>> +{
>> + return tag | (op << 16) | (tgt_data << 24) | (__u64)is_target_io << 63 |
>> + (__u64)is_bpf_io << 60;
>> +}
>> +
>> +SEC("ublk.s/")
>> +int ublk_io_prep_prog(struct ublk_bpf_ctx *ctx)
>> +{
>> + struct io_uring_sqe *sqe;
>> + struct sqe_data sd = {0};
>> + struct sqe_key key;
>> + u16 q_id = ctx->q_id;
>> + u8 op; // = ctx->op;
>> + u32 nr_sectors = ctx->nr_sectors;
>> + u64 start_sector = ctx->start_sector;
>> + char fmt_1[] ="ublk_io_prep_prog %d %d\n";
>> +
>> + key.q_id = ctx->q_id;
>> + key.tag = ctx->tag;
>> + key.offset = 0;
>> + key.res = 0;
>> +
>> + bpf_probe_read_kernel(&op, 1, &ctx->op);
>> + bpf_trace_printk(fmt_1, sizeof(fmt_1), q_id, op);
>> + sqe = (struct io_uring_sqe *)&sd;
>> + if (op == REQ_OP_READ) {
>> + char fmt[] ="add read sae\n";
>> +
>> + bpf_trace_printk(fmt, sizeof(fmt));
>> + io_uring_prep_read(sqe, target_fd, 0, nr_sectors << 9,
>> + start_sector << 9);
>> + sqe->user_data = build_user_data(ctx->tag, op, 0, 1, 1);
>> + bpf_map_update_elem(&sqes_map, &key, &sd, BPF_NOEXIST);
>> + } else if (op == REQ_OP_WRITE) {
>> + char fmt[] ="add write sae\n";
>> +
>> + bpf_trace_printk(fmt, sizeof(fmt));
>> +
>> + io_uring_prep_write(sqe, target_fd, 0, nr_sectors << 9,
>> + start_sector << 9);
>> + sqe->user_data = build_user_data(ctx->tag, op, 0, 1, 1);
>> + bpf_map_update_elem(&sqes_map, &key, &sd, BPF_NOEXIST);
>> + } else {
>> + ;
>> + }
>> + return 0;
>> +}
>> +
>> +SEC("ublk.s/")
>> +int ublk_io_submit_prog(struct ublk_bpf_ctx *ctx)
>> +{
>> + struct io_uring_sqe *sqe;
>> + char fmt[] ="submit sqe for req[qid:%u tag:%u]\n";
>> + int q_id = ctx->q_id, *ring_fd;
>> + struct sqe_key key;
>> +
>> + key.q_id = ctx->q_id;
>> + key.tag = ctx->tag;
>> + key.offset = 0;
>> + key.res = 0;
>> +
>> + sqe = bpf_map_lookup_elem(&sqes_map, &key);
>> + ring_fd = bpf_map_lookup_elem(&uring_fd_map, &q_id);
>> + if (ring_fd) {
>> + bpf_trace_printk(fmt, sizeof(fmt), key.q_id, key.tag);
>> + bpf_ublk_queue_sqe(ctx, sqe, 128, *ring_fd);
>> + bpf_map_delete_elem(&sqes_map, &key);
>> + }
>> + return 0;
>> +}
>> +
>> +char LICENSE[] SEC("license") = "GPL";
>> diff --git a/include/ublk_cmd.h b/include/ublk_cmd.h
>> index f6238cc..893ba8c 100644
>> --- a/include/ublk_cmd.h
>> +++ b/include/ublk_cmd.h
>> @@ -17,6 +17,8 @@
>> #define UBLK_CMD_STOP_DEV 0x07
>> #define UBLK_CMD_SET_PARAMS 0x08
>> #define UBLK_CMD_GET_PARAMS 0x09
>> +#define UBLK_CMD_REG_BPF_PROG 0x0a
>> +#define UBLK_CMD_UNREG_BPF_PROG 0x0b
>> #define UBLK_CMD_START_USER_RECOVERY 0x10
>> #define UBLK_CMD_END_USER_RECOVERY 0x11
>> #define UBLK_CMD_GET_DEV_INFO2 0x12
>> diff --git a/include/ublksrv.h b/include/ublksrv.h
>> index d38bd46..f5deddb 100644
>> --- a/include/ublksrv.h
>> +++ b/include/ublksrv.h
>> @@ -106,6 +106,7 @@ struct ublksrv_tgt_info {
>> unsigned int nr_fds;
>> int fds[UBLKSRV_TGT_MAX_FDS];
>> void *tgt_data;
>> + void *tgt_bpf_obj;
>>
>> /*
>> * Extra IO slots for each queue, target code can reserve some
>> @@ -263,6 +264,8 @@ struct ublksrv_tgt_type {
>> int (*init_queue)(const struct ublksrv_queue *, void **queue_data_ptr);
>> void (*deinit_queue)(const struct ublksrv_queue *);
>>
>> + int (*init_queue_bpf)(const struct ublksrv_dev *dev, const struct ublksrv_queue *q);
>> +
>> unsigned long reserved[5];
>> };
>>
>> @@ -318,6 +321,11 @@ extern void ublksrv_ctrl_prep_recovery(struct ublksrv_ctrl_dev *dev,
>> const char *recovery_jbuf);
>> extern const char *ublksrv_ctrl_get_recovery_jbuf(const struct ublksrv_ctrl_dev *dev);
>>
>> +extern void ublksrv_ctrl_set_bpf_obj_info(struct ublksrv_ctrl_dev *dev,
>> + void *obj);
>> +extern int ublksrv_ctrl_reg_bpf_prog(struct ublksrv_ctrl_dev *dev,
>> + int io_prep_fd, int io_submit_fd);
>> +
>> /* ublksrv device ("/dev/ublkcN") level APIs */
>> extern const struct ublksrv_dev *ublksrv_dev_init(const struct ublksrv_ctrl_dev *
>> ctrl_dev);
>> diff --git a/include/ublksrv_priv.h b/include/ublksrv_priv.h
>> index 2996baa..8da8866 100644
>> --- a/include/ublksrv_priv.h
>> +++ b/include/ublksrv_priv.h
>> @@ -42,6 +42,7 @@ struct ublksrv_ctrl_dev {
>>
>> const char *tgt_type;
>> const struct ublksrv_tgt_type *tgt_ops;
>> + void *bpf_obj;
>>
>> /*
>> * default is UBLKSRV_RUN_DIR but can be specified via command line,
>> diff --git a/include/ublksrv_tgt.h b/include/ublksrv_tgt.h
>> index 234d31e..e0db7d9 100644
>> --- a/include/ublksrv_tgt.h
>> +++ b/include/ublksrv_tgt.h
>> @@ -9,6 +9,7 @@
>> #include <getopt.h>
>> #include <string.h>
>> #include <stdarg.h>
>> +#include <limits.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <sys/ioctl.h>
>> diff --git a/lib/ublksrv.c b/lib/ublksrv.c
>> index 96bed95..110ccb3 100644
>> --- a/lib/ublksrv.c
>> +++ b/lib/ublksrv.c
>> @@ -603,6 +603,9 @@ skip_alloc_buf:
>> goto fail;
>> }
>>
>> + if (dev->tgt.ops->init_queue_bpf)
>> + dev->tgt.ops->init_queue_bpf(tdev, local_to_tq(q));
>> +
>> ublksrv_dev_init_io_cmds(dev, q);
>>
>> /*
>> @@ -723,6 +726,7 @@ const struct ublksrv_dev *ublksrv_dev_init(const struct ublksrv_ctrl_dev *ctrl_d
>> }
>>
>> tgt->fds[0] = dev->cdev_fd;
>> + tgt->tgt_bpf_obj = ctrl_dev->bpf_obj;
>>
>> ret = ublksrv_tgt_init(dev, ctrl_dev->tgt_type, ctrl_dev->tgt_ops,
>> ctrl_dev->tgt_argc, ctrl_dev->tgt_argv);
>> diff --git a/lib/ublksrv_cmd.c b/lib/ublksrv_cmd.c
>> index 0d7265d..0101cb9 100644
>> --- a/lib/ublksrv_cmd.c
>> +++ b/lib/ublksrv_cmd.c
>> @@ -502,6 +502,27 @@ int ublksrv_ctrl_end_recovery(struct ublksrv_ctrl_dev *dev, int daemon_pid)
>> return ret;
>> }
>>
>> +int ublksrv_ctrl_reg_bpf_prog(struct ublksrv_ctrl_dev *dev,
>> + int io_prep_fd, int io_submit_fd)
>> +{
>> + struct ublksrv_ctrl_cmd_data data = {
>> + .cmd_op = UBLK_CMD_REG_BPF_PROG,
>> + .flags = CTRL_CMD_HAS_DATA,
>> + };
>> + int ret;
>> +
>> + data.data[0] = io_prep_fd;
>> + data.data[1] = io_submit_fd;
>> +
>> + ret = __ublksrv_ctrl_cmd(dev, &data);
>> + return ret;
>> +}
>> +
>> +void ublksrv_ctrl_set_bpf_obj_info(struct ublksrv_ctrl_dev *dev, void *obj)
>> +{
>> + dev->bpf_obj = obj;
>> +}
>> +
>> const struct ublksrv_ctrl_dev_info *ublksrv_ctrl_get_dev_info(
>> const struct ublksrv_ctrl_dev *dev)
>> {
>> diff --git a/tgt_loop.cpp b/tgt_loop.cpp
>> index 79a65d3..b1568fe 100644
>> --- a/tgt_loop.cpp
>> +++ b/tgt_loop.cpp
>> @@ -4,7 +4,11 @@
>>
>> #include <poll.h>
>> #include <sys/epoll.h>
>> +#include <linux/bpf.h>
>> +#include <bpf/bpf.h>
>> +#include <bpf/libbpf.h>
>> #include "ublksrv_tgt.h"
>> +#include "bpf/.tmp/ublk.skel.h"
> Where is bpf/.tmp/ublk.skel.h?
>
>>
>> static bool backing_supports_discard(char *name)
>> {
>> @@ -88,6 +92,20 @@ static int loop_recovery_tgt(struct ublksrv_dev *dev, int type)
>> return 0;
>> }
>>
>> +static int loop_init_queue_bpf(const struct ublksrv_dev *dev,
>> + const struct ublksrv_queue *q)
>> +{
>> + int ret, q_id, ring_fd;
>> + const struct ublksrv_tgt_info *tgt = &dev->tgt;
>> + struct ublk_bpf *obj = (struct ublk_bpf*)tgt->tgt_bpf_obj;
>> +
>> + q_id = q->q_id;
>> + ring_fd = q->ring_ptr->ring_fd;
>> + ret = bpf_map_update_elem(bpf_map__fd(obj->maps.uring_fd_map), &q_id,
>> + &ring_fd, 0);
>> + return ret;
>> +}
>> +
>> static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
>> *argv[])
>> {
>> @@ -125,6 +143,7 @@ static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
>> },
>> };
>> bool can_discard = false;
>> + struct ublk_bpf *bpf_obj;
>>
>> strcpy(tgt_json.name, "loop");
>>
>> @@ -218,6 +237,10 @@ static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
>> jbuf = ublksrv_tgt_realloc_json_buf(dev, &jbuf_size);
>> } while (ret < 0);
>>
>> + if (tgt->tgt_bpf_obj) {
>> + bpf_obj = (struct ublk_bpf *)tgt->tgt_bpf_obj;
>> + bpf_obj->data->target_fd = tgt->fds[1];
>> + }
>> return 0;
>> }
>>
>> @@ -252,9 +275,14 @@ static int loop_queue_tgt_io(const struct ublksrv_queue *q,
>> const struct ublk_io_data *data, int tag)
>> {
>> const struct ublksrv_io_desc *iod = data->iod;
>> - struct io_uring_sqe *sqe = io_uring_get_sqe(q->ring_ptr);
>> + struct io_uring_sqe *sqe;
>> unsigned ublk_op = ublksrv_get_op(iod);
>>
>> + /* ebpf prog wil handle read/write requests. */
>> + if ((ublk_op == UBLK_IO_OP_READ) || (ublk_op == UBLK_IO_OP_WRITE))
>> + return 1;
>> +
>> + sqe = io_uring_get_sqe(q->ring_ptr);
>> if (!sqe)
>> return 0;
>>
>> @@ -374,6 +402,7 @@ struct ublksrv_tgt_type loop_tgt_type = {
>> .type = UBLKSRV_TGT_TYPE_LOOP,
>> .name = "loop",
>> .recovery_tgt = loop_recovery_tgt,
>> + .init_queue_bpf = loop_init_queue_bpf,
>> };
>>
>> static void tgt_loop_init() __attribute__((constructor));
>> diff --git a/ublksrv_tgt.cpp b/ublksrv_tgt.cpp
>> index 5ed328d..d3796cf 100644
>> --- a/ublksrv_tgt.cpp
>> +++ b/ublksrv_tgt.cpp
>> @@ -2,6 +2,7 @@
>>
>> #include "config.h"
>> #include "ublksrv_tgt.h"
>> +#include "bpf/.tmp/ublk.skel.h"
> Same with above
>
>
> Thanks,
> Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 3/3] ublk_drv: add ebpf support
2023-02-16 8:11 ` Ming Lei
@ 2023-02-16 12:12 ` Xiaoguang Wang
2023-02-17 3:02 ` Ming Lei
0 siblings, 1 reply; 13+ messages in thread
From: Xiaoguang Wang @ 2023-02-16 12:12 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-block, io-uring, bpf, axboe, asml.silence, ZiyangZhang
hello,
> On Wed, Feb 15, 2023 at 08:41:22AM +0800, Xiaoguang Wang wrote:
>> Currenly only one bpf_ublk_queue_sqe() ebpf is added, ublksrv target
>> can use this helper to write ebpf prog to support ublk kernel & usersapce
>> zero copy, please see ublksrv test codes for more info.
>>
>> */
>> + if ((req_op(req) == REQ_OP_WRITE) && ub->io_prep_prog)
>> + return rq_bytes;
> Can you explain a bit why READ isn't supported? Because WRITE zero
> copy is supposed to be supported easily with splice based approach,
> and I am more interested in READ zc actually.
No special reason, READ op can also be supported. I'll
add this support in patch set v2.
For this RFC patch set, I just tried to show the idea, so
I must admit that current codes are not mature enough :)
>
>> +
>> if (req_op(req) != REQ_OP_WRITE && req_op(req) != REQ_OP_FLUSH)
>> return rq_bytes;
>>
>> @@ -860,6 +921,89 @@ static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
>> }
>> }
>>
>>
>> + kbuf->bvec = bvec;
>> + rq_for_each_bvec(tmp, rq, rq_iter) {
>> + *bvec = tmp;
>> + bvec++;
>> + }
>> +
>> + kbuf->count = blk_rq_bytes(rq);
>> + kbuf->nr_bvecs = nr_bvec;
>> + data->kbuf = kbuf;
>> + return 0;
> bio/req bvec table is immutable, so here you can pass its reference
> to kbuf directly.
Yeah, thanks.
>
>> +}
>> +
>> +static int ublk_run_bpf_prog(struct ublk_queue *ubq, struct request *rq)
>> +{
>> + int err;
>> + struct ublk_device *ub = ubq->dev;
>> + struct bpf_prog *prog = ub->io_prep_prog;
>> + struct ublk_io_bpf_ctx *bpf_ctx;
>> +
>> + if (!prog)
>> + return 0;
>> +
>> + bpf_ctx = kmalloc(sizeof(struct ublk_io_bpf_ctx), GFP_NOIO);
>> + if (!bpf_ctx)
>> + return -EIO;
>> +
>> + err = ublk_init_uring_kbuf(rq);
>> + if (err < 0) {
>> + kfree(bpf_ctx);
>> + return -EIO;
>> + }
>> + bpf_ctx->ub = ub;
>> + bpf_ctx->ctx.q_id = ubq->q_id;
>> + bpf_ctx->ctx.tag = rq->tag;
>> + bpf_ctx->ctx.op = req_op(rq);
>> + bpf_ctx->ctx.nr_sectors = blk_rq_sectors(rq);
>> + bpf_ctx->ctx.start_sector = blk_rq_pos(rq);
> The above is for setting up target io parameter, which is supposed
> to be from userspace, cause it is result of user space logic. If
> these parameters are from kernel, the whole logic has to be done
> in io_prep_prog.
Yeah, it's designed that io_prep_prog implements user space
io logic.
>
>> + bpf_prog_run_pin_on_cpu(prog, bpf_ctx);
>> +
>> + init_task_work(&bpf_ctx->work, ublk_bpf_io_submit_fn);
>> + if (task_work_add(ubq->ubq_daemon, &bpf_ctx->work, TWA_SIGNAL_NO_IPI))
>> + kfree(bpf_ctx);
> task_work_add() is only available in case of ublk builtin.
Yeah, I'm thinking how to work around it.
>
>> + return 0;
>> +}
>> +
>> static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>> const struct blk_mq_queue_data *bd)
>> {
>> @@ -872,6 +1016,9 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>> if (unlikely(res != BLK_STS_OK))
>> return BLK_STS_IOERR;
>>
>> + /* Currently just for test. */
>> + ublk_run_bpf_prog(ubq, rq);
> Can you explain the above comment a bit? When is the io_prep_prog called
> in the non-test version? Or can you post the non-test version in list
> for review.
Forgot to delete stale comments, sorry. I'm writing v2 patch set,
> Here it is the key for understanding the whole idea, especially when
> is io_prep_prog called finally? How to pass parameters to io_prep_prog?
Let me explain more about the design:
io_prep_prog has two types of parameters:
1) its call argument: struct ublk_bpf_ctx, see ublk.bpf.c.
ublk_bpf_ctx will describe one kernel io requests about
its op, qid, sectors info. io_prep_prog uses these info to
map target io.
2) ebpf map structure, user space daemon can use map
structure to pass much information from user space to
io_prep_prog, which will help it to initialize target io if necessary.
io_prep_prog is called when ublk_queue_rq() is called, this bpf
prog will initialize one or more sqes according to user logic, and
io_prep_prog will put these sqes in an ebpf map structure, then
execute a task_work_add() to notify ubq_daemon to execute
io_submit_prog. Note, we can not call io_uring_submit_sqe()
in task context that calls ublk_queue_rq(), that context does not
have io_uring instance owned by ubq_daemon.
Later ubq_daemon will call io_submit_prog to submit sqes.
>
> Given it is ebpf prog, I don't think any userspace parameter can be
> passed to io_prep_prog when submitting IO, that means all user logic has
> to be done inside io_prep_prog? If yes, not sure if it is one good way,
> cause ebpf prog is very limited programming environment, but the user
> logic could be as complicated as using btree to map io, or communicating
> with remote machine for figuring out the mapping. Loop is just the
> simplest direct mapping.
Currently, we can use ebpf map structure to pass user space
parameter to io_prep_prog. Also I agree with you that complicated
logic maybe hard to be implemented in ebpf prog, hope ebpf
community will improve this situation gradually.
For userspace target implementations I met so far, they just use
userspace block device solutions to visit distributed filesystem,
involves socket programming and have simple map io logic. We
can prepare socket fd in ebpf map structure, and these map io
logic should be easily implemented in ebpf prog, though I don't
apply this ebpf method to our internal business yet.
Thanks for review.
Regards,
Xiaoguang Wang
>
>
> Thanks,
> Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 3/3] ublk_drv: add ebpf support
2023-02-16 12:12 ` Xiaoguang Wang
@ 2023-02-17 3:02 ` Ming Lei
2023-02-17 10:46 ` Ming Lei
2023-02-22 14:13 ` Xiaoguang Wang
0 siblings, 2 replies; 13+ messages in thread
From: Ming Lei @ 2023-02-17 3:02 UTC (permalink / raw)
To: Xiaoguang Wang
Cc: linux-block, io-uring, bpf, axboe, asml.silence, ZiyangZhang,
ming.lei
On Thu, Feb 16, 2023 at 08:12:18PM +0800, Xiaoguang Wang wrote:
> hello,
>
> > On Wed, Feb 15, 2023 at 08:41:22AM +0800, Xiaoguang Wang wrote:
> >> Currenly only one bpf_ublk_queue_sqe() ebpf is added, ublksrv target
> >> can use this helper to write ebpf prog to support ublk kernel & usersapce
> >> zero copy, please see ublksrv test codes for more info.
> >>
> >> */
> >> + if ((req_op(req) == REQ_OP_WRITE) && ub->io_prep_prog)
> >> + return rq_bytes;
> > Can you explain a bit why READ isn't supported? Because WRITE zero
> > copy is supposed to be supported easily with splice based approach,
> > and I am more interested in READ zc actually.
> No special reason, READ op can also be supported. I'll
> add this support in patch set v2.
> For this RFC patch set, I just tried to show the idea, so
> I must admit that current codes are not mature enough :)
OK.
>
> >
> >> +
> >> if (req_op(req) != REQ_OP_WRITE && req_op(req) != REQ_OP_FLUSH)
> >> return rq_bytes;
> >>
> >> @@ -860,6 +921,89 @@ static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
> >> }
> >> }
> >>
> >>
> >> + kbuf->bvec = bvec;
> >> + rq_for_each_bvec(tmp, rq, rq_iter) {
> >> + *bvec = tmp;
> >> + bvec++;
> >> + }
> >> +
> >> + kbuf->count = blk_rq_bytes(rq);
> >> + kbuf->nr_bvecs = nr_bvec;
> >> + data->kbuf = kbuf;
> >> + return 0;
> > bio/req bvec table is immutable, so here you can pass its reference
> > to kbuf directly.
> Yeah, thanks.
Also if this request has multiple bios, either you need to submit
multple sqes or copy all bvec into single table. And in case of single bio,
the table reference can be used directly.
>
> >
> >> +}
> >> +
> >> +static int ublk_run_bpf_prog(struct ublk_queue *ubq, struct request *rq)
> >> +{
> >> + int err;
> >> + struct ublk_device *ub = ubq->dev;
> >> + struct bpf_prog *prog = ub->io_prep_prog;
> >> + struct ublk_io_bpf_ctx *bpf_ctx;
> >> +
> >> + if (!prog)
> >> + return 0;
> >> +
> >> + bpf_ctx = kmalloc(sizeof(struct ublk_io_bpf_ctx), GFP_NOIO);
> >> + if (!bpf_ctx)
> >> + return -EIO;
> >> +
> >> + err = ublk_init_uring_kbuf(rq);
> >> + if (err < 0) {
> >> + kfree(bpf_ctx);
> >> + return -EIO;
> >> + }
> >> + bpf_ctx->ub = ub;
> >> + bpf_ctx->ctx.q_id = ubq->q_id;
> >> + bpf_ctx->ctx.tag = rq->tag;
> >> + bpf_ctx->ctx.op = req_op(rq);
> >> + bpf_ctx->ctx.nr_sectors = blk_rq_sectors(rq);
> >> + bpf_ctx->ctx.start_sector = blk_rq_pos(rq);
> > The above is for setting up target io parameter, which is supposed
> > to be from userspace, cause it is result of user space logic. If
> > these parameters are from kernel, the whole logic has to be done
> > in io_prep_prog.
> Yeah, it's designed that io_prep_prog implements user space
> io logic.
That could be the biggest weakness of this approach, because people
really want to implement complicated logic in userspace, which should
be the biggest value of ublk, but now seems you move kernel C
programming into ebpf userspace programming, I don't think ebpf
is good at handling complicated userspace logic.
>
> >
> >> + bpf_prog_run_pin_on_cpu(prog, bpf_ctx);
> >> +
> >> + init_task_work(&bpf_ctx->work, ublk_bpf_io_submit_fn);
> >> + if (task_work_add(ubq->ubq_daemon, &bpf_ctx->work, TWA_SIGNAL_NO_IPI))
> >> + kfree(bpf_ctx);
> > task_work_add() is only available in case of ublk builtin.
> Yeah, I'm thinking how to work around it.
>
> >
> >> + return 0;
> >> +}
> >> +
> >> static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> >> const struct blk_mq_queue_data *bd)
> >> {
> >> @@ -872,6 +1016,9 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> >> if (unlikely(res != BLK_STS_OK))
> >> return BLK_STS_IOERR;
> >>
> >> + /* Currently just for test. */
> >> + ublk_run_bpf_prog(ubq, rq);
> > Can you explain the above comment a bit? When is the io_prep_prog called
> > in the non-test version? Or can you post the non-test version in list
> > for review.
> Forgot to delete stale comments, sorry. I'm writing v2 patch set,
OK, got it, so looks ublk_run_bpf_prog is designed to run two progs
loaded from two control commands.
>
> > Here it is the key for understanding the whole idea, especially when
> > is io_prep_prog called finally? How to pass parameters to io_prep_prog?
> Let me explain more about the design:
> io_prep_prog has two types of parameters:
> 1) its call argument: struct ublk_bpf_ctx, see ublk.bpf.c.
> ublk_bpf_ctx will describe one kernel io requests about
> its op, qid, sectors info. io_prep_prog uses these info to
> map target io.
> 2) ebpf map structure, user space daemon can use map
> structure to pass much information from user space to
> io_prep_prog, which will help it to initialize target io if necessary.
>
> io_prep_prog is called when ublk_queue_rq() is called, this bpf
> prog will initialize one or more sqes according to user logic, and
> io_prep_prog will put these sqes in an ebpf map structure, then
> execute a task_work_add() to notify ubq_daemon to execute
> io_submit_prog. Note, we can not call io_uring_submit_sqe()
> in task context that calls ublk_queue_rq(), that context does not
> have io_uring instance owned by ubq_daemon.
> Later ubq_daemon will call io_submit_prog to submit sqes.
Submitting sqe from kernel looks interesting, but I guess
performance may be hurt, given plugging(batching) can't be applied
any more, which is supposed to affect io perf a lot.
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 3/3] ublk_drv: add ebpf support
2023-02-17 3:02 ` Ming Lei
@ 2023-02-17 10:46 ` Ming Lei
2023-02-22 14:13 ` Xiaoguang Wang
1 sibling, 0 replies; 13+ messages in thread
From: Ming Lei @ 2023-02-17 10:46 UTC (permalink / raw)
To: Xiaoguang Wang
Cc: linux-block, io-uring, bpf, axboe, asml.silence, ZiyangZhang,
ming.lei
On Fri, Feb 17, 2023 at 11:02:14AM +0800, Ming Lei wrote:
> On Thu, Feb 16, 2023 at 08:12:18PM +0800, Xiaoguang Wang wrote:
> > hello,
...
> > io_prep_prog is called when ublk_queue_rq() is called, this bpf
> > prog will initialize one or more sqes according to user logic, and
> > io_prep_prog will put these sqes in an ebpf map structure, then
> > execute a task_work_add() to notify ubq_daemon to execute
> > io_submit_prog. Note, we can not call io_uring_submit_sqe()
> > in task context that calls ublk_queue_rq(), that context does not
> > have io_uring instance owned by ubq_daemon.
> > Later ubq_daemon will call io_submit_prog to submit sqes.
>
> Submitting sqe from kernel looks interesting, but I guess
> performance may be hurt, given plugging(batching) can't be applied
> any more, which is supposed to affect io perf a lot.
If submitting SQE in kernel is really doable, maybe we can add another
command, such as, UBLK_IO_SUBMIT_SQE(just like UBLK_IO_NEED_GET_DATA),
and pass the built SQE(which represents part of user logic result) as
io_uring command payload, and ask ublk driver to build buffer for this
SQE, then submit this SQE in kernel.
But there is SQE order problem, net usually requires SQEs to be linked
and submitted in order, with this way, it becomes not easy to maintain
SQEs order(some linked in user, and some in kernel).
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC 3/3] ublk_drv: add ebpf support
2023-02-17 3:02 ` Ming Lei
2023-02-17 10:46 ` Ming Lei
@ 2023-02-22 14:13 ` Xiaoguang Wang
1 sibling, 0 replies; 13+ messages in thread
From: Xiaoguang Wang @ 2023-02-22 14:13 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-block, io-uring, bpf, axboe, asml.silence, ZiyangZhang
hi,
I spent some time to write v2, especially think about how to work
around task_work_add is not exported, so sorry for late response.
>
>>> The above is for setting up target io parameter, which is supposed
>>> to be from userspace, cause it is result of user space logic. If
>>> these parameters are from kernel, the whole logic has to be done
>>> in io_prep_prog.
>> Yeah, it's designed that io_prep_prog implements user space
>> io logic.
> That could be the biggest weakness of this approach, because people
> really want to implement complicated logic in userspace, which should
> be the biggest value of ublk, but now seems you move kernel C
> programming into ebpf userspace programming, I don't think ebpf
> is good at handling complicated userspace logic.
Absolutely agree with you, ebpf has strict programming rules,
I spent more time than I had thought at startup for support loop
target ebpf prog(ublk.bpf.c). Later I'll try to collaborate with my
colleagues, to see whether we can program their userspace logic
into ebpf prog or partially.
>> io_prep_prog is called when ublk_queue_rq() is called, this bpf
>> prog will initialize one or more sqes according to user logic, and
>> io_prep_prog will put these sqes in an ebpf map structure, then
>> execute a task_work_add() to notify ubq_daemon to execute
>> io_submit_prog. Note, we can not call io_uring_submit_sqe()
>> in task context that calls ublk_queue_rq(), that context does not
>> have io_uring instance owned by ubq_daemon.
>> Later ubq_daemon will call io_submit_prog to submit sqes.
> Submitting sqe from kernel looks interesting, but I guess
> performance may be hurt, given plugging(batching) can't be applied
> any more, which is supposed to affect io perf a lot.
Yes, agree, but I didn't have much time to improve this yet.
Currently, I mainly try to use this feature on large ios, to
reduce memory copy overhead, which consumes much
cpu resource, our clients really hope we can reduce it.
Regards,
Xiaoguang Wang
>
>
>
> Thanks,
> Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-02-22 14:13 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-15 0:41 [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Xiaoguang Wang
2023-02-15 0:41 ` [RFC 1/3] bpf: add UBLK program type Xiaoguang Wang
2023-02-15 0:41 ` [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel Xiaoguang Wang
2023-02-15 0:41 ` [RFC 3/3] ublk_drv: add ebpf support Xiaoguang Wang
2023-02-16 8:11 ` Ming Lei
2023-02-16 12:12 ` Xiaoguang Wang
2023-02-17 3:02 ` Ming Lei
2023-02-17 10:46 ` Ming Lei
2023-02-22 14:13 ` Xiaoguang Wang
2023-02-15 0:46 ` [UBLKSRV] Add " Xiaoguang Wang
2023-02-16 8:28 ` Ming Lei
2023-02-16 9:17 ` Xiaoguang Wang
2023-02-15 8:40 ` [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Ziyang Zhang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox