From: Ziyang Zhang <[email protected]>
To: [email protected], [email protected],
[email protected]
Cc: [email protected], [email protected], [email protected],
Xiaoguang Wang <[email protected]>
Subject: Re: [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk
Date: Wed, 15 Feb 2023 16:40:57 +0800 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 2023/2/15 08:41, Xiaoguang Wang wrote:
> Normally, userspace block device impementations need to copy data between
> kernel block layer's io requests and userspace block device's userspace
> daemon, for example, ublk and tcmu both have similar logic, but this
> operation will consume cpu resources obviously, especially for large io.
>
> There are methods trying to reduce these cpu overheads, then userspace
> block device's io performance will be improved further. These methods
> contain: 1) use special hardware to do memory copy, but seems not all
> architectures have these special hardware; 2) sofeware methods, such as
> mmap kernel block layer's io requests's data to userspace daemon [1],
> but it has page table's map/unmap, tlb flush overhead, security issue,
> etc, and it maybe only friendly to large io.
>
> Add a new program type BPF_PROG_TYPE_UBLK for ublk, which is a generic
> framework for implementing block device logic from userspace. Typical
> userspace block device impementations need to copy data between kernel
> block layer's io requests and userspace block device's userspace daemon,
> which will consume cpu resources, especially for large io.
>
> To solve this problem, I'd propose a new method, which will combine the
> respective advantages of io_uring and ebpf. Add a new program type
> BPF_PROG_TYPE_UBLK for ublk, userspace block device daemon process should
> register an ebpf prog. This bpf prog will use bpf helper offered by ublk
> bpf prog type to submit io requests on behalf of daemon process.
> Currently there is only one helper:
> u64 bpf_ublk_queue_sqe(struct ublk_io_bpf_ctx *bpf_ctx,
> struct io_uring_sqe *sqe, u32 sqe_len, u32, fd)
>
> This helper will use io_uring to submit io requests, so we need to make
> io_uring be able to submit a sqe located in kernel(Some codes idea comes
> from Pavel's patchset [2], but pavel's patch needs sqe->buf still comes
> from userspace addr), and bpf prog initializes sqes, but does not need to
> initializes sqes' buf field, sqe->buf will come from kernel block layer io
> requests in some form. See patch 2 for more.
>
> In example of ublk loop target, we can easily implement such below logic in
> ebpf prog:
> 1. userspace daemon registers an ebpf prog and passes two backend file
> fd in ebpf map structure。
> 2. For kernel io requests against the first half of userspace device,
> ebpf prog prepares an io_uring sqe, which will submit io against the first
> backend file fd and sqe's buffer comes from kernel io reqeusts. Kernel
> io requests against second half of userspace device has similar logic,
> only sqe's fd will be the second backend file fd.
> 3. When ublk driver blk-mq queue_rq() is called, this ebpf prog will
> be executed and completes kernel io requests.
>
> That means, by using ebpf, we can implement various userspace log in kernel.
>
> From above expample, we can see that this method has 3 advantages at least:
> 1. Remove memory copy between kernel block layer and userspace daemon
> completely.
> 2. Save memory. Userspace daemon doesn't need to maintain memory to
> issue and complete io requests, and use kernel block layer io requests
> memory directly.
> 2. We may reduce the number of round trips between kernel and userspace
> daemon, so may reduce kernel & userspace context switch overheads.
>
> Test:
> Add a ublk loop target: ublk add -t loop -q 1 -d 128 -f loop.file
>
> fio job file:
> [global]
> direct=1
> filename=/dev/ublkb0
> time_based
> runtime=60
> numjobs=1
> cpus_allowed=1
>
> [rand-read-4k]
> bs=512K
> iodepth=16
> ioengine=libaio
> rw=randwrite
> stonewall
>
>
> Without this patch:
> WRITE: bw=745MiB/s (781MB/s), 745MiB/s-745MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60010-60010msec
> ublk daemon's cpu utilization is about 9.3%~10.0%, showed by top tool.
>
> With this patch:
> WRITE: bw=744MiB/s (781MB/s), 744MiB/s-744MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60012-60012msec
> ublk daemon's cpu utilization is about 1.3%~1.7%, showed by top tool.
>
> From above tests, this method can reduce cpu copy overhead obviously.
>
>
> TODO:
> I must say this patchset is just a RFC for design.
>
> 1) Currently for this patchset, I just make ublk ebpf prog submit io requests
> using io_uring in kernel, cqe event still needs to be handled in userspace
> daemon. Once later we succeed in make io_uring handle cqe in kernel, ublk
> ebpf prog can implement io in kernel.
>
> 2) ublk driver needs to work better with ebpf, currently I did some hack
> codes to support ebpf in ublk driver, it only can support write requests.
>
> 3) I have not done much tests yet, will run liburing/ublk/blktests
> later.
>
> Any review and suggestions are welcome, thanks.
>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/all/[email protected]/
>
>
> Xiaoguang Wang (3):
> bpf: add UBLK program type
> io_uring: enable io_uring to submit sqes located in kernel
> ublk_drv: add ebpf support
>
> drivers/block/ublk_drv.c | 228 ++++++++++++++++++++++++++++++++-
> include/linux/bpf_types.h | 2 +
> include/linux/io_uring.h | 13 ++
> include/linux/io_uring_types.h | 8 +-
> include/uapi/linux/bpf.h | 2 +
> include/uapi/linux/ublk_cmd.h | 11 ++
> io_uring/io_uring.c | 59 ++++++++-
> io_uring/rsrc.c | 15 +++
> io_uring/rsrc.h | 3 +
> io_uring/rw.c | 7 +
> kernel/bpf/syscall.c | 1 +
> kernel/bpf/verifier.c | 9 +-
> scripts/bpf_doc.py | 4 +
> tools/include/uapi/linux/bpf.h | 9 ++
> tools/lib/bpf/libbpf.c | 2 +
> 15 files changed, 366 insertions(+), 7 deletions(-)
>
Hi, Here is perf report output of ublk daemon(loop target):
+ 57.96% 4.03% ublk liburing.so.2.2 [.] _io_uring_get_cqe ▒
+ 53.94% 0.00% ublk [kernel.vmlinux] [k] entry_SYSCALL_64 ◆
+ 53.94% 0.65% ublk [kernel.vmlinux] [k] do_syscall_64 ▒
+ 48.37% 1.18% ublk [kernel.vmlinux] [k] __do_sys_io_uring_enter ▒
+ 42.92% 1.72% ublk [kernel.vmlinux] [k] io_cqring_wait ▒
+ 35.17% 0.06% ublk [kernel.vmlinux] [k] task_work_run ▒
+ 34.75% 0.53% ublk [kernel.vmlinux] [k] io_run_task_work_sig ▒
+ 33.45% 0.00% ublk [kernel.vmlinux] [k] ublk_bpf_io_submit_fn ▒
+ 33.16% 0.06% ublk bpf_prog_3bdc6181a3c616fb_ublk_io_submit_prog [k] bpf_prog_3bdc6181a3c616fb_ublk_io_sub▒
+ 32.68% 0.00% iou-wrk-18583 [unknown] [k] 0000000000000000 ▒
+ 32.68% 0.00% iou-wrk-18583 [unknown] [k] 0x00007efe920b1040 ▒
+ 32.68% 0.00% iou-wrk-18583 [kernel.vmlinux] [k] ret_from_fork ▒
+ 32.68% 0.47% iou-wrk-18583 [kernel.vmlinux] [k] io_wqe_worker ▒
+ 30.61% 0.00% ublk [kernel.vmlinux] [k] io_submit_sqe ▒
+ 30.31% 0.06% ublk [kernel.vmlinux] [k] io_issue_sqe ▒
+ 28.00% 0.00% ublk [kernel.vmlinux] [k] bpf_ublk_queue_sqe ▒
+ 28.00% 0.00% ublk [kernel.vmlinux] [k] io_uring_submit_sqe ▒
+ 27.18% 0.00% ublk [kernel.vmlinux] [k] io_write ▒
+ 27.18% 0.00% ublk [xfs] [k] xfs_file_write_iter
The call stack is:
- 57.96% 4.03% ublk liburing.so.2.2 [.] _io_uring_get_cqe ◆
- 53.94% _io_uring_get_cqe ▒
entry_SYSCALL_64 ▒
- do_syscall_64 ▒
- 48.37% __do_sys_io_uring_enter ▒
- 42.92% io_cqring_wait ▒
- 34.75% io_run_task_work_sig ▒
- task_work_run ▒
- 32.50% ublk_bpf_io_submit_fn ▒
- 32.21% bpf_prog_3bdc6181a3c616fb_ublk_io_submit_prog ▒
- 27.12% bpf_ublk_queue_sqe ▒
- io_uring_submit_sqe ▒
- 26.64% io_submit_sqe ▒
- 26.35% io_issue_sqe ▒
- io_write ▒
xfs_file_write_iter ▒
Here, "io_submit" ebpf prog will be run in task_work of ublk daemon
process after io_uring_enter() syscall. In this ebpf prog, a sqe is
built and submitted. All information about this blk-mq request is
stored in a "ctx". Then io_uring can write to the backing file
(xfs_file_write_iter).
Here is call stack from perf report output of fio:
- 5.04% 0.18% fio [kernel.vmlinux] [k] ublk_queue_rq ▒
- 4.86% ublk_queue_rq ▒
- 3.67% bpf_prog_b8456549dbe40c37_ublk_io_prep_prog ▒
- 3.10% bpf_trace_printk ▒
2.83% _raw_spin_unlock_irqrestore ▒
- 0.70% task_work_add ▒
- try_to_wake_up ▒
_raw_spin_unlock_irqrestore ▒
Here, "io_prep" ebpf prog will be run in "ublk_queue_rq" process.
In this ebpf prog, qid, tag, nr_sectors, start_sector, op, flags
will be stored in one "ctx". Then we add a task_work to the ublk
daemon process.
Regards,
Zhang
prev parent reply other threads:[~2023-02-15 8:41 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-15 0:41 [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Xiaoguang Wang
2023-02-15 0:41 ` [RFC 1/3] bpf: add UBLK program type Xiaoguang Wang
2023-02-15 0:41 ` [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel Xiaoguang Wang
2023-02-15 0:41 ` [RFC 3/3] ublk_drv: add ebpf support Xiaoguang Wang
2023-02-16 8:11 ` Ming Lei
2023-02-16 12:12 ` Xiaoguang Wang
2023-02-17 3:02 ` Ming Lei
2023-02-17 10:46 ` Ming Lei
2023-02-22 14:13 ` Xiaoguang Wang
2023-02-15 0:46 ` [UBLKSRV] Add " Xiaoguang Wang
2023-02-16 8:28 ` Ming Lei
2023-02-16 9:17 ` Xiaoguang Wang
2023-02-15 8:40 ` Ziyang Zhang [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=45b1fea5-8ced-1eda-7f3d-dc6dc5727d55@linux.alibaba.com \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox