From: Ming Lei <[email protected]>
To: Pavel Begunkov <[email protected]>
Cc: Jens Axboe <[email protected]>,
[email protected], [email protected],
Miklos Szeredi <[email protected]>,
ZiyangZhang <[email protected]>,
Xiaoguang Wang <[email protected]>,
Bernd Schubert <[email protected]>,
[email protected]
Subject: Re: [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD
Date: Wed, 8 Mar 2023 10:10:54 +0800 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On Tue, Mar 07, 2023 at 05:17:04PM +0000, Pavel Begunkov wrote:
> On 3/7/23 15:37, Pavel Begunkov wrote:
> > On 3/7/23 14:15, Ming Lei wrote:
> > > Hello,
> > >
> > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
> > > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
> > > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs
> > > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1,
> > > and its ->issue() can retrieve/import buffer from master request's
> > > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of
> > > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset
> > > submits slave OP just like normal OP issued from userspace, that said,
> > > SQE order is kept, and batching handling is done too.
> >
> > From a quick look through patches it all looks a bit complicated
> > and intrusive, all over generic hot paths. I think instead we
> > should be able to use registered buffer table as intermediary and
> > reuse splicing. Let me try it out
>
> Here we go, isolated in a new opcode, and in the end should work
> with any file supporting splice. It's a quick prototype, it's lacking
> and there are many obvious fatal bugs. It also needs some optimisations,
> improvements on how executed by io_uring and extra stuff like
> memcpy ops and fixed buf recv/send. I'll clean it up.
>
> I used a test below, it essentially does zc recv.
>
> https://github.com/isilence/liburing/commit/81fe705739af7d9b77266f9aa901c1ada870739d
>
>
> From 87ad9e8e3aed683aa040fb4b9ae499f8726ba393 Mon Sep 17 00:00:00 2001
> Message-Id: <87ad9e8e3aed683aa040fb4b9ae499f8726ba393.1678208911.git.asml.silence@gmail.com>
> From: Pavel Begunkov <[email protected]>
> Date: Tue, 7 Mar 2023 17:01:44 +0000
> Subject: [POC 1/1] io_uring: splicing into reg buf table
>
> EXTREMELY BUGGY! Not for inclusion.
>
> Add a new operation called IORING_OP_SPLICE_FROM,
> which splices from a file into the registered buffer table. This is
> done in a zerocopy fashion with a caveat that the user won't have
> direct access to the data, however it can use it with any io_uring
> request supporting registered buffers.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> include/uapi/linux/io_uring.h | 1 +
> io_uring/io_uring.c | 4 +-
> io_uring/opdef.c | 10 ++++
> io_uring/splice.c | 98 +++++++++++++++++++++++++++++++++++
> io_uring/splice.h | 3 ++
> 5 files changed, 114 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 709de6d4feb2..a91ce1d2ebd7 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -223,6 +223,7 @@ enum io_uring_op {
> IORING_OP_URING_CMD,
> IORING_OP_SEND_ZC,
> IORING_OP_SENDMSG_ZC,
> + IORING_OP_SPLICE_FROM,
> /* this goes last, obviously */
> IORING_OP_LAST,
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 7625597b5227..b7389a6ea190 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -2781,8 +2781,8 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
> io_wait_rsrc_data(ctx->file_data);
> mutex_lock(&ctx->uring_lock);
> - if (ctx->buf_data)
> - __io_sqe_buffers_unregister(ctx);
> + // if (ctx->buf_data)
> + // __io_sqe_buffers_unregister(ctx);
> if (ctx->file_data)
> __io_sqe_files_unregister(ctx);
> io_cqring_overflow_kill(ctx);
> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
> index cca7c5b55208..28d4fa42676b 100644
> --- a/io_uring/opdef.c
> +++ b/io_uring/opdef.c
> @@ -428,6 +428,13 @@ const struct io_issue_def io_issue_defs[] = {
> .prep = io_eopnotsupp_prep,
> #endif
> },
> + [IORING_OP_SPLICE_FROM] = {
> + .needs_file = 1,
> + .unbound_nonreg_file = 1,
> + // .pollin = 1,
> + .prep = io_splice_from_prep,
> + .issue = io_splice_from,
> + }
> };
> @@ -648,6 +655,9 @@ const struct io_cold_def io_cold_defs[] = {
> .fail = io_sendrecv_fail,
> #endif
> },
> + [IORING_OP_SPLICE_FROM] = {
> + .name = "SPLICE_FROM",
> + }
> };
> const char *io_uring_get_opcode(u8 opcode)
> diff --git a/io_uring/splice.c b/io_uring/splice.c
> index 2a4bbb719531..0467e9f46e99 100644
> --- a/io_uring/splice.c
> +++ b/io_uring/splice.c
> @@ -8,11 +8,13 @@
> #include <linux/namei.h>
> #include <linux/io_uring.h>
> #include <linux/splice.h>
> +#include <linux/nospec.h>
> #include <uapi/linux/io_uring.h>
> #include "io_uring.h"
> #include "splice.h"
> +#include "rsrc.h"
> struct io_splice {
> struct file *file_out;
> @@ -119,3 +121,99 @@ int io_splice(struct io_kiocb *req, unsigned int issue_flags)
> io_req_set_res(req, ret, 0);
> return IOU_OK;
> }
> +
> +struct io_splice_from {
> + struct file *file;
> + loff_t off;
> + u64 len;
> +};
> +
> +
> +int io_splice_from_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> +{
> + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from);
> +
> + if (unlikely(sqe->splice_flags || sqe->splice_fd_in || sqe->ioprio ||
> + sqe->addr || sqe->addr3))
> + return -EINVAL;
> +
> + req->buf_index = READ_ONCE(sqe->buf_index);
> +
> + sp->len = READ_ONCE(sqe->len);
> + if (unlikely(!sp->len))
> + return -EINVAL;
> +
> + sp->off = READ_ONCE(sqe->off);
> + return 0;
> +}
> +
> +int io_splice_from(struct io_kiocb *req, unsigned int issue_flags)
> +{
> + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from);
> + loff_t *ppos = (sp->off == -1) ? NULL : &sp->off;
> + struct io_mapped_ubuf *imu;
> + struct pipe_inode_info *pi;
> + struct io_ring_ctx *ctx;
> + unsigned int pipe_tail;
> + int ret, i, nr_pages;
> + u16 index;
> +
> + if (!sp->file->f_op->splice_read)
> + return -ENOTSUPP;
> +
> + pi = alloc_pipe_info();
The above should be replaced with direct pipe, otherwise every time
allocating one pipe inode really hurts performance.
> + if (!pi)
> + return -ENOMEM;
> + pi->readers = 1;
> +
> + ret = sp->file->f_op->splice_read(sp->file, ppos, pi, sp->len, 0);
> + if (ret < 0)
> + goto done;
> +
> + nr_pages = pipe_occupancy(pi->head, pi->tail);
> + imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
> + if (!imu)
> + goto done;
> +
> + ret = 0;
> + pipe_tail = pi->tail;
> + for (i = 0; !pipe_empty(pi->head, pipe_tail); i++) {
> + unsigned int mask = pi->ring_size - 1; // kill mask
> + struct pipe_buffer *buf = &pi->bufs[pipe_tail & mask];
> +
> + bvec_set_page(&imu->bvec[i], buf->page, buf->len, buf->offset);
> + ret += buf->len;
> + pipe_tail++;
> + }
> + if (WARN_ON_ONCE(i != nr_pages))
> + return -EFAULT;
> +
> + ctx = req->ctx;
> + io_ring_submit_lock(ctx, issue_flags);
> + if (unlikely(req->buf_index >= ctx->nr_user_bufs)) {
> + /* TODO: cleanup pages */
> + ret = -EFAULT;
> + kvfree(imu);
> + goto done_unlock;
> + }
> + index = array_index_nospec(req->buf_index, ctx->nr_user_bufs);
> + if (ctx->user_bufs[index] != ctx->dummy_ubuf) {
> + /* TODO: cleanup pages */
> + kvfree(imu);
> + ret = -EFAULT;
> + goto done_unlock;
> + }
> +
> + imu->ubuf = 0;
> + imu->ubuf_end = ret;
> + imu->nr_bvecs = nr_pages;
> + ctx->user_bufs[index] = imu;
Your patch looks like transferring pages ownership to io_uring fixed
buffer, but unfortunately it can't be done in this way. splice is
supposed for moving data, not transfer buffer ownership.
https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/
1) pages are actually owned by device side(ublk, here: sp->file), but we want to
loan them to io_uring normal OPs.
2) after these pages are used by io_uring normal OPs, these pages have
been returned back to sp->file, and the notification has to be done
explicitly, because page is owned by sp->file of splice_read().
3) pages RW direction has to limited strictly, and in case of ublk/fuse,
device pages can only be read or write which depends on user io request
direction.
Also IMO it isn't good to add one buffer to ctx->user_bufs[] oneshot and
retrieve it oneshot, and it can be set via req->imu simply in one fused
command.
Thanks,
Ming
next prev parent reply other threads:[~2023-03-08 2:12 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
2023-03-07 14:15 ` [PATCH V2 01/17] io_uring: add IO_URING_F_FUSED and prepare for supporting OP_FUSED_CMD Ming Lei
2023-03-07 14:15 ` [PATCH V2 02/17] io_uring: increase io_kiocb->flags into 64bit Ming Lei
2023-03-07 14:15 ` [PATCH V2 03/17] io_uring: add IORING_OP_FUSED_CMD Ming Lei
2023-03-07 14:15 ` [PATCH V2 04/17] io_uring: support OP_READ/OP_WRITE for fused slave request Ming Lei
2023-03-07 14:15 ` [PATCH V2 05/17] io_uring: support OP_SEND_ZC/OP_RECV " Ming Lei
2023-03-09 7:46 ` kernel test robot
2023-03-09 17:22 ` kernel test robot
2023-03-07 14:15 ` [PATCH V2 06/17] block: ublk_drv: mark device as LIVE before adding disk Ming Lei
2023-03-08 3:48 ` Ziyang Zhang
2023-03-08 7:44 ` Ming Lei
2023-03-07 14:15 ` [PATCH V2 07/17] block: ublk_drv: add common exit handling Ming Lei
2023-03-14 17:15 ` kernel test robot
2023-03-07 14:15 ` [PATCH V2 08/17] block: ublk_drv: don't consider flush request in map/unmap io Ming Lei
2023-03-08 3:50 ` Ziyang Zhang
2023-03-07 14:15 ` [PATCH V2 09/17] block: ublk_drv: add two helpers to clean up map/unmap request Ming Lei
2023-03-09 3:12 ` Ziyang Zhang
2023-03-07 14:15 ` [PATCH V2 10/17] block: ublk_drv: clean up several helpers Ming Lei
2023-03-09 3:12 ` Ziyang Zhang
2023-03-07 14:15 ` [PATCH V2 11/17] block: ublk_drv: cleanup 'struct ublk_map_data' Ming Lei
2023-03-09 3:16 ` Ziyang Zhang
2023-03-07 14:15 ` [PATCH V2 12/17] block: ublk_drv: cleanup ublk_copy_user_pages Ming Lei
2023-03-07 23:57 ` kernel test robot
2023-03-15 7:05 ` Ziyang Zhang
2023-03-07 14:15 ` [PATCH V2 13/17] block: ublk_drv: grab request reference when the request is handled by userspace Ming Lei
2023-03-15 5:20 ` kernel test robot
2023-03-07 14:15 ` [PATCH V2 14/17] block: ublk_drv: support to copy any part of request pages Ming Lei
2023-03-07 14:15 ` [PATCH V2 15/17] block: ublk_drv: add read()/write() support for ublk char device Ming Lei
2023-03-07 14:15 ` [PATCH V2 16/17] block: ublk_drv: don't check buffer in case of zero copy Ming Lei
2023-03-07 14:15 ` [PATCH V2 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting " Ming Lei
2023-03-07 15:37 ` [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Pavel Begunkov
2023-03-07 17:17 ` Pavel Begunkov
2023-03-08 2:10 ` Ming Lei [this message]
2023-03-08 14:46 ` Pavel Begunkov
2023-03-08 16:17 ` Ming Lei
2023-03-08 16:54 ` Pavel Begunkov
2023-03-09 1:44 ` Ming Lei
2023-03-08 1:08 ` Ming Lei
2023-03-08 16:22 ` Pavel Begunkov
2023-03-09 2:05 ` Ming Lei
2023-03-15 7:08 ` Ziyang Zhang
2023-03-15 7:54 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox