public inbox for [email protected]
 help / color / mirror / Atom feed
From: Ming Lei <[email protected]>
To: Jens Axboe <[email protected]>
Cc: [email protected], [email protected],
	[email protected],
	Alexander Viro <[email protected]>,
	Stefan Hajnoczi <[email protected]>,
	Miklos Szeredi <[email protected]>,
	Bernd Schubert <[email protected]>,
	Nitesh Shetty <[email protected]>,
	Christoph Hellwig <[email protected]>,
	Ziyang Zhang <[email protected]>,
	[email protected]
Subject: Re: [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
Date: Sun, 12 Feb 2023 11:22:42 +0800	[thread overview]
Message-ID: <Y+hbggDCm9wViPAv@T590> (raw)
In-Reply-To: <[email protected]>

On Sat, Feb 11, 2023 at 09:52:58AM -0700, Jens Axboe wrote:
> On 2/11/23 9:12?AM, Ming Lei wrote:
> > On Sat, Feb 11, 2023 at 08:45:18AM -0700, Jens Axboe wrote:
> >> On 2/10/23 8:32?AM, Ming Lei wrote:
> >>> IORING_OP_READ_SPLICE_BUF: read to buffer which is built from
> >>> ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len)
> >>> for building buffer.
> >>>
> >>> IORING_OP_WRITE_SPLICE_BUF: write from buffer which is built from
> >>> ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len)
> >>> for building buffer.
> >>>
> >>> The typical use case is for supporting ublk/fuse io_uring zero copy,
> >>> and READ/WRITE OP retrieves ublk/fuse request buffer via direct pipe
> >>> from device->read_splice(), then READ/WRITE can be done to/from this
> >>> buffer directly.
> >>
> >> Main question here - would this be better not plumbed up through the rw
> >> path? Might be cleaner, even if it either requires a bit of helper
> >> refactoring or accepting a bit of duplication. But would still be better
> >> than polluting the rw fast path imho.
> > 
> > The buffer is actually IO buffer, which has to be plumbed up in IO path,
> > and it can't be done like the registered buffer.
> > 
> > The only affect on fast path is :
> > 
> > 		if (io_rw_splice_buf(req))	//which just check opcode
> >               return io_prep_rw_splice_buf(req, sqe);
> > 
> > and the cleanup code which is only done for the two new OPs.
> > 
> > Or maybe I misunderstand your point? Or any detailed suggestion?
> > 
> > Actually the code should be factored into generic helper, since net.c
> > need to use them too. Probably it needs to move to rsrc.c?
> 
> Yep, just refactoring out those bits as a prep thing. rsrc could work,
> or perhaps a new file for that.

OK.

> 
> >> Also seems like this should be separately testable. We can't add new
> >> opcodes that don't have a feature test at least, and should also have
> >> various corner case tests. A bit of commenting outside of this below.
> > 
> > OK, I will write/add one very simple ublk userspace to liburing for
> > test purpose.
> 
> Thanks!

Thinking of further, if we use ublk for liburing test purpose, root is
often needed, even though we support un-privileged mode, which needs
administrator to grant access, so is it still good to do so?

It could be easier to add ->splice_read() on /dev/zero for test
purpose, just allocate zeroed pages in ->splice_read(), and add
them to pipe like ublk->splice_read(), and sink side can read
from or write to these pages, but zero's read_iter_zero() won't
be affected. And normal splice/tee won't connect to zero too
because we only allow it from kernel use.

> 
> >>> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
> >>> index 5238ecd7af6a..91e8d8f96134 100644
> >>> --- a/io_uring/opdef.c
> >>> +++ b/io_uring/opdef.c
> >>> @@ -427,6 +427,31 @@ const struct io_issue_def io_issue_defs[] = {
> >>>  		.prep			= io_eopnotsupp_prep,
> >>>  #endif
> >>>  	},
> >>> +	[IORING_OP_READ_SPLICE_BUF] = {
> >>> +		.needs_file		= 1,
> >>> +		.unbound_nonreg_file	= 1,
> >>> +		.pollin			= 1,
> >>> +		.plug			= 1,
> >>> +		.audit_skip		= 1,
> >>> +		.ioprio			= 1,
> >>> +		.iopoll			= 1,
> >>> +		.iopoll_queue		= 1,
> >>> +		.prep			= io_prep_rw,
> >>> +		.issue			= io_read,
> >>> +	},
> >>> +	[IORING_OP_WRITE_SPLICE_BUF] = {
> >>> +		.needs_file		= 1,
> >>> +		.hash_reg_file		= 1,
> >>> +		.unbound_nonreg_file	= 1,
> >>> +		.pollout		= 1,
> >>> +		.plug			= 1,
> >>> +		.audit_skip		= 1,
> >>> +		.ioprio			= 1,
> >>> +		.iopoll			= 1,
> >>> +		.iopoll_queue		= 1,
> >>> +		.prep			= io_prep_rw,
> >>> +		.issue			= io_write,
> >>> +	},
> >>
> >> Are these really safe with iopoll?
> > 
> > Yeah, after the buffer is built, the handling is basically
> > same with IORING_OP_WRITE_FIXED, so I think it is safe.
> 
> Yeah, on a second look, as these are just using the normal read/write
> path after that should be fine indeed.
> 
> >>
> >>> +static int io_prep_rw_splice_buf(struct io_kiocb *req,
> >>> +				 const struct io_uring_sqe *sqe)
> >>> +{
> >>> +	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
> >>> +	unsigned nr_pages = io_rw_splice_buf_nr_bvecs(rw->len);
> >>> +	loff_t splice_off = READ_ONCE(sqe->splice_off_in);
> >>> +	struct io_rw_splice_buf_data data;
> >>> +	struct io_mapped_ubuf *imu;
> >>> +	struct fd splice_fd;
> >>> +	int ret;
> >>> +
> >>> +	splice_fd = fdget(READ_ONCE(sqe->splice_fd_in));
> >>> +	if (!splice_fd.file)
> >>> +		return -EBADF;
> >>
> >> Seems like this should check for SPLICE_F_FD_IN_FIXED, and also use
> >> io_file_get_normal() for the non-fixed case in case someone passed in an
> >> io_uring fd.
> > 
> > SPLICE_F_FD_IN_FIXED needs one extra word for holding splice flags, if
> > we can use sqe->addr3, I think it is doable.
> 
> I haven't checked the rest, but you can't just use ->splice_flags for
> this?

->splice_flags shares memory with rwflags, so can't be used.

I think it is fine to use ->addr3, given io_getxattr()/io_setxattr()/
io_msg_ring() has used that.

> 
> In any case, the get path needs to look like io_tee() here, and:
> 
> >>> +out_put_fd:
> >>> +	if (splice_fd.file)
> >>> +		fdput(splice_fd);
> 
> this put needs to be gated on whether it's a fixed file or not.

Yeah.

> 
> >> If the operation is done, clear NEED_CLEANUP and do the cleanup here?
> >> That'll be faster.
> > 
> > The buffer has to be cleaned up after req is completed, since bvec
> > table is needed for bio, and page reference need to be dropped after
> > IO is done too.
> 
> I mean when you clear that flag, call the cleanup bits you otherwise
> would've called on later cleanup.

Got it.

Thanks,
Ming


  reply	other threads:[~2023-02-12  3:23 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-10 15:32 [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
2023-02-10 15:32 ` [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel Ming Lei
2023-02-11 15:42   ` Ming Lei
2023-02-11 18:57     ` Linus Torvalds
2023-02-12  1:39       ` Ming Lei
2023-02-13 20:04         ` Linus Torvalds
2023-02-14  0:52           ` Ming Lei
2023-02-14  2:35             ` Ming Lei
2023-02-14 11:03           ` Miklos Szeredi
2023-02-14 14:35             ` Ming Lei
2023-02-14 15:39               ` Miklos Szeredi
2023-02-15  0:11                 ` Ming Lei
2023-02-15 10:36                   ` Miklos Szeredi
2023-02-10 15:32 ` [PATCH 2/4] fs/splice: allow to ignore signal in __splice_from_pipe Ming Lei
2023-02-10 15:32 ` [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
2023-02-11 15:45   ` Jens Axboe
2023-02-11 16:12     ` Ming Lei
2023-02-11 16:52       ` Jens Axboe
2023-02-12  3:22         ` Ming Lei [this message]
2023-02-12  3:55           ` Jens Axboe
2023-02-13  1:06             ` Ming Lei
2023-02-11 17:13   ` Jens Axboe
2023-02-12  1:48     ` Ming Lei
2023-02-12  2:42       ` Jens Axboe
2023-02-10 15:32 ` [PATCH 4/4] ublk_drv: support splice based read/write zero copy Ming Lei
2023-02-10 21:54 ` [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Jens Axboe
2023-02-10 22:19   ` Jens Axboe
2023-02-11  5:13   ` Ming Lei
2023-02-11 15:45     ` Jens Axboe
2023-02-14 16:36 ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y+hbggDCm9wViPAv@T590 \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox