public inbox for [email protected]
 help / color / mirror / Atom feed
From: Ming Lei <[email protected]>
To: Jens Axboe <[email protected]>
Cc: [email protected], [email protected],
	[email protected],
	Alexander Viro <[email protected]>,
	Stefan Hajnoczi <[email protected]>,
	Miklos Szeredi <[email protected]>,
	Bernd Schubert <[email protected]>,
	Nitesh Shetty <[email protected]>,
	Christoph Hellwig <[email protected]>,
	Ziyang Zhang <[email protected]>,
	[email protected],
	Linus Torvalds <[email protected]>
Subject: Re: [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
Date: Mon, 13 Feb 2023 09:06:03 +0800	[thread overview]
Message-ID: <Y+mM+z519WhoOZpk@T590> (raw)
In-Reply-To: <[email protected]>

On Sat, Feb 11, 2023 at 08:55:23PM -0700, Jens Axboe wrote:
> On 2/11/23 8:22?PM, Ming Lei wrote:
> >>>> Also seems like this should be separately testable. We can't add new
> >>>> opcodes that don't have a feature test at least, and should also have
> >>>> various corner case tests. A bit of commenting outside of this below.
> >>>
> >>> OK, I will write/add one very simple ublk userspace to liburing for
> >>> test purpose.
> >>
> >> Thanks!
> > 
> > Thinking of further, if we use ublk for liburing test purpose, root is
> > often needed, even though we support un-privileged mode, which needs
> > administrator to grant access, so is it still good to do so?
> 
> That's fine, some tests already depend on root for certain things, like
> passthrough. When I run the tests, I do a pass as both a regular user
> and as root. The important bit is just that the tests skip when they are
> not root rather than fail.

OK, that is nice! I am going to write one minimized ublk userspace,
which can be used in both liburing & blktests for test purpose.

> 
> > It could be easier to add ->splice_read() on /dev/zero for test
> > purpose, just allocate zeroed pages in ->splice_read(), and add
> > them to pipe like ublk->splice_read(), and sink side can read
> > from or write to these pages, but zero's read_iter_zero() won't
> > be affected. And normal splice/tee won't connect to zero too
> > because we only allow it from kernel use.
> 
> Arguably /dev/zero should still support splice_read() as a regression
> fix as I argued to Linus, so I'd just add that as a prep patch.

Understood.

> 
> >>>> Seems like this should check for SPLICE_F_FD_IN_FIXED, and also use
> >>>> io_file_get_normal() for the non-fixed case in case someone passed in an
> >>>> io_uring fd.
> >>>
> >>> SPLICE_F_FD_IN_FIXED needs one extra word for holding splice flags, if
> >>> we can use sqe->addr3, I think it is doable.
> >>
> >> I haven't checked the rest, but you can't just use ->splice_flags for
> >> this?
> > 
> > ->splice_flags shares memory with rwflags, so can't be used.
> > 
> > I think it is fine to use ->addr3, given io_getxattr()/io_setxattr()/
> > io_msg_ring() has used that.
> 
> This is part of the confusion, as you treat it basically like a
> read/write internally, but the opcode names indicate differently. Why
> not just have a separate prep helper for these and then use a layout

Looks separate prep is cleaner.

> that makes more sense,surely rwflags aren't applicable for these
> anyway? I think that'd make it a lot cleaner.
> 
> Yeah, addr3 could easily be used, but it's makes for a really confusing
> command structure when the command is kinda-read but also kinda-splice.
> And it arguable makes more sense to treat it as the latter, as it takes
> the two fds like splice.

Yeah, it can be thought as one direct splice between device and generic
file, and I'd take more words to share the whole story here.

1) traditional splice is: 

file A(read file A to pipe buffer) -> pipe -> file B(write pipe buffer to file B)

which implements zero copy for 'cp file_A file_B'.

2) device splice (device -> pipe -> file)

If only for doing 'cp device file', the current splice works just fine, however
we do not have syscall for handling the following more generic scenario:

	dev(produce buffer to pipe) -> pipe -> file(consume buffer from pipe)

splice is supposed for transferring pages over pipe, so the above model
is reasonable. And we do have kernel APIs for doing the above by direct
pipe: splice_direct_to_actor & __splice_from_pipe.

That is basically what this patch does, and consumer could be READ
or WRITE. The current splice syscall only supports WRITE consumer(
write pipe buffer to file) in pipe's read end.

It can be used for fuse to support READ zero copy(fuse write zero copy
was supported in 2010, but never support read zero copy because of missing
such syscall), and for supporting ublk zero copy.

Actually it can help to implement any "virt" drivers, most of which just
talks with file or socket, or anything which can be handled in userspace
efficiently.

Follows the usage, suppose the syscall is named as dev_splice()

	1) "poll" device A for incoming request
	- "poll' just represents one kernel/user communication, once it
	returns, there is request peeked

    - device is one virt device and exposed to userspace and for
	receiving request from userspace

	2) handling driver specific logic
	   - it could be request mapping, such as logical volume manager,
		 the logic can be quite complicated to require Btree to map
		 request from A to underlying files, such as dm-thin or bcache

	   - it could be network based device, nbd, ceph RBD, drbd, ..., usb
	   over IP, .., there could be meta lookup over socket, send
	   command/recv reply, crypto enc/dec, ...

	3) dev_splice(A, A_offset, len, B, B_offset, OP)
	- A is the device
	- B is one generic file(regular file, block device, socket, ...)
	- OP is the consumer operation(could just be read/write or net
	  recv/send)
	- A(produce buffer) -> direct pipe -> B(consume the buffer from
	pipe by OP)

	4) after the above device_splice() is completed, request is
	completed, and send notification to userspace

Now we have io_uring command for handling 1) & 4) efficiently, the
test over ublk has shown this point. If 3) can be supported, the
complicated driver/device logic in 2) can be moved to userspace.
Then the whole driver can be implemented in userspace efficiently,
performance could even be better than kernel driver[1][2].

The trouble is that when dev_splice() becomes much more generic, it is
harder to define it as syscall. It could be easier with io_uring
compared with splice() syscall, but still not easy enough:

- if the sqe interface(for providing parameters) can be stable from
  beginning, or can be extended in future easily

- REQ_F_* has been used up

- may cause some trouble inside io_uring implementation, such as, how
to convert to other io_uring OP handling with the provide consumer op code.

That is why I treat it as io_uring read/write from the beginning, the other
two could be just treated as net recv/send, and only difference is that
buffer is retrieved from direct pipe via splice(device, offset, len).

So which type is better?  Treating it as dev_spice() or specific consumer
OP? If we treat it as splice, is it better to define it as one single
generic OP, or muliple OPs and each one is mapped to single consumer OP?

I am open about the interface definition, any comment/suggestion is
highly welcome!

[1] https://lore.kernel.org/linux-block/Yza1u1KfKa7ycQm0@T590/
[2] https://lore.kernel.org/lkml/[email protected]/T/

Thanks,
Ming


  reply	other threads:[~2023-02-13  1:07 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-10 15:32 [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
2023-02-10 15:32 ` [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel Ming Lei
2023-02-11 15:42   ` Ming Lei
2023-02-11 18:57     ` Linus Torvalds
2023-02-12  1:39       ` Ming Lei
2023-02-13 20:04         ` Linus Torvalds
2023-02-14  0:52           ` Ming Lei
2023-02-14  2:35             ` Ming Lei
2023-02-14 11:03           ` Miklos Szeredi
2023-02-14 14:35             ` Ming Lei
2023-02-14 15:39               ` Miklos Szeredi
2023-02-15  0:11                 ` Ming Lei
2023-02-15 10:36                   ` Miklos Szeredi
2023-02-10 15:32 ` [PATCH 2/4] fs/splice: allow to ignore signal in __splice_from_pipe Ming Lei
2023-02-10 15:32 ` [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
2023-02-11 15:45   ` Jens Axboe
2023-02-11 16:12     ` Ming Lei
2023-02-11 16:52       ` Jens Axboe
2023-02-12  3:22         ` Ming Lei
2023-02-12  3:55           ` Jens Axboe
2023-02-13  1:06             ` Ming Lei [this message]
2023-02-11 17:13   ` Jens Axboe
2023-02-12  1:48     ` Ming Lei
2023-02-12  2:42       ` Jens Axboe
2023-02-10 15:32 ` [PATCH 4/4] ublk_drv: support splice based read/write zero copy Ming Lei
2023-02-10 21:54 ` [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Jens Axboe
2023-02-10 22:19   ` Jens Axboe
2023-02-11  5:13   ` Ming Lei
2023-02-11 15:45     ` Jens Axboe
2023-02-14 16:36 ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y+mM+z519WhoOZpk@T590 \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox