public inbox for [email protected]
 help / color / mirror / Atom feed
From: Ming Lei <[email protected]>
To: Bernd Schubert <[email protected]>
Cc: Jens Axboe <[email protected]>,
	"[email protected]" <[email protected]>,
	"[email protected]" <[email protected]>,
	"[email protected]" <[email protected]>,
	Miklos Szeredi <[email protected]>,
	ZiyangZhang <[email protected]>,
	Xiaoguang Wang <[email protected]>,
	Pavel Begunkov <[email protected]>,
	Stefan Hajnoczi <[email protected]>,
	Dan Williams <[email protected]>
Subject: Re: [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy
Date: Sat, 1 Apr 2023 21:19:11 +0800	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On Fri, Mar 31, 2023 at 07:13:21PM +0000, Bernd Schubert wrote:
> On 3/30/23 13:36, Ming Lei wrote:
> > Apply io_uring fused command for supporting zero copy:
> > 
> > 1) init the fused cmd buffer(io_mapped_buf) in ublk_map_io(), and deinit it
> > in ublk_unmap_io(), and this buffer is immutable, so it is just fine to
> > retrieve it from concurrent fused command.
> > 
> > 1) add sub-command opcode of UBLK_IO_FUSED_SUBMIT_IO for retrieving this
> > fused cmd(zero copy) buffer
> > 
> > 2) call io_fused_cmd_start_secondary_req() to provide buffer to secondary
> > request and submit secondary request; meantime setup complete callback via
> > this API, once secondary request is completed, the complete callback is
> > called for freeing the buffer and completing the fused command
> > 
> > Also request reference is held during fused command lifetime, and this way
> > guarantees that request buffer won't be freed until all inflight fused
> > commands are completed.
> > 
> > userspace(only implement sqe128 fused command):
> > 
> > 	https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6
> > 
> > liburing test(only implement normal sqe fused command: two 64byte SQEs)
> > 
> > 	https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6
> > 
> > Signed-off-by: Ming Lei <[email protected]>
> > ---
> >   Documentation/block/ublk.rst  | 126 ++++++++++++++++++++--
> >   drivers/block/ublk_drv.c      | 192 ++++++++++++++++++++++++++++++++--
> >   include/uapi/linux/ublk_cmd.h |   6 +-
> >   3 files changed, 303 insertions(+), 21 deletions(-)
> > 
> > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > index 1713b2890abb..7b7aa24e9729 100644
> > --- a/Documentation/block/ublk.rst
> > +++ b/Documentation/block/ublk.rst
> > @@ -297,18 +297,126 @@ with specified IO tag in the command data:
> >     ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
> >     the server buffer (pages) read to the IO request pages.
> >   
> > -Future development
> > -==================
> > +- ``UBLK_IO_FUSED_SUBMIT_IO``
> > +
> > +  Used for implementing zero copy feature.
> > +
> > +  It has to been the primary command of io_uring fused command. This command
> 
> s/been/be/ ?

Yeah.

> 
> > +  submits the generic secondary IO request with io buffer provided by our primary
> > +  command, and won't be completed until the secondary request is done.
> > +
> > +  The provided buffer is represented as ``io_uring_bvec_buf``, which is
> > +  actually ublk request buffer's reference, and the reference is shared &
> > +  read-only, so the generic secondary request can retrieve any part of the buffer
> > +  by passing buffer offset & length.
> >   
> >   Zero copy
> > ----------
> > +=========
> > +
> > +What is zero copy?
> > +------------------
> > +
> > +When application submits IO to ``/dev/ublkb*``, userspace buffer(direct io)
> > +or page cache buffer(buffered io) or kernel buffer(meta io often) is used
> > +for submitting data to ublk driver, and all kinds of these buffers are
> > +represented by bio/bvecs(ublk request buffer) finally. Before supporting
> > +zero copy, data in these buffers has to be copied to ublk server userspace
> > +buffer before handling WRITE IO, or after handing READ IO, so that ublk
> > +server can handle IO for ``/dev/ublkb*`` with the copied data.
> > +
> > +The extra copy between ublk request buffer and ublk server userspace buffer
> > +not only increases CPU utilization(such as pinning pages, copy data), but
> > +also consumes memory bandwidth, and the cost could be very big when IO size
> > +is big. It is observed that ublk-null IOPS may be increased to ~5X if the
> > +extra copy can be avoided.
> > +
> > +So zero copy is very important for supporting high performance block device
> > +in userspace.
> > +
> > +Technical requirements
> > +----------------------
> > +
> > +- ublk request buffer use
> > +
> > +ublk request buffer is represented by bio/bvec, which is immutable, so do
> > +not try to change bvec via buffer reference; data can be read from or
> > +written to the buffer according to buffer direction, but bvec can't be
> > +changed
> > +
> > +- buffer lifetime
> > +
> > +Ublk server borrows ublk request buffer for handling ublk IO, ublk request
> > +buffer reference is used. Reference can't outlive the referent buffer. That
> > +means all request buffer references have to be released by ublk server
> > +before ublk driver completes this request, when request buffer ownership
> > +is transferred to upper layer(FS, application, ...).
> > +
> > +Also after ublk request is completed, any page belonging to this ublk
> > +request can not be written or read any more from ublk server since it is
> > +one block device from kernel viewpoint.
> > +
> > +- buffer direction
> > +
> > +For ublk WRITE request, ublk request buffer should only be accessed as data
> > +source, and the buffer can't be written by ublk server
> > +
> > +For ublk READ request, ublk request buffer should only be accessed as data
> > +destination, and the buffer can't be read by ublk server, otherwise kernel
> > +data is leaked to ublk server, which can be unprivileged application.
> > +
> > +- arbitrary size sub-buffer needs to be retrieved from ublk server
> > +
> > +ublk is one generic framework for implementing block device in userspace,
> > +and typical requirements include logical volume manager(mirror, stripped, ...),
> > +distributed network storage, compressed target, ...
> > +
> > +ublk server needs to retrieve arbitrary size sub-buffer of ublk request, and
> > +ublk server needs to submit IOs with these sub-buffer(s). That also means
> > +arbitrary size sub-buffer(s) can be used to submit IO multiple times.
> > +
> > +Any sub-buffer is actually one reference of ublk request buffer, which
> > +ownership can't be transferred to upper layer if any reference is held
> > +by ublk server.
> > +
> > +Why slice isn't good for ublk zero copy
> 
> slice

Good catch.

> 
> > +---------------------------------------
> > +
> > +- spliced page from ->splice_read() can't be written
> > +
> > +ublk READ request can't be handled because spliced page can't be written to, and
> > +extending splice for ublk zero copy isn't one good solution [#splice_extend]_
> > +
> > +- it is very hard to meet above requirements  wrt. request buffer lifetime
> > +
> > +splice/pipe focuses on page reference lifetime, but ublk zero copy pays more
> > +attention to ublk request buffer lifetime. If is very inefficient to respect
> > +request buffer lifetime by using all pipe buffer's ->release() which requires
> > +all pipe buffers and pipe to be kept when ublk server handles IO. That means
> > +one single dedicated ``pipe_inode_info`` has to be allocated runtime for each
> > +provided buffer, and the pipe needs to be populated with pages in ublk request
> > +buffer.
> > +
> > +
> > +io_uring fused command based zero copy
> > +--------------------------------------
> > +
> > +io_uring fused command includes one primary command(uring command) and one
> > +generic secondary request. The primary command is responsible for submitting
> > +secondary request with provided buffer from ublk request, and primary command
> > +won't be completed until the secondary request is completed.
> > +
> > +Typical ublk IO handling includes network and FS IO, so it is usual enough
> > +for io_uring net & fs to support IO with provided buffer from primary command.
> >   
> > -Zero copy is a generic requirement for nbd, fuse or similar drivers. A
> 
> Maybe replace 'requirement' with "very useful"? It works without zc, 
> just with limited performance.

OK, also it has been observed that the extra copy affects performance a log for
big IO size(64K+). 

> 
> > -problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace
> > -can't be remapped any more in kernel with existing mm interfaces. This can
> > -occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that
> > -big requests (IO size >= 256 KB) may benefit a lot from zero copy.
> > +Once primary command is submitted successfully, ublk driver guarantees that
> > +the ublk request buffer won't be gone away since secondary request actually
> > +grabs the buffer's reference. This way also guarantees that multiple
> > +concurrent fused commands associated with same request buffer works fine,
> > +as the provided buffer reference is shared & read-only.
> >   
> > +Also buffer usage direction flag is passed to primary command from userspace,
> > +so ublk driver can validate if it is legal to use buffer with requested
> > +direction.
> >   
> >   References
> >   ==========
> > @@ -323,4 +431,4 @@ References
> >   
> >   .. [#stefan] https://lore.kernel.org/linux-block/[email protected]/
> >   
> > -.. [#xiaoguang] https://lore.kernel.org/linux-block/[email protected]/
> > +.. [#splice_extend] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@mail.gmail.com/
> 
> 
> Any chance you could add a few code path into this document? For example 
> as in Documentation/filesystems/fuse.rst,
> section "Kernel - userspace interface"

OK, just it isn't handy to describe the flow by pure text.

BTW, you can also see the code path in the link:

https://github.com/ming1/ubdsrv/blob/master/doc/ublk_intro.pdf


thanks,
Ming


  reply	other threads:[~2023-04-01 13:20 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
2023-03-30 11:36 ` [PATCH V6 01/17] io_uring: increase io_kiocb->flags into 64bit Ming Lei
2023-03-30 11:36 ` [PATCH V6 02/17] io_uring: use ctx->cached_sq_head to calculate left sqes Ming Lei
2023-03-30 11:36 ` [PATCH V6 03/17] io_uring: add generic IORING_OP_FUSED_CMD Ming Lei
2023-04-01 14:35   ` Ming Lei
2023-03-30 11:36 ` [PATCH V6 04/17] io_uring: support providing buffer by IORING_OP_FUSED_CMD Ming Lei
2023-03-30 11:36 ` [PATCH V6 05/17] io_uring: support OP_READ/OP_WRITE for fused secondary request Ming Lei
2023-03-30 11:36 ` [PATCH V6 06/17] io_uring: support OP_SEND_ZC/OP_RECV " Ming Lei
2023-03-30 11:36 ` [PATCH V6 07/17] block: ublk_drv: add common exit handling Ming Lei
2023-03-30 11:36 ` [PATCH V6 08/17] block: ublk_drv: don't consider flush request in map/unmap io Ming Lei
2023-03-30 11:36 ` [PATCH V6 09/17] block: ublk_drv: add two helpers to clean up map/unmap request Ming Lei
2023-03-30 11:36 ` [PATCH V6 10/17] block: ublk_drv: clean up several helpers Ming Lei
2023-03-30 11:36 ` [PATCH V6 11/17] block: ublk_drv: cleanup 'struct ublk_map_data' Ming Lei
2023-03-30 11:36 ` [PATCH V6 12/17] block: ublk_drv: cleanup ublk_copy_user_pages Ming Lei
2023-03-31 16:22   ` Bernd Schubert
2023-03-30 11:36 ` [PATCH V6 13/17] block: ublk_drv: grab request reference when the request is handled by userspace Ming Lei
2023-03-30 11:36 ` [PATCH V6 14/17] block: ublk_drv: support to copy any part of request pages Ming Lei
2023-03-30 11:36 ` [PATCH V6 15/17] block: ublk_drv: add read()/write() support for ublk char device Ming Lei
2023-03-30 11:36 ` [PATCH V6 16/17] block: ublk_drv: don't check buffer in case of zero copy Ming Lei
2023-03-30 11:36 ` [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting " Ming Lei
2023-03-31 19:13   ` Bernd Schubert
2023-04-01 13:19     ` Ming Lei [this message]
2023-03-31 19:55   ` Bernd Schubert
2023-04-01 13:22     ` Ming Lei
2023-04-03  9:25     ` Bernd Schubert
2023-04-03  1:11 ` [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
2023-04-03  1:24   ` Jens Axboe
2023-04-04  7:48     ` Ming Lei
2023-04-03  1:23 ` (subset) " Jens Axboe
2023-04-18 19:38 ` Bernd Schubert
2023-04-19  1:51   ` Ming Lei
2023-04-19  9:56     ` Bernd Schubert
2023-04-19 11:19       ` Ming Lei
2023-04-19 15:42         ` Bernd Schubert
2023-04-20  1:18           ` Pavel Begunkov
2023-04-20  1:38           ` Ming Lei
2023-04-21 22:38             ` Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox