public inbox for [email protected]
 help / color / mirror / Atom feed
From: Ming Lei <[email protected]>
To: Ziyang Zhang <[email protected]>
Cc: [email protected], [email protected],
	[email protected],
	Gabriel Krisman Bertazi <[email protected]>,
	Xiaoguang Wang <[email protected]>,
	[email protected]
Subject: Re: [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough)
Date: Thu, 30 Jun 2022 17:09:06 +0800	[thread overview]
Message-ID: <Yr1oMvYCqn5m2oLX@T590> (raw)
In-Reply-To: <[email protected]>

On Thu, Jun 30, 2022 at 03:16:21PM +0800, Ziyang Zhang wrote:
> Hi, Ming
> 
> On 2022/6/29 19:33, Ming Lei wrote:
> > On Wed, Jun 29, 2022 at 11:22:23AM +0800, Ziyang Zhang wrote:
> >> Hi Ming,
> >>
> >> On 2022/6/27 23:29, Ming Lei wrote:
> >>> Hi Ziyang,
> >>>
> >>> On Mon, Jun 27, 2022 at 04:20:55PM +0800, Ziyang Zhang wrote:
> >>>> Hi Ming,
> >>>>
> >>>> We are learning your ubd code and developing a library: libubd for ubd.
> >>>> This article explains why we need libubd and how we design it.
> >>>>
> >>>> Related threads:
> >>>> (1) https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/
> >>>> (2) https://lore.kernel.org/all/YnDhorlKgOKiWkiz@T590/
> >>>> (3) https://lore.kernel.org/all/[email protected]/
> >>>> (4) https://lore.kernel.org/all/[email protected]/
> >>>>
> >>>>
> >>>> Userspace block driver(ubd)[1], based on io_uring passthrough,
> >>>> allows users to define their own backend storage in userspace
> >>>> and provides block devices such as /dev/ubdbX.
> >>>> Ming Lei has provided kernel driver code: ubd_drv.c[2]
> >>>> and userspace code: ubdsrv[3].
> >>>>
> >>>> ubd_drv.c simply passes all blk-mq IO requests
> >>>> to ubdsrv through io_uring sqes/cqes. We think the kernel code
> >>>> is pretty well-designed.
> >>>>
> >>>> ubdsrv is implemented by a single daemon
> >>>> and target(backend) IO handling(null_tgt and loop_tgt) 
> >>>> is embedded in the daemon. 
> >>>> While trying ubdsrv, we find ubdsrv is hard to be used 
> >>>> by our backend.
> >>>
> >>> ubd is supposed to provide one generic framework for user space block
> >>> driver, and it can be used for doing lots of fun/useful thing.
> >>>
> >>> If I understand correctly, this isn't same with your use case:
> >>>
> >>> 1) your user space block driver isn't generic, and should be dedicated
> >>> for Alibaba's uses
> >>>
> >>> 2) your case has been there for long time, and you want to switch from other
> >>> approach(maybe tcmu) to ubd given ubd has better performance.
> >>>
> >>
> >> Yes, you are correct :)
> >> The idea of design libubd is actually from libtcmu.
> >>
> >> We do have some userspace storage system as the IO handling backend, 
> >> and we need ubd to provide block drivers such as /dev/ubdbX for up layer client apps.
> >>
> >>
> >> I think your motivation is that provides a complete user block driver to users
> >> and they DO NOT change any code.
> >> Users DO change their code using libubd for embedding libubd into the backend.
> >>
> >>
> >>>> First is description of our backend:
> >>>>
> >>>> (1) a distributing system sends/receives IO requests 
> >>>>     through network.
> >>>>
> >>>> (2) The system use RPC calls among hundreds of
> >>>>      storage servers and RPC calls are associated with data buffers
> >>>>      allocated from a memory pool.
> >>>>
> >>>> (3) On each server for each device(/dev/vdX), our backend runs
> >>>>      many threads to handle IO requests and manage the device. 
> >>>>
> >>>> Second are reasons why ubdsrv is hard to use for us:
> >>>>
> >>>> (1) ubdsrv requires the target(backend) issues IO requests
> >>>>     to the io_uring provided by ubdsrv but our backend 
> >>>>     uses something like RPC and does not support io_uring.
> >>>
> >>> As one generic framework, the io command has to be io_uring
> >>> passthrough, and the io doesn't have to be handled by io_uring.
> >>
> >> Yes, our backend define its own communicating method.
> >>
> >>>
> >>> But IMO io_uring is much more efficient, so I'd try to make async io
> >>> (io uring) as the 1st citizen in the framework, especially for new
> >>> driver.
> >>>
> >>> But it can support other way really, such as use io_uring with eventfd,
> >>> the other userspace context can handle io, then wake up io_uring context
> >>> via eventfd. You may not use io_uring for handling io, but you still
> >>> need to communicate with the context for handling io_uring passthrough
> >>> command, and one mechanism(such as eventfd) has to be there for the
> >>> communication.
> >>
> >> Ok, eventfd may be helpful. 
> >> If you read my API, you may find ubdlib_complete_io_request().
> >> I think the backend io worker thread can call this function to tell the 
> >> ubd queue thread(the io_uring context in it) to commit the IO.
> > 
> > The ubdlib_complete_io_request() has to be called in the same pthread
> > context, that looks not flexible. When you handle IO via non-io_uring in the same
> > context, the cpu utilization in submission/completion side should be
> > higher than io_uring. And this way should be worse than the usage in
> > ubd/loop, that is why I suggest to use one io_uring for handling both
> > io command and io request if possible.
> 
> ubdlib_complete_io_request() can be called in the io worker thread,
> not in the ubdsrv queue thread(with the io_uring context for handling uring_cmd).
> 
> You can find ubd_runner.c in my libubd repo. There are many io worker
> threads for each ubdsrv queue to handle IO requests.
> 
> Actually this idea comes from tcmu-runner. The data flow is:
> 
> 1) in ubdsrv queue thread, io_uring_enter(): returns(IO reqs received from blk-mq)
> 
> 2) in ubdsrv queue thread, ubdsrv_reap_requests(): iterate on each cqe(with an IO req),
>    
>    for READ/WRITE requests, ubd_aio_queue_io() to enqueue the IO req into a io_queue
>    (each ubdsrv queue has one io_queue). This IO req's status is IO_HANDLING_ASYNC.
>     
>    for other simple(can be handled very quickly), 
>    handle it right now and call ubdlib_complete_io_request()
> 
> 3) in ubdsrv queue thread, ubdsrv_commit_and_fetch(): iterate on all IO slots per ubdsrv queue
>    and setup sqe if one IO(IO completion) is ready to commit.
>    
>    Here, some IO slots are still IO_HANDLING_ASYNC so no sqe is generated for them.
> 
> 
> 4)  in ubdsrv queue thread, io_uring_enter(): submit all sqes and wait for cqes
>     (io_uring_enter() will return after at least one IO req is received from blk-mq)
>    
> 5) When 3) or 4) happens, at the same time in ubdsrv queue IO worker threads:
>    each io worker thread try to deque and handle one IO req from io_queue per ubdsrv queue.
>    
>    After the IO worker handles the IO req(WRITE/READ), it calls ubdlib_complete_io_request()
>    This function can mark this  IO req's status to ready to commit.
> 
> IO handling/completion and io_uring_enter() can happen at the same time.
> 
> Besides, io_uring_enter can:
> 
> 1) block and wait for cqes until at least
> one blk-mq req comes from queue_rq()
> 
> 2) submit sqes(with last IO completion and next fetch)
> 
> so I have to consider how to notify io_uring about io completion 
> after io_uring_enter() is slept(block and wait for cqes).

Yeah, that was exactly my question, :-)

> 
> In current version of ubd_runner(an async libubd target), I try to use an "unblock"
> io_uring_enter_timeout() and caller can set a timeout value for it.
> So IO completions happen after io_uring_enter_timeout() call can be committed
> by next io_uring_enter_timeout() call...
> 
> But this is a very ugly implementation 
> because I may waste CPU on useless loops in ubdsrv queue thread if
> blk-mq reqs do not income frequently.
> 
> You mentioned that eventfd may be helpful and I agree with you. :)
> I can register an eventfd in io_uring after ubd_aio_queue_io() and write the eventfd
> in  ubdlib_complete_io_request().
> 
> I will fix my code.

FYI, there is one example about using eventfd to wakeup io_uring, which
can be added to the library for your usecase:

https://gist.github.com/1Jo1/6496d1b8b6b363c301271340e2eab95b

> 
> > 
> >>
> >>
> >>
> >>>
> >>>>
> >>>> (2) ubdsrv forks a daemon and it takes over everything.
> >>>>     Users should type "list/stop/del" ctrl-commands to interact with
> >>>>     the daemon. It is inconvenient for our backend
> >>>>     because it has threads(from a C++ thread library) running inside.
> >>>
> >>> No, list/stop/del won't interact with the daemon, and the per-queue
> >>> pthread is only handling IO commands(io_uring passthrough) and IO request.
> >>>
> >>
> >>
> >> Sorry I made a mistake.
> >>
> >> I mean from user's view, 
> >> he has to type list/del/stop from cmdlind to control the daemon.
> >> (I know the control flow is cmdline-->ubd_drv.c-->ubdsrv daemon).
> >>
> >> This is a little weird if we try to make a ubd library.
> >> So I actually provides APIs in libubd for users to do these list/del/stop works.
> > 
> > OK, that is fine to export APIs for admin purpose.
> > 
> >>
> >>
> >>>>
> >>>> (3) ubdsrv PRE-allocates internal data buffers for each ubd device.
> >>>>     The data flow is:
> >>>>     bio vectors <-1-> ubdsrv data buffer <-2-> backend buffer(our RPC buffer).
> >>>>     Since ubdsrv does not export its internal data buffer to backend,
> >>>>     the second copy is unavoidable. 
> >>>>     PRE-allocating data buffer may not be a good idea for wasting memory
> >>>>     if there are hundreds of ubd devices(/dev/ubdbX).
> >>>
> >>> The preallocation is just virtual memory, which is cheap and not pinned, but
> >>> ubdsrv does support buffer provided by io command, see:
> >>>
> >>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >>
> >> Actually I discussed on the design of pre-allocation in your RFC patch for ubd_drv
> >> but you did not reply :)
> >>
> >> I paste it here:
> >>
> >> "I am worried about the fixed-size(size is max io size, 256KiB) pre-allocated data buffers in UBDSRV
> >> may consume too much memory. Do you mean these pages can be reclaimed by sth like madvise()?
> >> If (1)swap is not set and (2)madvise() is not called, these pages may not be reclaimed."
> >>
> >> I observed that your ubdsrv use posix_memalign() to pre-allocate data buffers, 
> >> and I have already noticed the memory cost while testing your ubdsrv with hundreds of /dev/ubdbX.
> > 
> > Usually posix_memalign just allocates virtual memory which is unlimited
> > in 64bit arch, and pages should be allocated until the buffer is read or write.
> > After the READ/WRITE is done, kernel still can reclaim the pages in this
> > virtual memory.
> > 
> > In future, we still may optimize the memory uses via madvise, such as
> > MADV_DONTNEED, after the slot is idle for long enough.
> 
> Ok, thanks for explanation. 
> 
> > 
> >>
> >> Another IMPORTANT problem is your commit:
> >> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> >> may be not helpful for WRITE requests if I understand correctly.
> >>
> >> Consider this data flow:
> >>
> >> 1. ubdsrv commits an IO req(req1, a READ req).
> >>
> >> 2. ubdsrv issues a sqe(UBD_IO_COMMIT_AND_FETCH_REQ), and sets io->addr to addr1.
> >>    addr1 is the addr of buffer user passed.
> >>    
> >>
> >> 3. ubd gets the sqe and commits req1, sets io->addr to addr1.
> >>
> >> 4. ubd gets IO req(req2, a WRITE req) from blk-mq(queue_rq) and commit a cqe.
> >>
> >> 5. ubd copys data to be written from biovec to addr1 in a task_work.
> >>
> >> 6. ubdsrv gets the cqe and tell the IO target to handle req2.
> >>
> >> 7. IO target handles req2. It is a WRITE req so target issues a io_uring write
> >>    cmd(with buffer set to addr1).
> >>
> >>
> >>
> >> The problem happens in 5). You cannot know the actual data_len of an blk-mq req
> >> until you get one in queue_rq. So length of addr1 may be less than data_len.
> > 
> > So far, the actual length of buffer has to be set as at least rq_max_blocks, since
> > we set it as ubd queue's max hw sectors. Yeah, you may argue memory
> > waste, but process virtual address is unlimited for 64bit arch, and
> > pages are allocated until actual read/write is started.
> 
> Ok, since I allow users to config rq_max_blocks in libubd, 
> it's users' responsibility to ensure length of user buffers
> is at least rq_max_blocks.
> 
> Now I agree on your commit:
> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d
> 
> Provide WRITE buffer in advance(when sending COMMIT_AND_FETCH) seems OK :)
> 
> > 
> >>>
> >>>>
> >>>> To better use ubd in more complicated scenarios, we have developed libubd.
> >>>> It does not assume implementation of backend and can be embedded into it.
> >>>> We refer to the code structure of tcmu-runner[4], 
> >>>> which includes a library(libtcmu) for users 
> >>>> to embed tcmu-runner inside backend's code. 
> >>>> It:
> >>>>
> >>>> (1) Does not fork/pthread_create but embedded in backend's threads
> >>>
> >>> That is because your backend may not use io_uring, I guess.
> >>>
> >>> But it is pretty easy to move the decision of creating pthread to target
> >>> code, which can be done in the interface of .prepare_target().
> >>
> >> I think the library should not create any thread if we want a libubd.
> > 
> > I Agree.
> > 
> >>
> >>>
> >>>>
> >>>> (2) Provides libubd APIs for backend to add/delete ubd devices 
> >>>>     and fetch/commit IO requests
> >>>
> >>> The above could be the main job of libubd.
> >>
> >> indeed.
> >>
> >>>
> >>>>
> >>>> (3) simply passes backend-provided data buffers to ubd_drv.c in kernel,
> >>>>     since the backend actually has no knowledge 
> >>>>     on incoming data size until it gets an IO descriptor.
> >>>
> >>> I can understand your requirement, not look at your code yet, but libubd
> >>> should be pretty thin from function viewpoint, and there are lots of common
> >>> things to abstract/share among all drivers, please see recent ubdsrv change:
> >>>
> >>> https://github.com/ming1/ubdsrv/commits/master
> >>>
> >>> in which:
> >>> 	- coroutine is added for handling target io
> >>> 	- the target interface(ubdsrv_tgt_type) has been cleaned/improved for
> >>> 	supporting complicated target
> >>> 	- c++ support
> >>
> >> Yes, I have read your coroutine code but I am not an expert of C++ 20.:(
> >> I think it is actually target(backend) design and ubd should not assume 
> >> how the backend handle IOs. 
> >>
> >> The work ubd in userspace has to be done is:
> >>
> >> 1) give some IO descriptors to backend, such as ubd_get_io_requests()
> >>
> >> 2) get IO completion form backend, such as ubd_complete_io_requests()
> > 
> > Or the user provides/registers two callbacks: handle_io_async() and
> > io_complete(), the former is called when one request comes from ubd
> > driver, the latter(optional) is called when one io is done.
> > 
> > Also you didn't mention how you notify io_uring about io completion after
> > io_uring_enter() is slept if your backend code doesn't use io_uring to
> > handle io.
> > 
> > I think one communication mechanism(such as eventfd) is needed for your
> > case.
> 
> Ok, I will try eventfd with io_uring.
> 
> > 
> >>
> >>
> >>
> >>>
> >>> IMO, libubd isn't worth of one freshly new project, and it could be integrated
> >>> into ubdsrv easily. The potential users could be existed usersapce
> >>> block driver projects.
> >>
> >> Yes, so many userspace storage systems can use ubd!
> >> You may look at tcmu-runner. It:
> >>
> >> 1) provides a library(libtcmu.c) for those who have a existing backend.
> >>
> >> 2) provides a runner(main.c in tcmu-runner) like your ubdsrv 
> >>    for those who just want to run it. 
> >>    And the runner is build on top of libtcmu.
> >>
> >>>
> >>> If you don't object, I am happy to co-work with you to add the support
> >>> for libubd in ubdsrv, then we can avoid to invent a wheel
> >>
> >> +1 :)
> > 
> > Thinking of further, I'd suggest to split ubdsrv into two parts:
> > 
> > 1) libubdsrv
> > - provide APIs like what you did in libubd
> > - provide API for notify io_uring(handling io command) that one io is
> > completed, and the API should support handling IO from other context
> > (not same with the io_uring context for handling io command).
> > 
> > 2) ubd target
> > - built on libubdsrv, such as ubd command is built on libubdsrv, and
> > specific target implementation is built on the library too.
> > 
> > It shouldn't be hard to work towards this direction, and I guess this
> > way should make current target implementation more clean.
> > 
> 
> Yes, this is like tcmu-runner's structure: a libtcmu and some target
> Thanks, Ming.  Glad to co-work with you.
> 
> I will take your advice and improve libubd(the communication mechanism, maybe eventfd).

I have added libublk branch for working towards this direction, if we
cowork on libublk, please write patch against this branch, then I can
apply your patch directly.

https://github.com/ming1/ubdsrv/tree/libublk

Thanks, 
Ming


  reply	other threads:[~2022-06-30  9:09 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-27  8:20 [RFC] libubd: library for ubd(userspace block driver based on io_uring passthrough) Ziyang Zhang
2022-06-27 15:29 ` Ming Lei
2022-06-29  3:22   ` Ziyang Zhang
2022-06-29 11:33     ` Ming Lei
2022-06-30  7:16       ` Ziyang Zhang
2022-06-30  9:09         ` Ming Lei [this message]
2022-06-30  9:29           ` Ziyang Zhang
2022-06-30 11:45             ` Ming Lei
2022-07-04  4:08             ` Ming Lei
2022-07-04  9:49               ` Ziyang Zhang
2022-07-04 11:03                 ` Ming Lei
2022-07-04 13:50                 ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yr1oMvYCqn5m2oLX@T590 \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox