public inbox for [email protected]
 help / color / mirror / Atom feed
From: Hao Xu <[email protected]>
To: [email protected]
Cc: Jens Axboe <[email protected]>,
	Pavel Begunkov <[email protected]>,
	Ingo Molnar <[email protected]>,
	Wanpeng Li <[email protected]>
Subject: Re: [RFC 00/19] uringlet
Date: Thu, 25 Aug 2022 21:03:59 +0800	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 8/19/22 23:27, Hao Xu wrote:
> From: Hao Xu <[email protected]>
> 
> Hi Jens and all,
> 
> This is an early RFC for a new way to do async IO. Currently io_uring
> works in a way like:
>   - issue an IO request in nowait way
>     here nowait means return error(EAGAIN) to io_uring layer when it would
>     block in deeper kernel stack.
> 
>   - issue an IO request in a normal(block) way
>     io_uring catches the EAGAIN error and create/wakeup a io-worker to
>     redo the IO request in a block way. The original context turns to
>     issue other requests. (some type of requests like buffered reads,
>     leverage task work to wipe out io-workers)
> 
> This has two main disadvantages:
>   - we have to find every block point along the kernel code path and
>     modify it to support nowait.
>     e.g.  alloc_memory() ----> if (alloc_memory() fails) return -EAGAIN
>     This hugely adds programming complexisity, especially when the code
>     path is long and complicated. For example, buffered write, we have
>     to handle locks, possibly journal part, meta data like extent node
>     misses.
> 
>   - By create/wakeup a new worker, we redo a IO request from the very
>     beginning, which means we re-walk the path from beginning to the
>     previous block point.
>     The original context backtracks to the io_uring layer from the block
>     point to submit other requests. While it's better to directly start
>     the new submission.
> 
> This RFC provides a new way to do it.
>   - We maintain a worker pool for each io_uring instance and each worker
>     in it can submit requests. The original task only needs to create the
>     first worker and return to userspace. Later it doesn't need to call
>     io_uring_enter.[1]
> 
>   - the created worker begins to submit requests. When it blocks, just
>     let it be blocked. Create/wakeup another worker to do the submission
> 
> [1] I currently keep these workers until the io_uring context exits. In
>      other words, a worker does submission, sleep, wake up, but won't
>      exit. Thus the original task don't need to create/wakeup workers.
> 
> I've done some testing:
> name: buffered write
> fs: xfs
> env: qemu box, 4 cpu, 8G mem.
> tool: fio
> 
>   - single file test:
> 
>     fio ioengine=io_uring, size=10M, bs=1024, direct=0,
>         thread=1, rw=randwrite, time_based=1, runtime=180
> 
>     async buffered writes:
>     iodepth
>        1      write: IOPS=428k, BW=418MiB/s (438MB/s)(73.5GiB/180000msec);
>        2      write: IOPS=406k, BW=396MiB/s (416MB/s)(69.7GiB/180002msec);
>        4      write: IOPS=382k, BW=373MiB/s (391MB/s)(65.6GiB/180000msec);
>        8      write: IOPS=255k, BW=249MiB/s (261MB/s)(43.7GiB/180001msec);
>        16     write: IOPS=399k, BW=390MiB/s (409MB/s)(68.5GiB/180000msec);
>        32     write: IOPS=433k, BW=423MiB/s (443MB/s)(74.3GiB/180000msec);
> 
>        1      lat (nsec): min=547, max=2929.3k, avg=1074.98, stdev=6498.72
>        2      lat (nsec): min=607, max=84320k, avg=3619.15, stdev=109104.36
>        4      lat (nsec): min=891, max=195941k, avg=9062.16, stdev=213600.71
>        8      lat (nsec): min=684, max=204164k, avg=29308.56, stdev=542490.72
>        16     lat (nsec): min=1002, max=77279k, avg=38716.65, stdev=461785.55
>        32     lat (nsec): min=674, max=75279k, avg=72673.91, stdev=588002.49
> 
> 
>     uringlet:
>     iodepth
>       1       write: IOPS=120k, BW=117MiB/s (123MB/s)(20.6GiB/180006msec);
>       2       write: IOPS=273k, BW=266MiB/s (279MB/s)(46.8GiB/180010msec);
>       4       write: IOPS=336k, BW=328MiB/s (344MB/s)(57.7GiB/180002msec);
>       8       write: IOPS=373k, BW=365MiB/s (382MB/s)(64.1GiB/180000msec);
>       16      write: IOPS=442k, BW=432MiB/s (453MB/s)(75.9GiB/180001msec);
>       32      write: IOPS=444k, BW=434MiB/s (455MB/s)(76.2GiB/180010msec);
> 
>       1       lat (nsec): min=684, max=10790k, avg=6781.23, stdev=10000.69
>       2       lat (nsec): min=650, max=91712k, avg=5690.52, stdev=136818.11
>       4       lat (nsec): min=785, max=79038k, avg=10297.04, stdev=227375.52
>       8       lat (nsec): min=862, max=97493k, avg=19804.67, stdev=350809.60
>       16      lat (nsec): min=823, max=81279k, avg=34681.33, stdev=478427.17
>       32      lat (usec): min=6, max=105935, avg=70.55, stdev=696.08
> 
> uringlet behaves worse on IOPS and lantency in small iodepth. I think
> the reason is there are more sleep and wakeup.(not sure about it, I'll
> look into it later)
> 
> The downside of uringlet:
>   - it costs more cpu resource, the reason is similar with the sqpoll case: a
>     uringlet worker keeps checking sqring to reduce latency.[2]
>   - task->plug is disabled for now since uringlet is buggy with it.
> 
> [2] For now, I allow a uringlet worker spin on the empty sqring for some
> times.
> 
> Any comments are welcome, This early RFC only supports buffered write for
> now and if the idea under it is proved to be the right way, I'll change
> it to a formal patchset and resolve the detail technical issues and try
> to support more io_uring features.
> 
> Regards,
> Hao
> 

Friendly ping...
Jens, any thoughts on this one?

      parent reply	other threads:[~2022-08-25 13:05 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
2022-08-19 15:27 ` [PATCH 01/19] io_uring: change return value of create_io_worker() and io_wqe_create_worker() Hao Xu
2022-08-19 15:27 ` [PATCH 02/19] io_uring: add IORING_SETUP_URINGLET Hao Xu
2022-08-19 15:27 ` [PATCH 03/19] io_uring: make worker pool per ctx for uringlet mode Hao Xu
2022-08-19 15:27 ` [PATCH 04/19] io-wq: split io_wqe_worker() to io_wqe_worker_normal() and io_wqe_worker_let() Hao Xu
2022-08-19 15:27 ` [PATCH 05/19] io_uring: add io_uringler_offload() for uringlet mode Hao Xu
2022-08-19 15:27 ` [PATCH 06/19] io-wq: change the io-worker scheduling logic Hao Xu
2022-08-19 15:27 ` [PATCH 07/19] io-wq: move worker state flags to io-wq.h Hao Xu
2022-08-19 15:27 ` [PATCH 08/19] io-wq: add IO_WORKER_F_SUBMIT and its friends Hao Xu
2022-08-19 15:27 ` [PATCH 09/19] io-wq: add IO_WORKER_F_SCHEDULED " Hao Xu
2022-08-19 15:27 ` [PATCH 10/19] io_uring: add io_submit_sqes_let() Hao Xu
2022-08-19 15:27 ` [PATCH 11/19] io_uring: don't allocate io-wq for a worker in uringlet mode Hao Xu
2022-08-19 15:27 ` [PATCH 12/19] io_uring: add uringlet worker cancellation function Hao Xu
2022-08-19 15:27 ` [PATCH 13/19] io-wq: add wq->owner for uringlet mode Hao Xu
2022-08-19 15:27 ` [PATCH 14/19] io_uring: modify issue_flags " Hao Xu
2022-08-19 15:27 ` [PATCH 15/19] io_uring: don't use inline completion cache if scheduled Hao Xu
2022-08-19 15:27 ` [PATCH 16/19] io_uring: release ctx->let when a ring exits Hao Xu
2022-08-19 15:27 ` [PATCH 17/19] io_uring: disable task plug for now Hao Xu
2022-08-19 15:27 ` [PATCH 18/19] io-wq: only do io_uringlet_end() at the first schedule time Hao Xu
2022-08-19 15:27 ` [PATCH 19/19] io_uring: wire up uringlet Hao Xu
2022-08-25 13:03 ` Hao Xu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox