From: Hao Xu <[email protected]>
To: [email protected]
Cc: Jens Axboe <[email protected]>,
Pavel Begunkov <[email protected]>,
Ingo Molnar <[email protected]>,
Wanpeng Li <[email protected]>
Subject: [RFC 00/19] uringlet
Date: Fri, 19 Aug 2022 23:27:19 +0800 [thread overview]
Message-ID: <[email protected]> (raw)
From: Hao Xu <[email protected]>
Hi Jens and all,
This is an early RFC for a new way to do async IO. Currently io_uring
works in a way like:
- issue an IO request in nowait way
here nowait means return error(EAGAIN) to io_uring layer when it would
block in deeper kernel stack.
- issue an IO request in a normal(block) way
io_uring catches the EAGAIN error and create/wakeup a io-worker to
redo the IO request in a block way. The original context turns to
issue other requests. (some type of requests like buffered reads,
leverage task work to wipe out io-workers)
This has two main disadvantages:
- we have to find every block point along the kernel code path and
modify it to support nowait.
e.g. alloc_memory() ----> if (alloc_memory() fails) return -EAGAIN
This hugely adds programming complexisity, especially when the code
path is long and complicated. For example, buffered write, we have
to handle locks, possibly journal part, meta data like extent node
misses.
- By create/wakeup a new worker, we redo a IO request from the very
beginning, which means we re-walk the path from beginning to the
previous block point.
The original context backtracks to the io_uring layer from the block
point to submit other requests. While it's better to directly start
the new submission.
This RFC provides a new way to do it.
- We maintain a worker pool for each io_uring instance and each worker
in it can submit requests. The original task only needs to create the
first worker and return to userspace. Later it doesn't need to call
io_uring_enter.[1]
- the created worker begins to submit requests. When it blocks, just
let it be blocked. Create/wakeup another worker to do the submission
[1] I currently keep these workers until the io_uring context exits. In
other words, a worker does submission, sleep, wake up, but won't
exit. Thus the original task don't need to create/wakeup workers.
I've done some testing:
name: buffered write
fs: xfs
env: qemu box, 4 cpu, 8G mem.
tool: fio
- single file test:
fio ioengine=io_uring, size=10M, bs=1024, direct=0,
thread=1, rw=randwrite, time_based=1, runtime=180
async buffered writes:
iodepth
1 write: IOPS=428k, BW=418MiB/s (438MB/s)(73.5GiB/180000msec);
2 write: IOPS=406k, BW=396MiB/s (416MB/s)(69.7GiB/180002msec);
4 write: IOPS=382k, BW=373MiB/s (391MB/s)(65.6GiB/180000msec);
8 write: IOPS=255k, BW=249MiB/s (261MB/s)(43.7GiB/180001msec);
16 write: IOPS=399k, BW=390MiB/s (409MB/s)(68.5GiB/180000msec);
32 write: IOPS=433k, BW=423MiB/s (443MB/s)(74.3GiB/180000msec);
1 lat (nsec): min=547, max=2929.3k, avg=1074.98, stdev=6498.72
2 lat (nsec): min=607, max=84320k, avg=3619.15, stdev=109104.36
4 lat (nsec): min=891, max=195941k, avg=9062.16, stdev=213600.71
8 lat (nsec): min=684, max=204164k, avg=29308.56, stdev=542490.72
16 lat (nsec): min=1002, max=77279k, avg=38716.65, stdev=461785.55
32 lat (nsec): min=674, max=75279k, avg=72673.91, stdev=588002.49
uringlet:
iodepth
1 write: IOPS=120k, BW=117MiB/s (123MB/s)(20.6GiB/180006msec);
2 write: IOPS=273k, BW=266MiB/s (279MB/s)(46.8GiB/180010msec);
4 write: IOPS=336k, BW=328MiB/s (344MB/s)(57.7GiB/180002msec);
8 write: IOPS=373k, BW=365MiB/s (382MB/s)(64.1GiB/180000msec);
16 write: IOPS=442k, BW=432MiB/s (453MB/s)(75.9GiB/180001msec);
32 write: IOPS=444k, BW=434MiB/s (455MB/s)(76.2GiB/180010msec);
1 lat (nsec): min=684, max=10790k, avg=6781.23, stdev=10000.69
2 lat (nsec): min=650, max=91712k, avg=5690.52, stdev=136818.11
4 lat (nsec): min=785, max=79038k, avg=10297.04, stdev=227375.52
8 lat (nsec): min=862, max=97493k, avg=19804.67, stdev=350809.60
16 lat (nsec): min=823, max=81279k, avg=34681.33, stdev=478427.17
32 lat (usec): min=6, max=105935, avg=70.55, stdev=696.08
uringlet behaves worse on IOPS and lantency in small iodepth. I think
the reason is there are more sleep and wakeup.(not sure about it, I'll
look into it later)
The downside of uringlet:
- it costs more cpu resource, the reason is similar with the sqpoll case: a
uringlet worker keeps checking sqring to reduce latency.[2]
- task->plug is disabled for now since uringlet is buggy with it.
[2] For now, I allow a uringlet worker spin on the empty sqring for some
times.
Any comments are welcome, This early RFC only supports buffered write for
now and if the idea under it is proved to be the right way, I'll change
it to a formal patchset and resolve the detail technical issues and try
to support more io_uring features.
Regards,
Hao
Hao Xu (19):
io_uring: change return value of create_io_worker() and
io_wqe_create_worker()
io_uring: add IORING_SETUP_URINGLET
io_uring: make worker pool per ctx for uringlet mode
io-wq: split io_wqe_worker() to io_wqe_worker_normal() and
io_wqe_worker_let()
io_uring: add io_uringler_offload() for uringlet mode
io-wq: change the io-worker scheduling logic
io-wq: move worker state flags to io-wq.h
io-wq: add IO_WORKER_F_SUBMIT and its friends
io-wq: add IO_WORKER_F_SCHEDULED and its friends
io_uring: add io_submit_sqes_let()
io_uring: don't allocate io-wq for a worker in uringlet mode
io_uring: add uringlet worker cancellation function
io-wq: add wq->owner for uringlet mode
io_uring: modify issue_flags for uringlet mode
io_uring: don't use inline completion cache if scheduled
io_uring: release ctx->let when a ring exits
io_uring: disable task plug for now
io-wq: only do io_uringlet_end() at the first schedule time
io_uring: wire up uringlet
include/linux/io_uring_types.h | 1 +
include/uapi/linux/io_uring.h | 4 +
io_uring/io-wq.c | 242 +++++++++++++++++++++++++--------
io_uring/io-wq.h | 65 ++++++++-
io_uring/io_uring.c | 122 +++++++++++++++--
io_uring/io_uring.h | 5 +-
io_uring/tctx.c | 31 +++--
7 files changed, 393 insertions(+), 77 deletions(-)
base-commit: 3f743e9bbb8fe20f4c477e4bf6341c4187a4a264
--
2.25.1
next reply other threads:[~2022-08-19 15:28 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-19 15:27 Hao Xu [this message]
2022-08-19 15:27 ` [PATCH 01/19] io_uring: change return value of create_io_worker() and io_wqe_create_worker() Hao Xu
2022-08-19 15:27 ` [PATCH 02/19] io_uring: add IORING_SETUP_URINGLET Hao Xu
2022-08-19 15:27 ` [PATCH 03/19] io_uring: make worker pool per ctx for uringlet mode Hao Xu
2022-08-19 15:27 ` [PATCH 04/19] io-wq: split io_wqe_worker() to io_wqe_worker_normal() and io_wqe_worker_let() Hao Xu
2022-08-19 15:27 ` [PATCH 05/19] io_uring: add io_uringler_offload() for uringlet mode Hao Xu
2022-08-19 15:27 ` [PATCH 06/19] io-wq: change the io-worker scheduling logic Hao Xu
2022-08-19 15:27 ` [PATCH 07/19] io-wq: move worker state flags to io-wq.h Hao Xu
2022-08-19 15:27 ` [PATCH 08/19] io-wq: add IO_WORKER_F_SUBMIT and its friends Hao Xu
2022-08-19 15:27 ` [PATCH 09/19] io-wq: add IO_WORKER_F_SCHEDULED " Hao Xu
2022-08-19 15:27 ` [PATCH 10/19] io_uring: add io_submit_sqes_let() Hao Xu
2022-08-19 15:27 ` [PATCH 11/19] io_uring: don't allocate io-wq for a worker in uringlet mode Hao Xu
2022-08-19 15:27 ` [PATCH 12/19] io_uring: add uringlet worker cancellation function Hao Xu
2022-08-19 15:27 ` [PATCH 13/19] io-wq: add wq->owner for uringlet mode Hao Xu
2022-08-19 15:27 ` [PATCH 14/19] io_uring: modify issue_flags " Hao Xu
2022-08-19 15:27 ` [PATCH 15/19] io_uring: don't use inline completion cache if scheduled Hao Xu
2022-08-19 15:27 ` [PATCH 16/19] io_uring: release ctx->let when a ring exits Hao Xu
2022-08-19 15:27 ` [PATCH 17/19] io_uring: disable task plug for now Hao Xu
2022-08-19 15:27 ` [PATCH 18/19] io-wq: only do io_uringlet_end() at the first schedule time Hao Xu
2022-08-19 15:27 ` [PATCH 19/19] io_uring: wire up uringlet Hao Xu
2022-08-25 13:03 ` [RFC 00/19] uringlet Hao Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox