public inbox for [email protected]
 help / color / mirror / Atom feed
* [RFC 00/19] uringlet
@ 2022-08-19 15:27 Hao Xu
  2022-08-19 15:27 ` [PATCH 01/19] io_uring: change return value of create_io_worker() and io_wqe_create_worker() Hao Xu
                   ` (19 more replies)
  0 siblings, 20 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

Hi Jens and all,

This is an early RFC for a new way to do async IO. Currently io_uring
works in a way like:
 - issue an IO request in nowait way
   here nowait means return error(EAGAIN) to io_uring layer when it would
   block in deeper kernel stack.

 - issue an IO request in a normal(block) way
   io_uring catches the EAGAIN error and create/wakeup a io-worker to
   redo the IO request in a block way. The original context turns to
   issue other requests. (some type of requests like buffered reads,
   leverage task work to wipe out io-workers)

This has two main disadvantages:
 - we have to find every block point along the kernel code path and
   modify it to support nowait.
   e.g.  alloc_memory() ----> if (alloc_memory() fails) return -EAGAIN
   This hugely adds programming complexisity, especially when the code
   path is long and complicated. For example, buffered write, we have
   to handle locks, possibly journal part, meta data like extent node
   misses.

 - By create/wakeup a new worker, we redo a IO request from the very
   beginning, which means we re-walk the path from beginning to the
   previous block point.
   The original context backtracks to the io_uring layer from the block
   point to submit other requests. While it's better to directly start
   the new submission.

This RFC provides a new way to do it.
 - We maintain a worker pool for each io_uring instance and each worker
   in it can submit requests. The original task only needs to create the
   first worker and return to userspace. Later it doesn't need to call
   io_uring_enter.[1]

 - the created worker begins to submit requests. When it blocks, just
   let it be blocked. Create/wakeup another worker to do the submission

[1] I currently keep these workers until the io_uring context exits. In
    other words, a worker does submission, sleep, wake up, but won't
    exit. Thus the original task don't need to create/wakeup workers.

I've done some testing:
name: buffered write
fs: xfs
env: qemu box, 4 cpu, 8G mem.
tool: fio

 - single file test:

   fio ioengine=io_uring, size=10M, bs=1024, direct=0,
       thread=1, rw=randwrite, time_based=1, runtime=180

   async buffered writes:
   iodepth
      1      write: IOPS=428k, BW=418MiB/s (438MB/s)(73.5GiB/180000msec);
      2      write: IOPS=406k, BW=396MiB/s (416MB/s)(69.7GiB/180002msec);
      4      write: IOPS=382k, BW=373MiB/s (391MB/s)(65.6GiB/180000msec);
      8      write: IOPS=255k, BW=249MiB/s (261MB/s)(43.7GiB/180001msec);
      16     write: IOPS=399k, BW=390MiB/s (409MB/s)(68.5GiB/180000msec);
      32     write: IOPS=433k, BW=423MiB/s (443MB/s)(74.3GiB/180000msec);

      1      lat (nsec): min=547, max=2929.3k, avg=1074.98, stdev=6498.72
      2      lat (nsec): min=607, max=84320k, avg=3619.15, stdev=109104.36
      4      lat (nsec): min=891, max=195941k, avg=9062.16, stdev=213600.71
      8      lat (nsec): min=684, max=204164k, avg=29308.56, stdev=542490.72
      16     lat (nsec): min=1002, max=77279k, avg=38716.65, stdev=461785.55
      32     lat (nsec): min=674, max=75279k, avg=72673.91, stdev=588002.49


   uringlet:
   iodepth
     1       write: IOPS=120k, BW=117MiB/s (123MB/s)(20.6GiB/180006msec);
     2       write: IOPS=273k, BW=266MiB/s (279MB/s)(46.8GiB/180010msec);
     4       write: IOPS=336k, BW=328MiB/s (344MB/s)(57.7GiB/180002msec);
     8       write: IOPS=373k, BW=365MiB/s (382MB/s)(64.1GiB/180000msec);
     16      write: IOPS=442k, BW=432MiB/s (453MB/s)(75.9GiB/180001msec);
     32      write: IOPS=444k, BW=434MiB/s (455MB/s)(76.2GiB/180010msec);

     1       lat (nsec): min=684, max=10790k, avg=6781.23, stdev=10000.69
     2       lat (nsec): min=650, max=91712k, avg=5690.52, stdev=136818.11
     4       lat (nsec): min=785, max=79038k, avg=10297.04, stdev=227375.52
     8       lat (nsec): min=862, max=97493k, avg=19804.67, stdev=350809.60
     16      lat (nsec): min=823, max=81279k, avg=34681.33, stdev=478427.17
     32      lat (usec): min=6, max=105935, avg=70.55, stdev=696.08

uringlet behaves worse on IOPS and lantency in small iodepth. I think
the reason is there are more sleep and wakeup.(not sure about it, I'll
look into it later)

The downside of uringlet:
 - it costs more cpu resource, the reason is similar with the sqpoll case: a
   uringlet worker keeps checking sqring to reduce latency.[2]
 - task->plug is disabled for now since uringlet is buggy with it.

[2] For now, I allow a uringlet worker spin on the empty sqring for some
times.

Any comments are welcome, This early RFC only supports buffered write for
now and if the idea under it is proved to be the right way, I'll change
it to a formal patchset and resolve the detail technical issues and try
to support more io_uring features.

Regards,
Hao

Hao Xu (19):
  io_uring: change return value of create_io_worker() and
    io_wqe_create_worker()
  io_uring: add IORING_SETUP_URINGLET
  io_uring: make worker pool per ctx for uringlet mode
  io-wq: split io_wqe_worker() to io_wqe_worker_normal() and
    io_wqe_worker_let()
  io_uring: add io_uringler_offload() for uringlet mode
  io-wq: change the io-worker scheduling logic
  io-wq: move worker state flags to io-wq.h
  io-wq: add IO_WORKER_F_SUBMIT and its friends
  io-wq: add IO_WORKER_F_SCHEDULED and its friends
  io_uring: add io_submit_sqes_let()
  io_uring: don't allocate io-wq for a worker in uringlet mode
  io_uring: add uringlet worker cancellation function
  io-wq: add wq->owner for uringlet mode
  io_uring: modify issue_flags for uringlet mode
  io_uring: don't use inline completion cache if scheduled
  io_uring: release ctx->let when a ring exits
  io_uring: disable task plug for now
  io-wq: only do io_uringlet_end() at the first schedule time
  io_uring: wire up uringlet

 include/linux/io_uring_types.h |   1 +
 include/uapi/linux/io_uring.h  |   4 +
 io_uring/io-wq.c               | 242 +++++++++++++++++++++++++--------
 io_uring/io-wq.h               |  65 ++++++++-
 io_uring/io_uring.c            | 122 +++++++++++++++--
 io_uring/io_uring.h            |   5 +-
 io_uring/tctx.c                |  31 +++--
 7 files changed, 393 insertions(+), 77 deletions(-)


base-commit: 3f743e9bbb8fe20f4c477e4bf6341c4187a4a264
-- 
2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/19] io_uring: change return value of create_io_worker() and io_wqe_create_worker()
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 02/19] io_uring: add IORING_SETUP_URINGLET Hao Xu
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

Change return value of create_io_worker() and io_wqe_create_worker() so
that it tells the detail error code.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io-wq.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index c6536d4b2da0..f631acbd50df 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -140,7 +140,7 @@ struct io_cb_cancel_data {
 	bool cancel_all;
 };
 
-static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index);
+static int create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index);
 static void io_wqe_dec_running(struct io_worker *worker);
 static bool io_acct_cancel_pending_work(struct io_wqe *wqe,
 					struct io_wqe_acct *acct,
@@ -289,7 +289,7 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe,
  * We need a worker. If we find a free one, we're good. If not, and we're
  * below the max number of workers, create one.
  */
-static bool io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
+static int io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
 {
 	/*
 	 * Most likely an attempt to queue unbounded work on an io_wq that
@@ -301,7 +301,7 @@ static bool io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
 	raw_spin_lock(&wqe->lock);
 	if (acct->nr_workers >= acct->max_workers) {
 		raw_spin_unlock(&wqe->lock);
-		return true;
+		return 0;
 	}
 	acct->nr_workers++;
 	raw_spin_unlock(&wqe->lock);
@@ -790,7 +790,7 @@ static void io_workqueue_create(struct work_struct *work)
 		kfree(worker);
 }
 
-static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
+static int create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 {
 	struct io_wqe_acct *acct = &wqe->acct[index];
 	struct io_worker *worker;
@@ -806,7 +806,7 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 		acct->nr_workers--;
 		raw_spin_unlock(&wqe->lock);
 		io_worker_ref_put(wq);
-		return false;
+		return -ENOMEM;
 	}
 
 	refcount_set(&worker->ref, 1);
@@ -828,7 +828,7 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
 		schedule_work(&worker->work);
 	}
 
-	return true;
+	return 0;
 }
 
 /*
@@ -933,7 +933,7 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
 	    !atomic_read(&acct->nr_running))) {
 		bool did_create;
 
-		did_create = io_wqe_create_worker(wqe, acct);
+		did_create = !io_wqe_create_worker(wqe, acct);
 		if (likely(did_create))
 			return;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 02/19] io_uring: add IORING_SETUP_URINGLET
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
  2022-08-19 15:27 ` [PATCH 01/19] io_uring: change return value of create_io_worker() and io_wqe_create_worker() Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 03/19] io_uring: make worker pool per ctx for uringlet mode Hao Xu
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

Add a new setup flag to turn on/off uringlet mode.

Signed-off-by: Hao Xu <[email protected]>
---
 include/uapi/linux/io_uring.h | 4 ++++
 io_uring/io_uring.c           | 4 +++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 1463cfecb56b..68507c23b079 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -152,6 +152,10 @@ enum {
  * Only one task is allowed to submit requests
  */
 #define IORING_SETUP_SINGLE_ISSUER	(1U << 12)
+/*
+ * uringlet mode
+ */
+#define IORING_SETUP_URINGLET		(1U << 13)
 
 enum io_uring_op {
 	IORING_OP_NOP,
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ebfdb2212ec2..5e4f5b1684dd 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3226,6 +3226,8 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	struct file *file;
 	int ret;
 
+	if (p->flags & IORING_SETUP_URINGLET)
+		return -EINVAL;
 	if (!entries)
 		return -EINVAL;
 	if (entries > IORING_MAX_ENTRIES) {
@@ -3400,7 +3402,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL |
 			IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG |
 			IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |
-			IORING_SETUP_SINGLE_ISSUER))
+			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_URINGLET))
 		return -EINVAL;
 
 	return io_uring_create(entries, &p, params);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 03/19] io_uring: make worker pool per ctx for uringlet mode
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
  2022-08-19 15:27 ` [PATCH 01/19] io_uring: change return value of create_io_worker() and io_wqe_create_worker() Hao Xu
  2022-08-19 15:27 ` [PATCH 02/19] io_uring: add IORING_SETUP_URINGLET Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 04/19] io-wq: split io_wqe_worker() to io_wqe_worker_normal() and io_wqe_worker_let() Hao Xu
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

For uringlet mode, make worker pool per ctx. This is much easier for
implementation. We can make it better later if it's necessary. In
uringlet mode, we need to find the specific ctx in a worker. Add a
member private for this. We set wq->task to NULL for uringlet as
a mark that this is a uringler io-wq.

Signed-off-by: Hao Xu <[email protected]>
---
 include/linux/io_uring_types.h |  1 +
 io_uring/io-wq.c               | 11 ++++++++++-
 io_uring/io-wq.h               |  4 ++++
 io_uring/io_uring.c            |  9 +++++++++
 io_uring/tctx.c                |  8 +++++++-
 5 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 677a25d44d7f..c8093e733a35 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -356,6 +356,7 @@ struct io_ring_ctx {
 	unsigned			sq_thread_idle;
 	/* protected by ->completion_lock */
 	unsigned			evfd_last_cq_tail;
+	struct io_wq			*let;
 };
 
 enum {
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index f631acbd50df..aaa58cbacf60 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -127,6 +127,8 @@ struct io_wq {
 
 	struct task_struct *task;
 
+	void *private;
+
 	struct io_wqe *wqes[];
 };
 
@@ -392,6 +394,11 @@ static bool io_queue_worker_create(struct io_worker *worker,
 	return false;
 }
 
+static inline bool io_wq_is_uringlet(struct io_wq *wq)
+{
+	return wq->private;
+}
+
 static void io_wqe_dec_running(struct io_worker *worker)
 {
 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
@@ -1153,6 +1160,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 	wq->hash = data->hash;
 	wq->free_work = data->free_work;
 	wq->do_work = data->do_work;
+	wq->private = data->private;
 
 	ret = -ENOMEM;
 	for_each_node(node) {
@@ -1188,7 +1196,8 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 		INIT_LIST_HEAD(&wqe->all_list);
 	}
 
-	wq->task = get_task_struct(data->task);
+	if (data->task)
+		wq->task = get_task_struct(data->task);
 	atomic_set(&wq->worker_refs, 1);
 	init_completion(&wq->worker_done);
 	return wq;
diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
index 31228426d192..b9f5ce4493e0 100644
--- a/io_uring/io-wq.h
+++ b/io_uring/io-wq.h
@@ -41,6 +41,7 @@ struct io_wq_data {
 	struct task_struct *task;
 	io_wq_work_fn *do_work;
 	free_work_fn *free_work;
+	void *private;
 };
 
 struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data);
@@ -80,4 +81,7 @@ static inline bool io_wq_current_is_worker(void)
 	return in_task() && (current->flags & PF_IO_WORKER) &&
 		current->worker_private;
 }
+
+extern struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx,
+					struct task_struct *task);
 #endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 5e4f5b1684dd..cb011a04653b 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3318,6 +3318,15 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	ret = io_sq_offload_create(ctx, p);
 	if (ret)
 		goto err;
+
+	if (ctx->flags & IORING_SETUP_URINGLET) {
+		ctx->let = io_init_wq_offload(ctx, current);
+		if (IS_ERR(ctx->let)) {
+			ret = PTR_ERR(ctx->let);
+			goto err;
+		}
+	}
+
 	/* always set a rsrc node */
 	ret = io_rsrc_node_switch_start(ctx);
 	if (ret)
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index 7f97d97fef0a..09c91cd7b5bf 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -12,7 +12,7 @@
 #include "io_uring.h"
 #include "tctx.h"
 
-static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx,
+struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx,
 					struct task_struct *task)
 {
 	struct io_wq_hash *hash;
@@ -34,9 +34,15 @@ static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx,
 	mutex_unlock(&ctx->uring_lock);
 
 	data.hash = hash;
+	/* for uringlet, wq->task is the iouring instance creator */
 	data.task = task;
 	data.free_work = io_wq_free_work;
 	data.do_work = io_wq_submit_work;
+	/* distinguish normal iowq and uringlet by wq->private for now */
+	if (ctx->flags & IORING_SETUP_URINGLET)
+		data.private = ctx;
+	else
+		data.private = NULL;
 
 	/* Do QD, or 4 * CPUS, whatever is smallest */
 	concurrency = min(ctx->sq_entries, 4 * num_online_cpus());
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 04/19] io-wq: split io_wqe_worker() to io_wqe_worker_normal() and io_wqe_worker_let()
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (2 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 03/19] io_uring: make worker pool per ctx for uringlet mode Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 05/19] io_uring: add io_uringler_offload() for uringlet mode Hao Xu
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

io_wqe_worker_normal() is the normal io worker, and io_wqe_worker_let()
is the handler for uringlet mode.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io-wq.c    | 82 ++++++++++++++++++++++++++++++++++++++++-----
 io_uring/io-wq.h    |  8 ++++-
 io_uring/io_uring.c |  8 +++--
 io_uring/io_uring.h |  2 +-
 4 files changed, 87 insertions(+), 13 deletions(-)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index aaa58cbacf60..b533db18d7c0 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -20,6 +20,7 @@
 #include "io-wq.h"
 #include "slist.h"
 #include "io_uring.h"
+#include "tctx.h"
 
 #define WORKER_IDLE_TIMEOUT	(5 * HZ)
 
@@ -617,19 +618,12 @@ static void io_worker_handle_work(struct io_worker *worker)
 	} while (1);
 }
 
-static int io_wqe_worker(void *data)
+static void io_wqe_worker_normal(struct io_worker *worker)
 {
-	struct io_worker *worker = data;
 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wq *wq = wqe->wq;
 	bool last_timeout = false;
-	char buf[TASK_COMM_LEN];
-
-	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
-
-	snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid);
-	set_task_comm(current, buf);
 
 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
 		long ret;
@@ -664,6 +658,78 @@ static int io_wqe_worker(void *data)
 
 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state))
 		io_worker_handle_work(worker);
+}
+
+#define IO_URINGLET_EMPTY_LIMIT	100000
+#define URINGLET_WORKER_IDLE_TIMEOUT	1
+
+static void io_wqe_worker_let(struct io_worker *worker)
+{
+	struct io_wqe *wqe = worker->wqe;
+	struct io_wq *wq = wqe->wq;
+
+	/* TODO this one breaks encapsulation */
+	if (unlikely(io_uring_add_tctx_node(wq->private)))
+		goto out;
+
+	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
+		unsigned int empty_count = 0;
+
+		__io_worker_busy(wqe, worker);
+		set_current_state(TASK_INTERRUPTIBLE);
+
+		do {
+			enum io_uringlet_state submit_state;
+
+			submit_state = wq->do_work(wq->private);
+			if (submit_state == IO_URINGLET_SCHEDULED) {
+				empty_count = 0;
+				break;
+			} else if (submit_state == IO_URINGLET_EMPTY) {
+				if (++empty_count > IO_URINGLET_EMPTY_LIMIT)
+					break;
+			} else {
+				empty_count = 0;
+			}
+			cond_resched();
+		} while (1);
+
+		raw_spin_lock(&wqe->lock);
+		__io_worker_idle(wqe, worker);
+		raw_spin_unlock(&wqe->lock);
+		schedule_timeout(URINGLET_WORKER_IDLE_TIMEOUT);
+		if (signal_pending(current)) {
+			struct ksignal ksig;
+
+			if (!get_signal(&ksig))
+				continue;
+			break;
+		}
+	}
+
+	__set_current_state(TASK_RUNNING);
+out:
+	wq->free_work(NULL);
+}
+
+static int io_wqe_worker(void *data)
+{
+	struct io_worker *worker = data;
+	struct io_wqe *wqe = worker->wqe;
+	struct io_wq *wq = wqe->wq;
+	bool uringlet = io_wq_is_uringlet(wq);
+	char buf[TASK_COMM_LEN];
+
+	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
+
+	snprintf(buf, sizeof(buf), uringlet ? "iou-let-%d" : "iou-wrk-%d",
+		 wq->task->pid);
+	set_task_comm(current, buf);
+
+	if (uringlet)
+		io_wqe_worker_let(worker);
+	else
+		io_wqe_worker_normal(worker);
 
 	io_worker_exit(worker);
 	return 0;
diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
index b9f5ce4493e0..b862b04e49ce 100644
--- a/io_uring/io-wq.h
+++ b/io_uring/io-wq.h
@@ -21,8 +21,14 @@ enum io_wq_cancel {
 	IO_WQ_CANCEL_NOTFOUND,	/* work not found */
 };
 
+enum io_uringlet_state {
+	IO_URINGLET_INLINE,
+	IO_URINGLET_EMPTY,
+	IO_URINGLET_SCHEDULED,
+};
+
 typedef struct io_wq_work *(free_work_fn)(struct io_wq_work *);
-typedef void (io_wq_work_fn)(struct io_wq_work *);
+typedef int (io_wq_work_fn)(struct io_wq_work *);
 
 struct io_wq_hash {
 	refcount_t refs;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index cb011a04653b..b57e9059a388 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1612,7 +1612,7 @@ struct io_wq_work *io_wq_free_work(struct io_wq_work *work)
 	return req ? &req->work : NULL;
 }
 
-void io_wq_submit_work(struct io_wq_work *work)
+int io_wq_submit_work(struct io_wq_work *work)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
 	const struct io_op_def *def = &io_op_defs[req->opcode];
@@ -1632,7 +1632,7 @@ void io_wq_submit_work(struct io_wq_work *work)
 	if (work->flags & IO_WQ_WORK_CANCEL) {
 fail:
 		io_req_task_queue_fail(req, err);
-		return;
+		return 0;
 	}
 	if (!io_assign_file(req, issue_flags)) {
 		err = -EBADF;
@@ -1666,7 +1666,7 @@ void io_wq_submit_work(struct io_wq_work *work)
 		}
 
 		if (io_arm_poll_handler(req, issue_flags) == IO_APOLL_OK)
-			return;
+			return 0;
 		/* aborted or ready, in either case retry blocking */
 		needs_poll = false;
 		issue_flags &= ~IO_URING_F_NONBLOCK;
@@ -1675,6 +1675,8 @@ void io_wq_submit_work(struct io_wq_work *work)
 	/* avoid locking problems by failing it from a clean context */
 	if (ret < 0)
 		io_req_task_queue_fail(req, ret);
+
+	return 0;
 }
 
 inline struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 2f73f83af960..b20d2506a60f 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -69,7 +69,7 @@ void io_free_batch_list(struct io_ring_ctx *ctx, struct io_wq_work_node *node);
 int io_req_prep_async(struct io_kiocb *req);
 
 struct io_wq_work *io_wq_free_work(struct io_wq_work *work);
-void io_wq_submit_work(struct io_wq_work *work);
+int io_wq_submit_work(struct io_wq_work *work);
 
 void io_free_req(struct io_kiocb *req);
 void io_queue_next(struct io_kiocb *req);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 05/19] io_uring: add io_uringler_offload() for uringlet mode
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (3 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 04/19] io-wq: split io_wqe_worker() to io_wqe_worker_normal() and io_wqe_worker_let() Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 06/19] io-wq: change the io-worker scheduling logic Hao Xu
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

In uringlet mode, a io_uring_enter call shouldn't do the sqe submission
work, but just offload it to io-workers.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io-wq.c    | 18 ++++++++++++++++++
 io_uring/io-wq.h    |  1 +
 io_uring/io_uring.c | 21 ++++++++++++++-------
 3 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index b533db18d7c0..212ea16cbb5e 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -771,6 +771,24 @@ void io_wq_worker_sleeping(struct task_struct *tsk)
 	io_wqe_dec_running(worker);
 }
 
+int io_uringlet_offload(struct io_wq *wq)
+{
+	struct io_wqe *wqe = wq->wqes[numa_node_id()];
+	struct io_wqe_acct *acct = io_get_acct(wqe, true);
+	bool waken;
+
+	raw_spin_lock(&wqe->lock);
+	rcu_read_lock();
+	waken = io_wqe_activate_free_worker(wqe, acct);
+	rcu_read_unlock();
+	raw_spin_unlock(&wqe->lock);
+
+	if (waken)
+		return 0;
+
+	return io_wqe_create_worker(wqe, acct);
+}
+
 static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
 			       struct task_struct *tsk)
 {
diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
index b862b04e49ce..66d2aeb17951 100644
--- a/io_uring/io-wq.h
+++ b/io_uring/io-wq.h
@@ -90,4 +90,5 @@ static inline bool io_wq_current_is_worker(void)
 
 extern struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx,
 					struct task_struct *task);
+extern int io_uringlet_offload(struct io_wq *wq);
 #endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index b57e9059a388..554041705e96 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3051,15 +3051,22 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 		if (unlikely(ret))
 			goto out;
 
-		mutex_lock(&ctx->uring_lock);
-		ret = io_submit_sqes(ctx, to_submit);
-		if (ret != to_submit) {
+		if (!(ctx->flags & IORING_SETUP_URINGLET)) {
+			mutex_lock(&ctx->uring_lock);
+			ret = io_submit_sqes(ctx, to_submit);
+			if (ret != to_submit) {
+				mutex_unlock(&ctx->uring_lock);
+				goto out;
+			}
+			if ((flags & IORING_ENTER_GETEVENTS) && ctx->syscall_iopoll)
+				goto iopoll_locked;
 			mutex_unlock(&ctx->uring_lock);
-			goto out;
+		} else {
+			ret = io_uringlet_offload(ctx->let);
+			if (ret)
+				goto out;
+			ret = to_submit;
 		}
-		if ((flags & IORING_ENTER_GETEVENTS) && ctx->syscall_iopoll)
-			goto iopoll_locked;
-		mutex_unlock(&ctx->uring_lock);
 	}
 	if (flags & IORING_ENTER_GETEVENTS) {
 		int ret2;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 06/19] io-wq: change the io-worker scheduling logic
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (4 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 05/19] io_uring: add io_uringler_offload() for uringlet mode Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 07/19] io-wq: move worker state flags to io-wq.h Hao Xu
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

We do io-worker creation when a io-worker gets sleeping and some
condition is met. For uringlet mode, we need to do the scheduling too.
A uringlet worker gets sleeping because of blocking in some place below
io_uring layer in the kernel stack. So we should wake up or create a new
uringlet worker in this situation. Meanwhile, setting up a flag to let
the sqe submitter know it had been scheduled out.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io-wq.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 212ea16cbb5e..5f54af7579a4 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -404,14 +404,28 @@ static void io_wqe_dec_running(struct io_worker *worker)
 {
 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
 	struct io_wqe *wqe = worker->wqe;
+	struct io_wq *wq = wqe->wq;
+	bool zero_refs;
 
 	if (!(worker->flags & IO_WORKER_F_UP))
 		return;
 
-	if (!atomic_dec_and_test(&acct->nr_running))
-		return;
-	if (!io_acct_run_queue(acct))
-		return;
+	zero_refs = atomic_dec_and_test(&acct->nr_running);
+
+	if (io_wq_is_uringlet(wq)) {
+		bool activated;
+
+		raw_spin_lock(&wqe->lock);
+		rcu_read_lock();
+		activated = io_wqe_activate_free_worker(wqe, acct);
+		rcu_read_unlock();
+		raw_spin_unlock(&wqe->lock);
+		if (activated)
+			return;
+	} else {
+		if (!zero_refs || !io_acct_run_queue(acct))
+			return;
+	}
 
 	atomic_inc(&acct->nr_running);
 	atomic_inc(&wqe->wq->worker_refs);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 07/19] io-wq: move worker state flags to io-wq.h
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (5 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 06/19] io-wq: change the io-worker scheduling logic Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 08/19] io-wq: add IO_WORKER_F_SUBMIT and its friends Hao Xu
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

Move worker state flags to io-wq.h so that we can levarage them later.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io-wq.c | 7 -------
 io_uring/io-wq.h | 8 ++++++++
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 5f54af7579a4..55f1063f24c7 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -24,13 +24,6 @@
 
 #define WORKER_IDLE_TIMEOUT	(5 * HZ)
 
-enum {
-	IO_WORKER_F_UP		= 1,	/* up and active */
-	IO_WORKER_F_RUNNING	= 2,	/* account as running */
-	IO_WORKER_F_FREE	= 4,	/* worker on free list */
-	IO_WORKER_F_BOUND	= 8,	/* is doing bounded work */
-};
-
 enum {
 	IO_WQ_BIT_EXIT		= 0,	/* wq exiting */
 };
diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
index 66d2aeb17951..504a8a8e3fd8 100644
--- a/io_uring/io-wq.h
+++ b/io_uring/io-wq.h
@@ -27,6 +27,14 @@ enum io_uringlet_state {
 	IO_URINGLET_SCHEDULED,
 };
 
+enum {
+	IO_WORKER_F_UP		= 1,	/* up and active */
+	IO_WORKER_F_RUNNING	= 2,	/* account as running */
+	IO_WORKER_F_FREE	= 4,	/* worker on free list */
+	IO_WORKER_F_BOUND	= 8,	/* is doing bounded work */
+	IO_WORKER_F_SCHEDULED	= 16,	/* worker had been scheduled out before */
+};
+
 typedef struct io_wq_work *(free_work_fn)(struct io_wq_work *);
 typedef int (io_wq_work_fn)(struct io_wq_work *);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 08/19] io-wq: add IO_WORKER_F_SUBMIT and its friends
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (6 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 07/19] io-wq: move worker state flags to io-wq.h Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 09/19] io-wq: add IO_WORKER_F_SCHEDULED " Hao Xu
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

Add IO_WORKER_F_SUBMIT to indicate that a uringlet worker is submitting
sqes and thus we should do some scheduling when it blocks.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io-wq.c | 20 ++++++++++++++++++++
 io_uring/io-wq.h |  1 +
 2 files changed, 21 insertions(+)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 55f1063f24c7..7e58bb5857ee 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -393,6 +393,21 @@ static inline bool io_wq_is_uringlet(struct io_wq *wq)
 	return wq->private;
 }
 
+static inline void io_worker_set_submit(struct io_worker *worker)
+{
+	worker->flags |= IO_WORKER_F_SUBMIT;
+}
+
+static inline void io_worker_clean_submit(struct io_worker *worker)
+{
+	worker->flags &= ~IO_WORKER_F_SUBMIT;
+}
+
+static inline bool io_worker_test_submit(struct io_worker *worker)
+{
+	return worker->flags & IO_WORKER_F_SUBMIT;
+}
+
 static void io_wqe_dec_running(struct io_worker *worker)
 {
 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
@@ -408,6 +423,9 @@ static void io_wqe_dec_running(struct io_worker *worker)
 	if (io_wq_is_uringlet(wq)) {
 		bool activated;
 
+		if (!io_worker_test_submit(worker))
+			return;
+
 		raw_spin_lock(&wqe->lock);
 		rcu_read_lock();
 		activated = io_wqe_activate_free_worker(wqe, acct);
@@ -688,7 +706,9 @@ static void io_wqe_worker_let(struct io_worker *worker)
 		do {
 			enum io_uringlet_state submit_state;
 
+			io_worker_set_submit(worker);
 			submit_state = wq->do_work(wq->private);
+			io_worker_clean_submit(worker);
 			if (submit_state == IO_URINGLET_SCHEDULED) {
 				empty_count = 0;
 				break;
diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
index 504a8a8e3fd8..1485e9009784 100644
--- a/io_uring/io-wq.h
+++ b/io_uring/io-wq.h
@@ -33,6 +33,7 @@ enum {
 	IO_WORKER_F_FREE	= 4,	/* worker on free list */
 	IO_WORKER_F_BOUND	= 8,	/* is doing bounded work */
 	IO_WORKER_F_SCHEDULED	= 16,	/* worker had been scheduled out before */
+	IO_WORKER_F_SUBMIT	= 32,	/* uringlet worker is submitting sqes */
 };
 
 typedef struct io_wq_work *(free_work_fn)(struct io_wq_work *);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 09/19] io-wq: add IO_WORKER_F_SCHEDULED and its friends
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (7 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 08/19] io-wq: add IO_WORKER_F_SUBMIT and its friends Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 10/19] io_uring: add io_submit_sqes_let() Hao Xu
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io-wq.c | 29 ++---------------------------
 io_uring/io-wq.h | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+), 27 deletions(-)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 7e58bb5857ee..fe4faff79cf8 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -32,33 +32,6 @@ enum {
 	IO_ACCT_STALLED_BIT	= 0,	/* stalled on hash */
 };
 
-/*
- * One for each thread in a wqe pool
- */
-struct io_worker {
-	refcount_t ref;
-	unsigned flags;
-	struct hlist_nulls_node nulls_node;
-	struct list_head all_list;
-	struct task_struct *task;
-	struct io_wqe *wqe;
-
-	struct io_wq_work *cur_work;
-	struct io_wq_work *next_work;
-	raw_spinlock_t lock;
-
-	struct completion ref_done;
-
-	unsigned long create_state;
-	struct callback_head create_work;
-	int create_index;
-
-	union {
-		struct rcu_head rcu;
-		struct work_struct work;
-	};
-};
-
 #if BITS_PER_LONG == 64
 #define IO_WQ_HASH_ORDER	6
 #else
@@ -426,6 +399,7 @@ static void io_wqe_dec_running(struct io_worker *worker)
 		if (!io_worker_test_submit(worker))
 			return;
 
+		io_worker_set_scheduled(worker);
 		raw_spin_lock(&wqe->lock);
 		rcu_read_lock();
 		activated = io_wqe_activate_free_worker(wqe, acct);
@@ -706,6 +680,7 @@ static void io_wqe_worker_let(struct io_worker *worker)
 		do {
 			enum io_uringlet_state submit_state;
 
+			io_worker_clean_scheduled(worker);
 			io_worker_set_submit(worker);
 			submit_state = wq->do_work(wq->private);
 			io_worker_clean_submit(worker);
diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
index 1485e9009784..81146dba2ae6 100644
--- a/io_uring/io-wq.h
+++ b/io_uring/io-wq.h
@@ -3,9 +3,37 @@
 
 #include <linux/refcount.h>
 #include <linux/io_uring_types.h>
+#include <linux/list_nulls.h>
 
 struct io_wq;
 
+/*
+ * One for each thread in a wqe pool
+ */
+struct io_worker {
+	refcount_t ref;
+	unsigned flags;
+	struct hlist_nulls_node nulls_node;
+	struct list_head all_list;
+	struct task_struct *task;
+	struct io_wqe *wqe;
+
+	struct io_wq_work *cur_work;
+	struct io_wq_work *next_work;
+	raw_spinlock_t lock;
+
+	struct completion ref_done;
+
+	unsigned long create_state;
+	struct callback_head create_work;
+	int create_index;
+
+	union {
+		struct rcu_head rcu;
+		struct work_struct work;
+	};
+};
+
 enum {
 	IO_WQ_WORK_CANCEL	= 1,
 	IO_WQ_WORK_HASHED	= 2,
@@ -97,6 +125,21 @@ static inline bool io_wq_current_is_worker(void)
 		current->worker_private;
 }
 
+static inline void io_worker_set_scheduled(struct io_worker *worker)
+{
+	worker->flags |= IO_WORKER_F_SCHEDULED;
+}
+
+static inline void io_worker_clean_scheduled(struct io_worker *worker)
+{
+	worker->flags &= ~IO_WORKER_F_SCHEDULED;
+}
+
+static inline bool io_worker_test_scheduled(struct io_worker *worker)
+{
+	return worker->flags & IO_WORKER_F_SCHEDULED;
+}
+
 extern struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx,
 					struct task_struct *task);
 extern int io_uringlet_offload(struct io_wq *wq);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 10/19] io_uring: add io_submit_sqes_let()
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (8 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 09/19] io-wq: add IO_WORKER_F_SCHEDULED " Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 11/19] io_uring: don't allocate io-wq for a worker in uringlet mode Hao Xu
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

Add io_submit_sqes_let() for submitting sqes in uringlet mode, and
update logic in schedule time and the io-wq init time.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io-wq.c    |  1 +
 io_uring/io_uring.c | 55 +++++++++++++++++++++++++++++++++++++++++++++
 io_uring/io_uring.h |  2 ++
 io_uring/tctx.c     |  8 ++++---
 4 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index fe4faff79cf8..00a1cdefb787 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -399,6 +399,7 @@ static void io_wqe_dec_running(struct io_worker *worker)
 		if (!io_worker_test_submit(worker))
 			return;
 
+		io_uringlet_end(wq->private);
 		io_worker_set_scheduled(worker);
 		raw_spin_lock(&wqe->lock);
 		rcu_read_lock();
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 554041705e96..a5fb6fa02ded 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2054,6 +2054,12 @@ static void io_commit_sqring(struct io_ring_ctx *ctx)
 	smp_store_release(&rings->sq.head, ctx->cached_sq_head);
 }
 
+void io_uringlet_end(struct io_ring_ctx *ctx)
+{
+	io_submit_state_end(ctx);
+	io_commit_sqring(ctx);
+}
+
 /*
  * Fetch an sqe, if one is available. Note this returns a pointer to memory
  * that is mapped by userspace. This means that care needs to be taken to
@@ -2141,6 +2147,55 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 	io_commit_sqring(ctx);
 	return ret;
 }
+int io_submit_sqes_let(struct io_wq_work *work)
+{
+	struct io_ring_ctx *ctx = (struct io_ring_ctx *)work;
+	unsigned int entries;
+	bool scheduled = false;
+	void *worker = current->worker_private;
+
+	entries = io_sqring_entries(ctx);
+	if (!entries)
+		return IO_URINGLET_EMPTY;
+
+	io_get_task_refs(entries);
+	io_submit_state_start(&ctx->submit_state, entries);
+	do {
+		const struct io_uring_sqe *sqe;
+		struct io_kiocb *req;
+
+		if (unlikely(!io_alloc_req_refill(ctx)))
+			break;
+		req = io_alloc_req(ctx);
+		sqe = io_get_sqe(ctx);
+		if (unlikely(!sqe)) {
+			io_req_add_to_cache(req, ctx);
+			break;
+		}
+
+		if (unlikely(io_submit_sqe(ctx, req, sqe)))
+			break;
+		/*  TODO this one breaks encapsulation */
+		scheduled = io_worker_test_scheduled(worker);
+		if (unlikely(scheduled)) {
+			entries--;
+			break;
+		}
+	} while (--entries);
+
+	/* TODO do this at the schedule time too */
+	if (unlikely(entries))
+		current->io_uring->cached_refs += entries;
+
+	 /* Commit SQ ring head once we've consumed and submitted all SQEs */
+
+	if (scheduled)
+		return IO_URINGLET_SCHEDULED;
+
+	io_uringlet_end(ctx);
+	return IO_URINGLET_INLINE;
+}
+
 
 struct io_wait_queue {
 	struct wait_queue_entry wq;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index b20d2506a60f..b95d92619607 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -64,9 +64,11 @@ int io_uring_alloc_task_context(struct task_struct *task,
 
 int io_poll_issue(struct io_kiocb *req, bool *locked);
 int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr);
+int io_submit_sqes_let(struct io_wq_work *work);
 int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin);
 void io_free_batch_list(struct io_ring_ctx *ctx, struct io_wq_work_node *node);
 int io_req_prep_async(struct io_kiocb *req);
+void io_uringlet_end(struct io_ring_ctx *ctx);
 
 struct io_wq_work *io_wq_free_work(struct io_wq_work *work);
 int io_wq_submit_work(struct io_wq_work *work);
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index 09c91cd7b5bf..0c15fb8b9a2e 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -37,12 +37,14 @@ struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx,
 	/* for uringlet, wq->task is the iouring instance creator */
 	data.task = task;
 	data.free_work = io_wq_free_work;
-	data.do_work = io_wq_submit_work;
 	/* distinguish normal iowq and uringlet by wq->private for now */
-	if (ctx->flags & IORING_SETUP_URINGLET)
+	if (ctx->flags & IORING_SETUP_URINGLET) {
 		data.private = ctx;
-	else
+		data.do_work = io_submit_sqes_let;
+	} else {
 		data.private = NULL;
+		data.do_work = io_wq_submit_work;
+	}
 
 	/* Do QD, or 4 * CPUS, whatever is smallest */
 	concurrency = min(ctx->sq_entries, 4 * num_online_cpus());
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 11/19] io_uring: don't allocate io-wq for a worker in uringlet mode
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (9 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 10/19] io_uring: add io_submit_sqes_let() Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 12/19] io_uring: add uringlet worker cancellation function Hao Xu
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

A uringlet worker doesn't need any io-wq pool.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/tctx.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index 0c15fb8b9a2e..b04d361bcf34 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -81,12 +81,17 @@ __cold int io_uring_alloc_task_context(struct task_struct *task,
 		return ret;
 	}
 
-	tctx->io_wq = io_init_wq_offload(ctx, task);
-	if (IS_ERR(tctx->io_wq)) {
-		ret = PTR_ERR(tctx->io_wq);
-		percpu_counter_destroy(&tctx->inflight);
-		kfree(tctx);
-		return ret;
+	/*
+	 * don't allocate io-wq in uringlet mode
+	 */
+	if (!(ctx->flags & IORING_SETUP_URINGLET)) {
+		tctx->io_wq = io_init_wq_offload(ctx, task);
+		if (IS_ERR(tctx->io_wq)) {
+			ret = PTR_ERR(tctx->io_wq);
+			percpu_counter_destroy(&tctx->inflight);
+			kfree(tctx);
+			return ret;
+		}
 	}
 
 	xa_init(&tctx->xa);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 12/19] io_uring: add uringlet worker cancellation function
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (10 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 11/19] io_uring: don't allocate io-wq for a worker in uringlet mode Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 13/19] io-wq: add wq->owner for uringlet mode Hao Xu
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

uringlet worker submits sqes, so we need to do some cancellation work
before it exits.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io_uring.c | 6 ++++++
 io_uring/io_uring.h | 1 +
 io_uring/tctx.c     | 2 ++
 3 files changed, 9 insertions(+)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a5fb6fa02ded..67d02dc16ea5 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2922,6 +2922,12 @@ void __io_uring_cancel(bool cancel_all)
 	io_uring_cancel_generic(cancel_all, NULL);
 }
 
+struct io_wq_work *io_uringlet_cancel(struct io_wq_work *work)
+{
+	__io_uring_cancel(true);
+	return NULL;
+}
+
 static void *io_uring_validate_mmap_request(struct file *file,
 					    loff_t pgoff, size_t sz)
 {
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index b95d92619607..011d0beb33bf 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -71,6 +71,7 @@ int io_req_prep_async(struct io_kiocb *req);
 void io_uringlet_end(struct io_ring_ctx *ctx);
 
 struct io_wq_work *io_wq_free_work(struct io_wq_work *work);
+struct io_wq_work *io_uringlet_cancel(struct io_wq_work *work);
 int io_wq_submit_work(struct io_wq_work *work);
 
 void io_free_req(struct io_kiocb *req);
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index b04d361bcf34..e10b20725066 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -41,9 +41,11 @@ struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx,
 	if (ctx->flags & IORING_SETUP_URINGLET) {
 		data.private = ctx;
 		data.do_work = io_submit_sqes_let;
+		data.free_work = io_uringlet_cancel;
 	} else {
 		data.private = NULL;
 		data.do_work = io_wq_submit_work;
+		data.free_work = io_wq_free_work;
 	}
 
 	/* Do QD, or 4 * CPUS, whatever is smallest */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 13/19] io-wq: add wq->owner for uringlet mode
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (11 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 12/19] io_uring: add uringlet worker cancellation function Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 14/19] io_uring: modify issue_flags " Hao Xu
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

In uringlet mode, we allow exact one worker to submit sqes at the same
time. nr_running is not a good choice to aim that. Add an member
wq->owner and its lock to achieve that, this avoids race condition
between workers.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io-wq.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 00a1cdefb787..9fcaeea7a478 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -96,6 +96,9 @@ struct io_wq {
 
 	void *private;
 
+	raw_spinlock_t lock;
+	struct io_worker *owner;
+
 	struct io_wqe *wqes[];
 };
 
@@ -381,6 +384,8 @@ static inline bool io_worker_test_submit(struct io_worker *worker)
 	return worker->flags & IO_WORKER_F_SUBMIT;
 }
 
+#define IO_WQ_OWNER_TRANSMIT	((struct io_worker *)-1)
+
 static void io_wqe_dec_running(struct io_worker *worker)
 {
 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
@@ -401,6 +406,10 @@ static void io_wqe_dec_running(struct io_worker *worker)
 
 		io_uringlet_end(wq->private);
 		io_worker_set_scheduled(worker);
+		raw_spin_lock(&wq->lock);
+		wq->owner = IO_WQ_OWNER_TRANSMIT;
+		raw_spin_unlock(&wq->lock);
+
 		raw_spin_lock(&wqe->lock);
 		rcu_read_lock();
 		activated = io_wqe_activate_free_worker(wqe, acct);
@@ -674,6 +683,17 @@ static void io_wqe_worker_let(struct io_worker *worker)
 
 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
 		unsigned int empty_count = 0;
+		struct io_worker *owner;
+
+		raw_spin_lock(&wq->lock);
+		owner = wq->owner;
+		if (owner && owner != IO_WQ_OWNER_TRANSMIT && owner != worker) {
+			raw_spin_unlock(&wq->lock);
+			set_current_state(TASK_INTERRUPTIBLE);
+			goto sleep;
+		}
+		wq->owner = worker;
+		raw_spin_unlock(&wq->lock);
 
 		__io_worker_busy(wqe, worker);
 		set_current_state(TASK_INTERRUPTIBLE);
@@ -697,6 +717,7 @@ static void io_wqe_worker_let(struct io_worker *worker)
 			cond_resched();
 		} while (1);
 
+sleep:
 		raw_spin_lock(&wqe->lock);
 		__io_worker_idle(wqe, worker);
 		raw_spin_unlock(&wqe->lock);
@@ -780,6 +801,14 @@ int io_uringlet_offload(struct io_wq *wq)
 	struct io_wqe_acct *acct = io_get_acct(wqe, true);
 	bool waken;
 
+	raw_spin_lock(&wq->lock);
+	if (wq->owner) {
+		raw_spin_unlock(&wq->lock);
+		return 0;
+	}
+	wq->owner = IO_WQ_OWNER_TRANSMIT;
+	raw_spin_unlock(&wq->lock);
+
 	raw_spin_lock(&wqe->lock);
 	rcu_read_lock();
 	waken = io_wqe_activate_free_worker(wqe, acct);
@@ -1248,6 +1277,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 	wq->free_work = data->free_work;
 	wq->do_work = data->do_work;
 	wq->private = data->private;
+	raw_spin_lock_init(&wq->lock);
 
 	ret = -ENOMEM;
 	for_each_node(node) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 14/19] io_uring: modify issue_flags for uringlet mode
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (12 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 13/19] io-wq: add wq->owner for uringlet mode Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 15/19] io_uring: don't use inline completion cache if scheduled Hao Xu
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

We don't need IO_URING_F_NONBLOCK in uringlet mode.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io_uring.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 67d02dc16ea5..0c14b90b8b47 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1751,8 +1751,12 @@ static inline void io_queue_sqe(struct io_kiocb *req)
 	__must_hold(&req->ctx->uring_lock)
 {
 	int ret;
+	unsigned int issue_flags = IO_URING_F_COMPLETE_DEFER;
 
-	ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER);
+	if (!(req->ctx->flags & IORING_SETUP_URINGLET))
+		issue_flags |= IO_URING_F_NONBLOCK;
+
+	ret = io_issue_sqe(req, issue_flags);
 
 	/*
 	 * We async punt it if the file wasn't marked NOWAIT, or if the file
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 15/19] io_uring: don't use inline completion cache if scheduled
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (13 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 14/19] io_uring: modify issue_flags " Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 16/19] io_uring: release ctx->let when a ring exits Hao Xu
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

In uringlet mode, if a worker has been scheduled out during sqe
submission, we cannot use inline completion for that sqe.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io_uring.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0c14b90b8b47..a109dcb48702 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1582,7 +1582,14 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 		revert_creds(creds);
 
 	if (ret == IOU_OK) {
-		if (issue_flags & IO_URING_F_COMPLETE_DEFER)
+		bool uringlet = req->ctx->flags & IORING_SETUP_URINGLET;
+		bool scheduled = false;
+
+		if (uringlet)
+			scheduled =
+				io_worker_test_scheduled(current->worker_private);
+
+		if ((issue_flags & IO_URING_F_COMPLETE_DEFER) && !scheduled)
 			io_req_complete_defer(req);
 		else
 			io_req_complete_post(req);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 16/19] io_uring: release ctx->let when a ring exits
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (14 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 15/19] io_uring: don't use inline completion cache if scheduled Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 17/19] io_uring: disable task plug for now Hao Xu
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

Release the uringlet worker pool when a ring exits, and reclaim the
resource.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io_uring.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a109dcb48702..bbe8948f4771 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2699,6 +2699,11 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 	unsigned long index;
 	struct creds *creds;
 
+	if (ctx->flags & IORING_SETUP_URINGLET) {
+		io_wq_exit_start(ctx->let);
+		io_wq_put_and_exit(ctx->let);
+	}
+
 	mutex_lock(&ctx->uring_lock);
 	percpu_ref_kill(&ctx->refs);
 	if (ctx->rings)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 17/19] io_uring: disable task plug for now
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (15 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 16/19] io_uring: release ctx->let when a ring exits Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 18/19] io-wq: only do io_uringlet_end() at the first schedule time Hao Xu
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

This is a temporary commit, the task plug causes hung and the reason is
unclear for now. So disable it in uringlet mode for now.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io_uring.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index bbe8948f4771..a48e34f63845 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2171,6 +2171,7 @@ int io_submit_sqes_let(struct io_wq_work *work)
 
 	io_get_task_refs(entries);
 	io_submit_state_start(&ctx->submit_state, entries);
+	ctx->submit_state->need_plug = false;
 	do {
 		const struct io_uring_sqe *sqe;
 		struct io_kiocb *req;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 18/19] io-wq: only do io_uringlet_end() at the first schedule time
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (16 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 17/19] io_uring: disable task plug for now Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-19 15:27 ` [PATCH 19/19] io_uring: wire up uringlet Hao Xu
  2022-08-25 13:03 ` [RFC 00/19] uringlet Hao Xu
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

A request may block multiple times during its life cycle. We should
only do io_uringlet_end() at the first time since this function may
modify ctx->submit_state info and for the non-first time, the task
already lost the control of submitting sqes. Allowing it to do so will
damage the submission state.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io-wq.c    | 14 +++++++++++---
 io_uring/io_uring.c |  2 +-
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 9fcaeea7a478..f845b7daced8 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -400,16 +400,24 @@ static void io_wqe_dec_running(struct io_worker *worker)
 
 	if (io_wq_is_uringlet(wq)) {
 		bool activated;
+		bool first_block;
 
 		if (!io_worker_test_submit(worker))
 			return;
 
-		io_uringlet_end(wq->private);
-		io_worker_set_scheduled(worker);
 		raw_spin_lock(&wq->lock);
-		wq->owner = IO_WQ_OWNER_TRANSMIT;
+		first_block = (wq->owner == worker ? true : false);
 		raw_spin_unlock(&wq->lock);
 
+		io_worker_set_scheduled(worker);
+
+		if (first_block) {
+			io_uringlet_end(wq->private);
+			raw_spin_lock(&wq->lock);
+			wq->owner = IO_WQ_OWNER_TRANSMIT;
+			raw_spin_unlock(&wq->lock);
+		}
+
 		raw_spin_lock(&wqe->lock);
 		rcu_read_lock();
 		activated = io_wqe_activate_free_worker(wqe, acct);
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a48e34f63845..7ebc83b3a33f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2171,7 +2171,7 @@ int io_submit_sqes_let(struct io_wq_work *work)
 
 	io_get_task_refs(entries);
 	io_submit_state_start(&ctx->submit_state, entries);
-	ctx->submit_state->need_plug = false;
+	ctx->submit_state.need_plug = false;
 	do {
 		const struct io_uring_sqe *sqe;
 		struct io_kiocb *req;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 19/19] io_uring: wire up uringlet
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (17 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 18/19] io-wq: only do io_uringlet_end() at the first schedule time Hao Xu
@ 2022-08-19 15:27 ` Hao Xu
  2022-08-25 13:03 ` [RFC 00/19] uringlet Hao Xu
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-19 15:27 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

From: Hao Xu <[email protected]>

Enable the uringlet mode.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/io_uring.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 7ebc83b3a33f..72474b512063 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3313,8 +3313,6 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	struct file *file;
 	int ret;
 
-	if (p->flags & IORING_SETUP_URINGLET)
-		return -EINVAL;
 	if (!entries)
 		return -EINVAL;
 	if (entries > IORING_MAX_ENTRIES) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC 00/19] uringlet
  2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
                   ` (18 preceding siblings ...)
  2022-08-19 15:27 ` [PATCH 19/19] io_uring: wire up uringlet Hao Xu
@ 2022-08-25 13:03 ` Hao Xu
  19 siblings, 0 replies; 21+ messages in thread
From: Hao Xu @ 2022-08-25 13:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Ingo Molnar, Wanpeng Li

On 8/19/22 23:27, Hao Xu wrote:
> From: Hao Xu <[email protected]>
> 
> Hi Jens and all,
> 
> This is an early RFC for a new way to do async IO. Currently io_uring
> works in a way like:
>   - issue an IO request in nowait way
>     here nowait means return error(EAGAIN) to io_uring layer when it would
>     block in deeper kernel stack.
> 
>   - issue an IO request in a normal(block) way
>     io_uring catches the EAGAIN error and create/wakeup a io-worker to
>     redo the IO request in a block way. The original context turns to
>     issue other requests. (some type of requests like buffered reads,
>     leverage task work to wipe out io-workers)
> 
> This has two main disadvantages:
>   - we have to find every block point along the kernel code path and
>     modify it to support nowait.
>     e.g.  alloc_memory() ----> if (alloc_memory() fails) return -EAGAIN
>     This hugely adds programming complexisity, especially when the code
>     path is long and complicated. For example, buffered write, we have
>     to handle locks, possibly journal part, meta data like extent node
>     misses.
> 
>   - By create/wakeup a new worker, we redo a IO request from the very
>     beginning, which means we re-walk the path from beginning to the
>     previous block point.
>     The original context backtracks to the io_uring layer from the block
>     point to submit other requests. While it's better to directly start
>     the new submission.
> 
> This RFC provides a new way to do it.
>   - We maintain a worker pool for each io_uring instance and each worker
>     in it can submit requests. The original task only needs to create the
>     first worker and return to userspace. Later it doesn't need to call
>     io_uring_enter.[1]
> 
>   - the created worker begins to submit requests. When it blocks, just
>     let it be blocked. Create/wakeup another worker to do the submission
> 
> [1] I currently keep these workers until the io_uring context exits. In
>      other words, a worker does submission, sleep, wake up, but won't
>      exit. Thus the original task don't need to create/wakeup workers.
> 
> I've done some testing:
> name: buffered write
> fs: xfs
> env: qemu box, 4 cpu, 8G mem.
> tool: fio
> 
>   - single file test:
> 
>     fio ioengine=io_uring, size=10M, bs=1024, direct=0,
>         thread=1, rw=randwrite, time_based=1, runtime=180
> 
>     async buffered writes:
>     iodepth
>        1      write: IOPS=428k, BW=418MiB/s (438MB/s)(73.5GiB/180000msec);
>        2      write: IOPS=406k, BW=396MiB/s (416MB/s)(69.7GiB/180002msec);
>        4      write: IOPS=382k, BW=373MiB/s (391MB/s)(65.6GiB/180000msec);
>        8      write: IOPS=255k, BW=249MiB/s (261MB/s)(43.7GiB/180001msec);
>        16     write: IOPS=399k, BW=390MiB/s (409MB/s)(68.5GiB/180000msec);
>        32     write: IOPS=433k, BW=423MiB/s (443MB/s)(74.3GiB/180000msec);
> 
>        1      lat (nsec): min=547, max=2929.3k, avg=1074.98, stdev=6498.72
>        2      lat (nsec): min=607, max=84320k, avg=3619.15, stdev=109104.36
>        4      lat (nsec): min=891, max=195941k, avg=9062.16, stdev=213600.71
>        8      lat (nsec): min=684, max=204164k, avg=29308.56, stdev=542490.72
>        16     lat (nsec): min=1002, max=77279k, avg=38716.65, stdev=461785.55
>        32     lat (nsec): min=674, max=75279k, avg=72673.91, stdev=588002.49
> 
> 
>     uringlet:
>     iodepth
>       1       write: IOPS=120k, BW=117MiB/s (123MB/s)(20.6GiB/180006msec);
>       2       write: IOPS=273k, BW=266MiB/s (279MB/s)(46.8GiB/180010msec);
>       4       write: IOPS=336k, BW=328MiB/s (344MB/s)(57.7GiB/180002msec);
>       8       write: IOPS=373k, BW=365MiB/s (382MB/s)(64.1GiB/180000msec);
>       16      write: IOPS=442k, BW=432MiB/s (453MB/s)(75.9GiB/180001msec);
>       32      write: IOPS=444k, BW=434MiB/s (455MB/s)(76.2GiB/180010msec);
> 
>       1       lat (nsec): min=684, max=10790k, avg=6781.23, stdev=10000.69
>       2       lat (nsec): min=650, max=91712k, avg=5690.52, stdev=136818.11
>       4       lat (nsec): min=785, max=79038k, avg=10297.04, stdev=227375.52
>       8       lat (nsec): min=862, max=97493k, avg=19804.67, stdev=350809.60
>       16      lat (nsec): min=823, max=81279k, avg=34681.33, stdev=478427.17
>       32      lat (usec): min=6, max=105935, avg=70.55, stdev=696.08
> 
> uringlet behaves worse on IOPS and lantency in small iodepth. I think
> the reason is there are more sleep and wakeup.(not sure about it, I'll
> look into it later)
> 
> The downside of uringlet:
>   - it costs more cpu resource, the reason is similar with the sqpoll case: a
>     uringlet worker keeps checking sqring to reduce latency.[2]
>   - task->plug is disabled for now since uringlet is buggy with it.
> 
> [2] For now, I allow a uringlet worker spin on the empty sqring for some
> times.
> 
> Any comments are welcome, This early RFC only supports buffered write for
> now and if the idea under it is proved to be the right way, I'll change
> it to a formal patchset and resolve the detail technical issues and try
> to support more io_uring features.
> 
> Regards,
> Hao
> 

Friendly ping...
Jens, any thoughts on this one?

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-08-25 13:05 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-19 15:27 [RFC 00/19] uringlet Hao Xu
2022-08-19 15:27 ` [PATCH 01/19] io_uring: change return value of create_io_worker() and io_wqe_create_worker() Hao Xu
2022-08-19 15:27 ` [PATCH 02/19] io_uring: add IORING_SETUP_URINGLET Hao Xu
2022-08-19 15:27 ` [PATCH 03/19] io_uring: make worker pool per ctx for uringlet mode Hao Xu
2022-08-19 15:27 ` [PATCH 04/19] io-wq: split io_wqe_worker() to io_wqe_worker_normal() and io_wqe_worker_let() Hao Xu
2022-08-19 15:27 ` [PATCH 05/19] io_uring: add io_uringler_offload() for uringlet mode Hao Xu
2022-08-19 15:27 ` [PATCH 06/19] io-wq: change the io-worker scheduling logic Hao Xu
2022-08-19 15:27 ` [PATCH 07/19] io-wq: move worker state flags to io-wq.h Hao Xu
2022-08-19 15:27 ` [PATCH 08/19] io-wq: add IO_WORKER_F_SUBMIT and its friends Hao Xu
2022-08-19 15:27 ` [PATCH 09/19] io-wq: add IO_WORKER_F_SCHEDULED " Hao Xu
2022-08-19 15:27 ` [PATCH 10/19] io_uring: add io_submit_sqes_let() Hao Xu
2022-08-19 15:27 ` [PATCH 11/19] io_uring: don't allocate io-wq for a worker in uringlet mode Hao Xu
2022-08-19 15:27 ` [PATCH 12/19] io_uring: add uringlet worker cancellation function Hao Xu
2022-08-19 15:27 ` [PATCH 13/19] io-wq: add wq->owner for uringlet mode Hao Xu
2022-08-19 15:27 ` [PATCH 14/19] io_uring: modify issue_flags " Hao Xu
2022-08-19 15:27 ` [PATCH 15/19] io_uring: don't use inline completion cache if scheduled Hao Xu
2022-08-19 15:27 ` [PATCH 16/19] io_uring: release ctx->let when a ring exits Hao Xu
2022-08-19 15:27 ` [PATCH 17/19] io_uring: disable task plug for now Hao Xu
2022-08-19 15:27 ` [PATCH 18/19] io-wq: only do io_uringlet_end() at the first schedule time Hao Xu
2022-08-19 15:27 ` [PATCH 19/19] io_uring: wire up uringlet Hao Xu
2022-08-25 13:03 ` [RFC 00/19] uringlet Hao Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox