public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCHSET v4 0/9] Improve MSG_RING DEFER_TASKRUN performance
@ 2024-06-18 18:48 Jens Axboe
  2024-06-18 18:48 ` [PATCH 1/5] io_uring/msg_ring: tighten requirement for remote posting Jens Axboe
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: Jens Axboe @ 2024-06-18 18:48 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

Hi,

Hi,

For v1 and replies to that and tons of perf measurements, go here:

https://lore.kernel.org/io-uring/[email protected]/

and find v2 here:

https://lore.kernel.org/io-uring/[email protected]/

and v3 here:

https://lore.kernel.org/io-uring/[email protected]/

and you can find the git tree here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-msg-ring.1

and the silly test app being used here:

https://kernel.dk/msg-lat.c

Patches are based on top of the pending 6.11 io_uring changes.

tldr is that this series greatly improves both latency, overhead, and
throughput of sending messages to other rings. It's done by using the
existing io_uring task_work for passing messages, rather than utilize
the rather big hammer of TWA_SIGNAL based generic kernel task_work.
Note that this differs from v3 of this posting, as that used the
CQE overflow approach. While the CQE overflow approach still performs
a bit better than this approach, this one is a bit cleaner.

Performance for local (same node CPUs) message passing before this
change:

init_flags=3000, delay=10 usec
latencies for: receiver (msg=82631)
    percentiles (nsec):
     |  1.0000th=[ 3088],  5.0000th=[ 3088], 10.0000th=[ 3120],
     | 20.0000th=[ 3248], 30.0000th=[ 3280], 40.0000th=[ 3312],
     | 50.0000th=[ 3408], 60.0000th=[ 3440], 70.0000th=[ 3472],
     | 80.0000th=[ 3504], 90.0000th=[ 3600], 95.0000th=[ 3696],
     | 99.0000th=[ 6368], 99.5000th=[ 6496], 99.9000th=[ 6880],
     | 99.9500th=[ 7008], 99.9900th=[12352]
latencies for: sender (msg=82631)
    percentiles (nsec):
     |  1.0000th=[ 5280],  5.0000th=[ 5280], 10.0000th=[ 5344],
     | 20.0000th=[ 5408], 30.0000th=[ 5472], 40.0000th=[ 5472],
     | 50.0000th=[ 5600], 60.0000th=[ 5600], 70.0000th=[ 5664],
     | 80.0000th=[ 5664], 90.0000th=[ 5792], 95.0000th=[ 5920],
     | 99.0000th=[ 8512], 99.5000th=[ 8640], 99.9000th=[ 8896],
     | 99.9500th=[ 9280], 99.9900th=[19840]

and after:

init_flags=3000, delay=10 usec
Latencies for: Sender (msg=236763)
    percentiles (nsec):
     |  1.0000th=[  225],  5.0000th=[  245], 10.0000th=[  278],
     | 20.0000th=[  294], 30.0000th=[  330], 40.0000th=[  378],
     | 50.0000th=[  418], 60.0000th=[  466], 70.0000th=[  524],
     | 80.0000th=[  604], 90.0000th=[  708], 95.0000th=[  804],
     | 99.0000th=[ 1864], 99.5000th=[ 2480], 99.9000th=[ 2768],
     | 99.9500th=[ 2864], 99.9900th=[ 3056]
Latencies for: Receiver (msg=236763)
    percentiles (nsec):
     |  1.0000th=[  764],  5.0000th=[  940], 10.0000th=[ 1096],
     | 20.0000th=[ 1416], 30.0000th=[ 1736], 40.0000th=[ 2040],
     | 50.0000th=[ 2352], 60.0000th=[ 2704], 70.0000th=[ 3152],
     | 80.0000th=[ 3856], 90.0000th=[ 4960], 95.0000th=[ 6176],
     | 99.0000th=[ 8032], 99.5000th=[ 8256], 99.9000th=[ 8768],
     | 99.9500th=[10304], 99.9900th=[91648]

and for remote (different nodes) CPUs, before:

init_flags=3000, delay=10 usec
Latencies for: Receiver (msg=44002)
    percentiles (nsec):
     |  1.0000th=[ 7264],  5.0000th=[ 8384], 10.0000th=[ 8512],
     | 20.0000th=[ 8640], 30.0000th=[ 8896], 40.0000th=[ 9024],
     | 50.0000th=[ 9152], 60.0000th=[ 9280], 70.0000th=[ 9408],
     | 80.0000th=[ 9536], 90.0000th=[ 9792], 95.0000th=[ 9920],
     | 99.0000th=[10304], 99.5000th=[13376], 99.9000th=[19840],
     | 99.9500th=[20608], 99.9900th=[25728]
Latencies for: Sender (msg=44002)
    percentiles (nsec):
     |  1.0000th=[11712],  5.0000th=[12864], 10.0000th=[12864],
     | 20.0000th=[13120], 30.0000th=[13248], 40.0000th=[13376],
     | 50.0000th=[13504], 60.0000th=[13760], 70.0000th=[13888],
     | 80.0000th=[14144], 90.0000th=[14272], 95.0000th=[14400],
     | 99.0000th=[15936], 99.5000th=[21632], 99.9000th=[24704],
     | 99.9500th=[25984], 99.9900th=[37632]

and after the changes:

init_flags=3000, delay=10 usec
Latencies for: Sender (msg=192598)
    percentiles (nsec):
     |  1.0000th=[  402],  5.0000th=[  430], 10.0000th=[  446],
     | 20.0000th=[  482], 30.0000th=[  700], 40.0000th=[  804],
     | 50.0000th=[  932], 60.0000th=[ 1176], 70.0000th=[ 1304],
     | 80.0000th=[ 1480], 90.0000th=[ 1752], 95.0000th=[ 2128],
     | 99.0000th=[ 2736], 99.5000th=[ 2928], 99.9000th=[ 4256],
     | 99.9500th=[ 8768], 99.9900th=[12864]
Latencies for: Receiver (msg=192598)
    percentiles (nsec):
     |  1.0000th=[ 2024],  5.0000th=[ 2544], 10.0000th=[ 2928],
     | 20.0000th=[ 3600], 30.0000th=[ 4048], 40.0000th=[ 4448],
     | 50.0000th=[ 4896], 60.0000th=[ 5408], 70.0000th=[ 5920],
     | 80.0000th=[ 6752], 90.0000th=[ 7904], 95.0000th=[ 9408],
     | 99.0000th=[10816], 99.5000th=[11712], 99.9000th=[16320],
     | 99.9500th=[18304], 99.9900th=[22656]

 include/linux/io_uring_types.h |   3 +
 io_uring/io_uring.c            |  53 ++++++++++++---
 io_uring/io_uring.h            |   3 +
 io_uring/msg_ring.c            | 119 ++++++++++++++++++++-------------
 io_uring/msg_ring.h            |   1 +
 5 files changed, 124 insertions(+), 55 deletions(-)

Since v3:
- Switch back to task_work approach, rather than utilize overflows
  for this
- Retain old task_work approach for fd passing
- Various tweaks and cleanups

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/5] io_uring/msg_ring: tighten requirement for remote posting
  2024-06-18 18:48 [PATCHSET v4 0/9] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
@ 2024-06-18 18:48 ` Jens Axboe
  2024-06-18 18:48 ` [PATCH 2/5] io_uring: add remote task_work execution helper Jens Axboe
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2024-06-18 18:48 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Jens Axboe

Currently this is gated on whether or not the target ring needs a local
completion - and if so, whether or not we're running on the right task.
The use case for same thread cross posting is probably a lot less
relevant than remote posting. And since we're going to improve this
situation anyway, just gate it on local posting and ignore what task
we're currently running on.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/msg_ring.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index 81c4a9d43729..9fdb0cc19bfd 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -68,9 +68,7 @@ void io_msg_ring_cleanup(struct io_kiocb *req)
 
 static inline bool io_msg_need_remote(struct io_ring_ctx *target_ctx)
 {
-	if (!target_ctx->task_complete)
-		return false;
-	return current != target_ctx->submitter_task;
+	return target_ctx->task_complete;
 }
 
 static int io_msg_exec_remote(struct io_kiocb *req, task_work_func_t func)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/5] io_uring: add remote task_work execution helper
  2024-06-18 18:48 [PATCHSET v4 0/9] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
  2024-06-18 18:48 ` [PATCH 1/5] io_uring/msg_ring: tighten requirement for remote posting Jens Axboe
@ 2024-06-18 18:48 ` Jens Axboe
  2024-06-18 18:48 ` [PATCH 3/5] io_uring: add io_add_aux_cqe() helper Jens Axboe
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2024-06-18 18:48 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Jens Axboe

All our task_work handling is targeted at the state in the io_kiocb
itself, which is what it is being used for. However, MSG_RING rolls its
own task_work handling, ignoring how that is usually done.

In preparation for switching MSG_RING to be able to use the normal
task_work handling, add io_req_task_work_add_remote() which allows the
caller to pass in the target io_ring_ctx.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/io_uring.c | 24 ++++++++++++++++--------
 io_uring/io_uring.h |  2 ++
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 438c44ca3abd..85b2ce54328c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1098,9 +1098,10 @@ void tctx_task_work(struct callback_head *cb)
 	WARN_ON_ONCE(ret);
 }
 
-static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
+static inline void io_req_local_work_add(struct io_kiocb *req,
+					 struct io_ring_ctx *ctx,
+					 unsigned flags)
 {
-	struct io_ring_ctx *ctx = req->ctx;
 	unsigned nr_wait, nr_tw, nr_tw_prev;
 	struct llist_node *head;
 
@@ -1114,6 +1115,8 @@ static inline void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 	if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK))
 		flags &= ~IOU_F_TWQ_LAZY_WAKE;
 
+	guard(rcu)();
+
 	head = READ_ONCE(ctx->work_llist.first);
 	do {
 		nr_tw_prev = 0;
@@ -1195,13 +1198,18 @@ static void io_req_normal_work_add(struct io_kiocb *req)
 
 void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
 {
-	if (req->ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
-		rcu_read_lock();
-		io_req_local_work_add(req, flags);
-		rcu_read_unlock();
-	} else {
+	if (req->ctx->flags & IORING_SETUP_DEFER_TASKRUN)
+		io_req_local_work_add(req, req->ctx, flags);
+	else
 		io_req_normal_work_add(req);
-	}
+}
+
+void io_req_task_work_add_remote(struct io_kiocb *req, struct io_ring_ctx *ctx,
+				 unsigned flags)
+{
+	if (WARN_ON_ONCE(!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)))
+		return;
+	io_req_local_work_add(req, ctx, flags);
 }
 
 static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index cd43924eed04..7a8641214509 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -73,6 +73,8 @@ struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
 			       unsigned issue_flags);
 
 void __io_req_task_work_add(struct io_kiocb *req, unsigned flags);
+void io_req_task_work_add_remote(struct io_kiocb *req, struct io_ring_ctx *ctx,
+				 unsigned flags);
 bool io_alloc_async_data(struct io_kiocb *req);
 void io_req_task_queue(struct io_kiocb *req);
 void io_req_task_complete(struct io_kiocb *req, struct io_tw_state *ts);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/5] io_uring: add io_add_aux_cqe() helper
  2024-06-18 18:48 [PATCHSET v4 0/9] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
  2024-06-18 18:48 ` [PATCH 1/5] io_uring/msg_ring: tighten requirement for remote posting Jens Axboe
  2024-06-18 18:48 ` [PATCH 2/5] io_uring: add remote task_work execution helper Jens Axboe
@ 2024-06-18 18:48 ` Jens Axboe
  2024-06-18 18:48 ` [PATCH 4/5] io_uring/msg_ring: improve handling of target CQE posting Jens Axboe
  2024-06-18 18:48 ` [PATCH 5/5] io_uring/msg_ring: add an alloc cache for io_kiocb entries Jens Axboe
  4 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2024-06-18 18:48 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Jens Axboe

This helper will post a CQE, and can be called from task_work where we
now that the ctx is already properly locked and that deferred
completions will get flushed later on.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/io_uring.c | 23 +++++++++++++++++++++--
 io_uring/io_uring.h |  1 +
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 85b2ce54328c..cdeb94d2a26b 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -801,19 +801,38 @@ static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res,
 	return false;
 }
 
-bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags)
+static bool __io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res,
+			      u32 cflags)
 {
 	bool filled;
 
-	io_cq_lock(ctx);
 	filled = io_fill_cqe_aux(ctx, user_data, res, cflags);
 	if (!filled)
 		filled = io_cqring_event_overflow(ctx, user_data, res, cflags, 0, 0);
 
+	return filled;
+}
+
+bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags)
+{
+	bool filled;
+
+	io_cq_lock(ctx);
+	filled = __io_post_aux_cqe(ctx, user_data, res, cflags);
 	io_cq_unlock_post(ctx);
 	return filled;
 }
 
+/*
+ * Must be called from inline task_work so we now a flush will happen later,
+ * and obviously with ctx->uring_lock held (tw always has that).
+ */
+void io_add_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags)
+{
+	__io_post_aux_cqe(ctx, user_data, res, cflags);
+	ctx->submit_state.cq_flush = true;
+}
+
 /*
  * A helper for multishot requests posting additional CQEs.
  * Should only be used from a task_work including IO_URING_F_MULTISHOT.
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 7a8641214509..e1ce908f0679 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -65,6 +65,7 @@ bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow);
 int io_run_task_work_sig(struct io_ring_ctx *ctx);
 void io_req_defer_failed(struct io_kiocb *req, s32 res);
 bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
+void io_add_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
 bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags);
 void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 4/5] io_uring/msg_ring: improve handling of target CQE posting
  2024-06-18 18:48 [PATCHSET v4 0/9] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
                   ` (2 preceding siblings ...)
  2024-06-18 18:48 ` [PATCH 3/5] io_uring: add io_add_aux_cqe() helper Jens Axboe
@ 2024-06-18 18:48 ` Jens Axboe
  2024-07-01 13:06   ` Pavel Begunkov
  2024-06-18 18:48 ` [PATCH 5/5] io_uring/msg_ring: add an alloc cache for io_kiocb entries Jens Axboe
  4 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2024-06-18 18:48 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Jens Axboe

Use the exported helper for queueing task_work for message passing,
rather than rolling our own.

Note that this is only done for strict data messages for now, file
descriptor passing messages still rely on the kernel task_work. It could
get converted at some point if it's performance critical.

This improves peak performance of message passing by about 5x in some
basic testing, with 2 threads just sending messages to each other.
Before this change, it was capped at around 700K/sec, with the change
it's at over 4M/sec.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/msg_ring.c | 90 +++++++++++++++++++++++----------------------
 1 file changed, 47 insertions(+), 43 deletions(-)

diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index 9fdb0cc19bfd..ad7d67d44461 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -13,7 +13,6 @@
 #include "filetable.h"
 #include "msg_ring.h"
 
-
 /* All valid masks for MSG_RING */
 #define IORING_MSG_RING_MASK		(IORING_MSG_RING_CQE_SKIP | \
 					IORING_MSG_RING_FLAGS_PASS)
@@ -71,54 +70,43 @@ static inline bool io_msg_need_remote(struct io_ring_ctx *target_ctx)
 	return target_ctx->task_complete;
 }
 
-static int io_msg_exec_remote(struct io_kiocb *req, task_work_func_t func)
+static void io_msg_tw_complete(struct io_kiocb *req, struct io_tw_state *ts)
 {
-	struct io_ring_ctx *ctx = req->file->private_data;
-	struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
-	struct task_struct *task = READ_ONCE(ctx->submitter_task);
-
-	if (unlikely(!task))
-		return -EOWNERDEAD;
+	struct io_ring_ctx *ctx = req->ctx;
 
-	init_task_work(&msg->tw, func);
-	if (task_work_add(task, &msg->tw, TWA_SIGNAL))
-		return -EOWNERDEAD;
+	io_add_aux_cqe(ctx, req->cqe.user_data, req->cqe.res, req->cqe.flags);
+	kmem_cache_free(req_cachep, req);
+	percpu_ref_put(&ctx->refs);
+}
 
-	return IOU_ISSUE_SKIP_COMPLETE;
+static void io_msg_remote_post(struct io_ring_ctx *ctx, struct io_kiocb *req,
+			       int res, u32 cflags, u64 user_data)
+{
+	req->cqe.user_data = user_data;
+	io_req_set_res(req, res, cflags);
+	percpu_ref_get(&ctx->refs);
+	req->ctx = ctx;
+	req->task = READ_ONCE(ctx->submitter_task);
+	req->io_task_work.func = io_msg_tw_complete;
+	io_req_task_work_add_remote(req, ctx, IOU_F_TWQ_LAZY_WAKE);
 }
 
-static void io_msg_tw_complete(struct callback_head *head)
+static int io_msg_data_remote(struct io_kiocb *req)
 {
-	struct io_msg *msg = container_of(head, struct io_msg, tw);
-	struct io_kiocb *req = cmd_to_io_kiocb(msg);
 	struct io_ring_ctx *target_ctx = req->file->private_data;
-	int ret = 0;
-
-	if (current->flags & PF_EXITING) {
-		ret = -EOWNERDEAD;
-	} else {
-		u32 flags = 0;
-
-		if (msg->flags & IORING_MSG_RING_FLAGS_PASS)
-			flags = msg->cqe_flags;
-
-		/*
-		 * If the target ring is using IOPOLL mode, then we need to be
-		 * holding the uring_lock for posting completions. Other ring
-		 * types rely on the regular completion locking, which is
-		 * handled while posting.
-		 */
-		if (target_ctx->flags & IORING_SETUP_IOPOLL)
-			mutex_lock(&target_ctx->uring_lock);
-		if (!io_post_aux_cqe(target_ctx, msg->user_data, msg->len, flags))
-			ret = -EOVERFLOW;
-		if (target_ctx->flags & IORING_SETUP_IOPOLL)
-			mutex_unlock(&target_ctx->uring_lock);
-	}
+	struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
+	struct io_kiocb *target;
+	u32 flags = 0;
 
-	if (ret < 0)
-		req_set_fail(req);
-	io_req_queue_tw_complete(req, ret);
+	target = kmem_cache_alloc(req_cachep, GFP_KERNEL);
+	if (unlikely(!target))
+		return -ENOMEM;
+
+	if (msg->flags & IORING_MSG_RING_FLAGS_PASS)
+		flags = msg->cqe_flags;
+
+	io_msg_remote_post(target_ctx, target, msg->len, flags, msg->user_data);
+	return 0;
 }
 
 static int io_msg_ring_data(struct io_kiocb *req, unsigned int issue_flags)
@@ -136,7 +124,7 @@ static int io_msg_ring_data(struct io_kiocb *req, unsigned int issue_flags)
 		return -EBADFD;
 
 	if (io_msg_need_remote(target_ctx))
-		return io_msg_exec_remote(req, io_msg_tw_complete);
+		return io_msg_data_remote(req);
 
 	if (msg->flags & IORING_MSG_RING_FLAGS_PASS)
 		flags = msg->cqe_flags;
@@ -216,6 +204,22 @@ static void io_msg_tw_fd_complete(struct callback_head *head)
 	io_req_queue_tw_complete(req, ret);
 }
 
+static int io_msg_fd_remote(struct io_kiocb *req)
+{
+	struct io_ring_ctx *ctx = req->file->private_data;
+	struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
+	struct task_struct *task = READ_ONCE(ctx->submitter_task);
+
+	if (unlikely(!task))
+		return -EOWNERDEAD;
+
+	init_task_work(&msg->tw, io_msg_tw_fd_complete);
+	if (task_work_add(task, &msg->tw, TWA_SIGNAL))
+		return -EOWNERDEAD;
+
+	return IOU_ISSUE_SKIP_COMPLETE;
+}
+
 static int io_msg_send_fd(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_ring_ctx *target_ctx = req->file->private_data;
@@ -238,7 +242,7 @@ static int io_msg_send_fd(struct io_kiocb *req, unsigned int issue_flags)
 	}
 
 	if (io_msg_need_remote(target_ctx))
-		return io_msg_exec_remote(req, io_msg_tw_fd_complete);
+		return io_msg_fd_remote(req);
 	return io_msg_install_complete(req, issue_flags);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 5/5] io_uring/msg_ring: add an alloc cache for io_kiocb entries
  2024-06-18 18:48 [PATCHSET v4 0/9] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
                   ` (3 preceding siblings ...)
  2024-06-18 18:48 ` [PATCH 4/5] io_uring/msg_ring: improve handling of target CQE posting Jens Axboe
@ 2024-06-18 18:48 ` Jens Axboe
  4 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2024-06-18 18:48 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Jens Axboe

With slab accounting, allocating and freeing memory has considerable
overhead. Add a basic alloc cache for the io_kiocb allocations that
msg_ring needs to do. Unlike other caches, this one is used by the
sender, grabbing it from the remote ring. When the remote ring gets
the posted completion, it'll free it locally. Hence it is separately
locked, using ctx->msg_lock.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/io_uring_types.h |  3 +++
 io_uring/io_uring.c            |  6 ++++++
 io_uring/msg_ring.c            | 31 +++++++++++++++++++++++++++++--
 io_uring/msg_ring.h            |  1 +
 4 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 1052a68fd68d..ede42dce1506 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -397,6 +397,9 @@ struct io_ring_ctx {
 	struct callback_head		poll_wq_task_work;
 	struct list_head		defer_list;
 
+	struct io_alloc_cache		msg_cache;
+	spinlock_t			msg_lock;
+
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	struct list_head	napi_list;	/* track busy poll napi_id */
 	spinlock_t		napi_lock;	/* napi_list lock */
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index cdeb94d2a26b..7ed1e009aaec 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -95,6 +95,7 @@
 #include "futex.h"
 #include "napi.h"
 #include "uring_cmd.h"
+#include "msg_ring.h"
 #include "memmap.h"
 
 #include "timeout.h"
@@ -315,6 +316,9 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 			    sizeof(struct io_async_rw));
 	ret |= io_alloc_cache_init(&ctx->uring_cache, IO_ALLOC_CACHE_MAX,
 			    sizeof(struct uring_cache));
+	spin_lock_init(&ctx->msg_lock);
+	ret |= io_alloc_cache_init(&ctx->msg_cache, IO_ALLOC_CACHE_MAX,
+			    sizeof(struct io_kiocb));
 	ret |= io_futex_cache_init(ctx);
 	if (ret)
 		goto err;
@@ -351,6 +355,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	io_alloc_cache_free(&ctx->netmsg_cache, io_netmsg_cache_free);
 	io_alloc_cache_free(&ctx->rw_cache, io_rw_cache_free);
 	io_alloc_cache_free(&ctx->uring_cache, kfree);
+	io_alloc_cache_free(&ctx->msg_cache, io_msg_cache_free);
 	io_futex_cache_free(ctx);
 	kfree(ctx->cancel_table.hbs);
 	kfree(ctx->cancel_table_locked.hbs);
@@ -2599,6 +2604,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	io_alloc_cache_free(&ctx->netmsg_cache, io_netmsg_cache_free);
 	io_alloc_cache_free(&ctx->rw_cache, io_rw_cache_free);
 	io_alloc_cache_free(&ctx->uring_cache, kfree);
+	io_alloc_cache_free(&ctx->msg_cache, io_msg_cache_free);
 	io_futex_cache_free(ctx);
 	io_destroy_buffers(ctx);
 	mutex_unlock(&ctx->uring_lock);
diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index ad7d67d44461..47a754e83b49 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -11,6 +11,7 @@
 #include "io_uring.h"
 #include "rsrc.h"
 #include "filetable.h"
+#include "alloc_cache.h"
 #include "msg_ring.h"
 
 /* All valid masks for MSG_RING */
@@ -75,7 +76,13 @@ static void io_msg_tw_complete(struct io_kiocb *req, struct io_tw_state *ts)
 	struct io_ring_ctx *ctx = req->ctx;
 
 	io_add_aux_cqe(ctx, req->cqe.user_data, req->cqe.res, req->cqe.flags);
-	kmem_cache_free(req_cachep, req);
+	if (spin_trylock(&ctx->msg_lock)) {
+		if (io_alloc_cache_put(&ctx->msg_cache, req))
+			req = NULL;
+		spin_unlock(&ctx->msg_lock);
+	}
+	if (req)
+		kfree(req);
 	percpu_ref_put(&ctx->refs);
 }
 
@@ -91,6 +98,19 @@ static void io_msg_remote_post(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	io_req_task_work_add_remote(req, ctx, IOU_F_TWQ_LAZY_WAKE);
 }
 
+static struct io_kiocb *io_msg_get_kiocb(struct io_ring_ctx *ctx)
+{
+	struct io_kiocb *req = NULL;
+
+	if (spin_trylock(&ctx->msg_lock)) {
+		req = io_alloc_cache_get(&ctx->msg_cache);
+		spin_unlock(&ctx->msg_lock);
+	}
+	if (req)
+		return req;
+	return kmem_cache_alloc(req_cachep, GFP_KERNEL | __GFP_NOWARN);
+}
+
 static int io_msg_data_remote(struct io_kiocb *req)
 {
 	struct io_ring_ctx *target_ctx = req->file->private_data;
@@ -98,7 +118,7 @@ static int io_msg_data_remote(struct io_kiocb *req)
 	struct io_kiocb *target;
 	u32 flags = 0;
 
-	target = kmem_cache_alloc(req_cachep, GFP_KERNEL);
+	target = io_msg_get_kiocb(req->ctx);
 	if (unlikely(!target))
 		return -ENOMEM;
 
@@ -296,3 +316,10 @@ int io_msg_ring(struct io_kiocb *req, unsigned int issue_flags)
 	io_req_set_res(req, ret, 0);
 	return IOU_OK;
 }
+
+void io_msg_cache_free(const void *entry)
+{
+	struct io_kiocb *req = (struct io_kiocb *) entry;
+
+	kmem_cache_free(req_cachep, req);
+}
diff --git a/io_uring/msg_ring.h b/io_uring/msg_ring.h
index 3987ee6c0e5f..3030f3942f0f 100644
--- a/io_uring/msg_ring.h
+++ b/io_uring/msg_ring.h
@@ -3,3 +3,4 @@
 int io_msg_ring_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_msg_ring(struct io_kiocb *req, unsigned int issue_flags);
 void io_msg_ring_cleanup(struct io_kiocb *req);
+void io_msg_cache_free(const void *entry);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 4/5] io_uring/msg_ring: improve handling of target CQE posting
  2024-06-18 18:48 ` [PATCH 4/5] io_uring/msg_ring: improve handling of target CQE posting Jens Axboe
@ 2024-07-01 13:06   ` Pavel Begunkov
  0 siblings, 0 replies; 7+ messages in thread
From: Pavel Begunkov @ 2024-07-01 13:06 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 6/18/24 19:48, Jens Axboe wrote:
> Use the exported helper for queueing task_work for message passing,
> rather than rolling our own.
> 
> Note that this is only done for strict data messages for now, file
> descriptor passing messages still rely on the kernel task_work. It could
> get converted at some point if it's performance critical.
> 
> This improves peak performance of message passing by about 5x in some
> basic testing, with 2 threads just sending messages to each other.
> Before this change, it was capped at around 700K/sec, with the change
> it's at over 4M/sec.
> 
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>   io_uring/msg_ring.c | 90 +++++++++++++++++++++++----------------------
>   1 file changed, 47 insertions(+), 43 deletions(-)
> 
> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
> index 9fdb0cc19bfd..ad7d67d44461 100644
> --- a/io_uring/msg_ring.c
> +++ b/io_uring/msg_ring.c
> @@ -13,7 +13,6 @@
>   #include "filetable.h"
>   #include "msg_ring.h"
>   
> -
>   /* All valid masks for MSG_RING */
>   #define IORING_MSG_RING_MASK		(IORING_MSG_RING_CQE_SKIP | \
>   					IORING_MSG_RING_FLAGS_PASS)
> @@ -71,54 +70,43 @@ static inline bool io_msg_need_remote(struct io_ring_ctx *target_ctx)
>   	return target_ctx->task_complete;
>   }
>   
> -static int io_msg_exec_remote(struct io_kiocb *req, task_work_func_t func)
> +static void io_msg_tw_complete(struct io_kiocb *req, struct io_tw_state *ts)
>   {
> -	struct io_ring_ctx *ctx = req->file->private_data;
> -	struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
> -	struct task_struct *task = READ_ONCE(ctx->submitter_task);

Not about this series particularly, but sounds like msg requests
should be REQ_F_INFLIGHT, but there is a chance lazy file assignment
is enough.

> -
> -	if (unlikely(!task))
> -		return -EOWNERDEAD;
> +	struct io_ring_ctx *ctx = req->ctx;
>   
> -	init_task_work(&msg->tw, func);
> -	if (task_work_add(task, &msg->tw, TWA_SIGNAL))
> -		return -EOWNERDEAD;
> +	io_add_aux_cqe(ctx, req->cqe.user_data, req->cqe.res, req->cqe.flags);
> +	kmem_cache_free(req_cachep, req);
> +	percpu_ref_put(&ctx->refs);
> +}
>   
> -	return IOU_ISSUE_SKIP_COMPLETE;
> +static void io_msg_remote_post(struct io_ring_ctx *ctx, struct io_kiocb *req,
> +			       int res, u32 cflags, u64 user_data)
> +{
> +	req->cqe.user_data = user_data;
> +	io_req_set_res(req, res, cflags);
> +	percpu_ref_get(&ctx->refs);
> +	req->ctx = ctx;
> +	req->task = READ_ONCE(ctx->submitter_task);

Missing a null check, apart from that the patchset looks right

> +	req->io_task_work.func = io_msg_tw_complete;
> +	io_req_task_work_add_remote(req, ctx, IOU_F_TWQ_LAZY_WAKE);
>   }

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-07-01 13:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-18 18:48 [PATCHSET v4 0/9] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
2024-06-18 18:48 ` [PATCH 1/5] io_uring/msg_ring: tighten requirement for remote posting Jens Axboe
2024-06-18 18:48 ` [PATCH 2/5] io_uring: add remote task_work execution helper Jens Axboe
2024-06-18 18:48 ` [PATCH 3/5] io_uring: add io_add_aux_cqe() helper Jens Axboe
2024-06-18 18:48 ` [PATCH 4/5] io_uring/msg_ring: improve handling of target CQE posting Jens Axboe
2024-07-01 13:06   ` Pavel Begunkov
2024-06-18 18:48 ` [PATCH 5/5] io_uring/msg_ring: add an alloc cache for io_kiocb entries Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox