[PATCHSET v2 0/7] Improve MSG_RING DEFER

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance
@ 2024-05-30 15:23 Jens Axboe
  2024-05-30 15:23 ` [PATCH 1/7] io_uring/msg_ring: split fd installing into a helper Jens Axboe
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Jens Axboe @ 2024-05-30 15:23 UTC (permalink / raw)
  To: io-uring

Hi,

For v1 and replies to that and tons of perf measurements, go here:

https://lore.kernel.org/io-uring/[email protected]/T/#m12f44c0a9ee40a59b0dcc226e22a0d031903aa73

as I won't duplicate them in here. Performance has been improved since
v1 as well, as the slab accounting is gone and we now rely soly on
the completion_lock on the issuer side.

Changes since v1:
- Change commit messages to reflect it's DEFER_TASKRUN, not SINGLE_ISSUER
- Get rid of the need to double lock on the target uring_lock
- Relax the check for needing remote posting, and then finally kill it
- Unify it across ring types
- Kill (now) unused callback_head in io_msg
- Add overflow caching to avoid __GFP_ACCOUNT overhead
- Rebase on current git master with 6.9 and 6.10 fixes pulled in

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/7] io_uring/msg_ring: split fd installing into a helper
  2024-05-30 15:23 [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
@ 2024-05-30 15:23 ` Jens Axboe
  2024-05-30 15:23 ` [PATCH 2/7] io_uring/msg_ring: tighten requirement for remote posting Jens Axboe
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2024-05-30 15:23 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

No functional changes in this patch, just in preparation for needing to
complete the fd install with the ctx lock already held.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/msg_ring.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index 81c4a9d43729..feff2b0822cf 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -173,25 +173,23 @@ static struct file *io_msg_grab_file(struct io_kiocb *req, unsigned int issue_fl
 	return file;
 }
 
-static int io_msg_install_complete(struct io_kiocb *req, unsigned int issue_flags)
+static int __io_msg_install_complete(struct io_kiocb *req)
 {
 	struct io_ring_ctx *target_ctx = req->file->private_data;
 	struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
 	struct file *src_file = msg->src_file;
 	int ret;
 
-	if (unlikely(io_double_lock_ctx(target_ctx, issue_flags)))
-		return -EAGAIN;
-
 	ret = __io_fixed_fd_install(target_ctx, src_file, msg->dst_fd);
 	if (ret < 0)
-		goto out_unlock;
+		return ret;
 
 	msg->src_file = NULL;
 	req->flags &= ~REQ_F_NEED_CLEANUP;
 
 	if (msg->flags & IORING_MSG_RING_CQE_SKIP)
-		goto out_unlock;
+		return ret;
+
 	/*
 	 * If this fails, the target still received the file descriptor but
 	 * wasn't notified of the fact. This means that if this request
@@ -199,8 +197,20 @@ static int io_msg_install_complete(struct io_kiocb *req, unsigned int issue_flag
 	 * later IORING_OP_MSG_RING delivers the message.
 	 */
 	if (!io_post_aux_cqe(target_ctx, msg->user_data, ret, 0))
-		ret = -EOVERFLOW;
-out_unlock:
+		return -EOVERFLOW;
+
+	return ret;
+}
+
+static int io_msg_install_complete(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_ring_ctx *target_ctx = req->file->private_data;
+	int ret;
+
+	if (unlikely(io_double_lock_ctx(target_ctx, issue_flags)))
+		return -EAGAIN;
+
+	ret = __io_msg_install_complete(req);
 	io_double_unlock_ctx(target_ctx);
 	return ret;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/7] io_uring/msg_ring: tighten requirement for remote posting
  2024-05-30 15:23 [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
  2024-05-30 15:23 ` [PATCH 1/7] io_uring/msg_ring: split fd installing into a helper Jens Axboe
@ 2024-05-30 15:23 ` Jens Axboe
  2024-05-30 15:23 ` [PATCH 3/7] io_uring/msg_ring: avoid double indirection task_work for data messages Jens Axboe
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2024-05-30 15:23 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Currently this is gated on whether or not the target ring needs a local
completion - and if so, whether or not we're running on the right task.
The use case for same thread cross posting is probably a lot less
relevant than remote posting. And since we're going to improve this
situation anyway, just gate it on local posting and ignore what task
we're currently running on.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/msg_ring.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index feff2b0822cf..15e7bda77d0d 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -68,9 +68,7 @@ void io_msg_ring_cleanup(struct io_kiocb *req)
 
 static inline bool io_msg_need_remote(struct io_ring_ctx *target_ctx)
 {
-	if (!target_ctx->task_complete)
-		return false;
-	return current != target_ctx->submitter_task;
+	return target_ctx->task_complete;
 }
 
 static int io_msg_exec_remote(struct io_kiocb *req, task_work_func_t func)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/7] io_uring/msg_ring: avoid double indirection task_work for data messages
  2024-05-30 15:23 [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
  2024-05-30 15:23 ` [PATCH 1/7] io_uring/msg_ring: split fd installing into a helper Jens Axboe
  2024-05-30 15:23 ` [PATCH 2/7] io_uring/msg_ring: tighten requirement for remote posting Jens Axboe
@ 2024-05-30 15:23 ` Jens Axboe
  2024-05-30 15:23 ` [PATCH 4/7] io_uring/msg_ring: avoid double indirection task_work for fd passing Jens Axboe
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2024-05-30 15:23 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

If IORING_SETUP_DEFER_TASKRUN is set, then we can't post CQEs remotely
to the target ring. Instead, task_work is queued for the target ring,
which is used to post the CQE. To make matters worse, once the target
CQE has been posted, task_work is then queued with the originator to
fill the completion.

This obviously adds a bunch of overhead and latency. Instead of relying
on generic kernel task_work for this, fill an overflow entry on the
target ring and flag it as such that the target ring will flush it. This
avoids both the task_work for posting the CQE, and it means that the
originator CQE can be filled inline as well.

In local testing, this reduces the latency on the sender side by 5-6x.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/msg_ring.c | 87 ++++++++++++++++++++++++++++-----------------
 1 file changed, 55 insertions(+), 32 deletions(-)

diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index 15e7bda77d0d..bdb935ef7aa2 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -87,38 +87,61 @@ static int io_msg_exec_remote(struct io_kiocb *req, task_work_func_t func)
 	return IOU_ISSUE_SKIP_COMPLETE;
 }
 
-static void io_msg_tw_complete(struct callback_head *head)
+static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
 {
-	struct io_msg *msg = container_of(head, struct io_msg, tw);
-	struct io_kiocb *req = cmd_to_io_kiocb(msg);
-	struct io_ring_ctx *target_ctx = req->file->private_data;
-	int ret = 0;
-
-	if (current->flags & PF_EXITING) {
-		ret = -EOWNERDEAD;
-	} else {
-		u32 flags = 0;
-
-		if (msg->flags & IORING_MSG_RING_FLAGS_PASS)
-			flags = msg->cqe_flags;
-
-		/*
-		 * If the target ring is using IOPOLL mode, then we need to be
-		 * holding the uring_lock for posting completions. Other ring
-		 * types rely on the regular completion locking, which is
-		 * handled while posting.
-		 */
-		if (target_ctx->flags & IORING_SETUP_IOPOLL)
-			mutex_lock(&target_ctx->uring_lock);
-		if (!io_post_aux_cqe(target_ctx, msg->user_data, msg->len, flags))
-			ret = -EOVERFLOW;
-		if (target_ctx->flags & IORING_SETUP_IOPOLL)
-			mutex_unlock(&target_ctx->uring_lock);
+	bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
+	size_t cqe_size = sizeof(struct io_overflow_cqe);
+	struct io_overflow_cqe *ocqe;
+
+	if (is_cqe32)
+		cqe_size += sizeof(struct io_uring_cqe);
+
+	ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT);
+	if (!ocqe)
+		return NULL;
+
+	if (is_cqe32)
+		ocqe->cqe.big_cqe[0] = ocqe->cqe.big_cqe[1] = 0;
+
+	return ocqe;
+}
+
+/*
+ * Entered with the target uring_lock held, and will drop it before
+ * returning. Adds a previously allocated ocqe to the overflow list on
+ * the target, and marks it appropriately for flushing.
+ */
+static void io_msg_add_overflow(struct io_msg *msg,
+				struct io_ring_ctx *target_ctx,
+				struct io_overflow_cqe *ocqe, int ret,
+				u32 flags)
+	__releases(&target_ctx->completion_lock)
+{
+	if (list_empty(&target_ctx->cq_overflow_list)) {
+		set_bit(IO_CHECK_CQ_OVERFLOW_BIT, &target_ctx->check_cq);
+		atomic_or(IORING_SQ_TASKRUN, &target_ctx->rings->sq_flags);
 	}
 
-	if (ret < 0)
-		req_set_fail(req);
-	io_req_queue_tw_complete(req, ret);
+	ocqe->cqe.user_data = msg->user_data;
+	ocqe->cqe.res = ret;
+	ocqe->cqe.flags = flags;
+	list_add_tail(&ocqe->list, &target_ctx->cq_overflow_list);
+	spin_unlock(&target_ctx->completion_lock);
+	wake_up_state(target_ctx->submitter_task, TASK_INTERRUPTIBLE);
+}
+
+static int io_msg_fill_remote(struct io_msg *msg, unsigned int issue_flags,
+			      struct io_ring_ctx *target_ctx, u32 flags)
+{
+	struct io_overflow_cqe *ocqe;
+
+	ocqe = io_alloc_overflow(target_ctx);
+	if (!ocqe)
+		return -ENOMEM;
+
+	spin_lock(&target_ctx->completion_lock);
+	io_msg_add_overflow(msg, target_ctx, ocqe, msg->len, flags);
+	return 0;
 }
 
 static int io_msg_ring_data(struct io_kiocb *req, unsigned int issue_flags)
@@ -135,12 +158,12 @@ static int io_msg_ring_data(struct io_kiocb *req, unsigned int issue_flags)
 	if (target_ctx->flags & IORING_SETUP_R_DISABLED)
 		return -EBADFD;
 
-	if (io_msg_need_remote(target_ctx))
-		return io_msg_exec_remote(req, io_msg_tw_complete);
-
 	if (msg->flags & IORING_MSG_RING_FLAGS_PASS)
 		flags = msg->cqe_flags;
 
+	if (io_msg_need_remote(target_ctx))
+		return io_msg_fill_remote(msg, issue_flags, target_ctx, flags);
+
 	ret = -EOVERFLOW;
 	if (target_ctx->flags & IORING_SETUP_IOPOLL) {
 		if (unlikely(io_double_lock_ctx(target_ctx, issue_flags)))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/7] io_uring/msg_ring: avoid double indirection task_work for fd passing
  2024-05-30 15:23 [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
                   ` (2 preceding siblings ...)
  2024-05-30 15:23 ` [PATCH 3/7] io_uring/msg_ring: avoid double indirection task_work for data messages Jens Axboe
@ 2024-05-30 15:23 ` Jens Axboe
  2024-05-30 15:23 ` [PATCH 5/7] io_uring/msg_ring: add an alloc cache for CQE entries Jens Axboe
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2024-05-30 15:23 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Like what was done for MSG_RING data passing avoiding a double task_work
roundtrip for IORING_SETUP_DEFER_TASKRUN, implement the same model for
fd passing. File descriptor passing is separately locked anyway, so the
only remaining issue is CQE posting, just like it was for data passing.
And for that, we can use the same approach.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/msg_ring.c | 58 +++++++++++++++++++++++++--------------------
 1 file changed, 32 insertions(+), 26 deletions(-)

diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index bdb935ef7aa2..74590e66d7f7 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -71,22 +71,6 @@ static inline bool io_msg_need_remote(struct io_ring_ctx *target_ctx)
 	return target_ctx->task_complete;
 }
 
-static int io_msg_exec_remote(struct io_kiocb *req, task_work_func_t func)
-{
-	struct io_ring_ctx *ctx = req->file->private_data;
-	struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
-	struct task_struct *task = READ_ONCE(ctx->submitter_task);
-
-	if (unlikely(!task))
-		return -EOWNERDEAD;
-
-	init_task_work(&msg->tw, func);
-	if (task_work_add(task, &msg->tw, TWA_SIGNAL))
-		return -EOWNERDEAD;
-
-	return IOU_ISSUE_SKIP_COMPLETE;
-}
-
 static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
 {
 	bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
@@ -236,17 +220,39 @@ static int io_msg_install_complete(struct io_kiocb *req, unsigned int issue_flag
 	return ret;
 }
 
-static void io_msg_tw_fd_complete(struct callback_head *head)
+static int io_msg_install_remote(struct io_kiocb *req, unsigned int issue_flags,
+				 struct io_ring_ctx *target_ctx)
 {
-	struct io_msg *msg = container_of(head, struct io_msg, tw);
-	struct io_kiocb *req = cmd_to_io_kiocb(msg);
-	int ret = -EOWNERDEAD;
+	struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
+	bool skip_cqe = msg->flags & IORING_MSG_RING_CQE_SKIP;
+	struct io_overflow_cqe *ocqe = NULL;
+	int ret;
 
-	if (!(current->flags & PF_EXITING))
-		ret = io_msg_install_complete(req, IO_URING_F_UNLOCKED);
-	if (ret < 0)
-		req_set_fail(req);
-	io_req_queue_tw_complete(req, ret);
+	if (!skip_cqe) {
+		ocqe = io_alloc_overflow(target_ctx);
+		if (!ocqe)
+			return -ENOMEM;
+	}
+
+	if (unlikely(io_double_lock_ctx(target_ctx, issue_flags))) {
+		kfree(ocqe);
+		return -EAGAIN;
+	}
+
+	ret = __io_fixed_fd_install(target_ctx, msg->src_file, msg->dst_fd);
+	mutex_unlock(&target_ctx->uring_lock);
+
+	if (ret >= 0) {
+		msg->src_file = NULL;
+		req->flags &= ~REQ_F_NEED_CLEANUP;
+		if (!skip_cqe) {
+			spin_lock(&target_ctx->completion_lock);
+			io_msg_add_overflow(msg, target_ctx, ocqe, ret, 0);
+			return 0;
+		}
+	}
+	kfree(ocqe);
+	return ret;
 }
 
 static int io_msg_send_fd(struct io_kiocb *req, unsigned int issue_flags)
@@ -271,7 +277,7 @@ static int io_msg_send_fd(struct io_kiocb *req, unsigned int issue_flags)
 	}
 
 	if (io_msg_need_remote(target_ctx))
-		return io_msg_exec_remote(req, io_msg_tw_fd_complete);
+		return io_msg_install_remote(req, issue_flags, target_ctx);
 	return io_msg_install_complete(req, issue_flags);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/7] io_uring/msg_ring: add an alloc cache for CQE entries
  2024-05-30 15:23 [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
                   ` (3 preceding siblings ...)
  2024-05-30 15:23 ` [PATCH 4/7] io_uring/msg_ring: avoid double indirection task_work for fd passing Jens Axboe
@ 2024-05-30 15:23 ` Jens Axboe
  2024-05-30 15:23 ` [PATCH 6/7] io_uring/msg_ring: remove callback_head from struct io_msg Jens Axboe
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2024-05-30 15:23 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

io_uring accounts the memory allocated, which is quite expensive. Wrap
the allocation and frees in the provided alloc cache framework. The
target ctx needs to be locked anyway for posting the overflow entry,
so just move the overflow alloc inside that section. Flushing the
entries has it locked as well, so io_cache_alloc_free() can be used.

In a simple test, most of the overhead of DEFER_TASKRUN message passing
ends up being accounting for allocation and free, and with this change
it's completely gone.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/io_uring_types.h |  7 ++++
 io_uring/io_uring.c            |  7 +++-
 io_uring/msg_ring.c            | 67 +++++++++++++++++++++++-----------
 io_uring/msg_ring.h            |  3 ++
 4 files changed, 62 insertions(+), 22 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 91224bbcfa73..0f8fc6070b12 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -357,6 +357,13 @@ struct io_ring_ctx {
 	struct io_alloc_cache	futex_cache;
 #endif
 
+	/*
+	 * Unlike the other caches, this one is used by the sender of messages
+	 * to this ring, not by the ring itself. As such, protection for this
+	 * cache is under ->completion_lock, not ->uring_lock.
+	 */
+	struct io_alloc_cache	msg_cache;
+
 	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
 	struct io_sq_data	*sq_data;	/* if using sq thread polling */
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 816e93e7f949..bdb2636dc939 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -95,6 +95,7 @@
 #include "futex.h"
 #include "napi.h"
 #include "uring_cmd.h"
+#include "msg_ring.h"
 #include "memmap.h"
 
 #include "timeout.h"
@@ -315,6 +316,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	ret |= io_alloc_cache_init(&ctx->uring_cache, IO_ALLOC_CACHE_MAX,
 			    sizeof(struct uring_cache));
 	ret |= io_futex_cache_init(ctx);
+	ret |= io_msg_cache_init(ctx);
 	if (ret)
 		goto err;
 	init_completion(&ctx->ref_comp);
@@ -351,6 +353,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	io_alloc_cache_free(&ctx->rw_cache, io_rw_cache_free);
 	io_alloc_cache_free(&ctx->uring_cache, kfree);
 	io_futex_cache_free(ctx);
+	io_msg_cache_free(ctx);
 	kfree(ctx->cancel_table.hbs);
 	kfree(ctx->cancel_table_locked.hbs);
 	xa_destroy(&ctx->io_bl_xa);
@@ -695,7 +698,8 @@ static void __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool dying)
 			memcpy(cqe, &ocqe->cqe, cqe_size);
 		}
 		list_del(&ocqe->list);
-		kfree(ocqe);
+		if (!io_alloc_cache_put(&ctx->msg_cache, ocqe))
+			kfree(ocqe);
 	}
 
 	if (list_empty(&ctx->cq_overflow_list)) {
@@ -2649,6 +2653,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	io_alloc_cache_free(&ctx->rw_cache, io_rw_cache_free);
 	io_alloc_cache_free(&ctx->uring_cache, kfree);
 	io_futex_cache_free(ctx);
+	io_msg_cache_free(ctx);
 	io_destroy_buffers(ctx);
 	mutex_unlock(&ctx->uring_lock);
 	if (ctx->sq_creds)
diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index 74590e66d7f7..392763f3f090 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -11,6 +11,7 @@
 #include "io_uring.h"
 #include "rsrc.h"
 #include "filetable.h"
+#include "alloc_cache.h"
 #include "msg_ring.h"
 
 
@@ -73,19 +74,24 @@ static inline bool io_msg_need_remote(struct io_ring_ctx *target_ctx)
 
 static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
 {
-	bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
-	size_t cqe_size = sizeof(struct io_overflow_cqe);
 	struct io_overflow_cqe *ocqe;
 
-	if (is_cqe32)
-		cqe_size += sizeof(struct io_uring_cqe);
+	ocqe = io_alloc_cache_get(&target_ctx->msg_cache);
+	if (!ocqe) {
+		bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
+		size_t cqe_size = sizeof(struct io_overflow_cqe);
 
-	ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT);
-	if (!ocqe)
-		return NULL;
+		if (is_cqe32)
+			cqe_size += sizeof(struct io_uring_cqe);
 
-	if (is_cqe32)
-		ocqe->cqe.big_cqe[0] = ocqe->cqe.big_cqe[1] = 0;
+		ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT);
+		if (!ocqe)
+			return NULL;
+
+		/* just init at alloc time, won't change */
+		if (is_cqe32)
+			ocqe->cqe.big_cqe[0] = ocqe->cqe.big_cqe[1] = 0;
+	}
 
 	return ocqe;
 }
@@ -119,13 +125,16 @@ static int io_msg_fill_remote(struct io_msg *msg, unsigned int issue_flags,
 {
 	struct io_overflow_cqe *ocqe;
 
+	spin_lock(&target_ctx->completion_lock);
+
 	ocqe = io_alloc_overflow(target_ctx);
-	if (!ocqe)
-		return -ENOMEM;
+	if (ocqe) {
+		io_msg_add_overflow(msg, target_ctx, ocqe, msg->len, flags);
+		return 0;
+	}
 
-	spin_lock(&target_ctx->completion_lock);
-	io_msg_add_overflow(msg, target_ctx, ocqe, msg->len, flags);
-	return 0;
+	spin_unlock(&target_ctx->completion_lock);
+	return -ENOMEM;
 }
 
 static int io_msg_ring_data(struct io_kiocb *req, unsigned int issue_flags)
@@ -228,17 +237,16 @@ static int io_msg_install_remote(struct io_kiocb *req, unsigned int issue_flags,
 	struct io_overflow_cqe *ocqe = NULL;
 	int ret;
 
+	if (unlikely(io_double_lock_ctx(target_ctx, issue_flags)))
+		return -EAGAIN;
+
 	if (!skip_cqe) {
+		spin_lock(&target_ctx->completion_lock);
 		ocqe = io_alloc_overflow(target_ctx);
 		if (!ocqe)
 			return -ENOMEM;
 	}
 
-	if (unlikely(io_double_lock_ctx(target_ctx, issue_flags))) {
-		kfree(ocqe);
-		return -EAGAIN;
-	}
-
 	ret = __io_fixed_fd_install(target_ctx, msg->src_file, msg->dst_fd);
 	mutex_unlock(&target_ctx->uring_lock);
 
@@ -246,12 +254,14 @@ static int io_msg_install_remote(struct io_kiocb *req, unsigned int issue_flags,
 		msg->src_file = NULL;
 		req->flags &= ~REQ_F_NEED_CLEANUP;
 		if (!skip_cqe) {
-			spin_lock(&target_ctx->completion_lock);
 			io_msg_add_overflow(msg, target_ctx, ocqe, ret, 0);
 			return 0;
 		}
 	}
-	kfree(ocqe);
+	if (ocqe) {
+		spin_unlock(&target_ctx->completion_lock);
+		kfree(ocqe);
+	}
 	return ret;
 }
 
@@ -331,3 +341,18 @@ int io_msg_ring(struct io_kiocb *req, unsigned int issue_flags)
 	io_req_set_res(req, ret, 0);
 	return IOU_OK;
 }
+
+int io_msg_cache_init(struct io_ring_ctx *ctx)
+{
+	size_t size = sizeof(struct io_overflow_cqe);
+
+	if (ctx->flags & IORING_SETUP_CQE32)
+		size += sizeof(struct io_uring_cqe);
+
+	return io_alloc_cache_init(&ctx->msg_cache, IO_ALLOC_CACHE_MAX, size);
+}
+
+void io_msg_cache_free(struct io_ring_ctx *ctx)
+{
+	io_alloc_cache_free(&ctx->msg_cache, kfree);
+}
diff --git a/io_uring/msg_ring.h b/io_uring/msg_ring.h
index 3987ee6c0e5f..94f5716d522e 100644
--- a/io_uring/msg_ring.h
+++ b/io_uring/msg_ring.h
@@ -3,3 +3,6 @@
 int io_msg_ring_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_msg_ring(struct io_kiocb *req, unsigned int issue_flags);
 void io_msg_ring_cleanup(struct io_kiocb *req);
+
+int io_msg_cache_init(struct io_ring_ctx *ctx);
+void io_msg_cache_free(struct io_ring_ctx *ctx);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/7] io_uring/msg_ring: remove callback_head from struct io_msg
  2024-05-30 15:23 [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
                   ` (4 preceding siblings ...)
  2024-05-30 15:23 ` [PATCH 5/7] io_uring/msg_ring: add an alloc cache for CQE entries Jens Axboe
@ 2024-05-30 15:23 ` Jens Axboe
  2024-05-30 15:23 ` [PATCH 7/7] io_uring/msg_ring: remove non-remote message passing Jens Axboe
  2024-06-03 13:53 ` [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance Pavel Begunkov
  7 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2024-05-30 15:23 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

This is now unused, get rid of it.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/msg_ring.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index 392763f3f090..5264ba346df8 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -22,7 +22,6 @@
 struct io_msg {
 	struct file			*file;
 	struct file			*src_file;
-	struct callback_head		tw;
 	u64 user_data;
 	u32 len;
 	u32 cmd;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 7/7] io_uring/msg_ring: remove non-remote message passing
  2024-05-30 15:23 [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
                   ` (5 preceding siblings ...)
  2024-05-30 15:23 ` [PATCH 6/7] io_uring/msg_ring: remove callback_head from struct io_msg Jens Axboe
@ 2024-05-30 15:23 ` Jens Axboe
  2024-06-03 13:53 ` [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance Pavel Begunkov
  7 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2024-05-30 15:23 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Now that the overflow approach works well, there's no need to retain the
double locking for direct CQ posting on the target ring. Just have any
kind of target ring use the same messaging mechanism.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/msg_ring.c | 78 +++++----------------------------------------
 1 file changed, 8 insertions(+), 70 deletions(-)

diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index 5264ba346df8..e966ec8757cb 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -33,11 +33,6 @@ struct io_msg {
 	u32 flags;
 };
 
-static void io_double_unlock_ctx(struct io_ring_ctx *octx)
-{
-	mutex_unlock(&octx->uring_lock);
-}
-
 static int io_double_lock_ctx(struct io_ring_ctx *octx,
 			      unsigned int issue_flags)
 {
@@ -66,11 +61,6 @@ void io_msg_ring_cleanup(struct io_kiocb *req)
 	msg->src_file = NULL;
 }
 
-static inline bool io_msg_need_remote(struct io_ring_ctx *target_ctx)
-{
-	return target_ctx->task_complete;
-}
-
 static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
 {
 	struct io_overflow_cqe *ocqe;
@@ -106,6 +96,8 @@ static void io_msg_add_overflow(struct io_msg *msg,
 				u32 flags)
 	__releases(&target_ctx->completion_lock)
 {
+	struct task_struct *task = READ_ONCE(target_ctx->submitter_task);
+
 	if (list_empty(&target_ctx->cq_overflow_list)) {
 		set_bit(IO_CHECK_CQ_OVERFLOW_BIT, &target_ctx->check_cq);
 		atomic_or(IORING_SQ_TASKRUN, &target_ctx->rings->sq_flags);
@@ -116,7 +108,10 @@ static void io_msg_add_overflow(struct io_msg *msg,
 	ocqe->cqe.flags = flags;
 	list_add_tail(&ocqe->list, &target_ctx->cq_overflow_list);
 	spin_unlock(&target_ctx->completion_lock);
-	wake_up_state(target_ctx->submitter_task, TASK_INTERRUPTIBLE);
+	if (task)
+		wake_up_state(task, TASK_INTERRUPTIBLE);
+	else if (wq_has_sleeper(&target_ctx->cq_wait))
+		wake_up(&target_ctx->cq_wait);
 }
 
 static int io_msg_fill_remote(struct io_msg *msg, unsigned int issue_flags,
@@ -141,7 +136,6 @@ static int io_msg_ring_data(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_ring_ctx *target_ctx = req->file->private_data;
 	struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
 	u32 flags = 0;
-	int ret;
 
 	if (msg->src_fd || msg->flags & ~IORING_MSG_RING_FLAGS_PASS)
 		return -EINVAL;
@@ -153,19 +147,7 @@ static int io_msg_ring_data(struct io_kiocb *req, unsigned int issue_flags)
 	if (msg->flags & IORING_MSG_RING_FLAGS_PASS)
 		flags = msg->cqe_flags;
 
-	if (io_msg_need_remote(target_ctx))
-		return io_msg_fill_remote(msg, issue_flags, target_ctx, flags);
-
-	ret = -EOVERFLOW;
-	if (target_ctx->flags & IORING_SETUP_IOPOLL) {
-		if (unlikely(io_double_lock_ctx(target_ctx, issue_flags)))
-			return -EAGAIN;
-	}
-	if (io_post_aux_cqe(target_ctx, msg->user_data, msg->len, flags))
-		ret = 0;
-	if (target_ctx->flags & IORING_SETUP_IOPOLL)
-		io_double_unlock_ctx(target_ctx);
-	return ret;
+	return io_msg_fill_remote(msg, issue_flags, target_ctx, flags);
 }
 
 static struct file *io_msg_grab_file(struct io_kiocb *req, unsigned int issue_flags)
@@ -186,48 +168,6 @@ static struct file *io_msg_grab_file(struct io_kiocb *req, unsigned int issue_fl
 	return file;
 }
 
-static int __io_msg_install_complete(struct io_kiocb *req)
-{
-	struct io_ring_ctx *target_ctx = req->file->private_data;
-	struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
-	struct file *src_file = msg->src_file;
-	int ret;
-
-	ret = __io_fixed_fd_install(target_ctx, src_file, msg->dst_fd);
-	if (ret < 0)
-		return ret;
-
-	msg->src_file = NULL;
-	req->flags &= ~REQ_F_NEED_CLEANUP;
-
-	if (msg->flags & IORING_MSG_RING_CQE_SKIP)
-		return ret;
-
-	/*
-	 * If this fails, the target still received the file descriptor but
-	 * wasn't notified of the fact. This means that if this request
-	 * completes with -EOVERFLOW, then the sender must ensure that a
-	 * later IORING_OP_MSG_RING delivers the message.
-	 */
-	if (!io_post_aux_cqe(target_ctx, msg->user_data, ret, 0))
-		return -EOVERFLOW;
-
-	return ret;
-}
-
-static int io_msg_install_complete(struct io_kiocb *req, unsigned int issue_flags)
-{
-	struct io_ring_ctx *target_ctx = req->file->private_data;
-	int ret;
-
-	if (unlikely(io_double_lock_ctx(target_ctx, issue_flags)))
-		return -EAGAIN;
-
-	ret = __io_msg_install_complete(req);
-	io_double_unlock_ctx(target_ctx);
-	return ret;
-}
-
 static int io_msg_install_remote(struct io_kiocb *req, unsigned int issue_flags,
 				 struct io_ring_ctx *target_ctx)
 {
@@ -285,9 +225,7 @@ static int io_msg_send_fd(struct io_kiocb *req, unsigned int issue_flags)
 		req->flags |= REQ_F_NEED_CLEANUP;
 	}
 
-	if (io_msg_need_remote(target_ctx))
-		return io_msg_install_remote(req, issue_flags, target_ctx);
-	return io_msg_install_complete(req, issue_flags);
+	return io_msg_install_remote(req, issue_flags, target_ctx);
 }
 
 int io_msg_ring_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance
  2024-05-30 15:23 [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
                   ` (6 preceding siblings ...)
  2024-05-30 15:23 ` [PATCH 7/7] io_uring/msg_ring: remove non-remote message passing Jens Axboe
@ 2024-06-03 13:53 ` Pavel Begunkov
  2024-06-04 18:57   ` Jens Axboe
  7 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2024-06-03 13:53 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 5/30/24 16:23, Jens Axboe wrote:
> Hi,
> 
> For v1 and replies to that and tons of perf measurements, go here:

I'd really prefer the task_work version rather than carving
yet another path specific to msg_ring. Perf might sounds better,
but it's duplicating wake up paths, not integrated with batch
waiting, not clear how affects different workloads with target
locking and would work weird in terms of ordering.

If the swing back is that expensive, another option is to
allocate a new request and let the target ring to deallocate
it once the message is delivered (similar to that overflow
entry).


> https://lore.kernel.org/io-uring/[email protected]/T/#m12f44c0a9ee40a59b0dcc226e22a0d031903aa73
> 
> as I won't duplicate them in here. Performance has been improved since
> v1 as well, as the slab accounting is gone and we now rely soly on
> the completion_lock on the issuer side.
> 
> Changes since v1:
> - Change commit messages to reflect it's DEFER_TASKRUN, not SINGLE_ISSUER
> - Get rid of the need to double lock on the target uring_lock
> - Relax the check for needing remote posting, and then finally kill it
> - Unify it across ring types
> - Kill (now) unused callback_head in io_msg
> - Add overflow caching to avoid __GFP_ACCOUNT overhead
> - Rebase on current git master with 6.9 and 6.10 fixes pulled in
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance
  2024-06-03 13:53 ` [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance Pavel Begunkov
@ 2024-06-04 18:57   ` Jens Axboe
  2024-06-04 19:55     ` Jens Axboe
  2024-06-05 15:50     ` Pavel Begunkov
  0 siblings, 2 replies; 15+ messages in thread
From: Jens Axboe @ 2024-06-04 18:57 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 6/3/24 7:53 AM, Pavel Begunkov wrote:
> On 5/30/24 16:23, Jens Axboe wrote:
>> Hi,
>>
>> For v1 and replies to that and tons of perf measurements, go here:
> 
> I'd really prefer the task_work version rather than carving
> yet another path specific to msg_ring. Perf might sounds better,
> but it's duplicating wake up paths, not integrated with batch
> waiting, not clear how affects different workloads with target
> locking and would work weird in terms of ordering.

The duplication is really minor, basically non-existent imho. It's a
wakeup call, it's literally 2 lines of code. I do agree on the batching,
though I don't think that's really a big concern as most usage I'd
expect from this would be sending single messages. You're not batch
waiting on those. But there could obviously be cases where you have a
lot of mixed traffic, and for those it would make sense to have the
batch wakeups.

What I do like with this version is that we end up with just one method
for delivering the CQE, rather than needing to split it into two. And it
gets rid of the uring_lock double locking for non-SINGLE_ISSUER. I know
we always try and push people towards DEFER_TASKRUN|SINGLE_ISSUER, but
that doesn't mean we should just ignore the cases where that isn't true.
Unifying that code and making it faster all around is a worthy goal in
and of itself. The code is CERTAINLY a lot cleaner after the change than
all the IOPOLL etc.

> If the swing back is that expensive, another option is to
> allocate a new request and let the target ring to deallocate
> it once the message is delivered (similar to that overflow
> entry).

I can give it a shot, and then run some testing. If we get close enough
with the latencies and performance, then I'd certainly be more amenable
to going either route.

We'd definitely need to pass in the required memory and avoid the return
round trip, as that basically doubles the cost (and latency) of sending
a message. The downside of what you suggest here is that while that
should integrate nicely with existing local task_work, it'll also mean
that we'll need hot path checks for treating that request type as a
special thing. Things like req->ctx being not local, freeing the request
rather than recycling, etc. And that'll need to happen in multiple
spots.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance
  2024-06-04 18:57   ` Jens Axboe
@ 2024-06-04 19:55     ` Jens Axboe
  2024-06-05 15:50     ` Pavel Begunkov
  1 sibling, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2024-06-04 19:55 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 6/4/24 12:57 PM, Jens Axboe wrote:
>> If the swing back is that expensive, another option is to
>> allocate a new request and let the target ring to deallocate
>> it once the message is delivered (similar to that overflow
>> entry).
> 
> I can give it a shot, and then run some testing. If we get close enough
> with the latencies and performance, then I'd certainly be more amenable
> to going either route.
> 
> We'd definitely need to pass in the required memory and avoid the return
> round trip, as that basically doubles the cost (and latency) of sending
> a message. The downside of what you suggest here is that while that
> should integrate nicely with existing local task_work, it'll also mean
> that we'll need hot path checks for treating that request type as a
> special thing. Things like req->ctx being not local, freeing the request
> rather than recycling, etc. And that'll need to happen in multiple
> spots.

On top of that, you also need CQE memory for the other side, it's not
just the req itself. Otherwise you don't know if it'll post or not, in
case of low memory situations.

I dunno, I feel like this solution would get a lot more complicated than
it is now, rather than make it simpler.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance
  2024-06-04 18:57   ` Jens Axboe
  2024-06-04 19:55     ` Jens Axboe
@ 2024-06-05 15:50     ` Pavel Begunkov
  2024-06-05 16:41       ` Jens Axboe
  1 sibling, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2024-06-05 15:50 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 6/4/24 19:57, Jens Axboe wrote:
> On 6/3/24 7:53 AM, Pavel Begunkov wrote:
>> On 5/30/24 16:23, Jens Axboe wrote:
>>> Hi,
>>>
>>> For v1 and replies to that and tons of perf measurements, go here:
>>
>> I'd really prefer the task_work version rather than carving
>> yet another path specific to msg_ring. Perf might sounds better,
>> but it's duplicating wake up paths, not integrated with batch
>> waiting, not clear how affects different workloads with target
>> locking and would work weird in terms of ordering.
> 
> The duplication is really minor, basically non-existent imho. It's a
> wakeup call, it's literally 2 lines of code. I do agree on the batching,

Well, v3 tries to add msg_ring/nr_overflow handling to local
task work, that what I mean by duplicating paths, and we'll
continue gutting the hot path for supporting msg_ring in
this way.

Does it work with eventfd? I can't find any handling, so next
you'd be adding:

io_commit_cqring_flush(ctx);

Likely draining around cq_extra should also be patched.
Yes, fixable, but it'll be a pile of fun, and without many
users, it'll take time to discover it all.

> though I don't think that's really a big concern as most usage I'd
> expect from this would be sending single messages. You're not batch
> waiting on those. But there could obviously be cases where you have a
> lot of mixed traffic, and for those it would make sense to have the
> batch wakeups.
> 
> What I do like with this version is that we end up with just one method
> for delivering the CQE, rather than needing to split it into two. And it
> gets rid of the uring_lock double locking for non-SINGLE_ISSUER. I know

You can't get rid of target locking for fd passing, the file tables
are sync'ed by the lock. Otherwise it's only IOPOLL, because with
normal rings it can and IIRC does take the completion_lock for CQE
posting. I don't see a problem here, unless you care that much about
IOPOLL?

> we always try and push people towards DEFER_TASKRUN|SINGLE_ISSUER, but
> that doesn't mean we should just ignore the cases where that isn't true.
> Unifying that code and making it faster all around is a worthy goal in
> and of itself. The code is CERTAINLY a lot cleaner after the change than
> all the IOPOLL etc.
> 
>> If the swing back is that expensive, another option is to
>> allocate a new request and let the target ring to deallocate
>> it once the message is delivered (similar to that overflow
>> entry).
> 
> I can give it a shot, and then run some testing. If we get close enough
> with the latencies and performance, then I'd certainly be more amenable
> to going either route.
> 
> We'd definitely need to pass in the required memory and avoid the return

Right, same as with CQEs

> round trip, as that basically doubles the cost (and latency) of sending

Sender's latency, which is IMHO not important at all

> a message. The downside of what you suggest here is that while that
> should integrate nicely with existing local task_work, it'll also mean
> that we'll need hot path checks for treating that request type as a
> special thing. Things like req->ctx being not local, freeing the request
> rather than recycling, etc. And that'll need to happen in multiple
> spots.

I'm not suggesting feeding that request into flush_completions()
and common completion infra, can be killed right in the tw callback.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance
  2024-06-05 15:50     ` Pavel Begunkov
@ 2024-06-05 16:41       ` Jens Axboe
  2024-06-05 19:20         ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2024-06-05 16:41 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 6/5/24 9:50 AM, Pavel Begunkov wrote:
> On 6/4/24 19:57, Jens Axboe wrote:
>> On 6/3/24 7:53 AM, Pavel Begunkov wrote:
>>> On 5/30/24 16:23, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> For v1 and replies to that and tons of perf measurements, go here:
>>>
>>> I'd really prefer the task_work version rather than carving
>>> yet another path specific to msg_ring. Perf might sounds better,
>>> but it's duplicating wake up paths, not integrated with batch
>>> waiting, not clear how affects different workloads with target
>>> locking and would work weird in terms of ordering.
>>
>> The duplication is really minor, basically non-existent imho. It's a
>> wakeup call, it's literally 2 lines of code. I do agree on the batching,
> 
> Well, v3 tries to add msg_ring/nr_overflow handling to local
> task work, that what I mean by duplicating paths, and we'll
> continue gutting the hot path for supporting msg_ring in
> this way.

No matter how you look at it, there will be changes to the hot path
regardless of whether we use local task_work like in the original, or do
the current approach.

> Does it work with eventfd? I can't find any handling, so next
> you'd be adding:
> 
> io_commit_cqring_flush(ctx);

That's merely because the flagging should be done in io_defer_wake(),
moving that code to the common helper as well.

> Likely draining around cq_extra should also be patched.
> Yes, fixable, but it'll be a pile of fun, and without many
> users, it'll take time to discover it all.

Yes that may need tweaking indeed. But this is a bit of a chicken and
egg problem - there are not many users of it, because it currently
sucks. We have to make it better, and there's already one user lined up
because of these changes.

We can't just let MSG_RING linger. It's an appealing interface for
message passing where you are using rings on both sides, but it's
currently pretty much useless exactly for the case that we care about
the most - DEFER_TASKRUN. So right now you are caught between a rock and
a hard place, where you want to use DEFER_TASKRUN because it's a lot
better for the things that people care about, but if you need message
passing, then it doesn't work very well.

>> though I don't think that's really a big concern as most usage I'd
>> expect from this would be sending single messages. You're not batch
>> waiting on those. But there could obviously be cases where you have a
>> lot of mixed traffic, and for those it would make sense to have the
>> batch wakeups.
>>
>> What I do like with this version is that we end up with just one method
>> for delivering the CQE, rather than needing to split it into two. And it
>> gets rid of the uring_lock double locking for non-SINGLE_ISSUER. I know
> 
> You can't get rid of target locking for fd passing, the file tables
> are sync'ed by the lock. Otherwise it's only IOPOLL, because with
> normal rings it can and IIRC does take the completion_lock for CQE
> posting. I don't see a problem here, unless you care that much about
> IOPOLL?

Right, fd passing still needs to grab the lock, and it still does with
the patchset. We can't really get around it for fd passing, at least not
without further work (of which I have no current plans to do). I don't
care about IOPOLL in particular for message passing, I don't think there
are any good use cases there. It's more of a code hygiene thing, the
branches are still there and do exist.

>> we always try and push people towards DEFER_TASKRUN|SINGLE_ISSUER, but
>> that doesn't mean we should just ignore the cases where that isn't true.
>> Unifying that code and making it faster all around is a worthy goal in
>> and of itself. The code is CERTAINLY a lot cleaner after the change than
>> all the IOPOLL etc.
>>
>>> If the swing back is that expensive, another option is to
>>> allocate a new request and let the target ring to deallocate
>>> it once the message is delivered (similar to that overflow
>>> entry).
>>
>> I can give it a shot, and then run some testing. If we get close enough
>> with the latencies and performance, then I'd certainly be more amenable
>> to going either route.
>>
>> We'd definitely need to pass in the required memory and avoid the return
> 
> Right, same as with CQEs
> 
>> round trip, as that basically doubles the cost (and latency) of sending
> 
> Sender's latency, which is IMHO not important at all

But it IS important. Not because of the latency itself, that part is
less important, but because of the added overhead of bouncing from ring1
to ring2, and then back from ring2 to ring1. The reduction in latency is
a direct reflecting of the reduction of overhead.

>> a message. The downside of what you suggest here is that while that
>> should integrate nicely with existing local task_work, it'll also mean
>> that we'll need hot path checks for treating that request type as a
>> special thing. Things like req->ctx being not local, freeing the request
>> rather than recycling, etc. And that'll need to happen in multiple
>> spots.
> 
> I'm not suggesting feeding that request into flush_completions()
> and common completion infra, can be killed right in the tw callback.

Right, so you need to special case these requests when you run the local
task_work. Which was my point above, you're going to need to accept hot
path additions regardless of the approach.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance
  2024-06-05 16:41       ` Jens Axboe
@ 2024-06-05 19:20         ` Pavel Begunkov
  2024-06-05 19:36           ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2024-06-05 19:20 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 6/5/24 17:41, Jens Axboe wrote:
> On 6/5/24 9:50 AM, Pavel Begunkov wrote:
>> On 6/4/24 19:57, Jens Axboe wrote:
>>> On 6/3/24 7:53 AM, Pavel Begunkov wrote:
>>>> On 5/30/24 16:23, Jens Axboe wrote:
>>>>> Hi,
>>>>>
>>>>> For v1 and replies to that and tons of perf measurements, go here:
>>>>
>>>> I'd really prefer the task_work version rather than carving
>>>> yet another path specific to msg_ring. Perf might sounds better,
>>>> but it's duplicating wake up paths, not integrated with batch
>>>> waiting, not clear how affects different workloads with target
>>>> locking and would work weird in terms of ordering.
>>>
>>> The duplication is really minor, basically non-existent imho. It's a
>>> wakeup call, it's literally 2 lines of code. I do agree on the batching,
>>
>> Well, v3 tries to add msg_ring/nr_overflow handling to local
>> task work, that what I mean by duplicating paths, and we'll
>> continue gutting the hot path for supporting msg_ring in
>> this way.
> 
> No matter how you look at it, there will be changes to the hot path
> regardless of whether we use local task_work like in the original, or do
> the current approach.

The only downside for !msg_ring paths in the original was
un-inlining of local tw_add().

>> Does it work with eventfd? I can't find any handling, so next
>> you'd be adding:
>>
>> io_commit_cqring_flush(ctx);
> 
> That's merely because the flagging should be done in io_defer_wake(),
> moving that code to the common helper as well.

Flagging? If you mean io_commit_cqring_flush() then no,
it shouldn't and cannot be there. It's called strictly after
posting a CQE (or queuing an overflow), which is after tw
callback execution.

>> Likely draining around cq_extra should also be patched.
>> Yes, fixable, but it'll be a pile of fun, and without many
>> users, it'll take time to discover it all.
> 
> Yes that may need tweaking indeed. But this is a bit of a chicken and
> egg problem - there are not many users of it, because it currently
> sucks. We have to make it better, and there's already one user lined up
> because of these changes.
> 
> We can't just let MSG_RING linger. It's an appealing interface for
> message passing where you are using rings on both sides, but it's
> currently pretty much useless exactly for the case that we care about
> the most - DEFER_TASKRUN. So right now you are caught between a rock and
> a hard place, where you want to use DEFER_TASKRUN because it's a lot
> better for the things that people care about, but if you need message
> passing, then it doesn't work very well.
> 
>>> though I don't think that's really a big concern as most usage I'd
>>> expect from this would be sending single messages. You're not batch
>>> waiting on those. But there could obviously be cases where you have a
>>> lot of mixed traffic, and for those it would make sense to have the
>>> batch wakeups.
>>>
>>> What I do like with this version is that we end up with just one method
>>> for delivering the CQE, rather than needing to split it into two. And it
>>> gets rid of the uring_lock double locking for non-SINGLE_ISSUER. I know
>>
>> You can't get rid of target locking for fd passing, the file tables
>> are sync'ed by the lock. Otherwise it's only IOPOLL, because with
>> normal rings it can and IIRC does take the completion_lock for CQE
>> posting. I don't see a problem here, unless you care that much about
>> IOPOLL?
> 
> Right, fd passing still needs to grab the lock, and it still does with
> the patchset. We can't really get around it for fd passing, at least not
> without further work (of which I have no current plans to do). I don't
> care about IOPOLL in particular for message passing, I don't think there
> are any good use cases there. It's more of a code hygiene thing, the
> branches are still there and do exist.
> 
>>> we always try and push people towards DEFER_TASKRUN|SINGLE_ISSUER, but
>>> that doesn't mean we should just ignore the cases where that isn't true.
>>> Unifying that code and making it faster all around is a worthy goal in
>>> and of itself. The code is CERTAINLY a lot cleaner after the change than
>>> all the IOPOLL etc.
>>>
>>>> If the swing back is that expensive, another option is to
>>>> allocate a new request and let the target ring to deallocate
>>>> it once the message is delivered (similar to that overflow
>>>> entry).
>>>
>>> I can give it a shot, and then run some testing. If we get close enough
>>> with the latencies and performance, then I'd certainly be more amenable
>>> to going either route.
>>>
>>> We'd definitely need to pass in the required memory and avoid the return
>>
>> Right, same as with CQEs
>>
>>> round trip, as that basically doubles the cost (and latency) of sending
>>
>> Sender's latency, which is IMHO not important at all
> 
> But it IS important. Not because of the latency itself, that part is
> less important, but because of the added overhead of bouncing from ring1
> to ring2, and then back from ring2 to ring1. The reduction in latency is
> a direct reflecting of the reduction of overhead.
> 
>>> a message. The downside of what you suggest here is that while that
>>> should integrate nicely with existing local task_work, it'll also mean
>>> that we'll need hot path checks for treating that request type as a
>>> special thing. Things like req->ctx being not local, freeing the request
>>> rather than recycling, etc. And that'll need to happen in multiple
>>> spots.
>>
>> I'm not suggesting feeding that request into flush_completions()
>> and common completion infra, can be killed right in the tw callback.
> 
> Right, so you need to special case these requests when you run the local
> task_work. Which was my point above, you're going to need to accept hot
> path additions regardless of the approach.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance
  2024-06-05 19:20         ` Pavel Begunkov
@ 2024-06-05 19:36           ` Jens Axboe
  0 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2024-06-05 19:36 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 6/5/24 1:20 PM, Pavel Begunkov wrote:
> On 6/5/24 17:41, Jens Axboe wrote:
>> On 6/5/24 9:50 AM, Pavel Begunkov wrote:
>>> On 6/4/24 19:57, Jens Axboe wrote:
>>>> On 6/3/24 7:53 AM, Pavel Begunkov wrote:
>>>>> On 5/30/24 16:23, Jens Axboe wrote:
>>>>>> Hi,
>>>>>>
>>>>>> For v1 and replies to that and tons of perf measurements, go here:
>>>>>
>>>>> I'd really prefer the task_work version rather than carving
>>>>> yet another path specific to msg_ring. Perf might sounds better,
>>>>> but it's duplicating wake up paths, not integrated with batch
>>>>> waiting, not clear how affects different workloads with target
>>>>> locking and would work weird in terms of ordering.
>>>>
>>>> The duplication is really minor, basically non-existent imho. It's a
>>>> wakeup call, it's literally 2 lines of code. I do agree on the batching,
>>>
>>> Well, v3 tries to add msg_ring/nr_overflow handling to local
>>> task work, that what I mean by duplicating paths, and we'll
>>> continue gutting the hot path for supporting msg_ring in
>>> this way.
>>
>> No matter how you look at it, there will be changes to the hot path
>> regardless of whether we use local task_work like in the original, or do
>> the current approach.
> 
> The only downside for !msg_ring paths in the original was
> un-inlining of local tw_add().

You're comparing an incomplete RFC to a more complete patchset, that
will not be the only downside once you're done with the local task_work
approach when the roundtrip is avoided. And that is my comparison base,
not some half finished POC that I posted for comments.

>>> Does it work with eventfd? I can't find any handling, so next
>>> you'd be adding:
>>>
>>> io_commit_cqring_flush(ctx);
>>
>> That's merely because the flagging should be done in io_defer_wake(),
>> moving that code to the common helper as well.
> 
> Flagging? If you mean io_commit_cqring_flush() then no,
> it shouldn't and cannot be there. It's called strictly after
> posting a CQE (or queuing an overflow), which is after tw
> callback execution.

I meant the SQ ring flagging and eventfd signaling, which is currently
done in local work adding. That should go in io_defer_wake().

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-06-05 19:36 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-30 15:23 [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance Jens Axboe
2024-05-30 15:23 ` [PATCH 1/7] io_uring/msg_ring: split fd installing into a helper Jens Axboe
2024-05-30 15:23 ` [PATCH 2/7] io_uring/msg_ring: tighten requirement for remote posting Jens Axboe
2024-05-30 15:23 ` [PATCH 3/7] io_uring/msg_ring: avoid double indirection task_work for data messages Jens Axboe
2024-05-30 15:23 ` [PATCH 4/7] io_uring/msg_ring: avoid double indirection task_work for fd passing Jens Axboe
2024-05-30 15:23 ` [PATCH 5/7] io_uring/msg_ring: add an alloc cache for CQE entries Jens Axboe
2024-05-30 15:23 ` [PATCH 6/7] io_uring/msg_ring: remove callback_head from struct io_msg Jens Axboe
2024-05-30 15:23 ` [PATCH 7/7] io_uring/msg_ring: remove non-remote message passing Jens Axboe
2024-06-03 13:53 ` [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance Pavel Begunkov
2024-06-04 18:57   ` Jens Axboe
2024-06-04 19:55     ` Jens Axboe
2024-06-05 15:50     ` Pavel Begunkov
2024-06-05 16:41       ` Jens Axboe
2024-06-05 19:20         ` Pavel Begunkov
2024-06-05 19:36           ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox