public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance
@ 2024-05-24 22:58 Jens Axboe
  2024-05-24 22:58 ` [PATCH 1/3] io_uring/msg_ring: split fd installing into a helper Jens Axboe
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Jens Axboe @ 2024-05-24 22:58 UTC (permalink / raw)
  To: io-uring

Hi,

A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to
use IORING_SETUP_DEFER_TASKRUN, will need two round trips through
generic task_work. This isn't ideal. This patchset attempts to rectify
that, taking a new approach rather than trying to use the io_uring
task_work infrastructure to handle it as in previous postings.

In a sample test app that has one thread send messages to another and
logging both the time between sender sending and receiver receving and
just the time for the sender to post a message and get the CQE back,
I see the following sender latencies with the stock kernel:

Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[ 4384],  5.0000th=[ 4512], 10.0000th=[ 4576],
     | 20.0000th=[ 4768], 30.0000th=[ 4896], 40.0000th=[ 5024],
     | 50.0000th=[ 5088], 60.0000th=[ 5152], 70.0000th=[ 5280],
     | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728],
     | 99.0000th=[ 8032], 99.5000th=[18048], 99.9000th=[21376],
     | 99.9500th=[26496], 99.9900th=[59136]

and with the patches:

Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[  756],  5.0000th=[  820], 10.0000th=[  828],
     | 20.0000th=[  844], 30.0000th=[  852], 40.0000th=[  852],
     | 50.0000th=[  860], 60.0000th=[  860], 70.0000th=[  868],
     | 80.0000th=[  884], 90.0000th=[  964], 95.0000th=[  988],
     | 99.0000th=[ 1128], 99.5000th=[ 1208], 99.9000th=[ 1544],
     | 99.9500th=[ 1944], 99.9900th=[ 2896]

For the receiving side the win is smaller as it only "suffers" from
a single generic task_work, about a 10% win in latencies there.

The idea here is to utilize the CQE overflow infrastructure for this,
as that allows the right task to post the CQE to the ring.

1 is a basic refactoring prep patch, patch 2 adds support for normal
messages, and patch 3 adopts the same approach for fd passing.

 io_uring/msg_ring.c | 151 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 138 insertions(+), 13 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/3] io_uring/msg_ring: split fd installing into a helper
  2024-05-24 22:58 [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Jens Axboe
@ 2024-05-24 22:58 ` Jens Axboe
  2024-05-24 22:58 ` [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages Jens Axboe
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2024-05-24 22:58 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

No functional changes in this patch, just in preparation for needing to
complete the fd install with the ctx lock already held.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/msg_ring.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index 81c4a9d43729..feff2b0822cf 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -173,25 +173,23 @@ static struct file *io_msg_grab_file(struct io_kiocb *req, unsigned int issue_fl
 	return file;
 }
 
-static int io_msg_install_complete(struct io_kiocb *req, unsigned int issue_flags)
+static int __io_msg_install_complete(struct io_kiocb *req)
 {
 	struct io_ring_ctx *target_ctx = req->file->private_data;
 	struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
 	struct file *src_file = msg->src_file;
 	int ret;
 
-	if (unlikely(io_double_lock_ctx(target_ctx, issue_flags)))
-		return -EAGAIN;
-
 	ret = __io_fixed_fd_install(target_ctx, src_file, msg->dst_fd);
 	if (ret < 0)
-		goto out_unlock;
+		return ret;
 
 	msg->src_file = NULL;
 	req->flags &= ~REQ_F_NEED_CLEANUP;
 
 	if (msg->flags & IORING_MSG_RING_CQE_SKIP)
-		goto out_unlock;
+		return ret;
+
 	/*
 	 * If this fails, the target still received the file descriptor but
 	 * wasn't notified of the fact. This means that if this request
@@ -199,8 +197,20 @@ static int io_msg_install_complete(struct io_kiocb *req, unsigned int issue_flag
 	 * later IORING_OP_MSG_RING delivers the message.
 	 */
 	if (!io_post_aux_cqe(target_ctx, msg->user_data, ret, 0))
-		ret = -EOVERFLOW;
-out_unlock:
+		return -EOVERFLOW;
+
+	return ret;
+}
+
+static int io_msg_install_complete(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_ring_ctx *target_ctx = req->file->private_data;
+	int ret;
+
+	if (unlikely(io_double_lock_ctx(target_ctx, issue_flags)))
+		return -EAGAIN;
+
+	ret = __io_msg_install_complete(req);
 	io_double_unlock_ctx(target_ctx);
 	return ret;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages
  2024-05-24 22:58 [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Jens Axboe
  2024-05-24 22:58 ` [PATCH 1/3] io_uring/msg_ring: split fd installing into a helper Jens Axboe
@ 2024-05-24 22:58 ` Jens Axboe
  2024-05-28 13:18   ` Pavel Begunkov
  2024-05-28 13:32   ` Pavel Begunkov
  2024-05-24 22:58 ` [PATCH 3/3] io_uring/msg_ring: avoid double indirection task_work for fd passing Jens Axboe
  2024-05-28 13:31 ` [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Pavel Begunkov
  3 siblings, 2 replies; 23+ messages in thread
From: Jens Axboe @ 2024-05-24 22:58 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely
to the target ring. Instead, task_work is queued for the target ring,
which is used to post the CQE. To make matters worse, once the target
CQE has been posted, task_work is then queued with the originator to
fill the completion.

This obviously adds a bunch of overhead and latency. Instead of relying
on generic kernel task_work for this, fill an overflow entry on the
target ring and flag it as such that the target ring will flush it. This
avoids both the task_work for posting the CQE, and it means that the
originator CQE can be filled inline as well.

In local testing, this reduces the latency on the sender side by 5-6x.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 74 insertions(+), 3 deletions(-)

diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index feff2b0822cf..3f89ff3a40ad 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head)
 	io_req_queue_tw_complete(req, ret);
 }
 
+static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
+{
+	bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
+	size_t cqe_size = sizeof(struct io_overflow_cqe);
+	struct io_overflow_cqe *ocqe;
+
+	if (is_cqe32)
+		cqe_size += sizeof(struct io_uring_cqe);
+
+	ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT);
+	if (!ocqe)
+		return NULL;
+
+	if (is_cqe32)
+		ocqe->cqe.big_cqe[0] = ocqe->cqe.big_cqe[1] = 0;
+
+	return ocqe;
+}
+
+/*
+ * Entered with the target uring_lock held, and will drop it before
+ * returning. Adds a previously allocated ocqe to the overflow list on
+ * the target, and marks it appropriately for flushing.
+ */
+static void io_msg_add_overflow(struct io_msg *msg,
+				struct io_ring_ctx *target_ctx,
+				struct io_overflow_cqe *ocqe, int ret)
+	__releases(target_ctx->uring_lock)
+{
+	spin_lock(&target_ctx->completion_lock);
+
+	if (list_empty(&target_ctx->cq_overflow_list)) {
+		set_bit(IO_CHECK_CQ_OVERFLOW_BIT, &target_ctx->check_cq);
+		atomic_or(IORING_SQ_TASKRUN, &target_ctx->rings->sq_flags);
+	}
+
+	ocqe->cqe.user_data = msg->user_data;
+	ocqe->cqe.res = ret;
+	list_add_tail(&ocqe->list, &target_ctx->cq_overflow_list);
+	spin_unlock(&target_ctx->completion_lock);
+	mutex_unlock(&target_ctx->uring_lock);
+	wake_up_state(target_ctx->submitter_task, TASK_INTERRUPTIBLE);
+}
+
+static bool io_msg_fill_remote(struct io_msg *msg, unsigned int issue_flags,
+			       struct io_ring_ctx *target_ctx, u32 flags)
+{
+	struct io_overflow_cqe *ocqe;
+
+	ocqe = io_alloc_overflow(target_ctx);
+	if (!ocqe)
+		return false;
+
+	if (unlikely(io_double_lock_ctx(target_ctx, issue_flags))) {
+		kfree(ocqe);
+		return false;
+	}
+
+	ocqe->cqe.flags = flags;
+	io_msg_add_overflow(msg, target_ctx, ocqe, msg->len);
+	return true;
+}
+
 static int io_msg_ring_data(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_ring_ctx *target_ctx = req->file->private_data;
@@ -137,12 +200,20 @@ static int io_msg_ring_data(struct io_kiocb *req, unsigned int issue_flags)
 	if (target_ctx->flags & IORING_SETUP_R_DISABLED)
 		return -EBADFD;
 
-	if (io_msg_need_remote(target_ctx))
-		return io_msg_exec_remote(req, io_msg_tw_complete);
-
 	if (msg->flags & IORING_MSG_RING_FLAGS_PASS)
 		flags = msg->cqe_flags;
 
+	if (io_msg_need_remote(target_ctx)) {
+		/*
+		 * Try adding an overflow entry to the target, and only if
+		 * that fails, resort to using more expensive task_work to
+		 * have the target_ctx owner fill the CQE.
+		 */
+		if (!io_msg_fill_remote(msg, issue_flags, target_ctx, flags))
+			return io_msg_exec_remote(req, io_msg_tw_complete);
+		return 0;
+	}
+
 	ret = -EOVERFLOW;
 	if (target_ctx->flags & IORING_SETUP_IOPOLL) {
 		if (unlikely(io_double_lock_ctx(target_ctx, issue_flags)))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] io_uring/msg_ring: avoid double indirection task_work for fd passing
  2024-05-24 22:58 [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Jens Axboe
  2024-05-24 22:58 ` [PATCH 1/3] io_uring/msg_ring: split fd installing into a helper Jens Axboe
  2024-05-24 22:58 ` [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages Jens Axboe
@ 2024-05-24 22:58 ` Jens Axboe
  2024-05-28 13:31 ` [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Pavel Begunkov
  3 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2024-05-24 22:58 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Like what was done for MSG_RING data passing avoiding a double task_work
roundtrip for IORING_SETUP_SINGLE_ISSUER, implement the same model for
fd passing. File descriptor passing is separately locked anyway, so the
only remaining issue is CQE posting, just like it was for data passing.
And for that, we can use the same approach.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/msg_ring.c | 48 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 46 insertions(+), 2 deletions(-)

diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index 3f89ff3a40ad..499702425711 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -299,6 +299,41 @@ static void io_msg_tw_fd_complete(struct callback_head *head)
 	io_req_queue_tw_complete(req, ret);
 }
 
+static int io_msg_install_remote(struct io_msg *msg, unsigned int issue_flags,
+				 struct io_ring_ctx *target_ctx)
+{
+	bool skip_cqe = msg->flags & IORING_MSG_RING_CQE_SKIP;
+	struct io_overflow_cqe *ocqe;
+	int ret;
+
+	if (!skip_cqe) {
+		ocqe = io_alloc_overflow(target_ctx);
+		if (!ocqe)
+			return -ENOMEM;
+	}
+
+	if (unlikely(io_double_lock_ctx(target_ctx, issue_flags))) {
+		kfree(ocqe);
+		return -EAGAIN;
+	}
+
+	ret = __io_fixed_fd_install(target_ctx, msg->src_file, msg->dst_fd);
+	if (ret < 0)
+		goto out;
+
+	msg->src_file = NULL;
+
+	if (!skip_cqe) {
+		ocqe->cqe.flags = 0;
+		io_msg_add_overflow(msg, target_ctx, ocqe, ret);
+		return 0;
+	}
+out:
+	mutex_unlock(&target_ctx->uring_lock);
+	kfree(ocqe);
+	return ret;
+}
+
 static int io_msg_send_fd(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_ring_ctx *target_ctx = req->file->private_data;
@@ -320,8 +355,17 @@ static int io_msg_send_fd(struct io_kiocb *req, unsigned int issue_flags)
 		req->flags |= REQ_F_NEED_CLEANUP;
 	}
 
-	if (io_msg_need_remote(target_ctx))
-		return io_msg_exec_remote(req, io_msg_tw_fd_complete);
+	if (io_msg_need_remote(target_ctx)) {
+		int ret;
+
+		ret = io_msg_install_remote(msg, issue_flags, target_ctx);
+		if (ret == -EAGAIN)
+			return io_msg_exec_remote(req, io_msg_tw_fd_complete);
+		else if (ret < 0)
+			return ret;
+		req->flags &= ~REQ_F_NEED_CLEANUP;
+		return 0;
+	}
 	return io_msg_install_complete(req, issue_flags);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages
  2024-05-24 22:58 ` [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages Jens Axboe
@ 2024-05-28 13:18   ` Pavel Begunkov
  2024-05-28 14:23     ` Jens Axboe
  2024-05-28 13:32   ` Pavel Begunkov
  1 sibling, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2024-05-28 13:18 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 5/24/24 23:58, Jens Axboe wrote:
> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely
> to the target ring. Instead, task_work is queued for the target ring,
> which is used to post the CQE. To make matters worse, once the target
> CQE has been posted, task_work is then queued with the originator to
> fill the completion.
> 
> This obviously adds a bunch of overhead and latency. Instead of relying
> on generic kernel task_work for this, fill an overflow entry on the
> target ring and flag it as such that the target ring will flush it. This
> avoids both the task_work for posting the CQE, and it means that the
> originator CQE can be filled inline as well.
> 
> In local testing, this reduces the latency on the sender side by 5-6x.
> 
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>   io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 74 insertions(+), 3 deletions(-)
> 
> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
> index feff2b0822cf..3f89ff3a40ad 100644
> --- a/io_uring/msg_ring.c
> +++ b/io_uring/msg_ring.c
> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head)
>   	io_req_queue_tw_complete(req, ret);
>   }
>   
> +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
> +{
> +	bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
> +	size_t cqe_size = sizeof(struct io_overflow_cqe);
> +	struct io_overflow_cqe *ocqe;
> +
> +	if (is_cqe32)
> +		cqe_size += sizeof(struct io_uring_cqe);
> +
> +	ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT);
> +	if (!ocqe)
> +		return NULL;
> +
> +	if (is_cqe32)
> +		ocqe->cqe.big_cqe[0] = ocqe->cqe.big_cqe[1] = 0;
> +
> +	return ocqe;
> +}
> +
> +/*
> + * Entered with the target uring_lock held, and will drop it before
> + * returning. Adds a previously allocated ocqe to the overflow list on
> + * the target, and marks it appropriately for flushing.
> + */
> +static void io_msg_add_overflow(struct io_msg *msg,
> +				struct io_ring_ctx *target_ctx,
> +				struct io_overflow_cqe *ocqe, int ret)
> +	__releases(target_ctx->uring_lock)
> +{
> +	spin_lock(&target_ctx->completion_lock);
> +
> +	if (list_empty(&target_ctx->cq_overflow_list)) {
> +		set_bit(IO_CHECK_CQ_OVERFLOW_BIT, &target_ctx->check_cq);
> +		atomic_or(IORING_SQ_TASKRUN, &target_ctx->rings->sq_flags);

TASKRUN? The normal overflow path sets IORING_SQ_CQ_OVERFLOW


> +	}
> +
> +	ocqe->cqe.user_data = msg->user_data;
> +	ocqe->cqe.res = ret;
> +	list_add_tail(&ocqe->list, &target_ctx->cq_overflow_list);
> +	spin_unlock(&target_ctx->completion_lock);
> +	mutex_unlock(&target_ctx->uring_lock);
> +	wake_up_state(target_ctx->submitter_task, TASK_INTERRUPTIBLE);
> +}

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance
  2024-05-24 22:58 [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Jens Axboe
                   ` (2 preceding siblings ...)
  2024-05-24 22:58 ` [PATCH 3/3] io_uring/msg_ring: avoid double indirection task_work for fd passing Jens Axboe
@ 2024-05-28 13:31 ` Pavel Begunkov
  2024-05-28 14:34   ` Jens Axboe
  3 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2024-05-28 13:31 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 5/24/24 23:58, Jens Axboe wrote:
> Hi,
> 
> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to

IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's
specifically an IORING_SETUP_DEFER_TASKRUN optimisation.

> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through
> generic task_work. This isn't ideal. This patchset attempts to rectify
> that, taking a new approach rather than trying to use the io_uring
> task_work infrastructure to handle it as in previous postings.

Not sure why you'd want to piggyback onto overflows, it's not
such a well made and reliable infra, whereas the DEFER_TASKRUN
part of the task_work approach was fine.

The completion path doesn't usually look at the overflow list
but on cached cqe pointers showing the CQ is full, that means
after you queue an overflow someone may post a CQE in the CQ
in the normal path and you get reordering. Not that bad
considering it's from another ring, but a bit nasty and surely
will bite us back in the future, it always does.

That's assuming you decide io_msg_need_remote() solely based
->task_complete and remove

	return current != target_ctx->submitter_task;

otherwise you can get two linked msg_ring target CQEs reordered.

It's also duplicating that crappy overflow code nobody cares
much about, and it's still a forced wake up with no batching,
circumventing the normal wake up path, i.e. io_uring tw.

I don't think we should care about the request completion
latency (sender latency), people should be more interested
in the reaction speed (receiver latency), but if you care
about it for a reason, perhaps you can just as well allocate
a new request and complete the previous one right away.

> In a sample test app that has one thread send messages to another and
> logging both the time between sender sending and receiver receving and
> just the time for the sender to post a message and get the CQE back,
> I see the following sender latencies with the stock kernel:
> 
> Latencies for: Sender
>      percentiles (nsec):
>       |  1.0000th=[ 4384],  5.0000th=[ 4512], 10.0000th=[ 4576],
>       | 20.0000th=[ 4768], 30.0000th=[ 4896], 40.0000th=[ 5024],
>       | 50.0000th=[ 5088], 60.0000th=[ 5152], 70.0000th=[ 5280],
>       | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728],
>       | 99.0000th=[ 8032], 99.5000th=[18048], 99.9000th=[21376],
>       | 99.9500th=[26496], 99.9900th=[59136]
> 
> and with the patches:
> 
> Latencies for: Sender
>      percentiles (nsec):
>       |  1.0000th=[  756],  5.0000th=[  820], 10.0000th=[  828],
>       | 20.0000th=[  844], 30.0000th=[  852], 40.0000th=[  852],
>       | 50.0000th=[  860], 60.0000th=[  860], 70.0000th=[  868],
>       | 80.0000th=[  884], 90.0000th=[  964], 95.0000th=[  988],
>       | 99.0000th=[ 1128], 99.5000th=[ 1208], 99.9000th=[ 1544],
>       | 99.9500th=[ 1944], 99.9900th=[ 2896]
> 
> For the receiving side the win is smaller as it only "suffers" from
> a single generic task_work, about a 10% win in latencies there.
> 
> The idea here is to utilize the CQE overflow infrastructure for this,
> as that allows the right task to post the CQE to the ring.
> 
> 1 is a basic refactoring prep patch, patch 2 adds support for normal
> messages, and patch 3 adopts the same approach for fd passing.
> 
>   io_uring/msg_ring.c | 151 ++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 138 insertions(+), 13 deletions(-)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages
  2024-05-24 22:58 ` [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages Jens Axboe
  2024-05-28 13:18   ` Pavel Begunkov
@ 2024-05-28 13:32   ` Pavel Begunkov
  2024-05-28 14:23     ` Jens Axboe
  1 sibling, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2024-05-28 13:32 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 5/24/24 23:58, Jens Axboe wrote:
> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely
> to the target ring. Instead, task_work is queued for the target ring,
> which is used to post the CQE. To make matters worse, once the target
> CQE has been posted, task_work is then queued with the originator to
> fill the completion.
> 
> This obviously adds a bunch of overhead and latency. Instead of relying
> on generic kernel task_work for this, fill an overflow entry on the
> target ring and flag it as such that the target ring will flush it. This
> avoids both the task_work for posting the CQE, and it means that the
> originator CQE can be filled inline as well.
> 
> In local testing, this reduces the latency on the sender side by 5-6x.
> 
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>   io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 74 insertions(+), 3 deletions(-)
> 
> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
> index feff2b0822cf..3f89ff3a40ad 100644
> --- a/io_uring/msg_ring.c
> +++ b/io_uring/msg_ring.c
> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head)
>   	io_req_queue_tw_complete(req, ret);
>   }
>   
> +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
> +{
> +	bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
> +	size_t cqe_size = sizeof(struct io_overflow_cqe);
> +	struct io_overflow_cqe *ocqe;
> +
> +	if (is_cqe32)
> +		cqe_size += sizeof(struct io_uring_cqe);
> +
> +	ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT);

__GFP_ACCOUNT looks painful

> +	if (!ocqe)
> +		return NULL;
> +
> +	if (is_cqe32)
> +		ocqe->cqe.big_cqe[0] = ocqe->cqe.big_cqe[1] = 0;
> +
> +	return ocqe;
> +}
> +
...

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages
  2024-05-28 13:32   ` Pavel Begunkov
@ 2024-05-28 14:23     ` Jens Axboe
  2024-05-28 16:23       ` Pavel Begunkov
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2024-05-28 14:23 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/28/24 7:32 AM, Pavel Begunkov wrote:
> On 5/24/24 23:58, Jens Axboe wrote:
>> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely
>> to the target ring. Instead, task_work is queued for the target ring,
>> which is used to post the CQE. To make matters worse, once the target
>> CQE has been posted, task_work is then queued with the originator to
>> fill the completion.
>>
>> This obviously adds a bunch of overhead and latency. Instead of relying
>> on generic kernel task_work for this, fill an overflow entry on the
>> target ring and flag it as such that the target ring will flush it. This
>> avoids both the task_work for posting the CQE, and it means that the
>> originator CQE can be filled inline as well.
>>
>> In local testing, this reduces the latency on the sender side by 5-6x.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>>   io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 74 insertions(+), 3 deletions(-)
>>
>> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
>> index feff2b0822cf..3f89ff3a40ad 100644
>> --- a/io_uring/msg_ring.c
>> +++ b/io_uring/msg_ring.c
>> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head)
>>       io_req_queue_tw_complete(req, ret);
>>   }
>>   +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
>> +{
>> +    bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
>> +    size_t cqe_size = sizeof(struct io_overflow_cqe);
>> +    struct io_overflow_cqe *ocqe;
>> +
>> +    if (is_cqe32)
>> +        cqe_size += sizeof(struct io_uring_cqe);
>> +
>> +    ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT);
> 
> __GFP_ACCOUNT looks painful

It always is - I did add the usual alloc cache for this after posting
this series, which makes it a no-op basically:

https://git.kernel.dk/cgit/linux/commit/?h=io_uring-msg_ring&id=c39ead262b60872d6d7daf55e9fc7d76dc09b29d

Just haven't posted a v2 yet.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages
  2024-05-28 13:18   ` Pavel Begunkov
@ 2024-05-28 14:23     ` Jens Axboe
  0 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2024-05-28 14:23 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/28/24 7:18 AM, Pavel Begunkov wrote:
> On 5/24/24 23:58, Jens Axboe wrote:
>> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely
>> to the target ring. Instead, task_work is queued for the target ring,
>> which is used to post the CQE. To make matters worse, once the target
>> CQE has been posted, task_work is then queued with the originator to
>> fill the completion.
>>
>> This obviously adds a bunch of overhead and latency. Instead of relying
>> on generic kernel task_work for this, fill an overflow entry on the
>> target ring and flag it as such that the target ring will flush it. This
>> avoids both the task_work for posting the CQE, and it means that the
>> originator CQE can be filled inline as well.
>>
>> In local testing, this reduces the latency on the sender side by 5-6x.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>>   io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 74 insertions(+), 3 deletions(-)
>>
>> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
>> index feff2b0822cf..3f89ff3a40ad 100644
>> --- a/io_uring/msg_ring.c
>> +++ b/io_uring/msg_ring.c
>> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head)
>>       io_req_queue_tw_complete(req, ret);
>>   }
>>   +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
>> +{
>> +    bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
>> +    size_t cqe_size = sizeof(struct io_overflow_cqe);
>> +    struct io_overflow_cqe *ocqe;
>> +
>> +    if (is_cqe32)
>> +        cqe_size += sizeof(struct io_uring_cqe);
>> +
>> +    ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT);
>> +    if (!ocqe)
>> +        return NULL;
>> +
>> +    if (is_cqe32)
>> +        ocqe->cqe.big_cqe[0] = ocqe->cqe.big_cqe[1] = 0;
>> +
>> +    return ocqe;
>> +}
>> +
>> +/*
>> + * Entered with the target uring_lock held, and will drop it before
>> + * returning. Adds a previously allocated ocqe to the overflow list on
>> + * the target, and marks it appropriately for flushing.
>> + */
>> +static void io_msg_add_overflow(struct io_msg *msg,
>> +                struct io_ring_ctx *target_ctx,
>> +                struct io_overflow_cqe *ocqe, int ret)
>> +    __releases(target_ctx->uring_lock)
>> +{
>> +    spin_lock(&target_ctx->completion_lock);
>> +
>> +    if (list_empty(&target_ctx->cq_overflow_list)) {
>> +        set_bit(IO_CHECK_CQ_OVERFLOW_BIT, &target_ctx->check_cq);
>> +        atomic_or(IORING_SQ_TASKRUN, &target_ctx->rings->sq_flags);
> 
> TASKRUN? The normal overflow path sets IORING_SQ_CQ_OVERFLOW

Was a bit split on it - we want it run as part of waiting, but I also
wasn't super interested in exposing it as an overflow condition since it
is now. It's more of an internal implementation detail.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance
  2024-05-28 13:31 ` [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Pavel Begunkov
@ 2024-05-28 14:34   ` Jens Axboe
  2024-05-28 14:39     ` Jens Axboe
                       ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Jens Axboe @ 2024-05-28 14:34 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/28/24 7:31 AM, Pavel Begunkov wrote:
> On 5/24/24 23:58, Jens Axboe wrote:
>> Hi,
>>
>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to
> 
> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's
> specifically an IORING_SETUP_DEFER_TASKRUN optimisation.

Right, I should change that in the commit message. It's task_complete
driving it, which is tied to DEFER_TASKRUN.

>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through
>> generic task_work. This isn't ideal. This patchset attempts to rectify
>> that, taking a new approach rather than trying to use the io_uring
>> task_work infrastructure to handle it as in previous postings.
> 
> Not sure why you'd want to piggyback onto overflows, it's not
> such a well made and reliable infra, whereas the DEFER_TASKRUN
> part of the task_work approach was fine.

It's not right now, because it's really a "don't get into this
condition, if you do, things are slower". And this series doesn't really
change that, and honestly it doesn't even need to. It's still way better
than what we had before in terms of DEFER_TASKRUN and messages.

We could certainly make the messages a subset of real overflow if we
wanted, but honestly it looks decent enough as-is with the changes that
I'm not hugely motivated to do that.

> The completion path doesn't usually look at the overflow list
> but on cached cqe pointers showing the CQ is full, that means
> after you queue an overflow someone may post a CQE in the CQ
> in the normal path and you get reordering. Not that bad
> considering it's from another ring, but a bit nasty and surely
> will bite us back in the future, it always does.

This is true, but generally true as well as completions come in async.
You don't get to control the exact order on a remote ring. Messages
themselves will be ordered when posted, which should be the important
aspect here. Order with locally posted completions I don't see why you'd
care, that's a timing game that you cannot control.

> That's assuming you decide io_msg_need_remote() solely based
> ->task_complete and remove
> 
>     return current != target_ctx->submitter_task;
> 
> otherwise you can get two linked msg_ring target CQEs reordered.

Good point, since it'd now be cheap enough, would probably make sense to
simply gate it on task_complete alone. I even toyed with the idea of
just using this approach for any ring type and kill some code in there,
but didn't venture that far yet.

> It's also duplicating that crappy overflow code nobody cares
> much about, and it's still a forced wake up with no batching,
> circumventing the normal wake up path, i.e. io_uring tw.

Yes, since this is v1 I didn't bother to integrate more tightly with the
generic overflows, that should obviously be done by first adding a
helper for this. I consider that pretty minor.

> I don't think we should care about the request completion
> latency (sender latency), people should be more interested
> in the reaction speed (receiver latency), but if you care
> about it for a reason, perhaps you can just as well allocate
> a new request and complete the previous one right away.

I know the numbers I posted was just sender side improvements on that
particular box, however that isn't really the case for others. Here's on
an another test box:

axboe@r7525 ~> ./msg-lat
init_flags=3000
Wait on startup
802775: my fd 3, other 4
802776: my fd 4, other 3
Latencies for: Receiver
    percentiles (nsec):
     |  1.0000th=[ 4192],  5.0000th=[ 4320], 10.0000th=[ 4448],
     | 20.0000th=[ 4576], 30.0000th=[ 4704], 40.0000th=[ 4832],
     | 50.0000th=[ 4960], 60.0000th=[ 5088], 70.0000th=[ 5216],
     | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728],
     | 99.0000th=[ 6176], 99.5000th=[ 7136], 99.9000th=[19584],
     | 99.9500th=[20352], 99.9900th=[28288]
Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[ 6560],  5.0000th=[ 6880], 10.0000th=[ 7008],
     | 20.0000th=[ 7264], 30.0000th=[ 7456], 40.0000th=[ 7712],
     | 50.0000th=[ 8032], 60.0000th=[ 8256], 70.0000th=[ 8512],
     | 80.0000th=[ 8640], 90.0000th=[ 8896], 95.0000th=[ 9152],
     | 99.0000th=[ 9792], 99.5000th=[11584], 99.9000th=[23168],
     | 99.9500th=[28032], 99.9900th=[40192]

and after:

axboe@r7525 ~> ./msg-lat                                                       1.776s
init_flags=3000
Wait on startup
3767: my fd 3, other 4
3768: my fd 4, other 3
Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[  740],  5.0000th=[  748], 10.0000th=[  756],
     | 20.0000th=[  764], 30.0000th=[  764], 40.0000th=[  772],
     | 50.0000th=[  772], 60.0000th=[  780], 70.0000th=[  780],
     | 80.0000th=[  860], 90.0000th=[  892], 95.0000th=[  900],
     | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656],
     | 99.9500th=[ 1976], 99.9900th=[ 3408]
Latencies for: Receiver
    percentiles (nsec):
     |  1.0000th=[ 2736],  5.0000th=[ 2736], 10.0000th=[ 2768],
     | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800],
     | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896],
     | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120],
     | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560],
     | 99.9500th=[21632], 99.9900th=[58624]

Obivously some variation in runs in general, but it's most certainly
faster in terms of receiving too. This test case is fixed at doing 100K
messages per second, didn't do any peak testing. But I strongly suspect
you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL
task_work is pretty sucky imho.

You could just make it io_kiocb based, but I did not want to get into
foreign requests on remote rings. What would you envision with that
approach, using our normal ring task_work for this instead? That would
be an approach, obviously this one took a different path from the
previous task_work driven approach.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance
  2024-05-28 14:34   ` Jens Axboe
@ 2024-05-28 14:39     ` Jens Axboe
  2024-05-28 15:27     ` Jens Axboe
  2024-05-28 16:50     ` Pavel Begunkov
  2 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2024-05-28 14:39 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/28/24 8:34 AM, Jens Axboe wrote:
> You could just make it io_kiocb based, but I did not want to get into
> foreign requests on remote rings. What would you envision with that
> approach, using our normal ring task_work for this instead? That would
> be an approach, obviously this one took a different path from the
> previous task_work driven approach.

Foreign io_kiocb itself isn't even enough, you'd need an overflow cqe
allocated upfront too for posting directly without waiting for the
reply. Basically it needs to be confident it'll post successfully on the
remote ring, otherwise we cannot complete the source msg ring sqe
inline.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance
  2024-05-28 14:34   ` Jens Axboe
  2024-05-28 14:39     ` Jens Axboe
@ 2024-05-28 15:27     ` Jens Axboe
  2024-05-28 16:50     ` Pavel Begunkov
  2 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2024-05-28 15:27 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/28/24 8:34 AM, Jens Axboe wrote:
> Obivously some variation in runs in general, but it's most certainly
> faster in terms of receiving too. This test case is fixed at doing 100K
> messages per second, didn't do any peak testing. But I strongly suspect
> you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL
> task_work is pretty sucky imho.

Example on that same box. Running the same message passer test, sys time
used for a 3 second run varies between 300ms and 850ms on the stock
kernel (with a lot of variance), after it's around ~150ms and pretty
steady.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages
  2024-05-28 14:23     ` Jens Axboe
@ 2024-05-28 16:23       ` Pavel Begunkov
  2024-05-28 17:59         ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2024-05-28 16:23 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 5/28/24 15:23, Jens Axboe wrote:
> On 5/28/24 7:32 AM, Pavel Begunkov wrote:
>> On 5/24/24 23:58, Jens Axboe wrote:
>>> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely
>>> to the target ring. Instead, task_work is queued for the target ring,
>>> which is used to post the CQE. To make matters worse, once the target
>>> CQE has been posted, task_work is then queued with the originator to
>>> fill the completion.
>>>
>>> This obviously adds a bunch of overhead and latency. Instead of relying
>>> on generic kernel task_work for this, fill an overflow entry on the
>>> target ring and flag it as such that the target ring will flush it. This
>>> avoids both the task_work for posting the CQE, and it means that the
>>> originator CQE can be filled inline as well.
>>>
>>> In local testing, this reduces the latency on the sender side by 5-6x.
>>>
>>> Signed-off-by: Jens Axboe <[email protected]>
>>> ---
>>>    io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++--
>>>    1 file changed, 74 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
>>> index feff2b0822cf..3f89ff3a40ad 100644
>>> --- a/io_uring/msg_ring.c
>>> +++ b/io_uring/msg_ring.c
>>> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head)
>>>        io_req_queue_tw_complete(req, ret);
>>>    }
>>>    +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
>>> +{
>>> +    bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
>>> +    size_t cqe_size = sizeof(struct io_overflow_cqe);
>>> +    struct io_overflow_cqe *ocqe;
>>> +
>>> +    if (is_cqe32)
>>> +        cqe_size += sizeof(struct io_uring_cqe);
>>> +
>>> +    ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT);
>>
>> __GFP_ACCOUNT looks painful
> 
> It always is - I did add the usual alloc cache for this after posting
> this series, which makes it a no-op basically:

Simple ring private cache wouldn't work so well with non
uniform transfer distributions. One way messaging, userspace
level batching, etc., but the main question is in the other
email, i.e. maybe it's better to go with the 2 tw hop model,
which returns memory back where it came from.

> https://git.kernel.dk/cgit/linux/commit/?h=io_uring-msg_ring&id=c39ead262b60872d6d7daf55e9fc7d76dc09b29d
> 
> Just haven't posted a v2 yet.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance
  2024-05-28 14:34   ` Jens Axboe
  2024-05-28 14:39     ` Jens Axboe
  2024-05-28 15:27     ` Jens Axboe
@ 2024-05-28 16:50     ` Pavel Begunkov
  2024-05-28 18:07       ` Jens Axboe
  2 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2024-05-28 16:50 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 5/28/24 15:34, Jens Axboe wrote:
> On 5/28/24 7:31 AM, Pavel Begunkov wrote:
>> On 5/24/24 23:58, Jens Axboe wrote:
>>> Hi,
>>>
>>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to
>>
>> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's
>> specifically an IORING_SETUP_DEFER_TASKRUN optimisation.
> 
> Right, I should change that in the commit message. It's task_complete
> driving it, which is tied to DEFER_TASKRUN.
> 
>>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through
>>> generic task_work. This isn't ideal. This patchset attempts to rectify
>>> that, taking a new approach rather than trying to use the io_uring
>>> task_work infrastructure to handle it as in previous postings.
>>
>> Not sure why you'd want to piggyback onto overflows, it's not
>> such a well made and reliable infra, whereas the DEFER_TASKRUN
>> part of the task_work approach was fine.
> 
> It's not right now, because it's really a "don't get into this
> condition, if you do, things are slower". And this series doesn't really
> change that, and honestly it doesn't even need to. It's still way better
> than what we had before in terms of DEFER_TASKRUN and messages.

Better than how it is now or comparing to the previous attempt?
I think the one using io_uring's tw infra was better, which is
where all wake ups and optimisations currently consolidate.

> We could certainly make the messages a subset of real overflow if we
> wanted, but honestly it looks decent enough as-is with the changes that
> I'm not hugely motivated to do that.

Not sure what you mean here, but not really suggesting refactoring
overflows. Taking the tw based patches should be easy, it only
needs killing !DEFER_TASKRUN changes from there.


>> The completion path doesn't usually look at the overflow list
>> but on cached cqe pointers showing the CQ is full, that means
>> after you queue an overflow someone may post a CQE in the CQ
>> in the normal path and you get reordering. Not that bad
>> considering it's from another ring, but a bit nasty and surely
>> will bite us back in the future, it always does.
> 
> This is true, but generally true as well as completions come in async.
> You don't get to control the exact order on a remote ring. Messages
> themselves will be ordered when posted, which should be the important
> aspect here. Order with locally posted completions I don't see why you'd
> care, that's a timing game that you cannot control.

True for a single request, but in a more complex system
sender's ordering will affect the order on the receiving side.

ring1: msg_ring(); write(pipe)
ring2: read(pipe)

The user would definitely think that the other ring will
first get a msg_ring CQE and then an CQE from the read, but as
always it's hard to predict all repercussions.

>> That's assuming you decide io_msg_need_remote() solely based
>> ->task_complete and remove
>>
>>      return current != target_ctx->submitter_task;
>>
>> otherwise you can get two linked msg_ring target CQEs reordered.
> 
> Good point, since it'd now be cheap enough, would probably make sense to
> simply gate it on task_complete alone. I even toyed with the idea of
> just using this approach for any ring type and kill some code in there,
> but didn't venture that far yet.

That task check doesn't make any real difference. If it's the
same thread you can skip io_uring all together.

>> It's also duplicating that crappy overflow code nobody cares
>> much about, and it's still a forced wake up with no batching,
>> circumventing the normal wake up path, i.e. io_uring tw.
> 
> Yes, since this is v1 I didn't bother to integrate more tightly with the
> generic overflows, that should obviously be done by first adding a
> helper for this. I consider that pretty minor.

My problem is not duplication of code base but rather
extending the internal user base of it. You can say that
msg_ring can easily become a hot path for some, and
then we'll be putting efforts both into overflows and
task_work when in essence they solve quite a similar
problem here.

>> I don't think we should care about the request completion
>> latency (sender latency), people should be more interested
>> in the reaction speed (receiver latency), but if you care
>> about it for a reason, perhaps you can just as well allocate
>> a new request and complete the previous one right away.
> 
> I know the numbers I posted was just sender side improvements on that
> particular box, however that isn't really the case for others. Here's on
> an another test box:
> 
> axboe@r7525 ~> ./msg-lat
> init_flags=3000
> Wait on startup
> 802775: my fd 3, other 4
> 802776: my fd 4, other 3
> Latencies for: Receiver
>      percentiles (nsec):
>       |  1.0000th=[ 4192],  5.0000th=[ 4320], 10.0000th=[ 4448],
>       | 20.0000th=[ 4576], 30.0000th=[ 4704], 40.0000th=[ 4832],
>       | 50.0000th=[ 4960], 60.0000th=[ 5088], 70.0000th=[ 5216],
>       | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728],
>       | 99.0000th=[ 6176], 99.5000th=[ 7136], 99.9000th=[19584],
>       | 99.9500th=[20352], 99.9900th=[28288]
> Latencies for: Sender
>      percentiles (nsec):
>       |  1.0000th=[ 6560],  5.0000th=[ 6880], 10.0000th=[ 7008],
>       | 20.0000th=[ 7264], 30.0000th=[ 7456], 40.0000th=[ 7712],
>       | 50.0000th=[ 8032], 60.0000th=[ 8256], 70.0000th=[ 8512],
>       | 80.0000th=[ 8640], 90.0000th=[ 8896], 95.0000th=[ 9152],
>       | 99.0000th=[ 9792], 99.5000th=[11584], 99.9000th=[23168],
>       | 99.9500th=[28032], 99.9900th=[40192]
> 
> and after:
> 
> axboe@r7525 ~> ./msg-lat                                                       1.776s
> init_flags=3000
> Wait on startup
> 3767: my fd 3, other 4
> 3768: my fd 4, other 3
> Latencies for: Sender
>      percentiles (nsec):
>       |  1.0000th=[  740],  5.0000th=[  748], 10.0000th=[  756],
>       | 20.0000th=[  764], 30.0000th=[  764], 40.0000th=[  772],
>       | 50.0000th=[  772], 60.0000th=[  780], 70.0000th=[  780],
>       | 80.0000th=[  860], 90.0000th=[  892], 95.0000th=[  900],
>       | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656],
>       | 99.9500th=[ 1976], 99.9900th=[ 3408]
> Latencies for: Receiver
>      percentiles (nsec):
>       |  1.0000th=[ 2736],  5.0000th=[ 2736], 10.0000th=[ 2768],
>       | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800],
>       | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896],
>       | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120],
>       | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560],
>       | 99.9500th=[21632], 99.9900th=[58624]
> 
> Obivously some variation in runs in general, but it's most certainly
> faster in terms of receiving too. This test case is fixed at doing 100K
> messages per second, didn't do any peak testing. But I strongly suspect
> you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL
> task_work is pretty sucky imho.

Sure, you mentioned wins on the receiver side, I consider it
to be the main merit (latency and throughput)

> You could just make it io_kiocb based, but I did not want to get into
> foreign requests on remote rings. What would you envision with that
> approach, using our normal ring task_work for this instead? That would

It was buggy in the !DEFER_TASKRUN path. Fortunately, you don't care
about it because it just does it all under ->completion_lock, which
is why you shouldn't have ever hit the problem in testing.

> be an approach, obviously this one took a different path from the
> previous task_work driven approach.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages
  2024-05-28 16:23       ` Pavel Begunkov
@ 2024-05-28 17:59         ` Jens Axboe
  2024-05-29  2:04           ` Pavel Begunkov
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2024-05-28 17:59 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/28/24 10:23 AM, Pavel Begunkov wrote:
> On 5/28/24 15:23, Jens Axboe wrote:
>> On 5/28/24 7:32 AM, Pavel Begunkov wrote:
>>> On 5/24/24 23:58, Jens Axboe wrote:
>>>> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely
>>>> to the target ring. Instead, task_work is queued for the target ring,
>>>> which is used to post the CQE. To make matters worse, once the target
>>>> CQE has been posted, task_work is then queued with the originator to
>>>> fill the completion.
>>>>
>>>> This obviously adds a bunch of overhead and latency. Instead of relying
>>>> on generic kernel task_work for this, fill an overflow entry on the
>>>> target ring and flag it as such that the target ring will flush it. This
>>>> avoids both the task_work for posting the CQE, and it means that the
>>>> originator CQE can be filled inline as well.
>>>>
>>>> In local testing, this reduces the latency on the sender side by 5-6x.
>>>>
>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>> ---
>>>>    io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++--
>>>>    1 file changed, 74 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
>>>> index feff2b0822cf..3f89ff3a40ad 100644
>>>> --- a/io_uring/msg_ring.c
>>>> +++ b/io_uring/msg_ring.c
>>>> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head)
>>>>        io_req_queue_tw_complete(req, ret);
>>>>    }
>>>>    +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
>>>> +{
>>>> +    bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
>>>> +    size_t cqe_size = sizeof(struct io_overflow_cqe);
>>>> +    struct io_overflow_cqe *ocqe;
>>>> +
>>>> +    if (is_cqe32)
>>>> +        cqe_size += sizeof(struct io_uring_cqe);
>>>> +
>>>> +    ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT);
>>>
>>> __GFP_ACCOUNT looks painful
>>
>> It always is - I did add the usual alloc cache for this after posting
>> this series, which makes it a no-op basically:
> 
> Simple ring private cache wouldn't work so well with non
> uniform transfer distributions. One way messaging, userspace
> level batching, etc., but the main question is in the other
> email, i.e. maybe it's better to go with the 2 tw hop model,
> which returns memory back where it came from.

The cache is local to the ring, so anyone that sends messages to that
ring gets to use it. So I believe it should in fact work really well. If
messaging is bidirectional, then caching on the target will apply in
both directions.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance
  2024-05-28 16:50     ` Pavel Begunkov
@ 2024-05-28 18:07       ` Jens Axboe
  2024-05-28 18:31         ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2024-05-28 18:07 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/28/24 10:50 AM, Pavel Begunkov wrote:
> On 5/28/24 15:34, Jens Axboe wrote:
>> On 5/28/24 7:31 AM, Pavel Begunkov wrote:
>>> On 5/24/24 23:58, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to
>>>
>>> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's
>>> specifically an IORING_SETUP_DEFER_TASKRUN optimisation.
>>
>> Right, I should change that in the commit message. It's task_complete
>> driving it, which is tied to DEFER_TASKRUN.
>>
>>>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through
>>>> generic task_work. This isn't ideal. This patchset attempts to rectify
>>>> that, taking a new approach rather than trying to use the io_uring
>>>> task_work infrastructure to handle it as in previous postings.
>>>
>>> Not sure why you'd want to piggyback onto overflows, it's not
>>> such a well made and reliable infra, whereas the DEFER_TASKRUN
>>> part of the task_work approach was fine.
>>
>> It's not right now, because it's really a "don't get into this
>> condition, if you do, things are slower". And this series doesn't really
>> change that, and honestly it doesn't even need to. It's still way better
>> than what we had before in terms of DEFER_TASKRUN and messages.
> 
> Better than how it is now or comparing to the previous attempt?
> I think the one using io_uring's tw infra was better, which is
> where all wake ups and optimisations currently consolidate.

Better than both - I haven't tested with the previous version, but I can
certainly do that. The reason why I think it'll be better is that it
avoids the double roundtrips. Yes v1 was using normal task_work which is
better, but it didn't solve what I think is the fundamental issue here.

I'll forward port it and give it a spin, then we'll know.

>> We could certainly make the messages a subset of real overflow if we
>> wanted, but honestly it looks decent enough as-is with the changes that
>> I'm not hugely motivated to do that.
> 
> Not sure what you mean here, but not really suggesting refactoring
> overflows. Taking the tw based patches should be easy, it only
> needs killing !DEFER_TASKRUN changes from there.

Sorry wasn't clear, the refactoring was purely my suggestion to reduce a
bit of the code duplication.

>>> The completion path doesn't usually look at the overflow list
>>> but on cached cqe pointers showing the CQ is full, that means
>>> after you queue an overflow someone may post a CQE in the CQ
>>> in the normal path and you get reordering. Not that bad
>>> considering it's from another ring, but a bit nasty and surely
>>> will bite us back in the future, it always does.
>>
>> This is true, but generally true as well as completions come in async.
>> You don't get to control the exact order on a remote ring. Messages
>> themselves will be ordered when posted, which should be the important
>> aspect here. Order with locally posted completions I don't see why you'd
>> care, that's a timing game that you cannot control.
> 
> True for a single request, but in a more complex system
> sender's ordering will affect the order on the receiving side.
> 
> ring1: msg_ring(); write(pipe)
> ring2: read(pipe)
> 
> The user would definitely think that the other ring will
> first get a msg_ring CQE and then an CQE from the read, but as
> always it's hard to predict all repercussions.

Nobody should be making assumptions like that, and that will in fact
already not be the case. If the msg_ring fails to lock the remote ring,
then it may very well end up in the hands of io-wq. And then you could
get either result validly, msg CQE first or last.

>>> That's assuming you decide io_msg_need_remote() solely based
>>> ->task_complete and remove
>>>
>>>      return current != target_ctx->submitter_task;
>>>
>>> otherwise you can get two linked msg_ring target CQEs reordered.
>>
>> Good point, since it'd now be cheap enough, would probably make sense to
>> simply gate it on task_complete alone. I even toyed with the idea of
>> just using this approach for any ring type and kill some code in there,
>> but didn't venture that far yet.
> 
> That task check doesn't make any real difference. If it's the
> same thread you can skip io_uring all together.

Yeah agree, I did add a patch in between after the last email to just
remove the task check. It's not really a useful case to attempt to cater
to in particular.

>>> It's also duplicating that crappy overflow code nobody cares
>>> much about, and it's still a forced wake up with no batching,
>>> circumventing the normal wake up path, i.e. io_uring tw.
>>
>> Yes, since this is v1 I didn't bother to integrate more tightly with the
>> generic overflows, that should obviously be done by first adding a
>> helper for this. I consider that pretty minor.
> 
> My problem is not duplication of code base but rather
> extending the internal user base of it. You can say that
> msg_ring can easily become a hot path for some, and
> then we'll be putting efforts both into overflows and
> task_work when in essence they solve quite a similar
> problem here.

That's why I was tempted to remove the task_work path for messages on
top of this. But I agree on the wakeup side, that's obviously something
that doesn't currently work with msg ring. And while I don't see that as
a hard problem to solve, it is a bit annoying to have multiple sources
for that. Would naturally be better to retain just the task_work side.

For one use case that I think is interesting with messages is work
passing, eliminating a user side data structure (and lock) for that and
side channel wakeups, I don't think the wakeup batching matters as
you're generally not going to be batching receiving work. You're either
already running work for processing, or sleeping waiting for one.

>>> I don't think we should care about the request completion
>>> latency (sender latency), people should be more interested
>>> in the reaction speed (receiver latency), but if you care
>>> about it for a reason, perhaps you can just as well allocate
>>> a new request and complete the previous one right away.
>>
>> I know the numbers I posted was just sender side improvements on that
>> particular box, however that isn't really the case for others. Here's on
>> an another test box:
>>
>> axboe@r7525 ~> ./msg-lat
>> init_flags=3000
>> Wait on startup
>> 802775: my fd 3, other 4
>> 802776: my fd 4, other 3
>> Latencies for: Receiver
>>      percentiles (nsec):
>>       |  1.0000th=[ 4192],  5.0000th=[ 4320], 10.0000th=[ 4448],
>>       | 20.0000th=[ 4576], 30.0000th=[ 4704], 40.0000th=[ 4832],
>>       | 50.0000th=[ 4960], 60.0000th=[ 5088], 70.0000th=[ 5216],
>>       | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728],
>>       | 99.0000th=[ 6176], 99.5000th=[ 7136], 99.9000th=[19584],
>>       | 99.9500th=[20352], 99.9900th=[28288]
>> Latencies for: Sender
>>      percentiles (nsec):
>>       |  1.0000th=[ 6560],  5.0000th=[ 6880], 10.0000th=[ 7008],
>>       | 20.0000th=[ 7264], 30.0000th=[ 7456], 40.0000th=[ 7712],
>>       | 50.0000th=[ 8032], 60.0000th=[ 8256], 70.0000th=[ 8512],
>>       | 80.0000th=[ 8640], 90.0000th=[ 8896], 95.0000th=[ 9152],
>>       | 99.0000th=[ 9792], 99.5000th=[11584], 99.9000th=[23168],
>>       | 99.9500th=[28032], 99.9900th=[40192]
>>
>> and after:
>>
>> axboe@r7525 ~> ./msg-lat                                                       1.776s
>> init_flags=3000
>> Wait on startup
>> 3767: my fd 3, other 4
>> 3768: my fd 4, other 3
>> Latencies for: Sender
>>      percentiles (nsec):
>>       |  1.0000th=[  740],  5.0000th=[  748], 10.0000th=[  756],
>>       | 20.0000th=[  764], 30.0000th=[  764], 40.0000th=[  772],
>>       | 50.0000th=[  772], 60.0000th=[  780], 70.0000th=[  780],
>>       | 80.0000th=[  860], 90.0000th=[  892], 95.0000th=[  900],
>>       | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656],
>>       | 99.9500th=[ 1976], 99.9900th=[ 3408]
>> Latencies for: Receiver
>>      percentiles (nsec):
>>       |  1.0000th=[ 2736],  5.0000th=[ 2736], 10.0000th=[ 2768],
>>       | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800],
>>       | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896],
>>       | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120],
>>       | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560],
>>       | 99.9500th=[21632], 99.9900th=[58624]
>>
>> Obivously some variation in runs in general, but it's most certainly
>> faster in terms of receiving too. This test case is fixed at doing 100K
>> messages per second, didn't do any peak testing. But I strongly suspect
>> you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL
>> task_work is pretty sucky imho.
> 
> Sure, you mentioned wins on the receiver side, I consider it
> to be the main merit (latency and throughput)

Actually I think both are interesting - not because the sender side is
latency sensitive for on receving the CQE for it, but because it goes
hand in hand with a reduction in cycles spent sending that work. That's
the win on the sender side, more so than the latency win. The latter is
just gravy on top.

>> You could just make it io_kiocb based, but I did not want to get into
>> foreign requests on remote rings. What would you envision with that
>> approach, using our normal ring task_work for this instead? That would
> 
> It was buggy in the !DEFER_TASKRUN path. Fortunately, you don't care
> about it because it just does it all under ->completion_lock, which
> is why you shouldn't have ever hit the problem in testing.

I'll test the old approach and we'll see where we are at.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance
  2024-05-28 18:07       ` Jens Axboe
@ 2024-05-28 18:31         ` Jens Axboe
  2024-05-28 23:04           ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2024-05-28 18:31 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/28/24 12:07 PM, Jens Axboe wrote:
> On 5/28/24 10:50 AM, Pavel Begunkov wrote:
>> On 5/28/24 15:34, Jens Axboe wrote:
>>> On 5/28/24 7:31 AM, Pavel Begunkov wrote:
>>>> On 5/24/24 23:58, Jens Axboe wrote:
>>>>> Hi,
>>>>>
>>>>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to
>>>>
>>>> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's
>>>> specifically an IORING_SETUP_DEFER_TASKRUN optimisation.
>>>
>>> Right, I should change that in the commit message. It's task_complete
>>> driving it, which is tied to DEFER_TASKRUN.
>>>
>>>>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through
>>>>> generic task_work. This isn't ideal. This patchset attempts to rectify
>>>>> that, taking a new approach rather than trying to use the io_uring
>>>>> task_work infrastructure to handle it as in previous postings.
>>>>
>>>> Not sure why you'd want to piggyback onto overflows, it's not
>>>> such a well made and reliable infra, whereas the DEFER_TASKRUN
>>>> part of the task_work approach was fine.
>>>
>>> It's not right now, because it's really a "don't get into this
>>> condition, if you do, things are slower". And this series doesn't really
>>> change that, and honestly it doesn't even need to. It's still way better
>>> than what we had before in terms of DEFER_TASKRUN and messages.
>>
>> Better than how it is now or comparing to the previous attempt?
>> I think the one using io_uring's tw infra was better, which is
>> where all wake ups and optimisations currently consolidate.
> 
> Better than both - I haven't tested with the previous version, but I can
> certainly do that. The reason why I think it'll be better is that it
> avoids the double roundtrips. Yes v1 was using normal task_work which is
> better, but it didn't solve what I think is the fundamental issue here.
> 
> I'll forward port it and give it a spin, then we'll know.

I suspect a bug in the previous patches, because this is what the
forward port looks like. First, for reference, the current results:

init_flags=3000
Wait on startup
3767: my fd 3, other 4
3768: my fd 4, other 3
Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[  740],  5.0000th=[  748], 10.0000th=[  756],
     | 20.0000th=[  764], 30.0000th=[  764], 40.0000th=[  772],
     | 50.0000th=[  772], 60.0000th=[  780], 70.0000th=[  780],
     | 80.0000th=[  860], 90.0000th=[  892], 95.0000th=[  900],
     | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656],
     | 99.9500th=[ 1976], 99.9900th=[ 3408]
Latencies for: Receiver
    percentiles (nsec):
     |  1.0000th=[ 2736],  5.0000th=[ 2736], 10.0000th=[ 2768],
     | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800],
     | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896],
     | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120],
     | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560],
     | 99.9500th=[21632], 99.9900th=[58624]

and here's with io_uring-msg_ring.1, which is just a straight forward
forward port of the previous patches on the same base as v2:

init_flags=3000
Wait on startup
4097: my fd 4, other 3
4096: my fd 3, other 4
Latencies for: Receiver
    percentiles (nsec):
     |  1.0000th=[ 5920],  5.0000th=[ 5920], 10.0000th=[ 5984],
     | 20.0000th=[ 5984], 30.0000th=[ 6048], 40.0000th=[ 6048],
     | 50.0000th=[ 6112], 60.0000th=[ 6304], 70.0000th=[ 6368],
     | 80.0000th=[ 6560], 90.0000th=[ 6880], 95.0000th=[ 7072],
     | 99.0000th=[ 7456], 99.5000th=[ 7712], 99.9000th=[ 8640],
     | 99.9500th=[10432], 99.9900th=[26240]
Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[ 9536],  5.0000th=[ 9664], 10.0000th=[ 9664],
     | 20.0000th=[ 9920], 30.0000th=[ 9920], 40.0000th=[10048],
     | 50.0000th=[10176], 60.0000th=[10304], 70.0000th=[10432],
     | 80.0000th=[10688], 90.0000th=[10944], 95.0000th=[11328],
     | 99.0000th=[11840], 99.5000th=[12096], 99.9000th=[13888],
     | 99.9500th=[15424], 99.9900th=[34560]

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance
  2024-05-28 18:31         ` Jens Axboe
@ 2024-05-28 23:04           ` Jens Axboe
  2024-05-29  1:35             ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2024-05-28 23:04 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/28/24 12:31 PM, Jens Axboe wrote:
> I suspect a bug in the previous patches, because this is what the
> forward port looks like. First, for reference, the current results:

Got it sorted, and pinned sender and receiver on CPUs to avoid the
variation. It looks like this with the task_work approach that I sent
out as v1:

Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[ 2160],  5.0000th=[ 2672], 10.0000th=[ 2768],
     | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600],
     | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632],
     | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696],
     | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192],
     | 99.9500th=[16320], 99.9900th=[18304]
Latencies for: Receiver
    percentiles (nsec):
     |  1.0000th=[ 1528],  5.0000th=[ 1576], 10.0000th=[ 1656],
     | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064],
     | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096],
     | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160],
     | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168],
     | 99.9500th=[15424], 99.9900th=[17280]

and here's the exact same test run on the current patches:

Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[  362],  5.0000th=[  362], 10.0000th=[  370],
     | 20.0000th=[  370], 30.0000th=[  370], 40.0000th=[  370],
     | 50.0000th=[  374], 60.0000th=[  382], 70.0000th=[  382],
     | 80.0000th=[  382], 90.0000th=[  382], 95.0000th=[  390],
     | 99.0000th=[  402], 99.5000th=[  430], 99.9000th=[  900],
     | 99.9500th=[  972], 99.9900th=[ 1432]
Latencies for: Receiver
    percentiles (nsec):
     |  1.0000th=[ 1528],  5.0000th=[ 1544], 10.0000th=[ 1560],
     | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592],
     | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608],
     | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688],
     | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272],
     | 99.9500th=[14784], 99.9900th=[73216]

I'll try and augment the test app to do proper rated submissions, so I
can ramp up the rates a bit and see what happens.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance
  2024-05-28 23:04           ` Jens Axboe
@ 2024-05-29  1:35             ` Jens Axboe
  2024-05-29  2:08               ` Pavel Begunkov
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2024-05-29  1:35 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/28/24 5:04 PM, Jens Axboe wrote:
> On 5/28/24 12:31 PM, Jens Axboe wrote:
>> I suspect a bug in the previous patches, because this is what the
>> forward port looks like. First, for reference, the current results:
> 
> Got it sorted, and pinned sender and receiver on CPUs to avoid the
> variation. It looks like this with the task_work approach that I sent
> out as v1:
> 
> Latencies for: Sender
>     percentiles (nsec):
>      |  1.0000th=[ 2160],  5.0000th=[ 2672], 10.0000th=[ 2768],
>      | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600],
>      | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632],
>      | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696],
>      | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192],
>      | 99.9500th=[16320], 99.9900th=[18304]
> Latencies for: Receiver
>     percentiles (nsec):
>      |  1.0000th=[ 1528],  5.0000th=[ 1576], 10.0000th=[ 1656],
>      | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064],
>      | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096],
>      | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160],
>      | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168],
>      | 99.9500th=[15424], 99.9900th=[17280]
> 
> and here's the exact same test run on the current patches:
> 
> Latencies for: Sender
>     percentiles (nsec):
>      |  1.0000th=[  362],  5.0000th=[  362], 10.0000th=[  370],
>      | 20.0000th=[  370], 30.0000th=[  370], 40.0000th=[  370],
>      | 50.0000th=[  374], 60.0000th=[  382], 70.0000th=[  382],
>      | 80.0000th=[  382], 90.0000th=[  382], 95.0000th=[  390],
>      | 99.0000th=[  402], 99.5000th=[  430], 99.9000th=[  900],
>      | 99.9500th=[  972], 99.9900th=[ 1432]
> Latencies for: Receiver
>     percentiles (nsec):
>      |  1.0000th=[ 1528],  5.0000th=[ 1544], 10.0000th=[ 1560],
>      | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592],
>      | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608],
>      | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688],
>      | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272],
>      | 99.9500th=[14784], 99.9900th=[73216]
> 
> I'll try and augment the test app to do proper rated submissions, so I
> can ramp up the rates a bit and see what happens.

And the final one, with the rated sends sorted out. One key observation
is that v1 trails the current edition, it just can't keep up as the rate
is increased. If we cap the rate at at what should be 33K messages per
second, v1 gets ~28K messages and has the following latency profile (for
a 3 second run)

Latencies for: Receiver (msg=83863)
    percentiles (nsec):
     |  1.0000th=[  1208],  5.0000th=[  1336], 10.0000th=[  1400],
     | 20.0000th=[  1768], 30.0000th=[  1912], 40.0000th=[  1976],
     | 50.0000th=[  2040], 60.0000th=[  2160], 70.0000th=[  2256],
     | 80.0000th=[  2480], 90.0000th=[  2736], 95.0000th=[  3024],
     | 99.0000th=[  4080], 99.5000th=[  4896], 99.9000th=[  9664],
     | 99.9500th=[ 17024], 99.9900th=[218112]
Latencies for: Sender (msg=83863)
    percentiles (nsec):
     |  1.0000th=[  1928],  5.0000th=[  2064], 10.0000th=[  2160],
     | 20.0000th=[  2608], 30.0000th=[  2672], 40.0000th=[  2736],
     | 50.0000th=[  2864], 60.0000th=[  2960], 70.0000th=[  3152],
     | 80.0000th=[  3408], 90.0000th=[  4128], 95.0000th=[  4576],
     | 99.0000th=[  5920], 99.5000th=[  6752], 99.9000th=[ 13376],
     | 99.9500th=[ 22912], 99.9900th=[261120]

and the current edition does:

Latencies for: Sender (msg=94488)
    percentiles (nsec):
     |  1.0000th=[  181],  5.0000th=[  191], 10.0000th=[  201],
     | 20.0000th=[  215], 30.0000th=[  225], 40.0000th=[  235],
     | 50.0000th=[  262], 60.0000th=[  306], 70.0000th=[  430],
     | 80.0000th=[ 1004], 90.0000th=[ 2480], 95.0000th=[ 3632],
     | 99.0000th=[ 8096], 99.5000th=[12352], 99.9000th=[18048],
     | 99.9500th=[19584], 99.9900th=[23680]
Latencies for: Receiver (msg=94488)
    percentiles (nsec):
     |  1.0000th=[  342],  5.0000th=[  398], 10.0000th=[  482],
     | 20.0000th=[  652], 30.0000th=[  812], 40.0000th=[  972],
     | 50.0000th=[ 1240], 60.0000th=[ 1640], 70.0000th=[ 1944],
     | 80.0000th=[ 2448], 90.0000th=[ 3248], 95.0000th=[ 5216],
     | 99.0000th=[10304], 99.5000th=[12352], 99.9000th=[18048],
     | 99.9500th=[19840], 99.9900th=[23168]

If we cap it where v1 keeps up, at 13K messages per second, v1 does:

Latencies for: Receiver (msg=38820)
    percentiles (nsec):
     |  1.0000th=[ 1160],  5.0000th=[ 1256], 10.0000th=[ 1352],
     | 20.0000th=[ 1688], 30.0000th=[ 1928], 40.0000th=[ 1976],
     | 50.0000th=[ 2064], 60.0000th=[ 2384], 70.0000th=[ 2480],
     | 80.0000th=[ 2768], 90.0000th=[ 3280], 95.0000th=[ 3472],
     | 99.0000th=[ 4192], 99.5000th=[ 4512], 99.9000th=[ 6624],
     | 99.9500th=[ 8768], 99.9900th=[14272]
Latencies for: Sender (msg=38820)
    percentiles (nsec):
     |  1.0000th=[ 1848],  5.0000th=[ 1928], 10.0000th=[ 2040],
     | 20.0000th=[ 2608], 30.0000th=[ 2640], 40.0000th=[ 2736],
     | 50.0000th=[ 3024], 60.0000th=[ 3120], 70.0000th=[ 3376],
     | 80.0000th=[ 3824], 90.0000th=[ 4512], 95.0000th=[ 4768],
     | 99.0000th=[ 5536], 99.5000th=[ 6048], 99.9000th=[ 9024],
     | 99.9500th=[10304], 99.9900th=[23424]

and v2 does:

Latencies for: Sender (msg=39005)
    percentiles (nsec):
     |  1.0000th=[  191],  5.0000th=[  211], 10.0000th=[  262],
     | 20.0000th=[  342], 30.0000th=[  382], 40.0000th=[  402],
     | 50.0000th=[  450], 60.0000th=[  532], 70.0000th=[ 1080],
     | 80.0000th=[ 1848], 90.0000th=[ 4768], 95.0000th=[10944],
     | 99.0000th=[16512], 99.5000th=[18304], 99.9000th=[22400],
     | 99.9500th=[26496], 99.9900th=[41728]
Latencies for: Receiver (msg=39005)
    percentiles (nsec):
     |  1.0000th=[  410],  5.0000th=[  604], 10.0000th=[  700],
     | 20.0000th=[  900], 30.0000th=[ 1128], 40.0000th=[ 1320],
     | 50.0000th=[ 1672], 60.0000th=[ 2256], 70.0000th=[ 2736],
     | 80.0000th=[ 3760], 90.0000th=[ 5408], 95.0000th=[11072],
     | 99.0000th=[18304], 99.5000th=[20096], 99.9000th=[24704],
     | 99.9500th=[27520], 99.9900th=[35584]

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages
  2024-05-28 17:59         ` Jens Axboe
@ 2024-05-29  2:04           ` Pavel Begunkov
  2024-05-29  2:43             ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2024-05-29  2:04 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 5/28/24 18:59, Jens Axboe wrote:
> On 5/28/24 10:23 AM, Pavel Begunkov wrote:
>> On 5/28/24 15:23, Jens Axboe wrote:
>>> On 5/28/24 7:32 AM, Pavel Begunkov wrote:
>>>> On 5/24/24 23:58, Jens Axboe wrote:
>>>>> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely
>>>>> to the target ring. Instead, task_work is queued for the target ring,
>>>>> which is used to post the CQE. To make matters worse, once the target
>>>>> CQE has been posted, task_work is then queued with the originator to
>>>>> fill the completion.
>>>>>
>>>>> This obviously adds a bunch of overhead and latency. Instead of relying
>>>>> on generic kernel task_work for this, fill an overflow entry on the
>>>>> target ring and flag it as such that the target ring will flush it. This
>>>>> avoids both the task_work for posting the CQE, and it means that the
>>>>> originator CQE can be filled inline as well.
>>>>>
>>>>> In local testing, this reduces the latency on the sender side by 5-6x.
>>>>>
>>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>>> ---
>>>>>     io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++--
>>>>>     1 file changed, 74 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
>>>>> index feff2b0822cf..3f89ff3a40ad 100644
>>>>> --- a/io_uring/msg_ring.c
>>>>> +++ b/io_uring/msg_ring.c
>>>>> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head)
>>>>>         io_req_queue_tw_complete(req, ret);
>>>>>     }
>>>>>     +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
>>>>> +{
>>>>> +    bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
>>>>> +    size_t cqe_size = sizeof(struct io_overflow_cqe);
>>>>> +    struct io_overflow_cqe *ocqe;
>>>>> +
>>>>> +    if (is_cqe32)
>>>>> +        cqe_size += sizeof(struct io_uring_cqe);
>>>>> +
>>>>> +    ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT);
>>>>
>>>> __GFP_ACCOUNT looks painful
>>>
>>> It always is - I did add the usual alloc cache for this after posting
>>> this series, which makes it a no-op basically:
>>
>> Simple ring private cache wouldn't work so well with non
>> uniform transfer distributions. One way messaging, userspace
>> level batching, etc., but the main question is in the other
>> email, i.e. maybe it's better to go with the 2 tw hop model,
>> which returns memory back where it came from.
> 
> The cache is local to the ring, so anyone that sends messages to that
> ring gets to use it. So I believe it should in fact work really well. If
> messaging is bidirectional, then caching on the target will apply in
> both directions.

*taking a look at the patch* it gets the entry from the target's
ring, so indeed not a problem. Taking the target lock for that,
however, is not the best, I ranted before about inter dependencies
b/w rings. E.g. requests messaging a ring run by a task CPU bound
in submission / tw execution would be directed to iowq and occupy
a worker thread for the time being.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance
  2024-05-29  1:35             ` Jens Axboe
@ 2024-05-29  2:08               ` Pavel Begunkov
  2024-05-29  2:42                 ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2024-05-29  2:08 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 5/29/24 02:35, Jens Axboe wrote:
> On 5/28/24 5:04 PM, Jens Axboe wrote:
>> On 5/28/24 12:31 PM, Jens Axboe wrote:
>>> I suspect a bug in the previous patches, because this is what the
>>> forward port looks like. First, for reference, the current results:
>>
>> Got it sorted, and pinned sender and receiver on CPUs to avoid the
>> variation. It looks like this with the task_work approach that I sent
>> out as v1:
>>
>> Latencies for: Sender
>>      percentiles (nsec):
>>       |  1.0000th=[ 2160],  5.0000th=[ 2672], 10.0000th=[ 2768],
>>       | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600],
>>       | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632],
>>       | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696],
>>       | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192],
>>       | 99.9500th=[16320], 99.9900th=[18304]
>> Latencies for: Receiver
>>      percentiles (nsec):
>>       |  1.0000th=[ 1528],  5.0000th=[ 1576], 10.0000th=[ 1656],
>>       | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064],
>>       | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096],
>>       | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160],
>>       | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168],
>>       | 99.9500th=[15424], 99.9900th=[17280]
>>
>> and here's the exact same test run on the current patches:
>>
>> Latencies for: Sender
>>      percentiles (nsec):
>>       |  1.0000th=[  362],  5.0000th=[  362], 10.0000th=[  370],
>>       | 20.0000th=[  370], 30.0000th=[  370], 40.0000th=[  370],
>>       | 50.0000th=[  374], 60.0000th=[  382], 70.0000th=[  382],
>>       | 80.0000th=[  382], 90.0000th=[  382], 95.0000th=[  390],
>>       | 99.0000th=[  402], 99.5000th=[  430], 99.9000th=[  900],
>>       | 99.9500th=[  972], 99.9900th=[ 1432]
>> Latencies for: Receiver
>>      percentiles (nsec):
>>       |  1.0000th=[ 1528],  5.0000th=[ 1544], 10.0000th=[ 1560],
>>       | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592],
>>       | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608],
>>       | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688],
>>       | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272],
>>       | 99.9500th=[14784], 99.9900th=[73216]
>>
>> I'll try and augment the test app to do proper rated submissions, so I
>> can ramp up the rates a bit and see what happens.
> 
> And the final one, with the rated sends sorted out. One key observation
> is that v1 trails the current edition, it just can't keep up as the rate
> is increased. If we cap the rate at at what should be 33K messages per
> second, v1 gets ~28K messages and has the following latency profile (for
> a 3 second run)

Do you see where the receiver latency comes from? The wakeups are
quite similar in nature, assuming it's all wait(nr=1) and CPUs
are not 100% consumed. The hop back spoils scheduling timing?


> Latencies for: Receiver (msg=83863)
>      percentiles (nsec):
>       |  1.0000th=[  1208],  5.0000th=[  1336], 10.0000th=[  1400],
>       | 20.0000th=[  1768], 30.0000th=[  1912], 40.0000th=[  1976],
>       | 50.0000th=[  2040], 60.0000th=[  2160], 70.0000th=[  2256],
>       | 80.0000th=[  2480], 90.0000th=[  2736], 95.0000th=[  3024],
>       | 99.0000th=[  4080], 99.5000th=[  4896], 99.9000th=[  9664],
>       | 99.9500th=[ 17024], 99.9900th=[218112]
> Latencies for: Sender (msg=83863)
>      percentiles (nsec):
>       |  1.0000th=[  1928],  5.0000th=[  2064], 10.0000th=[  2160],
>       | 20.0000th=[  2608], 30.0000th=[  2672], 40.0000th=[  2736],
>       | 50.0000th=[  2864], 60.0000th=[  2960], 70.0000th=[  3152],
>       | 80.0000th=[  3408], 90.0000th=[  4128], 95.0000th=[  4576],
>       | 99.0000th=[  5920], 99.5000th=[  6752], 99.9000th=[ 13376],
>       | 99.9500th=[ 22912], 99.9900th=[261120]
> 
> and the current edition does:
> 
> Latencies for: Sender (msg=94488)
>      percentiles (nsec):
>       |  1.0000th=[  181],  5.0000th=[  191], 10.0000th=[  201],
>       | 20.0000th=[  215], 30.0000th=[  225], 40.0000th=[  235],
>       | 50.0000th=[  262], 60.0000th=[  306], 70.0000th=[  430],
>       | 80.0000th=[ 1004], 90.0000th=[ 2480], 95.0000th=[ 3632],
>       | 99.0000th=[ 8096], 99.5000th=[12352], 99.9000th=[18048],
>       | 99.9500th=[19584], 99.9900th=[23680]
> Latencies for: Receiver (msg=94488)
>      percentiles (nsec):
>       |  1.0000th=[  342],  5.0000th=[  398], 10.0000th=[  482],
>       | 20.0000th=[  652], 30.0000th=[  812], 40.0000th=[  972],
>       | 50.0000th=[ 1240], 60.0000th=[ 1640], 70.0000th=[ 1944],
>       | 80.0000th=[ 2448], 90.0000th=[ 3248], 95.0000th=[ 5216],
>       | 99.0000th=[10304], 99.5000th=[12352], 99.9000th=[18048],
>       | 99.9500th=[19840], 99.9900th=[23168]
> 
> If we cap it where v1 keeps up, at 13K messages per second, v1 does:
> 
> Latencies for: Receiver (msg=38820)
>      percentiles (nsec):
>       |  1.0000th=[ 1160],  5.0000th=[ 1256], 10.0000th=[ 1352],
>       | 20.0000th=[ 1688], 30.0000th=[ 1928], 40.0000th=[ 1976],
>       | 50.0000th=[ 2064], 60.0000th=[ 2384], 70.0000th=[ 2480],
>       | 80.0000th=[ 2768], 90.0000th=[ 3280], 95.0000th=[ 3472],
>       | 99.0000th=[ 4192], 99.5000th=[ 4512], 99.9000th=[ 6624],
>       | 99.9500th=[ 8768], 99.9900th=[14272]
> Latencies for: Sender (msg=38820)
>      percentiles (nsec):
>       |  1.0000th=[ 1848],  5.0000th=[ 1928], 10.0000th=[ 2040],
>       | 20.0000th=[ 2608], 30.0000th=[ 2640], 40.0000th=[ 2736],
>       | 50.0000th=[ 3024], 60.0000th=[ 3120], 70.0000th=[ 3376],
>       | 80.0000th=[ 3824], 90.0000th=[ 4512], 95.0000th=[ 4768],
>       | 99.0000th=[ 5536], 99.5000th=[ 6048], 99.9000th=[ 9024],
>       | 99.9500th=[10304], 99.9900th=[23424]
> 
> and v2 does:
> 
> Latencies for: Sender (msg=39005)
>      percentiles (nsec):
>       |  1.0000th=[  191],  5.0000th=[  211], 10.0000th=[  262],
>       | 20.0000th=[  342], 30.0000th=[  382], 40.0000th=[  402],
>       | 50.0000th=[  450], 60.0000th=[  532], 70.0000th=[ 1080],
>       | 80.0000th=[ 1848], 90.0000th=[ 4768], 95.0000th=[10944],
>       | 99.0000th=[16512], 99.5000th=[18304], 99.9000th=[22400],
>       | 99.9500th=[26496], 99.9900th=[41728]
> Latencies for: Receiver (msg=39005)
>      percentiles (nsec):
>       |  1.0000th=[  410],  5.0000th=[  604], 10.0000th=[  700],
>       | 20.0000th=[  900], 30.0000th=[ 1128], 40.0000th=[ 1320],
>       | 50.0000th=[ 1672], 60.0000th=[ 2256], 70.0000th=[ 2736],
>       | 80.0000th=[ 3760], 90.0000th=[ 5408], 95.0000th=[11072],
>       | 99.0000th=[18304], 99.5000th=[20096], 99.9000th=[24704],
>       | 99.9500th=[27520], 99.9900th=[35584]
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance
  2024-05-29  2:08               ` Pavel Begunkov
@ 2024-05-29  2:42                 ` Jens Axboe
  0 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2024-05-29  2:42 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/28/24 8:08 PM, Pavel Begunkov wrote:
> On 5/29/24 02:35, Jens Axboe wrote:
>> On 5/28/24 5:04 PM, Jens Axboe wrote:
>>> On 5/28/24 12:31 PM, Jens Axboe wrote:
>>>> I suspect a bug in the previous patches, because this is what the
>>>> forward port looks like. First, for reference, the current results:
>>>
>>> Got it sorted, and pinned sender and receiver on CPUs to avoid the
>>> variation. It looks like this with the task_work approach that I sent
>>> out as v1:
>>>
>>> Latencies for: Sender
>>>      percentiles (nsec):
>>>       |  1.0000th=[ 2160],  5.0000th=[ 2672], 10.0000th=[ 2768],
>>>       | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600],
>>>       | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632],
>>>       | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696],
>>>       | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192],
>>>       | 99.9500th=[16320], 99.9900th=[18304]
>>> Latencies for: Receiver
>>>      percentiles (nsec):
>>>       |  1.0000th=[ 1528],  5.0000th=[ 1576], 10.0000th=[ 1656],
>>>       | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064],
>>>       | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096],
>>>       | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160],
>>>       | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168],
>>>       | 99.9500th=[15424], 99.9900th=[17280]
>>>
>>> and here's the exact same test run on the current patches:
>>>
>>> Latencies for: Sender
>>>      percentiles (nsec):
>>>       |  1.0000th=[  362],  5.0000th=[  362], 10.0000th=[  370],
>>>       | 20.0000th=[  370], 30.0000th=[  370], 40.0000th=[  370],
>>>       | 50.0000th=[  374], 60.0000th=[  382], 70.0000th=[  382],
>>>       | 80.0000th=[  382], 90.0000th=[  382], 95.0000th=[  390],
>>>       | 99.0000th=[  402], 99.5000th=[  430], 99.9000th=[  900],
>>>       | 99.9500th=[  972], 99.9900th=[ 1432]
>>> Latencies for: Receiver
>>>      percentiles (nsec):
>>>       |  1.0000th=[ 1528],  5.0000th=[ 1544], 10.0000th=[ 1560],
>>>       | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592],
>>>       | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608],
>>>       | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688],
>>>       | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272],
>>>       | 99.9500th=[14784], 99.9900th=[73216]
>>>
>>> I'll try and augment the test app to do proper rated submissions, so I
>>> can ramp up the rates a bit and see what happens.
>>
>> And the final one, with the rated sends sorted out. One key observation
>> is that v1 trails the current edition, it just can't keep up as the rate
>> is increased. If we cap the rate at at what should be 33K messages per
>> second, v1 gets ~28K messages and has the following latency profile (for
>> a 3 second run)
> 
> Do you see where the receiver latency comes from? The wakeups are
> quite similar in nature, assuming it's all wait(nr=1) and CPUs
> are not 100% consumed. The hop back spoils scheduling timing?

I haven't dug into that side yet, but I'm guessing it's indeed
scheduling artifacts. It's all doing single waits, the sender is doing
io_uring_submit_and_wait(ring, 1) and the receiver is doing
io_uring_wait_cqe(ring, &cqe);

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages
  2024-05-29  2:04           ` Pavel Begunkov
@ 2024-05-29  2:43             ` Jens Axboe
  0 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2024-05-29  2:43 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/28/24 8:04 PM, Pavel Begunkov wrote:
> On 5/28/24 18:59, Jens Axboe wrote:
>> On 5/28/24 10:23 AM, Pavel Begunkov wrote:
>>> On 5/28/24 15:23, Jens Axboe wrote:
>>>> On 5/28/24 7:32 AM, Pavel Begunkov wrote:
>>>>> On 5/24/24 23:58, Jens Axboe wrote:
>>>>>> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely
>>>>>> to the target ring. Instead, task_work is queued for the target ring,
>>>>>> which is used to post the CQE. To make matters worse, once the target
>>>>>> CQE has been posted, task_work is then queued with the originator to
>>>>>> fill the completion.
>>>>>>
>>>>>> This obviously adds a bunch of overhead and latency. Instead of relying
>>>>>> on generic kernel task_work for this, fill an overflow entry on the
>>>>>> target ring and flag it as such that the target ring will flush it. This
>>>>>> avoids both the task_work for posting the CQE, and it means that the
>>>>>> originator CQE can be filled inline as well.
>>>>>>
>>>>>> In local testing, this reduces the latency on the sender side by 5-6x.
>>>>>>
>>>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>>>> ---
>>>>>>     io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++--
>>>>>>     1 file changed, 74 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
>>>>>> index feff2b0822cf..3f89ff3a40ad 100644
>>>>>> --- a/io_uring/msg_ring.c
>>>>>> +++ b/io_uring/msg_ring.c
>>>>>> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head)
>>>>>>         io_req_queue_tw_complete(req, ret);
>>>>>>     }
>>>>>>     +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
>>>>>> +{
>>>>>> +    bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
>>>>>> +    size_t cqe_size = sizeof(struct io_overflow_cqe);
>>>>>> +    struct io_overflow_cqe *ocqe;
>>>>>> +
>>>>>> +    if (is_cqe32)
>>>>>> +        cqe_size += sizeof(struct io_uring_cqe);
>>>>>> +
>>>>>> +    ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT);
>>>>>
>>>>> __GFP_ACCOUNT looks painful
>>>>
>>>> It always is - I did add the usual alloc cache for this after posting
>>>> this series, which makes it a no-op basically:
>>>
>>> Simple ring private cache wouldn't work so well with non
>>> uniform transfer distributions. One way messaging, userspace
>>> level batching, etc., but the main question is in the other
>>> email, i.e. maybe it's better to go with the 2 tw hop model,
>>> which returns memory back where it came from.
>>
>> The cache is local to the ring, so anyone that sends messages to that
>> ring gets to use it. So I believe it should in fact work really well. If
>> messaging is bidirectional, then caching on the target will apply in
>> both directions.
> 
> *taking a look at the patch* it gets the entry from the target's
> ring, so indeed not a problem. Taking the target lock for that,
> however, is not the best, I ranted before about inter dependencies
> b/w rings. E.g. requests messaging a ring run by a task CPU bound
> in submission / tw execution would be directed to iowq and occupy
> a worker thread for the time being.

I can try and do some stats on io-wq bouncing, it can indeed be a risk.
Might even be possible to only retain the ring lock for flushing, which
is less of an issue as it happens locally, and have the overflow entries
locked separately. For now I just kept the overflow backend that we
already have, and the locking that MSG_RING already does.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-05-29  2:43 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-24 22:58 [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Jens Axboe
2024-05-24 22:58 ` [PATCH 1/3] io_uring/msg_ring: split fd installing into a helper Jens Axboe
2024-05-24 22:58 ` [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages Jens Axboe
2024-05-28 13:18   ` Pavel Begunkov
2024-05-28 14:23     ` Jens Axboe
2024-05-28 13:32   ` Pavel Begunkov
2024-05-28 14:23     ` Jens Axboe
2024-05-28 16:23       ` Pavel Begunkov
2024-05-28 17:59         ` Jens Axboe
2024-05-29  2:04           ` Pavel Begunkov
2024-05-29  2:43             ` Jens Axboe
2024-05-24 22:58 ` [PATCH 3/3] io_uring/msg_ring: avoid double indirection task_work for fd passing Jens Axboe
2024-05-28 13:31 ` [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Pavel Begunkov
2024-05-28 14:34   ` Jens Axboe
2024-05-28 14:39     ` Jens Axboe
2024-05-28 15:27     ` Jens Axboe
2024-05-28 16:50     ` Pavel Begunkov
2024-05-28 18:07       ` Jens Axboe
2024-05-28 18:31         ` Jens Axboe
2024-05-28 23:04           ` Jens Axboe
2024-05-29  1:35             ` Jens Axboe
2024-05-29  2:08               ` Pavel Begunkov
2024-05-29  2:42                 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox