* [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance @ 2024-05-24 22:58 Jens Axboe 2024-05-24 22:58 ` [PATCH 1/3] io_uring/msg_ring: split fd installing into a helper Jens Axboe ` (3 more replies) 0 siblings, 4 replies; 23+ messages in thread From: Jens Axboe @ 2024-05-24 22:58 UTC (permalink / raw) To: io-uring Hi, A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to use IORING_SETUP_DEFER_TASKRUN, will need two round trips through generic task_work. This isn't ideal. This patchset attempts to rectify that, taking a new approach rather than trying to use the io_uring task_work infrastructure to handle it as in previous postings. In a sample test app that has one thread send messages to another and logging both the time between sender sending and receiver receving and just the time for the sender to post a message and get the CQE back, I see the following sender latencies with the stock kernel: Latencies for: Sender percentiles (nsec): | 1.0000th=[ 4384], 5.0000th=[ 4512], 10.0000th=[ 4576], | 20.0000th=[ 4768], 30.0000th=[ 4896], 40.0000th=[ 5024], | 50.0000th=[ 5088], 60.0000th=[ 5152], 70.0000th=[ 5280], | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728], | 99.0000th=[ 8032], 99.5000th=[18048], 99.9000th=[21376], | 99.9500th=[26496], 99.9900th=[59136] and with the patches: Latencies for: Sender percentiles (nsec): | 1.0000th=[ 756], 5.0000th=[ 820], 10.0000th=[ 828], | 20.0000th=[ 844], 30.0000th=[ 852], 40.0000th=[ 852], | 50.0000th=[ 860], 60.0000th=[ 860], 70.0000th=[ 868], | 80.0000th=[ 884], 90.0000th=[ 964], 95.0000th=[ 988], | 99.0000th=[ 1128], 99.5000th=[ 1208], 99.9000th=[ 1544], | 99.9500th=[ 1944], 99.9900th=[ 2896] For the receiving side the win is smaller as it only "suffers" from a single generic task_work, about a 10% win in latencies there. The idea here is to utilize the CQE overflow infrastructure for this, as that allows the right task to post the CQE to the ring. 1 is a basic refactoring prep patch, patch 2 adds support for normal messages, and patch 3 adopts the same approach for fd passing. io_uring/msg_ring.c | 151 ++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 138 insertions(+), 13 deletions(-) -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/3] io_uring/msg_ring: split fd installing into a helper 2024-05-24 22:58 [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Jens Axboe @ 2024-05-24 22:58 ` Jens Axboe 2024-05-24 22:58 ` [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages Jens Axboe ` (2 subsequent siblings) 3 siblings, 0 replies; 23+ messages in thread From: Jens Axboe @ 2024-05-24 22:58 UTC (permalink / raw) To: io-uring; +Cc: Jens Axboe No functional changes in this patch, just in preparation for needing to complete the fd install with the ctx lock already held. Signed-off-by: Jens Axboe <[email protected]> --- io_uring/msg_ring.c | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c index 81c4a9d43729..feff2b0822cf 100644 --- a/io_uring/msg_ring.c +++ b/io_uring/msg_ring.c @@ -173,25 +173,23 @@ static struct file *io_msg_grab_file(struct io_kiocb *req, unsigned int issue_fl return file; } -static int io_msg_install_complete(struct io_kiocb *req, unsigned int issue_flags) +static int __io_msg_install_complete(struct io_kiocb *req) { struct io_ring_ctx *target_ctx = req->file->private_data; struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg); struct file *src_file = msg->src_file; int ret; - if (unlikely(io_double_lock_ctx(target_ctx, issue_flags))) - return -EAGAIN; - ret = __io_fixed_fd_install(target_ctx, src_file, msg->dst_fd); if (ret < 0) - goto out_unlock; + return ret; msg->src_file = NULL; req->flags &= ~REQ_F_NEED_CLEANUP; if (msg->flags & IORING_MSG_RING_CQE_SKIP) - goto out_unlock; + return ret; + /* * If this fails, the target still received the file descriptor but * wasn't notified of the fact. This means that if this request @@ -199,8 +197,20 @@ static int io_msg_install_complete(struct io_kiocb *req, unsigned int issue_flag * later IORING_OP_MSG_RING delivers the message. */ if (!io_post_aux_cqe(target_ctx, msg->user_data, ret, 0)) - ret = -EOVERFLOW; -out_unlock: + return -EOVERFLOW; + + return ret; +} + +static int io_msg_install_complete(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_ring_ctx *target_ctx = req->file->private_data; + int ret; + + if (unlikely(io_double_lock_ctx(target_ctx, issue_flags))) + return -EAGAIN; + + ret = __io_msg_install_complete(req); io_double_unlock_ctx(target_ctx); return ret; } -- 2.43.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages 2024-05-24 22:58 [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Jens Axboe 2024-05-24 22:58 ` [PATCH 1/3] io_uring/msg_ring: split fd installing into a helper Jens Axboe @ 2024-05-24 22:58 ` Jens Axboe 2024-05-28 13:18 ` Pavel Begunkov 2024-05-28 13:32 ` Pavel Begunkov 2024-05-24 22:58 ` [PATCH 3/3] io_uring/msg_ring: avoid double indirection task_work for fd passing Jens Axboe 2024-05-28 13:31 ` [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Pavel Begunkov 3 siblings, 2 replies; 23+ messages in thread From: Jens Axboe @ 2024-05-24 22:58 UTC (permalink / raw) To: io-uring; +Cc: Jens Axboe If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely to the target ring. Instead, task_work is queued for the target ring, which is used to post the CQE. To make matters worse, once the target CQE has been posted, task_work is then queued with the originator to fill the completion. This obviously adds a bunch of overhead and latency. Instead of relying on generic kernel task_work for this, fill an overflow entry on the target ring and flag it as such that the target ring will flush it. This avoids both the task_work for posting the CQE, and it means that the originator CQE can be filled inline as well. In local testing, this reduces the latency on the sender side by 5-6x. Signed-off-by: Jens Axboe <[email protected]> --- io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 74 insertions(+), 3 deletions(-) diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c index feff2b0822cf..3f89ff3a40ad 100644 --- a/io_uring/msg_ring.c +++ b/io_uring/msg_ring.c @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head) io_req_queue_tw_complete(req, ret); } +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx) +{ + bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32; + size_t cqe_size = sizeof(struct io_overflow_cqe); + struct io_overflow_cqe *ocqe; + + if (is_cqe32) + cqe_size += sizeof(struct io_uring_cqe); + + ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT); + if (!ocqe) + return NULL; + + if (is_cqe32) + ocqe->cqe.big_cqe[0] = ocqe->cqe.big_cqe[1] = 0; + + return ocqe; +} + +/* + * Entered with the target uring_lock held, and will drop it before + * returning. Adds a previously allocated ocqe to the overflow list on + * the target, and marks it appropriately for flushing. + */ +static void io_msg_add_overflow(struct io_msg *msg, + struct io_ring_ctx *target_ctx, + struct io_overflow_cqe *ocqe, int ret) + __releases(target_ctx->uring_lock) +{ + spin_lock(&target_ctx->completion_lock); + + if (list_empty(&target_ctx->cq_overflow_list)) { + set_bit(IO_CHECK_CQ_OVERFLOW_BIT, &target_ctx->check_cq); + atomic_or(IORING_SQ_TASKRUN, &target_ctx->rings->sq_flags); + } + + ocqe->cqe.user_data = msg->user_data; + ocqe->cqe.res = ret; + list_add_tail(&ocqe->list, &target_ctx->cq_overflow_list); + spin_unlock(&target_ctx->completion_lock); + mutex_unlock(&target_ctx->uring_lock); + wake_up_state(target_ctx->submitter_task, TASK_INTERRUPTIBLE); +} + +static bool io_msg_fill_remote(struct io_msg *msg, unsigned int issue_flags, + struct io_ring_ctx *target_ctx, u32 flags) +{ + struct io_overflow_cqe *ocqe; + + ocqe = io_alloc_overflow(target_ctx); + if (!ocqe) + return false; + + if (unlikely(io_double_lock_ctx(target_ctx, issue_flags))) { + kfree(ocqe); + return false; + } + + ocqe->cqe.flags = flags; + io_msg_add_overflow(msg, target_ctx, ocqe, msg->len); + return true; +} + static int io_msg_ring_data(struct io_kiocb *req, unsigned int issue_flags) { struct io_ring_ctx *target_ctx = req->file->private_data; @@ -137,12 +200,20 @@ static int io_msg_ring_data(struct io_kiocb *req, unsigned int issue_flags) if (target_ctx->flags & IORING_SETUP_R_DISABLED) return -EBADFD; - if (io_msg_need_remote(target_ctx)) - return io_msg_exec_remote(req, io_msg_tw_complete); - if (msg->flags & IORING_MSG_RING_FLAGS_PASS) flags = msg->cqe_flags; + if (io_msg_need_remote(target_ctx)) { + /* + * Try adding an overflow entry to the target, and only if + * that fails, resort to using more expensive task_work to + * have the target_ctx owner fill the CQE. + */ + if (!io_msg_fill_remote(msg, issue_flags, target_ctx, flags)) + return io_msg_exec_remote(req, io_msg_tw_complete); + return 0; + } + ret = -EOVERFLOW; if (target_ctx->flags & IORING_SETUP_IOPOLL) { if (unlikely(io_double_lock_ctx(target_ctx, issue_flags))) -- 2.43.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages 2024-05-24 22:58 ` [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages Jens Axboe @ 2024-05-28 13:18 ` Pavel Begunkov 2024-05-28 14:23 ` Jens Axboe 2024-05-28 13:32 ` Pavel Begunkov 1 sibling, 1 reply; 23+ messages in thread From: Pavel Begunkov @ 2024-05-28 13:18 UTC (permalink / raw) To: Jens Axboe, io-uring On 5/24/24 23:58, Jens Axboe wrote: > If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely > to the target ring. Instead, task_work is queued for the target ring, > which is used to post the CQE. To make matters worse, once the target > CQE has been posted, task_work is then queued with the originator to > fill the completion. > > This obviously adds a bunch of overhead and latency. Instead of relying > on generic kernel task_work for this, fill an overflow entry on the > target ring and flag it as such that the target ring will flush it. This > avoids both the task_work for posting the CQE, and it means that the > originator CQE can be filled inline as well. > > In local testing, this reduces the latency on the sender side by 5-6x. > > Signed-off-by: Jens Axboe <[email protected]> > --- > io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 74 insertions(+), 3 deletions(-) > > diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c > index feff2b0822cf..3f89ff3a40ad 100644 > --- a/io_uring/msg_ring.c > +++ b/io_uring/msg_ring.c > @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head) > io_req_queue_tw_complete(req, ret); > } > > +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx) > +{ > + bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32; > + size_t cqe_size = sizeof(struct io_overflow_cqe); > + struct io_overflow_cqe *ocqe; > + > + if (is_cqe32) > + cqe_size += sizeof(struct io_uring_cqe); > + > + ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT); > + if (!ocqe) > + return NULL; > + > + if (is_cqe32) > + ocqe->cqe.big_cqe[0] = ocqe->cqe.big_cqe[1] = 0; > + > + return ocqe; > +} > + > +/* > + * Entered with the target uring_lock held, and will drop it before > + * returning. Adds a previously allocated ocqe to the overflow list on > + * the target, and marks it appropriately for flushing. > + */ > +static void io_msg_add_overflow(struct io_msg *msg, > + struct io_ring_ctx *target_ctx, > + struct io_overflow_cqe *ocqe, int ret) > + __releases(target_ctx->uring_lock) > +{ > + spin_lock(&target_ctx->completion_lock); > + > + if (list_empty(&target_ctx->cq_overflow_list)) { > + set_bit(IO_CHECK_CQ_OVERFLOW_BIT, &target_ctx->check_cq); > + atomic_or(IORING_SQ_TASKRUN, &target_ctx->rings->sq_flags); TASKRUN? The normal overflow path sets IORING_SQ_CQ_OVERFLOW > + } > + > + ocqe->cqe.user_data = msg->user_data; > + ocqe->cqe.res = ret; > + list_add_tail(&ocqe->list, &target_ctx->cq_overflow_list); > + spin_unlock(&target_ctx->completion_lock); > + mutex_unlock(&target_ctx->uring_lock); > + wake_up_state(target_ctx->submitter_task, TASK_INTERRUPTIBLE); > +} -- Pavel Begunkov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages 2024-05-28 13:18 ` Pavel Begunkov @ 2024-05-28 14:23 ` Jens Axboe 0 siblings, 0 replies; 23+ messages in thread From: Jens Axboe @ 2024-05-28 14:23 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/28/24 7:18 AM, Pavel Begunkov wrote: > On 5/24/24 23:58, Jens Axboe wrote: >> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely >> to the target ring. Instead, task_work is queued for the target ring, >> which is used to post the CQE. To make matters worse, once the target >> CQE has been posted, task_work is then queued with the originator to >> fill the completion. >> >> This obviously adds a bunch of overhead and latency. Instead of relying >> on generic kernel task_work for this, fill an overflow entry on the >> target ring and flag it as such that the target ring will flush it. This >> avoids both the task_work for posting the CQE, and it means that the >> originator CQE can be filled inline as well. >> >> In local testing, this reduces the latency on the sender side by 5-6x. >> >> Signed-off-by: Jens Axboe <[email protected]> >> --- >> io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++-- >> 1 file changed, 74 insertions(+), 3 deletions(-) >> >> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c >> index feff2b0822cf..3f89ff3a40ad 100644 >> --- a/io_uring/msg_ring.c >> +++ b/io_uring/msg_ring.c >> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head) >> io_req_queue_tw_complete(req, ret); >> } >> +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx) >> +{ >> + bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32; >> + size_t cqe_size = sizeof(struct io_overflow_cqe); >> + struct io_overflow_cqe *ocqe; >> + >> + if (is_cqe32) >> + cqe_size += sizeof(struct io_uring_cqe); >> + >> + ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT); >> + if (!ocqe) >> + return NULL; >> + >> + if (is_cqe32) >> + ocqe->cqe.big_cqe[0] = ocqe->cqe.big_cqe[1] = 0; >> + >> + return ocqe; >> +} >> + >> +/* >> + * Entered with the target uring_lock held, and will drop it before >> + * returning. Adds a previously allocated ocqe to the overflow list on >> + * the target, and marks it appropriately for flushing. >> + */ >> +static void io_msg_add_overflow(struct io_msg *msg, >> + struct io_ring_ctx *target_ctx, >> + struct io_overflow_cqe *ocqe, int ret) >> + __releases(target_ctx->uring_lock) >> +{ >> + spin_lock(&target_ctx->completion_lock); >> + >> + if (list_empty(&target_ctx->cq_overflow_list)) { >> + set_bit(IO_CHECK_CQ_OVERFLOW_BIT, &target_ctx->check_cq); >> + atomic_or(IORING_SQ_TASKRUN, &target_ctx->rings->sq_flags); > > TASKRUN? The normal overflow path sets IORING_SQ_CQ_OVERFLOW Was a bit split on it - we want it run as part of waiting, but I also wasn't super interested in exposing it as an overflow condition since it is now. It's more of an internal implementation detail. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages 2024-05-24 22:58 ` [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages Jens Axboe 2024-05-28 13:18 ` Pavel Begunkov @ 2024-05-28 13:32 ` Pavel Begunkov 2024-05-28 14:23 ` Jens Axboe 1 sibling, 1 reply; 23+ messages in thread From: Pavel Begunkov @ 2024-05-28 13:32 UTC (permalink / raw) To: Jens Axboe, io-uring On 5/24/24 23:58, Jens Axboe wrote: > If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely > to the target ring. Instead, task_work is queued for the target ring, > which is used to post the CQE. To make matters worse, once the target > CQE has been posted, task_work is then queued with the originator to > fill the completion. > > This obviously adds a bunch of overhead and latency. Instead of relying > on generic kernel task_work for this, fill an overflow entry on the > target ring and flag it as such that the target ring will flush it. This > avoids both the task_work for posting the CQE, and it means that the > originator CQE can be filled inline as well. > > In local testing, this reduces the latency on the sender side by 5-6x. > > Signed-off-by: Jens Axboe <[email protected]> > --- > io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 74 insertions(+), 3 deletions(-) > > diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c > index feff2b0822cf..3f89ff3a40ad 100644 > --- a/io_uring/msg_ring.c > +++ b/io_uring/msg_ring.c > @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head) > io_req_queue_tw_complete(req, ret); > } > > +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx) > +{ > + bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32; > + size_t cqe_size = sizeof(struct io_overflow_cqe); > + struct io_overflow_cqe *ocqe; > + > + if (is_cqe32) > + cqe_size += sizeof(struct io_uring_cqe); > + > + ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT); __GFP_ACCOUNT looks painful > + if (!ocqe) > + return NULL; > + > + if (is_cqe32) > + ocqe->cqe.big_cqe[0] = ocqe->cqe.big_cqe[1] = 0; > + > + return ocqe; > +} > + ... -- Pavel Begunkov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages 2024-05-28 13:32 ` Pavel Begunkov @ 2024-05-28 14:23 ` Jens Axboe 2024-05-28 16:23 ` Pavel Begunkov 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2024-05-28 14:23 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/28/24 7:32 AM, Pavel Begunkov wrote: > On 5/24/24 23:58, Jens Axboe wrote: >> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely >> to the target ring. Instead, task_work is queued for the target ring, >> which is used to post the CQE. To make matters worse, once the target >> CQE has been posted, task_work is then queued with the originator to >> fill the completion. >> >> This obviously adds a bunch of overhead and latency. Instead of relying >> on generic kernel task_work for this, fill an overflow entry on the >> target ring and flag it as such that the target ring will flush it. This >> avoids both the task_work for posting the CQE, and it means that the >> originator CQE can be filled inline as well. >> >> In local testing, this reduces the latency on the sender side by 5-6x. >> >> Signed-off-by: Jens Axboe <[email protected]> >> --- >> io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++-- >> 1 file changed, 74 insertions(+), 3 deletions(-) >> >> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c >> index feff2b0822cf..3f89ff3a40ad 100644 >> --- a/io_uring/msg_ring.c >> +++ b/io_uring/msg_ring.c >> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head) >> io_req_queue_tw_complete(req, ret); >> } >> +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx) >> +{ >> + bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32; >> + size_t cqe_size = sizeof(struct io_overflow_cqe); >> + struct io_overflow_cqe *ocqe; >> + >> + if (is_cqe32) >> + cqe_size += sizeof(struct io_uring_cqe); >> + >> + ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT); > > __GFP_ACCOUNT looks painful It always is - I did add the usual alloc cache for this after posting this series, which makes it a no-op basically: https://git.kernel.dk/cgit/linux/commit/?h=io_uring-msg_ring&id=c39ead262b60872d6d7daf55e9fc7d76dc09b29d Just haven't posted a v2 yet. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages 2024-05-28 14:23 ` Jens Axboe @ 2024-05-28 16:23 ` Pavel Begunkov 2024-05-28 17:59 ` Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Pavel Begunkov @ 2024-05-28 16:23 UTC (permalink / raw) To: Jens Axboe, io-uring On 5/28/24 15:23, Jens Axboe wrote: > On 5/28/24 7:32 AM, Pavel Begunkov wrote: >> On 5/24/24 23:58, Jens Axboe wrote: >>> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely >>> to the target ring. Instead, task_work is queued for the target ring, >>> which is used to post the CQE. To make matters worse, once the target >>> CQE has been posted, task_work is then queued with the originator to >>> fill the completion. >>> >>> This obviously adds a bunch of overhead and latency. Instead of relying >>> on generic kernel task_work for this, fill an overflow entry on the >>> target ring and flag it as such that the target ring will flush it. This >>> avoids both the task_work for posting the CQE, and it means that the >>> originator CQE can be filled inline as well. >>> >>> In local testing, this reduces the latency on the sender side by 5-6x. >>> >>> Signed-off-by: Jens Axboe <[email protected]> >>> --- >>> io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++-- >>> 1 file changed, 74 insertions(+), 3 deletions(-) >>> >>> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c >>> index feff2b0822cf..3f89ff3a40ad 100644 >>> --- a/io_uring/msg_ring.c >>> +++ b/io_uring/msg_ring.c >>> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head) >>> io_req_queue_tw_complete(req, ret); >>> } >>> +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx) >>> +{ >>> + bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32; >>> + size_t cqe_size = sizeof(struct io_overflow_cqe); >>> + struct io_overflow_cqe *ocqe; >>> + >>> + if (is_cqe32) >>> + cqe_size += sizeof(struct io_uring_cqe); >>> + >>> + ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT); >> >> __GFP_ACCOUNT looks painful > > It always is - I did add the usual alloc cache for this after posting > this series, which makes it a no-op basically: Simple ring private cache wouldn't work so well with non uniform transfer distributions. One way messaging, userspace level batching, etc., but the main question is in the other email, i.e. maybe it's better to go with the 2 tw hop model, which returns memory back where it came from. > https://git.kernel.dk/cgit/linux/commit/?h=io_uring-msg_ring&id=c39ead262b60872d6d7daf55e9fc7d76dc09b29d > > Just haven't posted a v2 yet. > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages 2024-05-28 16:23 ` Pavel Begunkov @ 2024-05-28 17:59 ` Jens Axboe 2024-05-29 2:04 ` Pavel Begunkov 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2024-05-28 17:59 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/28/24 10:23 AM, Pavel Begunkov wrote: > On 5/28/24 15:23, Jens Axboe wrote: >> On 5/28/24 7:32 AM, Pavel Begunkov wrote: >>> On 5/24/24 23:58, Jens Axboe wrote: >>>> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely >>>> to the target ring. Instead, task_work is queued for the target ring, >>>> which is used to post the CQE. To make matters worse, once the target >>>> CQE has been posted, task_work is then queued with the originator to >>>> fill the completion. >>>> >>>> This obviously adds a bunch of overhead and latency. Instead of relying >>>> on generic kernel task_work for this, fill an overflow entry on the >>>> target ring and flag it as such that the target ring will flush it. This >>>> avoids both the task_work for posting the CQE, and it means that the >>>> originator CQE can be filled inline as well. >>>> >>>> In local testing, this reduces the latency on the sender side by 5-6x. >>>> >>>> Signed-off-by: Jens Axboe <[email protected]> >>>> --- >>>> io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++-- >>>> 1 file changed, 74 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c >>>> index feff2b0822cf..3f89ff3a40ad 100644 >>>> --- a/io_uring/msg_ring.c >>>> +++ b/io_uring/msg_ring.c >>>> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head) >>>> io_req_queue_tw_complete(req, ret); >>>> } >>>> +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx) >>>> +{ >>>> + bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32; >>>> + size_t cqe_size = sizeof(struct io_overflow_cqe); >>>> + struct io_overflow_cqe *ocqe; >>>> + >>>> + if (is_cqe32) >>>> + cqe_size += sizeof(struct io_uring_cqe); >>>> + >>>> + ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT); >>> >>> __GFP_ACCOUNT looks painful >> >> It always is - I did add the usual alloc cache for this after posting >> this series, which makes it a no-op basically: > > Simple ring private cache wouldn't work so well with non > uniform transfer distributions. One way messaging, userspace > level batching, etc., but the main question is in the other > email, i.e. maybe it's better to go with the 2 tw hop model, > which returns memory back where it came from. The cache is local to the ring, so anyone that sends messages to that ring gets to use it. So I believe it should in fact work really well. If messaging is bidirectional, then caching on the target will apply in both directions. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages 2024-05-28 17:59 ` Jens Axboe @ 2024-05-29 2:04 ` Pavel Begunkov 2024-05-29 2:43 ` Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Pavel Begunkov @ 2024-05-29 2:04 UTC (permalink / raw) To: Jens Axboe, io-uring On 5/28/24 18:59, Jens Axboe wrote: > On 5/28/24 10:23 AM, Pavel Begunkov wrote: >> On 5/28/24 15:23, Jens Axboe wrote: >>> On 5/28/24 7:32 AM, Pavel Begunkov wrote: >>>> On 5/24/24 23:58, Jens Axboe wrote: >>>>> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely >>>>> to the target ring. Instead, task_work is queued for the target ring, >>>>> which is used to post the CQE. To make matters worse, once the target >>>>> CQE has been posted, task_work is then queued with the originator to >>>>> fill the completion. >>>>> >>>>> This obviously adds a bunch of overhead and latency. Instead of relying >>>>> on generic kernel task_work for this, fill an overflow entry on the >>>>> target ring and flag it as such that the target ring will flush it. This >>>>> avoids both the task_work for posting the CQE, and it means that the >>>>> originator CQE can be filled inline as well. >>>>> >>>>> In local testing, this reduces the latency on the sender side by 5-6x. >>>>> >>>>> Signed-off-by: Jens Axboe <[email protected]> >>>>> --- >>>>> io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++-- >>>>> 1 file changed, 74 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c >>>>> index feff2b0822cf..3f89ff3a40ad 100644 >>>>> --- a/io_uring/msg_ring.c >>>>> +++ b/io_uring/msg_ring.c >>>>> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head) >>>>> io_req_queue_tw_complete(req, ret); >>>>> } >>>>> +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx) >>>>> +{ >>>>> + bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32; >>>>> + size_t cqe_size = sizeof(struct io_overflow_cqe); >>>>> + struct io_overflow_cqe *ocqe; >>>>> + >>>>> + if (is_cqe32) >>>>> + cqe_size += sizeof(struct io_uring_cqe); >>>>> + >>>>> + ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT); >>>> >>>> __GFP_ACCOUNT looks painful >>> >>> It always is - I did add the usual alloc cache for this after posting >>> this series, which makes it a no-op basically: >> >> Simple ring private cache wouldn't work so well with non >> uniform transfer distributions. One way messaging, userspace >> level batching, etc., but the main question is in the other >> email, i.e. maybe it's better to go with the 2 tw hop model, >> which returns memory back where it came from. > > The cache is local to the ring, so anyone that sends messages to that > ring gets to use it. So I believe it should in fact work really well. If > messaging is bidirectional, then caching on the target will apply in > both directions. *taking a look at the patch* it gets the entry from the target's ring, so indeed not a problem. Taking the target lock for that, however, is not the best, I ranted before about inter dependencies b/w rings. E.g. requests messaging a ring run by a task CPU bound in submission / tw execution would be directed to iowq and occupy a worker thread for the time being. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages 2024-05-29 2:04 ` Pavel Begunkov @ 2024-05-29 2:43 ` Jens Axboe 0 siblings, 0 replies; 23+ messages in thread From: Jens Axboe @ 2024-05-29 2:43 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/28/24 8:04 PM, Pavel Begunkov wrote: > On 5/28/24 18:59, Jens Axboe wrote: >> On 5/28/24 10:23 AM, Pavel Begunkov wrote: >>> On 5/28/24 15:23, Jens Axboe wrote: >>>> On 5/28/24 7:32 AM, Pavel Begunkov wrote: >>>>> On 5/24/24 23:58, Jens Axboe wrote: >>>>>> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely >>>>>> to the target ring. Instead, task_work is queued for the target ring, >>>>>> which is used to post the CQE. To make matters worse, once the target >>>>>> CQE has been posted, task_work is then queued with the originator to >>>>>> fill the completion. >>>>>> >>>>>> This obviously adds a bunch of overhead and latency. Instead of relying >>>>>> on generic kernel task_work for this, fill an overflow entry on the >>>>>> target ring and flag it as such that the target ring will flush it. This >>>>>> avoids both the task_work for posting the CQE, and it means that the >>>>>> originator CQE can be filled inline as well. >>>>>> >>>>>> In local testing, this reduces the latency on the sender side by 5-6x. >>>>>> >>>>>> Signed-off-by: Jens Axboe <[email protected]> >>>>>> --- >>>>>> io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++-- >>>>>> 1 file changed, 74 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c >>>>>> index feff2b0822cf..3f89ff3a40ad 100644 >>>>>> --- a/io_uring/msg_ring.c >>>>>> +++ b/io_uring/msg_ring.c >>>>>> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head) >>>>>> io_req_queue_tw_complete(req, ret); >>>>>> } >>>>>> +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx) >>>>>> +{ >>>>>> + bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32; >>>>>> + size_t cqe_size = sizeof(struct io_overflow_cqe); >>>>>> + struct io_overflow_cqe *ocqe; >>>>>> + >>>>>> + if (is_cqe32) >>>>>> + cqe_size += sizeof(struct io_uring_cqe); >>>>>> + >>>>>> + ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT); >>>>> >>>>> __GFP_ACCOUNT looks painful >>>> >>>> It always is - I did add the usual alloc cache for this after posting >>>> this series, which makes it a no-op basically: >>> >>> Simple ring private cache wouldn't work so well with non >>> uniform transfer distributions. One way messaging, userspace >>> level batching, etc., but the main question is in the other >>> email, i.e. maybe it's better to go with the 2 tw hop model, >>> which returns memory back where it came from. >> >> The cache is local to the ring, so anyone that sends messages to that >> ring gets to use it. So I believe it should in fact work really well. If >> messaging is bidirectional, then caching on the target will apply in >> both directions. > > *taking a look at the patch* it gets the entry from the target's > ring, so indeed not a problem. Taking the target lock for that, > however, is not the best, I ranted before about inter dependencies > b/w rings. E.g. requests messaging a ring run by a task CPU bound > in submission / tw execution would be directed to iowq and occupy > a worker thread for the time being. I can try and do some stats on io-wq bouncing, it can indeed be a risk. Might even be possible to only retain the ring lock for flushing, which is less of an issue as it happens locally, and have the overflow entries locked separately. For now I just kept the overflow backend that we already have, and the locking that MSG_RING already does. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 3/3] io_uring/msg_ring: avoid double indirection task_work for fd passing 2024-05-24 22:58 [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Jens Axboe 2024-05-24 22:58 ` [PATCH 1/3] io_uring/msg_ring: split fd installing into a helper Jens Axboe 2024-05-24 22:58 ` [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages Jens Axboe @ 2024-05-24 22:58 ` Jens Axboe 2024-05-28 13:31 ` [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Pavel Begunkov 3 siblings, 0 replies; 23+ messages in thread From: Jens Axboe @ 2024-05-24 22:58 UTC (permalink / raw) To: io-uring; +Cc: Jens Axboe Like what was done for MSG_RING data passing avoiding a double task_work roundtrip for IORING_SETUP_SINGLE_ISSUER, implement the same model for fd passing. File descriptor passing is separately locked anyway, so the only remaining issue is CQE posting, just like it was for data passing. And for that, we can use the same approach. Signed-off-by: Jens Axboe <[email protected]> --- io_uring/msg_ring.c | 48 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 46 insertions(+), 2 deletions(-) diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c index 3f89ff3a40ad..499702425711 100644 --- a/io_uring/msg_ring.c +++ b/io_uring/msg_ring.c @@ -299,6 +299,41 @@ static void io_msg_tw_fd_complete(struct callback_head *head) io_req_queue_tw_complete(req, ret); } +static int io_msg_install_remote(struct io_msg *msg, unsigned int issue_flags, + struct io_ring_ctx *target_ctx) +{ + bool skip_cqe = msg->flags & IORING_MSG_RING_CQE_SKIP; + struct io_overflow_cqe *ocqe; + int ret; + + if (!skip_cqe) { + ocqe = io_alloc_overflow(target_ctx); + if (!ocqe) + return -ENOMEM; + } + + if (unlikely(io_double_lock_ctx(target_ctx, issue_flags))) { + kfree(ocqe); + return -EAGAIN; + } + + ret = __io_fixed_fd_install(target_ctx, msg->src_file, msg->dst_fd); + if (ret < 0) + goto out; + + msg->src_file = NULL; + + if (!skip_cqe) { + ocqe->cqe.flags = 0; + io_msg_add_overflow(msg, target_ctx, ocqe, ret); + return 0; + } +out: + mutex_unlock(&target_ctx->uring_lock); + kfree(ocqe); + return ret; +} + static int io_msg_send_fd(struct io_kiocb *req, unsigned int issue_flags) { struct io_ring_ctx *target_ctx = req->file->private_data; @@ -320,8 +355,17 @@ static int io_msg_send_fd(struct io_kiocb *req, unsigned int issue_flags) req->flags |= REQ_F_NEED_CLEANUP; } - if (io_msg_need_remote(target_ctx)) - return io_msg_exec_remote(req, io_msg_tw_fd_complete); + if (io_msg_need_remote(target_ctx)) { + int ret; + + ret = io_msg_install_remote(msg, issue_flags, target_ctx); + if (ret == -EAGAIN) + return io_msg_exec_remote(req, io_msg_tw_fd_complete); + else if (ret < 0) + return ret; + req->flags &= ~REQ_F_NEED_CLEANUP; + return 0; + } return io_msg_install_complete(req, issue_flags); } -- 2.43.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance 2024-05-24 22:58 [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Jens Axboe ` (2 preceding siblings ...) 2024-05-24 22:58 ` [PATCH 3/3] io_uring/msg_ring: avoid double indirection task_work for fd passing Jens Axboe @ 2024-05-28 13:31 ` Pavel Begunkov 2024-05-28 14:34 ` Jens Axboe 3 siblings, 1 reply; 23+ messages in thread From: Pavel Begunkov @ 2024-05-28 13:31 UTC (permalink / raw) To: Jens Axboe, io-uring On 5/24/24 23:58, Jens Axboe wrote: > Hi, > > A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's specifically an IORING_SETUP_DEFER_TASKRUN optimisation. > use IORING_SETUP_DEFER_TASKRUN, will need two round trips through > generic task_work. This isn't ideal. This patchset attempts to rectify > that, taking a new approach rather than trying to use the io_uring > task_work infrastructure to handle it as in previous postings. Not sure why you'd want to piggyback onto overflows, it's not such a well made and reliable infra, whereas the DEFER_TASKRUN part of the task_work approach was fine. The completion path doesn't usually look at the overflow list but on cached cqe pointers showing the CQ is full, that means after you queue an overflow someone may post a CQE in the CQ in the normal path and you get reordering. Not that bad considering it's from another ring, but a bit nasty and surely will bite us back in the future, it always does. That's assuming you decide io_msg_need_remote() solely based ->task_complete and remove return current != target_ctx->submitter_task; otherwise you can get two linked msg_ring target CQEs reordered. It's also duplicating that crappy overflow code nobody cares much about, and it's still a forced wake up with no batching, circumventing the normal wake up path, i.e. io_uring tw. I don't think we should care about the request completion latency (sender latency), people should be more interested in the reaction speed (receiver latency), but if you care about it for a reason, perhaps you can just as well allocate a new request and complete the previous one right away. > In a sample test app that has one thread send messages to another and > logging both the time between sender sending and receiver receving and > just the time for the sender to post a message and get the CQE back, > I see the following sender latencies with the stock kernel: > > Latencies for: Sender > percentiles (nsec): > | 1.0000th=[ 4384], 5.0000th=[ 4512], 10.0000th=[ 4576], > | 20.0000th=[ 4768], 30.0000th=[ 4896], 40.0000th=[ 5024], > | 50.0000th=[ 5088], 60.0000th=[ 5152], 70.0000th=[ 5280], > | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728], > | 99.0000th=[ 8032], 99.5000th=[18048], 99.9000th=[21376], > | 99.9500th=[26496], 99.9900th=[59136] > > and with the patches: > > Latencies for: Sender > percentiles (nsec): > | 1.0000th=[ 756], 5.0000th=[ 820], 10.0000th=[ 828], > | 20.0000th=[ 844], 30.0000th=[ 852], 40.0000th=[ 852], > | 50.0000th=[ 860], 60.0000th=[ 860], 70.0000th=[ 868], > | 80.0000th=[ 884], 90.0000th=[ 964], 95.0000th=[ 988], > | 99.0000th=[ 1128], 99.5000th=[ 1208], 99.9000th=[ 1544], > | 99.9500th=[ 1944], 99.9900th=[ 2896] > > For the receiving side the win is smaller as it only "suffers" from > a single generic task_work, about a 10% win in latencies there. > > The idea here is to utilize the CQE overflow infrastructure for this, > as that allows the right task to post the CQE to the ring. > > 1 is a basic refactoring prep patch, patch 2 adds support for normal > messages, and patch 3 adopts the same approach for fd passing. > > io_uring/msg_ring.c | 151 ++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 138 insertions(+), 13 deletions(-) > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance 2024-05-28 13:31 ` [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Pavel Begunkov @ 2024-05-28 14:34 ` Jens Axboe 2024-05-28 14:39 ` Jens Axboe ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Jens Axboe @ 2024-05-28 14:34 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/28/24 7:31 AM, Pavel Begunkov wrote: > On 5/24/24 23:58, Jens Axboe wrote: >> Hi, >> >> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to > > IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's > specifically an IORING_SETUP_DEFER_TASKRUN optimisation. Right, I should change that in the commit message. It's task_complete driving it, which is tied to DEFER_TASKRUN. >> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through >> generic task_work. This isn't ideal. This patchset attempts to rectify >> that, taking a new approach rather than trying to use the io_uring >> task_work infrastructure to handle it as in previous postings. > > Not sure why you'd want to piggyback onto overflows, it's not > such a well made and reliable infra, whereas the DEFER_TASKRUN > part of the task_work approach was fine. It's not right now, because it's really a "don't get into this condition, if you do, things are slower". And this series doesn't really change that, and honestly it doesn't even need to. It's still way better than what we had before in terms of DEFER_TASKRUN and messages. We could certainly make the messages a subset of real overflow if we wanted, but honestly it looks decent enough as-is with the changes that I'm not hugely motivated to do that. > The completion path doesn't usually look at the overflow list > but on cached cqe pointers showing the CQ is full, that means > after you queue an overflow someone may post a CQE in the CQ > in the normal path and you get reordering. Not that bad > considering it's from another ring, but a bit nasty and surely > will bite us back in the future, it always does. This is true, but generally true as well as completions come in async. You don't get to control the exact order on a remote ring. Messages themselves will be ordered when posted, which should be the important aspect here. Order with locally posted completions I don't see why you'd care, that's a timing game that you cannot control. > That's assuming you decide io_msg_need_remote() solely based > ->task_complete and remove > > return current != target_ctx->submitter_task; > > otherwise you can get two linked msg_ring target CQEs reordered. Good point, since it'd now be cheap enough, would probably make sense to simply gate it on task_complete alone. I even toyed with the idea of just using this approach for any ring type and kill some code in there, but didn't venture that far yet. > It's also duplicating that crappy overflow code nobody cares > much about, and it's still a forced wake up with no batching, > circumventing the normal wake up path, i.e. io_uring tw. Yes, since this is v1 I didn't bother to integrate more tightly with the generic overflows, that should obviously be done by first adding a helper for this. I consider that pretty minor. > I don't think we should care about the request completion > latency (sender latency), people should be more interested > in the reaction speed (receiver latency), but if you care > about it for a reason, perhaps you can just as well allocate > a new request and complete the previous one right away. I know the numbers I posted was just sender side improvements on that particular box, however that isn't really the case for others. Here's on an another test box: axboe@r7525 ~> ./msg-lat init_flags=3000 Wait on startup 802775: my fd 3, other 4 802776: my fd 4, other 3 Latencies for: Receiver percentiles (nsec): | 1.0000th=[ 4192], 5.0000th=[ 4320], 10.0000th=[ 4448], | 20.0000th=[ 4576], 30.0000th=[ 4704], 40.0000th=[ 4832], | 50.0000th=[ 4960], 60.0000th=[ 5088], 70.0000th=[ 5216], | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728], | 99.0000th=[ 6176], 99.5000th=[ 7136], 99.9000th=[19584], | 99.9500th=[20352], 99.9900th=[28288] Latencies for: Sender percentiles (nsec): | 1.0000th=[ 6560], 5.0000th=[ 6880], 10.0000th=[ 7008], | 20.0000th=[ 7264], 30.0000th=[ 7456], 40.0000th=[ 7712], | 50.0000th=[ 8032], 60.0000th=[ 8256], 70.0000th=[ 8512], | 80.0000th=[ 8640], 90.0000th=[ 8896], 95.0000th=[ 9152], | 99.0000th=[ 9792], 99.5000th=[11584], 99.9000th=[23168], | 99.9500th=[28032], 99.9900th=[40192] and after: axboe@r7525 ~> ./msg-lat 1.776s init_flags=3000 Wait on startup 3767: my fd 3, other 4 3768: my fd 4, other 3 Latencies for: Sender percentiles (nsec): | 1.0000th=[ 740], 5.0000th=[ 748], 10.0000th=[ 756], | 20.0000th=[ 764], 30.0000th=[ 764], 40.0000th=[ 772], | 50.0000th=[ 772], 60.0000th=[ 780], 70.0000th=[ 780], | 80.0000th=[ 860], 90.0000th=[ 892], 95.0000th=[ 900], | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656], | 99.9500th=[ 1976], 99.9900th=[ 3408] Latencies for: Receiver percentiles (nsec): | 1.0000th=[ 2736], 5.0000th=[ 2736], 10.0000th=[ 2768], | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800], | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896], | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120], | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560], | 99.9500th=[21632], 99.9900th=[58624] Obivously some variation in runs in general, but it's most certainly faster in terms of receiving too. This test case is fixed at doing 100K messages per second, didn't do any peak testing. But I strongly suspect you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL task_work is pretty sucky imho. You could just make it io_kiocb based, but I did not want to get into foreign requests on remote rings. What would you envision with that approach, using our normal ring task_work for this instead? That would be an approach, obviously this one took a different path from the previous task_work driven approach. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance 2024-05-28 14:34 ` Jens Axboe @ 2024-05-28 14:39 ` Jens Axboe 2024-05-28 15:27 ` Jens Axboe 2024-05-28 16:50 ` Pavel Begunkov 2 siblings, 0 replies; 23+ messages in thread From: Jens Axboe @ 2024-05-28 14:39 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/28/24 8:34 AM, Jens Axboe wrote: > You could just make it io_kiocb based, but I did not want to get into > foreign requests on remote rings. What would you envision with that > approach, using our normal ring task_work for this instead? That would > be an approach, obviously this one took a different path from the > previous task_work driven approach. Foreign io_kiocb itself isn't even enough, you'd need an overflow cqe allocated upfront too for posting directly without waiting for the reply. Basically it needs to be confident it'll post successfully on the remote ring, otherwise we cannot complete the source msg ring sqe inline. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance 2024-05-28 14:34 ` Jens Axboe 2024-05-28 14:39 ` Jens Axboe @ 2024-05-28 15:27 ` Jens Axboe 2024-05-28 16:50 ` Pavel Begunkov 2 siblings, 0 replies; 23+ messages in thread From: Jens Axboe @ 2024-05-28 15:27 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/28/24 8:34 AM, Jens Axboe wrote: > Obivously some variation in runs in general, but it's most certainly > faster in terms of receiving too. This test case is fixed at doing 100K > messages per second, didn't do any peak testing. But I strongly suspect > you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL > task_work is pretty sucky imho. Example on that same box. Running the same message passer test, sys time used for a 3 second run varies between 300ms and 850ms on the stock kernel (with a lot of variance), after it's around ~150ms and pretty steady. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance 2024-05-28 14:34 ` Jens Axboe 2024-05-28 14:39 ` Jens Axboe 2024-05-28 15:27 ` Jens Axboe @ 2024-05-28 16:50 ` Pavel Begunkov 2024-05-28 18:07 ` Jens Axboe 2 siblings, 1 reply; 23+ messages in thread From: Pavel Begunkov @ 2024-05-28 16:50 UTC (permalink / raw) To: Jens Axboe, io-uring On 5/28/24 15:34, Jens Axboe wrote: > On 5/28/24 7:31 AM, Pavel Begunkov wrote: >> On 5/24/24 23:58, Jens Axboe wrote: >>> Hi, >>> >>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to >> >> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's >> specifically an IORING_SETUP_DEFER_TASKRUN optimisation. > > Right, I should change that in the commit message. It's task_complete > driving it, which is tied to DEFER_TASKRUN. > >>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through >>> generic task_work. This isn't ideal. This patchset attempts to rectify >>> that, taking a new approach rather than trying to use the io_uring >>> task_work infrastructure to handle it as in previous postings. >> >> Not sure why you'd want to piggyback onto overflows, it's not >> such a well made and reliable infra, whereas the DEFER_TASKRUN >> part of the task_work approach was fine. > > It's not right now, because it's really a "don't get into this > condition, if you do, things are slower". And this series doesn't really > change that, and honestly it doesn't even need to. It's still way better > than what we had before in terms of DEFER_TASKRUN and messages. Better than how it is now or comparing to the previous attempt? I think the one using io_uring's tw infra was better, which is where all wake ups and optimisations currently consolidate. > We could certainly make the messages a subset of real overflow if we > wanted, but honestly it looks decent enough as-is with the changes that > I'm not hugely motivated to do that. Not sure what you mean here, but not really suggesting refactoring overflows. Taking the tw based patches should be easy, it only needs killing !DEFER_TASKRUN changes from there. >> The completion path doesn't usually look at the overflow list >> but on cached cqe pointers showing the CQ is full, that means >> after you queue an overflow someone may post a CQE in the CQ >> in the normal path and you get reordering. Not that bad >> considering it's from another ring, but a bit nasty and surely >> will bite us back in the future, it always does. > > This is true, but generally true as well as completions come in async. > You don't get to control the exact order on a remote ring. Messages > themselves will be ordered when posted, which should be the important > aspect here. Order with locally posted completions I don't see why you'd > care, that's a timing game that you cannot control. True for a single request, but in a more complex system sender's ordering will affect the order on the receiving side. ring1: msg_ring(); write(pipe) ring2: read(pipe) The user would definitely think that the other ring will first get a msg_ring CQE and then an CQE from the read, but as always it's hard to predict all repercussions. >> That's assuming you decide io_msg_need_remote() solely based >> ->task_complete and remove >> >> return current != target_ctx->submitter_task; >> >> otherwise you can get two linked msg_ring target CQEs reordered. > > Good point, since it'd now be cheap enough, would probably make sense to > simply gate it on task_complete alone. I even toyed with the idea of > just using this approach for any ring type and kill some code in there, > but didn't venture that far yet. That task check doesn't make any real difference. If it's the same thread you can skip io_uring all together. >> It's also duplicating that crappy overflow code nobody cares >> much about, and it's still a forced wake up with no batching, >> circumventing the normal wake up path, i.e. io_uring tw. > > Yes, since this is v1 I didn't bother to integrate more tightly with the > generic overflows, that should obviously be done by first adding a > helper for this. I consider that pretty minor. My problem is not duplication of code base but rather extending the internal user base of it. You can say that msg_ring can easily become a hot path for some, and then we'll be putting efforts both into overflows and task_work when in essence they solve quite a similar problem here. >> I don't think we should care about the request completion >> latency (sender latency), people should be more interested >> in the reaction speed (receiver latency), but if you care >> about it for a reason, perhaps you can just as well allocate >> a new request and complete the previous one right away. > > I know the numbers I posted was just sender side improvements on that > particular box, however that isn't really the case for others. Here's on > an another test box: > > axboe@r7525 ~> ./msg-lat > init_flags=3000 > Wait on startup > 802775: my fd 3, other 4 > 802776: my fd 4, other 3 > Latencies for: Receiver > percentiles (nsec): > | 1.0000th=[ 4192], 5.0000th=[ 4320], 10.0000th=[ 4448], > | 20.0000th=[ 4576], 30.0000th=[ 4704], 40.0000th=[ 4832], > | 50.0000th=[ 4960], 60.0000th=[ 5088], 70.0000th=[ 5216], > | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728], > | 99.0000th=[ 6176], 99.5000th=[ 7136], 99.9000th=[19584], > | 99.9500th=[20352], 99.9900th=[28288] > Latencies for: Sender > percentiles (nsec): > | 1.0000th=[ 6560], 5.0000th=[ 6880], 10.0000th=[ 7008], > | 20.0000th=[ 7264], 30.0000th=[ 7456], 40.0000th=[ 7712], > | 50.0000th=[ 8032], 60.0000th=[ 8256], 70.0000th=[ 8512], > | 80.0000th=[ 8640], 90.0000th=[ 8896], 95.0000th=[ 9152], > | 99.0000th=[ 9792], 99.5000th=[11584], 99.9000th=[23168], > | 99.9500th=[28032], 99.9900th=[40192] > > and after: > > axboe@r7525 ~> ./msg-lat 1.776s > init_flags=3000 > Wait on startup > 3767: my fd 3, other 4 > 3768: my fd 4, other 3 > Latencies for: Sender > percentiles (nsec): > | 1.0000th=[ 740], 5.0000th=[ 748], 10.0000th=[ 756], > | 20.0000th=[ 764], 30.0000th=[ 764], 40.0000th=[ 772], > | 50.0000th=[ 772], 60.0000th=[ 780], 70.0000th=[ 780], > | 80.0000th=[ 860], 90.0000th=[ 892], 95.0000th=[ 900], > | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656], > | 99.9500th=[ 1976], 99.9900th=[ 3408] > Latencies for: Receiver > percentiles (nsec): > | 1.0000th=[ 2736], 5.0000th=[ 2736], 10.0000th=[ 2768], > | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800], > | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896], > | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120], > | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560], > | 99.9500th=[21632], 99.9900th=[58624] > > Obivously some variation in runs in general, but it's most certainly > faster in terms of receiving too. This test case is fixed at doing 100K > messages per second, didn't do any peak testing. But I strongly suspect > you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL > task_work is pretty sucky imho. Sure, you mentioned wins on the receiver side, I consider it to be the main merit (latency and throughput) > You could just make it io_kiocb based, but I did not want to get into > foreign requests on remote rings. What would you envision with that > approach, using our normal ring task_work for this instead? That would It was buggy in the !DEFER_TASKRUN path. Fortunately, you don't care about it because it just does it all under ->completion_lock, which is why you shouldn't have ever hit the problem in testing. > be an approach, obviously this one took a different path from the > previous task_work driven approach. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance 2024-05-28 16:50 ` Pavel Begunkov @ 2024-05-28 18:07 ` Jens Axboe 2024-05-28 18:31 ` Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2024-05-28 18:07 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/28/24 10:50 AM, Pavel Begunkov wrote: > On 5/28/24 15:34, Jens Axboe wrote: >> On 5/28/24 7:31 AM, Pavel Begunkov wrote: >>> On 5/24/24 23:58, Jens Axboe wrote: >>>> Hi, >>>> >>>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to >>> >>> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's >>> specifically an IORING_SETUP_DEFER_TASKRUN optimisation. >> >> Right, I should change that in the commit message. It's task_complete >> driving it, which is tied to DEFER_TASKRUN. >> >>>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through >>>> generic task_work. This isn't ideal. This patchset attempts to rectify >>>> that, taking a new approach rather than trying to use the io_uring >>>> task_work infrastructure to handle it as in previous postings. >>> >>> Not sure why you'd want to piggyback onto overflows, it's not >>> such a well made and reliable infra, whereas the DEFER_TASKRUN >>> part of the task_work approach was fine. >> >> It's not right now, because it's really a "don't get into this >> condition, if you do, things are slower". And this series doesn't really >> change that, and honestly it doesn't even need to. It's still way better >> than what we had before in terms of DEFER_TASKRUN and messages. > > Better than how it is now or comparing to the previous attempt? > I think the one using io_uring's tw infra was better, which is > where all wake ups and optimisations currently consolidate. Better than both - I haven't tested with the previous version, but I can certainly do that. The reason why I think it'll be better is that it avoids the double roundtrips. Yes v1 was using normal task_work which is better, but it didn't solve what I think is the fundamental issue here. I'll forward port it and give it a spin, then we'll know. >> We could certainly make the messages a subset of real overflow if we >> wanted, but honestly it looks decent enough as-is with the changes that >> I'm not hugely motivated to do that. > > Not sure what you mean here, but not really suggesting refactoring > overflows. Taking the tw based patches should be easy, it only > needs killing !DEFER_TASKRUN changes from there. Sorry wasn't clear, the refactoring was purely my suggestion to reduce a bit of the code duplication. >>> The completion path doesn't usually look at the overflow list >>> but on cached cqe pointers showing the CQ is full, that means >>> after you queue an overflow someone may post a CQE in the CQ >>> in the normal path and you get reordering. Not that bad >>> considering it's from another ring, but a bit nasty and surely >>> will bite us back in the future, it always does. >> >> This is true, but generally true as well as completions come in async. >> You don't get to control the exact order on a remote ring. Messages >> themselves will be ordered when posted, which should be the important >> aspect here. Order with locally posted completions I don't see why you'd >> care, that's a timing game that you cannot control. > > True for a single request, but in a more complex system > sender's ordering will affect the order on the receiving side. > > ring1: msg_ring(); write(pipe) > ring2: read(pipe) > > The user would definitely think that the other ring will > first get a msg_ring CQE and then an CQE from the read, but as > always it's hard to predict all repercussions. Nobody should be making assumptions like that, and that will in fact already not be the case. If the msg_ring fails to lock the remote ring, then it may very well end up in the hands of io-wq. And then you could get either result validly, msg CQE first or last. >>> That's assuming you decide io_msg_need_remote() solely based >>> ->task_complete and remove >>> >>> return current != target_ctx->submitter_task; >>> >>> otherwise you can get two linked msg_ring target CQEs reordered. >> >> Good point, since it'd now be cheap enough, would probably make sense to >> simply gate it on task_complete alone. I even toyed with the idea of >> just using this approach for any ring type and kill some code in there, >> but didn't venture that far yet. > > That task check doesn't make any real difference. If it's the > same thread you can skip io_uring all together. Yeah agree, I did add a patch in between after the last email to just remove the task check. It's not really a useful case to attempt to cater to in particular. >>> It's also duplicating that crappy overflow code nobody cares >>> much about, and it's still a forced wake up with no batching, >>> circumventing the normal wake up path, i.e. io_uring tw. >> >> Yes, since this is v1 I didn't bother to integrate more tightly with the >> generic overflows, that should obviously be done by first adding a >> helper for this. I consider that pretty minor. > > My problem is not duplication of code base but rather > extending the internal user base of it. You can say that > msg_ring can easily become a hot path for some, and > then we'll be putting efforts both into overflows and > task_work when in essence they solve quite a similar > problem here. That's why I was tempted to remove the task_work path for messages on top of this. But I agree on the wakeup side, that's obviously something that doesn't currently work with msg ring. And while I don't see that as a hard problem to solve, it is a bit annoying to have multiple sources for that. Would naturally be better to retain just the task_work side. For one use case that I think is interesting with messages is work passing, eliminating a user side data structure (and lock) for that and side channel wakeups, I don't think the wakeup batching matters as you're generally not going to be batching receiving work. You're either already running work for processing, or sleeping waiting for one. >>> I don't think we should care about the request completion >>> latency (sender latency), people should be more interested >>> in the reaction speed (receiver latency), but if you care >>> about it for a reason, perhaps you can just as well allocate >>> a new request and complete the previous one right away. >> >> I know the numbers I posted was just sender side improvements on that >> particular box, however that isn't really the case for others. Here's on >> an another test box: >> >> axboe@r7525 ~> ./msg-lat >> init_flags=3000 >> Wait on startup >> 802775: my fd 3, other 4 >> 802776: my fd 4, other 3 >> Latencies for: Receiver >> percentiles (nsec): >> | 1.0000th=[ 4192], 5.0000th=[ 4320], 10.0000th=[ 4448], >> | 20.0000th=[ 4576], 30.0000th=[ 4704], 40.0000th=[ 4832], >> | 50.0000th=[ 4960], 60.0000th=[ 5088], 70.0000th=[ 5216], >> | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728], >> | 99.0000th=[ 6176], 99.5000th=[ 7136], 99.9000th=[19584], >> | 99.9500th=[20352], 99.9900th=[28288] >> Latencies for: Sender >> percentiles (nsec): >> | 1.0000th=[ 6560], 5.0000th=[ 6880], 10.0000th=[ 7008], >> | 20.0000th=[ 7264], 30.0000th=[ 7456], 40.0000th=[ 7712], >> | 50.0000th=[ 8032], 60.0000th=[ 8256], 70.0000th=[ 8512], >> | 80.0000th=[ 8640], 90.0000th=[ 8896], 95.0000th=[ 9152], >> | 99.0000th=[ 9792], 99.5000th=[11584], 99.9000th=[23168], >> | 99.9500th=[28032], 99.9900th=[40192] >> >> and after: >> >> axboe@r7525 ~> ./msg-lat 1.776s >> init_flags=3000 >> Wait on startup >> 3767: my fd 3, other 4 >> 3768: my fd 4, other 3 >> Latencies for: Sender >> percentiles (nsec): >> | 1.0000th=[ 740], 5.0000th=[ 748], 10.0000th=[ 756], >> | 20.0000th=[ 764], 30.0000th=[ 764], 40.0000th=[ 772], >> | 50.0000th=[ 772], 60.0000th=[ 780], 70.0000th=[ 780], >> | 80.0000th=[ 860], 90.0000th=[ 892], 95.0000th=[ 900], >> | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656], >> | 99.9500th=[ 1976], 99.9900th=[ 3408] >> Latencies for: Receiver >> percentiles (nsec): >> | 1.0000th=[ 2736], 5.0000th=[ 2736], 10.0000th=[ 2768], >> | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800], >> | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896], >> | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120], >> | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560], >> | 99.9500th=[21632], 99.9900th=[58624] >> >> Obivously some variation in runs in general, but it's most certainly >> faster in terms of receiving too. This test case is fixed at doing 100K >> messages per second, didn't do any peak testing. But I strongly suspect >> you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL >> task_work is pretty sucky imho. > > Sure, you mentioned wins on the receiver side, I consider it > to be the main merit (latency and throughput) Actually I think both are interesting - not because the sender side is latency sensitive for on receving the CQE for it, but because it goes hand in hand with a reduction in cycles spent sending that work. That's the win on the sender side, more so than the latency win. The latter is just gravy on top. >> You could just make it io_kiocb based, but I did not want to get into >> foreign requests on remote rings. What would you envision with that >> approach, using our normal ring task_work for this instead? That would > > It was buggy in the !DEFER_TASKRUN path. Fortunately, you don't care > about it because it just does it all under ->completion_lock, which > is why you shouldn't have ever hit the problem in testing. I'll test the old approach and we'll see where we are at. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance 2024-05-28 18:07 ` Jens Axboe @ 2024-05-28 18:31 ` Jens Axboe 2024-05-28 23:04 ` Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2024-05-28 18:31 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/28/24 12:07 PM, Jens Axboe wrote: > On 5/28/24 10:50 AM, Pavel Begunkov wrote: >> On 5/28/24 15:34, Jens Axboe wrote: >>> On 5/28/24 7:31 AM, Pavel Begunkov wrote: >>>> On 5/24/24 23:58, Jens Axboe wrote: >>>>> Hi, >>>>> >>>>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to >>>> >>>> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's >>>> specifically an IORING_SETUP_DEFER_TASKRUN optimisation. >>> >>> Right, I should change that in the commit message. It's task_complete >>> driving it, which is tied to DEFER_TASKRUN. >>> >>>>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through >>>>> generic task_work. This isn't ideal. This patchset attempts to rectify >>>>> that, taking a new approach rather than trying to use the io_uring >>>>> task_work infrastructure to handle it as in previous postings. >>>> >>>> Not sure why you'd want to piggyback onto overflows, it's not >>>> such a well made and reliable infra, whereas the DEFER_TASKRUN >>>> part of the task_work approach was fine. >>> >>> It's not right now, because it's really a "don't get into this >>> condition, if you do, things are slower". And this series doesn't really >>> change that, and honestly it doesn't even need to. It's still way better >>> than what we had before in terms of DEFER_TASKRUN and messages. >> >> Better than how it is now or comparing to the previous attempt? >> I think the one using io_uring's tw infra was better, which is >> where all wake ups and optimisations currently consolidate. > > Better than both - I haven't tested with the previous version, but I can > certainly do that. The reason why I think it'll be better is that it > avoids the double roundtrips. Yes v1 was using normal task_work which is > better, but it didn't solve what I think is the fundamental issue here. > > I'll forward port it and give it a spin, then we'll know. I suspect a bug in the previous patches, because this is what the forward port looks like. First, for reference, the current results: init_flags=3000 Wait on startup 3767: my fd 3, other 4 3768: my fd 4, other 3 Latencies for: Sender percentiles (nsec): | 1.0000th=[ 740], 5.0000th=[ 748], 10.0000th=[ 756], | 20.0000th=[ 764], 30.0000th=[ 764], 40.0000th=[ 772], | 50.0000th=[ 772], 60.0000th=[ 780], 70.0000th=[ 780], | 80.0000th=[ 860], 90.0000th=[ 892], 95.0000th=[ 900], | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656], | 99.9500th=[ 1976], 99.9900th=[ 3408] Latencies for: Receiver percentiles (nsec): | 1.0000th=[ 2736], 5.0000th=[ 2736], 10.0000th=[ 2768], | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800], | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896], | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120], | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560], | 99.9500th=[21632], 99.9900th=[58624] and here's with io_uring-msg_ring.1, which is just a straight forward forward port of the previous patches on the same base as v2: init_flags=3000 Wait on startup 4097: my fd 4, other 3 4096: my fd 3, other 4 Latencies for: Receiver percentiles (nsec): | 1.0000th=[ 5920], 5.0000th=[ 5920], 10.0000th=[ 5984], | 20.0000th=[ 5984], 30.0000th=[ 6048], 40.0000th=[ 6048], | 50.0000th=[ 6112], 60.0000th=[ 6304], 70.0000th=[ 6368], | 80.0000th=[ 6560], 90.0000th=[ 6880], 95.0000th=[ 7072], | 99.0000th=[ 7456], 99.5000th=[ 7712], 99.9000th=[ 8640], | 99.9500th=[10432], 99.9900th=[26240] Latencies for: Sender percentiles (nsec): | 1.0000th=[ 9536], 5.0000th=[ 9664], 10.0000th=[ 9664], | 20.0000th=[ 9920], 30.0000th=[ 9920], 40.0000th=[10048], | 50.0000th=[10176], 60.0000th=[10304], 70.0000th=[10432], | 80.0000th=[10688], 90.0000th=[10944], 95.0000th=[11328], | 99.0000th=[11840], 99.5000th=[12096], 99.9000th=[13888], | 99.9500th=[15424], 99.9900th=[34560] -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance 2024-05-28 18:31 ` Jens Axboe @ 2024-05-28 23:04 ` Jens Axboe 2024-05-29 1:35 ` Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2024-05-28 23:04 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/28/24 12:31 PM, Jens Axboe wrote: > I suspect a bug in the previous patches, because this is what the > forward port looks like. First, for reference, the current results: Got it sorted, and pinned sender and receiver on CPUs to avoid the variation. It looks like this with the task_work approach that I sent out as v1: Latencies for: Sender percentiles (nsec): | 1.0000th=[ 2160], 5.0000th=[ 2672], 10.0000th=[ 2768], | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600], | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632], | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696], | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192], | 99.9500th=[16320], 99.9900th=[18304] Latencies for: Receiver percentiles (nsec): | 1.0000th=[ 1528], 5.0000th=[ 1576], 10.0000th=[ 1656], | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064], | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096], | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160], | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168], | 99.9500th=[15424], 99.9900th=[17280] and here's the exact same test run on the current patches: Latencies for: Sender percentiles (nsec): | 1.0000th=[ 362], 5.0000th=[ 362], 10.0000th=[ 370], | 20.0000th=[ 370], 30.0000th=[ 370], 40.0000th=[ 370], | 50.0000th=[ 374], 60.0000th=[ 382], 70.0000th=[ 382], | 80.0000th=[ 382], 90.0000th=[ 382], 95.0000th=[ 390], | 99.0000th=[ 402], 99.5000th=[ 430], 99.9000th=[ 900], | 99.9500th=[ 972], 99.9900th=[ 1432] Latencies for: Receiver percentiles (nsec): | 1.0000th=[ 1528], 5.0000th=[ 1544], 10.0000th=[ 1560], | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592], | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608], | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688], | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272], | 99.9500th=[14784], 99.9900th=[73216] I'll try and augment the test app to do proper rated submissions, so I can ramp up the rates a bit and see what happens. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance 2024-05-28 23:04 ` Jens Axboe @ 2024-05-29 1:35 ` Jens Axboe 2024-05-29 2:08 ` Pavel Begunkov 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2024-05-29 1:35 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/28/24 5:04 PM, Jens Axboe wrote: > On 5/28/24 12:31 PM, Jens Axboe wrote: >> I suspect a bug in the previous patches, because this is what the >> forward port looks like. First, for reference, the current results: > > Got it sorted, and pinned sender and receiver on CPUs to avoid the > variation. It looks like this with the task_work approach that I sent > out as v1: > > Latencies for: Sender > percentiles (nsec): > | 1.0000th=[ 2160], 5.0000th=[ 2672], 10.0000th=[ 2768], > | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600], > | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632], > | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696], > | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192], > | 99.9500th=[16320], 99.9900th=[18304] > Latencies for: Receiver > percentiles (nsec): > | 1.0000th=[ 1528], 5.0000th=[ 1576], 10.0000th=[ 1656], > | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064], > | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096], > | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160], > | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168], > | 99.9500th=[15424], 99.9900th=[17280] > > and here's the exact same test run on the current patches: > > Latencies for: Sender > percentiles (nsec): > | 1.0000th=[ 362], 5.0000th=[ 362], 10.0000th=[ 370], > | 20.0000th=[ 370], 30.0000th=[ 370], 40.0000th=[ 370], > | 50.0000th=[ 374], 60.0000th=[ 382], 70.0000th=[ 382], > | 80.0000th=[ 382], 90.0000th=[ 382], 95.0000th=[ 390], > | 99.0000th=[ 402], 99.5000th=[ 430], 99.9000th=[ 900], > | 99.9500th=[ 972], 99.9900th=[ 1432] > Latencies for: Receiver > percentiles (nsec): > | 1.0000th=[ 1528], 5.0000th=[ 1544], 10.0000th=[ 1560], > | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592], > | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608], > | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688], > | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272], > | 99.9500th=[14784], 99.9900th=[73216] > > I'll try and augment the test app to do proper rated submissions, so I > can ramp up the rates a bit and see what happens. And the final one, with the rated sends sorted out. One key observation is that v1 trails the current edition, it just can't keep up as the rate is increased. If we cap the rate at at what should be 33K messages per second, v1 gets ~28K messages and has the following latency profile (for a 3 second run) Latencies for: Receiver (msg=83863) percentiles (nsec): | 1.0000th=[ 1208], 5.0000th=[ 1336], 10.0000th=[ 1400], | 20.0000th=[ 1768], 30.0000th=[ 1912], 40.0000th=[ 1976], | 50.0000th=[ 2040], 60.0000th=[ 2160], 70.0000th=[ 2256], | 80.0000th=[ 2480], 90.0000th=[ 2736], 95.0000th=[ 3024], | 99.0000th=[ 4080], 99.5000th=[ 4896], 99.9000th=[ 9664], | 99.9500th=[ 17024], 99.9900th=[218112] Latencies for: Sender (msg=83863) percentiles (nsec): | 1.0000th=[ 1928], 5.0000th=[ 2064], 10.0000th=[ 2160], | 20.0000th=[ 2608], 30.0000th=[ 2672], 40.0000th=[ 2736], | 50.0000th=[ 2864], 60.0000th=[ 2960], 70.0000th=[ 3152], | 80.0000th=[ 3408], 90.0000th=[ 4128], 95.0000th=[ 4576], | 99.0000th=[ 5920], 99.5000th=[ 6752], 99.9000th=[ 13376], | 99.9500th=[ 22912], 99.9900th=[261120] and the current edition does: Latencies for: Sender (msg=94488) percentiles (nsec): | 1.0000th=[ 181], 5.0000th=[ 191], 10.0000th=[ 201], | 20.0000th=[ 215], 30.0000th=[ 225], 40.0000th=[ 235], | 50.0000th=[ 262], 60.0000th=[ 306], 70.0000th=[ 430], | 80.0000th=[ 1004], 90.0000th=[ 2480], 95.0000th=[ 3632], | 99.0000th=[ 8096], 99.5000th=[12352], 99.9000th=[18048], | 99.9500th=[19584], 99.9900th=[23680] Latencies for: Receiver (msg=94488) percentiles (nsec): | 1.0000th=[ 342], 5.0000th=[ 398], 10.0000th=[ 482], | 20.0000th=[ 652], 30.0000th=[ 812], 40.0000th=[ 972], | 50.0000th=[ 1240], 60.0000th=[ 1640], 70.0000th=[ 1944], | 80.0000th=[ 2448], 90.0000th=[ 3248], 95.0000th=[ 5216], | 99.0000th=[10304], 99.5000th=[12352], 99.9000th=[18048], | 99.9500th=[19840], 99.9900th=[23168] If we cap it where v1 keeps up, at 13K messages per second, v1 does: Latencies for: Receiver (msg=38820) percentiles (nsec): | 1.0000th=[ 1160], 5.0000th=[ 1256], 10.0000th=[ 1352], | 20.0000th=[ 1688], 30.0000th=[ 1928], 40.0000th=[ 1976], | 50.0000th=[ 2064], 60.0000th=[ 2384], 70.0000th=[ 2480], | 80.0000th=[ 2768], 90.0000th=[ 3280], 95.0000th=[ 3472], | 99.0000th=[ 4192], 99.5000th=[ 4512], 99.9000th=[ 6624], | 99.9500th=[ 8768], 99.9900th=[14272] Latencies for: Sender (msg=38820) percentiles (nsec): | 1.0000th=[ 1848], 5.0000th=[ 1928], 10.0000th=[ 2040], | 20.0000th=[ 2608], 30.0000th=[ 2640], 40.0000th=[ 2736], | 50.0000th=[ 3024], 60.0000th=[ 3120], 70.0000th=[ 3376], | 80.0000th=[ 3824], 90.0000th=[ 4512], 95.0000th=[ 4768], | 99.0000th=[ 5536], 99.5000th=[ 6048], 99.9000th=[ 9024], | 99.9500th=[10304], 99.9900th=[23424] and v2 does: Latencies for: Sender (msg=39005) percentiles (nsec): | 1.0000th=[ 191], 5.0000th=[ 211], 10.0000th=[ 262], | 20.0000th=[ 342], 30.0000th=[ 382], 40.0000th=[ 402], | 50.0000th=[ 450], 60.0000th=[ 532], 70.0000th=[ 1080], | 80.0000th=[ 1848], 90.0000th=[ 4768], 95.0000th=[10944], | 99.0000th=[16512], 99.5000th=[18304], 99.9000th=[22400], | 99.9500th=[26496], 99.9900th=[41728] Latencies for: Receiver (msg=39005) percentiles (nsec): | 1.0000th=[ 410], 5.0000th=[ 604], 10.0000th=[ 700], | 20.0000th=[ 900], 30.0000th=[ 1128], 40.0000th=[ 1320], | 50.0000th=[ 1672], 60.0000th=[ 2256], 70.0000th=[ 2736], | 80.0000th=[ 3760], 90.0000th=[ 5408], 95.0000th=[11072], | 99.0000th=[18304], 99.5000th=[20096], 99.9000th=[24704], | 99.9500th=[27520], 99.9900th=[35584] -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance 2024-05-29 1:35 ` Jens Axboe @ 2024-05-29 2:08 ` Pavel Begunkov 2024-05-29 2:42 ` Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Pavel Begunkov @ 2024-05-29 2:08 UTC (permalink / raw) To: Jens Axboe, io-uring On 5/29/24 02:35, Jens Axboe wrote: > On 5/28/24 5:04 PM, Jens Axboe wrote: >> On 5/28/24 12:31 PM, Jens Axboe wrote: >>> I suspect a bug in the previous patches, because this is what the >>> forward port looks like. First, for reference, the current results: >> >> Got it sorted, and pinned sender and receiver on CPUs to avoid the >> variation. It looks like this with the task_work approach that I sent >> out as v1: >> >> Latencies for: Sender >> percentiles (nsec): >> | 1.0000th=[ 2160], 5.0000th=[ 2672], 10.0000th=[ 2768], >> | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600], >> | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632], >> | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696], >> | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192], >> | 99.9500th=[16320], 99.9900th=[18304] >> Latencies for: Receiver >> percentiles (nsec): >> | 1.0000th=[ 1528], 5.0000th=[ 1576], 10.0000th=[ 1656], >> | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064], >> | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096], >> | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160], >> | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168], >> | 99.9500th=[15424], 99.9900th=[17280] >> >> and here's the exact same test run on the current patches: >> >> Latencies for: Sender >> percentiles (nsec): >> | 1.0000th=[ 362], 5.0000th=[ 362], 10.0000th=[ 370], >> | 20.0000th=[ 370], 30.0000th=[ 370], 40.0000th=[ 370], >> | 50.0000th=[ 374], 60.0000th=[ 382], 70.0000th=[ 382], >> | 80.0000th=[ 382], 90.0000th=[ 382], 95.0000th=[ 390], >> | 99.0000th=[ 402], 99.5000th=[ 430], 99.9000th=[ 900], >> | 99.9500th=[ 972], 99.9900th=[ 1432] >> Latencies for: Receiver >> percentiles (nsec): >> | 1.0000th=[ 1528], 5.0000th=[ 1544], 10.0000th=[ 1560], >> | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592], >> | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608], >> | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688], >> | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272], >> | 99.9500th=[14784], 99.9900th=[73216] >> >> I'll try and augment the test app to do proper rated submissions, so I >> can ramp up the rates a bit and see what happens. > > And the final one, with the rated sends sorted out. One key observation > is that v1 trails the current edition, it just can't keep up as the rate > is increased. If we cap the rate at at what should be 33K messages per > second, v1 gets ~28K messages and has the following latency profile (for > a 3 second run) Do you see where the receiver latency comes from? The wakeups are quite similar in nature, assuming it's all wait(nr=1) and CPUs are not 100% consumed. The hop back spoils scheduling timing? > Latencies for: Receiver (msg=83863) > percentiles (nsec): > | 1.0000th=[ 1208], 5.0000th=[ 1336], 10.0000th=[ 1400], > | 20.0000th=[ 1768], 30.0000th=[ 1912], 40.0000th=[ 1976], > | 50.0000th=[ 2040], 60.0000th=[ 2160], 70.0000th=[ 2256], > | 80.0000th=[ 2480], 90.0000th=[ 2736], 95.0000th=[ 3024], > | 99.0000th=[ 4080], 99.5000th=[ 4896], 99.9000th=[ 9664], > | 99.9500th=[ 17024], 99.9900th=[218112] > Latencies for: Sender (msg=83863) > percentiles (nsec): > | 1.0000th=[ 1928], 5.0000th=[ 2064], 10.0000th=[ 2160], > | 20.0000th=[ 2608], 30.0000th=[ 2672], 40.0000th=[ 2736], > | 50.0000th=[ 2864], 60.0000th=[ 2960], 70.0000th=[ 3152], > | 80.0000th=[ 3408], 90.0000th=[ 4128], 95.0000th=[ 4576], > | 99.0000th=[ 5920], 99.5000th=[ 6752], 99.9000th=[ 13376], > | 99.9500th=[ 22912], 99.9900th=[261120] > > and the current edition does: > > Latencies for: Sender (msg=94488) > percentiles (nsec): > | 1.0000th=[ 181], 5.0000th=[ 191], 10.0000th=[ 201], > | 20.0000th=[ 215], 30.0000th=[ 225], 40.0000th=[ 235], > | 50.0000th=[ 262], 60.0000th=[ 306], 70.0000th=[ 430], > | 80.0000th=[ 1004], 90.0000th=[ 2480], 95.0000th=[ 3632], > | 99.0000th=[ 8096], 99.5000th=[12352], 99.9000th=[18048], > | 99.9500th=[19584], 99.9900th=[23680] > Latencies for: Receiver (msg=94488) > percentiles (nsec): > | 1.0000th=[ 342], 5.0000th=[ 398], 10.0000th=[ 482], > | 20.0000th=[ 652], 30.0000th=[ 812], 40.0000th=[ 972], > | 50.0000th=[ 1240], 60.0000th=[ 1640], 70.0000th=[ 1944], > | 80.0000th=[ 2448], 90.0000th=[ 3248], 95.0000th=[ 5216], > | 99.0000th=[10304], 99.5000th=[12352], 99.9000th=[18048], > | 99.9500th=[19840], 99.9900th=[23168] > > If we cap it where v1 keeps up, at 13K messages per second, v1 does: > > Latencies for: Receiver (msg=38820) > percentiles (nsec): > | 1.0000th=[ 1160], 5.0000th=[ 1256], 10.0000th=[ 1352], > | 20.0000th=[ 1688], 30.0000th=[ 1928], 40.0000th=[ 1976], > | 50.0000th=[ 2064], 60.0000th=[ 2384], 70.0000th=[ 2480], > | 80.0000th=[ 2768], 90.0000th=[ 3280], 95.0000th=[ 3472], > | 99.0000th=[ 4192], 99.5000th=[ 4512], 99.9000th=[ 6624], > | 99.9500th=[ 8768], 99.9900th=[14272] > Latencies for: Sender (msg=38820) > percentiles (nsec): > | 1.0000th=[ 1848], 5.0000th=[ 1928], 10.0000th=[ 2040], > | 20.0000th=[ 2608], 30.0000th=[ 2640], 40.0000th=[ 2736], > | 50.0000th=[ 3024], 60.0000th=[ 3120], 70.0000th=[ 3376], > | 80.0000th=[ 3824], 90.0000th=[ 4512], 95.0000th=[ 4768], > | 99.0000th=[ 5536], 99.5000th=[ 6048], 99.9000th=[ 9024], > | 99.9500th=[10304], 99.9900th=[23424] > > and v2 does: > > Latencies for: Sender (msg=39005) > percentiles (nsec): > | 1.0000th=[ 191], 5.0000th=[ 211], 10.0000th=[ 262], > | 20.0000th=[ 342], 30.0000th=[ 382], 40.0000th=[ 402], > | 50.0000th=[ 450], 60.0000th=[ 532], 70.0000th=[ 1080], > | 80.0000th=[ 1848], 90.0000th=[ 4768], 95.0000th=[10944], > | 99.0000th=[16512], 99.5000th=[18304], 99.9000th=[22400], > | 99.9500th=[26496], 99.9900th=[41728] > Latencies for: Receiver (msg=39005) > percentiles (nsec): > | 1.0000th=[ 410], 5.0000th=[ 604], 10.0000th=[ 700], > | 20.0000th=[ 900], 30.0000th=[ 1128], 40.0000th=[ 1320], > | 50.0000th=[ 1672], 60.0000th=[ 2256], 70.0000th=[ 2736], > | 80.0000th=[ 3760], 90.0000th=[ 5408], 95.0000th=[11072], > | 99.0000th=[18304], 99.5000th=[20096], 99.9000th=[24704], > | 99.9500th=[27520], 99.9900th=[35584] > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance 2024-05-29 2:08 ` Pavel Begunkov @ 2024-05-29 2:42 ` Jens Axboe 0 siblings, 0 replies; 23+ messages in thread From: Jens Axboe @ 2024-05-29 2:42 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/28/24 8:08 PM, Pavel Begunkov wrote: > On 5/29/24 02:35, Jens Axboe wrote: >> On 5/28/24 5:04 PM, Jens Axboe wrote: >>> On 5/28/24 12:31 PM, Jens Axboe wrote: >>>> I suspect a bug in the previous patches, because this is what the >>>> forward port looks like. First, for reference, the current results: >>> >>> Got it sorted, and pinned sender and receiver on CPUs to avoid the >>> variation. It looks like this with the task_work approach that I sent >>> out as v1: >>> >>> Latencies for: Sender >>> percentiles (nsec): >>> | 1.0000th=[ 2160], 5.0000th=[ 2672], 10.0000th=[ 2768], >>> | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600], >>> | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632], >>> | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696], >>> | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192], >>> | 99.9500th=[16320], 99.9900th=[18304] >>> Latencies for: Receiver >>> percentiles (nsec): >>> | 1.0000th=[ 1528], 5.0000th=[ 1576], 10.0000th=[ 1656], >>> | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064], >>> | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096], >>> | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160], >>> | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168], >>> | 99.9500th=[15424], 99.9900th=[17280] >>> >>> and here's the exact same test run on the current patches: >>> >>> Latencies for: Sender >>> percentiles (nsec): >>> | 1.0000th=[ 362], 5.0000th=[ 362], 10.0000th=[ 370], >>> | 20.0000th=[ 370], 30.0000th=[ 370], 40.0000th=[ 370], >>> | 50.0000th=[ 374], 60.0000th=[ 382], 70.0000th=[ 382], >>> | 80.0000th=[ 382], 90.0000th=[ 382], 95.0000th=[ 390], >>> | 99.0000th=[ 402], 99.5000th=[ 430], 99.9000th=[ 900], >>> | 99.9500th=[ 972], 99.9900th=[ 1432] >>> Latencies for: Receiver >>> percentiles (nsec): >>> | 1.0000th=[ 1528], 5.0000th=[ 1544], 10.0000th=[ 1560], >>> | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592], >>> | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608], >>> | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688], >>> | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272], >>> | 99.9500th=[14784], 99.9900th=[73216] >>> >>> I'll try and augment the test app to do proper rated submissions, so I >>> can ramp up the rates a bit and see what happens. >> >> And the final one, with the rated sends sorted out. One key observation >> is that v1 trails the current edition, it just can't keep up as the rate >> is increased. If we cap the rate at at what should be 33K messages per >> second, v1 gets ~28K messages and has the following latency profile (for >> a 3 second run) > > Do you see where the receiver latency comes from? The wakeups are > quite similar in nature, assuming it's all wait(nr=1) and CPUs > are not 100% consumed. The hop back spoils scheduling timing? I haven't dug into that side yet, but I'm guessing it's indeed scheduling artifacts. It's all doing single waits, the sender is doing io_uring_submit_and_wait(ring, 1) and the receiver is doing io_uring_wait_cqe(ring, &cqe); -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2024-05-29 2:43 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-24 22:58 [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Jens Axboe 2024-05-24 22:58 ` [PATCH 1/3] io_uring/msg_ring: split fd installing into a helper Jens Axboe 2024-05-24 22:58 ` [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages Jens Axboe 2024-05-28 13:18 ` Pavel Begunkov 2024-05-28 14:23 ` Jens Axboe 2024-05-28 13:32 ` Pavel Begunkov 2024-05-28 14:23 ` Jens Axboe 2024-05-28 16:23 ` Pavel Begunkov 2024-05-28 17:59 ` Jens Axboe 2024-05-29 2:04 ` Pavel Begunkov 2024-05-29 2:43 ` Jens Axboe 2024-05-24 22:58 ` [PATCH 3/3] io_uring/msg_ring: avoid double indirection task_work for fd passing Jens Axboe 2024-05-28 13:31 ` [PATCHSET 0/3] Improve MSG_RING SINGLE_ISSUER performance Pavel Begunkov 2024-05-28 14:34 ` Jens Axboe 2024-05-28 14:39 ` Jens Axboe 2024-05-28 15:27 ` Jens Axboe 2024-05-28 16:50 ` Pavel Begunkov 2024-05-28 18:07 ` Jens Axboe 2024-05-28 18:31 ` Jens Axboe 2024-05-28 23:04 ` Jens Axboe 2024-05-29 1:35 ` Jens Axboe 2024-05-29 2:08 ` Pavel Begunkov 2024-05-29 2:42 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox