* Sending CQE to a different ring
@ 2022-03-09 23:49 Artyom Pavlov
2022-03-10 1:36 ` Jens Axboe
0 siblings, 1 reply; 28+ messages in thread
From: Artyom Pavlov @ 2022-03-09 23:49 UTC (permalink / raw)
To: io-uring
Greetings!
A common approach for multi-threaded servers is to have a number of
threads equal to a number of cores and launch a separate ring in each
one. AFAIK currently if we want to send an event to a different ring, we
have to write-lock this ring, create SQE, and update the index ring.
Alternatively, we could use some kind of user-space message passing.
Such approaches are somewhat inefficient and I think it can be solved
elegantly by updating the io_uring_sqe type to allow accepting fd of a
ring to which CQE must be sent by kernel. It can be done by introducing
an IOSQE_ flag and using one of currently unused padding u64s.
Such feature could be useful for load balancing and message passing
between threads which would ride on top of io-uring, i.e. you could send
NOP with user_data pointing to a message payload.
Best regards,
Artyom Pavlov.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-09 23:49 Sending CQE to a different ring Artyom Pavlov
@ 2022-03-10 1:36 ` Jens Axboe
2022-03-10 1:55 ` Jens Axboe
2022-03-10 2:11 ` Artyom Pavlov
0 siblings, 2 replies; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 1:36 UTC (permalink / raw)
To: Artyom Pavlov, io-uring
On 3/9/22 4:49 PM, Artyom Pavlov wrote:
> Greetings!
>
> A common approach for multi-threaded servers is to have a number of
> threads equal to a number of cores and launch a separate ring in each
> one. AFAIK currently if we want to send an event to a different ring,
> we have to write-lock this ring, create SQE, and update the index
> ring. Alternatively, we could use some kind of user-space message
> passing.
>
> Such approaches are somewhat inefficient and I think it can be solved
> elegantly by updating the io_uring_sqe type to allow accepting fd of a
> ring to which CQE must be sent by kernel. It can be done by
> introducing an IOSQE_ flag and using one of currently unused padding
> u64s.
>
> Such feature could be useful for load balancing and message passing
> between threads which would ride on top of io-uring, i.e. you could
> send NOP with user_data pointing to a message payload.
So what you want is a NOP with 'fd' set to the fd of another ring, and
that nop posts a CQE on that other ring? I don't think we'd need IOSQE
flags for that, we just need a NOP that supports that. I see a few ways
of going about that:
1) Add a new 'NOP' that takes an fd, and validates that that fd is an
io_uring instance. It can then grab the completion lock on that ring
and post an empty CQE.
2) We add a FEAT flag saying NOP supports taking an 'fd' argument, where
'fd' is another ring. Posting CQE same as above.
3) We add a specific opcode for this. Basically the same as #2, but
maybe with a more descriptive name than NOP.
Might make sense to pair that with a CQE flag or something like that, as
there's no specific user_data that could be used as it doesn't match an
existing SQE that has been issued. IORING_CQE_F_WAKEUP for example.
Would be applicable to all the above cases.
I kind of like #3 the best. Add a IORING_OP_RING_WAKEUP command, require
that sqe->fd point to a ring (could even be the ring itself, doesn't
matter). And add IORING_CQE_F_WAKEUP as a specific flag for that.
--
Jens Axboe
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 1:36 ` Jens Axboe
@ 2022-03-10 1:55 ` Jens Axboe
2022-03-10 2:33 ` Jens Axboe
2022-03-10 2:11 ` Artyom Pavlov
1 sibling, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 1:55 UTC (permalink / raw)
To: Artyom Pavlov, io-uring
On 3/9/22 6:36 PM, Jens Axboe wrote:
> On 3/9/22 4:49 PM, Artyom Pavlov wrote:
>> Greetings!
>>
>> A common approach for multi-threaded servers is to have a number of
>> threads equal to a number of cores and launch a separate ring in each
>> one. AFAIK currently if we want to send an event to a different ring,
>> we have to write-lock this ring, create SQE, and update the index
>> ring. Alternatively, we could use some kind of user-space message
>> passing.
>>
>> Such approaches are somewhat inefficient and I think it can be solved
>> elegantly by updating the io_uring_sqe type to allow accepting fd of a
>> ring to which CQE must be sent by kernel. It can be done by
>> introducing an IOSQE_ flag and using one of currently unused padding
>> u64s.
>>
>> Such feature could be useful for load balancing and message passing
>> between threads which would ride on top of io-uring, i.e. you could
>> send NOP with user_data pointing to a message payload.
>
> So what you want is a NOP with 'fd' set to the fd of another ring, and
> that nop posts a CQE on that other ring? I don't think we'd need IOSQE
> flags for that, we just need a NOP that supports that. I see a few ways
> of going about that:
>
> 1) Add a new 'NOP' that takes an fd, and validates that that fd is an
> io_uring instance. It can then grab the completion lock on that ring
> and post an empty CQE.
>
> 2) We add a FEAT flag saying NOP supports taking an 'fd' argument, where
> 'fd' is another ring. Posting CQE same as above.
>
> 3) We add a specific opcode for this. Basically the same as #2, but
> maybe with a more descriptive name than NOP.
>
> Might make sense to pair that with a CQE flag or something like that, as
> there's no specific user_data that could be used as it doesn't match an
> existing SQE that has been issued. IORING_CQE_F_WAKEUP for example.
> Would be applicable to all the above cases.
>
> I kind of like #3 the best. Add a IORING_OP_RING_WAKEUP command, require
> that sqe->fd point to a ring (could even be the ring itself, doesn't
> matter). And add IORING_CQE_F_WAKEUP as a specific flag for that.
Something like the below, totally untested. The request will complete on
the original ring with either 0, for success, or -EOVERFLOW if the
target ring was already in an overflow state. If the fd specified isn't
an io_uring context, then the request will complete with -EBADFD.
If you have any way of testing this, please do. I'll write a basic
functionality test for it as well, but not until tomorrow.
Maybe we want to include in cqe->res who the waker was? We can stuff the
pid/tid in there, for example.
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2e04f718319d..b21f85a48224 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1105,6 +1105,9 @@ static const struct io_op_def io_op_defs[] = {
[IORING_OP_MKDIRAT] = {},
[IORING_OP_SYMLINKAT] = {},
[IORING_OP_LINKAT] = {},
+ [IORING_OP_WAKEUP_RING] = {
+ .needs_file = 1,
+ },
};
/* requests with any of those set should undergo io_disarm_next() */
@@ -4235,6 +4238,44 @@ static int io_nop(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}
+static int io_wakeup_ring_prep(struct io_kiocb *req,
+ const struct io_uring_sqe *sqe)
+{
+ if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index || sqe->off ||
+ sqe->len || sqe->rw_flags || sqe->splice_fd_in ||
+ sqe->buf_index || sqe->personality))
+ return -EINVAL;
+
+ if (req->file->f_op != &io_uring_fops)
+ return -EBADFD;
+
+ return 0;
+}
+
+static int io_wakeup_ring(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct io_uring_cqe *cqe;
+ struct io_ring_ctx *ctx;
+ int ret = 0;
+
+ ctx = req->file->private_data;
+ spin_lock(&ctx->completion_lock);
+ cqe = io_get_cqe(ctx);
+ if (cqe) {
+ WRITE_ONCE(cqe->user_data, 0);
+ WRITE_ONCE(cqe->res, 0);
+ WRITE_ONCE(cqe->flags, IORING_CQE_F_WAKEUP);
+ } else {
+ ret = -EOVERFLOW;
+ }
+ io_commit_cqring(ctx);
+ spin_unlock(&ctx->completion_lock);
+ io_cqring_ev_posted(ctx);
+
+ __io_req_complete(req, issue_flags, ret, 0);
+ return 0;
+}
+
static int io_fsync_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
struct io_ring_ctx *ctx = req->ctx;
@@ -6568,6 +6609,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return io_symlinkat_prep(req, sqe);
case IORING_OP_LINKAT:
return io_linkat_prep(req, sqe);
+ case IORING_OP_WAKEUP_RING:
+ return io_wakeup_ring_prep(req, sqe);
}
printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
@@ -6851,6 +6894,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
case IORING_OP_LINKAT:
ret = io_linkat(req, issue_flags);
break;
+ case IORING_OP_WAKEUP_RING:
+ ret = io_wakeup_ring(req, issue_flags);
+ break;
default:
ret = -EINVAL;
break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 787f491f0d2a..088232133594 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -143,6 +143,7 @@ enum {
IORING_OP_MKDIRAT,
IORING_OP_SYMLINKAT,
IORING_OP_LINKAT,
+ IORING_OP_WAKEUP_RING,
/* this goes last, obviously */
IORING_OP_LAST,
@@ -199,9 +200,11 @@ struct io_uring_cqe {
*
* IORING_CQE_F_BUFFER If set, the upper 16 bits are the buffer ID
* IORING_CQE_F_MORE If set, parent SQE will generate more CQE entries
+ * IORING_CQE_F_WAKEUP Wakeup request CQE, no link to an SQE
*/
#define IORING_CQE_F_BUFFER (1U << 0)
#define IORING_CQE_F_MORE (1U << 1)
+#define IORING_CQE_F_WAKEUP (1U << 2)
enum {
IORING_CQE_BUFFER_SHIFT = 16,
--
Jens Axboe
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 1:36 ` Jens Axboe
2022-03-10 1:55 ` Jens Axboe
@ 2022-03-10 2:11 ` Artyom Pavlov
2022-03-10 3:00 ` Jens Axboe
2022-03-10 3:06 ` Jens Axboe
1 sibling, 2 replies; 28+ messages in thread
From: Artyom Pavlov @ 2022-03-10 2:11 UTC (permalink / raw)
To: Jens Axboe, io-uring
10.03.2022 04:36, Jens Axboe wrote:
> On 3/9/22 4:49 PM, Artyom Pavlov wrote:
>> Greetings!
>>
>> A common approach for multi-threaded servers is to have a number of
>> threads equal to a number of cores and launch a separate ring in each
>> one. AFAIK currently if we want to send an event to a different ring,
>> we have to write-lock this ring, create SQE, and update the index
>> ring. Alternatively, we could use some kind of user-space message
>> passing.
>>
>> Such approaches are somewhat inefficient and I think it can be solved
>> elegantly by updating the io_uring_sqe type to allow accepting fd of a
>> ring to which CQE must be sent by kernel. It can be done by
>> introducing an IOSQE_ flag and using one of currently unused padding
>> u64s.
>>
>> Such feature could be useful for load balancing and message passing
>> between threads which would ride on top of io-uring, i.e. you could
>> send NOP with user_data pointing to a message payload.
>
> So what you want is a NOP with 'fd' set to the fd of another ring, and
> that nop posts a CQE on that other ring? I don't think we'd need IOSQE
> flags for that, we just need a NOP that supports that. I see a few ways
> of going about that:
>
> 1) Add a new 'NOP' that takes an fd, and validates that that fd is an
> io_uring instance. It can then grab the completion lock on that ring
> and post an empty CQE.
>
> 2) We add a FEAT flag saying NOP supports taking an 'fd' argument, where
> 'fd' is another ring. Posting CQE same as above.
>
> 3) We add a specific opcode for this. Basically the same as #2, but
> maybe with a more descriptive name than NOP.
>
> Might make sense to pair that with a CQE flag or something like that, as
> there's no specific user_data that could be used as it doesn't match an
> existing SQE that has been issued. IORING_CQE_F_WAKEUP for example.
> Would be applicable to all the above cases.
>
> I kind of like #3 the best. Add a IORING_OP_RING_WAKEUP command, require
> that sqe->fd point to a ring (could even be the ring itself, doesn't
> matter). And add IORING_CQE_F_WAKEUP as a specific flag for that.
>
No, ideally I would like to be able to send any type of SQE to a
different ring. For example, if I see that the current ring is
overloaded, I can create exactly the same SQEs as during usual
operation, but with a changed recipient ring.
Your approach with a new "sendable" NOP will allow to emulate it in
user-space, but it will involve unnecessary ring round-trip and will be
a bit less pleasant in user code, e.g. we would need to encode a
separate state "the task is being sent to a different ring" instead of
simply telling io-uring "read data and report CQE on this ring" without
any intermediate states.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 1:55 ` Jens Axboe
@ 2022-03-10 2:33 ` Jens Axboe
2022-03-10 9:15 ` Chris Panayis
2022-03-10 13:53 ` Pavel Begunkov
0 siblings, 2 replies; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 2:33 UTC (permalink / raw)
To: Artyom Pavlov, io-uring
[-- Attachment #1: Type: text/plain, Size: 5923 bytes --]
On 3/9/22 6:55 PM, Jens Axboe wrote:
> On 3/9/22 6:36 PM, Jens Axboe wrote:
>> On 3/9/22 4:49 PM, Artyom Pavlov wrote:
>>> Greetings!
>>>
>>> A common approach for multi-threaded servers is to have a number of
>>> threads equal to a number of cores and launch a separate ring in each
>>> one. AFAIK currently if we want to send an event to a different ring,
>>> we have to write-lock this ring, create SQE, and update the index
>>> ring. Alternatively, we could use some kind of user-space message
>>> passing.
>>>
>>> Such approaches are somewhat inefficient and I think it can be solved
>>> elegantly by updating the io_uring_sqe type to allow accepting fd of a
>>> ring to which CQE must be sent by kernel. It can be done by
>>> introducing an IOSQE_ flag and using one of currently unused padding
>>> u64s.
>>>
>>> Such feature could be useful for load balancing and message passing
>>> between threads which would ride on top of io-uring, i.e. you could
>>> send NOP with user_data pointing to a message payload.
>>
>> So what you want is a NOP with 'fd' set to the fd of another ring, and
>> that nop posts a CQE on that other ring? I don't think we'd need IOSQE
>> flags for that, we just need a NOP that supports that. I see a few ways
>> of going about that:
>>
>> 1) Add a new 'NOP' that takes an fd, and validates that that fd is an
>> io_uring instance. It can then grab the completion lock on that ring
>> and post an empty CQE.
>>
>> 2) We add a FEAT flag saying NOP supports taking an 'fd' argument, where
>> 'fd' is another ring. Posting CQE same as above.
>>
>> 3) We add a specific opcode for this. Basically the same as #2, but
>> maybe with a more descriptive name than NOP.
>>
>> Might make sense to pair that with a CQE flag or something like that, as
>> there's no specific user_data that could be used as it doesn't match an
>> existing SQE that has been issued. IORING_CQE_F_WAKEUP for example.
>> Would be applicable to all the above cases.
>>
>> I kind of like #3 the best. Add a IORING_OP_RING_WAKEUP command, require
>> that sqe->fd point to a ring (could even be the ring itself, doesn't
>> matter). And add IORING_CQE_F_WAKEUP as a specific flag for that.
>
> Something like the below, totally untested. The request will complete on
> the original ring with either 0, for success, or -EOVERFLOW if the
> target ring was already in an overflow state. If the fd specified isn't
> an io_uring context, then the request will complete with -EBADFD.
>
> If you have any way of testing this, please do. I'll write a basic
> functionality test for it as well, but not until tomorrow.
>
> Maybe we want to include in cqe->res who the waker was? We can stuff the
> pid/tid in there, for example.
Made the pid change, and also wrote a test case for it. Only change
otherwise is adding a completion trace event as well. Patch below
against for-5.18/io_uring, and attached the test case for liburing.
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2e04f718319d..b21f85a48224 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1105,6 +1105,9 @@ static const struct io_op_def io_op_defs[] = {
[IORING_OP_MKDIRAT] = {},
[IORING_OP_SYMLINKAT] = {},
[IORING_OP_LINKAT] = {},
+ [IORING_OP_WAKEUP_RING] = {
+ .needs_file = 1,
+ },
};
/* requests with any of those set should undergo io_disarm_next() */
@@ -4235,6 +4238,44 @@ static int io_nop(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}
+static int io_wakeup_ring_prep(struct io_kiocb *req,
+ const struct io_uring_sqe *sqe)
+{
+ if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index || sqe->off ||
+ sqe->len || sqe->rw_flags || sqe->splice_fd_in ||
+ sqe->buf_index || sqe->personality))
+ return -EINVAL;
+
+ if (req->file->f_op != &io_uring_fops)
+ return -EBADFD;
+
+ return 0;
+}
+
+static int io_wakeup_ring(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct io_uring_cqe *cqe;
+ struct io_ring_ctx *ctx;
+ int ret = 0;
+
+ ctx = req->file->private_data;
+ spin_lock(&ctx->completion_lock);
+ cqe = io_get_cqe(ctx);
+ if (cqe) {
+ WRITE_ONCE(cqe->user_data, 0);
+ WRITE_ONCE(cqe->res, 0);
+ WRITE_ONCE(cqe->flags, IORING_CQE_F_WAKEUP);
+ } else {
+ ret = -EOVERFLOW;
+ }
+ io_commit_cqring(ctx);
+ spin_unlock(&ctx->completion_lock);
+ io_cqring_ev_posted(ctx);
+
+ __io_req_complete(req, issue_flags, ret, 0);
+ return 0;
+}
+
static int io_fsync_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
struct io_ring_ctx *ctx = req->ctx;
@@ -6568,6 +6609,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return io_symlinkat_prep(req, sqe);
case IORING_OP_LINKAT:
return io_linkat_prep(req, sqe);
+ case IORING_OP_WAKEUP_RING:
+ return io_wakeup_ring_prep(req, sqe);
}
printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
@@ -6851,6 +6894,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
case IORING_OP_LINKAT:
ret = io_linkat(req, issue_flags);
break;
+ case IORING_OP_WAKEUP_RING:
+ ret = io_wakeup_ring(req, issue_flags);
+ break;
default:
ret = -EINVAL;
break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 787f491f0d2a..088232133594 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -143,6 +143,7 @@ enum {
IORING_OP_MKDIRAT,
IORING_OP_SYMLINKAT,
IORING_OP_LINKAT,
+ IORING_OP_WAKEUP_RING,
/* this goes last, obviously */
IORING_OP_LAST,
@@ -199,9 +200,11 @@ struct io_uring_cqe {
*
* IORING_CQE_F_BUFFER If set, the upper 16 bits are the buffer ID
* IORING_CQE_F_MORE If set, parent SQE will generate more CQE entries
+ * IORING_CQE_F_WAKEUP Wakeup request CQE, no link to an SQE
*/
#define IORING_CQE_F_BUFFER (1U << 0)
#define IORING_CQE_F_MORE (1U << 1)
+#define IORING_CQE_F_WAKEUP (1U << 2)
enum {
IORING_CQE_BUFFER_SHIFT = 16,
--
Jens Axboe
[-- Attachment #2: wakeup-ring.patch --]
[-- Type: text/x-patch, Size: 4877 bytes --]
commit d07d17adab5b918bd0543ce78f75a4747a057379
Author: Jens Axboe <[email protected]>
Date: Wed Mar 9 19:31:57 2022 -0700
test/wakeup-ring: add test cases for IORING_OP_WAKEUP_RING
Signed-off-by: Jens Axboe <[email protected]>
diff --git a/.gitignore b/.gitignore
index c9dc77fbe162..961fd9e96dc7 100644
--- a/.gitignore
+++ b/.gitignore
@@ -128,6 +128,7 @@
/test/timeout-overflow
/test/unlink
/test/wakeup-hang
+/test/wakeup-ring
/test/multicqes_drain
/test/poll-mshot-update
/test/rsrc_tags
diff --git a/src/include/liburing/io_uring.h b/src/include/liburing/io_uring.h
index a7d193d0df38..8f919b42a8ea 100644
--- a/src/include/liburing/io_uring.h
+++ b/src/include/liburing/io_uring.h
@@ -147,6 +147,7 @@ enum {
IORING_OP_MKDIRAT,
IORING_OP_SYMLINKAT,
IORING_OP_LINKAT,
+ IORING_OP_WAKEUP_RING,
/* this goes last, obviously */
IORING_OP_LAST,
diff --git a/test/Makefile b/test/Makefile
index f421f536df87..4aafbae826ca 100644
--- a/test/Makefile
+++ b/test/Makefile
@@ -151,6 +151,7 @@ test_srcs := \
timeout-overflow.c \
unlink.c \
wakeup-hang.c \
+ wakeup-ring.c \
skip-cqe.c \
# EOL
@@ -221,6 +222,7 @@ ring-leak2: override LDFLAGS += -lpthread
poll-mshot-update: override LDFLAGS += -lpthread
exit-no-cleanup: override LDFLAGS += -lpthread
pollfree: override LDFLAGS += -lpthread
+wakeup-ring: override LDFLAGS += -lpthread
install: $(test_targets) runtests.sh runtests-loop.sh
$(INSTALL) -D -d -m 755 $(datadir)/liburing-test/
diff --git a/test/wakeup-ring.c b/test/wakeup-ring.c
new file mode 100644
index 000000000000..0ccae92d4c93
--- /dev/null
+++ b/test/wakeup-ring.c
@@ -0,0 +1,172 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Description: test ring wakeup command
+ *
+ */
+#include <errno.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <string.h>
+#include <fcntl.h>
+#include <pthread.h>
+
+#include "liburing.h"
+
+static int no_wakeup;
+
+static int test_own(struct io_uring *ring)
+{
+ struct io_uring_cqe *cqe;
+ struct io_uring_sqe *sqe;
+ int ret, i;
+
+ sqe = io_uring_get_sqe(ring);
+ if (!sqe) {
+ fprintf(stderr, "get sqe failed\n");
+ goto err;
+ }
+
+ sqe->opcode = IORING_OP_WAKEUP_RING;
+ sqe->fd = ring->ring_fd;
+ sqe->user_data = 1;
+
+ ret = io_uring_submit(ring);
+ if (ret <= 0) {
+ fprintf(stderr, "sqe submit failed: %d\n", ret);
+ goto err;
+ }
+
+ for (i = 0; i < 2; i++) {
+ ret = io_uring_peek_cqe(ring, &cqe);
+ if (ret < 0) {
+ fprintf(stderr, "wait completion %d\n", ret);
+ goto err;
+ }
+ switch (cqe->user_data) {
+ case 1:
+ if (cqe->res == -EINVAL || cqe->res == -EOPNOTSUPP) {
+ no_wakeup = 1;
+ return 0;
+ }
+ if (cqe->res != 0) {
+ fprintf(stderr, "wakeup res %d\n", cqe->res);
+ return -1;
+ }
+ break;
+ case 0:
+ if (!(cqe->flags & (1U << 2))) {
+ fprintf(stderr, "invalid flags %x\n", cqe->flags);
+ return -1;
+ }
+ break;
+ }
+ io_uring_cqe_seen(ring, cqe);
+ }
+
+ return 0;
+err:
+ return 1;
+}
+
+static void *wait_cqe_fn(void *data)
+{
+ struct io_uring *ring = data;
+ struct io_uring_cqe *cqe;
+ int ret;
+
+ ret = io_uring_wait_cqe(ring, &cqe);
+ if (ret) {
+ fprintf(stderr, "wait cqe %d\n", ret);
+ goto err;
+ }
+
+ if (!(cqe->flags & (1U << 2))) {
+ fprintf(stderr, "invalid flags %x\n", cqe->flags);
+ goto err;
+ }
+
+ return NULL;
+err:
+ return (void *) (unsigned long) 1;
+}
+
+static int test_remote(struct io_uring *ring, struct io_uring *target)
+{
+ struct io_uring_cqe *cqe;
+ struct io_uring_sqe *sqe;
+ int ret;
+
+ sqe = io_uring_get_sqe(ring);
+ if (!sqe) {
+ fprintf(stderr, "get sqe failed\n");
+ goto err;
+ }
+
+ sqe->opcode = IORING_OP_WAKEUP_RING;
+ sqe->fd = target->ring_fd;
+ sqe->user_data = 1;
+
+ ret = io_uring_submit(ring);
+ if (ret <= 0) {
+ fprintf(stderr, "sqe submit failed: %d\n", ret);
+ goto err;
+ }
+
+ ret = io_uring_peek_cqe(ring, &cqe);
+ if (ret < 0) {
+ fprintf(stderr, "wait completion %d\n", ret);
+ goto err;
+ }
+ if (cqe->res != 0) {
+ fprintf(stderr, "wakeup res %d\n", cqe->res);
+ return -1;
+ }
+
+ io_uring_cqe_seen(ring, cqe);
+ return 0;
+err:
+ return 1;
+}
+
+int main(int argc, char *argv[])
+{
+ struct io_uring ring, ring2;
+ pthread_t thread;
+ void *tret;
+ int ret;
+
+ if (argc > 1)
+ return 0;
+
+ ret = io_uring_queue_init(8, &ring, 0);
+ if (ret) {
+ fprintf(stderr, "ring setup failed: %d\n", ret);
+ return 1;
+ }
+ ret = io_uring_queue_init(8, &ring2, 0);
+ if (ret) {
+ fprintf(stderr, "ring setup failed: %d\n", ret);
+ return 1;
+ }
+
+ pthread_create(&thread, NULL, wait_cqe_fn, &ring2);
+
+ ret = test_own(&ring);
+ if (ret) {
+ fprintf(stderr, "test_own failed\n");
+ return ret;
+ }
+ if (no_wakeup)
+ return 0;
+
+ ret = test_remote(&ring, &ring2);
+ if (ret) {
+ fprintf(stderr, "test_remote failed\n");
+ return ret;
+ }
+
+ pthread_join(thread, &tret);
+
+ return 0;
+}
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 2:11 ` Artyom Pavlov
@ 2022-03-10 3:00 ` Jens Axboe
2022-03-10 3:48 ` Artyom Pavlov
2022-03-10 13:34 ` Pavel Begunkov
2022-03-10 3:06 ` Jens Axboe
1 sibling, 2 replies; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 3:00 UTC (permalink / raw)
To: Artyom Pavlov, io-uring
On 3/9/22 7:11 PM, Artyom Pavlov wrote:
> 10.03.2022 04:36, Jens Axboe wrote:
>> On 3/9/22 4:49 PM, Artyom Pavlov wrote:
>>> Greetings!
>>>
>>> A common approach for multi-threaded servers is to have a number of
>>> threads equal to a number of cores and launch a separate ring in each
>>> one. AFAIK currently if we want to send an event to a different ring,
>>> we have to write-lock this ring, create SQE, and update the index
>>> ring. Alternatively, we could use some kind of user-space message
>>> passing.
>>>
>>> Such approaches are somewhat inefficient and I think it can be solved
>>> elegantly by updating the io_uring_sqe type to allow accepting fd of a
>>> ring to which CQE must be sent by kernel. It can be done by
>>> introducing an IOSQE_ flag and using one of currently unused padding
>>> u64s.
>>>
>>> Such feature could be useful for load balancing and message passing
>>> between threads which would ride on top of io-uring, i.e. you could
>>> send NOP with user_data pointing to a message payload.
>>
>> So what you want is a NOP with 'fd' set to the fd of another ring, and
>> that nop posts a CQE on that other ring? I don't think we'd need IOSQE
>> flags for that, we just need a NOP that supports that. I see a few ways
>> of going about that:
>>
>> 1) Add a new 'NOP' that takes an fd, and validates that that fd is an
>> io_uring instance. It can then grab the completion lock on that ring
>> and post an empty CQE.
>>
>> 2) We add a FEAT flag saying NOP supports taking an 'fd' argument, where
>> 'fd' is another ring. Posting CQE same as above.
>>
>> 3) We add a specific opcode for this. Basically the same as #2, but
>> maybe with a more descriptive name than NOP.
>>
>> Might make sense to pair that with a CQE flag or something like that, as
>> there's no specific user_data that could be used as it doesn't match an
>> existing SQE that has been issued. IORING_CQE_F_WAKEUP for example.
>> Would be applicable to all the above cases.
>>
>> I kind of like #3 the best. Add a IORING_OP_RING_WAKEUP command, require
>> that sqe->fd point to a ring (could even be the ring itself, doesn't
>> matter). And add IORING_CQE_F_WAKEUP as a specific flag for that.
>>
>
> No, ideally I would like to be able to send any type of SQE to a
> different ring. For example, if I see that the current ring is
> overloaded, I can create exactly the same SQEs as during usual
> operation, but with a changed recipient ring.
>
> Your approach with a new "sendable" NOP will allow to emulate it in
> user-space, but it will involve unnecessary ring round-trip and will
> be a bit less pleasant in user code, e.g. we would need to encode a
> separate state "the task is being sent to a different ring" instead of
> simply telling io-uring "read data and report CQE on this ring"
> without any intermediate states.
OK, so what you're asking is to be able to submit an sqe to ring1, but
have the completion show up in ring2? With the idea being that the rings
are setup so that you're basing this on which thread should ultimately
process the request when it completes, which is why you want it to
target another ring?
It'd certainly be doable, but it's a bit of a strange beast. My main
concern with that would be:
1) It's a fast path code addition to every request, we'd need to check
some new field (sqe->completion_ring_fd) and then also grab a
reference to that file for use at completion time.
2) Completions are protected by the completion lock, and it isn't
trivial to nest these. What happens if ring1 submits an sqe with
ring2 as the cqe target, and ring2 submits an sqe with ring1 as the
cqe target? We can't safely nest these, as we could easily introduce
deadlocks that way.
My knee jerk reaction is that it'd be both simpler and cheaper to
implement this in userspace... Unless there's an elegant solution to it,
which I don't immediately see.
--
Jens Axboe
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 2:11 ` Artyom Pavlov
2022-03-10 3:00 ` Jens Axboe
@ 2022-03-10 3:06 ` Jens Axboe
1 sibling, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 3:06 UTC (permalink / raw)
To: Artyom Pavlov, io-uring
On 3/9/22 7:11 PM, Artyom Pavlov wrote:
> No, ideally I would like to be able to send any type of SQE to a
> different ring. For example, if I see that the current ring is
> overloaded, I can create exactly the same SQEs as during usual
> operation, but with a changed recipient ring.
And since I could still be misunderstanding your intent here, if you
mean that you want to submit sqe to ring1 with ring2 being the target,
then I'm not sure that would be very helpful. The submitting task is the
owner of the request, and will ultimately be the one that ends up
running eg task_work associated with the request. It's not really a good
way to shift work from one ring to another, if the setup is such that
the rings are tied to a thread and the threads are in turn mostly tied
to a CPU or group of CPUs.
--
Jens Axboe
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 3:00 ` Jens Axboe
@ 2022-03-10 3:48 ` Artyom Pavlov
2022-03-10 4:03 ` Jens Axboe
2022-03-10 13:34 ` Pavel Begunkov
1 sibling, 1 reply; 28+ messages in thread
From: Artyom Pavlov @ 2022-03-10 3:48 UTC (permalink / raw)
To: Jens Axboe, io-uring
> OK, so what you're asking is to be able to submit an sqe to ring1, but
> have the completion show up in ring2? With the idea being that the rings
> are setup so that you're basing this on which thread should ultimately
> process the request when it completes, which is why you want it to
> target another ring?
Yes, to both questions.
> 1) It's a fast path code addition to every request, we'd need to check
> some new field (sqe->completion_ring_fd) and then also grab a
> reference to that file for use at completion time.
Since migration of tasks will be relatively rare, the relevant branch
could be marked as cold and such branch should be relatively easy for
CPU branch predictor. So I don't think we will see a measurable
performance regression for the common case.
> 2) Completions are protected by the completion lock, and it isn't
> trivial to nest these. What happens if ring1 submits an sqe with
> ring2 as the cqe target, and ring2 submits an sqe with ring1 as the
> cqe target? We can't safely nest these, as we could easily introduce
> deadlocks that way.
I thought a better approach will be to copy SQE from ring1 into ring2
internal buffer and execute it as usual (IIUC kernel copies SQEs first
before processing them). I am not familiar with internals of io-uring
implementation, so I can not give any practical proposals.
> My knee jerk reaction is that it'd be both simpler and cheaper to
> implement this in userspace... Unless there's an elegant solution to it,
> which I don't immediately see.
Yes, as I said in the initial post, it's certainly possible to do it in
user-space. But I think it's a quite common problem, so it could warrant
including a built-in solution into io-uring API. Also it could be a bit
more efficient to do in kernel space, e.g. you would not need mutexes,
which in the worst case may involve parking and unparking threads, thus
stalling event loop.
> The submitting task is the owner of the request, and will ultimately
> be the one that ends up running eg task_work associated with the
> request. It's not really a good way to shift work from one ring to
> another, if the setup is such that the rings are tied to a thread and
> the threads are in turn mostly tied to a CPU or group of CPUs.
I am not sure I understand your point here. In my understanding, the
common approach for using io-uring is to keep in user_data a pointer to
FSM (Finite State Machine) state together with pointer to a function
used to drive FSM further after CQE is received (alternatively, instead
of the function pointer a jump table could be used).
Usually, it does not matter much on which thread FSM will be driven
since FSM state is kept on the heap. Yes, it may not be great from CPU
cache point of view, but it's better than having unbalanced thread load.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 3:48 ` Artyom Pavlov
@ 2022-03-10 4:03 ` Jens Axboe
2022-03-10 4:14 ` Jens Axboe
0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 4:03 UTC (permalink / raw)
To: Artyom Pavlov, io-uring
On 3/9/22 8:48 PM, Artyom Pavlov wrote:
>> OK, so what you're asking is to be able to submit an sqe to ring1, but
>> have the completion show up in ring2? With the idea being that the rings
>> are setup so that you're basing this on which thread should ultimately
>> process the request when it completes, which is why you want it to
>> target another ring?
>
> Yes, to both questions.
>
>> 1) It's a fast path code addition to every request, we'd need to check
>> some new field (sqe->completion_ring_fd) and then also grab a
>> reference to that file for use at completion time.
>
> Since migration of tasks will be relatively rare, the relevant branch
> could be marked as cold and such branch should be relatively easy for
> CPU branch predictor. So I don't think we will see a measurable
> performance regression for the common case.
It's not the branches I'm worried about, it's the growing of the request
to accomodate it, and the need to bring in another fd for this. We're
not just talking one piece of branch here, a request being tied to a
specific ring is a pretty core foundation of the internal code. It would
require massive refactoring to disconnect those two. We have a lot of
optimizations in place to handle completions efficiently as it is.
But I guess I'm still a bit confused on what this will buy is. The
request is still being executed on the first ring (and hence the thread
associated with it), with the suggested approach here the only thing
you'd gain is the completion going somewhere else. Is this purely about
the post-processing that happens when a completion is posted to a given
ring?
>> 2) Completions are protected by the completion lock, and it isn't
>> trivial to nest these. What happens if ring1 submits an sqe with
>> ring2 as the cqe target, and ring2 submits an sqe with ring1 as the
>> cqe target? We can't safely nest these, as we could easily introduce
>> deadlocks that way.
>
> I thought a better approach will be to copy SQE from ring1 into ring2
> internal buffer and execute it as usual (IIUC kernel copies SQEs first
> before processing them). I am not familiar with internals of io-uring
> implementation, so I can not give any practical proposals.
That's certainly possible, but what does it buy us? Why not just issue
it on ring2 to begin with? The issue per ring is serialized anyway, by
an internal mutex.
>> My knee jerk reaction is that it'd be both simpler and cheaper to
>> implement this in userspace... Unless there's an elegant solution to it,
>> which I don't immediately see.
>
> Yes, as I said in the initial post, it's certainly possible to do it
> in user-space. But I think it's a quite common problem, so it could
> warrant including a built-in solution into io-uring API. Also it could
> be a bit more efficient to do in kernel space, e.g. you would not need
> mutexes, which in the worst case may involve parking and unparking
> threads, thus stalling event loop.
I'm all for having solutions for common problems, but it has to make
sense to do so.
liburing has some state in the ring structure which makes it hard to
share, but for the raw interface, there's really not that concern
outside of needing to ensure that you serialize access to writing the sq
ring head and sqe entry. The kernel doesn't really care, though you
don't want two threads entering the kernel for the same ring as one of
them would simply then just be blocked until the other is done.
>> The submitting task is the owner of the request, and will ultimately
>> be the one that ends up running eg task_work associated with the
>> request. It's not really a good way to shift work from one ring to
>> another, if the setup is such that the rings are tied to a thread and
>> the threads are in turn mostly tied to a CPU or group of CPUs.
>
> I am not sure I understand your point here. In my understanding, the
> common approach for using io-uring is to keep in user_data a pointer
> to FSM (Finite State Machine) state together with pointer to a
> function used to drive FSM further after CQE is received
> (alternatively, instead of the function pointer a jump table could be
> used).
>
> Usually, it does not matter much on which thread FSM will be driven
> since FSM state is kept on the heap. Yes, it may not be great from CPU
> cache point of view, but it's better than having unbalanced thread
> load.
OK, so it is just about the post-processing. These are key questions to
answer, because it'll help drive what the best solution is here.
How did the original thread end up with the work to begin with? Was the
workload evenly distributed at that point, but later conditions (before
it get issued) mean that the situation has now changed and we'd prefer
to execute it somewhere else?
I'll mull over this a bit...
--
Jens Axboe
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 4:03 ` Jens Axboe
@ 2022-03-10 4:14 ` Jens Axboe
2022-03-10 14:00 ` Artyom Pavlov
2022-03-10 15:36 ` Artyom Pavlov
0 siblings, 2 replies; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 4:14 UTC (permalink / raw)
To: Artyom Pavlov, io-uring
On 3/9/22 9:03 PM, Jens Axboe wrote:
> I'll mull over this a bit...
One idea... You issue the request as you normally would for ring1, and
you mark that request A with IOSQE_CQE_SKIP_SUCCESS. Then you link an
IORING_OP_WAKEUP_RING to request A, with the fd for it set to ring2, and
also mark that with IOSQE_CQE_SKIP_SUCCESS.
We'd need to have sqe->addr (or whatever field) in the WAKEUP_RING
request be set to the user_data of request A, so we can propagate it.
The end result is that ring1 won't see any completions for request A or
the linked WAKEUP, unless one of them fails. If that happens, then you
get to process things locally still, but given that this is a weird
error case, shouldn't affect things in practice. ring2 will get the CQE
posted once request A has completed, with the user_data filled in from
request A. Which is really what we care about here, as far as I
understand.
This basically works right now with what I posted, and without needing
to rearchitect a bunch of stuff. And it's pretty efficient. Only thing
we'd need to add is passing in the target cqe user_data for the WAKEUP
request. Would just need to be wrapped in something that allows you to
do this easily, as it would be useful for others too potentially.
--
Jens Axboe
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 2:33 ` Jens Axboe
@ 2022-03-10 9:15 ` Chris Panayis
2022-03-10 13:53 ` Pavel Begunkov
1 sibling, 0 replies; 28+ messages in thread
From: Chris Panayis @ 2022-03-10 9:15 UTC (permalink / raw)
To: Jens Axboe, Artyom Pavlov, io-uring
ooo.. I like this simple interface, even if there ends up being a more
'infra-ring-esque' api... We currently implement thread/ring
callbacks/wakeups using eventfd - this IORING_OP_WAKEUP_RING interface
would be much better..
Thanks
Chris
On 10/03/2022 02:33, Jens Axboe wrote:
> On 3/9/22 6:55 PM, Jens Axboe wrote:
>> On 3/9/22 6:36 PM, Jens Axboe wrote:
>>> On 3/9/22 4:49 PM, Artyom Pavlov wrote:
>>>> Greetings!
>>>>
>>>> A common approach for multi-threaded servers is to have a number of
>>>> threads equal to a number of cores and launch a separate ring in each
>>>> one. AFAIK currently if we want to send an event to a different ring,
>>>> we have to write-lock this ring, create SQE, and update the index
>>>> ring. Alternatively, we could use some kind of user-space message
>>>> passing.
>>>>
>>>> Such approaches are somewhat inefficient and I think it can be solved
>>>> elegantly by updating the io_uring_sqe type to allow accepting fd of a
>>>> ring to which CQE must be sent by kernel. It can be done by
>>>> introducing an IOSQE_ flag and using one of currently unused padding
>>>> u64s.
>>>>
>>>> Such feature could be useful for load balancing and message passing
>>>> between threads which would ride on top of io-uring, i.e. you could
>>>> send NOP with user_data pointing to a message payload.
>>> So what you want is a NOP with 'fd' set to the fd of another ring, and
>>> that nop posts a CQE on that other ring? I don't think we'd need IOSQE
>>> flags for that, we just need a NOP that supports that. I see a few ways
>>> of going about that:
>>>
>>> 1) Add a new 'NOP' that takes an fd, and validates that that fd is an
>>> io_uring instance. It can then grab the completion lock on that ring
>>> and post an empty CQE.
>>>
>>> 2) We add a FEAT flag saying NOP supports taking an 'fd' argument, where
>>> 'fd' is another ring. Posting CQE same as above.
>>>
>>> 3) We add a specific opcode for this. Basically the same as #2, but
>>> maybe with a more descriptive name than NOP.
>>>
>>> Might make sense to pair that with a CQE flag or something like that, as
>>> there's no specific user_data that could be used as it doesn't match an
>>> existing SQE that has been issued. IORING_CQE_F_WAKEUP for example.
>>> Would be applicable to all the above cases.
>>>
>>> I kind of like #3 the best. Add a IORING_OP_RING_WAKEUP command, require
>>> that sqe->fd point to a ring (could even be the ring itself, doesn't
>>> matter). And add IORING_CQE_F_WAKEUP as a specific flag for that.
>> Something like the below, totally untested. The request will complete on
>> the original ring with either 0, for success, or -EOVERFLOW if the
>> target ring was already in an overflow state. If the fd specified isn't
>> an io_uring context, then the request will complete with -EBADFD.
>>
>> If you have any way of testing this, please do. I'll write a basic
>> functionality test for it as well, but not until tomorrow.
>>
>> Maybe we want to include in cqe->res who the waker was? We can stuff the
>> pid/tid in there, for example.
> Made the pid change, and also wrote a test case for it. Only change
> otherwise is adding a completion trace event as well. Patch below
> against for-5.18/io_uring, and attached the test case for liburing.
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 2e04f718319d..b21f85a48224 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1105,6 +1105,9 @@ static const struct io_op_def io_op_defs[] = {
> [IORING_OP_MKDIRAT] = {},
> [IORING_OP_SYMLINKAT] = {},
> [IORING_OP_LINKAT] = {},
> + [IORING_OP_WAKEUP_RING] = {
> + .needs_file = 1,
> + },
> };
>
> /* requests with any of those set should undergo io_disarm_next() */
> @@ -4235,6 +4238,44 @@ static int io_nop(struct io_kiocb *req, unsigned int issue_flags)
> return 0;
> }
>
> +static int io_wakeup_ring_prep(struct io_kiocb *req,
> + const struct io_uring_sqe *sqe)
> +{
> + if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index || sqe->off ||
> + sqe->len || sqe->rw_flags || sqe->splice_fd_in ||
> + sqe->buf_index || sqe->personality))
> + return -EINVAL;
> +
> + if (req->file->f_op != &io_uring_fops)
> + return -EBADFD;
> +
> + return 0;
> +}
> +
> +static int io_wakeup_ring(struct io_kiocb *req, unsigned int issue_flags)
> +{
> + struct io_uring_cqe *cqe;
> + struct io_ring_ctx *ctx;
> + int ret = 0;
> +
> + ctx = req->file->private_data;
> + spin_lock(&ctx->completion_lock);
> + cqe = io_get_cqe(ctx);
> + if (cqe) {
> + WRITE_ONCE(cqe->user_data, 0);
> + WRITE_ONCE(cqe->res, 0);
> + WRITE_ONCE(cqe->flags, IORING_CQE_F_WAKEUP);
> + } else {
> + ret = -EOVERFLOW;
> + }
> + io_commit_cqring(ctx);
> + spin_unlock(&ctx->completion_lock);
> + io_cqring_ev_posted(ctx);
> +
> + __io_req_complete(req, issue_flags, ret, 0);
> + return 0;
> +}
> +
> static int io_fsync_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> {
> struct io_ring_ctx *ctx = req->ctx;
> @@ -6568,6 +6609,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> return io_symlinkat_prep(req, sqe);
> case IORING_OP_LINKAT:
> return io_linkat_prep(req, sqe);
> + case IORING_OP_WAKEUP_RING:
> + return io_wakeup_ring_prep(req, sqe);
> }
>
> printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
> @@ -6851,6 +6894,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
> case IORING_OP_LINKAT:
> ret = io_linkat(req, issue_flags);
> break;
> + case IORING_OP_WAKEUP_RING:
> + ret = io_wakeup_ring(req, issue_flags);
> + break;
> default:
> ret = -EINVAL;
> break;
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 787f491f0d2a..088232133594 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -143,6 +143,7 @@ enum {
> IORING_OP_MKDIRAT,
> IORING_OP_SYMLINKAT,
> IORING_OP_LINKAT,
> + IORING_OP_WAKEUP_RING,
>
> /* this goes last, obviously */
> IORING_OP_LAST,
> @@ -199,9 +200,11 @@ struct io_uring_cqe {
> *
> * IORING_CQE_F_BUFFER If set, the upper 16 bits are the buffer ID
> * IORING_CQE_F_MORE If set, parent SQE will generate more CQE entries
> + * IORING_CQE_F_WAKEUP Wakeup request CQE, no link to an SQE
> */
> #define IORING_CQE_F_BUFFER (1U << 0)
> #define IORING_CQE_F_MORE (1U << 1)
> +#define IORING_CQE_F_WAKEUP (1U << 2)
>
> enum {
> IORING_CQE_BUFFER_SHIFT = 16,
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 3:00 ` Jens Axboe
2022-03-10 3:48 ` Artyom Pavlov
@ 2022-03-10 13:34 ` Pavel Begunkov
2022-03-10 13:43 ` Jens Axboe
1 sibling, 1 reply; 28+ messages in thread
From: Pavel Begunkov @ 2022-03-10 13:34 UTC (permalink / raw)
To: Jens Axboe, Artyom Pavlov, io-uring
On 3/10/22 03:00, Jens Axboe wrote:
> On 3/9/22 7:11 PM, Artyom Pavlov wrote:
>> 10.03.2022 04:36, Jens Axboe wrote:
>>> On 3/9/22 4:49 PM, Artyom Pavlov wrote:
>>>> Greetings!
>>>>
>>>> A common approach for multi-threaded servers is to have a number of
>>>> threads equal to a number of cores and launch a separate ring in each
>>>> one. AFAIK currently if we want to send an event to a different ring,
>>>> we have to write-lock this ring, create SQE, and update the index
>>>> ring. Alternatively, we could use some kind of user-space message
>>>> passing.
>>>>
>>>> Such approaches are somewhat inefficient and I think it can be solved
>>>> elegantly by updating the io_uring_sqe type to allow accepting fd of a
>>>> ring to which CQE must be sent by kernel. It can be done by
>>>> introducing an IOSQE_ flag and using one of currently unused padding
>>>> u64s.
>>>>
>>>> Such feature could be useful for load balancing and message passing
>>>> between threads which would ride on top of io-uring, i.e. you could
>>>> send NOP with user_data pointing to a message payload.
>>>
>>> So what you want is a NOP with 'fd' set to the fd of another ring, and
>>> that nop posts a CQE on that other ring? I don't think we'd need IOSQE
>>> flags for that, we just need a NOP that supports that. I see a few ways
>>> of going about that:
>>>
>>> 1) Add a new 'NOP' that takes an fd, and validates that that fd is an
>>> io_uring instance. It can then grab the completion lock on that ring
>>> and post an empty CQE.
>>>
>>> 2) We add a FEAT flag saying NOP supports taking an 'fd' argument, where
>>> 'fd' is another ring. Posting CQE same as above.
>>>
>>> 3) We add a specific opcode for this. Basically the same as #2, but
>>> maybe with a more descriptive name than NOP.
>>>
>>> Might make sense to pair that with a CQE flag or something like that, as
>>> there's no specific user_data that could be used as it doesn't match an
>>> existing SQE that has been issued. IORING_CQE_F_WAKEUP for example.
>>> Would be applicable to all the above cases.
>>>
>>> I kind of like #3 the best. Add a IORING_OP_RING_WAKEUP command, require
>>> that sqe->fd point to a ring (could even be the ring itself, doesn't
>>> matter). And add IORING_CQE_F_WAKEUP as a specific flag for that.
>>>
>>
>> No, ideally I would like to be able to send any type of SQE to a
>> different ring. For example, if I see that the current ring is
>> overloaded, I can create exactly the same SQEs as during usual
>> operation, but with a changed recipient ring.
>>
>> Your approach with a new "sendable" NOP will allow to emulate it in
>> user-space, but it will involve unnecessary ring round-trip and will
>> be a bit less pleasant in user code, e.g. we would need to encode a
>> separate state "the task is being sent to a different ring" instead of
>> simply telling io-uring "read data and report CQE on this ring"
>> without any intermediate states.
>
> OK, so what you're asking is to be able to submit an sqe to ring1, but
> have the completion show up in ring2? With the idea being that the rings
> are setup so that you're basing this on which thread should ultimately
> process the request when it completes, which is why you want it to
> target another ring?
>
> It'd certainly be doable, but it's a bit of a strange beast. My main
> concern with that would be:
>
> 1) It's a fast path code addition to every request, we'd need to check
> some new field (sqe->completion_ring_fd) and then also grab a
> reference to that file for use at completion time.
>
> 2) Completions are protected by the completion lock, and it isn't
> trivial to nest these. What happens if ring1 submits an sqe with
> ring2 as the cqe target, and ring2 submits an sqe with ring1 as the
> cqe target? We can't safely nest these, as we could easily introduce
> deadlocks that way.
>
> My knee jerk reaction is that it'd be both simpler and cheaper to
> implement this in userspace... Unless there's an elegant solution to it,
> which I don't immediately see.
Per request fd will be ugly and slow unfortunately. As people asked about
a similar thing before, the only thing I can suggest is to add a way
to pass another SQ. The execution will be slower, but at least can be
made zero overhead for the normal path.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 13:34 ` Pavel Begunkov
@ 2022-03-10 13:43 ` Jens Axboe
2022-03-10 13:51 ` Pavel Begunkov
0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 13:43 UTC (permalink / raw)
To: Pavel Begunkov, Artyom Pavlov, io-uring
On 3/10/22 6:34 AM, Pavel Begunkov wrote:
> On 3/10/22 03:00, Jens Axboe wrote:
>> On 3/9/22 7:11 PM, Artyom Pavlov wrote:
>>> 10.03.2022 04:36, Jens Axboe wrote:
>>>> On 3/9/22 4:49 PM, Artyom Pavlov wrote:
>>>>> Greetings!
>>>>>
>>>>> A common approach for multi-threaded servers is to have a number of
>>>>> threads equal to a number of cores and launch a separate ring in each
>>>>> one. AFAIK currently if we want to send an event to a different ring,
>>>>> we have to write-lock this ring, create SQE, and update the index
>>>>> ring. Alternatively, we could use some kind of user-space message
>>>>> passing.
>>>>>
>>>>> Such approaches are somewhat inefficient and I think it can be solved
>>>>> elegantly by updating the io_uring_sqe type to allow accepting fd of a
>>>>> ring to which CQE must be sent by kernel. It can be done by
>>>>> introducing an IOSQE_ flag and using one of currently unused padding
>>>>> u64s.
>>>>>
>>>>> Such feature could be useful for load balancing and message passing
>>>>> between threads which would ride on top of io-uring, i.e. you could
>>>>> send NOP with user_data pointing to a message payload.
>>>>
>>>> So what you want is a NOP with 'fd' set to the fd of another ring, and
>>>> that nop posts a CQE on that other ring? I don't think we'd need IOSQE
>>>> flags for that, we just need a NOP that supports that. I see a few ways
>>>> of going about that:
>>>>
>>>> 1) Add a new 'NOP' that takes an fd, and validates that that fd is an
>>>> io_uring instance. It can then grab the completion lock on that ring
>>>> and post an empty CQE.
>>>>
>>>> 2) We add a FEAT flag saying NOP supports taking an 'fd' argument, where
>>>> 'fd' is another ring. Posting CQE same as above.
>>>>
>>>> 3) We add a specific opcode for this. Basically the same as #2, but
>>>> maybe with a more descriptive name than NOP.
>>>>
>>>> Might make sense to pair that with a CQE flag or something like that, as
>>>> there's no specific user_data that could be used as it doesn't match an
>>>> existing SQE that has been issued. IORING_CQE_F_WAKEUP for example.
>>>> Would be applicable to all the above cases.
>>>>
>>>> I kind of like #3 the best. Add a IORING_OP_RING_WAKEUP command, require
>>>> that sqe->fd point to a ring (could even be the ring itself, doesn't
>>>> matter). And add IORING_CQE_F_WAKEUP as a specific flag for that.
>>>>
>>>
>>> No, ideally I would like to be able to send any type of SQE to a
>>> different ring. For example, if I see that the current ring is
>>> overloaded, I can create exactly the same SQEs as during usual
>>> operation, but with a changed recipient ring.
>>>
>>> Your approach with a new "sendable" NOP will allow to emulate it in
>>> user-space, but it will involve unnecessary ring round-trip and will
>>> be a bit less pleasant in user code, e.g. we would need to encode a
>>> separate state "the task is being sent to a different ring" instead of
>>> simply telling io-uring "read data and report CQE on this ring"
>>> without any intermediate states.
>>
>> OK, so what you're asking is to be able to submit an sqe to ring1, but
>> have the completion show up in ring2? With the idea being that the rings
>> are setup so that you're basing this on which thread should ultimately
>> process the request when it completes, which is why you want it to
>> target another ring?
>>
>> It'd certainly be doable, but it's a bit of a strange beast. My main
>> concern with that would be:
>>
>> 1) It's a fast path code addition to every request, we'd need to check
>> some new field (sqe->completion_ring_fd) and then also grab a
>> reference to that file for use at completion time.
>>
>> 2) Completions are protected by the completion lock, and it isn't
>> trivial to nest these. What happens if ring1 submits an sqe with
>> ring2 as the cqe target, and ring2 submits an sqe with ring1 as the
>> cqe target? We can't safely nest these, as we could easily introduce
>> deadlocks that way.
>>
>> My knee jerk reaction is that it'd be both simpler and cheaper to
>> implement this in userspace... Unless there's an elegant solution to it,
>> which I don't immediately see.
>
> Per request fd will be ugly and slow unfortunately. As people asked about
> a similar thing before, the only thing I can suggest is to add a way
> to pass another SQ. The execution will be slower, but at least can be
> made zero overhead for the normal path.
The MSG_RING command seems like a good fit for me, and it'll both cater
to the "I just need to wakeup this ring and I don't want to use signals"
crowd, and passing actual (limited) information like what is needed in
this case.
--
Jens Axboe
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 13:43 ` Jens Axboe
@ 2022-03-10 13:51 ` Pavel Begunkov
0 siblings, 0 replies; 28+ messages in thread
From: Pavel Begunkov @ 2022-03-10 13:51 UTC (permalink / raw)
To: Jens Axboe, Artyom Pavlov, io-uring
On 3/10/22 13:43, Jens Axboe wrote:
> On 3/10/22 6:34 AM, Pavel Begunkov wrote:
>> On 3/10/22 03:00, Jens Axboe wrote:
>>> On 3/9/22 7:11 PM, Artyom Pavlov wrote:
>>>> 10.03.2022 04:36, Jens Axboe wrote:
>>>>> On 3/9/22 4:49 PM, Artyom Pavlov wrote:
[...]
>>> OK, so what you're asking is to be able to submit an sqe to ring1, but
>>> have the completion show up in ring2? With the idea being that the rings
>>> are setup so that you're basing this on which thread should ultimately
>>> process the request when it completes, which is why you want it to
>>> target another ring?
>>>
>>> It'd certainly be doable, but it's a bit of a strange beast. My main
>>> concern with that would be:
>>>
>>> 1) It's a fast path code addition to every request, we'd need to check
>>> some new field (sqe->completion_ring_fd) and then also grab a
>>> reference to that file for use at completion time.
>>>
>>> 2) Completions are protected by the completion lock, and it isn't
>>> trivial to nest these. What happens if ring1 submits an sqe with
>>> ring2 as the cqe target, and ring2 submits an sqe with ring1 as the
>>> cqe target? We can't safely nest these, as we could easily introduce
>>> deadlocks that way.
>>>
>>> My knee jerk reaction is that it'd be both simpler and cheaper to
>>> implement this in userspace... Unless there's an elegant solution to it,
>>> which I don't immediately see.
>>
>> Per request fd will be ugly and slow unfortunately. As people asked about
>> a similar thing before, the only thing I can suggest is to add a way
>> to pass another SQ. The execution will be slower, but at least can be
>> made zero overhead for the normal path.
>
> The MSG_RING command seems like a good fit for me, and it'll both cater
> to the "I just need to wakeup this ring and I don't want to use signals"
> crowd, and passing actual (limited) information like what is needed in
> this case.
Agree if that's what is needed.
To clarify, another approach I suggested is to be able to submit from a
different userspace provided SQ, so there is no locking around the main
SQ. And this doesn't cross a ring boundary, one will still need to specify
only one target ring fd
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 2:33 ` Jens Axboe
2022-03-10 9:15 ` Chris Panayis
@ 2022-03-10 13:53 ` Pavel Begunkov
2022-03-10 15:38 ` Jens Axboe
1 sibling, 1 reply; 28+ messages in thread
From: Pavel Begunkov @ 2022-03-10 13:53 UTC (permalink / raw)
To: Jens Axboe, Artyom Pavlov, io-uring
On 3/10/22 02:33, Jens Axboe wrote:
> On 3/9/22 6:55 PM, Jens Axboe wrote:
>> On 3/9/22 6:36 PM, Jens Axboe wrote:
>>> On 3/9/22 4:49 PM, Artyom Pavlov wrote:
>>>> Greetings!
>>>>
>>>> A common approach for multi-threaded servers is to have a number of
>>>> threads equal to a number of cores and launch a separate ring in each
>>>> one. AFAIK currently if we want to send an event to a different ring,
>>>> we have to write-lock this ring, create SQE, and update the index
>>>> ring. Alternatively, we could use some kind of user-space message
>>>> passing.
>>>>
>>>> Such approaches are somewhat inefficient and I think it can be solved
>>>> elegantly by updating the io_uring_sqe type to allow accepting fd of a
>>>> ring to which CQE must be sent by kernel. It can be done by
>>>> introducing an IOSQE_ flag and using one of currently unused padding
>>>> u64s.
>>>>
>>>> Such feature could be useful for load balancing and message passing
>>>> between threads which would ride on top of io-uring, i.e. you could
>>>> send NOP with user_data pointing to a message payload.
>>>
>>> So what you want is a NOP with 'fd' set to the fd of another ring, and
>>> that nop posts a CQE on that other ring? I don't think we'd need IOSQE
>>> flags for that, we just need a NOP that supports that. I see a few ways
>>> of going about that:
>>>
>>> 1) Add a new 'NOP' that takes an fd, and validates that that fd is an
>>> io_uring instance. It can then grab the completion lock on that ring
>>> and post an empty CQE.
>>>
>>> 2) We add a FEAT flag saying NOP supports taking an 'fd' argument, where
>>> 'fd' is another ring. Posting CQE same as above.
>>>
>>> 3) We add a specific opcode for this. Basically the same as #2, but
>>> maybe with a more descriptive name than NOP.
>>>
>>> Might make sense to pair that with a CQE flag or something like that, as
>>> there's no specific user_data that could be used as it doesn't match an
>>> existing SQE that has been issued. IORING_CQE_F_WAKEUP for example.
>>> Would be applicable to all the above cases.
>>>
>>> I kind of like #3 the best. Add a IORING_OP_RING_WAKEUP command, require
>>> that sqe->fd point to a ring (could even be the ring itself, doesn't
>>> matter). And add IORING_CQE_F_WAKEUP as a specific flag for that.
>>
>> Something like the below, totally untested. The request will complete on
>> the original ring with either 0, for success, or -EOVERFLOW if the
>> target ring was already in an overflow state. If the fd specified isn't
>> an io_uring context, then the request will complete with -EBADFD.
>>
>> If you have any way of testing this, please do. I'll write a basic
>> functionality test for it as well, but not until tomorrow.
>>
>> Maybe we want to include in cqe->res who the waker was? We can stuff the
>> pid/tid in there, for example.
>
> Made the pid change, and also wrote a test case for it. Only change
> otherwise is adding a completion trace event as well. Patch below
> against for-5.18/io_uring, and attached the test case for liburing.
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 2e04f718319d..b21f85a48224 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1105,6 +1105,9 @@ static const struct io_op_def io_op_defs[] = {
> [IORING_OP_MKDIRAT] = {},
> [IORING_OP_SYMLINKAT] = {},
> [IORING_OP_LINKAT] = {},
> + [IORING_OP_WAKEUP_RING] = {
> + .needs_file = 1,
> + },
> };
>
> /* requests with any of those set should undergo io_disarm_next() */
> @@ -4235,6 +4238,44 @@ static int io_nop(struct io_kiocb *req, unsigned int issue_flags)
> return 0;
> }
>
> +static int io_wakeup_ring_prep(struct io_kiocb *req,
> + const struct io_uring_sqe *sqe)
> +{
> + if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index || sqe->off ||
> + sqe->len || sqe->rw_flags || sqe->splice_fd_in ||
> + sqe->buf_index || sqe->personality))
> + return -EINVAL;
> +
> + if (req->file->f_op != &io_uring_fops)
> + return -EBADFD;
> +
> + return 0;
> +}
> +
> +static int io_wakeup_ring(struct io_kiocb *req, unsigned int issue_flags)
> +{
> + struct io_uring_cqe *cqe;
> + struct io_ring_ctx *ctx;
> + int ret = 0;
> +
> + ctx = req->file->private_data;
> + spin_lock(&ctx->completion_lock);
> + cqe = io_get_cqe(ctx);
> + if (cqe) {
> + WRITE_ONCE(cqe->user_data, 0);
> + WRITE_ONCE(cqe->res, 0);
> + WRITE_ONCE(cqe->flags, IORING_CQE_F_WAKEUP);
> + } else {
> + ret = -EOVERFLOW;
> + }
io_fill_cqe_aux(), maybe? Handles overflows better, increments cq_extra,
etc. Might also make sense to kick cq_timeouts, so waiters are forced to
wake up.
> + io_commit_cqring(ctx);
> + spin_unlock(&ctx->completion_lock);
> + io_cqring_ev_posted(ctx);
> +
> + __io_req_complete(req, issue_flags, ret, 0);
> + return 0;
> +}
> +
> static int io_fsync_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> {
> struct io_ring_ctx *ctx = req->ctx;
> @@ -6568,6 +6609,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> return io_symlinkat_prep(req, sqe);
> case IORING_OP_LINKAT:
> return io_linkat_prep(req, sqe);
> + case IORING_OP_WAKEUP_RING:
> + return io_wakeup_ring_prep(req, sqe);
> }
>
> printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
> @@ -6851,6 +6894,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
> case IORING_OP_LINKAT:
> ret = io_linkat(req, issue_flags);
> break;
> + case IORING_OP_WAKEUP_RING:
> + ret = io_wakeup_ring(req, issue_flags);
> + break;
> default:
> ret = -EINVAL;
> break;
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 787f491f0d2a..088232133594 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -143,6 +143,7 @@ enum {
> IORING_OP_MKDIRAT,
> IORING_OP_SYMLINKAT,
> IORING_OP_LINKAT,
> + IORING_OP_WAKEUP_RING,
>
> /* this goes last, obviously */
> IORING_OP_LAST,
> @@ -199,9 +200,11 @@ struct io_uring_cqe {
> *
> * IORING_CQE_F_BUFFER If set, the upper 16 bits are the buffer ID
> * IORING_CQE_F_MORE If set, parent SQE will generate more CQE entries
> + * IORING_CQE_F_WAKEUP Wakeup request CQE, no link to an SQE
> */
> #define IORING_CQE_F_BUFFER (1U << 0)
> #define IORING_CQE_F_MORE (1U << 1)
> +#define IORING_CQE_F_WAKEUP (1U << 2)
>
> enum {
> IORING_CQE_BUFFER_SHIFT = 16,
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 4:14 ` Jens Axboe
@ 2022-03-10 14:00 ` Artyom Pavlov
2022-03-10 15:36 ` Artyom Pavlov
1 sibling, 0 replies; 28+ messages in thread
From: Artyom Pavlov @ 2022-03-10 14:00 UTC (permalink / raw)
To: Jens Axboe, io-uring
> It's not the branches I'm worried about, it's the growing of the request
> to accomodate it, and the need to bring in another fd for this.
Maybe it's worth to pre-register fds of rings to which we can send CQEs
similarly to pre-registering file fds? It would allow us to use u8 or
u16 instead of u64 for identifying recipient ring.
> But I guess I'm still a bit confused on what this will buy is. The
> request is still being executed on the first ring (and hence the thread
> associated with it), with the suggested approach here the only thing
> you'd gain is the completion going somewhere else. Is this purely about
> the post-processing that happens when a completion is posted to a given
> ring?
As I wrote earlier, I am not familiar with internals of the io-uring
implementation, so I am talking purely from user point of view. I will
trust your judgment in regards of implementation complexity.
I guess, from user PoV, it does not matter on which ring the SQE will be
executed. It can have certain performance implications, but otherwise it
for user it's simply an implementation detail.
> How did the original thread end up with the work to begin with? Was the
> workload evenly distributed at that point, but later conditions (before
> it get issued) mean that the situation has now changed and we'd prefer
> to execute it somewhere else?
Let's talk about a concrete simplified example. Imagine a server which
accepts from network commands to compute hash for a file with given
path. The server executes the following algorithm:
1) Accept connection
2) Read command
3) Open file and create hasher state
4) Read chunk of data from file
5) If read data is not empty, update hasher state and go to step 4, else
finalize hasher
6) Return the resulting hash and go to step 2
We have two places where we can balance load. First, after we accepted
connection we should decide a ring which will process this connection.
Second, during creation of SQE for step 4, if the current thread is
overloaded, we can transfer task to a different thread.
The problem is that we can not predict how kernel will return read
chunks. Even if we distributed SQEs evenly across rings, it's possible
that kernel will return CQEs for a single ring in burst thus overloading
it, while other threads will starve for events.
On a second thought, it looks like your solution with
IORING_OP_WAKEUP_RING will have the following advantage: it will allow
us to migrate task before execution of step 5 has started, while with my
proposal we will be able to migrate tasks only on SQE creation (i.e. on
step 4).
> One idea... You issue the request as you normally would for ring1, and
> you mark that request A with IOSQE_CQE_SKIP_SUCCESS. Then you link an
> IORING_OP_WAKEUP_RING to request A, with the fd for it set to ring2, and
> also mark that with IOSQE_CQE_SKIP_SUCCESS.
Looks interesting! I have forgot about linking and IOSQE_CQE_SKIP_SUCCESS.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 4:14 ` Jens Axboe
2022-03-10 14:00 ` Artyom Pavlov
@ 2022-03-10 15:36 ` Artyom Pavlov
2022-03-10 15:43 ` Jens Axboe
1 sibling, 1 reply; 28+ messages in thread
From: Artyom Pavlov @ 2022-03-10 15:36 UTC (permalink / raw)
To: Jens Axboe, io-uring
After thinking about it a bit, I think this approach has one serious
disadvantage: you lose successful result value of the initial request.
Imagine we submit IORING_OP_READ and link IORING_OP_WAKEUP_RING to it.
If the request is completed successfully, both ring1 and ring2 will lose
number of read bytes.
10.03.2022 07:14, Jens Axboe wrote:
> On 3/9/22 9:03 PM, Jens Axboe wrote:
>> I'll mull over this a bit...
>
> One idea... You issue the request as you normally would for ring1, and
> you mark that request A with IOSQE_CQE_SKIP_SUCCESS. Then you link an
> IORING_OP_WAKEUP_RING to request A, with the fd for it set to ring2, and
> also mark that with IOSQE_CQE_SKIP_SUCCESS.
>
> We'd need to have sqe->addr (or whatever field) in the WAKEUP_RING
> request be set to the user_data of request A, so we can propagate it.
>
> The end result is that ring1 won't see any completions for request A or
> the linked WAKEUP, unless one of them fails. If that happens, then you
> get to process things locally still, but given that this is a weird
> error case, shouldn't affect things in practice. ring2 will get the CQE
> posted once request A has completed, with the user_data filled in from
> request A. Which is really what we care about here, as far as I
> understand.
>
> This basically works right now with what I posted, and without needing
> to rearchitect a bunch of stuff. And it's pretty efficient. Only thing
> we'd need to add is passing in the target cqe user_data for the WAKEUP
> request. Would just need to be wrapped in something that allows you to
> do this easily, as it would be useful for others too potentially.
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 13:53 ` Pavel Begunkov
@ 2022-03-10 15:38 ` Jens Axboe
0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 15:38 UTC (permalink / raw)
To: Pavel Begunkov, Artyom Pavlov, io-uring
On 3/10/22 6:53 AM, Pavel Begunkov wrote:
> On 3/10/22 02:33, Jens Axboe wrote:
>> On 3/9/22 6:55 PM, Jens Axboe wrote:
>>> On 3/9/22 6:36 PM, Jens Axboe wrote:
>>>> On 3/9/22 4:49 PM, Artyom Pavlov wrote:
>>>>> Greetings!
>>>>>
>>>>> A common approach for multi-threaded servers is to have a number of
>>>>> threads equal to a number of cores and launch a separate ring in each
>>>>> one. AFAIK currently if we want to send an event to a different ring,
>>>>> we have to write-lock this ring, create SQE, and update the index
>>>>> ring. Alternatively, we could use some kind of user-space message
>>>>> passing.
>>>>>
>>>>> Such approaches are somewhat inefficient and I think it can be solved
>>>>> elegantly by updating the io_uring_sqe type to allow accepting fd of a
>>>>> ring to which CQE must be sent by kernel. It can be done by
>>>>> introducing an IOSQE_ flag and using one of currently unused padding
>>>>> u64s.
>>>>>
>>>>> Such feature could be useful for load balancing and message passing
>>>>> between threads which would ride on top of io-uring, i.e. you could
>>>>> send NOP with user_data pointing to a message payload.
>>>>
>>>> So what you want is a NOP with 'fd' set to the fd of another ring, and
>>>> that nop posts a CQE on that other ring? I don't think we'd need IOSQE
>>>> flags for that, we just need a NOP that supports that. I see a few ways
>>>> of going about that:
>>>>
>>>> 1) Add a new 'NOP' that takes an fd, and validates that that fd is an
>>>> io_uring instance. It can then grab the completion lock on that ring
>>>> and post an empty CQE.
>>>>
>>>> 2) We add a FEAT flag saying NOP supports taking an 'fd' argument, where
>>>> 'fd' is another ring. Posting CQE same as above.
>>>>
>>>> 3) We add a specific opcode for this. Basically the same as #2, but
>>>> maybe with a more descriptive name than NOP.
>>>>
>>>> Might make sense to pair that with a CQE flag or something like that, as
>>>> there's no specific user_data that could be used as it doesn't match an
>>>> existing SQE that has been issued. IORING_CQE_F_WAKEUP for example.
>>>> Would be applicable to all the above cases.
>>>>
>>>> I kind of like #3 the best. Add a IORING_OP_RING_WAKEUP command, require
>>>> that sqe->fd point to a ring (could even be the ring itself, doesn't
>>>> matter). And add IORING_CQE_F_WAKEUP as a specific flag for that.
>>>
>>> Something like the below, totally untested. The request will complete on
>>> the original ring with either 0, for success, or -EOVERFLOW if the
>>> target ring was already in an overflow state. If the fd specified isn't
>>> an io_uring context, then the request will complete with -EBADFD.
>>>
>>> If you have any way of testing this, please do. I'll write a basic
>>> functionality test for it as well, but not until tomorrow.
>>>
>>> Maybe we want to include in cqe->res who the waker was? We can stuff the
>>> pid/tid in there, for example.
>>
>> Made the pid change, and also wrote a test case for it. Only change
>> otherwise is adding a completion trace event as well. Patch below
>> against for-5.18/io_uring, and attached the test case for liburing.
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 2e04f718319d..b21f85a48224 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -1105,6 +1105,9 @@ static const struct io_op_def io_op_defs[] = {
>> [IORING_OP_MKDIRAT] = {},
>> [IORING_OP_SYMLINKAT] = {},
>> [IORING_OP_LINKAT] = {},
>> + [IORING_OP_WAKEUP_RING] = {
>> + .needs_file = 1,
>> + },
>> };
>> /* requests with any of those set should undergo io_disarm_next() */
>> @@ -4235,6 +4238,44 @@ static int io_nop(struct io_kiocb *req, unsigned int issue_flags)
>> return 0;
>> }
>> +static int io_wakeup_ring_prep(struct io_kiocb *req,
>> + const struct io_uring_sqe *sqe)
>> +{
>> + if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index || sqe->off ||
>> + sqe->len || sqe->rw_flags || sqe->splice_fd_in ||
>> + sqe->buf_index || sqe->personality))
>> + return -EINVAL;
>> +
>> + if (req->file->f_op != &io_uring_fops)
>> + return -EBADFD;
>> +
>> + return 0;
>> +}
>> +
>> +static int io_wakeup_ring(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> + struct io_uring_cqe *cqe;
>> + struct io_ring_ctx *ctx;
>> + int ret = 0;
>> +
>> + ctx = req->file->private_data;
>> + spin_lock(&ctx->completion_lock);
>> + cqe = io_get_cqe(ctx);
>> + if (cqe) {
>> + WRITE_ONCE(cqe->user_data, 0);
>> + WRITE_ONCE(cqe->res, 0);
>> + WRITE_ONCE(cqe->flags, IORING_CQE_F_WAKEUP);
>> + } else {
>> + ret = -EOVERFLOW;
>> + }
>
> io_fill_cqe_aux(), maybe? Handles overflows better, increments cq_extra,
> etc. Might also make sense to kick cq_timeouts, so waiters are forced to
> wake up.
I think the main question here is if we want to handle overflows at all,
I deliberately didn't do that. But apart from that io_fill_cqe_aux()
does to everything we need.
I guess the nice thing about actually allocating an overflow entry is
that there's no weird error handling on the submitter side. Let's go
with that.
--
Jens Axboe
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 15:36 ` Artyom Pavlov
@ 2022-03-10 15:43 ` Jens Axboe
2022-03-10 15:46 ` Jens Axboe
0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 15:43 UTC (permalink / raw)
To: Artyom Pavlov, io-uring
On 3/10/22 8:36 AM, Artyom Pavlov wrote:
> After thinking about it a bit, I think this approach has one serious
> disadvantage: you lose successful result value of the initial request.
> Imagine we submit IORING_OP_READ and link IORING_OP_WAKEUP_RING to it.
> If the request is completed successfully, both ring1 and ring2 will
> lose number of read bytes.
But you know what the result is, otherwise you would've gotten a cqe
posted if it wasn't what you asked for.
--
Jens Axboe
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 15:43 ` Jens Axboe
@ 2022-03-10 15:46 ` Jens Axboe
2022-03-10 15:52 ` Artyom Pavlov
0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 15:46 UTC (permalink / raw)
To: Artyom Pavlov, io-uring
On 3/10/22 8:43 AM, Jens Axboe wrote:
> On 3/10/22 8:36 AM, Artyom Pavlov wrote:
>> After thinking about it a bit, I think this approach has one serious
>> disadvantage: you lose successful result value of the initial request.
>> Imagine we submit IORING_OP_READ and link IORING_OP_WAKEUP_RING to it.
>> If the request is completed successfully, both ring1 and ring2 will
>> lose number of read bytes.
>
> But you know what the result is, otherwise you would've gotten a cqe
> posted if it wasn't what you asked for.
Can also be made more explicit by setting sqe->len to what you set the
original request length to, and then pass that back in the cqe->res
instead of using the pid from the task that sent it. Then you'd have it
immediately available. Might make more sense than the pid, not sure
anyone would care about that?
--
Jens Axboe
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 15:46 ` Jens Axboe
@ 2022-03-10 15:52 ` Artyom Pavlov
2022-03-10 15:57 ` Jens Axboe
0 siblings, 1 reply; 28+ messages in thread
From: Artyom Pavlov @ 2022-03-10 15:52 UTC (permalink / raw)
To: Jens Axboe, io-uring
10.03.2022 18:46, Jens Axboe wrote:
> On 3/10/22 8:43 AM, Jens Axboe wrote:
>> On 3/10/22 8:36 AM, Artyom Pavlov wrote:
>>> After thinking about it a bit, I think this approach has one serious
>>> disadvantage: you lose successful result value of the initial request.
>>> Imagine we submit IORING_OP_READ and link IORING_OP_WAKEUP_RING to it.
>>> If the request is completed successfully, both ring1 and ring2 will
>>> lose number of read bytes.
>>
>> But you know what the result is, otherwise you would've gotten a cqe
>> posted if it wasn't what you asked for.
>
> Can also be made more explicit by setting sqe->len to what you set the
> original request length to, and then pass that back in the cqe->res
> instead of using the pid from the task that sent it. Then you'd have it
> immediately available. Might make more sense than the pid, not sure
> anyone would care about that?
Maybe I am missing something, but we only know that the request to which
IORING_OP_WAKEUP_RING was linked completed successfully. How exactly do
you retrieve the number of read bytes with the linking aproach?
Yes, passing positive result value would make more sense than PID of
submitter, which is rarely, if ever, needed. IIUC we would not be able
to use linking with such approach, since sqe->len has to be set in user
code based on a received CQE, but I guess it should be fine in practice.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 15:52 ` Artyom Pavlov
@ 2022-03-10 15:57 ` Jens Axboe
2022-03-10 16:07 ` Artyom Pavlov
0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 15:57 UTC (permalink / raw)
To: Artyom Pavlov, io-uring
On 3/10/22 8:52 AM, Artyom Pavlov wrote:
> 10.03.2022 18:46, Jens Axboe wrote:
>> On 3/10/22 8:43 AM, Jens Axboe wrote:
>>> On 3/10/22 8:36 AM, Artyom Pavlov wrote:
>>>> After thinking about it a bit, I think this approach has one serious
>>>> disadvantage: you lose successful result value of the initial request.
>>>> Imagine we submit IORING_OP_READ and link IORING_OP_WAKEUP_RING to it.
>>>> If the request is completed successfully, both ring1 and ring2 will
>>>> lose number of read bytes.
>>>
>>> But you know what the result is, otherwise you would've gotten a cqe
>>> posted if it wasn't what you asked for.
>>
>> Can also be made more explicit by setting sqe->len to what you set the
>> original request length to, and then pass that back in the cqe->res
>> instead of using the pid from the task that sent it. Then you'd have it
>> immediately available. Might make more sense than the pid, not sure
>> anyone would care about that?
>
> Maybe I am missing something, but we only know that the request to
> which IORING_OP_WAKEUP_RING was linked completed successfully. How
> exactly do you retrieve the number of read bytes with the linking
> aproach?
Because you'd do:
sqe1 = get_sqe();
prep_sqe(sqe1);
sqe1->len = io_bytes;
sqe1->flags |= IOSQE_IO_LINK | IOSQE_CQE_SKIP_ON_SUCESS;
sqe2 = get_sqe();
prep_msg_ring(sqe2);
sqe2->fd = target_ring_fd;
sqe2->len = io_bytes;
sqe2->flags |= IOSQE_CQE_SKIP_ON_SUCESS;
Then when target_ring gets the cqe for the msg_ring operation, it'll
have sqe->len available. If sqe1 doesn't complete with io_bytes as the
result, the link will be broken and you'll have to handle it locally.
Which should be fine, it's not like you can't handle it locally, you
just prefer to have it handled remotely.
> Yes, passing positive result value would make more sense than PID of
> submitter, which is rarely, if ever, needed. IIUC we would not be able
> to use linking with such approach, since sqe->len has to be set in
> user code based on a received CQE, but I guess it should be fine in
> practice.
Right, and using sqe->len and passing it through makes a lot more sense
in general as you can pass whatever you want there. If you want to use
the pid, you can use it like that. Or for whatever else you'd want. That
gives you both 'len' and 'user_data' as information you can pass between
the rings.
It could also be used as `len` holding a message type, and `user_data`
holding a pointer to a struct. For example.
--
Jens Axboe
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 15:57 ` Jens Axboe
@ 2022-03-10 16:07 ` Artyom Pavlov
2022-03-10 16:12 ` Jens Axboe
0 siblings, 1 reply; 28+ messages in thread
From: Artyom Pavlov @ 2022-03-10 16:07 UTC (permalink / raw)
To: Jens Axboe, io-uring
>> Yes, passing positive result value would make more sense than PID of
>> submitter, which is rarely, if ever, needed. IIUC we would not be able
>> to use linking with such approach, since sqe->len has to be set in
>> user code based on a received CQE, but I guess it should be fine in
>> practice.
>
> Right, and using sqe->len and passing it through makes a lot more sense
> in general as you can pass whatever you want there. If you want to use
> the pid, you can use it like that. Or for whatever else you'd want. That
> gives you both 'len' and 'user_data' as information you can pass between
> the rings.
>
> It could also be used as `len` holding a message type, and `user_data`
> holding a pointer to a struct. For example.
I like IORING_OP_WAKEUP_RING with sqe->len being copied to cqe->result.
The only question I have is how should "negative" (i.e. bigger or equal
to 2^31) lengths be handled. They either should be copied similarly to
positive values or we should get an error.
Also what do you think about registering ring fds? Could it be beneficial?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 16:07 ` Artyom Pavlov
@ 2022-03-10 16:12 ` Jens Axboe
2022-03-10 16:22 ` Artyom Pavlov
0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 16:12 UTC (permalink / raw)
To: Artyom Pavlov, io-uring
On 3/10/22 9:07 AM, Artyom Pavlov wrote:
>>> Yes, passing positive result value would make more sense than PID of
>>> submitter, which is rarely, if ever, needed. IIUC we would not be able
>>> to use linking with such approach, since sqe->len has to be set in
>>> user code based on a received CQE, but I guess it should be fine in
>>> practice.
>>
>> Right, and using sqe->len and passing it through makes a lot more sense
>> in general as you can pass whatever you want there. If you want to use
>> the pid, you can use it like that. Or for whatever else you'd want. That
>> gives you both 'len' and 'user_data' as information you can pass between
>> the rings.
>>
>> It could also be used as `len` holding a message type, and `user_data`
>> holding a pointer to a struct. For example.
>
> I like IORING_OP_WAKEUP_RING with sqe->len being copied to
> cqe->result. The only question I have is how should "negative" (i.e.
> bigger or equal to 2^31) lengths be handled. They either should be
> copied similarly to positive values or we should get an error.
Any IO transferring syscall in Linux supports INT_MAX as the maximum in
one request, which is also why the res field is sized the way it is. So
you cannot transfer more than that anyway, hence it should not be an
issue (at least specific to io_uring).
> Also what do you think about registering ring fds? Could it be
> beneficial?
Definitely, it'll make the overhead a bit lower for issuing the
IORING_OP_MSG_RING (it's been renamed ;-) request.
--
Jens Axboe
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 16:12 ` Jens Axboe
@ 2022-03-10 16:22 ` Artyom Pavlov
2022-03-10 16:25 ` Jens Axboe
0 siblings, 1 reply; 28+ messages in thread
From: Artyom Pavlov @ 2022-03-10 16:22 UTC (permalink / raw)
To: Jens Axboe, io-uring
>> I like IORING_OP_WAKEUP_RING with sqe->len being copied to
>> cqe->result. The only question I have is how should "negative" (i.e.
>> bigger or equal to 2^31) lengths be handled. They either should be
>> copied similarly to positive values or we should get an error.
>
> Any IO transferring syscall in Linux supports INT_MAX as the maximum in
> one request, which is also why the res field is sized the way it is. So
> you cannot transfer more than that anyway, hence it should not be an
> issue (at least specific to io_uring).
>
>> Also what do you think about registering ring fds? Could it be
>> beneficial?
>
> Definitely, it'll make the overhead a bit lower for issuing the
> IORING_OP_MSG_RING (it's been renamed ;-) request.
>
I mean the case when sqe->len is used to transfer arbitrary data set by
user. I believe IORING_OP_MSG_RING behavior for this edge case should be
explicitly documented.
It looks like we have 3 options:
1) Copy sqe->len to cqe->result without any checks. Disadvantage:
user-provided value may collide with EBADFD and EOVERFLOW.
2) Submit error CQE to the submitter ring.
3) Submit error CQE to the receiver ring (cqe->result will contain error
code).
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 16:22 ` Artyom Pavlov
@ 2022-03-10 16:25 ` Jens Axboe
2022-03-10 16:28 ` Artyom Pavlov
0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 16:25 UTC (permalink / raw)
To: Artyom Pavlov, io-uring
On 3/10/22 9:22 AM, Artyom Pavlov wrote:
>>> I like IORING_OP_WAKEUP_RING with sqe->len being copied to
>>> cqe->result. The only question I have is how should "negative" (i.e.
>>> bigger or equal to 2^31) lengths be handled. They either should be
>>> copied similarly to positive values or we should get an error.
>>
>> Any IO transferring syscall in Linux supports INT_MAX as the maximum in
>> one request, which is also why the res field is sized the way it is. So
>> you cannot transfer more than that anyway, hence it should not be an
>> issue (at least specific to io_uring).
>>
>>> Also what do you think about registering ring fds? Could it be
>>> beneficial?
>>
>> Definitely, it'll make the overhead a bit lower for issuing the
>> IORING_OP_MSG_RING (it's been renamed ;-) request.
>>
>
> I mean the case when sqe->len is used to transfer arbitrary data set
> by user. I believe IORING_OP_MSG_RING behavior for this edge case
> should be explicitly documented.
Ah gotcha
> It looks like we have 3 options:
> 1) Copy sqe->len to cqe->result without any checks. Disadvantage:
> user-provided value may collide with EBADFD and EOVERFLOW.
> 2) Submit error CQE to the submitter ring.
> 3) Submit error CQE to the receiver ring (cqe->result will contain
> error code).
#1 should not be an issue, as cqe->result for those values is the
original ring result code, not the target ring.
I'd say the application should just case it to u32 and intepret it like
that, if it's worried about the signed nature of it?
--
Jens Axboe
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 16:25 ` Jens Axboe
@ 2022-03-10 16:28 ` Artyom Pavlov
2022-03-10 16:30 ` Jens Axboe
0 siblings, 1 reply; 28+ messages in thread
From: Artyom Pavlov @ 2022-03-10 16:28 UTC (permalink / raw)
To: Jens Axboe, io-uring
>> It looks like we have 3 options:
>> 1) Copy sqe->len to cqe->result without any checks. Disadvantage:
>> user-provided value may collide with EBADFD and EOVERFLOW.
>> 2) Submit error CQE to the submitter ring.
>> 3) Submit error CQE to the receiver ring (cqe->result will contain
>> error code).
>
> #1 should not be an issue, as cqe->result for those values is the
> original ring result code, not the target ring.
>
> I'd say the application should just case it to u32 and intepret it like
> that, if it's worried about the signed nature of it?
Ah, indeed. I've missed that EBADFD and EOVERFLOW errors only can happen
in the submitter ring, so the receiver ring can always interpret CQE
with the IORING_CQE_F_MSG flag as a successfully received message from
another ring.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Sending CQE to a different ring
2022-03-10 16:28 ` Artyom Pavlov
@ 2022-03-10 16:30 ` Jens Axboe
0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2022-03-10 16:30 UTC (permalink / raw)
To: Artyom Pavlov, io-uring
On 3/10/22 9:28 AM, Artyom Pavlov wrote:
>>> It looks like we have 3 options:
>>> 1) Copy sqe->len to cqe->result without any checks. Disadvantage:
>>> user-provided value may collide with EBADFD and EOVERFLOW.
>>> 2) Submit error CQE to the submitter ring.
>>> 3) Submit error CQE to the receiver ring (cqe->result will contain
>>> error code).
>>
>> #1 should not be an issue, as cqe->result for those values is the
>> original ring result code, not the target ring.
>>
>> I'd say the application should just case it to u32 and intepret it like
>> that, if it's worried about the signed nature of it?
>
> Ah, indeed. I've missed that EBADFD and EOVERFLOW errors only can
> happen in the submitter ring, so the receiver ring can always
> interpret CQE with the IORING_CQE_F_MSG flag as a successfully
> received message from another ring.
Yes, so I don't think there's any confusion there. I did just make the
prep helper in liburing take an unsigned, and I did test that we don't
have any issues on that front.
Posted v2, passes testing too. I actually think this is quite a nifty
feature, and can be used quite flexibly to communicate between rings.
--
Jens Axboe
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2022-03-10 16:30 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-03-09 23:49 Sending CQE to a different ring Artyom Pavlov
2022-03-10 1:36 ` Jens Axboe
2022-03-10 1:55 ` Jens Axboe
2022-03-10 2:33 ` Jens Axboe
2022-03-10 9:15 ` Chris Panayis
2022-03-10 13:53 ` Pavel Begunkov
2022-03-10 15:38 ` Jens Axboe
2022-03-10 2:11 ` Artyom Pavlov
2022-03-10 3:00 ` Jens Axboe
2022-03-10 3:48 ` Artyom Pavlov
2022-03-10 4:03 ` Jens Axboe
2022-03-10 4:14 ` Jens Axboe
2022-03-10 14:00 ` Artyom Pavlov
2022-03-10 15:36 ` Artyom Pavlov
2022-03-10 15:43 ` Jens Axboe
2022-03-10 15:46 ` Jens Axboe
2022-03-10 15:52 ` Artyom Pavlov
2022-03-10 15:57 ` Jens Axboe
2022-03-10 16:07 ` Artyom Pavlov
2022-03-10 16:12 ` Jens Axboe
2022-03-10 16:22 ` Artyom Pavlov
2022-03-10 16:25 ` Jens Axboe
2022-03-10 16:28 ` Artyom Pavlov
2022-03-10 16:30 ` Jens Axboe
2022-03-10 13:34 ` Pavel Begunkov
2022-03-10 13:43 ` Jens Axboe
2022-03-10 13:51 ` Pavel Begunkov
2022-03-10 3:06 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox