* [PATCH 0/3] Add REQ_F_CQE_SKIP support to io_uring zerocopy @ 2024-04-04 22:16 Oliver Crumrine 2024-04-04 22:17 ` [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for " Oliver Crumrine ` (3 more replies) 0 siblings, 4 replies; 17+ messages in thread From: Oliver Crumrine @ 2024-04-04 22:16 UTC (permalink / raw) To: axboe, asml.silence, davem, edumazet, kuba, pabeni, shuah, leitao Cc: io-uring, netdev, linux-kselftest, linux-kernel This patchset allows for io_uring zerocopy to support REQ_F_CQE_SKIP, skipping the normal completion notification, but not the zerocopy buffer release notification. This patchset also includes a test to test these changes, and a patch to mini_liburing to enable io_uring_peek_cqe, which is needed for the test. Oliver Crumrine (3): io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy io_uring: Add io_uring_peek_cqe to mini_liburing io_uring: Support IOSQE_CQE_SKIP_SUCCESS in io_uring zerocopy test io_uring/net.c | 6 +-- tools/include/io_uring/mini_liburing.h | 18 +++++++++ .../selftests/net/io_uring_zerocopy_tx.c | 37 +++++++++++++++++-- .../selftests/net/io_uring_zerocopy_tx.sh | 7 +++- 4 files changed, 59 insertions(+), 10 deletions(-) -- 2.44.0 ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy 2024-04-04 22:16 [PATCH 0/3] Add REQ_F_CQE_SKIP support to io_uring zerocopy Oliver Crumrine @ 2024-04-04 22:17 ` Oliver Crumrine 2024-04-05 13:01 ` Pavel Begunkov 2024-04-04 22:19 ` [PATCH 2/3] io_uring: Add io_uring_peek_cqe to mini_liburing Oliver Crumrine ` (2 subsequent siblings) 3 siblings, 1 reply; 17+ messages in thread From: Oliver Crumrine @ 2024-04-04 22:17 UTC (permalink / raw) To: axboe, asml.silence, davem, edumazet, kuba, pabeni, shuah, leitao Cc: io-uring, netdev, linux-kselftest, linux-kernel In his patch to enable zerocopy networking for io_uring, Pavel Begunkov specifically disabled REQ_F_CQE_SKIP, as (at least from my understanding) the userspace program wouldn't receive the IORING_CQE_F_MORE flag in the result value. To fix this, instead of keeping track of how many CQEs have been received, and subtracting notifs from that, programs can keep track of how many SQEs they have issued, and if a CQE is returned with an error, they can simply subtract from how many notifs they expect to receive. Signed-off-by: Oliver Crumrine <[email protected]> --- io_uring/net.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/io_uring/net.c b/io_uring/net.c index 1e7665ff6ef7..822f49809b68 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -1044,9 +1044,6 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))) return -EINVAL; - /* we don't support IOSQE_CQE_SKIP_SUCCESS just yet */ - if (req->flags & REQ_F_CQE_SKIP) - return -EINVAL; notif = zc->notif = io_alloc_notif(ctx); if (!notif) @@ -1342,7 +1339,8 @@ void io_sendrecv_fail(struct io_kiocb *req) req->cqe.res = sr->done_io; if ((req->flags & REQ_F_NEED_CLEANUP) && - (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) + (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC) && + !(req->flags & REQ_F_CQE_SKIP)) req->cqe.flags |= IORING_CQE_F_MORE; } -- 2.44.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy 2024-04-04 22:17 ` [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for " Oliver Crumrine @ 2024-04-05 13:01 ` Pavel Begunkov 2024-04-05 20:04 ` Oliver Crumrine 0 siblings, 1 reply; 17+ messages in thread From: Pavel Begunkov @ 2024-04-05 13:01 UTC (permalink / raw) To: Oliver Crumrine, axboe, davem, edumazet, kuba, pabeni, shuah, leitao Cc: io-uring, netdev, linux-kselftest, linux-kernel On 4/4/24 23:17, Oliver Crumrine wrote: > In his patch to enable zerocopy networking for io_uring, Pavel Begunkov > specifically disabled REQ_F_CQE_SKIP, as (at least from my > understanding) the userspace program wouldn't receive the > IORING_CQE_F_MORE flag in the result value. No. IORING_CQE_F_MORE means there will be another CQE from this request, so a single CQE without IORING_CQE_F_MORE is trivially fine. The problem is the semantics, because by suppressing the first CQE you're loosing the result value. You might rely on WAITALL as other sends and "fail" (in terms of io_uring) the request in case of a partial send posting 2 CQEs, but that's not a great way and it's getting userspace complicated pretty easily. In short, it was left out for later because there is a better way to implement it, but it should be done carefully > To fix this, instead of keeping track of how many CQEs have been > received, and subtracting notifs from that, programs can keep track of That's a benchmark way of doing it, more realistically it'd be more like event_loop() { cqe = wait_cqe(); struct req *r = (struct req *)cqe->user_data; r->callback(r, cqe); } send_zc_callback(req, cqe) { if (cqe->flags & F_MORE) { // don't free the req // we should wait for another CQE ... } } > how many SQEs they have issued, and if a CQE is returned with an error, > they can simply subtract from how many notifs they expect to receive. The design specifically untangles those two notions, i.e. there can be a notification even when the main CQE fails (ret<0). It's safer this way, even though AFAIK relying on errors would be fine with current users (TCP/UDP). > Signed-off-by: Oliver Crumrine <[email protected]> > --- > io_uring/net.c | 6 ++---- > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/io_uring/net.c b/io_uring/net.c > index 1e7665ff6ef7..822f49809b68 100644 > --- a/io_uring/net.c > +++ b/io_uring/net.c > @@ -1044,9 +1044,6 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))) > return -EINVAL; > - /* we don't support IOSQE_CQE_SKIP_SUCCESS just yet */ > - if (req->flags & REQ_F_CQE_SKIP) > - return -EINVAL; > > notif = zc->notif = io_alloc_notif(ctx); > if (!notif) > @@ -1342,7 +1339,8 @@ void io_sendrecv_fail(struct io_kiocb *req) > req->cqe.res = sr->done_io; > > if ((req->flags & REQ_F_NEED_CLEANUP) && > - (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) > + (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC) && > + !(req->flags & REQ_F_CQE_SKIP)) > req->cqe.flags |= IORING_CQE_F_MORE; > } > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy 2024-04-05 13:01 ` Pavel Begunkov @ 2024-04-05 20:04 ` Oliver Crumrine 2024-04-06 21:23 ` Pavel Begunkov 0 siblings, 1 reply; 17+ messages in thread From: Oliver Crumrine @ 2024-04-05 20:04 UTC (permalink / raw) To: Pavel Begunkov, Oliver Crumrine, axboe, davem, edumazet, kuba, pabeni, shuah, leitao Cc: io-uring, netdev, linux-kselftest, linux-kernel Pavel Begunkov wrote: > On 4/4/24 23:17, Oliver Crumrine wrote: > > In his patch to enable zerocopy networking for io_uring, Pavel Begunkov > > specifically disabled REQ_F_CQE_SKIP, as (at least from my > > understanding) the userspace program wouldn't receive the > > IORING_CQE_F_MORE flag in the result value. > > No. IORING_CQE_F_MORE means there will be another CQE from this > request, so a single CQE without IORING_CQE_F_MORE is trivially > fine. > > The problem is the semantics, because by suppressing the first > CQE you're loosing the result value. You might rely on WAITALL That's already happening with io_send. > as other sends and "fail" (in terms of io_uring) the request > in case of a partial send posting 2 CQEs, but that's not a great > way and it's getting userspace complicated pretty easily. > > In short, it was left out for later because there is a > better way to implement it, but it should be done carefully Maybe we could put the return values in the notifs? That would be a discrepancy between io_send and io_send_zc, though. > > > > To fix this, instead of keeping track of how many CQEs have been > > received, and subtracting notifs from that, programs can keep track of > > That's a benchmark way of doing it, more realistically > it'd be more like > > event_loop() { > cqe = wait_cqe(); > struct req *r = (struct req *)cqe->user_data; > r->callback(r, cqe); > > > send_zc_callback(req, cqe) { > if (cqe->flags & F_MORE) { > // don't free the req > // we should wait for another CQE > ... > } > } > > > how many SQEs they have issued, and if a CQE is returned with an error, > > they can simply subtract from how many notifs they expect to receive. > > The design specifically untangles those two notions, i.e. there can > be a notification even when the main CQE fails (ret<0). It's safer > this way, even though AFAIK relying on errors would be fine with > current users (TCP/UDP). > > > > Signed-off-by: Oliver Crumrine <[email protected]> > > --- > > io_uring/net.c | 6 ++---- > > 1 file changed, 2 insertions(+), 4 deletions(-) > > > > diff --git a/io_uring/net.c b/io_uring/net.c > > index 1e7665ff6ef7..822f49809b68 100644 > > --- a/io_uring/net.c > > +++ b/io_uring/net.c > > @@ -1044,9 +1044,6 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > > if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))) > > return -EINVAL; > > - /* we don't support IOSQE_CQE_SKIP_SUCCESS just yet */ > > - if (req->flags & REQ_F_CQE_SKIP) > > - return -EINVAL; > > > > notif = zc->notif = io_alloc_notif(ctx); > > if (!notif) > > @@ -1342,7 +1339,8 @@ void io_sendrecv_fail(struct io_kiocb *req) > > req->cqe.res = sr->done_io; > > > > if ((req->flags & REQ_F_NEED_CLEANUP) && > > - (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) > > + (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC) && > > + !(req->flags & REQ_F_CQE_SKIP)) > > req->cqe.flags |= IORING_CQE_F_MORE; > > } > > > > -- > Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy 2024-04-05 20:04 ` Oliver Crumrine @ 2024-04-06 21:23 ` Pavel Begunkov 2024-04-07 13:13 ` Oliver Crumrine 0 siblings, 1 reply; 17+ messages in thread From: Pavel Begunkov @ 2024-04-06 21:23 UTC (permalink / raw) To: Oliver Crumrine, axboe, davem, edumazet, kuba, pabeni, shuah, leitao Cc: io-uring, netdev, linux-kselftest, linux-kernel On 4/5/24 21:04, Oliver Crumrine wrote: > Pavel Begunkov wrote: >> On 4/4/24 23:17, Oliver Crumrine wrote: >>> In his patch to enable zerocopy networking for io_uring, Pavel Begunkov >>> specifically disabled REQ_F_CQE_SKIP, as (at least from my >>> understanding) the userspace program wouldn't receive the >>> IORING_CQE_F_MORE flag in the result value. >> >> No. IORING_CQE_F_MORE means there will be another CQE from this >> request, so a single CQE without IORING_CQE_F_MORE is trivially >> fine. >> >> The problem is the semantics, because by suppressing the first >> CQE you're loosing the result value. You might rely on WAITALL > That's already happening with io_send. Right, and it's still annoying and hard to use >> as other sends and "fail" (in terms of io_uring) the request >> in case of a partial send posting 2 CQEs, but that's not a great >> way and it's getting userspace complicated pretty easily. >> >> In short, it was left out for later because there is a >> better way to implement it, but it should be done carefully > Maybe we could put the return values in the notifs? That would be a > discrepancy between io_send and io_send_zc, though. Yes. And yes, having a custom flavour is not good. It'd only be well usable if apart from returning the actual result it also guarantees there will be one and only one CQE, then the userspace doesn't have to do the dancing with counting and checking F_MORE. In fact, I outlined before how a generic solution may looks like: https://github.com/axboe/liburing/issues/824 The only interesting part, IMHO, is to be able to merge the main completion with its notification. Below is an old stash rebased onto for-6.10. The only thing missing is relinking, but maybe we don't even care about it. I need to cover it well with tests. commit ca5e4fb6d105b5dfdf3768d46ce01529b7bb88c5 Author: Pavel Begunkov <[email protected]> Date: Sat Apr 6 15:46:38 2024 +0100 io_uring/net: introduce single CQE send zc mode IORING_OP_SEND[MSG]_ZC requests are posting two completions, one to notify that the data was queued, and later a second, usually referred as "notification", to let the user know that the buffer used can be reused/freed. In some cases the user might not care about the main completion and would be content getting only the notification, which would allow to simplify the userspace. One example is when after a send the user would be waiting for the other end to get the message and reply back not pushing any more data in the meantime. Another case is unreliable protocols like UDP, which do not require a confirmation from the other end before dropping buffers, and so the notifications are usually posted shortly after the send request is queued. Add a flag merging completions into a single CQE. cqe->res will store the send's result as usual, and it will have IORING_CQE_F_NOTIF set if the buffer was potentially used. Timewise, it would be posted at the moment when the notification would have been originally completed. Signed-off-by: Pavel Begunkov <[email protected]> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 7bd10201a02b..e2b528c341c9 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -356,6 +356,7 @@ enum io_uring_op { #define IORING_RECV_MULTISHOT (1U << 1) #define IORING_RECVSEND_FIXED_BUF (1U << 2) #define IORING_SEND_ZC_REPORT_USAGE (1U << 3) +#define IORING_SEND_ZC_COMBINE_CQE (1U << 4) /* * cqe.res for IORING_CQE_F_NOTIF if diff --git a/io_uring/net.c b/io_uring/net.c index a74287692071..052f030ab8f8 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -992,7 +992,19 @@ void io_send_zc_cleanup(struct io_kiocb *req) } } -#define IO_ZC_FLAGS_COMMON (IORING_RECVSEND_POLL_FIRST | IORING_RECVSEND_FIXED_BUF) +static inline void io_sendzc_adjust_res(struct io_kiocb *req) +{ + struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg); + + if (sr->flags & IORING_SEND_ZC_COMBINE_CQE) { + sr->notif->cqe.res = req->cqe.res; + req->flags |= REQ_F_CQE_SKIP; + } +} + +#define IO_ZC_FLAGS_COMMON (IORING_RECVSEND_POLL_FIRST | \ + IORING_RECVSEND_FIXED_BUF | \ + IORING_SEND_ZC_COMBINE_CQE) #define IO_ZC_FLAGS_VALID (IO_ZC_FLAGS_COMMON | IORING_SEND_ZC_REPORT_USAGE) int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) @@ -1022,6 +1034,8 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (zc->flags & ~IO_ZC_FLAGS_VALID) return -EINVAL; if (zc->flags & IORING_SEND_ZC_REPORT_USAGE) { + if (zc->flags & IORING_SEND_ZC_COMBINE_CQE) + return -EINVAL; io_notif_set_extended(notif); io_notif_to_data(notif)->zc_report = true; } @@ -1197,6 +1211,9 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags) else if (zc->done_io) ret = zc->done_io; + io_req_set_res(req, ret, IORING_CQE_F_MORE); + io_sendzc_adjust_res(req); + /* * If we're in io-wq we can't rely on tw ordering guarantees, defer * flushing notif to io_send_zc_cleanup() @@ -1205,7 +1222,6 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags) io_notif_flush(zc->notif); io_req_msg_cleanup(req, 0); } - io_req_set_res(req, ret, IORING_CQE_F_MORE); return IOU_OK; } @@ -1258,6 +1274,9 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags) else if (sr->done_io) ret = sr->done_io; + io_req_set_res(req, ret, IORING_CQE_F_MORE); + io_sendzc_adjust_res(req); + /* * If we're in io-wq we can't rely on tw ordering guarantees, defer * flushing notif to io_send_zc_cleanup() @@ -1266,7 +1285,6 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags) io_notif_flush(sr->notif); io_req_msg_cleanup(req, 0); } - io_req_set_res(req, ret, IORING_CQE_F_MORE); return IOU_OK; } @@ -1278,8 +1296,10 @@ void io_sendrecv_fail(struct io_kiocb *req) req->cqe.res = sr->done_io; if ((req->flags & REQ_F_NEED_CLEANUP) && - (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) + (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) { req->cqe.flags |= IORING_CQE_F_MORE; + io_sendzc_adjust_res(req); + } } int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) -- Pavel Begunkov ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy 2024-04-06 21:23 ` Pavel Begunkov @ 2024-04-07 13:13 ` Oliver Crumrine 2024-04-07 19:14 ` Oliver Crumrine 0 siblings, 1 reply; 17+ messages in thread From: Oliver Crumrine @ 2024-04-07 13:13 UTC (permalink / raw) To: Pavel Begunkov, Oliver Crumrine, axboe; +Cc: io-uring, linux-kernel Pavel Begunkov wrote: > On 4/5/24 21:04, Oliver Crumrine wrote: > > Pavel Begunkov wrote: > >> On 4/4/24 23:17, Oliver Crumrine wrote: > >>> In his patch to enable zerocopy networking for io_uring, Pavel Begunkov > >>> specifically disabled REQ_F_CQE_SKIP, as (at least from my > >>> understanding) the userspace program wouldn't receive the > >>> IORING_CQE_F_MORE flag in the result value. > >> > >> No. IORING_CQE_F_MORE means there will be another CQE from this > >> request, so a single CQE without IORING_CQE_F_MORE is trivially > >> fine. > >> > >> The problem is the semantics, because by suppressing the first > >> CQE you're loosing the result value. You might rely on WAITALL > > That's already happening with io_send. > > Right, and it's still annoying and hard to use Another solution might be something where there is a counter that stores how many CQEs with REQ_F_CQE_SKIP have been processed. Before exiting, userspace could call a function like: io_wait_completions(int completions) which would wait until everything is done, and then userspace could peek the completion ring. > > >> as other sends and "fail" (in terms of io_uring) the request > >> in case of a partial send posting 2 CQEs, but that's not a great > >> way and it's getting userspace complicated pretty easily. > >> > >> In short, it was left out for later because there is a > >> better way to implement it, but it should be done carefully > > Maybe we could put the return values in the notifs? That would be a > > discrepancy between io_send and io_send_zc, though. > > Yes. And yes, having a custom flavour is not good. It'd only > be well usable if apart from returning the actual result > it also guarantees there will be one and only one CQE, then > the userspace doesn't have to do the dancing with counting > and checking F_MORE. In fact, I outlined before how a generic > solution may looks like: > > https://github.com/axboe/liburing/issues/824 > > The only interesting part, IMHO, is to be able to merge the > main completion with its notification. Below is an old stash > rebased onto for-6.10. The only thing missing is relinking, > but maybe we don't even care about it. I need to cover it > well with tests. The patch looks pretty good. The only potential issue is that you store the res of the normal CQE into the notif CQE. This overwrites the IORING_CQE_F_NOTIF with IORING_CQE_F_MORE. This means that the notif would indicate to userspace that there will be another CQE, of which there won't. > > > > > commit ca5e4fb6d105b5dfdf3768d46ce01529b7bb88c5 > Author: Pavel Begunkov <[email protected]> > Date: Sat Apr 6 15:46:38 2024 +0100 > > io_uring/net: introduce single CQE send zc mode > > IORING_OP_SEND[MSG]_ZC requests are posting two completions, one to > notify that the data was queued, and later a second, usually referred > as "notification", to let the user know that the buffer used can be > reused/freed. In some cases the user might not care about the main > completion and would be content getting only the notification, which > would allow to simplify the userspace. > > One example is when after a send the user would be waiting for the other > end to get the message and reply back not pushing any more data in the > meantime. Another case is unreliable protocols like UDP, which do not > require a confirmation from the other end before dropping buffers, and > so the notifications are usually posted shortly after the send request > is queued. > > Add a flag merging completions into a single CQE. cqe->res will store > the send's result as usual, and it will have IORING_CQE_F_NOTIF set if > the buffer was potentially used. Timewise, it would be posted at the > moment when the notification would have been originally completed. > > Signed-off-by: Pavel Begunkov <[email protected]> > > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h > index 7bd10201a02b..e2b528c341c9 100644 > --- a/include/uapi/linux/io_uring.h > +++ b/include/uapi/linux/io_uring.h > @@ -356,6 +356,7 @@ enum io_uring_op { > #define IORING_RECV_MULTISHOT (1U << 1) > #define IORING_RECVSEND_FIXED_BUF (1U << 2) > #define IORING_SEND_ZC_REPORT_USAGE (1U << 3) > +#define IORING_SEND_ZC_COMBINE_CQE (1U << 4) > > /* > * cqe.res for IORING_CQE_F_NOTIF if > diff --git a/io_uring/net.c b/io_uring/net.c > index a74287692071..052f030ab8f8 100644 > --- a/io_uring/net.c > +++ b/io_uring/net.c > @@ -992,7 +992,19 @@ void io_send_zc_cleanup(struct io_kiocb *req) > } > } > > -#define IO_ZC_FLAGS_COMMON (IORING_RECVSEND_POLL_FIRST | IORING_RECVSEND_FIXED_BUF) > +static inline void io_sendzc_adjust_res(struct io_kiocb *req) > +{ > + struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg); > + > + if (sr->flags & IORING_SEND_ZC_COMBINE_CQE) { > + sr->notif->cqe.res = req->cqe.res; > + req->flags |= REQ_F_CQE_SKIP; > + } > +} > + > +#define IO_ZC_FLAGS_COMMON (IORING_RECVSEND_POLL_FIRST | \ > + IORING_RECVSEND_FIXED_BUF | \ > + IORING_SEND_ZC_COMBINE_CQE) > #define IO_ZC_FLAGS_VALID (IO_ZC_FLAGS_COMMON | IORING_SEND_ZC_REPORT_USAGE) > > int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > @@ -1022,6 +1034,8 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > if (zc->flags & ~IO_ZC_FLAGS_VALID) > return -EINVAL; > if (zc->flags & IORING_SEND_ZC_REPORT_USAGE) { > + if (zc->flags & IORING_SEND_ZC_COMBINE_CQE) > + return -EINVAL; > io_notif_set_extended(notif); > io_notif_to_data(notif)->zc_report = true; > } > @@ -1197,6 +1211,9 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags) > else if (zc->done_io) > ret = zc->done_io; > > + io_req_set_res(req, ret, IORING_CQE_F_MORE); > + io_sendzc_adjust_res(req); > + > /* > * If we're in io-wq we can't rely on tw ordering guarantees, defer > * flushing notif to io_send_zc_cleanup() > @@ -1205,7 +1222,6 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags) > io_notif_flush(zc->notif); > io_req_msg_cleanup(req, 0); > } > - io_req_set_res(req, ret, IORING_CQE_F_MORE); > return IOU_OK; > } > > else if (sr->done_io) > ret = sr->done_io; > > + io_req_set_res(req, ret, IORING_CQE_F_MORE); > + io_sendzc_adjust_res(req); > + > /* > * If we're in io-wq we can't rely on tw ordering guarantees, defer > * flushing notif to io_send_zc_cleanup() > @@ -1266,7 +1285,6 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags) > io_notif_flush(sr->notif); > io_req_msg_cleanup(req, 0); > } > - io_req_set_res(req, ret, IORING_CQE_F_MORE); > return IOU_OK; > } > > @@ -1278,8 +1296,10 @@ void io_sendrecv_fail(struct io_kiocb *req) > req->cqe.res = sr->done_io; > > if ((req->flags & REQ_F_NEED_CLEANUP) && > - (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) > + (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) { > req->cqe.flags |= IORING_CQE_F_MORE; > + io_sendzc_adjust_res(req); > + } > } > > int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > -- > Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy 2024-04-07 13:13 ` Oliver Crumrine @ 2024-04-07 19:14 ` Oliver Crumrine 2024-04-07 23:46 ` Pavel Begunkov 0 siblings, 1 reply; 17+ messages in thread From: Oliver Crumrine @ 2024-04-07 19:14 UTC (permalink / raw) To: Pavel Begunkov, Oliver Crumrine, axboe; +Cc: io-uring, linux-kernel Oliver Crumrine wrote: > Pavel Begunkov wrote: > > On 4/5/24 21:04, Oliver Crumrine wrote: > > > Pavel Begunkov wrote: > > >> On 4/4/24 23:17, Oliver Crumrine wrote: > > >>> In his patch to enable zerocopy networking for io_uring, Pavel Begunkov > > >>> specifically disabled REQ_F_CQE_SKIP, as (at least from my > > >>> understanding) the userspace program wouldn't receive the > > >>> IORING_CQE_F_MORE flag in the result value. > > >> > > >> No. IORING_CQE_F_MORE means there will be another CQE from this > > >> request, so a single CQE without IORING_CQE_F_MORE is trivially > > >> fine. > > >> > > >> The problem is the semantics, because by suppressing the first > > >> CQE you're loosing the result value. You might rely on WAITALL > > > That's already happening with io_send. > > > > Right, and it's still annoying and hard to use > Another solution might be something where there is a counter that stores > how many CQEs with REQ_F_CQE_SKIP have been processed. Before exiting, > userspace could call a function like: io_wait_completions(int completions) > which would wait until everything is done, and then userspace could peek > the completion ring. > > > > >> as other sends and "fail" (in terms of io_uring) the request > > >> in case of a partial send posting 2 CQEs, but that's not a great > > >> way and it's getting userspace complicated pretty easily. > > >> > > >> In short, it was left out for later because there is a > > >> better way to implement it, but it should be done carefully > > > Maybe we could put the return values in the notifs? That would be a > > > discrepancy between io_send and io_send_zc, though. > > > > Yes. And yes, having a custom flavour is not good. It'd only > > be well usable if apart from returning the actual result > > it also guarantees there will be one and only one CQE, then > > the userspace doesn't have to do the dancing with counting > > and checking F_MORE. In fact, I outlined before how a generic > > solution may looks like: > > > > https://github.com/axboe/liburing/issues/824 > > > > The only interesting part, IMHO, is to be able to merge the > > main completion with its notification. Below is an old stash > > rebased onto for-6.10. The only thing missing is relinking, > > but maybe we don't even care about it. I need to cover it > > well with tests. > The patch looks pretty good. The only potential issue is that you store > the res of the normal CQE into the notif CQE. This overwrites the > IORING_CQE_F_NOTIF with IORING_CQE_F_MORE. This means that the notif would > indicate to userspace that there will be another CQE, of which there > won't. I was wrong here; Mixed up flags and result value. > > > > > > > > > > commit ca5e4fb6d105b5dfdf3768d46ce01529b7bb88c5 > > Author: Pavel Begunkov <[email protected]> > > Date: Sat Apr 6 15:46:38 2024 +0100 > > > > io_uring/net: introduce single CQE send zc mode > > > > IORING_OP_SEND[MSG]_ZC requests are posting two completions, one to > > notify that the data was queued, and later a second, usually referred > > as "notification", to let the user know that the buffer used can be > > reused/freed. In some cases the user might not care about the main > > completion and would be content getting only the notification, which > > would allow to simplify the userspace. > > > > One example is when after a send the user would be waiting for the other > > end to get the message and reply back not pushing any more data in the > > meantime. Another case is unreliable protocols like UDP, which do not > > require a confirmation from the other end before dropping buffers, and > > so the notifications are usually posted shortly after the send request > > is queued. > > > > Add a flag merging completions into a single CQE. cqe->res will store > > the send's result as usual, and it will have IORING_CQE_F_NOTIF set if > > the buffer was potentially used. Timewise, it would be posted at the > > moment when the notification would have been originally completed. > > > > Signed-off-by: Pavel Begunkov <[email protected]> > > > > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h > > index 7bd10201a02b..e2b528c341c9 100644 > > --- a/include/uapi/linux/io_uring.h > > +++ b/include/uapi/linux/io_uring.h > > @@ -356,6 +356,7 @@ enum io_uring_op { > > #define IORING_RECV_MULTISHOT (1U << 1) > > #define IORING_RECVSEND_FIXED_BUF (1U << 2) > > #define IORING_SEND_ZC_REPORT_USAGE (1U << 3) > > +#define IORING_SEND_ZC_COMBINE_CQE (1U << 4) > > > > /* > > * cqe.res for IORING_CQE_F_NOTIF if > > diff --git a/io_uring/net.c b/io_uring/net.c > > index a74287692071..052f030ab8f8 100644 > > --- a/io_uring/net.c > > +++ b/io_uring/net.c > > @@ -992,7 +992,19 @@ void io_send_zc_cleanup(struct io_kiocb *req) > > } > > } > > > > -#define IO_ZC_FLAGS_COMMON (IORING_RECVSEND_POLL_FIRST | IORING_RECVSEND_FIXED_BUF) > > +static inline void io_sendzc_adjust_res(struct io_kiocb *req) > > +{ > > + struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg); > > + > > + if (sr->flags & IORING_SEND_ZC_COMBINE_CQE) { > > + sr->notif->cqe.res = req->cqe.res; > > + req->flags |= REQ_F_CQE_SKIP; > > + } > > +} > > + > > +#define IO_ZC_FLAGS_COMMON (IORING_RECVSEND_POLL_FIRST | \ > > + IORING_RECVSEND_FIXED_BUF | \ > > + IORING_SEND_ZC_COMBINE_CQE) > > #define IO_ZC_FLAGS_VALID (IO_ZC_FLAGS_COMMON | IORING_SEND_ZC_REPORT_USAGE) > > > > int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > @@ -1022,6 +1034,8 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > if (zc->flags & ~IO_ZC_FLAGS_VALID) > > return -EINVAL; > > if (zc->flags & IORING_SEND_ZC_REPORT_USAGE) { > > + if (zc->flags & IORING_SEND_ZC_COMBINE_CQE) > > + return -EINVAL; > > io_notif_set_extended(notif); > > io_notif_to_data(notif)->zc_report = true; > > } > > @@ -1197,6 +1211,9 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags) > > else if (zc->done_io) > > ret = zc->done_io; > > > > + io_req_set_res(req, ret, IORING_CQE_F_MORE); > > + io_sendzc_adjust_res(req); > > + > > /* > > * If we're in io-wq we can't rely on tw ordering guarantees, defer > > * flushing notif to io_send_zc_cleanup() > > @@ -1205,7 +1222,6 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags) > > io_notif_flush(zc->notif); > > io_req_msg_cleanup(req, 0); > > } > > - io_req_set_res(req, ret, IORING_CQE_F_MORE); > > return IOU_OK; > > } > > > > > else if (sr->done_io) > > ret = sr->done_io; > > > > + io_req_set_res(req, ret, IORING_CQE_F_MORE); > > + io_sendzc_adjust_res(req); > > + > > /* > > * If we're in io-wq we can't rely on tw ordering guarantees, defer > > * flushing notif to io_send_zc_cleanup() > > @@ -1266,7 +1285,6 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags) > > io_notif_flush(sr->notif); > > io_req_msg_cleanup(req, 0); > > } > > - io_req_set_res(req, ret, IORING_CQE_F_MORE); > > return IOU_OK; > > } > > > > @@ -1278,8 +1296,10 @@ void io_sendrecv_fail(struct io_kiocb *req) > > req->cqe.res = sr->done_io; > > > > if ((req->flags & REQ_F_NEED_CLEANUP) && > > - (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) > > + (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) { > > req->cqe.flags |= IORING_CQE_F_MORE; > > + io_sendzc_adjust_res(req); > > + } > > } > > > > int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > > > > -- > > Pavel Begunkov > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy 2024-04-07 19:14 ` Oliver Crumrine @ 2024-04-07 23:46 ` Pavel Begunkov 2024-04-09 1:33 ` Oliver Crumrine 0 siblings, 1 reply; 17+ messages in thread From: Pavel Begunkov @ 2024-04-07 23:46 UTC (permalink / raw) To: Oliver Crumrine, axboe; +Cc: io-uring, linux-kernel On 4/7/24 20:14, Oliver Crumrine wrote: > Oliver Crumrine wrote: >> Pavel Begunkov wrote: >>> On 4/5/24 21:04, Oliver Crumrine wrote: >>>> Pavel Begunkov wrote: >>>>> On 4/4/24 23:17, Oliver Crumrine wrote: >>>>>> In his patch to enable zerocopy networking for io_uring, Pavel Begunkov >>>>>> specifically disabled REQ_F_CQE_SKIP, as (at least from my >>>>>> understanding) the userspace program wouldn't receive the >>>>>> IORING_CQE_F_MORE flag in the result value. >>>>> >>>>> No. IORING_CQE_F_MORE means there will be another CQE from this >>>>> request, so a single CQE without IORING_CQE_F_MORE is trivially >>>>> fine. >>>>> >>>>> The problem is the semantics, because by suppressing the first >>>>> CQE you're loosing the result value. You might rely on WAITALL >>>> That's already happening with io_send. >>> >>> Right, and it's still annoying and hard to use >> Another solution might be something where there is a counter that stores >> how many CQEs with REQ_F_CQE_SKIP have been processed. Before exiting, >> userspace could call a function like: io_wait_completions(int completions) >> which would wait until everything is done, and then userspace could peek >> the completion ring. >>> >>>>> as other sends and "fail" (in terms of io_uring) the request >>>>> in case of a partial send posting 2 CQEs, but that's not a great >>>>> way and it's getting userspace complicated pretty easily. >>>>> >>>>> In short, it was left out for later because there is a >>>>> better way to implement it, but it should be done carefully >>>> Maybe we could put the return values in the notifs? That would be a >>>> discrepancy between io_send and io_send_zc, though. >>> >>> Yes. And yes, having a custom flavour is not good. It'd only >>> be well usable if apart from returning the actual result >>> it also guarantees there will be one and only one CQE, then >>> the userspace doesn't have to do the dancing with counting >>> and checking F_MORE. In fact, I outlined before how a generic >>> solution may looks like: >>> >>> https://github.com/axboe/liburing/issues/824 >>> >>> The only interesting part, IMHO, is to be able to merge the >>> main completion with its notification. Below is an old stash >>> rebased onto for-6.10. The only thing missing is relinking, >>> but maybe we don't even care about it. I need to cover it >>> well with tests. >> The patch looks pretty good. The only potential issue is that you store >> the res of the normal CQE into the notif CQE. This overwrites the >> IORING_CQE_F_NOTIF with IORING_CQE_F_MORE. This means that the notif would >> indicate to userspace that there will be another CQE, of which there >> won't. > I was wrong here; Mixed up flags and result value. Right, it's fine. And it's synchronised by the ubuf refcounting, though it might get more complicated if I'd try out some counting optimisations. FWIW, it shouldn't give any performance wins. The heavy stuff is notifications waking the task, which is still there. I can even imagine that having separate CQEs might be more flexible and would allow more efficient CQ batching. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy 2024-04-07 23:46 ` Pavel Begunkov @ 2024-04-09 1:33 ` Oliver Crumrine 2024-04-10 12:05 ` Pavel Begunkov 0 siblings, 1 reply; 17+ messages in thread From: Oliver Crumrine @ 2024-04-09 1:33 UTC (permalink / raw) To: Pavel Begunkov, Oliver Crumrine, axboe; +Cc: io-uring, linux-kernel Pavel Begunkov wrote: > On 4/7/24 20:14, Oliver Crumrine wrote: > > Oliver Crumrine wrote: > >> Pavel Begunkov wrote: > >>> On 4/5/24 21:04, Oliver Crumrine wrote: > >>>> Pavel Begunkov wrote: > >>>>> On 4/4/24 23:17, Oliver Crumrine wrote: > >>>>>> In his patch to enable zerocopy networking for io_uring, Pavel Begunkov > >>>>>> specifically disabled REQ_F_CQE_SKIP, as (at least from my > >>>>>> understanding) the userspace program wouldn't receive the > >>>>>> IORING_CQE_F_MORE flag in the result value. > >>>>> > >>>>> No. IORING_CQE_F_MORE means there will be another CQE from this > >>>>> request, so a single CQE without IORING_CQE_F_MORE is trivially > >>>>> fine. > >>>>> > >>>>> The problem is the semantics, because by suppressing the first > >>>>> CQE you're loosing the result value. You might rely on WAITALL > >>>> That's already happening with io_send. > >>> > >>> Right, and it's still annoying and hard to use > >> Another solution might be something where there is a counter that stores > >> how many CQEs with REQ_F_CQE_SKIP have been processed. Before exiting, > >> userspace could call a function like: io_wait_completions(int completions) > >> which would wait until everything is done, and then userspace could peek > >> the completion ring. > >>> > >>>>> as other sends and "fail" (in terms of io_uring) the request > >>>>> in case of a partial send posting 2 CQEs, but that's not a great > >>>>> way and it's getting userspace complicated pretty easily. > >>>>> > >>>>> In short, it was left out for later because there is a > >>>>> better way to implement it, but it should be done carefully > >>>> Maybe we could put the return values in the notifs? That would be a > >>>> discrepancy between io_send and io_send_zc, though. > >>> > >>> Yes. And yes, having a custom flavour is not good. It'd only > >>> be well usable if apart from returning the actual result > >>> it also guarantees there will be one and only one CQE, then > >>> the userspace doesn't have to do the dancing with counting > >>> and checking F_MORE. In fact, I outlined before how a generic > >>> solution may looks like: > >>> > >>> https://github.com/axboe/liburing/issues/824 > >>> > >>> The only interesting part, IMHO, is to be able to merge the > >>> main completion with its notification. Below is an old stash > >>> rebased onto for-6.10. The only thing missing is relinking, > >>> but maybe we don't even care about it. I need to cover it > >>> well with tests. > >> The patch looks pretty good. The only potential issue is that you store > >> the res of the normal CQE into the notif CQE. This overwrites the > >> IORING_CQE_F_NOTIF with IORING_CQE_F_MORE. This means that the notif would > >> indicate to userspace that there will be another CQE, of which there > >> won't. > > I was wrong here; Mixed up flags and result value. > > Right, it's fine. And it's synchronised by the ubuf refcounting, > though it might get more complicated if I'd try out some counting > optimisations. > > FWIW, it shouldn't give any performance wins. The heavy stuff is > notifications waking the task, which is still there. I can even > imagine that having separate CQEs might be more flexible and would > allow more efficient CQ batching. I've actaully been working on this issue for a little while now. My current idea is that an id is put into the optval section of the SQE, and then it can be used to tag that req with a certain group. When a req has a flag set on it, it can request for all of group's notifs to be "flushed" in one notif that encompasses that entire group. If the id is zero, it won't be associated with a group and will generate a notif. LMK if you see anything in here that could overcomplicate userspace. I think it's pretty simple, but you've had a crack at this before so I'd like to hear your opinion. > > -- > Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy 2024-04-09 1:33 ` Oliver Crumrine @ 2024-04-10 12:05 ` Pavel Begunkov 2024-04-11 0:52 ` Oliver Crumrine 0 siblings, 1 reply; 17+ messages in thread From: Pavel Begunkov @ 2024-04-10 12:05 UTC (permalink / raw) To: Oliver Crumrine, axboe; +Cc: io-uring, linux-kernel On 4/9/24 02:33, Oliver Crumrine wrote: > Pavel Begunkov wrote: >> On 4/7/24 20:14, Oliver Crumrine wrote: >>> Oliver Crumrine wrote: >>>> Pavel Begunkov wrote: >>>>> On 4/5/24 21:04, Oliver Crumrine wrote: >>>>>> Pavel Begunkov wrote: >>>>>>> On 4/4/24 23:17, Oliver Crumrine wrote: >>>>>>>> In his patch to enable zerocopy networking for io_uring, Pavel Begunkov >>>>>>>> specifically disabled REQ_F_CQE_SKIP, as (at least from my >>>>>>>> understanding) the userspace program wouldn't receive the >>>>>>>> IORING_CQE_F_MORE flag in the result value. >>>>>>> >>>>>>> No. IORING_CQE_F_MORE means there will be another CQE from this >>>>>>> request, so a single CQE without IORING_CQE_F_MORE is trivially >>>>>>> fine. >>>>>>> >>>>>>> The problem is the semantics, because by suppressing the first >>>>>>> CQE you're loosing the result value. You might rely on WAITALL >>>>>> That's already happening with io_send. >>>>> >>>>> Right, and it's still annoying and hard to use >>>> Another solution might be something where there is a counter that stores >>>> how many CQEs with REQ_F_CQE_SKIP have been processed. Before exiting, >>>> userspace could call a function like: io_wait_completions(int completions) >>>> which would wait until everything is done, and then userspace could peek >>>> the completion ring. >>>>> >>>>>>> as other sends and "fail" (in terms of io_uring) the request >>>>>>> in case of a partial send posting 2 CQEs, but that's not a great >>>>>>> way and it's getting userspace complicated pretty easily. >>>>>>> >>>>>>> In short, it was left out for later because there is a >>>>>>> better way to implement it, but it should be done carefully >>>>>> Maybe we could put the return values in the notifs? That would be a >>>>>> discrepancy between io_send and io_send_zc, though. >>>>> >>>>> Yes. And yes, having a custom flavour is not good. It'd only >>>>> be well usable if apart from returning the actual result >>>>> it also guarantees there will be one and only one CQE, then >>>>> the userspace doesn't have to do the dancing with counting >>>>> and checking F_MORE. In fact, I outlined before how a generic >>>>> solution may looks like: >>>>> >>>>> https://github.com/axboe/liburing/issues/824 >>>>> >>>>> The only interesting part, IMHO, is to be able to merge the >>>>> main completion with its notification. Below is an old stash >>>>> rebased onto for-6.10. The only thing missing is relinking, >>>>> but maybe we don't even care about it. I need to cover it >>>>> well with tests. >>>> The patch looks pretty good. The only potential issue is that you store >>>> the res of the normal CQE into the notif CQE. This overwrites the >>>> IORING_CQE_F_NOTIF with IORING_CQE_F_MORE. This means that the notif would >>>> indicate to userspace that there will be another CQE, of which there >>>> won't. >>> I was wrong here; Mixed up flags and result value. >> >> Right, it's fine. And it's synchronised by the ubuf refcounting, >> though it might get more complicated if I'd try out some counting >> optimisations. >> >> FWIW, it shouldn't give any performance wins. The heavy stuff is >> notifications waking the task, which is still there. I can even >> imagine that having separate CQEs might be more flexible and would >> allow more efficient CQ batching. > I've actaully been working on this issue for a little while now. My current > idea is that an id is put into the optval section of the SQE, and then it > can be used to tag that req with a certain group. When a req has a flag > set on it, it can request for all of group's notifs to be "flushed" in one > notif that encompasses that entire group. If the id is zero, it won't be > associated with a group and will generate a notif. LMK if you see anything > in here that could overcomplicate userspace. I think it's pretty simple, > but you've had a crack at this before so I'd like to hear your opinion. You can take a look at early versions of the IORING_OP_SEND_ZC, e.g. patchset v1, probably even later ones. It was basically doing what you've described with minor uapi changes, like you had to declare groups (slots) in advance, i.e. register them. More flexible and so performant in some circumstances, but the overall feedback from people trying it is that it's complicated. The user should allocate group ids, track bound requests / buffers, do other management. The next question is how the user should decide what bind to what. There is some nastiness in using the same group for multiple sockets, and then what's the cut line to flush the previous notif? I probably forgot a couple more complaints. TL;DR; The performance is a bit of a longer story, problems are mostly coming from the async nature of io_uring, and it'd be nice to solve at least a part of it generically, not only for sendzc. The expensive stuff is waking up the task, it's not unique to notifications, recv will trigger it with polling as well as other opcodes. Then the key is completion batching. What's interesting, take for example some tx only toy benchmark with DEFER_TASKRUN (recommended to use in any case). If you always wait for sends without notifications and add eventual *_get_events(), that would completely avoid the wake up overhead if there are enough buffers, and if it's not it can 1:1 replace tx polling. Try groups, see if numbers are good. And a heads up, I'm looking at improving it a little bit for TCP because of a report, not changing uapi but might change performance math. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy 2024-04-10 12:05 ` Pavel Begunkov @ 2024-04-11 0:52 ` Oliver Crumrine 2024-04-12 13:20 ` Pavel Begunkov 0 siblings, 1 reply; 17+ messages in thread From: Oliver Crumrine @ 2024-04-11 0:52 UTC (permalink / raw) To: Pavel Begunkov, Oliver Crumrine, axboe; +Cc: io-uring, linux-kernel Pavel Begunkov wrote: > On 4/9/24 02:33, Oliver Crumrine wrote: > > Pavel Begunkov wrote: > >> On 4/7/24 20:14, Oliver Crumrine wrote: > >>> Oliver Crumrine wrote: > >>>> Pavel Begunkov wrote: > >>>>> On 4/5/24 21:04, Oliver Crumrine wrote: > >>>>>> Pavel Begunkov wrote: > >>>>>>> On 4/4/24 23:17, Oliver Crumrine wrote: > >>>>>>>> In his patch to enable zerocopy networking for io_uring, Pavel Begunkov > >>>>>>>> specifically disabled REQ_F_CQE_SKIP, as (at least from my > >>>>>>>> understanding) the userspace program wouldn't receive the > >>>>>>>> IORING_CQE_F_MORE flag in the result value. > >>>>>>> > >>>>>>> No. IORING_CQE_F_MORE means there will be another CQE from this > >>>>>>> request, so a single CQE without IORING_CQE_F_MORE is trivially > >>>>>>> fine. > >>>>>>> > >>>>>>> The problem is the semantics, because by suppressing the first > >>>>>>> CQE you're loosing the result value. You might rely on WAITALL > >>>>>> That's already happening with io_send. > >>>>> > >>>>> Right, and it's still annoying and hard to use > >>>> Another solution might be something where there is a counter that stores > >>>> how many CQEs with REQ_F_CQE_SKIP have been processed. Before exiting, > >>>> userspace could call a function like: io_wait_completions(int completions) > >>>> which would wait until everything is done, and then userspace could peek > >>>> the completion ring. > >>>>> > >>>>>>> as other sends and "fail" (in terms of io_uring) the request > >>>>>>> in case of a partial send posting 2 CQEs, but that's not a great > >>>>>>> way and it's getting userspace complicated pretty easily. > >>>>>>> > >>>>>>> In short, it was left out for later because there is a > >>>>>>> better way to implement it, but it should be done carefully > >>>>>> Maybe we could put the return values in the notifs? That would be a > >>>>>> discrepancy between io_send and io_send_zc, though. > >>>>> > >>>>> Yes. And yes, having a custom flavour is not good. It'd only > >>>>> be well usable if apart from returning the actual result > >>>>> it also guarantees there will be one and only one CQE, then > >>>>> the userspace doesn't have to do the dancing with counting > >>>>> and checking F_MORE. In fact, I outlined before how a generic > >>>>> solution may looks like: > >>>>> > >>>>> https://github.com/axboe/liburing/issues/824 > >>>>> > >>>>> The only interesting part, IMHO, is to be able to merge the > >>>>> main completion with its notification. Below is an old stash > >>>>> rebased onto for-6.10. The only thing missing is relinking, > >>>>> but maybe we don't even care about it. I need to cover it > >>>>> well with tests. > >>>> The patch looks pretty good. The only potential issue is that you store > >>>> the res of the normal CQE into the notif CQE. This overwrites the > >>>> IORING_CQE_F_NOTIF with IORING_CQE_F_MORE. This means that the notif would > >>>> indicate to userspace that there will be another CQE, of which there > >>>> won't. > >>> I was wrong here; Mixed up flags and result value. > >> > >> Right, it's fine. And it's synchronised by the ubuf refcounting, > >> though it might get more complicated if I'd try out some counting > >> optimisations. > >> > >> FWIW, it shouldn't give any performance wins. The heavy stuff is > >> notifications waking the task, which is still there. I can even > >> imagine that having separate CQEs might be more flexible and would > >> allow more efficient CQ batching. > > I've actaully been working on this issue for a little while now. My current > > idea is that an id is put into the optval section of the SQE, and then it > > can be used to tag that req with a certain group. When a req has a flag > > set on it, it can request for all of group's notifs to be "flushed" in one > > notif that encompasses that entire group. If the id is zero, it won't be > > associated with a group and will generate a notif. LMK if you see anything > > in here that could overcomplicate userspace. I think it's pretty simple, > > but you've had a crack at this before so I'd like to hear your opinion. > > You can take a look at early versions of the IORING_OP_SEND_ZC, e.g. > patchset v1, probably even later ones. It was basically doing what > you've described with minor uapi changes, like you had to declare groups > (slots) in advance, i.e. register them. My idea is that insead of allocating slots before making requests, "slots" will be allocated as the group ids show up. Instead of an array of slots, a linked list can be used so things can be kmalloc'ed on the fly to make the uapi simpler. > > More flexible and so performant in some circumstances, but the overall > feedback from people trying it is that it's complicated. The user should > allocate group ids, track bound requests / buffers, do other management. > The next question is how the user should decide what bind to what. There > is some nastiness in using the same group for multiple sockets, and then Then maybe we find a way to prevent multiple sockets on one group. > what's the cut line to flush the previous notif? I probably forgot a I'd make it the max for a u32 -- I'm (probably) going to use an atomic_t to store the counter of how many reqs have been completed, so a u32 max would make sense. > couple more complaints. > > TL;DR; > > The performance is a bit of a longer story, problems are mostly coming > from the async nature of io_uring, and it'd be nice to solve at least a > part of it generically, not only for sendzc. The expensive stuff is > waking up the task, it's not unique to notifications, recv will trigger > it with polling as well as other opcodes. Then the key is completion > batching. Maybe the interface is made for sendzc first, and people could test it there. Then if it is considered beneficial to other places, it could be implemented there. > > What's interesting, take for example some tx only toy benchmark with > DEFER_TASKRUN (recommended to use in any case). If you always wait for > sends without notifications and add eventual *_get_events(), that would > completely avoid the wake up overhead if there are enough buffers, > and if it's not it can 1:1 replace tx polling. Seems like an interesting way to eliminate waiting overhead. > > Try groups, see if numbers are good. And a heads up, I'm looking at I will. Working hard to have the code done by Sunday. > improving it a little bit for TCP because of a report, not changing > uapi but might change performance math. > > -- > Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy 2024-04-11 0:52 ` Oliver Crumrine @ 2024-04-12 13:20 ` Pavel Begunkov 2024-04-15 23:51 ` Oliver Crumrine 0 siblings, 1 reply; 17+ messages in thread From: Pavel Begunkov @ 2024-04-12 13:20 UTC (permalink / raw) To: Oliver Crumrine, axboe; +Cc: io-uring, linux-kernel On 4/11/24 01:52, Oliver Crumrine wrote: > Pavel Begunkov wrote: >> On 4/9/24 02:33, Oliver Crumrine wrote: >>> Pavel Begunkov wrote: >>>> On 4/7/24 20:14, Oliver Crumrine wrote: >>>>> Oliver Crumrine wrote: >>>>>> Pavel Begunkov wrote: >>>>>>> On 4/5/24 21:04, Oliver Crumrine wrote: >>>>>>>> Pavel Begunkov wrote: >>>>>>>>> On 4/4/24 23:17, Oliver Crumrine wrote: >>>>>>>>>> In his patch to enable zerocopy networking for io_uring, Pavel Begunkov >>>>>>>>>> specifically disabled REQ_F_CQE_SKIP, as (at least from my >>>>>>>>>> understanding) the userspace program wouldn't receive the >>>>>>>>>> IORING_CQE_F_MORE flag in the result value. >>>>>>>>> >>>>>>>>> No. IORING_CQE_F_MORE means there will be another CQE from this >>>>>>>>> request, so a single CQE without IORING_CQE_F_MORE is trivially >>>>>>>>> fine. >>>>>>>>> >>>>>>>>> The problem is the semantics, because by suppressing the first >>>>>>>>> CQE you're loosing the result value. You might rely on WAITALL >>>>>>>> That's already happening with io_send. >>>>>>> >>>>>>> Right, and it's still annoying and hard to use >>>>>> Another solution might be something where there is a counter that stores >>>>>> how many CQEs with REQ_F_CQE_SKIP have been processed. Before exiting, >>>>>> userspace could call a function like: io_wait_completions(int completions) >>>>>> which would wait until everything is done, and then userspace could peek >>>>>> the completion ring. >>>>>>> >>>>>>>>> as other sends and "fail" (in terms of io_uring) the request >>>>>>>>> in case of a partial send posting 2 CQEs, but that's not a great >>>>>>>>> way and it's getting userspace complicated pretty easily. >>>>>>>>> >>>>>>>>> In short, it was left out for later because there is a >>>>>>>>> better way to implement it, but it should be done carefully >>>>>>>> Maybe we could put the return values in the notifs? That would be a >>>>>>>> discrepancy between io_send and io_send_zc, though. >>>>>>> >>>>>>> Yes. And yes, having a custom flavour is not good. It'd only >>>>>>> be well usable if apart from returning the actual result >>>>>>> it also guarantees there will be one and only one CQE, then >>>>>>> the userspace doesn't have to do the dancing with counting >>>>>>> and checking F_MORE. In fact, I outlined before how a generic >>>>>>> solution may looks like: >>>>>>> >>>>>>> https://github.com/axboe/liburing/issues/824 >>>>>>> >>>>>>> The only interesting part, IMHO, is to be able to merge the >>>>>>> main completion with its notification. Below is an old stash >>>>>>> rebased onto for-6.10. The only thing missing is relinking, >>>>>>> but maybe we don't even care about it. I need to cover it >>>>>>> well with tests. >>>>>> The patch looks pretty good. The only potential issue is that you store >>>>>> the res of the normal CQE into the notif CQE. This overwrites the >>>>>> IORING_CQE_F_NOTIF with IORING_CQE_F_MORE. This means that the notif would >>>>>> indicate to userspace that there will be another CQE, of which there >>>>>> won't. >>>>> I was wrong here; Mixed up flags and result value. >>>> >>>> Right, it's fine. And it's synchronised by the ubuf refcounting, >>>> though it might get more complicated if I'd try out some counting >>>> optimisations. >>>> >>>> FWIW, it shouldn't give any performance wins. The heavy stuff is >>>> notifications waking the task, which is still there. I can even >>>> imagine that having separate CQEs might be more flexible and would >>>> allow more efficient CQ batching. >>> I've actaully been working on this issue for a little while now. My current >>> idea is that an id is put into the optval section of the SQE, and then it >>> can be used to tag that req with a certain group. When a req has a flag >>> set on it, it can request for all of group's notifs to be "flushed" in one >>> notif that encompasses that entire group. If the id is zero, it won't be >>> associated with a group and will generate a notif. LMK if you see anything >>> in here that could overcomplicate userspace. I think it's pretty simple, >>> but you've had a crack at this before so I'd like to hear your opinion. >> >> You can take a look at early versions of the IORING_OP_SEND_ZC, e.g. >> patchset v1, probably even later ones. It was basically doing what >> you've described with minor uapi changes, like you had to declare groups >> (slots) in advance, i.e. register them. > My idea is that insead of allocating slots before making requests, "slots" > will be allocated as the group ids show up. Instead of an array of slots, a > linked list can be used so things can be kmalloc'ed on the fly to make > the uapi simpler. >> >> More flexible and so performant in some circumstances, but the overall >> feedback from people trying it is that it's complicated. The user should >> allocate group ids, track bound requests / buffers, do other management. >> The next question is how the user should decide what bind to what. There >> is some nastiness in using the same group for multiple sockets, and then > Then maybe we find a way to prevent multiple sockets on one group. You don't have to explicitly prevent it unless there are other reasons, it's just not given a real app would be able to use it this way. >> what's the cut line to flush the previous notif? I probably forgot a > I'd make it the max for a u32 -- I'm (probably) going to use an atomic_t > to store the counter of how many reqs have been completed, so a u32 max > would make sense. To be clear, the question raised is entirely for userspace to decide if we're talking about the design when the user has to flush a group notificaiton via flag or so. Atomics or not is a performance side, that's separate. >> couple more complaints. >> >> TL;DR; >> >> The performance is a bit of a longer story, problems are mostly coming >> from the async nature of io_uring, and it'd be nice to solve at least a >> part of it generically, not only for sendzc. The expensive stuff is >> waking up the task, it's not unique to notifications, recv will trigger >> it with polling as well as other opcodes. Then the key is completion >> batching. > Maybe the interface is made for sendzc first, and people could test it > there. Then if it is considered beneficial to other places, it could be > implemented there. >> >> What's interesting, take for example some tx only toy benchmark with >> DEFER_TASKRUN (recommended to use in any case). If you always wait for >> sends without notifications and add eventual *_get_events(), that would >> completely avoid the wake up overhead if there are enough buffers, >> and if it's not it can 1:1 replace tx polling. > Seems like an interesting way to eliminate waiting overhead. >> >> Try groups, see if numbers are good. And a heads up, I'm looking at > I will. Working hard to have the code done by Sunday. Good, and here is the patchset I mentioned: https://lore.kernel.org/io-uring/[email protected]/T/ >> improving it a little bit for TCP because of a report, not changing >> uapi but might change performance math. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy 2024-04-12 13:20 ` Pavel Begunkov @ 2024-04-15 23:51 ` Oliver Crumrine 0 siblings, 0 replies; 17+ messages in thread From: Oliver Crumrine @ 2024-04-15 23:51 UTC (permalink / raw) To: Pavel Begunkov, Oliver Crumrine, axboe; +Cc: io-uring, linux-kernel Pavel Begunkov wrote: > On 4/11/24 01:52, Oliver Crumrine wrote: > > Pavel Begunkov wrote: > >> On 4/9/24 02:33, Oliver Crumrine wrote: > >>> Pavel Begunkov wrote: > >>>> On 4/7/24 20:14, Oliver Crumrine wrote: > >>>>> Oliver Crumrine wrote: > >>>>>> Pavel Begunkov wrote: > >>>>>>> On 4/5/24 21:04, Oliver Crumrine wrote: > >>>>>>>> Pavel Begunkov wrote: > >>>>>>>>> On 4/4/24 23:17, Oliver Crumrine wrote: > >>>>>>>>>> In his patch to enable zerocopy networking for io_uring, Pavel Begunkov > >>>>>>>>>> specifically disabled REQ_F_CQE_SKIP, as (at least from my > >>>>>>>>>> understanding) the userspace program wouldn't receive the > >>>>>>>>>> IORING_CQE_F_MORE flag in the result value. > >>>>>>>>> > >>>>>>>>> No. IORING_CQE_F_MORE means there will be another CQE from this > >>>>>>>>> request, so a single CQE without IORING_CQE_F_MORE is trivially > >>>>>>>>> fine. > >>>>>>>>> > >>>>>>>>> The problem is the semantics, because by suppressing the first > >>>>>>>>> CQE you're loosing the result value. You might rely on WAITALL > >>>>>>>> That's already happening with io_send. > >>>>>>> > >>>>>>> Right, and it's still annoying and hard to use > >>>>>> Another solution might be something where there is a counter that stores > >>>>>> how many CQEs with REQ_F_CQE_SKIP have been processed. Before exiting, > >>>>>> userspace could call a function like: io_wait_completions(int completions) > >>>>>> which would wait until everything is done, and then userspace could peek > >>>>>> the completion ring. > >>>>>>> > >>>>>>>>> as other sends and "fail" (in terms of io_uring) the request > >>>>>>>>> in case of a partial send posting 2 CQEs, but that's not a great > >>>>>>>>> way and it's getting userspace complicated pretty easily. > >>>>>>>>> > >>>>>>>>> In short, it was left out for later because there is a > >>>>>>>>> better way to implement it, but it should be done carefully > >>>>>>>> Maybe we could put the return values in the notifs? That would be a > >>>>>>>> discrepancy between io_send and io_send_zc, though. > >>>>>>> > >>>>>>> Yes. And yes, having a custom flavour is not good. It'd only > >>>>>>> be well usable if apart from returning the actual result > >>>>>>> it also guarantees there will be one and only one CQE, then > >>>>>>> the userspace doesn't have to do the dancing with counting > >>>>>>> and checking F_MORE. In fact, I outlined before how a generic > >>>>>>> solution may looks like: > >>>>>>> > >>>>>>> https://github.com/axboe/liburing/issues/824 > >>>>>>> > >>>>>>> The only interesting part, IMHO, is to be able to merge the > >>>>>>> main completion with its notification. Below is an old stash > >>>>>>> rebased onto for-6.10. The only thing missing is relinking, > >>>>>>> but maybe we don't even care about it. I need to cover it > >>>>>>> well with tests. > >>>>>> The patch looks pretty good. The only potential issue is that you store > >>>>>> the res of the normal CQE into the notif CQE. This overwrites the > >>>>>> IORING_CQE_F_NOTIF with IORING_CQE_F_MORE. This means that the notif would > >>>>>> indicate to userspace that there will be another CQE, of which there > >>>>>> won't. > >>>>> I was wrong here; Mixed up flags and result value. > >>>> > >>>> Right, it's fine. And it's synchronised by the ubuf refcounting, > >>>> though it might get more complicated if I'd try out some counting > >>>> optimisations. > >>>> > >>>> FWIW, it shouldn't give any performance wins. The heavy stuff is > >>>> notifications waking the task, which is still there. I can even > >>>> imagine that having separate CQEs might be more flexible and would > >>>> allow more efficient CQ batching. > >>> I've actaully been working on this issue for a little while now. My current > >>> idea is that an id is put into the optval section of the SQE, and then it > >>> can be used to tag that req with a certain group. When a req has a flag > >>> set on it, it can request for all of group's notifs to be "flushed" in one > >>> notif that encompasses that entire group. If the id is zero, it won't be > >>> associated with a group and will generate a notif. LMK if you see anything > >>> in here that could overcomplicate userspace. I think it's pretty simple, > >>> but you've had a crack at this before so I'd like to hear your opinion. > >> > >> You can take a look at early versions of the IORING_OP_SEND_ZC, e.g. > >> patchset v1, probably even later ones. It was basically doing what > >> you've described with minor uapi changes, like you had to declare groups > >> (slots) in advance, i.e. register them. > > My idea is that insead of allocating slots before making requests, "slots" > > will be allocated as the group ids show up. Instead of an array of slots, a > > linked list can be used so things can be kmalloc'ed on the fly to make > > the uapi simpler. > >> > >> More flexible and so performant in some circumstances, but the overall > >> feedback from people trying it is that it's complicated. The user should > >> allocate group ids, track bound requests / buffers, do other management. > >> The next question is how the user should decide what bind to what. There > >> is some nastiness in using the same group for multiple sockets, and then > > Then maybe we find a way to prevent multiple sockets on one group. > > You don't have to explicitly prevent it unless there are other reasons, > it's just not given a real app would be able to use it this way. > > >> what's the cut line to flush the previous notif? I probably forgot a > > I'd make it the max for a u32 -- I'm (probably) going to use an atomic_t > > to store the counter of how many reqs have been completed, so a u32 max > > would make sense. > > To be clear, the question raised is entirely for userspace to decide > if we're talking about the design when the user has to flush a group > notificaiton via flag or so. Atomics or not is a performance side, > that's separate. > > >> couple more complaints. > >> > >> TL;DR; > >> > >> The performance is a bit of a longer story, problems are mostly coming > >> from the async nature of io_uring, and it'd be nice to solve at least a > >> part of it generically, not only for sendzc. The expensive stuff is > >> waking up the task, it's not unique to notifications, recv will trigger > >> it with polling as well as other opcodes. Then the key is completion > >> batching. > > Maybe the interface is made for sendzc first, and people could test it > > there. Then if it is considered beneficial to other places, it could be > > implemented there. > >> > >> What's interesting, take for example some tx only toy benchmark with > >> DEFER_TASKRUN (recommended to use in any case). If you always wait for > >> sends without notifications and add eventual *_get_events(), that would > >> completely avoid the wake up overhead if there are enough buffers, > >> and if it's not it can 1:1 replace tx polling. > > Seems like an interesting way to eliminate waiting overhead. > >> > >> Try groups, see if numbers are good. And a heads up, I'm looking at > > I will. Working hard to have the code done by Sunday. > > Good, and here is the patchset I mentioned: > > https://lore.kernel.org/io-uring/[email protected]/T/ Wow! 6x improvment is crazy. I just finished the code for notif grouping, and will be benchmarking it in the upcoming hours/days. It's still in a pre-alpha state, so I'll have to put a little more work into it. (pre-alpha means leaking memory. I have 32 gigs in my system. Assuming I don't go too crazy on the benchmarking I should be fine) Either way, my patch will need a little bit of work to be compatible with yours, as it modifies the ubuf callback, and yours does too. > > >> improving it a little bit for TCP because of a report, not changing > >> uapi but might change performance math. > > -- > Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/3] io_uring: Add io_uring_peek_cqe to mini_liburing 2024-04-04 22:16 [PATCH 0/3] Add REQ_F_CQE_SKIP support to io_uring zerocopy Oliver Crumrine 2024-04-04 22:17 ` [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for " Oliver Crumrine @ 2024-04-04 22:19 ` Oliver Crumrine 2024-04-04 22:19 ` [PATCH 3/3] io_uring: Support IOSQE_CQE_SKIP_SUCCESS in io_uring zerocopy test Oliver Crumrine 2024-04-05 12:06 ` [PATCH 0/3] Add REQ_F_CQE_SKIP support to io_uring zerocopy Pavel Begunkov 3 siblings, 0 replies; 17+ messages in thread From: Oliver Crumrine @ 2024-04-04 22:19 UTC (permalink / raw) To: axboe, asml.silence, davem, edumazet, kuba, pabeni, shuah, leitao Cc: io-uring, netdev, linux-kselftest, linux-kernel mini_liburing.h is used by the io_uring tests as a stand-in for liburing. The next patch in this series needs the io_uring_peek_cqe function, and I speculate that future tests will too, so add that here. Please scrutinize this code, I am somewhat new to io_uring, and whilst I have tested this and know it to work, I very well could've made a mistake somewhere without noticing. Signed-off-by: Oliver Crumrine <[email protected]> --- tools/include/io_uring/mini_liburing.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/tools/include/io_uring/mini_liburing.h b/tools/include/io_uring/mini_liburing.h index 9ccb16074eb5..94dcc7f9aa33 100644 --- a/tools/include/io_uring/mini_liburing.h +++ b/tools/include/io_uring/mini_liburing.h @@ -182,6 +182,24 @@ static inline int io_uring_wait_cqe(struct io_uring *ring, return 0; } +static inline int io_uring_peek_cqe(struct io_uring *ring, + struct io_uring_cqe **cqe_ptr) +{ + struct io_uring_cq *cq = &ring->cq; + const unsigned int mask = *cq->kring_mask; + unsigned int head = *cq->khead; + unsigned int tail = *cq->ktail; + + *cqe_ptr = NULL; + + read_barrier(); + if (head-tail) { + *cqe_ptr = &cq->cqes[head & mask]; + return 0; + } + return -EAGAIN; +} + static inline int io_uring_submit(struct io_uring *ring) { struct io_uring_sq *sq = &ring->sq; -- 2.44.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 3/3] io_uring: Support IOSQE_CQE_SKIP_SUCCESS in io_uring zerocopy test 2024-04-04 22:16 [PATCH 0/3] Add REQ_F_CQE_SKIP support to io_uring zerocopy Oliver Crumrine 2024-04-04 22:17 ` [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for " Oliver Crumrine 2024-04-04 22:19 ` [PATCH 2/3] io_uring: Add io_uring_peek_cqe to mini_liburing Oliver Crumrine @ 2024-04-04 22:19 ` Oliver Crumrine 2024-04-06 20:33 ` Muhammad Usama Anjum 2024-04-05 12:06 ` [PATCH 0/3] Add REQ_F_CQE_SKIP support to io_uring zerocopy Pavel Begunkov 3 siblings, 1 reply; 17+ messages in thread From: Oliver Crumrine @ 2024-04-04 22:19 UTC (permalink / raw) To: axboe, asml.silence, davem, edumazet, kuba, pabeni, shuah, leitao Cc: io-uring, netdev, linux-kselftest, linux-kernel Add support for the IOSQE_CQE_SKIP_SUCCESS flag in the io_uring_zerocopy_tx test, using the "-a" option. Instead of incrementing when IORING_CQE_F_MORE is set, remember how many SQEs are sent and simply wait on notifs instead of regular completions. For non-zc stuff, there won't be notifs or completions, so don't wait on either of those, but check the completion queue for errors at the end to make sure none have popped up. The changes to the shell script run the tests both with and without the "-a" option. Signed-off-by: Oliver Crumrine <[email protected]> --- .../selftests/net/io_uring_zerocopy_tx.c | 38 +++++++++++++++++-- .../selftests/net/io_uring_zerocopy_tx.sh | 7 +++- 2 files changed, 39 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.c b/tools/testing/selftests/net/io_uring_zerocopy_tx.c index 76e604e4810e..11a43594935f 100644 --- a/tools/testing/selftests/net/io_uring_zerocopy_tx.c +++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.c @@ -50,8 +50,10 @@ enum { }; static bool cfg_cork = false; +static bool cfg_nocqe = false; static int cfg_mode = MODE_ZC_FIXED; static int cfg_nr_reqs = 8; +static int cfg_nr_completions = 8; static int cfg_family = PF_UNSPEC; static int cfg_payload_len; static int cfg_port = 8000; @@ -134,11 +136,21 @@ static void do_tx(int domain, int type, int protocol) if (mode == MODE_NONZC) { io_uring_prep_send(sqe, fd, payload, cfg_payload_len, msg_flags); + if (cfg_nocqe) { + sqe->flags |= IOSQE_CQE_SKIP_SUCCESS; + cfg_nr_completions--; + } sqe->user_data = NONZC_TAG; } else { io_uring_prep_sendzc(sqe, fd, payload, cfg_payload_len, msg_flags, zc_flags); + if (cfg_nocqe) { + sqe->flags |= IOSQE_CQE_SKIP_SUCCESS; + packets++; + compl_cqes++; + bytes += cfg_payload_len; + } if (mode == MODE_ZC_FIXED) { sqe->ioprio |= IORING_RECVSEND_FIXED_BUF; sqe->buf_index = buf_idx; @@ -153,7 +165,7 @@ static void do_tx(int domain, int type, int protocol) if (cfg_cork) do_setsockopt(fd, IPPROTO_UDP, UDP_CORK, 0); - for (i = 0; i < cfg_nr_reqs; i++) { + for (i = 0; i < cfg_nr_completions; i++) { ret = io_uring_wait_cqe(&ring, &cqe); if (ret) error(1, ret, "wait cqe"); @@ -168,7 +180,9 @@ static void do_tx(int domain, int type, int protocol) if (compl_cqes <= 0) error(1, -EINVAL, "notification mismatch"); compl_cqes--; - i--; + if (!cfg_nocqe) + i--; io_uring_cqe_seen(&ring); continue; } @@ -200,6 +214,17 @@ static void do_tx(int domain, int type, int protocol) compl_cqes--; } + /* The above code does not account for a send error when + * IOSQE_CQE_SKIP_SUCCESS is set. This is operating under the + * assumption that an error CQE will get put on the ring before + * the above code completes: + */ + while (!io_uring_peek_cqe(&ring, &cqe)) { + if (cqe->res == -EAGAIN) + continue; + error(1, -EINVAL, "send failed"); + } + fprintf(stderr, "tx=%lu (MB=%lu), tx/s=%lu (MB/s=%lu)\n", packets, bytes >> 20, packets / (cfg_runtime_ms / 1000), @@ -221,7 +246,7 @@ static void do_test(int domain, int type, int protocol) static void usage(const char *filepath) { error(1, 0, "Usage: %s (-4|-6) (udp|tcp) -D<dst_ip> [-s<payload size>] " - "[-t<time s>] [-n<batch>] [-p<port>] [-m<mode>]", filepath); + "[-t<time s>] [-n<batch>] [-p<port>] [-m<mode>] [-a]", filepath); } static void parse_opts(int argc, char **argv) @@ -239,7 +264,7 @@ static void parse_opts(int argc, char **argv) usage(argv[0]); cfg_payload_len = max_payload_len; - while ((c = getopt(argc, argv, "46D:p:s:t:n:c:m:")) != -1) { + while ((c = getopt(argc, argv, "46aD:p:s:t:n:c:m:")) != -1) { switch (c) { case '4': if (cfg_family != PF_UNSPEC) @@ -274,6 +299,9 @@ static void parse_opts(int argc, char **argv) case 'm': cfg_mode = strtol(optarg, NULL, 0); break; + case 'a': + cfg_nocqe = true; + break; } } @@ -302,6 +330,8 @@ static void parse_opts(int argc, char **argv) error(1, 0, "-s: payload exceeds max (%d)", max_payload_len); if (optind != argc - 1) usage(argv[0]); + + cfg_nr_completions = cfg_nr_reqs; } int main(int argc, char **argv) diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.sh b/tools/testing/selftests/net/io_uring_zerocopy_tx.sh index 123439545013..aeb4645b7891 100755 --- a/tools/testing/selftests/net/io_uring_zerocopy_tx.sh +++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.sh @@ -25,11 +25,14 @@ readonly path_sysctl_mem="net.core.optmem_max" # No arguments: automated test if [[ "$#" -eq "0" ]]; then IPs=( "4" "6" ) + SKIPCQEs=("" "-a") for IP in "${IPs[@]}"; do for mode in $(seq 1 3); do - $0 "$IP" udp -m "$mode" -t 1 -n 32 - $0 "$IP" tcp -m "$mode" -t 1 -n 1 + for cqe in "${SKIPCQEs[@]}"; do + $0 "$IP" udp -m "$mode" -t 1 -n 32 "$cqe" + $0 "$IP" tcp -m "$mode" -t 1 -n 1 "$cqe" + done done done -- 2.44.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 3/3] io_uring: Support IOSQE_CQE_SKIP_SUCCESS in io_uring zerocopy test 2024-04-04 22:19 ` [PATCH 3/3] io_uring: Support IOSQE_CQE_SKIP_SUCCESS in io_uring zerocopy test Oliver Crumrine @ 2024-04-06 20:33 ` Muhammad Usama Anjum 0 siblings, 0 replies; 17+ messages in thread From: Muhammad Usama Anjum @ 2024-04-06 20:33 UTC (permalink / raw) To: Oliver Crumrine, axboe, asml.silence, davem, edumazet, kuba, pabeni, shuah, leitao Cc: Muhammad Usama Anjum, io-uring, netdev, linux-kselftest, linux-kernel On 4/5/24 3:19 AM, Oliver Crumrine wrote: > Add support for the IOSQE_CQE_SKIP_SUCCESS flag in the io_uring_zerocopy_tx > test, using the "-a" option. Instead of incrementing when > IORING_CQE_F_MORE is set, remember how many SQEs are sent and simply > wait on notifs instead of regular completions. For non-zc stuff, there > won't be notifs or completions, so don't wait on either of those, but > check the completion queue for errors at the end to make sure none have > popped up. > > The changes to the shell script run the tests both with and without the > "-a" option. > > Signed-off-by: Oliver Crumrine <[email protected]> Acked-by: Muhammad Usama Anjum <[email protected]> > --- > .../selftests/net/io_uring_zerocopy_tx.c | 38 +++++++++++++++++-- > .../selftests/net/io_uring_zerocopy_tx.sh | 7 +++- > 2 files changed, 39 insertions(+), 6 deletions(-) > > diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.c b/tools/testing/selftests/net/io_uring_zerocopy_tx.c > index 76e604e4810e..11a43594935f 100644 > --- a/tools/testing/selftests/net/io_uring_zerocopy_tx.c > +++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.c > @@ -50,8 +50,10 @@ enum { > }; > > static bool cfg_cork = false; > +static bool cfg_nocqe = false; > static int cfg_mode = MODE_ZC_FIXED; > static int cfg_nr_reqs = 8; > +static int cfg_nr_completions = 8; > static int cfg_family = PF_UNSPEC; > static int cfg_payload_len; > static int cfg_port = 8000; > @@ -134,11 +136,21 @@ static void do_tx(int domain, int type, int protocol) > if (mode == MODE_NONZC) { > io_uring_prep_send(sqe, fd, payload, > cfg_payload_len, msg_flags); > + if (cfg_nocqe) { > + sqe->flags |= IOSQE_CQE_SKIP_SUCCESS; > + cfg_nr_completions--; > + } > sqe->user_data = NONZC_TAG; > } else { > io_uring_prep_sendzc(sqe, fd, payload, > cfg_payload_len, > msg_flags, zc_flags); > + if (cfg_nocqe) { > + sqe->flags |= IOSQE_CQE_SKIP_SUCCESS; > + packets++; > + compl_cqes++; > + bytes += cfg_payload_len; > + } > if (mode == MODE_ZC_FIXED) { > sqe->ioprio |= IORING_RECVSEND_FIXED_BUF; > sqe->buf_index = buf_idx; > @@ -153,7 +165,7 @@ static void do_tx(int domain, int type, int protocol) > > if (cfg_cork) > do_setsockopt(fd, IPPROTO_UDP, UDP_CORK, 0); > - for (i = 0; i < cfg_nr_reqs; i++) { > + for (i = 0; i < cfg_nr_completions; i++) { > ret = io_uring_wait_cqe(&ring, &cqe); > if (ret) > error(1, ret, "wait cqe"); > @@ -168,7 +180,9 @@ static void do_tx(int domain, int type, int protocol) > if (compl_cqes <= 0) > error(1, -EINVAL, "notification mismatch"); > compl_cqes--; > - i--; > + if (!cfg_nocqe) > + i--; > io_uring_cqe_seen(&ring); > continue; > } > @@ -200,6 +214,17 @@ static void do_tx(int domain, int type, int protocol) > compl_cqes--; > } > > + /* The above code does not account for a send error when > + * IOSQE_CQE_SKIP_SUCCESS is set. This is operating under the > + * assumption that an error CQE will get put on the ring before > + * the above code completes: > + */ > + while (!io_uring_peek_cqe(&ring, &cqe)) { > + if (cqe->res == -EAGAIN) > + continue; > + error(1, -EINVAL, "send failed"); > + } > + > fprintf(stderr, "tx=%lu (MB=%lu), tx/s=%lu (MB/s=%lu)\n", > packets, bytes >> 20, > packets / (cfg_runtime_ms / 1000), > @@ -221,7 +246,7 @@ static void do_test(int domain, int type, int protocol) > static void usage(const char *filepath) > { > error(1, 0, "Usage: %s (-4|-6) (udp|tcp) -D<dst_ip> [-s<payload size>] " > - "[-t<time s>] [-n<batch>] [-p<port>] [-m<mode>]", filepath); > + "[-t<time s>] [-n<batch>] [-p<port>] [-m<mode>] [-a]", filepath); > } > > static void parse_opts(int argc, char **argv) > @@ -239,7 +264,7 @@ static void parse_opts(int argc, char **argv) > usage(argv[0]); > cfg_payload_len = max_payload_len; > > - while ((c = getopt(argc, argv, "46D:p:s:t:n:c:m:")) != -1) { > + while ((c = getopt(argc, argv, "46aD:p:s:t:n:c:m:")) != -1) { > switch (c) { > case '4': > if (cfg_family != PF_UNSPEC) > @@ -274,6 +299,9 @@ static void parse_opts(int argc, char **argv) > case 'm': > cfg_mode = strtol(optarg, NULL, 0); > break; > + case 'a': > + cfg_nocqe = true; > + break; > } > } > > @@ -302,6 +330,8 @@ static void parse_opts(int argc, char **argv) > error(1, 0, "-s: payload exceeds max (%d)", max_payload_len); > if (optind != argc - 1) > usage(argv[0]); > + > + cfg_nr_completions = cfg_nr_reqs; > } > > int main(int argc, char **argv) > diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.sh b/tools/testing/selftests/net/io_uring_zerocopy_tx.sh > index 123439545013..aeb4645b7891 100755 > --- a/tools/testing/selftests/net/io_uring_zerocopy_tx.sh > +++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.sh > @@ -25,11 +25,14 @@ readonly path_sysctl_mem="net.core.optmem_max" > # No arguments: automated test > if [[ "$#" -eq "0" ]]; then > IPs=( "4" "6" ) > + SKIPCQEs=("" "-a") > > for IP in "${IPs[@]}"; do > for mode in $(seq 1 3); do > - $0 "$IP" udp -m "$mode" -t 1 -n 32 > - $0 "$IP" tcp -m "$mode" -t 1 -n 1 > + for cqe in "${SKIPCQEs[@]}"; do > + $0 "$IP" udp -m "$mode" -t 1 -n 32 "$cqe" > + $0 "$IP" tcp -m "$mode" -t 1 -n 1 "$cqe" > + done > done > done > -- BR, Muhammad Usama Anjum ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/3] Add REQ_F_CQE_SKIP support to io_uring zerocopy 2024-04-04 22:16 [PATCH 0/3] Add REQ_F_CQE_SKIP support to io_uring zerocopy Oliver Crumrine ` (2 preceding siblings ...) 2024-04-04 22:19 ` [PATCH 3/3] io_uring: Support IOSQE_CQE_SKIP_SUCCESS in io_uring zerocopy test Oliver Crumrine @ 2024-04-05 12:06 ` Pavel Begunkov 3 siblings, 0 replies; 17+ messages in thread From: Pavel Begunkov @ 2024-04-05 12:06 UTC (permalink / raw) To: Oliver Crumrine, axboe, davem, edumazet, kuba, pabeni, shuah, leitao Cc: io-uring, netdev, linux-kselftest, linux-kernel On 4/4/24 23:16, Oliver Crumrine wrote: > This patchset allows for io_uring zerocopy to support REQ_F_CQE_SKIP, > skipping the normal completion notification, but not the zerocopy buffer > release notification. It's an io_uring internal change not altering how it operates with the net layer, you don't need to CC the net list. > This patchset also includes a test to test these changes, and a patch to > mini_liburing to enable io_uring_peek_cqe, which is needed for the test. For the same reason tests should be in liburing, where all io_uring tests are, and the selftest can be dropped. See liburing/test/send-zerocopy.c > Oliver Crumrine (3): > io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy > io_uring: Add io_uring_peek_cqe to mini_liburing > io_uring: Support IOSQE_CQE_SKIP_SUCCESS in io_uring zerocopy test > > io_uring/net.c | 6 +-- > tools/include/io_uring/mini_liburing.h | 18 +++++++++ > .../selftests/net/io_uring_zerocopy_tx.c | 37 +++++++++++++++++-- > .../selftests/net/io_uring_zerocopy_tx.sh | 7 +++- > 4 files changed, 59 insertions(+), 10 deletions(-) > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-04-15 23:51 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-04-04 22:16 [PATCH 0/3] Add REQ_F_CQE_SKIP support to io_uring zerocopy Oliver Crumrine 2024-04-04 22:17 ` [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for " Oliver Crumrine 2024-04-05 13:01 ` Pavel Begunkov 2024-04-05 20:04 ` Oliver Crumrine 2024-04-06 21:23 ` Pavel Begunkov 2024-04-07 13:13 ` Oliver Crumrine 2024-04-07 19:14 ` Oliver Crumrine 2024-04-07 23:46 ` Pavel Begunkov 2024-04-09 1:33 ` Oliver Crumrine 2024-04-10 12:05 ` Pavel Begunkov 2024-04-11 0:52 ` Oliver Crumrine 2024-04-12 13:20 ` Pavel Begunkov 2024-04-15 23:51 ` Oliver Crumrine 2024-04-04 22:19 ` [PATCH 2/3] io_uring: Add io_uring_peek_cqe to mini_liburing Oliver Crumrine 2024-04-04 22:19 ` [PATCH 3/3] io_uring: Support IOSQE_CQE_SKIP_SUCCESS in io_uring zerocopy test Oliver Crumrine 2024-04-06 20:33 ` Muhammad Usama Anjum 2024-04-05 12:06 ` [PATCH 0/3] Add REQ_F_CQE_SKIP support to io_uring zerocopy Pavel Begunkov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox