* IORING_CQE_F_COPIED
@ 2022-10-14 11:06 Stefan Metzmacher
2022-10-17 16:46 ` IORING_CQE_F_COPIED Pavel Begunkov
0 siblings, 1 reply; 25+ messages in thread
From: Stefan Metzmacher @ 2022-10-14 11:06 UTC (permalink / raw)
To: Pavel Begunkov, io-uring, Jens Axboe; +Cc: Jakub Kicinski, netdev
Hi Pavel,
In the tests I made I used this version of IORING_CQE_F_COPIED:
https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d
(also inlined at the end)
Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests)
metze
include/uapi/linux/io_uring.h | 1 +
io_uring/notif.c | 5 +++++
net/ipv4/ip_output.c | 3 ++-
net/ipv4/tcp.c | 2 ++
net/ipv6/ip6_output.c | 3 ++-
5 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 04729989e6ee..efeab6a9b4f3 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -341,6 +341,7 @@ struct io_uring_cqe {
#define IORING_CQE_F_MORE (1U << 1)
#define IORING_CQE_F_SOCK_NONEMPTY (1U << 2)
#define IORING_CQE_F_NOTIF (1U << 3)
+#define IORING_CQE_F_COPIED (1U << 4)
enum {
IORING_CQE_BUFFER_SHIFT = 16,
diff --git a/io_uring/notif.c b/io_uring/notif.c
index e37c6569d82e..2162d1af0b60 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -18,6 +18,8 @@ static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
__io_unaccount_mem(ctx->user, nd->account_pages);
nd->account_pages = 0;
}
+ if (!nd->uarg.zerocopy)
+ notif->cqe.flags |= IORING_CQE_F_COPIED;
io_req_task_complete(notif, locked);
}
@@ -28,6 +30,8 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
struct io_kiocb *notif = cmd_to_io_kiocb(nd);
+ uarg->zerocopy = uarg->zerocopy & success;
+
if (refcount_dec_and_test(&uarg->refcnt)) {
notif->io_task_work.func = __io_notif_complete_tw;
io_req_task_work_add(notif);
@@ -53,6 +57,7 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
nd = io_notif_to_data(notif);
nd->account_pages = 0;
+ nd->uarg.zerocopy = 1;
nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
nd->uarg.callback = io_uring_tx_zerocopy_callback;
refcount_set(&nd->uarg.refcnt, 1);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04e2034f2f8e..64d263a8ece8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1032,7 +1032,8 @@ static int __ip_append_data(struct sock *sk,
paged = true;
zc = true;
uarg = msg->msg_ubuf;
- }
+ } else
+ msg->msg_ubuf->zerocopy = 0;
} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
if (!uarg)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cdf26724d7db..d3a2ed9f22df 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1247,6 +1247,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
uarg = msg->msg_ubuf;
net_zcopy_get(uarg);
zc = sk->sk_route_caps & NETIF_F_SG;
+ if (!zc)
+ uarg->zerocopy = 0;
} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
if (!uarg) {
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index bb0f469a5247..3d75dd05ff98 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1559,7 +1559,8 @@ static int __ip6_append_data(struct sock *sk,
paged = true;
zc = true;
uarg = msg->msg_ubuf;
- }
+ } else
+ msg->msg_ubuf->zerocopy = 0;
} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
if (!uarg)
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: IORING_CQE_F_COPIED
2022-10-14 11:06 IORING_CQE_F_COPIED Stefan Metzmacher
@ 2022-10-17 16:46 ` Pavel Begunkov
2022-10-18 8:43 ` IORING_CQE_F_COPIED Stefan Metzmacher
0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2022-10-17 16:46 UTC (permalink / raw)
To: Stefan Metzmacher, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
Hey Stefan,
On 10/14/22 12:06, Stefan Metzmacher wrote:
> Hi Pavel,
>
> In the tests I made I used this version of IORING_CQE_F_COPIED:
> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d
> (also inlined at the end)
>
> Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests)
I was thinking, can it be delivered separately but not in the same cqe?
The intention is to keep it off the IO path. For example, it can emit a
zc status CQE or maybe keep a "zc failed" counter inside the ring. Other
options? And we can add a separate callback for that, will make a couple
of things better.
What do you think? Especially from the userspace usability perspective.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_CQE_F_COPIED
2022-10-17 16:46 ` IORING_CQE_F_COPIED Pavel Begunkov
@ 2022-10-18 8:43 ` Stefan Metzmacher
2022-10-19 15:06 ` IORING_CQE_F_COPIED Pavel Begunkov
0 siblings, 1 reply; 25+ messages in thread
From: Stefan Metzmacher @ 2022-10-18 8:43 UTC (permalink / raw)
To: Pavel Begunkov, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
Hi Pavel,
> On 10/14/22 12:06, Stefan Metzmacher wrote:
>> Hi Pavel,
>>
>> In the tests I made I used this version of IORING_CQE_F_COPIED:
>> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d
>> (also inlined at the end)
>>
>> Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests)
>
> I was thinking, can it be delivered separately but not in the same cqe?
> The intention is to keep it off the IO path. For example, it can emit a
> zc status CQE or maybe keep a "zc failed" counter inside the ring. Other
> options? And we can add a separate callback for that, will make a couple
> of things better.
>
> What do you think? Especially from the userspace usability perspective.
So far I can't think of any other way that would be useful yet,
but that doesn't mean something else might exist...
IORING_CQE_F_COPIED is available per request and makes it possible
to judge why the related SENDMSG_ZC was fast or not.
It's also available in trace-cmd report.
Everything else would likely re-introduce similar complexity like we
had with the notif slots.
Instead of a new IORING_CQE_F_COPIED flag we could also set
cqe.res = SO_EE_CODE_ZEROCOPY_COPIED, but that isn't really different.
As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED
for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED.
Can you be more verbose why you're thinking about something different?
metze
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_CQE_F_COPIED
2022-10-18 8:43 ` IORING_CQE_F_COPIED Stefan Metzmacher
@ 2022-10-19 15:06 ` Pavel Begunkov
2022-10-19 16:12 ` IORING_CQE_F_COPIED Stefan Metzmacher
0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2022-10-19 15:06 UTC (permalink / raw)
To: Stefan Metzmacher, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
On 10/18/22 09:43, Stefan Metzmacher wrote:
> Hi Pavel,
>
>> On 10/14/22 12:06, Stefan Metzmacher wrote:
>>> Hi Pavel,
>>>
>>> In the tests I made I used this version of IORING_CQE_F_COPIED:
>>> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d
>>> (also inlined at the end)
>>>
>>> Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests)
>>
>> I was thinking, can it be delivered separately but not in the same cqe?
>> The intention is to keep it off the IO path. For example, it can emit a
>> zc status CQE or maybe keep a "zc failed" counter inside the ring. Other
>> options? And we can add a separate callback for that, will make a couple
>> of things better.
>>
>> What do you think? Especially from the userspace usability perspective.
>
> So far I can't think of any other way that would be useful yet,
> but that doesn't mean something else might exist...
>
> IORING_CQE_F_COPIED is available per request and makes it possible
> to judge why the related SENDMSG_ZC was fast or not.
> It's also available in trace-cmd report.
>
> Everything else would likely re-introduce similar complexity like we
> had with the notif slots.
>
> Instead of a new IORING_CQE_F_COPIED flag we could also set
> cqe.res = SO_EE_CODE_ZEROCOPY_COPIED, but that isn't really different.
>
> As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED
> for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED.
> Can you be more verbose why you're thinking about something different?
Because it feels like something that should be done roughly once and in
advance. Performance wise, I agree that a bunch of extra instructions in
the (io_uring) IO path won't make difference as the net overhead is
already high, but I still prefer to keep it thin. The complexity is a
good point though, if only we could piggy back it onto MSG_PROBE.
Ok, let's do IORING_CQE_F_COPIED and aim 6.2 + possibly backport.
First, there is no more ubuf_info::zerocopy, see for-next, but you can
grab space in io_kiocb, io_kiocb::iopoll_completed is a good candidate.
You would want to take one io_uring patch I'm going to send (will CC
you), with that you won't need to change anything in net/. And the last
bit, let's make the zc probing conditional under IORING_RECVSEND_* flag,
I'll make it zero overhead when not set later by replacing the callback.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_CQE_F_COPIED
2022-10-19 15:06 ` IORING_CQE_F_COPIED Pavel Begunkov
@ 2022-10-19 16:12 ` Stefan Metzmacher
2022-10-20 2:24 ` IORING_CQE_F_COPIED Pavel Begunkov
0 siblings, 1 reply; 25+ messages in thread
From: Stefan Metzmacher @ 2022-10-19 16:12 UTC (permalink / raw)
To: Pavel Begunkov, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
Hi Pavel,
>> As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED
>> for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED.
>> Can you be more verbose why you're thinking about something different?
>
> Because it feels like something that should be done roughly once and in
> advance. Performance wise, I agree that a bunch of extra instructions in
> the (io_uring) IO path won't make difference as the net overhead is
> already high, but I still prefer to keep it thin. The complexity is a
> good point though, if only we could piggy back it onto MSG_PROBE.
> Ok, let's do IORING_CQE_F_COPIED and aim 6.2 + possibly backport.
Thanks!
Experimenting with this stuff lets me wish to have a way to
have a different 'user_data' field for the notif cqe,
maybe based on a IORING_RECVSEND_ flag, it may make my life
easier and would avoid some complexity in userspace...
As I need to handle retry on short writes even with MSG_WAITALL
as EINTR and other errors could cause them.
What do you think?
> First, there is no more ubuf_info::zerocopy, see for-next, but you can
> grab space in io_kiocb, io_kiocb::iopoll_completed is a good candidate.
Ok I found your "net: introduce struct ubuf_info_msgzc" and
"net: shrink struct ubuf_info" commits.
I think the change would be trivial, the zerocopy field would just move
to struct io_notif_data..., maybe as 'bool copied'.
> You would want to take one io_uring patch I'm going to send (will CC
> you), with that you won't need to change anything in net/.
The problem is that e.g. tcp_sendmsg_locked() won't ever call
the callback at all if 'zc' is false.
That's why there's the:
if (!zc)
uarg->zerocopy = 0;
Maybe I can inverse the logic and use two variables 'zero_copied'
and 'copied'.
We'd start with both being false and this logic in the callback:
if (success) {
if (unlikely(!nd->zero_copied && !nd->copied))
nd->zero_copied = true;
} else {
if (unlikely(!nd->copied)) {
nd->copied = true;
nd->zero_copied = false;
}
}
And __io_notif_complete_tw still needs:
if (!nd->zero_copied)
notif->cqe.flags |= IORING_CQE_F_COPIED;
instead of if (nd->copied)
> And the last bit, let's make the zc probing conditional under IORING_RECVSEND_* flag,
> I'll make it zero overhead when not set later by replacing the callback.
And the if statement to select a highspeed callback based on
a IORING_RECVSEND_ flag is less overhead than
the if statements in the slow callback version?
metze
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_CQE_F_COPIED
2022-10-19 16:12 ` IORING_CQE_F_COPIED Stefan Metzmacher
@ 2022-10-20 2:24 ` Pavel Begunkov
2022-10-20 10:04 ` IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED) Stefan Metzmacher
2022-10-20 10:10 ` IORING_SEND_NOTIF_USER_DATA " Stefan Metzmacher
0 siblings, 2 replies; 25+ messages in thread
From: Pavel Begunkov @ 2022-10-20 2:24 UTC (permalink / raw)
To: Stefan Metzmacher, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
On 10/19/22 17:12, Stefan Metzmacher wrote:
> Hi Pavel,
>
>>> As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED
>>> for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED.
>>> Can you be more verbose why you're thinking about something different?
>>
>> Because it feels like something that should be done roughly once and in
>> advance. Performance wise, I agree that a bunch of extra instructions in
>> the (io_uring) IO path won't make difference as the net overhead is
>> already high, but I still prefer to keep it thin. The complexity is a
>> good point though, if only we could piggy back it onto MSG_PROBE.
>> Ok, let's do IORING_CQE_F_COPIED and aim 6.2 + possibly backport.
>
> Thanks!
>
> Experimenting with this stuff lets me wish to have a way to
> have a different 'user_data' field for the notif cqe,
> maybe based on a IORING_RECVSEND_ flag, it may make my life
> easier and would avoid some complexity in userspace...
> As I need to handle retry on short writes even with MSG_WAITALL
> as EINTR and other errors could cause them.
>
> What do you think?
>
>> First, there is no more ubuf_info::zerocopy, see for-next, but you can
>> grab space in io_kiocb, io_kiocb::iopoll_completed is a good candidate.
>
> Ok I found your "net: introduce struct ubuf_info_msgzc" and
> "net: shrink struct ubuf_info" commits.
>
> I think the change would be trivial, the zerocopy field would just move
> to struct io_notif_data..., maybe as 'bool copied'.
>
>> You would want to take one io_uring patch I'm going to send (will CC
>> you), with that you won't need to change anything in net/.
>
> The problem is that e.g. tcp_sendmsg_locked() won't ever call
> the callback at all if 'zc' is false.
>
> That's why there's the:
>
> if (!zc)
> uarg->zerocopy = 0;
>
> Maybe I can inverse the logic and use two variables 'zero_copied'
> and 'copied'.
>
> We'd start with both being false and this logic in the callback:>
> if (success) {
> if (unlikely(!nd->zero_copied && !nd->copied))
> nd->zero_copied = true;
> } else {
> if (unlikely(!nd->copied)) {
> nd->copied = true;
> nd->zero_copied = false;
> }
> }
Yep, sth like that should do, but let's guard against
spurious net_zcopy_put() just in case.
used = false;
copied = false;
callback(skb, success, ubuf) {
if (skb)
used = true;
if (!success)
copied = true;
}
complete() {
if (!used || copied)
set_flag(IORING_CQE_F_COPIED);
}
> And __io_notif_complete_tw still needs:
>
> if (!nd->zero_copied)
> notif->cqe.flags |= IORING_CQE_F_COPIED;
Which can be shoved in a custom callback
>> And the last bit, let's make the zc probing conditional under IORING_RECVSEND_* flag,
>> I'll make it zero overhead when not set later by replacing the callback.
>
> And the if statement to select a highspeed callback based on
> a IORING_RECVSEND_ flag is less overhead than
> the if statements in the slow callback version?
I'm more concerned about future changes around it, but there won't
be extra ifs.
#define COMMON_FLAGS (RECVSEND_FIRST_POLL|...)
#define ALL_FLAGS (COMMON_FLAGS|RECVSEND_PROBE)
if (flags & ~COMMON_FLAGS) {
if (flags & ~ALL_FLAGS)
return err;
if (flags & RECVSEND_PROBE)
set_callback(notif);
}
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 25+ messages in thread
* IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED)
2022-10-20 2:24 ` IORING_CQE_F_COPIED Pavel Begunkov
@ 2022-10-20 10:04 ` Stefan Metzmacher
2022-10-20 13:46 ` Pavel Begunkov
2022-10-20 10:10 ` IORING_SEND_NOTIF_USER_DATA " Stefan Metzmacher
1 sibling, 1 reply; 25+ messages in thread
From: Stefan Metzmacher @ 2022-10-20 10:04 UTC (permalink / raw)
To: Pavel Begunkov, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
Hi Pavel,
> Yep, sth like that should do, but let's guard against
> spurious net_zcopy_put() just in case.
>
> used = false;
> copied = false;
>
> callback(skb, success, ubuf) {
> if (skb)
> used = true;
> if (!success)
> copied = true;
> }
> complete() {
> if (!used || copied)
> set_flag(IORING_CQE_F_COPIED);
> }
>
>> And __io_notif_complete_tw still needs:
>>
>> if (!nd->zero_copied)
>> notif->cqe.flags |= IORING_CQE_F_COPIED;
>
> Which can be shoved in a custom callback
>
Ok, got the idea.
> I'm more concerned about future changes around it, but there won't
> be extra ifs.
>
> #define COMMON_FLAGS (RECVSEND_FIRST_POLL|...)
> #define ALL_FLAGS (COMMON_FLAGS|RECVSEND_PROBE)
>
> if (flags & ~COMMON_FLAGS) {
> if (flags & ~ALL_FLAGS)
> return err;
> if (flags & RECVSEND_PROBE)
> set_callback(notif);
> }
So far I came up with a IORING_SEND_NOTIF_REPORT_USAGE opt-in flag
and the reporting is done in cqe.res with IORING_NOTIF_USAGE_ZC_USED (0x00000001)
and/or IORING_NOTIF_USAGE_ZC_COPIED (0x8000000). So the caller is also
able to notice that some parts were able to use zero copy, while other
fragments were copied.
I haven't tested it yet, but I want to post it early...
What do you think?
metze
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ab7458033ee3..751fc4eff8d1 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -296,10 +296,28 @@ enum io_uring_op {
*
* IORING_RECVSEND_FIXED_BUF Use registered buffers, the index is stored in
* the buf_index field.
+ *
+ * IORING_SEND_NOTIF_REPORT_USAGE
+ * If SEND[MSG]_ZC should report
+ * the zerocopy usage in cqe.res
+ * for the IORING_CQE_F_NOTIF cqe.
+ * IORING_NOTIF_USAGE_ZC_USED if zero copy was used
+ * (at least partially).
+ * IORING_NOTIF_USAGE_ZC_COPIED if data was copied
+ * (at least partially).
*/
#define IORING_RECVSEND_POLL_FIRST (1U << 0)
#define IORING_RECV_MULTISHOT (1U << 1)
#define IORING_RECVSEND_FIXED_BUF (1U << 2)
+#define IORING_SEND_NOTIF_REPORT_USAGE (1U << 3)
+
+/*
+ * cqe.res for IORING_CQE_F_NOTIF if
+ * IORING_SEND_NOTIF_REPORT_USAGE was requested
+ */
+#define IORING_NOTIF_USAGE_ZC_USED (1U << 0)
+#define IORING_NOTIF_USAGE_ZC_COPIED (1U << 31)
+
/*
* accept flags stored in sqe->ioprio
diff --git a/io_uring/net.c b/io_uring/net.c
index 735eec545115..a79d7d349e19 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -946,9 +946,11 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
zc->flags = READ_ONCE(sqe->ioprio);
if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
- IORING_RECVSEND_FIXED_BUF))
+ IORING_RECVSEND_FIXED_BUF |
+ IORING_SEND_NOTIF_REPORT_USAGE))
return -EINVAL;
- notif = zc->notif = io_alloc_notif(ctx);
+ notif = zc->notif = io_alloc_notif(ctx,
+ zc->flags & IORING_SEND_NOTIF_REPORT_USAGE);
if (!notif)
return -ENOMEM;
notif->cqe.user_data = req->cqe.user_data;
diff --git a/io_uring/notif.c b/io_uring/notif.c
index e37c6569d82e..3844e3c8ad7e 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -3,13 +3,14 @@
#include <linux/file.h>
#include <linux/slab.h>
#include <linux/net.h>
+#include <linux/errqueue.h>
#include <linux/io_uring.h>
#include "io_uring.h"
#include "notif.h"
#include "rsrc.h"
-static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
+static inline void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
{
struct io_notif_data *nd = io_notif_to_data(notif);
struct io_ring_ctx *ctx = notif->ctx;
@@ -21,20 +22,46 @@ static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
io_req_task_complete(notif, locked);
}
-static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
- struct ubuf_info *uarg,
- bool success)
+static inline void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
+ struct ubuf_info *uarg,
+ bool success)
{
struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
struct io_kiocb *notif = cmd_to_io_kiocb(nd);
if (refcount_dec_and_test(&uarg->refcnt)) {
- notif->io_task_work.func = __io_notif_complete_tw;
io_req_task_work_add(notif);
}
}
-struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
+static void __io_notif_complete_tw_report_usage(struct io_kiocb *notif, bool *locked)
+{
+ struct io_notif_data *nd = io_notif_to_data(notif);
+
+ if (likely(nd->zc_used))
+ notif->cqe.res |= IORING_NOTIF_USAGE_ZC_USED;
+
+ if (unlikely(nd->zc_copied))
+ notif->cqe.res |= IORING_NOTIF_USAGE_ZC_COPIED;
+
+ __io_notif_complete_tw(notif, locked);
+}
+
+static void io_uring_tx_zerocopy_callback_report_usage(struct sk_buff *skb,
+ struct ubuf_info *uarg,
+ bool success)
+{
+ struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
+
+ if (success && !nd->zc_used && skb)
+ nd->zc_used = true;
+ else if (unlikely(!success && !nd->zc_copied))
+ nd->zc_copied = true;
+
+ io_uring_tx_zerocopy_callback(skb, uarg, success);
+}
+
+struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx, bool report_usage)
__must_hold(&ctx->uring_lock)
{
struct io_kiocb *notif;
@@ -54,7 +81,14 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
nd = io_notif_to_data(notif);
nd->account_pages = 0;
nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
- nd->uarg.callback = io_uring_tx_zerocopy_callback;
+ if (report_usage) {
+ nd->zc_used = nd->zc_copied = false;
+ nd->uarg.callback = io_uring_tx_zerocopy_callback_report_usage;
+ notif->io_task_work.func = __io_notif_complete_tw_report_usage;
+ } else {
+ nd->uarg.callback = io_uring_tx_zerocopy_callback;
+ notif->io_task_work.func = __io_notif_complete_tw;
+ }
refcount_set(&nd->uarg.refcnt, 1);
return notif;
}
@@ -66,7 +100,6 @@ void io_notif_flush(struct io_kiocb *notif)
/* drop slot's master ref */
if (refcount_dec_and_test(&nd->uarg.refcnt)) {
- notif->io_task_work.func = __io_notif_complete_tw;
io_req_task_work_add(notif);
}
}
diff --git a/io_uring/notif.h b/io_uring/notif.h
index 5b4d710c8ca5..5ac7a2745e52 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -13,10 +13,12 @@ struct io_notif_data {
struct file *file;
struct ubuf_info uarg;
unsigned long account_pages;
+ bool zc_used;
+ bool zc_copied;
};
void io_notif_flush(struct io_kiocb *notif);
-struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx);
+struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx, bool report_usage);
static inline struct io_notif_data *io_notif_to_data(struct io_kiocb *notif)
{
^ permalink raw reply related [flat|nested] 25+ messages in thread
* IORING_SEND_NOTIF_USER_DATA (was Re: IORING_CQE_F_COPIED)
2022-10-20 2:24 ` IORING_CQE_F_COPIED Pavel Begunkov
2022-10-20 10:04 ` IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED) Stefan Metzmacher
@ 2022-10-20 10:10 ` Stefan Metzmacher
2022-10-20 15:37 ` Pavel Begunkov
1 sibling, 1 reply; 25+ messages in thread
From: Stefan Metzmacher @ 2022-10-20 10:10 UTC (permalink / raw)
To: Pavel Begunkov, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
Hi Pavel,
>> Experimenting with this stuff lets me wish to have a way to
>> have a different 'user_data' field for the notif cqe,
>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>> easier and would avoid some complexity in userspace...
>> As I need to handle retry on short writes even with MSG_WAITALL
>> as EINTR and other errors could cause them.
>>
>> What do you think?
Any comment on this?
IORING_SEND_NOTIF_USER_DATA could let us use
notif->cqe.user_data = sqe->addr3;
metze
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED)
2022-10-20 10:04 ` IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED) Stefan Metzmacher
@ 2022-10-20 13:46 ` Pavel Begunkov
2022-10-20 14:51 ` Stefan Metzmacher
0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2022-10-20 13:46 UTC (permalink / raw)
To: Stefan Metzmacher, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
On 10/20/22 11:04, Stefan Metzmacher wrote:
> Hi Pavel,
[...]
>
> So far I came up with a IORING_SEND_NOTIF_REPORT_USAGE opt-in flag
> and the reporting is done in cqe.res with IORING_NOTIF_USAGE_ZC_USED (0x00000001)
> and/or IORING_NOTIF_USAGE_ZC_COPIED (0x8000000). So the caller is also
> able to notice that some parts were able to use zero copy, while other
> fragments were copied.
Are we really interested in multihoming and probably some very edge cases?
I'd argue we're not and it should be a single bool hint indicating whether
zc is viable or not. It can do more complex calculations _if_ needed, e.g.
looking inside skb's and figure out how many bytes were copied but as for me
it should better be turned into a single bool in the end. Could also be the
number of bytes copied, but I don't think we can't have the accuracy for
that (e.g. what we're going to return if some protocol duplicates an skb
and sends to 2 different devices or is processing it in a pipeline?)
So the question is what is the use case for having 2 flags?
btw, now we've got another example why the report flag is a good idea,
we can't use cqe.res unconditionally because we want to have a "one CQE
per request" mode, but it's fine if we make it and the report flag
mutually exclusive.
> I haven't tested it yet, but I want to post it early...
>
> What do you think?
Keeping in mind potential backporting let's make it as simple and
short as possible first and then do optimisations on top.
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index ab7458033ee3..751fc4eff8d1 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -296,10 +296,28 @@ enum io_uring_op {
> *
> * IORING_RECVSEND_FIXED_BUF Use registered buffers, the index is stored in
> * the buf_index field.
> + *
> + * IORING_SEND_NOTIF_REPORT_USAGE
> + * If SEND[MSG]_ZC should report
> + * the zerocopy usage in cqe.res
> + * for the IORING_CQE_F_NOTIF cqe.
> + * IORING_NOTIF_USAGE_ZC_USED if zero copy was used
> + * (at least partially).
> + * IORING_NOTIF_USAGE_ZC_COPIED if data was copied
> + * (at least partially).
> */
> #define IORING_RECVSEND_POLL_FIRST (1U << 0)
> #define IORING_RECV_MULTISHOT (1U << 1)
> #define IORING_RECVSEND_FIXED_BUF (1U << 2)
> +#define IORING_SEND_NOTIF_REPORT_USAGE (1U << 3)
> +
> +/*
> + * cqe.res for IORING_CQE_F_NOTIF if
> + * IORING_SEND_NOTIF_REPORT_USAGE was requested
> + */
> +#define IORING_NOTIF_USAGE_ZC_USED (1U << 0)
> +#define IORING_NOTIF_USAGE_ZC_COPIED (1U << 31)
> +
>
> /*
> * accept flags stored in sqe->ioprio
> diff --git a/io_uring/net.c b/io_uring/net.c
> index 735eec545115..a79d7d349e19 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -946,9 +946,11 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>
> zc->flags = READ_ONCE(sqe->ioprio);
> if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
> - IORING_RECVSEND_FIXED_BUF))
> + IORING_RECVSEND_FIXED_BUF |
> + IORING_SEND_NOTIF_REPORT_USAGE))
> return -EINVAL;
> - notif = zc->notif = io_alloc_notif(ctx);
> + notif = zc->notif = io_alloc_notif(ctx,
> + zc->flags & IORING_SEND_NOTIF_REPORT_USAGE);
> if (!notif)
> return -ENOMEM;
> notif->cqe.user_data = req->cqe.user_data;
> diff --git a/io_uring/notif.c b/io_uring/notif.c
> index e37c6569d82e..3844e3c8ad7e 100644
> --- a/io_uring/notif.c
> +++ b/io_uring/notif.c
> @@ -3,13 +3,14 @@
> #include <linux/file.h>
> #include <linux/slab.h>
> #include <linux/net.h>
> +#include <linux/errqueue.h>
Is it needed?
> #include <linux/io_uring.h>
>
> #include "io_uring.h"
> #include "notif.h"
> #include "rsrc.h"
>
> -static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
> +static inline void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
Let's remove this hunk with inlining and do it later
> {
> struct io_notif_data *nd = io_notif_to_data(notif);
> struct io_ring_ctx *ctx = notif->ctx;
> @@ -21,20 +22,46 @@ static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
> io_req_task_complete(notif, locked);
> }
>
> -static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
> - struct ubuf_info *uarg,
> - bool success)
> +static inline void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
> + struct ubuf_info *uarg,
> + bool success)
This one as well.
> {
> struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
> struct io_kiocb *notif = cmd_to_io_kiocb(nd);
>
> if (refcount_dec_and_test(&uarg->refcnt)) {
> - notif->io_task_work.func = __io_notif_complete_tw;
> io_req_task_work_add(notif);
> }
> }
>
> -struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
> +static void __io_notif_complete_tw_report_usage(struct io_kiocb *notif, bool *locked)
Just shove all that into __io_notif_complete_tw().
> +{
> + struct io_notif_data *nd = io_notif_to_data(notif);
> +
> + if (likely(nd->zc_used))
> + notif->cqe.res |= IORING_NOTIF_USAGE_ZC_USED;
> +
> + if (unlikely(nd->zc_copied))
> + notif->cqe.res |= IORING_NOTIF_USAGE_ZC_COPIED;
> +
> + __io_notif_complete_tw(notif, locked);
> +}
> +
> +static void io_uring_tx_zerocopy_callback_report_usage(struct sk_buff *skb,
> + struct ubuf_info *uarg,
> + bool success)
> +{
> + struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
> +
> + if (success && !nd->zc_used && skb)
> + nd->zc_used = true;
> + else if (unlikely(!success && !nd->zc_copied))
> + nd->zc_copied = true;
It's fine but racy, so let's WRITE_ONCE() to indicate it.
> +
> + io_uring_tx_zerocopy_callback(skb, uarg, success);
> +}
> +
> +struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx, bool report_usage)
> __must_hold(&ctx->uring_lock)
And it's better to kill this argument and init zc_used/copied
unconditionally.
> {
> struct io_kiocb *notif;
> @@ -54,7 +81,14 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
> nd = io_notif_to_data(notif);
> nd->account_pages = 0;
> nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
> - nd->uarg.callback = io_uring_tx_zerocopy_callback;
> + if (report_usage) {
> + nd->zc_used = nd->zc_copied = false;
> + nd->uarg.callback = io_uring_tx_zerocopy_callback_report_usage;
> + notif->io_task_work.func = __io_notif_complete_tw_report_usage;
> + } else {
> + nd->uarg.callback = io_uring_tx_zerocopy_callback;
> + notif->io_task_work.func = __io_notif_complete_tw;
> + }
> refcount_set(&nd->uarg.refcnt, 1);
> return notif;
> }
> @@ -66,7 +100,6 @@ void io_notif_flush(struct io_kiocb *notif)
>
> /* drop slot's master ref */
> if (refcount_dec_and_test(&nd->uarg.refcnt)) {
> - notif->io_task_work.func = __io_notif_complete_tw;
> io_req_task_work_add(notif);
> }
> }
> diff --git a/io_uring/notif.h b/io_uring/notif.h
> index 5b4d710c8ca5..5ac7a2745e52 100644
> --- a/io_uring/notif.h
> +++ b/io_uring/notif.h
> @@ -13,10 +13,12 @@ struct io_notif_data {
> struct file *file;
> struct ubuf_info uarg;
> unsigned long account_pages;
> + bool zc_used;
> + bool zc_copied;
IIRC io_notif_data is fully packed in 6.1, so placing zc_{used,copied}
there might complicate backporting (if any). We can place them in io_kiocb
directly and move in 6.2. Alternatively account_pages doesn't have to be
long.
> };
>
> void io_notif_flush(struct io_kiocb *notif);
> -struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx);
> +struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx, bool report_usage);
>
> static inline struct io_notif_data *io_notif_to_data(struct io_kiocb *notif)
> {
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED)
2022-10-20 13:46 ` Pavel Begunkov
@ 2022-10-20 14:51 ` Stefan Metzmacher
2022-10-20 15:31 ` Pavel Begunkov
0 siblings, 1 reply; 25+ messages in thread
From: Stefan Metzmacher @ 2022-10-20 14:51 UTC (permalink / raw)
To: Pavel Begunkov, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
Hi Pavel,
>> So far I came up with a IORING_SEND_NOTIF_REPORT_USAGE opt-in flag
>> and the reporting is done in cqe.res with IORING_NOTIF_USAGE_ZC_USED (0x00000001)
>> and/or IORING_NOTIF_USAGE_ZC_COPIED (0x8000000). So the caller is also
>> able to notice that some parts were able to use zero copy, while other
>> fragments were copied.
>
> Are we really interested in multihoming and probably some very edge cases?
> I'd argue we're not and it should be a single bool hint indicating whether
> zc is viable or not. It can do more complex calculations _if_ needed, e.g.
> looking inside skb's and figure out how many bytes were copied but as for me
> it should better be turned into a single bool in the end. Could also be the
> number of bytes copied, but I don't think we can't have the accuracy for
> that (e.g. what we're going to return if some protocol duplicates an skb
> and sends to 2 different devices or is processing it in a pipeline?)
>
> So the question is what is the use case for having 2 flags?
It's mostly for debugging.
> btw, now we've got another example why the report flag is a good idea,
I don't understand that line...
> we can't use cqe.res unconditionally because we want to have a "one CQE
> per request" mode, but it's fine if we make it and the report flag
> mutually exclusive.
You mean we can add an optimized case where SEND[MSG]_ZC would not
generate F_MORE and skips F_NOTIF, because we copied or the transmission
path was really fast?
Then I'd move to IORING_CQE_F_COPIED again...
>> I haven't tested it yet, but I want to post it early...
>>
>> What do you think?
>
> Keeping in mind potential backporting let's make it as simple and
> short as possible first and then do optimisations on top.
ok.
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index ab7458033ee3..751fc4eff8d1 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -296,10 +296,28 @@ enum io_uring_op {
>> *
>> * IORING_RECVSEND_FIXED_BUF Use registered buffers, the index is stored in
>> * the buf_index field.
>> + *
>> + * IORING_SEND_NOTIF_REPORT_USAGE
>> + * If SEND[MSG]_ZC should report
>> + * the zerocopy usage in cqe.res
>> + * for the IORING_CQE_F_NOTIF cqe.
>> + * IORING_NOTIF_USAGE_ZC_USED if zero copy was used
>> + * (at least partially).
>> + * IORING_NOTIF_USAGE_ZC_COPIED if data was copied
>> + * (at least partially).
>> */
>> #define IORING_RECVSEND_POLL_FIRST (1U << 0)
>> #define IORING_RECV_MULTISHOT (1U << 1)
>> #define IORING_RECVSEND_FIXED_BUF (1U << 2)
>> +#define IORING_SEND_NOTIF_REPORT_USAGE (1U << 3)
>> +
>> +/*
>> + * cqe.res for IORING_CQE_F_NOTIF if
>> + * IORING_SEND_NOTIF_REPORT_USAGE was requested
>> + */
>> +#define IORING_NOTIF_USAGE_ZC_USED (1U << 0)
>> +#define IORING_NOTIF_USAGE_ZC_COPIED (1U << 31)
>> +
>>
>> /*
>> * accept flags stored in sqe->ioprio
>> diff --git a/io_uring/net.c b/io_uring/net.c
>> index 735eec545115..a79d7d349e19 100644
>> --- a/io_uring/net.c
>> +++ b/io_uring/net.c
>> @@ -946,9 +946,11 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>
>> zc->flags = READ_ONCE(sqe->ioprio);
>> if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
>> - IORING_RECVSEND_FIXED_BUF))
>> + IORING_RECVSEND_FIXED_BUF |
>> + IORING_SEND_NOTIF_REPORT_USAGE))
>> return -EINVAL;
>> - notif = zc->notif = io_alloc_notif(ctx);
>> + notif = zc->notif = io_alloc_notif(ctx,
>> + zc->flags & IORING_SEND_NOTIF_REPORT_USAGE);
>> if (!notif)
>> return -ENOMEM;
>> notif->cqe.user_data = req->cqe.user_data;
>> diff --git a/io_uring/notif.c b/io_uring/notif.c
>> index e37c6569d82e..3844e3c8ad7e 100644
>> --- a/io_uring/notif.c
>> +++ b/io_uring/notif.c
>> @@ -3,13 +3,14 @@
>> #include <linux/file.h>
>> #include <linux/slab.h>
>> #include <linux/net.h>
>> +#include <linux/errqueue.h>
>
> Is it needed?
No
>> #include <linux/io_uring.h>
>>
>> #include "io_uring.h"
>> #include "notif.h"
>> #include "rsrc.h"
>>
>> -static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
>> +static inline void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
>
> Let's remove this hunk with inlining and do it later
>
>> {
>> struct io_notif_data *nd = io_notif_to_data(notif);
>> struct io_ring_ctx *ctx = notif->ctx;
>> @@ -21,20 +22,46 @@ static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
>> io_req_task_complete(notif, locked);
>> }
>>
>> -static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
>> - struct ubuf_info *uarg,
>> - bool success)
>> +static inline void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
>> + struct ubuf_info *uarg,
>> + bool success)
>
> This one as well.
>
>
>> {
>> struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
>> struct io_kiocb *notif = cmd_to_io_kiocb(nd);
>>
>> if (refcount_dec_and_test(&uarg->refcnt)) {
>> - notif->io_task_work.func = __io_notif_complete_tw;
>> io_req_task_work_add(notif);
>> }
>> }
>>
>> -struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
>> +static void __io_notif_complete_tw_report_usage(struct io_kiocb *notif, bool *locked)
>
> Just shove all that into __io_notif_complete_tw().
Ok, and then optimze later?
Otherwise we could have IORING_CQE_F_COPIED by default without opt-in
flag...
>> +static void io_uring_tx_zerocopy_callback_report_usage(struct sk_buff *skb,
>> + struct ubuf_info *uarg,
>> + bool success)
>> +{
>> + struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
>> +
>> + if (success && !nd->zc_used && skb)
>> + nd->zc_used = true;
>> + else if (unlikely(!success && !nd->zc_copied))
>> + nd->zc_copied = true;
>
> It's fine but racy, so let's WRITE_ONCE() to indicate it.
I don't see how this could be a problem, but I can add it.
>> diff --git a/io_uring/notif.h b/io_uring/notif.h
>> index 5b4d710c8ca5..5ac7a2745e52 100644
>> --- a/io_uring/notif.h
>> +++ b/io_uring/notif.h
>> @@ -13,10 +13,12 @@ struct io_notif_data {
>> struct file *file;
>> struct ubuf_info uarg;
>> unsigned long account_pages;
>> + bool zc_used;
>> + bool zc_copied;
>
> IIRC io_notif_data is fully packed in 6.1, so placing zc_{used,copied}
> there might complicate backporting (if any). We can place them in io_kiocb
> directly and move in 6.2. Alternatively account_pages doesn't have to be
> long.
As far as I can see kernel-dk-block/io_uring-6.1 alread has your
shrink patches included...
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED)
2022-10-20 14:51 ` Stefan Metzmacher
@ 2022-10-20 15:31 ` Pavel Begunkov
2022-10-21 9:36 ` Stefan Metzmacher
0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2022-10-20 15:31 UTC (permalink / raw)
To: Stefan Metzmacher, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
On 10/20/22 15:51, Stefan Metzmacher wrote:
> Hi Pavel,
>
>>> So far I came up with a IORING_SEND_NOTIF_REPORT_USAGE opt-in flag
>>> and the reporting is done in cqe.res with IORING_NOTIF_USAGE_ZC_USED (0x00000001)
>>> and/or IORING_NOTIF_USAGE_ZC_COPIED (0x8000000). So the caller is also
>>> able to notice that some parts were able to use zero copy, while other
>>> fragments were copied.
>>
>> Are we really interested in multihoming and probably some very edge cases?
>> I'd argue we're not and it should be a single bool hint indicating whether
>> zc is viable or not. It can do more complex calculations _if_ needed, e.g.
>> looking inside skb's and figure out how many bytes were copied but as for me
>> it should better be turned into a single bool in the end. Could also be the
>> number of bytes copied, but I don't think we can't have the accuracy for
>> that (e.g. what we're going to return if some protocol duplicates an skb
>> and sends to 2 different devices or is processing it in a pipeline?)
>>
>> So the question is what is the use case for having 2 flags?
>
> It's mostly for debugging.
Ok, than it sounds like we don't need it.
>> btw, now we've got another example why the report flag is a good idea,
>
> I don't understand that line...
I'm just telling that IORING_SEND_NOTIF_* instead of unconditional reporting
is more flexible and extendible from the uapi perspective.
>> we can't use cqe.res unconditionally because we want to have a "one CQE
>> per request" mode, but it's fine if we make it and the report flag
>> mutually exclusive.
>
> You mean we can add an optimized case where SEND[MSG]_ZC would not
> generate F_MORE and skips F_NOTIF, because we copied or the transmission
> path was really fast?
It is rather about optionally omitting the first (aka completion) cqe and
posting only the notification cqe, which makes a lot of sense for UDP and
some TCP use cases.
> Then I'd move to IORING_CQE_F_COPIED again...
[...]
>>> -struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
>>> +static void __io_notif_complete_tw_report_usage(struct io_kiocb *notif, bool *locked)
>>
>> Just shove all that into __io_notif_complete_tw().
>
> Ok, and then optimze later?
Right, I'm just tired of back porting patches by hand :)
> Otherwise we could have IORING_CQE_F_COPIED by default without opt-in
> flag...
>
>>> +static void io_uring_tx_zerocopy_callback_report_usage(struct sk_buff *skb,
>>> + struct ubuf_info *uarg,
>>> + bool success)
>>> +{
>>> + struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
>>> +
>>> + if (success && !nd->zc_used && skb)
>>> + nd->zc_used = true;
>>> + else if (unlikely(!success && !nd->zc_copied))
>>> + nd->zc_copied = true;
>>
>> It's fine but racy, so let's WRITE_ONCE() to indicate it.
>
> I don't see how this could be a problem, but I can add it.
It's not a problem, but better to be a little be more explicit
about parallel writes.
>>> diff --git a/io_uring/notif.h b/io_uring/notif.h
>>> index 5b4d710c8ca5..5ac7a2745e52 100644
>>> --- a/io_uring/notif.h
>>> +++ b/io_uring/notif.h
>>> @@ -13,10 +13,12 @@ struct io_notif_data {
>>> struct file *file;
>>> struct ubuf_info uarg;
>>> unsigned long account_pages;
>>> + bool zc_used;
>>> + bool zc_copied;
>>
>> IIRC io_notif_data is fully packed in 6.1, so placing zc_{used,copied}
>> there might complicate backporting (if any). We can place them in io_kiocb
>> directly and move in 6.2. Alternatively account_pages doesn't have to be
>> long.
>
> As far as I can see kernel-dk-block/io_uring-6.1 alread has your
> shrink patches included...
Sorry, I mean 6.0
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_USER_DATA (was Re: IORING_CQE_F_COPIED)
2022-10-20 10:10 ` IORING_SEND_NOTIF_USER_DATA " Stefan Metzmacher
@ 2022-10-20 15:37 ` Pavel Begunkov
2022-10-21 8:32 ` Stefan Metzmacher
0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2022-10-20 15:37 UTC (permalink / raw)
To: Stefan Metzmacher, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
On 10/20/22 11:10, Stefan Metzmacher wrote:
> Hi Pavel,
>
>>> Experimenting with this stuff lets me wish to have a way to
>>> have a different 'user_data' field for the notif cqe,
>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>> easier and would avoid some complexity in userspace...
>>> As I need to handle retry on short writes even with MSG_WAITALL
>>> as EINTR and other errors could cause them.
>>>
>>> What do you think?
>
> Any comment on this?
>
> IORING_SEND_NOTIF_USER_DATA could let us use
> notif->cqe.user_data = sqe->addr3;
I'd rather not use the last available u64, tbh, that was the
reason for not adding a second user_data in the first place.
Maybe IORING_SETUP_SQE128?
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_USER_DATA (was Re: IORING_CQE_F_COPIED)
2022-10-20 15:37 ` Pavel Begunkov
@ 2022-10-21 8:32 ` Stefan Metzmacher
2022-10-21 9:27 ` Pavel Begunkov
0 siblings, 1 reply; 25+ messages in thread
From: Stefan Metzmacher @ 2022-10-21 8:32 UTC (permalink / raw)
To: Pavel Begunkov, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
Hi Pavel,
>>>> Experimenting with this stuff lets me wish to have a way to
>>>> have a different 'user_data' field for the notif cqe,
>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>> easier and would avoid some complexity in userspace...
>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>> as EINTR and other errors could cause them.
>>>>
>>>> What do you think?
>>
>> Any comment on this?
>>
>> IORING_SEND_NOTIF_USER_DATA could let us use
>> notif->cqe.user_data = sqe->addr3;
>
> I'd rather not use the last available u64, tbh, that was the
> reason for not adding a second user_data in the first place.
As far as I can see io_send_zc_prep has this:
if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
return -EINVAL;
both are u64...
metze
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_USER_DATA (was Re: IORING_CQE_F_COPIED)
2022-10-21 8:32 ` Stefan Metzmacher
@ 2022-10-21 9:27 ` Pavel Begunkov
2022-10-21 9:45 ` Stefan Metzmacher
2022-10-21 10:15 ` Stefan Metzmacher
0 siblings, 2 replies; 25+ messages in thread
From: Pavel Begunkov @ 2022-10-21 9:27 UTC (permalink / raw)
To: Stefan Metzmacher, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
On 10/21/22 09:32, Stefan Metzmacher wrote:
> Hi Pavel,
>
>>>>> Experimenting with this stuff lets me wish to have a way to
>>>>> have a different 'user_data' field for the notif cqe,
>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>>> easier and would avoid some complexity in userspace...
>>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>>> as EINTR and other errors could cause them.
>>>>>
>>>>> What do you think?
>>>
>>> Any comment on this?
>>>
>>> IORING_SEND_NOTIF_USER_DATA could let us use
>>> notif->cqe.user_data = sqe->addr3;
>>
>> I'd rather not use the last available u64, tbh, that was the
>> reason for not adding a second user_data in the first place.
>
> As far as I can see io_send_zc_prep has this:
>
> if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
> return -EINVAL;
>
> both are u64...
Hah, true, completely forgot about that one
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED)
2022-10-20 15:31 ` Pavel Begunkov
@ 2022-10-21 9:36 ` Stefan Metzmacher
2022-10-21 11:09 ` Pavel Begunkov
0 siblings, 1 reply; 25+ messages in thread
From: Stefan Metzmacher @ 2022-10-21 9:36 UTC (permalink / raw)
To: Pavel Begunkov, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
Hi Pavel,
>>>> So far I came up with a IORING_SEND_NOTIF_REPORT_USAGE opt-in flag
>>>> and the reporting is done in cqe.res with IORING_NOTIF_USAGE_ZC_USED (0x00000001)
>>>> and/or IORING_NOTIF_USAGE_ZC_COPIED (0x8000000). So the caller is also
>>>> able to notice that some parts were able to use zero copy, while other
>>>> fragments were copied.
>>>
>>> Are we really interested in multihoming and probably some very edge cases?
>>> I'd argue we're not and it should be a single bool hint indicating whether
>>> zc is viable or not. It can do more complex calculations _if_ needed, e.g.
>>> looking inside skb's and figure out how many bytes were copied but as for me
>>> it should better be turned into a single bool in the end. Could also be the
>>> number of bytes copied, but I don't think we can't have the accuracy for
>>> that (e.g. what we're going to return if some protocol duplicates an skb
>>> and sends to 2 different devices or is processing it in a pipeline?)
>>>
>>> So the question is what is the use case for having 2 flags?
>>
>> It's mostly for debugging.
>
> Ok, than it sounds like we don't need it.
Maybe I could add some trace points to the callback?
>>> btw, now we've got another example why the report flag is a good idea,
>>
>> I don't understand that line...
>
> I'm just telling that IORING_SEND_NOTIF_* instead of unconditional reporting
> is more flexible and extendible from the uapi perspective.
ok
>>> we can't use cqe.res unconditionally because we want to have a "one CQE
>>> per request" mode, but it's fine if we make it and the report flag
>>> mutually exclusive.
>>
>> You mean we can add an optimized case where SEND[MSG]_ZC would not
>> generate F_MORE and skips F_NOTIF, because we copied or the transmission
>> path was really fast?
>
> It is rather about optionally omitting the first (aka completion) cqe and
> posting only the notification cqe, which makes a lot of sense for UDP and
> some TCP use cases.
OK.
>> Then I'd move to IORING_CQE_F_COPIED again...
> [...]
>>>> -struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
>>>> +static void __io_notif_complete_tw_report_usage(struct io_kiocb *notif, bool *locked)
>>>
>>> Just shove all that into __io_notif_complete_tw().
>>
>> Ok, and then optimze later?
>
> Right, I'm just tired of back porting patches by hand :)
ok, I just assumed it would be 6.1 only.
>> Otherwise we could have IORING_CQE_F_COPIED by default without opt-in
>> flag...
Do you still want an opt-in flag to get IORING_CQE_F_COPIED?
If so what name do you want it to be?
>>>> +static void io_uring_tx_zerocopy_callback_report_usage(struct sk_buff *skb,
>>>> + struct ubuf_info *uarg,
>>>> + bool success)
>>>> +{
>>>> + struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
>>>> +
>>>> + if (success && !nd->zc_used && skb)
>>>> + nd->zc_used = true;
>>>> + else if (unlikely(!success && !nd->zc_copied))
>>>> + nd->zc_copied = true;
>>>
>>> It's fine but racy, so let's WRITE_ONCE() to indicate it.
>>
>> I don't see how this could be a problem, but I can add it.
>
> It's not a problem, but better to be a little be more explicit
> about parallel writes.
ok.
>>>> diff --git a/io_uring/notif.h b/io_uring/notif.h
>>>> index 5b4d710c8ca5..5ac7a2745e52 100644
>>>> --- a/io_uring/notif.h
>>>> +++ b/io_uring/notif.h
>>>> @@ -13,10 +13,12 @@ struct io_notif_data {
>>>> struct file *file;
>>>> struct ubuf_info uarg;
>>>> unsigned long account_pages;
>>>> + bool zc_used;
>>>> + bool zc_copied;
>>>
>>> IIRC io_notif_data is fully packed in 6.1, so placing zc_{used,copied}
>>> there might complicate backporting (if any). We can place them in io_kiocb
>>> directly and move in 6.2. Alternatively account_pages doesn't have to be
>>> long.
>>
>> As far as I can see kernel-dk-block/io_uring-6.1 alread has your
>> shrink patches included...
>
> Sorry, I mean 6.0
So you want to backport to 6.0?
Find the current version below, sizeof(struct io_kiocb) will grow from
3*64 + 24 to 3*64 + 32 (on x64_64) to it stays within 4 cache lines.
I tried this first:
union {
u8 iopoll_completed;
struct {
u8 zc_used:1;
u8 zc_copied:1;
};
};
But then WRITE_ONCE() complains about a bitfield write.
So let me now about the opt-in flag and I'll prepare real commits
including a patch that moves from struct io_kiocb to struct io_notif_data
on top.
metze
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index f5b687a787a3..189152ad78d6 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -515,6 +515,9 @@ struct io_kiocb {
u8 opcode;
/* polled IO has completed */
u8 iopoll_completed;
+ /* these will be moved to struct io_notif_data in 6.1 */
+ bool zc_used;
+ bool zc_copied;
/*
* Can be either a fixed buffer index, or used with provided buffers.
* For the latter, before issue it points to the buffer group ID,
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ab7458033ee3..738d6234d1d9 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -350,6 +350,7 @@ struct io_uring_cqe {
#define IORING_CQE_F_MORE (1U << 1)
#define IORING_CQE_F_SOCK_NONEMPTY (1U << 2)
#define IORING_CQE_F_NOTIF (1U << 3)
+#define IORING_CQE_F_COPIED (1U << 4)
enum {
IORING_CQE_BUFFER_SHIFT = 16,
diff --git a/io_uring/notif.c b/io_uring/notif.c
index e37c6569d82e..033aca064b10 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -18,6 +18,10 @@ static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
__io_unaccount_mem(ctx->user, nd->account_pages);
nd->account_pages = 0;
}
+
+ if (notif->zc_copied || !notif->zc_used)
+ notif->cqe.flags |= IORING_CQE_F_COPIED;
+
io_req_task_complete(notif, locked);
}
@@ -28,6 +32,11 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
struct io_kiocb *notif = cmd_to_io_kiocb(nd);
+ if (success && !notif->zc_used && skb)
+ WRITE_ONCE(notif->zc_used, true);
+ else if (!success && !notif->zc_copied)
+ WRITE_ONCE(notif->zc_copied, true);
+
if (refcount_dec_and_test(&uarg->refcnt)) {
notif->io_task_work.func = __io_notif_complete_tw;
io_req_task_work_add(notif);
@@ -55,6 +64,7 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
nd->account_pages = 0;
nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
nd->uarg.callback = io_uring_tx_zerocopy_callback;
+ notif->zc_used = notif->zc_copied = false;
refcount_set(&nd->uarg.refcnt, 1);
return notif;
}
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_USER_DATA (was Re: IORING_CQE_F_COPIED)
2022-10-21 9:27 ` Pavel Begunkov
@ 2022-10-21 9:45 ` Stefan Metzmacher
2022-10-21 11:20 ` Pavel Begunkov
2022-10-21 10:15 ` Stefan Metzmacher
1 sibling, 1 reply; 25+ messages in thread
From: Stefan Metzmacher @ 2022-10-21 9:45 UTC (permalink / raw)
To: Pavel Begunkov, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
Am 21.10.22 um 11:27 schrieb Pavel Begunkov:
> On 10/21/22 09:32, Stefan Metzmacher wrote:
>> Hi Pavel,
>>
>>>>>> Experimenting with this stuff lets me wish to have a way to
>>>>>> have a different 'user_data' field for the notif cqe,
>>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>>>> easier and would avoid some complexity in userspace...
>>>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>>>> as EINTR and other errors could cause them.
>>>>>>
>>>>>> What do you think?
>>>>
>>>> Any comment on this?
>>>>
>>>> IORING_SEND_NOTIF_USER_DATA could let us use
>>>> notif->cqe.user_data = sqe->addr3;
>>>
>>> I'd rather not use the last available u64, tbh, that was the
>>> reason for not adding a second user_data in the first place.
>>
>> As far as I can see io_send_zc_prep has this:
>>
>> if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>> return -EINVAL;
>>
>> both are u64...
>
> Hah, true, completely forgot about that one
So would a commit like below be fine for you?
Do you have anything in mind for SEND[MSG]_ZC that could possibly use
another u64 in future?
metze
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 738d6234d1d9..7a6272872334 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -300,6 +300,7 @@ enum io_uring_op {
#define IORING_RECVSEND_POLL_FIRST (1U << 0)
#define IORING_RECV_MULTISHOT (1U << 1)
#define IORING_RECVSEND_FIXED_BUF (1U << 2)
+#define IORING_SEND_NOTIF_USER_DATA (1U << 3)
/*
* accept flags stored in sqe->ioprio
diff --git a/io_uring/net.c b/io_uring/net.c
index 735eec545115..e1bc06b58cd7 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -938,7 +938,7 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
struct io_ring_ctx *ctx = req->ctx;
struct io_kiocb *notif;
- if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
+ if (unlikely(READ_ONCE(sqe->__pad2[0]))
return -EINVAL;
/* we don't support IOSQE_CQE_SKIP_SUCCESS just yet */
if (req->flags & REQ_F_CQE_SKIP)
@@ -946,12 +946,19 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
zc->flags = READ_ONCE(sqe->ioprio);
if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
- IORING_RECVSEND_FIXED_BUF))
+ IORING_RECVSEND_FIXED_BUF |
+ IORING_SEND_NOTIF_USER_DATA))
return -EINVAL;
notif = zc->notif = io_alloc_notif(ctx);
if (!notif)
return -ENOMEM;
- notif->cqe.user_data = req->cqe.user_data;
+ if (zc->flags & IORING_SEND_NOTIF_USER_DATA)
+ notif->cqe.user_data = READ_ONCE(sqe->addr3);
+ else {
+ if (unlikely(READ_ONCE(sqe->addr3)))
+ return -EINVAL;
+ notif->cqe.user_data = req->cqe.user_data;
+ }
notif->cqe.res = 0;
notif->cqe.flags = IORING_CQE_F_NOTIF;
req->flags |= REQ_F_NEED_CLEANUP;
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_USER_DATA (was Re: IORING_CQE_F_COPIED)
2022-10-21 9:27 ` Pavel Begunkov
2022-10-21 9:45 ` Stefan Metzmacher
@ 2022-10-21 10:15 ` Stefan Metzmacher
2022-10-21 11:26 ` Pavel Begunkov
1 sibling, 1 reply; 25+ messages in thread
From: Stefan Metzmacher @ 2022-10-21 10:15 UTC (permalink / raw)
To: Pavel Begunkov, io-uring, Jens Axboe; +Cc: Dylan Yudaken
Hi Pavel, and others...
>> As far as I can see io_send_zc_prep has this:
>>
>> if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>> return -EINVAL;
>>
>> both are u64...
>
> Hah, true, completely forgot about that one
BTW: any comment on my "[RFC PATCH 0/8] cleanup struct io_uring_sqe layout"
thread, that would make it much easier to figure out which fields are used..., see
https://lore.kernel.org/io-uring/[email protected]/#r
metze
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED)
2022-10-21 9:36 ` Stefan Metzmacher
@ 2022-10-21 11:09 ` Pavel Begunkov
2022-10-21 14:03 ` Stefan Metzmacher
0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2022-10-21 11:09 UTC (permalink / raw)
To: Stefan Metzmacher, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
On 10/21/22 10:36, Stefan Metzmacher wrote:
> Hi Pavel,
[...]
>> Right, I'm just tired of back porting patches by hand :)
>
> ok, I just assumed it would be 6.1 only.
I'm fine with 6.1 only, it'd make things easier. I thought from
your first postings you wanted it 6.0. Then we don't need to care
about the placing of the copied/used flags.
>>> Otherwise we could have IORING_CQE_F_COPIED by default without opt-in
>>> flag...
>
> Do you still want an opt-in flag to get IORING_CQE_F_COPIED?
> If so what name do you want it to be?
Ala a IORING_SEND_* flag? Yes please.
*_REPORT_USAGE was fine but I'd make it IORING_SEND_ZC_REPORT_USAGE.
And can be extended if there is more info needed in the future.
And I don't mind using a bit in cqe->res, makes cflags less polluted.
>>>>> +static void io_uring_tx_zerocopy_callback_report_usage(struct sk_buff *skb,
>>>>> + struct ubuf_info *uarg,
>>>>> + bool success)
>>>>> +{
>>>>> + struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
>>>>> +
>>>>> + if (success && !nd->zc_used && skb)
>>>>> + nd->zc_used = true;
>>>>> + else if (unlikely(!success && !nd->zc_copied))
>>>>> + nd->zc_copied = true;
>>>>
>>>> It's fine but racy, so let's WRITE_ONCE() to indicate it.
>>>
>>> I don't see how this could be a problem, but I can add it.
>>
>> It's not a problem, but better to be a little be more explicit
>> about parallel writes.
>
> ok.
>
>>>>> diff --git a/io_uring/notif.h b/io_uring/notif.h
>>>>> index 5b4d710c8ca5..5ac7a2745e52 100644
>>>>> --- a/io_uring/notif.h
>>>>> +++ b/io_uring/notif.h
>>>>> @@ -13,10 +13,12 @@ struct io_notif_data {
>>>>> struct file *file;
>>>>> struct ubuf_info uarg;
>>>>> unsigned long account_pages;
>>>>> + bool zc_used;
>>>>> + bool zc_copied;
>>>>
>>>> IIRC io_notif_data is fully packed in 6.1, so placing zc_{used,copied}
>>>> there might complicate backporting (if any). We can place them in io_kiocb
>>>> directly and move in 6.2. Alternatively account_pages doesn't have to be
>>>> long.
>>>
>>> As far as I can see kernel-dk-block/io_uring-6.1 alread has your
>>> shrink patches included...
>>
>> Sorry, I mean 6.0
>
> So you want to backport to 6.0?
>
> Find the current version below, sizeof(struct io_kiocb) will grow from
> 3*64 + 24 to 3*64 + 32 (on x64_64) to it stays within 4 cache lines.
>
> I tried this first:
>
> union {
> u8 iopoll_completed;
> struct {
> u8 zc_used:1;
> u8 zc_copied:1;
> };
> };
>
> But then WRITE_ONCE() complains about a bitfield write.
rightfully so, it can't be a bitfield as it would be racy
and not only in theory this time.
> So let me now about the opt-in flag and I'll prepare real commits
> including a patch that moves from struct io_kiocb to struct io_notif_data
> on top.
Yeah, better to be opt-in, but apart from it and comments above
looks good.
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index f5b687a787a3..189152ad78d6 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -515,6 +515,9 @@ struct io_kiocb {
> u8 opcode;
> /* polled IO has completed */
> u8 iopoll_completed;
> + /* these will be moved to struct io_notif_data in 6.1 */
> + bool zc_used;
> + bool zc_copied;
> /*
> * Can be either a fixed buffer index, or used with provided buffers.
> * For the latter, before issue it points to the buffer group ID,
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index ab7458033ee3..738d6234d1d9 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -350,6 +350,7 @@ struct io_uring_cqe {
> #define IORING_CQE_F_MORE (1U << 1)
> #define IORING_CQE_F_SOCK_NONEMPTY (1U << 2)
> #define IORING_CQE_F_NOTIF (1U << 3)
> +#define IORING_CQE_F_COPIED (1U << 4)
>
> enum {
> IORING_CQE_BUFFER_SHIFT = 16,
> diff --git a/io_uring/notif.c b/io_uring/notif.c
> index e37c6569d82e..033aca064b10 100644
> --- a/io_uring/notif.c
> +++ b/io_uring/notif.c
> @@ -18,6 +18,10 @@ static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
> __io_unaccount_mem(ctx->user, nd->account_pages);
> nd->account_pages = 0;
> }
> +
> + if (notif->zc_copied || !notif->zc_used)
> + notif->cqe.flags |= IORING_CQE_F_COPIED;
> +
As discussed above, should depend on IORING_SEND_ZC_REPORT_USAGE
> io_req_task_complete(notif, locked);
> }
>
> @@ -28,6 +32,11 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
> struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
> struct io_kiocb *notif = cmd_to_io_kiocb(nd);
>
> + if (success && !notif->zc_used && skb)
> + WRITE_ONCE(notif->zc_used, true);
> + else if (!success && !notif->zc_copied)
> + WRITE_ONCE(notif->zc_copied, true);
> +
> if (refcount_dec_and_test(&uarg->refcnt)) {
> notif->io_task_work.func = __io_notif_complete_tw;
> io_req_task_work_add(notif);
> @@ -55,6 +64,7 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
> nd->account_pages = 0;
> nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
> nd->uarg.callback = io_uring_tx_zerocopy_callback;
> + notif->zc_used = notif->zc_copied = false;
> refcount_set(&nd->uarg.refcnt, 1);
> return notif;
> }
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_USER_DATA (was Re: IORING_CQE_F_COPIED)
2022-10-21 9:45 ` Stefan Metzmacher
@ 2022-10-21 11:20 ` Pavel Begunkov
2022-10-21 12:10 ` Stefan Metzmacher
0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2022-10-21 11:20 UTC (permalink / raw)
To: Stefan Metzmacher, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
On 10/21/22 10:45, Stefan Metzmacher wrote:
> Am 21.10.22 um 11:27 schrieb Pavel Begunkov:
>> On 10/21/22 09:32, Stefan Metzmacher wrote:
>>> Hi Pavel,
>>>
>>>>>>> Experimenting with this stuff lets me wish to have a way to
>>>>>>> have a different 'user_data' field for the notif cqe,
>>>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>>>>> easier and would avoid some complexity in userspace...
>>>>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>>>>> as EINTR and other errors could cause them.
>>>>>>>
>>>>>>> What do you think?
>>>>>
>>>>> Any comment on this?
>>>>>
>>>>> IORING_SEND_NOTIF_USER_DATA could let us use
>>>>> notif->cqe.user_data = sqe->addr3;
>>>>
>>>> I'd rather not use the last available u64, tbh, that was the
>>>> reason for not adding a second user_data in the first place.
>>>
>>> As far as I can see io_send_zc_prep has this:
>>>
>>> if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>>> return -EINVAL;
>>>
>>> both are u64...
>>
>> Hah, true, completely forgot about that one
>
> So would a commit like below be fine for you?
>
> Do you have anything in mind for SEND[MSG]_ZC that could possibly use
> another u64 in future?
It'll most likely be taken in the future, some features are planned
some I can imagine. The question is how necessary this one is and/or
how much simpler it would make it considering that CQEs are ordered
and apps still need to check for F_MORE. It shouldn't even require
refcounting. Can you elaborate on the simplifying userspace part?
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_USER_DATA (was Re: IORING_CQE_F_COPIED)
2022-10-21 10:15 ` Stefan Metzmacher
@ 2022-10-21 11:26 ` Pavel Begunkov
2022-10-21 12:38 ` Stefan Metzmacher
0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2022-10-21 11:26 UTC (permalink / raw)
To: Stefan Metzmacher, io-uring, Jens Axboe; +Cc: Dylan Yudaken
On 10/21/22 11:15, Stefan Metzmacher wrote:
> Hi Pavel, and others...
>
>>> As far as I can see io_send_zc_prep has this:
>>>
>>> if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>>> return -EINVAL;
>>>
>>> both are u64...
>>
>> Hah, true, completely forgot about that one
>
> BTW: any comment on my "[RFC PATCH 0/8] cleanup struct io_uring_sqe layout"
> thread, that would make it much easier to figure out which fields are used..., see
> https://lore.kernel.org/io-uring/[email protected]/#r
I admit the sqe layout is messy as there is no good separation b/w
common vs opcode specific fields, but it's not like the new layout
makes it much simpler. E.g. looking who is using a field will get
more complicated. iow, no strong opinion on it.
btw, will be happy to have the include guard patch from one of
your branches
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_USER_DATA (was Re: IORING_CQE_F_COPIED)
2022-10-21 11:20 ` Pavel Begunkov
@ 2022-10-21 12:10 ` Stefan Metzmacher
0 siblings, 0 replies; 25+ messages in thread
From: Stefan Metzmacher @ 2022-10-21 12:10 UTC (permalink / raw)
To: Pavel Begunkov, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
Am 21.10.22 um 13:20 schrieb Pavel Begunkov:
> On 10/21/22 10:45, Stefan Metzmacher wrote:
>> Am 21.10.22 um 11:27 schrieb Pavel Begunkov:
>>> On 10/21/22 09:32, Stefan Metzmacher wrote:
>>>> Hi Pavel,
>>>>
>>>>>>>> Experimenting with this stuff lets me wish to have a way to
>>>>>>>> have a different 'user_data' field for the notif cqe,
>>>>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>>>>>> easier and would avoid some complexity in userspace...
>>>>>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>>>>>> as EINTR and other errors could cause them.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>
>>>>>> Any comment on this?
>>>>>>
>>>>>> IORING_SEND_NOTIF_USER_DATA could let us use
>>>>>> notif->cqe.user_data = sqe->addr3;
>>>>>
>>>>> I'd rather not use the last available u64, tbh, that was the
>>>>> reason for not adding a second user_data in the first place.
>>>>
>>>> As far as I can see io_send_zc_prep has this:
>>>>
>>>> if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>>>> return -EINVAL;
>>>>
>>>> both are u64...
>>>
>>> Hah, true, completely forgot about that one
>>
>> So would a commit like below be fine for you?
>>
>> Do you have anything in mind for SEND[MSG]_ZC that could possibly use
>> another u64 in future?
>
> It'll most likely be taken in the future, some features are planned
> some I can imagine.
Can give examples? As I can't imagine any possible feature.
> The question is how necessary this one is and/or
> how much simpler it would make it considering that CQEs are ordered
> and apps still need to check for F_MORE. It shouldn't even require
> refcounting. Can you elaborate on the simplifying userspace part?
>
It's not critical, it would just make it easier to dispatch
a different callback functions for the two cases.
The current problem I'm facing is that I have a structure
holding the state of an response and that has a single embedded
completion structure:
(simplified) struct completion {
uint32_t generation;
void (*callback_fn)(void *callback_private, const struct io_uring_cqe *cqe);
void *callback_private;
};
I use the memory address of the completion structure glued with the lower bits of the generation
as 'user_data'. Imagine that I got a short write from SENDMSG_ZC/WAITALL
because EINTR was generated, then I need to retry from userspace, which
I'd try immediately without waiting for the NOTIF cqe to arrive.
For each incoming cqe I get the completion address and the generation
out of user_data and then verify the generation against the one stored in
the completion in order to detect bugs, before passing over to callback_fn().
Because I still need to handle the NOTIF cqe from the first try
I can't change the generation for the next try.
I thought about using two completion structures, one for the main SENDMSG_ZC result
(which gets its generation incremented with each retry) and one for the NOTIF cqes
just keeping generation stable having a simple callback_fn just waiting for a
refcount to get 0.
Most likely I just need to sit down concentrated to get the
recounting and similar things sorted out.
If there are really useful things we will do with addr3 and __pad2[0],
I can try to cope with it... It would just be sad if they wouldn't be used anyway.
metze
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_USER_DATA (was Re: IORING_CQE_F_COPIED)
2022-10-21 11:26 ` Pavel Begunkov
@ 2022-10-21 12:38 ` Stefan Metzmacher
0 siblings, 0 replies; 25+ messages in thread
From: Stefan Metzmacher @ 2022-10-21 12:38 UTC (permalink / raw)
To: Pavel Begunkov, io-uring, Jens Axboe; +Cc: Dylan Yudaken
Am 21.10.22 um 13:26 schrieb Pavel Begunkov:
> On 10/21/22 11:15, Stefan Metzmacher wrote:
>> Hi Pavel, and others...
>>
>>>> As far as I can see io_send_zc_prep has this:
>>>>
>>>> if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>>>> return -EINVAL;
>>>>
>>>> both are u64...
>>>
>>> Hah, true, completely forgot about that one
>>
>> BTW: any comment on my "[RFC PATCH 0/8] cleanup struct io_uring_sqe layout"
>> thread, that would make it much easier to figure out which fields are used..., see
>> https://lore.kernel.org/io-uring/[email protected]/#r
>
> I admit the sqe layout is messy as there is no good separation b/w
> common vs opcode specific fields, but it's not like the new layout
> makes it much simpler.
Really?
> E.g. looking who is using a field will get more complicated.
Why should anyone care what fields other opcodes use
and how they are named.
For legacy reasons we have to live with
struct io_uring_sqe_common common; in the middle.
apart from that each opcode should be free to use
5 u64 fields and 1 u32 field.
But e.g.
+ /* IORING_OP_FALLOCATE */
+ struct io_uring_sqe_fallocate {
+ struct io_uring_sqe_hdr hdr;
+
+ __u64 offset;
+ __u64 length;
+ __u32 mode;
+ __u32 u32_ofs28;
+
+ struct io_uring_sqe_common common;
+
+ __u32 u32_ofs44;
+ __u64 u64_ofs48;
+ __u64 u64_ofs56;
+ } fallocate;
Immediately shows what's used and what not
and it avoids brain dead things like using
sqe->addr instead of sqe->len for the length.
And it makes it trivial to verify that the _prep function
rejects any unused field.
And it would it easier to write per opcode tracing code,
which can be easily analyzed.
> iow, no strong opinion on it.
>
> btw, will be happy to have the include guard patch from one of
> your branches
This one from the io_uring_livepatch.v6.1 branch?
https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=3c36e05baad737f5cb896fdc9fc53dc1b74d2499
metze
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED)
2022-10-21 11:09 ` Pavel Begunkov
@ 2022-10-21 14:03 ` Stefan Metzmacher
2022-10-27 8:47 ` Stefan Metzmacher
2022-10-27 10:51 ` Pavel Begunkov
0 siblings, 2 replies; 25+ messages in thread
From: Stefan Metzmacher @ 2022-10-21 14:03 UTC (permalink / raw)
To: Pavel Begunkov, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
Am 21.10.22 um 13:09 schrieb Pavel Begunkov:
> On 10/21/22 10:36, Stefan Metzmacher wrote:
>> Hi Pavel,
> [...]
>>> Right, I'm just tired of back porting patches by hand :)
>>
>> ok, I just assumed it would be 6.1 only.
>
> I'm fine with 6.1 only, it'd make things easier. I thought from
> your first postings you wanted it 6.0. Then we don't need to care
> about the placing of the copied/used flags.
>
>>>> Otherwise we could have IORING_CQE_F_COPIED by default without opt-in
>>>> flag...
>>
>> Do you still want an opt-in flag to get IORING_CQE_F_COPIED?
>> If so what name do you want it to be?
>
> Ala a IORING_SEND_* flag? Yes please.
>
> *_REPORT_USAGE was fine but I'd make it IORING_SEND_ZC_REPORT_USAGE.
> And can be extended if there is more info needed in the future.
>
> And I don't mind using a bit in cqe->res, makes cflags less polluted.
So no worries about the delayed/skip sendmsg completion anymore?
Should I define it like this, ok?
#define IORING_NOTIF_USAGE_ZC_COPIED (1U << 31)
See the full patch below...
metze
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index d69ae7eba773..32e1f2a55b70 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -296,10 +296,24 @@ enum io_uring_op {
*
* IORING_RECVSEND_FIXED_BUF Use registered buffers, the index is stored in
* the buf_index field.
+
+ * IORING_SEND_NOTIF_REPORT_USAGE
+ * If SEND[MSG]_ZC should report
+ * the zerocopy usage in cqe.res
+ * for the IORING_CQE_F_NOTIF cqe.
+ * IORING_NOTIF_USAGE_ZC_COPIED if data was copied
+ * (at least partially).
*/
#define IORING_RECVSEND_POLL_FIRST (1U << 0)
#define IORING_RECV_MULTISHOT (1U << 1)
#define IORING_RECVSEND_FIXED_BUF (1U << 2)
+#define IORING_SEND_ZC_REPORT_USAGE (1U << 3)
+
+/*
+ * cqe.res for IORING_CQE_F_NOTIF if
+ * IORING_SEND_ZC_REPORT_USAGE was requested
+ */
+#define IORING_NOTIF_USAGE_ZC_COPIED (1U << 31)
/*
* accept flags stored in sqe->ioprio
diff --git a/io_uring/net.c b/io_uring/net.c
index 56078f47efe7..1aa3b50b3e82 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -939,7 +939,8 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
zc->flags = READ_ONCE(sqe->ioprio);
if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
- IORING_RECVSEND_FIXED_BUF))
+ IORING_RECVSEND_FIXED_BUF |
+ IORING_SEND_ZC_REPORT_USAGE))
return -EINVAL;
notif = zc->notif = io_alloc_notif(ctx);
if (!notif)
@@ -957,6 +958,9 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
req->imu = READ_ONCE(ctx->user_bufs[idx]);
io_req_set_rsrc_node(notif, ctx, 0);
}
+ if (zc->flags & IORING_SEND_ZC_REPORT_USAGE) {
+ io_notif_to_data(notif)->zc_report = true;
+ }
if (req->opcode == IORING_OP_SEND_ZC) {
if (READ_ONCE(sqe->__pad3[0]))
diff --git a/io_uring/notif.c b/io_uring/notif.c
index e37c6569d82e..4bfef10161fa 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -18,6 +18,10 @@ static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
__io_unaccount_mem(ctx->user, nd->account_pages);
nd->account_pages = 0;
}
+
+ if (nd->zc_report && (nd->zc_copied || !nd->zc_used))
+ notif->cqe.res |= IORING_NOTIF_USAGE_ZC_COPIED;
+
io_req_task_complete(notif, locked);
}
@@ -28,6 +32,13 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
struct io_kiocb *notif = cmd_to_io_kiocb(nd);
+ if (nd->zc_report) {
+ if (success && !nd->zc_used && skb)
+ WRITE_ONCE(nd->zc_used, true);
+ else if (!success && !nd->zc_copied)
+ WRITE_ONCE(nd->zc_copied, true);
+ }
+
if (refcount_dec_and_test(&uarg->refcnt)) {
notif->io_task_work.func = __io_notif_complete_tw;
io_req_task_work_add(notif);
@@ -55,6 +66,7 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
nd->account_pages = 0;
nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
nd->uarg.callback = io_uring_tx_zerocopy_callback;
+ nd->zc_report = nd->zc_used = nd->zc_copied = false;
refcount_set(&nd->uarg.refcnt, 1);
return notif;
}
diff --git a/io_uring/notif.h b/io_uring/notif.h
index e4fbcae0f3fd..6be2e5ae8581 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -15,6 +15,9 @@ struct io_notif_data {
struct file *file;
struct ubuf_info uarg;
unsigned long account_pages;
+ bool zc_report;
+ bool zc_used;
+ bool zc_copied;
};
void io_notif_flush(struct io_kiocb *notif);
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED)
2022-10-21 14:03 ` Stefan Metzmacher
@ 2022-10-27 8:47 ` Stefan Metzmacher
2022-10-27 10:51 ` Pavel Begunkov
1 sibling, 0 replies; 25+ messages in thread
From: Stefan Metzmacher @ 2022-10-27 8:47 UTC (permalink / raw)
To: Pavel Begunkov, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
Hi Pavel,
>> Ala a IORING_SEND_* flag? Yes please.
>>
>> *_REPORT_USAGE was fine but I'd make it IORING_SEND_ZC_REPORT_USAGE.
>> And can be extended if there is more info needed in the future.
>>
>> And I don't mind using a bit in cqe->res, makes cflags less polluted.
>
> So no worries about the delayed/skip sendmsg completion anymore?
>
> Should I define it like this, ok?
>
> #define IORING_NOTIF_USAGE_ZC_COPIED (1U << 31)
>
> See the full patch below...
Apart from still having IORING_SEND_NOTIF_REPORT_USAGE
in the comment... (which I'll fix...)
Is this now fine for you? Then I would post a real patch.
Thanks!
metze
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED)
2022-10-21 14:03 ` Stefan Metzmacher
2022-10-27 8:47 ` Stefan Metzmacher
@ 2022-10-27 10:51 ` Pavel Begunkov
1 sibling, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2022-10-27 10:51 UTC (permalink / raw)
To: Stefan Metzmacher, io-uring, Jens Axboe
Cc: Jakub Kicinski, netdev, Dylan Yudaken
On 10/21/22 15:03, Stefan Metzmacher wrote:
> Am 21.10.22 um 13:09 schrieb Pavel Begunkov:
>> On 10/21/22 10:36, Stefan Metzmacher wrote:
>>> Hi Pavel,
>> [...]
>>>> Right, I'm just tired of back porting patches by hand :)
>>>
>>> ok, I just assumed it would be 6.1 only.
>>
>> I'm fine with 6.1 only, it'd make things easier. I thought from
>> your first postings you wanted it 6.0. Then we don't need to care
>> about the placing of the copied/used flags.
>>
>>>>> Otherwise we could have IORING_CQE_F_COPIED by default without opt-in
>>>>> flag...
>>>
>>> Do you still want an opt-in flag to get IORING_CQE_F_COPIED?
>>> If so what name do you want it to be?
>>
>> Ala a IORING_SEND_* flag? Yes please.
>>
>> *_REPORT_USAGE was fine but I'd make it IORING_SEND_ZC_REPORT_USAGE.
>> And can be extended if there is more info needed in the future.
>>
>> And I don't mind using a bit in cqe->res, makes cflags less polluted.
>
> So no worries about the delayed/skip sendmsg completion anymore?
I'll just make it incompatible the reporting.
> Should I define it like this, ok?
>
> #define IORING_NOTIF_USAGE_ZC_COPIED (1U << 31)
>
> See the full patch below...
Looks good
> metze
>
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index d69ae7eba773..32e1f2a55b70 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -296,10 +296,24 @@ enum io_uring_op {
> *
> * IORING_RECVSEND_FIXED_BUF Use registered buffers, the index is stored in
> * the buf_index field.
> +
> + * IORING_SEND_NOTIF_REPORT_USAGE
> + * If SEND[MSG]_ZC should report
> + * the zerocopy usage in cqe.res
> + * for the IORING_CQE_F_NOTIF cqe.
> + * IORING_NOTIF_USAGE_ZC_COPIED if data was copied
> + * (at least partially).
> */
> #define IORING_RECVSEND_POLL_FIRST (1U << 0)
> #define IORING_RECV_MULTISHOT (1U << 1)
> #define IORING_RECVSEND_FIXED_BUF (1U << 2)
> +#define IORING_SEND_ZC_REPORT_USAGE (1U << 3)
> +
> +/*
> + * cqe.res for IORING_CQE_F_NOTIF if
> + * IORING_SEND_ZC_REPORT_USAGE was requested
> + */
> +#define IORING_NOTIF_USAGE_ZC_COPIED (1U << 31)
>
> /*
> * accept flags stored in sqe->ioprio
> diff --git a/io_uring/net.c b/io_uring/net.c
> index 56078f47efe7..1aa3b50b3e82 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -939,7 +939,8 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>
> zc->flags = READ_ONCE(sqe->ioprio);
> if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
> - IORING_RECVSEND_FIXED_BUF))
> + IORING_RECVSEND_FIXED_BUF |
> + IORING_SEND_ZC_REPORT_USAGE))
> return -EINVAL;
> notif = zc->notif = io_alloc_notif(ctx);
> if (!notif)
> @@ -957,6 +958,9 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> req->imu = READ_ONCE(ctx->user_bufs[idx]);
> io_req_set_rsrc_node(notif, ctx, 0);
> }
> + if (zc->flags & IORING_SEND_ZC_REPORT_USAGE) {
> + io_notif_to_data(notif)->zc_report = true;
> + }
>
> if (req->opcode == IORING_OP_SEND_ZC) {
> if (READ_ONCE(sqe->__pad3[0]))
> diff --git a/io_uring/notif.c b/io_uring/notif.c
> index e37c6569d82e..4bfef10161fa 100644
> --- a/io_uring/notif.c
> +++ b/io_uring/notif.c
> @@ -18,6 +18,10 @@ static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
> __io_unaccount_mem(ctx->user, nd->account_pages);
> nd->account_pages = 0;
> }
> +
> + if (nd->zc_report && (nd->zc_copied || !nd->zc_used))
> + notif->cqe.res |= IORING_NOTIF_USAGE_ZC_COPIED;
> +
> io_req_task_complete(notif, locked);
> }
>
> @@ -28,6 +32,13 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
> struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
> struct io_kiocb *notif = cmd_to_io_kiocb(nd);
>
> + if (nd->zc_report) {
> + if (success && !nd->zc_used && skb)
> + WRITE_ONCE(nd->zc_used, true);
> + else if (!success && !nd->zc_copied)
> + WRITE_ONCE(nd->zc_copied, true);
> + }
> +
> if (refcount_dec_and_test(&uarg->refcnt)) {
> notif->io_task_work.func = __io_notif_complete_tw;
> io_req_task_work_add(notif);
> @@ -55,6 +66,7 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
> nd->account_pages = 0;
> nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
> nd->uarg.callback = io_uring_tx_zerocopy_callback;
> + nd->zc_report = nd->zc_used = nd->zc_copied = false;
> refcount_set(&nd->uarg.refcnt, 1);
> return notif;
> }
> diff --git a/io_uring/notif.h b/io_uring/notif.h
> index e4fbcae0f3fd..6be2e5ae8581 100644
> --- a/io_uring/notif.h
> +++ b/io_uring/notif.h
> @@ -15,6 +15,9 @@ struct io_notif_data {
> struct file *file;
> struct ubuf_info uarg;
> unsigned long account_pages;
> + bool zc_report;
> + bool zc_used;
> + bool zc_copied;
> };
>
> void io_notif_flush(struct io_kiocb *notif);
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2022-10-27 10:52 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-14 11:06 IORING_CQE_F_COPIED Stefan Metzmacher
2022-10-17 16:46 ` IORING_CQE_F_COPIED Pavel Begunkov
2022-10-18 8:43 ` IORING_CQE_F_COPIED Stefan Metzmacher
2022-10-19 15:06 ` IORING_CQE_F_COPIED Pavel Begunkov
2022-10-19 16:12 ` IORING_CQE_F_COPIED Stefan Metzmacher
2022-10-20 2:24 ` IORING_CQE_F_COPIED Pavel Begunkov
2022-10-20 10:04 ` IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED) Stefan Metzmacher
2022-10-20 13:46 ` Pavel Begunkov
2022-10-20 14:51 ` Stefan Metzmacher
2022-10-20 15:31 ` Pavel Begunkov
2022-10-21 9:36 ` Stefan Metzmacher
2022-10-21 11:09 ` Pavel Begunkov
2022-10-21 14:03 ` Stefan Metzmacher
2022-10-27 8:47 ` Stefan Metzmacher
2022-10-27 10:51 ` Pavel Begunkov
2022-10-20 10:10 ` IORING_SEND_NOTIF_USER_DATA " Stefan Metzmacher
2022-10-20 15:37 ` Pavel Begunkov
2022-10-21 8:32 ` Stefan Metzmacher
2022-10-21 9:27 ` Pavel Begunkov
2022-10-21 9:45 ` Stefan Metzmacher
2022-10-21 11:20 ` Pavel Begunkov
2022-10-21 12:10 ` Stefan Metzmacher
2022-10-21 10:15 ` Stefan Metzmacher
2022-10-21 11:26 ` Pavel Begunkov
2022-10-21 12:38 ` Stefan Metzmacher
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox