public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* (no subject)
       [not found] <[PATCHSET 0/3] Add support for IORING_CQE_F_POLLED>
@ 2025-07-11 23:59 ` Jens Axboe
  2025-07-11 23:59   ` [PATCH 1/3] io_uring/poll: cleanup apoll freeing Jens Axboe
                     ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Jens Axboe @ 2025-07-11 23:59 UTC (permalink / raw)
  To: io-uring

Hi,

It can be useful to know if a given file/socket/pipe needed to go
through poll arming to successfully execute the request. For example, a
write to a pipe will normally succeed inline. Ditto for a send on a
socket. But if the socket or pipe is full, then io_uring must wait for a
POLLOUT trigger to execute the request. This can be useful backpressure
information for an application. On the read side, it'll tell the
application whether data was readily available or not, if POLLIN
triggering was needed to read/recv the data.

This patchset adds support for IORING_CQE_F_POLLED, which is set in
cqe->flags if a request needed to go through the poll machinery before
it could get successfully executed.

Patch 1 is just an unrelated cleanup I spotted while doing this work.
Patch 2 adds a request flag that gets set via poll wake, and patch 3
wires up using this flag to set IORING_CQE_F_POLLED.

liburing test cases here:

https://git.kernel.dk/cgit/liburing/log/?h=cqe-polled

and the patches here can also be found here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-send-full

 include/linux/io_uring_types.h |  3 +++
 include/uapi/linux/io_uring.h  |  6 ++++++
 io_uring/io_uring.c            | 14 ++++++--------
 io_uring/io_uring.h            |  2 ++
 io_uring/poll.c                |  1 +
 5 files changed, 18 insertions(+), 8 deletions(-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] io_uring/poll: cleanup apoll freeing
  2025-07-11 23:59 ` Jens Axboe
@ 2025-07-11 23:59   ` Jens Axboe
  2025-07-11 23:59   ` [PATCH 2/3] io_uring/poll: flag request as having gone through poll wake machinery Jens Axboe
  2025-07-11 23:59   ` [PATCH 3/3] io_uring: add IORING_CQE_F_POLLED flag Jens Axboe
  2 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2025-07-11 23:59 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

No point having REQ_F_POLLED in both IO_REQ_CLEAN_FLAGS and in
IO_REQ_CLEAN_SLOW_FLAGS, and having both io_free_batch_list() and then
io_clean_op() check for it and clean it.

Move REQ_F_POLLED to IO_REQ_CLEAN_SLOW_FLAGS and drop it from
IO_REQ_CLEAN_FLAGS, and have only io_free_batch_list() do the check and
freeing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/io_uring.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 895740c955d0..4ef69dd58734 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -114,11 +114,11 @@
 #define IO_REQ_LINK_FLAGS (REQ_F_LINK | REQ_F_HARDLINK)
 
 #define IO_REQ_CLEAN_FLAGS (REQ_F_BUFFER_SELECTED | REQ_F_NEED_CLEANUP | \
-				REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS | \
-				REQ_F_ASYNC_DATA)
+				REQ_F_INFLIGHT | REQ_F_CREDS | REQ_F_ASYNC_DATA)
 
 #define IO_REQ_CLEAN_SLOW_FLAGS (REQ_F_REFCOUNT | IO_REQ_LINK_FLAGS | \
-				 REQ_F_REISSUE | IO_REQ_CLEAN_FLAGS)
+				 REQ_F_REISSUE | REQ_F_POLLED | \
+				 IO_REQ_CLEAN_FLAGS)
 
 #define IO_TCTX_REFS_CACHE_NR	(1U << 10)
 
@@ -392,11 +392,6 @@ static void io_clean_op(struct io_kiocb *req)
 		if (def->cleanup)
 			def->cleanup(req);
 	}
-	if ((req->flags & REQ_F_POLLED) && req->apoll) {
-		kfree(req->apoll->double_poll);
-		kfree(req->apoll);
-		req->apoll = NULL;
-	}
 	if (req->flags & REQ_F_INFLIGHT)
 		atomic_dec(&req->tctx->inflight_tracked);
 	if (req->flags & REQ_F_CREDS)
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] io_uring/poll: flag request as having gone through poll wake machinery
  2025-07-11 23:59 ` Jens Axboe
  2025-07-11 23:59   ` [PATCH 1/3] io_uring/poll: cleanup apoll freeing Jens Axboe
@ 2025-07-11 23:59   ` Jens Axboe
  2025-07-12 11:39     ` Pavel Begunkov
  2025-07-11 23:59   ` [PATCH 3/3] io_uring: add IORING_CQE_F_POLLED flag Jens Axboe
  2 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2025-07-11 23:59 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

No functional changes in this patch, just in preparation for being able
to flag completions as having completed via being triggered from poll.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h | 3 +++
 io_uring/poll.c                | 1 +
 2 files changed, 4 insertions(+)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 80a178f3d896..b56fe2247077 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -505,6 +505,7 @@ enum {
 	REQ_F_HAS_METADATA_BIT,
 	REQ_F_IMPORT_BUFFER_BIT,
 	REQ_F_SQE_COPIED_BIT,
+	REQ_F_POLL_WAKE_BIT,
 
 	/* not a real bit, just to check we're not overflowing the space */
 	__REQ_F_LAST_BIT,
@@ -596,6 +597,8 @@ enum {
 	REQ_F_IMPORT_BUFFER	= IO_REQ_FLAG(REQ_F_IMPORT_BUFFER_BIT),
 	/* ->sqe_copy() has been called, if necessary */
 	REQ_F_SQE_COPIED	= IO_REQ_FLAG(REQ_F_SQE_COPIED_BIT),
+	/* request went through poll wakeup machinery */
+	REQ_F_POLL_WAKE		= IO_REQ_FLAG(REQ_F_POLL_WAKE_BIT),
 };
 
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, io_tw_token_t tw);
diff --git a/io_uring/poll.c b/io_uring/poll.c
index c7e9fb34563d..e1950b744f3b 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -423,6 +423,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 			else
 				req->flags &= ~REQ_F_SINGLE_POLL;
 		}
+		req->flags |= REQ_F_POLL_WAKE;
 		__io_poll_execute(req, mask);
 	}
 	return 1;
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] io_uring: add IORING_CQE_F_POLLED flag
  2025-07-11 23:59 ` Jens Axboe
  2025-07-11 23:59   ` [PATCH 1/3] io_uring/poll: cleanup apoll freeing Jens Axboe
  2025-07-11 23:59   ` [PATCH 2/3] io_uring/poll: flag request as having gone through poll wake machinery Jens Axboe
@ 2025-07-11 23:59   ` Jens Axboe
  2025-07-12 11:34     ` Pavel Begunkov
  2 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2025-07-11 23:59 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

If IORING_CQE_F_POLLED is set in cqe->flags for a completion, then that
request had to go through the poll machinery before it could be
completed. For a read/recv type operation, that meant the fd/socket was
originally empty when the request was submitted. For a write/send type
operation, it means that the socket/pipe/file was full when the initial
attempt was made to execute it. This can be used for backpressure
signaling, sending back information to the application on the state of
the file or socket.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/uapi/linux/io_uring.h | 6 ++++++
 io_uring/io_uring.c           | 3 +++
 io_uring/io_uring.h           | 2 ++
 3 files changed, 11 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index b8a0e70ee2fd..7f2613ee9a5b 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -483,12 +483,18 @@ struct io_uring_cqe {
  *			other provided buffer type, all completions with a
  *			buffer passed back is automatically returned to the
  *			application.
+ * IORING_CQE_F_POLLED	If set, the operation was completed after being through
+ *			the poll machinery. For a write/send, this meant the
+ *			socket was full when the operation was attempted. For
+ *			a read operation, the socket/fd was empty when it was
+ *			initially attempted.
  */
 #define IORING_CQE_F_BUFFER		(1U << 0)
 #define IORING_CQE_F_MORE		(1U << 1)
 #define IORING_CQE_F_SOCK_NONEMPTY	(1U << 2)
 #define IORING_CQE_F_NOTIF		(1U << 3)
 #define IORING_CQE_F_BUF_MORE		(1U << 4)
+#define IORING_CQE_F_POLLED		(1U << 5)
 
 #define IORING_CQE_BUFFER_SHIFT		16
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4ef69dd58734..292ac416dfbe 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -891,6 +891,9 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
 	struct io_ring_ctx *ctx = req->ctx;
 	bool posted;
 
+	if (req->flags & REQ_F_POLL_WAKE)
+		cflags |= IORING_CQE_F_POLLED;
+
 	/*
 	 * If multishot has already posted deferred completions, ensure that
 	 * those are flushed first before posting this one. If not, CQEs
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index dc17162e7af1..d837e02d26b2 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -235,6 +235,8 @@ static inline void req_set_fail(struct io_kiocb *req)
 
 static inline void io_req_set_res(struct io_kiocb *req, s32 res, u32 cflags)
 {
+	if (req->flags & REQ_F_POLL_WAKE)
+		cflags |= IORING_CQE_F_POLLED;
 	req->cqe.res = res;
 	req->cqe.flags = cflags;
 }
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] io_uring: add IORING_CQE_F_POLLED flag
  2025-07-11 23:59   ` [PATCH 3/3] io_uring: add IORING_CQE_F_POLLED flag Jens Axboe
@ 2025-07-12 11:34     ` Pavel Begunkov
  2025-07-12 14:49       ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2025-07-12 11:34 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 7/12/25 00:59, Jens Axboe wrote:
...>   	/*
>   	 * If multishot has already posted deferred completions, ensure that
>   	 * those are flushed first before posting this one. If not, CQEs
> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> index dc17162e7af1..d837e02d26b2 100644
> --- a/io_uring/io_uring.h
> +++ b/io_uring/io_uring.h
> @@ -235,6 +235,8 @@ static inline void req_set_fail(struct io_kiocb *req)
>   
>   static inline void io_req_set_res(struct io_kiocb *req, s32 res, u32 cflags)
>   {
> +	if (req->flags & REQ_F_POLL_WAKE)
> +		cflags |= IORING_CQE_F_POLLED;

Can you avoid introducing this new uapi (and overhead) for requests that
don't care about it please? It's useless for multishots, and the only
real potential use case is send requests.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] io_uring/poll: flag request as having gone through poll wake machinery
  2025-07-11 23:59   ` [PATCH 2/3] io_uring/poll: flag request as having gone through poll wake machinery Jens Axboe
@ 2025-07-12 11:39     ` Pavel Begunkov
  2025-07-12 20:59       ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2025-07-12 11:39 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 7/12/25 00:59, Jens Axboe wrote:
> No functional changes in this patch, just in preparation for being able
> to flag completions as having completed via being triggered from poll.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>   include/linux/io_uring_types.h | 3 +++
>   io_uring/poll.c                | 1 +
>   2 files changed, 4 insertions(+)
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 80a178f3d896..b56fe2247077 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -505,6 +505,7 @@ enum {
>   	REQ_F_HAS_METADATA_BIT,
>   	REQ_F_IMPORT_BUFFER_BIT,
>   	REQ_F_SQE_COPIED_BIT,
> +	REQ_F_POLL_WAKE_BIT,
>   
>   	/* not a real bit, just to check we're not overflowing the space */
>   	__REQ_F_LAST_BIT,
> @@ -596,6 +597,8 @@ enum {
>   	REQ_F_IMPORT_BUFFER	= IO_REQ_FLAG(REQ_F_IMPORT_BUFFER_BIT),
>   	/* ->sqe_copy() has been called, if necessary */
>   	REQ_F_SQE_COPIED	= IO_REQ_FLAG(REQ_F_SQE_COPIED_BIT),
> +	/* request went through poll wakeup machinery */
> +	REQ_F_POLL_WAKE		= IO_REQ_FLAG(REQ_F_POLL_WAKE_BIT),
>   };
>   
>   typedef void (*io_req_tw_func_t)(struct io_kiocb *req, io_tw_token_t tw);
> diff --git a/io_uring/poll.c b/io_uring/poll.c
> index c7e9fb34563d..e1950b744f3b 100644
> --- a/io_uring/poll.c
> +++ b/io_uring/poll.c
> @@ -423,6 +423,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
>   			else
>   				req->flags &= ~REQ_F_SINGLE_POLL;
>   		}
> +		req->flags |= REQ_F_POLL_WAKE;

Same, it's overhead for all polled requests for a not clear gain.
Just move it to the arming function. It's also not correct to
keep it here, if that's what you care about.

>   		__io_poll_execute(req, mask);
>   	}
>   	return 1;

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] io_uring: add IORING_CQE_F_POLLED flag
  2025-07-12 11:34     ` Pavel Begunkov
@ 2025-07-12 14:49       ` Pavel Begunkov
  2025-07-12 21:02         ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2025-07-12 14:49 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 7/12/25 12:34, Pavel Begunkov wrote:
> On 7/12/25 00:59, Jens Axboe wrote:
> ...>       /*
>>        * If multishot has already posted deferred completions, ensure that
>>        * those are flushed first before posting this one. If not, CQEs
>> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
>> index dc17162e7af1..d837e02d26b2 100644
>> --- a/io_uring/io_uring.h
>> +++ b/io_uring/io_uring.h
>> @@ -235,6 +235,8 @@ static inline void req_set_fail(struct io_kiocb *req)
>>   static inline void io_req_set_res(struct io_kiocb *req, s32 res, u32 cflags)
>>   {
>> +    if (req->flags & REQ_F_POLL_WAKE)
>> +        cflags |= IORING_CQE_F_POLLED;
> 
> Can you avoid introducing this new uapi (and overhead) for requests that
> don't care about it please? It's useless for multishots, and the only
> real potential use case is send requests.

Another thought, I think the userspace can already easily infer
information similar to what this flag gives. E.g. peek at CQEs
right after submission and mark the inverse of the flag. The
actual impl can be made nicer than that.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] io_uring/poll: flag request as having gone through poll wake machinery
  2025-07-12 11:39     ` Pavel Begunkov
@ 2025-07-12 20:59       ` Jens Axboe
  2025-07-14  9:26         ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2025-07-12 20:59 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 7/12/25 5:39 AM, Pavel Begunkov wrote:
> On 7/12/25 00:59, Jens Axboe wrote:
>> No functional changes in this patch, just in preparation for being able
>> to flag completions as having completed via being triggered from poll.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>   include/linux/io_uring_types.h | 3 +++
>>   io_uring/poll.c                | 1 +
>>   2 files changed, 4 insertions(+)
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index 80a178f3d896..b56fe2247077 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -505,6 +505,7 @@ enum {
>>       REQ_F_HAS_METADATA_BIT,
>>       REQ_F_IMPORT_BUFFER_BIT,
>>       REQ_F_SQE_COPIED_BIT,
>> +    REQ_F_POLL_WAKE_BIT,
>>         /* not a real bit, just to check we're not overflowing the space */
>>       __REQ_F_LAST_BIT,
>> @@ -596,6 +597,8 @@ enum {
>>       REQ_F_IMPORT_BUFFER    = IO_REQ_FLAG(REQ_F_IMPORT_BUFFER_BIT),
>>       /* ->sqe_copy() has been called, if necessary */
>>       REQ_F_SQE_COPIED    = IO_REQ_FLAG(REQ_F_SQE_COPIED_BIT),
>> +    /* request went through poll wakeup machinery */
>> +    REQ_F_POLL_WAKE        = IO_REQ_FLAG(REQ_F_POLL_WAKE_BIT),
>>   };
>>     typedef void (*io_req_tw_func_t)(struct io_kiocb *req, io_tw_token_t tw);
>> diff --git a/io_uring/poll.c b/io_uring/poll.c
>> index c7e9fb34563d..e1950b744f3b 100644
>> --- a/io_uring/poll.c
>> +++ b/io_uring/poll.c
>> @@ -423,6 +423,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
>>               else
>>                   req->flags &= ~REQ_F_SINGLE_POLL;
>>           }
>> +        req->flags |= REQ_F_POLL_WAKE;
> 
> Same, it's overhead for all polled requests for a not clear gain.
> Just move it to the arming function. It's also not correct to
> keep it here, if that's what you care about.

Not too worried about overhead, for an unlocked or. The whole poll
machinery is pretty intense in that regard. But yeah, do agree that just
moving it to arming would be better and more appropriate too.

I'm still a bit split on whether this makes any sense at all, 2-3 really
should be RFC.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] io_uring: add IORING_CQE_F_POLLED flag
  2025-07-12 14:49       ` Pavel Begunkov
@ 2025-07-12 21:02         ` Jens Axboe
  2025-07-12 23:05           ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2025-07-12 21:02 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 7/12/25 8:49 AM, Pavel Begunkov wrote:
> On 7/12/25 12:34, Pavel Begunkov wrote:
>> On 7/12/25 00:59, Jens Axboe wrote:
>> ...>       /*
>>>        * If multishot has already posted deferred completions, ensure that
>>>        * those are flushed first before posting this one. If not, CQEs
>>> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
>>> index dc17162e7af1..d837e02d26b2 100644
>>> --- a/io_uring/io_uring.h
>>> +++ b/io_uring/io_uring.h
>>> @@ -235,6 +235,8 @@ static inline void req_set_fail(struct io_kiocb *req)
>>>   static inline void io_req_set_res(struct io_kiocb *req, s32 res, u32 cflags)
>>>   {
>>> +    if (req->flags & REQ_F_POLL_WAKE)
>>> +        cflags |= IORING_CQE_F_POLLED;
>>
>> Can you avoid introducing this new uapi (and overhead) for requests that
>> don't care about it please? It's useless for multishots, and the only
>> real potential use case is send requests.
> 
> Another thought, I think the userspace can already easily infer
> information similar to what this flag gives. E.g. peek at CQEs
> right after submission and mark the inverse of the flag. The
> actual impl can be made nicer than that.

As per the previous reply, not sure it makes a ton of sense. The initial
hack I did was just for sends, and it actually just reused the bit for
SOCK_NONEMPTY, as it was only for writes. But then the concept seemed
generic enough that it'd be useful for writes. And then it becomes
mostly a "did I poll thing", which obviously then makes sense for single
shot read/recv as well.

Now it's taking a new bit, which isn't great.

But I think it's one of those things that need to ruminate a bit. Maybe
just go back to doing it purely for sends. But then perhaps you'd
actually want to know if the NEXT send would block, not that your
current one did.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] io_uring: add IORING_CQE_F_POLLED flag
  2025-07-12 21:02         ` Jens Axboe
@ 2025-07-12 23:05           ` Jens Axboe
  0 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2025-07-12 23:05 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 7/12/25 3:02 PM, Jens Axboe wrote:
> On 7/12/25 8:49 AM, Pavel Begunkov wrote:
>> On 7/12/25 12:34, Pavel Begunkov wrote:
>>> On 7/12/25 00:59, Jens Axboe wrote:
>>> ...>       /*
>>>>        * If multishot has already posted deferred completions, ensure that
>>>>        * those are flushed first before posting this one. If not, CQEs
>>>> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
>>>> index dc17162e7af1..d837e02d26b2 100644
>>>> --- a/io_uring/io_uring.h
>>>> +++ b/io_uring/io_uring.h
>>>> @@ -235,6 +235,8 @@ static inline void req_set_fail(struct io_kiocb *req)
>>>>   static inline void io_req_set_res(struct io_kiocb *req, s32 res, u32 cflags)
>>>>   {
>>>> +    if (req->flags & REQ_F_POLL_WAKE)
>>>> +        cflags |= IORING_CQE_F_POLLED;
>>>
>>> Can you avoid introducing this new uapi (and overhead) for requests that
>>> don't care about it please? It's useless for multishots, and the only
>>> real potential use case is send requests.
>>
>> Another thought, I think the userspace can already easily infer
>> information similar to what this flag gives. E.g. peek at CQEs
>> right after submission and mark the inverse of the flag. The
>> actual impl can be made nicer than that.
> 
> As per the previous reply, not sure it makes a ton of sense. The initial
> hack I did was just for sends, and it actually just reused the bit for
> SOCK_NONEMPTY, as it was only for writes. But then the concept seemed
> generic enough that it'd be useful for writes. And then it becomes
> mostly a "did I poll thing", which obviously then makes sense for single
> shot read/recv as well.
> 
> Now it's taking a new bit, which isn't great.
> 
> But I think it's one of those things that need to ruminate a bit. Maybe
> just go back to doing it purely for sends. But then perhaps you'd
> actually want to know if the NEXT send would block, not that your
> current one did.

This is closer to what I originally had:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-send-full

with adding REQ_F_POLL_ARMED and then using that just on the send
side. And reusing the value of SOCK_NONEMPTY, as that one is just
for receives, and this one is just for sends.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] io_uring/poll: flag request as having gone through poll wake machinery
  2025-07-12 20:59       ` Jens Axboe
@ 2025-07-14  9:26         ` Pavel Begunkov
  2025-07-14 14:54           ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2025-07-14  9:26 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 7/12/25 21:59, Jens Axboe wrote:
> On 7/12/25 5:39 AM, Pavel Begunkov wrote:
>> On 7/12/25 00:59, Jens Axboe wrote:
>>> No functional changes in this patch, just in preparation for being able
>>> to flag completions as having completed via being triggered from poll.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>>>    include/linux/io_uring_types.h | 3 +++
>>>    io_uring/poll.c                | 1 +
>>>    2 files changed, 4 insertions(+)
>>>
>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>> index 80a178f3d896..b56fe2247077 100644
>>> --- a/include/linux/io_uring_types.h
>>> +++ b/include/linux/io_uring_types.h
>>> @@ -505,6 +505,7 @@ enum {
>>>        REQ_F_HAS_METADATA_BIT,
>>>        REQ_F_IMPORT_BUFFER_BIT,
>>>        REQ_F_SQE_COPIED_BIT,
>>> +    REQ_F_POLL_WAKE_BIT,
>>>          /* not a real bit, just to check we're not overflowing the space */
>>>        __REQ_F_LAST_BIT,
>>> @@ -596,6 +597,8 @@ enum {
>>>        REQ_F_IMPORT_BUFFER    = IO_REQ_FLAG(REQ_F_IMPORT_BUFFER_BIT),
>>>        /* ->sqe_copy() has been called, if necessary */
>>>        REQ_F_SQE_COPIED    = IO_REQ_FLAG(REQ_F_SQE_COPIED_BIT),
>>> +    /* request went through poll wakeup machinery */
>>> +    REQ_F_POLL_WAKE        = IO_REQ_FLAG(REQ_F_POLL_WAKE_BIT),
>>>    };
>>>      typedef void (*io_req_tw_func_t)(struct io_kiocb *req, io_tw_token_t tw);
>>> diff --git a/io_uring/poll.c b/io_uring/poll.c
>>> index c7e9fb34563d..e1950b744f3b 100644
>>> --- a/io_uring/poll.c
>>> +++ b/io_uring/poll.c
>>> @@ -423,6 +423,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
>>>                else
>>>                    req->flags &= ~REQ_F_SINGLE_POLL;
>>>            }
>>> +        req->flags |= REQ_F_POLL_WAKE;
>>
>> Same, it's overhead for all polled requests for a not clear gain.
>> Just move it to the arming function. It's also not correct to
>> keep it here, if that's what you care about.
> 
> Not too worried about overhead, for an unlocked or. The whole poll

You know, I wrote this machinery and optimised it, I'm not saying it
to just piss you off, I still need it to work well for zcrx :)
Not going into details, but it's not such a simple unlocked or. And
death by a thousand is never old either.

> machinery is pretty intense in that regard. But yeah, do agree that just
> moving it to arming would be better and more appropriate too.
> 
> I'm still a bit split on whether this makes any sense at all, 2-3 really

Right, that what I meant by unclear benefit. You're returning
information from past when it's be already irrelevant, especially
so for socket tx with how they handle their wait queue wakeups.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] io_uring/poll: flag request as having gone through poll wake machinery
  2025-07-14  9:26         ` Pavel Begunkov
@ 2025-07-14 14:54           ` Jens Axboe
  2025-07-14 15:45             ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2025-07-14 14:54 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 7/14/25 3:26 AM, Pavel Begunkov wrote:
> On 7/12/25 21:59, Jens Axboe wrote:
>> On 7/12/25 5:39 AM, Pavel Begunkov wrote:
>>> On 7/12/25 00:59, Jens Axboe wrote:
>>>> No functional changes in this patch, just in preparation for being able
>>>> to flag completions as having completed via being triggered from poll.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> ---
>>>>    include/linux/io_uring_types.h | 3 +++
>>>>    io_uring/poll.c                | 1 +
>>>>    2 files changed, 4 insertions(+)
>>>>
>>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>>> index 80a178f3d896..b56fe2247077 100644
>>>> --- a/include/linux/io_uring_types.h
>>>> +++ b/include/linux/io_uring_types.h
>>>> @@ -505,6 +505,7 @@ enum {
>>>>        REQ_F_HAS_METADATA_BIT,
>>>>        REQ_F_IMPORT_BUFFER_BIT,
>>>>        REQ_F_SQE_COPIED_BIT,
>>>> +    REQ_F_POLL_WAKE_BIT,
>>>>          /* not a real bit, just to check we're not overflowing the space */
>>>>        __REQ_F_LAST_BIT,
>>>> @@ -596,6 +597,8 @@ enum {
>>>>        REQ_F_IMPORT_BUFFER    = IO_REQ_FLAG(REQ_F_IMPORT_BUFFER_BIT),
>>>>        /* ->sqe_copy() has been called, if necessary */
>>>>        REQ_F_SQE_COPIED    = IO_REQ_FLAG(REQ_F_SQE_COPIED_BIT),
>>>> +    /* request went through poll wakeup machinery */
>>>> +    REQ_F_POLL_WAKE        = IO_REQ_FLAG(REQ_F_POLL_WAKE_BIT),
>>>>    };
>>>>      typedef void (*io_req_tw_func_t)(struct io_kiocb *req, io_tw_token_t tw);
>>>> diff --git a/io_uring/poll.c b/io_uring/poll.c
>>>> index c7e9fb34563d..e1950b744f3b 100644
>>>> --- a/io_uring/poll.c
>>>> +++ b/io_uring/poll.c
>>>> @@ -423,6 +423,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
>>>>                else
>>>>                    req->flags &= ~REQ_F_SINGLE_POLL;
>>>>            }
>>>> +        req->flags |= REQ_F_POLL_WAKE;
>>>
>>> Same, it's overhead for all polled requests for a not clear gain.
>>> Just move it to the arming function. It's also not correct to
>>> keep it here, if that's what you care about.
>>
>> Not too worried about overhead, for an unlocked or. The whole poll
> 
> You know, I wrote this machinery and optimised it, I'm not saying it
> to just piss you off, I still need it to work well for zcrx :)

This was not a critique of the code, it's just a generic statement on
the serialization around poll triggering is obviously a lot more
expensive than basic flag checking or setting. Every comment is not a
backhanded attack on someones code.

> Not going into details, but it's not such a simple unlocked or. And
> death by a thousand is never old either.

That's obviously true, I was just trying to set expectations that a
single flag mask is not really a big deal. If the idea and feature was
fully solidified and useful, then arguing that adding a bit or is a
problem is nonsense. By that standard, we could never add anything to
the code, only remove. At the same time, adding frivolous code is of
course always a bad idea.

>> machinery is pretty intense in that regard. But yeah, do agree that just
>> moving it to arming would be better and more appropriate too.
>>
>> I'm still a bit split on whether this makes any sense at all, 2-3 really
> 
> Right, that what I meant by unclear benefit. You're returning
> information from past when it's be already irrelevant, especially
> so for socket tx with how they handle their wait queue wakeups.

Indeed, and that's really the core of it and why I'm not totally sold on
it either. As mentioned in a previous email, I think it's better to let
this concept brew a bit in the background. Maybe something more specific
and useful will come out of this later.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] io_uring/poll: flag request as having gone through poll wake machinery
  2025-07-14 14:54           ` Jens Axboe
@ 2025-07-14 15:45             ` Pavel Begunkov
  2025-07-14 17:51               ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2025-07-14 15:45 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 7/14/25 15:54, Jens Axboe wrote:
> On 7/14/25 3:26 AM, Pavel Begunkov wrote:
>> On 7/12/25 21:59, Jens Axboe wrote:
>>> On 7/12/25 5:39 AM, Pavel Begunkov wrote:
>>>> On 7/12/25 00:59, Jens Axboe wrote:
>>>>> No functional changes in this patch, just in preparation for being able
...>>>> Same, it's overhead for all polled requests for a not clear gain.
>>>> Just move it to the arming function. It's also not correct to
>>>> keep it here, if that's what you care about.
>>>
>>> Not too worried about overhead, for an unlocked or. The whole poll
>>
>> You know, I wrote this machinery and optimised it, I'm not saying it
>> to just piss you off, I still need it to work well for zcrx :)
> 
> This was not a critique of the code, it's just a generic statement on
> the serialization around poll triggering is obviously a lot more
> expensive than basic flag checking or setting. Every comment is not a
> backhanded attack on someones code.

Not taken this way, it works well enough for such highly concurrent
synchronisation.

>> Not going into details, but it's not such a simple unlocked or. And
>> death by a thousand is never old either.
> 
> That's obviously true, I was just trying to set expectations that a
> single flag mask is not really a big deal. If the idea and feature was
> fully solidified and useful, then arguing that adding a bit or is a
> problem is nonsense.

Quite the opppsite, it should be argued about, and not because "or"
is expensive, but because it's a write in a nuanced place.

  By that standard, we could never add anything to
> the code, only remove. At the same time, adding frivolous code is of
> course always a bad idea.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] io_uring/poll: flag request as having gone through poll wake machinery
  2025-07-14 15:45             ` Pavel Begunkov
@ 2025-07-14 17:51               ` Jens Axboe
  2025-07-18 10:20                 ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2025-07-14 17:51 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 7/14/25 9:45 AM, Pavel Begunkov wrote:
> On 7/14/25 15:54, Jens Axboe wrote:
>> On 7/14/25 3:26 AM, Pavel Begunkov wrote:
>>> On 7/12/25 21:59, Jens Axboe wrote:
>>>> On 7/12/25 5:39 AM, Pavel Begunkov wrote:
>>>>> On 7/12/25 00:59, Jens Axboe wrote:
>>>>>> No functional changes in this patch, just in preparation for being able
> ...>>>> Same, it's overhead for all polled requests for a not clear gain.
>>>>> Just move it to the arming function. It's also not correct to
>>>>> keep it here, if that's what you care about.
>>>>
>>>> Not too worried about overhead, for an unlocked or. The whole poll
>>>
>>> You know, I wrote this machinery and optimised it, I'm not saying it
>>> to just piss you off, I still need it to work well for zcrx :)
>>
>> This was not a critique of the code, it's just a generic statement on
>> the serialization around poll triggering is obviously a lot more
>> expensive than basic flag checking or setting. Every comment is not a
>> backhanded attack on someones code.
> 
> Not taken this way, it works well enough for such highly concurrent
> synchronisation.

Certainly, no complaints!

>>> Not going into details, but it's not such a simple unlocked or. And
>>> death by a thousand is never old either.
>>
>> That's obviously true, I was just trying to set expectations that a
>> single flag mask is not really a big deal. If the idea and feature was
>> fully solidified and useful, then arguing that adding a bit or is a
>> problem is nonsense.
> 
> Quite the opppsite, it should be argued about, and not because "or"
> is expensive, but because it's a write in a nuanced place.

I think that's orthogonal - should it be commented? Definitely yes. This
is sadly true for a lot of the code in there, but doesn't mean we should
add more.

But all moot so far, at least until it can move forward, if at all.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] io_uring/poll: flag request as having gone through poll wake machinery
  2025-07-14 17:51               ` Jens Axboe
@ 2025-07-18 10:20                 ` Pavel Begunkov
  0 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2025-07-18 10:20 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 7/14/25 18:51, Jens Axboe wrote:
> On 7/14/25 9:45 AM, Pavel Begunkov wrote:
>> On 7/14/25 15:54, Jens Axboe wrote:
>>> On 7/14/25 3:26 AM, Pavel Begunkov wrote:
>>>> On 7/12/25 21:59, Jens Axboe wrote:
>>>>> On 7/12/25 5:39 AM, Pavel Begunkov wrote:
>>>>>> On 7/12/25 00:59, Jens Axboe wrote:
>>>>>>> No functional changes in this patch, just in preparation for being able
>> ...>>>> Same, it's overhead for all polled requests for a not clear gain.
>>>>>> Just move it to the arming function. It's also not correct to
>>>>>> keep it here, if that's what you care about.
>>>>>
>>>>> Not too worried about overhead, for an unlocked or. The whole poll
>>>>
>>>> You know, I wrote this machinery and optimised it, I'm not saying it
>>>> to just piss you off, I still need it to work well for zcrx :)
>>>
>>> This was not a critique of the code, it's just a generic statement on
>>> the serialization around poll triggering is obviously a lot more
>>> expensive than basic flag checking or setting. Every comment is not a
>>> backhanded attack on someones code.
>>
>> Not taken this way, it works well enough for such highly concurrent
>> synchronisation.
> 
> Certainly, no complaints!
> 
>>>> Not going into details, but it's not such a simple unlocked or. And
>>>> death by a thousand is never old either.
>>>
>>> That's obviously true, I was just trying to set expectations that a
>>> single flag mask is not really a big deal. If the idea and feature was
>>> fully solidified and useful, then arguing that adding a bit or is a
>>> problem is nonsense.
>>
>> Quite the opppsite, it should be argued about, and not because "or"
>> is expensive, but because it's a write in a nuanced place.
> 
> I think that's orthogonal - should it be commented? Definitely yes. This
> is sadly true for a lot of the code in there, but doesn't mean we should
> add more.

Not sure I understand what you mean. I was telling from the beginning
that there is a legit performance concern for that chunk, which happens
to be a bitwise "or". Which is why I commented, and what I believe should
be argued about. The "or" part is not much relevant, let's not go into
straw man'ing it. I'd just hope you're less eager to call everything
nonsense, because a single "bitwise or" could be a problem depending on
circumstances :)

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-07-18 10:18 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <[PATCHSET 0/3] Add support for IORING_CQE_F_POLLED>
2025-07-11 23:59 ` Jens Axboe
2025-07-11 23:59   ` [PATCH 1/3] io_uring/poll: cleanup apoll freeing Jens Axboe
2025-07-11 23:59   ` [PATCH 2/3] io_uring/poll: flag request as having gone through poll wake machinery Jens Axboe
2025-07-12 11:39     ` Pavel Begunkov
2025-07-12 20:59       ` Jens Axboe
2025-07-14  9:26         ` Pavel Begunkov
2025-07-14 14:54           ` Jens Axboe
2025-07-14 15:45             ` Pavel Begunkov
2025-07-14 17:51               ` Jens Axboe
2025-07-18 10:20                 ` Pavel Begunkov
2025-07-11 23:59   ` [PATCH 3/3] io_uring: add IORING_CQE_F_POLLED flag Jens Axboe
2025-07-12 11:34     ` Pavel Begunkov
2025-07-12 14:49       ` Pavel Begunkov
2025-07-12 21:02         ` Jens Axboe
2025-07-12 23:05           ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox