public inbox for [email protected]
 help / color / mirror / Atom feed
* io_uring and spurious wake-ups from eventfd
@ 2020-01-07 15:55 Mark Papadakis
  2020-01-07 20:26 ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Papadakis @ 2020-01-07 15:55 UTC (permalink / raw)
  To: io-uring

This is perhaps an odd request, but if it’s trivial to implement support for this described feature, it could help others like it ‘d help me (I ‘ve been experimenting with io_uring for some time now).

Being able to register an eventfd with an io_uring context is very handy, if you e.g have some sort of reactor thread multiplexing I/O using epoll etc, where you want to be notified when there are pending CQEs to drain. The problem, such as it is, is that this can result in un-necessary/spurious wake-ups.

If, for example, you are monitoring some sockets for EPOLLIN, and when poll says you have pending bytes to read from their sockets, and said sockets are non-blocking, and for each some reported event you reserve an SQE for preadv() to read that data and then you io_uring_enter to submit the SQEs, because the data is readily available, as soon as io_uring_enter returns, you will have your completions available - which you can process.
The “problem” is that poll will wake up immediately thereafter in the next reactor loop iteration because eventfd was tripped (which is reasonable but un-necessary).

What if there was a flag for io_uring_setup() so that the eventfd would only be tripped for CQEs that were processed asynchronously, or, if that’s non-trivial, only for CQEs that reference file FDs?

That’d help with that spurious wake-up.

Thank you,
@markpapadakis

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: io_uring and spurious wake-ups from eventfd
  2020-01-07 15:55 io_uring and spurious wake-ups from eventfd Mark Papadakis
@ 2020-01-07 20:26 ` Jens Axboe
  2020-01-07 20:34   ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2020-01-07 20:26 UTC (permalink / raw)
  To: Mark Papadakis, io-uring

On 1/7/20 8:55 AM, Mark Papadakis wrote:
> This is perhaps an odd request, but if it’s trivial to implement
> support for this described feature, it could help others like it ‘d
> help me (I ‘ve been experimenting with io_uring for some time now).
> 
> Being able to register an eventfd with an io_uring context is very
> handy, if you e.g have some sort of reactor thread multiplexing I/O
> using epoll etc, where you want to be notified when there are pending
> CQEs to drain. The problem, such as it is, is that this can result in
> un-necessary/spurious wake-ups.
> 
> If, for example, you are monitoring some sockets for EPOLLIN, and when
> poll says you have pending bytes to read from their sockets, and said
> sockets are non-blocking, and for each some reported event you reserve
> an SQE for preadv() to read that data and then you io_uring_enter to
> submit the SQEs, because the data is readily available, as soon as
> io_uring_enter returns, you will have your completions available -
> which you can process.  The “problem” is that poll will wake up
> immediately thereafter in the next reactor loop iteration because
> eventfd was tripped (which is reasonable but un-necessary).
> 
> What if there was a flag for io_uring_setup() so that the eventfd
> would only be tripped for CQEs that were processed asynchronously, or,
> if that’s non-trivial, only for CQEs that reference file FDs?
> 
> That’d help with that spurious wake-up.

One easy way to do that would be for the application to signal that it
doesn't want eventfd notifications for certain requests. Like using an
IOSQE_ flag for that. Then you could set that on the requests you submit
in response to triggering an eventfd event.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: io_uring and spurious wake-ups from eventfd
  2020-01-07 20:26 ` Jens Axboe
@ 2020-01-07 20:34   ` Jens Axboe
  2020-01-08  7:36     ` Mark Papadakis
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2020-01-07 20:34 UTC (permalink / raw)
  To: Mark Papadakis, io-uring

On 1/7/20 1:26 PM, Jens Axboe wrote:
> On 1/7/20 8:55 AM, Mark Papadakis wrote:
>> This is perhaps an odd request, but if it’s trivial to implement
>> support for this described feature, it could help others like it ‘d
>> help me (I ‘ve been experimenting with io_uring for some time now).
>>
>> Being able to register an eventfd with an io_uring context is very
>> handy, if you e.g have some sort of reactor thread multiplexing I/O
>> using epoll etc, where you want to be notified when there are pending
>> CQEs to drain. The problem, such as it is, is that this can result in
>> un-necessary/spurious wake-ups.
>>
>> If, for example, you are monitoring some sockets for EPOLLIN, and when
>> poll says you have pending bytes to read from their sockets, and said
>> sockets are non-blocking, and for each some reported event you reserve
>> an SQE for preadv() to read that data and then you io_uring_enter to
>> submit the SQEs, because the data is readily available, as soon as
>> io_uring_enter returns, you will have your completions available -
>> which you can process.  The “problem” is that poll will wake up
>> immediately thereafter in the next reactor loop iteration because
>> eventfd was tripped (which is reasonable but un-necessary).
>>
>> What if there was a flag for io_uring_setup() so that the eventfd
>> would only be tripped for CQEs that were processed asynchronously, or,
>> if that’s non-trivial, only for CQEs that reference file FDs?
>>
>> That’d help with that spurious wake-up.
> 
> One easy way to do that would be for the application to signal that it
> doesn't want eventfd notifications for certain requests. Like using an
> IOSQE_ flag for that. Then you could set that on the requests you submit
> in response to triggering an eventfd event.

Something like this - totally untested. Note that for some error cases I
didn't go through the trouble of wiring this up, but for all the normal
cases it should work.

Patch is against my for-5.6/io_uring branch.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index c770c2c0eb52..5c6da1e1f29c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -512,6 +512,7 @@ struct io_kiocb {
 #define REQ_F_HARDLINK		65536	/* doesn't sever on completion < 0 */
 #define REQ_F_FORCE_ASYNC	131072	/* IOSQE_ASYNC */
 #define REQ_F_CUR_POS		262144	/* read/write uses file position */
+#define REQ_F_NO_EVFD		524288	/* don't trigger eventfd for this req */
 	u64			user_data;
 	u32			result;
 	u32			sequence;
@@ -960,13 +961,13 @@ static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
 	return &rings->cqes[tail & ctx->cq_mask];
 }
 
-static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
+static void io_cqring_ev_posted(struct io_ring_ctx *ctx, bool evfd)
 {
 	if (waitqueue_active(&ctx->wait))
 		wake_up(&ctx->wait);
 	if (waitqueue_active(&ctx->sqo_wait))
 		wake_up(&ctx->sqo_wait);
-	if (ctx->cq_ev_fd)
+	if (evfd && ctx->cq_ev_fd)
 		eventfd_signal(ctx->cq_ev_fd, 1);
 }
 
@@ -1018,7 +1019,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
 		clear_bit(0, &ctx->cq_check_overflow);
 	}
 	spin_unlock_irqrestore(&ctx->completion_lock, flags);
-	io_cqring_ev_posted(ctx);
+	io_cqring_ev_posted(ctx, true);
 
 	while (!list_empty(&list)) {
 		req = list_first_entry(&list, struct io_kiocb, list);
@@ -1070,7 +1071,7 @@ static void io_cqring_add_event(struct io_kiocb *req, long res)
 	io_commit_cqring(ctx);
 	spin_unlock_irqrestore(&ctx->completion_lock, flags);
 
-	io_cqring_ev_posted(ctx);
+	io_cqring_ev_posted(ctx, !(req->flags & REQ_F_NO_EVFD));
 }
 
 static inline bool io_is_fallback_req(struct io_kiocb *req)
@@ -1288,7 +1289,7 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr)
 
 	req->flags |= REQ_F_LINK_NEXT;
 	if (wake_ev)
-		io_cqring_ev_posted(ctx);
+		io_cqring_ev_posted(ctx, true);
 }
 
 /*
@@ -1320,7 +1321,7 @@ static void io_fail_links(struct io_kiocb *req)
 
 	io_commit_cqring(ctx);
 	spin_unlock_irqrestore(&ctx->completion_lock, flags);
-	io_cqring_ev_posted(ctx);
+	io_cqring_ev_posted(ctx, true);
 }
 
 static void io_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt)
@@ -3366,7 +3367,7 @@ static void io_poll_complete_work(struct io_wq_work **workptr)
 	io_poll_complete(req, mask, ret);
 	spin_unlock_irq(&ctx->completion_lock);
 
-	io_cqring_ev_posted(ctx);
+	io_cqring_ev_posted(ctx, !(req->flags & REQ_F_NO_EVFD));
 
 	if (ret < 0)
 		req_set_fail_links(req);
@@ -3379,6 +3380,7 @@ static void __io_poll_flush(struct io_ring_ctx *ctx, struct llist_node *nodes)
 {
 	struct io_kiocb *req, *tmp;
 	struct req_batch rb;
+	bool evfd = false;
 
 	rb.to_free = rb.need_iter = 0;
 	spin_lock_irq(&ctx->completion_lock);
@@ -3386,6 +3388,8 @@ static void __io_poll_flush(struct io_ring_ctx *ctx, struct llist_node *nodes)
 		hash_del(&req->hash_node);
 		io_poll_complete(req, req->result, 0);
 
+		if (!evfd && !(req->flags & REQ_F_NO_EVFD))
+			evfd = true;
 		if (refcount_dec_and_test(&req->refs) &&
 		    !io_req_multi_free(&rb, req)) {
 			req->flags |= REQ_F_COMP_LOCKED;
@@ -3394,7 +3398,7 @@ static void __io_poll_flush(struct io_ring_ctx *ctx, struct llist_node *nodes)
 	}
 	spin_unlock_irq(&ctx->completion_lock);
 
-	io_cqring_ev_posted(ctx);
+	io_cqring_ev_posted(ctx, evfd);
 	io_free_req_many(ctx, &rb);
 }
 
@@ -3439,7 +3443,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 			io_put_req(req);
 			spin_unlock_irqrestore(&ctx->completion_lock, flags);
 
-			io_cqring_ev_posted(ctx);
+			io_cqring_ev_posted(ctx, !(req->flags & REQ_F_NO_EVFD));
 			req = NULL;
 		} else {
 			req->result = mask;
@@ -3557,7 +3561,7 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt)
 	spin_unlock_irq(&ctx->completion_lock);
 
 	if (mask) {
-		io_cqring_ev_posted(ctx);
+		io_cqring_ev_posted(ctx, !(req->flags & REQ_F_NO_EVFD));
 		io_put_req_find_next(req, nxt);
 	}
 	return ipt.error;
@@ -3597,7 +3601,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
 	io_commit_cqring(ctx);
 	spin_unlock_irqrestore(&ctx->completion_lock, flags);
 
-	io_cqring_ev_posted(ctx);
+	io_cqring_ev_posted(ctx, !(req->flags & REQ_F_NO_EVFD));
 	req_set_fail_links(req);
 	io_put_req(req);
 	return HRTIMER_NORESTART;
@@ -3659,7 +3663,7 @@ static int io_timeout_remove(struct io_kiocb *req)
 	io_cqring_fill_event(req, ret);
 	io_commit_cqring(ctx);
 	spin_unlock_irq(&ctx->completion_lock);
-	io_cqring_ev_posted(ctx);
+	io_cqring_ev_posted(ctx, !(req->flags & REQ_F_NO_EVFD));
 	if (ret < 0)
 		req_set_fail_links(req);
 	io_put_req(req);
@@ -3829,7 +3833,7 @@ static void io_async_find_and_cancel(struct io_ring_ctx *ctx,
 	io_cqring_fill_event(req, ret);
 	io_commit_cqring(ctx);
 	spin_unlock_irqrestore(&ctx->completion_lock, flags);
-	io_cqring_ev_posted(ctx);
+	io_cqring_ev_posted(ctx, !(req->flags & REQ_F_NO_EVFD));
 
 	if (ret < 0)
 		req_set_fail_links(req);
@@ -4528,7 +4532,8 @@ static inline void io_queue_link_head(struct io_kiocb *req)
 }
 
 #define SQE_VALID_FLAGS	(IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK|	\
-				IOSQE_IO_HARDLINK | IOSQE_ASYNC)
+				IOSQE_IO_HARDLINK | IOSQE_ASYNC | 	\
+				IOSQE_NO_EVENTFD)
 
 static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 			  struct io_submit_state *state, struct io_kiocb **link)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index be64c9757ff1..72af96a38ed8 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -53,6 +53,7 @@ struct io_uring_sqe {
 #define IOSQE_IO_LINK		(1U << 2)	/* links next sqe */
 #define IOSQE_IO_HARDLINK	(1U << 3)	/* like LINK, but stronger */
 #define IOSQE_ASYNC		(1U << 4)	/* always go async */
+#define IOSQE_NO_EVENTFD	(1U << 5)	/* don't trigger eventfd */
 
 /*
  * io_uring_setup() flags

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: io_uring and spurious wake-ups from eventfd
  2020-01-07 20:34   ` Jens Axboe
@ 2020-01-08  7:36     ` Mark Papadakis
  2020-01-08 16:24       ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Papadakis @ 2020-01-08  7:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring



> On 7 Jan 2020, at 10:34 PM, Jens Axboe <[email protected]> wrote:
> 
> On 1/7/20 1:26 PM, Jens Axboe wrote:
>> On 1/7/20 8:55 AM, Mark Papadakis wrote:
>>> This is perhaps an odd request, but if it’s trivial to implement
>>> support for this described feature, it could help others like it ‘d
>>> help me (I ‘ve been experimenting with io_uring for some time now).
>>> 
>>> Being able to register an eventfd with an io_uring context is very
>>> handy, if you e.g have some sort of reactor thread multiplexing I/O
>>> using epoll etc, where you want to be notified when there are pending
>>> CQEs to drain. The problem, such as it is, is that this can result in
>>> un-necessary/spurious wake-ups.
>>> 
>>> If, for example, you are monitoring some sockets for EPOLLIN, and when
>>> poll says you have pending bytes to read from their sockets, and said
>>> sockets are non-blocking, and for each some reported event you reserve
>>> an SQE for preadv() to read that data and then you io_uring_enter to
>>> submit the SQEs, because the data is readily available, as soon as
>>> io_uring_enter returns, you will have your completions available -
>>> which you can process.  The “problem” is that poll will wake up
>>> immediately thereafter in the next reactor loop iteration because
>>> eventfd was tripped (which is reasonable but un-necessary).
>>> 
>>> What if there was a flag for io_uring_setup() so that the eventfd
>>> would only be tripped for CQEs that were processed asynchronously, or,
>>> if that’s non-trivial, only for CQEs that reference file FDs?
>>> 
>>> That’d help with that spurious wake-up.
>> 
>> One easy way to do that would be for the application to signal that it
>> doesn't want eventfd notifications for certain requests. Like using an
>> IOSQE_ flag for that. Then you could set that on the requests you submit
>> in response to triggering an eventfd event.
> 


Thanks Jens,

This is great, but perhaps there is a somewhat slightly more optimal way to do this.
Ideally, io_uring should trip the eventfd if there are any new completions available, that haven’t been produced
In the context of an io_uring_enter(). That is to say, if any SQEs can be immediately served (because data is readily available in
Buffers/caches in the kernel), then their respective CQEs will be produced in the context of that io_uring_enter() that submitted said SQEs(and thus the CQEs can be processed immediately after io_uring_enter() returns). 
So, if any CQEs are placed in the respective ring at any other time, but not during an io_uring_enter() call, then it means those completions were produced asynchronously, and thus the eventfd can be tripped, otherwise, there is no need to trip the eventfd at all.

e.g (pseudocode):
void produce_completion(cfq_ctx *ctx, const bool in_io_uring_enter_ctx) {
        cqe_ring_push(cqe_from_ctx(ctx));
        if (false == in_io_uring_enter_ctx && eventfd_registered()) {
                trip_iouring_eventfd();
        } else {
                // don't bother
        }
}

@markpapadakis

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: io_uring and spurious wake-ups from eventfd
  2020-01-08  7:36     ` Mark Papadakis
@ 2020-01-08 16:24       ` Jens Axboe
  2020-01-08 16:46         ` Mark Papadakis
  2020-01-09  6:09         ` Daurnimator
  0 siblings, 2 replies; 11+ messages in thread
From: Jens Axboe @ 2020-01-08 16:24 UTC (permalink / raw)
  To: Mark Papadakis; +Cc: io-uring

On 1/8/20 12:36 AM, Mark Papadakis wrote:
> 
> 
>> On 7 Jan 2020, at 10:34 PM, Jens Axboe <[email protected]> wrote:
>>
>> On 1/7/20 1:26 PM, Jens Axboe wrote:
>>> On 1/7/20 8:55 AM, Mark Papadakis wrote:
>>>> This is perhaps an odd request, but if it’s trivial to implement
>>>> support for this described feature, it could help others like it ‘d
>>>> help me (I ‘ve been experimenting with io_uring for some time now).
>>>>
>>>> Being able to register an eventfd with an io_uring context is very
>>>> handy, if you e.g have some sort of reactor thread multiplexing I/O
>>>> using epoll etc, where you want to be notified when there are pending
>>>> CQEs to drain. The problem, such as it is, is that this can result in
>>>> un-necessary/spurious wake-ups.
>>>>
>>>> If, for example, you are monitoring some sockets for EPOLLIN, and when
>>>> poll says you have pending bytes to read from their sockets, and said
>>>> sockets are non-blocking, and for each some reported event you reserve
>>>> an SQE for preadv() to read that data and then you io_uring_enter to
>>>> submit the SQEs, because the data is readily available, as soon as
>>>> io_uring_enter returns, you will have your completions available -
>>>> which you can process.  The “problem” is that poll will wake up
>>>> immediately thereafter in the next reactor loop iteration because
>>>> eventfd was tripped (which is reasonable but un-necessary).
>>>>
>>>> What if there was a flag for io_uring_setup() so that the eventfd
>>>> would only be tripped for CQEs that were processed asynchronously, or,
>>>> if that’s non-trivial, only for CQEs that reference file FDs?
>>>>
>>>> That’d help with that spurious wake-up.
>>>
>>> One easy way to do that would be for the application to signal that it
>>> doesn't want eventfd notifications for certain requests. Like using an
>>> IOSQE_ flag for that. Then you could set that on the requests you submit
>>> in response to triggering an eventfd event.
>>
> 
> 
> Thanks Jens,
> 
> This is great, but perhaps there is a somewhat slightly more optimal
> way to do this.  Ideally, io_uring should trip the eventfd if there
> are any new completions available, that haven’t been produced In the
> context of an io_uring_enter(). That is to say, if any SQEs can be
> immediately served (because data is readily available in
> Buffers/caches in the kernel), then their respective CQEs will be
> produced in the context of that io_uring_enter() that submitted said
> SQEs(and thus the CQEs can be processed immediately after
> io_uring_enter() returns).  So, if any CQEs are placed in the
> respective ring at any other time, but not during an io_uring_enter()
> call, then it means those completions were produced asynchronously,
> and thus the eventfd can be tripped, otherwise, there is no need to
> trip the eventfd at all.
> 
> e.g (pseudocode):
> void produce_completion(cfq_ctx *ctx, const bool in_io_uring_enter_ctx) {
>         cqe_ring_push(cqe_from_ctx(ctx));
>         if (false == in_io_uring_enter_ctx && eventfd_registered()) {
>                 trip_iouring_eventfd();
>         } else {
>                 // don't bother
>         }
> }

I see what you're saying, so essentially only trigger eventfd
notifications if the completions happen async. That does make a lot of
sense, and it would be cleaner than having to flag this per request as
well. I think we'd still need to make that opt-in as it changes the
behavior of it.

The best way to do that would be to add IORING_REGISTER_EVENTFD_ASYNC or
something like that. Does the exact same thing as
IORING_REGISTER_EVENTFD, but only triggers it if completions happen
async.

What do you think?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: io_uring and spurious wake-ups from eventfd
  2020-01-08 16:24       ` Jens Axboe
@ 2020-01-08 16:46         ` Mark Papadakis
  2020-01-08 16:50           ` Jens Axboe
  2020-01-09  6:09         ` Daurnimator
  1 sibling, 1 reply; 11+ messages in thread
From: Mark Papadakis @ 2020-01-08 16:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Thus sounds perfect!

Thanks Jens

@markpapadakis

> On 8 Jan 2020, at 6:24 PM, Jens Axboe <[email protected]> wrote:
> 
> On 1/8/20 12:36 AM, Mark Papadakis wrote:
>> 
>> 
>>>> On 7 Jan 2020, at 10:34 PM, Jens Axboe <[email protected]> wrote:
>>> 
>>> On 1/7/20 1:26 PM, Jens Axboe wrote:
>>>> On 1/7/20 8:55 AM, Mark Papadakis wrote:
>>>>> This is perhaps an odd request, but if it’s trivial to implement
>>>>> support for this described feature, it could help others like it ‘d
>>>>> help me (I ‘ve been experimenting with io_uring for some time now).
>>>>> 
>>>>> Being able to register an eventfd with an io_uring context is very
>>>>> handy, if you e.g have some sort of reactor thread multiplexing I/O
>>>>> using epoll etc, where you want to be notified when there are pending
>>>>> CQEs to drain. The problem, such as it is, is that this can result in
>>>>> un-necessary/spurious wake-ups.
>>>>> 
>>>>> If, for example, you are monitoring some sockets for EPOLLIN, and when
>>>>> poll says you have pending bytes to read from their sockets, and said
>>>>> sockets are non-blocking, and for each some reported event you reserve
>>>>> an SQE for preadv() to read that data and then you io_uring_enter to
>>>>> submit the SQEs, because the data is readily available, as soon as
>>>>> io_uring_enter returns, you will have your completions available -
>>>>> which you can process.  The “problem” is that poll will wake up
>>>>> immediately thereafter in the next reactor loop iteration because
>>>>> eventfd was tripped (which is reasonable but un-necessary).
>>>>> 
>>>>> What if there was a flag for io_uring_setup() so that the eventfd
>>>>> would only be tripped for CQEs that were processed asynchronously, or,
>>>>> if that’s non-trivial, only for CQEs that reference file FDs?
>>>>> 
>>>>> That’d help with that spurious wake-up.
>>>> 
>>>> One easy way to do that would be for the application to signal that it
>>>> doesn't want eventfd notifications for certain requests. Like using an
>>>> IOSQE_ flag for that. Then you could set that on the requests you submit
>>>> in response to triggering an eventfd event.
>>> 
>> 
>> 
>> Thanks Jens,
>> 
>> This is great, but perhaps there is a somewhat slightly more optimal
>> way to do this.  Ideally, io_uring should trip the eventfd if there
>> are any new completions available, that haven’t been produced In the
>> context of an io_uring_enter(). That is to say, if any SQEs can be
>> immediately served (because data is readily available in
>> Buffers/caches in the kernel), then their respective CQEs will be
>> produced in the context of that io_uring_enter() that submitted said
>> SQEs(and thus the CQEs can be processed immediately after
>> io_uring_enter() returns).  So, if any CQEs are placed in the
>> respective ring at any other time, but not during an io_uring_enter()
>> call, then it means those completions were produced asynchronously,
>> and thus the eventfd can be tripped, otherwise, there is no need to
>> trip the eventfd at all.
>> 
>> e.g (pseudocode):
>> void produce_completion(cfq_ctx *ctx, const bool in_io_uring_enter_ctx) {
>>        cqe_ring_push(cqe_from_ctx(ctx));
>>        if (false == in_io_uring_enter_ctx && eventfd_registered()) {
>>                trip_iouring_eventfd();
>>        } else {
>>                // don't bother
>>        }
>> }
> 
> I see what you're saying, so essentially only trigger eventfd
> notifications if the completions happen async. That does make a lot of
> sense, and it would be cleaner than having to flag this per request as
> well. I think we'd still need to make that opt-in as it changes the
> behavior of it.
> 
> The best way to do that would be to add IORING_REGISTER_EVENTFD_ASYNC or
> something like that. Does the exact same thing as
> IORING_REGISTER_EVENTFD, but only triggers it if completions happen
> async.
> 
> What do you think?
> 
> -- 
> Jens Axboe
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: io_uring and spurious wake-ups from eventfd
  2020-01-08 16:46         ` Mark Papadakis
@ 2020-01-08 16:50           ` Jens Axboe
  2020-01-08 17:20             ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2020-01-08 16:50 UTC (permalink / raw)
  To: Mark Papadakis; +Cc: io-uring

On 1/8/20 9:46 AM, Mark Papadakis wrote:
> Thus sounds perfect!

I'll try and cook this up, if you can test it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: io_uring and spurious wake-ups from eventfd
  2020-01-08 16:50           ` Jens Axboe
@ 2020-01-08 17:20             ` Jens Axboe
  2020-01-08 18:08               ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2020-01-08 17:20 UTC (permalink / raw)
  To: Mark Papadakis; +Cc: io-uring

On 1/8/20 9:50 AM, Jens Axboe wrote:
> On 1/8/20 9:46 AM, Mark Papadakis wrote:
>> Thus sounds perfect!
> 
> I'll try and cook this up, if you can test it.

Something like the below.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 06fdda43163d..70478888bb16 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -206,6 +206,7 @@ struct io_ring_ctx {
 		bool			account_mem;
 		bool			cq_overflow_flushed;
 		bool			drain_next;
+		bool			eventfd_async;
 
 		/*
 		 * Ring buffer of indices into array of io_uring_sqe, which is
@@ -960,13 +961,20 @@ static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
 	return &rings->cqes[tail & ctx->cq_mask];
 }
 
+static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
+{
+	if (!ctx->eventfd_async)
+		return true;
+	return io_wq_current_is_worker() || in_interrupt();
+}
+
 static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
 {
 	if (waitqueue_active(&ctx->wait))
 		wake_up(&ctx->wait);
 	if (waitqueue_active(&ctx->sqo_wait))
 		wake_up(&ctx->sqo_wait);
-	if (ctx->cq_ev_fd)
+	if (ctx->cq_ev_fd && io_should_trigger_evfd(ctx))
 		eventfd_signal(ctx->cq_ev_fd, 1);
 }
 
@@ -6552,10 +6560,17 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 		ret = io_sqe_files_update(ctx, arg, nr_args);
 		break;
 	case IORING_REGISTER_EVENTFD:
+	case IORING_REGISTER_EVENTFD_ASYNC:
 		ret = -EINVAL;
 		if (nr_args != 1)
 			break;
 		ret = io_eventfd_register(ctx, arg);
+		if (ret)
+			break;
+		if (opcode == IORING_REGISTER_EVENTFD_ASYNC)
+			ctx->eventfd_async = true;
+		else
+			ctx->eventfd_async = false;
 		break;
 	case IORING_UNREGISTER_EVENTFD:
 		ret = -EINVAL;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index be64c9757ff1..02e0b2d59b63 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -192,6 +192,7 @@ struct io_uring_params {
 #define IORING_REGISTER_EVENTFD		4
 #define IORING_UNREGISTER_EVENTFD	5
 #define IORING_REGISTER_FILES_UPDATE	6
+#define IORING_REGISTER_EVENTFD_ASYNC	7
 
 struct io_uring_files_update {
 	__u32 offset;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: io_uring and spurious wake-ups from eventfd
  2020-01-08 17:20             ` Jens Axboe
@ 2020-01-08 18:08               ` Jens Axboe
  0 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2020-01-08 18:08 UTC (permalink / raw)
  To: Mark Papadakis; +Cc: io-uring

On 1/8/20 10:20 AM, Jens Axboe wrote:
> On 1/8/20 9:50 AM, Jens Axboe wrote:
>> On 1/8/20 9:46 AM, Mark Papadakis wrote:
>>> Thus sounds perfect!
>>
>> I'll try and cook this up, if you can test it.
> 
> Something like the below.

I pushed it here for easier testing:

https://git.kernel.dk/cgit/linux-block/log/?h=for-5.6/io_uring-vfs

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: io_uring and spurious wake-ups from eventfd
  2020-01-08 16:24       ` Jens Axboe
  2020-01-08 16:46         ` Mark Papadakis
@ 2020-01-09  6:09         ` Daurnimator
  2020-01-09 15:14           ` Jens Axboe
  1 sibling, 1 reply; 11+ messages in thread
From: Daurnimator @ 2020-01-09  6:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mark Papadakis, io-uring

On Thu, 9 Jan 2020 at 03:25, Jens Axboe <[email protected]> wrote:
> I see what you're saying, so essentially only trigger eventfd
> notifications if the completions happen async. That does make a lot of
> sense, and it would be cleaner than having to flag this per request as
> well. I think we'd still need to make that opt-in as it changes the
> behavior of it.
>
> The best way to do that would be to add IORING_REGISTER_EVENTFD_ASYNC or
> something like that. Does the exact same thing as
> IORING_REGISTER_EVENTFD, but only triggers it if completions happen
> async.
>
> What do you think?


Why would a new opcode be cleaner than using a flag for the existing
EVENTFD opcode?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: io_uring and spurious wake-ups from eventfd
  2020-01-09  6:09         ` Daurnimator
@ 2020-01-09 15:14           ` Jens Axboe
  0 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2020-01-09 15:14 UTC (permalink / raw)
  To: Daurnimator; +Cc: Mark Papadakis, io-uring

On 1/8/20 11:09 PM, Daurnimator wrote:
> On Thu, 9 Jan 2020 at 03:25, Jens Axboe <[email protected]> wrote:
>> I see what you're saying, so essentially only trigger eventfd
>> notifications if the completions happen async. That does make a lot of
>> sense, and it would be cleaner than having to flag this per request as
>> well. I think we'd still need to make that opt-in as it changes the
>> behavior of it.
>>
>> The best way to do that would be to add IORING_REGISTER_EVENTFD_ASYNC or
>> something like that. Does the exact same thing as
>> IORING_REGISTER_EVENTFD, but only triggers it if completions happen
>> async.
>>
>> What do you think?
> 
> 
> Why would a new opcode be cleaner than using a flag for the existing
> EVENTFD opcode?

A few reasons I can think of:

1) We don't consume an IOSQE flag, which is a pretty sparse resource.
2) This is generally behavior where you either want one or the other,
   not a mix. Hence a general setup/modify flag makes more sense to me.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-01-09 15:14 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-01-07 15:55 io_uring and spurious wake-ups from eventfd Mark Papadakis
2020-01-07 20:26 ` Jens Axboe
2020-01-07 20:34   ` Jens Axboe
2020-01-08  7:36     ` Mark Papadakis
2020-01-08 16:24       ` Jens Axboe
2020-01-08 16:46         ` Mark Papadakis
2020-01-08 16:50           ` Jens Axboe
2020-01-08 17:20             ` Jens Axboe
2020-01-08 18:08               ` Jens Axboe
2020-01-09  6:09         ` Daurnimator
2020-01-09 15:14           ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox