* [PATCH v2 1/8] io_uring: move pinning out of io_req_local_work_add
2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 2/8] io_uring: optimie local tw add ctx pinning Pavel Begunkov
` (7 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel
Move ctx pinning from io_req_local_work_add() to the caller, looks
better and makes working with the code a bit easier.
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/io_uring.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ae90d2753e0d..29a0516ee5ce 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1306,17 +1306,15 @@ static void io_req_local_work_add(struct io_kiocb *req)
{
struct io_ring_ctx *ctx = req->ctx;
- percpu_ref_get(&ctx->refs);
-
if (!llist_add(&req->io_task_work.node, &ctx->work_llist))
- goto put_ref;
+ return;
/* needed for the following wake up */
smp_mb__after_atomic();
if (unlikely(atomic_read(&req->task->io_uring->in_cancel))) {
io_move_task_work_from_local(ctx);
- goto put_ref;
+ return;
}
if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
@@ -1326,9 +1324,6 @@ static void io_req_local_work_add(struct io_kiocb *req)
if (READ_ONCE(ctx->cq_waiting))
wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
-
-put_ref:
- percpu_ref_put(&ctx->refs);
}
void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
@@ -1337,7 +1332,9 @@ void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
struct io_ring_ctx *ctx = req->ctx;
if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+ percpu_ref_get(&ctx->refs);
io_req_local_work_add(req);
+ percpu_ref_put(&ctx->refs);
return;
}
--
2.40.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 2/8] io_uring: optimie local tw add ctx pinning
2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 1/8] io_uring: move pinning out of io_req_local_work_add Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 3/8] io_uring: refactor __io_cq_unlock_post_flush() Pavel Begunkov
` (6 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel
We currently pin the ctx for io_req_local_work_add() with
percpu_ref_get/put, which imply two rcu_read_lock/unlock pairs and some
extra overhead on top in the fast path. Replace it with a pure rcu read
and let io_ring_exit_work() synchronise against it.
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/io_uring.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 29a0516ee5ce..fb7215b543cd 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1332,9 +1332,9 @@ void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
struct io_ring_ctx *ctx = req->ctx;
if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
- percpu_ref_get(&ctx->refs);
+ rcu_read_lock();
io_req_local_work_add(req);
- percpu_ref_put(&ctx->refs);
+ rcu_read_unlock();
return;
}
@@ -3052,6 +3052,10 @@ static __cold void io_ring_exit_work(struct work_struct *work)
spin_lock(&ctx->completion_lock);
spin_unlock(&ctx->completion_lock);
+ /* pairs with RCU read section in io_req_local_work_add() */
+ if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
+ synchronize_rcu();
+
io_ring_ctx_free(ctx);
}
--
2.40.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 3/8] io_uring: refactor __io_cq_unlock_post_flush()
2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 1/8] io_uring: move pinning out of io_req_local_work_add Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 2/8] io_uring: optimie local tw add ctx pinning Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
2023-04-06 14:23 ` Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 4/8] io_uring: add tw add flags Pavel Begunkov
` (5 subsequent siblings)
8 siblings, 1 reply; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel
Instead of smp_mb() + __io_cqring_wake() in __io_cq_unlock_post_flush()
use equivalent io_cqring_wake(). With that we can clean it up further
and remove __io_cqring_wake().
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/io_uring.c | 6 ++----
io_uring/io_uring.h | 11 ++---------
2 files changed, 4 insertions(+), 13 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index fb7215b543cd..d4ac62de2113 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -640,10 +640,8 @@ static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
* it will re-check the wakeup conditions once we return we can safely
* skip waking it up.
*/
- if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
- smp_mb();
- __io_cqring_wake(ctx);
- }
+ if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
+ io_cqring_wake(ctx);
}
void io_cq_unlock_post(struct io_ring_ctx *ctx)
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 193b2db39fe8..24d8196bbca3 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -228,8 +228,7 @@ static inline void io_poll_wq_wake(struct io_ring_ctx *ctx)
poll_to_key(EPOLL_URING_WAKE | EPOLLIN));
}
-/* requires smb_mb() prior, see wq_has_sleeper() */
-static inline void __io_cqring_wake(struct io_ring_ctx *ctx)
+static inline void io_cqring_wake(struct io_ring_ctx *ctx)
{
/*
* Trigger waitqueue handler on all waiters on our waitqueue. This
@@ -241,17 +240,11 @@ static inline void __io_cqring_wake(struct io_ring_ctx *ctx)
* waitqueue handlers, we know we have a dependency between eventfd or
* epoll and should terminate multishot poll at that point.
*/
- if (waitqueue_active(&ctx->cq_wait))
+ if (wq_has_sleeper(&ctx->cq_wait))
__wake_up(&ctx->cq_wait, TASK_NORMAL, 0,
poll_to_key(EPOLL_URING_WAKE | EPOLLIN));
}
-static inline void io_cqring_wake(struct io_ring_ctx *ctx)
-{
- smp_mb();
- __io_cqring_wake(ctx);
-}
-
static inline bool io_sqring_full(struct io_ring_ctx *ctx)
{
struct io_rings *r = ctx->rings;
--
2.40.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v2 3/8] io_uring: refactor __io_cq_unlock_post_flush()
2023-04-06 13:20 ` [PATCH v2 3/8] io_uring: refactor __io_cq_unlock_post_flush() Pavel Begunkov
@ 2023-04-06 14:23 ` Pavel Begunkov
0 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 14:23 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe, linux-kernel
On 4/6/23 14:20, Pavel Begunkov wrote:
> Instead of smp_mb() + __io_cqring_wake() in __io_cq_unlock_post_flush()
> use equivalent io_cqring_wake(). With that we can clean it up further
> and remove __io_cqring_wake().
I didn't notice patches 3 and 7 have the same subj. This one
should've better been called refactor io_cqring_wake().
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> io_uring/io_uring.c | 6 ++----
> io_uring/io_uring.h | 11 ++---------
> 2 files changed, 4 insertions(+), 13 deletions(-)
>
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index fb7215b543cd..d4ac62de2113 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -640,10 +640,8 @@ static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
> * it will re-check the wakeup conditions once we return we can safely
> * skip waking it up.
> */
> - if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
> - smp_mb();
> - __io_cqring_wake(ctx);
> - }
> + if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
> + io_cqring_wake(ctx);
> }
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v2 4/8] io_uring: add tw add flags
2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
` (2 preceding siblings ...)
2023-04-06 13:20 ` [PATCH v2 3/8] io_uring: refactor __io_cq_unlock_post_flush() Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 5/8] io_uring: inline llist_add() Pavel Begunkov
` (4 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel
We pass 'allow_local' into io_req_task_work_add() but will need more
flags. Replace it with a flags bit field and name this allow_local
flag.
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/io_uring.c | 7 ++++---
io_uring/io_uring.h | 9 +++++++--
2 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index d4ac62de2113..6f175fe682e4 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1324,12 +1324,13 @@ static void io_req_local_work_add(struct io_kiocb *req)
wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
}
-void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
+void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
{
struct io_uring_task *tctx = req->task->io_uring;
struct io_ring_ctx *ctx = req->ctx;
- if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+ if (!(flags & IOU_F_TWQ_FORCE_NORMAL) &&
+ (ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
rcu_read_lock();
io_req_local_work_add(req);
rcu_read_unlock();
@@ -1359,7 +1360,7 @@ static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
io_task_work.node);
node = node->next;
- __io_req_task_work_add(req, false);
+ __io_req_task_work_add(req, IOU_F_TWQ_FORCE_NORMAL);
}
}
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 24d8196bbca3..cb4309a2acdc 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -15,6 +15,11 @@
#include <trace/events/io_uring.h>
#endif
+enum {
+ /* don't use deferred task_work */
+ IOU_F_TWQ_FORCE_NORMAL = 1,
+};
+
enum {
IOU_OK = 0,
IOU_ISSUE_SKIP_COMPLETE = -EIOCBQUEUED,
@@ -48,7 +53,7 @@ static inline bool io_req_ffs_set(struct io_kiocb *req)
return req->flags & REQ_F_FIXED_FILE;
}
-void __io_req_task_work_add(struct io_kiocb *req, bool allow_local);
+void __io_req_task_work_add(struct io_kiocb *req, unsigned flags);
bool io_is_uring_fops(struct file *file);
bool io_alloc_async_data(struct io_kiocb *req);
void io_req_task_queue(struct io_kiocb *req);
@@ -93,7 +98,7 @@ bool io_match_task_safe(struct io_kiocb *head, struct task_struct *task,
static inline void io_req_task_work_add(struct io_kiocb *req)
{
- __io_req_task_work_add(req, true);
+ __io_req_task_work_add(req, 0);
}
#define io_for_each_link(pos, head) \
--
2.40.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 5/8] io_uring: inline llist_add()
2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
` (3 preceding siblings ...)
2023-04-06 13:20 ` [PATCH v2 4/8] io_uring: add tw add flags Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 6/8] io_uring: reduce scheduling due to tw Pavel Begunkov
` (3 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel
We'll need to grab some information from the previous request in the tw
list, inline llist_add(), it'll be used in the following patch.
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/io_uring.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 6f175fe682e4..786ecfa01c54 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1303,8 +1303,15 @@ static __cold void io_fallback_tw(struct io_uring_task *tctx)
static void io_req_local_work_add(struct io_kiocb *req)
{
struct io_ring_ctx *ctx = req->ctx;
+ struct llist_node *first;
- if (!llist_add(&req->io_task_work.node, &ctx->work_llist))
+ first = READ_ONCE(ctx->work_llist.first);
+ do {
+ req->io_task_work.node.next = first;
+ } while (!try_cmpxchg(&ctx->work_llist.first, &first,
+ &req->io_task_work.node));
+
+ if (first)
return;
/* needed for the following wake up */
--
2.40.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 6/8] io_uring: reduce scheduling due to tw
2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
` (4 preceding siblings ...)
2023-04-06 13:20 ` [PATCH v2 5/8] io_uring: inline llist_add() Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 7/8] io_uring: refactor __io_cq_unlock_post_flush() Pavel Begunkov
` (2 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel
Every task_work will try to wake the task to be executed, which causes
excessive scheduling and additional overhead. For some tw it's
justified, but others won't do much but post a single CQE.
When a task waits for multiple cqes, every such task_work will wake it
up. Instead, the task may give a hint about how many cqes it waits for,
io_req_local_work_add() will compare against it and skip wake ups
if #cqes + #tw is not enough to satisfy the waiting condition. Task_work
that uses the optimisation should be simple enough and never post more
than one CQE. It's also ignored for non DEFER_TASKRUN rings.
Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/io_uring_types.h | 3 +-
io_uring/io_uring.c | 68 +++++++++++++++++++++++-----------
io_uring/io_uring.h | 9 +++++
io_uring/notif.c | 2 +-
io_uring/notif.h | 2 +-
io_uring/rw.c | 2 +-
6 files changed, 61 insertions(+), 25 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 4a6ce03a4903..fa621a508a01 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -296,7 +296,7 @@ struct io_ring_ctx {
spinlock_t completion_lock;
bool poll_multi_queue;
- bool cq_waiting;
+ atomic_t cq_wait_nr;
/*
* ->iopoll_list is protected by the ctx->uring_lock for
@@ -566,6 +566,7 @@ struct io_kiocb {
atomic_t refs;
atomic_t poll_refs;
struct io_task_work io_task_work;
+ unsigned nr_tw;
/* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
union {
struct hlist_node hash_node;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 786ecfa01c54..8a327a81beaf 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1300,35 +1300,59 @@ static __cold void io_fallback_tw(struct io_uring_task *tctx)
}
}
-static void io_req_local_work_add(struct io_kiocb *req)
+static void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
{
struct io_ring_ctx *ctx = req->ctx;
+ unsigned nr_wait, nr_tw, nr_tw_prev;
struct llist_node *first;
+ if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK))
+ flags &= ~IOU_F_TWQ_LAZY_WAKE;
+
first = READ_ONCE(ctx->work_llist.first);
do {
+ nr_tw_prev = 0;
+ if (first) {
+ struct io_kiocb *first_req = container_of(first,
+ struct io_kiocb,
+ io_task_work.node);
+ /*
+ * Might be executed at any moment, rely on
+ * SLAB_TYPESAFE_BY_RCU to keep it alive.
+ */
+ nr_tw_prev = READ_ONCE(first_req->nr_tw);
+ }
+ nr_tw = nr_tw_prev + 1;
+ /* Large enough to fail the nr_wait comparison below */
+ if (!(flags & IOU_F_TWQ_LAZY_WAKE))
+ nr_tw = -1U;
+
+ req->nr_tw = nr_tw;
req->io_task_work.node.next = first;
} while (!try_cmpxchg(&ctx->work_llist.first, &first,
&req->io_task_work.node));
- if (first)
- return;
-
- /* needed for the following wake up */
- smp_mb__after_atomic();
-
- if (unlikely(atomic_read(&req->task->io_uring->in_cancel))) {
- io_move_task_work_from_local(ctx);
- return;
+ if (!first) {
+ if (unlikely(atomic_read(&req->task->io_uring->in_cancel))) {
+ io_move_task_work_from_local(ctx);
+ return;
+ }
+ if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
+ atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
+ if (ctx->has_evfd)
+ io_eventfd_signal(ctx);
}
- if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
- atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
- if (ctx->has_evfd)
- io_eventfd_signal(ctx);
-
- if (READ_ONCE(ctx->cq_waiting))
- wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
+ nr_wait = atomic_read(&ctx->cq_wait_nr);
+ /* no one is waiting */
+ if (!nr_wait)
+ return;
+ /* either not enough or the previous add has already woken it up */
+ if (nr_wait > nr_tw || nr_tw_prev >= nr_wait)
+ return;
+ /* pairs with set_current_state() in io_cqring_wait() */
+ smp_mb__after_atomic();
+ wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
}
void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
@@ -1339,7 +1363,7 @@ void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
if (!(flags & IOU_F_TWQ_FORCE_NORMAL) &&
(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
rcu_read_lock();
- io_req_local_work_add(req);
+ io_req_local_work_add(req, flags);
rcu_read_unlock();
return;
}
@@ -2625,7 +2649,9 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
unsigned long check_cq;
if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
- WRITE_ONCE(ctx->cq_waiting, 1);
+ int nr_wait = (int) iowq.cq_tail - READ_ONCE(ctx->rings->cq.tail);
+
+ atomic_set(&ctx->cq_wait_nr, nr_wait);
set_current_state(TASK_INTERRUPTIBLE);
} else {
prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
@@ -2634,7 +2660,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
ret = io_cqring_wait_schedule(ctx, &iowq);
__set_current_state(TASK_RUNNING);
- WRITE_ONCE(ctx->cq_waiting, 0);
+ atomic_set(&ctx->cq_wait_nr, 0);
if (ret < 0)
break;
@@ -4517,7 +4543,7 @@ static int __init io_uring_init(void)
io_uring_optable_init();
req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC |
- SLAB_ACCOUNT);
+ SLAB_ACCOUNT | SLAB_TYPESAFE_BY_RCU);
return 0;
};
__initcall(io_uring_init);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index cb4309a2acdc..ef449e43d493 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -18,6 +18,15 @@
enum {
/* don't use deferred task_work */
IOU_F_TWQ_FORCE_NORMAL = 1,
+
+ /*
+ * A hint to not wake right away but delay until there are enough of
+ * tw's queued to match the number of CQEs the task is waiting for.
+ *
+ * Must not be used wirh requests generating more than one CQE.
+ * It's also ignored unless IORING_SETUP_DEFER_TASKRUN is set.
+ */
+ IOU_F_TWQ_LAZY_WAKE = 2,
};
enum {
diff --git a/io_uring/notif.c b/io_uring/notif.c
index 172105eb347d..e1846a25dde1 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -31,7 +31,7 @@ static void io_tx_ubuf_callback(struct sk_buff *skb, struct ubuf_info *uarg,
struct io_kiocb *notif = cmd_to_io_kiocb(nd);
if (refcount_dec_and_test(&uarg->refcnt))
- io_req_task_work_add(notif);
+ __io_req_task_work_add(notif, IOU_F_TWQ_LAZY_WAKE);
}
static void io_tx_ubuf_callback_ext(struct sk_buff *skb, struct ubuf_info *uarg,
diff --git a/io_uring/notif.h b/io_uring/notif.h
index c88c800cd89d..6dd1b30a468f 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -33,7 +33,7 @@ static inline void io_notif_flush(struct io_kiocb *notif)
/* drop slot's master ref */
if (refcount_dec_and_test(&nd->uarg.refcnt))
- io_req_task_work_add(notif);
+ __io_req_task_work_add(notif, IOU_F_TWQ_LAZY_WAKE);
}
static inline int io_notif_account_mem(struct io_kiocb *notif, unsigned len)
diff --git a/io_uring/rw.c b/io_uring/rw.c
index f14868624f41..6c7d2654770e 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -304,7 +304,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res)
return;
io_req_set_res(req, io_fixup_rw_res(req, res), 0);
req->io_task_work.func = io_req_rw_complete;
- io_req_task_work_add(req);
+ __io_req_task_work_add(req, IOU_F_TWQ_LAZY_WAKE);
}
static void io_complete_rw_iopoll(struct kiocb *kiocb, long res)
--
2.40.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 7/8] io_uring: refactor __io_cq_unlock_post_flush()
2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
` (5 preceding siblings ...)
2023-04-06 13:20 ` [PATCH v2 6/8] io_uring: reduce scheduling due to tw Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 8/8] io_uring: optimise io_req_local_work_add Pavel Begunkov
2023-04-12 1:53 ` [PATCH v2 0/8] optimise resheduling due to deferred tw Jens Axboe
8 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel
Separate ->task_complete path in __io_cq_unlock_post_flush().
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/io_uring.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 8a327a81beaf..0ea50c46f27f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -627,21 +627,23 @@ static inline void __io_cq_unlock_post(struct io_ring_ctx *ctx)
io_cqring_wake(ctx);
}
-static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
+static void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
__releases(ctx->completion_lock)
{
io_commit_cqring(ctx);
- __io_cq_unlock(ctx);
- io_commit_cqring_flush(ctx);
- /*
- * As ->task_complete implies that the ring is single tasked, cq_wait
- * may only be waited on by the current in io_cqring_wait(), but since
- * it will re-check the wakeup conditions once we return we can safely
- * skip waking it up.
- */
- if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
+ if (ctx->task_complete) {
+ /*
+ * ->task_complete implies that only current might be waiting
+ * for CQEs, and obviously, we currently don't. No one is
+ * waiting, wakeups are futile, skip them.
+ */
+ io_commit_cqring_flush(ctx);
+ } else {
+ __io_cq_unlock(ctx);
+ io_commit_cqring_flush(ctx);
io_cqring_wake(ctx);
+ }
}
void io_cq_unlock_post(struct io_ring_ctx *ctx)
--
2.40.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v2 8/8] io_uring: optimise io_req_local_work_add
2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
` (6 preceding siblings ...)
2023-04-06 13:20 ` [PATCH v2 7/8] io_uring: refactor __io_cq_unlock_post_flush() Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
2023-04-12 1:53 ` [PATCH v2 0/8] optimise resheduling due to deferred tw Jens Axboe
8 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel
Chains of memory accesses are never good for performance.
The req->task->io_uring->in_cancel in io_req_local_work_add() is there
so that when a task is exiting via io_uring_try_cancel_requests() and
starts waiting for completions it gets woken up by every new task_work
item queued.
Do a little trick by announcing waiting in io_uring_try_cancel_requests()
making io_req_local_work_add() to wake us up. We also need to check for
deferred tw items after prepare_to_wait(TASK_INTERRUPTIBLE);
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/io_uring.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0ea50c46f27f..9bbf58297a0e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1335,10 +1335,6 @@ static void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
&req->io_task_work.node));
if (!first) {
- if (unlikely(atomic_read(&req->task->io_uring->in_cancel))) {
- io_move_task_work_from_local(ctx);
- return;
- }
if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
if (ctx->has_evfd)
@@ -3205,6 +3201,12 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
enum io_wq_cancel cret;
bool ret = false;
+ /* set it so io_req_local_work_add() would wake us up */
+ if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+ atomic_set(&ctx->cq_wait_nr, 1);
+ smp_mb();
+ }
+
/* failed during ring init, it couldn't have issued any requests */
if (!ctx->rings)
return false;
@@ -3259,6 +3261,8 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
{
struct io_uring_task *tctx = current->io_uring;
struct io_ring_ctx *ctx;
+ struct io_tctx_node *node;
+ unsigned long index;
s64 inflight;
DEFINE_WAIT(wait);
@@ -3280,9 +3284,6 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
break;
if (!sqd) {
- struct io_tctx_node *node;
- unsigned long index;
-
xa_for_each(&tctx->xa, index, node) {
/* sqpoll task will cancel all its requests */
if (node->ctx->sq_data)
@@ -3305,7 +3306,13 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
prepare_to_wait(&tctx->wait, &wait, TASK_INTERRUPTIBLE);
io_run_task_work();
io_uring_drop_tctx_refs(current);
-
+ xa_for_each(&tctx->xa, index, node) {
+ if (!llist_empty(&node->ctx->work_llist)) {
+ WARN_ON_ONCE(node->ctx->submitter_task &&
+ node->ctx->submitter_task != current);
+ goto end_wait;
+ }
+ }
/*
* If we've seen completions, retry without waiting. This
* avoids a race where a completion comes in before we did
@@ -3313,6 +3320,7 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
*/
if (inflight == tctx_inflight(tctx, !cancel_all))
schedule();
+end_wait:
finish_wait(&tctx->wait, &wait);
} while (1);
--
2.40.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v2 0/8] optimise resheduling due to deferred tw
2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
` (7 preceding siblings ...)
2023-04-06 13:20 ` [PATCH v2 8/8] io_uring: optimise io_req_local_work_add Pavel Begunkov
@ 2023-04-12 1:53 ` Jens Axboe
8 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2023-04-12 1:53 UTC (permalink / raw)
To: io-uring, Pavel Begunkov; +Cc: linux-kernel
On Thu, 06 Apr 2023 14:20:06 +0100, Pavel Begunkov wrote:
> io_uring extensively uses task_work, but when a task is waiting
> every new queued task_work batch will try to wake it up and so
> cause lots of scheduling activity. This series optimises it,
> specifically applied for rw completions and send-zc notifications
> for now, and will helpful for further optimisations.
>
> Quick testing shows similar to v1 results, numbers from v1:
> For my zc net test once in a while waiting for a portion of buffers
> I've got 10x descrease in the number of context switches and 2x
> improvement in CPU util (17% vs 8%). In profiles, io_cqring_work()
> got down from 40-50% of CPU to ~13%.
>
> [...]
Applied, thanks!
[1/8] io_uring: move pinning out of io_req_local_work_add
commit: ab1c590f5c9b96d8d8843d351aed72469f8f2ef0
[2/8] io_uring: optimie local tw add ctx pinning
commit: d73a572df24661851465c821d33c03e70e4b68e5
[3/8] io_uring: refactor __io_cq_unlock_post_flush()
commit: c66ae3ec38f946edb1776d25c1c8cd63803b8ec3
[4/8] io_uring: add tw add flags
commit: 8501fe70ae9855076ffb03a3670e02a7b3437304
[5/8] io_uring: inline llist_add()
commit: 5150940079a3ce94d7474f6f5b0d6276569dc1de
[6/8] io_uring: reduce scheduling due to tw
commit: 8751d15426a31baaf40f7570263c27c3e5d1dc44
[7/8] io_uring: refactor __io_cq_unlock_post_flush()
commit: c66ae3ec38f946edb1776d25c1c8cd63803b8ec3
[8/8] io_uring: optimise io_req_local_work_add
commit: 360cd42c4e95ff06d8d7b0a54e42236c7e7c187f
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 11+ messages in thread