* [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)
@ 2025-01-28 13:39 Max Kellermann
2025-01-28 13:39 ` [PATCH 1/8] io_uring/io-wq: eliminate redundant io_work_get_acct() calls Max Kellermann
` (8 more replies)
0 siblings, 9 replies; 30+ messages in thread
From: Max Kellermann @ 2025-01-28 13:39 UTC (permalink / raw)
To: axboe, asml.silence, io-uring, linux-kernel; +Cc: Max Kellermann
While optimizing my io_uring-based web server, I found that the kernel
spends 35% of the CPU time waiting for `io_wq_acct.lock`. This patch
set reduces contention of this lock, though I believe much more should
be done in order to allow more worker concurrency.
I measured these patches with my HTTP server (serving static files and
running a tiny PHP script) and with a micro-benchmark that submits
millions of `IORING_OP_NOP` entries (with `IOSQE_ASYNC` to force
offloading the operation to a worker, so this offload overhead can be
measured).
Some of the optimizations eliminate memory accesses, e.g. by passing
values that are already known to (inlined) functions and by caching
values in local variables. These are useful optimizations, but they
are too small to measure them in a benchmark (too much noise).
Some of the patches have a measurable effect and they contain
benchmark numbers that I could reproduce in repeated runs, despite the
noise.
I'm not confident about the correctness of the last patch ("io_uring:
skip redundant poll wakeups"). This seemed like low-hanging fruit, so
low that it seemed suspicious to me. If this is a useful
optimization, the idea could probably be ported to other wait_queue
users, or even into the wait_queue library. What I'm not confident
about is whether the optimization is valid or whether it may miss
wakeups, leading to stalls. Please advise!
Total "perf diff" for `IORING_OP_NOP`:
42.25% -9.24% [kernel.kallsyms] [k] queued_spin_lock_slowpath
4.79% +2.83% [kernel.kallsyms] [k] io_worker_handle_work
7.23% -1.41% [kernel.kallsyms] [k] io_wq_submit_work
6.80% +1.23% [kernel.kallsyms] [k] io_wq_free_work
3.19% +1.10% [kernel.kallsyms] [k] io_req_task_complete
2.45% +0.94% [kernel.kallsyms] [k] try_to_wake_up
+0.81% [kernel.kallsyms] [k] io_acct_activate_free_worker
0.79% +0.64% [kernel.kallsyms] [k] __schedule
Serving static files with HTTP (send+receive on local+TCP,splice
file->pipe->TCP):
42.92% -7.84% [kernel.kallsyms] [k] queued_spin_lock_slowpath
1.53% -1.51% [kernel.kallsyms] [k] ep_poll_callback
1.18% +1.49% [kernel.kallsyms] [k] io_wq_free_work
0.61% +0.60% [kernel.kallsyms] [k] try_to_wake_up
0.76% -0.43% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
2.22% -0.33% [kernel.kallsyms] [k] io_wq_submit_work
Running PHP script (send+receive on local+TCP, splice pipe->TCP):
33.01% -4.13% [kernel.kallsyms] [k] queued_spin_lock_slowpath
1.57% -1.56% [kernel.kallsyms] [k] ep_poll_callback
1.36% +1.19% [kernel.kallsyms] [k] io_wq_free_work
0.94% -0.61% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
2.56% -0.36% [kernel.kallsyms] [k] io_wq_submit_work
2.06% +0.36% [kernel.kallsyms] [k] io_worker_handle_work
1.00% +0.35% [kernel.kallsyms] [k] try_to_wake_up
(The `IORING_OP_NOP` benchmark finishes after a hardcoded number of
operations; the two HTTP benchmarks finish after a certain wallclock
duration, and therefore more HTTP requests were handled.)
Max Kellermann (8):
io_uring/io-wq: eliminate redundant io_work_get_acct() calls
io_uring/io-wq: add io_worker.acct pointer
io_uring/io-wq: move worker lists to struct io_wq_acct
io_uring/io-wq: cache work->flags in variable
io_uring/io-wq: do not use bogus hash value
io_uring/io-wq: pass io_wq to io_get_next_work()
io_uring: cache io_kiocb->flags in variable
io_uring: skip redundant poll wakeups
include/linux/io_uring_types.h | 10 ++
io_uring/io-wq.c | 230 +++++++++++++++++++--------------
io_uring/io-wq.h | 7 +-
io_uring/io_uring.c | 63 +++++----
io_uring/io_uring.h | 2 +-
5 files changed, 187 insertions(+), 125 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 1/8] io_uring/io-wq: eliminate redundant io_work_get_acct() calls
2025-01-28 13:39 [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Max Kellermann
@ 2025-01-28 13:39 ` Max Kellermann
2025-01-28 13:39 ` [PATCH 2/8] io_uring/io-wq: add io_worker.acct pointer Max Kellermann
` (7 subsequent siblings)
8 siblings, 0 replies; 30+ messages in thread
From: Max Kellermann @ 2025-01-28 13:39 UTC (permalink / raw)
To: axboe, asml.silence, io-uring, linux-kernel; +Cc: Max Kellermann
Instead of calling io_work_get_acct() again, pass acct to
io_wq_insert_work() and io_wq_remove_pending().
This atomic access in io_work_get_acct() was done under the
`acct->lock`, and optimizing it away reduces lock contention a bit.
Signed-off-by: Max Kellermann <[email protected]>
---
io_uring/io-wq.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 5d0928f37471..6d26f6f068af 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -903,9 +903,8 @@ static void io_run_cancel(struct io_wq_work *work, struct io_wq *wq)
} while (work);
}
-static void io_wq_insert_work(struct io_wq *wq, struct io_wq_work *work)
+static void io_wq_insert_work(struct io_wq *wq, struct io_wq_acct *acct, struct io_wq_work *work)
{
- struct io_wq_acct *acct = io_work_get_acct(wq, work);
unsigned int hash;
struct io_wq_work *tail;
@@ -951,7 +950,7 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
}
raw_spin_lock(&acct->lock);
- io_wq_insert_work(wq, work);
+ io_wq_insert_work(wq, acct, work);
clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
raw_spin_unlock(&acct->lock);
@@ -1021,10 +1020,10 @@ static bool io_wq_worker_cancel(struct io_worker *worker, void *data)
}
static inline void io_wq_remove_pending(struct io_wq *wq,
+ struct io_wq_acct *acct,
struct io_wq_work *work,
struct io_wq_work_node *prev)
{
- struct io_wq_acct *acct = io_work_get_acct(wq, work);
unsigned int hash = io_get_work_hash(work);
struct io_wq_work *prev_work = NULL;
@@ -1051,7 +1050,7 @@ static bool io_acct_cancel_pending_work(struct io_wq *wq,
work = container_of(node, struct io_wq_work, list);
if (!match->fn(work, match->data))
continue;
- io_wq_remove_pending(wq, work, prev);
+ io_wq_remove_pending(wq, acct, work, prev);
raw_spin_unlock(&acct->lock);
io_run_cancel(work, wq);
match->nr_pending++;
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 2/8] io_uring/io-wq: add io_worker.acct pointer
2025-01-28 13:39 [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Max Kellermann
2025-01-28 13:39 ` [PATCH 1/8] io_uring/io-wq: eliminate redundant io_work_get_acct() calls Max Kellermann
@ 2025-01-28 13:39 ` Max Kellermann
2025-01-28 13:39 ` [PATCH 3/8] io_uring/io-wq: move worker lists to struct io_wq_acct Max Kellermann
` (6 subsequent siblings)
8 siblings, 0 replies; 30+ messages in thread
From: Max Kellermann @ 2025-01-28 13:39 UTC (permalink / raw)
To: axboe, asml.silence, io-uring, linux-kernel; +Cc: Max Kellermann
This replaces the `IO_WORKER_F_BOUND` flag. All code that checks this
flag is not interested in knowing whether this is a "bound" worker;
all it does with this flag is determine the `io_wq_acct` pointer. At
the cost of an extra pointer field, we can eliminate some fragile
pointer arithmetic. In turn, the `create_index` and `index` fields
are not needed anymore.
Signed-off-by: Max Kellermann <[email protected]>
---
io_uring/io-wq.c | 23 ++++++++---------------
1 file changed, 8 insertions(+), 15 deletions(-)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 6d26f6f068af..197352ef78c7 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -30,7 +30,6 @@ enum {
IO_WORKER_F_UP = 0, /* up and active */
IO_WORKER_F_RUNNING = 1, /* account as running */
IO_WORKER_F_FREE = 2, /* worker on free list */
- IO_WORKER_F_BOUND = 3, /* is doing bounded work */
};
enum {
@@ -46,12 +45,12 @@ enum {
*/
struct io_worker {
refcount_t ref;
- int create_index;
unsigned long flags;
struct hlist_nulls_node nulls_node;
struct list_head all_list;
struct task_struct *task;
struct io_wq *wq;
+ struct io_wq_acct *acct;
struct io_wq_work *cur_work;
raw_spinlock_t lock;
@@ -79,7 +78,6 @@ struct io_worker {
struct io_wq_acct {
unsigned nr_workers;
unsigned max_workers;
- int index;
atomic_t nr_running;
raw_spinlock_t lock;
struct io_wq_work_list work_list;
@@ -135,7 +133,7 @@ struct io_cb_cancel_data {
bool cancel_all;
};
-static bool create_io_worker(struct io_wq *wq, int index);
+static bool create_io_worker(struct io_wq *wq, struct io_wq_acct *acct);
static void io_wq_dec_running(struct io_worker *worker);
static bool io_acct_cancel_pending_work(struct io_wq *wq,
struct io_wq_acct *acct,
@@ -167,7 +165,7 @@ static inline struct io_wq_acct *io_work_get_acct(struct io_wq *wq,
static inline struct io_wq_acct *io_wq_get_acct(struct io_worker *worker)
{
- return io_get_acct(worker->wq, test_bit(IO_WORKER_F_BOUND, &worker->flags));
+ return worker->acct;
}
static void io_worker_ref_put(struct io_wq *wq)
@@ -323,7 +321,7 @@ static bool io_wq_create_worker(struct io_wq *wq, struct io_wq_acct *acct)
raw_spin_unlock(&wq->lock);
atomic_inc(&acct->nr_running);
atomic_inc(&wq->worker_refs);
- return create_io_worker(wq, acct->index);
+ return create_io_worker(wq, acct);
}
static void io_wq_inc_running(struct io_worker *worker)
@@ -343,7 +341,7 @@ static void create_worker_cb(struct callback_head *cb)
worker = container_of(cb, struct io_worker, create_work);
wq = worker->wq;
- acct = &wq->acct[worker->create_index];
+ acct = worker->acct;
raw_spin_lock(&wq->lock);
if (acct->nr_workers < acct->max_workers) {
@@ -352,7 +350,7 @@ static void create_worker_cb(struct callback_head *cb)
}
raw_spin_unlock(&wq->lock);
if (do_create) {
- create_io_worker(wq, worker->create_index);
+ create_io_worker(wq, acct);
} else {
atomic_dec(&acct->nr_running);
io_worker_ref_put(wq);
@@ -384,7 +382,6 @@ static bool io_queue_worker_create(struct io_worker *worker,
atomic_inc(&wq->worker_refs);
init_task_work(&worker->create_work, func);
- worker->create_index = acct->index;
if (!task_work_add(wq->task, &worker->create_work, TWA_SIGNAL)) {
/*
* EXIT may have been set after checking it above, check after
@@ -821,9 +818,8 @@ static void io_workqueue_create(struct work_struct *work)
kfree(worker);
}
-static bool create_io_worker(struct io_wq *wq, int index)
+static bool create_io_worker(struct io_wq *wq, struct io_wq_acct *acct)
{
- struct io_wq_acct *acct = &wq->acct[index];
struct io_worker *worker;
struct task_struct *tsk;
@@ -842,12 +838,10 @@ static bool create_io_worker(struct io_wq *wq, int index)
refcount_set(&worker->ref, 1);
worker->wq = wq;
+ worker->acct = acct;
raw_spin_lock_init(&worker->lock);
init_completion(&worker->ref_done);
- if (index == IO_WQ_ACCT_BOUND)
- set_bit(IO_WORKER_F_BOUND, &worker->flags);
-
tsk = create_io_thread(io_wq_worker, worker, NUMA_NO_NODE);
if (!IS_ERR(tsk)) {
io_init_new_worker(wq, worker, tsk);
@@ -1176,7 +1170,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
for (i = 0; i < IO_WQ_ACCT_NR; i++) {
struct io_wq_acct *acct = &wq->acct[i];
- acct->index = i;
atomic_set(&acct->nr_running, 0);
INIT_WQ_LIST(&acct->work_list);
raw_spin_lock_init(&acct->lock);
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 3/8] io_uring/io-wq: move worker lists to struct io_wq_acct
2025-01-28 13:39 [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Max Kellermann
2025-01-28 13:39 ` [PATCH 1/8] io_uring/io-wq: eliminate redundant io_work_get_acct() calls Max Kellermann
2025-01-28 13:39 ` [PATCH 2/8] io_uring/io-wq: add io_worker.acct pointer Max Kellermann
@ 2025-01-28 13:39 ` Max Kellermann
2025-01-28 13:39 ` [PATCH 4/8] io_uring/io-wq: cache work->flags in variable Max Kellermann
` (5 subsequent siblings)
8 siblings, 0 replies; 30+ messages in thread
From: Max Kellermann @ 2025-01-28 13:39 UTC (permalink / raw)
To: axboe, asml.silence, io-uring, linux-kernel; +Cc: Max Kellermann
Have separate linked lists for bounded and unbounded workers. This
way, io_acct_activate_free_worker() sees only workers relevant to it
and doesn't need to skip irrelevant ones. This speeds up the
linked list traversal (under acct->lock).
The `io_wq.lock` field is moved to `io_wq_acct.workers_lock`. It did
not actually protect "access to elements below", that is, not all of
them; it only protected access to the worker lists. By having two
locks instead of one, contention on this lock is reduced.
Signed-off-by: Max Kellermann <[email protected]>
---
io_uring/io-wq.c | 162 ++++++++++++++++++++++++++++-------------------
1 file changed, 96 insertions(+), 66 deletions(-)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 197352ef78c7..dfdd45ebe4bb 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -76,9 +76,27 @@ struct io_worker {
#define IO_WQ_NR_HASH_BUCKETS (1u << IO_WQ_HASH_ORDER)
struct io_wq_acct {
+ /**
+ * Protects access to the worker lists.
+ */
+ raw_spinlock_t workers_lock;
+
unsigned nr_workers;
unsigned max_workers;
atomic_t nr_running;
+
+ /**
+ * The list of free workers. Protected by #workers_lock
+ * (write) and RCU (read).
+ */
+ struct hlist_nulls_head free_list;
+
+ /**
+ * The list of all workers. Protected by #workers_lock
+ * (write) and RCU (read).
+ */
+ struct list_head all_list;
+
raw_spinlock_t lock;
struct io_wq_work_list work_list;
unsigned long flags;
@@ -110,12 +128,6 @@ struct io_wq {
struct io_wq_acct acct[IO_WQ_ACCT_NR];
- /* lock protects access to elements below */
- raw_spinlock_t lock;
-
- struct hlist_nulls_head free_list;
- struct list_head all_list;
-
struct wait_queue_entry wait;
struct io_wq_work *hash_tail[IO_WQ_NR_HASH_BUCKETS];
@@ -190,9 +202,9 @@ static void io_worker_cancel_cb(struct io_worker *worker)
struct io_wq *wq = worker->wq;
atomic_dec(&acct->nr_running);
- raw_spin_lock(&wq->lock);
+ raw_spin_lock(&acct->workers_lock);
acct->nr_workers--;
- raw_spin_unlock(&wq->lock);
+ raw_spin_unlock(&acct->workers_lock);
io_worker_ref_put(wq);
clear_bit_unlock(0, &worker->create_state);
io_worker_release(worker);
@@ -211,6 +223,7 @@ static bool io_task_worker_match(struct callback_head *cb, void *data)
static void io_worker_exit(struct io_worker *worker)
{
struct io_wq *wq = worker->wq;
+ struct io_wq_acct *acct = io_wq_get_acct(worker);
while (1) {
struct callback_head *cb = task_work_cancel_match(wq->task,
@@ -224,11 +237,11 @@ static void io_worker_exit(struct io_worker *worker)
io_worker_release(worker);
wait_for_completion(&worker->ref_done);
- raw_spin_lock(&wq->lock);
+ raw_spin_lock(&acct->workers_lock);
if (test_bit(IO_WORKER_F_FREE, &worker->flags))
hlist_nulls_del_rcu(&worker->nulls_node);
list_del_rcu(&worker->all_list);
- raw_spin_unlock(&wq->lock);
+ raw_spin_unlock(&acct->workers_lock);
io_wq_dec_running(worker);
/*
* this worker is a goner, clear ->worker_private to avoid any
@@ -267,8 +280,7 @@ static inline bool io_acct_run_queue(struct io_wq_acct *acct)
* Check head of free list for an available worker. If one isn't available,
* caller must create one.
*/
-static bool io_wq_activate_free_worker(struct io_wq *wq,
- struct io_wq_acct *acct)
+static bool io_acct_activate_free_worker(struct io_wq_acct *acct)
__must_hold(RCU)
{
struct hlist_nulls_node *n;
@@ -279,13 +291,9 @@ static bool io_wq_activate_free_worker(struct io_wq *wq,
* activate. If a given worker is on the free_list but in the process
* of exiting, keep trying.
*/
- hlist_nulls_for_each_entry_rcu(worker, n, &wq->free_list, nulls_node) {
+ hlist_nulls_for_each_entry_rcu(worker, n, &acct->free_list, nulls_node) {
if (!io_worker_get(worker))
continue;
- if (io_wq_get_acct(worker) != acct) {
- io_worker_release(worker);
- continue;
- }
/*
* If the worker is already running, it's either already
* starting work or finishing work. In either case, if it does
@@ -312,13 +320,13 @@ static bool io_wq_create_worker(struct io_wq *wq, struct io_wq_acct *acct)
if (unlikely(!acct->max_workers))
pr_warn_once("io-wq is not configured for unbound workers");
- raw_spin_lock(&wq->lock);
+ raw_spin_lock(&acct->workers_lock);
if (acct->nr_workers >= acct->max_workers) {
- raw_spin_unlock(&wq->lock);
+ raw_spin_unlock(&acct->workers_lock);
return true;
}
acct->nr_workers++;
- raw_spin_unlock(&wq->lock);
+ raw_spin_unlock(&acct->workers_lock);
atomic_inc(&acct->nr_running);
atomic_inc(&wq->worker_refs);
return create_io_worker(wq, acct);
@@ -342,13 +350,13 @@ static void create_worker_cb(struct callback_head *cb)
worker = container_of(cb, struct io_worker, create_work);
wq = worker->wq;
acct = worker->acct;
- raw_spin_lock(&wq->lock);
+ raw_spin_lock(&acct->workers_lock);
if (acct->nr_workers < acct->max_workers) {
acct->nr_workers++;
do_create = true;
}
- raw_spin_unlock(&wq->lock);
+ raw_spin_unlock(&acct->workers_lock);
if (do_create) {
create_io_worker(wq, acct);
} else {
@@ -427,25 +435,25 @@ static void io_wq_dec_running(struct io_worker *worker)
* Worker will start processing some work. Move it to the busy list, if
* it's currently on the freelist
*/
-static void __io_worker_busy(struct io_wq *wq, struct io_worker *worker)
+static void __io_worker_busy(struct io_wq_acct *acct, struct io_worker *worker)
{
if (test_bit(IO_WORKER_F_FREE, &worker->flags)) {
clear_bit(IO_WORKER_F_FREE, &worker->flags);
- raw_spin_lock(&wq->lock);
+ raw_spin_lock(&acct->workers_lock);
hlist_nulls_del_init_rcu(&worker->nulls_node);
- raw_spin_unlock(&wq->lock);
+ raw_spin_unlock(&acct->workers_lock);
}
}
/*
* No work, worker going to sleep. Move to freelist.
*/
-static void __io_worker_idle(struct io_wq *wq, struct io_worker *worker)
- __must_hold(wq->lock)
+static void __io_worker_idle(struct io_wq_acct *acct, struct io_worker *worker)
+ __must_hold(acct->workers_lock)
{
if (!test_bit(IO_WORKER_F_FREE, &worker->flags)) {
set_bit(IO_WORKER_F_FREE, &worker->flags);
- hlist_nulls_add_head_rcu(&worker->nulls_node, &wq->free_list);
+ hlist_nulls_add_head_rcu(&worker->nulls_node, &acct->free_list);
}
}
@@ -580,7 +588,7 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
if (!work)
break;
- __io_worker_busy(wq, worker);
+ __io_worker_busy(acct, worker);
io_assign_current_work(worker, work);
__set_current_state(TASK_RUNNING);
@@ -651,20 +659,20 @@ static int io_wq_worker(void *data)
while (io_acct_run_queue(acct))
io_worker_handle_work(acct, worker);
- raw_spin_lock(&wq->lock);
+ raw_spin_lock(&acct->workers_lock);
/*
* Last sleep timed out. Exit if we're not the last worker,
* or if someone modified our affinity.
*/
if (last_timeout && (exit_mask || acct->nr_workers > 1)) {
acct->nr_workers--;
- raw_spin_unlock(&wq->lock);
+ raw_spin_unlock(&acct->workers_lock);
__set_current_state(TASK_RUNNING);
break;
}
last_timeout = false;
- __io_worker_idle(wq, worker);
- raw_spin_unlock(&wq->lock);
+ __io_worker_idle(acct, worker);
+ raw_spin_unlock(&acct->workers_lock);
if (io_run_task_work())
continue;
ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
@@ -725,18 +733,18 @@ void io_wq_worker_sleeping(struct task_struct *tsk)
io_wq_dec_running(worker);
}
-static void io_init_new_worker(struct io_wq *wq, struct io_worker *worker,
+static void io_init_new_worker(struct io_wq *wq, struct io_wq_acct *acct, struct io_worker *worker,
struct task_struct *tsk)
{
tsk->worker_private = worker;
worker->task = tsk;
set_cpus_allowed_ptr(tsk, wq->cpu_mask);
- raw_spin_lock(&wq->lock);
- hlist_nulls_add_head_rcu(&worker->nulls_node, &wq->free_list);
- list_add_tail_rcu(&worker->all_list, &wq->all_list);
+ raw_spin_lock(&acct->workers_lock);
+ hlist_nulls_add_head_rcu(&worker->nulls_node, &acct->free_list);
+ list_add_tail_rcu(&worker->all_list, &acct->all_list);
set_bit(IO_WORKER_F_FREE, &worker->flags);
- raw_spin_unlock(&wq->lock);
+ raw_spin_unlock(&acct->workers_lock);
wake_up_new_task(tsk);
}
@@ -772,20 +780,20 @@ static void create_worker_cont(struct callback_head *cb)
struct io_worker *worker;
struct task_struct *tsk;
struct io_wq *wq;
+ struct io_wq_acct *acct;
worker = container_of(cb, struct io_worker, create_work);
clear_bit_unlock(0, &worker->create_state);
wq = worker->wq;
+ acct = io_wq_get_acct(worker);
tsk = create_io_thread(io_wq_worker, worker, NUMA_NO_NODE);
if (!IS_ERR(tsk)) {
- io_init_new_worker(wq, worker, tsk);
+ io_init_new_worker(wq, acct, worker, tsk);
io_worker_release(worker);
return;
} else if (!io_should_retry_thread(worker, PTR_ERR(tsk))) {
- struct io_wq_acct *acct = io_wq_get_acct(worker);
-
atomic_dec(&acct->nr_running);
- raw_spin_lock(&wq->lock);
+ raw_spin_lock(&acct->workers_lock);
acct->nr_workers--;
if (!acct->nr_workers) {
struct io_cb_cancel_data match = {
@@ -793,11 +801,11 @@ static void create_worker_cont(struct callback_head *cb)
.cancel_all = true,
};
- raw_spin_unlock(&wq->lock);
+ raw_spin_unlock(&acct->workers_lock);
while (io_acct_cancel_pending_work(wq, acct, &match))
;
} else {
- raw_spin_unlock(&wq->lock);
+ raw_spin_unlock(&acct->workers_lock);
}
io_worker_ref_put(wq);
kfree(worker);
@@ -829,9 +837,9 @@ static bool create_io_worker(struct io_wq *wq, struct io_wq_acct *acct)
if (!worker) {
fail:
atomic_dec(&acct->nr_running);
- raw_spin_lock(&wq->lock);
+ raw_spin_lock(&acct->workers_lock);
acct->nr_workers--;
- raw_spin_unlock(&wq->lock);
+ raw_spin_unlock(&acct->workers_lock);
io_worker_ref_put(wq);
return false;
}
@@ -844,7 +852,7 @@ static bool create_io_worker(struct io_wq *wq, struct io_wq_acct *acct)
tsk = create_io_thread(io_wq_worker, worker, NUMA_NO_NODE);
if (!IS_ERR(tsk)) {
- io_init_new_worker(wq, worker, tsk);
+ io_init_new_worker(wq, acct, worker, tsk);
} else if (!io_should_retry_thread(worker, PTR_ERR(tsk))) {
kfree(worker);
goto fail;
@@ -860,14 +868,14 @@ static bool create_io_worker(struct io_wq *wq, struct io_wq_acct *acct)
* Iterate the passed in list and call the specific function for each
* worker that isn't exiting
*/
-static bool io_wq_for_each_worker(struct io_wq *wq,
- bool (*func)(struct io_worker *, void *),
- void *data)
+static bool io_acct_for_each_worker(struct io_wq_acct *acct,
+ bool (*func)(struct io_worker *, void *),
+ void *data)
{
struct io_worker *worker;
bool ret = false;
- list_for_each_entry_rcu(worker, &wq->all_list, all_list) {
+ list_for_each_entry_rcu(worker, &acct->all_list, all_list) {
if (io_worker_get(worker)) {
/* no task if node is/was offline */
if (worker->task)
@@ -881,6 +889,18 @@ static bool io_wq_for_each_worker(struct io_wq *wq,
return ret;
}
+static bool io_wq_for_each_worker(struct io_wq *wq,
+ bool (*func)(struct io_worker *, void *),
+ void *data)
+{
+ for (int i = 0; i < IO_WQ_ACCT_NR; i++) {
+ if (!io_acct_for_each_worker(&wq->acct[i], func, data))
+ return false;
+ }
+
+ return true;
+}
+
static bool io_wq_worker_wake(struct io_worker *worker, void *data)
{
__set_notify_signal(worker->task);
@@ -949,7 +969,7 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
raw_spin_unlock(&acct->lock);
rcu_read_lock();
- do_create = !io_wq_activate_free_worker(wq, acct);
+ do_create = !io_acct_activate_free_worker(acct);
rcu_read_unlock();
if (do_create && ((work_flags & IO_WQ_WORK_CONCURRENT) ||
@@ -960,12 +980,12 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
if (likely(did_create))
return;
- raw_spin_lock(&wq->lock);
+ raw_spin_lock(&acct->workers_lock);
if (acct->nr_workers) {
- raw_spin_unlock(&wq->lock);
+ raw_spin_unlock(&acct->workers_lock);
return;
}
- raw_spin_unlock(&wq->lock);
+ raw_spin_unlock(&acct->workers_lock);
/* fatal condition, failed to create the first worker */
io_acct_cancel_pending_work(wq, acct, &match);
@@ -1072,11 +1092,22 @@ static void io_wq_cancel_pending_work(struct io_wq *wq,
}
}
+static void io_acct_cancel_running_work(struct io_wq_acct *acct,
+ struct io_cb_cancel_data *match)
+{
+ raw_spin_lock(&acct->workers_lock);
+ io_acct_for_each_worker(acct, io_wq_worker_cancel, match);
+ raw_spin_unlock(&acct->workers_lock);
+}
+
static void io_wq_cancel_running_work(struct io_wq *wq,
struct io_cb_cancel_data *match)
{
rcu_read_lock();
- io_wq_for_each_worker(wq, io_wq_worker_cancel, match);
+
+ for (int i = 0; i < IO_WQ_ACCT_NR; i++)
+ io_acct_cancel_running_work(&wq->acct[i], match);
+
rcu_read_unlock();
}
@@ -1099,16 +1130,14 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
* as an indication that we attempt to signal cancellation. The
* completion will run normally in this case.
*
- * Do both of these while holding the wq->lock, to ensure that
+ * Do both of these while holding the acct->workers_lock, to ensure that
* we'll find a work item regardless of state.
*/
io_wq_cancel_pending_work(wq, &match);
if (match.nr_pending && !match.cancel_all)
return IO_WQ_CANCEL_OK;
- raw_spin_lock(&wq->lock);
io_wq_cancel_running_work(wq, &match);
- raw_spin_unlock(&wq->lock);
if (match.nr_running && !match.cancel_all)
return IO_WQ_CANCEL_RUNNING;
@@ -1132,7 +1161,7 @@ static int io_wq_hash_wake(struct wait_queue_entry *wait, unsigned mode,
struct io_wq_acct *acct = &wq->acct[i];
if (test_and_clear_bit(IO_ACCT_STALLED_BIT, &acct->flags))
- io_wq_activate_free_worker(wq, acct);
+ io_acct_activate_free_worker(acct);
}
rcu_read_unlock();
return 1;
@@ -1171,14 +1200,15 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
struct io_wq_acct *acct = &wq->acct[i];
atomic_set(&acct->nr_running, 0);
+
+ raw_spin_lock_init(&acct->workers_lock);
+ INIT_HLIST_NULLS_HEAD(&acct->free_list, 0);
+ INIT_LIST_HEAD(&acct->all_list);
+
INIT_WQ_LIST(&acct->work_list);
raw_spin_lock_init(&acct->lock);
}
- raw_spin_lock_init(&wq->lock);
- INIT_HLIST_NULLS_HEAD(&wq->free_list, 0);
- INIT_LIST_HEAD(&wq->all_list);
-
wq->task = get_task_struct(data->task);
atomic_set(&wq->worker_refs, 1);
init_completion(&wq->worker_done);
@@ -1364,14 +1394,14 @@ int io_wq_max_workers(struct io_wq *wq, int *new_count)
rcu_read_lock();
- raw_spin_lock(&wq->lock);
for (i = 0; i < IO_WQ_ACCT_NR; i++) {
acct = &wq->acct[i];
+ raw_spin_lock(&acct->workers_lock);
prev[i] = max_t(int, acct->max_workers, prev[i]);
if (new_count[i])
acct->max_workers = new_count[i];
+ raw_spin_unlock(&acct->workers_lock);
}
- raw_spin_unlock(&wq->lock);
rcu_read_unlock();
for (i = 0; i < IO_WQ_ACCT_NR; i++)
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 4/8] io_uring/io-wq: cache work->flags in variable
2025-01-28 13:39 [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Max Kellermann
` (2 preceding siblings ...)
2025-01-28 13:39 ` [PATCH 3/8] io_uring/io-wq: move worker lists to struct io_wq_acct Max Kellermann
@ 2025-01-28 13:39 ` Max Kellermann
2025-01-29 18:57 ` Pavel Begunkov
2025-01-28 13:39 ` [PATCH 5/8] io_uring/io-wq: do not use bogus hash value Max Kellermann
` (4 subsequent siblings)
8 siblings, 1 reply; 30+ messages in thread
From: Max Kellermann @ 2025-01-28 13:39 UTC (permalink / raw)
To: axboe, asml.silence, io-uring, linux-kernel; +Cc: Max Kellermann
This eliminates several redundant atomic reads and therefore reduces
the duration the surrounding spinlocks are held.
In several io_uring benchmarks, this reduced the CPU time spent in
queued_spin_lock_slowpath() considerably:
io_uring benchmark with a flood of `IORING_OP_NOP` and `IOSQE_ASYNC`:
38.86% -1.49% [kernel.kallsyms] [k] queued_spin_lock_slowpath
6.75% +0.36% [kernel.kallsyms] [k] io_worker_handle_work
2.60% +0.19% [kernel.kallsyms] [k] io_nop
3.92% +0.18% [kernel.kallsyms] [k] io_req_task_complete
6.34% -0.18% [kernel.kallsyms] [k] io_wq_submit_work
HTTP server, static file:
42.79% -2.77% [kernel.kallsyms] [k] queued_spin_lock_slowpath
2.08% +0.23% [kernel.kallsyms] [k] io_wq_submit_work
1.19% +0.20% [kernel.kallsyms] [k] amd_iommu_iotlb_sync_map
1.46% +0.15% [kernel.kallsyms] [k] ep_poll_callback
1.80% +0.15% [kernel.kallsyms] [k] io_worker_handle_work
HTTP server, PHP:
35.03% -1.80% [kernel.kallsyms] [k] queued_spin_lock_slowpath
0.84% +0.21% [kernel.kallsyms] [k] amd_iommu_iotlb_sync_map
1.39% +0.12% [kernel.kallsyms] [k] _copy_to_iter
0.21% +0.10% [kernel.kallsyms] [k] update_sd_lb_stats
Signed-off-by: Max Kellermann <[email protected]>
---
io_uring/io-wq.c | 33 +++++++++++++++++++++------------
io_uring/io-wq.h | 7 ++++++-
2 files changed, 27 insertions(+), 13 deletions(-)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index dfdd45ebe4bb..ba9974e6f521 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -170,9 +170,9 @@ static inline struct io_wq_acct *io_get_acct(struct io_wq *wq, bool bound)
}
static inline struct io_wq_acct *io_work_get_acct(struct io_wq *wq,
- struct io_wq_work *work)
+ unsigned int work_flags)
{
- return io_get_acct(wq, !(atomic_read(&work->flags) & IO_WQ_WORK_UNBOUND));
+ return io_get_acct(wq, !(work_flags & IO_WQ_WORK_UNBOUND));
}
static inline struct io_wq_acct *io_wq_get_acct(struct io_worker *worker)
@@ -457,9 +457,14 @@ static void __io_worker_idle(struct io_wq_acct *acct, struct io_worker *worker)
}
}
+static inline unsigned int __io_get_work_hash(unsigned int work_flags)
+{
+ return work_flags >> IO_WQ_HASH_SHIFT;
+}
+
static inline unsigned int io_get_work_hash(struct io_wq_work *work)
{
- return atomic_read(&work->flags) >> IO_WQ_HASH_SHIFT;
+ return __io_get_work_hash(atomic_read(&work->flags));
}
static bool io_wait_on_hash(struct io_wq *wq, unsigned int hash)
@@ -489,17 +494,19 @@ static struct io_wq_work *io_get_next_work(struct io_wq_acct *acct,
struct io_wq *wq = worker->wq;
wq_list_for_each(node, prev, &acct->work_list) {
+ unsigned int work_flags;
unsigned int hash;
work = container_of(node, struct io_wq_work, list);
/* not hashed, can run anytime */
- if (!io_wq_is_hashed(work)) {
+ work_flags = atomic_read(&work->flags);
+ if (!__io_wq_is_hashed(work_flags)) {
wq_list_del(&acct->work_list, node, prev);
return work;
}
- hash = io_get_work_hash(work);
+ hash = __io_get_work_hash(work_flags);
/* all items with this hash lie in [work, tail] */
tail = wq->hash_tail[hash];
@@ -596,12 +603,13 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
/* handle a whole dependent link */
do {
struct io_wq_work *next_hashed, *linked;
- unsigned int hash = io_get_work_hash(work);
+ unsigned int work_flags = atomic_read(&work->flags);
+ unsigned int hash = __io_get_work_hash(work_flags);
next_hashed = wq_next_work(work);
if (do_kill &&
- (atomic_read(&work->flags) & IO_WQ_WORK_UNBOUND))
+ (work_flags & IO_WQ_WORK_UNBOUND))
atomic_or(IO_WQ_WORK_CANCEL, &work->flags);
wq->do_work(work);
io_assign_current_work(worker, NULL);
@@ -917,18 +925,19 @@ static void io_run_cancel(struct io_wq_work *work, struct io_wq *wq)
} while (work);
}
-static void io_wq_insert_work(struct io_wq *wq, struct io_wq_acct *acct, struct io_wq_work *work)
+static void io_wq_insert_work(struct io_wq *wq, struct io_wq_acct *acct,
+ struct io_wq_work *work, unsigned int work_flags)
{
unsigned int hash;
struct io_wq_work *tail;
- if (!io_wq_is_hashed(work)) {
+ if (!__io_wq_is_hashed(work_flags)) {
append:
wq_list_add_tail(&work->list, &acct->work_list);
return;
}
- hash = io_get_work_hash(work);
+ hash = __io_get_work_hash(work_flags);
tail = wq->hash_tail[hash];
wq->hash_tail[hash] = work;
if (!tail)
@@ -944,8 +953,8 @@ static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
{
- struct io_wq_acct *acct = io_work_get_acct(wq, work);
unsigned int work_flags = atomic_read(&work->flags);
+ struct io_wq_acct *acct = io_work_get_acct(wq, work_flags);
struct io_cb_cancel_data match = {
.fn = io_wq_work_match_item,
.data = work,
@@ -964,7 +973,7 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
}
raw_spin_lock(&acct->lock);
- io_wq_insert_work(wq, acct, work);
+ io_wq_insert_work(wq, acct, work, work_flags);
clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
raw_spin_unlock(&acct->lock);
diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
index b3b004a7b625..d4fb2940e435 100644
--- a/io_uring/io-wq.h
+++ b/io_uring/io-wq.h
@@ -54,9 +54,14 @@ int io_wq_cpu_affinity(struct io_uring_task *tctx, cpumask_var_t mask);
int io_wq_max_workers(struct io_wq *wq, int *new_count);
bool io_wq_worker_stopped(void);
+static inline bool __io_wq_is_hashed(unsigned int work_flags)
+{
+ return work_flags & IO_WQ_WORK_HASHED;
+}
+
static inline bool io_wq_is_hashed(struct io_wq_work *work)
{
- return atomic_read(&work->flags) & IO_WQ_WORK_HASHED;
+ return __io_wq_is_hashed(atomic_read(&work->flags));
}
typedef bool (work_cancel_fn)(struct io_wq_work *, void *);
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 5/8] io_uring/io-wq: do not use bogus hash value
2025-01-28 13:39 [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Max Kellermann
` (3 preceding siblings ...)
2025-01-28 13:39 ` [PATCH 4/8] io_uring/io-wq: cache work->flags in variable Max Kellermann
@ 2025-01-28 13:39 ` Max Kellermann
2025-01-28 13:39 ` [PATCH 6/8] io_uring/io-wq: pass io_wq to io_get_next_work() Max Kellermann
` (3 subsequent siblings)
8 siblings, 0 replies; 30+ messages in thread
From: Max Kellermann @ 2025-01-28 13:39 UTC (permalink / raw)
To: axboe, asml.silence, io-uring, linux-kernel; +Cc: Max Kellermann
Previously, the `hash` variable was initialized with `-1` and only
updated by io_get_next_work() if the current work was hashed. Commit
60cf46ae6054 ("io-wq: hash dependent work") changed this to always
call io_get_work_hash() even if the work was not hashed. This caused
the `hash != -1U` check to always be true, adding some overhead for
the `hash->wait` code.
This patch fixes the regression by checking the `IO_WQ_WORK_HASHED`
flag.
Perf diff for a flood of `IORING_OP_NOP` with `IOSQE_ASYNC`:
38.55% -1.57% [kernel.kallsyms] [k] queued_spin_lock_slowpath
6.86% -0.72% [kernel.kallsyms] [k] io_worker_handle_work
0.10% +0.67% [kernel.kallsyms] [k] put_prev_entity
1.96% +0.59% [kernel.kallsyms] [k] io_nop_prep
3.31% -0.51% [kernel.kallsyms] [k] try_to_wake_up
7.18% -0.47% [kernel.kallsyms] [k] io_wq_free_work
Fixes: 60cf46ae6054 ("io-wq: hash dependent work")
Cc: Pavel Begunkov <[email protected]>
Signed-off-by: Max Kellermann <[email protected]>
---
io_uring/io-wq.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index ba9974e6f521..6e31f312b61a 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -604,7 +604,9 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
do {
struct io_wq_work *next_hashed, *linked;
unsigned int work_flags = atomic_read(&work->flags);
- unsigned int hash = __io_get_work_hash(work_flags);
+ unsigned int hash = __io_wq_is_hashed(work_flags)
+ ? __io_get_work_hash(work_flags)
+ : -1U;
next_hashed = wq_next_work(work);
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 6/8] io_uring/io-wq: pass io_wq to io_get_next_work()
2025-01-28 13:39 [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Max Kellermann
` (4 preceding siblings ...)
2025-01-28 13:39 ` [PATCH 5/8] io_uring/io-wq: do not use bogus hash value Max Kellermann
@ 2025-01-28 13:39 ` Max Kellermann
2025-01-28 13:39 ` [PATCH 7/8] io_uring: cache io_kiocb->flags in variable Max Kellermann
` (2 subsequent siblings)
8 siblings, 0 replies; 30+ messages in thread
From: Max Kellermann @ 2025-01-28 13:39 UTC (permalink / raw)
To: axboe, asml.silence, io-uring, linux-kernel; +Cc: Max Kellermann
The only caller has already determined this pointer, so let's skip
the redundant dereference.
Signed-off-by: Max Kellermann <[email protected]>
---
io_uring/io-wq.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 6e31f312b61a..f7d328feb722 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -485,13 +485,12 @@ static bool io_wait_on_hash(struct io_wq *wq, unsigned int hash)
}
static struct io_wq_work *io_get_next_work(struct io_wq_acct *acct,
- struct io_worker *worker)
+ struct io_wq *wq)
__must_hold(acct->lock)
{
struct io_wq_work_node *node, *prev;
struct io_wq_work *work, *tail;
unsigned int stall_hash = -1U;
- struct io_wq *wq = worker->wq;
wq_list_for_each(node, prev, &acct->work_list) {
unsigned int work_flags;
@@ -576,7 +575,7 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
* can't make progress, any work completion or insertion will
* clear the stalled flag.
*/
- work = io_get_next_work(acct, worker);
+ work = io_get_next_work(acct, wq);
if (work) {
/*
* Make sure cancelation can find this, even before
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 7/8] io_uring: cache io_kiocb->flags in variable
2025-01-28 13:39 [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Max Kellermann
` (5 preceding siblings ...)
2025-01-28 13:39 ` [PATCH 6/8] io_uring/io-wq: pass io_wq to io_get_next_work() Max Kellermann
@ 2025-01-28 13:39 ` Max Kellermann
2025-01-29 19:11 ` Pavel Begunkov
2025-01-28 13:39 ` [PATCH 8/8] io_uring: skip redundant poll wakeups Max Kellermann
2025-01-29 17:18 ` [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Jens Axboe
8 siblings, 1 reply; 30+ messages in thread
From: Max Kellermann @ 2025-01-28 13:39 UTC (permalink / raw)
To: axboe, asml.silence, io-uring, linux-kernel; +Cc: Max Kellermann
This eliminates several redundant reads, some of which probably cannot
be optimized away by the compiler.
Signed-off-by: Max Kellermann <[email protected]>
---
io_uring/io_uring.c | 59 +++++++++++++++++++++++++++------------------
1 file changed, 35 insertions(+), 24 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 7bfbc7c22367..137c2066c5a3 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -391,28 +391,30 @@ static bool req_need_defer(struct io_kiocb *req, u32 seq)
static void io_clean_op(struct io_kiocb *req)
{
- if (req->flags & REQ_F_BUFFER_SELECTED) {
+ const unsigned int req_flags = req->flags;
+
+ if (req_flags & REQ_F_BUFFER_SELECTED) {
spin_lock(&req->ctx->completion_lock);
io_kbuf_drop(req);
spin_unlock(&req->ctx->completion_lock);
}
- if (req->flags & REQ_F_NEED_CLEANUP) {
+ if (req_flags & REQ_F_NEED_CLEANUP) {
const struct io_cold_def *def = &io_cold_defs[req->opcode];
if (def->cleanup)
def->cleanup(req);
}
- if ((req->flags & REQ_F_POLLED) && req->apoll) {
+ if ((req_flags & REQ_F_POLLED) && req->apoll) {
kfree(req->apoll->double_poll);
kfree(req->apoll);
req->apoll = NULL;
}
- if (req->flags & REQ_F_INFLIGHT)
+ if (req_flags & REQ_F_INFLIGHT)
atomic_dec(&req->tctx->inflight_tracked);
- if (req->flags & REQ_F_CREDS)
+ if (req_flags & REQ_F_CREDS)
put_cred(req->creds);
- if (req->flags & REQ_F_ASYNC_DATA) {
+ if (req_flags & REQ_F_ASYNC_DATA) {
kfree(req->async_data);
req->async_data = NULL;
}
@@ -453,31 +455,37 @@ static noinline void __io_arm_ltimeout(struct io_kiocb *req)
io_queue_linked_timeout(__io_prep_linked_timeout(req));
}
-static inline void io_arm_ltimeout(struct io_kiocb *req)
+static inline void _io_arm_ltimeout(struct io_kiocb *req, unsigned int req_flags)
{
- if (unlikely(req->flags & REQ_F_ARM_LTIMEOUT))
+ if (unlikely(req_flags & REQ_F_ARM_LTIMEOUT))
__io_arm_ltimeout(req);
}
+static inline void io_arm_ltimeout(struct io_kiocb *req)
+{
+ _io_arm_ltimeout(req, req->flags);
+}
+
static void io_prep_async_work(struct io_kiocb *req)
{
+ unsigned int req_flags = req->flags;
const struct io_issue_def *def = &io_issue_defs[req->opcode];
struct io_ring_ctx *ctx = req->ctx;
- if (!(req->flags & REQ_F_CREDS)) {
- req->flags |= REQ_F_CREDS;
+ if (!(req_flags & REQ_F_CREDS)) {
+ req_flags = req->flags |= REQ_F_CREDS;
req->creds = get_current_cred();
}
req->work.list.next = NULL;
atomic_set(&req->work.flags, 0);
- if (req->flags & REQ_F_FORCE_ASYNC)
+ if (req_flags & REQ_F_FORCE_ASYNC)
atomic_or(IO_WQ_WORK_CONCURRENT, &req->work.flags);
- if (req->file && !(req->flags & REQ_F_FIXED_FILE))
- req->flags |= io_file_get_flags(req->file);
+ if (req->file && !(req_flags & REQ_F_FIXED_FILE))
+ req_flags = req->flags |= io_file_get_flags(req->file);
- if (req->file && (req->flags & REQ_F_ISREG)) {
+ if (req->file && (req_flags & REQ_F_ISREG)) {
bool should_hash = def->hash_reg_file;
/* don't serialize this request if the fs doesn't need it */
@@ -1703,13 +1711,14 @@ static __cold void io_drain_req(struct io_kiocb *req)
spin_unlock(&ctx->completion_lock);
}
-static bool io_assign_file(struct io_kiocb *req, const struct io_issue_def *def,
+static bool io_assign_file(struct io_kiocb *req, unsigned int req_flags,
+ const struct io_issue_def *def,
unsigned int issue_flags)
{
if (req->file || !def->needs_file)
return true;
- if (req->flags & REQ_F_FIXED_FILE)
+ if (req_flags & REQ_F_FIXED_FILE)
req->file = io_file_get_fixed(req, req->cqe.fd, issue_flags);
else
req->file = io_file_get_normal(req, req->cqe.fd);
@@ -1719,14 +1728,15 @@ static bool io_assign_file(struct io_kiocb *req, const struct io_issue_def *def,
static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
{
+ const unsigned int req_flags = req->flags;
const struct io_issue_def *def = &io_issue_defs[req->opcode];
const struct cred *creds = NULL;
int ret;
- if (unlikely(!io_assign_file(req, def, issue_flags)))
+ if (unlikely(!io_assign_file(req, req_flags, def, issue_flags)))
return -EBADF;
- if (unlikely((req->flags & REQ_F_CREDS) && req->creds != current_cred()))
+ if (unlikely((req_flags & REQ_F_CREDS) && req->creds != current_cred()))
creds = override_creds(req->creds);
if (!def->audit_skip)
@@ -1783,18 +1793,19 @@ struct io_wq_work *io_wq_free_work(struct io_wq_work *work)
void io_wq_submit_work(struct io_wq_work *work)
{
struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+ const unsigned int req_flags = req->flags;
const struct io_issue_def *def = &io_issue_defs[req->opcode];
unsigned int issue_flags = IO_URING_F_UNLOCKED | IO_URING_F_IOWQ;
bool needs_poll = false;
int ret = 0, err = -ECANCELED;
/* one will be dropped by ->io_wq_free_work() after returning to io-wq */
- if (!(req->flags & REQ_F_REFCOUNT))
+ if (!(req_flags & REQ_F_REFCOUNT))
__io_req_set_refcount(req, 2);
else
req_ref_get(req);
- io_arm_ltimeout(req);
+ _io_arm_ltimeout(req, req_flags);
/* either cancelled or io-wq is dying, so don't touch tctx->iowq */
if (atomic_read(&work->flags) & IO_WQ_WORK_CANCEL) {
@@ -1802,7 +1813,7 @@ void io_wq_submit_work(struct io_wq_work *work)
io_req_task_queue_fail(req, err);
return;
}
- if (!io_assign_file(req, def, issue_flags)) {
+ if (!io_assign_file(req, req_flags, def, issue_flags)) {
err = -EBADF;
atomic_or(IO_WQ_WORK_CANCEL, &work->flags);
goto fail;
@@ -1816,7 +1827,7 @@ void io_wq_submit_work(struct io_wq_work *work)
* Don't allow any multishot execution from io-wq. It's more restrictive
* than necessary and also cleaner.
*/
- if (req->flags & REQ_F_APOLL_MULTISHOT) {
+ if (req_flags & REQ_F_APOLL_MULTISHOT) {
err = -EBADFD;
if (!io_file_can_poll(req))
goto fail;
@@ -1831,7 +1842,7 @@ void io_wq_submit_work(struct io_wq_work *work)
}
}
- if (req->flags & REQ_F_FORCE_ASYNC) {
+ if (req_flags & REQ_F_FORCE_ASYNC) {
bool opcode_poll = def->pollin || def->pollout;
if (opcode_poll && io_file_can_poll(req)) {
@@ -1849,7 +1860,7 @@ void io_wq_submit_work(struct io_wq_work *work)
* If REQ_F_NOWAIT is set, then don't wait or retry with
* poll. -EAGAIN is final for that case.
*/
- if (req->flags & REQ_F_NOWAIT)
+ if (req_flags & REQ_F_NOWAIT)
break;
/*
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 8/8] io_uring: skip redundant poll wakeups
2025-01-28 13:39 [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Max Kellermann
` (6 preceding siblings ...)
2025-01-28 13:39 ` [PATCH 7/8] io_uring: cache io_kiocb->flags in variable Max Kellermann
@ 2025-01-28 13:39 ` Max Kellermann
2025-01-31 13:54 ` Pavel Begunkov
2025-01-29 17:18 ` [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Jens Axboe
8 siblings, 1 reply; 30+ messages in thread
From: Max Kellermann @ 2025-01-28 13:39 UTC (permalink / raw)
To: axboe, asml.silence, io-uring, linux-kernel; +Cc: Max Kellermann
Using io_uring with epoll is very expensive because every completion
leads to a __wake_up() call, most of which are unnecessary because the
polling process has already been woken up but has not had a chance to
process the completions. During this time, wq_has_sleeper() still
returns true, therefore this check is not enough.
Perf diff for a HTTP server pushing a static file with splice() into
the TCP socket:
37.91% -2.00% [kernel.kallsyms] [k] queued_spin_lock_slowpath
1.69% -1.67% [kernel.kallsyms] [k] ep_poll_callback
0.95% +1.64% [kernel.kallsyms] [k] io_wq_free_work
0.88% -0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
1.66% +0.28% [kernel.kallsyms] [k] io_worker_handle_work
1.14% +0.18% [kernel.kallsyms] [k] _raw_spin_lock
0.24% -0.17% [kernel.kallsyms] [k] __wake_up
Signed-off-by: Max Kellermann <[email protected]>
---
include/linux/io_uring_types.h | 10 ++++++++++
io_uring/io_uring.c | 4 ++++
io_uring/io_uring.h | 2 +-
3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 623d8e798a11..01514cb76095 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -384,6 +384,16 @@ struct io_ring_ctx {
struct wait_queue_head poll_wq;
struct io_restriction restrictions;
+ /**
+ * Non-zero if a process is waiting for #poll_wq and reset to
+ * zero when #poll_wq is woken up. This is supposed to
+ * eliminate redundant wakeup calls. Only checking
+ * wq_has_sleeper() is not enough because it will return true
+ * until the sleeper has actually woken up and has been
+ * scheduled.
+ */
+ atomic_t poll_wq_waiting;
+
u32 pers_next;
struct xarray personalities;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 137c2066c5a3..b65efd07e9f0 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2760,6 +2760,7 @@ static __cold void io_activate_pollwq_cb(struct callback_head *cb)
* Wake ups for some events between start of polling and activation
* might've been lost due to loose synchronisation.
*/
+ atomic_set_release(&ctx->poll_wq_waiting, 0);
wake_up_all(&ctx->poll_wq);
percpu_ref_put(&ctx->refs);
}
@@ -2793,6 +2794,9 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
if (unlikely(!ctx->poll_activated))
io_activate_pollwq(ctx);
+
+ atomic_set(&ctx->poll_wq_waiting, 1);
+
/*
* provides mb() which pairs with barrier from wq_has_sleeper
* call in io_commit_cqring
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index f65e3f3ede51..186cee066f9f 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -287,7 +287,7 @@ static inline void io_commit_cqring(struct io_ring_ctx *ctx)
static inline void io_poll_wq_wake(struct io_ring_ctx *ctx)
{
- if (wq_has_sleeper(&ctx->poll_wq))
+ if (wq_has_sleeper(&ctx->poll_wq) && atomic_xchg_release(&ctx->poll_wq_waiting, 0) > 0)
__wake_up(&ctx->poll_wq, TASK_NORMAL, 0,
poll_to_key(EPOLL_URING_WAKE | EPOLLIN));
}
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)
2025-01-28 13:39 [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Max Kellermann
` (7 preceding siblings ...)
2025-01-28 13:39 ` [PATCH 8/8] io_uring: skip redundant poll wakeups Max Kellermann
@ 2025-01-29 17:18 ` Jens Axboe
2025-01-29 17:39 ` Max Kellermann
8 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2025-01-29 17:18 UTC (permalink / raw)
To: Max Kellermann, asml.silence, io-uring, linux-kernel
On 1/28/25 6:39 AM, Max Kellermann wrote:
> While optimizing my io_uring-based web server, I found that the kernel
> spends 35% of the CPU time waiting for `io_wq_acct.lock`. This patch
> set reduces contention of this lock, though I believe much more should
> be done in order to allow more worker concurrency.
>
> I measured these patches with my HTTP server (serving static files and
> running a tiny PHP script) and with a micro-benchmark that submits
> millions of `IORING_OP_NOP` entries (with `IOSQE_ASYNC` to force
> offloading the operation to a worker, so this offload overhead can be
> measured).
>
> Some of the optimizations eliminate memory accesses, e.g. by passing
> values that are already known to (inlined) functions and by caching
> values in local variables. These are useful optimizations, but they
> are too small to measure them in a benchmark (too much noise).
>
> Some of the patches have a measurable effect and they contain
> benchmark numbers that I could reproduce in repeated runs, despite the
> noise.
>
> I'm not confident about the correctness of the last patch ("io_uring:
> skip redundant poll wakeups"). This seemed like low-hanging fruit, so
> low that it seemed suspicious to me. If this is a useful
> optimization, the idea could probably be ported to other wait_queue
> users, or even into the wait_queue library. What I'm not confident
> about is whether the optimization is valid or whether it may miss
> wakeups, leading to stalls. Please advise!
That last patch is the only one that needs a bit more checking, so I'd
suggest we just ignore that one for now. We're in the middle of the
merge window anyway, so all of this would have to wait for the 6.15
merge window anyway - iow, plenty of time.
The other patches look pretty straight forward to me. Only thing that
has me puzzled a bit is why you have so much io-wq activity with your
application, in general I'd expect 0 activity there. But Then I saw the
forced ASYNC flag, and it makes sense. In general, forcing that isn't a
great idea, but for a benchmark for io-wq it certainly makes sense.
I'll apply 1-7 once 6.14-rc1 is out and I can kick off a
for-6.15/io_uring branch. Thanks!
--
Jens Axboe
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)
2025-01-29 17:18 ` [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Jens Axboe
@ 2025-01-29 17:39 ` Max Kellermann
2025-01-29 17:45 ` Jens Axboe
2025-01-29 19:30 ` Pavel Begunkov
0 siblings, 2 replies; 30+ messages in thread
From: Max Kellermann @ 2025-01-29 17:39 UTC (permalink / raw)
To: Jens Axboe; +Cc: asml.silence, io-uring, linux-kernel
On Wed, Jan 29, 2025 at 6:19 PM Jens Axboe <[email protected]> wrote:
> The other patches look pretty straight forward to me. Only thing that
> has me puzzled a bit is why you have so much io-wq activity with your
> application, in general I'd expect 0 activity there. But Then I saw the
> forced ASYNC flag, and it makes sense. In general, forcing that isn't a
> great idea, but for a benchmark for io-wq it certainly makes sense.
I was experimenting with io_uring and wanted to see how much
performance I can squeeze out of my web server running
single-threaded. The overhead of io_uring_submit() grew very large,
because the "send" operation would do a lot of synchronous work in the
kernel. I tried SQPOLL but it was actually a big performance
regression; this just shifted my CPU usage to epoll_wait(). Forcing
ASYNC gave me large throughput improvements (moving the submission
overhead to iowq), but then the iowq lock contention was the next
limit, thus this patch series.
I'm still experimenting, and I will certainly revisit SQPOLL to learn
more about why it didn't help and how to fix it.
> I'll apply 1-7 once 6.14-rc1 is out and I can kick off a
> for-6.15/io_uring branch. Thanks!
Thanks Jens, and please let me know when you're ready to discuss the
last patch. It's a big improvement for those who combine io_uring with
epoll, it's worth it.
Max
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)
2025-01-29 17:39 ` Max Kellermann
@ 2025-01-29 17:45 ` Jens Axboe
2025-01-29 18:01 ` Max Kellermann
2025-01-29 19:30 ` Pavel Begunkov
1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2025-01-29 17:45 UTC (permalink / raw)
To: Max Kellermann; +Cc: asml.silence, io-uring, linux-kernel
On 1/29/25 10:39 AM, Max Kellermann wrote:
> On Wed, Jan 29, 2025 at 6:19?PM Jens Axboe <[email protected]> wrote:
>> The other patches look pretty straight forward to me. Only thing that
>> has me puzzled a bit is why you have so much io-wq activity with your
>> application, in general I'd expect 0 activity there. But Then I saw the
>> forced ASYNC flag, and it makes sense. In general, forcing that isn't a
>> great idea, but for a benchmark for io-wq it certainly makes sense.
>
> I was experimenting with io_uring and wanted to see how much
> performance I can squeeze out of my web server running
> single-threaded. The overhead of io_uring_submit() grew very large,
> because the "send" operation would do a lot of synchronous work in the
> kernel. I tried SQPOLL but it was actually a big performance
> regression; this just shifted my CPU usage to epoll_wait(). Forcing
> ASYNC gave me large throughput improvements (moving the submission
> overhead to iowq), but then the iowq lock contention was the next
> limit, thus this patch series.
>
> I'm still experimenting, and I will certainly revisit SQPOLL to learn
> more about why it didn't help and how to fix it.
Why are you combining it with epoll in the first place? It's a lot more
efficient to wait on a/multiple events in io_uring_enter() rather than
go back to a serialize one-event-per-notification by using epoll to wait
on completions on the io_uring side.
--
Jens Axboe
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)
2025-01-29 17:45 ` Jens Axboe
@ 2025-01-29 18:01 ` Max Kellermann
2025-01-31 16:13 ` Jens Axboe
0 siblings, 1 reply; 30+ messages in thread
From: Max Kellermann @ 2025-01-29 18:01 UTC (permalink / raw)
To: Jens Axboe; +Cc: asml.silence, io-uring, linux-kernel
On Wed, Jan 29, 2025 at 6:45 PM Jens Axboe <[email protected]> wrote:
> Why are you combining it with epoll in the first place? It's a lot more
> efficient to wait on a/multiple events in io_uring_enter() rather than
> go back to a serialize one-event-per-notification by using epoll to wait
> on completions on the io_uring side.
Yes, I wish I could do that, but that works only if everything is
io_uring - all or nothing. Most of the code is built around an
epoll-based loop and will not be ported to io_uring so quickly.
Maybe what's missing is epoll_wait as io_uring opcode. Then I could
wrap it the other way. Or am I supposed to use io_uring
poll_add_multishot for that?
Max
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/8] io_uring/io-wq: cache work->flags in variable
2025-01-28 13:39 ` [PATCH 4/8] io_uring/io-wq: cache work->flags in variable Max Kellermann
@ 2025-01-29 18:57 ` Pavel Begunkov
2025-01-29 19:11 ` Max Kellermann
0 siblings, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2025-01-29 18:57 UTC (permalink / raw)
To: Max Kellermann, axboe, io-uring, linux-kernel
On 1/28/25 13:39, Max Kellermann wrote:
> This eliminates several redundant atomic reads and therefore reduces
> the duration the surrounding spinlocks are held.
What architecture are you running? I don't get why the reads
are expensive while it's relaxed and there shouldn't even be
any contention. It doesn't even need to be atomics, we still
should be able to convert int back to plain ints.
> In several io_uring benchmarks, this reduced the CPU time spent in
> queued_spin_lock_slowpath() considerably:
>
> io_uring benchmark with a flood of `IORING_OP_NOP` and `IOSQE_ASYNC`:
>
> 38.86% -1.49% [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 6.75% +0.36% [kernel.kallsyms] [k] io_worker_handle_work
> 2.60% +0.19% [kernel.kallsyms] [k] io_nop
> 3.92% +0.18% [kernel.kallsyms] [k] io_req_task_complete
> 6.34% -0.18% [kernel.kallsyms] [k] io_wq_submit_work
>
> HTTP server, static file:
>
> 42.79% -2.77% [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 2.08% +0.23% [kernel.kallsyms] [k] io_wq_submit_work
> 1.19% +0.20% [kernel.kallsyms] [k] amd_iommu_iotlb_sync_map
> 1.46% +0.15% [kernel.kallsyms] [k] ep_poll_callback
> 1.80% +0.15% [kernel.kallsyms] [k] io_worker_handle_work
>
> HTTP server, PHP:
>
> 35.03% -1.80% [kernel.kallsyms] [k] queued_spin_lock_slowpath
> 0.84% +0.21% [kernel.kallsyms] [k] amd_iommu_iotlb_sync_map
> 1.39% +0.12% [kernel.kallsyms] [k] _copy_to_iter
> 0.21% +0.10% [kernel.kallsyms] [k] update_sd_lb_stats
>
> Signed-off-by: Max Kellermann <[email protected]>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/8] io_uring/io-wq: cache work->flags in variable
2025-01-29 18:57 ` Pavel Begunkov
@ 2025-01-29 19:11 ` Max Kellermann
2025-01-29 23:41 ` Pavel Begunkov
0 siblings, 1 reply; 30+ messages in thread
From: Max Kellermann @ 2025-01-29 19:11 UTC (permalink / raw)
To: Pavel Begunkov; +Cc: axboe, io-uring, linux-kernel
On Wed, Jan 29, 2025 at 7:56 PM Pavel Begunkov <[email protected]> wrote:
> What architecture are you running? I don't get why the reads
> are expensive while it's relaxed and there shouldn't even be
> any contention. It doesn't even need to be atomics, we still
> should be able to convert int back to plain ints.
I measured on an AMD Epyc 9654P.
As you see in my numbers, around 40% of the CPU time was wasted on
spinlock contention. Dozens of io-wq threads are trampling on each
other's feet all the time.
I don't think this is about memory accesses being exceptionally
expensive; it's just about wringing every cycle from the code section
that's under the heavy-contention spinlock.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 7/8] io_uring: cache io_kiocb->flags in variable
2025-01-28 13:39 ` [PATCH 7/8] io_uring: cache io_kiocb->flags in variable Max Kellermann
@ 2025-01-29 19:11 ` Pavel Begunkov
0 siblings, 0 replies; 30+ messages in thread
From: Pavel Begunkov @ 2025-01-29 19:11 UTC (permalink / raw)
To: Max Kellermann, axboe, io-uring, linux-kernel
On 1/28/25 13:39, Max Kellermann wrote:
> This eliminates several redundant reads, some of which probably cannot
> be optimized away by the compiler.
Let's not, it hurts readability with no clear benefits. In most cases
the compiler will be able to optimise it just where it matters, and
in cold paths we're comparing the overhead of reading a cached variable
with taking locks and doing indirect calls, and even then it'd likely
need to be saved onto the stack and loaded back.
The only place where it might be worth it is io_issue_sqe(), and
even then I'd doubt it.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)
2025-01-29 17:39 ` Max Kellermann
2025-01-29 17:45 ` Jens Axboe
@ 2025-01-29 19:30 ` Pavel Begunkov
2025-01-29 19:43 ` Max Kellermann
1 sibling, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2025-01-29 19:30 UTC (permalink / raw)
To: Max Kellermann, Jens Axboe; +Cc: io-uring, linux-kernel
On 1/29/25 17:39, Max Kellermann wrote:
> On Wed, Jan 29, 2025 at 6:19 PM Jens Axboe <[email protected]> wrote:
>> The other patches look pretty straight forward to me. Only thing that
>> has me puzzled a bit is why you have so much io-wq activity with your
>> application, in general I'd expect 0 activity there. But Then I saw the
>> forced ASYNC flag, and it makes sense. In general, forcing that isn't a
>> great idea, but for a benchmark for io-wq it certainly makes sense.
>
> I was experimenting with io_uring and wanted to see how much
> performance I can squeeze out of my web server running
> single-threaded. The overhead of io_uring_submit() grew very large,
> because the "send" operation would do a lot of synchronous work in the
> kernel. I tried SQPOLL but it was actually a big performance
> regression; this just shifted my CPU usage to epoll_wait(). Forcing
> ASYNC gave me large throughput improvements (moving the submission
> overhead to iowq), but then the iowq lock contention was the next
> limit, thus this patch series.
It's great to see iowq getting some optimisations, but note that
it wouldn't be fair comparing it to single threaded peers when
you have a lot of iowq activity as it might be occupying multiple
CPUs. A curious open question is whether it'd be more performant
to have several user threads with their private rings.
> I'm still experimenting, and I will certainly revisit SQPOLL to learn
> more about why it didn't help and how to fix it.
It's wasteful unless you saturate it close to 100%, and then you
usually have SQPOLL on a separate CPU than the user task submitting
requests, and so it'd take some cache bouncing. It's not a silver
bullet.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)
2025-01-29 19:30 ` Pavel Begunkov
@ 2025-01-29 19:43 ` Max Kellermann
0 siblings, 0 replies; 30+ messages in thread
From: Max Kellermann @ 2025-01-29 19:43 UTC (permalink / raw)
To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-kernel
On Wed, Jan 29, 2025 at 8:30 PM Pavel Begunkov <[email protected]> wrote:
> It's great to see iowq getting some optimisations, but note that
> it wouldn't be fair comparing it to single threaded peers when
> you have a lot of iowq activity as it might be occupying multiple
> CPUs.
True. Fully loaded with the benchmark, I see 400%-600% CPU usage on my
process (30-40% of that being spinlock contention).
I wanted to explore how far I can get with a single (userspace)
thread, and leave the dirty thread-sync work to the kernel.
> It's wasteful unless you saturate it close to 100%, and then you
> usually have SQPOLL on a separate CPU than the user task submitting
> requests, and so it'd take some cache bouncing. It's not a silver
> bullet.
Of course, memory latency always bites us in the end. But this isn't
the endgame just yet, we still have a lot of potential for
optimizations.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/8] io_uring/io-wq: cache work->flags in variable
2025-01-29 19:11 ` Max Kellermann
@ 2025-01-29 23:41 ` Pavel Begunkov
2025-01-30 5:36 ` Max Kellermann
2025-01-30 14:54 ` Jens Axboe
0 siblings, 2 replies; 30+ messages in thread
From: Pavel Begunkov @ 2025-01-29 23:41 UTC (permalink / raw)
To: Max Kellermann; +Cc: axboe, io-uring, linux-kernel
On 1/29/25 19:11, Max Kellermann wrote:
> On Wed, Jan 29, 2025 at 7:56 PM Pavel Begunkov <[email protected]> wrote:
>> What architecture are you running? I don't get why the reads
>> are expensive while it's relaxed and there shouldn't even be
>> any contention. It doesn't even need to be atomics, we still
>> should be able to convert int back to plain ints.
>
> I measured on an AMD Epyc 9654P.
> As you see in my numbers, around 40% of the CPU time was wasted on
> spinlock contention. Dozens of io-wq threads are trampling on each
> other's feet all the time.
> I don't think this is about memory accesses being exceptionally
> expensive; it's just about wringing every cycle from the code section
> that's under the heavy-contention spinlock.
Ok, then it's an architectural problem and needs more serious
reengineering, e.g. of how work items are stored and grabbed, and it
might even get some more use cases for io_uring. FWIW, I'm not saying
smaller optimisations shouldn't have place especially when they're
clean.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/8] io_uring/io-wq: cache work->flags in variable
2025-01-29 23:41 ` Pavel Begunkov
@ 2025-01-30 5:36 ` Max Kellermann
2025-01-30 14:57 ` Jens Axboe
2025-01-30 14:54 ` Jens Axboe
1 sibling, 1 reply; 30+ messages in thread
From: Max Kellermann @ 2025-01-30 5:36 UTC (permalink / raw)
To: Pavel Begunkov; +Cc: axboe, io-uring, linux-kernel
On Thu, Jan 30, 2025 at 12:41 AM Pavel Begunkov <[email protected]> wrote:
> Ok, then it's an architectural problem and needs more serious
> reengineering, e.g. of how work items are stored and grabbed
Rough unpolished idea: I was thinking about having multiple work
lists, each with its own spinlock (separate cache line), and each
io-wq thread only uses one of them, while the submitter round-robins
through the lists.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/8] io_uring/io-wq: cache work->flags in variable
2025-01-29 23:41 ` Pavel Begunkov
2025-01-30 5:36 ` Max Kellermann
@ 2025-01-30 14:54 ` Jens Axboe
1 sibling, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2025-01-30 14:54 UTC (permalink / raw)
To: Pavel Begunkov, Max Kellermann; +Cc: io-uring, linux-kernel
On 1/29/25 4:41 PM, Pavel Begunkov wrote:
> On 1/29/25 19:11, Max Kellermann wrote:
>> On Wed, Jan 29, 2025 at 7:56?PM Pavel Begunkov <[email protected]> wrote:
>>> What architecture are you running? I don't get why the reads
>>> are expensive while it's relaxed and there shouldn't even be
>>> any contention. It doesn't even need to be atomics, we still
>>> should be able to convert int back to plain ints.
>>
>> I measured on an AMD Epyc 9654P.
>> As you see in my numbers, around 40% of the CPU time was wasted on
>> spinlock contention. Dozens of io-wq threads are trampling on each
>> other's feet all the time.
>> I don't think this is about memory accesses being exceptionally
>> expensive; it's just about wringing every cycle from the code section
>> that's under the heavy-contention spinlock.
>
> Ok, then it's an architectural problem and needs more serious
> reengineering, e.g. of how work items are stored and grabbed, and it
> might even get some more use cases for io_uring. FWIW, I'm not saying
> smaller optimisations shouldn't have place especially when they're
> clean.
Totally agree - io-wq would need some improvements on the where to queue
and pull work to make it scale better, which may indeed be a good idea
to do and would open it up to more use cases that currently don't make
much sense.
That said, also agree that the minor optimizations still have a place,
it's not like they will stand in the way of general improvements as
well.
--
Jens Axboe
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/8] io_uring/io-wq: cache work->flags in variable
2025-01-30 5:36 ` Max Kellermann
@ 2025-01-30 14:57 ` Jens Axboe
2025-01-31 14:06 ` Pavel Begunkov
0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2025-01-30 14:57 UTC (permalink / raw)
To: Max Kellermann, Pavel Begunkov; +Cc: io-uring, linux-kernel
On 1/29/25 10:36 PM, Max Kellermann wrote:
> On Thu, Jan 30, 2025 at 12:41?AM Pavel Begunkov <[email protected]> wrote:
>> Ok, then it's an architectural problem and needs more serious
>> reengineering, e.g. of how work items are stored and grabbed
>
> Rough unpolished idea: I was thinking about having multiple work
> lists, each with its own spinlock (separate cache line), and each
> io-wq thread only uses one of them, while the submitter round-robins
> through the lists.
Pending work would certainly need better spreading than just the two
classes we have now.
One thing to keep in mind is that the design of io-wq is such that it's
quite possible to have N work items pending and just a single thread
serving all of them. If the io-wq thread doesn't go to sleep, it will
keep processing work units. This is done for efficiency reasons, and to
avoid a proliferation of io-wq threads when it's not going to be
beneficial. This means than when you queue a work item, it's not easy to
pick an appropriate io-wq thread upfront, and generally the io-wq thread
itself will pick its next work item at the perfect time - when it
doesn't have anything else to do, or finished the existing work.
This should be kept in mind for making io-wq scale better.
--
Jens Axboe
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 8/8] io_uring: skip redundant poll wakeups
2025-01-28 13:39 ` [PATCH 8/8] io_uring: skip redundant poll wakeups Max Kellermann
@ 2025-01-31 13:54 ` Pavel Begunkov
2025-01-31 17:16 ` Max Kellermann
0 siblings, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2025-01-31 13:54 UTC (permalink / raw)
To: Max Kellermann, axboe, io-uring, linux-kernel
On 1/28/25 13:39, Max Kellermann wrote:
> Using io_uring with epoll is very expensive because every completion
> leads to a __wake_up() call, most of which are unnecessary because the
> polling process has already been woken up but has not had a chance to
> process the completions. During this time, wq_has_sleeper() still
> returns true, therefore this check is not enough.
The poller is not required to call vfs_poll / io_uring_poll()
multiple times, in which case all subsequent events will be dropped
on the floor. E.g. the poller queues a poll entry in the first
io_uring_poll(), and then every time there is an event it'd do
vfs_read() or whatever without removing the entry.
I don't think we can make such assumptions about the poller, at
least without some help from it / special casing particular
callbacks.
...
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 137c2066c5a3..b65efd07e9f0 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
...
> @@ -2793,6 +2794,9 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
>
> if (unlikely(!ctx->poll_activated))
> io_activate_pollwq(ctx);
> +
> + atomic_set(&ctx->poll_wq_waiting, 1);
io_uring_poll() |
poll_wq_waiting = 1 |
| io_poll_wq_wake()
| poll_wq_waiting = 0
| // no waiters yet => no wake ups
| <return to user space>
| <consume all cqes>
poll_wait() |
return; // no events |
| produce_cqes()
| io_poll_wq_wake()
| if (poll_wq_waiting) wake();
| // it's still 0, wake up is lost
> /*
> * provides mb() which pairs with barrier from wq_has_sleeper
> * call in io_commit_cqring
> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> index f65e3f3ede51..186cee066f9f 100644
> --- a/io_uring/io_uring.h
> +++ b/io_uring/io_uring.h
> @@ -287,7 +287,7 @@ static inline void io_commit_cqring(struct io_ring_ctx *ctx)
>
> static inline void io_poll_wq_wake(struct io_ring_ctx *ctx)
> {
> - if (wq_has_sleeper(&ctx->poll_wq))
> + if (wq_has_sleeper(&ctx->poll_wq) && atomic_xchg_release(&ctx->poll_wq_waiting, 0) > 0)
> __wake_up(&ctx->poll_wq, TASK_NORMAL, 0,
> poll_to_key(EPOLL_URING_WAKE | EPOLLIN));
> }
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 4/8] io_uring/io-wq: cache work->flags in variable
2025-01-30 14:57 ` Jens Axboe
@ 2025-01-31 14:06 ` Pavel Begunkov
0 siblings, 0 replies; 30+ messages in thread
From: Pavel Begunkov @ 2025-01-31 14:06 UTC (permalink / raw)
To: Jens Axboe, Max Kellermann; +Cc: io-uring, linux-kernel
On 1/30/25 14:57, Jens Axboe wrote:
> On 1/29/25 10:36 PM, Max Kellermann wrote:
>> On Thu, Jan 30, 2025 at 12:41?AM Pavel Begunkov <[email protected]> wrote:
>>> Ok, then it's an architectural problem and needs more serious
>>> reengineering, e.g. of how work items are stored and grabbed
>>
>> Rough unpolished idea: I was thinking about having multiple work
>> lists, each with its own spinlock (separate cache line), and each
>> io-wq thread only uses one of them, while the submitter round-robins
>> through the lists.
>
> Pending work would certainly need better spreading than just the two
> classes we have now.
>
> One thing to keep in mind is that the design of io-wq is such that it's
> quite possible to have N work items pending and just a single thread
> serving all of them. If the io-wq thread doesn't go to sleep, it will
> keep processing work units. This is done for efficiency reasons, and to
Looking at people complaining about too many iowq tasks, we should be
limiting the number of them even more aggressively, and maybe scaling
them down faster if that's a problem.
> avoid a proliferation of io-wq threads when it's not going to be
> beneficial. This means than when you queue a work item, it's not easy to
> pick an appropriate io-wq thread upfront, and generally the io-wq thread
> itself will pick its next work item at the perfect time - when it
> doesn't have anything else to do, or finished the existing work.
>
> This should be kept in mind for making io-wq scale better.
People are saying that work stealing is working well with thread
pools, that might be an option, even though there are some
differences from userspace thread pools. I also remember Hao was
trying to do something for iowq a couple of years ago.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)
2025-01-29 18:01 ` Max Kellermann
@ 2025-01-31 16:13 ` Jens Axboe
2025-02-01 15:25 ` Jens Axboe
0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2025-01-31 16:13 UTC (permalink / raw)
To: Max Kellermann; +Cc: asml.silence, io-uring, linux-kernel
On 1/29/25 11:01 AM, Max Kellermann wrote:
> On Wed, Jan 29, 2025 at 6:45?PM Jens Axboe <[email protected]> wrote:
>> Why are you combining it with epoll in the first place? It's a lot more
>> efficient to wait on a/multiple events in io_uring_enter() rather than
>> go back to a serialize one-event-per-notification by using epoll to wait
>> on completions on the io_uring side.
>
> Yes, I wish I could do that, but that works only if everything is
> io_uring - all or nothing. Most of the code is built around an
> epoll-based loop and will not be ported to io_uring so quickly.
>
> Maybe what's missing is epoll_wait as io_uring opcode. Then I could
> wrap it the other way. Or am I supposed to use io_uring
> poll_add_multishot for that?
Not a huge fan of adding more epoll logic to io_uring, but you are right
this case may indeed make sense as it allows you to integrate better
that way in existing event loops. I'll take a look.
--
Jens Axboe
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 8/8] io_uring: skip redundant poll wakeups
2025-01-31 13:54 ` Pavel Begunkov
@ 2025-01-31 17:16 ` Max Kellermann
2025-01-31 17:25 ` Pavel Begunkov
0 siblings, 1 reply; 30+ messages in thread
From: Max Kellermann @ 2025-01-31 17:16 UTC (permalink / raw)
To: Pavel Begunkov; +Cc: axboe, io-uring, linux-kernel
On Fri, Jan 31, 2025 at 2:54 PM Pavel Begunkov <[email protected]> wrote:
> | // it's still 0, wake up is lost
I think I misunderstood how these polling functions in the kernel work
- thanks for the explanation, Pavel. This one patch is indeed flawed,
and I'll try to come up with something better.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 8/8] io_uring: skip redundant poll wakeups
2025-01-31 17:16 ` Max Kellermann
@ 2025-01-31 17:25 ` Pavel Begunkov
0 siblings, 0 replies; 30+ messages in thread
From: Pavel Begunkov @ 2025-01-31 17:25 UTC (permalink / raw)
To: Max Kellermann; +Cc: axboe, io-uring, linux-kernel
On 1/31/25 17:16, Max Kellermann wrote:
> On Fri, Jan 31, 2025 at 2:54 PM Pavel Begunkov <[email protected]> wrote:
>> | // it's still 0, wake up is lost
>
> I think I misunderstood how these polling functions in the kernel work
> - thanks for the explanation, Pavel. This one patch is indeed flawed,
> and I'll try to come up with something better.
No worries, it's too easy to get lost there, it takes me some
staring at the code every time as well.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)
2025-01-31 16:13 ` Jens Axboe
@ 2025-02-01 15:25 ` Jens Axboe
2025-02-01 15:30 ` Max Kellermann
0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2025-02-01 15:25 UTC (permalink / raw)
To: Max Kellermann; +Cc: asml.silence, io-uring, linux-kernel
On 1/31/25 9:13 AM, Jens Axboe wrote:
> On 1/29/25 11:01 AM, Max Kellermann wrote:
>> On Wed, Jan 29, 2025 at 6:45?PM Jens Axboe <[email protected]> wrote:
>>> Why are you combining it with epoll in the first place? It's a lot more
>>> efficient to wait on a/multiple events in io_uring_enter() rather than
>>> go back to a serialize one-event-per-notification by using epoll to wait
>>> on completions on the io_uring side.
>>
>> Yes, I wish I could do that, but that works only if everything is
>> io_uring - all or nothing. Most of the code is built around an
>> epoll-based loop and will not be ported to io_uring so quickly.
>>
>> Maybe what's missing is epoll_wait as io_uring opcode. Then I could
>> wrap it the other way. Or am I supposed to use io_uring
>> poll_add_multishot for that?
>
> Not a huge fan of adding more epoll logic to io_uring, but you are right
> this case may indeed make sense as it allows you to integrate better
> that way in existing event loops. I'll take a look.
Here's a series doing that:
https://git.kernel.dk/cgit/linux/log/?h=io_uring-epoll-wait
Could actually work pretty well - the last patch adds multishot support
as well, which means we can avoid the write lock dance for repeated
triggers of this epoll event. That should actually end up being more
efficient than regular epoll_wait(2).
Wrote a basic test cases to exercise it, and it seems to work fine for
me, but obviously not super well tested just yet. Below is the liburing
diff, just adds the helper to prepare one of these epoll wait requests.
diff --git a/src/include/liburing.h b/src/include/liburing.h
index 49b4edf437b2..a95c475496f4 100644
--- a/src/include/liburing.h
+++ b/src/include/liburing.h
@@ -729,6 +729,15 @@ IOURINGINLINE void io_uring_prep_listen(struct io_uring_sqe *sqe, int fd,
io_uring_prep_rw(IORING_OP_LISTEN, sqe, fd, 0, backlog, 0);
}
+struct epoll_event;
+IOURINGINLINE void io_uring_prep_epoll_wait(struct io_uring_sqe *sqe, int fd,
+ struct epoll_event *events,
+ int maxevents, unsigned flags)
+{
+ io_uring_prep_rw(IORING_OP_EPOLL_WAIT, sqe, fd, events, maxevents, 0);
+ sqe->epoll_flags = flags;
+}
+
IOURINGINLINE void io_uring_prep_files_update(struct io_uring_sqe *sqe,
int *fds, unsigned nr_fds,
int offset)
diff --git a/src/include/liburing/io_uring.h b/src/include/liburing/io_uring.h
index 765919883cff..bc725787ceb7 100644
--- a/src/include/liburing/io_uring.h
+++ b/src/include/liburing/io_uring.h
@@ -73,6 +73,7 @@ struct io_uring_sqe {
__u32 futex_flags;
__u32 install_fd_flags;
__u32 nop_flags;
+ __u32 epoll_flags;
};
__u64 user_data; /* data to be passed back at completion time */
/* pack this to avoid bogus arm OABI complaints */
@@ -262,6 +263,7 @@ enum io_uring_op {
IORING_OP_FTRUNCATE,
IORING_OP_BIND,
IORING_OP_LISTEN,
+ IORING_OP_EPOLL_WAIT,
/* this goes last, obviously */
IORING_OP_LAST,
@@ -388,6 +390,11 @@ enum io_uring_op {
#define IORING_ACCEPT_DONTWAIT (1U << 1)
#define IORING_ACCEPT_POLL_FIRST (1U << 2)
+/*
+ * epoll_wait flags, stored in sqe->epoll_flags
+ */
+#define IORING_EPOLL_WAIT_MULTISHOT (1U << 0)
+
/*
* IORING_OP_MSG_RING command types, stored in sqe->addr
*/
--
Jens Axboe
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)
2025-02-01 15:25 ` Jens Axboe
@ 2025-02-01 15:30 ` Max Kellermann
2025-02-01 15:38 ` Jens Axboe
0 siblings, 1 reply; 30+ messages in thread
From: Max Kellermann @ 2025-02-01 15:30 UTC (permalink / raw)
To: Jens Axboe; +Cc: asml.silence, io-uring, linux-kernel
On Sat, Feb 1, 2025 at 4:26 PM Jens Axboe <[email protected]> wrote:
> > Not a huge fan of adding more epoll logic to io_uring, but you are right
> > this case may indeed make sense as it allows you to integrate better
> > that way in existing event loops. I'll take a look.
>
> Here's a series doing that:
>
> https://git.kernel.dk/cgit/linux/log/?h=io_uring-epoll-wait
>
> Could actually work pretty well - the last patch adds multishot support
> as well, which means we can avoid the write lock dance for repeated
> triggers of this epoll event. That should actually end up being more
> efficient than regular epoll_wait(2).
Nice, thanks Jens! I will integrate this in our I/O event loop and
test it next week. This will eliminate the io_uring poll wakeup
overhead completely.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)
2025-02-01 15:30 ` Max Kellermann
@ 2025-02-01 15:38 ` Jens Axboe
0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2025-02-01 15:38 UTC (permalink / raw)
To: Max Kellermann; +Cc: asml.silence, io-uring, linux-kernel
On 2/1/25 8:30 AM, Max Kellermann wrote:
> On Sat, Feb 1, 2025 at 4:26 PM Jens Axboe <[email protected]> wrote:
>>> Not a huge fan of adding more epoll logic to io_uring, but you are right
>>> this case may indeed make sense as it allows you to integrate better
>>> that way in existing event loops. I'll take a look.
>>
>> Here's a series doing that:
>>
>> https://git.kernel.dk/cgit/linux/log/?h=io_uring-epoll-wait
>>
>> Could actually work pretty well - the last patch adds multishot support
>> as well, which means we can avoid the write lock dance for repeated
>> triggers of this epoll event. That should actually end up being more
>> efficient than regular epoll_wait(2).
>
> Nice, thanks Jens! I will integrate this in our I/O event loop and
> test it next week. This will eliminate the io_uring poll wakeup
> overhead completely.
That'd be great, let us know how it goes!
--
Jens Axboe
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2025-02-01 15:38 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-28 13:39 [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Max Kellermann
2025-01-28 13:39 ` [PATCH 1/8] io_uring/io-wq: eliminate redundant io_work_get_acct() calls Max Kellermann
2025-01-28 13:39 ` [PATCH 2/8] io_uring/io-wq: add io_worker.acct pointer Max Kellermann
2025-01-28 13:39 ` [PATCH 3/8] io_uring/io-wq: move worker lists to struct io_wq_acct Max Kellermann
2025-01-28 13:39 ` [PATCH 4/8] io_uring/io-wq: cache work->flags in variable Max Kellermann
2025-01-29 18:57 ` Pavel Begunkov
2025-01-29 19:11 ` Max Kellermann
2025-01-29 23:41 ` Pavel Begunkov
2025-01-30 5:36 ` Max Kellermann
2025-01-30 14:57 ` Jens Axboe
2025-01-31 14:06 ` Pavel Begunkov
2025-01-30 14:54 ` Jens Axboe
2025-01-28 13:39 ` [PATCH 5/8] io_uring/io-wq: do not use bogus hash value Max Kellermann
2025-01-28 13:39 ` [PATCH 6/8] io_uring/io-wq: pass io_wq to io_get_next_work() Max Kellermann
2025-01-28 13:39 ` [PATCH 7/8] io_uring: cache io_kiocb->flags in variable Max Kellermann
2025-01-29 19:11 ` Pavel Begunkov
2025-01-28 13:39 ` [PATCH 8/8] io_uring: skip redundant poll wakeups Max Kellermann
2025-01-31 13:54 ` Pavel Begunkov
2025-01-31 17:16 ` Max Kellermann
2025-01-31 17:25 ` Pavel Begunkov
2025-01-29 17:18 ` [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Jens Axboe
2025-01-29 17:39 ` Max Kellermann
2025-01-29 17:45 ` Jens Axboe
2025-01-29 18:01 ` Max Kellermann
2025-01-31 16:13 ` Jens Axboe
2025-02-01 15:25 ` Jens Axboe
2025-02-01 15:30 ` Max Kellermann
2025-02-01 15:38 ` Jens Axboe
2025-01-29 19:30 ` Pavel Begunkov
2025-01-29 19:43 ` Max Kellermann
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox