* [PATCH v2 1/8] io_uring: account drain memory to cgroup
2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 2/8] io_uring: fix spurious drain flushing Pavel Begunkov
` (7 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
Account drain allocations against memcg. It's not a big problem as each
such allocation is paired with a request, which is accounted, but it's
nicer to follow the limits more closely.
Cc: stable@vger.kernel.org # 6.1
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
io_uring/io_uring.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0d051476008c..23e283e65eeb 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1672,7 +1672,7 @@ static __cold void io_drain_req(struct io_kiocb *req)
spin_unlock(&ctx->completion_lock);
io_prep_async_link(req);
- de = kmalloc(sizeof(*de), GFP_KERNEL);
+ de = kmalloc(sizeof(*de), GFP_KERNEL_ACCOUNT);
if (!de) {
ret = -ENOMEM;
io_req_defer_failed(req, ret);
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 2/8] io_uring: fix spurious drain flushing
2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 1/8] io_uring: account drain memory to cgroup Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 3/8] io_uring: simplify drain ret passing Pavel Begunkov
` (6 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
io_queue_deferred() is not tolerant to spurious calls not completing
some requests. You can have an inflight drain-marked request and another
request that came after and got queued into the drain list. Now, if
io_queue_deferred() is called before the first request completes, it'll
check the 2nd req with req_need_defer(), find that there is no drain
flag set, and queue it for execution.
To make io_queue_deferred() work, it should at least check sequences for
the first request, and then we need also need to check if there is
another drain request creating another bubble.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
io_uring/io_uring.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 23e283e65eeb..add46ab19017 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -559,18 +559,30 @@ void io_req_queue_iowq(struct io_kiocb *req)
io_req_task_work_add(req);
}
+static bool io_drain_defer_seq(struct io_kiocb *req, u32 seq)
+{
+ struct io_ring_ctx *ctx = req->ctx;
+
+ return seq + READ_ONCE(ctx->cq_extra) != ctx->cached_cq_tail;
+}
+
static __cold noinline void io_queue_deferred(struct io_ring_ctx *ctx)
{
+ bool drain_seen = false, first = true;
+
spin_lock(&ctx->completion_lock);
while (!list_empty(&ctx->defer_list)) {
struct io_defer_entry *de = list_first_entry(&ctx->defer_list,
struct io_defer_entry, list);
- if (req_need_defer(de->req, de->seq))
+ drain_seen |= de->req->flags & REQ_F_IO_DRAIN;
+ if ((drain_seen || first) && io_drain_defer_seq(de->req, de->seq))
break;
+
list_del_init(&de->list);
io_req_task_queue(de->req);
kfree(de);
+ first = false;
}
spin_unlock(&ctx->completion_lock);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 3/8] io_uring: simplify drain ret passing
2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 1/8] io_uring: account drain memory to cgroup Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 2/8] io_uring: fix spurious drain flushing Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 4/8] io_uring: remove drain prealloc checks Pavel Begunkov
` (5 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
"ret" in io_drain_req() is only used in one place, remove it and pass
-ENOMEM directly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
io_uring/io_uring.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index add46ab19017..c32fae28ea52 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1669,7 +1669,6 @@ static __cold void io_drain_req(struct io_kiocb *req)
{
struct io_ring_ctx *ctx = req->ctx;
struct io_defer_entry *de;
- int ret;
u32 seq = io_get_sequence(req);
/* Still need defer if there is pending req in defer list. */
@@ -1686,8 +1685,7 @@ static __cold void io_drain_req(struct io_kiocb *req)
io_prep_async_link(req);
de = kmalloc(sizeof(*de), GFP_KERNEL_ACCOUNT);
if (!de) {
- ret = -ENOMEM;
- io_req_defer_failed(req, ret);
+ io_req_defer_failed(req, -ENOMEM);
return;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 4/8] io_uring: remove drain prealloc checks
2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
` (2 preceding siblings ...)
2025-05-09 11:12 ` [PATCH v2 3/8] io_uring: simplify drain ret passing Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 5/8] io_uring: consolidate drain seq checking Pavel Begunkov
` (4 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
Currently io_drain_req() has two steps. The first is fast path checking
sequence numbers. The second is allocations, rechecking and actual
queuing. Further simplify it by removing the first step.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
io_uring/io_uring.c | 15 +++------------
1 file changed, 3 insertions(+), 12 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index c32fae28ea52..e3f6914304a2 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1671,17 +1671,6 @@ static __cold void io_drain_req(struct io_kiocb *req)
struct io_defer_entry *de;
u32 seq = io_get_sequence(req);
- /* Still need defer if there is pending req in defer list. */
- spin_lock(&ctx->completion_lock);
- if (!req_need_defer(req, seq) && list_empty_careful(&ctx->defer_list)) {
- spin_unlock(&ctx->completion_lock);
-queue:
- ctx->drain_active = false;
- io_req_task_queue(req);
- return;
- }
- spin_unlock(&ctx->completion_lock);
-
io_prep_async_link(req);
de = kmalloc(sizeof(*de), GFP_KERNEL_ACCOUNT);
if (!de) {
@@ -1693,7 +1682,9 @@ static __cold void io_drain_req(struct io_kiocb *req)
if (!req_need_defer(req, seq) && list_empty(&ctx->defer_list)) {
spin_unlock(&ctx->completion_lock);
kfree(de);
- goto queue;
+ ctx->drain_active = false;
+ io_req_task_queue(req);
+ return;
}
trace_io_uring_defer(req);
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 5/8] io_uring: consolidate drain seq checking
2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
` (3 preceding siblings ...)
2025-05-09 11:12 ` [PATCH v2 4/8] io_uring: remove drain prealloc checks Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 6/8] io_uring: open code io_account_cq_overflow() Pavel Begunkov
` (3 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
We check sequences when queuing drained requests as well when flushing
them. Instead, always queue and immediately try to flush, so that all
seq handling can be kept contained in the flushing code.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
io_uring/io_uring.c | 45 +++++++++++++++++----------------------------
1 file changed, 17 insertions(+), 28 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e3f6914304a2..0afae33e05e6 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -389,17 +389,6 @@ static void io_account_cq_overflow(struct io_ring_ctx *ctx)
ctx->cq_extra--;
}
-static bool req_need_defer(struct io_kiocb *req, u32 seq)
-{
- if (unlikely(req->flags & REQ_F_IO_DRAIN)) {
- struct io_ring_ctx *ctx = req->ctx;
-
- return seq + READ_ONCE(ctx->cq_extra) != ctx->cached_cq_tail;
- }
-
- return false;
-}
-
static void io_clean_op(struct io_kiocb *req)
{
if (unlikely(req->flags & REQ_F_BUFFER_SELECTED))
@@ -566,11 +555,10 @@ static bool io_drain_defer_seq(struct io_kiocb *req, u32 seq)
return seq + READ_ONCE(ctx->cq_extra) != ctx->cached_cq_tail;
}
-static __cold noinline void io_queue_deferred(struct io_ring_ctx *ctx)
+static __cold noinline void __io_queue_deferred(struct io_ring_ctx *ctx)
{
bool drain_seen = false, first = true;
- spin_lock(&ctx->completion_lock);
while (!list_empty(&ctx->defer_list)) {
struct io_defer_entry *de = list_first_entry(&ctx->defer_list,
struct io_defer_entry, list);
@@ -584,7 +572,12 @@ static __cold noinline void io_queue_deferred(struct io_ring_ctx *ctx)
kfree(de);
first = false;
}
- spin_unlock(&ctx->completion_lock);
+}
+
+static __cold noinline void io_queue_deferred(struct io_ring_ctx *ctx)
+{
+ guard(spinlock)(&ctx->completion_lock);
+ __io_queue_deferred(ctx);
}
void __io_commit_cqring_flush(struct io_ring_ctx *ctx)
@@ -1668,30 +1661,26 @@ static __cold void io_drain_req(struct io_kiocb *req)
__must_hold(&ctx->uring_lock)
{
struct io_ring_ctx *ctx = req->ctx;
+ bool drain = req->flags & IOSQE_IO_DRAIN;
struct io_defer_entry *de;
- u32 seq = io_get_sequence(req);
- io_prep_async_link(req);
de = kmalloc(sizeof(*de), GFP_KERNEL_ACCOUNT);
if (!de) {
io_req_defer_failed(req, -ENOMEM);
return;
}
- spin_lock(&ctx->completion_lock);
- if (!req_need_defer(req, seq) && list_empty(&ctx->defer_list)) {
- spin_unlock(&ctx->completion_lock);
- kfree(de);
- ctx->drain_active = false;
- io_req_task_queue(req);
- return;
- }
-
+ io_prep_async_link(req);
trace_io_uring_defer(req);
de->req = req;
- de->seq = seq;
- list_add_tail(&de->list, &ctx->defer_list);
- spin_unlock(&ctx->completion_lock);
+ de->seq = io_get_sequence(req);
+
+ scoped_guard(spinlock, &ctx->completion_lock) {
+ list_add_tail(&de->list, &ctx->defer_list);
+ __io_queue_deferred(ctx);
+ if (!drain && list_empty(&ctx->defer_list))
+ ctx->drain_active = false;
+ }
}
static bool io_assign_file(struct io_kiocb *req, const struct io_issue_def *def,
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 6/8] io_uring: open code io_account_cq_overflow()
2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
` (4 preceding siblings ...)
2025-05-09 11:12 ` [PATCH v2 5/8] io_uring: consolidate drain seq checking Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 7/8] io_uring: count allocated requests Pavel Begunkov
` (2 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
io_account_cq_overflow() doesn't help explaining what's going on in
there, and it'll become even smaller with following patches, so open
code it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
io_uring/io_uring.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0afae33e05e6..9619c46bd25e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -381,14 +381,6 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
return NULL;
}
-static void io_account_cq_overflow(struct io_ring_ctx *ctx)
-{
- struct io_rings *r = ctx->rings;
-
- WRITE_ONCE(r->cq_overflow, READ_ONCE(r->cq_overflow) + 1);
- ctx->cq_extra--;
-}
-
static void io_clean_op(struct io_kiocb *req)
{
if (unlikely(req->flags & REQ_F_BUFFER_SELECTED))
@@ -742,12 +734,15 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
ocqe = kmalloc(ocq_size, GFP_ATOMIC | __GFP_ACCOUNT);
trace_io_uring_cqe_overflow(ctx, user_data, res, cflags, ocqe);
if (!ocqe) {
+ struct io_rings *r = ctx->rings;
+
/*
* If we're in ring overflow flush mode, or in task cancel mode,
* or cannot allocate an overflow entry, then we need to drop it
* on the floor.
*/
- io_account_cq_overflow(ctx);
+ WRITE_ONCE(r->cq_overflow, READ_ONCE(r->cq_overflow) + 1);
+ ctx->cq_extra--;
set_bit(IO_CHECK_CQ_DROPPED_BIT, &ctx->check_cq);
return false;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 7/8] io_uring: count allocated requests
2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
` (5 preceding siblings ...)
2025-05-09 11:12 ` [PATCH v2 6/8] io_uring: open code io_account_cq_overflow() Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 8/8] io_uring: drain based on allocates reqs Pavel Begunkov
2025-05-09 14:02 ` [PATCH v2 0/8] allocated requests based drain and fixes Jens Axboe
8 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
Keep track of the number requests a ring currently has allocated (and
not freed), it'll be needed in the next patch.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
include/linux/io_uring_types.h | 1 +
io_uring/io_uring.c | 9 ++++++++-
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 7e23e993280e..73b289b48280 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -435,6 +435,7 @@ struct io_ring_ctx {
/* protected by ->completion_lock */
unsigned evfd_last_cq_tail;
+ unsigned nr_req_allocated;
/*
* Protection for resize vs mmap races - both the mmap and resize
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 9619c46bd25e..14188f49a4ce 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -954,6 +954,8 @@ __cold bool __io_alloc_req_refill(struct io_ring_ctx *ctx)
}
percpu_ref_get_many(&ctx->refs, ret);
+ ctx->nr_req_allocated += ret;
+
while (ret--) {
struct io_kiocb *req = reqs[ret];
@@ -2691,8 +2693,10 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
kmem_cache_free(req_cachep, req);
nr++;
}
- if (nr)
+ if (nr) {
+ ctx->nr_req_allocated -= nr;
percpu_ref_put_many(&ctx->refs, nr);
+ }
mutex_unlock(&ctx->uring_lock);
}
@@ -2729,6 +2733,9 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
percpu_ref_exit(&ctx->refs);
free_uid(ctx->user);
io_req_caches_free(ctx);
+
+ WARN_ON_ONCE(ctx->nr_req_allocated);
+
if (ctx->hash_map)
io_wq_put_hash(ctx->hash_map);
io_napi_free(ctx);
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 8/8] io_uring: drain based on allocates reqs
2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
` (6 preceding siblings ...)
2025-05-09 11:12 ` [PATCH v2 7/8] io_uring: count allocated requests Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
2025-05-13 10:37 ` Andy Shevchenko
2025-05-09 14:02 ` [PATCH v2 0/8] allocated requests based drain and fixes Jens Axboe
8 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
Don't rely on CQ sequence numbers for draining, as it has become messy
and needs cq_extra adjustments. Instead, base it on the number of
allocated requests and only allow flushing when all requests are in the
drain list.
As a result, cq_extra is gone, no overhead for its accounting in aux cqe
posting, less bloating as it was inlined before, and it's in general
simpler than trying to track where we should bump it and where it should
be put back like in cases of overflow. Also, it'll likely help with
cleaning and unifying some of the CQ posting helpers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
include/linux/io_uring_types.h | 2 +-
io_uring/io_uring.c | 83 +++++++++++++++-------------------
io_uring/io_uring.h | 3 +-
3 files changed, 38 insertions(+), 50 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 73b289b48280..00dbd7cd0e7d 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -341,7 +341,6 @@ struct io_ring_ctx {
unsigned cached_cq_tail;
unsigned cq_entries;
struct io_ev_fd __rcu *io_ev_fd;
- unsigned cq_extra;
void *cq_wait_arg;
size_t cq_wait_size;
@@ -417,6 +416,7 @@ struct io_ring_ctx {
struct callback_head poll_wq_task_work;
struct list_head defer_list;
+ unsigned nr_drained;
struct io_alloc_cache msg_cache;
spinlock_t msg_lock;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 14188f49a4ce..0fda1b1a33ae 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -129,7 +129,6 @@
struct io_defer_entry {
struct list_head list;
struct io_kiocb *req;
- u32 seq;
};
/* requests with any of those set should undergo io_disarm_next() */
@@ -149,6 +148,7 @@ static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
bool is_sqpoll_thread);
static void io_queue_sqe(struct io_kiocb *req);
+static void __io_req_caches_free(struct io_ring_ctx *ctx);
static __read_mostly DEFINE_STATIC_KEY_FALSE(io_key_has_sqarray);
@@ -540,46 +540,45 @@ void io_req_queue_iowq(struct io_kiocb *req)
io_req_task_work_add(req);
}
-static bool io_drain_defer_seq(struct io_kiocb *req, u32 seq)
+static unsigned io_linked_nr(struct io_kiocb *req)
{
- struct io_ring_ctx *ctx = req->ctx;
+ struct io_kiocb *tmp;
+ unsigned nr = 0;
- return seq + READ_ONCE(ctx->cq_extra) != ctx->cached_cq_tail;
+ io_for_each_link(tmp, req)
+ nr++;
+ return nr;
}
-static __cold noinline void __io_queue_deferred(struct io_ring_ctx *ctx)
+static __cold noinline void io_queue_deferred(struct io_ring_ctx *ctx)
{
bool drain_seen = false, first = true;
+ lockdep_assert_held(&ctx->uring_lock);
+ __io_req_caches_free(ctx);
+
while (!list_empty(&ctx->defer_list)) {
struct io_defer_entry *de = list_first_entry(&ctx->defer_list,
struct io_defer_entry, list);
drain_seen |= de->req->flags & REQ_F_IO_DRAIN;
- if ((drain_seen || first) && io_drain_defer_seq(de->req, de->seq))
- break;
+ if ((drain_seen || first) && ctx->nr_req_allocated != ctx->nr_drained)
+ return;
list_del_init(&de->list);
+ ctx->nr_drained -= io_linked_nr(de->req);
io_req_task_queue(de->req);
kfree(de);
first = false;
}
}
-static __cold noinline void io_queue_deferred(struct io_ring_ctx *ctx)
-{
- guard(spinlock)(&ctx->completion_lock);
- __io_queue_deferred(ctx);
-}
-
void __io_commit_cqring_flush(struct io_ring_ctx *ctx)
{
if (ctx->poll_activated)
io_poll_wq_wake(ctx);
if (ctx->off_timeout_used)
io_flush_timeouts(ctx);
- if (ctx->drain_active)
- io_queue_deferred(ctx);
if (ctx->has_evfd)
io_eventfd_signal(ctx, true);
}
@@ -742,7 +741,6 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
* on the floor.
*/
WRITE_ONCE(r->cq_overflow, READ_ONCE(r->cq_overflow) + 1);
- ctx->cq_extra--;
set_bit(IO_CHECK_CQ_DROPPED_BIT, &ctx->check_cq);
return false;
}
@@ -812,8 +810,6 @@ static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res,
{
struct io_uring_cqe *cqe;
- ctx->cq_extra++;
-
if (likely(io_get_cqe(ctx, &cqe))) {
WRITE_ONCE(cqe->user_data, user_data);
WRITE_ONCE(cqe->res, res);
@@ -1456,6 +1452,10 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
io_free_batch_list(ctx, state->compl_reqs.first);
INIT_WQ_LIST(&state->compl_reqs);
}
+
+ if (unlikely(ctx->drain_active))
+ io_queue_deferred(ctx);
+
ctx->submit_state.cq_flush = false;
}
@@ -1643,23 +1643,14 @@ io_req_flags_t io_file_get_flags(struct file *file)
return res;
}
-static u32 io_get_sequence(struct io_kiocb *req)
-{
- u32 seq = req->ctx->cached_sq_head;
- struct io_kiocb *cur;
-
- /* need original cached_sq_head, but it was increased for each req */
- io_for_each_link(cur, req)
- seq--;
- return seq;
-}
-
static __cold void io_drain_req(struct io_kiocb *req)
__must_hold(&ctx->uring_lock)
{
struct io_ring_ctx *ctx = req->ctx;
bool drain = req->flags & IOSQE_IO_DRAIN;
struct io_defer_entry *de;
+ struct io_kiocb *tmp;
+ int nr = 0;
de = kmalloc(sizeof(*de), GFP_KERNEL_ACCOUNT);
if (!de) {
@@ -1667,17 +1658,17 @@ static __cold void io_drain_req(struct io_kiocb *req)
return;
}
+ io_for_each_link(tmp, req)
+ nr++;
io_prep_async_link(req);
trace_io_uring_defer(req);
de->req = req;
- de->seq = io_get_sequence(req);
- scoped_guard(spinlock, &ctx->completion_lock) {
- list_add_tail(&de->list, &ctx->defer_list);
- __io_queue_deferred(ctx);
- if (!drain && list_empty(&ctx->defer_list))
- ctx->drain_active = false;
- }
+ ctx->nr_drained += io_linked_nr(req);
+ list_add_tail(&de->list, &ctx->defer_list);
+ io_queue_deferred(ctx);
+ if (!drain && list_empty(&ctx->defer_list))
+ ctx->drain_active = false;
}
static bool io_assign_file(struct io_kiocb *req, const struct io_issue_def *def,
@@ -2260,10 +2251,6 @@ static bool io_get_sqe(struct io_ring_ctx *ctx, const struct io_uring_sqe **sqe)
(!(ctx->flags & IORING_SETUP_NO_SQARRAY))) {
head = READ_ONCE(ctx->sq_array[head]);
if (unlikely(head >= ctx->sq_entries)) {
- /* drop invalid entries */
- spin_lock(&ctx->completion_lock);
- ctx->cq_extra--;
- spin_unlock(&ctx->completion_lock);
WRITE_ONCE(ctx->rings->sq_dropped,
READ_ONCE(ctx->rings->sq_dropped) + 1);
return false;
@@ -2681,13 +2668,11 @@ unsigned long rings_size(unsigned int flags, unsigned int sq_entries,
return off;
}
-static void io_req_caches_free(struct io_ring_ctx *ctx)
+static __cold void __io_req_caches_free(struct io_ring_ctx *ctx)
{
struct io_kiocb *req;
int nr = 0;
- mutex_lock(&ctx->uring_lock);
-
while (!io_req_cache_empty(ctx)) {
req = io_extract_req(ctx);
kmem_cache_free(req_cachep, req);
@@ -2697,7 +2682,12 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
ctx->nr_req_allocated -= nr;
percpu_ref_put_many(&ctx->refs, nr);
}
- mutex_unlock(&ctx->uring_lock);
+}
+
+static __cold void io_req_caches_free(struct io_ring_ctx *ctx)
+{
+ guard(mutex)(&ctx->uring_lock);
+ __io_req_caches_free(ctx);
}
static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
@@ -3002,20 +2992,19 @@ static __cold bool io_cancel_defer_files(struct io_ring_ctx *ctx,
struct io_defer_entry *de;
LIST_HEAD(list);
- spin_lock(&ctx->completion_lock);
list_for_each_entry_reverse(de, &ctx->defer_list, list) {
if (io_match_task_safe(de->req, tctx, cancel_all)) {
list_cut_position(&list, &ctx->defer_list, &de->list);
break;
}
}
- spin_unlock(&ctx->completion_lock);
if (list_empty(&list))
return false;
while (!list_empty(&list)) {
de = list_first_entry(&list, struct io_defer_entry, list);
list_del_init(&de->list);
+ ctx->nr_drained -= io_linked_nr(de->req);
io_req_task_queue_fail(de->req, -ECANCELED);
kfree(de);
}
@@ -3090,8 +3079,8 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
if ((ctx->flags & IORING_SETUP_DEFER_TASKRUN) &&
io_allowed_defer_tw_run(ctx))
ret |= io_run_local_work(ctx, INT_MAX, INT_MAX) > 0;
- ret |= io_cancel_defer_files(ctx, tctx, cancel_all);
mutex_lock(&ctx->uring_lock);
+ ret |= io_cancel_defer_files(ctx, tctx, cancel_all);
ret |= io_poll_remove_all(ctx, tctx, cancel_all);
ret |= io_waitid_remove_all(ctx, tctx, cancel_all);
ret |= io_futex_remove_all(ctx, tctx, cancel_all);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index e4050b2d0821..81f22196a57d 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -196,7 +196,6 @@ static inline bool io_defer_get_uncommited_cqe(struct io_ring_ctx *ctx,
{
io_lockdep_assert_cq_locked(ctx);
- ctx->cq_extra++;
ctx->submit_state.cq_flush = true;
return io_get_cqe(ctx, cqe_ret);
}
@@ -414,7 +413,7 @@ static inline void io_req_complete_defer(struct io_kiocb *req)
static inline void io_commit_cqring_flush(struct io_ring_ctx *ctx)
{
- if (unlikely(ctx->off_timeout_used || ctx->drain_active ||
+ if (unlikely(ctx->off_timeout_used ||
ctx->has_evfd || ctx->poll_activated))
__io_commit_cqring_flush(ctx);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 8/8] io_uring: drain based on allocates reqs
2025-05-09 11:12 ` [PATCH v2 8/8] io_uring: drain based on allocates reqs Pavel Begunkov
@ 2025-05-13 10:37 ` Andy Shevchenko
2025-05-13 13:35 ` Pavel Begunkov
0 siblings, 1 reply; 12+ messages in thread
From: Andy Shevchenko @ 2025-05-13 10:37 UTC (permalink / raw)
To: Pavel Begunkov; +Cc: io-uring
On Fri, May 09, 2025 at 12:12:54PM +0100, Pavel Begunkov wrote:
> Don't rely on CQ sequence numbers for draining, as it has become messy
> and needs cq_extra adjustments. Instead, base it on the number of
> allocated requests and only allow flushing when all requests are in the
> drain list.
>
> As a result, cq_extra is gone, no overhead for its accounting in aux cqe
> posting, less bloating as it was inlined before, and it's in general
> simpler than trying to track where we should bump it and where it should
> be put back like in cases of overflow. Also, it'll likely help with
> cleaning and unifying some of the CQ posting helpers.
This patch breaks the `make W=1` build. Please, always test your changes with
`make W=1`. See below the details.
...
> static __cold void io_drain_req(struct io_kiocb *req)
> __must_hold(&ctx->uring_lock)
> {
> struct io_ring_ctx *ctx = req->ctx;
> bool drain = req->flags & IOSQE_IO_DRAIN;
> struct io_defer_entry *de;
> + struct io_kiocb *tmp;
> + int nr = 0;
Defined and assigned...
>
> de = kmalloc(sizeof(*de), GFP_KERNEL_ACCOUNT);
> if (!de) {
> @@ -1667,17 +1658,17 @@ static __cold void io_drain_req(struct io_kiocb *req)
> return;
> }
>
> + io_for_each_link(tmp, req)
> + nr++;
...just incremented...
And that seems it. Does the above have any side-effects? Or is it just a dead
code (a.k.a. leftovers from the rebase/upcoming work)?
In any case, please make use of nr or drop it completely.
io_uring/io_uring.c:1649:6: error: variable 'nr' set but not used [-Werror,-Wunused-but-set-variable]
1649 | int nr = 0;
| ^
1 error generated.
> io_prep_async_link(req);
> trace_io_uring_defer(req);
> de->req = req;
> - de->seq = io_get_sequence(req);
>
> - scoped_guard(spinlock, &ctx->completion_lock) {
> - list_add_tail(&de->list, &ctx->defer_list);
> - __io_queue_deferred(ctx);
> - if (!drain && list_empty(&ctx->defer_list))
> - ctx->drain_active = false;
> - }
> + ctx->nr_drained += io_linked_nr(req);
> + list_add_tail(&de->list, &ctx->defer_list);
> + io_queue_deferred(ctx);
> + if (!drain && list_empty(&ctx->defer_list))
> + ctx->drain_active = false;
> }
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 8/8] io_uring: drain based on allocates reqs
2025-05-13 10:37 ` Andy Shevchenko
@ 2025-05-13 13:35 ` Pavel Begunkov
0 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-13 13:35 UTC (permalink / raw)
To: Andy Shevchenko; +Cc: io-uring
On 5/13/25 11:37, Andy Shevchenko wrote:
> On Fri, May 09, 2025 at 12:12:54PM +0100, Pavel Begunkov wrote:
>> Don't rely on CQ sequence numbers for draining, as it has become messy
>> and needs cq_extra adjustments. Instead, base it on the number of
>> allocated requests and only allow flushing when all requests are in the
>> drain list.
>>
>> As a result, cq_extra is gone, no overhead for its accounting in aux cqe
>> posting, less bloating as it was inlined before, and it's in general
>> simpler than trying to track where we should bump it and where it should
>> be put back like in cases of overflow. Also, it'll likely help with
>> cleaning and unifying some of the CQ posting helpers.
>
> This patch breaks the `make W=1` build. Please, always test your changes with
That's good advice, even if unsolicited, but it doesn't help with the
issue. Not every compiler complain. I'd also say, things like that
should warn by default, and they usually do, at least for basic
"unused variable" cases.
> `make W=1`. See below the details.
Yes, it's dead code and has already been fixed a day ago.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/8] allocated requests based drain and fixes
2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
` (7 preceding siblings ...)
2025-05-09 11:12 ` [PATCH v2 8/8] io_uring: drain based on allocates reqs Pavel Begunkov
@ 2025-05-09 14:02 ` Jens Axboe
8 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2025-05-09 14:02 UTC (permalink / raw)
To: io-uring, Pavel Begunkov
On Fri, 09 May 2025 12:12:46 +0100, Pavel Begunkov wrote:
> Patches 1-7 are fixes and preparations. Patch 8 implements
> draining based on the number of allocated requests, which
> is simpler and remove cq_extra accounting overhead.
>
> v2: fix executing still drained reqs on spurious calls to
> io_queue_deferred() (Patch 2)
>
> [...]
Applied, thanks!
[1/8] io_uring: account drain memory to cgroup
commit: f979c20547e72568e3c793bc92c7522bc3166246
[2/8] io_uring: fix spurious drain flushing
commit: fde04c7e2775feb0746301e0ef86a04d3598c3fe
[3/8] io_uring: simplify drain ret passing
commit: 05b334110fdc85f536d7dd573120d573801fb2d1
[4/8] io_uring: remove drain prealloc checks
commit: e91e4f692f7993d5d192228c5f8a9a2e12ff5250
[5/8] io_uring: consolidate drain seq checking
commit: 19a94da447f832ee614f8f5532d31c1c70061520
[6/8] io_uring: open code io_account_cq_overflow()
commit: b0c8a6401fbca91da4fe0dc10d61a770f1581e45
[7/8] io_uring: count allocated requests
commit: 63de899cb6220357dea9d0f4e5aa459ff5193bb0
[8/8] io_uring: drain based on allocates reqs
commit: 55c28f431651973385c17081bdcdadf8b5c6da91
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 12+ messages in thread