public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] allocated requests based drain and fixes
@ 2025-05-09 11:12 Pavel Begunkov
  2025-05-09 11:12 ` [PATCH v2 1/8] io_uring: account drain memory to cgroup Pavel Begunkov
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

Patches 1-7 are fixes and preparations. Patch 8 implements
draining based on the number of allocated requests, which
is simpler and remove cq_extra accounting overhead. 

v2: fix executing still drained reqs on spurious calls to
    io_queue_deferred() (Patch 2)

    Only clear ->drain_active after flushing, if a non drain
    request got flushed.

Pavel Begunkov (8):
  io_uring: account drain memory to cgroup
  io_uring: fix spurious drain flushing
  io_uring: simplify drain ret passing
  io_uring: remove drain prealloc checks
  io_uring: consolidate drain seq checking
  io_uring: open code io_account_cq_overflow()
  io_uring: count allocated requests
  io_uring: drain based on allocates reqs

 include/linux/io_uring_types.h |   3 +-
 io_uring/io_uring.c            | 137 ++++++++++++++-------------------
 io_uring/io_uring.h            |   3 +-
 3 files changed, 62 insertions(+), 81 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/8] io_uring: account drain memory to cgroup
  2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
  2025-05-09 11:12 ` [PATCH v2 2/8] io_uring: fix spurious drain flushing Pavel Begunkov
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

Account drain allocations against memcg. It's not a big problem as each
such allocation is paired with a request, which is accounted, but it's
nicer to follow the limits more closely.

Cc: stable@vger.kernel.org # 6.1
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0d051476008c..23e283e65eeb 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1672,7 +1672,7 @@ static __cold void io_drain_req(struct io_kiocb *req)
 	spin_unlock(&ctx->completion_lock);
 
 	io_prep_async_link(req);
-	de = kmalloc(sizeof(*de), GFP_KERNEL);
+	de = kmalloc(sizeof(*de), GFP_KERNEL_ACCOUNT);
 	if (!de) {
 		ret = -ENOMEM;
 		io_req_defer_failed(req, ret);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/8] io_uring: fix spurious drain flushing
  2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
  2025-05-09 11:12 ` [PATCH v2 1/8] io_uring: account drain memory to cgroup Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
  2025-05-09 11:12 ` [PATCH v2 3/8] io_uring: simplify drain ret passing Pavel Begunkov
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

io_queue_deferred() is not tolerant to spurious calls not completing
some requests. You can have an inflight drain-marked request and another
request that came after and got queued into the drain list. Now, if
io_queue_deferred() is called before the first request completes, it'll
check the 2nd req with req_need_defer(), find that there is no drain
flag set, and queue it for execution.

To make io_queue_deferred() work, it should at least check sequences for
the first request, and then we need also need to check if there is
another drain request creating another bubble.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/io_uring.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 23e283e65eeb..add46ab19017 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -559,18 +559,30 @@ void io_req_queue_iowq(struct io_kiocb *req)
 	io_req_task_work_add(req);
 }
 
+static bool io_drain_defer_seq(struct io_kiocb *req, u32 seq)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	return seq + READ_ONCE(ctx->cq_extra) != ctx->cached_cq_tail;
+}
+
 static __cold noinline void io_queue_deferred(struct io_ring_ctx *ctx)
 {
+	bool drain_seen = false, first = true;
+
 	spin_lock(&ctx->completion_lock);
 	while (!list_empty(&ctx->defer_list)) {
 		struct io_defer_entry *de = list_first_entry(&ctx->defer_list,
 						struct io_defer_entry, list);
 
-		if (req_need_defer(de->req, de->seq))
+		drain_seen |= de->req->flags & REQ_F_IO_DRAIN;
+		if ((drain_seen || first) && io_drain_defer_seq(de->req, de->seq))
 			break;
+
 		list_del_init(&de->list);
 		io_req_task_queue(de->req);
 		kfree(de);
+		first = false;
 	}
 	spin_unlock(&ctx->completion_lock);
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/8] io_uring: simplify drain ret passing
  2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
  2025-05-09 11:12 ` [PATCH v2 1/8] io_uring: account drain memory to cgroup Pavel Begunkov
  2025-05-09 11:12 ` [PATCH v2 2/8] io_uring: fix spurious drain flushing Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
  2025-05-09 11:12 ` [PATCH v2 4/8] io_uring: remove drain prealloc checks Pavel Begunkov
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

"ret" in io_drain_req() is only used in one place, remove it and pass
-ENOMEM directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/io_uring.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index add46ab19017..c32fae28ea52 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1669,7 +1669,6 @@ static __cold void io_drain_req(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_defer_entry *de;
-	int ret;
 	u32 seq = io_get_sequence(req);
 
 	/* Still need defer if there is pending req in defer list. */
@@ -1686,8 +1685,7 @@ static __cold void io_drain_req(struct io_kiocb *req)
 	io_prep_async_link(req);
 	de = kmalloc(sizeof(*de), GFP_KERNEL_ACCOUNT);
 	if (!de) {
-		ret = -ENOMEM;
-		io_req_defer_failed(req, ret);
+		io_req_defer_failed(req, -ENOMEM);
 		return;
 	}
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/8] io_uring: remove drain prealloc checks
  2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
                   ` (2 preceding siblings ...)
  2025-05-09 11:12 ` [PATCH v2 3/8] io_uring: simplify drain ret passing Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
  2025-05-09 11:12 ` [PATCH v2 5/8] io_uring: consolidate drain seq checking Pavel Begunkov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

Currently io_drain_req() has two steps. The first is fast path checking
sequence numbers. The second is allocations, rechecking and actual
queuing. Further simplify it by removing the first step.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/io_uring.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index c32fae28ea52..e3f6914304a2 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1671,17 +1671,6 @@ static __cold void io_drain_req(struct io_kiocb *req)
 	struct io_defer_entry *de;
 	u32 seq = io_get_sequence(req);
 
-	/* Still need defer if there is pending req in defer list. */
-	spin_lock(&ctx->completion_lock);
-	if (!req_need_defer(req, seq) && list_empty_careful(&ctx->defer_list)) {
-		spin_unlock(&ctx->completion_lock);
-queue:
-		ctx->drain_active = false;
-		io_req_task_queue(req);
-		return;
-	}
-	spin_unlock(&ctx->completion_lock);
-
 	io_prep_async_link(req);
 	de = kmalloc(sizeof(*de), GFP_KERNEL_ACCOUNT);
 	if (!de) {
@@ -1693,7 +1682,9 @@ static __cold void io_drain_req(struct io_kiocb *req)
 	if (!req_need_defer(req, seq) && list_empty(&ctx->defer_list)) {
 		spin_unlock(&ctx->completion_lock);
 		kfree(de);
-		goto queue;
+		ctx->drain_active = false;
+		io_req_task_queue(req);
+		return;
 	}
 
 	trace_io_uring_defer(req);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 5/8] io_uring: consolidate drain seq checking
  2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
                   ` (3 preceding siblings ...)
  2025-05-09 11:12 ` [PATCH v2 4/8] io_uring: remove drain prealloc checks Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
  2025-05-09 11:12 ` [PATCH v2 6/8] io_uring: open code io_account_cq_overflow() Pavel Begunkov
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

We check sequences when queuing drained requests as well when flushing
them. Instead, always queue and immediately try to flush, so that all
seq handling can be kept contained in the flushing code.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/io_uring.c | 45 +++++++++++++++++----------------------------
 1 file changed, 17 insertions(+), 28 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e3f6914304a2..0afae33e05e6 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -389,17 +389,6 @@ static void io_account_cq_overflow(struct io_ring_ctx *ctx)
 	ctx->cq_extra--;
 }
 
-static bool req_need_defer(struct io_kiocb *req, u32 seq)
-{
-	if (unlikely(req->flags & REQ_F_IO_DRAIN)) {
-		struct io_ring_ctx *ctx = req->ctx;
-
-		return seq + READ_ONCE(ctx->cq_extra) != ctx->cached_cq_tail;
-	}
-
-	return false;
-}
-
 static void io_clean_op(struct io_kiocb *req)
 {
 	if (unlikely(req->flags & REQ_F_BUFFER_SELECTED))
@@ -566,11 +555,10 @@ static bool io_drain_defer_seq(struct io_kiocb *req, u32 seq)
 	return seq + READ_ONCE(ctx->cq_extra) != ctx->cached_cq_tail;
 }
 
-static __cold noinline void io_queue_deferred(struct io_ring_ctx *ctx)
+static __cold noinline void __io_queue_deferred(struct io_ring_ctx *ctx)
 {
 	bool drain_seen = false, first = true;
 
-	spin_lock(&ctx->completion_lock);
 	while (!list_empty(&ctx->defer_list)) {
 		struct io_defer_entry *de = list_first_entry(&ctx->defer_list,
 						struct io_defer_entry, list);
@@ -584,7 +572,12 @@ static __cold noinline void io_queue_deferred(struct io_ring_ctx *ctx)
 		kfree(de);
 		first = false;
 	}
-	spin_unlock(&ctx->completion_lock);
+}
+
+static __cold noinline void io_queue_deferred(struct io_ring_ctx *ctx)
+{
+	guard(spinlock)(&ctx->completion_lock);
+	__io_queue_deferred(ctx);
 }
 
 void __io_commit_cqring_flush(struct io_ring_ctx *ctx)
@@ -1668,30 +1661,26 @@ static __cold void io_drain_req(struct io_kiocb *req)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_ring_ctx *ctx = req->ctx;
+	bool drain = req->flags & IOSQE_IO_DRAIN;
 	struct io_defer_entry *de;
-	u32 seq = io_get_sequence(req);
 
-	io_prep_async_link(req);
 	de = kmalloc(sizeof(*de), GFP_KERNEL_ACCOUNT);
 	if (!de) {
 		io_req_defer_failed(req, -ENOMEM);
 		return;
 	}
 
-	spin_lock(&ctx->completion_lock);
-	if (!req_need_defer(req, seq) && list_empty(&ctx->defer_list)) {
-		spin_unlock(&ctx->completion_lock);
-		kfree(de);
-		ctx->drain_active = false;
-		io_req_task_queue(req);
-		return;
-	}
-
+	io_prep_async_link(req);
 	trace_io_uring_defer(req);
 	de->req = req;
-	de->seq = seq;
-	list_add_tail(&de->list, &ctx->defer_list);
-	spin_unlock(&ctx->completion_lock);
+	de->seq = io_get_sequence(req);
+
+	scoped_guard(spinlock, &ctx->completion_lock) {
+		list_add_tail(&de->list, &ctx->defer_list);
+		__io_queue_deferred(ctx);
+		if (!drain && list_empty(&ctx->defer_list))
+			ctx->drain_active = false;
+	}
 }
 
 static bool io_assign_file(struct io_kiocb *req, const struct io_issue_def *def,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 6/8] io_uring: open code io_account_cq_overflow()
  2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
                   ` (4 preceding siblings ...)
  2025-05-09 11:12 ` [PATCH v2 5/8] io_uring: consolidate drain seq checking Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
  2025-05-09 11:12 ` [PATCH v2 7/8] io_uring: count allocated requests Pavel Begunkov
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

io_account_cq_overflow() doesn't help explaining what's going on in
there, and it'll become even smaller with following patches, so open
code it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/io_uring.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0afae33e05e6..9619c46bd25e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -381,14 +381,6 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	return NULL;
 }
 
-static void io_account_cq_overflow(struct io_ring_ctx *ctx)
-{
-	struct io_rings *r = ctx->rings;
-
-	WRITE_ONCE(r->cq_overflow, READ_ONCE(r->cq_overflow) + 1);
-	ctx->cq_extra--;
-}
-
 static void io_clean_op(struct io_kiocb *req)
 {
 	if (unlikely(req->flags & REQ_F_BUFFER_SELECTED))
@@ -742,12 +734,15 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
 	ocqe = kmalloc(ocq_size, GFP_ATOMIC | __GFP_ACCOUNT);
 	trace_io_uring_cqe_overflow(ctx, user_data, res, cflags, ocqe);
 	if (!ocqe) {
+		struct io_rings *r = ctx->rings;
+
 		/*
 		 * If we're in ring overflow flush mode, or in task cancel mode,
 		 * or cannot allocate an overflow entry, then we need to drop it
 		 * on the floor.
 		 */
-		io_account_cq_overflow(ctx);
+		WRITE_ONCE(r->cq_overflow, READ_ONCE(r->cq_overflow) + 1);
+		ctx->cq_extra--;
 		set_bit(IO_CHECK_CQ_DROPPED_BIT, &ctx->check_cq);
 		return false;
 	}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 7/8] io_uring: count allocated requests
  2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
                   ` (5 preceding siblings ...)
  2025-05-09 11:12 ` [PATCH v2 6/8] io_uring: open code io_account_cq_overflow() Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
  2025-05-09 11:12 ` [PATCH v2 8/8] io_uring: drain based on allocates reqs Pavel Begunkov
  2025-05-09 14:02 ` [PATCH v2 0/8] allocated requests based drain and fixes Jens Axboe
  8 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

Keep track of the number requests a ring currently has allocated (and
not freed), it'll be needed in the next patch.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/io_uring_types.h | 1 +
 io_uring/io_uring.c            | 9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 7e23e993280e..73b289b48280 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -435,6 +435,7 @@ struct io_ring_ctx {
 
 	/* protected by ->completion_lock */
 	unsigned			evfd_last_cq_tail;
+	unsigned			nr_req_allocated;
 
 	/*
 	 * Protection for resize vs mmap races - both the mmap and resize
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 9619c46bd25e..14188f49a4ce 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -954,6 +954,8 @@ __cold bool __io_alloc_req_refill(struct io_ring_ctx *ctx)
 	}
 
 	percpu_ref_get_many(&ctx->refs, ret);
+	ctx->nr_req_allocated += ret;
+
 	while (ret--) {
 		struct io_kiocb *req = reqs[ret];
 
@@ -2691,8 +2693,10 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
 		kmem_cache_free(req_cachep, req);
 		nr++;
 	}
-	if (nr)
+	if (nr) {
+		ctx->nr_req_allocated -= nr;
 		percpu_ref_put_many(&ctx->refs, nr);
+	}
 	mutex_unlock(&ctx->uring_lock);
 }
 
@@ -2729,6 +2733,9 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	percpu_ref_exit(&ctx->refs);
 	free_uid(ctx->user);
 	io_req_caches_free(ctx);
+
+	WARN_ON_ONCE(ctx->nr_req_allocated);
+
 	if (ctx->hash_map)
 		io_wq_put_hash(ctx->hash_map);
 	io_napi_free(ctx);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 8/8] io_uring: drain based on allocates reqs
  2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
                   ` (6 preceding siblings ...)
  2025-05-09 11:12 ` [PATCH v2 7/8] io_uring: count allocated requests Pavel Begunkov
@ 2025-05-09 11:12 ` Pavel Begunkov
  2025-05-13 10:37   ` Andy Shevchenko
  2025-05-09 14:02 ` [PATCH v2 0/8] allocated requests based drain and fixes Jens Axboe
  8 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-09 11:12 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

Don't rely on CQ sequence numbers for draining, as it has become messy
and needs cq_extra adjustments. Instead, base it on the number of
allocated requests and only allow flushing when all requests are in the
drain list.

As a result, cq_extra is gone, no overhead for its accounting in aux cqe
posting, less bloating as it was inlined before, and it's in general
simpler than trying to track where we should bump it and where it should
be put back like in cases of overflow. Also, it'll likely help with
cleaning and unifying some of the CQ posting helpers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/io_uring_types.h |  2 +-
 io_uring/io_uring.c            | 83 +++++++++++++++-------------------
 io_uring/io_uring.h            |  3 +-
 3 files changed, 38 insertions(+), 50 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 73b289b48280..00dbd7cd0e7d 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -341,7 +341,6 @@ struct io_ring_ctx {
 		unsigned		cached_cq_tail;
 		unsigned		cq_entries;
 		struct io_ev_fd	__rcu	*io_ev_fd;
-		unsigned		cq_extra;
 
 		void			*cq_wait_arg;
 		size_t			cq_wait_size;
@@ -417,6 +416,7 @@ struct io_ring_ctx {
 
 	struct callback_head		poll_wq_task_work;
 	struct list_head		defer_list;
+	unsigned			nr_drained;
 
 	struct io_alloc_cache		msg_cache;
 	spinlock_t			msg_lock;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 14188f49a4ce..0fda1b1a33ae 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -129,7 +129,6 @@
 struct io_defer_entry {
 	struct list_head	list;
 	struct io_kiocb		*req;
-	u32			seq;
 };
 
 /* requests with any of those set should undergo io_disarm_next() */
@@ -149,6 +148,7 @@ static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 					 bool is_sqpoll_thread);
 
 static void io_queue_sqe(struct io_kiocb *req);
+static void __io_req_caches_free(struct io_ring_ctx *ctx);
 
 static __read_mostly DEFINE_STATIC_KEY_FALSE(io_key_has_sqarray);
 
@@ -540,46 +540,45 @@ void io_req_queue_iowq(struct io_kiocb *req)
 	io_req_task_work_add(req);
 }
 
-static bool io_drain_defer_seq(struct io_kiocb *req, u32 seq)
+static unsigned io_linked_nr(struct io_kiocb *req)
 {
-	struct io_ring_ctx *ctx = req->ctx;
+	struct io_kiocb *tmp;
+	unsigned nr = 0;
 
-	return seq + READ_ONCE(ctx->cq_extra) != ctx->cached_cq_tail;
+	io_for_each_link(tmp, req)
+		nr++;
+	return nr;
 }
 
-static __cold noinline void __io_queue_deferred(struct io_ring_ctx *ctx)
+static __cold noinline void io_queue_deferred(struct io_ring_ctx *ctx)
 {
 	bool drain_seen = false, first = true;
 
+	lockdep_assert_held(&ctx->uring_lock);
+	__io_req_caches_free(ctx);
+
 	while (!list_empty(&ctx->defer_list)) {
 		struct io_defer_entry *de = list_first_entry(&ctx->defer_list,
 						struct io_defer_entry, list);
 
 		drain_seen |= de->req->flags & REQ_F_IO_DRAIN;
-		if ((drain_seen || first) && io_drain_defer_seq(de->req, de->seq))
-			break;
+		if ((drain_seen || first) && ctx->nr_req_allocated != ctx->nr_drained)
+			return;
 
 		list_del_init(&de->list);
+		ctx->nr_drained -= io_linked_nr(de->req);
 		io_req_task_queue(de->req);
 		kfree(de);
 		first = false;
 	}
 }
 
-static __cold noinline void io_queue_deferred(struct io_ring_ctx *ctx)
-{
-	guard(spinlock)(&ctx->completion_lock);
-	__io_queue_deferred(ctx);
-}
-
 void __io_commit_cqring_flush(struct io_ring_ctx *ctx)
 {
 	if (ctx->poll_activated)
 		io_poll_wq_wake(ctx);
 	if (ctx->off_timeout_used)
 		io_flush_timeouts(ctx);
-	if (ctx->drain_active)
-		io_queue_deferred(ctx);
 	if (ctx->has_evfd)
 		io_eventfd_signal(ctx, true);
 }
@@ -742,7 +741,6 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
 		 * on the floor.
 		 */
 		WRITE_ONCE(r->cq_overflow, READ_ONCE(r->cq_overflow) + 1);
-		ctx->cq_extra--;
 		set_bit(IO_CHECK_CQ_DROPPED_BIT, &ctx->check_cq);
 		return false;
 	}
@@ -812,8 +810,6 @@ static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res,
 {
 	struct io_uring_cqe *cqe;
 
-	ctx->cq_extra++;
-
 	if (likely(io_get_cqe(ctx, &cqe))) {
 		WRITE_ONCE(cqe->user_data, user_data);
 		WRITE_ONCE(cqe->res, res);
@@ -1456,6 +1452,10 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 		io_free_batch_list(ctx, state->compl_reqs.first);
 		INIT_WQ_LIST(&state->compl_reqs);
 	}
+
+	if (unlikely(ctx->drain_active))
+		io_queue_deferred(ctx);
+
 	ctx->submit_state.cq_flush = false;
 }
 
@@ -1643,23 +1643,14 @@ io_req_flags_t io_file_get_flags(struct file *file)
 	return res;
 }
 
-static u32 io_get_sequence(struct io_kiocb *req)
-{
-	u32 seq = req->ctx->cached_sq_head;
-	struct io_kiocb *cur;
-
-	/* need original cached_sq_head, but it was increased for each req */
-	io_for_each_link(cur, req)
-		seq--;
-	return seq;
-}
-
 static __cold void io_drain_req(struct io_kiocb *req)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	bool drain = req->flags & IOSQE_IO_DRAIN;
 	struct io_defer_entry *de;
+	struct io_kiocb *tmp;
+	int nr = 0;
 
 	de = kmalloc(sizeof(*de), GFP_KERNEL_ACCOUNT);
 	if (!de) {
@@ -1667,17 +1658,17 @@ static __cold void io_drain_req(struct io_kiocb *req)
 		return;
 	}
 
+	io_for_each_link(tmp, req)
+		nr++;
 	io_prep_async_link(req);
 	trace_io_uring_defer(req);
 	de->req = req;
-	de->seq = io_get_sequence(req);
 
-	scoped_guard(spinlock, &ctx->completion_lock) {
-		list_add_tail(&de->list, &ctx->defer_list);
-		__io_queue_deferred(ctx);
-		if (!drain && list_empty(&ctx->defer_list))
-			ctx->drain_active = false;
-	}
+	ctx->nr_drained += io_linked_nr(req);
+	list_add_tail(&de->list, &ctx->defer_list);
+	io_queue_deferred(ctx);
+	if (!drain && list_empty(&ctx->defer_list))
+		ctx->drain_active = false;
 }
 
 static bool io_assign_file(struct io_kiocb *req, const struct io_issue_def *def,
@@ -2260,10 +2251,6 @@ static bool io_get_sqe(struct io_ring_ctx *ctx, const struct io_uring_sqe **sqe)
 	    (!(ctx->flags & IORING_SETUP_NO_SQARRAY))) {
 		head = READ_ONCE(ctx->sq_array[head]);
 		if (unlikely(head >= ctx->sq_entries)) {
-			/* drop invalid entries */
-			spin_lock(&ctx->completion_lock);
-			ctx->cq_extra--;
-			spin_unlock(&ctx->completion_lock);
 			WRITE_ONCE(ctx->rings->sq_dropped,
 				   READ_ONCE(ctx->rings->sq_dropped) + 1);
 			return false;
@@ -2681,13 +2668,11 @@ unsigned long rings_size(unsigned int flags, unsigned int sq_entries,
 	return off;
 }
 
-static void io_req_caches_free(struct io_ring_ctx *ctx)
+static __cold void __io_req_caches_free(struct io_ring_ctx *ctx)
 {
 	struct io_kiocb *req;
 	int nr = 0;
 
-	mutex_lock(&ctx->uring_lock);
-
 	while (!io_req_cache_empty(ctx)) {
 		req = io_extract_req(ctx);
 		kmem_cache_free(req_cachep, req);
@@ -2697,7 +2682,12 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
 		ctx->nr_req_allocated -= nr;
 		percpu_ref_put_many(&ctx->refs, nr);
 	}
-	mutex_unlock(&ctx->uring_lock);
+}
+
+static __cold void io_req_caches_free(struct io_ring_ctx *ctx)
+{
+	guard(mutex)(&ctx->uring_lock);
+	__io_req_caches_free(ctx);
 }
 
 static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
@@ -3002,20 +2992,19 @@ static __cold bool io_cancel_defer_files(struct io_ring_ctx *ctx,
 	struct io_defer_entry *de;
 	LIST_HEAD(list);
 
-	spin_lock(&ctx->completion_lock);
 	list_for_each_entry_reverse(de, &ctx->defer_list, list) {
 		if (io_match_task_safe(de->req, tctx, cancel_all)) {
 			list_cut_position(&list, &ctx->defer_list, &de->list);
 			break;
 		}
 	}
-	spin_unlock(&ctx->completion_lock);
 	if (list_empty(&list))
 		return false;
 
 	while (!list_empty(&list)) {
 		de = list_first_entry(&list, struct io_defer_entry, list);
 		list_del_init(&de->list);
+		ctx->nr_drained -= io_linked_nr(de->req);
 		io_req_task_queue_fail(de->req, -ECANCELED);
 		kfree(de);
 	}
@@ -3090,8 +3079,8 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 	if ((ctx->flags & IORING_SETUP_DEFER_TASKRUN) &&
 	    io_allowed_defer_tw_run(ctx))
 		ret |= io_run_local_work(ctx, INT_MAX, INT_MAX) > 0;
-	ret |= io_cancel_defer_files(ctx, tctx, cancel_all);
 	mutex_lock(&ctx->uring_lock);
+	ret |= io_cancel_defer_files(ctx, tctx, cancel_all);
 	ret |= io_poll_remove_all(ctx, tctx, cancel_all);
 	ret |= io_waitid_remove_all(ctx, tctx, cancel_all);
 	ret |= io_futex_remove_all(ctx, tctx, cancel_all);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index e4050b2d0821..81f22196a57d 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -196,7 +196,6 @@ static inline bool io_defer_get_uncommited_cqe(struct io_ring_ctx *ctx,
 {
 	io_lockdep_assert_cq_locked(ctx);
 
-	ctx->cq_extra++;
 	ctx->submit_state.cq_flush = true;
 	return io_get_cqe(ctx, cqe_ret);
 }
@@ -414,7 +413,7 @@ static inline void io_req_complete_defer(struct io_kiocb *req)
 
 static inline void io_commit_cqring_flush(struct io_ring_ctx *ctx)
 {
-	if (unlikely(ctx->off_timeout_used || ctx->drain_active ||
+	if (unlikely(ctx->off_timeout_used ||
 		     ctx->has_evfd || ctx->poll_activated))
 		__io_commit_cqring_flush(ctx);
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/8] allocated requests based drain and fixes
  2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
                   ` (7 preceding siblings ...)
  2025-05-09 11:12 ` [PATCH v2 8/8] io_uring: drain based on allocates reqs Pavel Begunkov
@ 2025-05-09 14:02 ` Jens Axboe
  8 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2025-05-09 14:02 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov


On Fri, 09 May 2025 12:12:46 +0100, Pavel Begunkov wrote:
> Patches 1-7 are fixes and preparations. Patch 8 implements
> draining based on the number of allocated requests, which
> is simpler and remove cq_extra accounting overhead.
> 
> v2: fix executing still drained reqs on spurious calls to
>     io_queue_deferred() (Patch 2)
> 
> [...]

Applied, thanks!

[1/8] io_uring: account drain memory to cgroup
      commit: f979c20547e72568e3c793bc92c7522bc3166246
[2/8] io_uring: fix spurious drain flushing
      commit: fde04c7e2775feb0746301e0ef86a04d3598c3fe
[3/8] io_uring: simplify drain ret passing
      commit: 05b334110fdc85f536d7dd573120d573801fb2d1
[4/8] io_uring: remove drain prealloc checks
      commit: e91e4f692f7993d5d192228c5f8a9a2e12ff5250
[5/8] io_uring: consolidate drain seq checking
      commit: 19a94da447f832ee614f8f5532d31c1c70061520
[6/8] io_uring: open code io_account_cq_overflow()
      commit: b0c8a6401fbca91da4fe0dc10d61a770f1581e45
[7/8] io_uring: count allocated requests
      commit: 63de899cb6220357dea9d0f4e5aa459ff5193bb0
[8/8] io_uring: drain based on allocates reqs
      commit: 55c28f431651973385c17081bdcdadf8b5c6da91

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 8/8] io_uring: drain based on allocates reqs
  2025-05-09 11:12 ` [PATCH v2 8/8] io_uring: drain based on allocates reqs Pavel Begunkov
@ 2025-05-13 10:37   ` Andy Shevchenko
  2025-05-13 13:35     ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Andy Shevchenko @ 2025-05-13 10:37 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring

On Fri, May 09, 2025 at 12:12:54PM +0100, Pavel Begunkov wrote:
> Don't rely on CQ sequence numbers for draining, as it has become messy
> and needs cq_extra adjustments. Instead, base it on the number of
> allocated requests and only allow flushing when all requests are in the
> drain list.
> 
> As a result, cq_extra is gone, no overhead for its accounting in aux cqe
> posting, less bloating as it was inlined before, and it's in general
> simpler than trying to track where we should bump it and where it should
> be put back like in cases of overflow. Also, it'll likely help with
> cleaning and unifying some of the CQ posting helpers.

This patch breaks the `make W=1` build. Please, always test your changes with
`make W=1`. See below the details.

...

>  static __cold void io_drain_req(struct io_kiocb *req)
>  	__must_hold(&ctx->uring_lock)
>  {
>  	struct io_ring_ctx *ctx = req->ctx;
>  	bool drain = req->flags & IOSQE_IO_DRAIN;
>  	struct io_defer_entry *de;
> +	struct io_kiocb *tmp;

> +	int nr = 0;

Defined and assigned...

>  
>  	de = kmalloc(sizeof(*de), GFP_KERNEL_ACCOUNT);
>  	if (!de) {
> @@ -1667,17 +1658,17 @@ static __cold void io_drain_req(struct io_kiocb *req)
>  		return;
>  	}
>  
> +	io_for_each_link(tmp, req)
> +		nr++;

...just incremented...

And that seems it. Does the above have any side-effects? Or is it just a dead
code (a.k.a. leftovers from the rebase/upcoming work)?

In any case, please make use of nr or drop it completely.

io_uring/io_uring.c:1649:6: error: variable 'nr' set but not used [-Werror,-Wunused-but-set-variable]
 1649 |         int nr = 0;
      |             ^
1 error generated.

>  	io_prep_async_link(req);
>  	trace_io_uring_defer(req);
>  	de->req = req;
> -	de->seq = io_get_sequence(req);
>  
> -	scoped_guard(spinlock, &ctx->completion_lock) {
> -		list_add_tail(&de->list, &ctx->defer_list);
> -		__io_queue_deferred(ctx);
> -		if (!drain && list_empty(&ctx->defer_list))
> -			ctx->drain_active = false;
> -	}
> +	ctx->nr_drained += io_linked_nr(req);
> +	list_add_tail(&de->list, &ctx->defer_list);
> +	io_queue_deferred(ctx);
> +	if (!drain && list_empty(&ctx->defer_list))
> +		ctx->drain_active = false;
>  }

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 8/8] io_uring: drain based on allocates reqs
  2025-05-13 10:37   ` Andy Shevchenko
@ 2025-05-13 13:35     ` Pavel Begunkov
  0 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2025-05-13 13:35 UTC (permalink / raw)
  To: Andy Shevchenko; +Cc: io-uring

On 5/13/25 11:37, Andy Shevchenko wrote:
> On Fri, May 09, 2025 at 12:12:54PM +0100, Pavel Begunkov wrote:
>> Don't rely on CQ sequence numbers for draining, as it has become messy
>> and needs cq_extra adjustments. Instead, base it on the number of
>> allocated requests and only allow flushing when all requests are in the
>> drain list.
>>
>> As a result, cq_extra is gone, no overhead for its accounting in aux cqe
>> posting, less bloating as it was inlined before, and it's in general
>> simpler than trying to track where we should bump it and where it should
>> be put back like in cases of overflow. Also, it'll likely help with
>> cleaning and unifying some of the CQ posting helpers.
> 
> This patch breaks the `make W=1` build. Please, always test your changes with

That's good advice, even if unsolicited, but it doesn't help with the
issue. Not every compiler complain. I'd also say, things like that
should warn by default, and they usually do, at least for basic
"unused variable" cases.

> `make W=1`. See below the details.

Yes, it's dead code and has already been fixed a day ago.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-05-13 13:34 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-09 11:12 [PATCH v2 0/8] allocated requests based drain and fixes Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 1/8] io_uring: account drain memory to cgroup Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 2/8] io_uring: fix spurious drain flushing Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 3/8] io_uring: simplify drain ret passing Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 4/8] io_uring: remove drain prealloc checks Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 5/8] io_uring: consolidate drain seq checking Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 6/8] io_uring: open code io_account_cq_overflow() Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 7/8] io_uring: count allocated requests Pavel Begunkov
2025-05-09 11:12 ` [PATCH v2 8/8] io_uring: drain based on allocates reqs Pavel Begunkov
2025-05-13 10:37   ` Andy Shevchenko
2025-05-13 13:35     ` Pavel Begunkov
2025-05-09 14:02 ` [PATCH v2 0/8] allocated requests based drain and fixes Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox