[PATCH v2 00/15] caching and SQ/CQ optimisations

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH v2 00/15] caching and SQ/CQ optimisations
@ 2023-08-24 22:53 Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 01/15] io_uring: improve cqe !tracing hot path Pavel Begunkov
                   ` (15 more replies)
  0 siblings, 16 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Patch 1-5 optimise io_fill_cqe_req

Patch 6-7 combine iopoll and normal completion paths

Patch 8 inlines io_fill_cqe_req.

Patch 9 should improve CPU caching of SQ/CQ pointers

Patch 10 removes conditionally SQ indirection (->sq_array). Assuming we'll
make it a default in liburing, Patch 10 optimises it with static_key.

Patch 11-15 shuffle io_ring_ctx fields.

Testing with t/io_uring nops only for now

                QD2     QD4     QD8     QD16    QD32
baseline:       17.3    26.6    36.4    43.7    49.4
Patches 1-15:   17.8    27.4    37.9    45.8    51.2
Patches 1-16:   17.9    28.2    39.3    47.8    54

v2:
  removed static_key, it'll be submitted later after it rolls out well
  minor description changes

Pavel Begunkov (15):
  io_uring: improve cqe !tracing hot path
  io_uring: cqe init hardening
  io_uring: simplify big_cqe handling
  io_uring: refactor __io_get_cqe()
  io_uring: optimise extra io_get_cqe null check
  io_uring: reorder cqring_flush and wakeups
  io_uring: merge iopoll and normal completion paths
  io_uring: force inline io_fill_cqe_req
  io_uring: compact SQ/CQ heads/tails
  io_uring: add option to remove SQ indirection
  io_uring: move non aligned field to the end
  io_uring: banish non-hot data to end of io_ring_ctx
  io_uring: separate task_work/waiting cache line
  io_uring: move multishot cqe cache in ctx
  io_uring: move iopoll ctx fields around

 include/linux/io_uring_types.h | 129 +++++++++++++++++----------------
 include/uapi/linux/io_uring.h  |   5 ++
 io_uring/io_uring.c            | 120 +++++++++++++++---------------
 io_uring/io_uring.h            |  58 +++++++--------
 io_uring/rw.c                  |  24 ++----
 io_uring/uring_cmd.c           |   5 +-
 6 files changed, 163 insertions(+), 178 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 01/15] io_uring: improve cqe !tracing hot path
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 02/15] io_uring: cqe init hardening Pavel Begunkov
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

While looking at io_fill_cqe_req()'s asm I stumbled on our trace points
turning into the chunk below:

trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
			req->cqe.res, req->cqe.flags,
			req->extra1, req->extra2);

io_uring/io_uring.c:898: 	trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
	movq	232(%rbx), %rdi	# req_44(D)->big_cqe.extra2, _5
	movq	224(%rbx), %rdx	# req_44(D)->big_cqe.extra1, _6
	movl	84(%rbx), %r9d	# req_44(D)->cqe.D.81184.flags, _7
	movl	80(%rbx), %r8d	# req_44(D)->cqe.res, _8
	movq	72(%rbx), %rcx	# req_44(D)->cqe.user_data, _9
	movq	88(%rbx), %rsi	# req_44(D)->ctx, _10
./arch/x86/include/asm/jump_label.h:27: 	asm_volatile_goto("1:"
	1:jmp .L1772 # objtool NOPs this 	#
	...

It does a jump_label for actual tracing, but those 6 moves will stay
there in the hottest io_uring path. As an optimisation, add a
trace_io_uring_complete_enabled() check, which is also uses jump_labels,
it tricks the compiler into behaving. It removes the junk without
changing anything else int the hot path.

Note: apparently, it's not only me noticing it, and people are also
working it around. We should remove the check when it's solved
generically or rework tracing.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.h | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 3e6ff3cd9a24..465598223386 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -145,10 +145,11 @@ static inline bool io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req
 	if (unlikely(!cqe))
 		return false;
 
-	trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
-				req->cqe.res, req->cqe.flags,
-				(req->flags & REQ_F_CQE32_INIT) ? req->extra1 : 0,
-				(req->flags & REQ_F_CQE32_INIT) ? req->extra2 : 0);
+	if (trace_io_uring_complete_enabled())
+		trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
+					req->cqe.res, req->cqe.flags,
+					(req->flags & REQ_F_CQE32_INIT) ? req->extra1 : 0,
+					(req->flags & REQ_F_CQE32_INIT) ? req->extra2 : 0);
 
 	memcpy(cqe, &req->cqe, sizeof(*cqe));
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 02/15] io_uring: cqe init hardening
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 01/15] io_uring: improve cqe !tracing hot path Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 03/15] io_uring: simplify big_cqe handling Pavel Begunkov
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

io_kiocb::cqe stores the completion info which we'll memcpy to
userspace, and we rely on callbacks and other later steps to populate
it with right values. We have never had problems with that, but it would
still be safer to zero it on allocation.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e1a23f4993d3..3e0fe1ebbc10 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1056,7 +1056,7 @@ static void io_preinit_req(struct io_kiocb *req, struct io_ring_ctx *ctx)
 	req->link = NULL;
 	req->async_data = NULL;
 	/* not necessary, but safer to zero */
-	req->cqe.res = 0;
+	memset(&req->cqe, 0, sizeof(req->cqe));
 }
 
 static void io_flush_cached_locked_reqs(struct io_ring_ctx *ctx,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 03/15] io_uring: simplify big_cqe handling
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 01/15] io_uring: improve cqe !tracing hot path Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 02/15] io_uring: cqe init hardening Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 04/15] io_uring: refactor __io_get_cqe() Pavel Begunkov
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Don't keep big_cqe bits of req in a union with hash_node, find a
separate space for it. It's bit safer, but also if we keep it always
initialised, we can get rid of ugly REQ_F_CQE32_INIT handling.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 16 ++++++----------
 io_uring/io_uring.c            |  8 +++-----
 io_uring/io_uring.h            | 15 +++------------
 io_uring/uring_cmd.c           |  5 ++---
 4 files changed, 14 insertions(+), 30 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index f04ce513fadb..9795eda529f7 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -409,7 +409,6 @@ enum {
 	REQ_F_SINGLE_POLL_BIT,
 	REQ_F_DOUBLE_POLL_BIT,
 	REQ_F_PARTIAL_IO_BIT,
-	REQ_F_CQE32_INIT_BIT,
 	REQ_F_APOLL_MULTISHOT_BIT,
 	REQ_F_CLEAR_POLLIN_BIT,
 	REQ_F_HASH_LOCKED_BIT,
@@ -479,8 +478,6 @@ enum {
 	REQ_F_PARTIAL_IO	= BIT(REQ_F_PARTIAL_IO_BIT),
 	/* fast poll multishot mode */
 	REQ_F_APOLL_MULTISHOT	= BIT(REQ_F_APOLL_MULTISHOT_BIT),
-	/* ->extra1 and ->extra2 are initialised */
-	REQ_F_CQE32_INIT	= BIT(REQ_F_CQE32_INIT_BIT),
 	/* recvmsg special flag, clear EPOLLIN */
 	REQ_F_CLEAR_POLLIN	= BIT(REQ_F_CLEAR_POLLIN_BIT),
 	/* hashed into ->cancel_hash_locked, protected by ->uring_lock */
@@ -579,13 +576,7 @@ struct io_kiocb {
 	struct io_task_work		io_task_work;
 	unsigned			nr_tw;
 	/* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
-	union {
-		struct hlist_node	hash_node;
-		struct {
-			u64		extra1;
-			u64		extra2;
-		};
-	};
+	struct hlist_node		hash_node;
 	/* internal polling, see IORING_FEAT_FAST_POLL */
 	struct async_poll		*apoll;
 	/* opcode allocated if it needs to store data for async defer */
@@ -595,6 +586,11 @@ struct io_kiocb {
 	/* custom credentials, valid IFF REQ_F_CREDS is set */
 	const struct cred		*creds;
 	struct io_wq_work		work;
+
+	struct {
+		u64			extra1;
+		u64			extra2;
+	} big_cqe;
 };
 
 struct io_overflow_cqe {
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3e0fe1ebbc10..0aeb33256a6d 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -807,13 +807,10 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
 
 void io_req_cqe_overflow(struct io_kiocb *req)
 {
-	if (!(req->flags & REQ_F_CQE32_INIT)) {
-		req->extra1 = 0;
-		req->extra2 = 0;
-	}
 	io_cqring_event_overflow(req->ctx, req->cqe.user_data,
 				req->cqe.res, req->cqe.flags,
-				req->extra1, req->extra2);
+				req->big_cqe.extra1, req->big_cqe.extra2);
+	memset(&req->big_cqe, 0, sizeof(req->big_cqe));
 }
 
 /*
@@ -1057,6 +1054,7 @@ static void io_preinit_req(struct io_kiocb *req, struct io_ring_ctx *ctx)
 	req->async_data = NULL;
 	/* not necessary, but safer to zero */
 	memset(&req->cqe, 0, sizeof(req->cqe));
+	memset(&req->big_cqe, 0, sizeof(req->big_cqe));
 }
 
 static void io_flush_cached_locked_reqs(struct io_ring_ctx *ctx,
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 465598223386..9b5dfb6ef484 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -148,21 +148,12 @@ static inline bool io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req
 	if (trace_io_uring_complete_enabled())
 		trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
 					req->cqe.res, req->cqe.flags,
-					(req->flags & REQ_F_CQE32_INIT) ? req->extra1 : 0,
-					(req->flags & REQ_F_CQE32_INIT) ? req->extra2 : 0);
+					req->big_cqe.extra1, req->big_cqe.extra2);
 
 	memcpy(cqe, &req->cqe, sizeof(*cqe));
-
 	if (ctx->flags & IORING_SETUP_CQE32) {
-		u64 extra1 = 0, extra2 = 0;
-
-		if (req->flags & REQ_F_CQE32_INIT) {
-			extra1 = req->extra1;
-			extra2 = req->extra2;
-		}
-
-		WRITE_ONCE(cqe->big_cqe[0], extra1);
-		WRITE_ONCE(cqe->big_cqe[1], extra2);
+		memcpy(cqe->big_cqe, &req->big_cqe, sizeof(*cqe));
+		memset(&req->big_cqe, 0, sizeof(req->big_cqe));
 	}
 	return true;
 }
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 8e7a03c1b20e..537795fddc87 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -43,9 +43,8 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_do_in_task_lazy);
 static inline void io_req_set_cqe32_extra(struct io_kiocb *req,
 					  u64 extra1, u64 extra2)
 {
-	req->extra1 = extra1;
-	req->extra2 = extra2;
-	req->flags |= REQ_F_CQE32_INIT;
+	req->big_cqe.extra1 = extra1;
+	req->big_cqe.extra2 = extra2;
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 04/15] io_uring: refactor __io_get_cqe()
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (2 preceding siblings ...)
  2023-08-24 22:53 ` [PATCH v2 03/15] io_uring: simplify big_cqe handling Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 05/15] io_uring: optimise extra io_get_cqe null check Pavel Begunkov
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Make __io_get_cqe simpler by not grabbing the cqe from refilled cached,
but letting io_get_cqe() do it for us. That's cleaner and removes some
duplication.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 13 ++++---------
 io_uring/io_uring.h | 23 ++++++++++++-----------
 2 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0aeb33256a6d..de05831eeca7 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -818,7 +818,7 @@ void io_req_cqe_overflow(struct io_kiocb *req)
  * control dependency is enough as we're using WRITE_ONCE to
  * fill the cq entry
  */
-struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow)
+bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow)
 {
 	struct io_rings *rings = ctx->rings;
 	unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1);
@@ -830,7 +830,7 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow)
 	 * Force overflow the completion.
 	 */
 	if (!overflow && (ctx->check_cq & BIT(IO_CHECK_CQ_OVERFLOW_BIT)))
-		return NULL;
+		return false;
 
 	/* userspace may cheat modifying the tail, be safe and do min */
 	queued = min(__io_cqring_events(ctx), ctx->cq_entries);
@@ -838,7 +838,7 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow)
 	/* we need a contiguous range, limit based on the current array offset */
 	len = min(free, ctx->cq_entries - off);
 	if (!len)
-		return NULL;
+		return false;
 
 	if (ctx->flags & IORING_SETUP_CQE32) {
 		off <<= 1;
@@ -847,12 +847,7 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow)
 
 	ctx->cqe_cached = &rings->cqes[off];
 	ctx->cqe_sentinel = ctx->cqe_cached + len;
-
-	ctx->cached_cq_tail++;
-	ctx->cqe_cached++;
-	if (ctx->flags & IORING_SETUP_CQE32)
-		ctx->cqe_cached++;
-	return &rings->cqes[off];
+	return true;
 }
 
 static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res,
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 9b5dfb6ef484..9c80d20fe18f 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -38,7 +38,7 @@ enum {
 	IOU_STOP_MULTISHOT	= -ECANCELED,
 };
 
-struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow);
+bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow);
 void io_req_cqe_overflow(struct io_kiocb *req);
 int io_run_task_work_sig(struct io_ring_ctx *ctx);
 void io_req_defer_failed(struct io_kiocb *req, s32 res);
@@ -112,19 +112,20 @@ static inline void io_req_task_work_add(struct io_kiocb *req)
 static inline struct io_uring_cqe *io_get_cqe_overflow(struct io_ring_ctx *ctx,
 						       bool overflow)
 {
-	io_lockdep_assert_cq_locked(ctx);
+	struct io_uring_cqe *cqe;
 
-	if (likely(ctx->cqe_cached < ctx->cqe_sentinel)) {
-		struct io_uring_cqe *cqe = ctx->cqe_cached;
+	io_lockdep_assert_cq_locked(ctx);
 
-		ctx->cached_cq_tail++;
-		ctx->cqe_cached++;
-		if (ctx->flags & IORING_SETUP_CQE32)
-			ctx->cqe_cached++;
-		return cqe;
+	if (unlikely(ctx->cqe_cached >= ctx->cqe_sentinel)) {
+		if (unlikely(!io_cqe_cache_refill(ctx, overflow)))
+			return NULL;
 	}
-
-	return __io_get_cqe(ctx, overflow);
+	cqe = ctx->cqe_cached;
+	ctx->cached_cq_tail++;
+	ctx->cqe_cached++;
+	if (ctx->flags & IORING_SETUP_CQE32)
+		ctx->cqe_cached++;
+	return cqe;
 }
 
 static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 05/15] io_uring: optimise extra io_get_cqe null check
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (3 preceding siblings ...)
  2023-08-24 22:53 ` [PATCH v2 04/15] io_uring: refactor __io_get_cqe() Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 06/15] io_uring: reorder cqring_flush and wakeups Pavel Begunkov
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

If the cached cqe check passes in io_get_cqe*() it already means that
the cqe we return is valid and non-zero, however the compiler is unable
to optimise null checks like in io_fill_cqe_req().

Do a bit of trickery, return success/fail boolean from io_get_cqe*()
and store cqe in the cqe parameter. That makes it do the right thing,
erasing the check together with the introduced indirection.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c |  7 +++----
 io_uring/io_uring.h | 20 +++++++++-----------
 2 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index de05831eeca7..cfc2dc8c4b2f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -683,10 +683,10 @@ static void __io_cqring_overflow_flush(struct io_ring_ctx *ctx)
 
 	io_cq_lock(ctx);
 	while (!list_empty(&ctx->cq_overflow_list)) {
-		struct io_uring_cqe *cqe = io_get_cqe_overflow(ctx, true);
+		struct io_uring_cqe *cqe;
 		struct io_overflow_cqe *ocqe;
 
-		if (!cqe)
+		if (!io_get_cqe_overflow(ctx, &cqe, true))
 			break;
 		ocqe = list_first_entry(&ctx->cq_overflow_list,
 					struct io_overflow_cqe, list);
@@ -862,8 +862,7 @@ static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res,
 	 * submission (by quite a lot). Increment the overflow count in
 	 * the ring.
 	 */
-	cqe = io_get_cqe(ctx);
-	if (likely(cqe)) {
+	if (likely(io_get_cqe(ctx, &cqe))) {
 		trace_io_uring_complete(ctx, NULL, user_data, res, cflags, 0, 0);
 
 		WRITE_ONCE(cqe->user_data, user_data);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 9c80d20fe18f..2960e35b32a5 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -109,28 +109,27 @@ static inline void io_req_task_work_add(struct io_kiocb *req)
 #define io_for_each_link(pos, head) \
 	for (pos = (head); pos; pos = pos->link)
 
-static inline struct io_uring_cqe *io_get_cqe_overflow(struct io_ring_ctx *ctx,
-						       bool overflow)
+static inline bool io_get_cqe_overflow(struct io_ring_ctx *ctx,
+					struct io_uring_cqe **ret,
+					bool overflow)
 {
-	struct io_uring_cqe *cqe;
-
 	io_lockdep_assert_cq_locked(ctx);
 
 	if (unlikely(ctx->cqe_cached >= ctx->cqe_sentinel)) {
 		if (unlikely(!io_cqe_cache_refill(ctx, overflow)))
-			return NULL;
+			return false;
 	}
-	cqe = ctx->cqe_cached;
+	*ret = ctx->cqe_cached;
 	ctx->cached_cq_tail++;
 	ctx->cqe_cached++;
 	if (ctx->flags & IORING_SETUP_CQE32)
 		ctx->cqe_cached++;
-	return cqe;
+	return true;
 }
 
-static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
+static inline bool io_get_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe **ret)
 {
-	return io_get_cqe_overflow(ctx, false);
+	return io_get_cqe_overflow(ctx, ret, false);
 }
 
 static inline bool io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req)
@@ -142,8 +141,7 @@ static inline bool io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req
 	 * submission (by quite a lot). Increment the overflow count in
 	 * the ring.
 	 */
-	cqe = io_get_cqe(ctx);
-	if (unlikely(!cqe))
+	if (unlikely(!io_get_cqe(ctx, &cqe)))
 		return false;
 
 	if (trace_io_uring_complete_enabled())
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 06/15] io_uring: reorder cqring_flush and wakeups
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (4 preceding siblings ...)
  2023-08-24 22:53 ` [PATCH v2 05/15] io_uring: optimise extra io_get_cqe null check Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 07/15] io_uring: merge iopoll and normal completion paths Pavel Begunkov
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Unlike in the past, io_commit_cqring_flush() doesn't do anything that
may need io_cqring_wake() to be issued after, all requests it completes
will go via task_work. Do io_commit_cqring_flush() after
io_cqring_wake() to clean up __io_cq_unlock_post().

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 14 +++-----------
 io_uring/rw.c       |  2 +-
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index cfc2dc8c4b2f..7c1ef5b6628d 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -629,19 +629,11 @@ static inline void io_cq_lock(struct io_ring_ctx *ctx)
 static inline void __io_cq_unlock_post(struct io_ring_ctx *ctx)
 {
 	io_commit_cqring(ctx);
-
-	if (ctx->task_complete) {
-		/*
-		 * ->task_complete implies that only current might be waiting
-		 * for CQEs, and obviously, we currently don't. No one is
-		 * waiting, wakeups are futile, skip them.
-		 */
-		io_commit_cqring_flush(ctx);
-	} else {
+	if (!ctx->task_complete) {
 		spin_unlock(&ctx->completion_lock);
-		io_commit_cqring_flush(ctx);
 		io_cqring_wake(ctx);
 	}
+	io_commit_cqring_flush(ctx);
 }
 
 static void io_cq_unlock_post(struct io_ring_ctx *ctx)
@@ -649,8 +641,8 @@ static void io_cq_unlock_post(struct io_ring_ctx *ctx)
 {
 	io_commit_cqring(ctx);
 	spin_unlock(&ctx->completion_lock);
-	io_commit_cqring_flush(ctx);
 	io_cqring_wake(ctx);
+	io_commit_cqring_flush(ctx);
 }
 
 /* Returns true if there are no backlogged entries after the flush */
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 9b51afdae505..20140d3505f1 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -985,9 +985,9 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 
 static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
 {
-	io_commit_cqring_flush(ctx);
 	if (ctx->flags & IORING_SETUP_SQPOLL)
 		io_cqring_wake(ctx);
+	io_commit_cqring_flush(ctx);
 }
 
 void io_rw_fail(struct io_kiocb *req)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 07/15] io_uring: merge iopoll and normal completion paths
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (5 preceding siblings ...)
  2023-08-24 22:53 ` [PATCH v2 06/15] io_uring: reorder cqring_flush and wakeups Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 08/15] io_uring: force inline io_fill_cqe_req Pavel Begunkov
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

io_do_iopoll() and io_submit_flush_completions() are pretty similar,
both filling CQEs and then free a list of requests. Don't duplicate it
and make iopoll use __io_submit_flush_completions(), which also helps
with inlining and other optimisations.

For that, we need to first find all completed iopoll requests and splice
them from the iopoll list and then pass it down. This adds one extra
list traversal, which should be fine as requests will stay hot in cache.

CQ locking is already conditional, introduce ->lockless_cq and skip
locking for IOPOLL as it's protected by ->uring_lock.

We also add a wakeup optimisation for IOPOLL to __io_cq_unlock_post(),
so it works just like io_cqring_ev_posted_iopoll().

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h |  1 +
 io_uring/io_uring.c            | 18 ++++++++++++------
 io_uring/io_uring.h            |  2 +-
 io_uring/rw.c                  | 24 +++++-------------------
 4 files changed, 19 insertions(+), 26 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 9795eda529f7..c0c03d8059df 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -205,6 +205,7 @@ struct io_ring_ctx {
 		unsigned int		has_evfd: 1;
 		/* all CQEs should be posted only by the submitter task */
 		unsigned int		task_complete: 1;
+		unsigned int		lockless_cq: 1;
 		unsigned int		syscall_iopoll: 1;
 		unsigned int		poll_activated: 1;
 		unsigned int		drain_disabled: 1;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 7c1ef5b6628d..e8321903e3f3 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -147,7 +147,6 @@ static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 					 bool cancel_all);
 
 static void io_queue_sqe(struct io_kiocb *req);
-static void __io_submit_flush_completions(struct io_ring_ctx *ctx);
 
 struct kmem_cache *req_cachep;
 
@@ -616,7 +615,7 @@ void __io_commit_cqring_flush(struct io_ring_ctx *ctx)
 
 static inline void __io_cq_lock(struct io_ring_ctx *ctx)
 {
-	if (!ctx->task_complete)
+	if (!ctx->lockless_cq)
 		spin_lock(&ctx->completion_lock);
 }
 
@@ -630,8 +629,11 @@ static inline void __io_cq_unlock_post(struct io_ring_ctx *ctx)
 {
 	io_commit_cqring(ctx);
 	if (!ctx->task_complete) {
-		spin_unlock(&ctx->completion_lock);
-		io_cqring_wake(ctx);
+		if (!ctx->lockless_cq)
+			spin_unlock(&ctx->completion_lock);
+		/* IOPOLL rings only need to wake up if it's also SQPOLL */
+		if (!ctx->syscall_iopoll)
+			io_cqring_wake(ctx);
 	}
 	io_commit_cqring_flush(ctx);
 }
@@ -1485,7 +1487,8 @@ void io_queue_next(struct io_kiocb *req)
 		io_req_task_queue(nxt);
 }
 
-void io_free_batch_list(struct io_ring_ctx *ctx, struct io_wq_work_node *node)
+static void io_free_batch_list(struct io_ring_ctx *ctx,
+			       struct io_wq_work_node *node)
 	__must_hold(&ctx->uring_lock)
 {
 	do {
@@ -1522,7 +1525,7 @@ void io_free_batch_list(struct io_ring_ctx *ctx, struct io_wq_work_node *node)
 	} while (node);
 }
 
-static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
+void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_submit_state *state = &ctx->submit_state;
@@ -3836,6 +3839,9 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	    !(ctx->flags & IORING_SETUP_SQPOLL))
 		ctx->task_complete = true;
 
+	if (ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL))
+		ctx->lockless_cq = true;
+
 	/*
 	 * lazy poll_wq activation relies on ->task_complete for synchronisation
 	 * purposes, see io_activate_pollwq()
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 2960e35b32a5..07fd185064d2 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -72,7 +72,7 @@ int io_ring_add_registered_file(struct io_uring_task *tctx, struct file *file,
 int io_poll_issue(struct io_kiocb *req, struct io_tw_state *ts);
 int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr);
 int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin);
-void io_free_batch_list(struct io_ring_ctx *ctx, struct io_wq_work_node *node);
+void __io_submit_flush_completions(struct io_ring_ctx *ctx);
 int io_req_prep_async(struct io_kiocb *req);
 
 struct io_wq_work *io_wq_free_work(struct io_wq_work *work);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 20140d3505f1..0a1e515f0510 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -983,13 +983,6 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	return ret;
 }
 
-static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
-{
-	if (ctx->flags & IORING_SETUP_SQPOLL)
-		io_cqring_wake(ctx);
-	io_commit_cqring_flush(ctx);
-}
-
 void io_rw_fail(struct io_kiocb *req)
 {
 	int res;
@@ -1060,24 +1053,17 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 		if (!smp_load_acquire(&req->iopoll_completed))
 			break;
 		nr_events++;
-		if (unlikely(req->flags & REQ_F_CQE_SKIP))
-			continue;
-
 		req->cqe.flags = io_put_kbuf(req, 0);
-		if (unlikely(!io_fill_cqe_req(ctx, req))) {
-			spin_lock(&ctx->completion_lock);
-			io_req_cqe_overflow(req);
-			spin_unlock(&ctx->completion_lock);
-		}
 	}
-
 	if (unlikely(!nr_events))
 		return 0;
 
-	io_commit_cqring(ctx);
-	io_cqring_ev_posted_iopoll(ctx);
 	pos = start ? start->next : ctx->iopoll_list.first;
 	wq_list_cut(&ctx->iopoll_list, prev, start);
-	io_free_batch_list(ctx, pos);
+
+	if (WARN_ON_ONCE(!wq_list_empty(&ctx->submit_state.compl_reqs)))
+		return 0;
+	ctx->submit_state.compl_reqs.first = pos;
+	__io_submit_flush_completions(ctx);
 	return nr_events;
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 08/15] io_uring: force inline io_fill_cqe_req
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (6 preceding siblings ...)
  2023-08-24 22:53 ` [PATCH v2 07/15] io_uring: merge iopoll and normal completion paths Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 09/15] io_uring: compact SQ/CQ heads/tails Pavel Begunkov
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

There are only 2 callers of io_fill_cqe_req left, and one of them is
extremely hot. Force inline the function.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 07fd185064d2..547c30582fb8 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -132,7 +132,8 @@ static inline bool io_get_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe **ret
 	return io_get_cqe_overflow(ctx, ret, false);
 }
 
-static inline bool io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req)
+static __always_inline bool io_fill_cqe_req(struct io_ring_ctx *ctx,
+					    struct io_kiocb *req)
 {
 	struct io_uring_cqe *cqe;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 09/15] io_uring: compact SQ/CQ heads/tails
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (7 preceding siblings ...)
  2023-08-24 22:53 ` [PATCH v2 08/15] io_uring: force inline io_fill_cqe_req Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 10/15] io_uring: add option to remove SQ indirection Pavel Begunkov
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Queues heads and tails cache line aligned. That makes sq, cq taking 4
lines or 5 lines if we include the rest of struct io_rings (e.g.
sq_flags is frequently accessed).

Since modern io_uring is mostly single threaded, it doesn't make much
send to spread them as such, it wastes space and puts additional pressure
on caches. Put them all into a single line.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index c0c03d8059df..608a8e80e881 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -69,8 +69,8 @@ struct io_uring_task {
 };
 
 struct io_uring {
-	u32 head ____cacheline_aligned_in_smp;
-	u32 tail ____cacheline_aligned_in_smp;
+	u32 head;
+	u32 tail;
 };
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 10/15] io_uring: add option to remove SQ indirection
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (8 preceding siblings ...)
  2023-08-24 22:53 ` [PATCH v2 09/15] io_uring: compact SQ/CQ heads/tails Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 11/15] io_uring: move non aligned field to the end Pavel Begunkov
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Not many aware, but io_uring submission queue has two levels. The first
level usually appears as sq_array and stores indexes into the actual SQ.

To my knowledge, no one has ever seriously used it, nor liburing exposes
it to users. Add IORING_SETUP_NO_SQARRAY, when set we don't bother
creating and using the sq_array and SQ heads/tails will be pointing
directly into the SQ. Improves memory footprint, in term of both
allocations as well as cache usage, and also should make io_get_sqe()
less branchy in the end.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/uapi/linux/io_uring.h |  5 ++++
 io_uring/io_uring.c           | 52 +++++++++++++++++++++--------------
 2 files changed, 37 insertions(+), 20 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 9fc7195f25df..8e61f8b7c2ce 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -185,6 +185,11 @@ enum {
  */
 #define IORING_SETUP_REGISTERED_FD_ONLY	(1U << 15)
 
+/*
+ * Removes indirection through the SQ index array.
+ */
+#define IORING_SETUP_NO_SQARRAY		(1U << 16)
+
 enum io_uring_op {
 	IORING_OP_NOP,
 	IORING_OP_READV,
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e8321903e3f3..a6eea3938802 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2339,8 +2339,21 @@ static void io_commit_sqring(struct io_ring_ctx *ctx)
  */
 static bool io_get_sqe(struct io_ring_ctx *ctx, const struct io_uring_sqe **sqe)
 {
-	unsigned head, mask = ctx->sq_entries - 1;
-	unsigned sq_idx = ctx->cached_sq_head++ & mask;
+	unsigned mask = ctx->sq_entries - 1;
+	unsigned head = ctx->cached_sq_head++ & mask;
+
+	if (!(ctx->flags & IORING_SETUP_NO_SQARRAY)) {
+		head = READ_ONCE(ctx->sq_array[head]);
+		if (unlikely(head >= ctx->sq_entries)) {
+			/* drop invalid entries */
+			spin_lock(&ctx->completion_lock);
+			ctx->cq_extra--;
+			spin_unlock(&ctx->completion_lock);
+			WRITE_ONCE(ctx->rings->sq_dropped,
+				   READ_ONCE(ctx->rings->sq_dropped) + 1);
+			return false;
+		}
+	}
 
 	/*
 	 * The cached sq head (or cq tail) serves two purposes:
@@ -2350,22 +2363,12 @@ static bool io_get_sqe(struct io_ring_ctx *ctx, const struct io_uring_sqe **sqe)
 	 * 2) allows the kernel side to track the head on its own, even
 	 *    though the application is the one updating it.
 	 */
-	head = READ_ONCE(ctx->sq_array[sq_idx]);
-	if (likely(head < ctx->sq_entries)) {
-		/* double index for 128-byte SQEs, twice as long */
-		if (ctx->flags & IORING_SETUP_SQE128)
-			head <<= 1;
-		*sqe = &ctx->sq_sqes[head];
-		return true;
-	}
 
-	/* drop invalid entries */
-	spin_lock(&ctx->completion_lock);
-	ctx->cq_extra--;
-	spin_unlock(&ctx->completion_lock);
-	WRITE_ONCE(ctx->rings->sq_dropped,
-		   READ_ONCE(ctx->rings->sq_dropped) + 1);
-	return false;
+	/* double index for 128-byte SQEs, twice as long */
+	if (ctx->flags & IORING_SETUP_SQE128)
+		head <<= 1;
+	*sqe = &ctx->sq_sqes[head];
+	return true;
 }
 
 int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
@@ -2734,6 +2737,12 @@ static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned int sq_entries
 		return SIZE_MAX;
 #endif
 
+	if (ctx->flags & IORING_SETUP_NO_SQARRAY) {
+		if (sq_offset)
+			*sq_offset = SIZE_MAX;
+		return off;
+	}
+
 	if (sq_offset)
 		*sq_offset = off;
 
@@ -3710,7 +3719,8 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx,
 		return PTR_ERR(rings);
 
 	ctx->rings = rings;
-	ctx->sq_array = (u32 *)((char *)rings + sq_array_offset);
+	if (!(ctx->flags & IORING_SETUP_NO_SQARRAY))
+		ctx->sq_array = (u32 *)((char *)rings + sq_array_offset);
 	rings->sq_ring_mask = p->sq_entries - 1;
 	rings->cq_ring_mask = p->cq_entries - 1;
 	rings->sq_ring_entries = p->sq_entries;
@@ -3921,7 +3931,8 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	p->sq_off.ring_entries = offsetof(struct io_rings, sq_ring_entries);
 	p->sq_off.flags = offsetof(struct io_rings, sq_flags);
 	p->sq_off.dropped = offsetof(struct io_rings, sq_dropped);
-	p->sq_off.array = (char *)ctx->sq_array - (char *)ctx->rings;
+	if (!(ctx->flags & IORING_SETUP_NO_SQARRAY))
+		p->sq_off.array = (char *)ctx->sq_array - (char *)ctx->rings;
 	p->sq_off.resv1 = 0;
 	if (!(ctx->flags & IORING_SETUP_NO_MMAP))
 		p->sq_off.user_addr = 0;
@@ -4010,7 +4021,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG |
 			IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |
 			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN |
-			IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY))
+			IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY |
+			IORING_SETUP_NO_SQARRAY))
 		return -EINVAL;
 
 	return io_uring_create(entries, &p, params);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 11/15] io_uring: move non aligned field to the end
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (9 preceding siblings ...)
  2023-08-24 22:53 ` [PATCH v2 10/15] io_uring: add option to remove SQ indirection Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 12/15] io_uring: banish non-hot data to end of io_ring_ctx Pavel Begunkov
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Move not cache aligned fields down in io_ring_ctx, should change
anything, but makes further refactoring easier.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 36 +++++++++++++++++-----------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 608a8e80e881..ad87d6074fb2 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -270,24 +270,6 @@ struct io_ring_ctx {
 		struct io_alloc_cache	netmsg_cache;
 	} ____cacheline_aligned_in_smp;
 
-	/* IRQ completion list, under ->completion_lock */
-	struct io_wq_work_list	locked_free_list;
-	unsigned int		locked_free_nr;
-
-	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
-	struct io_sq_data	*sq_data;	/* if using sq thread polling */
-
-	struct wait_queue_head	sqo_sq_wait;
-	struct list_head	sqd_list;
-
-	unsigned long		check_cq;
-
-	unsigned int		file_alloc_start;
-	unsigned int		file_alloc_end;
-
-	struct xarray		personalities;
-	u32			pers_next;
-
 	struct {
 		/*
 		 * We cache a range of free CQEs we can use, once exhausted it
@@ -332,6 +314,24 @@ struct io_ring_ctx {
 		unsigned		cq_last_tm_flush;
 	} ____cacheline_aligned_in_smp;
 
+	/* IRQ completion list, under ->completion_lock */
+	struct io_wq_work_list	locked_free_list;
+	unsigned int		locked_free_nr;
+
+	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
+	struct io_sq_data	*sq_data;	/* if using sq thread polling */
+
+	struct wait_queue_head	sqo_sq_wait;
+	struct list_head	sqd_list;
+
+	unsigned long		check_cq;
+
+	unsigned int		file_alloc_start;
+	unsigned int		file_alloc_end;
+
+	struct xarray		personalities;
+	u32			pers_next;
+
 	/* Keep this last, we don't need it for the fast path */
 	struct wait_queue_head		poll_wq;
 	struct io_restriction		restrictions;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 12/15] io_uring: banish non-hot data to end of io_ring_ctx
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (10 preceding siblings ...)
  2023-08-24 22:53 ` [PATCH v2 11/15] io_uring: move non aligned field to the end Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 13/15] io_uring: separate task_work/waiting cache line Pavel Begunkov
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Let's move all slow path, setup/init and so on fields to the end of
io_ring_ctx, that makes ctx reorganisation later easier. That includes,
page arrays used only on tear down, CQ overflow list, old provided
buffer caches and used by io-wq poll hashes.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 37 +++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index ad87d6074fb2..72e609752323 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -211,20 +211,11 @@ struct io_ring_ctx {
 		unsigned int		drain_disabled: 1;
 		unsigned int		compat: 1;
 
-		enum task_work_notify_mode	notify_method;
+		struct task_struct	*submitter_task;
+		struct io_rings		*rings;
+		struct percpu_ref	refs;
 
-		/*
-		 * If IORING_SETUP_NO_MMAP is used, then the below holds
-		 * the gup'ed pages for the two rings, and the sqes.
-		 */
-		unsigned short		n_ring_pages;
-		unsigned short		n_sqe_pages;
-		struct page		**ring_pages;
-		struct page		**sqe_pages;
-
-		struct io_rings			*rings;
-		struct task_struct		*submitter_task;
-		struct percpu_ref		refs;
+		enum task_work_notify_mode	notify_method;
 	} ____cacheline_aligned_in_smp;
 
 	/* submission data */
@@ -262,10 +253,8 @@ struct io_ring_ctx {
 
 		struct io_buffer_list	*io_bl;
 		struct xarray		io_bl_xa;
-		struct list_head	io_buffers_cache;
 
 		struct io_hash_table	cancel_table_locked;
-		struct list_head	cq_overflow_list;
 		struct io_alloc_cache	apoll_cache;
 		struct io_alloc_cache	netmsg_cache;
 	} ____cacheline_aligned_in_smp;
@@ -298,11 +287,8 @@ struct io_ring_ctx {
 		 * manipulate the list, hence no extra locking is needed there.
 		 */
 		struct io_wq_work_list	iopoll_list;
-		struct io_hash_table	cancel_table;
 
 		struct llist_head	work_llist;
-
-		struct list_head	io_buffers_comp;
 	} ____cacheline_aligned_in_smp;
 
 	/* timeouts */
@@ -318,6 +304,10 @@ struct io_ring_ctx {
 	struct io_wq_work_list	locked_free_list;
 	unsigned int		locked_free_nr;
 
+	struct list_head	io_buffers_comp;
+	struct list_head	cq_overflow_list;
+	struct io_hash_table	cancel_table;
+
 	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
 	struct io_sq_data	*sq_data;	/* if using sq thread polling */
 
@@ -332,6 +322,8 @@ struct io_ring_ctx {
 	struct xarray		personalities;
 	u32			pers_next;
 
+	struct list_head	io_buffers_cache;
+
 	/* Keep this last, we don't need it for the fast path */
 	struct wait_queue_head		poll_wq;
 	struct io_restriction		restrictions;
@@ -375,6 +367,15 @@ struct io_ring_ctx {
 	unsigned			sq_thread_idle;
 	/* protected by ->completion_lock */
 	unsigned			evfd_last_cq_tail;
+
+	/*
+	 * If IORING_SETUP_NO_MMAP is used, then the below holds
+	 * the gup'ed pages for the two rings, and the sqes.
+	 */
+	unsigned short			n_ring_pages;
+	unsigned short			n_sqe_pages;
+	struct page			**ring_pages;
+	struct page			**sqe_pages;
 };
 
 struct io_tw_state {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 13/15] io_uring: separate task_work/waiting cache line
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (11 preceding siblings ...)
  2023-08-24 22:53 ` [PATCH v2 12/15] io_uring: banish non-hot data to end of io_ring_ctx Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 14/15] io_uring: move multishot cqe cache in ctx Pavel Begunkov
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

task_work's are typically queued up from IRQ/softirq potentially by a
random CPU like in case of networking. Batch ctx fields bouncing as this
into a separate cache line.

We also move ->cq_timeouts there because waiters have to read and check
it. We can also conditionally hide ->cq_timeouts in the future from the
CQ wait path as a not really useful rudiment.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 72e609752323..5de5dffe29df 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -270,15 +270,25 @@ struct io_ring_ctx {
 		unsigned		cached_cq_tail;
 		unsigned		cq_entries;
 		struct io_ev_fd	__rcu	*io_ev_fd;
-		struct wait_queue_head	cq_wait;
 		unsigned		cq_extra;
 	} ____cacheline_aligned_in_smp;
 
+	/*
+	 * task_work and async notification delivery cacheline. Expected to
+	 * regularly bounce b/w CPUs.
+	 */
+	struct {
+		struct llist_head	work_llist;
+		unsigned long		check_cq;
+		atomic_t		cq_wait_nr;
+		atomic_t		cq_timeouts;
+		struct wait_queue_head	cq_wait;
+	} ____cacheline_aligned_in_smp;
+
 	struct {
 		spinlock_t		completion_lock;
 
 		bool			poll_multi_queue;
-		atomic_t		cq_wait_nr;
 
 		/*
 		 * ->iopoll_list is protected by the ctx->uring_lock for
@@ -287,14 +297,11 @@ struct io_ring_ctx {
 		 * manipulate the list, hence no extra locking is needed there.
 		 */
 		struct io_wq_work_list	iopoll_list;
-
-		struct llist_head	work_llist;
 	} ____cacheline_aligned_in_smp;
 
 	/* timeouts */
 	struct {
 		spinlock_t		timeout_lock;
-		atomic_t		cq_timeouts;
 		struct list_head	timeout_list;
 		struct list_head	ltimeout_list;
 		unsigned		cq_last_tm_flush;
@@ -314,8 +321,6 @@ struct io_ring_ctx {
 	struct wait_queue_head	sqo_sq_wait;
 	struct list_head	sqd_list;
 
-	unsigned long		check_cq;
-
 	unsigned int		file_alloc_start;
 	unsigned int		file_alloc_end;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 14/15] io_uring: move multishot cqe cache in ctx
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (12 preceding siblings ...)
  2023-08-24 22:53 ` [PATCH v2 13/15] io_uring: separate task_work/waiting cache line Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 22:53 ` [PATCH v2 15/15] io_uring: move iopoll ctx fields around Pavel Begunkov
  2023-08-24 23:16 ` [PATCH v2 00/15] caching and SQ/CQ optimisations Jens Axboe
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

We cache multishot CQEs before flushing them to the CQ in
submit_state.cqe. It's a 16 entry cache totalling 256 bytes in the
middle of the io_submit_state structure. Move it out of there, it
should help with CPU caches for the submission state, and shouldn't
affect cached CQEs.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 3 ++-
 io_uring/io_uring.c            | 6 +++---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 5de5dffe29df..01bdbc223edd 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -176,7 +176,6 @@ struct io_submit_state {
 	unsigned short		submit_nr;
 	unsigned int		cqes_count;
 	struct blk_plug		plug;
-	struct io_uring_cqe	cqes[16];
 };
 
 struct io_ev_fd {
@@ -307,6 +306,8 @@ struct io_ring_ctx {
 		unsigned		cq_last_tm_flush;
 	} ____cacheline_aligned_in_smp;
 
+	struct io_uring_cqe	completion_cqes[16];
+
 	/* IRQ completion list, under ->completion_lock */
 	struct io_wq_work_list	locked_free_list;
 	unsigned int		locked_free_nr;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a6eea3938802..88599852af82 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -880,7 +880,7 @@ static void __io_flush_post_cqes(struct io_ring_ctx *ctx)
 
 	lockdep_assert_held(&ctx->uring_lock);
 	for (i = 0; i < state->cqes_count; i++) {
-		struct io_uring_cqe *cqe = &state->cqes[i];
+		struct io_uring_cqe *cqe = &ctx->completion_cqes[i];
 
 		if (!io_fill_cqe_aux(ctx, cqe->user_data, cqe->res, cqe->flags)) {
 			if (ctx->task_complete) {
@@ -931,7 +931,7 @@ bool io_fill_cqe_req_aux(struct io_kiocb *req, bool defer, s32 res, u32 cflags)
 
 	lockdep_assert_held(&ctx->uring_lock);
 
-	if (ctx->submit_state.cqes_count == ARRAY_SIZE(ctx->submit_state.cqes)) {
+	if (ctx->submit_state.cqes_count == ARRAY_SIZE(ctx->completion_cqes)) {
 		__io_cq_lock(ctx);
 		__io_flush_post_cqes(ctx);
 		/* no need to flush - flush is deferred */
@@ -945,7 +945,7 @@ bool io_fill_cqe_req_aux(struct io_kiocb *req, bool defer, s32 res, u32 cflags)
 	if (test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq))
 		return false;
 
-	cqe = &ctx->submit_state.cqes[ctx->submit_state.cqes_count++];
+	cqe = &ctx->completion_cqes[ctx->submit_state.cqes_count++];
 	cqe->user_data = user_data;
 	cqe->res = res;
 	cqe->flags = cflags;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 15/15] io_uring: move iopoll ctx fields around
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (13 preceding siblings ...)
  2023-08-24 22:53 ` [PATCH v2 14/15] io_uring: move multishot cqe cache in ctx Pavel Begunkov
@ 2023-08-24 22:53 ` Pavel Begunkov
  2023-08-24 23:16 ` [PATCH v2 00/15] caching and SQ/CQ optimisations Jens Axboe
  15 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-08-24 22:53 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Move poll_multi_queue and iopoll_list to the submission cache line, it
doesn't make much sense to keep them separately, and is better place
for it in general.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 01bdbc223edd..13d19b9be9f4 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -256,6 +256,15 @@ struct io_ring_ctx {
 		struct io_hash_table	cancel_table_locked;
 		struct io_alloc_cache	apoll_cache;
 		struct io_alloc_cache	netmsg_cache;
+
+		/*
+		 * ->iopoll_list is protected by the ctx->uring_lock for
+		 * io_uring instances that don't use IORING_SETUP_SQPOLL.
+		 * For SQPOLL, only the single threaded io_sq_thread() will
+		 * manipulate the list, hence no extra locking is needed there.
+		 */
+		struct io_wq_work_list	iopoll_list;
+		bool			poll_multi_queue;
 	} ____cacheline_aligned_in_smp;
 
 	struct {
@@ -284,20 +293,6 @@ struct io_ring_ctx {
 		struct wait_queue_head	cq_wait;
 	} ____cacheline_aligned_in_smp;
 
-	struct {
-		spinlock_t		completion_lock;
-
-		bool			poll_multi_queue;
-
-		/*
-		 * ->iopoll_list is protected by the ctx->uring_lock for
-		 * io_uring instances that don't use IORING_SETUP_SQPOLL.
-		 * For SQPOLL, only the single threaded io_sq_thread() will
-		 * manipulate the list, hence no extra locking is needed there.
-		 */
-		struct io_wq_work_list	iopoll_list;
-	} ____cacheline_aligned_in_smp;
-
 	/* timeouts */
 	struct {
 		spinlock_t		timeout_lock;
@@ -308,6 +303,8 @@ struct io_ring_ctx {
 
 	struct io_uring_cqe	completion_cqes[16];
 
+	spinlock_t		completion_lock;
+
 	/* IRQ completion list, under ->completion_lock */
 	struct io_wq_work_list	locked_free_list;
 	unsigned int		locked_free_nr;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 00/15] caching and SQ/CQ optimisations
  2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (14 preceding siblings ...)
  2023-08-24 22:53 ` [PATCH v2 15/15] io_uring: move iopoll ctx fields around Pavel Begunkov
@ 2023-08-24 23:16 ` Jens Axboe
  15 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2023-08-24 23:16 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov


On Thu, 24 Aug 2023 23:53:22 +0100, Pavel Begunkov wrote:
> Patch 1-5 optimise io_fill_cqe_req
> 
> Patch 6-7 combine iopoll and normal completion paths
> 
> Patch 8 inlines io_fill_cqe_req.
> 
> Patch 9 should improve CPU caching of SQ/CQ pointers
> 
> [...]

Applied, thanks!

[01/15] io_uring: improve cqe !tracing hot path
        commit: a0727c738309a06ef5579c1742f8f0def63aa883
[02/15] io_uring: cqe init hardening
        commit: 31d3ba924fd86add6d14f9085fdd2f4ec0879631
[03/15] io_uring: simplify big_cqe handling
        commit: b24c5d752962fa0970cd7e3d74b1cd0e843358de
[04/15] io_uring: refactor __io_get_cqe()
        commit: 20d6b633870495fda1d92d283ebf890d80f68ecd
[05/15] io_uring: optimise extra io_get_cqe null check
        commit: 59fbc409e71649f558fb4578cdbfac67acb824dc
[06/15] io_uring: reorder cqring_flush and wakeups
        commit: 54927baf6c195fb512ac38b26a041ca44edb2e29
[07/15] io_uring: merge iopoll and normal completion paths
        commit: ec26c225f06f5993f8891fa6c79fab3c92981181
[08/15] io_uring: force inline io_fill_cqe_req
        commit: 093a650b757210bc856ca7f5349fb5a4bb9d4bd6
[09/15] io_uring: compact SQ/CQ heads/tails
        commit: e5598d6ae62626d261b046a2f19347c38681ff51
[10/15] io_uring: add option to remove SQ indirection
        commit: 2af89abda7d9c2aeb573677e2c498ddb09f8058a
[11/15] io_uring: move non aligned field to the end
        commit: d7f06fea5d6be78403d42c9637f67bc883870094
[12/15] io_uring: banish non-hot data to end of io_ring_ctx
        commit: 18df385f42f0b3310ed2e4a3e39264bf5e784692
[13/15] io_uring: separate task_work/waiting cache line
        commit: c9def23dde5238184777340ad811e4903f216a2d
[14/15] io_uring: move multishot cqe cache in ctx
        commit: 0aa7aa5f766933d4f91b22d9658cd688e1f15dab
[15/15] io_uring: move iopoll ctx fields around
        commit: 644c4a7a721fb90356cdd42219c9928a3c386230

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-08-24 23:17 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-24 22:53 [PATCH v2 00/15] caching and SQ/CQ optimisations Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 01/15] io_uring: improve cqe !tracing hot path Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 02/15] io_uring: cqe init hardening Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 03/15] io_uring: simplify big_cqe handling Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 04/15] io_uring: refactor __io_get_cqe() Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 05/15] io_uring: optimise extra io_get_cqe null check Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 06/15] io_uring: reorder cqring_flush and wakeups Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 07/15] io_uring: merge iopoll and normal completion paths Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 08/15] io_uring: force inline io_fill_cqe_req Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 09/15] io_uring: compact SQ/CQ heads/tails Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 10/15] io_uring: add option to remove SQ indirection Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 11/15] io_uring: move non aligned field to the end Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 12/15] io_uring: banish non-hot data to end of io_ring_ctx Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 13/15] io_uring: separate task_work/waiting cache line Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 14/15] io_uring: move multishot cqe cache in ctx Pavel Begunkov
2023-08-24 22:53 ` [PATCH v2 15/15] io_uring: move iopoll ctx fields around Pavel Begunkov
2023-08-24 23:16 ` [PATCH v2 00/15] caching and SQ/CQ optimisations Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox