public inbox for [email protected]
 help / color / mirror / Atom feed
* [RFC 00/16] caching and SQ/CQ optimisations
@ 2023-08-15 17:31 Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 01/16] io_uring: improve cqe !tracing hot path Pavel Begunkov
                   ` (15 more replies)
  0 siblings, 16 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Patch 1-5 optimise io_fill_cqe_req

Patch 6-7 combine iopoll and normal completion paths

Patch 8 should improve CPU caching of SQ/CQ pointers

Patch 9 removes conditionally SQ indirection (->sq_array). Assuming we'll
make it a default in liburing, Patch 10 optimises it with static_key.

Patch 10-15 shuffle io_ring_ctx fields.

Patch 16 inlines io_fill_cqe_req.

Testing with t/io_uring nops only for now

                QD2     QD4     QD8     QD16    QD32
baseline:       17.3    26.6    36.4    43.7    49.4
Patches 1-15:   17.8    27.4    37.9    45.8    51.2
Patches 1-16:   17.9    28.2    39.3    47.8    54

L1 load misses decreased from 1.7% to 1.3%, I don't think it's
significant and it will be more interesting to see how it looks
when we do actual IO.

Pavel Begunkov (16):
  io_uring: improve cqe !tracing hot path
  io_uring: cqe init hardening
  io_uring: simplify big_cqe handling
  io_uring: refactor __io_get_cqe()
  io_uring: optimise extra io_get_cqe null check
  io_uring: reorder cqring_flush and wakeups
  io_uring: merge iopoll and normal completion paths
  io_uring: compact SQ/CQ heads/tails
  io_uring: add option to remove SQ indirection
  io_uring: static_key for !IORING_SETUP_NO_SQARRAY
  io_uring: move non aligned field to the end
  io_uring: banish non-hot data to end of io_ring_ctx
  io_uring: separate task_work/waiting cache line
  io_uring: move multishot cqe cache in ctx
  io_uring: move iopoll ctx fields around
  io_uring: force inline io_fill_cqe_req

 include/linux/io_uring_types.h | 129 ++++++++++++++++----------------
 include/uapi/linux/io_uring.h  |   5 ++
 io_uring/io_uring.c            | 130 ++++++++++++++++++---------------
 io_uring/io_uring.h            |  58 +++++++--------
 io_uring/rw.c                  |  24 ++----
 io_uring/uring_cmd.c           |   5 +-
 6 files changed, 173 insertions(+), 178 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 01/16] io_uring: improve cqe !tracing hot path
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 02/16] io_uring: cqe init hardening Pavel Begunkov
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

While looking at io_fill_cqe_req()'s asm I stumbled on our trace points
turning into the chunk below:

trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
			req->cqe.res, req->cqe.flags,
			req->extra1, req->extra2);

io_uring/io_uring.c:898: 	trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
	movq	232(%rbx), %rdi	# req_44(D)->big_cqe.extra2, _5
	movq	224(%rbx), %rdx	# req_44(D)->big_cqe.extra1, _6
	movl	84(%rbx), %r9d	# req_44(D)->cqe.D.81184.flags, _7
	movl	80(%rbx), %r8d	# req_44(D)->cqe.res, _8
	movq	72(%rbx), %rcx	# req_44(D)->cqe.user_data, _9
	movq	88(%rbx), %rsi	# req_44(D)->ctx, _10
./arch/x86/include/asm/jump_label.h:27: 	asm_volatile_goto("1:"
	1:jmp .L1772 # objtool NOPs this 	#
	...

It does a jump_label for actual tracing, but those 6 moves will stay
there in the hottest io_uring path. As an optimisation, add a
trace_io_uring_complete_enabled() check, which is also uses jump_labels,
it tricks the compiler into behaving. It removes the junk without
changing anything else int the hot path.

Note: apparently, it's not only me noticing it, and people are also
working it around. We should remove the check when it's solved
generically or rework tracing.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.h | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 3e6ff3cd9a24..465598223386 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -145,10 +145,11 @@ static inline bool io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req
 	if (unlikely(!cqe))
 		return false;
 
-	trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
-				req->cqe.res, req->cqe.flags,
-				(req->flags & REQ_F_CQE32_INIT) ? req->extra1 : 0,
-				(req->flags & REQ_F_CQE32_INIT) ? req->extra2 : 0);
+	if (trace_io_uring_complete_enabled())
+		trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
+					req->cqe.res, req->cqe.flags,
+					(req->flags & REQ_F_CQE32_INIT) ? req->extra1 : 0,
+					(req->flags & REQ_F_CQE32_INIT) ? req->extra2 : 0);
 
 	memcpy(cqe, &req->cqe, sizeof(*cqe));
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 02/16] io_uring: cqe init hardening
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 01/16] io_uring: improve cqe !tracing hot path Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-19 15:03   ` Jens Axboe
  2023-08-15 17:31 ` [PATCH 03/16] io_uring: simplify big_cqe handling Pavel Begunkov
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

io_kiocb::cqe stores the completion info which we'll memcpy to
userspace, and we rely on callbacks and other later steps to populate
it with right values. We have never had problems with that, but it would
still be safer to zero it on allocation.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e189158ebbdd..4d27655be3a6 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1056,7 +1056,7 @@ static void io_preinit_req(struct io_kiocb *req, struct io_ring_ctx *ctx)
 	req->link = NULL;
 	req->async_data = NULL;
 	/* not necessary, but safer to zero */
-	req->cqe.res = 0;
+	memset(&req->cqe, 0, sizeof(req->cqe));
 }
 
 static void io_flush_cached_locked_reqs(struct io_ring_ctx *ctx,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 03/16] io_uring: simplify big_cqe handling
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 01/16] io_uring: improve cqe !tracing hot path Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 02/16] io_uring: cqe init hardening Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 04/16] io_uring: refactor __io_get_cqe() Pavel Begunkov
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Don't keep big_cqe bits of req in a union with hash_node, find a
separate space for it. It's bit safer, but also if we keep it always
initialised, we can get rid of ugly REQ_F_CQE32_INIT handling.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 16 ++++++----------
 io_uring/io_uring.c            |  8 +++-----
 io_uring/io_uring.h            | 15 +++------------
 io_uring/uring_cmd.c           |  5 ++---
 4 files changed, 14 insertions(+), 30 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index f04ce513fadb..9795eda529f7 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -409,7 +409,6 @@ enum {
 	REQ_F_SINGLE_POLL_BIT,
 	REQ_F_DOUBLE_POLL_BIT,
 	REQ_F_PARTIAL_IO_BIT,
-	REQ_F_CQE32_INIT_BIT,
 	REQ_F_APOLL_MULTISHOT_BIT,
 	REQ_F_CLEAR_POLLIN_BIT,
 	REQ_F_HASH_LOCKED_BIT,
@@ -479,8 +478,6 @@ enum {
 	REQ_F_PARTIAL_IO	= BIT(REQ_F_PARTIAL_IO_BIT),
 	/* fast poll multishot mode */
 	REQ_F_APOLL_MULTISHOT	= BIT(REQ_F_APOLL_MULTISHOT_BIT),
-	/* ->extra1 and ->extra2 are initialised */
-	REQ_F_CQE32_INIT	= BIT(REQ_F_CQE32_INIT_BIT),
 	/* recvmsg special flag, clear EPOLLIN */
 	REQ_F_CLEAR_POLLIN	= BIT(REQ_F_CLEAR_POLLIN_BIT),
 	/* hashed into ->cancel_hash_locked, protected by ->uring_lock */
@@ -579,13 +576,7 @@ struct io_kiocb {
 	struct io_task_work		io_task_work;
 	unsigned			nr_tw;
 	/* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
-	union {
-		struct hlist_node	hash_node;
-		struct {
-			u64		extra1;
-			u64		extra2;
-		};
-	};
+	struct hlist_node		hash_node;
 	/* internal polling, see IORING_FEAT_FAST_POLL */
 	struct async_poll		*apoll;
 	/* opcode allocated if it needs to store data for async defer */
@@ -595,6 +586,11 @@ struct io_kiocb {
 	/* custom credentials, valid IFF REQ_F_CREDS is set */
 	const struct cred		*creds;
 	struct io_wq_work		work;
+
+	struct {
+		u64			extra1;
+		u64			extra2;
+	} big_cqe;
 };
 
 struct io_overflow_cqe {
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4d27655be3a6..20b46e64cc07 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -807,13 +807,10 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
 
 void io_req_cqe_overflow(struct io_kiocb *req)
 {
-	if (!(req->flags & REQ_F_CQE32_INIT)) {
-		req->extra1 = 0;
-		req->extra2 = 0;
-	}
 	io_cqring_event_overflow(req->ctx, req->cqe.user_data,
 				req->cqe.res, req->cqe.flags,
-				req->extra1, req->extra2);
+				req->big_cqe.extra1, req->big_cqe.extra2);
+	memset(&req->big_cqe, 0, sizeof(req->big_cqe));
 }
 
 /*
@@ -1057,6 +1054,7 @@ static void io_preinit_req(struct io_kiocb *req, struct io_ring_ctx *ctx)
 	req->async_data = NULL;
 	/* not necessary, but safer to zero */
 	memset(&req->cqe, 0, sizeof(req->cqe));
+	memset(&req->big_cqe, 0, sizeof(req->big_cqe));
 }
 
 static void io_flush_cached_locked_reqs(struct io_ring_ctx *ctx,
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 465598223386..9b5dfb6ef484 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -148,21 +148,12 @@ static inline bool io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req
 	if (trace_io_uring_complete_enabled())
 		trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
 					req->cqe.res, req->cqe.flags,
-					(req->flags & REQ_F_CQE32_INIT) ? req->extra1 : 0,
-					(req->flags & REQ_F_CQE32_INIT) ? req->extra2 : 0);
+					req->big_cqe.extra1, req->big_cqe.extra2);
 
 	memcpy(cqe, &req->cqe, sizeof(*cqe));
-
 	if (ctx->flags & IORING_SETUP_CQE32) {
-		u64 extra1 = 0, extra2 = 0;
-
-		if (req->flags & REQ_F_CQE32_INIT) {
-			extra1 = req->extra1;
-			extra2 = req->extra2;
-		}
-
-		WRITE_ONCE(cqe->big_cqe[0], extra1);
-		WRITE_ONCE(cqe->big_cqe[1], extra2);
+		memcpy(cqe->big_cqe, &req->big_cqe, sizeof(*cqe));
+		memset(&req->big_cqe, 0, sizeof(req->big_cqe));
 	}
 	return true;
 }
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 8e7a03c1b20e..537795fddc87 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -43,9 +43,8 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_do_in_task_lazy);
 static inline void io_req_set_cqe32_extra(struct io_kiocb *req,
 					  u64 extra1, u64 extra2)
 {
-	req->extra1 = extra1;
-	req->extra2 = extra2;
-	req->flags |= REQ_F_CQE32_INIT;
+	req->big_cqe.extra1 = extra1;
+	req->big_cqe.extra2 = extra2;
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 04/16] io_uring: refactor __io_get_cqe()
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (2 preceding siblings ...)
  2023-08-15 17:31 ` [PATCH 03/16] io_uring: simplify big_cqe handling Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 05/16] io_uring: optimise extra io_get_cqe null check Pavel Begunkov
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Make __io_get_cqe simpler by not grabbing the cqe from refilled cached,
but letting io_get_cqe() do it for us. That's cleaner and removes some
duplication.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 13 ++++---------
 io_uring/io_uring.h | 23 ++++++++++++-----------
 2 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 20b46e64cc07..623d41755714 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -818,7 +818,7 @@ void io_req_cqe_overflow(struct io_kiocb *req)
  * control dependency is enough as we're using WRITE_ONCE to
  * fill the cq entry
  */
-struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow)
+bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow)
 {
 	struct io_rings *rings = ctx->rings;
 	unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1);
@@ -830,7 +830,7 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow)
 	 * Force overflow the completion.
 	 */
 	if (!overflow && (ctx->check_cq & BIT(IO_CHECK_CQ_OVERFLOW_BIT)))
-		return NULL;
+		return false;
 
 	/* userspace may cheat modifying the tail, be safe and do min */
 	queued = min(__io_cqring_events(ctx), ctx->cq_entries);
@@ -838,7 +838,7 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow)
 	/* we need a contiguous range, limit based on the current array offset */
 	len = min(free, ctx->cq_entries - off);
 	if (!len)
-		return NULL;
+		return false;
 
 	if (ctx->flags & IORING_SETUP_CQE32) {
 		off <<= 1;
@@ -847,12 +847,7 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow)
 
 	ctx->cqe_cached = &rings->cqes[off];
 	ctx->cqe_sentinel = ctx->cqe_cached + len;
-
-	ctx->cached_cq_tail++;
-	ctx->cqe_cached++;
-	if (ctx->flags & IORING_SETUP_CQE32)
-		ctx->cqe_cached++;
-	return &rings->cqes[off];
+	return true;
 }
 
 static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res,
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 9b5dfb6ef484..9c80d20fe18f 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -38,7 +38,7 @@ enum {
 	IOU_STOP_MULTISHOT	= -ECANCELED,
 };
 
-struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow);
+bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow);
 void io_req_cqe_overflow(struct io_kiocb *req);
 int io_run_task_work_sig(struct io_ring_ctx *ctx);
 void io_req_defer_failed(struct io_kiocb *req, s32 res);
@@ -112,19 +112,20 @@ static inline void io_req_task_work_add(struct io_kiocb *req)
 static inline struct io_uring_cqe *io_get_cqe_overflow(struct io_ring_ctx *ctx,
 						       bool overflow)
 {
-	io_lockdep_assert_cq_locked(ctx);
+	struct io_uring_cqe *cqe;
 
-	if (likely(ctx->cqe_cached < ctx->cqe_sentinel)) {
-		struct io_uring_cqe *cqe = ctx->cqe_cached;
+	io_lockdep_assert_cq_locked(ctx);
 
-		ctx->cached_cq_tail++;
-		ctx->cqe_cached++;
-		if (ctx->flags & IORING_SETUP_CQE32)
-			ctx->cqe_cached++;
-		return cqe;
+	if (unlikely(ctx->cqe_cached >= ctx->cqe_sentinel)) {
+		if (unlikely(!io_cqe_cache_refill(ctx, overflow)))
+			return NULL;
 	}
-
-	return __io_get_cqe(ctx, overflow);
+	cqe = ctx->cqe_cached;
+	ctx->cached_cq_tail++;
+	ctx->cqe_cached++;
+	if (ctx->flags & IORING_SETUP_CQE32)
+		ctx->cqe_cached++;
+	return cqe;
 }
 
 static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 05/16] io_uring: optimise extra io_get_cqe null check
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (3 preceding siblings ...)
  2023-08-15 17:31 ` [PATCH 04/16] io_uring: refactor __io_get_cqe() Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 06/16] io_uring: reorder cqring_flush and wakeups Pavel Begunkov
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

If the cached cqe check passes in io_get_cqe*() it already means that
the cqe we return is valid and non-zero, however the compiler is unable
to optimise null checks like in io_fill_cqe_req().

Do a bit of trickery, return success/fail boolean from io_get_cqe*()
and store cqe in the cqe parameter. That makes it do the right thing,
erasing the check together with the introduced indirection.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c |  7 +++----
 io_uring/io_uring.h | 20 +++++++++-----------
 2 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 623d41755714..e5378dc7aa19 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -683,10 +683,10 @@ static void __io_cqring_overflow_flush(struct io_ring_ctx *ctx)
 
 	io_cq_lock(ctx);
 	while (!list_empty(&ctx->cq_overflow_list)) {
-		struct io_uring_cqe *cqe = io_get_cqe_overflow(ctx, true);
+		struct io_uring_cqe *cqe;
 		struct io_overflow_cqe *ocqe;
 
-		if (!cqe)
+		if (!io_get_cqe_overflow(ctx, &cqe, true))
 			break;
 		ocqe = list_first_entry(&ctx->cq_overflow_list,
 					struct io_overflow_cqe, list);
@@ -862,8 +862,7 @@ static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res,
 	 * submission (by quite a lot). Increment the overflow count in
 	 * the ring.
 	 */
-	cqe = io_get_cqe(ctx);
-	if (likely(cqe)) {
+	if (likely(io_get_cqe(ctx, &cqe))) {
 		trace_io_uring_complete(ctx, NULL, user_data, res, cflags, 0, 0);
 
 		WRITE_ONCE(cqe->user_data, user_data);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 9c80d20fe18f..2960e35b32a5 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -109,28 +109,27 @@ static inline void io_req_task_work_add(struct io_kiocb *req)
 #define io_for_each_link(pos, head) \
 	for (pos = (head); pos; pos = pos->link)
 
-static inline struct io_uring_cqe *io_get_cqe_overflow(struct io_ring_ctx *ctx,
-						       bool overflow)
+static inline bool io_get_cqe_overflow(struct io_ring_ctx *ctx,
+					struct io_uring_cqe **ret,
+					bool overflow)
 {
-	struct io_uring_cqe *cqe;
-
 	io_lockdep_assert_cq_locked(ctx);
 
 	if (unlikely(ctx->cqe_cached >= ctx->cqe_sentinel)) {
 		if (unlikely(!io_cqe_cache_refill(ctx, overflow)))
-			return NULL;
+			return false;
 	}
-	cqe = ctx->cqe_cached;
+	*ret = ctx->cqe_cached;
 	ctx->cached_cq_tail++;
 	ctx->cqe_cached++;
 	if (ctx->flags & IORING_SETUP_CQE32)
 		ctx->cqe_cached++;
-	return cqe;
+	return true;
 }
 
-static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
+static inline bool io_get_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe **ret)
 {
-	return io_get_cqe_overflow(ctx, false);
+	return io_get_cqe_overflow(ctx, ret, false);
 }
 
 static inline bool io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req)
@@ -142,8 +141,7 @@ static inline bool io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req
 	 * submission (by quite a lot). Increment the overflow count in
 	 * the ring.
 	 */
-	cqe = io_get_cqe(ctx);
-	if (unlikely(!cqe))
+	if (unlikely(!io_get_cqe(ctx, &cqe)))
 		return false;
 
 	if (trace_io_uring_complete_enabled())
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 06/16] io_uring: reorder cqring_flush and wakeups
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (4 preceding siblings ...)
  2023-08-15 17:31 ` [PATCH 05/16] io_uring: optimise extra io_get_cqe null check Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 07/16] io_uring: merge iopoll and normal completion paths Pavel Begunkov
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Unlike in the past, io_commit_cqring_flush() doesn't do anything that
may need io_cqring_wake() to be issued after, all requests it completes
will go via task_work. Do io_commit_cqring_flush() after
io_cqring_wake() to clean up __io_cq_unlock_post().

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 14 +++-----------
 io_uring/rw.c       |  2 +-
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e5378dc7aa19..8d27d2a2e893 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -629,19 +629,11 @@ static inline void io_cq_lock(struct io_ring_ctx *ctx)
 static inline void __io_cq_unlock_post(struct io_ring_ctx *ctx)
 {
 	io_commit_cqring(ctx);
-
-	if (ctx->task_complete) {
-		/*
-		 * ->task_complete implies that only current might be waiting
-		 * for CQEs, and obviously, we currently don't. No one is
-		 * waiting, wakeups are futile, skip them.
-		 */
-		io_commit_cqring_flush(ctx);
-	} else {
+	if (!ctx->task_complete) {
 		spin_unlock(&ctx->completion_lock);
-		io_commit_cqring_flush(ctx);
 		io_cqring_wake(ctx);
 	}
+	io_commit_cqring_flush(ctx);
 }
 
 static void io_cq_unlock_post(struct io_ring_ctx *ctx)
@@ -649,8 +641,8 @@ static void io_cq_unlock_post(struct io_ring_ctx *ctx)
 {
 	io_commit_cqring(ctx);
 	spin_unlock(&ctx->completion_lock);
-	io_commit_cqring_flush(ctx);
 	io_cqring_wake(ctx);
+	io_commit_cqring_flush(ctx);
 }
 
 /* Returns true if there are no backlogged entries after the flush */
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 9b51afdae505..20140d3505f1 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -985,9 +985,9 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 
 static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
 {
-	io_commit_cqring_flush(ctx);
 	if (ctx->flags & IORING_SETUP_SQPOLL)
 		io_cqring_wake(ctx);
+	io_commit_cqring_flush(ctx);
 }
 
 void io_rw_fail(struct io_kiocb *req)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 07/16] io_uring: merge iopoll and normal completion paths
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (5 preceding siblings ...)
  2023-08-15 17:31 ` [PATCH 06/16] io_uring: reorder cqring_flush and wakeups Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 08/16] io_uring: compact SQ/CQ heads/tails Pavel Begunkov
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

io_do_iopoll() and io_submit_flush_completions() are pretty similar,
both filling CQEs and then free a list of requests. Don't duplicate it
and make iopoll use __io_submit_flush_completions(), which also helps
with inlining and other optimisations.

For that, we need to first find all completed iopoll requests and splice
them from the iopoll list and then pass it down. This adds one extra
list traversal, which should be fine as requests will stay hot in cache.

CQ locking is already conditional, introduce ->lockless_cq and skip
locking for IOPOLL as it's protected by ->uring_lock.

We also add a wakeup optimisation for IOPOLL to __io_cq_unlock_post(),
so it works just like io_cqring_ev_posted_iopoll().

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h |  1 +
 io_uring/io_uring.c            | 18 ++++++++++++------
 io_uring/io_uring.h            |  2 +-
 io_uring/rw.c                  | 24 +++++-------------------
 4 files changed, 19 insertions(+), 26 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 9795eda529f7..c0c03d8059df 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -205,6 +205,7 @@ struct io_ring_ctx {
 		unsigned int		has_evfd: 1;
 		/* all CQEs should be posted only by the submitter task */
 		unsigned int		task_complete: 1;
+		unsigned int		lockless_cq: 1;
 		unsigned int		syscall_iopoll: 1;
 		unsigned int		poll_activated: 1;
 		unsigned int		drain_disabled: 1;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 8d27d2a2e893..204c6a31c5d1 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -147,7 +147,6 @@ static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 					 bool cancel_all);
 
 static void io_queue_sqe(struct io_kiocb *req);
-static void __io_submit_flush_completions(struct io_ring_ctx *ctx);
 
 struct kmem_cache *req_cachep;
 
@@ -616,7 +615,7 @@ void __io_commit_cqring_flush(struct io_ring_ctx *ctx)
 
 static inline void __io_cq_lock(struct io_ring_ctx *ctx)
 {
-	if (!ctx->task_complete)
+	if (!ctx->lockless_cq)
 		spin_lock(&ctx->completion_lock);
 }
 
@@ -630,8 +629,11 @@ static inline void __io_cq_unlock_post(struct io_ring_ctx *ctx)
 {
 	io_commit_cqring(ctx);
 	if (!ctx->task_complete) {
-		spin_unlock(&ctx->completion_lock);
-		io_cqring_wake(ctx);
+		if (!ctx->lockless_cq)
+			spin_unlock(&ctx->completion_lock);
+		/* IOPOLL rings only need to wake up if it's also SQPOLL */
+		if (!ctx->syscall_iopoll)
+			io_cqring_wake(ctx);
 	}
 	io_commit_cqring_flush(ctx);
 }
@@ -1485,7 +1487,8 @@ void io_queue_next(struct io_kiocb *req)
 		io_req_task_queue(nxt);
 }
 
-void io_free_batch_list(struct io_ring_ctx *ctx, struct io_wq_work_node *node)
+static void io_free_batch_list(struct io_ring_ctx *ctx,
+			       struct io_wq_work_node *node)
 	__must_hold(&ctx->uring_lock)
 {
 	do {
@@ -1522,7 +1525,7 @@ void io_free_batch_list(struct io_ring_ctx *ctx, struct io_wq_work_node *node)
 	} while (node);
 }
 
-static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
+void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_submit_state *state = &ctx->submit_state;
@@ -3836,6 +3839,9 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	    !(ctx->flags & IORING_SETUP_SQPOLL))
 		ctx->task_complete = true;
 
+	if (ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL))
+		ctx->lockless_cq = true;
+
 	/*
 	 * lazy poll_wq activation relies on ->task_complete for synchronisation
 	 * purposes, see io_activate_pollwq()
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 2960e35b32a5..07fd185064d2 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -72,7 +72,7 @@ int io_ring_add_registered_file(struct io_uring_task *tctx, struct file *file,
 int io_poll_issue(struct io_kiocb *req, struct io_tw_state *ts);
 int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr);
 int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin);
-void io_free_batch_list(struct io_ring_ctx *ctx, struct io_wq_work_node *node);
+void __io_submit_flush_completions(struct io_ring_ctx *ctx);
 int io_req_prep_async(struct io_kiocb *req);
 
 struct io_wq_work *io_wq_free_work(struct io_wq_work *work);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 20140d3505f1..0a1e515f0510 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -983,13 +983,6 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	return ret;
 }
 
-static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
-{
-	if (ctx->flags & IORING_SETUP_SQPOLL)
-		io_cqring_wake(ctx);
-	io_commit_cqring_flush(ctx);
-}
-
 void io_rw_fail(struct io_kiocb *req)
 {
 	int res;
@@ -1060,24 +1053,17 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 		if (!smp_load_acquire(&req->iopoll_completed))
 			break;
 		nr_events++;
-		if (unlikely(req->flags & REQ_F_CQE_SKIP))
-			continue;
-
 		req->cqe.flags = io_put_kbuf(req, 0);
-		if (unlikely(!io_fill_cqe_req(ctx, req))) {
-			spin_lock(&ctx->completion_lock);
-			io_req_cqe_overflow(req);
-			spin_unlock(&ctx->completion_lock);
-		}
 	}
-
 	if (unlikely(!nr_events))
 		return 0;
 
-	io_commit_cqring(ctx);
-	io_cqring_ev_posted_iopoll(ctx);
 	pos = start ? start->next : ctx->iopoll_list.first;
 	wq_list_cut(&ctx->iopoll_list, prev, start);
-	io_free_batch_list(ctx, pos);
+
+	if (WARN_ON_ONCE(!wq_list_empty(&ctx->submit_state.compl_reqs)))
+		return 0;
+	ctx->submit_state.compl_reqs.first = pos;
+	__io_submit_flush_completions(ctx);
 	return nr_events;
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 08/16] io_uring: compact SQ/CQ heads/tails
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (6 preceding siblings ...)
  2023-08-15 17:31 ` [PATCH 07/16] io_uring: merge iopoll and normal completion paths Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-19 15:05   ` Jens Axboe
  2023-08-15 17:31 ` [PATCH 09/16] io_uring: add option to remove SQ indirection Pavel Begunkov
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Queues heads and tails cache line aligned. That makes sq, cq taking 4
lines or 5 lines if we include the rest of struct io_rings (e.g.
sq_flags is frequently accessed).

Since modern io_uring is mostly single threaded, it doesn't make much
send to sread them as such, it wastes space and puts additional pressure
on caches. Put them all into a single line.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index c0c03d8059df..608a8e80e881 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -69,8 +69,8 @@ struct io_uring_task {
 };
 
 struct io_uring {
-	u32 head ____cacheline_aligned_in_smp;
-	u32 tail ____cacheline_aligned_in_smp;
+	u32 head;
+	u32 tail;
 };
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 09/16] io_uring: add option to remove SQ indirection
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (7 preceding siblings ...)
  2023-08-15 17:31 ` [PATCH 08/16] io_uring: compact SQ/CQ heads/tails Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-19 15:06   ` Jens Axboe
  2023-08-15 17:31 ` [PATCH 10/16] io_uring: static_key for !IORING_SETUP_NO_SQARRAY Pavel Begunkov
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Not many aware, but io_uring submission queue has two levels. The first
level usually appears as sq_array and stores indexes into the actual SQ.
To my knowledge, no one has ever seriously used it, nor liburing exposes
it to users.

Add IORING_SETUP_NO_SQARRAY, when set we don't bother creating and using
the sq_array and SQ heads/tails will be pointing directly into the SQ.
Improves memory footprint, in term of both allocations as well as cache
usage, and also should make io_get_sqe() less branchy in the end.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/uapi/linux/io_uring.h |  5 ++++
 io_uring/io_uring.c           | 52 +++++++++++++++++++++--------------
 2 files changed, 37 insertions(+), 20 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 9fc7195f25df..f669b1ed33be 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -185,6 +185,11 @@ enum {
  */
 #define IORING_SETUP_REGISTERED_FD_ONLY	(1U << 15)
 
+/*
+ * Disables the SQ index array.
+ */
+#define IORING_SETUP_NO_SQARRAY		(1U << 16)
+
 enum io_uring_op {
 	IORING_OP_NOP,
 	IORING_OP_READV,
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 204c6a31c5d1..ac6d1687ba6c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2339,8 +2339,21 @@ static void io_commit_sqring(struct io_ring_ctx *ctx)
  */
 static bool io_get_sqe(struct io_ring_ctx *ctx, const struct io_uring_sqe **sqe)
 {
-	unsigned head, mask = ctx->sq_entries - 1;
-	unsigned sq_idx = ctx->cached_sq_head++ & mask;
+	unsigned mask = ctx->sq_entries - 1;
+	unsigned head = ctx->cached_sq_head++ & mask;
+
+	if (!(ctx->flags & IORING_SETUP_NO_SQARRAY)) {
+		head = READ_ONCE(ctx->sq_array[head]);
+		if (unlikely(head >= ctx->sq_entries)) {
+			/* drop invalid entries */
+			spin_lock(&ctx->completion_lock);
+			ctx->cq_extra--;
+			spin_unlock(&ctx->completion_lock);
+			WRITE_ONCE(ctx->rings->sq_dropped,
+				   READ_ONCE(ctx->rings->sq_dropped) + 1);
+			return false;
+		}
+	}
 
 	/*
 	 * The cached sq head (or cq tail) serves two purposes:
@@ -2350,22 +2363,12 @@ static bool io_get_sqe(struct io_ring_ctx *ctx, const struct io_uring_sqe **sqe)
 	 * 2) allows the kernel side to track the head on its own, even
 	 *    though the application is the one updating it.
 	 */
-	head = READ_ONCE(ctx->sq_array[sq_idx]);
-	if (likely(head < ctx->sq_entries)) {
-		/* double index for 128-byte SQEs, twice as long */
-		if (ctx->flags & IORING_SETUP_SQE128)
-			head <<= 1;
-		*sqe = &ctx->sq_sqes[head];
-		return true;
-	}
 
-	/* drop invalid entries */
-	spin_lock(&ctx->completion_lock);
-	ctx->cq_extra--;
-	spin_unlock(&ctx->completion_lock);
-	WRITE_ONCE(ctx->rings->sq_dropped,
-		   READ_ONCE(ctx->rings->sq_dropped) + 1);
-	return false;
+	/* double index for 128-byte SQEs, twice as long */
+	if (ctx->flags & IORING_SETUP_SQE128)
+		head <<= 1;
+	*sqe = &ctx->sq_sqes[head];
+	return true;
 }
 
 int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
@@ -2734,6 +2737,12 @@ static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned int sq_entries
 		return SIZE_MAX;
 #endif
 
+	if (ctx->flags & IORING_SETUP_NO_SQARRAY) {
+		if (sq_offset)
+			*sq_offset = SIZE_MAX;
+		return off;
+	}
+
 	if (sq_offset)
 		*sq_offset = off;
 
@@ -3710,7 +3719,8 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx,
 		return PTR_ERR(rings);
 
 	ctx->rings = rings;
-	ctx->sq_array = (u32 *)((char *)rings + sq_array_offset);
+	if (!(ctx->flags & IORING_SETUP_NO_SQARRAY))
+		ctx->sq_array = (u32 *)((char *)rings + sq_array_offset);
 	rings->sq_ring_mask = p->sq_entries - 1;
 	rings->cq_ring_mask = p->cq_entries - 1;
 	rings->sq_ring_entries = p->sq_entries;
@@ -3921,7 +3931,8 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	p->sq_off.ring_entries = offsetof(struct io_rings, sq_ring_entries);
 	p->sq_off.flags = offsetof(struct io_rings, sq_flags);
 	p->sq_off.dropped = offsetof(struct io_rings, sq_dropped);
-	p->sq_off.array = (char *)ctx->sq_array - (char *)ctx->rings;
+	if (!(ctx->flags & IORING_SETUP_NO_SQARRAY))
+		p->sq_off.array = (char *)ctx->sq_array - (char *)ctx->rings;
 	p->sq_off.resv1 = 0;
 	if (!(ctx->flags & IORING_SETUP_NO_MMAP))
 		p->sq_off.user_addr = 0;
@@ -4010,7 +4021,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG |
 			IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |
 			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN |
-			IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY))
+			IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY |
+			IORING_SETUP_NO_SQARRAY))
 		return -EINVAL;
 
 	return io_uring_create(entries, &p, params);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 10/16] io_uring: static_key for !IORING_SETUP_NO_SQARRAY
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (8 preceding siblings ...)
  2023-08-15 17:31 ` [PATCH 09/16] io_uring: add option to remove SQ indirection Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 11/16] io_uring: move non aligned field to the end Pavel Begunkov
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

At some point IORING_SETUP_NO_SQARRAY should become the default, so add
a static_key to optimise out the chunk of io_get_sqe() dealing with
sq_arrays.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ac6d1687ba6c..c39606740c73 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -72,6 +72,7 @@
 #include <linux/io_uring.h>
 #include <linux/audit.h>
 #include <linux/security.h>
+#include <linux/jump_label.h>
 #include <asm/shmparam.h>
 
 #define CREATE_TRACE_POINTS
@@ -148,6 +149,8 @@ static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 
 static void io_queue_sqe(struct io_kiocb *req);
 
+static __read_mostly DEFINE_STATIC_KEY_FALSE(io_key_has_sqarray);
+
 struct kmem_cache *req_cachep;
 
 struct sock *io_uring_get_socket(struct file *file)
@@ -2342,7 +2345,8 @@ static bool io_get_sqe(struct io_ring_ctx *ctx, const struct io_uring_sqe **sqe)
 	unsigned mask = ctx->sq_entries - 1;
 	unsigned head = ctx->cached_sq_head++ & mask;
 
-	if (!(ctx->flags & IORING_SETUP_NO_SQARRAY)) {
+	if (static_branch_unlikely(&io_key_has_sqarray) &&
+	    (!(ctx->flags & IORING_SETUP_NO_SQARRAY))) {
 		head = READ_ONCE(ctx->sq_array[head]);
 		if (unlikely(head >= ctx->sq_entries)) {
 			/* drop invalid entries */
@@ -2871,6 +2875,9 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 #endif
 	WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
 
+	if (!(ctx->flags & IORING_SETUP_NO_SQARRAY))
+		static_branch_dec(&io_key_has_sqarray);
+
 	io_alloc_cache_free(&ctx->rsrc_node_cache, io_rsrc_node_cache_free);
 	if (ctx->mm_account) {
 		mmdrop(ctx->mm_account);
@@ -3844,6 +3851,9 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	if (!ctx)
 		return -ENOMEM;
 
+	if (!(ctx->flags & IORING_SETUP_NO_SQARRAY))
+		static_branch_inc(&io_key_has_sqarray);
+
 	if ((ctx->flags & IORING_SETUP_DEFER_TASKRUN) &&
 	    !(ctx->flags & IORING_SETUP_IOPOLL) &&
 	    !(ctx->flags & IORING_SETUP_SQPOLL))
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 11/16] io_uring: move non aligned field to the end
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (9 preceding siblings ...)
  2023-08-15 17:31 ` [PATCH 10/16] io_uring: static_key for !IORING_SETUP_NO_SQARRAY Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 12/16] io_uring: banish non-hot data to end of io_ring_ctx Pavel Begunkov
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Move not cache aligned fields down in io_ring_ctx, should change
anything, but makes further refactoring easier.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 36 +++++++++++++++++-----------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 608a8e80e881..ad87d6074fb2 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -270,24 +270,6 @@ struct io_ring_ctx {
 		struct io_alloc_cache	netmsg_cache;
 	} ____cacheline_aligned_in_smp;
 
-	/* IRQ completion list, under ->completion_lock */
-	struct io_wq_work_list	locked_free_list;
-	unsigned int		locked_free_nr;
-
-	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
-	struct io_sq_data	*sq_data;	/* if using sq thread polling */
-
-	struct wait_queue_head	sqo_sq_wait;
-	struct list_head	sqd_list;
-
-	unsigned long		check_cq;
-
-	unsigned int		file_alloc_start;
-	unsigned int		file_alloc_end;
-
-	struct xarray		personalities;
-	u32			pers_next;
-
 	struct {
 		/*
 		 * We cache a range of free CQEs we can use, once exhausted it
@@ -332,6 +314,24 @@ struct io_ring_ctx {
 		unsigned		cq_last_tm_flush;
 	} ____cacheline_aligned_in_smp;
 
+	/* IRQ completion list, under ->completion_lock */
+	struct io_wq_work_list	locked_free_list;
+	unsigned int		locked_free_nr;
+
+	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
+	struct io_sq_data	*sq_data;	/* if using sq thread polling */
+
+	struct wait_queue_head	sqo_sq_wait;
+	struct list_head	sqd_list;
+
+	unsigned long		check_cq;
+
+	unsigned int		file_alloc_start;
+	unsigned int		file_alloc_end;
+
+	struct xarray		personalities;
+	u32			pers_next;
+
 	/* Keep this last, we don't need it for the fast path */
 	struct wait_queue_head		poll_wq;
 	struct io_restriction		restrictions;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 12/16] io_uring: banish non-hot data to end of io_ring_ctx
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (10 preceding siblings ...)
  2023-08-15 17:31 ` [PATCH 11/16] io_uring: move non aligned field to the end Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 13/16] io_uring: separate task_work/waiting cache line Pavel Begunkov
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Let's move all slow path, setup/init and so on fields to the end of
io_ring_ctx, that makes ctx reorganisation later easier. That includes,
page arrays used only on tear down, CQ overflow list, old provided
buffer caches and used by io-wq poll hashes.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 37 +++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index ad87d6074fb2..72e609752323 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -211,20 +211,11 @@ struct io_ring_ctx {
 		unsigned int		drain_disabled: 1;
 		unsigned int		compat: 1;
 
-		enum task_work_notify_mode	notify_method;
+		struct task_struct	*submitter_task;
+		struct io_rings		*rings;
+		struct percpu_ref	refs;
 
-		/*
-		 * If IORING_SETUP_NO_MMAP is used, then the below holds
-		 * the gup'ed pages for the two rings, and the sqes.
-		 */
-		unsigned short		n_ring_pages;
-		unsigned short		n_sqe_pages;
-		struct page		**ring_pages;
-		struct page		**sqe_pages;
-
-		struct io_rings			*rings;
-		struct task_struct		*submitter_task;
-		struct percpu_ref		refs;
+		enum task_work_notify_mode	notify_method;
 	} ____cacheline_aligned_in_smp;
 
 	/* submission data */
@@ -262,10 +253,8 @@ struct io_ring_ctx {
 
 		struct io_buffer_list	*io_bl;
 		struct xarray		io_bl_xa;
-		struct list_head	io_buffers_cache;
 
 		struct io_hash_table	cancel_table_locked;
-		struct list_head	cq_overflow_list;
 		struct io_alloc_cache	apoll_cache;
 		struct io_alloc_cache	netmsg_cache;
 	} ____cacheline_aligned_in_smp;
@@ -298,11 +287,8 @@ struct io_ring_ctx {
 		 * manipulate the list, hence no extra locking is needed there.
 		 */
 		struct io_wq_work_list	iopoll_list;
-		struct io_hash_table	cancel_table;
 
 		struct llist_head	work_llist;
-
-		struct list_head	io_buffers_comp;
 	} ____cacheline_aligned_in_smp;
 
 	/* timeouts */
@@ -318,6 +304,10 @@ struct io_ring_ctx {
 	struct io_wq_work_list	locked_free_list;
 	unsigned int		locked_free_nr;
 
+	struct list_head	io_buffers_comp;
+	struct list_head	cq_overflow_list;
+	struct io_hash_table	cancel_table;
+
 	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
 	struct io_sq_data	*sq_data;	/* if using sq thread polling */
 
@@ -332,6 +322,8 @@ struct io_ring_ctx {
 	struct xarray		personalities;
 	u32			pers_next;
 
+	struct list_head	io_buffers_cache;
+
 	/* Keep this last, we don't need it for the fast path */
 	struct wait_queue_head		poll_wq;
 	struct io_restriction		restrictions;
@@ -375,6 +367,15 @@ struct io_ring_ctx {
 	unsigned			sq_thread_idle;
 	/* protected by ->completion_lock */
 	unsigned			evfd_last_cq_tail;
+
+	/*
+	 * If IORING_SETUP_NO_MMAP is used, then the below holds
+	 * the gup'ed pages for the two rings, and the sqes.
+	 */
+	unsigned short			n_ring_pages;
+	unsigned short			n_sqe_pages;
+	struct page			**ring_pages;
+	struct page			**sqe_pages;
 };
 
 struct io_tw_state {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 13/16] io_uring: separate task_work/waiting cache line
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (11 preceding siblings ...)
  2023-08-15 17:31 ` [PATCH 12/16] io_uring: banish non-hot data to end of io_ring_ctx Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 14/16] io_uring: move multishot cqe cache in ctx Pavel Begunkov
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

task_work's are typically queued up from IRQ/softirq potentially by a
random CPU like in case of networking. Batch ctx fields bouncing as this
into a separate cache line.

We also move ->cq_timeouts there because waiters have to read and check
it. We can also conditionally hide ->cq_timeouts in the future from the
CQ wait path as a not really useful rudiment.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 72e609752323..5de5dffe29df 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -270,15 +270,25 @@ struct io_ring_ctx {
 		unsigned		cached_cq_tail;
 		unsigned		cq_entries;
 		struct io_ev_fd	__rcu	*io_ev_fd;
-		struct wait_queue_head	cq_wait;
 		unsigned		cq_extra;
 	} ____cacheline_aligned_in_smp;
 
+	/*
+	 * task_work and async notification delivery cacheline. Expected to
+	 * regularly bounce b/w CPUs.
+	 */
+	struct {
+		struct llist_head	work_llist;
+		unsigned long		check_cq;
+		atomic_t		cq_wait_nr;
+		atomic_t		cq_timeouts;
+		struct wait_queue_head	cq_wait;
+	} ____cacheline_aligned_in_smp;
+
 	struct {
 		spinlock_t		completion_lock;
 
 		bool			poll_multi_queue;
-		atomic_t		cq_wait_nr;
 
 		/*
 		 * ->iopoll_list is protected by the ctx->uring_lock for
@@ -287,14 +297,11 @@ struct io_ring_ctx {
 		 * manipulate the list, hence no extra locking is needed there.
 		 */
 		struct io_wq_work_list	iopoll_list;
-
-		struct llist_head	work_llist;
 	} ____cacheline_aligned_in_smp;
 
 	/* timeouts */
 	struct {
 		spinlock_t		timeout_lock;
-		atomic_t		cq_timeouts;
 		struct list_head	timeout_list;
 		struct list_head	ltimeout_list;
 		unsigned		cq_last_tm_flush;
@@ -314,8 +321,6 @@ struct io_ring_ctx {
 	struct wait_queue_head	sqo_sq_wait;
 	struct list_head	sqd_list;
 
-	unsigned long		check_cq;
-
 	unsigned int		file_alloc_start;
 	unsigned int		file_alloc_end;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 14/16] io_uring: move multishot cqe cache in ctx
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (12 preceding siblings ...)
  2023-08-15 17:31 ` [PATCH 13/16] io_uring: separate task_work/waiting cache line Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 15/16] io_uring: move iopoll ctx fields around Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 16/16] io_uring: force inline io_fill_cqe_req Pavel Begunkov
  15 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

We cache multishot CQEs before flushing them to the CQ in
submit_state.cqe. It's a 16 entry cache totalling 256 bytes in the
middle of the io_submit_state structure. Move it out of there, it
should help with CPU caches for the submission state, and shouldn't
affect cached CQEs.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 3 ++-
 io_uring/io_uring.c            | 6 +++---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 5de5dffe29df..01bdbc223edd 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -176,7 +176,6 @@ struct io_submit_state {
 	unsigned short		submit_nr;
 	unsigned int		cqes_count;
 	struct blk_plug		plug;
-	struct io_uring_cqe	cqes[16];
 };
 
 struct io_ev_fd {
@@ -307,6 +306,8 @@ struct io_ring_ctx {
 		unsigned		cq_last_tm_flush;
 	} ____cacheline_aligned_in_smp;
 
+	struct io_uring_cqe	completion_cqes[16];
+
 	/* IRQ completion list, under ->completion_lock */
 	struct io_wq_work_list	locked_free_list;
 	unsigned int		locked_free_nr;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index c39606740c73..5b13d22d1b76 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -883,7 +883,7 @@ static void __io_flush_post_cqes(struct io_ring_ctx *ctx)
 
 	lockdep_assert_held(&ctx->uring_lock);
 	for (i = 0; i < state->cqes_count; i++) {
-		struct io_uring_cqe *cqe = &state->cqes[i];
+		struct io_uring_cqe *cqe = &ctx->completion_cqes[i];
 
 		if (!io_fill_cqe_aux(ctx, cqe->user_data, cqe->res, cqe->flags)) {
 			if (ctx->task_complete) {
@@ -934,7 +934,7 @@ bool io_fill_cqe_req_aux(struct io_kiocb *req, bool defer, s32 res, u32 cflags)
 
 	lockdep_assert_held(&ctx->uring_lock);
 
-	if (ctx->submit_state.cqes_count == ARRAY_SIZE(ctx->submit_state.cqes)) {
+	if (ctx->submit_state.cqes_count == ARRAY_SIZE(ctx->completion_cqes)) {
 		__io_cq_lock(ctx);
 		__io_flush_post_cqes(ctx);
 		/* no need to flush - flush is deferred */
@@ -948,7 +948,7 @@ bool io_fill_cqe_req_aux(struct io_kiocb *req, bool defer, s32 res, u32 cflags)
 	if (test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq))
 		return false;
 
-	cqe = &ctx->submit_state.cqes[ctx->submit_state.cqes_count++];
+	cqe = &ctx->completion_cqes[ctx->submit_state.cqes_count++];
 	cqe->user_data = user_data;
 	cqe->res = res;
 	cqe->flags = cflags;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 15/16] io_uring: move iopoll ctx fields around
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (13 preceding siblings ...)
  2023-08-15 17:31 ` [PATCH 14/16] io_uring: move multishot cqe cache in ctx Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  2023-08-15 17:31 ` [PATCH 16/16] io_uring: force inline io_fill_cqe_req Pavel Begunkov
  15 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Move poll_multi_queue and iopoll_list to the submission cache line, it
doesn't make much sense to keep them separately, and is better place
for it in general.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 01bdbc223edd..13d19b9be9f4 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -256,6 +256,15 @@ struct io_ring_ctx {
 		struct io_hash_table	cancel_table_locked;
 		struct io_alloc_cache	apoll_cache;
 		struct io_alloc_cache	netmsg_cache;
+
+		/*
+		 * ->iopoll_list is protected by the ctx->uring_lock for
+		 * io_uring instances that don't use IORING_SETUP_SQPOLL.
+		 * For SQPOLL, only the single threaded io_sq_thread() will
+		 * manipulate the list, hence no extra locking is needed there.
+		 */
+		struct io_wq_work_list	iopoll_list;
+		bool			poll_multi_queue;
 	} ____cacheline_aligned_in_smp;
 
 	struct {
@@ -284,20 +293,6 @@ struct io_ring_ctx {
 		struct wait_queue_head	cq_wait;
 	} ____cacheline_aligned_in_smp;
 
-	struct {
-		spinlock_t		completion_lock;
-
-		bool			poll_multi_queue;
-
-		/*
-		 * ->iopoll_list is protected by the ctx->uring_lock for
-		 * io_uring instances that don't use IORING_SETUP_SQPOLL.
-		 * For SQPOLL, only the single threaded io_sq_thread() will
-		 * manipulate the list, hence no extra locking is needed there.
-		 */
-		struct io_wq_work_list	iopoll_list;
-	} ____cacheline_aligned_in_smp;
-
 	/* timeouts */
 	struct {
 		spinlock_t		timeout_lock;
@@ -308,6 +303,8 @@ struct io_ring_ctx {
 
 	struct io_uring_cqe	completion_cqes[16];
 
+	spinlock_t		completion_lock;
+
 	/* IRQ completion list, under ->completion_lock */
 	struct io_wq_work_list	locked_free_list;
 	unsigned int		locked_free_nr;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 16/16] io_uring: force inline io_fill_cqe_req
  2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
                   ` (14 preceding siblings ...)
  2023-08-15 17:31 ` [PATCH 15/16] io_uring: move iopoll ctx fields around Pavel Begunkov
@ 2023-08-15 17:31 ` Pavel Begunkov
  15 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-15 17:31 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

There are only 2 callers of io_fill_cqe_req left, and one of them is
extremely hot. Force inline the function.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 07fd185064d2..547c30582fb8 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -132,7 +132,8 @@ static inline bool io_get_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe **ret
 	return io_get_cqe_overflow(ctx, ret, false);
 }
 
-static inline bool io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req)
+static __always_inline bool io_fill_cqe_req(struct io_ring_ctx *ctx,
+					    struct io_kiocb *req)
 {
 	struct io_uring_cqe *cqe;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 02/16] io_uring: cqe init hardening
  2023-08-15 17:31 ` [PATCH 02/16] io_uring: cqe init hardening Pavel Begunkov
@ 2023-08-19 15:03   ` Jens Axboe
  2023-08-24 16:28     ` Pavel Begunkov
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2023-08-19 15:03 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 8/15/23 11:31 AM, Pavel Begunkov wrote:
> io_kiocb::cqe stores the completion info which we'll memcpy to
> userspace, and we rely on callbacks and other later steps to populate
> it with right values. We have never had problems with that, but it would
> still be safer to zero it on allocation.
> 
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
>  io_uring/io_uring.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index e189158ebbdd..4d27655be3a6 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -1056,7 +1056,7 @@ static void io_preinit_req(struct io_kiocb *req, struct io_ring_ctx *ctx)
>  	req->link = NULL;
>  	req->async_data = NULL;
>  	/* not necessary, but safer to zero */
> -	req->cqe.res = 0;
> +	memset(&req->cqe, 0, sizeof(req->cqe));
>  }
>  
>  static void io_flush_cached_locked_reqs(struct io_ring_ctx *ctx,

I think this is a good idea, but I wonder if we should open-clear it
instead. I've had cases in the past where that's more efficient than
calling memset.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 08/16] io_uring: compact SQ/CQ heads/tails
  2023-08-15 17:31 ` [PATCH 08/16] io_uring: compact SQ/CQ heads/tails Pavel Begunkov
@ 2023-08-19 15:05   ` Jens Axboe
  2023-08-24 16:29     ` Pavel Begunkov
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2023-08-19 15:05 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 8/15/23 11:31 AM, Pavel Begunkov wrote:
> Queues heads and tails cache line aligned. That makes sq, cq taking 4
> lines or 5 lines if we include the rest of struct io_rings (e.g.
> sq_flags is frequently accessed).
> 
> Since modern io_uring is mostly single threaded, it doesn't make much
> send to sread them as such, it wastes space and puts additional pressure

"sense to spread". Can fix up while applying. Change itself looks good
to me.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 09/16] io_uring: add option to remove SQ indirection
  2023-08-15 17:31 ` [PATCH 09/16] io_uring: add option to remove SQ indirection Pavel Begunkov
@ 2023-08-19 15:06   ` Jens Axboe
  0 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2023-08-19 15:06 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 8/15/23 11:31 AM, Pavel Begunkov wrote:
> Not many aware, but io_uring submission queue has two levels. The first
> level usually appears as sq_array and stores indexes into the actual SQ.
> To my knowledge, no one has ever seriously used it, nor liburing exposes
> it to users.

liburing does indeed not expose it, fio does use it as it's using the
raw interface. But yeah, definitely not needed for most cases and I
think this change is fine direction wise.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 02/16] io_uring: cqe init hardening
  2023-08-19 15:03   ` Jens Axboe
@ 2023-08-24 16:28     ` Pavel Begunkov
  2023-08-24 16:49       ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-24 16:28 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 8/19/23 16:03, Jens Axboe wrote:
> On 8/15/23 11:31 AM, Pavel Begunkov wrote:
>> io_kiocb::cqe stores the completion info which we'll memcpy to
>> userspace, and we rely on callbacks and other later steps to populate
>> it with right values. We have never had problems with that, but it would
>> still be safer to zero it on allocation.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> ---
>>   io_uring/io_uring.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index e189158ebbdd..4d27655be3a6 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -1056,7 +1056,7 @@ static void io_preinit_req(struct io_kiocb *req, struct io_ring_ctx *ctx)
>>   	req->link = NULL;
>>   	req->async_data = NULL;
>>   	/* not necessary, but safer to zero */
>> -	req->cqe.res = 0;
>> +	memset(&req->cqe, 0, sizeof(req->cqe));
>>   }
>>   
>>   static void io_flush_cached_locked_reqs(struct io_ring_ctx *ctx,
> 
> I think this is a good idea, but I wonder if we should open-clear it
> instead. I've had cases in the past where that's more efficient than
> calling memset.

I don't think it ever happens for 16 byte memcpy, and in either
case it's a cache refill, quite a slow path. I believe memcpy is
better here.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 08/16] io_uring: compact SQ/CQ heads/tails
  2023-08-19 15:05   ` Jens Axboe
@ 2023-08-24 16:29     ` Pavel Begunkov
  0 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2023-08-24 16:29 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 8/19/23 16:05, Jens Axboe wrote:
> On 8/15/23 11:31 AM, Pavel Begunkov wrote:
>> Queues heads and tails cache line aligned. That makes sq, cq taking 4
>> lines or 5 lines if we include the rest of struct io_rings (e.g.
>> sq_flags is frequently accessed).
>>
>> Since modern io_uring is mostly single threaded, it doesn't make much
>> send to sread them as such, it wastes space and puts additional pressure
> 
> "sense to spread". Can fix up while applying. Change itself looks good
> to me.

I'll be resending as we agreed, will fix it up, thanks

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 02/16] io_uring: cqe init hardening
  2023-08-24 16:28     ` Pavel Begunkov
@ 2023-08-24 16:49       ` Jens Axboe
  0 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2023-08-24 16:49 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 8/24/23 10:28 AM, Pavel Begunkov wrote:
> On 8/19/23 16:03, Jens Axboe wrote:
>> On 8/15/23 11:31 AM, Pavel Begunkov wrote:
>>> io_kiocb::cqe stores the completion info which we'll memcpy to
>>> userspace, and we rely on callbacks and other later steps to populate
>>> it with right values. We have never had problems with that, but it would
>>> still be safer to zero it on allocation.
>>>
>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>> ---
>>>   io_uring/io_uring.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>> index e189158ebbdd..4d27655be3a6 100644
>>> --- a/io_uring/io_uring.c
>>> +++ b/io_uring/io_uring.c
>>> @@ -1056,7 +1056,7 @@ static void io_preinit_req(struct io_kiocb *req, struct io_ring_ctx *ctx)
>>>       req->link = NULL;
>>>       req->async_data = NULL;
>>>       /* not necessary, but safer to zero */
>>> -    req->cqe.res = 0;
>>> +    memset(&req->cqe, 0, sizeof(req->cqe));
>>>   }
>>>     static void io_flush_cached_locked_reqs(struct io_ring_ctx *ctx,
>>
>> I think this is a good idea, but I wonder if we should open-clear it
>> instead. I've had cases in the past where that's more efficient than
>> calling memset.
> 
> I don't think it ever happens for 16 byte memcpy, and in either
> case it's a cache refill, quite a slow path. I believe memcpy is
> better here.

Yeah I think it's fine as-is - just checked here and either approach
yields the same.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-08-24 16:50 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-15 17:31 [RFC 00/16] caching and SQ/CQ optimisations Pavel Begunkov
2023-08-15 17:31 ` [PATCH 01/16] io_uring: improve cqe !tracing hot path Pavel Begunkov
2023-08-15 17:31 ` [PATCH 02/16] io_uring: cqe init hardening Pavel Begunkov
2023-08-19 15:03   ` Jens Axboe
2023-08-24 16:28     ` Pavel Begunkov
2023-08-24 16:49       ` Jens Axboe
2023-08-15 17:31 ` [PATCH 03/16] io_uring: simplify big_cqe handling Pavel Begunkov
2023-08-15 17:31 ` [PATCH 04/16] io_uring: refactor __io_get_cqe() Pavel Begunkov
2023-08-15 17:31 ` [PATCH 05/16] io_uring: optimise extra io_get_cqe null check Pavel Begunkov
2023-08-15 17:31 ` [PATCH 06/16] io_uring: reorder cqring_flush and wakeups Pavel Begunkov
2023-08-15 17:31 ` [PATCH 07/16] io_uring: merge iopoll and normal completion paths Pavel Begunkov
2023-08-15 17:31 ` [PATCH 08/16] io_uring: compact SQ/CQ heads/tails Pavel Begunkov
2023-08-19 15:05   ` Jens Axboe
2023-08-24 16:29     ` Pavel Begunkov
2023-08-15 17:31 ` [PATCH 09/16] io_uring: add option to remove SQ indirection Pavel Begunkov
2023-08-19 15:06   ` Jens Axboe
2023-08-15 17:31 ` [PATCH 10/16] io_uring: static_key for !IORING_SETUP_NO_SQARRAY Pavel Begunkov
2023-08-15 17:31 ` [PATCH 11/16] io_uring: move non aligned field to the end Pavel Begunkov
2023-08-15 17:31 ` [PATCH 12/16] io_uring: banish non-hot data to end of io_ring_ctx Pavel Begunkov
2023-08-15 17:31 ` [PATCH 13/16] io_uring: separate task_work/waiting cache line Pavel Begunkov
2023-08-15 17:31 ` [PATCH 14/16] io_uring: move multishot cqe cache in ctx Pavel Begunkov
2023-08-15 17:31 ` [PATCH 15/16] io_uring: move iopoll ctx fields around Pavel Begunkov
2023-08-15 17:31 ` [PATCH 16/16] io_uring: force inline io_fill_cqe_req Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox