[PATCH 00/16] squeeze more performance

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH 00/16] squeeze more performance
@ 2021-10-04 19:02 Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 01/16] io_uring: optimise kiocb layout Pavel Begunkov
                   ` (16 more replies)
  0 siblings, 17 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

fio/t/io_uring -s32 -d32 -c32 -N1

          | baseline  | 0-15      | 0-16        | diff
setup 1:  | 34 MIOPS  | 42 MIOPS  | 42.2  MIOPS | 25 %
setup 2:  | 31 MIOPS  | 31 MIOPS  | 32    MIOPS | ~3 $

Setup 1 gets 25% performance improvement, which is unexpected and a
share of it should be accounted as compiler/HW magic. Setup 2 is just
3%, but the catch is that some of the patches _very_ unexpectedly sink
performance, so it's more like 31 MIOPS -> 29 -> 30 -> 29 -> 31 -> 32

I'd suggest to leave 16/16 aside, maybe for future consideration and
refinement. The end result is not very clear, I'd expect probably
around 3-5% with a more stable setup for nops32, and a better win
for io_cqring_ev_posted() intensive cases like BPF.

Pavel Begunkov (16):
  io_uring: optimise kiocb layout
  io_uring: add more likely/unlikely() annotations
  io_uring: delay req queueing into compl-batch list
  io_uring: optimise request allocation
  io_uring: optimise INIT_WQ_LIST
  io_uring: don't wake sqpoll in io_cqring_ev_posted
  io_uring: merge CQ and poll waitqueues
  io_uring: optimise ctx referencing by requests
  io_uring: mark cold functions
  io_uring: optimise io_free_batch_list()
  io_uring: control ->async_data with a REQ_F flag
  io_uring: remove struct io_completion
  io_uring: inline io_req_needs_clean()
  io_uring: inline io_poll_complete
  io_uring: correct fill events helpers types
  io_uring: mark hot functions

 fs/io-wq.h    |   1 -
 fs/io_uring.c | 390 ++++++++++++++++++++++++++------------------------
 2 files changed, 205 insertions(+), 186 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 01/16] io_uring: optimise kiocb layout
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
@ 2021-10-04 19:02 ` Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 02/16] io_uring: add more likely/unlikely() annotations Pavel Begunkov
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

We want ->comp_list in the second cacheline, which is hotter comparing
to the 3rd. Swap the field with ->link, which is not as hot and
controlled by flags and so not accessed unless there is a link.

By the way add a couple of comments for io_kiocb fields.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a9eefd74b7e1..970535071564 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -863,19 +863,22 @@ struct io_kiocb {
 	struct task_struct		*task;
 	u64				user_data;
 
-	struct io_kiocb			*link;
 	struct percpu_ref		*fixed_rsrc_refs;
 
-	/* used with ctx->iopoll_list with reads/writes */
+	/* used by request caches, completion batching and iopoll */
 	struct io_wq_work_node		comp_list;
+	struct io_kiocb			*link;
 	struct io_task_work		io_task_work;
 	/* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
 	struct hlist_node		hash_node;
+	/* internal polling, see IORING_FEAT_FAST_POLL */
 	struct async_poll		*apoll;
 	/* store used ubuf, so we can prevent reloading */
 	struct io_mapped_ubuf		*imu;
 	struct io_wq_work		work;
+	/* custom credentials, valid IFF REQ_F_CREDS is set */
 	const struct cred		*creds;
+	/* stores selected buf, valid IFF REQ_F_BUFFER_SELECTED is set */
 	struct io_buffer		*kbuf;
 };
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 02/16] io_uring: add more likely/unlikely() annotations
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 01/16] io_uring: optimise kiocb layout Pavel Begunkov
@ 2021-10-04 19:02 ` Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 03/16] io_uring: delay req queueing into compl-batch list Pavel Begunkov
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Add two extra unlikely() in io_submit_sqes() and one around
io_req_needs_clean() to help the compiler to avoid extra jumps
in hot paths.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 970535071564..b09b267247f5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1957,7 +1957,7 @@ static inline void io_dismantle_req(struct io_kiocb *req)
 {
 	unsigned int flags = req->flags;
 
-	if (io_req_needs_clean(req))
+	if (unlikely(io_req_needs_clean(req)))
 		io_clean_op(req);
 	if (!(flags & REQ_F_FIXED_FILE))
 		io_put_file(req->file);
@@ -7201,11 +7201,11 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 	unsigned int entries = io_sqring_entries(ctx);
 	int submitted = 0;
 
-	if (!entries)
+	if (unlikely(!entries))
 		return 0;
 	/* make sure SQ entry isn't read before tail */
 	nr = min3(nr, ctx->sq_entries, entries);
-	if (!percpu_ref_tryget_many(&ctx->refs, nr))
+	if (unlikely(!percpu_ref_tryget_many(&ctx->refs, nr)))
 		return -EAGAIN;
 	io_get_task_refs(nr);
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 03/16] io_uring: delay req queueing into compl-batch list
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 01/16] io_uring: optimise kiocb layout Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 02/16] io_uring: add more likely/unlikely() annotations Pavel Begunkov
@ 2021-10-04 19:02 ` Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 04/16] io_uring: optimise request allocation Pavel Begunkov
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

io_req_complete_state() is inlined and used in lots of places, so we
want to keep it concise. Move adding a request into a completion batch
list from io_req_complete_state() into the consumer, i.e.
__io_queue_sqe().

before vs after
   text    data     bss     dec     hex filename
  91894   14002       8  105904   19db0 ./fs/io_uring.o
  91046   14002       8  105056   19a60 ./fs/io_uring.o

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index b09b267247f5..54850696ab6d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1446,6 +1446,13 @@ static void io_prep_async_link(struct io_kiocb *req)
 	}
 }
 
+static inline void io_req_add_compl_list(struct io_kiocb *req)
+{
+	struct io_submit_state *state = &req->ctx->submit_state;
+
+	wq_list_add_tail(&req->comp_list, &state->compl_reqs);
+}
+
 static void io_queue_async_work(struct io_kiocb *req, bool *locked)
 {
 	struct io_ring_ctx *ctx = req->ctx;
@@ -1820,20 +1827,15 @@ static inline bool io_req_needs_clean(struct io_kiocb *req)
 	return req->flags & IO_REQ_CLEAN_FLAGS;
 }
 
-static void io_req_complete_state(struct io_kiocb *req, long res,
-				  unsigned int cflags)
+static inline void io_req_complete_state(struct io_kiocb *req, long res,
+					 unsigned int cflags)
 {
-	struct io_submit_state *state;
-
 	/* clean per-opcode space, because req->compl is aliased with it */
 	if (io_req_needs_clean(req))
 		io_clean_op(req);
 	req->result = res;
 	req->compl.cflags = cflags;
 	req->flags |= REQ_F_COMPLETE_INLINE;
-
-	state = &req->ctx->submit_state;
-	wq_list_add_tail(&req->comp_list, &state->compl_reqs);
 }
 
 static inline void __io_req_complete(struct io_kiocb *req, unsigned issue_flags,
@@ -2621,10 +2623,12 @@ static void io_req_task_complete(struct io_kiocb *req, bool *locked)
 	unsigned int cflags = io_put_rw_kbuf(req);
 	long res = req->result;
 
-	if (*locked)
+	if (*locked) {
 		io_req_complete_state(req, res, cflags);
-	else
+		io_req_add_compl_list(req);
+	} else {
 		io_req_complete_post(req, res, cflags);
+	}
 }
 
 static void __io_complete_rw(struct io_kiocb *req, long res, long res2,
@@ -6889,8 +6893,10 @@ static inline void __io_queue_sqe(struct io_kiocb *req)
 
 	ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER);
 
-	if (req->flags & REQ_F_COMPLETE_INLINE)
+	if (req->flags & REQ_F_COMPLETE_INLINE) {
+		io_req_add_compl_list(req);
 		return;
+	}
 	/*
 	 * We async punt it if the file wasn't marked NOWAIT, or if the file
 	 * doesn't support non-blocking read/write attempts
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 04/16] io_uring: optimise request allocation
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
                   ` (2 preceding siblings ...)
  2021-10-04 19:02 ` [PATCH 03/16] io_uring: delay req queueing into compl-batch list Pavel Begunkov
@ 2021-10-04 19:02 ` Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 05/16] io_uring: optimise INIT_WQ_LIST Pavel Begunkov
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Even after fully inlining io_alloc_req() my compiler does a NULL check
in the path of successful allocation, no hacks like an empty dereference
help it. Restructure io_alloc_req() by splitting out refilling part, so
the compiler generate a slightly better binary.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 54850696ab6d..377c1cfd5d06 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1912,18 +1912,17 @@ static bool io_flush_cached_reqs(struct io_ring_ctx *ctx)
  * Because of that, io_alloc_req() should be called only under ->uring_lock
  * and with extra caution to not get a request that is still worked on.
  */
-static struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx)
+static bool __io_alloc_req_refill(struct io_ring_ctx *ctx)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_submit_state *state = &ctx->submit_state;
 	gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
 	void *reqs[IO_REQ_ALLOC_BATCH];
-	struct io_wq_work_node *node;
 	struct io_kiocb *req;
 	int ret, i;
 
 	if (likely(state->free_list.next || io_flush_cached_reqs(ctx)))
-		goto got_req;
+		return true;
 
 	ret = kmem_cache_alloc_bulk(req_cachep, gfp, ARRAY_SIZE(reqs), reqs);
 
@@ -1934,7 +1933,7 @@ static struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx)
 	if (unlikely(ret <= 0)) {
 		reqs[0] = kmem_cache_alloc(req_cachep, gfp);
 		if (!reqs[0])
-			return NULL;
+			return false;
 		ret = 1;
 	}
 
@@ -1944,8 +1943,21 @@ static struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx)
 		io_preinit_req(req, ctx);
 		wq_stack_add_head(&req->comp_list, &state->free_list);
 	}
-got_req:
-	node = wq_stack_extract(&state->free_list);
+	return true;
+}
+
+static inline bool io_alloc_req_refill(struct io_ring_ctx *ctx)
+{
+	if (unlikely(!ctx->submit_state.free_list.next))
+		return __io_alloc_req_refill(ctx);
+	return true;
+}
+
+static inline struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx)
+{
+	struct io_wq_work_node *node;
+
+	node = wq_stack_extract(&ctx->submit_state.free_list);
 	return container_of(node, struct io_kiocb, comp_list);
 }
 
@@ -7220,12 +7232,12 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 		const struct io_uring_sqe *sqe;
 		struct io_kiocb *req;
 
-		req = io_alloc_req(ctx);
-		if (unlikely(!req)) {
+		if (unlikely(!io_alloc_req_refill(ctx))) {
 			if (!submitted)
 				submitted = -EAGAIN;
 			break;
 		}
+		req = io_alloc_req(ctx);
 		sqe = io_get_sqe(ctx);
 		if (unlikely(!sqe)) {
 			wq_stack_add_head(&req->comp_list, &ctx->submit_state.free_list);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 05/16] io_uring: optimise INIT_WQ_LIST
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
                   ` (3 preceding siblings ...)
  2021-10-04 19:02 ` [PATCH 04/16] io_uring: optimise request allocation Pavel Begunkov
@ 2021-10-04 19:02 ` Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 06/16] io_uring: don't wake sqpoll in io_cqring_ev_posted Pavel Begunkov
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

The invariant of io_wq_work_list is that it's empty IFF ->first is NULL,
so no need to initially set ->last. With now having more users of the
list it may play a role, i.e. used in each tw iteration and on every
completion flushing.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io-wq.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/io-wq.h b/fs/io-wq.h
index 87ba6a733630..41bf37674a49 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -38,7 +38,6 @@ struct io_wq_work_list {
 #define wq_list_empty(list)	(READ_ONCE((list)->first) == NULL)
 #define INIT_WQ_LIST(list)	do {				\
 	(list)->first = NULL;					\
-	(list)->last = NULL;					\
 } while (0)
 
 static inline void wq_list_add_after(struct io_wq_work_node *node,
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 06/16] io_uring: don't wake sqpoll in io_cqring_ev_posted
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
                   ` (4 preceding siblings ...)
  2021-10-04 19:02 ` [PATCH 05/16] io_uring: optimise INIT_WQ_LIST Pavel Begunkov
@ 2021-10-04 19:02 ` Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 07/16] io_uring: merge CQ and poll waitqueues Pavel Begunkov
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

io_cqring_ev_posted() doesn't need to wake SQPOLL, it's either done by
userspace or with task_work, but no action is required on request
completion. Rip off bits waking it up in io_cqring_ev_posted().

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 377c1cfd5d06..56c0f7f1610f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1619,8 +1619,6 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
 	 */
 	if (wq_has_sleeper(&ctx->cq_wait))
 		wake_up_all(&ctx->cq_wait);
-	if (ctx->sq_data && waitqueue_active(&ctx->sq_data->wait))
-		wake_up(&ctx->sq_data->wait);
 	if (io_should_trigger_evfd(ctx))
 		eventfd_signal(ctx->cq_ev_fd, 1);
 	if (waitqueue_active(&ctx->poll_wait))
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 07/16] io_uring: merge CQ and poll waitqueues
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
                   ` (5 preceding siblings ...)
  2021-10-04 19:02 ` [PATCH 06/16] io_uring: don't wake sqpoll in io_cqring_ev_posted Pavel Begunkov
@ 2021-10-04 19:02 ` Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 08/16] io_uring: optimise ctx referencing by requests Pavel Begunkov
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

->cq_wait and ->poll_wait and waken up in the same manner, use a single
waitqueue for both of them. CQ waiters are queued exclusively, so wake
up should first go over all pollers and that's what we need.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 56c0f7f1610f..b465fba8a0dc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -394,7 +394,6 @@ struct io_ring_ctx {
 		unsigned		cached_cq_tail;
 		unsigned		cq_entries;
 		struct eventfd_ctx	*cq_ev_fd;
-		struct wait_queue_head	poll_wait;
 		struct wait_queue_head	cq_wait;
 		unsigned		cq_extra;
 		atomic_t		cq_timeouts;
@@ -1300,7 +1299,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	ctx->flags = p->flags;
 	init_waitqueue_head(&ctx->sqo_sq_wait);
 	INIT_LIST_HEAD(&ctx->sqd_list);
-	init_waitqueue_head(&ctx->poll_wait);
 	INIT_LIST_HEAD(&ctx->cq_overflow_list);
 	init_completion(&ctx->ref_comp);
 	xa_init_flags(&ctx->io_buffers, XA_FLAGS_ALLOC1);
@@ -1621,8 +1619,6 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
 		wake_up_all(&ctx->cq_wait);
 	if (io_should_trigger_evfd(ctx))
 		eventfd_signal(ctx->cq_ev_fd, 1);
-	if (waitqueue_active(&ctx->poll_wait))
-		wake_up_interruptible(&ctx->poll_wait);
 }
 
 static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
@@ -1636,8 +1632,6 @@ static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
 	}
 	if (io_should_trigger_evfd(ctx))
 		eventfd_signal(ctx->cq_ev_fd, 1);
-	if (waitqueue_active(&ctx->poll_wait))
-		wake_up_interruptible(&ctx->poll_wait);
 }
 
 /* Returns true if there are no backlogged entries after the flush */
@@ -9256,7 +9250,7 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
 	struct io_ring_ctx *ctx = file->private_data;
 	__poll_t mask = 0;
 
-	poll_wait(file, &ctx->poll_wait, wait);
+	poll_wait(file, &ctx->cq_wait, wait);
 	/*
 	 * synchronizes with barrier from wq_has_sleeper call in
 	 * io_commit_cqring
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 08/16] io_uring: optimise ctx referencing by requests
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
                   ` (6 preceding siblings ...)
  2021-10-04 19:02 ` [PATCH 07/16] io_uring: merge CQ and poll waitqueues Pavel Begunkov
@ 2021-10-04 19:02 ` Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 09/16] io_uring: mark cold functions Pavel Begunkov
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Currenlty, we allocate one ctx reference per request at submission time
and put them at free. It's batched and not so expensive but it still
bloats the kernel, adds 2 function calls for rcu and adds some overhead
for request counting in io_free_batch_list().

Always keep one reference with a request, even when it's freed and in
io_uring request caches. There is extra work at ring exit / quiesce
paths, which now need to put all cached requests. io_ring_exit_work() is
already looping, so it's not a problem. Add hybrid-busy waiting to
io_ctx_quiesce() as well for now.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index b465fba8a0dc..2b7f38df6a0c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1807,7 +1807,6 @@ static void io_req_complete_post(struct io_kiocb *req, long res,
 		io_put_task(req->task, 1);
 		wq_list_add_head(&req->comp_list, &ctx->locked_free_list);
 		ctx->locked_free_nr++;
-		percpu_ref_put(&ctx->refs);
 	}
 	io_commit_cqring(ctx);
 	spin_unlock(&ctx->completion_lock);
@@ -1929,6 +1928,7 @@ static bool __io_alloc_req_refill(struct io_ring_ctx *ctx)
 		ret = 1;
 	}
 
+	percpu_ref_get_many(&ctx->refs, ret);
 	for (i = 0; i < ret; i++) {
 		req = reqs[i];
 
@@ -1986,8 +1986,6 @@ static void __io_free_req(struct io_kiocb *req)
 	wq_list_add_head(&req->comp_list, &ctx->locked_free_list);
 	ctx->locked_free_nr++;
 	spin_unlock(&ctx->completion_lock);
-
-	percpu_ref_put(&ctx->refs);
 }
 
 static inline void io_remove_next_linked(struct io_kiocb *req)
@@ -2276,7 +2274,7 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
 	__must_hold(&ctx->uring_lock)
 {
 	struct task_struct *task = NULL;
-	int task_refs = 0, ctx_refs = 0;
+	int task_refs = 0;
 
 	do {
 		struct io_kiocb *req = container_of(node, struct io_kiocb,
@@ -2296,12 +2294,9 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
 			task_refs = 0;
 		}
 		task_refs++;
-		ctx_refs++;
 		wq_stack_add_head(&req->comp_list, &ctx->submit_state.free_list);
 	} while (node);
 
-	if (ctx_refs)
-		percpu_ref_put_many(&ctx->refs, ctx_refs);
 	if (task)
 		io_put_task(task, task_refs);
 }
@@ -7215,8 +7210,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 		return 0;
 	/* make sure SQ entry isn't read before tail */
 	nr = min3(nr, ctx->sq_entries, entries);
-	if (unlikely(!percpu_ref_tryget_many(&ctx->refs, nr)))
-		return -EAGAIN;
 	io_get_task_refs(nr);
 
 	io_submit_state_start(&ctx->submit_state, nr);
@@ -7246,7 +7239,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 		int unused = nr - ref_used;
 
 		current->io_uring->cached_refs += unused;
-		percpu_ref_put_many(&ctx->refs, unused);
 	}
 
 	io_submit_state_end(ctx);
@@ -9167,6 +9159,7 @@ static void io_destroy_buffers(struct io_ring_ctx *ctx)
 static void io_req_caches_free(struct io_ring_ctx *ctx)
 {
 	struct io_submit_state *state = &ctx->submit_state;
+	int nr = 0;
 
 	mutex_lock(&ctx->uring_lock);
 	io_flush_cached_locked_reqs(ctx, state);
@@ -9178,7 +9171,10 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
 		node = wq_stack_extract(&state->free_list);
 		req = container_of(node, struct io_kiocb, comp_list);
 		kmem_cache_free(req_cachep, req);
+		nr++;
 	}
+	if (nr)
+		percpu_ref_put_many(&ctx->refs, nr);
 	mutex_unlock(&ctx->uring_lock);
 }
 
@@ -9348,6 +9344,8 @@ static void io_ring_exit_work(struct work_struct *work)
 			io_sq_thread_unpark(sqd);
 		}
 
+		io_req_caches_free(ctx);
+
 		if (WARN_ON_ONCE(time_after(jiffies, timeout))) {
 			/* there is little hope left, don't run it too often */
 			interval = HZ * 60;
@@ -10727,10 +10725,14 @@ static int io_ctx_quiesce(struct io_ring_ctx *ctx)
 	 */
 	mutex_unlock(&ctx->uring_lock);
 	do {
-		ret = wait_for_completion_interruptible(&ctx->ref_comp);
-		if (!ret)
+		ret = wait_for_completion_interruptible_timeout(&ctx->ref_comp, HZ);
+		if (ret) {
+			ret = min(0L, ret);
 			break;
+		}
+
 		ret = io_run_task_work_sig();
+		io_req_caches_free(ctx);
 	} while (ret >= 0);
 	mutex_lock(&ctx->uring_lock);
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 09/16] io_uring: mark cold functions
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
                   ` (7 preceding siblings ...)
  2021-10-04 19:02 ` [PATCH 08/16] io_uring: optimise ctx referencing by requests Pavel Begunkov
@ 2021-10-04 19:02 ` Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 10/16] io_uring: optimise io_free_batch_list() Pavel Begunkov
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Attribute cold functions so compilers can optimise them for size. It
shrinks the binary by 2.5-3%

   text    data     bss     dec     hex filename
  90670   14002       8  104680   198e8 ./fs/io_uring.o
  88053   14002       8  102063   18eaf ./fs/io_uring.o

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 123 ++++++++++++++++++++++++++------------------------
 1 file changed, 64 insertions(+), 59 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2b7f38df6a0c..10112ea73e77 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1230,7 +1230,7 @@ static inline void req_fail_link_node(struct io_kiocb *req, int res)
 	req->result = res;
 }
 
-static void io_ring_ctx_ref_free(struct percpu_ref *ref)
+static __cold void io_ring_ctx_ref_free(struct percpu_ref *ref)
 {
 	struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs);
 
@@ -1242,7 +1242,7 @@ static inline bool io_is_timeout_noseq(struct io_kiocb *req)
 	return !req->timeout.off;
 }
 
-static void io_fallback_req_func(struct work_struct *work)
+static __cold void io_fallback_req_func(struct work_struct *work)
 {
 	struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx,
 						fallback_work.work);
@@ -1262,7 +1262,7 @@ static void io_fallback_req_func(struct work_struct *work)
 
 }
 
-static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
+static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 {
 	struct io_ring_ctx *ctx;
 	int hash_bits;
@@ -1361,7 +1361,7 @@ static inline bool io_req_ffs_set(struct io_kiocb *req)
 	return IS_ENABLED(CONFIG_64BIT) && (req->flags & REQ_F_FIXED_FILE);
 }
 
-static void io_req_track_inflight(struct io_kiocb *req)
+static inline void io_req_track_inflight(struct io_kiocb *req)
 {
 	if (!(req->flags & REQ_F_INFLIGHT)) {
 		req->flags |= REQ_F_INFLIGHT;
@@ -1500,7 +1500,7 @@ static void io_kill_timeout(struct io_kiocb *req, int status)
 	}
 }
 
-static void io_queue_deferred(struct io_ring_ctx *ctx)
+static __cold void io_queue_deferred(struct io_ring_ctx *ctx)
 {
 	while (!list_empty(&ctx->defer_list)) {
 		struct io_defer_entry *de = list_first_entry(&ctx->defer_list,
@@ -1514,7 +1514,7 @@ static void io_queue_deferred(struct io_ring_ctx *ctx)
 	}
 }
 
-static void io_flush_timeouts(struct io_ring_ctx *ctx)
+static __cold void io_flush_timeouts(struct io_ring_ctx *ctx)
 	__must_hold(&ctx->completion_lock)
 {
 	u32 seq = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts);
@@ -1547,7 +1547,7 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx)
 	spin_unlock_irq(&ctx->timeout_lock);
 }
 
-static void __io_commit_cqring_flush(struct io_ring_ctx *ctx)
+static __cold void __io_commit_cqring_flush(struct io_ring_ctx *ctx)
 {
 	if (ctx->off_timeout_used)
 		io_flush_timeouts(ctx);
@@ -1903,7 +1903,7 @@ static bool io_flush_cached_reqs(struct io_ring_ctx *ctx)
  * Because of that, io_alloc_req() should be called only under ->uring_lock
  * and with extra caution to not get a request that is still worked on.
  */
-static bool __io_alloc_req_refill(struct io_ring_ctx *ctx)
+static __cold bool __io_alloc_req_refill(struct io_ring_ctx *ctx)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_submit_state *state = &ctx->submit_state;
@@ -1975,7 +1975,7 @@ static inline void io_dismantle_req(struct io_kiocb *req)
 	}
 }
 
-static void __io_free_req(struct io_kiocb *req)
+static __cold void __io_free_req(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
@@ -2462,7 +2462,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
  * We can't just wait for polled events to come to us, we have to actively
  * find and complete them.
  */
-static void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
+static __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
 {
 	if (!(ctx->flags & IORING_SETUP_IOPOLL))
 		return;
@@ -5635,8 +5635,8 @@ static bool io_poll_remove_one(struct io_kiocb *req)
 /*
  * Returns true if we found and killed one or more poll requests
  */
-static bool io_poll_remove_all(struct io_ring_ctx *ctx, struct task_struct *tsk,
-			       bool cancel_all)
+static __cold bool io_poll_remove_all(struct io_ring_ctx *ctx,
+				      struct task_struct *tsk, bool cancel_all)
 {
 	struct hlist_node *tmp;
 	struct io_kiocb *req;
@@ -6428,7 +6428,7 @@ static u32 io_get_sequence(struct io_kiocb *req)
 	return seq;
 }
 
-static void io_drain_req(struct io_kiocb *req)
+static __cold void io_drain_req(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_defer_entry *de;
@@ -7308,7 +7308,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
 	return ret;
 }
 
-static void io_sqd_update_thread_idle(struct io_sq_data *sqd)
+static __cold void io_sqd_update_thread_idle(struct io_sq_data *sqd)
 {
 	struct io_ring_ctx *ctx;
 	unsigned sq_thread_idle = 0;
@@ -7562,7 +7562,7 @@ static void io_free_page_table(void **table, size_t size)
 	kfree(table);
 }
 
-static void **io_alloc_page_table(size_t size)
+static __cold void **io_alloc_page_table(size_t size)
 {
 	unsigned i, nr_tables = DIV_ROUND_UP(size, PAGE_SIZE);
 	size_t init_size = size;
@@ -7591,7 +7591,7 @@ static void io_rsrc_node_destroy(struct io_rsrc_node *ref_node)
 	kfree(ref_node);
 }
 
-static void io_rsrc_node_ref_zero(struct percpu_ref *ref)
+static __cold void io_rsrc_node_ref_zero(struct percpu_ref *ref)
 {
 	struct io_rsrc_node *node = container_of(ref, struct io_rsrc_node, refs);
 	struct io_ring_ctx *ctx = node->rsrc_data->ctx;
@@ -7668,7 +7668,8 @@ static int io_rsrc_node_switch_start(struct io_ring_ctx *ctx)
 	return ctx->rsrc_backup_node ? 0 : -ENOMEM;
 }
 
-static int io_rsrc_ref_quiesce(struct io_rsrc_data *data, struct io_ring_ctx *ctx)
+static __cold int io_rsrc_ref_quiesce(struct io_rsrc_data *data,
+				      struct io_ring_ctx *ctx)
 {
 	int ret;
 
@@ -7724,9 +7725,9 @@ static void io_rsrc_data_free(struct io_rsrc_data *data)
 	kfree(data);
 }
 
-static int io_rsrc_data_alloc(struct io_ring_ctx *ctx, rsrc_put_fn *do_put,
-			      u64 __user *utags, unsigned nr,
-			      struct io_rsrc_data **pdata)
+static __cold int io_rsrc_data_alloc(struct io_ring_ctx *ctx, rsrc_put_fn *do_put,
+				     u64 __user *utags, unsigned nr,
+				     struct io_rsrc_data **pdata)
 {
 	struct io_rsrc_data *data;
 	int ret = -ENOMEM;
@@ -8496,8 +8497,8 @@ static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx,
 	return io_wq_create(concurrency, &data);
 }
 
-static int io_uring_alloc_task_context(struct task_struct *task,
-				       struct io_ring_ctx *ctx)
+static __cold int io_uring_alloc_task_context(struct task_struct *task,
+					      struct io_ring_ctx *ctx)
 {
 	struct io_uring_task *tctx;
 	int ret;
@@ -8544,8 +8545,8 @@ void __io_uring_free(struct task_struct *tsk)
 	tsk->io_uring = NULL;
 }
 
-static int io_sq_offload_create(struct io_ring_ctx *ctx,
-				struct io_uring_params *p)
+static __cold int io_sq_offload_create(struct io_ring_ctx *ctx,
+				       struct io_uring_params *p)
 {
 	int ret;
 
@@ -9184,7 +9185,7 @@ static void io_wait_rsrc_data(struct io_rsrc_data *data)
 		wait_for_completion(&data->done);
 }
 
-static void io_ring_ctx_free(struct io_ring_ctx *ctx)
+static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
 	io_sq_thread_finish(ctx);
 
@@ -9293,7 +9294,7 @@ struct io_tctx_exit {
 	struct io_ring_ctx		*ctx;
 };
 
-static void io_tctx_exit_cb(struct callback_head *cb)
+static __cold void io_tctx_exit_cb(struct callback_head *cb)
 {
 	struct io_uring_task *tctx = current->io_uring;
 	struct io_tctx_exit *work;
@@ -9308,14 +9309,14 @@ static void io_tctx_exit_cb(struct callback_head *cb)
 	complete(&work->completion);
 }
 
-static bool io_cancel_ctx_cb(struct io_wq_work *work, void *data)
+static __cold bool io_cancel_ctx_cb(struct io_wq_work *work, void *data)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
 
 	return req->ctx == data;
 }
 
-static void io_ring_exit_work(struct work_struct *work)
+static __cold void io_ring_exit_work(struct work_struct *work)
 {
 	struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, exit_work);
 	unsigned long timeout = jiffies + HZ * 60 * 5;
@@ -9386,8 +9387,8 @@ static void io_ring_exit_work(struct work_struct *work)
 }
 
 /* Returns true if we found and killed one or more timeouts */
-static bool io_kill_timeouts(struct io_ring_ctx *ctx, struct task_struct *tsk,
-			     bool cancel_all)
+static __cold bool io_kill_timeouts(struct io_ring_ctx *ctx,
+				    struct task_struct *tsk, bool cancel_all)
 {
 	struct io_kiocb *req, *tmp;
 	int canceled = 0;
@@ -9409,7 +9410,7 @@ static bool io_kill_timeouts(struct io_ring_ctx *ctx, struct task_struct *tsk,
 	return canceled != 0;
 }
 
-static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
+static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 {
 	unsigned long index;
 	struct creds *creds;
@@ -9471,8 +9472,9 @@ static bool io_cancel_task_cb(struct io_wq_work *work, void *data)
 	return ret;
 }
 
-static bool io_cancel_defer_files(struct io_ring_ctx *ctx,
-				  struct task_struct *task, bool cancel_all)
+static __cold bool io_cancel_defer_files(struct io_ring_ctx *ctx,
+					 struct task_struct *task,
+					 bool cancel_all)
 {
 	struct io_defer_entry *de;
 	LIST_HEAD(list);
@@ -9497,7 +9499,7 @@ static bool io_cancel_defer_files(struct io_ring_ctx *ctx,
 	return true;
 }
 
-static bool io_uring_try_cancel_iowq(struct io_ring_ctx *ctx)
+static __cold bool io_uring_try_cancel_iowq(struct io_ring_ctx *ctx)
 {
 	struct io_tctx_node *node;
 	enum io_wq_cancel cret;
@@ -9521,9 +9523,9 @@ static bool io_uring_try_cancel_iowq(struct io_ring_ctx *ctx)
 	return ret;
 }
 
-static void io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
-					 struct task_struct *task,
-					 bool cancel_all)
+static __cold void io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
+						struct task_struct *task,
+						bool cancel_all)
 {
 	struct io_task_cancel cancel = { .task = task, .all = cancel_all, };
 	struct io_uring_task *tctx = task ? task->io_uring : NULL;
@@ -9613,7 +9615,7 @@ static inline int io_uring_add_tctx_node(struct io_ring_ctx *ctx)
 /*
  * Remove this io_uring_file -> task mapping.
  */
-static void io_uring_del_tctx_node(unsigned long index)
+static __cold void io_uring_del_tctx_node(unsigned long index)
 {
 	struct io_uring_task *tctx = current->io_uring;
 	struct io_tctx_node *node;
@@ -9636,7 +9638,7 @@ static void io_uring_del_tctx_node(unsigned long index)
 	kfree(node);
 }
 
-static void io_uring_clean_tctx(struct io_uring_task *tctx)
+static __cold void io_uring_clean_tctx(struct io_uring_task *tctx)
 {
 	struct io_wq *wq = tctx->io_wq;
 	struct io_tctx_node *node;
@@ -9663,7 +9665,7 @@ static s64 tctx_inflight(struct io_uring_task *tctx, bool tracked)
 	return percpu_counter_sum(&tctx->inflight);
 }
 
-static void io_uring_drop_tctx_refs(struct task_struct *task)
+static __cold void io_uring_drop_tctx_refs(struct task_struct *task)
 {
 	struct io_uring_task *tctx = task->io_uring;
 	unsigned int refs = tctx->cached_refs;
@@ -9679,7 +9681,8 @@ static void io_uring_drop_tctx_refs(struct task_struct *task)
  * Find any io_uring ctx that this task has registered or done IO on, and cancel
  * requests. @sqd should be not-null IIF it's an SQPOLL thread cancellation.
  */
-static void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
+static __cold void io_uring_cancel_generic(bool cancel_all,
+					   struct io_sq_data *sqd)
 {
 	struct io_uring_task *tctx = current->io_uring;
 	struct io_ring_ctx *ctx;
@@ -9772,7 +9775,7 @@ static void *io_uring_validate_mmap_request(struct file *file,
 
 #ifdef CONFIG_MMU
 
-static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
+static __cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	size_t sz = vma->vm_end - vma->vm_start;
 	unsigned long pfn;
@@ -9957,7 +9960,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 }
 
 #ifdef CONFIG_PROC_FS
-static int io_uring_show_cred(struct seq_file *m, unsigned int id,
+static __cold int io_uring_show_cred(struct seq_file *m, unsigned int id,
 		const struct cred *cred)
 {
 	struct user_namespace *uns = seq_user_ns(m);
@@ -9989,7 +9992,8 @@ static int io_uring_show_cred(struct seq_file *m, unsigned int id,
 	return 0;
 }
 
-static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
+static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx,
+					  struct seq_file *m)
 {
 	struct io_sq_data *sq = NULL;
 	struct io_overflow_cqe *ocqe;
@@ -10101,7 +10105,7 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 	spin_unlock(&ctx->completion_lock);
 }
 
-static void io_uring_show_fdinfo(struct seq_file *m, struct file *f)
+static __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f)
 {
 	struct io_ring_ctx *ctx = f->private_data;
 
@@ -10125,8 +10129,8 @@ static const struct file_operations io_uring_fops = {
 #endif
 };
 
-static int io_allocate_scq_urings(struct io_ring_ctx *ctx,
-				  struct io_uring_params *p)
+static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx,
+					 struct io_uring_params *p)
 {
 	struct io_rings *rings;
 	size_t size, sq_array_offset;
@@ -10215,8 +10219,8 @@ static struct file *io_uring_get_file(struct io_ring_ctx *ctx)
 	return file;
 }
 
-static int io_uring_create(unsigned entries, struct io_uring_params *p,
-			   struct io_uring_params __user *params)
+static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
+				  struct io_uring_params __user *params)
 {
 	struct io_ring_ctx *ctx;
 	struct file *file;
@@ -10374,7 +10378,8 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries,
 	return io_uring_setup(entries, params);
 }
 
-static int io_probe(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args)
+static __cold int io_probe(struct io_ring_ctx *ctx, void __user *arg,
+			   unsigned nr_args)
 {
 	struct io_uring_probe *p;
 	size_t size;
@@ -10430,8 +10435,8 @@ static int io_register_personality(struct io_ring_ctx *ctx)
 	return id;
 }
 
-static int io_register_restrictions(struct io_ring_ctx *ctx, void __user *arg,
-				    unsigned int nr_args)
+static __cold int io_register_restrictions(struct io_ring_ctx *ctx,
+					   void __user *arg, unsigned int nr_args)
 {
 	struct io_uring_restriction *res;
 	size_t size;
@@ -10565,7 +10570,7 @@ static int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg,
 	return __io_register_rsrc_update(ctx, type, &up, up.nr);
 }
 
-static int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg,
+static __cold int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg,
 			    unsigned int size, unsigned int type)
 {
 	struct io_uring_rsrc_register rr;
@@ -10591,8 +10596,8 @@ static int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg,
 	return -EINVAL;
 }
 
-static int io_register_iowq_aff(struct io_ring_ctx *ctx, void __user *arg,
-				unsigned len)
+static __cold int io_register_iowq_aff(struct io_ring_ctx *ctx,
+				       void __user *arg, unsigned len)
 {
 	struct io_uring_task *tctx = current->io_uring;
 	cpumask_var_t new_mask;
@@ -10618,7 +10623,7 @@ static int io_register_iowq_aff(struct io_ring_ctx *ctx, void __user *arg,
 	return ret;
 }
 
-static int io_unregister_iowq_aff(struct io_ring_ctx *ctx)
+static __cold int io_unregister_iowq_aff(struct io_ring_ctx *ctx)
 {
 	struct io_uring_task *tctx = current->io_uring;
 
@@ -10628,8 +10633,8 @@ static int io_unregister_iowq_aff(struct io_ring_ctx *ctx)
 	return io_wq_cpu_affinity(tctx->io_wq, NULL);
 }
 
-static int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
-					void __user *arg)
+static __cold int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
+					       void __user *arg)
 {
 	struct io_uring_task *tctx = NULL;
 	struct io_sq_data *sqd = NULL;
@@ -10710,7 +10715,7 @@ static bool io_register_op_must_quiesce(int op)
 	}
 }
 
-static int io_ctx_quiesce(struct io_ring_ctx *ctx)
+static __cold int io_ctx_quiesce(struct io_ring_ctx *ctx)
 {
 	long ret;
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 10/16] io_uring: optimise io_free_batch_list()
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
                   ` (8 preceding siblings ...)
  2021-10-04 19:02 ` [PATCH 09/16] io_uring: mark cold functions Pavel Begunkov
@ 2021-10-04 19:02 ` Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 11/16] io_uring: control ->async_data with a REQ_F flag Pavel Begunkov
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Delay reading the next node in io_free_batch_list(), allows the compiler
to load the value a bit later improving register spilling in some cases.
With gcc 11.1 it helped to move @task_refs variable from the stack to a
register and optimises out a couple of per request instructions.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 10112ea73e77..50312ac4537d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2280,9 +2280,10 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
 		struct io_kiocb *req = container_of(node, struct io_kiocb,
 						    comp_list);
 
-		node = req->comp_list.next;
-		if (!req_ref_put_and_test(req))
+		if (!req_ref_put_and_test(req)) {
+			node = req->comp_list.next;
 			continue;
+		}
 
 		io_queue_next(req);
 		io_dismantle_req(req);
@@ -2294,6 +2295,7 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
 			task_refs = 0;
 		}
 		task_refs++;
+		node = req->comp_list.next;
 		wq_stack_add_head(&req->comp_list, &ctx->submit_state.free_list);
 	} while (node);
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 11/16] io_uring: control ->async_data with a REQ_F flag
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
                   ` (9 preceding siblings ...)
  2021-10-04 19:02 ` [PATCH 10/16] io_uring: optimise io_free_batch_list() Pavel Begunkov
@ 2021-10-04 19:02 ` Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 12/16] io_uring: remove struct io_completion Pavel Begunkov
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

->async_data is a slow path, so it won't matter much if we do the clean
up inside io_clean_op(). Moreover, in many cases it's allocated together
with setting one or more of IO_REQ_CLEAN_FLAGS flags, so it'd go through
io_clean_op() anyway.

Control ->async_data allocation with a new flag REQ_F_ASYNC_DATA, so we
can do all the maintainence under io_req_needs_clean() fast check.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 72 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 46 insertions(+), 26 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 50312ac4537d..1e93c0b1314c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -109,7 +109,8 @@
 #define SQE_VALID_FLAGS	(SQE_COMMON_FLAGS|IOSQE_BUFFER_SELECT|IOSQE_IO_DRAIN)
 
 #define IO_REQ_CLEAN_FLAGS (REQ_F_BUFFER_SELECTED | REQ_F_NEED_CLEANUP | \
-				REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS)
+				REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS | \
+				REQ_F_ASYNC_DATA)
 
 #define IO_TCTX_REFS_CACHE_NR	(1U << 10)
 
@@ -732,6 +733,7 @@ enum {
 	REQ_F_CREDS_BIT,
 	REQ_F_REFCOUNT_BIT,
 	REQ_F_ARM_LTIMEOUT_BIT,
+	REQ_F_ASYNC_DATA_BIT,
 	/* keep async read/write and isreg together and in order */
 	REQ_F_NOWAIT_READ_BIT,
 	REQ_F_NOWAIT_WRITE_BIT,
@@ -787,6 +789,8 @@ enum {
 	REQ_F_REFCOUNT		= BIT(REQ_F_REFCOUNT_BIT),
 	/* there is a linked timeout that has to be armed */
 	REQ_F_ARM_LTIMEOUT	= BIT(REQ_F_ARM_LTIMEOUT_BIT),
+	/* ->async_data allocated */
+	REQ_F_ASYNC_DATA	= BIT(REQ_F_ASYNC_DATA_BIT),
 };
 
 struct async_poll {
@@ -847,8 +851,6 @@ struct io_kiocb {
 		struct io_completion	compl;
 	};
 
-	/* opcode allocated if it needs to store data for async defer */
-	void				*async_data;
 	u8				opcode;
 	/* polled IO has completed */
 	u8				iopoll_completed;
@@ -863,6 +865,8 @@ struct io_kiocb {
 	u64				user_data;
 
 	struct percpu_ref		*fixed_rsrc_refs;
+	/* store used ubuf, so we can prevent reloading */
+	struct io_mapped_ubuf		*imu;
 
 	/* used by request caches, completion batching and iopoll */
 	struct io_wq_work_node		comp_list;
@@ -872,8 +876,9 @@ struct io_kiocb {
 	struct hlist_node		hash_node;
 	/* internal polling, see IORING_FEAT_FAST_POLL */
 	struct async_poll		*apoll;
-	/* store used ubuf, so we can prevent reloading */
-	struct io_mapped_ubuf		*imu;
+
+	/* opcode allocated if it needs to store data for async defer */
+	void				*async_data;
 	struct io_wq_work		work;
 	/* custom credentials, valid IFF REQ_F_CREDS is set */
 	const struct cred		*creds;
@@ -1219,6 +1224,11 @@ static bool io_match_task(struct io_kiocb *head, struct task_struct *task,
 	return false;
 }
 
+static inline bool req_has_async_data(struct io_kiocb *req)
+{
+	return req->flags & REQ_F_ASYNC_DATA;
+}
+
 static inline void req_set_fail(struct io_kiocb *req)
 {
 	req->flags |= REQ_F_FAIL;
@@ -1969,10 +1979,6 @@ static inline void io_dismantle_req(struct io_kiocb *req)
 		io_put_file(req->file);
 	if (req->fixed_rsrc_refs)
 		percpu_ref_put(req->fixed_rsrc_refs);
-	if (req->async_data) {
-		kfree(req->async_data);
-		req->async_data = NULL;
-	}
 }
 
 static __cold void __io_free_req(struct io_kiocb *req)
@@ -2561,7 +2567,7 @@ static bool io_resubmit_prep(struct io_kiocb *req)
 {
 	struct io_async_rw *rw = req->async_data;
 
-	if (!rw)
+	if (!req_has_async_data(req))
 		return !io_req_prep_async(req);
 	iov_iter_restore(&rw->iter, &rw->iter_state);
 	return true;
@@ -2881,7 +2887,7 @@ static void kiocb_done(struct kiocb *kiocb, ssize_t ret,
 	struct io_async_rw *io = req->async_data;
 
 	/* add previously done IO, if any */
-	if (io && io->bytes_done > 0) {
+	if (req_has_async_data(req) && io->bytes_done > 0) {
 		if (ret < 0)
 			ret = io->bytes_done;
 		else
@@ -3260,7 +3266,11 @@ static inline bool io_alloc_async_data(struct io_kiocb *req)
 {
 	WARN_ON_ONCE(!io_op_defs[req->opcode].async_size);
 	req->async_data = kmalloc(io_op_defs[req->opcode].async_size, GFP_KERNEL);
-	return req->async_data == NULL;
+	if (req->async_data) {
+		req->flags |= REQ_F_ASYNC_DATA;
+		return false;
+	}
+	return true;
 }
 
 static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec,
@@ -3269,7 +3279,7 @@ static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec,
 {
 	if (!force && !io_op_defs[req->opcode].needs_async_setup)
 		return 0;
-	if (!req->async_data) {
+	if (!req_has_async_data(req)) {
 		struct io_async_rw *iorw;
 
 		if (io_alloc_async_data(req)) {
@@ -3402,12 +3412,13 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
-	struct io_async_rw *rw = req->async_data;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 	struct iov_iter_state __state, *state;
+	struct io_async_rw *rw;
 	ssize_t ret, ret2;
 
-	if (rw) {
+	if (req_has_async_data(req)) {
+		rw = req->async_data;
 		iter = &rw->iter;
 		state = &rw->iter_state;
 		/*
@@ -3537,12 +3548,13 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
-	struct io_async_rw *rw = req->async_data;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 	struct iov_iter_state __state, *state;
+	struct io_async_rw *rw;
 	ssize_t ret, ret2;
 
-	if (rw) {
+	if (req_has_async_data(req)) {
+		rw = req->async_data;
 		iter = &rw->iter;
 		state = &rw->iter_state;
 		iov_iter_restore(iter, state);
@@ -4711,8 +4723,9 @@ static int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
 	if (unlikely(!sock))
 		return -ENOTSOCK;
 
-	kmsg = req->async_data;
-	if (!kmsg) {
+	if (req_has_async_data(req)) {
+		kmsg = req->async_data;
+	} else {
 		ret = io_sendmsg_copy_hdr(req, &iomsg);
 		if (ret)
 			return ret;
@@ -4928,8 +4941,9 @@ static int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags)
 	if (unlikely(!sock))
 		return -ENOTSOCK;
 
-	kmsg = req->async_data;
-	if (!kmsg) {
+	if (req_has_async_data(req)) {
+		kmsg = req->async_data;
+	} else {
 		ret = io_recvmsg_copy_hdr(req, &iomsg);
 		if (ret)
 			return ret;
@@ -5120,7 +5134,7 @@ static int io_connect(struct io_kiocb *req, unsigned int issue_flags)
 	int ret;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 
-	if (req->async_data) {
+	if (req_has_async_data(req)) {
 		io = req->async_data;
 	} else {
 		ret = move_addr_to_kernel(req->connect.addr,
@@ -5136,7 +5150,7 @@ static int io_connect(struct io_kiocb *req, unsigned int issue_flags)
 	ret = __sys_connect_file(req->file, &io->address,
 					req->connect.addr_len, file_flags);
 	if ((ret == -EAGAIN || ret == -EINPROGRESS) && force_nonblock) {
-		if (req->async_data)
+		if (req_has_async_data(req))
 			return -EAGAIN;
 		if (io_alloc_async_data(req)) {
 			ret = -ENOMEM;
@@ -5427,7 +5441,10 @@ static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt,
 		io_init_poll_iocb(poll, poll_one->events, io_poll_double_wake);
 		req_ref_get(req);
 		poll->wait.private = req;
+
 		*poll_ptr = poll;
+		if (req->opcode == IORING_OP_POLL_ADD)
+			req->flags |= REQ_F_ASYNC_DATA;
 	}
 
 	pt->nr_entries++;
@@ -6089,7 +6106,7 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	if (unlikely(off && !req->ctx->off_timeout_used))
 		req->ctx->off_timeout_used = true;
 
-	if (!req->async_data && io_alloc_async_data(req))
+	if (!req_has_async_data(req) && io_alloc_async_data(req))
 		return -ENOMEM;
 
 	data = req->async_data;
@@ -6398,7 +6415,7 @@ static int io_req_prep_async(struct io_kiocb *req)
 {
 	if (!io_op_defs[req->opcode].needs_async_setup)
 		return 0;
-	if (WARN_ON_ONCE(req->async_data))
+	if (WARN_ON_ONCE(req_has_async_data(req)))
 		return -EFAULT;
 	if (io_alloc_async_data(req))
 		return -EAGAIN;
@@ -6541,7 +6558,10 @@ static void io_clean_op(struct io_kiocb *req)
 	}
 	if (req->flags & REQ_F_CREDS)
 		put_cred(req->creds);
-
+	if (req->flags & REQ_F_ASYNC_DATA) {
+		kfree(req->async_data);
+		req->async_data = NULL;
+	}
 	req->flags &= ~IO_REQ_CLEAN_FLAGS;
 }
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 12/16] io_uring: remove struct io_completion
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
                   ` (10 preceding siblings ...)
  2021-10-04 19:02 ` [PATCH 11/16] io_uring: control ->async_data with a REQ_F flag Pavel Begunkov
@ 2021-10-04 19:02 ` Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 13/16] io_uring: inline io_req_needs_clean() Pavel Begunkov
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

We keep struct io_completion only as a temporal storage of cflags, Place
it in io_kiocb, it's cleaner, removes extra bits and even might be used
for future optimisations.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1e93c0b1314c..9f12ac5ec906 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -684,11 +684,6 @@ struct io_hardlink {
 	int				flags;
 };
 
-struct io_completion {
-	struct file			*file;
-	u32				cflags;
-};
-
 struct io_async_connect {
 	struct sockaddr_storage		address;
 };
@@ -847,22 +842,20 @@ struct io_kiocb {
 		struct io_mkdir		mkdir;
 		struct io_symlink	symlink;
 		struct io_hardlink	hardlink;
-		/* use only after cleaning per-op data, see io_clean_op() */
-		struct io_completion	compl;
 	};
 
 	u8				opcode;
 	/* polled IO has completed */
 	u8				iopoll_completed;
-
 	u16				buf_index;
+	unsigned int			flags;
+
+	u64				user_data;
 	u32				result;
+	u32				cflags;
 
 	struct io_ring_ctx		*ctx;
-	unsigned int			flags;
-	atomic_t			refs;
 	struct task_struct		*task;
-	u64				user_data;
 
 	struct percpu_ref		*fixed_rsrc_refs;
 	/* store used ubuf, so we can prevent reloading */
@@ -870,13 +863,13 @@ struct io_kiocb {
 
 	/* used by request caches, completion batching and iopoll */
 	struct io_wq_work_node		comp_list;
+	atomic_t			refs;
 	struct io_kiocb			*link;
 	struct io_task_work		io_task_work;
 	/* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
 	struct hlist_node		hash_node;
 	/* internal polling, see IORING_FEAT_FAST_POLL */
 	struct async_poll		*apoll;
-
 	/* opcode allocated if it needs to store data for async defer */
 	void				*async_data;
 	struct io_wq_work		work;
@@ -1831,11 +1824,8 @@ static inline bool io_req_needs_clean(struct io_kiocb *req)
 static inline void io_req_complete_state(struct io_kiocb *req, long res,
 					 unsigned int cflags)
 {
-	/* clean per-opcode space, because req->compl is aliased with it */
-	if (io_req_needs_clean(req))
-		io_clean_op(req);
 	req->result = res;
-	req->compl.cflags = cflags;
+	req->cflags = cflags;
 	req->flags |= REQ_F_COMPLETE_INLINE;
 }
 
@@ -2321,7 +2311,7 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 						    comp_list);
 
 		__io_cqring_fill_event(ctx, req->user_data, req->result,
-					req->compl.cflags);
+					req->cflags);
 	}
 	io_commit_cqring(ctx);
 	spin_unlock(&ctx->completion_lock);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 13/16] io_uring: inline io_req_needs_clean()
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
                   ` (11 preceding siblings ...)
  2021-10-04 19:02 ` [PATCH 12/16] io_uring: remove struct io_completion Pavel Begunkov
@ 2021-10-04 19:02 ` Pavel Begunkov
  2021-10-04 19:02 ` [PATCH 14/16] io_uring: inline io_poll_complete Pavel Begunkov
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

There is only a single user of io_req_needs_clean() inline it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9f12ac5ec906..1ffa811eb76a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1816,11 +1816,6 @@ static void io_req_complete_post(struct io_kiocb *req, long res,
 	io_cqring_ev_posted(ctx);
 }
 
-static inline bool io_req_needs_clean(struct io_kiocb *req)
-{
-	return req->flags & IO_REQ_CLEAN_FLAGS;
-}
-
 static inline void io_req_complete_state(struct io_kiocb *req, long res,
 					 unsigned int cflags)
 {
@@ -1963,7 +1958,7 @@ static inline void io_dismantle_req(struct io_kiocb *req)
 {
 	unsigned int flags = req->flags;
 
-	if (unlikely(io_req_needs_clean(req)))
+	if (unlikely(flags & IO_REQ_CLEAN_FLAGS))
 		io_clean_op(req);
 	if (!(flags & REQ_F_FIXED_FILE))
 		io_put_file(req->file);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 14/16] io_uring: inline io_poll_complete
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
                   ` (12 preceding siblings ...)
  2021-10-04 19:02 ` [PATCH 13/16] io_uring: inline io_req_needs_clean() Pavel Begunkov
@ 2021-10-04 19:02 ` Pavel Begunkov
  2021-10-04 19:03 ` [PATCH 15/16] io_uring: correct fill events helpers types Pavel Begunkov
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:02 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Inline io_poll_complete(), it's simple and doesn't have any particular
purpose.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1ffa811eb76a..f60818602544 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5295,16 +5295,6 @@ static bool __io_poll_complete(struct io_kiocb *req, __poll_t mask)
 	return !(flags & IORING_CQE_F_MORE);
 }
 
-static inline bool io_poll_complete(struct io_kiocb *req, __poll_t mask)
-	__must_hold(&req->ctx->completion_lock)
-{
-	bool done;
-
-	done = __io_poll_complete(req, mask);
-	io_commit_cqring(req->ctx);
-	return done;
-}
-
 static void io_poll_task_func(struct io_kiocb *req, bool *locked)
 {
 	struct io_ring_ctx *ctx = req->ctx;
@@ -5794,7 +5784,8 @@ static int io_poll_add(struct io_kiocb *req, unsigned int issue_flags)
 
 	if (mask) { /* no async, we'd stolen it */
 		ipt.error = 0;
-		done = io_poll_complete(req, mask);
+		done = __io_poll_complete(req, mask);
+		io_commit_cqring(req->ctx);
 	}
 	spin_unlock(&ctx->completion_lock);
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 15/16] io_uring: correct fill events helpers types
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
                   ` (13 preceding siblings ...)
  2021-10-04 19:02 ` [PATCH 14/16] io_uring: inline io_poll_complete Pavel Begunkov
@ 2021-10-04 19:03 ` Pavel Begunkov
  2021-10-04 19:03 ` [PATCH 16/16] io_uring: mark hot functions Pavel Begunkov
  2021-10-04 20:19 ` [PATCH 00/16] squeeze more performance Jens Axboe
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

CQE result is a 32-bit integer, so the functions generating CQEs are
better to accept not long but ints. Convert io_cqring_fill_event() and
other helpers.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f60818602544..62dc128e9b6b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1072,7 +1072,7 @@ static void io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 static void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd);
 
 static bool io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data,
-				 long res, unsigned int cflags);
+				 s32 res, u32 cflags);
 static void io_put_req(struct io_kiocb *req);
 static void io_put_req_deferred(struct io_kiocb *req);
 static void io_dismantle_req(struct io_kiocb *req);
@@ -1730,7 +1730,7 @@ static inline void io_get_task_refs(int nr)
 }
 
 static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
-				     long res, unsigned int cflags)
+				     s32 res, u32 cflags)
 {
 	struct io_overflow_cqe *ocqe;
 
@@ -1758,7 +1758,7 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
 }
 
 static inline bool __io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data,
-					  long res, unsigned int cflags)
+					  s32 res, u32 cflags)
 {
 	struct io_uring_cqe *cqe;
 
@@ -1781,13 +1781,13 @@ static inline bool __io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data
 
 /* not as hot to bloat with inlining */
 static noinline bool io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data,
-					  long res, unsigned int cflags)
+					  s32 res, u32 cflags)
 {
 	return __io_cqring_fill_event(ctx, user_data, res, cflags);
 }
 
-static void io_req_complete_post(struct io_kiocb *req, long res,
-				 unsigned int cflags)
+static void io_req_complete_post(struct io_kiocb *req, s32 res,
+				 u32 cflags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
@@ -1816,8 +1816,8 @@ static void io_req_complete_post(struct io_kiocb *req, long res,
 	io_cqring_ev_posted(ctx);
 }
 
-static inline void io_req_complete_state(struct io_kiocb *req, long res,
-					 unsigned int cflags)
+static inline void io_req_complete_state(struct io_kiocb *req, s32 res,
+					 u32 cflags)
 {
 	req->result = res;
 	req->cflags = cflags;
@@ -1825,7 +1825,7 @@ static inline void io_req_complete_state(struct io_kiocb *req, long res,
 }
 
 static inline void __io_req_complete(struct io_kiocb *req, unsigned issue_flags,
-				     long res, unsigned cflags)
+				     s32 res, u32 cflags)
 {
 	if (issue_flags & IO_URING_F_COMPLETE_DEFER)
 		io_req_complete_state(req, res, cflags);
@@ -1833,12 +1833,12 @@ static inline void __io_req_complete(struct io_kiocb *req, unsigned issue_flags,
 		io_req_complete_post(req, res, cflags);
 }
 
-static inline void io_req_complete(struct io_kiocb *req, long res)
+static inline void io_req_complete(struct io_kiocb *req, s32 res)
 {
 	__io_req_complete(req, 0, res, 0);
 }
 
-static void io_req_complete_failed(struct io_kiocb *req, long res)
+static void io_req_complete_failed(struct io_kiocb *req, s32 res)
 {
 	req_set_fail(req);
 	io_req_complete_post(req, res, 0);
@@ -2613,7 +2613,7 @@ static bool __io_complete_rw_common(struct io_kiocb *req, long res)
 static void io_req_task_complete(struct io_kiocb *req, bool *locked)
 {
 	unsigned int cflags = io_put_rw_kbuf(req);
-	long res = req->result;
+	int res = req->result;
 
 	if (*locked) {
 		io_req_complete_state(req, res, cflags);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 16/16] io_uring: mark hot functions
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
                   ` (14 preceding siblings ...)
  2021-10-04 19:03 ` [PATCH 15/16] io_uring: correct fill events helpers types Pavel Begunkov
@ 2021-10-04 19:03 ` Pavel Begunkov
  2021-10-04 20:19 ` [PATCH 00/16] squeeze more performance Jens Axboe
  16 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2021-10-04 19:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

It can be quite beneficial to mark appropriate functions with
__attribute__((hot)), it mostly helps to rearrange functions so they are
cached better. E.g. nops test showed 31->32 MIOPS improvement with it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 62dc128e9b6b..e35569df7f80 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -114,6 +114,8 @@
 
 #define IO_TCTX_REFS_CACHE_NR	(1U << 10)
 
+#define __hot			__attribute__((__hot__))
+
 struct io_uring {
 	u32 head ____cacheline_aligned_in_smp;
 	u32 tail ____cacheline_aligned_in_smp;
@@ -1816,8 +1818,8 @@ static void io_req_complete_post(struct io_kiocb *req, s32 res,
 	io_cqring_ev_posted(ctx);
 }
 
-static inline void io_req_complete_state(struct io_kiocb *req, s32 res,
-					 u32 cflags)
+static inline __hot void io_req_complete_state(struct io_kiocb *req,
+					       s32 res, u32 cflags)
 {
 	req->result = res;
 	req->cflags = cflags;
@@ -2260,8 +2262,8 @@ static void io_free_req_work(struct io_kiocb *req, bool *locked)
 	io_free_req(req);
 }
 
-static void io_free_batch_list(struct io_ring_ctx *ctx,
-				struct io_wq_work_node *node)
+static __hot void io_free_batch_list(struct io_ring_ctx *ctx,
+				     struct io_wq_work_node *node)
 	__must_hold(&ctx->uring_lock)
 {
 	struct task_struct *task = NULL;
@@ -2294,7 +2296,7 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
 		io_put_task(task, task_refs);
 }
 
-static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
+static __hot void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_wq_work_node *node, *prev;
@@ -2389,7 +2391,7 @@ static inline bool io_run_task_work(void)
 	return false;
 }
 
-static int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
+static __hot int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 {
 	struct io_wq_work_node *pos, *start, *prev;
 	int nr_events = 0;
@@ -2479,7 +2481,7 @@ static __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
 	mutex_unlock(&ctx->uring_lock);
 }
 
-static int io_iopoll_check(struct io_ring_ctx *ctx, long min)
+static __hot int io_iopoll_check(struct io_ring_ctx *ctx, long min)
 {
 	unsigned int nr_events = 0;
 	int ret = 0;
@@ -6541,7 +6543,7 @@ static void io_clean_op(struct io_kiocb *req)
 	req->flags &= ~IO_REQ_CLEAN_FLAGS;
 }
 
-static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
+static __hot int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	const struct cred *creds = NULL;
@@ -6882,7 +6884,7 @@ static void io_queue_sqe_arm_apoll(struct io_kiocb *req)
 		io_queue_linked_timeout(linked_timeout);
 }
 
-static inline void __io_queue_sqe(struct io_kiocb *req)
+static inline __hot void __io_queue_sqe(struct io_kiocb *req)
 	__must_hold(&req->ctx->uring_lock)
 {
 	struct io_kiocb *linked_timeout;
@@ -6926,7 +6928,7 @@ static void io_queue_sqe_fallback(struct io_kiocb *req)
 	}
 }
 
-static inline void io_queue_sqe(struct io_kiocb *req)
+static inline __hot void io_queue_sqe(struct io_kiocb *req)
 	__must_hold(&req->ctx->uring_lock)
 {
 	if (likely(!(req->flags & (REQ_F_FORCE_ASYNC | REQ_F_FAIL))))
@@ -6977,8 +6979,8 @@ static void io_init_req_drain(struct io_kiocb *req)
 	}
 }
 
-static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
-		       const struct io_uring_sqe *sqe)
+static __hot int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
+			     const struct io_uring_sqe *sqe)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_submit_state *state;
@@ -7050,8 +7052,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return io_req_prep(req, sqe);
 }
 
-static int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
-			 const struct io_uring_sqe *sqe)
+static __hot int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
+			       const struct io_uring_sqe *sqe)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_submit_link *link = &ctx->submit_state.link;
@@ -7174,7 +7176,7 @@ static void io_commit_sqring(struct io_ring_ctx *ctx)
  * used, it's important that those reads are done through READ_ONCE() to
  * prevent a re-load down the line.
  */
-static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx)
+static inline const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx)
 {
 	unsigned head, mask = ctx->sq_entries - 1;
 	unsigned sq_idx = ctx->cached_sq_head++ & mask;
@@ -7198,7 +7200,7 @@ static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx)
 	return NULL;
 }
 
-static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
+static __hot int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 	__must_hold(&ctx->uring_lock)
 {
 	unsigned int entries = io_sqring_entries(ctx);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/16] squeeze more performance
  2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
                   ` (15 preceding siblings ...)
  2021-10-04 19:03 ` [PATCH 16/16] io_uring: mark hot functions Pavel Begunkov
@ 2021-10-04 20:19 ` Jens Axboe
  2021-10-04 20:33   ` Jens Axboe
  16 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2021-10-04 20:19 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 10/4/21 1:02 PM, Pavel Begunkov wrote:
> fio/t/io_uring -s32 -d32 -c32 -N1
> 
>           | baseline  | 0-15      | 0-16        | diff
> setup 1:  | 34 MIOPS  | 42 MIOPS  | 42.2  MIOPS | 25 %
> setup 2:  | 31 MIOPS  | 31 MIOPS  | 32    MIOPS | ~3 $
> 
> Setup 1 gets 25% performance improvement, which is unexpected and a
> share of it should be accounted as compiler/HW magic. Setup 2 is just
> 3%, but the catch is that some of the patches _very_ unexpectedly sink
> performance, so it's more like 31 MIOPS -> 29 -> 30 -> 29 -> 31 -> 32
> 
> I'd suggest to leave 16/16 aside, maybe for future consideration and
> refinement. The end result is not very clear, I'd expect probably
> around 3-5% with a more stable setup for nops32, and a better win
> for io_cqring_ev_posted() intensive cases like BPF.

Looks and tests good for me. I've skipped 16/16 for now, we can
evaluate that one later.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/16] squeeze more performance
  2021-10-04 20:19 ` [PATCH 00/16] squeeze more performance Jens Axboe
@ 2021-10-04 20:33   ` Jens Axboe
  0 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2021-10-04 20:33 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 10/4/21 2:19 PM, Jens Axboe wrote:
> On 10/4/21 1:02 PM, Pavel Begunkov wrote:
>> fio/t/io_uring -s32 -d32 -c32 -N1
>>
>>           | baseline  | 0-15      | 0-16        | diff
>> setup 1:  | 34 MIOPS  | 42 MIOPS  | 42.2  MIOPS | 25 %
>> setup 2:  | 31 MIOPS  | 31 MIOPS  | 32    MIOPS | ~3 $
>>
>> Setup 1 gets 25% performance improvement, which is unexpected and a
>> share of it should be accounted as compiler/HW magic. Setup 2 is just
>> 3%, but the catch is that some of the patches _very_ unexpectedly sink
>> performance, so it's more like 31 MIOPS -> 29 -> 30 -> 29 -> 31 -> 32
>>
>> I'd suggest to leave 16/16 aside, maybe for future consideration and
>> refinement. The end result is not very clear, I'd expect probably
>> around 3-5% with a more stable setup for nops32, and a better win
>> for io_cqring_ev_posted() intensive cases like BPF.
> 
> Looks and tests good for me. I've skipped 16/16 for now, we can
> evaluate that one later.

For reference, running this on just the faster box:

Setup/Test   |  Peak-1-thread   Peak-2-threads   NOPS   Diff
------------------------------------------------------------------
Setup 2 pre  |      5.07M            5.74M       71.1M
Setup 2 post |      5.23M            5.84M       73.9M

which is a pretty substantial win.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-10-04 20:33 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-10-04 19:02 [PATCH 00/16] squeeze more performance Pavel Begunkov
2021-10-04 19:02 ` [PATCH 01/16] io_uring: optimise kiocb layout Pavel Begunkov
2021-10-04 19:02 ` [PATCH 02/16] io_uring: add more likely/unlikely() annotations Pavel Begunkov
2021-10-04 19:02 ` [PATCH 03/16] io_uring: delay req queueing into compl-batch list Pavel Begunkov
2021-10-04 19:02 ` [PATCH 04/16] io_uring: optimise request allocation Pavel Begunkov
2021-10-04 19:02 ` [PATCH 05/16] io_uring: optimise INIT_WQ_LIST Pavel Begunkov
2021-10-04 19:02 ` [PATCH 06/16] io_uring: don't wake sqpoll in io_cqring_ev_posted Pavel Begunkov
2021-10-04 19:02 ` [PATCH 07/16] io_uring: merge CQ and poll waitqueues Pavel Begunkov
2021-10-04 19:02 ` [PATCH 08/16] io_uring: optimise ctx referencing by requests Pavel Begunkov
2021-10-04 19:02 ` [PATCH 09/16] io_uring: mark cold functions Pavel Begunkov
2021-10-04 19:02 ` [PATCH 10/16] io_uring: optimise io_free_batch_list() Pavel Begunkov
2021-10-04 19:02 ` [PATCH 11/16] io_uring: control ->async_data with a REQ_F flag Pavel Begunkov
2021-10-04 19:02 ` [PATCH 12/16] io_uring: remove struct io_completion Pavel Begunkov
2021-10-04 19:02 ` [PATCH 13/16] io_uring: inline io_req_needs_clean() Pavel Begunkov
2021-10-04 19:02 ` [PATCH 14/16] io_uring: inline io_poll_complete Pavel Begunkov
2021-10-04 19:03 ` [PATCH 15/16] io_uring: correct fill events helpers types Pavel Begunkov
2021-10-04 19:03 ` [PATCH 16/16] io_uring: mark hot functions Pavel Begunkov
2021-10-04 20:19 ` [PATCH 00/16] squeeze more performance Jens Axboe
2021-10-04 20:33   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox