* [PATCH 01/12] io_uring: keep SQ pointers in a single cacheline
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
2021-06-14 22:37 ` [PATCH 02/12] io_uring: move ctx->flags from SQ cacheline Pavel Begunkov
` (12 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
To: Jens Axboe, io-uring
sq_array and sq_sqes are always used together, however they are in
different cachelines, where the borderline is right before
cq_overflow_list is rather rarely touched. Move the fields together so
it loads only one cacheline.
Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index d665c9419ad3..f3c827cd8ff8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -364,6 +364,7 @@ struct io_ring_ctx {
* array.
*/
u32 *sq_array;
+ struct io_uring_sqe *sq_sqes;
unsigned cached_sq_head;
unsigned sq_entries;
unsigned sq_thread_idle;
@@ -373,8 +374,6 @@ struct io_ring_ctx {
struct list_head defer_list;
struct list_head timeout_list;
struct list_head cq_overflow_list;
-
- struct io_uring_sqe *sq_sqes;
} ____cacheline_aligned_in_smp;
struct {
--
2.31.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 02/12] io_uring: move ctx->flags from SQ cacheline
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
2021-06-14 22:37 ` [PATCH 01/12] io_uring: keep SQ pointers in a single cacheline Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
2021-06-14 22:37 ` [PATCH 03/12] io_uring: shuffle more fields into SQ ctx section Pavel Begunkov
` (11 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
To: Jens Axboe, io-uring
ctx->flags are heavily used by both, completion and submission sides, so
move it out from the ctx fields related to submissions. Instead, place
it together with ctx->refs, because it's already cacheline-aligned and
so pads lots of space, and both almost never change. Also, in most
occasions they are accessed together as refs are taken at submission
time and put back during completion.
Do same with ctx->rings, where the pointer itself is never modified
apart from ring init/free.
Note: in percpu mode, struct percpu_ref doesn't modify the struct itself
but takes indirection with ref->percpu_count_ptr.
Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index f3c827cd8ff8..a4460383bd25 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -341,17 +341,19 @@ struct io_submit_state {
};
struct io_ring_ctx {
+ /* const or read-mostly hot data */
struct {
struct percpu_ref refs;
- } ____cacheline_aligned_in_smp;
- struct {
+ struct io_rings *rings;
unsigned int flags;
unsigned int compat: 1;
unsigned int drain_next: 1;
unsigned int eventfd_async: 1;
unsigned int restricted: 1;
+ } ____cacheline_aligned_in_smp;
+ struct {
/*
* Ring buffer of indices into array of io_uring_sqe, which is
* mmapped by the application using the IORING_OFF_SQES offset.
@@ -386,8 +388,6 @@ struct io_ring_ctx {
struct list_head locked_free_list;
unsigned int locked_free_nr;
- struct io_rings *rings;
-
const struct cred *sq_creds; /* cred used for __io_sq_thread() */
struct io_sq_data *sq_data; /* if using sq thread polling */
--
2.31.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 03/12] io_uring: shuffle more fields into SQ ctx section
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
2021-06-14 22:37 ` [PATCH 01/12] io_uring: keep SQ pointers in a single cacheline Pavel Begunkov
2021-06-14 22:37 ` [PATCH 02/12] io_uring: move ctx->flags from SQ cacheline Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
2021-06-14 22:37 ` [PATCH 04/12] io_uring: refactor io_get_sqe() Pavel Begunkov
` (10 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
To: Jens Axboe, io-uring
Since moving locked_free_* out of struct io_submit_state
ctx->submit_state is accessed on submission side only, so move it into
the submission section. Same goes for rsrc table pointers/nodes/etc.,
they must be taken and checked during submission because sync'ed by
uring_lock, so move them there as well.
Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 35 +++++++++++++++++------------------
1 file changed, 17 insertions(+), 18 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index a4460383bd25..5cc0c4dd2709 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -353,6 +353,7 @@ struct io_ring_ctx {
unsigned int restricted: 1;
} ____cacheline_aligned_in_smp;
+ /* submission data */
struct {
/*
* Ring buffer of indices into array of io_uring_sqe, which is
@@ -369,13 +370,27 @@ struct io_ring_ctx {
struct io_uring_sqe *sq_sqes;
unsigned cached_sq_head;
unsigned sq_entries;
- unsigned sq_thread_idle;
unsigned cached_sq_dropped;
unsigned long sq_check_overflow;
-
struct list_head defer_list;
+
+ /*
+ * Fixed resources fast path, should be accessed only under
+ * uring_lock, and updated through io_uring_register(2)
+ */
+ struct io_rsrc_node *rsrc_node;
+ struct io_file_table file_table;
+ unsigned nr_user_files;
+ unsigned nr_user_bufs;
+ struct io_mapped_ubuf **user_bufs;
+
+ struct io_submit_state submit_state;
struct list_head timeout_list;
struct list_head cq_overflow_list;
+ struct xarray io_buffers;
+ struct xarray personalities;
+ u32 pers_next;
+ unsigned sq_thread_idle;
} ____cacheline_aligned_in_smp;
struct {
@@ -383,7 +398,6 @@ struct io_ring_ctx {
wait_queue_head_t wait;
} ____cacheline_aligned_in_smp;
- struct io_submit_state submit_state;
/* IRQ completion list, under ->completion_lock */
struct list_head locked_free_list;
unsigned int locked_free_nr;
@@ -394,21 +408,6 @@ struct io_ring_ctx {
struct wait_queue_head sqo_sq_wait;
struct list_head sqd_list;
- /*
- * Fixed resources fast path, should be accessed only under uring_lock,
- * and updated through io_uring_register(2)
- */
- struct io_rsrc_node *rsrc_node;
-
- struct io_file_table file_table;
- unsigned nr_user_files;
- unsigned nr_user_bufs;
- struct io_mapped_ubuf **user_bufs;
-
- struct xarray io_buffers;
- struct xarray personalities;
- u32 pers_next;
-
struct {
unsigned cached_cq_tail;
unsigned cq_entries;
--
2.31.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 04/12] io_uring: refactor io_get_sqe()
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
` (2 preceding siblings ...)
2021-06-14 22:37 ` [PATCH 03/12] io_uring: shuffle more fields into SQ ctx section Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
2021-06-14 22:37 ` [PATCH 05/12] io_uring: don't cache number of dropped SQEs Pavel Begunkov
` (9 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
To: Jens Axboe, io-uring
The line of io_get_sqe() evaluating @head consists of too many
operations including READ_ONCE(), it's not convenient for probing.
Refactor it also improving readability.
Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5cc0c4dd2709..3baacfe2c9b7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6685,8 +6685,8 @@ static void io_commit_sqring(struct io_ring_ctx *ctx)
*/
static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx)
{
- u32 *sq_array = ctx->sq_array;
unsigned head, mask = ctx->sq_entries - 1;
+ unsigned sq_idx = ctx->cached_sq_head++ & mask;
/*
* The cached sq head (or cq tail) serves two purposes:
@@ -6696,7 +6696,7 @@ static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx)
* 2) allows the kernel side to track the head on its own, even
* though the application is the one updating it.
*/
- head = READ_ONCE(sq_array[ctx->cached_sq_head++ & mask]);
+ head = READ_ONCE(ctx->sq_array[sq_idx]);
if (likely(head < ctx->sq_entries))
return &ctx->sq_sqes[head];
--
2.31.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 05/12] io_uring: don't cache number of dropped SQEs
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
` (3 preceding siblings ...)
2021-06-14 22:37 ` [PATCH 04/12] io_uring: refactor io_get_sqe() Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
2021-06-14 22:37 ` [PATCH 06/12] io_uring: optimise completion timeout flushing Pavel Begunkov
` (8 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
To: Jens Axboe, io-uring
Kill ->cached_sq_dropped and wire DRAIN sequence number correction via
->cq_extra, which is there exactly for that purpose. User visible
dropped counter will be populated by incrementing it instead of keeping
a copy, similarly as it was done not so long ago with cq_overflow.
Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 3baacfe2c9b7..6dd14f4aa5f1 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -370,7 +370,6 @@ struct io_ring_ctx {
struct io_uring_sqe *sq_sqes;
unsigned cached_sq_head;
unsigned sq_entries;
- unsigned cached_sq_dropped;
unsigned long sq_check_overflow;
struct list_head defer_list;
@@ -5994,13 +5993,11 @@ static u32 io_get_sequence(struct io_kiocb *req)
{
struct io_kiocb *pos;
struct io_ring_ctx *ctx = req->ctx;
- u32 total_submitted, nr_reqs = 0;
+ u32 nr_reqs = 0;
io_for_each_link(pos, req)
nr_reqs++;
-
- total_submitted = ctx->cached_sq_head - ctx->cached_sq_dropped;
- return total_submitted - nr_reqs;
+ return ctx->cached_sq_head - nr_reqs;
}
static int io_req_defer(struct io_kiocb *req)
@@ -6701,8 +6698,9 @@ static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx)
return &ctx->sq_sqes[head];
/* drop invalid entries */
- ctx->cached_sq_dropped++;
- WRITE_ONCE(ctx->rings->sq_dropped, ctx->cached_sq_dropped);
+ ctx->cq_extra--;
+ WRITE_ONCE(ctx->rings->sq_dropped,
+ READ_ONCE(ctx->rings->sq_dropped) + 1);
return NULL;
}
--
2.31.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 06/12] io_uring: optimise completion timeout flushing
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
` (4 preceding siblings ...)
2021-06-14 22:37 ` [PATCH 05/12] io_uring: don't cache number of dropped SQEs Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
2021-06-14 22:37 ` [PATCH 07/12] io_uring: small io_submit_sqe() optimisation Pavel Begunkov
` (7 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
To: Jens Axboe, io-uring
io_commit_cqring() might be very hot and we definitely don't want to
touch ->timeout_list there, because 1) it's shared with the submission
side so might lead to cache bouncing and 2) may need to load an extra
cache line, especially for IRQ completions.
We're interested in it at the completion side only when there are
offset-mode timeouts, which are not so popular. Replace
list_empty(->timeout_list) hot path check with a new one-way flag, which
is set when we prepare the first offset-mode timeout.
note: the flag sits in the same line as briefly used after ->rings
Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6dd14f4aa5f1..fbf3b2149a4c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -351,6 +351,7 @@ struct io_ring_ctx {
unsigned int drain_next: 1;
unsigned int eventfd_async: 1;
unsigned int restricted: 1;
+ unsigned int off_timeout_used: 1;
} ____cacheline_aligned_in_smp;
/* submission data */
@@ -1318,12 +1319,12 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx)
{
u32 seq;
- if (list_empty(&ctx->timeout_list))
+ if (likely(!ctx->off_timeout_used))
return;
seq = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts);
- do {
+ while (!list_empty(&ctx->timeout_list)) {
u32 events_needed, events_got;
struct io_kiocb *req = list_first_entry(&ctx->timeout_list,
struct io_kiocb, timeout.list);
@@ -1345,8 +1346,7 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx)
list_del_init(&req->timeout.list);
io_kill_timeout(req, 0);
- } while (!list_empty(&ctx->timeout_list));
-
+ }
ctx->cq_last_tm_flush = seq;
}
@@ -5651,6 +5651,8 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
return -EINVAL;
req->timeout.off = off;
+ if (unlikely(off && !req->ctx->off_timeout_used))
+ req->ctx->off_timeout_used = true;
if (!req->async_data && io_alloc_async_data(req))
return -ENOMEM;
--
2.31.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 07/12] io_uring: small io_submit_sqe() optimisation
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
` (5 preceding siblings ...)
2021-06-14 22:37 ` [PATCH 06/12] io_uring: optimise completion timeout flushing Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
2021-06-14 22:37 ` [PATCH 08/12] io_uring: clean up check_overflow flag Pavel Begunkov
` (6 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
To: Jens Axboe, io-uring
submit_state.link is used only to assemble a link and not used for
actual submission, so clear it before io_queue_sqe() in io_submit_sqe(),
awhile it's hot and in caches and queueing doesn't spoil it. May also
potentially help compiler with spilling or to do other optimisations.
Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index fbf3b2149a4c..9c73991465c8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6616,8 +6616,8 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
/* last request of a link, enqueue the link */
if (!(req->flags & (REQ_F_LINK | REQ_F_HARDLINK))) {
- io_queue_sqe(head);
link->head = NULL;
+ io_queue_sqe(head);
}
} else {
if (unlikely(ctx->drain_next)) {
--
2.31.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 08/12] io_uring: clean up check_overflow flag
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
` (6 preceding siblings ...)
2021-06-14 22:37 ` [PATCH 07/12] io_uring: small io_submit_sqe() optimisation Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
2021-06-14 22:37 ` [PATCH 09/12] io_uring: wait heads renaming Pavel Begunkov
` (5 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
To: Jens Axboe, io-uring
There are no users of ->sq_check_overflow, only ->cq_check_overflow is
used. Combine it and move out of completion related part of struct
io_ring_ctx.
A not so obvious benefit of it is fitting all completion side fields
into a single cacheline. It was taking 2 lines before with 56B padding,
and io_cqring_ev_posted*() were still touching both of them.
Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 20 +++++++++-----------
1 file changed, 9 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9c73991465c8..65d51e2d5c15 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -371,7 +371,6 @@ struct io_ring_ctx {
struct io_uring_sqe *sq_sqes;
unsigned cached_sq_head;
unsigned sq_entries;
- unsigned long sq_check_overflow;
struct list_head defer_list;
/*
@@ -408,13 +407,14 @@ struct io_ring_ctx {
struct wait_queue_head sqo_sq_wait;
struct list_head sqd_list;
+ unsigned long check_cq_overflow;
+
struct {
unsigned cached_cq_tail;
unsigned cq_entries;
atomic_t cq_timeouts;
unsigned cq_last_tm_flush;
unsigned cq_extra;
- unsigned long cq_check_overflow;
struct wait_queue_head cq_wait;
struct fasync_struct *cq_fasync;
struct eventfd_ctx *cq_ev_fd;
@@ -1464,8 +1464,7 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
all_flushed = list_empty(&ctx->cq_overflow_list);
if (all_flushed) {
- clear_bit(0, &ctx->sq_check_overflow);
- clear_bit(0, &ctx->cq_check_overflow);
+ clear_bit(0, &ctx->check_cq_overflow);
ctx->rings->sq_flags &= ~IORING_SQ_CQ_OVERFLOW;
}
@@ -1481,7 +1480,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
{
bool ret = true;
- if (test_bit(0, &ctx->cq_check_overflow)) {
+ if (test_bit(0, &ctx->check_cq_overflow)) {
/* iopoll syncs against uring_lock, not completion_lock */
if (ctx->flags & IORING_SETUP_IOPOLL)
mutex_lock(&ctx->uring_lock);
@@ -1544,8 +1543,7 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
return false;
}
if (list_empty(&ctx->cq_overflow_list)) {
- set_bit(0, &ctx->sq_check_overflow);
- set_bit(0, &ctx->cq_check_overflow);
+ set_bit(0, &ctx->check_cq_overflow);
ctx->rings->sq_flags |= IORING_SQ_CQ_OVERFLOW;
}
ocqe->cqe.user_data = user_data;
@@ -2391,7 +2389,7 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, long min)
* If we do, we can potentially be spinning for commands that
* already triggered a CQE (eg in error).
*/
- if (test_bit(0, &ctx->cq_check_overflow))
+ if (test_bit(0, &ctx->check_cq_overflow))
__io_cqring_overflow_flush(ctx, false);
if (io_cqring_events(ctx))
goto out;
@@ -6965,7 +6963,7 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
* Cannot safely flush overflowed CQEs from here, ensure we wake up
* the task, and the next invocation will do it.
*/
- if (io_should_wake(iowq) || test_bit(0, &iowq->ctx->cq_check_overflow))
+ if (io_should_wake(iowq) || test_bit(0, &iowq->ctx->check_cq_overflow))
return autoremove_wake_function(curr, mode, wake_flags, key);
return -1;
}
@@ -6993,7 +6991,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
if (ret || io_should_wake(iowq))
return ret;
/* let the caller flush overflows, retry */
- if (test_bit(0, &ctx->cq_check_overflow))
+ if (test_bit(0, &ctx->check_cq_overflow))
return 1;
*timeout = schedule_timeout(*timeout);
@@ -8702,7 +8700,7 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
* Users may get EPOLLIN meanwhile seeing nothing in cqring, this
* pushs them to do the flush.
*/
- if (io_cqring_events(ctx) || test_bit(0, &ctx->cq_check_overflow))
+ if (io_cqring_events(ctx) || test_bit(0, &ctx->check_cq_overflow))
mask |= EPOLLIN | EPOLLRDNORM;
return mask;
--
2.31.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 09/12] io_uring: wait heads renaming
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
` (7 preceding siblings ...)
2021-06-14 22:37 ` [PATCH 08/12] io_uring: clean up check_overflow flag Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
2021-06-14 22:37 ` [PATCH 10/12] io_uring: move uring_lock location Pavel Begunkov
` (4 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
To: Jens Axboe, io-uring
We use several wait_queue_head's for different purposes, but namings are
confusing. First rename ctx->cq_wait into ctx->poll_wait, because this
one is used for polling an io_uring instance. Then rename ctx->wait into
ctx->cq_wait, which is responsible for CQE waiting.
Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 30 +++++++++++++++---------------
1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 65d51e2d5c15..e9bf26fbf65d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -394,7 +394,7 @@ struct io_ring_ctx {
struct {
struct mutex uring_lock;
- wait_queue_head_t wait;
+ wait_queue_head_t cq_wait;
} ____cacheline_aligned_in_smp;
/* IRQ completion list, under ->completion_lock */
@@ -415,7 +415,7 @@ struct io_ring_ctx {
atomic_t cq_timeouts;
unsigned cq_last_tm_flush;
unsigned cq_extra;
- struct wait_queue_head cq_wait;
+ struct wait_queue_head poll_wait;
struct fasync_struct *cq_fasync;
struct eventfd_ctx *cq_ev_fd;
} ____cacheline_aligned_in_smp;
@@ -1178,13 +1178,13 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
ctx->flags = p->flags;
init_waitqueue_head(&ctx->sqo_sq_wait);
INIT_LIST_HEAD(&ctx->sqd_list);
- init_waitqueue_head(&ctx->cq_wait);
+ init_waitqueue_head(&ctx->poll_wait);
INIT_LIST_HEAD(&ctx->cq_overflow_list);
init_completion(&ctx->ref_comp);
xa_init_flags(&ctx->io_buffers, XA_FLAGS_ALLOC1);
xa_init_flags(&ctx->personalities, XA_FLAGS_ALLOC1);
mutex_init(&ctx->uring_lock);
- init_waitqueue_head(&ctx->wait);
+ init_waitqueue_head(&ctx->cq_wait);
spin_lock_init(&ctx->completion_lock);
INIT_LIST_HEAD(&ctx->iopoll_list);
INIT_LIST_HEAD(&ctx->defer_list);
@@ -1404,14 +1404,14 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
/* see waitqueue_active() comment */
smp_mb();
- if (waitqueue_active(&ctx->wait))
- wake_up(&ctx->wait);
+ if (waitqueue_active(&ctx->cq_wait))
+ wake_up(&ctx->cq_wait);
if (ctx->sq_data && waitqueue_active(&ctx->sq_data->wait))
wake_up(&ctx->sq_data->wait);
if (io_should_trigger_evfd(ctx))
eventfd_signal(ctx->cq_ev_fd, 1);
- if (waitqueue_active(&ctx->cq_wait)) {
- wake_up_interruptible(&ctx->cq_wait);
+ if (waitqueue_active(&ctx->poll_wait)) {
+ wake_up_interruptible(&ctx->poll_wait);
kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN);
}
}
@@ -1422,13 +1422,13 @@ static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
smp_mb();
if (ctx->flags & IORING_SETUP_SQPOLL) {
- if (waitqueue_active(&ctx->wait))
- wake_up(&ctx->wait);
+ if (waitqueue_active(&ctx->cq_wait))
+ wake_up(&ctx->cq_wait);
}
if (io_should_trigger_evfd(ctx))
eventfd_signal(ctx->cq_ev_fd, 1);
- if (waitqueue_active(&ctx->cq_wait)) {
- wake_up_interruptible(&ctx->cq_wait);
+ if (waitqueue_active(&ctx->poll_wait)) {
+ wake_up_interruptible(&ctx->poll_wait);
kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN);
}
}
@@ -7056,10 +7056,10 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
ret = -EBUSY;
break;
}
- prepare_to_wait_exclusive(&ctx->wait, &iowq.wq,
+ prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
TASK_INTERRUPTIBLE);
ret = io_cqring_wait_schedule(ctx, &iowq, &timeout);
- finish_wait(&ctx->wait, &iowq.wq);
+ finish_wait(&ctx->cq_wait, &iowq.wq);
cond_resched();
} while (ret > 0);
@@ -8678,7 +8678,7 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
struct io_ring_ctx *ctx = file->private_data;
__poll_t mask = 0;
- poll_wait(file, &ctx->cq_wait, wait);
+ poll_wait(file, &ctx->poll_wait, wait);
/*
* synchronizes with barrier from wq_has_sleeper call in
* io_commit_cqring
--
2.31.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 10/12] io_uring: move uring_lock location
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
` (8 preceding siblings ...)
2021-06-14 22:37 ` [PATCH 09/12] io_uring: wait heads renaming Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
2021-06-14 22:37 ` [PATCH 11/12] io_uring: refactor io_req_defer() Pavel Begunkov
` (3 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
To: Jens Axboe, io-uring
->uring_lock is prevalently used for submission, even though it protects
many other things like iopoll, registeration, selected bufs, and more.
And it's placed together with ->cq_wait poked on completion and CQ
waiting sides. Move them apart, ->uring_lock goes to the submission
data, and cq_wait to completion related chunk. The last one requires
some reshuffling so everything needed by io_cqring_ev_posted*() is in
one cacheline.
Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index e9bf26fbf65d..1b6cfc6b79c5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -356,6 +356,8 @@ struct io_ring_ctx {
/* submission data */
struct {
+ struct mutex uring_lock;
+
/*
* Ring buffer of indices into array of io_uring_sqe, which is
* mmapped by the application using the IORING_OFF_SQES offset.
@@ -392,11 +394,6 @@ struct io_ring_ctx {
unsigned sq_thread_idle;
} ____cacheline_aligned_in_smp;
- struct {
- struct mutex uring_lock;
- wait_queue_head_t cq_wait;
- } ____cacheline_aligned_in_smp;
-
/* IRQ completion list, under ->completion_lock */
struct list_head locked_free_list;
unsigned int locked_free_nr;
@@ -412,12 +409,13 @@ struct io_ring_ctx {
struct {
unsigned cached_cq_tail;
unsigned cq_entries;
- atomic_t cq_timeouts;
- unsigned cq_last_tm_flush;
- unsigned cq_extra;
+ struct eventfd_ctx *cq_ev_fd;
struct wait_queue_head poll_wait;
+ struct wait_queue_head cq_wait;
+ unsigned cq_extra;
+ atomic_t cq_timeouts;
struct fasync_struct *cq_fasync;
- struct eventfd_ctx *cq_ev_fd;
+ unsigned cq_last_tm_flush;
} ____cacheline_aligned_in_smp;
struct {
--
2.31.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 11/12] io_uring: refactor io_req_defer()
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
` (9 preceding siblings ...)
2021-06-14 22:37 ` [PATCH 10/12] io_uring: move uring_lock location Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
2021-06-14 22:37 ` [PATCH 12/12] io_uring: optimise non-drain path Pavel Begunkov
` (2 subsequent siblings)
13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
To: Jens Axboe, io-uring
Rename io_req_defer() into io_drain_req() and refactor it uncoupling it
from io_queue_sqe() error handling and preparing for coming
optimisations. Also, prioritise non IOSQE_ASYNC path.
Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 39 +++++++++++++++++++--------------------
1 file changed, 19 insertions(+), 20 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1b6cfc6b79c5..29b705201ca3 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5998,7 +5998,7 @@ static u32 io_get_sequence(struct io_kiocb *req)
return ctx->cached_sq_head - nr_reqs;
}
-static int io_req_defer(struct io_kiocb *req)
+static bool io_drain_req(struct io_kiocb *req)
{
struct io_ring_ctx *ctx = req->ctx;
struct io_defer_entry *de;
@@ -6008,27 +6008,29 @@ static int io_req_defer(struct io_kiocb *req)
/* Still need defer if there is pending req in defer list. */
if (likely(list_empty_careful(&ctx->defer_list) &&
!(req->flags & REQ_F_IO_DRAIN)))
- return 0;
+ return false;
seq = io_get_sequence(req);
/* Still a chance to pass the sequence check */
if (!req_need_defer(req, seq) && list_empty_careful(&ctx->defer_list))
- return 0;
+ return false;
ret = io_req_prep_async(req);
if (ret)
return ret;
io_prep_async_link(req);
de = kmalloc(sizeof(*de), GFP_KERNEL);
- if (!de)
- return -ENOMEM;
+ if (!de) {
+ io_req_complete_failed(req, ret);
+ return true;
+ }
spin_lock_irq(&ctx->completion_lock);
if (!req_need_defer(req, seq) && list_empty(&ctx->defer_list)) {
spin_unlock_irq(&ctx->completion_lock);
kfree(de);
io_queue_async_work(req);
- return -EIOCBQUEUED;
+ return true;
}
trace_io_uring_defer(ctx, req, req->user_data);
@@ -6036,7 +6038,7 @@ static int io_req_defer(struct io_kiocb *req)
de->seq = seq;
list_add_tail(&de->list, &ctx->defer_list);
spin_unlock_irq(&ctx->completion_lock);
- return -EIOCBQUEUED;
+ return true;
}
static void io_clean_op(struct io_kiocb *req)
@@ -6447,21 +6449,18 @@ static void __io_queue_sqe(struct io_kiocb *req)
static void io_queue_sqe(struct io_kiocb *req)
{
- int ret;
+ if (io_drain_req(req))
+ return;
- ret = io_req_defer(req);
- if (ret) {
- if (ret != -EIOCBQUEUED) {
-fail_req:
- io_req_complete_failed(req, ret);
- }
- } else if (req->flags & REQ_F_FORCE_ASYNC) {
- ret = io_req_prep_async(req);
- if (unlikely(ret))
- goto fail_req;
- io_queue_async_work(req);
- } else {
+ if (likely(!(req->flags & REQ_F_FORCE_ASYNC))) {
__io_queue_sqe(req);
+ } else {
+ int ret = io_req_prep_async(req);
+
+ if (unlikely(ret))
+ io_req_complete_failed(req, ret);
+ else
+ io_queue_async_work(req);
}
}
--
2.31.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 12/12] io_uring: optimise non-drain path
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
` (10 preceding siblings ...)
2021-06-14 22:37 ` [PATCH 11/12] io_uring: refactor io_req_defer() Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
2021-06-15 12:27 ` [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
2021-06-15 21:38 ` Jens Axboe
13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
To: Jens Axboe, io-uring
Replace drain checks with one-way flag set upon seeing the first
IOSQE_IO_DRAIN request. There are several places where it cuts cycles
well:
1) It's much faster than the fast check with two
conditions in io_drain_req() including pretty complex
list_empty_careful().
2) We can mark io_queue_sqe() inline now, that's a huge win.
3) It replaces timeout and drain checks in io_commit_cqring() with a
single flags test. Also great not touching ->defer_list there without a
reason so limiting cache bouncing.
It adds a small amount of overhead to drain path, but it's negligible.
The main nuisance is that once it meets any DRAIN request in io_uring
instance lifetime it will _always_ go through a slower path, so
drain-less and offset-mode timeout less applications are preferable.
The overhead in that case would be not big, but it's worth to bear in
mind.
Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 57 +++++++++++++++++++++++++++------------------------
1 file changed, 30 insertions(+), 27 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 29b705201ca3..5828ffdbea82 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -352,6 +352,7 @@ struct io_ring_ctx {
unsigned int eventfd_async: 1;
unsigned int restricted: 1;
unsigned int off_timeout_used: 1;
+ unsigned int drain_used: 1;
} ____cacheline_aligned_in_smp;
/* submission data */
@@ -1299,9 +1300,9 @@ static void io_kill_timeout(struct io_kiocb *req, int status)
}
}
-static void __io_queue_deferred(struct io_ring_ctx *ctx)
+static void io_queue_deferred(struct io_ring_ctx *ctx)
{
- do {
+ while (!list_empty(&ctx->defer_list)) {
struct io_defer_entry *de = list_first_entry(&ctx->defer_list,
struct io_defer_entry, list);
@@ -1310,17 +1311,12 @@ static void __io_queue_deferred(struct io_ring_ctx *ctx)
list_del_init(&de->list);
io_req_task_queue(de->req);
kfree(de);
- } while (!list_empty(&ctx->defer_list));
+ }
}
static void io_flush_timeouts(struct io_ring_ctx *ctx)
{
- u32 seq;
-
- if (likely(!ctx->off_timeout_used))
- return;
-
- seq = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts);
+ u32 seq = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts);
while (!list_empty(&ctx->timeout_list)) {
u32 events_needed, events_got;
@@ -1350,13 +1346,14 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx)
static void io_commit_cqring(struct io_ring_ctx *ctx)
{
- io_flush_timeouts(ctx);
-
+ if (unlikely(ctx->off_timeout_used || ctx->drain_used)) {
+ if (ctx->off_timeout_used)
+ io_flush_timeouts(ctx);
+ if (ctx->drain_used)
+ io_queue_deferred(ctx);
+ }
/* order cqe stores with ring update */
smp_store_release(&ctx->rings->cq.tail, ctx->cached_cq_tail);
-
- if (unlikely(!list_empty(&ctx->defer_list)))
- __io_queue_deferred(ctx);
}
static inline bool io_sqring_full(struct io_ring_ctx *ctx)
@@ -6447,9 +6444,9 @@ static void __io_queue_sqe(struct io_kiocb *req)
io_queue_linked_timeout(linked_timeout);
}
-static void io_queue_sqe(struct io_kiocb *req)
+static inline void io_queue_sqe(struct io_kiocb *req)
{
- if (io_drain_req(req))
+ if (unlikely(req->ctx->drain_used) && io_drain_req(req))
return;
if (likely(!(req->flags & REQ_F_FORCE_ASYNC))) {
@@ -6573,6 +6570,23 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
io_req_complete_failed(req, ret);
return ret;
}
+
+ if (unlikely(req->flags & REQ_F_IO_DRAIN)) {
+ ctx->drain_used = true;
+
+ /*
+ * Taking sequential execution of a link, draining both sides
+ * of the link also fullfils IOSQE_IO_DRAIN semantics for all
+ * requests in the link. So, it drains the head and the
+ * next after the link request. The last one is done via
+ * drain_next flag to persist the effect across calls.
+ */
+ if (link->head) {
+ link->head->flags |= REQ_F_IO_DRAIN;
+ ctx->drain_next = 1;
+ }
+ }
+
ret = io_req_prep(req, sqe);
if (unlikely(ret))
goto fail_req;
@@ -6591,17 +6605,6 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
if (link->head) {
struct io_kiocb *head = link->head;
- /*
- * Taking sequential execution of a link, draining both sides
- * of the link also fullfils IOSQE_IO_DRAIN semantics for all
- * requests in the link. So, it drains the head and the
- * next after the link request. The last one is done via
- * drain_next flag to persist the effect across calls.
- */
- if (req->flags & REQ_F_IO_DRAIN) {
- head->flags |= REQ_F_IO_DRAIN;
- ctx->drain_next = 1;
- }
ret = io_req_prep_async(req);
if (unlikely(ret))
goto fail_req;
--
2.31.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 5.14 00/12] for-next optimisations
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
` (11 preceding siblings ...)
2021-06-14 22:37 ` [PATCH 12/12] io_uring: optimise non-drain path Pavel Begunkov
@ 2021-06-15 12:27 ` Pavel Begunkov
2021-06-15 21:38 ` Jens Axboe
13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-15 12:27 UTC (permalink / raw)
To: Jens Axboe, io-uring
On 6/14/21 11:37 PM, Pavel Begunkov wrote:
> There are two main lines intervened. The first one is pt.2 of ctx field
> shuffling for better caching. There is a couple of things left on that
> front.
>
> The second is optimising (assumably) rarely used offset-based timeouts
> and draining. There is a downside (see 12/12), which will be fixed
> later. In plans to queue a task_work clearing drain_used (under
> uring_lock) from io_queue_deferred() once all drainee are gone.
There is a much better way to disable it back: in io_drain_req().
The series is good to go, I'll patch on top later.
> nops(batch=32):
> 15.9 MIOPS vs 17.3 MIOPS
> nullblk (irqmode=2 completion_nsec=0 submit_queues=16), no merges, no stat
> 1002 KIOPS vs 1050 KIOPS
>
> Though the second test is very slow comparing to what I've seen before,
> so might be not represantative.
>
> Pavel Begunkov (12):
> io_uring: keep SQ pointers in a single cacheline
> io_uring: move ctx->flags from SQ cacheline
> io_uring: shuffle more fields into SQ ctx section
> io_uring: refactor io_get_sqe()
> io_uring: don't cache number of dropped SQEs
> io_uring: optimise completion timeout flushing
> io_uring: small io_submit_sqe() optimisation
> io_uring: clean up check_overflow flag
> io_uring: wait heads renaming
> io_uring: move uring_lock location
> io_uring: refactor io_req_defer()
> io_uring: optimise non-drain path
>
> fs/io_uring.c | 226 +++++++++++++++++++++++++-------------------------
> 1 file changed, 111 insertions(+), 115 deletions(-)
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 5.14 00/12] for-next optimisations
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
` (12 preceding siblings ...)
2021-06-15 12:27 ` [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
@ 2021-06-15 21:38 ` Jens Axboe
13 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2021-06-15 21:38 UTC (permalink / raw)
To: Pavel Begunkov, io-uring
On 6/14/21 4:37 PM, Pavel Begunkov wrote:
> There are two main lines intervened. The first one is pt.2 of ctx field
> shuffling for better caching. There is a couple of things left on that
> front.
>
> The second is optimising (assumably) rarely used offset-based timeouts
> and draining. There is a downside (see 12/12), which will be fixed
> later. In plans to queue a task_work clearing drain_used (under
> uring_lock) from io_queue_deferred() once all drainee are gone.
>
> nops(batch=32):
> 15.9 MIOPS vs 17.3 MIOPS
> nullblk (irqmode=2 completion_nsec=0 submit_queues=16), no merges, no stat
> 1002 KIOPS vs 1050 KIOPS
>
> Though the second test is very slow comparing to what I've seen before,
> so might be not represantative.
Applied, thanks. I'll run this through my testing, too.
--
Jens Axboe
^ permalink raw reply [flat|nested] 15+ messages in thread