public inbox for [email protected]
 help / color / mirror / Atom feed
* [RFC 0/4] completion locking optimisation feature
@ 2022-03-18 13:52 Pavel Begunkov
  2022-03-18 13:52 ` [PATCH 1/4] io_uring: get rid of raw fill cqe in kill_timeout Pavel Begunkov
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-03-18 13:52 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

A WIP feature optimising out CQEs posting spinlocking for some use cases.
For a more detailed description see 4/4.

Quick benchmarking with fio/t/io_uring nops gives extra 4% to throughput for
QD=1, and ~+2.5% for QD=4.

Pavel Begunkov (4):
  io_uring: get rid of raw fill cqe in kill_timeout
  io_uring: get rid of raw fill_cqe in io_fail_links
  io_uring: remove raw fill_cqe from linked timeout
  io_uring: optimise compl locking for non-shared rings

 fs/io_uring.c                 | 126 ++++++++++++++++++++++------------
 include/uapi/linux/io_uring.h |   1 +
 2 files changed, 85 insertions(+), 42 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] io_uring: get rid of raw fill cqe in kill_timeout
  2022-03-18 13:52 [RFC 0/4] completion locking optimisation feature Pavel Begunkov
@ 2022-03-18 13:52 ` Pavel Begunkov
  2022-03-18 13:52 ` [PATCH 2/4] io_uring: get rid of raw fill_cqe in io_fail_links Pavel Begunkov
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-03-18 13:52 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

We don't want raw fill_cqe calls. In preparation for upcoming features,
get rid of fill cqe by using io_req_task_complete

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 41 +++++++++++++++++++++++------------------
 1 file changed, 23 insertions(+), 18 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 690bfeaa609a..0e04e0997d7d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1191,6 +1191,7 @@ static int io_close_fixed(struct io_kiocb *req, unsigned int issue_flags);
 
 static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer);
 static void io_eventfd_signal(struct io_ring_ctx *ctx);
+static void io_req_tw_queue_complete(struct io_kiocb *req, u32 res);
 
 static struct kmem_cache *req_cachep;
 
@@ -1746,8 +1747,7 @@ static void io_kill_timeout(struct io_kiocb *req, int status)
 		atomic_set(&req->ctx->cq_timeouts,
 			atomic_read(&req->ctx->cq_timeouts) + 1);
 		list_del_init(&req->timeout.list);
-		io_fill_cqe_req(req, status, 0);
-		io_put_req_deferred(req);
+		io_req_tw_queue_complete(req, status);
 	}
 }
 
@@ -2595,6 +2595,19 @@ static void io_req_task_submit(struct io_kiocb *req, bool *locked)
 		io_req_complete_failed(req, -EFAULT);
 }
 
+static void io_req_task_complete(struct io_kiocb *req, bool *locked)
+{
+	int res = req->result;
+
+	if (*locked) {
+		io_req_complete_state(req, res, io_put_kbuf(req, 0));
+		io_req_add_compl_list(req);
+	} else {
+		io_req_complete_post(req, res,
+					io_put_kbuf(req, IO_URING_F_UNLOCKED));
+	}
+}
+
 static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
 {
 	req->result = ret;
@@ -2602,6 +2615,13 @@ static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
 	io_req_task_work_add(req, false);
 }
 
+static void io_req_tw_queue_complete(struct io_kiocb *req, u32 res)
+{
+	req->result = res;
+	req->io_task_work.func = io_req_task_complete;
+	io_req_task_work_add(req, false);
+}
+
 static void io_req_task_queue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_req_task_submit;
@@ -2987,19 +3007,6 @@ static bool __io_complete_rw_common(struct io_kiocb *req, long res)
 	return false;
 }
 
-static inline void io_req_task_complete(struct io_kiocb *req, bool *locked)
-{
-	int res = req->result;
-
-	if (*locked) {
-		io_req_complete_state(req, res, io_put_kbuf(req, 0));
-		io_req_add_compl_list(req);
-	} else {
-		io_req_complete_post(req, res,
-					io_put_kbuf(req, IO_URING_F_UNLOCKED));
-	}
-}
-
 static void __io_complete_rw(struct io_kiocb *req, long res,
 			     unsigned int issue_flags)
 {
@@ -6458,9 +6465,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
 	if (!(data->flags & IORING_TIMEOUT_ETIME_SUCCESS))
 		req_set_fail(req);
 
-	req->result = -ETIME;
-	req->io_task_work.func = io_req_task_complete;
-	io_req_task_work_add(req, false);
+	io_req_tw_queue_complete(req, -ETIME);
 	return HRTIMER_NORESTART;
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] io_uring: get rid of raw fill_cqe in io_fail_links
  2022-03-18 13:52 [RFC 0/4] completion locking optimisation feature Pavel Begunkov
  2022-03-18 13:52 ` [PATCH 1/4] io_uring: get rid of raw fill cqe in kill_timeout Pavel Begunkov
@ 2022-03-18 13:52 ` Pavel Begunkov
  2022-03-18 13:52 ` [PATCH 3/4] io_uring: remove raw fill_cqe from linked timeout Pavel Begunkov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-03-18 13:52 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Replace fill_cqe insside of io_fail_links with tw. The CQE ordering
guarantees rely on the fact that io_uring's tw's are executed in order.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0e04e0997d7d..fff66f4d00c4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2335,11 +2335,13 @@ static void io_fail_links(struct io_kiocb *req)
 		trace_io_uring_fail_link(req->ctx, req, req->user_data,
 					req->opcode, link);
 
-		if (!ignore_cqes) {
+		if (!ignore_cqes)
 			link->flags &= ~REQ_F_CQE_SKIP;
-			io_fill_cqe_req(link, res, 0);
-		}
-		io_put_req_deferred(link);
+		/*
+		 * linked CQEs should be ordered, we rely on the tw
+		 * infrastructure executing them in the right order
+		 */
+		io_req_tw_queue_complete(link, res);
 		link = nxt;
 	}
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] io_uring: remove raw fill_cqe from linked timeout
  2022-03-18 13:52 [RFC 0/4] completion locking optimisation feature Pavel Begunkov
  2022-03-18 13:52 ` [PATCH 1/4] io_uring: get rid of raw fill cqe in kill_timeout Pavel Begunkov
  2022-03-18 13:52 ` [PATCH 2/4] io_uring: get rid of raw fill_cqe in io_fail_links Pavel Begunkov
@ 2022-03-18 13:52 ` Pavel Begunkov
  2022-03-18 13:52 ` [PATCH 4/4] io_uring: optimise compl locking for non-shared rings Pavel Begunkov
  2022-03-18 14:42 ` [RFC 0/4] completion locking optimisation feature Pavel Begunkov
  4 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-03-18 13:52 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

We don't want io_fill_cqe_req() in disarming, so replace it with
io_req_tw_queue_complete() as usual. A minor problem here is that the
timeout CQE may came after completion of further linked requests, e.g.
when it's like

req1 -> linked_timeout -> req2

However, it shouldn't be much of a problem as it can happen even without
this patch when timeouts are racing with the main request completions.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index fff66f4d00c4..5f895ad910b6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1166,8 +1166,6 @@ static void io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 					 bool cancel_all);
 static void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd);
 
-static void io_fill_cqe_req(struct io_kiocb *req, s32 res, u32 cflags);
-
 static void io_put_req(struct io_kiocb *req);
 static void io_put_req_deferred(struct io_kiocb *req);
 static void io_dismantle_req(struct io_kiocb *req);
@@ -2070,12 +2068,6 @@ static inline bool __io_fill_cqe_req(struct io_kiocb *req, s32 res, u32 cflags)
 	return __io_fill_cqe(req->ctx, req->user_data, res, cflags);
 }
 
-static noinline void io_fill_cqe_req(struct io_kiocb *req, s32 res, u32 cflags)
-{
-	if (!(req->flags & REQ_F_CQE_SKIP))
-		__io_fill_cqe_req(req, res, cflags);
-}
-
 static noinline bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data,
 				     s32 res, u32 cflags)
 {
@@ -2307,9 +2299,7 @@ static bool io_kill_linked_timeout(struct io_kiocb *req)
 		link->timeout.head = NULL;
 		if (hrtimer_try_to_cancel(&io->timer) != -1) {
 			list_del(&link->timeout.list);
-			/* leave REQ_F_CQE_SKIP to io_fill_cqe_req */
-			io_fill_cqe_req(link, -ECANCELED, 0);
-			io_put_req_deferred(link);
+			io_req_tw_queue_complete(link, -ECANCELED);
 			return true;
 		}
 	}
@@ -2357,9 +2347,7 @@ static bool io_disarm_next(struct io_kiocb *req)
 		req->flags &= ~REQ_F_ARM_LTIMEOUT;
 		if (link && link->opcode == IORING_OP_LINK_TIMEOUT) {
 			io_remove_next_linked(req);
-			/* leave REQ_F_CQE_SKIP to io_fill_cqe_req */
-			io_fill_cqe_req(link, -ECANCELED, 0);
-			io_put_req_deferred(link);
+			io_req_tw_queue_complete(link, -ECANCELED);
 			posted = true;
 		}
 	} else if (req->flags & REQ_F_LINK_TIMEOUT) {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] io_uring: optimise compl locking for non-shared rings
  2022-03-18 13:52 [RFC 0/4] completion locking optimisation feature Pavel Begunkov
                   ` (2 preceding siblings ...)
  2022-03-18 13:52 ` [PATCH 3/4] io_uring: remove raw fill_cqe from linked timeout Pavel Begunkov
@ 2022-03-18 13:52 ` Pavel Begunkov
  2022-03-18 14:54   ` Jens Axboe
  2022-03-18 14:42 ` [RFC 0/4] completion locking optimisation feature Pavel Begunkov
  4 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2022-03-18 13:52 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

When only one task submits requests, most of CQEs are expected to be
filled from that task context so we have natural serialisation. That
would mean that in those cases we don't need spinlocking around CQE
posting. One downside is that it also mean that io-wq workers can't emit
CQEs directly but should do it through the original task context using
task_works. That may hurt latency and performance and might matter much
to some workloads, but it's not a huge deal in general as io-wq is a
slow path and there is some additional merit from tw completion
batching.

The feature should be opted-in by the userspace by setting a new
IORING_SETUP_PRIVATE_CQ flag. It doesn't work with IOPOLL, and also for
now only the task that created a ring can submit requests to it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c                 | 61 +++++++++++++++++++++++++++++++----
 include/uapi/linux/io_uring.h |  1 +
 2 files changed, 55 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5f895ad910b6..52a15b29f6e4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2109,6 +2109,12 @@ static void io_req_complete_post(struct io_kiocb *req, s32 res,
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
+	/* todo cflags */
+	if (ctx->flags & IORING_SETUP_PRIVATE_CQ) {
+		io_req_tw_queue_complete(req, res);
+		return;
+	}
+
 	spin_lock(&ctx->completion_lock);
 	__io_req_complete_post(req, res, cflags);
 	io_commit_cqring(ctx);
@@ -2593,8 +2599,14 @@ static void io_req_task_complete(struct io_kiocb *req, bool *locked)
 		io_req_complete_state(req, res, io_put_kbuf(req, 0));
 		io_req_add_compl_list(req);
 	} else {
-		io_req_complete_post(req, res,
+		struct io_ring_ctx *ctx = req->ctx;
+
+		spin_lock(&ctx->completion_lock);
+		__io_req_complete_post(req, res,
 					io_put_kbuf(req, IO_URING_F_UNLOCKED));
+		io_commit_cqring(ctx);
+		spin_unlock(&ctx->completion_lock);
+		io_cqring_ev_posted(ctx);
 	}
 }
 
@@ -2686,7 +2698,9 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 	struct io_submit_state *state = &ctx->submit_state;
 
 	if (state->flush_cqes) {
-		spin_lock(&ctx->completion_lock);
+		if (!(ctx->flags & IORING_SETUP_PRIVATE_CQ))
+			spin_lock(&ctx->completion_lock);
+
 		wq_list_for_each(node, prev, &state->compl_reqs) {
 			struct io_kiocb *req = container_of(node, struct io_kiocb,
 						    comp_list);
@@ -2705,7 +2719,9 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 		}
 
 		io_commit_cqring(ctx);
-		spin_unlock(&ctx->completion_lock);
+		if (!(ctx->flags & IORING_SETUP_PRIVATE_CQ))
+			spin_unlock(&ctx->completion_lock);
+
 		io_cqring_ev_posted(ctx);
 		state->flush_cqes = false;
 	}
@@ -5895,8 +5911,10 @@ static int io_poll_check_events(struct io_kiocb *req)
 	int v;
 
 	/* req->task == current here, checking PF_EXITING is safe */
-	if (unlikely(req->task->flags & PF_EXITING))
+	if (unlikely(req->task->flags & PF_EXITING)) {
 		io_poll_mark_cancelled(req);
+		return -ECANCELED;
+	}
 
 	do {
 		v = atomic_read(&req->poll_refs);
@@ -9165,6 +9183,8 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 	unsigned int done;
 	bool needs_switch = false;
 
+	if (tags && (ctx->flags & IORING_SETUP_PRIVATE_CQ))
+		return -EINVAL;
 	if (!ctx->file_data)
 		return -ENXIO;
 	if (up->offset + nr_args > ctx->nr_user_files)
@@ -9845,6 +9865,9 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 	__u32 done;
 	int i, err;
 
+	if (tags && (ctx->flags & IORING_SETUP_PRIVATE_CQ))
+		return -EINVAL;
+
 	if (!ctx->buf_data)
 		return -ENXIO;
 	if (up->offset + nr_args > ctx->nr_user_bufs)
@@ -10389,6 +10412,23 @@ static __cold void io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 	}
 }
 
+static __cold int io_uring_check_ctxs(struct io_ring_ctx *ctx)
+{
+	int nr_tctxs = 0, max_tctxs = 1;
+	struct list_head *pos;
+
+	if (!(ctx->flags & IORING_SETUP_PRIVATE_CQ))
+		return 0;
+
+	if (ctx->flags & IORING_SETUP_IOPOLL)
+		return -EINVAL;
+	if (ctx->flags & IORING_SETUP_SQPOLL)
+		max_tctxs++;
+	list_for_each(pos, &ctx->tctx_list)
+		nr_tctxs++;
+	return nr_tctxs < max_tctxs ? 0 : -EINVAL;
+}
+
 static int __io_uring_add_tctx_node(struct io_ring_ctx *ctx)
 {
 	struct io_uring_task *tctx = current->io_uring;
@@ -10417,14 +10457,18 @@ static int __io_uring_add_tctx_node(struct io_ring_ctx *ctx)
 		node->ctx = ctx;
 		node->task = current;
 
-		ret = xa_err(xa_store(&tctx->xa, (unsigned long)ctx,
+		mutex_lock(&ctx->uring_lock);
+		ret = io_uring_check_ctxs(ctx);
+		if (!ret) {
+			ret = xa_err(xa_store(&tctx->xa, (unsigned long)ctx,
 					node, GFP_KERNEL));
+		}
 		if (ret) {
 			kfree(node);
+			mutex_unlock(&ctx->uring_lock);
 			return ret;
 		}
 
-		mutex_lock(&ctx->uring_lock);
 		list_add(&node->ctx_node, &ctx->tctx_list);
 		mutex_unlock(&ctx->uring_lock);
 	}
@@ -11349,7 +11393,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
 			IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |
 			IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ |
-			IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL))
+			IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL |
+			IORING_SETUP_PRIVATE_CQ))
 		return -EINVAL;
 
 	return  io_uring_create(entries, &p, params);
@@ -11561,6 +11606,8 @@ static __cold int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg,
 	/* keep it extendible */
 	if (size != sizeof(rr))
 		return -EINVAL;
+	if (rr.tags && (ctx->flags & IORING_SETUP_PRIVATE_CQ))
+		return -EINVAL;
 
 	memset(&rr, 0, sizeof(rr));
 	if (copy_from_user(&rr, arg, size))
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index d2be4eb22008..342fab169b83 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -102,6 +102,7 @@ enum {
 #define IORING_SETUP_ATTACH_WQ	(1U << 5)	/* attach to existing wq */
 #define IORING_SETUP_R_DISABLED	(1U << 6)	/* start with ring disabled */
 #define IORING_SETUP_SUBMIT_ALL	(1U << 7)	/* continue submit on error */
+#define IORING_SETUP_PRIVATE_CQ	(1U << 8)
 
 enum {
 	IORING_OP_NOP,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC 0/4] completion locking optimisation feature
  2022-03-18 13:52 [RFC 0/4] completion locking optimisation feature Pavel Begunkov
                   ` (3 preceding siblings ...)
  2022-03-18 13:52 ` [PATCH 4/4] io_uring: optimise compl locking for non-shared rings Pavel Begunkov
@ 2022-03-18 14:42 ` Pavel Begunkov
  2022-03-18 14:52   ` Jens Axboe
  4 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2022-03-18 14:42 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

On 3/18/22 13:52, Pavel Begunkov wrote:
> A WIP feature optimising out CQEs posting spinlocking for some use cases.
> For a more detailed description see 4/4.
> 
> Quick benchmarking with fio/t/io_uring nops gives extra 4% to throughput for
> QD=1, and ~+2.5% for QD=4.

Non-io_uring overhead (syscalls + userspace) takes ~60% of all execution
time, so the percentage should quite depend on the CPU and the kernel config.
Likely to be more than 4% for a faster setup.

fwiw, was also usingIORING_ENTER_REGISTERED_RING, if it's not yet included
in the upstream version of the tool.

Also, want to play after to see if we can also avoid taking uring_lock.


> Pavel Begunkov (4):
>    io_uring: get rid of raw fill cqe in kill_timeout
>    io_uring: get rid of raw fill_cqe in io_fail_links
>    io_uring: remove raw fill_cqe from linked timeout
>    io_uring: optimise compl locking for non-shared rings
> 
>   fs/io_uring.c                 | 126 ++++++++++++++++++++++------------
>   include/uapi/linux/io_uring.h |   1 +
>   2 files changed, 85 insertions(+), 42 deletions(-)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/4] completion locking optimisation feature
  2022-03-18 14:42 ` [RFC 0/4] completion locking optimisation feature Pavel Begunkov
@ 2022-03-18 14:52   ` Jens Axboe
  2022-03-18 15:00     ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2022-03-18 14:52 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 3/18/22 8:42 AM, Pavel Begunkov wrote:
> On 3/18/22 13:52, Pavel Begunkov wrote:
>> A WIP feature optimising out CQEs posting spinlocking for some use cases.
>> For a more detailed description see 4/4.
>>
>> Quick benchmarking with fio/t/io_uring nops gives extra 4% to throughput for
>> QD=1, and ~+2.5% for QD=4.
> 
> Non-io_uring overhead (syscalls + userspace) takes ~60% of all execution
> time, so the percentage should quite depend on the CPU and the kernel config.
> Likely to be more than 4% for a faster setup.
> 
> fwiw, was also usingIORING_ENTER_REGISTERED_RING, if it's not yet included
> in the upstream version of the tool.

But that seems to be exclusive of using PRIVATE_CQ?

> Also, want to play after to see if we can also avoid taking uring_lock.

That would certainly also be interesting!


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] io_uring: optimise compl locking for non-shared rings
  2022-03-18 13:52 ` [PATCH 4/4] io_uring: optimise compl locking for non-shared rings Pavel Begunkov
@ 2022-03-18 14:54   ` Jens Axboe
  2022-03-18 15:13     ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2022-03-18 14:54 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 3/18/22 7:52 AM, Pavel Begunkov wrote:
> When only one task submits requests, most of CQEs are expected to be
> filled from that task context so we have natural serialisation. That
> would mean that in those cases we don't need spinlocking around CQE
> posting. One downside is that it also mean that io-wq workers can't emit
> CQEs directly but should do it through the original task context using
> task_works. That may hurt latency and performance and might matter much
> to some workloads, but it's not a huge deal in general as io-wq is a
> slow path and there is some additional merit from tw completion
> batching.

Not too worried about io-wq task_work for cq filling, it is the slower
path after all. And I think we can get away with doing notifications as
it's just for CQ filling. If the task is currently waiting in
cqring_wait, then it'll get woken anyway and it will process task work.
If it's in userspace, it doesn't need a notification. That should make
it somewhat lighter than requiring using TIF_NOTIFY_SIGNAL for that.

> The feature should be opted-in by the userspace by setting a new
> IORING_SETUP_PRIVATE_CQ flag. It doesn't work with IOPOLL, and also for
> now only the task that created a ring can submit requests to it.

I know this is a WIP, but why do we need CQ_PRIVATE? And this needs to
work with registered files (and ring fd) as that is probably a bigger
win than skipping the completion_lock if you're not shared anyway.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/4] completion locking optimisation feature
  2022-03-18 14:52   ` Jens Axboe
@ 2022-03-18 15:00     ` Pavel Begunkov
  2022-03-18 15:22       ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2022-03-18 15:00 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 3/18/22 14:52, Jens Axboe wrote:
> On 3/18/22 8:42 AM, Pavel Begunkov wrote:
>> On 3/18/22 13:52, Pavel Begunkov wrote:
>>> A WIP feature optimising out CQEs posting spinlocking for some use cases.
>>> For a more detailed description see 4/4.
>>>
>>> Quick benchmarking with fio/t/io_uring nops gives extra 4% to throughput for
>>> QD=1, and ~+2.5% for QD=4.
>>
>> Non-io_uring overhead (syscalls + userspace) takes ~60% of all execution
>> time, so the percentage should quite depend on the CPU and the kernel config.
>> Likely to be more than 4% for a faster setup.
>>
>> fwiw, was also usingIORING_ENTER_REGISTERED_RING, if it's not yet included
>> in the upstream version of the tool.
> 
> But that seems to be exclusive of using PRIVATE_CQ?

No, it's not. Let me ask to clarify the description and so, why do you
think it is exclusive?


>> Also, want to play after to see if we can also avoid taking uring_lock.
> 
> That would certainly also be interesting!

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] io_uring: optimise compl locking for non-shared rings
  2022-03-18 14:54   ` Jens Axboe
@ 2022-03-18 15:13     ` Pavel Begunkov
  2022-03-18 15:21       ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2022-03-18 15:13 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 3/18/22 14:54, Jens Axboe wrote:
> On 3/18/22 7:52 AM, Pavel Begunkov wrote:
>> When only one task submits requests, most of CQEs are expected to be
>> filled from that task context so we have natural serialisation. That
>> would mean that in those cases we don't need spinlocking around CQE
>> posting. One downside is that it also mean that io-wq workers can't emit
>> CQEs directly but should do it through the original task context using
>> task_works. That may hurt latency and performance and might matter much
>> to some workloads, but it's not a huge deal in general as io-wq is a
>> slow path and there is some additional merit from tw completion
>> batching.
> 
> Not too worried about io-wq task_work for cq filling, it is the slower
> path after all. And I think we can get away with doing notifications as
> it's just for CQ filling. If the task is currently waiting in
> cqring_wait, then it'll get woken anyway and it will process task work.
> If it's in userspace, it doesn't need a notification. That should make
> it somewhat lighter than requiring using TIF_NOTIFY_SIGNAL for that.
> 
>> The feature should be opted-in by the userspace by setting a new
>> IORING_SETUP_PRIVATE_CQ flag. It doesn't work with IOPOLL, and also for
>> now only the task that created a ring can submit requests to it.
> 
> I know this is a WIP, but why do we need CQ_PRIVATE? And this needs to

One reason is because of the io-wq -> tw punting, which is not optimal
for e.g. active users of IOSQE_ASYNC. The second is because the
fundamental requirement is that only one task should be submitting
requests. Was thinking about automating it, e.g. when we register
a second tctx we go through a slow path waiting for all current tw
to complete and then removing an internal and not userspace visible
CQ_PRIVATE flag.

Also, as SQPOLL task is by definition the only one submitting SQEs,
was thinking about enabling it by default for them, but didn't do
because of the io-wq / IOSQE_ASYNC.

> work with registered files (and ring fd) as that is probably a bigger
> win than skipping the completion_lock if you're not shared anyway.

It does work with fixed/registered files and registered io_uring fds.

In regards of "a bigger win", probably in many cases, but if you submit
a good batch at once, and completion tw batching doesn't kick in (e.g.
direct bdev read of not too high intensity), it might save
N spinlock/unlock when registered ring fd would kill only one pair of
fdget/fdput.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] io_uring: optimise compl locking for non-shared rings
  2022-03-18 15:13     ` Pavel Begunkov
@ 2022-03-18 15:21       ` Jens Axboe
  2022-03-18 15:32         ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2022-03-18 15:21 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 3/18/22 9:13 AM, Pavel Begunkov wrote:
> On 3/18/22 14:54, Jens Axboe wrote:
>> On 3/18/22 7:52 AM, Pavel Begunkov wrote:
>>> When only one task submits requests, most of CQEs are expected to be
>>> filled from that task context so we have natural serialisation. That
>>> would mean that in those cases we don't need spinlocking around CQE
>>> posting. One downside is that it also mean that io-wq workers can't emit
>>> CQEs directly but should do it through the original task context using
>>> task_works. That may hurt latency and performance and might matter much
>>> to some workloads, but it's not a huge deal in general as io-wq is a
>>> slow path and there is some additional merit from tw completion
>>> batching.
>>
>> Not too worried about io-wq task_work for cq filling, it is the slower
>> path after all. And I think we can get away with doing notifications as
>> it's just for CQ filling. If the task is currently waiting in
>> cqring_wait, then it'll get woken anyway and it will process task work.
>> If it's in userspace, it doesn't need a notification. That should make
>> it somewhat lighter than requiring using TIF_NOTIFY_SIGNAL for that.
>>
>>> The feature should be opted-in by the userspace by setting a new
>>> IORING_SETUP_PRIVATE_CQ flag. It doesn't work with IOPOLL, and also for
>>> now only the task that created a ring can submit requests to it.
>>
>> I know this is a WIP, but why do we need CQ_PRIVATE? And this needs to
> 
> One reason is because of the io-wq -> tw punting, which is not optimal
> for e.g. active users of IOSQE_ASYNC. The second is because the
> fundamental requirement is that only one task should be submitting
> requests. Was thinking about automating it, e.g. when we register
> a second tctx we go through a slow path waiting for all current tw
> to complete and then removing an internal and not userspace visible
> CQ_PRIVATE flag.

Was thinking something along those lines too. The alternative is setting
up the ring with SETUP_SINGLE_ISSUER or something like that, having the
application tell us that it is a single issuer and no submits are
shared across threads. Serves the same kind of purpose as CQ_PRIVATE,
but enables us to simply fail things if the task violates those
constraints. Would also be a better name I believe as it might enable
further optimizations in the future, like for example the mutex
reduction for submits.

> Also, as SQPOLL task is by definition the only one submitting SQEs,
> was thinking about enabling it by default for them, but didn't do
> because of the io-wq / IOSQE_ASYNC.

Gotcha.

>> work with registered files (and ring fd) as that is probably a bigger
>> win than skipping the completion_lock if you're not shared anyway.
> 
> It does work with fixed/registered files and registered io_uring fds.

t/io_uring fails for me with registered files or rings, getting EINVAL.
Might be user error, but that's simply just setting CQ_PRIVATE for
setup.

> In regards of "a bigger win", probably in many cases, but if you submit
> a good batch at once, and completion tw batching doesn't kick in (e.g.
> direct bdev read of not too high intensity), it might save
> N spinlock/unlock when registered ring fd would kill only one pair of
> fdget/fdput.

Definitely, various cases where one would be a bigger win than the
other, agree on that. But let's just ensure that both work together :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/4] completion locking optimisation feature
  2022-03-18 15:00     ` Pavel Begunkov
@ 2022-03-18 15:22       ` Jens Axboe
  2022-03-18 15:34         ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2022-03-18 15:22 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 3/18/22 9:00 AM, Pavel Begunkov wrote:
> On 3/18/22 14:52, Jens Axboe wrote:
>> On 3/18/22 8:42 AM, Pavel Begunkov wrote:
>>> On 3/18/22 13:52, Pavel Begunkov wrote:
>>>> A WIP feature optimising out CQEs posting spinlocking for some use cases.
>>>> For a more detailed description see 4/4.
>>>>
>>>> Quick benchmarking with fio/t/io_uring nops gives extra 4% to throughput for
>>>> QD=1, and ~+2.5% for QD=4.
>>>
>>> Non-io_uring overhead (syscalls + userspace) takes ~60% of all execution
>>> time, so the percentage should quite depend on the CPU and the kernel config.
>>> Likely to be more than 4% for a faster setup.
>>>
>>> fwiw, was also usingIORING_ENTER_REGISTERED_RING, if it's not yet included
>>> in the upstream version of the tool.
>>
>> But that seems to be exclusive of using PRIVATE_CQ?
> 
> No, it's not. Let me ask to clarify the description and so, why do you
> think it is exclusive?

Didn't dig too deep, just saw various EINVAL around registered files and
updating/insertion. And the fact that t/io_uring gets -EINVAL on
register with op 20 (ring fd) if I set CQ_PRIVATE, see other email.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] io_uring: optimise compl locking for non-shared rings
  2022-03-18 15:21       ` Jens Axboe
@ 2022-03-18 15:32         ` Pavel Begunkov
  2022-03-18 16:06           ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2022-03-18 15:32 UTC (permalink / raw)
  To: Jens Axboe, io-uring

[-- Attachment #1: Type: text/plain, Size: 4067 bytes --]

On 3/18/22 15:21, Jens Axboe wrote:
> On 3/18/22 9:13 AM, Pavel Begunkov wrote:
>> On 3/18/22 14:54, Jens Axboe wrote:
>>> On 3/18/22 7:52 AM, Pavel Begunkov wrote:
>>>> When only one task submits requests, most of CQEs are expected to be
>>>> filled from that task context so we have natural serialisation. That
>>>> would mean that in those cases we don't need spinlocking around CQE
>>>> posting. One downside is that it also mean that io-wq workers can't emit
>>>> CQEs directly but should do it through the original task context using
>>>> task_works. That may hurt latency and performance and might matter much
>>>> to some workloads, but it's not a huge deal in general as io-wq is a
>>>> slow path and there is some additional merit from tw completion
>>>> batching.
>>>
>>> Not too worried about io-wq task_work for cq filling, it is the slower
>>> path after all. And I think we can get away with doing notifications as
>>> it's just for CQ filling. If the task is currently waiting in
>>> cqring_wait, then it'll get woken anyway and it will process task work.
>>> If it's in userspace, it doesn't need a notification. That should make
>>> it somewhat lighter than requiring using TIF_NOTIFY_SIGNAL for that.
>>>
>>>> The feature should be opted-in by the userspace by setting a new
>>>> IORING_SETUP_PRIVATE_CQ flag. It doesn't work with IOPOLL, and also for
>>>> now only the task that created a ring can submit requests to it.
>>>
>>> I know this is a WIP, but why do we need CQ_PRIVATE? And this needs to
>>
>> One reason is because of the io-wq -> tw punting, which is not optimal
>> for e.g. active users of IOSQE_ASYNC. The second is because the
>> fundamental requirement is that only one task should be submitting
>> requests. Was thinking about automating it, e.g. when we register
>> a second tctx we go through a slow path waiting for all current tw
>> to complete and then removing an internal and not userspace visible
>> CQ_PRIVATE flag.
> 
> Was thinking something along those lines too. The alternative is setting
> up the ring with SETUP_SINGLE_ISSUER or something like that, having the
> application tell us that it is a single issuer and no submits are
> shared across threads. Serves the same kind of purpose as CQ_PRIVATE,
> but enables us to simply fail things if the task violates those
> constraints. Would also be a better name I believe as it might enable
> further optimizations in the future, like for example the mutex
> reduction for submits.

That's exactly what it is, including the failing part. And I like
your name better, will take it

>> Also, as SQPOLL task is by definition the only one submitting SQEs,
>> was thinking about enabling it by default for them, but didn't do
>> because of the io-wq / IOSQE_ASYNC.
> 
> Gotcha.
> 
>>> work with registered files (and ring fd) as that is probably a bigger
>>> win than skipping the completion_lock if you're not shared anyway.
>>
>> It does work with fixed/registered files and registered io_uring fds.
> 
> t/io_uring fails for me with registered files or rings, getting EINVAL.
> Might be user error, but that's simply just setting CQ_PRIVATE for
> setup.

One thing I changed in the tool is that the ring should be created
by the submitter task, so move setup_ring into the submitter thread.
Plan to get rid of this restriction though.

Weird that it works only for you only without reg files/rings, will
take a look.

Attached io_uring.c that I used, it's based on some old version,
so do_nop can't be set in argv but should turned in the source code.
IORING_ENTER_REGISTERED_RING is always enabled.

>> In regards of "a bigger win", probably in many cases, but if you submit
>> a good batch at once, and completion tw batching doesn't kick in (e.g.
>> direct bdev read of not too high intensity), it might save
>> N spinlock/unlock when registered ring fd would kill only one pair of
>> fdget/fdput.
> 
> Definitely, various cases where one would be a bigger win than the
> other, agree on that. But let's just ensure that both work together :-)

-- 
Pavel Begunkov

[-- Attachment #2: io_uring.c --]
[-- Type: text/x-csrc, Size: 15194 bytes --]

#include <stdio.h>
#include <errno.h>
#include <assert.h>
#include <stdlib.h>
#include <stddef.h>
#include <signal.h>
#include <inttypes.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <sys/resource.h>
#include <sys/mman.h>
#include <sys/uio.h>
#include <linux/fs.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <pthread.h>
#include <sched.h>

#include "../arch/arch.h"
#include "../lib/types.h"
#include "../os/linux/io_uring.h"

#define min(a, b)		((a < b) ? (a) : (b))

struct io_sq_ring {
	unsigned *head;
	unsigned *tail;
	unsigned *ring_mask;
	unsigned *ring_entries;
	unsigned *flags;
	unsigned *array;
};


#define IORING_SETUP_PRIVATE_CQ	(1U << 8)

enum {
	IORING_REGISTER_RING_FDS		= 20,
};

#define IORING_ENTER_REGISTERED_RING	(1U << 4)

struct io_cq_ring {
	unsigned *head;
	unsigned *tail;
	unsigned *ring_mask;
	unsigned *ring_entries;
	struct io_uring_cqe *cqes;
};

#define DEPTH			128
#define BATCH_SUBMIT		32
#define BATCH_COMPLETE		32
#define BS			4096

#define MAX_FDS			16

static unsigned sq_ring_mask, cq_ring_mask;

struct file {
	unsigned long max_blocks;
	unsigned pending_ios;
	int real_fd;
	int fixed_fd;
};

struct submitter {
	pthread_t thread;
	int ring_fd;
	struct io_sq_ring sq_ring;
	struct io_uring_sqe *sqes;
	struct io_cq_ring cq_ring;
	int inflight;
	unsigned long reaps;
	unsigned long done;
	unsigned long calls;
	volatile int finish;

	__s32 *fds;

	struct file files[MAX_FDS];
	unsigned nr_files;
	unsigned cur_file;
	struct iovec iovecs[];
};

static struct submitter *submitter;
static volatile int finish;

static int depth = DEPTH;
static int batch_submit = BATCH_SUBMIT;
static int batch_complete = BATCH_COMPLETE;
static int bs = BS;
static int polled = 1;		/* use IO polling */
static int fixedbufs = 1;	/* use fixed user buffers */
static int register_files = 1;	/* use fixed files */
static int buffered = 0;	/* use buffered IO, not O_DIRECT */
static int sq_thread_poll = 0;	/* use kernel submission/poller thread */
static int sq_thread_cpu = -1;	/* pin above thread to this CPU */
static int do_nop = 1;		/* no-op SQ ring commands */

struct io_uring_rsrc_update {
	__u32 offset;
	__u32 resv;
	__aligned_u64 data;
};

static int vectored = 1;

static int setup_ring(struct submitter *s);

static int io_uring_register_buffers(struct submitter *s)
{
	if (do_nop)
		return 0;

	return syscall(__NR_io_uring_register, s->ring_fd,
			IORING_REGISTER_BUFFERS, s->iovecs, depth);
}

static int io_uring_register_io_uring_fd(struct submitter *s)
{
	struct io_uring_rsrc_update up = {};

	up.offset = 0;
	up.data = s->ring_fd;

	return syscall(__NR_io_uring_register, s->ring_fd,
			IORING_REGISTER_RING_FDS, &up, 1);
}

static int io_uring_register_files(struct submitter *s)
{
	int i;

	if (do_nop)
		return 0;

	s->fds = calloc(s->nr_files, sizeof(__s32));
	for (i = 0; i < s->nr_files; i++) {
		s->fds[i] = s->files[i].real_fd;
		s->files[i].fixed_fd = i;
	}

	return syscall(__NR_io_uring_register, s->ring_fd,
			IORING_REGISTER_FILES, s->fds, s->nr_files);
}

static int io_uring_setup(unsigned entries, struct io_uring_params *p)
{
	return syscall(__NR_io_uring_setup, entries, p);
}

static void io_uring_probe(int fd)
{
	struct io_uring_probe *p;
	int ret;

	p = malloc(sizeof(*p) + 256 * sizeof(struct io_uring_probe_op));
	if (!p)
		return;

	memset(p, 0, sizeof(*p) + 256 * sizeof(struct io_uring_probe_op));
	ret = syscall(__NR_io_uring_register, fd, IORING_REGISTER_PROBE, p, 256);
	if (ret < 0)
		goto out;

	if (IORING_OP_READ > p->ops_len)
		goto out;

	if ((p->ops[IORING_OP_READ].flags & IO_URING_OP_SUPPORTED))
		vectored = 0;
out:
	free(p);
}

static int io_uring_enter(struct submitter *s, unsigned int to_submit,
			  unsigned int min_complete, unsigned int flags)
{
	return syscall(__NR_io_uring_enter, 0, to_submit, min_complete,
			flags | IORING_ENTER_REGISTERED_RING, NULL, 0);
}

#ifndef CONFIG_HAVE_GETTID
static int gettid(void)
{
	return syscall(__NR_gettid);
}
#endif

static unsigned file_depth(struct submitter *s)
{
	return (depth + s->nr_files - 1) / s->nr_files;
}

static void init_io(struct submitter *s, unsigned index)
{
	struct io_uring_sqe *sqe = &s->sqes[index];
	unsigned long offset;
	struct file *f;
	long r;

	if (do_nop) {
		sqe->opcode = IORING_OP_NOP;
		return;
	}

	if (s->nr_files == 1) {
		f = &s->files[0];
	} else {
		f = &s->files[s->cur_file];
		if (f->pending_ios >= file_depth(s)) {
			s->cur_file++;
			if (s->cur_file == s->nr_files)
				s->cur_file = 0;
			f = &s->files[s->cur_file];
		}
	}
	f->pending_ios++;

	r = lrand48();
	offset = (r % (f->max_blocks - 1)) * bs;

	if (register_files) {
		sqe->flags = IOSQE_FIXED_FILE;
		sqe->fd = f->fixed_fd;
	} else {
		sqe->flags = 0;
		sqe->fd = f->real_fd;
	}
	if (fixedbufs) {
		sqe->opcode = IORING_OP_READ_FIXED;
		sqe->addr = (unsigned long) s->iovecs[index].iov_base;
		sqe->len = bs;
		sqe->buf_index = index;
	} else if (!vectored) {
		sqe->opcode = IORING_OP_READ;
		sqe->addr = (unsigned long) s->iovecs[index].iov_base;
		sqe->len = bs;
		sqe->buf_index = 0;
	} else {
		sqe->opcode = IORING_OP_READV;
		sqe->addr = (unsigned long) &s->iovecs[index];
		sqe->len = 1;
		sqe->buf_index = 0;
	}
	sqe->ioprio = 0;
	sqe->off = offset;
	sqe->user_data = (unsigned long) f;
}

static int prep_more_ios(struct submitter *s, int max_ios)
{
	struct io_sq_ring *ring = &s->sq_ring;
	unsigned index, tail, next_tail, prepped = 0;

	next_tail = tail = *ring->tail;
	do {
		next_tail++;
		if (next_tail == atomic_load_acquire(ring->head))
			break;

		index = tail & sq_ring_mask;
		init_io(s, index);
		ring->array[index] = index;
		prepped++;
		tail = next_tail;
	} while (prepped < max_ios);

	if (prepped)
		atomic_store_release(ring->tail, tail);
	return prepped;
}

static int get_file_size(struct file *f)
{
	struct stat st;

	if (fstat(f->real_fd, &st) < 0)
		return -1;
	if (S_ISBLK(st.st_mode)) {
		unsigned long long bytes;

		if (ioctl(f->real_fd, BLKGETSIZE64, &bytes) != 0)
			return -1;

		f->max_blocks = bytes / bs;
		return 0;
	} else if (S_ISREG(st.st_mode)) {
		f->max_blocks = st.st_size / bs;
		return 0;
	}

	return -1;
}

static int reap_events(struct submitter *s)
{
	struct io_cq_ring *ring = &s->cq_ring;
	struct io_uring_cqe *cqe;
	unsigned head, reaped = 0;

	head = *ring->head;
	do {
		struct file *f;

		read_barrier();
		if (head == atomic_load_acquire(ring->tail))
			break;
		cqe = &ring->cqes[head & cq_ring_mask];
		if (!do_nop) {
			f = (struct file *) (uintptr_t) cqe->user_data;
			f->pending_ios--;
			if (cqe->res != bs) {
				printf("io: unexpected ret=%d\n", cqe->res);
				if (polled && cqe->res == -EOPNOTSUPP)
					printf("Your filesystem/driver/kernel doesn't support polled IO\n");
				return -1;
			}
		}
		reaped++;
		head++;
	} while (1);

	if (reaped) {
		s->inflight -= reaped;
		atomic_store_release(ring->head, head);
	}
	return reaped;
}

static void *submitter_fn(void *data)
{
	struct submitter *s = data;
	struct io_sq_ring *ring = &s->sq_ring;
	int ret, prepped;

	ret = setup_ring(s);
	if (ret) {
		printf("ring setup failed: %s, %d\n", strerror(errno), ret);
		return NULL;
	}

	printf("submitter=%d\n", gettid());

	srand48(pthread_self());

	prepped = 0;
	do {
		int to_wait, to_submit, this_reap, to_prep;
		unsigned ring_flags = 0;

		if (!prepped && s->inflight < depth) {
			to_prep = min(depth - s->inflight, batch_submit);
			prepped = prep_more_ios(s, to_prep);
		}
		s->inflight += prepped;
submit_more:
		to_submit = prepped;
submit:
		if (to_submit && (s->inflight + to_submit <= depth))
			to_wait = 0;
		else
			to_wait = min(s->inflight + to_submit, batch_complete);

		/*
		 * Only need to call io_uring_enter if we're not using SQ thread
		 * poll, or if IORING_SQ_NEED_WAKEUP is set.
		 */
		if (sq_thread_poll)
			ring_flags = atomic_load_acquire(ring->flags);
		if (!sq_thread_poll || ring_flags & IORING_SQ_NEED_WAKEUP) {
			unsigned flags = 0;

			if (to_wait)
				flags = IORING_ENTER_GETEVENTS;
			if (ring_flags & IORING_SQ_NEED_WAKEUP)
				flags |= IORING_ENTER_SQ_WAKEUP;
			ret = io_uring_enter(s, to_submit, to_wait, flags);
			s->calls++;
		} else {
			/* for SQPOLL, we submitted it all effectively */
			ret = to_submit;
		}

		/*
		 * For non SQ thread poll, we already got the events we needed
		 * through the io_uring_enter() above. For SQ thread poll, we
		 * need to loop here until we find enough events.
		 */
		this_reap = 0;
		do {
			int r;
			r = reap_events(s);
			if (r == -1) {
				s->finish = 1;
				break;
			} else if (r > 0)
				this_reap += r;
		} while (sq_thread_poll && this_reap < to_wait);
		s->reaps += this_reap;

		if (ret >= 0) {
			if (!ret) {
				to_submit = 0;
				if (s->inflight)
					goto submit;
				continue;
			} else if (ret < to_submit) {
				int diff = to_submit - ret;

				s->done += ret;
				prepped -= diff;
				goto submit_more;
			}
			s->done += ret;
			prepped = 0;
			continue;
		} else if (ret < 0) {
			if (errno == EAGAIN) {
				if (s->finish)
					break;
				if (this_reap)
					goto submit;
				to_submit = 0;
				goto submit;
			}
			printf("io_submit: %s\n", strerror(errno));
			break;
		}
	} while (!s->finish);

	finish = 1;
	return NULL;
}

static void sig_int(int sig)
{
	printf("Exiting on signal %d\n", sig);
	submitter->finish = 1;
	finish = 1;
}

static void arm_sig_int(void)
{
	struct sigaction act;

	memset(&act, 0, sizeof(act));
	act.sa_handler = sig_int;
	act.sa_flags = SA_RESTART;
	sigaction(SIGINT, &act, NULL);
}

static int setup_ring(struct submitter *s)
{
	struct io_sq_ring *sring = &s->sq_ring;
	struct io_cq_ring *cring = &s->cq_ring;
	struct io_uring_params p;
	int ret, fd;
	void *ptr;

	memset(&p, 0, sizeof(p));

	if (polled && !do_nop)
		p.flags |= IORING_SETUP_IOPOLL;
	if (sq_thread_poll) {
		p.flags |= IORING_SETUP_SQPOLL;
		if (sq_thread_cpu != -1) {
			p.flags |= IORING_SETUP_SQ_AFF;
			p.sq_thread_cpu = sq_thread_cpu;
		}
	}
	p.flags |= IORING_SETUP_PRIVATE_CQ;

	fd = io_uring_setup(depth, &p);
	if (fd < 0) {
		perror("io_uring_setup");
		return 1;
	}
	s->ring_fd = fd;

	io_uring_probe(fd);

	ret = io_uring_register_io_uring_fd(s);
	if (ret < 0) {
		perror("io_uring_register_io_uring_fd");
		return 1;
	}

	if (fixedbufs) {
		ret = io_uring_register_buffers(s);
		if (ret < 0) {
			perror("io_uring_register_buffers");
			return 1;
		}
	}

	if (register_files) {
		ret = io_uring_register_files(s);
		if (ret < 0) {
			perror("io_uring_register_files");
			return 1;
		}
	}

	ptr = mmap(0, p.sq_off.array + p.sq_entries * sizeof(__u32),
			PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd,
			IORING_OFF_SQ_RING);
	printf("sq_ring ptr = 0x%p\n", ptr);
	sring->head = ptr + p.sq_off.head;
	sring->tail = ptr + p.sq_off.tail;
	sring->ring_mask = ptr + p.sq_off.ring_mask;
	sring->ring_entries = ptr + p.sq_off.ring_entries;
	sring->flags = ptr + p.sq_off.flags;
	sring->array = ptr + p.sq_off.array;
	sq_ring_mask = *sring->ring_mask;

	s->sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe),
			PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd,
			IORING_OFF_SQES);
	printf("sqes ptr    = 0x%p\n", s->sqes);

	ptr = mmap(0, p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe),
			PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd,
			IORING_OFF_CQ_RING);
	printf("cq_ring ptr = 0x%p\n", ptr);
	cring->head = ptr + p.cq_off.head;
	cring->tail = ptr + p.cq_off.tail;
	cring->ring_mask = ptr + p.cq_off.ring_mask;
	cring->ring_entries = ptr + p.cq_off.ring_entries;
	cring->cqes = ptr + p.cq_off.cqes;
	cq_ring_mask = *cring->ring_mask;
	return 0;
}

static void file_depths(char *buf)
{
	struct submitter *s = submitter;
	char *p;
	int i;

	buf[0] = '\0';
	p = buf;
	for (i = 0; i < s->nr_files; i++) {
		struct file *f = &s->files[i];

		if (i + 1 == s->nr_files)
			p += sprintf(p, "%d", f->pending_ios);
		else
			p += sprintf(p, "%d, ", f->pending_ios);
	}
}

static void usage(char *argv)
{
	printf("%s [options] -- [filenames]\n"
		" -d <int> : IO Depth, default %d\n"
		" -s <int> : Batch submit, default %d\n"
		" -c <int> : Batch complete, default %d\n"
		" -b <int> : Block size, default %d\n"
		" -p <bool> : Polled IO, default %d\n",
		argv, DEPTH, BATCH_SUBMIT, BATCH_COMPLETE, BS, polled);
	exit(0);
}

int main(int argc, char *argv[])
{
	struct submitter *s;
	unsigned long done, calls, reap;
	int i, flags, fd, opt;
	char *fdepths;
	void *ret;

	if (!do_nop && argc < 2) {
		printf("%s: filename [options]\n", argv[0]);
		return 1;
	}

	while ((opt = getopt(argc, argv, "d:s:c:b:p:B:F:h?")) != -1) {
		switch (opt) {
		case 'd':
			depth = atoi(optarg);
			break;
		case 's':
			batch_submit = atoi(optarg);
			break;
		case 'c':
			batch_complete = atoi(optarg);
			break;
		case 'b':
			bs = atoi(optarg);
			break;
		case 'p':
			polled = !!atoi(optarg);
			break;
		case 'B':
			fixedbufs = !!atoi(optarg);
			break;
		case 'F':
			register_files = !!atoi(optarg);
			break;
		case 'h':
		case '?':
		default:
			usage(argv[0]);
			break;
		}
	}

	submitter = malloc(sizeof(*submitter) + depth * sizeof(struct iovec));
	memset(submitter, 0, sizeof(*submitter) + depth * sizeof(struct iovec));
	s = submitter;

	flags = O_RDONLY | O_NOATIME;
	if (!buffered)
		flags |= O_DIRECT;

	i = optind;
	while (!do_nop && i < argc) {
		struct file *f;

		if (s->nr_files == MAX_FDS) {
			printf("Max number of files (%d) reached\n", MAX_FDS);
			break;
		}
		fd = open(argv[i], flags);
		if (fd < 0) {
			perror("open");
			return 1;
		}

		f = &s->files[s->nr_files];
		f->real_fd = fd;
		if (get_file_size(f)) {
			printf("failed getting size of device/file\n");
			return 1;
		}
		if (f->max_blocks <= 1) {
			printf("Zero file/device size?\n");
			return 1;
		}
		f->max_blocks--;

		printf("Added file %s\n", argv[i]);
		s->nr_files++;
		i++;
	}

	if (fixedbufs) {
		struct rlimit rlim;

		rlim.rlim_cur = RLIM_INFINITY;
		rlim.rlim_max = RLIM_INFINITY;
		if (setrlimit(RLIMIT_MEMLOCK, &rlim) < 0) {
			perror("setrlimit");
			return 1;
		}
	}

	arm_sig_int();

	for (i = 0; i < depth; i++) {
		void *buf;

		if (posix_memalign(&buf, bs, bs)) {
			printf("failed alloc\n");
			return 1;
		}
		s->iovecs[i].iov_base = buf;
		s->iovecs[i].iov_len = bs;
	}

	printf("polled=%d, fixedbufs=%d, register_files=%d, buffered=%d", polled, fixedbufs, register_files, buffered);
	printf(" QD=%d, sq_ring=%d, cq_ring=%d\n", depth, 0, 0);

	pthread_create(&s->thread, NULL, submitter_fn, s);

	fdepths = malloc(8 * s->nr_files);
	reap = calls = done = 0;
	do {
		unsigned long this_done = 0;
		unsigned long this_reap = 0;
		unsigned long this_call = 0;
		unsigned long rpc = 0, ipc = 0;

		sleep(1);
		this_done += s->done;
		this_call += s->calls;
		this_reap += s->reaps;
		if (this_call - calls) {
			rpc = (this_done - done) / (this_call - calls);
			ipc = (this_reap - reap) / (this_call - calls);
		} else
			rpc = ipc = -1;
		file_depths(fdepths);
		printf("IOPS=%lu, IOS/call=%ld/%ld, inflight=%u (%s)\n",
				this_done - done, rpc, ipc, s->inflight,
				fdepths);
		done = this_done;
		calls = this_call;
		reap = this_reap;
	} while (!finish);

	pthread_join(s->thread, &ret);
	close(s->ring_fd);
	free(fdepths);
	return 0;
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/4] completion locking optimisation feature
  2022-03-18 15:22       ` Jens Axboe
@ 2022-03-18 15:34         ` Pavel Begunkov
  0 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-03-18 15:34 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 3/18/22 15:22, Jens Axboe wrote:
> On 3/18/22 9:00 AM, Pavel Begunkov wrote:
>> On 3/18/22 14:52, Jens Axboe wrote:
>>> On 3/18/22 8:42 AM, Pavel Begunkov wrote:
>>>> On 3/18/22 13:52, Pavel Begunkov wrote:
>>>>> A WIP feature optimising out CQEs posting spinlocking for some use cases.
>>>>> For a more detailed description see 4/4.
>>>>>
>>>>> Quick benchmarking with fio/t/io_uring nops gives extra 4% to throughput for
>>>>> QD=1, and ~+2.5% for QD=4.
>>>>
>>>> Non-io_uring overhead (syscalls + userspace) takes ~60% of all execution
>>>> time, so the percentage should quite depend on the CPU and the kernel config.
>>>> Likely to be more than 4% for a faster setup.
>>>>
>>>> fwiw, was also usingIORING_ENTER_REGISTERED_RING, if it's not yet included
>>>> in the upstream version of the tool.
>>>
>>> But that seems to be exclusive of using PRIVATE_CQ?
>>
>> No, it's not. Let me ask to clarify the description and so, why do you
>> think it is exclusive?
> 
> Didn't dig too deep, just saw various EINVAL around registered files and
> updating/insertion. And the fact that t/io_uring gets -EINVAL on

Ah, those are for registered rsrc with tags, just because they use
fill_cqe, need to do something about them

> register with op 20 (ring fd) if I set CQ_PRIVATE, see other email.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] io_uring: optimise compl locking for non-shared rings
  2022-03-18 15:32         ` Pavel Begunkov
@ 2022-03-18 16:06           ` Jens Axboe
  0 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2022-03-18 16:06 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 3/18/22 9:32 AM, Pavel Begunkov wrote:
>>>> work with registered files (and ring fd) as that is probably a bigger
>>>> win than skipping the completion_lock if you're not shared anyway.
>>>
>>> It does work with fixed/registered files and registered io_uring fds.
>>
>> t/io_uring fails for me with registered files or rings, getting EINVAL.
>> Might be user error, but that's simply just setting CQ_PRIVATE for
>> setup.
> 
> One thing I changed in the tool is that the ring should be created
> by the submitter task, so move setup_ring into the submitter thread.
> Plan to get rid of this restriction though.

OK, thought it was probably something like that, you do end up with two
tctx if you init in main and pass to thread. But that's probably quite a
common way to do setup, so should deal with that in the future at least.
Just part of the stuff that would ultimately need polishing.

> Weird that it works only for you only without reg files/rings, will
> take a look.

I think the part above totally explains it.

> Attached io_uring.c that I used, it's based on some old version,
> so do_nop can't be set in argv but should turned in the source code.
> IORING_ENTER_REGISTERED_RING is always enabled.

Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-03-18 16:07 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-03-18 13:52 [RFC 0/4] completion locking optimisation feature Pavel Begunkov
2022-03-18 13:52 ` [PATCH 1/4] io_uring: get rid of raw fill cqe in kill_timeout Pavel Begunkov
2022-03-18 13:52 ` [PATCH 2/4] io_uring: get rid of raw fill_cqe in io_fail_links Pavel Begunkov
2022-03-18 13:52 ` [PATCH 3/4] io_uring: remove raw fill_cqe from linked timeout Pavel Begunkov
2022-03-18 13:52 ` [PATCH 4/4] io_uring: optimise compl locking for non-shared rings Pavel Begunkov
2022-03-18 14:54   ` Jens Axboe
2022-03-18 15:13     ` Pavel Begunkov
2022-03-18 15:21       ` Jens Axboe
2022-03-18 15:32         ` Pavel Begunkov
2022-03-18 16:06           ` Jens Axboe
2022-03-18 14:42 ` [RFC 0/4] completion locking optimisation feature Pavel Begunkov
2022-03-18 14:52   ` Jens Axboe
2022-03-18 15:00     ` Pavel Begunkov
2022-03-18 15:22       ` Jens Axboe
2022-03-18 15:34         ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox