Re: [ISSUE] The time cost of IOSQE_IO_LINK

public inbox for [email protected]
 help / color / mirror / Atom feed

From: Jens Axboe <[email protected]>
To: Peter Zijlstra <[email protected]>
Cc: "Carter Li 李通洲" <[email protected]>,
	"Pavel Begunkov" <[email protected]>,
	io-uring <[email protected]>
Subject: Re: [ISSUE] The time cost of IOSQE_IO_LINK
Date: Fri, 14 Feb 2020 10:52:43 -0700	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 2/14/20 9:18 AM, Jens Axboe wrote:
> On 2/14/20 8:47 AM, Jens Axboe wrote:
>>> I suspect you meant to put that in finish_task_switch() which is the
>>> tail end of every schedule(), schedule_tail() is the tail end of
>>> clone().
>>>
>>> Or maybe you meant to put it in (and rename) sched_update_worker() which
>>> is after every schedule() but in a preemptible context -- much saner
>>> since you don't want to go add an unbounded amount of work in a
>>> non-preemptible context.
>>>
>>> At which point you already have your callback: io_wq_worker_running(),
>>> or is this for any random task?
>>
>> Let me try and clarify - this isn't for the worker tasks, this is for
>> any task that is using io_uring. In fact, it's particularly not for the
>> worker threads, just the task itself.
>>
>> I basically want the handler to be called when:
>>
>> 1) The task is scheduled in. The poll will complete and stuff some items
>>    on that task list, and I want to task to process them as it wakes up.
>>
>> 2) The task is going to sleep, don't want to leave entries around while
>>    the task is sleeping.
>>
>> 3) I need it to be called from "normal" context, with ints enabled,
>>    preempt enabled, etc.
>>
>> sched_update_worker() (with a rename) looks ideal for #1, and the
>> context is sane for me. Just need a good spot to put the hook call for
>> schedule out. I think this:
>>
>> 	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
>> 		preempt_disable();
>> 		if (tsk->flags & PF_WQ_WORKER)
>> 			wq_worker_sleeping(tsk);
>> 		else
>> 			io_wq_worker_sleeping(tsk);
>> 		preempt_enable_no_resched();
>> 	}
>>
>> just needs to go into another helper, and then I can call it there
>> outside of the preempt.
>>
>> I'm sure there are daemons lurking here, but I'll test and see how it
>> goes...
> 
> Here's a stab at cleaning it up:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-task-poll
> 
> top two patches. First one simply cleans up the sched_update_worker(),
> so we now have sched_in_update() and sched_out_update(). No changes in
> this patch, just moves the worker sched-out handling into a helper.
> 
> 2nd patch then utilizes this to flush the per-task requests that may
> have been queued up.

In fact, we can go even further. If we have this task handler, then we:

1) Never need to go async for poll completion, and we can remove a bunch
   of code that handles that
2) Don't need to worry about nested eventfd notification, that code goes
   away too
3) Don't need the poll llist for batching flushes, that goes away

In terms of performance, for the single client case we did about 48K
requests per second on my kvm on the laptop, now we're doing 148K.
So it's definitely worthwhile... On top of that, diffstat:

 fs/io_uring.c | 166 +++++++-------------------------------------------
 1 file changed, 22 insertions(+), 144 deletions(-)


diff --git a/fs/io_uring.c b/fs/io_uring.c
index a683d7a08003..1e436b732f2a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -295,7 +295,6 @@ struct io_ring_ctx {
 
 	struct {
 		spinlock_t		completion_lock;
-		struct llist_head	poll_llist;
 
 		/*
 		 * ->poll_list is protected by the ctx->uring_lock for
@@ -552,19 +551,13 @@ struct io_kiocb {
 	};
 
 	struct io_async_ctx		*io;
-	/*
-	 * llist_node is only used for poll deferred completions
-	 */
-	struct llist_node		llist_node;
 	bool				in_async;
 	bool				needs_fixed_file;
 	u8				opcode;
 
 	struct io_ring_ctx	*ctx;
-	union {
-		struct list_head	list;
-		struct hlist_node	hash_node;
-	};
+	struct list_head	list;
+	struct hlist_node	hash_node;
 	struct list_head	link_list;
 	unsigned int		flags;
 	refcount_t		refs;
@@ -835,7 +828,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->wait);
 	spin_lock_init(&ctx->completion_lock);
-	init_llist_head(&ctx->poll_llist);
 	INIT_LIST_HEAD(&ctx->poll_list);
 	INIT_LIST_HEAD(&ctx->defer_list);
 	INIT_LIST_HEAD(&ctx->timeout_list);
@@ -932,8 +924,6 @@ static inline void io_req_work_grab_env(struct io_kiocb *req,
 	}
 	if (!req->work.task_pid)
 		req->work.task_pid = task_pid_vnr(current);
-	if (!req->task)
-		req->task = get_task_struct(current);
 }
 
 static inline void io_req_work_drop_env(struct io_kiocb *req)
@@ -3545,92 +3535,18 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error)
 	io_commit_cqring(ctx);
 }
 
-static void io_poll_complete_work(struct io_wq_work **workptr)
+static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt)
 {
-	struct io_wq_work *work = *workptr;
-	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
-	struct io_poll_iocb *poll = &req->poll;
-	struct poll_table_struct pt = { ._key = poll->events };
 	struct io_ring_ctx *ctx = req->ctx;
-	struct io_kiocb *nxt = NULL;
-	__poll_t mask = 0;
-	int ret = 0;
 
-	if (work->flags & IO_WQ_WORK_CANCEL) {
-		WRITE_ONCE(poll->canceled, true);
-		ret = -ECANCELED;
-	} else if (READ_ONCE(poll->canceled)) {
-		ret = -ECANCELED;
-	}
-
-	if (ret != -ECANCELED)
-		mask = vfs_poll(poll->file, &pt) & poll->events;
-
-	/*
-	 * Note that ->ki_cancel callers also delete iocb from active_reqs after
-	 * calling ->ki_cancel.  We need the ctx_lock roundtrip here to
-	 * synchronize with them.  In the cancellation case the list_del_init
-	 * itself is not actually needed, but harmless so we keep it in to
-	 * avoid further branches in the fast path.
-	 */
 	spin_lock_irq(&ctx->completion_lock);
-	if (!mask && ret != -ECANCELED) {
-		add_wait_queue(poll->head, &poll->wait);
-		spin_unlock_irq(&ctx->completion_lock);
-		return;
-	}
 	hash_del(&req->hash_node);
-	io_poll_complete(req, mask, ret);
-	spin_unlock_irq(&ctx->completion_lock);
-
-	io_cqring_ev_posted(ctx);
-
-	if (ret < 0)
-		req_set_fail_links(req);
-	io_put_req_find_next(req, &nxt);
-	if (nxt)
-		io_wq_assign_next(workptr, nxt);
-}
-
-static void __io_poll_flush(struct io_ring_ctx *ctx, struct llist_node *nodes)
-{
-	struct io_kiocb *req, *tmp;
-	struct req_batch rb;
-
-	rb.to_free = rb.need_iter = 0;
-	spin_lock_irq(&ctx->completion_lock);
-	llist_for_each_entry_safe(req, tmp, nodes, llist_node) {
-		hash_del(&req->hash_node);
-		io_poll_complete(req, req->result, 0);
-
-		if (refcount_dec_and_test(&req->refs) &&
-		    !io_req_multi_free(&rb, req)) {
-			req->flags |= REQ_F_COMP_LOCKED;
-			io_free_req(req);
-		}
-	}
+	io_poll_complete(req, req->result, 0);
+	req->flags |= REQ_F_COMP_LOCKED;
+	io_put_req_find_next(req, nxt);
 	spin_unlock_irq(&ctx->completion_lock);
-
 	io_cqring_ev_posted(ctx);
-	io_free_req_many(ctx, &rb);
-}
-
-static void io_poll_flush(struct io_wq_work **workptr)
-{
-	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-	struct llist_node *nodes;
-
-	nodes = llist_del_all(&req->ctx->poll_llist);
-	if (nodes)
-		__io_poll_flush(req->ctx, nodes);
-}
-
-static void io_poll_trigger_evfd(struct io_wq_work **workptr)
-{
-	struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
-
-	eventfd_signal(req->ctx->cq_ev_fd, 1);
-	io_put_req(req);
 }
 
 static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
@@ -3638,8 +3554,9 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 {
 	struct io_poll_iocb *poll = wait->private;
 	struct io_kiocb *req = container_of(poll, struct io_kiocb, poll);
-	struct io_ring_ctx *ctx = req->ctx;
 	__poll_t mask = key_to_poll(key);
+	struct task_struct *tsk;
+	unsigned long flags;
 
 	/* for instances that support it check for an event match first: */
 	if (mask && !(mask & poll->events))
@@ -3647,56 +3564,12 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 
 	list_del_init(&poll->wait.entry);
 
-	/*
-	 * Run completion inline if we can. We're using trylock here because
-	 * we are violating the completion_lock -> poll wq lock ordering.
-	 * If we have a link timeout we're going to need the completion_lock
-	 * for finalizing the request, mark us as having grabbed that already.
-	 */
-	if (mask) {
-		unsigned long flags;
-
-		if (llist_empty(&ctx->poll_llist) &&
-		    spin_trylock_irqsave(&ctx->completion_lock, flags)) {
-			bool trigger_ev;
-
-			hash_del(&req->hash_node);
-			io_poll_complete(req, mask, 0);
-
-			trigger_ev = io_should_trigger_evfd(ctx);
-			if (trigger_ev && eventfd_signal_count()) {
-				trigger_ev = false;
-				req->work.func = io_poll_trigger_evfd;
-			} else {
-				struct io_kiocb *nxt = NULL;
-
-				req->flags |= REQ_F_COMP_LOCKED;
-				io_put_req_find_next(req, &nxt);
-				if (nxt) {
-					struct task_struct *tsk = nxt->task;
-
-					raw_spin_lock(&tsk->uring_lock);
-					list_add_tail(&nxt->list, &tsk->uring_work);
-					raw_spin_unlock(&tsk->uring_lock);
-					/* do we need to wake tsk here??? */
-				}
-				req = NULL;
-			}
-			spin_unlock_irqrestore(&ctx->completion_lock, flags);
-			__io_cqring_ev_posted(ctx, trigger_ev);
-		} else {
-			req->result = mask;
-			req->llist_node.next = NULL;
-			/* if the list wasn't empty, we're done */
-			if (!llist_add(&req->llist_node, &ctx->poll_llist))
-				req = NULL;
-			else
-				req->work.func = io_poll_flush;
-		}
-	}
-	if (req)
-		io_queue_async_work(req);
-
+	tsk = req->task;
+	req->result = mask;
+	raw_spin_lock_irqsave(&tsk->uring_lock, flags);
+	list_add_tail(&req->list, &tsk->uring_work);
+	raw_spin_unlock_irqrestore(&tsk->uring_lock, flags);
+	wake_up_process(tsk);
 	return 1;
 }
 
@@ -3755,7 +3628,6 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt)
 	bool cancel = false;
 	__poll_t mask;
 
-	INIT_IO_WORK(&req->work, io_poll_complete_work);
 	INIT_HLIST_NODE(&req->hash_node);
 
 	poll->head = NULL;
@@ -4863,6 +4735,8 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		return false;
 	}
 
+	req->task = get_task_struct(current);
+
 	/*
 	 * If we already have a head request, queue this one for async
 	 * submittal once the head completes. If we don't have a head but
@@ -5270,10 +5144,14 @@ void io_uring_task_handler(struct task_struct *tsk)
 	raw_spin_unlock_irq(&tsk->uring_lock);
 
 	while (!list_empty(&local_list)) {
+		struct io_kiocb *nxt = NULL;
+
 		req = list_first_entry(&local_list, struct io_kiocb, list);
 		list_del(&req->list);
 
-		__io_queue_sqe(req, NULL);
+		io_poll_task_handler(req, &nxt);
+		if (nxt)
+			__io_queue_sqe(req, NULL);
 	}
 }
 

-- 
Jens Axboe

next prev parent reply	other threads:[~2020-02-14 17:52 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-12 16:31 [ISSUE] The time cost of IOSQE_IO_LINK Carter Li 李通洲
2020-02-12 17:11 ` Jens Axboe
2020-02-12 17:22   ` Jens Axboe
2020-02-12 17:29     ` Jens Axboe
2020-02-13  0:33   ` Carter Li 李通洲
2020-02-13 15:08     ` Pavel Begunkov
2020-02-13 15:14       ` Jens Axboe
2020-02-13 15:51         ` Carter Li 李通洲
2020-02-14  1:25           ` Carter Li 李通洲
2020-02-14  2:45             ` Jens Axboe
2020-02-14  5:03               ` Jens Axboe
2020-02-14 15:32                 ` Peter Zijlstra
2020-02-14 15:47                   ` Jens Axboe
2020-02-14 16:18                     ` Jens Axboe
2020-02-14 17:52                       ` Jens Axboe [this message]
2020-02-14 20:44                         ` Jens Axboe
2020-02-15  0:16                           ` Carter Li 李通洲
2020-02-15  1:10                             ` Jens Axboe
2020-02-15  1:25                               ` Carter Li 李通洲
2020-02-15  1:27                                 ` Jens Axboe
2020-02-15  6:01                                   ` Jens Axboe
2020-02-15  6:32                                     ` Carter Li 李通洲
2020-02-15 15:11                                       ` Jens Axboe
2020-02-16 19:06                                     ` Pavel Begunkov
2020-02-16 22:23                                       ` Jens Axboe
2020-02-17 10:30                                         ` Pavel Begunkov
2020-02-17 19:30                                           ` Jens Axboe
2020-02-16 23:06                                       ` Jens Axboe
2020-02-16 23:07                                         ` Jens Axboe
2020-02-17 12:09                           ` Peter Zijlstra
2020-02-17 16:12                             ` Jens Axboe
2020-02-17 17:16                               ` Jens Axboe
2020-02-17 17:46                                 ` Peter Zijlstra
2020-02-17 18:16                                   ` Jens Axboe
2020-02-18 13:13                                     ` Peter Zijlstra
2020-02-18 14:27                                       ` [PATCH] asm-generic/atomic: Add try_cmpxchg() fallbacks Peter Zijlstra
2020-02-18 14:40                                         ` Peter Zijlstra
2020-02-20 10:30                                         ` Will Deacon
2020-02-20 10:37                                           ` Peter Zijlstra
2020-02-20 10:39                                             ` Will Deacon
2020-02-18 14:56                                       ` [ISSUE] The time cost of IOSQE_IO_LINK Oleg Nesterov
2020-02-18 15:07                                         ` Oleg Nesterov
2020-02-18 15:38                                           ` Peter Zijlstra
2020-02-18 16:33                                             ` Jens Axboe
2020-02-18 15:07                                         ` Peter Zijlstra
2020-02-18 15:50                                           ` [PATCH] task_work_run: don't take ->pi_lock unconditionally Oleg Nesterov
2020-02-20 16:39                                             ` Peter Zijlstra
2020-02-20 17:22                                               ` Oleg Nesterov
2020-02-20 17:49                                                 ` Peter Zijlstra
2020-02-21 14:52                                                   ` Oleg Nesterov
2020-02-24 18:47                                                     ` Jens Axboe
2020-02-28 19:17                                                       ` Jens Axboe
2020-02-28 19:25                                                         ` Peter Zijlstra
2020-02-28 19:28                                                           ` Jens Axboe
2020-02-28 20:06                                                             ` Peter Zijlstra
2020-02-28 20:15                                                               ` Jens Axboe
2020-02-18 16:46                                       ` [ISSUE] The time cost of IOSQE_IO_LINK Jens Axboe
2020-02-18 16:52                                         ` Jens Axboe
2020-02-18 13:13                               ` Peter Zijlstra

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:a683d7a0800 dfblob:1e436b732f2 )
 OR (
bs:"Re: [ISSUE] The time cost of IOSQE_IO_LINK" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox