From: Pavel Begunkov <[email protected]>
To: Hao Xu <[email protected]>, Jens Axboe <[email protected]>
Cc: [email protected], Joseph Qi <[email protected]>
Subject: Re: [PATCH 3/8] io_uring: add a limited tw list for irq completion work
Date: Thu, 30 Sep 2021 10:02:07 +0100 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 9/29/21 12:38 PM, Hao Xu wrote:
> 在 2021/9/28 下午7:29, Pavel Begunkov 写道:
[...]
>>> @@ -2132,12 +2136,16 @@ static void tctx_task_work(struct callback_head *cb)
>>> while (1) {
>>> struct io_wq_work_node *node;
>>> - if (!tctx->task_list.first && locked)
>>> + if (!tctx->prior_task_list.first &&
>>> + !tctx->task_list.first && locked)
>>> io_submit_flush_completions(ctx);
>>> spin_lock_irq(&tctx->task_lock);
>>> - node = tctx->task_list.first;
>>> + wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
>>> + node = tctx->prior_task_list.first;
>>
>> I find all this accounting expensive, sure I'll see it for my BPF tests.
> May I ask how do you evaluate the overhead with BPF here?
It's a custom branch and apparently would need some thinking on how
to apply your stuff on top, because of yet another list in [1]. In
short, the case in mind spins inside of tctx_task_work() doing one
request at a time.
I think would be easier if I try it out myself.
[1] https://github.com/isilence/linux/commit/d6285a9817eb26aa52ad54a79584512d7efa82fd
>>
>> How about
>> 1) remove MAX_EMERGENCY_TW_RATIO and all the counters,
>> prior_nr and others.
>>
>> 2) rely solely on list merging
>>
>> So, when it enters an iteration of the loop it finds a set of requests
>> to run, it first executes all priority ones of that set and then the
>> rest (just by the fact that you merged the lists and execute all from
>> them).
>>
>> It solves the problem of total starvation of non-prio requests, e.g.
>> if new completions coming as fast as you complete previous ones. One
>> downside is that prio requests coming while we execute a previous
>> batch will be executed only after a previous batch of non-prio
>> requests, I don't think it's much of a problem but interesting to
>> see numbers.
> hmm, this probably doesn't solve the starvation, since there may be
> a number of priority TWs ahead of non-prio TWs in one iteration, in the
> case of submitting many sqes in one io_submit_sqes. That's why I keep
> just 1/3 priority TWs there.
I don't think it's a problem, they should be fast enough and we have
a forward progress guarantees for non-prio. IMHO that should be enough.
>>
>>
>>> INIT_WQ_LIST(&tctx->task_list);
>>> + INIT_WQ_LIST(&tctx->prior_task_list);
>>> + tctx->nr = tctx->prior_nr = 0;
>>> if (!node)
>>> tctx->task_running = false;
>>> spin_unlock_irq(&tctx->task_lock);
>>> @@ -2166,7 +2174,7 @@ static void tctx_task_work(struct callback_head *cb)
>>> ctx_flush_and_put(ctx, &locked);
>>> }
>>> -static void io_req_task_work_add(struct io_kiocb *req)
>>> +static void io_req_task_work_add(struct io_kiocb *req, bool emergency)
>>
>> It think "priority" instead of "emergency" will be more accurate
>>
>>> {
>>> struct task_struct *tsk = req->task;
>>> struct io_uring_task *tctx = tsk->io_uring;
>>> @@ -2178,7 +2186,13 @@ static void io_req_task_work_add(struct io_kiocb *req)
>>> WARN_ON_ONCE(!tctx);
>>> spin_lock_irqsave(&tctx->task_lock, flags);
>>> - wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>>> + if (emergency && tctx->prior_nr * MAX_EMERGENCY_TW_RATIO < tctx->nr) {
>>> + wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
>>> + tctx->prior_nr++;
>>> + } else {
>>> + wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>>> + }
>>> + tctx->nr++;
>>> running = tctx->task_running;
>>> if (!running)
>>> tctx->task_running = true;
>>
>>
>>
>
--
Pavel Begunkov
next prev parent reply other threads:[~2021-09-30 9:02 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-27 6:17 [PATCH 0/6] task_work optimization Hao Xu
2021-09-27 6:17 ` [PATCH 1/8] io-wq: code clean for io_wq_add_work_after() Hao Xu
2021-09-28 11:08 ` Pavel Begunkov
2021-09-29 7:36 ` Hao Xu
2021-09-29 11:23 ` Pavel Begunkov
2021-09-27 6:17 ` [PATCH 2/8] io-wq: add helper to merge two wq_lists Hao Xu
2021-09-27 10:17 ` Hao Xu
2021-09-28 11:10 ` Pavel Begunkov
2021-09-28 16:48 ` Hao Xu
2021-09-29 11:23 ` Pavel Begunkov
2021-09-27 6:17 ` [PATCH 3/8] io_uring: add a limited tw list for irq completion work Hao Xu
2021-09-28 11:29 ` Pavel Begunkov
2021-09-28 16:55 ` Hao Xu
2021-09-29 11:25 ` Pavel Begunkov
2021-09-29 11:38 ` Hao Xu
2021-09-30 9:02 ` Pavel Begunkov [this message]
2021-09-30 3:21 ` Hao Xu
2021-09-27 6:17 ` [PATCH 4/8] io_uring: add helper for task work execution code Hao Xu
2021-09-27 6:17 ` [PATCH 5/8] io_uring: split io_req_complete_post() and add a helper Hao Xu
2021-09-27 6:17 ` [PATCH 6/8] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
2021-09-27 6:17 ` [PATCH 7/8] io_uring: add tw_ctx for io_uring_task Hao Xu
2021-09-27 6:17 ` [PATCH 8/8] io_uring: batch completion in prior_task_list Hao Xu
2021-09-27 6:21 ` [PATCH 0/6] task_work optimization Hao Xu
-- strict thread matches above, loose matches on Subject: below --
2021-09-27 10:51 [PATCH v2 0/8] " Hao Xu
2021-09-27 10:51 ` [PATCH 3/8] io_uring: add a limited tw list for irq completion work Hao Xu
2021-09-29 12:31 ` Hao Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox