From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <io-uring-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EA557C433F5
	for <io-uring@archiver.kernel.org>; Thu, 28 Oct 2021 11:46:35 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id CE4C560FF2
	for <io-uring@archiver.kernel.org>; Thu, 28 Oct 2021 11:46:35 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229835AbhJ1Ls7 (ORCPT <rfc822;io-uring@archiver.kernel.org>);
        Thu, 28 Oct 2021 07:48:59 -0400
Received: from out30-130.freemail.mail.aliyun.com ([115.124.30.130]:37658 "EHLO
        out30-130.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S230188AbhJ1Ls6 (ORCPT
        <rfc822;io-uring@vger.kernel.org>); Thu, 28 Oct 2021 07:48:58 -0400
X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R791e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=haoxu@linux.alibaba.com;NM=1;PH=DS;RN=4;SR=0;TI=SMTPD_---0Uu.Sn3p_1635421589;
Received: from B-25KNML85-0107.local(mailfrom:haoxu@linux.alibaba.com fp:SMTPD_---0Uu.Sn3p_1635421589)
          by smtp.aliyun-inc.com(127.0.0.1);
          Thu, 28 Oct 2021 19:46:30 +0800
Subject: Re: [PATCH for-5.16 v3 0/8] task work optimization
From:   Hao Xu <haoxu@linux.alibaba.com>
To:     Pavel Begunkov <asml.silence@gmail.com>,
        Jens Axboe <axboe@kernel.dk>
Cc:     io-uring@vger.kernel.org, Joseph Qi <joseph.qi@linux.alibaba.com>
References: <20211027140216.20008-1-haoxu@linux.alibaba.com>
 <da85ab25-a6c6-f6a7-e2d5-d9a13c4dcf2f@gmail.com>
 <7a528ce1-a44e-3ee7-095c-1a92528ec441@linux.alibaba.com>
Message-ID: <a1acf557-e2d6-8beb-0108-f38e824e16ba@linux.alibaba.com>
Date:   Thu, 28 Oct 2021 19:46:29 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0)
 Gecko/20100101 Thunderbird/78.13.0
MIME-Version: 1.0
In-Reply-To: <7a528ce1-a44e-3ee7-095c-1a92528ec441@linux.alibaba.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: bulk
List-ID: <io-uring.vger.kernel.org>
X-Mailing-List: io-uring@vger.kernel.org

在 2021/10/28 下午2:07, Hao Xu 写道:
> 在 2021/10/28 上午2:15, Pavel Begunkov 写道:
>> On 10/27/21 15:02, Hao Xu wrote:
>>> Tested this patchset by manually replace __io_queue_sqe() in
>>> io_queue_sqe() by io_req_task_queue() to construct 'heavy' task works.
>>> Then test with fio:
>>
>> If submissions and completions are done by the same task it doesn't
>> really matter in which order they're executed because the task won't
>> get back to userspace execution to see CQEs until tw returns.
> It may matter, it depends on the time cost of submittion
> and the DMA IO time. Pick up sqpoll mode as example,
> we submit 10 reqs:
> t1          io_submit_sqes
>              -->io_req_task_queue
> t2          io_task_work_run
> we actually do the submittion in t2,  but if the workload
> is big engough, the 'irq completion TW' will be inserted
> to the TW list after t2 is fully done, then those
> 'irq completion TW' will be delayed to the next round.
> With this patchset, we can handle them first.
>> Furthermore, it even might be worse because the earlier you submit
>> the better with everything else equal.
>>
>> IIRC, that's how it's with fio, right? If so, you may get better
>> numbers with a test that does submissions and completions in
>> different threads.
> Because of the completion cache, I doubt if it works.
> For single ctx, it seems we always update the cqring
> pointer after all the TWs in the list are done.
I suddenly realized sqpoll mode does submissions and completions
in different threads, and in this situation this patchset always
first commit_cqring() after handling TWs in priority list.
So this is the right test, do I miss something?
>>
>> Also interesting to find an explanation for you numbers assuming
> The reason may be what I said above, but I don't have a
> strict proof now.
>> they're stable. 7/8 batching? How often it does it go this path?
>> If only one task submits requests it should already be covered
>> with existing batching.
> the problem of the existing batch is(given there is only
> one ctx):
> 1. we flush it after all the TWs done
> 2. we batch them if we have uring lock.
> the new batch is:
> 1. don't care about uring lock
> 2. we can flush the completions in the priority list
>     in advance.(which means userland can see it earlier.)
>>
>>
>>> ioengine=io_uring
>>> sqpoll=1
>>> thread=1
>>> bs=4k
>>> direct=1
>>> rw=randread
>>> time_based=1
>>> runtime=600
>>> randrepeat=0
>>> group_reporting=1
>>> filename=/dev/nvme0n1
>>>
>>> 2/8 set unlimited priority_task_list, 8/8 set a limitation of
>>> 1/3 * (len_prority_list + len_normal_list), data below:
>>>     depth     no 8/8   include 8/8      before this patchset
>>>      1        7.05         7.82              7.10
>>>      2        8.47         8.48              8.60
>>>      4        10.42        9.99              10.42
>>>      8        13.78        13.13             13.22
>>>      16       27.41        27.92             24.33
>>>      32       49.40        46.16             53.08
>>>      64       102.53       105.68            103.36
>>>      128      196.98       202.76            205.61
>>>      256      372.99       375.61            414.88
>>>      512      747.23       763.95            791.30
>>>      1024     1472.59      1527.46           1538.72
>>>      2048     3153.49      3129.22           3329.01
>>>      4096     6387.86      5899.74           6682.54
>>>      8192     12150.25     12433.59          12774.14
>>>      16384    23085.58     24342.84          26044.71
>>>
>>> It seems 2/8 is better, haven't tried other choices other than 1/3,
>>> still put 8/8 here for people's further thoughts.
>>>
>>> Hao Xu (8):
>>>    io-wq: add helper to merge two wq_lists
>>>    io_uring: add a priority tw list for irq completion work
>>>    io_uring: add helper for task work execution code
>>>    io_uring: split io_req_complete_post() and add a helper
>>>    io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
>>>    io_uring: add nr_ctx to record the number of ctx in a task
>>>    io_uring: batch completion in prior_task_list
>>>    io_uring: add limited number of TWs to priority task list
>>>
>>>   fs/io-wq.h    |  21 +++++++
>>>   fs/io_uring.c | 168 +++++++++++++++++++++++++++++++++++---------------
>>>   2 files changed, 138 insertions(+), 51 deletions(-)
>>>
>>