From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F6A5C433F5 for ; Sat, 18 Dec 2021 06:57:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232213AbhLRG5N (ORCPT ); Sat, 18 Dec 2021 01:57:13 -0500 Received: from out30-56.freemail.mail.aliyun.com ([115.124.30.56]:51868 "EHLO out30-56.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232199AbhLRG5N (ORCPT ); Sat, 18 Dec 2021 01:57:13 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R131e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04357;MF=haoxu@linux.alibaba.com;NM=1;PH=DS;RN=4;SR=0;TI=SMTPD_---0V-yXjeg_1639810629; Received: from 192.168.31.207(mailfrom:haoxu@linux.alibaba.com fp:SMTPD_---0V-yXjeg_1639810629) by smtp.aliyun-inc.com(127.0.0.1); Sat, 18 Dec 2021 14:57:10 +0800 Subject: Re: [POC RFC 0/3] support graph like dependent sqes To: Pavel Begunkov , Jens Axboe Cc: io-uring@vger.kernel.org, Joseph Qi References: <20211214055734.61702-1-haoxu@linux.alibaba.com> <4ef630f4-54d8-e8c6-8622-dccef5323864@gmail.com> <7607c0f9-cad3-cfc5-687e-07dc82684b4e@linux.alibaba.com> <06e21b01-a168-e25f-1b42-97789392bd89@gmail.com> From: Hao Xu Message-ID: <96155b9c-9f35-53b8-456a-8623fc850b03@linux.alibaba.com> Date: Sat, 18 Dec 2021 14:57:08 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.14.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org 在 2021/12/18 上午3:33, Pavel Begunkov 写道: > On 12/16/21 16:55, Hao Xu wrote: >> 在 2021/12/15 上午2:16, Pavel Begunkov 写道: >>> On 12/14/21 16:53, Hao Xu wrote: >>>> 在 2021/12/14 下午11:21, Pavel Begunkov 写道: >>>>> On 12/14/21 05:57, Hao Xu wrote: >>>>>> This is just a proof of concept which is incompleted, send it >>>>>> early for >>>>>> thoughts and suggestions. >>>>>> >>>>>> We already have IOSQE_IO_LINK to describe linear dependency >>>>>> relationship sqes. While this patchset provides a new feature to >>>>>> support DAG dependency. For instance, 4 sqes have a relationship >>>>>> as below: >>>>>>        --> 2 -- >>>>>>       /        \ >>>>>> 1 ---          ---> 4 >>>>>>       \        / >>>>>>        --> 3 -- >>>>>> IOSQE_IO_LINK serializes them to 1-->2-->3-->4, which unneccessarily >>>>>> serializes 2 and 3. But a DAG can fully describe it. >>>>>> >>>>>> For the detail usage, see the following patches' messages. >>>>>> >>>>>> Tested it with 100 direct read sqes, each one reads a BS=4k block >>>>>> data >>>>>> in a same file, blocks are not overlapped. These sqes form a graph: >>>>>>        2 >>>>>>        3 >>>>>> 1 --> 4 --> 100 >>>>>>       ... >>>>>>        99 >>>>>> >>>>>> This is an extreme case, just to show the idea. >>>>>> >>>>>> results below: >>>>>> io_link: >>>>>> IOPS: 15898251 >>>>>> graph_link: >>>>>> IOPS: 29325513 >>>>>> io_link: >>>>>> IOPS: 16420361 >>>>>> graph_link: >>>>>> IOPS: 29585798 >>>>>> io_link: >>>>>> IOPS: 18148820 >>>>>> graph_link: >>>>>> IOPS: 27932960 >>>>> >>>>> Hmm, what do we compare here? IIUC, >>>>> "io_link" is a huge link of 100 requests. Around 15898251 IOPS >>>>> "graph_link" is a graph of diameter 3. Around 29585798 IOPS >>> >>> Diam 2 graph, my bad >>> >>> >>>>> Is that right? If so it'd more more fair to compare with a >>>>> similar graph-like scheduling on the userspace side. >>>> >>>> The above test is more like to show the disadvantage of LINK >>> >>> Oh yeah, links can be slow, especially when it kills potential >>> parallelism or need extra allocations for keeping state, like >>> READV and WRITEV. >>> >>> >>>> But yes, it's better to test the similar userspace  scheduling since >>>> >>>> LINK is definitely not a good choice so have to prove the graph stuff >>>> >>>> beat the userspace scheduling. Will test that soon. Thanks. >>> >>> Would be also great if you can also post the benchmark once >>> it's done >> >> Wrote a new test to test nop sqes forming a full binary tree with >> (2^10)-1 nodes, >> which I think it a more general case.  Turns out the result is still >> not stable and >> the kernel side graph link is much slow. I'll try to optimize it. > > That's expected unfortunately. And without reacting on results > of previous requests, it's hard to imagine to be useful. BPF may > have helped, e.g. not keeping an explicit graph but just generating > new requests from the kernel... But apparently even with this it's > hard to compete with just leaving it in userspace. > Tried to exclude the memory allocation stuff, seems it's a bit better than the user graph. For the result delivery, I was thinking of attaching BPF program within a sqe, not creating a single BPF type sqe. Then we can have data flow in the graph or linkchain. But I haven't had a clear draft for it >> Btw, is there any comparison data between the current io link feature >> and the >> userspace scheduling. > > Don't remember. I'd try to look up the cover-letter for the patches > implementing it, I believe there should've been some numbers and > hopefully test description. > > fwiw, before io_uring mailing list got established patches/etc. > were mostly going through linux-block mailing list. Links are old, so > patches might be there. >