From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D5D2C433F5 for ; Fri, 29 Oct 2021 14:34:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2AE0360FC1 for ; Fri, 29 Oct 2021 14:34:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229521AbhJ2Ogo (ORCPT ); Fri, 29 Oct 2021 10:36:44 -0400 Received: from out30-45.freemail.mail.aliyun.com ([115.124.30.45]:51442 "EHLO out30-45.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229610AbhJ2Ogn (ORCPT ); Fri, 29 Oct 2021 10:36:43 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01424;MF=xiaoguang.wang@linux.alibaba.com;NM=1;PH=DS;RN=3;SR=0;TI=SMTPD_---0Uu9TSdD_1635518052; Received: from legedeMacBook-Pro.local(mailfrom:xiaoguang.wang@linux.alibaba.com fp:SMTPD_---0Uu9TSdD_1635518052) by smtp.aliyun-inc.com(127.0.0.1); Fri, 29 Oct 2021 22:34:13 +0800 Subject: Re: [PATCH v3 2/3] io_uring: reduce frequent add_wait_queue() overhead for multi-shot poll request To: Pavel Begunkov , io-uring@vger.kernel.org Cc: axboe@kernel.dk References: <20211025053849.3139-1-xiaoguang.wang@linux.alibaba.com> <20211025053849.3139-3-xiaoguang.wang@linux.alibaba.com> <7dd1823d-0324-36d1-2562-362f2ef0399b@gmail.com> <7e6c2a36-adf5-ece5-9109-cd5c4429e79d@linux.alibaba.com> <68dced19-6f11-4db8-321a-856217505539@gmail.com> From: Xiaoguang Wang Message-ID: <4d66262e-7356-2609-7698-c37322a7c072@linux.alibaba.com> Date: Fri, 29 Oct 2021 22:34:12 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.14.0 MIME-Version: 1.0 In-Reply-To: <68dced19-6f11-4db8-321a-856217505539@gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org > On 10/29/21 14:37, Xiaoguang Wang wrote: >> hi, >> >>> On 10/29/21 03:57, Xiaoguang Wang wrote: >>>> hi, >>>> >>>>> On 10/25/21 06:38, Xiaoguang Wang wrote: >>>>>> Run echo_server to evaluate io_uring's multi-shot poll >>>>>> performance, perf >>>>>> shows that add_wait_queue() has obvious overhead. Intruduce a new >>>>>> state >>>>>> 'active' in io_poll_iocb to indicate whether io_poll_wake() >>>>>> should queue >>>>>> a task_work. This new state will be set to true initially, be set >>>>>> to false >>>>>> when starting to queue a task work, and be set to true again when >>>>>> a poll >>>>>> cqe has been committed. One concern is that this method may lost >>>>>> waken-up >>>>>> event, but seems it's ok. >>>>>> >>>>>>    io_poll_wake                io_poll_task_func >>>>>> t1                       | >>>>>> t2                       | WRITE_ONCE(req->poll.active, true); >>>>>> t3                       | >>>>>> t4                       |    io_commit_cqring(ctx); >>>>>> t5                       | >>>>>> t6                       | >>>>>> >>>>>> If waken-up events happens before or at t4, it's ok, user app >>>>>> will always >>>>>> see a cqe. If waken-up events happens after t4 and IIUC, >>>>>> io_poll_wake() >>>>>> will see the new req->poll.active value by using READ_ONCE(). >>>>>> >>>>>> Echo_server codes can be cloned from: >>>>>> https://codeup.openanolis.cn/codeup/storage/io_uring-echo-server.git, >>>>>> >>>>>> branch is xiaoguangwang/io_uring_multishot. >>>>>> >>>>>> Without this patch, the tps in our test environment is 284116, with >>>>>> this patch, the tps is 287832, about 1.3% reqs improvement, which >>>>>> is indeed in accord with the saved add_wait_queue() cost. >>>>>> >>>>>> Signed-off-by: Xiaoguang Wang >>>>>> --- >>>>>>   fs/io_uring.c | 57 >>>>>> +++++++++++++++++++++++++++++++++------------------------ >>>>>>   1 file changed, 33 insertions(+), 24 deletions(-) >>>>>> >>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>>> index 18af9bb9a4bc..e4c779dac953 100644 >>>>>> --- a/fs/io_uring.c >>>>>> +++ b/fs/io_uring.c >>>>>> @@ -481,6 +481,7 @@ struct io_poll_iocb { >>>>>>       __poll_t            events; >>>>>>       bool                done; >>>>>>       bool                canceled; >>>>>> +    bool                active; >>>>>>       struct wait_queue_entry        wait; >>>>>>   }; >>>>>>   @@ -5233,8 +5234,6 @@ static inline int __io_async_wake(struct >>>>>> io_kiocb *req, struct io_poll_iocb *pol >>>>>>   { >>>>>>       trace_io_uring_task_add(req->ctx, req->opcode, >>>>>> req->user_data, mask); >>>>>>   -    list_del_init(&poll->wait.entry); >>>>>> - >>>>> >>>>> As I mentioned to Hao some time ago, we can't allow this function >>>>> or in >>>>> particular io_req_task_work_add() to happen twice before the first >>>>> task work got executed, they use the same field in io_kiocb and those >>>>> will corrupt the tw list. >>>>> >>>>> Looks that's what can happen here. >>>> If I have understood scenario your described correctly, I think it >>>> won't happen :) >>>> With this patch, if the first io_req_task_work_add() is issued, >>>> poll.active >>>> will be set to false, then no new io_req_task_work_add() will be >>>> issued. >>>> Only the first task_work installed by the first >>>> io_req_task_work_add() has >>>> completed, poll.active will be set to true again. >>> >>> Ah, I see now, the active dance is in io_poll_wake(). That won't work >>> with READ_ONCE/WRITE_ONCE though, you would need real atomics >>> >>> The easiest race to explain is: >>> >>> CPU1                        | CPU2 >>> io_poll_wake                | io_poll_wake >>> if (p->active) return 0;    | if (p->active) return 0; >> it's "if (!p->active) return 0;" in my patch :) > > hah, yeah, email draft-coding > >>> // p->active is false in both cases, continue >>> p->active = false;          | p->active = false; >>> task_work_add()             | task_work_add() >> io_poll_wake() is called with poll->head->lock, so there will no >> concurrent >> io_poll_wake() calls. > > The problem is that the poll implementation is too wobbly, can't count > how many corner cases there are... Absolutely agree with you. In the process of developing fixed poll patch, I have realized poll implementation is  not easy... > We can get to the point that your > patches are "proven" to work, but I'll be more on the cautious side as > the current state is hell over complicated. OK, I understand your concerns. I'll check double poll codes again, not quite familiar with it yet. Regards, Xiaoguang Wang