From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C59E8C433EF for ; Tue, 14 Jun 2022 09:53:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239637AbiFNJxB (ORCPT ); Tue, 14 Jun 2022 05:53:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40750 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240965AbiFNJw7 (ORCPT ); Tue, 14 Jun 2022 05:52:59 -0400 Received: from out1.migadu.com (out1.migadu.com [91.121.223.63]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1BAA71F61D for ; Tue, 14 Jun 2022 02:52:53 -0700 (PDT) Message-ID: <659462b3-4fc7-12ed-f760-da0f222539a0@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1655200371; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ctXmhXgvFDTZSyhrdBIuHSU7/aPKdHgKMEwt0ilpmek=; b=NZuYvd+KOCizZkIph0hs3v43Qz4EDXIghh2NCq7xIRCrAQoOA4Il43ovD6Fr0wqqBXWx7j 9hQub1CuIOzLbECypvxkm85xsNbpWaGRsy8H/y4onxc0cAEtIPRLv3gwwu9JRWfM/wzQf9 BbZ32XMD58vDq3/VhZ57vRKbaFOM/ps= Date: Tue, 14 Jun 2022 17:52:41 +0800 MIME-Version: 1.0 Subject: Re: [RFC] support memory recycle for ring-mapped provided buffer Content-Language: en-US To: Dylan Yudaken , "io-uring@vger.kernel.org" Cc: "axboe@kernel.dk" , "asml.silence@gmail.com" References: <6641baea-ba35-fb31-b2e7-901d72e9d9a0@linux.dev> <4980fd4d-b1f3-7b1c-8bfc-6be4d31f9da0@linux.dev> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Hao Xu In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Migadu-Auth-User: linux.dev Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org Hi Dylan, On 6/14/22 16:38, Dylan Yudaken wrote: > On Tue, 2022-06-14 at 14:26 +0800, Hao Xu wrote: >> On 6/12/22 15:30, Hao Xu wrote: >>> On 6/10/22 13:55, Hao Xu wrote: >>>> Hi all, >>>> >>>> I've actually done most code of this, but I think it's necessary >>>> to >>>> first ask community for comments on the design. what I do is when >>>> consuming a buffer, don't increment the head, but check the >>>> length >>>> in real use. Then update the buffer info like >>>> buff->addr += len, buff->len -= len; >>>> (off course if a req consumes the whole buffer, just increment >>>> head) >>>> and since we now changed the addr of buffer, a simple buffer id >>>> is >>>> useless for userspace to get the data. We have to deliver the >>>> original >>>> addr back to userspace through cqe->extra1, which means this >>>> feature >>>> needs CQE32 to be on. >>>> This way a provided buffer may be splited to many pieces, and >>>> userspace >>>> should track each piece, when all the pieces are spare again, >>>> they can >>>> re-provide the buffer.(they can surely re-provide each piece >>>> separately >>>> but that causes more and more memory fragments, anyway, it's >>>> users' >>>> choice.) >>>> >>>> How do you think of this? Actually I'm not a fun of big cqe, it's >>>> not >>>> perfect to have the limitation of having CQE32 on, but seems no >>>> other >>>> option? >> >> Another way is two rings, just like sqring and cqring. Users provide >> buffers to sqring, kernel fetches it and when data is there put it to >> cqring for users to read. The downside is we need to copy the buffer >> metadata. and there is a limitation of how many times we can split >> the >> buffer since the cqring has a length. >> >>>> >>>> Thanks, >>>> Hao >>> >>> To implement this, CQE32 have to be introduced to almost >>> everywhere. >>> For example for io_issue_sqe: >>> >>> def->issue(); >>> if (unlikely(CQE32)) >>>      __io_req_complete32(); >>> else >>>      __io_req_complete(); >>> >>> which will cerntainly have some overhead for main path. Any >>> comments? For this downside, I think there is way to limit it to only read/recv path. >>> >>> Regards, >>> Hao >>> >> > > I find the idea interesting, but is it definitely worth doing? > > Other downsides I see with this approach: > * userspace would have to keep track of when a buffer is finished. This > might get complicated. This one is fine I think, since users can choose not to enable this feature and if they use it, they can choose not to track the buffer but to re-provide each piece immediately. (When a user register the pbuf ring, they can deliver a flag to enable this feature.) > * there is a problem of tiny writes - would we want to support a > minimum buffer size? Sorry I'm not following here, why do we need to have a min buffer size? > > I think in general it can be acheived using the existing buffer ring > and leave the management to userspace. For example if a user prepares a > ring with N large buffers, on each completion the user is free to > requeue that buffer without the recently completed chunk. [1] I see, was not aware of this... > > The downsides here I see are: > * there is a delay to requeuing the buffer. This might cause more > ENOBUFS. Practically I 'feel' this will not be a big problem in > practice > * there is an additional atomic incrememnt on the ring > > Do you feel the wins are worth the extra complexity? Personally speaking, the only downside of my first approach is overhead of cqe32 on iopoll completion path and read/recv/recvmsg path. But looks [1] is fine...TBH I'm not sure which one is better. Thanks, Hao > >