From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f41.google.com (mail-wm1-f41.google.com [209.85.128.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ADB32231C8A for ; Mon, 14 Oct 2024 17:47:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.41 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728928071; cv=none; b=a7UEI+nMrP0ZwZ7VsC0f7ovGP6oE1rhE520XCqFfYNad/X7Gth9QA35xbVYA7kl7amLITKV7fDzqqMnieNSXdBy3i5/HzekroBBRPFIF8gym9z8LPupijfFzJb/65SSeUOYrrJaMu7TWJtqsFPdtuHVK5BvLCmTpuS/MQHt7xm4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728928071; c=relaxed/simple; bh=Ae2eFWKs2DICyuhJF5WOveCVqd1vBgj7av8vdTiKu2E=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=F6KlqBbgUBbsU4gn4pvgwIvP17rmyF8v96Bsa8MwsKkcl4gp2lfNruNEQg8y/ib/wll0CvsHV6W3we+OfAkR54p3fsM7nJnzHyjdn9Pn8+elXYtUJOPSohsCxt9futze/u4Nerq+cR6xR8jRTOkYFIgGBB/84vzEWxqWsL9UD3k= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=fKODWITP; arc=none smtp.client-ip=209.85.128.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="fKODWITP" Received: by mail-wm1-f41.google.com with SMTP id 5b1f17b1804b1-43115887867so30800405e9.0 for ; Mon, 14 Oct 2024 10:47:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1728928068; x=1729532868; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=HY2pIgVaiomVHdf5KxIuoBCM6/gxyToAISsKl+hhmEc=; b=fKODWITPwYb2Bg4a6R+rSTx8TXrhfFi48Iwe8D1V/mLpJ+h90rZTaTn6tezJRzWfBo zVdrMiKKPhP5nzvUtRLGlh+WJUPqLjl44BF8TsXgqB+Pdh3qT0Plr8MlJj5Vy2DLV97w Pn3ELCxzbb4NdVpRCin4MekgmDrTHVZIYygmF+zQv8hlVZr3z172vSl5rZcmtD7uCMER q4h40dKCjVE3Az/jvslE8miyRphzUAghBRYxAd64IoiaKvmmPAAaLm9xFwWynfGcARll 77lnpAOiqqlav2flsDJE0GDdglTncAuDVR1mjnqKe1DoRzMv2Clmf9a4jil5Z+/zWNxS 69XA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728928068; x=1729532868; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=HY2pIgVaiomVHdf5KxIuoBCM6/gxyToAISsKl+hhmEc=; b=pCe8kgfADG204MkbmM6g6ZA1qIlhxc/gm0hFCpiRat45i0AR7ZC7/dN3pt2eaY6c1k UhT+dVyVLx18NpHlI+ESSVkPm/orBtBnW4batm6GCTbaO2yLOKF/h//QcA68mQa4uMEY 7cZUZrIDjQ6t7DLQSefGIEUEsC/dpH99/jkTWBYEq42aGw5nxTPvCKIQm+0WXEXbAhIP 9W9bB5dNGlI+86LgiBQb0GQUYIVvPG0m9G1MK0dtAiIs/ixouC6D27uOzMSmGaEDameo iQupdR+Zt94aNiCmg5C4dboO+nHFD7B1lxPKSxMc3ewS9SQ/ru8ydatxNtLpWY3bPsUA zERQ== X-Forwarded-Encrypted: i=1; AJvYcCU06eJLV4T4ZCFe3D2rMIxNSk6eAvZpTI9Qp/MmPWZFiJMclwwWymARS7TLomDaDEC/1xoX4P0LEg==@vger.kernel.org X-Gm-Message-State: AOJu0YyoXiuQd1/+jnASEij99gXBolDH0hztXoDW/9fIHo50JfwroSAI mB6m9c3dRm18NOgG6rvzdSoRuv9exTIIHGM2Z/19gc2XU/1I3W5c X-Google-Smtp-Source: AGHT+IEZZX9qF+rhnNqt4XjddoZp2AjgLPUFz9rqrixoYAXs+obQ7LuFMuDihqfmu5UOWYh99Pu3Ww== X-Received: by 2002:a05:600c:5121:b0:42c:b991:98bc with SMTP id 5b1f17b1804b1-43115a27eddmr136149725e9.0.1728928067668; Mon, 14 Oct 2024 10:47:47 -0700 (PDT) Received: from [192.168.42.172] (82-132-213-68.dab.02.net. [82.132.213.68]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4311835d7b2sm127955595e9.42.2024.10.14.10.47.46 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 14 Oct 2024 10:47:47 -0700 (PDT) Message-ID: <74b0e140-f79d-4a89-a83a-77334f739c92@gmail.com> Date: Mon, 14 Oct 2024 18:48:13 +0100 Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Large CQE for fuse headers To: Bernd Schubert , Miklos Szeredi , Ming Lei Cc: Jens Axboe , io-uring@vger.kernel.org, Joanne Koong , Josef Bacik References: <2fe2a3d3-4720-4d33-871e-5408ba44a543@fastmail.fm> <24ee0d07-47cc-4dcb-bdca-2123f38d7219@fastmail.fm> Content-Language: en-US From: Pavel Begunkov In-Reply-To: <24ee0d07-47cc-4dcb-bdca-2123f38d7219@fastmail.fm> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 10/14/24 16:21, Bernd Schubert wrote: > On 10/14/24 15:34, Pavel Begunkov wrote: >> On 10/14/24 13:47, Bernd Schubert wrote: >>> On 10/14/24 13:10, Miklos Szeredi wrote: >>>> On Mon, 14 Oct 2024 at 04:44, Ming Lei wrote: >>>> >>>>> It also depends on how fuse user code consumes the big CQE payload, if >>>>> fuse header needs to keep in memory a bit long, you may have to copy it >>>>> somewhere for post-processing since io_uring(kernel) needs CQE to be >>>>> returned back asap. >>>> >>>> Yes. >>>> >>>> I'm not quite sure how the libfuse interface will work to accommodate >>>> this.  Currently if the server needs to delay the processing of a >>>> request it would have to copy all arguments, since validity will not >>>> be guaranteed after the callback returns.  With the io_uring >>>> infrastructure the headers would need to be copied, but the data >>>> buffer would be per-request and would not need copying.  This is >>>> relaxing a requirement so existing servers would continue to work >>>> fine, but would not be able to take full advantage of the multi-buffer >>>> design. >>>> >>>> Bernd do you have an idea how this would work? >>> >>> I assume returning a CQE is io_uring_cq_advance()? >> >> Yes >> >>> In my current libfuse io_uring branch that only happens when >>> all CQEs have been processed. We could also easily switch to >>> io_uring_cqe_seen() to do it per CQE. >> >> Either that one. >> >>> I don't understand why we need to return CQEs asap, assuming CQ >>> ring size is the same as SQ ring size - why does it matter? >> >> The SQE is consumed once the request is issued, but nothing >> prevents the user to keep the QD larger than the SQ size, >> e.g. do M syscalls each ending N requests and then wait for typo, Sending or queueing N requests. In other words it's perfectly legal to: It's perfectly legal to: ring = create_ring(nr_cqes=N); for (i = 0 .. M) { for (i = 0..N) prep_sqe(); submit_all_sqes(); } wait(nr=N * M); With a caveat that the wait can't complete more than the CQ size, but you can even add a loop atop of the wait. while (nr_inflight_cqes) { wait(nr = min(CQ_size, nr_inflight_cqes); process_cqes(); } Or do something more elaborate, often frameworks allow to push any number of requests not caring too much about exactly matching queue sizes apart from sizing them for performance reasons. >> N * M completions. >> > > I need a bit help to understand this. Do you mean that in typical > io-uring usage SQEs get submitted, already released in kernel Typical or not, but the number of requests in flight is not limited by the size of the SQ, it only limits how many requests you can queue per syscall, i.e. per io_uring_submit(). > and then users submit even more SQEs? And that creates a > kernel queue depth for completion? > I guess as long as libfuse does not expose the ring we don't have > that issue. But then yeah, exposing the ring to fuse-server/daemon > is planned... Could be, for example you don't need to care about overflows at all if the CQ size is always larger than the number of requests in flight. Perhaps the simplest example: prep_requests(nr=N); wait_cq(nr=N); process_cqes(nr=N); >>> If we indeed need to return the CQE before processing the request, >>> it indeed would be better to have a 2nd memory buffer associated with >>> the fuse request. >> >> With that said, the usual problem is to size the CQ so that it >> (almost) never overflows, otherwise it hurts performance. With >> DEFER_TASKRUN you can delay returning CQEs to the kernel until >> the next time you wait for completions, i.e. do io_uring waiting >> syscall. Without the flag, CQEs may come asynchronously to the >> user, so need a bit more consideration. >> > > Current libfuse code has it disabled IORING_SETUP_SINGLE_ISSUER, > IORING_SETUP_DEFER_TASKRUN, IORING_SETUP_TASKRUN_FLAG and > IORING_SETUP_COOP_TASKRUN as these are somehow slowing down > things. Those flags are not a requirement, you can try to size the CQ so that overflows are rare, it's just a bit easier to do with DEFER_TASKRUN. > Not sure if this thread is optimal to discuss this. I would > also first like to sort out all the other design topics before > going into fine-tuning... -- Pavel Begunkov