From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f194.google.com (mail-pl1-f194.google.com [209.85.214.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 54FF3303CAA for ; Thu, 11 Dec 2025 10:33:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.194 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765449234; cv=none; b=f75NL0dFx8vzw/TofV5+uWDlJFhkpF3z1ZNQnFnBaN9kck8oIdjNfiUkcvG+TTxTNklq3IFY+rS2wHN9v43zyOps/D1hpvik+SyGdcH6D51hO+0jACs8FAh92FEw9BUf3idUpxrlvKxvywjdihWBIig5Rrv+09dKYmUyfDab2yg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765449234; c=relaxed/simple; bh=QhUjSFQfZn3R7at7FYJNEbk+Pqft7OC3kiYFVws2FLg=; h=Message-ID:Date:MIME-Version:Subject:From:To:Cc:References: In-Reply-To:Content-Type; b=AAlHpdjFDoy/nnIpxRE1NGw5x20N3264fZidylut7MHQJEwEgXluM2RbP3pqXrwGnLGNKGfY1mymWKzlvjah0j6BVpGKewYus2N2iavSEBn7nR/kR32cFOGfYF2KjDUAruPCSA0Mg9yuHyrnCqtXiS8d0ZWMiQekbCC+cyOAp/U= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.dk; spf=pass smtp.mailfrom=kernel.dk; dkim=pass (2048-bit key) header.d=kernel-dk.20230601.gappssmtp.com header.i=@kernel-dk.20230601.gappssmtp.com header.b=JWYT7uR4; arc=none smtp.client-ip=209.85.214.194 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.dk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=kernel.dk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20230601.gappssmtp.com header.i=@kernel-dk.20230601.gappssmtp.com header.b="JWYT7uR4" Received: by mail-pl1-f194.google.com with SMTP id d9443c01a7336-29845b06dd2so10261555ad.2 for ; Thu, 11 Dec 2025 02:33:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20230601.gappssmtp.com; s=20230601; t=1765449227; x=1766054027; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:content-language:references :cc:to:from:subject:user-agent:mime-version:date:message-id:from:to :cc:subject:date:message-id:reply-to; bh=IMzRLmiSJlclDk3J5XUyEVJIVr3CldXDSEfIMKMuBv4=; b=JWYT7uR4kasAAfm4o5nPRasLBic78VoyJihE4Souz/sTrJAVoRHaITRHZlO33CM+4f ZHuR3oYM1d2Y8oF79UROEv2qUdtZ0Ou0jSKH3LlQZNzhjo0PSJWPSpOux4lmplW1S1ry ZFeKiqU+1EbULoSK0Bwxt5ZX5RcLwOeLgQ8dBNmAy5aIljkrXviPajEYbImnMQApw69z Fcr/1aPqsswEkHMYnZhUAcVdMF7GYyIH3Tog4gg+eiBuQ53MlUvlaSUi7+gOAQcsE0ve 9YhWdHplqabnVGr518wW7Ep1ZMzXQaBK8koJFiVnfudHuOQVVQUagpFYAx7rbcOJBOwr 6Ozg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765449227; x=1766054027; h=content-transfer-encoding:in-reply-to:content-language:references :cc:to:from:subject:user-agent:mime-version:date:message-id:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=IMzRLmiSJlclDk3J5XUyEVJIVr3CldXDSEfIMKMuBv4=; b=RUEM/IGuDLPHoWtmArcuudAoq450Rs2fGxPRqegBqEOO7OwboFbuUFh6z5QSzfRDft QZmXfbJKgCFTxDo0iErRf9qW6PNwSvbg59u0RDK+rU5m+hik5uaDMahe176R9XGluLbj Gz2b4Y6SCTFEIDYBCeCd3VChI/JG7d7h++FWyFXpJNolt85U2K3usDUsk0iM/0649dMy 6YidpE3375AgDRwqsYwpn6ZKtP0BnuOOUKyYzowNWgaOHkVQmDUW9q79P5k6MArxEt1x lz+41vQn3iRqOPwpVv3iU6jNglZtDDysKtl90F41rldEYipPtcvsOxgi9qeQrDUnVMbe uMPg== X-Forwarded-Encrypted: i=1; AJvYcCUe1TQoMd3QnKOyLY79N46m7o9atW8E+JXTUREeETWB3Zlg1CkEfKbFQ+OOr9cQR3iKFXmOK6rAtg==@vger.kernel.org X-Gm-Message-State: AOJu0YxlpJLPiMJPNfJqaKtOmO5G7gBHl3A/+TrXfh2BKI0vWoUJUxB5 IPLLJpF0WtiiasaQAixEmai1jq5nnZ/F3J7XKv3I3AZ3yPiqdg+rZ+ck7qJ70N7vJZw= X-Gm-Gg: AY/fxX4M1+Ky2lls2mEGbnAIHKvBzUmfM/gEdjoJm4vqzDkxM6WDOQvW4zH+P5cOaMN jb/Sjag05TT9eAZf0U7TEfs1frqkPr2GywwFbQOUShtOsmPsQ+NAl6B4kozlLiecRABmNI3Fzfq BXzQFYplv+RbE0g3qkBUtEU7foa9U4l6tdAxIha6Gh4VEeNbEfgV+itkydEUV6/vFzHeSk1ya9T LW1jfgTbIlRLYn3EkAV96p8uHId4jXrtIQZoGy7jxrBjmT6RqZF2g4ePCnDYIXBUtdxZrv2r4Gx 7ad15M0CWC66JAabIUoq2NXUqUu9m4mX+I+pvZYeOKKILCgb3nUQvnkDtF1FrSjqec6MCjTKvsq CSb63BtYcWoOIWA+AeniDFwo8ulRuqeURbJtk2Y4a+On+CmfBLZNncRp/+wfKO6eF9LPrWhwYtl 6l9rbeVXNVVzsFVLe3LlT5cUzjR4944DN1kaIzzcdQQSqahd59BA== X-Google-Smtp-Source: AGHT+IGZPIxzu4YlVn+rKlKzUN4TBNm42bnEXraSjYEQVwxWvUypf5BpMLtweH6DNmkIVNvCCewLTA== X-Received: by 2002:a05:7022:220c:b0:119:e569:f279 with SMTP id a92af1059eb24-11f296d7f9cmr5184988c88.34.1765449226897; Thu, 11 Dec 2025 02:33:46 -0800 (PST) Received: from [172.20.4.188] (221x255x142x61.ap221.ftth.ucom.ne.jp. [221.255.142.61]) by smtp.gmail.com with ESMTPSA id a92af1059eb24-11f2e1bb45csm6738176c88.1.2025.12.11.02.33.45 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 11 Dec 2025 02:33:46 -0800 (PST) Message-ID: <1acb251a-4c4a-479c-a51e-a8db9a6e0fa3@kernel.dk> Date: Thu, 11 Dec 2025 03:33:44 -0700 Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode From: Jens Axboe To: Fengnan , asml.silence@gmail.com, io-uring@vger.kernel.org Cc: Fengnan Chang , Diangang Li References: <20251210085501.84261-1-changfengnan@bytedance.com> <20251210085501.84261-3-changfengnan@bytedance.com> <69f81ed8-2b4a-461f-90b8-0b9752140f8d@kernel.dk> <0661763c-4f56-4895-afd2-7346bb2452e4@gmail.com> <0654d130-665a-4b1a-b99b-bb80ca06353a@kernel.dk> Content-Language: en-US In-Reply-To: <0654d130-665a-4b1a-b99b-bb80ca06353a@kernel.dk> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 12/11/25 3:22 AM, Jens Axboe wrote: > On 12/11/25 12:38 AM, Fengnan wrote: >> >> >> ? 2025/12/11 12:10, Jens Axboe ??: >>> On 12/10/25 7:15 PM, Jens Axboe wrote: >>>> On 12/10/25 1:55 AM, Fengnan Chang wrote: >>>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it >>>>> is considered that the current req is the actual completed request. >>>>> This may be reasonable for multi-queue ctx, but is problematic for >>>>> single-queue ctx because the current request may not be done when the >>>>> poll gets to the result. In this case, the completed io needs to wait >>>>> for the first io on the chain to complete before notifying the user, >>>>> which may cause io accumulation in the list. >>>>> Our modification plan is as follows: change io_wq_work_list to normal >>>>> list so that the iopoll_list list in it can be removed and put into the >>>>> comp_reqs list when the request is completed. This way each io is >>>>> handled independently and all gets processed in time. >>>>> >>>>> After modification, test with: >>>>> >>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 >>>>> /dev/nvme6n1 >>>>> >>>>> base IOPS is 725K, patch IOPS is 782K. >>>>> >>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1 >>>>> /dev/nvme6n1 >>>>> >>>>> Base IOPS is 880k, patch IOPS is 895K. >>>> A few notes on this: >>>> >>>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is >>>> necessarily safe. Yes generally this is invoked from the >>>> owning/polling task, but that's not guaranteed. >>>> >>>> 2) The patch doesn't apply to the current tree, must be an older >>>> version? >>>> >>>> 3) When hand-applied, it still throws a compile warning about an unused >>>> variable. Please don't send untested stuff... >>>> >>>> 4) Don't just blatantly bloat the io_kiocb. When you change from a >>>> singly to a doubly linked list, you're growing the io_kiocb size. You >>>> should be able to use a union with struct io_task_work for example. >>>> That's already 16b in size - win/win as you don't need to slow down >>>> the cache management as that can keep using the linkage it currently >>>> is using, and you're not bloating the io_kiocb. >>>> >>>> 5) The already mentioned point about the cache free list now being >>>> doubly linked. This is generally a _bad_ idea as removing and adding >>>> entries now need to touch other entries too. That's not very cache >>>> friendly. >>>> >>>> #1 is kind of the big one, as it means you'll need to re-think how you >>>> do this. I do agree that the current approach isn't necessarily ideal as >>>> we don't process completions as quickly as we could, so I think there's >>>> merrit in continuing this work. >>> Proof of concept below, entirely untested, at a conference. Basically >>> does what I describe above, and retains the list manipulation logic >>> on the iopoll side, rather than on the completion side. Can you take >>> a look at this? And if it runs, can you do some numbers on that too? >> >> This patch works, and in my test case, the performance is identical to >> my patch. > > Good! > >> But there is a small problem, now looking for completed requests, >> always need to traverse the whole iopoll_list. this can be a bit >> inefficient in some cases, for example if the previous sent 128K io , >> the last io is 4K, the last io will be returned much earlier, this >> kind of scenario can not be verified in the current test. I'm not sure >> if it's a meaningful scenario. > > Not sure that's a big problem, you're just spinning to complete anyway. > You could add your iob->nr_reqs or something, and break after finding > those know have completed. That won't necessarily change anything, could > still be the last one that completed. Would probably make more sense to > pass in 'min_events' or similar and stop after that. But I think mostly > tweaks that can be made after the fact. If you want to send out a new > patch based on the one I sent, feel free to. Eg, something like this on top would do that. Like I mentioned earlier, you cannot do the list manipulation where you did it, it's not safe. You have to defer it to reaping time. If we could do it from the callback where we mark it complete, then that would surely make things more trivial and avoid iteration when not needed. diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index cae9e857aea4..93709ee6c3ee 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -917,6 +917,7 @@ static inline bool blk_mq_add_to_batch(struct request *req, else if (iob->complete != complete) return false; iob->need_ts |= blk_mq_need_time_stamp(req); + iob->nr_reqs++; rq_list_add_tail(&iob->req_list, req); return true; } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 72e34acd439c..9335b552e040 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1821,6 +1821,7 @@ void bdev_fput(struct file *bdev_file); struct io_comp_batch { struct rq_list req_list; bool need_ts; + int nr_reqs; void (*complete)(struct io_comp_batch *); }; diff --git a/io_uring/rw.c b/io_uring/rw.c index 307f1f39d9f3..37b5b2328f24 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -1358,6 +1358,8 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) iob.complete(&iob); list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) { + if (nr_events == iob.nr_reqs) + break; /* order with io_complete_rw_iopoll(), e.g. ->result updates */ if (!smp_load_acquire(&req->iopoll_completed)) continue; -- Jens Axboe