From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-oa1-f50.google.com (mail-oa1-f50.google.com [209.85.160.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CD59D14D43D for ; Mon, 19 Aug 2024 23:03:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.50 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724108632; cv=none; b=euq8NVJCvorNPUrjBWAYJc26885c7yy/iGX7SVY06tdHx1dH3EWOdfqf9VzsIKogZgTx1mVMF5mP3TSfhc2JpPskI+8jk8ycGYxKz9R9aIb0tewbawcT2Na638m5fXLNRNoSzcJLRqbz3HTwN0L4MGLsJKAaFrcOR5HPrRxvi68= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724108632; c=relaxed/simple; bh=DWDrPAgyt3q3qqpf/Bm5RscRhfde6eGy0s0ZIh+s1Fw=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=uevTwU5grrLo7vhZ2aNes94zm8KqHJEskSmtpKzlqN3l3q299QFg6pHZE6NLChCGjWyXkNqrlB9Ib+xBcXZ35hMucID1aMQsCjMzQKuaQ4HCPw3eRDCW6cHAu4JdfOP3KIXmVDa0vBvMgEH42QwBqEX5Ipso4xgWzXly8xuDTyQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk; spf=none smtp.mailfrom=davidwei.uk; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b=VHzqy/gv; arc=none smtp.client-ip=209.85.160.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="VHzqy/gv" Received: by mail-oa1-f50.google.com with SMTP id 586e51a60fabf-2705d31a35cso1114589fac.0 for ; Mon, 19 Aug 2024 16:03:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1724108630; x=1724713430; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=B7FF2xPFNcpfo//VjsVucyjCndxHBHqcNrJYQAzj6ak=; b=VHzqy/gv3yH2QEk3etqyQsqbigissQcF1HPP3noQn80J78AWxZERmY+J7yxg0mgk61 2wfWOUDetftY45397KtSts8j1a8+LDgkeNN0BaAD4queoKs9TiGmfph/3KDfSuiuK+JZ k5+kaOSN6mwlfejSyOMtUKQggxKTmiu1e4adA4SALf1JZB8Gua8uPzlxwH+ATeYZb8ke CLPpqdFh7t5K2daV/mw5ozfnQmWxApj9/ydGcANYfugD8G9gR3uKW1xVFlrEM2cqeMLo Rv9TXh/7ogUNGJZNZtJ99sdGYom4WE88sRTe8t5ov4FVadb2XvCjv3jMZYBes0hYVfyQ QbxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724108630; x=1724713430; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=B7FF2xPFNcpfo//VjsVucyjCndxHBHqcNrJYQAzj6ak=; b=pHGY39BcysaDHXT9mngmfRgOhYMe6D0fd6PpiNKysVj1hNKs+vA91sXWu/3UpxPbKT UIbXFzaF2VM5/KolE3+W9A8ZasF2o2mQEVTPVy+e4mVnNGRKSI9EZuNwoNTaPcsGQW1J oUWAFvVoefMFfy9EFT5MGJbS79IiP8ihMruF+1f/OtFZ2H3eBKQYUQuM1GsZdM1Cqzl1 /KLrbAV+M82sbRKvIokO0AIJ+NbDf/0M6Dsa84uHxrdGO+/gFVedCgEim0XXZx/ckYEk DahSWywNcaRQhPfq50jlsfdNVIc7HvFCZQYsQjvicQQCR/uUm6Ki3c7CBVC/sAXbnBX7 q1mA== X-Gm-Message-State: AOJu0YxFeL+1xcoOaowVwSV+gEIuaoZeNC2eeCDiGjtkc0vwVixEfDlP 2Jasjg4JwqUWo1w13LaCn6wjLNP4tePz90bspdtXT5pBEGez+mvJ65emgUFLq2aiOzoOKLkJEKs vmEY= X-Google-Smtp-Source: AGHT+IHtSLvx/QNXMgK5YXXCbPc+LjVNDq/28CfjoPvtJBpLho+Kxunpr3xkXl/MceIr8XErkbESyQ== X-Received: by 2002:a05:6870:d626:b0:25e:8509:160e with SMTP id 586e51a60fabf-2701c349877mr15873424fac.3.1724108629652; Mon, 19 Aug 2024 16:03:49 -0700 (PDT) Received: from [192.168.1.13] (174-21-189-109.tukw.qwest.net. [174.21.189.109]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-7c6b61ce869sm6988325a12.24.2024.08.19.16.03.49 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 19 Aug 2024 16:03:49 -0700 (PDT) Message-ID: <5c209f2e-34c5-4afa-9ecf-842f33d6baf0@davidwei.uk> Date: Mon, 19 Aug 2024 16:03:48 -0700 Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2] io_uring: add IORING_ENTER_NO_IOWAIT to not set in_iowait Content-Language: en-GB To: Jeff Moyer Cc: io-uring@vger.kernel.org, Jens Axboe , Pavel Begunkov References: <20240816223640.1140763-1-dw@davidwei.uk> From: David Wei In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 2024-08-16 18:23, Jeff Moyer wrote: > Hi, David, > > David Wei writes: > >> io_uring sets current->in_iowait when waiting for completions, which >> achieves two things: >> >> 1. Proper accounting of the time as iowait time >> 2. Enable cpufreq optimisations, setting SCHED_CPUFREQ_IOWAIT on the rq >> >> For block IO this makes sense as high iowait can be indicative of >> issues. > > It also let's you know that the system isn't truly idle. IOW, it would > be doing some work if it didn't have to wait for I/O. This was the > reason the metric was added (admins being confused about why their > system was showing up idle). I see. Thanks for the historical context. > >> But for network IO especially recv, the recv side does not control >> when the completions happen. >> >> Some user tooling attributes iowait time as CPU utilisation i.e. not > > What user tooling are you talking about? If it shows iowait as busy > time, the tooling is broken. Please see my last mail on the subject: > https://lore.kernel.org/io-uring/x49cz0hxdfa.fsf@segfault.boston.devel.redhat.com/ Our internal tooling for example considers CPU util% to be (100 - idle%), but it also has a CPU busy% defined as (100 - idle% - iowait%). It is very unfortunate that everyone uses CPU util% for monitoring, with all sorts of alerts, dashboards and load balancers referring to this value. One reason is that, depending on context, high iowait time may or may not be a problem, so it isn't as simple as redefining CPU util% to exclude iowait. > >> idle, so high iowait time looks like high CPU util even though the task >> is not scheduled and the CPU is free to run other tasks. When doing >> network IO with e.g. the batch completion feature, the CPU may appear to >> have high utilisation. > > Again, iowait is idle time. That's fair. I think it is simpler to have a single "CPU util" metric defined as (100 - idle%), and have a switch that userspace explicitly flips to say "I want iowait to be considered truly idle or not". This means things such as load balancers can be built around a single metric, rather than having to consider both util/busy and needing to understand "does iowait mean anything?". > >> This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on >> enter. If set, then current->in_iowait is not set. By default this flag >> is not set to maintain existing behaviour i.e. in_iowait is always set. >> This is to prevent waiting for completions being accounted as CPU >> utilisation. >> >> Not setting in_iowait does mean that we also lose cpufreq optimisations >> above because in_iowait semantics couples 1 and 2 together. Eventually >> we will untangle the two so the optimisations can be enabled >> independently of the accounting. >> >> IORING_FEAT_IOWAIT_TOGGLE is returned in io_uring_create() to indicate >> support. This will be used by liburing to check for this feature. > > If I receive a problem report where iowait time isn't accurate, I now > have to somehow figure out if an application is setting this flag. This > sounds like a support headache, and I do wonder what the benefit is. > From what you've written, the justification for the patch is that some > userspace tooling misinterprets iowait. Shouldn't we just fix that? Right, I understand your concerns. That's why by default this flag is not set and io_uring behaves as before with in_iowait always set. Unfortunately, "just fix userspace" for us is a huge ask because a whole pyramid of both code and human understanding has been built on the current definition of "CPU utilisation". This is extremely time consuming to change, nor is it something that we (io_uring) want to take on imo. Why not give the option for people to indicate whether they want iowait showing up or not? > > It may be that certain (all?) network functions, like recv, should not > be accounted as iowait. However, I don't think the onus should be on > applications to tell the kernel about that--the kernel should just > figure that out on its own. > > Am I alone in these opinions? Why should the onus be on the kernel? I think it is more difficult for the kernel to figure out exactly what semantics userspace wants and it is simpler for userspace to select their preference. >From my experience, userspace apps either assigns a thread with an io_uring instance to network IO or disk IO, but never both. If there is a valid case for doing both types in the same io_uring, then it would be trivial to add wait helpers that sets IORING_ENTER_NO_IOWAIT on a per wait basis. I do agree with you that this is not ideal. What io_uring really wants is to decouple the iowait accounting from the cpufreq optimisation that gets enabled in the presence of in_iowait, which is a bigger ask and out of scope of this patch. When _someone_ decides to fix the wider iowait issue, I'm happy to revisit this patch. > > Cheers, > Jeff >