public inbox for [email protected]
 help / color / mirror / Atom feed
From: David Wei <[email protected]>
To: Jeff Moyer <[email protected]>
Cc: [email protected], Jens Axboe <[email protected]>,
	Pavel Begunkov <[email protected]>
Subject: Re: [PATCH v2] io_uring: add IORING_ENTER_NO_IOWAIT to not set in_iowait
Date: Mon, 19 Aug 2024 16:03:48 -0700	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 2024-08-16 18:23, Jeff Moyer wrote:
> Hi, David,
> 
> David Wei <[email protected]> writes:
> 
>> io_uring sets current->in_iowait when waiting for completions, which
>> achieves two things:
>>
>> 1. Proper accounting of the time as iowait time
>> 2. Enable cpufreq optimisations, setting SCHED_CPUFREQ_IOWAIT on the rq
>>
>> For block IO this makes sense as high iowait can be indicative of
>> issues.
> 
> It also let's you know that the system isn't truly idle.  IOW, it would
> be doing some work if it didn't have to wait for I/O.  This was the
> reason the metric was added (admins being confused about why their
> system was showing up idle).

I see. Thanks for the historical context.

> 
>> But for network IO especially recv, the recv side does not control
>> when the completions happen.
>>
>> Some user tooling attributes iowait time as CPU utilisation i.e. not
> 
> What user tooling are you talking about?  If it shows iowait as busy
> time, the tooling is broken.  Please see my last mail on the subject:
>   https://lore.kernel.org/io-uring/[email protected]/

Our internal tooling for example considers CPU util% to be (100 -
idle%), but it also has a CPU busy% defined as (100 - idle% - iowait%).
It is very unfortunate that everyone uses CPU util% for monitoring, with
all sorts of alerts, dashboards and load balancers referring to this
value. One reason is that, depending on context, high iowait time may or
may not be a problem, so it isn't as simple as redefining CPU util% to
exclude iowait.

> 
>> idle, so high iowait time looks like high CPU util even though the task
>> is not scheduled and the CPU is free to run other tasks. When doing
>> network IO with e.g. the batch completion feature, the CPU may appear to
>> have high utilisation.
> 
> Again, iowait is idle time.

That's fair. I think it is simpler to have a single "CPU util" metric
defined as (100 - idle%), and have a switch that userspace explicitly
flips to say "I want iowait to be considered truly idle or not". This
means things such as load balancers can be built around a single metric,
rather than having to consider both util/busy and needing to understand
"does iowait mean anything?".

> 
>> This patchset adds a IOURING_ENTER_NO_IOWAIT flag that can be set on
>> enter. If set, then current->in_iowait is not set. By default this flag
>> is not set to maintain existing behaviour i.e. in_iowait is always set.
>> This is to prevent waiting for completions being accounted as CPU
>> utilisation.
>>
>> Not setting in_iowait does mean that we also lose cpufreq optimisations
>> above because in_iowait semantics couples 1 and 2 together. Eventually
>> we will untangle the two so the optimisations can be enabled
>> independently of the accounting.
>>
>> IORING_FEAT_IOWAIT_TOGGLE is returned in io_uring_create() to indicate
>> support. This will be used by liburing to check for this feature.
> 
> If I receive a problem report where iowait time isn't accurate, I now
> have to somehow figure out if an application is setting this flag.  This
> sounds like a support headache, and I do wonder what the benefit is.
> From what you've written, the justification for the patch is that some
> userspace tooling misinterprets iowait.  Shouldn't we just fix that?

Right, I understand your concerns. That's why by default this flag is
not set and io_uring behaves as before with in_iowait always set.

Unfortunately, "just fix userspace" for us is a huge ask because a whole
pyramid of both code and human understanding has been built on the
current definition of "CPU utilisation". This is extremely time
consuming to change, nor is it something that we (io_uring) want to take
on imo. Why not give the option for people to indicate whether they want
iowait showing up or not?

> 
> It may be that certain (all?) network functions, like recv, should not
> be accounted as iowait.  However, I don't think the onus should be on
> applications to tell the kernel about that--the kernel should just
> figure that out on its own.
> 
> Am I alone in these opinions?

Why should the onus be on the kernel? I think it is more difficult for
the kernel to figure out exactly what semantics userspace wants and it
is simpler for userspace to select their preference.

From my experience, userspace apps either assigns a thread with an
io_uring instance to network IO or disk IO, but never both. If there is
a valid case for doing both types in the same io_uring, then it would be
trivial to add wait helpers that sets IORING_ENTER_NO_IOWAIT on a per
wait basis.

I do agree with you that this is not ideal. What io_uring really wants
is to decouple the iowait accounting from the cpufreq optimisation that
gets enabled in the presence of in_iowait, which is a bigger ask and out
of scope of this patch. When _someone_ decides to fix the wider iowait
issue, I'm happy to revisit this patch.

> 
> Cheers,
> Jeff
> 

  reply	other threads:[~2024-08-19 23:03 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-16 22:36 [PATCH v2] io_uring: add IORING_ENTER_NO_IOWAIT to not set in_iowait David Wei
2024-08-16 22:49 ` Jens Axboe
2024-08-17  1:23 ` Jeff Moyer
2024-08-19 23:03   ` David Wei [this message]
2024-08-17 19:44 ` Pavel Begunkov
2024-08-17 20:20   ` Jens Axboe
2024-08-17 21:05     ` Pavel Begunkov
2024-08-17 21:09       ` Jens Axboe
2024-08-17 22:04         ` Pavel Begunkov
2024-08-18  1:08           ` Jens Axboe
2024-08-18  2:27             ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox