[RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour
@ 2024-09-05  9:26 Christian Loehle
  2024-09-05  9:26 ` [RFC PATCH 1/8] cpuidle: menu: Remove iowait influence Christian Loehle
                   ` (8 more replies)
  0 siblings, 9 replies; 25+ messages in thread
From: Christian Loehle @ 2024-09-05  9:26 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael, peterz
  Cc: juri.lelli, mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	andres, asml.silence, linux-block, io-uring, qyousef, dsmythies,
	axboe, Christian Loehle

I wanted to share my current status after working on the schedutil
iowait boost issue for a while now. This is what I consider the best
solution, happy for anyone to share thoughts and test results (it's
simply impossible to cover them all).
I'm hoping to remove some (bad) heuristics that have been in the kernel
for a long time and are seemingly impossible to evolve. Since the
introduction of these heuristics IO workloads have changed and those
heuristics can be removed while only really affecting synthetic
benchmarks.

A task can set the in_iowait flag before calling schedule() to indicate
that it is waiting for IO to complete (traditionally this was block IO
only).
Initially the reason for introducing the flag was almost cosmetic: To
distinguish a CPU idle system that has no tasks to run from a CPU idle
system that has tasks waiting for IO in order to continue running.

I have outlined the issues with the current iowait behaviour and
grouped them into: A) iowait as a performance metric, B) for cpuidle
specifically, C) for cpufreq specifically.

====== iowait for performance ======
Issues with using iowait as a performance metric:

1. The definition of iowait altogether is vague at best, it is
sprinkled across kernel code. Traditionally is for block io only but
grepping the tree for shows many counterexamples.
2. Not every IO task is performance-critical (consider writebacks).

====== cpuidle ======
In the cpuidle governor menu, the number of iowaiting tasks is used to
determine an interactivity requirement, the more tasks the CPU has with
with iowait set the shallower the state selected will be.

Relying on iowait for cpuidle is problematic for the following reasons:
1. There is no guarantee for an iowaiting task to be woken up on the
same CPU.
2. The task being in iowait says nothing about the idle duration, we
could be selecting shallower states for a long time.
3. The task being in iowait doesn't always imply a performance hit
with increased latency.
4. If there is such a performance hit, the number of iowaiting tasks
doesn't directly correlate.

The cpuidle governor teo shows that the reliance on using iowait isn't
critical to performance. So this series replaces menu with teo.

====== cpufreq ======
cpufreq governors had iowait-based behavior since a long time, the
initial git release in 2.6.12 with the ondemand governor already had
iowait_time essentially counting as busy time and thus increasing
frequency.
iowait boost in schedutil was introduced by
commit ("21ca6d2c52f8 cpufreq: schedutil: Add iowait boosting")
with it more or less following intel_pstate's approach.
Both intel_pstate and schedutil use the enqueueing of a (FAIR) task on
the rq to determine the boost. The more often a task that has iowait
set is enqueued the higher the applied boost.
The underlying problem is that while a task might only utilize the CPU
to a small degree (due to the task mostly blocking for IO), the time
in-between enqueue and dequeue might be critical to performance
(consider that a storage device has no outstanding requests in the
meantime and therefore is idling).
The low utilization will lead to a low frequency being selected and
thus the task's throughput being sub-optimal.
We call this the io utilization problem.

Relying on iowait wakeups for cpufreq is problematic for the following reasons:

1. Specifically for asynchronous or multi-threaded IO the assumption
that we are missing out on block device throughput because it is idling
isn't true. We are often boosting (to an inefficient) frequency just to
have another request in a queue.
2. It assumes a task that has waited on IO will send io requests in the
future.

There are additional schedutil-specific reasons listed in the patch.

====== Per-task iowait boosting ======
There is a general consensus that schedutil iowait boosting needed
to at least evolve.
Attempts of this have been made, mostly centered around associating the
boost to the task that is frequently being enqueued.

Unfortunately there are problems to this in addition to the ones
mentioned about iowait generally:

1. If the time a task is runnable (so between enqueue and dequeue) is
critical to throughput, so is the interrupt handling and the block
layer completion. For a simple fio example that goes to userspace and
back everytime even for a block device driver like nvme-pci that is
highly optimised, more than half of the time between io requests is
spent on the CPU without the task being enqueued. If the CPU is only
boosted for the time that the task is runnable, there is little gain.
For !MCQ devices we cannot assume the non-task part migrates with the
task, so migration even to a higher capacity CPU might not be
beneficial. (This is getting less important though with widespread
MCQ devices, see Bart's comments in [1].)
2. If the task benefits from boosting it will be very short-lived
(otherwise it would build up utilization on it's own), much shorter
than hardware can switch frequency, thus the boost won't be effective
until at least the next enqueue-dequeue cycle. Together with the
previous reason this means we have to hold the boosted frequency
even if the task is off the rq.
3. The increased complexity of implementation.

I don't think a per-task io boost strategy that is practical and
improves enough on the per-rq design. See my proposals [1] and the
OSPM24 discussion [2] about iowait boosting.

====== Performance impact ======
Overall performance impact will depend to some degree on the workload
and the platform.
Performance is compared for mainline menu and mainline teo with
schedutil with or without iowait boost.

----- Synthetic benchmark -----
The worst case is a simple fio benchmark with minimal time spend in
userspace and IOs issued sequentially back-to-back.
fio --minimal --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 | cut -f 8 -d ";"

The fio task will not exceed utilization for anything but the lowest
OPP being requested on an otherwise idle system.

On a RK3399 which has high latency incurring sleep states (reported as
state1:370us state2:900us, observed on average more like state1:50us
state2:100us) using teo can actually offset the performance loss
incurred by not using schedutil iowait boosting.

RK3399:
menu iowait boost
/dev/nvme0n1
[2584, 2622, 2629]
menu no iowait boost
/dev/nvme0n1
[2068, 2097, 2647]
teo iowait boost
/dev/nvme0n1
[4476, 4478, 4492]
teo no iowait boost
/dev/nvme0n1
[3728, 3733, 3737]

Menu despite the performance multiplier of iowait will often select a
deeper state even if the current workload makes these a worse choice
for both throughput and energy-efficiency.

On a M1 which has just state1 during which the hardware will power-gate
WFI which leads to exit latencies of 10us to 20us there is no gain due
to using teo, so a regression is visible altogether:

M1 Pro:
menu iowait boost
/dev/nvme0n1
[12780, 12788, 12790]
menu no iowait boost
/dev/nvme0n1
[8354, 8359, 8389]
teo iowait boost
/dev/nvme0n1
[12680, 12708, 12728]
teo no iowait boost
/dev/nvme0n1
[7981, 8005, 8093]

One can see the diminishing effect of iowait boosting, even for the synchronous
single-threaded case if the workload builds up utilization on it's own by using
the --thinkcycles fio parameter.

----- Benchmarks -----

Fortunately most real-world io workloads don't suffer from the io
utilization problem.
To illustrate here are a few, all done on the M1 Pro again.
All are run with default mmtests configs except for the filebench
ones where I have set FILEBENCH_MIN_THREADS=8 and
FILEBENCH_MAX_THREADS=16. M1 Pro has 8 CPUs.
The test are run with mainline menu and mainline teo and the sugov
iowait boost knob on or off:
menu + iowait_boost, menu - iowait_boost, teo + iowait_boost, teo - iowait_boost, menu + iowait_boost
The last column is to get a feel on the error margin of the test.

config-db-pgbench-timed-rw-large
pgbench Transactions
Hmean     1       726.68 (   0.00%)      713.79 (  -1.77%)      764.83 (   5.25%)      686.74 (  -5.50%)      715.22 (  -1.58%)
Hmean     4      2213.45 (   0.00%)     2302.69 (   4.03%)     2155.90 (  -2.60%)     2716.60 *  22.73%*     2099.55 (  -5.15%)
Hmean     7      3048.96 (   0.00%)     2530.59 * -17.00%*     2761.00 (  -9.44%)     2601.78 * -14.67%*     3967.59 *  30.13%*
Hmean     12     3223.91 (   0.00%)     3166.93 (  -1.77%)     3503.41 (   8.67%)     3122.70 (  -3.14%)     3228.43 (   0.14%)
Hmean     16     3510.04 (   0.00%)     3405.16 (  -2.99%)     3462.12 (  -1.37%)     3584.56 (   2.12%)     3476.20 (  -0.96%)

The multithreaded tests look good, the upper 3 aren't stable on this
platform IMV. I have seen pgbench run fine even for 1 and 4 with teo,
the original io_uring regression seemed to be from using menu.

config-io-blogbench
blogbench
Hmean     WriteScore    12060.71 (   0.00%)    12179.88 (   0.99%)    12155.41 (   0.79%)    11936.12 (  -1.03%)    12159.15 (   0.82%)
Hmean     ReadScore    206741.40 (   0.00%)   205900.19 (  -0.41%)   204815.61 (  -0.93%)   220167.94 (   6.49%)   207404.34 (   0.32%)

config-io-bonniepp-dir-async
bonnie Throughput
Hmean     SeqCreate ops        742.63 (   0.00%)      756.90 (   1.92%)      784.49 (   5.64%)      734.96 (  -1.03%)      762.66 (   2.70%)
Hmean     SeqCreate read       637.83 (   0.00%)      539.57 * -15.40%*      645.19 (   1.15%)      533.63 * -16.34%*      642.83 (   0.78%)
Hmean     SeqCreate del          0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)        0.00 (   0.00%)
Hmean     RandCreate ops      1456.37 (   0.00%)     1418.19 (  -2.62%)     1465.71 (   0.64%)     1484.20 (   1.91%)     1552.98 (   6.63%)
Hmean     RandCreate read      595.00 (   0.00%)      581.01 (  -2.35%)      563.78 (  -5.25%)      563.58 (  -5.28%)      562.95 (  -5.39%)

The SeqCreate read line looks like a ~15% regression wihout iowait boost.

config-io-filebench-varmail-medium
filebench
Hmean     varmail-8     27995.82 (   0.00%)    27463.93 (  -1.90%)    28341.78 (   1.24%)    27237.85 *  -2.71%*    28164.97 (   0.60%)
Hmean     varmail-16    42904.57 (   0.00%)    41633.66 *  -2.96%*    41641.88 *  -2.94%*    40831.48 *  -4.83%*    41756.76 (  -2.68%)

config-io-filebench-webproxy-medium
filebench
Hmean     webproxy-8    191686.03 (   0.00%)   190237.03 (  -0.76%)   192915.82 (   0.64%)   190706.94 (  -0.51%)   192282.81 (   0.31%)
Hmean     webproxy-16   181528.22 (   0.00%)   185743.13 *   2.32%*   185432.54 *   2.15%*   186597.08 *   2.79%*   183376.19 (   1.02%)

config-io-filebench-webserver-medium
filebench
Hmean     webserver-4     49974.72 (   0.00%)    44327.69 * -11.30%*    49498.08 (  -0.95%)    44223.22 * -11.51%*    50726.92 *   1.51%*
Hmean     webserver-6     73232.09 (   0.00%)    65999.99 *  -9.88%*    73396.77 (   0.22%)    65628.80 * -10.38%*    72996.77 (  -0.32%)
Hmean     webserver-12   149138.05 (   0.00%)   148735.35 (  -0.27%)   148853.11 (  -0.19%)   148119.96 *  -0.68%*   148458.42 (  -0.46%)

If threads < nr_cpus then there's a ~10% regression due to iowait boosting.

config-io-fio-io_uring-randread-direct
fio Throughput
Hmean     kb/sec-randread-read   459665.40 (   0.00%)   460498.37 (   0.18%)   442021.24 *  -3.84%*   449985.77 *  -2.11%*   459446.89 (  -0.05%)

Indicates a slight teo regression.

config-io-fsmark-xfsrepair
fsmark
Hmean     4-files/sec   203535.32 (   0.00%)   202487.47 (  -0.51%)   208261.00 (   2.32%)   211015.56 *   3.68%*   208432.18 *   2.41%*

config-workload-ebizzy
ebizzy Overall Throughput
Hmean     Rsec-1    112650.58 (   0.00%)   117498.40 (   4.30%)   112201.69 (  -0.40%)   112297.30 (  -0.31%)   111077.71 (  -1.40%)
Hmean     Rsec-3    413238.27 (   0.00%)   385083.09 (  -6.81%)   407349.16 (  -1.43%)   418227.30 (   1.21%)   418465.17 (   1.26%)
Hmean     Rsec-5    638089.98 (   0.00%)   637027.84 (  -0.17%)   637337.67 (  -0.12%)   639066.37 (   0.15%)   638654.41 (   0.09%)
Hmean     Rsec-7    761622.12 (   0.00%)   761506.36 (  -0.02%)   761725.47 (   0.01%)   760639.83 (  -0.13%)   761642.36 (   0.00%)
Hmean     Rsec-12   783093.40 (   0.00%)   765566.10 (  -2.24%)   786634.19 (   0.45%)   792604.59 (   1.21%)   796267.24 (   1.68%)
Hmean     Rsec-18   784677.51 (   0.00%)   783555.79 (  -0.14%)   785633.74 (   0.12%)   771189.93 (  -1.72%)   790559.21 (   0.75%)
Hmean     Rsec-24   779544.14 (   0.00%)   781893.47 (   0.30%)   777973.36 (  -0.20%)   776038.68 (  -0.45%)   775678.94 (  -0.50%)
Hmean     Rsec-30   757447.59 (   0.00%)   733989.03 (  -3.10%)   729435.88 (  -3.70%)   746541.43 (  -1.44%)   752617.47 (  -0.64%)
Hmean     Rsec-32   760912.43 (   0.00%)   773417.23 (   1.64%)   735356.71 (  -3.36%)   743558.83 (  -2.28%)   751862.45 (  -1.19%)

No obvious trend here, but since it was used a couple times for cpuidle
governor development in the past I did include it.

Note:
Technically the entire series saves some time on the enqueue path, that
is not reflected here as it's all measured with the sysfs knob.

====== RFT ======

The series is structured in a way to make bisecting easy in case of
regressions. Additionally I provided two test patches in schedutil to
cap the boost using a sysfs knob. In case there are regressions I would
be interested if setting a relatively low cap mitigates the issue.
I can provide a similar knob for intel_state if regressions are
reported that bisect to that.
Information about the CPU capacities / topology would be useful if it
bisects to the cpufreq changes, about the idle states if it bisects to
cpuidle respectively.

====== TODO ======

I haven't touched cpufreq_ondemand yet, that would need to be done as
well, but wanted to get some comments and/or testresults first.
should_io_be_busy() occurrences also still needs to be addressed.

[1]
v1 per-task io boost
https://lore.kernel.org/lkml/20240304201625.100619-1-christian.loehle@arm.com/
v2 per-task io boost
https://lore.kernel.org/lkml/20240518113947.2127802-2-christian.loehle@arm.com/
[2]
OSPM24 discussion iowait boosting
https://www.youtube.com/watch?v=MSQGEsSziZ4

Regards,
Christian

Christian Loehle (8):
  cpuidle: menu: Remove iowait influence
  cpuidle: Prefer teo over menu governor
  TEST: cpufreq/schedutil: Linear iowait boost step
  TEST: cpufreq/schedutil: iowait boost cap sysfs
  cpufreq/schedutil: Remove iowait boost
  cpufreq: intel_pstate: Remove iowait boost
  cpufreq: Remove SCHED_CPUFREQ_IOWAIT update
  io_uring: Do not set iowait before sleeping

 drivers/cpufreq/intel_pstate.c   |  50 +----------
 drivers/cpuidle/Kconfig          |   5 +-
 drivers/cpuidle/governors/menu.c |  78 +++--------------
 drivers/cpuidle/governors/teo.c  |   2 +-
 include/linux/sched/cpufreq.h    |   2 -
 io_uring/io_uring.c              |  17 ----
 kernel/sched/cpufreq_schedutil.c | 144 +------------------------------
 kernel/sched/fair.c              |   8 --
 8 files changed, 18 insertions(+), 288 deletions(-)

--
2.34.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH 1/8] cpuidle: menu: Remove iowait influence
  2024-09-05  9:26 [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour Christian Loehle
@ 2024-09-05  9:26 ` Christian Loehle
  2024-09-30 14:58   ` Rafael J. Wysocki
  2024-09-05  9:26 ` [RFC PATCH 2/8] cpuidle: Prefer teo over menu governor Christian Loehle
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Christian Loehle @ 2024-09-05  9:26 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael, peterz
  Cc: juri.lelli, mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	andres, asml.silence, linux-block, io-uring, qyousef, dsmythies,
	axboe, Christian Loehle

Remove CPU iowaiters influence on idle state selection.
Remove the menu notion of performance multiplier which increased with
the number of tasks that went to iowait sleep on this CPU and haven't
woken up yet.

Relying on iowait for cpuidle is problematic for a few reasons:
1. There is no guarantee that an iowaiting task will wake up on the
same CPU.
2. The task being in iowait says nothing about the idle duration, we
could be selecting shallower states for a long time.
3. The task being in iowait doesn't always imply a performance hit
with increased latency.
4. If there is such a performance hit, the number of iowaiting tasks
doesn't directly correlate.
5. The definition of iowait altogether is vague at best, it is
sprinkled across kernel code.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 drivers/cpuidle/governors/menu.c | 76 ++++----------------------------
 1 file changed, 9 insertions(+), 67 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index f3c9d49f0f2a..28363bfa3e4c 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -19,7 +19,7 @@
 
 #include "gov.h"
 
-#define BUCKETS 12
+#define BUCKETS 6
 #define INTERVAL_SHIFT 3
 #define INTERVALS (1UL << INTERVAL_SHIFT)
 #define RESOLUTION 1024
@@ -29,12 +29,11 @@
 /*
  * Concepts and ideas behind the menu governor
  *
- * For the menu governor, there are 3 decision factors for picking a C
+ * For the menu governor, there are 2 decision factors for picking a C
  * state:
  * 1) Energy break even point
- * 2) Performance impact
- * 3) Latency tolerance (from pmqos infrastructure)
- * These three factors are treated independently.
+ * 2) Latency tolerance (from pmqos infrastructure)
+ * These two factors are treated independently.
  *
  * Energy break even point
  * -----------------------
@@ -75,30 +74,6 @@
  * intervals and if the stand deviation of these 8 intervals is below a
  * threshold value, we use the average of these intervals as prediction.
  *
- * Limiting Performance Impact
- * ---------------------------
- * C states, especially those with large exit latencies, can have a real
- * noticeable impact on workloads, which is not acceptable for most sysadmins,
- * and in addition, less performance has a power price of its own.
- *
- * As a general rule of thumb, menu assumes that the following heuristic
- * holds:
- *     The busier the system, the less impact of C states is acceptable
- *
- * This rule-of-thumb is implemented using a performance-multiplier:
- * If the exit latency times the performance multiplier is longer than
- * the predicted duration, the C state is not considered a candidate
- * for selection due to a too high performance impact. So the higher
- * this multiplier is, the longer we need to be idle to pick a deep C
- * state, and thus the less likely a busy CPU will hit such a deep
- * C state.
- *
- * Currently there is only one value determining the factor:
- * 10 points are added for each process that is waiting for IO on this CPU.
- * (This value was experimentally determined.)
- * Utilization is no longer a factor as it was shown that it never contributed
- * significantly to the performance multiplier in the first place.
- *
  */
 
 struct menu_device {
@@ -112,19 +87,10 @@ struct menu_device {
 	int		interval_ptr;
 };
 
-static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters)
+static inline int which_bucket(u64 duration_ns)
 {
 	int bucket = 0;
 
-	/*
-	 * We keep two groups of stats; one with no
-	 * IO pending, one without.
-	 * This allows us to calculate
-	 * E(duration)|iowait
-	 */
-	if (nr_iowaiters)
-		bucket = BUCKETS/2;
-
 	if (duration_ns < 10ULL * NSEC_PER_USEC)
 		return bucket;
 	if (duration_ns < 100ULL * NSEC_PER_USEC)
@@ -138,19 +104,6 @@ static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters)
 	return bucket + 5;
 }
 
-/*
- * Return a multiplier for the exit latency that is intended
- * to take performance requirements into account.
- * The more performance critical we estimate the system
- * to be, the higher this multiplier, and thus the higher
- * the barrier to go to an expensive C state.
- */
-static inline int performance_multiplier(unsigned int nr_iowaiters)
-{
-	/* for IO wait tasks (per cpu!) we add 10x each */
-	return 1 + 10 * nr_iowaiters;
-}
-
 static DEFINE_PER_CPU(struct menu_device, menu_devices);
 
 static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev);
@@ -258,8 +211,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 	struct menu_device *data = this_cpu_ptr(&menu_devices);
 	s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
 	u64 predicted_ns;
-	u64 interactivity_req;
-	unsigned int nr_iowaiters;
 	ktime_t delta, delta_tick;
 	int i, idx;
 
@@ -268,8 +219,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 		data->needs_update = 0;
 	}
 
-	nr_iowaiters = nr_iowait_cpu(dev->cpu);
-
 	/* Find the shortest expected idle interval. */
 	predicted_ns = get_typical_interval(data) * NSEC_PER_USEC;
 	if (predicted_ns > RESIDENCY_THRESHOLD_NS) {
@@ -283,7 +232,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 		}
 
 		data->next_timer_ns = delta;
-		data->bucket = which_bucket(data->next_timer_ns, nr_iowaiters);
+		data->bucket = which_bucket(data->next_timer_ns);
 
 		/* Round up the result for half microseconds. */
 		timer_us = div_u64((RESOLUTION * DECAY * NSEC_PER_USEC) / 2 +
@@ -301,7 +250,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 		 */
 		data->next_timer_ns = KTIME_MAX;
 		delta_tick = TICK_NSEC / 2;
-		data->bucket = which_bucket(KTIME_MAX, nr_iowaiters);
+		data->bucket = which_bucket(KTIME_MAX);
 	}
 
 	if (unlikely(drv->state_count <= 1 || latency_req == 0) ||
@@ -328,15 +277,8 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 		 */
 		if (predicted_ns < TICK_NSEC)
 			predicted_ns = data->next_timer_ns;
-	} else {
-		/*
-		 * Use the performance multiplier and the user-configurable
-		 * latency_req to determine the maximum exit latency.
-		 */
-		interactivity_req = div64_u64(predicted_ns,
-					      performance_multiplier(nr_iowaiters));
-		if (latency_req > interactivity_req)
-			latency_req = interactivity_req;
+	} else if (latency_req > predicted_ns) {
+		latency_req = predicted_ns;
 	}
 
 	/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 2/8] cpuidle: Prefer teo over menu governor
  2024-09-05  9:26 [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour Christian Loehle
  2024-09-05  9:26 ` [RFC PATCH 1/8] cpuidle: menu: Remove iowait influence Christian Loehle
@ 2024-09-05  9:26 ` Christian Loehle
  2024-09-30 15:06   ` Rafael J. Wysocki
  2024-09-05  9:26 ` [RFC PATCH 3/8] TEST: cpufreq/schedutil: Linear iowait boost step Christian Loehle
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Christian Loehle @ 2024-09-05  9:26 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael, peterz
  Cc: juri.lelli, mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	andres, asml.silence, linux-block, io-uring, qyousef, dsmythies,
	axboe, Christian Loehle

Since menu no longer has the interactivity boost teo works better
overall, so make it the default.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 drivers/cpuidle/Kconfig          | 5 +----
 drivers/cpuidle/governors/menu.c | 2 +-
 drivers/cpuidle/governors/teo.c  | 2 +-
 3 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index cac5997dca50..ae67a464025a 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -5,7 +5,7 @@ config CPU_IDLE
 	bool "CPU idle PM support"
 	default y if ACPI || PPC_PSERIES
 	select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
-	select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE) && !CPU_IDLE_GOV_TEO
+	select CPU_IDLE_GOV_TEO if (NO_HZ || NO_HZ_IDLE) && !CPU_IDLE_GOV_MENU
 	help
 	  CPU idle is a generic framework for supporting software-controlled
 	  idle processor power management.  It includes modular cross-platform
@@ -30,9 +30,6 @@ config CPU_IDLE_GOV_TEO
 	  This governor implements a simplified idle state selection method
 	  focused on timer events and does not do any interactivity boosting.
 
-	  Some workloads benefit from using it and it generally should be safe
-	  to use.  Say Y here if you are not happy with the alternatives.
-
 config CPU_IDLE_GOV_HALTPOLL
 	bool "Haltpoll governor (for virtualized systems)"
 	depends on KVM_GUEST
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 28363bfa3e4c..c0ae5e98d6f1 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -508,7 +508,7 @@ static int menu_enable_device(struct cpuidle_driver *drv,
 
 static struct cpuidle_governor menu_governor = {
 	.name =		"menu",
-	.rating =	20,
+	.rating =	19,
 	.enable =	menu_enable_device,
 	.select =	menu_select,
 	.reflect =	menu_reflect,
diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index f2992f92d8db..6c3cc39f285d 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -537,7 +537,7 @@ static int teo_enable_device(struct cpuidle_driver *drv,
 
 static struct cpuidle_governor teo_governor = {
 	.name =		"teo",
-	.rating =	19,
+	.rating =	20,
 	.enable =	teo_enable_device,
 	.select =	teo_select,
 	.reflect =	teo_reflect,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 3/8] TEST: cpufreq/schedutil: Linear iowait boost step
  2024-09-05  9:26 [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour Christian Loehle
  2024-09-05  9:26 ` [RFC PATCH 1/8] cpuidle: menu: Remove iowait influence Christian Loehle
  2024-09-05  9:26 ` [RFC PATCH 2/8] cpuidle: Prefer teo over menu governor Christian Loehle
@ 2024-09-05  9:26 ` Christian Loehle
  2024-09-05  9:26 ` [RFC PATCH 4/8] TEST: cpufreq/schedutil: iowait boost cap sysfs Christian Loehle
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Christian Loehle @ 2024-09-05  9:26 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael, peterz
  Cc: juri.lelli, mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	andres, asml.silence, linux-block, io-uring, qyousef, dsmythies,
	axboe, Christian Loehle

In preparation for capping iowait boost make the steps linear as
opposed to doubling.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 kernel/sched/cpufreq_schedutil.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index eece6244f9d2..7810374aaa5b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -267,7 +267,8 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
 	/* Double the boost at each request */
 	if (sg_cpu->iowait_boost) {
 		sg_cpu->iowait_boost =
-			min_t(unsigned int, sg_cpu->iowait_boost << 1, SCHED_CAPACITY_SCALE);
+			min_t(unsigned int,
+			      sg_cpu->iowait_boost + IOWAIT_BOOST_MIN, SCHED_CAPACITY_SCALE);
 		return;
 	}
 
@@ -308,11 +309,9 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 		/*
 		 * No boost pending; reduce the boost value.
 		 */
-		sg_cpu->iowait_boost >>= 1;
-		if (sg_cpu->iowait_boost < IOWAIT_BOOST_MIN) {
-			sg_cpu->iowait_boost = 0;
+		sg_cpu->iowait_boost -= IOWAIT_BOOST_MIN;
+		if (!sg_cpu->iowait_boost)
 			return 0;
-		}
 	}
 
 	sg_cpu->iowait_boost_pending = false;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 4/8] TEST: cpufreq/schedutil: iowait boost cap sysfs
  2024-09-05  9:26 [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour Christian Loehle
                   ` (2 preceding siblings ...)
  2024-09-05  9:26 ` [RFC PATCH 3/8] TEST: cpufreq/schedutil: Linear iowait boost step Christian Loehle
@ 2024-09-05  9:26 ` Christian Loehle
  2024-09-05  9:26 ` [RFC PATCH 5/8] cpufreq/schedutil: Remove iowait boost Christian Loehle
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Christian Loehle @ 2024-09-05  9:26 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael, peterz
  Cc: juri.lelli, mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	andres, asml.silence, linux-block, io-uring, qyousef, dsmythies,
	axboe, Christian Loehle

Add a knob to cap applied iowait_boost per sysfs.
This is to test for potential regressions.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 kernel/sched/cpufreq_schedutil.c | 38 ++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 7810374aaa5b..5324f07fc93a 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -11,6 +11,7 @@
 struct sugov_tunables {
 	struct gov_attr_set	attr_set;
 	unsigned int		rate_limit_us;
+	unsigned int		iowait_boost_cap;
 };
 
 struct sugov_policy {
@@ -35,6 +36,8 @@ struct sugov_policy {
 
 	bool			limits_changed;
 	bool			need_freq_update;
+
+	unsigned int		iowait_boost_cap;
 };
 
 struct sugov_cpu {
@@ -316,6 +319,9 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 
 	sg_cpu->iowait_boost_pending = false;
 
+	if (sg_cpu->iowait_boost > sg_cpu->sg_policy->iowait_boost_cap)
+		sg_cpu->iowait_boost = sg_cpu->sg_policy->iowait_boost_cap;
+
 	/*
 	 * sg_cpu->util is already in capacity scale; convert iowait_boost
 	 * into the same scale so we can compare.
@@ -554,6 +560,14 @@ static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
 	return sprintf(buf, "%u\n", tunables->rate_limit_us);
 }
 
+
+static ssize_t iowait_boost_cap_show(struct gov_attr_set *attr_set, char *buf)
+{
+	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+	return sprintf(buf, "%u\n", tunables->iowait_boost_cap);
+}
+
 static ssize_t
 rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
 {
@@ -572,10 +586,30 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
 	return count;
 }
 
+static ssize_t
+iowait_boost_cap_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
+{
+	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+	struct sugov_policy *sg_policy;
+	unsigned int iowait_boost_cap;
+
+	if (kstrtouint(buf, 10, &iowait_boost_cap))
+		return -EINVAL;
+
+	tunables->iowait_boost_cap = iowait_boost_cap;
+
+	list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
+		sg_policy->iowait_boost_cap = iowait_boost_cap;
+
+	return count;
+}
+
 static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+static struct governor_attr iowait_boost_cap = __ATTR_RW(iowait_boost_cap);
 
 static struct attribute *sugov_attrs[] = {
 	&rate_limit_us.attr,
+	&iowait_boost_cap.attr,
 	NULL
 };
 ATTRIBUTE_GROUPS(sugov);
@@ -765,6 +799,8 @@ static int sugov_init(struct cpufreq_policy *policy)
 
 	tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
 
+	tunables->iowait_boost_cap = SCHED_CAPACITY_SCALE;
+
 	policy->governor_data = sg_policy;
 	sg_policy->tunables = tunables;
 
@@ -834,6 +870,8 @@ static int sugov_start(struct cpufreq_policy *policy)
 	sg_policy->limits_changed		= false;
 	sg_policy->cached_raw_freq		= 0;
 
+	sg_policy->iowait_boost_cap		= SCHED_CAPACITY_SCALE;
+
 	sg_policy->need_freq_update = cpufreq_driver_test_flags(CPUFREQ_NEED_UPDATE_LIMITS);
 
 	if (policy_is_shared(policy))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 5/8] cpufreq/schedutil: Remove iowait boost
  2024-09-05  9:26 [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour Christian Loehle
                   ` (3 preceding siblings ...)
  2024-09-05  9:26 ` [RFC PATCH 4/8] TEST: cpufreq/schedutil: iowait boost cap sysfs Christian Loehle
@ 2024-09-05  9:26 ` Christian Loehle
  2024-09-30 16:34   ` Rafael J. Wysocki
  2024-09-05  9:26 ` [RFC PATCH 6/8] cpufreq: intel_pstate: " Christian Loehle
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Christian Loehle @ 2024-09-05  9:26 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael, peterz
  Cc: juri.lelli, mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	andres, asml.silence, linux-block, io-uring, qyousef, dsmythies,
	axboe, Christian Loehle

iowait boost in schedutil was introduced by
commit ("21ca6d2c52f8 cpufreq: schedutil: Add iowait boosting").
with it more or less following intel_pstate's approach to increase
frequency after an iowait wakeup.
Behaviour that is piggy-backed onto iowait boost is problematic
due to a lot of reasons, so remove it.

For schedutil specifically these are some of the reasons:
1. Boosting is applied even in scenarios where it doesn't improve
throughput.
2. The boost is not accounted for in EAS: a) feec() will only consider
 the actual task utilization for task placement, but another CPU might
 be more energy-efficient at that capacity than the boosted one.)
 b) When placing a non-IO task while a CPU is boosted compute_energy()
 assumes a lower OPP than what is actually applied. This leads to
 wrong EAS decisions.
3. Actual IO heavy workloads are hardly distinguished from infrequent
in_iowait wakeups.
4. The boost isn't accounted for in task placement.
5. The boost isn't associated with a task, it therefore lingers on the
rq even after the responsible task has migrated / stopped.
6. The boost isn't associated with a task, it therefore needs to ramp
up again when migrated.
7. Since schedutil doesn't know which task is getting woken up,
multiple unrelated in_iowait tasks lead to boosting.
8. Boosting is hard to control with UCLAMP_MAX (which is only active
when the task is on the rq, which for boosted tasks is usually not
the case for most of the time).

One benefit of schedutil specifically is the reliance on the
scheduler's utilization signals, which have evolved a lot since it's
original introduction. Some cases that benefitted from iowait boosting
in the past can now be covered by e.g. util_est.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 kernel/sched/cpufreq_schedutil.c | 181 +------------------------------
 1 file changed, 3 insertions(+), 178 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 5324f07fc93a..55b8b8ba7238 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -6,12 +6,9 @@
  * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
  */
 
-#define IOWAIT_BOOST_MIN	(SCHED_CAPACITY_SCALE / 8)
-
 struct sugov_tunables {
 	struct gov_attr_set	attr_set;
 	unsigned int		rate_limit_us;
-	unsigned int		iowait_boost_cap;
 };
 
 struct sugov_policy {
@@ -36,8 +33,6 @@ struct sugov_policy {
 
 	bool			limits_changed;
 	bool			need_freq_update;
-
-	unsigned int		iowait_boost_cap;
 };
 
 struct sugov_cpu {
@@ -45,10 +40,6 @@ struct sugov_cpu {
 	struct sugov_policy	*sg_policy;
 	unsigned int		cpu;
 
-	bool			iowait_boost_pending;
-	unsigned int		iowait_boost;
-	u64			last_update;
-
 	unsigned long		util;
 	unsigned long		bw_min;
 
@@ -198,137 +189,15 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
 	return max(min, max);
 }
 
-static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
+static void sugov_get_util(struct sugov_cpu *sg_cpu)
 {
 	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
 
 	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
-	util = max(util, boost);
 	sg_cpu->bw_min = min;
 	sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
 }
 
-/**
- * sugov_iowait_reset() - Reset the IO boost status of a CPU.
- * @sg_cpu: the sugov data for the CPU to boost
- * @time: the update time from the caller
- * @set_iowait_boost: true if an IO boost has been requested
- *
- * The IO wait boost of a task is disabled after a tick since the last update
- * of a CPU. If a new IO wait boost is requested after more then a tick, then
- * we enable the boost starting from IOWAIT_BOOST_MIN, which improves energy
- * efficiency by ignoring sporadic wakeups from IO.
- */
-static bool sugov_iowait_reset(struct sugov_cpu *sg_cpu, u64 time,
-			       bool set_iowait_boost)
-{
-	s64 delta_ns = time - sg_cpu->last_update;
-
-	/* Reset boost only if a tick has elapsed since last request */
-	if (delta_ns <= TICK_NSEC)
-		return false;
-
-	sg_cpu->iowait_boost = set_iowait_boost ? IOWAIT_BOOST_MIN : 0;
-	sg_cpu->iowait_boost_pending = set_iowait_boost;
-
-	return true;
-}
-
-/**
- * sugov_iowait_boost() - Updates the IO boost status of a CPU.
- * @sg_cpu: the sugov data for the CPU to boost
- * @time: the update time from the caller
- * @flags: SCHED_CPUFREQ_IOWAIT if the task is waking up after an IO wait
- *
- * Each time a task wakes up after an IO operation, the CPU utilization can be
- * boosted to a certain utilization which doubles at each "frequent and
- * successive" wakeup from IO, ranging from IOWAIT_BOOST_MIN to the utilization
- * of the maximum OPP.
- *
- * To keep doubling, an IO boost has to be requested at least once per tick,
- * otherwise we restart from the utilization of the minimum OPP.
- */
-static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
-			       unsigned int flags)
-{
-	bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
-
-	/* Reset boost if the CPU appears to have been idle enough */
-	if (sg_cpu->iowait_boost &&
-	    sugov_iowait_reset(sg_cpu, time, set_iowait_boost))
-		return;
-
-	/* Boost only tasks waking up after IO */
-	if (!set_iowait_boost)
-		return;
-
-	/* Ensure boost doubles only one time at each request */
-	if (sg_cpu->iowait_boost_pending)
-		return;
-	sg_cpu->iowait_boost_pending = true;
-
-	/* Double the boost at each request */
-	if (sg_cpu->iowait_boost) {
-		sg_cpu->iowait_boost =
-			min_t(unsigned int,
-			      sg_cpu->iowait_boost + IOWAIT_BOOST_MIN, SCHED_CAPACITY_SCALE);
-		return;
-	}
-
-	/* First wakeup after IO: start with minimum boost */
-	sg_cpu->iowait_boost = IOWAIT_BOOST_MIN;
-}
-
-/**
- * sugov_iowait_apply() - Apply the IO boost to a CPU.
- * @sg_cpu: the sugov data for the cpu to boost
- * @time: the update time from the caller
- * @max_cap: the max CPU capacity
- *
- * A CPU running a task which woken up after an IO operation can have its
- * utilization boosted to speed up the completion of those IO operations.
- * The IO boost value is increased each time a task wakes up from IO, in
- * sugov_iowait_apply(), and it's instead decreased by this function,
- * each time an increase has not been requested (!iowait_boost_pending).
- *
- * A CPU which also appears to have been idle for at least one tick has also
- * its IO boost utilization reset.
- *
- * This mechanism is designed to boost high frequently IO waiting tasks, while
- * being more conservative on tasks which does sporadic IO operations.
- */
-static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
-			       unsigned long max_cap)
-{
-	/* No boost currently required */
-	if (!sg_cpu->iowait_boost)
-		return 0;
-
-	/* Reset boost if the CPU appears to have been idle enough */
-	if (sugov_iowait_reset(sg_cpu, time, false))
-		return 0;
-
-	if (!sg_cpu->iowait_boost_pending) {
-		/*
-		 * No boost pending; reduce the boost value.
-		 */
-		sg_cpu->iowait_boost -= IOWAIT_BOOST_MIN;
-		if (!sg_cpu->iowait_boost)
-			return 0;
-	}
-
-	sg_cpu->iowait_boost_pending = false;
-
-	if (sg_cpu->iowait_boost > sg_cpu->sg_policy->iowait_boost_cap)
-		sg_cpu->iowait_boost = sg_cpu->sg_policy->iowait_boost_cap;
-
-	/*
-	 * sg_cpu->util is already in capacity scale; convert iowait_boost
-	 * into the same scale so we can compare.
-	 */
-	return (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
-}
-
 #ifdef CONFIG_NO_HZ_COMMON
 static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
 {
@@ -356,18 +225,12 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
 					      u64 time, unsigned long max_cap,
 					      unsigned int flags)
 {
-	unsigned long boost;
-
-	sugov_iowait_boost(sg_cpu, time, flags);
-	sg_cpu->last_update = time;
-
 	ignore_dl_rate_limit(sg_cpu);
 
 	if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
 		return false;
 
-	boost = sugov_iowait_apply(sg_cpu, time, max_cap);
-	sugov_get_util(sg_cpu, boost);
+	sugov_get_util(sg_cpu);
 
 	return true;
 }
@@ -468,11 +331,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
 
 	for_each_cpu(j, policy->cpus) {
 		struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
-		unsigned long boost;
-
-		boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
-		sugov_get_util(j_sg_cpu, boost);
 
+		sugov_get_util(j_sg_cpu);
 		util = max(j_sg_cpu->util, util);
 	}
 
@@ -488,9 +348,6 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
 
 	raw_spin_lock(&sg_policy->update_lock);
 
-	sugov_iowait_boost(sg_cpu, time, flags);
-	sg_cpu->last_update = time;
-
 	ignore_dl_rate_limit(sg_cpu);
 
 	if (sugov_should_update_freq(sg_policy, time)) {
@@ -560,14 +417,6 @@ static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
 	return sprintf(buf, "%u\n", tunables->rate_limit_us);
 }
 
-
-static ssize_t iowait_boost_cap_show(struct gov_attr_set *attr_set, char *buf)
-{
-	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
-
-	return sprintf(buf, "%u\n", tunables->iowait_boost_cap);
-}
-
 static ssize_t
 rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
 {
@@ -586,30 +435,10 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
 	return count;
 }
 
-static ssize_t
-iowait_boost_cap_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
-{
-	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
-	struct sugov_policy *sg_policy;
-	unsigned int iowait_boost_cap;
-
-	if (kstrtouint(buf, 10, &iowait_boost_cap))
-		return -EINVAL;
-
-	tunables->iowait_boost_cap = iowait_boost_cap;
-
-	list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
-		sg_policy->iowait_boost_cap = iowait_boost_cap;
-
-	return count;
-}
-
 static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
-static struct governor_attr iowait_boost_cap = __ATTR_RW(iowait_boost_cap);
 
 static struct attribute *sugov_attrs[] = {
 	&rate_limit_us.attr,
-	&iowait_boost_cap.attr,
 	NULL
 };
 ATTRIBUTE_GROUPS(sugov);
@@ -799,8 +628,6 @@ static int sugov_init(struct cpufreq_policy *policy)
 
 	tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
 
-	tunables->iowait_boost_cap = SCHED_CAPACITY_SCALE;
-
 	policy->governor_data = sg_policy;
 	sg_policy->tunables = tunables;
 
@@ -870,8 +697,6 @@ static int sugov_start(struct cpufreq_policy *policy)
 	sg_policy->limits_changed		= false;
 	sg_policy->cached_raw_freq		= 0;
 
-	sg_policy->iowait_boost_cap		= SCHED_CAPACITY_SCALE;
-
 	sg_policy->need_freq_update = cpufreq_driver_test_flags(CPUFREQ_NEED_UPDATE_LIMITS);
 
 	if (policy_is_shared(policy))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 6/8] cpufreq: intel_pstate: Remove iowait boost
  2024-09-05  9:26 [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour Christian Loehle
                   ` (4 preceding siblings ...)
  2024-09-05  9:26 ` [RFC PATCH 5/8] cpufreq/schedutil: Remove iowait boost Christian Loehle
@ 2024-09-05  9:26 ` Christian Loehle
  2024-09-12 11:22   ` [RFC PATCH] TEST: cpufreq: intel_pstate: sysfs iowait_boost_cap Christian Loehle
  2024-09-30 18:03   ` [RFC PATCH 6/8] cpufreq: intel_pstate: Remove iowait boost Rafael J. Wysocki
  2024-09-05  9:26 ` [RFC PATCH 7/8] cpufreq: Remove SCHED_CPUFREQ_IOWAIT update Christian Loehle
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 25+ messages in thread
From: Christian Loehle @ 2024-09-05  9:26 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael, peterz
  Cc: juri.lelli, mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	andres, asml.silence, linux-block, io-uring, qyousef, dsmythies,
	axboe, Christian Loehle

Analogous to schedutil, remove iowait boost for the same reasons.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 drivers/cpufreq/intel_pstate.c | 50 ++--------------------------------
 1 file changed, 3 insertions(+), 47 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index c0278d023cfc..7f30b2569bb3 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -191,7 +191,6 @@ struct global_params {
  * @policy:		CPUFreq policy value
  * @update_util:	CPUFreq utility callback information
  * @update_util_set:	CPUFreq utility callback is set
- * @iowait_boost:	iowait-related boost fraction
  * @last_update:	Time of the last update.
  * @pstate:		Stores P state limits for this CPU
  * @vid:		Stores VID limits for this CPU
@@ -245,7 +244,6 @@ struct cpudata {
 	struct acpi_processor_performance acpi_perf_data;
 	bool valid_pss_table;
 #endif
-	unsigned int iowait_boost;
 	s16 epp_powersave;
 	s16 epp_policy;
 	s16 epp_default;
@@ -2136,28 +2134,7 @@ static inline void intel_pstate_update_util_hwp_local(struct cpudata *cpu,
 {
 	cpu->sample.time = time;
 
-	if (cpu->sched_flags & SCHED_CPUFREQ_IOWAIT) {
-		bool do_io = false;
-
-		cpu->sched_flags = 0;
-		/*
-		 * Set iowait_boost flag and update time. Since IO WAIT flag
-		 * is set all the time, we can't just conclude that there is
-		 * some IO bound activity is scheduled on this CPU with just
-		 * one occurrence. If we receive at least two in two
-		 * consecutive ticks, then we treat as boost candidate.
-		 */
-		if (time_before64(time, cpu->last_io_update + 2 * TICK_NSEC))
-			do_io = true;
-
-		cpu->last_io_update = time;
-
-		if (do_io)
-			intel_pstate_hwp_boost_up(cpu);
-
-	} else {
-		intel_pstate_hwp_boost_down(cpu);
-	}
+	intel_pstate_hwp_boost_down(cpu);
 }
 
 static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
@@ -2240,9 +2217,6 @@ static inline int32_t get_target_pstate(struct cpudata *cpu)
 	busy_frac = div_fp(sample->mperf << cpu->aperf_mperf_shift,
 			   sample->tsc);
 
-	if (busy_frac < cpu->iowait_boost)
-		busy_frac = cpu->iowait_boost;
-
 	sample->busy_scaled = busy_frac * 100;
 
 	target = READ_ONCE(global.no_turbo) ?
@@ -2303,7 +2277,7 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
 		sample->aperf,
 		sample->tsc,
 		get_avg_frequency(cpu),
-		fp_toint(cpu->iowait_boost * 100));
+		0);
 }
 
 static void intel_pstate_update_util(struct update_util_data *data, u64 time,
@@ -2317,24 +2291,6 @@ static void intel_pstate_update_util(struct update_util_data *data, u64 time,
 		return;
 
 	delta_ns = time - cpu->last_update;
-	if (flags & SCHED_CPUFREQ_IOWAIT) {
-		/* Start over if the CPU may have been idle. */
-		if (delta_ns > TICK_NSEC) {
-			cpu->iowait_boost = ONE_EIGHTH_FP;
-		} else if (cpu->iowait_boost >= ONE_EIGHTH_FP) {
-			cpu->iowait_boost <<= 1;
-			if (cpu->iowait_boost > int_tofp(1))
-				cpu->iowait_boost = int_tofp(1);
-		} else {
-			cpu->iowait_boost = ONE_EIGHTH_FP;
-		}
-	} else if (cpu->iowait_boost) {
-		/* Clear iowait_boost if the CPU may have been idle. */
-		if (delta_ns > TICK_NSEC)
-			cpu->iowait_boost = 0;
-		else
-			cpu->iowait_boost >>= 1;
-	}
 	cpu->last_update = time;
 	delta_ns = time - cpu->sample.time;
 	if ((s64)delta_ns < INTEL_PSTATE_SAMPLING_INTERVAL)
@@ -2832,7 +2788,7 @@ static void intel_cpufreq_trace(struct cpudata *cpu, unsigned int trace_type, in
 		sample->aperf,
 		sample->tsc,
 		get_avg_frequency(cpu),
-		fp_toint(cpu->iowait_boost * 100));
+		0);
 }
 
 static void intel_cpufreq_hwp_update(struct cpudata *cpu, u32 min, u32 max,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 7/8] cpufreq: Remove SCHED_CPUFREQ_IOWAIT update
  2024-09-05  9:26 [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour Christian Loehle
                   ` (5 preceding siblings ...)
  2024-09-05  9:26 ` [RFC PATCH 6/8] cpufreq: intel_pstate: " Christian Loehle
@ 2024-09-05  9:26 ` Christian Loehle
  2024-09-05  9:26 ` [RFC PATCH 8/8] io_uring: Do not set iowait before sleeping Christian Loehle
  2024-09-05 12:31 ` [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour Christian Loehle
  8 siblings, 0 replies; 25+ messages in thread
From: Christian Loehle @ 2024-09-05  9:26 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael, peterz
  Cc: juri.lelli, mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	andres, asml.silence, linux-block, io-uring, qyousef, dsmythies,
	axboe, Christian Loehle

Neither intel_pstate nor schedutil care for the flag anymore, so
remove the update and flag definition.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 include/linux/sched/cpufreq.h | 2 --
 kernel/sched/fair.c           | 8 --------
 2 files changed, 10 deletions(-)

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index bdd31ab93bc5..d4af813d3126 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -8,8 +8,6 @@
  * Interface between cpufreq drivers and the scheduler:
  */
 
-#define SCHED_CPUFREQ_IOWAIT	(1U << 0)
-
 #ifdef CONFIG_CPU_FREQ
 struct cpufreq_policy;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9057584ec06d..5cae0e5619aa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6759,14 +6759,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	util_est_enqueue(&rq->cfs, p);
 
-	/*
-	 * If in_iowait is set, the code below may not trigger any cpufreq
-	 * utilization updates, so do it here explicitly with the IOWAIT flag
-	 * passed.
-	 */
-	if (p->in_iowait)
-		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
-
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 8/8] io_uring: Do not set iowait before sleeping
  2024-09-05  9:26 [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour Christian Loehle
                   ` (6 preceding siblings ...)
  2024-09-05  9:26 ` [RFC PATCH 7/8] cpufreq: Remove SCHED_CPUFREQ_IOWAIT update Christian Loehle
@ 2024-09-05  9:26 ` Christian Loehle
  2024-09-05 12:31 ` [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour Christian Loehle
  8 siblings, 0 replies; 25+ messages in thread
From: Christian Loehle @ 2024-09-05  9:26 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael, peterz
  Cc: juri.lelli, mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	andres, asml.silence, linux-block, io-uring, qyousef, dsmythies,
	axboe, Christian Loehle

Setting in_iowait was introduced in commit
8a796565cec3 ("io_uring: Use io_schedule* in cqring wait")
to tackle a perf regression that was caused by menu taking iowait into
account for synchronous IO and thus not selecting deeper states like in
the io_uring counterpart.
That behaviour is gone, so the workaround can be removed.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 io_uring/io_uring.c | 17 -----------------
 1 file changed, 17 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3942db160f18..c819d40bdcf0 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2341,15 +2341,6 @@ int io_run_task_work_sig(struct io_ring_ctx *ctx)
 	return 0;
 }
 
-static bool current_pending_io(void)
-{
-	struct io_uring_task *tctx = current->io_uring;
-
-	if (!tctx)
-		return false;
-	return percpu_counter_read_positive(&tctx->inflight);
-}
-
 /* when returns >0, the caller should retry */
 static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 					  struct io_wait_queue *iowq)
@@ -2367,19 +2358,11 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 	if (unlikely(io_should_wake(iowq)))
 		return 0;
 
-	/*
-	 * Mark us as being in io_wait if we have pending requests, so cpufreq
-	 * can take into account that the task is waiting for IO - turns out
-	 * to be important for low QD IO.
-	 */
-	if (current_pending_io())
-		current->in_iowait = 1;
 	ret = 0;
 	if (iowq->timeout == KTIME_MAX)
 		schedule();
 	else if (!schedule_hrtimeout(&iowq->timeout, HRTIMER_MODE_ABS))
 		ret = -ETIME;
-	current->in_iowait = 0;
 	return ret;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour
  2024-09-05  9:26 [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour Christian Loehle
                   ` (7 preceding siblings ...)
  2024-09-05  9:26 ` [RFC PATCH 8/8] io_uring: Do not set iowait before sleeping Christian Loehle
@ 2024-09-05 12:31 ` Christian Loehle
  8 siblings, 0 replies; 25+ messages in thread
From: Christian Loehle @ 2024-09-05 12:31 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael, peterz
  Cc: juri.lelli, mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	andres, asml.silence, linux-block, io-uring, qyousef, dsmythies,
	axboe

On 9/5/24 10:26, Christian Loehle wrote:
> I wanted to share my current status after working on the schedutil
> iowait boost issue for a while now. This is what I consider the best
> solution, happy for anyone to share thoughts and test results (it's
> simply impossible to cover them all).
> I'm hoping to remove some (bad) heuristics that have been in the kernel
> for a long time and are seemingly impossible to evolve. Since the
> introduction of these heuristics IO workloads have changed and those
> heuristics can be removed while only really affecting synthetic
> benchmarks.

Lots of related discussion is also here:
[PATCHSET v4 0/4] Split iowait into two states
https://lore.kernel.org/lkml/20240416121526.67022-1-axboe@kernel.dk/
[PATCHSET v6 0/4] Split iowait into two states
https://lore.kernel.org/lkml/20240819154259.215504-1-axboe@kernel.dk/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH] TEST: cpufreq: intel_pstate: sysfs iowait_boost_cap
  2024-09-05  9:26 ` [RFC PATCH 6/8] cpufreq: intel_pstate: " Christian Loehle
@ 2024-09-12 11:22   ` Christian Loehle
  2024-09-30 18:03   ` [RFC PATCH 6/8] cpufreq: intel_pstate: Remove iowait boost Rafael J. Wysocki
  1 sibling, 0 replies; 25+ messages in thread
From: Christian Loehle @ 2024-09-12 11:22 UTC (permalink / raw)
  To: linux-pm, linux-kernel, rafael, peterz
  Cc: juri.lelli, mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	andres, asml.silence, linux-block, io-uring, qyousef, dsmythies,
	axboe

For non-HWP systems, rework iowait boost to be linear and add
the sysfs knob iowait_boost_cap to limit the maximum boost in
8 steps.

I don't see a good way to translate this to HWP, as the
boost applied isn't as static as it is for non-HWP, but there
is already the dynamic_hwp_boost sysfs to enable/disable
completely.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 drivers/cpufreq/intel_pstate.c | 39 ++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index c0278d023cfc..6882d8c74e61 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -183,6 +183,7 @@ struct global_params {
 	bool turbo_disabled;
 	int max_perf_pct;
 	int min_perf_pct;
+	unsigned int iowait_boost_cap;
 };
 
 /**
@@ -1444,6 +1445,30 @@ static ssize_t store_min_perf_pct(struct kobject *a, struct kobj_attribute *b,
 	return count;
 }
 
+static ssize_t store_iowait_boost_cap(struct kobject *a, struct kobj_attribute *b,
+				  const char *buf, size_t count)
+{
+	unsigned int input;
+	int ret;
+
+	ret = sscanf(buf, "%u", &input);
+	if (ret != 1)
+		return -EINVAL;
+
+	mutex_lock(&intel_pstate_driver_lock);
+
+	if (!intel_pstate_driver) {
+		mutex_unlock(&intel_pstate_driver_lock);
+		return -EAGAIN;
+	}
+
+	global.iowait_boost_cap = clamp_t(int, input, 0, 8);
+
+	mutex_unlock(&intel_pstate_driver_lock);
+
+	return count;
+}
+
 static ssize_t show_hwp_dynamic_boost(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf)
 {
@@ -1497,6 +1522,7 @@ static ssize_t store_energy_efficiency(struct kobject *a, struct kobj_attribute
 
 show_one(max_perf_pct, max_perf_pct);
 show_one(min_perf_pct, min_perf_pct);
+show_one(iowait_boost_cap, iowait_boost_cap);
 
 define_one_global_rw(status);
 define_one_global_rw(no_turbo);
@@ -1506,6 +1532,7 @@ define_one_global_ro(turbo_pct);
 define_one_global_ro(num_pstates);
 define_one_global_rw(hwp_dynamic_boost);
 define_one_global_rw(energy_efficiency);
+define_one_global_rw(iowait_boost_cap);
 
 static struct attribute *intel_pstate_attributes[] = {
 	&status.attr,
@@ -1562,6 +1589,9 @@ static void __init intel_pstate_sysfs_expose_params(void)
 		rc = sysfs_create_file(intel_pstate_kobject, &energy_efficiency.attr);
 		WARN_ON(rc);
 	}
+
+	rc = sysfs_create_file(intel_pstate_kobject, &iowait_boost_cap.attr);
+	WARN_ON(rc);
 }
 
 static void __init intel_pstate_sysfs_remove(void)
@@ -2322,18 +2352,23 @@ static void intel_pstate_update_util(struct update_util_data *data, u64 time,
 		if (delta_ns > TICK_NSEC) {
 			cpu->iowait_boost = ONE_EIGHTH_FP;
 		} else if (cpu->iowait_boost >= ONE_EIGHTH_FP) {
-			cpu->iowait_boost <<= 1;
+			cpu->iowait_boost += ONE_EIGHTH_FP;
 			if (cpu->iowait_boost > int_tofp(1))
 				cpu->iowait_boost = int_tofp(1);
 		} else {
 			cpu->iowait_boost = ONE_EIGHTH_FP;
 		}
+		if (cpu->iowait_boost > global.iowait_boost_cap * ONE_EIGHTH_FP)
+			cpu->iowait_boost = global.iowait_boost_cap * ONE_EIGHTH_FP;
 	} else if (cpu->iowait_boost) {
 		/* Clear iowait_boost if the CPU may have been idle. */
 		if (delta_ns > TICK_NSEC)
 			cpu->iowait_boost = 0;
-		else
+		else {
 			cpu->iowait_boost >>= 1;
+			if (cpu->iowait_boost < ONE_EIGHTH_FP)
+				cpu->iowait_boost = 0;
+		}
 	}
 	cpu->last_update = time;
 	delta_ns = time - cpu->sample.time;
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 1/8] cpuidle: menu: Remove iowait influence
  2024-09-05  9:26 ` [RFC PATCH 1/8] cpuidle: menu: Remove iowait influence Christian Loehle
@ 2024-09-30 14:58   ` Rafael J. Wysocki
  0 siblings, 0 replies; 25+ messages in thread
From: Rafael J. Wysocki @ 2024-09-30 14:58 UTC (permalink / raw)
  To: Christian Loehle
  Cc: linux-pm, linux-kernel, rafael, peterz, juri.lelli, mingo,
	dietmar.eggemann, vschneid, vincent.guittot, Johannes.Thumshirn,
	adrian.hunter, ulf.hansson, bvanassche, andres, asml.silence,
	linux-block, io-uring, qyousef, dsmythies, axboe

On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> Remove CPU iowaiters influence on idle state selection.
> Remove the menu notion of performance multiplier which increased with
> the number of tasks that went to iowait sleep on this CPU and haven't
> woken up yet.
>
> Relying on iowait for cpuidle is problematic for a few reasons:
> 1. There is no guarantee that an iowaiting task will wake up on the
> same CPU.
> 2. The task being in iowait says nothing about the idle duration, we
> could be selecting shallower states for a long time.
> 3. The task being in iowait doesn't always imply a performance hit
> with increased latency.
> 4. If there is such a performance hit, the number of iowaiting tasks
> doesn't directly correlate.
> 5. The definition of iowait altogether is vague at best, it is
> sprinkled across kernel code.
>
> Signed-off-by: Christian Loehle <christian.loehle@arm.com>

I promised feedback on this series.

As far as this particular patch is concerned, I generally agree with
all of the above, so I'm going to put it into linux-next right away
and see if anyone reports a problem with it.

> ---
>  drivers/cpuidle/governors/menu.c | 76 ++++----------------------------
>  1 file changed, 9 insertions(+), 67 deletions(-)
>
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index f3c9d49f0f2a..28363bfa3e4c 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -19,7 +19,7 @@
>
>  #include "gov.h"
>
> -#define BUCKETS 12
> +#define BUCKETS 6
>  #define INTERVAL_SHIFT 3
>  #define INTERVALS (1UL << INTERVAL_SHIFT)
>  #define RESOLUTION 1024
> @@ -29,12 +29,11 @@
>  /*
>   * Concepts and ideas behind the menu governor
>   *
> - * For the menu governor, there are 3 decision factors for picking a C
> + * For the menu governor, there are 2 decision factors for picking a C
>   * state:
>   * 1) Energy break even point
> - * 2) Performance impact
> - * 3) Latency tolerance (from pmqos infrastructure)
> - * These three factors are treated independently.
> + * 2) Latency tolerance (from pmqos infrastructure)
> + * These two factors are treated independently.
>   *
>   * Energy break even point
>   * -----------------------
> @@ -75,30 +74,6 @@
>   * intervals and if the stand deviation of these 8 intervals is below a
>   * threshold value, we use the average of these intervals as prediction.
>   *
> - * Limiting Performance Impact
> - * ---------------------------
> - * C states, especially those with large exit latencies, can have a real
> - * noticeable impact on workloads, which is not acceptable for most sysadmins,
> - * and in addition, less performance has a power price of its own.
> - *
> - * As a general rule of thumb, menu assumes that the following heuristic
> - * holds:
> - *     The busier the system, the less impact of C states is acceptable
> - *
> - * This rule-of-thumb is implemented using a performance-multiplier:
> - * If the exit latency times the performance multiplier is longer than
> - * the predicted duration, the C state is not considered a candidate
> - * for selection due to a too high performance impact. So the higher
> - * this multiplier is, the longer we need to be idle to pick a deep C
> - * state, and thus the less likely a busy CPU will hit such a deep
> - * C state.
> - *
> - * Currently there is only one value determining the factor:
> - * 10 points are added for each process that is waiting for IO on this CPU.
> - * (This value was experimentally determined.)
> - * Utilization is no longer a factor as it was shown that it never contributed
> - * significantly to the performance multiplier in the first place.
> - *
>   */
>
>  struct menu_device {
> @@ -112,19 +87,10 @@ struct menu_device {
>         int             interval_ptr;
>  };
>
> -static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters)
> +static inline int which_bucket(u64 duration_ns)
>  {
>         int bucket = 0;
>
> -       /*
> -        * We keep two groups of stats; one with no
> -        * IO pending, one without.
> -        * This allows us to calculate
> -        * E(duration)|iowait
> -        */
> -       if (nr_iowaiters)
> -               bucket = BUCKETS/2;
> -
>         if (duration_ns < 10ULL * NSEC_PER_USEC)
>                 return bucket;
>         if (duration_ns < 100ULL * NSEC_PER_USEC)
> @@ -138,19 +104,6 @@ static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters)
>         return bucket + 5;
>  }
>
> -/*
> - * Return a multiplier for the exit latency that is intended
> - * to take performance requirements into account.
> - * The more performance critical we estimate the system
> - * to be, the higher this multiplier, and thus the higher
> - * the barrier to go to an expensive C state.
> - */
> -static inline int performance_multiplier(unsigned int nr_iowaiters)
> -{
> -       /* for IO wait tasks (per cpu!) we add 10x each */
> -       return 1 + 10 * nr_iowaiters;
> -}
> -
>  static DEFINE_PER_CPU(struct menu_device, menu_devices);
>
>  static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev);
> @@ -258,8 +211,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>         struct menu_device *data = this_cpu_ptr(&menu_devices);
>         s64 latency_req = cpuidle_governor_latency_req(dev->cpu);
>         u64 predicted_ns;
> -       u64 interactivity_req;
> -       unsigned int nr_iowaiters;
>         ktime_t delta, delta_tick;
>         int i, idx;
>
> @@ -268,8 +219,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>                 data->needs_update = 0;
>         }
>
> -       nr_iowaiters = nr_iowait_cpu(dev->cpu);
> -
>         /* Find the shortest expected idle interval. */
>         predicted_ns = get_typical_interval(data) * NSEC_PER_USEC;
>         if (predicted_ns > RESIDENCY_THRESHOLD_NS) {
> @@ -283,7 +232,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>                 }
>
>                 data->next_timer_ns = delta;
> -               data->bucket = which_bucket(data->next_timer_ns, nr_iowaiters);
> +               data->bucket = which_bucket(data->next_timer_ns);
>
>                 /* Round up the result for half microseconds. */
>                 timer_us = div_u64((RESOLUTION * DECAY * NSEC_PER_USEC) / 2 +
> @@ -301,7 +250,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>                  */
>                 data->next_timer_ns = KTIME_MAX;
>                 delta_tick = TICK_NSEC / 2;
> -               data->bucket = which_bucket(KTIME_MAX, nr_iowaiters);
> +               data->bucket = which_bucket(KTIME_MAX);
>         }
>
>         if (unlikely(drv->state_count <= 1 || latency_req == 0) ||
> @@ -328,15 +277,8 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
>                  */
>                 if (predicted_ns < TICK_NSEC)
>                         predicted_ns = data->next_timer_ns;
> -       } else {
> -               /*
> -                * Use the performance multiplier and the user-configurable
> -                * latency_req to determine the maximum exit latency.
> -                */
> -               interactivity_req = div64_u64(predicted_ns,
> -                                             performance_multiplier(nr_iowaiters));
> -               if (latency_req > interactivity_req)
> -                       latency_req = interactivity_req;
> +       } else if (latency_req > predicted_ns) {
> +               latency_req = predicted_ns;
>         }
>
>         /*
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/8] cpuidle: Prefer teo over menu governor
  2024-09-05  9:26 ` [RFC PATCH 2/8] cpuidle: Prefer teo over menu governor Christian Loehle
@ 2024-09-30 15:06   ` Rafael J. Wysocki
  2024-09-30 16:12     ` Christian Loehle
  0 siblings, 1 reply; 25+ messages in thread
From: Rafael J. Wysocki @ 2024-09-30 15:06 UTC (permalink / raw)
  To: Christian Loehle
  Cc: linux-pm, linux-kernel, rafael, peterz, juri.lelli, mingo,
	dietmar.eggemann, vschneid, vincent.guittot, Johannes.Thumshirn,
	adrian.hunter, ulf.hansson, bvanassche, andres, asml.silence,
	linux-block, io-uring, qyousef, dsmythies, axboe

On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> Since menu no longer has the interactivity boost teo works better
> overall, so make it the default.
>
> Signed-off-by: Christian Loehle <christian.loehle@arm.com>

I know that this isn't strictly related to the use of iowait in menu,
but I'd rather wait with this one until the previous change in menu
settles down.

Also it would be good to provide some numbers to support the "teo
works better overall" claim above.

> ---
>  drivers/cpuidle/Kconfig          | 5 +----
>  drivers/cpuidle/governors/menu.c | 2 +-
>  drivers/cpuidle/governors/teo.c  | 2 +-
>  3 files changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
> index cac5997dca50..ae67a464025a 100644
> --- a/drivers/cpuidle/Kconfig
> +++ b/drivers/cpuidle/Kconfig
> @@ -5,7 +5,7 @@ config CPU_IDLE
>         bool "CPU idle PM support"
>         default y if ACPI || PPC_PSERIES
>         select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
> -       select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE) && !CPU_IDLE_GOV_TEO
> +       select CPU_IDLE_GOV_TEO if (NO_HZ || NO_HZ_IDLE) && !CPU_IDLE_GOV_MENU
>         help
>           CPU idle is a generic framework for supporting software-controlled
>           idle processor power management.  It includes modular cross-platform
> @@ -30,9 +30,6 @@ config CPU_IDLE_GOV_TEO
>           This governor implements a simplified idle state selection method
>           focused on timer events and does not do any interactivity boosting.
>
> -         Some workloads benefit from using it and it generally should be safe
> -         to use.  Say Y here if you are not happy with the alternatives.
> -
>  config CPU_IDLE_GOV_HALTPOLL
>         bool "Haltpoll governor (for virtualized systems)"
>         depends on KVM_GUEST
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index 28363bfa3e4c..c0ae5e98d6f1 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -508,7 +508,7 @@ static int menu_enable_device(struct cpuidle_driver *drv,
>
>  static struct cpuidle_governor menu_governor = {
>         .name =         "menu",
> -       .rating =       20,
> +       .rating =       19,
>         .enable =       menu_enable_device,
>         .select =       menu_select,
>         .reflect =      menu_reflect,
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index f2992f92d8db..6c3cc39f285d 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -537,7 +537,7 @@ static int teo_enable_device(struct cpuidle_driver *drv,
>
>  static struct cpuidle_governor teo_governor = {
>         .name =         "teo",
> -       .rating =       19,
> +       .rating =       20,
>         .enable =       teo_enable_device,
>         .select =       teo_select,
>         .reflect =      teo_reflect,
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/8] cpuidle: Prefer teo over menu governor
  2024-09-30 15:06   ` Rafael J. Wysocki
@ 2024-09-30 16:12     ` Christian Loehle
  2024-09-30 16:42       ` Rafael J. Wysocki
  0 siblings, 1 reply; 25+ messages in thread
From: Christian Loehle @ 2024-09-30 16:12 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, linux-kernel, peterz, juri.lelli, mingo,
	dietmar.eggemann, vschneid, vincent.guittot, Johannes.Thumshirn,
	adrian.hunter, ulf.hansson, bvanassche, andres, asml.silence,
	linux-block, io-uring, qyousef, dsmythies, axboe

On 9/30/24 16:06, Rafael J. Wysocki wrote:
> On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> Since menu no longer has the interactivity boost teo works better
>> overall, so make it the default.
>>
>> Signed-off-by: Christian Loehle <christian.loehle@arm.com>

First of all thank you for taking a look.

> 
> I know that this isn't strictly related to the use of iowait in menu,
> but I'd rather wait with this one until the previous change in menu
> settles down.

Sure, I will look at any regressions that are reported, although "teo
is performing better/worse/eqyal" would already be a pretty helpful hint
and for me personally, if they do both perform badly I find debugging
teo way easier.

> 
> Also it would be good to provide some numbers to support the "teo
> works better overall" claim above.

Definitely, there are some in the overall cover-letter if you just
compare equivalent menu/teo columns, but with the very fragmented
cpuidle world this isn't anywhere near enough to back up that claim.
We have found it to provide better results in both mobile and infra/
server workloads on common arm64 platforms.

That being said, I don't mind menu being around or even the default
per-se, but would encourage anyone to give teo a try.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/8] cpufreq/schedutil: Remove iowait boost
  2024-09-05  9:26 ` [RFC PATCH 5/8] cpufreq/schedutil: Remove iowait boost Christian Loehle
@ 2024-09-30 16:34   ` Rafael J. Wysocki
  2024-10-03  9:10     ` Christian Loehle
  2024-10-03  9:47     ` Quentin Perret
  0 siblings, 2 replies; 25+ messages in thread
From: Rafael J. Wysocki @ 2024-09-30 16:34 UTC (permalink / raw)
  To: Christian Loehle
  Cc: linux-pm, linux-kernel, rafael, peterz, juri.lelli, mingo,
	dietmar.eggemann, vschneid, vincent.guittot, Johannes.Thumshirn,
	adrian.hunter, ulf.hansson, bvanassche, andres, asml.silence,
	linux-block, io-uring, qyousef, dsmythies, axboe

On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> iowait boost in schedutil was introduced by
> commit ("21ca6d2c52f8 cpufreq: schedutil: Add iowait boosting").
> with it more or less following intel_pstate's approach to increase
> frequency after an iowait wakeup.
> Behaviour that is piggy-backed onto iowait boost is problematic
> due to a lot of reasons, so remove it.
>
> For schedutil specifically these are some of the reasons:
> 1. Boosting is applied even in scenarios where it doesn't improve
> throughput.

Well, I wouldn't argue this way because it is kind of like saying that
air conditioning is used even when it doesn't really help.  It is
sometimes hard to know in advance whether or not it will help though.

> 2. The boost is not accounted for in EAS: a) feec() will only consider
>  the actual task utilization for task placement, but another CPU might
>  be more energy-efficient at that capacity than the boosted one.)
>  b) When placing a non-IO task while a CPU is boosted compute_energy()
>  assumes a lower OPP than what is actually applied. This leads to
>  wrong EAS decisions.

That's a very good point IMV and so is the one regarding UCLAMP_MAX (8
in your list).

If the goal is to set the adequate performance for a given utilization
level (either actual or prescribed), boosting doesn't really play well
with this and it shouldn't be used at least in these cases.

> 3. Actual IO heavy workloads are hardly distinguished from infrequent
> in_iowait wakeups.

Do infrequent in_iowait wakeups really cause the boosting to be
applied at full swing?

> 4. The boost isn't accounted for in task placement.

I'm not sure what exactly this means.  "Big" vs "little" or something else?

> 5. The boost isn't associated with a task, it therefore lingers on the
> rq even after the responsible task has migrated / stopped.

Fair enough, but this is rather a problem with the implementation of
boosting and not with the basic idea of it.

> 6. The boost isn't associated with a task, it therefore needs to ramp
> up again when migrated.

Well, that again is somewhat implementation-related IMV, and it need
not be problematic in principle.  Namely, if a task migrates and it is
not the only one in the "new" CPUs runqueue, and the other tasks in
there don't use in_iowait, maybe it's better to not boost it?

It also means that boosting is not very consistent, though, which is a
valid point.

> 7. Since schedutil doesn't know which task is getting woken up,
> multiple unrelated in_iowait tasks lead to boosting.

Well, that's by design: it boosts, when "there is enough IO pressure
in the runqueue", so to speak.

Basically, it is a departure from the "make performance follow
utilization" general idea and it is based on the observation that in
some cases performance can be improved by taking additional
information into account.

It is also about pure performance, not about energy efficiency.

> 8. Boosting is hard to control with UCLAMP_MAX (which is only active
> when the task is on the rq, which for boosted tasks is usually not
> the case for most of the time).
>
> One benefit of schedutil specifically is the reliance on the
> scheduler's utilization signals, which have evolved a lot since it's
> original introduction. Some cases that benefitted from iowait boosting
> in the past can now be covered by e.g. util_est.

And it would be good to give some examples of this.

IMV you have a clean-cut argument in the EAS and UCLAMP_MAX cases, but
apart from that it is all a bit hand-wavy.

> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
> ---
>  kernel/sched/cpufreq_schedutil.c | 181 +------------------------------
>  1 file changed, 3 insertions(+), 178 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 5324f07fc93a..55b8b8ba7238 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -6,12 +6,9 @@
>   * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>   */
>
> -#define IOWAIT_BOOST_MIN       (SCHED_CAPACITY_SCALE / 8)
> -
>  struct sugov_tunables {
>         struct gov_attr_set     attr_set;
>         unsigned int            rate_limit_us;
> -       unsigned int            iowait_boost_cap;
>  };
>
>  struct sugov_policy {
> @@ -36,8 +33,6 @@ struct sugov_policy {
>
>         bool                    limits_changed;
>         bool                    need_freq_update;
> -
> -       unsigned int            iowait_boost_cap;
>  };
>
>  struct sugov_cpu {
> @@ -45,10 +40,6 @@ struct sugov_cpu {
>         struct sugov_policy     *sg_policy;
>         unsigned int            cpu;
>
> -       bool                    iowait_boost_pending;
> -       unsigned int            iowait_boost;
> -       u64                     last_update;
> -
>         unsigned long           util;
>         unsigned long           bw_min;
>
> @@ -198,137 +189,15 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
>         return max(min, max);
>  }
>
> -static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> +static void sugov_get_util(struct sugov_cpu *sg_cpu)
>  {
>         unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
>
>         util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
> -       util = max(util, boost);
>         sg_cpu->bw_min = min;
>         sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
>  }
>
> -/**
> - * sugov_iowait_reset() - Reset the IO boost status of a CPU.
> - * @sg_cpu: the sugov data for the CPU to boost
> - * @time: the update time from the caller
> - * @set_iowait_boost: true if an IO boost has been requested
> - *
> - * The IO wait boost of a task is disabled after a tick since the last update
> - * of a CPU. If a new IO wait boost is requested after more then a tick, then
> - * we enable the boost starting from IOWAIT_BOOST_MIN, which improves energy
> - * efficiency by ignoring sporadic wakeups from IO.
> - */
> -static bool sugov_iowait_reset(struct sugov_cpu *sg_cpu, u64 time,
> -                              bool set_iowait_boost)
> -{
> -       s64 delta_ns = time - sg_cpu->last_update;
> -
> -       /* Reset boost only if a tick has elapsed since last request */
> -       if (delta_ns <= TICK_NSEC)
> -               return false;
> -
> -       sg_cpu->iowait_boost = set_iowait_boost ? IOWAIT_BOOST_MIN : 0;
> -       sg_cpu->iowait_boost_pending = set_iowait_boost;
> -
> -       return true;
> -}
> -
> -/**
> - * sugov_iowait_boost() - Updates the IO boost status of a CPU.
> - * @sg_cpu: the sugov data for the CPU to boost
> - * @time: the update time from the caller
> - * @flags: SCHED_CPUFREQ_IOWAIT if the task is waking up after an IO wait
> - *
> - * Each time a task wakes up after an IO operation, the CPU utilization can be
> - * boosted to a certain utilization which doubles at each "frequent and
> - * successive" wakeup from IO, ranging from IOWAIT_BOOST_MIN to the utilization
> - * of the maximum OPP.
> - *
> - * To keep doubling, an IO boost has to be requested at least once per tick,
> - * otherwise we restart from the utilization of the minimum OPP.
> - */
> -static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> -                              unsigned int flags)
> -{
> -       bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
> -
> -       /* Reset boost if the CPU appears to have been idle enough */
> -       if (sg_cpu->iowait_boost &&
> -           sugov_iowait_reset(sg_cpu, time, set_iowait_boost))
> -               return;
> -
> -       /* Boost only tasks waking up after IO */
> -       if (!set_iowait_boost)
> -               return;
> -
> -       /* Ensure boost doubles only one time at each request */
> -       if (sg_cpu->iowait_boost_pending)
> -               return;
> -       sg_cpu->iowait_boost_pending = true;
> -
> -       /* Double the boost at each request */
> -       if (sg_cpu->iowait_boost) {
> -               sg_cpu->iowait_boost =
> -                       min_t(unsigned int,
> -                             sg_cpu->iowait_boost + IOWAIT_BOOST_MIN, SCHED_CAPACITY_SCALE);
> -               return;
> -       }
> -
> -       /* First wakeup after IO: start with minimum boost */
> -       sg_cpu->iowait_boost = IOWAIT_BOOST_MIN;
> -}
> -
> -/**
> - * sugov_iowait_apply() - Apply the IO boost to a CPU.
> - * @sg_cpu: the sugov data for the cpu to boost
> - * @time: the update time from the caller
> - * @max_cap: the max CPU capacity
> - *
> - * A CPU running a task which woken up after an IO operation can have its
> - * utilization boosted to speed up the completion of those IO operations.
> - * The IO boost value is increased each time a task wakes up from IO, in
> - * sugov_iowait_apply(), and it's instead decreased by this function,
> - * each time an increase has not been requested (!iowait_boost_pending).
> - *
> - * A CPU which also appears to have been idle for at least one tick has also
> - * its IO boost utilization reset.
> - *
> - * This mechanism is designed to boost high frequently IO waiting tasks, while
> - * being more conservative on tasks which does sporadic IO operations.
> - */
> -static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> -                              unsigned long max_cap)
> -{
> -       /* No boost currently required */
> -       if (!sg_cpu->iowait_boost)
> -               return 0;
> -
> -       /* Reset boost if the CPU appears to have been idle enough */
> -       if (sugov_iowait_reset(sg_cpu, time, false))
> -               return 0;
> -
> -       if (!sg_cpu->iowait_boost_pending) {
> -               /*
> -                * No boost pending; reduce the boost value.
> -                */
> -               sg_cpu->iowait_boost -= IOWAIT_BOOST_MIN;
> -               if (!sg_cpu->iowait_boost)
> -                       return 0;
> -       }
> -
> -       sg_cpu->iowait_boost_pending = false;
> -
> -       if (sg_cpu->iowait_boost > sg_cpu->sg_policy->iowait_boost_cap)
> -               sg_cpu->iowait_boost = sg_cpu->sg_policy->iowait_boost_cap;
> -
> -       /*
> -        * sg_cpu->util is already in capacity scale; convert iowait_boost
> -        * into the same scale so we can compare.
> -        */
> -       return (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
> -}
> -
>  #ifdef CONFIG_NO_HZ_COMMON
>  static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
>  {
> @@ -356,18 +225,12 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
>                                               u64 time, unsigned long max_cap,
>                                               unsigned int flags)
>  {
> -       unsigned long boost;
> -
> -       sugov_iowait_boost(sg_cpu, time, flags);
> -       sg_cpu->last_update = time;
> -
>         ignore_dl_rate_limit(sg_cpu);
>
>         if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
>                 return false;
>
> -       boost = sugov_iowait_apply(sg_cpu, time, max_cap);
> -       sugov_get_util(sg_cpu, boost);
> +       sugov_get_util(sg_cpu);
>
>         return true;
>  }
> @@ -468,11 +331,8 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
>
>         for_each_cpu(j, policy->cpus) {
>                 struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
> -               unsigned long boost;
> -
> -               boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
> -               sugov_get_util(j_sg_cpu, boost);
>
> +               sugov_get_util(j_sg_cpu);
>                 util = max(j_sg_cpu->util, util);
>         }
>
> @@ -488,9 +348,6 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
>
>         raw_spin_lock(&sg_policy->update_lock);
>
> -       sugov_iowait_boost(sg_cpu, time, flags);
> -       sg_cpu->last_update = time;
> -
>         ignore_dl_rate_limit(sg_cpu);
>
>         if (sugov_should_update_freq(sg_policy, time)) {
> @@ -560,14 +417,6 @@ static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
>         return sprintf(buf, "%u\n", tunables->rate_limit_us);
>  }
>
> -
> -static ssize_t iowait_boost_cap_show(struct gov_attr_set *attr_set, char *buf)
> -{
> -       struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> -
> -       return sprintf(buf, "%u\n", tunables->iowait_boost_cap);
> -}
> -
>  static ssize_t
>  rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
>  {
> @@ -586,30 +435,10 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
>         return count;
>  }
>
> -static ssize_t
> -iowait_boost_cap_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
> -{
> -       struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
> -       struct sugov_policy *sg_policy;
> -       unsigned int iowait_boost_cap;
> -
> -       if (kstrtouint(buf, 10, &iowait_boost_cap))
> -               return -EINVAL;
> -
> -       tunables->iowait_boost_cap = iowait_boost_cap;
> -
> -       list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
> -               sg_policy->iowait_boost_cap = iowait_boost_cap;
> -
> -       return count;
> -}
> -
>  static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
> -static struct governor_attr iowait_boost_cap = __ATTR_RW(iowait_boost_cap);
>
>  static struct attribute *sugov_attrs[] = {
>         &rate_limit_us.attr,
> -       &iowait_boost_cap.attr,
>         NULL
>  };
>  ATTRIBUTE_GROUPS(sugov);
> @@ -799,8 +628,6 @@ static int sugov_init(struct cpufreq_policy *policy)
>
>         tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
>
> -       tunables->iowait_boost_cap = SCHED_CAPACITY_SCALE;
> -
>         policy->governor_data = sg_policy;
>         sg_policy->tunables = tunables;
>
> @@ -870,8 +697,6 @@ static int sugov_start(struct cpufreq_policy *policy)
>         sg_policy->limits_changed               = false;
>         sg_policy->cached_raw_freq              = 0;
>
> -       sg_policy->iowait_boost_cap             = SCHED_CAPACITY_SCALE;
> -
>         sg_policy->need_freq_update = cpufreq_driver_test_flags(CPUFREQ_NEED_UPDATE_LIMITS);
>
>         if (policy_is_shared(policy))
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/8] cpuidle: Prefer teo over menu governor
  2024-09-30 16:12     ` Christian Loehle
@ 2024-09-30 16:42       ` Rafael J. Wysocki
  0 siblings, 0 replies; 25+ messages in thread
From: Rafael J. Wysocki @ 2024-09-30 16:42 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Rafael J. Wysocki, linux-pm, linux-kernel, peterz, juri.lelli,
	mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	andres, asml.silence, linux-block, io-uring, qyousef, dsmythies,
	axboe

On Mon, Sep 30, 2024 at 6:12 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 9/30/24 16:06, Rafael J. Wysocki wrote:
> > On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle
> > <christian.loehle@arm.com> wrote:
> >>
> >> Since menu no longer has the interactivity boost teo works better
> >> overall, so make it the default.
> >>
> >> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
>
> First of all thank you for taking a look.
>
> >
> > I know that this isn't strictly related to the use of iowait in menu,
> > but I'd rather wait with this one until the previous change in menu
> > settles down.
>
> Sure, I will look at any regressions that are reported, although "teo
> is performing better/worse/eqyal" would already be a pretty helpful hint
> and for me personally, if they do both perform badly I find debugging
> teo way easier.
>
> >
> > Also it would be good to provide some numbers to support the "teo
> > works better overall" claim above.
>
> Definitely, there are some in the overall cover-letter if you just
> compare equivalent menu/teo columns, but with the very fragmented
> cpuidle world this isn't anywhere near enough to back up that claim.
> We have found it to provide better results in both mobile and infra/
> server workloads on common arm64 platforms.

So why don't you add some numbers to the patch changelog?

If you can at least demonstrate that they are on par with each other
in some relevant benchmarks, then you can use the argument of teo
being more straightforward and so easier to reason about.

> That being said, I don't mind menu being around or even the default
> per-se, but would encourage anyone to give teo a try.

Fair enough.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 6/8] cpufreq: intel_pstate: Remove iowait boost
  2024-09-05  9:26 ` [RFC PATCH 6/8] cpufreq: intel_pstate: " Christian Loehle
  2024-09-12 11:22   ` [RFC PATCH] TEST: cpufreq: intel_pstate: sysfs iowait_boost_cap Christian Loehle
@ 2024-09-30 18:03   ` Rafael J. Wysocki
  2024-09-30 20:35     ` srinivas pandruvada
  1 sibling, 1 reply; 25+ messages in thread
From: Rafael J. Wysocki @ 2024-09-30 18:03 UTC (permalink / raw)
  To: Christian Loehle
  Cc: linux-pm, linux-kernel, rafael, peterz, juri.lelli, mingo,
	dietmar.eggemann, vschneid, vincent.guittot, Johannes.Thumshirn,
	adrian.hunter, ulf.hansson, bvanassche, andres, asml.silence,
	linux-block, io-uring, qyousef, dsmythies, axboe,
	Srinivas Pandruvada

+Srinivas who can say more about the reasons why iowait boosting makes
a difference for intel_pstate than I do.

On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> Analogous to schedutil, remove iowait boost for the same reasons.

Well, first of all, iowait boosting was added to intel_pstate to help
some workloads that otherwise were underperforming.  I'm not sure if
you can simply remove it without introducing performance regressions
in those workloads.

While you can argue that it is not useful in schedutil any more due to
the improved scheduler input for it, you can hardly extend that
argument to intel_pstate because it doesn't use all of the scheduler
input used by schedutil.

Also, the EAS and UCLAMP_MAX arguments are not applicable to
intel_pstate because it doesn't support any of them.

This applies to the ondemand cpufreq governor either.


> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
> ---
>  drivers/cpufreq/intel_pstate.c | 50 ++--------------------------------
>  1 file changed, 3 insertions(+), 47 deletions(-)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index c0278d023cfc..7f30b2569bb3 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -191,7 +191,6 @@ struct global_params {
>   * @policy:            CPUFreq policy value
>   * @update_util:       CPUFreq utility callback information
>   * @update_util_set:   CPUFreq utility callback is set
> - * @iowait_boost:      iowait-related boost fraction
>   * @last_update:       Time of the last update.
>   * @pstate:            Stores P state limits for this CPU
>   * @vid:               Stores VID limits for this CPU
> @@ -245,7 +244,6 @@ struct cpudata {
>         struct acpi_processor_performance acpi_perf_data;
>         bool valid_pss_table;
>  #endif
> -       unsigned int iowait_boost;
>         s16 epp_powersave;
>         s16 epp_policy;
>         s16 epp_default;
> @@ -2136,28 +2134,7 @@ static inline void intel_pstate_update_util_hwp_local(struct cpudata *cpu,
>  {
>         cpu->sample.time = time;
>
> -       if (cpu->sched_flags & SCHED_CPUFREQ_IOWAIT) {
> -               bool do_io = false;
> -
> -               cpu->sched_flags = 0;
> -               /*
> -                * Set iowait_boost flag and update time. Since IO WAIT flag
> -                * is set all the time, we can't just conclude that there is
> -                * some IO bound activity is scheduled on this CPU with just
> -                * one occurrence. If we receive at least two in two
> -                * consecutive ticks, then we treat as boost candidate.
> -                */
> -               if (time_before64(time, cpu->last_io_update + 2 * TICK_NSEC))
> -                       do_io = true;
> -
> -               cpu->last_io_update = time;
> -
> -               if (do_io)
> -                       intel_pstate_hwp_boost_up(cpu);
> -
> -       } else {
> -               intel_pstate_hwp_boost_down(cpu);
> -       }
> +       intel_pstate_hwp_boost_down(cpu);
>  }
>
>  static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
> @@ -2240,9 +2217,6 @@ static inline int32_t get_target_pstate(struct cpudata *cpu)
>         busy_frac = div_fp(sample->mperf << cpu->aperf_mperf_shift,
>                            sample->tsc);
>
> -       if (busy_frac < cpu->iowait_boost)
> -               busy_frac = cpu->iowait_boost;
> -
>         sample->busy_scaled = busy_frac * 100;
>
>         target = READ_ONCE(global.no_turbo) ?
> @@ -2303,7 +2277,7 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
>                 sample->aperf,
>                 sample->tsc,
>                 get_avg_frequency(cpu),
> -               fp_toint(cpu->iowait_boost * 100));
> +               0);
>  }
>
>  static void intel_pstate_update_util(struct update_util_data *data, u64 time,
> @@ -2317,24 +2291,6 @@ static void intel_pstate_update_util(struct update_util_data *data, u64 time,
>                 return;
>
>         delta_ns = time - cpu->last_update;
> -       if (flags & SCHED_CPUFREQ_IOWAIT) {
> -               /* Start over if the CPU may have been idle. */
> -               if (delta_ns > TICK_NSEC) {
> -                       cpu->iowait_boost = ONE_EIGHTH_FP;
> -               } else if (cpu->iowait_boost >= ONE_EIGHTH_FP) {
> -                       cpu->iowait_boost <<= 1;
> -                       if (cpu->iowait_boost > int_tofp(1))
> -                               cpu->iowait_boost = int_tofp(1);
> -               } else {
> -                       cpu->iowait_boost = ONE_EIGHTH_FP;
> -               }
> -       } else if (cpu->iowait_boost) {
> -               /* Clear iowait_boost if the CPU may have been idle. */
> -               if (delta_ns > TICK_NSEC)
> -                       cpu->iowait_boost = 0;
> -               else
> -                       cpu->iowait_boost >>= 1;
> -       }
>         cpu->last_update = time;
>         delta_ns = time - cpu->sample.time;
>         if ((s64)delta_ns < INTEL_PSTATE_SAMPLING_INTERVAL)
> @@ -2832,7 +2788,7 @@ static void intel_cpufreq_trace(struct cpudata *cpu, unsigned int trace_type, in
>                 sample->aperf,
>                 sample->tsc,
>                 get_avg_frequency(cpu),
> -               fp_toint(cpu->iowait_boost * 100));
> +               0);
>  }
>
>  static void intel_cpufreq_hwp_update(struct cpudata *cpu, u32 min, u32 max,
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 6/8] cpufreq: intel_pstate: Remove iowait boost
  2024-09-30 18:03   ` [RFC PATCH 6/8] cpufreq: intel_pstate: Remove iowait boost Rafael J. Wysocki
@ 2024-09-30 20:35     ` srinivas pandruvada
  2024-10-01  9:57       ` Christian Loehle
  0 siblings, 1 reply; 25+ messages in thread
From: srinivas pandruvada @ 2024-09-30 20:35 UTC (permalink / raw)
  To: Rafael J. Wysocki, Christian Loehle
  Cc: linux-pm, linux-kernel, peterz, juri.lelli, mingo,
	dietmar.eggemann, vschneid, vincent.guittot, Johannes.Thumshirn,
	adrian.hunter, ulf.hansson, bvanassche, andres, asml.silence,
	linux-block, io-uring, qyousef, dsmythies, axboe

On Mon, 2024-09-30 at 20:03 +0200, Rafael J. Wysocki wrote:
> +Srinivas who can say more about the reasons why iowait boosting
> makes
> a difference for intel_pstate than I do.
> 
It makes difference on Xeons and also GFX performance.
The actual gains will be model specific as it will be dependent on
hardware algorithms and EPP.

It was introduced to solve regression in Skylake xeons. But even in the
recent servers there are gains.
Refer to
https://lkml.iu.edu/hypermail/linux/kernel/1806.0/03574.html

Thanks,
Srinivas


> On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle
> <christian.loehle@arm.com> wrote:
> > 
> > Analogous to schedutil, remove iowait boost for the same reasons.
> 
> Well, first of all, iowait boosting was added to intel_pstate to help
> some workloads that otherwise were underperforming.  I'm not sure if
> you can simply remove it without introducing performance regressions
> in those workloads.
> 
> While you can argue that it is not useful in schedutil any more due
> to
> the improved scheduler input for it, you can hardly extend that
> argument to intel_pstate because it doesn't use all of the scheduler
> input used by schedutil.
> 
> Also, the EAS and UCLAMP_MAX arguments are not applicable to
> intel_pstate because it doesn't support any of them.
> 
> This applies to the ondemand cpufreq governor either.
> 
> 
> > Signed-off-by: Christian Loehle <christian.loehle@arm.com>
> > ---
> >  drivers/cpufreq/intel_pstate.c | 50 ++----------------------------
> > ----
> >  1 file changed, 3 insertions(+), 47 deletions(-)
> > 
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index c0278d023cfc..7f30b2569bb3 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -191,7 +191,6 @@ struct global_params {
> >   * @policy:            CPUFreq policy value
> >   * @update_util:       CPUFreq utility callback information
> >   * @update_util_set:   CPUFreq utility callback is set
> > - * @iowait_boost:      iowait-related boost fraction
> >   * @last_update:       Time of the last update.
> >   * @pstate:            Stores P state limits for this CPU
> >   * @vid:               Stores VID limits for this CPU
> > @@ -245,7 +244,6 @@ struct cpudata {
> >         struct acpi_processor_performance acpi_perf_data;
> >         bool valid_pss_table;
> >  #endif
> > -       unsigned int iowait_boost;
> >         s16 epp_powersave;
> >         s16 epp_policy;
> >         s16 epp_default;
> > @@ -2136,28 +2134,7 @@ static inline void
> > intel_pstate_update_util_hwp_local(struct cpudata *cpu,
> >  {
> >         cpu->sample.time = time;
> > 
> > -       if (cpu->sched_flags & SCHED_CPUFREQ_IOWAIT) {
> > -               bool do_io = false;
> > -
> > -               cpu->sched_flags = 0;
> > -               /*
> > -                * Set iowait_boost flag and update time. Since IO
> > WAIT flag
> > -                * is set all the time, we can't just conclude that
> > there is
> > -                * some IO bound activity is scheduled on this CPU
> > with just
> > -                * one occurrence. If we receive at least two in
> > two
> > -                * consecutive ticks, then we treat as boost
> > candidate.
> > -                */
> > -               if (time_before64(time, cpu->last_io_update + 2 *
> > TICK_NSEC))
> > -                       do_io = true;
> > -
> > -               cpu->last_io_update = time;
> > -
> > -               if (do_io)
> > -                       intel_pstate_hwp_boost_up(cpu);
> > -
> > -       } else {
> > -               intel_pstate_hwp_boost_down(cpu);
> > -       }
> > +       intel_pstate_hwp_boost_down(cpu);
> >  }
> > 
> >  static inline void intel_pstate_update_util_hwp(struct
> > update_util_data *data,
> > @@ -2240,9 +2217,6 @@ static inline int32_t
> > get_target_pstate(struct cpudata *cpu)
> >         busy_frac = div_fp(sample->mperf << cpu->aperf_mperf_shift,
> >                            sample->tsc);
> > 
> > -       if (busy_frac < cpu->iowait_boost)
> > -               busy_frac = cpu->iowait_boost;
> > -
> >         sample->busy_scaled = busy_frac * 100;
> > 
> >         target = READ_ONCE(global.no_turbo) ?
> > @@ -2303,7 +2277,7 @@ static void intel_pstate_adjust_pstate(struct
> > cpudata *cpu)
> >                 sample->aperf,
> >                 sample->tsc,
> >                 get_avg_frequency(cpu),
> > -               fp_toint(cpu->iowait_boost * 100));
> > +               0);
> >  }
> > 
> >  static void intel_pstate_update_util(struct update_util_data
> > *data, u64 time,
> > @@ -2317,24 +2291,6 @@ static void intel_pstate_update_util(struct
> > update_util_data *data, u64 time,
> >                 return;
> > 
> >         delta_ns = time - cpu->last_update;
> > -       if (flags & SCHED_CPUFREQ_IOWAIT) {
> > -               /* Start over if the CPU may have been idle. */
> > -               if (delta_ns > TICK_NSEC) {
> > -                       cpu->iowait_boost = ONE_EIGHTH_FP;
> > -               } else if (cpu->iowait_boost >= ONE_EIGHTH_FP) {
> > -                       cpu->iowait_boost <<= 1;
> > -                       if (cpu->iowait_boost > int_tofp(1))
> > -                               cpu->iowait_boost = int_tofp(1);
> > -               } else {
> > -                       cpu->iowait_boost = ONE_EIGHTH_FP;
> > -               }
> > -       } else if (cpu->iowait_boost) {
> > -               /* Clear iowait_boost if the CPU may have been
> > idle. */
> > -               if (delta_ns > TICK_NSEC)
> > -                       cpu->iowait_boost = 0;
> > -               else
> > -                       cpu->iowait_boost >>= 1;
> > -       }
> >         cpu->last_update = time;
> >         delta_ns = time - cpu->sample.time;
> >         if ((s64)delta_ns < INTEL_PSTATE_SAMPLING_INTERVAL)
> > @@ -2832,7 +2788,7 @@ static void intel_cpufreq_trace(struct
> > cpudata *cpu, unsigned int trace_type, in
> >                 sample->aperf,
> >                 sample->tsc,
> >                 get_avg_frequency(cpu),
> > -               fp_toint(cpu->iowait_boost * 100));
> > +               0);
> >  }
> > 
> >  static void intel_cpufreq_hwp_update(struct cpudata *cpu, u32 min,
> > u32 max,
> > --
> > 2.34.1
> > 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 6/8] cpufreq: intel_pstate: Remove iowait boost
  2024-09-30 20:35     ` srinivas pandruvada
@ 2024-10-01  9:57       ` Christian Loehle
  2024-10-01 14:46         ` srinivas pandruvada
  0 siblings, 1 reply; 25+ messages in thread
From: Christian Loehle @ 2024-10-01  9:57 UTC (permalink / raw)
  To: srinivas pandruvada, Rafael J. Wysocki
  Cc: linux-pm, linux-kernel, peterz, juri.lelli, mingo,
	dietmar.eggemann, vschneid, vincent.guittot, Johannes.Thumshirn,
	adrian.hunter, ulf.hansson, bvanassche, andres, asml.silence,
	linux-block, io-uring, qyousef, dsmythies, axboe

On 9/30/24 21:35, srinivas pandruvada wrote:
> On Mon, 2024-09-30 at 20:03 +0200, Rafael J. Wysocki wrote:
>> +Srinivas who can say more about the reasons why iowait boosting
>> makes
>> a difference for intel_pstate than I do.
>>

Hi Srinivas,

> It makes difference on Xeons and also GFX performance.

AFAIU the GFX performance with iowait boost is a regression though,
because it cuts into the system power budget (CPU+GPU), especially
on desktop and mobile chips (but also some servers), no?
https://lore.kernel.org/lkml/20180730220029.81983-1-srinivas.pandruvada@linux.intel.com/
https://lore.kernel.org/lkml/e7388bf4-deb1-34b6-97d7-89ced8e78ef1@intel.com/
Or is there a reported case where iowait boosting helps
graphics workloads?

> The actual gains will be model specific as it will be dependent on
> hardware algorithms and EPP.
> 
> It was introduced to solve regression in Skylake xeons. But even in the
> recent servers there are gains.
> Refer to
> https://lkml.iu.edu/hypermail/linux/kernel/1806.0/03574.html

Did you look into PELT utilization values at that time?
I see why intel_pstate might be worse off than schedutil wrt removing
iowait boosting and do see two remedies essentially:
1. Boost after all sleeps (less aggressively), although I'm not a huge fan of
this.
2. If the gap between util_est and HWP-determined frequency is too large
then apply some boost. A sort of fallback on a schedutil strategy.
That would of course require util_est to be significantly large in those
scenarios.

I might try to propose something for 2, although as you can probably
guess, playing with HWP is somewhat uncharted waters for me.

Since intel_pstate will actually boost into unsustainable P-states,
there should be workloads that regress with iowait boosting. I'll
go looking for those.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 6/8] cpufreq: intel_pstate: Remove iowait boost
  2024-10-01  9:57       ` Christian Loehle
@ 2024-10-01 14:46         ` srinivas pandruvada
  0 siblings, 0 replies; 25+ messages in thread
From: srinivas pandruvada @ 2024-10-01 14:46 UTC (permalink / raw)
  To: Christian Loehle, Rafael J. Wysocki
  Cc: linux-pm, linux-kernel, peterz, juri.lelli, mingo,
	dietmar.eggemann, vschneid, vincent.guittot, Johannes.Thumshirn,
	adrian.hunter, ulf.hansson, bvanassche, andres, asml.silence,
	linux-block, io-uring, qyousef, dsmythies, axboe

Hi Christian,

On Tue, 2024-10-01 at 10:57 +0100, Christian Loehle wrote:
> On 9/30/24 21:35, srinivas pandruvada wrote:
> > On Mon, 2024-09-30 at 20:03 +0200, Rafael J. Wysocki wrote:
> > > +Srinivas who can say more about the reasons why iowait boosting
> > > makes
> > > a difference for intel_pstate than I do.
> > > 
> 
> Hi Srinivas,
> 
> > It makes difference on Xeons and also GFX performance.
> 
> AFAIU the GFX performance with iowait boost is a regression though,
> because it cuts into the system power budget (CPU+GPU), especially
> on desktop and mobile chips (but also some servers), no?
> https://lore.kernel.org/lkml/20180730220029.81983-1-srinivas.pandruvada@linux.intel.com/
> https://lore.kernel.org/lkml/e7388bf4-deb1-34b6-97d7-89ced8e78ef1@intel.com/
> Or is there a reported case where iowait boosting helps
> graphics workloads?
> 
GFX is complex as you have both cases depending on the generation. We
don't enable the control by default. There is a user space control, so
that it can be selected when it helps.


> > The actual gains will be model specific as it will be dependent on
> > hardware algorithms and EPP.
> > 
> > It was introduced to solve regression in Skylake xeons. But even in
> > the
> > recent servers there are gains.
> > Refer to
> > https://lkml.iu.edu/hypermail/linux/kernel/1806.0/03574.html
> 
> Did you look into PELT utilization values at that time?
No. But boost is needed for idle or semi-idle CPUs, otherwise HWP would
have already running at higher frequency. But we could avoid boot if
util is above a threshold.


> I see why intel_pstate might be worse off than schedutil wrt removing
> iowait boosting and do see two remedies essentially:
> 1. Boost after all sleeps (less aggressively), although I'm not a
> huge fan of
> this.
> 2. If the gap between util_est and HWP-determined frequency is too
> large
> then apply some boost. A sort of fallback on a schedutil strategy.
> That would of course require util_est to be significantly large in
> those
> scenarios.
> 
> I might try to propose something for 2, although as you can probably
> guess, playing with HWP is somewhat uncharted waters for me.
> 
Now we sample the last HWP determined frequency at every tick and can
use to avoid boost. So need some experiments.

> Since intel_pstate will actually boost into unsustainable P-states,
> there should be workloads that regress with iowait boosting. I'll
> go looking for those.

Xeons power limit is in order of 100s of Watts. So boost doesn't
generally to unsustainable state. Even if all cores boost at the same
time, the worst case we still get all core turbo.

Thanks,
Srinivas




> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/8] cpufreq/schedutil: Remove iowait boost
  2024-09-30 16:34   ` Rafael J. Wysocki
@ 2024-10-03  9:10     ` Christian Loehle
  2024-10-03  9:47     ` Quentin Perret
  1 sibling, 0 replies; 25+ messages in thread
From: Christian Loehle @ 2024-10-03  9:10 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, linux-kernel, peterz, juri.lelli, mingo,
	dietmar.eggemann, vschneid, vincent.guittot, Johannes.Thumshirn,
	adrian.hunter, ulf.hansson, bvanassche, andres, asml.silence,
	linux-block, io-uring, qyousef, dsmythies, axboe

On 9/30/24 17:34, Rafael J. Wysocki wrote:
> On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> iowait boost in schedutil was introduced by
>> commit ("21ca6d2c52f8 cpufreq: schedutil: Add iowait boosting").
>> with it more or less following intel_pstate's approach to increase
>> frequency after an iowait wakeup.
>> Behaviour that is piggy-backed onto iowait boost is problematic
>> due to a lot of reasons, so remove it.
>>
>> For schedutil specifically these are some of the reasons:
>> 1. Boosting is applied even in scenarios where it doesn't improve
>> throughput.
> 
> Well, I wouldn't argue this way because it is kind of like saying that
> air conditioning is used even when it doesn't really help.  It is
> sometimes hard to know in advance whether or not it will help though.

Right, it's a heuristic that's often wrong and costs energy when it
triggers is what I was trying to say.

> 
>> 2. The boost is not accounted for in EAS: a) feec() will only consider
>>  the actual task utilization for task placement, but another CPU might
>>  be more energy-efficient at that capacity than the boosted one.)
>>  b) When placing a non-IO task while a CPU is boosted compute_energy()
>>  assumes a lower OPP than what is actually applied. This leads to
>>  wrong EAS decisions.
> 
> That's a very good point IMV and so is the one regarding UCLAMP_MAX (8
> in your list).
> 
> If the goal is to set the adequate performance for a given utilization
> level (either actual or prescribed), boosting doesn't really play well
> with this and it shouldn't be used at least in these cases.
> 
>> 3. Actual IO heavy workloads are hardly distinguished from infrequent
>> in_iowait wakeups.
> 
> Do infrequent in_iowait wakeups really cause the boosting to be
> applied at full swing?

Maybe not full swing, but the relatively high rate_limit_us and TICK_NSEC
found on Android deivces does indeed lead to occasional boosting periods
even for 'infrequent'/unrelated wakeups.

> 
>> 4. The boost isn't accounted for in task placement.
> 
> I'm not sure what exactly this means.  "Big" vs "little" or something else?

That should be "[...] in task placement for HMP", you're right.
Essentially if we were to consider a task to be 100% of capacity boost-worthy,
we need to consider that at task placement. Now we cap out at the local CPU,
which might be rather small. (~10% of the biggest CPU on mobile).
Logically this argument (a CAS argument essentially), should probably come
before the EAS one to make more sense.

>> 5. The boost isn't associated with a task, it therefore lingers on the
>> rq even after the responsible task has migrated / stopped.
> 
> Fair enough, but this is rather a problem with the implementation of
> boosting and not with the basic idea of it.

Unfortunately the lingering (or to use a term with less negative connotation:
holding) almost is a necessity, too, as described in the cover-letter.
If we only boost at enqueue (and immediately scale down on dequeue) we lose
out massively, as the interrupt isn't boosted and we have to run at the lower
frequency for the DVFS transition delay (even if on x86 that may be close to
negligible). IMO this is the main reason why the mechanism can't evolve (into
something like a per-task strategy).
Even a per-task strategy would need to a) set a timer in case the iowait
period is too long and b) remove boost from prev_cpu if enqueued somewhere
else.

> 
>> 6. The boost isn't associated with a task, it therefore needs to ramp
>> up again when migrated.
> 
> Well, that again is somewhat implementation-related IMV, and it need
> not be problematic in principle.  Namely, if a task migrates and it is
> not the only one in the "new" CPUs runqueue, and the other tasks in
> there don't use in_iowait, maybe it's better to not boost it?

Agreed, this can be argued about (and also isn't a huge problem in
practice).

> 
> It also means that boosting is not very consistent, though, which is a
> valid point.
> 
>> 7. Since schedutil doesn't know which task is getting woken up,
>> multiple unrelated in_iowait tasks lead to boosting.
> 
> Well, that's by design: it boosts, when "there is enough IO pressure
> in the runqueue", so to speak.> 
> Basically, it is a departure from the "make performance follow
> utilization" general idea and it is based on the observation that in
> some cases performance can be improved by taking additional
> information into account.
> 
> It is also about pure performance, not about energy efficiency.

And the lines between those become more and more blurry, see the GFX
regression. There's very few free lunches up for grabs these days, if
you're boosting performance on X, you're likely paying for it on Y.
That is fine as long as boosting X is deliberate which iowait boosting
very much is not.

> 
>> 8. Boosting is hard to control with UCLAMP_MAX (which is only active
>> when the task is on the rq, which for boosted tasks is usually not
>> the case for most of the time).
>>
>> One benefit of schedutil specifically is the reliance on the
>> scheduler's utilization signals, which have evolved a lot since it's
>> original introduction. Some cases that benefitted from iowait boosting
>> in the past can now be covered by e.g. util_est.
> 
> And it would be good to give some examples of this.
> 
> IMV you have a clean-cut argument in the EAS and UCLAMP_MAX cases, but
> apart from that it is all a bit hand-wavy.

Thanks Rafael, you brought up some good points!


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/8] cpufreq/schedutil: Remove iowait boost
  2024-09-30 16:34   ` Rafael J. Wysocki
  2024-10-03  9:10     ` Christian Loehle
@ 2024-10-03  9:47     ` Quentin Perret
  2024-10-03 10:30       ` Christian Loehle
  1 sibling, 1 reply; 25+ messages in thread
From: Quentin Perret @ 2024-10-03  9:47 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christian Loehle, linux-pm, linux-kernel, peterz, juri.lelli,
	mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	andres, asml.silence, linux-block, io-uring, qyousef, dsmythies,
	axboe

On Monday 30 Sep 2024 at 18:34:24 (+0200), Rafael J. Wysocki wrote:
> On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle
> <christian.loehle@arm.com> wrote:
> >
> > iowait boost in schedutil was introduced by
> > commit ("21ca6d2c52f8 cpufreq: schedutil: Add iowait boosting").
> > with it more or less following intel_pstate's approach to increase
> > frequency after an iowait wakeup.
> > Behaviour that is piggy-backed onto iowait boost is problematic
> > due to a lot of reasons, so remove it.
> >
> > For schedutil specifically these are some of the reasons:
> > 1. Boosting is applied even in scenarios where it doesn't improve
> > throughput.
> 
> Well, I wouldn't argue this way because it is kind of like saying that
> air conditioning is used even when it doesn't really help.  It is
> sometimes hard to know in advance whether or not it will help though.
> 
> > 2. The boost is not accounted for in EAS: a) feec() will only consider
> >  the actual task utilization for task placement, but another CPU might
> >  be more energy-efficient at that capacity than the boosted one.)
> >  b) When placing a non-IO task while a CPU is boosted compute_energy()
> >  assumes a lower OPP than what is actually applied. This leads to
> >  wrong EAS decisions.
> 
> That's a very good point IMV and so is the one regarding UCLAMP_MAX (8
> in your list).

I would actually argue that this is also an implementation problem
rather than something fundamental about boosting. EAS could be taught
about iowait boosting and factor that into the decisions.

> If the goal is to set the adequate performance for a given utilization
> level (either actual or prescribed), boosting doesn't really play well
> with this and it shouldn't be used at least in these cases.

There's plenty of cases where EAS will correctly understand that
migrating a task away will not reduce the OPP (e.g. another task on the
rq has a uclamp_min request, or another CPU in the perf domain has a
higher request), so iowait boosting could probably be added.

In fact if the iowait boost was made a task property, EAS could easily
understand the effect of migrating that boost with the task (it's not
fundamentally different from migrating a task with a high uclamp_min
from the energy model perspective).

> > 3. Actual IO heavy workloads are hardly distinguished from infrequent
> > in_iowait wakeups.
> 
> Do infrequent in_iowait wakeups really cause the boosting to be
> applied at full swing?
> 
> > 4. The boost isn't accounted for in task placement.
> 
> I'm not sure what exactly this means.  "Big" vs "little" or something else?
> 
> > 5. The boost isn't associated with a task, it therefore lingers on the
> > rq even after the responsible task has migrated / stopped.
> 
> Fair enough, but this is rather a problem with the implementation of
> boosting and not with the basic idea of it.

+1

> > 6. The boost isn't associated with a task, it therefore needs to ramp
> > up again when migrated.
> 
> Well, that again is somewhat implementation-related IMV, and it need
> not be problematic in principle.  Namely, if a task migrates and it is
> not the only one in the "new" CPUs runqueue, and the other tasks in
> there don't use in_iowait, maybe it's better to not boost it?
> 
> It also means that boosting is not very consistent, though, which is a
> valid point.
> 
> > 7. Since schedutil doesn't know which task is getting woken up,
> > multiple unrelated in_iowait tasks lead to boosting.
> 
> Well, that's by design: it boosts, when "there is enough IO pressure
> in the runqueue", so to speak.
> 
> Basically, it is a departure from the "make performance follow
> utilization" general idea and it is based on the observation that in
> some cases performance can be improved by taking additional
> information into account.
> 
> It is also about pure performance, not about energy efficiency.
> 
> > 8. Boosting is hard to control with UCLAMP_MAX (which is only active
> > when the task is on the rq, which for boosted tasks is usually not
> > the case for most of the time).

Sounds like another reason to make iowait boosting per-task to me :-)

I've always thought that turning iowait boosting into some sort of
in-kernel uclamp_min request would be a good approach for most of the
issues mentioned above. Note that I'm not necessarily saying to use the
actual uclamp infrastructure (though it's valid option), I'm really just
talking about the concept. Is that something you've considered?

I presume we could even factor out the 'logic' part of the code that
decides out to request the boost into its own thing, and possibly have
different policies for different use-cases, but that might be overkill.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/8] cpufreq/schedutil: Remove iowait boost
  2024-10-03  9:47     ` Quentin Perret
@ 2024-10-03 10:30       ` Christian Loehle
  2024-10-05  0:39         ` Andres Freund
  0 siblings, 1 reply; 25+ messages in thread
From: Christian Loehle @ 2024-10-03 10:30 UTC (permalink / raw)
  To: Quentin Perret, Rafael J. Wysocki
  Cc: linux-pm, linux-kernel, peterz, juri.lelli, mingo,
	dietmar.eggemann, vschneid, vincent.guittot, Johannes.Thumshirn,
	adrian.hunter, ulf.hansson, bvanassche, andres, asml.silence,
	linux-block, io-uring, qyousef, dsmythies, axboe

On 10/3/24 10:47, Quentin Perret wrote:
> On Monday 30 Sep 2024 at 18:34:24 (+0200), Rafael J. Wysocki wrote:
>> On Thu, Sep 5, 2024 at 11:27 AM Christian Loehle
>> <christian.loehle@arm.com> wrote:
>>>
>>> iowait boost in schedutil was introduced by
>>> commit ("21ca6d2c52f8 cpufreq: schedutil: Add iowait boosting").
>>> with it more or less following intel_pstate's approach to increase
>>> frequency after an iowait wakeup.
>>> Behaviour that is piggy-backed onto iowait boost is problematic
>>> due to a lot of reasons, so remove it.
>>>
>>> For schedutil specifically these are some of the reasons:
>>> 1. Boosting is applied even in scenarios where it doesn't improve
>>> throughput.
>>
>> Well, I wouldn't argue this way because it is kind of like saying that
>> air conditioning is used even when it doesn't really help.  It is
>> sometimes hard to know in advance whether or not it will help though.
>>
>>> 2. The boost is not accounted for in EAS: a) feec() will only consider
>>>  the actual task utilization for task placement, but another CPU might
>>>  be more energy-efficient at that capacity than the boosted one.)
>>>  b) When placing a non-IO task while a CPU is boosted compute_energy()
>>>  assumes a lower OPP than what is actually applied. This leads to
>>>  wrong EAS decisions.
>>
>> That's a very good point IMV and so is the one regarding UCLAMP_MAX (8
>> in your list).
> 
> I would actually argue that this is also an implementation problem
> rather than something fundamental about boosting. EAS could be taught
> about iowait boosting and factor that into the decisions.

Definitely, and I did do exactly that.

> 
>> If the goal is to set the adequate performance for a given utilization
>> level (either actual or prescribed), boosting doesn't really play well
>> with this and it shouldn't be used at least in these cases.
> 
> There's plenty of cases where EAS will correctly understand that
> migrating a task away will not reduce the OPP (e.g. another task on the
> rq has a uclamp_min request, or another CPU in the perf domain has a
> higher request), so iowait boosting could probably be added.
> 
> In fact if the iowait boost was made a task property, EAS could easily
> understand the effect of migrating that boost with the task (it's not
> fundamentally different from migrating a task with a high uclamp_min
> from the energy model perspective).

True.
> 
>>> 3. Actual IO heavy workloads are hardly distinguished from infrequent
>>> in_iowait wakeups.
>>
>> Do infrequent in_iowait wakeups really cause the boosting to be
>> applied at full swing?
>>
>>> 4. The boost isn't accounted for in task placement.
>>
>> I'm not sure what exactly this means.  "Big" vs "little" or something else?
>>
>>> 5. The boost isn't associated with a task, it therefore lingers on the
>>> rq even after the responsible task has migrated / stopped.
>>
>> Fair enough, but this is rather a problem with the implementation of
>> boosting and not with the basic idea of it.
> 
> +1
> 
>>> 6. The boost isn't associated with a task, it therefore needs to ramp
>>> up again when migrated.
>>
>> Well, that again is somewhat implementation-related IMV, and it need
>> not be problematic in principle.  Namely, if a task migrates and it is
>> not the only one in the "new" CPUs runqueue, and the other tasks in
>> there don't use in_iowait, maybe it's better to not boost it?
>>
>> It also means that boosting is not very consistent, though, which is a
>> valid point.
>>
>>> 7. Since schedutil doesn't know which task is getting woken up,
>>> multiple unrelated in_iowait tasks lead to boosting.
>>
>> Well, that's by design: it boosts, when "there is enough IO pressure
>> in the runqueue", so to speak.
>>
>> Basically, it is a departure from the "make performance follow
>> utilization" general idea and it is based on the observation that in
>> some cases performance can be improved by taking additional
>> information into account.
>>
>> It is also about pure performance, not about energy efficiency.
>>
>>> 8. Boosting is hard to control with UCLAMP_MAX (which is only active
>>> when the task is on the rq, which for boosted tasks is usually not
>>> the case for most of the time).
> 
> Sounds like another reason to make iowait boosting per-task to me :-)
> 
> I've always thought that turning iowait boosting into some sort of
> in-kernel uclamp_min request would be a good approach for most of the
> issues mentioned above. Note that I'm not necessarily saying to use the
> actual uclamp infrastructure (though it's valid option), I'm really just
> talking about the concept. Is that something you've considered?
> 
> I presume we could even factor out the 'logic' part of the code that
> decides out to request the boost into its own thing, and possibly have
> different policies for different use-cases, but that might be overkill.

See the cover-letter part on per-task iowait boosting, specifically:
[1]
v1 per-task io boost
https://lore.kernel.org/lkml/20240304201625.100619-1-christian.loehle@arm.com/
v2 per-task io boost
https://lore.kernel.org/lkml/20240518113947.2127802-2-christian.loehle@arm.com/
[2]
OSPM24 discussion iowait boosting
https://www.youtube.com/watch?v=MSQGEsSziZ4

These are the main issues with transforming the existing mechanism into
a per-task attribute.
Almost unsolvable is: Does reducing "iowait pressure" (be it per-task or per-rq)
actually improve throughput even (assuming for now that this throughput is
something we care about, I'm sure you know that isn't always the case, e.g.
background tasks). With MCQ devices and some reasonable IO workload that is
IO-bound our iowait boosting is often just boosting CPU frequency (which uses
power obviously) to queue in yet another request for a device which has essentially
endless pending requests. If pending request N+1 arrives x usecs earlier or
later at the device then makes no difference in IO throughput.
If boosting would improve e.g. IOPS (of that device) is something the block layer
(with a lot of added infrastructure, but at least in theory it would know what
device we're iowaiting on, unlike the scheduler) could tell us about. If that is
actually useful for user experience (i.e. worth the power) only userspace can decide
(and then we're back at uclamp_min anyway).
(The above all assumes that iowait even means "is waiting for block IO and
about to send another block IO" which is far from reality.)

Thanks Quentin for getting involved, your input is very much appreciated!

Regards,
Christian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/8] cpufreq/schedutil: Remove iowait boost
  2024-10-03 10:30       ` Christian Loehle
@ 2024-10-05  0:39         ` Andres Freund
  2024-10-09  9:54           ` Christian Loehle
  0 siblings, 1 reply; 25+ messages in thread
From: Andres Freund @ 2024-10-05  0:39 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Quentin Perret, Rafael J. Wysocki, linux-pm, linux-kernel, peterz,
	juri.lelli, mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	asml.silence, linux-block, io-uring, qyousef, dsmythies, axboe

Hi,

A caveat: I'm a userspace developer that occasionally strays into kernel land
(see e.g. the io_uring iowait thing). So I'm likely to get some kernel side
things wrong.

On 2024-10-03 11:30:52 +0100, Christian Loehle wrote:
> These are the main issues with transforming the existing mechanism into
> a per-task attribute.
> Almost unsolvable is: Does reducing "iowait pressure" (be it per-task or per-rq)
> actually improve throughput even (assuming for now that this throughput is
> something we care about, I'm sure you know that isn't always the case, e.g.
> background tasks). With MCQ devices and some reasonable IO workload that is
> IO-bound our iowait boosting is often just boosting CPU frequency (which uses
> power obviously) to queue in yet another request for a device which has essentially
> endless pending requests. If pending request N+1 arrives x usecs earlier or
> later at the device then makes no difference in IO throughput.

That's sometimes true, but definitely not all the time? There are plenty
workloads with low-queue-depth style IO. Which often are also rather latency
sensitive.

E.g. the device a database journal resides on will typically have a low queue
depth. It's extremely common in OLTPish workloads to be bound by the latency
of journal flushes. If, after the journal flush completes, the CPU is clocked
low and takes a while to wake up, you'll see substantially worse performance.

> If boosting would improve e.g. IOPS (of that device) is something the block layer
> (with a lot of added infrastructure, but at least in theory it would know what
> device we're iowaiting on, unlike the scheduler) could tell us about. If that is
> actually useful for user experience (i.e. worth the power) only userspace can decide
> (and then we're back at uclamp_min anyway).

I think there are many cases where userspace won't realistically be able to do
anything about that.

For one, just because, for some workload, a too deep idle state is bad during
IO, doesn't mean userspace won't ever want to clock down. And it's probably
going to be too expensive to change any attributes around idle states for
individual IOs.

Are there actually any non-privileged APIs around this that userspace *could*
even change? I'd not consider moving to busy-polling based APIs a realistic
alternative.

For many workloads cpuidle is way too aggressive dropping into lower states
*despite* iowait. But just disabling all lower idle states obviously has
undesirable energy usage implications. It surely is the answer for some
workloads, but I don't think it'd be good to promote it as the sole solution.

It's easy to under-estimate the real-world impact of a change like this. When
benchmarking we tend to see what kind of throughput we can get, by having N
clients hammering the server as fast as they can. But in the real world that's
pretty rare for anything latency sensitive to go full blast - rather there's a
rate of requests incoming and that the clients are sensitive to requests being
processed more slowly.

That's not to say that the current situation can't be improved - I've seen way
too many workloads where the only ways to get decent performance were one of:

- disable most idle states (via sysfs or /dev/cpu_dma_latency)
- just have busy loops when idling - doesn't work when doing synchronous
  syscalls that block though
- have some lower priority tasks scheduled that just burns CPU

I'm just worried that removing iowait will make this worse.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/8] cpufreq/schedutil: Remove iowait boost
  2024-10-05  0:39         ` Andres Freund
@ 2024-10-09  9:54           ` Christian Loehle
  0 siblings, 0 replies; 25+ messages in thread
From: Christian Loehle @ 2024-10-09  9:54 UTC (permalink / raw)
  To: Andres Freund
  Cc: Quentin Perret, Rafael J. Wysocki, linux-pm, linux-kernel, peterz,
	juri.lelli, mingo, dietmar.eggemann, vschneid, vincent.guittot,
	Johannes.Thumshirn, adrian.hunter, ulf.hansson, bvanassche,
	asml.silence, linux-block, io-uring, qyousef, dsmythies, axboe

On 10/5/24 01:39, Andres Freund wrote:
> Hi,
> 
> 
> A caveat: I'm a userspace developer that occasionally strays into kernel land
> (see e.g. the io_uring iowait thing). So I'm likely to get some kernel side
> things wrong.

Thank you for your input!

> 
> On 2024-10-03 11:30:52 +0100, Christian Loehle wrote:
>> These are the main issues with transforming the existing mechanism into
>> a per-task attribute.
>> Almost unsolvable is: Does reducing "iowait pressure" (be it per-task or per-rq)
>> actually improve throughput even (assuming for now that this throughput is
>> something we care about, I'm sure you know that isn't always the case, e.g.
>> background tasks). With MCQ devices and some reasonable IO workload that is
>> IO-bound our iowait boosting is often just boosting CPU frequency (which uses
>> power obviously) to queue in yet another request for a device which has essentially
>> endless pending requests. If pending request N+1 arrives x usecs earlier or
>> later at the device then makes no difference in IO throughput.
> 
> That's sometimes true, but definitely not all the time? There are plenty
> workloads with low-queue-depth style IO. Which often are also rather latency
> sensitive.
> 
> E.g. the device a database journal resides on will typically have a low queue
> depth. It's extremely common in OLTPish workloads to be bound by the latency
> of journal flushes. If, after the journal flush completes, the CPU is clocked
> low and takes a while to wake up, you'll see substantially worse performance.

Yeah absolutely and if we knew what a latency-sensitive journal flush is tuning
cpuidle and cpufreq to it would probably be reasonable.
I did test mmtests filebench-oltp that looked fine, do you have any other
benchmarks you would like to see?

>> If boosting would improve e.g. IOPS (of that device) is something the block layer
>> (with a lot of added infrastructure, but at least in theory it would know what
>> device we're iowaiting on, unlike the scheduler) could tell us about. If that is
>> actually useful for user experience (i.e. worth the power) only userspace can decide
>> (and then we're back at uclamp_min anyway).
> 
> I think there are many cases where userspace won't realistically be able to do
> anything about that.
> 
> For one, just because, for some workload, a too deep idle state is bad during
> IO, doesn't mean userspace won't ever want to clock down. And it's probably
> going to be too expensive to change any attributes around idle states for
> individual IOs.

So the kernel currently applies these to all of them essentially.

> 
> Are there actually any non-privileged APIs around this that userspace *could*
> even change? I'd not consider moving to busy-polling based APIs a realistic
> alternative.

No and I'm not sure an actual non-privileged API would be a good idea, would
it? It is essentially changing hardware behavior.
So does busy-polling of course, but the kernel can at least curb that and
maintain fairness and so forth.

> 
> For many workloads cpuidle is way too aggressive dropping into lower states
> *despite* iowait. But just disabling all lower idle states obviously has
> undesirable energy usage implications. It surely is the answer for some
> workloads, but I don't think it'd be good to promote it as the sole solution.

Right, but we (cpuidle) don't know how to distinguish the two, we just do it
for all of them. Whether kernel or userspace applies the same (awful) heuristic
doesn't make that much of a difference in practice.

> 
> It's easy to under-estimate the real-world impact of a change like this. When
> benchmarking we tend to see what kind of throughput we can get, by having N
> clients hammering the server as fast as they can. But in the real world that's
> pretty rare for anything latency sensitive to go full blast - rather there's a
> rate of requests incoming and that the clients are sensitive to requests being
> processed more slowly.

Agreed, this series is posted as RFT and I'm happy to take a look at any
regressions for both the cpufreq and cpuidle parts of it.

> 
> 
> That's not to say that the current situation can't be improved - I've seen way
> too many workloads where the only ways to get decent performance were one of:
> 
> - disable most idle states (via sysfs or /dev/cpu_dma_latency)
> - just have busy loops when idling - doesn't work when doing synchronous
>   syscalls that block though
> - have some lower priority tasks scheduled that just burns CPU
> 
> I'm just worried that removing iowait will make this worse.

I just need to mention again that almost all of what you replied does refer to
cpuidle, not cpufreq (which this particular patch was about), not to create more
confusion.

Regards,
Christian

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-10-09  9:54 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-05  9:26 [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour Christian Loehle
2024-09-05  9:26 ` [RFC PATCH 1/8] cpuidle: menu: Remove iowait influence Christian Loehle
2024-09-30 14:58   ` Rafael J. Wysocki
2024-09-05  9:26 ` [RFC PATCH 2/8] cpuidle: Prefer teo over menu governor Christian Loehle
2024-09-30 15:06   ` Rafael J. Wysocki
2024-09-30 16:12     ` Christian Loehle
2024-09-30 16:42       ` Rafael J. Wysocki
2024-09-05  9:26 ` [RFC PATCH 3/8] TEST: cpufreq/schedutil: Linear iowait boost step Christian Loehle
2024-09-05  9:26 ` [RFC PATCH 4/8] TEST: cpufreq/schedutil: iowait boost cap sysfs Christian Loehle
2024-09-05  9:26 ` [RFC PATCH 5/8] cpufreq/schedutil: Remove iowait boost Christian Loehle
2024-09-30 16:34   ` Rafael J. Wysocki
2024-10-03  9:10     ` Christian Loehle
2024-10-03  9:47     ` Quentin Perret
2024-10-03 10:30       ` Christian Loehle
2024-10-05  0:39         ` Andres Freund
2024-10-09  9:54           ` Christian Loehle
2024-09-05  9:26 ` [RFC PATCH 6/8] cpufreq: intel_pstate: " Christian Loehle
2024-09-12 11:22   ` [RFC PATCH] TEST: cpufreq: intel_pstate: sysfs iowait_boost_cap Christian Loehle
2024-09-30 18:03   ` [RFC PATCH 6/8] cpufreq: intel_pstate: Remove iowait boost Rafael J. Wysocki
2024-09-30 20:35     ` srinivas pandruvada
2024-10-01  9:57       ` Christian Loehle
2024-10-01 14:46         ` srinivas pandruvada
2024-09-05  9:26 ` [RFC PATCH 7/8] cpufreq: Remove SCHED_CPUFREQ_IOWAIT update Christian Loehle
2024-09-05  9:26 ` [RFC PATCH 8/8] io_uring: Do not set iowait before sleeping Christian Loehle
2024-09-05 12:31 ` [RFT RFC PATCH 0/8] cpufreq: cpuidle: Remove iowait behaviour Christian Loehle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox