public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH] io_uring/io-wq: stop setting PF_NO_SETAFFINITY on io-wq workers
@ 2023-03-08 14:27 Jens Axboe
  2023-03-14 10:07 ` Daniel Dao
  0 siblings, 1 reply; 5+ messages in thread
From: Jens Axboe @ 2023-03-08 14:27 UTC (permalink / raw)
  To: io-uring; +Cc: Daniel Dao

Every now and then reports come in that are puzzled on why changing
affinity on the io-wq workers fails with EINVAL. This happens because they
set PF_NO_SETAFFINITY as part of their creation, as io-wq organizes
workers into groups based on what CPU they are running on.

However, this is purely an optimization and not a functional requirement.
We can allow setting affinity, and just lazily update our worker to wqe
mappings. If a given io-wq thread times out, it normally exits if there's
no more work to do. The exception is if it's the last worker available.
For the timeout case, check the affinity of the worker against group mask
and exit even if it's the last worker. New workers should be created with
the right mask and in the right location.

Reported-by:Daniel Dao <[email protected]>
Link: https://lore.kernel.org/io-uring/CA+wXwBQwgxB3_UphSny-yAP5b26meeOu1W4TwYVcD_+5gOhvPw@mail.gmail.com/
Signed-off-by: Jens Axboe <[email protected]>

---

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 411bb2d1acd4..f81c0a7136a5 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -616,7 +616,7 @@ static int io_wqe_worker(void *data)
 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wq *wq = wqe->wq;
-	bool last_timeout = false;
+	bool exit_mask = false, last_timeout = false;
 	char buf[TASK_COMM_LEN];
 
 	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
@@ -632,8 +632,11 @@ static int io_wqe_worker(void *data)
 			io_worker_handle_work(worker);
 
 		raw_spin_lock(&wqe->lock);
-		/* timed out, exit unless we're the last worker */
-		if (last_timeout && acct->nr_workers > 1) {
+		/*
+		 * Last sleep timed out. Exit if we're not the last worker,
+		 * or if someone modified our affinity.
+		 */
+		if (last_timeout && (exit_mask || acct->nr_workers > 1)) {
 			acct->nr_workers--;
 			raw_spin_unlock(&wqe->lock);
 			__set_current_state(TASK_RUNNING);
@@ -652,7 +655,11 @@ static int io_wqe_worker(void *data)
 				continue;
 			break;
 		}
-		last_timeout = !ret;
+		if (!ret) {
+			last_timeout = true;
+			exit_mask = !cpumask_test_cpu(raw_smp_processor_id(),
+							wqe->cpu_mask);
+		}
 	}
 
 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state))
@@ -704,7 +711,6 @@ static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
 	tsk->worker_private = worker;
 	worker->task = tsk;
 	set_cpus_allowed_ptr(tsk, wqe->cpu_mask);
-	tsk->flags |= PF_NO_SETAFFINITY;
 
 	raw_spin_lock(&wqe->lock);
 	hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] io_uring/io-wq: stop setting PF_NO_SETAFFINITY on io-wq workers
  2023-03-08 14:27 [PATCH] io_uring/io-wq: stop setting PF_NO_SETAFFINITY on io-wq workers Jens Axboe
@ 2023-03-14 10:07 ` Daniel Dao
  2023-03-14 16:25   ` Michal Koutný
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Dao @ 2023-03-14 10:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Waiman Long, cgroups

On Wed, Mar 8, 2023 at 2:27 PM Jens Axboe <[email protected]> wrote:
>
> Every now and then reports come in that are puzzled on why changing
> affinity on the io-wq workers fails with EINVAL. This happens because they
> set PF_NO_SETAFFINITY as part of their creation, as io-wq organizes
> workers into groups based on what CPU they are running on.
>
> However, this is purely an optimization and not a functional requirement.
> We can allow setting affinity, and just lazily update our worker to wqe
> mappings. If a given io-wq thread times out, it normally exits if there's
> no more work to do. The exception is if it's the last worker available.
> For the timeout case, check the affinity of the worker against group mask
> and exit even if it's the last worker. New workers should be created with
> the right mask and in the right location.

The patch resolved the bug around enabling cpuset for subtree_control for me.
However, it also doesn't prevent user from setting cpuset value that
is incompatible
with iou threads. For example, on a 2-numa 4-cpu node, new iou-wrks are bound to
2-3 while we can set cpuset.cpus to 1-2 successfully. The end result
is a mix of cpu
distribution such as:

  pid 533's current affinity list: 1,2 # process
  pid 720's current affinity list: 1,2 # iou-wrk-533
  pid 5236's current affinity list: 2,3 # iou-wrk-533, running outside of cpuset


IMO this violated the principle of cpuset and can be confusing for end users.
I think I prefer Waiman's suggestion of allowing an implicit move to cpuset
when enabling cpuset with subtree_control but not explicit moves such as when
setting cpuset.cpus or writing the pids into cgroup.procs. It's easier to reason
about and make the failure mode more explicit.

What do you think ?

Cheers,
Daniel.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] io_uring/io-wq: stop setting PF_NO_SETAFFINITY on io-wq workers
  2023-03-14 10:07 ` Daniel Dao
@ 2023-03-14 16:25   ` Michal Koutný
  2023-03-14 16:48     ` Jens Axboe
  2023-03-14 18:17     ` Waiman Long
  0 siblings, 2 replies; 5+ messages in thread
From: Michal Koutný @ 2023-03-14 16:25 UTC (permalink / raw)
  To: Daniel Dao; +Cc: Jens Axboe, io-uring, Waiman Long, cgroups

[-- Attachment #1: Type: text/plain, Size: 1802 bytes --]

Hello.

On Tue, Mar 14, 2023 at 10:07:40AM +0000, Daniel Dao <[email protected]> wrote:
> IMO this violated the principle of cpuset and can be confusing for end users.
> I think I prefer Waiman's suggestion of allowing an implicit move to cpuset
> when enabling cpuset with subtree_control but not explicit moves such as when
> setting cpuset.cpus or writing the pids into cgroup.procs. It's easier to reason
> about and make the failure mode more explicit.
> 
> What do you think ?

I think cpuset should top IO worker's affinity (like sched_setaffinity(2)).
Thus:
- modifying cpuset.cpus	                update task's affinity, for sure
- implicit migration (enabling cpuset)  update task's affinity, effective nop
- explicit migration (meh)              update task's affinity, ¯\_(ツ)_/¯

My understanding of PF_NO_SETAFFINITY is that's for kernel threads that
do work that's functionally needed on a given CPU and thus they cannot
be migrated [1]. As said previously for io_uring workers, affinity is
for performance only.

Hence, I'd also suggest on top of 01e68ce08a30 ("io_uring/io-wq: stop
setting PF_NO_SETAFFINITY on io-wq workers"):

--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -233,7 +233,6 @@ static int io_sq_thread(void *data)
                set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu));
        else
                set_cpus_allowed_ptr(current, cpu_online_mask);
-       current->flags |= PF_NO_SETAFFINITY;

        mutex_lock(&sqd->lock);
        while (1) {

Afterall, io_uring_setup(2) already mentions:
> When cgroup setting cpuset.cpus changes (typically in container
> environment), the bounded cpu set may be changed as well.

HTH,
Michal

[1] Ideally, those should always remain in the root cpuset cgroup.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] io_uring/io-wq: stop setting PF_NO_SETAFFINITY on io-wq workers
  2023-03-14 16:25   ` Michal Koutný
@ 2023-03-14 16:48     ` Jens Axboe
  2023-03-14 18:17     ` Waiman Long
  1 sibling, 0 replies; 5+ messages in thread
From: Jens Axboe @ 2023-03-14 16:48 UTC (permalink / raw)
  To: Michal Koutný, Daniel Dao; +Cc: io-uring, Waiman Long, cgroups

On 3/14/23 10:25 AM, Michal Koutný wrote:
> Hello.
> 
> On Tue, Mar 14, 2023 at 10:07:40AM +0000, Daniel Dao <[email protected]> wrote:
>> IMO this violated the principle of cpuset and can be confusing for end users.
>> I think I prefer Waiman's suggestion of allowing an implicit move to cpuset
>> when enabling cpuset with subtree_control but not explicit moves such as when
>> setting cpuset.cpus or writing the pids into cgroup.procs. It's easier to reason
>> about and make the failure mode more explicit.
>>
>> What do you think ?
> 
> I think cpuset should top IO worker's affinity (like sched_setaffinity(2)).
> Thus:
> - modifying cpuset.cpus	                update task's affinity, for sure
> - implicit migration (enabling cpuset)  update task's affinity, effective nop
> - explicit migration (meh)              update task's affinity, ¯\_(ツ)_/¯
> 
> My understanding of PF_NO_SETAFFINITY is that's for kernel threads that
> do work that's functionally needed on a given CPU and thus they cannot
> be migrated [1]. As said previously for io_uring workers, affinity is
> for performance only.
> 
> Hence, I'd also suggest on top of 01e68ce08a30 ("io_uring/io-wq: stop
> setting PF_NO_SETAFFINITY on io-wq workers"):
> 
> --- a/io_uring/sqpoll.c
> +++ b/io_uring/sqpoll.c
> @@ -233,7 +233,6 @@ static int io_sq_thread(void *data)
>                 set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu));
>         else
>                 set_cpus_allowed_ptr(current, cpu_online_mask);
> -       current->flags |= PF_NO_SETAFFINITY;
> 
>         mutex_lock(&sqd->lock);
>         while (1) {

Ah yes, let's get that done as well in the same release. Do you want
to send a patch for this?

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] io_uring/io-wq: stop setting PF_NO_SETAFFINITY on io-wq workers
  2023-03-14 16:25   ` Michal Koutný
  2023-03-14 16:48     ` Jens Axboe
@ 2023-03-14 18:17     ` Waiman Long
  1 sibling, 0 replies; 5+ messages in thread
From: Waiman Long @ 2023-03-14 18:17 UTC (permalink / raw)
  To: Michal Koutný, Daniel Dao; +Cc: Jens Axboe, io-uring, cgroups

On 3/14/23 12:25, Michal Koutný wrote:
> Hello.
>
> On Tue, Mar 14, 2023 at 10:07:40AM +0000, Daniel Dao <[email protected]> wrote:
>> IMO this violated the principle of cpuset and can be confusing for end users.
>> I think I prefer Waiman's suggestion of allowing an implicit move to cpuset
>> when enabling cpuset with subtree_control but not explicit moves such as when
>> setting cpuset.cpus or writing the pids into cgroup.procs. It's easier to reason
>> about and make the failure mode more explicit.
>>
>> What do you think ?
> I think cpuset should top IO worker's affinity (like sched_setaffinity(2)).
> Thus:
> - modifying cpuset.cpus	                update task's affinity, for sure
> - implicit migration (enabling cpuset)  update task's affinity, effective nop
Note that since commit 7fd4da9c158 ("cgroup/cpuset: Optimize 
cpuset_attach() on v2") in v6.2, implicit migration (enabling cpuset) 
will not affect the cpu affinity of the process.
> - explicit migration (meh)              update task's affinity, ¯\_(ツ)_/¯
>
> My understanding of PF_NO_SETAFFINITY is that's for kernel threads that
> do work that's functionally needed on a given CPU and thus they cannot
> be migrated [1]. As said previously for io_uring workers, affinity is
> for performance only.
>
> Hence, I'd also suggest on top of 01e68ce08a30 ("io_uring/io-wq: stop
> setting PF_NO_SETAFFINITY on io-wq workers"):
>
> --- a/io_uring/sqpoll.c
> +++ b/io_uring/sqpoll.c
> @@ -233,7 +233,6 @@ static int io_sq_thread(void *data)
>                  set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu));
>          else
>                  set_cpus_allowed_ptr(current, cpu_online_mask);
> -       current->flags |= PF_NO_SETAFFINITY;
>
>          mutex_lock(&sqd->lock);
>          while (1) {
>
> Afterall, io_uring_setup(2) already mentions:
>> When cgroup setting cpuset.cpus changes (typically in container
>> environment), the bounded cpu set may be changed as well.

Using sched_setaffiinity(2) can be another alternative. Starting from 
v6.2, cpu affinity set by sched_affiinity(2) will be more or less 
maintained and constrained by the current cpuset even if the cpu list is 
being changed as long as there is overlap between the two. The 
intersection between cpu affinity set by sched_setaffinity(2) and the 
effective_cpus in cpuset will be the effective cpu affinity of the task.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-03-14 18:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-08 14:27 [PATCH] io_uring/io-wq: stop setting PF_NO_SETAFFINITY on io-wq workers Jens Axboe
2023-03-14 10:07 ` Daniel Dao
2023-03-14 16:25   ` Michal Koutný
2023-03-14 16:48     ` Jens Axboe
2023-03-14 18:17     ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox