public inbox for [email protected]
 help / color / mirror / Atom feed
* Unexpected EINVAL when enabling cpuset in subtree_control when io_uring threads are running
@ 2023-03-08 11:42 Daniel Dao
  2023-03-08 13:50 ` Jens Axboe
  2023-03-08 14:20 ` Waiman Long
  0 siblings, 2 replies; 5+ messages in thread
From: Daniel Dao @ 2023-03-08 11:42 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Tejun Heo, cgroups, linux-kernel, kernel-team

Hi all,

We encountered EINVAL when enabling cpuset in cgroupv2 when io_uring
worker threads are running. Here are the steps to reproduce the failure
on kernel 6.1.14:

1. Remove cpuset from subtree_control

  > for d in $(find /sys/fs/cgroup/ -maxdepth 1 -type d); do echo
'-cpuset' | sudo tee -a $d/cgroup.subtree_control; done
  > cat /sys/fs/cgroup/cgroup.subtree_control
  cpu io memory pids

2. Run any applications that utilize the uring worker thread pool. I used
   https://github.com/cloudflare/cloudflare-blog/tree/master/2022-02-io_uring-worker-pool

  > cargo run -- -a -w 2 -t 2

3. Enabling cpuset will return EINVAL

  > echo '+cpuset' | sudo tee -a /sys/fs/cgroup/cgroup.subtree_control
  +cpuset
  tee: /sys/fs/cgroup/cgroup.subtree_control: Invalid argument

We traced this down to task_can_attach that will return EINVAL when it
encounters
kthreads with PF_NO_SETAFFINITY, which io_uring worker threads have.

This seems like an unexpected interaction when enabling cpuset for the subtrees
that contain kthreads. We are currently considering a workaround to try to
enable cpuset in root subtree_control before any io_uring applications
can start,
hence failure to enable cpuset is localized to only cgroup with
io_uring kthreads.
But this is cumbersome.

Any suggestions would be very much appreciated.

Thanks,
Daniel.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unexpected EINVAL when enabling cpuset in subtree_control when io_uring threads are running
  2023-03-08 11:42 Unexpected EINVAL when enabling cpuset in subtree_control when io_uring threads are running Daniel Dao
@ 2023-03-08 13:50 ` Jens Axboe
  2023-03-08 14:20 ` Waiman Long
  1 sibling, 0 replies; 5+ messages in thread
From: Jens Axboe @ 2023-03-08 13:50 UTC (permalink / raw)
  To: Daniel Dao; +Cc: io-uring, Tejun Heo, cgroups, linux-kernel, kernel-team

On 3/8/23 4:42?AM, Daniel Dao wrote:
> Hi all,
> 
> We encountered EINVAL when enabling cpuset in cgroupv2 when io_uring
> worker threads are running. Here are the steps to reproduce the failure
> on kernel 6.1.14:
> 
> 1. Remove cpuset from subtree_control
> 
>   > for d in $(find /sys/fs/cgroup/ -maxdepth 1 -type d); do echo
> '-cpuset' | sudo tee -a $d/cgroup.subtree_control; done
>   > cat /sys/fs/cgroup/cgroup.subtree_control
>   cpu io memory pids
> 
> 2. Run any applications that utilize the uring worker thread pool. I used
>    https://github.com/cloudflare/cloudflare-blog/tree/master/2022-02-io_uring-worker-pool
> 
>   > cargo run -- -a -w 2 -t 2
> 
> 3. Enabling cpuset will return EINVAL
> 
>   > echo '+cpuset' | sudo tee -a /sys/fs/cgroup/cgroup.subtree_control
>   +cpuset
>   tee: /sys/fs/cgroup/cgroup.subtree_control: Invalid argument
> 
> We traced this down to task_can_attach that will return EINVAL when it
> encounters
> kthreads with PF_NO_SETAFFINITY, which io_uring worker threads have.
> 
> This seems like an unexpected interaction when enabling cpuset for the subtrees
> that contain kthreads. We are currently considering a workaround to try to
> enable cpuset in root subtree_control before any io_uring applications
> can start,
> hence failure to enable cpuset is localized to only cgroup with
> io_uring kthreads.
> But this is cumbersome.
> 
> Any suggestions would be very much appreciated.

One important thing to note here is that io_uring workers are not
kthreads, but like kthreads, they set PF_NO_SETAFFINITY which prevents
userspace from moving them around. We do have an explicit API for
setting the affinity of workers associated with an io_uring context,
however.

But you are not the first to come across this, and I'm pondering how we
can improve this situation. io-wq blocks changing the CPU affinity
because it organizes workers within a node, but this is purely an
optimization and not integral to how it works.

One thing we could do is simply check the cpumask of the worker after it
went to sleep, and if it woke up due to a timeout (eg not to handle real
work). That'd lazily drop workers that are now not affinitized
correctly. With that, I think it'd be sane to drop the PF_NO_SETAFFINITY
mask from the worker. Something like the below, would be great if you
could test.


diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 411bb2d1acd4..669f50cb4e90 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -616,7 +616,7 @@ static int io_wqe_worker(void *data)
 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
 	struct io_wqe *wqe = worker->wqe;
 	struct io_wq *wq = wqe->wq;
-	bool last_timeout = false;
+	bool exit_mask = false, last_timeout = false;
 	char buf[TASK_COMM_LEN];
 
 	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
@@ -632,8 +632,11 @@ static int io_wqe_worker(void *data)
 			io_worker_handle_work(worker);
 
 		raw_spin_lock(&wqe->lock);
-		/* timed out, exit unless we're the last worker */
-		if (last_timeout && acct->nr_workers > 1) {
+		/*
+		 * Last sleep timed out. Exit if we're not the last worker,
+		 * or if someone modified our affinity.
+		 */
+		if (last_timeout && (exit_mask || acct->nr_workers > 1)) {
 			acct->nr_workers--;
 			raw_spin_unlock(&wqe->lock);
 			__set_current_state(TASK_RUNNING);
@@ -652,7 +655,11 @@ static int io_wqe_worker(void *data)
 				continue;
 			break;
 		}
-		last_timeout = !ret;
+		if (!ret) {
+			last_timeout = true;
+			exit_mask = !cpumask_test_cpu(smp_processor_id(),
+							wqe->cpu_mask);
+		}
 	}
 
 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state))
@@ -704,7 +711,6 @@ static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
 	tsk->worker_private = worker;
 	worker->task = tsk;
 	set_cpus_allowed_ptr(tsk, wqe->cpu_mask);
-	tsk->flags |= PF_NO_SETAFFINITY;
 
 	raw_spin_lock(&wqe->lock);
 	hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: Unexpected EINVAL when enabling cpuset in subtree_control when io_uring threads are running
  2023-03-08 11:42 Unexpected EINVAL when enabling cpuset in subtree_control when io_uring threads are running Daniel Dao
  2023-03-08 13:50 ` Jens Axboe
@ 2023-03-08 14:20 ` Waiman Long
  2023-03-08 14:26   ` Jens Axboe
  1 sibling, 1 reply; 5+ messages in thread
From: Waiman Long @ 2023-03-08 14:20 UTC (permalink / raw)
  To: Daniel Dao, Jens Axboe
  Cc: io-uring, Tejun Heo, cgroups, linux-kernel, kernel-team

On 3/8/23 06:42, Daniel Dao wrote:
> Hi all,
>
> We encountered EINVAL when enabling cpuset in cgroupv2 when io_uring
> worker threads are running. Here are the steps to reproduce the failure
> on kernel 6.1.14:
>
> 1. Remove cpuset from subtree_control
>
>    > for d in $(find /sys/fs/cgroup/ -maxdepth 1 -type d); do echo
> '-cpuset' | sudo tee -a $d/cgroup.subtree_control; done
>    > cat /sys/fs/cgroup/cgroup.subtree_control
>    cpu io memory pids
>
> 2. Run any applications that utilize the uring worker thread pool. I used
>     https://github.com/cloudflare/cloudflare-blog/tree/master/2022-02-io_uring-worker-pool
>
>    > cargo run -- -a -w 2 -t 2
>
> 3. Enabling cpuset will return EINVAL
>
>    > echo '+cpuset' | sudo tee -a /sys/fs/cgroup/cgroup.subtree_control
>    +cpuset
>    tee: /sys/fs/cgroup/cgroup.subtree_control: Invalid argument
>
> We traced this down to task_can_attach that will return EINVAL when it
> encounters
> kthreads with PF_NO_SETAFFINITY, which io_uring worker threads have.
>
> This seems like an unexpected interaction when enabling cpuset for the subtrees
> that contain kthreads. We are currently considering a workaround to try to
> enable cpuset in root subtree_control before any io_uring applications
> can start,
> hence failure to enable cpuset is localized to only cgroup with
> io_uring kthreads.
> But this is cumbersome.
>
> Any suggestions would be very much appreciated.

Anytime you echo "+cpuset" to cgroup.subtree_control to enable cpuset, 
the tasks within the child cgroups will do an implicit move from the 
parent cpuset to the child cpusets. However, that move will fail if any 
task has the PF_NO_SETAFFINITY flag set due to task_can_attach() 
function which checks for this. One possible solution is for the cpuset 
to ignore tasks with PF_NO_SETAFFINITY set for implicit move. IOW, 
allowing the implicit move without touching it, but not explicit one 
using cgroup.procs.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unexpected EINVAL when enabling cpuset in subtree_control when io_uring threads are running
  2023-03-08 14:20 ` Waiman Long
@ 2023-03-08 14:26   ` Jens Axboe
  2023-03-08 14:44     ` Waiman Long
  0 siblings, 1 reply; 5+ messages in thread
From: Jens Axboe @ 2023-03-08 14:26 UTC (permalink / raw)
  To: Waiman Long, Daniel Dao
  Cc: io-uring, Tejun Heo, cgroups, linux-kernel, kernel-team

On 3/8/23 7:20?AM, Waiman Long wrote:
> On 3/8/23 06:42, Daniel Dao wrote:
>> Hi all,
>>
>> We encountered EINVAL when enabling cpuset in cgroupv2 when io_uring
>> worker threads are running. Here are the steps to reproduce the failure
>> on kernel 6.1.14:
>>
>> 1. Remove cpuset from subtree_control
>>
>>    > for d in $(find /sys/fs/cgroup/ -maxdepth 1 -type d); do echo
>> '-cpuset' | sudo tee -a $d/cgroup.subtree_control; done
>>    > cat /sys/fs/cgroup/cgroup.subtree_control
>>    cpu io memory pids
>>
>> 2. Run any applications that utilize the uring worker thread pool. I used
>>     https://github.com/cloudflare/cloudflare-blog/tree/master/2022-02-io_uring-worker-pool
>>
>>    > cargo run -- -a -w 2 -t 2
>>
>> 3. Enabling cpuset will return EINVAL
>>
>>    > echo '+cpuset' | sudo tee -a /sys/fs/cgroup/cgroup.subtree_control
>>    +cpuset
>>    tee: /sys/fs/cgroup/cgroup.subtree_control: Invalid argument
>>
>> We traced this down to task_can_attach that will return EINVAL when it
>> encounters
>> kthreads with PF_NO_SETAFFINITY, which io_uring worker threads have.
>>
>> This seems like an unexpected interaction when enabling cpuset for the subtrees
>> that contain kthreads. We are currently considering a workaround to try to
>> enable cpuset in root subtree_control before any io_uring applications
>> can start,
>> hence failure to enable cpuset is localized to only cgroup with
>> io_uring kthreads.
>> But this is cumbersome.
>>
>> Any suggestions would be very much appreciated.
> 
> Anytime you echo "+cpuset" to cgroup.subtree_control to enable cpuset,
> the tasks within the child cgroups will do an implicit move from the
> parent cpuset to the child cpusets. However, that move will fail if
> any task has the PF_NO_SETAFFINITY flag set due to task_can_attach()
> function which checks for this. One possible solution is for the
> cpuset to ignore tasks with PF_NO_SETAFFINITY set for implicit move.
> IOW, allowing the implicit move without touching it, but not explicit
> one using cgroup.procs.

I was pondering this too as I was typing my reply, but at least for
io-wq, this report isn't the first to be puzzled or broken by the fact
that task threads might have PF_NO_SETAFFINITY set. So while it might be
worthwhile to for cpuset to ignore PF_NO_SETAFFINITY as a separate fix,
I think it's better to fix io-wq in general. Not sure we have other
cases where it's even possible to have PF_NO_SETAFFINITY set on
userspace threads?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Unexpected EINVAL when enabling cpuset in subtree_control when io_uring threads are running
  2023-03-08 14:26   ` Jens Axboe
@ 2023-03-08 14:44     ` Waiman Long
  0 siblings, 0 replies; 5+ messages in thread
From: Waiman Long @ 2023-03-08 14:44 UTC (permalink / raw)
  To: Jens Axboe, Daniel Dao
  Cc: io-uring, Tejun Heo, cgroups, linux-kernel, kernel-team

On 3/8/23 09:26, Jens Axboe wrote:
> On 3/8/23 7:20?AM, Waiman Long wrote:
>> On 3/8/23 06:42, Daniel Dao wrote:
>>> Hi all,
>>>
>>> We encountered EINVAL when enabling cpuset in cgroupv2 when io_uring
>>> worker threads are running. Here are the steps to reproduce the failure
>>> on kernel 6.1.14:
>>>
>>> 1. Remove cpuset from subtree_control
>>>
>>>     > for d in $(find /sys/fs/cgroup/ -maxdepth 1 -type d); do echo
>>> '-cpuset' | sudo tee -a $d/cgroup.subtree_control; done
>>>     > cat /sys/fs/cgroup/cgroup.subtree_control
>>>     cpu io memory pids
>>>
>>> 2. Run any applications that utilize the uring worker thread pool. I used
>>>      https://github.com/cloudflare/cloudflare-blog/tree/master/2022-02-io_uring-worker-pool
>>>
>>>     > cargo run -- -a -w 2 -t 2
>>>
>>> 3. Enabling cpuset will return EINVAL
>>>
>>>     > echo '+cpuset' | sudo tee -a /sys/fs/cgroup/cgroup.subtree_control
>>>     +cpuset
>>>     tee: /sys/fs/cgroup/cgroup.subtree_control: Invalid argument
>>>
>>> We traced this down to task_can_attach that will return EINVAL when it
>>> encounters
>>> kthreads with PF_NO_SETAFFINITY, which io_uring worker threads have.
>>>
>>> This seems like an unexpected interaction when enabling cpuset for the subtrees
>>> that contain kthreads. We are currently considering a workaround to try to
>>> enable cpuset in root subtree_control before any io_uring applications
>>> can start,
>>> hence failure to enable cpuset is localized to only cgroup with
>>> io_uring kthreads.
>>> But this is cumbersome.
>>>
>>> Any suggestions would be very much appreciated.
>> Anytime you echo "+cpuset" to cgroup.subtree_control to enable cpuset,
>> the tasks within the child cgroups will do an implicit move from the
>> parent cpuset to the child cpusets. However, that move will fail if
>> any task has the PF_NO_SETAFFINITY flag set due to task_can_attach()
>> function which checks for this. One possible solution is for the
>> cpuset to ignore tasks with PF_NO_SETAFFINITY set for implicit move.
>> IOW, allowing the implicit move without touching it, but not explicit
>> one using cgroup.procs.
> I was pondering this too as I was typing my reply, but at least for
> io-wq, this report isn't the first to be puzzled or broken by the fact
> that task threads might have PF_NO_SETAFFINITY set. So while it might be
> worthwhile to for cpuset to ignore PF_NO_SETAFFINITY as a separate fix,
> I think it's better to fix io-wq in general. Not sure we have other
> cases where it's even possible to have PF_NO_SETAFFINITY set on
> userspace threads?

Changing current cpuset behavior is an alternative solution. It is a 
problem anytime a task (user or kthread) has PF_NO_SETAFFINITY set but 
not in the root cgroup. Besides io_uring, I have no idea if there is 
other use cases out there. It is just a change we may need to do in the 
future if there are other similar cases. Since you are fixing it on the 
io-wq side, it is not an urgent issue that needs to be addressed from 
the cpuset side.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-03-08 14:47 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-08 11:42 Unexpected EINVAL when enabling cpuset in subtree_control when io_uring threads are running Daniel Dao
2023-03-08 13:50 ` Jens Axboe
2023-03-08 14:20 ` Waiman Long
2023-03-08 14:26   ` Jens Axboe
2023-03-08 14:44     ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox