* [PATCH 6.1 1/2] io_uring/io-wq: do not allow pinning outside of cpuset
2024-09-11 16:23 [PATCH 6.1 0/2] io_uring/io-wq: respect cgroup cpusets Felix Moessbauer
@ 2024-09-11 16:23 ` Felix Moessbauer
2024-09-11 16:23 ` [PATCH 6.1 2/2] io_uring/io-wq: inherit cpuset of cgroup in io worker Felix Moessbauer
` (2 subsequent siblings)
3 siblings, 0 replies; 8+ messages in thread
From: Felix Moessbauer @ 2024-09-11 16:23 UTC (permalink / raw)
To: axboe
Cc: stable, asml.silence, linux-kernel, io-uring, cgroups, dqminh,
longman, adriaan.schmidt, florian.bezdeka, Felix Moessbauer
commit 0997aa5497c714edbb349ca366d28bd550ba3408 upstream.
The io worker threads are userland threads that just never exit to the
userland. By that, they are also assigned to a cgroup (the group of the
creating task).
When changing the affinity of the io_wq thread via syscall, we must only
allow cpumasks within the limits defined by the cpuset controller of the
cgroup (if enabled).
Fixes: da64d6db3bd3 ("io_uring: One wqe per wq")
Signed-off-by: Felix Moessbauer <[email protected]>
---
io_uring/io-wq.c | 25 +++++++++++++++++++------
1 file changed, 19 insertions(+), 6 deletions(-)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 139cd49b2c27..c74bcc8d2f06 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -13,6 +13,7 @@
#include <linux/slab.h>
#include <linux/rculist_nulls.h>
#include <linux/cpu.h>
+#include <linux/cpuset.h>
#include <linux/task_work.h>
#include <linux/audit.h>
#include <uapi/linux/io_uring.h>
@@ -1362,22 +1363,34 @@ static int io_wq_cpu_offline(unsigned int cpu, struct hlist_node *node)
int io_wq_cpu_affinity(struct io_uring_task *tctx, cpumask_var_t mask)
{
+ cpumask_var_t allowed_mask;
+ int ret = 0;
int i;
if (!tctx || !tctx->io_wq)
return -EINVAL;
+ if (!alloc_cpumask_var(&allowed_mask, GFP_KERNEL))
+ return -ENOMEM;
+ cpuset_cpus_allowed(tctx->io_wq->task, allowed_mask);
+
rcu_read_lock();
for_each_node(i) {
struct io_wqe *wqe = tctx->io_wq->wqes[i];
-
- if (mask)
- cpumask_copy(wqe->cpu_mask, mask);
- else
- cpumask_copy(wqe->cpu_mask, cpumask_of_node(i));
+ if (mask) {
+ if (cpumask_subset(mask, allowed_mask))
+ cpumask_copy(wqe->cpu_mask, mask);
+ else
+ ret = -EINVAL;
+ } else {
+ if (!cpumask_and(wqe->cpu_mask, cpumask_of_node(i), allowed_mask))
+ cpumask_copy(wqe->cpu_mask, allowed_mask);
+ }
}
rcu_read_unlock();
- return 0;
+
+ free_cpumask_var(allowed_mask);
+ return ret;
}
/*
--
2.39.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 6.1 2/2] io_uring/io-wq: inherit cpuset of cgroup in io worker
2024-09-11 16:23 [PATCH 6.1 0/2] io_uring/io-wq: respect cgroup cpusets Felix Moessbauer
2024-09-11 16:23 ` [PATCH 6.1 1/2] io_uring/io-wq: do not allow pinning outside of cpuset Felix Moessbauer
@ 2024-09-11 16:23 ` Felix Moessbauer
2024-09-11 16:28 ` [PATCH 6.1 0/2] io_uring/io-wq: respect cgroup cpusets Jens Axboe
2024-09-30 19:15 ` Greg KH
3 siblings, 0 replies; 8+ messages in thread
From: Felix Moessbauer @ 2024-09-11 16:23 UTC (permalink / raw)
To: axboe
Cc: stable, asml.silence, linux-kernel, io-uring, cgroups, dqminh,
longman, adriaan.schmidt, florian.bezdeka, Felix Moessbauer
commit 84eacf177faa605853c58e5b1c0d9544b88c16fd upstream.
The io worker threads are userland threads that just never exit to the
userland. By that, they are also assigned to a cgroup (the group of the
creating task).
When creating a new io worker, this worker should inherit the cpuset
of the cgroup.
Fixes: da64d6db3bd3 ("io_uring: One wqe per wq")
Signed-off-by: Felix Moessbauer <[email protected]>
---
io_uring/io-wq.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index c74bcc8d2f06..04265bf8d319 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -1157,6 +1157,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
{
int ret, node, i;
struct io_wq *wq;
+ cpumask_var_t allowed_mask;
if (WARN_ON_ONCE(!data->free_work || !data->do_work))
return ERR_PTR(-EINVAL);
@@ -1176,6 +1177,9 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
wq->do_work = data->do_work;
ret = -ENOMEM;
+ if (!alloc_cpumask_var(&allowed_mask, GFP_KERNEL))
+ goto err;
+ cpuset_cpus_allowed(current, allowed_mask);
for_each_node(node) {
struct io_wqe *wqe;
int alloc_node = node;
@@ -1188,7 +1192,8 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
wq->wqes[node] = wqe;
if (!alloc_cpumask_var(&wqe->cpu_mask, GFP_KERNEL))
goto err;
- cpumask_copy(wqe->cpu_mask, cpumask_of_node(node));
+ if (!cpumask_and(wqe->cpu_mask, cpumask_of_node(node), allowed_mask))
+ cpumask_copy(wqe->cpu_mask, allowed_mask);
wqe->node = alloc_node;
wqe->acct[IO_WQ_ACCT_BOUND].max_workers = bounded;
wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
@@ -1222,6 +1227,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
free_cpumask_var(wq->wqes[node]->cpu_mask);
kfree(wq->wqes[node]);
}
+ free_cpumask_var(allowed_mask);
err_wq:
kfree(wq);
return ERR_PTR(ret);
--
2.39.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 6.1 0/2] io_uring/io-wq: respect cgroup cpusets
2024-09-11 16:23 [PATCH 6.1 0/2] io_uring/io-wq: respect cgroup cpusets Felix Moessbauer
2024-09-11 16:23 ` [PATCH 6.1 1/2] io_uring/io-wq: do not allow pinning outside of cpuset Felix Moessbauer
2024-09-11 16:23 ` [PATCH 6.1 2/2] io_uring/io-wq: inherit cpuset of cgroup in io worker Felix Moessbauer
@ 2024-09-11 16:28 ` Jens Axboe
2024-09-30 19:15 ` Greg KH
3 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2024-09-11 16:28 UTC (permalink / raw)
To: Felix Moessbauer
Cc: stable, asml.silence, linux-kernel, io-uring, cgroups, dqminh,
longman, adriaan.schmidt, florian.bezdeka
On 9/11/24 10:23 AM, Felix Moessbauer wrote:
> Hi,
>
> as discussed in [1], this is a manual backport of the remaining two
> patches to let the io worker threads respect the affinites defined by
> the cgroup of the process.
>
> In 6.1 one worker is created per NUMA node, while in da64d6db3bd3
> ("io_uring: One wqe per wq") this is changed to only have a single worker.
> As this patch is pretty invasive, Jens and me agreed to not backport it.
>
> Instead we now limit the workers cpuset to the cpus that are in the
> intersection between what the cgroup allows and what the NUMA node has.
> This leaves the question what to do in case the intersection is empty:
> To be backwarts compatible, we allow this case, but restrict the cpumask
> of the poller to the cpuset defined by the cgroup. We further believe
> this is a reasonable decision, as da64d6db3bd3 drops the NUMA awareness
> anyways.
>
> [1] https://lore.kernel.org/lkml/[email protected]
The upstream patches are staged for 6.12 and marked for a backport, so
they should go upstream next week. Once they are upstream, I'll make
sure to check in on these on the stable front.
--
Jens Axboe
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 6.1 0/2] io_uring/io-wq: respect cgroup cpusets
2024-09-11 16:23 [PATCH 6.1 0/2] io_uring/io-wq: respect cgroup cpusets Felix Moessbauer
` (2 preceding siblings ...)
2024-09-11 16:28 ` [PATCH 6.1 0/2] io_uring/io-wq: respect cgroup cpusets Jens Axboe
@ 2024-09-30 19:15 ` Greg KH
2024-10-01 7:32 ` MOESSBAUER, Felix
3 siblings, 1 reply; 8+ messages in thread
From: Greg KH @ 2024-09-30 19:15 UTC (permalink / raw)
To: Felix Moessbauer
Cc: axboe, stable, asml.silence, linux-kernel, io-uring, cgroups,
dqminh, longman, adriaan.schmidt, florian.bezdeka
On Wed, Sep 11, 2024 at 06:23:14PM +0200, Felix Moessbauer wrote:
> Hi,
>
> as discussed in [1], this is a manual backport of the remaining two
> patches to let the io worker threads respect the affinites defined by
> the cgroup of the process.
>
> In 6.1 one worker is created per NUMA node, while in da64d6db3bd3
> ("io_uring: One wqe per wq") this is changed to only have a single worker.
> As this patch is pretty invasive, Jens and me agreed to not backport it.
>
> Instead we now limit the workers cpuset to the cpus that are in the
> intersection between what the cgroup allows and what the NUMA node has.
> This leaves the question what to do in case the intersection is empty:
> To be backwarts compatible, we allow this case, but restrict the cpumask
> of the poller to the cpuset defined by the cgroup. We further believe
> this is a reasonable decision, as da64d6db3bd3 drops the NUMA awareness
> anyways.
>
> [1] https://lore.kernel.org/lkml/[email protected]
Why was neither of these actually tagged for inclusion in a stable tree?
Why just 6.1.y? Please submit them for all relevent kernel versions.
thanks,
greg k-h
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 6.1 0/2] io_uring/io-wq: respect cgroup cpusets
2024-09-30 19:15 ` Greg KH
@ 2024-10-01 7:32 ` MOESSBAUER, Felix
2024-10-01 7:50 ` gregkh
0 siblings, 1 reply; 8+ messages in thread
From: MOESSBAUER, Felix @ 2024-10-01 7:32 UTC (permalink / raw)
To: [email protected]
Cc: Schmidt, Adriaan, [email protected],
[email protected], Bezdeka, Florian,
[email protected], [email protected], [email protected],
[email protected], [email protected],
[email protected]
On Mon, 2024-09-30 at 21:15 +0200, Greg KH wrote:
> On Wed, Sep 11, 2024 at 06:23:14PM +0200, Felix Moessbauer wrote:
> > Hi,
> >
> > as discussed in [1], this is a manual backport of the remaining two
> > patches to let the io worker threads respect the affinites defined
> > by
> > the cgroup of the process.
> >
> > In 6.1 one worker is created per NUMA node, while in da64d6db3bd3
> > ("io_uring: One wqe per wq") this is changed to only have a single
> > worker.
> > As this patch is pretty invasive, Jens and me agreed to not
> > backport it.
> >
> > Instead we now limit the workers cpuset to the cpus that are in the
> > intersection between what the cgroup allows and what the NUMA node
> > has.
> > This leaves the question what to do in case the intersection is
> > empty:
> > To be backwarts compatible, we allow this case, but restrict the
> > cpumask
> > of the poller to the cpuset defined by the cgroup. We further
> > believe
> > this is a reasonable decision, as da64d6db3bd3 drops the NUMA
> > awareness
> > anyways.
> >
> > [1]
> > https://lore.kernel.org/lkml/[email protected]
>
> Why was neither of these actually tagged for inclusion in a stable
> tree?
This is a manual backport of these patches for 6.1, as the subsystem
changed significantly between 6.1 and 6.2, making an automated backport
impossible. This has been agreed on with Jens in
https://lore.kernel.org/lkml/[email protected]/
> Why just 6.1.y? Please submit them for all relevent kernel versions.
The original patch was tagged stable and got accepted in 6.6, 6.10 and
6.11.
Felix
>
> thanks,
>
> greg k-h
--
Siemens AG, Technology
Linux Expert Center
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 6.1 0/2] io_uring/io-wq: respect cgroup cpusets
2024-10-01 7:32 ` MOESSBAUER, Felix
@ 2024-10-01 7:50 ` gregkh
2024-10-01 13:35 ` Jens Axboe
0 siblings, 1 reply; 8+ messages in thread
From: gregkh @ 2024-10-01 7:50 UTC (permalink / raw)
To: MOESSBAUER, Felix
Cc: Schmidt, Adriaan, [email protected],
[email protected], Bezdeka, Florian,
[email protected], [email protected], [email protected],
[email protected], [email protected],
[email protected]
On Tue, Oct 01, 2024 at 07:32:42AM +0000, MOESSBAUER, Felix wrote:
> On Mon, 2024-09-30 at 21:15 +0200, Greg KH wrote:
> > On Wed, Sep 11, 2024 at 06:23:14PM +0200, Felix Moessbauer wrote:
> > > Hi,
> > >
> > > as discussed in [1], this is a manual backport of the remaining two
> > > patches to let the io worker threads respect the affinites defined
> > > by
> > > the cgroup of the process.
> > >
> > > In 6.1 one worker is created per NUMA node, while in da64d6db3bd3
> > > ("io_uring: One wqe per wq") this is changed to only have a single
> > > worker.
> > > As this patch is pretty invasive, Jens and me agreed to not
> > > backport it.
> > >
> > > Instead we now limit the workers cpuset to the cpus that are in the
> > > intersection between what the cgroup allows and what the NUMA node
> > > has.
> > > This leaves the question what to do in case the intersection is
> > > empty:
> > > To be backwarts compatible, we allow this case, but restrict the
> > > cpumask
> > > of the poller to the cpuset defined by the cgroup. We further
> > > believe
> > > this is a reasonable decision, as da64d6db3bd3 drops the NUMA
> > > awareness
> > > anyways.
> > >
> > > [1]
> > > https://lore.kernel.org/lkml/[email protected]
> >
> > Why was neither of these actually tagged for inclusion in a stable
> > tree?
>
> This is a manual backport of these patches for 6.1, as the subsystem
> changed significantly between 6.1 and 6.2, making an automated backport
> impossible. This has been agreed on with Jens in
> https://lore.kernel.org/lkml/[email protected]/
>
> > Why just 6.1.y? Please submit them for all relevent kernel versions.
>
> The original patch was tagged stable and got accepted in 6.6, 6.10 and
> 6.11.
No they were not at all. Please properly tag them in the future as per
the documentation if you wish to have things applied to the stable
trees:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
thanks,
greg k-h
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 6.1 0/2] io_uring/io-wq: respect cgroup cpusets
2024-10-01 7:50 ` gregkh
@ 2024-10-01 13:35 ` Jens Axboe
0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2024-10-01 13:35 UTC (permalink / raw)
To: [email protected], MOESSBAUER, Felix
Cc: Schmidt, Adriaan, [email protected],
[email protected], Bezdeka, Florian,
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 10/1/24 1:50 AM, [email protected] wrote:
> On Tue, Oct 01, 2024 at 07:32:42AM +0000, MOESSBAUER, Felix wrote:
>> On Mon, 2024-09-30 at 21:15 +0200, Greg KH wrote:
>>> On Wed, Sep 11, 2024 at 06:23:14PM +0200, Felix Moessbauer wrote:
>>>> Hi,
>>>>
>>>> as discussed in [1], this is a manual backport of the remaining two
>>>> patches to let the io worker threads respect the affinites defined
>>>> by
>>>> the cgroup of the process.
>>>>
>>>> In 6.1 one worker is created per NUMA node, while in da64d6db3bd3
>>>> ("io_uring: One wqe per wq") this is changed to only have a single
>>>> worker.
>>>> As this patch is pretty invasive, Jens and me agreed to not
>>>> backport it.
>>>>
>>>> Instead we now limit the workers cpuset to the cpus that are in the
>>>> intersection between what the cgroup allows and what the NUMA node
>>>> has.
>>>> This leaves the question what to do in case the intersection is
>>>> empty:
>>>> To be backwarts compatible, we allow this case, but restrict the
>>>> cpumask
>>>> of the poller to the cpuset defined by the cgroup. We further
>>>> believe
>>>> this is a reasonable decision, as da64d6db3bd3 drops the NUMA
>>>> awareness
>>>> anyways.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/lkml/[email protected]
>>>
>>> Why was neither of these actually tagged for inclusion in a stable
>>> tree?
>>
>> This is a manual backport of these patches for 6.1, as the subsystem
>> changed significantly between 6.1 and 6.2, making an automated backport
>> impossible. This has been agreed on with Jens in
>> https://lore.kernel.org/lkml/[email protected]/
>>
>>> Why just 6.1.y? Please submit them for all relevent kernel versions.
>>
>> The original patch was tagged stable and got accepted in 6.6, 6.10 and
>> 6.11.
>
> No they were not at all. Please properly tag them in the future as per
> the documentation if you wish to have things applied to the stable
> trees:
> https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
That's my bad, missed that one of them did not get marked for stable,
the sqpoll one did.
--
Jens Axboe
^ permalink raw reply [flat|nested] 8+ messages in thread