* [PATCH 2/2] nvme: use uring_cmd sys_admin flag
2023-12-04 17:53 [PATCH 1/2] iouring: one capable call per iouring instance Keith Busch
@ 2023-12-04 17:53 ` Keith Busch
2023-12-04 18:05 ` [PATCH 1/2] iouring: one capable call per iouring instance Jens Axboe
` (2 subsequent siblings)
3 siblings, 0 replies; 21+ messages in thread
From: Keith Busch @ 2023-12-04 17:53 UTC (permalink / raw)
To: linux-nvme, io-uring; +Cc: axboe, hch, sagi, asml.silence, Keith Busch
From: Keith Busch <[email protected]>
The nvme passthrough interface through io_uring is intended to be fast, so we
should avoid calling capable() every io. Checking other permission first helped
reduce this overhead, but it's still called for many commands.
Use the new uring_cmd sys admin issue_flag to see if we can skip
additional checks. The ioctl path won't be able to use this
optimization, but that wasn't considered a fast path anyway.
Signed-off-by: Keith Busch <[email protected]>
---
drivers/nvme/host/ioctl.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 6c5ae820bc0fc..83c0a1170505c 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -11,6 +11,7 @@
enum {
NVME_IOCTL_VEC = (1 << 0),
NVME_IOCTL_PARTITION = (1 << 1),
+ NVME_IOCTL_SYS_ADMIN = (1 << 2),
};
static bool nvme_cmd_allowed(struct nvme_ns *ns, struct nvme_command *c,
@@ -18,6 +19,9 @@ static bool nvme_cmd_allowed(struct nvme_ns *ns, struct nvme_command *c,
{
u32 effects;
+ if (flags & NVME_IOCTL_SYS_ADMIN)
+ return true;
+
/*
* Do not allow unprivileged passthrough on partitions, as that allows an
* escape from the containment of the partition.
@@ -445,7 +449,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
struct request *req;
blk_opf_t rq_flags = REQ_ALLOC_CACHE;
blk_mq_req_flags_t blk_flags = 0;
- int ret;
+ int ret, flags = 0;
c.common.opcode = READ_ONCE(cmd->opcode);
c.common.flags = READ_ONCE(cmd->flags);
@@ -468,7 +472,11 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
c.common.cdw14 = cpu_to_le32(READ_ONCE(cmd->cdw14));
c.common.cdw15 = cpu_to_le32(READ_ONCE(cmd->cdw15));
- if (!nvme_cmd_allowed(ns, &c, 0, ioucmd->file->f_mode & FMODE_WRITE))
+ if (issue_flags & IO_URING_F_SYS_ADMIN)
+ flags |= NVME_IOCTL_SYS_ADMIN;
+
+ if (!nvme_cmd_allowed(ns, &c, flags,
+ ioucmd->file->f_mode & FMODE_WRITE))
return -EACCES;
d.metadata = READ_ONCE(cmd->metadata);
--
2.34.1
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-04 17:53 [PATCH 1/2] iouring: one capable call per iouring instance Keith Busch
2023-12-04 17:53 ` [PATCH 2/2] nvme: use uring_cmd sys_admin flag Keith Busch
@ 2023-12-04 18:05 ` Jens Axboe
2023-12-04 18:45 ` Pavel Begunkov
2023-12-05 16:21 ` Kanchan Joshi
2023-12-04 18:15 ` Jens Axboe
2023-12-04 18:40 ` Jeff Moyer
3 siblings, 2 replies; 21+ messages in thread
From: Jens Axboe @ 2023-12-04 18:05 UTC (permalink / raw)
To: Keith Busch, linux-nvme, io-uring; +Cc: hch, sagi, asml.silence, Keith Busch
On 12/4/23 10:53 AM, Keith Busch wrote:
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 1d254f2c997de..4aa10b64f539e 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
> ctx->syscall_iopoll = 1;
>
> ctx->compat = in_compat_syscall();
> + ctx->sys_admin = capable(CAP_SYS_ADMIN);
> if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK))
> ctx->user = get_uid(current_user());
Hmm, what happens if the app starts as eg root for initialization
purposes and drops caps after? That would have previously have caused
passthrough to fail, but now it will work. Perhaps this is fine, after
all this isn't unusual for eg opening device or doing other init special
work?
In any case, that should definitely be explicitly mentioned in the
commit message for a change like that.
--
Jens Axboe
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-04 18:05 ` [PATCH 1/2] iouring: one capable call per iouring instance Jens Axboe
@ 2023-12-04 18:45 ` Pavel Begunkov
2023-12-05 16:21 ` Kanchan Joshi
1 sibling, 0 replies; 21+ messages in thread
From: Pavel Begunkov @ 2023-12-04 18:45 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, linux-nvme, io-uring; +Cc: hch, sagi, Keith Busch
On 12/4/23 18:05, Jens Axboe wrote:
> On 12/4/23 10:53 AM, Keith Busch wrote:
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 1d254f2c997de..4aa10b64f539e 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
>> ctx->syscall_iopoll = 1;
>>
>> ctx->compat = in_compat_syscall();
>> + ctx->sys_admin = capable(CAP_SYS_ADMIN);
>> if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK))
>> ctx->user = get_uid(current_user());
>
> Hmm, what happens if the app starts as eg root for initialization
> purposes and drops caps after? That would have previously have caused
> passthrough to fail, but now it will work. Perhaps this is fine, after
> all this isn't unusual for eg opening device or doing other init special
> work?
The side effects would be quite a surprise when you initialize the ring
from a privileged process and then pass it to a less capable one. Ring
sharing would also be affected. Privilege downgrade also sounds like
a valid concern. The first two will be solved if restricted to
IORING_SETUP_DEFER_TASKRUN rings and
io_is_capable() {
return ctx->sys_admin || capable();
}
And it still doesn't seem great bypassing it, when the question is
rather why it's expensive? I've seen before in the wild a fat BPF
program running on every call, is that what happens?
> In any case, that should definitely be explicitly mentioned in the
> commit message for a change like that.
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-04 18:05 ` [PATCH 1/2] iouring: one capable call per iouring instance Jens Axboe
2023-12-04 18:45 ` Pavel Begunkov
@ 2023-12-05 16:21 ` Kanchan Joshi
2023-12-06 21:09 ` Keith Busch
1 sibling, 1 reply; 21+ messages in thread
From: Kanchan Joshi @ 2023-12-05 16:21 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, linux-nvme, io-uring
Cc: hch, sagi, asml.silence, Keith Busch
On 12/4/2023 11:35 PM, Jens Axboe wrote:
> On 12/4/23 10:53 AM, Keith Busch wrote:
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 1d254f2c997de..4aa10b64f539e 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
>> ctx->syscall_iopoll = 1;
>>
>> ctx->compat = in_compat_syscall();
>> + ctx->sys_admin = capable(CAP_SYS_ADMIN);
>> if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK))
>> ctx->user = get_uid(current_user());
> Hmm, what happens if the app starts as eg root for initialization
> purposes and drops caps after? That would have previously have caused
> passthrough to fail, but now it will work.
Does it sound any better if this 'super ring' type of ability is asked
explicitly by a setup flag say IORING_SETUP_CAPABLE_ONCE.
It does not change the old behavior. It also implies that capable user
knows what it asked for, so no need to keep things in sync if the
process drops caps after ring setup is done.
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4aa10b64f539..589e614144b6 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3981,6 +3981,8 @@ static __cold int io_uring_create(unsigned
entries, struct io_uring_params *p,
ctx->compat = in_compat_syscall();
+ if (ctx->flags & IORING_SETUP_CAPABLE_ONCE &&
capable(CAP_SYS_ADMIN))
+ ctx->sys_admin = 1;
if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK))
ctx->user = get_uid(current_user());
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-05 16:21 ` Kanchan Joshi
@ 2023-12-06 21:09 ` Keith Busch
0 siblings, 0 replies; 21+ messages in thread
From: Keith Busch @ 2023-12-06 21:09 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Jens Axboe, Keith Busch, linux-nvme, io-uring, hch, sagi,
asml.silence
On Tue, Dec 05, 2023 at 09:51:13PM +0530, Kanchan Joshi wrote:
> Does it sound any better if this 'super ring' type of ability is asked
> explicitly by a setup flag say IORING_SETUP_CAPABLE_ONCE.
Just my opinion: that option sounds good!
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-04 17:53 [PATCH 1/2] iouring: one capable call per iouring instance Keith Busch
2023-12-04 17:53 ` [PATCH 2/2] nvme: use uring_cmd sys_admin flag Keith Busch
2023-12-04 18:05 ` [PATCH 1/2] iouring: one capable call per iouring instance Jens Axboe
@ 2023-12-04 18:15 ` Jens Axboe
2023-12-04 18:40 ` Jeff Moyer
3 siblings, 0 replies; 21+ messages in thread
From: Jens Axboe @ 2023-12-04 18:15 UTC (permalink / raw)
To: Keith Busch, linux-nvme, io-uring; +Cc: hch, sagi, asml.silence, Keith Busch
On 12/4/23 10:53 AM, Keith Busch wrote:
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 1d254f2c997de..4aa10b64f539e 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
> ctx->syscall_iopoll = 1;
>
> ctx->compat = in_compat_syscall();
> + ctx->sys_admin = capable(CAP_SYS_ADMIN);
> if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK))
> ctx->user = get_uid(current_user());
>
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 8a38b9f75d841..764f0e004aa00 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -164,6 +164,8 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
> issue_flags |= IO_URING_F_CQE32;
> if (ctx->compat)
> issue_flags |= IO_URING_F_COMPAT;
> + if (ctx->sys_admin)
> + issue_flags |= IO_URING_F_SYS_ADMIN;
> if (ctx->flags & IORING_SETUP_IOPOLL) {
> if (!file->f_op->uring_cmd_iopoll)
> return -EOPNOTSUPP;
Since we know have two flags that would be cached from init time, rather
than add a second branch for this, why not have:
io_uring_create()
{
[...]
if (in_compat_syscall())
ctx->issue_flags |= IO_URING_F_COMPAT;
if (capable(CAP_SYS_ADMIN))
ctx->issue_flags |= IO_URING_F_SYS_ADMIN;
[...]
}
and get rid of ->compat and ->sys_admin, and then have:
io_uring_cmd()
{
issue_flags |= ctx->issue_flags;
}
Then we can also drop checking for IORING_SETUP_SQE128/CQE32 as well,
dropping two more fast path branches.
--
Jens Axboe
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-04 17:53 [PATCH 1/2] iouring: one capable call per iouring instance Keith Busch
` (2 preceding siblings ...)
2023-12-04 18:15 ` Jens Axboe
@ 2023-12-04 18:40 ` Jeff Moyer
2023-12-04 18:57 ` Keith Busch
2023-12-04 19:01 ` Jens Axboe
3 siblings, 2 replies; 21+ messages in thread
From: Jeff Moyer @ 2023-12-04 18:40 UTC (permalink / raw)
To: Keith Busch
Cc: linux-nvme, io-uring, axboe, hch, sagi, asml.silence, Keith Busch,
linux-security-module
I added a CC: linux-security-module@vger
Hi, Keith,
Keith Busch <[email protected]> writes:
> From: Keith Busch <[email protected]>
>
> The uring_cmd operation is often used for privileged actions, so drivers
> subscribing to this interface check capable() for each command. The
> capable() function is not fast path friendly for many kernel configs,
> and this can really harm performance. Stash the capable sys admin
> attribute in the io_uring context and set a new issue_flag for the
> uring_cmd interface.
I have a few questions. What privileged actions are performance
sensitive? I would hope that anything requiring privileges would not be
in a fast path (but clearly that's not the case). What performance
benefits did you measure with this patch set in place (and on what
workloads)? What happens when a ring fd is passed to another process?
Finally, as Jens mentioned, I would expect dropping priviliges to, you
know, drop privileges. I don't think a commit message is going to be
enough documentation for a change like this.
Cheers,
Jeff
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> include/linux/io_uring_types.h | 4 ++++
> io_uring/io_uring.c | 1 +
> io_uring/uring_cmd.c | 2 ++
> 3 files changed, 7 insertions(+)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index bebab36abce89..d64d6916753f0 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -36,6 +36,9 @@ enum io_uring_cmd_flags {
> /* set when uring wants to cancel a previously issued command */
> IO_URING_F_CANCEL = (1 << 11),
> IO_URING_F_COMPAT = (1 << 12),
> +
> + /* ring validated as CAP_SYS_ADMIN capable */
> + IO_URING_F_SYS_ADMIN = (1 << 13),
> };
>
> struct io_wq_work_node {
> @@ -240,6 +243,7 @@ struct io_ring_ctx {
> unsigned int poll_activated: 1;
> unsigned int drain_disabled: 1;
> unsigned int compat: 1;
> + unsigned int sys_admin: 1;
>
> struct task_struct *submitter_task;
> struct io_rings *rings;
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 1d254f2c997de..4aa10b64f539e 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -3980,6 +3980,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
> ctx->syscall_iopoll = 1;
>
> ctx->compat = in_compat_syscall();
> + ctx->sys_admin = capable(CAP_SYS_ADMIN);
> if (!ns_capable_noaudit(&init_user_ns, CAP_IPC_LOCK))
> ctx->user = get_uid(current_user());
>
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 8a38b9f75d841..764f0e004aa00 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -164,6 +164,8 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
> issue_flags |= IO_URING_F_CQE32;
> if (ctx->compat)
> issue_flags |= IO_URING_F_COMPAT;
> + if (ctx->sys_admin)
> + issue_flags |= IO_URING_F_SYS_ADMIN;
> if (ctx->flags & IORING_SETUP_IOPOLL) {
> if (!file->f_op->uring_cmd_iopoll)
> return -EOPNOTSUPP;
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-04 18:40 ` Jeff Moyer
@ 2023-12-04 18:57 ` Keith Busch
2023-12-05 4:14 ` Ming Lei
2023-12-04 19:01 ` Jens Axboe
1 sibling, 1 reply; 21+ messages in thread
From: Keith Busch @ 2023-12-04 18:57 UTC (permalink / raw)
To: Jeff Moyer
Cc: Keith Busch, linux-nvme, io-uring, axboe, hch, sagi, asml.silence,
linux-security-module
On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote:
> I added a CC: linux-security-module@vger
> Keith Busch <[email protected]> writes:
> > From: Keith Busch <[email protected]>
> >
> > The uring_cmd operation is often used for privileged actions, so drivers
> > subscribing to this interface check capable() for each command. The
> > capable() function is not fast path friendly for many kernel configs,
> > and this can really harm performance. Stash the capable sys admin
> > attribute in the io_uring context and set a new issue_flag for the
> > uring_cmd interface.
>
> I have a few questions. What privileged actions are performance
> sensitive? I would hope that anything requiring privileges would not
> be in a fast path (but clearly that's not the case).
Protocol specifics that don't have a generic equivalent. For example,
NVMe FDP is reachable only through the uring_cmd and ioctl interfaces,
but you use it like normal reads and writes so has to be as fast as the
generic interfaces.
The same interfaces can be abused, so access needs to be restricted.
> What performance benefits did you measure with this patch set in place
> (and on what workloads)?
Quite a bit. Here's a random read high-depth workload on a single
device test:
Before: 970k IOPs
After: 1750k IOPs
> What happens when a ring fd is passed to another process?
>
> Finally, as Jens mentioned, I would expect dropping priviliges to, you
> know, drop privileges. I don't think a commit message is going to be
> enough documentation for a change like this.
Yeah, point taken.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-04 18:57 ` Keith Busch
@ 2023-12-05 4:14 ` Ming Lei
2023-12-05 4:31 ` Keith Busch
0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2023-12-05 4:14 UTC (permalink / raw)
To: Keith Busch
Cc: Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe, hch, sagi,
asml.silence, linux-security-module, Kanchan Joshi
On Mon, Dec 04, 2023 at 11:57:55AM -0700, Keith Busch wrote:
> On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote:
> > I added a CC: linux-security-module@vger
> > Keith Busch <[email protected]> writes:
> > > From: Keith Busch <[email protected]>
> > >
> > > The uring_cmd operation is often used for privileged actions, so drivers
> > > subscribing to this interface check capable() for each command. The
> > > capable() function is not fast path friendly for many kernel configs,
> > > and this can really harm performance. Stash the capable sys admin
> > > attribute in the io_uring context and set a new issue_flag for the
> > > uring_cmd interface.
> >
> > I have a few questions. What privileged actions are performance
> > sensitive? I would hope that anything requiring privileges would not
> > be in a fast path (but clearly that's not the case).
>
> Protocol specifics that don't have a generic equivalent. For example,
> NVMe FDP is reachable only through the uring_cmd and ioctl interfaces,
> but you use it like normal reads and writes so has to be as fast as the
> generic interfaces.
But normal read/write pt command doesn't require ADMIN any more since
commit 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands"),
why do you have to pay the cost of checking capable(CAP_SYS_ADMIN)?
Thanks,
Ming
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-05 4:14 ` Ming Lei
@ 2023-12-05 4:31 ` Keith Busch
2023-12-05 5:25 ` Ming Lei
0 siblings, 1 reply; 21+ messages in thread
From: Keith Busch @ 2023-12-05 4:31 UTC (permalink / raw)
To: Ming Lei
Cc: Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe, hch, sagi,
asml.silence, linux-security-module, Kanchan Joshi
On Tue, Dec 05, 2023 at 12:14:22PM +0800, Ming Lei wrote:
> On Mon, Dec 04, 2023 at 11:57:55AM -0700, Keith Busch wrote:
> > On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote:
> > > I added a CC: linux-security-module@vger
> > > Keith Busch <[email protected]> writes:
> > > > From: Keith Busch <[email protected]>
> > > >
> > > > The uring_cmd operation is often used for privileged actions, so drivers
> > > > subscribing to this interface check capable() for each command. The
> > > > capable() function is not fast path friendly for many kernel configs,
> > > > and this can really harm performance. Stash the capable sys admin
> > > > attribute in the io_uring context and set a new issue_flag for the
> > > > uring_cmd interface.
> > >
> > > I have a few questions. What privileged actions are performance
> > > sensitive? I would hope that anything requiring privileges would not
> > > be in a fast path (but clearly that's not the case).
> >
> > Protocol specifics that don't have a generic equivalent. For example,
> > NVMe FDP is reachable only through the uring_cmd and ioctl interfaces,
> > but you use it like normal reads and writes so has to be as fast as the
> > generic interfaces.
>
> But normal read/write pt command doesn't require ADMIN any more since
> commit 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands"),
> why do you have to pay the cost of checking capable(CAP_SYS_ADMIN)?
Good question. The "capable" check had always been first so even with
the relaxed permissions, it was still paying the price. I have changed
that order in commit staged here (not yet upstream):
http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622
Note that only prevents the costly capable() check if the inexpensive
checks could make a determination. That's still not solving the problem
long term since we aim for forward compatibility where we have no idea
which opcodes, admin identifications, or vendor specifics could be
deemed "safe" for non-root users in the future, so those conditions
would always fall back to the more expensive check that this patch was
trying to mitigate for admin processes.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-05 4:31 ` Keith Busch
@ 2023-12-05 5:25 ` Ming Lei
2023-12-05 15:45 ` Keith Busch
0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2023-12-05 5:25 UTC (permalink / raw)
To: Keith Busch
Cc: Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe, hch, sagi,
asml.silence, linux-security-module, Kanchan Joshi
On Mon, Dec 04, 2023 at 09:31:21PM -0700, Keith Busch wrote:
> On Tue, Dec 05, 2023 at 12:14:22PM +0800, Ming Lei wrote:
> > On Mon, Dec 04, 2023 at 11:57:55AM -0700, Keith Busch wrote:
> > > On Mon, Dec 04, 2023 at 01:40:58PM -0500, Jeff Moyer wrote:
> > > > I added a CC: linux-security-module@vger
> > > > Keith Busch <[email protected]> writes:
> > > > > From: Keith Busch <[email protected]>
> > > > >
> > > > > The uring_cmd operation is often used for privileged actions, so drivers
> > > > > subscribing to this interface check capable() for each command. The
> > > > > capable() function is not fast path friendly for many kernel configs,
> > > > > and this can really harm performance. Stash the capable sys admin
> > > > > attribute in the io_uring context and set a new issue_flag for the
> > > > > uring_cmd interface.
> > > >
> > > > I have a few questions. What privileged actions are performance
> > > > sensitive? I would hope that anything requiring privileges would not
> > > > be in a fast path (but clearly that's not the case).
> > >
> > > Protocol specifics that don't have a generic equivalent. For example,
> > > NVMe FDP is reachable only through the uring_cmd and ioctl interfaces,
> > > but you use it like normal reads and writes so has to be as fast as the
> > > generic interfaces.
> >
> > But normal read/write pt command doesn't require ADMIN any more since
> > commit 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands"),
> > why do you have to pay the cost of checking capable(CAP_SYS_ADMIN)?
>
> Good question. The "capable" check had always been first so even with
> the relaxed permissions, it was still paying the price. I have changed
> that order in commit staged here (not yet upstream):
>
> http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622
With this change, I guess you shouldn't see the following big gap, right?
> Before: 970k IOPs
> After: 1750k IOPs
>
> Note that only prevents the costly capable() check if the inexpensive
> checks could make a determination. That's still not solving the problem
> long term since we aim for forward compatibility where we have no idea
> which opcodes, admin identifications, or vendor specifics could be
> deemed "safe" for non-root users in the future, so those conditions
> would always fall back to the more expensive check that this patch was
> trying to mitigate for admin processes.
Not sure I get the idea, it is related with nvme's permission model for
user pt command, and:
1) it should be always checked in entry of nvme user pt command
2) only the following two types of commands require ADMIN, per commit
855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands")
- any admin-cmd is not allowed
- vendor-specific and fabric commmand are not allowed
Can you provide more details why the expensive check can't be avoided for
fast read/write user IO commands?
Thanks,
Ming
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-05 5:25 ` Ming Lei
@ 2023-12-05 15:45 ` Keith Busch
2023-12-06 3:08 ` Ming Lei
0 siblings, 1 reply; 21+ messages in thread
From: Keith Busch @ 2023-12-05 15:45 UTC (permalink / raw)
To: Ming Lei
Cc: Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe, hch, sagi,
asml.silence, linux-security-module, Kanchan Joshi
On Tue, Dec 05, 2023 at 01:25:44PM +0800, Ming Lei wrote:
> On Mon, Dec 04, 2023 at 09:31:21PM -0700, Keith Busch wrote:
> > Good question. The "capable" check had always been first so even with
> > the relaxed permissions, it was still paying the price. I have changed
> > that order in commit staged here (not yet upstream):
> >
> > http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622
>
> With this change, I guess you shouldn't see the following big gap, right?
Correct.
> > Before: 970k IOPs
> > After: 1750k IOPs
> > Note that only prevents the costly capable() check if the inexpensive
> > checks could make a determination. That's still not solving the problem
> > long term since we aim for forward compatibility where we have no idea
> > which opcodes, admin identifications, or vendor specifics could be
> > deemed "safe" for non-root users in the future, so those conditions
> > would always fall back to the more expensive check that this patch was
> > trying to mitigate for admin processes.
>
> Not sure I get the idea, it is related with nvme's permission model for
> user pt command, and:
>
> 1) it should be always checked in entry of nvme user pt command
>
> 2) only the following two types of commands require ADMIN, per commit
> 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands")
>
> - any admin-cmd is not allowed
> - vendor-specific and fabric commmand are not allowed
>
> Can you provide more details why the expensive check can't be avoided for
> fast read/write user IO commands?
It's not necessarily about the read/write passthrough commands. It's for
commands we don't know about today. Do we want to revisit this problem
every time spec provides another operation? Are vendor unique solutions
not allowed to get high IOPs access?
Secondly, some people have rediscovered you can abuse this interface to
corrupt kernel memory, so there are considerations to restricting this
to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we
have to go that route.
Lastly (and least important), there are a lot of checks happening in the
"quick" path that I am trying to replace with a single check. While each
invidividual check isn't too bad, they start to add up.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-05 15:45 ` Keith Busch
@ 2023-12-06 3:08 ` Ming Lei
2023-12-06 15:31 ` Keith Busch
0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2023-12-06 3:08 UTC (permalink / raw)
To: Keith Busch
Cc: Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe, hch, sagi,
asml.silence, linux-security-module, Kanchan Joshi
On Tue, Dec 05, 2023 at 08:45:10AM -0700, Keith Busch wrote:
> On Tue, Dec 05, 2023 at 01:25:44PM +0800, Ming Lei wrote:
> > On Mon, Dec 04, 2023 at 09:31:21PM -0700, Keith Busch wrote:
> > > Good question. The "capable" check had always been first so even with
> > > the relaxed permissions, it was still paying the price. I have changed
> > > that order in commit staged here (not yet upstream):
> > >
> > > http://git.infradead.org/nvme.git/commitdiff/7be866b1cf0bf1dfa74480fe8097daeceda68622
> >
> > With this change, I guess you shouldn't see the following big gap, right?
>
> Correct.
>
> > > Before: 970k IOPs
> > > After: 1750k IOPs
>
> > > Note that only prevents the costly capable() check if the inexpensive
> > > checks could make a determination. That's still not solving the problem
> > > long term since we aim for forward compatibility where we have no idea
> > > which opcodes, admin identifications, or vendor specifics could be
> > > deemed "safe" for non-root users in the future, so those conditions
> > > would always fall back to the more expensive check that this patch was
> > > trying to mitigate for admin processes.
> >
> > Not sure I get the idea, it is related with nvme's permission model for
> > user pt command, and:
> >
> > 1) it should be always checked in entry of nvme user pt command
> >
> > 2) only the following two types of commands require ADMIN, per commit
> > 855b7717f44b ("nvme: fine-granular CAP_SYS_ADMIN for nvme io commands")
> >
> > - any admin-cmd is not allowed
> > - vendor-specific and fabric commmand are not allowed
> >
> > Can you provide more details why the expensive check can't be avoided for
> > fast read/write user IO commands?
>
> It's not necessarily about the read/write passthrough commands. It's for
> commands we don't know about today. Do we want to revisit this problem
> every time spec provides another operation? Are vendor unique solutions
> not allowed to get high IOPs access?
Except for read/write, what other commands are performance sensitive?
>
> Secondly, some people have rediscovered you can abuse this interface to
> corrupt kernel memory, so there are considerations to restricting this
Just wondering why ADMIN won't corrupt kernel memory, and only normal
user can, looks it is kernel bug instead of permission related issue.
> to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we
> have to go that route.
If capable(CAP_SYS_ADMIN) is really slow, I am wondering why not
optimize it in task_struct?
Thanks,
Ming
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-06 3:08 ` Ming Lei
@ 2023-12-06 15:31 ` Keith Busch
2023-12-07 1:23 ` Ming Lei
0 siblings, 1 reply; 21+ messages in thread
From: Keith Busch @ 2023-12-06 15:31 UTC (permalink / raw)
To: Ming Lei
Cc: Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe, hch, sagi,
asml.silence, linux-security-module, Kanchan Joshi
On Wed, Dec 06, 2023 at 11:08:17AM +0800, Ming Lei wrote:
> On Tue, Dec 05, 2023 at 08:45:10AM -0700, Keith Busch wrote:
> >
> > It's not necessarily about the read/write passthrough commands. It's for
> > commands we don't know about today. Do we want to revisit this problem
> > every time spec provides another operation? Are vendor unique solutions
> > not allowed to get high IOPs access?
>
> Except for read/write, what other commands are performance sensitive?
It varies by command set, but this question is irrelevant. I'm not
interested in gatekeeping the fast path.
> > Secondly, some people have rediscovered you can abuse this interface to
> > corrupt kernel memory, so there are considerations to restricting this
>
> Just wondering why ADMIN won't corrupt kernel memory, and only normal
> user can, looks it is kernel bug instead of permission related issue.
Admin can corrupt memory as easily as a normal user through this
interface. We just don't want such capabilities to be available to
regular users.
And it's a user bug: user told the kernel to map buffer of size X, but
the device transfers size Y into it. Kernel can't do anything about that
(other than remove the interface, but such an action will break many
existing users) because we fundamentally do not know the true transfer
size of a random command. Many NVMe commands don't explicitly encode
transfer lengths, so disagreement between host and device on implicit
lengths risk corruption. It's a protocol "feature".
> > to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we
> > have to go that route.
>
> If capable(CAP_SYS_ADMIN) is really slow, I am wondering why not
> optimize it in task_struct?
That's an interesting point to look into. I was hoping to not touch such
a common struct, but I'm open to all options.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-06 15:31 ` Keith Busch
@ 2023-12-07 1:23 ` Ming Lei
2023-12-07 17:48 ` Christoph Hellwig
0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2023-12-07 1:23 UTC (permalink / raw)
To: Keith Busch
Cc: Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe, hch, sagi,
asml.silence, linux-security-module, Kanchan Joshi
On Wed, Dec 06, 2023 at 08:31:54AM -0700, Keith Busch wrote:
> On Wed, Dec 06, 2023 at 11:08:17AM +0800, Ming Lei wrote:
> > On Tue, Dec 05, 2023 at 08:45:10AM -0700, Keith Busch wrote:
> > >
> > > It's not necessarily about the read/write passthrough commands. It's for
> > > commands we don't know about today. Do we want to revisit this problem
> > > every time spec provides another operation? Are vendor unique solutions
> > > not allowed to get high IOPs access?
> >
> > Except for read/write, what other commands are performance sensitive?
>
> It varies by command set, but this question is irrelevant. I'm not
> interested in gatekeeping the fast path.
IMO, it doesn't make sense to run such optimization for commands which aren't
performance sensitive.
>
> > > Secondly, some people have rediscovered you can abuse this interface to
> > > corrupt kernel memory, so there are considerations to restricting this
> >
> > Just wondering why ADMIN won't corrupt kernel memory, and only normal
> > user can, looks it is kernel bug instead of permission related issue.
>
> Admin can corrupt memory as easily as a normal user through this
> interface. We just don't want such capabilities to be available to
> regular users.
>
> And it's a user bug: user told the kernel to map buffer of size X, but
> the device transfers size Y into it. Kernel can't do anything about that
> (other than remove the interface, but such an action will break many
> existing users) because we fundamentally do not know the true transfer
> size of a random command. Many NVMe commands don't explicitly encode
> transfer lengths, so disagreement between host and device on implicit
> lengths risk corruption. It's a protocol "feature".
Got it, thanks for the explanation, and looks one big defect of
NVMe protocol or the device implementation.
>
> > > to CAP_SYS_ADMIN anyway, so there's no cheap check available today if we
> > > have to go that route.
> >
> > If capable(CAP_SYS_ADMIN) is really slow, I am wondering why not
> > optimize it in task_struct?
>
> That's an interesting point to look into. I was hoping to not touch such
> a common struct, but I'm open to all options.
capability is per-thread, and it is updated in current process/pthread, so
the correct place to cache this info is 'task_struct'.
Thanks,
Ming
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-07 1:23 ` Ming Lei
@ 2023-12-07 17:48 ` Christoph Hellwig
0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2023-12-07 17:48 UTC (permalink / raw)
To: Ming Lei
Cc: Keith Busch, Jeff Moyer, Keith Busch, linux-nvme, io-uring, axboe,
hch, sagi, asml.silence, linux-security-module, Kanchan Joshi
On Thu, Dec 07, 2023 at 09:23:06AM +0800, Ming Lei wrote:
> Got it, thanks for the explanation, and looks one big defect of
> NVMe protocol or the device implementation.
It is. And NVMe has plenty more problems like that and people keep
adding this kind of crap even today :(
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-04 18:40 ` Jeff Moyer
2023-12-04 18:57 ` Keith Busch
@ 2023-12-04 19:01 ` Jens Axboe
2023-12-04 19:22 ` Jeff Moyer
1 sibling, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2023-12-04 19:01 UTC (permalink / raw)
To: Jeff Moyer, Keith Busch
Cc: linux-nvme, io-uring, hch, sagi, asml.silence, Keith Busch,
linux-security-module
On 12/4/23 11:40 AM, Jeff Moyer wrote:
> Finally, as Jens mentioned, I would expect dropping priviliges to, you
> know, drop privileges. I don't think a commit message is going to be
> enough documentation for a change like this.
Only thing I can think of here is to cache the state in
task->io_uring->something, and then ensure those are invalidated
whenever caps change. It's one of those cases where that's probably only
done once, but we do need to be able to catch it. Not convinced that
caching it at ring creation is sane enough, even if it is kind of like
opening devices before privs are dropped where you could not otherwise
re-open them later on.
--
Jens Axboe
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-04 19:01 ` Jens Axboe
@ 2023-12-04 19:22 ` Jeff Moyer
2023-12-04 19:33 ` Jens Axboe
2023-12-04 19:37 ` Keith Busch
0 siblings, 2 replies; 21+ messages in thread
From: Jeff Moyer @ 2023-12-04 19:22 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, linux-nvme, io-uring, hch, sagi, asml.silence,
Keith Busch, linux-security-module
Jens Axboe <[email protected]> writes:
> On 12/4/23 11:40 AM, Jeff Moyer wrote:
>> Finally, as Jens mentioned, I would expect dropping priviliges to, you
>> know, drop privileges. I don't think a commit message is going to be
>> enough documentation for a change like this.
>
> Only thing I can think of here is to cache the state in
> task->io_uring->something, and then ensure those are invalidated
> whenever caps change.
I looked through the capable() code, and there is no way that I could
find to be notified of changes.
> It's one of those cases where that's probably only done once, but we
> do need to be able to catch it. Not convinced that caching it at ring
> creation is sane enough, even if it is kind of like opening devices
> before privs are dropped where you could not otherwise re-open them
> later on.
Agreed.
-Jeff
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-04 19:22 ` Jeff Moyer
@ 2023-12-04 19:33 ` Jens Axboe
2023-12-04 19:37 ` Keith Busch
1 sibling, 0 replies; 21+ messages in thread
From: Jens Axboe @ 2023-12-04 19:33 UTC (permalink / raw)
To: Jeff Moyer
Cc: Keith Busch, linux-nvme, io-uring, hch, sagi, asml.silence,
Keith Busch, linux-security-module
On 12/4/23 12:22 PM, Jeff Moyer wrote:
> Jens Axboe <[email protected]> writes:
>
>> On 12/4/23 11:40 AM, Jeff Moyer wrote:
>>> Finally, as Jens mentioned, I would expect dropping priviliges to, you
>>> know, drop privileges. I don't think a commit message is going to be
>>> enough documentation for a change like this.
>>
>> Only thing I can think of here is to cache the state in
>> task->io_uring->something, and then ensure those are invalidated
>> whenever caps change.
>
> I looked through the capable() code, and there is no way that I could
> find to be notified of changes.
Right, what I meant is that you'd need to add an io_uring_cap_change()
or something that gets called, and that iterates the rings associated
with that task and clears the flag. Ugly...
--
Jens Axboe
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] iouring: one capable call per iouring instance
2023-12-04 19:22 ` Jeff Moyer
2023-12-04 19:33 ` Jens Axboe
@ 2023-12-04 19:37 ` Keith Busch
1 sibling, 0 replies; 21+ messages in thread
From: Keith Busch @ 2023-12-04 19:37 UTC (permalink / raw)
To: Jeff Moyer
Cc: Jens Axboe, Keith Busch, linux-nvme, io-uring, hch, sagi,
asml.silence, linux-security-module
On Mon, Dec 04, 2023 at 02:22:22PM -0500, Jeff Moyer wrote:
> Jens Axboe <[email protected]> writes:
>
> > On 12/4/23 11:40 AM, Jeff Moyer wrote:
> >> Finally, as Jens mentioned, I would expect dropping priviliges to, you
> >> know, drop privileges. I don't think a commit message is going to be
> >> enough documentation for a change like this.
> >
> > Only thing I can think of here is to cache the state in
> > task->io_uring->something, and then ensure those are invalidated
> > whenever caps change.
>
> I looked through the capable() code, and there is no way that I could
> find to be notified of changes.
Something like LSM_HOOK_INIT on 'capset', but needs to work without
CONFIG_SECURITY.
^ permalink raw reply [flat|nested] 21+ messages in thread