[RFC] io_uring: add restrictions to support untrusted applications and guests

public inbox for [email protected]
 help / color / mirror / Atom feed

* [RFC] io_uring: add restrictions to support untrusted applications and guests
@ 2020-06-09 14:24 Stefano Garzarella
  2020-06-14 15:52 ` Jens Axboe
  2020-06-15  9:04 ` Jann Horn
  0 siblings, 2 replies; 13+ messages in thread
From: Stefano Garzarella @ 2020-06-09 14:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Stefan Hajnoczi, Jeff Moyer, io-uring, linux-kernel

Hi Jens,
Stefan and I have a proposal to share with io_uring community.
Before implementing it we would like to discuss it to receive feedbacks and
to see if it could be accepted:

Adding restrictions to io_uring
=====================================
The io_uring API provides submission and completion queues for performing
asynchronous I/O operations. The queues are located in memory that is
accessible to both the host userspace application and the kernel, making it
possible to monitor for activity through polling instead of system calls. This
design offers good performance and this makes exposing io_uring to guests an
attractive idea for improving I/O performance in virtualization.

PoC and preliminary benchmarks
---------------------------
We realized a PoC, using QEMU and virtio-blk device, to share io_uring
CQ and SQ rings with the guest.
QEMU initializes io_uring, registers the device (NVMe) fd through
io_uring_register(2), and maps the rings in the guest memory.
The virtio-blk driver uses these rings to send requests instead of using
the standard virtqueues.

The PoC implements a pure polling solution where the application is polling
(IOPOLL enabled) in the guest and the sqpoll_kthread is polling in the host
(SQPOLL and IOPOLL enabled).

These are the encouraging results we obtained from this preliminary work;
we used fio (rw=randread bs=4k) to measure the kIOPS on a NVMe device:

- bare-metal
                                                       iodepth
  | fio ioengine                              |  1  |  8  |  16 |  32 |
  |-------------------------------------------|----:|----:|----:|----:|
  | io_uring (SQPOLL + IOPOLL)                | 119 | 550 | 581 | 585 |
  | io_uring (IOPOLL)                         | 122 | 502 | 519 | 538 |

- QEMU/KVM guest (aio=io_uring)
                                                       iodepth
  | virtio-blk            | fio ioengine      |  1  |  8  |  16 |  32 |
  |-----------------------|-------------------|----:|----:|----:|----:|
  | virtqueues            | io_uring (IOPOLL) |  27 | 144 | 209 | 266 |
  | virtqueues + iothread | io_uring (IOPOLL) |  73 | 264 | 306 | 312 |
  | io_uring passthrough  | io_uring (IOPOLL) | 104 | 532 | 577 | 585 |

  All guest experiments are using the QEMU io_uring backend with SQPOLL and
  IOPOLL enabled. The virtio-blk driver is modified to support blovk io_poll
  on both virtqueues and io_uring passthrough.

Before developing this proof-of-concept further we would like to discuss
io_uring changes required to restrict rings since this mechanism is a
prerequisite for real-world use cases where guests are untrusted.

Restrictions
------------
This document proposes io_uring API changes that safely allow untrusted
applications or guests to use io_uring. io_uring's existing security model is
that of kernel system call handler code. It is designed to reject invalid
inputs from host userspace applications. Supporting guests as io_uring API
clients adds a new trust domain with access to even fewer resources than host
userspace applications.

Guests do not have direct access to host userspace application file descriptors
or memory. The host userspace application, a Virtual Machine Monitor (VMM) such
as QEMU, grants access to a subset of its file descriptors and memory. The
allowed file descriptors are typically the disk image files belonging to the
guest. The memory is typically the virtual machine's RAM that the VMM has
allocated on behalf of the guest.

The following extensions to the io_uring API allow the host application to
grant access to some of its file descriptors.

These extensions are designed to be applicable to other use cases besides
untrusted guests and are not virtualization-specific. For example, the
restrictions can be used to allow only a subset of sqe operations available to
an application similar to seccomp syscall whitelisting.

An address translation and memory restriction mechanism would also be
necessary, but we can discuss this later.

The IOURING_REGISTER_RESTRICTIONS opcode
----------------------------------------
The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently
installs a feature whitelist on an io_ring_ctx. The io_ring_ctx can then be
passed to untrusted code with the knowledge that only operations present in the
whitelist can be executed.

The whitelist approach ensures that new features added to io_uring do not
accidentally become available when an existing application is launched on a
newer kernel version.

The IORING_REGISTER_RESTRICTIONS opcode takes an array of struct
io_uring_restriction elements that describe whitelisted features:

  #define IORING_REGISTER_RESTRICTIONS 11

  /* struct io_uring_restriction::opcode values */
  enum {
      /* Allow an io_uring_register(2) opcode */
      IORING_RESTRICTION_REGISTER_OP,

      /* Allow an sqe opcode */
      IORING_RESTRICTION_SQE_OP,

      /* Only allow fixed files */
      IORING_RESTRICTION_FIXED_FILES_ONLY,

      /* Only allow registered addresses and translate them */
      IORING_RESTRICTION_BUFFER_CHECK
  };

  struct io_uring_restriction {
      __u16 opcode;
      union {
          __u8 register_op; /* IORING_RESTRICTION_REGISTER_OP */
          __u8 sqe_op;      /* IORING_RESTRICTION_SQE_OP */
      };
      __u8 resv;
      __u32 resv2[3];
  };

This call can only be made once. Afterwards it is not possible to change
restrictions anymore. This prevents untrusted code from removing restrictions.

Limiting access to io_uring operations
--------------------------------------
The following example shows how to whitelist IORING_OP_READV, IORING_OP_WRITEV,
and IORING_OP_FSYNC:

  struct io_uring_restriction restrictions[] = {
      {
          .opcode = IORING_RESTRICTION_SQE_OP,
          .sqe_op = IORING_OP_READV,
      },
      {
          .opcode = IORING_RESTRICTION_SQE_OP,
          .sqe_op = IORING_OP_WRITEV,
      },
      {
          .opcode = IORING_RESTRICTION_SQE_OP,
          .sqe_op = IORING_OP_FSYNC,
      },
      ...
  };

  io_uring_register(ringfd, IORING_REGISTER_RESTRICTIONS,
                    restrictions, ARRAY_SIZE(restrictions));

Limiting access to file descriptors
-----------------------------------
The fixed files mechanism can be used to limit access to a set of file
descriptors:

  struct io_uring_restriction restrictions[] = {
      {
          .opcode = IORING_RESTRICTION_FIXED_FILES_ONLY,
      },
      ...
  };

  io_uring_register(ringfd, IORING_REGISTER_RESTRICTIONS,
                    restrictions, ARRAY_SIZE(restrictions));

Only requests with the sqe->flags IOSQE_FIXED_FILE bit set will be allowed.

Thanks for your feedback,
Stefano

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] io_uring: add restrictions to support untrusted applications and guests
  2020-06-09 14:24 [RFC] io_uring: add restrictions to support untrusted applications and guests Stefano Garzarella
@ 2020-06-14 15:52 ` Jens Axboe
  2020-06-15  7:23   ` Stefano Garzarella
  2020-06-15  9:04 ` Jann Horn
  1 sibling, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2020-06-14 15:52 UTC (permalink / raw)
  To: Stefano Garzarella; +Cc: Stefan Hajnoczi, Jeff Moyer, io-uring, linux-kernel

On 6/9/20 8:24 AM, Stefano Garzarella wrote:
> Hi Jens,
> Stefan and I have a proposal to share with io_uring community.
> Before implementing it we would like to discuss it to receive feedbacks and
> to see if it could be accepted:
> 
> Adding restrictions to io_uring
> =====================================
> The io_uring API provides submission and completion queues for performing
> asynchronous I/O operations. The queues are located in memory that is
> accessible to both the host userspace application and the kernel, making it
> possible to monitor for activity through polling instead of system calls. This
> design offers good performance and this makes exposing io_uring to guests an
> attractive idea for improving I/O performance in virtualization.
> 
> PoC and preliminary benchmarks
> ---------------------------
> We realized a PoC, using QEMU and virtio-blk device, to share io_uring
> CQ and SQ rings with the guest.
> QEMU initializes io_uring, registers the device (NVMe) fd through
> io_uring_register(2), and maps the rings in the guest memory.
> The virtio-blk driver uses these rings to send requests instead of using
> the standard virtqueues.
> 
> The PoC implements a pure polling solution where the application is polling
> (IOPOLL enabled) in the guest and the sqpoll_kthread is polling in the host
> (SQPOLL and IOPOLL enabled).
> 
> These are the encouraging results we obtained from this preliminary work;
> we used fio (rw=randread bs=4k) to measure the kIOPS on a NVMe device:
> 
> - bare-metal
>                                                        iodepth
>   | fio ioengine                              |  1  |  8  |  16 |  32 |
>   |-------------------------------------------|----:|----:|----:|----:|
>   | io_uring (SQPOLL + IOPOLL)                | 119 | 550 | 581 | 585 |
>   | io_uring (IOPOLL)                         | 122 | 502 | 519 | 538 |
> 
> - QEMU/KVM guest (aio=io_uring)
>                                                        iodepth
>   | virtio-blk            | fio ioengine      |  1  |  8  |  16 |  32 |
>   |-----------------------|-------------------|----:|----:|----:|----:|
>   | virtqueues            | io_uring (IOPOLL) |  27 | 144 | 209 | 266 |
>   | virtqueues + iothread | io_uring (IOPOLL) |  73 | 264 | 306 | 312 |
>   | io_uring passthrough  | io_uring (IOPOLL) | 104 | 532 | 577 | 585 |
> 
>   All guest experiments are using the QEMU io_uring backend with SQPOLL and
>   IOPOLL enabled. The virtio-blk driver is modified to support blovk io_poll
>   on both virtqueues and io_uring passthrough.
> 
> Before developing this proof-of-concept further we would like to discuss
> io_uring changes required to restrict rings since this mechanism is a
> prerequisite for real-world use cases where guests are untrusted.
> 
> Restrictions
> ------------
> This document proposes io_uring API changes that safely allow untrusted
> applications or guests to use io_uring. io_uring's existing security model is
> that of kernel system call handler code. It is designed to reject invalid
> inputs from host userspace applications. Supporting guests as io_uring API
> clients adds a new trust domain with access to even fewer resources than host
> userspace applications.
> 
> Guests do not have direct access to host userspace application file descriptors
> or memory. The host userspace application, a Virtual Machine Monitor (VMM) such
> as QEMU, grants access to a subset of its file descriptors and memory. The
> allowed file descriptors are typically the disk image files belonging to the
> guest. The memory is typically the virtual machine's RAM that the VMM has
> allocated on behalf of the guest.
> 
> The following extensions to the io_uring API allow the host application to
> grant access to some of its file descriptors.
> 
> These extensions are designed to be applicable to other use cases besides
> untrusted guests and are not virtualization-specific. For example, the
> restrictions can be used to allow only a subset of sqe operations available to
> an application similar to seccomp syscall whitelisting.
> 
> An address translation and memory restriction mechanism would also be
> necessary, but we can discuss this later.
> 
> The IOURING_REGISTER_RESTRICTIONS opcode
> ----------------------------------------
> The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently
> installs a feature whitelist on an io_ring_ctx. The io_ring_ctx can then be
> passed to untrusted code with the knowledge that only operations present in the
> whitelist can be executed.
> 
> The whitelist approach ensures that new features added to io_uring do not
> accidentally become available when an existing application is launched on a
> newer kernel version.
> 
> The IORING_REGISTER_RESTRICTIONS opcode takes an array of struct
> io_uring_restriction elements that describe whitelisted features:
> 
>   #define IORING_REGISTER_RESTRICTIONS 11
> 
>   /* struct io_uring_restriction::opcode values */
>   enum {
>       /* Allow an io_uring_register(2) opcode */
>       IORING_RESTRICTION_REGISTER_OP,
> 
>       /* Allow an sqe opcode */
>       IORING_RESTRICTION_SQE_OP,
> 
>       /* Only allow fixed files */
>       IORING_RESTRICTION_FIXED_FILES_ONLY,
> 
>       /* Only allow registered addresses and translate them */
>       IORING_RESTRICTION_BUFFER_CHECK
>   };
> 
>   struct io_uring_restriction {
>       __u16 opcode;
>       union {
>           __u8 register_op; /* IORING_RESTRICTION_REGISTER_OP */
>           __u8 sqe_op;      /* IORING_RESTRICTION_SQE_OP */
>       };
>       __u8 resv;
>       __u32 resv2[3];
>   };
> 
> This call can only be made once. Afterwards it is not possible to change
> restrictions anymore. This prevents untrusted code from removing restrictions.
> 
> Limiting access to io_uring operations
> --------------------------------------
> The following example shows how to whitelist IORING_OP_READV, IORING_OP_WRITEV,
> and IORING_OP_FSYNC:
> 
>   struct io_uring_restriction restrictions[] = {
>       {
>           .opcode = IORING_RESTRICTION_SQE_OP,
>           .sqe_op = IORING_OP_READV,
>       },
>       {
>           .opcode = IORING_RESTRICTION_SQE_OP,
>           .sqe_op = IORING_OP_WRITEV,
>       },
>       {
>           .opcode = IORING_RESTRICTION_SQE_OP,
>           .sqe_op = IORING_OP_FSYNC,
>       },
>       ...
>   };
> 
>   io_uring_register(ringfd, IORING_REGISTER_RESTRICTIONS,
>                     restrictions, ARRAY_SIZE(restrictions));
> 
> Limiting access to file descriptors
> -----------------------------------
> The fixed files mechanism can be used to limit access to a set of file
> descriptors:
> 
>   struct io_uring_restriction restrictions[] = {
>       {
>           .opcode = IORING_RESTRICTION_FIXED_FILES_ONLY,
>       },
>       ...
>   };
> 
>   io_uring_register(ringfd, IORING_REGISTER_RESTRICTIONS,
>                     restrictions, ARRAY_SIZE(restrictions));
> 
> Only requests with the sqe->flags IOSQE_FIXED_FILE bit set will be allowed.

I don't think this sounds unreasonable, but I'd really like to see a
prototype hacked up before rendering any further opinions on it :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] io_uring: add restrictions to support untrusted applications and guests
  2020-06-14 15:52 ` Jens Axboe
@ 2020-06-15  7:23   ` Stefano Garzarella
  0 siblings, 0 replies; 13+ messages in thread
From: Stefano Garzarella @ 2020-06-15  7:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Stefan Hajnoczi, Jeff Moyer, io-uring, linux-kernel

On Sun, Jun 14, 2020 at 09:52:30AM -0600, Jens Axboe wrote:
> On 6/9/20 8:24 AM, Stefano Garzarella wrote:
> > Hi Jens,
> > Stefan and I have a proposal to share with io_uring community.
> > Before implementing it we would like to discuss it to receive feedbacks and
> > to see if it could be accepted:
> > 
> > Adding restrictions to io_uring
> > =====================================
> > The io_uring API provides submission and completion queues for performing
> > asynchronous I/O operations. The queues are located in memory that is
> > accessible to both the host userspace application and the kernel, making it
> > possible to monitor for activity through polling instead of system calls. This
> > design offers good performance and this makes exposing io_uring to guests an
> > attractive idea for improving I/O performance in virtualization.
> > 
> > PoC and preliminary benchmarks
> > ---------------------------
> > We realized a PoC, using QEMU and virtio-blk device, to share io_uring
> > CQ and SQ rings with the guest.
> > QEMU initializes io_uring, registers the device (NVMe) fd through
> > io_uring_register(2), and maps the rings in the guest memory.
> > The virtio-blk driver uses these rings to send requests instead of using
> > the standard virtqueues.
> > 
> > The PoC implements a pure polling solution where the application is polling
> > (IOPOLL enabled) in the guest and the sqpoll_kthread is polling in the host
> > (SQPOLL and IOPOLL enabled).
> > 
> > These are the encouraging results we obtained from this preliminary work;
> > we used fio (rw=randread bs=4k) to measure the kIOPS on a NVMe device:
> > 
> > - bare-metal
> >                                                        iodepth
> >   | fio ioengine                              |  1  |  8  |  16 |  32 |
> >   |-------------------------------------------|----:|----:|----:|----:|
> >   | io_uring (SQPOLL + IOPOLL)                | 119 | 550 | 581 | 585 |
> >   | io_uring (IOPOLL)                         | 122 | 502 | 519 | 538 |
> > 
> > - QEMU/KVM guest (aio=io_uring)
> >                                                        iodepth
> >   | virtio-blk            | fio ioengine      |  1  |  8  |  16 |  32 |
> >   |-----------------------|-------------------|----:|----:|----:|----:|
> >   | virtqueues            | io_uring (IOPOLL) |  27 | 144 | 209 | 266 |
> >   | virtqueues + iothread | io_uring (IOPOLL) |  73 | 264 | 306 | 312 |
> >   | io_uring passthrough  | io_uring (IOPOLL) | 104 | 532 | 577 | 585 |
> > 
> >   All guest experiments are using the QEMU io_uring backend with SQPOLL and
> >   IOPOLL enabled. The virtio-blk driver is modified to support blovk io_poll
> >   on both virtqueues and io_uring passthrough.
> > 
> > Before developing this proof-of-concept further we would like to discuss
> > io_uring changes required to restrict rings since this mechanism is a
> > prerequisite for real-world use cases where guests are untrusted.
> > 
> > Restrictions
> > ------------
> > This document proposes io_uring API changes that safely allow untrusted
> > applications or guests to use io_uring. io_uring's existing security model is
> > that of kernel system call handler code. It is designed to reject invalid
> > inputs from host userspace applications. Supporting guests as io_uring API
> > clients adds a new trust domain with access to even fewer resources than host
> > userspace applications.
> > 
> > Guests do not have direct access to host userspace application file descriptors
> > or memory. The host userspace application, a Virtual Machine Monitor (VMM) such
> > as QEMU, grants access to a subset of its file descriptors and memory. The
> > allowed file descriptors are typically the disk image files belonging to the
> > guest. The memory is typically the virtual machine's RAM that the VMM has
> > allocated on behalf of the guest.
> > 
> > The following extensions to the io_uring API allow the host application to
> > grant access to some of its file descriptors.
> > 
> > These extensions are designed to be applicable to other use cases besides
> > untrusted guests and are not virtualization-specific. For example, the
> > restrictions can be used to allow only a subset of sqe operations available to
> > an application similar to seccomp syscall whitelisting.
> > 
> > An address translation and memory restriction mechanism would also be
> > necessary, but we can discuss this later.
> > 
> > The IOURING_REGISTER_RESTRICTIONS opcode
> > ----------------------------------------
> > The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently
> > installs a feature whitelist on an io_ring_ctx. The io_ring_ctx can then be
> > passed to untrusted code with the knowledge that only operations present in the
> > whitelist can be executed.
> > 
> > The whitelist approach ensures that new features added to io_uring do not
> > accidentally become available when an existing application is launched on a
> > newer kernel version.
> > 
> > The IORING_REGISTER_RESTRICTIONS opcode takes an array of struct
> > io_uring_restriction elements that describe whitelisted features:
> > 
> >   #define IORING_REGISTER_RESTRICTIONS 11
> > 
> >   /* struct io_uring_restriction::opcode values */
> >   enum {
> >       /* Allow an io_uring_register(2) opcode */
> >       IORING_RESTRICTION_REGISTER_OP,
> > 
> >       /* Allow an sqe opcode */
> >       IORING_RESTRICTION_SQE_OP,
> > 
> >       /* Only allow fixed files */
> >       IORING_RESTRICTION_FIXED_FILES_ONLY,
> > 
> >       /* Only allow registered addresses and translate them */
> >       IORING_RESTRICTION_BUFFER_CHECK
> >   };
> > 
> >   struct io_uring_restriction {
> >       __u16 opcode;
> >       union {
> >           __u8 register_op; /* IORING_RESTRICTION_REGISTER_OP */
> >           __u8 sqe_op;      /* IORING_RESTRICTION_SQE_OP */
> >       };
> >       __u8 resv;
> >       __u32 resv2[3];
> >   };
> > 
> > This call can only be made once. Afterwards it is not possible to change
> > restrictions anymore. This prevents untrusted code from removing restrictions.
> > 
> > Limiting access to io_uring operations
> > --------------------------------------
> > The following example shows how to whitelist IORING_OP_READV, IORING_OP_WRITEV,
> > and IORING_OP_FSYNC:
> > 
> >   struct io_uring_restriction restrictions[] = {
> >       {
> >           .opcode = IORING_RESTRICTION_SQE_OP,
> >           .sqe_op = IORING_OP_READV,
> >       },
> >       {
> >           .opcode = IORING_RESTRICTION_SQE_OP,
> >           .sqe_op = IORING_OP_WRITEV,
> >       },
> >       {
> >           .opcode = IORING_RESTRICTION_SQE_OP,
> >           .sqe_op = IORING_OP_FSYNC,
> >       },
> >       ...
> >   };
> > 
> >   io_uring_register(ringfd, IORING_REGISTER_RESTRICTIONS,
> >                     restrictions, ARRAY_SIZE(restrictions));
> > 
> > Limiting access to file descriptors
> > -----------------------------------
> > The fixed files mechanism can be used to limit access to a set of file
> > descriptors:
> > 
> >   struct io_uring_restriction restrictions[] = {
> >       {
> >           .opcode = IORING_RESTRICTION_FIXED_FILES_ONLY,
> >       },
> >       ...
> >   };
> > 
> >   io_uring_register(ringfd, IORING_REGISTER_RESTRICTIONS,
> >                     restrictions, ARRAY_SIZE(restrictions));
> > 
> > Only requests with the sqe->flags IOSQE_FIXED_FILE bit set will be allowed.
> 
> I don't think this sounds unreasonable, but I'd really like to see a
> prototype hacked up before rendering any further opinions on it :-)

Yeah :-) I'll be back with a prototype of this changes ASAP.

Thanks for you feedback,
Stefano


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] io_uring: add restrictions to support untrusted applications and guests
  2020-06-09 14:24 [RFC] io_uring: add restrictions to support untrusted applications and guests Stefano Garzarella
  2020-06-14 15:52 ` Jens Axboe
@ 2020-06-15  9:04 ` Jann Horn
  2020-06-15 13:33   ` Stefano Garzarella
  2020-06-15 22:01   ` Christian Brauner
  1 sibling, 2 replies; 13+ messages in thread
From: Jann Horn @ 2020-06-15  9:04 UTC (permalink / raw)
  To: Stefano Garzarella, Kees Cook, Christian Brauner, Sargun Dhillon,
	Aleksa Sarai
  Cc: Jens Axboe, Stefan Hajnoczi, Jeff Moyer, io-uring, kernel list,
	Kernel Hardening

+Kees, Christian, Sargun, Aleksa, kernel-hardening for their opinions
on seccomp-related aspects

On Tue, Jun 9, 2020 at 4:24 PM Stefano Garzarella <[email protected]> wrote:
> Hi Jens,
> Stefan and I have a proposal to share with io_uring community.
> Before implementing it we would like to discuss it to receive feedbacks and
> to see if it could be accepted:
>
> Adding restrictions to io_uring
> =====================================
> The io_uring API provides submission and completion queues for performing
> asynchronous I/O operations. The queues are located in memory that is
> accessible to both the host userspace application and the kernel, making it
> possible to monitor for activity through polling instead of system calls. This
> design offers good performance and this makes exposing io_uring to guests an
> attractive idea for improving I/O performance in virtualization.
[...]
> Restrictions
> ------------
> This document proposes io_uring API changes that safely allow untrusted
> applications or guests to use io_uring. io_uring's existing security model is
> that of kernel system call handler code. It is designed to reject invalid
> inputs from host userspace applications. Supporting guests as io_uring API
> clients adds a new trust domain with access to even fewer resources than host
> userspace applications.
>
> Guests do not have direct access to host userspace application file descriptors
> or memory. The host userspace application, a Virtual Machine Monitor (VMM) such
> as QEMU, grants access to a subset of its file descriptors and memory. The
> allowed file descriptors are typically the disk image files belonging to the
> guest. The memory is typically the virtual machine's RAM that the VMM has
> allocated on behalf of the guest.
>
> The following extensions to the io_uring API allow the host application to
> grant access to some of its file descriptors.
>
> These extensions are designed to be applicable to other use cases besides
> untrusted guests and are not virtualization-specific. For example, the
> restrictions can be used to allow only a subset of sqe operations available to
> an application similar to seccomp syscall whitelisting.
>
> An address translation and memory restriction mechanism would also be
> necessary, but we can discuss this later.
>
> The IOURING_REGISTER_RESTRICTIONS opcode
> ----------------------------------------
> The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently
> installs a feature whitelist on an io_ring_ctx. The io_ring_ctx can then be
> passed to untrusted code with the knowledge that only operations present in the
> whitelist can be executed.

This approach of first creating a normal io_uring instance and then
installing restrictions separately in a second syscall means that it
won't be possible to use seccomp to restrict newly created io_uring
instances; code that should be subject to seccomp restrictions and
uring restrictions would only be able to use preexisting io_uring
instances that have already been configured by trusted code.

So I think that from the seccomp perspective, it might be preferable
to set up these restrictions in the io_uring_setup() syscall. It might
also be a bit nicer from a code cleanliness perspective, since you
won't have to worry about concurrently changing restrictions.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] io_uring: add restrictions to support untrusted applications and guests
  2020-06-15  9:04 ` Jann Horn
@ 2020-06-15 13:33   ` Stefano Garzarella
  2020-06-15 17:00     ` Jens Axboe
  2020-06-15 22:01   ` Christian Brauner
  1 sibling, 1 reply; 13+ messages in thread
From: Stefano Garzarella @ 2020-06-15 13:33 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, Christian Brauner, Sargun Dhillon, Aleksa Sarai,
	Jens Axboe, Stefan Hajnoczi, Jeff Moyer, io-uring, kernel list,
	Kernel Hardening

On Mon, Jun 15, 2020 at 11:04:06AM +0200, Jann Horn wrote:
> +Kees, Christian, Sargun, Aleksa, kernel-hardening for their opinions
> on seccomp-related aspects
> 
> On Tue, Jun 9, 2020 at 4:24 PM Stefano Garzarella <[email protected]> wrote:
> > Hi Jens,
> > Stefan and I have a proposal to share with io_uring community.
> > Before implementing it we would like to discuss it to receive feedbacks and
> > to see if it could be accepted:
> >
> > Adding restrictions to io_uring
> > =====================================
> > The io_uring API provides submission and completion queues for performing
> > asynchronous I/O operations. The queues are located in memory that is
> > accessible to both the host userspace application and the kernel, making it
> > possible to monitor for activity through polling instead of system calls. This
> > design offers good performance and this makes exposing io_uring to guests an
> > attractive idea for improving I/O performance in virtualization.
> [...]
> > Restrictions
> > ------------
> > This document proposes io_uring API changes that safely allow untrusted
> > applications or guests to use io_uring. io_uring's existing security model is
> > that of kernel system call handler code. It is designed to reject invalid
> > inputs from host userspace applications. Supporting guests as io_uring API
> > clients adds a new trust domain with access to even fewer resources than host
> > userspace applications.
> >
> > Guests do not have direct access to host userspace application file descriptors
> > or memory. The host userspace application, a Virtual Machine Monitor (VMM) such
> > as QEMU, grants access to a subset of its file descriptors and memory. The
> > allowed file descriptors are typically the disk image files belonging to the
> > guest. The memory is typically the virtual machine's RAM that the VMM has
> > allocated on behalf of the guest.
> >
> > The following extensions to the io_uring API allow the host application to
> > grant access to some of its file descriptors.
> >
> > These extensions are designed to be applicable to other use cases besides
> > untrusted guests and are not virtualization-specific. For example, the
> > restrictions can be used to allow only a subset of sqe operations available to
> > an application similar to seccomp syscall whitelisting.
> >
> > An address translation and memory restriction mechanism would also be
> > necessary, but we can discuss this later.
> >
> > The IOURING_REGISTER_RESTRICTIONS opcode
> > ----------------------------------------
> > The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently
> > installs a feature whitelist on an io_ring_ctx. The io_ring_ctx can then be
> > passed to untrusted code with the knowledge that only operations present in the
> > whitelist can be executed.
> 
> This approach of first creating a normal io_uring instance and then
> installing restrictions separately in a second syscall means that it
> won't be possible to use seccomp to restrict newly created io_uring
> instances; code that should be subject to seccomp restrictions and
> uring restrictions would only be able to use preexisting io_uring
> instances that have already been configured by trusted code.
> 
> So I think that from the seccomp perspective, it might be preferable
> to set up these restrictions in the io_uring_setup() syscall. It might
> also be a bit nicer from a code cleanliness perspective, since you
> won't have to worry about concurrently changing restrictions.
> 

Thank you for these details!

It seems feasible to include the restrictions during io_uring_setup().

The only doubt concerns the possibility of allowing the trusted code to
do some operations, before passing queues to the untrusted code, for
example registering file descriptors, buffers, eventfds, etc.

To avoid this, I should include these operations in io_uring_setup(),
adding some code that I wanted to avoid by reusing io_uring_register().

If I add restrictions in io_uring_setup() and then add an operation to
go into safe mode (e.g. a flag in io_uring_enter()), we would have the same
problem, right?

Just to be clear, I mean something like this:

    /* params will include restrictions */
    fd = io_uring_setup(entries, params);

    /* trusted code */
    io_uring_register_files(fd, ...);
    io_uring_register_buffers(fd, ...);
    io_uring_register_eventfd(fd, ...);

    /* enable safe mode */
    io_uring_enter(fd, ..., IORING_ENTER_ENABLE_RESTRICTIONS);


Anyway, including a list of things to register in the 'params', passed
to io_uring_setup(), should be feasible, if Jens agree :-)

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] io_uring: add restrictions to support untrusted applications and guests
  2020-06-15 13:33   ` Stefano Garzarella
@ 2020-06-15 17:00     ` Jens Axboe
  2020-06-16  9:12       ` Stefano Garzarella
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2020-06-15 17:00 UTC (permalink / raw)
  To: Stefano Garzarella, Jann Horn
  Cc: Kees Cook, Christian Brauner, Sargun Dhillon, Aleksa Sarai,
	Stefan Hajnoczi, Jeff Moyer, io-uring, kernel list,
	Kernel Hardening

On 6/15/20 7:33 AM, Stefano Garzarella wrote:
> On Mon, Jun 15, 2020 at 11:04:06AM +0200, Jann Horn wrote:
>> +Kees, Christian, Sargun, Aleksa, kernel-hardening for their opinions
>> on seccomp-related aspects
>>
>> On Tue, Jun 9, 2020 at 4:24 PM Stefano Garzarella <[email protected]> wrote:
>>> Hi Jens,
>>> Stefan and I have a proposal to share with io_uring community.
>>> Before implementing it we would like to discuss it to receive feedbacks and
>>> to see if it could be accepted:
>>>
>>> Adding restrictions to io_uring
>>> =====================================
>>> The io_uring API provides submission and completion queues for performing
>>> asynchronous I/O operations. The queues are located in memory that is
>>> accessible to both the host userspace application and the kernel, making it
>>> possible to monitor for activity through polling instead of system calls. This
>>> design offers good performance and this makes exposing io_uring to guests an
>>> attractive idea for improving I/O performance in virtualization.
>> [...]
>>> Restrictions
>>> ------------
>>> This document proposes io_uring API changes that safely allow untrusted
>>> applications or guests to use io_uring. io_uring's existing security model is
>>> that of kernel system call handler code. It is designed to reject invalid
>>> inputs from host userspace applications. Supporting guests as io_uring API
>>> clients adds a new trust domain with access to even fewer resources than host
>>> userspace applications.
>>>
>>> Guests do not have direct access to host userspace application file descriptors
>>> or memory. The host userspace application, a Virtual Machine Monitor (VMM) such
>>> as QEMU, grants access to a subset of its file descriptors and memory. The
>>> allowed file descriptors are typically the disk image files belonging to the
>>> guest. The memory is typically the virtual machine's RAM that the VMM has
>>> allocated on behalf of the guest.
>>>
>>> The following extensions to the io_uring API allow the host application to
>>> grant access to some of its file descriptors.
>>>
>>> These extensions are designed to be applicable to other use cases besides
>>> untrusted guests and are not virtualization-specific. For example, the
>>> restrictions can be used to allow only a subset of sqe operations available to
>>> an application similar to seccomp syscall whitelisting.
>>>
>>> An address translation and memory restriction mechanism would also be
>>> necessary, but we can discuss this later.
>>>
>>> The IOURING_REGISTER_RESTRICTIONS opcode
>>> ----------------------------------------
>>> The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently
>>> installs a feature whitelist on an io_ring_ctx. The io_ring_ctx can then be
>>> passed to untrusted code with the knowledge that only operations present in the
>>> whitelist can be executed.
>>
>> This approach of first creating a normal io_uring instance and then
>> installing restrictions separately in a second syscall means that it
>> won't be possible to use seccomp to restrict newly created io_uring
>> instances; code that should be subject to seccomp restrictions and
>> uring restrictions would only be able to use preexisting io_uring
>> instances that have already been configured by trusted code.
>>
>> So I think that from the seccomp perspective, it might be preferable
>> to set up these restrictions in the io_uring_setup() syscall. It might
>> also be a bit nicer from a code cleanliness perspective, since you
>> won't have to worry about concurrently changing restrictions.
>>
> 
> Thank you for these details!
> 
> It seems feasible to include the restrictions during io_uring_setup().
> 
> The only doubt concerns the possibility of allowing the trusted code to
> do some operations, before passing queues to the untrusted code, for
> example registering file descriptors, buffers, eventfds, etc.
> 
> To avoid this, I should include these operations in io_uring_setup(),
> adding some code that I wanted to avoid by reusing io_uring_register().
> 
> If I add restrictions in io_uring_setup() and then add an operation to
> go into safe mode (e.g. a flag in io_uring_enter()), we would have the same
> problem, right?
> 
> Just to be clear, I mean something like this:
> 
>     /* params will include restrictions */
>     fd = io_uring_setup(entries, params);
> 
>     /* trusted code */
>     io_uring_register_files(fd, ...);
>     io_uring_register_buffers(fd, ...);
>     io_uring_register_eventfd(fd, ...);
> 
>     /* enable safe mode */
>     io_uring_enter(fd, ..., IORING_ENTER_ENABLE_RESTRICTIONS);
> 
> 
> Anyway, including a list of things to register in the 'params', passed
> to io_uring_setup(), should be feasible, if Jens agree :-)

I wonder how best to deal with this, in terms of ring visibility vs
registering restrictions. We could potentially start the ring in a
disabled mode, if asked to. It'd still be visible in terms of having
the fd installed, but it'd just error requests. That'd leave you with
time to do the various setup routines needed before then flagging it
as enabled. My only worry on that would be adding overhead for doing
that. It'd be cheap enough to check for IORING_SETUP_DISABLED in
ctx->flags in io_uring_enter(), and return -EBADFD or something if
that's the case. That doesn't cover the SQPOLL case though, but maybe we
just don't start the sq thread if IORING_SETUP_DISABLED is set.

We'd need a way to clear IORING_SETUP_DISABLED through
io_uring_register(). When clearing, that could then start the sq thread
as well, when SQPOLL is set.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] io_uring: add restrictions to support untrusted applications and guests
  2020-06-15  9:04 ` Jann Horn
  2020-06-15 13:33   ` Stefano Garzarella
@ 2020-06-15 22:01   ` Christian Brauner
  2020-06-15 23:26     ` Jann Horn
  1 sibling, 1 reply; 13+ messages in thread
From: Christian Brauner @ 2020-06-15 22:01 UTC (permalink / raw)
  To: Jann Horn
  Cc: Stefano Garzarella, Kees Cook, Sargun Dhillon, Aleksa Sarai,
	Jens Axboe, Stefan Hajnoczi, Jeff Moyer, io-uring, kernel list,
	Kernel Hardening

On Mon, Jun 15, 2020 at 11:04:06AM +0200, Jann Horn wrote:
> +Kees, Christian, Sargun, Aleksa, kernel-hardening for their opinions
> on seccomp-related aspects

Just fyi, I'm on holiday this week so my responses have some
non-significant lag into early next week.

> 
> On Tue, Jun 9, 2020 at 4:24 PM Stefano Garzarella <[email protected]> wrote:
> > Hi Jens,
> > Stefan and I have a proposal to share with io_uring community.
> > Before implementing it we would like to discuss it to receive feedbacks and
> > to see if it could be accepted:
> >
> > Adding restrictions to io_uring
> > =====================================
> > The io_uring API provides submission and completion queues for performing
> > asynchronous I/O operations. The queues are located in memory that is
> > accessible to both the host userspace application and the kernel, making it
> > possible to monitor for activity through polling instead of system calls. This
> > design offers good performance and this makes exposing io_uring to guests an
> > attractive idea for improving I/O performance in virtualization.
> [...]
> > Restrictions
> > ------------
> > This document proposes io_uring API changes that safely allow untrusted
> > applications or guests to use io_uring. io_uring's existing security model is
> > that of kernel system call handler code. It is designed to reject invalid
> > inputs from host userspace applications. Supporting guests as io_uring API
> > clients adds a new trust domain with access to even fewer resources than host
> > userspace applications.
> >
> > Guests do not have direct access to host userspace application file descriptors
> > or memory. The host userspace application, a Virtual Machine Monitor (VMM) such
> > as QEMU, grants access to a subset of its file descriptors and memory. The
> > allowed file descriptors are typically the disk image files belonging to the
> > guest. The memory is typically the virtual machine's RAM that the VMM has
> > allocated on behalf of the guest.
> >
> > The following extensions to the io_uring API allow the host application to
> > grant access to some of its file descriptors.
> >
> > These extensions are designed to be applicable to other use cases besides
> > untrusted guests and are not virtualization-specific. For example, the
> > restrictions can be used to allow only a subset of sqe operations available to
> > an application similar to seccomp syscall whitelisting.
> >
> > An address translation and memory restriction mechanism would also be
> > necessary, but we can discuss this later.
> >
> > The IOURING_REGISTER_RESTRICTIONS opcode
> > ----------------------------------------
> > The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently
> > installs a feature whitelist on an io_ring_ctx. The io_ring_ctx can then be
> > passed to untrusted code with the knowledge that only operations present in the
> > whitelist can be executed.
> 
> This approach of first creating a normal io_uring instance and then
> installing restrictions separately in a second syscall means that it
> won't be possible to use seccomp to restrict newly created io_uring
> instances; code that should be subject to seccomp restrictions and
> uring restrictions would only be able to use preexisting io_uring
> instances that have already been configured by trusted code.
> 
> So I think that from the seccomp perspective, it might be preferable
> to set up these restrictions in the io_uring_setup() syscall. It might

So from what I can gather from this proposal, this would be a separate
security model for io_uring? I'm not to thrilled about that tbh. (There's
some discussion around extending seccomp - also at kernel summit.)
But doing the whole restriction setup in io_uring_setup() would at least
mean that if seccomp is extended to filter first-level pointers it could
know about all the security restrictions that apply to this io_uring
instance (Which I think you were getting at, Jann?).

Hm, would it make sense that if a task has a seccomp filter installed
that blocks openat syscalls that io_uring should automatically block
openat() calls as well or is the expectation "just block all of io_uring
if you're worried about that"?

Christian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] io_uring: add restrictions to support untrusted applications and guests
  2020-06-15 22:01   ` Christian Brauner
@ 2020-06-15 23:26     ` Jann Horn
  0 siblings, 0 replies; 13+ messages in thread
From: Jann Horn @ 2020-06-15 23:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stefano Garzarella, Kees Cook, Sargun Dhillon, Aleksa Sarai,
	Jens Axboe, Stefan Hajnoczi, Jeff Moyer, io-uring, kernel list,
	Kernel Hardening

On Tue, Jun 16, 2020 at 12:01 AM Christian Brauner
<[email protected]> wrote:
>
> On Mon, Jun 15, 2020 at 11:04:06AM +0200, Jann Horn wrote:
> > +Kees, Christian, Sargun, Aleksa, kernel-hardening for their opinions
> > on seccomp-related aspects
>
> Just fyi, I'm on holiday this week so my responses have some
> non-significant lag into early next week.
>
> >
> > On Tue, Jun 9, 2020 at 4:24 PM Stefano Garzarella <[email protected]> wrote:
> > > Hi Jens,
> > > Stefan and I have a proposal to share with io_uring community.
> > > Before implementing it we would like to discuss it to receive feedbacks and
> > > to see if it could be accepted:
> > >
> > > Adding restrictions to io_uring
> > > =====================================
> > > The io_uring API provides submission and completion queues for performing
> > > asynchronous I/O operations. The queues are located in memory that is
> > > accessible to both the host userspace application and the kernel, making it
> > > possible to monitor for activity through polling instead of system calls. This
> > > design offers good performance and this makes exposing io_uring to guests an
> > > attractive idea for improving I/O performance in virtualization.
> > [...]
> > > Restrictions
> > > ------------
> > > This document proposes io_uring API changes that safely allow untrusted
> > > applications or guests to use io_uring. io_uring's existing security model is
> > > that of kernel system call handler code. It is designed to reject invalid
> > > inputs from host userspace applications. Supporting guests as io_uring API
> > > clients adds a new trust domain with access to even fewer resources than host
> > > userspace applications.
> > >
> > > Guests do not have direct access to host userspace application file descriptors
> > > or memory. The host userspace application, a Virtual Machine Monitor (VMM) such
> > > as QEMU, grants access to a subset of its file descriptors and memory. The
> > > allowed file descriptors are typically the disk image files belonging to the
> > > guest. The memory is typically the virtual machine's RAM that the VMM has
> > > allocated on behalf of the guest.
> > >
> > > The following extensions to the io_uring API allow the host application to
> > > grant access to some of its file descriptors.
> > >
> > > These extensions are designed to be applicable to other use cases besides
> > > untrusted guests and are not virtualization-specific. For example, the
> > > restrictions can be used to allow only a subset of sqe operations available to
> > > an application similar to seccomp syscall whitelisting.
> > >
> > > An address translation and memory restriction mechanism would also be
> > > necessary, but we can discuss this later.
> > >
> > > The IOURING_REGISTER_RESTRICTIONS opcode
> > > ----------------------------------------
> > > The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently
> > > installs a feature whitelist on an io_ring_ctx. The io_ring_ctx can then be
> > > passed to untrusted code with the knowledge that only operations present in the
> > > whitelist can be executed.
> >
> > This approach of first creating a normal io_uring instance and then
> > installing restrictions separately in a second syscall means that it
> > won't be possible to use seccomp to restrict newly created io_uring
> > instances; code that should be subject to seccomp restrictions and
> > uring restrictions would only be able to use preexisting io_uring
> > instances that have already been configured by trusted code.
> >
> > So I think that from the seccomp perspective, it might be preferable
> > to set up these restrictions in the io_uring_setup() syscall. It might
>
> So from what I can gather from this proposal, this would be a separate
> security model for io_uring? I'm not to thrilled about that tbh. (There's
> some discussion around extending seccomp - also at kernel summit.)
> But doing the whole restriction setup in io_uring_setup() would at least
> mean that if seccomp is extended to filter first-level pointers it could
> know about all the security restrictions that apply to this io_uring
> instance (Which I think you were getting at, Jann?).

Yeah.

> Hm, would it make sense that if a task has a seccomp filter installed
> that blocks openat syscalls that io_uring should automatically block
> openat() calls as well or is the expectation "just block all of io_uring
> if you're worried about that"?

I mean, if we could make that automagic, that'd be kinda neat; but I'm
slightly worried that an automated translation might end up being
slightly inaccurate. (But maybe that's acceptable?)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] io_uring: add restrictions to support untrusted applications and guests
  2020-06-15 17:00     ` Jens Axboe
@ 2020-06-16  9:12       ` Stefano Garzarella
  2020-06-16 11:32         ` Jann Horn
  2020-06-16 15:26         ` Jens Axboe
  0 siblings, 2 replies; 13+ messages in thread
From: Stefano Garzarella @ 2020-06-16  9:12 UTC (permalink / raw)
  To: Jens Axboe, Jann Horn
  Cc: Kees Cook, Christian Brauner, Sargun Dhillon, Aleksa Sarai,
	Stefan Hajnoczi, Jeff Moyer, io-uring, kernel list,
	Kernel Hardening

On Mon, Jun 15, 2020 at 11:00:25AM -0600, Jens Axboe wrote:
> On 6/15/20 7:33 AM, Stefano Garzarella wrote:
> > On Mon, Jun 15, 2020 at 11:04:06AM +0200, Jann Horn wrote:
> >> +Kees, Christian, Sargun, Aleksa, kernel-hardening for their opinions
> >> on seccomp-related aspects
> >>
> >> On Tue, Jun 9, 2020 at 4:24 PM Stefano Garzarella <[email protected]> wrote:
> >>> Hi Jens,
> >>> Stefan and I have a proposal to share with io_uring community.
> >>> Before implementing it we would like to discuss it to receive feedbacks and
> >>> to see if it could be accepted:
> >>>
> >>> Adding restrictions to io_uring
> >>> =====================================
> >>> The io_uring API provides submission and completion queues for performing
> >>> asynchronous I/O operations. The queues are located in memory that is
> >>> accessible to both the host userspace application and the kernel, making it
> >>> possible to monitor for activity through polling instead of system calls. This
> >>> design offers good performance and this makes exposing io_uring to guests an
> >>> attractive idea for improving I/O performance in virtualization.
> >> [...]
> >>> Restrictions
> >>> ------------
> >>> This document proposes io_uring API changes that safely allow untrusted
> >>> applications or guests to use io_uring. io_uring's existing security model is
> >>> that of kernel system call handler code. It is designed to reject invalid
> >>> inputs from host userspace applications. Supporting guests as io_uring API
> >>> clients adds a new trust domain with access to even fewer resources than host
> >>> userspace applications.
> >>>
> >>> Guests do not have direct access to host userspace application file descriptors
> >>> or memory. The host userspace application, a Virtual Machine Monitor (VMM) such
> >>> as QEMU, grants access to a subset of its file descriptors and memory. The
> >>> allowed file descriptors are typically the disk image files belonging to the
> >>> guest. The memory is typically the virtual machine's RAM that the VMM has
> >>> allocated on behalf of the guest.
> >>>
> >>> The following extensions to the io_uring API allow the host application to
> >>> grant access to some of its file descriptors.
> >>>
> >>> These extensions are designed to be applicable to other use cases besides
> >>> untrusted guests and are not virtualization-specific. For example, the
> >>> restrictions can be used to allow only a subset of sqe operations available to
> >>> an application similar to seccomp syscall whitelisting.
> >>>
> >>> An address translation and memory restriction mechanism would also be
> >>> necessary, but we can discuss this later.
> >>>
> >>> The IOURING_REGISTER_RESTRICTIONS opcode
> >>> ----------------------------------------
> >>> The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently
> >>> installs a feature whitelist on an io_ring_ctx. The io_ring_ctx can then be
> >>> passed to untrusted code with the knowledge that only operations present in the
> >>> whitelist can be executed.
> >>
> >> This approach of first creating a normal io_uring instance and then
> >> installing restrictions separately in a second syscall means that it
> >> won't be possible to use seccomp to restrict newly created io_uring
> >> instances; code that should be subject to seccomp restrictions and
> >> uring restrictions would only be able to use preexisting io_uring
> >> instances that have already been configured by trusted code.
> >>
> >> So I think that from the seccomp perspective, it might be preferable
> >> to set up these restrictions in the io_uring_setup() syscall. It might
> >> also be a bit nicer from a code cleanliness perspective, since you
> >> won't have to worry about concurrently changing restrictions.
> >>
> > 
> > Thank you for these details!
> > 
> > It seems feasible to include the restrictions during io_uring_setup().
> > 
> > The only doubt concerns the possibility of allowing the trusted code to
> > do some operations, before passing queues to the untrusted code, for
> > example registering file descriptors, buffers, eventfds, etc.
> > 
> > To avoid this, I should include these operations in io_uring_setup(),
> > adding some code that I wanted to avoid by reusing io_uring_register().
> > 
> > If I add restrictions in io_uring_setup() and then add an operation to
> > go into safe mode (e.g. a flag in io_uring_enter()), we would have the same
> > problem, right?
> > 
> > Just to be clear, I mean something like this:
> > 
> >     /* params will include restrictions */
> >     fd = io_uring_setup(entries, params);
> > 
> >     /* trusted code */
> >     io_uring_register_files(fd, ...);
> >     io_uring_register_buffers(fd, ...);
> >     io_uring_register_eventfd(fd, ...);
> > 
> >     /* enable safe mode */
> >     io_uring_enter(fd, ..., IORING_ENTER_ENABLE_RESTRICTIONS);
> > 
> > 
> > Anyway, including a list of things to register in the 'params', passed
> > to io_uring_setup(), should be feasible, if Jens agree :-)
> 
> I wonder how best to deal with this, in terms of ring visibility vs
> registering restrictions. We could potentially start the ring in a
> disabled mode, if asked to. It'd still be visible in terms of having
> the fd installed, but it'd just error requests. That'd leave you with
> time to do the various setup routines needed before then flagging it
> as enabled. My only worry on that would be adding overhead for doing
> that. It'd be cheap enough to check for IORING_SETUP_DISABLED in
> ctx->flags in io_uring_enter(), and return -EBADFD or something if
> that's the case. That doesn't cover the SQPOLL case though, but maybe we
> just don't start the sq thread if IORING_SETUP_DISABLED is set.

It seems to me a very good approach and easy to implement. In this way
we can reuse io_uring_register() without having to modify too much
io_uring_setup().

> 
> We'd need a way to clear IORING_SETUP_DISABLED through
> io_uring_register(). When clearing, that could then start the sq thread
> as well, when SQPOLL is set.

Could we do it using io_uring_enter() since we have a flag field or
do you think it's semantically incorrect?

@Jann, do you think this could work with seccomp?

Thanks,
Stefano


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] io_uring: add restrictions to support untrusted applications and guests
  2020-06-16  9:12       ` Stefano Garzarella
@ 2020-06-16 11:32         ` Jann Horn
  2020-06-16 14:07           ` Stefano Garzarella
  2020-06-16 15:26         ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: Jann Horn @ 2020-06-16 11:32 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Jens Axboe, Kees Cook, Christian Brauner, Sargun Dhillon,
	Aleksa Sarai, Stefan Hajnoczi, Jeff Moyer, io-uring, kernel list,
	Kernel Hardening

On Tue, Jun 16, 2020 at 11:13 AM Stefano Garzarella <[email protected]> wrote:
> On Mon, Jun 15, 2020 at 11:00:25AM -0600, Jens Axboe wrote:
> > On 6/15/20 7:33 AM, Stefano Garzarella wrote:
> > > On Mon, Jun 15, 2020 at 11:04:06AM +0200, Jann Horn wrote:
> > >> +Kees, Christian, Sargun, Aleksa, kernel-hardening for their opinions
> > >> on seccomp-related aspects
> > >>
> > >> On Tue, Jun 9, 2020 at 4:24 PM Stefano Garzarella <[email protected]> wrote:
> > >>> Hi Jens,
> > >>> Stefan and I have a proposal to share with io_uring community.
> > >>> Before implementing it we would like to discuss it to receive feedbacks and
> > >>> to see if it could be accepted:
> > >>>
> > >>> Adding restrictions to io_uring
> > >>> =====================================
> > >>> The io_uring API provides submission and completion queues for performing
> > >>> asynchronous I/O operations. The queues are located in memory that is
> > >>> accessible to both the host userspace application and the kernel, making it
> > >>> possible to monitor for activity through polling instead of system calls. This
> > >>> design offers good performance and this makes exposing io_uring to guests an
> > >>> attractive idea for improving I/O performance in virtualization.
> > >> [...]
> > >>> Restrictions
> > >>> ------------
> > >>> This document proposes io_uring API changes that safely allow untrusted
> > >>> applications or guests to use io_uring. io_uring's existing security model is
> > >>> that of kernel system call handler code. It is designed to reject invalid
> > >>> inputs from host userspace applications. Supporting guests as io_uring API
> > >>> clients adds a new trust domain with access to even fewer resources than host
> > >>> userspace applications.
> > >>>
> > >>> Guests do not have direct access to host userspace application file descriptors
> > >>> or memory. The host userspace application, a Virtual Machine Monitor (VMM) such
> > >>> as QEMU, grants access to a subset of its file descriptors and memory. The
> > >>> allowed file descriptors are typically the disk image files belonging to the
> > >>> guest. The memory is typically the virtual machine's RAM that the VMM has
> > >>> allocated on behalf of the guest.
> > >>>
> > >>> The following extensions to the io_uring API allow the host application to
> > >>> grant access to some of its file descriptors.
> > >>>
> > >>> These extensions are designed to be applicable to other use cases besides
> > >>> untrusted guests and are not virtualization-specific. For example, the
> > >>> restrictions can be used to allow only a subset of sqe operations available to
> > >>> an application similar to seccomp syscall whitelisting.
> > >>>
> > >>> An address translation and memory restriction mechanism would also be
> > >>> necessary, but we can discuss this later.
> > >>>
> > >>> The IOURING_REGISTER_RESTRICTIONS opcode
> > >>> ----------------------------------------
> > >>> The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently
> > >>> installs a feature whitelist on an io_ring_ctx. The io_ring_ctx can then be
> > >>> passed to untrusted code with the knowledge that only operations present in the
> > >>> whitelist can be executed.
> > >>
> > >> This approach of first creating a normal io_uring instance and then
> > >> installing restrictions separately in a second syscall means that it
> > >> won't be possible to use seccomp to restrict newly created io_uring
> > >> instances; code that should be subject to seccomp restrictions and
> > >> uring restrictions would only be able to use preexisting io_uring
> > >> instances that have already been configured by trusted code.
> > >>
> > >> So I think that from the seccomp perspective, it might be preferable
> > >> to set up these restrictions in the io_uring_setup() syscall. It might
> > >> also be a bit nicer from a code cleanliness perspective, since you
> > >> won't have to worry about concurrently changing restrictions.
> > >>
> > >
> > > Thank you for these details!
> > >
> > > It seems feasible to include the restrictions during io_uring_setup().
> > >
> > > The only doubt concerns the possibility of allowing the trusted code to
> > > do some operations, before passing queues to the untrusted code, for
> > > example registering file descriptors, buffers, eventfds, etc.
> > >
> > > To avoid this, I should include these operations in io_uring_setup(),
> > > adding some code that I wanted to avoid by reusing io_uring_register().
> > >
> > > If I add restrictions in io_uring_setup() and then add an operation to
> > > go into safe mode (e.g. a flag in io_uring_enter()), we would have the same
> > > problem, right?
> > >
> > > Just to be clear, I mean something like this:
> > >
> > >     /* params will include restrictions */
> > >     fd = io_uring_setup(entries, params);
> > >
> > >     /* trusted code */
> > >     io_uring_register_files(fd, ...);
> > >     io_uring_register_buffers(fd, ...);
> > >     io_uring_register_eventfd(fd, ...);
> > >
> > >     /* enable safe mode */
> > >     io_uring_enter(fd, ..., IORING_ENTER_ENABLE_RESTRICTIONS);
> > >
> > >
> > > Anyway, including a list of things to register in the 'params', passed
> > > to io_uring_setup(), should be feasible, if Jens agree :-)
> >
> > I wonder how best to deal with this, in terms of ring visibility vs
> > registering restrictions. We could potentially start the ring in a
> > disabled mode, if asked to. It'd still be visible in terms of having
> > the fd installed, but it'd just error requests. That'd leave you with
> > time to do the various setup routines needed before then flagging it
> > as enabled. My only worry on that would be adding overhead for doing
> > that. It'd be cheap enough to check for IORING_SETUP_DISABLED in
> > ctx->flags in io_uring_enter(), and return -EBADFD or something if
> > that's the case. That doesn't cover the SQPOLL case though, but maybe we
> > just don't start the sq thread if IORING_SETUP_DISABLED is set.
>
> It seems to me a very good approach and easy to implement. In this way
> we can reuse io_uring_register() without having to modify too much
> io_uring_setup().
>
> >
> > We'd need a way to clear IORING_SETUP_DISABLED through
> > io_uring_register(). When clearing, that could then start the sq thread
> > as well, when SQPOLL is set.
>
> Could we do it using io_uring_enter() since we have a flag field or
> do you think it's semantically incorrect?
>
> @Jann, do you think this could work with seccomp?

To clarify that I understood your proposal correctly: Is the idea to
have two types of mostly orthogonal restrictions; one type being
restrictions on the opcode (supplied in io_uring_setup() and enforced
immediately) and the other type being restrictions on
io_uring_register() (enabled via IORING_ENTER_ENABLE_RESTRICTIONS)?

That sounds fine to me. IORING_ENTER_ENABLE_RESTRICTIONS probably
isn't necessary for your usecase though, right? Or is the idea to use
that to suppress grace periods during setup in io_uring_register(), or
something like that?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] io_uring: add restrictions to support untrusted applications and guests
  2020-06-16 11:32         ` Jann Horn
@ 2020-06-16 14:07           ` Stefano Garzarella
  0 siblings, 0 replies; 13+ messages in thread
From: Stefano Garzarella @ 2020-06-16 14:07 UTC (permalink / raw)
  To: Jann Horn
  Cc: Jens Axboe, Kees Cook, Christian Brauner, Sargun Dhillon,
	Aleksa Sarai, Stefan Hajnoczi, Jeff Moyer, io-uring, kernel list,
	Kernel Hardening

On Tue, Jun 16, 2020 at 01:32:54PM +0200, Jann Horn wrote:
> On Tue, Jun 16, 2020 at 11:13 AM Stefano Garzarella <[email protected]> wrote:
> > On Mon, Jun 15, 2020 at 11:00:25AM -0600, Jens Axboe wrote:
> > > On 6/15/20 7:33 AM, Stefano Garzarella wrote:
> > > > On Mon, Jun 15, 2020 at 11:04:06AM +0200, Jann Horn wrote:
> > > >> +Kees, Christian, Sargun, Aleksa, kernel-hardening for their opinions
> > > >> on seccomp-related aspects
> > > >>
> > > >> On Tue, Jun 9, 2020 at 4:24 PM Stefano Garzarella <[email protected]> wrote:
> > > >>> Hi Jens,
> > > >>> Stefan and I have a proposal to share with io_uring community.
> > > >>> Before implementing it we would like to discuss it to receive feedbacks and
> > > >>> to see if it could be accepted:
> > > >>>
> > > >>> Adding restrictions to io_uring
> > > >>> =====================================
> > > >>> The io_uring API provides submission and completion queues for performing
> > > >>> asynchronous I/O operations. The queues are located in memory that is
> > > >>> accessible to both the host userspace application and the kernel, making it
> > > >>> possible to monitor for activity through polling instead of system calls. This
> > > >>> design offers good performance and this makes exposing io_uring to guests an
> > > >>> attractive idea for improving I/O performance in virtualization.
> > > >> [...]
> > > >>> Restrictions
> > > >>> ------------
> > > >>> This document proposes io_uring API changes that safely allow untrusted
> > > >>> applications or guests to use io_uring. io_uring's existing security model is
> > > >>> that of kernel system call handler code. It is designed to reject invalid
> > > >>> inputs from host userspace applications. Supporting guests as io_uring API
> > > >>> clients adds a new trust domain with access to even fewer resources than host
> > > >>> userspace applications.
> > > >>>
> > > >>> Guests do not have direct access to host userspace application file descriptors
> > > >>> or memory. The host userspace application, a Virtual Machine Monitor (VMM) such
> > > >>> as QEMU, grants access to a subset of its file descriptors and memory. The
> > > >>> allowed file descriptors are typically the disk image files belonging to the
> > > >>> guest. The memory is typically the virtual machine's RAM that the VMM has
> > > >>> allocated on behalf of the guest.
> > > >>>
> > > >>> The following extensions to the io_uring API allow the host application to
> > > >>> grant access to some of its file descriptors.
> > > >>>
> > > >>> These extensions are designed to be applicable to other use cases besides
> > > >>> untrusted guests and are not virtualization-specific. For example, the
> > > >>> restrictions can be used to allow only a subset of sqe operations available to
> > > >>> an application similar to seccomp syscall whitelisting.
> > > >>>
> > > >>> An address translation and memory restriction mechanism would also be
> > > >>> necessary, but we can discuss this later.
> > > >>>
> > > >>> The IOURING_REGISTER_RESTRICTIONS opcode
> > > >>> ----------------------------------------
> > > >>> The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently
> > > >>> installs a feature whitelist on an io_ring_ctx. The io_ring_ctx can then be
> > > >>> passed to untrusted code with the knowledge that only operations present in the
> > > >>> whitelist can be executed.
> > > >>
> > > >> This approach of first creating a normal io_uring instance and then
> > > >> installing restrictions separately in a second syscall means that it
> > > >> won't be possible to use seccomp to restrict newly created io_uring
> > > >> instances; code that should be subject to seccomp restrictions and
> > > >> uring restrictions would only be able to use preexisting io_uring
> > > >> instances that have already been configured by trusted code.
> > > >>
> > > >> So I think that from the seccomp perspective, it might be preferable
> > > >> to set up these restrictions in the io_uring_setup() syscall. It might
> > > >> also be a bit nicer from a code cleanliness perspective, since you
> > > >> won't have to worry about concurrently changing restrictions.
> > > >>
> > > >
> > > > Thank you for these details!
> > > >
> > > > It seems feasible to include the restrictions during io_uring_setup().
> > > >
> > > > The only doubt concerns the possibility of allowing the trusted code to
> > > > do some operations, before passing queues to the untrusted code, for
> > > > example registering file descriptors, buffers, eventfds, etc.
> > > >
> > > > To avoid this, I should include these operations in io_uring_setup(),
> > > > adding some code that I wanted to avoid by reusing io_uring_register().
> > > >
> > > > If I add restrictions in io_uring_setup() and then add an operation to
> > > > go into safe mode (e.g. a flag in io_uring_enter()), we would have the same
> > > > problem, right?
> > > >
> > > > Just to be clear, I mean something like this:
> > > >
> > > >     /* params will include restrictions */
> > > >     fd = io_uring_setup(entries, params);
> > > >
> > > >     /* trusted code */
> > > >     io_uring_register_files(fd, ...);
> > > >     io_uring_register_buffers(fd, ...);
> > > >     io_uring_register_eventfd(fd, ...);
> > > >
> > > >     /* enable safe mode */
> > > >     io_uring_enter(fd, ..., IORING_ENTER_ENABLE_RESTRICTIONS);
> > > >
> > > >
> > > > Anyway, including a list of things to register in the 'params', passed
> > > > to io_uring_setup(), should be feasible, if Jens agree :-)
> > >
> > > I wonder how best to deal with this, in terms of ring visibility vs
> > > registering restrictions. We could potentially start the ring in a
> > > disabled mode, if asked to. It'd still be visible in terms of having
> > > the fd installed, but it'd just error requests. That'd leave you with
> > > time to do the various setup routines needed before then flagging it
> > > as enabled. My only worry on that would be adding overhead for doing
> > > that. It'd be cheap enough to check for IORING_SETUP_DISABLED in
> > > ctx->flags in io_uring_enter(), and return -EBADFD or something if
> > > that's the case. That doesn't cover the SQPOLL case though, but maybe we
> > > just don't start the sq thread if IORING_SETUP_DISABLED is set.
> >
> > It seems to me a very good approach and easy to implement. In this way
> > we can reuse io_uring_register() without having to modify too much
> > io_uring_setup().
> >
> > >
> > > We'd need a way to clear IORING_SETUP_DISABLED through
> > > io_uring_register(). When clearing, that could then start the sq thread
> > > as well, when SQPOLL is set.
> >
> > Could we do it using io_uring_enter() since we have a flag field or
> > do you think it's semantically incorrect?
> >
> > @Jann, do you think this could work with seccomp?
> 
> To clarify that I understood your proposal correctly: Is the idea to
> have two types of mostly orthogonal restrictions; one type being
> restrictions on the opcode (supplied in io_uring_setup() and enforced
> immediately) and the other type being restrictions on
> io_uring_register() (enabled via IORING_ENTER_ENABLE_RESTRICTIONS)?

Slightly different. The idea is to start the ring in a disabled mode,
where all submission ops are not allowed.

In this way the trusted code can do the various setups (e.g. using
io_uring_register() to register fd, buffers, restrictions, etc.).
When the setup phase is finished, the trusted code can enable the ring
using io_uring_register() or io_uring_enter() with a special flag.

After this last syscall, submissions are enabled and restricted
(if restrictions have been registered).

I hope that's a little bit clearer. I'm sorry it's not.

Stefano


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] io_uring: add restrictions to support untrusted applications and guests
  2020-06-16  9:12       ` Stefano Garzarella
  2020-06-16 11:32         ` Jann Horn
@ 2020-06-16 15:26         ` Jens Axboe
  2020-06-16 16:07           ` Stefano Garzarella
  1 sibling, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2020-06-16 15:26 UTC (permalink / raw)
  To: Stefano Garzarella, Jann Horn
  Cc: Kees Cook, Christian Brauner, Sargun Dhillon, Aleksa Sarai,
	Stefan Hajnoczi, Jeff Moyer, io-uring, kernel list,
	Kernel Hardening

On 6/16/20 3:12 AM, Stefano Garzarella wrote:
> On Mon, Jun 15, 2020 at 11:00:25AM -0600, Jens Axboe wrote:
>> On 6/15/20 7:33 AM, Stefano Garzarella wrote:
>>> On Mon, Jun 15, 2020 at 11:04:06AM +0200, Jann Horn wrote:
>>>> +Kees, Christian, Sargun, Aleksa, kernel-hardening for their opinions
>>>> on seccomp-related aspects
>>>>
>>>> On Tue, Jun 9, 2020 at 4:24 PM Stefano Garzarella <[email protected]> wrote:
>>>>> Hi Jens,
>>>>> Stefan and I have a proposal to share with io_uring community.
>>>>> Before implementing it we would like to discuss it to receive feedbacks and
>>>>> to see if it could be accepted:
>>>>>
>>>>> Adding restrictions to io_uring
>>>>> =====================================
>>>>> The io_uring API provides submission and completion queues for performing
>>>>> asynchronous I/O operations. The queues are located in memory that is
>>>>> accessible to both the host userspace application and the kernel, making it
>>>>> possible to monitor for activity through polling instead of system calls. This
>>>>> design offers good performance and this makes exposing io_uring to guests an
>>>>> attractive idea for improving I/O performance in virtualization.
>>>> [...]
>>>>> Restrictions
>>>>> ------------
>>>>> This document proposes io_uring API changes that safely allow untrusted
>>>>> applications or guests to use io_uring. io_uring's existing security model is
>>>>> that of kernel system call handler code. It is designed to reject invalid
>>>>> inputs from host userspace applications. Supporting guests as io_uring API
>>>>> clients adds a new trust domain with access to even fewer resources than host
>>>>> userspace applications.
>>>>>
>>>>> Guests do not have direct access to host userspace application file descriptors
>>>>> or memory. The host userspace application, a Virtual Machine Monitor (VMM) such
>>>>> as QEMU, grants access to a subset of its file descriptors and memory. The
>>>>> allowed file descriptors are typically the disk image files belonging to the
>>>>> guest. The memory is typically the virtual machine's RAM that the VMM has
>>>>> allocated on behalf of the guest.
>>>>>
>>>>> The following extensions to the io_uring API allow the host application to
>>>>> grant access to some of its file descriptors.
>>>>>
>>>>> These extensions are designed to be applicable to other use cases besides
>>>>> untrusted guests and are not virtualization-specific. For example, the
>>>>> restrictions can be used to allow only a subset of sqe operations available to
>>>>> an application similar to seccomp syscall whitelisting.
>>>>>
>>>>> An address translation and memory restriction mechanism would also be
>>>>> necessary, but we can discuss this later.
>>>>>
>>>>> The IOURING_REGISTER_RESTRICTIONS opcode
>>>>> ----------------------------------------
>>>>> The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently
>>>>> installs a feature whitelist on an io_ring_ctx. The io_ring_ctx can then be
>>>>> passed to untrusted code with the knowledge that only operations present in the
>>>>> whitelist can be executed.
>>>>
>>>> This approach of first creating a normal io_uring instance and then
>>>> installing restrictions separately in a second syscall means that it
>>>> won't be possible to use seccomp to restrict newly created io_uring
>>>> instances; code that should be subject to seccomp restrictions and
>>>> uring restrictions would only be able to use preexisting io_uring
>>>> instances that have already been configured by trusted code.
>>>>
>>>> So I think that from the seccomp perspective, it might be preferable
>>>> to set up these restrictions in the io_uring_setup() syscall. It might
>>>> also be a bit nicer from a code cleanliness perspective, since you
>>>> won't have to worry about concurrently changing restrictions.
>>>>
>>>
>>> Thank you for these details!
>>>
>>> It seems feasible to include the restrictions during io_uring_setup().
>>>
>>> The only doubt concerns the possibility of allowing the trusted code to
>>> do some operations, before passing queues to the untrusted code, for
>>> example registering file descriptors, buffers, eventfds, etc.
>>>
>>> To avoid this, I should include these operations in io_uring_setup(),
>>> adding some code that I wanted to avoid by reusing io_uring_register().
>>>
>>> If I add restrictions in io_uring_setup() and then add an operation to
>>> go into safe mode (e.g. a flag in io_uring_enter()), we would have the same
>>> problem, right?
>>>
>>> Just to be clear, I mean something like this:
>>>
>>>     /* params will include restrictions */
>>>     fd = io_uring_setup(entries, params);
>>>
>>>     /* trusted code */
>>>     io_uring_register_files(fd, ...);
>>>     io_uring_register_buffers(fd, ...);
>>>     io_uring_register_eventfd(fd, ...);
>>>
>>>     /* enable safe mode */
>>>     io_uring_enter(fd, ..., IORING_ENTER_ENABLE_RESTRICTIONS);
>>>
>>>
>>> Anyway, including a list of things to register in the 'params', passed
>>> to io_uring_setup(), should be feasible, if Jens agree :-)
>>
>> I wonder how best to deal with this, in terms of ring visibility vs
>> registering restrictions. We could potentially start the ring in a
>> disabled mode, if asked to. It'd still be visible in terms of having
>> the fd installed, but it'd just error requests. That'd leave you with
>> time to do the various setup routines needed before then flagging it
>> as enabled. My only worry on that would be adding overhead for doing
>> that. It'd be cheap enough to check for IORING_SETUP_DISABLED in
>> ctx->flags in io_uring_enter(), and return -EBADFD or something if
>> that's the case. That doesn't cover the SQPOLL case though, but maybe we
>> just don't start the sq thread if IORING_SETUP_DISABLED is set.
> 
> It seems to me a very good approach and easy to implement. In this way
> we can reuse io_uring_register() without having to modify too much
> io_uring_setup().

Right

>> We'd need a way to clear IORING_SETUP_DISABLED through
>> io_uring_register(). When clearing, that could then start the sq thread
>> as well, when SQPOLL is set.
> 
> Could we do it using io_uring_enter() since we have a flag field or
> do you think it's semantically incorrect?

Either way is probably fine, I gravitated towards io_uring_register()
since any io_uring_enter() should fail if the ring is disabled. But I
guess it's fine to allow the "enable" operation through io_uring_enter.
Keep in mind that io_uring_enter is the hottest path, where
io_uring_register is not nearly as hot and we can allow ourselves a bit
more flexibility there.

In summary, I'd be fine with io_uring_enter if it's slim and lean, still
leaning towards doing it in io_uring_register as it seems like a more
natural fit.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] io_uring: add restrictions to support untrusted applications and guests
  2020-06-16 15:26         ` Jens Axboe
@ 2020-06-16 16:07           ` Stefano Garzarella
  0 siblings, 0 replies; 13+ messages in thread
From: Stefano Garzarella @ 2020-06-16 16:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jann Horn, Kees Cook, Christian Brauner, Sargun Dhillon,
	Aleksa Sarai, Stefan Hajnoczi, Jeff Moyer, io-uring, kernel list,
	Kernel Hardening

On Tue, Jun 16, 2020 at 09:26:31AM -0600, Jens Axboe wrote:
> On 6/16/20 3:12 AM, Stefano Garzarella wrote:
> > On Mon, Jun 15, 2020 at 11:00:25AM -0600, Jens Axboe wrote:
> >> On 6/15/20 7:33 AM, Stefano Garzarella wrote:
> >>> On Mon, Jun 15, 2020 at 11:04:06AM +0200, Jann Horn wrote:
> >>>> +Kees, Christian, Sargun, Aleksa, kernel-hardening for their opinions
> >>>> on seccomp-related aspects
> >>>>
> >>>> On Tue, Jun 9, 2020 at 4:24 PM Stefano Garzarella <[email protected]> wrote:
> >>>>> Hi Jens,
> >>>>> Stefan and I have a proposal to share with io_uring community.
> >>>>> Before implementing it we would like to discuss it to receive feedbacks and
> >>>>> to see if it could be accepted:
> >>>>>
> >>>>> Adding restrictions to io_uring
> >>>>> =====================================
> >>>>> The io_uring API provides submission and completion queues for performing
> >>>>> asynchronous I/O operations. The queues are located in memory that is
> >>>>> accessible to both the host userspace application and the kernel, making it
> >>>>> possible to monitor for activity through polling instead of system calls. This
> >>>>> design offers good performance and this makes exposing io_uring to guests an
> >>>>> attractive idea for improving I/O performance in virtualization.
> >>>> [...]
> >>>>> Restrictions
> >>>>> ------------
> >>>>> This document proposes io_uring API changes that safely allow untrusted
> >>>>> applications or guests to use io_uring. io_uring's existing security model is
> >>>>> that of kernel system call handler code. It is designed to reject invalid
> >>>>> inputs from host userspace applications. Supporting guests as io_uring API
> >>>>> clients adds a new trust domain with access to even fewer resources than host
> >>>>> userspace applications.
> >>>>>
> >>>>> Guests do not have direct access to host userspace application file descriptors
> >>>>> or memory. The host userspace application, a Virtual Machine Monitor (VMM) such
> >>>>> as QEMU, grants access to a subset of its file descriptors and memory. The
> >>>>> allowed file descriptors are typically the disk image files belonging to the
> >>>>> guest. The memory is typically the virtual machine's RAM that the VMM has
> >>>>> allocated on behalf of the guest.
> >>>>>
> >>>>> The following extensions to the io_uring API allow the host application to
> >>>>> grant access to some of its file descriptors.
> >>>>>
> >>>>> These extensions are designed to be applicable to other use cases besides
> >>>>> untrusted guests and are not virtualization-specific. For example, the
> >>>>> restrictions can be used to allow only a subset of sqe operations available to
> >>>>> an application similar to seccomp syscall whitelisting.
> >>>>>
> >>>>> An address translation and memory restriction mechanism would also be
> >>>>> necessary, but we can discuss this later.
> >>>>>
> >>>>> The IOURING_REGISTER_RESTRICTIONS opcode
> >>>>> ----------------------------------------
> >>>>> The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently
> >>>>> installs a feature whitelist on an io_ring_ctx. The io_ring_ctx can then be
> >>>>> passed to untrusted code with the knowledge that only operations present in the
> >>>>> whitelist can be executed.
> >>>>
> >>>> This approach of first creating a normal io_uring instance and then
> >>>> installing restrictions separately in a second syscall means that it
> >>>> won't be possible to use seccomp to restrict newly created io_uring
> >>>> instances; code that should be subject to seccomp restrictions and
> >>>> uring restrictions would only be able to use preexisting io_uring
> >>>> instances that have already been configured by trusted code.
> >>>>
> >>>> So I think that from the seccomp perspective, it might be preferable
> >>>> to set up these restrictions in the io_uring_setup() syscall. It might
> >>>> also be a bit nicer from a code cleanliness perspective, since you
> >>>> won't have to worry about concurrently changing restrictions.
> >>>>
> >>>
> >>> Thank you for these details!
> >>>
> >>> It seems feasible to include the restrictions during io_uring_setup().
> >>>
> >>> The only doubt concerns the possibility of allowing the trusted code to
> >>> do some operations, before passing queues to the untrusted code, for
> >>> example registering file descriptors, buffers, eventfds, etc.
> >>>
> >>> To avoid this, I should include these operations in io_uring_setup(),
> >>> adding some code that I wanted to avoid by reusing io_uring_register().
> >>>
> >>> If I add restrictions in io_uring_setup() and then add an operation to
> >>> go into safe mode (e.g. a flag in io_uring_enter()), we would have the same
> >>> problem, right?
> >>>
> >>> Just to be clear, I mean something like this:
> >>>
> >>>     /* params will include restrictions */
> >>>     fd = io_uring_setup(entries, params);
> >>>
> >>>     /* trusted code */
> >>>     io_uring_register_files(fd, ...);
> >>>     io_uring_register_buffers(fd, ...);
> >>>     io_uring_register_eventfd(fd, ...);
> >>>
> >>>     /* enable safe mode */
> >>>     io_uring_enter(fd, ..., IORING_ENTER_ENABLE_RESTRICTIONS);
> >>>
> >>>
> >>> Anyway, including a list of things to register in the 'params', passed
> >>> to io_uring_setup(), should be feasible, if Jens agree :-)
> >>
> >> I wonder how best to deal with this, in terms of ring visibility vs
> >> registering restrictions. We could potentially start the ring in a
> >> disabled mode, if asked to. It'd still be visible in terms of having
> >> the fd installed, but it'd just error requests. That'd leave you with
> >> time to do the various setup routines needed before then flagging it
> >> as enabled. My only worry on that would be adding overhead for doing
> >> that. It'd be cheap enough to check for IORING_SETUP_DISABLED in
> >> ctx->flags in io_uring_enter(), and return -EBADFD or something if
> >> that's the case. That doesn't cover the SQPOLL case though, but maybe we
> >> just don't start the sq thread if IORING_SETUP_DISABLED is set.
> > 
> > It seems to me a very good approach and easy to implement. In this way
> > we can reuse io_uring_register() without having to modify too much
> > io_uring_setup().
> 
> Right
> 
> >> We'd need a way to clear IORING_SETUP_DISABLED through
> >> io_uring_register(). When clearing, that could then start the sq thread
> >> as well, when SQPOLL is set.
> > 
> > Could we do it using io_uring_enter() since we have a flag field or
> > do you think it's semantically incorrect?
> 
> Either way is probably fine, I gravitated towards io_uring_register()
> since any io_uring_enter() should fail if the ring is disabled. But I
> guess it's fine to allow the "enable" operation through io_uring_enter.
> Keep in mind that io_uring_enter is the hottest path, where
> io_uring_register is not nearly as hot and we can allow ourselves a bit
> more flexibility there.

Right, now I see and I totally agree!

> 
> In summary, I'd be fine with io_uring_enter if it's slim and lean, still
> leaning towards doing it in io_uring_register as it seems like a more
> natural fit.

Thanks for the clarification. I'll take that into account.

Stefano


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-06-16 16:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-06-09 14:24 [RFC] io_uring: add restrictions to support untrusted applications and guests Stefano Garzarella
2020-06-14 15:52 ` Jens Axboe
2020-06-15  7:23   ` Stefano Garzarella
2020-06-15  9:04 ` Jann Horn
2020-06-15 13:33   ` Stefano Garzarella
2020-06-15 17:00     ` Jens Axboe
2020-06-16  9:12       ` Stefano Garzarella
2020-06-16 11:32         ` Jann Horn
2020-06-16 14:07           ` Stefano Garzarella
2020-06-16 15:26         ` Jens Axboe
2020-06-16 16:07           ` Stefano Garzarella
2020-06-15 22:01   ` Christian Brauner
2020-06-15 23:26     ` Jann Horn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox