Re: [PATCH 1/6] io_uring: add support for BPF filtering for opcode restrictions

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

From: Jens Axboe <axboe@kernel.dk>
To: Aleksa Sarai <cyphar@cyphar.com>
Cc: io-uring@vger.kernel.org, brauner@kernel.org,
	Kees Cook <kees@kernel.org>, Jann Horn <jannh@google.com>
Subject: Re: [PATCH 1/6] io_uring: add support for BPF filtering for opcode restrictions
Date: Mon, 19 Jan 2026 13:17:37 -0700	[thread overview]
Message-ID: <74c9786a-e9a1-4760-ade8-541a58c296cd@kernel.dk> (raw)
In-Reply-To: <2026-01-19-tinted-shifty-storage-skulls-4614XX@cyphar.com>

On 1/19/26 11:51 AM, Aleksa Sarai wrote:
> On 2026-01-18, Jens Axboe <axboe@kernel.dk> wrote:
>> This adds support for loading BPF programs with io_uring, which can
>> restrict the opcodes executed. Unlike IORING_REGISTER_RESTRICTIONS,
>> using BPF programs allow fine grained control over both the opcode in
>> question, as well as other data associated with the request. This
>> initial patch just supports whatever is in the io_kiocb for filtering,
>> but shortly opcode specific support will be added.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  include/linux/io_uring_types.h           |   9 +
>>  include/uapi/linux/io_uring.h            |   3 +
>>  include/uapi/linux/io_uring/bpf_filter.h |  47 ++++
>>  io_uring/Kconfig                         |   5 +
>>  io_uring/Makefile                        |   1 +
>>  io_uring/bpf_filter.c                    | 328 +++++++++++++++++++++++
>>  io_uring/bpf_filter.h                    |  42 +++
>>  io_uring/io_uring.c                      |   8 +
>>  io_uring/register.c                      |   8 +
>>  9 files changed, 451 insertions(+)
>>  create mode 100644 include/uapi/linux/io_uring/bpf_filter.h
>>  create mode 100644 io_uring/bpf_filter.c
>>  create mode 100644 io_uring/bpf_filter.h
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index 211686ad89fd..37f0a5f7b2f4 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -219,9 +219,18 @@ struct io_rings {
>>  	struct io_uring_cqe	cqes[] ____cacheline_aligned_in_smp;
>>  };
>>  
>> +struct io_bpf_filter;
>> +struct io_bpf_filters {
>> +	refcount_t refs;	/* ref for ->bpf_filters */
>> +	spinlock_t lock;	/* protects ->bpf_filters modifications */
>> +	struct io_bpf_filter __rcu **filters;
>> +	struct rcu_head rcu_head;
>> +};
>> +
>>  struct io_restriction {
>>  	DECLARE_BITMAP(register_op, IORING_REGISTER_LAST);
>>  	DECLARE_BITMAP(sqe_op, IORING_OP_LAST);
>> +	struct io_bpf_filters *bpf_filters;
>>  	u8 sqe_flags_allowed;
>>  	u8 sqe_flags_required;
>>  	/* IORING_OP_* restrictions exist */
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index b5b23c0d5283..94669b77fee8 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -700,6 +700,9 @@ enum io_uring_register_op {
>>  	/* auxiliary zcrx configuration, see enum zcrx_ctrl_op */
>>  	IORING_REGISTER_ZCRX_CTRL		= 36,
>>  
>> +	/* register bpf filtering programs */
>> +	IORING_REGISTER_BPF_FILTER		= 37,
>> +
>>  	/* this goes last */
>>  	IORING_REGISTER_LAST,
>>  
>> diff --git a/include/uapi/linux/io_uring/bpf_filter.h b/include/uapi/linux/io_uring/bpf_filter.h
>> new file mode 100644
>> index 000000000000..14bd5b7468a7
>> --- /dev/null
>> +++ b/include/uapi/linux/io_uring/bpf_filter.h
>> @@ -0,0 +1,47 @@
>> +/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */
>> +/*
>> + * Header file for the io_uring BPF filters.
>> + */
>> +#ifndef LINUX_IO_URING_BPF_FILTER_H
>> +#define LINUX_IO_URING_BPF_FILTER_H
>> +
>> +#include <linux/types.h>
>> +
>> +struct io_uring_bpf_ctx {
>> +	__u8	opcode;
>> +	__u8	sqe_flags;
>> +	__u8	pad[6];
>> +	__u64	user_data;
>> +	__u64	resv[6];
>> +};
> 
> I had more envisioned this as operating on SQEs directly, not on an
> intermediate representation.
> 
> I get why this is much simpler to deal with (operating on the SQE
> directly is going to be racy as a malicious process could change the
> argument values after the filters have run -- this is kind of like the
> classic ptrace-seccomp hole), but since SQEs are a fixed size it seems
> like the most natural analogue to seccomp's model.
> 
> While it is a tad ugly, AFAICS (looking at io_socket_prep) this would
> also let you filter socket(2) directly which would obviate the need for
> the second patch in this series.

It fundamentally cannot operate on the sqe directly, exactly for the
reasons you outline. We need to move the data to stable storage first,
which is why I did it that way. It has to be done past the prep stage
for that reason.

> That being said, if you do really want to go with a custom
> representation, I would suggest including a size and making it variable

There's no other choice than going with a custom representation...

IHMO it also makes it much cleaner. You really don't want filters
dealing with the intricasies of SQE layout. For custom opcode filters,
they become much easier to reason about if they are working with a
sub-struct in the union.

> size so that filtering pointers (especially of extensible struct
> syscalls like openat2) is more trivial to accomplish in the future. [1]
> is the model we came up with for seccomp, which suffers from having to
> deal with arbitrary syscall bodies but if you have your own
> representation you can end up with something reasonably extensible
> without the baggage.
> 
> [1]: https://www.youtube.com/watch?v=CHpLLR0CwSw

I'll take a look at that - fwiw, another reason why we need a custom
struct is precisely so we can filter on things that need to be brought
into the kernel first. Things like open_how for openat2 variants, for
example. Otherwise this would not be possible. We can do that in a clean
way, and never have to deal with any kind of pointers.


>> +enum {
>> +	/*
>> +	 * If set, any currently unset opcode will have a deny filter attached
>> +	 */
>> +	IO_URING_BPF_FILTER_DENY_REST	= 1,
>> +};
>> +
>> +struct io_uring_bpf_filter {
>> +	__u32	opcode;		/* io_uring opcode to filter */
>> +	__u32	flags;
>> +	__u32	filter_len;	/* number of BPF instructions */
>> +	__u32	resv;
>> +	__u64	filter_ptr;	/* pointer to BPF filter */
>> +	__u64	resv2[5];
>> +};
> 
> Since io_uring_bpf_ctx contains the opcode, it seems a little strange
> that you would require userspace to set up a separate filter for every
> opcode they wish to filter. seccomp lets you just have one filter for
> all syscall numbers -- I can see the argument that this lets you build
> optimised filters for each opcode but having to set up filters for 65
> opcodes (at time of writing) seems less than optimal...

I agree it's a bit cumbersome. But if you just want to filter a specific
opcode, that's easily doable with the current non-bpf restrictions. You
only really need the BPF filter if you want to allow an ocpode, but
restrict certain parts of it.

> The optimisation seccomp uses for filters that simply blanket allow a
> syscall is to pre-compute a cached bitmap of those syscalls to avoid
> running the filter for those syscalls (see seccomp_cache_prepare). That
> seems like a more practical solution which provides a similar (if not
> better) optimisation for the allow-filter case.

But I don't think useful in this case, as there's already an existing
mechanism to do that in io_uring. And with this patch, particularly the
last patch, those can also get set per task and inherited, so that any
new ring will abide by them.

>> +	/*
>> +	 * Iterate registered filters. The opcode is allowed IFF all filters
>> +	 * return 1. If any filter returns denied, opcode will be denied.
>> +	 */
>> +	do {
>> +		if (filter == &dummy_filter)
>> +			ret = 0;
>> +		else
>> +			ret = bpf_prog_run(filter->prog, &bpf_ctx);
>> +		if (!ret)
>> +			break;
>> +		filter = filter->next;
>> +	} while (filter);
> 
> I understand why you didn't want to replicate the messiness of seccomp's
> arbitrary errno feature (it's almost certainly for the best), but maybe
> it would be prudent to make the expected return values some special
> (large) value so that you have some wiggle room for future expansion?
> 
> For instance, if you ever wanted to add support for logging (a-la
> SECCOMP_RET_LOG) then it would need to be lower priority than blocking
> the operation and you would need to have something like the logic in
> seccomp_run_filters to return the highest priority filter return value.
> 
> (You could validate that the filter only returns IO_URING_BPF_RET_BLOCK
> or 0 in the verifier.)

Can't you just do that with any value? 0 is currently DENY, 1 is ALLOW,
those are the only documented filter return values. You could just have
2 be DENY_LOG or whatever, and 3 be ALLOW_LOG and so forth. Don't really
see why you'd need to limit yourself to 0..1 as it currently stands.

>> +int io_register_bpf_filter(struct io_restriction *res,
>> +			   struct io_uring_bpf __user *arg)
>> +{
>> +	struct io_bpf_filter *filter, *old_filter;
>> +	struct io_bpf_filters *filters;
>> +	struct io_uring_bpf reg;
>> +	struct bpf_prog *prog;
>> +	struct sock_fprog fprog;
>> +	int ret;
>> +
>> +	if (copy_from_user(&reg, arg, sizeof(reg)))
>> +		return -EFAULT;
>> +	if (reg.cmd_type != IO_URING_BPF_CMD_FILTER)
>> +		return -EINVAL;
>> +	if (reg.cmd_flags || reg.resv)
>> +		return -EINVAL;
>> +
>> +	if (reg.filter.opcode >= IORING_OP_LAST)
>> +		return -EINVAL;
>> +	if (reg.filter.flags & ~IO_URING_BPF_FILTER_FLAGS)
>> +		return -EINVAL;
>> +	if (reg.filter.resv)
>> +		return -EINVAL;
>> +	if (!mem_is_zero(reg.filter.resv2, sizeof(reg.filter.resv2)))
>> +		return -EINVAL;
>> +	if (!reg.filter.filter_len || reg.filter.filter_len > BPF_MAXINSNS)
>> +		return -EINVAL;
> 
> Similar question to my other mail about copy_struct_from_user().

Ah missed that in the other email. It'd help a lot if you trim your
replies and only quote relevant parts, it's a lot more efficient. And
makes it much harder not to miss content. Sure we can use
copy_struct_from_user().

-- 
Jens Axboe

next prev parent reply	other threads:[~2026-01-19 20:17 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-18 17:16 [PATCHSET v5] Inherited restrictions and BPF filtering Jens Axboe
2026-01-18 17:16 ` [PATCH 1/6] io_uring: add support for BPF filtering for opcode restrictions Jens Axboe
2026-01-19 18:51   ` Aleksa Sarai
2026-01-19 20:17     ` Jens Axboe [this message]
2026-01-18 17:16 ` [PATCH 2/6] io_uring/net: allow filtering on IORING_OP_SOCKET data Jens Axboe
2026-01-18 17:16 ` [PATCH 3/6] io_uring/bpf_filter: cache lookup table in ctx->bpf_filters Jens Axboe
2026-01-18 17:16 ` [PATCH 4/6] io_uring/bpf_filter: add ref counts to struct io_bpf_filter Jens Axboe
2026-01-18 17:16 ` [PATCH 5/6] io_uring: add task fork hook Jens Axboe
2026-01-18 17:16 ` [PATCH 6/6] io_uring: allow registration of per-task restrictions Jens Axboe
2026-01-19 17:54   ` Aleksa Sarai
2026-01-19 18:02     ` Jens Axboe
2026-01-19 20:29     ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=74c9786a-e9a1-4760-ade8-541a58c296cd@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=cyphar@cyphar.com \
    --cc=io-uring@vger.kernel.org \
    --cc=jannh@google.com \
    --cc=kees@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox