From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-dl1-f48.google.com (mail-dl1-f48.google.com [74.125.82.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A808334D4F6 for ; Tue, 27 Jan 2026 16:41:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.48 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769532065; cv=none; b=FGesZ/yPwKOxUMtUjrWVNqWkCCswUyK+ZB6vvzmhpbvJJmqTK7codxNMJj4qDBNAcZH52jQDdzwF/qQMgluzcOqmTDMBaRusMHTxDKQCzHQkloBwNsYmjwV0VT8xy8b5qLk++bSYN0ay4oyrTER91JKbNu9w/oy9oxOIWvgfyEI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769532065; c=relaxed/simple; bh=Pw3Os9PF7gqvDxTsWGbh3HWYcMCVNWCGn4yurqe2NFk=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=FwTEqKf4H+9mF9CDMkeeYkojTjrajkmQf1YuQIkqwvhMBptHY4TXK9kJ34uHWjr7960eXyWUMjJuvhqd3qpNj6A3JXxEH2QAb4syIOUvT94czGDqvFdsxE7DFqZq3nGrxdZOFqfv8gNFq3LwgfX/3OaJAWdBDkfG80/vohnVAhg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.dk; spf=pass smtp.mailfrom=kernel.dk; dkim=pass (2048-bit key) header.d=kernel-dk.20230601.gappssmtp.com header.i=@kernel-dk.20230601.gappssmtp.com header.b=wWqnFRqs; arc=none smtp.client-ip=74.125.82.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.dk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=kernel.dk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20230601.gappssmtp.com header.i=@kernel-dk.20230601.gappssmtp.com header.b="wWqnFRqs" Received: by mail-dl1-f48.google.com with SMTP id a92af1059eb24-124a635476fso620130c88.0 for ; Tue, 27 Jan 2026 08:41:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20230601.gappssmtp.com; s=20230601; t=1769532062; x=1770136862; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=S0nHU0IFRxLKlodBkKGRRT1wMjTIExYPnjSNEjOhX9c=; b=wWqnFRqsPan51ETdW30CI5y2KuMVtCC31vHey6Svjxv4QyHqLL5oqI+PPQI+POTdBE HQHKowYacJdci6i26DgVa/UtiWXlMvdTEYBaLZlEd1c1pdf0RdBfYUoF62o8DUxODazp DTjJySxqRc4YZnZfbebeOA/3Wx0RWyK/MLjvVOwLZCxDBzyPCKeReaFRCyjjGgkl5sF/ AZ3phP1CeKdeI87IGJAiudIr4drucgHa6Ev0I4WYzWoMmKNXj0Q5Bs+o7tcWZWt2LUco waUrXKx9sVMJy+kSU+vGUyKMc9SAkk2V3mtZXjwN8jzNDjED1UmeqgUBP5hkJhahZrQ7 VXUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769532062; x=1770136862; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=S0nHU0IFRxLKlodBkKGRRT1wMjTIExYPnjSNEjOhX9c=; b=YlQ0aPPM79TV92yRNscDMxCvF/8PbCpq0NO6P5Bm83FtcN8Vzu3A0kCYcJzxHAohfo w/bYihIf4eaO/ym7UZ8TQn36aLiLlQSSLXIBGe/tAQCabpCsr49ygUilG+NidUK2p0kT v5+ecGfMenbwAZ1xwnlTP2iO8rCY8oKJwiSdER9Oyb/6u0DXLDfG3UcP52I8IUuMHXU3 X6OcvJ9lVY65X9FQElDivkbQorcZj6TAyQfdzHSki/K78pFFTcJ+mY2f1PO7LiYc0TyN 9pZ6kEbcEMCzoHYf7qPOGas1QnnEY5h97H/7lNrAnUyxZY2gNDDWePvYwE3RhkXd1w/2 3gjw== X-Gm-Message-State: AOJu0YwaSxlj7crpm34VLrsdnsyRiurCh5y76vjmwFrIlHi+XNGeqiGu hYDYhjKPo6J+qMh3y3IZSu20ZndhPUP2GFIZvr7BnkQJjVRX6dLtIfG9GObcP4qU+k0= X-Gm-Gg: AZuq6aL18h7OMi1EAMMvhiUGpGWJ7+zyY4nq1M1WHF/sVTatrLnHBnlMXQHlUCxYvCX 5oM8UdfvxO5CdcHAaw8DShHdSAEJDbfg2xT0O4+ovYn9vL0DqCaZh0NgIGLpeOWJsdxwP2WjfSv 7o0hYZ2NxULLE0faEOq0XrgFH9i9Hlx5lb5eMHD+0e5udNPh7cWSXUZdfn3mqDZv+eTuz3r412I OB2y3fToFNEFktuhALEHr1kVU7k0MubitTKFEcQx5l9jFjjE9WePXfcwyKEXSsWMYe89luhUbHa AJUsX3uw5slGxeinli4gnJeGaRXcrZUxrl9FD6E3G+YtXWbFXJHFK8Yupc0p+wL3dbrbAqtu8Oj YYhzOz1cqemc1d4nsr8XgD5wTGzGnyzERhVuW/kbEGMmNw85pEw+s9jhnbwtt9V2AoYY1rQYwR/ tSFS7pSSyuOMBWVRPUwhuabtrgJwshHhOhC6ERghD/vS5sGeEY0C1ZZJsCbRCLOqpXvAOc38MMN 8EcXw== X-Received: by 2002:a05:7022:24a9:b0:119:e55a:9c08 with SMTP id a92af1059eb24-124a0114330mr1338340c88.36.1769532062128; Tue, 27 Jan 2026 08:41:02 -0800 (PST) Received: from ?IPV6:2620:10d:c096:122::1:58f5? ([2620:10d:c090:600::cedf]) by smtp.gmail.com with ESMTPSA id 5a478bee46e88-2b742c4043asm16018489eec.34.2026.01.27.08.41.01 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 27 Jan 2026 08:41:01 -0800 (PST) Message-ID: <269f6e6a-362f-4ccf-926f-2e2e0c1ea014@kernel.dk> Date: Tue, 27 Jan 2026 09:41:00 -0700 Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/7] io_uring: add support for BPF filtering for opcode restrictions To: Christian Brauner Cc: io-uring@vger.kernel.org, jannh@google.com, kees@kernel.org, linux-kernel@vger.kernel.org References: <20260119235456.1722452-1-axboe@kernel.dk> <20260119235456.1722452-2-axboe@kernel.dk> <20260127-lobgesang-lautsprecher-3805340711c8@brauner> Content-Language: en-US From: Jens Axboe In-Reply-To: <20260127-lobgesang-lautsprecher-3805340711c8@brauner> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 1/27/26 3:06 AM, Christian Brauner wrote: >> +int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req) >> +{ >> + struct io_bpf_filter *filter; >> + struct io_uring_bpf_ctx bpf_ctx; >> + int ret; >> + >> + /* Fast check for existence of filters outside of RCU */ >> + if (!rcu_access_pointer(res->bpf_filters->filters[req->opcode])) >> + return 0; >> + >> + /* >> + * req->opcode has already been validated to be within the range >> + * of what we expect, io_init_req() does this. >> + */ >> + rcu_read_lock(); >> + filter = rcu_dereference(res->bpf_filters->filters[req->opcode]); >> + if (!filter) { >> + ret = 1; >> + goto out; >> + } else if (filter == &dummy_filter) { >> + ret = 0; >> + goto out; >> + } >> + >> + io_uring_populate_bpf_ctx(&bpf_ctx, req); >> + >> + /* >> + * Iterate registered filters. The opcode is allowed IFF all filters >> + * return 1. If any filter returns denied, opcode will be denied. >> + */ >> + do { >> + if (filter == &dummy_filter) >> + ret = 0; >> + else >> + ret = bpf_prog_run(filter->prog, &bpf_ctx); >> + if (!ret) >> + break; >> + filter = filter->next; >> + } while (filter); >> +out: >> + rcu_read_unlock(); >> + return ret ? 0 : -EACCES; >> +} > > Maybe we can write this a little nicer?: > > int __io_uring_run_bpf_filters(struct io_restriction *res, struct io_kiocb *req) > { > struct io_bpf_filter *filter; > struct io_uring_bpf_ctx bpf_ctx; > > /* Fast check for existence of filters outside of RCU */ > if (!rcu_access_pointer(res->bpf_filters->filters[req->opcode])) > return 0; > > /* > * req->opcode has already been validated to be within the range > * of what we expect, io_init_req() does this. > */ > guard(rcu)(); > filter = rcu_dereference(res->bpf_filters->filters[req->opcode]); > if (!filter) > return 0; > > if (filter == &dummy_filter) > return -EACCES; > > io_uring_populate_bpf_ctx(&bpf_ctx, req); > > /* > * Iterate registered filters. The opcode is allowed IFF all filters > * return 1. If any filter returns denied, opcode will be denied. > */ > for (; filter ; filter = filter->next) { > int ret; > > if (filter == &dummy_filter) > return -EACCES; > > ret = bpf_prog_run(filter->prog, &bpf_ctx); > if (!ret) > return -EACCES; > } > > return 0; > } Did a variant based on this, I agree it looks nicer with guard for this one. >> +static void io_free_bpf_filters(struct rcu_head *head) >> +{ >> + struct io_bpf_filter __rcu **filter; >> + struct io_bpf_filters *filters; >> + int i; >> + >> + filters = container_of(head, struct io_bpf_filters, rcu_head); >> + spin_lock(&filters->lock); >> + filter = filters->filters; >> + if (!filter) { >> + spin_unlock(&filters->lock); >> + return; >> + } >> + spin_unlock(&filters->lock); > > This is minor but I prefer: > > scoped_guard(spinlock)(&filters->lock) { > filters = container_of(head, struct io_bpf_filters, rcu_head); > filter = filters->filters; > if (!filter) > return; > } Reason I tend to never do that is that I always have to grep for the syntax... And case in point, you need to do: scoped_guard(spinlock, &filters->lock) { so I guess I'm not the only one :-). But it does read better that way, made this change too. >> +static struct io_bpf_filters *io_new_bpf_filters(void) >> +{ >> + struct io_bpf_filters *filters; >> + >> + filters = kzalloc(sizeof(*filters), GFP_KERNEL_ACCOUNT); >> + if (!filters) >> + return ERR_PTR(-ENOMEM); >> + >> + filters->filters = kcalloc(IORING_OP_LAST, >> + sizeof(struct io_bpf_filter *), >> + GFP_KERNEL_ACCOUNT); >> + if (!filters->filters) { >> + kfree(filters); >> + return ERR_PTR(-ENOMEM); >> + } >> + >> + refcount_set(&filters->refs, 1); >> + spin_lock_init(&filters->lock); >> + return filters; >> +} > > static struct io_bpf_filters *io_new_bpf_filters(void) > { > struct io_bpf_filters *filters __free(kfree) = NULL; > > filters = kzalloc(sizeof(*filters), GFP_KERNEL_ACCOUNT); > if (!filters) > return ERR_PTR(-ENOMEM); > > filters->filters = kcalloc(IORING_OP_LAST, > sizeof(struct io_bpf_filter *), > GFP_KERNEL_ACCOUNT); > if (!filters->filters) > return ERR_PTR(-ENOMEM); > > refcount_set(&filters->refs, 1); > spin_lock_init(&filters->lock); > return no_free_ptr(filters); > } Adopted as well, thanks. >> +/* >> + * Validate classic BPF filter instructions. Only allow a safe subset of >> + * operations - no packet data access, just context field loads and basic >> + * ALU/jump operations. >> + */ >> +static int io_uring_check_cbpf_filter(struct sock_filter *filter, >> + unsigned int flen) >> +{ >> + int pc; > > Seems fine to me but I can't meaningfully review this. Yeah seriously... It's just the seccomp filter modified for this use case, so supposedly should be vetted already. >> +int io_register_bpf_filter(struct io_restriction *res, >> + struct io_uring_bpf __user *arg) >> +{ >> + struct io_bpf_filter *filter, *old_filter; >> + struct io_bpf_filters *filters; >> + struct io_uring_bpf reg; >> + struct bpf_prog *prog; >> + struct sock_fprog fprog; >> + int ret; >> + >> + if (copy_from_user(®, arg, sizeof(reg))) >> + return -EFAULT; >> + if (reg.cmd_type != IO_URING_BPF_CMD_FILTER) >> + return -EINVAL; >> + if (reg.cmd_flags || reg.resv) >> + return -EINVAL; >> + >> + if (reg.filter.opcode >= IORING_OP_LAST) >> + return -EINVAL; > > So you only support per-op-code filtering with cbpf. I assume that you > would argue that people can use the existing io_uring restrictions. But > that's not inherited, right? So then this forces users to have a bpf > program for all opcodes that io_uring on their system supports. The existing restrictions ARE inherited now, and can be set on a per-task basis as well. That's in the last patch. And since the classic "deny this opcode" filtering is way cheaper than running a BPF prog, I do think that's the right approach. > I think that this is a bit unfortunate and wasteful for both userspace > and io_uring. Can't we do a combined thing where we also allow filters > to attach to all op-codes. Then userspace could start with an allow-list > or deny-list filter and then attach further per-op-code bpf programs to > the op-codes they want to manage specifically. Then you also get > inheritance of the restrictions per-task. > > That would be nicer imho. I'm considering this one moot given the above on using IORING_REGISTER_RESTRICTIONS with fd == -1 to set per-task restrictions using the classic non-bpf filtering, which are inherited as well. You only need the cBPF filters if you want to deny an opcode based on aux data for that opcode. -- Jens Axboe