From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f45.google.com (mail-wm1-f45.google.com [209.85.128.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AB2ED219E0; Fri, 13 Jun 2025 00:25:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.45 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749774322; cv=none; b=R5FNQj5748Ppvu6kbgFcUaukVfJ57OL8g1qmwHYoyCAFIUsKrOuIDg7kvRZF2uFnpg3t4MUZYd5e5sjCWS67x8EsAdysdHAFuaoO2KeYUTR7iLsjcuA5T2FoiMzz/UWI+GmWM8mVlUt9gewfVUJxBMvGKQkRp7gNP83UFmYtw8o= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749774322; c=relaxed/simple; bh=KOlum+J8kvBWXOiz2MOiHDhRZrX3PwYKoOaHJ27JLBA=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=fD8TAm0doDarZYNjVi/nTc/LCeJvYYc3Y2sR1lGf/lmCl5mVxsfOUPEu3HCqeRmqg+S1A/+p/NeR/AEmh6EYMweVr3x5Ge0csplEX6I94W7DMha06gnFivxyhopT9t2rYUmXdUmV8gKWx06O/ozd23w0XjuDRqd/ABLOad9HYO0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=j8J114Y3; arc=none smtp.client-ip=209.85.128.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="j8J114Y3" Received: by mail-wm1-f45.google.com with SMTP id 5b1f17b1804b1-442fda876a6so13331615e9.0; Thu, 12 Jun 2025 17:25:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749774319; x=1750379119; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=BtY2lCk3uzspSzuYUeIi/Qq4T39UA0U6nXtfkQzRaPA=; b=j8J114Y3pPpctKy0PSRMQbEUjcliXudtxBq7v/v6YvbDe10vDFkM0R2Kx/IMASoGUd 3zHj9aNz2Qwvl7T3rxSZL7lCMTCPkwWRQmtH9tF9fwZycSHta0e9Ot2mtcPkTZoLxulS baibHrGrtY+8RTOU1ez3opfN98NdBzweLhreBArQGMJSHZbup2tiVpZOk1kTaNv2KnWJ 4AQZAk9suMW1FcAc8pgn49ifexvkoj46X6TUENdgCC6pLkN98RDgE9LIlQV+vzylWdJO GUDrtwtR0XjO6UBVgcanjAFpKtUyjrd6haVmysJ3RxY4NVGA+a1ZymexW2KK1pBEApCH BSUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749774319; x=1750379119; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=BtY2lCk3uzspSzuYUeIi/Qq4T39UA0U6nXtfkQzRaPA=; b=njrwwbUHDX+ImClf95ubFSspgSJ/uxjPjFcBMRzWPhJqrnG47CeiZCZQvRBJVGtOPG /9/4exXY8RkzGoacq4WgBcm/tuIto7eP4zW0y/yjTd13GauHGjr5TwgKkHiIo+gf6v1q 262LZO/CSCb5bCkL9Eov6QsjRSUeAygQA+iyYuJIx94PpYedvyfxAjTe2NQWoGr2aF9G +SMV1V00B48ty8NYTbl8hzCI4DQyTP6gF4CMm7MgrC8GC7k3Ysbjuj/T8tzuLB9F8rIx iCeaWWcC/L+HwMCelwzX7VUagF+m5t/bbkvnFVdhEWzuDD2HPnV+B6LRZnrKbOHV1Vo5 +fHg== X-Forwarded-Encrypted: i=1; AJvYcCV+BDUxJnU/xHjyXF+uZN0w+PBLCRy+gez7XoWeC2xHAOvkwIfRGHrnyBPFTE4Bi5G7pfs=@vger.kernel.org, AJvYcCWQ4KQGWFR9BwgUCDl7GMTSD8BdWaQxlTgGW9Gb0slEEV0Q9ilvIrTF+OsVCfDT7Zp8X6qXGc/tfqheWQyJ@vger.kernel.org X-Gm-Message-State: AOJu0YzeYkQ9PG6zh+DiLw+DYIMf/Bl1ub+EDWaD3g8dhWrdnjfPpaYR vvpPET0F3FRM6/sZnVXw7R/B4jDo43Fj1OJy21eEpbkuKMeDoHrcD61bYvhBEw+FO6V4kvizxhy 1QSU7B3FwWck3kXZXWJsefAhUD/L34n8417PO X-Gm-Gg: ASbGncsS4nUNtSBcOJAQUfAGbxiZQgRt787Jb9sI/Hs7bTAhsYUxS5aL3XvwNyHfJq6 +zRkohnDqq5xI9fJ0H4TLCHYKBN7vOPQ27CabRTjQ9MlnXsQtBwPQb8Gd91OCnrJDWSuSa9/O4H 12tYbMkd+OaRfXxipIpP0cEvms2ei1xx5eMlC/qlehkU7pvWeOrvDKT1QEF30RjhwxBlAM7ehh X-Google-Smtp-Source: AGHT+IF76pGBYxGZYWw+CJokdxXh6p6fIroXhuyYAPs9g7npmaYdejWwOR9kgrMPXPQHyCCG3MzqkJp4opYixEpEZ9E= X-Received: by 2002:a5d:588a:0:b0:3a4:f52d:8b11 with SMTP id ffacd0b85a97d-3a5686f59e0mr1042820f8f.20.1749774318887; Thu, 12 Jun 2025 17:25:18 -0700 (PDT) Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <8aa7b962-40a6-4bbc-8646-86dd7ce3380e@gmail.com> In-Reply-To: <8aa7b962-40a6-4bbc-8646-86dd7ce3380e@gmail.com> From: Alexei Starovoitov Date: Thu, 12 Jun 2025 17:25:07 -0700 X-Gm-Features: AX0GCFsinQTNjdnmxTBWf-QlYZGnOFA6IRDU-2AFsvp0E2RrYTzY6ZCebXQDLVg Message-ID: Subject: Re: [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers To: Pavel Begunkov , Andrii Nakryiko Cc: io-uring@vger.kernel.org, Martin KaFai Lau , bpf , LKML Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Jun 12, 2025 at 6:25=E2=80=AFAM Pavel Begunkov wrote: > > On 6/12/25 03:47, Alexei Starovoitov wrote: > > On Fri, Jun 6, 2025 at 6:58=E2=80=AFAM Pavel Begunkov wrote: > ...>> +__bpf_kfunc > >> +struct io_uring_cqe *bpf_io_uring_extract_next_cqe(struct io_ring_ctx= *ctx) > >> +{ > >> + struct io_rings *rings =3D ctx->rings; > >> + unsigned int mask =3D ctx->cq_entries - 1; > >> + unsigned head =3D rings->cq.head; > >> + struct io_uring_cqe *cqe; > >> + > >> + /* TODO CQE32 */ > >> + if (head =3D=3D rings->cq.tail) > >> + return NULL; > >> + > >> + cqe =3D &rings->cqes[head & mask]; > >> + rings->cq.head++; > >> + return cqe; > >> +} > >> + > >> +__bpf_kfunc_end_defs(); > >> + > >> +BTF_KFUNCS_START(io_uring_kfunc_set) > >> +BTF_ID_FLAGS(func, bpf_io_uring_submit_sqes, KF_SLEEPABLE); > >> +BTF_ID_FLAGS(func, bpf_io_uring_post_cqe, KF_SLEEPABLE); > >> +BTF_ID_FLAGS(func, bpf_io_uring_queue_sqe, KF_SLEEPABLE); > >> +BTF_ID_FLAGS(func, bpf_io_uring_get_cqe, 0); > >> +BTF_ID_FLAGS(func, bpf_io_uring_extract_next_cqe, KF_RET_NULL); > >> +BTF_KFUNCS_END(io_uring_kfunc_set) > > > > This is not safe in general. > > The verifier doesn't enforce argument safety here. > > As a minimum you need to add KF_TRUSTED_ARGS flag to all kfunc. > > And once you do that you'll see that the verifier > > doesn't recognize the cqe returned from bpf_io_uring_get_cqe*() > > as trusted. > > Thanks, will add it. If I read it right, without the flag the > program can, for example, create a struct io_ring_ctx on stack, > fill it with nonsense and pass to kfuncs. Is that right? No. The verifier will only allow a pointer to struct io_ring_ctx to be passed, but it may not be fully trusted. The verifier has 3 types of pointers to kernel structures: 1. ptr_to_btf_id 2. ptr_to_btf_id | trusted 3. ptr_to_btf_id | untrusted 1st was added long ago for tracing and gradually got adopted for non-tracing needs, but it has a foot gun, since all pointer walks keep ptr_to_btf_id type. It's fine in some cases to follow pointers, but not in all. Hence 2nd variant was added and there foo->bar dereference needs to be explicitly allowed instead of allowed by default like for 1st kind. All loads through 1 and 3 are implemented as probe_read_kernel. while loads from 2 are direct loads. So kfuncs without KF_TRUSTED_ARGS with struct io_ring_ctx *ctx argument are likely fine and safe, since it's impossible to get this io_ring_ctx pointer by dereferencing some other pointer. But better to tighten safety from the start. We recommend KF_TRUSTED_ARGS for all kfuncs and eventually it will be the default. > > Looking at your example: > > https://github.com/axboe/liburing/commit/706237127f03e15b4cc9c7c31c16d3= 4dbff37cdc > > it doesn't care about contents of cqe and doesn't pass it further. > > So sort-of ok-ish right now, > > but if you need to pass cqe to another kfunc > > you would need to add an open coded iterator for cqe-s > > with appropriate KF_ITER* flags > > or maybe add acquire/release semantics for cqe. > > Like, get_cqe will be KF_ACQUIRE, and you'd need > > matching KF_RELEASE kfunc, > > so that 'cqe' is not lost. > > Then 'cqe' will be trusted and you can pass it as actual 'cqe' > > into another kfunc. > > Without KF_ACQUIRE the verifier sees that get_cqe*() kfuncs > > return 'struct io_uring_cqe *' and it's ok for tracing > > or passing into kfuncs like bpf_io_uring_queue_sqe() > > that don't care about a particular type, > > but not ok for full tracking of objects. > > I don't need type safety for SQEs / CQEs, they're supposed to be simple > memory blobs containing userspace data only. SQ / CQ are shared with > userspace, and the kfuncs can leak the content of passed CQE / SQE to > userspace. But I'd like to find a way to reject programs stashing > kernel pointers / data into them. That's impossible. If you're worried about bpf prog exposing kernel addresses to user space then abort the whole thing. CAP_PERFMON is required for the majority of bpf progs. > > BPF_PROG(name, struct io_ring_ctx *io_ring) > { > struct io_uring_sqe *cqe =3D ...; > cqe->user_data =3D io_ring; > cqe->res =3D io_ring->private_field; > } > > And I mentioned in the message, I rather want to get rid of half of the > kfuncs, and give BPF direct access to the SQ/CQ instead. Schematically > it should look like this: > > BPF_PROG(name, struct io_ring_ctx *ring) > { > struct io_uring_sqe *sqes =3D get_SQ(ring); > > sqes[ring->sq_tail]->opcode =3D OP_NOP; > bpf_kfunc_submit_sqes(ring, 1); > > struct io_uring_cqe *cqes =3D get_CQ(ring); > print_cqe(&cqes[ring->cq_head]); > } > > I hacked up RET_PTR_TO_MEM for kfuncs, the diff is below, but it'd be > great to get rid of the constness of the size argument. I need to > digest arenas first as conceptually they look very close. arena is a special memory region where every byte is writeable by user space. > > > For next revision please post all selftest, examples, > > and bpf progs on the list, > > so people don't need to search github. > > Did the link in the cover letter not work for you? I'm confused > since it's all in a branch in my tree, but you linked to the same > patches but in Jens' tree, and I have zero clue what they're > doing there or how you found them. External links can disappear. It's not good for reviewers and for keeping the history of conversation. > > diff --git a/io_uring/bpf.c b/io_uring/bpf.c > index 9494e4289605..400a06a74b5d 100644 > --- a/io_uring/bpf.c > +++ b/io_uring/bpf.c > @@ -2,6 +2,7 @@ > #include > > #include "io_uring.h" > +#include "memmap.h" > #include "bpf.h" > #include "register.h" > > @@ -72,6 +73,14 @@ struct io_uring_cqe *bpf_io_uring_extract_next_cqe(str= uct io_ring_ctx *ctx) > return cqe; > } > > +__bpf_kfunc > +void *bpf_io_uring_get_region(struct io_ring_ctx *ctx, u64 size__retsz) > +{ > + if (size__retsz > ((u64)ctx->ring_region.nr_pages << PAGE_SHIFT)) > + return NULL; > + return io_region_get_ptr(&ctx->ring_region); > +} and bpf prog should be able to read/write anything in [ctx->ring_region->ptr, ..ptr + size] region ? Populating (creating) dynptr is probably better. See bpf_dynptr_from*() but what is the lifetime of that memory ?