* io_uring: BPF controlled I/O @ 2021-06-05 9:08 Pavel Begunkov 2021-06-05 9:16 ` [LSF/MM/BPF TOPIC] " Pavel Begunkov 2021-06-07 18:51 ` Victor Stewart 0 siblings, 2 replies; 5+ messages in thread From: Pavel Begunkov @ 2021-06-05 9:08 UTC (permalink / raw) To: io-uring; +Cc: Jens Axboe, [email protected], LKML, bpf, lsf-pc One of the core ideas behind io_uring is passing requests via memory shared b/w the userspace and the kernel, a.k.a. queues or rings. That serves a purpose of reducing number of context switches or bypassing them, but the userspace is responsible for controlling the flow, reaping and processing completions (a.k.a. Completion Queue Entry, CQE), and submitting new requests, adding extra context switches even if there is not much work to do. A simple illustration is read(open()), where io_uring is unable to propagate the returned fd to the read, with more cases piling up. The big picture idea stays the same since last year, to give out some of this control to BPF, allow it to check results of completed requests, manipulate memory if needed and submit new requests. Apart from being just a glue between two requests, it might even offer more flexibility like keeping a QD, doing reduce/broadcast and so on. The prototype [1,2] is in a good shape but some work need to be done. However, the main concern is getting an understanding what features and functionality have to be added to be flexible enough. Various toy examples can be found at [3] ([1] includes an overview of cases). Discussion points: - Use cases, feature requests, benchmarking - Userspace programming model, code reuse (e.g. liburing) - BPF-BPF and userspace-BPF synchronisation. There is CQE based notification approach and plans (see design notes), however need to discuss what else might be needed. - Do we need more contexts passed apart from user_data? e.g. specifying a BPF map/array/etc fd io_uring requests? - Userspace atomics and efficiency of userspace reads/writes. If proved to be not performant enough there are potential ways to take on it, e.g. inlining, having it in BPF ISA, and pre-verifying userspace pointers. [1] https://lore.kernel.org/io-uring/[email protected]/T/#m31d0a2ac6e2213f912a200f5e8d88bd74f81406b [2] https://github.com/isilence/linux/tree/ebpf_v2 [3] https://github.com/isilence/liburing/tree/ebpf_v2/examples/bpf ----------------------------------------------------------------------- Design notes: Instead of basing it on hooks it adds support of a new type of io_uring requests as it gives a better control and let's to reuse internal infrastructure. These requests run a new type of io_uring BPF programs wired with a bunch of new helpers for submitting requests and dealing with CQEs, are allowed to read/write userspace memory in virtue of a recently added sleepable BPF feature. and also provided with a token (generic io_uring token, aka user_data, specified at submission and returned in an CQE), which may be used to pass a userspace pointer used as a context. Besides running BPF programs, they are able to request waiting. Currently it supports CQ waiting for a number of completions, but others might be added and/or needed, e.g. futex and/or requeueing the current BPF request onto an io_uring request/link being submitted. That hides the overhead of creating BPF requests by keeping them alive and invoking multiple times. Another big chunk solved is figuring out a good way of feeding CQEs (potentially many) to a BPF program. The current approach is to enable multiple completion queues (CQ), and specify for each request to which one steer its CQE, so all the synchronisation is in control of the userspace. For instance, there may be a separate CQ per each in-flight BPF request, and they can work with their own queues and send an CQE to the main CQ so notifying the userspace. It also opens up a notification-like sync through CQE posting to neighbours' CQs. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [LSF/MM/BPF TOPIC] io_uring: BPF controlled I/O 2021-06-05 9:08 io_uring: BPF controlled I/O Pavel Begunkov @ 2021-06-05 9:16 ` Pavel Begunkov 2021-06-07 18:51 ` Victor Stewart 1 sibling, 0 replies; 5+ messages in thread From: Pavel Begunkov @ 2021-06-05 9:16 UTC (permalink / raw) To: io-uring; +Cc: Jens Axboe, [email protected], LKML, bpf, lsf-pc I botched subject tags, should be [LSF/MM/BPF TOPIC]. On 6/5/21 10:08 AM, Pavel Begunkov wrote: > One of the core ideas behind io_uring is passing requests via memory > shared b/w the userspace and the kernel, a.k.a. queues or rings. That > serves a purpose of reducing number of context switches or bypassing > them, but the userspace is responsible for controlling the flow, > reaping and processing completions (a.k.a. Completion Queue Entry, CQE), > and submitting new requests, adding extra context switches even if there > is not much work to do. A simple illustration is read(open()), where > io_uring is unable to propagate the returned fd to the read, with more > cases piling up. > > The big picture idea stays the same since last year, to give out some > of this control to BPF, allow it to check results of completed requests, > manipulate memory if needed and submit new requests. Apart from being > just a glue between two requests, it might even offer more flexibility > like keeping a QD, doing reduce/broadcast and so on. > > The prototype [1,2] is in a good shape but some work need to be done. > However, the main concern is getting an understanding what features and > functionality have to be added to be flexible enough. Various toy > examples can be found at [3] ([1] includes an overview of cases). > > Discussion points: > - Use cases, feature requests, benchmarking > - Userspace programming model, code reuse (e.g. liburing) > - BPF-BPF and userspace-BPF synchronisation. There is > CQE based notification approach and plans (see design > notes), however need to discuss what else might be > needed. > - Do we need more contexts passed apart from user_data? > e.g. specifying a BPF map/array/etc fd io_uring requests? > - Userspace atomics and efficiency of userspace reads/writes. If > proved to be not performant enough there are potential ways to take > on it, e.g. inlining, having it in BPF ISA, and pre-verifying > userspace pointers. > > [1] https://lore.kernel.org/io-uring/[email protected]/T/#m31d0a2ac6e2213f912a200f5e8d88bd74f81406b > [2] https://github.com/isilence/linux/tree/ebpf_v2 > [3] https://github.com/isilence/liburing/tree/ebpf_v2/examples/bpf > > > ----------------------------------------------------------------------- > Design notes: > > Instead of basing it on hooks it adds support of a new type of io_uring > requests as it gives a better control and let's to reuse internal > infrastructure. These requests run a new type of io_uring BPF programs > wired with a bunch of new helpers for submitting requests and dealing > with CQEs, are allowed to read/write userspace memory in virtue of a > recently added sleepable BPF feature. and also provided with a token > (generic io_uring token, aka user_data, specified at submission and > returned in an CQE), which may be used to pass a userspace pointer used > as a context. > > Besides running BPF programs, they are able to request waiting. > Currently it supports CQ waiting for a number of completions, but others > might be added and/or needed, e.g. futex and/or requeueing the current > BPF request onto an io_uring request/link being submitted. That hides > the overhead of creating BPF requests by keeping them alive and > invoking multiple times. > > Another big chunk solved is figuring out a good way of feeding CQEs > (potentially many) to a BPF program. The current approach > is to enable multiple completion queues (CQ), and specify for each > request to which one steer its CQE, so all the synchronisation > is in control of the userspace. For instance, there may be a separate > CQ per each in-flight BPF request, and they can work with their own > queues and send an CQE to the main CQ so notifying the userspace. > It also opens up a notification-like sync through CQE posting to > neighbours' CQs. > > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: io_uring: BPF controlled I/O 2021-06-05 9:08 io_uring: BPF controlled I/O Pavel Begunkov 2021-06-05 9:16 ` [LSF/MM/BPF TOPIC] " Pavel Begunkov @ 2021-06-07 18:51 ` Victor Stewart 2021-06-10 9:09 ` Pavel Begunkov 2021-06-14 7:54 ` David Laight 1 sibling, 2 replies; 5+ messages in thread From: Victor Stewart @ 2021-06-07 18:51 UTC (permalink / raw) To: Pavel Begunkov Cc: io-uring, Jens Axboe, [email protected], LKML, bpf, lsf-pc On Sat, Jun 5, 2021 at 5:09 AM Pavel Begunkov <[email protected]> wrote: > > One of the core ideas behind io_uring is passing requests via memory > shared b/w the userspace and the kernel, a.k.a. queues or rings. That > serves a purpose of reducing number of context switches or bypassing > them, but the userspace is responsible for controlling the flow, > reaping and processing completions (a.k.a. Completion Queue Entry, CQE), > and submitting new requests, adding extra context switches even if there > is not much work to do. A simple illustration is read(open()), where > io_uring is unable to propagate the returned fd to the read, with more > cases piling up. > > The big picture idea stays the same since last year, to give out some > of this control to BPF, allow it to check results of completed requests, > manipulate memory if needed and submit new requests. Apart from being > just a glue between two requests, it might even offer more flexibility > like keeping a QD, doing reduce/broadcast and so on. > > The prototype [1,2] is in a good shape but some work need to be done. > However, the main concern is getting an understanding what features and > functionality have to be added to be flexible enough. Various toy > examples can be found at [3] ([1] includes an overview of cases). > > Discussion points: > - Use cases, feature requests, benchmarking hi Pavel, coincidentally i'm tossing around in my mind at the moment an idea for offloading the PING/PONG of a QUIC server/client into the kernel via eBPF. problem being, being that QUIC is userspace run transport and that NAT-ed UDP mappings can't be expected to stay open longer than 30 seconds, QUIC applications bare a large cost of context switching wake-up to conduct connection lifetime maintenance... especially when managing a large number of mostly idle long lived connections. so offloading this maintenance service into the kernel would be a great efficiency boon. the main impediment is that access to the kernel crypto libraries isn't currently possible from eBPF. that said, connection wide crypto offload into the NIC is a frequently mentioned subject in QUIC circles, so one could argue better to allocate the time to NIC crypto offload and then simply conduct this PING/PONG offload in plain text. CQEs would provide a great way for the offloaded service to be able to wake up the application when it's input is required. anyway food for thought. Victor > - Userspace programming model, code reuse (e.g. liburing) > - BPF-BPF and userspace-BPF synchronisation. There is > CQE based notification approach and plans (see design > notes), however need to discuss what else might be > needed. > - Do we need more contexts passed apart from user_data? > e.g. specifying a BPF map/array/etc fd io_uring requests? > - Userspace atomics and efficiency of userspace reads/writes. If > proved to be not performant enough there are potential ways to take > on it, e.g. inlining, having it in BPF ISA, and pre-verifying > userspace pointers. > > [1] https://lore.kernel.org/io-uring/[email protected]/T/#m31d0a2ac6e2213f912a200f5e8d88bd74f81406b > [2] https://github.com/isilence/linux/tree/ebpf_v2 > [3] https://github.com/isilence/liburing/tree/ebpf_v2/examples/bpf > > > ----------------------------------------------------------------------- > Design notes: > > Instead of basing it on hooks it adds support of a new type of io_uring > requests as it gives a better control and let's to reuse internal > infrastructure. These requests run a new type of io_uring BPF programs > wired with a bunch of new helpers for submitting requests and dealing > with CQEs, are allowed to read/write userspace memory in virtue of a > recently added sleepable BPF feature. and also provided with a token > (generic io_uring token, aka user_data, specified at submission and > returned in an CQE), which may be used to pass a userspace pointer used > as a context. > > Besides running BPF programs, they are able to request waiting. > Currently it supports CQ waiting for a number of completions, but others > might be added and/or needed, e.g. futex and/or requeueing the current > BPF request onto an io_uring request/link being submitted. That hides > the overhead of creating BPF requests by keeping them alive and > invoking multiple times. > > Another big chunk solved is figuring out a good way of feeding CQEs > (potentially many) to a BPF program. The current approach > is to enable multiple completion queues (CQ), and specify for each > request to which one steer its CQE, so all the synchronisation > is in control of the userspace. For instance, there may be a separate > CQ per each in-flight BPF request, and they can work with their own > queues and send an CQE to the main CQ so notifying the userspace. > It also opens up a notification-like sync through CQE posting to > neighbours' CQs. > > > -- > Pavel Begunkov ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: io_uring: BPF controlled I/O 2021-06-07 18:51 ` Victor Stewart @ 2021-06-10 9:09 ` Pavel Begunkov 2021-06-14 7:54 ` David Laight 1 sibling, 0 replies; 5+ messages in thread From: Pavel Begunkov @ 2021-06-10 9:09 UTC (permalink / raw) To: Victor Stewart Cc: io-uring, Jens Axboe, [email protected], LKML, bpf, lsf-pc On 6/7/21 7:51 PM, Victor Stewart wrote: > On Sat, Jun 5, 2021 at 5:09 AM Pavel Begunkov <[email protected]> wrote: >> >> One of the core ideas behind io_uring is passing requests via memory >> shared b/w the userspace and the kernel, a.k.a. queues or rings. That >> serves a purpose of reducing number of context switches or bypassing >> them, but the userspace is responsible for controlling the flow, >> reaping and processing completions (a.k.a. Completion Queue Entry, CQE), >> and submitting new requests, adding extra context switches even if there >> is not much work to do. A simple illustration is read(open()), where >> io_uring is unable to propagate the returned fd to the read, with more >> cases piling up. >> >> The big picture idea stays the same since last year, to give out some >> of this control to BPF, allow it to check results of completed requests, >> manipulate memory if needed and submit new requests. Apart from being >> just a glue between two requests, it might even offer more flexibility >> like keeping a QD, doing reduce/broadcast and so on. >> >> The prototype [1,2] is in a good shape but some work need to be done. >> However, the main concern is getting an understanding what features and >> functionality have to be added to be flexible enough. Various toy >> examples can be found at [3] ([1] includes an overview of cases). >> >> Discussion points: >> - Use cases, feature requests, benchmarking > > hi Pavel, > > coincidentally i'm tossing around in my mind at the moment an idea for > offloading > the PING/PONG of a QUIC server/client into the kernel via eBPF. > > problem being, being that QUIC is userspace run transport and that NAT-ed UDP > mappings can't be expected to stay open longer than 30 seconds, QUIC > applications > bare a large cost of context switching wake-up to conduct connection lifetime > maintenance... especially when managing a large number of mostly idle long lived > connections. so offloading this maintenance service into the kernel > would be a great > efficiency boon. > > the main impediment is that access to the kernel crypto libraries > isn't currently possible > from eBPF. that said, connection wide crypto offload into the NIC is a > frequently mentioned > subject in QUIC circles, so one could argue better to allocate the > time to NIC crypto offload > and then simply conduct this PING/PONG offload in plain text. > > CQEs would provide a great way for the offloaded service to be able to > wake up the > application when it's input is required. Interesting, want to try out the idea? All pointers are here and/or in the patchset's cv, but if anything is not clear, inconvenient, lacks needed functionality, etc. let me know > anyway food for thought. > > Victor > >> - Userspace programming model, code reuse (e.g. liburing) >> - BPF-BPF and userspace-BPF synchronisation. There is >> CQE based notification approach and plans (see design >> notes), however need to discuss what else might be >> needed. >> - Do we need more contexts passed apart from user_data? >> e.g. specifying a BPF map/array/etc fd io_uring requests? >> - Userspace atomics and efficiency of userspace reads/writes. If >> proved to be not performant enough there are potential ways to take >> on it, e.g. inlining, having it in BPF ISA, and pre-verifying >> userspace pointers. >> >> [1] https://lore.kernel.org/io-uring/[email protected]/T/#m31d0a2ac6e2213f912a200f5e8d88bd74f81406b >> [2] https://github.com/isilence/linux/tree/ebpf_v2 >> [3] https://github.com/isilence/liburing/tree/ebpf_v2/examples/bpf >> >> >> ----------------------------------------------------------------------- >> Design notes: >> >> Instead of basing it on hooks it adds support of a new type of io_uring >> requests as it gives a better control and let's to reuse internal >> infrastructure. These requests run a new type of io_uring BPF programs >> wired with a bunch of new helpers for submitting requests and dealing >> with CQEs, are allowed to read/write userspace memory in virtue of a >> recently added sleepable BPF feature. and also provided with a token >> (generic io_uring token, aka user_data, specified at submission and >> returned in an CQE), which may be used to pass a userspace pointer used >> as a context. >> >> Besides running BPF programs, they are able to request waiting. >> Currently it supports CQ waiting for a number of completions, but others >> might be added and/or needed, e.g. futex and/or requeueing the current >> BPF request onto an io_uring request/link being submitted. That hides >> the overhead of creating BPF requests by keeping them alive and >> invoking multiple times. >> >> Another big chunk solved is figuring out a good way of feeding CQEs >> (potentially many) to a BPF program. The current approach >> is to enable multiple completion queues (CQ), and specify for each >> request to which one steer its CQE, so all the synchronisation >> is in control of the userspace. For instance, there may be a separate >> CQ per each in-flight BPF request, and they can work with their own >> queues and send an CQE to the main CQ so notifying the userspace. >> It also opens up a notification-like sync through CQE posting to >> neighbours' CQs. >> >> >> -- >> Pavel Begunkov -- Pavel Begunkov ^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: io_uring: BPF controlled I/O 2021-06-07 18:51 ` Victor Stewart 2021-06-10 9:09 ` Pavel Begunkov @ 2021-06-14 7:54 ` David Laight 1 sibling, 0 replies; 5+ messages in thread From: David Laight @ 2021-06-14 7:54 UTC (permalink / raw) To: 'Victor Stewart', Pavel Begunkov Cc: io-uring, Jens Axboe, [email protected], LKML, bpf, [email protected] From: Victor Stewart > Sent: 07 June 2021 19:51 ... > coincidentally i'm tossing around in my mind at the moment an idea for > offloading > the PING/PONG of a QUIC server/client into the kernel via eBPF. > > problem being, being that QUIC is userspace run transport and that NAT-ed UDP > mappings can't be expected to stay open longer than 30 seconds, QUIC > applications > bare a large cost of context switching wake-up to conduct connection lifetime > maintenance... especially when managing a large number of mostly idle long lived > connections. so offloading this maintenance service into the kernel > would be a great > efficiency boon. > > the main impediment is that access to the kernel crypto libraries > isn't currently possible > from eBPF. that said, connection wide crypto offload into the NIC is a > frequently mentioned > subject in QUIC circles, so one could argue better to allocate the > time to NIC crypto offload > and then simply conduct this PING/PONG offload in plain text. Hmmmm... a good example of how not to type emails. Thought, does the UDP tx needed to keep the NAT tables active need to be encrypted? A single byte UDP packet would do the trick. You just need something the remote system is designed to ignore. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-06-14 7:54 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-06-05 9:08 io_uring: BPF controlled I/O Pavel Begunkov 2021-06-05 9:16 ` [LSF/MM/BPF TOPIC] " Pavel Begunkov 2021-06-07 18:51 ` Victor Stewart 2021-06-10 9:09 ` Pavel Begunkov 2021-06-14 7:54 ` David Laight
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox