* Re: [RFC] Programming model for io_uring + eBPF [not found] ` <[email protected]> @ 2021-04-16 15:49 ` Pavel Begunkov 2021-04-20 16:35 ` Christian Dietrich 0 siblings, 1 reply; 17+ messages in thread From: Pavel Begunkov @ 2021-04-16 15:49 UTC (permalink / raw) To: Christian Dietrich, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke On 16/04/2021 13:02, Christian Dietrich wrote: > Hello, > > we [1] are operating-system researchers from Germany and noticed the LWN > article about the combination of io_uring and eBPF programs. We think > that high-performance system-call interfaces in combination with > programability are the perfect substrate for system-call clustering. > Therefore, we had some intense discussions how the combination of eBPF > and io_uring could look like, and we would like to present our ideas > here. In a nutshell, our goal is to give applications the possibility to > perform complex I/O operations, or even their whole application logic, > with less kernel-userspace transitions. Hi Christian, Horst, Franz-Bernhard, It's nice to hear from you again. Sounds good, we need to have this discussion to refine interfaces. The model makes much sense, but let me outline some details and how it matches up from the kernel and implementation perspective > > First, we want to present our idea what the execution model should look > like in order to give the user program the possibility to program its > I/O behavior, before we give a few coarse examples how this execution > model can be used to structure complex I/O operations. > > Execution model > ================ > > Think of the execution model as a list of short high-level statements > about the capabilities that the io_uring+eBPF interface should provide. > For each point, in-detail design decisions will be necessary to > integrate it well with the kernel. > > 1. Program model > 1. **Loading**: An eBPF program, which is a collection of functions, > can be attached to an uring. Each uring has at most one loaded > eBPF program An ebpf program is basically a single entry point (i.e. function), but there is some flexibility on reusing them, iirc even dynamically, not link-time only. So, from the io_uring/bpf perspective, you load a bunch of bpf programs (i.e. separate functions) ... > 2. **Invocation**: Individual functions in that eBPF program can be > called. > 3. These eBPF programs have some kind of global state that remains > valid across invocations. ... where the state is provided from the userspace in a form of bpf map/array. So, it's not necessarily restricted by a single state, but you can have many per io_uring. Hence, if userspace provides a single piece of state, that should be as you described. > > 2. Invocation model > 1. An invocation to any function in the loaded eBPF program can > be queued as an SQE. > 2. If an invocation SQE is part of an SQE link chain, it is > (as usual) executed after the preceeding I/O operation has > finished. Right, just like any other linked request > 3. All invocations on the same uring are serialized in their > execution -- even if the SQEs are part of different link > chains that are otherwise executed in parallel. Don't think it's a good idea to limit the kernel to it, but we may try to find a userspace solution, or do it optionally. > 4. Invocations have run-to-completion semantics. > > 3. Access to previous SQEs/CQEs (in a link chain) > 1. In a link chain, the invocation can access the data of > (**all** or **the** -- subject to debate and application-scenario > requirements) previous SQEs and CQEs. > 2. All SQEs get a flag that indicates that no CQE is emitted to > userspace. However, the CQE is always generated and can be > consumed by following invocation-SQEs. There is my last concept, was thinking about because feeding CQE of only last linked request seems very limited to me. - let's allow to register multiple completion queues (CQs), for simplicity extra ones are limited to BPF programs and not exported to the userspace, may be lifted in the future - BPF requests are naturally allowed to reap CQEs from them and free to choose from which exactly CQ (including from multiple). So, that solves 2) without flags because CQEs are posted, but userspace can write BPF in such a way that they are reaped only by a specific BPF request/etc. and not affecting others. As well as effectively CQEs part of 1), without burden of extra allocations for CQEs, etc. That's flexible enough to enable users to do many interesting things, e.g. to project it onto graph-like events synchronisation models. This may need some other things, e.g. like allowing BPF requests to wait for >=N entries in CQ, but not of much concern. > 4. Data and control access within an invocation: Invocations can ... > 1. Access preceeding SQEs and CQEs of the same link chain (see above). It'd be a problem to allow them to access preceeding SQEs. They're never saved in requests, and by the time of execution previous requests are already gone. And SQE is 64B, even alloc + copy would add too much of overhead. > 2. Emit SQEs and CQEs to the same uring. All SQEs that are submitted > by one invocation follow each other directly in the submission > ring to enable SQE linking. fwiw, no need to use the SQ for from BPF submissions, especially since it may be used by the userspace in parallel and synchronises races is painful. > 3. Read and write the whole userspace memory of the uring-owning > process. That's almost completely on BPF. > We think that this execution model provides a few important design > considerations that make programming I/O feasible to application > developers: > > With the capability to access previous SQEs (3.1) and userspace memory > (4.3), invocations can directly manipulate the read and written results > of previous I/O operations. There, we could also avoid to provide buffer > allocation/deallocation mechanisms for invocations. While this might be > a point of debate for the Linux kernel, it is definitely something we > want to try in our academic setting. > > With the serialization guarantees (2.3 and 2.4), the user has not to > worry about synchronization within the eBPF program itself, and we > somehow mimic the (unluckily successful) execution model of JavaScript. > > With the possibility to submit linked pairs of I/O-SQE and > invocation-SQE (4.2), an invocation can submit several parallelly > running I/O requests. With the possibility to supress emitting CQEs to > userspace (3.2), long chains of computation+I/O can be executed without > disturbing userspace. > > By having global state (1.3) and the serialization guarantees, complex > synchronization (e.g. barrier) can be implemented within such eBPF > programs. Great job writing up requirements and use cases. Let's see how we can improve it. My main concerns is 1) to keep it flexible, so use-cases and ideas may bounce around, and allow multiple models to co-exists. 2) but also keep it slim enough, because wouldn't be of much use if it's faster to go through userspace and normal submissions. Let me know what you think > > > # Application Examples > > In order to make our imagined execution model more plastic, we want to > give a few examples how our proposed interface could be used to program > I/O. For this, we have drawn a pretty picture: > > https://i.imgur.com/HADDfrv.png > Also attached with this mail as an PDF. > > > > > > > ## Postprocess > > We submit two linked SQEs into the queue. The first one performs an I/O > operation and does not emit a CQE. The second is an invocation that can > access the previous SQE, the CQE, and the resulting buffer. After some > postprocessing, it emits a CQE to the completion queue. > > ## I/O -> Computation -> I/O Loops > > This is an extension to the postprocess example. Instead of always > emitting a CQE, the invocation can also decide to queue another pair of > I/O+invocation SQEs to perform another round of postprocessed I/O. > > An example for such an I/O loop could be to parse the header of some > network protocol. The loop parses incoming data until the header is > complete or until some unhandled corner case occurs. In these cases, the > loop can **escalate** to the userspace by emitting an CQE. > > > ## Fork-Join Parallelism > > As an invocation can emit multiple SQEs to its io_uring, it can issue > multiple I/O requests at once that are executed in parallel. In order to > wait for all requests to complete, it initializes a global counter in > its program state and links an invocation to each parallel requests that > counts completions. The last invocation continues the io-program: > > > int counter; > void start_par() { > sqe_barrier->opcode = IORING_OP_EBPF; > ... > sqe_barrier->addr = &finish_par; > counter = 10; > for (int i=0; i<10 ;i++) { > sqe_io->opcode = IORING_OP_READV... > sqe_io->flags = ... | IOSQE_IO_LINK | IOSEQ_IO_NOCQE | ... > ... > uring_submit_sqe(seq_io); > uring_submit_sqe(seq_barrier) > } > } > > void finish_par () { > if (--counter > 0) return; > > // All 10 packets are completed > uring_submit_cqe(....) > } > > > The synchronization with a global variable works as all eBPF programs > within one uring are serialized, regardless of their linking state. > > # Our next steps > > With Franz (Cc), we have a Master student who wants to explore this > execution model. We will first build a prototype in userspace on basis > of Pavel's eBPF patch series. For this prototype, we will put a proxy > before the completion queue to mimic the linked invocation-SQEs and the > data access to userspace. The proxy will execute normal C functions that > adhere to the rules of eBPF as callbacks to completed I/O SQEs. With > this prototype at hand, we can already use the execution model to > restructure a userspace program to use the new interface, before we work > on the necessary kernel extensions. > > Besides that, we would like to discuss our envisioned execution model on > this mailinglist. What do you think about the presented idea? > > Horst & chris > > [1] Christian Dietrich, https://osg.tuhh.de/People/dietrich/ > Horst Schirmeier, https://ess.cs.tu-dortmund.de/~hsc/ > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-04-16 15:49 ` [RFC] Programming model for io_uring + eBPF Pavel Begunkov @ 2021-04-20 16:35 ` Christian Dietrich 2021-04-23 15:34 ` Pavel Begunkov 0 siblings, 1 reply; 17+ messages in thread From: Christian Dietrich @ 2021-04-20 16:35 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke Hi Pavel, thank you for your insights on the potential kernel integration. Hopefully this mail also lands on the mailinglist, because somehow my last mail was eaten by VGER although I'm a subscriber on that list. Let's see where this mail ends up. Pavel Begunkov <[email protected]> [16. April 2021]: >> 1. Program model >> 1. **Loading**: An eBPF program, which is a collection of functions, >> can be attached to an uring. Each uring has at most one loaded >> eBPF program > > An ebpf program is basically a single entry point (i.e. function), but > there is some flexibility on reusing them, iirc even dynamically, not > link-time only. So, from the io_uring/bpf perspective, you load a bunch > of bpf programs (i.e. separate functions) ... From a intentional perspective, there is not much difference between N programs with one entry function each and one program with N entry functions. As a user of our new interface I want to have the possibility to preattach multiple eBPF functions to a uring in order to shadow the latency for loading (and potentially) jitting the eBPF program. My intention with wanting only a single program was to make it easier to emit SQEs that reference another eBPF function. In an ideal world, the eBPF function only uses a `sqe.callback = &other_func' in its body. >> 2. **Invocation**: Individual functions in that eBPF program can be >> called. >> 3. These eBPF programs have some kind of global state that remains >> valid across invocations. > > ... where the state is provided from the userspace in a form of bpf > map/array. So, it's not necessarily restricted by a single state, but > you can have many per io_uring. > > Hence, if userspace provides a single piece of state, that should be as > you described. That should be fine as long as we do not have to transport the eBPF function pointer and the state pointer in every eBPF-SQE. As I heard, the space in an SQE is quite limited :-) Therefore, I designed an implicit method to pass state between invocations into the our model. >> 3. All invocations on the same uring are serialized in their >> execution -- even if the SQEs are part of different link >> chains that are otherwise executed in parallel. > > Don't think it's a good idea to limit the kernel to it, but we > may try to find a userspace solution, or do it optionally. I think this is a curcial point for the execution model as otherwise SQE executions have synchronize to with each other. I think it is consensus that these callback should have read and write access to different buffers, which provokes the question of data races. I see several solutions here: - My proposed serialization promise - Giving atomicity guarantees for the invocation of different eBPF helper functions (e.g. write buffer to user memory). However, what should I do if I want to perform a composed operation in an atomar fashin? Surely, we do not want to have transactions. - Exposing synchronization primitives to the eBPF program. I don't think that we can argue for semaphores in an eBPF program. With the serialization promise, we at least avoid the need to synchronize callbacks with callbacks. However, synchronization between user space and callback is still a problem. In order to make this serialization promise less strict, we could hide it behind an SQE flag. Thereby, many callbacks can execute in parallel, while some are guaranteed to execute in sequence. >> 3. Access to previous SQEs/CQEs (in a link chain) >> 1. In a link chain, the invocation can access the data of >> (**all** or **the** -- subject to debate and application-scenario >> requirements) previous SQEs and CQEs. >> 2. All SQEs get a flag that indicates that no CQE is emitted to >> userspace. However, the CQE is always generated and can be >> consumed by following invocation-SQEs. > > There is my last concept, was thinking about because feeding CQE of > only last linked request seems very limited to me. > > - let's allow to register multiple completion queues (CQs), for > simplicity extra ones are limited to BPF programs and not exported to > the userspace, may be lifted in the future > > - BPF requests are naturally allowed to reap CQEs from them and free > to choose from which exactly CQ (including from multiple). > > So, that solves 2) without flags because CQEs are posted, but userspace > can write BPF in such a way that they are reaped only by a specific BPF > request/etc. and not affecting others. As well as effectively CQEs part > of 1), without burden of extra allocations for CQEs, etc. Ok, let me summarize that to know if I've understand it correctly. Instead of having one completion queue for a submission queue, there are now the user-exposed completion queue and (N-1) in-kernel completion queues. On these queues, only eBPF programs can reap CQEs. If an eBPF wants to forward the CQE of the previous linked SQE, it reaps it from that queue and write it to the user-visible queue. How do I indicate at the first SQE into which CQ the result should be written? On the first sight, this looks much more complex to me than using a flag to supress the emitting of CQEs. > That's flexible enough to enable users to do many interesting things, > e.g. to project it onto graph-like events synchronisation models. This > may need some other things, e.g. like allowing BPF requests to wait > for >=N entries in CQ, but not of much concern. In your mentioned scenario, I would submit an unlinked eBPF-SQE that says: exeucte this SQE only if that CQ has more than n entries. But wouldn't that introduce a whole new signaling and waiting that sits beside the SQE linking mechanism? Are we able to encode all of that into a single SQE that also holds an eBPF function pointer and (potenitally) an pointer to a context map? >> 4. Data and control access within an invocation: Invocations can ... >> 1. Access preceeding SQEs and CQEs of the same link chain (see above). > > It'd be a problem to allow them to access preceeding SQEs. They're never > saved in requests, and by the time of execution previous requests are > already gone. And SQE is 64B, even alloc + copy would add too much of > overhead. My rational behind this point was to have the arguments to the preceeding request at hand to have both, request and uring, specific state as arguments. But perhaps all of this can be done in userspace (and might even be more flexible) by supplying the correct eBPF map. >> 2. Emit SQEs and CQEs to the same uring. All SQEs that are submitted >> by one invocation follow each other directly in the submission >> ring to enable SQE linking. > > fwiw, no need to use the SQ for from BPF submissions, especially since > it may be used by the userspace in parallel and synchronises races is > painful. Sure, we only have to have a possibility so create linked requests. They never have to fly through the SQ. Only on an intentional level it should be the same. And by mimicing the user-level interface, without backing it by the same implementation, it would be more familiar for the user-space developer. >> 3. Read and write the whole userspace memory of the uring-owning >> process. > > That's almost completely on BPF. But probably, people have stark opinions on that, or? I mean, as far as i grasped the situation, people are quite skeptical on the capabilities of eBPF programs and fear the hell of an endless chain of vulnerabilities. > Great job writing up requirements and use cases. Let's see how we can > improve it. My main concerns is 1) to keep it flexible, so use-cases > and ideas may bounce around, and allow multiple models to co-exists. > 2) but also keep it slim enough, because wouldn't be of much use if > it's faster to go through userspace and normal submissions. I completely agree with your here. I would add another point to our bucket list of goals: 3) it should be comprehensible for the user-space dev and follow the principle of the least surprise. I tried to accomplish this, by reproducing the user-space interface at the eBPF level. In the end it would be nice, if a user space program, toghether with the help of a few #defines and a little bit of make-magic, decide for each SQE-callback whether it should be executed in user- or in kernel space. chris -- Christian Dietrich ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-04-20 16:35 ` Christian Dietrich @ 2021-04-23 15:34 ` Pavel Begunkov 2021-04-29 13:27 ` Christian Dietrich 0 siblings, 1 reply; 17+ messages in thread From: Pavel Begunkov @ 2021-04-23 15:34 UTC (permalink / raw) To: Christian Dietrich, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke On 4/20/21 5:35 PM, Christian Dietrich wrote: > Hi Pavel, > > thank you for your insights on the potential kernel integration. > Hopefully this mail also lands on the mailinglist, because somehow my > last mail was eaten by VGER although I'm a subscriber on that list. > Let's see where this mail ends up. > > Pavel Begunkov <[email protected]> [16. April 2021]: > >>> 1. Program model >>> 1. **Loading**: An eBPF program, which is a collection of functions, >>> can be attached to an uring. Each uring has at most one loaded >>> eBPF program >> >> An ebpf program is basically a single entry point (i.e. function), but >> there is some flexibility on reusing them, iirc even dynamically, not >> link-time only. So, from the io_uring/bpf perspective, you load a bunch >> of bpf programs (i.e. separate functions) ... > > From a intentional perspective, there is not much difference between N > programs with one entry function each and one program with N entry > functions. As a user of our new interface I want to have the possibility > to preattach multiple eBPF functions to a uring in order to shadow the > latency for loading (and potentially) jitting the eBPF program. Yeah, absolutely. I don't see much profit in registering them dynamically, so for now they will be needed to be loaded and attached in advance. Or can be done in a more dynamic fashion, doesn't really matter. btw, bpf splits compilation and attach steps, adds some flexibility. > My intention with wanting only a single program was to make it easier to > emit SQEs that reference anot unction. In an ideal world, the > eBPF function only uses a `sqe.callback = &other_func' in its body. Should look similar to the userspace, fill a 64B chunk of memory, where the exact program is specified by an index, the same that is used during attach/registration > >>> 2. **Invocation**: Individual functions in that eBPF program can be >>> called. >>> 3. These eBPF programs have some kind of global state that remains >>> valid across invocations. >> >> ... where the state is provided from the userspace in a form of bpf >> map/array. So, it's not necessarily restricted by a single state, but >> you can have many per io_uring. >> >> Hence, if userspace provides a single piece of state, that should be as >> you described. > > That should be fine as long as we do not have to transport the eBPF > function pointer and the state pointer in every eBPF-SQE. As I heard, > the space in an SQE is quite limited :-) Therefore, I designed an > implicit method to pass state between invocations into the our model. and context fd is just another field in the SQE. On the space -- it depends. Some opcodes pass more info than others, and even for those we yet have 16 bytes unused. For bpf I don't expect passing much in SQE, so it should be ok. >>> 3. All invocations on the same uring are serialized in their >>> execution -- even if the SQEs are part of different link >>> chains that are otherwise executed in parallel. >> >> Don't think it's a good idea to limit the kernel to it, but we >> may try to find a userspace solution, or do it optionally. > > I think this is a curcial point for the execution model as otherwise SQE > executions have synchronize to with each other. I think it is consensus > that these callback should have read and write access to different > buffers, which provokes the question of data races. I see several > solutions here: > > - My proposed serialization promise It can be an optional feature, but 1) it may become a bottleneck at some point, 2) users use several rings, e.g. per-thread, so they might need to reimplement serialisation anyway. > > - Giving atomicity guarantees for the invocation of different eBPF > helper functions (e.g. write buffer to user memory). However, what > should I do if I want to perform a composed operation in an atomar > fashin? Surely, we do not want to have transactions. Surely not in the kernel ;) > > - Exposing synchronization primitives to the eBPF program. I don't think > that we can argue for semaphores in an eBPF program. I remember a discussion about sleep-able bpf, we need to look what has happened with it. > With the serialization promise, we at least avoid the need to > synchronize callbacks with callbacks. However, synchronization between > user space and callback is still a problem. Need to look up up-to-date BPF capabilities, but can also be spinlocks, for both: bpf-userspace sync, and between bpf https://lwn.net/ml/netdev/[email protected]/ Either just cancel bpf requests, and resubmit them. > > In order to make this serialization promise less strict, we could hide > it behind an SQE flag. Thereby, many callbacks can execute in parallel, > while some are guaranteed to execute in sequence. > >>> 3. Access to previous SQEs/CQEs (in a link chain) >>> 1. In a link chain, the invocation can access the data of >>> (**all** or **the** -- subject to debate and application-scenario >>> requirements) previous SQEs and CQEs. >>> 2. All SQEs get a flag that indicates that no CQE is emitted to >>> userspace. However, the CQE is always generated and can be >>> consumed by following invocation-SQEs. >> >> There is my last concept, was thinking about because feeding CQE of >> only last linked request seems very limited to me. >> >> - let's allow to register multiple completion queues (CQs), for >> simplicity extra ones are limited to BPF programs and not exported to >> the userspace, may be lifted in the future >> >> - BPF requests are naturally allowed to reap CQEs from them and free >> to choose from which exactly CQ (including from multiple). >> >> So, that solves 2) without flags because CQEs are posted, but userspace >> can write BPF in such a way that they are reaped only by a specific BPF >> request/etc. and not affecting others. As well as effectively CQEs part >> of 1), without burden of extra allocations for CQEs, etc. > > Ok, let me summarize that to know if I've understand it correctly. > Instead of having one completion queue for a submission queue, there are > now the user-exposed completion queue and (N-1) in-kernel completion With a bit of work nothing forbids to make them userspace visible, just next step to the idea. In the end I want to have no difference between CQs, and everyone can reap from anywhere, and it's up to user to use/synchronise properly. > queues. On these queues, only eBPF programs can reap CQEs. If an eBPF > wants to forward the CQE of the previous linked SQE, it reaps it from > that queue and write it to the user-visible queue. CQ is specified by index in SQE, in each SQE. So either as you say, or just specify index of the main CQ in that previous linked request in the first place. > > How do I indicate at the first SQE into which CQ the result should be > written? > > On the first sight, this looks much more complex to me than using a flag > to supress the emitting of CQEs. Yes, adds a bit of complexity, but without it you can only get last CQE, 1) it's not flexible enough and shoots off different potential scenarios 2) not performance efficient -- overhead on running a bpf request after each I/O request can be too large. 3) does require mandatory linking if you want to get result. Without it we can submit a single BPF request and let it running multiple times, e.g. waiting for on CQ, but linking would much limit options 4) bodgy from the implementation perspective Just imagine that you will be able to synchronise between BPF programs just by writing to the right CQ. Or can keep constant queue depth with a single BPF program, and many more ideas flying around. >> That's flexible enough to enable users to do many interesting things, >> e.g. to project it onto graph-like events synchronisation models. This >> may need some other things, e.g. like allowing BPF requests to wait >> for >=N entries in CQ, but not of much concern. > > In your mentioned scenario, I would submit an unlinked eBPF-SQE that > says: exeucte this SQE only if that CQ has more than n entries. > But wouldn't that introduce a whole new signaling and waiting that sits > beside the SQE linking mechanism? Should work just like CQ waiting, don't see any problem with it. > Are we able to encode all of that into a single SQE that also holds an > eBPF function pointer and (potenitally) an pointer to a context map? yes, but can be just a separate linked request... >>> 4. Data and control access within an invocation: Invocations can ... >>> 1. Access preceeding SQEs and CQEs of the same link chain (see above). >> >> It'd be a problem to allow them to access preceeding SQEs. They're never >> saved in requests, and by the time of execution previous requests are >> already gone. And SQE is 64B, even alloc + copy would add too much of >> overhead. > > My rational behind this point was to have the arguments to the > preceeding request at hand to have both, request and uring, specific > state as arguments. But perhaps all of this can be done in userspace > (and might even be more flexible) by supplying the correct eBPF map. Right. And it should know what it's doing anyway in most cases. All more complex dispatching / state machines can be pretty well implemented via context. > >>> 2. Emit SQEs and CQEs to the same uring. All SQEs that are submitted >>> by one invocation follow each other directly in the submission >>> ring to enable SQE linking. >> >> fwiw, no need to use the SQ for from BPF submissions, especially since >> it may be used by the userspace in parallel and synchronises races is >> painful. > > Sure, we only have to have a possibility so create linked requests. They > never have to fly through the SQ. Only on an intentional level it should > be the same. And by mimicing the user-level interface, without backing > it by the same implementation, it would be more familiar for the > user-space developer. > >>> 3. Read and write the whole userspace memory of the uring-owning >>> process. >> >> That's almost completely on BPF. > > But probably, people have stark opinions on that, or? I mean, as far as > i grasped the situation, people are quite skeptical on the capabilities > of eBPF programs and fear the hell of an endless chain of vulnerabilities. Capabilities are vast already, and there are restrictions because of security reasons, and it's there not without a reason, so io_uring won't try to circumvent bpf policies on that. If it's allowed, great, if not, need to wait while a solution is figured out (if ever). I believe there was something for accessing userspace memory, we need to look it up. >> Great job writing up requirements and use cases. Let's see how we can >> improve it. My main concerns is 1) to keep it flexible, so use-cases >> and ideas may bounce around, and allow multiple models to co-exists. >> 2) but also keep it slim enough, because wouldn't be of much use if >> it's faster to go through userspace and normal submissions. > > I completely agree with your here. I would add another point to our > bucket list of goals: > > 3) it should be comprehensible for the user-space dev and follow the > principle of the least surprise. > > I tried to accomplish this, by reproducing the user-space interface at > the eBPF level. In the end it would be nice, if a user space program, > toghether with the help of a few #defines and a little bit of > make-magic, decide for each SQE-callback whether it should be executed > in user- or in kernel space. sounds interesting! -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-04-23 15:34 ` Pavel Begunkov @ 2021-04-29 13:27 ` Christian Dietrich 2021-05-01 9:49 ` Pavel Begunkov 0 siblings, 1 reply; 17+ messages in thread From: Christian Dietrich @ 2021-04-29 13:27 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke Pavel Begunkov <[email protected]> [23. April 2021]: > Yeah, absolutely. I don't see much profit in registering them > dynamically, so for now they will be needed to be loaded and attached > in advance. Or can be done in a more dynamic fashion, doesn't really > matter. > > btw, bpf splits compilation and attach steps, adds some flexibility. So, I'm currently working on rebasing your work onto the tag 'for-5.13/io_uring-2021-04-27'. So if you already have some branch on this, just let me know to save the work. > Should look similar to the userspace, fill a 64B chunk of memory, > where the exact program is specified by an index, the same that is > used during attach/registration When looking at the current implementation, when can only perform the attachment once and there is no "append eBPF". While this is probably OK for code, for eBPF maps, we will need some kind of append eBPF map. > and context fd is just another field in the SQE. On the space -- it > depends. Some opcodes pass more info than others, and even for those we > yet have 16 bytes unused. For bpf I don't expect passing much in SQE, so > it should be ok. So besides an eBPF-Progam ID, we would also pass an ID for an eEBF map in the SQE. One thought that came to my mind: Why do we have to register the eBPF programs and maps? We could also just pass the FDs for those objects in the SQE. As long as there is no other state, it could be the userspaces choice to either attach it or pass it every time. For other FDs we already support both modes, right? >> - My proposed serialization promise > > It can be an optional feature, but 1) it may become a bottleneck at > some point, 2) users use several rings, e.g. per-thread, so they > might need to reimplement serialisation anyway. If we make it possible to pass some FD to an synchronization object (e.g. semaphore), this might do the trick to support both modes at the interface. >> - Exposing synchronization primitives to the eBPF program. I don't think >> that we can argue for semaphores in an eBPF program. > > I remember a discussion about sleep-able bpf, we need to look what has > happened with it. But surely this would hurt a lot as we would have to manage not only eBPF programs, but also eBPF processes. While this is surely possible, I don't know if it is really suitable for a high-performance interface like io_uring. But, don't know about the state. > >> With the serialization promise, we at least avoid the need to >> synchronize callbacks with callbacks. However, synchronization between >> user space and callback is still a problem. > > Need to look up up-to-date BPF capabilities, but can also be spinlocks, > for both: bpf-userspace sync, and between bpf > https://lwn.net/ml/netdev/[email protected]/ Using Spinlocks between kernel and userspace just feels wrong, very wrong. But it might be an alternate route to synchronization > With a bit of work nothing forbids to make them userspace visible, > just next step to the idea. In the end I want to have no difference > between CQs, and everyone can reap from anywhere, and it's up to > user to use/synchronise properly. I like the notion of orthogonality with this route. Perhaps, we don't need to have user-invisible CQs but it can be enough to address the CQ of another uring in my SQE as the sink for the resulting CQE. Downside with that idea would be that the user has to setup another ring with SQ and CQ, but only the CQ is used. > [...] > CQ is specified by index in SQE, in each SQE. So either as you say, or > just specify index of the main CQ in that previous linked request in > the first place. From looking at the code: This is not yet the case, or? >> How do I indicate at the first SQE into which CQ the result should be >> written? > Yes, adds a bit of complexity, but without it you can only get last CQE, > 1) it's not flexible enough and shoots off different potential scenarios > > 2) not performance efficient -- overhead on running a bpf request after > each I/O request can be too large. > > 3) does require mandatory linking if you want to get result. Without it > we can submit a single BPF request and let it running multiple times, > e.g. waiting for on CQ, but linking would much limit options > > 4) bodgy from the implementation perspective When taking a step back, this is nearly a io_uring_enter(minwait=N)-SQE with an attached eBPF callback, or? At that point, we are nearly full circle. >> Are we able to encode all of that into a single SQE that also holds an >> eBPF function pointer and (potenitally) an pointer to a context map? > > yes, but can be just a separate linked request... So, let's make a little collection about the (potential) information that our poor SQE has to hold. Thereby, FDs should be registrable and addressible by an index. - FD to eBPF program - FD to eBPF map - FD to synchronization object during the execution - FD to foreign CQ for waiting on N CQEs That are a lot of references to other object for which we would have to extend the registration interface. > Right. And it should know what it's doing anyway in most cases. All > more complex dispatching / state machines can be pretty well > implemented via context. You convinced me that an eBPF map as a context is the more canonical way of doing it by achieving the same degree of flexibility. > I believe there was something for accessing userspace memory, we > need to look it up. Either way, from a researcher perspective, we can just allow it and look how it can performs. chris -- Dr.-Ing. Christian Dietrich Operating System Group (E-EXK4) Technische Universität Hamburg Am Schwarzenberg-Campus 3 (E) 21073 Hamburg eMail: [email protected] Tel: +49 40 42878 2188 WWW: https://osg.tuhh.de/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-04-29 13:27 ` Christian Dietrich @ 2021-05-01 9:49 ` Pavel Begunkov 2021-05-05 12:57 ` Christian Dietrich 0 siblings, 1 reply; 17+ messages in thread From: Pavel Begunkov @ 2021-05-01 9:49 UTC (permalink / raw) To: Christian Dietrich, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke On 4/29/21 2:27 PM, Christian Dietrich wrote: > Pavel Begunkov <[email protected]> [23. April 2021]: > >> Yeah, absolutely. I don't see much profit in registering them >> dynamically, so for now they will be needed to be loaded and attached >> in advance. Or can be done in a more dynamic fashion, doesn't really >> matter. >> >> btw, bpf splits compilation and attach steps, adds some flexibility. > > So, I'm currently working on rebasing your work onto the tag > 'for-5.13/io_uring-2021-04-27'. So if you already have some branch on > this, just let me know to save the work. I'll hack the v2 up the next week + patch some bugs I see, so I'd suggest to wait for a bit. Will post an example together, because was using plain bpf assembly for testing. >> Should look similar to the userspace, fill a 64B chunk of memory, >> where the exact program is specified by an index, the same that is >> used during attach/registration > > When looking at the current implementation, when can only perform the > attachment once and there is no "append eBPF". While this is probably OK > for code, for eBPF maps, we will need some kind of append eBPF map. No need to register maps, it's passed as an fd. Looks I have never posted any example, but you can even compile fds in in bpf programs at runtime. fd = ...; bpf_prog[] = { ..., READ_MAP(fd), ...}; compile_bpf(bpf_prog); I was thinking about the opposite, to allow to optionally register them to save on atomic ops. Not in the next patch iteration though Regarding programs, can be updated by unregister+register (slow). And can be made more dynamic as registered files, but don't see much profit in doing so, but open to use cases. >> and context fd is just another field in the SQE. On the space -- it >> depends. Some opcodes pass more info than others, and even for those we >> yet have 16 bytes unused. For bpf I don't expect passing much in SQE, so >> it should be ok. > > So besides an eBPF-Progam ID, we would also pass an ID for an eEBF map > in the SQE. Yes, at least that's how I envision it. But that's a good question what else we want to pass. E.g. some arbitrary ids (u64), that's forwarded to the program, it can be a pointer to somewhere or just a piece of current state. > One thought that came to my mind: Why do we have to register the eBPF > programs and maps? We could also just pass the FDs for those objects in > the SQE. As long as there is no other state, it could be the userspaces > choice to either attach it or pass it every time. For other FDs we > already support both modes, right? It doesn't register maps (see above). For not registering/attaching in advance -- interesting idea, I need it double check with bpf guys. >>> - My proposed serialization promise >> >> It can be an optional feature, but 1) it may become a bottleneck at >> some point, 2) users use several rings, e.g. per-thread, so they >> might need to reimplement serialisation anyway. > > If we make it possible to pass some FD to an synchronization object > (e.g. semaphore), this might do the trick to support both modes at the interface. > >>> - Exposing synchronization primitives to the eBPF program. I don't think >>> that we can argue for semaphores in an eBPF program. >> >> I remember a discussion about sleep-able bpf, we need to look what has >> happened with it. > > But surely this would hurt a lot as we would have to manage not only > eBPF programs, but also eBPF processes. While this is surely possible, I > don't know if it is really suitable for a high-performance interface > like io_uring. But, don't know about the state. > >> >>> With the serialization promise, we at least avoid the need to >>> synchronize callbacks with callbacks. However, synchronization between >>> user space and callback is still a problem. >> >> Need to look up up-to-date BPF capabilities, but can also be spinlocks, >> for both: bpf-userspace sync, and between bpf >> https://lwn.net/ml/netdev/[email protected]/ > > Using Spinlocks between kernel and userspace just feels wrong, very > wrong. But it might be an alternate route to synchronization Right, probably not the way to go, can't exist out of try_lock() or blocking, but it's still interesting for me to look into in general >> With a bit of work nothing forbids to make them userspace visible, >> just next step to the idea. In the end I want to have no difference >> between CQs, and everyone can reap from anywhere, and it's up to >> user to use/synchronise properly. > > I like the notion of orthogonality with this route. Perhaps, we don't > need to have user-invisible CQs but it can be enough to address the CQ > of another uring in my SQE as the sink for the resulting CQE. Not going to happen, cross references is always a huge headache. The idea is to add several CQs to a single ring, regardless of BPF, and for any request of any type use specify sqe->cq_idx, and CQE goes there. And then BPF can benefit from it, sending its requests to the right CQ, not necessarily to a single one. It will be the responsibility of the userspace to make use of CQs right, e.g. allocating CQ per BPF program and so avoiding any sync, or do something more complex cross posting to several CQs. > Downside with that idea would be that the user has to setup another > ring with SQ and CQ, but only the CQ is used. > >> [...] > >> CQ is specified by index in SQE, in each SQE. So either as you say, or >> just specify index of the main CQ in that previous linked request in >> the first place. > > From looking at the code: This is not yet the case, or? Right, not yet >>> How do I indicate at the first SQE into which CQ the result should be >>> written? > >> Yes, adds a bit of complexity, but without it you can only get last CQE, >> 1) it's not flexible enough and shoots off different potential scenarios >> >> 2) not performance efficient -- overhead on running a bpf request after >> each I/O request can be too large. >> >> 3) does require mandatory linking if you want to get result. Without it >> we can submit a single BPF request and let it running multiple times, >> e.g. waiting for on CQ, but linking would much limit options >> >> 4) bodgy from the implementation perspective > > When taking a step back, this is nearly a io_uring_enter(minwait=N)-SQE Think of it as a separate request type (won't be a separate, but still...) waiting on CQ, similarly to io_uring_enter(minwait=N) but doesn't block. > with an attached eBPF callback, or? At that point, we are nearly full > circle. What do you mean? >>> Are we able to encode all of that into a single SQE that also holds an >>> eBPF function pointer and (potenitally) an pointer to a context map? >> >> yes, but can be just a separate linked request... > > So, let's make a little collection about the (potential) information > that our poor SQE has to hold. Thereby, FDs should be registrable and > addressible by an index. > > - FD to eBPF program yes (index, not fd), but as you mentioned we should check > - FD to eBPF map we pass fd, but not register it (not by default) > - FD to synchronization object during the execution can be in the context, in theory. We need to check available synchronisation means > - FD to foreign CQ for waiting on N CQEs It's a generic field in SQE, so not bpf specific. And a BPF program submitting SQEs is free to specify any CQ index in them. e.g. pseudo-code: my_bpf_program(ctx) { sqe = prep_read(ctx, ...); sqe->cq_idx = 1; sqe = prep_recv(ctx, ...); sqe->cq_idx = 42; } > > That are a lot of references to other object for which we would have > to extend the registration interface. So, it's only bpf programs that it registers. >> Right. And it should know what it's doing anyway in most cases. All >> more complex dispatching / state machines can be pretty well >> implemented via context. > > You convinced me that an eBPF map as a context is the more canonical way > of doing it by achieving the same degree of flexibility. > >> I believe there was something for accessing userspace memory, we >> need to look it up. > > Either way, from a researcher perspective, we can just allow it and look > how it can performs. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-05-01 9:49 ` Pavel Begunkov @ 2021-05-05 12:57 ` Christian Dietrich 2021-05-05 16:13 ` Christian Dietrich 2021-05-07 15:10 ` Pavel Begunkov 0 siblings, 2 replies; 17+ messages in thread From: Christian Dietrich @ 2021-05-05 12:57 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke Pavel Begunkov <[email protected]> [01. May 2021]: > No need to register maps, it's passed as an fd. Looks I have never > posted any example, but you can even compile fds in in bpf programs > at runtime. > > fd = ...; > bpf_prog[] = { ..., READ_MAP(fd), ...}; > compile_bpf(bpf_prog); > > I was thinking about the opposite, to allow to optionally register > them to save on atomic ops. Not in the next patch iteration though That was exactly my idea: make it possible to register them optionally. I also think that registering the eBPF program FD could be made optionally, since this would be similar to other FD references that are submitted to io_uring. > Yes, at least that's how I envision it. But that's a good question > what else we want to pass. E.g. some arbitrary ids (u64), that's > forwarded to the program, it can be a pointer to somewhere or just > a piece of current state. So, we will have the regular user_data attribute there. So why add another potential argument to the program, besides passing an eBPF map? >> One thought that came to my mind: Why do we have to register the eBPF >> programs and maps? We could also just pass the FDs for those objects in >> the SQE. As long as there is no other state, it could be the userspaces >> choice to either attach it or pass it every time. For other FDs we >> already support both modes, right? > > It doesn't register maps (see above). For not registering/attaching in > advance -- interesting idea, I need it double check with bpf guys. My feeling would be the following: In any case, at submission we have to resolve the SQE references to a 'bpf_prog *', if this is done via a registered program in io_ring_ctx or via calling 'bpf_prog_get_type' is only a matter of the frontend. The only difference would be the reference couting. For registered bpf_progs, we can increment the refcounter only once. For unregistered, we would have to increment it for every SQE. Don't know if the added flexibility is worth the effort. >> If we make it possible to pass some FD to an synchronization object >> (e.g. semaphore), this might do the trick to support both modes at the interface. I was thinking further about this. And, really I could not come up with a proper user-space visible 'object' that we can grab with an FD to attach this semantic to as futexes come in form of a piece of memory. So perhaps, we would do something like // alloc 3 groups io_uring_register(fd, REGISTER_SYNCHRONIZATION_GROUPS, 3); // submit a synchronized SQE sqe->flags |= IOSQE_SYNCHRONIZE; sqe->synchronize_group = 1; // could probably be restricted to uint8_t. When looking at this, this could generally be a nice feature to have with SQEs, or? Hereby, the user could insert all of his SQEs and they would run sequentially. In contrast to SQE linking, the order of SQEs would not be determined, which might be beneficial at some point. >> - FD to synchronization object during the execution > > can be in the context, in theory. We need to check available > synchronisation means In the context you mean: there could be a pointer to user-memory and the bpf program calls futex_wait? > The idea is to add several CQs to a single ring, regardless of BPF, > and for any request of any type use specify sqe->cq_idx, and CQE > goes there. And then BPF can benefit from it, sending its requests > to the right CQ, not necessarily to a single one. > It will be the responsibility of the userspace to make use of CQs > right, e.g. allocating CQ per BPF program and so avoiding any sync, > or do something more complex cross posting to several CQs. Would this be done at creation time or by registering them? I think that creation time would be sufficient and it would ease the housekeeping. In that case, we could get away with making all CQs equal in size and only read out the number of the additional CQs from the io_uring_params. When adding a sqe->cq_idx: Do we have to introduce an IOSQE_SELET_OTHER flag to not break userspace for backwards compatibility or is it sufficient to assume that the user has zeroed the SQE beforehand? However, I like the idea of making it a generic field of SQEs. >> When taking a step back, this is nearly a io_uring_enter(minwait=N)-SQE > > Think of it as a separate request type (won't be a separate, but still...) > waiting on CQ, similarly to io_uring_enter(minwait=N) but doesn't block. > >> with an attached eBPF callback, or? At that point, we are nearly full >> circle. > > What do you mean? I was just joking: With a minwait=N-like field in the eBPF-SQE, we can mimic the behavior of io_uring_enter(minwait=N) but with an SQE. On the result of that eBPF-SQE we can now actually wait with io_uring_enter(..) chris -- Prof. Dr.-Ing. Christian Dietrich Operating System Group (E-EXK4) Technische Universität Hamburg Am Schwarzenberg-Campus 3 (E) 21073 Hamburg eMail: [email protected] Tel: +49 40 42878 2188 WWW: https://osg.tuhh.de/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-05-05 12:57 ` Christian Dietrich @ 2021-05-05 16:13 ` Christian Dietrich 2021-05-07 15:13 ` Pavel Begunkov 2021-05-07 15:10 ` Pavel Begunkov 1 sibling, 1 reply; 17+ messages in thread From: Christian Dietrich @ 2021-05-05 16:13 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke Christian Dietrich <[email protected]> [05. May 2021]: > So perhaps, we would do something like > > // alloc 3 groups > io_uring_register(fd, REGISTER_SYNCHRONIZATION_GROUPS, 3); > > // submit a synchronized SQE > sqe->flags |= IOSQE_SYNCHRONIZE; > sqe->synchronize_group = 1; // could probably be restricted to uint8_t. > > When looking at this, this could generally be a nice feature to have > with SQEs, or? Hereby, the user could insert all of his SQEs and they > would run sequentially. In contrast to SQE linking, the order of SQEs > would not be determined, which might be beneficial at some point. I was thinking further about this statement: "Performing (optional) serialization of eBPF-SQEs is similar to SQE linking". If we would want to implement the above interface of synchronization groups, it could be done without taking locks but by fixing the execution order at submit time. Thereby, synchronization groups would become a form of "implicit SQE linking". The following SQE would become: Append this SQE to the SQE-link chain with the name '1'. If the link chain has completed, start a new one. Thereby, the user could add an SQE to an existing link chain, even other SQEs are already submitted. > sqe->flags |= IOSQE_SYNCHRONIZE; > sqe->synchronize_group = 1; // could probably be restricted to uint8_t. Implementation wise, we would hold a pointer to the last element of the implicitly generated link chain. chris -- Prof. Dr.-Ing. Christian Dietrich Operating System Group (E-EXK4) Technische Universität Hamburg Am Schwarzenberg-Campus 3 (E) 21073 Hamburg eMail: [email protected] Tel: +49 40 42878 2188 WWW: https://osg.tuhh.de/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-05-05 16:13 ` Christian Dietrich @ 2021-05-07 15:13 ` Pavel Begunkov 2021-05-12 11:20 ` Christian Dietrich 0 siblings, 1 reply; 17+ messages in thread From: Pavel Begunkov @ 2021-05-07 15:13 UTC (permalink / raw) To: Christian Dietrich, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke On 5/5/21 5:13 PM, Christian Dietrich wrote: > Christian Dietrich <[email protected]> [05. May 2021]: > >> So perhaps, we would do something like >> >> // alloc 3 groups >> io_uring_register(fd, REGISTER_SYNCHRONIZATION_GROUPS, 3); >> >> // submit a synchronized SQE >> sqe->flags |= IOSQE_SYNCHRONIZE; >> sqe->synchronize_group = 1; // could probably be restricted to uint8_t. >> >> When looking at this, this could generally be a nice feature to have >> with SQEs, or? Hereby, the user could insert all of his SQEs and they >> would run sequentially. In contrast to SQE linking, the order of SQEs >> would not be determined, which might be beneficial at some point. > > I was thinking further about this statement: "Performing (optional) > serialization of eBPF-SQEs is similar to SQE linking". > > If we would want to implement the above interface of synchronization > groups, it could be done without taking locks but by fixing the > execution order at submit time. Thereby, synchronization groups would > become a form of "implicit SQE linking". > > The following SQE would become: Append this SQE to the SQE-link chain > with the name '1'. If the link chain has completed, start a new one. > Thereby, the user could add an SQE to an existing link chain, even other > SQEs are already submitted. > >> sqe->flags |= IOSQE_SYNCHRONIZE; >> sqe->synchronize_group = 1; // could probably be restricted to uint8_t. > > Implementation wise, we would hold a pointer to the last element of the > implicitly generated link chain. Such things go really horribly with performant APIs as io_uring, even if not used. Just see IOSQE_IO_DRAIN, it maybe almost never used but still in the hot path. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-05-07 15:13 ` Pavel Begunkov @ 2021-05-12 11:20 ` Christian Dietrich 2021-05-18 14:39 ` Pavel Begunkov 0 siblings, 1 reply; 17+ messages in thread From: Christian Dietrich @ 2021-05-12 11:20 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke Pavel Begunkov <[email protected]> [07. May 2021]: >> The following SQE would become: Append this SQE to the SQE-link chain >> with the name '1'. If the link chain has completed, start a new one. >> Thereby, the user could add an SQE to an existing link chain, even other >> SQEs are already submitted. >> >>> sqe->flags |= IOSQE_SYNCHRONIZE; >>> sqe->synchronize_group = 1; // could probably be restricted to uint8_t. >> >> Implementation wise, we would hold a pointer to the last element of the >> implicitly generated link chain. > > It will be in the common path hurting performance for those not using > it, and with no clear benefit that can't be implemented in userspace. > And io_uring is thin enough for all those extra ifs to affect end > performance. > > Let's consider if we run out of userspace options. So summarize my proposal: I want io_uring to support implicit synchronization by sequentialization at submit time. Doing this would avoid the overheads of locking (and potentially sleeping). So the problem that I see with a userspace solution is the following: If I want to sequentialize an SQE with another SQE that was submitted waaaaaay earlier, the usual IOSQE_IO_LINK cannot be used as I cannot the the link flag of that already submitted SQE. Therefore, I would have to wait in userspace for the CQE and submit my second SQE lateron. Especially if the goal is to remain in Kernelspace as long as possible via eBPF-SQEs this is not optimal. > Such things go really horribly with performant APIs as io_uring, even > if not used. Just see IOSQE_IO_DRAIN, it maybe almost never used but > still in the hot path. If we extend the semantic of IOSEQ_IO_LINK instead of introducing a new flag, we should be able to limit the problem, or? - With synchronize_group=0, the usual link-the-next SQE semantic could remain. - While synchronize_group!=0 could expose the described synchronization semantic. Thereby, the overhead is at least hidden behind the existing check for IOSEQ_IO_LINK, which is there anyway. Do you consider IOSQE_IO_LINK=1 part of the hot path? chris -- Prof. Dr.-Ing. Christian Dietrich Operating System Group (E-EXK4) Technische Universität Hamburg Am Schwarzenberg-Campus 3 (E), 4.092 21073 Hamburg eMail: [email protected] Tel: +49 40 42878 2188 WWW: https://osg.tuhh.de/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-05-12 11:20 ` Christian Dietrich @ 2021-05-18 14:39 ` Pavel Begunkov 2021-05-19 16:55 ` Christian Dietrich 0 siblings, 1 reply; 17+ messages in thread From: Pavel Begunkov @ 2021-05-18 14:39 UTC (permalink / raw) To: Christian Dietrich, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke On 5/12/21 12:20 PM, Christian Dietrich wrote: > Pavel Begunkov <[email protected]> [07. May 2021]: > >>> The following SQE would become: Append this SQE to the SQE-link chain >>> with the name '1'. If the link chain has completed, start a new one. >>> Thereby, the user could add an SQE to an existing link chain, even other >>> SQEs are already submitted. >>> >>>> sqe->flags |= IOSQE_SYNCHRONIZE; >>>> sqe->synchronize_group = 1; // could probably be restricted to uint8_t. >>> >>> Implementation wise, we would hold a pointer to the last element of the >>> implicitly generated link chain. >> >> It will be in the common path hurting performance for those not using >> it, and with no clear benefit that can't be implemented in userspace. >> And io_uring is thin enough for all those extra ifs to affect end >> performance. >> >> Let's consider if we run out of userspace options. > > So summarize my proposal: I want io_uring to support implicit > synchronization by sequentialization at submit time. Doing this would > avoid the overheads of locking (and potentially sleeping). > > So the problem that I see with a userspace solution is the following: > If I want to sequentialize an SQE with another SQE that was submitted > waaaaaay earlier, the usual IOSQE_IO_LINK cannot be used as I cannot the > the link flag of that already submitted SQE. Therefore, I would have to > wait in userspace for the CQE and submit my second SQE lateron. > > Especially if the goal is to remain in Kernelspace as long as possible > via eBPF-SQEs this is not optimal. > >> Such things go really horribly with performant APIs as io_uring, even >> if not used. Just see IOSQE_IO_DRAIN, it maybe almost never used but >> still in the hot path. > > If we extend the semantic of IOSEQ_IO_LINK instead of introducing a new > flag, we should be able to limit the problem, or? > > - With synchronize_group=0, the usual link-the-next SQE semantic could > remain. > - While synchronize_group!=0 could expose the described synchronization > semantic. > > Thereby, the overhead is at least hidden behind the existing check for > IOSEQ_IO_LINK, which is there anyway. Do you consider IOSQE_IO_LINK=1 > part of the hot path? Let's clarify in case I misunderstood you. In a snippet below, should it serialise execution of sqe1 and sqe2, so they don't run concurrently? Once request is submitted we don't keep an explicit reference to it, and it's hard and unreliably trying to find it, so would not really be "submission" time, but would require additional locking: 1) either on completion of a request it looks up its group, but then submission should do +1 spinlock to keep e.g. a list for each group. 2) or try to find a running request and append to its linked list, but that won't work. 3) or do some other magic, but all options would rather be far from free. If it shouldn't serialise it this case, then I don't see much difference with IOSEQ_IO_LINK. prep_sqe1(group=1); submit(); prep_sqe2(group=1); submit(); -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-05-18 14:39 ` Pavel Begunkov @ 2021-05-19 16:55 ` Christian Dietrich 2021-05-20 11:14 ` Pavel Begunkov 0 siblings, 1 reply; 17+ messages in thread From: Christian Dietrich @ 2021-05-19 16:55 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke Pavel Begunkov <[email protected]> [18. May 2021]: >> If we extend the semantic of IOSEQ_IO_LINK instead of introducing a new >> flag, we should be able to limit the problem, or? >> >> - With synchronize_group=0, the usual link-the-next SQE semantic could >> remain. >> - While synchronize_group!=0 could expose the described synchronization >> semantic. >> >> Thereby, the overhead is at least hidden behind the existing check for >> IOSEQ_IO_LINK, which is there anyway. Do you consider IOSQE_IO_LINK=1 >> part of the hot path? > > Let's clarify in case I misunderstood you. In a snippet below, should > it serialise execution of sqe1 and sqe2, so they don't run > concurrently? ,---- | > prep_sqe1(group=1); | > submit(); | > prep_sqe2(group=1); | > submit(); `---- Yes, in this snippet both SQEs should serialize. However, in this case, as they are submitted in sequence, it would be sufficient to use group=0. Let's make an example, were synchronization groups actually make a difference: | prep_sqe1(group=1); submit(); | prep_sqe2(group=3); submit(); | prep_sqe3(group=1); submit(); | ... time passes ... no sqe finishes | prep_sqe4(group=3); submit(); | .... time passes... sqe1-sqe3 finish | prep_sqe5(group=1); submit(); In this example, we could execute SQE1 and SQE2 in parallel, while SQE3 must be executed after SQE1. Furthermore, with synchronization groups, we can sequence SQE4 after SEQ2, although SQE3 was submitted in the meantime. This could not be achieved with linking on the same io_uring. For SQE5, we specify a synchronization group, however, as SQE1 and SQE3 have already finished, it can be started right one. > Once request is submitted we don't keep an explicit reference to it, > and it's hard and unreliably trying to find it, so would not really be > "submission" time, but would require additional locking: > > 1) either on completion of a request it looks up its group, but > then submission should do +1 spinlock to keep e.g. a list for each > group. > 2) or try to find a running request and append to its linked list, > but that won't work. > 3) or do some other magic, but all options would rather be far from > free. Ok, by looking at the code, submission side and completion side are currently uncoupled to each other (aka. no common spinlock). And this is one important source of performance. Right? Then this is something we have to keep. Ok, I'm not sure I fully understand these three variants, but I think that my proposal was aiming for option 2. However, I'm not quite sure why this is not possible. What would be wrong with the following proposal, which would also be applied to the regular IO_LINK (sync_group 0). Each io_ring_ctx already has a 'struct io_submit_state'. There, we replace the submit link with an array of N 'struct io_kiocb **': struct io_submit_state { .... struct io_kiocb ** sync_groups[16]; .... } These array elements point directly to the link field of the last element submitted for that synchronization group. Furthermore, we extend io_kiocb to store its synchronization group: struct io_kiocb { .... u8 sync_group; .... } On the completion side, we extend __io_req_find_next to unregister itself from the io_submit_state of its ring: u8 sg = req->sync_group; if (req->ctx.submit_state.sync_groups[sg] == &(req->link)) { // We might be the last one. struct io_kiocb ** x = req->link ? &(req->link->link) : NULL; CAS(&(req->ctx.submit_state.sync_groups[sg]), &(req->link), x); // CAS failure is no problem. } // At this point, req->link cannot be changed by the submission side, // but it will start a new chain or append to our successor. nxt = req->link; req->link = NULL; return nxt; With this extension, the cost for removing the completed request from the submit state costs one load and one comparision, if linking is used and we are the last one on the chain. Otherwise, we pay one compare_and_swap for it, which is required if submission and completion should be able to run fully parallel. This isn't for free. At submission time, we have to append requests, if there is a predecessor. For this, we extend io_submit_sqe to work with multiple groups: u8 sg = req->sync_group; struct io_kiocb **link_field_new = (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) ? &(req->link) : NULL; retry: struct io_kiocb **link_field = ctx->sync_groups[sg] if (link_field) { // Try to append to previous SQE. However, we might run in // parallel to __io_req_find_next. // Edit the link field of the previous SQE. *link_field = req; if(! CAS(&ctx->sync_groups[sg], link_field, link_field_new)) goto retry; // CAS failed. Last SQE was completed while we // prepared the update } else { // There is no previous one, we are alone. ctx->sync_group[sg] = link_field_new; } In essence, the sync_groups would be a lock_free queue with a dangling head that is even wait-free on the completion side. The above is surely not correct, but with a few strategic load_aquire and the store_release it probably can be made correct. And while it is not free, there already should be a similar kind of synchronization between submission and completion if it should be possible to link SQE to SQEs that are already in flight and could complete while we want to link it. Otherwise, SQE linking would only work for SQEs that are submitted in one go, but as io_submit_state_end() does not clear state->link.head, I think this is supposed to work. chris -- Prof. Dr.-Ing. Christian Dietrich Operating System Group (E-EXK4) Technische Universität Hamburg Am Schwarzenberg-Campus 3 (E), 4.092 21073 Hamburg eMail: [email protected] Tel: +49 40 42878 2188 WWW: https://osg.tuhh.de/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-05-19 16:55 ` Christian Dietrich @ 2021-05-20 11:14 ` Pavel Begunkov 2021-05-20 15:01 ` Christian Dietrich 0 siblings, 1 reply; 17+ messages in thread From: Pavel Begunkov @ 2021-05-20 11:14 UTC (permalink / raw) To: Christian Dietrich, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke On 5/19/21 5:55 PM, Christian Dietrich wrote: > Pavel Begunkov <[email protected]> [18. May 2021]: > >>> If we extend the semantic of IOSEQ_IO_LINK instead of introducing a new >>> flag, we should be able to limit the problem, or? >>> >>> - With synchronize_group=0, the usual link-the-next SQE semantic could >>> remain. >>> - While synchronize_group!=0 could expose the described synchronization >>> semantic. >>> >>> Thereby, the overhead is at least hidden behind the existing check for >>> IOSEQ_IO_LINK, which is there anyway. Do you consider IOSQE_IO_LINK=1 >>> part of the hot path? >> >> Let's clarify in case I misunderstood you. In a snippet below, should >> it serialise execution of sqe1 and sqe2, so they don't run >> concurrently? > > ,---- > | > prep_sqe1(group=1); > | > submit(); > | > prep_sqe2(group=1); > | > submit(); > `---- > > Yes, in this snippet both SQEs should serialize. However, in this case, > as they are submitted in sequence, it would be sufficient to use > group=0. > > Let's make an example, were synchronization groups actually make a > difference: > > | prep_sqe1(group=1); submit(); > | prep_sqe2(group=3); submit(); > | prep_sqe3(group=1); submit(); > | ... time passes ... no sqe finishes > | prep_sqe4(group=3); submit(); > | .... time passes... sqe1-sqe3 finish > | prep_sqe5(group=1); submit(); > > In this example, we could execute SQE1 and SQE2 in parallel, while SQE3 > must be executed after SQE1. > > Furthermore, with synchronization groups, we can sequence SQE4 after > SEQ2, although SQE3 was submitted in the meantime. This could not be > achieved with linking on the same io_uring. > > For SQE5, we specify a synchronization group, however, as SQE1 and SQE3 > have already finished, it can be started right one. Yeah, got it right, thanks >> Once request is submitted we don't keep an explicit reference to it, >> and it's hard and unreliably trying to find it, so would not really be >> "submission" time, but would require additional locking: >> >> 1) either on completion of a request it looks up its group, but >> then submission should do +1 spinlock to keep e.g. a list for each >> group. >> 2) or try to find a running request and append to its linked list, >> but that won't work. >> 3) or do some other magic, but all options would rather be far from >> free. > > Ok, by looking at the code, submission side and completion side are > currently uncoupled to each other (aka. no common spinlock). And this is > one important source of performance. Right? Then this is something we > have to keep. atomics/spinlocks scale purely, can be used we would surely prefer to avoid them in hot paths if possible. > Ok, I'm not sure I fully understand these three variants, but I think > that my proposal was aiming for option 2. However, I'm not quite sure > why this is not possible. What would be wrong with the following > proposal, which would also be applied to the regular IO_LINK (sync_group 0). You describe more like 1, the second was more like how cancellation requests find their targets. In any case, all variants are possible to implement. The question is in viability / performance cost considering that can be done in userspace, and as it might know have more knowledge can actually be implemented in faster manner. E.g., if userspace is single threaded or already finely sync submission/completion, the exactly same algorithm can be implemented without any extra synchronisation there. Also consider that it adds overhead for those not using it, it's a yet another special case/feature that can't ever be undone (ABI compatibility), has maintenance burden, and unrealised future performance costs (when it stands in the way of optimisation, as DRAIN does). Not telling that the feature can't have place, just needs good enough justification (e.g. performance or opening new use cases) comparing to complexity/etc. So the simpler/less intrusive the better. > Each io_ring_ctx already has a 'struct io_submit_state'. There, we > replace the submit link with an array of N 'struct io_kiocb **': > > struct io_submit_state { > .... > struct io_kiocb ** sync_groups[16]; > .... > } > > These array elements point directly to the link field of the last > element submitted for that synchronization group. > Furthermore, we extend io_kiocb to store its synchronization group: > > struct io_kiocb { > .... > u8 sync_group; > .... > } > > On the completion side, we extend __io_req_find_next to unregister > itself from the io_submit_state of its ring: > > u8 sg = req->sync_group; > if (req->ctx.submit_state.sync_groups[sg] == &(req->link)) { > // We might be the last one. > struct io_kiocb ** x = req->link ? &(req->link->link) : NULL; > CAS(&(req->ctx.submit_state.sync_groups[sg]), &(req->link), x); > // CAS failure is no problem. > } > // At this point, req->link cannot be changed by the submission side, > // but it will start a new chain or append to our successor. > nxt = req->link; > req->link = NULL; > return nxt; > > With this extension, the cost for removing the completed request from > the submit state costs one load and one comparision, if linking is used > and we are the last one on the chain. > Otherwise, we pay one compare_and_swap for it, which is required if > submission and completion should be able to run fully parallel. This > isn't for free. > > At submission time, we have to append requests, if there is a > predecessor. For this, we extend io_submit_sqe to work with multiple > groups: > > u8 sg = req->sync_group; > struct io_kiocb **link_field_new = > (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) ? &(req->link) : NULL; > > retry: > struct io_kiocb **link_field = ctx->sync_groups[sg] > if (link_field) { > // Try to append to previous SQE. However, we might run in > // parallel to __io_req_find_next. > > // Edit the link field of the previous SQE. > *link_field = req; By this time the req might already be gone (and sync_groups[] changed to NULL/next), and so you may get use after free. Also modifying it will break some cancellation bits who want it to be stable and conditionally sync using completion_lock. Can be fixed, but you will find tons of such small things fixing which will be a bone in the throat even for paths not really using it. See, IOSQE_IO_DRAIN, it's in hot completion part, it's in several hot places of the submission path, and got worse after links were added. Would love to kill it off completely, but compatibility. > if(! CAS(&ctx->sync_groups[sg], link_field, link_field_new)) > goto retry; // CAS failed. Last SQE was completed while we > // prepared the update > } else { > // There is no previous one, we are alone. > ctx->sync_group[sg] = link_field_new; > } > > In essence, the sync_groups would be a lock_free queue with a dangling > head that is even wait-free on the completion side. The above is surely > not correct, but with a few strategic load_aquire and the store_release > it probably can be made correct. Neither those are free -- cache bouncing > > And while it is not free, there already should be a similar kind of > synchronization between submission and completion if it should be > possible to link SQE to SQEs that are already in flight and could > complete while we want to link it. > Otherwise, SQE linking would only work for SQEs that are submitted in > one go, but as io_submit_state_end() does not clear > state->link.head, I think this is supposed to work. Just to clarify, it submits all the collected link before returning, just the invariant is based on link.head, if NULL there is no link, if not tail is initialised. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-05-20 11:14 ` Pavel Begunkov @ 2021-05-20 15:01 ` Christian Dietrich 2021-05-21 10:27 ` Pavel Begunkov 0 siblings, 1 reply; 17+ messages in thread From: Christian Dietrich @ 2021-05-20 15:01 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke Pavel Begunkov <[email protected]> [20. May 2021]: > atomics/spinlocks scale purely, can be used we would surely prefer to > avoid them in hot paths if possible. I understand that. Do you consider SQE-Linking part of the hot path or is it OK if linking SQE results overhead? > E.g., if userspace is single threaded or already finely sync > submission/completion, the exactly same algorithm can be implemented > without any extra synchronisation there. The problem that I see is that eBPF in io_uring breaks this fine synchronization as eBPF SQE submission and userspace SQE submission can run in parallel. But going back to my original wish: I wanted to ensure that I can serialize eBPF-SQEs such that I'm sure that they do not run in parallel. My idea was to use synchronization groups as a generalization of SQE linking in order to make it also useful for others (not only for eBPF). My reasoning being not doing this serialization in userspace is that I want to use the SQPOLL mode and execute long chains of IO/computation-SQEs without leaving the kernelspace. > Not telling that the feature can't have place, just needs good > enough justification (e.g. performance or opening new use cases) > comparing to complexity/etc. So the simpler/less intrusive the > better. I hope that you find this discussion as fruitful as I do. I really enjoy searching for an abstraction that is simple and yet powerful enough to fulfill user requirements. >> At submission time, we have to append requests, if there is a >> predecessor. For this, we extend io_submit_sqe to work with multiple >> groups: >> >> u8 sg = req->sync_group; >> struct io_kiocb **link_field_new = >> (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) ? &(req->link) : NULL; >> >> retry: >> struct io_kiocb **link_field = ctx->sync_groups[sg] >> if (link_field) { >> // Try to append to previous SQE. However, we might run in >> // parallel to __io_req_find_next. >> >> // Edit the link field of the previous SQE. >> *link_field = req; > > By this time the req might already be gone (and sync_groups[] changed to > NULL/next), and so you may get use after free. Also modifying it will > break some cancellation bits who want it to be stable and conditionally > sync using completion_lock. Yes, that is right if the completion side frees requests. Is this the case or is the SQE returned to an io_kiocb cache? >> In essence, the sync_groups would be a lock_free queue with a dangling >> head that is even wait-free on the completion side. The above is surely >> not correct, but with a few strategic load_aquire and the store_release >> it probably can be made correct. > > Neither those are free -- cache bouncing The problem that I had when thinking about the implementation is that IO_LINK semantic works in the wrong direction: Link the next SQE, whenever it comes to this SQE. If it would be the other way around ("Link this SQE to the previous one") it would be much easier as the cost would only arise if we actually request linking. But compatibility.. >> And while it is not free, there already should be a similar kind of >> synchronization between submission and completion if it should be >> possible to link SQE to SQEs that are already in flight and could >> complete while we want to link it. >> Otherwise, SQE linking would only work for SQEs that are submitted in >> one go, but as io_submit_state_end() does not clear >> state->link.head, I think this is supposed to work. > > Just to clarify, it submits all the collected link before returning, > just the invariant is based on link.head, if NULL there is no link, > if not tail is initialised. Ok, but what happens if the last SQE in an io_submit_sqes() call requests linking? Is it envisioned that the first SQE that comes with the next io_submit_sqes() is linked to that one? If this is not supported, what happens if I use the SQPOLL mode where the poller thread can partition my submitted SQEs at an arbitrary point into multiple io_submit_sqes() calls? If this is supported, link.head has to point to the last submitted SQE after the first io_submit_sqes()-call. Isn't then appending SQEs in the second io_submit_sqes()-call racy with the completion side. (With the same problems that I tried to solve? chris -- Prof. Dr.-Ing. Christian Dietrich Operating System Group (E-EXK4) Technische Universität Hamburg Am Schwarzenberg-Campus 3 (E), 4.092 21073 Hamburg eMail: [email protected] Tel: +49 40 42878 2188 WWW: https://osg.tuhh.de/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-05-20 15:01 ` Christian Dietrich @ 2021-05-21 10:27 ` Pavel Begunkov 2021-05-27 11:12 ` Christian Dietrich 0 siblings, 1 reply; 17+ messages in thread From: Pavel Begunkov @ 2021-05-21 10:27 UTC (permalink / raw) To: Christian Dietrich, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke On 5/20/21 4:01 PM, Christian Dietrich wrote: > Pavel Begunkov <[email protected]> [20. May 2021]: > >> atomics/spinlocks scale purely, can be used we would surely prefer to >> avoid them in hot paths if possible. > > I understand that. Do you consider SQE-Linking part of the hot path or > is it OK if linking SQE results overhead? > >> E.g., if userspace is single threaded or already finely sync >> submission/completion, the exactly same algorithm can be implemented >> without any extra synchronisation there. > > The problem that I see is that eBPF in io_uring breaks this fine > synchronization as eBPF SQE submission and userspace SQE submission can > run in parallel. It definitely won't be a part of ABI, but they actually do serialise at the moment. But even if not, there are way doing that sync-less not ctx-wide, but per submitter. > But going back to my original wish: I wanted to ensure that I can > serialize eBPF-SQEs such that I'm sure that they do not run in parallel. > My idea was to use synchronization groups as a generalization of > SQE linking in order to make it also useful for others (not only for eBPF). So, let's dissect it a bit more, why do you need serialising as such? What use case you have in mind, and let's see if it's indeed can't be implemented efficiently with what we have. To recap: BPFs don't share SQ with userspace at all, and may have separate CQs to reap events from. You may post an event and it's wait synchronised, so may act as a message-based synchronisation, see test3 in the recently posted v2 for example. I'll also be adding futex support (bpf + separate requests), it might play handy for some users. > My reasoning being not doing this serialization in userspace is that I > want to use the SQPOLL mode and execute long chains of > IO/computation-SQEs without leaving the kernelspace. btw, "in userspace" is now more vague as it can be done by BPF as well. For some use cases I'd expect BPF acting as a reactor, e.g. on completion of previous CQEs and submitting new requests in response, and so keeping it entirely in kernel space until it have anything to tell to the userspace, e.g. by posting into the main CQ. >> Not telling that the feature can't have place, just needs good >> enough justification (e.g. performance or opening new use cases) >> comparing to complexity/etc. So the simpler/less intrusive the >> better. > > I hope that you find this discussion as fruitful as I do. I really enjoy > searching for an abstraction that is simple and yet powerful enough to > fulfill user requirements. It is, I just have my doubts that it's the most flexible/performant option. Would be easier to reason with specific use cases in mind, so we may see if it can't be done in a better way >>> At submission time, we have to append requests, if there is a >>> predecessor. For this, we extend io_submit_sqe to work with multiple >>> groups: >>> >>> u8 sg = req->sync_group; >>> struct io_kiocb **link_field_new = >>> (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) ? &(req->link) : NULL; >>> >>> retry: >>> struct io_kiocb **link_field = ctx->sync_groups[sg] >>> if (link_field) { >>> // Try to append to previous SQE. However, we might run in >>> // parallel to __io_req_find_next. >>> >>> // Edit the link field of the previous SQE. >>> *link_field = req; >> >> By this time the req might already be gone (and sync_groups[] changed to >> NULL/next), and so you may get use after free. Also modifying it will >> break some cancellation bits who want it to be stable and conditionally >> sync using completion_lock. > > Yes, that is right if the completion side frees requests. Is this the > case or is the SQE returned to an io_kiocb cache? Might be freed. Caching them is a recent feature and doesn't cover all the cases, at least for now. >>> In essence, the sync_groups would be a lock_free queue with a dangling >>> head that is even wait-free on the completion side. The above is surely >>> not correct, but with a few strategic load_aquire and the store_release >>> it probably can be made correct. >> >> Neither those are free -- cache bouncing > > The problem that I had when thinking about the implementation is that > IO_LINK semantic works in the wrong direction: Link the next SQE, > whenever it comes to this SQE. If it would be the other way around > ("Link this SQE to the previous one") it would be much easier as the > cost would only arise if we actually request linking. But compatibility.. Stack vs queue style linking? If I understand what you mean right, that's because this is how SQ is parsed and so that's the most efficient way. >>> And while it is not free, there already should be a similar kind of >>> synchronization between submission and completion if it should be >>> possible to link SQE to SQEs that are already in flight and could >>> complete while we want to link it. >>> Otherwise, SQE linking would only work for SQEs that are submitted in >>> one go, but as io_submit_state_end() does not clear >>> state->link.head, I think this is supposed to work. >> >> Just to clarify, it submits all the collected link before returning, >> just the invariant is based on link.head, if NULL there is no link, >> if not tail is initialised. > > Ok, but what happens if the last SQE in an io_submit_sqes() call > requests linking? Is it envisioned that the first SQE that comes with > the next io_submit_sqes() is linked to that one? No, it doesn't leave submission boundary (e.g. a single syscall). In theory may be left there _not_ submitted, but don't see much profit in it. > If this is not supported, what happens if I use the SQPOLL mode where > the poller thread can partition my submitted SQEs at an arbitrary > point into multiple io_submit_sqes() calls? It's not arbitrary, submission is atomic in nature, first you fill SQEs in memory but they are not visible to SQPOLL in a meanwhile, and then you "commit" them by overwriting SQ tail pointer. Not a great exception for that is shared SQPOLL task, but it just waits someone to take care of the case. if (cap_entries && to_submit > 8) to_submit = 8; > If this is supported, link.head has to point to the last submitted SQE after > the first io_submit_sqes()-call. Isn't then appending SQEs in the > second io_submit_sqes()-call racy with the completion side. (With the > same problems that I tried to solve? Exactly why it's not supported -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-05-21 10:27 ` Pavel Begunkov @ 2021-05-27 11:12 ` Christian Dietrich 2021-06-02 10:47 ` Pavel Begunkov 0 siblings, 1 reply; 17+ messages in thread From: Christian Dietrich @ 2021-05-27 11:12 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke Pavel Begunkov <[email protected]> [21. May 2021]: >> The problem that I see is that eBPF in io_uring breaks this fine >> synchronization as eBPF SQE submission and userspace SQE submission can >> run in parallel. > > It definitely won't be a part of ABI, but they actually do serialise > at the moment. They serialize because they are executed by the same worker thread, right? Perhaps that is the solution to my synchronization problem. If/when io_uring supports more than one eBPF executioners, we should make the number of executors configurable at setup time. Thereby, the user can implicitly manage serialization of eBPF execution. >> But going back to my original wish: I wanted to ensure that I can >> serialize eBPF-SQEs such that I'm sure that they do not run in parallel. >> My idea was to use synchronization groups as a generalization of >> SQE linking in order to make it also useful for others (not only for eBPF). > > So, let's dissect it a bit more, why do you need serialising as > such? What use case you have in mind, and let's see if it's indeed > can't be implemented efficiently with what we have. What I want to do is to manipulate (read-calculate-update) user memory from eBPF without the need to synchronize between eBPF invocations. As eBPF invocations have a run-to-completion semantic, it feels bad to use lock-based synchronization. Besides waiting for user-memory to be swapped in, they will be usually short and plug together results and newly emitted CQEs. > To recap: BPFs don't share SQ with userspace at all, and may have > separate CQs to reap events from. You may post an event and it's > wait synchronised, so may act as a message-based synchronisation, > see test3 in the recently posted v2 for example. I'll also be > adding futex support (bpf + separate requests), it might > play handy for some users. I'm sure that it is possible to use those mechanisms for synchronizing, but I assume that explicit synchronization (locks, passing tokens around) is more expensive than sequenzializing requests (implicit synchronization) before starting to execute them. But probably, we need some benchmarks to see what performs better. >> My reasoning being not doing this serialization in userspace is that I >> want to use the SQPOLL mode and execute long chains of >> IO/computation-SQEs without leaving the kernelspace. > > btw, "in userspace" is now more vague as it can be done by BPF > as well. For some use cases I'd expect BPF acting as a reactor, > e.g. on completion of previous CQEs and submitting new requests > in response, and so keeping it entirely in kernel space until > it have anything to tell to the userspace, e.g. by posting > into the main CQ. Yes, exactly that is my motivation. But I also think that it is a useful pattern to have many eBPF callbacks pending (e.g. one for each connection). With one pending invocation per connection, synchronization with a fixed number of additional CQEs might be problematic: For example, for a per-connection barrier synchronization with the CQ-reap approach, one needs one CQ for each connection. >> The problem that I had when thinking about the implementation is that >> IO_LINK semantic works in the wrong direction: Link the next SQE, >> whenever it comes to this SQE. If it would be the other way around >> ("Link this SQE to the previous one") it would be much easier as the >> cost would only arise if we actually request linking. But compatibility.. > > Stack vs queue style linking? If I understand what you mean right, that's > because this is how SQ is parsed and so that's the most efficient way. No, I did not want to argue about the ordering within the link chain, but with the semantic of link flag. I though that it might have been beneficial to let the flag indicate that the SQE should be linked to the previous one. However, after thinking this through in more detail I now believe that it does not make any difference for the submission path. >> Ok, but what happens if the last SQE in an io_submit_sqes() call >> requests linking? Is it envisioned that the first SQE that comes with >> the next io_submit_sqes() is linked to that one? > > No, it doesn't leave submission boundary (e.g. a single syscall). > In theory may be left there _not_ submitted, but don't see much > profit in it. > >> If this is not supported, what happens if I use the SQPOLL mode where >> the poller thread can partition my submitted SQEs at an arbitrary >> point into multiple io_submit_sqes() calls? > > It's not arbitrary, submission is atomic in nature, first you fill > SQEs in memory but they are not visible to SQPOLL in a meanwhile, > and then you "commit" them by overwriting SQ tail pointer. > > Not a great exception for that is shared SQPOLL task, but it > just waits someone to take care of the case. > > if (cap_entries && to_submit > 8) > to_submit = 8; > >> If this is supported, link.head has to point to the last submitted SQE after >> the first io_submit_sqes()-call. Isn't then appending SQEs in the >> second io_submit_sqes()-call racy with the completion side. (With the >> same problems that I tried to solve? > > Exactly why it's not supported Thank you for this detailed explanation. I now understand the design decision with the SQE linking much better and why delayed linking of SQEs introduces linking between completion and submission side that is undesirable. chris -- Prof. Dr.-Ing. Christian Dietrich Operating System Group (E-EXK4) Technische Universität Hamburg Am Schwarzenberg-Campus 3 (E), 4.092 21073 Hamburg eMail: [email protected] Tel: +49 40 42878 2188 WWW: https://osg.tuhh.de/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-05-27 11:12 ` Christian Dietrich @ 2021-06-02 10:47 ` Pavel Begunkov 0 siblings, 0 replies; 17+ messages in thread From: Pavel Begunkov @ 2021-06-02 10:47 UTC (permalink / raw) To: Christian Dietrich, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke On 5/27/21 12:12 PM, Christian Dietrich wrote: > Pavel Begunkov <[email protected]> [21. May 2021]: > >>> The problem that I see is that eBPF in io_uring breaks this fine >>> synchronization as eBPF SQE submission and userspace SQE submission can >>> run in parallel. >> >> It definitely won't be a part of ABI, but they actually do serialise >> at the moment. > > They serialize because they are executed by the same worker thread, > right? No, because all submissions are currently under ctx->uring_lock mutex and also executed in the submitting task context (no io-wq). I hope to get rid of the second restriction, i.e. prior upstreaming, and maybe do something about the first one in far future if there will be a need. > Perhaps that is the solution to my synchronization problem. If/when > io_uring supports more than one eBPF executioners, we should make the > number of executors configurable at setup time. Thereby, the user can > implicitly manage serialization of eBPF execution. I wouldn't keep it a part of ABI, but may be good enough for experiments. So, it'll be UB and may break unless we figure something out. ... actually it sounds like a sane idea to do lock grabbing lazy, i.e. only if it actually submits requests. That may make sense if you control several requests by BPF, e.g. keeping QD + batching. >>> But going back to my original wish: I wanted to ensure that I can >>> serialize eBPF-SQEs such that I'm sure that they do not run in parallel. >>> My idea was to use synchronization groups as a generalization of >>> SQE linking in order to make it also useful for others (not only for eBPF). >> >> So, let's dissect it a bit more, why do you need serialising as >> such? What use case you have in mind, and let's see if it's indeed >> can't be implemented efficiently with what we have. > > What I want to do is to manipulate (read-calculate-update) user memory > from eBPF without the need to synchronize between eBPF invocations. > > As eBPF invocations have a run-to-completion semantic, it feels bad to > use lock-based synchronization. Besides waiting for user-memory to be > swapped in, they will be usually short and plug together results and > newly emitted CQEs. swapping can't be done under spinlocks, that's a reason why userspace read/write are available only to sleepable BPF. btw, I was thinking about adding atomic ops with userspace memory. I can code an easy solution, but may have too much of additional overhead. Same situation with normal load/stores being slow mentioned below. > >> To recap: BPFs don't share SQ with userspace at all, and may have >> separate CQs to reap events from. You may post an event and it's >> wait synchronised, so may act as a message-based synchronisation, >> see test3 in the recently posted v2 for example. I'll also be >> adding futex support (bpf + separate requests), it might >> play handy for some users. > > I'm sure that it is possible to use those mechanisms for synchronizing, > but I assume that explicit synchronization (locks, passing tokens > around) is more expensive than sequenzializing requests (implicit > synchronization) before starting to execute them. As you know it depends on work/synchronisation ratio, but I'm also concerned about scalability. Consider same argument applied to sequenzializing normal user threads (e.g. pthreads). Not quite of the same extent but shows the idea. > But probably, we need some benchmarks to see what performs better. Would be great to have some in any case. Userspace read/write may be slow. It can be solved (if a problem) but would require work to be done on the BPF kernel side >>> My reasoning being not doing this serialization in userspace is that I >>> want to use the SQPOLL mode and execute long chains of >>> IO/computation-SQEs without leaving the kernelspace. >> >> btw, "in userspace" is now more vague as it can be done by BPF >> as well. For some use cases I'd expect BPF acting as a reactor, >> e.g. on completion of previous CQEs and submitting new requests >> in response, and so keeping it entirely in kernel space until >> it have anything to tell to the userspace, e.g. by posting >> into the main CQ. > > Yes, exactly that is my motivation. But I also think that it is a useful > pattern to have many eBPF callbacks pending (e.g. one for each > connection). Good case. One more moment is how much generality such cases need. E.g. there are some data structures in BPF, don't remember names but like perf buffers or perf rings. > > With one pending invocation per connection, synchronization with a fixed > number of additional CQEs might be problematic: For example, for a > per-connection barrier synchronization with the CQ-reap approach, one > needs one CQ for each connection. > >>> The problem that I had when thinking about the implementation is that >>> IO_LINK semantic works in the wrong direction: Link the next SQE, >>> whenever it comes to this SQE. If it would be the other way around >>> ("Link this SQE to the previous one") it would be much easier as the >>> cost would only arise if we actually request linking. But compatibility.. >> >> Stack vs queue style linking? If I understand what you mean right, that's >> because this is how SQ is parsed and so that's the most efficient way. > > No, I did not want to argue about the ordering within the link chain, > but with the semantic of link flag. I though that it might have been > beneficial to let the flag indicate that the SQE should be linked to the > previous one. However, after thinking this through in more detail I now > believe that it does not make any difference for the submission path. > > >>> Ok, but what happens if the last SQE in an io_submit_sqes() call >>> requests linking? Is it envisioned that the first SQE that comes with >>> the next io_submit_sqes() is linked to that one? >> >> No, it doesn't leave submission boundary (e.g. a single syscall). >> In theory may be left there _not_ submitted, but don't see much >> profit in it. >> >>> If this is not supported, what happens if I use the SQPOLL mode where >>> the poller thread can partition my submitted SQEs at an arbitrary >>> point into multiple io_submit_sqes() calls? >> >> It's not arbitrary, submission is atomic in nature, first you fill >> SQEs in memory but they are not visible to SQPOLL in a meanwhile, >> and then you "commit" them by overwriting SQ tail pointer. >> >> Not a great exception for that is shared SQPOLL task, but it >> just waits someone to take care of the case. >> >> if (cap_entries && to_submit > 8) >> to_submit = 8; >> >>> If this is supported, link.head has to point to the last submitted SQE after >>> the first io_submit_sqes()-call. Isn't then appending SQEs in the >>> second io_submit_sqes()-call racy with the completion side. (With the >>> same problems that I tried to solve? >> >> Exactly why it's not supported > > Thank you for this detailed explanation. I now understand the design > decision with the SQE linking much better and why delayed linking of > SQEs introduces linking between completion and submission side that is > undesirable. You're welcome, hope we'll get API into shape in no time -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Programming model for io_uring + eBPF 2021-05-05 12:57 ` Christian Dietrich 2021-05-05 16:13 ` Christian Dietrich @ 2021-05-07 15:10 ` Pavel Begunkov 1 sibling, 0 replies; 17+ messages in thread From: Pavel Begunkov @ 2021-05-07 15:10 UTC (permalink / raw) To: Christian Dietrich, io-uring; +Cc: Horst Schirmeier, Franz-B. Tuneke On 5/5/21 1:57 PM, Christian Dietrich wrote: > Pavel Begunkov <[email protected]> [01. May 2021]: > >> No need to register maps, it's passed as an fd. Looks I have never >> posted any example, but you can even compile fds in in bpf programs >> at runtime. >> >> fd = ...; >> bpf_prog[] = { ..., READ_MAP(fd), ...}; >> compile_bpf(bpf_prog); >> >> I was thinking about the opposite, to allow to optionally register >> them to save on atomic ops. Not in the next patch iteration though > > That was exactly my idea: make it possible to register them > optionally. > > I also think that registering the eBPF program FD could be made > optionally, since this would be similar to other FD references that are > submitted to io_uring. To wrap it, map registration wasn't required nor possible from the beginning, but can be interesting to allow it optionally. And the opposite for programs. >> Yes, at least that's how I envision it. But that's a good question >> what else we want to pass. E.g. some arbitrary ids (u64), that's >> forwarded to the program, it can be a pointer to somewhere or just >> a piece of current state. > > So, we will have the regular user_data attribute there. So why add > another potential argument to the program, besides passing an eBPF map? It may be more flexible or performant, I'm not eager adding more to it, but we'll see if it's good enough as is once we write some larger bpf programs. >>> One thought that came to my mind: Why do we have to register the eBPF >>> programs and maps? We could also just pass the FDs for those objects in >>> the SQE. As long as there is no other state, it could be the userspaces >>> choice to either attach it or pass it every time. For other FDs we >>> already support both modes, right? >> >> It doesn't register maps (see above). For not registering/attaching in >> advance -- interesting idea, I need it double check with bpf guys. > > My feeling would be the following: In any case, at submission we have to > resolve the SQE references to a 'bpf_prog *', if this is done via a > registered program in io_ring_ctx or via calling 'bpf_prog_get_type' is > only a matter of the frontend. Yes, I took on this point. > The only difference would be the reference couting. For registered > bpf_progs, we can increment the refcounter only once. For unregistered, > we would have to increment it for every SQE. Don't know if the added > flexibility is worth the effort. > >>> If we make it possible to pass some FD to an synchronization object >>> (e.g. semaphore), this might do the trick to support both modes at the interface. > > I was thinking further about this. And, really I could not come up with > a proper user-space visible 'object' that we can grab with an FD to > attach this semantic to as futexes come in form of a piece of memory. Right, not like, but exactly futexes. There is an interest in having futex requests, so need to write it up. > So perhaps, we would do something like > > // alloc 3 groups > io_uring_register(fd, REGISTER_SYNCHRONIZATION_GROUPS, 3); > > // submit a synchronized SQE > sqe->flags |= IOSQE_SYNCHRONIZE; > sqe->synchronize_group = 1; // could probably be restricted to uint8_t. > > When looking at this, this could generally be a nice feature to have > with SQEs, or? Hereby, the user could insert all of his SQEs and they > would run sequentially. In contrast to SQE linking, the order of SQEs > would not be determined, which might be beneficial at some point. It will be in the common path hurting performance for those not using it, and with no clear benefit that can't be implemented in userspace. And io_uring is thin enough for all those extra ifs to affect end performance. Let's consider if we run out of userspace options. >>> - FD to synchronization object during the execution >> >> can be in the context, in theory. We need to check available >> synchronisation means > > In the context you mean: there could be a pointer to user-memory and the > bpf program calls futex_wait? > >> The idea is to add several CQs to a single ring, regardless of BPF, >> and for any request of any type use specify sqe->cq_idx, and CQE >> goes there. And then BPF can benefit from it, sending its requests >> to the right CQ, not necessarily to a single one. >> It will be the responsibility of the userspace to make use of CQs >> right, e.g. allocating CQ per BPF program and so avoiding any sync, >> or do something more complex cross posting to several CQs. > > Would this be done at creation time or by registering them? I think that > creation time would be sufficient and it would ease the housekeeping. In > that case, we could get away with making all CQs equal in size and only > read out the number of the additional CQs from the io_uring_params. At creation time. Don't see a reason to move it to post creation, and see a small optimisation weighting for it. > > When adding a sqe->cq_idx: Do we have to introduce an IOSQE_SELET_OTHER > flag to not break userspace for backwards compatibility or is it > sufficient to assume that the user has zeroed the SQE beforehand? Fine by me assuming zeros. > However, I like the idea of making it a generic field of SQEs. > >>> When taking a step back, this is nearly a io_uring_enter(minwait=N)-SQE >> >> Think of it as a separate request type (won't be a separate, but still...) >> waiting on CQ, similarly to io_uring_enter(minwait=N) but doesn't block. >> >>> with an attached eBPF callback, or? At that point, we are nearly full >>> circle. >> >> What do you mean? > > I was just joking: With a minwait=N-like field in the eBPF-SQE, we can > mimic the behavior of io_uring_enter(minwait=N) but with an SQE. On the > result of that eBPF-SQE we can now actually wait with io_uring_enter(..) -- Pavel Begunkov ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2021-06-02 10:49 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <[email protected]> [not found] ` <[email protected]> [not found] ` <[email protected]> 2021-04-16 15:49 ` [RFC] Programming model for io_uring + eBPF Pavel Begunkov 2021-04-20 16:35 ` Christian Dietrich 2021-04-23 15:34 ` Pavel Begunkov 2021-04-29 13:27 ` Christian Dietrich 2021-05-01 9:49 ` Pavel Begunkov 2021-05-05 12:57 ` Christian Dietrich 2021-05-05 16:13 ` Christian Dietrich 2021-05-07 15:13 ` Pavel Begunkov 2021-05-12 11:20 ` Christian Dietrich 2021-05-18 14:39 ` Pavel Begunkov 2021-05-19 16:55 ` Christian Dietrich 2021-05-20 11:14 ` Pavel Begunkov 2021-05-20 15:01 ` Christian Dietrich 2021-05-21 10:27 ` Pavel Begunkov 2021-05-27 11:12 ` Christian Dietrich 2021-06-02 10:47 ` Pavel Begunkov 2021-05-07 15:10 ` Pavel Begunkov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox