On 01/02/2020 15:02, Andres Freund wrote: >> Right, after a "failure" occurred for a IOSQE_IO_LINK request, all subsequent >> requests in the link won't be executed, but completed with -ECANCELED. However, >> if IOSQE_IO_HARDLINK set for the request, it won't sever/break the link and will >> continue to the next one. > > I think something along those lines should be added to the manpage... I > think severing the link isn't really a good description, because it's > not like it's separating off the tail to be independent, or such. If > anything it's the opposite. > > >>> Looks like it's defined in a somewhat adhoc manner. For file read/write >>> subsequent requests are failed if they are a short read/write. But >>> e.g. for sendmsg that looks not to be the case. >>> >> >> As you said, it's defined rather sporadically. We should unify for it to make >> sense. I'd prefer to follow the read/write pattern. > > I think one problem with that is that it's not necessarily useful to > insist on the length being the maximum allowed length. E.g. for a > recvmsg you'd likely want to not fail the request if you read less than > what you provided for, because that's just a normal occurance. It could > e.g. be useful to just start the next recv (with a different buffer) > immediately> I'm not even sure it's generally sensible for read either, as that > doesn't work well for EOF, non-file FDs, ... Perhaps there's just no > good solution though. People already asked about such stuff, you can find the discussion somewhere in github issues for liburing. In short, there are a lot of different patterns, and that's not viable to implement them in the kernel. There are thoughts, ideas and plans around using BPF to deal with that. I've sent LSF/MM/BPF topic proposal exactly about that. > > >>> Perhaps it'd make sense to reject use of IOSQE_IO_LINK outside ops where >>> it's meaningful? >> >> If we disregard it for either length-based operations or the rest ones (or >> whatever combination), the feature won't be flexible enough to be useful, >> but in combination it allows to remove much of context switches. > > I really don't want to make it less useful ;) - In fact I'm pretty > excited about having it. I haven't yet implemented / benchmarked that, > but I think for databases it is likely to be very good to achieve low > but consistent IO queue depths for background tasks like checkpointing, > readahead, writeback etc, while still having a low context switch > rates. Without something like IOSQE_IO_LINK it's considerably harder to > have continuous IO that doesn't impact higher priority IO like journal > flushes. > > Andres Freund > -- Pavel Begunkov