public inbox for [email protected]
 help / color / mirror / Atom feed
* What does IOSQE_IO_[HARD]LINK actually mean?
@ 2020-02-01  9:18 Andres Freund
  2020-02-01 11:30 ` Pavel Begunkov
  2020-02-01 18:06 ` Jens Axboe
  0 siblings, 2 replies; 6+ messages in thread
From: Andres Freund @ 2020-02-01  9:18 UTC (permalink / raw)
  To: Jens Axboe, io-uring

Hi,

Reading the manpage from liburing I read:
       IOSQE_IO_LINK
              When  this  flag is specified, it forms a link with the next SQE in the submission ring. That next SQE
              will not be started before this one completes.  This, in effect, forms a chain of SQEs, which  can  be
              arbitrarily  long. The tail of the chain is denoted by the first SQE that does not have this flag set.
              This flag has no effect on previous SQE submissions, nor does it impact SQEs that are outside  of  the
              chain  tail.  This  means  that multiple chains can be executing in parallel, or chains and individual
              SQEs. Only members inside the chain are serialized. Available since 5.3.

       IOSQE_IO_HARDLINK
              Like IOSQE_IO_LINK, but it doesn't sever regardless of the completion result.  Note that the link will
              still sever if we fail submitting the parent request, hard links are only resilient in the presence of
              completion results for requests that did submit correctly.  IOSQE_IO_HARDLINK  implies  IOSQE_IO_LINK.
              Available since 5.5.

I can make some sense out of that description of IOSQE_IO_LINK without
looking at kernel code. But I don't think it's possible to understand
what happens when an earlier chain member fails, and what denotes an
error.  IOSQE_IO_HARDLINK's description kind of implies that
IOSQE_IO_LINK will not start the next request if there was a failure,
but doesn't define failure either.

Looks like it's defined in a somewhat adhoc manner. For file read/write
subsequent requests are failed if they are a short read/write. But
e.g. for sendmsg that looks not to be the case.

Perhaps it'd make sense to reject use of IOSQE_IO_LINK outside ops where
it's meaningful?

Or maybe I'm just missing something.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What does IOSQE_IO_[HARD]LINK actually mean?
  2020-02-01  9:18 What does IOSQE_IO_[HARD]LINK actually mean? Andres Freund
@ 2020-02-01 11:30 ` Pavel Begunkov
  2020-02-01 12:02   ` Andres Freund
  2020-02-01 18:06 ` Jens Axboe
  1 sibling, 1 reply; 6+ messages in thread
From: Pavel Begunkov @ 2020-02-01 11:30 UTC (permalink / raw)
  To: Andres Freund, Jens Axboe, io-uring


[-- Attachment #1.1: Type: text/plain, Size: 2594 bytes --]

On 01/02/2020 12:18, Andres Freund wrote:
> Hi,
> 
> Reading the manpage from liburing I read:
>        IOSQE_IO_LINK
>               When  this  flag is specified, it forms a link with the next SQE in the submission ring. That next SQE
>               will not be started before this one completes.  This, in effect, forms a chain of SQEs, which  can  be
>               arbitrarily  long. The tail of the chain is denoted by the first SQE that does not have this flag set.
>               This flag has no effect on previous SQE submissions, nor does it impact SQEs that are outside  of  the
>               chain  tail.  This  means  that multiple chains can be executing in parallel, or chains and individual
>               SQEs. Only members inside the chain are serialized. Available since 5.3.
> 
>        IOSQE_IO_HARDLINK
>               Like IOSQE_IO_LINK, but it doesn't sever regardless of the completion result.  Note that the link will
>               still sever if we fail submitting the parent request, hard links are only resilient in the presence of
>               completion results for requests that did submit correctly.  IOSQE_IO_HARDLINK  implies  IOSQE_IO_LINK.
>               Available since 5.5.
> 
> I can make some sense out of that description of IOSQE_IO_LINK without
> looking at kernel code. But I don't think it's possible to understand
> what happens when an earlier chain member fails, and what denotes an
> error.  IOSQE_IO_HARDLINK's description kind of implies that
> IOSQE_IO_LINK will not start the next request if there was a failure,
> but doesn't define failure either.
> 

Right, after a "failure" occurred for a IOSQE_IO_LINK request, all subsequent
requests in the link won't be executed, but completed with -ECANCELED. However,
if IOSQE_IO_HARDLINK set for the request, it won't sever/break the link and will
continue to the next one.

> Looks like it's defined in a somewhat adhoc manner. For file read/write
> subsequent requests are failed if they are a short read/write. But
> e.g. for sendmsg that looks not to be the case.
> 

As you said, it's defined rather sporadically. We should unify for it to make
sense. I'd prefer to follow the read/write pattern.

> Perhaps it'd make sense to reject use of IOSQE_IO_LINK outside ops where
> it's meaningful?

If we disregard it for either length-based operations or the rest ones (or
whatever combination), the feature won't be flexible enough to be useful,
but in combination it allows to remove much of context switches.

-- 
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What does IOSQE_IO_[HARD]LINK actually mean?
  2020-02-01 11:30 ` Pavel Begunkov
@ 2020-02-01 12:02   ` Andres Freund
  2020-02-01 15:28     ` Pavel Begunkov
  0 siblings, 1 reply; 6+ messages in thread
From: Andres Freund @ 2020-02-01 12:02 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring

Hi,

On 2020-02-01 14:30:06 +0300, Pavel Begunkov wrote:
> On 01/02/2020 12:18, Andres Freund wrote:
> > Hi,
> > 
> > Reading the manpage from liburing I read:
> >        IOSQE_IO_LINK
> >               When  this  flag is specified, it forms a link with the next SQE in the submission ring. That next SQE
> >               will not be started before this one completes.  This, in effect, forms a chain of SQEs, which  can  be
> >               arbitrarily  long. The tail of the chain is denoted by the first SQE that does not have this flag set.
> >               This flag has no effect on previous SQE submissions, nor does it impact SQEs that are outside  of  the
> >               chain  tail.  This  means  that multiple chains can be executing in parallel, or chains and individual
> >               SQEs. Only members inside the chain are serialized. Available since 5.3.
> > 
> >        IOSQE_IO_HARDLINK
> >               Like IOSQE_IO_LINK, but it doesn't sever regardless of the completion result.  Note that the link will
> >               still sever if we fail submitting the parent request, hard links are only resilient in the presence of
> >               completion results for requests that did submit correctly.  IOSQE_IO_HARDLINK  implies  IOSQE_IO_LINK.
> >               Available since 5.5.
> > 
> > I can make some sense out of that description of IOSQE_IO_LINK without
> > looking at kernel code. But I don't think it's possible to understand
> > what happens when an earlier chain member fails, and what denotes an
> > error.  IOSQE_IO_HARDLINK's description kind of implies that
> > IOSQE_IO_LINK will not start the next request if there was a failure,
> > but doesn't define failure either.
> > 
> 
> Right, after a "failure" occurred for a IOSQE_IO_LINK request, all subsequent
> requests in the link won't be executed, but completed with -ECANCELED. However,
> if IOSQE_IO_HARDLINK set for the request, it won't sever/break the link and will
> continue to the next one.

I think something along those lines should be added to the manpage... I
think severing the link isn't really a good description, because it's
not like it's separating off the tail to be independent, or such. If
anything it's the opposite.


> > Looks like it's defined in a somewhat adhoc manner. For file read/write
> > subsequent requests are failed if they are a short read/write. But
> > e.g. for sendmsg that looks not to be the case.
> > 
> 
> As you said, it's defined rather sporadically. We should unify for it to make
> sense. I'd prefer to follow the read/write pattern.

I think one problem with that is that it's not necessarily useful to
insist on the length being the maximum allowed length. E.g. for a
recvmsg you'd likely want to not fail the request if you read less than
what you provided for, because that's just a normal occurance. It could
e.g. be useful to just start the next recv (with a different buffer)
immediately.

I'm not even sure it's generally sensible for read either, as that
doesn't work well for EOF, non-file FDs, ... Perhaps there's just no
good solution though.


> > Perhaps it'd make sense to reject use of IOSQE_IO_LINK outside ops where
> > it's meaningful?
> 
> If we disregard it for either length-based operations or the rest ones (or
> whatever combination), the feature won't be flexible enough to be useful,
> but in combination it allows to remove much of context switches.

I really don't want to make it less useful ;) - In fact I'm pretty
excited about having it. I haven't yet implemented / benchmarked that,
but I think for databases it is likely to be very good to achieve low
but consistent IO queue depths for background tasks like checkpointing,
readahead, writeback etc, while still having a low context switch
rates. Without something like IOSQE_IO_LINK it's considerably harder to
have continuous IO that doesn't impact higher priority IO like journal
flushes.

Andres Freund

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What does IOSQE_IO_[HARD]LINK actually mean?
  2020-02-01 12:02   ` Andres Freund
@ 2020-02-01 15:28     ` Pavel Begunkov
  0 siblings, 0 replies; 6+ messages in thread
From: Pavel Begunkov @ 2020-02-01 15:28 UTC (permalink / raw)
  To: Andres Freund; +Cc: Jens Axboe, io-uring


[-- Attachment #1.1: Type: text/plain, Size: 2746 bytes --]

On 01/02/2020 15:02, Andres Freund wrote:
>> Right, after a "failure" occurred for a IOSQE_IO_LINK request, all subsequent
>> requests in the link won't be executed, but completed with -ECANCELED. However,
>> if IOSQE_IO_HARDLINK set for the request, it won't sever/break the link and will
>> continue to the next one.
> 
> I think something along those lines should be added to the manpage... I
> think severing the link isn't really a good description, because it's
> not like it's separating off the tail to be independent, or such. If
> anything it's the opposite.
> 
> 
>>> Looks like it's defined in a somewhat adhoc manner. For file read/write
>>> subsequent requests are failed if they are a short read/write. But
>>> e.g. for sendmsg that looks not to be the case.
>>>
>>
>> As you said, it's defined rather sporadically. We should unify for it to make
>> sense. I'd prefer to follow the read/write pattern.
> 
> I think one problem with that is that it's not necessarily useful to
> insist on the length being the maximum allowed length. E.g. for a
> recvmsg you'd likely want to not fail the request if you read less than
> what you provided for, because that's just a normal occurance. It could
> e.g. be useful to just start the next recv (with a different buffer)
> immediately> I'm not even sure it's generally sensible for read either, as that
> doesn't work well for EOF, non-file FDs, ... Perhaps there's just no
> good solution though.

People already asked about such stuff, you can find the discussion somewhere in
github issues for liburing. In short, there are a lot of different patterns, and
that's not viable to implement them in the kernel. There are thoughts, ideas and
plans around using BPF to deal with that.
I've sent LSF/MM/BPF topic proposal exactly about that.

> 
> 
>>> Perhaps it'd make sense to reject use of IOSQE_IO_LINK outside ops where
>>> it's meaningful?
>>
>> If we disregard it for either length-based operations or the rest ones (or
>> whatever combination), the feature won't be flexible enough to be useful,
>> but in combination it allows to remove much of context switches.
> 
> I really don't want to make it less useful ;) - In fact I'm pretty
> excited about having it. I haven't yet implemented / benchmarked that,
> but I think for databases it is likely to be very good to achieve low
> but consistent IO queue depths for background tasks like checkpointing,
> readahead, writeback etc, while still having a low context switch
> rates. Without something like IOSQE_IO_LINK it's considerably harder to
> have continuous IO that doesn't impact higher priority IO like journal
> flushes.
> 
> Andres Freund
> 

-- 
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What does IOSQE_IO_[HARD]LINK actually mean?
  2020-02-01  9:18 What does IOSQE_IO_[HARD]LINK actually mean? Andres Freund
  2020-02-01 11:30 ` Pavel Begunkov
@ 2020-02-01 18:06 ` Jens Axboe
  2020-02-02  7:36   ` Andres Freund
  1 sibling, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2020-02-01 18:06 UTC (permalink / raw)
  To: Andres Freund, io-uring

On 2/1/20 2:18 AM, Andres Freund wrote:
> Hi,
> 
> Reading the manpage from liburing I read:
>        IOSQE_IO_LINK
>               When  this  flag is specified, it forms a link with the next SQE in the submission ring. That next SQE
>               will not be started before this one completes.  This, in effect, forms a chain of SQEs, which  can  be
>               arbitrarily  long. The tail of the chain is denoted by the first SQE that does not have this flag set.
>               This flag has no effect on previous SQE submissions, nor does it impact SQEs that are outside  of  the
>               chain  tail.  This  means  that multiple chains can be executing in parallel, or chains and individual
>               SQEs. Only members inside the chain are serialized. Available since 5.3.
> 
>        IOSQE_IO_HARDLINK
>               Like IOSQE_IO_LINK, but it doesn't sever regardless of the completion result.  Note that the link will
>               still sever if we fail submitting the parent request, hard links are only resilient in the presence of
>               completion results for requests that did submit correctly.  IOSQE_IO_HARDLINK  implies  IOSQE_IO_LINK.
>               Available since 5.5.
> 
> I can make some sense out of that description of IOSQE_IO_LINK without
> looking at kernel code. But I don't think it's possible to understand
> what happens when an earlier chain member fails, and what denotes an
> error.  IOSQE_IO_HARDLINK's description kind of implies that
> IOSQE_IO_LINK will not start the next request if there was a failure,
> but doesn't define failure either.

I won't touch on the rest since Pavel already did, but I did expand the
explanation of when a normal link will sever, and how:

https://git.kernel.dk/cgit/liburing/commit/?id=9416351377f04211f859667f39a58d2a223cbd21

LSFMM will have a session on BPF with io_uring, which we'll need to have
full control of links outside of the basic use cases.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: What does IOSQE_IO_[HARD]LINK actually mean?
  2020-02-01 18:06 ` Jens Axboe
@ 2020-02-02  7:36   ` Andres Freund
  0 siblings, 0 replies; 6+ messages in thread
From: Andres Freund @ 2020-02-02  7:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On 2020-02-01 11:06:28 -0700, Jens Axboe wrote:
> I won't touch on the rest since Pavel already did, but I did expand the
> explanation of when a normal link will sever, and how:

Awesome.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-02-02  7:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-02-01  9:18 What does IOSQE_IO_[HARD]LINK actually mean? Andres Freund
2020-02-01 11:30 ` Pavel Begunkov
2020-02-01 12:02   ` Andres Freund
2020-02-01 15:28     ` Pavel Begunkov
2020-02-01 18:06 ` Jens Axboe
2020-02-02  7:36   ` Andres Freund

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox