Memory ordering description in io

public inbox for [email protected]
 help / color / mirror / Atom feed

* Memory ordering description in io_uring.pdf
@ 2022-09-18 16:56 J. Hanne
  2022-09-22  1:54 ` Jens Axboe
  0 siblings, 1 reply; 4+ messages in thread
From: J. Hanne @ 2022-09-18 16:56 UTC (permalink / raw)
  To: io-uring

Hi,

I have a couple of questions regarding the necessity of including memory
barriers when using io_uring, as outlined in
https://kernel.dk/io_uring.pdf. I'm fine with using liburing, but still I
do want to understand what is going on behind the scenes, so any comment
would be appreciated.

Firstly, I wonder why memory barriers are required at all, when NOT using
polled mode. Because requiring them in non-polled mode somehow implies that:
- Memory re-ordering occurs across system-call boundaries (i.e. when
  submitting, the tail write could happen after the io_uring_enter
  syscall?!)
- CPU data dependency checks do not work
So, are memory barriers really required when just using a simple
loop around io_uring_enter with completely synchronous processing?

Secondly, the examples in io_uring.pdf suggest that checking completion
entries requires a read_barrier and a write_barrier and submitting entries
requires *two* write_barriers. Really?

My expectation would be, just as with "normal" inter-thread userspace ipc,
that plain store-release and load-acquire semantics are sufficient, e.g.: 
- For reading completion entries:
-- first read the CQ ring head (without any ordering enforcement)
-- then use __atomic_load(__ATOMIC_ACQUIRE) to read the CQ ring tail
-- then use __atomic_store(__ATOMIC_RELEASE) to update the CQ ring head
- For submitting entries:
-- first read the SQ ring tail (without any ordering enforcement)
-- then use __atomic_load(__ATOMIC_ACQUIRE) to read the SQ ring head
-- then use __atomic_store(__ATOMIC_RELEASE) to update the SQ ring tail
Wouldn't these be sufficient?!

Thirdly, io_uring.pdf and
https://github.com/torvalds/linux/blob/master/io_uring/io_uring.c seem a
little contradicting, at least from my reading:

io_uring.pdf, in the completion entry example:
- Includes a read_barrier() **BEFORE** it reads the CQ ring tail
- Include a write_barrier() **AFTER** updating CQ head

io_uring.c says on completion entries:
- **AFTER** the application reads the CQ ring tail, it must use an appropriate
  smp_rmb() [...].
- It also needs a smp_mb() **BEFORE** updating CQ head [...].

io_uring.pdf, in the submission entry example:
- Includes a write_barrier() **BEFORE** updating the SQ tail
- Includes a write_barrier() **AFTER** updating the SQ tail

io_uring.c says on submission entries:
- [...] the application must use an appropriate smp_wmb() **BEFORE**
  writing the SQ tail
  (this matches io_uring.pdf)
- And it needs a barrier ordering the SQ head load before writing new
  SQ entries
  
I know, io_uring.pdf does mention that the memory ordering description
is simplified. So maybe this is the whole explanation for my confusion?

Cheers,
  Johann

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Memory ordering description in io_uring.pdf
  2022-09-18 16:56 Memory ordering description in io_uring.pdf J. Hanne
@ 2022-09-22  1:54 ` Jens Axboe
  2022-09-25 10:34   ` J. Hanne
  0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2022-09-22  1:54 UTC (permalink / raw)
  To: J. Hanne, io-uring

On 9/18/22 10:56 AM, J. Hanne wrote:
> Hi,
> 
> I have a couple of questions regarding the necessity of including memory
> barriers when using io_uring, as outlined in
> https://kernel.dk/io_uring.pdf. I'm fine with using liburing, but still I
> do want to understand what is going on behind the scenes, so any comment
> would be appreciated.

In terms of the barriers, that doc is somewhat outdated...

> Firstly, I wonder why memory barriers are required at all, when NOT using
> polled mode. Because requiring them in non-polled mode somehow implies that:
> - Memory re-ordering occurs across system-call boundaries (i.e. when
>   submitting, the tail write could happen after the io_uring_enter
>   syscall?!)
> - CPU data dependency checks do not work
> So, are memory barriers really required when just using a simple
> loop around io_uring_enter with completely synchronous processing?

No, I don't beleive that they are. The exception is SQPOLL, as you mention,
as there's not necessarily a syscall involved with that.

> Secondly, the examples in io_uring.pdf suggest that checking completion
> entries requires a read_barrier and a write_barrier and submitting entries
> requires *two* write_barriers. Really?
> 
> My expectation would be, just as with "normal" inter-thread userspace ipc,
> that plain store-release and load-acquire semantics are sufficient, e.g.: 
> - For reading completion entries:
> -- first read the CQ ring head (without any ordering enforcement)
> -- then use __atomic_load(__ATOMIC_ACQUIRE) to read the CQ ring tail
> -- then use __atomic_store(__ATOMIC_RELEASE) to update the CQ ring head
> - For submitting entries:
> -- first read the SQ ring tail (without any ordering enforcement)
> -- then use __atomic_load(__ATOMIC_ACQUIRE) to read the SQ ring head
> -- then use __atomic_store(__ATOMIC_RELEASE) to update the SQ ring tail
> Wouldn't these be sufficient?!

Please check liburing to see what that does. Would be interested in
your feedback (and patches!). Largely x86 not caring too much about
these have meant that I think we've erred on the side of caution
on that front.

> Thirdly, io_uring.pdf and
> https://github.com/torvalds/linux/blob/master/io_uring/io_uring.c seem a
> little contradicting, at least from my reading:
> 
> io_uring.pdf, in the completion entry example:
> - Includes a read_barrier() **BEFORE** it reads the CQ ring tail
> - Include a write_barrier() **AFTER** updating CQ head
> 
> io_uring.c says on completion entries:
> - **AFTER** the application reads the CQ ring tail, it must use an appropriate
>   smp_rmb() [...].
> - It also needs a smp_mb() **BEFORE** updating CQ head [...].
> 
> io_uring.pdf, in the submission entry example:
> - Includes a write_barrier() **BEFORE** updating the SQ tail
> - Includes a write_barrier() **AFTER** updating the SQ tail
> 
> io_uring.c says on submission entries:
> - [...] the application must use an appropriate smp_wmb() **BEFORE**
>   writing the SQ tail
>   (this matches io_uring.pdf)
> - And it needs a barrier ordering the SQ head load before writing new
>   SQ entries
>   
> I know, io_uring.pdf does mention that the memory ordering description
> is simplified. So maybe this is the whole explanation for my confusion?

The canonical resource at this point is the kernel code, as some of
the revamping of the memory ordering happened way later than when
that doc was written. Would be nice to get it updated at some point.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Memory ordering description in io_uring.pdf
  2022-09-22  1:54 ` Jens Axboe
@ 2022-09-25 10:34   ` J. Hanne
  2022-09-25 12:03     ` J. Hanne
  0 siblings, 1 reply; 4+ messages in thread
From: J. Hanne @ 2022-09-25 10:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Hi,

> Am 22.09.2022 um 03:54 schrieb Jens Axboe <[email protected]>:
> 
> On 9/18/22 10:56 AM, J. Hanne wrote:
>> Hi,
>> 
>> I have a couple of questions regarding the necessity of including memory
>> barriers when using io_uring, as outlined in
>> https://kernel.dk/io_uring.pdf. I'm fine with using liburing, but still I
>> do want to understand what is going on behind the scenes, so any comment
>> would be appreciated.
> 
> In terms of the barriers, that doc is somewhat outdated...
Ok, that pretty much explains why I got an inconsistent view after studying multiple sources…

> 
>> Firstly, I wonder why memory barriers are required at all, when NOT using
>> polled mode. Because requiring them in non-polled mode somehow implies that:
>> - Memory re-ordering occurs across system-call boundaries (i.e. when
>>  submitting, the tail write could happen after the io_uring_enter
>>  syscall?!)
>> - CPU data dependency checks do not work
>> So, are memory barriers really required when just using a simple
>> loop around io_uring_enter with completely synchronous processing?
> 
> No, I don't beleive that they are. The exception is SQPOLL, as you mention,
> as there's not necessarily a syscall involved with that.
> 
>> Secondly, the examples in io_uring.pdf suggest that checking completion
>> entries requires a read_barrier and a write_barrier and submitting entries
>> requires *two* write_barriers. Really?
>> 
>> My expectation would be, just as with "normal" inter-thread userspace ipc,
>> that plain store-release and load-acquire semantics are sufficient, e.g.: 
>> - For reading completion entries:
>> -- first read the CQ ring head (without any ordering enforcement)
>> -- then use __atomic_load(__ATOMIC_ACQUIRE) to read the CQ ring tail
>> -- then use __atomic_store(__ATOMIC_RELEASE) to update the CQ ring head
>> - For submitting entries:
>> -- first read the SQ ring tail (without any ordering enforcement)
>> -- then use __atomic_load(__ATOMIC_ACQUIRE) to read the SQ ring head
>> -- then use __atomic_store(__ATOMIC_RELEASE) to update the SQ ring tail
>> Wouldn't these be sufficient?!
> 
> Please check liburing to see what that does. Would be interested in
> your feedback (and patches!). Largely x86 not caring too much about
> these have meant that I think we've erred on the side of caution
> on that front.
Ok, I will check. My practical experience with memory barriers is limited however, so I’m not in the position to give a final judgement

> 
>> Thirdly, io_uring.pdf and
>> https://github.com/torvalds/linux/blob/master/io_uring/io_uring.c seem a
>> little contradicting, at least from my reading:
>> 
>> io_uring.pdf, in the completion entry example:
>> - Includes a read_barrier() **BEFORE** it reads the CQ ring tail
>> - Include a write_barrier() **AFTER** updating CQ head
>> 
>> io_uring.c says on completion entries:
>> - **AFTER** the application reads the CQ ring tail, it must use an appropriate
>>  smp_rmb() [...].
>> - It also needs a smp_mb() **BEFORE** updating CQ head [...].
>> 
>> io_uring.pdf, in the submission entry example:
>> - Includes a write_barrier() **BEFORE** updating the SQ tail
>> - Includes a write_barrier() **AFTER** updating the SQ tail
>> 
>> io_uring.c says on submission entries:
>> - [...] the application must use an appropriate smp_wmb() **BEFORE**
>>  writing the SQ tail
>>  (this matches io_uring.pdf)
>> - And it needs a barrier ordering the SQ head load before writing new
>>  SQ entries
>> 
>> I know, io_uring.pdf does mention that the memory ordering description
>> is simplified. So maybe this is the whole explanation for my confusion?
> 
> The canonical resource at this point is the kernel code, as some of
> the revamping of the memory ordering happened way later than when
> that doc was written. Would be nice to get it updated at some point.
Ok, I will try. Where is the io_uring.pdf source (tex? markdown??)?

Regards,
  Johann


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Memory ordering description in io_uring.pdf
  2022-09-25 10:34   ` J. Hanne
@ 2022-09-25 12:03     ` J. Hanne
  0 siblings, 0 replies; 4+ messages in thread
From: J. Hanne @ 2022-09-25 12:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Hi,

what needs to be brought into an consistent state:
- https://kernel.dk/io_uring.pdf (where is the source??)
- https://git.kernel.dk/cgit/liburing/tree/man/io_uring.7 (https://manpages.debian.org/testing/liburing-dev/io_uring.7.en.html)
- https://git.kernel.dk/cgit/liburing/tree/src/queue.c (using macros from https://git.kernel.dk/cgit/liburing/tree/src/include/liburing/barrier.h)
- https://github.com/torvalds/linux/blob/master/io_uring/io_uring.c

I’ll start with submission queue handling.

Quoting myself, my (possibly naive) approach for submitting entries would be, using gcc atomic builtins in absence of standardized C functions:
- (1) first read the SQ ring tail (without any ordering enforcement)
- (2) then use __atomic_load(__ATOMIC_ACQUIRE) to read the SQ ring head
- (3) then use __atomic_store(__ATOMIC_RELEASE) to update the SQ ring tail

Comparing with the existing documentation, (3) matches everywhere:
- io_uring.pdf confirms (3), as it inserts a write_barrier() before updating the ring tail
- io_uring.7 confirms (3), as it uses atomic_store_release to update the ring tail
- __io_uring_flush_sq in queue.c confirms (3), as it uses io_uring_smp_store_release to update the ring tail
  (BUT, __io_uring_flush_sq in queue.c also special cases IORING_SETUP_SQPOLL, which I do not fully understand)
- io_uring.c says: "the application must use an appropriate smp_wmb() before writing the SQ tail (ordering SQ entry stores with the tail store)"

However, (2) is not so clear:
- io_uring.pdf never reads the ring head (but at least it mentions that the example is simplified as it is missing a queue full check)
- io_uring.7 never reads the ring head (as it does not check if the ring is full, which it does not even mention)
- __io_uring_flush_sq in queue.c confirms that, usually, acquire semantics are needed for reading the ring head, but seems to handle it elsewhere due to how it works internally (?)
- io_uring.c says: “[the application] needs a barrier ordering the SQ head load before writing new SQ entries (smp_load_acquire to read head will do)."
  (BUT, it does not mention WHY the application needs to load the ring head)

Lastly, I absolutely do not understand the second write_barrier in io_uring.pdf after updating the ring tail. https://git.kernel.dk/cgit/liburing/commit/?id=ecefd7958eb32602df07f12e9808598b2c2de84b more or less just removed it. Before removal, it had this comment: “The kernel has the matching read barrier for reading the SQ tail.“. Yes, the kernel does need such a read barrier, but the write barrier *before* the ring tail update should be enough?!

So, my recommendation for documentation updates is:
- In io_uring.pdf, remove the second write_barrier after the ring tail update.
- In io_uring.pdf, augment the submission example with reading the ring head (to check for a queue-full condition), including a read_barrier after
- In io_uring.7, also add a queue-full check
- In io_uring.c extend the comment to say WHY the application needs to read the ring head

Comments?

Regards,
  Johann

> Am 25.09.2022 um 12:34 schrieb J. Hanne <[email protected]>:
> 
> Hi,
> 
>> Am 22.09.2022 um 03:54 schrieb Jens Axboe <[email protected]>:
>> 
>> On 9/18/22 10:56 AM, J. Hanne wrote:
>>> Hi,
>>> 
>>> I have a couple of questions regarding the necessity of including memory
>>> barriers when using io_uring, as outlined in
>>> https://kernel.dk/io_uring.pdf. I'm fine with using liburing, but still I
>>> do want to understand what is going on behind the scenes, so any comment
>>> would be appreciated.
>> 
>> In terms of the barriers, that doc is somewhat outdated...
> Ok, that pretty much explains why I got an inconsistent view after studying multiple sources…
> 
>> 
>>> Firstly, I wonder why memory barriers are required at all, when NOT using
>>> polled mode. Because requiring them in non-polled mode somehow implies that:
>>> - Memory re-ordering occurs across system-call boundaries (i.e. when
>>> submitting, the tail write could happen after the io_uring_enter
>>> syscall?!)
>>> - CPU data dependency checks do not work
>>> So, are memory barriers really required when just using a simple
>>> loop around io_uring_enter with completely synchronous processing?
>> 
>> No, I don't beleive that they are. The exception is SQPOLL, as you mention,
>> as there's not necessarily a syscall involved with that.
>> 
>>> Secondly, the examples in io_uring.pdf suggest that checking completion
>>> entries requires a read_barrier and a write_barrier and submitting entries
>>> requires *two* write_barriers. Really?
>>> 
>>> My expectation would be, just as with "normal" inter-thread userspace ipc,
>>> that plain store-release and load-acquire semantics are sufficient, e.g.: 
>>> - For reading completion entries:
>>> -- first read the CQ ring head (without any ordering enforcement)
>>> -- then use __atomic_load(__ATOMIC_ACQUIRE) to read the CQ ring tail
>>> -- then use __atomic_store(__ATOMIC_RELEASE) to update the CQ ring head
>>> - For submitting entries:
>>> -- first read the SQ ring tail (without any ordering enforcement)
>>> -- then use __atomic_load(__ATOMIC_ACQUIRE) to read the SQ ring head
>>> -- then use __atomic_store(__ATOMIC_RELEASE) to update the SQ ring tail
>>> Wouldn't these be sufficient?!
>> 
>> Please check liburing to see what that does. Would be interested in
>> your feedback (and patches!). Largely x86 not caring too much about
>> these have meant that I think we've erred on the side of caution
>> on that front.
> Ok, I will check. My practical experience with memory barriers is limited however, so I’m not in the position to give a final judgement
> 
>> 
>>> Thirdly, io_uring.pdf and
>>> https://github.com/torvalds/linux/blob/master/io_uring/io_uring.c seem a
>>> little contradicting, at least from my reading:
>>> 
>>> io_uring.pdf, in the completion entry example:
>>> - Includes a read_barrier() **BEFORE** it reads the CQ ring tail
>>> - Include a write_barrier() **AFTER** updating CQ head
>>> 
>>> io_uring.c says on completion entries:
>>> - **AFTER** the application reads the CQ ring tail, it must use an appropriate
>>> smp_rmb() [...].
>>> - It also needs a smp_mb() **BEFORE** updating CQ head [...].
>>> 
>>> io_uring.pdf, in the submission entry example:
>>> - Includes a write_barrier() **BEFORE** updating the SQ tail
>>> - Includes a write_barrier() **AFTER** updating the SQ tail
>>> 
>>> io_uring.c says on submission entries:
>>> - [...] the application must use an appropriate smp_wmb() **BEFORE**
>>> writing the SQ tail
>>> (this matches io_uring.pdf)
>>> - And it needs a barrier ordering the SQ head load before writing new
>>> SQ entries
>>> 
>>> I know, io_uring.pdf does mention that the memory ordering description
>>> is simplified. So maybe this is the whole explanation for my confusion?
>> 
>> The canonical resource at this point is the kernel code, as some of
>> the revamping of the memory ordering happened way later than when
>> that doc was written. Would be nice to get it updated at some point.
> Ok, I will try. Where is the io_uring.pdf source (tex? markdown??)?
> 
> Regards,
>  Johann


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-09-25 12:03 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-09-18 16:56 Memory ordering description in io_uring.pdf J. Hanne
2022-09-22  1:54 ` Jens Axboe
2022-09-25 10:34   ` J. Hanne
2022-09-25 12:03     ` J. Hanne

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox