* io_uring and POSIX read-write concurrency guarantees
@ 2020-06-09 13:19 Niall Douglas
0 siblings, 0 replies; only message in thread
From: Niall Douglas @ 2020-06-09 13:19 UTC (permalink / raw)
To: io-uring
Dear io-uring mailing list,
My name is Niall Douglas, author of the std::file_handle and
std::mapped_file_handle proposal before WG21 for standardisation. I have
been collaborating with Eric Niebler, Kirk Shoop and Lewis Baker from
Facebook who author the Sender-Receiver proposal for standardised async
i/o in future C++ to implement an io_uring backend for file i/o. We
previously tried to email Jens Axboe privately about this on the 18th
May, 25th May and 28th May, but we received no response, hence we have
come here.
We are currently working on how best to implement async file i/o on
Linux with io_uring, such that std::file_handle, when used with
Sender-Receiver, does the right thing. To be specific, std::file_handle
specifically guarantees propagation of the system's implementation of
POSIX read-write concurrency guarantees, and indeed much file i/o code
implicitly assumes those guarantees i.e. that reads made by thread A
from the same inode will see the same sequence as writes made by thread
B to overlapping regions, and that concurrent reads never see a torn
write in progress up to IOV_MAX scatter-gather buffers.
These guarantees are implemented by a wide range of systems: FreeBSD,
Microsoft Windows and Mac OS have high quality implementations. Linux
varies by filesystem and O_DIRECT flag, so for example ext4 does not
implement the guarantees unless O_DIRECT is turned on. ZFS on Linux
always implements them.
What we would like to achieve is that process A using async file i/o
based on io_uring would experience the POSIX read-write concurrency
guarantees when interoperating with process B using sync file i/o upon
the same inode. In other words, whether io_uring is used, or not, should
have no apparent difference to C++ code.
The existing ordering, pacing and linking sqes in io_uring is
insufficient to achieve this goal because each io_uring ring buffer is
independent of other io_uring ring buffers, and indeed also independent
of the inode being i/o-ed upon.
What we think io_uring would need to implement POSIX read-write
concurrency guarantees for file i/o is the ability to create a global
submission queue per-inode. All i/o in the system, including from read()
and write(), would submit to that per-inode queue. Each inode would have
an as-if read-write mutex. Read i/o can be dispatched in parallel. Write
i/o waits until all preceding operations have completed, and writes then
occur one-at-a-time, per-inode.
(Strictly speaking, the POSIX read-write concurrency guarantees only
affect *overlapping* regions. If i/o is to non-overlapping regions, it
can execute in parallel. However, figuring out whether regions overlap
is slow, so the simpler mechanism above is probably the best balance of
performance to guarantee)
I wish to be clear here: this facility should be opt-out for code which
doesn't care about POSIX read-write concurrency guarantees e.g. if there
can only be one thread accessing a file, we only care about performance,
not concurrency. However, for files shared between processes, I think
the default on Linux ought to be the same as it is on all the other
major platforms. Then portable code works as-is on Linux. Failing that,
standard C++ library implementers ought to be able to implement those
guarantees for C++ code on Linux, and right now I don't believe they can
with io_uring, they would be forced to use a threadpool doing
synchronous i/o, which seems a shame.
Feedback and questions are welcome. My thanks in advance for your time.
Niall
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2020-06-09 13:29 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-06-09 13:19 io_uring and POSIX read-write concurrency guarantees Niall Douglas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox