From: Lennert Buytenhek <[email protected]>
To: Jens Axboe <[email protected]>
Cc: [email protected], [email protected],
[email protected], [email protected]
Subject: Re: [RFC PATCH] io_uring: add support for IORING_OP_GETDENTS64
Date: Sat, 23 Jan 2021 20:16:36 +0200 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On Sat, Jan 23, 2021 at 10:37:25AM -0700, Jens Axboe wrote:
> > IORING_OP_GETDENTS64 behaves like getdents64(2) and takes the same
> > arguments.
> >
> > Signed-off-by: Lennert Buytenhek <[email protected]>
> > ---
> > This seems to work OK, but I'd appreciate a review from someone more
> > familiar with io_uring internals than I am, as I'm not entirely sure
> > I did everything quite right.
> >
> > A dumb test program for IORING_OP_GETDENTS64 is available here:
> >
> > https://krautbox.wantstofly.org/~buytenh/uringfind.c
> >
> > This does more or less what find(1) does: it scans recursively through
> > a directory tree and prints the names of all directories and files it
> > encounters along the way -- but then using io_uring. (The uring version
> > prints the names of encountered files and directories in an order that's
> > determined by SQE completion order, which is somewhat nondeterministic
> > and likely to differ between runs.)
> >
> > On a directory tree with 14-odd million files in it that's on a
> > six-drive (spinning disk) btrfs raid, find(1) takes:
> >
> > # echo 3 > /proc/sys/vm/drop_caches
> > # time find /mnt/repo > /dev/null
> >
> > real 24m7.815s
> > user 0m15.015s
> > sys 0m48.340s
> > #
> >
> > And the io_uring version takes:
> >
> > # echo 3 > /proc/sys/vm/drop_caches
> > # time ./uringfind /mnt/repo > /dev/null
> >
> > real 10m29.064s
> > user 0m4.347s
> > sys 0m1.677s
> > #
> >
> > These timings are repeatable and consistent to within a few seconds.
> >
> > (btrfs seems to be sending most metadata reads to the same drive in the
> > array during this test, even though this filesystem is using the raid1c4
> > profile for metadata, so I suspect that more drive-level parallelism can
> > be extracted with some btrfs tweaks.)
> >
> > The fully cached case also shows some speedup for the io_uring version:
> >
> > # time find /mnt/repo > /dev/null
> >
> > real 0m5.223s
> > user 0m1.926s
> > sys 0m3.268s
> > #
> >
> > vs:
> >
> > # time ./uringfind /mnt/repo > /dev/null
> >
> > real 0m3.604s
> > user 0m2.417s
> > sys 0m0.793s
> > #
> >
> > That said, the point of this patch isn't primarily to enable
> > lightning-fast find(1) or du(1), but more to complete the set of
> > filesystem I/O primitives available via io_uring, so that applications
> > can do all of their filesystem I/O using the same mechanism, without
> > having to manually punt some of their work out to worker threads -- and
> > indeed, an object storage backend server that I wrote a while ago can
> > run with a pure io_uring based event loop with this patch.
>
> The results look nice for sure.
Thanks! And thank you for having a look.
> Once concern is that io_uring generally
> guarantees that any state passed in is stable once submit is done. For
> the below implementation, that doesn't hold as the linux_dirent64 isn't
> used until later in the process. That means if you do:
>
> submit_getdents64(ring)
> {
> struct linux_dirent64 dent;
> struct io_uring_sqe *sqe;
>
> sqe = io_uring_get_sqe(ring);
> io_uring_prep_getdents64(sqe, ..., &dent);
> io_uring_submit(ring);
> }
>
> other_func(ring)
> {
> struct io_uring_cqe *cqe;
>
> submit_getdents64(ring);
> io_uring_wait_cqe(ring, &cqe);
>
> }
>
> then the kernel side might get garbage by the time the sqe is actually
> submitted. This is true because you don't use it inline, only from the
> out-of-line async context. Usually this is solved by having the prep
> side copy in the necessary state, eg see io_openat2_prep() for how we
> make filename and open_how stable by copying them into kernel memory.
> That ensures that if/when these operations need to go async and finish
> out-of-line, the contents are stable and there's no requirement for the
> application to keep them valid once submission is done.
>
> Not sure how best to solve that, since the vfs side relies heavily on
> linux_dirent64 being a user pointer...
No data is passed into the kernel on a getdents64(2) call via user
memory, i.e. getdents64(2) only ever writes into the supplied
linux_dirent64 user pointer, it never reads from it. The only things
that we need to keep stable here are the linux_dirent64 pointer itself
and the 'count' argument and those are both passed in via the SQE, and
we READ_ONCE() them from the SQE in the prep function. I think that's
probably the source of confusion here?
Cheers,
Lennert
next prev parent reply other threads:[~2021-01-23 18:17 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-23 11:41 [RFC PATCH] io_uring: add support for IORING_OP_GETDENTS64 Lennert Buytenhek
2021-01-23 17:37 ` Jens Axboe
2021-01-23 18:16 ` Lennert Buytenhek [this message]
2021-01-23 18:22 ` Jens Axboe
2021-01-23 23:27 ` Matthew Wilcox
2021-01-23 23:33 ` Jens Axboe
2021-01-28 22:30 ` Lennert Buytenhek
2021-01-23 23:50 ` Andres Freund
2021-01-23 23:56 ` Andres Freund
2021-01-24 1:59 ` Al Viro
2021-01-24 2:17 ` Andres Freund
2021-01-24 22:21 ` David Laight
2021-01-28 23:07 ` Lennert Buytenhek
2021-01-29 5:37 ` Lennert Buytenhek
2021-01-29 9:41 ` David Laight
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox