public inbox for [email protected]
 help / color / mirror / Atom feed
From: Dominique Martinet <[email protected]>
To: Dave Chinner <[email protected]>
Cc: Alexander Viro <[email protected]>,
	Christian Brauner <[email protected]>,
	Jens Axboe <[email protected]>,
	Pavel Begunkov <[email protected]>,
	Stefan Roesch <[email protected]>,
	[email protected], [email protected],
	[email protected]
Subject: Re: [PATCH RFC 2/2] io_uring: add support for getdents
Date: Fri, 28 Apr 2023 15:14:52 +0900	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

Dave Chinner wrote on Fri, Apr 28, 2023 at 03:06:40PM +1000:
> >  - in my understanding io_uring has its own thread pool, so even if the
> > actual getdents is blocking other IOs can progress (assuming there is
> > less blocked getdents than threads), without having to build one's own
> > extra thread pool next to the uring handling.
> > Looking at io_uring/fs.c the other "metadata related" calls there also
> > use the synchronous APIs (renameat, unlinkat, mkdirat, symlinkat and
> > linkat all do), so I didn't think of that as a problem in itself.
> 
> I think you missed the point. getdents is not an exclusive operation
> - it is run under shared locking unlike all the other direcotry
> modification operations you cite above. They use exclusive locking
> so there's no real benefit by trying to run them non-blocking or
> as an async operation as they are single threaded and will consume
> a single thread context from start to end.
> 
> Further, one of the main reasons they get punted to the per-thread
> pool is so that io_uring can optimise away the lock contention
> caused by running multiple work threads on exclusively locked
> objects; it does this by only running one work item per inode at a
> time.

Ah, I didn't know that -- thank you for this piece of information.

If we're serializing readdirs per inode I agree that this implementation
is subpart for filesystems implementing iterate_shared.

> > > Filesystems like XFS can easily do non-blocking getdents calls - we
> > > just need the NOWAIT plumbing (like we added to the IO path with
> > > IOCB_NOWAIT) to tell the filesystem not to block on locks or IO.
> > > Indeed, filesystems often have async readahead built into their
> > > getdents paths (XFS does), so it seems to me that we really want
> > > non-blocking getdents to allow filesystems to take full advantage of
> > > doing work without blocking and then shuffling the remainder off to
> > > a background thread when it actually needs to wait for IO....
> > 
> > I believe that can be done without any change of this API, so that'll be
> > a very welcome addition when it is ready;
> 
> Again, I think you miss the point.
> 
> Non blocking data IO came before io_uring and we had the
> infrastructure in place before io_uring took advantage of it.
> Application developers asked the fs developers to add support for
> non-blocking direct IO operations and because we pretty much had all
> the infrastructure to support already in place it got done quickly
> via preadv2/pwritev2 via RWF_NOWAIT flags.
> 
> We already pass a struct dir_context to ->iterate_shared(), so we
> have a simple way to add context specific flags down the filesystem
> from iterate_dir(). This is similar to the iocb for file data IO
> that contains the flags field that holds the IOCB_NOWAIT context for
> io_uring based IO. So the infrastructure to plumb it all the way
> down the fs implementation of ->iterate_shared is already there.

Sure, that sounds like a good approach that isn't breaking the API (not
breaking iterate/iterate_shared implementations that don't look at the
flags and allowing the fs that want to look at it to do so)

Adding such a flag and setting it on the uring side without doing
anything else is trivial enough if polling/rescheduling can be sorted
out.

(I guess at this rate the dir_context flag could also be modified by the
FS to signal it just filled in the last entry, for the other part of
this thread asking to know when the directory has been done iterating
without wasting a trivial empty getdents)

> XFS also has async metadata IO capability and we use that for
> readahead in the xfs_readdir() implementation. hence we've got all
> the parts we need to do non-blocking readdir already in place. This
> is very similar to how we already had all the pieces in the IO path
> ready to do non-block IO well before anyone asked for IOCB_NOWAIT
> functionality....
> 
> AFAICT, the io_uring code wouldn't need to do much more other than
> punt to the work queue if it receives a -EAGAIN result. Otherwise
> the what the filesystem returns doesn't need to change, and I don't
> see that we need to change how the filldir callbacks work, either.
> We just keep filling the user buffer until we either run out of
> cached directory data or the user buffer is full.

I agree with the fs side of it, I'd like to confirm what the uring
side needs to do before proceeding -- looking at the read/write path
there seems to be a polling mechanism in place to tell uring when to
look again, and I haven't looked at this part of the code yet to see
what happens if no such polling is in place (does uring just retry
periodically?)

I'll have a look this weekend.

> > I don't think the adding the
> > uring op should wait on this if we can agree a simple wrapper API is
> > good enough (or come up with a better one if someone has a Good Idea)
> 
> It doesn't look at all hard to me. If you add a NOWAIT context flag
> to the dir_context it should be relatively trivial to connect all
> the parts together. If you do all the VFS, io_uring and userspace
> testing infrastructure work, I should be able to sort out the
> changes needed to xfs_readdir() to support nonblocking
> ->iterate_shared() behaviour.

I still think the userspace <-> ring API is unrelated to the nowait side
of it, but as said in the other part of the thread I'll be happy to give
a bit more time if that helps moving forward.

I'll send another reply around the end of the weekend with either v2 of
this patch or a few more comments.


Thanks,
-- 
Dominique Martinet | Asmadeus

  reply	other threads:[~2023-04-28  6:15 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-22  8:40 [PATCH RFC 0/2] io_uring: add getdents support, take 2 Dominique Martinet
2023-04-22  8:40 ` [PATCH RFC 1/2] fs: split off vfs_getdents function of getdents64 syscall Dominique Martinet
2023-04-22 10:34   ` Dominique Martinet
2023-04-22  8:40 ` [PATCH RFC 2/2] io_uring: add support for getdents Dominique Martinet
2023-04-23 22:40   ` Dave Chinner
2023-04-23 23:43     ` Dominique Martinet
2023-04-24  7:29       ` Clay Harris
2023-04-24  8:41         ` Dominique Martinet
2023-04-24  9:20           ` Clay Harris
2023-04-24 10:55             ` Dominique Martinet
2023-04-28  5:06       ` Dave Chinner
2023-04-28  6:14         ` Dominique Martinet [this message]
2023-04-28 11:27           ` Dominique Martinet
2023-04-30 23:15             ` Dave Chinner
2023-04-29  8:07           ` Dominique Martinet
2023-04-30 23:32             ` Dave Chinner
2023-05-01  0:49               ` Dominique Martinet
2023-05-01  7:16                 ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox