[PATCH RFC 0/2] io_uring: add getdents support, take 2

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH RFC 0/2] io_uring: add getdents support, take 2
@ 2023-04-22  8:40 Dominique Martinet
  2023-04-22  8:40 ` [PATCH RFC 1/2] fs: split off vfs_getdents function of getdents64 syscall Dominique Martinet
  2023-04-22  8:40 ` [PATCH RFC 2/2] io_uring: add support for getdents Dominique Martinet
  0 siblings, 2 replies; 18+ messages in thread
From: Dominique Martinet @ 2023-04-22  8:40 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jens Axboe, Pavel Begunkov,
	Stefan Roesch
  Cc: linux-fsdevel, linux-kernel, io-uring, Dominique Martinet

This is an attempt to revive discussion after bringing it up as a reply
to the last attempt last week.

Since the problem with the previous attempt at adding getdents to
io_uring was that offset was problematic, we can just not add an offset
parameter: using different offsets is rarely useful in practice (maybe
for some cacheless userspace NFS server?) and more isn't worth the cost
of allowing arbitrary offsets: just allow rewind as a flag instead.
[happy to drop even that flag for what I care about, but that might be
useful enough on its own as io_uring rightfully has no seek API]

The new API does nothing that cannot be achieved with plain syscalls so
it shouldn't be introducing any new problem, the only downside is that
having the state in the file struct isn't very uring-ish and if a
better solution is found later that will probably require duplicating
some logic in a new flag... But that seems like it would likely be a
distant future, and this version should be usable right away.

To repeat the rationale for having getdents available from uring as it
has been a while, while I believe there can be real performance
improvements as suggested in [1], the real reason is to allow coherency
in code: applications using io_uring usually center their event loop
around the ring, and having the occasional synchronous getdents call in
there is cumbersome and hard to do properly when you consider e.g. a
slow or acting up network fs...
[1] https://lore.kernel.org/linux-fsdevel/[email protected]/
(unfortunately the source is no longer available...)

liburing implementation:
https://github.com/martinetd/liburing/commits/getdents
(will submit properly after initial discussions here)

Previous discussion:
https://lore.kernel.org/all/[email protected]/T/#m517583f23502f32b040c819d930359313b3db00c

Signed-off-by: Dominique Martinet <[email protected]>
---
Dominique Martinet (2):
      fs: split off vfs_getdents function of getdents64 syscall
      io_uring: add support for getdents

 fs/internal.h                 |  8 +++++++
 fs/readdir.c                  | 33 +++++++++++++++++++++-------
 include/uapi/linux/io_uring.h |  7 ++++++
 io_uring/fs.c                 | 51 +++++++++++++++++++++++++++++++++++++++++++
 io_uring/fs.h                 |  3 +++
 io_uring/opdef.c              |  8 +++++++
 6 files changed, 102 insertions(+), 8 deletions(-)
---
base-commit: 6a8f57ae2eb07ab39a6f0ccad60c760743051026
change-id: 20230422-uring-getdents-2aab84d240aa

Best regards,
-- 
Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 1/2] fs: split off vfs_getdents function of getdents64 syscall
  2023-04-22  8:40 [PATCH RFC 0/2] io_uring: add getdents support, take 2 Dominique Martinet
@ 2023-04-22  8:40 ` Dominique Martinet
  2023-04-22 10:34   ` Dominique Martinet
  2023-04-22  8:40 ` [PATCH RFC 2/2] io_uring: add support for getdents Dominique Martinet
  1 sibling, 1 reply; 18+ messages in thread
From: Dominique Martinet @ 2023-04-22  8:40 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jens Axboe, Pavel Begunkov,
	Stefan Roesch
  Cc: linux-fsdevel, linux-kernel, io-uring, Dominique Martinet

This splits off the vfs_getdents function from the getdents64 system
call.
This will allow io_uring to call the vfs_getdents function.

Co-authored-by: Stefan Roesch <[email protected]>
Signed-off-by: Dominique Martinet <[email protected]>
---
 fs/internal.h |  8 ++++++++
 fs/readdir.c  | 33 +++++++++++++++++++++++++--------
 2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index dc4eb91a577a..92eeaf3837d1 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -264,3 +264,11 @@ int setattr_should_drop_sgid(struct mnt_idmap *idmap,
 struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
 struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
 void mnt_idmap_put(struct mnt_idmap *idmap);
+
+/*
+ * fs/readdir.c
+ */
+struct linux_dirent64;
+
+int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
+		 unsigned int count);
diff --git a/fs/readdir.c b/fs/readdir.c
index 9c53edb60c03..1d541a6f2d55 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -351,10 +351,16 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
 	return false;
 }
 
-SYSCALL_DEFINE3(getdents64, unsigned int, fd,
-		struct linux_dirent64 __user *, dirent, unsigned int, count)
+
+/**
+ * vfs_getdents - getdents without fdget
+ * @file    : pointer to file struct of directory
+ * @dirent  : pointer to user directory structure
+ * @count   : size of buffer
+ */
+int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
+		 unsigned int count)
 {
-	struct fd f;
 	struct getdents_callback64 buf = {
 		.ctx.actor = filldir64,
 		.count = count,
@@ -362,11 +368,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	};
 	int error;
 
-	f = fdget_pos(fd);
-	if (!f.file)
-		return -EBADF;
-
-	error = iterate_dir(f.file, &buf.ctx);
+	error = iterate_dir(file, &buf.ctx);
 	if (error >= 0)
 		error = buf.error;
 	if (buf.prev_reclen) {
@@ -379,6 +381,21 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 		else
 			error = count - buf.count;
 	}
+	return error;
+}
+
+SYSCALL_DEFINE3(getdents64, unsigned int, fd,
+		struct linux_dirent64 __user *, dirent, unsigned int, count)
+{
+	struct fd f;
+	int error;
+
+	f = fdget_pos(fd);
+	if (!f.file)
+		return -EBADF;
+
+	error = vfs_getdents(f.file, dirent, count);
+
 	fdput_pos(f);
 	return error;
 }

-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC 2/2] io_uring: add support for getdents
  2023-04-22  8:40 [PATCH RFC 0/2] io_uring: add getdents support, take 2 Dominique Martinet
  2023-04-22  8:40 ` [PATCH RFC 1/2] fs: split off vfs_getdents function of getdents64 syscall Dominique Martinet
@ 2023-04-22  8:40 ` Dominique Martinet
  2023-04-23 22:40   ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Dominique Martinet @ 2023-04-22  8:40 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jens Axboe, Pavel Begunkov,
	Stefan Roesch
  Cc: linux-fsdevel, linux-kernel, io-uring, Dominique Martinet

This add support for getdents64 to io_uring, acting exactly like the
syscall: the directory is iterated from it's current's position as
stored in the file struct, and the file's position is updated exactly as
if getdents64 had been called.

Additionally, since io_uring has no way of issuing a seek, a flag
IORING_GETDENTS_REWIND has been added that will seek to the start of the
directory like rewinddir(3) for users that might require such a thing.
This will act exactly as if seek then getdents64 have been called
sequentially with no atomicity guarantee:
if this wasn't clear it is the responsibility of the caller to not use
getdents multiple time on a single file in parallel if they want useful
results, as is currently the case when using the syscall from multiple
threads.

Signed-off-by: Dominique Martinet <[email protected]>
---
 include/uapi/linux/io_uring.h |  7 ++++++
 io_uring/fs.c                 | 51 +++++++++++++++++++++++++++++++++++++++++++
 io_uring/fs.h                 |  3 +++
 io_uring/opdef.c              |  8 +++++++
 4 files changed, 69 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 709de6d4feb2..44c87fad2714 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -65,6 +65,7 @@ struct io_uring_sqe {
 		__u32		xattr_flags;
 		__u32		msg_ring_flags;
 		__u32		uring_cmd_flags;
+		__u32		getdents_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	/* pack this to avoid bogus arm OABI complaints */
@@ -223,6 +224,7 @@ enum io_uring_op {
 	IORING_OP_URING_CMD,
 	IORING_OP_SEND_ZC,
 	IORING_OP_SENDMSG_ZC,
+	IORING_OP_GETDENTS,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
@@ -258,6 +260,11 @@ enum io_uring_op {
  */
 #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
 
+/*
+ * sqe->getdents_flags
+ */
+#define IORING_GETDENTS_REWIND	(1U << 0)
+
 /*
  * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
  * command flags for POLL_ADD are stored in sqe->len.
diff --git a/io_uring/fs.c b/io_uring/fs.c
index f6a69a549fd4..cb555bc738bd 100644
--- a/io_uring/fs.c
+++ b/io_uring/fs.c
@@ -47,6 +47,13 @@ struct io_link {
 	int				flags;
 };
 
+struct io_getdents {
+	struct file			*file;
+	struct linux_dirent64 __user	*dirent;
+	unsigned int			count;
+	int				flags;
+};
+
 int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
@@ -291,3 +298,47 @@ void io_link_cleanup(struct io_kiocb *req)
 	putname(sl->oldpath);
 	putname(sl->newpath);
 }
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+
+	if (READ_ONCE(sqe->off) != 0)
+		return -EINVAL;
+
+	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	gd->count = READ_ONCE(sqe->len);
+	gd->flags = READ_ONCE(sqe->getdents_flags);
+	if (gd->flags & ~IORING_GETDENTS_REWIND)
+		return -EINVAL;
+
+	return 0;
+}
+
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+	int ret;
+
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
+
+	if ((gd->flags & IORING_GETDENTS_REWIND)) {
+		ret = vfs_llseek(req->file, 0, SEEK_SET);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = vfs_getdents(req->file, gd->dirent, gd->count);
+out:
+	if (ret < 0) {
+		if (ret == -ERESTARTSYS)
+			ret = -EINTR;
+
+		req_set_fail(req);
+	}
+
+	io_req_set_res(req, ret, 0);
+	return 0;
+}
+
diff --git a/io_uring/fs.h b/io_uring/fs.h
index 0bb5efe3d6bb..f83a6f3a678d 100644
--- a/io_uring/fs.h
+++ b/io_uring/fs.h
@@ -18,3 +18,6 @@ int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags);
 int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_linkat(struct io_kiocb *req, unsigned int issue_flags);
 void io_link_cleanup(struct io_kiocb *req);
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index cca7c5b55208..8f770c582ab3 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -428,6 +428,11 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_GETDENTS] = {
+		.needs_file		= 1,
+		.prep			= io_getdents_prep,
+		.issue			= io_getdents,
+	},
 };
 
 
@@ -648,6 +653,9 @@ const struct io_cold_def io_cold_defs[] = {
 		.fail			= io_sendrecv_fail,
 #endif
 	},
+	[IORING_OP_GETDENTS] = {
+		.name			= "GETDENTS",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)

-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/2] fs: split off vfs_getdents function of getdents64 syscall
  2023-04-22  8:40 ` [PATCH RFC 1/2] fs: split off vfs_getdents function of getdents64 syscall Dominique Martinet
@ 2023-04-22 10:34   ` Dominique Martinet
  0 siblings, 0 replies; 18+ messages in thread
From: Dominique Martinet @ 2023-04-22 10:34 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jens Axboe, Pavel Begunkov,
	Stefan Roesch
  Cc: linux-fsdevel, linux-kernel, io-uring

Dominique Martinet wrote on Sat, Apr 22, 2023 at 05:40:18PM +0900:
> This splits off the vfs_getdents function from the getdents64 system
> call.
> This will allow io_uring to call the vfs_getdents function.
> 
> Co-authored-by: Stefan Roesch <[email protected]>
> Signed-off-by: Dominique Martinet <[email protected]>
> ---
>  fs/internal.h |  8 ++++++++
>  fs/readdir.c  | 33 +++++++++++++++++++++++++--------
>  2 files changed, 33 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/internal.h b/fs/internal.h
> index dc4eb91a577a..92eeaf3837d1 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -264,3 +264,11 @@ int setattr_should_drop_sgid(struct mnt_idmap *idmap,
>  struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
>  struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
>  void mnt_idmap_put(struct mnt_idmap *idmap);
> +
> +/*
> + * fs/readdir.c
> + */
> +struct linux_dirent64;
> +
> +int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
> +		 unsigned int count);
> diff --git a/fs/readdir.c b/fs/readdir.c
> index 9c53edb60c03..1d541a6f2d55 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c

(This needs an extra `#include "internal.h"`, missing declaration
warning reported privately by intel build robot... fs/ doesn't build
with W=1 by default; I'll resend v2 after some comments it doesn't make
much sense to spam patches at this point)

--
Dominique

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] io_uring: add support for getdents
  2023-04-22  8:40 ` [PATCH RFC 2/2] io_uring: add support for getdents Dominique Martinet
@ 2023-04-23 22:40   ` Dave Chinner
  2023-04-23 23:43     ` Dominique Martinet
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2023-04-23 22:40 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Alexander Viro, Christian Brauner, Jens Axboe, Pavel Begunkov,
	Stefan Roesch, linux-fsdevel, linux-kernel, io-uring

On Sat, Apr 22, 2023 at 05:40:19PM +0900, Dominique Martinet wrote:
> This add support for getdents64 to io_uring, acting exactly like the
> syscall: the directory is iterated from it's current's position as
> stored in the file struct, and the file's position is updated exactly as
> if getdents64 had been called.
> 
> Additionally, since io_uring has no way of issuing a seek, a flag
> IORING_GETDENTS_REWIND has been added that will seek to the start of the
> directory like rewinddir(3) for users that might require such a thing.
> This will act exactly as if seek then getdents64 have been called
> sequentially with no atomicity guarantee:
> if this wasn't clear it is the responsibility of the caller to not use
> getdents multiple time on a single file in parallel if they want useful
> results, as is currently the case when using the syscall from multiple
> threads.
> 
> Signed-off-by: Dominique Martinet <[email protected]>
> ---
>  include/uapi/linux/io_uring.h |  7 ++++++
>  io_uring/fs.c                 | 51 +++++++++++++++++++++++++++++++++++++++++++
>  io_uring/fs.h                 |  3 +++
>  io_uring/opdef.c              |  8 +++++++
>  4 files changed, 69 insertions(+)

This doesn't actually introduce non-blocking getdents operations, so
what's the point? If it just shuffles the getdents call off to a
background thread, why bother with io_uring in the first place?

Filesystems like XFS can easily do non-blocking getdents calls - we
just need the NOWAIT plumbing (like we added to the IO path with
IOCB_NOWAIT) to tell the filesystem not to block on locks or IO.
Indeed, filesystems often have async readahead built into their
getdents paths (XFS does), so it seems to me that we really want
non-blocking getdents to allow filesystems to take full advantage of
doing work without blocking and then shuffling the remainder off to
a background thread when it actually needs to wait for IO....

Cheers,

Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] io_uring: add support for getdents
  2023-04-23 22:40   ` Dave Chinner
@ 2023-04-23 23:43     ` Dominique Martinet
  2023-04-24  7:29       ` Clay Harris
  2023-04-28  5:06       ` Dave Chinner
  0 siblings, 2 replies; 18+ messages in thread
From: Dominique Martinet @ 2023-04-23 23:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Christian Brauner, Jens Axboe, Pavel Begunkov,
	Stefan Roesch, linux-fsdevel, linux-kernel, io-uring

Dave Chinner wrote on Mon, Apr 24, 2023 at 08:40:45AM +1000:
> This doesn't actually introduce non-blocking getdents operations, so
> what's the point? If it just shuffles the getdents call off to a
> background thread, why bother with io_uring in the first place?

As said in the cover letter my main motivation really is simplifying the
userspace application:
 - style-wise, mixing in plain old getdents(2) or readdir(3) in the
middle of an io_uring handling loop just feels wrong; but this may just
be my OCD talking.
 - in my understanding io_uring has its own thread pool, so even if the
actual getdents is blocking other IOs can progress (assuming there is
less blocked getdents than threads), without having to build one's own
extra thread pool next to the uring handling.
Looking at io_uring/fs.c the other "metadata related" calls there also
use the synchronous APIs (renameat, unlinkat, mkdirat, symlinkat and
linkat all do), so I didn't think of that as a problem in itself.


> Filesystems like XFS can easily do non-blocking getdents calls - we
> just need the NOWAIT plumbing (like we added to the IO path with
> IOCB_NOWAIT) to tell the filesystem not to block on locks or IO.
> Indeed, filesystems often have async readahead built into their
> getdents paths (XFS does), so it seems to me that we really want
> non-blocking getdents to allow filesystems to take full advantage of
> doing work without blocking and then shuffling the remainder off to
> a background thread when it actually needs to wait for IO....

I believe that can be done without any change of this API, so that'll be
a very welcome addition when it is ready; I don't think the adding the
uring op should wait on this if we can agree a simple wrapper API is
good enough (or come up with a better one if someone has a Good Idea)

(looking at io_uring/rw.c for comparison, io_getdents() will "just" need
to be adjusted to issue an async req if IO_URING_F_NONBLOCK is set, and
the poll/retry logic sorted out)

Thanks,
-- 
Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] io_uring: add support for getdents
  2023-04-23 23:43     ` Dominique Martinet
@ 2023-04-24  7:29       ` Clay Harris
  2023-04-24  8:41         ` Dominique Martinet
  2023-04-28  5:06       ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Clay Harris @ 2023-04-24  7:29 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Dave Chinner, Alexander Viro, Christian Brauner, Jens Axboe,
	Pavel Begunkov, Stefan Roesch, linux-fsdevel, linux-kernel,
	io-uring

On Mon, Apr 24 2023 at 08:43:00 +0900, Dominique Martinet quoth thus:

> Dave Chinner wrote on Mon, Apr 24, 2023 at 08:40:45AM +1000:
> > This doesn't actually introduce non-blocking getdents operations, so
> > what's the point? If it just shuffles the getdents call off to a
> > background thread, why bother with io_uring in the first place?
> 
> As said in the cover letter my main motivation really is simplifying the
> userspace application:
>  - style-wise, mixing in plain old getdents(2) or readdir(3) in the
> middle of an io_uring handling loop just feels wrong; but this may just
> be my OCD talking.
>  - in my understanding io_uring has its own thread pool, so even if the
> actual getdents is blocking other IOs can progress (assuming there is
> less blocked getdents than threads), without having to build one's own
> extra thread pool next to the uring handling.
> Looking at io_uring/fs.c the other "metadata related" calls there also
> use the synchronous APIs (renameat, unlinkat, mkdirat, symlinkat and
> linkat all do), so I didn't think of that as a problem in itself.

Having written code to create additional worker threads in addition
to using io_uring as a main loop, I'm glad to see this proposal back
again.  I think the original work was shot down much too hastily based
on the file positioning issues.  Really only two cases of directory
position are useful*, which can be expressed either as an off_t
(0 = rewind, -1 = curpos), or a single bit flag.  As I understand the
code, the rewind case shouldn't be any problem.  From a practical
viewpoint, I don't think non-blocking would see a lot of use, but it
wouldn't hurt.

This also seems like a good place to bring up a point I made with
the last attempt at this code.  You're missing an optimization here.
getdents knows whether it is returning a buffer because the next entry
won't fit versus because there are no more entries.  As it doesn't
return that information, callers must always keep calling it back
until EOF.  This means a completely unnecessary call is made for
every open directory.  In other words, for a directory scan where
the buffers are large enough to not overflow, that literally twice
as many calls are made to getdents as necessary.  As io_uring is
in-kernel, it could use an internal interface to getdents which would
return an EOF indicator along with the (probably non-empty) buffer.
io_uring would then return that flag with the CQE.

(* As an aside, the only place I've ever seen a non-zero lseek on a
directory, is in a very resource limited environment, e.g. too small
open files limit.  In the case of a depth-first directory scan, it
must close directories before completely reading them, and reopen /
lseek to their previous position in order to continue.  This scenario
is certainly not worth bothering with for io_uring.)

> > Filesystems like XFS can easily do non-blocking getdents calls - we
> > just need the NOWAIT plumbing (like we added to the IO path with
> > IOCB_NOWAIT) to tell the filesystem not to block on locks or IO.
> > Indeed, filesystems often have async readahead built into their
> > getdents paths (XFS does), so it seems to me that we really want
> > non-blocking getdents to allow filesystems to take full advantage of
> > doing work without blocking and then shuffling the remainder off to
> > a background thread when it actually needs to wait for IO....
> 
> I believe that can be done without any change of this API, so that'll be
> a very welcome addition when it is ready; I don't think the adding the
> uring op should wait on this if we can agree a simple wrapper API is
> good enough (or come up with a better one if someone has a Good Idea)
> 
> (looking at io_uring/rw.c for comparison, io_getdents() will "just" need
> to be adjusted to issue an async req if IO_URING_F_NONBLOCK is set, and
> the poll/retry logic sorted out)
> 
> Thanks,
> -- 
> Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] io_uring: add support for getdents
  2023-04-24  7:29       ` Clay Harris
@ 2023-04-24  8:41         ` Dominique Martinet
  2023-04-24  9:20           ` Clay Harris
  0 siblings, 1 reply; 18+ messages in thread
From: Dominique Martinet @ 2023-04-24  8:41 UTC (permalink / raw)
  To: Clay Harris
  Cc: Dave Chinner, Alexander Viro, Christian Brauner, Jens Axboe,
	Pavel Begunkov, Stefan Roesch, linux-fsdevel, linux-kernel,
	io-uring

Thanks!

Clay Harris wrote on Mon, Apr 24, 2023 at 02:29:46AM -0500:
> This also seems like a good place to bring up a point I made with
> the last attempt at this code.  You're missing an optimization here.
> getdents knows whether it is returning a buffer because the next entry
> won't fit versus because there are no more entries.  As it doesn't
> return that information, callers must always keep calling it back
> until EOF.  This means a completely unnecessary call is made for
> every open directory.  In other words, for a directory scan where
> the buffers are large enough to not overflow, that literally twice
> as many calls are made to getdents as necessary.  As io_uring is
> in-kernel, it could use an internal interface to getdents which would
> return an EOF indicator along with the (probably non-empty) buffer.
> io_uring would then return that flag with the CQE.

Sorry I didn't spot that comment in the last iteration of the patch,
that sounds interesting.

This isn't straightforward even in-kernel though: the ctx.actor callback
(filldir64) isn't called when we're done, so we only know we couldn't
fill in the buffer.
We could have the callback record 'buffer full' and consider we're done
if the buffer is full, or just single-handedly declare we are if we have
more than `MAXNAMLEN + sizeof(struct linux_dirent64)` left over, but I
assume a filesystem is allowed to return what it has readily available
and expect the user to come back later?
In which case we cannot use this as an heuristic...

So if we do this, it'll require a way for filesystems to say they're
filling in as much as they can, or go the sledgehammer way of adding an
extra dir_context dir_context callback, either way I'm not sure I want
to deal with all that immediately unless I'm told all filesystems will
fill as much as possible without ever failing for any temporary reason
in the middle of iterate/iterate_shared().
Call me greedy but I believe such a flag in the CQE could also be added
later on without any bad side effects (as it's optional to check on it
to stop calling early and there's no harm in not setting it)?

> (* As an aside, the only place I've ever seen a non-zero lseek on a
> directory, is in a very resource limited environment, e.g. too small
> open files limit.  In the case of a depth-first directory scan, it
> must close directories before completely reading them, and reopen /
> lseek to their previous position in order to continue.  This scenario
> is certainly not worth bothering with for io_uring.)

(I also thought of userspace NFS/9P servers are these two at least get
requests from clients with an arbitrary offset, but I'll be glad to
forget about them for now...)

-- 
Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] io_uring: add support for getdents
  2023-04-24  8:41         ` Dominique Martinet
@ 2023-04-24  9:20           ` Clay Harris
  2023-04-24 10:55             ` Dominique Martinet
  0 siblings, 1 reply; 18+ messages in thread
From: Clay Harris @ 2023-04-24  9:20 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Dave Chinner, Alexander Viro, Christian Brauner, Jens Axboe,
	Pavel Begunkov, Stefan Roesch, linux-fsdevel, linux-kernel,
	io-uring

On Mon, Apr 24 2023 at 17:41:18 +0900, Dominique Martinet quoth thus:

> Thanks!
> 
> Clay Harris wrote on Mon, Apr 24, 2023 at 02:29:46AM -0500:
> > This also seems like a good place to bring up a point I made with
> > the last attempt at this code.  You're missing an optimization here.
> > getdents knows whether it is returning a buffer because the next entry
> > won't fit versus because there are no more entries.  As it doesn't
> > return that information, callers must always keep calling it back
> > until EOF.  This means a completely unnecessary call is made for
> > every open directory.  In other words, for a directory scan where
> > the buffers are large enough to not overflow, that literally twice
> > as many calls are made to getdents as necessary.  As io_uring is
> > in-kernel, it could use an internal interface to getdents which would
> > return an EOF indicator along with the (probably non-empty) buffer.
> > io_uring would then return that flag with the CQE.
> 
> Sorry I didn't spot that comment in the last iteration of the patch,
> that sounds interesting.
> 
> This isn't straightforward even in-kernel though: the ctx.actor callback
> (filldir64) isn't called when we're done, so we only know we couldn't
> fill in the buffer.
> We could have the callback record 'buffer full' and consider we're done
> if the buffer is full, or just single-handedly declare we are if we have
> more than `MAXNAMLEN + sizeof(struct linux_dirent64)` left over, but I
> assume a filesystem is allowed to return what it has readily available
> and expect the user to come back later?
> In which case we cannot use this as an heuristic...
> 
> So if we do this, it'll require a way for filesystems to say they're
> filling in as much as they can, or go the sledgehammer way of adding an
> extra dir_context dir_context callback, either way I'm not sure I want
> to deal with all that immediately unless I'm told all filesystems will
> fill as much as possible without ever failing for any temporary reason
> in the middle of iterate/iterate_shared().

I don't have a complete understanding of this area, but my thought was
not that we would look for any buffer full condition, but rather that
an iterator could be tested for next_entry == EOF.

> Call me greedy but I believe such a flag in the CQE could also be added
> later on without any bad side effects (as it's optional to check on it
> to stop calling early and there's no harm in not setting it)?

Certainly it could be added later, but I wanted to make sure some thought
was put into it now.  It would be nice to have it sooner rather than later
though...

> 
> > (* As an aside, the only place I've ever seen a non-zero lseek on a
> > directory, is in a very resource limited environment, e.g. too small
> > open files limit.  In the case of a depth-first directory scan, it
> > must close directories before completely reading them, and reopen /
> > lseek to their previous position in order to continue.  This scenario
> > is certainly not worth bothering with for io_uring.)
> 
> (I also thought of userspace NFS/9P servers are these two at least get
> requests from clients with an arbitrary offset, but I'll be glad to
> forget about them for now...)
> 
> -- 
> Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] io_uring: add support for getdents
  2023-04-24  9:20           ` Clay Harris
@ 2023-04-24 10:55             ` Dominique Martinet
  0 siblings, 0 replies; 18+ messages in thread
From: Dominique Martinet @ 2023-04-24 10:55 UTC (permalink / raw)
  To: Clay Harris
  Cc: Dave Chinner, Alexander Viro, Christian Brauner, Jens Axboe,
	Pavel Begunkov, Stefan Roesch, linux-fsdevel, linux-kernel,
	io-uring

Clay Harris wrote on Mon, Apr 24, 2023 at 04:20:55AM -0500:
> > This isn't straightforward even in-kernel though: the ctx.actor callback
> > (filldir64) isn't called when we're done, so we only know we couldn't
> > fill in the buffer.
> > We could have the callback record 'buffer full' and consider we're done
> > if the buffer is full, or just single-handedly declare we are if we have
> > more than `MAXNAMLEN + sizeof(struct linux_dirent64)` left over, but I
> > assume a filesystem is allowed to return what it has readily available
> > and expect the user to come back later?
> > In which case we cannot use this as an heuristic...
> > 
> > So if we do this, it'll require a way for filesystems to say they're
> > filling in as much as they can, or go the sledgehammer way of adding an
> > extra dir_context dir_context callback, either way I'm not sure I want
> > to deal with all that immediately unless I'm told all filesystems will
> > fill as much as possible without ever failing for any temporary reason
> > in the middle of iterate/iterate_shared().
> 
> I don't have a complete understanding of this area, but my thought was
> not that we would look for any buffer full condition, but rather that
> an iterator could be tested for next_entry == EOF.

I'm afraid I don't see any such iterator readily available in the common
code, so that would also have to be per fs as far as I can see :/

Also some filesystems just don't have a next_entry iterator readily
available; for example on 9p the network protocol is pretty much exactly
like getdents and a readdir call is issued continuously until either
filldir64 returns an error (e.g. buffer full) or the server returned 0
(or an error); if you leave the v9fs_dir_readdir{,_dotl}() call you
loose this information immediately.
Getting this information out would be doubly appreciable because the
"final getdents" to confirm the dir has been fully read is redundant
with that previous last 9p readdir which ended the previous iteration
(could also be fixed by caching...), but that'll really require fixing
up the iterate_shared() operation to signal such a thing...

> > Call me greedy but I believe such a flag in the CQE could also be added
> > later on without any bad side effects (as it's optional to check on it
> > to stop calling early and there's no harm in not setting it)?
> 
> Certainly it could be added later, but I wanted to make sure some thought
> was put into it now.  It would be nice to have it sooner rather than later
> though...

Yes, if there's an easy way forward I overlooked I'd be happy to spend a
bit more time on it; and discussing never hurts.
In for a penny, in for a pound :)

-- 
Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] io_uring: add support for getdents
  2023-04-23 23:43     ` Dominique Martinet
  2023-04-24  7:29       ` Clay Harris
@ 2023-04-28  5:06       ` Dave Chinner
  2023-04-28  6:14         ` Dominique Martinet
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2023-04-28  5:06 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Alexander Viro, Christian Brauner, Jens Axboe, Pavel Begunkov,
	Stefan Roesch, linux-fsdevel, linux-kernel, io-uring

On Mon, Apr 24, 2023 at 08:43:00AM +0900, Dominique Martinet wrote:
> Dave Chinner wrote on Mon, Apr 24, 2023 at 08:40:45AM +1000:
> > This doesn't actually introduce non-blocking getdents operations, so
> > what's the point? If it just shuffles the getdents call off to a
> > background thread, why bother with io_uring in the first place?
> 
> As said in the cover letter my main motivation really is simplifying the
> userspace application:
>  - style-wise, mixing in plain old getdents(2) or readdir(3) in the
> middle of an io_uring handling loop just feels wrong; but this may just
> be my OCD talking.
>  - in my understanding io_uring has its own thread pool, so even if the
> actual getdents is blocking other IOs can progress (assuming there is
> less blocked getdents than threads), without having to build one's own
> extra thread pool next to the uring handling.
> Looking at io_uring/fs.c the other "metadata related" calls there also
> use the synchronous APIs (renameat, unlinkat, mkdirat, symlinkat and
> linkat all do), so I didn't think of that as a problem in itself.

I think you missed the point. getdents is not an exclusive operation
- it is run under shared locking unlike all the other direcotry
modification operations you cite above. They use exclusive locking
so there's no real benefit by trying to run them non-blocking or
as an async operation as they are single threaded and will consume
a single thread context from start to end.

Further, one of the main reasons they get punted to the per-thread
pool is so that io_uring can optimise away the lock contention
caused by running multiple work threads on exclusively locked
objects; it does this by only running one work item per inode at a
time.

This is exactly what we don't want with getdents - we want to be
able to run as many concurrent getdents and lookup operations in
parallel as we can as both all use shared locking. IOWs, getdents
and inode lookups are much closer in behaviour and application use
to concurrent buffered data reads than they are to directory
modification operations.

We can already do concurrent getdents/lookup operations on a single
directory from userspace with multiple threads, but the way this
series adds support to io_uring somewhat prevents concurrent
getdents/lookup operations on the same directory inode via io_uring.
IOWs, adding getdents support to io_uring like this is not a step
forwards for applications that use/need concurrency in directory
lookup operations.

Keep in mind that if the directory is small enough to fit in the
inode, XFS can return all the getdents information immediately as it
is guaranteed to be in memory without doing any IO at all. Why
should that fast path that is commonly hit get punted to a work
queue and suddenly cost an application at least two extra context
switches?

> > Filesystems like XFS can easily do non-blocking getdents calls - we
> > just need the NOWAIT plumbing (like we added to the IO path with
> > IOCB_NOWAIT) to tell the filesystem not to block on locks or IO.
> > Indeed, filesystems often have async readahead built into their
> > getdents paths (XFS does), so it seems to me that we really want
> > non-blocking getdents to allow filesystems to take full advantage of
> > doing work without blocking and then shuffling the remainder off to
> > a background thread when it actually needs to wait for IO....
> 
> I believe that can be done without any change of this API, so that'll be
> a very welcome addition when it is ready;

Again, I think you miss the point.

Non blocking data IO came before io_uring and we had the
infrastructure in place before io_uring took advantage of it.
Application developers asked the fs developers to add support for
non-blocking direct IO operations and because we pretty much had all
the infrastructure to support already in place it got done quickly
via preadv2/pwritev2 via RWF_NOWAIT flags.

We already pass a struct dir_context to ->iterate_shared(), so we
have a simple way to add context specific flags down the filesystem
from iterate_dir(). This is similar to the iocb for file data IO
that contains the flags field that holds the IOCB_NOWAIT context for
io_uring based IO. So the infrastructure to plumb it all the way
down the fs implementation of ->iterate_shared is already there.

XFS also has async metadata IO capability and we use that for
readahead in the xfs_readdir() implementation. hence we've got all
the parts we need to do non-blocking readdir already in place. This
is very similar to how we already had all the pieces in the IO path
ready to do non-block IO well before anyone asked for IOCB_NOWAIT
functionality....

AFAICT, the io_uring code wouldn't need to do much more other than
punt to the work queue if it receives a -EAGAIN result. Otherwise
the what the filesystem returns doesn't need to change, and I don't
see that we need to change how the filldir callbacks work, either.
We just keep filling the user buffer until we either run out of
cached directory data or the user buffer is full.

And as I've already implied, several filesystems perform async
readahead from their ->iterate_shared methods, so there's every
chance they will return some data while there is readahead IO in
progress. By the time the io_uring processing loop gets back to
issue another getdents operation, that IO will have completed and the
application will be able to read more dirents without blocking. The
filesystem will issue more readahead while processing what it
already has available, and around the loop we go.

> I don't think the adding the
> uring op should wait on this if we can agree a simple wrapper API is
> good enough (or come up with a better one if someone has a Good Idea)

It doesn't look at all hard to me. If you add a NOWAIT context flag
to the dir_context it should be relatively trivial to connect all
the parts together. If you do all the VFS, io_uring and userspace
testing infrastructure work, I should be able to sort out the
changes needed to xfs_readdir() to support nonblocking
->iterate_shared() behaviour.

Cheers,

Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] io_uring: add support for getdents
  2023-04-28  5:06       ` Dave Chinner
@ 2023-04-28  6:14         ` Dominique Martinet
  2023-04-28 11:27           ` Dominique Martinet
  2023-04-29  8:07           ` Dominique Martinet
  0 siblings, 2 replies; 18+ messages in thread
From: Dominique Martinet @ 2023-04-28  6:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Christian Brauner, Jens Axboe, Pavel Begunkov,
	Stefan Roesch, linux-fsdevel, linux-kernel, io-uring

Dave Chinner wrote on Fri, Apr 28, 2023 at 03:06:40PM +1000:
> >  - in my understanding io_uring has its own thread pool, so even if the
> > actual getdents is blocking other IOs can progress (assuming there is
> > less blocked getdents than threads), without having to build one's own
> > extra thread pool next to the uring handling.
> > Looking at io_uring/fs.c the other "metadata related" calls there also
> > use the synchronous APIs (renameat, unlinkat, mkdirat, symlinkat and
> > linkat all do), so I didn't think of that as a problem in itself.
> 
> I think you missed the point. getdents is not an exclusive operation
> - it is run under shared locking unlike all the other direcotry
> modification operations you cite above. They use exclusive locking
> so there's no real benefit by trying to run them non-blocking or
> as an async operation as they are single threaded and will consume
> a single thread context from start to end.
> 
> Further, one of the main reasons they get punted to the per-thread
> pool is so that io_uring can optimise away the lock contention
> caused by running multiple work threads on exclusively locked
> objects; it does this by only running one work item per inode at a
> time.

Ah, I didn't know that -- thank you for this piece of information.

If we're serializing readdirs per inode I agree that this implementation
is subpart for filesystems implementing iterate_shared.

> > > Filesystems like XFS can easily do non-blocking getdents calls - we
> > > just need the NOWAIT plumbing (like we added to the IO path with
> > > IOCB_NOWAIT) to tell the filesystem not to block on locks or IO.
> > > Indeed, filesystems often have async readahead built into their
> > > getdents paths (XFS does), so it seems to me that we really want
> > > non-blocking getdents to allow filesystems to take full advantage of
> > > doing work without blocking and then shuffling the remainder off to
> > > a background thread when it actually needs to wait for IO....
> > 
> > I believe that can be done without any change of this API, so that'll be
> > a very welcome addition when it is ready;
> 
> Again, I think you miss the point.
> 
> Non blocking data IO came before io_uring and we had the
> infrastructure in place before io_uring took advantage of it.
> Application developers asked the fs developers to add support for
> non-blocking direct IO operations and because we pretty much had all
> the infrastructure to support already in place it got done quickly
> via preadv2/pwritev2 via RWF_NOWAIT flags.
> 
> We already pass a struct dir_context to ->iterate_shared(), so we
> have a simple way to add context specific flags down the filesystem
> from iterate_dir(). This is similar to the iocb for file data IO
> that contains the flags field that holds the IOCB_NOWAIT context for
> io_uring based IO. So the infrastructure to plumb it all the way
> down the fs implementation of ->iterate_shared is already there.

Sure, that sounds like a good approach that isn't breaking the API (not
breaking iterate/iterate_shared implementations that don't look at the
flags and allowing the fs that want to look at it to do so)

Adding such a flag and setting it on the uring side without doing
anything else is trivial enough if polling/rescheduling can be sorted
out.

(I guess at this rate the dir_context flag could also be modified by the
FS to signal it just filled in the last entry, for the other part of
this thread asking to know when the directory has been done iterating
without wasting a trivial empty getdents)

> XFS also has async metadata IO capability and we use that for
> readahead in the xfs_readdir() implementation. hence we've got all
> the parts we need to do non-blocking readdir already in place. This
> is very similar to how we already had all the pieces in the IO path
> ready to do non-block IO well before anyone asked for IOCB_NOWAIT
> functionality....
> 
> AFAICT, the io_uring code wouldn't need to do much more other than
> punt to the work queue if it receives a -EAGAIN result. Otherwise
> the what the filesystem returns doesn't need to change, and I don't
> see that we need to change how the filldir callbacks work, either.
> We just keep filling the user buffer until we either run out of
> cached directory data or the user buffer is full.

I agree with the fs side of it, I'd like to confirm what the uring
side needs to do before proceeding -- looking at the read/write path
there seems to be a polling mechanism in place to tell uring when to
look again, and I haven't looked at this part of the code yet to see
what happens if no such polling is in place (does uring just retry
periodically?)

I'll have a look this weekend.

> > I don't think the adding the
> > uring op should wait on this if we can agree a simple wrapper API is
> > good enough (or come up with a better one if someone has a Good Idea)
> 
> It doesn't look at all hard to me. If you add a NOWAIT context flag
> to the dir_context it should be relatively trivial to connect all
> the parts together. If you do all the VFS, io_uring and userspace
> testing infrastructure work, I should be able to sort out the
> changes needed to xfs_readdir() to support nonblocking
> ->iterate_shared() behaviour.

I still think the userspace <-> ring API is unrelated to the nowait side
of it, but as said in the other part of the thread I'll be happy to give
a bit more time if that helps moving forward.

I'll send another reply around the end of the weekend with either v2 of
this patch or a few more comments.


Thanks,
-- 
Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] io_uring: add support for getdents
  2023-04-28  6:14         ` Dominique Martinet
@ 2023-04-28 11:27           ` Dominique Martinet
  2023-04-30 23:15             ` Dave Chinner
  2023-04-29  8:07           ` Dominique Martinet
  1 sibling, 1 reply; 18+ messages in thread
From: Dominique Martinet @ 2023-04-28 11:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Christian Brauner, Jens Axboe, Pavel Begunkov,
	Stefan Roesch, linux-fsdevel, linux-kernel, io-uring

Dominique Martinet wrote on Fri, Apr 28, 2023 at 03:14:52PM +0900:
> > We already pass a struct dir_context to ->iterate_shared(), so we
> > have a simple way to add context specific flags down the filesystem
> > from iterate_dir(). This is similar to the iocb for file data IO
> > that contains the flags field that holds the IOCB_NOWAIT context for
> > io_uring based IO. So the infrastructure to plumb it all the way
> > down the fs implementation of ->iterate_shared is already there.
> 
> Sure, that sounds like a good approach that isn't breaking the API (not
> breaking iterate/iterate_shared implementations that don't look at the
> flags and allowing the fs that want to look at it to do so)

Hmm actually I said that, but io_getdents() needs to know if the flag
will be honored or not (if it will be honored, we can call this when
issue_flags & IO_URING_F_NONBLOCK but if we're not sure the fs handles
it then we risk blocking)

I'm not familiar with this part of the VFS, but I do not see any kind of
flags for the filesystem to signal if it'll handle it or not -- this is
actually similar to iterate vs. iterate_shared so that'll mean adding
iterate_shared_hasnonblock or something, which is getting silly.

I'm sure you understand this better than me and I'm missing something
obvious here, but I don't think I'll be able to make something I'm happy
with here (in a reasonable timeframe anyway).


Thanks,
-- 
Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] io_uring: add support for getdents
  2023-04-28  6:14         ` Dominique Martinet
  2023-04-28 11:27           ` Dominique Martinet
@ 2023-04-29  8:07           ` Dominique Martinet
  2023-04-30 23:32             ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Dominique Martinet @ 2023-04-29  8:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Christian Brauner, Jens Axboe, Pavel Begunkov,
	Stefan Roesch, linux-fsdevel, linux-kernel, io-uring

Dominique Martinet wrote on Fri, Apr 28, 2023 at 03:14:52PM +0900:
> > AFAICT, the io_uring code wouldn't need to do much more other than
> > punt to the work queue if it receives a -EAGAIN result. Otherwise
> > the what the filesystem returns doesn't need to change, and I don't
> > see that we need to change how the filldir callbacks work, either.
> > We just keep filling the user buffer until we either run out of
> > cached directory data or the user buffer is full.
> 
> [...] I'd like to confirm what the uring
> side needs to do before proceeding -- looking at the read/write path
> there seems to be a polling mechanism in place to tell uring when to
> look again, and I haven't looked at this part of the code yet to see
> what happens if no such polling is in place (does uring just retry
> periodically?)

Ok so this part can work out as you said, I hadn't understood what you
meant by "punt to the work queue" but that should work from my new
understanding of the ring; we can just return EAGAIN if the non-blocking
variant doesn't have immediate results and call the blocking variant
when we're called again without IO_URING_F_NONBLOCK in flags.
(So there's no need to try to add a form of polling, although that is
possible if we ever become able to do that; I'll just forget about this
and be happy this part is easy)

That just leaves deciding if a filesystem handles the blocking variant
or not; ideally if we can know early (prep time) we can even mark
REQ_F_FORCE_ASYNC in flags to skip the non-blocking call for filesystems
that don't handle that and we get the best of both worlds.

I've had a second look and I still don't see anything obvious though;
I'd rather avoid adding a new variant of iterate()/iterate_shared() --
we could use that as a chance to add a flag to struct file_operation
instead? e.g., something like mmap_supported_flags:
-----
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7db..2ebbf48ee18b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1761,7 +1761,7 @@ struct file_operations {
 	int (*iopoll)(struct kiocb *kiocb, struct io_comp_batch *,
 			unsigned int flags);
 	int (*iterate) (struct file *, struct dir_context *);
-	int (*iterate_shared) (struct file *, struct dir_context *);
+	unsigned long iterate_supported_flags;
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
@@ -1797,6 +1797,10 @@ struct file_operations {
 				unsigned int poll_flags);
 } __randomize_layout;

+/** iterate_supported_flags */
+#define ITERATE_SHARED 0x1
+#define ITERATE_NOWAIT 0x2
+
 struct inode_operations {
 	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
 	const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *);
-----

and fix all usages of iterate_shared.

I guess at this rate it might make sense to rename mmap_supported_flags
to some more generic supported_flags instead?...

It's a bit more than I have signed up for, but I guess it's still
reasonable enough. I'll wait for feedback before doing it though; please
say if this sounds good to you and I'll send a v2 with such a flag, as
well as adding flags to dir_context as you had suggested.

Thanks,
-- 
Dominique Martinet | Asmadeus

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] io_uring: add support for getdents
  2023-04-28 11:27           ` Dominique Martinet
@ 2023-04-30 23:15             ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2023-04-30 23:15 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Alexander Viro, Christian Brauner, Jens Axboe, Pavel Begunkov,
	Stefan Roesch, linux-fsdevel, linux-kernel, io-uring

On Fri, Apr 28, 2023 at 08:27:53PM +0900, Dominique Martinet wrote:
> Dominique Martinet wrote on Fri, Apr 28, 2023 at 03:14:52PM +0900:
> > > We already pass a struct dir_context to ->iterate_shared(), so we
> > > have a simple way to add context specific flags down the filesystem
> > > from iterate_dir(). This is similar to the iocb for file data IO
> > > that contains the flags field that holds the IOCB_NOWAIT context for
> > > io_uring based IO. So the infrastructure to plumb it all the way
> > > down the fs implementation of ->iterate_shared is already there.
> > 
> > Sure, that sounds like a good approach that isn't breaking the API (not
> > breaking iterate/iterate_shared implementations that don't look at the
> > flags and allowing the fs that want to look at it to do so)
> 
> Hmm actually I said that, but io_getdents() needs to know if the flag
> will be honored or not (if it will be honored, we can call this when
> issue_flags & IO_URING_F_NONBLOCK but if we're not sure the fs handles
> it then we risk blocking)

See the new FMODE_DIO_PARALLEL_WRITE flag for triggering filesystem
specific non-blocking io_uring behaviour....

-Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] io_uring: add support for getdents
  2023-04-29  8:07           ` Dominique Martinet
@ 2023-04-30 23:32             ` Dave Chinner
  2023-05-01  0:49               ` Dominique Martinet
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2023-04-30 23:32 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Alexander Viro, Christian Brauner, Jens Axboe, Pavel Begunkov,
	Stefan Roesch, linux-fsdevel, linux-kernel, io-uring

On Sat, Apr 29, 2023 at 05:07:32PM +0900, Dominique Martinet wrote:
> Dominique Martinet wrote on Fri, Apr 28, 2023 at 03:14:52PM +0900:
> > > AFAICT, the io_uring code wouldn't need to do much more other than
> > > punt to the work queue if it receives a -EAGAIN result. Otherwise
> > > the what the filesystem returns doesn't need to change, and I don't
> > > see that we need to change how the filldir callbacks work, either.
> > > We just keep filling the user buffer until we either run out of
> > > cached directory data or the user buffer is full.
> > 
> > [...] I'd like to confirm what the uring
> > side needs to do before proceeding -- looking at the read/write path
> > there seems to be a polling mechanism in place to tell uring when to
> > look again, and I haven't looked at this part of the code yet to see
> > what happens if no such polling is in place (does uring just retry
> > periodically?)
> 
> Ok so this part can work out as you said, I hadn't understood what you
> meant by "punt to the work queue" but that should work from my new
> understanding of the ring; we can just return EAGAIN if the non-blocking
> variant doesn't have immediate results and call the blocking variant
> when we're called again without IO_URING_F_NONBLOCK in flags.
> (So there's no need to try to add a form of polling, although that is
> possible if we ever become able to do that; I'll just forget about this
> and be happy this part is easy)
> 
> 
> That just leaves deciding if a filesystem handles the blocking variant
> or not; ideally if we can know early (prep time) we can even mark
> REQ_F_FORCE_ASYNC in flags to skip the non-blocking call for filesystems
> that don't handle that and we get the best of both worlds.
> 
> I've had a second look and I still don't see anything obvious though;
> I'd rather avoid adding a new variant of iterate()/iterate_shared() --
> we could use that as a chance to add a flag to struct file_operation
> instead? e.g., something like mmap_supported_flags:

I don't think that makes sense - the eventual goal is to make
->iterate() go away entirely and all filesystems use
->iterate_shared(). Hence I think adding flags to select iterate vs
iterate_shared and the locking that is needed is the wrong place to
start from here.

Whether the filesystem supports non-blocking ->iterate_shared() or
not is a filesystem implementation option and io_uring needs that
information to be held on the struct file for efficient
determination of whether it should use non-blocking operations or
not.

We already set per-filesystem file modes via the ->open method,
that's how we already tell io_uring that it can do NOWAIT IO, as
well as async read/write IO for regular files. And now we also use
it for FMODE_DIO_PARALLEL_WRITE, too.

See __io_file_supports_nowait()....

Essentially, io_uring already cwhas the mechanism available to it
to determine if it should use NOWAIT semantics for getdents
operations; we just need to set FMODE_NOWAIT correctly for directory
files via ->open() on the filesystems that support it...

[ Hmmmm - we probably need to be more careful in XFS about what
types of files we set those flags on.... ]

Cheers,

Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] io_uring: add support for getdents
  2023-04-30 23:32             ` Dave Chinner
@ 2023-05-01  0:49               ` Dominique Martinet
  2023-05-01  7:16                 ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Dominique Martinet @ 2023-05-01  0:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Christian Brauner, Jens Axboe, Pavel Begunkov,
	Stefan Roesch, linux-fsdevel, linux-kernel, io-uring

Dave Chinner wrote on Mon, May 01, 2023 at 09:32:41AM +1000:
> > I've had a second look and I still don't see anything obvious though;
> > I'd rather avoid adding a new variant of iterate()/iterate_shared() --
> > we could use that as a chance to add a flag to struct file_operation
> > instead? e.g., something like mmap_supported_flags:
> 
> I don't think that makes sense - the eventual goal is to make
> ->iterate() go away entirely and all filesystems use
> ->iterate_shared(). Hence I think adding flags to select iterate vs
> iterate_shared and the locking that is needed is the wrong place to
> start from here.

(The flag could just go away when all filesystems not supporting it are
gone, and it could be made the other way around (e.g. explicit
NOT_SHARED to encourage migrations), so I don't really see the problem
with this but next point makes this moot anyway)

> Whether the filesystem supports non-blocking ->iterate_shared() or
> not is a filesystem implementation option and io_uring needs that
> information to be held on the struct file for efficient
> determination of whether it should use non-blocking operations or
> not.

Right, sorry. I was thinking that since it's fs/op dependant it made
more sense to keep next to the iterate operation, but that'd be a
layering violation to look directly at the file_operation vector
directly from the uring code... So having it in the struct file is
better from that point of view.

> We already set per-filesystem file modes via the ->open method,
> that's how we already tell io_uring that it can do NOWAIT IO, as
> well as async read/write IO for regular files. And now we also use
> it for FMODE_DIO_PARALLEL_WRITE, too.
> 
> See __io_file_supports_nowait()....
> 
> Essentially, io_uring already cwhas the mechanism available to it
> to determine if it should use NOWAIT semantics for getdents
> operations; we just need to set FMODE_NOWAIT correctly for directory
> files via ->open() on the filesystems that support it...

Great, I wasn't aware of FMODE_NOWAIT; things are starting to fall in
place.
I'll send a v2 around Wed or Thurs (yay national holidays)

> [ Hmmmm - we probably need to be more careful in XFS about what
> types of files we set those flags on.... ]

Yes, FMODE_NOWAIT will be set on directories as xfs_dir_open calls
xfs_file_open which sets it inconditionally... So I got to check other
filesystems don't do something similar as a bonus, but it looks like
none that set FMODE_NOWAIT on regular files share the file open path,
so at least that shouldn't be too bad.
Happy to also fold the xfs fix as a prerequisite patch of this series or
to let you do it, just tell me.

Thanks,
-- 
Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 2/2] io_uring: add support for getdents
  2023-05-01  0:49               ` Dominique Martinet
@ 2023-05-01  7:16                 ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2023-05-01  7:16 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Alexander Viro, Christian Brauner, Jens Axboe, Pavel Begunkov,
	Stefan Roesch, linux-fsdevel, linux-kernel, io-uring

On Mon, May 01, 2023 at 09:49:31AM +0900, Dominique Martinet wrote:
> Dave Chinner wrote on Mon, May 01, 2023 at 09:32:41AM +1000:
> > > I've had a second look and I still don't see anything obvious though;
> > > I'd rather avoid adding a new variant of iterate()/iterate_shared() --
> > > we could use that as a chance to add a flag to struct file_operation
> > > instead? e.g., something like mmap_supported_flags:
> > 
> > I don't think that makes sense - the eventual goal is to make
> > ->iterate() go away entirely and all filesystems use
> > ->iterate_shared(). Hence I think adding flags to select iterate vs
> > iterate_shared and the locking that is needed is the wrong place to
> > start from here.
> 
> (The flag could just go away when all filesystems not supporting it are
> gone, and it could be made the other way around (e.g. explicit
> NOT_SHARED to encourage migrations), so I don't really see the problem
> with this but next point makes this moot anyway)
> 
> > Whether the filesystem supports non-blocking ->iterate_shared() or
> > not is a filesystem implementation option and io_uring needs that
> > information to be held on the struct file for efficient
> > determination of whether it should use non-blocking operations or
> > not.
> 
> Right, sorry. I was thinking that since it's fs/op dependant it made
> more sense to keep next to the iterate operation, but that'd be a
> layering violation to look directly at the file_operation vector
> directly from the uring code... So having it in the struct file is
> better from that point of view.
> 
> > We already set per-filesystem file modes via the ->open method,
> > that's how we already tell io_uring that it can do NOWAIT IO, as
> > well as async read/write IO for regular files. And now we also use
> > it for FMODE_DIO_PARALLEL_WRITE, too.
> > 
> > See __io_file_supports_nowait()....
> > 
> > Essentially, io_uring already cwhas the mechanism available to it
> > to determine if it should use NOWAIT semantics for getdents
> > operations; we just need to set FMODE_NOWAIT correctly for directory
> > files via ->open() on the filesystems that support it...
> 
> Great, I wasn't aware of FMODE_NOWAIT; things are starting to fall in
> place.
> I'll send a v2 around Wed or Thurs (yay national holidays)
> 
> > [ Hmmmm - we probably need to be more careful in XFS about what
> > types of files we set those flags on.... ]
> 
> Yes, FMODE_NOWAIT will be set on directories as xfs_dir_open calls
> xfs_file_open which sets it inconditionally... So I got to check other
> filesystems don't do something similar as a bonus, but it looks like
> none that set FMODE_NOWAIT on regular files share the file open path,
> so at least that shouldn't be too bad.
> Happy to also fold the xfs fix as a prerequisite patch of this series or
> to let you do it, just tell me.

Yeah, we need to audit all the ->open() calls to ensure they do the
right thing - I think what XFS does is a harmless oversight at this
point, we can simple split xfs_file_open() and xfs_dir_open() as
appropriate.

Also, the nowait enabled ->iterate_shared() method for XFS will look
something like the patch below. It's not tested in any way, I just
wrote it quickly to demonstrate the relative simplicity of
converting all the locking and IO interfaces in the xfs_readdir()
for NOWAIT operation. Making it asynchronous (equivalent of
FMODE_BUF_RASYNC) is a lot more work, but I'm not sure that is
necessary given the async readahead that gets issued...

Cheers,

Dave.
-- 
Dave Chinner
[email protected]

xfs: NOWAIT semantics for readdir

From: Dave Chinner <[email protected]>

Signed-off-by: Dave Chinner <[email protected]>
---
 fs/xfs/libxfs/xfs_da_btree.c   | 16 +++++++++++++
 fs/xfs/libxfs/xfs_da_btree.h   |  1 +
 fs/xfs/libxfs/xfs_dir2_block.c |  7 +++---
 fs/xfs/libxfs/xfs_dir2_priv.h  |  2 +-
 fs/xfs/scrub/dir.c             |  2 +-
 fs/xfs/scrub/readdir.c         |  2 +-
 fs/xfs/xfs_dir2_readdir.c      | 54 ++++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_inode.c             | 17 +++++++++++++
 fs/xfs/xfs_inode.h             | 15 ++++++------
 include/linux/fs.h             |  1 +
 10 files changed, 94 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index e576560b46e9..7a1a0af24197 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2643,16 +2643,32 @@ xfs_da_read_buf(
 	struct xfs_buf_map	map, *mapp = &map;
 	int			nmap = 1;
 	int			error;
+	int			buf_flags = 0;
 
 	*bpp = NULL;
 	error = xfs_dabuf_map(dp, bno, flags, whichfork, &mapp, &nmap);
 	if (error || !nmap)
 		goto out_free;
 
+	/*
+	 * NOWAIT semantics mean we don't wait on the buffer lock nor do we
+	 * issue IO for this buffer if it is not already in memory. Caller will
+	 * retry. This will return -EAGAIN if the buffer is in memory and cannot
+	 * be locked, and no buffer and no error if it isn't in memory.  We
+	 * translate both of those into a return state of -EAGAIN and *bpp =
+	 * NULL.
+	 */
+	if (flags & XFS_DABUF_NOWAIT)
+		buf_flags |= XBF_TRYLOCK | XBF_INCORE;
 	error = xfs_trans_read_buf_map(mp, tp, mp->m_ddev_targp, mapp, nmap, 0,
 			&bp, ops);
 	if (error)
 		goto out_free;
+	if (!bp) {
+		ASSERT(flags & XFS_DABUF_NOWAIT);
+		error = -EAGAIN;
+		goto out_free;
+	}
 
 	if (whichfork == XFS_ATTR_FORK)
 		xfs_buf_set_ref(bp, XFS_ATTR_BTREE_REF);
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index ffa3df5b2893..32e7b1cca402 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -205,6 +205,7 @@ int	xfs_da3_node_read_mapped(struct xfs_trans *tp, struct xfs_inode *dp,
  */
 
 #define XFS_DABUF_MAP_HOLE_OK	(1u << 0)
+#define XFS_DABUF_NOWAIT	(1u << 1)
 
 int	xfs_da_grow_inode(xfs_da_args_t *args, xfs_dablk_t *new_blkno);
 int	xfs_da_grow_inode_int(struct xfs_da_args *args, xfs_fileoff_t *bno,
diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c
index 00f960a703b2..59b24a594add 100644
--- a/fs/xfs/libxfs/xfs_dir2_block.c
+++ b/fs/xfs/libxfs/xfs_dir2_block.c
@@ -135,13 +135,14 @@ int
 xfs_dir3_block_read(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*dp,
+	unsigned int		flags,
 	struct xfs_buf		**bpp)
 {
 	struct xfs_mount	*mp = dp->i_mount;
 	xfs_failaddr_t		fa;
 	int			err;
 
-	err = xfs_da_read_buf(tp, dp, mp->m_dir_geo->datablk, 0, bpp,
+	err = xfs_da_read_buf(tp, dp, mp->m_dir_geo->datablk, flags, bpp,
 				XFS_DATA_FORK, &xfs_dir3_block_buf_ops);
 	if (err || !*bpp)
 		return err;
@@ -380,7 +381,7 @@ xfs_dir2_block_addname(
 	tp = args->trans;
 
 	/* Read the (one and only) directory block into bp. */
-	error = xfs_dir3_block_read(tp, dp, &bp);
+	error = xfs_dir3_block_read(tp, dp, 0, &bp);
 	if (error)
 		return error;
 
@@ -695,7 +696,7 @@ xfs_dir2_block_lookup_int(
 	dp = args->dp;
 	tp = args->trans;
 
-	error = xfs_dir3_block_read(tp, dp, &bp);
+	error = xfs_dir3_block_read(tp, dp, 0, &bp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h
index 7404a9ff1a92..7d4cf8a0f15b 100644
--- a/fs/xfs/libxfs/xfs_dir2_priv.h
+++ b/fs/xfs/libxfs/xfs_dir2_priv.h
@@ -51,7 +51,7 @@ extern int xfs_dir_cilookup_result(struct xfs_da_args *args,
 
 /* xfs_dir2_block.c */
 extern int xfs_dir3_block_read(struct xfs_trans *tp, struct xfs_inode *dp,
-			       struct xfs_buf **bpp);
+			       unsigned int flags, struct xfs_buf **bpp);
 extern int xfs_dir2_block_addname(struct xfs_da_args *args);
 extern int xfs_dir2_block_lookup(struct xfs_da_args *args);
 extern int xfs_dir2_block_removename(struct xfs_da_args *args);
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 0b491784b759..5cc51f201bd7 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -313,7 +313,7 @@ xchk_directory_data_bestfree(
 		/* dir block format */
 		if (lblk != XFS_B_TO_FSBT(mp, XFS_DIR2_DATA_OFFSET))
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
-		error = xfs_dir3_block_read(sc->tp, sc->ip, &bp);
+		error = xfs_dir3_block_read(sc->tp, sc->ip, 0, &bp);
 	} else {
 		/* dir data format */
 		error = xfs_dir3_data_read(sc->tp, sc->ip, lblk, 0, &bp);
diff --git a/fs/xfs/scrub/readdir.c b/fs/xfs/scrub/readdir.c
index e51c1544be63..f0a727311632 100644
--- a/fs/xfs/scrub/readdir.c
+++ b/fs/xfs/scrub/readdir.c
@@ -101,7 +101,7 @@ xchk_dir_walk_block(
 	unsigned int		off, next_off, end;
 	int			error;
 
-	error = xfs_dir3_block_read(sc->tp, dp, &bp);
+	error = xfs_dir3_block_read(sc->tp, dp, 0, &bp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 9f3ceb461515..e5fcd3786599 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -149,6 +149,7 @@ xfs_dir2_block_getdents(
 	struct xfs_da_geometry	*geo = args->geo;
 	unsigned int		offset, next_offset;
 	unsigned int		end;
+	unsigned int		flags = 0;
 
 	/*
 	 * If the block number in the offset is out of range, we're done.
@@ -156,7 +157,9 @@ xfs_dir2_block_getdents(
 	if (xfs_dir2_dataptr_to_db(geo, ctx->pos) > geo->datablk)
 		return 0;
 
-	error = xfs_dir3_block_read(args->trans, dp, &bp);
+	if (ctx->nowait)
+		flags |= XFS_DABUF_NOWAIT;
+	error = xfs_dir3_block_read(args->trans, dp, flags, &bp);
 	if (error)
 		return error;
 
@@ -240,6 +243,7 @@ xfs_dir2_block_getdents(
 STATIC int
 xfs_dir2_leaf_readbuf(
 	struct xfs_da_args	*args,
+	struct dir_context	*ctx,
 	size_t			bufsize,
 	xfs_dir2_off_t		*cur_off,
 	xfs_dablk_t		*ra_blk,
@@ -258,10 +262,15 @@ xfs_dir2_leaf_readbuf(
 	struct xfs_iext_cursor	icur;
 	int			ra_want;
 	int			error = 0;
+	unsigned int		flags = 0;
 
-	error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
-	if (error)
-		goto out;
+	if (ctx->nowait) {
+		flags |= XFS_DABUF_NOWAIT;
+	} else {
+		error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
+		if (error)
+			goto out;
+	}
 
 	/*
 	 * Look for mapped directory blocks at or above the current offset.
@@ -280,7 +289,7 @@ xfs_dir2_leaf_readbuf(
 	new_off = xfs_dir2_da_to_byte(geo, map.br_startoff);
 	if (new_off > *cur_off)
 		*cur_off = new_off;
-	error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, 0, &bp);
+	error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, flags, &bp);
 	if (error)
 		goto out;
 
@@ -337,6 +346,16 @@ xfs_dir2_leaf_readbuf(
 	goto out;
 }
 
+static inline int
+xfs_ilock_for_readdir(
+	struct xfs_inode	*dp,
+	bool			nowait)
+{
+	if (nowait)
+		return xfs_ilock_data_map_shared_nowait(dp);
+	return xfs_ilock_data_map_shared(dp);
+}
+
 /*
  * Getdents (readdir) for leaf and node directories.
  * This reads the data blocks only, so is the same for both forms.
@@ -360,6 +379,7 @@ xfs_dir2_leaf_getdents(
 	int			byteoff;	/* offset in current block */
 	unsigned int		offset = 0;
 	int			error = 0;	/* error return value */
+	int			written = 0;
 
 	/*
 	 * If the offset is at or past the largest allowed value,
@@ -391,10 +411,16 @@ xfs_dir2_leaf_getdents(
 				bp = NULL;
 			}
 
-			if (*lock_mode == 0)
-				*lock_mode = xfs_ilock_data_map_shared(dp);
-			error = xfs_dir2_leaf_readbuf(args, bufsize, &curoff,
-					&rablk, &bp);
+			if (*lock_mode == 0) {
+				*lock_mode = xfs_ilock_for_readdir(dp,
+						ctx->nowait);
+				if (!*lock_mode) {
+					error = -EAGAIN;
+					break;
+				}
+			}
+			error = xfs_dir2_leaf_readbuf(args, ctx, bufsize,
+					&curoff, &rablk, &bp);
 			if (error || !bp)
 				break;
 
@@ -479,6 +505,7 @@ xfs_dir2_leaf_getdents(
 		 */
 		offset += length;
 		curoff += length;
+		written += length;
 		/* bufsize may have just been a guess; don't go negative */
 		bufsize = bufsize > length ? bufsize - length : 0;
 	}
@@ -492,6 +519,8 @@ xfs_dir2_leaf_getdents(
 		ctx->pos = xfs_dir2_byte_to_dataptr(curoff) & 0x7fffffff;
 	if (bp)
 		xfs_trans_brelse(args->trans, bp);
+	if (error == -EAGAIN && written > 0)
+		error = 0;
 	return error;
 }
 
@@ -528,10 +557,13 @@ xfs_readdir(
 	args.geo = dp->i_mount->m_dir_geo;
 	args.trans = tp;
 
+	lock_mode = xfs_ilock_for_readdir(dp, ctx->nowait);
+	if (!lock_mode)
+		return -EAGAIN;
+
 	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
 		return xfs_dir2_sf_getdents(&args, ctx);
 
-	lock_mode = xfs_ilock_data_map_shared(dp);
 	error = xfs_dir2_isblock(&args, &isblock);
 	if (error)
 		goto out_unlock;
@@ -546,5 +578,7 @@ xfs_readdir(
 out_unlock:
 	if (lock_mode)
 		xfs_iunlock(dp, lock_mode);
+	if (error == -EAGAIN)
+		ASSERT(ctx->nowait);
 	return error;
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 5808abab786c..c0d5f3f06270 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -120,6 +120,23 @@ xfs_ilock_data_map_shared(
 	return lock_mode;
 }
 
+/*
+ * Similar to xfs_ilock_data_map_shared(), except that it will only try to lock
+ * the inode in shared mode if the extents are already in memory. If it fails to
+ * get the lock or has to do IO to read the extent list, fail the operation by
+ * returning 0 as the lock mode.
+ */
+uint
+xfs_ilock_data_map_shared_nowait(
+	struct xfs_inode	*ip)
+{
+	if (xfs_need_iread_extents(&ip->i_df))
+		return 0;
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
+		return 0;
+	return XFS_ILOCK_SHARED;
+}
+
 uint
 xfs_ilock_attr_map_shared(
 	struct xfs_inode	*ip)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 69d21e42c10a..f766e1a41d90 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -490,13 +490,14 @@ int		xfs_rename(struct mnt_idmap *idmap,
 			   struct xfs_name *target_name,
 			   struct xfs_inode *target_ip, unsigned int flags);
 
-void		xfs_ilock(xfs_inode_t *, uint);
-int		xfs_ilock_nowait(xfs_inode_t *, uint);
-void		xfs_iunlock(xfs_inode_t *, uint);
-void		xfs_ilock_demote(xfs_inode_t *, uint);
-bool		xfs_isilocked(struct xfs_inode *, uint);
-uint		xfs_ilock_data_map_shared(struct xfs_inode *);
-uint		xfs_ilock_attr_map_shared(struct xfs_inode *);
+void		xfs_ilock(struct xfs_inode *ip, uint lockmode);
+int		xfs_ilock_nowait(struct xfs_inode *ip, uint lockmode);
+void		xfs_iunlock(struct xfs_inode *ip, uint lockmode);
+void		xfs_ilock_demote(struct xfs_inode *ip, uint lockmode);
+bool		xfs_isilocked(struct xfs_inode *ip, uint lockmode);
+uint		xfs_ilock_data_map_shared(struct xfs_inode *ip);
+uint		xfs_ilock_data_map_shared_nowait(struct xfs_inode *ip);
+uint		xfs_ilock_attr_map_shared(struct xfs_inode *ip);
 
 uint		xfs_ip2xflags(struct xfs_inode *);
 int		xfs_ifree(struct xfs_trans *, struct xfs_inode *);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 67495ef79bb2..26c91812ca48 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1702,6 +1702,7 @@ typedef bool (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64,
 struct dir_context {
 	filldir_t actor;
 	loff_t pos;
+	bool nowait;
 };
 
 /*

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-05-01  7:16 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-22  8:40 [PATCH RFC 0/2] io_uring: add getdents support, take 2 Dominique Martinet
2023-04-22  8:40 ` [PATCH RFC 1/2] fs: split off vfs_getdents function of getdents64 syscall Dominique Martinet
2023-04-22 10:34   ` Dominique Martinet
2023-04-22  8:40 ` [PATCH RFC 2/2] io_uring: add support for getdents Dominique Martinet
2023-04-23 22:40   ` Dave Chinner
2023-04-23 23:43     ` Dominique Martinet
2023-04-24  7:29       ` Clay Harris
2023-04-24  8:41         ` Dominique Martinet
2023-04-24  9:20           ` Clay Harris
2023-04-24 10:55             ` Dominique Martinet
2023-04-28  5:06       ` Dave Chinner
2023-04-28  6:14         ` Dominique Martinet
2023-04-28 11:27           ` Dominique Martinet
2023-04-30 23:15             ` Dave Chinner
2023-04-29  8:07           ` Dominique Martinet
2023-04-30 23:32             ` Dave Chinner
2023-05-01  0:49               ` Dominique Martinet
2023-05-01  7:16                 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox