[PATCH v2 0/6] io_uring: add getdents support, take 2

2023-05-10T10:53:18Z

This is an attempt to revive discussion after bringing it up as a reply
to the last attempt last week.

Since the problem with the previous attempt at adding getdents to
io_uring was that offset was problematic, we can just not add an offset
parameter: using different offsets is rarely useful in practice (maybe
for some cacheless userspace NFS server?) and more isn't worth the cost
of allowing arbitrary offsets: just allow rewind as a flag instead.
[happy to drop even that flag for what I care about, but that might be
useful enough on its own as io_uring rightfully has no seek API]

The new API does nothing that cannot be achieved with plain syscalls so
it shouldn't be introducing any new problem, the only downside is that
having the state in the file struct isn't very uring-ish and if a
better solution is found later that will probably require duplicating
some logic in a new flag... But that seems like it would likely be a
distant future, and this version should be usable right away.

To repeat the rationale for having getdents available from uring as it
has been a while, while I believe there can be real performance
improvements as suggested in [1], the real reason is to allow coherency
in code: applications using io_uring usually center their event loop
around the ring, and having the occasional synchronous getdents call in
there is cumbersome and hard to do properly when you consider e.g. a
slow or acting up network fs...
[1] https://lore.kernel.org/linux-fsdevel/20210218122640.GA334506@wantstofly.org/
(unfortunately the source is no longer available...)

liburing implementation:
https://github.com/martinetd/liburing/commits/getdents
(will submit properly after initial discussions here)

Previous discussion:
https://lore.kernel.org/all/20211221164004.119663-1-shr@fb.com/T/#m517583f23502f32b040c819d930359313b3db00c

Signed-off-by: Dominique Martinet 
---
Changes in v2:
- implement NOWAIT as suggested by dchinner; I'm unable to provide
reliable benchmarks but it does indeed look positive (and makes sense)
to avoid spinning out to another thread when not required.
FWIW though, the serializing readdirs per inode argument given in v1
thread isn't valid: serialization is only done in io_prep_async_work
for regular files (REQ_F_ISREG, set from file mode through FFS_ISREG),
so dir operations aren't serialized in our case.
If I was pedantic I'd say we might want that hashing for filesystems
that don't implement interate_shared, but that info comes too late and
these filesystems should become less frequent so leaving as is.
- implement NOWAIT for kernfs and libfs to test with /sys
(the tentative patch for xfs didn't seem to work, didn't take the time
to debug)
- try to implement some EOF flag in CQE as suggested by Clay Harris,
this is less clearly cut and left as RFC.
The liburing test/getdents.t implementation has grown options to test
this flag and also try async explicitly in the latest commit:
https://github.com/martinetd/liburing/commits/getdents
- vfs_getdents split: add missing internal.h include
- Link to v1: https://lore.kernel.org/r/20230422-uring-getdents-v1-0-14c1db36e98c@codewreck.org

---
Dominique Martinet (6):
      fs: split off vfs_getdents function of getdents64 syscall
      vfs_getdents/struct dir_context: add flags field
      io_uring: add support for getdents
      kernfs: implement readdir FMODE_NOWAIT
      libfs: set FMODE_NOWAIT on dir open
      RFC: io_uring getdents: test returning an EOF flag in CQE

 fs/internal.h                 |  8 ++++++
 fs/kernfs/dir.c               | 21 ++++++++++++++-
 fs/libfs.c                    | 10 ++++---
 fs/readdir.c                  | 38 +++++++++++++++++++++------
 include/linux/fs.h            | 10 +++++++
 include/uapi/linux/io_uring.h |  9 +++++++
 io_uring/fs.c                 | 61 +++++++++++++++++++++++++++++++++++++++++++
 io_uring/fs.h                 |  3 +++
 io_uring/opdef.c              |  8 ++++++
 9 files changed, 156 insertions(+), 12 deletions(-)
---
base-commit: 58390c8ce1bddb6c623f62e7ed36383e7fa5c02f
change-id: 20230422-uring-getdents-2aab84d240aa

Best regards,
-- 
Dominique Martinet | Asmadeus

[PATCH v2 1/6] fs: split off vfs_getdents function of getdents64 syscall

2023-05-10T10:53:33Z

This splits off the vfs_getdents function from the getdents64 system
call.
This will allow io_uring to call the vfs_getdents function.

Co-authored-by: Stefan Roesch 
Signed-off-by: Dominique Martinet 
---
 fs/internal.h |  8 ++++++++
 fs/readdir.c  | 34 ++++++++++++++++++++++++++--------
 2 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index bd3b2810a36b..e8ca000e6613 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -260,3 +260,11 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
 struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
 struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
 void mnt_idmap_put(struct mnt_idmap *idmap);
+
+/*
+ * fs/readdir.c
+ */
+struct linux_dirent64;
+
+int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
+		 unsigned int count);
diff --git a/fs/readdir.c b/fs/readdir.c
index 9c53edb60c03..ed0803d0011e 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include "internal.h"
 
 #include 
 
@@ -351,10 +352,16 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
 	return false;
 }
 
-SYSCALL_DEFINE3(getdents64, unsigned int, fd,
-		struct linux_dirent64 __user *, dirent, unsigned int, count)
+
+/**
+ * vfs_getdents - getdents without fdget
+ * @file    : pointer to file struct of directory
+ * @dirent  : pointer to user directory structure
+ * @count   : size of buffer
+ */
+int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
+		 unsigned int count)
 {
-	struct fd f;
 	struct getdents_callback64 buf = {
 		.ctx.actor = filldir64,
 		.count = count,
@@ -362,11 +369,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	};
 	int error;
 
-	f = fdget_pos(fd);
-	if (!f.file)
-		return -EBADF;
-
-	error = iterate_dir(f.file, &buf.ctx);
+	error = iterate_dir(file, &buf.ctx);
 	if (error >= 0)
 		error = buf.error;
 	if (buf.prev_reclen) {
@@ -379,6 +382,21 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 		else
 			error = count - buf.count;
 	}
+	return error;
+}
+
+SYSCALL_DEFINE3(getdents64, unsigned int, fd,
+		struct linux_dirent64 __user *, dirent, unsigned int, count)
+{
+	struct fd f;
+	int error;
+
+	f = fdget_pos(fd);
+	if (!f.file)
+		return -EBADF;
+
+	error = vfs_getdents(f.file, dirent, count);
+
 	fdput_pos(f);
 	return error;
 }

-- 
2.39.2

[PATCH v2 2/6] vfs_getdents/struct dir_context: add flags field

2023-05-10T10:53:35Z

The flags will allow passing DIR_CONTEXT_F_NOWAIT to iterate()
implementations that support it (as signaled through FMODE_NWAIT
in file->f_mode)

Notes:
- considered using IOCB_NOWAIT but if we add more flags later it
would be confusing to keep track of which values are valid, use
dedicated flags
- might want to check ctx.flags & DIR_CONTEXT_F_NOWAIT is only set
when file->f_mode & FMODE_NOWAIT in iterate_dir() as e.g. WARN_ONCE?

Signed-off-by: Dominique Martinet 
---
 fs/internal.h      | 2 +-
 fs/readdir.c       | 6 ++++--
 include/linux/fs.h | 8 ++++++++
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index e8ca000e6613..0264b001d99a 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -267,4 +267,4 @@ void mnt_idmap_put(struct mnt_idmap *idmap);
 struct linux_dirent64;
 
 int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
-		 unsigned int count);
+		 unsigned int count, unsigned long flags);
diff --git a/fs/readdir.c b/fs/readdir.c
index ed0803d0011e..1311b89d75e1 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -358,12 +358,14 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
  * @file    : pointer to file struct of directory
  * @dirent  : pointer to user directory structure
  * @count   : size of buffer
+ * @flags   : additional dir_context flags
  */
 int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
-		 unsigned int count)
+		 unsigned int count, unsigned long flags)
 {
 	struct getdents_callback64 buf = {
 		.ctx.actor = filldir64,
+		.ctx.flags = flags,
 		.count = count,
 		.current_dir = dirent
 	};
@@ -395,7 +397,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	if (!f.file)
 		return -EBADF;
 
-	error = vfs_getdents(f.file, dirent, count);
+	error = vfs_getdents(f.file, dirent, count, 0);
 
 	fdput_pos(f);
 	return error;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 21a981680856..f7de2b5ca38e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1716,8 +1716,16 @@ typedef bool (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64,
 struct dir_context {
 	filldir_t actor;
 	loff_t pos;
+	unsigned long flags;
 };
 
+/*
+ * flags for dir_context flags
+ * DIR_CONTEXT_F_NOWAIT: Request non-blocking iterate
+ *                       (requires file->f_mode & FMODE_NOWAIT)
+ */
+#define DIR_CONTEXT_F_NOWAIT	0x1
+
 /*
  * These flags let !MMU mmap() govern direct device mapping vs immediate
  * copying more easily for MAP_PRIVATE, especially for ROM filesystems.

-- 
2.39.2

[PATCH v2 3/6] io_uring: add support for getdents

2023-05-10T10:53:41Z

This add support for getdents64 to io_uring, acting exactly like the
syscall: the directory is iterated from it's current's position as
stored in the file struct, and the file's position is updated exactly as
if getdents64 had been called.

Additionally, since io_uring has no way of issuing a seek, a flag
IORING_GETDENTS_REWIND has been added that will seek to the start of the
directory like rewinddir(3) for users that might require such a thing.
This will act exactly as if seek then getdents64 have been called
sequentially with no atomicity guarantee:
if this wasn't clear it is the responsibility of the caller to not use
getdents multiple time on a single file in parallel if they want useful
results, as is currently the case when using the syscall from multiple
threads.

For filesystems that support NOWAIT in iterate_shared(), try to use it
first; if a user already knows the filesystem they use do not support
nowait they can force async through IOSQE_ASYNC in the sqe flags,
avoiding the need to bounce back through a useless EAGAIN return.
(Note we already do that in prep if rewind is requested)

Signed-off-by: Dominique Martinet 
---
 include/uapi/linux/io_uring.h |  7 ++++++
 io_uring/fs.c                 | 57 +++++++++++++++++++++++++++++++++++++++++++
 io_uring/fs.h                 |  3 +++
 io_uring/opdef.c              |  8 ++++++
 4 files changed, 75 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 0716cb17e436..35d0de18d893 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -65,6 +65,7 @@ struct io_uring_sqe {
 		__u32		xattr_flags;
 		__u32		msg_ring_flags;
 		__u32		uring_cmd_flags;
+		__u32		getdents_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	/* pack this to avoid bogus arm OABI complaints */
@@ -223,6 +224,7 @@ enum io_uring_op {
 	IORING_OP_URING_CMD,
 	IORING_OP_SEND_ZC,
 	IORING_OP_SENDMSG_ZC,
+	IORING_OP_GETDENTS,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
@@ -259,6 +261,11 @@ enum io_uring_op {
  */
 #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
 
+/*
+ * sqe->getdents_flags
+ */
+#define IORING_GETDENTS_REWIND	(1U << 0)
+
 /*
  * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
  * command flags for POLL_ADD are stored in sqe->len.
diff --git a/io_uring/fs.c b/io_uring/fs.c
index f6a69a549fd4..b15ec81c1ed2 100644
--- a/io_uring/fs.c
+++ b/io_uring/fs.c
@@ -47,6 +47,13 @@ struct io_link {
 	int				flags;
 };
 
+struct io_getdents {
+	struct file			*file;
+	struct linux_dirent64 __user	*dirent;
+	unsigned int			count;
+	int				flags;
+};
+
 int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
@@ -291,3 +298,53 @@ void io_link_cleanup(struct io_kiocb *req)
 	putname(sl->oldpath);
 	putname(sl->newpath);
 }
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+
+	if (READ_ONCE(sqe->off) != 0)
+		return -EINVAL;
+
+	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	gd->count = READ_ONCE(sqe->len);
+	gd->flags = READ_ONCE(sqe->getdents_flags);
+	if (gd->flags & ~IORING_GETDENTS_REWIND)
+		return -EINVAL;
+	/* rewind cannot be nowait */
+	if (gd->flags & IORING_GETDENTS_REWIND)
+		req->flags |= REQ_F_FORCE_ASYNC;
+
+	return 0;
+}
+
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+	unsigned long getdents_flags = 0;
+	int ret;
+
+	if (issue_flags & IO_URING_F_NONBLOCK) {
+		if (!(req->file->f_mode & FMODE_NOWAIT))
+			return -EAGAIN;
+
+		getdents_flags = DIR_CONTEXT_F_NOWAIT;
+	}
+	if ((gd->flags & IORING_GETDENTS_REWIND)) {
+		WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+
+		ret = vfs_llseek(req->file, 0, SEEK_SET);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = vfs_getdents(req->file, gd->dirent, gd->count, getdents_flags);
+out:
+	if (ret == -EAGAIN &&
+	    (issue_flags & IO_URING_F_NONBLOCK))
+			return -EAGAIN;
+
+	io_req_set_res(req, ret, 0);
+	return 0;
+}
+
diff --git a/io_uring/fs.h b/io_uring/fs.h
index 0bb5efe3d6bb..f83a6f3a678d 100644
--- a/io_uring/fs.h
+++ b/io_uring/fs.h
@@ -18,3 +18,6 @@ int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags);
 int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_linkat(struct io_kiocb *req, unsigned int issue_flags);
 void io_link_cleanup(struct io_kiocb *req);
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index cca7c5b55208..8f770c582ab3 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -428,6 +428,11 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_GETDENTS] = {
+		.needs_file		= 1,
+		.prep			= io_getdents_prep,
+		.issue			= io_getdents,
+	},
 };
 
 
@@ -648,6 +653,9 @@ const struct io_cold_def io_cold_defs[] = {
 		.fail			= io_sendrecv_fail,
 #endif
 	},
+	[IORING_OP_GETDENTS] = {
+		.name			= "GETDENTS",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)

-- 
2.39.2

[PATCH v2 4/6] kernfs: implement readdir FMODE_NOWAIT

2023-05-10T10:53:42Z

Since down_read can block, use the _trylock variant if NOWAIT variant
has been requested.
(can probably do a little bit better style-wise)

Signed-off-by: Dominique Martinet 
---
 fs/kernfs/dir.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 45b6919903e6..5a5b3e7881bf 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1824,7 +1824,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 		return 0;
 
 	root = kernfs_root(parent);
-	down_read(&root->kernfs_rwsem);
+	if (ctx->flags & DIR_CONTEXT_F_NOWAIT) {
+		if (!down_read_trylock(&root->kernfs_rwsem))
+			return -EAGAIN;
+	} else {
+		down_read(&root->kernfs_rwsem);
+	}
 
 	if (kernfs_ns_enabled(parent))
 		ns = kernfs_info(dentry->d_sb)->ns;
@@ -1845,6 +1850,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 		if (!dir_emit(ctx, name, len, ino, type))
 			return 0;
 		down_read(&root->kernfs_rwsem);
+		if (ctx->flags & DIR_CONTEXT_F_NOWAIT) {
+			if (!down_read_trylock(&root->kernfs_rwsem))
+				return 0;
+		} else {
+			down_read(&root->kernfs_rwsem);
+		}
 	}
 	up_read(&root->kernfs_rwsem);
 	file->private_data = NULL;
@@ -1852,7 +1863,14 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 	return 0;
 }
 
+static int kernfs_fop_dir_open(struct inode *inode, struct file *file)
+{
+	file->f_mode |= FMODE_NOWAIT;
+	return 0;
+}
+
 const struct file_operations kernfs_dir_fops = {
+	.open		= kernfs_fop_dir_open,
 	.read		= generic_read_dir,
 	.iterate_shared	= kernfs_fop_readdir,
 	.release	= kernfs_dir_fop_release,

-- 
2.39.2

[PATCH v2 5/6] libfs: set FMODE_NOWAIT on dir open

2023-05-10T10:53:40Z

the readdir can technically wait a bit on a spinlock, but that should
never wait for long enough to return EAGAIN -- just set the capability
flag on directories f_mode

Signed-off-by: Dominique Martinet 
---
 fs/libfs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/libfs.c b/fs/libfs.c
index 89cf614a3271..a3c7e42d90a7 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -81,6 +81,7 @@ EXPORT_SYMBOL(simple_lookup);
 int dcache_dir_open(struct inode *inode, struct file *file)
 {
 	file->private_data = d_alloc_cursor(file->f_path.dentry);
+	file->f_mode |= FMODE_NOWAIT;
 
 	return file->private_data ? 0 : -ENOMEM;
 }

-- 
2.39.2

[PATCH v2 6/6] RFC: io_uring getdents: test returning an EOF flag in CQE

2023-05-10T10:54:24Z

This turns out to be very slightly faster than an extra call to
getdents, but in practice it doesn't seem to be such an improvement as
the trailing getdents will return almost immediately be absorbed by the
scheduling noise in a find-like context (my ""server"" is too noisy to
get proper benchmarks out, but results look slightly better with this in
async mode, and almost identical in the NOWAIT path)

If the user is waiting the end of a single directory though it might be
worth it, so including the patch for comments.
(in particular I'm not really happy that the flag has become in-out for
vfs_getdents, especially when the getdents64 syscall does not use it,
but I don't see much other way around it)

If this approach is acceptable/wanted then this patch will be split down
further (at least dir_context/vfs_getdents, kernfs, libfs, uring in four
separate commits)

Signed-off-by: Dominique Martinet 
---
 fs/internal.h                 |  2 +-
 fs/kernfs/dir.c               |  1 +
 fs/libfs.c                    |  9 ++++++---
 fs/readdir.c                  | 10 ++++++----
 include/linux/fs.h            |  2 ++
 include/uapi/linux/io_uring.h |  2 ++
 io_uring/fs.c                 |  8 ++++++--
 7 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 0264b001d99a..0b1552c7a870 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -267,4 +267,4 @@ void mnt_idmap_put(struct mnt_idmap *idmap);
 struct linux_dirent64;
 
 int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
-		 unsigned int count, unsigned long flags);
+		 unsigned int count, unsigned long *flags);
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 5a5b3e7881bf..53a6b4804c34 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1860,6 +1860,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
 	up_read(&root->kernfs_rwsem);
 	file->private_data = NULL;
 	ctx->pos = INT_MAX;
+	ctx->flags |= DIR_CONTEXT_F_EOD;
 	return 0;
 }
 
diff --git a/fs/libfs.c b/fs/libfs.c
index a3c7e42d90a7..b2a95dadffbd 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -208,10 +208,12 @@ int dcache_readdir(struct file *file, struct dir_context *ctx)
 		p = &next->d_child;
 	}
 	spin_lock(&dentry->d_lock);
-	if (next)
+	if (next) {
 		list_move_tail(&cursor->d_child, &next->d_child);
-	else
+	} else {
 		list_del_init(&cursor->d_child);
+		ctx->flags |= DIR_CONTEXT_F_EOD;
+	}
 	spin_unlock(&dentry->d_lock);
 	dput(next);
 
@@ -1347,7 +1349,8 @@ static loff_t empty_dir_llseek(struct file *file, loff_t offset, int whence)
 
 static int empty_dir_readdir(struct file *file, struct dir_context *ctx)
 {
-	dir_emit_dots(file, ctx);
+	if (dir_emit_dots(file, ctx))
+		ctx->flags |= DIR_CONTEXT_F_EOD;
 	return 0;
 }
 
diff --git a/fs/readdir.c b/fs/readdir.c
index 1311b89d75e1..be75a2154b4f 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -358,14 +358,14 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
  * @file    : pointer to file struct of directory
  * @dirent  : pointer to user directory structure
  * @count   : size of buffer
- * @flags   : additional dir_context flags
+ * @flags   : pointer to additional dir_context flags
  */
 int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
-		 unsigned int count, unsigned long flags)
+		 unsigned int count, unsigned long *flags)
 {
 	struct getdents_callback64 buf = {
 		.ctx.actor = filldir64,
-		.ctx.flags = flags,
+		.ctx.flags = flags ? *flags : 0,
 		.count = count,
 		.current_dir = dirent
 	};
@@ -384,6 +384,8 @@ int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
 		else
 			error = count - buf.count;
 	}
+	if (flags)
+		*flags = buf.ctx.flags;
 	return error;
 }
 
@@ -397,7 +399,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	if (!f.file)
 		return -EBADF;
 
-	error = vfs_getdents(f.file, dirent, count, 0);
+	error = vfs_getdents(f.file, dirent, count, NULL);
 
 	fdput_pos(f);
 	return error;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f7de2b5ca38e..d1e31bccfb4f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1723,8 +1723,10 @@ struct dir_context {
  * flags for dir_context flags
  * DIR_CONTEXT_F_NOWAIT: Request non-blocking iterate
  *                       (requires file->f_mode & FMODE_NOWAIT)
+ * DIR_CONTEXT_F_EOD: Signal directory has been fully iterated, set by the fs
  */
 #define DIR_CONTEXT_F_NOWAIT	0x1
+#define DIR_CONTEXT_F_EOD	0x2
 
 /*
  * These flags let !MMU mmap() govern direct device mapping vs immediate
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 35d0de18d893..35877132027e 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -381,11 +381,13 @@ struct io_uring_cqe {
  * IORING_CQE_F_SOCK_NONEMPTY	If set, more data to read after socket recv
  * IORING_CQE_F_NOTIF	Set for notification CQEs. Can be used to distinct
  * 			them from sends.
+ * IORING_CQE_F_EOF     If set, file or directory has reached end of file.
  */
 #define IORING_CQE_F_BUFFER		(1U << 0)
 #define IORING_CQE_F_MORE		(1U << 1)
 #define IORING_CQE_F_SOCK_NONEMPTY	(1U << 2)
 #define IORING_CQE_F_NOTIF		(1U << 3)
+#define IORING_CQE_F_EOF		(1U << 4)
 
 enum {
 	IORING_CQE_BUFFER_SHIFT		= 16,
diff --git a/io_uring/fs.c b/io_uring/fs.c
index b15ec81c1ed2..f6222b0148ef 100644
--- a/io_uring/fs.c
+++ b/io_uring/fs.c
@@ -322,6 +322,7 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
 	unsigned long getdents_flags = 0;
+	u32 cqe_flags = 0;
 	int ret;
 
 	if (issue_flags & IO_URING_F_NONBLOCK) {
@@ -338,13 +339,16 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
 			goto out;
 	}
 
-	ret = vfs_getdents(req->file, gd->dirent, gd->count, getdents_flags);
+	ret = vfs_getdents(req->file, gd->dirent, gd->count, &getdents_flags);
 out:
 	if (ret == -EAGAIN &&
 	    (issue_flags & IO_URING_F_NONBLOCK))
 			return -EAGAIN;
 
-	io_req_set_res(req, ret, 0);
+	if (getdents_flags & DIR_CONTEXT_F_EOD)
+		cqe_flags |= IORING_CQE_F_EOF;
+
+	io_req_set_res(req, ret, cqe_flags);
 	return 0;
 }
 

-- 
2.39.2

Re: [PATCH v2 4/6] kernfs: implement readdir FMODE_NOWAIT

2023-05-11T10:56:26Z

Hi Dominique,

kernel test robot noticed the following build warnings:

url:    https://github.com/intel-lab-lkp/linux/commits/Dominique-Martinet/fs-split-off-vfs_getdents-function-of-getdents64-syscall/20230510-185542
base:   58390c8ce1bddb6c623f62e7ed36383e7fa5c02f
patch link:    https://lore.kernel.org/r/20230422-uring-getdents-v2-4-2db1e37dc55e%40codewreck.org
patch subject: [PATCH v2 4/6] kernfs: implement readdir FMODE_NOWAIT
config: i386-randconfig-m021 (https://download.01.org/0day-ci/archive/20230511/202305110647.eSnSEulg-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot 
| Reported-by: Dan Carpenter 
| Link: https://lore.kernel.org/r/202305110647.eSnSEulg-lkp@intel.com/

smatch warnings:
fs/kernfs/dir.c:1863 kernfs_fop_readdir() warn: inconsistent returns '&root->kernfs_rwsem'.

vim +1863 fs/kernfs/dir.c

c637b8acbe079e Tejun Heo          2013-12-11  1815  static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1816  {
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1817  	struct dentry *dentry = file->f_path.dentry;
319ba91d352a74 Shaohua Li         2017-07-12  1818  	struct kernfs_node *parent = kernfs_dentry_node(dentry);
324a56e16e44ba Tejun Heo          2013-12-11  1819  	struct kernfs_node *pos = file->private_data;
393c3714081a53 Minchan Kim        2021-11-18  1820  	struct kernfs_root *root;
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1821  	const void *ns = NULL;
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1822  
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1823  	if (!dir_emit_dots(file, ctx))
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1824  		return 0;
393c3714081a53 Minchan Kim        2021-11-18  1825  
393c3714081a53 Minchan Kim        2021-11-18  1826  	root = kernfs_root(parent);
a551138c4b3b9f Dominique Martinet 2023-05-10  1827  	if (ctx->flags & DIR_CONTEXT_F_NOWAIT) {
a551138c4b3b9f Dominique Martinet 2023-05-10  1828  		if (!down_read_trylock(&root->kernfs_rwsem))
a551138c4b3b9f Dominique Martinet 2023-05-10  1829  			return -EAGAIN;
a551138c4b3b9f Dominique Martinet 2023-05-10  1830  	} else {
393c3714081a53 Minchan Kim        2021-11-18  1831  		down_read(&root->kernfs_rwsem);
a551138c4b3b9f Dominique Martinet 2023-05-10  1832  	}
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1833  
324a56e16e44ba Tejun Heo          2013-12-11  1834  	if (kernfs_ns_enabled(parent))
c525aaddc366df Tejun Heo          2013-12-11  1835  		ns = kernfs_info(dentry->d_sb)->ns;
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1836  
c637b8acbe079e Tejun Heo          2013-12-11  1837  	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1838  	     pos;
c637b8acbe079e Tejun Heo          2013-12-11  1839  	     pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
adc5e8b58f4886 Tejun Heo          2013-12-11  1840  		const char *name = pos->name;
364595a6851bf6 Jeff Layton        2023-03-30  1841  		unsigned int type = fs_umode_to_dtype(pos->mode);
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1842  		int len = strlen(name);
67c0496e87d193 Tejun Heo          2019-11-04  1843  		ino_t ino = kernfs_ino(pos);
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1844  
adc5e8b58f4886 Tejun Heo          2013-12-11  1845  		ctx->pos = pos->hash;
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1846  		file->private_data = pos;
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1847  		kernfs_get(pos);
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1848  
393c3714081a53 Minchan Kim        2021-11-18  1849  		up_read(&root->kernfs_rwsem);
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1850  		if (!dir_emit(ctx, name, len, ino, type))
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1851  			return 0;
393c3714081a53 Minchan Kim        2021-11-18  1852  		down_read(&root->kernfs_rwsem);
                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Needs to be deleted.

a551138c4b3b9f Dominique Martinet 2023-05-10  1853  		if (ctx->flags & DIR_CONTEXT_F_NOWAIT) {
a551138c4b3b9f Dominique Martinet 2023-05-10  1854  			if (!down_read_trylock(&root->kernfs_rwsem))
a551138c4b3b9f Dominique Martinet 2023-05-10  1855  				return 0;

It's a bit strange the this doesn't return -EAGAIN;

a551138c4b3b9f Dominique Martinet 2023-05-10  1856  		} else {
a551138c4b3b9f Dominique Martinet 2023-05-10  1857  			down_read(&root->kernfs_rwsem);
a551138c4b3b9f Dominique Martinet 2023-05-10  1858  		}
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1859  	}
393c3714081a53 Minchan Kim        2021-11-18  1860  	up_read(&root->kernfs_rwsem);
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1861  	file->private_data = NULL;
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1862  	ctx->pos = INT_MAX;
fd7b9f7b9776b1 Tejun Heo          2013-11-28 @1863  	return 0;
fd7b9f7b9776b1 Tejun Heo          2013-11-28  1864  }

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

Re: [PATCH v2 4/6] kernfs: implement readdir FMODE_NOWAIT

2023-05-11T11:03:46Z

Dan Carpenter wrote on Thu, May 11, 2023 at 01:55:57PM +0300:
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1850  		if (!dir_emit(ctx, name, len, ino, type))
> fd7b9f7b9776b1 Tejun Heo          2013-11-28  1851  			return 0;
> 393c3714081a53 Minchan Kim        2021-11-18  1852  		down_read(&root->kernfs_rwsem);
>                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Needs to be deleted.

Uh, yes, sorry; I'm not sure how I let that slip, I guess I didn't hit
any dead lock because nothing ever tried to take a write lock after
getdents...
Thanks!

I expect there'll be other comments (this might not make it at all), so
I'll keep the v3 of this patch with this fix locally and resend after
other comments.

> a551138c4b3b9f Dominique Martinet 2023-05-10  1853  		if (ctx->flags & DIR_CONTEXT_F_NOWAIT) {
> a551138c4b3b9f Dominique Martinet 2023-05-10  1854  			if (!down_read_trylock(&root->kernfs_rwsem))
> a551138c4b3b9f Dominique Martinet 2023-05-10  1855  				return 0;
> 
> It's a bit strange the this doesn't return -EAGAIN;

That is on purpose: the getdents did work (dir_emit returned success at
least once), so the caller can process whatever was filled in the buffer
before calling iterate_shared() again.

If we were to return -EAGAIN here, we'd actually be throwing out the
entries we just filled in, and that's not what we want.

-- 
Dominique Martinet | Asmadeus

Re: [PATCH v2 6/6] RFC: io_uring getdents: test returning an EOF flag in CQE

2023-05-23T14:30:29Z

On Wed, May 10, 2023 at 07:52:54PM +0900, Dominique Martinet wrote:
> This turns out to be very slightly faster than an extra call to
> getdents, but in practice it doesn't seem to be such an improvement as
> the trailing getdents will return almost immediately be absorbed by the
> scheduling noise in a find-like context (my ""server"" is too noisy to
> get proper benchmarks out, but results look slightly better with this in
> async mode, and almost identical in the NOWAIT path)
> 
> If the user is waiting the end of a single directory though it might be
> worth it, so including the patch for comments.
> (in particular I'm not really happy that the flag has become in-out for
> vfs_getdents, especially when the getdents64 syscall does not use it,
> but I don't see much other way around it)
> 
> If this approach is acceptable/wanted then this patch will be split down
> further (at least dir_context/vfs_getdents, kernfs, libfs, uring in four
> separate commits)
> 
> Signed-off-by: Dominique Martinet 
> ---
>  fs/internal.h                 |  2 +-
>  fs/kernfs/dir.c               |  1 +
>  fs/libfs.c                    |  9 ++++++---
>  fs/readdir.c                  | 10 ++++++----
>  include/linux/fs.h            |  2 ++
>  include/uapi/linux/io_uring.h |  2 ++
>  io_uring/fs.c                 |  8 ++++++--
>  7 files changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/internal.h b/fs/internal.h
> index 0264b001d99a..0b1552c7a870 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -267,4 +267,4 @@ void mnt_idmap_put(struct mnt_idmap *idmap);
>  struct linux_dirent64;
>  
>  int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
> -		 unsigned int count, unsigned long flags);
> +		 unsigned int count, unsigned long *flags);
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 5a5b3e7881bf..53a6b4804c34 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1860,6 +1860,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
>  	up_read(&root->kernfs_rwsem);
>  	file->private_data = NULL;
>  	ctx->pos = INT_MAX;
> +	ctx->flags |= DIR_CONTEXT_F_EOD;
>  	return 0;
>  }
>  
> diff --git a/fs/libfs.c b/fs/libfs.c
> index a3c7e42d90a7..b2a95dadffbd 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -208,10 +208,12 @@ int dcache_readdir(struct file *file, struct dir_context *ctx)
>  		p = &next->d_child;
>  	}
>  	spin_lock(&dentry->d_lock);
> -	if (next)
> +	if (next) {
>  		list_move_tail(&cursor->d_child, &next->d_child);
> -	else
> +	} else {
>  		list_del_init(&cursor->d_child);
> +		ctx->flags |= DIR_CONTEXT_F_EOD;
> +	}
>  	spin_unlock(&dentry->d_lock);
>  	dput(next);
>  
> @@ -1347,7 +1349,8 @@ static loff_t empty_dir_llseek(struct file *file, loff_t offset, int whence)
>  
>  static int empty_dir_readdir(struct file *file, struct dir_context *ctx)
>  {
> -	dir_emit_dots(file, ctx);
> +	if (dir_emit_dots(file, ctx))
> +		ctx->flags |= DIR_CONTEXT_F_EOD;
>  	return 0;
>  }
>  
> diff --git a/fs/readdir.c b/fs/readdir.c
> index 1311b89d75e1..be75a2154b4f 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -358,14 +358,14 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
>   * @file    : pointer to file struct of directory
>   * @dirent  : pointer to user directory structure
>   * @count   : size of buffer
> - * @flags   : additional dir_context flags
> + * @flags   : pointer to additional dir_context flags
>   */
>  int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
> -		 unsigned int count, unsigned long flags)
> +		 unsigned int count, unsigned long *flags)
>  {
>  	struct getdents_callback64 buf = {
>  		.ctx.actor = filldir64,
> -		.ctx.flags = flags,
> +		.ctx.flags = flags ? *flags : 0,
>  		.count = count,
>  		.current_dir = dirent
>  	};
> @@ -384,6 +384,8 @@ int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
>  		else
>  			error = count - buf.count;
>  	}
> +	if (flags)
> +		*flags = buf.ctx.flags;
>  	return error;
>  }
>  
> @@ -397,7 +399,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
>  	if (!f.file)
>  		return -EBADF;
>  
> -	error = vfs_getdents(f.file, dirent, count, 0);
> +	error = vfs_getdents(f.file, dirent, count, NULL);
>  
>  	fdput_pos(f);
>  	return error;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f7de2b5ca38e..d1e31bccfb4f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1723,8 +1723,10 @@ struct dir_context {
>   * flags for dir_context flags
>   * DIR_CONTEXT_F_NOWAIT: Request non-blocking iterate
>   *                       (requires file->f_mode & FMODE_NOWAIT)
> + * DIR_CONTEXT_F_EOD: Signal directory has been fully iterated, set by the fs
>   */
>  #define DIR_CONTEXT_F_NOWAIT	0x1
> +#define DIR_CONTEXT_F_EOD	0x2
>  
>  /*
>   * These flags let !MMU mmap() govern direct device mapping vs immediate
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 35d0de18d893..35877132027e 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -381,11 +381,13 @@ struct io_uring_cqe {
>   * IORING_CQE_F_SOCK_NONEMPTY	If set, more data to read after socket recv
>   * IORING_CQE_F_NOTIF	Set for notification CQEs. Can be used to distinct
>   * 			them from sends.
> + * IORING_CQE_F_EOF     If set, file or directory has reached end of file.
>   */
>  #define IORING_CQE_F_BUFFER		(1U << 0)
>  #define IORING_CQE_F_MORE		(1U << 1)
>  #define IORING_CQE_F_SOCK_NONEMPTY	(1U << 2)
>  #define IORING_CQE_F_NOTIF		(1U << 3)
> +#define IORING_CQE_F_EOF		(1U << 4)
>  
>  enum {
>  	IORING_CQE_BUFFER_SHIFT		= 16,
> diff --git a/io_uring/fs.c b/io_uring/fs.c
> index b15ec81c1ed2..f6222b0148ef 100644
> --- a/io_uring/fs.c
> +++ b/io_uring/fs.c
> @@ -322,6 +322,7 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
>  {
>  	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>  	unsigned long getdents_flags = 0;
> +	u32 cqe_flags = 0;
>  	int ret;
>  
>  	if (issue_flags & IO_URING_F_NONBLOCK) {
> @@ -338,13 +339,16 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
>  			goto out;
>  	}
>  
> -	ret = vfs_getdents(req->file, gd->dirent, gd->count, getdents_flags);
> +	ret = vfs_getdents(req->file, gd->dirent, gd->count, &getdents_flags);

I don't understand how synchronization and updating of f_pos works here.
For example, what happens if a concurrent seek happens on the fd while
io_uring is using vfs_getdents which calls into iterate_dir() and
updates f_pos?

Re: [PATCH v2 1/6] fs: split off vfs_getdents function of getdents64 syscall

2023-05-23T15:39:18Z

On Wed, May 10, 2023 at 07:52:49PM +0900, Dominique Martinet wrote:
> This splits off the vfs_getdents function from the getdents64 system
> call.
> This will allow io_uring to call the vfs_getdents function.
> 
> Co-authored-by: Stefan Roesch 
> Signed-off-by: Dominique Martinet 
> ---
>  fs/internal.h |  8 ++++++++
>  fs/readdir.c  | 34 ++++++++++++++++++++++++++--------
>  2 files changed, 34 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/internal.h b/fs/internal.h
> index bd3b2810a36b..e8ca000e6613 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -260,3 +260,11 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
>  struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
>  struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
>  void mnt_idmap_put(struct mnt_idmap *idmap);
> +
> +/*
> + * fs/readdir.c
> + */
> +struct linux_dirent64;
> +
> +int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
> +		 unsigned int count);
> diff --git a/fs/readdir.c b/fs/readdir.c
> index 9c53edb60c03..ed0803d0011e 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include "internal.h"
>  
>  #include 
>  
> @@ -351,10 +352,16 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
>  	return false;
>  }
>  
> -SYSCALL_DEFINE3(getdents64, unsigned int, fd,
> -		struct linux_dirent64 __user *, dirent, unsigned int, count)
> +
> +/**
> + * vfs_getdents - getdents without fdget
> + * @file    : pointer to file struct of directory
> + * @dirent  : pointer to user directory structure
> + * @count   : size of buffer
> + */
> +int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
> +		 unsigned int count)
>  {
> -	struct fd f;
>  	struct getdents_callback64 buf = {
>  		.ctx.actor = filldir64,
>  		.count = count,
> @@ -362,11 +369,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
>  	};
>  	int error;
>  
> -	f = fdget_pos(fd);
> -	if (!f.file)
> -		return -EBADF;
> -
> -	error = iterate_dir(f.file, &buf.ctx);
> +	error = iterate_dir(file, &buf.ctx);

So afaict this isn't enough.
If you look into iterate_shared() you should see that it uses and
updates f_pos. But that can't work for io_uring and also isn't how
io_uring handles read and write. You probably need to use a local pos
similar to what io_uring does in rw.c for rw->kiocb.ki_pos. But in
contrast simply disallow any offsets for getdents completely. Thus not
relying on f_pos anywhere at all.

Re: [PATCH v2 6/6] RFC: io_uring getdents: test returning an EOF flag in CQE

2023-05-23T21:05:58Z

Christian Brauner wrote on Tue, May 23, 2023 at 04:30:14PM +0200:
> > index b15ec81c1ed2..f6222b0148ef 100644
> > --- a/io_uring/fs.c
> > +++ b/io_uring/fs.c
> > @@ -322,6 +322,7 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> >  {
> >  	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> >  	unsigned long getdents_flags = 0;
> > +	u32 cqe_flags = 0;
> >  	int ret;
> >  
> >  	if (issue_flags & IO_URING_F_NONBLOCK) {
> > @@ -338,13 +339,16 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> >  			goto out;
> >  	}
> >  
> > -	ret = vfs_getdents(req->file, gd->dirent, gd->count, getdents_flags);
> > +	ret = vfs_getdents(req->file, gd->dirent, gd->count, &getdents_flags);
> 
> I don't understand how synchronization and updating of f_pos works here.
> For example, what happens if a concurrent seek happens on the fd while
> io_uring is using vfs_getdents which calls into iterate_dir() and
> updates f_pos?

I don't see how different that is from a user spawning two threads and
calling getdents64 + lseek or two getdents64 in parallel?
(or any two other users of iterate_dir)

As far as I understand you'll either get the old or new pos as
obtained/updated by iterate_dir()?

That iterate_dir probably ought to be using READ_ONCE/WRITE_ONCE or some
atomic read/update wrappers as the shared case only has a read lock
around these, but that's not a new problem; and for all I care
about I'm happy to let users shoot themselves in the foot.
(although I guess that with filesystems not validating the offset as
was pointed out in a previous version comment having non-atomic update
might be a security issue at some point on architectures that don't
guarantee atomic 64bit updates, but if someone manages to abuse it
it's already possible to abuse it with the good old syscalls, so I'd
rather leave that up to someone who understand how atomicity in the
kernel works better than me...)

-- 
Dominique Martinet | Asmadeus

Re: [PATCH v2 1/6] fs: split off vfs_getdents function of getdents64 syscall

2023-05-23T21:14:24Z

Christian Brauner wrote on Tue, May 23, 2023 at 05:39:08PM +0200:
> > @@ -362,11 +369,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
> >  	};
> >  	int error;
> >  
> > -	f = fdget_pos(fd);
> > -	if (!f.file)
> > -		return -EBADF;
> > -
> > -	error = iterate_dir(f.file, &buf.ctx);
> > +	error = iterate_dir(file, &buf.ctx);
> 
> So afaict this isn't enough.
> If you look into iterate_shared() you should see that it uses and
> updates f_pos. But that can't work for io_uring and also isn't how
> io_uring handles read and write. You probably need to use a local pos
> similar to what io_uring does in rw.c for rw->kiocb.ki_pos. But in
> contrast simply disallow any offsets for getdents completely. Thus not
> relying on f_pos anywhere at all.

Using a private offset from the sqe was the previous implementation
discussed around here[1], and Al Viro pointed out that the iterate
filesystem implementations don't validate the offset makes sense as it's
either costly or for some filesystems downright impossible, so I went
into a don't let users modify it approach.

[1] https://lore.kernel.org/all/20211221164004.119663-1-shr@fb.com/T/#m517583f23502f32b040c819d930359313b3db00c


I agree it's not how io_uring usually works -- it dislikes global
states -- but it works perfectly well as long as you don't have multiple
users on the same file, which the application can take care of.

Not having any offsets would work for small directories but make reading
large directories impossible so some sort of continuation is required,
which means we need to keep the offset around; I also suggested keeping
the offset in argument as the previous version but only allowing the
last known offset (... so ultimately still updating f_pos anyway as we
don't have anywhere else to store it) or 0, but if we're going to do
that it looks much simpler to me to expose the same API as getdents.

-- 
Dominique Martinet | Asmadeus

Re: [PATCH v2 1/6] fs: split off vfs_getdents function of getdents64 syscall

2023-05-24T13:52:55Z

On Wed, May 24, 2023 at 06:13:57AM +0900, Dominique Martinet wrote:
> Christian Brauner wrote on Tue, May 23, 2023 at 05:39:08PM +0200:
> > > @@ -362,11 +369,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
> > >  	};
> > >  	int error;
> > >  
> > > -	f = fdget_pos(fd);
> > > -	if (!f.file)
> > > -		return -EBADF;
> > > -
> > > -	error = iterate_dir(f.file, &buf.ctx);
> > > +	error = iterate_dir(file, &buf.ctx);
> > 
> > So afaict this isn't enough.
> > If you look into iterate_shared() you should see that it uses and
> > updates f_pos. But that can't work for io_uring and also isn't how
> > io_uring handles read and write. You probably need to use a local pos
> > similar to what io_uring does in rw.c for rw->kiocb.ki_pos. But in
> > contrast simply disallow any offsets for getdents completely. Thus not
> > relying on f_pos anywhere at all.
> 
> Using a private offset from the sqe was the previous implementation
> discussed around here[1], and Al Viro pointed out that the iterate
> filesystem implementations don't validate the offset makes sense as it's
> either costly or for some filesystems downright impossible, so I went
> into a don't let users modify it approach.
> 
> [1] https://lore.kernel.org/all/20211221164004.119663-1-shr@fb.com/T/#m517583f23502f32b040c819d930359313b3db00c
> 
> 
> I agree it's not how io_uring usually works -- it dislikes global
> states -- but it works perfectly well as long as you don't have multiple
> users on the same file, which the application can take care of.
> 
> Not having any offsets would work for small directories but make reading
> large directories impossible so some sort of continuation is required,
> which means we need to keep the offset around; I also suggested keeping
> the offset in argument as the previous version but only allowing the
> last known offset (... so ultimately still updating f_pos anyway as we
> don't have anywhere else to store it) or 0, but if we're going to do
> that it looks much simpler to me to expose the same API as getdents.
> 
> -- 
> Dominique Martinet | Asmadeus

On Wed, May 24, 2023 at 06:05:06AM +0900, Dominique Martinet wrote:
> Christian Brauner wrote on Tue, May 23, 2023 at 04:30:14PM +0200:
> > > index b15ec81c1ed2..f6222b0148ef 100644
> > > --- a/io_uring/fs.c
> > > +++ b/io_uring/fs.c
> > > @@ -322,6 +322,7 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> > >  {
> > >  	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> > >  	unsigned long getdents_flags = 0;
> > > +	u32 cqe_flags = 0;
> > >  	int ret;
> > >  
> > >  	if (issue_flags & IO_URING_F_NONBLOCK) {
> > > @@ -338,13 +339,16 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> > >  			goto out;
> > >  	}
> > >  
> > > -	ret = vfs_getdents(req->file, gd->dirent, gd->count, getdents_flags);
> > > +	ret = vfs_getdents(req->file, gd->dirent, gd->count, &getdents_flags);
> > 
> > I don't understand how synchronization and updating of f_pos works here.
> > For example, what happens if a concurrent seek happens on the fd while
> > io_uring is using vfs_getdents which calls into iterate_dir() and
> > updates f_pos?
> 
> I don't see how different that is from a user spawning two threads and
> calling getdents64 + lseek or two getdents64 in parallel?
> (or any two other users of iterate_dir)
> 
> As far as I understand you'll either get the old or new pos as
> obtained/updated by iterate_dir()?
> 
> That iterate_dir probably ought to be using READ_ONCE/WRITE_ONCE or some
> atomic read/update wrappers as the shared case only has a read lock
> around these, but that's not a new problem; and for all I care
> about I'm happy to let users shoot themselves in the foot.
> (although I guess that with filesystems not validating the offset as
> was pointed out in a previous version comment having non-atomic update
> might be a security issue at some point on architectures that don't
> guarantee atomic 64bit updates, but if someone manages to abuse it
> it's already possible to abuse it with the good old syscalls, so I'd
> rather leave that up to someone who understand how atomicity in the
> kernel works better than me...)

There's multiple issues here.

The main objection in [1] was to allow specifying an arbitrary offset
from userspace. What [3] did was to implement a pread() variant for
directories, i.e., pgetdents(). That can't work in principle/is
prohibitively complex. Which is what your series avoids by not allowing
any offsets to be specified.

However, there's still a problem here. Updates to f_pos happen under an
approriate lock to guarantee consistency of the position between calls
that move the cursor position. In the normal read-write path io_uring
doesn't concern itself with f_pos as it keeps local state in
kiocb->ki_pos.

But then it still does end up running into f_pos consistency problems
for read-write because it does allow operating on the current f_pos if
the offset if struct io_rw is set to -1.

In that case it does retrieve and update f_pos which should take
f_pos_lock and a patchset for this was posted but it didn't go anywhere.
It should probably hold that lock. See Jann's comments in the other
thread how that currently can lead to issues.

For getdents() not protecting f_pos is equally bad or worse. The patch
doesn't hold f_pos_lock and just updates f_pos prior _and_ post
iterate_dir() arguing that this race is fine. But again, f_version and
f_pos are consistent after each system call invocation.

But without that you can have a concurrent seek running and can end up
with an inconsistent f_pos position within the same system call. IOW,
you're breaking f_pos being in a well-known state. And you're not doing
that just for io_uring you're doing it for the regular system call
interface as well as both can be used on the same fd simultaneously.
So that's a no go imho.

> I don't see how different that is from a user spawning two threads and
> calling getdents64 + lseek or two getdents64 in parallel?
> (or any two other users of iterate_dir)

The difference is that in both cases f_pos_lock for both getdents and
lseek is held. So f_pos is in a good known state. You're not taking any
locks so now we're risking inconsistency within the same system call if
getdents and lseek run concurrently. Jens also mentioned that you could
even have this problem from within io_uring itself.

So tl;dr, there's no good reason to declare this an acceptable race
afaict. So either this is fixed properly or we're not doing it as far as
I'm concerned.

[1] https://lore.kernel.org/all/20211221164004.119663-1-shr@fb.com/T/#m517583f23502f32b040c819d930359313b3db00c
[2] https://lore.kernel.org/io-uring/20211221164004.119663-4-shr@fb.com
[3] https://lore.kernel.org/io-uring/20211221164004.119663-1-shr@fb.com

Re: [PATCH v2 1/6] fs: split off vfs_getdents function of getdents64 syscall

2023-05-24T21:36:53Z

Christian Brauner wrote on Wed, May 24, 2023 at 03:52:45PM +0200:
> The main objection in [1] was to allow specifying an arbitrary offset
> from userspace. What [3] did was to implement a pread() variant for
> directories, i.e., pgetdents(). That can't work in principle/is
> prohibitively complex. Which is what your series avoids by not allowing
> any offsets to be specified.

Yes.

> However, there's still a problem here. Updates to f_pos happen under an
> approriate lock to guarantee consistency of the position between calls
> that move the cursor position. In the normal read-write path io_uring
> doesn't concern itself with f_pos as it keeps local state in
> kiocb->ki_pos.
> 
> But then it still does end up running into f_pos consistency problems
> for read-write because it does allow operating on the current f_pos if
> the offset if struct io_rw is set to -1.
> 
> In that case it does retrieve and update f_pos which should take
> f_pos_lock and a patchset for this was posted but it didn't go anywhere.
> It should probably hold that lock. See Jann's comments in the other
> thread how that currently can lead to issues.

Assuming that is this mail:
https://lore.kernel.org/io-uring/CAG48ez1O9VxSuWuLXBjke23YxUA8EhMP+6RCHo5PNQBf3B0pDQ@mail.gmail.com/

So, ok, I didn't realize fdget_pos() actually acted as a lock on the
file's f_pos_lock (apparently only if FMODE_ATMOIC_POS is set? but it
looks set on most if not all directories through finish_open(), that
looks called consistently enough)

What was confusing is that default_llseek updates f_pos under the
inode_lock (write), and getdents also takes that lock (for read only in
shared implem), so I assumed getdents also was just protected by this
read lock, but I guess that was a bad assumption (as I kept pointing
out, a shared read lock isn't good enough, we definitely agree there)


In practice, in the non-registered file case io_uring is also calling
fdget, so the lock is held exactly the same as the syscall and I wasn't
so far off -- but we need to figure something for the registered file
case.
openat variants don't allow working with the registered variant at all
on the parent fd, so as far as I'm concerned I'd be happy setting the
same limitation for getdents: just say it acts on fd and not files, and
call it a day...
It'd also be possible to check if REQ_F_FIXED_FILE flag was set and
manually take the lock somehow but we don't have any primitive that
takes f_pos_lock from a file (the only place that takes it is fdget
which requires a fd), so I'd rather not add such a new exception.
I assume the other patch you mentioned about adding that lock was this
one:
https://lore.kernel.org/all/20220222105504.3331010-1-dylany@fb.com/T/#m3609dc8057d0bc8e41ceab643e4d630f7b91bde6
and it just atkes the lock, but __fdget_pos also checks for
FMODE_ATOMIC_OPS and file_count and I'm not sure I understand how it
sets (f.flags & FDPUT_POS_UNLOCK) (for fdput_pos) so I'd rather not add
such a code path at this point..


So, ok, what do you think about just forbidding registered files?
I can't see where that wouldn't suffice but I might be missing something
else.

> For getdents() not protecting f_pos is equally bad or worse. The patch
> doesn't hold f_pos_lock and just updates f_pos prior _and_ post
> iterate_dir() arguing that this race is fine. But again, f_version and
> f_pos are consistent after each system call invocation.
> 
> But without that you can have a concurrent seek running and can end up
> with an inconsistent f_pos position within the same system call. IOW,
> you're breaking f_pos being in a well-known state. And you're not doing
> that just for io_uring you're doing it for the regular system call
> interface as well as both can be used on the same fd simultaneously.
> So that's a no go imho.

(so seek is fine, but I agree two concurrent getdents on registered
files won't have the required lock)

-- 
Dominique Martinet | Asmadeus

Re: [PATCH v2 1/6] fs: split off vfs_getdents function of getdents64 syscall

2023-05-25T09:22:19Z

On Thu, May 25, 2023 at 06:36:17AM +0900, Dominique Martinet wrote:
> Christian Brauner wrote on Wed, May 24, 2023 at 03:52:45PM +0200:
> > The main objection in [1] was to allow specifying an arbitrary offset
> > from userspace. What [3] did was to implement a pread() variant for
> > directories, i.e., pgetdents(). That can't work in principle/is
> > prohibitively complex. Which is what your series avoids by not allowing
> > any offsets to be specified.
> 
> Yes.
> 
> > However, there's still a problem here. Updates to f_pos happen under an
> > approriate lock to guarantee consistency of the position between calls
> > that move the cursor position. In the normal read-write path io_uring
> > doesn't concern itself with f_pos as it keeps local state in
> > kiocb->ki_pos.
> > 
> > But then it still does end up running into f_pos consistency problems
> > for read-write because it does allow operating on the current f_pos if
> > the offset if struct io_rw is set to -1.
> > 
> > In that case it does retrieve and update f_pos which should take
> > f_pos_lock and a patchset for this was posted but it didn't go anywhere.
> > It should probably hold that lock. See Jann's comments in the other
> > thread how that currently can lead to issues.
> 
> Assuming that is this mail:
> https://lore.kernel.org/io-uring/CAG48ez1O9VxSuWuLXBjke23YxUA8EhMP+6RCHo5PNQBf3B0pDQ@mail.gmail.com/
> 
> So, ok, I didn't realize fdget_pos() actually acted as a lock on the
> file's f_pos_lock (apparently only if FMODE_ATMOIC_POS is set? but it
> looks set on most if not all directories through finish_open(), that
> looks called consistently enough)

It's set for any regular file and directory.

> 
> What was confusing is that default_llseek updates f_pos under the
> inode_lock (write), and getdents also takes that lock (for read only in
> shared implem), so I assumed getdents also was just protected by this
> read lock, but I guess that was a bad assumption (as I kept pointing
> out, a shared read lock isn't good enough, we definitely agree there)
> 
> 
> In practice, in the non-registered file case io_uring is also calling
> fdget, so the lock is held exactly the same as the syscall and I wasn't

No, it really isn't. fdget() doesn't take f_pos_lock at all:

fdget()
-> __fdget()
   -> __fget_light()
      -> __fget()
         -> __fget_files()
            -> __fget_files_rcu()

If that were true then any system call that passes an fd and uses
fdget() would try to acquire a mutex on f_pos_lock. We'd be serializing
every *at based system call on f_pos_lock whenever we have multiple fds
referring to the same file trying to operate on it concurrently.

We do have fdget_pos() and fdput_pos() as a special purpose fdget() for
a select group of system calls that require this synchronization.

> so far off -- but we need to figure something for the registered file
> case.
> openat variants don't allow working with the registered variant at all
> on the parent fd, so as far as I'm concerned I'd be happy setting the
> same limitation for getdents: just say it acts on fd and not files, and
> call it a day...

I don't follow. Also this is hacky so no.

The reason why io_uring *at implementations don't work with fixed files
is that the VFS interface expect regular fds. You could very well make
this work for fixed files but why. It would mean exposing a whole new
set of vfs helpers to io_uring and would probably involve nasty corner
cases.

Also the connection between regular and fixed files in io_uring is
pretty much fluent. While fixed files can only remain pinned in an
io_uring instance it requires that the caller explicitly gave up all
references in their fdtable to that struct file by closing all fds
referring to the same file.

But there's no guarantee. For example, if another thread dups the fd or
the caller sends the fd via SCM_RIGHTS to another process or the caller
simply doesn't close the fd or another thread gets an fd to the same
file from that task via pidfd_getfd before it closed it this doesn't hold.

So it's very well possible to have an fd and a fixed io_uring reference
referring to the same file. The first one can be used with the regular
system call interface and io_uring *at requests that forbid fixed files.
And the other one can be used for i_uring fixed file operations. Doesn't
matter if that shouldn't be done, it's possible afaict.

For regular and fixed files you also have the same problem from within
the same io_uring instance where you can have concurrent getdent
requests. You'd end up producing the exact same inconsistencies.

> It'd also be possible to check if REQ_F_FIXED_FILE flag was set and
> manually take the lock somehow but we don't have any primitive that
> takes f_pos_lock from a file (the only place that takes it is fdget
> which requires a fd), so I'd rather not add such a new exception.
> I assume the other patch you mentioned about adding that lock was this
> one:
> https://lore.kernel.org/all/20220222105504.3331010-1-dylany@fb.com/T/#m3609dc8057d0bc8e41ceab643e4d630f7b91bde6
> and it just atkes the lock, but __fdget_pos also checks for
> FMODE_ATOMIC_OPS and file_count and I'm not sure I understand how it
> sets (f.flags & FDPUT_POS_UNLOCK) (for fdput_pos) so I'd rather not add
> such a code path at this point..
> 
> 
> So, ok, what do you think about just forbidding registered files?
> I can't see where that wouldn't suffice but I might be missing something
> else.

It doesn't help.

Re: [PATCH v2 1/6] fs: split off vfs_getdents function of getdents64 syscall

2023-05-25T11:00:52Z

Christian Brauner wrote on Thu, May 25, 2023 at 11:22:08AM +0200:
> > What was confusing is that default_llseek updates f_pos under the
> > inode_lock (write), and getdents also takes that lock (for read only in
> > shared implem), so I assumed getdents also was just protected by this
> > read lock, but I guess that was a bad assumption (as I kept pointing
> > out, a shared read lock isn't good enough, we definitely agree there)
> > 
> > 
> > In practice, in the non-registered file case io_uring is also calling
> > fdget, so the lock is held exactly the same as the syscall and I wasn't
> 
> No, it really isn't. fdget() doesn't take f_pos_lock at all:
> 
> fdget()
> -> __fdget()
>    -> __fget_light()
>       -> __fget()
>          -> __fget_files()
>             -> __fget_files_rcu()

Ugh, I managed to not notice that I was looking at fdget_pos and that
it's not the same as fdget by the time I wrote two paragraphs... These
functions all have too many wrappers and too similar names for a quick
look before work.

> If that were true then any system call that passes an fd and uses
> fdget() would try to acquire a mutex on f_pos_lock. We'd be serializing
> every *at based system call on f_pos_lock whenever we have multiple fds
> referring to the same file trying to operate on it concurrently.
> 
> We do have fdget_pos() and fdput_pos() as a special purpose fdget() for
> a select group of system calls that require this synchronization.

Right, that makes sense, and invalidates everything I said after that
anyway but it's not like looking stupid ever killed anyone.

Ok so it would require adding a new wrapper from struct file to struct
fd that'd eventually take the lock and set FDPUT_POS_UNLOCK for... not
fdput_pos but another function for that stopping short of fdput...
Then just call that around both vfs_llseek and vfs_getdents calls; which
is the easy part.

(Or possibly call mutex_lock directly like Dylan did in [1]...)
[1] https://lore.kernel.org/all/20220222105504.3331010-1-dylany@fb.com/T/#m3609dc8057d0bc8e41ceab643e4d630f7b91bde6



I'll be honest though I'm thankful for your explanations but I think
I'll just do like Stefan and stop trying for now: the only reason I've
started this was because I wanted to play with io_uring for a new toy
project and it felt awkward without a getdents for crawling a tree; and
I'm long past the point where I should have thrown the towel and just
make that a sequential walk.
There's too many "conditional patches" (NOWAIT, end of dir indicator)
that I don't care about and require additional work to rebase
continuously so I'll just leave it up to someone else who does care.

So to that someone: feel free to continue from these branches (I've
included the fix for kernfs_fop_readdir that Dan Carpenter reported):
https://github.com/martinetd/linux/commits/io_uring_getdents
https://github.com/martinetd/liburing/commits/getdents

Or just start over, there's not that much code now hopefully the
baseline requirements have gotten a little bit clearer.


Sorry for stirring the mess and leaving halfway, if nobody does continue
I might send a v3 when I have more time/energy in a few months, but it
won't be quick.

-- 
Dominique

Re: [PATCH v2 1/6] fs: split off vfs_getdents function of getdents64 syscall

2023-05-25T12:33:56Z

On Thu, May 25, 2023 at 08:00:02PM +0900, Dominique Martinet wrote:
> Christian Brauner wrote on Thu, May 25, 2023 at 11:22:08AM +0200:
> > > What was confusing is that default_llseek updates f_pos under the
> > > inode_lock (write), and getdents also takes that lock (for read only in
> > > shared implem), so I assumed getdents also was just protected by this
> > > read lock, but I guess that was a bad assumption (as I kept pointing
> > > out, a shared read lock isn't good enough, we definitely agree there)
> > > 
> > > 
> > > In practice, in the non-registered file case io_uring is also calling
> > > fdget, so the lock is held exactly the same as the syscall and I wasn't
> > 
> > No, it really isn't. fdget() doesn't take f_pos_lock at all:
> > 
> > fdget()
> > -> __fdget()
> >    -> __fget_light()
> >       -> __fget()
> >          -> __fget_files()
> >             -> __fget_files_rcu()
> 
> Ugh, I managed to not notice that I was looking at fdget_pos and that
> it's not the same as fdget by the time I wrote two paragraphs... These
> functions all have too many wrappers and too similar names for a quick
> look before work.
> 
> > If that were true then any system call that passes an fd and uses
> > fdget() would try to acquire a mutex on f_pos_lock. We'd be serializing
> > every *at based system call on f_pos_lock whenever we have multiple fds
> > referring to the same file trying to operate on it concurrently.
> > 
> > We do have fdget_pos() and fdput_pos() as a special purpose fdget() for
> > a select group of system calls that require this synchronization.
> 
> Right, that makes sense, and invalidates everything I said after that
> anyway but it's not like looking stupid ever killed anyone.

I strongly disagree with the looking stupid part. These callchains are
quite unwieldy and it's easy to get confused. Usually if you receive a
long mail about the semantics involved - as in the earlier thread - it
means there's landmines all over.

> 
> Ok so it would require adding a new wrapper from struct file to struct
> fd that'd eventually take the lock and set FDPUT_POS_UNLOCK for... not
> fdput_pos but another function for that stopping short of fdput...
> Then just call that around both vfs_llseek and vfs_getdents calls; which
> is the easy part.
> 
> (Or possibly call mutex_lock directly like Dylan did in [1]...)
> [1] https://lore.kernel.org/all/20220222105504.3331010-1-dylany@fb.com/T/#m3609dc8057d0bc8e41ceab643e4d630f7b91bde6

We'd need a consistent story whatever it ends up being.

> I'll be honest though I'm thankful for your explanations but I think
> I'll just do like Stefan and stop trying for now: the only reason I've
> started this was because I wanted to play with io_uring for a new toy
> project and it felt awkward without a getdents for crawling a tree; and
> I'm long past the point where I should have thrown the towel and just
> make that a sequential walk.
> There's too many "conditional patches" (NOWAIT, end of dir indicator)
> that I don't care about and require additional work to rebase
> continuously so I'll just leave it up to someone else who does care.
> 
> So to that someone: feel free to continue from these branches (I've
> included the fix for kernfs_fop_readdir that Dan Carpenter reported):
> https://github.com/martinetd/linux/commits/io_uring_getdents
> https://github.com/martinetd/liburing/commits/getdents
> 
> Or just start over, there's not that much code now hopefully the
> baseline requirements have gotten a little bit clearer.
> 
> 
> Sorry for stirring the mess and leaving halfway, if nobody does continue
> I might send a v3 when I have more time/energy in a few months, but it
> won't be quick.

It's fine.

Re: [PATCH v2 1/6] fs: split off vfs_getdents function of getdents64 syscall

2023-07-11T08:17:25Z

On 5/25/23 19:00, Dominique Martinet wrote:
> Christian Brauner wrote on Thu, May 25, 2023 at 11:22:08AM +0200:
>>> What was confusing is that default_llseek updates f_pos under the
>>> inode_lock (write), and getdents also takes that lock (for read only in
>>> shared implem), so I assumed getdents also was just protected by this
>>> read lock, but I guess that was a bad assumption (as I kept pointing
>>> out, a shared read lock isn't good enough, we definitely agree there)
>>>
>>>
>>> In practice, in the non-registered file case io_uring is also calling
>>> fdget, so the lock is held exactly the same as the syscall and I wasn't
>>
>> No, it really isn't. fdget() doesn't take f_pos_lock at all:
>>
>> fdget()
>> -> __fdget()
>>     -> __fget_light()
>>        -> __fget()
>>           -> __fget_files()
>>              -> __fget_files_rcu()
> 
> Ugh, I managed to not notice that I was looking at fdget_pos and that
> it's not the same as fdget by the time I wrote two paragraphs... These
> functions all have too many wrappers and too similar names for a quick
> look before work.
> 
>> If that were true then any system call that passes an fd and uses
>> fdget() would try to acquire a mutex on f_pos_lock. We'd be serializing
>> every *at based system call on f_pos_lock whenever we have multiple fds
>> referring to the same file trying to operate on it concurrently.
>>
>> We do have fdget_pos() and fdput_pos() as a special purpose fdget() for
>> a select group of system calls that require this synchronization.
> 
> Right, that makes sense, and invalidates everything I said after that
> anyway but it's not like looking stupid ever killed anyone.
> 
> Ok so it would require adding a new wrapper from struct file to struct
> fd that'd eventually take the lock and set FDPUT_POS_UNLOCK for... not
> fdput_pos but another function for that stopping short of fdput...
> Then just call that around both vfs_llseek and vfs_getdents calls; which
> is the easy part.
> 
> (Or possibly call mutex_lock directly like Dylan did in [1]...)
> [1] https://lore.kernel.org/all/20220222105504.3331010-1-dylany@fb.com/T/#m3609dc8057d0bc8e41ceab643e4d630f7b91bde6
> 
> 
> 
> I'll be honest though I'm thankful for your explanations but I think
> I'll just do like Stefan and stop trying for now: the only reason I've
> started this was because I wanted to play with io_uring for a new toy
> project and it felt awkward without a getdents for crawling a tree; and
> I'm long past the point where I should have thrown the towel and just
> make that a sequential walk.
> There's too many "conditional patches" (NOWAIT, end of dir indicator)
> that I don't care about and require additional work to rebase
> continuously so I'll just leave it up to someone else who does care.
> 
> So to that someone: feel free to continue from these branches (I've
> included the fix for kernfs_fop_readdir that Dan Carpenter reported):
> https://github.com/martinetd/linux/commits/io_uring_getdents
> https://github.com/martinetd/liburing/commits/getdents
> 
> Or just start over, there's not that much code now hopefully the
> baseline requirements have gotten a little bit clearer.
> 
> 
> Sorry for stirring the mess and leaving halfway, if nobody does continue
> I might send a v3 when I have more time/energy in a few months, but it
> won't be quick.
> 

Hi Dominique,

I'd like to take this if you don't mind.

Regards,
Hao

Re: [PATCH v2 1/6] fs: split off vfs_getdents function of getdents64 syscall

2023-07-11T08:24:53Z

Hao Xu wrote on Tue, Jul 11, 2023 at 04:17:11PM +0800:
> > So to that someone: feel free to continue from these branches (I've
> > included the fix for kernfs_fop_readdir that Dan Carpenter reported):
> > https://github.com/martinetd/linux/commits/io_uring_getdents
> > https://github.com/martinetd/liburing/commits/getdents
> > 
> > Or just start over, there's not that much code now hopefully the
> > baseline requirements have gotten a little bit clearer.
> > 
> > 
> > Sorry for stirring the mess and leaving halfway, if nobody does continue
> > I might send a v3 when I have more time/energy in a few months, but it
> > won't be quick.
> 
> I'd like to take this if you don't mind.

Sure, I haven't been working on this lately, feel free to work on this.

Will be happy to review anything you send based on what came out of the
previous discussions to save Christian and others some time so you can
keep me in Cc if you'd like.

-- 
Dominique Martinet | Asmadeus