[PATCH v3 0/3] io_uring getdents

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH v3 0/3] io_uring getdents
@ 2023-07-11 11:40 Hao Xu
  2023-07-11 11:40 ` [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: Hao Xu @ 2023-07-11 11:40 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

From: Hao Xu <[email protected]>

This series introduce getdents64 to io_uring, the code logic is similar
with the snychronized version's. It first try nowait issue, and offload
it to io-wq threads if the first try fails.


v2->v3:
 - removed the kernfs patches
 - add f_pos_lock logic
 - remove the "reduce last EOF getdents try" optimization since
   Dominique reports that doesn't make difference
 - remove the rewind logic, I think the right way is to introduce lseek
   to io_uring not to patch this logic to getdents.
 - add Singed-off-by of Stefan Roesch for patch 1 since checkpatch
   complained that Co-developed-by someone should be accompanied with
   Signed-off-by same person, I can remove them if Stefan thinks that's
   not proper.


Dominique Martinet (1):
  fs: split off vfs_getdents function of getdents64 syscall

Hao Xu (2):
  vfs_getdents/struct dir_context: add flags field
  io_uring: add support for getdents

 fs/internal.h                 |  8 +++++
 fs/readdir.c                  | 36 ++++++++++++++++-----
 include/linux/fs.h            |  8 +++++
 include/uapi/linux/io_uring.h |  7 ++++
 io_uring/fs.c                 | 60 +++++++++++++++++++++++++++++++++++
 io_uring/fs.h                 |  3 ++
 io_uring/opdef.c              |  8 +++++
 7 files changed, 122 insertions(+), 8 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall
  2023-07-11 11:40 [PATCH v3 0/3] io_uring getdents Hao Xu
@ 2023-07-11 11:40 ` Hao Xu
  2023-07-11 13:02   ` Ammar Faizi
  2023-07-11 23:41   ` Dave Chinner
  2023-07-11 11:40 ` [PATCH 2/3] vfs_getdents/struct dir_context: add flags field Hao Xu
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 35+ messages in thread
From: Hao Xu @ 2023-07-11 11:40 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

From: Dominique Martinet <[email protected]>

This splits off the vfs_getdents function from the getdents64 system
call.
This will allow io_uring to call the vfs_getdents function.

Co-developed-by: Stefan Roesch <[email protected]>
Signed-off-by: Stefan Roesch <[email protected]>
Signed-off-by: Dominique Martinet <[email protected]>
---
 fs/internal.h |  8 ++++++++
 fs/readdir.c  | 34 ++++++++++++++++++++++++++--------
 2 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index f7a3dc111026..b1f66e52d61b 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -304,3 +304,11 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
 struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
 struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
 void mnt_idmap_put(struct mnt_idmap *idmap);
+
+/*
+ * fs/readdir.c
+ */
+struct linux_dirent64;
+
+int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
+		 unsigned int count);
diff --git a/fs/readdir.c b/fs/readdir.c
index b264ce60114d..9592259b7e7f 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -21,6 +21,7 @@
 #include <linux/unistd.h>
 #include <linux/compat.h>
 #include <linux/uaccess.h>
+#include "internal.h"
 
 #include <asm/unaligned.h>
 
@@ -351,10 +352,16 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
 	return false;
 }
 
-SYSCALL_DEFINE3(getdents64, unsigned int, fd,
-		struct linux_dirent64 __user *, dirent, unsigned int, count)
+
+/**
+ * vfs_getdents - getdents without fdget
+ * @file    : pointer to file struct of directory
+ * @dirent  : pointer to user directory structure
+ * @count   : size of buffer
+ */
+int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
+		 unsigned int count)
 {
-	struct fd f;
 	struct getdents_callback64 buf = {
 		.ctx.actor = filldir64,
 		.count = count,
@@ -362,11 +369,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	};
 	int error;
 
-	f = fdget_pos(fd);
-	if (!f.file)
-		return -EBADF;
-
-	error = iterate_dir(f.file, &buf.ctx);
+	error = iterate_dir(file, &buf.ctx);
 	if (error >= 0)
 		error = buf.error;
 	if (buf.prev_reclen) {
@@ -379,6 +382,21 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 		else
 			error = count - buf.count;
 	}
+	return error;
+}
+
+SYSCALL_DEFINE3(getdents64, unsigned int, fd,
+		struct linux_dirent64 __user *, dirent, unsigned int, count)
+{
+	struct fd f;
+	int error;
+
+	f = fdget_pos(fd);
+	if (!f.file)
+		return -EBADF;
+
+	error = vfs_getdents(f.file, dirent, count);
+
 	fdput_pos(f);
 	return error;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/3] vfs_getdents/struct dir_context: add flags field
  2023-07-11 11:40 [PATCH v3 0/3] io_uring getdents Hao Xu
  2023-07-11 11:40 ` [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
@ 2023-07-11 11:40 ` Hao Xu
  2023-07-12 11:31   ` Christian Brauner
  2023-07-11 11:40 ` [PATCH 3/3] io_uring: add support for getdents Hao Xu
  2023-07-11 23:47 ` [PATCH v3 0/3] io_uring getdents Dave Chinner
  3 siblings, 1 reply; 35+ messages in thread
From: Hao Xu @ 2023-07-11 11:40 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

From: Hao Xu <[email protected]>

The flags will allow passing DIR_CONTEXT_F_NOWAIT to iterate()
implementations that support it (as signaled through FMODE_NWAIT
in file->f_mode)

Notes:
- considered using IOCB_NOWAIT but if we add more flags later it
would be confusing to keep track of which values are valid, use
dedicated flags
- might want to check ctx.flags & DIR_CONTEXT_F_NOWAIT is only set
when file->f_mode & FMODE_NOWAIT in iterate_dir() as e.g. WARN_ONCE?

Co-developed-by: Dominique Martinet <[email protected]>
Signed-off-by: Dominique Martinet <[email protected]>
Signed-off-by: Hao Xu <[email protected]>
---
 fs/internal.h      | 2 +-
 fs/readdir.c       | 6 ++++--
 include/linux/fs.h | 8 ++++++++
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index b1f66e52d61b..7508d485c655 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -311,4 +311,4 @@ void mnt_idmap_put(struct mnt_idmap *idmap);
 struct linux_dirent64;
 
 int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
-		 unsigned int count);
+		 unsigned int count, unsigned long flags);
diff --git a/fs/readdir.c b/fs/readdir.c
index 9592259b7e7f..b80caf4c9321 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -358,12 +358,14 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
  * @file    : pointer to file struct of directory
  * @dirent  : pointer to user directory structure
  * @count   : size of buffer
+ * @flags   : additional dir_context flags
  */
 int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
-		 unsigned int count)
+		 unsigned int count, unsigned long flags)
 {
 	struct getdents_callback64 buf = {
 		.ctx.actor = filldir64,
+		.ctx.flags = flags,
 		.count = count,
 		.current_dir = dirent
 	};
@@ -395,7 +397,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	if (!f.file)
 		return -EBADF;
 
-	error = vfs_getdents(f.file, dirent, count);
+	error = vfs_getdents(f.file, dirent, count, 0);
 
 	fdput_pos(f);
 	return error;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6867512907d6..f3e315e8efdd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1719,8 +1719,16 @@ typedef bool (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64,
 struct dir_context {
 	filldir_t actor;
 	loff_t pos;
+	unsigned long flags;
 };
 
+/*
+ * flags for dir_context flags
+ * DIR_CONTEXT_F_NOWAIT: Request non-blocking iterate
+ *                       (requires file->f_mode & FMODE_NOWAIT)
+ */
+#define DIR_CONTEXT_F_NOWAIT	(1 << 0)
+
 /*
  * These flags let !MMU mmap() govern direct device mapping vs immediate
  * copying more easily for MAP_PRIVATE, especially for ROM filesystems.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/3] io_uring: add support for getdents
  2023-07-11 11:40 [PATCH v3 0/3] io_uring getdents Hao Xu
  2023-07-11 11:40 ` [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
  2023-07-11 11:40 ` [PATCH 2/3] vfs_getdents/struct dir_context: add flags field Hao Xu
@ 2023-07-11 11:40 ` Hao Xu
  2023-07-11 12:15   ` Dominique Martinet
  2023-07-12 15:27   ` Christian Brauner
  2023-07-11 23:47 ` [PATCH v3 0/3] io_uring getdents Dave Chinner
  3 siblings, 2 replies; 35+ messages in thread
From: Hao Xu @ 2023-07-11 11:40 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

From: Hao Xu <[email protected]>

This add support for getdents64 to io_uring, acting exactly like the
syscall: the directory is iterated from it's current's position as
stored in the file struct, and the file's position is updated exactly as
if getdents64 had been called.

For filesystems that support NOWAIT in iterate_shared(), try to use it
first; if a user already knows the filesystem they use do not support
nowait they can force async through IOSQE_ASYNC in the sqe flags,
avoiding the need to bounce back through a useless EAGAIN return.

Co-developed-by: Dominique Martinet <[email protected]>
Signed-off-by: Dominique Martinet <[email protected]>
Signed-off-by: Hao Xu <[email protected]>
---
 include/uapi/linux/io_uring.h |  7 ++++
 io_uring/fs.c                 | 60 +++++++++++++++++++++++++++++++++++
 io_uring/fs.h                 |  3 ++
 io_uring/opdef.c              |  8 +++++
 4 files changed, 78 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 08720c7bd92f..6c0d521135a6 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -65,6 +65,7 @@ struct io_uring_sqe {
 		__u32		xattr_flags;
 		__u32		msg_ring_flags;
 		__u32		uring_cmd_flags;
+		__u32		getdents_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	/* pack this to avoid bogus arm OABI complaints */
@@ -235,6 +236,7 @@ enum io_uring_op {
 	IORING_OP_URING_CMD,
 	IORING_OP_SEND_ZC,
 	IORING_OP_SENDMSG_ZC,
+	IORING_OP_GETDENTS,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
@@ -273,6 +275,11 @@ enum io_uring_op {
  */
 #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
 
+/*
+ * sqe->getdents_flags
+ */
+#define IORING_GETDENTS_REWIND	(1U << 0)
+
 /*
  * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
  * command flags for POLL_ADD are stored in sqe->len.
diff --git a/io_uring/fs.c b/io_uring/fs.c
index f6a69a549fd4..77f00577e09c 100644
--- a/io_uring/fs.c
+++ b/io_uring/fs.c
@@ -47,6 +47,13 @@ struct io_link {
 	int				flags;
 };
 
+struct io_getdents {
+	struct file			*file;
+	struct linux_dirent64 __user	*dirent;
+	unsigned int			count;
+	int				flags;
+};
+
 int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
@@ -291,3 +298,56 @@ void io_link_cleanup(struct io_kiocb *req)
 	putname(sl->oldpath);
 	putname(sl->newpath);
 }
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+
+	if (READ_ONCE(sqe->off) != 0)
+		return -EINVAL;
+
+	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	gd->count = READ_ONCE(sqe->len);
+
+	return 0;
+}
+
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+	struct file *file;
+	unsigned long getdents_flags = 0;
+	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	bool should_lock = false;
+	int ret;
+
+	if (force_nonblock) {
+		if (!(req->file->f_mode & FMODE_NOWAIT))
+			return -EAGAIN;
+
+		getdents_flags = DIR_CONTEXT_F_NOWAIT;
+	}
+
+	file = req->file;
+	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
+		if (file_count(file) > 1)
+			should_lock = true;
+	}
+	if (should_lock) {
+		if (!force_nonblock)
+			mutex_lock(&file->f_pos_lock);
+		else if (!mutex_trylock(&file->f_pos_lock))
+			return -EAGAIN;
+	}
+
+	ret = vfs_getdents(file, gd->dirent, gd->count, getdents_flags);
+	if (should_lock)
+		mutex_unlock(&file->f_pos_lock);
+
+	if (ret == -EAGAIN && force_nonblock)
+		return -EAGAIN;
+
+	io_req_set_res(req, ret, 0);
+	return 0;
+}
+
diff --git a/io_uring/fs.h b/io_uring/fs.h
index 0bb5efe3d6bb..f83a6f3a678d 100644
--- a/io_uring/fs.h
+++ b/io_uring/fs.h
@@ -18,3 +18,6 @@ int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags);
 int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_linkat(struct io_kiocb *req, unsigned int issue_flags);
 void io_link_cleanup(struct io_kiocb *req);
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 3b9c6489b8b6..1bae6b2a8d0b 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -428,6 +428,11 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_GETDENTS] = {
+		.needs_file		= 1,
+		.prep			= io_getdents_prep,
+		.issue			= io_getdents,
+	},
 };
 
 
@@ -648,6 +653,9 @@ const struct io_cold_def io_cold_defs[] = {
 		.fail			= io_sendrecv_fail,
 #endif
 	},
+	[IORING_OP_GETDENTS] = {
+		.name			= "GETDENTS",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] io_uring: add support for getdents
  2023-07-11 11:40 ` [PATCH 3/3] io_uring: add support for getdents Hao Xu
@ 2023-07-11 12:15   ` Dominique Martinet
  2023-07-12  7:53     ` Hao Xu
  2023-07-12  8:01     ` Hao Xu
  2023-07-12 15:27   ` Christian Brauner
  1 sibling, 2 replies; 35+ messages in thread
From: Dominique Martinet @ 2023-07-11 12:15 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

Hao Xu wrote on Tue, Jul 11, 2023 at 07:40:27PM +0800:
> diff --git a/io_uring/fs.c b/io_uring/fs.c
> index f6a69a549fd4..77f00577e09c 100644
> --- a/io_uring/fs.c
> +++ b/io_uring/fs.c
> @@ -291,3 +298,56 @@ void io_link_cleanup(struct io_kiocb *req)
>  	putname(sl->oldpath);
>  	putname(sl->newpath);
>  }
> +
> +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> +{
> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> +
> +	if (READ_ONCE(sqe->off) != 0)
> +		return -EINVAL;
> +
> +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
> +	gd->count = READ_ONCE(sqe->len);
> +
> +	return 0;
> +}
> +
> +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> +{
> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> +	struct file *file;
> +	unsigned long getdents_flags = 0;
> +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
> +	bool should_lock = false;
> +	int ret;
> +
> +	if (force_nonblock) {
> +		if (!(req->file->f_mode & FMODE_NOWAIT))
> +			return -EAGAIN;
> +
> +		getdents_flags = DIR_CONTEXT_F_NOWAIT;
> +	}
> +
> +	file = req->file;
> +	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {

If file is NULL here things will just blow up in vfs_getdents anyway,
let's remove the useless check

> +		if (file_count(file) > 1)

I was curious about this so I found it's basically what __fdget_pos does
before deciding it should take the f_pos_lock, and as such this is
probably correct... But if someone can chime in here: what guarantees
someone else won't __fdget_pos (or equivalent through this) the file
again between this and the vfs_getdents call?
That second get would make file_count > 1 and it would lock, but lock
hadn't been taken here so the other call could get the lock without
waiting and both would process getdents or seek or whatever in
parallel.


That aside I don't see any obvious problem with this.

-- 
Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall
  2023-07-11 11:40 ` [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
@ 2023-07-11 13:02   ` Ammar Faizi
  2023-07-12  8:03     ` Hao Xu
  2023-07-11 23:41   ` Dave Chinner
  1 sibling, 1 reply; 35+ messages in thread
From: Ammar Faizi @ 2023-07-11 13:02 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring Mailing List, Jens Axboe, Dominique Martinet,
	Pavel Begunkov, Christian Brauner, Alexander Viro, Stefan Roesch,
	Clay Harris, Dave Chinner, Linux Fsdevel Mailing List,
	Wanpeng Li

On Tue, Jul 11, 2023 at 07:40:25PM +0800, Hao Xu wrote:
> This splits off the vfs_getdents function from the getdents64 system
> call.
> This will allow io_uring to call the vfs_getdents function.
> 
> Co-developed-by: Stefan Roesch <[email protected]>
> Signed-off-by: Stefan Roesch <[email protected]>
> Signed-off-by: Dominique Martinet <[email protected]>
> ---

Since you took this, it needs your Signed-off-by.

-- 
Ammar Faizi


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall
  2023-07-11 11:40 ` [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
  2023-07-11 13:02   ` Ammar Faizi
@ 2023-07-11 23:41   ` Dave Chinner
  2023-07-11 23:50     ` Jens Axboe
  1 sibling, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2023-07-11 23:41 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li

On Tue, Jul 11, 2023 at 07:40:25PM +0800, Hao Xu wrote:
> From: Dominique Martinet <[email protected]>
> 
> This splits off the vfs_getdents function from the getdents64 system
> call.
> This will allow io_uring to call the vfs_getdents function.
> 
> Co-developed-by: Stefan Roesch <[email protected]>
> Signed-off-by: Stefan Roesch <[email protected]>
> Signed-off-by: Dominique Martinet <[email protected]>
> ---
>  fs/internal.h |  8 ++++++++
>  fs/readdir.c  | 34 ++++++++++++++++++++++++++--------
>  2 files changed, 34 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/internal.h b/fs/internal.h
> index f7a3dc111026..b1f66e52d61b 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -304,3 +304,11 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
>  struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
>  struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
>  void mnt_idmap_put(struct mnt_idmap *idmap);
> +
> +/*
> + * fs/readdir.c
> + */
> +struct linux_dirent64;
> +
> +int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
> +		 unsigned int count);

Uh...

Since when have we allowed code outside fs/ to use fs/internal.h?

-Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/3] io_uring getdents
  2023-07-11 11:40 [PATCH v3 0/3] io_uring getdents Hao Xu
                   ` (2 preceding siblings ...)
  2023-07-11 11:40 ` [PATCH 3/3] io_uring: add support for getdents Hao Xu
@ 2023-07-11 23:47 ` Dave Chinner
  2023-07-11 23:51   ` Jens Axboe
  2023-07-12  3:19   ` Hao Xu
  3 siblings, 2 replies; 35+ messages in thread
From: Dave Chinner @ 2023-07-11 23:47 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li

On Tue, Jul 11, 2023 at 07:40:24PM +0800, Hao Xu wrote:
> From: Hao Xu <[email protected]>
> 
> This series introduce getdents64 to io_uring, the code logic is similar
> with the snychronized version's. It first try nowait issue, and offload
> it to io-wq threads if the first try fails.
> 
> 
> v2->v3:
>  - removed the kernfs patches
>  - add f_pos_lock logic
>  - remove the "reduce last EOF getdents try" optimization since
>    Dominique reports that doesn't make difference
>  - remove the rewind logic, I think the right way is to introduce lseek
>    to io_uring not to patch this logic to getdents.
>  - add Singed-off-by of Stefan Roesch for patch 1 since checkpatch
>    complained that Co-developed-by someone should be accompanied with
>    Signed-off-by same person, I can remove them if Stefan thinks that's
>    not proper.
> 
> 
> Dominique Martinet (1):
>   fs: split off vfs_getdents function of getdents64 syscall
> 
> Hao Xu (2):
>   vfs_getdents/struct dir_context: add flags field
>   io_uring: add support for getdents

So what filesystem actually uses this new NOWAIT functionality?
Unless I'm blind (quite possibly) I don't see any filesystem
implementation of this functionality in the patch series.

I know I posted a prototype for XFS to use it, and I expected that
it would become part of this patch series to avoid the "we don't add
unused code to the kernel" problem. i.e. the authors would take the
XFS prototype, make it work, add support into for the new io_uring
operation to fsstress in fstests and then use that to stress test
the new infrastructure before it gets merged....

But I don't see any of this?

-Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall
  2023-07-11 23:41   ` Dave Chinner
@ 2023-07-11 23:50     ` Jens Axboe
  2023-07-12 11:14       ` Christian Brauner
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2023-07-11 23:50 UTC (permalink / raw)
  To: Dave Chinner, Hao Xu
  Cc: io-uring, Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, linux-fsdevel,
	Wanpeng Li

On 7/11/23 5:41?PM, Dave Chinner wrote:
> On Tue, Jul 11, 2023 at 07:40:25PM +0800, Hao Xu wrote:
>> From: Dominique Martinet <[email protected]>
>>
>> This splits off the vfs_getdents function from the getdents64 system
>> call.
>> This will allow io_uring to call the vfs_getdents function.
>>
>> Co-developed-by: Stefan Roesch <[email protected]>
>> Signed-off-by: Stefan Roesch <[email protected]>
>> Signed-off-by: Dominique Martinet <[email protected]>
>> ---
>>  fs/internal.h |  8 ++++++++
>>  fs/readdir.c  | 34 ++++++++++++++++++++++++++--------
>>  2 files changed, 34 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/internal.h b/fs/internal.h
>> index f7a3dc111026..b1f66e52d61b 100644
>> --- a/fs/internal.h
>> +++ b/fs/internal.h
>> @@ -304,3 +304,11 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
>>  struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
>>  struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
>>  void mnt_idmap_put(struct mnt_idmap *idmap);
>> +
>> +/*
>> + * fs/readdir.c
>> + */
>> +struct linux_dirent64;
>> +
>> +int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
>> +		 unsigned int count);
> 
> Uh...
> 
> Since when have we allowed code outside fs/ to use fs/internal.h?

io_uring does use for things like open/close, statx, and xattr already.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/3] io_uring getdents
  2023-07-11 23:47 ` [PATCH v3 0/3] io_uring getdents Dave Chinner
@ 2023-07-11 23:51   ` Jens Axboe
  2023-07-12  0:53     ` Dominique Martinet
  2023-07-12  3:19   ` Hao Xu
  1 sibling, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2023-07-11 23:51 UTC (permalink / raw)
  To: Dave Chinner, Hao Xu
  Cc: io-uring, Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, linux-fsdevel,
	Wanpeng Li

On 7/11/23 5:47?PM, Dave Chinner wrote:
> On Tue, Jul 11, 2023 at 07:40:24PM +0800, Hao Xu wrote:
>> From: Hao Xu <[email protected]>
>>
>> This series introduce getdents64 to io_uring, the code logic is similar
>> with the snychronized version's. It first try nowait issue, and offload
>> it to io-wq threads if the first try fails.
>>
>>
>> v2->v3:
>>  - removed the kernfs patches
>>  - add f_pos_lock logic
>>  - remove the "reduce last EOF getdents try" optimization since
>>    Dominique reports that doesn't make difference
>>  - remove the rewind logic, I think the right way is to introduce lseek
>>    to io_uring not to patch this logic to getdents.
>>  - add Singed-off-by of Stefan Roesch for patch 1 since checkpatch
>>    complained that Co-developed-by someone should be accompanied with
>>    Signed-off-by same person, I can remove them if Stefan thinks that's
>>    not proper.
>>
>>
>> Dominique Martinet (1):
>>   fs: split off vfs_getdents function of getdents64 syscall
>>
>> Hao Xu (2):
>>   vfs_getdents/struct dir_context: add flags field
>>   io_uring: add support for getdents
> 
> So what filesystem actually uses this new NOWAIT functionality?
> Unless I'm blind (quite possibly) I don't see any filesystem
> implementation of this functionality in the patch series.
> 
> I know I posted a prototype for XFS to use it, and I expected that
> it would become part of this patch series to avoid the "we don't add
> unused code to the kernel" problem. i.e. the authors would take the
> XFS prototype, make it work, add support into for the new io_uring
> operation to fsstress in fstests and then use that to stress test
> the new infrastructure before it gets merged....
> 
> But I don't see any of this?

That would indeed be great if we could get NOWAIT, that might finally
convince me that it's worth plumbing up! Do you have a link to that
prototype? That seems like what should be the base for this, and be an
inspiration for other file systems to get efficient getdents via this
(rather than io-wq punt, which I'm not a huge fan of...).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/3] io_uring getdents
  2023-07-11 23:51   ` Jens Axboe
@ 2023-07-12  0:53     ` Dominique Martinet
  2023-07-12  0:56       ` Jens Axboe
  2023-07-12  3:12       ` Hao Xu
  0 siblings, 2 replies; 35+ messages in thread
From: Dominique Martinet @ 2023-07-12  0:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dave Chinner, Hao Xu, io-uring, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li

Jens Axboe wrote on Tue, Jul 11, 2023 at 05:51:46PM -0600:
> > So what filesystem actually uses this new NOWAIT functionality?
> > Unless I'm blind (quite possibly) I don't see any filesystem
> > implementation of this functionality in the patch series.

I had implemented this for kernfs and libfs (so sysfs, debugfs, possibly
tmpfs/proc?) in v2

The patch as of v2's mail has a bug, but my branch has it fixed as of
https://github.com/martinetd/linux/commits/io_uring_getdents

(I guess these aren't "real" enough though)

> > I know I posted a prototype for XFS to use it, and I expected that
> > it would become part of this patch series to avoid the "we don't add
> > unused code to the kernel" problem. i.e. the authors would take the
> > XFS prototype, make it work, add support into for the new io_uring
> > operation to fsstress in fstests and then use that to stress test
> > the new infrastructure before it gets merged....
> > 
> > But I don't see any of this?
> 
> That would indeed be great if we could get NOWAIT, that might finally
> convince me that it's worth plumbing up! Do you have a link to that
> prototype? That seems like what should be the base for this, and be an
> inspiration for other file systems to get efficient getdents via this
> (rather than io-wq punt, which I'm not a huge fan of...).

the xfs poc was in this mail:
https://lore.kernel.org/all/[email protected]/

I never spent time debugging it, but it should definitely be workable

-- 
Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/3] io_uring getdents
  2023-07-12  0:53     ` Dominique Martinet
@ 2023-07-12  0:56       ` Jens Axboe
  2023-07-12  3:16         ` Hao Xu
  2023-07-12  3:12       ` Hao Xu
  1 sibling, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2023-07-12  0:56 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Dave Chinner, Hao Xu, io-uring, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li

On 7/11/23 6:53 PM, Dominique Martinet wrote:
> Jens Axboe wrote on Tue, Jul 11, 2023 at 05:51:46PM -0600:
>>> So what filesystem actually uses this new NOWAIT functionality?
>>> Unless I'm blind (quite possibly) I don't see any filesystem
>>> implementation of this functionality in the patch series.
> 
> I had implemented this for kernfs and libfs (so sysfs, debugfs, possibly
> tmpfs/proc?) in v2
> 
> The patch as of v2's mail has a bug, but my branch has it fixed as of
> https://github.com/martinetd/linux/commits/io_uring_getdents
> 
> (I guess these aren't "real" enough though)

No, I definitely think those are real and valid. But would be nice with
a "real" file system as well.

>>> I know I posted a prototype for XFS to use it, and I expected that
>>> it would become part of this patch series to avoid the "we don't add
>>> unused code to the kernel" problem. i.e. the authors would take the
>>> XFS prototype, make it work, add support into for the new io_uring
>>> operation to fsstress in fstests and then use that to stress test
>>> the new infrastructure before it gets merged....
>>>
>>> But I don't see any of this?
>>
>> That would indeed be great if we could get NOWAIT, that might finally
>> convince me that it's worth plumbing up! Do you have a link to that
>> prototype? That seems like what should be the base for this, and be an
>> inspiration for other file systems to get efficient getdents via this
>> (rather than io-wq punt, which I'm not a huge fan of...).
> 
> the xfs poc was in this mail:
> https://lore.kernel.org/all/[email protected]/
> 
> I never spent time debugging it, but it should definitely be workable

If either you or Hao wants to take a stab at it and see how it goes,
I think that would be hugely beneficial for this patchset.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/3] io_uring getdents
  2023-07-12  0:53     ` Dominique Martinet
  2023-07-12  0:56       ` Jens Axboe
@ 2023-07-12  3:12       ` Hao Xu
  1 sibling, 0 replies; 35+ messages in thread
From: Hao Xu @ 2023-07-12  3:12 UTC (permalink / raw)
  To: Dominique Martinet, Jens Axboe
  Cc: Dave Chinner, io-uring, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, linux-fsdevel,
	Wanpeng Li

On 7/12/23 08:53, Dominique Martinet wrote:
> Jens Axboe wrote on Tue, Jul 11, 2023 at 05:51:46PM -0600:
>>> So what filesystem actually uses this new NOWAIT functionality?
>>> Unless I'm blind (quite possibly) I don't see any filesystem
>>> implementation of this functionality in the patch series.
> 
> I had implemented this for kernfs and libfs (so sysfs, debugfs, possibly
> tmpfs/proc?) in v2
> 
> The patch as of v2's mail has a bug, but my branch has it fixed as of
> https://github.com/martinetd/linux/commits/io_uring_getdents

I see, I'll try this, those in v2 causes issues when I boot my VM with them.

> 
> (I guess these aren't "real" enough though)
> 
>>> I know I posted a prototype for XFS to use it, and I expected that
>>> it would become part of this patch series to avoid the "we don't add
>>> unused code to the kernel" problem. i.e. the authors would take the
>>> XFS prototype, make it work, add support into for the new io_uring
>>> operation to fsstress in fstests and then use that to stress test
>>> the new infrastructure before it gets merged....
>>>
>>> But I don't see any of this?
>>
>> That would indeed be great if we could get NOWAIT, that might finally
>> convince me that it's worth plumbing up! Do you have a link to that
>> prototype? That seems like what should be the base for this, and be an
>> inspiration for other file systems to get efficient getdents via this
>> (rather than io-wq punt, which I'm not a huge fan of...).
> 
> the xfs poc was in this mail:
> https://lore.kernel.org/all/[email protected]/
> 
> I never spent time debugging it, but it should definitely be workable
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/3] io_uring getdents
  2023-07-12  0:56       ` Jens Axboe
@ 2023-07-12  3:16         ` Hao Xu
  0 siblings, 0 replies; 35+ messages in thread
From: Hao Xu @ 2023-07-12  3:16 UTC (permalink / raw)
  To: Jens Axboe, Dominique Martinet
  Cc: Dave Chinner, io-uring, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, linux-fsdevel,
	Wanpeng Li

Hi,

On 7/12/23 08:56, Jens Axboe wrote:
> On 7/11/23 6:53 PM, Dominique Martinet wrote:
>> Jens Axboe wrote on Tue, Jul 11, 2023 at 05:51:46PM -0600:
>>>> So what filesystem actually uses this new NOWAIT functionality?
>>>> Unless I'm blind (quite possibly) I don't see any filesystem
>>>> implementation of this functionality in the patch series.
>>
>> I had implemented this for kernfs and libfs (so sysfs, debugfs, possibly
>> tmpfs/proc?) in v2
>>
>> The patch as of v2's mail has a bug, but my branch has it fixed as of
>> https://github.com/martinetd/linux/commits/io_uring_getdents
>>
>> (I guess these aren't "real" enough though)
> 
> No, I definitely think those are real and valid. But would be nice with
> a "real" file system as well.
> 
>>>> I know I posted a prototype for XFS to use it, and I expected that
>>>> it would become part of this patch series to avoid the "we don't add
>>>> unused code to the kernel" problem. i.e. the authors would take the
>>>> XFS prototype, make it work, add support into for the new io_uring
>>>> operation to fsstress in fstests and then use that to stress test
>>>> the new infrastructure before it gets merged....
>>>>
>>>> But I don't see any of this?
>>>
>>> That would indeed be great if we could get NOWAIT, that might finally
>>> convince me that it's worth plumbing up! Do you have a link to that
>>> prototype? That seems like what should be the base for this, and be an
>>> inspiration for other file systems to get efficient getdents via this
>>> (rather than io-wq punt, which I'm not a huge fan of...).
>>
>> the xfs poc was in this mail:
>> https://lore.kernel.org/all/[email protected]/
>>
>> I never spent time debugging it, but it should definitely be workable
> 
> If either you or Hao wants to take a stab at it and see how it goes,
> I think that would be hugely beneficial for this patchset.
> 


I can take the xfs and kernfs part if Dominique doesn't mind.

Regards,
Hao


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 0/3] io_uring getdents
  2023-07-11 23:47 ` [PATCH v3 0/3] io_uring getdents Dave Chinner
  2023-07-11 23:51   ` Jens Axboe
@ 2023-07-12  3:19   ` Hao Xu
  1 sibling, 0 replies; 35+ messages in thread
From: Hao Xu @ 2023-07-12  3:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li

On 7/12/23 07:47, Dave Chinner wrote:
> On Tue, Jul 11, 2023 at 07:40:24PM +0800, Hao Xu wrote:
>> From: Hao Xu <[email protected]>
>>
>> This series introduce getdents64 to io_uring, the code logic is similar
>> with the snychronized version's. It first try nowait issue, and offload
>> it to io-wq threads if the first try fails.
>>
>>
>> v2->v3:
>>   - removed the kernfs patches
>>   - add f_pos_lock logic
>>   - remove the "reduce last EOF getdents try" optimization since
>>     Dominique reports that doesn't make difference
>>   - remove the rewind logic, I think the right way is to introduce lseek
>>     to io_uring not to patch this logic to getdents.
>>   - add Singed-off-by of Stefan Roesch for patch 1 since checkpatch
>>     complained that Co-developed-by someone should be accompanied with
>>     Signed-off-by same person, I can remove them if Stefan thinks that's
>>     not proper.
>>
>>
>> Dominique Martinet (1):
>>    fs: split off vfs_getdents function of getdents64 syscall
>>
>> Hao Xu (2):
>>    vfs_getdents/struct dir_context: add flags field
>>    io_uring: add support for getdents
> 
> So what filesystem actually uses this new NOWAIT functionality?
> Unless I'm blind (quite possibly) I don't see any filesystem
> implementation of this functionality in the patch series.
> 
> I know I posted a prototype for XFS to use it, and I expected that
> it would become part of this patch series to avoid the "we don't add
> unused code to the kernel" problem. i.e. the authors would take the
> XFS prototype, make it work, add support into for the new io_uring
> operation to fsstress in fstests and then use that to stress test
> the new infrastructure before it gets merged....
> 
> But I don't see any of this?
> 
> -Dave.

Hi Dave,
You are right, currently no real filesystem supports that from my 
investigation, I saw the xfs prototype, I'd like to make it work first.
That may cause some time.

Thanks,
Hao

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] io_uring: add support for getdents
  2023-07-11 12:15   ` Dominique Martinet
@ 2023-07-12  7:53     ` Hao Xu
  2023-07-12 16:10       ` Dominique Martinet
  2023-07-12  8:01     ` Hao Xu
  1 sibling, 1 reply; 35+ messages in thread
From: Hao Xu @ 2023-07-12  7:53 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: io-uring, Jens Axboe, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On 7/11/23 20:15, Dominique Martinet wrote:
> Hao Xu wrote on Tue, Jul 11, 2023 at 07:40:27PM +0800:
>> diff --git a/io_uring/fs.c b/io_uring/fs.c
>> index f6a69a549fd4..77f00577e09c 100644
>> --- a/io_uring/fs.c
>> +++ b/io_uring/fs.c
>> @@ -291,3 +298,56 @@ void io_link_cleanup(struct io_kiocb *req)
>>   	putname(sl->oldpath);
>>   	putname(sl->newpath);
>>   }
>> +
>> +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>> +{
>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>> +
>> +	if (READ_ONCE(sqe->off) != 0)
>> +		return -EINVAL;
>> +
>> +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
>> +	gd->count = READ_ONCE(sqe->len);
>> +
>> +	return 0;
>> +}
>> +
>> +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>> +	struct file *file;
>> +	unsigned long getdents_flags = 0;
>> +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
>> +	bool should_lock = false;
>> +	int ret;
>> +
>> +	if (force_nonblock) {
>> +		if (!(req->file->f_mode & FMODE_NOWAIT))
>> +			return -EAGAIN;
>> +
>> +		getdents_flags = DIR_CONTEXT_F_NOWAIT;
>> +	}
>> +
>> +	file = req->file;
>> +	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
> 
> If file is NULL here things will just blow up in vfs_getdents anyway,
> let's remove the useless check
> 
>> +		if (file_count(file) > 1)
> 
> I was curious about this so I found it's basically what __fdget_pos does
> before deciding it should take the f_pos_lock, and as such this is
> probably correct... But if someone can chime in here: what guarantees
> someone else won't __fdget_pos (or equivalent through this) the file
> again between this and the vfs_getdents call?
> That second get would make file_count > 1 and it would lock, but lock
> hadn't been taken here so the other call could get the lock without
> waiting and both would process getdents or seek or whatever in
> parallel.
> 

Hi Dominique,

This file_count(file) is atomic_read, so I believe no race condition here.

> 
> That aside I don't see any obvious problem with this.
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] io_uring: add support for getdents
  2023-07-11 12:15   ` Dominique Martinet
  2023-07-12  7:53     ` Hao Xu
@ 2023-07-12  8:01     ` Hao Xu
  1 sibling, 0 replies; 35+ messages in thread
From: Hao Xu @ 2023-07-12  8:01 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: io-uring, Jens Axboe, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On 7/11/23 20:15, Dominique Martinet wrote:
> Hao Xu wrote on Tue, Jul 11, 2023 at 07:40:27PM +0800:
>> diff --git a/io_uring/fs.c b/io_uring/fs.c
>> index f6a69a549fd4..77f00577e09c 100644
>> --- a/io_uring/fs.c
>> +++ b/io_uring/fs.c
>> @@ -291,3 +298,56 @@ void io_link_cleanup(struct io_kiocb *req)
>>   	putname(sl->oldpath);
>>   	putname(sl->newpath);
>>   }
>> +
>> +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>> +{
>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>> +
>> +	if (READ_ONCE(sqe->off) != 0)
>> +		return -EINVAL;
>> +
>> +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
>> +	gd->count = READ_ONCE(sqe->len);
>> +
>> +	return 0;
>> +}
>> +
>> +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>> +	struct file *file;
>> +	unsigned long getdents_flags = 0;
>> +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
>> +	bool should_lock = false;
>> +	int ret;
>> +
>> +	if (force_nonblock) {
>> +		if (!(req->file->f_mode & FMODE_NOWAIT))
>> +			return -EAGAIN;
>> +
>> +		getdents_flags = DIR_CONTEXT_F_NOWAIT;
>> +	}
>> +
>> +	file = req->file;
>> +	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
> 
> If file is NULL here things will just blow up in vfs_getdents anyway,
> let's remove the useless check

Sorry, forgot this one. Actually I think the file NULL check is not 
necessary here since it has been done in general code path in 
io_assign_file()


> 
>> +		if (file_count(file) > 1)
> 
> I was curious about this so I found it's basically what __fdget_pos does
> before deciding it should take the f_pos_lock, and as such this is
> probably correct... But if someone can chime in here: what guarantees
> someone else won't __fdget_pos (or equivalent through this) the file
> again between this and the vfs_getdents call?
> That second get would make file_count > 1 and it would lock, but lock
> hadn't been taken here so the other call could get the lock without
> waiting and both would process getdents or seek or whatever in
> parallel.
> 
> 
> That aside I don't see any obvious problem with this.
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall
  2023-07-11 13:02   ` Ammar Faizi
@ 2023-07-12  8:03     ` Hao Xu
  2023-07-12 13:55       ` Ammar Faizi
  0 siblings, 1 reply; 35+ messages in thread
From: Hao Xu @ 2023-07-12  8:03 UTC (permalink / raw)
  To: Ammar Faizi
  Cc: io-uring Mailing List, Jens Axboe, Dominique Martinet,
	Pavel Begunkov, Christian Brauner, Alexander Viro, Stefan Roesch,
	Clay Harris, Dave Chinner, Linux Fsdevel Mailing List,
	Wanpeng Li

On 7/11/23 21:02, Ammar Faizi wrote:
> On Tue, Jul 11, 2023 at 07:40:25PM +0800, Hao Xu wrote:
>> This splits off the vfs_getdents function from the getdents64 system
>> call.
>> This will allow io_uring to call the vfs_getdents function.
>>
>> Co-developed-by: Stefan Roesch <[email protected]>
>> Signed-off-by: Stefan Roesch <[email protected]>
>> Signed-off-by: Dominique Martinet <[email protected]>
>> ---
> 
> Since you took this, it needs your Signed-off-by.
> 

Hi Ammar,
I just add this signed-off-by of Stefan to resolve the checkpatch 
complain, no code change.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall
  2023-07-11 23:50     ` Jens Axboe
@ 2023-07-12 11:14       ` Christian Brauner
  0 siblings, 0 replies; 35+ messages in thread
From: Christian Brauner @ 2023-07-12 11:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dave Chinner, Hao Xu, io-uring, Dominique Martinet,
	Pavel Begunkov, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li

On Tue, Jul 11, 2023 at 05:50:27PM -0600, Jens Axboe wrote:
> On 7/11/23 5:41?PM, Dave Chinner wrote:
> > On Tue, Jul 11, 2023 at 07:40:25PM +0800, Hao Xu wrote:
> >> From: Dominique Martinet <[email protected]>
> >>
> >> This splits off the vfs_getdents function from the getdents64 system
> >> call.
> >> This will allow io_uring to call the vfs_getdents function.
> >>
> >> Co-developed-by: Stefan Roesch <[email protected]>
> >> Signed-off-by: Stefan Roesch <[email protected]>
> >> Signed-off-by: Dominique Martinet <[email protected]>
> >> ---
> >>  fs/internal.h |  8 ++++++++
> >>  fs/readdir.c  | 34 ++++++++++++++++++++++++++--------
> >>  2 files changed, 34 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/fs/internal.h b/fs/internal.h
> >> index f7a3dc111026..b1f66e52d61b 100644
> >> --- a/fs/internal.h
> >> +++ b/fs/internal.h
> >> @@ -304,3 +304,11 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
> >>  struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
> >>  struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
> >>  void mnt_idmap_put(struct mnt_idmap *idmap);
> >> +
> >> +/*
> >> + * fs/readdir.c
> >> + */
> >> +struct linux_dirent64;
> >> +
> >> +int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
> >> +		 unsigned int count);
> > 
> > Uh...
> > 
> > Since when have we allowed code outside fs/ to use fs/internal.h?
> 
> io_uring does use for things like open/close, statx, and xattr already.

Arguably though because you io_uring once used to be located under fs/.
In general though, we don't support anyone outside of fs/ to use that
header.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] vfs_getdents/struct dir_context: add flags field
  2023-07-11 11:40 ` [PATCH 2/3] vfs_getdents/struct dir_context: add flags field Hao Xu
@ 2023-07-12 11:31   ` Christian Brauner
  2023-07-12 16:02     ` Dominique Martinet
  0 siblings, 1 reply; 35+ messages in thread
From: Christian Brauner @ 2023-07-12 11:31 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On Tue, Jul 11, 2023 at 07:40:26PM +0800, Hao Xu wrote:
> From: Hao Xu <[email protected]>
> 
> The flags will allow passing DIR_CONTEXT_F_NOWAIT to iterate()
> implementations that support it (as signaled through FMODE_NWAIT
> in file->f_mode)
> 
> Notes:
> - considered using IOCB_NOWAIT but if we add more flags later it
> would be confusing to keep track of which values are valid, use
> dedicated flags
> - might want to check ctx.flags & DIR_CONTEXT_F_NOWAIT is only set
> when file->f_mode & FMODE_NOWAIT in iterate_dir() as e.g. WARN_ONCE?
> 
> Co-developed-by: Dominique Martinet <[email protected]>
> Signed-off-by: Dominique Martinet <[email protected]>
> Signed-off-by: Hao Xu <[email protected]>
> ---
>  fs/internal.h      | 2 +-
>  fs/readdir.c       | 6 ++++--
>  include/linux/fs.h | 8 ++++++++
>  3 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/internal.h b/fs/internal.h
> index b1f66e52d61b..7508d485c655 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -311,4 +311,4 @@ void mnt_idmap_put(struct mnt_idmap *idmap);
>  struct linux_dirent64;
>  
>  int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
> -		 unsigned int count);
> +		 unsigned int count, unsigned long flags);
> diff --git a/fs/readdir.c b/fs/readdir.c
> index 9592259b7e7f..b80caf4c9321 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -358,12 +358,14 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
>   * @file    : pointer to file struct of directory
>   * @dirent  : pointer to user directory structure
>   * @count   : size of buffer
> + * @flags   : additional dir_context flags

Why do you need that flag argument. The ->iterate{_shared}() i_op gets
passed the file so the filesystem can check
@file->f_mode & FMODE_NOWAIT, no?

>   */
>  int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
> -		 unsigned int count)
> +		 unsigned int count, unsigned long flags)
>  {
>  	struct getdents_callback64 buf = {
>  		.ctx.actor = filldir64,
> +		.ctx.flags = flags,
>  		.count = count,
>  		.current_dir = dirent
>  	};
> @@ -395,7 +397,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
>  	if (!f.file)
>  		return -EBADF;
>  
> -	error = vfs_getdents(f.file, dirent, count);
> +	error = vfs_getdents(f.file, dirent, count, 0);
>  
>  	fdput_pos(f);
>  	return error;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6867512907d6..f3e315e8efdd 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1719,8 +1719,16 @@ typedef bool (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64,
>  struct dir_context {
>  	filldir_t actor;
>  	loff_t pos;
> +	unsigned long flags;
>  };
>  
> +/*
> + * flags for dir_context flags
> + * DIR_CONTEXT_F_NOWAIT: Request non-blocking iterate
> + *                       (requires file->f_mode & FMODE_NOWAIT)
> + */
> +#define DIR_CONTEXT_F_NOWAIT	(1 << 0)

Even if this should be needed, I don't think this needs to use a full
flags field.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall
  2023-07-12  8:03     ` Hao Xu
@ 2023-07-12 13:55       ` Ammar Faizi
  2023-07-13  4:17         ` Hao Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Ammar Faizi @ 2023-07-12 13:55 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring Mailing List, Jens Axboe, Dominique Martinet,
	Pavel Begunkov, Christian Brauner, Alexander Viro, Stefan Roesch,
	Clay Harris, Dave Chinner, Linux Fsdevel Mailing List,
	Wanpeng Li

On Wed, Jul 12, 2023 at 04:03:41PM +0800, Hao Xu wrote:
> On 7/11/23 21:02, Ammar Faizi wrote:
> > On Tue, Jul 11, 2023 at 07:40:25PM +0800, Hao Xu wrote:
> > > Co-developed-by: Stefan Roesch <[email protected]>
> > > Signed-off-by: Stefan Roesch <[email protected]>
> > > Signed-off-by: Dominique Martinet <[email protected]>
> > > ---
> > 
> > Since you took this, it needs your Signed-off-by.
> > 
> 
> Hi Ammar,
> I just add this signed-off-by of Stefan to resolve the checkpatch complain,
> no code change.

Both, you and Stefan are required to sign-off. The submitter is also
required to sign-off even if the submitter makes no code change.

See https://www.kernel.org/doc/html/latest/process/submitting-patches.html:
"""
Any further SoBs (Signed-off-by:'s) following the author's SoB are from
people handling and transporting the patch, but were not involved in its
development. SoB chains should reflect the real route a patch took as it
was propagated to the maintainers and ultimately to Linus, with the
first SoB entry signalling primary authorship of a single author.
"""

It also applies to the maintainer when they apply your patches.

-- 
Ammar Faizi

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] io_uring: add support for getdents
  2023-07-11 11:40 ` [PATCH 3/3] io_uring: add support for getdents Hao Xu
  2023-07-11 12:15   ` Dominique Martinet
@ 2023-07-12 15:27   ` Christian Brauner
  2023-07-13  4:35     ` Hao Xu
  1 sibling, 1 reply; 35+ messages in thread
From: Christian Brauner @ 2023-07-12 15:27 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On Tue, Jul 11, 2023 at 07:40:27PM +0800, Hao Xu wrote:
> From: Hao Xu <[email protected]>
> 
> This add support for getdents64 to io_uring, acting exactly like the
> syscall: the directory is iterated from it's current's position as
> stored in the file struct, and the file's position is updated exactly as
> if getdents64 had been called.
> 
> For filesystems that support NOWAIT in iterate_shared(), try to use it
> first; if a user already knows the filesystem they use do not support
> nowait they can force async through IOSQE_ASYNC in the sqe flags,
> avoiding the need to bounce back through a useless EAGAIN return.
> 
> Co-developed-by: Dominique Martinet <[email protected]>
> Signed-off-by: Dominique Martinet <[email protected]>
> Signed-off-by: Hao Xu <[email protected]>
> ---
>  include/uapi/linux/io_uring.h |  7 ++++
>  io_uring/fs.c                 | 60 +++++++++++++++++++++++++++++++++++
>  io_uring/fs.h                 |  3 ++
>  io_uring/opdef.c              |  8 +++++
>  4 files changed, 78 insertions(+)
> 
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 08720c7bd92f..6c0d521135a6 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -65,6 +65,7 @@ struct io_uring_sqe {
>  		__u32		xattr_flags;
>  		__u32		msg_ring_flags;
>  		__u32		uring_cmd_flags;
> +		__u32		getdents_flags;
>  	};
>  	__u64	user_data;	/* data to be passed back at completion time */
>  	/* pack this to avoid bogus arm OABI complaints */
> @@ -235,6 +236,7 @@ enum io_uring_op {
>  	IORING_OP_URING_CMD,
>  	IORING_OP_SEND_ZC,
>  	IORING_OP_SENDMSG_ZC,
> +	IORING_OP_GETDENTS,
>  
>  	/* this goes last, obviously */
>  	IORING_OP_LAST,
> @@ -273,6 +275,11 @@ enum io_uring_op {
>   */
>  #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
>  
> +/*
> + * sqe->getdents_flags
> + */
> +#define IORING_GETDENTS_REWIND	(1U << 0)
> +
>  /*
>   * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
>   * command flags for POLL_ADD are stored in sqe->len.
> diff --git a/io_uring/fs.c b/io_uring/fs.c
> index f6a69a549fd4..77f00577e09c 100644
> --- a/io_uring/fs.c
> +++ b/io_uring/fs.c
> @@ -47,6 +47,13 @@ struct io_link {
>  	int				flags;
>  };
>  
> +struct io_getdents {
> +	struct file			*file;
> +	struct linux_dirent64 __user	*dirent;
> +	unsigned int			count;
> +	int				flags;
> +};
> +
>  int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  {
>  	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
> @@ -291,3 +298,56 @@ void io_link_cleanup(struct io_kiocb *req)
>  	putname(sl->oldpath);
>  	putname(sl->newpath);
>  }
> +
> +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> +{
> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> +
> +	if (READ_ONCE(sqe->off) != 0)
> +		return -EINVAL;
> +
> +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
> +	gd->count = READ_ONCE(sqe->len);
> +
> +	return 0;
> +}
> +
> +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> +{
> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> +	struct file *file;
> +	unsigned long getdents_flags = 0;
> +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
> +	bool should_lock = false;
> +	int ret;
> +
> +	if (force_nonblock) {
> +		if (!(req->file->f_mode & FMODE_NOWAIT))
> +			return -EAGAIN;
> +
> +		getdents_flags = DIR_CONTEXT_F_NOWAIT;

I mentioned this on the other patch but it seems really pointless to
have that extra flag. I would really like to hear a good reason for
this.

> +	}
> +
> +	file = req->file;
> +	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
> +		if (file_count(file) > 1)

Assume we have a regular non-threaded process that just opens an fd to a
file. The process registers an async readdir request via that fd for the
file with io_uring and goes to do other stuff while waiting for the
result.

Some time later, io_uring gets to io_getdents() and the task is still
single threaded and the file hasn't been shared in the meantime. So
io_getdents() doesn't take the lock and starts the readdir() call.

Concurrently, the process that registered the io_uring request was free
to do other stuff and issued a synchronous readdir() system call which
calls fdget_pos(). Since the fdtable still isn't shared it doesn't
increment f_count and doesn't acquire the mutex. Now there's another
concurrent readdir() going on.

(Similar thing can happen if the process creates a thread for example.)

Two readdir() requests now proceed concurrently which is not intended.
Now to verify that this race can't happen with io_uring:

* regular fds:
  It seems that io_uring calls fget() on each regular file descriptor
  when an async request is registered. So that means that io_uring
  always hold its own explicit reference here.
  So as long as the original task is alive or another thread is alive
  f_count is guaranteed to be > 1 and so the mutex would always be
  acquired.

  If the registering process dies right before io_uring gets to the
  io_getdents() request no other process can steal the fd anymore and in
  that case the readdir call would not lock. But that's fine.

* fixed fds:
  I don't know the reference counting rules here. io_uring would need to
  ensure that it's impossible for two async readdir requests via a fixed
  fd to race because f_count is == 1.

  Iiuc, if a process registers a file it opened as a fixed file and
  immediately closes the fd afterwards - without anyone else holding a
  reference to that file - and only uses the fixed fd going forward, the
  f_count of that file in io_uring's fixed file table is always 1.

  So one could issue any number of concurrent readdir requests with no
  mutual exclusion. So for fixed files there definitely is a race, no?

All of that could ofc be simplified if we could just always acquire the
mutex in fdget_pos() and other places and drop that file_count(file) > 1
optimization everywhere. But I have no idea if the optimization for not
acquiring the mutex if f_count == 1 is worth it?

I hope I didn't confuse myself here.

Jens, do yo have any input here?

> +			should_lock = true;
> +	}
> +	if (should_lock) {
> +		if (!force_nonblock)
> +			mutex_lock(&file->f_pos_lock);
> +		else if (!mutex_trylock(&file->f_pos_lock))
> +			return -EAGAIN;
> +	}

Open-coding this seems extremely brittle with an invitation for subtle
bugs.

> +
> +	ret = vfs_getdents(file, gd->dirent, gd->count, getdents_flags);
> +	if (should_lock)
> +		mutex_unlock(&file->f_pos_lock);
> +
> +	if (ret == -EAGAIN && force_nonblock)
> +		return -EAGAIN;
> +
> +	io_req_set_res(req, ret, 0);
> +	return 0;
> +}
> +
> diff --git a/io_uring/fs.h b/io_uring/fs.h
> index 0bb5efe3d6bb..f83a6f3a678d 100644
> --- a/io_uring/fs.h
> +++ b/io_uring/fs.h
> @@ -18,3 +18,6 @@ int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags);
>  int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
>  int io_linkat(struct io_kiocb *req, unsigned int issue_flags);
>  void io_link_cleanup(struct io_kiocb *req);
> +
> +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
> +int io_getdents(struct io_kiocb *req, unsigned int issue_flags);
> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
> index 3b9c6489b8b6..1bae6b2a8d0b 100644
> --- a/io_uring/opdef.c
> +++ b/io_uring/opdef.c
> @@ -428,6 +428,11 @@ const struct io_issue_def io_issue_defs[] = {
>  		.prep			= io_eopnotsupp_prep,
>  #endif
>  	},
> +	[IORING_OP_GETDENTS] = {
> +		.needs_file		= 1,
> +		.prep			= io_getdents_prep,
> +		.issue			= io_getdents,
> +	},
>  };
>  
>  
> @@ -648,6 +653,9 @@ const struct io_cold_def io_cold_defs[] = {
>  		.fail			= io_sendrecv_fail,
>  #endif
>  	},
> +	[IORING_OP_GETDENTS] = {
> +		.name			= "GETDENTS",
> +	},
>  };
>  
>  const char *io_uring_get_opcode(u8 opcode)
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] vfs_getdents/struct dir_context: add flags field
  2023-07-12 11:31   ` Christian Brauner
@ 2023-07-12 16:02     ` Dominique Martinet
  2023-07-13  4:12       ` Hao Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Dominique Martinet @ 2023-07-12 16:02 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Hao Xu, io-uring, Jens Axboe, Pavel Begunkov, Alexander Viro,
	Stefan Roesch, Clay Harris, Dave Chinner, linux-fsdevel,
	Wanpeng Li

(replying as that was my code)

Christian Brauner wrote on Wed, Jul 12, 2023 at 01:31:57PM +0200:
> On Tue, Jul 11, 2023 at 07:40:26PM +0800, Hao Xu wrote:
> > diff --git a/fs/readdir.c b/fs/readdir.c
> > index 9592259b7e7f..b80caf4c9321 100644
> > --- a/fs/readdir.c
> > +++ b/fs/readdir.c
> > @@ -358,12 +358,14 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
> >   * @file    : pointer to file struct of directory
> >   * @dirent  : pointer to user directory structure
> >   * @count   : size of buffer
> > + * @flags   : additional dir_context flags
> 
> Why do you need that flag argument. The ->iterate{_shared}() i_op gets
> passed the file so the filesystem can check
> @file->f_mode & FMODE_NOWAIT, no?

As far as I understand it, it's not because the fd is capable of NOWAIT
that uring will call it in NOWAIT mode:
- if the first getdents call returned -EAGAIN it'll also fall back to
waiting in a separate thread (there's no "getdents poll" implementation,
so there's no other way of rescheduling a non-blocking call)
- it's also possible for the user to specify it wants IOSQE_ASYNC in the
sqe->flags (admitedly I'm not sure why would anyone do this, but that's
useful for benchmarks at least -- it skips the initial NOWAIT call
before falling back to threaded waiting call)

Even outsides of io_uring, a call to getdents64 should block, so even if
the filesystem supports non-blocking it should be explicitely required
by the caller.


> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1719,8 +1719,16 @@ typedef bool (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64,
> >  struct dir_context {
> >  	filldir_t actor;
> >  	loff_t pos;
> > +	unsigned long flags;
> >  };
> >  
> > +/*
> > + * flags for dir_context flags
> > + * DIR_CONTEXT_F_NOWAIT: Request non-blocking iterate
> > + *                       (requires file->f_mode & FMODE_NOWAIT)
> > + */
> > +#define DIR_CONTEXT_F_NOWAIT	(1 << 0)
> 
> Even if this should be needed, I don't think this needs to use a full
> flags field.

I also got a request to somehow pass back "are there more entries to
read after this call" to the caller in my v1, and I had done this as a
second flag -- in general my understanding was that it's better to add
flags than a specific boolean for extensibility but I have no opinon
here.


-- 
Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] io_uring: add support for getdents
  2023-07-12  7:53     ` Hao Xu
@ 2023-07-12 16:10       ` Dominique Martinet
  2023-07-13  4:05         ` Hao Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Dominique Martinet @ 2023-07-12 16:10 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

Hao Xu wrote on Wed, Jul 12, 2023 at 03:53:24PM +0800:
> > > +		if (file_count(file) > 1)
> > 
> > I was curious about this so I found it's basically what __fdget_pos does
> > before deciding it should take the f_pos_lock, and as such this is
> > probably correct... But if someone can chime in here: what guarantees
> > someone else won't __fdget_pos (or equivalent through this) the file
> > again between this and the vfs_getdents call?
> > That second get would make file_count > 1 and it would lock, but lock
> > hadn't been taken here so the other call could get the lock without
> > waiting and both would process getdents or seek or whatever in
> > parallel.
> > 
> 
> This file_count(file) is atomic_read, so I believe no race condition here.

I don't see how that helps in the presence of another thread getting the
lock after we possibly issued a getdents without the lock, e.g.

t1 call io_uring getdents here
t1 sees file_count(file) == 1 and skips getting lock
t1 starts issuing vfs_getdents [... processing]
t2 calls either io_uring getdents or getdents64 syscall
t2 gets the lock, since it wasn't taken by t1 it can be obtained
t2 issues another vfs_getdents

Christian raised the same issue so I'll leave this to his part of the
thread for reply, but I hope that clarified my concern.

-----

BTW I forgot to point out: this dropped the REWIND bit from my patch; I
believe some form of "seek" is necessary for real applications to make
use of this (for example, a web server could keep the fd open in a LRU
and keep issuing readdir over and over again everytime it gets an
indexing request); not having rewind means it'd need to close and
re-open the fd everytime which doesn't seem optimal.

A previous iteration discussed that real seek is difficult and not
necessarily needed to I settled for rewind, but was there a reason you
decided to stop handling that?

My very egoistical personal use case won't require it, so I can just say
I don't care here, but it would be nice to have a reason explained at
some point

-- 
Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] io_uring: add support for getdents
  2023-07-12 16:10       ` Dominique Martinet
@ 2023-07-13  4:05         ` Hao Xu
  2023-07-13  4:40           ` Hao Xu
  2023-07-13  4:50           ` Dominique Martinet
  0 siblings, 2 replies; 35+ messages in thread
From: Hao Xu @ 2023-07-13  4:05 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: io-uring, Jens Axboe, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li


On 7/13/23 00:10, Dominique Martinet wrote:
> Hao Xu wrote on Wed, Jul 12, 2023 at 03:53:24PM +0800:
>>>> +		if (file_count(file) > 1)
>>> I was curious about this so I found it's basically what __fdget_pos does
>>> before deciding it should take the f_pos_lock, and as such this is
>>> probably correct... But if someone can chime in here: what guarantees
>>> someone else won't __fdget_pos (or equivalent through this) the file
>>> again between this and the vfs_getdents call?
>>> That second get would make file_count > 1 and it would lock, but lock
>>> hadn't been taken here so the other call could get the lock without
>>> waiting and both would process getdents or seek or whatever in
>>> parallel.
>>>
>> This file_count(file) is atomic_read, so I believe no race condition here.
> I don't see how that helps in the presence of another thread getting the
> lock after we possibly issued a getdents without the lock, e.g.
>
> t1 call io_uring getdents here
> t1 sees file_count(file) == 1 and skips getting lock
> t1 starts issuing vfs_getdents [... processing]
> t2 calls either io_uring getdents or getdents64 syscall
> t2 gets the lock, since it wasn't taken by t1 it can be obtained
> t2 issues another vfs_getdents
>
> Christian raised the same issue so I'll leave this to his part of the
> thread for reply, but I hope that clarified my concern.


Hi Dominique,

Ah, I misunderstood your question, sorry. The thing is f_count is 
init-ed to be 1,

and normal uring requests do fdget first, so I think it's ok for normal 
requests.

What Christian points out is issue with fixed file, that is indeed a 
problem I think.


>
> -----
>
> BTW I forgot to point out: this dropped the REWIND bit from my patch; I
> believe some form of "seek" is necessary for real applications to make
> use of this (for example, a web server could keep the fd open in a LRU
> and keep issuing readdir over and over again everytime it gets an
> indexing request); not having rewind means it'd need to close and
> re-open the fd everytime which doesn't seem optimal.
>
> A previous iteration discussed that real seek is difficult and not
> necessarily needed to I settled for rewind, but was there a reason you
> decided to stop handling that?
>
> My very egoistical personal use case won't require it, so I can just say
> I don't care here, but it would be nice to have a reason explained at
> some point


Yes, like Al pointed out, getdents with an offset is not the right way 
to do it,

So a way to do seek is a must. But like what I said in the cover-letter, 
I do think the right thing is to

import lseek/llseek to io_uring, not increment the complex of getdents.


Thanks,

Hao



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] vfs_getdents/struct dir_context: add flags field
  2023-07-12 16:02     ` Dominique Martinet
@ 2023-07-13  4:12       ` Hao Xu
  0 siblings, 0 replies; 35+ messages in thread
From: Hao Xu @ 2023-07-13  4:12 UTC (permalink / raw)
  To: Dominique Martinet, Christian Brauner
  Cc: io-uring, Jens Axboe, Pavel Begunkov, Alexander Viro,
	Stefan Roesch, Clay Harris, Dave Chinner, linux-fsdevel,
	Wanpeng Li

Hi Christian and Dominique,


On 7/13/23 00:02, Dominique Martinet wrote:
> (replying as that was my code)
>
> Christian Brauner wrote on Wed, Jul 12, 2023 at 01:31:57PM +0200:
>> On Tue, Jul 11, 2023 at 07:40:26PM +0800, Hao Xu wrote:
>>> diff --git a/fs/readdir.c b/fs/readdir.c
>>> index 9592259b7e7f..b80caf4c9321 100644
>>> --- a/fs/readdir.c
>>> +++ b/fs/readdir.c
>>> @@ -358,12 +358,14 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
>>>    * @file    : pointer to file struct of directory
>>>    * @dirent  : pointer to user directory structure
>>>    * @count   : size of buffer
>>> + * @flags   : additional dir_context flags
>> Why do you need that flag argument. The ->iterate{_shared}() i_op gets
>> passed the file so the filesystem can check
>> @file->f_mode & FMODE_NOWAIT, no?
> As far as I understand it, it's not because the fd is capable of NOWAIT
> that uring will call it in NOWAIT mode:
> - if the first getdents call returned -EAGAIN it'll also fall back to
> waiting in a separate thread (there's no "getdents poll" implementation,
> so there's no other way of rescheduling a non-blocking call)
> - it's also possible for the user to specify it wants IOSQE_ASYNC in the
> sqe->flags (admitedly I'm not sure why would anyone do this, but that's
> useful for benchmarks at least -- it skips the initial NOWAIT call
> before falling back to threaded waiting call)
>
> Even outsides of io_uring, a call to getdents64 should block, so even if
> the filesystem supports non-blocking it should be explicitely required
> by the caller.


Hi Christian,

My understanding of FMODE_NOWAIT is "this file support nowait IO". Just 
like what Dominique

said, io_uring issue a request two rounds(let's simplify it here since 
no apoll or task work involved),

and the first round is  a nowait/nonblock try, the second one is an 
offload-ed block try. So besides

a "ability" flag(FMODE_NOWAIT), we still need a "one-round" flag to 
point out that "we do need to

do nowait IO this time".


>
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -1719,8 +1719,16 @@ typedef bool (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64,
>>>   struct dir_context {
>>>   	filldir_t actor;
>>>   	loff_t pos;
>>> +	unsigned long flags;
>>>   };
>>>   
>>> +/*
>>> + * flags for dir_context flags
>>> + * DIR_CONTEXT_F_NOWAIT: Request non-blocking iterate
>>> + *                       (requires file->f_mode & FMODE_NOWAIT)
>>> + */
>>> +#define DIR_CONTEXT_F_NOWAIT	(1 << 0)
>> Even if this should be needed, I don't think this needs to use a full
>> flags field.
> I also got a request to somehow pass back "are there more entries to
> read after this call" to the caller in my v1, and I had done this as a
> second flag -- in general my understanding was that it's better to add
> flags than a specific boolean for extensibility but I have no opinon
> here.


I've no strong opinion here, I kept it here as a flag variable to make it

more extendable in the future.


Thanks,

Hao


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall
  2023-07-12 13:55       ` Ammar Faizi
@ 2023-07-13  4:17         ` Hao Xu
  0 siblings, 0 replies; 35+ messages in thread
From: Hao Xu @ 2023-07-13  4:17 UTC (permalink / raw)
  To: Ammar Faizi
  Cc: io-uring Mailing List, Jens Axboe, Dominique Martinet,
	Pavel Begunkov, Christian Brauner, Alexander Viro, Stefan Roesch,
	Clay Harris, Dave Chinner, Linux Fsdevel Mailing List,
	Wanpeng Li


On 7/12/23 21:55, Ammar Faizi wrote:
> On Wed, Jul 12, 2023 at 04:03:41PM +0800, Hao Xu wrote:
>> On 7/11/23 21:02, Ammar Faizi wrote:
>>> On Tue, Jul 11, 2023 at 07:40:25PM +0800, Hao Xu wrote:
>>>> Co-developed-by: Stefan Roesch <[email protected]>
>>>> Signed-off-by: Stefan Roesch <[email protected]>
>>>> Signed-off-by: Dominique Martinet <[email protected]>
>>>> ---
>>> Since you took this, it needs your Signed-off-by.
>>>
>> Hi Ammar,
>> I just add this signed-off-by of Stefan to resolve the checkpatch complain,
>> no code change.
> Both, you and Stefan are required to sign-off. The submitter is also
> required to sign-off even if the submitter makes no code change.
>
> See https://www.kernel.org/doc/html/latest/process/submitting-patches.html:
> """
> Any further SoBs (Signed-off-by:'s) following the author's SoB are from
> people handling and transporting the patch, but were not involved in its
> development. SoB chains should reflect the real route a patch took as it
> was propagated to the maintainers and ultimately to Linus, with the
> first SoB entry signalling primary authorship of a single author.
> """
>
> It also applies to the maintainer when they apply your patches.


Hi Ammar,

I see, this information is really helpful, I'll fix it in next version 
to make it

more standardized.


Thanks,

Hao


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] io_uring: add support for getdents
  2023-07-12 15:27   ` Christian Brauner
@ 2023-07-13  4:35     ` Hao Xu
  2023-07-13  7:10       ` Christian Brauner
  0 siblings, 1 reply; 35+ messages in thread
From: Hao Xu @ 2023-07-13  4:35 UTC (permalink / raw)
  To: Christian Brauner
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On 7/12/23 23:27, Christian Brauner wrote:
> On Tue, Jul 11, 2023 at 07:40:27PM +0800, Hao Xu wrote:
>> From: Hao Xu <[email protected]>
>>
>> This add support for getdents64 to io_uring, acting exactly like the
>> syscall: the directory is iterated from it's current's position as
>> stored in the file struct, and the file's position is updated exactly as
>> if getdents64 had been called.
>>
>> For filesystems that support NOWAIT in iterate_shared(), try to use it
>> first; if a user already knows the filesystem they use do not support
>> nowait they can force async through IOSQE_ASYNC in the sqe flags,
>> avoiding the need to bounce back through a useless EAGAIN return.
>>
>> Co-developed-by: Dominique Martinet <[email protected]>
>> Signed-off-by: Dominique Martinet <[email protected]>
>> Signed-off-by: Hao Xu <[email protected]>
>> ---
>>   include/uapi/linux/io_uring.h |  7 ++++
>>   io_uring/fs.c                 | 60 +++++++++++++++++++++++++++++++++++
>>   io_uring/fs.h                 |  3 ++
>>   io_uring/opdef.c              |  8 +++++
>>   4 files changed, 78 insertions(+)
>>
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index 08720c7bd92f..6c0d521135a6 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -65,6 +65,7 @@ struct io_uring_sqe {
>>   		__u32		xattr_flags;
>>   		__u32		msg_ring_flags;
>>   		__u32		uring_cmd_flags;
>> +		__u32		getdents_flags;
>>   	};
>>   	__u64	user_data;	/* data to be passed back at completion time */
>>   	/* pack this to avoid bogus arm OABI complaints */
>> @@ -235,6 +236,7 @@ enum io_uring_op {
>>   	IORING_OP_URING_CMD,
>>   	IORING_OP_SEND_ZC,
>>   	IORING_OP_SENDMSG_ZC,
>> +	IORING_OP_GETDENTS,
>>   
>>   	/* this goes last, obviously */
>>   	IORING_OP_LAST,
>> @@ -273,6 +275,11 @@ enum io_uring_op {
>>    */
>>   #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
>>   
>> +/*
>> + * sqe->getdents_flags
>> + */
>> +#define IORING_GETDENTS_REWIND	(1U << 0)
>> +
>>   /*
>>    * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
>>    * command flags for POLL_ADD are stored in sqe->len.
>> diff --git a/io_uring/fs.c b/io_uring/fs.c
>> index f6a69a549fd4..77f00577e09c 100644
>> --- a/io_uring/fs.c
>> +++ b/io_uring/fs.c
>> @@ -47,6 +47,13 @@ struct io_link {
>>   	int				flags;
>>   };
>>   
>> +struct io_getdents {
>> +	struct file			*file;
>> +	struct linux_dirent64 __user	*dirent;
>> +	unsigned int			count;
>> +	int				flags;
>> +};
>> +
>>   int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>   {
>>   	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
>> @@ -291,3 +298,56 @@ void io_link_cleanup(struct io_kiocb *req)
>>   	putname(sl->oldpath);
>>   	putname(sl->newpath);
>>   }
>> +
>> +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>> +{
>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>> +
>> +	if (READ_ONCE(sqe->off) != 0)
>> +		return -EINVAL;
>> +
>> +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
>> +	gd->count = READ_ONCE(sqe->len);
>> +
>> +	return 0;
>> +}
>> +
>> +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>> +	struct file *file;
>> +	unsigned long getdents_flags = 0;
>> +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
>> +	bool should_lock = false;
>> +	int ret;
>> +
>> +	if (force_nonblock) {
>> +		if (!(req->file->f_mode & FMODE_NOWAIT))
>> +			return -EAGAIN;
>> +
>> +		getdents_flags = DIR_CONTEXT_F_NOWAIT;
> 
> I mentioned this on the other patch but it seems really pointless to
> have that extra flag. I would really like to hear a good reason for
> this.
> 
>> +	}
>> +
>> +	file = req->file;
>> +	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
>> +		if (file_count(file) > 1)
> 
> Assume we have a regular non-threaded process that just opens an fd to a
> file. The process registers an async readdir request via that fd for the
> file with io_uring and goes to do other stuff while waiting for the
> result.
> 
> Some time later, io_uring gets to io_getdents() and the task is still
> single threaded and the file hasn't been shared in the meantime. So
> io_getdents() doesn't take the lock and starts the readdir() call.
> 
> Concurrently, the process that registered the io_uring request was free
> to do other stuff and issued a synchronous readdir() system call which
> calls fdget_pos(). Since the fdtable still isn't shared it doesn't
> increment f_count and doesn't acquire the mutex. Now there's another
> concurrent readdir() going on.
> 
> (Similar thing can happen if the process creates a thread for example.)
> 
> Two readdir() requests now proceed concurrently which is not intended.
> Now to verify that this race can't happen with io_uring:
> 
> * regular fds:
>    It seems that io_uring calls fget() on each regular file descriptor
>    when an async request is registered. So that means that io_uring
>    always hold its own explicit reference here.
>    So as long as the original task is alive or another thread is alive
>    f_count is guaranteed to be > 1 and so the mutex would always be
>    acquired.
> 
>    If the registering process dies right before io_uring gets to the
>    io_getdents() request no other process can steal the fd anymore and in
>    that case the readdir call would not lock. But that's fine.
> 
> * fixed fds:
>    I don't know the reference counting rules here. io_uring would need to
>    ensure that it's impossible for two async readdir requests via a fixed
>    fd to race because f_count is == 1.
> 
>    Iiuc, if a process registers a file it opened as a fixed file and
>    immediately closes the fd afterwards - without anyone else holding a
>    reference to that file - and only uses the fixed fd going forward, the
>    f_count of that file in io_uring's fixed file table is always 1.
> 
>    So one could issue any number of concurrent readdir requests with no
>    mutual exclusion. So for fixed files there definitely is a race, no?

Hi Christian,
The ref logic for fixed file is that it does fdget() when registering
the file, and fdput() when unregistering it. So the ref in between is
always > 1. The fixed file feature is to reduce frequent fdget/fdput,
but it does call them at the register/unregister time.


> 
> All of that could ofc be simplified if we could just always acquire the
> mutex in fdget_pos() and other places and drop that file_count(file) > 1
> optimization everywhere. But I have no idea if the optimization for not
> acquiring the mutex if f_count == 1 is worth it?
> 
> I hope I didn't confuse myself here.
> 
> Jens, do yo have any input here?
> 
>> +			should_lock = true;
>> +	}
>> +	if (should_lock) {
>> +		if (!force_nonblock)
>> +			mutex_lock(&file->f_pos_lock);
>> +		else if (!mutex_trylock(&file->f_pos_lock))
>> +			return -EAGAIN;
>> +	}
> 
> Open-coding this seems extremely brittle with an invitation for subtle
> bugs.

Could you elaborate on this, I'm not sure that I understand it quite
well. Sorry for my poor English.

Thanks,
Hao

> 
>> +
>> +	ret = vfs_getdents(file, gd->dirent, gd->count, getdents_flags);
>> +	if (should_lock)
>> +		mutex_unlock(&file->f_pos_lock);
>> +
>> +	if (ret == -EAGAIN && force_nonblock)
>> +		return -EAGAIN;
>> +
>> +	io_req_set_res(req, ret, 0);
>> +	return 0;
>> +}
>> +
>> diff --git a/io_uring/fs.h b/io_uring/fs.h
>> index 0bb5efe3d6bb..f83a6f3a678d 100644
>> --- a/io_uring/fs.h
>> +++ b/io_uring/fs.h
>> @@ -18,3 +18,6 @@ int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags);
>>   int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
>>   int io_linkat(struct io_kiocb *req, unsigned int issue_flags);
>>   void io_link_cleanup(struct io_kiocb *req);
>> +
>> +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
>> +int io_getdents(struct io_kiocb *req, unsigned int issue_flags);
>> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
>> index 3b9c6489b8b6..1bae6b2a8d0b 100644
>> --- a/io_uring/opdef.c
>> +++ b/io_uring/opdef.c
>> @@ -428,6 +428,11 @@ const struct io_issue_def io_issue_defs[] = {
>>   		.prep			= io_eopnotsupp_prep,
>>   #endif
>>   	},
>> +	[IORING_OP_GETDENTS] = {
>> +		.needs_file		= 1,
>> +		.prep			= io_getdents_prep,
>> +		.issue			= io_getdents,
>> +	},
>>   };
>>   
>>   
>> @@ -648,6 +653,9 @@ const struct io_cold_def io_cold_defs[] = {
>>   		.fail			= io_sendrecv_fail,
>>   #endif
>>   	},
>> +	[IORING_OP_GETDENTS] = {
>> +		.name			= "GETDENTS",
>> +	},
>>   };
>>   
>>   const char *io_uring_get_opcode(u8 opcode)
>> -- 
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] io_uring: add support for getdents
  2023-07-13  4:05         ` Hao Xu
@ 2023-07-13  4:40           ` Hao Xu
  2023-07-13  4:50           ` Dominique Martinet
  1 sibling, 0 replies; 35+ messages in thread
From: Hao Xu @ 2023-07-13  4:40 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: io-uring, Jens Axboe, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On 7/13/23 12:05, Hao Xu wrote:
> 
> On 7/13/23 00:10, Dominique Martinet wrote:
>> Hao Xu wrote on Wed, Jul 12, 2023 at 03:53:24PM +0800:
>>>>> +        if (file_count(file) > 1)
>>>> I was curious about this so I found it's basically what __fdget_pos 
>>>> does
>>>> before deciding it should take the f_pos_lock, and as such this is
>>>> probably correct... But if someone can chime in here: what guarantees
>>>> someone else won't __fdget_pos (or equivalent through this) the file
>>>> again between this and the vfs_getdents call?
>>>> That second get would make file_count > 1 and it would lock, but lock
>>>> hadn't been taken here so the other call could get the lock without
>>>> waiting and both would process getdents or seek or whatever in
>>>> parallel.
>>>>
>>> This file_count(file) is atomic_read, so I believe no race condition 
>>> here.
>> I don't see how that helps in the presence of another thread getting the
>> lock after we possibly issued a getdents without the lock, e.g.
>>
>> t1 call io_uring getdents here
>> t1 sees file_count(file) == 1 and skips getting lock
>> t1 starts issuing vfs_getdents [... processing]
>> t2 calls either io_uring getdents or getdents64 syscall
>> t2 gets the lock, since it wasn't taken by t1 it can be obtained
>> t2 issues another vfs_getdents
>>
>> Christian raised the same issue so I'll leave this to his part of the
>> thread for reply, but I hope that clarified my concern.
> 
> 
> Hi Dominique,
> 
> Ah, I misunderstood your question, sorry. The thing is f_count is 
> init-ed to be 1,
> 
> and normal uring requests do fdget first, so I think it's ok for normal 
> requests.
> 
> What Christian points out is issue with fixed file, that is indeed a 
> problem I think.

After re-think of it, I think there is no race in fixed file case as
well, because the f_count is always >1


> 
> 
>>
>> -----
>>
>> BTW I forgot to point out: this dropped the REWIND bit from my patch; I
>> believe some form of "seek" is necessary for real applications to make
>> use of this (for example, a web server could keep the fd open in a LRU
>> and keep issuing readdir over and over again everytime it gets an
>> indexing request); not having rewind means it'd need to close and
>> re-open the fd everytime which doesn't seem optimal.
>>
>> A previous iteration discussed that real seek is difficult and not
>> necessarily needed to I settled for rewind, but was there a reason you
>> decided to stop handling that?
>>
>> My very egoistical personal use case won't require it, so I can just say
>> I don't care here, but it would be nice to have a reason explained at
>> some point
> 
> 
> Yes, like Al pointed out, getdents with an offset is not the right way 
> to do it,
> 
> So a way to do seek is a must. But like what I said in the cover-letter, 
> I do think the right thing is to
> 
> import lseek/llseek to io_uring, not increment the complex of getdents.
> 
> 
> Thanks,
> 
> Hao
> 
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] io_uring: add support for getdents
  2023-07-13  4:05         ` Hao Xu
  2023-07-13  4:40           ` Hao Xu
@ 2023-07-13  4:50           ` Dominique Martinet
  1 sibling, 0 replies; 35+ messages in thread
From: Dominique Martinet @ 2023-07-13  4:50 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

Hao Xu wrote on Thu, Jul 13, 2023 at 12:05:00PM +0800:
> Yes, like Al pointed out, getdents with an offset is not the right way to do
> it,
> 
> So a way to do seek is a must. But like what I said in the cover-letter, I
> do think the right thing is to
> 
> import lseek/llseek to io_uring, not increment the complex of getdents.

Ok, sorry I hadn't read the cover letter properly


Hao Xu wrote on Thu, Jul 13, 2023 at 12:40:05PM +0800:
> > Ah, I misunderstood your question, sorry. The thing is f_count is
> > init-ed to be 1,
> > 
> > and normal uring requests do fdget first, so I think it's ok for normal
> > requests.
> > 
> > What Christian points out is issue with fixed file, that is indeed a
> > problem I think.
> 
> After re-think of it, I think there is no race in fixed file case as
> well, because the f_count is always >1

Let's remove the if > 1 check then

-- 
Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] io_uring: add support for getdents
  2023-07-13  4:35     ` Hao Xu
@ 2023-07-13  7:10       ` Christian Brauner
  2023-07-13  9:06         ` Hao Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Christian Brauner @ 2023-07-13  7:10 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On Thu, Jul 13, 2023 at 12:35:07PM +0800, Hao Xu wrote:
> On 7/12/23 23:27, Christian Brauner wrote:
> > On Tue, Jul 11, 2023 at 07:40:27PM +0800, Hao Xu wrote:
> > > From: Hao Xu <[email protected]>
> > > 
> > > This add support for getdents64 to io_uring, acting exactly like the
> > > syscall: the directory is iterated from it's current's position as
> > > stored in the file struct, and the file's position is updated exactly as
> > > if getdents64 had been called.
> > > 
> > > For filesystems that support NOWAIT in iterate_shared(), try to use it
> > > first; if a user already knows the filesystem they use do not support
> > > nowait they can force async through IOSQE_ASYNC in the sqe flags,
> > > avoiding the need to bounce back through a useless EAGAIN return.
> > > 
> > > Co-developed-by: Dominique Martinet <[email protected]>
> > > Signed-off-by: Dominique Martinet <[email protected]>
> > > Signed-off-by: Hao Xu <[email protected]>
> > > ---
> > >   include/uapi/linux/io_uring.h |  7 ++++
> > >   io_uring/fs.c                 | 60 +++++++++++++++++++++++++++++++++++
> > >   io_uring/fs.h                 |  3 ++
> > >   io_uring/opdef.c              |  8 +++++
> > >   4 files changed, 78 insertions(+)
> > > 
> > > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> > > index 08720c7bd92f..6c0d521135a6 100644
> > > --- a/include/uapi/linux/io_uring.h
> > > +++ b/include/uapi/linux/io_uring.h
> > > @@ -65,6 +65,7 @@ struct io_uring_sqe {
> > >   		__u32		xattr_flags;
> > >   		__u32		msg_ring_flags;
> > >   		__u32		uring_cmd_flags;
> > > +		__u32		getdents_flags;
> > >   	};
> > >   	__u64	user_data;	/* data to be passed back at completion time */
> > >   	/* pack this to avoid bogus arm OABI complaints */
> > > @@ -235,6 +236,7 @@ enum io_uring_op {
> > >   	IORING_OP_URING_CMD,
> > >   	IORING_OP_SEND_ZC,
> > >   	IORING_OP_SENDMSG_ZC,
> > > +	IORING_OP_GETDENTS,
> > >   	/* this goes last, obviously */
> > >   	IORING_OP_LAST,
> > > @@ -273,6 +275,11 @@ enum io_uring_op {
> > >    */
> > >   #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
> > > +/*
> > > + * sqe->getdents_flags
> > > + */
> > > +#define IORING_GETDENTS_REWIND	(1U << 0)
> > > +
> > >   /*
> > >    * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
> > >    * command flags for POLL_ADD are stored in sqe->len.
> > > diff --git a/io_uring/fs.c b/io_uring/fs.c
> > > index f6a69a549fd4..77f00577e09c 100644
> > > --- a/io_uring/fs.c
> > > +++ b/io_uring/fs.c
> > > @@ -47,6 +47,13 @@ struct io_link {
> > >   	int				flags;
> > >   };
> > > +struct io_getdents {
> > > +	struct file			*file;
> > > +	struct linux_dirent64 __user	*dirent;
> > > +	unsigned int			count;
> > > +	int				flags;
> > > +};
> > > +
> > >   int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > >   {
> > >   	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
> > > @@ -291,3 +298,56 @@ void io_link_cleanup(struct io_kiocb *req)
> > >   	putname(sl->oldpath);
> > >   	putname(sl->newpath);
> > >   }
> > > +
> > > +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > +{
> > > +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> > > +
> > > +	if (READ_ONCE(sqe->off) != 0)
> > > +		return -EINVAL;
> > > +
> > > +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
> > > +	gd->count = READ_ONCE(sqe->len);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> > > +{
> > > +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> > > +	struct file *file;
> > > +	unsigned long getdents_flags = 0;
> > > +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
> > > +	bool should_lock = false;
> > > +	int ret;
> > > +
> > > +	if (force_nonblock) {
> > > +		if (!(req->file->f_mode & FMODE_NOWAIT))
> > > +			return -EAGAIN;
> > > +
> > > +		getdents_flags = DIR_CONTEXT_F_NOWAIT;
> > 
> > I mentioned this on the other patch but it seems really pointless to
> > have that extra flag. I would really like to hear a good reason for
> > this.
> > 
> > > +	}
> > > +
> > > +	file = req->file;
> > > +	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
> > > +		if (file_count(file) > 1)
> > 
> > Assume we have a regular non-threaded process that just opens an fd to a
> > file. The process registers an async readdir request via that fd for the
> > file with io_uring and goes to do other stuff while waiting for the
> > result.
> > 
> > Some time later, io_uring gets to io_getdents() and the task is still
> > single threaded and the file hasn't been shared in the meantime. So
> > io_getdents() doesn't take the lock and starts the readdir() call.
> > 
> > Concurrently, the process that registered the io_uring request was free
> > to do other stuff and issued a synchronous readdir() system call which
> > calls fdget_pos(). Since the fdtable still isn't shared it doesn't
> > increment f_count and doesn't acquire the mutex. Now there's another
> > concurrent readdir() going on.
> > 
> > (Similar thing can happen if the process creates a thread for example.)
> > 
> > Two readdir() requests now proceed concurrently which is not intended.
> > Now to verify that this race can't happen with io_uring:
> > 
> > * regular fds:
> >    It seems that io_uring calls fget() on each regular file descriptor
> >    when an async request is registered. So that means that io_uring
> >    always hold its own explicit reference here.
> >    So as long as the original task is alive or another thread is alive
> >    f_count is guaranteed to be > 1 and so the mutex would always be
> >    acquired.
> > 
> >    If the registering process dies right before io_uring gets to the
> >    io_getdents() request no other process can steal the fd anymore and in
> >    that case the readdir call would not lock. But that's fine.
> > 
> > * fixed fds:
> >    I don't know the reference counting rules here. io_uring would need to
> >    ensure that it's impossible for two async readdir requests via a fixed
> >    fd to race because f_count is == 1.
> > 
> >    Iiuc, if a process registers a file it opened as a fixed file and
> >    immediately closes the fd afterwards - without anyone else holding a
> >    reference to that file - and only uses the fixed fd going forward, the
> >    f_count of that file in io_uring's fixed file table is always 1.
> > 
> >    So one could issue any number of concurrent readdir requests with no
> >    mutual exclusion. So for fixed files there definitely is a race, no?
> 
> Hi Christian,
> The ref logic for fixed file is that it does fdget() when registering

It absolutely can't be the case that io_uring uses fdget()/fdput() for
long-term file references. fdget() internally use __fget_light() which
avoids taking a reference on the file if the file table isn't shared. So
should that file be stashed anywhere for async work its a UAF waiting to
happen.

> the file, and fdput() when unregistering it. So the ref in between is
> always > 1. The fixed file feature is to reduce frequent fdget/fdput,
> but it does call them at the register/unregister time.

So consider:

// Caller opens some file.
fd_register = open("/some/file", ...); // f_count == 1

// Caller registers that file as a fixed file
IORING_REGISTER_FILES
-> io_sqe_files_register()
   -> fget(fd_register) // f_count == 2
   -> io_fixed_file_set()

// Caller trades regular fd reference for fixed file reference completely.
close(fd_register);
-> close_fd(fd_register)
   -> file = pick_file()
   -> filp_close(file)
      -> fput(file)    // f_count == 1


// Caller spawns a second thread. Both treads issue async getdents via
// fixed file.
T1                                              T2
IORING_OP_GETDENTS                              IORING_OP_GETDENTS

// At some point io_assign_file() must be called which has:

          if (req->flags & REQ_F_FIXED_FILE)
                  req->file = io_file_get_fixed(req, req->cqe.fd, issue_flags);
          else
                  req->file = io_file_get_normal(req, req->cqe.fd);

// Since this is REQ_F_FIXED_FILE f_count == 1

if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
        if (file_count(file) > 1)

// No lock is taken; T1 and T2 issue getdents concurrently without any
// locking. -> race on f_pos

I'm happy to be convinced that this is safe, but please someone explain
in detail why this can't happen and where that extra f_count reference
for fixed files that this code wants to rely on is coming from.

Afaik, the whole point is that fixed files don't ever call fget()/fput()
after having been registered anymore. Consequently, f_count should be 1
once io_uring has taken full ownership of the file and the file can only
be addressed via a fixed file reference.

> 
> 
> > 
> > All of that could ofc be simplified if we could just always acquire the
> > mutex in fdget_pos() and other places and drop that file_count(file) > 1
> > optimization everywhere. But I have no idea if the optimization for not
> > acquiring the mutex if f_count == 1 is worth it?
> > 
> > I hope I didn't confuse myself here.
> > 
> > Jens, do yo have any input here?
> > 
> > > +			should_lock = true;
> > > +	}
> > > +	if (should_lock) {
> > > +		if (!force_nonblock)
> > > +			mutex_lock(&file->f_pos_lock);
> > > +		else if (!mutex_trylock(&file->f_pos_lock))
> > > +			return -EAGAIN;
> > > +	}
> > 
> > Open-coding this seems extremely brittle with an invitation for subtle
> > bugs.
> 
> Could you elaborate on this, I'm not sure that I understand it quite
> well. Sorry for my poor English.

No need to apologize. I'm wondering whether this should be moved into a
tiny helper and actually be exposed via a vfs header if we go this
route is all.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] io_uring: add support for getdents
  2023-07-13  7:10       ` Christian Brauner
@ 2023-07-13  9:06         ` Hao Xu
  2023-07-13 15:14           ` Christian Brauner
  0 siblings, 1 reply; 35+ messages in thread
From: Hao Xu @ 2023-07-13  9:06 UTC (permalink / raw)
  To: Christian Brauner
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

Hi Christian,

On 7/13/23 15:10, Christian Brauner wrote:
> On Thu, Jul 13, 2023 at 12:35:07PM +0800, Hao Xu wrote:
>> On 7/12/23 23:27, Christian Brauner wrote:
>>> On Tue, Jul 11, 2023 at 07:40:27PM +0800, Hao Xu wrote:
>>>> From: Hao Xu <[email protected]>
>>>>
>>>> This add support for getdents64 to io_uring, acting exactly like the
>>>> syscall: the directory is iterated from it's current's position as
>>>> stored in the file struct, and the file's position is updated exactly as
>>>> if getdents64 had been called.
>>>>
>>>> For filesystems that support NOWAIT in iterate_shared(), try to use it
>>>> first; if a user already knows the filesystem they use do not support
>>>> nowait they can force async through IOSQE_ASYNC in the sqe flags,
>>>> avoiding the need to bounce back through a useless EAGAIN return.
>>>>
>>>> Co-developed-by: Dominique Martinet <[email protected]>
>>>> Signed-off-by: Dominique Martinet <[email protected]>
>>>> Signed-off-by: Hao Xu <[email protected]>
>>>> ---
>>>>    include/uapi/linux/io_uring.h |  7 ++++
>>>>    io_uring/fs.c                 | 60 +++++++++++++++++++++++++++++++++++
>>>>    io_uring/fs.h                 |  3 ++
>>>>    io_uring/opdef.c              |  8 +++++
>>>>    4 files changed, 78 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>> index 08720c7bd92f..6c0d521135a6 100644
>>>> --- a/include/uapi/linux/io_uring.h
>>>> +++ b/include/uapi/linux/io_uring.h
>>>> @@ -65,6 +65,7 @@ struct io_uring_sqe {
>>>>    		__u32		xattr_flags;
>>>>    		__u32		msg_ring_flags;
>>>>    		__u32		uring_cmd_flags;
>>>> +		__u32		getdents_flags;
>>>>    	};
>>>>    	__u64	user_data;	/* data to be passed back at completion time */
>>>>    	/* pack this to avoid bogus arm OABI complaints */
>>>> @@ -235,6 +236,7 @@ enum io_uring_op {
>>>>    	IORING_OP_URING_CMD,
>>>>    	IORING_OP_SEND_ZC,
>>>>    	IORING_OP_SENDMSG_ZC,
>>>> +	IORING_OP_GETDENTS,
>>>>    	/* this goes last, obviously */
>>>>    	IORING_OP_LAST,
>>>> @@ -273,6 +275,11 @@ enum io_uring_op {
>>>>     */
>>>>    #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
>>>> +/*
>>>> + * sqe->getdents_flags
>>>> + */
>>>> +#define IORING_GETDENTS_REWIND	(1U << 0)
>>>> +
>>>>    /*
>>>>     * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
>>>>     * command flags for POLL_ADD are stored in sqe->len.
>>>> diff --git a/io_uring/fs.c b/io_uring/fs.c
>>>> index f6a69a549fd4..77f00577e09c 100644
>>>> --- a/io_uring/fs.c
>>>> +++ b/io_uring/fs.c
>>>> @@ -47,6 +47,13 @@ struct io_link {
>>>>    	int				flags;
>>>>    };
>>>> +struct io_getdents {
>>>> +	struct file			*file;
>>>> +	struct linux_dirent64 __user	*dirent;
>>>> +	unsigned int			count;
>>>> +	int				flags;
>>>> +};
>>>> +
>>>>    int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>    {
>>>>    	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
>>>> @@ -291,3 +298,56 @@ void io_link_cleanup(struct io_kiocb *req)
>>>>    	putname(sl->oldpath);
>>>>    	putname(sl->newpath);
>>>>    }
>>>> +
>>>> +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>> +{
>>>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>>>> +
>>>> +	if (READ_ONCE(sqe->off) != 0)
>>>> +		return -EINVAL;
>>>> +
>>>> +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
>>>> +	gd->count = READ_ONCE(sqe->len);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
>>>> +{
>>>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>>>> +	struct file *file;
>>>> +	unsigned long getdents_flags = 0;
>>>> +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
>>>> +	bool should_lock = false;
>>>> +	int ret;
>>>> +
>>>> +	if (force_nonblock) {
>>>> +		if (!(req->file->f_mode & FMODE_NOWAIT))
>>>> +			return -EAGAIN;
>>>> +
>>>> +		getdents_flags = DIR_CONTEXT_F_NOWAIT;
>>>
>>> I mentioned this on the other patch but it seems really pointless to
>>> have that extra flag. I would really like to hear a good reason for
>>> this.
>>>
>>>> +	}
>>>> +
>>>> +	file = req->file;
>>>> +	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
>>>> +		if (file_count(file) > 1)
>>>
>>> Assume we have a regular non-threaded process that just opens an fd to a
>>> file. The process registers an async readdir request via that fd for the
>>> file with io_uring and goes to do other stuff while waiting for the
>>> result.
>>>
>>> Some time later, io_uring gets to io_getdents() and the task is still
>>> single threaded and the file hasn't been shared in the meantime. So
>>> io_getdents() doesn't take the lock and starts the readdir() call.
>>>
>>> Concurrently, the process that registered the io_uring request was free
>>> to do other stuff and issued a synchronous readdir() system call which
>>> calls fdget_pos(). Since the fdtable still isn't shared it doesn't
>>> increment f_count and doesn't acquire the mutex. Now there's another
>>> concurrent readdir() going on.
>>>
>>> (Similar thing can happen if the process creates a thread for example.)
>>>
>>> Two readdir() requests now proceed concurrently which is not intended.
>>> Now to verify that this race can't happen with io_uring:
>>>
>>> * regular fds:
>>>     It seems that io_uring calls fget() on each regular file descriptor
>>>     when an async request is registered. So that means that io_uring
>>>     always hold its own explicit reference here.
>>>     So as long as the original task is alive or another thread is alive
>>>     f_count is guaranteed to be > 1 and so the mutex would always be
>>>     acquired.
>>>
>>>     If the registering process dies right before io_uring gets to the
>>>     io_getdents() request no other process can steal the fd anymore and in
>>>     that case the readdir call would not lock. But that's fine.
>>>
>>> * fixed fds:
>>>     I don't know the reference counting rules here. io_uring would need to
>>>     ensure that it's impossible for two async readdir requests via a fixed
>>>     fd to race because f_count is == 1.
>>>
>>>     Iiuc, if a process registers a file it opened as a fixed file and
>>>     immediately closes the fd afterwards - without anyone else holding a
>>>     reference to that file - and only uses the fixed fd going forward, the
>>>     f_count of that file in io_uring's fixed file table is always 1.
>>>
>>>     So one could issue any number of concurrent readdir requests with no
>>>     mutual exclusion. So for fixed files there definitely is a race, no?
>>
>> Hi Christian,
>> The ref logic for fixed file is that it does fdget() when registering
> 
> It absolutely can't be the case that io_uring uses fdget()/fdput() for
> long-term file references. fdget() internally use __fget_light() which
> avoids taking a reference on the file if the file table isn't shared. So
> should that file be stashed anywhere for async work its a UAF waiting to
> happen.
> 

Yes, I typed the wrong name, should be fget() not fdget().

>> the file, and fdput() when unregistering it. So the ref in between is
>> always > 1. The fixed file feature is to reduce frequent fdget/fdput,
>> but it does call them at the register/unregister time.
> 
> So consider:
> 
> // Caller opens some file.
> fd_register = open("/some/file", ...); // f_count == 1
> 
> // Caller registers that file as a fixed file
> IORING_REGISTER_FILES
> -> io_sqe_files_register()
>     -> fget(fd_register) // f_count == 2
>     -> io_fixed_file_set()
> 
> // Caller trades regular fd reference for fixed file reference completely.
> close(fd_register);
> -> close_fd(fd_register)
>     -> file = pick_file()
>     -> filp_close(file)
>        -> fput(file)    // f_count == 1
> 
> 
> // Caller spawns a second thread. Both treads issue async getdents via
> // fixed file.
> T1                                              T2
> IORING_OP_GETDENTS                              IORING_OP_GETDENTS
> 
> // At some point io_assign_file() must be called which has:
> 
>            if (req->flags & REQ_F_FIXED_FILE)
>                    req->file = io_file_get_fixed(req, req->cqe.fd, issue_flags);
>            else
>                    req->file = io_file_get_normal(req, req->cqe.fd);
> 
> // Since this is REQ_F_FIXED_FILE f_count == 1
> 
> if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
>          if (file_count(file) > 1)
> 
> // No lock is taken; T1 and T2 issue getdents concurrently without any
> // locking. -> race on f_pos
> 
> I'm happy to be convinced that this is safe, but please someone explain
> in detail why this can't happen and where that extra f_count reference
> for fixed files that this code wants to rely on is coming from.
> 
> Afaik, the whole point is that fixed files don't ever call fget()/fput()
> after having been registered anymore. Consequently, f_count should be 1
> once io_uring has taken full ownership of the file and the file can only
> be addressed via a fixed file reference.

Thanks for explanation, I now realize it's an issue, even for non-fixed 
files when io_uring takes full ownership. for example:

io_uring submit a getdents          --> f_count == 2, get the lock
nowait submission fails             --> f_count == 2, release the lock
punt it to io-wq thread and return to userspace
close(fd)                           --> f_count == 1
call sync getdents64                --> doing getdents without lock
the io-wq thread begins to run      --> f_count == 1, doing getdents
                                         without lock.

Though this looks like a silly use case but users can do that anyway.

How about remove this f_count > 1 small optimization in io_uring and 
always get the lock, looks like it makes big trouble for async
situation. and there may often be parallel io_uring getdents in the
same time for a file [1], it may be not very meaningful to do this
file count optimization.

[1] I believe users will issue multiple async getdents at same time 
rather than issue them one by one to get better performance.

Thanks,
Hao

> 
>>
>>
>>>
>>> All of that could ofc be simplified if we could just always acquire the
>>> mutex in fdget_pos() and other places and drop that file_count(file) > 1
>>> optimization everywhere. But I have no idea if the optimization for not
>>> acquiring the mutex if f_count == 1 is worth it?
>>>
>>> I hope I didn't confuse myself here.
>>>
>>> Jens, do yo have any input here?
>>>
>>>> +			should_lock = true;
>>>> +	}
>>>> +	if (should_lock) {
>>>> +		if (!force_nonblock)
>>>> +			mutex_lock(&file->f_pos_lock);
>>>> +		else if (!mutex_trylock(&file->f_pos_lock))
>>>> +			return -EAGAIN;
>>>> +	}
>>>
>>> Open-coding this seems extremely brittle with an invitation for subtle
>>> bugs.
>>
>> Could you elaborate on this, I'm not sure that I understand it quite
>> well. Sorry for my poor English.
> 
> No need to apologize. I'm wondering whether this should be moved into a
> tiny helper and actually be exposed via a vfs header if we go this
> route is all.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] io_uring: add support for getdents
  2023-07-13  9:06         ` Hao Xu
@ 2023-07-13 15:14           ` Christian Brauner
  2023-07-16 11:57             ` Hao Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Christian Brauner @ 2023-07-13 15:14 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On Thu, Jul 13, 2023 at 05:06:32PM +0800, Hao Xu wrote:
> Hi Christian,
> 
> On 7/13/23 15:10, Christian Brauner wrote:
> > On Thu, Jul 13, 2023 at 12:35:07PM +0800, Hao Xu wrote:
> > > On 7/12/23 23:27, Christian Brauner wrote:
> > > > On Tue, Jul 11, 2023 at 07:40:27PM +0800, Hao Xu wrote:
> > > > > From: Hao Xu <[email protected]>
> > > > > 
> > > > > This add support for getdents64 to io_uring, acting exactly like the
> > > > > syscall: the directory is iterated from it's current's position as
> > > > > stored in the file struct, and the file's position is updated exactly as
> > > > > if getdents64 had been called.
> > > > > 
> > > > > For filesystems that support NOWAIT in iterate_shared(), try to use it
> > > > > first; if a user already knows the filesystem they use do not support
> > > > > nowait they can force async through IOSQE_ASYNC in the sqe flags,
> > > > > avoiding the need to bounce back through a useless EAGAIN return.
> > > > > 
> > > > > Co-developed-by: Dominique Martinet <[email protected]>
> > > > > Signed-off-by: Dominique Martinet <[email protected]>
> > > > > Signed-off-by: Hao Xu <[email protected]>
> > > > > ---
> > > > >    include/uapi/linux/io_uring.h |  7 ++++
> > > > >    io_uring/fs.c                 | 60 +++++++++++++++++++++++++++++++++++
> > > > >    io_uring/fs.h                 |  3 ++
> > > > >    io_uring/opdef.c              |  8 +++++
> > > > >    4 files changed, 78 insertions(+)
> > > > > 
> > > > > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> > > > > index 08720c7bd92f..6c0d521135a6 100644
> > > > > --- a/include/uapi/linux/io_uring.h
> > > > > +++ b/include/uapi/linux/io_uring.h
> > > > > @@ -65,6 +65,7 @@ struct io_uring_sqe {
> > > > >    		__u32		xattr_flags;
> > > > >    		__u32		msg_ring_flags;
> > > > >    		__u32		uring_cmd_flags;
> > > > > +		__u32		getdents_flags;
> > > > >    	};
> > > > >    	__u64	user_data;	/* data to be passed back at completion time */
> > > > >    	/* pack this to avoid bogus arm OABI complaints */
> > > > > @@ -235,6 +236,7 @@ enum io_uring_op {
> > > > >    	IORING_OP_URING_CMD,
> > > > >    	IORING_OP_SEND_ZC,
> > > > >    	IORING_OP_SENDMSG_ZC,
> > > > > +	IORING_OP_GETDENTS,
> > > > >    	/* this goes last, obviously */
> > > > >    	IORING_OP_LAST,
> > > > > @@ -273,6 +275,11 @@ enum io_uring_op {
> > > > >     */
> > > > >    #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
> > > > > +/*
> > > > > + * sqe->getdents_flags
> > > > > + */
> > > > > +#define IORING_GETDENTS_REWIND	(1U << 0)
> > > > > +
> > > > >    /*
> > > > >     * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
> > > > >     * command flags for POLL_ADD are stored in sqe->len.
> > > > > diff --git a/io_uring/fs.c b/io_uring/fs.c
> > > > > index f6a69a549fd4..77f00577e09c 100644
> > > > > --- a/io_uring/fs.c
> > > > > +++ b/io_uring/fs.c
> > > > > @@ -47,6 +47,13 @@ struct io_link {
> > > > >    	int				flags;
> > > > >    };
> > > > > +struct io_getdents {
> > > > > +	struct file			*file;
> > > > > +	struct linux_dirent64 __user	*dirent;
> > > > > +	unsigned int			count;
> > > > > +	int				flags;
> > > > > +};
> > > > > +
> > > > >    int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > > >    {
> > > > >    	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
> > > > > @@ -291,3 +298,56 @@ void io_link_cleanup(struct io_kiocb *req)
> > > > >    	putname(sl->oldpath);
> > > > >    	putname(sl->newpath);
> > > > >    }
> > > > > +
> > > > > +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > > > +{
> > > > > +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> > > > > +
> > > > > +	if (READ_ONCE(sqe->off) != 0)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
> > > > > +	gd->count = READ_ONCE(sqe->len);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> > > > > +{
> > > > > +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> > > > > +	struct file *file;
> > > > > +	unsigned long getdents_flags = 0;
> > > > > +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
> > > > > +	bool should_lock = false;
> > > > > +	int ret;
> > > > > +
> > > > > +	if (force_nonblock) {
> > > > > +		if (!(req->file->f_mode & FMODE_NOWAIT))
> > > > > +			return -EAGAIN;
> > > > > +
> > > > > +		getdents_flags = DIR_CONTEXT_F_NOWAIT;
> > > > 
> > > > I mentioned this on the other patch but it seems really pointless to
> > > > have that extra flag. I would really like to hear a good reason for
> > > > this.
> > > > 
> > > > > +	}
> > > > > +
> > > > > +	file = req->file;
> > > > > +	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
> > > > > +		if (file_count(file) > 1)
> > > > 
> > > > Assume we have a regular non-threaded process that just opens an fd to a
> > > > file. The process registers an async readdir request via that fd for the
> > > > file with io_uring and goes to do other stuff while waiting for the
> > > > result.
> > > > 
> > > > Some time later, io_uring gets to io_getdents() and the task is still
> > > > single threaded and the file hasn't been shared in the meantime. So
> > > > io_getdents() doesn't take the lock and starts the readdir() call.
> > > > 
> > > > Concurrently, the process that registered the io_uring request was free
> > > > to do other stuff and issued a synchronous readdir() system call which
> > > > calls fdget_pos(). Since the fdtable still isn't shared it doesn't
> > > > increment f_count and doesn't acquire the mutex. Now there's another
> > > > concurrent readdir() going on.
> > > > 
> > > > (Similar thing can happen if the process creates a thread for example.)
> > > > 
> > > > Two readdir() requests now proceed concurrently which is not intended.
> > > > Now to verify that this race can't happen with io_uring:
> > > > 
> > > > * regular fds:
> > > >     It seems that io_uring calls fget() on each regular file descriptor
> > > >     when an async request is registered. So that means that io_uring
> > > >     always hold its own explicit reference here.
> > > >     So as long as the original task is alive or another thread is alive
> > > >     f_count is guaranteed to be > 1 and so the mutex would always be
> > > >     acquired.
> > > > 
> > > >     If the registering process dies right before io_uring gets to the
> > > >     io_getdents() request no other process can steal the fd anymore and in
> > > >     that case the readdir call would not lock. But that's fine.
> > > > 
> > > > * fixed fds:
> > > >     I don't know the reference counting rules here. io_uring would need to
> > > >     ensure that it's impossible for two async readdir requests via a fixed
> > > >     fd to race because f_count is == 1.
> > > > 
> > > >     Iiuc, if a process registers a file it opened as a fixed file and
> > > >     immediately closes the fd afterwards - without anyone else holding a
> > > >     reference to that file - and only uses the fixed fd going forward, the
> > > >     f_count of that file in io_uring's fixed file table is always 1.
> > > > 
> > > >     So one could issue any number of concurrent readdir requests with no
> > > >     mutual exclusion. So for fixed files there definitely is a race, no?
> > > 
> > > Hi Christian,
> > > The ref logic for fixed file is that it does fdget() when registering
> > 
> > It absolutely can't be the case that io_uring uses fdget()/fdput() for
> > long-term file references. fdget() internally use __fget_light() which
> > avoids taking a reference on the file if the file table isn't shared. So
> > should that file be stashed anywhere for async work its a UAF waiting to
> > happen.
> > 
> 
> Yes, I typed the wrong name, should be fget() not fdget().
> 
> > > the file, and fdput() when unregistering it. So the ref in between is
> > > always > 1. The fixed file feature is to reduce frequent fdget/fdput,
> > > but it does call them at the register/unregister time.
> > 
> > So consider:
> > 
> > // Caller opens some file.
> > fd_register = open("/some/file", ...); // f_count == 1
> > 
> > // Caller registers that file as a fixed file
> > IORING_REGISTER_FILES
> > -> io_sqe_files_register()
> >     -> fget(fd_register) // f_count == 2
> >     -> io_fixed_file_set()
> > 
> > // Caller trades regular fd reference for fixed file reference completely.
> > close(fd_register);
> > -> close_fd(fd_register)
> >     -> file = pick_file()
> >     -> filp_close(file)
> >        -> fput(file)    // f_count == 1
> > 
> > 
> > // Caller spawns a second thread. Both treads issue async getdents via
> > // fixed file.
> > T1                                              T2
> > IORING_OP_GETDENTS                              IORING_OP_GETDENTS
> > 
> > // At some point io_assign_file() must be called which has:
> > 
> >            if (req->flags & REQ_F_FIXED_FILE)
> >                    req->file = io_file_get_fixed(req, req->cqe.fd, issue_flags);
> >            else
> >                    req->file = io_file_get_normal(req, req->cqe.fd);
> > 
> > // Since this is REQ_F_FIXED_FILE f_count == 1
> > 
> > if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
> >          if (file_count(file) > 1)
> > 
> > // No lock is taken; T1 and T2 issue getdents concurrently without any
> > // locking. -> race on f_pos
> > 
> > I'm happy to be convinced that this is safe, but please someone explain
> > in detail why this can't happen and where that extra f_count reference
> > for fixed files that this code wants to rely on is coming from.
> > 
> > Afaik, the whole point is that fixed files don't ever call fget()/fput()
> > after having been registered anymore. Consequently, f_count should be 1
> > once io_uring has taken full ownership of the file and the file can only
> > be addressed via a fixed file reference.
> 
> Thanks for explanation, I now realize it's an issue, even for non-fixed
> files when io_uring takes full ownership. for example:
> 
> io_uring submit a getdents          --> f_count == 2, get the lock
> nowait submission fails             --> f_count == 2, release the lock
> punt it to io-wq thread and return to userspace
> close(fd)                           --> f_count == 1
> call sync getdents64                --> doing getdents without lock
> the io-wq thread begins to run      --> f_count == 1, doing getdents
>                                         without lock.
> 
> Though this looks like a silly use case but users can do that anyway.
> 
> How about remove this f_count > 1 small optimization in io_uring and always
> get the lock, looks like it makes big trouble for async
> situation. and there may often be parallel io_uring getdents in the
> same time for a file [1], it may be not very meaningful to do this
> file count optimization.
> 
> [1] I believe users will issue multiple async getdents at same time rather
> than issue them one by one to get better performance.

Could someone with perf experience try and remove that f_count == 1
optimization from __fdget_pos() completely and make it always acquire
the mutex? I wonder what the performance impact of that is.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] io_uring: add support for getdents
  2023-07-13 15:14           ` Christian Brauner
@ 2023-07-16 11:57             ` Hao Xu
  2023-07-18  6:55               ` Hao Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Hao Xu @ 2023-07-16 11:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On 7/13/23 23:14, Christian Brauner wrote:

> Could someone with perf experience try and remove that f_count == 1
> optimization from __fdget_pos() completely and make it always acquire
> the mutex? I wonder what the performance impact of that is.

Hi Christian,
For your reference, I did a simple test: writed a c program that open a
directory which has 1000 empty files, then call sync getdents64 on it
repeatedly until we get all the entries. I run this program 10 times for
"with f_count==1
optimization" and "always do the lock" version.
Got below data:
with f_count==1:

time cost: 0.000379 

time cost: 0.000116 

time cost: 0.000090 

time cost: 0.000101 

time cost: 0.000095 

time cost: 0.000092 

time cost: 0.000092 

time cost: 0.000095 

time cost: 0.000092 

time cost: 0.000121 

time cost: 0.000092 
 

time cost avg: 0.00009859999999999998

always do the lock:
time cost: 0.000095 

time cost: 0.000099 

time cost: 0.000123 

time cost: 0.000124 

time cost: 0.000092 

time cost: 0.000099 

time cost: 0.000092 

time cost: 0.000092 

time cost: 0.000093 

time cost: 0.000094 
 
             time cost avg: 0.00010029999999999997

So about 1.724% increment

[1] the first run is not showed here since that try does real IO
     and diff a lot.
[2] the time cost calculation is by gettimeofday()
[3] run it in a VM which has 2 CPUs and 1GB memory.

Regards,
Hao

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] io_uring: add support for getdents
  2023-07-16 11:57             ` Hao Xu
@ 2023-07-18  6:55               ` Hao Xu
  0 siblings, 0 replies; 35+ messages in thread
From: Hao Xu @ 2023-07-18  6:55 UTC (permalink / raw)
  To: Christian Brauner
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On 7/16/23 19:57, Hao Xu wrote:
> On 7/13/23 23:14, Christian Brauner wrote:
> 
>> Could someone with perf experience try and remove that f_count == 1
>> optimization from __fdget_pos() completely and make it always acquire
>> the mutex? I wonder what the performance impact of that is.
> 
> Hi Christian,
> For your reference, I did a simple test: writed a c program that open a
> directory which has 1000 empty files, then call sync getdents64 on it
> repeatedly until we get all the entries. I run this program 10 times for
> "with f_count==1
> optimization" and "always do the lock" version.
> Got below data:
> with f_count==1:
> 
> time cost: 0.000379
> time cost: 0.000116
> time cost: 0.000090
> time cost: 0.000101
> time cost: 0.000095
> time cost: 0.000092
> time cost: 0.000092
> time cost: 0.000095
> time cost: 0.000092
> time cost: 0.000121
> time cost: 0.000092
> 
> time cost avg: 0.00009859999999999998
> 
> always do the lock:
> time cost: 0.000095
> time cost: 0.000099
> time cost: 0.000123
> time cost: 0.000124
> time cost: 0.000092
> time cost: 0.000099
> time cost: 0.000092
> time cost: 0.000092
> time cost: 0.000093
> time cost: 0.000094
>              time cost avg: 0.00010029999999999997
> 
> So about 1.724% increment
> 
> [1] the first run is not showed here since that try does real IO
>      and diff a lot.
> [2] the time cost calculation is by gettimeofday()
> [3] run it in a VM which has 2 CPUs and 1GB memory.
> 
> Regards,
> Hao

Did another similar test for more times(100 rounds), about 1.4%
increment. How about:
if CONFIG_IO_URING: remove the f_count==1 logic
else: do the old logic.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2023-07-18  6:55 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-11 11:40 [PATCH v3 0/3] io_uring getdents Hao Xu
2023-07-11 11:40 ` [PATCH 1/3] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
2023-07-11 13:02   ` Ammar Faizi
2023-07-12  8:03     ` Hao Xu
2023-07-12 13:55       ` Ammar Faizi
2023-07-13  4:17         ` Hao Xu
2023-07-11 23:41   ` Dave Chinner
2023-07-11 23:50     ` Jens Axboe
2023-07-12 11:14       ` Christian Brauner
2023-07-11 11:40 ` [PATCH 2/3] vfs_getdents/struct dir_context: add flags field Hao Xu
2023-07-12 11:31   ` Christian Brauner
2023-07-12 16:02     ` Dominique Martinet
2023-07-13  4:12       ` Hao Xu
2023-07-11 11:40 ` [PATCH 3/3] io_uring: add support for getdents Hao Xu
2023-07-11 12:15   ` Dominique Martinet
2023-07-12  7:53     ` Hao Xu
2023-07-12 16:10       ` Dominique Martinet
2023-07-13  4:05         ` Hao Xu
2023-07-13  4:40           ` Hao Xu
2023-07-13  4:50           ` Dominique Martinet
2023-07-12  8:01     ` Hao Xu
2023-07-12 15:27   ` Christian Brauner
2023-07-13  4:35     ` Hao Xu
2023-07-13  7:10       ` Christian Brauner
2023-07-13  9:06         ` Hao Xu
2023-07-13 15:14           ` Christian Brauner
2023-07-16 11:57             ` Hao Xu
2023-07-18  6:55               ` Hao Xu
2023-07-11 23:47 ` [PATCH v3 0/3] io_uring getdents Dave Chinner
2023-07-11 23:51   ` Jens Axboe
2023-07-12  0:53     ` Dominique Martinet
2023-07-12  0:56       ` Jens Axboe
2023-07-12  3:16         ` Hao Xu
2023-07-12  3:12       ` Hao Xu
2023-07-12  3:19   ` Hao Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox