[PATCH v4 0/5] io_uring getdents

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH v4 0/5] io_uring getdents
@ 2023-07-18 13:21 Hao Xu
  2023-07-18 13:21 ` [PATCH 1/5] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
                   ` (5 more replies)
  0 siblings, 6 replies; 41+ messages in thread
From: Hao Xu @ 2023-07-18 13:21 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

From: Hao Xu <[email protected]>

This series introduce getdents64 to io_uring, the code logic is similar
with the snychronized version's. It first try nowait issue, and offload
it to io-wq threads if the first try fails.

Tested it with a liburing case:
https://github.com/HowHsu/liburing/blob/getdents/test/getdents2.c

The test is controlled by the below script[2] which runs getdents2.t 100
times and calulate the avg.
The result show that io_uring version is about 3% faster:

python3 run_getdents.py
    Average of sync: 0.1036849
    Average of iouring: 0.1005568

(0.1036849-0.1005568)/0.1036849 = 3.017%

note:
[1] the number of getdents call/request in io_uring and normal sync version
are made sure to be same beforehand.

[2] run_getdents.py

```python3

import subprocess

N = 100
sum = 0.0
args = ["/data/home/howeyxu/tmpdir", "sync"]

for i in range(N):
    output = subprocess.check_output(["./liburing/test/getdents2.t"] + args)
    sum += float(output)

average = sum / N
print("Average of sync:", average)

sum = 0.0
args = ["/data/home/howeyxu/tmpdir", "iouring"]

for i in range(N):
    output = subprocess.check_output(["./liburing/test/getdents2.t"] + args)
    sum += float(output)

average = sum / N
print("Average of iouring:", average)

```

v3->v4:
 - add Dave's xfs nowait code and fix a deadlock problem, with some code
   style tweak.
 - disable fixed file to avoid a race problem for now
 - add a test program.

v2->v3:
 - removed the kernfs patches
 - add f_pos_lock logic
 - remove the "reduce last EOF getdents try" optimization since
   Dominique reports that doesn't make difference
 - remove the rewind logic, I think the right way is to introduce lseek
   to io_uring not to patch this logic to getdents.
 - add Singed-off-by of Stefan Roesch for patch 1 since checkpatch
   complained that Co-developed-by someone should be accompanied with
   Signed-off-by same person, I can remove them if Stefan thinks that's
   not proper.


Dominique Martinet (1):
  fs: split off vfs_getdents function of getdents64 syscall

Hao Xu (4):
  vfs_getdents/struct dir_context: add flags field
  io_uring: add support for getdents
  xfs: add NOWAIT semantics for readdir
  disable fixed file for io_uring getdents for now

 fs/internal.h                  |  8 +++++
 fs/readdir.c                   | 36 ++++++++++++++++-----
 fs/xfs/libxfs/xfs_da_btree.c   | 16 ++++++++++
 fs/xfs/libxfs/xfs_da_btree.h   |  1 +
 fs/xfs/libxfs/xfs_dir2_block.c |  7 ++--
 fs/xfs/libxfs/xfs_dir2_priv.h  |  2 +-
 fs/xfs/scrub/dir.c             |  2 +-
 fs/xfs/scrub/readdir.c         |  2 +-
 fs/xfs/xfs_dir2_readdir.c      | 58 +++++++++++++++++++++++++++-------
 fs/xfs/xfs_inode.c             | 17 ++++++++++
 fs/xfs/xfs_inode.h             | 15 +++++----
 include/linux/fs.h             |  8 +++++
 include/uapi/linux/io_uring.h  |  7 ++++
 io_uring/fs.c                  | 57 +++++++++++++++++++++++++++++++++
 io_uring/fs.h                  |  3 ++
 io_uring/opdef.c               |  8 +++++
 16 files changed, 215 insertions(+), 32 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/5] fs: split off vfs_getdents function of getdents64 syscall
  2023-07-18 13:21 [PATCH v4 0/5] io_uring getdents Hao Xu
@ 2023-07-18 13:21 ` Hao Xu
  2023-07-18 13:21 ` [PATCH 2/5] vfs_getdents/struct dir_context: add flags field Hao Xu
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Hao Xu @ 2023-07-18 13:21 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

From: Dominique Martinet <[email protected]>

This splits off the vfs_getdents function from the getdents64 system
call.
This will allow io_uring to call the vfs_getdents function.

Co-developed-by: Stefan Roesch <[email protected]>
Signed-off-by: Stefan Roesch <[email protected]>
Signed-off-by: Dominique Martinet <[email protected]>
Signed-off-by: Hao Xu <[email protected]>
---
 fs/internal.h |  8 ++++++++
 fs/readdir.c  | 34 ++++++++++++++++++++++++++--------
 2 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index f7a3dc111026..b1f66e52d61b 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -304,3 +304,11 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
 struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
 struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
 void mnt_idmap_put(struct mnt_idmap *idmap);
+
+/*
+ * fs/readdir.c
+ */
+struct linux_dirent64;
+
+int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
+		 unsigned int count);
diff --git a/fs/readdir.c b/fs/readdir.c
index b264ce60114d..9592259b7e7f 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -21,6 +21,7 @@
 #include <linux/unistd.h>
 #include <linux/compat.h>
 #include <linux/uaccess.h>
+#include "internal.h"
 
 #include <asm/unaligned.h>
 
@@ -351,10 +352,16 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
 	return false;
 }
 
-SYSCALL_DEFINE3(getdents64, unsigned int, fd,
-		struct linux_dirent64 __user *, dirent, unsigned int, count)
+
+/**
+ * vfs_getdents - getdents without fdget
+ * @file    : pointer to file struct of directory
+ * @dirent  : pointer to user directory structure
+ * @count   : size of buffer
+ */
+int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
+		 unsigned int count)
 {
-	struct fd f;
 	struct getdents_callback64 buf = {
 		.ctx.actor = filldir64,
 		.count = count,
@@ -362,11 +369,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	};
 	int error;
 
-	f = fdget_pos(fd);
-	if (!f.file)
-		return -EBADF;
-
-	error = iterate_dir(f.file, &buf.ctx);
+	error = iterate_dir(file, &buf.ctx);
 	if (error >= 0)
 		error = buf.error;
 	if (buf.prev_reclen) {
@@ -379,6 +382,21 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 		else
 			error = count - buf.count;
 	}
+	return error;
+}
+
+SYSCALL_DEFINE3(getdents64, unsigned int, fd,
+		struct linux_dirent64 __user *, dirent, unsigned int, count)
+{
+	struct fd f;
+	int error;
+
+	f = fdget_pos(fd);
+	if (!f.file)
+		return -EBADF;
+
+	error = vfs_getdents(f.file, dirent, count);
+
 	fdput_pos(f);
 	return error;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/5] vfs_getdents/struct dir_context: add flags field
  2023-07-18 13:21 [PATCH v4 0/5] io_uring getdents Hao Xu
  2023-07-18 13:21 ` [PATCH 1/5] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
@ 2023-07-18 13:21 ` Hao Xu
  2023-07-18 13:21 ` [PATCH 3/5] io_uring: add support for getdents Hao Xu
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Hao Xu @ 2023-07-18 13:21 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

From: Hao Xu <[email protected]>

The flags will allow passing DIR_CONTEXT_F_NOWAIT to iterate()
implementations that support it (as signaled through FMODE_NWAIT
in file->f_mode)

Notes:
- considered using IOCB_NOWAIT but if we add more flags later it
would be confusing to keep track of which values are valid, use
dedicated flags
- might want to check ctx.flags & DIR_CONTEXT_F_NOWAIT is only set
when file->f_mode & FMODE_NOWAIT in iterate_dir() as e.g. WARN_ONCE?

Co-developed-by: Dominique Martinet <[email protected]>
Signed-off-by: Dominique Martinet <[email protected]>
Signed-off-by: Hao Xu <[email protected]>
---
 fs/internal.h      | 2 +-
 fs/readdir.c       | 6 ++++--
 include/linux/fs.h | 8 ++++++++
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index b1f66e52d61b..7508d485c655 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -311,4 +311,4 @@ void mnt_idmap_put(struct mnt_idmap *idmap);
 struct linux_dirent64;
 
 int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
-		 unsigned int count);
+		 unsigned int count, unsigned long flags);
diff --git a/fs/readdir.c b/fs/readdir.c
index 9592259b7e7f..b80caf4c9321 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -358,12 +358,14 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
  * @file    : pointer to file struct of directory
  * @dirent  : pointer to user directory structure
  * @count   : size of buffer
+ * @flags   : additional dir_context flags
  */
 int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
-		 unsigned int count)
+		 unsigned int count, unsigned long flags)
 {
 	struct getdents_callback64 buf = {
 		.ctx.actor = filldir64,
+		.ctx.flags = flags,
 		.count = count,
 		.current_dir = dirent
 	};
@@ -395,7 +397,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	if (!f.file)
 		return -EBADF;
 
-	error = vfs_getdents(f.file, dirent, count);
+	error = vfs_getdents(f.file, dirent, count, 0);
 
 	fdput_pos(f);
 	return error;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6867512907d6..f3e315e8efdd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1719,8 +1719,16 @@ typedef bool (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64,
 struct dir_context {
 	filldir_t actor;
 	loff_t pos;
+	unsigned long flags;
 };
 
+/*
+ * flags for dir_context flags
+ * DIR_CONTEXT_F_NOWAIT: Request non-blocking iterate
+ *                       (requires file->f_mode & FMODE_NOWAIT)
+ */
+#define DIR_CONTEXT_F_NOWAIT	(1 << 0)
+
 /*
  * These flags let !MMU mmap() govern direct device mapping vs immediate
  * copying more easily for MAP_PRIVATE, especially for ROM filesystems.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/5] io_uring: add support for getdents
  2023-07-18 13:21 [PATCH v4 0/5] io_uring getdents Hao Xu
  2023-07-18 13:21 ` [PATCH 1/5] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
  2023-07-18 13:21 ` [PATCH 2/5] vfs_getdents/struct dir_context: add flags field Hao Xu
@ 2023-07-18 13:21 ` Hao Xu
  2023-07-19  8:56   ` Hao Xu
  2023-07-26 15:00   ` Christian Brauner
  2023-07-18 13:21 ` [PATCH 4/5] xfs: add NOWAIT semantics for readdir Hao Xu
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 41+ messages in thread
From: Hao Xu @ 2023-07-18 13:21 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

From: Hao Xu <[email protected]>

This add support for getdents64 to io_uring, acting exactly like the
syscall: the directory is iterated from it's current's position as
stored in the file struct, and the file's position is updated exactly as
if getdents64 had been called.

For filesystems that support NOWAIT in iterate_shared(), try to use it
first; if a user already knows the filesystem they use do not support
nowait they can force async through IOSQE_ASYNC in the sqe flags,
avoiding the need to bounce back through a useless EAGAIN return.

Co-developed-by: Dominique Martinet <[email protected]>
Signed-off-by: Dominique Martinet <[email protected]>
Signed-off-by: Hao Xu <[email protected]>
---
 include/uapi/linux/io_uring.h |  7 +++++
 io_uring/fs.c                 | 55 +++++++++++++++++++++++++++++++++++
 io_uring/fs.h                 |  3 ++
 io_uring/opdef.c              |  8 +++++
 4 files changed, 73 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 36f9c73082de..b200b2600622 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -65,6 +65,7 @@ struct io_uring_sqe {
 		__u32		xattr_flags;
 		__u32		msg_ring_flags;
 		__u32		uring_cmd_flags;
+		__u32		getdents_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	/* pack this to avoid bogus arm OABI complaints */
@@ -235,6 +236,7 @@ enum io_uring_op {
 	IORING_OP_URING_CMD,
 	IORING_OP_SEND_ZC,
 	IORING_OP_SENDMSG_ZC,
+	IORING_OP_GETDENTS,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
@@ -273,6 +275,11 @@ enum io_uring_op {
  */
 #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
 
+/*
+ * sqe->getdents_flags
+ */
+#define IORING_GETDENTS_REWIND	(1U << 0)
+
 /*
  * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
  * command flags for POLL_ADD are stored in sqe->len.
diff --git a/io_uring/fs.c b/io_uring/fs.c
index f6a69a549fd4..480f25677fed 100644
--- a/io_uring/fs.c
+++ b/io_uring/fs.c
@@ -47,6 +47,13 @@ struct io_link {
 	int				flags;
 };
 
+struct io_getdents {
+	struct file			*file;
+	struct linux_dirent64 __user	*dirent;
+	unsigned int			count;
+	int				flags;
+};
+
 int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
@@ -291,3 +298,51 @@ void io_link_cleanup(struct io_kiocb *req)
 	putname(sl->oldpath);
 	putname(sl->newpath);
 }
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+
+	if (READ_ONCE(sqe->off) != 0)
+		return -EINVAL;
+
+	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	gd->count = READ_ONCE(sqe->len);
+
+	return 0;
+}
+
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+	struct file *file = req->file;
+	unsigned long getdents_flags = 0;
+	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	bool should_lock = file->f_mode & FMODE_ATOMIC_POS;
+	int ret;
+
+	if (force_nonblock) {
+		if (!(req->file->f_mode & FMODE_NOWAIT))
+			return -EAGAIN;
+
+		getdents_flags = DIR_CONTEXT_F_NOWAIT;
+	}
+
+	if (should_lock) {
+		if (!force_nonblock)
+			mutex_lock(&file->f_pos_lock);
+		else if (!mutex_trylock(&file->f_pos_lock))
+			return -EAGAIN;
+	}
+
+	ret = vfs_getdents(file, gd->dirent, gd->count, getdents_flags);
+	if (should_lock)
+		mutex_unlock(&file->f_pos_lock);
+
+	if (ret == -EAGAIN && force_nonblock)
+		return -EAGAIN;
+
+	io_req_set_res(req, ret, 0);
+	return 0;
+}
+
diff --git a/io_uring/fs.h b/io_uring/fs.h
index 0bb5efe3d6bb..f83a6f3a678d 100644
--- a/io_uring/fs.h
+++ b/io_uring/fs.h
@@ -18,3 +18,6 @@ int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags);
 int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_linkat(struct io_kiocb *req, unsigned int issue_flags);
 void io_link_cleanup(struct io_kiocb *req);
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 3b9c6489b8b6..1bae6b2a8d0b 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -428,6 +428,11 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_GETDENTS] = {
+		.needs_file		= 1,
+		.prep			= io_getdents_prep,
+		.issue			= io_getdents,
+	},
 };
 
 
@@ -648,6 +653,9 @@ const struct io_cold_def io_cold_defs[] = {
 		.fail			= io_sendrecv_fail,
 #endif
 	},
+	[IORING_OP_GETDENTS] = {
+		.name			= "GETDENTS",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-18 13:21 ` [PATCH 3/5] io_uring: add support for getdents Hao Xu
@ 2023-07-19  8:56   ` Hao Xu
  2023-07-26 15:00   ` Christian Brauner
  1 sibling, 0 replies; 41+ messages in thread
From: Hao Xu @ 2023-07-19  8:56 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On 7/18/23 21:21, Hao Xu wrote:
> From: Hao Xu <[email protected]>
> 
> This add support for getdents64 to io_uring, acting exactly like the
> syscall: the directory is iterated from it's current's position as
> stored in the file struct, and the file's position is updated exactly as
> if getdents64 had been called.
> 
> For filesystems that support NOWAIT in iterate_shared(), try to use it
> first; if a user already knows the filesystem they use do not support
> nowait they can force async through IOSQE_ASYNC in the sqe flags,
> avoiding the need to bounce back through a useless EAGAIN return.
> 
> Co-developed-by: Dominique Martinet <[email protected]>
> Signed-off-by: Dominique Martinet <[email protected]>
> Signed-off-by: Hao Xu <[email protected]>
> ---
>   include/uapi/linux/io_uring.h |  7 +++++
>   io_uring/fs.c                 | 55 +++++++++++++++++++++++++++++++++++
>   io_uring/fs.h                 |  3 ++
>   io_uring/opdef.c              |  8 +++++
>   4 files changed, 73 insertions(+)
> 
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 36f9c73082de..b200b2600622 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -65,6 +65,7 @@ struct io_uring_sqe {
>   		__u32		xattr_flags;
>   		__u32		msg_ring_flags;
>   		__u32		uring_cmd_flags;
> +		__u32		getdents_flags;

Looks this is not needed anymore, I'll remove this in next version.

>   	};
>   	__u64	user_data;	/* data to be passed back at completion time */
>   	/* pack this to avoid bogus arm OABI complaints */
> @@ -235,6 +236,7 @@ enum io_uring_op {
>   	IORING_OP_URING_CMD,
>   	IORING_OP_SEND_ZC,
>   	IORING_OP_SENDMSG_ZC,
> +	IORING_OP_GETDENTS,
>   
>   	/* this goes last, obviously */
>   	IORING_OP_LAST,
> @@ -273,6 +275,11 @@ enum io_uring_op {
>    */
>   #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
>   
> +/*
> + * sqe->getdents_flags
> + */
> +#define IORING_GETDENTS_REWIND	(1U << 0)

ditto

> +
>   /*
>    * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
>    * command flags for POLL_ADD are stored in sqe->len.
> diff --git a/io_uring/fs.c b/io_uring/fs.c
> index f6a69a549fd4..480f25677fed 100644
> --- a/io_uring/fs.c
> +++ b/io_uring/fs.c
> @@ -47,6 +47,13 @@ struct io_link {
>   	int				flags;
>   };
>   
> +struct io_getdents {
> +	struct file			*file;
> +	struct linux_dirent64 __user	*dirent;
> +	unsigned int			count;
> +	int				flags;

ditto


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-18 13:21 ` [PATCH 3/5] io_uring: add support for getdents Hao Xu
  2023-07-19  8:56   ` Hao Xu
@ 2023-07-26 15:00   ` Christian Brauner
  2023-07-27 11:51     ` Hao Xu
  1 sibling, 1 reply; 41+ messages in thread
From: Christian Brauner @ 2023-07-26 15:00 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On Tue, Jul 18, 2023 at 09:21:10PM +0800, Hao Xu wrote:
> From: Hao Xu <[email protected]>
> 
> This add support for getdents64 to io_uring, acting exactly like the
> syscall: the directory is iterated from it's current's position as
> stored in the file struct, and the file's position is updated exactly as
> if getdents64 had been called.
> 
> For filesystems that support NOWAIT in iterate_shared(), try to use it
> first; if a user already knows the filesystem they use do not support
> nowait they can force async through IOSQE_ASYNC in the sqe flags,
> avoiding the need to bounce back through a useless EAGAIN return.
> 
> Co-developed-by: Dominique Martinet <[email protected]>
> Signed-off-by: Dominique Martinet <[email protected]>
> Signed-off-by: Hao Xu <[email protected]>
> ---
>  include/uapi/linux/io_uring.h |  7 +++++
>  io_uring/fs.c                 | 55 +++++++++++++++++++++++++++++++++++
>  io_uring/fs.h                 |  3 ++
>  io_uring/opdef.c              |  8 +++++
>  4 files changed, 73 insertions(+)
> 
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 36f9c73082de..b200b2600622 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -65,6 +65,7 @@ struct io_uring_sqe {
>  		__u32		xattr_flags;
>  		__u32		msg_ring_flags;
>  		__u32		uring_cmd_flags;
> +		__u32		getdents_flags;
>  	};
>  	__u64	user_data;	/* data to be passed back at completion time */
>  	/* pack this to avoid bogus arm OABI complaints */
> @@ -235,6 +236,7 @@ enum io_uring_op {
>  	IORING_OP_URING_CMD,
>  	IORING_OP_SEND_ZC,
>  	IORING_OP_SENDMSG_ZC,
> +	IORING_OP_GETDENTS,
>  
>  	/* this goes last, obviously */
>  	IORING_OP_LAST,
> @@ -273,6 +275,11 @@ enum io_uring_op {
>   */
>  #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
>  
> +/*
> + * sqe->getdents_flags
> + */
> +#define IORING_GETDENTS_REWIND	(1U << 0)
> +
>  /*
>   * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
>   * command flags for POLL_ADD are stored in sqe->len.
> diff --git a/io_uring/fs.c b/io_uring/fs.c
> index f6a69a549fd4..480f25677fed 100644
> --- a/io_uring/fs.c
> +++ b/io_uring/fs.c
> @@ -47,6 +47,13 @@ struct io_link {
>  	int				flags;
>  };
>  
> +struct io_getdents {
> +	struct file			*file;
> +	struct linux_dirent64 __user	*dirent;
> +	unsigned int			count;
> +	int				flags;
> +};
> +
>  int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  {
>  	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
> @@ -291,3 +298,51 @@ void io_link_cleanup(struct io_kiocb *req)
>  	putname(sl->oldpath);
>  	putname(sl->newpath);
>  }
> +
> +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> +{
> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> +
> +	if (READ_ONCE(sqe->off) != 0)
> +		return -EINVAL;
> +
> +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
> +	gd->count = READ_ONCE(sqe->len);
> +
> +	return 0;
> +}
> +
> +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> +{
> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> +	struct file *file = req->file;
> +	unsigned long getdents_flags = 0;
> +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;

Hm, I'm not sure what exactly the rules are for IO_URING_F_NONBLOCK.
But to point this out:

vfs_getdents()
-> iterate_dir()
   {
        if (shared)
                res = down_read_killable(&inode->i_rwsem);
        else
                res = down_write_killable(&inode->i_rwsem);
   }

which means you can still end up sleeping here before you go into a
filesystem that does actually support non-waiting getdents. So if you
have concurrent operations that grab inode lock (touch, mkdir etc) you
can end up sleeping here.

Is that intentional or an oversight? If the former can someone please
explain the rules and why it's fine in this case?

> +	bool should_lock = file->f_mode & FMODE_ATOMIC_POS;
> +	int ret;
> +
> +	if (force_nonblock) {
> +		if (!(req->file->f_mode & FMODE_NOWAIT))
> +			return -EAGAIN;
> +
> +		getdents_flags = DIR_CONTEXT_F_NOWAIT;
> +	}
> +
> +	if (should_lock) {
> +		if (!force_nonblock)
> +			mutex_lock(&file->f_pos_lock);
> +		else if (!mutex_trylock(&file->f_pos_lock))
> +			return -EAGAIN;
> +	}

That now looks like it works.

> +
> +	ret = vfs_getdents(file, gd->dirent, gd->count, getdents_flags);
> +	if (should_lock)
> +		mutex_unlock(&file->f_pos_lock);

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-26 15:00   ` Christian Brauner
@ 2023-07-27 11:51     ` Hao Xu
  2023-07-27 14:27       ` Christian Brauner
  0 siblings, 1 reply; 41+ messages in thread
From: Hao Xu @ 2023-07-27 11:51 UTC (permalink / raw)
  To: Christian Brauner
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On 7/26/23 23:00, Christian Brauner wrote:
> On Tue, Jul 18, 2023 at 09:21:10PM +0800, Hao Xu wrote:
>> From: Hao Xu <[email protected]>
>>
>> This add support for getdents64 to io_uring, acting exactly like the
>> syscall: the directory is iterated from it's current's position as
>> stored in the file struct, and the file's position is updated exactly as
>> if getdents64 had been called.
>>
>> For filesystems that support NOWAIT in iterate_shared(), try to use it
>> first; if a user already knows the filesystem they use do not support
>> nowait they can force async through IOSQE_ASYNC in the sqe flags,
>> avoiding the need to bounce back through a useless EAGAIN return.
>>
>> Co-developed-by: Dominique Martinet <[email protected]>
>> Signed-off-by: Dominique Martinet <[email protected]>
>> Signed-off-by: Hao Xu <[email protected]>
>> ---
>>   include/uapi/linux/io_uring.h |  7 +++++
>>   io_uring/fs.c                 | 55 +++++++++++++++++++++++++++++++++++
>>   io_uring/fs.h                 |  3 ++
>>   io_uring/opdef.c              |  8 +++++
>>   4 files changed, 73 insertions(+)
>>
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index 36f9c73082de..b200b2600622 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -65,6 +65,7 @@ struct io_uring_sqe {
>>   		__u32		xattr_flags;
>>   		__u32		msg_ring_flags;
>>   		__u32		uring_cmd_flags;
>> +		__u32		getdents_flags;
>>   	};
>>   	__u64	user_data;	/* data to be passed back at completion time */
>>   	/* pack this to avoid bogus arm OABI complaints */
>> @@ -235,6 +236,7 @@ enum io_uring_op {
>>   	IORING_OP_URING_CMD,
>>   	IORING_OP_SEND_ZC,
>>   	IORING_OP_SENDMSG_ZC,
>> +	IORING_OP_GETDENTS,
>>   
>>   	/* this goes last, obviously */
>>   	IORING_OP_LAST,
>> @@ -273,6 +275,11 @@ enum io_uring_op {
>>    */
>>   #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
>>   
>> +/*
>> + * sqe->getdents_flags
>> + */
>> +#define IORING_GETDENTS_REWIND	(1U << 0)
>> +
>>   /*
>>    * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
>>    * command flags for POLL_ADD are stored in sqe->len.
>> diff --git a/io_uring/fs.c b/io_uring/fs.c
>> index f6a69a549fd4..480f25677fed 100644
>> --- a/io_uring/fs.c
>> +++ b/io_uring/fs.c
>> @@ -47,6 +47,13 @@ struct io_link {
>>   	int				flags;
>>   };
>>   
>> +struct io_getdents {
>> +	struct file			*file;
>> +	struct linux_dirent64 __user	*dirent;
>> +	unsigned int			count;
>> +	int				flags;
>> +};
>> +
>>   int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>   {
>>   	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
>> @@ -291,3 +298,51 @@ void io_link_cleanup(struct io_kiocb *req)
>>   	putname(sl->oldpath);
>>   	putname(sl->newpath);
>>   }
>> +
>> +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>> +{
>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>> +
>> +	if (READ_ONCE(sqe->off) != 0)
>> +		return -EINVAL;
>> +
>> +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
>> +	gd->count = READ_ONCE(sqe->len);
>> +
>> +	return 0;
>> +}
>> +
>> +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>> +	struct file *file = req->file;
>> +	unsigned long getdents_flags = 0;
>> +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
> 
> Hm, I'm not sure what exactly the rules are for IO_URING_F_NONBLOCK.
> But to point this out:
> 
> vfs_getdents()
> -> iterate_dir()
>     {
>          if (shared)
>                  res = down_read_killable(&inode->i_rwsem);
>          else
>                  res = down_write_killable(&inode->i_rwsem);
>     }
> 
> which means you can still end up sleeping here before you go into a
> filesystem that does actually support non-waiting getdents. So if you
> have concurrent operations that grab inode lock (touch, mkdir etc) you
> can end up sleeping here.
> 
> Is that intentional or an oversight? If the former can someone please
> explain the rules and why it's fine in this case?

I actually saw this semaphore, and there is another xfs lock in
file_accessed
   --> touch_atime
     --> inode_update_time
       --> inode->i_op->update_time == xfs_vn_update_time

Forgot to point them out in the cover-letter..., I didn't modify them
since I'm not very sure about if we should do so, and I saw Stefan's
patchset didn't modify them too.

My personnal thinking is we should apply trylock logic for this
inode->i_rwsem. For xfs lock in touch_atime, we should do that since it
doesn't make sense to rollback all the stuff while we are almost at the
end of getdents because of a lock.


> 
>> +	bool should_lock = file->f_mode & FMODE_ATOMIC_POS;
>> +	int ret;
>> +
>> +	if (force_nonblock) {
>> +		if (!(req->file->f_mode & FMODE_NOWAIT))
>> +			return -EAGAIN;
>> +
>> +		getdents_flags = DIR_CONTEXT_F_NOWAIT;
>> +	}
>> +
>> +	if (should_lock) {
>> +		if (!force_nonblock)
>> +			mutex_lock(&file->f_pos_lock);
>> +		else if (!mutex_trylock(&file->f_pos_lock))
>> +			return -EAGAIN;
>> +	}
> 
> That now looks like it works.
> 
>> +
>> +	ret = vfs_getdents(file, gd->dirent, gd->count, getdents_flags);
>> +	if (should_lock)
>> +		mutex_unlock(&file->f_pos_lock);


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-27 11:51     ` Hao Xu
@ 2023-07-27 14:27       ` Christian Brauner
  2023-07-27 15:12         ` Pavel Begunkov
                           ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Christian Brauner @ 2023-07-27 14:27 UTC (permalink / raw)
  To: Hao Xu, djwong, Dave Chinner, Jens Axboe
  Cc: io-uring, Dominique Martinet, Pavel Begunkov, Alexander Viro,
	Stefan Roesch, Clay Harris, linux-fsdevel, Wanpeng Li

On Thu, Jul 27, 2023 at 07:51:19PM +0800, Hao Xu wrote:
> On 7/26/23 23:00, Christian Brauner wrote:
> > On Tue, Jul 18, 2023 at 09:21:10PM +0800, Hao Xu wrote:
> > > From: Hao Xu <[email protected]>
> > > 
> > > This add support for getdents64 to io_uring, acting exactly like the
> > > syscall: the directory is iterated from it's current's position as
> > > stored in the file struct, and the file's position is updated exactly as
> > > if getdents64 had been called.
> > > 
> > > For filesystems that support NOWAIT in iterate_shared(), try to use it
> > > first; if a user already knows the filesystem they use do not support
> > > nowait they can force async through IOSQE_ASYNC in the sqe flags,
> > > avoiding the need to bounce back through a useless EAGAIN return.
> > > 
> > > Co-developed-by: Dominique Martinet <[email protected]>
> > > Signed-off-by: Dominique Martinet <[email protected]>
> > > Signed-off-by: Hao Xu <[email protected]>
> > > ---
> > >   include/uapi/linux/io_uring.h |  7 +++++
> > >   io_uring/fs.c                 | 55 +++++++++++++++++++++++++++++++++++
> > >   io_uring/fs.h                 |  3 ++
> > >   io_uring/opdef.c              |  8 +++++
> > >   4 files changed, 73 insertions(+)
> > > 
> > > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> > > index 36f9c73082de..b200b2600622 100644
> > > --- a/include/uapi/linux/io_uring.h
> > > +++ b/include/uapi/linux/io_uring.h
> > > @@ -65,6 +65,7 @@ struct io_uring_sqe {
> > >   		__u32		xattr_flags;
> > >   		__u32		msg_ring_flags;
> > >   		__u32		uring_cmd_flags;
> > > +		__u32		getdents_flags;
> > >   	};
> > >   	__u64	user_data;	/* data to be passed back at completion time */
> > >   	/* pack this to avoid bogus arm OABI complaints */
> > > @@ -235,6 +236,7 @@ enum io_uring_op {
> > >   	IORING_OP_URING_CMD,
> > >   	IORING_OP_SEND_ZC,
> > >   	IORING_OP_SENDMSG_ZC,
> > > +	IORING_OP_GETDENTS,
> > >   	/* this goes last, obviously */
> > >   	IORING_OP_LAST,
> > > @@ -273,6 +275,11 @@ enum io_uring_op {
> > >    */
> > >   #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
> > > +/*
> > > + * sqe->getdents_flags
> > > + */
> > > +#define IORING_GETDENTS_REWIND	(1U << 0)
> > > +
> > >   /*
> > >    * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
> > >    * command flags for POLL_ADD are stored in sqe->len.
> > > diff --git a/io_uring/fs.c b/io_uring/fs.c
> > > index f6a69a549fd4..480f25677fed 100644
> > > --- a/io_uring/fs.c
> > > +++ b/io_uring/fs.c
> > > @@ -47,6 +47,13 @@ struct io_link {
> > >   	int				flags;
> > >   };
> > > +struct io_getdents {
> > > +	struct file			*file;
> > > +	struct linux_dirent64 __user	*dirent;
> > > +	unsigned int			count;
> > > +	int				flags;
> > > +};
> > > +
> > >   int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > >   {
> > >   	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
> > > @@ -291,3 +298,51 @@ void io_link_cleanup(struct io_kiocb *req)
> > >   	putname(sl->oldpath);
> > >   	putname(sl->newpath);
> > >   }
> > > +
> > > +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > +{
> > > +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> > > +
> > > +	if (READ_ONCE(sqe->off) != 0)
> > > +		return -EINVAL;
> > > +
> > > +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
> > > +	gd->count = READ_ONCE(sqe->len);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> > > +{
> > > +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> > > +	struct file *file = req->file;
> > > +	unsigned long getdents_flags = 0;
> > > +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
> > 
> > Hm, I'm not sure what exactly the rules are for IO_URING_F_NONBLOCK.
> > But to point this out:
> > 
> > vfs_getdents()
> > -> iterate_dir()
> >     {
> >          if (shared)
> >                  res = down_read_killable(&inode->i_rwsem);
> >          else
> >                  res = down_write_killable(&inode->i_rwsem);
> >     }
> > 
> > which means you can still end up sleeping here before you go into a
> > filesystem that does actually support non-waiting getdents. So if you
> > have concurrent operations that grab inode lock (touch, mkdir etc) you
> > can end up sleeping here.
> > 
> > Is that intentional or an oversight? If the former can someone please
> > explain the rules and why it's fine in this case?
> 
> I actually saw this semaphore, and there is another xfs lock in
> file_accessed
>   --> touch_atime
>     --> inode_update_time
>       --> inode->i_op->update_time == xfs_vn_update_time
> 
> Forgot to point them out in the cover-letter..., I didn't modify them
> since I'm not very sure about if we should do so, and I saw Stefan's
> patchset didn't modify them too.
> 
> My personnal thinking is we should apply trylock logic for this
> inode->i_rwsem. For xfs lock in touch_atime, we should do that since it
> doesn't make sense to rollback all the stuff while we are almost at the
> end of getdents because of a lock.

That manoeuvres around the problem. Which I'm slightly more sensitive
too as this review is a rather expensive one.

Plus, it seems fixable in at least two ways:

For both we need to be able to tell the filesystem that a nowait atime
update is requested. Simple thing seems to me to add a S_NOWAIT flag to
file_time_flags and passing that via i_op->update_time() which already
has a flag argument. That would likely also help kiocb_modified().

file_accessed()
-> touch_atime()
   -> inode_update_time()
      -> i_op->update_time == xfs_vn_update_time()

Then we have two options afaict:

(1) best-effort atime update

file_accessed() already has the builtin assumption that updating atime
might fail for other reasons - see the comment in there. So it is
somewhat best-effort already.

(2) move atime update before calling into filesystem

If we want to be sure that access time is updated when a readdir request
is issued through io_uring then we need to have file_accessed() give a
return value and expose a new helper for io_uring or modify
vfs_getdents() to do something like:

vfs_getdents()
{
	if (nowait)
		down_read_trylock()

	if (!IS_DEADDIR(inode)) {
		ret = file_accessed(file);
		if (ret == -EAGAIN)
			goto out_unlock;

		f_op->iterate_shared()
	}
}

It's not unprecedented to do update atime before the actual operation
has been done afaict. That's already the case in xfs_file_write_checks()
which is called before anything is written. So that seems ok.

Does any of these two options work for the xfs maintainers and Jens?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-27 14:27       ` Christian Brauner
@ 2023-07-27 15:12         ` Pavel Begunkov
  2023-07-27 15:52           ` Christian Brauner
  2023-07-30 18:02         ` Hao Xu
  2023-07-31  1:33         ` Dave Chinner
  2 siblings, 1 reply; 41+ messages in thread
From: Pavel Begunkov @ 2023-07-27 15:12 UTC (permalink / raw)
  To: Christian Brauner, Hao Xu, djwong, Dave Chinner, Jens Axboe
  Cc: io-uring, Dominique Martinet, Alexander Viro, Stefan Roesch,
	Clay Harris, linux-fsdevel, Wanpeng Li, josef

On 7/27/23 15:27, Christian Brauner wrote:
> On Thu, Jul 27, 2023 at 07:51:19PM +0800, Hao Xu wrote:
>> On 7/26/23 23:00, Christian Brauner wrote:
>>> On Tue, Jul 18, 2023 at 09:21:10PM +0800, Hao Xu wrote:
>>>> From: Hao Xu <[email protected]>
>>>>
>>>> This add support for getdents64 to io_uring, acting exactly like the
>>>> syscall: the directory is iterated from it's current's position as
>>>> stored in the file struct, and the file's position is updated exactly as
>>>> if getdents64 had been called.
>>>>
>>>> For filesystems that support NOWAIT in iterate_shared(), try to use it
>>>> first; if a user already knows the filesystem they use do not support
>>>> nowait they can force async through IOSQE_ASYNC in the sqe flags,
>>>> avoiding the need to bounce back through a useless EAGAIN return.
>>>>
>>>> Co-developed-by: Dominique Martinet <[email protected]>
>>>> Signed-off-by: Dominique Martinet <[email protected]>
>>>> Signed-off-by: Hao Xu <[email protected]>
>>>> ---
[...]
>> I actually saw this semaphore, and there is another xfs lock in
>> file_accessed
>>    --> touch_atime
>>      --> inode_update_time
>>        --> inode->i_op->update_time == xfs_vn_update_time
>>
>> Forgot to point them out in the cover-letter..., I didn't modify them
>> since I'm not very sure about if we should do so, and I saw Stefan's
>> patchset didn't modify them too.
>>
>> My personnal thinking is we should apply trylock logic for this
>> inode->i_rwsem. For xfs lock in touch_atime, we should do that since it
>> doesn't make sense to rollback all the stuff while we are almost at the
>> end of getdents because of a lock.
> 
> That manoeuvres around the problem. Which I'm slightly more sensitive
> too as this review is a rather expensive one.
> 
> Plus, it seems fixable in at least two ways:
> 
> For both we need to be able to tell the filesystem that a nowait atime
> update is requested. Simple thing seems to me to add a S_NOWAIT flag to
> file_time_flags and passing that via i_op->update_time() which already
> has a flag argument. That would likely also help kiocb_modified().

fwiw, we've just recently had similar problems with io_uring read/write
and atime/mtime in prod environment, so we're interested in solving that
regardless of this patchset. I.e. io_uring issues rw with NOWAIT, {a,m}time
touch ignores that, that stalls other submissions and completely screws
latency.

> file_accessed()
> -> touch_atime()
>     -> inode_update_time()
>        -> i_op->update_time == xfs_vn_update_time()
> 
> Then we have two options afaict:
> 
> (1) best-effort atime update
> 
> file_accessed() already has the builtin assumption that updating atime
> might fail for other reasons - see the comment in there. So it is
> somewhat best-effort already.
> 
> (2) move atime update before calling into filesystem
> 
> If we want to be sure that access time is updated when a readdir request
> is issued through io_uring then we need to have file_accessed() give a
> return value and expose a new helper for io_uring or modify
> vfs_getdents() to do something like:
> 
> vfs_getdents()
> {
> 	if (nowait)
> 		down_read_trylock()
> 
> 	if (!IS_DEADDIR(inode)) {
> 		ret = file_accessed(file);
> 		if (ret == -EAGAIN)
> 			goto out_unlock;
> 
> 		f_op->iterate_shared()
> 	}
> }
> 
> It's not unprecedented to do update atime before the actual operation
> has been done afaict. That's already the case in xfs_file_write_checks()
> which is called before anything is written. So that seems ok.
> 
> Does any of these two options work for the xfs maintainers and Jens?

It doesn't look (2) will solve it for reads/writes, at least without
the pain of changing the {write,read}_iter callbacks. 1) sounds good
to me from the io_uring perspective, but I guess it won't work
for mtime?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-27 15:12         ` Pavel Begunkov
@ 2023-07-27 15:52           ` Christian Brauner
  2023-07-27 16:17             ` Pavel Begunkov
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Brauner @ 2023-07-27 15:52 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Hao Xu, djwong, Dave Chinner, Jens Axboe, io-uring,
	Dominique Martinet, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li, josef

On Thu, Jul 27, 2023 at 04:12:12PM +0100, Pavel Begunkov wrote:
> On 7/27/23 15:27, Christian Brauner wrote:
> > On Thu, Jul 27, 2023 at 07:51:19PM +0800, Hao Xu wrote:
> > > On 7/26/23 23:00, Christian Brauner wrote:
> > > > On Tue, Jul 18, 2023 at 09:21:10PM +0800, Hao Xu wrote:
> > > > > From: Hao Xu <[email protected]>
> > > > > 
> > > > > This add support for getdents64 to io_uring, acting exactly like the
> > > > > syscall: the directory is iterated from it's current's position as
> > > > > stored in the file struct, and the file's position is updated exactly as
> > > > > if getdents64 had been called.
> > > > > 
> > > > > For filesystems that support NOWAIT in iterate_shared(), try to use it
> > > > > first; if a user already knows the filesystem they use do not support
> > > > > nowait they can force async through IOSQE_ASYNC in the sqe flags,
> > > > > avoiding the need to bounce back through a useless EAGAIN return.
> > > > > 
> > > > > Co-developed-by: Dominique Martinet <[email protected]>
> > > > > Signed-off-by: Dominique Martinet <[email protected]>
> > > > > Signed-off-by: Hao Xu <[email protected]>
> > > > > ---
> [...]
> > > I actually saw this semaphore, and there is another xfs lock in
> > > file_accessed
> > >    --> touch_atime
> > >      --> inode_update_time
> > >        --> inode->i_op->update_time == xfs_vn_update_time
> > > 
> > > Forgot to point them out in the cover-letter..., I didn't modify them
> > > since I'm not very sure about if we should do so, and I saw Stefan's
> > > patchset didn't modify them too.
> > > 
> > > My personnal thinking is we should apply trylock logic for this
> > > inode->i_rwsem. For xfs lock in touch_atime, we should do that since it
> > > doesn't make sense to rollback all the stuff while we are almost at the
> > > end of getdents because of a lock.
> > 
> > That manoeuvres around the problem. Which I'm slightly more sensitive
> > too as this review is a rather expensive one.
> > 
> > Plus, it seems fixable in at least two ways:
> > 
> > For both we need to be able to tell the filesystem that a nowait atime
> > update is requested. Simple thing seems to me to add a S_NOWAIT flag to
> > file_time_flags and passing that via i_op->update_time() which already
> > has a flag argument. That would likely also help kiocb_modified().
> 
> fwiw, we've just recently had similar problems with io_uring read/write
> and atime/mtime in prod environment, so we're interested in solving that
> regardless of this patchset. I.e. io_uring issues rw with NOWAIT, {a,m}time
> touch ignores that, that stalls other submissions and completely screws
> latency.
> 
> > file_accessed()
> > -> touch_atime()
> >     -> inode_update_time()
> >        -> i_op->update_time == xfs_vn_update_time()
> > 
> > Then we have two options afaict:
> > 
> > (1) best-effort atime update
> > 
> > file_accessed() already has the builtin assumption that updating atime
> > might fail for other reasons - see the comment in there. So it is
> > somewhat best-effort already.
> > 
> > (2) move atime update before calling into filesystem
> > 
> > If we want to be sure that access time is updated when a readdir request
> > is issued through io_uring then we need to have file_accessed() give a
> > return value and expose a new helper for io_uring or modify
> > vfs_getdents() to do something like:
> > 
> > vfs_getdents()
> > {
> > 	if (nowait)
> > 		down_read_trylock()
> > 
> > 	if (!IS_DEADDIR(inode)) {
> > 		ret = file_accessed(file);
> > 		if (ret == -EAGAIN)
> > 			goto out_unlock;
> > 
> > 		f_op->iterate_shared()
> > 	}
> > }
> > 
> > It's not unprecedented to do update atime before the actual operation
> > has been done afaict. That's already the case in xfs_file_write_checks()
> > which is called before anything is written. So that seems ok.
> > 
> > Does any of these two options work for the xfs maintainers and Jens?
> 
> It doesn't look (2) will solve it for reads/writes, at least without

It would also solve it for writes which is what my kiocb_modified()
comment was about. So right now you have:

kiocb_modified(IOCB_NOWAI)
-> file_modified_flags(IOCB_NOWAI)
   -> file_remove_privs(IOCB_NOWAIT) // already fully non-blocking
   -> file_accessed(IOCB_NOWAIT)
      -> i_op->update_time(S_ATIME | S_NOWAIT)

and since xfs_file_write_iter() calls xfs_file_write_checks() before
doing any actual work you'd now be fine.

For reads xfs_file_read_iter() would need to be changed to a similar
logic but that's for xfs to decide ultimately.

> the pain of changing the {write,read}_iter callbacks. 1) sounds good
> to me from the io_uring perspective, but I guess it won't work
> for mtime?

I would prefer 2) which seems cleaner to me. But I might miss why this
won't work. So input needed/wanted.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-27 15:52           ` Christian Brauner
@ 2023-07-27 16:17             ` Pavel Begunkov
  2023-07-27 16:28               ` Christian Brauner
  0 siblings, 1 reply; 41+ messages in thread
From: Pavel Begunkov @ 2023-07-27 16:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Hao Xu, djwong, Dave Chinner, Jens Axboe, io-uring,
	Dominique Martinet, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li, josef

On 7/27/23 16:52, Christian Brauner wrote:
> On Thu, Jul 27, 2023 at 04:12:12PM +0100, Pavel Begunkov wrote:
>> On 7/27/23 15:27, Christian Brauner wrote:
>>> On Thu, Jul 27, 2023 at 07:51:19PM +0800, Hao Xu wrote:
>>>> On 7/26/23 23:00, Christian Brauner wrote:
>>>>> On Tue, Jul 18, 2023 at 09:21:10PM +0800, Hao Xu wrote:
>>>>>> From: Hao Xu <[email protected]>
>>>>>>
>>>>>> This add support for getdents64 to io_uring, acting exactly like the
>>>>>> syscall: the directory is iterated from it's current's position as
>>>>>> stored in the file struct, and the file's position is updated exactly as
>>>>>> if getdents64 had been called.
>>>>>>
>>>>>> For filesystems that support NOWAIT in iterate_shared(), try to use it
>>>>>> first; if a user already knows the filesystem they use do not support
>>>>>> nowait they can force async through IOSQE_ASYNC in the sqe flags,
>>>>>> avoiding the need to bounce back through a useless EAGAIN return.
>>>>>>
>>>>>> Co-developed-by: Dominique Martinet <[email protected]>
>>>>>> Signed-off-by: Dominique Martinet <[email protected]>
>>>>>> Signed-off-by: Hao Xu <[email protected]>
>>>>>> ---
>> [...]
>>>> I actually saw this semaphore, and there is another xfs lock in
>>>> file_accessed
>>>>     --> touch_atime
>>>>       --> inode_update_time
>>>>         --> inode->i_op->update_time == xfs_vn_update_time
>>>>
>>>> Forgot to point them out in the cover-letter..., I didn't modify them
>>>> since I'm not very sure about if we should do so, and I saw Stefan's
>>>> patchset didn't modify them too.
>>>>
>>>> My personnal thinking is we should apply trylock logic for this
>>>> inode->i_rwsem. For xfs lock in touch_atime, we should do that since it
>>>> doesn't make sense to rollback all the stuff while we are almost at the
>>>> end of getdents because of a lock.
>>>
>>> That manoeuvres around the problem. Which I'm slightly more sensitive
>>> too as this review is a rather expensive one.
>>>
>>> Plus, it seems fixable in at least two ways:
>>>
>>> For both we need to be able to tell the filesystem that a nowait atime
>>> update is requested. Simple thing seems to me to add a S_NOWAIT flag to
>>> file_time_flags and passing that via i_op->update_time() which already
>>> has a flag argument. That would likely also help kiocb_modified().
>>
>> fwiw, we've just recently had similar problems with io_uring read/write
>> and atime/mtime in prod environment, so we're interested in solving that
>> regardless of this patchset. I.e. io_uring issues rw with NOWAIT, {a,m}time
>> touch ignores that, that stalls other submissions and completely screws
>> latency.
>>
>>> file_accessed()
>>> -> touch_atime()
>>>      -> inode_update_time()
>>>         -> i_op->update_time == xfs_vn_update_time()
>>>
>>> Then we have two options afaict:
>>>
>>> (1) best-effort atime update
>>>
>>> file_accessed() already has the builtin assumption that updating atime
>>> might fail for other reasons - see the comment in there. So it is
>>> somewhat best-effort already.
>>>
>>> (2) move atime update before calling into filesystem
>>>
>>> If we want to be sure that access time is updated when a readdir request
>>> is issued through io_uring then we need to have file_accessed() give a
>>> return value and expose a new helper for io_uring or modify
>>> vfs_getdents() to do something like:
>>>
>>> vfs_getdents()
>>> {
>>> 	if (nowait)
>>> 		down_read_trylock()
>>>
>>> 	if (!IS_DEADDIR(inode)) {
>>> 		ret = file_accessed(file);
>>> 		if (ret == -EAGAIN)
>>> 			goto out_unlock;
>>>
>>> 		f_op->iterate_shared()
>>> 	}
>>> }
>>>
>>> It's not unprecedented to do update atime before the actual operation
>>> has been done afaict. That's already the case in xfs_file_write_checks()
>>> which is called before anything is written. So that seems ok.
>>>
>>> Does any of these two options work for the xfs maintainers and Jens?
>>
>> It doesn't look (2) will solve it for reads/writes, at least without
> 
> It would also solve it for writes which is what my kiocb_modified()
> comment was about. So right now you have:

Great, I assumed there are stricter requirements for mtime not
transiently failing.


> kiocb_modified(IOCB_NOWAI)
> -> file_modified_flags(IOCB_NOWAI)
>     -> file_remove_privs(IOCB_NOWAIT) // already fully non-blocking
>     -> file_accessed(IOCB_NOWAIT)
>        -> i_op->update_time(S_ATIME | S_NOWAIT)
> 
> and since xfs_file_write_iter() calls xfs_file_write_checks() before
> doing any actual work you'd now be fine.
> 
> For reads xfs_file_read_iter() would need to be changed to a similar
> logic but that's for xfs to decide ultimately.
> 
>> the pain of changing the {write,read}_iter callbacks. 1) sounds good
>> to me from the io_uring perspective, but I guess it won't work
>> for mtime?
> 
> I would prefer 2) which seems cleaner to me. But I might miss why this
> won't work. So input needed/wanted.

Maybe I didn't fully grasp the (2) idea

2.1: all read_iter, write_iter, etc. callbacks should do file_accessed()
before doing IO, which sounds like a good option if everyone agrees with
that. Taking a look at direct block io, it's already like this.

2.2: Having io_uring doing file_accessed(). Since it's all currently
hidden behind {read,write}_iter() callbacks and not easily extractable,
it doesn't like a good option, unless I missed sth.
E.g. this ugliness comes to mind.

io_uring_read() {
	file_accessed();
	file->f_op->read_iter(DONT_TOUCH_ATIME);
	...
}

read_iter_impl() {
	// some pre processing

	if (!(flags & DONT_TOUCH_ATIME))
		file_accessed();
}

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-27 16:17             ` Pavel Begunkov
@ 2023-07-27 16:28               ` Christian Brauner
  2023-07-31  1:58                 ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Brauner @ 2023-07-27 16:28 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Hao Xu, djwong, Dave Chinner, Jens Axboe, io-uring,
	Dominique Martinet, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li, josef

On Thu, Jul 27, 2023 at 05:17:30PM +0100, Pavel Begunkov wrote:
> On 7/27/23 16:52, Christian Brauner wrote:
> > On Thu, Jul 27, 2023 at 04:12:12PM +0100, Pavel Begunkov wrote:
> > > On 7/27/23 15:27, Christian Brauner wrote:
> > > > On Thu, Jul 27, 2023 at 07:51:19PM +0800, Hao Xu wrote:
> > > > > On 7/26/23 23:00, Christian Brauner wrote:
> > > > > > On Tue, Jul 18, 2023 at 09:21:10PM +0800, Hao Xu wrote:
> > > > > > > From: Hao Xu <[email protected]>
> > > > > > > 
> > > > > > > This add support for getdents64 to io_uring, acting exactly like the
> > > > > > > syscall: the directory is iterated from it's current's position as
> > > > > > > stored in the file struct, and the file's position is updated exactly as
> > > > > > > if getdents64 had been called.
> > > > > > > 
> > > > > > > For filesystems that support NOWAIT in iterate_shared(), try to use it
> > > > > > > first; if a user already knows the filesystem they use do not support
> > > > > > > nowait they can force async through IOSQE_ASYNC in the sqe flags,
> > > > > > > avoiding the need to bounce back through a useless EAGAIN return.
> > > > > > > 
> > > > > > > Co-developed-by: Dominique Martinet <[email protected]>
> > > > > > > Signed-off-by: Dominique Martinet <[email protected]>
> > > > > > > Signed-off-by: Hao Xu <[email protected]>
> > > > > > > ---
> > > [...]
> > > > > I actually saw this semaphore, and there is another xfs lock in
> > > > > file_accessed
> > > > >     --> touch_atime
> > > > >       --> inode_update_time
> > > > >         --> inode->i_op->update_time == xfs_vn_update_time
> > > > > 
> > > > > Forgot to point them out in the cover-letter..., I didn't modify them
> > > > > since I'm not very sure about if we should do so, and I saw Stefan's
> > > > > patchset didn't modify them too.
> > > > > 
> > > > > My personnal thinking is we should apply trylock logic for this
> > > > > inode->i_rwsem. For xfs lock in touch_atime, we should do that since it
> > > > > doesn't make sense to rollback all the stuff while we are almost at the
> > > > > end of getdents because of a lock.
> > > > 
> > > > That manoeuvres around the problem. Which I'm slightly more sensitive
> > > > too as this review is a rather expensive one.
> > > > 
> > > > Plus, it seems fixable in at least two ways:
> > > > 
> > > > For both we need to be able to tell the filesystem that a nowait atime
> > > > update is requested. Simple thing seems to me to add a S_NOWAIT flag to
> > > > file_time_flags and passing that via i_op->update_time() which already
> > > > has a flag argument. That would likely also help kiocb_modified().
> > > 
> > > fwiw, we've just recently had similar problems with io_uring read/write
> > > and atime/mtime in prod environment, so we're interested in solving that
> > > regardless of this patchset. I.e. io_uring issues rw with NOWAIT, {a,m}time
> > > touch ignores that, that stalls other submissions and completely screws
> > > latency.
> > > 
> > > > file_accessed()
> > > > -> touch_atime()
> > > >      -> inode_update_time()
> > > >         -> i_op->update_time == xfs_vn_update_time()
> > > > 
> > > > Then we have two options afaict:
> > > > 
> > > > (1) best-effort atime update
> > > > 
> > > > file_accessed() already has the builtin assumption that updating atime
> > > > might fail for other reasons - see the comment in there. So it is
> > > > somewhat best-effort already.
> > > > 
> > > > (2) move atime update before calling into filesystem
> > > > 
> > > > If we want to be sure that access time is updated when a readdir request
> > > > is issued through io_uring then we need to have file_accessed() give a
> > > > return value and expose a new helper for io_uring or modify
> > > > vfs_getdents() to do something like:
> > > > 
> > > > vfs_getdents()
> > > > {
> > > > 	if (nowait)
> > > > 		down_read_trylock()
> > > > 
> > > > 	if (!IS_DEADDIR(inode)) {
> > > > 		ret = file_accessed(file);
> > > > 		if (ret == -EAGAIN)
> > > > 			goto out_unlock;
> > > > 
> > > > 		f_op->iterate_shared()
> > > > 	}
> > > > }
> > > > 
> > > > It's not unprecedented to do update atime before the actual operation
> > > > has been done afaict. That's already the case in xfs_file_write_checks()
> > > > which is called before anything is written. So that seems ok.
> > > > 
> > > > Does any of these two options work for the xfs maintainers and Jens?
> > > 
> > > It doesn't look (2) will solve it for reads/writes, at least without
> > 
> > It would also solve it for writes which is what my kiocb_modified()
> > comment was about. So right now you have:
> 
> Great, I assumed there are stricter requirements for mtime not
> transiently failing.

But I mean then wouldn't this already be a problem today?
kiocb_modified() can error out with EAGAIN today:

          ret = inode_needs_update_time(inode, &now);
          if (ret <= 0)
                  return ret;
          if (flags & IOCB_NOWAIT)
                  return -EAGAIN;

          return __file_update_time(file, &now, ret);

the thing is that it doesn't matter for ->write_iter() - for xfs at
least - because xfs does it as part of preparatory checks before
actually doing any real work. The problem happens when you do actual
work and afterwards call kiocb_modified(). That's why I think (2) is
preferable.

> 
> 
> > kiocb_modified(IOCB_NOWAI)
> > -> file_modified_flags(IOCB_NOWAI)
> >     -> file_remove_privs(IOCB_NOWAIT) // already fully non-blocking
> >     -> file_accessed(IOCB_NOWAIT)
> >        -> i_op->update_time(S_ATIME | S_NOWAIT)
> > 
> > and since xfs_file_write_iter() calls xfs_file_write_checks() before
> > doing any actual work you'd now be fine.
> > 
> > For reads xfs_file_read_iter() would need to be changed to a similar
> > logic but that's for xfs to decide ultimately.
> > 
> > > the pain of changing the {write,read}_iter callbacks. 1) sounds good
> > > to me from the io_uring perspective, but I guess it won't work
> > > for mtime?
> > 
> > I would prefer 2) which seems cleaner to me. But I might miss why this
> > won't work. So input needed/wanted.
> 
> Maybe I didn't fully grasp the (2) idea
> 
> 2.1: all read_iter, write_iter, etc. callbacks should do file_accessed()
> before doing IO, which sounds like a good option if everyone agrees with
> that. Taking a look at direct block io, it's already like this.

Yes, that's what I'm talking about. I'm asking whether that's ok for xfs
maintainers basically. i_op->write_iter() already works like that since
the dawn of time but i_op->read_iter doesn't and I'm proposing to make
it work like that and wondering if there's any issues I'm unaware of.

> 
> 2.2: Having io_uring doing file_accessed(). Since it's all currently
> hidden behind {read,write}_iter() callbacks and not easily extractable,
> it doesn't like a good option, unless I missed sth.

I think that would be the wrong approach and is definitely not what I
meant.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-27 16:28               ` Christian Brauner
@ 2023-07-31  1:58                 ` Dave Chinner
  2023-07-31  7:34                   ` Hao Xu
  2023-07-31  7:40                   ` Christian Brauner
  0 siblings, 2 replies; 41+ messages in thread
From: Dave Chinner @ 2023-07-31  1:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Pavel Begunkov, Hao Xu, djwong, Jens Axboe, io-uring,
	Dominique Martinet, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li, josef

On Thu, Jul 27, 2023 at 06:28:52PM +0200, Christian Brauner wrote:
> On Thu, Jul 27, 2023 at 05:17:30PM +0100, Pavel Begunkov wrote:
> > On 7/27/23 16:52, Christian Brauner wrote:
> > > On Thu, Jul 27, 2023 at 04:12:12PM +0100, Pavel Begunkov wrote:
> > > It would also solve it for writes which is what my kiocb_modified()
> > > comment was about. So right now you have:
> > 
> > Great, I assumed there are stricter requirements for mtime not
> > transiently failing.
> 
> But I mean then wouldn't this already be a problem today?
> kiocb_modified() can error out with EAGAIN today:
> 
>           ret = inode_needs_update_time(inode, &now);
>           if (ret <= 0)
>                   return ret;
>           if (flags & IOCB_NOWAIT)
>                   return -EAGAIN;
> 
>           return __file_update_time(file, &now, ret);
> 
> the thing is that it doesn't matter for ->write_iter() - for xfs at
> least - because xfs does it as part of preparatory checks before
> actually doing any real work. The problem happens when you do actual
> work and afterwards call kiocb_modified(). That's why I think (2) is
> preferable.

This has nothing to do with what "XFS does". It's actually an
IOCB_NOWAIT API design constraint.

That is, IOCB_NOWAIT means "complete the whole operation without
blocking or return -EAGAIN having done nothing".  If we have to do
something that might block (like a timestamp update) then we need to
punt the entire operation before anything has been modified.  This
requires all the "do we need to modify this" checks to be done up
front before we start modifying anything.

So while it looks like this might be "an XFS thing", that's because
XFS tends to be the first filesystem that most io_uring NOWAIT
functionality is implemented on. IOWs, what you see is XFS is doing
things the way IOCB_NOWAIT requires to be done. i.e. it's a
demonstration of how nonblocking filesystem modification operations
need to be run, not an "XFS thing"...

> > > I would prefer 2) which seems cleaner to me. But I might miss why this
> > > won't work. So input needed/wanted.
> > 
> > Maybe I didn't fully grasp the (2) idea
> > 
> > 2.1: all read_iter, write_iter, etc. callbacks should do file_accessed()
> > before doing IO, which sounds like a good option if everyone agrees with
> > that. Taking a look at direct block io, it's already like this.
> 
> Yes, that's what I'm talking about. I'm asking whether that's ok for xfs
> maintainers basically. i_op->write_iter() already works like that since
> the dawn of time but i_op->read_iter doesn't and I'm proposing to make
> it work like that and wondering if there's any issues I'm unaware of.

XFS already calls file_accessed() in the DIO read path before the
read gets issued. I don't see any problem with lifting it to before
the copy-out loop in filemap_read() because it is run regardless of
whether any data is read or any error occurred.  Hence it just
doesn't look like it matters if it is run before or after the
copy-out loop to me....

-Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-31  1:58                 ` Dave Chinner
@ 2023-07-31  7:34                   ` Hao Xu
  2023-07-31  7:50                     ` Christian Brauner
  2023-07-31  7:40                   ` Christian Brauner
  1 sibling, 1 reply; 41+ messages in thread
From: Hao Xu @ 2023-07-31  7:34 UTC (permalink / raw)
  To: Dave Chinner, Christian Brauner
  Cc: Pavel Begunkov, djwong, Jens Axboe, io-uring, Dominique Martinet,
	Alexander Viro, Stefan Roesch, Clay Harris, linux-fsdevel,
	Wanpeng Li, josef

On 7/31/23 09:58, Dave Chinner wrote:
> On Thu, Jul 27, 2023 at 06:28:52PM +0200, Christian Brauner wrote:
>> On Thu, Jul 27, 2023 at 05:17:30PM +0100, Pavel Begunkov wrote:
>>> On 7/27/23 16:52, Christian Brauner wrote:
>>>> On Thu, Jul 27, 2023 at 04:12:12PM +0100, Pavel Begunkov wrote:
>>>> It would also solve it for writes which is what my kiocb_modified()
>>>> comment was about. So right now you have:
>>>
>>> Great, I assumed there are stricter requirements for mtime not
>>> transiently failing.
>>
>> But I mean then wouldn't this already be a problem today?
>> kiocb_modified() can error out with EAGAIN today:
>>
>>            ret = inode_needs_update_time(inode, &now);
>>            if (ret <= 0)
>>                    return ret;
>>            if (flags & IOCB_NOWAIT)
>>                    return -EAGAIN;
>>
>>            return __file_update_time(file, &now, ret);
>>
>> the thing is that it doesn't matter for ->write_iter() - for xfs at
>> least - because xfs does it as part of preparatory checks before
>> actually doing any real work. The problem happens when you do actual
>> work and afterwards call kiocb_modified(). That's why I think (2) is
>> preferable.
> 
> This has nothing to do with what "XFS does". It's actually an
> IOCB_NOWAIT API design constraint.
> 
> That is, IOCB_NOWAIT means "complete the whole operation without
> blocking or return -EAGAIN having done nothing".  If we have to do
> something that might block (like a timestamp update) then we need to
> punt the entire operation before anything has been modified.  This
> requires all the "do we need to modify this" checks to be done up
> front before we start modifying anything.
> 
> So while it looks like this might be "an XFS thing", that's because
> XFS tends to be the first filesystem that most io_uring NOWAIT
> functionality is implemented on. IOWs, what you see is XFS is doing
> things the way IOCB_NOWAIT requires to be done. i.e. it's a
> demonstration of how nonblocking filesystem modification operations
> need to be run, not an "XFS thing"...
> 
>>>> I would prefer 2) which seems cleaner to me. But I might miss why this
>>>> won't work. So input needed/wanted.
>>>
>>> Maybe I didn't fully grasp the (2) idea
>>>
>>> 2.1: all read_iter, write_iter, etc. callbacks should do file_accessed()
>>> before doing IO, which sounds like a good option if everyone agrees with
>>> that. Taking a look at direct block io, it's already like this.
>>
>> Yes, that's what I'm talking about. I'm asking whether that's ok for xfs
>> maintainers basically. i_op->write_iter() already works like that since
>> the dawn of time but i_op->read_iter doesn't and I'm proposing to make
>> it work like that and wondering if there's any issues I'm unaware of.
> 
> XFS already calls file_accessed() in the DIO read path before the
> read gets issued. I don't see any problem with lifting it to before

Hi Dave,

Here I've a question, in DIO read path, if we update the time but
later somehow got errors before actual reading, e.g. return -EAGAIN
from the xfs_ilock_iocb(), shouldn't we revert the time update since
we actually doesn't read the file? We can lazily update the time but
on the contrary a false update sounds weird to me.

Thanks,
Hao

> the copy-out loop in filemap_read() because it is run regardless of
> whether any data is read or any error occurred.  Hence it just
> doesn't look like it matters if it is run before or after the
> copy-out loop to me....
> 
> -Dave.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-31  7:34                   ` Hao Xu
@ 2023-07-31  7:50                     ` Christian Brauner
  0 siblings, 0 replies; 41+ messages in thread
From: Christian Brauner @ 2023-07-31  7:50 UTC (permalink / raw)
  To: Hao Xu
  Cc: Dave Chinner, Pavel Begunkov, djwong, Jens Axboe, io-uring,
	Dominique Martinet, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li, josef

On Mon, Jul 31, 2023 at 03:34:48PM +0800, Hao Xu wrote:
> On 7/31/23 09:58, Dave Chinner wrote:
> > On Thu, Jul 27, 2023 at 06:28:52PM +0200, Christian Brauner wrote:
> > > On Thu, Jul 27, 2023 at 05:17:30PM +0100, Pavel Begunkov wrote:
> > > > On 7/27/23 16:52, Christian Brauner wrote:
> > > > > On Thu, Jul 27, 2023 at 04:12:12PM +0100, Pavel Begunkov wrote:
> > > > > It would also solve it for writes which is what my kiocb_modified()
> > > > > comment was about. So right now you have:
> > > > 
> > > > Great, I assumed there are stricter requirements for mtime not
> > > > transiently failing.
> > > 
> > > But I mean then wouldn't this already be a problem today?
> > > kiocb_modified() can error out with EAGAIN today:
> > > 
> > >            ret = inode_needs_update_time(inode, &now);
> > >            if (ret <= 0)
> > >                    return ret;
> > >            if (flags & IOCB_NOWAIT)
> > >                    return -EAGAIN;
> > > 
> > >            return __file_update_time(file, &now, ret);
> > > 
> > > the thing is that it doesn't matter for ->write_iter() - for xfs at
> > > least - because xfs does it as part of preparatory checks before
> > > actually doing any real work. The problem happens when you do actual
> > > work and afterwards call kiocb_modified(). That's why I think (2) is
> > > preferable.
> > 
> > This has nothing to do with what "XFS does". It's actually an
> > IOCB_NOWAIT API design constraint.
> > 
> > That is, IOCB_NOWAIT means "complete the whole operation without
> > blocking or return -EAGAIN having done nothing".  If we have to do
> > something that might block (like a timestamp update) then we need to
> > punt the entire operation before anything has been modified.  This
> > requires all the "do we need to modify this" checks to be done up
> > front before we start modifying anything.
> > 
> > So while it looks like this might be "an XFS thing", that's because
> > XFS tends to be the first filesystem that most io_uring NOWAIT
> > functionality is implemented on. IOWs, what you see is XFS is doing
> > things the way IOCB_NOWAIT requires to be done. i.e. it's a
> > demonstration of how nonblocking filesystem modification operations
> > need to be run, not an "XFS thing"...
> > 
> > > > > I would prefer 2) which seems cleaner to me. But I might miss why this
> > > > > won't work. So input needed/wanted.
> > > > 
> > > > Maybe I didn't fully grasp the (2) idea
> > > > 
> > > > 2.1: all read_iter, write_iter, etc. callbacks should do file_accessed()
> > > > before doing IO, which sounds like a good option if everyone agrees with
> > > > that. Taking a look at direct block io, it's already like this.
> > > 
> > > Yes, that's what I'm talking about. I'm asking whether that's ok for xfs
> > > maintainers basically. i_op->write_iter() already works like that since
> > > the dawn of time but i_op->read_iter doesn't and I'm proposing to make
> > > it work like that and wondering if there's any issues I'm unaware of.
> > 
> > XFS already calls file_accessed() in the DIO read path before the
> > read gets issued. I don't see any problem with lifting it to before
> 
> Hi Dave,
> 
> Here I've a question, in DIO read path, if we update the time but
> later somehow got errors before actual reading, e.g. return -EAGAIN
> from the xfs_ilock_iocb(), shouldn't we revert the time update since
> we actually doesn't read the file? We can lazily update the time but
> on the contrary a false update sounds weird to me.

You've read the file and you failed but you did access the file. We have
the same logic kinda for readdir in the sense that we call
file_accessed() no matter if ->iterate_shared() or ->iterate() succeeded
or not.

Plus, you should just be able to reorder the read checks to mirror the
write checks. Afaict, the order of the write checks are:

xfs_file_write_checks()
-> xfs_ilock_iocb()
-> kiocb_modified()

so we can just do:

+ xfs_file_read_checks()
  +-> xfs_ilock_iocb()
      +-> file_accessed()/kiocb_accessed()

for the read checks. Then we managed to acquire the necessary lock.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-31  1:58                 ` Dave Chinner
  2023-07-31  7:34                   ` Hao Xu
@ 2023-07-31  7:40                   ` Christian Brauner
  1 sibling, 0 replies; 41+ messages in thread
From: Christian Brauner @ 2023-07-31  7:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Pavel Begunkov, Hao Xu, djwong, Jens Axboe, io-uring,
	Dominique Martinet, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li, josef

On Mon, Jul 31, 2023 at 11:58:50AM +1000, Dave Chinner wrote:
> On Thu, Jul 27, 2023 at 06:28:52PM +0200, Christian Brauner wrote:
> > On Thu, Jul 27, 2023 at 05:17:30PM +0100, Pavel Begunkov wrote:
> > > On 7/27/23 16:52, Christian Brauner wrote:
> > > > On Thu, Jul 27, 2023 at 04:12:12PM +0100, Pavel Begunkov wrote:
> > > > It would also solve it for writes which is what my kiocb_modified()
> > > > comment was about. So right now you have:
> > > 
> > > Great, I assumed there are stricter requirements for mtime not
> > > transiently failing.
> > 
> > But I mean then wouldn't this already be a problem today?
> > kiocb_modified() can error out with EAGAIN today:
> > 
> >           ret = inode_needs_update_time(inode, &now);
> >           if (ret <= 0)
> >                   return ret;
> >           if (flags & IOCB_NOWAIT)
> >                   return -EAGAIN;
> > 
> >           return __file_update_time(file, &now, ret);
> > 
> > the thing is that it doesn't matter for ->write_iter() - for xfs at
> > least - because xfs does it as part of preparatory checks before
> > actually doing any real work. The problem happens when you do actual
> > work and afterwards call kiocb_modified(). That's why I think (2) is
> > preferable.
> 
> This has nothing to do with what "XFS does". It's actually an
> IOCB_NOWAIT API design constraint.
> 
> That is, IOCB_NOWAIT means "complete the whole operation without
> blocking or return -EAGAIN having done nothing".  If we have to do
> something that might block (like a timestamp update) then we need to
> punt the entire operation before anything has been modified.  This
> requires all the "do we need to modify this" checks to be done up
> front before we start modifying anything.
> 
> So while it looks like this might be "an XFS thing", that's because
> XFS tends to be the first filesystem that most io_uring NOWAIT
> functionality is implemented on. IOWs, what you see is XFS is doing
> things the way IOCB_NOWAIT requires to be done. i.e. it's a
> demonstration of how nonblocking filesystem modification operations
> need to be run, not an "XFS thing"...

Yes, I'm aware. I was trying to pay xfs a compliment for that but
somehow that didn't come through.

> 
> > > > I would prefer 2) which seems cleaner to me. But I might miss why this
> > > > won't work. So input needed/wanted.
> > > 
> > > Maybe I didn't fully grasp the (2) idea
> > > 
> > > 2.1: all read_iter, write_iter, etc. callbacks should do file_accessed()
> > > before doing IO, which sounds like a good option if everyone agrees with
> > > that. Taking a look at direct block io, it's already like this.
> > 
> > Yes, that's what I'm talking about. I'm asking whether that's ok for xfs
> > maintainers basically. i_op->write_iter() already works like that since
> > the dawn of time but i_op->read_iter doesn't and I'm proposing to make
> > it work like that and wondering if there's any issues I'm unaware of.
> 
> XFS already calls file_accessed() in the DIO read path before the
> read gets issued. I don't see any problem with lifting it to before
> the copy-out loop in filemap_read() because it is run regardless of
> whether any data is read or any error occurred.  Hence it just
> doesn't look like it matters if it is run before or after the
> copy-out loop to me....

Great.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-27 14:27       ` Christian Brauner
  2023-07-27 15:12         ` Pavel Begunkov
@ 2023-07-30 18:02         ` Hao Xu
  2023-07-31  8:18           ` Christian Brauner
  2023-07-31  1:33         ` Dave Chinner
  2 siblings, 1 reply; 41+ messages in thread
From: Hao Xu @ 2023-07-30 18:02 UTC (permalink / raw)
  To: Christian Brauner, djwong, Dave Chinner, Jens Axboe
  Cc: io-uring, Dominique Martinet, Pavel Begunkov, Alexander Viro,
	Stefan Roesch, Clay Harris, linux-fsdevel, Wanpeng Li

Hi Christian,

On 7/27/23 22:27, Christian Brauner wrote:
> On Thu, Jul 27, 2023 at 07:51:19PM +0800, Hao Xu wrote:
>> On 7/26/23 23:00, Christian Brauner wrote:
>>> On Tue, Jul 18, 2023 at 09:21:10PM +0800, Hao Xu wrote:
>>>> From: Hao Xu <[email protected]>
>>>>
>>>> This add support for getdents64 to io_uring, acting exactly like the
>>>> syscall: the directory is iterated from it's current's position as
>>>> stored in the file struct, and the file's position is updated exactly as
>>>> if getdents64 had been called.
>>>>
>>>> For filesystems that support NOWAIT in iterate_shared(), try to use it
>>>> first; if a user already knows the filesystem they use do not support
>>>> nowait they can force async through IOSQE_ASYNC in the sqe flags,
>>>> avoiding the need to bounce back through a useless EAGAIN return.
>>>>
>>>> Co-developed-by: Dominique Martinet <[email protected]>
>>>> Signed-off-by: Dominique Martinet <[email protected]>
>>>> Signed-off-by: Hao Xu <[email protected]>
>>>> ---
>>>>    include/uapi/linux/io_uring.h |  7 +++++
>>>>    io_uring/fs.c                 | 55 +++++++++++++++++++++++++++++++++++
>>>>    io_uring/fs.h                 |  3 ++
>>>>    io_uring/opdef.c              |  8 +++++
>>>>    4 files changed, 73 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>> index 36f9c73082de..b200b2600622 100644
>>>> --- a/include/uapi/linux/io_uring.h
>>>> +++ b/include/uapi/linux/io_uring.h
>>>> @@ -65,6 +65,7 @@ struct io_uring_sqe {
>>>>    		__u32		xattr_flags;
>>>>    		__u32		msg_ring_flags;
>>>>    		__u32		uring_cmd_flags;
>>>> +		__u32		getdents_flags;
>>>>    	};
>>>>    	__u64	user_data;	/* data to be passed back at completion time */
>>>>    	/* pack this to avoid bogus arm OABI complaints */
>>>> @@ -235,6 +236,7 @@ enum io_uring_op {
>>>>    	IORING_OP_URING_CMD,
>>>>    	IORING_OP_SEND_ZC,
>>>>    	IORING_OP_SENDMSG_ZC,
>>>> +	IORING_OP_GETDENTS,
>>>>    	/* this goes last, obviously */
>>>>    	IORING_OP_LAST,
>>>> @@ -273,6 +275,11 @@ enum io_uring_op {
>>>>     */
>>>>    #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
>>>> +/*
>>>> + * sqe->getdents_flags
>>>> + */
>>>> +#define IORING_GETDENTS_REWIND	(1U << 0)
>>>> +
>>>>    /*
>>>>     * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
>>>>     * command flags for POLL_ADD are stored in sqe->len.
>>>> diff --git a/io_uring/fs.c b/io_uring/fs.c
>>>> index f6a69a549fd4..480f25677fed 100644
>>>> --- a/io_uring/fs.c
>>>> +++ b/io_uring/fs.c
>>>> @@ -47,6 +47,13 @@ struct io_link {
>>>>    	int				flags;
>>>>    };
>>>> +struct io_getdents {
>>>> +	struct file			*file;
>>>> +	struct linux_dirent64 __user	*dirent;
>>>> +	unsigned int			count;
>>>> +	int				flags;
>>>> +};
>>>> +
>>>>    int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>    {
>>>>    	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
>>>> @@ -291,3 +298,51 @@ void io_link_cleanup(struct io_kiocb *req)
>>>>    	putname(sl->oldpath);
>>>>    	putname(sl->newpath);
>>>>    }
>>>> +
>>>> +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>> +{
>>>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>>>> +
>>>> +	if (READ_ONCE(sqe->off) != 0)
>>>> +		return -EINVAL;
>>>> +
>>>> +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
>>>> +	gd->count = READ_ONCE(sqe->len);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
>>>> +{
>>>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>>>> +	struct file *file = req->file;
>>>> +	unsigned long getdents_flags = 0;
>>>> +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
>>>
>>> Hm, I'm not sure what exactly the rules are for IO_URING_F_NONBLOCK.
>>> But to point this out:
>>>
>>> vfs_getdents()
>>> -> iterate_dir()
>>>      {
>>>           if (shared)
>>>                   res = down_read_killable(&inode->i_rwsem);
>>>           else
>>>                   res = down_write_killable(&inode->i_rwsem);
>>>      }
>>>
>>> which means you can still end up sleeping here before you go into a
>>> filesystem that does actually support non-waiting getdents. So if you
>>> have concurrent operations that grab inode lock (touch, mkdir etc) you
>>> can end up sleeping here.
>>>
>>> Is that intentional or an oversight? If the former can someone please
>>> explain the rules and why it's fine in this case?
>>
>> I actually saw this semaphore, and there is another xfs lock in
>> file_accessed
>>    --> touch_atime
>>      --> inode_update_time
>>        --> inode->i_op->update_time == xfs_vn_update_time
>>
>> Forgot to point them out in the cover-letter..., I didn't modify them
>> since I'm not very sure about if we should do so, and I saw Stefan's
>> patchset didn't modify them too.
>>
>> My personnal thinking is we should apply trylock logic for this
>> inode->i_rwsem. For xfs lock in touch_atime, we should do that since it
>> doesn't make sense to rollback all the stuff while we are almost at the
>> end of getdents because of a lock.
> 
> That manoeuvres around the problem. Which I'm slightly more sensitive
> too as this review is a rather expensive one.
> 
> Plus, it seems fixable in at least two ways:
> 
> For both we need to be able to tell the filesystem that a nowait atime
> update is requested. Simple thing seems to me to add a S_NOWAIT flag to
> file_time_flags and passing that via i_op->update_time() which already
> has a flag argument. That would likely also help kiocb_modified().
> 
> file_accessed()
> -> touch_atime()
>     -> inode_update_time()
>        -> i_op->update_time == xfs_vn_update_time()
> 
> Then we have two options afaict:
> 
> (1) best-effort atime update
> 
> file_accessed() already has the builtin assumption that updating atime
> might fail for other reasons - see the comment in there. So it is
> somewhat best-effort already.
> 
> (2) move atime update before calling into filesystem
> 
> If we want to be sure that access time is updated when a readdir request
> is issued through io_uring then we need to have file_accessed() give a
> return value and expose a new helper for io_uring or modify
> vfs_getdents() to do something like:
> 
> vfs_getdents()
> {
> 	if (nowait)
> 		down_read_trylock()
> 
> 	if (!IS_DEADDIR(inode)) {
> 		ret = file_accessed(file);
> 		if (ret == -EAGAIN)
> 			goto out_unlock;
> 
> 		f_op->iterate_shared()
> 	}
> }
> 
> It's not unprecedented to do update atime before the actual operation
> has been done afaict. That's already the case in xfs_file_write_checks()
> which is called before anything is written. So that seems ok.

I'm not familiar with this part(the time update), I guess we should
revert the updated time if we succeed to do file_accessed(file) but
fail somewhere later in f_op->iterate_shared()? Or is it definitely
counted as an "access" as long as we start to call getdents to a file?

Thanks,
Hao

> 
> Does any of these two options work for the xfs maintainers and Jens?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-30 18:02         ` Hao Xu
@ 2023-07-31  8:18           ` Christian Brauner
  2023-07-31  9:31             ` Hao Xu
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Brauner @ 2023-07-31  8:18 UTC (permalink / raw)
  To: Hao Xu
  Cc: djwong, Dave Chinner, Jens Axboe, io-uring, Dominique Martinet,
	Pavel Begunkov, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li

On Mon, Jul 31, 2023 at 02:02:25AM +0800, Hao Xu wrote:
> Hi Christian,
> 
> On 7/27/23 22:27, Christian Brauner wrote:
> > On Thu, Jul 27, 2023 at 07:51:19PM +0800, Hao Xu wrote:
> > > On 7/26/23 23:00, Christian Brauner wrote:
> > > > On Tue, Jul 18, 2023 at 09:21:10PM +0800, Hao Xu wrote:
> > > > > From: Hao Xu <[email protected]>
> > > > > 
> > > > > This add support for getdents64 to io_uring, acting exactly like the
> > > > > syscall: the directory is iterated from it's current's position as
> > > > > stored in the file struct, and the file's position is updated exactly as
> > > > > if getdents64 had been called.
> > > > > 
> > > > > For filesystems that support NOWAIT in iterate_shared(), try to use it
> > > > > first; if a user already knows the filesystem they use do not support
> > > > > nowait they can force async through IOSQE_ASYNC in the sqe flags,
> > > > > avoiding the need to bounce back through a useless EAGAIN return.
> > > > > 
> > > > > Co-developed-by: Dominique Martinet <[email protected]>
> > > > > Signed-off-by: Dominique Martinet <[email protected]>
> > > > > Signed-off-by: Hao Xu <[email protected]>
> > > > > ---
> > > > >    include/uapi/linux/io_uring.h |  7 +++++
> > > > >    io_uring/fs.c                 | 55 +++++++++++++++++++++++++++++++++++
> > > > >    io_uring/fs.h                 |  3 ++
> > > > >    io_uring/opdef.c              |  8 +++++
> > > > >    4 files changed, 73 insertions(+)
> > > > > 
> > > > > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> > > > > index 36f9c73082de..b200b2600622 100644
> > > > > --- a/include/uapi/linux/io_uring.h
> > > > > +++ b/include/uapi/linux/io_uring.h
> > > > > @@ -65,6 +65,7 @@ struct io_uring_sqe {
> > > > >    		__u32		xattr_flags;
> > > > >    		__u32		msg_ring_flags;
> > > > >    		__u32		uring_cmd_flags;
> > > > > +		__u32		getdents_flags;
> > > > >    	};
> > > > >    	__u64	user_data;	/* data to be passed back at completion time */
> > > > >    	/* pack this to avoid bogus arm OABI complaints */
> > > > > @@ -235,6 +236,7 @@ enum io_uring_op {
> > > > >    	IORING_OP_URING_CMD,
> > > > >    	IORING_OP_SEND_ZC,
> > > > >    	IORING_OP_SENDMSG_ZC,
> > > > > +	IORING_OP_GETDENTS,
> > > > >    	/* this goes last, obviously */
> > > > >    	IORING_OP_LAST,
> > > > > @@ -273,6 +275,11 @@ enum io_uring_op {
> > > > >     */
> > > > >    #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
> > > > > +/*
> > > > > + * sqe->getdents_flags
> > > > > + */
> > > > > +#define IORING_GETDENTS_REWIND	(1U << 0)
> > > > > +
> > > > >    /*
> > > > >     * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
> > > > >     * command flags for POLL_ADD are stored in sqe->len.
> > > > > diff --git a/io_uring/fs.c b/io_uring/fs.c
> > > > > index f6a69a549fd4..480f25677fed 100644
> > > > > --- a/io_uring/fs.c
> > > > > +++ b/io_uring/fs.c
> > > > > @@ -47,6 +47,13 @@ struct io_link {
> > > > >    	int				flags;
> > > > >    };
> > > > > +struct io_getdents {
> > > > > +	struct file			*file;
> > > > > +	struct linux_dirent64 __user	*dirent;
> > > > > +	unsigned int			count;
> > > > > +	int				flags;
> > > > > +};
> > > > > +
> > > > >    int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > > >    {
> > > > >    	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
> > > > > @@ -291,3 +298,51 @@ void io_link_cleanup(struct io_kiocb *req)
> > > > >    	putname(sl->oldpath);
> > > > >    	putname(sl->newpath);
> > > > >    }
> > > > > +
> > > > > +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > > > +{
> > > > > +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> > > > > +
> > > > > +	if (READ_ONCE(sqe->off) != 0)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
> > > > > +	gd->count = READ_ONCE(sqe->len);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> > > > > +{
> > > > > +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> > > > > +	struct file *file = req->file;
> > > > > +	unsigned long getdents_flags = 0;
> > > > > +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
> > > > 
> > > > Hm, I'm not sure what exactly the rules are for IO_URING_F_NONBLOCK.
> > > > But to point this out:
> > > > 
> > > > vfs_getdents()
> > > > -> iterate_dir()
> > > >      {
> > > >           if (shared)
> > > >                   res = down_read_killable(&inode->i_rwsem);
> > > >           else
> > > >                   res = down_write_killable(&inode->i_rwsem);
> > > >      }
> > > > 
> > > > which means you can still end up sleeping here before you go into a
> > > > filesystem that does actually support non-waiting getdents. So if you
> > > > have concurrent operations that grab inode lock (touch, mkdir etc) you
> > > > can end up sleeping here.
> > > > 
> > > > Is that intentional or an oversight? If the former can someone please
> > > > explain the rules and why it's fine in this case?
> > > 
> > > I actually saw this semaphore, and there is another xfs lock in
> > > file_accessed
> > >    --> touch_atime
> > >      --> inode_update_time
> > >        --> inode->i_op->update_time == xfs_vn_update_time
> > > 
> > > Forgot to point them out in the cover-letter..., I didn't modify them
> > > since I'm not very sure about if we should do so, and I saw Stefan's
> > > patchset didn't modify them too.
> > > 
> > > My personnal thinking is we should apply trylock logic for this
> > > inode->i_rwsem. For xfs lock in touch_atime, we should do that since it
> > > doesn't make sense to rollback all the stuff while we are almost at the
> > > end of getdents because of a lock.
> > 
> > That manoeuvres around the problem. Which I'm slightly more sensitive
> > too as this review is a rather expensive one.
> > 
> > Plus, it seems fixable in at least two ways:
> > 
> > For both we need to be able to tell the filesystem that a nowait atime
> > update is requested. Simple thing seems to me to add a S_NOWAIT flag to
> > file_time_flags and passing that via i_op->update_time() which already
> > has a flag argument. That would likely also help kiocb_modified().
> > 
> > file_accessed()
> > -> touch_atime()
> >     -> inode_update_time()
> >        -> i_op->update_time == xfs_vn_update_time()
> > 
> > Then we have two options afaict:
> > 
> > (1) best-effort atime update
> > 
> > file_accessed() already has the builtin assumption that updating atime
> > might fail for other reasons - see the comment in there. So it is
> > somewhat best-effort already.
> > 
> > (2) move atime update before calling into filesystem
> > 
> > If we want to be sure that access time is updated when a readdir request
> > is issued through io_uring then we need to have file_accessed() give a
> > return value and expose a new helper for io_uring or modify
> > vfs_getdents() to do something like:
> > 
> > vfs_getdents()
> > {
> > 	if (nowait)
> > 		down_read_trylock()
> > 
> > 	if (!IS_DEADDIR(inode)) {
> > 		ret = file_accessed(file);
> > 		if (ret == -EAGAIN)
> > 			goto out_unlock;
> > 
> > 		f_op->iterate_shared()
> > 	}
> > }
> > 
> > It's not unprecedented to do update atime before the actual operation
> > has been done afaict. That's already the case in xfs_file_write_checks()
> > which is called before anything is written. So that seems ok.
> 
> I'm not familiar with this part(the time update), I guess we should
> revert the updated time if we succeed to do file_accessed(file) but
> fail somewhere later in f_op->iterate_shared()? Or is it definitely
> counted as an "access" as long as we start to call getdents to a file?

To answer that you can simply take a look at readdir rn

res = -ENOENT;
if (!IS_DEADDIR(inode)) {
        ctx->pos = file->f_pos;
        if (shared)
                res = file->f_op->iterate_shared(file, ctx);
        else
                res = file->f_op->iterate(file, ctx);
        file->f_pos = ctx->pos;
        fsnotify_access(file);
        file_accessed(file);
}

Also, I've said this before: touch_atime() is currently best effort. It
may fail for any kind of reason.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-31  8:18           ` Christian Brauner
@ 2023-07-31  9:31             ` Hao Xu
  0 siblings, 0 replies; 41+ messages in thread
From: Hao Xu @ 2023-07-31  9:31 UTC (permalink / raw)
  To: Christian Brauner
  Cc: djwong, Dave Chinner, Jens Axboe, io-uring, Dominique Martinet,
	Pavel Begunkov, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li


On 7/31/23 16:18, Christian Brauner wrote:
> On Mon, Jul 31, 2023 at 02:02:25AM +0800, Hao Xu wrote:
>> Hi Christian,
>>
>> On 7/27/23 22:27, Christian Brauner wrote:
>>> On Thu, Jul 27, 2023 at 07:51:19PM +0800, Hao Xu wrote:
>>>> On 7/26/23 23:00, Christian Brauner wrote:
>>>>> On Tue, Jul 18, 2023 at 09:21:10PM +0800, Hao Xu wrote:
>>>>>> From: Hao Xu <[email protected]>
>>>>>>
>>>>>> This add support for getdents64 to io_uring, acting exactly like the
>>>>>> syscall: the directory is iterated from it's current's position as
>>>>>> stored in the file struct, and the file's position is updated exactly as
>>>>>> if getdents64 had been called.
>>>>>>
>>>>>> For filesystems that support NOWAIT in iterate_shared(), try to use it
>>>>>> first; if a user already knows the filesystem they use do not support
>>>>>> nowait they can force async through IOSQE_ASYNC in the sqe flags,
>>>>>> avoiding the need to bounce back through a useless EAGAIN return.
>>>>>>
>>>>>> Co-developed-by: Dominique Martinet <[email protected]>
>>>>>> Signed-off-by: Dominique Martinet <[email protected]>
>>>>>> Signed-off-by: Hao Xu <[email protected]>
>>>>>> ---
>>>>>>     include/uapi/linux/io_uring.h |  7 +++++
>>>>>>     io_uring/fs.c                 | 55 +++++++++++++++++++++++++++++++++++
>>>>>>     io_uring/fs.h                 |  3 ++
>>>>>>     io_uring/opdef.c              |  8 +++++
>>>>>>     4 files changed, 73 insertions(+)
>>>>>>
>>>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>>>> index 36f9c73082de..b200b2600622 100644
>>>>>> --- a/include/uapi/linux/io_uring.h
>>>>>> +++ b/include/uapi/linux/io_uring.h
>>>>>> @@ -65,6 +65,7 @@ struct io_uring_sqe {
>>>>>>     		__u32		xattr_flags;
>>>>>>     		__u32		msg_ring_flags;
>>>>>>     		__u32		uring_cmd_flags;
>>>>>> +		__u32		getdents_flags;
>>>>>>     	};
>>>>>>     	__u64	user_data;	/* data to be passed back at completion time */
>>>>>>     	/* pack this to avoid bogus arm OABI complaints */
>>>>>> @@ -235,6 +236,7 @@ enum io_uring_op {
>>>>>>     	IORING_OP_URING_CMD,
>>>>>>     	IORING_OP_SEND_ZC,
>>>>>>     	IORING_OP_SENDMSG_ZC,
>>>>>> +	IORING_OP_GETDENTS,
>>>>>>     	/* this goes last, obviously */
>>>>>>     	IORING_OP_LAST,
>>>>>> @@ -273,6 +275,11 @@ enum io_uring_op {
>>>>>>      */
>>>>>>     #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
>>>>>> +/*
>>>>>> + * sqe->getdents_flags
>>>>>> + */
>>>>>> +#define IORING_GETDENTS_REWIND	(1U << 0)
>>>>>> +
>>>>>>     /*
>>>>>>      * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
>>>>>>      * command flags for POLL_ADD are stored in sqe->len.
>>>>>> diff --git a/io_uring/fs.c b/io_uring/fs.c
>>>>>> index f6a69a549fd4..480f25677fed 100644
>>>>>> --- a/io_uring/fs.c
>>>>>> +++ b/io_uring/fs.c
>>>>>> @@ -47,6 +47,13 @@ struct io_link {
>>>>>>     	int				flags;
>>>>>>     };
>>>>>> +struct io_getdents {
>>>>>> +	struct file			*file;
>>>>>> +	struct linux_dirent64 __user	*dirent;
>>>>>> +	unsigned int			count;
>>>>>> +	int				flags;
>>>>>> +};
>>>>>> +
>>>>>>     int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>>     {
>>>>>>     	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
>>>>>> @@ -291,3 +298,51 @@ void io_link_cleanup(struct io_kiocb *req)
>>>>>>     	putname(sl->oldpath);
>>>>>>     	putname(sl->newpath);
>>>>>>     }
>>>>>> +
>>>>>> +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>> +{
>>>>>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>>>>>> +
>>>>>> +	if (READ_ONCE(sqe->off) != 0)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
>>>>>> +	gd->count = READ_ONCE(sqe->len);
>>>>>> +
>>>>>> +	return 0;
>>>>>> +}
>>>>>> +
>>>>>> +int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
>>>>>> +{
>>>>>> +	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
>>>>>> +	struct file *file = req->file;
>>>>>> +	unsigned long getdents_flags = 0;
>>>>>> +	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
>>>>> Hm, I'm not sure what exactly the rules are for IO_URING_F_NONBLOCK.
>>>>> But to point this out:
>>>>>
>>>>> vfs_getdents()
>>>>> -> iterate_dir()
>>>>>       {
>>>>>            if (shared)
>>>>>                    res = down_read_killable(&inode->i_rwsem);
>>>>>            else
>>>>>                    res = down_write_killable(&inode->i_rwsem);
>>>>>       }
>>>>>
>>>>> which means you can still end up sleeping here before you go into a
>>>>> filesystem that does actually support non-waiting getdents. So if you
>>>>> have concurrent operations that grab inode lock (touch, mkdir etc) you
>>>>> can end up sleeping here.
>>>>>
>>>>> Is that intentional or an oversight? If the former can someone please
>>>>> explain the rules and why it's fine in this case?
>>>> I actually saw this semaphore, and there is another xfs lock in
>>>> file_accessed
>>>>     --> touch_atime
>>>>       --> inode_update_time
>>>>         --> inode->i_op->update_time == xfs_vn_update_time
>>>>
>>>> Forgot to point them out in the cover-letter..., I didn't modify them
>>>> since I'm not very sure about if we should do so, and I saw Stefan's
>>>> patchset didn't modify them too.
>>>>
>>>> My personnal thinking is we should apply trylock logic for this
>>>> inode->i_rwsem. For xfs lock in touch_atime, we should do that since it
>>>> doesn't make sense to rollback all the stuff while we are almost at the
>>>> end of getdents because of a lock.
>>> That manoeuvres around the problem. Which I'm slightly more sensitive
>>> too as this review is a rather expensive one.
>>>
>>> Plus, it seems fixable in at least two ways:
>>>
>>> For both we need to be able to tell the filesystem that a nowait atime
>>> update is requested. Simple thing seems to me to add a S_NOWAIT flag to
>>> file_time_flags and passing that via i_op->update_time() which already
>>> has a flag argument. That would likely also help kiocb_modified().
>>>
>>> file_accessed()
>>> -> touch_atime()
>>>      -> inode_update_time()
>>>         -> i_op->update_time == xfs_vn_update_time()
>>>
>>> Then we have two options afaict:
>>>
>>> (1) best-effort atime update
>>>
>>> file_accessed() already has the builtin assumption that updating atime
>>> might fail for other reasons - see the comment in there. So it is
>>> somewhat best-effort already.
>>>
>>> (2) move atime update before calling into filesystem
>>>
>>> If we want to be sure that access time is updated when a readdir request
>>> is issued through io_uring then we need to have file_accessed() give a
>>> return value and expose a new helper for io_uring or modify
>>> vfs_getdents() to do something like:
>>>
>>> vfs_getdents()
>>> {
>>> 	if (nowait)
>>> 		down_read_trylock()
>>>
>>> 	if (!IS_DEADDIR(inode)) {
>>> 		ret = file_accessed(file);
>>> 		if (ret == -EAGAIN)
>>> 			goto out_unlock;
>>>
>>> 		f_op->iterate_shared()
>>> 	}
>>> }
>>>
>>> It's not unprecedented to do update atime before the actual operation
>>> has been done afaict. That's already the case in xfs_file_write_checks()
>>> which is called before anything is written. So that seems ok.
>> I'm not familiar with this part(the time update), I guess we should
>> revert the updated time if we succeed to do file_accessed(file) but
>> fail somewhere later in f_op->iterate_shared()? Or is it definitely
>> counted as an "access" as long as we start to call getdents to a file?
> To answer that you can simply take a look at readdir rn
>
> res = -ENOENT;
> if (!IS_DEADDIR(inode)) {
>          ctx->pos = file->f_pos;
>          if (shared)
>                  res = file->f_op->iterate_shared(file, ctx);
>          else
>                  res = file->f_op->iterate(file, ctx);
>          file->f_pos = ctx->pos;
>          fsnotify_access(file);
>          file_accessed(file);
> }
>
> Also, I've said this before: touch_atime() is currently best effort. It
> may fail for any kind of reason.


Gotcha, I checked the code and I think you are right, time updates

even the fop->iterate() quits at the very beginning. I'll move 
file_accessed(file)

to before f_op->the iterate(). Thanks.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-27 14:27       ` Christian Brauner
  2023-07-27 15:12         ` Pavel Begunkov
  2023-07-30 18:02         ` Hao Xu
@ 2023-07-31  1:33         ` Dave Chinner
  2023-07-31  8:13           ` Christian Brauner
  2 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2023-07-31  1:33 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Hao Xu, djwong, Jens Axboe, io-uring, Dominique Martinet,
	Pavel Begunkov, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li

On Thu, Jul 27, 2023 at 04:27:30PM +0200, Christian Brauner wrote:
> On Thu, Jul 27, 2023 at 07:51:19PM +0800, Hao Xu wrote:
> > I actually saw this semaphore, and there is another xfs lock in
> > file_accessed
> >   --> touch_atime
> >     --> inode_update_time
> >       --> inode->i_op->update_time == xfs_vn_update_time
> > 
> > Forgot to point them out in the cover-letter..., I didn't modify them
> > since I'm not very sure about if we should do so, and I saw Stefan's
> > patchset didn't modify them too.
> > 
> > My personnal thinking is we should apply trylock logic for this
> > inode->i_rwsem. For xfs lock in touch_atime, we should do that since it
> > doesn't make sense to rollback all the stuff while we are almost at the
> > end of getdents because of a lock.
> 
> That manoeuvres around the problem. Which I'm slightly more sensitive
> too as this review is a rather expensive one.
> 
> Plus, it seems fixable in at least two ways:
> 
> For both we need to be able to tell the filesystem that a nowait atime
> update is requested. Simple thing seems to me to add a S_NOWAIT flag to
> file_time_flags and passing that via i_op->update_time() which already
> has a flag argument. That would likely also help kiocb_modified().

Wait - didn't we already fix this for mtime updates on IOCB_NOWAIT
modification operations? Yeah, we did:

kiocb_modified(iocb)
  file_modified_flags(iocb->ki_file, iocb->ki_flags)
    ....
    ret = inode_needs_update_time()
    if (ret <= 0)
	return ret;
    if (flags & IOCB_NOWAIT)
	return -EAGAIN;
    <does timestamp update>

> file_accessed()
> -> touch_atime()
>    -> inode_update_time()
>       -> i_op->update_time == xfs_vn_update_time()

Yeah, so this needs the same treatment as file_modified_flags() -
touch_atime() needs a flag variant that passes IOCB_NOWAIT, and
after atime_needs_update() returns trues we should check IOCB_NOWAIT
and return EAGAIN if it is set. That will punt the operation that
needs to the update to a worker thread that can block....

> Then we have two options afaict:
> 
> (1) best-effort atime update
> 
> file_accessed() already has the builtin assumption that updating atime
> might fail for other reasons - see the comment in there. So it is
> somewhat best-effort already.
> 
> (2) move atime update before calling into filesystem
> 
> If we want to be sure that access time is updated when a readdir request
> is issued through io_uring then we need to have file_accessed() give a
> return value and expose a new helper for io_uring or modify
> vfs_getdents() to do something like:
> 
> vfs_getdents()
> {
> 	if (nowait)
> 		down_read_trylock()
> 
> 	if (!IS_DEADDIR(inode)) {
> 		ret = file_accessed(file);
> 		if (ret == -EAGAIN)
> 			goto out_unlock;
> 
> 		f_op->iterate_shared()
> 	}
> }

Yup, that's the sort of thing that needs to be done.

But as I said in the "llseek for io-uring" thread, we need to stop
the game of whack-a-mole passing random nowait boolean flags to VFS
operations before it starts in earnest.  We really need a common
context structure (like we have a kiocb for IO operations) that
holds per operation control state so we have consistency across all
the operations that we need different behaviours for.

> It's not unprecedented to do update atime before the actual operation
> has been done afaict. That's already the case in xfs_file_write_checks()
> which is called before anything is written. So that seems ok.

Writes don't update atime - they update mtime, and there are other
considerations for doing the mtime update before the data
modification. e.g. we have to strip permissions from the inode prior
to any changes that are going to be made, so we've already got to do
all the "change inode metadata" checks and modifications before the
data is written anyway....

Cheers,

Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-31  1:33         ` Dave Chinner
@ 2023-07-31  8:13           ` Christian Brauner
  2023-07-31 15:26             ` Darrick J. Wong
  2023-08-01 18:39             ` Hao Xu
  0 siblings, 2 replies; 41+ messages in thread
From: Christian Brauner @ 2023-07-31  8:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hao Xu, djwong, Jens Axboe, io-uring, Dominique Martinet,
	Pavel Begunkov, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li

On Mon, Jul 31, 2023 at 11:33:05AM +1000, Dave Chinner wrote:
> On Thu, Jul 27, 2023 at 04:27:30PM +0200, Christian Brauner wrote:
> > On Thu, Jul 27, 2023 at 07:51:19PM +0800, Hao Xu wrote:
> > > I actually saw this semaphore, and there is another xfs lock in
> > > file_accessed
> > >   --> touch_atime
> > >     --> inode_update_time
> > >       --> inode->i_op->update_time == xfs_vn_update_time
> > > 
> > > Forgot to point them out in the cover-letter..., I didn't modify them
> > > since I'm not very sure about if we should do so, and I saw Stefan's
> > > patchset didn't modify them too.
> > > 
> > > My personnal thinking is we should apply trylock logic for this
> > > inode->i_rwsem. For xfs lock in touch_atime, we should do that since it
> > > doesn't make sense to rollback all the stuff while we are almost at the
> > > end of getdents because of a lock.
> > 
> > That manoeuvres around the problem. Which I'm slightly more sensitive
> > too as this review is a rather expensive one.
> > 
> > Plus, it seems fixable in at least two ways:
> > 
> > For both we need to be able to tell the filesystem that a nowait atime
> > update is requested. Simple thing seems to me to add a S_NOWAIT flag to
> > file_time_flags and passing that via i_op->update_time() which already
> > has a flag argument. That would likely also help kiocb_modified().
> 
> Wait - didn't we already fix this for mtime updates on IOCB_NOWAIT
> modification operations? Yeah, we did:
> 
> kiocb_modified(iocb)
>   file_modified_flags(iocb->ki_file, iocb->ki_flags)
>     ....
>     ret = inode_needs_update_time()
>     if (ret <= 0)
> 	return ret;
>     if (flags & IOCB_NOWAIT)
> 	return -EAGAIN;
>     <does timestamp update>
> 
> > file_accessed()
> > -> touch_atime()
> >    -> inode_update_time()
> >       -> i_op->update_time == xfs_vn_update_time()
> 
> Yeah, so this needs the same treatment as file_modified_flags() -
> touch_atime() needs a flag variant that passes IOCB_NOWAIT, and
> after atime_needs_update() returns trues we should check IOCB_NOWAIT
> and return EAGAIN if it is set. That will punt the operation that
> needs to the update to a worker thread that can block....

As I tried to explain, I would prefer if we could inform the filesystem
through i_op->update_time() itself that this is async and give the
filesystem the ability to try and acquire the locks it needs and return
EAGAIN from i_op->update_time() itself if it can't acquire them.

> 
> > Then we have two options afaict:
> > 
> > (1) best-effort atime update
> > 
> > file_accessed() already has the builtin assumption that updating atime
> > might fail for other reasons - see the comment in there. So it is
> > somewhat best-effort already.
> > 
> > (2) move atime update before calling into filesystem
> > 
> > If we want to be sure that access time is updated when a readdir request
> > is issued through io_uring then we need to have file_accessed() give a
> > return value and expose a new helper for io_uring or modify
> > vfs_getdents() to do something like:
> > 
> > vfs_getdents()
> > {
> > 	if (nowait)
> > 		down_read_trylock()
> > 
> > 	if (!IS_DEADDIR(inode)) {
> > 		ret = file_accessed(file);
> > 		if (ret == -EAGAIN)
> > 			goto out_unlock;
> > 
> > 		f_op->iterate_shared()
> > 	}
> > }
> 
> Yup, that's the sort of thing that needs to be done.
> 
> But as I said in the "llseek for io-uring" thread, we need to stop
> the game of whack-a-mole passing random nowait boolean flags to VFS
> operations before it starts in earnest.  We really need a common
> context structure (like we have a kiocb for IO operations) that
> holds per operation control state so we have consistency across all
> the operations that we need different behaviours for.

Yes, I tend to agree and thought about the same. But right now we don't
have a lot of context. So I would lean towards a flag argument at most.

But I also wouldn't consider it necessarily wrong to start with booleans
or a flag first and in a couple of months if the need for more context
arises we know what kind of struct we want or need.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-31  8:13           ` Christian Brauner
@ 2023-07-31 15:26             ` Darrick J. Wong
  2023-07-31 22:18               ` Dave Chinner
  2023-08-01  0:28               ` Jens Axboe
  2023-08-01 18:39             ` Hao Xu
  1 sibling, 2 replies; 41+ messages in thread
From: Darrick J. Wong @ 2023-07-31 15:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dave Chinner, Hao Xu, Jens Axboe, io-uring, Dominique Martinet,
	Pavel Begunkov, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li

On Mon, Jul 31, 2023 at 10:13:21AM +0200, Christian Brauner wrote:
> On Mon, Jul 31, 2023 at 11:33:05AM +1000, Dave Chinner wrote:
> > On Thu, Jul 27, 2023 at 04:27:30PM +0200, Christian Brauner wrote:
> > > On Thu, Jul 27, 2023 at 07:51:19PM +0800, Hao Xu wrote:
> > > > I actually saw this semaphore, and there is another xfs lock in
> > > > file_accessed
> > > >   --> touch_atime
> > > >     --> inode_update_time
> > > >       --> inode->i_op->update_time == xfs_vn_update_time
> > > > 
> > > > Forgot to point them out in the cover-letter..., I didn't modify them
> > > > since I'm not very sure about if we should do so, and I saw Stefan's
> > > > patchset didn't modify them too.
> > > > 
> > > > My personnal thinking is we should apply trylock logic for this
> > > > inode->i_rwsem. For xfs lock in touch_atime, we should do that since it
> > > > doesn't make sense to rollback all the stuff while we are almost at the
> > > > end of getdents because of a lock.
> > > 
> > > That manoeuvres around the problem. Which I'm slightly more sensitive
> > > too as this review is a rather expensive one.
> > > 
> > > Plus, it seems fixable in at least two ways:
> > > 
> > > For both we need to be able to tell the filesystem that a nowait atime
> > > update is requested. Simple thing seems to me to add a S_NOWAIT flag to
> > > file_time_flags and passing that via i_op->update_time() which already
> > > has a flag argument. That would likely also help kiocb_modified().
> > 
> > Wait - didn't we already fix this for mtime updates on IOCB_NOWAIT
> > modification operations? Yeah, we did:
> > 
> > kiocb_modified(iocb)
> >   file_modified_flags(iocb->ki_file, iocb->ki_flags)
> >     ....
> >     ret = inode_needs_update_time()
> >     if (ret <= 0)
> > 	return ret;
> >     if (flags & IOCB_NOWAIT)
> > 	return -EAGAIN;
> >     <does timestamp update>
> > 
> > > file_accessed()
> > > -> touch_atime()
> > >    -> inode_update_time()
> > >       -> i_op->update_time == xfs_vn_update_time()
> > 
> > Yeah, so this needs the same treatment as file_modified_flags() -
> > touch_atime() needs a flag variant that passes IOCB_NOWAIT, and
> > after atime_needs_update() returns trues we should check IOCB_NOWAIT
> > and return EAGAIN if it is set. That will punt the operation that
> > needs to the update to a worker thread that can block....
> 
> As I tried to explain, I would prefer if we could inform the filesystem
> through i_op->update_time() itself that this is async and give the
> filesystem the ability to try and acquire the locks it needs and return
> EAGAIN from i_op->update_time() itself if it can't acquire them.
> 
> > 
> > > Then we have two options afaict:
> > > 
> > > (1) best-effort atime update
> > > 
> > > file_accessed() already has the builtin assumption that updating atime
> > > might fail for other reasons - see the comment in there. So it is
> > > somewhat best-effort already.
> > > 
> > > (2) move atime update before calling into filesystem
> > > 
> > > If we want to be sure that access time is updated when a readdir request
> > > is issued through io_uring then we need to have file_accessed() give a
> > > return value and expose a new helper for io_uring or modify
> > > vfs_getdents() to do something like:
> > > 
> > > vfs_getdents()
> > > {
> > > 	if (nowait)
> > > 		down_read_trylock()
> > > 
> > > 	if (!IS_DEADDIR(inode)) {
> > > 		ret = file_accessed(file);
> > > 		if (ret == -EAGAIN)
> > > 			goto out_unlock;
> > > 
> > > 		f_op->iterate_shared()
> > > 	}
> > > }
> > 
> > Yup, that's the sort of thing that needs to be done.
> > 
> > But as I said in the "llseek for io-uring" thread, we need to stop
> > the game of whack-a-mole passing random nowait boolean flags to VFS
> > operations before it starts in earnest.  We really need a common
> > context structure (like we have a kiocb for IO operations) that
> > holds per operation control state so we have consistency across all
> > the operations that we need different behaviours for.
> 
> Yes, I tend to agree and thought about the same. But right now we don't
> have a lot of context. So I would lean towards a flag argument at most.
> 
> But I also wouldn't consider it necessarily wrong to start with booleans
> or a flag first and in a couple of months if the need for more context
> arises we know what kind of struct we want or need.

I'm probably missing a ton of context (because at the end of the day I
don't care all that much about NOWAIT and still have never installed
liburing) but AFAICT the goal seems to be that for a given io request,
uring tries to execute it with trylocks in the originating process
context.  If that attempt fails, it'll punt the io to a workqueue and
rerun the request with blocking locks.  Right?

I've watched quite a bit of NOWAIT whackamole going on over the past few
years (i_rwsem, the ILOCK, the IO layer, memory allocations...).  IIRC
these filesystem ios all have to run in process context, right?  If so,
why don't we capture the NOWAIT state in a PF flag?  We already do that
for NOFS/NOIO memory allocations to make sure that /all/ reclaim
attempts cannot recurse into the fs/io stacks.

"I prefer EAGAIN errors to this process blocking" doesn't seem all that
much different.  But, what do I know...

--D

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-31 15:26             ` Darrick J. Wong
@ 2023-07-31 22:18               ` Dave Chinner
  2023-08-01  0:28               ` Jens Axboe
  1 sibling, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2023-07-31 22:18 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christian Brauner, Hao Xu, Jens Axboe, io-uring,
	Dominique Martinet, Pavel Begunkov, Alexander Viro, Stefan Roesch,
	Clay Harris, linux-fsdevel, Wanpeng Li

On Mon, Jul 31, 2023 at 08:26:23AM -0700, Darrick J. Wong wrote:
> On Mon, Jul 31, 2023 at 10:13:21AM +0200, Christian Brauner wrote:
> > On Mon, Jul 31, 2023 at 11:33:05AM +1000, Dave Chinner wrote:
> > > On Thu, Jul 27, 2023 at 04:27:30PM +0200, Christian Brauner wrote:
> > > But as I said in the "llseek for io-uring" thread, we need to stop
> > > the game of whack-a-mole passing random nowait boolean flags to VFS
> > > operations before it starts in earnest.  We really need a common
> > > context structure (like we have a kiocb for IO operations) that
> > > holds per operation control state so we have consistency across all
> > > the operations that we need different behaviours for.
> > 
> > Yes, I tend to agree and thought about the same. But right now we don't
> > have a lot of context. So I would lean towards a flag argument at most.
> > 
> > But I also wouldn't consider it necessarily wrong to start with booleans
> > or a flag first and in a couple of months if the need for more context
> > arises we know what kind of struct we want or need.
> 
> I'm probably missing a ton of context (because at the end of the day I
> don't care all that much about NOWAIT and still have never installed
> liburing) but AFAICT the goal seems to be that for a given io request,
> uring tries to execute it with trylocks in the originating process
> context.  If that attempt fails, it'll punt the io to a workqueue and
> rerun the request with blocking locks.  Right?

Yes, that might be the case for the VFS level code we are talking
about right now....

... but, for example, I have no clue what task context
nvmet_file_execute_rw() runs in but it definitely issues file IO
with IOCB_NOWAIT...

> I've watched quite a bit of NOWAIT whackamole going on over the past few
> years (i_rwsem, the ILOCK, the IO layer, memory allocations...).  IIRC
> these filesystem ios all have to run in process context, right?  If so,
> why don't we capture the NOWAIT state in a PF flag?  We already do that
> for NOFS/NOIO memory allocations to make sure that /all/ reclaim
> attempts cannot recurse into the fs/io stacks.

Interesting idea.

That would mean the high level code would have to toggle the task
flags before calling into the VFS, which means we'd have to capture
RWF_NOWAIT flags at the syscall level rather than propagating them
into the iocb. That may impact speed racers, because the RWF flag
propagation has been a target of significant micro-optimisation
since io_uring came along....

> "I prefer EAGAIN errors to this process blocking" doesn't seem all that
> much different.  But, what do I know...

Yeah, I can see how that might be advantageous from an API
persepective, though my gut says "that's a can of worms" but I
haven't spent enough time thinking about it to work out why I feel
that way.

-Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-31 15:26             ` Darrick J. Wong
  2023-07-31 22:18               ` Dave Chinner
@ 2023-08-01  0:28               ` Jens Axboe
  2023-08-01  0:47                 ` Matthew Wilcox
                                   ` (3 more replies)
  1 sibling, 4 replies; 41+ messages in thread
From: Jens Axboe @ 2023-08-01  0:28 UTC (permalink / raw)
  To: Darrick J. Wong, Christian Brauner
  Cc: Dave Chinner, Hao Xu, io-uring, Dominique Martinet,
	Pavel Begunkov, Alexander Viro, Stefan Roesch, Clay Harris,
	linux-fsdevel, Wanpeng Li

On 7/31/23 9:26?AM, Darrick J. Wong wrote:
> I've watched quite a bit of NOWAIT whackamole going on over the past few
> years (i_rwsem, the ILOCK, the IO layer, memory allocations...).  IIRC
> these filesystem ios all have to run in process context, right?  If so,
> why don't we capture the NOWAIT state in a PF flag?  We already do that
> for NOFS/NOIO memory allocations to make sure that /all/ reclaim
> attempts cannot recurse into the fs/io stacks.

I would greatly prefer passing down the context rather than capitulating
and adding a task_struct flag for this. I think it _kind of_ makes sense
for things like allocations, as you cannot easily track that all the way
down, but it's a really ugly solution. It certainly creates more churn
passing it down, but it also reveals the parts that need to check it.
WHen new code is added, it's much more likely you'll spot the fact that
there's passed in context. For allocation, you end up in the allocator
anyway, which can augment the gfp mask with whatever is set in the task.
The same is not true for locking and other bits, as they don't return a
value to begin with. When we know they are sane, we can flag the fs as
supporting it (like we've done for async buffered reads, for example).

It's also not an absolute thing, like memory allocations are. It's
perfectly fine to grab a mutex under NOWAIT issue. What you should not
do is grab a mutex that someone else can grab while waiting on IO. This
kind of extra context is only available in the code in question, not
generically for eg mutex locking.

I'm not a huge fan of the "let's add a bool nowait". Most/all use cases
pass down state anyway, putting it in a suitable type struct seems much
cleaner to me than the out-of-band approach of just adding a
current->please_nowait.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-08-01  0:28               ` Jens Axboe
@ 2023-08-01  0:47                 ` Matthew Wilcox
  2023-08-01  0:49                   ` Jens Axboe
  2023-08-01  7:17                 ` Christian Brauner
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2023-08-01  0:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Darrick J. Wong, Christian Brauner, Dave Chinner, Hao Xu,
	io-uring, Dominique Martinet, Pavel Begunkov, Alexander Viro,
	Stefan Roesch, Clay Harris, linux-fsdevel, Wanpeng Li

On Mon, Jul 31, 2023 at 06:28:02PM -0600, Jens Axboe wrote:
> It's also not an absolute thing, like memory allocations are. It's
> perfectly fine to grab a mutex under NOWAIT issue. What you should not
> do is grab a mutex that someone else can grab while waiting on IO. This
> kind of extra context is only available in the code in question, not
> generically for eg mutex locking.

Is that information documented somewhere?  I didn't know that was the
rule, and I wouldn't be surprised if that's news to several of the other
people on this thread.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-08-01  0:47                 ` Matthew Wilcox
@ 2023-08-01  0:49                   ` Jens Axboe
  2023-08-01  1:01                     ` Matthew Wilcox
  2023-08-01  6:59                     ` Christian Brauner
  0 siblings, 2 replies; 41+ messages in thread
From: Jens Axboe @ 2023-08-01  0:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Christian Brauner, Dave Chinner, Hao Xu,
	io-uring, Dominique Martinet, Pavel Begunkov, Alexander Viro,
	Stefan Roesch, Clay Harris, linux-fsdevel, Wanpeng Li

On 7/31/23 6:47?PM, Matthew Wilcox wrote:
> On Mon, Jul 31, 2023 at 06:28:02PM -0600, Jens Axboe wrote:
>> It's also not an absolute thing, like memory allocations are. It's
>> perfectly fine to grab a mutex under NOWAIT issue. What you should not
>> do is grab a mutex that someone else can grab while waiting on IO. This
>> kind of extra context is only available in the code in question, not
>> generically for eg mutex locking.
> 
> Is that information documented somewhere?  I didn't know that was the
> rule, and I wouldn't be surprised if that's news to several of the other
> people on this thread.

I've mentioned it in various threads, but would probably be good to
write down somewhere if that actually makes more people see it. I'm
dubious, but since it applies to NOWAIT in general, would be a good
thing to do regardless. And then at least I could point to that rather
than have to write up a long description every time.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-08-01  0:49                   ` Jens Axboe
@ 2023-08-01  1:01                     ` Matthew Wilcox
  2023-08-01  7:00                       ` Christian Brauner
  2023-08-01  6:59                     ` Christian Brauner
  1 sibling, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2023-08-01  1:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Darrick J. Wong, Christian Brauner, Dave Chinner, Hao Xu,
	io-uring, Dominique Martinet, Pavel Begunkov, Alexander Viro,
	Stefan Roesch, Clay Harris, linux-fsdevel, Wanpeng Li

On Mon, Jul 31, 2023 at 06:49:45PM -0600, Jens Axboe wrote:
> On 7/31/23 6:47?PM, Matthew Wilcox wrote:
> > On Mon, Jul 31, 2023 at 06:28:02PM -0600, Jens Axboe wrote:
> >> It's also not an absolute thing, like memory allocations are. It's
> >> perfectly fine to grab a mutex under NOWAIT issue. What you should not
> >> do is grab a mutex that someone else can grab while waiting on IO. This
> >> kind of extra context is only available in the code in question, not
> >> generically for eg mutex locking.
> > 
> > Is that information documented somewhere?  I didn't know that was the
> > rule, and I wouldn't be surprised if that's news to several of the other
> > people on this thread.
> 
> I've mentioned it in various threads, but would probably be good to
> write down somewhere if that actually makes more people see it. I'm
> dubious, but since it applies to NOWAIT in general, would be a good
> thing to do regardless. And then at least I could point to that rather
> than have to write up a long description every time.

Would be good to include whether it's OK to wait on a mutex that's held
by another thread that's allocating memory without __GFP_NOFS (since
obviously that can block on I/O)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-08-01  1:01                     ` Matthew Wilcox
@ 2023-08-01  7:00                       ` Christian Brauner
  0 siblings, 0 replies; 41+ messages in thread
From: Christian Brauner @ 2023-08-01  7:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Darrick J. Wong, Dave Chinner, Hao Xu, io-uring,
	Dominique Martinet, Pavel Begunkov, Alexander Viro, Stefan Roesch,
	Clay Harris, linux-fsdevel, Wanpeng Li

On Tue, Aug 01, 2023 at 02:01:59AM +0100, Matthew Wilcox wrote:
> On Mon, Jul 31, 2023 at 06:49:45PM -0600, Jens Axboe wrote:
> > On 7/31/23 6:47?PM, Matthew Wilcox wrote:
> > > On Mon, Jul 31, 2023 at 06:28:02PM -0600, Jens Axboe wrote:
> > >> It's also not an absolute thing, like memory allocations are. It's
> > >> perfectly fine to grab a mutex under NOWAIT issue. What you should not
> > >> do is grab a mutex that someone else can grab while waiting on IO. This
> > >> kind of extra context is only available in the code in question, not
> > >> generically for eg mutex locking.
> > > 
> > > Is that information documented somewhere?  I didn't know that was the
> > > rule, and I wouldn't be surprised if that's news to several of the other
> > > people on this thread.
> > 
> > I've mentioned it in various threads, but would probably be good to
> > write down somewhere if that actually makes more people see it. I'm
> > dubious, but since it applies to NOWAIT in general, would be a good
> > thing to do regardless. And then at least I could point to that rather
> > than have to write up a long description every time.
> 
> Would be good to include whether it's OK to wait on a mutex that's held
> by another thread that's allocating memory without __GFP_NOFS (since
> obviously that can block on I/O)

And adding a few illustrative examples would be helpful.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-08-01  0:49                   ` Jens Axboe
  2023-08-01  1:01                     ` Matthew Wilcox
@ 2023-08-01  6:59                     ` Christian Brauner
  1 sibling, 0 replies; 41+ messages in thread
From: Christian Brauner @ 2023-08-01  6:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Matthew Wilcox, Darrick J. Wong, Dave Chinner, Hao Xu, io-uring,
	Dominique Martinet, Pavel Begunkov, Alexander Viro, Stefan Roesch,
	Clay Harris, linux-fsdevel, Wanpeng Li

On Mon, Jul 31, 2023 at 06:49:45PM -0600, Jens Axboe wrote:
> On 7/31/23 6:47?PM, Matthew Wilcox wrote:
> > On Mon, Jul 31, 2023 at 06:28:02PM -0600, Jens Axboe wrote:
> >> It's also not an absolute thing, like memory allocations are. It's
> >> perfectly fine to grab a mutex under NOWAIT issue. What you should not
> >> do is grab a mutex that someone else can grab while waiting on IO. This
> >> kind of extra context is only available in the code in question, not
> >> generically for eg mutex locking.
> > 
> > Is that information documented somewhere?  I didn't know that was the
> > rule, and I wouldn't be surprised if that's news to several of the other
> > people on this thread.
> 
> I've mentioned it in various threads, but would probably be good to
> write down somewhere if that actually makes more people see it. I'm
> dubious, but since it applies to NOWAIT in general, would be a good
> thing to do regardless. And then at least I could point to that rather
> than have to write up a long description every time.

I've only been aware of this because we talked about it somewhere else.
So adding it as documentation would be great.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-08-01  0:28               ` Jens Axboe
  2023-08-01  0:47                 ` Matthew Wilcox
@ 2023-08-01  7:17                 ` Christian Brauner
  2023-08-08  4:34                 ` Hao Xu
  2023-08-08  9:33                 ` Hao Xu
  3 siblings, 0 replies; 41+ messages in thread
From: Christian Brauner @ 2023-08-01  7:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Darrick J. Wong, Dave Chinner, Hao Xu, io-uring,
	Dominique Martinet, Pavel Begunkov, Alexander Viro, Stefan Roesch,
	Clay Harris, linux-fsdevel, Wanpeng Li

On Mon, Jul 31, 2023 at 06:28:02PM -0600, Jens Axboe wrote:
> On 7/31/23 9:26?AM, Darrick J. Wong wrote:
> > I've watched quite a bit of NOWAIT whackamole going on over the past few
> > years (i_rwsem, the ILOCK, the IO layer, memory allocations...).  IIRC
> > these filesystem ios all have to run in process context, right?  If so,
> > why don't we capture the NOWAIT state in a PF flag?  We already do that
> > for NOFS/NOIO memory allocations to make sure that /all/ reclaim
> > attempts cannot recurse into the fs/io stacks.
> 
> I would greatly prefer passing down the context rather than capitulating
> and adding a task_struct flag for this. I think it _kind of_ makes sense
> for things like allocations, as you cannot easily track that all the way
> down, but it's a really ugly solution. It certainly creates more churn
> passing it down, but it also reveals the parts that need to check it.
> WHen new code is added, it's much more likely you'll spot the fact that
> there's passed in context. For allocation, you end up in the allocator
> anyway, which can augment the gfp mask with whatever is set in the task.
> The same is not true for locking and other bits, as they don't return a
> value to begin with. When we know they are sane, we can flag the fs as
> supporting it (like we've done for async buffered reads, for example).
> 
> It's also not an absolute thing, like memory allocations are. It's
> perfectly fine to grab a mutex under NOWAIT issue. What you should not
> do is grab a mutex that someone else can grab while waiting on IO. This
> kind of extra context is only available in the code in question, not
> generically for eg mutex locking.
> 
> I'm not a huge fan of the "let's add a bool nowait". Most/all use cases
> pass down state anyway, putting it in a suitable type struct seems much

We're only going to pass a struct if there really is a need for one
though. Meaning, we're shouldn't end up passing a struct with a single
element around in the hopes that we'll add more members at some point. 

> cleaner to me than the out-of-band approach of just adding a
> current->please_nowait.

I'm not convinced that abusing current/task_struct for this is sane. I
not just very much doubt this will go down well with reviewers outside
of fs/ we'd also rightly be told that we're punting on a design problem
because it would be more churn.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-08-01  0:28               ` Jens Axboe
  2023-08-01  0:47                 ` Matthew Wilcox
  2023-08-01  7:17                 ` Christian Brauner
@ 2023-08-08  4:34                 ` Hao Xu
  2023-08-08  5:18                   ` Hao Xu
  2023-08-08  9:33                 ` Hao Xu
  3 siblings, 1 reply; 41+ messages in thread
From: Hao Xu @ 2023-08-08  4:34 UTC (permalink / raw)
  To: Jens Axboe, Darrick J. Wong, Christian Brauner
  Cc: Dave Chinner, io-uring, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, linux-fsdevel,
	Wanpeng Li

On 8/1/23 08:28, Jens Axboe wrote:
> On 7/31/23 9:26?AM, Darrick J. Wong wrote:
>> I've watched quite a bit of NOWAIT whackamole going on over the past few
>> years (i_rwsem, the ILOCK, the IO layer, memory allocations...).  IIRC
>> these filesystem ios all have to run in process context, right?  If so,
>> why don't we capture the NOWAIT state in a PF flag?  We already do that
>> for NOFS/NOIO memory allocations to make sure that /all/ reclaim
>> attempts cannot recurse into the fs/io stacks.
> 
> I would greatly prefer passing down the context rather than capitulating
> and adding a task_struct flag for this. I think it _kind of_ makes sense
> for things like allocations, as you cannot easily track that all the way
> down, but it's a really ugly solution. It certainly creates more churn
> passing it down, but it also reveals the parts that need to check it.
> WHen new code is added, it's much more likely you'll spot the fact that
> there's passed in context. For allocation, you end up in the allocator
> anyway, which can augment the gfp mask with whatever is set in the task.
> The same is not true for locking and other bits, as they don't return a
> value to begin with. When we know they are sane, we can flag the fs as
> supporting it (like we've done for async buffered reads, for example).
> 
> It's also not an absolute thing, like memory allocations are. It's
> perfectly fine to grab a mutex under NOWAIT issue. What you should not

Hi Jens,
To make sure, I'd like to ask, for memory allocation, GFP_NOIO semantics
is all we need in NOWAIT issue, GFP_NOWAIT is not necessary, do I
understand it right?

Thanks,
Hao

> do is grab a mutex that someone else can grab while waiting on IO. This
> kind of extra context is only available in the code in question, not
> generically for eg mutex locking.
> 
> I'm not a huge fan of the "let's add a bool nowait". Most/all use cases
> pass down state anyway, putting it in a suitable type struct seems much
> cleaner to me than the out-of-band approach of just adding a
> current->please_nowait.
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-08-08  4:34                 ` Hao Xu
@ 2023-08-08  5:18                   ` Hao Xu
  0 siblings, 0 replies; 41+ messages in thread
From: Hao Xu @ 2023-08-08  5:18 UTC (permalink / raw)
  To: Jens Axboe, Darrick J. Wong, Christian Brauner
  Cc: Dave Chinner, io-uring, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, linux-fsdevel,
	Wanpeng Li

On 8/8/23 12:34, Hao Xu wrote:
> On 8/1/23 08:28, Jens Axboe wrote:
>> On 7/31/23 9:26?AM, Darrick J. Wong wrote:
>>> I've watched quite a bit of NOWAIT whackamole going on over the past 
>>> few
>>> years (i_rwsem, the ILOCK, the IO layer, memory allocations...).  IIRC
>>> these filesystem ios all have to run in process context, right?  If so,
>>> why don't we capture the NOWAIT state in a PF flag?  We already do that
>>> for NOFS/NOIO memory allocations to make sure that /all/ reclaim
>>> attempts cannot recurse into the fs/io stacks.
>>
>> I would greatly prefer passing down the context rather than capitulating
>> and adding a task_struct flag for this. I think it _kind of_ makes sense
>> for things like allocations, as you cannot easily track that all the way
>> down, but it's a really ugly solution. It certainly creates more churn
>> passing it down, but it also reveals the parts that need to check it.
>> WHen new code is added, it's much more likely you'll spot the fact that
>> there's passed in context. For allocation, you end up in the allocator
>> anyway, which can augment the gfp mask with whatever is set in the task.
>> The same is not true for locking and other bits, as they don't return a
>> value to begin with. When we know they are sane, we can flag the fs as
>> supporting it (like we've done for async buffered reads, for example).
>>
>> It's also not an absolute thing, like memory allocations are. It's
>> perfectly fine to grab a mutex under NOWAIT issue. What you should not
>
> Hi Jens,
> To make sure, I'd like to ask, for memory allocation, GFP_NOIO semantics
> is all we need in NOWAIT issue, GFP_NOWAIT is not necessary, do I
> understand it right?
>
> Thanks,
> Hao


Trying to find a lock in mem allocation process that GFP_NOIO holds it while

other normal GFP_* like GFP_KERNEL also holds it and does IO.

Failed to find one such lock.  So I guess though GFP_NOIO may cause sleep

but won't wait on IO.





^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-08-01  0:28               ` Jens Axboe
                                   ` (2 preceding siblings ...)
  2023-08-08  4:34                 ` Hao Xu
@ 2023-08-08  9:33                 ` Hao Xu
  2023-08-08 22:55                   ` Jens Axboe
  3 siblings, 1 reply; 41+ messages in thread
From: Hao Xu @ 2023-08-08  9:33 UTC (permalink / raw)
  To: Jens Axboe, Darrick J. Wong, Christian Brauner
  Cc: Dave Chinner, io-uring, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, linux-fsdevel,
	Wanpeng Li

On 8/1/23 08:28, Jens Axboe wrote:
> On 7/31/23 9:26?AM, Darrick J. Wong wrote:
>> I've watched quite a bit of NOWAIT whackamole going on over the past few
>> years (i_rwsem, the ILOCK, the IO layer, memory allocations...). IIRC
>> these filesystem ios all have to run in process context, right? If so,
>> why don't we capture the NOWAIT state in a PF flag? We already do that
>> for NOFS/NOIO memory allocations to make sure that /all/ reclaim
>> attempts cannot recurse into the fs/io stacks.
>
> I would greatly prefer passing down the context rather than capitulating
> and adding a task_struct flag for this. I think it _kind of_ makes sense
> for things like allocations, as you cannot easily track that all the way
> down, but it's a really ugly solution. It certainly creates more churn
> passing it down, but it also reveals the parts that need to check it.
> WHen new code is added, it's much more likely you'll spot the fact that
> there's passed in context. For allocation, you end up in the allocator
> anyway, which can augment the gfp mask with whatever is set in the task.
> The same is not true for locking and other bits, as they don't return a
> value to begin with. When we know they are sane, we can flag the fs as
> supporting it (like we've done for async buffered reads, for example).
>
> It's also not an absolute thing, like memory allocations are. It's
> perfectly fine to grab a mutex under NOWAIT issue. What you should not

Hi Jens,
To make sure, I'd like to ask, for memory allocation, GFP_NOIO semantics
is all we need in NOWAIT issue, GFP_NOWAIT is not necessary, do I
understand it right?

Thanks,
Hao

> do is grab a mutex that someone else can grab while waiting on IO. This
> kind of extra context is only available in the code in question, not
> generically for eg mutex locking.
>
> I'm not a huge fan of the "let's add a bool nowait". Most/all use cases
> pass down state anyway, putting it in a suitable type struct seems much
> cleaner to me than the out-of-band approach of just adding a
> current->please_nowait.
>



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-08-08  9:33                 ` Hao Xu
@ 2023-08-08 22:55                   ` Jens Axboe
  0 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2023-08-08 22:55 UTC (permalink / raw)
  To: Hao Xu, Darrick J. Wong, Christian Brauner
  Cc: Dave Chinner, io-uring, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, linux-fsdevel,
	Wanpeng Li

On 8/8/23 3:33?AM, Hao Xu wrote:
> On 8/1/23 08:28, Jens Axboe wrote:
>> On 7/31/23 9:26?AM, Darrick J. Wong wrote:
>>> I've watched quite a bit of NOWAIT whackamole going on over the past few
>>> years (i_rwsem, the ILOCK, the IO layer, memory allocations...). IIRC
>>> these filesystem ios all have to run in process context, right? If so,
>>> why don't we capture the NOWAIT state in a PF flag? We already do that
>>> for NOFS/NOIO memory allocations to make sure that /all/ reclaim
>>> attempts cannot recurse into the fs/io stacks.
>>
>> I would greatly prefer passing down the context rather than capitulating
>> and adding a task_struct flag for this. I think it _kind of_ makes sense
>> for things like allocations, as you cannot easily track that all the way
>> down, but it's a really ugly solution. It certainly creates more churn
>> passing it down, but it also reveals the parts that need to check it.
>> WHen new code is added, it's much more likely you'll spot the fact that
>> there's passed in context. For allocation, you end up in the allocator
>> anyway, which can augment the gfp mask with whatever is set in the task.
>> The same is not true for locking and other bits, as they don't return a
>> value to begin with. When we know they are sane, we can flag the fs as
>> supporting it (like we've done for async buffered reads, for example).
>>
>> It's also not an absolute thing, like memory allocations are. It's
>> perfectly fine to grab a mutex under NOWAIT issue. What you should not
> 
> Hi Jens,
> To make sure, I'd like to ask, for memory allocation, GFP_NOIO semantics
> is all we need in NOWAIT issue, GFP_NOWAIT is not necessary, do I
> understand it right?

Yep, GFP_NOIO should be just fine.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/5] io_uring: add support for getdents
  2023-07-31  8:13           ` Christian Brauner
  2023-07-31 15:26             ` Darrick J. Wong
@ 2023-08-01 18:39             ` Hao Xu
  1 sibling, 0 replies; 41+ messages in thread
From: Hao Xu @ 2023-08-01 18:39 UTC (permalink / raw)
  To: Christian Brauner, Dave Chinner
  Cc: djwong, Jens Axboe, io-uring, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, linux-fsdevel,
	Wanpeng Li

On 7/31/23 16:13, Christian Brauner wrote:
> On Mon, Jul 31, 2023 at 11:33:05AM +1000, Dave Chinner wrote:
>> On Thu, Jul 27, 2023 at 04:27:30PM +0200, Christian Brauner wrote:
>>> On Thu, Jul 27, 2023 at 07:51:19PM +0800, Hao Xu wrote:
>>>> I actually saw this semaphore, and there is another xfs lock in
>>>> file_accessed
>>>>    --> touch_atime
>>>>      --> inode_update_time
>>>>        --> inode->i_op->update_time == xfs_vn_update_time
>>>>
>>>> Forgot to point them out in the cover-letter..., I didn't modify them
>>>> since I'm not very sure about if we should do so, and I saw Stefan's
>>>> patchset didn't modify them too.
>>>>
>>>> My personnal thinking is we should apply trylock logic for this
>>>> inode->i_rwsem. For xfs lock in touch_atime, we should do that since it
>>>> doesn't make sense to rollback all the stuff while we are almost at the
>>>> end of getdents because of a lock.
>>>
>>> That manoeuvres around the problem. Which I'm slightly more sensitive
>>> too as this review is a rather expensive one.
>>>
>>> Plus, it seems fixable in at least two ways:
>>>
>>> For both we need to be able to tell the filesystem that a nowait atime
>>> update is requested. Simple thing seems to me to add a S_NOWAIT flag to
>>> file_time_flags and passing that via i_op->update_time() which already
>>> has a flag argument. That would likely also help kiocb_modified().
>>
>> Wait - didn't we already fix this for mtime updates on IOCB_NOWAIT
>> modification operations? Yeah, we did:
>>
>> kiocb_modified(iocb)
>>    file_modified_flags(iocb->ki_file, iocb->ki_flags)
>>      ....
>>      ret = inode_needs_update_time()
>>      if (ret <= 0)
>> 	return ret;
>>      if (flags & IOCB_NOWAIT)
>> 	return -EAGAIN;
>>      <does timestamp update>
>>
>>> file_accessed()
>>> -> touch_atime()
>>>     -> inode_update_time()
>>>        -> i_op->update_time == xfs_vn_update_time()
>>
>> Yeah, so this needs the same treatment as file_modified_flags() -
>> touch_atime() needs a flag variant that passes IOCB_NOWAIT, and
>> after atime_needs_update() returns trues we should check IOCB_NOWAIT
>> and return EAGAIN if it is set. That will punt the operation that
>> needs to the update to a worker thread that can block....
> 
> As I tried to explain, I would prefer if we could inform the filesystem
> through i_op->update_time() itself that this is async and give the
> filesystem the ability to try and acquire the locks it needs and return
> EAGAIN from i_op->update_time() itself if it can't acquire them.

I browse code in i_op->update_time = xfs_vn_update_time, it's mainly
about xfs journal code. It creates a transaction and adds a item to
it, not familiar with this part, from a quick look I found some
locks and sleep point in it to modify. I think I need some time to sort
out this part. Or maybe we can do it like what Dave said as a short term
solution and change the block points in journal code later as a separate
patchset, those journal code I believe are common code for xfs IO
operations. I'm ok with both though.

> 
>>
>>> Then we have two options afaict:
>>>
>>> (1) best-effort atime update
>>>
>>> file_accessed() already has the builtin assumption that updating atime
>>> might fail for other reasons - see the comment in there. So it is
>>> somewhat best-effort already.
>>>
>>> (2) move atime update before calling into filesystem
>>>
>>> If we want to be sure that access time is updated when a readdir request
>>> is issued through io_uring then we need to have file_accessed() give a
>>> return value and expose a new helper for io_uring or modify
>>> vfs_getdents() to do something like:
>>>
>>> vfs_getdents()
>>> {
>>> 	if (nowait)
>>> 		down_read_trylock()
>>>
>>> 	if (!IS_DEADDIR(inode)) {
>>> 		ret = file_accessed(file);
>>> 		if (ret == -EAGAIN)
>>> 			goto out_unlock;
>>>
>>> 		f_op->iterate_shared()
>>> 	}
>>> }
>>
>> Yup, that's the sort of thing that needs to be done.
>>
>> But as I said in the "llseek for io-uring" thread, we need to stop
>> the game of whack-a-mole passing random nowait boolean flags to VFS
>> operations before it starts in earnest.  We really need a common
>> context structure (like we have a kiocb for IO operations) that
>> holds per operation control state so we have consistency across all
>> the operations that we need different behaviours for.
> 
> Yes, I tend to agree and thought about the same. But right now we don't
> have a lot of context. So I would lean towards a flag argument at most.
> 
> But I also wouldn't consider it necessarily wrong to start with booleans
> or a flag first and in a couple of months if the need for more context
> arises we know what kind of struct we want or need.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 4/5] xfs: add NOWAIT semantics for readdir
  2023-07-18 13:21 [PATCH v4 0/5] io_uring getdents Hao Xu
                   ` (2 preceding siblings ...)
  2023-07-18 13:21 ` [PATCH 3/5] io_uring: add support for getdents Hao Xu
@ 2023-07-18 13:21 ` Hao Xu
  2023-07-19  2:35   ` kernel test robot
  2023-07-18 13:21 ` [PATCH RFC 5/5] disable fixed file for io_uring getdents for now Hao Xu
  2023-07-19  6:04 ` [PATCH v4 0/5] io_uring getdents Christian Brauner
  5 siblings, 1 reply; 41+ messages in thread
From: Hao Xu @ 2023-07-18 13:21 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

From: Hao Xu <[email protected]>

Implement NOWAIT semantics for readdir. Return EAGAIN error to the
caller if it would block, like failing to get locks, or going to
do IO.

Co-developed-by: Dave Chinner <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Hao Xu <[email protected]>
[fixes deadlock issue, tweak code style]
---
 fs/xfs/libxfs/xfs_da_btree.c   | 16 ++++++++++
 fs/xfs/libxfs/xfs_da_btree.h   |  1 +
 fs/xfs/libxfs/xfs_dir2_block.c |  7 ++--
 fs/xfs/libxfs/xfs_dir2_priv.h  |  2 +-
 fs/xfs/scrub/dir.c             |  2 +-
 fs/xfs/scrub/readdir.c         |  2 +-
 fs/xfs/xfs_dir2_readdir.c      | 58 +++++++++++++++++++++++++++-------
 fs/xfs/xfs_inode.c             | 17 ++++++++++
 fs/xfs/xfs_inode.h             | 15 +++++----
 9 files changed, 96 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index e576560b46e9..7a1a0af24197 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2643,16 +2643,32 @@ xfs_da_read_buf(
 	struct xfs_buf_map	map, *mapp = &map;
 	int			nmap = 1;
 	int			error;
+	int			buf_flags = 0;
 
 	*bpp = NULL;
 	error = xfs_dabuf_map(dp, bno, flags, whichfork, &mapp, &nmap);
 	if (error || !nmap)
 		goto out_free;
 
+	/*
+	 * NOWAIT semantics mean we don't wait on the buffer lock nor do we
+	 * issue IO for this buffer if it is not already in memory. Caller will
+	 * retry. This will return -EAGAIN if the buffer is in memory and cannot
+	 * be locked, and no buffer and no error if it isn't in memory.  We
+	 * translate both of those into a return state of -EAGAIN and *bpp =
+	 * NULL.
+	 */
+	if (flags & XFS_DABUF_NOWAIT)
+		buf_flags |= XBF_TRYLOCK | XBF_INCORE;
 	error = xfs_trans_read_buf_map(mp, tp, mp->m_ddev_targp, mapp, nmap, 0,
 			&bp, ops);
 	if (error)
 		goto out_free;
+	if (!bp) {
+		ASSERT(flags & XFS_DABUF_NOWAIT);
+		error = -EAGAIN;
+		goto out_free;
+	}
 
 	if (whichfork == XFS_ATTR_FORK)
 		xfs_buf_set_ref(bp, XFS_ATTR_BTREE_REF);
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index ffa3df5b2893..32e7b1cca402 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -205,6 +205,7 @@ int	xfs_da3_node_read_mapped(struct xfs_trans *tp, struct xfs_inode *dp,
  */
 
 #define XFS_DABUF_MAP_HOLE_OK	(1u << 0)
+#define XFS_DABUF_NOWAIT	(1u << 1)
 
 int	xfs_da_grow_inode(xfs_da_args_t *args, xfs_dablk_t *new_blkno);
 int	xfs_da_grow_inode_int(struct xfs_da_args *args, xfs_fileoff_t *bno,
diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c
index 00f960a703b2..59b24a594add 100644
--- a/fs/xfs/libxfs/xfs_dir2_block.c
+++ b/fs/xfs/libxfs/xfs_dir2_block.c
@@ -135,13 +135,14 @@ int
 xfs_dir3_block_read(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*dp,
+	unsigned int		flags,
 	struct xfs_buf		**bpp)
 {
 	struct xfs_mount	*mp = dp->i_mount;
 	xfs_failaddr_t		fa;
 	int			err;
 
-	err = xfs_da_read_buf(tp, dp, mp->m_dir_geo->datablk, 0, bpp,
+	err = xfs_da_read_buf(tp, dp, mp->m_dir_geo->datablk, flags, bpp,
 				XFS_DATA_FORK, &xfs_dir3_block_buf_ops);
 	if (err || !*bpp)
 		return err;
@@ -380,7 +381,7 @@ xfs_dir2_block_addname(
 	tp = args->trans;
 
 	/* Read the (one and only) directory block into bp. */
-	error = xfs_dir3_block_read(tp, dp, &bp);
+	error = xfs_dir3_block_read(tp, dp, 0, &bp);
 	if (error)
 		return error;
 
@@ -695,7 +696,7 @@ xfs_dir2_block_lookup_int(
 	dp = args->dp;
 	tp = args->trans;
 
-	error = xfs_dir3_block_read(tp, dp, &bp);
+	error = xfs_dir3_block_read(tp, dp, 0, &bp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h
index 7404a9ff1a92..7d4cf8a0f15b 100644
--- a/fs/xfs/libxfs/xfs_dir2_priv.h
+++ b/fs/xfs/libxfs/xfs_dir2_priv.h
@@ -51,7 +51,7 @@ extern int xfs_dir_cilookup_result(struct xfs_da_args *args,
 
 /* xfs_dir2_block.c */
 extern int xfs_dir3_block_read(struct xfs_trans *tp, struct xfs_inode *dp,
-			       struct xfs_buf **bpp);
+			       unsigned int flags, struct xfs_buf **bpp);
 extern int xfs_dir2_block_addname(struct xfs_da_args *args);
 extern int xfs_dir2_block_lookup(struct xfs_da_args *args);
 extern int xfs_dir2_block_removename(struct xfs_da_args *args);
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 0b491784b759..5cc51f201bd7 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -313,7 +313,7 @@ xchk_directory_data_bestfree(
 		/* dir block format */
 		if (lblk != XFS_B_TO_FSBT(mp, XFS_DIR2_DATA_OFFSET))
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
-		error = xfs_dir3_block_read(sc->tp, sc->ip, &bp);
+		error = xfs_dir3_block_read(sc->tp, sc->ip, 0, &bp);
 	} else {
 		/* dir data format */
 		error = xfs_dir3_data_read(sc->tp, sc->ip, lblk, 0, &bp);
diff --git a/fs/xfs/scrub/readdir.c b/fs/xfs/scrub/readdir.c
index e51c1544be63..f0a727311632 100644
--- a/fs/xfs/scrub/readdir.c
+++ b/fs/xfs/scrub/readdir.c
@@ -101,7 +101,7 @@ xchk_dir_walk_block(
 	unsigned int		off, next_off, end;
 	int			error;
 
-	error = xfs_dir3_block_read(sc->tp, dp, &bp);
+	error = xfs_dir3_block_read(sc->tp, dp, 0, &bp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 9f3ceb461515..55bccc2d0da7 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -149,6 +149,7 @@ xfs_dir2_block_getdents(
 	struct xfs_da_geometry	*geo = args->geo;
 	unsigned int		offset, next_offset;
 	unsigned int		end;
+	unsigned int		flags = 0;
 
 	/*
 	 * If the block number in the offset is out of range, we're done.
@@ -156,7 +157,9 @@ xfs_dir2_block_getdents(
 	if (xfs_dir2_dataptr_to_db(geo, ctx->pos) > geo->datablk)
 		return 0;
 
-	error = xfs_dir3_block_read(args->trans, dp, &bp);
+	if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
+		flags |= XFS_DABUF_NOWAIT;
+	error = xfs_dir3_block_read(args->trans, dp, flags, &bp);
 	if (error)
 		return error;
 
@@ -240,6 +243,7 @@ xfs_dir2_block_getdents(
 STATIC int
 xfs_dir2_leaf_readbuf(
 	struct xfs_da_args	*args,
+	struct dir_context	*ctx,
 	size_t			bufsize,
 	xfs_dir2_off_t		*cur_off,
 	xfs_dablk_t		*ra_blk,
@@ -258,10 +262,15 @@ xfs_dir2_leaf_readbuf(
 	struct xfs_iext_cursor	icur;
 	int			ra_want;
 	int			error = 0;
-
-	error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
-	if (error)
-		goto out;
+	unsigned int		flags = 0;
+
+	if (ctx->flags & DIR_CONTEXT_F_NOWAIT) {
+		flags |= XFS_DABUF_NOWAIT;
+	} else {
+		error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
+		if (error)
+			goto out;
+	}
 
 	/*
 	 * Look for mapped directory blocks at or above the current offset.
@@ -280,7 +289,7 @@ xfs_dir2_leaf_readbuf(
 	new_off = xfs_dir2_da_to_byte(geo, map.br_startoff);
 	if (new_off > *cur_off)
 		*cur_off = new_off;
-	error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, 0, &bp);
+	error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, flags, &bp);
 	if (error)
 		goto out;
 
@@ -337,6 +346,16 @@ xfs_dir2_leaf_readbuf(
 	goto out;
 }
 
+static inline int
+xfs_ilock_for_readdir(
+	struct xfs_inode	*dp,
+	bool			nowait)
+{
+	if (nowait)
+		return xfs_ilock_data_map_shared_nowait(dp);
+	return xfs_ilock_data_map_shared(dp);
+}
+
 /*
  * Getdents (readdir) for leaf and node directories.
  * This reads the data blocks only, so is the same for both forms.
@@ -360,6 +379,7 @@ xfs_dir2_leaf_getdents(
 	int			byteoff;	/* offset in current block */
 	unsigned int		offset = 0;
 	int			error = 0;	/* error return value */
+	int			written = 0;
 
 	/*
 	 * If the offset is at or past the largest allowed value,
@@ -391,10 +411,16 @@ xfs_dir2_leaf_getdents(
 				bp = NULL;
 			}
 
-			if (*lock_mode == 0)
-				*lock_mode = xfs_ilock_data_map_shared(dp);
-			error = xfs_dir2_leaf_readbuf(args, bufsize, &curoff,
-					&rablk, &bp);
+			if (*lock_mode == 0) {
+				*lock_mode = xfs_ilock_for_readdir(dp,
+					ctx->flags & DIR_CONTEXT_F_NOWAIT);
+				if (!*lock_mode) {
+					error = -EAGAIN;
+					break;
+				}
+			}
+			error = xfs_dir2_leaf_readbuf(args, ctx, bufsize,
+					&curoff, &rablk, &bp);
 			if (error || !bp)
 				break;
 
@@ -479,6 +505,7 @@ xfs_dir2_leaf_getdents(
 		 */
 		offset += length;
 		curoff += length;
+		written += length;
 		/* bufsize may have just been a guess; don't go negative */
 		bufsize = bufsize > length ? bufsize - length : 0;
 	}
@@ -492,6 +519,8 @@ xfs_dir2_leaf_getdents(
 		ctx->pos = xfs_dir2_byte_to_dataptr(curoff) & 0x7fffffff;
 	if (bp)
 		xfs_trans_brelse(args->trans, bp);
+	if (error == -EAGAIN && written > 0)
+		error = 0;
 	return error;
 }
 
@@ -514,6 +543,7 @@ xfs_readdir(
 	unsigned int		lock_mode;
 	bool			isblock;
 	int			error;
+	bool			nowait;
 
 	trace_xfs_readdir(dp);
 
@@ -531,7 +561,11 @@ xfs_readdir(
 	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
 		return xfs_dir2_sf_getdents(&args, ctx);
 
-	lock_mode = xfs_ilock_data_map_shared(dp);
+	nowait = ctx->flags & DIR_CONTEXT_F_NOWAIT;
+	lock_mode = xfs_ilock_for_readdir(dp, nowait);
+	if (!lock_mode)
+		return -EAGAIN;
+
 	error = xfs_dir2_isblock(&args, &isblock);
 	if (error)
 		goto out_unlock;
@@ -546,5 +580,7 @@ xfs_readdir(
 out_unlock:
 	if (lock_mode)
 		xfs_iunlock(dp, lock_mode);
+	if (error == -EAGAIN)
+		ASSERT(nowait);
 	return error;
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9e62cc500140..48b1257b3ec4 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -120,6 +120,23 @@ xfs_ilock_data_map_shared(
 	return lock_mode;
 }
 
+/*
+ * Similar to xfs_ilock_data_map_shared(), except that it will only try to lock
+ * the inode in shared mode if the extents are already in memory. If it fails to
+ * get the lock or has to do IO to read the extent list, fail the operation by
+ * returning 0 as the lock mode.
+ */
+uint
+xfs_ilock_data_map_shared_nowait(
+	struct xfs_inode	*ip)
+{
+	if (xfs_need_iread_extents(&ip->i_df))
+		return 0;
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
+		return 0;
+	return XFS_ILOCK_SHARED;
+}
+
 uint
 xfs_ilock_attr_map_shared(
 	struct xfs_inode	*ip)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 7547caf2f2ab..2cdb4daadacf 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -490,13 +490,14 @@ int		xfs_rename(struct mnt_idmap *idmap,
 			   struct xfs_name *target_name,
 			   struct xfs_inode *target_ip, unsigned int flags);
 
-void		xfs_ilock(xfs_inode_t *, uint);
-int		xfs_ilock_nowait(xfs_inode_t *, uint);
-void		xfs_iunlock(xfs_inode_t *, uint);
-void		xfs_ilock_demote(xfs_inode_t *, uint);
-bool		xfs_isilocked(struct xfs_inode *, uint);
-uint		xfs_ilock_data_map_shared(struct xfs_inode *);
-uint		xfs_ilock_attr_map_shared(struct xfs_inode *);
+void		xfs_ilock(struct xfs_inode *ip, uint lockmode);
+int		xfs_ilock_nowait(struct xfs_inode *ip, uint lockmode);
+void		xfs_iunlock(struct xfs_inode *ip, uint lockmode);
+void		xfs_ilock_demote(struct xfs_inode *ip, uint lockmode);
+bool		xfs_isilocked(struct xfs_inode *ip, uint lockmode);
+uint		xfs_ilock_data_map_shared(struct xfs_inode *ip);
+uint		xfs_ilock_data_map_shared_nowait(struct xfs_inode *ip);
+uint		xfs_ilock_attr_map_shared(struct xfs_inode *ip);
 
 uint		xfs_ip2xflags(struct xfs_inode *);
 int		xfs_ifree(struct xfs_trans *, struct xfs_inode *);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/5] xfs: add NOWAIT semantics for readdir
  2023-07-18 13:21 ` [PATCH 4/5] xfs: add NOWAIT semantics for readdir Hao Xu
@ 2023-07-19  2:35   ` kernel test robot
  0 siblings, 0 replies; 41+ messages in thread
From: kernel test robot @ 2023-07-19  2:35 UTC (permalink / raw)
  To: Hao Xu, io-uring, Jens Axboe
  Cc: llvm, oe-kbuild-all, Dominique Martinet, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	Dave Chinner, linux-fsdevel, Wanpeng Li

Hi Hao,

kernel test robot noticed the following build warnings:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on linus/master v6.5-rc2 next-20230718]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Hao-Xu/fs-split-off-vfs_getdents-function-of-getdents64-syscall/20230718-212529
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
patch link:    https://lore.kernel.org/r/20230718132112.461218-5-hao.xu%40linux.dev
patch subject: [PATCH 4/5] xfs: add NOWAIT semantics for readdir
config: x86_64-randconfig-r012-20230718 (https://download.01.org/0day-ci/archive/20230719/[email protected]/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce: (https://download.01.org/0day-ci/archive/20230719/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> fs/xfs/libxfs/xfs_da_btree.c:2646:8: warning: variable 'buf_flags' set but not used [-Wunused-but-set-variable]
           int                     buf_flags = 0;
                                   ^
   1 warning generated.


vim +/buf_flags +2646 fs/xfs/libxfs/xfs_da_btree.c

  2627	
  2628	/*
  2629	 * Get a buffer for the dir/attr block, fill in the contents.
  2630	 */
  2631	int
  2632	xfs_da_read_buf(
  2633		struct xfs_trans	*tp,
  2634		struct xfs_inode	*dp,
  2635		xfs_dablk_t		bno,
  2636		unsigned int		flags,
  2637		struct xfs_buf		**bpp,
  2638		int			whichfork,
  2639		const struct xfs_buf_ops *ops)
  2640	{
  2641		struct xfs_mount	*mp = dp->i_mount;
  2642		struct xfs_buf		*bp;
  2643		struct xfs_buf_map	map, *mapp = &map;
  2644		int			nmap = 1;
  2645		int			error;
> 2646		int			buf_flags = 0;
  2647	
  2648		*bpp = NULL;
  2649		error = xfs_dabuf_map(dp, bno, flags, whichfork, &mapp, &nmap);
  2650		if (error || !nmap)
  2651			goto out_free;
  2652	
  2653		/*
  2654		 * NOWAIT semantics mean we don't wait on the buffer lock nor do we
  2655		 * issue IO for this buffer if it is not already in memory. Caller will
  2656		 * retry. This will return -EAGAIN if the buffer is in memory and cannot
  2657		 * be locked, and no buffer and no error if it isn't in memory.  We
  2658		 * translate both of those into a return state of -EAGAIN and *bpp =
  2659		 * NULL.
  2660		 */
  2661		if (flags & XFS_DABUF_NOWAIT)
  2662			buf_flags |= XBF_TRYLOCK | XBF_INCORE;
  2663		error = xfs_trans_read_buf_map(mp, tp, mp->m_ddev_targp, mapp, nmap, 0,
  2664				&bp, ops);
  2665		if (error)
  2666			goto out_free;
  2667		if (!bp) {
  2668			ASSERT(flags & XFS_DABUF_NOWAIT);
  2669			error = -EAGAIN;
  2670			goto out_free;
  2671		}
  2672	
  2673		if (whichfork == XFS_ATTR_FORK)
  2674			xfs_buf_set_ref(bp, XFS_ATTR_BTREE_REF);
  2675		else
  2676			xfs_buf_set_ref(bp, XFS_DIR_BTREE_REF);
  2677		*bpp = bp;
  2678	out_free:
  2679		if (mapp != &map)
  2680			kmem_free(mapp);
  2681	
  2682		return error;
  2683	}
  2684	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH RFC 5/5] disable fixed file for io_uring getdents for now
  2023-07-18 13:21 [PATCH v4 0/5] io_uring getdents Hao Xu
                   ` (3 preceding siblings ...)
  2023-07-18 13:21 ` [PATCH 4/5] xfs: add NOWAIT semantics for readdir Hao Xu
@ 2023-07-18 13:21 ` Hao Xu
  2023-07-26 14:23   ` Christian Brauner
  2023-07-19  6:04 ` [PATCH v4 0/5] io_uring getdents Christian Brauner
  5 siblings, 1 reply; 41+ messages in thread
From: Hao Xu @ 2023-07-18 13:21 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

From: Hao Xu <[email protected]>

Fixed file for io_uring getdents can trigger race problem. Users can
register a file to be fixed file in io_uring and then remove other
reference so that there are only fixed file reference of that file.
And then they can issue concurrent async getdents requests or both
async and sync getdents requests without holding the f_pos_lock
since there is a f_count == 1 optimization.

Signed-off-by: Hao Xu <[email protected]>
---
 io_uring/fs.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/io_uring/fs.c b/io_uring/fs.c
index 480f25677fed..dc74676b1499 100644
--- a/io_uring/fs.c
+++ b/io_uring/fs.c
@@ -303,6 +303,8 @@ int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
 
+	if (unlikely(req->flags & REQ_F_FIXED_FILE))
+		return -EBADF;
 	if (READ_ONCE(sqe->off) != 0)
 		return -EINVAL;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 5/5] disable fixed file for io_uring getdents for now
  2023-07-18 13:21 ` [PATCH RFC 5/5] disable fixed file for io_uring getdents for now Hao Xu
@ 2023-07-26 14:23   ` Christian Brauner
  2023-07-27 12:09     ` Hao Xu
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Brauner @ 2023-07-26 14:23 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On Tue, Jul 18, 2023 at 09:21:12PM +0800, Hao Xu wrote:
> From: Hao Xu <[email protected]>
> 
> Fixed file for io_uring getdents can trigger race problem. Users can
> register a file to be fixed file in io_uring and then remove other
> reference so that there are only fixed file reference of that file.
> And then they can issue concurrent async getdents requests or both
> async and sync getdents requests without holding the f_pos_lock
> since there is a f_count == 1 optimization.

Afaict, that limitation isn't necessary. This version ow works fine with
fixed files.

Based on the commit message there seems to be a misunderstanding.
Your previous version of the patchset copied the f_count optimization
into io_uring's locking which would've caused the race I described
in the other thread.

There regular system call interface was always safe because as long as
the original fd is kept the file count will be greater than 1 and both
the fixed file and regular system call interface will acquire the lock.

So fixed file's not being usable was entirely causes by copying the
f_count optimization into io_uring. Since this patchset now doesn't use
that optimization and unconditionally locks things are fine. (And even
if, the point is now moot anyway since we dropped that optimization from
the regular system call path anyway because of another issue.)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 5/5] disable fixed file for io_uring getdents for now
  2023-07-26 14:23   ` Christian Brauner
@ 2023-07-27 12:09     ` Hao Xu
  0 siblings, 0 replies; 41+ messages in thread
From: Hao Xu @ 2023-07-27 12:09 UTC (permalink / raw)
  To: Christian Brauner
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On 7/26/23 22:23, Christian Brauner wrote:
> On Tue, Jul 18, 2023 at 09:21:12PM +0800, Hao Xu wrote:
>> From: Hao Xu <[email protected]>
>>
>> Fixed file for io_uring getdents can trigger race problem. Users can
>> register a file to be fixed file in io_uring and then remove other
>> reference so that there are only fixed file reference of that file.
>> And then they can issue concurrent async getdents requests or both
>> async and sync getdents requests without holding the f_pos_lock
>> since there is a f_count == 1 optimization.
> 
> Afaict, that limitation isn't necessary. This version ow works fine with
> fixed files.
> 
> Based on the commit message there seems to be a misunderstanding.
> Your previous version of the patchset copied the f_count optimization
> into io_uring's locking which would've caused the race I described
> in the other thread.
> 
> There regular system call interface was always safe because as long as
> the original fd is kept the file count will be greater than 1 and both
> the fixed file and regular system call interface will acquire the lock.
> 
> So fixed file's not being usable was entirely causes by copying the
> f_count optimization into io_uring. Since this patchset now doesn't use
> that optimization and unconditionally locks things are fine. (And even
> if, the point is now moot anyway since we dropped that optimization from
> the regular system call path anyway because of another issue.)

I see, forgot we already remove it from fdtable after close(fd) thus
cannot access it by fd any more. I'll fix it in next version. Thanks.

Regards,
Hao

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 0/5] io_uring getdents
  2023-07-18 13:21 [PATCH v4 0/5] io_uring getdents Hao Xu
                   ` (4 preceding siblings ...)
  2023-07-18 13:21 ` [PATCH RFC 5/5] disable fixed file for io_uring getdents for now Hao Xu
@ 2023-07-19  6:04 ` Christian Brauner
  5 siblings, 0 replies; 41+ messages in thread
From: Christian Brauner @ 2023-07-19  6:04 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	linux-fsdevel, Wanpeng Li

On Tue, Jul 18, 2023 at 09:21:07PM +0800, Hao Xu wrote:
> From: Hao Xu <[email protected]>
> 
> This series introduce getdents64 to io_uring, the code logic is similar
> with the snychronized version's. It first try nowait issue, and offload
> it to io-wq threads if the first try fails.
> 
> Tested it with a liburing case:
> https://github.com/HowHsu/liburing/blob/getdents/test/getdents2.c
> 
> The test is controlled by the below script[2] which runs getdents2.t 100
> times and calulate the avg.
> The result show that io_uring version is about 3% faster:
> 
> python3 run_getdents.py
>     Average of sync: 0.1036849
>     Average of iouring: 0.1005568
> 
> (0.1036849-0.1005568)/0.1036849 = 3.017%
> 
> note:
> [1] the number of getdents call/request in io_uring and normal sync version
> are made sure to be same beforehand.
> 
> [2] run_getdents.py
> 
> ```python3
> 
> import subprocess
> 
> N = 100
> sum = 0.0
> args = ["/data/home/howeyxu/tmpdir", "sync"]
> 
> for i in range(N):
>     output = subprocess.check_output(["./liburing/test/getdents2.t"] + args)
>     sum += float(output)
> 
> average = sum / N
> print("Average of sync:", average)
> 
> sum = 0.0
> args = ["/data/home/howeyxu/tmpdir", "iouring"]
> 
> for i in range(N):
>     output = subprocess.check_output(["./liburing/test/getdents2.t"] + args)
>     sum += float(output)
> 
> average = sum / N
> print("Average of iouring:", average)
> 
> ```
> 
> v3->v4:

I'm out this week so will review next week.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2023-08-08 22:55 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-18 13:21 [PATCH v4 0/5] io_uring getdents Hao Xu
2023-07-18 13:21 ` [PATCH 1/5] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
2023-07-18 13:21 ` [PATCH 2/5] vfs_getdents/struct dir_context: add flags field Hao Xu
2023-07-18 13:21 ` [PATCH 3/5] io_uring: add support for getdents Hao Xu
2023-07-19  8:56   ` Hao Xu
2023-07-26 15:00   ` Christian Brauner
2023-07-27 11:51     ` Hao Xu
2023-07-27 14:27       ` Christian Brauner
2023-07-27 15:12         ` Pavel Begunkov
2023-07-27 15:52           ` Christian Brauner
2023-07-27 16:17             ` Pavel Begunkov
2023-07-27 16:28               ` Christian Brauner
2023-07-31  1:58                 ` Dave Chinner
2023-07-31  7:34                   ` Hao Xu
2023-07-31  7:50                     ` Christian Brauner
2023-07-31  7:40                   ` Christian Brauner
2023-07-30 18:02         ` Hao Xu
2023-07-31  8:18           ` Christian Brauner
2023-07-31  9:31             ` Hao Xu
2023-07-31  1:33         ` Dave Chinner
2023-07-31  8:13           ` Christian Brauner
2023-07-31 15:26             ` Darrick J. Wong
2023-07-31 22:18               ` Dave Chinner
2023-08-01  0:28               ` Jens Axboe
2023-08-01  0:47                 ` Matthew Wilcox
2023-08-01  0:49                   ` Jens Axboe
2023-08-01  1:01                     ` Matthew Wilcox
2023-08-01  7:00                       ` Christian Brauner
2023-08-01  6:59                     ` Christian Brauner
2023-08-01  7:17                 ` Christian Brauner
2023-08-08  4:34                 ` Hao Xu
2023-08-08  5:18                   ` Hao Xu
2023-08-08  9:33                 ` Hao Xu
2023-08-08 22:55                   ` Jens Axboe
2023-08-01 18:39             ` Hao Xu
2023-07-18 13:21 ` [PATCH 4/5] xfs: add NOWAIT semantics for readdir Hao Xu
2023-07-19  2:35   ` kernel test robot
2023-07-18 13:21 ` [PATCH RFC 5/5] disable fixed file for io_uring getdents for now Hao Xu
2023-07-26 14:23   ` Christian Brauner
2023-07-27 12:09     ` Hao Xu
2023-07-19  6:04 ` [PATCH v4 0/5] io_uring getdents Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox