This is an attempt to revive discussion after bringing it up as a reply to the last attempt last week. Since the problem with the previous attempt at adding getdents to io_uring was that offset was problematic, we can just not add an offset parameter: using different offsets is rarely useful in practice (maybe for some cacheless userspace NFS server?) and more isn't worth the cost of allowing arbitrary offsets: just allow rewind as a flag instead. [happy to drop even that flag for what I care about, but that might be useful enough on its own as io_uring rightfully has no seek API] The new API does nothing that cannot be achieved with plain syscalls so it shouldn't be introducing any new problem, the only downside is that having the state in the file struct isn't very uring-ish and if a better solution is found later that will probably require duplicating some logic in a new flag... But that seems like it would likely be a distant future, and this version should be usable right away. To repeat the rationale for having getdents available from uring as it has been a while, while I believe there can be real performance improvements as suggested in [1], the real reason is to allow coherency in code: applications using io_uring usually center their event loop around the ring, and having the occasional synchronous getdents call in there is cumbersome and hard to do properly when you consider e.g. a slow or acting up network fs... [1] https://lore.kernel.org/linux-fsdevel/20210218122640.GA334506@wantstofly.org/ (unfortunately the source is no longer available...) liburing implementation: https://github.com/martinetd/liburing/commits/getdents (will submit properly after initial discussions here) Previous discussion: https://lore.kernel.org/all/20211221164004.119663-1-shr@fb.com/T/#m517583f23502f32b040c819d930359313b3db00c Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> --- Changes in v2: - implement NOWAIT as suggested by dchinner; I'm unable to provide reliable benchmarks but it does indeed look positive (and makes sense) to avoid spinning out to another thread when not required. FWIW though, the serializing readdirs per inode argument given in v1 thread isn't valid: serialization is only done in io_prep_async_work for regular files (REQ_F_ISREG, set from file mode through FFS_ISREG), so dir operations aren't serialized in our case. If I was pedantic I'd say we might want that hashing for filesystems that don't implement interate_shared, but that info comes too late and these filesystems should become less frequent so leaving as is. - implement NOWAIT for kernfs and libfs to test with /sys (the tentative patch for xfs didn't seem to work, didn't take the time to debug) - try to implement some EOF flag in CQE as suggested by Clay Harris, this is less clearly cut and left as RFC. The liburing test/getdents.t implementation has grown options to test this flag and also try async explicitly in the latest commit: https://github.com/martinetd/liburing/commits/getdents - vfs_getdents split: add missing internal.h include - Link to v1: https://lore.kernel.org/r/20230422-uring-getdents-v1-0-14c1db36e98c@codewreck.org --- Dominique Martinet (6): fs: split off vfs_getdents function of getdents64 syscall vfs_getdents/struct dir_context: add flags field io_uring: add support for getdents kernfs: implement readdir FMODE_NOWAIT libfs: set FMODE_NOWAIT on dir open RFC: io_uring getdents: test returning an EOF flag in CQE fs/internal.h | 8 ++++++ fs/kernfs/dir.c | 21 ++++++++++++++- fs/libfs.c | 10 ++++--- fs/readdir.c | 38 +++++++++++++++++++++------ include/linux/fs.h | 10 +++++++ include/uapi/linux/io_uring.h | 9 +++++++ io_uring/fs.c | 61 +++++++++++++++++++++++++++++++++++++++++++ io_uring/fs.h | 3 +++ io_uring/opdef.c | 8 ++++++ 9 files changed, 156 insertions(+), 12 deletions(-) --- base-commit: 58390c8ce1bddb6c623f62e7ed36383e7fa5c02f change-id: 20230422-uring-getdents-2aab84d240aa Best regards, -- Dominique Martinet | Asmadeus
This splits off the vfs_getdents function from the getdents64 system call. This will allow io_uring to call the vfs_getdents function. Co-authored-by: Stefan Roesch <shr@fb.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> --- fs/internal.h | 8 ++++++++ fs/readdir.c | 34 ++++++++++++++++++++++++++-------- 2 files changed, 34 insertions(+), 8 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index bd3b2810a36b..e8ca000e6613 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -260,3 +260,11 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns); struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap); void mnt_idmap_put(struct mnt_idmap *idmap); + +/* + * fs/readdir.c + */ +struct linux_dirent64; + +int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent, + unsigned int count); diff --git a/fs/readdir.c b/fs/readdir.c index 9c53edb60c03..ed0803d0011e 100644 --- a/fs/readdir.c +++ b/fs/readdir.c @@ -21,6 +21,7 @@ #include <linux/unistd.h> #include <linux/compat.h> #include <linux/uaccess.h> +#include "internal.h" #include <asm/unaligned.h> @@ -351,10 +352,16 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen, return false; } -SYSCALL_DEFINE3(getdents64, unsigned int, fd, - struct linux_dirent64 __user *, dirent, unsigned int, count) + +/** + * vfs_getdents - getdents without fdget + * @file : pointer to file struct of directory + * @dirent : pointer to user directory structure + * @count : size of buffer + */ +int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent, + unsigned int count) { - struct fd f; struct getdents_callback64 buf = { .ctx.actor = filldir64, .count = count, @@ -362,11 +369,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd, }; int error; - f = fdget_pos(fd); - if (!f.file) - return -EBADF; - - error = iterate_dir(f.file, &buf.ctx); + error = iterate_dir(file, &buf.ctx); if (error >= 0) error = buf.error; if (buf.prev_reclen) { @@ -379,6 +382,21 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd, else error = count - buf.count; } + return error; +} + +SYSCALL_DEFINE3(getdents64, unsigned int, fd, + struct linux_dirent64 __user *, dirent, unsigned int, count) +{ + struct fd f; + int error; + + f = fdget_pos(fd); + if (!f.file) + return -EBADF; + + error = vfs_getdents(f.file, dirent, count); + fdput_pos(f); return error; } -- 2.39.2
The flags will allow passing DIR_CONTEXT_F_NOWAIT to iterate() implementations that support it (as signaled through FMODE_NWAIT in file->f_mode) Notes: - considered using IOCB_NOWAIT but if we add more flags later it would be confusing to keep track of which values are valid, use dedicated flags - might want to check ctx.flags & DIR_CONTEXT_F_NOWAIT is only set when file->f_mode & FMODE_NOWAIT in iterate_dir() as e.g. WARN_ONCE? Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> --- fs/internal.h | 2 +- fs/readdir.c | 6 ++++-- include/linux/fs.h | 8 ++++++++ 3 files changed, 13 insertions(+), 3 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index e8ca000e6613..0264b001d99a 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -267,4 +267,4 @@ void mnt_idmap_put(struct mnt_idmap *idmap); struct linux_dirent64; int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent, - unsigned int count); + unsigned int count, unsigned long flags); diff --git a/fs/readdir.c b/fs/readdir.c index ed0803d0011e..1311b89d75e1 100644 --- a/fs/readdir.c +++ b/fs/readdir.c @@ -358,12 +358,14 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen, * @file : pointer to file struct of directory * @dirent : pointer to user directory structure * @count : size of buffer + * @flags : additional dir_context flags */ int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent, - unsigned int count) + unsigned int count, unsigned long flags) { struct getdents_callback64 buf = { .ctx.actor = filldir64, + .ctx.flags = flags, .count = count, .current_dir = dirent }; @@ -395,7 +397,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd, if (!f.file) return -EBADF; - error = vfs_getdents(f.file, dirent, count); + error = vfs_getdents(f.file, dirent, count, 0); fdput_pos(f); return error; diff --git a/include/linux/fs.h b/include/linux/fs.h index 21a981680856..f7de2b5ca38e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1716,8 +1716,16 @@ typedef bool (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64, struct dir_context { filldir_t actor; loff_t pos; + unsigned long flags; }; +/* + * flags for dir_context flags + * DIR_CONTEXT_F_NOWAIT: Request non-blocking iterate + * (requires file->f_mode & FMODE_NOWAIT) + */ +#define DIR_CONTEXT_F_NOWAIT 0x1 + /* * These flags let !MMU mmap() govern direct device mapping vs immediate * copying more easily for MAP_PRIVATE, especially for ROM filesystems. -- 2.39.2
This add support for getdents64 to io_uring, acting exactly like the syscall: the directory is iterated from it's current's position as stored in the file struct, and the file's position is updated exactly as if getdents64 had been called. Additionally, since io_uring has no way of issuing a seek, a flag IORING_GETDENTS_REWIND has been added that will seek to the start of the directory like rewinddir(3) for users that might require such a thing. This will act exactly as if seek then getdents64 have been called sequentially with no atomicity guarantee: if this wasn't clear it is the responsibility of the caller to not use getdents multiple time on a single file in parallel if they want useful results, as is currently the case when using the syscall from multiple threads. For filesystems that support NOWAIT in iterate_shared(), try to use it first; if a user already knows the filesystem they use do not support nowait they can force async through IOSQE_ASYNC in the sqe flags, avoiding the need to bounce back through a useless EAGAIN return. (Note we already do that in prep if rewind is requested) Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> --- include/uapi/linux/io_uring.h | 7 ++++++ io_uring/fs.c | 57 +++++++++++++++++++++++++++++++++++++++++++ io_uring/fs.h | 3 +++ io_uring/opdef.c | 8 ++++++ 4 files changed, 75 insertions(+) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 0716cb17e436..35d0de18d893 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -65,6 +65,7 @@ struct io_uring_sqe { __u32 xattr_flags; __u32 msg_ring_flags; __u32 uring_cmd_flags; + __u32 getdents_flags; }; __u64 user_data; /* data to be passed back at completion time */ /* pack this to avoid bogus arm OABI complaints */ @@ -223,6 +224,7 @@ enum io_uring_op { IORING_OP_URING_CMD, IORING_OP_SEND_ZC, IORING_OP_SENDMSG_ZC, + IORING_OP_GETDENTS, /* this goes last, obviously */ IORING_OP_LAST, @@ -259,6 +261,11 @@ enum io_uring_op { */ #define SPLICE_F_FD_IN_FIXED (1U << 31) /* the last bit of __u32 */ +/* + * sqe->getdents_flags + */ +#define IORING_GETDENTS_REWIND (1U << 0) + /* * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the * command flags for POLL_ADD are stored in sqe->len. diff --git a/io_uring/fs.c b/io_uring/fs.c index f6a69a549fd4..b15ec81c1ed2 100644 --- a/io_uring/fs.c +++ b/io_uring/fs.c @@ -47,6 +47,13 @@ struct io_link { int flags; }; +struct io_getdents { + struct file *file; + struct linux_dirent64 __user *dirent; + unsigned int count; + int flags; +}; + int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename); @@ -291,3 +298,53 @@ void io_link_cleanup(struct io_kiocb *req) putname(sl->oldpath); putname(sl->newpath); } + +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents); + + if (READ_ONCE(sqe->off) != 0) + return -EINVAL; + + gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr)); + gd->count = READ_ONCE(sqe->len); + gd->flags = READ_ONCE(sqe->getdents_flags); + if (gd->flags & ~IORING_GETDENTS_REWIND) + return -EINVAL; + /* rewind cannot be nowait */ + if (gd->flags & IORING_GETDENTS_REWIND) + req->flags |= REQ_F_FORCE_ASYNC; + + return 0; +} + +int io_getdents(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents); + unsigned long getdents_flags = 0; + int ret; + + if (issue_flags & IO_URING_F_NONBLOCK) { + if (!(req->file->f_mode & FMODE_NOWAIT)) + return -EAGAIN; + + getdents_flags = DIR_CONTEXT_F_NOWAIT; + } + if ((gd->flags & IORING_GETDENTS_REWIND)) { + WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK); + + ret = vfs_llseek(req->file, 0, SEEK_SET); + if (ret < 0) + goto out; + } + + ret = vfs_getdents(req->file, gd->dirent, gd->count, getdents_flags); +out: + if (ret == -EAGAIN && + (issue_flags & IO_URING_F_NONBLOCK)) + return -EAGAIN; + + io_req_set_res(req, ret, 0); + return 0; +} + diff --git a/io_uring/fs.h b/io_uring/fs.h index 0bb5efe3d6bb..f83a6f3a678d 100644 --- a/io_uring/fs.h +++ b/io_uring/fs.h @@ -18,3 +18,6 @@ int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags); int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); int io_linkat(struct io_kiocb *req, unsigned int issue_flags); void io_link_cleanup(struct io_kiocb *req); + +int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); +int io_getdents(struct io_kiocb *req, unsigned int issue_flags); diff --git a/io_uring/opdef.c b/io_uring/opdef.c index cca7c5b55208..8f770c582ab3 100644 --- a/io_uring/opdef.c +++ b/io_uring/opdef.c @@ -428,6 +428,11 @@ const struct io_issue_def io_issue_defs[] = { .prep = io_eopnotsupp_prep, #endif }, + [IORING_OP_GETDENTS] = { + .needs_file = 1, + .prep = io_getdents_prep, + .issue = io_getdents, + }, }; @@ -648,6 +653,9 @@ const struct io_cold_def io_cold_defs[] = { .fail = io_sendrecv_fail, #endif }, + [IORING_OP_GETDENTS] = { + .name = "GETDENTS", + }, }; const char *io_uring_get_opcode(u8 opcode) -- 2.39.2
Since down_read can block, use the _trylock variant if NOWAIT variant has been requested. (can probably do a little bit better style-wise) Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> --- fs/kernfs/dir.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 45b6919903e6..5a5b3e7881bf 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1824,7 +1824,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx) return 0; root = kernfs_root(parent); - down_read(&root->kernfs_rwsem); + if (ctx->flags & DIR_CONTEXT_F_NOWAIT) { + if (!down_read_trylock(&root->kernfs_rwsem)) + return -EAGAIN; + } else { + down_read(&root->kernfs_rwsem); + } if (kernfs_ns_enabled(parent)) ns = kernfs_info(dentry->d_sb)->ns; @@ -1845,6 +1850,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx) if (!dir_emit(ctx, name, len, ino, type)) return 0; down_read(&root->kernfs_rwsem); + if (ctx->flags & DIR_CONTEXT_F_NOWAIT) { + if (!down_read_trylock(&root->kernfs_rwsem)) + return 0; + } else { + down_read(&root->kernfs_rwsem); + } } up_read(&root->kernfs_rwsem); file->private_data = NULL; @@ -1852,7 +1863,14 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx) return 0; } +static int kernfs_fop_dir_open(struct inode *inode, struct file *file) +{ + file->f_mode |= FMODE_NOWAIT; + return 0; +} + const struct file_operations kernfs_dir_fops = { + .open = kernfs_fop_dir_open, .read = generic_read_dir, .iterate_shared = kernfs_fop_readdir, .release = kernfs_dir_fop_release, -- 2.39.2
the readdir can technically wait a bit on a spinlock, but that should never wait for long enough to return EAGAIN -- just set the capability flag on directories f_mode Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> --- fs/libfs.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/libfs.c b/fs/libfs.c index 89cf614a3271..a3c7e42d90a7 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -81,6 +81,7 @@ EXPORT_SYMBOL(simple_lookup); int dcache_dir_open(struct inode *inode, struct file *file) { file->private_data = d_alloc_cursor(file->f_path.dentry); + file->f_mode |= FMODE_NOWAIT; return file->private_data ? 0 : -ENOMEM; } -- 2.39.2
This turns out to be very slightly faster than an extra call to getdents, but in practice it doesn't seem to be such an improvement as the trailing getdents will return almost immediately be absorbed by the scheduling noise in a find-like context (my ""server"" is too noisy to get proper benchmarks out, but results look slightly better with this in async mode, and almost identical in the NOWAIT path) If the user is waiting the end of a single directory though it might be worth it, so including the patch for comments. (in particular I'm not really happy that the flag has become in-out for vfs_getdents, especially when the getdents64 syscall does not use it, but I don't see much other way around it) If this approach is acceptable/wanted then this patch will be split down further (at least dir_context/vfs_getdents, kernfs, libfs, uring in four separate commits) Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> --- fs/internal.h | 2 +- fs/kernfs/dir.c | 1 + fs/libfs.c | 9 ++++++--- fs/readdir.c | 10 ++++++---- include/linux/fs.h | 2 ++ include/uapi/linux/io_uring.h | 2 ++ io_uring/fs.c | 8 ++++++-- 7 files changed, 24 insertions(+), 10 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index 0264b001d99a..0b1552c7a870 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -267,4 +267,4 @@ void mnt_idmap_put(struct mnt_idmap *idmap); struct linux_dirent64; int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent, - unsigned int count, unsigned long flags); + unsigned int count, unsigned long *flags); diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 5a5b3e7881bf..53a6b4804c34 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1860,6 +1860,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx) up_read(&root->kernfs_rwsem); file->private_data = NULL; ctx->pos = INT_MAX; + ctx->flags |= DIR_CONTEXT_F_EOD; return 0; } diff --git a/fs/libfs.c b/fs/libfs.c index a3c7e42d90a7..b2a95dadffbd 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -208,10 +208,12 @@ int dcache_readdir(struct file *file, struct dir_context *ctx) p = &next->d_child; } spin_lock(&dentry->d_lock); - if (next) + if (next) { list_move_tail(&cursor->d_child, &next->d_child); - else + } else { list_del_init(&cursor->d_child); + ctx->flags |= DIR_CONTEXT_F_EOD; + } spin_unlock(&dentry->d_lock); dput(next); @@ -1347,7 +1349,8 @@ static loff_t empty_dir_llseek(struct file *file, loff_t offset, int whence) static int empty_dir_readdir(struct file *file, struct dir_context *ctx) { - dir_emit_dots(file, ctx); + if (dir_emit_dots(file, ctx)) + ctx->flags |= DIR_CONTEXT_F_EOD; return 0; } diff --git a/fs/readdir.c b/fs/readdir.c index 1311b89d75e1..be75a2154b4f 100644 --- a/fs/readdir.c +++ b/fs/readdir.c @@ -358,14 +358,14 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen, * @file : pointer to file struct of directory * @dirent : pointer to user directory structure * @count : size of buffer - * @flags : additional dir_context flags + * @flags : pointer to additional dir_context flags */ int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent, - unsigned int count, unsigned long flags) + unsigned int count, unsigned long *flags) { struct getdents_callback64 buf = { .ctx.actor = filldir64, - .ctx.flags = flags, + .ctx.flags = flags ? *flags : 0, .count = count, .current_dir = dirent }; @@ -384,6 +384,8 @@ int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent, else error = count - buf.count; } + if (flags) + *flags = buf.ctx.flags; return error; } @@ -397,7 +399,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd, if (!f.file) return -EBADF; - error = vfs_getdents(f.file, dirent, count, 0); + error = vfs_getdents(f.file, dirent, count, NULL); fdput_pos(f); return error; diff --git a/include/linux/fs.h b/include/linux/fs.h index f7de2b5ca38e..d1e31bccfb4f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1723,8 +1723,10 @@ struct dir_context { * flags for dir_context flags * DIR_CONTEXT_F_NOWAIT: Request non-blocking iterate * (requires file->f_mode & FMODE_NOWAIT) + * DIR_CONTEXT_F_EOD: Signal directory has been fully iterated, set by the fs */ #define DIR_CONTEXT_F_NOWAIT 0x1 +#define DIR_CONTEXT_F_EOD 0x2 /* * These flags let !MMU mmap() govern direct device mapping vs immediate diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 35d0de18d893..35877132027e 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -381,11 +381,13 @@ struct io_uring_cqe { * IORING_CQE_F_SOCK_NONEMPTY If set, more data to read after socket recv * IORING_CQE_F_NOTIF Set for notification CQEs. Can be used to distinct * them from sends. + * IORING_CQE_F_EOF If set, file or directory has reached end of file. */ #define IORING_CQE_F_BUFFER (1U << 0) #define IORING_CQE_F_MORE (1U << 1) #define IORING_CQE_F_SOCK_NONEMPTY (1U << 2) #define IORING_CQE_F_NOTIF (1U << 3) +#define IORING_CQE_F_EOF (1U << 4) enum { IORING_CQE_BUFFER_SHIFT = 16, diff --git a/io_uring/fs.c b/io_uring/fs.c index b15ec81c1ed2..f6222b0148ef 100644 --- a/io_uring/fs.c +++ b/io_uring/fs.c @@ -322,6 +322,7 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags) { struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents); unsigned long getdents_flags = 0; + u32 cqe_flags = 0; int ret; if (issue_flags & IO_URING_F_NONBLOCK) { @@ -338,13 +339,16 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags) goto out; } - ret = vfs_getdents(req->file, gd->dirent, gd->count, getdents_flags); + ret = vfs_getdents(req->file, gd->dirent, gd->count, &getdents_flags); out: if (ret == -EAGAIN && (issue_flags & IO_URING_F_NONBLOCK)) return -EAGAIN; - io_req_set_res(req, ret, 0); + if (getdents_flags & DIR_CONTEXT_F_EOD) + cqe_flags |= IORING_CQE_F_EOF; + + io_req_set_res(req, ret, cqe_flags); return 0; } -- 2.39.2
Hi Dominique, kernel test robot noticed the following build warnings: url: https://github.com/intel-lab-lkp/linux/commits/Dominique-Martinet/fs-split-off-vfs_getdents-function-of-getdents64-syscall/20230510-185542 base: 58390c8ce1bddb6c623f62e7ed36383e7fa5c02f patch link: https://lore.kernel.org/r/20230422-uring-getdents-v2-4-2db1e37dc55e%40codewreck.org patch subject: [PATCH v2 4/6] kernfs: implement readdir FMODE_NOWAIT config: i386-randconfig-m021 (https://download.01.org/0day-ci/archive/20230511/202305110647.eSnSEulg-lkp@intel.com/config) compiler: gcc-11 (Debian 11.3.0-12) 11.3.0 If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> | Reported-by: Dan Carpenter <error27@gmail.com> | Link: https://lore.kernel.org/r/202305110647.eSnSEulg-lkp@intel.com/ smatch warnings: fs/kernfs/dir.c:1863 kernfs_fop_readdir() warn: inconsistent returns '&root->kernfs_rwsem'. vim +1863 fs/kernfs/dir.c c637b8acbe079e Tejun Heo 2013-12-11 1815 static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx) fd7b9f7b9776b1 Tejun Heo 2013-11-28 1816 { fd7b9f7b9776b1 Tejun Heo 2013-11-28 1817 struct dentry *dentry = file->f_path.dentry; 319ba91d352a74 Shaohua Li 2017-07-12 1818 struct kernfs_node *parent = kernfs_dentry_node(dentry); 324a56e16e44ba Tejun Heo 2013-12-11 1819 struct kernfs_node *pos = file->private_data; 393c3714081a53 Minchan Kim 2021-11-18 1820 struct kernfs_root *root; fd7b9f7b9776b1 Tejun Heo 2013-11-28 1821 const void *ns = NULL; fd7b9f7b9776b1 Tejun Heo 2013-11-28 1822 fd7b9f7b9776b1 Tejun Heo 2013-11-28 1823 if (!dir_emit_dots(file, ctx)) fd7b9f7b9776b1 Tejun Heo 2013-11-28 1824 return 0; 393c3714081a53 Minchan Kim 2021-11-18 1825 393c3714081a53 Minchan Kim 2021-11-18 1826 root = kernfs_root(parent); a551138c4b3b9f Dominique Martinet 2023-05-10 1827 if (ctx->flags & DIR_CONTEXT_F_NOWAIT) { a551138c4b3b9f Dominique Martinet 2023-05-10 1828 if (!down_read_trylock(&root->kernfs_rwsem)) a551138c4b3b9f Dominique Martinet 2023-05-10 1829 return -EAGAIN; a551138c4b3b9f Dominique Martinet 2023-05-10 1830 } else { 393c3714081a53 Minchan Kim 2021-11-18 1831 down_read(&root->kernfs_rwsem); a551138c4b3b9f Dominique Martinet 2023-05-10 1832 } fd7b9f7b9776b1 Tejun Heo 2013-11-28 1833 324a56e16e44ba Tejun Heo 2013-12-11 1834 if (kernfs_ns_enabled(parent)) c525aaddc366df Tejun Heo 2013-12-11 1835 ns = kernfs_info(dentry->d_sb)->ns; fd7b9f7b9776b1 Tejun Heo 2013-11-28 1836 c637b8acbe079e Tejun Heo 2013-12-11 1837 for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos); fd7b9f7b9776b1 Tejun Heo 2013-11-28 1838 pos; c637b8acbe079e Tejun Heo 2013-12-11 1839 pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) { adc5e8b58f4886 Tejun Heo 2013-12-11 1840 const char *name = pos->name; 364595a6851bf6 Jeff Layton 2023-03-30 1841 unsigned int type = fs_umode_to_dtype(pos->mode); fd7b9f7b9776b1 Tejun Heo 2013-11-28 1842 int len = strlen(name); 67c0496e87d193 Tejun Heo 2019-11-04 1843 ino_t ino = kernfs_ino(pos); fd7b9f7b9776b1 Tejun Heo 2013-11-28 1844 adc5e8b58f4886 Tejun Heo 2013-12-11 1845 ctx->pos = pos->hash; fd7b9f7b9776b1 Tejun Heo 2013-11-28 1846 file->private_data = pos; fd7b9f7b9776b1 Tejun Heo 2013-11-28 1847 kernfs_get(pos); fd7b9f7b9776b1 Tejun Heo 2013-11-28 1848 393c3714081a53 Minchan Kim 2021-11-18 1849 up_read(&root->kernfs_rwsem); fd7b9f7b9776b1 Tejun Heo 2013-11-28 1850 if (!dir_emit(ctx, name, len, ino, type)) fd7b9f7b9776b1 Tejun Heo 2013-11-28 1851 return 0; 393c3714081a53 Minchan Kim 2021-11-18 1852 down_read(&root->kernfs_rwsem); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Needs to be deleted. a551138c4b3b9f Dominique Martinet 2023-05-10 1853 if (ctx->flags & DIR_CONTEXT_F_NOWAIT) { a551138c4b3b9f Dominique Martinet 2023-05-10 1854 if (!down_read_trylock(&root->kernfs_rwsem)) a551138c4b3b9f Dominique Martinet 2023-05-10 1855 return 0; It's a bit strange the this doesn't return -EAGAIN; a551138c4b3b9f Dominique Martinet 2023-05-10 1856 } else { a551138c4b3b9f Dominique Martinet 2023-05-10 1857 down_read(&root->kernfs_rwsem); a551138c4b3b9f Dominique Martinet 2023-05-10 1858 } fd7b9f7b9776b1 Tejun Heo 2013-11-28 1859 } 393c3714081a53 Minchan Kim 2021-11-18 1860 up_read(&root->kernfs_rwsem); fd7b9f7b9776b1 Tejun Heo 2013-11-28 1861 file->private_data = NULL; fd7b9f7b9776b1 Tejun Heo 2013-11-28 1862 ctx->pos = INT_MAX; fd7b9f7b9776b1 Tejun Heo 2013-11-28 @1863 return 0; fd7b9f7b9776b1 Tejun Heo 2013-11-28 1864 } -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests
Dan Carpenter wrote on Thu, May 11, 2023 at 01:55:57PM +0300: > fd7b9f7b9776b1 Tejun Heo 2013-11-28 1850 if (!dir_emit(ctx, name, len, ino, type)) > fd7b9f7b9776b1 Tejun Heo 2013-11-28 1851 return 0; > 393c3714081a53 Minchan Kim 2021-11-18 1852 down_read(&root->kernfs_rwsem); > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > Needs to be deleted. Uh, yes, sorry; I'm not sure how I let that slip, I guess I didn't hit any dead lock because nothing ever tried to take a write lock after getdents... Thanks! I expect there'll be other comments (this might not make it at all), so I'll keep the v3 of this patch with this fix locally and resend after other comments. > a551138c4b3b9f Dominique Martinet 2023-05-10 1853 if (ctx->flags & DIR_CONTEXT_F_NOWAIT) { > a551138c4b3b9f Dominique Martinet 2023-05-10 1854 if (!down_read_trylock(&root->kernfs_rwsem)) > a551138c4b3b9f Dominique Martinet 2023-05-10 1855 return 0; > > It's a bit strange the this doesn't return -EAGAIN; That is on purpose: the getdents did work (dir_emit returned success at least once), so the caller can process whatever was filled in the buffer before calling iterate_shared() again. If we were to return -EAGAIN here, we'd actually be throwing out the entries we just filled in, and that's not what we want. -- Dominique Martinet | Asmadeus
On Wed, May 10, 2023 at 07:52:54PM +0900, Dominique Martinet wrote:
> This turns out to be very slightly faster than an extra call to
> getdents, but in practice it doesn't seem to be such an improvement as
> the trailing getdents will return almost immediately be absorbed by the
> scheduling noise in a find-like context (my ""server"" is too noisy to
> get proper benchmarks out, but results look slightly better with this in
> async mode, and almost identical in the NOWAIT path)
>
> If the user is waiting the end of a single directory though it might be
> worth it, so including the patch for comments.
> (in particular I'm not really happy that the flag has become in-out for
> vfs_getdents, especially when the getdents64 syscall does not use it,
> but I don't see much other way around it)
>
> If this approach is acceptable/wanted then this patch will be split down
> further (at least dir_context/vfs_getdents, kernfs, libfs, uring in four
> separate commits)
>
> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
> ---
> fs/internal.h | 2 +-
> fs/kernfs/dir.c | 1 +
> fs/libfs.c | 9 ++++++---
> fs/readdir.c | 10 ++++++----
> include/linux/fs.h | 2 ++
> include/uapi/linux/io_uring.h | 2 ++
> io_uring/fs.c | 8 ++++++--
> 7 files changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/fs/internal.h b/fs/internal.h
> index 0264b001d99a..0b1552c7a870 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -267,4 +267,4 @@ void mnt_idmap_put(struct mnt_idmap *idmap);
> struct linux_dirent64;
>
> int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
> - unsigned int count, unsigned long flags);
> + unsigned int count, unsigned long *flags);
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 5a5b3e7881bf..53a6b4804c34 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1860,6 +1860,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
> up_read(&root->kernfs_rwsem);
> file->private_data = NULL;
> ctx->pos = INT_MAX;
> + ctx->flags |= DIR_CONTEXT_F_EOD;
> return 0;
> }
>
> diff --git a/fs/libfs.c b/fs/libfs.c
> index a3c7e42d90a7..b2a95dadffbd 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -208,10 +208,12 @@ int dcache_readdir(struct file *file, struct dir_context *ctx)
> p = &next->d_child;
> }
> spin_lock(&dentry->d_lock);
> - if (next)
> + if (next) {
> list_move_tail(&cursor->d_child, &next->d_child);
> - else
> + } else {
> list_del_init(&cursor->d_child);
> + ctx->flags |= DIR_CONTEXT_F_EOD;
> + }
> spin_unlock(&dentry->d_lock);
> dput(next);
>
> @@ -1347,7 +1349,8 @@ static loff_t empty_dir_llseek(struct file *file, loff_t offset, int whence)
>
> static int empty_dir_readdir(struct file *file, struct dir_context *ctx)
> {
> - dir_emit_dots(file, ctx);
> + if (dir_emit_dots(file, ctx))
> + ctx->flags |= DIR_CONTEXT_F_EOD;
> return 0;
> }
>
> diff --git a/fs/readdir.c b/fs/readdir.c
> index 1311b89d75e1..be75a2154b4f 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -358,14 +358,14 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
> * @file : pointer to file struct of directory
> * @dirent : pointer to user directory structure
> * @count : size of buffer
> - * @flags : additional dir_context flags
> + * @flags : pointer to additional dir_context flags
> */
> int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
> - unsigned int count, unsigned long flags)
> + unsigned int count, unsigned long *flags)
> {
> struct getdents_callback64 buf = {
> .ctx.actor = filldir64,
> - .ctx.flags = flags,
> + .ctx.flags = flags ? *flags : 0,
> .count = count,
> .current_dir = dirent
> };
> @@ -384,6 +384,8 @@ int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
> else
> error = count - buf.count;
> }
> + if (flags)
> + *flags = buf.ctx.flags;
> return error;
> }
>
> @@ -397,7 +399,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
> if (!f.file)
> return -EBADF;
>
> - error = vfs_getdents(f.file, dirent, count, 0);
> + error = vfs_getdents(f.file, dirent, count, NULL);
>
> fdput_pos(f);
> return error;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f7de2b5ca38e..d1e31bccfb4f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1723,8 +1723,10 @@ struct dir_context {
> * flags for dir_context flags
> * DIR_CONTEXT_F_NOWAIT: Request non-blocking iterate
> * (requires file->f_mode & FMODE_NOWAIT)
> + * DIR_CONTEXT_F_EOD: Signal directory has been fully iterated, set by the fs
> */
> #define DIR_CONTEXT_F_NOWAIT 0x1
> +#define DIR_CONTEXT_F_EOD 0x2
>
> /*
> * These flags let !MMU mmap() govern direct device mapping vs immediate
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 35d0de18d893..35877132027e 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -381,11 +381,13 @@ struct io_uring_cqe {
> * IORING_CQE_F_SOCK_NONEMPTY If set, more data to read after socket recv
> * IORING_CQE_F_NOTIF Set for notification CQEs. Can be used to distinct
> * them from sends.
> + * IORING_CQE_F_EOF If set, file or directory has reached end of file.
> */
> #define IORING_CQE_F_BUFFER (1U << 0)
> #define IORING_CQE_F_MORE (1U << 1)
> #define IORING_CQE_F_SOCK_NONEMPTY (1U << 2)
> #define IORING_CQE_F_NOTIF (1U << 3)
> +#define IORING_CQE_F_EOF (1U << 4)
>
> enum {
> IORING_CQE_BUFFER_SHIFT = 16,
> diff --git a/io_uring/fs.c b/io_uring/fs.c
> index b15ec81c1ed2..f6222b0148ef 100644
> --- a/io_uring/fs.c
> +++ b/io_uring/fs.c
> @@ -322,6 +322,7 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> {
> struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> unsigned long getdents_flags = 0;
> + u32 cqe_flags = 0;
> int ret;
>
> if (issue_flags & IO_URING_F_NONBLOCK) {
> @@ -338,13 +339,16 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> goto out;
> }
>
> - ret = vfs_getdents(req->file, gd->dirent, gd->count, getdents_flags);
> + ret = vfs_getdents(req->file, gd->dirent, gd->count, &getdents_flags);
I don't understand how synchronization and updating of f_pos works here.
For example, what happens if a concurrent seek happens on the fd while
io_uring is using vfs_getdents which calls into iterate_dir() and
updates f_pos?
On Wed, May 10, 2023 at 07:52:49PM +0900, Dominique Martinet wrote:
> This splits off the vfs_getdents function from the getdents64 system
> call.
> This will allow io_uring to call the vfs_getdents function.
>
> Co-authored-by: Stefan Roesch <shr@fb.com>
> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
> ---
> fs/internal.h | 8 ++++++++
> fs/readdir.c | 34 ++++++++++++++++++++++++++--------
> 2 files changed, 34 insertions(+), 8 deletions(-)
>
> diff --git a/fs/internal.h b/fs/internal.h
> index bd3b2810a36b..e8ca000e6613 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -260,3 +260,11 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
> struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
> struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
> void mnt_idmap_put(struct mnt_idmap *idmap);
> +
> +/*
> + * fs/readdir.c
> + */
> +struct linux_dirent64;
> +
> +int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
> + unsigned int count);
> diff --git a/fs/readdir.c b/fs/readdir.c
> index 9c53edb60c03..ed0803d0011e 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -21,6 +21,7 @@
> #include <linux/unistd.h>
> #include <linux/compat.h>
> #include <linux/uaccess.h>
> +#include "internal.h"
>
> #include <asm/unaligned.h>
>
> @@ -351,10 +352,16 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
> return false;
> }
>
> -SYSCALL_DEFINE3(getdents64, unsigned int, fd,
> - struct linux_dirent64 __user *, dirent, unsigned int, count)
> +
> +/**
> + * vfs_getdents - getdents without fdget
> + * @file : pointer to file struct of directory
> + * @dirent : pointer to user directory structure
> + * @count : size of buffer
> + */
> +int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
> + unsigned int count)
> {
> - struct fd f;
> struct getdents_callback64 buf = {
> .ctx.actor = filldir64,
> .count = count,
> @@ -362,11 +369,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
> };
> int error;
>
> - f = fdget_pos(fd);
> - if (!f.file)
> - return -EBADF;
> -
> - error = iterate_dir(f.file, &buf.ctx);
> + error = iterate_dir(file, &buf.ctx);
So afaict this isn't enough.
If you look into iterate_shared() you should see that it uses and
updates f_pos. But that can't work for io_uring and also isn't how
io_uring handles read and write. You probably need to use a local pos
similar to what io_uring does in rw.c for rw->kiocb.ki_pos. But in
contrast simply disallow any offsets for getdents completely. Thus not
relying on f_pos anywhere at all.
Christian Brauner wrote on Tue, May 23, 2023 at 04:30:14PM +0200:
> > index b15ec81c1ed2..f6222b0148ef 100644
> > --- a/io_uring/fs.c
> > +++ b/io_uring/fs.c
> > @@ -322,6 +322,7 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> > {
> > struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
> > unsigned long getdents_flags = 0;
> > + u32 cqe_flags = 0;
> > int ret;
> >
> > if (issue_flags & IO_URING_F_NONBLOCK) {
> > @@ -338,13 +339,16 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
> > goto out;
> > }
> >
> > - ret = vfs_getdents(req->file, gd->dirent, gd->count, getdents_flags);
> > + ret = vfs_getdents(req->file, gd->dirent, gd->count, &getdents_flags);
>
> I don't understand how synchronization and updating of f_pos works here.
> For example, what happens if a concurrent seek happens on the fd while
> io_uring is using vfs_getdents which calls into iterate_dir() and
> updates f_pos?
I don't see how different that is from a user spawning two threads and
calling getdents64 + lseek or two getdents64 in parallel?
(or any two other users of iterate_dir)
As far as I understand you'll either get the old or new pos as
obtained/updated by iterate_dir()?
That iterate_dir probably ought to be using READ_ONCE/WRITE_ONCE or some
atomic read/update wrappers as the shared case only has a read lock
around these, but that's not a new problem; and for all I care
about I'm happy to let users shoot themselves in the foot.
(although I guess that with filesystems not validating the offset as
was pointed out in a previous version comment having non-atomic update
might be a security issue at some point on architectures that don't
guarantee atomic 64bit updates, but if someone manages to abuse it
it's already possible to abuse it with the good old syscalls, so I'd
rather leave that up to someone who understand how atomicity in the
kernel works better than me...)
--
Dominique Martinet | Asmadeus
Christian Brauner wrote on Tue, May 23, 2023 at 05:39:08PM +0200: > > @@ -362,11 +369,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd, > > }; > > int error; > > > > - f = fdget_pos(fd); > > - if (!f.file) > > - return -EBADF; > > - > > - error = iterate_dir(f.file, &buf.ctx); > > + error = iterate_dir(file, &buf.ctx); > > So afaict this isn't enough. > If you look into iterate_shared() you should see that it uses and > updates f_pos. But that can't work for io_uring and also isn't how > io_uring handles read and write. You probably need to use a local pos > similar to what io_uring does in rw.c for rw->kiocb.ki_pos. But in > contrast simply disallow any offsets for getdents completely. Thus not > relying on f_pos anywhere at all. Using a private offset from the sqe was the previous implementation discussed around here[1], and Al Viro pointed out that the iterate filesystem implementations don't validate the offset makes sense as it's either costly or for some filesystems downright impossible, so I went into a don't let users modify it approach. [1] https://lore.kernel.org/all/20211221164004.119663-1-shr@fb.com/T/#m517583f23502f32b040c819d930359313b3db00c I agree it's not how io_uring usually works -- it dislikes global states -- but it works perfectly well as long as you don't have multiple users on the same file, which the application can take care of. Not having any offsets would work for small directories but make reading large directories impossible so some sort of continuation is required, which means we need to keep the offset around; I also suggested keeping the offset in argument as the previous version but only allowing the last known offset (... so ultimately still updating f_pos anyway as we don't have anywhere else to store it) or 0, but if we're going to do that it looks much simpler to me to expose the same API as getdents. -- Dominique Martinet | Asmadeus
On Wed, May 24, 2023 at 06:13:57AM +0900, Dominique Martinet wrote: > Christian Brauner wrote on Tue, May 23, 2023 at 05:39:08PM +0200: > > > @@ -362,11 +369,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd, > > > }; > > > int error; > > > > > > - f = fdget_pos(fd); > > > - if (!f.file) > > > - return -EBADF; > > > - > > > - error = iterate_dir(f.file, &buf.ctx); > > > + error = iterate_dir(file, &buf.ctx); > > > > So afaict this isn't enough. > > If you look into iterate_shared() you should see that it uses and > > updates f_pos. But that can't work for io_uring and also isn't how > > io_uring handles read and write. You probably need to use a local pos > > similar to what io_uring does in rw.c for rw->kiocb.ki_pos. But in > > contrast simply disallow any offsets for getdents completely. Thus not > > relying on f_pos anywhere at all. > > Using a private offset from the sqe was the previous implementation > discussed around here[1], and Al Viro pointed out that the iterate > filesystem implementations don't validate the offset makes sense as it's > either costly or for some filesystems downright impossible, so I went > into a don't let users modify it approach. > > [1] https://lore.kernel.org/all/20211221164004.119663-1-shr@fb.com/T/#m517583f23502f32b040c819d930359313b3db00c > > > I agree it's not how io_uring usually works -- it dislikes global > states -- but it works perfectly well as long as you don't have multiple > users on the same file, which the application can take care of. > > Not having any offsets would work for small directories but make reading > large directories impossible so some sort of continuation is required, > which means we need to keep the offset around; I also suggested keeping > the offset in argument as the previous version but only allowing the > last known offset (... so ultimately still updating f_pos anyway as we > don't have anywhere else to store it) or 0, but if we're going to do > that it looks much simpler to me to expose the same API as getdents. > > -- > Dominique Martinet | Asmadeus On Wed, May 24, 2023 at 06:05:06AM +0900, Dominique Martinet wrote: > Christian Brauner wrote on Tue, May 23, 2023 at 04:30:14PM +0200: > > > index b15ec81c1ed2..f6222b0148ef 100644 > > > --- a/io_uring/fs.c > > > +++ b/io_uring/fs.c > > > @@ -322,6 +322,7 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags) > > > { > > > struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents); > > > unsigned long getdents_flags = 0; > > > + u32 cqe_flags = 0; > > > int ret; > > > > > > if (issue_flags & IO_URING_F_NONBLOCK) { > > > @@ -338,13 +339,16 @@ int io_getdents(struct io_kiocb *req, unsigned int issue_flags) > > > goto out; > > > } > > > > > > - ret = vfs_getdents(req->file, gd->dirent, gd->count, getdents_flags); > > > + ret = vfs_getdents(req->file, gd->dirent, gd->count, &getdents_flags); > > > > I don't understand how synchronization and updating of f_pos works here. > > For example, what happens if a concurrent seek happens on the fd while > > io_uring is using vfs_getdents which calls into iterate_dir() and > > updates f_pos? > > I don't see how different that is from a user spawning two threads and > calling getdents64 + lseek or two getdents64 in parallel? > (or any two other users of iterate_dir) > > As far as I understand you'll either get the old or new pos as > obtained/updated by iterate_dir()? > > That iterate_dir probably ought to be using READ_ONCE/WRITE_ONCE or some > atomic read/update wrappers as the shared case only has a read lock > around these, but that's not a new problem; and for all I care > about I'm happy to let users shoot themselves in the foot. > (although I guess that with filesystems not validating the offset as > was pointed out in a previous version comment having non-atomic update > might be a security issue at some point on architectures that don't > guarantee atomic 64bit updates, but if someone manages to abuse it > it's already possible to abuse it with the good old syscalls, so I'd > rather leave that up to someone who understand how atomicity in the > kernel works better than me...) There's multiple issues here. The main objection in [1] was to allow specifying an arbitrary offset from userspace. What [3] did was to implement a pread() variant for directories, i.e., pgetdents(). That can't work in principle/is prohibitively complex. Which is what your series avoids by not allowing any offsets to be specified. However, there's still a problem here. Updates to f_pos happen under an approriate lock to guarantee consistency of the position between calls that move the cursor position. In the normal read-write path io_uring doesn't concern itself with f_pos as it keeps local state in kiocb->ki_pos. But then it still does end up running into f_pos consistency problems for read-write because it does allow operating on the current f_pos if the offset if struct io_rw is set to -1. In that case it does retrieve and update f_pos which should take f_pos_lock and a patchset for this was posted but it didn't go anywhere. It should probably hold that lock. See Jann's comments in the other thread how that currently can lead to issues. For getdents() not protecting f_pos is equally bad or worse. The patch doesn't hold f_pos_lock and just updates f_pos prior _and_ post iterate_dir() arguing that this race is fine. But again, f_version and f_pos are consistent after each system call invocation. But without that you can have a concurrent seek running and can end up with an inconsistent f_pos position within the same system call. IOW, you're breaking f_pos being in a well-known state. And you're not doing that just for io_uring you're doing it for the regular system call interface as well as both can be used on the same fd simultaneously. So that's a no go imho. > I don't see how different that is from a user spawning two threads and > calling getdents64 + lseek or two getdents64 in parallel? > (or any two other users of iterate_dir) The difference is that in both cases f_pos_lock for both getdents and lseek is held. So f_pos is in a good known state. You're not taking any locks so now we're risking inconsistency within the same system call if getdents and lseek run concurrently. Jens also mentioned that you could even have this problem from within io_uring itself. So tl;dr, there's no good reason to declare this an acceptable race afaict. So either this is fixed properly or we're not doing it as far as I'm concerned. [1] https://lore.kernel.org/all/20211221164004.119663-1-shr@fb.com/T/#m517583f23502f32b040c819d930359313b3db00c [2] https://lore.kernel.org/io-uring/20211221164004.119663-4-shr@fb.com [3] https://lore.kernel.org/io-uring/20211221164004.119663-1-shr@fb.com
Christian Brauner wrote on Wed, May 24, 2023 at 03:52:45PM +0200: > The main objection in [1] was to allow specifying an arbitrary offset > from userspace. What [3] did was to implement a pread() variant for > directories, i.e., pgetdents(). That can't work in principle/is > prohibitively complex. Which is what your series avoids by not allowing > any offsets to be specified. Yes. > However, there's still a problem here. Updates to f_pos happen under an > approriate lock to guarantee consistency of the position between calls > that move the cursor position. In the normal read-write path io_uring > doesn't concern itself with f_pos as it keeps local state in > kiocb->ki_pos. > > But then it still does end up running into f_pos consistency problems > for read-write because it does allow operating on the current f_pos if > the offset if struct io_rw is set to -1. > > In that case it does retrieve and update f_pos which should take > f_pos_lock and a patchset for this was posted but it didn't go anywhere. > It should probably hold that lock. See Jann's comments in the other > thread how that currently can lead to issues. Assuming that is this mail: https://lore.kernel.org/io-uring/CAG48ez1O9VxSuWuLXBjke23YxUA8EhMP+6RCHo5PNQBf3B0pDQ@mail.gmail.com/ So, ok, I didn't realize fdget_pos() actually acted as a lock on the file's f_pos_lock (apparently only if FMODE_ATMOIC_POS is set? but it looks set on most if not all directories through finish_open(), that looks called consistently enough) What was confusing is that default_llseek updates f_pos under the inode_lock (write), and getdents also takes that lock (for read only in shared implem), so I assumed getdents also was just protected by this read lock, but I guess that was a bad assumption (as I kept pointing out, a shared read lock isn't good enough, we definitely agree there) In practice, in the non-registered file case io_uring is also calling fdget, so the lock is held exactly the same as the syscall and I wasn't so far off -- but we need to figure something for the registered file case. openat variants don't allow working with the registered variant at all on the parent fd, so as far as I'm concerned I'd be happy setting the same limitation for getdents: just say it acts on fd and not files, and call it a day... It'd also be possible to check if REQ_F_FIXED_FILE flag was set and manually take the lock somehow but we don't have any primitive that takes f_pos_lock from a file (the only place that takes it is fdget which requires a fd), so I'd rather not add such a new exception. I assume the other patch you mentioned about adding that lock was this one: https://lore.kernel.org/all/20220222105504.3331010-1-dylany@fb.com/T/#m3609dc8057d0bc8e41ceab643e4d630f7b91bde6 and it just atkes the lock, but __fdget_pos also checks for FMODE_ATOMIC_OPS and file_count and I'm not sure I understand how it sets (f.flags & FDPUT_POS_UNLOCK) (for fdput_pos) so I'd rather not add such a code path at this point.. So, ok, what do you think about just forbidding registered files? I can't see where that wouldn't suffice but I might be missing something else. > For getdents() not protecting f_pos is equally bad or worse. The patch > doesn't hold f_pos_lock and just updates f_pos prior _and_ post > iterate_dir() arguing that this race is fine. But again, f_version and > f_pos are consistent after each system call invocation. > > But without that you can have a concurrent seek running and can end up > with an inconsistent f_pos position within the same system call. IOW, > you're breaking f_pos being in a well-known state. And you're not doing > that just for io_uring you're doing it for the regular system call > interface as well as both can be used on the same fd simultaneously. > So that's a no go imho. (so seek is fine, but I agree two concurrent getdents on registered files won't have the required lock) -- Dominique Martinet | Asmadeus
On Thu, May 25, 2023 at 06:36:17AM +0900, Dominique Martinet wrote: > Christian Brauner wrote on Wed, May 24, 2023 at 03:52:45PM +0200: > > The main objection in [1] was to allow specifying an arbitrary offset > > from userspace. What [3] did was to implement a pread() variant for > > directories, i.e., pgetdents(). That can't work in principle/is > > prohibitively complex. Which is what your series avoids by not allowing > > any offsets to be specified. > > Yes. > > > However, there's still a problem here. Updates to f_pos happen under an > > approriate lock to guarantee consistency of the position between calls > > that move the cursor position. In the normal read-write path io_uring > > doesn't concern itself with f_pos as it keeps local state in > > kiocb->ki_pos. > > > > But then it still does end up running into f_pos consistency problems > > for read-write because it does allow operating on the current f_pos if > > the offset if struct io_rw is set to -1. > > > > In that case it does retrieve and update f_pos which should take > > f_pos_lock and a patchset for this was posted but it didn't go anywhere. > > It should probably hold that lock. See Jann's comments in the other > > thread how that currently can lead to issues. > > Assuming that is this mail: > https://lore.kernel.org/io-uring/CAG48ez1O9VxSuWuLXBjke23YxUA8EhMP+6RCHo5PNQBf3B0pDQ@mail.gmail.com/ > > So, ok, I didn't realize fdget_pos() actually acted as a lock on the > file's f_pos_lock (apparently only if FMODE_ATMOIC_POS is set? but it > looks set on most if not all directories through finish_open(), that > looks called consistently enough) It's set for any regular file and directory. > > What was confusing is that default_llseek updates f_pos under the > inode_lock (write), and getdents also takes that lock (for read only in > shared implem), so I assumed getdents also was just protected by this > read lock, but I guess that was a bad assumption (as I kept pointing > out, a shared read lock isn't good enough, we definitely agree there) > > > In practice, in the non-registered file case io_uring is also calling > fdget, so the lock is held exactly the same as the syscall and I wasn't No, it really isn't. fdget() doesn't take f_pos_lock at all: fdget() -> __fdget() -> __fget_light() -> __fget() -> __fget_files() -> __fget_files_rcu() If that were true then any system call that passes an fd and uses fdget() would try to acquire a mutex on f_pos_lock. We'd be serializing every *at based system call on f_pos_lock whenever we have multiple fds referring to the same file trying to operate on it concurrently. We do have fdget_pos() and fdput_pos() as a special purpose fdget() for a select group of system calls that require this synchronization. > so far off -- but we need to figure something for the registered file > case. > openat variants don't allow working with the registered variant at all > on the parent fd, so as far as I'm concerned I'd be happy setting the > same limitation for getdents: just say it acts on fd and not files, and > call it a day... I don't follow. Also this is hacky so no. The reason why io_uring *at implementations don't work with fixed files is that the VFS interface expect regular fds. You could very well make this work for fixed files but why. It would mean exposing a whole new set of vfs helpers to io_uring and would probably involve nasty corner cases. Also the connection between regular and fixed files in io_uring is pretty much fluent. While fixed files can only remain pinned in an io_uring instance it requires that the caller explicitly gave up all references in their fdtable to that struct file by closing all fds referring to the same file. But there's no guarantee. For example, if another thread dups the fd or the caller sends the fd via SCM_RIGHTS to another process or the caller simply doesn't close the fd or another thread gets an fd to the same file from that task via pidfd_getfd before it closed it this doesn't hold. So it's very well possible to have an fd and a fixed io_uring reference referring to the same file. The first one can be used with the regular system call interface and io_uring *at requests that forbid fixed files. And the other one can be used for i_uring fixed file operations. Doesn't matter if that shouldn't be done, it's possible afaict. For regular and fixed files you also have the same problem from within the same io_uring instance where you can have concurrent getdent requests. You'd end up producing the exact same inconsistencies. > It'd also be possible to check if REQ_F_FIXED_FILE flag was set and > manually take the lock somehow but we don't have any primitive that > takes f_pos_lock from a file (the only place that takes it is fdget > which requires a fd), so I'd rather not add such a new exception. > I assume the other patch you mentioned about adding that lock was this > one: > https://lore.kernel.org/all/20220222105504.3331010-1-dylany@fb.com/T/#m3609dc8057d0bc8e41ceab643e4d630f7b91bde6 > and it just atkes the lock, but __fdget_pos also checks for > FMODE_ATOMIC_OPS and file_count and I'm not sure I understand how it > sets (f.flags & FDPUT_POS_UNLOCK) (for fdput_pos) so I'd rather not add > such a code path at this point.. > > > So, ok, what do you think about just forbidding registered files? > I can't see where that wouldn't suffice but I might be missing something > else. It doesn't help.
Christian Brauner wrote on Thu, May 25, 2023 at 11:22:08AM +0200: > > What was confusing is that default_llseek updates f_pos under the > > inode_lock (write), and getdents also takes that lock (for read only in > > shared implem), so I assumed getdents also was just protected by this > > read lock, but I guess that was a bad assumption (as I kept pointing > > out, a shared read lock isn't good enough, we definitely agree there) > > > > > > In practice, in the non-registered file case io_uring is also calling > > fdget, so the lock is held exactly the same as the syscall and I wasn't > > No, it really isn't. fdget() doesn't take f_pos_lock at all: > > fdget() > -> __fdget() > -> __fget_light() > -> __fget() > -> __fget_files() > -> __fget_files_rcu() Ugh, I managed to not notice that I was looking at fdget_pos and that it's not the same as fdget by the time I wrote two paragraphs... These functions all have too many wrappers and too similar names for a quick look before work. > If that were true then any system call that passes an fd and uses > fdget() would try to acquire a mutex on f_pos_lock. We'd be serializing > every *at based system call on f_pos_lock whenever we have multiple fds > referring to the same file trying to operate on it concurrently. > > We do have fdget_pos() and fdput_pos() as a special purpose fdget() for > a select group of system calls that require this synchronization. Right, that makes sense, and invalidates everything I said after that anyway but it's not like looking stupid ever killed anyone. Ok so it would require adding a new wrapper from struct file to struct fd that'd eventually take the lock and set FDPUT_POS_UNLOCK for... not fdput_pos but another function for that stopping short of fdput... Then just call that around both vfs_llseek and vfs_getdents calls; which is the easy part. (Or possibly call mutex_lock directly like Dylan did in [1]...) [1] https://lore.kernel.org/all/20220222105504.3331010-1-dylany@fb.com/T/#m3609dc8057d0bc8e41ceab643e4d630f7b91bde6 I'll be honest though I'm thankful for your explanations but I think I'll just do like Stefan and stop trying for now: the only reason I've started this was because I wanted to play with io_uring for a new toy project and it felt awkward without a getdents for crawling a tree; and I'm long past the point where I should have thrown the towel and just make that a sequential walk. There's too many "conditional patches" (NOWAIT, end of dir indicator) that I don't care about and require additional work to rebase continuously so I'll just leave it up to someone else who does care. So to that someone: feel free to continue from these branches (I've included the fix for kernfs_fop_readdir that Dan Carpenter reported): https://github.com/martinetd/linux/commits/io_uring_getdents https://github.com/martinetd/liburing/commits/getdents Or just start over, there's not that much code now hopefully the baseline requirements have gotten a little bit clearer. Sorry for stirring the mess and leaving halfway, if nobody does continue I might send a v3 when I have more time/energy in a few months, but it won't be quick. -- Dominique
On Thu, May 25, 2023 at 08:00:02PM +0900, Dominique Martinet wrote: > Christian Brauner wrote on Thu, May 25, 2023 at 11:22:08AM +0200: > > > What was confusing is that default_llseek updates f_pos under the > > > inode_lock (write), and getdents also takes that lock (for read only in > > > shared implem), so I assumed getdents also was just protected by this > > > read lock, but I guess that was a bad assumption (as I kept pointing > > > out, a shared read lock isn't good enough, we definitely agree there) > > > > > > > > > In practice, in the non-registered file case io_uring is also calling > > > fdget, so the lock is held exactly the same as the syscall and I wasn't > > > > No, it really isn't. fdget() doesn't take f_pos_lock at all: > > > > fdget() > > -> __fdget() > > -> __fget_light() > > -> __fget() > > -> __fget_files() > > -> __fget_files_rcu() > > Ugh, I managed to not notice that I was looking at fdget_pos and that > it's not the same as fdget by the time I wrote two paragraphs... These > functions all have too many wrappers and too similar names for a quick > look before work. > > > If that were true then any system call that passes an fd and uses > > fdget() would try to acquire a mutex on f_pos_lock. We'd be serializing > > every *at based system call on f_pos_lock whenever we have multiple fds > > referring to the same file trying to operate on it concurrently. > > > > We do have fdget_pos() and fdput_pos() as a special purpose fdget() for > > a select group of system calls that require this synchronization. > > Right, that makes sense, and invalidates everything I said after that > anyway but it's not like looking stupid ever killed anyone. I strongly disagree with the looking stupid part. These callchains are quite unwieldy and it's easy to get confused. Usually if you receive a long mail about the semantics involved - as in the earlier thread - it means there's landmines all over. > > Ok so it would require adding a new wrapper from struct file to struct > fd that'd eventually take the lock and set FDPUT_POS_UNLOCK for... not > fdput_pos but another function for that stopping short of fdput... > Then just call that around both vfs_llseek and vfs_getdents calls; which > is the easy part. > > (Or possibly call mutex_lock directly like Dylan did in [1]...) > [1] https://lore.kernel.org/all/20220222105504.3331010-1-dylany@fb.com/T/#m3609dc8057d0bc8e41ceab643e4d630f7b91bde6 We'd need a consistent story whatever it ends up being. > I'll be honest though I'm thankful for your explanations but I think > I'll just do like Stefan and stop trying for now: the only reason I've > started this was because I wanted to play with io_uring for a new toy > project and it felt awkward without a getdents for crawling a tree; and > I'm long past the point where I should have thrown the towel and just > make that a sequential walk. > There's too many "conditional patches" (NOWAIT, end of dir indicator) > that I don't care about and require additional work to rebase > continuously so I'll just leave it up to someone else who does care. > > So to that someone: feel free to continue from these branches (I've > included the fix for kernfs_fop_readdir that Dan Carpenter reported): > https://github.com/martinetd/linux/commits/io_uring_getdents > https://github.com/martinetd/liburing/commits/getdents > > Or just start over, there's not that much code now hopefully the > baseline requirements have gotten a little bit clearer. > > > Sorry for stirring the mess and leaving halfway, if nobody does continue > I might send a v3 when I have more time/energy in a few months, but it > won't be quick. It's fine.
On 5/25/23 19:00, Dominique Martinet wrote:
> Christian Brauner wrote on Thu, May 25, 2023 at 11:22:08AM +0200:
>>> What was confusing is that default_llseek updates f_pos under the
>>> inode_lock (write), and getdents also takes that lock (for read only in
>>> shared implem), so I assumed getdents also was just protected by this
>>> read lock, but I guess that was a bad assumption (as I kept pointing
>>> out, a shared read lock isn't good enough, we definitely agree there)
>>>
>>>
>>> In practice, in the non-registered file case io_uring is also calling
>>> fdget, so the lock is held exactly the same as the syscall and I wasn't
>>
>> No, it really isn't. fdget() doesn't take f_pos_lock at all:
>>
>> fdget()
>> -> __fdget()
>> -> __fget_light()
>> -> __fget()
>> -> __fget_files()
>> -> __fget_files_rcu()
>
> Ugh, I managed to not notice that I was looking at fdget_pos and that
> it's not the same as fdget by the time I wrote two paragraphs... These
> functions all have too many wrappers and too similar names for a quick
> look before work.
>
>> If that were true then any system call that passes an fd and uses
>> fdget() would try to acquire a mutex on f_pos_lock. We'd be serializing
>> every *at based system call on f_pos_lock whenever we have multiple fds
>> referring to the same file trying to operate on it concurrently.
>>
>> We do have fdget_pos() and fdput_pos() as a special purpose fdget() for
>> a select group of system calls that require this synchronization.
>
> Right, that makes sense, and invalidates everything I said after that
> anyway but it's not like looking stupid ever killed anyone.
>
> Ok so it would require adding a new wrapper from struct file to struct
> fd that'd eventually take the lock and set FDPUT_POS_UNLOCK for... not
> fdput_pos but another function for that stopping short of fdput...
> Then just call that around both vfs_llseek and vfs_getdents calls; which
> is the easy part.
>
> (Or possibly call mutex_lock directly like Dylan did in [1]...)
> [1] https://lore.kernel.org/all/20220222105504.3331010-1-dylany@fb.com/T/#m3609dc8057d0bc8e41ceab643e4d630f7b91bde6
>
>
>
> I'll be honest though I'm thankful for your explanations but I think
> I'll just do like Stefan and stop trying for now: the only reason I've
> started this was because I wanted to play with io_uring for a new toy
> project and it felt awkward without a getdents for crawling a tree; and
> I'm long past the point where I should have thrown the towel and just
> make that a sequential walk.
> There's too many "conditional patches" (NOWAIT, end of dir indicator)
> that I don't care about and require additional work to rebase
> continuously so I'll just leave it up to someone else who does care.
>
> So to that someone: feel free to continue from these branches (I've
> included the fix for kernfs_fop_readdir that Dan Carpenter reported):
> https://github.com/martinetd/linux/commits/io_uring_getdents
> https://github.com/martinetd/liburing/commits/getdents
>
> Or just start over, there's not that much code now hopefully the
> baseline requirements have gotten a little bit clearer.
>
>
> Sorry for stirring the mess and leaving halfway, if nobody does continue
> I might send a v3 when I have more time/energy in a few months, but it
> won't be quick.
>
Hi Dominique,
I'd like to take this if you don't mind.
Regards,
Hao
Hao Xu wrote on Tue, Jul 11, 2023 at 04:17:11PM +0800:
> > So to that someone: feel free to continue from these branches (I've
> > included the fix for kernfs_fop_readdir that Dan Carpenter reported):
> > https://github.com/martinetd/linux/commits/io_uring_getdents
> > https://github.com/martinetd/liburing/commits/getdents
> >
> > Or just start over, there's not that much code now hopefully the
> > baseline requirements have gotten a little bit clearer.
> >
> >
> > Sorry for stirring the mess and leaving halfway, if nobody does continue
> > I might send a v3 when I have more time/energy in a few months, but it
> > won't be quick.
>
> I'd like to take this if you don't mind.
Sure, I haven't been working on this lately, feel free to work on this.
Will be happy to review anything you send based on what came out of the
previous discussions to save Christian and others some time so you can
keep me in Cc if you'd like.
--
Dominique Martinet | Asmadeus