* [GIT PULL] io_uring fixes for 5.10-rc
@ 2020-10-30 17:09 Jens Axboe
2020-10-30 22:10 ` pr-tracker-bot
0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2020-10-30 17:09 UTC (permalink / raw)
To: Linus Torvalds; +Cc: io-uring, [email protected]
Hi Linus,
Fixes that should go into this release:
- Fixes for linked timeouts (Pavel)
- Set IO_WQ_WORK_CONCURRENT early for async offload (Pavel)
- Two minor simplifications that make the code easier to read and follow
(Pavel)
Please pull!
The following changes since commit ee6e00c868221f5f7d0b6eb4e8379a148e26bc20:
splice: change exported internal do_splice() helper to take kernel offset (2020-10-22 14:15:51 -0600)
are available in the Git repository at:
git://git.kernel.dk/linux-block.git tags/io_uring-5.10-2020-10-30
for you to fetch changes up to c8b5e2600a2cfa1cdfbecf151afd67aee227381d:
io_uring: use type appropriate io_kiocb handler for double poll (2020-10-25 13:53:26 -0600)
----------------------------------------------------------------
io_uring-5.10-2020-10-30
----------------------------------------------------------------
Jens Axboe (1):
io_uring: use type appropriate io_kiocb handler for double poll
Pavel Begunkov (7):
io_uring: remove opcode check on ltimeout kill
io_uring: don't adjust LINK_HEAD in cancel ltimeout
io_uring: always clear LINK_TIMEOUT after cancel
io_uring: don't defer put of cancelled ltimeout
io_uring: don't miss setting IO_WQ_WORK_CONCURRENT
io_uring: simplify nxt propagation in io_queue_sqe
io_uring: simplify __io_queue_sqe()
fs/io_uring.c | 108 +++++++++++++++++++++-------------------------------------
1 file changed, 38 insertions(+), 70 deletions(-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] io_uring fixes for 5.10-rc
2020-10-30 17:09 Jens Axboe
@ 2020-10-30 22:10 ` pr-tracker-bot
0 siblings, 0 replies; 17+ messages in thread
From: pr-tracker-bot @ 2020-10-30 22:10 UTC (permalink / raw)
To: Jens Axboe; +Cc: Linus Torvalds, io-uring, [email protected]
The pull request you sent on Fri, 30 Oct 2020 11:09:13 -0600:
> git://git.kernel.dk/linux-block.git tags/io_uring-5.10-2020-10-30
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/cf9446cc8e6d85355642209538dde619f53770dc
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* [GIT PULL] io_uring fixes for 5.10-rc
@ 2020-11-07 20:13 Jens Axboe
2020-11-07 22:08 ` pr-tracker-bot
0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2020-11-07 20:13 UTC (permalink / raw)
To: Linus Torvalds; +Cc: io-uring, [email protected]
Hi Linus,
A set of fixes for io_uring that should go into this release:
- SQPOLL cancelation fixes
- Two fixes for the io_identity COW
- Cancelation overflow fix (Pavel)
- Drain request cancelation fix (Pavel)
- Link timeout race fix (Pavel)
Please pull!
The following changes since commit 3cea11cd5e3b00d91caf0b4730194039b45c5891:
Linux 5.10-rc2 (2020-11-01 14:43:51 -0800)
are available in the Git repository at:
git://git.kernel.dk/linux-block.git tags/io_uring-5.10-2020-11-07
for you to fetch changes up to 9a472ef7a3690ac0b77ebfb04c88fa795de2adea:
io_uring: fix link lookup racing with link timeout (2020-11-05 15:36:40 -0700)
----------------------------------------------------------------
io_uring-5.10-2020-11-07
----------------------------------------------------------------
Jens Axboe (5):
io-wq: cancel request if it's asking for files and we don't have them
io_uring: properly handle SQPOLL request cancelations
io_uring: ensure consistent view of original task ->mm from SQPOLL
io_uring: drop req/tctx io_identity separately
io_uring: use correct pointer for io_uring_show_cred()
Pavel Begunkov (3):
io_uring: fix overflowed cancel w/ linked ->files
io_uring: don't forget to task-cancel drained reqs
io_uring: fix link lookup racing with link timeout
fs/io-wq.c | 4 ++
fs/io_uring.c | 183 +++++++++++++++++++++++++++++++++++------------
include/linux/io_uring.h | 3 +-
3 files changed, 142 insertions(+), 48 deletions(-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] io_uring fixes for 5.10-rc
2020-11-07 20:13 Jens Axboe
@ 2020-11-07 22:08 ` pr-tracker-bot
0 siblings, 0 replies; 17+ messages in thread
From: pr-tracker-bot @ 2020-11-07 22:08 UTC (permalink / raw)
To: Jens Axboe; +Cc: Linus Torvalds, io-uring, [email protected]
The pull request you sent on Sat, 7 Nov 2020 13:13:57 -0700:
> git://git.kernel.dk/linux-block.git tags/io_uring-5.10-2020-11-07
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/e9c02d68cc26b28a9a12ebd1aeaed673ad0e73e2
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* [GIT PULL] io_uring fixes for 5.10-rc
@ 2020-11-13 21:18 Jens Axboe
2020-11-14 0:15 ` pr-tracker-bot
0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2020-11-13 21:18 UTC (permalink / raw)
To: Linus Torvalds; +Cc: io-uring
Hi Linus,
Single fix in here, for a missed rounding case at setup time, which
caused an otherwise legitimate setup case to return -EINVAL if used with
unaligned ring size values. Please pull!
The following changes since commit 9a472ef7a3690ac0b77ebfb04c88fa795de2adea:
io_uring: fix link lookup racing with link timeout (2020-11-05 15:36:40 -0700)
are available in the Git repository at:
git://git.kernel.dk/linux-block.git tags/io_uring-5.10-2020-11-13
for you to fetch changes up to 88ec3211e46344a7d10cf6cb5045f839f7785f8e:
io_uring: round-up cq size before comparing with rounded sq size (2020-11-11 10:42:41 -0700)
----------------------------------------------------------------
io_uring-5.10-2020-11-13
----------------------------------------------------------------
Jens Axboe (1):
io_uring: round-up cq size before comparing with rounded sq size
fs/io_uring.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] io_uring fixes for 5.10-rc
2020-11-13 21:18 Jens Axboe
@ 2020-11-14 0:15 ` pr-tracker-bot
0 siblings, 0 replies; 17+ messages in thread
From: pr-tracker-bot @ 2020-11-14 0:15 UTC (permalink / raw)
To: Jens Axboe; +Cc: Linus Torvalds, io-uring
The pull request you sent on Fri, 13 Nov 2020 14:18:58 -0700:
> git://git.kernel.dk/linux-block.git tags/io_uring-5.10-2020-11-13
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/1b1e9262ca644b5b7f1d12b2f8c2edfff420c5f3
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* [GIT PULL] io_uring fixes for 5.10-rc
@ 2020-11-20 18:45 Jens Axboe
2020-11-20 20:02 ` Linus Torvalds
2020-11-21 0:29 ` pr-tracker-bot
0 siblings, 2 replies; 17+ messages in thread
From: Jens Axboe @ 2020-11-20 18:45 UTC (permalink / raw)
To: Linus Torvalds; +Cc: io-uring, [email protected]
Hi Linus,
Mostly regression or stable fodder:
- Disallow async path resolution of /proc/self
- Tighten constraints for segmented async buffered reads
- Fix double completion for a retry error case
- Fix for fixed file life times (Pavel)
Please pull!
The following changes since commit 88ec3211e46344a7d10cf6cb5045f839f7785f8e:
io_uring: round-up cq size before comparing with rounded sq size (2020-11-11 10:42:41 -0700)
are available in the Git repository at:
git://git.kernel.dk/linux-block.git tags/io_uring-5.10-2020-11-20
for you to fetch changes up to e297822b20e7fe683e107aea46e6402adcf99c70:
io_uring: order refnode recycling (2020-11-18 08:02:10 -0700)
----------------------------------------------------------------
io_uring-5.10-2020-11-20
----------------------------------------------------------------
Jens Axboe (4):
proc: don't allow async path resolution of /proc/self components
io_uring: handle -EOPNOTSUPP on path resolution
mm: never attempt async page lock if we've transferred data already
io_uring: don't double complete failed reissue request
Pavel Begunkov (2):
io_uring: get an active ref_node from files_data
io_uring: order refnode recycling
fs/io_uring.c | 57 ++++++++++++++++++++++++++++++++++++++++++---------------
fs/proc/self.c | 7 +++++++
mm/filemap.c | 18 ++++++++++++++----
3 files changed, 63 insertions(+), 19 deletions(-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] io_uring fixes for 5.10-rc
2020-11-20 18:45 Jens Axboe
@ 2020-11-20 20:02 ` Linus Torvalds
2020-11-20 21:36 ` Jens Axboe
2020-11-21 0:29 ` pr-tracker-bot
1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2020-11-20 20:02 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring, [email protected]
On Fri, Nov 20, 2020 at 10:45 AM Jens Axboe <[email protected]> wrote:
> Jens Axboe (4):
> proc: don't allow async path resolution of /proc/self components
This one is ok.
> io_uring: handle -EOPNOTSUPP on path resolution
But this one smells. It talks about how it shouldn't block, but the
fact is, it can easily block when the path going through another
filesystem (think ".." to get to root before even hitting /proc/self,
but also think /proc/self/cwd/randompathgoeshere).
The whole concept seems entirely broken anyway. Why would you retry
the failure after doing it asynchronously? If it really doesn't block,
then it shouldn't have been done async in the first place.
IMNSHO, the openat logic is just wrong. And that "ignore_nonblock"
thing is a disgusting hack that is everything that is wrong with
io_uring. Stop doing these kinds of hacky things that will just cause
problems down the line.
I think the correct thing to do is to just start the open
synchronously with an RCU lookup, and if that fails, go to the async
one. And if the async one fails because it's /proc/self, then it just
fails. None of this kind of "it should be ok" stuff.
And that would likely be the faster model anyway - do it synchronously
and immediately for the easy cases.
And if it really is something like "/proc/self/cwd/randompathgoeshere"
that actually will block, maybe io_uring just shouldn't support it?
I've pulled this, but I really object to how io_uring keeps having
subtle bugs, and then they get worked around with this kind of hackery
which really smells like "this will be a subtle bug some time in the
future".
Linus
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] io_uring fixes for 5.10-rc
2020-11-20 20:02 ` Linus Torvalds
@ 2020-11-20 21:36 ` Jens Axboe
2020-11-21 0:23 ` Linus Torvalds
0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2020-11-20 21:36 UTC (permalink / raw)
To: Linus Torvalds; +Cc: io-uring, [email protected]
On 11/20/20 1:02 PM, Linus Torvalds wrote:
> On Fri, Nov 20, 2020 at 10:45 AM Jens Axboe <[email protected]> wrote:
>> Jens Axboe (4):
>> proc: don't allow async path resolution of /proc/self components
>
> This one is ok.
>
>> io_uring: handle -EOPNOTSUPP on path resolution
>
> But this one smells. It talks about how it shouldn't block, but the
> fact is, it can easily block when the path going through another
> filesystem (think ".." to get to root before even hitting /proc/self,
> but also think /proc/self/cwd/randompathgoeshere).
>
> The whole concept seems entirely broken anyway. Why would you retry
> the failure after doing it asynchronously? If it really doesn't block,
> then it shouldn't have been done async in the first place.
>
> IMNSHO, the openat logic is just wrong. And that "ignore_nonblock"
> thing is a disgusting hack that is everything that is wrong with
> io_uring. Stop doing these kinds of hacky things that will just cause
> problems down the line.
>
> I think the correct thing to do is to just start the open
> synchronously with an RCU lookup, and if that fails, go to the async
> one. And if the async one fails because it's /proc/self, then it just
> fails. None of this kind of "it should be ok" stuff.
>
> And that would likely be the faster model anyway - do it synchronously
> and immediately for the easy cases.
>
> And if it really is something like "/proc/self/cwd/randompathgoeshere"
> that actually will block, maybe io_uring just shouldn't support it?
>
> I've pulled this, but I really object to how io_uring keeps having
> subtle bugs, and then they get worked around with this kind of hackery
> which really smells like "this will be a subtle bug some time in the
> future".
I don't disagree with you on that. I've been a bit gun shy on touching
the VFS side of things, but this one isn't too bad. I hacked up a patch
that allows io_uring to do LOOKUP_RCU and a quick test seems to indicate
it's fine. On top of that, we just propagate the error if we do fail and
get rid of that odd retry loop.
And yes, it should be much better performance as well, for any sort of
cached lookup. There's a reason why we made the close side more
efficient like that, too.
Lightly tested patch below, needs to be split into 2 parts of course.
But the VFS side is just adding a few functions to fs/internal.h and the
struct nameidata structure, no other changes needed.
diff --git a/fs/internal.h b/fs/internal.h
index 6fd14ea213c3..e100d5bca42d 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -131,11 +131,41 @@ struct open_flags {
};
extern struct file *do_filp_open(int dfd, struct filename *pathname,
const struct open_flags *op);
+extern struct file *path_openat(struct nameidata *nd,
+ const struct open_flags *op, unsigned flags);
extern struct file *do_file_open_root(struct dentry *, struct vfsmount *,
const char *, const struct open_flags *);
extern struct open_how build_open_how(int flags, umode_t mode);
extern int build_open_flags(const struct open_how *how, struct open_flags *op);
+#define EMBEDDED_LEVELS 2
+struct nameidata {
+ struct path path;
+ struct qstr last;
+ struct path root;
+ struct inode *inode; /* path.dentry.d_inode */
+ unsigned int flags;
+ unsigned seq, m_seq, r_seq;
+ int last_type;
+ unsigned depth;
+ int total_link_count;
+ struct saved {
+ struct path link;
+ struct delayed_call done;
+ const char *name;
+ unsigned seq;
+ } *stack, internal[EMBEDDED_LEVELS];
+ struct filename *name;
+ struct nameidata *saved;
+ unsigned root_seq;
+ int dfd;
+ kuid_t dir_uid;
+ umode_t dir_mode;
+} __randomize_layout;
+
+extern void set_nameidata(struct nameidata *p, int dfd, struct filename *name);
+extern void restore_nameidata(void);
+
long do_sys_ftruncate(unsigned int fd, loff_t length, int small);
int chmod_common(const struct path *path, umode_t mode);
int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group,
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 43ba815e4107..896b7f92cfed 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4069,9 +4069,6 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
struct file *file;
int ret;
- if (force_nonblock && !req->open.ignore_nonblock)
- return -EAGAIN;
-
ret = build_open_flags(&req->open.how, &op);
if (ret)
goto err;
@@ -4080,25 +4077,28 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
if (ret < 0)
goto err;
- file = do_filp_open(req->open.dfd, req->open.filename, &op);
- if (IS_ERR(file)) {
- put_unused_fd(ret);
- ret = PTR_ERR(file);
+ if (!force_nonblock) {
+ struct nameidata nd;
+
+ set_nameidata(&nd, req->open.dfd, req->open.filename);
+ file = path_openat(&nd, &op, op.lookup_flags | LOOKUP_RCU);
+ restore_nameidata();
+
/*
- * A work-around to ensure that /proc/self works that way
- * that it should - if we get -EOPNOTSUPP back, then assume
- * that proc_self_get_link() failed us because we're in async
- * context. We should be safe to retry this from the task
- * itself with force_nonblock == false set, as it should not
- * block on lookup. Would be nice to know this upfront and
- * avoid the async dance, but doesn't seem feasible.
+ * If RCU lookup fails, then we need to retry this from
+ * async context.
*/
- if (ret == -EOPNOTSUPP && io_wq_current_is_worker()) {
- req->open.ignore_nonblock = true;
- refcount_inc(&req->refs);
- io_req_task_queue(req);
- return 0;
+ if (file == ERR_PTR(-ECHILD)) {
+ put_unused_fd(ret);
+ return -EAGAIN;
}
+ } else {
+ file = do_filp_open(req->open.dfd, req->open.filename, &op);
+ }
+
+ if (IS_ERR(file)) {
+ put_unused_fd(ret);
+ ret = PTR_ERR(file);
} else {
fsnotify_open(file);
fd_install(ret, file);
diff --git a/fs/namei.c b/fs/namei.c
index 03d0e11e4f36..288fdae18221 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -498,32 +498,7 @@ void path_put(const struct path *path)
}
EXPORT_SYMBOL(path_put);
-#define EMBEDDED_LEVELS 2
-struct nameidata {
- struct path path;
- struct qstr last;
- struct path root;
- struct inode *inode; /* path.dentry.d_inode */
- unsigned int flags;
- unsigned seq, m_seq, r_seq;
- int last_type;
- unsigned depth;
- int total_link_count;
- struct saved {
- struct path link;
- struct delayed_call done;
- const char *name;
- unsigned seq;
- } *stack, internal[EMBEDDED_LEVELS];
- struct filename *name;
- struct nameidata *saved;
- unsigned root_seq;
- int dfd;
- kuid_t dir_uid;
- umode_t dir_mode;
-} __randomize_layout;
-
-static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
+void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
{
struct nameidata *old = current->nameidata;
p->stack = p->internal;
@@ -534,7 +509,7 @@ static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
current->nameidata = p;
}
-static void restore_nameidata(void)
+void restore_nameidata(void)
{
struct nameidata *now = current->nameidata, *old = now->saved;
@@ -3346,8 +3321,8 @@ static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file)
return error;
}
-static struct file *path_openat(struct nameidata *nd,
- const struct open_flags *op, unsigned flags)
+struct file *path_openat(struct nameidata *nd, const struct open_flags *op,
+ unsigned flags)
{
struct file *file;
int error;
--
Jens Axboe
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [GIT PULL] io_uring fixes for 5.10-rc
2020-11-20 21:36 ` Jens Axboe
@ 2020-11-21 0:23 ` Linus Torvalds
2020-11-21 2:41 ` Jens Axboe
0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2020-11-21 0:23 UTC (permalink / raw)
To: Jens Axboe, Al Viro; +Cc: io-uring, [email protected]
On Fri, Nov 20, 2020 at 1:36 PM Jens Axboe <[email protected]> wrote:
>
> I don't disagree with you on that. I've been a bit gun shy on touching
> the VFS side of things, but this one isn't too bad. I hacked up a patch
> that allows io_uring to do LOOKUP_RCU and a quick test seems to indicate
> it's fine. On top of that, we just propagate the error if we do fail and
> get rid of that odd retry loop.
Ok, this looks better to me (but is obviously not 5.10 material).
That said, I think I'd prefer to keep 'struct nameidata' internal to
just fs/namei.c, and maybe we can just expert that
struct nameidata nd;
set_nameidata(&nd, req->open.dfd, req->open.filename);
file = path_openat(&nd, &op, op.lookup_flags | LOOKUP_RCU);
restore_nameidata();
return filp == ERR_PTR(-ECHILD) ? -EAGAIN : filp;
as a helper from namei.c instead? Call it "do_filp_open_rcu()" or something?
That "force_nonblock" test seems a bit off, though. Why is that RCU
case only done when "!force_nonblock"? It would seem that if
force_nonblock is set, you want to do this too?
Al? You can see the background on lkml, but basically io_uring wants
to punt file open to a kernel thread, except if it can just be done
directly without blocking (which is pretty much that RCU lookup case).
And the thing that triggered this is that /proc/self/ can only be done
directly - not in a kernel thread. So the RCU case actually ends up
being interesting in that it would handl those things.
Linus
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] io_uring fixes for 5.10-rc
2020-11-20 18:45 Jens Axboe
2020-11-20 20:02 ` Linus Torvalds
@ 2020-11-21 0:29 ` pr-tracker-bot
1 sibling, 0 replies; 17+ messages in thread
From: pr-tracker-bot @ 2020-11-21 0:29 UTC (permalink / raw)
To: Jens Axboe; +Cc: Linus Torvalds, io-uring, [email protected]
The pull request you sent on Fri, 20 Nov 2020 11:45:31 -0700:
> git://git.kernel.dk/linux-block.git tags/io_uring-5.10-2020-11-20
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/fa5fca78bb2fe7a58ae7297407dcda1914ea8353
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] io_uring fixes for 5.10-rc
2020-11-21 0:23 ` Linus Torvalds
@ 2020-11-21 2:41 ` Jens Axboe
2020-11-21 3:00 ` Jens Axboe
0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2020-11-21 2:41 UTC (permalink / raw)
To: Linus Torvalds, Al Viro; +Cc: io-uring, [email protected]
On 11/20/20 5:23 PM, Linus Torvalds wrote:
> On Fri, Nov 20, 2020 at 1:36 PM Jens Axboe <[email protected]> wrote:
>>
>> I don't disagree with you on that. I've been a bit gun shy on touching
>> the VFS side of things, but this one isn't too bad. I hacked up a patch
>> that allows io_uring to do LOOKUP_RCU and a quick test seems to indicate
>> it's fine. On top of that, we just propagate the error if we do fail and
>> get rid of that odd retry loop.
>
> Ok, this looks better to me (but is obviously not 5.10 material).
>
> That said, I think I'd prefer to keep 'struct nameidata' internal to
> just fs/namei.c, and maybe we can just expert that
>
> struct nameidata nd;
>
> set_nameidata(&nd, req->open.dfd, req->open.filename);
> file = path_openat(&nd, &op, op.lookup_flags | LOOKUP_RCU);
> restore_nameidata();
> return filp == ERR_PTR(-ECHILD) ? -EAGAIN : filp;
>
> as a helper from namei.c instead? Call it "do_filp_open_rcu()" or something?
Yes, that's probably a better idea. I'll move in that direction.
> That "force_nonblock" test seems a bit off, though. Why is that RCU
> case only done when "!force_nonblock"? It would seem that if
> force_nonblock is set, you want to do this too?
Taking a second look at it, it's inverted. So if force_nonblock == true,
we want to do just the RCU lookup.
But I think the bit that you're missing is that the other case will do
the normal lookup, which does an RCU lookup first. It looks needs to
look like this:
if (force_nonblock)
file = do_filp_open_rcu();
else
file = do_filp_open();
--
Jens Axboe
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] io_uring fixes for 5.10-rc
2020-11-21 2:41 ` Jens Axboe
@ 2020-11-21 3:00 ` Jens Axboe
2020-11-21 18:07 ` Linus Torvalds
0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2020-11-21 3:00 UTC (permalink / raw)
To: Linus Torvalds, Al Viro; +Cc: io-uring, [email protected]
On 11/20/20 7:41 PM, Jens Axboe wrote:
> On 11/20/20 5:23 PM, Linus Torvalds wrote:
>> On Fri, Nov 20, 2020 at 1:36 PM Jens Axboe <[email protected]> wrote:
>>>
>>> I don't disagree with you on that. I've been a bit gun shy on touching
>>> the VFS side of things, but this one isn't too bad. I hacked up a patch
>>> that allows io_uring to do LOOKUP_RCU and a quick test seems to indicate
>>> it's fine. On top of that, we just propagate the error if we do fail and
>>> get rid of that odd retry loop.
>>
>> Ok, this looks better to me (but is obviously not 5.10 material).
>>
>> That said, I think I'd prefer to keep 'struct nameidata' internal to
>> just fs/namei.c, and maybe we can just expert that
>>
>> struct nameidata nd;
>>
>> set_nameidata(&nd, req->open.dfd, req->open.filename);
>> file = path_openat(&nd, &op, op.lookup_flags | LOOKUP_RCU);
>> restore_nameidata();
>> return filp == ERR_PTR(-ECHILD) ? -EAGAIN : filp;
>>
>> as a helper from namei.c instead? Call it "do_filp_open_rcu()" or something?
>
> Yes, that's probably a better idea. I'll move in that direction.
Actually, I think we can do even better. How about just having
do_filp_open() exit after LOOKUP_RCU fails, if LOOKUP_RCU was already
set in the lookup flags? Then we don't need to change much else, and
most of it falls out naturally.
Except it seems that should work, except LOOKUP_RCU does not guarantee
that we're not going to do IO:
[ 20.463195] schedule+0x5f/0xd0
[ 20.463444] io_schedule+0x45/0x70
[ 20.463712] bit_wait_io+0x11/0x50
[ 20.463981] __wait_on_bit+0x2c/0x90
[ 20.464264] out_of_line_wait_on_bit+0x86/0x90
[ 20.464611] ? var_wake_function+0x30/0x30
[ 20.464932] __ext4_find_entry+0x2b5/0x410
[ 20.465254] ? d_alloc_parallel+0x241/0x4e0
[ 20.465581] ext4_lookup+0x51/0x1b0
[ 20.465855] ? __d_lookup+0x77/0x120
[ 20.466136] path_openat+0x4e8/0xe40
[ 20.466417] do_filp_open+0x79/0x100
[ 20.466720] ? __kernel_text_address+0x30/0x70
[ 20.467068] ? __alloc_fd+0xb3/0x150
[ 20.467349] io_openat2+0x65/0x210
[ 20.467618] io_issue_sqe+0x3e/0xf70
Which I'm actually pretty sure that I discovered before and attempted to
do a LOOKUP_NONBLOCK, which was kind of half assed and that Al
(rightfully) hated because of that.
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 43ba815e4107..9a0a21ac5227 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4069,36 +4069,25 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
struct file *file;
int ret;
- if (force_nonblock && !req->open.ignore_nonblock)
- return -EAGAIN;
-
ret = build_open_flags(&req->open.how, &op);
if (ret)
goto err;
+ if (force_nonblock)
+ op.lookup_flags |= LOOKUP_RCU;
ret = __get_unused_fd_flags(req->open.how.flags, req->open.nofile);
if (ret < 0)
goto err;
file = do_filp_open(req->open.dfd, req->open.filename, &op);
+ if (force_nonblock && file == ERR_PTR(-ECHILD)) {
+ put_unused_fd(ret);
+ return -EAGAIN;
+ }
+
if (IS_ERR(file)) {
put_unused_fd(ret);
ret = PTR_ERR(file);
- /*
- * A work-around to ensure that /proc/self works that way
- * that it should - if we get -EOPNOTSUPP back, then assume
- * that proc_self_get_link() failed us because we're in async
- * context. We should be safe to retry this from the task
- * itself with force_nonblock == false set, as it should not
- * block on lookup. Would be nice to know this upfront and
- * avoid the async dance, but doesn't seem feasible.
- */
- if (ret == -EOPNOTSUPP && io_wq_current_is_worker()) {
- req->open.ignore_nonblock = true;
- refcount_inc(&req->refs);
- io_req_task_queue(req);
- return 0;
- }
} else {
fsnotify_open(file);
fd_install(ret, file);
diff --git a/fs/namei.c b/fs/namei.c
index 03d0e11e4f36..eb2c917986a5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3394,8 +3394,11 @@ struct file *do_filp_open(int dfd, struct filename *pathname,
set_nameidata(&nd, dfd, pathname);
filp = path_openat(&nd, op, flags | LOOKUP_RCU);
- if (unlikely(filp == ERR_PTR(-ECHILD)))
+ if (unlikely(filp == ERR_PTR(-ECHILD))) {
+ if (flags & LOOKUP_RCU)
+ return filp;
filp = path_openat(&nd, op, flags);
+ }
if (unlikely(filp == ERR_PTR(-ESTALE)))
filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
restore_nameidata();
--
Jens Axboe
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [GIT PULL] io_uring fixes for 5.10-rc
2020-11-21 3:00 ` Jens Axboe
@ 2020-11-21 18:07 ` Linus Torvalds
2020-11-21 22:58 ` Jens Axboe
0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2020-11-21 18:07 UTC (permalink / raw)
To: Jens Axboe; +Cc: Al Viro, io-uring, [email protected]
On Fri, Nov 20, 2020 at 7:00 PM Jens Axboe <[email protected]> wrote:
>
> Actually, I think we can do even better. How about just having
> do_filp_open() exit after LOOKUP_RCU fails, if LOOKUP_RCU was already
> set in the lookup flags? Then we don't need to change much else, and
> most of it falls out naturally.
So I was thinking doing the RCU lookup unconditionally, and then doing
the nn-RCU lookup if that fails afterwards.
But your patch looks good to me.
Except for the issue you noticed.
> Except it seems that should work, except LOOKUP_RCU does not guarantee
> that we're not going to do IO:
Well, almost nothing guarantees lack of IO, since allocations etc can
still block, but..
> [ 20.463195] schedule+0x5f/0xd0
> [ 20.463444] io_schedule+0x45/0x70
> [ 20.463712] bit_wait_io+0x11/0x50
> [ 20.463981] __wait_on_bit+0x2c/0x90
> [ 20.464264] out_of_line_wait_on_bit+0x86/0x90
> [ 20.464611] ? var_wake_function+0x30/0x30
> [ 20.464932] __ext4_find_entry+0x2b5/0x410
> [ 20.465254] ? d_alloc_parallel+0x241/0x4e0
> [ 20.465581] ext4_lookup+0x51/0x1b0
> [ 20.465855] ? __d_lookup+0x77/0x120
> [ 20.466136] path_openat+0x4e8/0xe40
> [ 20.466417] do_filp_open+0x79/0x100
Hmm. Is this perhaps an O_CREAT case? I think we only do the dcache
lookups under RCU, not the final path component creation.
And there are probably lots of other situations where we finish with
LOOKUP_RCU (with unlazy_walk()), and then continue.
Example: look at "may_lookup()" - if inode_permission() says "I can't
do this without blocking" the logic actually just tries to validate
the current state (that "unlazy_walk()" thing), and then continue
without RCU.
It obviously hasn't been about lockless semantics, it's been about
really being lockless. So LOOKUP_RCU has been a "try to do this
locklessly" rather than "you cannot take any locks".
I guess we would have to add a LOOKUP_NOBLOCK thing to actually then
say "if the RCU lookup fails, return -EAGAIN".
That's probably not a huge undertaking, but yeah, I didn't think of
it. I think this is a "we need to have Al tell us if it's reasonable".
Linus
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] io_uring fixes for 5.10-rc
2020-11-21 18:07 ` Linus Torvalds
@ 2020-11-21 22:58 ` Jens Axboe
0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2020-11-21 22:58 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Al Viro, io-uring, [email protected]
On 11/21/20 11:07 AM, Linus Torvalds wrote:
> On Fri, Nov 20, 2020 at 7:00 PM Jens Axboe <[email protected]> wrote:
>>
>> Actually, I think we can do even better. How about just having
>> do_filp_open() exit after LOOKUP_RCU fails, if LOOKUP_RCU was already
>> set in the lookup flags? Then we don't need to change much else, and
>> most of it falls out naturally.
>
> So I was thinking doing the RCU lookup unconditionally, and then doing
> the nn-RCU lookup if that fails afterwards.
>
> But your patch looks good to me.
>
> Except for the issue you noticed.
After having taken a closer look, I think the saner approach is
LOOKUP_NONBLOCK instead of using LOOKUP_RCU which is used more as
a state than lookup flag. I'll try and hack something up that looks
passable.
>> Except it seems that should work, except LOOKUP_RCU does not guarantee
>> that we're not going to do IO:
>
> Well, almost nothing guarantees lack of IO, since allocations etc can
> still block, but..
Sure, and we can't always avoid that - but blatant block on waiting
for IO should be avoided.
>> [ 20.463195] schedule+0x5f/0xd0
>> [ 20.463444] io_schedule+0x45/0x70
>> [ 20.463712] bit_wait_io+0x11/0x50
>> [ 20.463981] __wait_on_bit+0x2c/0x90
>> [ 20.464264] out_of_line_wait_on_bit+0x86/0x90
>> [ 20.464611] ? var_wake_function+0x30/0x30
>> [ 20.464932] __ext4_find_entry+0x2b5/0x410
>> [ 20.465254] ? d_alloc_parallel+0x241/0x4e0
>> [ 20.465581] ext4_lookup+0x51/0x1b0
>> [ 20.465855] ? __d_lookup+0x77/0x120
>> [ 20.466136] path_openat+0x4e8/0xe40
>> [ 20.466417] do_filp_open+0x79/0x100
>
> Hmm. Is this perhaps an O_CREAT case? I think we only do the dcache
> lookups under RCU, not the final path component creation.
It's just a basic test that opens all files under a directory. So
no O_CREAT, it's all existing files. I think this is just a case of not
aborting early enough for LOOKUP_NONBLOCK, and we've obviously already
dropped LOOKUP_RCU (and done rcu_read_unlock() again) at this point.
> And there are probably lots of other situations where we finish with
> LOOKUP_RCU (with unlazy_walk()), and then continue.>
> Example: look at "may_lookup()" - if inode_permission() says "I can't
> do this without blocking" the logic actually just tries to validate
> the current state (that "unlazy_walk()" thing), and then continue
> without RCU.
>
> It obviously hasn't been about lockless semantics, it's been about
> really being lockless. So LOOKUP_RCU has been a "try to do this
> locklessly" rather than "you cannot take any locks".
>
> I guess we would have to add a LOOKUP_NOBLOCK thing to actually then
> say "if the RCU lookup fails, return -EAGAIN".
>
> That's probably not a huge undertaking, but yeah, I didn't think of
> it. I think this is a "we need to have Al tell us if it's reasonable".
Definitely. I did have a weak attempt at LOOKUP_NONBLOCK earlier, I'll
try and resurrect it and see what that leads to. Outside of just pure
lookup, the d_revalidate() was a bit interesting as it may block for
certain cases, but those should be (hopefully) detectable upfront.
--
Jens Axboe
^ permalink raw reply [flat|nested] 17+ messages in thread
* [GIT PULL] io_uring fixes for 5.10-rc
@ 2020-11-27 20:47 Jens Axboe
2020-11-27 21:21 ` pr-tracker-bot
0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2020-11-27 20:47 UTC (permalink / raw)
To: Linus Torvalds; +Cc: io-uring
Hi Linus,
- Out of bounds fix for the cq size cap from earlier this release
(Joseph)
- iov_iter type check fix (Pavel)
- Files grab + cancelation fix (Pavel)
Please pull!
The following changes since commit 418baf2c28f3473039f2f7377760bd8f6897ae18:
Linux 5.10-rc5 (2020-11-22 15:36:08 -0800)
are available in the Git repository at:
git://git.kernel.dk/linux-block.git tags/io_uring-5.10-2020-11-27
for you to fetch changes up to af60470347de6ac2b9f0cc3703975a543a3de075:
io_uring: fix files grab/cancel race (2020-11-26 08:50:21 -0700)
----------------------------------------------------------------
io_uring-5.10-2020-11-27
----------------------------------------------------------------
Joseph Qi (1):
io_uring: fix shift-out-of-bounds when round up cq size
Pavel Begunkov (2):
io_uring: fix ITER_BVEC check
io_uring: fix files grab/cancel race
fs/io_uring.c | 39 ++++++++++++++++++++-------------------
1 file changed, 20 insertions(+), 19 deletions(-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] io_uring fixes for 5.10-rc
2020-11-27 20:47 [GIT PULL] io_uring fixes for 5.10-rc Jens Axboe
@ 2020-11-27 21:21 ` pr-tracker-bot
0 siblings, 0 replies; 17+ messages in thread
From: pr-tracker-bot @ 2020-11-27 21:21 UTC (permalink / raw)
To: Jens Axboe; +Cc: Linus Torvalds, io-uring
The pull request you sent on Fri, 27 Nov 2020 13:47:10 -0700:
> git://git.kernel.dk/linux-block.git tags/io_uring-5.10-2020-11-27
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/9223e74f9960778bd3edd39e15edd5532708b7fb
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2020-11-27 21:33 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-27 20:47 [GIT PULL] io_uring fixes for 5.10-rc Jens Axboe
2020-11-27 21:21 ` pr-tracker-bot
-- strict thread matches above, loose matches on Subject: below --
2020-11-20 18:45 Jens Axboe
2020-11-20 20:02 ` Linus Torvalds
2020-11-20 21:36 ` Jens Axboe
2020-11-21 0:23 ` Linus Torvalds
2020-11-21 2:41 ` Jens Axboe
2020-11-21 3:00 ` Jens Axboe
2020-11-21 18:07 ` Linus Torvalds
2020-11-21 22:58 ` Jens Axboe
2020-11-21 0:29 ` pr-tracker-bot
2020-11-13 21:18 Jens Axboe
2020-11-14 0:15 ` pr-tracker-bot
2020-11-07 20:13 Jens Axboe
2020-11-07 22:08 ` pr-tracker-bot
2020-10-30 17:09 Jens Axboe
2020-10-30 22:10 ` pr-tracker-bot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox