[GIT PULL] io_uring fixes for 5.10-rc

public inbox for [email protected]
 help / color / mirror / Atom feed

* [GIT PULL] io_uring fixes for 5.10-rc
@ 2020-11-20 18:45 Jens Axboe
  2020-11-20 20:02 ` Linus Torvalds
  2020-11-21  0:29 ` [GIT PULL] io_uring fixes for 5.10-rc pr-tracker-bot
  0 siblings, 2 replies; 12+ messages in thread
From: Jens Axboe @ 2020-11-20 18:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring, [email protected]

Hi Linus,

Mostly regression or stable fodder:

- Disallow async path resolution of /proc/self

- Tighten constraints for segmented async buffered reads

- Fix double completion for a retry error case

- Fix for fixed file life times (Pavel)

Please pull!


The following changes since commit 88ec3211e46344a7d10cf6cb5045f839f7785f8e:

  io_uring: round-up cq size before comparing with rounded sq size (2020-11-11 10:42:41 -0700)

are available in the Git repository at:

  git://git.kernel.dk/linux-block.git tags/io_uring-5.10-2020-11-20

for you to fetch changes up to e297822b20e7fe683e107aea46e6402adcf99c70:

  io_uring: order refnode recycling (2020-11-18 08:02:10 -0700)

----------------------------------------------------------------
io_uring-5.10-2020-11-20

----------------------------------------------------------------
Jens Axboe (4):
      proc: don't allow async path resolution of /proc/self components
      io_uring: handle -EOPNOTSUPP on path resolution
      mm: never attempt async page lock if we've transferred data already
      io_uring: don't double complete failed reissue request

Pavel Begunkov (2):
      io_uring: get an active ref_node from files_data
      io_uring: order refnode recycling

 fs/io_uring.c  | 57 ++++++++++++++++++++++++++++++++++++++++++---------------
 fs/proc/self.c |  7 +++++++
 mm/filemap.c   | 18 ++++++++++++++----
 3 files changed, 63 insertions(+), 19 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.10-rc
  2020-11-20 18:45 [GIT PULL] io_uring fixes for 5.10-rc Jens Axboe
@ 2020-11-20 20:02 ` Linus Torvalds
  2020-11-20 21:36   ` Jens Axboe
  2020-11-21  0:29 ` [GIT PULL] io_uring fixes for 5.10-rc pr-tracker-bot
  1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2020-11-20 20:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, [email protected]

On Fri, Nov 20, 2020 at 10:45 AM Jens Axboe <[email protected]> wrote:
> Jens Axboe (4):
>       proc: don't allow async path resolution of /proc/self components

This one is ok.

>       io_uring: handle -EOPNOTSUPP on path resolution

But this one smells. It talks about how it shouldn't block, but the
fact is, it can easily block when the path going through another
filesystem (think ".." to get to root before even hitting /proc/self,
but also think /proc/self/cwd/randompathgoeshere).

The whole concept seems entirely broken anyway. Why would you retry
the failure after doing it asynchronously? If it really doesn't block,
then it shouldn't have been done async in the first place.

IMNSHO, the openat logic is just wrong. And that "ignore_nonblock"
thing is a disgusting hack that is everything that is wrong with
io_uring. Stop doing these kinds of hacky things that will just cause
problems down the line.

I think the correct thing to do is to just start the open
synchronously with an RCU lookup, and if that fails, go to the async
one. And if the async one fails because it's /proc/self, then it just
fails. None of this kind of "it should be ok" stuff.

And that would likely be the faster model anyway - do it synchronously
and immediately for the easy cases.

And if it really is something like "/proc/self/cwd/randompathgoeshere"
that actually will block, maybe io_uring just shouldn't support it?

I've pulled this, but I really object to how io_uring keeps having
subtle bugs, and then they get worked around with this kind of hackery
which really smells like "this will be a subtle bug some time in the
future".

       Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.10-rc
  2020-11-20 20:02 ` Linus Torvalds
@ 2020-11-20 21:36   ` Jens Axboe
  2020-11-21  0:23     ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2020-11-20 21:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring, [email protected]

On 11/20/20 1:02 PM, Linus Torvalds wrote:
> On Fri, Nov 20, 2020 at 10:45 AM Jens Axboe <[email protected]> wrote:
>> Jens Axboe (4):
>>       proc: don't allow async path resolution of /proc/self components
> 
> This one is ok.
> 
>>       io_uring: handle -EOPNOTSUPP on path resolution
> 
> But this one smells. It talks about how it shouldn't block, but the
> fact is, it can easily block when the path going through another
> filesystem (think ".." to get to root before even hitting /proc/self,
> but also think /proc/self/cwd/randompathgoeshere).
> 
> The whole concept seems entirely broken anyway. Why would you retry
> the failure after doing it asynchronously? If it really doesn't block,
> then it shouldn't have been done async in the first place.
> 
> IMNSHO, the openat logic is just wrong. And that "ignore_nonblock"
> thing is a disgusting hack that is everything that is wrong with
> io_uring. Stop doing these kinds of hacky things that will just cause
> problems down the line.
> 
> I think the correct thing to do is to just start the open
> synchronously with an RCU lookup, and if that fails, go to the async
> one. And if the async one fails because it's /proc/self, then it just
> fails. None of this kind of "it should be ok" stuff.
> 
> And that would likely be the faster model anyway - do it synchronously
> and immediately for the easy cases.
> 
> And if it really is something like "/proc/self/cwd/randompathgoeshere"
> that actually will block, maybe io_uring just shouldn't support it?
> 
> I've pulled this, but I really object to how io_uring keeps having
> subtle bugs, and then they get worked around with this kind of hackery
> which really smells like "this will be a subtle bug some time in the
> future".

I don't disagree with you on that. I've been a bit gun shy on touching
the VFS side of things, but this one isn't too bad. I hacked up a patch
that allows io_uring to do LOOKUP_RCU and a quick test seems to indicate
it's fine. On top of that, we just propagate the error if we do fail and
get rid of that odd retry loop.

And yes, it should be much better performance as well, for any sort of
cached lookup. There's a reason why we made the close side more
efficient like that, too.

Lightly tested patch below, needs to be split into 2 parts of course.
But the VFS side is just adding a few functions to fs/internal.h and the
struct nameidata structure, no other changes needed.


diff --git a/fs/internal.h b/fs/internal.h
index 6fd14ea213c3..e100d5bca42d 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -131,11 +131,41 @@ struct open_flags {
 };
 extern struct file *do_filp_open(int dfd, struct filename *pathname,
 		const struct open_flags *op);
+extern struct file *path_openat(struct nameidata *nd,
+		const struct open_flags *op, unsigned flags);
 extern struct file *do_file_open_root(struct dentry *, struct vfsmount *,
 		const char *, const struct open_flags *);
 extern struct open_how build_open_how(int flags, umode_t mode);
 extern int build_open_flags(const struct open_how *how, struct open_flags *op);
 
+#define EMBEDDED_LEVELS 2
+struct nameidata {
+	struct path	path;
+	struct qstr	last;
+	struct path	root;
+	struct inode	*inode; /* path.dentry.d_inode */
+	unsigned int	flags;
+	unsigned	seq, m_seq, r_seq;
+	int		last_type;
+	unsigned	depth;
+	int		total_link_count;
+	struct saved {
+		struct path link;
+		struct delayed_call done;
+		const char *name;
+		unsigned seq;
+	} *stack, internal[EMBEDDED_LEVELS];
+	struct filename	*name;
+	struct nameidata *saved;
+	unsigned	root_seq;
+	int		dfd;
+	kuid_t		dir_uid;
+	umode_t		dir_mode;
+} __randomize_layout;
+
+extern void set_nameidata(struct nameidata *p, int dfd, struct filename *name);
+extern void restore_nameidata(void);
+
 long do_sys_ftruncate(unsigned int fd, loff_t length, int small);
 int chmod_common(const struct path *path, umode_t mode);
 int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group,
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 43ba815e4107..896b7f92cfed 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4069,9 +4069,6 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
 	struct file *file;
 	int ret;
 
-	if (force_nonblock && !req->open.ignore_nonblock)
-		return -EAGAIN;
-
 	ret = build_open_flags(&req->open.how, &op);
 	if (ret)
 		goto err;
@@ -4080,25 +4077,28 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
 	if (ret < 0)
 		goto err;
 
-	file = do_filp_open(req->open.dfd, req->open.filename, &op);
-	if (IS_ERR(file)) {
-		put_unused_fd(ret);
-		ret = PTR_ERR(file);
+	if (!force_nonblock) {
+		struct nameidata nd;
+
+		set_nameidata(&nd, req->open.dfd, req->open.filename);
+		file = path_openat(&nd, &op, op.lookup_flags | LOOKUP_RCU);
+		restore_nameidata();
+
 		/*
-		 * A work-around to ensure that /proc/self works that way
-		 * that it should - if we get -EOPNOTSUPP back, then assume
-		 * that proc_self_get_link() failed us because we're in async
-		 * context. We should be safe to retry this from the task
-		 * itself with force_nonblock == false set, as it should not
-		 * block on lookup. Would be nice to know this upfront and
-		 * avoid the async dance, but doesn't seem feasible.
+		 * If RCU lookup fails, then we need to retry this from
+		 * async context.
 		 */
-		if (ret == -EOPNOTSUPP && io_wq_current_is_worker()) {
-			req->open.ignore_nonblock = true;
-			refcount_inc(&req->refs);
-			io_req_task_queue(req);
-			return 0;
+		if (file == ERR_PTR(-ECHILD)) {
+			put_unused_fd(ret);
+			return -EAGAIN;
 		}
+	} else {
+		file = do_filp_open(req->open.dfd, req->open.filename, &op);
+	}
+
+	if (IS_ERR(file)) {
+		put_unused_fd(ret);
+		ret = PTR_ERR(file);
 	} else {
 		fsnotify_open(file);
 		fd_install(ret, file);
diff --git a/fs/namei.c b/fs/namei.c
index 03d0e11e4f36..288fdae18221 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -498,32 +498,7 @@ void path_put(const struct path *path)
 }
 EXPORT_SYMBOL(path_put);
 
-#define EMBEDDED_LEVELS 2
-struct nameidata {
-	struct path	path;
-	struct qstr	last;
-	struct path	root;
-	struct inode	*inode; /* path.dentry.d_inode */
-	unsigned int	flags;
-	unsigned	seq, m_seq, r_seq;
-	int		last_type;
-	unsigned	depth;
-	int		total_link_count;
-	struct saved {
-		struct path link;
-		struct delayed_call done;
-		const char *name;
-		unsigned seq;
-	} *stack, internal[EMBEDDED_LEVELS];
-	struct filename	*name;
-	struct nameidata *saved;
-	unsigned	root_seq;
-	int		dfd;
-	kuid_t		dir_uid;
-	umode_t		dir_mode;
-} __randomize_layout;
-
-static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
+void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
 {
 	struct nameidata *old = current->nameidata;
 	p->stack = p->internal;
@@ -534,7 +509,7 @@ static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
 	current->nameidata = p;
 }
 
-static void restore_nameidata(void)
+void restore_nameidata(void)
 {
 	struct nameidata *now = current->nameidata, *old = now->saved;
 
@@ -3346,8 +3321,8 @@ static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file)
 	return error;
 }
 
-static struct file *path_openat(struct nameidata *nd,
-			const struct open_flags *op, unsigned flags)
+struct file *path_openat(struct nameidata *nd, const struct open_flags *op,
+			 unsigned flags)
 {
 	struct file *file;
 	int error;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.10-rc
  2020-11-20 21:36   ` Jens Axboe
@ 2020-11-21  0:23     ` Linus Torvalds
  2020-11-21  2:41       ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2020-11-21  0:23 UTC (permalink / raw)
  To: Jens Axboe, Al Viro; +Cc: io-uring, [email protected]

On Fri, Nov 20, 2020 at 1:36 PM Jens Axboe <[email protected]> wrote:
>
> I don't disagree with you on that. I've been a bit gun shy on touching
> the VFS side of things, but this one isn't too bad. I hacked up a patch
> that allows io_uring to do LOOKUP_RCU and a quick test seems to indicate
> it's fine. On top of that, we just propagate the error if we do fail and
> get rid of that odd retry loop.

Ok, this looks better to me (but is obviously not 5.10 material).

That said, I think I'd prefer to keep 'struct nameidata' internal to
just fs/namei.c, and maybe we can just expert that

        struct nameidata nd;

        set_nameidata(&nd, req->open.dfd, req->open.filename);
        file = path_openat(&nd, &op, op.lookup_flags | LOOKUP_RCU);
        restore_nameidata();
        return filp == ERR_PTR(-ECHILD) ? -EAGAIN : filp;

as a helper from namei.c instead? Call it "do_filp_open_rcu()" or something?

That "force_nonblock" test seems a bit off, though. Why is that RCU
case only done when "!force_nonblock"? It would seem that if
force_nonblock is set, you want to do this too?

Al? You can see the background on lkml, but basically io_uring wants
to punt file open to a kernel thread, except if it can just be done
directly without blocking (which is pretty much that RCU lookup case).

And the thing that triggered this is that /proc/self/ can only be done
directly - not in a kernel thread. So the RCU case actually ends up
being interesting in that it would handl those things.

            Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.10-rc
  2020-11-21  0:23     ` Linus Torvalds
@ 2020-11-21  2:41       ` Jens Axboe
  2020-11-21  3:00         ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2020-11-21  2:41 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro; +Cc: io-uring, [email protected]

On 11/20/20 5:23 PM, Linus Torvalds wrote:
> On Fri, Nov 20, 2020 at 1:36 PM Jens Axboe <[email protected]> wrote:
>>
>> I don't disagree with you on that. I've been a bit gun shy on touching
>> the VFS side of things, but this one isn't too bad. I hacked up a patch
>> that allows io_uring to do LOOKUP_RCU and a quick test seems to indicate
>> it's fine. On top of that, we just propagate the error if we do fail and
>> get rid of that odd retry loop.
> 
> Ok, this looks better to me (but is obviously not 5.10 material).
> 
> That said, I think I'd prefer to keep 'struct nameidata' internal to
> just fs/namei.c, and maybe we can just expert that
> 
>         struct nameidata nd;
> 
>         set_nameidata(&nd, req->open.dfd, req->open.filename);
>         file = path_openat(&nd, &op, op.lookup_flags | LOOKUP_RCU);
>         restore_nameidata();
>         return filp == ERR_PTR(-ECHILD) ? -EAGAIN : filp;
> 
> as a helper from namei.c instead? Call it "do_filp_open_rcu()" or something?

Yes, that's probably a better idea. I'll move in that direction.

> That "force_nonblock" test seems a bit off, though. Why is that RCU
> case only done when "!force_nonblock"? It would seem that if
> force_nonblock is set, you want to do this too?

Taking a second look at it, it's inverted. So if force_nonblock == true,
we want to do just the RCU lookup.

But I think the bit that you're missing is that the other case will do
the normal lookup, which does an RCU lookup first. It looks needs to
look like this:

if (force_nonblock)
	file = do_filp_open_rcu();
else
	file = do_filp_open();

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.10-rc
  2020-11-21  2:41       ` Jens Axboe
@ 2020-11-21  3:00         ` Jens Axboe
  2020-11-21 18:07           ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2020-11-21  3:00 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro; +Cc: io-uring, [email protected]

On 11/20/20 7:41 PM, Jens Axboe wrote:
> On 11/20/20 5:23 PM, Linus Torvalds wrote:
>> On Fri, Nov 20, 2020 at 1:36 PM Jens Axboe <[email protected]> wrote:
>>>
>>> I don't disagree with you on that. I've been a bit gun shy on touching
>>> the VFS side of things, but this one isn't too bad. I hacked up a patch
>>> that allows io_uring to do LOOKUP_RCU and a quick test seems to indicate
>>> it's fine. On top of that, we just propagate the error if we do fail and
>>> get rid of that odd retry loop.
>>
>> Ok, this looks better to me (but is obviously not 5.10 material).
>>
>> That said, I think I'd prefer to keep 'struct nameidata' internal to
>> just fs/namei.c, and maybe we can just expert that
>>
>>         struct nameidata nd;
>>
>>         set_nameidata(&nd, req->open.dfd, req->open.filename);
>>         file = path_openat(&nd, &op, op.lookup_flags | LOOKUP_RCU);
>>         restore_nameidata();
>>         return filp == ERR_PTR(-ECHILD) ? -EAGAIN : filp;
>>
>> as a helper from namei.c instead? Call it "do_filp_open_rcu()" or something?
> 
> Yes, that's probably a better idea. I'll move in that direction.

Actually, I think we can do even better. How about just having
do_filp_open() exit after LOOKUP_RCU fails, if LOOKUP_RCU was already
set in the lookup flags? Then we don't need to change much else, and
most of it falls out naturally.

Except it seems that should work, except LOOKUP_RCU does not guarantee
that we're not going to do IO:

[   20.463195]  schedule+0x5f/0xd0
[   20.463444]  io_schedule+0x45/0x70
[   20.463712]  bit_wait_io+0x11/0x50
[   20.463981]  __wait_on_bit+0x2c/0x90
[   20.464264]  out_of_line_wait_on_bit+0x86/0x90
[   20.464611]  ? var_wake_function+0x30/0x30
[   20.464932]  __ext4_find_entry+0x2b5/0x410
[   20.465254]  ? d_alloc_parallel+0x241/0x4e0
[   20.465581]  ext4_lookup+0x51/0x1b0
[   20.465855]  ? __d_lookup+0x77/0x120
[   20.466136]  path_openat+0x4e8/0xe40
[   20.466417]  do_filp_open+0x79/0x100
[   20.466720]  ? __kernel_text_address+0x30/0x70
[   20.467068]  ? __alloc_fd+0xb3/0x150
[   20.467349]  io_openat2+0x65/0x210
[   20.467618]  io_issue_sqe+0x3e/0xf70

Which I'm actually pretty sure that I discovered before and attempted to
do a LOOKUP_NONBLOCK, which was kind of half assed and that Al
(rightfully) hated because of that.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 43ba815e4107..9a0a21ac5227 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4069,36 +4069,25 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
 	struct file *file;
 	int ret;
 
-	if (force_nonblock && !req->open.ignore_nonblock)
-		return -EAGAIN;
-
 	ret = build_open_flags(&req->open.how, &op);
 	if (ret)
 		goto err;
+	if (force_nonblock)
+		op.lookup_flags |= LOOKUP_RCU;
 
 	ret = __get_unused_fd_flags(req->open.how.flags, req->open.nofile);
 	if (ret < 0)
 		goto err;
 
 	file = do_filp_open(req->open.dfd, req->open.filename, &op);
+	if (force_nonblock && file == ERR_PTR(-ECHILD)) {
+		put_unused_fd(ret);
+		return -EAGAIN;
+	}
+
 	if (IS_ERR(file)) {
 		put_unused_fd(ret);
 		ret = PTR_ERR(file);
-		/*
-		 * A work-around to ensure that /proc/self works that way
-		 * that it should - if we get -EOPNOTSUPP back, then assume
-		 * that proc_self_get_link() failed us because we're in async
-		 * context. We should be safe to retry this from the task
-		 * itself with force_nonblock == false set, as it should not
-		 * block on lookup. Would be nice to know this upfront and
-		 * avoid the async dance, but doesn't seem feasible.
-		 */
-		if (ret == -EOPNOTSUPP && io_wq_current_is_worker()) {
-			req->open.ignore_nonblock = true;
-			refcount_inc(&req->refs);
-			io_req_task_queue(req);
-			return 0;
-		}
 	} else {
 		fsnotify_open(file);
 		fd_install(ret, file);
diff --git a/fs/namei.c b/fs/namei.c
index 03d0e11e4f36..eb2c917986a5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3394,8 +3394,11 @@ struct file *do_filp_open(int dfd, struct filename *pathname,
 
 	set_nameidata(&nd, dfd, pathname);
 	filp = path_openat(&nd, op, flags | LOOKUP_RCU);
-	if (unlikely(filp == ERR_PTR(-ECHILD)))
+	if (unlikely(filp == ERR_PTR(-ECHILD))) {
+		if (flags & LOOKUP_RCU)
+			return filp;
 		filp = path_openat(&nd, op, flags);
+	}
 	if (unlikely(filp == ERR_PTR(-ESTALE)))
 		filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
 	restore_nameidata();

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.10-rc
  2020-11-21  3:00         ` Jens Axboe
@ 2020-11-21 18:07           ` Linus Torvalds
  2020-11-21 22:58             ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2020-11-21 18:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Al Viro, io-uring, [email protected]

On Fri, Nov 20, 2020 at 7:00 PM Jens Axboe <[email protected]> wrote:
>
> Actually, I think we can do even better. How about just having
> do_filp_open() exit after LOOKUP_RCU fails, if LOOKUP_RCU was already
> set in the lookup flags? Then we don't need to change much else, and
> most of it falls out naturally.

So I was thinking doing the RCU lookup unconditionally, and then doing
the nn-RCU lookup if that fails afterwards.

But your patch looks good to me.

Except for the issue you noticed.

> Except it seems that should work, except LOOKUP_RCU does not guarantee
> that we're not going to do IO:

Well, almost nothing guarantees lack of IO, since allocations etc can
still block, but..

> [   20.463195]  schedule+0x5f/0xd0
> [   20.463444]  io_schedule+0x45/0x70
> [   20.463712]  bit_wait_io+0x11/0x50
> [   20.463981]  __wait_on_bit+0x2c/0x90
> [   20.464264]  out_of_line_wait_on_bit+0x86/0x90
> [   20.464611]  ? var_wake_function+0x30/0x30
> [   20.464932]  __ext4_find_entry+0x2b5/0x410
> [   20.465254]  ? d_alloc_parallel+0x241/0x4e0
> [   20.465581]  ext4_lookup+0x51/0x1b0
> [   20.465855]  ? __d_lookup+0x77/0x120
> [   20.466136]  path_openat+0x4e8/0xe40
> [   20.466417]  do_filp_open+0x79/0x100

Hmm. Is this perhaps an O_CREAT case? I think we only do the dcache
lookups under RCU, not the final path component creation.

And there are probably lots of other situations where we finish with
LOOKUP_RCU (with unlazy_walk()), and then continue.

Example: look at "may_lookup()" - if inode_permission() says "I can't
do this without blocking" the logic actually just tries to validate
the current state (that "unlazy_walk()" thing), and then continue
without RCU.

It obviously hasn't been about lockless semantics, it's been about
really being lockless. So LOOKUP_RCU has been a "try to do this
locklessly" rather than "you cannot take any locks".

I guess we would have to add a LOOKUP_NOBLOCK thing to actually then
say "if the RCU lookup fails, return -EAGAIN".

That's probably not a huge undertaking, but yeah, I didn't think of
it. I think this is a "we need to have Al tell us if it's reasonable".

                Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.10-rc
  2020-11-21 18:07           ` Linus Torvalds
@ 2020-11-21 22:58             ` Jens Axboe
  2020-12-10 17:32               ` namei.c LOOKUP_NONBLOCK (was "Re: [GIT PULL] io_uring fixes for 5.10-rc") Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2020-11-21 22:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, io-uring, [email protected]

On 11/21/20 11:07 AM, Linus Torvalds wrote:
> On Fri, Nov 20, 2020 at 7:00 PM Jens Axboe <[email protected]> wrote:
>>
>> Actually, I think we can do even better. How about just having
>> do_filp_open() exit after LOOKUP_RCU fails, if LOOKUP_RCU was already
>> set in the lookup flags? Then we don't need to change much else, and
>> most of it falls out naturally.
> 
> So I was thinking doing the RCU lookup unconditionally, and then doing
> the nn-RCU lookup if that fails afterwards.
> 
> But your patch looks good to me.
> 
> Except for the issue you noticed.

After having taken a closer look, I think the saner approach is
LOOKUP_NONBLOCK instead of using LOOKUP_RCU which is used more as
a state than lookup flag. I'll try and hack something up that looks
passable.

>> Except it seems that should work, except LOOKUP_RCU does not guarantee
>> that we're not going to do IO:
> 
> Well, almost nothing guarantees lack of IO, since allocations etc can
> still block, but..

Sure, and we can't always avoid that - but blatant block on waiting
for IO should be avoided.

>> [   20.463195]  schedule+0x5f/0xd0
>> [   20.463444]  io_schedule+0x45/0x70
>> [   20.463712]  bit_wait_io+0x11/0x50
>> [   20.463981]  __wait_on_bit+0x2c/0x90
>> [   20.464264]  out_of_line_wait_on_bit+0x86/0x90
>> [   20.464611]  ? var_wake_function+0x30/0x30
>> [   20.464932]  __ext4_find_entry+0x2b5/0x410
>> [   20.465254]  ? d_alloc_parallel+0x241/0x4e0
>> [   20.465581]  ext4_lookup+0x51/0x1b0
>> [   20.465855]  ? __d_lookup+0x77/0x120
>> [   20.466136]  path_openat+0x4e8/0xe40
>> [   20.466417]  do_filp_open+0x79/0x100
> 
> Hmm. Is this perhaps an O_CREAT case? I think we only do the dcache
> lookups under RCU, not the final path component creation.

It's just a basic test that opens all files under a directory. So
no O_CREAT, it's all existing files. I think this is just a case of not
aborting early enough for LOOKUP_NONBLOCK, and we've obviously already
dropped LOOKUP_RCU (and done rcu_read_unlock() again) at this point.

> And there are probably lots of other situations where we finish with
> LOOKUP_RCU (with unlazy_walk()), and then continue.> 
> Example: look at "may_lookup()" - if inode_permission() says "I can't
> do this without blocking" the logic actually just tries to validate
> the current state (that "unlazy_walk()" thing), and then continue
> without RCU.
> 
> It obviously hasn't been about lockless semantics, it's been about
> really being lockless. So LOOKUP_RCU has been a "try to do this
> locklessly" rather than "you cannot take any locks".
> 
> I guess we would have to add a LOOKUP_NOBLOCK thing to actually then
> say "if the RCU lookup fails, return -EAGAIN".
> 
> That's probably not a huge undertaking, but yeah, I didn't think of
> it. I think this is a "we need to have Al tell us if it's reasonable".

Definitely. I did have a weak attempt at LOOKUP_NONBLOCK earlier, I'll
try and resurrect it and see what that leads to. Outside of just pure
lookup, the d_revalidate() was a bit interesting as it may block for
certain cases, but those should be (hopefully) detectable upfront.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* namei.c LOOKUP_NONBLOCK (was "Re: [GIT PULL] io_uring fixes for 5.10-rc")
  2020-11-21 22:58             ` Jens Axboe
@ 2020-12-10 17:32               ` Jens Axboe
  2020-12-10 18:55                 ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2020-12-10 17:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, io-uring, [email protected]

On 11/21/20 3:58 PM, Jens Axboe wrote:
> On 11/21/20 11:07 AM, Linus Torvalds wrote:
>> On Fri, Nov 20, 2020 at 7:00 PM Jens Axboe <[email protected]> wrote:
>>>
>>> Actually, I think we can do even better. How about just having
>>> do_filp_open() exit after LOOKUP_RCU fails, if LOOKUP_RCU was already
>>> set in the lookup flags? Then we don't need to change much else, and
>>> most of it falls out naturally.
>>
>> So I was thinking doing the RCU lookup unconditionally, and then doing
>> the nn-RCU lookup if that fails afterwards.
>>
>> But your patch looks good to me.
>>
>> Except for the issue you noticed.
> 
> After having taken a closer look, I think the saner approach is
> LOOKUP_NONBLOCK instead of using LOOKUP_RCU which is used more as
> a state than lookup flag. I'll try and hack something up that looks
> passable.
> 
>>> Except it seems that should work, except LOOKUP_RCU does not guarantee
>>> that we're not going to do IO:
>>
>> Well, almost nothing guarantees lack of IO, since allocations etc can
>> still block, but..
> 
> Sure, and we can't always avoid that - but blatant block on waiting
> for IO should be avoided.
> 
>>> [   20.463195]  schedule+0x5f/0xd0
>>> [   20.463444]  io_schedule+0x45/0x70
>>> [   20.463712]  bit_wait_io+0x11/0x50
>>> [   20.463981]  __wait_on_bit+0x2c/0x90
>>> [   20.464264]  out_of_line_wait_on_bit+0x86/0x90
>>> [   20.464611]  ? var_wake_function+0x30/0x30
>>> [   20.464932]  __ext4_find_entry+0x2b5/0x410
>>> [   20.465254]  ? d_alloc_parallel+0x241/0x4e0
>>> [   20.465581]  ext4_lookup+0x51/0x1b0
>>> [   20.465855]  ? __d_lookup+0x77/0x120
>>> [   20.466136]  path_openat+0x4e8/0xe40
>>> [   20.466417]  do_filp_open+0x79/0x100
>>
>> Hmm. Is this perhaps an O_CREAT case? I think we only do the dcache
>> lookups under RCU, not the final path component creation.
> 
> It's just a basic test that opens all files under a directory. So
> no O_CREAT, it's all existing files. I think this is just a case of not
> aborting early enough for LOOKUP_NONBLOCK, and we've obviously already
> dropped LOOKUP_RCU (and done rcu_read_unlock() again) at this point.
> 
>> And there are probably lots of other situations where we finish with
>> LOOKUP_RCU (with unlazy_walk()), and then continue.> 
>> Example: look at "may_lookup()" - if inode_permission() says "I can't
>> do this without blocking" the logic actually just tries to validate
>> the current state (that "unlazy_walk()" thing), and then continue
>> without RCU.
>>
>> It obviously hasn't been about lockless semantics, it's been about
>> really being lockless. So LOOKUP_RCU has been a "try to do this
>> locklessly" rather than "you cannot take any locks".
>>
>> I guess we would have to add a LOOKUP_NOBLOCK thing to actually then
>> say "if the RCU lookup fails, return -EAGAIN".
>>
>> That's probably not a huge undertaking, but yeah, I didn't think of
>> it. I think this is a "we need to have Al tell us if it's reasonable".
> 
> Definitely. I did have a weak attempt at LOOKUP_NONBLOCK earlier, I'll
> try and resurrect it and see what that leads to. Outside of just pure
> lookup, the d_revalidate() was a bit interesting as it may block for
> certain cases, but those should be (hopefully) detectable upfront.

Here's a potentially better attempt - basically we allow LOOKUP_NONBLOCK
with LOOKUP_RCU, and if we end up dropping LOOKUP_RCU, then we generally
return -EAGAIN if LOOKUP_NONBLOCK is set as we can no longer guarantee
that we won't block.

Al, what do you think? I didn't include the io_uring part here, as
that's just dropping the existing hack and setting LOOKUP_NONBLOCK if
we're in task context. I can send it out as a separate patch, of course.

diff --git a/fs/namei.c b/fs/namei.c
index 03d0e11e4f36..303874f1b9f1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -679,7 +679,7 @@ static bool legitimize_root(struct nameidata *nd)
  * Nothing should touch nameidata between unlazy_walk() failure and
  * terminate_walk().
  */
-static int unlazy_walk(struct nameidata *nd)
+static int __unlazy_walk(struct nameidata *nd)
 {
 	struct dentry *parent = nd->path.dentry;
 
@@ -704,6 +704,18 @@ static int unlazy_walk(struct nameidata *nd)
 	return -ECHILD;
 }
 
+static int unlazy_walk(struct nameidata *nd)
+{
+	int ret;
+
+	ret = __unlazy_walk(nd);
+	/* If caller is asking for NONBLOCK lookup, assume we can't satisfy it */
+	if (!ret && (nd->flags & LOOKUP_NONBLOCK))
+		ret = -EAGAIN;
+
+	return ret;
+}
+
 /**
  * unlazy_child - try to switch to ref-walk mode.
  * @nd: nameidata pathwalk data
@@ -764,10 +776,13 @@ static int unlazy_child(struct nameidata *nd, struct dentry *dentry, unsigned se
 
 static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
 {
-	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
+	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) {
+		if ((flags & (LOOKUP_RCU | LOOKUP_NONBLOCK)) == LOOKUP_NONBLOCK)
+			return -EAGAIN;
 		return dentry->d_op->d_revalidate(dentry, flags);
-	else
-		return 1;
+	}
+
+	return 1;
 }
 
 /**
@@ -792,7 +807,7 @@ static int complete_walk(struct nameidata *nd)
 		 */
 		if (!(nd->flags & (LOOKUP_ROOT | LOOKUP_IS_SCOPED)))
 			nd->root.mnt = NULL;
-		if (unlikely(unlazy_walk(nd)))
+		if (unlikely(__unlazy_walk(nd)))
 			return -ECHILD;
 	}
 
@@ -1466,8 +1481,9 @@ static struct dentry *lookup_fast(struct nameidata *nd,
 		unsigned seq;
 		dentry = __d_lookup_rcu(parent, &nd->last, &seq);
 		if (unlikely(!dentry)) {
-			if (unlazy_walk(nd))
-				return ERR_PTR(-ECHILD);
+			int ret = unlazy_walk(nd);
+			if (ret)
+				return ERR_PTR(ret);
 			return NULL;
 		}
 
@@ -1569,8 +1585,9 @@ static inline int may_lookup(struct nameidata *nd)
 		int err = inode_permission(nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
 		if (err != -ECHILD)
 			return err;
-		if (unlazy_walk(nd))
-			return -ECHILD;
+		err = unlazy_walk(nd);
+		if (err)
+			return err;
 	}
 	return inode_permission(nd->inode, MAY_EXEC);
 }
@@ -1591,9 +1608,11 @@ static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq)
 		// we need to grab link before we do unlazy.  And we can't skip
 		// unlazy even if we fail to grab the link - cleanup needs it
 		bool grabbed_link = legitimize_path(nd, link, seq);
+		int ret;
 
-		if (unlazy_walk(nd) != 0 || !grabbed_link)
-			return -ECHILD;
+		ret = unlazy_walk(nd);
+		if (ret && !grabbed_link)
+			return ret;
 
 		if (nd_alloc_stack(nd))
 			return 0;
@@ -1634,8 +1653,9 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
 		touch_atime(&last->link);
 		cond_resched();
 	} else if (atime_needs_update(&last->link, inode)) {
-		if (unlikely(unlazy_walk(nd)))
-			return ERR_PTR(-ECHILD);
+		error = unlazy_walk(nd);
+		if (unlikely(error))
+			return ERR_PTR(error);
 		touch_atime(&last->link);
 	}
 
@@ -1652,8 +1672,9 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
 		if (nd->flags & LOOKUP_RCU) {
 			res = get(NULL, inode, &last->done);
 			if (res == ERR_PTR(-ECHILD)) {
-				if (unlikely(unlazy_walk(nd)))
-					return ERR_PTR(-ECHILD);
+				error = unlazy_walk(nd);
+				if (unlikely(error))
+					return ERR_PTR(error);
 				res = get(link->dentry, inode, &last->done);
 			}
 		} else {
@@ -2193,8 +2214,9 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 		}
 		if (unlikely(!d_can_lookup(nd->path.dentry))) {
 			if (nd->flags & LOOKUP_RCU) {
-				if (unlazy_walk(nd))
-					return -ECHILD;
+				err = unlazy_walk(nd);
+				if (err)
+					return err;
 			}
 			return -ENOTDIR;
 		}
@@ -3394,10 +3416,14 @@ struct file *do_filp_open(int dfd, struct filename *pathname,
 
 	set_nameidata(&nd, dfd, pathname);
 	filp = path_openat(&nd, op, flags | LOOKUP_RCU);
+	/* If we fail RCU lookup, assume NONBLOCK cannot be honored */
+	if (flags & LOOKUP_NONBLOCK)
+		goto out;
 	if (unlikely(filp == ERR_PTR(-ECHILD)))
 		filp = path_openat(&nd, op, flags);
 	if (unlikely(filp == ERR_PTR(-ESTALE)))
 		filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
+out:
 	restore_nameidata();
 	return filp;
 }
diff --git a/include/linux/namei.h b/include/linux/namei.h
index a4bb992623c4..c36c4e0805fc 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -46,6 +46,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
 #define LOOKUP_NO_XDEV		0x040000 /* No mountpoint crossing. */
 #define LOOKUP_BENEATH		0x080000 /* No escaping from starting point. */
 #define LOOKUP_IN_ROOT		0x100000 /* Treat dirfd as fs root. */
+#define LOOKUP_NONBLOCK		0x200000 /* don't block for lookup */
 /* LOOKUP_* flags which do scope-related checks based on the dirfd. */
 #define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: namei.c LOOKUP_NONBLOCK (was "Re: [GIT PULL] io_uring fixes for 5.10-rc")
  2020-12-10 17:32               ` namei.c LOOKUP_NONBLOCK (was "Re: [GIT PULL] io_uring fixes for 5.10-rc") Jens Axboe
@ 2020-12-10 18:55                 ` Linus Torvalds
  2020-12-10 19:21                   ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2020-12-10 18:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Al Viro, io-uring, [email protected]

On Thu, Dec 10, 2020 at 9:32 AM Jens Axboe <[email protected]> wrote:
>
> Here's a potentially better attempt - basically we allow LOOKUP_NONBLOCK
> with LOOKUP_RCU, and if we end up dropping LOOKUP_RCU, then we generally
> return -EAGAIN if LOOKUP_NONBLOCK is set as we can no longer guarantee
> that we won't block.

Looks sane to me.

I don't love the "__unlazy_walk vs unlazy_walk" naming - I think it
needs to be more clear about what the difference is, but I think the
basic patch looks sane, and looks about as big as I would have
expected it to be.

But yes, I'll leave it to Al.

And if we do this - and I think we should - I'd also love to see a new
flag in 'struct open_how' to openat2(), even if it's only to enable
tests. RESOLVE_NONBLOCK?

               Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: namei.c LOOKUP_NONBLOCK (was "Re: [GIT PULL] io_uring fixes for 5.10-rc")
  2020-12-10 18:55                 ` Linus Torvalds
@ 2020-12-10 19:21                   ` Jens Axboe
  0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2020-12-10 19:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, io-uring, [email protected]

On 12/10/20 11:55 AM, Linus Torvalds wrote:
> On Thu, Dec 10, 2020 at 9:32 AM Jens Axboe <[email protected]> wrote:
>>
>> Here's a potentially better attempt - basically we allow LOOKUP_NONBLOCK
>> with LOOKUP_RCU, and if we end up dropping LOOKUP_RCU, then we generally
>> return -EAGAIN if LOOKUP_NONBLOCK is set as we can no longer guarantee
>> that we won't block.
> 
> Looks sane to me.
> 
> I don't love the "__unlazy_walk vs unlazy_walk" naming - I think it
> needs to be more clear about what the difference is, but I think the
> basic patch looks sane, and looks about as big as I would have
> expected it to be.

Agree, would probably make more sense as __unlazy_walk ->
complete_walk_rcu(), which then also falls out naturally from
complete_walk() being the sole caller of that.

> But yes, I'll leave it to Al.
> 
> And if we do this - and I think we should - I'd also love to see a new
> flag in 'struct open_how' to openat2(), even if it's only to enable
> tests. RESOLVE_NONBLOCK?

Sure, enabling cached opens from userspace through regular openat2().
Let's wrap up this one first though, that needs to be a separate patch
anyway. I'll follow up with that once this is in.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [GIT PULL] io_uring fixes for 5.10-rc
  2020-11-20 18:45 [GIT PULL] io_uring fixes for 5.10-rc Jens Axboe
  2020-11-20 20:02 ` Linus Torvalds
@ 2020-11-21  0:29 ` pr-tracker-bot
  1 sibling, 0 replies; 12+ messages in thread
From: pr-tracker-bot @ 2020-11-21  0:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, io-uring, [email protected]

The pull request you sent on Fri, 20 Nov 2020 11:45:31 -0700:

> git://git.kernel.dk/linux-block.git tags/io_uring-5.10-2020-11-20

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/fa5fca78bb2fe7a58ae7297407dcda1914ea8353

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-12-10 19:22 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-20 18:45 [GIT PULL] io_uring fixes for 5.10-rc Jens Axboe
2020-11-20 20:02 ` Linus Torvalds
2020-11-20 21:36   ` Jens Axboe
2020-11-21  0:23     ` Linus Torvalds
2020-11-21  2:41       ` Jens Axboe
2020-11-21  3:00         ` Jens Axboe
2020-11-21 18:07           ` Linus Torvalds
2020-11-21 22:58             ` Jens Axboe
2020-12-10 17:32               ` namei.c LOOKUP_NONBLOCK (was "Re: [GIT PULL] io_uring fixes for 5.10-rc") Jens Axboe
2020-12-10 18:55                 ` Linus Torvalds
2020-12-10 19:21                   ` Jens Axboe
2020-11-21  0:29 ` [GIT PULL] io_uring fixes for 5.10-rc pr-tracker-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox