* io_uring force_nonblock vs POSIX_FADV_WILLNEED
@ 2020-02-01 9:43 Andres Freund
2020-02-01 16:22 ` Jens Axboe
2020-02-03 6:40 ` Matthew Wilcox
0 siblings, 2 replies; 6+ messages in thread
From: Andres Freund @ 2020-02-01 9:43 UTC (permalink / raw)
To: Jens Axboe, io-uring, linux-fsdevel
Hi
Currently io_uring executes fadvise in submission context except for
DONTNEED:
static int io_fadvise(struct io_kiocb *req, struct io_kiocb **nxt,
bool force_nonblock)
{
...
/* DONTNEED may block, others _should_ not */
if (fa->advice == POSIX_FADV_DONTNEED && force_nonblock)
return -EAGAIN;
which makes sense for POSIX_FADV_{NORMAL, RANDOM, WILLNEED}, but doesn't
seem like it's true for POSIX_FADV_WILLNEED?
As far as I can tell POSIX_FADV_WILLNEED synchronously starts readahead,
including page allocation etc, which of course might trigger quite
blocking. The fs also quite possibly needs to read metadata.
Seems like either WILLNEED would have to always be deferred, or
force_page_cache_readahead, __do_page_cache_readahead would etc need to
be wired up to know not to block. Including returning EAGAIN, despite
force_page_cache_readahead and generic_readahead() intentially ignoring
return values / errors.
I guess it's also possible to just add a separate precheck that looks
whether there's any IO needing to be done for the range. That could
potentially also be used to make DONTNEED nonblocking in case everything
is clean already, which seems like it could be nice. But that seems
weird modularity wise.
Context: postgres has long used POSIX_FADV_WILLNEED to do a poor man's
version of async buffered reads, when it knows it needs to do a fair bit
of random reads that are already known (e.g. for bitmap heap scans).
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: io_uring force_nonblock vs POSIX_FADV_WILLNEED
2020-02-01 9:43 io_uring force_nonblock vs POSIX_FADV_WILLNEED Andres Freund
@ 2020-02-01 16:22 ` Jens Axboe
2020-02-02 7:14 ` Andres Freund
2020-02-03 6:40 ` Matthew Wilcox
1 sibling, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2020-02-01 16:22 UTC (permalink / raw)
To: Andres Freund, io-uring, linux-fsdevel
On 2/1/20 2:43 AM, Andres Freund wrote:
> Hi
>
> Currently io_uring executes fadvise in submission context except for
> DONTNEED:
>
> static int io_fadvise(struct io_kiocb *req, struct io_kiocb **nxt,
> bool force_nonblock)
> {
> ...
> /* DONTNEED may block, others _should_ not */
> if (fa->advice == POSIX_FADV_DONTNEED && force_nonblock)
> return -EAGAIN;
>
> which makes sense for POSIX_FADV_{NORMAL, RANDOM, WILLNEED}, but doesn't
> seem like it's true for POSIX_FADV_WILLNEED?
>
> As far as I can tell POSIX_FADV_WILLNEED synchronously starts readahead,
> including page allocation etc, which of course might trigger quite
> blocking. The fs also quite possibly needs to read metadata.
>
>
> Seems like either WILLNEED would have to always be deferred, or
> force_page_cache_readahead, __do_page_cache_readahead would etc need to
> be wired up to know not to block. Including returning EAGAIN, despite
> force_page_cache_readahead and generic_readahead() intentially ignoring
> return values / errors.
>
> I guess it's also possible to just add a separate precheck that looks
> whether there's any IO needing to be done for the range. That could
> potentially also be used to make DONTNEED nonblocking in case everything
> is clean already, which seems like it could be nice. But that seems
> weird modularity wise.
Good point, we can block on the read-ahead. Which is counter intuitive,
but true.
I'll queue up the below for now, better safe than sorry.
diff --git a/fs/io_uring.c b/fs/io_uring.c
index fb5c5b3e23f4..1464e4c9b04c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2728,8 +2728,7 @@ static int io_fadvise(struct io_kiocb *req, struct io_kiocb **nxt,
struct io_fadvise *fa = &req->fadvise;
int ret;
- /* DONTNEED may block, others _should_ not */
- if (fa->advice == POSIX_FADV_DONTNEED && force_nonblock)
+ if (force_nonblock)
return -EAGAIN;
ret = vfs_fadvise(req->file, fa->offset, fa->len, fa->advice);
--
Jens Axboe
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: io_uring force_nonblock vs POSIX_FADV_WILLNEED
2020-02-01 16:22 ` Jens Axboe
@ 2020-02-02 7:14 ` Andres Freund
2020-02-02 16:34 ` Jens Axboe
0 siblings, 1 reply; 6+ messages in thread
From: Andres Freund @ 2020-02-02 7:14 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring, linux-fsdevel
Hi,
On 2020-02-01 09:22:45 -0700, Jens Axboe wrote:
> On 2/1/20 2:43 AM, Andres Freund wrote:
> > Seems like either WILLNEED would have to always be deferred, or
> > force_page_cache_readahead, __do_page_cache_readahead would etc need to
> > be wired up to know not to block. Including returning EAGAIN, despite
> > force_page_cache_readahead and generic_readahead() intentially ignoring
> > return values / errors.
> >
> > I guess it's also possible to just add a separate precheck that looks
> > whether there's any IO needing to be done for the range. That could
> > potentially also be used to make DONTNEED nonblocking in case everything
> > is clean already, which seems like it could be nice. But that seems
> > weird modularity wise.
>
> Good point, we can block on the read-ahead. Which is counter intuitive,
> but true.
> I'll queue up the below for now, better safe than sorry.
>
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index fb5c5b3e23f4..1464e4c9b04c 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2728,8 +2728,7 @@ static int io_fadvise(struct io_kiocb *req, struct io_kiocb **nxt,
> struct io_fadvise *fa = &req->fadvise;
> int ret;
>
> - /* DONTNEED may block, others _should_ not */
> - if (fa->advice == POSIX_FADV_DONTNEED && force_nonblock)
> + if (force_nonblock)
> return -EAGAIN;
>
> ret = vfs_fadvise(req->file, fa->offset, fa->len, fa->advice);
Hm, that seems a bit broad. It seems fairly safe to leave
POSIX_FADV_{NORMAL,RANDOM,SEQUENTIAL} as sync. I guess there's there's
the argument that that's not something one does frequently enough to
care, but it's not hard to imagine wanting to change to RANDOM for a few
reads and then back to NORMAL.
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: io_uring force_nonblock vs POSIX_FADV_WILLNEED
2020-02-02 7:14 ` Andres Freund
@ 2020-02-02 16:34 ` Jens Axboe
0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2020-02-02 16:34 UTC (permalink / raw)
To: Andres Freund; +Cc: io-uring, linux-fsdevel
On 2/2/20 12:14 AM, Andres Freund wrote:
> Hi,
>
> On 2020-02-01 09:22:45 -0700, Jens Axboe wrote:
>> On 2/1/20 2:43 AM, Andres Freund wrote:
>>> Seems like either WILLNEED would have to always be deferred, or
>>> force_page_cache_readahead, __do_page_cache_readahead would etc need to
>>> be wired up to know not to block. Including returning EAGAIN, despite
>>> force_page_cache_readahead and generic_readahead() intentially ignoring
>>> return values / errors.
>>>
>>> I guess it's also possible to just add a separate precheck that looks
>>> whether there's any IO needing to be done for the range. That could
>>> potentially also be used to make DONTNEED nonblocking in case everything
>>> is clean already, which seems like it could be nice. But that seems
>>> weird modularity wise.
>>
>> Good point, we can block on the read-ahead. Which is counter intuitive,
>> but true.
>
>> I'll queue up the below for now, better safe than sorry.
>>
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index fb5c5b3e23f4..1464e4c9b04c 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -2728,8 +2728,7 @@ static int io_fadvise(struct io_kiocb *req, struct io_kiocb **nxt,
>> struct io_fadvise *fa = &req->fadvise;
>> int ret;
>>
>> - /* DONTNEED may block, others _should_ not */
>> - if (fa->advice == POSIX_FADV_DONTNEED && force_nonblock)
>> + if (force_nonblock)
>> return -EAGAIN;
>>
>> ret = vfs_fadvise(req->file, fa->offset, fa->len, fa->advice);
>
> Hm, that seems a bit broad. It seems fairly safe to leave
> POSIX_FADV_{NORMAL,RANDOM,SEQUENTIAL} as sync. I guess there's there's
> the argument that that's not something one does frequently enough to
> care, but it's not hard to imagine wanting to change to RANDOM for a few
> reads and then back to NORMAL.
Yeah agree, not sure why I didn't cater to the normal cases. I'll
adjust.
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: io_uring force_nonblock vs POSIX_FADV_WILLNEED
2020-02-01 9:43 io_uring force_nonblock vs POSIX_FADV_WILLNEED Andres Freund
2020-02-01 16:22 ` Jens Axboe
@ 2020-02-03 6:40 ` Matthew Wilcox
2020-02-03 7:42 ` Andres Freund
1 sibling, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2020-02-03 6:40 UTC (permalink / raw)
To: Andres Freund; +Cc: Jens Axboe, io-uring, linux-fsdevel
On Sat, Feb 01, 2020 at 01:43:09AM -0800, Andres Freund wrote:
> As far as I can tell POSIX_FADV_WILLNEED synchronously starts readahead,
> including page allocation etc, which of course might trigger quite
> blocking. The fs also quite possibly needs to read metadata.
>
>
> Seems like either WILLNEED would have to always be deferred, or
> force_page_cache_readahead, __do_page_cache_readahead would etc need to
> be wired up to know not to block. Including returning EAGAIN, despite
> force_page_cache_readahead and generic_readahead() intentially ignoring
> return values / errors.
The first step is going to be letting the readahead code know that it
should have this behaviour, which is tricky because the code flow looks
like this:
io_fadvise
vfs_fadvise
file->f_op->fadvise()
... and we'd be breaking brand new ground trying to add a gfp_t to a
file_operations method. Which is not to say it couldn't be done, but
would mean changing filesystems, just so we could pass the gfp
flags through from the top level to the low level. It wouldn't be
too bad; only two filesystems implement an ->fadvise op today.
Next possibility, we could add a POSIX_FADV_WILLNEED_ASYNC advice flag.
This would be kind of gnarly; look at XFS for example:
if (advice == POSIX_FADV_WILLNEED) {
lockflags = XFS_IOLOCK_SHARED;
xfs_ilock(ip, lockflags);
}
ret = generic_fadvise(file, start, end, advice);
if (lockflags)
xfs_iunlock(ip, lockflags);
so if there's some other filesystem which decides to start taking a lock
here and we miss it, it'll break when executing async.
Something I already want to see in an entirely different context is
a flag in the task_struct which says, essentially, "don't block in
memory allocations" -- ie behave as if __GFP_NOWAIT | __GFP_NOWARN
is set. See my proposal here:
https://lore.kernel.org/linux-mm/[email protected]/
(option 2)
You can see Kirill, Vlastimil and Michal are in favour of adding a
memalloc_nowait_*() API, and it would also save us here from having to
pass this information down the stack to force_page_cache_readahead()
and friends.
I've got my head stuck in the middle of the readahead code right now,
so this seems like a good time to add this functionality. Once I'm done
with finding out who broke my test VM, I'll take a shot at adding this.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: io_uring force_nonblock vs POSIX_FADV_WILLNEED
2020-02-03 6:40 ` Matthew Wilcox
@ 2020-02-03 7:42 ` Andres Freund
0 siblings, 0 replies; 6+ messages in thread
From: Andres Freund @ 2020-02-03 7:42 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: Jens Axboe, io-uring, linux-fsdevel
Hi Matthew,
On 2020-02-02 22:40:47 -0800, Matthew Wilcox wrote:
> On Sat, Feb 01, 2020 at 01:43:09AM -0800, Andres Freund wrote:
> > As far as I can tell POSIX_FADV_WILLNEED synchronously starts readahead,
> > including page allocation etc, which of course might trigger quite
> > blocking. The fs also quite possibly needs to read metadata.
> >
> >
> > Seems like either WILLNEED would have to always be deferred, or
> > force_page_cache_readahead, __do_page_cache_readahead would etc need to
> > be wired up to know not to block. Including returning EAGAIN, despite
> > force_page_cache_readahead and generic_readahead() intentially ignoring
> > return values / errors.
>
> The first step is going to be letting the readahead code know that it
> should have this behaviour, which is tricky because the code flow looks
> like this:
>
> io_fadvise
> vfs_fadvise
> file->f_op->fadvise()
Yea.
> ... and we'd be breaking brand new ground trying to add a gfp_t to a
> file_operations method. Which is not to say it couldn't be done, but
> would mean changing filesystems, just so we could pass the gfp
> flags through from the top level to the low level. It wouldn't be
> too bad; only two filesystems implement an ->fadvise op today.
I was wondering if the right approach could be to pass through a kiocb
instead of gfp_t. There's obviously precedent for that in
file_operations, and then IOCB_NOWAIT could be used to represent the the
intent to not block. It'd be a bit weird in the sense that currently
there'd probably be no callback - but that seems fairly minor. And who
knows,
> Next possibility, we could add a POSIX_FADV_WILLNEED_ASYNC advice
> flag.
POSIX_FADV_DONTNEED has similar problems to POSIX_FADV_WILLNEED, so it'd
be nice to come up with an API change to vfs_fadvise that'd support
both. Obviously there also could be a POSIX_FADV_DONTNEED_ASYNC, but ...
> Something I already want to see in an entirely different context is
> a flag in the task_struct which says, essentially, "don't block in
> memory allocations" -- ie behave as if __GFP_NOWAIT | __GFP_NOWARN
> is set. See my proposal here:
I'm a bit out of my depth here: Would __GFP_NOWAIT actually be suitable
to indicate that no blocking IO is to be executed by the FS? E.g. for
metadata? As far as I can tell that's also a problem, not just reclaim
to make space for the to-be-read data.
> I've got my head stuck in the middle of the readahead code right now,
> so this seems like a good time to add this functionality. Once I'm done
> with finding out who broke my test VM, I'll take a shot at adding
> this.
Cool!
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-02-03 11:06 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-02-01 9:43 io_uring force_nonblock vs POSIX_FADV_WILLNEED Andres Freund
2020-02-01 16:22 ` Jens Axboe
2020-02-02 7:14 ` Andres Freund
2020-02-02 16:34 ` Jens Axboe
2020-02-03 6:40 ` Matthew Wilcox
2020-02-03 7:42 ` Andres Freund
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox