* [PATCH 01/15] block: provide plug based way of signaling forced no-wait semantics
2020-06-18 14:43 Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-06-18 14:43 ` [PATCH 02/15] io_uring: always plug for any number of IOs Jens Axboe
` (14 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe
Provide a way for the caller to specify that IO should be marked
with REQ_NOWAIT to avoid blocking on allocation.
Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-core.c | 6 ++++++
include/linux/blkdev.h | 1 +
2 files changed, 7 insertions(+)
diff --git a/block/blk-core.c b/block/blk-core.c
index 03252af8c82c..62a4904db921 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -958,6 +958,7 @@ generic_make_request_checks(struct bio *bio)
struct request_queue *q;
int nr_sectors = bio_sectors(bio);
blk_status_t status = BLK_STS_IOERR;
+ struct blk_plug *plug;
char b[BDEVNAME_SIZE];
might_sleep();
@@ -971,6 +972,10 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}
+ plug = blk_mq_plug(q, bio);
+ if (plug && plug->nowait)
+ bio->bi_opf |= REQ_NOWAIT;
+
/*
* For a REQ_NOWAIT based request, return -EOPNOTSUPP
* if queue is not a request based queue.
@@ -1800,6 +1805,7 @@ void blk_start_plug(struct blk_plug *plug)
INIT_LIST_HEAD(&plug->cb_list);
plug->rq_count = 0;
plug->multiple_queues = false;
+ plug->nowait = false;
/*
* Store ordering should not be needed here, since a potential
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8fd900998b4e..6e067dca94cf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1189,6 +1189,7 @@ struct blk_plug {
struct list_head cb_list; /* md requires an unplug callback */
unsigned short rq_count;
bool multiple_queues;
+ bool nowait;
};
#define BLK_MAX_REQUEST_COUNT 16
#define BLK_PLUG_FLUSH_SIZE (128 * 1024)
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 02/15] io_uring: always plug for any number of IOs
2020-06-18 14:43 Jens Axboe
2020-06-18 14:43 ` [PATCH 01/15] block: provide plug based way of signaling forced no-wait semantics Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-06-18 14:43 ` [PATCH 03/15] io_uring: catch -EIO from buffered issue request failure Jens Axboe
` (13 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe
Currently we only plug if we're doing more than two request. We're going
to be relying on always having the plug there to pass down information,
so plug unconditionally.
Signed-off-by: Jens Axboe <[email protected]>
---
fs/io_uring.c | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index b14a8e6a0e15..ca78dd7c79da 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -676,7 +676,6 @@ struct io_kiocb {
};
};
-#define IO_PLUG_THRESHOLD 2
#define IO_IOPOLL_BATCH 8
struct io_submit_state {
@@ -5914,7 +5913,7 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
struct file *ring_file, int ring_fd)
{
- struct io_submit_state state, *statep = NULL;
+ struct io_submit_state state;
struct io_kiocb *link = NULL;
int i, submitted = 0;
@@ -5931,10 +5930,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
if (!percpu_ref_tryget_many(&ctx->refs, nr))
return -EAGAIN;
- if (nr > IO_PLUG_THRESHOLD) {
- io_submit_state_start(&state, nr);
- statep = &state;
- }
+ io_submit_state_start(&state, nr);
ctx->ring_fd = ring_fd;
ctx->ring_file = ring_file;
@@ -5949,14 +5945,14 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
io_consume_sqe(ctx);
break;
}
- req = io_alloc_req(ctx, statep);
+ req = io_alloc_req(ctx, &state);
if (unlikely(!req)) {
if (!submitted)
submitted = -EAGAIN;
break;
}
- err = io_init_req(ctx, req, sqe, statep);
+ err = io_init_req(ctx, req, sqe, &state);
io_consume_sqe(ctx);
/* will complete beyond this point, count as submitted */
submitted++;
@@ -5982,8 +5978,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
}
if (link)
io_queue_link_head(link);
- if (statep)
- io_submit_state_end(&state);
+ io_submit_state_end(&state);
/* Commit SQ ring head once we've consumed and submitted all SQEs */
io_commit_sqring(ctx);
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 03/15] io_uring: catch -EIO from buffered issue request failure
2020-06-18 14:43 Jens Axboe
2020-06-18 14:43 ` [PATCH 01/15] block: provide plug based way of signaling forced no-wait semantics Jens Axboe
2020-06-18 14:43 ` [PATCH 02/15] io_uring: always plug for any number of IOs Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-06-18 14:43 ` [PATCH 04/15] io_uring: re-issue block requests that failed because of resources Jens Axboe
` (12 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe
-EIO bubbles up like -EAGAIN if we fail to allocate a request at the
lower level. Play it safe and treat it like -EAGAIN in terms of sync
retry, to avoid passing back an errant -EIO.
Catch some of these early for block based file, as non-mq devices
generally do not support NOWAIT. That saves us some overhead by
not first trying, then retrying from async context. We can go straight
to async punt instead.
Signed-off-by: Jens Axboe <[email protected]>
---
fs/io_uring.c | 28 +++++++++++++++++++++++-----
1 file changed, 23 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index ca78dd7c79da..2e257c5a1866 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2088,6 +2088,15 @@ static struct file *__io_file_get(struct io_submit_state *state, int fd)
return state->file;
}
+static bool io_bdev_nowait(struct block_device *bdev)
+{
+#ifdef CONFIG_BLOCK
+ return !bdev || queue_is_mq(bdev_get_queue(bdev));
+#else
+ return true;
+#endif
+}
+
/*
* If we tracked the file through the SCM inflight mechanism, we could support
* any file. For now, just ensure that anything potentially problematic is done
@@ -2097,10 +2106,19 @@ static bool io_file_supports_async(struct file *file, int rw)
{
umode_t mode = file_inode(file)->i_mode;
- if (S_ISBLK(mode) || S_ISCHR(mode) || S_ISSOCK(mode))
- return true;
- if (S_ISREG(mode) && file->f_op != &io_uring_fops)
+ if (S_ISBLK(mode)) {
+ if (io_bdev_nowait(file->f_inode->i_bdev))
+ return true;
+ return false;
+ }
+ if (S_ISCHR(mode) || S_ISSOCK(mode))
return true;
+ if (S_ISREG(mode)) {
+ if (io_bdev_nowait(file->f_inode->i_sb->s_bdev) &&
+ file->f_op != &io_uring_fops)
+ return true;
+ return false;
+ }
/* any ->read/write should understand O_NONBLOCK */
if (file->f_flags & O_NONBLOCK)
@@ -2650,7 +2668,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
iov_count = iov_iter_count(&iter);
ret = rw_verify_area(READ, req->file, &kiocb->ki_pos, iov_count);
if (!ret) {
- ssize_t ret2;
+ ssize_t ret2 = 0;
if (req->file->f_op->read_iter)
ret2 = call_read_iter(req->file, kiocb, &iter);
@@ -2658,7 +2676,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
ret2 = loop_rw_iter(READ, req->file, kiocb, &iter);
/* Catch -EAGAIN return for forced non-blocking submission */
- if (!force_nonblock || ret2 != -EAGAIN) {
+ if (!force_nonblock || (ret2 != -EAGAIN && ret2 != -EIO)) {
kiocb_done(kiocb, ret2);
} else {
copy_iov:
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 04/15] io_uring: re-issue block requests that failed because of resources
2020-06-18 14:43 Jens Axboe
` (2 preceding siblings ...)
2020-06-18 14:43 ` [PATCH 03/15] io_uring: catch -EIO from buffered issue request failure Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-06-19 14:12 ` Pavel Begunkov
2020-06-18 14:43 ` [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set Jens Axboe
` (11 subsequent siblings)
15 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe
Mark the plug with nowait == true, which will cause requests to avoid
blocking on request allocation. If they do, we catch them and reissue
them from a task_work based handler.
Normally we can catch -EAGAIN directly, but the hard case is for split
requests. As an example, the application issues a 512KB request. The
block core will split this into 128KB if that's the max size for the
device. The first request issues just fine, but we run into -EAGAIN for
some latter splits for the same request. As the bio is split, we don't
get to see the -EAGAIN until one of the actual reads complete, and hence
we cannot handle it inline as part of submission.
This does potentially cause re-reads of parts of the range, as the whole
request is reissued. There's currently no better way to handle this.
Signed-off-by: Jens Axboe <[email protected]>
---
fs/io_uring.c | 148 ++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 124 insertions(+), 24 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2e257c5a1866..40413fb9d07b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -900,6 +900,13 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req,
static void __io_queue_sqe(struct io_kiocb *req,
const struct io_uring_sqe *sqe);
+static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
+ struct iovec **iovec, struct iov_iter *iter,
+ bool needs_lock);
+static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size,
+ struct iovec *iovec, struct iovec *fast_iov,
+ struct iov_iter *iter);
+
static struct kmem_cache *req_cachep;
static const struct file_operations io_uring_fops;
@@ -1978,12 +1985,115 @@ static void io_complete_rw_common(struct kiocb *kiocb, long res)
__io_cqring_add_event(req, res, cflags);
}
+static void io_sq_thread_drop_mm(struct io_ring_ctx *ctx)
+{
+ struct mm_struct *mm = current->mm;
+
+ if (mm) {
+ kthread_unuse_mm(mm);
+ mmput(mm);
+ }
+}
+
+static int io_sq_thread_acquire_mm(struct io_ring_ctx *ctx,
+ struct io_kiocb *req)
+{
+ if (io_op_defs[req->opcode].needs_mm && !current->mm) {
+ if (unlikely(!mmget_not_zero(ctx->sqo_mm)))
+ return -EFAULT;
+ kthread_use_mm(ctx->sqo_mm);
+ }
+
+ return 0;
+}
+
+#ifdef CONFIG_BLOCK
+static bool io_resubmit_prep(struct io_kiocb *req, int error)
+{
+ struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+ ssize_t ret = -ECANCELED;
+ struct iov_iter iter;
+ int rw;
+
+ if (error) {
+ ret = error;
+ goto end_req;
+ }
+
+ switch (req->opcode) {
+ case IORING_OP_READV:
+ case IORING_OP_READ_FIXED:
+ case IORING_OP_READ:
+ rw = READ;
+ break;
+ case IORING_OP_WRITEV:
+ case IORING_OP_WRITE_FIXED:
+ case IORING_OP_WRITE:
+ rw = WRITE;
+ break;
+ default:
+ printk_once(KERN_WARNING "io_uring: bad opcode in resubmit %d\n",
+ req->opcode);
+ goto end_req;
+ }
+
+ ret = io_import_iovec(rw, req, &iovec, &iter, false);
+ if (ret < 0)
+ goto end_req;
+ ret = io_setup_async_rw(req, ret, iovec, inline_vecs, &iter);
+ if (!ret)
+ return true;
+ kfree(iovec);
+end_req:
+ io_cqring_add_event(req, ret);
+ req_set_fail_links(req);
+ io_put_req(req);
+ return false;
+}
+
+static void io_rw_resubmit(struct callback_head *cb)
+{
+ struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
+ struct io_ring_ctx *ctx = req->ctx;
+ int err;
+
+ __set_current_state(TASK_RUNNING);
+
+ err = io_sq_thread_acquire_mm(ctx, req);
+
+ if (io_resubmit_prep(req, err)) {
+ refcount_inc(&req->refs);
+ io_queue_async_work(req);
+ }
+}
+#endif
+
+static bool io_rw_reissue(struct io_kiocb *req, long res)
+{
+#ifdef CONFIG_BLOCK
+ struct task_struct *tsk;
+ int ret;
+
+ if ((res != -EAGAIN && res != -EOPNOTSUPP) || io_wq_current_is_worker())
+ return false;
+
+ tsk = req->task;
+ init_task_work(&req->task_work, io_rw_resubmit);
+ ret = task_work_add(tsk, &req->task_work, true);
+ if (!ret)
+ return true;
+#endif
+ return false;
+}
+
static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
{
struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
- io_complete_rw_common(kiocb, res);
- io_put_req(req);
+ if (!io_rw_reissue(req, res)) {
+ io_complete_rw_common(kiocb, res);
+ io_put_req(req);
+ }
}
static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
@@ -2169,6 +2279,9 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
if (kiocb->ki_flags & IOCB_NOWAIT)
req->flags |= REQ_F_NOWAIT;
+ if (kiocb->ki_flags & IOCB_DIRECT)
+ io_get_req_task(req);
+
if (force_nonblock)
kiocb->ki_flags |= IOCB_NOWAIT;
@@ -2668,6 +2781,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
iov_count = iov_iter_count(&iter);
ret = rw_verify_area(READ, req->file, &kiocb->ki_pos, iov_count);
if (!ret) {
+ unsigned long nr_segs = iter.nr_segs;
ssize_t ret2 = 0;
if (req->file->f_op->read_iter)
@@ -2679,6 +2793,8 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
if (!force_nonblock || (ret2 != -EAGAIN && ret2 != -EIO)) {
kiocb_done(kiocb, ret2);
} else {
+ iter.count = iov_count;
+ iter.nr_segs = nr_segs;
copy_iov:
ret = io_setup_async_rw(req, io_size, iovec,
inline_vecs, &iter);
@@ -2765,6 +2881,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock)
iov_count = iov_iter_count(&iter);
ret = rw_verify_area(WRITE, req->file, &kiocb->ki_pos, iov_count);
if (!ret) {
+ unsigned long nr_segs = iter.nr_segs;
ssize_t ret2;
/*
@@ -2802,6 +2919,8 @@ static int io_write(struct io_kiocb *req, bool force_nonblock)
if (!force_nonblock || ret2 != -EAGAIN) {
kiocb_done(kiocb, ret2);
} else {
+ iter.count = iov_count;
+ iter.nr_segs = nr_segs;
copy_iov:
ret = io_setup_async_rw(req, io_size, iovec,
inline_vecs, &iter);
@@ -4282,28 +4401,6 @@ static void io_async_queue_proc(struct file *file, struct wait_queue_head *head,
__io_queue_proc(&pt->req->apoll->poll, pt, head);
}
-static void io_sq_thread_drop_mm(struct io_ring_ctx *ctx)
-{
- struct mm_struct *mm = current->mm;
-
- if (mm) {
- kthread_unuse_mm(mm);
- mmput(mm);
- }
-}
-
-static int io_sq_thread_acquire_mm(struct io_ring_ctx *ctx,
- struct io_kiocb *req)
-{
- if (io_op_defs[req->opcode].needs_mm && !current->mm) {
- if (unlikely(!mmget_not_zero(ctx->sqo_mm)))
- return -EFAULT;
- kthread_use_mm(ctx->sqo_mm);
- }
-
- return 0;
-}
-
static void io_async_task_func(struct callback_head *cb)
{
struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
@@ -5814,6 +5911,9 @@ static void io_submit_state_start(struct io_submit_state *state,
unsigned int max_ios)
{
blk_start_plug(&state->plug);
+#ifdef CONFIG_BLOCK
+ state->plug.nowait = true;
+#endif
state->free_reqs = 0;
state->file = NULL;
state->ios_left = max_ios;
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 04/15] io_uring: re-issue block requests that failed because of resources
2020-06-18 14:43 ` [PATCH 04/15] io_uring: re-issue block requests that failed because of resources Jens Axboe
@ 2020-06-19 14:12 ` Pavel Begunkov
2020-06-19 14:22 ` Jens Axboe
0 siblings, 1 reply; 38+ messages in thread
From: Pavel Begunkov @ 2020-06-19 14:12 UTC (permalink / raw)
To: io-uring, Jens Axboe; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm
On 18/06/2020 17:43, Jens Axboe wrote:
> Mark the plug with nowait == true, which will cause requests to avoid
> blocking on request allocation. If they do, we catch them and reissue
> them from a task_work based handler.
>
> Normally we can catch -EAGAIN directly, but the hard case is for split
> requests. As an example, the application issues a 512KB request. The
> block core will split this into 128KB if that's the max size for the
> device. The first request issues just fine, but we run into -EAGAIN for
> some latter splits for the same request. As the bio is split, we don't
> get to see the -EAGAIN until one of the actual reads complete, and hence
> we cannot handle it inline as part of submission.
>
> This does potentially cause re-reads of parts of the range, as the whole
> request is reissued. There's currently no better way to handle this.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> fs/io_uring.c | 148 ++++++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 124 insertions(+), 24 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 2e257c5a1866..40413fb9d07b 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -900,6 +900,13 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req,
> static void __io_queue_sqe(struct io_kiocb *req,
> const struct io_uring_sqe *sqe);
>
...> +
> +static void io_rw_resubmit(struct callback_head *cb)
> +{
> + struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
> + struct io_ring_ctx *ctx = req->ctx;
> + int err;
> +
> + __set_current_state(TASK_RUNNING);
> +
> + err = io_sq_thread_acquire_mm(ctx, req);
> +
> + if (io_resubmit_prep(req, err)) {
> + refcount_inc(&req->refs);
> + io_queue_async_work(req);
> + }
Hmm, I have similar stuff but for iopoll. On top removing grab_env* for
linked reqs and some extra. I think I'll rebase on top of this.
> +}
> +#endif
> +
> +static bool io_rw_reissue(struct io_kiocb *req, long res)
> +{
> +#ifdef CONFIG_BLOCK
> + struct task_struct *tsk;
> + int ret;
> +
> + if ((res != -EAGAIN && res != -EOPNOTSUPP) || io_wq_current_is_worker())
> + return false;
> +
> + tsk = req->task;
> + init_task_work(&req->task_work, io_rw_resubmit);
> + ret = task_work_add(tsk, &req->task_work, true);
I don't like that the request becomes un-discoverable for cancellation
awhile sitting in the task_work list. Poll stuff at least have hash_node
for that.
> + if (!ret)
> + return true;
> +#endif
> + return false;
> +}
> +
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 04/15] io_uring: re-issue block requests that failed because of resources
2020-06-19 14:12 ` Pavel Begunkov
@ 2020-06-19 14:22 ` Jens Axboe
2020-06-19 14:30 ` Pavel Begunkov
0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2020-06-19 14:22 UTC (permalink / raw)
To: Pavel Begunkov, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm
On 6/19/20 8:12 AM, Pavel Begunkov wrote:
> On 18/06/2020 17:43, Jens Axboe wrote:
>> Mark the plug with nowait == true, which will cause requests to avoid
>> blocking on request allocation. If they do, we catch them and reissue
>> them from a task_work based handler.
>>
>> Normally we can catch -EAGAIN directly, but the hard case is for split
>> requests. As an example, the application issues a 512KB request. The
>> block core will split this into 128KB if that's the max size for the
>> device. The first request issues just fine, but we run into -EAGAIN for
>> some latter splits for the same request. As the bio is split, we don't
>> get to see the -EAGAIN until one of the actual reads complete, and hence
>> we cannot handle it inline as part of submission.
>>
>> This does potentially cause re-reads of parts of the range, as the whole
>> request is reissued. There's currently no better way to handle this.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>> fs/io_uring.c | 148 ++++++++++++++++++++++++++++++++++++++++++--------
>> 1 file changed, 124 insertions(+), 24 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 2e257c5a1866..40413fb9d07b 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -900,6 +900,13 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req,
>> static void __io_queue_sqe(struct io_kiocb *req,
>> const struct io_uring_sqe *sqe);
>>
> ...> +
>> +static void io_rw_resubmit(struct callback_head *cb)
>> +{
>> + struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
>> + struct io_ring_ctx *ctx = req->ctx;
>> + int err;
>> +
>> + __set_current_state(TASK_RUNNING);
>> +
>> + err = io_sq_thread_acquire_mm(ctx, req);
>> +
>> + if (io_resubmit_prep(req, err)) {
>> + refcount_inc(&req->refs);
>> + io_queue_async_work(req);
>> + }
>
> Hmm, I have similar stuff but for iopoll. On top removing grab_env* for
> linked reqs and some extra. I think I'll rebase on top of this.
Yes, there's certainly overlap there. I consider this series basically
wrapped up, so feel free to just base on top of it.
>> +static bool io_rw_reissue(struct io_kiocb *req, long res)
>> +{
>> +#ifdef CONFIG_BLOCK
>> + struct task_struct *tsk;
>> + int ret;
>> +
>> + if ((res != -EAGAIN && res != -EOPNOTSUPP) || io_wq_current_is_worker())
>> + return false;
>> +
>> + tsk = req->task;
>> + init_task_work(&req->task_work, io_rw_resubmit);
>> + ret = task_work_add(tsk, &req->task_work, true);
>
> I don't like that the request becomes un-discoverable for cancellation
> awhile sitting in the task_work list. Poll stuff at least have hash_node
> for that.
Async buffered IO was never cancelable, so it doesn't really matter.
It's tied to the task, so we know it'll get executed - either run, or
canceled if the task is going away. This is really not that different
from having the work discoverable through io-wq queueing before, since
the latter could never be canceled anyway as it sits there
uninterruptibly waiting for IO completion.
--
Jens Axboe
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 04/15] io_uring: re-issue block requests that failed because of resources
2020-06-19 14:22 ` Jens Axboe
@ 2020-06-19 14:30 ` Pavel Begunkov
2020-06-19 14:36 ` Jens Axboe
0 siblings, 1 reply; 38+ messages in thread
From: Pavel Begunkov @ 2020-06-19 14:30 UTC (permalink / raw)
To: Jens Axboe, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm
On 19/06/2020 17:22, Jens Axboe wrote:
> On 6/19/20 8:12 AM, Pavel Begunkov wrote:
>> On 18/06/2020 17:43, Jens Axboe wrote:
>>> Mark the plug with nowait == true, which will cause requests to avoid
>>> blocking on request allocation. If they do, we catch them and reissue
>>> them from a task_work based handler.
>>>
>>> Normally we can catch -EAGAIN directly, but the hard case is for split
>>> requests. As an example, the application issues a 512KB request. The
>>> block core will split this into 128KB if that's the max size for the
>>> device. The first request issues just fine, but we run into -EAGAIN for
>>> some latter splits for the same request. As the bio is split, we don't
>>> get to see the -EAGAIN until one of the actual reads complete, and hence
>>> we cannot handle it inline as part of submission.
>>>
>>> This does potentially cause re-reads of parts of the range, as the whole
>>> request is reissued. There's currently no better way to handle this.
>>>
>>> Signed-off-by: Jens Axboe <[email protected]>
>>> ---
>>> fs/io_uring.c | 148 ++++++++++++++++++++++++++++++++++++++++++--------
>>> 1 file changed, 124 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 2e257c5a1866..40413fb9d07b 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -900,6 +900,13 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req,
>>> static void __io_queue_sqe(struct io_kiocb *req,
>>> const struct io_uring_sqe *sqe);
>>>
>> ...> +
>>> +static void io_rw_resubmit(struct callback_head *cb)
>>> +{
>>> + struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
>>> + struct io_ring_ctx *ctx = req->ctx;
>>> + int err;
>>> +
>>> + __set_current_state(TASK_RUNNING);
>>> +
>>> + err = io_sq_thread_acquire_mm(ctx, req);
>>> +
>>> + if (io_resubmit_prep(req, err)) {
>>> + refcount_inc(&req->refs);
>>> + io_queue_async_work(req);
>>> + }
>>
>> Hmm, I have similar stuff but for iopoll. On top removing grab_env* for
>> linked reqs and some extra. I think I'll rebase on top of this.
>
> Yes, there's certainly overlap there. I consider this series basically
> wrapped up, so feel free to just base on top of it.
>
>>> +static bool io_rw_reissue(struct io_kiocb *req, long res)
>>> +{
>>> +#ifdef CONFIG_BLOCK
>>> + struct task_struct *tsk;
>>> + int ret;
>>> +
>>> + if ((res != -EAGAIN && res != -EOPNOTSUPP) || io_wq_current_is_worker())
>>> + return false;
>>> +
>>> + tsk = req->task;
>>> + init_task_work(&req->task_work, io_rw_resubmit);
>>> + ret = task_work_add(tsk, &req->task_work, true);
>>
>> I don't like that the request becomes un-discoverable for cancellation
>> awhile sitting in the task_work list. Poll stuff at least have hash_node
>> for that.
>
> Async buffered IO was never cancelable, so it doesn't really matter.
> It's tied to the task, so we know it'll get executed - either run, or
> canceled if the task is going away. This is really not that different
> from having the work discoverable through io-wq queueing before, since
> the latter could never be canceled anyway as it sits there
> uninterruptibly waiting for IO completion.
Makes sense. I was thinking about using this task-requeue for all kinds
of requests. Though, instead of speculating it'd be better for me to embody
ideas into patches and see.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 04/15] io_uring: re-issue block requests that failed because of resources
2020-06-19 14:30 ` Pavel Begunkov
@ 2020-06-19 14:36 ` Jens Axboe
0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-19 14:36 UTC (permalink / raw)
To: Pavel Begunkov, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm
On 6/19/20 8:30 AM, Pavel Begunkov wrote:
> On 19/06/2020 17:22, Jens Axboe wrote:
>> On 6/19/20 8:12 AM, Pavel Begunkov wrote:
>>> On 18/06/2020 17:43, Jens Axboe wrote:
>>>> Mark the plug with nowait == true, which will cause requests to avoid
>>>> blocking on request allocation. If they do, we catch them and reissue
>>>> them from a task_work based handler.
>>>>
>>>> Normally we can catch -EAGAIN directly, but the hard case is for split
>>>> requests. As an example, the application issues a 512KB request. The
>>>> block core will split this into 128KB if that's the max size for the
>>>> device. The first request issues just fine, but we run into -EAGAIN for
>>>> some latter splits for the same request. As the bio is split, we don't
>>>> get to see the -EAGAIN until one of the actual reads complete, and hence
>>>> we cannot handle it inline as part of submission.
>>>>
>>>> This does potentially cause re-reads of parts of the range, as the whole
>>>> request is reissued. There's currently no better way to handle this.
>>>>
>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>> ---
>>>> fs/io_uring.c | 148 ++++++++++++++++++++++++++++++++++++++++++--------
>>>> 1 file changed, 124 insertions(+), 24 deletions(-)
>>>>
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>> index 2e257c5a1866..40413fb9d07b 100644
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -900,6 +900,13 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req,
>>>> static void __io_queue_sqe(struct io_kiocb *req,
>>>> const struct io_uring_sqe *sqe);
>>>>
>>> ...> +
>>>> +static void io_rw_resubmit(struct callback_head *cb)
>>>> +{
>>>> + struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
>>>> + struct io_ring_ctx *ctx = req->ctx;
>>>> + int err;
>>>> +
>>>> + __set_current_state(TASK_RUNNING);
>>>> +
>>>> + err = io_sq_thread_acquire_mm(ctx, req);
>>>> +
>>>> + if (io_resubmit_prep(req, err)) {
>>>> + refcount_inc(&req->refs);
>>>> + io_queue_async_work(req);
>>>> + }
>>>
>>> Hmm, I have similar stuff but for iopoll. On top removing grab_env* for
>>> linked reqs and some extra. I think I'll rebase on top of this.
>>
>> Yes, there's certainly overlap there. I consider this series basically
>> wrapped up, so feel free to just base on top of it.
>>
>>>> +static bool io_rw_reissue(struct io_kiocb *req, long res)
>>>> +{
>>>> +#ifdef CONFIG_BLOCK
>>>> + struct task_struct *tsk;
>>>> + int ret;
>>>> +
>>>> + if ((res != -EAGAIN && res != -EOPNOTSUPP) || io_wq_current_is_worker())
>>>> + return false;
>>>> +
>>>> + tsk = req->task;
>>>> + init_task_work(&req->task_work, io_rw_resubmit);
>>>> + ret = task_work_add(tsk, &req->task_work, true);
>>>
>>> I don't like that the request becomes un-discoverable for cancellation
>>> awhile sitting in the task_work list. Poll stuff at least have hash_node
>>> for that.
>>
>> Async buffered IO was never cancelable, so it doesn't really matter.
>> It's tied to the task, so we know it'll get executed - either run, or
>> canceled if the task is going away. This is really not that different
>> from having the work discoverable through io-wq queueing before, since
>> the latter could never be canceled anyway as it sits there
>> uninterruptibly waiting for IO completion.
>
> Makes sense. I was thinking about using this task-requeue for all kinds
> of requests. Though, instead of speculating it'd be better for me to embody
> ideas into patches and see.
And that's fine, for requests where it matters, on-the-side
discoverability can still be a thing. If we're in the task itself where
it is queued, that provides us safey from the work going way from under
us. Then we just have to mark it appropriately, if it needs to get
canceled instead of run to completion.
Some care needed, of course, but there's nothing that would prevent this
from working. Ideally we'd be able to peal off a task_work entry, but
that's kind of difficult with the singly linked non-locked list.
--
Jens Axboe
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
2020-06-18 14:43 Jens Axboe
` (3 preceding siblings ...)
2020-06-18 14:43 ` [PATCH 04/15] io_uring: re-issue block requests that failed because of resources Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-06-24 1:02 ` Dave Chinner
2020-06-24 4:38 ` Dave Chinner
2020-06-18 14:43 ` [PATCH 06/15] mm: abstract out wake_page_match() from wake_page_function() Jens Axboe
` (10 subsequent siblings)
15 siblings, 2 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring
Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
Johannes Weiner
The read-ahead shouldn't block, so allow it to be done even if
IOCB_NOWAIT is set in the kiocb.
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
mm/filemap.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index f0ae9a6308cb..3378d4fca883 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2028,8 +2028,6 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
page = find_get_page(mapping, index);
if (!page) {
- if (iocb->ki_flags & IOCB_NOWAIT)
- goto would_block;
page_cache_sync_readahead(mapping,
ra, filp,
index, last_index - index);
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
2020-06-18 14:43 ` [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set Jens Axboe
@ 2020-06-24 1:02 ` Dave Chinner
2020-06-24 1:46 ` Matthew Wilcox
2020-06-24 4:38 ` Dave Chinner
1 sibling, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2020-06-24 1:02 UTC (permalink / raw)
To: Add, support, for, async, buffered, reads
Cc: io-uring, linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
Johannes Weiner
On Thu, Jun 18, 2020 at 08:43:45AM -0600, Jens Axboe wrote:
> The read-ahead shouldn't block, so allow it to be done even if
> IOCB_NOWAIT is set in the kiocb.
>
> Acked-by: Johannes Weiner <[email protected]>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> mm/filemap.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index f0ae9a6308cb..3378d4fca883 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2028,8 +2028,6 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
>
> page = find_get_page(mapping, index);
> if (!page) {
> - if (iocb->ki_flags & IOCB_NOWAIT)
> - goto would_block;
> page_cache_sync_readahead(mapping,
> ra, filp,
> index, last_index - index);
Doesn't think break preadv2(RWF_NOWAIT) semantics for on buffered
reads? i.e. this can now block on memory allocation for the page
cache, which is something RWF_NOWAIT IO should not do....
Cheers,
Dave.
--
Dave Chinner
[email protected]
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
2020-06-24 1:02 ` Dave Chinner
@ 2020-06-24 1:46 ` Matthew Wilcox
2020-06-24 15:00 ` Jens Axboe
0 siblings, 1 reply; 38+ messages in thread
From: Matthew Wilcox @ 2020-06-24 1:46 UTC (permalink / raw)
To: Dave Chinner
Cc: Add, support, for, async, buffered, reads, io-uring,
linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
Johannes Weiner
On Wed, Jun 24, 2020 at 11:02:53AM +1000, Dave Chinner wrote:
> On Thu, Jun 18, 2020 at 08:43:45AM -0600, Jens Axboe wrote:
> > The read-ahead shouldn't block, so allow it to be done even if
> > IOCB_NOWAIT is set in the kiocb.
>
> Doesn't think break preadv2(RWF_NOWAIT) semantics for on buffered
> reads? i.e. this can now block on memory allocation for the page
> cache, which is something RWF_NOWAIT IO should not do....
Yes. This eventually ends up in page_cache_readahead_unbounded()
which gets its gfp flags from readahead_gfp_mask(mapping).
I'd be quite happy to add a gfp_t to struct readahead_control.
The other thing I've been looking into for other reasons is adding
a memalloc_nowait_{save,restore}, which would avoid passing down
the gfp_t.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
2020-06-24 1:46 ` Matthew Wilcox
@ 2020-06-24 15:00 ` Jens Axboe
2020-06-24 15:35 ` Jens Axboe
0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2020-06-24 15:00 UTC (permalink / raw)
To: Matthew Wilcox, Dave Chinner
Cc: io-uring, linux-fsdevel, linux-kernel, linux-mm, akpm,
Johannes Weiner
On 6/23/20 7:46 PM, Matthew Wilcox wrote:
> On Wed, Jun 24, 2020 at 11:02:53AM +1000, Dave Chinner wrote:
>> On Thu, Jun 18, 2020 at 08:43:45AM -0600, Jens Axboe wrote:
>>> The read-ahead shouldn't block, so allow it to be done even if
>>> IOCB_NOWAIT is set in the kiocb.
>>
>> Doesn't think break preadv2(RWF_NOWAIT) semantics for on buffered
>> reads? i.e. this can now block on memory allocation for the page
>> cache, which is something RWF_NOWAIT IO should not do....
>
> Yes. This eventually ends up in page_cache_readahead_unbounded()
> which gets its gfp flags from readahead_gfp_mask(mapping).
>
> I'd be quite happy to add a gfp_t to struct readahead_control.
> The other thing I've been looking into for other reasons is adding
> a memalloc_nowait_{save,restore}, which would avoid passing down
> the gfp_t.
That was my first thought, having the memalloc_foo_save/restore for
this. I don't think adding a gfp_t to readahead_control is going
to be super useful, seems like the kind of thing that should be
non-blocking by default.
--
Jens Axboe
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
2020-06-24 15:00 ` Jens Axboe
@ 2020-06-24 15:35 ` Jens Axboe
2020-06-24 16:41 ` Matthew Wilcox
0 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2020-06-24 15:35 UTC (permalink / raw)
To: Matthew Wilcox, Dave Chinner
Cc: io-uring, linux-fsdevel, linux-kernel, linux-mm, akpm,
Johannes Weiner
On 6/24/20 9:00 AM, Jens Axboe wrote:
> On 6/23/20 7:46 PM, Matthew Wilcox wrote:
>> On Wed, Jun 24, 2020 at 11:02:53AM +1000, Dave Chinner wrote:
>>> On Thu, Jun 18, 2020 at 08:43:45AM -0600, Jens Axboe wrote:
>>>> The read-ahead shouldn't block, so allow it to be done even if
>>>> IOCB_NOWAIT is set in the kiocb.
>>>
>>> Doesn't think break preadv2(RWF_NOWAIT) semantics for on buffered
>>> reads? i.e. this can now block on memory allocation for the page
>>> cache, which is something RWF_NOWAIT IO should not do....
>>
>> Yes. This eventually ends up in page_cache_readahead_unbounded()
>> which gets its gfp flags from readahead_gfp_mask(mapping).
>>
>> I'd be quite happy to add a gfp_t to struct readahead_control.
>> The other thing I've been looking into for other reasons is adding
>> a memalloc_nowait_{save,restore}, which would avoid passing down
>> the gfp_t.
>
> That was my first thought, having the memalloc_foo_save/restore for
> this. I don't think adding a gfp_t to readahead_control is going
> to be super useful, seems like the kind of thing that should be
> non-blocking by default.
We're already doing memalloc_nofs_save/restore in
page_cache_readahead_unbounded(), so I think all we need is to just do a
noio dance in generic_file_buffered_read() and that should be enough.
diff --git a/mm/filemap.c b/mm/filemap.c
index a5b1fa8f7ce4..c29d4b310ed6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -41,6 +41,7 @@
#include <linux/delayacct.h>
#include <linux/psi.h>
#include <linux/ramfs.h>
+#include <linux/sched/mm.h>
#include "internal.h"
#define CREATE_TRACE_POINTS
@@ -2011,6 +2012,7 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
struct address_space *mapping = filp->f_mapping;
struct inode *inode = mapping->host;
struct file_ra_state *ra = &filp->f_ra;
+ const bool nowait = (iocb->ki_flags & IOCB_NOWAIT) != 0;
loff_t *ppos = &iocb->ki_pos;
pgoff_t index;
pgoff_t last_index;
@@ -2044,9 +2046,15 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
page = find_get_page(mapping, index);
if (!page) {
+ unsigned int flags;
+
+ if (nowait)
+ flags = memalloc_noio_save();
page_cache_sync_readahead(mapping,
ra, filp,
index, last_index - index);
+ if (nowait)
+ memalloc_noio_restore(flags);
page = find_get_page(mapping, index);
if (unlikely(page == NULL))
goto no_cached_page;
@@ -2070,7 +2078,7 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
error = wait_on_page_locked_async(page,
iocb->ki_waitq);
} else {
- if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (nowait) {
put_page(page);
goto would_block;
}
@@ -2185,7 +2193,7 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
}
readpage:
- if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (nowait) {
unlock_page(page);
put_page(page);
goto would_block;
--
Jens Axboe
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
2020-06-24 15:35 ` Jens Axboe
@ 2020-06-24 16:41 ` Matthew Wilcox
2020-06-24 16:44 ` Jens Axboe
0 siblings, 1 reply; 38+ messages in thread
From: Matthew Wilcox @ 2020-06-24 16:41 UTC (permalink / raw)
To: Jens Axboe
Cc: Dave Chinner, io-uring, linux-fsdevel, linux-kernel, linux-mm,
akpm, Johannes Weiner
On Wed, Jun 24, 2020 at 09:35:19AM -0600, Jens Axboe wrote:
> On 6/24/20 9:00 AM, Jens Axboe wrote:
> > On 6/23/20 7:46 PM, Matthew Wilcox wrote:
> >> I'd be quite happy to add a gfp_t to struct readahead_control.
> >> The other thing I've been looking into for other reasons is adding
> >> a memalloc_nowait_{save,restore}, which would avoid passing down
> >> the gfp_t.
> >
> > That was my first thought, having the memalloc_foo_save/restore for
> > this. I don't think adding a gfp_t to readahead_control is going
> > to be super useful, seems like the kind of thing that should be
> > non-blocking by default.
>
> We're already doing memalloc_nofs_save/restore in
> page_cache_readahead_unbounded(), so I think all we need is to just do a
> noio dance in generic_file_buffered_read() and that should be enough.
I think we can still sleep though, right? I was thinking more
like this:
http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
2020-06-24 16:41 ` Matthew Wilcox
@ 2020-06-24 16:44 ` Jens Axboe
2020-07-07 11:38 ` Andreas Grünbacher
2020-08-10 22:56 ` Dave Chinner
0 siblings, 2 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-24 16:44 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Dave Chinner, io-uring, linux-fsdevel, linux-kernel, linux-mm,
akpm, Johannes Weiner
On 6/24/20 10:41 AM, Matthew Wilcox wrote:
> On Wed, Jun 24, 2020 at 09:35:19AM -0600, Jens Axboe wrote:
>> On 6/24/20 9:00 AM, Jens Axboe wrote:
>>> On 6/23/20 7:46 PM, Matthew Wilcox wrote:
>>>> I'd be quite happy to add a gfp_t to struct readahead_control.
>>>> The other thing I've been looking into for other reasons is adding
>>>> a memalloc_nowait_{save,restore}, which would avoid passing down
>>>> the gfp_t.
>>>
>>> That was my first thought, having the memalloc_foo_save/restore for
>>> this. I don't think adding a gfp_t to readahead_control is going
>>> to be super useful, seems like the kind of thing that should be
>>> non-blocking by default.
>>
>> We're already doing memalloc_nofs_save/restore in
>> page_cache_readahead_unbounded(), so I think all we need is to just do a
>> noio dance in generic_file_buffered_read() and that should be enough.
>
> I think we can still sleep though, right? I was thinking more
> like this:
>
> http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc
Yeah, that's probably better. How do we want to handle this? I've already
got the other bits queued up. I can either add them to the series, or
pull a branch that'll go into Linus as well.
--
Jens Axboe
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
2020-06-24 16:44 ` Jens Axboe
@ 2020-07-07 11:38 ` Andreas Grünbacher
2020-07-07 14:31 ` Jens Axboe
2020-08-10 22:56 ` Dave Chinner
1 sibling, 1 reply; 38+ messages in thread
From: Andreas Grünbacher @ 2020-07-07 11:38 UTC (permalink / raw)
To: Jens Axboe
Cc: Matthew Wilcox, Dave Chinner, io-uring,
Linux FS-devel Mailing List, Linux Kernel Mailing List, Linux-MM,
Andrew Morton, Johannes Weiner
Am Mi., 24. Juni 2020 um 18:48 Uhr schrieb Jens Axboe <[email protected]>:
>
> On 6/24/20 10:41 AM, Matthew Wilcox wrote:
> > On Wed, Jun 24, 2020 at 09:35:19AM -0600, Jens Axboe wrote:
> >> On 6/24/20 9:00 AM, Jens Axboe wrote:
> >>> On 6/23/20 7:46 PM, Matthew Wilcox wrote:
> >>>> I'd be quite happy to add a gfp_t to struct readahead_control.
> >>>> The other thing I've been looking into for other reasons is adding
> >>>> a memalloc_nowait_{save,restore}, which would avoid passing down
> >>>> the gfp_t.
> >>>
> >>> That was my first thought, having the memalloc_foo_save/restore for
> >>> this. I don't think adding a gfp_t to readahead_control is going
> >>> to be super useful, seems like the kind of thing that should be
> >>> non-blocking by default.
> >>
> >> We're already doing memalloc_nofs_save/restore in
> >> page_cache_readahead_unbounded(), so I think all we need is to just do a
> >> noio dance in generic_file_buffered_read() and that should be enough.
> >
> > I think we can still sleep though, right? I was thinking more
> > like this:
> >
> > http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc
>
> Yeah, that's probably better. How do we want to handle this? I've already
> got the other bits queued up. I can either add them to the series, or
> pull a branch that'll go into Linus as well.
Also note my conflicting patch that introduces a IOCB_NOIO flag for
fixing a gfs2 regression:
https://lore.kernel.org/linux-fsdevel/[email protected]/
Thanks,
Andreas
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
2020-07-07 11:38 ` Andreas Grünbacher
@ 2020-07-07 14:31 ` Jens Axboe
0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-07-07 14:31 UTC (permalink / raw)
To: Andreas Grünbacher
Cc: Matthew Wilcox, Dave Chinner, io-uring,
Linux FS-devel Mailing List, Linux Kernel Mailing List, Linux-MM,
Andrew Morton, Johannes Weiner
On 7/7/20 5:38 AM, Andreas Grünbacher wrote:
> Am Mi., 24. Juni 2020 um 18:48 Uhr schrieb Jens Axboe <[email protected]>:
>>
>> On 6/24/20 10:41 AM, Matthew Wilcox wrote:
>>> On Wed, Jun 24, 2020 at 09:35:19AM -0600, Jens Axboe wrote:
>>>> On 6/24/20 9:00 AM, Jens Axboe wrote:
>>>>> On 6/23/20 7:46 PM, Matthew Wilcox wrote:
>>>>>> I'd be quite happy to add a gfp_t to struct readahead_control.
>>>>>> The other thing I've been looking into for other reasons is adding
>>>>>> a memalloc_nowait_{save,restore}, which would avoid passing down
>>>>>> the gfp_t.
>>>>>
>>>>> That was my first thought, having the memalloc_foo_save/restore for
>>>>> this. I don't think adding a gfp_t to readahead_control is going
>>>>> to be super useful, seems like the kind of thing that should be
>>>>> non-blocking by default.
>>>>
>>>> We're already doing memalloc_nofs_save/restore in
>>>> page_cache_readahead_unbounded(), so I think all we need is to just do a
>>>> noio dance in generic_file_buffered_read() and that should be enough.
>>>
>>> I think we can still sleep though, right? I was thinking more
>>> like this:
>>>
>>> http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc
>>
>> Yeah, that's probably better. How do we want to handle this? I've already
>> got the other bits queued up. I can either add them to the series, or
>> pull a branch that'll go into Linus as well.
>
> Also note my conflicting patch that introduces a IOCB_NOIO flag for
> fixing a gfs2 regression:
>
> https://lore.kernel.org/linux-fsdevel/[email protected]/
Yeah I noticed, pretty easy to resolve though.
--
Jens Axboe
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
2020-06-24 16:44 ` Jens Axboe
2020-07-07 11:38 ` Andreas Grünbacher
@ 2020-08-10 22:56 ` Dave Chinner
2020-08-10 23:03 ` Jens Axboe
1 sibling, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2020-08-10 22:56 UTC (permalink / raw)
To: Jens Axboe
Cc: Matthew Wilcox, io-uring, linux-fsdevel, linux-kernel, linux-mm,
akpm, Johannes Weiner
On Wed, Jun 24, 2020 at 10:44:21AM -0600, Jens Axboe wrote:
> On 6/24/20 10:41 AM, Matthew Wilcox wrote:
> > On Wed, Jun 24, 2020 at 09:35:19AM -0600, Jens Axboe wrote:
> >> On 6/24/20 9:00 AM, Jens Axboe wrote:
> >>> On 6/23/20 7:46 PM, Matthew Wilcox wrote:
> >>>> I'd be quite happy to add a gfp_t to struct readahead_control.
> >>>> The other thing I've been looking into for other reasons is adding
> >>>> a memalloc_nowait_{save,restore}, which would avoid passing down
> >>>> the gfp_t.
> >>>
> >>> That was my first thought, having the memalloc_foo_save/restore for
> >>> this. I don't think adding a gfp_t to readahead_control is going
> >>> to be super useful, seems like the kind of thing that should be
> >>> non-blocking by default.
> >>
> >> We're already doing memalloc_nofs_save/restore in
> >> page_cache_readahead_unbounded(), so I think all we need is to just do a
> >> noio dance in generic_file_buffered_read() and that should be enough.
> >
> > I think we can still sleep though, right? I was thinking more
> > like this:
> >
> > http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc
>
> Yeah, that's probably better. How do we want to handle this? I've already
> got the other bits queued up. I can either add them to the series, or
> pull a branch that'll go into Linus as well.
Jens, Willy,
Now that this patch has been merged and IOCB_NOWAIT semantics ifor
buffered reads are broken in Linus' tree, what's the plan to get
this regression fixed before 5.9 releases?
Cheers,
Dave.
--
Dave Chinner
[email protected]
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
2020-08-10 22:56 ` Dave Chinner
@ 2020-08-10 23:03 ` Jens Axboe
0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-08-10 23:03 UTC (permalink / raw)
To: Dave Chinner
Cc: Matthew Wilcox, io-uring, linux-fsdevel, linux-kernel, linux-mm,
akpm, Johannes Weiner
On 8/10/20 4:56 PM, Dave Chinner wrote:
> On Wed, Jun 24, 2020 at 10:44:21AM -0600, Jens Axboe wrote:
>> On 6/24/20 10:41 AM, Matthew Wilcox wrote:
>>> On Wed, Jun 24, 2020 at 09:35:19AM -0600, Jens Axboe wrote:
>>>> On 6/24/20 9:00 AM, Jens Axboe wrote:
>>>>> On 6/23/20 7:46 PM, Matthew Wilcox wrote:
>>>>>> I'd be quite happy to add a gfp_t to struct readahead_control.
>>>>>> The other thing I've been looking into for other reasons is adding
>>>>>> a memalloc_nowait_{save,restore}, which would avoid passing down
>>>>>> the gfp_t.
>>>>>
>>>>> That was my first thought, having the memalloc_foo_save/restore for
>>>>> this. I don't think adding a gfp_t to readahead_control is going
>>>>> to be super useful, seems like the kind of thing that should be
>>>>> non-blocking by default.
>>>>
>>>> We're already doing memalloc_nofs_save/restore in
>>>> page_cache_readahead_unbounded(), so I think all we need is to just do a
>>>> noio dance in generic_file_buffered_read() and that should be enough.
>>>
>>> I think we can still sleep though, right? I was thinking more
>>> like this:
>>>
>>> http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc
>>
>> Yeah, that's probably better. How do we want to handle this? I've already
>> got the other bits queued up. I can either add them to the series, or
>> pull a branch that'll go into Linus as well.
>
> Jens, Willy,
>
> Now that this patch has been merged and IOCB_NOWAIT semantics ifor
> buffered reads are broken in Linus' tree, what's the plan to get
> this regression fixed before 5.9 releases?
Not sure where Willy's work went on this topic, but it is on my radar. But
I think we can do something truly simple now that we have IOCB_NOIO:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bd7ec3eaeed0..f1cca4bfdd7b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3293,7 +3293,7 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
if (flags & RWF_NOWAIT) {
if (!(ki->ki_filp->f_mode & FMODE_NOWAIT))
return -EOPNOTSUPP;
- kiocb_flags |= IOCB_NOWAIT;
+ kiocb_flags |= IOCB_NOWAIT | IOCB_NOIO;
}
if (flags & RWF_HIPRI)
kiocb_flags |= IOCB_HIPRI;
--
Jens Axboe
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
2020-06-18 14:43 ` [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set Jens Axboe
2020-06-24 1:02 ` Dave Chinner
@ 2020-06-24 4:38 ` Dave Chinner
2020-06-24 15:01 ` Jens Axboe
1 sibling, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2020-06-24 4:38 UTC (permalink / raw)
To: Jens Axboe
Cc: io-uring, linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
Johannes Weiner
On Thu, Jun 18, 2020 at 08:43:45AM -0600, Jens Axboe wrote:
> The read-ahead shouldn't block, so allow it to be done even if
> IOCB_NOWAIT is set in the kiocb.
>
> Acked-by: Johannes Weiner <[email protected]>
> Signed-off-by: Jens Axboe <[email protected]>
BTW, Jens, in case nobody had mentioned it, the Reply-To field for
the patches in this patchset is screwed up:
| Reply-To: [email protected], [email protected], [email protected],
| [email protected], [email protected],
| [email protected]
Cheers,
Dave.
--
Dave Chinner
[email protected]
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
2020-06-24 4:38 ` Dave Chinner
@ 2020-06-24 15:01 ` Jens Axboe
0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-24 15:01 UTC (permalink / raw)
To: Dave Chinner
Cc: io-uring, linux-fsdevel, linux-kernel, linux-mm, akpm,
Johannes Weiner
On 6/23/20 10:38 PM, Dave Chinner wrote:
> On Thu, Jun 18, 2020 at 08:43:45AM -0600, Jens Axboe wrote:
>> The read-ahead shouldn't block, so allow it to be done even if
>> IOCB_NOWAIT is set in the kiocb.
>>
>> Acked-by: Johannes Weiner <[email protected]>
>> Signed-off-by: Jens Axboe <[email protected]>
>
> BTW, Jens, in case nobody had mentioned it, the Reply-To field for
> the patches in this patchset is screwed up:
>
> | Reply-To: [email protected], [email protected], [email protected],
> | [email protected], [email protected],
> | [email protected]
Yeah, I pasted the subject line into the wrong spot for git send-email,
hence the reply-to is boogered, and the subject line was empty for the
cover letter...
--
Jens Axboe
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 06/15] mm: abstract out wake_page_match() from wake_page_function()
2020-06-18 14:43 Jens Axboe
` (4 preceding siblings ...)
2020-06-18 14:43 ` [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-06-18 14:43 ` [PATCH 07/15] mm: add support for async page locking Jens Axboe
` (9 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring
Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
Johannes Weiner
No functional changes in this patch, just in preparation for allowing
more callers.
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/pagemap.h | 37 +++++++++++++++++++++++++++++++++++++
mm/filemap.c | 35 ++++-------------------------------
2 files changed, 41 insertions(+), 31 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index cf2468da68e9..2f18221bb5c8 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -496,6 +496,43 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
return pgoff;
}
+/* This has the same layout as wait_bit_key - see fs/cachefiles/rdwr.c */
+struct wait_page_key {
+ struct page *page;
+ int bit_nr;
+ int page_match;
+};
+
+struct wait_page_queue {
+ struct page *page;
+ int bit_nr;
+ wait_queue_entry_t wait;
+};
+
+static inline int wake_page_match(struct wait_page_queue *wait_page,
+ struct wait_page_key *key)
+{
+ if (wait_page->page != key->page)
+ return 0;
+ key->page_match = 1;
+
+ if (wait_page->bit_nr != key->bit_nr)
+ return 0;
+
+ /*
+ * Stop walking if it's locked.
+ * Is this safe if put_and_wait_on_page_locked() is in use?
+ * Yes: the waker must hold a reference to this page, and if PG_locked
+ * has now already been set by another task, that task must also hold
+ * a reference to the *same usage* of this page; so there is no need
+ * to walk on to wake even the put_and_wait_on_page_locked() callers.
+ */
+ if (test_bit(key->bit_nr, &key->page->flags))
+ return -1;
+
+ return 1;
+}
+
extern void __lock_page(struct page *page);
extern int __lock_page_killable(struct page *page);
extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
diff --git a/mm/filemap.c b/mm/filemap.c
index 3378d4fca883..c3175dbd8fba 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -987,43 +987,16 @@ void __init pagecache_init(void)
page_writeback_init();
}
-/* This has the same layout as wait_bit_key - see fs/cachefiles/rdwr.c */
-struct wait_page_key {
- struct page *page;
- int bit_nr;
- int page_match;
-};
-
-struct wait_page_queue {
- struct page *page;
- int bit_nr;
- wait_queue_entry_t wait;
-};
-
static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
{
struct wait_page_key *key = arg;
struct wait_page_queue *wait_page
= container_of(wait, struct wait_page_queue, wait);
+ int ret;
- if (wait_page->page != key->page)
- return 0;
- key->page_match = 1;
-
- if (wait_page->bit_nr != key->bit_nr)
- return 0;
-
- /*
- * Stop walking if it's locked.
- * Is this safe if put_and_wait_on_page_locked() is in use?
- * Yes: the waker must hold a reference to this page, and if PG_locked
- * has now already been set by another task, that task must also hold
- * a reference to the *same usage* of this page; so there is no need
- * to walk on to wake even the put_and_wait_on_page_locked() callers.
- */
- if (test_bit(key->bit_nr, &key->page->flags))
- return -1;
-
+ ret = wake_page_match(wait_page, key);
+ if (ret != 1)
+ return ret;
return autoremove_wake_function(wait, mode, sync, key);
}
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 07/15] mm: add support for async page locking
2020-06-18 14:43 Jens Axboe
` (5 preceding siblings ...)
2020-06-18 14:43 ` [PATCH 06/15] mm: abstract out wake_page_match() from wake_page_function() Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-07-07 11:32 ` Andreas Grünbacher
2020-06-18 14:43 ` [PATCH 08/15] mm: support async buffered reads in generic_file_buffered_read() Jens Axboe
` (8 subsequent siblings)
15 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring
Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
Johannes Weiner
Normally waiting for a page to become unlocked, or locking the page,
requires waiting for IO to complete. Add support for lock_page_async()
and wait_on_page_locked_async(), which are callback based instead. This
allows a caller to get notified when a page becomes unlocked, rather
than wait for it.
We add a new iocb field, ki_waitq, to pass in the necessary data for this
to happen. We can unionize this with ki_cookie, since that is only used
for polled IO. Polled IO can never co-exist with async callbacks, as it is
(by definition) polled completions. struct wait_page_key is made public,
and we define struct wait_page_async as the interface between the caller
and the core.
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/fs.h | 7 ++++++-
include/linux/pagemap.h | 17 ++++++++++++++++
mm/filemap.c | 45 ++++++++++++++++++++++++++++++++++++++++-
3 files changed, 67 insertions(+), 2 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6c4ab4dc1cd7..6ac919b40596 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -315,6 +315,8 @@ enum rw_hint {
#define IOCB_SYNC (1 << 5)
#define IOCB_WRITE (1 << 6)
#define IOCB_NOWAIT (1 << 7)
+/* iocb->ki_waitq is valid */
+#define IOCB_WAITQ (1 << 8)
struct kiocb {
struct file *ki_filp;
@@ -328,7 +330,10 @@ struct kiocb {
int ki_flags;
u16 ki_hint;
u16 ki_ioprio; /* See linux/ioprio.h */
- unsigned int ki_cookie; /* for ->iopoll */
+ union {
+ unsigned int ki_cookie; /* for ->iopoll */
+ struct wait_page_queue *ki_waitq; /* for async buffered IO */
+ };
randomized_struct_fields_end
};
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2f18221bb5c8..e053e1d9a4d7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -535,6 +535,7 @@ static inline int wake_page_match(struct wait_page_queue *wait_page,
extern void __lock_page(struct page *page);
extern int __lock_page_killable(struct page *page);
+extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags);
extern void unlock_page(struct page *page);
@@ -571,6 +572,22 @@ static inline int lock_page_killable(struct page *page)
return 0;
}
+/*
+ * lock_page_async - Lock the page, unless this would block. If the page
+ * is already locked, then queue a callback when the page becomes unlocked.
+ * This callback can then retry the operation.
+ *
+ * Returns 0 if the page is locked successfully, or -EIOCBQUEUED if the page
+ * was already locked and the callback defined in 'wait' was queued.
+ */
+static inline int lock_page_async(struct page *page,
+ struct wait_page_queue *wait)
+{
+ if (!trylock_page(page))
+ return __lock_page_async(page, wait);
+ return 0;
+}
+
/*
* lock_page_or_retry - Lock the page, unless this would block and the
* caller indicated that it can handle a retry.
diff --git a/mm/filemap.c b/mm/filemap.c
index c3175dbd8fba..e8aaf43bee9f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1180,6 +1180,36 @@ int wait_on_page_bit_killable(struct page *page, int bit_nr)
}
EXPORT_SYMBOL(wait_on_page_bit_killable);
+static int __wait_on_page_locked_async(struct page *page,
+ struct wait_page_queue *wait, bool set)
+{
+ struct wait_queue_head *q = page_waitqueue(page);
+ int ret = 0;
+
+ wait->page = page;
+ wait->bit_nr = PG_locked;
+
+ spin_lock_irq(&q->lock);
+ __add_wait_queue_entry_tail(q, &wait->wait);
+ SetPageWaiters(page);
+ if (set)
+ ret = !trylock_page(page);
+ else
+ ret = PageLocked(page);
+ /*
+ * If we were succesful now, we know we're still on the
+ * waitqueue as we're still under the lock. This means it's
+ * safe to remove and return success, we know the callback
+ * isn't going to trigger.
+ */
+ if (!ret)
+ __remove_wait_queue(q, &wait->wait);
+ else
+ ret = -EIOCBQUEUED;
+ spin_unlock_irq(&q->lock);
+ return ret;
+}
+
/**
* put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
* @page: The page to wait for.
@@ -1342,6 +1372,11 @@ int __lock_page_killable(struct page *__page)
}
EXPORT_SYMBOL_GPL(__lock_page_killable);
+int __lock_page_async(struct page *page, struct wait_page_queue *wait)
+{
+ return __wait_on_page_locked_async(page, wait, true);
+}
+
/*
* Return values:
* 1 - page is locked; mmap_lock is still held.
@@ -2131,6 +2166,11 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
}
readpage:
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ unlock_page(page);
+ put_page(page);
+ goto would_block;
+ }
/*
* A previous I/O error may have been due to temporary
* failures, eg. multipath errors.
@@ -2150,7 +2190,10 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
}
if (!PageUptodate(page)) {
- error = lock_page_killable(page);
+ if (iocb->ki_flags & IOCB_WAITQ)
+ error = lock_page_async(page, iocb->ki_waitq);
+ else
+ error = lock_page_killable(page);
if (unlikely(error))
goto readpage_error;
if (!PageUptodate(page)) {
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 07/15] mm: add support for async page locking
2020-06-18 14:43 ` [PATCH 07/15] mm: add support for async page locking Jens Axboe
@ 2020-07-07 11:32 ` Andreas Grünbacher
2020-07-07 14:32 ` Jens Axboe
0 siblings, 1 reply; 38+ messages in thread
From: Andreas Grünbacher @ 2020-07-07 11:32 UTC (permalink / raw)
To: Jens Axboe
Cc: io-uring, Linux FS-devel Mailing List, Linux Kernel Mailing List,
Linux-MM, Andrew Morton, Johannes Weiner
Jens,
Am Do., 18. Juni 2020 um 16:47 Uhr schrieb Jens Axboe <[email protected]>:
> Normally waiting for a page to become unlocked, or locking the page,
> requires waiting for IO to complete. Add support for lock_page_async()
> and wait_on_page_locked_async(), which are callback based instead. This
> allows a caller to get notified when a page becomes unlocked, rather
> than wait for it.
>
> We add a new iocb field, ki_waitq, to pass in the necessary data for this
> to happen. We can unionize this with ki_cookie, since that is only used
> for polled IO. Polled IO can never co-exist with async callbacks, as it is
> (by definition) polled completions. struct wait_page_key is made public,
> and we define struct wait_page_async as the interface between the caller
> and the core.
>
> Acked-by: Johannes Weiner <[email protected]>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> include/linux/fs.h | 7 ++++++-
> include/linux/pagemap.h | 17 ++++++++++++++++
> mm/filemap.c | 45 ++++++++++++++++++++++++++++++++++++++++-
> 3 files changed, 67 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6c4ab4dc1cd7..6ac919b40596 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -315,6 +315,8 @@ enum rw_hint {
> #define IOCB_SYNC (1 << 5)
> #define IOCB_WRITE (1 << 6)
> #define IOCB_NOWAIT (1 << 7)
> +/* iocb->ki_waitq is valid */
> +#define IOCB_WAITQ (1 << 8)
>
> struct kiocb {
> struct file *ki_filp;
> @@ -328,7 +330,10 @@ struct kiocb {
> int ki_flags;
> u16 ki_hint;
> u16 ki_ioprio; /* See linux/ioprio.h */
> - unsigned int ki_cookie; /* for ->iopoll */
> + union {
> + unsigned int ki_cookie; /* for ->iopoll */
> + struct wait_page_queue *ki_waitq; /* for async buffered IO */
> + };
>
> randomized_struct_fields_end
> };
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 2f18221bb5c8..e053e1d9a4d7 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -535,6 +535,7 @@ static inline int wake_page_match(struct wait_page_queue *wait_page,
>
> extern void __lock_page(struct page *page);
> extern int __lock_page_killable(struct page *page);
> +extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
> extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
> unsigned int flags);
> extern void unlock_page(struct page *page);
> @@ -571,6 +572,22 @@ static inline int lock_page_killable(struct page *page)
> return 0;
> }
>
> +/*
> + * lock_page_async - Lock the page, unless this would block. If the page
> + * is already locked, then queue a callback when the page becomes unlocked.
> + * This callback can then retry the operation.
> + *
> + * Returns 0 if the page is locked successfully, or -EIOCBQUEUED if the page
> + * was already locked and the callback defined in 'wait' was queued.
> + */
> +static inline int lock_page_async(struct page *page,
> + struct wait_page_queue *wait)
> +{
> + if (!trylock_page(page))
> + return __lock_page_async(page, wait);
> + return 0;
> +}
> +
> /*
> * lock_page_or_retry - Lock the page, unless this would block and the
> * caller indicated that it can handle a retry.
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c3175dbd8fba..e8aaf43bee9f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1180,6 +1180,36 @@ int wait_on_page_bit_killable(struct page *page, int bit_nr)
> }
> EXPORT_SYMBOL(wait_on_page_bit_killable);
>
> +static int __wait_on_page_locked_async(struct page *page,
> + struct wait_page_queue *wait, bool set)
> +{
> + struct wait_queue_head *q = page_waitqueue(page);
> + int ret = 0;
> +
> + wait->page = page;
> + wait->bit_nr = PG_locked;
> +
> + spin_lock_irq(&q->lock);
> + __add_wait_queue_entry_tail(q, &wait->wait);
> + SetPageWaiters(page);
> + if (set)
> + ret = !trylock_page(page);
> + else
> + ret = PageLocked(page);
> + /*
> + * If we were succesful now, we know we're still on the
> + * waitqueue as we're still under the lock. This means it's
> + * safe to remove and return success, we know the callback
> + * isn't going to trigger.
> + */
> + if (!ret)
> + __remove_wait_queue(q, &wait->wait);
> + else
> + ret = -EIOCBQUEUED;
> + spin_unlock_irq(&q->lock);
> + return ret;
> +}
> +
> /**
> * put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
> * @page: The page to wait for.
> @@ -1342,6 +1372,11 @@ int __lock_page_killable(struct page *__page)
> }
> EXPORT_SYMBOL_GPL(__lock_page_killable);
>
> +int __lock_page_async(struct page *page, struct wait_page_queue *wait)
> +{
> + return __wait_on_page_locked_async(page, wait, true);
> +}
> +
> /*
> * Return values:
> * 1 - page is locked; mmap_lock is still held.
> @@ -2131,6 +2166,11 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
> }
>
> readpage:
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + unlock_page(page);
> + put_page(page);
> + goto would_block;
> + }
This hunk should have been part of "mm: allow read-ahead with
IOCB_NOWAIT set" ...
> /*
> * A previous I/O error may have been due to temporary
> * failures, eg. multipath errors.
> @@ -2150,7 +2190,10 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
> }
>
> if (!PageUptodate(page)) {
> - error = lock_page_killable(page);
> + if (iocb->ki_flags & IOCB_WAITQ)
> + error = lock_page_async(page, iocb->ki_waitq);
> + else
> + error = lock_page_killable(page);
> if (unlikely(error))
> goto readpage_error;
> if (!PageUptodate(page)) {
> --
> 2.27.0
>
Andreas
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 07/15] mm: add support for async page locking
2020-07-07 11:32 ` Andreas Grünbacher
@ 2020-07-07 14:32 ` Jens Axboe
0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-07-07 14:32 UTC (permalink / raw)
To: Andreas Grünbacher
Cc: io-uring, Linux FS-devel Mailing List, Linux Kernel Mailing List,
Linux-MM, Andrew Morton, Johannes Weiner
On 7/7/20 5:32 AM, Andreas Grünbacher wrote:
>> @@ -2131,6 +2166,11 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
>> }
>>
>> readpage:
>> + if (iocb->ki_flags & IOCB_NOWAIT) {
>> + unlock_page(page);
>> + put_page(page);
>> + goto would_block;
>> + }
>
> This hunk should have been part of "mm: allow read-ahead with
> IOCB_NOWAIT set" ...
It probably should have... Oh well, it's been queued up multiple weeks
at this point, not worth rebasing to move this hunk.
--
Jens Axboe
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 08/15] mm: support async buffered reads in generic_file_buffered_read()
2020-06-18 14:43 Jens Axboe
` (6 preceding siblings ...)
2020-06-18 14:43 ` [PATCH 07/15] mm: add support for async page locking Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-06-18 14:43 ` [PATCH 09/15] fs: add FMODE_BUF_RASYNC Jens Axboe
` (7 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring
Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
Johannes Weiner
Use the async page locking infrastructure, if IOCB_WAITQ is set in the
passed in iocb. The caller must expect an -EIOCBQUEUED return value,
which means that IO is started but not done yet. This is similar to how
O_DIRECT signals the same operation. Once the callback is received by
the caller for IO completion, the caller must retry the operation.
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
mm/filemap.c | 38 +++++++++++++++++++++++++++-----------
1 file changed, 27 insertions(+), 11 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index e8aaf43bee9f..a5b1fa8f7ce4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1210,6 +1210,14 @@ static int __wait_on_page_locked_async(struct page *page,
return ret;
}
+static int wait_on_page_locked_async(struct page *page,
+ struct wait_page_queue *wait)
+{
+ if (!PageLocked(page))
+ return 0;
+ return __wait_on_page_locked_async(compound_head(page), wait, false);
+}
+
/**
* put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
* @page: The page to wait for.
@@ -2049,17 +2057,25 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
index, last_index - index);
}
if (!PageUptodate(page)) {
- if (iocb->ki_flags & IOCB_NOWAIT) {
- put_page(page);
- goto would_block;
- }
-
/*
* See comment in do_read_cache_page on why
* wait_on_page_locked is used to avoid unnecessarily
* serialisations and why it's safe.
*/
- error = wait_on_page_locked_killable(page);
+ if (iocb->ki_flags & IOCB_WAITQ) {
+ if (written) {
+ put_page(page);
+ goto out;
+ }
+ error = wait_on_page_locked_async(page,
+ iocb->ki_waitq);
+ } else {
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ put_page(page);
+ goto would_block;
+ }
+ error = wait_on_page_locked_killable(page);
+ }
if (unlikely(error))
goto readpage_error;
if (PageUptodate(page))
@@ -2147,7 +2163,10 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
page_not_up_to_date:
/* Get exclusive access to the page ... */
- error = lock_page_killable(page);
+ if (iocb->ki_flags & IOCB_WAITQ)
+ error = lock_page_async(page, iocb->ki_waitq);
+ else
+ error = lock_page_killable(page);
if (unlikely(error))
goto readpage_error;
@@ -2190,10 +2209,7 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
}
if (!PageUptodate(page)) {
- if (iocb->ki_flags & IOCB_WAITQ)
- error = lock_page_async(page, iocb->ki_waitq);
- else
- error = lock_page_killable(page);
+ error = lock_page_killable(page);
if (unlikely(error))
goto readpage_error;
if (!PageUptodate(page)) {
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 09/15] fs: add FMODE_BUF_RASYNC
2020-06-18 14:43 Jens Axboe
` (7 preceding siblings ...)
2020-06-18 14:43 ` [PATCH 08/15] mm: support async buffered reads in generic_file_buffered_read() Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-06-18 14:43 ` [PATCH 10/15] block: flag block devices as supporting IOCB_WAITQ Jens Axboe
` (6 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe
If set, this indicates that the file system supports IOCB_WAITQ for
buffered reads.
Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/fs.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6ac919b40596..3f9de90e0266 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File does not contribute to nr_files count */
#define FMODE_NOACCOUNT ((__force fmode_t)0x20000000)
+/* File supports async buffered reads */
+#define FMODE_BUF_RASYNC ((__force fmode_t)0x40000000)
+
/*
* Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
* that indicates that they should check the contents of the iovec are
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 10/15] block: flag block devices as supporting IOCB_WAITQ
2020-06-18 14:43 Jens Axboe
` (8 preceding siblings ...)
2020-06-18 14:43 ` [PATCH 09/15] fs: add FMODE_BUF_RASYNC Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-06-18 14:43 ` [PATCH 11/15] xfs: flag files as supporting buffered async reads Jens Axboe
` (5 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe
Signed-off-by: Jens Axboe <[email protected]>
---
fs/block_dev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 47860e589388..54720c90dad0 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1850,7 +1850,7 @@ static int blkdev_open(struct inode * inode, struct file * filp)
*/
filp->f_flags |= O_LARGEFILE;
- filp->f_mode |= FMODE_NOWAIT;
+ filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
if (filp->f_flags & O_NDELAY)
filp->f_mode |= FMODE_NDELAY;
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 11/15] xfs: flag files as supporting buffered async reads
2020-06-18 14:43 Jens Axboe
` (9 preceding siblings ...)
2020-06-18 14:43 ` [PATCH 10/15] block: flag block devices as supporting IOCB_WAITQ Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-06-18 14:43 ` [PATCH 12/15] btrfs: " Jens Axboe
` (4 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring
Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
Darrick J . Wong
XFS uses generic_file_read_iter(), which already supports this.
Acked-by: Darrick J. Wong <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
fs/xfs/xfs_file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 00db81eac80d..fdbff4860d61 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1080,7 +1080,7 @@ xfs_file_open(
return -EFBIG;
if (XFS_FORCED_SHUTDOWN(XFS_M(inode->i_sb)))
return -EIO;
- file->f_mode |= FMODE_NOWAIT;
+ file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
return 0;
}
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 12/15] btrfs: flag files as supporting buffered async reads
2020-06-18 14:43 Jens Axboe
` (10 preceding siblings ...)
2020-06-18 14:43 ` [PATCH 11/15] xfs: flag files as supporting buffered async reads Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-06-19 11:11 ` David Sterba
2020-06-18 14:43 ` [PATCH 13/15] ext4: flag " Jens Axboe
` (3 subsequent siblings)
15 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring
Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
Chris Mason
btrfs uses generic_file_read_iter(), which already supports this.
Acked-by: Chris Mason <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
fs/btrfs/file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 2c14312b05e8..234a418eb6da 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3472,7 +3472,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
static int btrfs_file_open(struct inode *inode, struct file *filp)
{
- filp->f_mode |= FMODE_NOWAIT;
+ filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
return generic_file_open(inode, filp);
}
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 12/15] btrfs: flag files as supporting buffered async reads
2020-06-18 14:43 ` [PATCH 12/15] btrfs: " Jens Axboe
@ 2020-06-19 11:11 ` David Sterba
0 siblings, 0 replies; 38+ messages in thread
From: David Sterba @ 2020-06-19 11:11 UTC (permalink / raw)
To: Jens Axboe
Cc: io-uring, linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
Chris Mason
On Thu, Jun 18, 2020 at 08:43:52AM -0600, Jens Axboe wrote:
> btrfs uses generic_file_read_iter(), which already supports this.
>
> Acked-by: Chris Mason <[email protected]>
> Signed-off-by: Jens Axboe <[email protected]>
Can you please CC the mailinglist for btrfs patches? Thanks.
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 13/15] ext4: flag as supporting buffered async reads
2020-06-18 14:43 Jens Axboe
` (11 preceding siblings ...)
2020-06-18 14:43 ` [PATCH 12/15] btrfs: " Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-06-18 14:43 ` [PATCH 14/15] mm: add kiocb_wait_page_queue_init() helper Jens Axboe
` (2 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring
Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
Theodore Ts'o
ext4 uses generic_file_read_iter(), which already supports this.
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
fs/ext4/file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2a01e31a032c..1e827410e9e1 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -839,7 +839,7 @@ static int ext4_file_open(struct inode * inode, struct file * filp)
return ret;
}
- filp->f_mode |= FMODE_NOWAIT;
+ filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
return dquot_file_open(inode, filp);
}
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 14/15] mm: add kiocb_wait_page_queue_init() helper
2020-06-18 14:43 Jens Axboe
` (12 preceding siblings ...)
2020-06-18 14:43 ` [PATCH 13/15] ext4: flag " Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-06-18 14:43 ` [PATCH 15/15] io_uring: support true async buffered reads, if file provides it Jens Axboe
2020-06-18 14:45 ` [PATCHSET v7 0/12] Add support for async buffered reads Jens Axboe
15 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring
Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
Johannes Weiner
Checks if the file supports it, and initializes the values that we need.
Caller passes in 'data' pointer, if any, and the callback function to
be used.
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/pagemap.h | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e053e1d9a4d7..7386bc67cc5a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -533,6 +533,27 @@ static inline int wake_page_match(struct wait_page_queue *wait_page,
return 1;
}
+static inline int kiocb_wait_page_queue_init(struct kiocb *kiocb,
+ struct wait_page_queue *wait,
+ wait_queue_func_t func,
+ void *data)
+{
+ /* Can't support async wakeup with polled IO */
+ if (kiocb->ki_flags & IOCB_HIPRI)
+ return -EINVAL;
+ if (kiocb->ki_filp->f_mode & FMODE_BUF_RASYNC) {
+ wait->wait.func = func;
+ wait->wait.private = data;
+ wait->wait.flags = 0;
+ INIT_LIST_HEAD(&wait->wait.entry);
+ kiocb->ki_flags |= IOCB_WAITQ;
+ kiocb->ki_waitq = wait;
+ return 0;
+ }
+
+ return -EOPNOTSUPP;
+}
+
extern void __lock_page(struct page *page);
extern int __lock_page_killable(struct page *page);
extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 15/15] io_uring: support true async buffered reads, if file provides it
2020-06-18 14:43 Jens Axboe
` (13 preceding siblings ...)
2020-06-18 14:43 ` [PATCH 14/15] mm: add kiocb_wait_page_queue_init() helper Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
2020-06-23 12:39 ` Pavel Begunkov
2020-06-18 14:45 ` [PATCHSET v7 0/12] Add support for async buffered reads Jens Axboe
15 siblings, 1 reply; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe
If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
the buffered read to an io-wq worker. Instead we can rely on page
unlocking callbacks to support retry based async IO. This is a lot more
efficient than doing async thread offload.
The retry is done similarly to how we handle poll based retry. From
the unlock callback, we simply queue the retry to a task_work based
handler.
Signed-off-by: Jens Axboe <[email protected]>
---
fs/io_uring.c | 145 +++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 137 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 40413fb9d07b..94282be1c413 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -78,6 +78,7 @@
#include <linux/fs_struct.h>
#include <linux/splice.h>
#include <linux/task_work.h>
+#include <linux/pagemap.h>
#define CREATE_TRACE_POINTS
#include <trace/events/io_uring.h>
@@ -503,6 +504,8 @@ struct io_async_rw {
struct iovec *iov;
ssize_t nr_segs;
ssize_t size;
+ struct wait_page_queue wpq;
+ struct callback_head task_work;
};
struct io_async_ctx {
@@ -2750,6 +2753,126 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
return 0;
}
+static void __io_async_buf_error(struct io_kiocb *req, int error)
+{
+ struct io_ring_ctx *ctx = req->ctx;
+
+ spin_lock_irq(&ctx->completion_lock);
+ io_cqring_fill_event(req, error);
+ io_commit_cqring(ctx);
+ spin_unlock_irq(&ctx->completion_lock);
+
+ io_cqring_ev_posted(ctx);
+ req_set_fail_links(req);
+ io_double_put_req(req);
+}
+
+static void io_async_buf_cancel(struct callback_head *cb)
+{
+ struct io_async_rw *rw;
+ struct io_kiocb *req;
+
+ rw = container_of(cb, struct io_async_rw, task_work);
+ req = rw->wpq.wait.private;
+ __io_async_buf_error(req, -ECANCELED);
+}
+
+static void io_async_buf_retry(struct callback_head *cb)
+{
+ struct io_async_rw *rw;
+ struct io_ring_ctx *ctx;
+ struct io_kiocb *req;
+
+ rw = container_of(cb, struct io_async_rw, task_work);
+ req = rw->wpq.wait.private;
+ ctx = req->ctx;
+
+ __set_current_state(TASK_RUNNING);
+ if (!io_sq_thread_acquire_mm(ctx, req)) {
+ mutex_lock(&ctx->uring_lock);
+ __io_queue_sqe(req, NULL);
+ mutex_unlock(&ctx->uring_lock);
+ } else {
+ __io_async_buf_error(req, -EFAULT);
+ }
+}
+
+static int io_async_buf_func(struct wait_queue_entry *wait, unsigned mode,
+ int sync, void *arg)
+{
+ struct wait_page_queue *wpq;
+ struct io_kiocb *req = wait->private;
+ struct io_async_rw *rw = &req->io->rw;
+ struct wait_page_key *key = arg;
+ struct task_struct *tsk;
+ int ret;
+
+ wpq = container_of(wait, struct wait_page_queue, wait);
+
+ ret = wake_page_match(wpq, key);
+ if (ret != 1)
+ return ret;
+
+ list_del_init(&wait->entry);
+
+ init_task_work(&rw->task_work, io_async_buf_retry);
+ /* submit ref gets dropped, acquire a new one */
+ refcount_inc(&req->refs);
+ tsk = req->task;
+ ret = task_work_add(tsk, &rw->task_work, true);
+ if (unlikely(ret)) {
+ /* queue just for cancelation */
+ init_task_work(&rw->task_work, io_async_buf_cancel);
+ tsk = io_wq_get_task(req->ctx->io_wq);
+ task_work_add(tsk, &rw->task_work, true);
+ }
+ wake_up_process(tsk);
+ return 1;
+}
+
+static bool io_rw_should_retry(struct io_kiocb *req)
+{
+ struct kiocb *kiocb = &req->rw.kiocb;
+ int ret;
+
+ /* never retry for NOWAIT, we just complete with -EAGAIN */
+ if (req->flags & REQ_F_NOWAIT)
+ return false;
+
+ /* already tried, or we're doing O_DIRECT */
+ if (kiocb->ki_flags & (IOCB_DIRECT | IOCB_WAITQ))
+ return false;
+ /*
+ * just use poll if we can, and don't attempt if the fs doesn't
+ * support callback based unlocks
+ */
+ if (file_can_poll(req->file) || !(req->file->f_mode & FMODE_BUF_RASYNC))
+ return false;
+
+ /*
+ * If request type doesn't require req->io to defer in general,
+ * we need to allocate it here
+ */
+ if (!req->io && __io_alloc_async_ctx(req))
+ return false;
+
+ ret = kiocb_wait_page_queue_init(kiocb, &req->io->rw.wpq,
+ io_async_buf_func, req);
+ if (!ret) {
+ io_get_req_task(req);
+ return true;
+ }
+
+ return false;
+}
+
+static int io_iter_do_read(struct io_kiocb *req, struct iov_iter *iter)
+{
+ if (req->file->f_op->read_iter)
+ return call_read_iter(req->file, &req->rw.kiocb, iter);
+ return loop_rw_iter(READ, req->file, &req->rw.kiocb, iter);
+}
+
static int io_read(struct io_kiocb *req, bool force_nonblock)
{
struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
@@ -2784,10 +2907,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
unsigned long nr_segs = iter.nr_segs;
ssize_t ret2 = 0;
- if (req->file->f_op->read_iter)
- ret2 = call_read_iter(req->file, kiocb, &iter);
- else
- ret2 = loop_rw_iter(READ, req->file, kiocb, &iter);
+ ret2 = io_iter_do_read(req, &iter);
/* Catch -EAGAIN return for forced non-blocking submission */
if (!force_nonblock || (ret2 != -EAGAIN && ret2 != -EIO)) {
@@ -2799,17 +2919,26 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
ret = io_setup_async_rw(req, io_size, iovec,
inline_vecs, &iter);
if (ret)
- goto out_free;
+ goto out;
/* any defer here is final, must blocking retry */
if (!(req->flags & REQ_F_NOWAIT) &&
!file_can_poll(req->file))
req->flags |= REQ_F_MUST_PUNT;
+ /* if we can retry, do so with the callbacks armed */
+ if (io_rw_should_retry(req)) {
+ ret2 = io_iter_do_read(req, &iter);
+ if (ret2 == -EIOCBQUEUED) {
+ goto out;
+ } else if (ret2 != -EAGAIN) {
+ kiocb_done(kiocb, ret2);
+ goto out;
+ }
+ }
+ kiocb->ki_flags &= ~IOCB_WAITQ;
return -EAGAIN;
}
}
-out_free:
- kfree(iovec);
- req->flags &= ~REQ_F_NEED_CLEANUP;
+out:
return ret;
}
--
2.27.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 15/15] io_uring: support true async buffered reads, if file provides it
2020-06-18 14:43 ` [PATCH 15/15] io_uring: support true async buffered reads, if file provides it Jens Axboe
@ 2020-06-23 12:39 ` Pavel Begunkov
2020-06-23 14:38 ` Jens Axboe
0 siblings, 1 reply; 38+ messages in thread
From: Pavel Begunkov @ 2020-06-23 12:39 UTC (permalink / raw)
To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe
On 18/06/2020 17:43, Jens Axboe wrote:
> If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
> the buffered read to an io-wq worker. Instead we can rely on page
> unlocking callbacks to support retry based async IO. This is a lot more
> efficient than doing async thread offload.
>
> The retry is done similarly to how we handle poll based retry. From
> the unlock callback, we simply queue the retry to a task_work based
> handler.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> fs/io_uring.c | 145 +++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 137 insertions(+), 8 deletions(-)
>
...
> static int io_read(struct io_kiocb *req, bool force_nonblock)
> {
> struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
> @@ -2784,10 +2907,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
> unsigned long nr_segs = iter.nr_segs;
> ssize_t ret2 = 0;
>
> - if (req->file->f_op->read_iter)
> - ret2 = call_read_iter(req->file, kiocb, &iter);
> - else
> - ret2 = loop_rw_iter(READ, req->file, kiocb, &iter);
> + ret2 = io_iter_do_read(req, &iter);
>
> /* Catch -EAGAIN return for forced non-blocking submission */
> if (!force_nonblock || (ret2 != -EAGAIN && ret2 != -EIO)) {
> @@ -2799,17 +2919,26 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
> ret = io_setup_async_rw(req, io_size, iovec,
> inline_vecs, &iter);
> if (ret)
> - goto out_free;
> + goto out;
> /* any defer here is final, must blocking retry */
> if (!(req->flags & REQ_F_NOWAIT) &&
> !file_can_poll(req->file))
> req->flags |= REQ_F_MUST_PUNT;
> + /* if we can retry, do so with the callbacks armed */
> + if (io_rw_should_retry(req)) {
> + ret2 = io_iter_do_read(req, &iter);
> + if (ret2 == -EIOCBQUEUED) {
> + goto out;
> + } else if (ret2 != -EAGAIN) {
> + kiocb_done(kiocb, ret2);
> + goto out;
> + }
> + }
> + kiocb->ki_flags &= ~IOCB_WAITQ;
> return -EAGAIN;
> }
> }
> -out_free:
> - kfree(iovec);
> - req->flags &= ~REQ_F_NEED_CLEANUP;
This looks fishy. For instance, if it fails early on rw_verify_area(), how would
it free yet on-stack iovec? Is it handled somehow?
> +out:
> return ret;
> }
>
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 15/15] io_uring: support true async buffered reads, if file provides it
2020-06-23 12:39 ` Pavel Begunkov
@ 2020-06-23 14:38 ` Jens Axboe
0 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-23 14:38 UTC (permalink / raw)
To: Pavel Begunkov, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm
On 6/23/20 6:39 AM, Pavel Begunkov wrote:
> On 18/06/2020 17:43, Jens Axboe wrote:
>> If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
>> the buffered read to an io-wq worker. Instead we can rely on page
>> unlocking callbacks to support retry based async IO. This is a lot more
>> efficient than doing async thread offload.
>>
>> The retry is done similarly to how we handle poll based retry. From
>> the unlock callback, we simply queue the retry to a task_work based
>> handler.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>> fs/io_uring.c | 145 +++++++++++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 137 insertions(+), 8 deletions(-)
>>
> ...
>> static int io_read(struct io_kiocb *req, bool force_nonblock)
>> {
>> struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
>> @@ -2784,10 +2907,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>> unsigned long nr_segs = iter.nr_segs;
>> ssize_t ret2 = 0;
>>
>> - if (req->file->f_op->read_iter)
>> - ret2 = call_read_iter(req->file, kiocb, &iter);
>> - else
>> - ret2 = loop_rw_iter(READ, req->file, kiocb, &iter);
>> + ret2 = io_iter_do_read(req, &iter);
>>
>> /* Catch -EAGAIN return for forced non-blocking submission */
>> if (!force_nonblock || (ret2 != -EAGAIN && ret2 != -EIO)) {
>> @@ -2799,17 +2919,26 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>> ret = io_setup_async_rw(req, io_size, iovec,
>> inline_vecs, &iter);
>> if (ret)
>> - goto out_free;
>> + goto out;
>> /* any defer here is final, must blocking retry */
>> if (!(req->flags & REQ_F_NOWAIT) &&
>> !file_can_poll(req->file))
>> req->flags |= REQ_F_MUST_PUNT;
>> + /* if we can retry, do so with the callbacks armed */
>> + if (io_rw_should_retry(req)) {
>> + ret2 = io_iter_do_read(req, &iter);
>> + if (ret2 == -EIOCBQUEUED) {
>> + goto out;
>> + } else if (ret2 != -EAGAIN) {
>> + kiocb_done(kiocb, ret2);
>> + goto out;
>> + }
>> + }
>> + kiocb->ki_flags &= ~IOCB_WAITQ;
>> return -EAGAIN;
>> }
>> }
>> -out_free:
>> - kfree(iovec);
>> - req->flags &= ~REQ_F_NEED_CLEANUP;
>
> This looks fishy. For instance, if it fails early on rw_verify_area(), how would
> it free yet on-stack iovec? Is it handled somehow?
This was tweaked and rebased on top of the REQ_F_NEED_CLEANUP change,
it should be correct in the tree:
https://git.kernel.dk/cgit/linux-block/tree/fs/io_uring.c?h=for-5.9/io_uring#n2908
--
Jens Axboe
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCHSET v7 0/12] Add support for async buffered reads
2020-06-18 14:43 Jens Axboe
` (14 preceding siblings ...)
2020-06-18 14:43 ` [PATCH 15/15] io_uring: support true async buffered reads, if file provides it Jens Axboe
@ 2020-06-18 14:45 ` Jens Axboe
15 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2020-06-18 14:45 UTC (permalink / raw)
To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm
On 6/18/20 8:43 AM, Jens Axboe wrote:
> We technically support this already through io_uring, but it's
> implemented with a thread backend to support cases where we would
> block. This isn't ideal.
Apparently I'm cursed when I send these out, corrected the subject
line to indicate this is v7 of the series...
--
Jens Axboe
^ permalink raw reply [flat|nested] 38+ messages in thread