* [PATCHSET v3 0/3] Add ability to save/restore iov_iter state @ 2021-09-15 16:29 Jens Axboe 2021-09-15 16:29 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe ` (4 more replies) 0 siblings, 5 replies; 17+ messages in thread From: Jens Axboe @ 2021-09-15 16:29 UTC (permalink / raw) To: io-uring, linux-fsdevel; +Cc: torvalds, viro Hi, Linus didn't particularly love the iov_iter->truncated addition and how it was used, and it was hard to disagree with that. Instead of relying on tracking ->truncated, add a few pieces of state so we can safely handle partial or errored read/write attempts (that we want to retry). Then we can get rid of the iov_iter addition, and at the same time handle cases that weren't handled correctly before. I've run this through vectored read/write with io_uring on the commonly problematic cases (dm and low depth SCSI device) which trigger these conditions often, and it seems to pass muster. I've also hacked in faked randomly short reads and that helped find on issue with double accounting. But it did validate the state handling otherwise. For a discussion on this topic, see the thread here: https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/ You can find these patches here: https://git.kernel.dk/cgit/linux-block/log/?h=iov_iter.3 Changes since v2: - Add comments on io_read() on the flow - Fix issue with rw->bytes_done being incremented too early and hence double accounting if we enter that bottom do {} while loop in io_read() - Restore iter at the bottom of do {} while loop in io_read() Changes since v1: - Drop 'did_bytes' from iov_iter_restore(). Only two cases in io_uring used it, and one of them had to be changed with v2. Better to just make the subsequent iov_iter_advance() explicit at that point. - Cleanup and sanitize the io_uring side, and ensure it's sane around worker retries. No more digging into iov_iter_state from io_uring, we use it just for save/restore purposes. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/3] iov_iter: add helper to save iov_iter state 2021-09-15 16:29 [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Jens Axboe @ 2021-09-15 16:29 ` Jens Axboe 2021-09-15 16:29 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe ` (3 subsequent siblings) 4 siblings, 0 replies; 17+ messages in thread From: Jens Axboe @ 2021-09-15 16:29 UTC (permalink / raw) To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe In an ideal world, when someone is passed an iov_iter and returns X bytes, then X bytes would have been consumed/advanced from the iov_iter. But we have use cases that always consume the entire iterator, a few examples of that are iomap and bdev O_DIRECT. This means we cannot rely on the state of the iov_iter once we've called ->read_iter() or ->write_iter(). This would be easier if we didn't always have to deal with truncate of the iov_iter, as rewinding would be trivial without that. We recently added a commit to track the truncate state, but that grew the iov_iter by 8 bytes and wasn't the best solution. Implement a helper to save enough of the iov_iter state to sanely restore it after we've called the read/write iterator helpers. This currently only works for IOVEC/BVEC/KVEC as that's all we need, support for other iterator types are left as an exercise for the reader. Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/ Signed-off-by: Jens Axboe <[email protected]> --- include/linux/uio.h | 15 +++++++++++++++ lib/iov_iter.c | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 51 insertions(+) diff --git a/include/linux/uio.h b/include/linux/uio.h index 5265024e8b90..984c4ab74859 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -27,6 +27,12 @@ enum iter_type { ITER_DISCARD, }; +struct iov_iter_state { + size_t iov_offset; + size_t count; + unsigned long nr_segs; +}; + struct iov_iter { u8 iter_type; bool data_source; @@ -55,6 +61,14 @@ static inline enum iter_type iov_iter_type(const struct iov_iter *i) return i->iter_type; } +static inline void iov_iter_save_state(struct iov_iter *iter, + struct iov_iter_state *state) +{ + state->iov_offset = iter->iov_offset; + state->count = iter->count; + state->nr_segs = iter->nr_segs; +} + static inline bool iter_is_iovec(const struct iov_iter *i) { return iov_iter_type(i) == ITER_IOVEC; @@ -233,6 +247,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages, ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages, size_t maxsize, size_t *start); int iov_iter_npages(const struct iov_iter *i, int maxpages); +void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state); const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags); diff --git a/lib/iov_iter.c b/lib/iov_iter.c index f2d50d69a6c3..755c10c5138c 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1972,3 +1972,39 @@ int import_single_range(int rw, void __user *buf, size_t len, return 0; } EXPORT_SYMBOL(import_single_range); + +/** + * iov_iter_restore() - Restore a &struct iov_iter to the same state as when + * iov_iter_save_state() was called. + * + * @i: &struct iov_iter to restore + * @state: state to restore from + * + * Used after iov_iter_save_state() to bring restore @i, if operations may + * have advanced it. + * + * Note: only works on ITER_IOVEC, ITER_BVEC, and ITER_KVEC + */ +void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state) +{ + if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i)) && + !iov_iter_is_kvec(i)) + return; + i->iov_offset = state->iov_offset; + i->count = state->count; + /* + * For the *vec iters, nr_segs + iov is constant - if we increment + * the vec, then we also decrement the nr_segs count. Hence we don't + * need to track both of these, just one is enough and we can deduct + * the other from that. ITER_KVEC and ITER_IOVEC are the same struct + * size, so we can just increment the iov pointer as they are unionzed. + * ITER_BVEC _may_ be the same size on some archs, but on others it is + * not. Be safe and handle it separately. + */ + BUILD_BUG_ON(sizeof(struct iovec) != sizeof(struct kvec)); + if (iov_iter_is_bvec(i)) + i->bvec -= state->nr_segs - i->nr_segs; + else + i->iov -= state->nr_segs - i->nr_segs; + i->nr_segs = state->nr_segs; +} -- 2.33.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/3] io_uring: use iov_iter state save/restore helpers 2021-09-15 16:29 [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Jens Axboe 2021-09-15 16:29 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe @ 2021-09-15 16:29 ` Jens Axboe 2021-09-15 16:29 ` [PATCH 3/3] Revert "iov_iter: track truncated size" Jens Axboe ` (2 subsequent siblings) 4 siblings, 0 replies; 17+ messages in thread From: Jens Axboe @ 2021-09-15 16:29 UTC (permalink / raw) To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe Get rid of the need to do re-expand and revert on an iterator when we encounter a short IO, or failure that warrants a retry. Use the new state save/restore helpers instead. We keep the iov_iter_state persistent across retries, if we need to restart the read or write operation. If there's a pending retry, the operation will always exit with the state correctly saved. Signed-off-by: Jens Axboe <[email protected]> --- fs/io_uring.c | 82 ++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 61 insertions(+), 21 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 855ea544807f..25bda8a5a4e5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -712,6 +712,7 @@ struct io_async_rw { struct iovec fast_iov[UIO_FASTIOV]; const struct iovec *free_iovec; struct iov_iter iter; + struct iov_iter_state iter_state; size_t bytes_done; struct wait_page_queue wpq; }; @@ -2608,8 +2609,7 @@ static bool io_resubmit_prep(struct io_kiocb *req) if (!rw) return !io_req_prep_async(req); - /* may have left rw->iter inconsistent on -EIOCBQUEUED */ - iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter)); + iov_iter_restore(&rw->iter, &rw->iter_state); return true; } @@ -3310,12 +3310,17 @@ static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec, if (!force && !io_op_defs[req->opcode].needs_async_setup) return 0; if (!req->async_data) { + struct io_async_rw *iorw; + if (io_alloc_async_data(req)) { kfree(iovec); return -ENOMEM; } io_req_map_rw(req, iovec, fast_iov, iter); + iorw = req->async_data; + /* we've copied and mapped the iter, ensure state is saved */ + iov_iter_save_state(&iorw->iter, &iorw->iter_state); } return 0; } @@ -3334,6 +3339,7 @@ static inline int io_rw_prep_async(struct io_kiocb *req, int rw) iorw->free_iovec = iov; if (iov) req->flags |= REQ_F_NEED_CLEANUP; + iov_iter_save_state(&iorw->iter, &iorw->iter_state); return 0; } @@ -3437,19 +3443,28 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) struct kiocb *kiocb = &req->rw.kiocb; struct iov_iter __iter, *iter = &__iter; struct io_async_rw *rw = req->async_data; - ssize_t io_size, ret, ret2; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + struct iov_iter_state __state, *state; + ssize_t ret, ret2; if (rw) { iter = &rw->iter; + state = &rw->iter_state; + /* + * We come here from an earlier attempt, restore our state to + * match in case it doesn't. It's cheap enough that we don't + * need to make this conditional. + */ + iov_iter_restore(iter, state); iovec = NULL; } else { ret = io_import_iovec(READ, req, &iovec, iter, !force_nonblock); if (ret < 0) return ret; + state = &__state; + iov_iter_save_state(iter, state); } - io_size = iov_iter_count(iter); - req->result = io_size; + req->result = iov_iter_count(iter); /* Ensure we clear previously set non-block flag */ if (!force_nonblock) @@ -3463,7 +3478,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) return ret ?: -EAGAIN; } - ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), io_size); + ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), req->result); if (unlikely(ret)) { kfree(iovec); return ret; @@ -3479,30 +3494,49 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) /* no retry on NONBLOCK nor RWF_NOWAIT */ if (req->flags & REQ_F_NOWAIT) goto done; - /* some cases will consume bytes even on error returns */ - iov_iter_reexpand(iter, iter->count + iter->truncated); - iov_iter_revert(iter, io_size - iov_iter_count(iter)); ret = 0; } else if (ret == -EIOCBQUEUED) { goto out_free; - } else if (ret <= 0 || ret == io_size || !force_nonblock || + } else if (ret <= 0 || ret == req->result || !force_nonblock || (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) { /* read all, failed, already did sync or don't want to retry */ goto done; } + /* + * Don't depend on the iter state matching what was consumed, or being + * untouched in case of error. Restore it and we'll advance it + * manually if we need to. + */ + iov_iter_restore(iter, state); + ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true); if (ret2) return ret2; iovec = NULL; rw = req->async_data; - /* now use our persistent iterator, if we aren't already */ - iter = &rw->iter; + /* + * Now use our persistent iterator and state, if we aren't already. + * We've restored and mapped the iter to match. + */ + if (iter != &rw->iter) { + iter = &rw->iter; + state = &rw->iter_state; + } do { - io_size -= ret; + /* + * We end up here because of a partial read, either from + * above or inside this loop. Advance the iter by the bytes + * that were consumed. + */ + iov_iter_advance(iter, ret); + if (!iov_iter_count(iter)) + break; rw->bytes_done += ret; + iov_iter_save_state(iter, state); + /* if we can retry, do so with the callbacks armed */ if (!io_rw_should_retry(req)) { kiocb->ki_flags &= ~IOCB_WAITQ; @@ -3520,7 +3554,8 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) return 0; /* we got some bytes, but not all. retry. */ kiocb->ki_flags &= ~IOCB_WAITQ; - } while (ret > 0 && ret < io_size); + iov_iter_restore(iter, state); + } while (ret > 0); done: kiocb_done(kiocb, ret, issue_flags); out_free: @@ -3543,19 +3578,24 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) struct kiocb *kiocb = &req->rw.kiocb; struct iov_iter __iter, *iter = &__iter; struct io_async_rw *rw = req->async_data; - ssize_t ret, ret2, io_size; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + struct iov_iter_state __state, *state; + ssize_t ret, ret2; if (rw) { iter = &rw->iter; + state = &rw->iter_state; + iov_iter_restore(iter, state); iovec = NULL; } else { ret = io_import_iovec(WRITE, req, &iovec, iter, !force_nonblock); if (ret < 0) return ret; + state = &__state; + iov_iter_save_state(iter, state); } - io_size = iov_iter_count(iter); - req->result = io_size; + req->result = iov_iter_count(iter); + ret2 = 0; /* Ensure we clear previously set non-block flag */ if (!force_nonblock) @@ -3572,7 +3612,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) (req->flags & REQ_F_ISREG)) goto copy_iov; - ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), io_size); + ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), req->result); if (unlikely(ret)) goto out_free; @@ -3619,9 +3659,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) kiocb_done(kiocb, ret2, issue_flags); } else { copy_iov: - /* some cases will consume bytes even on error returns */ - iov_iter_reexpand(iter, iter->count + iter->truncated); - iov_iter_revert(iter, io_size - iov_iter_count(iter)); + iov_iter_restore(iter, state); + if (ret2 > 0) + iov_iter_advance(iter, ret2); ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false); return ret ?: -EAGAIN; } -- 2.33.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 3/3] Revert "iov_iter: track truncated size" 2021-09-15 16:29 [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Jens Axboe 2021-09-15 16:29 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe 2021-09-15 16:29 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe @ 2021-09-15 16:29 ` Jens Axboe 2021-09-15 18:32 ` [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Linus Torvalds 2021-09-16 4:47 ` Al Viro 4 siblings, 0 replies; 17+ messages in thread From: Jens Axboe @ 2021-09-15 16:29 UTC (permalink / raw) To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe This reverts commit 2112ff5ce0c1128fe7b4d19cfe7f2b8ce5b595fa. We no longer need to track the truncation count, the one user that did need it has been converted to using iov_iter_restore() instead. Signed-off-by: Jens Axboe <[email protected]> --- include/linux/uio.h | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/include/linux/uio.h b/include/linux/uio.h index 984c4ab74859..207101a9c5c3 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -53,7 +53,6 @@ struct iov_iter { }; loff_t xarray_start; }; - size_t truncated; }; static inline enum iter_type iov_iter_type(const struct iov_iter *i) @@ -270,10 +269,8 @@ static inline void iov_iter_truncate(struct iov_iter *i, u64 count) * conversion in assignement is by definition greater than all * values of size_t, including old i->count. */ - if (i->count > count) { - i->truncated += i->count - count; + if (i->count > count) i->count = count; - } } /* @@ -282,7 +279,6 @@ static inline void iov_iter_truncate(struct iov_iter *i, u64 count) */ static inline void iov_iter_reexpand(struct iov_iter *i, size_t count) { - i->truncated -= count - i->count; i->count = count; } -- 2.33.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCHSET v3 0/3] Add ability to save/restore iov_iter state 2021-09-15 16:29 [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Jens Axboe ` (2 preceding siblings ...) 2021-09-15 16:29 ` [PATCH 3/3] Revert "iov_iter: track truncated size" Jens Axboe @ 2021-09-15 18:32 ` Linus Torvalds 2021-09-15 18:46 ` Jens Axboe 2021-09-16 4:47 ` Al Viro 4 siblings, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2021-09-15 18:32 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, linux-fsdevel, Al Viro On Wed, Sep 15, 2021 at 9:29 AM Jens Axboe <[email protected]> wrote: > > I've run this through vectored read/write with io_uring on the commonly > problematic cases (dm and low depth SCSI device) which trigger these > conditions often, and it seems to pass muster. I've also hacked in > faked randomly short reads and that helped find on issue with double > accounting. But it did validate the state handling otherwise. Ok, so I can't see anything obviously wrong with this, or anything I can object to. It's still fairly complicated, and I don't love how hard it is to follow some of it, but I do believe it's better. IOW, I don't have any objections. Al was saying he was looking at the io_uring code, so maybe he'll find something. Do you have these test-cases as some kind of test-suite so that this all stays correct? Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHSET v3 0/3] Add ability to save/restore iov_iter state 2021-09-15 18:32 ` [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Linus Torvalds @ 2021-09-15 18:46 ` Jens Axboe 2021-09-15 19:26 ` Linus Torvalds 0 siblings, 1 reply; 17+ messages in thread From: Jens Axboe @ 2021-09-15 18:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, linux-fsdevel, Al Viro On 9/15/21 12:32 PM, Linus Torvalds wrote: > On Wed, Sep 15, 2021 at 9:29 AM Jens Axboe <[email protected]> wrote: >> >> I've run this through vectored read/write with io_uring on the commonly >> problematic cases (dm and low depth SCSI device) which trigger these >> conditions often, and it seems to pass muster. I've also hacked in >> faked randomly short reads and that helped find on issue with double >> accounting. But it did validate the state handling otherwise. > > Ok, so I can't see anything obviously wrong with this, or anything I > can object to. It's still fairly complicated, and I don't love how > hard it is to follow some of it, but I do believe it's better. OK good > IOW, I don't have any objections. Al was saying he was looking at the > io_uring code, so maybe he'll find something. > > Do you have these test-cases as some kind of test-suite so that this > all stays correct? Yep liburing has a whole bunch of regressions tests that we always run for any change, and new cases are added as problems are found. That also has test cases for new features, etc. This one is particularly difficult to test and have confidence in, which is why I ended up hacking up that faked short return so I knew I had exercised all of it. The usual tests do end up hitting the -EAGAIN path quite easily for certain device types, but not the short read/write. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHSET v3 0/3] Add ability to save/restore iov_iter state 2021-09-15 18:46 ` Jens Axboe @ 2021-09-15 19:26 ` Linus Torvalds 2021-09-15 19:40 ` Jens Axboe 0 siblings, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2021-09-15 19:26 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, linux-fsdevel, Al Viro On Wed, Sep 15, 2021 at 11:46 AM Jens Axboe <[email protected]> wrote: > > The usual tests > do end up hitting the -EAGAIN path quite easily for certain device > types, but not the short read/write. No way to do something like "read in file to make sure it's cached, then invalidate caches from position X with POSIX_FADV_DONTNEED, then do a read that crosses that cached/uncached boundary"? To at least verify that "partly synchronous, but partly punted to async" case? Or were you talking about some other situation? Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHSET v3 0/3] Add ability to save/restore iov_iter state 2021-09-15 19:26 ` Linus Torvalds @ 2021-09-15 19:40 ` Jens Axboe 2021-09-15 22:42 ` Jens Axboe 0 siblings, 1 reply; 17+ messages in thread From: Jens Axboe @ 2021-09-15 19:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, linux-fsdevel, Al Viro On 9/15/21 1:26 PM, Linus Torvalds wrote: > On Wed, Sep 15, 2021 at 11:46 AM Jens Axboe <[email protected]> wrote: >> >> The usual tests >> do end up hitting the -EAGAIN path quite easily for certain device >> types, but not the short read/write. > > No way to do something like "read in file to make sure it's cached, > then invalidate caches from position X with POSIX_FADV_DONTNEED, then > do a read that crosses that cached/uncached boundary"? > > To at least verify that "partly synchronous, but partly punted to > async" case? > > Or were you talking about some other situation? No that covers some of it, and that happens naturally with buffered IO. The typical case is -EAGAIN on the first try, then you get a partial or all of it the next loop, and then done or continue. I tend to run fio verification workloads for that, as you get all the flexibility of fio with the data verification. And there are tests in there that run DONTNEED in parallel with buffered IO, exactly to catch some of these csaes. But they don't verify the data, generally. In that sense buffered is a lot easier than O_DIRECT, as it's easier to provoke these cases. And that does hit all the save/restore parts and looping, and if you do it with registered buffers then you get to work with bvec iter as well. O_DIRECT may get you -EAGAIN for low queue depth devices, but it'll never do a short read/write after that. But that's not in the regressions tests. I'll write a test case that can go with the liburing regressions for it. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHSET v3 0/3] Add ability to save/restore iov_iter state 2021-09-15 19:40 ` Jens Axboe @ 2021-09-15 22:42 ` Jens Axboe 2021-09-16 1:15 ` Jens Axboe 0 siblings, 1 reply; 17+ messages in thread From: Jens Axboe @ 2021-09-15 22:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, linux-fsdevel, Al Viro On 9/15/21 1:40 PM, Jens Axboe wrote: > On 9/15/21 1:26 PM, Linus Torvalds wrote: >> On Wed, Sep 15, 2021 at 11:46 AM Jens Axboe <[email protected]> wrote: >>> >>> The usual tests >>> do end up hitting the -EAGAIN path quite easily for certain device >>> types, but not the short read/write. >> >> No way to do something like "read in file to make sure it's cached, >> then invalidate caches from position X with POSIX_FADV_DONTNEED, then >> do a read that crosses that cached/uncached boundary"? >> >> To at least verify that "partly synchronous, but partly punted to >> async" case? >> >> Or were you talking about some other situation? > > No that covers some of it, and that happens naturally with buffered IO. > The typical case is -EAGAIN on the first try, then you get a partial > or all of it the next loop, and then done or continue. I tend to run > fio verification workloads for that, as you get all the flexibility > of fio with the data verification. And there are tests in there that run > DONTNEED in parallel with buffered IO, exactly to catch some of these > csaes. But they don't verify the data, generally. > > In that sense buffered is a lot easier than O_DIRECT, as it's easier to > provoke these cases. And that does hit all the save/restore parts and > looping, and if you do it with registered buffers then you get to work > with bvec iter as well. O_DIRECT may get you -EAGAIN for low queue depth > devices, but it'll never do a short read/write after that. > > But that's not in the regressions tests. I'll write a test case > that can go with the liburing regressions for it. OK I wrote one, quick'n dirty. It's written as a liburing test, which means it can take no arguments (in which case it creates a 128MB file), or it can take an argument and it'll use that argument as the file. We fill the first 128MB of the file with known data, basically the offset of the file. Then we read it back in any of the following ways: 1) Using non-vectored read 2) Using vectored read, segments that fit in UIO_FASTIOV 3) Using vectored read, segments larger than UIO_FASTIOV This catches all the different cases for a read. We do that with both buffered and O_DIRECT, and before each pass, we randomly DONTNEED either the first, middle, or end part of each segment in the read size. I ran this on my laptop, and I found this: axboe@p1 ~/gi/liburing (master)> test/file-verify 0.100s bad read 229376, read 3 Buffered novec test failed axboe@p1 ~/gi/liburing (master)> test/file-verify 0.213s bad read 294912, read 0 Buffered novec test failed which is because I'm running the iov_iter.2 stuff, and we're hitting that double accounting issue that I mentioned in the cover letter for this series. That's why the read return is larger than we ask for (128K). Running it on the current branch passes: [root@archlinux liburing]# for i in $(seq 10); do test/file-verify; done [root@archlinux liburing]# (this is in my test vm that I run on the laptop for kernel testing, hence the root and different hostname). I will add this as a liburing regression test case. Probably needs a bit of cleaning up first, it was just a quick prototype as I thought your suggestion was a good one. Will probably change it to run at a higher queue depth than just the 1 it does now. /* SPDX-License-Identifier: MIT */ /* * Description: run various read verify tests * */ #include <errno.h> #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <assert.h> #include "helpers.h" #include "liburing.h" #define FSIZE 128*1024*1024 #define CHUNK_SIZE 131072 #define PUNCH_SIZE 32768 static int verify_buf(void *buf, size_t size, off_t off) { int i, u_in_buf = size / sizeof(unsigned int); unsigned int *ptr; off /= sizeof(unsigned int); ptr = buf; for (i = 0; i < u_in_buf; i++) { if (off != *ptr) { fprintf(stderr, "Found %u, wanted %lu\n", *ptr, off); return 1; } ptr++; off++; } return 0; } enum { PUNCH_NONE, PUNCH_FRONT, PUNCH_MIDDLE, PUNCH_END, }; /* * For each chunk in file, DONTNEED a start, end, or middle segment of it. * We enter here with the file fully cached every time, either freshly * written or after other reads. */ static int do_punch(int fd) { off_t offset = 0; int punch_type; while (offset + CHUNK_SIZE <= FSIZE) { off_t punch_off; punch_type = rand() % (PUNCH_END + 1); switch (punch_type) { default: case PUNCH_NONE: punch_off = -1; /* gcc... */ break; case PUNCH_FRONT: punch_off = offset; break; case PUNCH_MIDDLE: punch_off = offset + PUNCH_SIZE; break; case PUNCH_END: punch_off = offset + CHUNK_SIZE - PUNCH_SIZE; break; } offset += CHUNK_SIZE; if (punch_type == PUNCH_NONE) continue; if (posix_fadvise(fd, punch_off, PUNCH_SIZE, POSIX_FADV_DONTNEED) < 0) { perror("posix_fadivse"); return 1; } } return 0; } static int test(struct io_uring *ring, const char *fname, int buffered, int vectored, int small_vecs) { struct io_uring_cqe *cqe; struct io_uring_sqe *sqe; struct iovec *vecs; int ret, fd, flags; void *buf; size_t left; off_t off, voff; int i, nr_vecs; flags = O_RDONLY; if (!buffered) flags |= O_DIRECT; fd = open(fname, flags); if (fd < 0) { perror("open"); return 1; } if (do_punch(fd)) return 1; if (vectored) { int vec_size; if (small_vecs) nr_vecs = 8; else nr_vecs = 16; vecs = t_malloc(nr_vecs * sizeof(struct iovec)); vec_size = CHUNK_SIZE / nr_vecs; for (i = 0; i < nr_vecs; i++) { t_posix_memalign(&buf, 4096, vec_size); vecs[i].iov_base = buf; vecs[i].iov_len = vec_size; } } else { t_posix_memalign(&buf, 4096, CHUNK_SIZE); nr_vecs = 0; vecs = NULL; } i = 0; left = FSIZE; off = 0; while (left) { size_t this = left; if (this > CHUNK_SIZE) this = CHUNK_SIZE; sqe = io_uring_get_sqe(ring); if (!sqe) { fprintf(stderr, "get sqe failed\n"); goto err; } if (vectored) io_uring_prep_readv(sqe, fd, vecs, nr_vecs, off); else io_uring_prep_read(sqe, fd, buf, this, off); sqe->user_data = off; off += this; ret = io_uring_submit(ring); if (ret <= 0) { fprintf(stderr, "sqe submit failed: %d\n", ret); goto err; } ret = io_uring_wait_cqe(ring, &cqe); if (ret < 0) { fprintf(stderr, "wait completion %d\n", ret); goto err; } if (cqe->res != this) { fprintf(stderr, "bad read %d, read %d\n", cqe->res, i); goto err; } if (vectored) { voff = cqe->user_data; for (i = 0; i < nr_vecs; i++) { if (verify_buf(vecs[i].iov_base, vecs[i].iov_len, voff)) { fprintf(stderr, "failed at off %lu\n", (unsigned long) voff); goto err; } voff += vecs[i].iov_len; } } else { if (verify_buf(buf, CHUNK_SIZE, cqe->user_data)) { fprintf(stderr, "failed at off %lu\n", (unsigned long) cqe->user_data); goto err; } } io_uring_cqe_seen(ring, cqe); i++; left -= CHUNK_SIZE; } ret = 0; done: if (vectored) { for (i = 0; i < nr_vecs; i++) free(vecs[i].iov_base); } else { free(buf); } close(fd); return ret; err: ret = 1; goto done; } static int fill_pattern(const char *fname) { size_t left = FSIZE; unsigned int val, *ptr; void *buf; int fd, i; fd = open(fname, O_WRONLY); if (fd < 0) { perror("open"); return 1; } val = 0; buf = t_malloc(4096); while (left) { int u_in_buf = 4096 / sizeof(val); size_t this = left; if (this > 4096) this = 4096; ptr = buf; for (i = 0; i < u_in_buf; i++) { *ptr = val; val++; ptr++; } if (write(fd, buf, 4096) != 4096) return 1; left -= 4096; } fsync(fd); close(fd); free(buf); return 0; } int main(int argc, char *argv[]) { struct io_uring ring; const char *fname; char buf[32]; int ret; srand(getpid()); if (argc > 1) { fname = argv[1]; } else { sprintf(buf, ".%d", getpid()); fname = buf; t_create_file(fname, FSIZE); } ret = io_uring_queue_init(64, &ring, 0); if (ret) { fprintf(stderr, "ring setup failed: %d\n", ret); goto err; } if (fill_pattern(fname)) goto err; ret = test(&ring, fname, 1, 0, 0); if (ret) { fprintf(stderr, "Buffered novec test failed\n"); goto err; } ret = test(&ring, fname, 1, 1, 0); if (ret) { fprintf(stderr, "Buffered vec test failed\n"); goto err; } ret = test(&ring, fname, 1, 1, 1); if (ret) { fprintf(stderr, "Buffered small vec test failed\n"); goto err; } ret = test(&ring, fname, 0, 0, 0); if (ret) { fprintf(stderr, "O_DIRECt novec test failed\n"); goto err; } ret = test(&ring, fname, 0, 1, 0); if (ret) { fprintf(stderr, "O_DIRECt vec test failed\n"); goto err; } ret = test(&ring, fname, 0, 1, 1); if (ret) { fprintf(stderr, "O_DIRECt small vec test failed\n"); goto err; } if (buf == fname) unlink(fname); return 0; err: if (buf == fname) unlink(fname); return 1; } -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHSET v3 0/3] Add ability to save/restore iov_iter state 2021-09-15 22:42 ` Jens Axboe @ 2021-09-16 1:15 ` Jens Axboe 0 siblings, 0 replies; 17+ messages in thread From: Jens Axboe @ 2021-09-16 1:15 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, linux-fsdevel, Al Viro On 9/15/21 4:42 PM, Jens Axboe wrote: > On 9/15/21 1:40 PM, Jens Axboe wrote: >> On 9/15/21 1:26 PM, Linus Torvalds wrote: >>> On Wed, Sep 15, 2021 at 11:46 AM Jens Axboe <[email protected]> wrote: >>>> >>>> The usual tests >>>> do end up hitting the -EAGAIN path quite easily for certain device >>>> types, but not the short read/write. >>> >>> No way to do something like "read in file to make sure it's cached, >>> then invalidate caches from position X with POSIX_FADV_DONTNEED, then >>> do a read that crosses that cached/uncached boundary"? >>> >>> To at least verify that "partly synchronous, but partly punted to >>> async" case? >>> >>> Or were you talking about some other situation? >> >> No that covers some of it, and that happens naturally with buffered IO. >> The typical case is -EAGAIN on the first try, then you get a partial >> or all of it the next loop, and then done or continue. I tend to run >> fio verification workloads for that, as you get all the flexibility >> of fio with the data verification. And there are tests in there that run >> DONTNEED in parallel with buffered IO, exactly to catch some of these >> csaes. But they don't verify the data, generally. >> >> In that sense buffered is a lot easier than O_DIRECT, as it's easier to >> provoke these cases. And that does hit all the save/restore parts and >> looping, and if you do it with registered buffers then you get to work >> with bvec iter as well. O_DIRECT may get you -EAGAIN for low queue depth >> devices, but it'll never do a short read/write after that. >> >> But that's not in the regressions tests. I'll write a test case >> that can go with the liburing regressions for it. > > OK I wrote one, quick'n dirty. It's written as a liburing test, which > means it can take no arguments (in which case it creates a 128MB file), > or it can take an argument and it'll use that argument as the file. We > fill the first 128MB of the file with known data, basically the offset > of the file. Then we read it back in any of the following ways: > > 1) Using non-vectored read > 2) Using vectored read, segments that fit in UIO_FASTIOV > 3) Using vectored read, segments larger than UIO_FASTIOV > > This catches all the different cases for a read. > > We do that with both buffered and O_DIRECT, and before each pass, we > randomly DONTNEED either the first, middle, or end part of each segment > in the read size. > > I ran this on my laptop, and I found this: > axboe@p1 ~/gi/liburing (master)> test/file-verify 0.100s > bad read 229376, read 3 > Buffered novec test failed > axboe@p1 ~/gi/liburing (master)> test/file-verify 0.213s > bad read 294912, read 0 > Buffered novec test failed > > which is because I'm running the iov_iter.2 stuff, and we're hitting > that double accounting issue that I mentioned in the cover letter for > this series. That's why the read return is larger than we ask for > (128K). Running it on the current branch passes: > > [root@archlinux liburing]# for i in $(seq 10); do test/file-verify; done > [root@archlinux liburing]# > > (this is in my test vm that I run on the laptop for kernel testing, > hence the root and different hostname). > > I will add this as a liburing regression test case. Probably needs a bit > of cleaning up first, it was just a quick prototype as I thought your > suggestion was a good one. Will probably change it to run at a higher > queue depth than just the 1 it does now. Cleaned it up a bit, and added registered buffer support as well (which is another variant over non-vectored reads) and queued IO support as well: https://git.kernel.dk/cgit/liburing/commit/?id=6ab387dab745aff2af760d9fed56a4154669edec and it's now part of the regular testing. Here's my usual run: Running test file-verify 3 sec Running test file-verify /dev/nvme0n1p2 3 sec Running test file-verify /dev/nvme1n1p1 3 sec Running test file-verify /dev/sdc2 Test file-verify timed out (may not be a failure) Running test file-verify /dev/dm-0 3 sec Running test file-verify /data/file 3 sec Note that the sdc2 timeout isn't a failure, it's just that emulation on qemu is slow enough that it takes 1min20s to run and I time out tests after 60s in the harness to prevent something stalling forever. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHSET v3 0/3] Add ability to save/restore iov_iter state 2021-09-15 16:29 [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Jens Axboe ` (3 preceding siblings ...) 2021-09-15 18:32 ` [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Linus Torvalds @ 2021-09-16 4:47 ` Al Viro 2021-09-16 16:10 ` Jens Axboe 4 siblings, 1 reply; 17+ messages in thread From: Al Viro @ 2021-09-16 4:47 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, linux-fsdevel, torvalds Jens, may I politely inquire why is struct io_rw playing these games with overloading ->rw.addr, instead of simply having struct io_buffer *kbuf in it? Another question: what the hell are the rules for REQ_F_BUFFER_SELECT? The first time around io_iov_buffer_select() will * read iovec from ->rw.addr * replace iovec.iov_base with value derived from ->buf_index * cap iovec.iov_len with value derived from ->buf_index Next time around it will use the same base *AND* replace the length with the value used to cap the original. Is that deliberate? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHSET v3 0/3] Add ability to save/restore iov_iter state 2021-09-16 4:47 ` Al Viro @ 2021-09-16 16:10 ` Jens Axboe 0 siblings, 0 replies; 17+ messages in thread From: Jens Axboe @ 2021-09-16 16:10 UTC (permalink / raw) To: Al Viro; +Cc: io-uring, linux-fsdevel, torvalds On 9/15/21 10:47 PM, Al Viro wrote: > Jens, may I politely inquire why is struct io_rw playing > these games with overloading ->rw.addr, instead of simply having > struct io_buffer *kbuf in it? Very simply to avoid growing the union command part of io_kiocb beyond a cacheline. We're pretty sensitive to io_kiocb size in general, and io_rw is already the biggest member in there. > Another question: what the hell are the rules for > REQ_F_BUFFER_SELECT? The first time around io_iov_buffer_select() > will > * read iovec from ->rw.addr > * replace iovec.iov_base with value derived from > ->buf_index > * cap iovec.iov_len with value derived from ->buf_index > Next time around it will use the same base *AND* replace the > length with the value used to cap the original. > Is that deliberate? Probably not strictly needed, but doesn't harm anything. The buffer is being consumed (and hence removed) at completion anyway, it's not a persistent change. Selected buffers must be re-provided by the application as the kernel has no way of knowing when the application would otherwise be ready for it to get reused, and that's done by issuing a new provide buffers request for the buffers that can get recycled. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCHSET v2 0/3] Add ability to save/restore iov_iter state @ 2021-09-14 14:17 Jens Axboe 2021-09-14 14:17 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe 0 siblings, 1 reply; 17+ messages in thread From: Jens Axboe @ 2021-09-14 14:17 UTC (permalink / raw) To: io-uring, linux-fsdevel; +Cc: torvalds, viro Hi, Linus didn't particularly love the iov_iter->truncated addition and how it was used, and it was hard to disagree with that. Instead of relying on tracking ->truncated, add a few pieces of state so we can safely handle partial or errored read/write attempts (that we want to retry). Then we can get rid of the iov_iter addition, and at the same time handle cases that weren't handled correctly before. I've run this through vectored read/write with io_uring on the commonly problematic cases (dm and low depth SCSI device) which trigger these conditions often, and it seems to pass muster. For a discussion on this topic, see the thread here: https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/ You can find these patches here: https://git.kernel.dk/cgit/linux-block/log/?h=iov_iter.2 Changes since v1: - Drop 'did_bytes' from iov_iter_restore(). Only two cases in io_uring used it, and one of them had to be changed with v2. Better to just make the subsequent iov_iter_advance() explicit at that point. - Cleanup and sanitize the io_uring side, and ensure it's sane around worker retries. No more digging into iov_iter_state from io_uring, we use it just for save/restore purposes. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/3] io_uring: use iov_iter state save/restore helpers 2021-09-14 14:17 [PATCHSET v2 " Jens Axboe @ 2021-09-14 14:17 ` Jens Axboe 2021-09-14 18:45 ` Linus Torvalds 0 siblings, 1 reply; 17+ messages in thread From: Jens Axboe @ 2021-09-14 14:17 UTC (permalink / raw) To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe Get rid of the need to do re-expand and revert on an iterator when we encounter a short IO, or failure that warrants a retry. Use the new state save/restore helpers instead. We keep the iov_iter_state persistent across retries, if we need to restart the read or write operation. If there's a pending retry, the operation will always exit with the state correctly saved. Signed-off-by: Jens Axboe <[email protected]> --- fs/io_uring.c | 62 ++++++++++++++++++++++++++++++++++----------------- 1 file changed, 41 insertions(+), 21 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 855ea544807f..dbc97d440801 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -712,6 +712,7 @@ struct io_async_rw { struct iovec fast_iov[UIO_FASTIOV]; const struct iovec *free_iovec; struct iov_iter iter; + struct iov_iter_state iter_state; size_t bytes_done; struct wait_page_queue wpq; }; @@ -2608,8 +2609,7 @@ static bool io_resubmit_prep(struct io_kiocb *req) if (!rw) return !io_req_prep_async(req); - /* may have left rw->iter inconsistent on -EIOCBQUEUED */ - iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter)); + iov_iter_restore(&rw->iter, &rw->iter_state); return true; } @@ -3310,12 +3310,16 @@ static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec, if (!force && !io_op_defs[req->opcode].needs_async_setup) return 0; if (!req->async_data) { + struct io_async_rw *iorw; + if (io_alloc_async_data(req)) { kfree(iovec); return -ENOMEM; } io_req_map_rw(req, iovec, fast_iov, iter); + iorw = req->async_data; + iov_iter_save_state(&iorw->iter, &iorw->iter_state); } return 0; } @@ -3334,6 +3338,7 @@ static inline int io_rw_prep_async(struct io_kiocb *req, int rw) iorw->free_iovec = iov; if (iov) req->flags |= REQ_F_NEED_CLEANUP; + iov_iter_save_state(&iorw->iter, &iorw->iter_state); return 0; } @@ -3437,19 +3442,23 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) struct kiocb *kiocb = &req->rw.kiocb; struct iov_iter __iter, *iter = &__iter; struct io_async_rw *rw = req->async_data; - ssize_t io_size, ret, ret2; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + struct iov_iter_state __state, *state; + ssize_t ret, ret2; if (rw) { iter = &rw->iter; + state = &rw->iter_state; + iov_iter_restore(iter, state); iovec = NULL; } else { ret = io_import_iovec(READ, req, &iovec, iter, !force_nonblock); if (ret < 0) return ret; + state = &__state; + iov_iter_save_state(iter, state); } - io_size = iov_iter_count(iter); - req->result = io_size; + req->result = iov_iter_count(iter); /* Ensure we clear previously set non-block flag */ if (!force_nonblock) @@ -3463,7 +3472,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) return ret ?: -EAGAIN; } - ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), io_size); + ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), req->result); if (unlikely(ret)) { kfree(iovec); return ret; @@ -3479,30 +3488,36 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) /* no retry on NONBLOCK nor RWF_NOWAIT */ if (req->flags & REQ_F_NOWAIT) goto done; - /* some cases will consume bytes even on error returns */ - iov_iter_reexpand(iter, iter->count + iter->truncated); - iov_iter_revert(iter, io_size - iov_iter_count(iter)); ret = 0; } else if (ret == -EIOCBQUEUED) { goto out_free; - } else if (ret <= 0 || ret == io_size || !force_nonblock || + } else if (ret <= 0 || ret == req->result || !force_nonblock || (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) { /* read all, failed, already did sync or don't want to retry */ goto done; } + iov_iter_restore(iter, state); + ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true); if (ret2) return ret2; iovec = NULL; rw = req->async_data; - /* now use our persistent iterator, if we aren't already */ - iter = &rw->iter; + /* now use our persistent iterator and state, if we aren't already */ + if (iter != &rw->iter) { + iter = &rw->iter; + state = &rw->iter_state; + } do { - io_size -= ret; rw->bytes_done += ret; + iov_iter_advance(iter, ret); + if (!iov_iter_count(iter)) + break; + iov_iter_save_state(iter, state); + /* if we can retry, do so with the callbacks armed */ if (!io_rw_should_retry(req)) { kiocb->ki_flags &= ~IOCB_WAITQ; @@ -3520,7 +3535,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) return 0; /* we got some bytes, but not all. retry. */ kiocb->ki_flags &= ~IOCB_WAITQ; - } while (ret > 0 && ret < io_size); + } while (ret > 0); done: kiocb_done(kiocb, ret, issue_flags); out_free: @@ -3543,19 +3558,24 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) struct kiocb *kiocb = &req->rw.kiocb; struct iov_iter __iter, *iter = &__iter; struct io_async_rw *rw = req->async_data; - ssize_t ret, ret2, io_size; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + struct iov_iter_state __state, *state; + ssize_t ret, ret2; if (rw) { iter = &rw->iter; + state = &rw->iter_state; + iov_iter_restore(iter, state); iovec = NULL; } else { ret = io_import_iovec(WRITE, req, &iovec, iter, !force_nonblock); if (ret < 0) return ret; + state = &__state; + iov_iter_save_state(iter, state); } - io_size = iov_iter_count(iter); - req->result = io_size; + req->result = iov_iter_count(iter); + ret2 = 0; /* Ensure we clear previously set non-block flag */ if (!force_nonblock) @@ -3572,7 +3592,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) (req->flags & REQ_F_ISREG)) goto copy_iov; - ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), io_size); + ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), req->result); if (unlikely(ret)) goto out_free; @@ -3619,9 +3639,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) kiocb_done(kiocb, ret2, issue_flags); } else { copy_iov: - /* some cases will consume bytes even on error returns */ - iov_iter_reexpand(iter, iter->count + iter->truncated); - iov_iter_revert(iter, io_size - iov_iter_count(iter)); + iov_iter_restore(iter, state); + if (ret2 > 0) + iov_iter_advance(iter, ret2); ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false); return ret ?: -EAGAIN; } -- 2.33.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] io_uring: use iov_iter state save/restore helpers 2021-09-14 14:17 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe @ 2021-09-14 18:45 ` Linus Torvalds 2021-09-14 19:37 ` Jens Axboe 0 siblings, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2021-09-14 18:45 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, linux-fsdevel, Al Viro On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <[email protected]> wrote: > > > + iov_iter_restore(iter, state); > + ... > rw->bytes_done += ret; > + iov_iter_advance(iter, ret); > + if (!iov_iter_count(iter)) > + break; > + iov_iter_save_state(iter, state); Ok, so now you keep iovb_iter and the state always in sync by just always resetting the iter back and then walking it forward explicitly - and re-saving the state. That seems safe, if potentially unnecessarily expensive. I guess re-walking lots of iovec entries is actually very unlikely in practice, so maybe this "stupid brute-force" model is the right one. I do find the odd "use __state vs rw->state" to be very confusing, though. Particularly in io_read(), where you do this: + iov_iter_restore(iter, state); + ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true); if (ret2) return ret2; iovec = NULL; rw = req->async_data; - /* now use our persistent iterator, if we aren't already */ - iter = &rw->iter; + /* now use our persistent iterator and state, if we aren't already */ + if (iter != &rw->iter) { + iter = &rw->iter; + state = &rw->iter_state; + } do { - io_size -= ret; rw->bytes_done += ret; + iov_iter_advance(iter, ret); + if (!iov_iter_count(iter)) + break; + iov_iter_save_state(iter, state); Note how it first does that iov_iter_restore() on iter/state, buit then it *replaces&* the iter/state pointers, and then it does iov_iter_advance() on the replacement ones. I don't see how that could be right. You're doing iov_iter_advance() on something else than the one you restored to the original values. And if it is right, it's sure confusing as hell. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] io_uring: use iov_iter state save/restore helpers 2021-09-14 18:45 ` Linus Torvalds @ 2021-09-14 19:37 ` Jens Axboe 2021-09-14 23:02 ` Jens Axboe 0 siblings, 1 reply; 17+ messages in thread From: Jens Axboe @ 2021-09-14 19:37 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, linux-fsdevel, Al Viro On 9/14/21 12:45 PM, Linus Torvalds wrote: > On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <[email protected]> wrote: >> >> >> + iov_iter_restore(iter, state); >> + > ... >> rw->bytes_done += ret; >> + iov_iter_advance(iter, ret); >> + if (!iov_iter_count(iter)) >> + break; >> + iov_iter_save_state(iter, state); > > Ok, so now you keep iovb_iter and the state always in sync by just > always resetting the iter back and then walking it forward explicitly > - and re-saving the state. > > That seems safe, if potentially unnecessarily expensive. Right, it's not ideal if it's a big range of IO, then it'll definitely be noticeable. But not too worried about it, at least not for now... > I guess re-walking lots of iovec entries is actually very unlikely in > practice, so maybe this "stupid brute-force" model is the right one. Not sure what the alternative is here. We could do something similar to __io_import_fixed() as we're only dealing with iter types where we can do that, but probably best left as a later optimization if it's deemed necessary. > I do find the odd "use __state vs rw->state" to be very confusing, > though. Particularly in io_read(), where you do this: > > + iov_iter_restore(iter, state); > + > ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true); > if (ret2) > return ret2; > > iovec = NULL; > rw = req->async_data; > - /* now use our persistent iterator, if we aren't already */ > - iter = &rw->iter; > + /* now use our persistent iterator and state, if we aren't already */ > + if (iter != &rw->iter) { > + iter = &rw->iter; > + state = &rw->iter_state; > + } > > do { > - io_size -= ret; > rw->bytes_done += ret; > + iov_iter_advance(iter, ret); > + if (!iov_iter_count(iter)) > + break; > + iov_iter_save_state(iter, state); > > > Note how it first does that iov_iter_restore() on iter/state, buit > then it *replaces&* the iter/state pointers, and then it does > iov_iter_advance() on the replacement ones. We restore the iter so it's the same as before we did the read_iter call, and then setup a consistent copy of the iov/iter in case we need to punt this request for retry. rw->iter should have the same state as iter at this point, and since rw->iter is the copy we'll use going forward, we're advancing that one in case ret > 0. The other case is that no persistent state is needed, and then iter remains the same. I'll take a second look at this part and see if I can make it a bit more straight forward, or at least comment it properly. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3] io_uring: use iov_iter state save/restore helpers 2021-09-14 19:37 ` Jens Axboe @ 2021-09-14 23:02 ` Jens Axboe 0 siblings, 0 replies; 17+ messages in thread From: Jens Axboe @ 2021-09-14 23:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, linux-fsdevel, Al Viro On 9/14/21 1:37 PM, Jens Axboe wrote: > On 9/14/21 12:45 PM, Linus Torvalds wrote: >> On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <[email protected]> wrote: >>> >>> >>> + iov_iter_restore(iter, state); >>> + >> ... >>> rw->bytes_done += ret; >>> + iov_iter_advance(iter, ret); >>> + if (!iov_iter_count(iter)) >>> + break; >>> + iov_iter_save_state(iter, state); >> >> Ok, so now you keep iovb_iter and the state always in sync by just >> always resetting the iter back and then walking it forward explicitly >> - and re-saving the state. >> >> That seems safe, if potentially unnecessarily expensive. > > Right, it's not ideal if it's a big range of IO, then it'll definitely > be noticeable. But not too worried about it, at least not for now... > >> I guess re-walking lots of iovec entries is actually very unlikely in >> practice, so maybe this "stupid brute-force" model is the right one. > > Not sure what the alternative is here. We could do something similar to > __io_import_fixed() as we're only dealing with iter types where we can > do that, but probably best left as a later optimization if it's deemed > necessary. > >> I do find the odd "use __state vs rw->state" to be very confusing, >> though. Particularly in io_read(), where you do this: >> >> + iov_iter_restore(iter, state); >> + >> ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true); >> if (ret2) >> return ret2; >> >> iovec = NULL; >> rw = req->async_data; >> - /* now use our persistent iterator, if we aren't already */ >> - iter = &rw->iter; >> + /* now use our persistent iterator and state, if we aren't already */ >> + if (iter != &rw->iter) { >> + iter = &rw->iter; >> + state = &rw->iter_state; >> + } >> >> do { >> - io_size -= ret; >> rw->bytes_done += ret; >> + iov_iter_advance(iter, ret); >> + if (!iov_iter_count(iter)) >> + break; >> + iov_iter_save_state(iter, state); >> >> >> Note how it first does that iov_iter_restore() on iter/state, buit >> then it *replaces&* the iter/state pointers, and then it does >> iov_iter_advance() on the replacement ones. > > We restore the iter so it's the same as before we did the read_iter > call, and then setup a consistent copy of the iov/iter in case we need > to punt this request for retry. rw->iter should have the same state as > iter at this point, and since rw->iter is the copy we'll use going > forward, we're advancing that one in case ret > 0. > > The other case is that no persistent state is needed, and then iter > remains the same. > > I'll take a second look at this part and see if I can make it a bit more > straight forward, or at least comment it properly. I hacked up something that shortens the iter for the initial IO, so we could more easily test the retry path and the state. It really is a hack, but the idea was to issue 64K io from fio, and then the initial attempt would be anywhere from 4K-60K truncated. That forces retry. I ran this with both 16 segments and 8 segments, verifying that it hits both the UIO_FASTIOV and alloc path. I did find one issue with that, see the last hunk in the hack. We need to increment rw->bytes_done if we don't break, or set ret to 0 if we do. Otherwise that last ret ends up being accounted twice. But apart from that, it passes data verification runs. diff --git a/fs/io_uring.c b/fs/io_uring.c index dc1ff47e3221..484c86252f9d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -744,6 +744,7 @@ enum { REQ_F_NOWAIT_READ_BIT, REQ_F_NOWAIT_WRITE_BIT, REQ_F_ISREG_BIT, + REQ_F_TRUNCATED_BIT, /* not a real bit, just to check we're not overflowing the space */ __REQ_F_LAST_BIT, @@ -797,6 +798,7 @@ enum { REQ_F_REFCOUNT = BIT(REQ_F_REFCOUNT_BIT), /* there is a linked timeout that has to be armed */ REQ_F_ARM_LTIMEOUT = BIT(REQ_F_ARM_LTIMEOUT_BIT), + REQ_F_TRUNCATED = BIT(REQ_F_TRUNCATED_BIT), }; struct async_poll { @@ -3454,11 +3456,12 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw.kiocb; - struct iov_iter __iter, *iter = &__iter; + struct iov_iter __i, __iter, *iter = &__iter; struct io_async_rw *rw = req->async_data; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; struct iov_iter_state __state, *state; ssize_t ret, ret2; + bool do_restore = false; if (rw) { iter = &rw->iter; @@ -3492,8 +3495,25 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) return ret; } + if (!(req->flags & REQ_F_TRUNCATED) && !(iov_iter_count(iter) & 4095)) { + int nr_vecs; + + __i = *iter; + nr_vecs = 1 + (prandom_u32() % iter->nr_segs); + iter->nr_segs = nr_vecs; + iter->count = nr_vecs * 8192; + req->flags |= REQ_F_TRUNCATED; + do_restore = true; + } + ret = io_iter_do_read(req, iter); + if (ret == -EAGAIN) { + req->flags &= ~REQ_F_TRUNCATED; + *iter = __i; + do_restore = false; + } + if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) { req->flags &= ~REQ_F_REISSUE; /* IOPOLL retry should happen for io-wq threads */ @@ -3513,6 +3533,9 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) iov_iter_restore(iter, state); + if (do_restore) + *iter = __i; + ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true); if (ret2) return ret2; @@ -3526,10 +3549,10 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) } do { - rw->bytes_done += ret; iov_iter_advance(iter, ret); if (!iov_iter_count(iter)) break; + rw->bytes_done += ret; iov_iter_save_state(iter, state); /* if we can retry, do so with the callbacks armed */ -- Jens Axboe ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCHSET 0/3] Add ability to save/restore iov_iter state @ 2021-09-10 18:25 Jens Axboe 2021-09-10 18:25 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe 0 siblings, 1 reply; 17+ messages in thread From: Jens Axboe @ 2021-09-10 18:25 UTC (permalink / raw) To: io-uring, linux-fsdevel; +Cc: torvalds, viro Hi, Linus didn't particularly love the iov_iter->truncated addition and how it was used, and it was hard to disagree with that. Instead of relying on tracking ->truncated, add a few pieces of state so we can safely handle partial or errored read/write attempts (that we want to retry). Then we can get rid of the iov_iter addition, and at the same time handle cases that weren't handled correctly before. I've run this through vectored read/write with io_uring on the commonly problematic cases (dm and low depth SCSI device) which trigger these conditions often, and it seems to pass muster. For a discussion on this topic, see the thread here: https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/ You can find these patches here: https://git.kernel.dk/cgit/linux-block/log/?h=iov_iter -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/3] io_uring: use iov_iter state save/restore helpers 2021-09-10 18:25 [PATCHSET 0/3] Add ability to save/restore iov_iter state Jens Axboe @ 2021-09-10 18:25 ` Jens Axboe 0 siblings, 0 replies; 17+ messages in thread From: Jens Axboe @ 2021-09-10 18:25 UTC (permalink / raw) To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe Get rid of the need to do re-expand and revert on an iterator when we encounter a short IO, or failure that warrants a retry. Use the new state save/restore helpers instead. Signed-off-by: Jens Axboe <[email protected]> --- fs/io_uring.c | 42 +++++++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 855ea544807f..84e33f751372 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -712,6 +712,7 @@ struct io_async_rw { struct iovec fast_iov[UIO_FASTIOV]; const struct iovec *free_iovec; struct iov_iter iter; + struct iov_iter_state iter_state; size_t bytes_done; struct wait_page_queue wpq; }; @@ -2608,8 +2609,7 @@ static bool io_resubmit_prep(struct io_kiocb *req) if (!rw) return !io_req_prep_async(req); - /* may have left rw->iter inconsistent on -EIOCBQUEUED */ - iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter)); + iov_iter_restore(&rw->iter, &rw->iter_state, 0); return true; } @@ -3437,19 +3437,22 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) struct kiocb *kiocb = &req->rw.kiocb; struct iov_iter __iter, *iter = &__iter; struct io_async_rw *rw = req->async_data; - ssize_t io_size, ret, ret2; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + struct iov_iter_state __state, *state; + ssize_t ret, ret2; if (rw) { iter = &rw->iter; + state = &rw->iter_state; iovec = NULL; } else { ret = io_import_iovec(READ, req, &iovec, iter, !force_nonblock); if (ret < 0) return ret; + state = &__state; } - io_size = iov_iter_count(iter); - req->result = io_size; + req->result = iov_iter_count(iter); + iov_iter_save_state(iter, state); /* Ensure we clear previously set non-block flag */ if (!force_nonblock) @@ -3463,7 +3466,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) return ret ?: -EAGAIN; } - ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), io_size); + ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), state->count); if (unlikely(ret)) { kfree(iovec); return ret; @@ -3479,18 +3482,17 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) /* no retry on NONBLOCK nor RWF_NOWAIT */ if (req->flags & REQ_F_NOWAIT) goto done; - /* some cases will consume bytes even on error returns */ - iov_iter_reexpand(iter, iter->count + iter->truncated); - iov_iter_revert(iter, io_size - iov_iter_count(iter)); ret = 0; } else if (ret == -EIOCBQUEUED) { goto out_free; - } else if (ret <= 0 || ret == io_size || !force_nonblock || + } else if (ret <= 0 || ret == state->count || !force_nonblock || (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) { /* read all, failed, already did sync or don't want to retry */ goto done; } + iov_iter_restore(iter, state, ret); + ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true); if (ret2) return ret2; @@ -3501,7 +3503,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) iter = &rw->iter; do { - io_size -= ret; + state->count -= ret; rw->bytes_done += ret; /* if we can retry, do so with the callbacks armed */ if (!io_rw_should_retry(req)) { @@ -3520,7 +3522,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) return 0; /* we got some bytes, but not all. retry. */ kiocb->ki_flags &= ~IOCB_WAITQ; - } while (ret > 0 && ret < io_size); + } while (ret > 0 && ret < state->count); done: kiocb_done(kiocb, ret, issue_flags); out_free: @@ -3543,19 +3545,23 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) struct kiocb *kiocb = &req->rw.kiocb; struct iov_iter __iter, *iter = &__iter; struct io_async_rw *rw = req->async_data; - ssize_t ret, ret2, io_size; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + struct iov_iter_state __state, *state; + ssize_t ret, ret2; if (rw) { iter = &rw->iter; + state = &rw->iter_state; iovec = NULL; } else { ret = io_import_iovec(WRITE, req, &iovec, iter, !force_nonblock); if (ret < 0) return ret; + state = &__state; } - io_size = iov_iter_count(iter); - req->result = io_size; + req->result = iov_iter_count(iter); + iov_iter_save_state(iter, state); + ret2 = 0; /* Ensure we clear previously set non-block flag */ if (!force_nonblock) @@ -3572,7 +3578,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) (req->flags & REQ_F_ISREG)) goto copy_iov; - ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), io_size); + ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), state->count); if (unlikely(ret)) goto out_free; @@ -3619,9 +3625,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) kiocb_done(kiocb, ret2, issue_flags); } else { copy_iov: - /* some cases will consume bytes even on error returns */ - iov_iter_reexpand(iter, iter->count + iter->truncated); - iov_iter_revert(iter, io_size - iov_iter_count(iter)); + iov_iter_restore(iter, state, ret2); ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false); return ret ?: -EAGAIN; } -- 2.33.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
end of thread, other threads:[~2021-09-16 16:26 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-09-15 16:29 [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Jens Axboe 2021-09-15 16:29 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe 2021-09-15 16:29 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe 2021-09-15 16:29 ` [PATCH 3/3] Revert "iov_iter: track truncated size" Jens Axboe 2021-09-15 18:32 ` [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Linus Torvalds 2021-09-15 18:46 ` Jens Axboe 2021-09-15 19:26 ` Linus Torvalds 2021-09-15 19:40 ` Jens Axboe 2021-09-15 22:42 ` Jens Axboe 2021-09-16 1:15 ` Jens Axboe 2021-09-16 4:47 ` Al Viro 2021-09-16 16:10 ` Jens Axboe -- strict thread matches above, loose matches on Subject: below -- 2021-09-14 14:17 [PATCHSET v2 " Jens Axboe 2021-09-14 14:17 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe 2021-09-14 18:45 ` Linus Torvalds 2021-09-14 19:37 ` Jens Axboe 2021-09-14 23:02 ` Jens Axboe 2021-09-10 18:25 [PATCHSET 0/3] Add ability to save/restore iov_iter state Jens Axboe 2021-09-10 18:25 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox