* [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO @ 2020-12-04 9:44 Hao Xu 2020-12-04 11:44 ` Pavel Begunkov 2020-12-07 2:21 ` Dave Chinner 0 siblings, 2 replies; 15+ messages in thread From: Hao Xu @ 2020-12-04 9:44 UTC (permalink / raw) To: Alexander Viro, Christoph Hellwig Cc: Darrick J. Wong, linux-fsdevel, Jeffle Xu, Konstantin Khlebnikov, Jens Axboe, io-uring, Joseph Qi Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when IOCB_NOWAIT is set. Suggested-by: Jeffle Xu <[email protected]> Signed-off-by: Konstantin Khlebnikov <[email protected]> Signed-off-by: Hao Xu <[email protected]> --- Hi all, I tested fio io_uring direct read for a file on ext4 filesystem on a nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which means REQ_NOWAIT is not set in bio->bi_opf. This makes nowait IO a normal IO. Since I'm new to iomap and block layer, I sincerely ask yours opinions in case I misunderstand the code which is very likely to happen.:) The example I use: io_uring direct randread, the first try is with IOCB_NOWAIT but not IOCB_HIPRI, the IOCB_NOWAIT is ignored in block layer which I think is not the designed behaviour. I found that Konstantin found this issue before in May 2020 (https://www.spinics.net/lists/linux-block/msg53275.html), here add his signature, add Jeffle's as well since he gave me some help. v1->v2: * add same logic in __blkdev_direct_IO_simple() v2->v3: * add same logic in do_blockdev_direct_IO() fs/block_dev.c | 7 +++++++ fs/direct-io.c | 6 ++++-- fs/iomap/direct-io.c | 3 +++ 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 9e84b1928b94..ca6f365c2f14 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -263,6 +263,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio) bio.bi_opf = dio_bio_write_op(iocb); task_io_account_write(ret); } + + if (iocb->ki_flags & IOCB_NOWAIT) + bio.bi_opf |= REQ_NOWAIT; + if (iocb->ki_flags & IOCB_HIPRI) bio_set_polled(&bio, iocb); @@ -417,6 +421,9 @@ static void blkdev_bio_end_io(struct bio *bio) task_io_account_write(bio->bi_iter.bi_size); } + if (iocb->ki_flags & IOCB_NOWAIT) + bio->bi_opf |= REQ_NOWAIT; + dio->size += bio->bi_iter.bi_size; pos += bio->bi_iter.bi_size; diff --git a/fs/direct-io.c b/fs/direct-io.c index d53fa92a1ab6..b221ed351c1c 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -1206,11 +1206,13 @@ static inline int drop_refcount(struct dio *dio) if (iov_iter_rw(iter) == WRITE) { dio->op = REQ_OP_WRITE; dio->op_flags = REQ_SYNC | REQ_IDLE; - if (iocb->ki_flags & IOCB_NOWAIT) - dio->op_flags |= REQ_NOWAIT; } else { dio->op = REQ_OP_READ; } + + if (iocb->ki_flags & IOCB_NOWAIT) + dio->op_flags |= REQ_NOWAIT; + if (iocb->ki_flags & IOCB_HIPRI) dio->op_flags |= REQ_HIPRI; diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 933f234d5bec..2e897688ed6d 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -64,6 +64,9 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap, { atomic_inc(&dio->ref); + if (dio->iocb->ki_flags & IOCB_NOWAIT) + bio->bi_opf |= REQ_NOWAIT; + if (dio->iocb->ki_flags & IOCB_HIPRI) bio_set_polled(bio, dio->iocb); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO 2020-12-04 9:44 [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO Hao Xu @ 2020-12-04 11:44 ` Pavel Begunkov 2020-12-07 2:21 ` Dave Chinner 1 sibling, 0 replies; 15+ messages in thread From: Pavel Begunkov @ 2020-12-04 11:44 UTC (permalink / raw) To: Hao Xu, Alexander Viro, Christoph Hellwig Cc: Darrick J. Wong, linux-fsdevel, Jeffle Xu, Konstantin Khlebnikov, Jens Axboe, io-uring, Joseph Qi, linux-block On 04/12/2020 09:44, Hao Xu wrote: > Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set > when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when > IOCB_NOWAIT is set. I believe Jens took my patch fixing that for blkdev_direct_IO*() (but not iomap) a while ago. BTW, even though get_maintainer.pl doesn't think so, AFAIK fs/block_dev.c is managed by [email protected]. Please CC it next time. > Suggested-by: Jeffle Xu <[email protected]> > Signed-off-by: Konstantin Khlebnikov <[email protected]> > Signed-off-by: Hao Xu <[email protected]> > --- > > Hi all, > I tested fio io_uring direct read for a file on ext4 filesystem on a > nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which > means REQ_NOWAIT is not set in bio->bi_opf. This makes nowait IO a > normal IO. Since I'm new to iomap and block layer, I sincerely ask > yours opinions in case I misunderstand the code which is very likely > to happen.:) > The example I use: io_uring direct randread, the first try is with > IOCB_NOWAIT but not IOCB_HIPRI, the IOCB_NOWAIT is ignored in block > layer which I think is not the designed behaviour. > > I found that Konstantin found this issue before in May > 2020 (https://www.spinics.net/lists/linux-block/msg53275.html), here add > his signature, add Jeffle's as well since he gave me some help. > > v1->v2: > * add same logic in __blkdev_direct_IO_simple() > v2->v3: > * add same logic in do_blockdev_direct_IO() > > fs/block_dev.c | 7 +++++++ > fs/direct-io.c | 6 ++++-- > fs/iomap/direct-io.c | 3 +++ > 3 files changed, 14 insertions(+), 2 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 9e84b1928b94..ca6f365c2f14 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -263,6 +263,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio) > bio.bi_opf = dio_bio_write_op(iocb); > task_io_account_write(ret); > } > + > + if (iocb->ki_flags & IOCB_NOWAIT) > + bio.bi_opf |= REQ_NOWAIT; > + > if (iocb->ki_flags & IOCB_HIPRI) > bio_set_polled(&bio, iocb); > > @@ -417,6 +421,9 @@ static void blkdev_bio_end_io(struct bio *bio) > task_io_account_write(bio->bi_iter.bi_size); > } > > + if (iocb->ki_flags & IOCB_NOWAIT) > + bio->bi_opf |= REQ_NOWAIT; > + > dio->size += bio->bi_iter.bi_size; > pos += bio->bi_iter.bi_size; > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index d53fa92a1ab6..b221ed351c1c 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -1206,11 +1206,13 @@ static inline int drop_refcount(struct dio *dio) > if (iov_iter_rw(iter) == WRITE) { > dio->op = REQ_OP_WRITE; > dio->op_flags = REQ_SYNC | REQ_IDLE; > - if (iocb->ki_flags & IOCB_NOWAIT) > - dio->op_flags |= REQ_NOWAIT; > } else { > dio->op = REQ_OP_READ; > } > + > + if (iocb->ki_flags & IOCB_NOWAIT) > + dio->op_flags |= REQ_NOWAIT; > + > if (iocb->ki_flags & IOCB_HIPRI) > dio->op_flags |= REQ_HIPRI; > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 933f234d5bec..2e897688ed6d 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -64,6 +64,9 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap, > { > atomic_inc(&dio->ref); > > + if (dio->iocb->ki_flags & IOCB_NOWAIT) > + bio->bi_opf |= REQ_NOWAIT; > + > if (dio->iocb->ki_flags & IOCB_HIPRI) > bio_set_polled(bio, dio->iocb); > > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO 2020-12-04 9:44 [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO Hao Xu 2020-12-04 11:44 ` Pavel Begunkov @ 2020-12-07 2:21 ` Dave Chinner 2020-12-07 23:40 ` Jens Axboe 2020-12-08 5:46 ` JeffleXu 1 sibling, 2 replies; 15+ messages in thread From: Dave Chinner @ 2020-12-07 2:21 UTC (permalink / raw) To: Hao Xu Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong, linux-fsdevel, Jeffle Xu, Konstantin Khlebnikov, Jens Axboe, io-uring, Joseph Qi On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote: > Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set > when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when > IOCB_NOWAIT is set. > > Suggested-by: Jeffle Xu <[email protected]> > Signed-off-by: Konstantin Khlebnikov <[email protected]> > Signed-off-by: Hao Xu <[email protected]> > --- > > Hi all, > I tested fio io_uring direct read for a file on ext4 filesystem on a > nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which > means REQ_NOWAIT is not set in bio->bi_opf. What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the filesystem behaviour, not the block device. REQ_NOWAIT can result in partial IO failures because the error is only reported to the iomap layer via IO completions. Hence we can split a DIO into multiple bios and have random bios in that IO fail with EAGAIN because REQ_NOWAIT is set. This error will get reported to the submitter via completion, and it will override any of the partial IOs that actually completed. Hence, like the recently reported multi-mapping IOCB_NOWAIT bug reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across extent boundaries") we'll get silent partial writes occurring because the second submitted bio in an IO can trigger EAGAIN errors with partial IO completion having already occurred. Further, we don't allow partial IO completion for DIO on XFS at all. DIO must be completely submitted and completed or return an error without having issued any IO at all. Hence using REQ_NOWAIT for DIO bios is incorrect and not desirable. Cheers, Dave. -- Dave Chinner [email protected] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO 2020-12-07 2:21 ` Dave Chinner @ 2020-12-07 23:40 ` Jens Axboe 2020-12-09 21:15 ` Dave Chinner 2020-12-08 5:46 ` JeffleXu 1 sibling, 1 reply; 15+ messages in thread From: Jens Axboe @ 2020-12-07 23:40 UTC (permalink / raw) To: Dave Chinner, Hao Xu Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong, linux-fsdevel, Jeffle Xu, Konstantin Khlebnikov, io-uring, Joseph Qi On 12/6/20 7:21 PM, Dave Chinner wrote: > On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote: >> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set >> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when >> IOCB_NOWAIT is set. >> >> Suggested-by: Jeffle Xu <[email protected]> >> Signed-off-by: Konstantin Khlebnikov <[email protected]> >> Signed-off-by: Hao Xu <[email protected]> >> --- >> >> Hi all, >> I tested fio io_uring direct read for a file on ext4 filesystem on a >> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which >> means REQ_NOWAIT is not set in bio->bi_opf. > > What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the > filesystem behaviour, not the block device. > > REQ_NOWAIT can result in partial IO failures because the error is > only reported to the iomap layer via IO completions. Hence we can > split a DIO into multiple bios and have random bios in that IO fail > with EAGAIN because REQ_NOWAIT is set. This error will > get reported to the submitter via completion, and it will override > any of the partial IOs that actually completed. > > Hence, like the recently reported multi-mapping IOCB_NOWAIT bug > reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow > NOWAIT DIO across extent boundaries") we'll get silent partial > writes occurring because the second submitted bio in an IO can > trigger EAGAIN errors with partial IO completion having already > occurred. > > Further, we don't allow partial IO completion for DIO on XFS at all. > DIO must be completely submitted and completed or return an error > without having issued any IO at all. Hence using REQ_NOWAIT for > DIO bios is incorrect and not desirable. What you say makes total sense for a user using RWF_NOWAIT, but it doesn't make a lot of sense for io_uring where we really want IOCB_NOWAIT to be what it suggests it is - don't wait for other IO to complete, if avoidable. One of the things that really suck with aio/libai is the "yeah it's probably async, but lol, might not be" aspect of it. For io_uring, if we do get -EAGAIN, we'll retry without NOWAIT set. So the concern about fractured/short writes doesn't bubble up to the application. Hence we really want an IOCB_NOWAIT_REALLY on that side, instead of the poor mans IOCB_MAYBE_NOWAIT semantics. -- Jens Axboe ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO 2020-12-07 23:40 ` Jens Axboe @ 2020-12-09 21:15 ` Dave Chinner 2020-12-10 2:33 ` JeffleXu 0 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2020-12-09 21:15 UTC (permalink / raw) To: Jens Axboe Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong, linux-fsdevel, Jeffle Xu, Konstantin Khlebnikov, io-uring, Joseph Qi On Mon, Dec 07, 2020 at 04:40:38PM -0700, Jens Axboe wrote: > On 12/6/20 7:21 PM, Dave Chinner wrote: > > On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote: > >> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set > >> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when > >> IOCB_NOWAIT is set. > >> > >> Suggested-by: Jeffle Xu <[email protected]> > >> Signed-off-by: Konstantin Khlebnikov <[email protected]> > >> Signed-off-by: Hao Xu <[email protected]> > >> --- > >> > >> Hi all, > >> I tested fio io_uring direct read for a file on ext4 filesystem on a > >> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which > >> means REQ_NOWAIT is not set in bio->bi_opf. > > > > What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the > > filesystem behaviour, not the block device. > > > > REQ_NOWAIT can result in partial IO failures because the error is > > only reported to the iomap layer via IO completions. Hence we can > > split a DIO into multiple bios and have random bios in that IO fail > > with EAGAIN because REQ_NOWAIT is set. This error will > > get reported to the submitter via completion, and it will override > > any of the partial IOs that actually completed. > > > > Hence, like the recently reported multi-mapping IOCB_NOWAIT bug > > reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow > > NOWAIT DIO across extent boundaries") we'll get silent partial > > writes occurring because the second submitted bio in an IO can > > trigger EAGAIN errors with partial IO completion having already > > occurred. > > > > Further, we don't allow partial IO completion for DIO on XFS at all. > > DIO must be completely submitted and completed or return an error > > without having issued any IO at all. Hence using REQ_NOWAIT for > > DIO bios is incorrect and not desirable. > > What you say makes total sense for a user using RWF_NOWAIT, but it > doesn't make a lot of sense for io_uring where we really want > IOCB_NOWAIT to be what it suggests it is - don't wait for other IO to > complete, if avoidable. One of the things that really suck with > aio/libai is the "yeah it's probably async, but lol, might not be" > aspect of it. Sure, but we have no way of telling what semantics the IO issuer actually requires from above. And because IOCB_NOWAIT behaviour is directly exposed to userspace by RWF_NOWAIT, that's the behaviour we have to implement. > For io_uring, if we do get -EAGAIN, we'll retry without NOWAIT set. So > the concern about fractured/short writes doesn't bubble up to the > application. Hence we really want an IOCB_NOWAIT_REALLY on that side, > instead of the poor mans IOCB_MAYBE_NOWAIT semantics. Yup, perhaps what we really want is a true IOCB_NONBLOCK flag as an internal kernel implementation. i.e. don't block anywhere in the stack, and the caller must handle retrying/completing the entire IO regardless of where the -EAGAIN comes from during the IO, including from a partial completion that the caller is waiting for... i.e. rather than hacking around this specific instance of "it blocks and we don't want it to", define the semantics and behaviour of a fully non-blocking IO through all layers from the VFS down to the hardware and let's implement that. Then we can stop playing whack-a-mole with all the "but it blocks when I do this, doctor!" issues that we seem to keep having.... :) Cheers, Dave. -- Dave Chinner [email protected] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO 2020-12-09 21:15 ` Dave Chinner @ 2020-12-10 2:33 ` JeffleXu 0 siblings, 0 replies; 15+ messages in thread From: JeffleXu @ 2020-12-10 2:33 UTC (permalink / raw) To: Dave Chinner, Jens Axboe Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong, linux-fsdevel, Konstantin Khlebnikov, io-uring, Joseph Qi On 12/10/20 5:15 AM, Dave Chinner wrote: > On Mon, Dec 07, 2020 at 04:40:38PM -0700, Jens Axboe wrote: >> On 12/6/20 7:21 PM, Dave Chinner wrote: >>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote: >>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set >>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when >>>> IOCB_NOWAIT is set. >>>> >>>> Suggested-by: Jeffle Xu <[email protected]> >>>> Signed-off-by: Konstantin Khlebnikov <[email protected]> >>>> Signed-off-by: Hao Xu <[email protected]> >>>> --- >>>> >>>> Hi all, >>>> I tested fio io_uring direct read for a file on ext4 filesystem on a >>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which >>>> means REQ_NOWAIT is not set in bio->bi_opf. >>> >>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the >>> filesystem behaviour, not the block device. >>> >>> REQ_NOWAIT can result in partial IO failures because the error is >>> only reported to the iomap layer via IO completions. Hence we can >>> split a DIO into multiple bios and have random bios in that IO fail >>> with EAGAIN because REQ_NOWAIT is set. This error will >>> get reported to the submitter via completion, and it will override >>> any of the partial IOs that actually completed. >>> >>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug >>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow >>> NOWAIT DIO across extent boundaries") we'll get silent partial >>> writes occurring because the second submitted bio in an IO can >>> trigger EAGAIN errors with partial IO completion having already >>> occurred. >>> >>> Further, we don't allow partial IO completion for DIO on XFS at all. >>> DIO must be completely submitted and completed or return an error >>> without having issued any IO at all. Hence using REQ_NOWAIT for >>> DIO bios is incorrect and not desirable. >> >> What you say makes total sense for a user using RWF_NOWAIT, but it >> doesn't make a lot of sense for io_uring where we really want >> IOCB_NOWAIT to be what it suggests it is - don't wait for other IO to >> complete, if avoidable. One of the things that really suck with >> aio/libai is the "yeah it's probably async, but lol, might not be" >> aspect of it. > > Sure, but we have no way of telling what semantics the IO issuer > actually requires from above. And because IOCB_NOWAIT behaviour is > directly exposed to userspace by RWF_NOWAIT, that's the behaviour we > have to implement. > >> For io_uring, if we do get -EAGAIN, we'll retry without NOWAIT set. So >> the concern about fractured/short writes doesn't bubble up to the >> application. Hence we really want an IOCB_NOWAIT_REALLY on that side, >> instead of the poor mans IOCB_MAYBE_NOWAIT semantics. > > Yup, perhaps what we really want is a true IOCB_NONBLOCK flag as an > internal kernel implementation. i.e. don't block anywhere in the > stack, and the caller must handle retrying/completing the entire IO > regardless of where the -EAGAIN comes from during the IO, including > from a partial completion that the caller is waiting for... > > i.e. rather than hacking around this specific instance of "it blocks > and we don't want it to", define the semantics and behaviour of a > fully non-blocking IO through all layers from the VFS down to the > hardware and let's implement that. Then we can stop playing > whack-a-mole with all the "but it blocks when I do this, doctor!" > issues that we seem to keep having.... :) > From my opinion, the semantics of NOWAIT is clear, just like Jens said, "don't wait, if**avoidable**". Also it should be the responsibility of the callers who set this flag to resubmit the failed IO once a -EAGAIN returned. e.g. if the user app sets RWF_NOWAIT, then it's the responsibility of the user app to resubmit the failed IO once -EAGAIN returned. If io_uring layer newly set IOCB_NONBLOCK flag, then it's the responsibility of io_uring to resubmit. The limitation here is that once a big DIO/bio got split, then the atomicity of whole DIO/bio can not be guaranteed. This seems a limitation of direct IO routine and block layer, and the work of above layers e.g. filesystem, is needed to guarantee the atomicity. -- Thanks, Jeffle ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO 2020-12-07 2:21 ` Dave Chinner 2020-12-07 23:40 ` Jens Axboe @ 2020-12-08 5:46 ` JeffleXu 2020-12-09 21:23 ` Dave Chinner 1 sibling, 1 reply; 15+ messages in thread From: JeffleXu @ 2020-12-08 5:46 UTC (permalink / raw) To: Dave Chinner, Hao Xu Cc: Alexander Viro, Christoph Hellwig, Darrick J. Wong, linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring, Joseph Qi On 12/7/20 10:21 AM, Dave Chinner wrote: > On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote: >> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set >> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when >> IOCB_NOWAIT is set. >> >> Suggested-by: Jeffle Xu <[email protected]> >> Signed-off-by: Konstantin Khlebnikov <[email protected]> >> Signed-off-by: Hao Xu <[email protected]> >> --- >> >> Hi all, >> I tested fio io_uring direct read for a file on ext4 filesystem on a >> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which >> means REQ_NOWAIT is not set in bio->bi_opf. > > What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the > filesystem behaviour, not the block device. > > REQ_NOWAIT can result in partial IO failures because the error is > only reported to the iomap layer via IO completions. Hence we can > split a DIO into multiple bios and have random bios in that IO fail > with EAGAIN because REQ_NOWAIT is set. This error will > get reported to the submitter via completion, and it will override > any of the partial IOs that actually completed. > > Hence, like the recently reported multi-mapping IOCB_NOWAIT bug > reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow > NOWAIT DIO across extent boundaries") we'll get silent partial > writes occurring because the second submitted bio in an IO can > trigger EAGAIN errors with partial IO completion having already > occurred. > > Further, we don't allow partial IO completion for DIO on XFS at all. > DIO must be completely submitted and completed or return an error > without having issued any IO at all. Hence using REQ_NOWAIT for > DIO bios is incorrect and not desirable. Not familiar with xfs though, just in curiosity, how do you achive 'no partial completion'? I mean you could avoid partial -EAGAIN by not setting REQ_NOWAIT, but you could get other partial errors such as -ENOMEM or something, as long as one DIO could be split to multiple bios. -- Thanks, Jeffle ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO 2020-12-08 5:46 ` JeffleXu @ 2020-12-09 21:23 ` Dave Chinner 2020-12-10 1:55 ` JeffleXu 0 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2020-12-09 21:23 UTC (permalink / raw) To: JeffleXu Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong, linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring, Joseph Qi On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote: > > > On 12/7/20 10:21 AM, Dave Chinner wrote: > > On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote: > >> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set > >> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when > >> IOCB_NOWAIT is set. > >> > >> Suggested-by: Jeffle Xu <[email protected]> > >> Signed-off-by: Konstantin Khlebnikov <[email protected]> > >> Signed-off-by: Hao Xu <[email protected]> > >> --- > >> > >> Hi all, > >> I tested fio io_uring direct read for a file on ext4 filesystem on a > >> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which > >> means REQ_NOWAIT is not set in bio->bi_opf. > > > > What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the > > filesystem behaviour, not the block device. > > > > REQ_NOWAIT can result in partial IO failures because the error is > > only reported to the iomap layer via IO completions. Hence we can > > split a DIO into multiple bios and have random bios in that IO fail > > with EAGAIN because REQ_NOWAIT is set. This error will > > get reported to the submitter via completion, and it will override > > any of the partial IOs that actually completed. > > > > Hence, like the recently reported multi-mapping IOCB_NOWAIT bug > > reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow > > NOWAIT DIO across extent boundaries") we'll get silent partial > > writes occurring because the second submitted bio in an IO can > > trigger EAGAIN errors with partial IO completion having already > > occurred. > > > > Further, we don't allow partial IO completion for DIO on XFS at all. > > DIO must be completely submitted and completed or return an error > > without having issued any IO at all. Hence using REQ_NOWAIT for > > DIO bios is incorrect and not desirable. > > Not familiar with xfs though, just in curiosity, how do you achive 'no > partial completion'? I mean you could avoid partial -EAGAIN by not > setting REQ_NOWAIT, but you could get other partial errors such as > -ENOMEM or something, as long as one DIO could be split to multiple bios. If any part of a DIO fails, we fail the entire IO. When we split a DIO into multiple bios and one reports an error, we don't know track where in the IO it actually failed, we just fail the entire IO. e.g. how do you report correct partial completion to userspace when a DIO gets split into 3 pieces and the middle one fails? There are two ranges that actually completed, but we can only report one of them.... And, really, we still need to report that an IO failed to userspace, because mission critical apps care more about the fact that an IO failure occurred than silently swallowing the IO error with a (potentially incorrect) partial IO completion notification. Cheers, Dave. -- Dave Chinner [email protected] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO 2020-12-09 21:23 ` Dave Chinner @ 2020-12-10 1:55 ` JeffleXu 2020-12-10 5:18 ` Dave Chinner 0 siblings, 1 reply; 15+ messages in thread From: JeffleXu @ 2020-12-10 1:55 UTC (permalink / raw) To: Dave Chinner Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong, linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring, Joseph Qi Sorry I'm still a little confused. On 12/10/20 5:23 AM, Dave Chinner wrote: > On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote: >> >> >> On 12/7/20 10:21 AM, Dave Chinner wrote: >>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote: >>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set >>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when >>>> IOCB_NOWAIT is set. >>>> >>>> Suggested-by: Jeffle Xu <[email protected]> >>>> Signed-off-by: Konstantin Khlebnikov <[email protected]> >>>> Signed-off-by: Hao Xu <[email protected]> >>>> --- >>>> >>>> Hi all, >>>> I tested fio io_uring direct read for a file on ext4 filesystem on a >>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which >>>> means REQ_NOWAIT is not set in bio->bi_opf. >>> >>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the >>> filesystem behaviour, not the block device. >>> >>> REQ_NOWAIT can result in partial IO failures because the error is >>> only reported to the iomap layer via IO completions. Hence we can >>> split a DIO into multiple bios and have random bios in that IO fail >>> with EAGAIN because REQ_NOWAIT is set. This error will >>> get reported to the submitter via completion, and it will override >>> any of the partial IOs that actually completed. >>> >>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug >>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow >>> NOWAIT DIO across extent boundaries") we'll get silent partial >>> writes occurring because the second submitted bio in an IO can >>> trigger EAGAIN errors with partial IO completion having already >>> occurred. >>> >>> Further, we don't allow partial IO completion for DIO on XFS at all. >>> DIO must be completely submitted and completed or return an error >>> without having issued any IO at all. Hence using REQ_NOWAIT for >>> DIO bios is incorrect and not desirable. The current block layer implementation causes that, as long as one split bio fails, then the whole DIO fails, in which case several split bios maybe have succeeded and the content has been written to the disk. This is obviously what you called "partial IO completion". I'm just concerned on how do you achieve that "DIO must return an error without having issued any IO at all". Do you have some method of reverting the content has already been written into the disk when a partial error happened? > > If any part of a DIO fails, we fail the entire IO. When we split a > DIO into multiple bios and one reports an error, we don't know track > where in the IO it actually failed, we just fail the entire IO. > > e.g. how do you report correct partial completion to userspace when > a DIO gets split into 3 pieces and the middle one fails? There are > two ranges that actually completed, but we can only report one of > them.... > > And, really, we still need to report that an IO failed to userspace, > because mission critical apps care more about the fact that an IO > failure occurred than silently swallowing the IO error with a > (potentially incorrect) partial IO completion notification. -- Thanks, Jeffle ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO 2020-12-10 1:55 ` JeffleXu @ 2020-12-10 5:18 ` Dave Chinner 2020-12-11 2:50 ` JeffleXu 0 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2020-12-10 5:18 UTC (permalink / raw) To: JeffleXu Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong, linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring, Joseph Qi On Thu, Dec 10, 2020 at 09:55:32AM +0800, JeffleXu wrote: > Sorry I'm still a little confused. > > > On 12/10/20 5:23 AM, Dave Chinner wrote: > > On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote: > >> > >> > >> On 12/7/20 10:21 AM, Dave Chinner wrote: > >>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote: > >>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set > >>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when > >>>> IOCB_NOWAIT is set. > >>>> > >>>> Suggested-by: Jeffle Xu <[email protected]> > >>>> Signed-off-by: Konstantin Khlebnikov <[email protected]> > >>>> Signed-off-by: Hao Xu <[email protected]> > >>>> --- > >>>> > >>>> Hi all, > >>>> I tested fio io_uring direct read for a file on ext4 filesystem on a > >>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which > >>>> means REQ_NOWAIT is not set in bio->bi_opf. > >>> > >>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the > >>> filesystem behaviour, not the block device. > >>> > >>> REQ_NOWAIT can result in partial IO failures because the error is > >>> only reported to the iomap layer via IO completions. Hence we can > >>> split a DIO into multiple bios and have random bios in that IO fail > >>> with EAGAIN because REQ_NOWAIT is set. This error will > >>> get reported to the submitter via completion, and it will override > >>> any of the partial IOs that actually completed. > >>> > >>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug > >>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow > >>> NOWAIT DIO across extent boundaries") we'll get silent partial > >>> writes occurring because the second submitted bio in an IO can > >>> trigger EAGAIN errors with partial IO completion having already > >>> occurred. > >>> > > >>> Further, we don't allow partial IO completion for DIO on XFS at all. > >>> DIO must be completely submitted and completed or return an error > >>> without having issued any IO at all. Hence using REQ_NOWAIT for > >>> DIO bios is incorrect and not desirable. > > > The current block layer implementation causes that, as long as one split > bio fails, then the whole DIO fails, in which case several split bios > maybe have succeeded and the content has been written to the disk. This > is obviously what you called "partial IO completion". > > I'm just concerned on how do you achieve that "DIO must return an error > without having issued any IO at all". Do you have some method of > reverting the content has already been written into the disk when a > partial error happened? I think you've misunderstood what I was saying. I did not say "DIO must return an error without having issued any IO at all". There are two parts to my statement, and you just smashed part of the first statement into part of the second statement and came up something I didn't actually say. The first statement is: 1. "DIO must be fully submitted and completed ...." That is, if we need to break an IO up into multiple parts, the entire IO must be submitted and completed as a whole. We do not allow partial submission or completion of the IO at all because we cannot accurately report what parts of a multi-bio DIO that failed through the completion interface. IOWs, if any of the IOs after the first one fail submission, then we must complete all the IOs that have already been submitted before we can report the failure that occurred during the IO. The second statement is: 2. "... or return an error without having issued any IO at all." IO submission errors are only reported by the DIO layer through IO completion, in which case #1 is applied. #2 only applies to errors that occur before IO submission is started, and these errors are directly reported to the caller. IOCB_NOWAIT is a check done before we start submission, hence can return -EAGAIN directly to the caller. IOWs, if an error is returned to the caller, we have either not submitted any IO at all, or we have fully submitted and completed the IO and there was some error reported by the IO completions. There is no scope for "partial IO" here - it either completed fully or we got an error. This is necessary for correct AIO semantics. We aren't waiting for completions to be able to report submission errors to submitters. Hence for async IO, the only way for an error in the DIO layer to be reported to the submitter is if the error occurs before the first IO is submitted.(*) RWF_NOWAIT was explicitly intended to enable applications using AIO+DIO to avoid long latencies that occur as a result of blocking on filesystem locks and resources. Blocking in the request queue is minimal latency compared to waiting for (tens of) thousands of IOs to complete ((tens of) seconds!) so the filesystem iomap path can run a transaction to allocate disk spacei for the DIO. IOWS, IOCB_NOWAIT was pretty` much intended to only be seen at the filesystem layers to avoid the extremely high latencies that filesystem layers might cause. Blocking for a few milliseconds or even tens of milliseconds in the request queue is not a problem IOCB_NOWAIT was ever intended to solve. Really, if io_uring needs DIO to avoid blocking in the request queues below the filesystem, it should be providing that guidance directly. IOCB_NOWAIT is -part- of the semantics being asked for, but it does not provide them all and we can't change them to provide exactly what io_uring wants because IOCB_NOWAIT == RWF_NOWAIT semantics. Cheers, Dave. (*) Yes, yes, I know that if you have a really fast storage the IO might complete before submission has finished, but that's just the final completion is done by the submitter and so #1 is actually being followed in this case. i.e. IO is fully submitted and completed. -- Dave Chinner [email protected] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO 2020-12-10 5:18 ` Dave Chinner @ 2020-12-11 2:50 ` JeffleXu 2020-12-14 2:56 ` Dave Chinner 0 siblings, 1 reply; 15+ messages in thread From: JeffleXu @ 2020-12-11 2:50 UTC (permalink / raw) To: Dave Chinner Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong, linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring, Joseph Qi Excellent explanation! Thanks a lot. Still some questions below. On 12/10/20 1:18 PM, Dave Chinner wrote: > On Thu, Dec 10, 2020 at 09:55:32AM +0800, JeffleXu wrote: >> Sorry I'm still a little confused. >> >> >> On 12/10/20 5:23 AM, Dave Chinner wrote: >>> On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote: >>>> >>>> >>>> On 12/7/20 10:21 AM, Dave Chinner wrote: >>>>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote: >>>>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set >>>>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when >>>>>> IOCB_NOWAIT is set. >>>>>> >>>>>> Suggested-by: Jeffle Xu <[email protected]> >>>>>> Signed-off-by: Konstantin Khlebnikov <[email protected]> >>>>>> Signed-off-by: Hao Xu <[email protected]> >>>>>> --- >>>>>> >>>>>> Hi all, >>>>>> I tested fio io_uring direct read for a file on ext4 filesystem on a >>>>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which >>>>>> means REQ_NOWAIT is not set in bio->bi_opf. >>>>> >>>>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the >>>>> filesystem behaviour, not the block device. >>>>> >>>>> REQ_NOWAIT can result in partial IO failures because the error is >>>>> only reported to the iomap layer via IO completions. Hence we can >>>>> split a DIO into multiple bios and have random bios in that IO fail >>>>> with EAGAIN because REQ_NOWAIT is set. This error will >>>>> get reported to the submitter via completion, and it will override >>>>> any of the partial IOs that actually completed. >>>>> >>>>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug >>>>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow >>>>> NOWAIT DIO across extent boundaries") we'll get silent partial >>>>> writes occurring because the second submitted bio in an IO can >>>>> trigger EAGAIN errors with partial IO completion having already >>>>> occurred. >>>>> >> >>>>> Further, we don't allow partial IO completion for DIO on XFS at all. >>>>> DIO must be completely submitted and completed or return an error >>>>> without having issued any IO at all. Hence using REQ_NOWAIT for >>>>> DIO bios is incorrect and not desirable. >> >> >> The current block layer implementation causes that, as long as one split >> bio fails, then the whole DIO fails, in which case several split bios >> maybe have succeeded and the content has been written to the disk. This >> is obviously what you called "partial IO completion". >> >> I'm just concerned on how do you achieve that "DIO must return an error >> without having issued any IO at all". Do you have some method of >> reverting the content has already been written into the disk when a >> partial error happened? > > I think you've misunderstood what I was saying. I did not say > "DIO must return an error without having issued any IO at all". > There are two parts to my statement, and you just smashed part of > the first statement into part of the second statement and came up > something I didn't actually say. > > The first statement is: > > 1. "DIO must be fully submitted and completed ...." > > That is, if we need to break an IO up into multiple parts, the > entire IO must be submitted and completed as a whole. We do not > allow partial submission or completion of the IO at all because we > cannot accurately report what parts of a multi-bio DIO that failed > through the completion interface. IOWs, if any of the IOs after the > first one fail submission, then we must complete all the IOs that > have already been submitted before we can report the failure that > occurred during the IO. > 1. Actually I'm quite not clear on what you called "partial submission or completion". Even when REQ_NOWAIT is set for all split bios of one DIO, then all these split bios are **submitted** to block layer through submit_bio(). Even when one split bio after the first one failed **inside** submit_bio() because of REQ_NOWAIT, submit_bio() only returns BLK_QC_T_NONE, and the DIO layer (such as __blkdev_direct_IO()) will still call submit_bio() for the remaining split bios. And then only when all split bios complete, will the whole kiocb complete. So if you define "submission" as submitting to hardware disk (such as nvme device driver), then it is indeed **partial submission** when REQ_NOWAIT set. But if the definition of "submission" is actually "submitting to block layer by calling submit_bio()", then all split bios of one DIO are indeed submitted to block layer, even when one split bio gets BLK_STS_AGAIN because of REQ_NOWIAT. 2. One DIO could be split into multiple bios in DIO layer. Similarly one big bio could be split into multiple bios in block layer. In the situation when all split bios have already been submitted to block layer, since the IO completion interface could return only one error code, the whole DIO could fail as long as one split bio fails, while several other split bios could have already succeeded and the corresponding disk sectors have already been overwritten. I'm not sure if this is what you called "silent partial writes", and if this is the core reason for commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across extent boundaries") you mentioned in previous mail. If this is the case, then it seems that the IO completion interface should be blamed for this issue. Besides REQ_NOWIAT, there may be more situations that will cause "silent partial writes". As long as one split bios could fail (besides EAGAIN, maybe EOPNOTSUPP, if possible), while other split bios may still succeeds, then the error of one split bio will still override the completion status of the whole DIO. In this case "silent partial writes" is still possible? In my understanding, passing REQ_NOWAIT flags from IOCB_NOWAIT in DIO layer only makes the problem more likely to happen, but it's not the only source of this problem? > The second statement is: > > 2. "... or return an error without having issued any IO at > all." > > IO submission errors are only reported by the DIO layer through IO > completion, in which case #1 is applied. #2 only applies to errors > that occur before IO submission is started, and these errors are > directly reported to the caller. IOCB_NOWAIT is a check done before > we start submission, hence can return -EAGAIN directly to the > caller. > > IOWs, if an error is returned to the caller, we have either not > submitted any IO at all, or we have fully submitted and completed > the IO and there was some error reported by the IO completions. > There is no scope for "partial IO" here - it either completed fully > or we got an error. > > This is necessary for correct AIO semantics. We aren't waiting for > completions to be able to report submission errors to submitters. > Hence for async IO, the only way for an error in the DIO layer to be > reported to the submitter is if the error occurs before the first IO > is submitted.(*) > > RWF_NOWAIT was explicitly intended to enable applications using > AIO+DIO to avoid long latencies that occur as a result of blocking > on filesystem locks and resources. Blocking in the request queue is > minimal latency compared to waiting for (tens of) thousands of IOs > to complete ((tens of) seconds!) so the filesystem iomap path can run a > transaction to allocate disk spacei for the DIO. > > IOWS, IOCB_NOWAIT was pretty` much intended to only be seen at the > filesystem layers to avoid the extremely high latencies that > filesystem layers might cause. Blocking for a few milliseconds or > even tens of milliseconds in the request queue is not a problem > IOCB_NOWAIT was ever intended to solve. Don't know the initial intention and history of IOCB_NOWAIT. Learned a lot. Thanks. > > Really, if io_uring needs DIO to avoid blocking in the request > queues below the filesystem, it should be providing that guidance > directly. IOCB_NOWAIT is -part- of the semantics being asked for, > but it does not provide them all and we can't change them to provide > exactly what io_uring wants because IOCB_NOWAIT == RWF_NOWAIT > semantics. > > Cheers, > > Dave. > > (*) Yes, yes, I know that if you have a really fast storage the IO > might complete before submission has finished, but that's just the > final completion is done by the submitter and so #1 is actually > being followed in this case. i.e. IO is fully submitted and > completed. > -- Thanks, Jeffle ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO 2020-12-11 2:50 ` JeffleXu @ 2020-12-14 2:56 ` Dave Chinner 2020-12-15 9:43 ` JeffleXu 0 siblings, 1 reply; 15+ messages in thread From: Dave Chinner @ 2020-12-14 2:56 UTC (permalink / raw) To: JeffleXu Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong, linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring, Joseph Qi On Fri, Dec 11, 2020 at 10:50:44AM +0800, JeffleXu wrote: > Excellent explanation! Thanks a lot. > > Still some questions below. > > On 12/10/20 1:18 PM, Dave Chinner wrote: > > On Thu, Dec 10, 2020 at 09:55:32AM +0800, JeffleXu wrote: > >> Sorry I'm still a little confused. > >> > >> > >> On 12/10/20 5:23 AM, Dave Chinner wrote: > >>> On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote: > >>>> > >>>> > >>>> On 12/7/20 10:21 AM, Dave Chinner wrote: > >>>>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote: > >>>>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set > >>>>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when > >>>>>> IOCB_NOWAIT is set. > >>>>>> > >>>>>> Suggested-by: Jeffle Xu <[email protected]> > >>>>>> Signed-off-by: Konstantin Khlebnikov <[email protected]> > >>>>>> Signed-off-by: Hao Xu <[email protected]> > >>>>>> --- > >>>>>> > >>>>>> Hi all, > >>>>>> I tested fio io_uring direct read for a file on ext4 filesystem on a > >>>>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which > >>>>>> means REQ_NOWAIT is not set in bio->bi_opf. > >>>>> > >>>>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the > >>>>> filesystem behaviour, not the block device. > >>>>> > >>>>> REQ_NOWAIT can result in partial IO failures because the error is > >>>>> only reported to the iomap layer via IO completions. Hence we can > >>>>> split a DIO into multiple bios and have random bios in that IO fail > >>>>> with EAGAIN because REQ_NOWAIT is set. This error will > >>>>> get reported to the submitter via completion, and it will override > >>>>> any of the partial IOs that actually completed. > >>>>> > >>>>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug > >>>>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow > >>>>> NOWAIT DIO across extent boundaries") we'll get silent partial > >>>>> writes occurring because the second submitted bio in an IO can > >>>>> trigger EAGAIN errors with partial IO completion having already > >>>>> occurred. > >>>>> > >> > >>>>> Further, we don't allow partial IO completion for DIO on XFS at all. > >>>>> DIO must be completely submitted and completed or return an error > >>>>> without having issued any IO at all. Hence using REQ_NOWAIT for > >>>>> DIO bios is incorrect and not desirable. > >> > >> > >> The current block layer implementation causes that, as long as one split > >> bio fails, then the whole DIO fails, in which case several split bios > >> maybe have succeeded and the content has been written to the disk. This > >> is obviously what you called "partial IO completion". > >> > >> I'm just concerned on how do you achieve that "DIO must return an error > >> without having issued any IO at all". Do you have some method of > >> reverting the content has already been written into the disk when a > >> partial error happened? > > > > I think you've misunderstood what I was saying. I did not say > > "DIO must return an error without having issued any IO at all". > > There are two parts to my statement, and you just smashed part of > > the first statement into part of the second statement and came up > > something I didn't actually say. > > > > The first statement is: > > > > 1. "DIO must be fully submitted and completed ...." > > > > That is, if we need to break an IO up into multiple parts, the > > entire IO must be submitted and completed as a whole. We do not > > allow partial submission or completion of the IO at all because we > > cannot accurately report what parts of a multi-bio DIO that failed > > through the completion interface. IOWs, if any of the IOs after the > > first one fail submission, then we must complete all the IOs that > > have already been submitted before we can report the failure that > > occurred during the IO. > > > > 1. Actually I'm quite not clear on what you called "partial submission > or completion". Even when REQ_NOWAIT is set for all split bios of one > DIO, then all these split bios are **submitted** to block layer through > submit_bio(). Even when one split bio after the first one failed > **inside** submit_bio() because of REQ_NOWAIT, submit_bio() only returns > BLK_QC_T_NONE, and the DIO layer (such as __blkdev_direct_IO()) will > still call submit_bio() for the remaining split bios. And then only when > all split bios complete, will the whole kiocb complete. Yes, so what you have there is complete submission and completion. i.e. what I described as #1. > So if you define "submission" as submitting to hardware disk (such as > nvme device driver), then it is indeed **partial submission** when > REQ_NOWAIT set. But if the definition of "submission" is actually > "submitting to block layer by calling submit_bio()", then all split bios > of one DIO are indeed submitted to block layer, even when one split bio > gets BLK_STS_AGAIN because of REQ_NOWIAT. The latter is the definition we using in iomap, because at this layer we are the ones submitting the bios and defining the bounds of the IO. What happens in the lower layers does not concern us here. Partial submission can occur at the iomap layer if IOCB_NOWAIT is set - that was what 883a790a8440 fixed. i.e. we do loop for all mapped extents { loop for mapped range { submit_bio(subrange) } } So if we have an extent range like so: start end +-----------+---------+------+-----------+ ext 1 ext 2 ext 3 ext 4 We actually have to loop above 4 times to map the 4 extents that span the user IO range. That means we need to submit at least 4 independent bios to run this IO. So if we submit all the bios in a first mapped range (ext 1), then go back to the filesystem to get the map for the next range to submit and it returns -EAGAIN, we break out of the outer loop without having completely submitted bios for the range spanning ext 2, 3 or 4. That's partial IO submission as the *iomap layer* defines it. And the "silent" part of the "partial write" it came from the fact this propagated -EAGAIN to the user despite the fact that some IO was actually successfully committed and completed. IOws, we may have corrupted data on disk by telling the user nothing was done with -EAGAIN rather than failing the partial IO with -EIO. This is the problem commit 883a790a8440 fixed. > 2. One DIO could be split into multiple bios in DIO layer. Similarly one > big bio could be split into multiple bios in block layer. In the > situation when all split bios have already been submitted to block > layer, since the IO completion interface could return only one error > code, the whole DIO could fail as long as one split bio fails, while > several other split bios could have already succeeded and the > corresponding disk sectors have already been overwritten. Yup. That can happen. That's a data corruption for which prevention is the responsibility of the layer doing the splitting and recombining. The iomap layer knows nothing of this - this situation needs to result in the entire bio that was split being marked with an EIO error because we do not know what parts of the bio made it to disk or not. IOWs, the result of the IO is undefined, and so we need to return EIO to the user. This "data is inconsistent until all sub-devices complete their IO successfully" situation is also known as the "RAID 5 write hole". Crash while these IOs are in flight, and some might hit the disk and some might not. When the system comes back up, you've got no idea which part of that IO was completed or not, so the data in the entire stripe is corrupt. > I'm not sure > if this is what you called "silent partial writes", and if this is the > core reason for commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across > extent boundaries") you mentioned in previous mail. No, it's not the same thing. This "split IO failed" should never result in a silent partial write - it should always result in EIO. > If this is the case, then it seems that the IO completion interface > should be blamed for this issue. Besides REQ_NOWIAT, there may be more > situations that will cause "silent partial writes". Yes, there are many potential causes of this behaviour. They are all considered to be a data corruption, and so in all cases they should be returning EIO as the completion status to the submitter of the bio. > As long as one split bios could fail (besides EAGAIN, maybe EOPNOTSUPP, > if possible), while other split bios may still succeeds, then the error > of one split bio will still override the completion status of the whole > DIO. If you get EOPNOTSUPP from on device in bio split across many devices, then you have a bug in the layer that is splitting the IO. All the sub-devices need to support the operation, or the top level device cannot safely support that operation. And if one device didn't do the opreation, then we have inconsistent data on disk now and so EIO is the result that should be returned to the user. > In this case "silent partial writes" is still possible? In my > understanding, passing REQ_NOWAIT flags from IOCB_NOWAIT in DIO layer > only makes the problem more likely to happen, but it's not the only > source of this problem? Passing REQ_NOWAIT from the iomap dio layer means that we will see data corruptions in stable storage simply due to high IO load, rather than the much, much rarer situations of software bugs or actual hardware errors. IOWs, REQ_NOWAIT essentially guarantees data corruption in stable storage of a perfectly working storage stack will occur and there is nothing the application can do to prevent that occurring. The worst part of this is that REQ_NOWAIT also requires the application to detect and correct the corruption caused by the kernel using REQ_NOWAIT inappropriately. And if the application crashes before it can correct the data corruption, it's suddenly a permanent, uncorrectable data corruption. REQ_NOWAIT is dangerous and can easily cause harm to user data by aborting parts of submitted IOs. IOWs, it's more like a random IO error injection mechanism than a useful method of doing safe non-blocking IO. Cheers, Dave. -- Dave Chinner [email protected] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO 2020-12-14 2:56 ` Dave Chinner @ 2020-12-15 9:43 ` JeffleXu 2021-04-02 14:32 ` Pavel Begunkov 0 siblings, 1 reply; 15+ messages in thread From: JeffleXu @ 2020-12-15 9:43 UTC (permalink / raw) To: Dave Chinner Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong, linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring, Joseph Qi Thanks for your explanation, again. On 12/14/20 10:56 AM, Dave Chinner wrote: > On Fri, Dec 11, 2020 at 10:50:44AM +0800, JeffleXu wrote: >> Excellent explanation! Thanks a lot. >> >> Still some questions below. >> >> On 12/10/20 1:18 PM, Dave Chinner wrote: >>> On Thu, Dec 10, 2020 at 09:55:32AM +0800, JeffleXu wrote: >>>> Sorry I'm still a little confused. >>>> >>>> >>>> On 12/10/20 5:23 AM, Dave Chinner wrote: >>>>> On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote: >>>>>> >>>>>> >>>>>> On 12/7/20 10:21 AM, Dave Chinner wrote: >>>>>>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote: >>>>>>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set >>>>>>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when >>>>>>>> IOCB_NOWAIT is set. >>>>>>>> >>>>>>>> Suggested-by: Jeffle Xu <[email protected]> >>>>>>>> Signed-off-by: Konstantin Khlebnikov <[email protected]> >>>>>>>> Signed-off-by: Hao Xu <[email protected]> >>>>>>>> --- >>>>>>>> >>>>>>>> Hi all, >>>>>>>> I tested fio io_uring direct read for a file on ext4 filesystem on a >>>>>>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which >>>>>>>> means REQ_NOWAIT is not set in bio->bi_opf. >>>>>>> >>>>>>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the >>>>>>> filesystem behaviour, not the block device. >>>>>>> >>>>>>> REQ_NOWAIT can result in partial IO failures because the error is >>>>>>> only reported to the iomap layer via IO completions. Hence we can >>>>>>> split a DIO into multiple bios and have random bios in that IO fail >>>>>>> with EAGAIN because REQ_NOWAIT is set. This error will >>>>>>> get reported to the submitter via completion, and it will override >>>>>>> any of the partial IOs that actually completed. >>>>>>> >>>>>>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug >>>>>>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow >>>>>>> NOWAIT DIO across extent boundaries") we'll get silent partial >>>>>>> writes occurring because the second submitted bio in an IO can >>>>>>> trigger EAGAIN errors with partial IO completion having already >>>>>>> occurred. >>>>>>> >>>> >>>>>>> Further, we don't allow partial IO completion for DIO on XFS at all. >>>>>>> DIO must be completely submitted and completed or return an error >>>>>>> without having issued any IO at all. Hence using REQ_NOWAIT for >>>>>>> DIO bios is incorrect and not desirable. >>>> >>>> >>>> The current block layer implementation causes that, as long as one split >>>> bio fails, then the whole DIO fails, in which case several split bios >>>> maybe have succeeded and the content has been written to the disk. This >>>> is obviously what you called "partial IO completion". >>>> >>>> I'm just concerned on how do you achieve that "DIO must return an error >>>> without having issued any IO at all". Do you have some method of >>>> reverting the content has already been written into the disk when a >>>> partial error happened? >>> >>> I think you've misunderstood what I was saying. I did not say >>> "DIO must return an error without having issued any IO at all". >>> There are two parts to my statement, and you just smashed part of >>> the first statement into part of the second statement and came up >>> something I didn't actually say. >>> >>> The first statement is: >>> >>> 1. "DIO must be fully submitted and completed ...." >>> >>> That is, if we need to break an IO up into multiple parts, the >>> entire IO must be submitted and completed as a whole. We do not >>> allow partial submission or completion of the IO at all because we >>> cannot accurately report what parts of a multi-bio DIO that failed >>> through the completion interface. IOWs, if any of the IOs after the >>> first one fail submission, then we must complete all the IOs that >>> have already been submitted before we can report the failure that >>> occurred during the IO. >>> >> >> 1. Actually I'm quite not clear on what you called "partial submission >> or completion". Even when REQ_NOWAIT is set for all split bios of one >> DIO, then all these split bios are **submitted** to block layer through >> submit_bio(). Even when one split bio after the first one failed >> **inside** submit_bio() because of REQ_NOWAIT, submit_bio() only returns >> BLK_QC_T_NONE, and the DIO layer (such as __blkdev_direct_IO()) will >> still call submit_bio() for the remaining split bios. And then only when >> all split bios complete, will the whole kiocb complete. > > Yes, so what you have there is complete submission and completion. > i.e. what I described as #1. > >> So if you define "submission" as submitting to hardware disk (such as >> nvme device driver), then it is indeed **partial submission** when >> REQ_NOWAIT set. But if the definition of "submission" is actually >> "submitting to block layer by calling submit_bio()", then all split bios >> of one DIO are indeed submitted to block layer, even when one split bio >> gets BLK_STS_AGAIN because of REQ_NOWIAT. > > The latter is the definition we using in iomap, because at this > layer we are the ones submitting the bios and defining the bounds of > the IO. What happens in the lower layers does not concern us here. > > Partial submission can occur at the iomap layer if IOCB_NOWAIT is > set - that was what 883a790a8440 fixed. i.e. we do > > loop for all mapped extents { > loop for mapped range { > submit_bio(subrange) > } > } > > So if we have an extent range like so: > > start end > +-----------+---------+------+-----------+ > ext 1 ext 2 ext 3 ext 4 > > We actually have to loop above 4 times to map the 4 extents that > span the user IO range. That means we need to submit at least 4 > independent bios to run this IO. > > So if we submit all the bios in a first mapped range (ext 1), then > go back to the filesystem to get the map for the next range to > submit and it returns -EAGAIN, we break out of the outer loop > without having completely submitted bios for the range spanning ext > 2, 3 or 4. That's partial IO submission as the *iomap layer* defines > it. Now I can understand the context of commit 883a790a8440 and the root issue here. > > And the "silent" part of the "partial write" it came from the fact > this propagated -EAGAIN to the user despite the fact that some IO > was actually successfully committed and completed. IOws, we may have > corrupted data on disk by telling the user nothing was done with > -EAGAIN rather than failing the partial IO with -EIO.> > This is the problem commit 883a790a8440 fixed. > >> 2. One DIO could be split into multiple bios in DIO layer. Similarly one >> big bio could be split into multiple bios in block layer. In the >> situation when all split bios have already been submitted to block >> layer, since the IO completion interface could return only one error >> code, the whole DIO could fail as long as one split bio fails, while >> several other split bios could have already succeeded and the >> corresponding disk sectors have already been overwritten. > > Yup. That can happen. That's a data corruption for which prevention > is the responsibility of the layer doing the splitting and > recombining. Unfortunately the block layer doesn't handle this well currently. > The iomap layer knows nothing of this - this situation > needs to result in the entire bio that was split being marked with > an EIO error because we do not know what parts of the bio made it to > disk or not. IOWs, the result of the IO is undefined, and so we > need to return EIO to the user. What's the difference of EIO and EAGAIN here, besides the literal semantics, when the data corruption has already happened? The upper user has to resubmit the bio when EAGAIN is received, and the data corruption should disappear once the resubmission completes successfully. But the bio may not be resubmitted when EIO is received, and the data corruption is just left there. > > This "data is inconsistent until all sub-devices complete their IO > successfully" situation is also known as the "RAID 5 write hole". > Crash while these IOs are in flight, and some might hit the disk and > some might not. When the system comes back up, you've got no idea > which part of that IO was completed or not, so the data in the > entire stripe is corrupt. > >> I'm not sure >> if this is what you called "silent partial writes", and if this is the >> core reason for commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across >> extent boundaries") you mentioned in previous mail. > > No, it's not the same thing. This "split IO failed" should never > result in a silent partial write - it should always result in EIO. I'm doubted about this. The block layer could split one big bio into multiple split bios. Also one big DIO from one extent could also be split into multiple split bios, since one single bio could maintain at most BIO_MAX_PAGES i.e. 256 segments. (see the loop in iomap_dio_bio_actor()) If the input iov_iter contains more than 256 segments, then it will be split into multiple split bios, though this iov_iter may be mapped to one single extent. And once one split bio (among all these split bios) failed with EAGAIN, the completion status of the whole IO is EAGAIN, rather than EIO, at least for the current implementation of block layer. So this should also be called "silent partial write", since partial bios may have been written to the disk successfully, while still **EAGAIN** returned in this situation? > >> If this is the case, then it seems that the IO completion interface >> should be blamed for this issue. Besides REQ_NOWIAT, there may be more >> situations that will cause "silent partial writes". > > Yes, there are many potential causes of this behaviour. They are all > considered to be a data corruption, and so in all cases they should > be returning EIO as the completion status to the submitter of the > bio. > >> As long as one split bios could fail (besides EAGAIN, maybe EOPNOTSUPP, >> if possible), while other split bios may still succeeds, then the error >> of one split bio will still override the completion status of the whole >> DIO. > > If you get EOPNOTSUPP from on device in bio split across many > devices, then you have a bug in the layer that is splitting the IO. > All the sub-devices need to support the operation, or the top level > device cannot safely support that operation. > > And if one device didn't do the opreation, then we have inconsistent > data on disk now and so EIO is the result that should be returned to > the user. EOPNOTSUPP is just an example here. I admit that NOWAIT may be problematic here. But what I mean here is that, once one split bio fails, the whole IO fails with the same error code e.g. EAGIAN/EOPNOTSUPP rather than EIO, while partial split bios may have already succeeded. IOWs, the block layer may also trigger "silent partial write" even without REQ_NOWAIT, though the possibility of "silent partial write" may increase a lot when REQ_NOWAIT applied. > >> In this case "silent partial writes" is still possible? In my >> understanding, passing REQ_NOWAIT flags from IOCB_NOWAIT in DIO layer >> only makes the problem more likely to happen, but it's not the only >> source of this problem? > > Passing REQ_NOWAIT from the iomap dio layer means that we will see > data corruptions in stable storage simply due to high IO load, rather > than the much, much rarer situations of software bugs or actual > hardware errors. IOWs, REQ_NOWAIT essentially guarantees data > corruption in stable storage of a perfectly working storage stack > will occur and there is nothing the application can do to prevent > that occurring. > > The worst part of this is that REQ_NOWAIT also requires the > application to detect and correct the corruption caused by the > kernel using REQ_NOWAIT inappropriately. And if the application > crashes before it can correct the data corruption, it's suddenly a > permanent, uncorrectable data corruption. > > REQ_NOWAIT is dangerous and can easily cause harm to user data by > aborting parts of submitted IOs. IOWs, it's more like a random IO > error injection mechanism than a useful method of doing safe > non-blocking IO. > > Cheers, > > Dave. > -- Thanks, Jeffle ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO 2020-12-15 9:43 ` JeffleXu @ 2021-04-02 14:32 ` Pavel Begunkov 2021-04-02 16:26 ` Jens Axboe 0 siblings, 1 reply; 15+ messages in thread From: Pavel Begunkov @ 2021-04-02 14:32 UTC (permalink / raw) To: JeffleXu, Dave Chinner Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong, linux-fsdevel, Konstantin Khlebnikov, Jens Axboe, io-uring, Joseph Qi On 15/12/2020 09:43, JeffleXu wrote: > Thanks for your explanation, again. Got stale, let's bring it up again. > > On 12/14/20 10:56 AM, Dave Chinner wrote: >> On Fri, Dec 11, 2020 at 10:50:44AM +0800, JeffleXu wrote: >>> Excellent explanation! Thanks a lot. >>> >>> Still some questions below. >>> >>> On 12/10/20 1:18 PM, Dave Chinner wrote: >>>> On Thu, Dec 10, 2020 at 09:55:32AM +0800, JeffleXu wrote: >>>>> Sorry I'm still a little confused. >>>>> >>>>> >>>>> On 12/10/20 5:23 AM, Dave Chinner wrote: >>>>>> On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote: >>>>>>> >>>>>>> >>>>>>> On 12/7/20 10:21 AM, Dave Chinner wrote: >>>>>>>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote: >>>>>>>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set >>>>>>>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when >>>>>>>>> IOCB_NOWAIT is set. >>>>>>>>> >>>>>>>>> Suggested-by: Jeffle Xu <[email protected]> >>>>>>>>> Signed-off-by: Konstantin Khlebnikov <[email protected]> >>>>>>>>> Signed-off-by: Hao Xu <[email protected]> >>>>>>>>> --- >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> I tested fio io_uring direct read for a file on ext4 filesystem on a >>>>>>>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which >>>>>>>>> means REQ_NOWAIT is not set in bio->bi_opf. >>>>>>>> >>>>>>>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the >>>>>>>> filesystem behaviour, not the block device. >>>>>>>> >>>>>>>> REQ_NOWAIT can result in partial IO failures because the error is >>>>>>>> only reported to the iomap layer via IO completions. Hence we can >>>>>>>> split a DIO into multiple bios and have random bios in that IO fail >>>>>>>> with EAGAIN because REQ_NOWAIT is set. This error will >>>>>>>> get reported to the submitter via completion, and it will override >>>>>>>> any of the partial IOs that actually completed. >>>>>>>> >>>>>>>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug >>>>>>>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow >>>>>>>> NOWAIT DIO across extent boundaries") we'll get silent partial >>>>>>>> writes occurring because the second submitted bio in an IO can >>>>>>>> trigger EAGAIN errors with partial IO completion having already >>>>>>>> occurred. >>>>>>>> >>>>> >>>>>>>> Further, we don't allow partial IO completion for DIO on XFS at all. >>>>>>>> DIO must be completely submitted and completed or return an error >>>>>>>> without having issued any IO at all. Hence using REQ_NOWAIT for >>>>>>>> DIO bios is incorrect and not desirable. >>>>> >>>>> >>>>> The current block layer implementation causes that, as long as one split >>>>> bio fails, then the whole DIO fails, in which case several split bios >>>>> maybe have succeeded and the content has been written to the disk. This >>>>> is obviously what you called "partial IO completion". >>>>> >>>>> I'm just concerned on how do you achieve that "DIO must return an error >>>>> without having issued any IO at all". Do you have some method of >>>>> reverting the content has already been written into the disk when a >>>>> partial error happened? >>>> >>>> I think you've misunderstood what I was saying. I did not say >>>> "DIO must return an error without having issued any IO at all". >>>> There are two parts to my statement, and you just smashed part of >>>> the first statement into part of the second statement and came up >>>> something I didn't actually say. >>>> >>>> The first statement is: >>>> >>>> 1. "DIO must be fully submitted and completed ...." >>>> >>>> That is, if we need to break an IO up into multiple parts, the >>>> entire IO must be submitted and completed as a whole. We do not >>>> allow partial submission or completion of the IO at all because we >>>> cannot accurately report what parts of a multi-bio DIO that failed >>>> through the completion interface. IOWs, if any of the IOs after the >>>> first one fail submission, then we must complete all the IOs that >>>> have already been submitted before we can report the failure that >>>> occurred during the IO. >>>> >>> >>> 1. Actually I'm quite not clear on what you called "partial submission >>> or completion". Even when REQ_NOWAIT is set for all split bios of one >>> DIO, then all these split bios are **submitted** to block layer through >>> submit_bio(). Even when one split bio after the first one failed >>> **inside** submit_bio() because of REQ_NOWAIT, submit_bio() only returns >>> BLK_QC_T_NONE, and the DIO layer (such as __blkdev_direct_IO()) will >>> still call submit_bio() for the remaining split bios. And then only when >>> all split bios complete, will the whole kiocb complete. >> >> Yes, so what you have there is complete submission and completion. >> i.e. what I described as #1. >> >>> So if you define "submission" as submitting to hardware disk (such as >>> nvme device driver), then it is indeed **partial submission** when >>> REQ_NOWAIT set. But if the definition of "submission" is actually >>> "submitting to block layer by calling submit_bio()", then all split bios >>> of one DIO are indeed submitted to block layer, even when one split bio >>> gets BLK_STS_AGAIN because of REQ_NOWIAT. >> >> The latter is the definition we using in iomap, because at this >> layer we are the ones submitting the bios and defining the bounds of >> the IO. What happens in the lower layers does not concern us here. >> >> Partial submission can occur at the iomap layer if IOCB_NOWAIT is >> set - that was what 883a790a8440 fixed. i.e. we do >> >> loop for all mapped extents { >> loop for mapped range { >> submit_bio(subrange) >> } >> } >> >> So if we have an extent range like so: >> >> start end >> +-----------+---------+------+-----------+ >> ext 1 ext 2 ext 3 ext 4 >> >> We actually have to loop above 4 times to map the 4 extents that >> span the user IO range. That means we need to submit at least 4 >> independent bios to run this IO. >> >> So if we submit all the bios in a first mapped range (ext 1), then >> go back to the filesystem to get the map for the next range to >> submit and it returns -EAGAIN, we break out of the outer loop >> without having completely submitted bios for the range spanning ext >> 2, 3 or 4. That's partial IO submission as the *iomap layer* defines >> it. > > Now I can understand the context of commit 883a790a8440 and the root > issue here. > >> >> And the "silent" part of the "partial write" it came from the fact >> this propagated -EAGAIN to the user despite the fact that some IO >> was actually successfully committed and completed. IOws, we may have >> corrupted data on disk by telling the user nothing was done with >> -EAGAIN rather than failing the partial IO with -EIO.> >> This is the problem commit 883a790a8440 fixed. >> >>> 2. One DIO could be split into multiple bios in DIO layer. Similarly one >>> big bio could be split into multiple bios in block layer. In the >>> situation when all split bios have already been submitted to block >>> layer, since the IO completion interface could return only one error >>> code, the whole DIO could fail as long as one split bio fails, while >>> several other split bios could have already succeeded and the >>> corresponding disk sectors have already been overwritten. >> >> Yup. That can happen. That's a data corruption for which prevention >> is the responsibility of the layer doing the splitting and >> recombining. > > Unfortunately the block layer doesn't handle this well currently. > > >> The iomap layer knows nothing of this - this situation >> needs to result in the entire bio that was split being marked with >> an EIO error because we do not know what parts of the bio made it to >> disk or not. IOWs, the result of the IO is undefined, and so we >> need to return EIO to the user. > > What's the difference of EIO and EAGAIN here, besides the literal > semantics, when the data corruption has already happened? The upper user > has to resubmit the bio when EAGAIN is received, and the data corruption > should disappear once the resubmission completes successfully. But the > bio may not be resubmitted when EIO is received, and the data corruption > is just left there. > > > >> >> This "data is inconsistent until all sub-devices complete their IO >> successfully" situation is also known as the "RAID 5 write hole". >> Crash while these IOs are in flight, and some might hit the disk and >> some might not. When the system comes back up, you've got no idea >> which part of that IO was completed or not, so the data in the >> entire stripe is corrupt. >> >>> I'm not sure >>> if this is what you called "silent partial writes", and if this is the >>> core reason for commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across >>> extent boundaries") you mentioned in previous mail. >> >> No, it's not the same thing. This "split IO failed" should never >> result in a silent partial write - it should always result in EIO. > > I'm doubted about this. > > The block layer could split one big bio into multiple split bios. Also > one big DIO from one extent could also be split into multiple split > bios, since one single bio could maintain at most BIO_MAX_PAGES i.e. 256 > segments. (see the loop in iomap_dio_bio_actor()) If the input iov_iter > contains more than 256 segments, then it will be split into multiple > split bios, though this iov_iter may be mapped to one single extent. > > And once one split bio (among all these split bios) failed with EAGAIN, > the completion status of the whole IO is EAGAIN, rather than EIO, at > least for the current implementation of block layer. So this should also > be called "silent partial write", since partial bios may have been > written to the disk successfully, while still **EAGAIN** returned in > this situation? > > >> >>> If this is the case, then it seems that the IO completion interface >>> should be blamed for this issue. Besides REQ_NOWIAT, there may be more >>> situations that will cause "silent partial writes". >> >> Yes, there are many potential causes of this behaviour. They are all >> considered to be a data corruption, and so in all cases they should >> be returning EIO as the completion status to the submitter of the >> bio. >>>> As long as one split bios could fail (besides EAGAIN, maybe EOPNOTSUPP, >>> if possible), while other split bios may still succeeds, then the error >>> of one split bio will still override the completion status of the whole >>> DIO. >> >> If you get EOPNOTSUPP from on device in bio split across many >> devices, then you have a bug in the layer that is splitting the IO. >> All the sub-devices need to support the operation, or the top level >> device cannot safely support that operation. >> >> And if one device didn't do the opreation, then we have inconsistent >> data on disk now and so EIO is the result that should be returned to >> the user. > > EOPNOTSUPP is just an example here. I admit that NOWAIT may be > problematic here. But what I mean here is that, once one split bio > fails, the whole IO fails with the same error code e.g. > EAGIAN/EOPNOTSUPP rather than EIO, while partial split bios may have > already succeeded. IOWs, the block layer may also trigger "silent > partial write" even without REQ_NOWAIT, though the possibility of > "silent partial write" may increase a lot when REQ_NOWAIT applied. > > >> >>> In this case "silent partial writes" is still possible? In my >>> understanding, passing REQ_NOWAIT flags from IOCB_NOWAIT in DIO layer >>> only makes the problem more likely to happen, but it's not the only >>> source of this problem? >> >> Passing REQ_NOWAIT from the iomap dio layer means that we will see >> data corruptions in stable storage simply due to high IO load, rather >> than the much, much rarer situations of software bugs or actual >> hardware errors. IOWs, REQ_NOWAIT essentially guarantees data >> corruption in stable storage of a perfectly working storage stack >> will occur and there is nothing the application can do to prevent >> that occurring. >> >> The worst part of this is that REQ_NOWAIT also requires the >> application to detect and correct the corruption caused by the >> kernel using REQ_NOWAIT inappropriately. And if the application >> crashes before it can correct the data corruption, it's suddenly a >> permanent, uncorrectable data corruption. >> >> REQ_NOWAIT is dangerous and can easily cause harm to user data by >> aborting parts of submitted IOs. IOWs, it's more like a random IO >> error injection mechanism than a useful method of doing safe >> non-blocking IO. >> >> Cheers, >> >> Dave. >> > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO 2021-04-02 14:32 ` Pavel Begunkov @ 2021-04-02 16:26 ` Jens Axboe 0 siblings, 0 replies; 15+ messages in thread From: Jens Axboe @ 2021-04-02 16:26 UTC (permalink / raw) To: Pavel Begunkov, JeffleXu, Dave Chinner Cc: Hao Xu, Alexander Viro, Christoph Hellwig, Darrick J. Wong, linux-fsdevel, Konstantin Khlebnikov, io-uring, Joseph Qi On 4/2/21 8:32 AM, Pavel Begunkov wrote: > On 15/12/2020 09:43, JeffleXu wrote: >> Thanks for your explanation, again. > > Got stale, let's bring it up again. How about something like this - check upfront if we're going to be using multiple bios, and -EAGAIN for NOWAIT being set if that is the case. That avoids the partial problem, and still retains (what I consider) proper NOWAIT behavior for O_DIRECT with IOCB_NOWAIT set. It's also worth nothing that this condition exists already for polled IO. If the bio is marked as polled, then we implicitly set NOWAIT as well, as there's no way to support polled IO with sleeping request allocations. Hence it's worth considering this a fix for that case, too. diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index e2c4991833b8..6f932fe99440 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -66,6 +66,8 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap, if (dio->iocb->ki_flags & IOCB_HIPRI) bio_set_polled(bio, dio->iocb); + if (dio->iocb->ki_flags & IOCB_NOWAIT) + bio->bi_opf |= REQ_NOWAIT; dio->submit.last_queue = bdev_get_queue(iomap->bdev); if (dio->dops && dio->dops->submit_io) @@ -236,6 +238,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, unsigned int blkbits = blksize_bits(bdev_logical_block_size(iomap->bdev)); unsigned int fs_block_size = i_blocksize(inode), pad; unsigned int align = iov_iter_alignment(dio->submit.iter); + bool nowait = dio->iocb->ki_flags & (IOCB_HIPRI | IOCB_NOWAIT); unsigned int bio_opf; struct bio *bio; bool need_zeroout = false; @@ -296,7 +299,17 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, */ bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua); - nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_PAGES); + nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, INT_MAX); + + /* Can't handle IOCB_NOWAIT for split bios */ + if (nr_pages > BIO_MAX_PAGES) { + if (nowait) { + ret = -EAGAIN; + goto out; + } + nr_pages = BIO_MAX_PAGES; + } + do { size_t n; if (dio->error) { @@ -326,6 +339,19 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, goto zero_tail; } + /* + * If there are leftover pages, bail if nowait is set to avoid + * multiple bios and potentially having one of them -EAGAIN + * with the other succeeding. + */ + nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, + BIO_MAX_PAGES); + if (nr_pages && nowait) { + ret = -EAGAIN; + bio_put(bio); + goto out; + } + n = bio->bi_iter.bi_size; if (dio->flags & IOMAP_DIO_WRITE) { task_io_account_write(n); @@ -337,8 +363,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, dio->size += n; copied += n; - nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, - BIO_MAX_PAGES); iomap_dio_submit_bio(dio, iomap, bio, pos); pos += n; } while (nr_pages); -- Jens Axboe ^ permalink raw reply related [flat|nested] 15+ messages in thread
end of thread, other threads:[~2021-04-02 16:26 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-12-04 9:44 [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO Hao Xu 2020-12-04 11:44 ` Pavel Begunkov 2020-12-07 2:21 ` Dave Chinner 2020-12-07 23:40 ` Jens Axboe 2020-12-09 21:15 ` Dave Chinner 2020-12-10 2:33 ` JeffleXu 2020-12-08 5:46 ` JeffleXu 2020-12-09 21:23 ` Dave Chinner 2020-12-10 1:55 ` JeffleXu 2020-12-10 5:18 ` Dave Chinner 2020-12-11 2:50 ` JeffleXu 2020-12-14 2:56 ` Dave Chinner 2020-12-15 9:43 ` JeffleXu 2021-04-02 14:32 ` Pavel Begunkov 2021-04-02 16:26 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox