From: Dave Chinner <[email protected]>
To: JeffleXu <[email protected]>
Cc: Hao Xu <[email protected]>,
Alexander Viro <[email protected]>,
Christoph Hellwig <[email protected]>,
"Darrick J. Wong" <[email protected]>,
[email protected],
Konstantin Khlebnikov <[email protected]>,
Jens Axboe <[email protected]>,
[email protected], Joseph Qi <[email protected]>
Subject: Re: [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO
Date: Mon, 14 Dec 2020 13:56:55 +1100 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On Fri, Dec 11, 2020 at 10:50:44AM +0800, JeffleXu wrote:
> Excellent explanation! Thanks a lot.
>
> Still some questions below.
>
> On 12/10/20 1:18 PM, Dave Chinner wrote:
> > On Thu, Dec 10, 2020 at 09:55:32AM +0800, JeffleXu wrote:
> >> Sorry I'm still a little confused.
> >>
> >>
> >> On 12/10/20 5:23 AM, Dave Chinner wrote:
> >>> On Tue, Dec 08, 2020 at 01:46:47PM +0800, JeffleXu wrote:
> >>>>
> >>>>
> >>>> On 12/7/20 10:21 AM, Dave Chinner wrote:
> >>>>> On Fri, Dec 04, 2020 at 05:44:56PM +0800, Hao Xu wrote:
> >>>>>> Currently, IOCB_NOWAIT is ignored in Direct IO, REQ_NOWAIT is only set
> >>>>>> when IOCB_HIPRI is set. But REQ_NOWAIT should be set as well when
> >>>>>> IOCB_NOWAIT is set.
> >>>>>>
> >>>>>> Suggested-by: Jeffle Xu <[email protected]>
> >>>>>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> >>>>>> Signed-off-by: Hao Xu <[email protected]>
> >>>>>> ---
> >>>>>>
> >>>>>> Hi all,
> >>>>>> I tested fio io_uring direct read for a file on ext4 filesystem on a
> >>>>>> nvme ssd. I found that IOCB_NOWAIT is ignored in iomap layer, which
> >>>>>> means REQ_NOWAIT is not set in bio->bi_opf.
> >>>>>
> >>>>> What iomap is doing is correct behaviour. IOCB_NOWAIT applies to the
> >>>>> filesystem behaviour, not the block device.
> >>>>>
> >>>>> REQ_NOWAIT can result in partial IO failures because the error is
> >>>>> only reported to the iomap layer via IO completions. Hence we can
> >>>>> split a DIO into multiple bios and have random bios in that IO fail
> >>>>> with EAGAIN because REQ_NOWAIT is set. This error will
> >>>>> get reported to the submitter via completion, and it will override
> >>>>> any of the partial IOs that actually completed.
> >>>>>
> >>>>> Hence, like the recently reported multi-mapping IOCB_NOWAIT bug
> >>>>> reported by Jens and fixed in commit 883a790a8440 ("xfs: don't allow
> >>>>> NOWAIT DIO across extent boundaries") we'll get silent partial
> >>>>> writes occurring because the second submitted bio in an IO can
> >>>>> trigger EAGAIN errors with partial IO completion having already
> >>>>> occurred.
> >>>>>
> >>
> >>>>> Further, we don't allow partial IO completion for DIO on XFS at all.
> >>>>> DIO must be completely submitted and completed or return an error
> >>>>> without having issued any IO at all. Hence using REQ_NOWAIT for
> >>>>> DIO bios is incorrect and not desirable.
> >>
> >>
> >> The current block layer implementation causes that, as long as one split
> >> bio fails, then the whole DIO fails, in which case several split bios
> >> maybe have succeeded and the content has been written to the disk. This
> >> is obviously what you called "partial IO completion".
> >>
> >> I'm just concerned on how do you achieve that "DIO must return an error
> >> without having issued any IO at all". Do you have some method of
> >> reverting the content has already been written into the disk when a
> >> partial error happened?
> >
> > I think you've misunderstood what I was saying. I did not say
> > "DIO must return an error without having issued any IO at all".
> > There are two parts to my statement, and you just smashed part of
> > the first statement into part of the second statement and came up
> > something I didn't actually say.
> >
> > The first statement is:
> >
> > 1. "DIO must be fully submitted and completed ...."
> >
> > That is, if we need to break an IO up into multiple parts, the
> > entire IO must be submitted and completed as a whole. We do not
> > allow partial submission or completion of the IO at all because we
> > cannot accurately report what parts of a multi-bio DIO that failed
> > through the completion interface. IOWs, if any of the IOs after the
> > first one fail submission, then we must complete all the IOs that
> > have already been submitted before we can report the failure that
> > occurred during the IO.
> >
>
> 1. Actually I'm quite not clear on what you called "partial submission
> or completion". Even when REQ_NOWAIT is set for all split bios of one
> DIO, then all these split bios are **submitted** to block layer through
> submit_bio(). Even when one split bio after the first one failed
> **inside** submit_bio() because of REQ_NOWAIT, submit_bio() only returns
> BLK_QC_T_NONE, and the DIO layer (such as __blkdev_direct_IO()) will
> still call submit_bio() for the remaining split bios. And then only when
> all split bios complete, will the whole kiocb complete.
Yes, so what you have there is complete submission and completion.
i.e. what I described as #1.
> So if you define "submission" as submitting to hardware disk (such as
> nvme device driver), then it is indeed **partial submission** when
> REQ_NOWAIT set. But if the definition of "submission" is actually
> "submitting to block layer by calling submit_bio()", then all split bios
> of one DIO are indeed submitted to block layer, even when one split bio
> gets BLK_STS_AGAIN because of REQ_NOWIAT.
The latter is the definition we using in iomap, because at this
layer we are the ones submitting the bios and defining the bounds of
the IO. What happens in the lower layers does not concern us here.
Partial submission can occur at the iomap layer if IOCB_NOWAIT is
set - that was what 883a790a8440 fixed. i.e. we do
loop for all mapped extents {
loop for mapped range {
submit_bio(subrange)
}
}
So if we have an extent range like so:
start end
+-----------+---------+------+-----------+
ext 1 ext 2 ext 3 ext 4
We actually have to loop above 4 times to map the 4 extents that
span the user IO range. That means we need to submit at least 4
independent bios to run this IO.
So if we submit all the bios in a first mapped range (ext 1), then
go back to the filesystem to get the map for the next range to
submit and it returns -EAGAIN, we break out of the outer loop
without having completely submitted bios for the range spanning ext
2, 3 or 4. That's partial IO submission as the *iomap layer* defines
it.
And the "silent" part of the "partial write" it came from the fact
this propagated -EAGAIN to the user despite the fact that some IO
was actually successfully committed and completed. IOws, we may have
corrupted data on disk by telling the user nothing was done with
-EAGAIN rather than failing the partial IO with -EIO.
This is the problem commit 883a790a8440 fixed.
> 2. One DIO could be split into multiple bios in DIO layer. Similarly one
> big bio could be split into multiple bios in block layer. In the
> situation when all split bios have already been submitted to block
> layer, since the IO completion interface could return only one error
> code, the whole DIO could fail as long as one split bio fails, while
> several other split bios could have already succeeded and the
> corresponding disk sectors have already been overwritten.
Yup. That can happen. That's a data corruption for which prevention
is the responsibility of the layer doing the splitting and
recombining. The iomap layer knows nothing of this - this situation
needs to result in the entire bio that was split being marked with
an EIO error because we do not know what parts of the bio made it to
disk or not. IOWs, the result of the IO is undefined, and so we
need to return EIO to the user.
This "data is inconsistent until all sub-devices complete their IO
successfully" situation is also known as the "RAID 5 write hole".
Crash while these IOs are in flight, and some might hit the disk and
some might not. When the system comes back up, you've got no idea
which part of that IO was completed or not, so the data in the
entire stripe is corrupt.
> I'm not sure
> if this is what you called "silent partial writes", and if this is the
> core reason for commit 883a790a8440 ("xfs: don't allow NOWAIT DIO across
> extent boundaries") you mentioned in previous mail.
No, it's not the same thing. This "split IO failed" should never
result in a silent partial write - it should always result in EIO.
> If this is the case, then it seems that the IO completion interface
> should be blamed for this issue. Besides REQ_NOWIAT, there may be more
> situations that will cause "silent partial writes".
Yes, there are many potential causes of this behaviour. They are all
considered to be a data corruption, and so in all cases they should
be returning EIO as the completion status to the submitter of the
bio.
> As long as one split bios could fail (besides EAGAIN, maybe EOPNOTSUPP,
> if possible), while other split bios may still succeeds, then the error
> of one split bio will still override the completion status of the whole
> DIO.
If you get EOPNOTSUPP from on device in bio split across many
devices, then you have a bug in the layer that is splitting the IO.
All the sub-devices need to support the operation, or the top level
device cannot safely support that operation.
And if one device didn't do the opreation, then we have inconsistent
data on disk now and so EIO is the result that should be returned to
the user.
> In this case "silent partial writes" is still possible? In my
> understanding, passing REQ_NOWAIT flags from IOCB_NOWAIT in DIO layer
> only makes the problem more likely to happen, but it's not the only
> source of this problem?
Passing REQ_NOWAIT from the iomap dio layer means that we will see
data corruptions in stable storage simply due to high IO load, rather
than the much, much rarer situations of software bugs or actual
hardware errors. IOWs, REQ_NOWAIT essentially guarantees data
corruption in stable storage of a perfectly working storage stack
will occur and there is nothing the application can do to prevent
that occurring.
The worst part of this is that REQ_NOWAIT also requires the
application to detect and correct the corruption caused by the
kernel using REQ_NOWAIT inappropriately. And if the application
crashes before it can correct the data corruption, it's suddenly a
permanent, uncorrectable data corruption.
REQ_NOWAIT is dangerous and can easily cause harm to user data by
aborting parts of submitted IOs. IOWs, it's more like a random IO
error injection mechanism than a useful method of doing safe
non-blocking IO.
Cheers,
Dave.
--
Dave Chinner
[email protected]
next prev parent reply other threads:[~2020-12-14 2:58 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-04 9:44 [PATCH v3 RESEND] iomap: set REQ_NOWAIT according to IOCB_NOWAIT in Direct IO Hao Xu
2020-12-04 11:44 ` Pavel Begunkov
2020-12-07 2:21 ` Dave Chinner
2020-12-07 23:40 ` Jens Axboe
2020-12-09 21:15 ` Dave Chinner
2020-12-10 2:33 ` JeffleXu
2020-12-08 5:46 ` JeffleXu
2020-12-09 21:23 ` Dave Chinner
2020-12-10 1:55 ` JeffleXu
2020-12-10 5:18 ` Dave Chinner
2020-12-11 2:50 ` JeffleXu
2020-12-14 2:56 ` Dave Chinner [this message]
2020-12-15 9:43 ` JeffleXu
2021-04-02 14:32 ` Pavel Begunkov
2021-04-02 16:26 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox