[PATCHSET 0/5] Improve async iomap DIO performance

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCHSET 0/5] Improve async iomap DIO performance
@ 2023-07-11 20:33 Jens Axboe
  2023-07-11 20:33 ` [PATCH 1/5] iomap: complete polled writes inline Jens Axboe
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Jens Axboe @ 2023-07-11 20:33 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres

Hi,

iomap always punts async dio completions to a workqueue, which has a
cost in terms of efficiency (now you need an unrelated worker to process
it) and latency (now you're bouncing a completion through an async
worker, which is a classic slowdown scenario).

This patchset intends to improve that situation. For polled IO, we
always have a task reaping completions. Those do, by definition, not
need to be punted through a workqueue. This is patch 1, and is good
for up to an 11% improvement in my testing. Details in that patch
commit message.

For IRQ driven IO, it's a bit more tricky. The iomap dio completion
will happen in hard/soft irq context, and we need a saner context to
process these completions. IOCB_DIO_DEFER is added, which can be set
in a struct kiocb->ki_flags by the issuer. If the completion side of
the iocb handling understands this flag, it can choose to set a
kiocb->dio_complete() handler and just call ki_complete from IRQ
context. The issuer must then ensure that this callback is processed
from a task. io_uring punts IRQ completions to task_work already, so
it's trivial wire it up to run more of the completion before posting
a CQE. Patches 2 and 3 add the necessary flag and io_uring support,
and patches 4 and 5 add iomap support for it. This is good for up
to a 37% improvement in throughput/latency for low queue depth IO,
patch 5 has the details.

This work came about when Andres tested low queue depth dio writes
for postgres and compared it to doing sync dio writes, showing that the
async processing slows us down a lot.

 fs/iomap/direct-io.c | 39 +++++++++++++++++++++++++++++++++------
 include/linux/fs.h   | 30 ++++++++++++++++++++++++++++--
 io_uring/rw.c        | 24 ++++++++++++++++++++----
 3 files changed, 81 insertions(+), 12 deletions(-)

Can also be found in a git branch here:

https://git.kernel.dk/cgit/linux/log/?h=xfs-async-dio

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/5] iomap: complete polled writes inline
  2023-07-11 20:33 [PATCHSET 0/5] Improve async iomap DIO performance Jens Axboe
@ 2023-07-11 20:33 ` Jens Axboe
  2023-07-12  1:02   ` Dave Chinner
  2023-07-11 20:33 ` [PATCH 2/5] fs: add IOCB flags related to passing back dio completions Jens Axboe
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2023-07-11 20:33 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, Jens Axboe

Polled IO is always reaped in the context of the process itself, so it
does not need to be punted to a workqueue for the completion. This is
different than IRQ driven IO, where iomap_dio_bio_end_io() will be
invoked from hard/soft IRQ context. For those cases we currently need
to punt to a workqueue for further processing. For the polled case,
since it's the task itself reaping completions, we're already in task
context. That makes it identical to the sync completion case.

Testing a basic QD 1..8 dio random write with polled IO with the
following fio job:

fio --name=polled-dio-write --filename=/data1/file --time_based=1 \
--runtime=10 --bs=4096 --rw=randwrite --norandommap --buffered=0 \
--cpus_allowed=4 --ioengine=io_uring --iodepth=$depth --hipri=1

yields:

	Stock	Patched		Diff
=======================================
QD1	180K	201K		+11%
QD2	356K	394K		+10%
QD4	608K	650K		+7%
QD8	827K	831K		+0.5%

which shows a nice win, particularly for lower queue depth writes.
This is expected, as higher queue depths will be busy polling
completions while the offloaded workqueue completions can happen in
parallel.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/iomap/direct-io.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index ea3b868c8355..343bde5d50d3 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -161,15 +161,16 @@ void iomap_dio_bio_end_io(struct bio *bio)
 			struct task_struct *waiter = dio->submit.waiter;
 			WRITE_ONCE(dio->submit.waiter, NULL);
 			blk_wake_io_task(waiter);
-		} else if (dio->flags & IOMAP_DIO_WRITE) {
+		} else if ((bio->bi_opf & REQ_POLLED) ||
+			   !(dio->flags & IOMAP_DIO_WRITE)) {
+			WRITE_ONCE(dio->iocb->private, NULL);
+			iomap_dio_complete_work(&dio->aio.work);
+		} else {
 			struct inode *inode = file_inode(dio->iocb->ki_filp);
 
 			WRITE_ONCE(dio->iocb->private, NULL);
 			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
 			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
-		} else {
-			WRITE_ONCE(dio->iocb->private, NULL);
-			iomap_dio_complete_work(&dio->aio.work);
 		}
 	}
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/5] iomap: complete polled writes inline
  2023-07-11 20:33 ` [PATCH 1/5] iomap: complete polled writes inline Jens Axboe
@ 2023-07-12  1:02   ` Dave Chinner
  2023-07-12  1:17     ` Jens Axboe
  2023-07-12 15:22     ` Christoph Hellwig
  0 siblings, 2 replies; 10+ messages in thread
From: Dave Chinner @ 2023-07-12  1:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres

On Tue, Jul 11, 2023 at 02:33:21PM -0600, Jens Axboe wrote:
> Polled IO is always reaped in the context of the process itself, so it
> does not need to be punted to a workqueue for the completion. This is
> different than IRQ driven IO, where iomap_dio_bio_end_io() will be
> invoked from hard/soft IRQ context. For those cases we currently need
> to punt to a workqueue for further processing. For the polled case,
> since it's the task itself reaping completions, we're already in task
> context. That makes it identical to the sync completion case.
> 
> Testing a basic QD 1..8 dio random write with polled IO with the
> following fio job:
> 
> fio --name=polled-dio-write --filename=/data1/file --time_based=1 \
> --runtime=10 --bs=4096 --rw=randwrite --norandommap --buffered=0 \
> --cpus_allowed=4 --ioengine=io_uring --iodepth=$depth --hipri=1

Ok, so this is testing pure overwrite DIOs as fio pre-writes the
file prior to starting the random write part of the test.

> yields:
> 
> 	Stock	Patched		Diff
> =======================================
> QD1	180K	201K		+11%
> QD2	356K	394K		+10%
> QD4	608K	650K		+7%
> QD8	827K	831K		+0.5%
> 
> which shows a nice win, particularly for lower queue depth writes.
> This is expected, as higher queue depths will be busy polling
> completions while the offloaded workqueue completions can happen in
> parallel.
> 
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>  fs/iomap/direct-io.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index ea3b868c8355..343bde5d50d3 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -161,15 +161,16 @@ void iomap_dio_bio_end_io(struct bio *bio)
>  			struct task_struct *waiter = dio->submit.waiter;
>  			WRITE_ONCE(dio->submit.waiter, NULL);
>  			blk_wake_io_task(waiter);
> -		} else if (dio->flags & IOMAP_DIO_WRITE) {
> +		} else if ((bio->bi_opf & REQ_POLLED) ||
> +			   !(dio->flags & IOMAP_DIO_WRITE)) {
> +			WRITE_ONCE(dio->iocb->private, NULL);
> +			iomap_dio_complete_work(&dio->aio.work);

I'm not sure this is safe for all polled writes. What if the DIO
write was into a hole and we have to run unwritten extent
completion via:

iomap_dio_complete_work(work)
  iomap_dio_complete(dio)
    dio->end_io(iocb)
      xfs_dio_write_end_io()
        xfs_iomap_write_unwritten()
          <runs transactions, takes rwsems, does IO>
  .....
  ki->ki_complete()
    io_complete_rw_iopoll()
  .....

I don't see anything in the iomap DIO path that prevents us from
doing HIPRI/REQ_POLLED IO on IOMAP_UNWRITTEN extents, hence I think
this change will result in bad things happening in general.

> +		} else {
>  			struct inode *inode = file_inode(dio->iocb->ki_filp);
>  
>  			WRITE_ONCE(dio->iocb->private, NULL);
>  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
>  			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> -		} else {
> -			WRITE_ONCE(dio->iocb->private, NULL);
> -			iomap_dio_complete_work(&dio->aio.work);
>  		}
>  	}

Regardless of the correctness of the code, I don't think adding this
special case is the right thing to do here.  We should be able to
complete all writes that don't require blocking completions directly
here, not just polled writes.

We recently had this discussion over hacking a special case "don't
queue for writes" for ext4 into this code - I had to point out the
broken O_DSYNC completion cases it resulted in there, too. I also
pointed out that we already had generic mechanisms in iomap to
enable us to make a submission time decision as to whether
completion needed to be queued or not. Thread here:

https://lore.kernel.org/linux-xfs/[email protected]/

Essentially, we shouldn't be using IOMAP_DIO_WRITE as the
determining factor for queuing completions - we should be using
the information the iocb and the iomap provides us at submission
time similar to how we determine if we can use REQ_FUA for O_DSYNC
writes to determine if iomap IO completion queuing is required.

This will do the correct *and* optimal thing for all types of
writes, polled or not...

Cheers,

Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/5] iomap: complete polled writes inline
  2023-07-12  1:02   ` Dave Chinner
@ 2023-07-12  1:17     ` Jens Axboe
  2023-07-12  2:51       ` Dave Chinner
  2023-07-12 15:22     ` Christoph Hellwig
  1 sibling, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2023-07-12  1:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: io-uring, linux-xfs, hch, andres

On 7/11/23 7:02?PM, Dave Chinner wrote:
> On Tue, Jul 11, 2023 at 02:33:21PM -0600, Jens Axboe wrote:
>> Polled IO is always reaped in the context of the process itself, so it
>> does not need to be punted to a workqueue for the completion. This is
>> different than IRQ driven IO, where iomap_dio_bio_end_io() will be
>> invoked from hard/soft IRQ context. For those cases we currently need
>> to punt to a workqueue for further processing. For the polled case,
>> since it's the task itself reaping completions, we're already in task
>> context. That makes it identical to the sync completion case.
>>
>> Testing a basic QD 1..8 dio random write with polled IO with the
>> following fio job:
>>
>> fio --name=polled-dio-write --filename=/data1/file --time_based=1 \
>> --runtime=10 --bs=4096 --rw=randwrite --norandommap --buffered=0 \
>> --cpus_allowed=4 --ioengine=io_uring --iodepth=$depth --hipri=1
> 
> Ok, so this is testing pure overwrite DIOs as fio pre-writes the
> file prior to starting the random write part of the test.

Correct.

>> yields:
>>
>> 	Stock	Patched		Diff
>> =======================================
>> QD1	180K	201K		+11%
>> QD2	356K	394K		+10%
>> QD4	608K	650K		+7%
>> QD8	827K	831K		+0.5%
>>
>> which shows a nice win, particularly for lower queue depth writes.
>> This is expected, as higher queue depths will be busy polling
>> completions while the offloaded workqueue completions can happen in
>> parallel.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>>  fs/iomap/direct-io.c | 9 +++++----
>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index ea3b868c8355..343bde5d50d3 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -161,15 +161,16 @@ void iomap_dio_bio_end_io(struct bio *bio)
>>  			struct task_struct *waiter = dio->submit.waiter;
>>  			WRITE_ONCE(dio->submit.waiter, NULL);
>>  			blk_wake_io_task(waiter);
>> -		} else if (dio->flags & IOMAP_DIO_WRITE) {
>> +		} else if ((bio->bi_opf & REQ_POLLED) ||
>> +			   !(dio->flags & IOMAP_DIO_WRITE)) {
>> +			WRITE_ONCE(dio->iocb->private, NULL);
>> +			iomap_dio_complete_work(&dio->aio.work);
> 
> I'm not sure this is safe for all polled writes. What if the DIO
> write was into a hole and we have to run unwritten extent
> completion via:
> 
> iomap_dio_complete_work(work)
>   iomap_dio_complete(dio)
>     dio->end_io(iocb)
>       xfs_dio_write_end_io()
>         xfs_iomap_write_unwritten()
>           <runs transactions, takes rwsems, does IO>
>   .....
>   ki->ki_complete()
>     io_complete_rw_iopoll()
>   .....
> 
> I don't see anything in the iomap DIO path that prevents us from
> doing HIPRI/REQ_POLLED IO on IOMAP_UNWRITTEN extents, hence I think
> this change will result in bad things happening in general.

There is a check related to writing beyond the size of the inode:

        if (need_zeroout ||                                                     
            ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
                dio->iocb->ki_flags &= ~IOCB_HIPRI;

but whether that is enough of what, I'm not so sure.

>> +		} else {
>>  			struct inode *inode = file_inode(dio->iocb->ki_filp);
>>  
>>  			WRITE_ONCE(dio->iocb->private, NULL);
>>  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
>>  			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
>> -		} else {
>> -			WRITE_ONCE(dio->iocb->private, NULL);
>> -			iomap_dio_complete_work(&dio->aio.work);
>>  		}
>>  	}
> 
> Regardless of the correctness of the code, I don't think adding this
> special case is the right thing to do here.  We should be able to
> complete all writes that don't require blocking completions directly
> here, not just polled writes.
> 
> We recently had this discussion over hacking a special case "don't
> queue for writes" for ext4 into this code - I had to point out the
> broken O_DSYNC completion cases it resulted in there, too. I also
> pointed out that we already had generic mechanisms in iomap to
> enable us to make a submission time decision as to whether
> completion needed to be queued or not. Thread here:
> 
> https://lore.kernel.org/linux-xfs/[email protected]/
> 
> Essentially, we shouldn't be using IOMAP_DIO_WRITE as the
> determining factor for queuing completions - we should be using
> the information the iocb and the iomap provides us at submission
> time similar to how we determine if we can use REQ_FUA for O_DSYNC
> writes to determine if iomap IO completion queuing is required.
> 
> This will do the correct *and* optimal thing for all types of
> writes, polled or not...

There's a fundamental difference between "cannot block, ever" as we have
from any kind of irq/rcu/preemption context, and the "we should not
block waiting for unrelated IO" which is really what the NOIO kind of
issue that async dio or polled async dio is. This obviously goes beyond
just this single patch and addresses the whole patchset, but it applies
equally to the polled completions here and the task punted callbacks for
the dio async writes. For the latter, we can certainly grab a mutex, for
the former we cannot, ever.

I do hear your point that gating this on writes is somewhat odd, but
that's mostly because the read completions don't really need to do
anything. Would you like it more if we made that explicit with another
IOMAP flag? Only concern here for the polled part is that REQ_POLLED may
be set for submission on the iomap side, but then later cleared through
the block stack if we cannot do polled IO for this bio. This means it
really has to be checked on the completion side, you cannot rely on any
iocb or iomap flags set at submission time.

For the write checking, that's already there... And while I'm all for
making that code cleaner, I don't necessarily think cleaning that up
first is a fair ask. At least not without more details on what you want,
specifically?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/5] iomap: complete polled writes inline
  2023-07-12  1:17     ` Jens Axboe
@ 2023-07-12  2:51       ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2023-07-12  2:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres

On Tue, Jul 11, 2023 at 07:17:43PM -0600, Jens Axboe wrote:
> On 7/11/23 7:02?PM, Dave Chinner wrote:
> > On Tue, Jul 11, 2023 at 02:33:21PM -0600, Jens Axboe wrote:
> >> Polled IO is always reaped in the context of the process itself, so it
> >> does not need to be punted to a workqueue for the completion. This is
> >> different than IRQ driven IO, where iomap_dio_bio_end_io() will be
> >> invoked from hard/soft IRQ context. For those cases we currently need
> >> to punt to a workqueue for further processing. For the polled case,
> >> since it's the task itself reaping completions, we're already in task
> >> context. That makes it identical to the sync completion case.
> >>
> >> Testing a basic QD 1..8 dio random write with polled IO with the
> >> following fio job:
> >>
> >> fio --name=polled-dio-write --filename=/data1/file --time_based=1 \
> >> --runtime=10 --bs=4096 --rw=randwrite --norandommap --buffered=0 \
> >> --cpus_allowed=4 --ioengine=io_uring --iodepth=$depth --hipri=1
> > 
> > Ok, so this is testing pure overwrite DIOs as fio pre-writes the
> > file prior to starting the random write part of the test.
> 
> Correct.

What is the differential when you use O_DSYNC writes?

> >> yields:
> >>
> >> 	Stock	Patched		Diff
> >> =======================================
> >> QD1	180K	201K		+11%
> >> QD2	356K	394K		+10%
> >> QD4	608K	650K		+7%
> >> QD8	827K	831K		+0.5%
> >>
> >> which shows a nice win, particularly for lower queue depth writes.
> >> This is expected, as higher queue depths will be busy polling
> >> completions while the offloaded workqueue completions can happen in
> >> parallel.
> >>
> >> Signed-off-by: Jens Axboe <[email protected]>
> >> ---
> >>  fs/iomap/direct-io.c | 9 +++++----
> >>  1 file changed, 5 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> >> index ea3b868c8355..343bde5d50d3 100644
> >> --- a/fs/iomap/direct-io.c
> >> +++ b/fs/iomap/direct-io.c
> >> @@ -161,15 +161,16 @@ void iomap_dio_bio_end_io(struct bio *bio)
> >>  			struct task_struct *waiter = dio->submit.waiter;
> >>  			WRITE_ONCE(dio->submit.waiter, NULL);
> >>  			blk_wake_io_task(waiter);
> >> -		} else if (dio->flags & IOMAP_DIO_WRITE) {
> >> +		} else if ((bio->bi_opf & REQ_POLLED) ||
> >> +			   !(dio->flags & IOMAP_DIO_WRITE)) {
> >> +			WRITE_ONCE(dio->iocb->private, NULL);
> >> +			iomap_dio_complete_work(&dio->aio.work);
> > 
> > I'm not sure this is safe for all polled writes. What if the DIO
> > write was into a hole and we have to run unwritten extent
> > completion via:
> > 
> > iomap_dio_complete_work(work)
> >   iomap_dio_complete(dio)
> >     dio->end_io(iocb)
> >       xfs_dio_write_end_io()
> >         xfs_iomap_write_unwritten()
> >           <runs transactions, takes rwsems, does IO>
> >   .....
> >   ki->ki_complete()
> >     io_complete_rw_iopoll()
> >   .....
> > 
> > I don't see anything in the iomap DIO path that prevents us from
> > doing HIPRI/REQ_POLLED IO on IOMAP_UNWRITTEN extents, hence I think
> > this change will result in bad things happening in general.
> 
> There is a check related to writing beyond the size of the inode:
> 
>         if (need_zeroout ||                                                     
>             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>                 dio->iocb->ki_flags &= ~IOCB_HIPRI;
> 
> but whether that is enough of what, I'm not so sure.

Ah, need-zeroout covers unwritten extents. Ok, I missed that - I
knew if covered sub-block zeroing, but I missed the fact it
explicitly turned off HIPRI for block aligned IO to unwritten
extents.

Hence HIPRI is turned off for new extents, unwritten extents and
writes that extend file size. It does not get turned off for O_DSYNC
writes, but they have exactly the same completion constraints as all
these other cases, except where we use REQ_FUA to avoid them. That
seems like an oversight to me.

IOWs, HIPRI is already turned off in *most* of the cases where
completion queuing is required. These are all the same cases that
IOMAP_F_DIRTY is used by the filesystem to tell iomap if a pure
overwrite is being done. i.e. HIPRI/REQ_POLLED is just another
pure-overwrite IO optimisation at the iomap level.


> >> +		} else {
> >>  			struct inode *inode = file_inode(dio->iocb->ki_filp);
> >>  
> >>  			WRITE_ONCE(dio->iocb->private, NULL);
> >>  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> >>  			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> >> -		} else {
> >> -			WRITE_ONCE(dio->iocb->private, NULL);
> >> -			iomap_dio_complete_work(&dio->aio.work);
> >>  		}
> >>  	}
> > 
> > Regardless of the correctness of the code, I don't think adding this
> > special case is the right thing to do here.  We should be able to
> > complete all writes that don't require blocking completions directly
> > here, not just polled writes.
> > 
> > We recently had this discussion over hacking a special case "don't
> > queue for writes" for ext4 into this code - I had to point out the
> > broken O_DSYNC completion cases it resulted in there, too. I also
> > pointed out that we already had generic mechanisms in iomap to
> > enable us to make a submission time decision as to whether
> > completion needed to be queued or not. Thread here:
> > 
> > https://lore.kernel.org/linux-xfs/[email protected]/
> > 
> > Essentially, we shouldn't be using IOMAP_DIO_WRITE as the
> > determining factor for queuing completions - we should be using
> > the information the iocb and the iomap provides us at submission
> > time similar to how we determine if we can use REQ_FUA for O_DSYNC
> > writes to determine if iomap IO completion queuing is required.
> > 
> > This will do the correct *and* optimal thing for all types of
> > writes, polled or not...
> 
> There's a fundamental difference between "cannot block, ever" as we have
> from any kind of irq/rcu/preemption context, and the "we should not
> block waiting for unrelated IO" which is really what the NOIO kind of
> issue that async dio or polled async dio is. This obviously goes beyond
> just this single patch and addresses the whole patchset, but it applies
> equally to the polled completions here and the task punted callbacks for
> the dio async writes. For the latter, we can certainly grab a mutex, for
> the former we cannot, ever.

Yes, but I don't think that matters....

> I do hear your point that gating this on writes is somewhat odd, but
> that's mostly because the read completions don't really need to do
> anything.

That was just a simplification we did because nobody was concerned
with micro-optimisation of the iomap IO path. It was much faster
than what we had before to begin with without lots of special case
micro-optimisation - the thing that made the old direct IO code
completely unmaintainable was all the crazy micro-optimisations that
had occurred over time....

> Would you like it more if we made that explicit with another
> IOMAP flag? Only concern here for the polled part is that REQ_POLLED may
> be set for submission on the iomap side, but then later cleared through
> the block stack if we cannot do polled IO for this bio. This means it
> really has to be checked on the completion side, you cannot rely on any
> iocb or iomap flags set at submission time.

I think REQ_POLLED is completely irrelevant here because it is only
used on iomaps that are pure overwrites. Hence at the iomap level it
should be perfectly OK to do direct completion for a pure overwrite
regardless of whether REQ_POLLED is set at completion time or not.

As such, I think completion queuing can be keyed off a specific
dio->flag set at submission time, rather than keying off
IOMAP_DIO_WRITE or some complex logic based on current IO
contexts...

> For the write checking, that's already there... And while I'm all for
> making that code cleaner, I don't necessarily think cleaning that up
> first is a fair ask. At least not without more details on what you want,
> specifically?

I think it's very fair to ask you to fix the problem the right way.
It's pretty clear what needs to be done, it's solvable in a generic
manner, it is not complex to do and we've even got pure overwrite
detection already in the dio write path that we can formalise for
this purpose.

Fix it once, fix it for everyone.

-Dave.

-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/5] iomap: complete polled writes inline
  2023-07-12  1:02   ` Dave Chinner
  2023-07-12  1:17     ` Jens Axboe
@ 2023-07-12 15:22     ` Christoph Hellwig
  1 sibling, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2023-07-12 15:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jens Axboe, io-uring, linux-xfs, hch, andres

On Wed, Jul 12, 2023 at 11:02:07AM +1000, Dave Chinner wrote:
> I'm not sure this is safe for all polled writes. What if the DIO
> write was into a hole and we have to run unwritten extent
> completion via:
> 
> iomap_dio_complete_work(work)
>   iomap_dio_complete(dio)
>     dio->end_io(iocb)
>       xfs_dio_write_end_io()
>         xfs_iomap_write_unwritten()
>           <runs transactions, takes rwsems, does IO>
>   .....
>   ki->ki_complete()
>     io_complete_rw_iopoll()
>   .....
> 
> I don't see anything in the iomap DIO path that prevents us from
> doing HIPRI/REQ_POLLED IO on IOMAP_UNWRITTEN extents, hence I think
> this change will result in bad things happening in general.

Where the bad thing is that we're doing fairly expensive work in the
completion thread.  Which is probably horrible for performance, but
should be otherwise unproblematic.

> Regardless of the correctness of the code, I don't think adding this
> special case is the right thing to do here.  We should be able to
> complete all writes that don't require blocking completions directly
> here, not just polled writes.

Note that we have quite a few completion handlers that don't block,
but still require user context, as they take a spinlock without
irq protection.

Thinks are a bit more complicated now compared to the legacy direct
I/O, because back then non-XFS file system usually dindn't support
i_size updates from asynchronous dio.

> Essentially, we shouldn't be using IOMAP_DIO_WRITE as the
> determining factor for queuing completions - we should be using
> the information the iocb and the iomap provides us at submission
> time similar to how we determine if we can use REQ_FUA for O_DSYNC
> writes to determine if iomap IO completion queuing is required.

We also need information from the file system, e.g. zonefs always
takes a mutex at least for the zone files.

In other words the optimize non-sync or FUA pure overwrites has a fair
bit of overlap with this, but actually is a more complex issue.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/5] fs: add IOCB flags related to passing back dio completions
  2023-07-11 20:33 [PATCHSET 0/5] Improve async iomap DIO performance Jens Axboe
  2023-07-11 20:33 ` [PATCH 1/5] iomap: complete polled writes inline Jens Axboe
@ 2023-07-11 20:33 ` Jens Axboe
  2023-07-11 20:33 ` [PATCH 3/5] io_uring/rw: add write support for IOCB_DIO_DEFER Jens Axboe
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2023-07-11 20:33 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, Jens Axboe

Async dio completions generally happen from hard/soft IRQ context, which
means that users like iomap may need to defer some of the completion
handling to a workqueue. This is less efficient than having the original
issuer handle it, like we do for sync IO, and it adds latency to the
completions.

Add IOCB_DIO_DEFER, which the issuer can set if it is able to safely
punt these completions to a safe context. If the dio handler is aware
of this flag, assign a callback handler in kiocb->dio_complete and
associated data io kiocb->private. The issuer will then call this handler
with that data from task context.

No functional changes in this patch.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/fs.h | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6867512907d6..115382f66d79 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -338,6 +338,16 @@ enum rw_hint {
 #define IOCB_NOIO		(1 << 20)
 /* can use bio alloc cache */
 #define IOCB_ALLOC_CACHE	(1 << 21)
+/*
+ * IOCB_DIO_DEFER can be set by the iocb owner, to indicate that the
+ * iocb completion can be passed back to the owner for execution from a safe
+ * context rather than needing to be punted through a workqueue. If this
+ * flag is set, the completion handling may set iocb->dio_complete to a
+ * handler, which the issuer will then call from task context to complete
+ * the processing of the iocb. iocb->private should then also be set to
+ * the argument being passed to this handler.
+ */
+#define IOCB_DIO_DEFER		(1 << 22)
 
 /* for use in trace events */
 #define TRACE_IOCB_STRINGS \
@@ -351,7 +361,8 @@ enum rw_hint {
 	{ IOCB_WRITE,		"WRITE" }, \
 	{ IOCB_WAITQ,		"WAITQ" }, \
 	{ IOCB_NOIO,		"NOIO" }, \
-	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }
+	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }, \
+	{ IOCB_DIO_DEFER,	"DIO_DEFER" }
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -360,7 +371,22 @@ struct kiocb {
 	void			*private;
 	int			ki_flags;
 	u16			ki_ioprio; /* See linux/ioprio.h */
-	struct wait_page_queue	*ki_waitq; /* for async buffered IO */
+	union {
+		/*
+		 * Only used for async buffered reads, where it denotes the
+		 * page waitqueue associated with completing the read. Valid
+		 * IFF IOCB_WAITQ is set.
+		 */
+		struct wait_page_queue	*ki_waitq;
+		/*
+		 * Can be used for O_DIRECT IO, where the completion handling
+		 * is punted back to the issuer of the IO. May only be set
+		 * if IOCB_DIO_DEFER is set by the issuer, and the issuer must
+		 * then check for presence of this handler when ki_complete is
+		 * invoked.
+		 */
+		ssize_t (*dio_complete)(void *data);
+	};
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/5] io_uring/rw: add write support for IOCB_DIO_DEFER
  2023-07-11 20:33 [PATCHSET 0/5] Improve async iomap DIO performance Jens Axboe
  2023-07-11 20:33 ` [PATCH 1/5] iomap: complete polled writes inline Jens Axboe
  2023-07-11 20:33 ` [PATCH 2/5] fs: add IOCB flags related to passing back dio completions Jens Axboe
@ 2023-07-11 20:33 ` Jens Axboe
  2023-07-11 20:33 ` [PATCH 4/5] iomap: add local 'iocb' variable in iomap_dio_bio_end_io() Jens Axboe
  2023-07-11 20:33 ` [PATCH 5/5] iomap: support IOCB_DIO_DEFER Jens Axboe
  4 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2023-07-11 20:33 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, Jens Axboe

If the filesystem dio handler understands IOCB_DIO_DEFER, we'll get
a kiocb->ki_complete() callback with kiocb->dio_complete set. In that
case, rather than complete the IO directly through task_work, queue
up an intermediate task_work handler that first processes this
callback and then immediately completes the request.

For XFS, this avoids a punt through a workqueue, which is a lot less
efficient and adds latency to lower queue depth (or sync) O_DIRECT
writes.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/rw.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/io_uring/rw.c b/io_uring/rw.c
index 1bce2208b65c..4ed378c70249 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -285,6 +285,14 @@ static inline int io_fixup_rw_res(struct io_kiocb *req, long res)
 
 void io_req_rw_complete(struct io_kiocb *req, struct io_tw_state *ts)
 {
+	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
+
+	if (rw->kiocb.dio_complete) {
+		long res = rw->kiocb.dio_complete(rw->kiocb.private);
+
+		io_req_set_res(req, io_fixup_rw_res(req, res), 0);
+	}
+
 	io_req_io_end(req);
 
 	if (req->flags & (REQ_F_BUFFER_SELECTED|REQ_F_BUFFER_RING)) {
@@ -300,9 +308,11 @@ static void io_complete_rw(struct kiocb *kiocb, long res)
 	struct io_rw *rw = container_of(kiocb, struct io_rw, kiocb);
 	struct io_kiocb *req = cmd_to_io_kiocb(rw);
 
-	if (__io_complete_rw_common(req, res))
-		return;
-	io_req_set_res(req, io_fixup_rw_res(req, res), 0);
+	if (!rw->kiocb.dio_complete) {
+		if (__io_complete_rw_common(req, res))
+			return;
+		io_req_set_res(req, io_fixup_rw_res(req, res), 0);
+	}
 	req->io_task_work.func = io_req_rw_complete;
 	__io_req_task_work_add(req, IOU_F_TWQ_LAZY_WAKE);
 }
@@ -914,7 +924,13 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		__sb_writers_release(file_inode(req->file)->i_sb,
 					SB_FREEZE_WRITE);
 	}
-	kiocb->ki_flags |= IOCB_WRITE;
+
+	/*
+	 * Set IOCB_DIO_DEFER, stating that our handler groks deferring the
+	 * completion to task context.
+	 */
+	kiocb->ki_flags |= IOCB_WRITE | IOCB_DIO_DEFER;
+	kiocb->dio_complete = NULL;
 
 	if (likely(req->file->f_op->write_iter))
 		ret2 = call_write_iter(req->file, kiocb, &s->iter);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/5] iomap: add local 'iocb' variable in iomap_dio_bio_end_io()
  2023-07-11 20:33 [PATCHSET 0/5] Improve async iomap DIO performance Jens Axboe
                   ` (2 preceding siblings ...)
  2023-07-11 20:33 ` [PATCH 3/5] io_uring/rw: add write support for IOCB_DIO_DEFER Jens Axboe
@ 2023-07-11 20:33 ` Jens Axboe
  2023-07-11 20:33 ` [PATCH 5/5] iomap: support IOCB_DIO_DEFER Jens Axboe
  4 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2023-07-11 20:33 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, Jens Axboe

We use this multiple times, add a local variable for the kiocb.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/iomap/direct-io.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 343bde5d50d3..94ef78b25b76 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -157,18 +157,20 @@ void iomap_dio_bio_end_io(struct bio *bio)
 		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
 
 	if (atomic_dec_and_test(&dio->ref)) {
+		struct kiocb *iocb = dio->iocb;
+
 		if (dio->wait_for_completion) {
 			struct task_struct *waiter = dio->submit.waiter;
 			WRITE_ONCE(dio->submit.waiter, NULL);
 			blk_wake_io_task(waiter);
 		} else if ((bio->bi_opf & REQ_POLLED) ||
 			   !(dio->flags & IOMAP_DIO_WRITE)) {
-			WRITE_ONCE(dio->iocb->private, NULL);
+			WRITE_ONCE(iocb->private, NULL);
 			iomap_dio_complete_work(&dio->aio.work);
 		} else {
-			struct inode *inode = file_inode(dio->iocb->ki_filp);
+			struct inode *inode = file_inode(iocb->ki_filp);
 
-			WRITE_ONCE(dio->iocb->private, NULL);
+			WRITE_ONCE(iocb->private, NULL);
 			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
 			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
 		}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/5] iomap: support IOCB_DIO_DEFER
  2023-07-11 20:33 [PATCHSET 0/5] Improve async iomap DIO performance Jens Axboe
                   ` (3 preceding siblings ...)
  2023-07-11 20:33 ` [PATCH 4/5] iomap: add local 'iocb' variable in iomap_dio_bio_end_io() Jens Axboe
@ 2023-07-11 20:33 ` Jens Axboe
  4 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2023-07-11 20:33 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, Jens Axboe

If IOCB_DIO_DEFER is set, utilize that to set kiocb->dio_complete handler
and data for that callback. Rather than punt the completion to a
workqueue, we pass back the handler and data to the issuer and will get a
callback from a safe task context.

Using the following fio job to randomly dio write 4k blocks at
queue depths of 1..16:

fio --name=dio-write --filename=/data1/file --time_based=1 \
--runtime=10 --bs=4096 --rw=randwrite --norandommap --buffered=0 \
--cpus_allowed=4 --ioengine=io_uring --iodepth=16

shows the following results before and after this patch:

	Stock	Patched		Diff
=======================================
QD1	155K	162K		+ 4.5%
QD2	290K	313K		+ 7.9%
QD4	533K	597K		+12.0%
QD8	604K	827K		+36.9%
QD16	615K	845K		+37.4%

which shows nice wins all around. If we factored in per-IOP efficiency,
the wins look even nicer. This becomes apparent as queue depth rises,
as the offloaded workqueue completions runs out of steam.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/iomap/direct-io.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 94ef78b25b76..bd7b948a29a7 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -130,6 +130,11 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 }
 EXPORT_SYMBOL_GPL(iomap_dio_complete);
 
+static ssize_t iomap_dio_deferred_complete(void *data)
+{
+	return iomap_dio_complete(data);
+}
+
 static void iomap_dio_complete_work(struct work_struct *work)
 {
 	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
@@ -167,6 +172,25 @@ void iomap_dio_bio_end_io(struct bio *bio)
 			   !(dio->flags & IOMAP_DIO_WRITE)) {
 			WRITE_ONCE(iocb->private, NULL);
 			iomap_dio_complete_work(&dio->aio.work);
+		} else if ((iocb->ki_flags & IOCB_DIO_DEFER) &&
+			   !(dio->flags & IOMAP_DIO_NEED_SYNC)) {
+			/* only polled IO cares about private cleared */
+			iocb->private = dio;
+			iocb->dio_complete = iomap_dio_deferred_complete;
+			/*
+			 * Invoke ->ki_complete() directly. We've assigned
+			 * out dio_complete callback handler, and since the
+			 * issuer set IOCB_DIO_DEFER, we know their
+			 * ki_complete handler will notice ->dio_complete
+			 * being set and will defer calling that handler
+			 * until it can be done from a safe task context.
+			 *
+			 * Note that the 'res' being passed in here is
+			 * not important for this case. The actual completion
+			 * value of the request will be gotten from dio_complete
+			 * when that is run by the issuer.
+			 */
+			iocb->ki_complete(iocb, 0);
 		} else {
 			struct inode *inode = file_inode(iocb->ki_filp);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-07-12 15:22 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-11 20:33 [PATCHSET 0/5] Improve async iomap DIO performance Jens Axboe
2023-07-11 20:33 ` [PATCH 1/5] iomap: complete polled writes inline Jens Axboe
2023-07-12  1:02   ` Dave Chinner
2023-07-12  1:17     ` Jens Axboe
2023-07-12  2:51       ` Dave Chinner
2023-07-12 15:22     ` Christoph Hellwig
2023-07-11 20:33 ` [PATCH 2/5] fs: add IOCB flags related to passing back dio completions Jens Axboe
2023-07-11 20:33 ` [PATCH 3/5] io_uring/rw: add write support for IOCB_DIO_DEFER Jens Axboe
2023-07-11 20:33 ` [PATCH 4/5] iomap: add local 'iocb' variable in iomap_dio_bio_end_io() Jens Axboe
2023-07-11 20:33 ` [PATCH 5/5] iomap: support IOCB_DIO_DEFER Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox