public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCHSET v3 0/6] Improve async iomap DIO performance
@ 2023-07-19 19:54 Jens Axboe
  2023-07-19 19:54 ` [PATCH 1/6] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Jens Axboe @ 2023-07-19 19:54 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david

Hi,

iomap always punts async dio write completions to a workqueue, which has
a cost in terms of efficiency (now you need an unrelated worker to
process it) and latency (now you're bouncing a completion through an
async worker, which is a classic slowdown scenario).

Even for writes that should, in theory, be able to complete inline,
if we race with truncate or need to invalidate pages post completion,
we cannot sanely be in IRQ context as the locking types don't allow
for that.

io_uring handles IRQ completions via task_work, and for writes that
don't need to do extra IO at completion time, we can safely complete
them inline from that. This patchset adds IOCB_DEFER, which an IO
issuer can set to inform the completion side that any extra work that
needs doing for that completion can be punted to a safe task context.

The iomap dio completion will happen in hard/soft irq context, and we
need a saner context to process these completions. IOCB_DIO_DEFER is
added, which can be set in a struct kiocb->ki_flags by the issuer. If
the completion side of the iocb handling understands this flag, it can
choose to set a kiocb->dio_complete() handler and just call ki_complete
from IRQ context. The issuer must then ensure that this callback is
processed from a task. io_uring punts IRQ completions to task_work
already, so it's trivial wire it up to run more of the completion before
posting a CQE. This is good for up to a 37% improvement in
throughput/latency for low queue depth IO, patch 5 has the details.

If we need to do real work at completion time, iomap will clear the
IOMAP_DIO_DEFER_COMP flag.

This work came about when Andres tested low queue depth dio writes
for postgres and compared it to doing sync dio writes, showing that the
async processing slows us down a lot.

Dave, would appreciate your input on if the logic is right now in
terms of when we can inline complete when DEFER is set!

 fs/iomap/direct-io.c | 102 +++++++++++++++++++++++++++++++++++--------
 include/linux/fs.h   |  30 ++++++++++++-
 io_uring/rw.c        |  27 ++++++++++--
 3 files changed, 136 insertions(+), 23 deletions(-)

Can also be found in a git branch here:

https://git.kernel.dk/cgit/linux/log/?h=xfs-async-dio.3

Since v2:
- Drop the poll specific bits. They end up folding under the last patch
  now anyway, and this avoids needing a weird "is bio still polled" or
  in_task() check.
- Keep non-IOCB_DEFER writes in the workqueue.
- Cleanup the iomap completion path first.
- Add patch treating fua && has_fua the same as fua && !write_cache
- Add explicit IOMAP_DIO_DEFER_COMP flag
- Add comments

-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/6] iomap: cleanup up iomap_dio_bio_end_io()
  2023-07-19 19:54 [PATCHSET v3 0/6] Improve async iomap DIO performance Jens Axboe
@ 2023-07-19 19:54 ` Jens Axboe
  2023-07-20  4:50   ` Christoph Hellwig
  2023-07-19 19:54 ` [PATCH 2/6] iomap: add IOMAP_DIO_INLINE_COMP Jens Axboe
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2023-07-19 19:54 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

Make the logic a bit easier to follow:

1) Add a release_bio out path, as everybody needs to touch that, and
   have our bio ref check jump there if it's non-zero.
2) Add a kiocb local variable.
3) Add comments for each of the three conditions (sync, inline, or
   async workqueue punt).

No functional changes in this patch.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/iomap/direct-io.c | 43 ++++++++++++++++++++++++++++---------------
 1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index ea3b868c8355..1c32f734c767 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -152,27 +152,40 @@ void iomap_dio_bio_end_io(struct bio *bio)
 {
 	struct iomap_dio *dio = bio->bi_private;
 	bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY);
+	struct kiocb *iocb = dio->iocb;
 
 	if (bio->bi_status)
 		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
+	if (!atomic_dec_and_test(&dio->ref))
+		goto release_bio;
 
-	if (atomic_dec_and_test(&dio->ref)) {
-		if (dio->wait_for_completion) {
-			struct task_struct *waiter = dio->submit.waiter;
-			WRITE_ONCE(dio->submit.waiter, NULL);
-			blk_wake_io_task(waiter);
-		} else if (dio->flags & IOMAP_DIO_WRITE) {
-			struct inode *inode = file_inode(dio->iocb->ki_filp);
-
-			WRITE_ONCE(dio->iocb->private, NULL);
-			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
-			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
-		} else {
-			WRITE_ONCE(dio->iocb->private, NULL);
-			iomap_dio_complete_work(&dio->aio.work);
-		}
+	/*
+	 * Synchronous dio, task itself will handle any completion work
+	 * that needs after IO. All we need to do is wake the task.
+	 */
+	if (dio->wait_for_completion) {
+		struct task_struct *waiter = dio->submit.waiter;
+		WRITE_ONCE(dio->submit.waiter, NULL);
+		blk_wake_io_task(waiter);
+		goto release_bio;
+	}
+
+	/*
+	 * If this dio is an async write, queue completion work for async
+	 * handling. Reads can always complete inline.
+	 */
+	if (dio->flags & IOMAP_DIO_WRITE) {
+		struct inode *inode = file_inode(iocb->ki_filp);
+
+		WRITE_ONCE(iocb->private, NULL);
+		INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
+		queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
+	} else {
+		WRITE_ONCE(iocb->private, NULL);
+		iomap_dio_complete_work(&dio->aio.work);
 	}
 
+release_bio:
 	if (should_dirty) {
 		bio_check_pages_dirty(bio);
 	} else {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/6] iomap: add IOMAP_DIO_INLINE_COMP
  2023-07-19 19:54 [PATCHSET v3 0/6] Improve async iomap DIO performance Jens Axboe
  2023-07-19 19:54 ` [PATCH 1/6] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
@ 2023-07-19 19:54 ` Jens Axboe
  2023-07-20  4:51   ` Christoph Hellwig
  2023-07-19 19:54 ` [PATCH 3/6] iomap: treat a write through cache the same as FUA Jens Axboe
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2023-07-19 19:54 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

Rather than gate whether or not we need to punt a dio completion to a
workqueue, add an explicit flag for it. For now we treat them the same,
reads always set the flags and async writes do not.

No functional changes in this patch.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/iomap/direct-io.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 1c32f734c767..6b302bf8790b 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -20,6 +20,7 @@
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
+#define IOMAP_DIO_INLINE_COMP	(1 << 27)
 #define IOMAP_DIO_WRITE_FUA	(1 << 28)
 #define IOMAP_DIO_NEED_SYNC	(1 << 29)
 #define IOMAP_DIO_WRITE		(1 << 30)
@@ -171,20 +172,25 @@ void iomap_dio_bio_end_io(struct bio *bio)
 	}
 
 	/*
-	 * If this dio is an async write, queue completion work for async
-	 * handling. Reads can always complete inline.
+	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
 	 */
-	if (dio->flags & IOMAP_DIO_WRITE) {
-		struct inode *inode = file_inode(iocb->ki_filp);
-
-		WRITE_ONCE(iocb->private, NULL);
-		INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
-		queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
-	} else {
+	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
 		WRITE_ONCE(iocb->private, NULL);
 		iomap_dio_complete_work(&dio->aio.work);
+		goto release_bio;
 	}
 
+	/*
+	 * Async DIO completion that requires filesystem level completion work
+	 * gets punted to a work queue to complete as the operation may require
+	 * more IO to be issued to finalise filesystem metadata changes or
+	 * guarantee data integrity.
+	 */
+	WRITE_ONCE(iocb->private, NULL);
+	INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
+	queue_work(file_inode(iocb->ki_filp)->i_sb->s_dio_done_wq,
+			&dio->aio.work);
+
 release_bio:
 	if (should_dirty) {
 		bio_check_pages_dirty(bio);
@@ -524,6 +530,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		iomi.flags |= IOMAP_NOWAIT;
 
 	if (iov_iter_rw(iter) == READ) {
+		/* reads can always complete inline */
+		dio->flags |= IOMAP_DIO_INLINE_COMP;
+
 		if (iomi.pos >= dio->i_size)
 			goto out_free_dio;
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/6] iomap: treat a write through cache the same as FUA
  2023-07-19 19:54 [PATCHSET v3 0/6] Improve async iomap DIO performance Jens Axboe
  2023-07-19 19:54 ` [PATCH 1/6] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
  2023-07-19 19:54 ` [PATCH 2/6] iomap: add IOMAP_DIO_INLINE_COMP Jens Axboe
@ 2023-07-19 19:54 ` Jens Axboe
  2023-07-20  4:54   ` Christoph Hellwig
  2023-07-19 19:54 ` [PATCH 4/6] fs: add IOCB flags related to passing back dio completions Jens Axboe
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2023-07-19 19:54 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

Whether we have a write back cache and are using FUA or don't have
a write back cache at all is the same situation. Treat them the same.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/iomap/direct-io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 6b302bf8790b..b30c3edf2ef3 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -280,7 +280,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		 * cache flushes on IO completion.
 		 */
 		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
-		    (dio->flags & IOMAP_DIO_WRITE_FUA) && bdev_fua(iomap->bdev))
+		    (dio->flags & IOMAP_DIO_WRITE_FUA) &&
+		    (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
 			use_fua = true;
 	}
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/6] fs: add IOCB flags related to passing back dio completions
  2023-07-19 19:54 [PATCHSET v3 0/6] Improve async iomap DIO performance Jens Axboe
                   ` (2 preceding siblings ...)
  2023-07-19 19:54 ` [PATCH 3/6] iomap: treat a write through cache the same as FUA Jens Axboe
@ 2023-07-19 19:54 ` Jens Axboe
  2023-07-20  5:01   ` Christoph Hellwig
  2023-07-19 19:54 ` [PATCH 5/6] io_uring/rw: add write support for IOCB_DIO_DEFER Jens Axboe
  2023-07-19 19:54 ` [PATCH 6/6] iomap: support IOCB_DIO_DEFER Jens Axboe
  5 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2023-07-19 19:54 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

Async dio completions generally happen from hard/soft IRQ context, which
means that users like iomap may need to defer some of the completion
handling to a workqueue. This is less efficient than having the original
issuer handle it, like we do for sync IO, and it adds latency to the
completions.

Add IOCB_DIO_DEFER, which the issuer can set if it is able to safely
punt these completions to a safe context. If the dio handler is aware
of this flag, assign a callback handler in kiocb->dio_complete and
associated data io kiocb->private. The issuer will then call this handler
with that data from task context.

No functional changes in this patch.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/fs.h | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6867512907d6..115382f66d79 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -338,6 +338,16 @@ enum rw_hint {
 #define IOCB_NOIO		(1 << 20)
 /* can use bio alloc cache */
 #define IOCB_ALLOC_CACHE	(1 << 21)
+/*
+ * IOCB_DIO_DEFER can be set by the iocb owner, to indicate that the
+ * iocb completion can be passed back to the owner for execution from a safe
+ * context rather than needing to be punted through a workqueue. If this
+ * flag is set, the completion handling may set iocb->dio_complete to a
+ * handler, which the issuer will then call from task context to complete
+ * the processing of the iocb. iocb->private should then also be set to
+ * the argument being passed to this handler.
+ */
+#define IOCB_DIO_DEFER		(1 << 22)
 
 /* for use in trace events */
 #define TRACE_IOCB_STRINGS \
@@ -351,7 +361,8 @@ enum rw_hint {
 	{ IOCB_WRITE,		"WRITE" }, \
 	{ IOCB_WAITQ,		"WAITQ" }, \
 	{ IOCB_NOIO,		"NOIO" }, \
-	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }
+	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }, \
+	{ IOCB_DIO_DEFER,	"DIO_DEFER" }
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -360,7 +371,22 @@ struct kiocb {
 	void			*private;
 	int			ki_flags;
 	u16			ki_ioprio; /* See linux/ioprio.h */
-	struct wait_page_queue	*ki_waitq; /* for async buffered IO */
+	union {
+		/*
+		 * Only used for async buffered reads, where it denotes the
+		 * page waitqueue associated with completing the read. Valid
+		 * IFF IOCB_WAITQ is set.
+		 */
+		struct wait_page_queue	*ki_waitq;
+		/*
+		 * Can be used for O_DIRECT IO, where the completion handling
+		 * is punted back to the issuer of the IO. May only be set
+		 * if IOCB_DIO_DEFER is set by the issuer, and the issuer must
+		 * then check for presence of this handler when ki_complete is
+		 * invoked.
+		 */
+		ssize_t (*dio_complete)(void *data);
+	};
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/6] io_uring/rw: add write support for IOCB_DIO_DEFER
  2023-07-19 19:54 [PATCHSET v3 0/6] Improve async iomap DIO performance Jens Axboe
                   ` (3 preceding siblings ...)
  2023-07-19 19:54 ` [PATCH 4/6] fs: add IOCB flags related to passing back dio completions Jens Axboe
@ 2023-07-19 19:54 ` Jens Axboe
  2023-07-19 19:54 ` [PATCH 6/6] iomap: support IOCB_DIO_DEFER Jens Axboe
  5 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2023-07-19 19:54 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

If the filesystem dio handler understands IOCB_DIO_DEFER, we'll get
a kiocb->ki_complete() callback with kiocb->dio_complete set. In that
case, rather than complete the IO directly through task_work, queue
up an intermediate task_work handler that first processes this
callback and then immediately completes the request.

For XFS, this avoids a punt through a workqueue, which is a lot less
efficient and adds latency to lower queue depth (or sync) O_DIRECT
writes.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/rw.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/io_uring/rw.c b/io_uring/rw.c
index 1bce2208b65c..4657e11acf02 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -285,6 +285,14 @@ static inline int io_fixup_rw_res(struct io_kiocb *req, long res)
 
 void io_req_rw_complete(struct io_kiocb *req, struct io_tw_state *ts)
 {
+	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
+
+	if (rw->kiocb.dio_complete) {
+		long res = rw->kiocb.dio_complete(rw->kiocb.private);
+
+		io_req_set_res(req, io_fixup_rw_res(req, res), 0);
+	}
+
 	io_req_io_end(req);
 
 	if (req->flags & (REQ_F_BUFFER_SELECTED|REQ_F_BUFFER_RING)) {
@@ -300,9 +308,11 @@ static void io_complete_rw(struct kiocb *kiocb, long res)
 	struct io_rw *rw = container_of(kiocb, struct io_rw, kiocb);
 	struct io_kiocb *req = cmd_to_io_kiocb(rw);
 
-	if (__io_complete_rw_common(req, res))
-		return;
-	io_req_set_res(req, io_fixup_rw_res(req, res), 0);
+	if (!rw->kiocb.dio_complete) {
+		if (__io_complete_rw_common(req, res))
+			return;
+		io_req_set_res(req, io_fixup_rw_res(req, res), 0);
+	}
 	req->io_task_work.func = io_req_rw_complete;
 	__io_req_task_work_add(req, IOU_F_TWQ_LAZY_WAKE);
 }
@@ -312,6 +322,9 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res)
 	struct io_rw *rw = container_of(kiocb, struct io_rw, kiocb);
 	struct io_kiocb *req = cmd_to_io_kiocb(rw);
 
+	if (rw->kiocb.dio_complete)
+		res = rw->kiocb.dio_complete(rw->kiocb.private);
+
 	if (kiocb->ki_flags & IOCB_WRITE)
 		kiocb_end_write(req);
 	if (unlikely(res != req->cqe.res)) {
@@ -914,7 +927,13 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		__sb_writers_release(file_inode(req->file)->i_sb,
 					SB_FREEZE_WRITE);
 	}
-	kiocb->ki_flags |= IOCB_WRITE;
+
+	/*
+	 * Set IOCB_DIO_DEFER, stating that our handler groks deferring the
+	 * completion to task context.
+	 */
+	kiocb->ki_flags |= IOCB_WRITE | IOCB_DIO_DEFER;
+	kiocb->dio_complete = NULL;
 
 	if (likely(req->file->f_op->write_iter))
 		ret2 = call_write_iter(req->file, kiocb, &s->iter);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/6] iomap: support IOCB_DIO_DEFER
  2023-07-19 19:54 [PATCHSET v3 0/6] Improve async iomap DIO performance Jens Axboe
                   ` (4 preceding siblings ...)
  2023-07-19 19:54 ` [PATCH 5/6] io_uring/rw: add write support for IOCB_DIO_DEFER Jens Axboe
@ 2023-07-19 19:54 ` Jens Axboe
  2023-07-20  4:59   ` Christoph Hellwig
  5 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2023-07-19 19:54 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

If IOCB_DIO_DEFER is set, utilize that to set kiocb->dio_complete handler
and data for that callback. Rather than punt the completion to a
workqueue, we pass back the handler and data to the issuer and will get a
callback from a safe task context.

Using the following fio job to randomly dio write 4k blocks at
queue depths of 1..16:

fio --name=dio-write --filename=/data1/file --time_based=1 \
--runtime=10 --bs=4096 --rw=randwrite --norandommap --buffered=0 \
--cpus_allowed=4 --ioengine=io_uring --iodepth=$depth

shows the following results before and after this patch:

	Stock	Patched		Diff
=======================================
QD1	155K	162K		+ 4.5%
QD2	290K	313K		+ 7.9%
QD4	533K	597K		+12.0%
QD8	604K	827K		+36.9%
QD16	615K	845K		+37.4%

which shows nice wins all around. If we factored in per-IOP efficiency,
the wins look even nicer. This becomes apparent as queue depth rises,
as the offloaded workqueue completions runs out of steam.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/iomap/direct-io.c | 47 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b30c3edf2ef3..b7055d50dd99 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -20,6 +20,7 @@
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
+#define IOMAP_DIO_DEFER_COMP	(1 << 26)
 #define IOMAP_DIO_INLINE_COMP	(1 << 27)
 #define IOMAP_DIO_WRITE_FUA	(1 << 28)
 #define IOMAP_DIO_NEED_SYNC	(1 << 29)
@@ -131,6 +132,11 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 }
 EXPORT_SYMBOL_GPL(iomap_dio_complete);
 
+static ssize_t iomap_dio_deferred_complete(void *data)
+{
+	return iomap_dio_complete(data);
+}
+
 static void iomap_dio_complete_work(struct work_struct *work)
 {
 	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
@@ -180,6 +186,31 @@ void iomap_dio_bio_end_io(struct bio *bio)
 		goto release_bio;
 	}
 
+	/*
+	 * If this dio is flagged with IOMAP_DIO_DEFER_COMP, then schedule
+	 * our completion that way to avoid an async punt to a workqueue.
+	 */
+	if (dio->flags & IOMAP_DIO_DEFER_COMP) {
+		/* only polled IO cares about private cleared */
+		iocb->private = dio;
+		iocb->dio_complete = iomap_dio_deferred_complete;
+
+		/*
+		 * Invoke ->ki_complete() directly. We've assigned out
+		 * dio_complete callback handler, and since the issuer set
+		 * IOCB_DIO_DEFER, we know their ki_complete handler will
+		 * notice ->dio_complete being set and will defer calling that
+		 * handler until it can be done from a safe task context.
+		 *
+		 * Note that the 'res' being passed in here is not important
+		 * for this case. The actual completion value of the request
+		 * will be gotten from dio_complete when that is run by the
+		 * issuer.
+		 */
+		iocb->ki_complete(iocb, 0);
+		goto release_bio;
+	}
+
 	/*
 	 * Async DIO completion that requires filesystem level completion work
 	 * gets punted to a work queue to complete as the operation may require
@@ -277,12 +308,15 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		 * data IO that doesn't require any metadata updates (including
 		 * after IO completion such as unwritten extent conversion) and
 		 * the underlying device supports FUA. This allows us to avoid
-		 * cache flushes on IO completion.
+		 * cache flushes on IO completion. If we can't use FUA and
+		 * need to sync, disable in-task completions.
 		 */
 		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
 		    (dio->flags & IOMAP_DIO_WRITE_FUA) &&
 		    (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
 			use_fua = true;
+		else if (dio->flags & IOMAP_DIO_NEED_SYNC)
+			dio->flags &= ~IOMAP_DIO_DEFER_COMP;
 	}
 
 	/*
@@ -308,6 +342,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		pad = pos & (fs_block_size - 1);
 		if (pad)
 			iomap_dio_zero(iter, dio, pos - pad, pad);
+
+		dio->flags &= ~IOMAP_DIO_DEFER_COMP;
 	}
 
 	/*
@@ -547,6 +583,15 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		iomi.flags |= IOMAP_WRITE;
 		dio->flags |= IOMAP_DIO_WRITE;
 
+		/*
+		 * Flag as supporting deferred completions, if the issuer
+		 * groks it. This can avoid a workqueue punt for writes.
+		 * We may later clear this flag if we need to do other IO
+		 * as part of this IO completion.
+		 */
+		if (iocb->ki_flags & IOCB_DIO_DEFER)
+			dio->flags |= IOMAP_DIO_DEFER_COMP;
+
 		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
 			ret = -EAGAIN;
 			if (iomi.pos >= dio->i_size ||
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/6] iomap: cleanup up iomap_dio_bio_end_io()
  2023-07-19 19:54 ` [PATCH 1/6] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
@ 2023-07-20  4:50   ` Christoph Hellwig
  2023-07-20 16:13     ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2023-07-20  4:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

> +	/*
> +	 * Synchronous dio, task itself will handle any completion work
> +	 * that needs after IO. All we need to do is wake the task.
> +	 */
> +	if (dio->wait_for_completion) {
> +		struct task_struct *waiter = dio->submit.waiter;
> +		WRITE_ONCE(dio->submit.waiter, NULL);

I know the existing code got it wrong, but can you please add an empty
line after the variable declaration here?

> +	/*
> +	 * If this dio is an async write, queue completion work for async
> +	 * handling. Reads can always complete inline.
> +	 */
> +	if (dio->flags & IOMAP_DIO_WRITE) {
> +		struct inode *inode = file_inode(iocb->ki_filp);
> +
> +		WRITE_ONCE(iocb->private, NULL);
> +		INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> +		queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> +	} else {

If we already do the goto style I'd probably do it here as well instead
of the else.

Otherwise this looks good to me.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] iomap: add IOMAP_DIO_INLINE_COMP
  2023-07-19 19:54 ` [PATCH 2/6] iomap: add IOMAP_DIO_INLINE_COMP Jens Axboe
@ 2023-07-20  4:51   ` Christoph Hellwig
  2023-07-20 16:19     ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2023-07-20  4:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

> -	if (dio->flags & IOMAP_DIO_WRITE) {
> -		struct inode *inode = file_inode(iocb->ki_filp);
> -
> -		WRITE_ONCE(iocb->private, NULL);
> -		INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> -		queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> -	} else {
> +	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
>  		WRITE_ONCE(iocb->private, NULL);
>  		iomap_dio_complete_work(&dio->aio.work);
> +		goto release_bio;
>  	}

I would have properly just structured the code to match this in the
lat path with a

	if (!(dio->flags & IOMAP_DIO_WRITE)) {

so that this becomes trivial.  But that's just nitpicking and the
result looks good to me.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/6] iomap: treat a write through cache the same as FUA
  2023-07-19 19:54 ` [PATCH 3/6] iomap: treat a write through cache the same as FUA Jens Axboe
@ 2023-07-20  4:54   ` Christoph Hellwig
  2023-07-20 16:23     ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2023-07-20  4:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

On Wed, Jul 19, 2023 at 01:54:14PM -0600, Jens Axboe wrote:
> Whether we have a write back cache and are using FUA or don't have
> a write back cache at all is the same situation. Treat them the same.

This looks correct, but I think the IOMAP_DIO_WRITE_FUA is rather
misnamed now which could lead to confusion.  The comment in
__iomap_dio_rw when checking the flag and clearing IOMAP_DIO_NEED_SYNC
also needs a little update to talk about writethrough semantics and
not just FUA now.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/6] iomap: support IOCB_DIO_DEFER
  2023-07-19 19:54 ` [PATCH 6/6] iomap: support IOCB_DIO_DEFER Jens Axboe
@ 2023-07-20  4:59   ` Christoph Hellwig
  2023-07-20 16:27     ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2023-07-20  4:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

> +	if (dio->flags & IOMAP_DIO_DEFER_COMP) {
> +		/* only polled IO cares about private cleared */
> +		iocb->private = dio;

FYI, I find this comment a bit weird as it comments on what we're
not doing in a path where it is irreleant.  I'd rather only clear
the private data in the path where polling is applicable and have
a comment there why it is cleared.  That probably belongs into the
first patch restructuring the function.

> @@ -277,12 +308,15 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		 * data IO that doesn't require any metadata updates (including
>  		 * after IO completion such as unwritten extent conversion) and
>  		 * the underlying device supports FUA. This allows us to avoid
> -		 * cache flushes on IO completion.
> +		 * cache flushes on IO completion. If we can't use FUA and
> +		 * need to sync, disable in-task completions.

... because iomap_dio_complete will have to call generic_write_sync to
do a blocking ->fsync call.

> @@ -308,6 +342,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		pad = pos & (fs_block_size - 1);
>  		if (pad)
>  			iomap_dio_zero(iter, dio, pos - pad, pad);
> +
> +		dio->flags &= ~IOMAP_DIO_DEFER_COMP;

Why does zeroing disable the deferred completions?  I can't really think
of why, which is probably a strong indicator why this needs a comment.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6] fs: add IOCB flags related to passing back dio completions
  2023-07-19 19:54 ` [PATCH 4/6] fs: add IOCB flags related to passing back dio completions Jens Axboe
@ 2023-07-20  5:01   ` Christoph Hellwig
  2023-07-20 16:25     ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2023-07-20  5:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

>  /* can use bio alloc cache */
>  #define IOCB_ALLOC_CACHE	(1 << 21)
> +/*
> + * IOCB_DIO_DEFER can be set by the iocb owner, to indicate that the
> + * iocb completion can be passed back to the owner for execution from a safe
> + * context rather than needing to be punted through a workqueue. If this
> + * flag is set, the completion handling may set iocb->dio_complete to a
> + * handler, which the issuer will then call from task context to complete
> + * the processing of the iocb. iocb->private should then also be set to
> + * the argument being passed to this handler.

Can you add an explanation when it is safe/destirable to do the deferred
completion?  As of the last patch we seem to avoid anything that does
I/O or transaction commits, but we'd still allow blocking operations
like mutexes used in the zonefs completion handler.  We need to catch
this so future usuers know what to do.

Similarly on the iomap side I think we need clear documentation for
what context ->end_io can be called in now.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/6] iomap: cleanup up iomap_dio_bio_end_io()
  2023-07-20  4:50   ` Christoph Hellwig
@ 2023-07-20 16:13     ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2023-07-20 16:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: io-uring, linux-xfs, andres, david

On 7/19/23 10:50?PM, Christoph Hellwig wrote:
>> +	/*
>> +	 * Synchronous dio, task itself will handle any completion work
>> +	 * that needs after IO. All we need to do is wake the task.
>> +	 */
>> +	if (dio->wait_for_completion) {
>> +		struct task_struct *waiter = dio->submit.waiter;
>> +		WRITE_ONCE(dio->submit.waiter, NULL);
> 
> I know the existing code got it wrong, but can you please add an empty
> line after the variable declaration here?

Sure, will add.

>> +	/*
>> +	 * If this dio is an async write, queue completion work for async
>> +	 * handling. Reads can always complete inline.
>> +	 */
>> +	if (dio->flags & IOMAP_DIO_WRITE) {
>> +		struct inode *inode = file_inode(iocb->ki_filp);
>> +
>> +		WRITE_ONCE(iocb->private, NULL);
>> +		INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
>> +		queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
>> +	} else {
> 
> If we already do the goto style I'd probably do it here as well instead
> of the else.

It does end up like that later on, but I can do it earlier and leave the
least desirable method (workqueue) at the end from this patch.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] iomap: add IOMAP_DIO_INLINE_COMP
  2023-07-20  4:51   ` Christoph Hellwig
@ 2023-07-20 16:19     ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2023-07-20 16:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: io-uring, linux-xfs, andres, david

On 7/19/23 10:51?PM, Christoph Hellwig wrote:
>> -	if (dio->flags & IOMAP_DIO_WRITE) {
>> -		struct inode *inode = file_inode(iocb->ki_filp);
>> -
>> -		WRITE_ONCE(iocb->private, NULL);
>> -		INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
>> -		queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
>> -	} else {
>> +	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
>>  		WRITE_ONCE(iocb->private, NULL);
>>  		iomap_dio_complete_work(&dio->aio.work);
>> +		goto release_bio;
>>  	}
> 
> I would have properly just structured the code to match this in the
> lat path with a
> 
> 	if (!(dio->flags & IOMAP_DIO_WRITE)) {
> 
> so that this becomes trivial.  But that's just nitpicking and the
> result looks good to me.

Ends up being done that way anyway with the rework of patch 1 to put the
non-write side first, so that's what it looks like now in v3.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/6] iomap: treat a write through cache the same as FUA
  2023-07-20  4:54   ` Christoph Hellwig
@ 2023-07-20 16:23     ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2023-07-20 16:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: io-uring, linux-xfs, andres, david

On 7/19/23 10:54?PM, Christoph Hellwig wrote:
> On Wed, Jul 19, 2023 at 01:54:14PM -0600, Jens Axboe wrote:
>> Whether we have a write back cache and are using FUA or don't have
>> a write back cache at all is the same situation. Treat them the same.
> 
> This looks correct, but I think the IOMAP_DIO_WRITE_FUA is rather
> misnamed now which could lead to confusion.  The comment in

It is - should I rename it to IOMAP_DIO_STABLE_WRITE or something like
that as part of this change?

> __iomap_dio_rw when checking the flag and clearing IOMAP_DIO_NEED_SYNC
> also needs a little update to talk about writethrough semantics and
> not just FUA now.

Will do.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6] fs: add IOCB flags related to passing back dio completions
  2023-07-20  5:01   ` Christoph Hellwig
@ 2023-07-20 16:25     ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2023-07-20 16:25 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: io-uring, linux-xfs, andres, david

On 7/19/23 11:01?PM, Christoph Hellwig wrote:
>>  /* can use bio alloc cache */
>>  #define IOCB_ALLOC_CACHE	(1 << 21)
>> +/*
>> + * IOCB_DIO_DEFER can be set by the iocb owner, to indicate that the
>> + * iocb completion can be passed back to the owner for execution from a safe
>> + * context rather than needing to be punted through a workqueue. If this
>> + * flag is set, the completion handling may set iocb->dio_complete to a
>> + * handler, which the issuer will then call from task context to complete
>> + * the processing of the iocb. iocb->private should then also be set to
>> + * the argument being passed to this handler.
> 
> Can you add an explanation when it is safe/destirable to do the deferred
> completion?  As of the last patch we seem to avoid anything that does
> I/O or transaction commits, but we'd still allow blocking operations
> like mutexes used in the zonefs completion handler.  We need to catch
> this so future usuers know what to do.

Sure can do - generally it's exactly as you write, anything that does
extra IO should still be done in a workqueue to avoid stalling this
particular IO completion on further IO.

> Similarly on the iomap side I think we need clear documentation for
> what context ->end_io can be called in now.

Honestly that was needed before as well, not really related to this
change (or the next two patches) as they don't change the context of how
it is called.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/6] iomap: support IOCB_DIO_DEFER
  2023-07-20  4:59   ` Christoph Hellwig
@ 2023-07-20 16:27     ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2023-07-20 16:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: io-uring, linux-xfs, andres, david

On 7/19/23 10:59?PM, Christoph Hellwig wrote:
>> +	if (dio->flags & IOMAP_DIO_DEFER_COMP) {
>> +		/* only polled IO cares about private cleared */
>> +		iocb->private = dio;
> 
> FYI, I find this comment a bit weird as it comments on what we're
> not doing in a path where it is irreleant.  I'd rather only clear
> the private data in the path where polling is applicable and have
> a comment there why it is cleared.  That probably belongs into the
> first patch restructuring the function.

OK, that makes sense.

>> @@ -277,12 +308,15 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>  		 * data IO that doesn't require any metadata updates (including
>>  		 * after IO completion such as unwritten extent conversion) and
>>  		 * the underlying device supports FUA. This allows us to avoid
>> -		 * cache flushes on IO completion.
>> +		 * cache flushes on IO completion. If we can't use FUA and
>> +		 * need to sync, disable in-task completions.
> 
> ... because iomap_dio_complete will have to call generic_write_sync to
> do a blocking ->fsync call.

Will add to that comment.

>> @@ -308,6 +342,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>  		pad = pos & (fs_block_size - 1);
>>  		if (pad)
>>  			iomap_dio_zero(iter, dio, pos - pad, pad);
>> +
>> +		dio->flags &= ~IOMAP_DIO_DEFER_COMP;
> 
> Why does zeroing disable the deferred completions?  I can't really
> think of why, which is probably a strong indicator why this needs a
> comment.

If zerooout is set, then it's a new or unwritten extent and we need
further processing at write time which may trigger more IO. I'll add a
comment.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-07-20 16:27 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-19 19:54 [PATCHSET v3 0/6] Improve async iomap DIO performance Jens Axboe
2023-07-19 19:54 ` [PATCH 1/6] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
2023-07-20  4:50   ` Christoph Hellwig
2023-07-20 16:13     ` Jens Axboe
2023-07-19 19:54 ` [PATCH 2/6] iomap: add IOMAP_DIO_INLINE_COMP Jens Axboe
2023-07-20  4:51   ` Christoph Hellwig
2023-07-20 16:19     ` Jens Axboe
2023-07-19 19:54 ` [PATCH 3/6] iomap: treat a write through cache the same as FUA Jens Axboe
2023-07-20  4:54   ` Christoph Hellwig
2023-07-20 16:23     ` Jens Axboe
2023-07-19 19:54 ` [PATCH 4/6] fs: add IOCB flags related to passing back dio completions Jens Axboe
2023-07-20  5:01   ` Christoph Hellwig
2023-07-20 16:25     ` Jens Axboe
2023-07-19 19:54 ` [PATCH 5/6] io_uring/rw: add write support for IOCB_DIO_DEFER Jens Axboe
2023-07-19 19:54 ` [PATCH 6/6] iomap: support IOCB_DIO_DEFER Jens Axboe
2023-07-20  4:59   ` Christoph Hellwig
2023-07-20 16:27     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox