public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH v2 0/7] implement async block discards/etc. via io_uring
@ 2024-08-22  3:35 Pavel Begunkov
  2024-08-22  3:35 ` [PATCH v2 1/7] io_uring/cmd: expose iowq to cmds Pavel Begunkov
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Pavel Begunkov @ 2024-08-22  3:35 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Conrad Meyer, linux-block, linux-mm

There is an interest in having asynchronous block operations like
discard and write zeroes. The series implements that as io_uring commands,
which is an io_uring request type allowing to implement custom file
specific operations.

First 4 patches are simple preps, patches 5-7 implement support for
discards, write zeroes and secure erases correspondingly.
While the two latter commands use the same code path, discards only
reuses common callbacks and not bio allocation loop because of
differences in how the range is sliced into bios, see granularity
handling in bio_discard_limit().

Note that there are differences with ioctl() versions, these are
asynchronous and looser on synchronisation with page cache allowing
more races, see comments to patch 5.

liburing tests:

https://github.com/isilence/liburing.git discard-cmd-test

Pavel Begunkov (7):
  io_uring/cmd: expose iowq to cmds
  io_uring/cmd: give inline space in request to cmds
  filemap: introduce filemap_invalidate_pages
  block: introduce blk_validate_write()
  block: implement async discard as io_uring cmd
  block: implement async wire write zeroes
  block: implement async secure erase

 block/blk.h                  |   1 +
 block/fops.c                 |   2 +
 block/ioctl.c                | 231 ++++++++++++++++++++++++++++++++---
 include/linux/blkdev.h       |   4 +
 include/linux/io_uring/cmd.h |  15 +++
 include/linux/pagemap.h      |   2 +
 include/uapi/linux/fs.h      |   4 +
 io_uring/io_uring.c          |  11 ++
 io_uring/io_uring.h          |   1 +
 io_uring/uring_cmd.c         |   7 ++
 mm/filemap.c                 |  18 ++-
 11 files changed, 273 insertions(+), 23 deletions(-)


base-commit: 15dadb5430367959a455818fef80350a68c010f4
-- 
2.45.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/7] io_uring/cmd: expose iowq to cmds
  2024-08-22  3:35 [PATCH v2 0/7] implement async block discards/etc. via io_uring Pavel Begunkov
@ 2024-08-22  3:35 ` Pavel Begunkov
  2024-08-22  3:35 ` [PATCH v2 2/7] io_uring/cmd: give inline space in request " Pavel Begunkov
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Pavel Begunkov @ 2024-08-22  3:35 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Conrad Meyer, linux-block, linux-mm

When an io_uring request needs blocking context we offload it to the
io_uring's thread pool called io-wq. We can get there off ->uring_cmd
by returning -EAGAIN, but there is no straightforward way of doing that
from an asynchronous callback. Add a helper that would transfer a
command to a blocking context.

Note, we do an extra hop via task_work before io_queue_iowq(), that's a
limitation of io_uring infra we have that can likely be lifted later
if that would ever become a problem.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring/cmd.h |  6 ++++++
 io_uring/io_uring.c          | 11 +++++++++++
 io_uring/io_uring.h          |  1 +
 io_uring/uring_cmd.c         |  7 +++++++
 4 files changed, 25 insertions(+)

diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
index 447fbfd32215..86ceb3383e49 100644
--- a/include/linux/io_uring/cmd.h
+++ b/include/linux/io_uring/cmd.h
@@ -48,6 +48,9 @@ void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
 void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
 		unsigned int issue_flags);
 
+/* Execute the request from a blocking context */
+void io_uring_cmd_issue_blocking(struct io_uring_cmd *ioucmd);
+
 #else
 static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd)
@@ -67,6 +70,9 @@ static inline void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
 		unsigned int issue_flags)
 {
 }
+static inline void io_uring_cmd_issue_blocking(struct io_uring_cmd *ioucmd)
+{
+}
 #endif
 
 /*
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a53f2f25a80b..323cad8175e9 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -533,6 +533,17 @@ static void io_queue_iowq(struct io_kiocb *req)
 		io_queue_linked_timeout(link);
 }
 
+static void io_req_queue_iowq_tw(struct io_kiocb *req, struct io_tw_state *ts)
+{
+	io_queue_iowq(req);
+}
+
+void io_req_queue_iowq(struct io_kiocb *req)
+{
+	req->io_task_work.func = io_req_queue_iowq_tw;
+	io_req_task_work_add(req);
+}
+
 static __cold void io_queue_deferred(struct io_ring_ctx *ctx)
 {
 	while (!list_empty(&ctx->defer_list)) {
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 65078e641390..9d70b2cf7b1e 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -94,6 +94,7 @@ int io_uring_alloc_task_context(struct task_struct *task,
 
 int io_ring_add_registered_file(struct io_uring_task *tctx, struct file *file,
 				     int start, int end);
+void io_req_queue_iowq(struct io_kiocb *req);
 
 int io_poll_issue(struct io_kiocb *req, struct io_tw_state *ts);
 int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr);
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 8391c7c7c1ec..39c3c816ec78 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -277,6 +277,13 @@ int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 }
 EXPORT_SYMBOL_GPL(io_uring_cmd_import_fixed);
 
+void io_uring_cmd_issue_blocking(struct io_uring_cmd *ioucmd)
+{
+	struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
+
+	io_req_queue_iowq(req);
+}
+
 static inline int io_uring_cmd_getsockopt(struct socket *sock,
 					  struct io_uring_cmd *cmd,
 					  unsigned int issue_flags)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/7] io_uring/cmd: give inline space in request to cmds
  2024-08-22  3:35 [PATCH v2 0/7] implement async block discards/etc. via io_uring Pavel Begunkov
  2024-08-22  3:35 ` [PATCH v2 1/7] io_uring/cmd: expose iowq to cmds Pavel Begunkov
@ 2024-08-22  3:35 ` Pavel Begunkov
  2024-08-22  3:35 ` [PATCH v2 3/7] filemap: introduce filemap_invalidate_pages Pavel Begunkov
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Pavel Begunkov @ 2024-08-22  3:35 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Conrad Meyer, linux-block, linux-mm

Some io_uring commands can use some inline space in io_kiocb. We have 32
bytes in struct io_uring_cmd, expose it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring/cmd.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
index 86ceb3383e49..c189d36ad55e 100644
--- a/include/linux/io_uring/cmd.h
+++ b/include/linux/io_uring/cmd.h
@@ -23,6 +23,15 @@ static inline const void *io_uring_sqe_cmd(const struct io_uring_sqe *sqe)
 	return sqe->cmd;
 }
 
+static inline void io_uring_cmd_private_sz_check(size_t cmd_sz)
+{
+	BUILD_BUG_ON(cmd_sz > sizeof_field(struct io_uring_cmd, pdu));
+}
+#define io_uring_cmd_to_pdu(cmd, pdu_type) ( \
+	io_uring_cmd_private_sz_check(sizeof(pdu_type)), \
+	((pdu_type *)&(cmd)->pdu) \
+)
+
 #if defined(CONFIG_IO_URING)
 int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/7] filemap: introduce filemap_invalidate_pages
  2024-08-22  3:35 [PATCH v2 0/7] implement async block discards/etc. via io_uring Pavel Begunkov
  2024-08-22  3:35 ` [PATCH v2 1/7] io_uring/cmd: expose iowq to cmds Pavel Begunkov
  2024-08-22  3:35 ` [PATCH v2 2/7] io_uring/cmd: give inline space in request " Pavel Begunkov
@ 2024-08-22  3:35 ` Pavel Begunkov
  2024-08-22  6:37   ` Christoph Hellwig
  2024-08-22  3:35 ` [PATCH v2 4/7] block: introduce blk_validate_write() Pavel Begunkov
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2024-08-22  3:35 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Conrad Meyer, linux-block, linux-mm

kiocb_invalidate_pages() is useful for the write path, however not
everything is backed by kiocb and we want to reuse the function for bio
based discard implementation. Extract and and reuse a new helper called
filemap_invalidate_pages(), which takes a argument indicating whether it
should be non-blocking and might return -EAGAIN.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/pagemap.h |  2 ++
 mm/filemap.c            | 18 +++++++++++++-----
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d9c7edb6422b..e39c3a7ce33c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -32,6 +32,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		pgoff_t start, pgoff_t end);
 int kiocb_invalidate_pages(struct kiocb *iocb, size_t count);
 void kiocb_invalidate_post_direct_write(struct kiocb *iocb, size_t count);
+int filemap_invalidate_pages(struct address_space *mapping,
+			     loff_t pos, loff_t end, bool nowait);
 
 int write_inode_now(struct inode *, int sync);
 int filemap_fdatawrite(struct address_space *);
diff --git a/mm/filemap.c b/mm/filemap.c
index d62150418b91..74baec119239 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2712,14 +2712,12 @@ int kiocb_write_and_wait(struct kiocb *iocb, size_t count)
 }
 EXPORT_SYMBOL_GPL(kiocb_write_and_wait);
 
-int kiocb_invalidate_pages(struct kiocb *iocb, size_t count)
+int filemap_invalidate_pages(struct address_space *mapping,
+			     loff_t pos, loff_t end, bool nowait)
 {
-	struct address_space *mapping = iocb->ki_filp->f_mapping;
-	loff_t pos = iocb->ki_pos;
-	loff_t end = pos + count - 1;
 	int ret;
 
-	if (iocb->ki_flags & IOCB_NOWAIT) {
+	if (nowait) {
 		/* we could block if there are any pages in the range */
 		if (filemap_range_has_page(mapping, pos, end))
 			return -EAGAIN;
@@ -2738,6 +2736,16 @@ int kiocb_invalidate_pages(struct kiocb *iocb, size_t count)
 	return invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
 					     end >> PAGE_SHIFT);
 }
+
+int kiocb_invalidate_pages(struct kiocb *iocb, size_t count)
+{
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
+	loff_t pos = iocb->ki_pos;
+	loff_t end = pos + count - 1;
+
+	return filemap_invalidate_pages(mapping, pos, end,
+					iocb->ki_flags & IOCB_NOWAIT);
+}
 EXPORT_SYMBOL_GPL(kiocb_invalidate_pages);
 
 /**
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 4/7] block: introduce blk_validate_write()
  2024-08-22  3:35 [PATCH v2 0/7] implement async block discards/etc. via io_uring Pavel Begunkov
                   ` (2 preceding siblings ...)
  2024-08-22  3:35 ` [PATCH v2 3/7] filemap: introduce filemap_invalidate_pages Pavel Begunkov
@ 2024-08-22  3:35 ` Pavel Begunkov
  2024-08-22  6:33   ` Christoph Hellwig
  2024-08-22  3:35 ` [PATCH v2 5/7] block: implement async discard as io_uring cmd Pavel Begunkov
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2024-08-22  3:35 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Conrad Meyer, linux-block, linux-mm

In preparation to further changes extract a helper function out of
blk_ioctl_discard() that validates if it's allowed to do a write-like
operation for the given range.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 block/ioctl.c | 47 +++++++++++++++++++++++++++++------------------
 1 file changed, 29 insertions(+), 18 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index e8e4a4190f18..8df0bc8002f5 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -92,38 +92,49 @@ static int compat_blkpg_ioctl(struct block_device *bdev,
 }
 #endif
 
+static int blk_validate_write(struct block_device *bdev, blk_mode_t mode,
+			      uint64_t start, uint64_t len)
+{
+	unsigned int bs_mask;
+	uint64_t end;
+
+	if (!(mode & BLK_OPEN_WRITE))
+		return -EBADF;
+	if (bdev_read_only(bdev))
+		return -EPERM;
+
+	bs_mask = bdev_logical_block_size(bdev) - 1;
+	if ((start | len) & bs_mask)
+		return -EINVAL;
+	if (!len)
+		return -EINVAL;
+
+	if (check_add_overflow(start, len, &end) || end > bdev_nr_bytes(bdev))
+		return -EINVAL;
+
+	return 0;
+}
+
 static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode,
 		unsigned long arg)
 {
-	unsigned int bs_mask = bdev_logical_block_size(bdev) - 1;
-	uint64_t range[2], start, len, end;
+	uint64_t range[2], start, len;
 	struct bio *prev = NULL, *bio;
 	sector_t sector, nr_sects;
 	struct blk_plug plug;
 	int err;
 
-	if (!(mode & BLK_OPEN_WRITE))
-		return -EBADF;
-
-	if (!bdev_max_discard_sectors(bdev))
-		return -EOPNOTSUPP;
-	if (bdev_read_only(bdev))
-		return -EPERM;
-
 	if (copy_from_user(range, (void __user *)arg, sizeof(range)))
 		return -EFAULT;
-
 	start = range[0];
 	len = range[1];
 
-	if (!len)
-		return -EINVAL;
-	if ((start | len) & bs_mask)
-		return -EINVAL;
+	if (!bdev_max_discard_sectors(bdev))
+		return -EOPNOTSUPP;
 
-	if (check_add_overflow(start, len, &end) ||
-	    end > bdev_nr_bytes(bdev))
-		return -EINVAL;
+	err = blk_validate_write(bdev, mode, start, len);
+	if (err)
+		return err;
 
 	filemap_invalidate_lock(bdev->bd_mapping);
 	err = truncate_bdev_range(bdev, mode, start, start + len - 1);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 5/7] block: implement async discard as io_uring cmd
  2024-08-22  3:35 [PATCH v2 0/7] implement async block discards/etc. via io_uring Pavel Begunkov
                   ` (3 preceding siblings ...)
  2024-08-22  3:35 ` [PATCH v2 4/7] block: introduce blk_validate_write() Pavel Begunkov
@ 2024-08-22  3:35 ` Pavel Begunkov
  2024-08-22  6:46   ` Christoph Hellwig
  2024-08-22  3:35 ` [PATCH v2 6/7] block: implement async wire write zeroes Pavel Begunkov
  2024-08-22  3:35 ` [PATCH v2 7/7] block: implement async secure erase Pavel Begunkov
  6 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2024-08-22  3:35 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Conrad Meyer, linux-block, linux-mm

io_uring allows to implement custom file specific operations via
fops->uring_cmd callback. Use it to wire up asynchronous discard
commands. Normally, first it tries to do a non-blocking issue, and if
fails we'd retry from a blocking context by returning -EAGAIN to
core io_uring.

Note, unlike ioctl(BLKDISCARD) with stronger guarantees against races,
we only do a best effort attempt to invalidate page cache, and it can
race with any writes and reads and leave page cache stale. It's the
same kind of races we allow to direct writes.

Suggested-by: Conrad Meyer <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
---
 block/blk.h             |   1 +
 block/fops.c            |   2 +
 block/ioctl.c           | 101 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h |   2 +
 4 files changed, 106 insertions(+)

diff --git a/block/blk.h b/block/blk.h
index e180863f918b..5178c5ba6852 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
 int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
 		loff_t lstart, loff_t lend);
 long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
+int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
 long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
 
 extern const struct address_space_operations def_blk_aops;
diff --git a/block/fops.c b/block/fops.c
index 9825c1713a49..8154b10b5abf 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -17,6 +17,7 @@
 #include <linux/fs.h>
 #include <linux/iomap.h>
 #include <linux/module.h>
+#include <linux/io_uring/cmd.h>
 #include "blk.h"
 
 static inline struct inode *bdev_file_inode(struct file *file)
@@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
 	.splice_read	= filemap_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= blkdev_fallocate,
+	.uring_cmd	= blkdev_uring_cmd,
 	.fop_flags	= FOP_BUFFER_RASYNC,
 };
 
diff --git a/block/ioctl.c b/block/ioctl.c
index 8df0bc8002f5..a9aaa7cb7f73 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -11,6 +11,8 @@
 #include <linux/blktrace_api.h>
 #include <linux/pr.h>
 #include <linux/uaccess.h>
+#include <linux/pagemap.h>
+#include <linux/io_uring/cmd.h>
 #include "blk.h"
 
 static int blkpg_do_ioctl(struct block_device *bdev,
@@ -745,3 +747,102 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 	return ret;
 }
 #endif
+
+struct blk_cmd {
+	blk_status_t status;
+	bool nowait;
+};
+
+static void blk_cmd_complete(struct io_uring_cmd *cmd, unsigned int issue_flags)
+{
+	struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd);
+	int res = blk_status_to_errno(bc->status);
+
+	if (res == -EAGAIN && bc->nowait)
+		io_uring_cmd_issue_blocking(cmd);
+	else
+		io_uring_cmd_done(cmd, res, 0, issue_flags);
+}
+
+static void bio_cmd_end(struct bio *bio)
+{
+	struct io_uring_cmd *cmd = bio->bi_private;
+	struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd);
+
+	if (unlikely(bio->bi_status) && !bc->status)
+		bc->status = bio->bi_status;
+
+	io_uring_cmd_do_in_task_lazy(cmd, blk_cmd_complete);
+	bio_put(bio);
+}
+
+static int blkdev_cmd_discard(struct io_uring_cmd *cmd,
+			      struct block_device *bdev,
+			      uint64_t start, uint64_t len, bool nowait)
+{
+	sector_t sector = start >> SECTOR_SHIFT;
+	sector_t nr_sects = len >> SECTOR_SHIFT;
+	struct bio *prev = NULL, *bio;
+	int err;
+
+	if (!bdev_max_discard_sectors(bdev))
+		return -EOPNOTSUPP;
+
+	err = blk_validate_write(bdev, file_to_blk_mode(cmd->file), start, len);
+	if (err)
+		return err;
+	err = filemap_invalidate_pages(bdev->bd_mapping, start,
+					start + len - 1, nowait);
+	if (err)
+		return err;
+
+	while ((bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
+					    GFP_KERNEL))) {
+		if (nowait) {
+			/*
+			 * Don't allow multi-bio non-blocking submissions as
+			 * subsequent bios may fail but we won't get direct
+			 * feedback about that. Normally, the caller should
+			 * retry from a blocking context.
+			 */
+			if (unlikely(nr_sects)) {
+				bio_put(bio);
+				return -EAGAIN;
+			}
+			bio->bi_opf |= REQ_NOWAIT;
+		}
+		prev = bio_chain_and_submit(prev, bio);
+	}
+	if (!prev)
+		return -EFAULT;
+
+	prev->bi_private = cmd;
+	prev->bi_end_io = bio_cmd_end;
+	submit_bio(prev);
+	return -EIOCBQUEUED;
+}
+
+int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
+{
+	struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host);
+	struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd);
+	const struct io_uring_sqe *sqe = cmd->sqe;
+	u32 cmd_op = cmd->cmd_op;
+	uint64_t start, len;
+
+	if (unlikely(sqe->ioprio || sqe->__pad1 || sqe->len ||
+		     sqe->rw_flags || sqe->file_index))
+		return -EINVAL;
+
+	bc->status = BLK_STS_OK;
+	bc->nowait = issue_flags & IO_URING_F_NONBLOCK;
+
+	start = READ_ONCE(sqe->addr);
+	len = READ_ONCE(sqe->addr3);
+
+	switch (cmd_op) {
+	case BLOCK_URING_CMD_DISCARD:
+		return blkdev_cmd_discard(cmd, bdev, start, len, bc->nowait);
+	}
+	return -EINVAL;
+}
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 753971770733..0016e38ed33c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -208,6 +208,8 @@ struct fsxattr {
  * (see uapi/linux/blkzoned.h)
  */
 
+#define BLOCK_URING_CMD_DISCARD			0
+
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */
 #define FIGETBSZ   _IO(0x00,2)	/* get the block size used for bmap */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 6/7] block: implement async wire write zeroes
  2024-08-22  3:35 [PATCH v2 0/7] implement async block discards/etc. via io_uring Pavel Begunkov
                   ` (4 preceding siblings ...)
  2024-08-22  3:35 ` [PATCH v2 5/7] block: implement async discard as io_uring cmd Pavel Begunkov
@ 2024-08-22  3:35 ` Pavel Begunkov
  2024-08-22  6:50   ` Christoph Hellwig
  2024-08-22  3:35 ` [PATCH v2 7/7] block: implement async secure erase Pavel Begunkov
  6 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2024-08-22  3:35 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Conrad Meyer, linux-block, linux-mm

Add another io_uring cmd for block layer implementing asynchronous write
zeroes. It reuses helpers we've added for async discards, and inherits
the code structure as well as all considerations in regards to page
cache races.

Suggested-by: Conrad Meyer <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
---
 block/ioctl.c           | 68 +++++++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h  |  4 +++
 include/uapi/linux/fs.h |  1 +
 3 files changed, 73 insertions(+)

diff --git a/block/ioctl.c b/block/ioctl.c
index a9aaa7cb7f73..6f0676f21e7b 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -776,6 +776,71 @@ static void bio_cmd_end(struct bio *bio)
 	bio_put(bio);
 }
 
+static int blkdev_queue_cmd(struct io_uring_cmd *cmd, struct block_device *bdev,
+			    uint64_t start, uint64_t len, sector_t limit,
+			    blk_opf_t opf)
+{
+	sector_t bs_mask = (bdev_logical_block_size(bdev) >> SECTOR_SHIFT) - 1;
+	sector_t sector = start >> SECTOR_SHIFT;
+	sector_t nr_sects = len >> SECTOR_SHIFT;
+	struct bio *prev = NULL, *bio;
+	int err;
+
+	if (!limit)
+		return -EOPNOTSUPP;
+
+	err = blk_validate_write(bdev, file_to_blk_mode(cmd->file), start, len);
+	if (err)
+		return err;
+
+	err = filemap_invalidate_pages(bdev->bd_mapping, start,
+					start + len - 1, opf & REQ_NOWAIT);
+	if (err)
+		return err;
+
+	limit = min(limit, (UINT_MAX >> SECTOR_SHIFT) & ~bs_mask);
+	while (nr_sects) {
+		sector_t bio_sects = min(nr_sects, limit);
+
+		/*
+		 * Don't allow multi-bio non-blocking submissions as subsequent
+		 * bios may fail but we won't get direct feedback about that.
+		 * Normally, the caller should retry from a blocking context.
+		 */
+		if ((opf & REQ_NOWAIT) && bio_sects != nr_sects)
+			return -EAGAIN;
+
+		bio = bio_alloc(bdev, 0, opf, GFP_KERNEL);
+		if (!bio)
+			break;
+		bio->bi_iter.bi_sector = sector;
+		bio->bi_iter.bi_size = bio_sects << SECTOR_SHIFT;
+		sector += bio_sects;
+		nr_sects -= bio_sects;
+
+		prev = bio_chain_and_submit(prev, bio);
+	}
+	if (!prev)
+		return -EFAULT;
+
+	prev->bi_private = cmd;
+	prev->bi_end_io = bio_cmd_end;
+	submit_bio(prev);
+	return -EIOCBQUEUED;
+}
+
+static int blkdev_cmd_write_zeroes(struct io_uring_cmd *cmd,
+				   struct block_device *bdev,
+				   uint64_t start, uint64_t len, bool nowait)
+{
+	blk_opf_t opf = REQ_OP_WRITE_ZEROES | REQ_NOUNMAP;
+
+	if (nowait)
+		opf |= REQ_NOWAIT;
+	return blkdev_queue_cmd(cmd, bdev, start, len,
+				bdev_write_zeroes_sectors(bdev), opf);
+}
+
 static int blkdev_cmd_discard(struct io_uring_cmd *cmd,
 			      struct block_device *bdev,
 			      uint64_t start, uint64_t len, bool nowait)
@@ -843,6 +908,9 @@ int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	switch (cmd_op) {
 	case BLOCK_URING_CMD_DISCARD:
 		return blkdev_cmd_discard(cmd, bdev, start, len, bc->nowait);
+	case BLOCK_URING_CMD_WRITE_ZEROES:
+		return blkdev_cmd_write_zeroes(cmd, bdev, start, len,
+					       bc->nowait);
 	}
 	return -EINVAL;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e85ec73a07d5..82bbe1e3e278 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1095,6 +1095,10 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp);
 
+struct bio *blk_alloc_write_zeroes_bio(struct block_device *bdev,
+					sector_t *sector, sector_t *nr_sects,
+					gfp_t gfp_mask);
+
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
 #define BLKDEV_ZERO_KILLABLE	(1 << 2)  /* interruptible by fatal signals */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 0016e38ed33c..b9e20ce57a28 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -209,6 +209,7 @@ struct fsxattr {
  */
 
 #define BLOCK_URING_CMD_DISCARD			0
+#define BLOCK_URING_CMD_WRITE_ZEROES		1
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 7/7] block: implement async secure erase
  2024-08-22  3:35 [PATCH v2 0/7] implement async block discards/etc. via io_uring Pavel Begunkov
                   ` (5 preceding siblings ...)
  2024-08-22  3:35 ` [PATCH v2 6/7] block: implement async wire write zeroes Pavel Begunkov
@ 2024-08-22  3:35 ` Pavel Begunkov
  2024-08-22  6:36   ` Christoph Hellwig
  6 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2024-08-22  3:35 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Conrad Meyer, linux-block, linux-mm

Add yet another io_uring cmd implementing async secure erases.
It has same page cache races as async discards and write zeroes and
reuses the common paths in general.

Suggested-by: Conrad Meyer <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
---
 block/ioctl.c           | 15 +++++++++++++++
 include/uapi/linux/fs.h |  1 +
 2 files changed, 16 insertions(+)

diff --git a/block/ioctl.c b/block/ioctl.c
index 6f0676f21e7b..ab8bab6ee806 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -841,6 +841,18 @@ static int blkdev_cmd_write_zeroes(struct io_uring_cmd *cmd,
 				bdev_write_zeroes_sectors(bdev), opf);
 }
 
+static int blkdev_cmd_secure_erase(struct io_uring_cmd *cmd,
+				   struct block_device *bdev,
+				   uint64_t start, uint64_t len, bool nowait)
+{
+	blk_opf_t opf = REQ_OP_SECURE_ERASE;
+
+	if (nowait)
+		opf |= REQ_NOWAIT;
+	return blkdev_queue_cmd(cmd, bdev, start, len,
+				bdev_max_secure_erase_sectors(bdev), opf);
+}
+
 static int blkdev_cmd_discard(struct io_uring_cmd *cmd,
 			      struct block_device *bdev,
 			      uint64_t start, uint64_t len, bool nowait)
@@ -911,6 +923,9 @@ int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	case BLOCK_URING_CMD_WRITE_ZEROES:
 		return blkdev_cmd_write_zeroes(cmd, bdev, start, len,
 					       bc->nowait);
+	case BLOCK_URING_CMD_SECURE_ERASE:
+		return blkdev_cmd_secure_erase(cmd, bdev, start, len,
+					       bc->nowait);
 	}
 	return -EINVAL;
 }
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b9e20ce57a28..425957589bdf 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -210,6 +210,7 @@ struct fsxattr {
 
 #define BLOCK_URING_CMD_DISCARD			0
 #define BLOCK_URING_CMD_WRITE_ZEROES		1
+#define BLOCK_URING_CMD_SECURE_ERASE		2
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 4/7] block: introduce blk_validate_write()
  2024-08-22  3:35 ` [PATCH v2 4/7] block: introduce blk_validate_write() Pavel Begunkov
@ 2024-08-22  6:33   ` Christoph Hellwig
  2024-08-22 12:36     ` Pavel Begunkov
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2024-08-22  6:33 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, Jens Axboe, Conrad Meyer, linux-block, linux-mm

On Thu, Aug 22, 2024 at 04:35:54AM +0100, Pavel Begunkov wrote:
> In preparation to further changes extract a helper function out of
> blk_ioctl_discard() that validates if it's allowed to do a write-like
> operation for the given range.

This isn't about a write, it is about a discard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 7/7] block: implement async secure erase
  2024-08-22  3:35 ` [PATCH v2 7/7] block: implement async secure erase Pavel Begunkov
@ 2024-08-22  6:36   ` Christoph Hellwig
  2024-08-22 12:36     ` Pavel Begunkov
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2024-08-22  6:36 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, Jens Axboe, Conrad Meyer, linux-block, linux-mm

On Thu, Aug 22, 2024 at 04:35:57AM +0100, Pavel Begunkov wrote:
> Add yet another io_uring cmd implementing async secure erases.
> It has same page cache races as async discards and write zeroes and
> reuses the common paths in general.

How did you test this?  Asking because my interruption fix for secure
discard is still pending because I could not find a test environment
for it.  And also because for a "secure" [1] erase proper invalidation
of the page cache actually matters, as otherwise you can still leak
data to the device.

[2] LBA based "secure" erase can't actually be secure on any storage
device that can write out of place.  Which is all of flash, but also
modern HDDs when encountering error.  It's a really bad interface
we should never have have started to support.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/7] filemap: introduce filemap_invalidate_pages
  2024-08-22  3:35 ` [PATCH v2 3/7] filemap: introduce filemap_invalidate_pages Pavel Begunkov
@ 2024-08-22  6:37   ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2024-08-22  6:37 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, Jens Axboe, Conrad Meyer, linux-block, linux-mm

On Thu, Aug 22, 2024 at 04:35:53AM +0100, Pavel Begunkov wrote:
> kiocb_invalidate_pages() is useful for the write path, however not
> everything is backed by kiocb and we want to reuse the function for bio
> based discard implementation. Extract and and reuse a new helper called
> filemap_invalidate_pages(), which takes a argument indicating whether it
> should be non-blocking and might return -EAGAIN.
> 
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
>  include/linux/pagemap.h |  2 ++
>  mm/filemap.c            | 18 +++++++++++++-----
>  2 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index d9c7edb6422b..e39c3a7ce33c 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -32,6 +32,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>  		pgoff_t start, pgoff_t end);
>  int kiocb_invalidate_pages(struct kiocb *iocb, size_t count);
>  void kiocb_invalidate_post_direct_write(struct kiocb *iocb, size_t count);
> +int filemap_invalidate_pages(struct address_space *mapping,
> +			     loff_t pos, loff_t end, bool nowait);
>  
>  int write_inode_now(struct inode *, int sync);
>  int filemap_fdatawrite(struct address_space *);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d62150418b91..74baec119239 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2712,14 +2712,12 @@ int kiocb_write_and_wait(struct kiocb *iocb, size_t count)
>  }
>  EXPORT_SYMBOL_GPL(kiocb_write_and_wait);
>  
> -int kiocb_invalidate_pages(struct kiocb *iocb, size_t count)
> +int filemap_invalidate_pages(struct address_space *mapping,
> +			     loff_t pos, loff_t end, bool nowait)
>  {
> -	struct address_space *mapping = iocb->ki_filp->f_mapping;
> -	loff_t pos = iocb->ki_pos;
> -	loff_t end = pos + count - 1;
>  	int ret;
>  
> -	if (iocb->ki_flags & IOCB_NOWAIT) {
> +	if (nowait) {
>  		/* we could block if there are any pages in the range */
>  		if (filemap_range_has_page(mapping, pos, end))
>  			return -EAGAIN;
> @@ -2738,6 +2736,16 @@ int kiocb_invalidate_pages(struct kiocb *iocb, size_t count)
>  	return invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
>  					     end >> PAGE_SHIFT);
>  }
> +
> +int kiocb_invalidate_pages(struct kiocb *iocb, size_t count)
> +{
> +	struct address_space *mapping = iocb->ki_filp->f_mapping;
> +	loff_t pos = iocb->ki_pos;
> +	loff_t end = pos + count - 1;
> +
> +	return filemap_invalidate_pages(mapping, pos, end,

No real need for the end variable here.  And maybe not for pos either.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/7] block: implement async discard as io_uring cmd
  2024-08-22  3:35 ` [PATCH v2 5/7] block: implement async discard as io_uring cmd Pavel Begunkov
@ 2024-08-22  6:46   ` Christoph Hellwig
  2024-08-22 13:07     ` Pavel Begunkov
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2024-08-22  6:46 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, Jens Axboe, Conrad Meyer, linux-block, linux-mm

On Thu, Aug 22, 2024 at 04:35:55AM +0100, Pavel Begunkov wrote:
> io_uring allows to implement custom file specific operations via
> fops->uring_cmd callback. Use it to wire up asynchronous discard
> commands. Normally, first it tries to do a non-blocking issue, and if
> fails we'd retry from a blocking context by returning -EAGAIN to
> core io_uring.
> 
> Note, unlike ioctl(BLKDISCARD) with stronger guarantees against races,
> we only do a best effort attempt to invalidate page cache, and it can
> race with any writes and reads and leave page cache stale. It's the
> same kind of races we allow to direct writes.

Can you please write up a man page for this that clear documents the
expecvted semantics?

> +static void bio_cmd_end(struct bio *bio)

This is really weird function name.  blk_cmd_end_io or
blk_cmd_bio_end_io would be the usual choices.

> +	while ((bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
> +					    GFP_KERNEL))) {

GFP_KERNEL can often will block.  You'll probably want a GFP_NOWAIT
allocation here for the nowait case.

> +		if (nowait) {
> +			/*
> +			 * Don't allow multi-bio non-blocking submissions as
> +			 * subsequent bios may fail but we won't get direct
> +			 * feedback about that. Normally, the caller should
> +			 * retry from a blocking context.
> +			 */
> +			if (unlikely(nr_sects)) {
> +				bio_put(bio);
> +				return -EAGAIN;
> +			}
> +			bio->bi_opf |= REQ_NOWAIT;
> +		}

And this really looks weird.  It first allocates a bio, potentially
blocking, and then gives up?  I think you're much better off with
something like:

	if (nowait) {
		if (nr_sects > bio_discard_limit(bdev, sector))
			return -EAGAIN;
		bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
						    GFP_NOWAIT);
		if (!bio)
			return -EAGAIN
		bio->bi_opf |= REQ_NOWAIT;
		goto submit;
	}

	/* submission loop here */

> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 753971770733..0016e38ed33c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -208,6 +208,8 @@ struct fsxattr {
>   * (see uapi/linux/blkzoned.h)
>   */
>  
> +#define BLOCK_URING_CMD_DISCARD			0

Is fs.h the reight place for this?

Curious:  how to we deal with conflicting uring cmds on different
device and how do we probe for them?  The NVMe uring_cmds
use the ioctl-style _IO* encoding which at least helps a bit with
that and which seem like a good idea.  Maybe someone needs to write
up a few lose rules on uring commands?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/7] block: implement async wire write zeroes
  2024-08-22  3:35 ` [PATCH v2 6/7] block: implement async wire write zeroes Pavel Begunkov
@ 2024-08-22  6:50   ` Christoph Hellwig
  2024-08-22 13:09     ` Pavel Begunkov
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2024-08-22  6:50 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, Jens Axboe, Conrad Meyer, linux-block, linux-mm

On Thu, Aug 22, 2024 at 04:35:56AM +0100, Pavel Begunkov wrote:
> Add another io_uring cmd for block layer implementing asynchronous write
> zeroes. It reuses helpers we've added for async discards, and inherits
> the code structure as well as all considerations in regards to page
> cache races.

Most comments from discard apply here as well.

> +static int blkdev_queue_cmd(struct io_uring_cmd *cmd, struct block_device *bdev,
> +			    uint64_t start, uint64_t len, sector_t limit,
> +			    blk_opf_t opf)

This feels a little over generic as it doesn't just queue any random
command.

> +static int blkdev_cmd_write_zeroes(struct io_uring_cmd *cmd,
> +				   struct block_device *bdev,
> +				   uint64_t start, uint64_t len, bool nowait)
> +{
> +	blk_opf_t opf = REQ_OP_WRITE_ZEROES | REQ_NOUNMAP;
> +
> +	if (nowait)
> +		opf |= REQ_NOWAIT;
> +	return blkdev_queue_cmd(cmd, bdev, start, len,
> +				bdev_write_zeroes_sectors(bdev), opf);

So no support for fallback to Write of zero page here?  That's probably
the case where the async offload is needed most.

> +struct bio *blk_alloc_write_zeroes_bio(struct block_device *bdev,
> +					sector_t *sector, sector_t *nr_sects,
> +					gfp_t gfp_mask);

Please keep this in block/blk.h, no need to expose it to the entire
kernel.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 4/7] block: introduce blk_validate_write()
  2024-08-22  6:33   ` Christoph Hellwig
@ 2024-08-22 12:36     ` Pavel Begunkov
  2024-08-23 11:52       ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2024-08-22 12:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: io-uring, Jens Axboe, Conrad Meyer, linux-block, linux-mm

On 8/22/24 07:33, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 04:35:54AM +0100, Pavel Begunkov wrote:
>> In preparation to further changes extract a helper function out of
>> blk_ioctl_discard() that validates if it's allowed to do a write-like
>> operation for the given range.
> 
> This isn't about a write, it is about a discard.

It's used for other commands in the series, all of them are
semantically "writes", or modifying data operations if that's
better. How would you call it? Some blk_modify_validate_args,
maybe?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 7/7] block: implement async secure erase
  2024-08-22  6:36   ` Christoph Hellwig
@ 2024-08-22 12:36     ` Pavel Begunkov
  0 siblings, 0 replies; 20+ messages in thread
From: Pavel Begunkov @ 2024-08-22 12:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: io-uring, Jens Axboe, Conrad Meyer, linux-block, linux-mm

On 8/22/24 07:36, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 04:35:57AM +0100, Pavel Begunkov wrote:
>> Add yet another io_uring cmd implementing async secure erases.
>> It has same page cache races as async discards and write zeroes and
>> reuses the common paths in general.
> 
> How did you test this?  Asking because my interruption fix for secure

In short I didn't, exactly because it's a challenge finding
anything that supports it. At least the change here is minimal,
but ...

> discard is still pending because I could not find a test environment
> for it.  And also because for a "secure" [1] erase proper invalidation
> of the page cache actually matters, as otherwise you can still leak
> data to the device.
> 
> [2] LBA based "secure" erase can't actually be secure on any storage
> device that can write out of place.  Which is all of flash, but also
> modern HDDs when encountering error.  It's a really bad interface
> we should never have have started to support.

... I should just drop it, not like async support makes much
sense.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/7] block: implement async discard as io_uring cmd
  2024-08-22  6:46   ` Christoph Hellwig
@ 2024-08-22 13:07     ` Pavel Begunkov
  2024-08-23 11:59       ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Begunkov @ 2024-08-22 13:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: io-uring, Jens Axboe, Conrad Meyer, linux-block, linux-mm

On 8/22/24 07:46, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 04:35:55AM +0100, Pavel Begunkov wrote:
>> io_uring allows to implement custom file specific operations via
>> fops->uring_cmd callback. Use it to wire up asynchronous discard
>> commands. Normally, first it tries to do a non-blocking issue, and if
>> fails we'd retry from a blocking context by returning -EAGAIN to
>> core io_uring.
>>
>> Note, unlike ioctl(BLKDISCARD) with stronger guarantees against races,
>> we only do a best effort attempt to invalidate page cache, and it can
>> race with any writes and reads and leave page cache stale. It's the
>> same kind of races we allow to direct writes.
> 
> Can you please write up a man page for this that clear documents the
> expecvted semantics?

Do we have it documented anywhere how O_DIRECT writes interact
with page cache, so I can refer to it?

  
>> +static void bio_cmd_end(struct bio *bio)
> 
> This is really weird function name.  blk_cmd_end_io or
> blk_cmd_bio_end_io would be the usual choices.

Will change with other cosmetics.


>> +	while ((bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
>> +					    GFP_KERNEL))) {
> 
> GFP_KERNEL can often will block.  You'll probably want a GFP_NOWAIT
> allocation here for the nowait case.

I can change it for clarity, but I don't think it's much of a concern
since the read/write path and pretty sure a bunch of other places never
cared about it. It does the main thing, propagating it down e.g. for
tag allocation.


>> +		if (nowait) {
>> +			/*
>> +			 * Don't allow multi-bio non-blocking submissions as
>> +			 * subsequent bios may fail but we won't get direct
>> +			 * feedback about that. Normally, the caller should
>> +			 * retry from a blocking context.
>> +			 */
>> +			if (unlikely(nr_sects)) {
>> +				bio_put(bio);
>> +				return -EAGAIN;
>> +			}
>> +			bio->bi_opf |= REQ_NOWAIT;
>> +		}
> 
> And this really looks weird.  It first allocates a bio, potentially

That's what the write path does.

> blocking, and then gives up?  I think you're much better off with
> something like:

I'd rather avoid calling bio_discard_limit() an extra time, it does
too much stuff inside, when the expected case is a single bio and
for multi-bio that overhead would really matter.

Maybe I should uniline blk_alloc_discard_bio() and dedup it with
write zeroes, I'll see if can be done after other write zeroes
changes.

> 
> 	if (nowait) {
> 		if (nr_sects > bio_discard_limit(bdev, sector))
> 			return -EAGAIN;
> 		bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
> 						    GFP_NOWAIT);
> 		if (!bio)
> 			return -EAGAIN
> 		bio->bi_opf |= REQ_NOWAIT;
> 		goto submit;
> 	}
> 
> 	/* submission loop here */
> 
>> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
>> index 753971770733..0016e38ed33c 100644
>> --- a/include/uapi/linux/fs.h
>> +++ b/include/uapi/linux/fs.h
>> @@ -208,6 +208,8 @@ struct fsxattr {
>>    * (see uapi/linux/blkzoned.h)
>>    */
>>   
>> +#define BLOCK_URING_CMD_DISCARD			0
> 
> Is fs.h the reight place for this?

Arguable, but I can move it to io_uring, makes things simpler
for me.

> Curious:  how to we deal with conflicting uring cmds on different
> device and how do we probe for them?  The NVMe uring_cmds
> use the ioctl-style _IO* encoding which at least helps a bit with
> that and which seem like a good idea.  Maybe someone needs to write
> up a few lose rules on uring commands?

My concern is that we're sacrificing compiler optimisations
(well, jump tables are disabled IIRC) for something that doesn't even
guarantee uniqueness. I'd like to see some degree of reflection,
like user querying a file class in terms of what operations it
supports, but that's beyond the scope of the series.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/7] block: implement async wire write zeroes
  2024-08-22  6:50   ` Christoph Hellwig
@ 2024-08-22 13:09     ` Pavel Begunkov
  0 siblings, 0 replies; 20+ messages in thread
From: Pavel Begunkov @ 2024-08-22 13:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: io-uring, Jens Axboe, Conrad Meyer, linux-block, linux-mm

On 8/22/24 07:50, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 04:35:56AM +0100, Pavel Begunkov wrote:
>> Add another io_uring cmd for block layer implementing asynchronous write
>> zeroes. It reuses helpers we've added for async discards, and inherits
>> the code structure as well as all considerations in regards to page
>> cache races.
> 
> Most comments from discard apply here as well.
> 
>> +static int blkdev_queue_cmd(struct io_uring_cmd *cmd, struct block_device *bdev,
>> +			    uint64_t start, uint64_t len, sector_t limit,
>> +			    blk_opf_t opf)
> 
> This feels a little over generic as it doesn't just queue any random
> command.
> 
>> +static int blkdev_cmd_write_zeroes(struct io_uring_cmd *cmd,
>> +				   struct block_device *bdev,
>> +				   uint64_t start, uint64_t len, bool nowait)
>> +{
>> +	blk_opf_t opf = REQ_OP_WRITE_ZEROES | REQ_NOUNMAP;
>> +
>> +	if (nowait)
>> +		opf |= REQ_NOWAIT;
>> +	return blkdev_queue_cmd(cmd, bdev, start, len,
>> +				bdev_write_zeroes_sectors(bdev), opf);
> 
> So no support for fallback to Write of zero page here?  That's probably
> the case where the async offload is needed most.

Fair enough, but I believe it's more reasonable to have a separate
cmd for it instead of silently falling back.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 4/7] block: introduce blk_validate_write()
  2024-08-22 12:36     ` Pavel Begunkov
@ 2024-08-23 11:52       ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2024-08-23 11:52 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Christoph Hellwig, io-uring, Jens Axboe, Conrad Meyer,
	linux-block, linux-mm

On Thu, Aug 22, 2024 at 01:36:10PM +0100, Pavel Begunkov wrote:
> It's used for other commands in the series, all of them are
> semantically "writes", or modifying data operations if that's
> better. How would you call it? Some blk_modify_validate_args,
> maybe?

Maybe just split the writability checks out and rename the rest
blk_validate_byte_range, that way the usage is pretty clearly
defined.  Bonus points for writing a kerneldoc other other
comment header explaining it.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/7] block: implement async discard as io_uring cmd
  2024-08-22 13:07     ` Pavel Begunkov
@ 2024-08-23 11:59       ` Christoph Hellwig
  2024-09-04 14:08         ` Pavel Begunkov
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2024-08-23 11:59 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Christoph Hellwig, io-uring, Jens Axboe, Conrad Meyer,
	linux-block, linux-mm, dchinner

On Thu, Aug 22, 2024 at 02:07:16PM +0100, Pavel Begunkov wrote:
> > > Note, unlike ioctl(BLKDISCARD) with stronger guarantees against races,
> > > we only do a best effort attempt to invalidate page cache, and it can
> > > race with any writes and reads and leave page cache stale. It's the
> > > same kind of races we allow to direct writes.
> > 
> > Can you please write up a man page for this that clear documents the
> > expecvted semantics?
> 
> Do we have it documented anywhere how O_DIRECT writes interact
> with page cache, so I can refer to it?

I can't find a good writeup.  Adding Dave as he tends to do long
emails on topic like this so he might have one hiding somewhere.

> > GFP_KERNEL can often will block.  You'll probably want a GFP_NOWAIT
> > allocation here for the nowait case.
> 
> I can change it for clarity, but I don't think it's much of a concern
> since the read/write path and pretty sure a bunch of other places never
> cared about it. It does the main thing, propagating it down e.g. for
> tag allocation.

True, we're only doing the nowait allocation for larger data
structures.  Which is a bit odd indeed.

> I'd rather avoid calling bio_discard_limit() an extra time, it does
> too much stuff inside, when the expected case is a single bio and
> for multi-bio that overhead would really matter.

Compared to a memory allocation it's not really doing all the much.
In the long run we really should move splitting discard bios down
the stack like we do for normal I/O anyway.

> Maybe I should uniline blk_alloc_discard_bio() and dedup it with

uniline?  I read that as unіnline, but as it's not inline I don't
understand what you mean either.

> > > +#define BLOCK_URING_CMD_DISCARD			0
> > 
> > Is fs.h the reight place for this?
> 
> Arguable, but I can move it to io_uring, makes things simpler
> for me.

I would have expected a uapi/linux/blkdev.h for it (and I'm kinda
surprised we don't have that yet).

> 
> > Curious:  how to we deal with conflicting uring cmds on different
> > device and how do we probe for them?  The NVMe uring_cmds
> > use the ioctl-style _IO* encoding which at least helps a bit with
> > that and which seem like a good idea.  Maybe someone needs to write
> > up a few lose rules on uring commands?
> 
> My concern is that we're sacrificing compiler optimisations
> (well, jump tables are disabled IIRC) for something that doesn't even
> guarantee uniqueness. I'd like to see some degree of reflection,
> like user querying a file class in terms of what operations it
> supports, but that's beyond the scope of the series.

We can't guaranteed uniqueness, but between the class, the direction,
and the argument size we get a pretty good one.  There is a reason
pretty much all ioctls added in the last 25 years are using this scheme.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/7] block: implement async discard as io_uring cmd
  2024-08-23 11:59       ` Christoph Hellwig
@ 2024-09-04 14:08         ` Pavel Begunkov
  0 siblings, 0 replies; 20+ messages in thread
From: Pavel Begunkov @ 2024-09-04 14:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: io-uring, Jens Axboe, Conrad Meyer, linux-block, linux-mm,
	dchinner

On 8/23/24 12:59, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 02:07:16PM +0100, Pavel Begunkov wrote:
>>>> Note, unlike ioctl(BLKDISCARD) with stronger guarantees against races,
>>>> we only do a best effort attempt to invalidate page cache, and it can
>>>> race with any writes and reads and leave page cache stale. It's the
>>>> same kind of races we allow to direct writes.
>>>
>>> Can you please write up a man page for this that clear documents the
>>> expecvted semantics?
>>
>> Do we have it documented anywhere how O_DIRECT writes interact
>> with page cache, so I can refer to it?
> 
> I can't find a good writeup.  Adding Dave as he tends to do long
> emails on topic like this so he might have one hiding somewhere.
> 
>>> GFP_KERNEL can often will block.  You'll probably want a GFP_NOWAIT
>>> allocation here for the nowait case.
>>
>> I can change it for clarity, but I don't think it's much of a concern
>> since the read/write path and pretty sure a bunch of other places never
>> cared about it. It does the main thing, propagating it down e.g. for
>> tag allocation.
> 
> True, we're only doing the nowait allocation for larger data
> structures.  Which is a bit odd indeed.

That's widespread, last time I looked into it no amount of patching
saved io_uring and tasks being killed by the oom reaper under memory
pressure.

>> I'd rather avoid calling bio_discard_limit() an extra time, it does
>> too much stuff inside, when the expected case is a single bio and
>> for multi-bio that overhead would really matter.
> 
> Compared to a memory allocation it's not really doing all the much.
> In the long run we really should move splitting discard bios down
> the stack like we do for normal I/O anyway.
> 
>> Maybe I should uniline blk_alloc_discard_bio() and dedup it with
> 
> uniline?  I read that as unіnline, but as it's not inline I don't
> understand what you mean either.

"Hand code" if you wish, but you can just ignore it


>>>> +#define BLOCK_URING_CMD_DISCARD			0
>>>
>>> Is fs.h the reight place for this?
>>
>> Arguable, but I can move it to io_uring, makes things simpler
>> for me.
> 
> I would have expected a uapi/linux/blkdev.h for it (and I'm kinda
> surprised we don't have that yet).

I think that would be overkill, we don't need it for just these
commands, and it's only adds pain with probing the header with
autotools or so. If there is a future vision for it I'd say we
can drop a patch on top.

>>> Curious:  how to we deal with conflicting uring cmds on different
>>> device and how do we probe for them?  The NVMe uring_cmds
>>> use the ioctl-style _IO* encoding which at least helps a bit with
>>> that and which seem like a good idea.  Maybe someone needs to write
>>> up a few lose rules on uring commands?
>>
>> My concern is that we're sacrificing compiler optimisations
>> (well, jump tables are disabled IIRC) for something that doesn't even
>> guarantee uniqueness. I'd like to see some degree of reflection,
>> like user querying a file class in terms of what operations it
>> supports, but that's beyond the scope of the series.
> 
> We can't guaranteed uniqueness, but between the class, the direction,
> and the argument size we get a pretty good one.  There is a reason
> pretty much all ioctls added in the last 25 years are using this scheme.

which is likely because some people insisted on it and not because
the scheme is so great that everyone became acolytes. Not to mention
only 256 possible "types" and the endless mess of sharing them and
trying to find a range to use. I'll convert to have less headache,
but either way we're just propagating the problem into the future.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-09-04 14:08 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-22  3:35 [PATCH v2 0/7] implement async block discards/etc. via io_uring Pavel Begunkov
2024-08-22  3:35 ` [PATCH v2 1/7] io_uring/cmd: expose iowq to cmds Pavel Begunkov
2024-08-22  3:35 ` [PATCH v2 2/7] io_uring/cmd: give inline space in request " Pavel Begunkov
2024-08-22  3:35 ` [PATCH v2 3/7] filemap: introduce filemap_invalidate_pages Pavel Begunkov
2024-08-22  6:37   ` Christoph Hellwig
2024-08-22  3:35 ` [PATCH v2 4/7] block: introduce blk_validate_write() Pavel Begunkov
2024-08-22  6:33   ` Christoph Hellwig
2024-08-22 12:36     ` Pavel Begunkov
2024-08-23 11:52       ` Christoph Hellwig
2024-08-22  3:35 ` [PATCH v2 5/7] block: implement async discard as io_uring cmd Pavel Begunkov
2024-08-22  6:46   ` Christoph Hellwig
2024-08-22 13:07     ` Pavel Begunkov
2024-08-23 11:59       ` Christoph Hellwig
2024-09-04 14:08         ` Pavel Begunkov
2024-08-22  3:35 ` [PATCH v2 6/7] block: implement async wire write zeroes Pavel Begunkov
2024-08-22  6:50   ` Christoph Hellwig
2024-08-22 13:09     ` Pavel Begunkov
2024-08-22  3:35 ` [PATCH v2 7/7] block: implement async secure erase Pavel Begunkov
2024-08-22  6:36   ` Christoph Hellwig
2024-08-22 12:36     ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox