public inbox for [email protected]
 help / color / mirror / Atom feed
* [RFC 0/5] implement asynchronous BLKDISCARD via io_uring
@ 2024-08-14 10:45 Pavel Begunkov
  2024-08-14 10:45 ` [RFC 1/5] io_uring/cmd: expose iowq to cmds Pavel Begunkov
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Pavel Begunkov @ 2024-08-14 10:45 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Conrad Meyer, linux-block, linux-mm

There is an interest in having asynchronous block operations like
discard. The patch set implements that as io_uring commands, which is
an io_uring request type allowing to implement custom file specific
operations.

First 4 patches are simple preps, and the main part is in Patch 5.
Not tested with a real drive yet, hence sending as an RFC.

I'm also going to add BLKDISCARDZEROES and BLKSECDISCARD, which should
reuse structures and helpers from Patch 5.

liburing tests for reference:

https://github.com/isilence/liburing.git discard-cmd-test

Pavel Begunkov (5):
  io_uring/cmd: expose iowq to cmds
  io_uring/cmd: give inline space in request to cmds
  filemap: introduce filemap_invalidate_pages
  block: introduce blk_validate_discard()
  block: implement io_uring discard cmd

 block/blk.h                  |   1 +
 block/fops.c                 |   2 +
 block/ioctl.c                | 139 ++++++++++++++++++++++++++++++-----
 include/linux/io_uring/cmd.h |  15 ++++
 include/linux/pagemap.h      |   2 +
 include/uapi/linux/fs.h      |   2 +
 io_uring/io_uring.c          |  11 +++
 io_uring/io_uring.h          |   1 +
 io_uring/uring_cmd.c         |   7 ++
 mm/filemap.c                 |  18 +++--
 10 files changed, 176 insertions(+), 22 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC 1/5] io_uring/cmd: expose iowq to cmds
  2024-08-14 10:45 [RFC 0/5] implement asynchronous BLKDISCARD via io_uring Pavel Begunkov
@ 2024-08-14 10:45 ` Pavel Begunkov
  2024-08-14 10:45 ` [RFC 2/5] io_uring/cmd: give inline space in request " Pavel Begunkov
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Pavel Begunkov @ 2024-08-14 10:45 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Conrad Meyer, linux-block, linux-mm

When an io_uring request needs blocking context we offload it to the
io_uring's thread pool called io-wq. We can get there off ->uring_cmd
by returning -EAGAIN, but there is no straightforward way of doing that
from an asynchronous callback. Add a helper that would transfer a
command to a blocking context.

Note, we do an extra hop via task_work before io_queue_iowq(), that's a
limitation of io_uring infra we have that can likely be lifted later
if that would ever become a problem.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring/cmd.h |  6 ++++++
 io_uring/io_uring.c          | 11 +++++++++++
 io_uring/io_uring.h          |  1 +
 io_uring/uring_cmd.c         |  7 +++++++
 4 files changed, 25 insertions(+)

diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
index 447fbfd32215..86ceb3383e49 100644
--- a/include/linux/io_uring/cmd.h
+++ b/include/linux/io_uring/cmd.h
@@ -48,6 +48,9 @@ void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
 void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
 		unsigned int issue_flags);
 
+/* Execute the request from a blocking context */
+void io_uring_cmd_issue_blocking(struct io_uring_cmd *ioucmd);
+
 #else
 static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd)
@@ -67,6 +70,9 @@ static inline void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
 		unsigned int issue_flags)
 {
 }
+static inline void io_uring_cmd_issue_blocking(struct io_uring_cmd *ioucmd)
+{
+}
 #endif
 
 /*
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3942db160f18..da819018088d 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -533,6 +533,17 @@ static void io_queue_iowq(struct io_kiocb *req)
 		io_queue_linked_timeout(link);
 }
 
+static void io_req_queue_iowq_tw(struct io_kiocb *req, struct io_tw_state *ts)
+{
+	io_queue_iowq(req);
+}
+
+void io_req_queue_iowq(struct io_kiocb *req)
+{
+	req->io_task_work.func = io_req_queue_iowq_tw;
+	io_req_task_work_add(req);
+}
+
 static __cold void io_queue_deferred(struct io_ring_ctx *ctx)
 {
 	while (!list_empty(&ctx->defer_list)) {
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index c2acf6180845..2896587a5221 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -90,6 +90,7 @@ int io_uring_alloc_task_context(struct task_struct *task,
 
 int io_ring_add_registered_file(struct io_uring_task *tctx, struct file *file,
 				     int start, int end);
+void io_req_queue_iowq(struct io_kiocb *req);
 
 int io_poll_issue(struct io_kiocb *req, struct io_tw_state *ts);
 int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr);
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 8391c7c7c1ec..39c3c816ec78 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -277,6 +277,13 @@ int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 }
 EXPORT_SYMBOL_GPL(io_uring_cmd_import_fixed);
 
+void io_uring_cmd_issue_blocking(struct io_uring_cmd *ioucmd)
+{
+	struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
+
+	io_req_queue_iowq(req);
+}
+
 static inline int io_uring_cmd_getsockopt(struct socket *sock,
 					  struct io_uring_cmd *cmd,
 					  unsigned int issue_flags)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC 2/5] io_uring/cmd: give inline space in request to cmds
  2024-08-14 10:45 [RFC 0/5] implement asynchronous BLKDISCARD via io_uring Pavel Begunkov
  2024-08-14 10:45 ` [RFC 1/5] io_uring/cmd: expose iowq to cmds Pavel Begunkov
@ 2024-08-14 10:45 ` Pavel Begunkov
  2024-08-14 10:45 ` [RFC 3/5] filemap: introduce filemap_invalidate_pages Pavel Begunkov
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Pavel Begunkov @ 2024-08-14 10:45 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Conrad Meyer, linux-block, linux-mm

Some io_uring commands can use some inline space in io_kiocb. We have 32
bytes in struct io_uring_cmd, expose it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring/cmd.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
index 86ceb3383e49..c189d36ad55e 100644
--- a/include/linux/io_uring/cmd.h
+++ b/include/linux/io_uring/cmd.h
@@ -23,6 +23,15 @@ static inline const void *io_uring_sqe_cmd(const struct io_uring_sqe *sqe)
 	return sqe->cmd;
 }
 
+static inline void io_uring_cmd_private_sz_check(size_t cmd_sz)
+{
+	BUILD_BUG_ON(cmd_sz > sizeof_field(struct io_uring_cmd, pdu));
+}
+#define io_uring_cmd_to_pdu(cmd, pdu_type) ( \
+	io_uring_cmd_private_sz_check(sizeof(pdu_type)), \
+	((pdu_type *)&(cmd)->pdu) \
+)
+
 #if defined(CONFIG_IO_URING)
 int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC 3/5] filemap: introduce filemap_invalidate_pages
  2024-08-14 10:45 [RFC 0/5] implement asynchronous BLKDISCARD via io_uring Pavel Begunkov
  2024-08-14 10:45 ` [RFC 1/5] io_uring/cmd: expose iowq to cmds Pavel Begunkov
  2024-08-14 10:45 ` [RFC 2/5] io_uring/cmd: give inline space in request " Pavel Begunkov
@ 2024-08-14 10:45 ` Pavel Begunkov
  2024-08-14 10:45 ` [RFC 4/5] block: introduce blk_validate_discard() Pavel Begunkov
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Pavel Begunkov @ 2024-08-14 10:45 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Conrad Meyer, linux-block, linux-mm

kiocb_invalidate_pages() is useful for the write path, however not
everything is backed by kiocb and we want to reuse the function for bio
based discard implementation. Extract and and reuse a new helper called
filemap_invalidate_pages(), which takes a argument indicating whether it
should be non-blocking and might return -EAGAIN.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/pagemap.h |  2 ++
 mm/filemap.c            | 18 +++++++++++++-----
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d9c7edb6422b..e39c3a7ce33c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -32,6 +32,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		pgoff_t start, pgoff_t end);
 int kiocb_invalidate_pages(struct kiocb *iocb, size_t count);
 void kiocb_invalidate_post_direct_write(struct kiocb *iocb, size_t count);
+int filemap_invalidate_pages(struct address_space *mapping,
+			     loff_t pos, loff_t end, bool nowait);
 
 int write_inode_now(struct inode *, int sync);
 int filemap_fdatawrite(struct address_space *);
diff --git a/mm/filemap.c b/mm/filemap.c
index d62150418b91..74baec119239 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2712,14 +2712,12 @@ int kiocb_write_and_wait(struct kiocb *iocb, size_t count)
 }
 EXPORT_SYMBOL_GPL(kiocb_write_and_wait);
 
-int kiocb_invalidate_pages(struct kiocb *iocb, size_t count)
+int filemap_invalidate_pages(struct address_space *mapping,
+			     loff_t pos, loff_t end, bool nowait)
 {
-	struct address_space *mapping = iocb->ki_filp->f_mapping;
-	loff_t pos = iocb->ki_pos;
-	loff_t end = pos + count - 1;
 	int ret;
 
-	if (iocb->ki_flags & IOCB_NOWAIT) {
+	if (nowait) {
 		/* we could block if there are any pages in the range */
 		if (filemap_range_has_page(mapping, pos, end))
 			return -EAGAIN;
@@ -2738,6 +2736,16 @@ int kiocb_invalidate_pages(struct kiocb *iocb, size_t count)
 	return invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
 					     end >> PAGE_SHIFT);
 }
+
+int kiocb_invalidate_pages(struct kiocb *iocb, size_t count)
+{
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
+	loff_t pos = iocb->ki_pos;
+	loff_t end = pos + count - 1;
+
+	return filemap_invalidate_pages(mapping, pos, end,
+					iocb->ki_flags & IOCB_NOWAIT);
+}
 EXPORT_SYMBOL_GPL(kiocb_invalidate_pages);
 
 /**
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC 4/5] block: introduce blk_validate_discard()
  2024-08-14 10:45 [RFC 0/5] implement asynchronous BLKDISCARD via io_uring Pavel Begunkov
                   ` (2 preceding siblings ...)
  2024-08-14 10:45 ` [RFC 3/5] filemap: introduce filemap_invalidate_pages Pavel Begunkov
@ 2024-08-14 10:45 ` Pavel Begunkov
  2024-08-14 10:45 ` [RFC 5/5] block: implement io_uring discard cmd Pavel Begunkov
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Pavel Begunkov @ 2024-08-14 10:45 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Conrad Meyer, linux-block, linux-mm

In preparation to further changes extract a helper function out of
blk_ioctl_discard() that validates discard arguments.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 block/ioctl.c | 45 ++++++++++++++++++++++++++++-----------------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index e8e4a4190f18..c7a3e6c6f5fa 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -92,39 +92,50 @@ static int compat_blkpg_ioctl(struct block_device *bdev,
 }
 #endif
 
-static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode,
-		unsigned long arg)
+static int blk_validate_discard(struct block_device *bdev, blk_mode_t mode,
+				uint64_t start, uint64_t len)
 {
-	unsigned int bs_mask = bdev_logical_block_size(bdev) - 1;
-	uint64_t range[2], start, len, end;
-	struct bio *prev = NULL, *bio;
-	sector_t sector, nr_sects;
-	struct blk_plug plug;
-	int err;
+	unsigned int bs_mask;
+	uint64_t end;
 
 	if (!(mode & BLK_OPEN_WRITE))
 		return -EBADF;
-
 	if (!bdev_max_discard_sectors(bdev))
 		return -EOPNOTSUPP;
 	if (bdev_read_only(bdev))
 		return -EPERM;
 
-	if (copy_from_user(range, (void __user *)arg, sizeof(range)))
-		return -EFAULT;
-
-	start = range[0];
-	len = range[1];
-
-	if (!len)
-		return -EINVAL;
+	bs_mask = bdev_logical_block_size(bdev) - 1;
 	if ((start | len) & bs_mask)
 		return -EINVAL;
+	if (!len)
+		return -EINVAL;
 
 	if (check_add_overflow(start, len, &end) ||
 	    end > bdev_nr_bytes(bdev))
 		return -EINVAL;
 
+	return 0;
+}
+
+static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode,
+		unsigned long arg)
+{
+	uint64_t range[2], start, len;
+	struct bio *prev = NULL, *bio;
+	sector_t sector, nr_sects;
+	struct blk_plug plug;
+	int err;
+
+	if (copy_from_user(range, (void __user *)arg, sizeof(range)))
+		return -EFAULT;
+	start = range[0];
+	len = range[1];
+
+	err = blk_validate_discard(bdev, mode, start, len);
+	if (err)
+		return err;
+
 	filemap_invalidate_lock(bdev->bd_mapping);
 	err = truncate_bdev_range(bdev, mode, start, start + len - 1);
 	if (err)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC 5/5] block: implement io_uring discard cmd
  2024-08-14 10:45 [RFC 0/5] implement asynchronous BLKDISCARD via io_uring Pavel Begunkov
                   ` (3 preceding siblings ...)
  2024-08-14 10:45 ` [RFC 4/5] block: introduce blk_validate_discard() Pavel Begunkov
@ 2024-08-14 10:45 ` Pavel Begunkov
  2024-08-15  1:42   ` Ming Lei
  2024-08-15 14:42   ` Jens Axboe
  2024-08-15 15:50 ` [RFC 0/5] implement asynchronous BLKDISCARD via io_uring Jens Axboe
  2024-08-15 16:15 ` Martin K. Petersen
  6 siblings, 2 replies; 26+ messages in thread
From: Pavel Begunkov @ 2024-08-14 10:45 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Conrad Meyer, linux-block, linux-mm

Add ->uring_cmd callback for block device files and use it to implement
asynchronous discard. Normally, it first tries to execute the command
from non-blocking context, which we limit to a single bio because
otherwise one of sub-bios may need to wait for other bios, and we don't
want to deal with partial IO. If non-blocking attempt fails, we'll retry
it in a blocking context.

Suggested-by: Conrad Meyer <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
---
 block/blk.h             |  1 +
 block/fops.c            |  2 +
 block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h |  2 +
 4 files changed, 99 insertions(+)

diff --git a/block/blk.h b/block/blk.h
index e180863f918b..5178c5ba6852 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
 int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
 		loff_t lstart, loff_t lend);
 long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
+int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
 long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
 
 extern const struct address_space_operations def_blk_aops;
diff --git a/block/fops.c b/block/fops.c
index 9825c1713a49..8154b10b5abf 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -17,6 +17,7 @@
 #include <linux/fs.h>
 #include <linux/iomap.h>
 #include <linux/module.h>
+#include <linux/io_uring/cmd.h>
 #include "blk.h"
 
 static inline struct inode *bdev_file_inode(struct file *file)
@@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
 	.splice_read	= filemap_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= blkdev_fallocate,
+	.uring_cmd	= blkdev_uring_cmd,
 	.fop_flags	= FOP_BUFFER_RASYNC,
 };
 
diff --git a/block/ioctl.c b/block/ioctl.c
index c7a3e6c6f5fa..f7f9c4c6d6b5 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -11,6 +11,8 @@
 #include <linux/blktrace_api.h>
 #include <linux/pr.h>
 #include <linux/uaccess.h>
+#include <linux/pagemap.h>
+#include <linux/io_uring/cmd.h>
 #include "blk.h"
 
 static int blkpg_do_ioctl(struct block_device *bdev,
@@ -744,4 +746,96 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 
 	return ret;
 }
+
+struct blk_cmd {
+	blk_status_t status;
+	bool nowait;
+};
+
+static void blk_cmd_complete(struct io_uring_cmd *cmd, unsigned int issue_flags)
+{
+	struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd);
+	int res = blk_status_to_errno(bc->status);
+
+	if (res == -EAGAIN && bc->nowait)
+		io_uring_cmd_issue_blocking(cmd);
+	else
+		io_uring_cmd_done(cmd, res, 0, issue_flags);
+}
+
+static void bio_cmd_end(struct bio *bio)
+{
+	struct io_uring_cmd *cmd = bio->bi_private;
+	struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd);
+
+	if (unlikely(bio->bi_status) && !bc->status)
+		bc->status = bio->bi_status;
+
+	io_uring_cmd_do_in_task_lazy(cmd, blk_cmd_complete);
+	bio_put(bio);
+}
+
+static int blkdev_cmd_discard(struct io_uring_cmd *cmd,
+			      struct block_device *bdev,
+			      uint64_t start, uint64_t len, bool nowait)
+{
+	sector_t sector = start >> SECTOR_SHIFT;
+	sector_t nr_sects = len >> SECTOR_SHIFT;
+	struct bio *prev = NULL, *bio;
+	int err;
+
+	err = blk_validate_discard(bdev, file_to_blk_mode(cmd->file),
+				   start, len);
+	if (err)
+		return err;
+	err = filemap_invalidate_pages(bdev->bd_mapping, start,
+					start + len - 1, nowait);
+	if (err)
+		return err;
+
+	while ((bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
+					    GFP_KERNEL))) {
+		if (nowait) {
+			if (unlikely(nr_sects)) {
+				bio_put(bio);
+				return -EAGAIN;
+			}
+			bio->bi_opf |= REQ_NOWAIT;
+		}
+		prev = bio_chain_and_submit(prev, bio);
+	}
+	if (!prev)
+		return -EFAULT;
+
+	prev->bi_private = cmd;
+	prev->bi_end_io = bio_cmd_end;
+	submit_bio(prev);
+	return -EIOCBQUEUED;
+}
+
+int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
+{
+	struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host);
+	struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd);
+	const struct io_uring_sqe *sqe = cmd->sqe;
+	u32 cmd_op = cmd->cmd_op;
+	uint64_t start, len;
+
+	if (unlikely(sqe->ioprio || sqe->__pad1 || sqe->len ||
+		     sqe->rw_flags || sqe->file_index))
+		return -EINVAL;
+
+	bc->status = BLK_STS_OK;
+	bc->nowait = issue_flags & IO_URING_F_NONBLOCK;
+
+	start = READ_ONCE(sqe->addr);
+	len = READ_ONCE(sqe->addr3);
+
+	switch (cmd_op) {
+	case BLOCK_URING_CMD_DISCARD:
+		return blkdev_cmd_discard(cmd, bdev, start, len, bc->nowait);
+	}
+	return -EINVAL;
+}
+
 #endif
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 753971770733..0016e38ed33c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -208,6 +208,8 @@ struct fsxattr {
  * (see uapi/linux/blkzoned.h)
  */
 
+#define BLOCK_URING_CMD_DISCARD			0
+
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */
 #define FIGETBSZ   _IO(0x00,2)	/* get the block size used for bmap */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-14 10:45 ` [RFC 5/5] block: implement io_uring discard cmd Pavel Begunkov
@ 2024-08-15  1:42   ` Ming Lei
  2024-08-15 14:33     ` Jens Axboe
  2024-08-15 14:42   ` Jens Axboe
  1 sibling, 1 reply; 26+ messages in thread
From: Ming Lei @ 2024-08-15  1:42 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, Jens Axboe, Conrad Meyer, linux-block, linux-mm

On Wed, Aug 14, 2024 at 6:46 PM Pavel Begunkov <[email protected]> wrote:
>
> Add ->uring_cmd callback for block device files and use it to implement
> asynchronous discard. Normally, it first tries to execute the command
> from non-blocking context, which we limit to a single bio because
> otherwise one of sub-bios may need to wait for other bios, and we don't
> want to deal with partial IO. If non-blocking attempt fails, we'll retry
> it in a blocking context.
>
> Suggested-by: Conrad Meyer <[email protected]>
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
>  block/blk.h             |  1 +
>  block/fops.c            |  2 +
>  block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fs.h |  2 +
>  4 files changed, 99 insertions(+)
>
> diff --git a/block/blk.h b/block/blk.h
> index e180863f918b..5178c5ba6852 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
>  int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
>                 loff_t lstart, loff_t lend);
>  long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
>  long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>
>  extern const struct address_space_operations def_blk_aops;
> diff --git a/block/fops.c b/block/fops.c
> index 9825c1713a49..8154b10b5abf 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -17,6 +17,7 @@
>  #include <linux/fs.h>
>  #include <linux/iomap.h>
>  #include <linux/module.h>
> +#include <linux/io_uring/cmd.h>
>  #include "blk.h"
>
>  static inline struct inode *bdev_file_inode(struct file *file)
> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
>         .splice_read    = filemap_splice_read,
>         .splice_write   = iter_file_splice_write,
>         .fallocate      = blkdev_fallocate,
> +       .uring_cmd      = blkdev_uring_cmd,

Just be curious, we have IORING_OP_FALLOCATE already for sending
discard to block device, why is .uring_cmd added for this purpose?

Thanks,


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-15  1:42   ` Ming Lei
@ 2024-08-15 14:33     ` Jens Axboe
  2024-08-15 17:11       ` Pavel Begunkov
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-08-15 14:33 UTC (permalink / raw)
  To: Ming Lei, Pavel Begunkov; +Cc: io-uring, Conrad Meyer, linux-block, linux-mm

On 8/14/24 7:42 PM, Ming Lei wrote:
> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <[email protected]> wrote:
>>
>> Add ->uring_cmd callback for block device files and use it to implement
>> asynchronous discard. Normally, it first tries to execute the command
>> from non-blocking context, which we limit to a single bio because
>> otherwise one of sub-bios may need to wait for other bios, and we don't
>> want to deal with partial IO. If non-blocking attempt fails, we'll retry
>> it in a blocking context.
>>
>> Suggested-by: Conrad Meyer <[email protected]>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> ---
>>  block/blk.h             |  1 +
>>  block/fops.c            |  2 +
>>  block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++
>>  include/uapi/linux/fs.h |  2 +
>>  4 files changed, 99 insertions(+)
>>
>> diff --git a/block/blk.h b/block/blk.h
>> index e180863f918b..5178c5ba6852 100644
>> --- a/block/blk.h
>> +++ b/block/blk.h
>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
>>  int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
>>                 loff_t lstart, loff_t lend);
>>  long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
>>  long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>
>>  extern const struct address_space_operations def_blk_aops;
>> diff --git a/block/fops.c b/block/fops.c
>> index 9825c1713a49..8154b10b5abf 100644
>> --- a/block/fops.c
>> +++ b/block/fops.c
>> @@ -17,6 +17,7 @@
>>  #include <linux/fs.h>
>>  #include <linux/iomap.h>
>>  #include <linux/module.h>
>> +#include <linux/io_uring/cmd.h>
>>  #include "blk.h"
>>
>>  static inline struct inode *bdev_file_inode(struct file *file)
>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
>>         .splice_read    = filemap_splice_read,
>>         .splice_write   = iter_file_splice_write,
>>         .fallocate      = blkdev_fallocate,
>> +       .uring_cmd      = blkdev_uring_cmd,
> 
> Just be curious, we have IORING_OP_FALLOCATE already for sending
> discard to block device, why is .uring_cmd added for this purpose?

I think wiring up a bdev uring_cmd makes sense, because:

1) The existing FALLOCATE op is using vfs_fallocate, which is inherently
   sync and hence always punted to io-wq.

2) There will most certainly be other async ops that would be
   interesting to add, at which point we'd need it anyway.

3) It arguably makes more sense to have a direct discard op than use
   fallocate for this, if working on a raw bdev.

And probably others...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-14 10:45 ` [RFC 5/5] block: implement io_uring discard cmd Pavel Begunkov
  2024-08-15  1:42   ` Ming Lei
@ 2024-08-15 14:42   ` Jens Axboe
  1 sibling, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2024-08-15 14:42 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Conrad Meyer, linux-block, linux-mm

On 8/14/24 4:45 AM, Pavel Begunkov wrote:
> diff --git a/block/ioctl.c b/block/ioctl.c
> index c7a3e6c6f5fa..f7f9c4c6d6b5 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -11,6 +11,8 @@
>  #include <linux/blktrace_api.h>
>  #include <linux/pr.h>
>  #include <linux/uaccess.h>
> +#include <linux/pagemap.h>
> +#include <linux/io_uring/cmd.h>
>  #include "blk.h"
>  
>  static int blkpg_do_ioctl(struct block_device *bdev,
> @@ -744,4 +746,96 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg)
>  
>  	return ret;
>  }
> +
> +struct blk_cmd {
> +	blk_status_t status;
> +	bool nowait;
> +};
> +
> +static void blk_cmd_complete(struct io_uring_cmd *cmd, unsigned int issue_flags)
> +{
> +	struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd);
> +	int res = blk_status_to_errno(bc->status);
> +
> +	if (res == -EAGAIN && bc->nowait)
> +		io_uring_cmd_issue_blocking(cmd);
> +	else
> +		io_uring_cmd_done(cmd, res, 0, issue_flags);
> +}
> +
> +static void bio_cmd_end(struct bio *bio)
> +{
> +	struct io_uring_cmd *cmd = bio->bi_private;
> +	struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd);
> +
> +	if (unlikely(bio->bi_status) && !bc->status)
> +		bc->status = bio->bi_status;
> +
> +	io_uring_cmd_do_in_task_lazy(cmd, blk_cmd_complete);
> +	bio_put(bio);
> +}
> +
> +static int blkdev_cmd_discard(struct io_uring_cmd *cmd,
> +			      struct block_device *bdev,
> +			      uint64_t start, uint64_t len, bool nowait)
> +{
> +	sector_t sector = start >> SECTOR_SHIFT;
> +	sector_t nr_sects = len >> SECTOR_SHIFT;
> +	struct bio *prev = NULL, *bio;
> +	int err;
> +
> +	err = blk_validate_discard(bdev, file_to_blk_mode(cmd->file),
> +				   start, len);
> +	if (err)
> +		return err;
> +	err = filemap_invalidate_pages(bdev->bd_mapping, start,
> +					start + len - 1, nowait);
> +	if (err)
> +		return err;
> +
> +	while ((bio = blk_alloc_discard_bio(bdev, &sector, &nr_sects,
> +					    GFP_KERNEL))) {
> +		if (nowait) {
> +			if (unlikely(nr_sects)) {
> +				bio_put(bio);
> +				return -EAGAIN;
> +			}
> +			bio->bi_opf |= REQ_NOWAIT;
> +		}
> +		prev = bio_chain_and_submit(prev, bio);
> +	}
> +	if (!prev)
> +		return -EFAULT;
> +
> +	prev->bi_private = cmd;
> +	prev->bi_end_io = bio_cmd_end;
> +	submit_bio(prev);
> +	return -EIOCBQUEUED;
> +}
> +
> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
> +{
> +	struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host);
> +	struct blk_cmd *bc = io_uring_cmd_to_pdu(cmd, struct blk_cmd);
> +	const struct io_uring_sqe *sqe = cmd->sqe;
> +	u32 cmd_op = cmd->cmd_op;
> +	uint64_t start, len;
> +
> +	if (unlikely(sqe->ioprio || sqe->__pad1 || sqe->len ||
> +		     sqe->rw_flags || sqe->file_index))
> +		return -EINVAL;
> +
> +	bc->status = BLK_STS_OK;
> +	bc->nowait = issue_flags & IO_URING_F_NONBLOCK;
> +
> +	start = READ_ONCE(sqe->addr);
> +	len = READ_ONCE(sqe->addr3);
> +
> +	switch (cmd_op) {
> +	case BLOCK_URING_CMD_DISCARD:
> +		return blkdev_cmd_discard(cmd, bdev, start, len, bc->nowait);
> +	}
> +	return -EINVAL;
> +}

This is inside the CONFIG_COMPAT section, which doesn't seem right.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/5] implement asynchronous BLKDISCARD via io_uring
  2024-08-14 10:45 [RFC 0/5] implement asynchronous BLKDISCARD via io_uring Pavel Begunkov
                   ` (4 preceding siblings ...)
  2024-08-14 10:45 ` [RFC 5/5] block: implement io_uring discard cmd Pavel Begunkov
@ 2024-08-15 15:50 ` Jens Axboe
  2024-08-15 17:26   ` Pavel Begunkov
  2024-08-15 16:15 ` Martin K. Petersen
  6 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-08-15 15:50 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Conrad Meyer, linux-block, linux-mm

On 8/14/24 4:45 AM, Pavel Begunkov wrote:
> There is an interest in having asynchronous block operations like
> discard. The patch set implements that as io_uring commands, which is
> an io_uring request type allowing to implement custom file specific
> operations.
> 
> First 4 patches are simple preps, and the main part is in Patch 5.
> Not tested with a real drive yet, hence sending as an RFC.
> 
> I'm also going to add BLKDISCARDZEROES and BLKSECDISCARD, which should
> reuse structures and helpers from Patch 5.
> 
> liburing tests for reference:
> 
> https://github.com/isilence/liburing.git discard-cmd-test

FWIW, did a quick patch to wire it up for fio. Using this
job file:

[trim]
filename=/dev/nvme31n1
ioengine=io_uring
iodepth=64
rw=randtrim
norandommap
bs=4k

the stock kernel gets:

  trim: IOPS=21.6k, BW=84.4MiB/s (88.5MB/s)(847MiB/10036msec); 0 zone resets

using ~5% CPU, and with the process predictably stuck in D state all of
the time, waiting on a sync trim.

With the patches:

  trim: IOPS=75.8k, BW=296MiB/s (310MB/s)(2653MiB/8961msec); 0 zone resets

using ~11% CPU.

Didn't verify actual functionality, but did check trims are going up and
down. Drive used is:

Dell NVMe PM1743 RI E3.S 7.68TB

Outside of async trim being useful for actual workloads rather than the
sync trim we have now, it'll also help characterizing real world mixed
workloads that have trims with reads/writes.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/5] implement asynchronous BLKDISCARD via io_uring
  2024-08-14 10:45 [RFC 0/5] implement asynchronous BLKDISCARD via io_uring Pavel Begunkov
                   ` (5 preceding siblings ...)
  2024-08-15 15:50 ` [RFC 0/5] implement asynchronous BLKDISCARD via io_uring Jens Axboe
@ 2024-08-15 16:15 ` Martin K. Petersen
  2024-08-15 17:12   ` Pavel Begunkov
  6 siblings, 1 reply; 26+ messages in thread
From: Martin K. Petersen @ 2024-08-15 16:15 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, Jens Axboe, Conrad Meyer, linux-block, linux-mm


Hi Pavel!

> I'm also going to add BLKDISCARDZEROES and BLKSECDISCARD, which should
> reuse structures and helpers from Patch 5.

Adding these is going to be very useful.

Just a nit: Please use either ZEROOUT (or WRITE_ZEROES) and SECURE_ERASE
terminology for the additional operations when you add them.

DISCARDZEROES is obsolete and secure discard has been replaced by secure
erase. We are stuck with the legacy ioctl names but we should avoid
perpetuating unfortunate naming choices from the past.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-15 14:33     ` Jens Axboe
@ 2024-08-15 17:11       ` Pavel Begunkov
  2024-08-15 23:44         ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2024-08-15 17:11 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei; +Cc: io-uring, Conrad Meyer, linux-block, linux-mm

On 8/15/24 15:33, Jens Axboe wrote:
> On 8/14/24 7:42 PM, Ming Lei wrote:
>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <[email protected]> wrote:
>>>
>>> Add ->uring_cmd callback for block device files and use it to implement
>>> asynchronous discard. Normally, it first tries to execute the command
>>> from non-blocking context, which we limit to a single bio because
>>> otherwise one of sub-bios may need to wait for other bios, and we don't
>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry
>>> it in a blocking context.
>>>
>>> Suggested-by: Conrad Meyer <[email protected]>
>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>> ---
>>>   block/blk.h             |  1 +
>>>   block/fops.c            |  2 +
>>>   block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++
>>>   include/uapi/linux/fs.h |  2 +
>>>   4 files changed, 99 insertions(+)
>>>
>>> diff --git a/block/blk.h b/block/blk.h
>>> index e180863f918b..5178c5ba6852 100644
>>> --- a/block/blk.h
>>> +++ b/block/blk.h
>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
>>>   int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
>>>                  loff_t lstart, loff_t lend);
>>>   long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
>>>   long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>>
>>>   extern const struct address_space_operations def_blk_aops;
>>> diff --git a/block/fops.c b/block/fops.c
>>> index 9825c1713a49..8154b10b5abf 100644
>>> --- a/block/fops.c
>>> +++ b/block/fops.c
>>> @@ -17,6 +17,7 @@
>>>   #include <linux/fs.h>
>>>   #include <linux/iomap.h>
>>>   #include <linux/module.h>
>>> +#include <linux/io_uring/cmd.h>
>>>   #include "blk.h"
>>>
>>>   static inline struct inode *bdev_file_inode(struct file *file)
>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
>>>          .splice_read    = filemap_splice_read,
>>>          .splice_write   = iter_file_splice_write,
>>>          .fallocate      = blkdev_fallocate,
>>> +       .uring_cmd      = blkdev_uring_cmd,
>>
>> Just be curious, we have IORING_OP_FALLOCATE already for sending
>> discard to block device, why is .uring_cmd added for this purpose?

Which is a good question, I haven't thought about it, but I tend to
agree with Jens. Because vfs_fallocate is created synchronous
IORING_OP_FALLOCATE is slow for anything but pretty large requests.
Probably can be patched up, which would  involve changing the
fops->fallocate protot, but I'm not sure async there makes sense
outside of bdev (?), and cmd approach is simpler, can be made
somewhat more efficient (1 less layer in the way), and it's not
really something completely new since we have it in ioctl.


> I think wiring up a bdev uring_cmd makes sense, because:
> 
> 1) The existing FALLOCATE op is using vfs_fallocate, which is inherently
>     sync and hence always punted to io-wq.
> 
> 2) There will most certainly be other async ops that would be
>     interesting to add, at which point we'd need it anyway.
> 
> 3) It arguably makes more sense to have a direct discard op than use
>     fallocate for this, if working on a raw bdev.
> 
> And probably others...
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/5] implement asynchronous BLKDISCARD via io_uring
  2024-08-15 16:15 ` Martin K. Petersen
@ 2024-08-15 17:12   ` Pavel Begunkov
  0 siblings, 0 replies; 26+ messages in thread
From: Pavel Begunkov @ 2024-08-15 17:12 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: io-uring, Jens Axboe, Conrad Meyer, linux-block, linux-mm

On 8/15/24 17:15, Martin K. Petersen wrote:
> 
> Hi Pavel!
> 
>> I'm also going to add BLKDISCARDZEROES and BLKSECDISCARD, which should
>> reuse structures and helpers from Patch 5.
> 
> Adding these is going to be very useful.
> 
> Just a nit: Please use either ZEROOUT (or WRITE_ZEROES) and SECURE_ERASE
> terminology for the additional operations when you add them.
> 
> DISCARDZEROES is obsolete and secure discard has been replaced by secure
> erase. We are stuck with the legacy ioctl names but we should avoid
> perpetuating unfortunate naming choices from the past.

Noted, thanks, the ioctl names are indeed confusing.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 0/5] implement asynchronous BLKDISCARD via io_uring
  2024-08-15 15:50 ` [RFC 0/5] implement asynchronous BLKDISCARD via io_uring Jens Axboe
@ 2024-08-15 17:26   ` Pavel Begunkov
  0 siblings, 0 replies; 26+ messages in thread
From: Pavel Begunkov @ 2024-08-15 17:26 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: Conrad Meyer, linux-block, linux-mm

On 8/15/24 16:50, Jens Axboe wrote:
> On 8/14/24 4:45 AM, Pavel Begunkov wrote:
>> There is an interest in having asynchronous block operations like
>> discard. The patch set implements that as io_uring commands, which is
>> an io_uring request type allowing to implement custom file specific
>> operations.
>>
>> First 4 patches are simple preps, and the main part is in Patch 5.
>> Not tested with a real drive yet, hence sending as an RFC.
>>
>> I'm also going to add BLKDISCARDZEROES and BLKSECDISCARD, which should
>> reuse structures and helpers from Patch 5.
>>
>> liburing tests for reference:
>>
>> https://github.com/isilence/liburing.git discard-cmd-test
> 
> FWIW, did a quick patch to wire it up for fio. Using this
> job file:
> 
> [trim]
> filename=/dev/nvme31n1
> ioengine=io_uring
> iodepth=64
> rw=randtrim
> norandommap
> bs=4k
> 
> the stock kernel gets:
> 
>    trim: IOPS=21.6k, BW=84.4MiB/s (88.5MB/s)(847MiB/10036msec); 0 zone resets
> 
> using ~5% CPU, and with the process predictably stuck in D state all of
> the time, waiting on a sync trim.
> 
> With the patches:
> 
>    trim: IOPS=75.8k, BW=296MiB/s (310MB/s)(2653MiB/8961msec); 0 zone resets
> 
> using ~11% CPU.

Thanks for giving it a run. Can be further improved for particular use cases,
e.g. by adding a vectored version as an additional feature, i.e. multiple
discard ranges per request, and something like retry of short IO, but for
that we'd want getting -EAGAIN directly from submit_bio unlike failing via
the callback.


> Didn't verify actual functionality, but did check trims are going up and
> down. Drive used is:
> 
> Dell NVMe PM1743 RI E3.S 7.68TB
> 
> Outside of async trim being useful for actual workloads rather than the
> sync trim we have now, it'll also help characterizing real world mixed
> workloads that have trims with reads/writes.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-15 17:11       ` Pavel Begunkov
@ 2024-08-15 23:44         ` Ming Lei
  2024-08-16  1:24           ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2024-08-15 23:44 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jens Axboe, io-uring, Conrad Meyer, linux-block, linux-mm,
	ming.lei

On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote:
> On 8/15/24 15:33, Jens Axboe wrote:
> > On 8/14/24 7:42 PM, Ming Lei wrote:
> > > On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <[email protected]> wrote:
> > > > 
> > > > Add ->uring_cmd callback for block device files and use it to implement
> > > > asynchronous discard. Normally, it first tries to execute the command
> > > > from non-blocking context, which we limit to a single bio because
> > > > otherwise one of sub-bios may need to wait for other bios, and we don't
> > > > want to deal with partial IO. If non-blocking attempt fails, we'll retry
> > > > it in a blocking context.
> > > > 
> > > > Suggested-by: Conrad Meyer <[email protected]>
> > > > Signed-off-by: Pavel Begunkov <[email protected]>
> > > > ---
> > > >   block/blk.h             |  1 +
> > > >   block/fops.c            |  2 +
> > > >   block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++
> > > >   include/uapi/linux/fs.h |  2 +
> > > >   4 files changed, 99 insertions(+)
> > > > 
> > > > diff --git a/block/blk.h b/block/blk.h
> > > > index e180863f918b..5178c5ba6852 100644
> > > > --- a/block/blk.h
> > > > +++ b/block/blk.h
> > > > @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
> > > >   int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
> > > >                  loff_t lstart, loff_t lend);
> > > >   long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
> > > > +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
> > > >   long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
> > > > 
> > > >   extern const struct address_space_operations def_blk_aops;
> > > > diff --git a/block/fops.c b/block/fops.c
> > > > index 9825c1713a49..8154b10b5abf 100644
> > > > --- a/block/fops.c
> > > > +++ b/block/fops.c
> > > > @@ -17,6 +17,7 @@
> > > >   #include <linux/fs.h>
> > > >   #include <linux/iomap.h>
> > > >   #include <linux/module.h>
> > > > +#include <linux/io_uring/cmd.h>
> > > >   #include "blk.h"
> > > > 
> > > >   static inline struct inode *bdev_file_inode(struct file *file)
> > > > @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
> > > >          .splice_read    = filemap_splice_read,
> > > >          .splice_write   = iter_file_splice_write,
> > > >          .fallocate      = blkdev_fallocate,
> > > > +       .uring_cmd      = blkdev_uring_cmd,
> > > 
> > > Just be curious, we have IORING_OP_FALLOCATE already for sending
> > > discard to block device, why is .uring_cmd added for this purpose?
> 
> Which is a good question, I haven't thought about it, but I tend to
> agree with Jens. Because vfs_fallocate is created synchronous
> IORING_OP_FALLOCATE is slow for anything but pretty large requests.
> Probably can be patched up, which would  involve changing the
> fops->fallocate protot, but I'm not sure async there makes sense
> outside of bdev (?), and cmd approach is simpler, can be made
> somewhat more efficient (1 less layer in the way), and it's not
> really something completely new since we have it in ioctl.

Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock,
same with blkdev_fallocate().

But this patch drops this exclusive lock, so it becomes async friendly,
but may cause stale page cache. However, if the lock is required, it can't
be efficient anymore and io-wq may be inevitable, :-)

Thanks,
Ming


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-15 23:44         ` Ming Lei
@ 2024-08-16  1:24           ` Jens Axboe
  2024-08-16  1:45             ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-08-16  1:24 UTC (permalink / raw)
  To: Ming Lei, Pavel Begunkov; +Cc: io-uring, Conrad Meyer, linux-block, linux-mm

On 8/15/24 5:44 PM, Ming Lei wrote:
> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote:
>> On 8/15/24 15:33, Jens Axboe wrote:
>>> On 8/14/24 7:42 PM, Ming Lei wrote:
>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <[email protected]> wrote:
>>>>>
>>>>> Add ->uring_cmd callback for block device files and use it to implement
>>>>> asynchronous discard. Normally, it first tries to execute the command
>>>>> from non-blocking context, which we limit to a single bio because
>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't
>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry
>>>>> it in a blocking context.
>>>>>
>>>>> Suggested-by: Conrad Meyer <[email protected]>
>>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>>> ---
>>>>>   block/blk.h             |  1 +
>>>>>   block/fops.c            |  2 +
>>>>>   block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++
>>>>>   include/uapi/linux/fs.h |  2 +
>>>>>   4 files changed, 99 insertions(+)
>>>>>
>>>>> diff --git a/block/blk.h b/block/blk.h
>>>>> index e180863f918b..5178c5ba6852 100644
>>>>> --- a/block/blk.h
>>>>> +++ b/block/blk.h
>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
>>>>>   int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
>>>>>                  loff_t lstart, loff_t lend);
>>>>>   long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
>>>>>   long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>>>>
>>>>>   extern const struct address_space_operations def_blk_aops;
>>>>> diff --git a/block/fops.c b/block/fops.c
>>>>> index 9825c1713a49..8154b10b5abf 100644
>>>>> --- a/block/fops.c
>>>>> +++ b/block/fops.c
>>>>> @@ -17,6 +17,7 @@
>>>>>   #include <linux/fs.h>
>>>>>   #include <linux/iomap.h>
>>>>>   #include <linux/module.h>
>>>>> +#include <linux/io_uring/cmd.h>
>>>>>   #include "blk.h"
>>>>>
>>>>>   static inline struct inode *bdev_file_inode(struct file *file)
>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
>>>>>          .splice_read    = filemap_splice_read,
>>>>>          .splice_write   = iter_file_splice_write,
>>>>>          .fallocate      = blkdev_fallocate,
>>>>> +       .uring_cmd      = blkdev_uring_cmd,
>>>>
>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending
>>>> discard to block device, why is .uring_cmd added for this purpose?
>>
>> Which is a good question, I haven't thought about it, but I tend to
>> agree with Jens. Because vfs_fallocate is created synchronous
>> IORING_OP_FALLOCATE is slow for anything but pretty large requests.
>> Probably can be patched up, which would  involve changing the
>> fops->fallocate protot, but I'm not sure async there makes sense
>> outside of bdev (?), and cmd approach is simpler, can be made
>> somewhat more efficient (1 less layer in the way), and it's not
>> really something completely new since we have it in ioctl.
> 
> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock,
> same with blkdev_fallocate().
> 
> But this patch drops this exclusive lock, so it becomes async friendly,
> but may cause stale page cache. However, if the lock is required, it can't
> be efficient anymore and io-wq may be inevitable, :-)

If you want to grab the lock, you can still opportunistically grab it.
For (by far) the common case, you'll get it, and you can still do it
inline.

Really not that unusual.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-16  1:24           ` Jens Axboe
@ 2024-08-16  1:45             ` Ming Lei
  2024-08-16  1:59               ` Pavel Begunkov
  2024-08-19 20:01               ` Jens Axboe
  0 siblings, 2 replies; 26+ messages in thread
From: Ming Lei @ 2024-08-16  1:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pavel Begunkov, io-uring, Conrad Meyer, linux-block, linux-mm,
	ming.lei

On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote:
> On 8/15/24 5:44 PM, Ming Lei wrote:
> > On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote:
> >> On 8/15/24 15:33, Jens Axboe wrote:
> >>> On 8/14/24 7:42 PM, Ming Lei wrote:
> >>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <[email protected]> wrote:
> >>>>>
> >>>>> Add ->uring_cmd callback for block device files and use it to implement
> >>>>> asynchronous discard. Normally, it first tries to execute the command
> >>>>> from non-blocking context, which we limit to a single bio because
> >>>>> otherwise one of sub-bios may need to wait for other bios, and we don't
> >>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry
> >>>>> it in a blocking context.
> >>>>>
> >>>>> Suggested-by: Conrad Meyer <[email protected]>
> >>>>> Signed-off-by: Pavel Begunkov <[email protected]>
> >>>>> ---
> >>>>>   block/blk.h             |  1 +
> >>>>>   block/fops.c            |  2 +
> >>>>>   block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++
> >>>>>   include/uapi/linux/fs.h |  2 +
> >>>>>   4 files changed, 99 insertions(+)
> >>>>>
> >>>>> diff --git a/block/blk.h b/block/blk.h
> >>>>> index e180863f918b..5178c5ba6852 100644
> >>>>> --- a/block/blk.h
> >>>>> +++ b/block/blk.h
> >>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
> >>>>>   int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
> >>>>>                  loff_t lstart, loff_t lend);
> >>>>>   long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
> >>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
> >>>>>   long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
> >>>>>
> >>>>>   extern const struct address_space_operations def_blk_aops;
> >>>>> diff --git a/block/fops.c b/block/fops.c
> >>>>> index 9825c1713a49..8154b10b5abf 100644
> >>>>> --- a/block/fops.c
> >>>>> +++ b/block/fops.c
> >>>>> @@ -17,6 +17,7 @@
> >>>>>   #include <linux/fs.h>
> >>>>>   #include <linux/iomap.h>
> >>>>>   #include <linux/module.h>
> >>>>> +#include <linux/io_uring/cmd.h>
> >>>>>   #include "blk.h"
> >>>>>
> >>>>>   static inline struct inode *bdev_file_inode(struct file *file)
> >>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
> >>>>>          .splice_read    = filemap_splice_read,
> >>>>>          .splice_write   = iter_file_splice_write,
> >>>>>          .fallocate      = blkdev_fallocate,
> >>>>> +       .uring_cmd      = blkdev_uring_cmd,
> >>>>
> >>>> Just be curious, we have IORING_OP_FALLOCATE already for sending
> >>>> discard to block device, why is .uring_cmd added for this purpose?
> >>
> >> Which is a good question, I haven't thought about it, but I tend to
> >> agree with Jens. Because vfs_fallocate is created synchronous
> >> IORING_OP_FALLOCATE is slow for anything but pretty large requests.
> >> Probably can be patched up, which would  involve changing the
> >> fops->fallocate protot, but I'm not sure async there makes sense
> >> outside of bdev (?), and cmd approach is simpler, can be made
> >> somewhat more efficient (1 less layer in the way), and it's not
> >> really something completely new since we have it in ioctl.
> > 
> > Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock,
> > same with blkdev_fallocate().
> > 
> > But this patch drops this exclusive lock, so it becomes async friendly,
> > but may cause stale page cache. However, if the lock is required, it can't
> > be efficient anymore and io-wq may be inevitable, :-)
> 
> If you want to grab the lock, you can still opportunistically grab it.
> For (by far) the common case, you'll get it, and you can still do it
> inline.

If the lock is grabbed in the whole cmd lifetime, it is basically one sync
interface cause there is at most one async discard cmd in-flight for each
device.

Meantime the handling has to move to io-wq for avoiding to block current
context, the interface becomes same with IORING_OP_FALLOCATE?


thanks,
Ming


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-16  1:45             ` Ming Lei
@ 2024-08-16  1:59               ` Pavel Begunkov
  2024-08-16  2:08                 ` Ming Lei
  2024-08-19 20:01               ` Jens Axboe
  1 sibling, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2024-08-16  1:59 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: io-uring, Conrad Meyer, linux-block, linux-mm

On 8/16/24 02:45, Ming Lei wrote:
> On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote:
>> On 8/15/24 5:44 PM, Ming Lei wrote:
>>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote:
>>>> On 8/15/24 15:33, Jens Axboe wrote:
>>>>> On 8/14/24 7:42 PM, Ming Lei wrote:
>>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <[email protected]> wrote:
>>>>>>>
>>>>>>> Add ->uring_cmd callback for block device files and use it to implement
>>>>>>> asynchronous discard. Normally, it first tries to execute the command
>>>>>>> from non-blocking context, which we limit to a single bio because
>>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't
>>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry
>>>>>>> it in a blocking context.
>>>>>>>
>>>>>>> Suggested-by: Conrad Meyer <[email protected]>
>>>>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>>>>> ---
>>>>>>>    block/blk.h             |  1 +
>>>>>>>    block/fops.c            |  2 +
>>>>>>>    block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++
>>>>>>>    include/uapi/linux/fs.h |  2 +
>>>>>>>    4 files changed, 99 insertions(+)
>>>>>>>
>>>>>>> diff --git a/block/blk.h b/block/blk.h
>>>>>>> index e180863f918b..5178c5ba6852 100644
>>>>>>> --- a/block/blk.h
>>>>>>> +++ b/block/blk.h
>>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
>>>>>>>    int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
>>>>>>>                   loff_t lstart, loff_t lend);
>>>>>>>    long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
>>>>>>>    long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>>>>>>
>>>>>>>    extern const struct address_space_operations def_blk_aops;
>>>>>>> diff --git a/block/fops.c b/block/fops.c
>>>>>>> index 9825c1713a49..8154b10b5abf 100644
>>>>>>> --- a/block/fops.c
>>>>>>> +++ b/block/fops.c
>>>>>>> @@ -17,6 +17,7 @@
>>>>>>>    #include <linux/fs.h>
>>>>>>>    #include <linux/iomap.h>
>>>>>>>    #include <linux/module.h>
>>>>>>> +#include <linux/io_uring/cmd.h>
>>>>>>>    #include "blk.h"
>>>>>>>
>>>>>>>    static inline struct inode *bdev_file_inode(struct file *file)
>>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
>>>>>>>           .splice_read    = filemap_splice_read,
>>>>>>>           .splice_write   = iter_file_splice_write,
>>>>>>>           .fallocate      = blkdev_fallocate,
>>>>>>> +       .uring_cmd      = blkdev_uring_cmd,
>>>>>>
>>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending
>>>>>> discard to block device, why is .uring_cmd added for this purpose?
>>>>
>>>> Which is a good question, I haven't thought about it, but I tend to
>>>> agree with Jens. Because vfs_fallocate is created synchronous
>>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests.
>>>> Probably can be patched up, which would  involve changing the
>>>> fops->fallocate protot, but I'm not sure async there makes sense
>>>> outside of bdev (?), and cmd approach is simpler, can be made
>>>> somewhat more efficient (1 less layer in the way), and it's not
>>>> really something completely new since we have it in ioctl.
>>>
>>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock,
>>> same with blkdev_fallocate().
>>>
>>> But this patch drops this exclusive lock, so it becomes async friendly,
>>> but may cause stale page cache. However, if the lock is required, it can't
>>> be efficient anymore and io-wq may be inevitable, :-)
>>
>> If you want to grab the lock, you can still opportunistically grab it.
>> For (by far) the common case, you'll get it, and you can still do it
>> inline.
> 
> If the lock is grabbed in the whole cmd lifetime, it is basically one sync
> interface cause there is at most one async discard cmd in-flight for each
> device.
> 
> Meantime the handling has to move to io-wq for avoiding to block current
> context, the interface becomes same with IORING_OP_FALLOCATE?

Right, and agree that we can't trylock because we'd need to keep it
locked until IO completes, at least the sync versions does that.

But I think *invalidate_pages() in the patch should be enough. That's
what the write path does, so it shouldn't cause any problem to the
kernel. As for user space, that'd be more relaxed than the ioctl,
just as writes are, so nothing new to the user. I hope someone with
better filemap understanding can confirm it (or not).

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-16  1:59               ` Pavel Begunkov
@ 2024-08-16  2:08                 ` Ming Lei
  2024-08-16  2:16                   ` Pavel Begunkov
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2024-08-16  2:08 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, Conrad Meyer, linux-block, linux-mm

On Fri, Aug 16, 2024 at 02:59:49AM +0100, Pavel Begunkov wrote:
> On 8/16/24 02:45, Ming Lei wrote:
> > On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote:
> > > On 8/15/24 5:44 PM, Ming Lei wrote:
> > > > On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote:
> > > > > On 8/15/24 15:33, Jens Axboe wrote:
> > > > > > On 8/14/24 7:42 PM, Ming Lei wrote:
> > > > > > > On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <[email protected]> wrote:
> > > > > > > > 
> > > > > > > > Add ->uring_cmd callback for block device files and use it to implement
> > > > > > > > asynchronous discard. Normally, it first tries to execute the command
> > > > > > > > from non-blocking context, which we limit to a single bio because
> > > > > > > > otherwise one of sub-bios may need to wait for other bios, and we don't
> > > > > > > > want to deal with partial IO. If non-blocking attempt fails, we'll retry
> > > > > > > > it in a blocking context.
> > > > > > > > 
> > > > > > > > Suggested-by: Conrad Meyer <[email protected]>
> > > > > > > > Signed-off-by: Pavel Begunkov <[email protected]>
> > > > > > > > ---
> > > > > > > >    block/blk.h             |  1 +
> > > > > > > >    block/fops.c            |  2 +
> > > > > > > >    block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++
> > > > > > > >    include/uapi/linux/fs.h |  2 +
> > > > > > > >    4 files changed, 99 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/block/blk.h b/block/blk.h
> > > > > > > > index e180863f918b..5178c5ba6852 100644
> > > > > > > > --- a/block/blk.h
> > > > > > > > +++ b/block/blk.h
> > > > > > > > @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
> > > > > > > >    int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
> > > > > > > >                   loff_t lstart, loff_t lend);
> > > > > > > >    long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
> > > > > > > > +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
> > > > > > > >    long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
> > > > > > > > 
> > > > > > > >    extern const struct address_space_operations def_blk_aops;
> > > > > > > > diff --git a/block/fops.c b/block/fops.c
> > > > > > > > index 9825c1713a49..8154b10b5abf 100644
> > > > > > > > --- a/block/fops.c
> > > > > > > > +++ b/block/fops.c
> > > > > > > > @@ -17,6 +17,7 @@
> > > > > > > >    #include <linux/fs.h>
> > > > > > > >    #include <linux/iomap.h>
> > > > > > > >    #include <linux/module.h>
> > > > > > > > +#include <linux/io_uring/cmd.h>
> > > > > > > >    #include "blk.h"
> > > > > > > > 
> > > > > > > >    static inline struct inode *bdev_file_inode(struct file *file)
> > > > > > > > @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
> > > > > > > >           .splice_read    = filemap_splice_read,
> > > > > > > >           .splice_write   = iter_file_splice_write,
> > > > > > > >           .fallocate      = blkdev_fallocate,
> > > > > > > > +       .uring_cmd      = blkdev_uring_cmd,
> > > > > > > 
> > > > > > > Just be curious, we have IORING_OP_FALLOCATE already for sending
> > > > > > > discard to block device, why is .uring_cmd added for this purpose?
> > > > > 
> > > > > Which is a good question, I haven't thought about it, but I tend to
> > > > > agree with Jens. Because vfs_fallocate is created synchronous
> > > > > IORING_OP_FALLOCATE is slow for anything but pretty large requests.
> > > > > Probably can be patched up, which would  involve changing the
> > > > > fops->fallocate protot, but I'm not sure async there makes sense
> > > > > outside of bdev (?), and cmd approach is simpler, can be made
> > > > > somewhat more efficient (1 less layer in the way), and it's not
> > > > > really something completely new since we have it in ioctl.
> > > > 
> > > > Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock,
> > > > same with blkdev_fallocate().
> > > > 
> > > > But this patch drops this exclusive lock, so it becomes async friendly,
> > > > but may cause stale page cache. However, if the lock is required, it can't
> > > > be efficient anymore and io-wq may be inevitable, :-)
> > > 
> > > If you want to grab the lock, you can still opportunistically grab it.
> > > For (by far) the common case, you'll get it, and you can still do it
> > > inline.
> > 
> > If the lock is grabbed in the whole cmd lifetime, it is basically one sync
> > interface cause there is at most one async discard cmd in-flight for each
> > device.
> > 
> > Meantime the handling has to move to io-wq for avoiding to block current
> > context, the interface becomes same with IORING_OP_FALLOCATE?
> 
> Right, and agree that we can't trylock because we'd need to keep it
> locked until IO completes, at least the sync versions does that.
> 
> But I think *invalidate_pages() in the patch should be enough. That's
> what the write path does, so it shouldn't cause any problem to the
> kernel. As for user space, that'd be more relaxed than the ioctl,
> just as writes are, so nothing new to the user. I hope someone with
> better filemap understanding can confirm it (or not).

I may not be familiar with filemap enough, but looks *invalidate_pages()
is only for removing pages from the page cache range, and the lock is added
for preventing new page cache read from being started, so stale data read
can be avoided when DISCARD is in-progress.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-16  2:08                 ` Ming Lei
@ 2024-08-16  2:16                   ` Pavel Begunkov
  2024-08-19 20:02                     ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2024-08-16  2:16 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, io-uring, Conrad Meyer, linux-block, linux-mm

On 8/16/24 03:08, Ming Lei wrote:
> On Fri, Aug 16, 2024 at 02:59:49AM +0100, Pavel Begunkov wrote:
>> On 8/16/24 02:45, Ming Lei wrote:
>>> On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote:
>>>> On 8/15/24 5:44 PM, Ming Lei wrote:
>>>>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote:
>>>>>> On 8/15/24 15:33, Jens Axboe wrote:
>>>>>>> On 8/14/24 7:42 PM, Ming Lei wrote:
>>>>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Add ->uring_cmd callback for block device files and use it to implement
>>>>>>>>> asynchronous discard. Normally, it first tries to execute the command
>>>>>>>>> from non-blocking context, which we limit to a single bio because
>>>>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't
>>>>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry
>>>>>>>>> it in a blocking context.
>>>>>>>>>
>>>>>>>>> Suggested-by: Conrad Meyer <[email protected]>
>>>>>>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>>>>>>> ---
>>>>>>>>>     block/blk.h             |  1 +
>>>>>>>>>     block/fops.c            |  2 +
>>>>>>>>>     block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>     include/uapi/linux/fs.h |  2 +
>>>>>>>>>     4 files changed, 99 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/block/blk.h b/block/blk.h
>>>>>>>>> index e180863f918b..5178c5ba6852 100644
>>>>>>>>> --- a/block/blk.h
>>>>>>>>> +++ b/block/blk.h
>>>>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
>>>>>>>>>     int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
>>>>>>>>>                    loff_t lstart, loff_t lend);
>>>>>>>>>     long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>>>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
>>>>>>>>>     long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>>>>>>>>
>>>>>>>>>     extern const struct address_space_operations def_blk_aops;
>>>>>>>>> diff --git a/block/fops.c b/block/fops.c
>>>>>>>>> index 9825c1713a49..8154b10b5abf 100644
>>>>>>>>> --- a/block/fops.c
>>>>>>>>> +++ b/block/fops.c
>>>>>>>>> @@ -17,6 +17,7 @@
>>>>>>>>>     #include <linux/fs.h>
>>>>>>>>>     #include <linux/iomap.h>
>>>>>>>>>     #include <linux/module.h>
>>>>>>>>> +#include <linux/io_uring/cmd.h>
>>>>>>>>>     #include "blk.h"
>>>>>>>>>
>>>>>>>>>     static inline struct inode *bdev_file_inode(struct file *file)
>>>>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
>>>>>>>>>            .splice_read    = filemap_splice_read,
>>>>>>>>>            .splice_write   = iter_file_splice_write,
>>>>>>>>>            .fallocate      = blkdev_fallocate,
>>>>>>>>> +       .uring_cmd      = blkdev_uring_cmd,
>>>>>>>>
>>>>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending
>>>>>>>> discard to block device, why is .uring_cmd added for this purpose?
>>>>>>
>>>>>> Which is a good question, I haven't thought about it, but I tend to
>>>>>> agree with Jens. Because vfs_fallocate is created synchronous
>>>>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests.
>>>>>> Probably can be patched up, which would  involve changing the
>>>>>> fops->fallocate protot, but I'm not sure async there makes sense
>>>>>> outside of bdev (?), and cmd approach is simpler, can be made
>>>>>> somewhat more efficient (1 less layer in the way), and it's not
>>>>>> really something completely new since we have it in ioctl.
>>>>>
>>>>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock,
>>>>> same with blkdev_fallocate().
>>>>>
>>>>> But this patch drops this exclusive lock, so it becomes async friendly,
>>>>> but may cause stale page cache. However, if the lock is required, it can't
>>>>> be efficient anymore and io-wq may be inevitable, :-)
>>>>
>>>> If you want to grab the lock, you can still opportunistically grab it.
>>>> For (by far) the common case, you'll get it, and you can still do it
>>>> inline.
>>>
>>> If the lock is grabbed in the whole cmd lifetime, it is basically one sync
>>> interface cause there is at most one async discard cmd in-flight for each
>>> device.
>>>
>>> Meantime the handling has to move to io-wq for avoiding to block current
>>> context, the interface becomes same with IORING_OP_FALLOCATE?
>>
>> Right, and agree that we can't trylock because we'd need to keep it
>> locked until IO completes, at least the sync versions does that.
>>
>> But I think *invalidate_pages() in the patch should be enough. That's
>> what the write path does, so it shouldn't cause any problem to the
>> kernel. As for user space, that'd be more relaxed than the ioctl,
>> just as writes are, so nothing new to the user. I hope someone with
>> better filemap understanding can confirm it (or not).
> 
> I may not be familiar with filemap enough, but looks *invalidate_pages()
> is only for removing pages from the page cache range, and the lock is added
> for preventing new page cache read from being started, so stale data read
> can be avoided when DISCARD is in-progress.

Sounds like it, but the point is it's the same data race for the
user as if it would've had a write in progress.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-16  1:45             ` Ming Lei
  2024-08-16  1:59               ` Pavel Begunkov
@ 2024-08-19 20:01               ` Jens Axboe
  2024-08-20  2:36                 ` Ming Lei
  1 sibling, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-08-19 20:01 UTC (permalink / raw)
  To: Ming Lei; +Cc: Pavel Begunkov, io-uring, Conrad Meyer, linux-block, linux-mm

On 8/15/24 7:45 PM, Ming Lei wrote:
> On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote:
>> On 8/15/24 5:44 PM, Ming Lei wrote:
>>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote:
>>>> On 8/15/24 15:33, Jens Axboe wrote:
>>>>> On 8/14/24 7:42 PM, Ming Lei wrote:
>>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <[email protected]> wrote:
>>>>>>>
>>>>>>> Add ->uring_cmd callback for block device files and use it to implement
>>>>>>> asynchronous discard. Normally, it first tries to execute the command
>>>>>>> from non-blocking context, which we limit to a single bio because
>>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't
>>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry
>>>>>>> it in a blocking context.
>>>>>>>
>>>>>>> Suggested-by: Conrad Meyer <[email protected]>
>>>>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>>>>> ---
>>>>>>>   block/blk.h             |  1 +
>>>>>>>   block/fops.c            |  2 +
>>>>>>>   block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++
>>>>>>>   include/uapi/linux/fs.h |  2 +
>>>>>>>   4 files changed, 99 insertions(+)
>>>>>>>
>>>>>>> diff --git a/block/blk.h b/block/blk.h
>>>>>>> index e180863f918b..5178c5ba6852 100644
>>>>>>> --- a/block/blk.h
>>>>>>> +++ b/block/blk.h
>>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
>>>>>>>   int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
>>>>>>>                  loff_t lstart, loff_t lend);
>>>>>>>   long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
>>>>>>>   long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>>>>>>
>>>>>>>   extern const struct address_space_operations def_blk_aops;
>>>>>>> diff --git a/block/fops.c b/block/fops.c
>>>>>>> index 9825c1713a49..8154b10b5abf 100644
>>>>>>> --- a/block/fops.c
>>>>>>> +++ b/block/fops.c
>>>>>>> @@ -17,6 +17,7 @@
>>>>>>>   #include <linux/fs.h>
>>>>>>>   #include <linux/iomap.h>
>>>>>>>   #include <linux/module.h>
>>>>>>> +#include <linux/io_uring/cmd.h>
>>>>>>>   #include "blk.h"
>>>>>>>
>>>>>>>   static inline struct inode *bdev_file_inode(struct file *file)
>>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
>>>>>>>          .splice_read    = filemap_splice_read,
>>>>>>>          .splice_write   = iter_file_splice_write,
>>>>>>>          .fallocate      = blkdev_fallocate,
>>>>>>> +       .uring_cmd      = blkdev_uring_cmd,
>>>>>>
>>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending
>>>>>> discard to block device, why is .uring_cmd added for this purpose?
>>>>
>>>> Which is a good question, I haven't thought about it, but I tend to
>>>> agree with Jens. Because vfs_fallocate is created synchronous
>>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests.
>>>> Probably can be patched up, which would  involve changing the
>>>> fops->fallocate protot, but I'm not sure async there makes sense
>>>> outside of bdev (?), and cmd approach is simpler, can be made
>>>> somewhat more efficient (1 less layer in the way), and it's not
>>>> really something completely new since we have it in ioctl.
>>>
>>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock,
>>> same with blkdev_fallocate().
>>>
>>> But this patch drops this exclusive lock, so it becomes async friendly,
>>> but may cause stale page cache. However, if the lock is required, it can't
>>> be efficient anymore and io-wq may be inevitable, :-)
>>
>> If you want to grab the lock, you can still opportunistically grab it.
>> For (by far) the common case, you'll get it, and you can still do it
>> inline.
> 
> If the lock is grabbed in the whole cmd lifetime, it is basically one sync
> interface cause there is at most one async discard cmd in-flight for each
> device.

Oh for sure, you could not do that anyway as you'd be holding a lock
across the syscall boundary, which isn't allowed.

> Meantime the handling has to move to io-wq for avoiding to block current
> context, the interface becomes same with IORING_OP_FALLOCATE?

I think the current truncate is overkill, we should be able to get by
without. And no, I will not entertain an option that's "oh just punt it
to io-wq".

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-16  2:16                   ` Pavel Begunkov
@ 2024-08-19 20:02                     ` Jens Axboe
  0 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2024-08-19 20:02 UTC (permalink / raw)
  To: Pavel Begunkov, Ming Lei; +Cc: io-uring, Conrad Meyer, linux-block, linux-mm

On 8/15/24 8:16 PM, Pavel Begunkov wrote:
> On 8/16/24 03:08, Ming Lei wrote:
>> On Fri, Aug 16, 2024 at 02:59:49AM +0100, Pavel Begunkov wrote:
>>> On 8/16/24 02:45, Ming Lei wrote:
>>>> On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote:
>>>>> On 8/15/24 5:44 PM, Ming Lei wrote:
>>>>>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote:
>>>>>>> On 8/15/24 15:33, Jens Axboe wrote:
>>>>>>>> On 8/14/24 7:42 PM, Ming Lei wrote:
>>>>>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> Add ->uring_cmd callback for block device files and use it to implement
>>>>>>>>>> asynchronous discard. Normally, it first tries to execute the command
>>>>>>>>>> from non-blocking context, which we limit to a single bio because
>>>>>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't
>>>>>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry
>>>>>>>>>> it in a blocking context.
>>>>>>>>>>
>>>>>>>>>> Suggested-by: Conrad Meyer <[email protected]>
>>>>>>>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>>     block/blk.h             |  1 +
>>>>>>>>>>     block/fops.c            |  2 +
>>>>>>>>>>     block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>     include/uapi/linux/fs.h |  2 +
>>>>>>>>>>     4 files changed, 99 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/block/blk.h b/block/blk.h
>>>>>>>>>> index e180863f918b..5178c5ba6852 100644
>>>>>>>>>> --- a/block/blk.h
>>>>>>>>>> +++ b/block/blk.h
>>>>>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
>>>>>>>>>>     int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
>>>>>>>>>>                    loff_t lstart, loff_t lend);
>>>>>>>>>>     long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>>>>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
>>>>>>>>>>     long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>>>>>>>>>
>>>>>>>>>>     extern const struct address_space_operations def_blk_aops;
>>>>>>>>>> diff --git a/block/fops.c b/block/fops.c
>>>>>>>>>> index 9825c1713a49..8154b10b5abf 100644
>>>>>>>>>> --- a/block/fops.c
>>>>>>>>>> +++ b/block/fops.c
>>>>>>>>>> @@ -17,6 +17,7 @@
>>>>>>>>>>     #include <linux/fs.h>
>>>>>>>>>>     #include <linux/iomap.h>
>>>>>>>>>>     #include <linux/module.h>
>>>>>>>>>> +#include <linux/io_uring/cmd.h>
>>>>>>>>>>     #include "blk.h"
>>>>>>>>>>
>>>>>>>>>>     static inline struct inode *bdev_file_inode(struct file *file)
>>>>>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
>>>>>>>>>>            .splice_read    = filemap_splice_read,
>>>>>>>>>>            .splice_write   = iter_file_splice_write,
>>>>>>>>>>            .fallocate      = blkdev_fallocate,
>>>>>>>>>> +       .uring_cmd      = blkdev_uring_cmd,
>>>>>>>>>
>>>>>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending
>>>>>>>>> discard to block device, why is .uring_cmd added for this purpose?
>>>>>>>
>>>>>>> Which is a good question, I haven't thought about it, but I tend to
>>>>>>> agree with Jens. Because vfs_fallocate is created synchronous
>>>>>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests.
>>>>>>> Probably can be patched up, which would  involve changing the
>>>>>>> fops->fallocate protot, but I'm not sure async there makes sense
>>>>>>> outside of bdev (?), and cmd approach is simpler, can be made
>>>>>>> somewhat more efficient (1 less layer in the way), and it's not
>>>>>>> really something completely new since we have it in ioctl.
>>>>>>
>>>>>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock,
>>>>>> same with blkdev_fallocate().
>>>>>>
>>>>>> But this patch drops this exclusive lock, so it becomes async friendly,
>>>>>> but may cause stale page cache. However, if the lock is required, it can't
>>>>>> be efficient anymore and io-wq may be inevitable, :-)
>>>>>
>>>>> If you want to grab the lock, you can still opportunistically grab it.
>>>>> For (by far) the common case, you'll get it, and you can still do it
>>>>> inline.
>>>>
>>>> If the lock is grabbed in the whole cmd lifetime, it is basically one sync
>>>> interface cause there is at most one async discard cmd in-flight for each
>>>> device.
>>>>
>>>> Meantime the handling has to move to io-wq for avoiding to block current
>>>> context, the interface becomes same with IORING_OP_FALLOCATE?
>>>
>>> Right, and agree that we can't trylock because we'd need to keep it
>>> locked until IO completes, at least the sync versions does that.
>>>
>>> But I think *invalidate_pages() in the patch should be enough. That's
>>> what the write path does, so it shouldn't cause any problem to the
>>> kernel. As for user space, that'd be more relaxed than the ioctl,
>>> just as writes are, so nothing new to the user. I hope someone with
>>> better filemap understanding can confirm it (or not).
>>
>> I may not be familiar with filemap enough, but looks *invalidate_pages()
>> is only for removing pages from the page cache range, and the lock is added
>> for preventing new page cache read from being started, so stale data read
>> can be avoided when DISCARD is in-progress.
> 
> Sounds like it, but the point is it's the same data race for the
> user as if it would've had a write in progress.

Right, which is why it should not matter. I think it's pretty silly to
take the sync implementation as gospel here, assuming that the original
author knew what they were doing in full detail. It just needs proper
documenting.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-19 20:01               ` Jens Axboe
@ 2024-08-20  2:36                 ` Ming Lei
  2024-08-20 16:30                   ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2024-08-20  2:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pavel Begunkov, io-uring, Conrad Meyer, linux-block, linux-mm,
	Jan Kara, Shin'ichiro Kawasaki

On Mon, Aug 19, 2024 at 02:01:21PM -0600, Jens Axboe wrote:
> On 8/15/24 7:45 PM, Ming Lei wrote:
> > On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote:
> >> On 8/15/24 5:44 PM, Ming Lei wrote:
> >>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote:
> >>>> On 8/15/24 15:33, Jens Axboe wrote:
> >>>>> On 8/14/24 7:42 PM, Ming Lei wrote:
> >>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <[email protected]> wrote:
> >>>>>>>
> >>>>>>> Add ->uring_cmd callback for block device files and use it to implement
> >>>>>>> asynchronous discard. Normally, it first tries to execute the command
> >>>>>>> from non-blocking context, which we limit to a single bio because
> >>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't
> >>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry
> >>>>>>> it in a blocking context.
> >>>>>>>
> >>>>>>> Suggested-by: Conrad Meyer <[email protected]>
> >>>>>>> Signed-off-by: Pavel Begunkov <[email protected]>
> >>>>>>> ---
> >>>>>>>   block/blk.h             |  1 +
> >>>>>>>   block/fops.c            |  2 +
> >>>>>>>   block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++
> >>>>>>>   include/uapi/linux/fs.h |  2 +
> >>>>>>>   4 files changed, 99 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/block/blk.h b/block/blk.h
> >>>>>>> index e180863f918b..5178c5ba6852 100644
> >>>>>>> --- a/block/blk.h
> >>>>>>> +++ b/block/blk.h
> >>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
> >>>>>>>   int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
> >>>>>>>                  loff_t lstart, loff_t lend);
> >>>>>>>   long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
> >>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
> >>>>>>>   long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
> >>>>>>>
> >>>>>>>   extern const struct address_space_operations def_blk_aops;
> >>>>>>> diff --git a/block/fops.c b/block/fops.c
> >>>>>>> index 9825c1713a49..8154b10b5abf 100644
> >>>>>>> --- a/block/fops.c
> >>>>>>> +++ b/block/fops.c
> >>>>>>> @@ -17,6 +17,7 @@
> >>>>>>>   #include <linux/fs.h>
> >>>>>>>   #include <linux/iomap.h>
> >>>>>>>   #include <linux/module.h>
> >>>>>>> +#include <linux/io_uring/cmd.h>
> >>>>>>>   #include "blk.h"
> >>>>>>>
> >>>>>>>   static inline struct inode *bdev_file_inode(struct file *file)
> >>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
> >>>>>>>          .splice_read    = filemap_splice_read,
> >>>>>>>          .splice_write   = iter_file_splice_write,
> >>>>>>>          .fallocate      = blkdev_fallocate,
> >>>>>>> +       .uring_cmd      = blkdev_uring_cmd,
> >>>>>>
> >>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending
> >>>>>> discard to block device, why is .uring_cmd added for this purpose?
> >>>>
> >>>> Which is a good question, I haven't thought about it, but I tend to
> >>>> agree with Jens. Because vfs_fallocate is created synchronous
> >>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests.
> >>>> Probably can be patched up, which would  involve changing the
> >>>> fops->fallocate protot, but I'm not sure async there makes sense
> >>>> outside of bdev (?), and cmd approach is simpler, can be made
> >>>> somewhat more efficient (1 less layer in the way), and it's not
> >>>> really something completely new since we have it in ioctl.
> >>>
> >>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock,
> >>> same with blkdev_fallocate().
> >>>
> >>> But this patch drops this exclusive lock, so it becomes async friendly,
> >>> but may cause stale page cache. However, if the lock is required, it can't
> >>> be efficient anymore and io-wq may be inevitable, :-)
> >>
> >> If you want to grab the lock, you can still opportunistically grab it.
> >> For (by far) the common case, you'll get it, and you can still do it
> >> inline.
> > 
> > If the lock is grabbed in the whole cmd lifetime, it is basically one sync
> > interface cause there is at most one async discard cmd in-flight for each
> > device.
> 
> Oh for sure, you could not do that anyway as you'd be holding a lock
> across the syscall boundary, which isn't allowed.

Indeed.

> 
> > Meantime the handling has to move to io-wq for avoiding to block current
> > context, the interface becomes same with IORING_OP_FALLOCATE?
> 
> I think the current truncate is overkill, we should be able to get by
> without. And no, I will not entertain an option that's "oh just punt it
> to io-wq".

BTW, the truncate is added by 351499a172c0 ("block: Invalidate cache on discard v2"),
and block/009 serves as regression test for covering page cache
coherency and discard.

Here the issue is actually related with the exclusive lock of
filemap_invalidate_lock(). IMO, it is reasonable to prevent page read during
discard for not polluting page cache. block/009 may fail too without the lock.

It is just that concurrent discards can't be allowed any more by
down_write() of rw_semaphore, and block device is really capable of doing
that. It can be thought as one regression of 7607c44c157d ("block: Hold invalidate_lock in
BLKDISCARD ioctl").

Cc Jan Kara and Shin'ichiro Kawasaki.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-20  2:36                 ` Ming Lei
@ 2024-08-20 16:30                   ` Jens Axboe
  2024-08-20 17:19                     ` Pavel Begunkov
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-08-20 16:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Pavel Begunkov, io-uring, Conrad Meyer, linux-block, linux-mm,
	Jan Kara, Shin'ichiro Kawasaki

On 8/19/24 8:36 PM, Ming Lei wrote:
> On Mon, Aug 19, 2024 at 02:01:21PM -0600, Jens Axboe wrote:
>> On 8/15/24 7:45 PM, Ming Lei wrote:
>>> On Thu, Aug 15, 2024 at 07:24:16PM -0600, Jens Axboe wrote:
>>>> On 8/15/24 5:44 PM, Ming Lei wrote:
>>>>> On Thu, Aug 15, 2024 at 06:11:13PM +0100, Pavel Begunkov wrote:
>>>>>> On 8/15/24 15:33, Jens Axboe wrote:
>>>>>>> On 8/14/24 7:42 PM, Ming Lei wrote:
>>>>>>>> On Wed, Aug 14, 2024 at 6:46?PM Pavel Begunkov <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Add ->uring_cmd callback for block device files and use it to implement
>>>>>>>>> asynchronous discard. Normally, it first tries to execute the command
>>>>>>>>> from non-blocking context, which we limit to a single bio because
>>>>>>>>> otherwise one of sub-bios may need to wait for other bios, and we don't
>>>>>>>>> want to deal with partial IO. If non-blocking attempt fails, we'll retry
>>>>>>>>> it in a blocking context.
>>>>>>>>>
>>>>>>>>> Suggested-by: Conrad Meyer <[email protected]>
>>>>>>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>>>>>>> ---
>>>>>>>>>   block/blk.h             |  1 +
>>>>>>>>>   block/fops.c            |  2 +
>>>>>>>>>   block/ioctl.c           | 94 +++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>   include/uapi/linux/fs.h |  2 +
>>>>>>>>>   4 files changed, 99 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/block/blk.h b/block/blk.h
>>>>>>>>> index e180863f918b..5178c5ba6852 100644
>>>>>>>>> --- a/block/blk.h
>>>>>>>>> +++ b/block/blk.h
>>>>>>>>> @@ -571,6 +571,7 @@ blk_mode_t file_to_blk_mode(struct file *file);
>>>>>>>>>   int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode,
>>>>>>>>>                  loff_t lstart, loff_t lend);
>>>>>>>>>   long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>>>>>>>> +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
>>>>>>>>>   long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
>>>>>>>>>
>>>>>>>>>   extern const struct address_space_operations def_blk_aops;
>>>>>>>>> diff --git a/block/fops.c b/block/fops.c
>>>>>>>>> index 9825c1713a49..8154b10b5abf 100644
>>>>>>>>> --- a/block/fops.c
>>>>>>>>> +++ b/block/fops.c
>>>>>>>>> @@ -17,6 +17,7 @@
>>>>>>>>>   #include <linux/fs.h>
>>>>>>>>>   #include <linux/iomap.h>
>>>>>>>>>   #include <linux/module.h>
>>>>>>>>> +#include <linux/io_uring/cmd.h>
>>>>>>>>>   #include "blk.h"
>>>>>>>>>
>>>>>>>>>   static inline struct inode *bdev_file_inode(struct file *file)
>>>>>>>>> @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = {
>>>>>>>>>          .splice_read    = filemap_splice_read,
>>>>>>>>>          .splice_write   = iter_file_splice_write,
>>>>>>>>>          .fallocate      = blkdev_fallocate,
>>>>>>>>> +       .uring_cmd      = blkdev_uring_cmd,
>>>>>>>>
>>>>>>>> Just be curious, we have IORING_OP_FALLOCATE already for sending
>>>>>>>> discard to block device, why is .uring_cmd added for this purpose?
>>>>>>
>>>>>> Which is a good question, I haven't thought about it, but I tend to
>>>>>> agree with Jens. Because vfs_fallocate is created synchronous
>>>>>> IORING_OP_FALLOCATE is slow for anything but pretty large requests.
>>>>>> Probably can be patched up, which would  involve changing the
>>>>>> fops->fallocate protot, but I'm not sure async there makes sense
>>>>>> outside of bdev (?), and cmd approach is simpler, can be made
>>>>>> somewhat more efficient (1 less layer in the way), and it's not
>>>>>> really something completely new since we have it in ioctl.
>>>>>
>>>>> Yeah, we have ioctl(DISCARD), which acquires filemap_invalidate_lock,
>>>>> same with blkdev_fallocate().
>>>>>
>>>>> But this patch drops this exclusive lock, so it becomes async friendly,
>>>>> but may cause stale page cache. However, if the lock is required, it can't
>>>>> be efficient anymore and io-wq may be inevitable, :-)
>>>>
>>>> If you want to grab the lock, you can still opportunistically grab it.
>>>> For (by far) the common case, you'll get it, and you can still do it
>>>> inline.
>>>
>>> If the lock is grabbed in the whole cmd lifetime, it is basically one sync
>>> interface cause there is at most one async discard cmd in-flight for each
>>> device.
>>
>> Oh for sure, you could not do that anyway as you'd be holding a lock
>> across the syscall boundary, which isn't allowed.
> 
> Indeed.
> 
>>
>>> Meantime the handling has to move to io-wq for avoiding to block current
>>> context, the interface becomes same with IORING_OP_FALLOCATE?
>>
>> I think the current truncate is overkill, we should be able to get by
>> without. And no, I will not entertain an option that's "oh just punt it
>> to io-wq".
> 
> BTW, the truncate is added by 351499a172c0 ("block: Invalidate cache on discard v2"),
> and block/009 serves as regression test for covering page cache
> coherency and discard.
> 
> Here the issue is actually related with the exclusive lock of
> filemap_invalidate_lock(). IMO, it is reasonable to prevent page read during
> discard for not polluting page cache. block/009 may fail too without the lock.
> 
> It is just that concurrent discards can't be allowed any more by
> down_write() of rw_semaphore, and block device is really capable of doing
> that. It can be thought as one regression of 7607c44c157d ("block: Hold invalidate_lock in
> BLKDISCARD ioctl").
> 
> Cc Jan Kara and Shin'ichiro Kawasaki.

Honestly I just think that's nonsense. It's like mixing direct and
buffered writes. Can you get corruption? Yes you most certainly can.
There should be no reason why we can't run discards without providing
page cache coherency. The sync interface attempts to do that, but that
doesn't mean that an async (or a different sync one, if that made sense)
should.

If you do discards to the same range as you're doing buffered IO, you
get to keep both potentially pieces. Fact is that most folks are doing
dio for performant IO exactly because buffered writes tend to be
horrible, and you could certainly use that with async discards and have
the application manage it just fine.

So I really think any attempts to provide page cache synchronization for
this is futile. And the existing sync one looks pretty abysmal, but it
doesn't really matter as it's a sync interfce. If one were to do
something about it for an async interface, then just pretend it's dio
and increment i_dio_count.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-20 16:30                   ` Jens Axboe
@ 2024-08-20 17:19                     ` Pavel Begunkov
  2024-08-21  2:55                       ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2024-08-20 17:19 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: io-uring, Conrad Meyer, linux-block, linux-mm, Jan Kara,
	Shin'ichiro Kawasaki

On 8/20/24 17:30, Jens Axboe wrote:
> On 8/19/24 8:36 PM, Ming Lei wrote:
>> On Mon, Aug 19, 2024 at 02:01:21PM -0600, Jens Axboe wrote:
>>> On 8/15/24 7:45 PM, Ming Lei wrote:
...
>>>> Meantime the handling has to move to io-wq for avoiding to block current
>>>> context, the interface becomes same with IORING_OP_FALLOCATE?
>>>
>>> I think the current truncate is overkill, we should be able to get by
>>> without. And no, I will not entertain an option that's "oh just punt it
>>> to io-wq".
>>
>> BTW, the truncate is added by 351499a172c0 ("block: Invalidate cache on discard v2"),
>> and block/009 serves as regression test for covering page cache
>> coherency and discard.
>>
>> Here the issue is actually related with the exclusive lock of
>> filemap_invalidate_lock(). IMO, it is reasonable to prevent page read during
>> discard for not polluting page cache. block/009 may fail too without the lock.
>>
>> It is just that concurrent discards can't be allowed any more by
>> down_write() of rw_semaphore, and block device is really capable of doing
>> that. It can be thought as one regression of 7607c44c157d ("block: Hold invalidate_lock in
>> BLKDISCARD ioctl").
>>
>> Cc Jan Kara and Shin'ichiro Kawasaki.
> 
> Honestly I just think that's nonsense. It's like mixing direct and
> buffered writes. Can you get corruption? Yes you most certainly can.
> There should be no reason why we can't run discards without providing
> page cache coherency. The sync interface attempts to do that, but that
> doesn't mean that an async (or a different sync one, if that made sense)
> should.

I don't see it as a problem either, it's a new interface, just need
to be upfront on what guarantees it provides (one more reason why
not fallocate), I'll elaborate on it in the commit message and so.

I think a reasonable thing to do is to have one rule for all write-like
operations starting from plain writes, which is currently allowing races
to happen and shift it to the user. Purely in theory we can get inventive
with likes of range lock trees, but that's unwarranted for all sorts of
reasons.

> If you do discards to the same range as you're doing buffered IO, you
> get to keep both potentially pieces. Fact is that most folks are doing
> dio for performant IO exactly because buffered writes tend to be
> horrible, and you could certainly use that with async discards and have
> the application manage it just fine.
> 
> So I really think any attempts to provide page cache synchronization for
> this is futile. And the existing sync one looks pretty abysmal, but it
> doesn't really matter as it's a sync interfce. If one were to do

It should be a pain for sync as well, you can't even spin another process
and parallelise this way.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC 5/5] block: implement io_uring discard cmd
  2024-08-20 17:19                     ` Pavel Begunkov
@ 2024-08-21  2:55                       ` Ming Lei
  0 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2024-08-21  2:55 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jens Axboe, io-uring, Conrad Meyer, linux-block, linux-mm,
	Jan Kara, Shin'ichiro Kawasaki

On Tue, Aug 20, 2024 at 06:19:00PM +0100, Pavel Begunkov wrote:
> On 8/20/24 17:30, Jens Axboe wrote:
> > On 8/19/24 8:36 PM, Ming Lei wrote:
> > > On Mon, Aug 19, 2024 at 02:01:21PM -0600, Jens Axboe wrote:
> > > > On 8/15/24 7:45 PM, Ming Lei wrote:
> ...
> > > > > Meantime the handling has to move to io-wq for avoiding to block current
> > > > > context, the interface becomes same with IORING_OP_FALLOCATE?
> > > > 
> > > > I think the current truncate is overkill, we should be able to get by
> > > > without. And no, I will not entertain an option that's "oh just punt it
> > > > to io-wq".
> > > 
> > > BTW, the truncate is added by 351499a172c0 ("block: Invalidate cache on discard v2"),
> > > and block/009 serves as regression test for covering page cache
> > > coherency and discard.
> > > 
> > > Here the issue is actually related with the exclusive lock of
> > > filemap_invalidate_lock(). IMO, it is reasonable to prevent page read during
> > > discard for not polluting page cache. block/009 may fail too without the lock.
> > > 
> > > It is just that concurrent discards can't be allowed any more by
> > > down_write() of rw_semaphore, and block device is really capable of doing
> > > that. It can be thought as one regression of 7607c44c157d ("block: Hold invalidate_lock in
> > > BLKDISCARD ioctl").
> > > 
> > > Cc Jan Kara and Shin'ichiro Kawasaki.
> > 
> > Honestly I just think that's nonsense. It's like mixing direct and
> > buffered writes. Can you get corruption? Yes you most certainly can.
> > There should be no reason why we can't run discards without providing
> > page cache coherency. The sync interface attempts to do that, but that
> > doesn't mean that an async (or a different sync one, if that made sense)
> > should.
> 
> I don't see it as a problem either, it's a new interface, just need
> to be upfront on what guarantees it provides (one more reason why
> not fallocate), I'll elaborate on it in the commit message and so.

Fair enough.

> 
> I think a reasonable thing to do is to have one rule for all write-like
> operations starting from plain writes, which is currently allowing races
> to happen and shift it to the user. Purely in theory we can get inventive
> with likes of range lock trees, but that's unwarranted for all sorts of
> reasons.
> 
> > If you do discards to the same range as you're doing buffered IO, you
> > get to keep both potentially pieces. Fact is that most folks are doing
> > dio for performant IO exactly because buffered writes tend to be
> > horrible, and you could certainly use that with async discards and have
> > the application manage it just fine.
> > 
> > So I really think any attempts to provide page cache synchronization for
> > this is futile. And the existing sync one looks pretty abysmal, but it
> > doesn't really matter as it's a sync interfce. If one were to do
> 
> It should be a pain for sync as well, you can't even spin another process
> and parallelise this way.

Yes, this way has degraded some sync discard workloads perf a lot.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-08-21  2:55 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-14 10:45 [RFC 0/5] implement asynchronous BLKDISCARD via io_uring Pavel Begunkov
2024-08-14 10:45 ` [RFC 1/5] io_uring/cmd: expose iowq to cmds Pavel Begunkov
2024-08-14 10:45 ` [RFC 2/5] io_uring/cmd: give inline space in request " Pavel Begunkov
2024-08-14 10:45 ` [RFC 3/5] filemap: introduce filemap_invalidate_pages Pavel Begunkov
2024-08-14 10:45 ` [RFC 4/5] block: introduce blk_validate_discard() Pavel Begunkov
2024-08-14 10:45 ` [RFC 5/5] block: implement io_uring discard cmd Pavel Begunkov
2024-08-15  1:42   ` Ming Lei
2024-08-15 14:33     ` Jens Axboe
2024-08-15 17:11       ` Pavel Begunkov
2024-08-15 23:44         ` Ming Lei
2024-08-16  1:24           ` Jens Axboe
2024-08-16  1:45             ` Ming Lei
2024-08-16  1:59               ` Pavel Begunkov
2024-08-16  2:08                 ` Ming Lei
2024-08-16  2:16                   ` Pavel Begunkov
2024-08-19 20:02                     ` Jens Axboe
2024-08-19 20:01               ` Jens Axboe
2024-08-20  2:36                 ` Ming Lei
2024-08-20 16:30                   ` Jens Axboe
2024-08-20 17:19                     ` Pavel Begunkov
2024-08-21  2:55                       ` Ming Lei
2024-08-15 14:42   ` Jens Axboe
2024-08-15 15:50 ` [RFC 0/5] implement asynchronous BLKDISCARD via io_uring Jens Axboe
2024-08-15 17:26   ` Pavel Begunkov
2024-08-15 16:15 ` Martin K. Petersen
2024-08-15 17:12   ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox