[PATCH 1/5] btrfs: remove pointless addition in btrfs_encoded

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH 1/5] btrfs: remove pointless addition in btrfs_encoded_read
  2024-10-14 17:18 [PATCH v3 0/5] btrfs: encoded reads via io_uring Mark Harmstone
@ 2024-10-14 17:18 ` Mark Harmstone
  0 siblings, 0 replies; 14+ messages in thread
From: Mark Harmstone @ 2024-10-14 17:18 UTC (permalink / raw)
  To: linux-btrfs, io-uring; +Cc: Mark Harmstone

iocb->ki_pos isn't used after this function, so there's no point in
changing its value.

Signed-off-by: Mark Harmstone <[email protected]>
---
 fs/btrfs/inode.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b1b6564ab68f..b024ebc3dcd6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9280,7 +9280,7 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 		ret = btrfs_encoded_read_inline(iocb, iter, start, lockend,
 						&cached_state, extent_start,
 						count, encoded, &unlocked);
-		goto out;
+		goto out_em;
 	}
 
 	/*
@@ -9346,9 +9346,6 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 						 &unlocked);
 	}
 
-out:
-	if (ret >= 0)
-		iocb->ki_pos += encoded->len;
 out_em:
 	free_extent_map(em);
 out_unlock_extent:
-- 
2.44.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 0/5] btrfs: io_uring interface for encoded reads
@ 2024-10-22 14:50 Mark Harmstone
  2024-10-22 14:50 ` [PATCH 1/5] btrfs: remove pointless addition in btrfs_encoded_read Mark Harmstone
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Mark Harmstone @ 2024-10-22 14:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: io-uring, Mark Harmstone

This is version 4 of a patch series to add an io_uring interface for
encoded reads. The principal use case for this is to eventually allow
btrfs send and receive to operate asynchronously, the lack of io_uring
encoded I/O being one of the main blockers for this.

I've written a test program for this, which demonstrates the ioctl and
io_uring interface produce identical results: https://github.com/maharmstone/io_uring-encoded

Changelog:
v4:
* Rewritten to avoid taking function pointer
* Removed nowait parameter, as this could be derived from iocb flags
* Fixed structure not getting properly initialized
* Followed ioctl by capping uncompressed reads at EOF
* Rebased against btrfs/for-next
* Formatting fixes
* Rearranged structs to minimize holes
* Published test program
* Fixed potential data race with userspace
* Changed to use io_uring_cmd_to_pdu helper function
* Added comments for potentially confusing parts of the code

v3:
* Redo of previous versions

Mark Harmstone (5):
  btrfs: remove pointless addition in btrfs_encoded_read
  btrfs: change btrfs_encoded_read so that reading of extent is done by
    caller
  btrfs: don't sleep in btrfs_encoded_read if IOCB_NOWAIT set
  btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages
  btrfs: add io_uring command for encoded reads

 fs/btrfs/btrfs_inode.h |  13 +-
 fs/btrfs/file.c        |   1 +
 fs/btrfs/inode.c       | 175 ++++++++++++++-------
 fs/btrfs/ioctl.c       | 342 ++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/ioctl.h       |   2 +
 fs/btrfs/send.c        |   3 +-
 6 files changed, 473 insertions(+), 63 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/5] btrfs: remove pointless addition in btrfs_encoded_read
  2024-10-22 14:50 [PATCH v4 0/5] btrfs: io_uring interface for encoded reads Mark Harmstone
@ 2024-10-22 14:50 ` Mark Harmstone
  2024-10-30  1:46   ` Anand Jain
  2024-10-22 14:50 ` [PATCH 2/5] btrfs: change btrfs_encoded_read so that reading of extent is done by caller Mark Harmstone
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Mark Harmstone @ 2024-10-22 14:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: io-uring, Mark Harmstone

iocb->ki_pos isn't used after this function, so there's no point in
changing its value.

Signed-off-by: Mark Harmstone <[email protected]>
---
 fs/btrfs/inode.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7c5ef2c5c7e8..94098a4c782d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9252,7 +9252,7 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 		ret = btrfs_encoded_read_inline(iocb, iter, start, lockend,
 						&cached_state, extent_start,
 						count, encoded, &unlocked);
-		goto out;
+		goto out_em;
 	}
 
 	/*
@@ -9318,9 +9318,6 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 						 &unlocked);
 	}
 
-out:
-	if (ret >= 0)
-		iocb->ki_pos += encoded->len;
 out_em:
 	free_extent_map(em);
 out_unlock_extent:
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/5] btrfs: change btrfs_encoded_read so that reading of extent is done by caller
  2024-10-22 14:50 [PATCH v4 0/5] btrfs: io_uring interface for encoded reads Mark Harmstone
  2024-10-22 14:50 ` [PATCH 1/5] btrfs: remove pointless addition in btrfs_encoded_read Mark Harmstone
@ 2024-10-22 14:50 ` Mark Harmstone
  2024-10-22 14:50 ` [PATCH 3/5] btrfs: don't sleep in btrfs_encoded_read if IOCB_NOWAIT set Mark Harmstone
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Mark Harmstone @ 2024-10-22 14:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: io-uring, Mark Harmstone

Change the behaviour of btrfs_encoded_read so that if it needs to read
an extent from disk, it leaves the extent and inode locked and returns
-EIOCBQUEUED. The caller is then responsible for doing the I/O via
btrfs_encoded_read_regular and unlocking the extent and inode.

Signed-off-by: Mark Harmstone <[email protected]>
---
 fs/btrfs/btrfs_inode.h | 10 +++++++-
 fs/btrfs/inode.c       | 58 ++++++++++++++++++++----------------------
 fs/btrfs/ioctl.c       | 33 +++++++++++++++++++++++-
 3 files changed, 69 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 157fd3f4cb33..ab1fbde97cee 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -615,7 +615,15 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 					  u64 disk_bytenr, u64 disk_io_size,
 					  struct page **pages);
 ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
-			   struct btrfs_ioctl_encoded_io_args *encoded);
+			   struct btrfs_ioctl_encoded_io_args *encoded,
+			   struct extent_state **cached_state,
+			   u64 *disk_bytenr, u64 *disk_io_size);
+ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, struct iov_iter *iter,
+				   u64 start, u64 lockend,
+				   struct extent_state **cached_state,
+				   u64 disk_bytenr, u64 disk_io_size,
+				   size_t count, bool compressed,
+				   bool *unlocked);
 ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
 			       const struct btrfs_ioctl_encoded_io_args *encoded);
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 94098a4c782d..0a4dc85769c7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9123,13 +9123,12 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 	return blk_status_to_errno(READ_ONCE(priv.status));
 }
 
-static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
-					  struct iov_iter *iter,
-					  u64 start, u64 lockend,
-					  struct extent_state **cached_state,
-					  u64 disk_bytenr, u64 disk_io_size,
-					  size_t count, bool compressed,
-					  bool *unlocked)
+ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, struct iov_iter *iter,
+				   u64 start, u64 lockend,
+				   struct extent_state **cached_state,
+				   u64 disk_bytenr, u64 disk_io_size,
+				   size_t count, bool compressed,
+				   bool *unlocked)
 {
 	struct btrfs_inode *inode = BTRFS_I(file_inode(iocb->ki_filp));
 	struct extent_io_tree *io_tree = &inode->io_tree;
@@ -9190,15 +9189,16 @@ static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
 }
 
 ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
-			   struct btrfs_ioctl_encoded_io_args *encoded)
+			   struct btrfs_ioctl_encoded_io_args *encoded,
+			   struct extent_state **cached_state,
+			   u64 *disk_bytenr, u64 *disk_io_size)
 {
 	struct btrfs_inode *inode = BTRFS_I(file_inode(iocb->ki_filp));
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct extent_io_tree *io_tree = &inode->io_tree;
 	ssize_t ret;
 	size_t count = iov_iter_count(iter);
-	u64 start, lockend, disk_bytenr, disk_io_size;
-	struct extent_state *cached_state = NULL;
+	u64 start, lockend;
 	struct extent_map *em;
 	bool unlocked = false;
 
@@ -9224,13 +9224,13 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 					       lockend - start + 1);
 		if (ret)
 			goto out_unlock_inode;
-		lock_extent(io_tree, start, lockend, &cached_state);
+		lock_extent(io_tree, start, lockend, cached_state);
 		ordered = btrfs_lookup_ordered_range(inode, start,
 						     lockend - start + 1);
 		if (!ordered)
 			break;
 		btrfs_put_ordered_extent(ordered);
-		unlock_extent(io_tree, start, lockend, &cached_state);
+		unlock_extent(io_tree, start, lockend, cached_state);
 		cond_resched();
 	}
 
@@ -9250,7 +9250,7 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 		free_extent_map(em);
 		em = NULL;
 		ret = btrfs_encoded_read_inline(iocb, iter, start, lockend,
-						&cached_state, extent_start,
+						cached_state, extent_start,
 						count, encoded, &unlocked);
 		goto out_em;
 	}
@@ -9263,12 +9263,12 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 			     inode->vfs_inode.i_size) - iocb->ki_pos;
 	if (em->disk_bytenr == EXTENT_MAP_HOLE ||
 	    (em->flags & EXTENT_FLAG_PREALLOC)) {
-		disk_bytenr = EXTENT_MAP_HOLE;
+		*disk_bytenr = EXTENT_MAP_HOLE;
 		count = min_t(u64, count, encoded->len);
 		encoded->len = count;
 		encoded->unencoded_len = count;
 	} else if (extent_map_is_compressed(em)) {
-		disk_bytenr = em->disk_bytenr;
+		*disk_bytenr = em->disk_bytenr;
 		/*
 		 * Bail if the buffer isn't large enough to return the whole
 		 * compressed extent.
@@ -9277,7 +9277,7 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 			ret = -ENOBUFS;
 			goto out_em;
 		}
-		disk_io_size = em->disk_num_bytes;
+		*disk_io_size = em->disk_num_bytes;
 		count = em->disk_num_bytes;
 		encoded->unencoded_len = em->ram_bytes;
 		encoded->unencoded_offset = iocb->ki_pos - (em->start - em->offset);
@@ -9287,44 +9287,42 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 			goto out_em;
 		encoded->compression = ret;
 	} else {
-		disk_bytenr = extent_map_block_start(em) + (start - em->start);
+		*disk_bytenr = extent_map_block_start(em) + (start - em->start);
 		if (encoded->len > count)
 			encoded->len = count;
 		/*
 		 * Don't read beyond what we locked. This also limits the page
 		 * allocations that we'll do.
 		 */
-		disk_io_size = min(lockend + 1, iocb->ki_pos + encoded->len) - start;
-		count = start + disk_io_size - iocb->ki_pos;
+		*disk_io_size = min(lockend + 1, iocb->ki_pos + encoded->len) - start;
+		count = start + *disk_io_size - iocb->ki_pos;
 		encoded->len = count;
 		encoded->unencoded_len = count;
-		disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize);
+		*disk_io_size = ALIGN(*disk_io_size, fs_info->sectorsize);
 	}
 	free_extent_map(em);
 	em = NULL;
 
-	if (disk_bytenr == EXTENT_MAP_HOLE) {
-		unlock_extent(io_tree, start, lockend, &cached_state);
+	if (*disk_bytenr == EXTENT_MAP_HOLE) {
+		unlock_extent(io_tree, start, lockend, cached_state);
 		btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
 		unlocked = true;
 		ret = iov_iter_zero(count, iter);
 		if (ret != count)
 			ret = -EFAULT;
 	} else {
-		ret = btrfs_encoded_read_regular(iocb, iter, start, lockend,
-						 &cached_state, disk_bytenr,
-						 disk_io_size, count,
-						 encoded->compression,
-						 &unlocked);
+		ret = -EIOCBQUEUED;
+		goto out_em;
 	}
 
 out_em:
 	free_extent_map(em);
 out_unlock_extent:
-	if (!unlocked)
-		unlock_extent(io_tree, start, lockend, &cached_state);
+	/* Leave inode and extent locked if we need to do a read */
+	if (!unlocked && ret != -EIOCBQUEUED)
+		unlock_extent(io_tree, start, lockend, cached_state);
 out_unlock_inode:
-	if (!unlocked)
+	if (!unlocked && ret != -EIOCBQUEUED)
 		btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
 	return ret;
 }
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 28b9b7fda578..d502b31010bc 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4513,12 +4513,17 @@ static int btrfs_ioctl_encoded_read(struct file *file, void __user *argp,
 	size_t copy_end_kernel = offsetofend(struct btrfs_ioctl_encoded_io_args,
 					     flags);
 	size_t copy_end;
+	struct btrfs_inode *inode = BTRFS_I(file_inode(file));
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct extent_io_tree *io_tree = &inode->io_tree;
 	struct iovec iovstack[UIO_FASTIOV];
 	struct iovec *iov = iovstack;
 	struct iov_iter iter;
 	loff_t pos;
 	struct kiocb kiocb;
 	ssize_t ret;
+	u64 disk_bytenr, disk_io_size;
+	struct extent_state *cached_state = NULL;
 
 	if (!capable(CAP_SYS_ADMIN)) {
 		ret = -EPERM;
@@ -4571,7 +4576,33 @@ static int btrfs_ioctl_encoded_read(struct file *file, void __user *argp,
 	init_sync_kiocb(&kiocb, file);
 	kiocb.ki_pos = pos;
 
-	ret = btrfs_encoded_read(&kiocb, &iter, &args);
+	ret = btrfs_encoded_read(&kiocb, &iter, &args, &cached_state,
+				 &disk_bytenr, &disk_io_size);
+
+	if (ret == -EIOCBQUEUED) {
+		bool unlocked = false;
+		u64 start, lockend, count;
+
+		start = ALIGN_DOWN(kiocb.ki_pos, fs_info->sectorsize);
+		lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
+
+		if (args.compression)
+			count = disk_io_size;
+		else
+			count = args.len;
+
+		ret = btrfs_encoded_read_regular(&kiocb, &iter, start, lockend,
+						 &cached_state, disk_bytenr,
+						 disk_io_size, count,
+						 args.compression,
+						 &unlocked);
+
+		if (!unlocked) {
+			unlock_extent(io_tree, start, lockend, &cached_state);
+			btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
+		}
+	}
+
 	if (ret >= 0) {
 		fsnotify_access(file);
 		if (copy_to_user(argp + copy_end,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/5] btrfs: don't sleep in btrfs_encoded_read if IOCB_NOWAIT set
  2024-10-22 14:50 [PATCH v4 0/5] btrfs: io_uring interface for encoded reads Mark Harmstone
  2024-10-22 14:50 ` [PATCH 1/5] btrfs: remove pointless addition in btrfs_encoded_read Mark Harmstone
  2024-10-22 14:50 ` [PATCH 2/5] btrfs: change btrfs_encoded_read so that reading of extent is done by caller Mark Harmstone
@ 2024-10-22 14:50 ` Mark Harmstone
  2024-10-22 14:50 ` [PATCH 4/5] btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages Mark Harmstone
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Mark Harmstone @ 2024-10-22 14:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: io-uring, Mark Harmstone

Change btrfs_encoded_read so that it returns -EAGAIN rather than sleeps
if IOCB_NOWAIT is set in iocb->ki_flags.

Signed-off-by: Mark Harmstone <[email protected]>
---
 fs/btrfs/inode.c | 54 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0a4dc85769c7..0c0753f20d54 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8984,12 +8984,16 @@ static ssize_t btrfs_encoded_read_inline(
 	unsigned long ptr;
 	void *tmp;
 	ssize_t ret;
+	bool nowait = iocb->ki_flags & IOCB_NOWAIT;
 
 	path = btrfs_alloc_path();
 	if (!path) {
 		ret = -ENOMEM;
 		goto out;
 	}
+
+	path->nowait = !!nowait;
+
 	ret = btrfs_lookup_file_extent(NULL, root, path, btrfs_ino(inode),
 				       extent_start, 0);
 	if (ret) {
@@ -9200,11 +9204,15 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 	size_t count = iov_iter_count(iter);
 	u64 start, lockend;
 	struct extent_map *em;
+	bool nowait = iocb->ki_flags & IOCB_NOWAIT;
 	bool unlocked = false;
 
 	file_accessed(iocb->ki_filp);
 
-	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
+	ret = btrfs_inode_lock(inode,
+			       BTRFS_ILOCK_SHARED | (nowait ? BTRFS_ILOCK_TRY : 0));
+	if (ret)
+		return ret;
 
 	if (iocb->ki_pos >= inode->vfs_inode.i_size) {
 		btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
@@ -9217,21 +9225,45 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 	 */
 	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
 
-	for (;;) {
+	if (nowait) {
 		struct btrfs_ordered_extent *ordered;
 
-		ret = btrfs_wait_ordered_range(inode, start,
-					       lockend - start + 1);
-		if (ret)
+		if (filemap_range_needs_writeback(inode->vfs_inode.i_mapping,
+						  start, lockend)) {
+			ret = -EAGAIN;
 			goto out_unlock_inode;
-		lock_extent(io_tree, start, lockend, cached_state);
+		}
+
+		if (!try_lock_extent(io_tree, start, lockend, cached_state)) {
+			ret = -EAGAIN;
+			goto out_unlock_inode;
+		}
+
 		ordered = btrfs_lookup_ordered_range(inode, start,
 						     lockend - start + 1);
-		if (!ordered)
-			break;
-		btrfs_put_ordered_extent(ordered);
-		unlock_extent(io_tree, start, lockend, cached_state);
-		cond_resched();
+		if (ordered) {
+			btrfs_put_ordered_extent(ordered);
+			unlock_extent(io_tree, start, lockend, cached_state);
+			ret = -EAGAIN;
+			goto out_unlock_inode;
+		}
+	} else {
+		for (;;) {
+			struct btrfs_ordered_extent *ordered;
+
+			ret = btrfs_wait_ordered_range(inode, start,
+						lockend - start + 1);
+			if (ret)
+				goto out_unlock_inode;
+			lock_extent(io_tree, start, lockend, cached_state);
+			ordered = btrfs_lookup_ordered_range(inode, start,
+							lockend - start + 1);
+			if (!ordered)
+				break;
+			btrfs_put_ordered_extent(ordered);
+			unlock_extent(io_tree, start, lockend, cached_state);
+			cond_resched();
+		}
 	}
 
 	em = btrfs_get_extent(inode, NULL, start, lockend - start + 1);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/5] btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages
  2024-10-22 14:50 [PATCH v4 0/5] btrfs: io_uring interface for encoded reads Mark Harmstone
                   ` (2 preceding siblings ...)
  2024-10-22 14:50 ` [PATCH 3/5] btrfs: don't sleep in btrfs_encoded_read if IOCB_NOWAIT set Mark Harmstone
@ 2024-10-22 14:50 ` Mark Harmstone
  2024-10-22 14:50 ` [PATCH 5/5] btrfs: add io_uring command for encoded reads Mark Harmstone
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Mark Harmstone @ 2024-10-22 14:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: io-uring, Mark Harmstone

Change btrfs_encoded_read_regular_fill_pages so that the priv struct is
allocated rather than stored on the stack, in preparation for adding an
asynchronous mode to the function.

Signed-off-by: Mark Harmstone <[email protected]>
---
 fs/btrfs/inode.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0c0753f20d54..5aedb85696f4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9086,16 +9086,21 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 					  struct page **pages)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct btrfs_encoded_read_private priv = {
-		.pending = ATOMIC_INIT(1),
-	};
+	struct btrfs_encoded_read_private *priv;
 	unsigned long i = 0;
 	struct btrfs_bio *bbio;
+	int ret;
 
-	init_waitqueue_head(&priv.wait);
+	priv = kmalloc(sizeof(struct btrfs_encoded_read_private), GFP_NOFS);
+	if (!priv)
+		return -ENOMEM;
+
+	init_waitqueue_head(&priv->wait);
+	atomic_set(&priv->pending, 1);
+	priv->status = 0;
 
 	bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, fs_info,
-			       btrfs_encoded_read_endio, &priv);
+			       btrfs_encoded_read_endio, priv);
 	bbio->bio.bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
 	bbio->inode = inode;
 
@@ -9103,11 +9108,11 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 		size_t bytes = min_t(u64, disk_io_size, PAGE_SIZE);
 
 		if (bio_add_page(&bbio->bio, pages[i], bytes, 0) < bytes) {
-			atomic_inc(&priv.pending);
+			atomic_inc(&priv->pending);
 			btrfs_submit_bbio(bbio, 0);
 
 			bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, fs_info,
-					       btrfs_encoded_read_endio, &priv);
+					       btrfs_encoded_read_endio, priv);
 			bbio->bio.bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
 			bbio->inode = inode;
 			continue;
@@ -9118,13 +9123,15 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 		disk_io_size -= bytes;
 	} while (disk_io_size);
 
-	atomic_inc(&priv.pending);
+	atomic_inc(&priv->pending);
 	btrfs_submit_bbio(bbio, 0);
 
-	if (atomic_dec_return(&priv.pending))
-		io_wait_event(priv.wait, !atomic_read(&priv.pending));
+	if (atomic_dec_return(&priv->pending))
+		io_wait_event(priv->wait, !atomic_read(&priv->pending));
 	/* See btrfs_encoded_read_endio() for ordering. */
-	return blk_status_to_errno(READ_ONCE(priv.status));
+	ret = blk_status_to_errno(READ_ONCE(priv->status));
+	kfree(priv);
+	return ret;
 }
 
 ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, struct iov_iter *iter,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/5] btrfs: add io_uring command for encoded reads
  2024-10-22 14:50 [PATCH v4 0/5] btrfs: io_uring interface for encoded reads Mark Harmstone
                   ` (3 preceding siblings ...)
  2024-10-22 14:50 ` [PATCH 4/5] btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages Mark Harmstone
@ 2024-10-22 14:50 ` Mark Harmstone
  2024-10-29 21:51   ` David Sterba
  2024-10-30  0:59   ` Pavel Begunkov
  2024-10-29 21:29 ` [PATCH v4 0/5] btrfs: io_uring interface " David Sterba
  2024-10-30  1:22 ` Anand Jain
  6 siblings, 2 replies; 14+ messages in thread
From: Mark Harmstone @ 2024-10-22 14:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: io-uring, Mark Harmstone

Adds an io_uring command for encoded reads, using the same interface as
the existing BTRFS_IOC_ENCODED_READ ioctl.

btrfs_uring_encoded_read is an io_uring version of
btrfs_ioctl_encoded_read, which validates the user input and calls
btrfs_encoded_read to read the appropriate metadata. If we determine
that we need to read an extent from disk, we call
btrfs_encoded_read_regular_fill_pages through btrfs_uring_read_extent to
prepare the bio.

The existing btrfs_encoded_read_regular_fill_pages is changed so that if
it is passed a uring_ctx value, rather than waking up any waiting
threads it calls btrfs_uring_read_extent_endio. This in turn copies the
read data back to userspace, and calls io_uring_cmd_done to complete the
io_uring command.

Because we're potentially doing a non-blocking read,
btrfs_uring_read_extent doesn't clean up after itself if it returns
-EIOCBQUEUED. Instead, it allocates a priv struct, populates the fields
there that we will need to unlock the inode and free our allocations,
and defers this to the btrfs_uring_read_finished that gets called when
the bio completes.

Signed-off-by: Mark Harmstone <[email protected]>
---
 fs/btrfs/btrfs_inode.h |   3 +-
 fs/btrfs/file.c        |   1 +
 fs/btrfs/inode.c       |  43 ++++--
 fs/btrfs/ioctl.c       | 309 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ioctl.h       |   2 +
 fs/btrfs/send.c        |   3 +-
 6 files changed, 349 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index ab1fbde97cee..f551444b498e 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -613,7 +613,8 @@ int btrfs_encoded_io_compression_from_extent(struct btrfs_fs_info *fs_info,
 					     int compress_type);
 int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 					  u64 disk_bytenr, u64 disk_io_size,
-					  struct page **pages);
+					  struct page **pages,
+					  void *uring_ctx);
 ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 			   struct btrfs_ioctl_encoded_io_args *encoded,
 			   struct extent_state **cached_state,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 5e0a1805e897..fbb753300071 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3710,6 +3710,7 @@ const struct file_operations btrfs_file_operations = {
 	.compat_ioctl	= btrfs_compat_ioctl,
 #endif
 	.remap_file_range = btrfs_remap_file_range,
+	.uring_cmd	= btrfs_uring_cmd,
 	.fop_flags	= FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC,
 };
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5aedb85696f4..759ae076f90b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9057,6 +9057,7 @@ static ssize_t btrfs_encoded_read_inline(
 
 struct btrfs_encoded_read_private {
 	wait_queue_head_t wait;
+	void *uring_ctx;
 	atomic_t pending;
 	blk_status_t status;
 };
@@ -9076,14 +9077,23 @@ static void btrfs_encoded_read_endio(struct btrfs_bio *bbio)
 		 */
 		WRITE_ONCE(priv->status, bbio->bio.bi_status);
 	}
-	if (!atomic_dec_return(&priv->pending))
-		wake_up(&priv->wait);
+	if (atomic_dec_return(&priv->pending) == 0) {
+		int err = blk_status_to_errno(READ_ONCE(priv->status));
+
+		if (priv->uring_ctx) {
+			btrfs_uring_read_extent_endio(priv->uring_ctx, err);
+			kfree(priv);
+		} else {
+			wake_up(&priv->wait);
+		}
+	}
 	bio_put(&bbio->bio);
 }
 
 int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 					  u64 disk_bytenr, u64 disk_io_size,
-					  struct page **pages)
+					  struct page **pages,
+					  void *uring_ctx)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_encoded_read_private *priv;
@@ -9098,6 +9108,7 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 	init_waitqueue_head(&priv->wait);
 	atomic_set(&priv->pending, 1);
 	priv->status = 0;
+	priv->uring_ctx = uring_ctx;
 
 	bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, fs_info,
 			       btrfs_encoded_read_endio, priv);
@@ -9126,12 +9137,23 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 	atomic_inc(&priv->pending);
 	btrfs_submit_bbio(bbio, 0);
 
-	if (atomic_dec_return(&priv->pending))
-		io_wait_event(priv->wait, !atomic_read(&priv->pending));
-	/* See btrfs_encoded_read_endio() for ordering. */
-	ret = blk_status_to_errno(READ_ONCE(priv->status));
-	kfree(priv);
-	return ret;
+	if (uring_ctx) {
+		if (atomic_dec_return(&priv->pending) == 0) {
+			ret = blk_status_to_errno(READ_ONCE(priv->status));
+			btrfs_uring_read_extent_endio(uring_ctx, ret);
+			kfree(priv);
+			return ret;
+		}
+
+		return -EIOCBQUEUED;
+	} else {
+		if (atomic_dec_return(&priv->pending) != 0)
+			io_wait_event(priv->wait, !atomic_read(&priv->pending));
+		/* See btrfs_encoded_read_endio() for ordering. */
+		ret = blk_status_to_errno(READ_ONCE(priv->status));
+		kfree(priv);
+		return ret;
+	}
 }
 
 ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, struct iov_iter *iter,
@@ -9160,7 +9182,8 @@ ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, struct iov_iter *iter,
 		}
 
 	ret = btrfs_encoded_read_regular_fill_pages(inode, disk_bytenr,
-						    disk_io_size, pages);
+						    disk_io_size, pages,
+						    NULL);
 	if (ret)
 		goto out;
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d502b31010bc..7f2731ef3dbb 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -29,6 +29,7 @@
 #include <linux/fileattr.h>
 #include <linux/fsverity.h>
 #include <linux/sched/xacct.h>
+#include <linux/io_uring/cmd.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "export.h"
@@ -4720,6 +4721,314 @@ static int btrfs_ioctl_encoded_write(struct file *file, void __user *argp, bool
 	return ret;
 }
 
+/*
+ * struct btrfs_uring_priv is the context that's attached to an encoded read
+ * io_uring command, in cmd->pdu. It contains the fields in
+ * btrfs_uring_read_extent that are necessary to finish off and cleanup the I/O
+ * in btrfs_uring_read_finished.
+ */
+struct btrfs_uring_priv {
+	struct io_uring_cmd *cmd;
+	struct page **pages;
+	unsigned long nr_pages;
+	struct kiocb iocb;
+	struct iovec *iov;
+	struct iov_iter iter;
+	struct extent_state *cached_state;
+	u64 count;
+	u64 start;
+	u64 lockend;
+	int err;
+	bool compressed;
+};
+
+static void btrfs_uring_read_finished(struct io_uring_cmd *cmd,
+				      unsigned int issue_flags)
+{
+	struct btrfs_uring_priv *priv =
+		*io_uring_cmd_to_pdu(cmd, struct btrfs_uring_priv *);
+	struct btrfs_inode *inode = BTRFS_I(file_inode(priv->iocb.ki_filp));
+	struct extent_io_tree *io_tree = &inode->io_tree;
+	unsigned long i;
+	u64 cur;
+	size_t page_offset;
+	ssize_t ret;
+
+	if (priv->err) {
+		ret = priv->err;
+		goto out;
+	}
+
+	if (priv->compressed) {
+		i = 0;
+		page_offset = 0;
+	} else {
+		i = (priv->iocb.ki_pos - priv->start) >> PAGE_SHIFT;
+		page_offset = offset_in_page(priv->iocb.ki_pos - priv->start);
+	}
+	cur = 0;
+	while (cur < priv->count) {
+		size_t bytes = min_t(size_t, priv->count - cur,
+				     PAGE_SIZE - page_offset);
+
+		if (copy_page_to_iter(priv->pages[i], page_offset, bytes,
+				      &priv->iter) != bytes) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		i++;
+		cur += bytes;
+		page_offset = 0;
+	}
+	ret = priv->count;
+
+out:
+	unlock_extent(io_tree, priv->start, priv->lockend, &priv->cached_state);
+	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
+
+	io_uring_cmd_done(cmd, ret, 0, issue_flags);
+	add_rchar(current, ret);
+
+	for (unsigned long index = 0; index < priv->nr_pages; index++)
+		__free_page(priv->pages[index]);
+
+	kfree(priv->pages);
+	kfree(priv->iov);
+	kfree(priv);
+}
+
+void btrfs_uring_read_extent_endio(void *ctx, int err)
+{
+	struct btrfs_uring_priv *priv = ctx;
+
+	priv->err = err;
+
+	*io_uring_cmd_to_pdu(priv->cmd, struct btrfs_uring_priv *) = priv;
+	io_uring_cmd_complete_in_task(priv->cmd, btrfs_uring_read_finished);
+}
+
+static int btrfs_uring_read_extent(struct kiocb *iocb, struct iov_iter *iter,
+				   u64 start, u64 lockend,
+				   struct extent_state *cached_state,
+				   u64 disk_bytenr, u64 disk_io_size,
+				   size_t count, bool compressed,
+				   struct iovec *iov,
+				   struct io_uring_cmd *cmd)
+{
+	struct btrfs_inode *inode = BTRFS_I(file_inode(iocb->ki_filp));
+	struct extent_io_tree *io_tree = &inode->io_tree;
+	struct page **pages;
+	struct btrfs_uring_priv *priv = NULL;
+	unsigned long nr_pages;
+	int ret;
+
+	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
+	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
+	if (!pages)
+		return -ENOMEM;
+	ret = btrfs_alloc_page_array(nr_pages, pages, 0);
+	if (ret) {
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	priv = kmalloc(sizeof(*priv), GFP_NOFS);
+	if (!priv) {
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	priv->iocb = *iocb;
+	priv->iov = iov;
+	priv->iter = *iter;
+	priv->count = count;
+	priv->cmd = cmd;
+	priv->cached_state = cached_state;
+	priv->compressed = compressed;
+	priv->nr_pages = nr_pages;
+	priv->pages = pages;
+	priv->start = start;
+	priv->lockend = lockend;
+	priv->err = 0;
+
+	ret = btrfs_encoded_read_regular_fill_pages(inode, disk_bytenr,
+						    disk_io_size, pages,
+						    priv);
+	if (ret && ret != -EIOCBQUEUED)
+		goto fail;
+
+	/*
+	 * If we return -EIOCBQUEUED, we're deferring the cleanup to
+	 * btrfs_uring_read_finished, which will handle unlocking the extent
+	 * and inode and freeing the allocations.
+	 */
+
+	return -EIOCBQUEUED;
+
+fail:
+	unlock_extent(io_tree, start, lockend, &cached_state);
+	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
+	kfree(priv);
+	return ret;
+}
+
+static int btrfs_uring_encoded_read(struct io_uring_cmd *cmd,
+				    unsigned int issue_flags)
+{
+	size_t copy_end_kernel = offsetofend(struct btrfs_ioctl_encoded_io_args,
+					     flags);
+	size_t copy_end;
+	struct btrfs_ioctl_encoded_io_args args = { 0 };
+	int ret;
+	u64 disk_bytenr, disk_io_size;
+	struct file *file = cmd->file;
+	struct btrfs_inode *inode = BTRFS_I(file->f_inode);
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct extent_io_tree *io_tree = &inode->io_tree;
+	struct iovec iovstack[UIO_FASTIOV];
+	struct iovec *iov = iovstack;
+	struct iov_iter iter;
+	loff_t pos;
+	struct kiocb kiocb;
+	struct extent_state *cached_state = NULL;
+	u64 start, lockend;
+	void __user *sqe_addr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr));
+
+	if (!capable(CAP_SYS_ADMIN)) {
+		ret = -EPERM;
+		goto out_acct;
+	}
+
+	if (issue_flags & IO_URING_F_COMPAT) {
+#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
+		struct btrfs_ioctl_encoded_io_args_32 args32;
+
+		copy_end = offsetofend(struct btrfs_ioctl_encoded_io_args_32,
+				       flags);
+		if (copy_from_user(&args32, sqe_addr, copy_end)) {
+			ret = -EFAULT;
+			goto out_acct;
+		}
+		args.iov = compat_ptr(args32.iov);
+		args.iovcnt = args32.iovcnt;
+		args.offset = args32.offset;
+		args.flags = args32.flags;
+#else
+		return -ENOTTY;
+#endif
+	} else {
+		copy_end = copy_end_kernel;
+		if (copy_from_user(&args, sqe_addr, copy_end)) {
+			ret = -EFAULT;
+			goto out_acct;
+		}
+	}
+
+	if (args.flags != 0)
+		return -EINVAL;
+
+	ret = import_iovec(ITER_DEST, args.iov, args.iovcnt, ARRAY_SIZE(iovstack),
+			   &iov, &iter);
+	if (ret < 0)
+		goto out_acct;
+
+	if (iov_iter_count(&iter) == 0) {
+		ret = 0;
+		goto out_free;
+	}
+
+	pos = args.offset;
+	ret = rw_verify_area(READ, file, &pos, args.len);
+	if (ret < 0)
+		goto out_free;
+
+	init_sync_kiocb(&kiocb, file);
+	kiocb.ki_pos = pos;
+
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		kiocb.ki_flags |= IOCB_NOWAIT;
+
+	start = ALIGN_DOWN(pos, fs_info->sectorsize);
+	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
+
+	ret = btrfs_encoded_read(&kiocb, &iter, &args, &cached_state,
+				 &disk_bytenr, &disk_io_size);
+	if (ret < 0 && ret != -EIOCBQUEUED)
+		goto out_free;
+
+	file_accessed(file);
+
+	if (copy_to_user(sqe_addr + copy_end, (char *)&args + copy_end_kernel,
+			 sizeof(args) - copy_end_kernel)) {
+		if (ret == -EIOCBQUEUED) {
+			unlock_extent(io_tree, start, lockend, &cached_state);
+			btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
+		}
+		ret = -EFAULT;
+		goto out_free;
+	}
+
+	if (ret == -EIOCBQUEUED) {
+		u64 count;
+
+		/*
+		 * If we've optimized things by storing the iovecs on the stack,
+		 * undo this.
+		 */
+		if (!iov) {
+			iov = kmalloc(sizeof(struct iovec) * args.iovcnt,
+				      GFP_NOFS);
+			if (!iov) {
+				unlock_extent(io_tree, start, lockend,
+					      &cached_state);
+				btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
+				ret = -ENOMEM;
+				goto out_acct;
+			}
+
+			memcpy(iov, iovstack,
+			       sizeof(struct iovec) * args.iovcnt);
+		}
+
+		count = min_t(u64, iov_iter_count(&iter), disk_io_size);
+
+		/* Match ioctl by not returning past EOF if uncompressed */
+		if (!args.compression)
+			count = min_t(u64, count, args.len);
+
+		ret = btrfs_uring_read_extent(&kiocb, &iter, start, lockend,
+					      cached_state, disk_bytenr,
+					      disk_io_size, count,
+					      args.compression, iov, cmd);
+
+		goto out_acct;
+	}
+
+out_free:
+	kfree(iov);
+
+out_acct:
+	if (ret > 0)
+		add_rchar(current, ret);
+	inc_syscr(current);
+
+	return ret;
+}
+
+int btrfs_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
+{
+	switch (cmd->cmd_op) {
+	case BTRFS_IOC_ENCODED_READ:
+#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
+	case BTRFS_IOC_ENCODED_READ_32:
+#endif
+		return btrfs_uring_encoded_read(cmd, issue_flags);
+	}
+
+	return -EINVAL;
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 19cd26b0244a..2b760c8778f8 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -22,5 +22,7 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode);
 int __pure btrfs_is_empty_uuid(const u8 *uuid);
 void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info,
 				     struct btrfs_ioctl_balance_args *bargs);
+int btrfs_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
+void btrfs_uring_read_extent_endio(void *ctx, int err);
 
 #endif
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 03b31b1c39be..cadb945bb345 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -5669,7 +5669,8 @@ static int send_encoded_extent(struct send_ctx *sctx, struct btrfs_path *path,
 	ret = btrfs_encoded_read_regular_fill_pages(BTRFS_I(inode),
 						    disk_bytenr, disk_num_bytes,
 						    sctx->send_buf_pages +
-						    (data_offset >> PAGE_SHIFT));
+						    (data_offset >> PAGE_SHIFT),
+						    NULL);
 	if (ret)
 		goto out;
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/5] btrfs: io_uring interface for encoded reads
  2024-10-22 14:50 [PATCH v4 0/5] btrfs: io_uring interface for encoded reads Mark Harmstone
                   ` (4 preceding siblings ...)
  2024-10-22 14:50 ` [PATCH 5/5] btrfs: add io_uring command for encoded reads Mark Harmstone
@ 2024-10-29 21:29 ` David Sterba
  2024-10-30  1:22 ` Anand Jain
  6 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2024-10-29 21:29 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: linux-btrfs, io-uring

On Tue, Oct 22, 2024 at 03:50:15PM +0100, Mark Harmstone wrote:
> This is version 4 of a patch series to add an io_uring interface for
> encoded reads. The principal use case for this is to eventually allow
> btrfs send and receive to operate asynchronously, the lack of io_uring
> encoded I/O being one of the main blockers for this.
> 
> I've written a test program for this, which demonstrates the ioctl and
> io_uring interface produce identical results: https://github.com/maharmstone/io_uring-encoded

We'll need a test utility for fstests too.

> Changelog:
> v4:
> * Rewritten to avoid taking function pointer
> * Removed nowait parameter, as this could be derived from iocb flags
> * Fixed structure not getting properly initialized
> * Followed ioctl by capping uncompressed reads at EOF
> * Rebased against btrfs/for-next
> * Formatting fixes
> * Rearranged structs to minimize holes
> * Published test program
> * Fixed potential data race with userspace
> * Changed to use io_uring_cmd_to_pdu helper function
> * Added comments for potentially confusing parts of the code

There are some more style issues and changelog updates but overall looks
ok to me. We're now in rc5 so we need to add it to for-next now or
postpone for next cycle (but I don't see a reason why).

I've noticed CONFIG_IO_URING is a config option, do we need some ifdef
protection or are the definitions provided unconditionally. We may also
need to ifdef out the ioctl code if io_uring is not built in.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] btrfs: add io_uring command for encoded reads
  2024-10-22 14:50 ` [PATCH 5/5] btrfs: add io_uring command for encoded reads Mark Harmstone
@ 2024-10-29 21:51   ` David Sterba
  2024-10-30  0:59   ` Pavel Begunkov
  1 sibling, 0 replies; 14+ messages in thread
From: David Sterba @ 2024-10-29 21:51 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: linux-btrfs, io-uring

On Tue, Oct 22, 2024 at 03:50:20PM +0100, Mark Harmstone wrote:
> +static int btrfs_uring_encoded_read(struct io_uring_cmd *cmd,
> +				    unsigned int issue_flags)
> +{
> +	size_t copy_end_kernel = offsetofend(struct btrfs_ioctl_encoded_io_args,
> +					     flags);
> +	size_t copy_end;
> +	struct btrfs_ioctl_encoded_io_args args = { 0 };
> +	int ret;
> +	u64 disk_bytenr, disk_io_size;
> +	struct file *file = cmd->file;
> +	struct btrfs_inode *inode = BTRFS_I(file->f_inode);
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct extent_io_tree *io_tree = &inode->io_tree;
> +	struct iovec iovstack[UIO_FASTIOV];
> +	struct iovec *iov = iovstack;
> +	struct iov_iter iter;
> +	loff_t pos;
> +	struct kiocb kiocb;
> +	struct extent_state *cached_state = NULL;
> +	u64 start, lockend;
> +	void __user *sqe_addr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr));
> +
> +	if (!capable(CAP_SYS_ADMIN)) {
> +		ret = -EPERM;
> +		goto out_acct;
> +	}

Access level check must be done first before any data are read, in this
case cmd->file and sqe_addr. Fixed.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] btrfs: add io_uring command for encoded reads
  2024-10-22 14:50 ` [PATCH 5/5] btrfs: add io_uring command for encoded reads Mark Harmstone
  2024-10-29 21:51   ` David Sterba
@ 2024-10-30  0:59   ` Pavel Begunkov
  2024-10-30  1:24     ` David Sterba
  1 sibling, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2024-10-30  0:59 UTC (permalink / raw)
  To: Mark Harmstone, linux-btrfs; +Cc: io-uring

On 10/22/24 15:50, Mark Harmstone wrote:
...
> +static void btrfs_uring_read_finished(struct io_uring_cmd *cmd,
> +				      unsigned int issue_flags)
> +{
> +	struct btrfs_uring_priv *priv =
> +		*io_uring_cmd_to_pdu(cmd, struct btrfs_uring_priv *);
> +	struct btrfs_inode *inode = BTRFS_I(file_inode(priv->iocb.ki_filp));
> +	struct extent_io_tree *io_tree = &inode->io_tree;
> +	unsigned long i;
> +	u64 cur;
> +	size_t page_offset;
> +	ssize_t ret;
> +
> +	if (priv->err) {
> +		ret = priv->err;
> +		goto out;
> +	}
> +
> +	if (priv->compressed) {
> +		i = 0;
> +		page_offset = 0;
> +	} else {
> +		i = (priv->iocb.ki_pos - priv->start) >> PAGE_SHIFT;
> +		page_offset = offset_in_page(priv->iocb.ki_pos - priv->start);
> +	}
> +	cur = 0;
> +	while (cur < priv->count) {
> +		size_t bytes = min_t(size_t, priv->count - cur,
> +				     PAGE_SIZE - page_offset);
> +
> +		if (copy_page_to_iter(priv->pages[i], page_offset, bytes,
> +				      &priv->iter) != bytes) {

If that's an iovec backed iter that might fail, you'd need to
steal this patch

https://lore.kernel.org/all/[email protected]/

and fail if "issue_flags & IO_URING_F_TASK_DEAD", see

https://lore.kernel.org/all/[email protected]/


> +			ret = -EFAULT;
> +			goto out;
> +		}
> +
> +		i++;
> +		cur += bytes;
> +		page_offset = 0;
> +	}
> +	ret = priv->count;
> +
> +out:
> +	unlock_extent(io_tree, priv->start, priv->lockend, &priv->cached_state);
> +	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);

When called via io_uring_cmd_complete_in_task() this function might
not get run in any reasonable amount of time. Even worse, a
misbehaving user can block it until the task dies.

I don't remember if rwsem allows unlock being executed in a different
task than the pairing lock, but blocking it for that long could be a
problem. I might not remember it right but I think Boris meantioned
that the O_DIRECT path drops the inode lock right after submission
without waiting for bios to complete. Is that right? Can we do it
here as well?

> +
> +	io_uring_cmd_done(cmd, ret, 0, issue_flags);
> +	add_rchar(current, ret);
> +
> +	for (unsigned long index = 0; index < priv->nr_pages; index++)
> +		__free_page(priv->pages[index]);
> +
> +	kfree(priv->pages);
> +	kfree(priv->iov);
> +	kfree(priv);
> +}
> +
> +void btrfs_uring_read_extent_endio(void *ctx, int err)
> +{
> +	struct btrfs_uring_priv *priv = ctx;
> +
> +	priv->err = err;
> +
> +	*io_uring_cmd_to_pdu(priv->cmd, struct btrfs_uring_priv *) = priv;

a nit, I'd suggest to create a temp var, should be easier to
read. It'd even be nicer if you wrap it into a structure
as suggested last time.

struct io_btrfs_cmd {
	struct btrfs_uring_priv *priv;
};

struct io_btrfs_cmd *bc = io_uring_cmd_to_pdu(cmd, struct io_btrfs_cmd);
bc->priv = priv;

> +	io_uring_cmd_complete_in_task(priv->cmd, btrfs_uring_read_finished);
> +}
> +
> +static int btrfs_uring_read_extent(struct kiocb *iocb, struct iov_iter *iter,
> +				   u64 start, u64 lockend,
> +				   struct extent_state *cached_state,
> +				   u64 disk_bytenr, u64 disk_io_size,
> +				   size_t count, bool compressed,
> +				   struct iovec *iov,
> +				   struct io_uring_cmd *cmd)
> +{
> +	struct btrfs_inode *inode = BTRFS_I(file_inode(iocb->ki_filp));
> +	struct extent_io_tree *io_tree = &inode->io_tree;
> +	struct page **pages;
> +	struct btrfs_uring_priv *priv = NULL;
> +	unsigned long nr_pages;
> +	int ret;
> +
> +	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
> +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> +	if (!pages)
> +		return -ENOMEM;
> +	ret = btrfs_alloc_page_array(nr_pages, pages, 0);
> +	if (ret) {
> +		ret = -ENOMEM;
> +		goto fail;
> +	}
> +
> +	priv = kmalloc(sizeof(*priv), GFP_NOFS);
> +	if (!priv) {
> +		ret = -ENOMEM;
> +		goto fail;
> +	}
> +
> +	priv->iocb = *iocb;
> +	priv->iov = iov;
> +	priv->iter = *iter;
> +	priv->count = count;
> +	priv->cmd = cmd;
> +	priv->cached_state = cached_state;
> +	priv->compressed = compressed;
> +	priv->nr_pages = nr_pages;
> +	priv->pages = pages;
> +	priv->start = start;
> +	priv->lockend = lockend;
> +	priv->err = 0;
> +
> +	ret = btrfs_encoded_read_regular_fill_pages(inode, disk_bytenr,
> +						    disk_io_size, pages,
> +						    priv);
> +	if (ret && ret != -EIOCBQUEUED)
> +		goto fail;

Turning both into return EIOCBQUEUED is a bit suspicious, but
I lack context to say. Might make sense to return ret and let
the caller handle it.

> +
> +	/*
> +	 * If we return -EIOCBQUEUED, we're deferring the cleanup to
> +	 * btrfs_uring_read_finished, which will handle unlocking the extent
> +	 * and inode and freeing the allocations.
> +	 */
> +
> +	return -EIOCBQUEUED;
> +
> +fail:
> +	unlock_extent(io_tree, start, lockend, &cached_state);
> +	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> +	kfree(priv);
> +	return ret;
> +}
> +
> +static int btrfs_uring_encoded_read(struct io_uring_cmd *cmd,
> +				    unsigned int issue_flags)
> +{
> +	size_t copy_end_kernel = offsetofend(struct btrfs_ioctl_encoded_io_args,
> +					     flags);
> +	size_t copy_end;
> +	struct btrfs_ioctl_encoded_io_args args = { 0 };
> +	int ret;
> +	u64 disk_bytenr, disk_io_size;
> +	struct file *file = cmd->file;
> +	struct btrfs_inode *inode = BTRFS_I(file->f_inode);
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct extent_io_tree *io_tree = &inode->io_tree;
> +	struct iovec iovstack[UIO_FASTIOV];
> +	struct iovec *iov = iovstack;
> +	struct iov_iter iter;
> +	loff_t pos;
> +	struct kiocb kiocb;
> +	struct extent_state *cached_state = NULL;
> +	u64 start, lockend;
> +	void __user *sqe_addr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr));

Let's rename it, I was taken aback for a brief second why
you're copy_from_user() from an SQE / the ring, which turns
out to be a user pointer to a btrfs structure.

...
> +	ret = btrfs_encoded_read(&kiocb, &iter, &args, &cached_state,
> +				 &disk_bytenr, &disk_io_size);
> +	if (ret < 0 && ret != -EIOCBQUEUED)
> +		goto out_free;
> +
> +	file_accessed(file);
> +
> +	if (copy_to_user(sqe_addr + copy_end, (char *)&args + copy_end_kernel,
> +			 sizeof(args) - copy_end_kernel)) {
> +		if (ret == -EIOCBQUEUED) {
> +			unlock_extent(io_tree, start, lockend, &cached_state);
> +			btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> +		}> +		ret = -EFAULT;
> +		goto out_free;

It seems we're saving iov in the priv structure, who can access the iovec
after the request is submitted? -EIOCBQUEUED in general means that the
request is submitted and will get completed async, e.g. via callback, and
if the bio callback can use the iov maybe via the iter, this goto will be
a use after free.

Also, you're returning -EFAULT back to io_uring, which will kill the
io_uring request / cmd while there might still be in flight bios that
can try to access it.

Can you inject errors into the copy and test please?

> +	}
> +
> +	if (ret == -EIOCBQUEUED) {
> +		u64 count;
> +
> +		/*
> +		 * If we've optimized things by storing the iovecs on the stack,
> +		 * undo this.
> +		 */> +		if (!iov) {
> +			iov = kmalloc(sizeof(struct iovec) * args.iovcnt,
> +				      GFP_NOFS);
> +			if (!iov) {
> +				unlock_extent(io_tree, start, lockend,
> +					      &cached_state);
> +				btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> +				ret = -ENOMEM;
> +				goto out_acct;
> +			}
> +
> +			memcpy(iov, iovstack,
> +			       sizeof(struct iovec) * args.iovcnt);
> +		}
> +
> +		count = min_t(u64, iov_iter_count(&iter), disk_io_size);
> +
> +		/* Match ioctl by not returning past EOF if uncompressed */
> +		if (!args.compression)
> +			count = min_t(u64, count, args.len);
> +
> +		ret = btrfs_uring_read_extent(&kiocb, &iter, start, lockend,
> +					      cached_state, disk_bytenr,
> +					      disk_io_size, count,
> +					      args.compression, iov, cmd);
> +
> +		goto out_acct;
> +	}
> +
> +out_free:
> +	kfree(iov);
> +
> +out_acct:
> +	if (ret > 0)
> +		add_rchar(current, ret);
> +	inc_syscr(current);
> +
> +	return ret;
> +}

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/5] btrfs: io_uring interface for encoded reads
  2024-10-22 14:50 [PATCH v4 0/5] btrfs: io_uring interface for encoded reads Mark Harmstone
                   ` (5 preceding siblings ...)
  2024-10-29 21:29 ` [PATCH v4 0/5] btrfs: io_uring interface " David Sterba
@ 2024-10-30  1:22 ` Anand Jain
  6 siblings, 0 replies; 14+ messages in thread
From: Anand Jain @ 2024-10-30  1:22 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: io-uring, linux-btrfs

On 22/10/24 22:50, Mark Harmstone wrote:
> This is version 4 of a patch series to add an io_uring interface for
> encoded reads. The principal use case for this is to eventually allow

> btrfs send and receive to operate asynchronously,

How would you define an asynchronously operated Btrfs send and receive?
Are you referring to Btrfs send and receive the leveraging io_uring
asynchronous operation?

> the lack of io_uring
> encoded I/O being one of the main blockers for this.

> 
> I've written a test program for this, which demonstrates the ioctl and
> io_uring interface produce identical results: https://github.com/maharmstone/io_uring-encoded

Thanks, Anand


> 
> Changelog:
> v4:
> * Rewritten to avoid taking function pointer
> * Removed nowait parameter, as this could be derived from iocb flags
> * Fixed structure not getting properly initialized
> * Followed ioctl by capping uncompressed reads at EOF
> * Rebased against btrfs/for-next
> * Formatting fixes
> * Rearranged structs to minimize holes
> * Published test program
> * Fixed potential data race with userspace
> * Changed to use io_uring_cmd_to_pdu helper function
> * Added comments for potentially confusing parts of the code
> 
> v3:
> * Redo of previous versions
> 
> Mark Harmstone (5):
>    btrfs: remove pointless addition in btrfs_encoded_read
>    btrfs: change btrfs_encoded_read so that reading of extent is done by
>      caller
>    btrfs: don't sleep in btrfs_encoded_read if IOCB_NOWAIT set
>    btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages
>    btrfs: add io_uring command for encoded reads
> 
>   fs/btrfs/btrfs_inode.h |  13 +-
>   fs/btrfs/file.c        |   1 +
>   fs/btrfs/inode.c       | 175 ++++++++++++++-------
>   fs/btrfs/ioctl.c       | 342 ++++++++++++++++++++++++++++++++++++++++-
>   fs/btrfs/ioctl.h       |   2 +
>   fs/btrfs/send.c        |   3 +-
>   6 files changed, 473 insertions(+), 63 deletions(-)
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] btrfs: add io_uring command for encoded reads
  2024-10-30  0:59   ` Pavel Begunkov
@ 2024-10-30  1:24     ` David Sterba
  2024-10-30  2:32       ` Pavel Begunkov
  0 siblings, 1 reply; 14+ messages in thread
From: David Sterba @ 2024-10-30  1:24 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Mark Harmstone, linux-btrfs, io-uring

On Wed, Oct 30, 2024 at 12:59:33AM +0000, Pavel Begunkov wrote:
> On 10/22/24 15:50, Mark Harmstone wrote:
> ...
> > +static void btrfs_uring_read_finished(struct io_uring_cmd *cmd,
> > +				      unsigned int issue_flags)
> > +{
> > +	struct btrfs_uring_priv *priv =
> > +		*io_uring_cmd_to_pdu(cmd, struct btrfs_uring_priv *);
> > +	struct btrfs_inode *inode = BTRFS_I(file_inode(priv->iocb.ki_filp));
> > +	struct extent_io_tree *io_tree = &inode->io_tree;
> > +	unsigned long i;
> > +	u64 cur;
> > +	size_t page_offset;
> > +	ssize_t ret;
> > +
> > +	if (priv->err) {
> > +		ret = priv->err;
> > +		goto out;
> > +	}
> > +
> > +	if (priv->compressed) {
> > +		i = 0;
> > +		page_offset = 0;
> > +	} else {
> > +		i = (priv->iocb.ki_pos - priv->start) >> PAGE_SHIFT;
> > +		page_offset = offset_in_page(priv->iocb.ki_pos - priv->start);
> > +	}
> > +	cur = 0;
> > +	while (cur < priv->count) {
> > +		size_t bytes = min_t(size_t, priv->count - cur,
> > +				     PAGE_SIZE - page_offset);
> > +
> > +		if (copy_page_to_iter(priv->pages[i], page_offset, bytes,
> > +				      &priv->iter) != bytes) {
> 
> If that's an iovec backed iter that might fail, you'd need to
> steal this patch
> 
> https://lore.kernel.org/all/[email protected]/
> 
> and fail if "issue_flags & IO_URING_F_TASK_DEAD", see
> 
> https://lore.kernel.org/all/[email protected]/
> 
> 
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +
> > +		i++;
> > +		cur += bytes;
> > +		page_offset = 0;
> > +	}
> > +	ret = priv->count;
> > +
> > +out:
> > +	unlock_extent(io_tree, priv->start, priv->lockend, &priv->cached_state);
> > +	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> 
> When called via io_uring_cmd_complete_in_task() this function might
> not get run in any reasonable amount of time. Even worse, a
> misbehaving user can block it until the task dies.
> 
> I don't remember if rwsem allows unlock being executed in a different
> task than the pairing lock, but blocking it for that long could be a
> problem. I might not remember it right but I think Boris meantioned
> that the O_DIRECT path drops the inode lock right after submission
> without waiting for bios to complete. Is that right? Can we do it
> here as well?
> 
> > +
> > +	io_uring_cmd_done(cmd, ret, 0, issue_flags);
> > +	add_rchar(current, ret);
> > +
> > +	for (unsigned long index = 0; index < priv->nr_pages; index++)
> > +		__free_page(priv->pages[index]);
> > +
> > +	kfree(priv->pages);
> > +	kfree(priv->iov);
> > +	kfree(priv);
> > +}
> > +
> > +void btrfs_uring_read_extent_endio(void *ctx, int err)
> > +{
> > +	struct btrfs_uring_priv *priv = ctx;
> > +
> > +	priv->err = err;
> > +
> > +	*io_uring_cmd_to_pdu(priv->cmd, struct btrfs_uring_priv *) = priv;
> 
> a nit, I'd suggest to create a temp var, should be easier to
> read. It'd even be nicer if you wrap it into a structure
> as suggested last time.
> 
> struct io_btrfs_cmd {
> 	struct btrfs_uring_priv *priv;
> };
> 
> struct io_btrfs_cmd *bc = io_uring_cmd_to_pdu(cmd, struct io_btrfs_cmd);
> bc->priv = priv;
> 
> > +	io_uring_cmd_complete_in_task(priv->cmd, btrfs_uring_read_finished);
> > +}
> > +
> > +static int btrfs_uring_read_extent(struct kiocb *iocb, struct iov_iter *iter,
> > +				   u64 start, u64 lockend,
> > +				   struct extent_state *cached_state,
> > +				   u64 disk_bytenr, u64 disk_io_size,
> > +				   size_t count, bool compressed,
> > +				   struct iovec *iov,
> > +				   struct io_uring_cmd *cmd)
> > +{
> > +	struct btrfs_inode *inode = BTRFS_I(file_inode(iocb->ki_filp));
> > +	struct extent_io_tree *io_tree = &inode->io_tree;
> > +	struct page **pages;
> > +	struct btrfs_uring_priv *priv = NULL;
> > +	unsigned long nr_pages;
> > +	int ret;
> > +
> > +	nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE);
> > +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> > +	if (!pages)
> > +		return -ENOMEM;
> > +	ret = btrfs_alloc_page_array(nr_pages, pages, 0);
> > +	if (ret) {
> > +		ret = -ENOMEM;
> > +		goto fail;
> > +	}
> > +
> > +	priv = kmalloc(sizeof(*priv), GFP_NOFS);
> > +	if (!priv) {
> > +		ret = -ENOMEM;
> > +		goto fail;
> > +	}
> > +
> > +	priv->iocb = *iocb;
> > +	priv->iov = iov;
> > +	priv->iter = *iter;
> > +	priv->count = count;
> > +	priv->cmd = cmd;
> > +	priv->cached_state = cached_state;
> > +	priv->compressed = compressed;
> > +	priv->nr_pages = nr_pages;
> > +	priv->pages = pages;
> > +	priv->start = start;
> > +	priv->lockend = lockend;
> > +	priv->err = 0;
> > +
> > +	ret = btrfs_encoded_read_regular_fill_pages(inode, disk_bytenr,
> > +						    disk_io_size, pages,
> > +						    priv);
> > +	if (ret && ret != -EIOCBQUEUED)
> > +		goto fail;
> 
> Turning both into return EIOCBQUEUED is a bit suspicious, but
> I lack context to say. Might make sense to return ret and let
> the caller handle it.
> 
> > +
> > +	/*
> > +	 * If we return -EIOCBQUEUED, we're deferring the cleanup to
> > +	 * btrfs_uring_read_finished, which will handle unlocking the extent
> > +	 * and inode and freeing the allocations.
> > +	 */
> > +
> > +	return -EIOCBQUEUED;
> > +
> > +fail:
> > +	unlock_extent(io_tree, start, lockend, &cached_state);
> > +	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> > +	kfree(priv);
> > +	return ret;
> > +}
> > +
> > +static int btrfs_uring_encoded_read(struct io_uring_cmd *cmd,
> > +				    unsigned int issue_flags)
> > +{
> > +	size_t copy_end_kernel = offsetofend(struct btrfs_ioctl_encoded_io_args,
> > +					     flags);
> > +	size_t copy_end;
> > +	struct btrfs_ioctl_encoded_io_args args = { 0 };
> > +	int ret;
> > +	u64 disk_bytenr, disk_io_size;
> > +	struct file *file = cmd->file;
> > +	struct btrfs_inode *inode = BTRFS_I(file->f_inode);
> > +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > +	struct extent_io_tree *io_tree = &inode->io_tree;
> > +	struct iovec iovstack[UIO_FASTIOV];
> > +	struct iovec *iov = iovstack;
> > +	struct iov_iter iter;
> > +	loff_t pos;
> > +	struct kiocb kiocb;
> > +	struct extent_state *cached_state = NULL;
> > +	u64 start, lockend;
> > +	void __user *sqe_addr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr));
> 
> Let's rename it, I was taken aback for a brief second why
> you're copy_from_user() from an SQE / the ring, which turns
> out to be a user pointer to a btrfs structure.
> 
> ...
> > +	ret = btrfs_encoded_read(&kiocb, &iter, &args, &cached_state,
> > +				 &disk_bytenr, &disk_io_size);
> > +	if (ret < 0 && ret != -EIOCBQUEUED)
> > +		goto out_free;
> > +
> > +	file_accessed(file);
> > +
> > +	if (copy_to_user(sqe_addr + copy_end, (char *)&args + copy_end_kernel,
> > +			 sizeof(args) - copy_end_kernel)) {
> > +		if (ret == -EIOCBQUEUED) {
> > +			unlock_extent(io_tree, start, lockend, &cached_state);
> > +			btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> > +		}> +		ret = -EFAULT;
> > +		goto out_free;
> 
> It seems we're saving iov in the priv structure, who can access the iovec
> after the request is submitted? -EIOCBQUEUED in general means that the
> request is submitted and will get completed async, e.g. via callback, and
> if the bio callback can use the iov maybe via the iter, this goto will be
> a use after free.
> 
> Also, you're returning -EFAULT back to io_uring, which will kill the
> io_uring request / cmd while there might still be in flight bios that
> can try to access it.
> 
> Can you inject errors into the copy and test please?

Thanks for the comments. I get the impression that there are known
problems on the io_uring side, so until that is resolved the btrfs part
may be insecure or with known runtime bugs, but in the end it does not
need any change. We just need to wait until it's resoved on the
interface level.

The patches you point to are from FUSE trying to wire up io_uring so
this looks like an interface problem. We recently have gained a config
option level gurard for experimental and unstable features so we can add
the code but don't have to expose users to the functionality unless they
konw there are risks or known problems. The io_uring and encoded read
has a performance benefit and I'd like to get the patches in for 6.13
but if there's something serious, one option is not add the code or at
least guard it (behind a config option).

I'm open to both and we have at least one -rc kernel to decide.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/5] btrfs: remove pointless addition in btrfs_encoded_read
  2024-10-22 14:50 ` [PATCH 1/5] btrfs: remove pointless addition in btrfs_encoded_read Mark Harmstone
@ 2024-10-30  1:46   ` Anand Jain
  0 siblings, 0 replies; 14+ messages in thread
From: Anand Jain @ 2024-10-30  1:46 UTC (permalink / raw)
  To: Mark Harmstone; +Cc: io-uring, linux-btrfs

On 22/10/24 22:50, Mark Harmstone wrote:
> iocb->ki_pos isn't used after this function, so there's no point in
> changing its value.
> 
> Signed-off-by: Mark Harmstone <[email protected]>
> ---
>   fs/btrfs/inode.c | 5 +----
>   1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 7c5ef2c5c7e8..94098a4c782d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9252,7 +9252,7 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
>   		ret = btrfs_encoded_read_inline(iocb, iter, start, lockend,
>   						&cached_state, extent_start,
>   						count, encoded, &unlocked);
> -		goto out;
> +		goto out_em;


Proceed to out_unlock_extent; free_extent_map() has already been called
two lines above, and %em is now NULL.

Thanks, Anand

>   	}
>   
>   	/*
> @@ -9318,9 +9318,6 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
>   						 &unlocked);
>   	}
>   
> -out:
> -	if (ret >= 0)
> -		iocb->ki_pos += encoded->len;
>   out_em:
>   	free_extent_map(em);
>   out_unlock_extent:


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] btrfs: add io_uring command for encoded reads
  2024-10-30  1:24     ` David Sterba
@ 2024-10-30  2:32       ` Pavel Begunkov
  0 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2024-10-30  2:32 UTC (permalink / raw)
  To: dsterba; +Cc: Mark Harmstone, linux-btrfs, io-uring

On 10/30/24 01:24, David Sterba wrote:
> On Wed, Oct 30, 2024 at 12:59:33AM +0000, Pavel Begunkov wrote:
>> On 10/22/24 15:50, Mark Harmstone wrote:
...
>> It seems we're saving iov in the priv structure, who can access the iovec
>> after the request is submitted? -EIOCBQUEUED in general means that the
>> request is submitted and will get completed async, e.g. via callback, and
>> if the bio callback can use the iov maybe via the iter, this goto will be
>> a use after free.
>>
>> Also, you're returning -EFAULT back to io_uring, which will kill the
>> io_uring request / cmd while there might still be in flight bios that
>> can try to access it.
>>
>> Can you inject errors into the copy and test please?
> 
> Thanks for the comments. I get the impression that there are known
> problems on the io_uring side, so until that is resolved the btrfs part
> may be insecure or with known runtime bugs, but in the end it does not
> need any change. We just need to wait until it's resoved on the
> interface level.

There is nothing wrong with io_uring, it's jumping from synchronous
to asynchronous that concerns me, or more specifically how this series
handles it and all races. Basic stuff like not freeing / changing
without protection memory that the async part might still be using.
That's up to this series to do it right.

> The patches you point to are from FUSE trying to wire up io_uring so
> this looks like an interface problem. We recently have gained a config

That's the easiest part of all, it can only happen when the
task dies and mm becomes unavaliable, sane userspace shouldn't
have problems like that. Mark just needs to include the referred
patch into the series and handle the request as mentioned.

> option level gurard for experimental and unstable features so we can add
> the code but don't have to expose users to the functionality unless they
> konw there are risks or known problems. The io_uring and encoded read
> has a performance benefit and

Good to hear that

> I'd like to get the patches in for 6.13
> but if there's something serious, one option is not add the code or at
> least guard it (behind a config option).

Let's see what Mark replies, I might be missing some things, and
you and other btrfs folks can help to answer the unlock question.

> I'm open to both and we have at least one -rc kernel to decide.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-10-30  2:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-22 14:50 [PATCH v4 0/5] btrfs: io_uring interface for encoded reads Mark Harmstone
2024-10-22 14:50 ` [PATCH 1/5] btrfs: remove pointless addition in btrfs_encoded_read Mark Harmstone
2024-10-30  1:46   ` Anand Jain
2024-10-22 14:50 ` [PATCH 2/5] btrfs: change btrfs_encoded_read so that reading of extent is done by caller Mark Harmstone
2024-10-22 14:50 ` [PATCH 3/5] btrfs: don't sleep in btrfs_encoded_read if IOCB_NOWAIT set Mark Harmstone
2024-10-22 14:50 ` [PATCH 4/5] btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages Mark Harmstone
2024-10-22 14:50 ` [PATCH 5/5] btrfs: add io_uring command for encoded reads Mark Harmstone
2024-10-29 21:51   ` David Sterba
2024-10-30  0:59   ` Pavel Begunkov
2024-10-30  1:24     ` David Sterba
2024-10-30  2:32       ` Pavel Begunkov
2024-10-29 21:29 ` [PATCH v4 0/5] btrfs: io_uring interface " David Sterba
2024-10-30  1:22 ` Anand Jain
  -- strict thread matches above, loose matches on Subject: below --
2024-10-14 17:18 [PATCH v3 0/5] btrfs: encoded reads via io_uring Mark Harmstone
2024-10-14 17:18 ` [PATCH 1/5] btrfs: remove pointless addition in btrfs_encoded_read Mark Harmstone

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox