public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH v6 00/10] Read/Write with metadata/integrity
       [not found] <CGME20241030180957epcas5p3312b0a582e8562f8c2169e64d41592b2@epcas5p3.samsung.com>
@ 2024-10-30 18:01 ` Kanchan Joshi
       [not found]   ` <CGME20241030181000epcas5p2bfb47a79f1e796116135f646c6f0ccc7@epcas5p2.samsung.com>
                     ` (9 more replies)
  0 siblings, 10 replies; 24+ messages in thread
From: Kanchan Joshi @ 2024-10-30 18:01 UTC (permalink / raw)
  To: axboe, hch, kbusch, martin.petersen, asml.silence, brauner, viro,
	jack
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-scsi,
	gost.dev, vishak.g, anuj1072538, Kanchan Joshi

This adds a new io_uring interface to exchange additional integrity/pi
metadata with read/write.

The patchset is on top of block/for-next.

Interface:

Application sets up a SQE128 ring, and populates a new 'struct io_uring_meta_pi'
within the second SQE. This structure enables to pass:

* pi_flags: Three flags are exposed for integrity checks,
 namely IO_INTEGRITY_CHK_GUARD/APPTAG/REFTAG.
* len: length of the meta buffer
* addr: address of the meta buffer
* seed: seed value for reftag remapping
* app_tag: application-specific 16b value

Block path (direct IO) , NVMe and SCSI driver are modified to support
this.

Patch 1 is an enhancement patch.
Patch 2 is required to make the bounce buffer copy back work correctly.
Patch 3 to 5 are prep patches.
Patch 6 adds the io_uring support.
Patch 7 gives us unified interface for user and kernel generated
integrity.
Patch 8 adds support in SCSI and patch 9 in NVMe.
Patch 10 adds the support for block direct IO.

Testing has been done by modifying fio to use this interface.
Example program for the interface is appended below [1].

Changes since v5:
https://lore.kernel.org/linux-block/[email protected]/

- remove meta_type field from SQE (hch, keith)
- remove __bitwise annotation (hch)
- remove BIP_CTRL_NOCHECK from scsi (hch)

Changes since v4;
https://lore.kernel.org/linux-block/[email protected]/

- better variable names to describe bounce buffer copy back (hch)
- move defintion of flags in the same patch introducing uio_meta (hch)
- move uio_meta definition to include/linux/uio.h (hch)
- bump seed size in uio_meta to 8 bytes (martin)
- move flags definition to include/uapi/linux/fs.h (hch)
- s/meta/metadata in commit description of io-uring (hch)
- rearrange the meta fields in sqe for cleaner layout
- partial submission case is not applicable as, we are only plumbing for async case
- s/META_TYPE_INTEGRITY/META_TYPE_PI (hch, martin)
- remove unlikely branching (hch)
- Better formatting, misc cleanups, better commit descriptions, reordering commits(hch)

Changes since v3:
https://lore.kernel.org/linux-block/[email protected]/

- add reftag seed support (Martin)
- fix incorrect formatting in uio_meta (hch)
- s/IOCB_HAS_META/IOCB_HAS_METADATA (hch)
- move integrity check flags to block layer header (hch)
- add comments for BIP_CHECK_GUARD/REFTAG/APPTAG flags (hch)
- remove bio_integrity check during completion if IOCB_HAS_METADATA is set (hch)
- use goto label to get rid of duplicate error handling (hch)
- add warn_on if trying to do sync io with iocb_has_metadata flag (hch)
- remove check for disabling reftag remapping (hch)
- remove BIP_INTEGRITY_USER flag (hch)
- add comment for app_tag field introduced in bio_integrity_payload (hch)
- pass request to nvme_set_app_tag function (hch)
- right indentation at a place in scsi patch (hch)
- move IOCB_HAS_METADATA to a separate fs patch (hch)

Changes since v2:
https://lore.kernel.org/linux-block/[email protected]/
- io_uring error handling styling (Gabriel)
- add documented helper to get metadata bytes from data iter (hch)
- during clone specify "what flags to clone" rather than
"what not to clone" (hch)
- Move uio_meta defination to bio-integrity.h (hch)
- Rename apptag field to app_tag (hch)
- Change datatype of flags field in uio_meta to bitwise (hch)
- Don't introduce BIP_USER_CHK_FOO flags (hch, martin)
- Driver should rely on block layer flags instead of seeing if it is
user-passthrough (hch)
- update the scsi code for handling user-meta (hch, martin)

Changes since v1:
https://lore.kernel.org/linux-block/[email protected]/
- Do not use new opcode for meta, and also add the provision to introduce new
meta types beyond integrity (Pavel)
- Stuff IOCB_HAS_META check in need_complete_io (Jens)
- Split meta handling in NVMe into a separate handler (Keith)
- Add meta handling for __blkdev_direct_IO too (Keith)
- Don't inherit BIP_COPY_USER flag for cloned bio's (Christoph)
- Better commit descriptions (Christoph)

Changes since RFC:
- modify io_uring plumbing based on recent async handling state changes
- fixes/enhancements to correctly handle the split for meta buffer
- add flags to specify guard/reftag/apptag checks
- add support to send apptag

[1]
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <linux/io_uring.h>
#include <linux/types.h>
#include "liburing.h"

/* write data/meta. read both. compare. send apptag too.
* prerequisite:
* unprotected xfer: format namespace with 4KB + 8b, pi_type = 0
* protected xfer: format namespace with 4KB + 8b, pi_type = 1
*/

#define DATA_LEN 4096
#define META_LEN 8

struct t10_pi_tuple {
        __be16  guard;
        __be16  apptag;
        __be32  reftag;
};

int main(int argc, char *argv[])
{
         struct io_uring ring;
         struct io_uring_sqe *sqe = NULL;
         struct io_uring_cqe *cqe = NULL;
         void *wdb,*rdb;
         char wmb[META_LEN], rmb[META_LEN];
         char *data_str = "data buffer";
         char *meta_str = "meta";
         int fd, ret, blksize;
         struct stat fstat;
         unsigned long long offset = DATA_LEN;
         struct t10_pi_tuple *pi;
         struct io_uring_meta_pi *md;

         if (argc != 2) {
                 fprintf(stderr, "Usage: %s <block-device>", argv[0]);
                 return 1;
         };

         if (stat(argv[1], &fstat) == 0) {
                 blksize = (int)fstat.st_blksize;
         } else {
                 perror("stat");
                 return 1;
         }

         if (posix_memalign(&wdb, blksize, DATA_LEN)) {
                 perror("posix_memalign failed");
                 return 1;
         }
         if (posix_memalign(&rdb, blksize, DATA_LEN)) {
                 perror("posix_memalign failed");
                 return 1;
         }

         strcpy(wdb, data_str);
         strcpy(wmb, meta_str);

         fd = open(argv[1], O_RDWR | O_DIRECT);
         if (fd < 0) {
                 printf("Error in opening device\n");
                 return 0;
         }

         ret = io_uring_queue_init(8, &ring, IORING_SETUP_SQE128);
         if (ret) {
                 fprintf(stderr, "ring setup failed: %d\n", ret);
                 return 1;
         }

         /* write data + meta-buffer to device */
         sqe = io_uring_get_sqe(&ring);
         if (!sqe) {
                 fprintf(stderr, "get sqe failed\n");
                 return 1;
         }

         io_uring_prep_write(sqe, fd, wdb, DATA_LEN, offset);

         md = (struct io_uring_meta_pi *) sqe->big_sqe;
         md->addr = (__u64)wmb;
         md->len = META_LEN;
         /* flags to ask for guard/reftag/apptag*/
         md->pi_flags = IO_INTEGRITY_CHK_APPTAG;
         md->app_tag = 0x1234;
         md->seed = 0;

         pi = (struct t10_pi_tuple *)wmb;
         pi->apptag = 0x3412;

         ret = io_uring_submit(&ring);
         if (ret <= 0) {
                 fprintf(stderr, "sqe submit failed: %d\n", ret);
                 return 1;
         }

         ret = io_uring_wait_cqe(&ring, &cqe);
         if (!cqe) {
                 fprintf(stderr, "cqe is NULL :%d\n", ret);
                 return 1;
         }
         if (cqe->res < 0) {
                 fprintf(stderr, "write cqe failure: %d", cqe->res);
                 return 1;
         }

         io_uring_cqe_seen(&ring, cqe);

         /* read data + meta-buffer back from device */
         sqe = io_uring_get_sqe(&ring);
         if (!sqe) {
                 fprintf(stderr, "get sqe failed\n");
                 return 1;
         }

         io_uring_prep_read(sqe, fd, rdb, DATA_LEN, offset);

         md = (struct io_uring_meta_pi *) sqe->big_sqe;
         md->addr = (__u64)rmb;
         md->len = META_LEN;
         md->pi_flags = IO_INTEGRITY_CHK_APPTAG;
         md->app_tag = 0x1234;
         md->seed = 0;

         ret = io_uring_submit(&ring);
         if (ret <= 0) {
                 fprintf(stderr, "sqe submit failed: %d\n", ret);
                 return 1;
         }

         ret = io_uring_wait_cqe(&ring, &cqe);
         if (!cqe) {
                 fprintf(stderr, "cqe is NULL :%d\n", ret);
                 return 1;
         }

         if (cqe->res < 0) {
                 fprintf(stderr, "read cqe failure: %d", cqe->res);
                 return 1;
         }
         io_uring_cqe_seen(&ring, cqe);

         if (strncmp(wmb, rmb, META_LEN))
                 printf("Failure: meta mismatch!, wmb=%s, rmb=%s\n", wmb, rmb);

         if (strncmp(wdb, rdb, DATA_LEN))
                 printf("Failure: data mismatch!\n");

         io_uring_queue_exit(&ring);
         free(rdb);
         free(wdb);
         return 0;
}

Anuj Gupta (7):
  block: define set of integrity flags to be inherited by cloned bip
  block: modify bio_integrity_map_user to accept iov_iter as argument
  fs, iov_iter: define meta io descriptor
  fs: introduce IOCB_HAS_METADATA for metadata
  io_uring/rw: add support to send metadata along with read/write
  block: introduce BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags
  scsi: add support for user-meta interface

Christoph Hellwig (1):
  block: copy back bounce buffer to user-space correctly in case of
    split

Kanchan Joshi (2):
  nvme: add support for passing on the application tag
  block: add support to pass user meta buffer

 block/bio-integrity.c         | 84 ++++++++++++++++++++++++++++-------
 block/blk-integrity.c         | 10 ++++-
 block/fops.c                  | 42 ++++++++++++++----
 drivers/nvme/host/core.c      | 21 +++++----
 drivers/scsi/sd.c             |  4 +-
 include/linux/bio-integrity.h | 26 ++++++++---
 include/linux/fs.h            |  1 +
 include/linux/uio.h           |  9 ++++
 include/uapi/linux/fs.h       |  9 ++++
 include/uapi/linux/io_uring.h | 16 +++++++
 io_uring/io_uring.c           |  4 ++
 io_uring/rw.c                 | 71 ++++++++++++++++++++++++++++-
 io_uring/rw.h                 | 14 +++++-
 13 files changed, 266 insertions(+), 45 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v6 01/10] block: define set of integrity flags to be inherited by cloned bip
       [not found]   ` <CGME20241030181000epcas5p2bfb47a79f1e796116135f646c6f0ccc7@epcas5p2.samsung.com>
@ 2024-10-30 18:01     ` Kanchan Joshi
  0 siblings, 0 replies; 24+ messages in thread
From: Kanchan Joshi @ 2024-10-30 18:01 UTC (permalink / raw)
  To: axboe, hch, kbusch, martin.petersen, asml.silence, brauner, viro,
	jack
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-scsi,
	gost.dev, vishak.g, anuj1072538, Anuj Gupta

From: Anuj Gupta <[email protected]>

Introduce BIP_CLONE_FLAGS describing integrity flags that should be
inherited in the cloned bip from the parent.

Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
---
 block/bio-integrity.c         | 2 +-
 include/linux/bio-integrity.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 2a4bd6611692..a448a25d13de 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -559,7 +559,7 @@ int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
 
 	bip->bip_vec = bip_src->bip_vec;
 	bip->bip_iter = bip_src->bip_iter;
-	bip->bip_flags = bip_src->bip_flags & ~BIP_BLOCK_INTEGRITY;
+	bip->bip_flags = bip_src->bip_flags & BIP_CLONE_FLAGS;
 
 	return 0;
 }
diff --git a/include/linux/bio-integrity.h b/include/linux/bio-integrity.h
index dbf0f74c1529..0f0cf10222e8 100644
--- a/include/linux/bio-integrity.h
+++ b/include/linux/bio-integrity.h
@@ -30,6 +30,9 @@ struct bio_integrity_payload {
 	struct bio_vec		bip_inline_vecs[];/* embedded bvec array */
 };
 
+#define BIP_CLONE_FLAGS (BIP_MAPPED_INTEGRITY | BIP_CTRL_NOCHECK | \
+			 BIP_DISK_NOCHECK | BIP_IP_CHECKSUM)
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 
 #define bip_for_each_vec(bvl, bip, iter)				\
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v6 02/10] block: copy back bounce buffer to user-space correctly in case of split
       [not found]   ` <CGME20241030181002epcas5p2b44e244bcd0c49d0a379f0f4fe07dc3f@epcas5p2.samsung.com>
@ 2024-10-30 18:01     ` Kanchan Joshi
  0 siblings, 0 replies; 24+ messages in thread
From: Kanchan Joshi @ 2024-10-30 18:01 UTC (permalink / raw)
  To: axboe, hch, kbusch, martin.petersen, asml.silence, brauner, viro,
	jack
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-scsi,
	gost.dev, vishak.g, anuj1072538, Anuj Gupta

From: Christoph Hellwig <[email protected]>

Copy back the bounce buffer to user-space in entirety when the parent
bio completes. The existing code uses bip_iter.bi_size for sizing the
copy, which can be modified. So move away from that and fetch it from
the vector passed to the block layer. While at it, switch to using
better variable names.

Fixes: 492c5d455969f ("block: bio-integrity: directly map user buffers")
Signed-off-by: Anuj Gupta <[email protected]>
[hch: better names for variables]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
---
 block/bio-integrity.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index a448a25d13de..4341b0d4efa1 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -118,17 +118,18 @@ static void bio_integrity_unpin_bvec(struct bio_vec *bv, int nr_vecs,
 
 static void bio_integrity_uncopy_user(struct bio_integrity_payload *bip)
 {
-	unsigned short nr_vecs = bip->bip_max_vcnt - 1;
-	struct bio_vec *copy = &bip->bip_vec[1];
-	size_t bytes = bip->bip_iter.bi_size;
-	struct iov_iter iter;
+	unsigned short orig_nr_vecs = bip->bip_max_vcnt - 1;
+	struct bio_vec *orig_bvecs = &bip->bip_vec[1];
+	struct bio_vec *bounce_bvec = &bip->bip_vec[0];
+	size_t bytes = bounce_bvec->bv_len;
+	struct iov_iter orig_iter;
 	int ret;
 
-	iov_iter_bvec(&iter, ITER_DEST, copy, nr_vecs, bytes);
-	ret = copy_to_iter(bvec_virt(bip->bip_vec), bytes, &iter);
+	iov_iter_bvec(&orig_iter, ITER_DEST, orig_bvecs, orig_nr_vecs, bytes);
+	ret = copy_to_iter(bvec_virt(bounce_bvec), bytes, &orig_iter);
 	WARN_ON_ONCE(ret != bytes);
 
-	bio_integrity_unpin_bvec(copy, nr_vecs, true);
+	bio_integrity_unpin_bvec(orig_bvecs, orig_nr_vecs, true);
 }
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v6 03/10] block: modify bio_integrity_map_user to accept iov_iter as argument
       [not found]   ` <CGME20241030181005epcas5p43b40adb5af1029c9ffaecde317bf1c5d@epcas5p4.samsung.com>
@ 2024-10-30 18:01     ` Kanchan Joshi
  2024-10-31  4:33       ` kernel test robot
  0 siblings, 1 reply; 24+ messages in thread
From: Kanchan Joshi @ 2024-10-30 18:01 UTC (permalink / raw)
  To: axboe, hch, kbusch, martin.petersen, asml.silence, brauner, viro,
	jack
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-scsi,
	gost.dev, vishak.g, anuj1072538, Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <[email protected]>

This patch refactors bio_integrity_map_user to accept iov_iter as
argument. This is a prep patch.

Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
---
 block/bio-integrity.c         | 12 +++++-------
 block/blk-integrity.c         | 10 +++++++++-
 include/linux/bio-integrity.h |  5 ++---
 3 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 4341b0d4efa1..f56d01cec689 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -302,16 +302,15 @@ static unsigned int bvec_from_pages(struct bio_vec *bvec, struct page **pages,
 	return nr_bvecs;
 }
 
-int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes)
+int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
 {
 	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 	unsigned int align = blk_lim_dma_alignment_and_pad(&q->limits);
 	struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages;
 	struct bio_vec stack_vec[UIO_FASTIOV], *bvec = stack_vec;
+	size_t offset, bytes = iter->count;
 	unsigned int direction, nr_bvecs;
-	struct iov_iter iter;
 	int ret, nr_vecs;
-	size_t offset;
 	bool copy;
 
 	if (bio_integrity(bio))
@@ -324,8 +323,7 @@ int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes)
 	else
 		direction = ITER_SOURCE;
 
-	iov_iter_ubuf(&iter, direction, ubuf, bytes);
-	nr_vecs = iov_iter_npages(&iter, BIO_MAX_VECS + 1);
+	nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS + 1);
 	if (nr_vecs > BIO_MAX_VECS)
 		return -E2BIG;
 	if (nr_vecs > UIO_FASTIOV) {
@@ -335,8 +333,8 @@ int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes)
 		pages = NULL;
 	}
 
-	copy = !iov_iter_is_aligned(&iter, align, align);
-	ret = iov_iter_extract_pages(&iter, &pages, bytes, nr_vecs, 0, &offset);
+	copy = !iov_iter_is_aligned(iter, align, align);
+	ret = iov_iter_extract_pages(iter, &pages, bytes, nr_vecs, 0, &offset);
 	if (unlikely(ret < 0))
 		goto free_bvec;
 
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index b180cac61a9d..4a29754f1bc2 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -115,8 +115,16 @@ EXPORT_SYMBOL(blk_rq_map_integrity_sg);
 int blk_rq_integrity_map_user(struct request *rq, void __user *ubuf,
 			      ssize_t bytes)
 {
-	int ret = bio_integrity_map_user(rq->bio, ubuf, bytes);
+	int ret;
+	struct iov_iter iter;
+	unsigned int direction;
 
+	if (op_is_write(req_op(rq)))
+		direction = ITER_DEST;
+	else
+		direction = ITER_SOURCE;
+	iov_iter_ubuf(&iter, direction, ubuf, bytes);
+	ret = bio_integrity_map_user(rq->bio, &iter);
 	if (ret)
 		return ret;
 
diff --git a/include/linux/bio-integrity.h b/include/linux/bio-integrity.h
index 0f0cf10222e8..58ff9988433a 100644
--- a/include/linux/bio-integrity.h
+++ b/include/linux/bio-integrity.h
@@ -75,7 +75,7 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio, gfp_t gfp,
 		unsigned int nr);
 int bio_integrity_add_page(struct bio *bio, struct page *page, unsigned int len,
 		unsigned int offset);
-int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t len);
+int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter);
 void bio_integrity_unmap_user(struct bio *bio);
 bool bio_integrity_prep(struct bio *bio);
 void bio_integrity_advance(struct bio *bio, unsigned int bytes_done);
@@ -101,8 +101,7 @@ static inline void bioset_integrity_free(struct bio_set *bs)
 {
 }
 
-static inline int bio_integrity_map_user(struct bio *bio, void __user *ubuf,
-					 ssize_t len)
+static int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
 {
 	return -EINVAL;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v6 04/10] fs, iov_iter: define meta io descriptor
       [not found]   ` <CGME20241030181008epcas5p333603fdbf3afb60947d3fc51138d11bf@epcas5p3.samsung.com>
@ 2024-10-30 18:01     ` Kanchan Joshi
  2024-10-31  6:55       ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Kanchan Joshi @ 2024-10-30 18:01 UTC (permalink / raw)
  To: axboe, hch, kbusch, martin.petersen, asml.silence, brauner, viro,
	jack
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-scsi,
	gost.dev, vishak.g, anuj1072538, Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <[email protected]>

Add flags to describe checks for integrity meta buffer. Also, introduce
a  new 'uio_meta' structure that upper layer can use to pass the
meta/integrity information.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
 include/linux/uio.h     | 9 +++++++++
 include/uapi/linux/fs.h | 9 +++++++++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 853f9de5aa05..8ada84e85447 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -82,6 +82,15 @@ struct iov_iter {
 	};
 };
 
+typedef __u16 uio_meta_flags_t;
+
+struct uio_meta {
+	uio_meta_flags_t	flags;
+	u16			app_tag;
+	u64			seed;
+	struct iov_iter		iter;
+};
+
 static inline const struct iovec *iter_iov(const struct iov_iter *iter)
 {
 	if (iter->iter_type == ITER_UBUF)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 753971770733..9070ef19f0a3 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -40,6 +40,15 @@
 #define BLOCK_SIZE_BITS 10
 #define BLOCK_SIZE (1<<BLOCK_SIZE_BITS)
 
+/* flags for integrity meta */
+#define IO_INTEGRITY_CHK_GUARD		(1U << 0) /* enforce guard check */
+#define IO_INTEGRITY_CHK_REFTAG		(1U << 1) /* enforce ref check */
+#define IO_INTEGRITY_CHK_APPTAG		(1U << 2) /* enforce app check */
+
+#define IO_INTEGRITY_VALID_FLAGS (IO_INTEGRITY_CHK_GUARD | \
+				  IO_INTEGRITY_CHK_REFTAG | \
+				  IO_INTEGRITY_CHK_APPTAG)
+
 #define SEEK_SET	0	/* seek relative to beginning of file */
 #define SEEK_CUR	1	/* seek relative to current file position */
 #define SEEK_END	2	/* seek relative to end of file */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v6 05/10] fs: introduce IOCB_HAS_METADATA for metadata
       [not found]   ` <CGME20241030181010epcas5p2c399ecea97ed6d0e5fb228b5d15c2089@epcas5p2.samsung.com>
@ 2024-10-30 18:01     ` Kanchan Joshi
  0 siblings, 0 replies; 24+ messages in thread
From: Kanchan Joshi @ 2024-10-30 18:01 UTC (permalink / raw)
  To: axboe, hch, kbusch, martin.petersen, asml.silence, brauner, viro,
	jack
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-scsi,
	gost.dev, vishak.g, anuj1072538, Anuj Gupta

From: Anuj Gupta <[email protected]>

Introduce an IOCB_HAS_METADATA flag for the kiocb struct, for handling
requests containing meta payload.

Signed-off-by: Anuj Gupta <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
 include/linux/fs.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4b5cad44a126..7f14675b02df 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -346,6 +346,7 @@ struct readahead_control;
 #define IOCB_DIO_CALLER_COMP	(1 << 22)
 /* kiocb is a read or write operation submitted by fs/aio.c. */
 #define IOCB_AIO_RW		(1 << 23)
+#define IOCB_HAS_METADATA	(1 << 24)
 
 /* for use in trace events */
 #define TRACE_IOCB_STRINGS \
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write
       [not found]   ` <CGME20241030181013epcas5p2762403c83e29c81ec34b2a7755154245@epcas5p2.samsung.com>
@ 2024-10-30 18:01     ` Kanchan Joshi
  2024-10-30 21:09       ` Keith Busch
  2024-10-31  6:55       ` Christoph Hellwig
  0 siblings, 2 replies; 24+ messages in thread
From: Kanchan Joshi @ 2024-10-30 18:01 UTC (permalink / raw)
  To: axboe, hch, kbusch, martin.petersen, asml.silence, brauner, viro,
	jack
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-scsi,
	gost.dev, vishak.g, anuj1072538, Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <[email protected]>

This patch adds the capability of passing integrity metadata along with
read/write.

Introduce a new 'struct io_uring_meta_pi' that contains following:
- pi_flags: integrity check flags namely
IO_INTEGRITY_CHK_{GUARD/APPTAG/REFTAG}
- len: length of the pi/metadata buffer
- buf: address of the metadata buffer
- seed: seed value for reftag remapping
- app_tag: application defined 16b value

Application sets up a SQE128 ring, prepares io_uring_meta_pi within
the second SQE.
The patch processes this information to prepare uio_meta descriptor
and passes it down using kiocb->private.

Meta exchange is supported only for direct IO.
Also vectored read/write operations with meta are not supported
currently.

Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
 include/uapi/linux/io_uring.h | 16 ++++++++
 io_uring/io_uring.c           |  4 ++
 io_uring/rw.c                 | 71 ++++++++++++++++++++++++++++++++++-
 io_uring/rw.h                 | 14 ++++++-
 4 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 024745283783..48dcca125db3 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -105,6 +105,22 @@ struct io_uring_sqe {
 		 */
 		__u8	cmd[0];
 	};
+	/*
+	 * If the ring is initialized with IORING_SETUP_SQE128, then
+	 * this field is starting offset for 64 bytes of data. For meta io
+	 * this contains 'struct io_uring_meta_pi'
+	 */
+	__u8	big_sqe[0];
+};
+
+/* this is placed in SQE128 */
+struct io_uring_meta_pi {
+	__u16		pi_flags;
+	__u16		app_tag;
+	__u32		len;
+	__u64		addr;
+	__u64		seed;
+	__u64		rsvd[2];
 };
 
 /*
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 44a772013c09..c5fd74e42c04 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3879,6 +3879,7 @@ static int __init io_uring_init(void)
 	BUILD_BUG_SQE_ELEM(48, __u64,  addr3);
 	BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd);
 	BUILD_BUG_SQE_ELEM(56, __u64,  __pad2);
+	BUILD_BUG_SQE_ELEM_SIZE(64, 0, big_sqe);
 
 	BUILD_BUG_ON(sizeof(struct io_uring_files_update) !=
 		     sizeof(struct io_uring_rsrc_update));
@@ -3902,6 +3903,9 @@ static int __init io_uring_init(void)
 	/* top 8bits are for internal use */
 	BUILD_BUG_ON((IORING_URING_CMD_MASK & 0xff000000) != 0);
 
+	BUILD_BUG_ON(sizeof(struct io_uring_meta_pi) >
+		     sizeof(struct io_uring_sqe));
+
 	io_uring_optable_init();
 
 	/*
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 30448f343c7f..cbb74fcfd0d1 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -257,6 +257,46 @@ static int io_prep_rw_setup(struct io_kiocb *req, int ddir, bool do_import)
 	return 0;
 }
 
+static inline void io_meta_save_state(struct io_async_rw *io)
+{
+	io->meta_state.seed = io->meta.seed;
+	iov_iter_save_state(&io->meta.iter, &io->meta_state.iter_meta);
+}
+
+static inline void io_meta_restore(struct io_async_rw *io)
+{
+	io->meta.seed = io->meta_state.seed;
+	iov_iter_restore(&io->meta.iter, &io->meta_state.iter_meta);
+}
+
+static int io_prep_rw_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+			   struct io_rw *rw, int ddir)
+{
+	const struct io_uring_meta_pi *md = (struct io_uring_meta_pi *)sqe->big_sqe;
+	const struct io_issue_def *def;
+	struct io_async_rw *io;
+	int ret;
+
+	if (READ_ONCE(md->rsvd[0]) || READ_ONCE(md->rsvd[1]))
+		return -EINVAL;
+
+	def = &io_issue_defs[req->opcode];
+	if (def->vectored)
+		return -EOPNOTSUPP;
+
+	io = req->async_data;
+	io->meta.flags = READ_ONCE(md->pi_flags);
+	io->meta.app_tag = READ_ONCE(md->app_tag);
+	io->meta.seed = READ_ONCE(md->seed);
+	ret = import_ubuf(ddir, u64_to_user_ptr(READ_ONCE(md->addr)),
+			  READ_ONCE(md->len), &io->meta.iter);
+	if (unlikely(ret < 0))
+		return ret;
+	rw->kiocb.ki_flags |= IOCB_HAS_METADATA;
+	io_meta_save_state(io);
+	return ret;
+}
+
 static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		      int ddir, bool do_import)
 {
@@ -279,11 +319,19 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		rw->kiocb.ki_ioprio = get_current_ioprio();
 	}
 	rw->kiocb.dio_complete = NULL;
+	rw->kiocb.ki_flags = 0;
 
 	rw->addr = READ_ONCE(sqe->addr);
 	rw->len = READ_ONCE(sqe->len);
 	rw->flags = READ_ONCE(sqe->rw_flags);
-	return io_prep_rw_setup(req, ddir, do_import);
+	ret = io_prep_rw_setup(req, ddir, do_import);
+
+	if (unlikely(ret))
+		return ret;
+
+	if (req->ctx->flags & IORING_SETUP_SQE128)
+		ret = io_prep_rw_meta(req, sqe, rw, ddir);
+	return ret;
 }
 
 int io_prep_read(struct io_kiocb *req, const struct io_uring_sqe *sqe)
@@ -409,7 +457,10 @@ static inline loff_t *io_kiocb_update_pos(struct io_kiocb *req)
 static void io_resubmit_prep(struct io_kiocb *req)
 {
 	struct io_async_rw *io = req->async_data;
+	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
 
+	if (rw->kiocb.ki_flags & IOCB_HAS_METADATA)
+		io_meta_restore(io);
 	iov_iter_restore(&io->iter, &io->iter_state);
 }
 
@@ -794,7 +845,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
 	if (!(req->flags & REQ_F_FIXED_FILE))
 		req->flags |= io_file_get_flags(file);
 
-	kiocb->ki_flags = file->f_iocb_flags;
+	kiocb->ki_flags |= file->f_iocb_flags;
 	ret = kiocb_set_rw_flags(kiocb, rw->flags, rw_type);
 	if (unlikely(ret))
 		return ret;
@@ -823,6 +874,18 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
 		kiocb->ki_complete = io_complete_rw;
 	}
 
+	if (kiocb->ki_flags & IOCB_HAS_METADATA) {
+		struct io_async_rw *io = req->async_data;
+
+		/*
+		 * We have a union of meta fields with wpq used for buffered-io
+		 * in io_async_rw, so fail it here.
+		 */
+		if (!(req->file->f_flags & O_DIRECT))
+			return -EOPNOTSUPP;
+		kiocb->private = &io->meta;
+	}
+
 	return 0;
 }
 
@@ -897,6 +960,8 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
 	 * manually if we need to.
 	 */
 	iov_iter_restore(&io->iter, &io->iter_state);
+	if (kiocb->ki_flags & IOCB_HAS_METADATA)
+		io_meta_restore(io);
 
 	do {
 		/*
@@ -1101,6 +1166,8 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	} else {
 ret_eagain:
 		iov_iter_restore(&io->iter, &io->iter_state);
+		if (kiocb->ki_flags & IOCB_HAS_METADATA)
+			io_meta_restore(io);
 		if (kiocb->ki_flags & IOCB_WRITE)
 			io_req_end_write(req);
 		return -EAGAIN;
diff --git a/io_uring/rw.h b/io_uring/rw.h
index 3f432dc75441..2d7656bd268d 100644
--- a/io_uring/rw.h
+++ b/io_uring/rw.h
@@ -2,6 +2,11 @@
 
 #include <linux/pagemap.h>
 
+struct io_meta_state {
+	u32			seed;
+	struct iov_iter_state	iter_meta;
+};
+
 struct io_async_rw {
 	size_t				bytes_done;
 	struct iov_iter			iter;
@@ -9,7 +14,14 @@ struct io_async_rw {
 	struct iovec			fast_iov;
 	struct iovec			*free_iovec;
 	int				free_iov_nr;
-	struct wait_page_queue		wpq;
+	/* wpq is for buffered io, while meta fields are used with direct io */
+	union {
+		struct wait_page_queue		wpq;
+		struct {
+			struct uio_meta			meta;
+			struct io_meta_state		meta_state;
+		};
+	};
 };
 
 int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v6 07/10] block: introduce BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags
       [not found]   ` <CGME20241030181016epcas5p3da284aa997e81d9855207584ab4bace3@epcas5p3.samsung.com>
@ 2024-10-30 18:01     ` Kanchan Joshi
  0 siblings, 0 replies; 24+ messages in thread
From: Kanchan Joshi @ 2024-10-30 18:01 UTC (permalink / raw)
  To: axboe, hch, kbusch, martin.petersen, asml.silence, brauner, viro,
	jack
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-scsi,
	gost.dev, vishak.g, anuj1072538, Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <[email protected]>

This patch introduces BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags which
indicate how the hardware should check the integrity payload.
BIP_CHECK_GUARD/REFTAG are conversion of existing semantics, while
BIP_CHECK_APPTAG is a new flag. The driver can now just rely on block
layer flags, and doesn't need to know the integrity source. Submitter
of PI decides which tags to check. This would also give us a unified
interface for user and kernel generated integrity.

Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
---
 block/bio-integrity.c         |  5 +++++
 drivers/nvme/host/core.c      | 11 +++--------
 include/linux/bio-integrity.h |  6 +++++-
 3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index f56d01cec689..3bee43b87001 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -434,6 +434,11 @@ bool bio_integrity_prep(struct bio *bio)
 	if (bi->csum_type == BLK_INTEGRITY_CSUM_IP)
 		bip->bip_flags |= BIP_IP_CHECKSUM;
 
+	/* describe what tags to check in payload */
+	if (bi->csum_type)
+		bip->bip_flags |= BIP_CHECK_GUARD;
+	if (bi->flags & BLK_INTEGRITY_REF_TAG)
+		bip->bip_flags |= BIP_CHECK_REFTAG;
 	if (bio_integrity_add_page(bio, virt_to_page(buf), len,
 			offset_in_page(buf)) < len) {
 		printk(KERN_ERR "could not attach integrity payload\n");
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3de7555a7de7..79bd6b22e88d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1004,18 +1004,13 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 			control |= NVME_RW_PRINFO_PRACT;
 		}
 
-		switch (ns->head->pi_type) {
-		case NVME_NS_DPS_PI_TYPE3:
+		if (bio_integrity_flagged(req->bio, BIP_CHECK_GUARD))
 			control |= NVME_RW_PRINFO_PRCHK_GUARD;
-			break;
-		case NVME_NS_DPS_PI_TYPE1:
-		case NVME_NS_DPS_PI_TYPE2:
-			control |= NVME_RW_PRINFO_PRCHK_GUARD |
-					NVME_RW_PRINFO_PRCHK_REF;
+		if (bio_integrity_flagged(req->bio, BIP_CHECK_REFTAG)) {
+			control |= NVME_RW_PRINFO_PRCHK_REF;
 			if (op == nvme_cmd_zone_append)
 				control |= NVME_RW_APPEND_PIREMAP;
 			nvme_set_ref_tag(ns, cmnd, req);
-			break;
 		}
 	}
 
diff --git a/include/linux/bio-integrity.h b/include/linux/bio-integrity.h
index 58ff9988433a..fe2bfe122db2 100644
--- a/include/linux/bio-integrity.h
+++ b/include/linux/bio-integrity.h
@@ -11,6 +11,9 @@ enum bip_flags {
 	BIP_DISK_NOCHECK	= 1 << 3, /* disable disk integrity checking */
 	BIP_IP_CHECKSUM		= 1 << 4, /* IP checksum */
 	BIP_COPY_USER		= 1 << 5, /* Kernel bounce buffer in use */
+	BIP_CHECK_GUARD		= 1 << 6, /* guard check */
+	BIP_CHECK_REFTAG	= 1 << 7, /* reftag check */
+	BIP_CHECK_APPTAG	= 1 << 8, /* apptag check */
 };
 
 struct bio_integrity_payload {
@@ -31,7 +34,8 @@ struct bio_integrity_payload {
 };
 
 #define BIP_CLONE_FLAGS (BIP_MAPPED_INTEGRITY | BIP_CTRL_NOCHECK | \
-			 BIP_DISK_NOCHECK | BIP_IP_CHECKSUM)
+			 BIP_DISK_NOCHECK | BIP_IP_CHECKSUM | \
+			 BIP_CHECK_GUARD | BIP_CHECK_REFTAG | BIP_CHECK_APPTAG)
 
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v6 08/10] nvme: add support for passing on the application tag
       [not found]   ` <CGME20241030181019epcas5p135961d721959d80f1f60bd4790ed52cf@epcas5p1.samsung.com>
@ 2024-10-30 18:01     ` Kanchan Joshi
  0 siblings, 0 replies; 24+ messages in thread
From: Kanchan Joshi @ 2024-10-30 18:01 UTC (permalink / raw)
  To: axboe, hch, kbusch, martin.petersen, asml.silence, brauner, viro,
	jack
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-scsi,
	gost.dev, vishak.g, anuj1072538, Kanchan Joshi, Anuj Gupta

With user integrity buffer, there is a way to specify the app_tag.
Set the corresponding protocol specific flags and send the app_tag down.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
---
 drivers/nvme/host/core.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 79bd6b22e88d..3b329e036d33 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -872,6 +872,12 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 	return BLK_STS_OK;
 }
 
+static void nvme_set_app_tag(struct request *req, struct nvme_command *cmnd)
+{
+	cmnd->rw.lbat = cpu_to_le16(bio_integrity(req->bio)->app_tag);
+	cmnd->rw.lbatm = cpu_to_le16(0xffff);
+}
+
 static void nvme_set_ref_tag(struct nvme_ns *ns, struct nvme_command *cmnd,
 			      struct request *req)
 {
@@ -1012,6 +1018,10 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 				control |= NVME_RW_APPEND_PIREMAP;
 			nvme_set_ref_tag(ns, cmnd, req);
 		}
+		if (bio_integrity_flagged(req->bio, BIP_CHECK_APPTAG)) {
+			control |= NVME_RW_PRINFO_PRCHK_APP;
+			nvme_set_app_tag(req, cmnd);
+		}
 	}
 
 	cmnd->rw.control = cpu_to_le16(control);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v6 09/10] scsi: add support for user-meta interface
       [not found]   ` <CGME20241030181021epcas5p1c61b7980358f3120014b4f99390d1595@epcas5p1.samsung.com>
@ 2024-10-30 18:01     ` Kanchan Joshi
  2024-10-31  5:09       ` kernel test robot
  2024-10-31  5:10       ` kernel test robot
  0 siblings, 2 replies; 24+ messages in thread
From: Kanchan Joshi @ 2024-10-30 18:01 UTC (permalink / raw)
  To: axboe, hch, kbusch, martin.petersen, asml.silence, brauner, viro,
	jack
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-scsi,
	gost.dev, vishak.g, anuj1072538, Anuj Gupta

From: Anuj Gupta <[email protected]>

Add support for sending user-meta buffer. Set tags to be checked
using flags specified by user/block-layer.
With this change, BIP_CTRL_NOCHECK becomes unused. Remove it.

Signed-off-by: Anuj Gupta <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
 drivers/scsi/sd.c             |  4 ++--
 include/linux/bio-integrity.h | 17 ++++++++---------
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ca4bc0ac76ad..d1a2ae0d4c29 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -814,14 +814,14 @@ static unsigned char sd_setup_protect_cmnd(struct scsi_cmnd *scmd,
 		if (bio_integrity_flagged(bio, BIP_IP_CHECKSUM))
 			scmd->prot_flags |= SCSI_PROT_IP_CHECKSUM;
 
-		if (bio_integrity_flagged(bio, BIP_CTRL_NOCHECK) == false)
+		if (bio_integrity_flagged(bio, BIP_CHECK_GUARD))
 			scmd->prot_flags |= SCSI_PROT_GUARD_CHECK;
 	}
 
 	if (dif != T10_PI_TYPE3_PROTECTION) {	/* DIX/DIF Type 0, 1, 2 */
 		scmd->prot_flags |= SCSI_PROT_REF_INCREMENT;
 
-		if (bio_integrity_flagged(bio, BIP_CTRL_NOCHECK) == false)
+		if (bio_integrity_flagged(bio, BIP_CHECK_REFTAG))
 			scmd->prot_flags |= SCSI_PROT_REF_CHECK;
 	}
 
diff --git a/include/linux/bio-integrity.h b/include/linux/bio-integrity.h
index fe2bfe122db2..0046c744ea53 100644
--- a/include/linux/bio-integrity.h
+++ b/include/linux/bio-integrity.h
@@ -7,13 +7,12 @@
 enum bip_flags {
 	BIP_BLOCK_INTEGRITY	= 1 << 0, /* block layer owns integrity data */
 	BIP_MAPPED_INTEGRITY	= 1 << 1, /* ref tag has been remapped */
-	BIP_CTRL_NOCHECK	= 1 << 2, /* disable HBA integrity checking */
-	BIP_DISK_NOCHECK	= 1 << 3, /* disable disk integrity checking */
-	BIP_IP_CHECKSUM		= 1 << 4, /* IP checksum */
-	BIP_COPY_USER		= 1 << 5, /* Kernel bounce buffer in use */
-	BIP_CHECK_GUARD		= 1 << 6, /* guard check */
-	BIP_CHECK_REFTAG	= 1 << 7, /* reftag check */
-	BIP_CHECK_APPTAG	= 1 << 8, /* apptag check */
+	BIP_DISK_NOCHECK	= 1 << 2, /* disable disk integrity checking */
+	BIP_IP_CHECKSUM		= 1 << 3, /* IP checksum */
+	BIP_COPY_USER		= 1 << 4, /* Kernel bounce buffer in use */
+	BIP_CHECK_GUARD		= 1 << 5, /* guard check */
+	BIP_CHECK_REFTAG	= 1 << 6, /* reftag check */
+	BIP_CHECK_APPTAG	= 1 << 7, /* apptag check */
 };
 
 struct bio_integrity_payload {
@@ -34,8 +33,8 @@ struct bio_integrity_payload {
 };
 
 #define BIP_CLONE_FLAGS (BIP_MAPPED_INTEGRITY | BIP_CTRL_NOCHECK | \
-			 BIP_DISK_NOCHECK | BIP_IP_CHECKSUM | \
-			 BIP_CHECK_GUARD | BIP_CHECK_REFTAG | BIP_CHECK_APPTAG)
+			 BIP_IP_CHECKSUM | BIP_CHECK_GUARD | \
+			 BIP_CHECK_REFTAG | BIP_CHECK_APPTAG)
 
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v6 10/10] block: add support to pass user meta buffer
       [not found]   ` <CGME20241030181024epcas5p3964697a08159f8593a6f94764f77a7f3@epcas5p3.samsung.com>
@ 2024-10-30 18:01     ` Kanchan Joshi
  0 siblings, 0 replies; 24+ messages in thread
From: Kanchan Joshi @ 2024-10-30 18:01 UTC (permalink / raw)
  To: axboe, hch, kbusch, martin.petersen, asml.silence, brauner, viro,
	jack
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-scsi,
	gost.dev, vishak.g, anuj1072538, Kanchan Joshi, Anuj Gupta

If an iocb contains metadata, extract that and prepare the bip.
Based on flags specified by the user, set corresponding guard/app/ref
tags to be checked in bip.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
---
 block/bio-integrity.c         | 50 +++++++++++++++++++++++++++++++++++
 block/fops.c                  | 42 ++++++++++++++++++++++-------
 include/linux/bio-integrity.h |  7 +++++
 3 files changed, 90 insertions(+), 9 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 3bee43b87001..5d81ad9a3d20 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -364,6 +364,55 @@ int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
 	return ret;
 }
 
+static void bio_uio_meta_to_bip(struct bio *bio, struct uio_meta *meta)
+{
+	struct bio_integrity_payload *bip = bio_integrity(bio);
+
+	if (meta->flags & IO_INTEGRITY_CHK_GUARD)
+		bip->bip_flags |= BIP_CHECK_GUARD;
+	if (meta->flags & IO_INTEGRITY_CHK_APPTAG)
+		bip->bip_flags |= BIP_CHECK_APPTAG;
+	if (meta->flags & IO_INTEGRITY_CHK_REFTAG)
+		bip->bip_flags |= BIP_CHECK_REFTAG;
+
+	bip->app_tag = meta->app_tag;
+}
+
+int bio_integrity_map_iter(struct bio *bio, struct uio_meta *meta)
+{
+	struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
+	unsigned int integrity_bytes;
+	int ret;
+	struct iov_iter it;
+
+	if (!bi)
+		return -EINVAL;
+	/*
+	 * original meta iterator can be bigger.
+	 * process integrity info corresponding to current data buffer only.
+	 */
+	it = meta->iter;
+	integrity_bytes = bio_integrity_bytes(bi, bio_sectors(bio));
+	if (it.count < integrity_bytes)
+		return -EINVAL;
+
+	/* should fit into two bytes */
+	BUILD_BUG_ON(IO_INTEGRITY_VALID_FLAGS >= (1 << 16));
+
+	if (meta->flags && (meta->flags & ~IO_INTEGRITY_VALID_FLAGS))
+		return -EINVAL;
+
+	it.count = integrity_bytes;
+	ret = bio_integrity_map_user(bio, &it);
+	if (!ret) {
+		bio_uio_meta_to_bip(bio, meta);
+		bip_set_seed(bio_integrity(bio), meta->seed);
+		iov_iter_advance(&meta->iter, integrity_bytes);
+		meta->seed += bio_integrity_intervals(bi, bio_sectors(bio));
+	}
+	return ret;
+}
+
 /**
  * bio_integrity_prep - Prepare bio for integrity I/O
  * @bio:	bio to prepare
@@ -564,6 +613,7 @@ int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
 	bip->bip_vec = bip_src->bip_vec;
 	bip->bip_iter = bip_src->bip_iter;
 	bip->bip_flags = bip_src->bip_flags & BIP_CLONE_FLAGS;
+	bip->app_tag = bip_src->app_tag;
 
 	return 0;
 }
diff --git a/block/fops.c b/block/fops.c
index 2d01c9007681..3cf7e15eabbc 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -54,6 +54,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 	struct bio bio;
 	ssize_t ret;
 
+	WARN_ON_ONCE(iocb->ki_flags & IOCB_HAS_METADATA);
 	if (nr_pages <= DIO_INLINE_BIO_VECS)
 		vecs = inline_vecs;
 	else {
@@ -128,6 +129,9 @@ static void blkdev_bio_end_io(struct bio *bio)
 	if (bio->bi_status && !dio->bio.bi_status)
 		dio->bio.bi_status = bio->bi_status;
 
+	if (dio->iocb->ki_flags & IOCB_HAS_METADATA)
+		bio_integrity_unmap_user(bio);
+
 	if (atomic_dec_and_test(&dio->ref)) {
 		if (!(dio->flags & DIO_IS_SYNC)) {
 			struct kiocb *iocb = dio->iocb;
@@ -221,14 +225,16 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 			 * a retry of this from blocking context.
 			 */
 			if (unlikely(iov_iter_count(iter))) {
-				bio_release_pages(bio, false);
-				bio_clear_flag(bio, BIO_REFFED);
-				bio_put(bio);
-				blk_finish_plug(&plug);
-				return -EAGAIN;
+				ret = -EAGAIN;
+				goto fail;
 			}
 			bio->bi_opf |= REQ_NOWAIT;
 		}
+		if (!is_sync && (iocb->ki_flags & IOCB_HAS_METADATA)) {
+			ret = bio_integrity_map_iter(bio, iocb->private);
+			if (unlikely(ret))
+				goto fail;
+		}
 
 		if (is_read) {
 			if (dio->flags & DIO_SHOULD_DIRTY)
@@ -269,6 +275,12 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 
 	bio_put(&dio->bio);
 	return ret;
+fail:
+	bio_release_pages(bio, false);
+	bio_clear_flag(bio, BIO_REFFED);
+	bio_put(bio);
+	blk_finish_plug(&plug);
+	return ret;
 }
 
 static void blkdev_bio_end_io_async(struct bio *bio)
@@ -286,6 +298,9 @@ static void blkdev_bio_end_io_async(struct bio *bio)
 		ret = blk_status_to_errno(bio->bi_status);
 	}
 
+	if (iocb->ki_flags & IOCB_HAS_METADATA)
+		bio_integrity_unmap_user(bio);
+
 	iocb->ki_complete(iocb, ret);
 
 	if (dio->flags & DIO_SHOULD_DIRTY) {
@@ -330,10 +345,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 		bio_iov_bvec_set(bio, iter);
 	} else {
 		ret = bio_iov_iter_get_pages(bio, iter);
-		if (unlikely(ret)) {
-			bio_put(bio);
-			return ret;
-		}
+		if (unlikely(ret))
+			goto out_bio_put;
 	}
 	dio->size = bio->bi_iter.bi_size;
 
@@ -346,6 +359,13 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 		task_io_account_write(bio->bi_iter.bi_size);
 	}
 
+	if (iocb->ki_flags & IOCB_HAS_METADATA) {
+		ret = bio_integrity_map_iter(bio, iocb->private);
+		WRITE_ONCE(iocb->private, NULL);
+		if (unlikely(ret))
+			goto out_bio_put;
+	}
+
 	if (iocb->ki_flags & IOCB_ATOMIC)
 		bio->bi_opf |= REQ_ATOMIC;
 
@@ -360,6 +380,10 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 		submit_bio(bio);
 	}
 	return -EIOCBQUEUED;
+
+out_bio_put:
+	bio_put(bio);
+	return ret;
 }
 
 static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
diff --git a/include/linux/bio-integrity.h b/include/linux/bio-integrity.h
index 0046c744ea53..a42b7fe0eee9 100644
--- a/include/linux/bio-integrity.h
+++ b/include/linux/bio-integrity.h
@@ -23,6 +23,7 @@ struct bio_integrity_payload {
 	unsigned short		bip_vcnt;	/* # of integrity bio_vecs */
 	unsigned short		bip_max_vcnt;	/* integrity bio_vec slots */
 	unsigned short		bip_flags;	/* control flags */
+	u16			app_tag;	/* application tag value */
 
 	struct bvec_iter	bio_iter;	/* for rewinding parent bio */
 
@@ -79,6 +80,7 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio, gfp_t gfp,
 int bio_integrity_add_page(struct bio *bio, struct page *page, unsigned int len,
 		unsigned int offset);
 int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter);
+int bio_integrity_map_iter(struct bio *bio, struct uio_meta *meta);
 void bio_integrity_unmap_user(struct bio *bio);
 bool bio_integrity_prep(struct bio *bio);
 void bio_integrity_advance(struct bio *bio, unsigned int bytes_done);
@@ -109,6 +111,11 @@ static int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
 	return -EINVAL;
 }
 
+static inline int bio_integrity_map_iter(struct bio *bio, struct uio_meta *meta)
+{
+	return -EINVAL;
+}
+
 static inline void bio_integrity_unmap_user(struct bio *bio)
 {
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write
  2024-10-30 18:01     ` [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write Kanchan Joshi
@ 2024-10-30 21:09       ` Keith Busch
  2024-10-31 14:39         ` Pavel Begunkov
  2024-10-31  6:55       ` Christoph Hellwig
  1 sibling, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-10-30 21:09 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, hch, martin.petersen, asml.silence, brauner, viro, jack,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-scsi,
	gost.dev, vishak.g, anuj1072538, Anuj Gupta

On Wed, Oct 30, 2024 at 11:31:08PM +0530, Kanchan Joshi wrote:
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 024745283783..48dcca125db3 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -105,6 +105,22 @@ struct io_uring_sqe {
>  		 */
>  		__u8	cmd[0];
>  	};
> +	/*
> +	 * If the ring is initialized with IORING_SETUP_SQE128, then
> +	 * this field is starting offset for 64 bytes of data. For meta io
> +	 * this contains 'struct io_uring_meta_pi'
> +	 */
> +	__u8	big_sqe[0];
> +};
> +
> +/* this is placed in SQE128 */
> +struct io_uring_meta_pi {
> +	__u16		pi_flags;
> +	__u16		app_tag;
> +	__u32		len;
> +	__u64		addr;
> +	__u64		seed;
> +	__u64		rsvd[2];
>  };

On the previous version, I was more questioning if it aligns with what
Pavel was trying to do here. I didn't quite get it, so I was more
confused than saying it should be this way now.

But I personally think this path makes sense. I would set it up just a
little differently for extended sqe's so that the PI overlays a more
generic struct that other opcodes might find a way to use later.
Something like:

struct io_uring_sqe_ext {
	union {
		__u32	rsvd0[8];
		struct {
			__u16		pi_flags;
			__u16		app_tag;
			__u32		len;
			__u64		addr;
			__u64		seed;
		} rw_pi;
	};
	__u32	rsvd1[8];
};
  
> @@ -3902,6 +3903,9 @@ static int __init io_uring_init(void)
>  	/* top 8bits are for internal use */
>  	BUILD_BUG_ON((IORING_URING_CMD_MASK & 0xff000000) != 0);
>  
> +	BUILD_BUG_ON(sizeof(struct io_uring_meta_pi) >
> +		     sizeof(struct io_uring_sqe));

Then this check would become:

	BUILD_BUG_ON(sizeof(struct io_uring_sqe_ext) != sizeof(struct io_uring_sqe));

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 03/10] block: modify bio_integrity_map_user to accept iov_iter as argument
  2024-10-30 18:01     ` [PATCH v6 03/10] block: modify bio_integrity_map_user to accept iov_iter as argument Kanchan Joshi
@ 2024-10-31  4:33       ` kernel test robot
  0 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2024-10-31  4:33 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, hch, kbusch, martin.petersen, asml.silence,
	brauner, viro, jack
  Cc: oe-kbuild-all, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-scsi, gost.dev, vishak.g, anuj1072538, Anuj Gupta,
	Kanchan Joshi

Hi Kanchan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on axboe-block/for-next]
[cannot apply to brauner-vfs/vfs.all mkp-scsi/for-next jejb-scsi/for-next linus/master v6.12-rc5 next-20241030]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kanchan-Joshi/block-define-set-of-integrity-flags-to-be-inherited-by-cloned-bip/20241031-021248
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link:    https://lore.kernel.org/r/20241030180112.4635-4-joshi.k%40samsung.com
patch subject: [PATCH v6 03/10] block: modify bio_integrity_map_user to accept iov_iter as argument
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20241031/[email protected]/config)
compiler: alpha-linux-gcc (GCC) 13.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241031/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

   In file included from include/linux/blk-integrity.h:6,
                    from block/bdev.c:15:
>> include/linux/bio-integrity.h:104:12: warning: 'bio_integrity_map_user' defined but not used [-Wunused-function]
     104 | static int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
         |            ^~~~~~~~~~~~~~~~~~~~~~


vim +/bio_integrity_map_user +104 include/linux/bio-integrity.h

   103	
 > 104	static int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
   105	{
   106		return -EINVAL;
   107	}
   108	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 09/10] scsi: add support for user-meta interface
  2024-10-30 18:01     ` [PATCH v6 09/10] scsi: add support for user-meta interface Kanchan Joshi
@ 2024-10-31  5:09       ` kernel test robot
  2024-10-31  5:10       ` kernel test robot
  1 sibling, 0 replies; 24+ messages in thread
From: kernel test robot @ 2024-10-31  5:09 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, hch, kbusch, martin.petersen, asml.silence,
	brauner, viro, jack
  Cc: llvm, oe-kbuild-all, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-scsi, gost.dev, vishak.g, anuj1072538,
	Anuj Gupta

Hi Kanchan,

kernel test robot noticed the following build errors:

[auto build test ERROR on axboe-block/for-next]
[cannot apply to brauner-vfs/vfs.all mkp-scsi/for-next jejb-scsi/for-next linus/master v6.12-rc5 next-20241030]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kanchan-Joshi/block-define-set-of-integrity-flags-to-be-inherited-by-cloned-bip/20241031-021248
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link:    https://lore.kernel.org/r/20241030180112.4635-10-joshi.k%40samsung.com
patch subject: [PATCH v6 09/10] scsi: add support for user-meta interface
config: arm-randconfig-001-20241031 (https://download.01.org/0day-ci/archive/20241031/[email protected]/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241031/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

>> block/bio-integrity.c:566:40: error: use of undeclared identifier 'BIP_CTRL_NOCHECK'; did you mean 'BIP_DISK_NOCHECK'?
     566 |         bip->bip_flags = bip_src->bip_flags & BIP_CLONE_FLAGS;
         |                                               ^
   include/linux/bio-integrity.h:35:49: note: expanded from macro 'BIP_CLONE_FLAGS'
      35 | #define BIP_CLONE_FLAGS (BIP_MAPPED_INTEGRITY | BIP_CTRL_NOCHECK | \
         |                                                 ^
   include/linux/bio-integrity.h:10:2: note: 'BIP_DISK_NOCHECK' declared here
      10 |         BIP_DISK_NOCHECK        = 1 << 2, /* disable disk integrity checking */
         |         ^
   1 error generated.

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for GET_FREE_REGION
   Depends on [n]: SPARSEMEM [=n]
   Selected by [y]:
   - RESOURCE_KUNIT_TEST [=y] && RUNTIME_TESTING_MENU [=y] && KUNIT [=y]


vim +566 block/bio-integrity.c

7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  543  
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  544  /**
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  545   * bio_integrity_clone - Callback for cloning bios with integrity metadata
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  546   * @bio:	New bio
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  547   * @bio_src:	Original bio
87092698c665e0a fs/bio-integrity.c    un'ichi Nomura     2009-03-09  548   * @gfp_mask:	Memory allocation mask
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  549   *
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  550   * Description:	Called to allocate a bip when cloning a bio
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  551   */
7878cba9f0037f5 fs/bio-integrity.c    Martin K. Petersen 2009-06-26  552  int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
1e2a410ff71504a fs/bio-integrity.c    Kent Overstreet    2012-09-06  553  			gfp_t gfp_mask)
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  554  {
180b2f95dd33101 block/bio-integrity.c Martin K. Petersen 2014-09-26  555  	struct bio_integrity_payload *bip_src = bio_integrity(bio_src);
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  556  	struct bio_integrity_payload *bip;
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  557  
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  558  	BUG_ON(bip_src == NULL);
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  559  
ba942238056584e block/bio-integrity.c Anuj Gupta         2024-07-02  560  	bip = bio_integrity_alloc(bio, gfp_mask, 0);
7b6c0f8034d7839 block/bio-integrity.c Dan Carpenter      2015-12-09  561  	if (IS_ERR(bip))
7b6c0f8034d7839 block/bio-integrity.c Dan Carpenter      2015-12-09  562  		return PTR_ERR(bip);
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  563  
ba942238056584e block/bio-integrity.c Anuj Gupta         2024-07-02  564  	bip->bip_vec = bip_src->bip_vec;
d57a5f7c6605f15 fs/bio-integrity.c    Kent Overstreet    2013-11-23  565  	bip->bip_iter = bip_src->bip_iter;
be32c1180d327a0 block/bio-integrity.c Anuj Gupta         2024-10-30 @566  	bip->bip_flags = bip_src->bip_flags & BIP_CLONE_FLAGS;
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  567  
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  568  	return 0;
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  569  }
7ba1ba12eeef0aa fs/bio-integrity.c    Martin K. Petersen 2008-06-30  570  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 09/10] scsi: add support for user-meta interface
  2024-10-30 18:01     ` [PATCH v6 09/10] scsi: add support for user-meta interface Kanchan Joshi
  2024-10-31  5:09       ` kernel test robot
@ 2024-10-31  5:10       ` kernel test robot
  1 sibling, 0 replies; 24+ messages in thread
From: kernel test robot @ 2024-10-31  5:10 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, hch, kbusch, martin.petersen, asml.silence,
	brauner, viro, jack
  Cc: oe-kbuild-all, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-scsi, gost.dev, vishak.g, anuj1072538, Anuj Gupta

Hi Kanchan,

kernel test robot noticed the following build errors:

[auto build test ERROR on axboe-block/for-next]
[cannot apply to brauner-vfs/vfs.all mkp-scsi/for-next jejb-scsi/for-next linus/master v6.12-rc5 next-20241030]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kanchan-Joshi/block-define-set-of-integrity-flags-to-be-inherited-by-cloned-bip/20241031-021248
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link:    https://lore.kernel.org/r/20241030180112.4635-10-joshi.k%40samsung.com
patch subject: [PATCH v6 09/10] scsi: add support for user-meta interface
config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20241031/[email protected]/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241031/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

   In file included from include/linux/blk-integrity.h:6,
                    from block/bio-integrity.c:9:
   block/bio-integrity.c: In function 'bio_integrity_clone':
>> include/linux/bio-integrity.h:35:49: error: 'BIP_CTRL_NOCHECK' undeclared (first use in this function); did you mean 'BIP_DISK_NOCHECK'?
      35 | #define BIP_CLONE_FLAGS (BIP_MAPPED_INTEGRITY | BIP_CTRL_NOCHECK | \
         |                                                 ^~~~~~~~~~~~~~~~
   block/bio-integrity.c:566:47: note: in expansion of macro 'BIP_CLONE_FLAGS'
     566 |         bip->bip_flags = bip_src->bip_flags & BIP_CLONE_FLAGS;
         |                                               ^~~~~~~~~~~~~~~
   include/linux/bio-integrity.h:35:49: note: each undeclared identifier is reported only once for each function it appears in
      35 | #define BIP_CLONE_FLAGS (BIP_MAPPED_INTEGRITY | BIP_CTRL_NOCHECK | \
         |                                                 ^~~~~~~~~~~~~~~~
   block/bio-integrity.c:566:47: note: in expansion of macro 'BIP_CLONE_FLAGS'
     566 |         bip->bip_flags = bip_src->bip_flags & BIP_CLONE_FLAGS;
         |                                               ^~~~~~~~~~~~~~~


vim +35 include/linux/bio-integrity.h

da042a365515115 Christoph Hellwig 2024-07-02  34  
be32c1180d327a0 Anuj Gupta        2024-10-30 @35  #define BIP_CLONE_FLAGS (BIP_MAPPED_INTEGRITY | BIP_CTRL_NOCHECK | \
ed538815d9325f6 Anuj Gupta        2024-10-30  36  			 BIP_IP_CHECKSUM | BIP_CHECK_GUARD | \
ed538815d9325f6 Anuj Gupta        2024-10-30  37  			 BIP_CHECK_REFTAG | BIP_CHECK_APPTAG)
be32c1180d327a0 Anuj Gupta        2024-10-30  38  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 04/10] fs, iov_iter: define meta io descriptor
  2024-10-30 18:01     ` [PATCH v6 04/10] fs, iov_iter: define meta io descriptor Kanchan Joshi
@ 2024-10-31  6:55       ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-31  6:55 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, hch, kbusch, martin.petersen, asml.silence, brauner, viro,
	jack, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-scsi, gost.dev, vishak.g, anuj1072538, Anuj Gupta

On Wed, Oct 30, 2024 at 11:31:06PM +0530, Kanchan Joshi wrote:
> +typedef __u16 uio_meta_flags_t;

I would have just skipped the typedef, but I don't have strong feelings
here.

Reviewed-by: Christoph Hellwig <[email protected]>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write
  2024-10-30 18:01     ` [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write Kanchan Joshi
  2024-10-30 21:09       ` Keith Busch
@ 2024-10-31  6:55       ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-31  6:55 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, hch, kbusch, martin.petersen, asml.silence, brauner, viro,
	jack, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-scsi, gost.dev, vishak.g, anuj1072538, Anuj Gupta

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write
  2024-10-30 21:09       ` Keith Busch
@ 2024-10-31 14:39         ` Pavel Begunkov
  2024-11-01 17:54           ` Kanchan Joshi
  0 siblings, 1 reply; 24+ messages in thread
From: Pavel Begunkov @ 2024-10-31 14:39 UTC (permalink / raw)
  To: Keith Busch, Kanchan Joshi
  Cc: axboe, hch, martin.petersen, brauner, viro, jack, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-scsi, gost.dev,
	vishak.g, anuj1072538, Anuj Gupta

On 10/30/24 21:09, Keith Busch wrote:
> On Wed, Oct 30, 2024 at 11:31:08PM +0530, Kanchan Joshi wrote:
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index 024745283783..48dcca125db3 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -105,6 +105,22 @@ struct io_uring_sqe {
>>   		 */
>>   		__u8	cmd[0];
>>   	};
>> +	/*
>> +	 * If the ring is initialized with IORING_SETUP_SQE128, then
>> +	 * this field is starting offset for 64 bytes of data. For meta io
>> +	 * this contains 'struct io_uring_meta_pi'
>> +	 */
>> +	__u8	big_sqe[0];
>> +};

I don't think zero sized arrays are good as a uapi regardless of
cmd[0] above, let's just do

sqe = get_sqe();
big_sqe = (void *)(sqe + 1)

with an appropriate helper.

>> +
>> +/* this is placed in SQE128 */
>> +struct io_uring_meta_pi {
>> +	__u16		pi_flags;
>> +	__u16		app_tag;
>> +	__u32		len;
>> +	__u64		addr;
>> +	__u64		seed;
>> +	__u64		rsvd[2];
>>   };
> 
> On the previous version, I was more questioning if it aligns with what

I missed that discussion, let me know if I need to look it up

> Pavel was trying to do here. I didn't quite get it, so I was more
> confused than saying it should be this way now.

The point is, SQEs don't have nearly enough space to accommodate all
such optional features, especially when it's taking so much space and
not applicable to all reads but rather some specific  use cases and
files. Consider that there might be more similar extensions and we might
even want to use them together.

1. SQE128 makes it big for all requests, intermixing with requests that
don't need additional space wastes space. SQE128 is fine to use but at
the same time we should be mindful about it and try to avoid enabling it
if feasible.

2. This API hard codes io_uring_meta_pi into the extended part of the
SQE. If we want to add another feature it'd need to go after the meta
struct. SQE256? And what if the user doesn't need PI but only the second
feature?

In short, the uAPI need to have a clear vision of how it can be used
with / extended to multiple optional features and not just PI.

One option I mentioned before is passing a user pointer to an array of
structures, each would will have the type specifying what kind of
feature / meta information it is, e.g. META_TYPE_PI. It's not a
complete solution but a base idea to extend upon. I separately
mentioned before, if copy_from_user is expensive we can optimise it
with pre-registering memory. I think Jens even tried something similar
with structures we pass as waiting parameters.

I didn't read through all iterations of the series, so if there is
some other approach described that ticks the boxes and flexible
enough, I'd be absolutely fine with it.


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write
  2024-10-31 14:39         ` Pavel Begunkov
@ 2024-11-01 17:54           ` Kanchan Joshi
  2024-11-07 17:23             ` Pavel Begunkov
  0 siblings, 1 reply; 24+ messages in thread
From: Kanchan Joshi @ 2024-11-01 17:54 UTC (permalink / raw)
  To: Pavel Begunkov, Keith Busch
  Cc: axboe, hch, martin.petersen, brauner, viro, jack, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-scsi, gost.dev,
	vishak.g, anuj1072538, Anuj Gupta

On 10/31/2024 8:09 PM, Pavel Begunkov wrote:
> On 10/30/24 21:09, Keith Busch wrote:
>> On Wed, Oct 30, 2024 at 11:31:08PM +0530, Kanchan Joshi wrote:
>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/ 
>>> io_uring.h
>>> index 024745283783..48dcca125db3 100644
>>> --- a/include/uapi/linux/io_uring.h
>>> +++ b/include/uapi/linux/io_uring.h
>>> @@ -105,6 +105,22 @@ struct io_uring_sqe {
>>>            */
>>>           __u8    cmd[0];
>>>       };
>>> +    /*
>>> +     * If the ring is initialized with IORING_SETUP_SQE128, then
>>> +     * this field is starting offset for 64 bytes of data. For meta io
>>> +     * this contains 'struct io_uring_meta_pi'
>>> +     */
>>> +    __u8    big_sqe[0];
>>> +};
> 
> I don't think zero sized arrays are good as a uapi regardless of
> cmd[0] above, let's just do
> 
> sqe = get_sqe();
> big_sqe = (void *)(sqe + 1)
> 
> with an appropriate helper.

In one of the internal version I did just that (i.e., sqe + 1), and 
that's fine for kernel.
But afterwards added big_sqe so that userspace can directly access 
access second-half of SQE_128. We have the similar big_cqe[] within 
io_uring_cqe too.

Is this still an eyesore?

>>> +
>>> +/* this is placed in SQE128 */
>>> +struct io_uring_meta_pi {
>>> +    __u16        pi_flags;
>>> +    __u16        app_tag;
>>> +    __u32        len;
>>> +    __u64        addr;
>>> +    __u64        seed;
>>> +    __u64        rsvd[2];
>>>   };
>>
>> On the previous version, I was more questioning if it aligns with what
> 
> I missed that discussion, let me know if I need to look it up

Yes, please take a look at previous iteration (v5):
https://lore.kernel.org/io-uring/[email protected]/

Also the corresponding code, since my other answers will use that.

>> Pavel was trying to do here. I didn't quite get it, so I was more
>> confused than saying it should be this way now.
> 
> The point is, SQEs don't have nearly enough space to accommodate all
> such optional features, especially when it's taking so much space and
> not applicable to all reads but rather some specific  use cases and
> files. Consider that there might be more similar extensions and we might
> even want to use them together.
> 
> 1. SQE128 makes it big for all requests, intermixing with requests that
> don't need additional space wastes space. SQE128 is fine to use but at
> the same time we should be mindful about it and try to avoid enabling it
> if feasible.

Right. And initial versions of this series did not use SQE128. But as we 
moved towards passing more comprehensive PI information, first SQE was 
not enough. And we thought to make use of SQE128 rather than taking 
copy_from_user cost.

 > 2. This API hard codes io_uring_meta_pi into the extended part of the
> SQE. If we want to add another feature it'd need to go after the meta
> struct. SQE256?

Not necessarily. It depends on how much extra space it needs for another 
feature. To keep free space in first SQE, I chose to place PI in the 
second one. Anyone requiring 20b (in v6) or 18b (in v5) space, does not 
even have to ask for SQE128.
For more, they can use leftover space in second SQE (about half of 
second sqe will still be free). In v5, they have entire second SQE if 
they don't want to use PI.
If contiguity is a concern, we can move all PI bytes (about 32b) to the 
end of second SQE.


 > And what if the user doesn't need PI but only the second
> feature?

Not this version, but v5 exposed meta_type as bit flags.
And with that, user will not pass the PI flag and that enables to use 
all the PI bytes for something else. We will have union of PI with some 
other info that is known not to co-exist.

> In short, the uAPI need to have a clear vision of how it can be used
> with / extended to multiple optional features and not just PI.
> 
> One option I mentioned before is passing a user pointer to an array of
> structures, each would will have the type specifying what kind of
> feature / meta information it is, e.g. META_TYPE_PI. It's not a
> complete solution but a base idea to extend upon. I separately
> mentioned before, if copy_from_user is expensive we can optimise it
> with pre-registering memory. I think Jens even tried something similar
> with structures we pass as waiting parameters.
> 
> I didn't read through all iterations of the series, so if there is
> some other approach described that ticks the boxes and flexible
> enough, I'd be absolutely fine with it.

Please just read v5. I think it ticks as many boxes as possible without 
having to resort to copy_from_user.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write
  2024-11-01 17:54           ` Kanchan Joshi
@ 2024-11-07 17:23             ` Pavel Begunkov
  2024-11-10 17:41               ` Kanchan Joshi
  2024-11-10 18:36               ` Kanchan Joshi
  0 siblings, 2 replies; 24+ messages in thread
From: Pavel Begunkov @ 2024-11-07 17:23 UTC (permalink / raw)
  To: Kanchan Joshi, Keith Busch
  Cc: axboe, hch, martin.petersen, brauner, viro, jack, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-scsi, gost.dev,
	vishak.g, anuj1072538, Anuj Gupta

On 11/1/24 17:54, Kanchan Joshi wrote:
> On 10/31/2024 8:09 PM, Pavel Begunkov wrote:
>> On 10/30/24 21:09, Keith Busch wrote:
>>> On Wed, Oct 30, 2024 at 11:31:08PM +0530, Kanchan Joshi wrote:
>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/
>>>> io_uring.h
>>>> index 024745283783..48dcca125db3 100644
>>>> --- a/include/uapi/linux/io_uring.h
>>>> +++ b/include/uapi/linux/io_uring.h
>>>> @@ -105,6 +105,22 @@ struct io_uring_sqe {
>>>>             */
>>>>            __u8    cmd[0];
>>>>        };
>>>> +    /*
>>>> +     * If the ring is initialized with IORING_SETUP_SQE128, then
>>>> +     * this field is starting offset for 64 bytes of data. For meta io
>>>> +     * this contains 'struct io_uring_meta_pi'
>>>> +     */
>>>> +    __u8    big_sqe[0];
>>>> +};
>>
>> I don't think zero sized arrays are good as a uapi regardless of
>> cmd[0] above, let's just do
>>
>> sqe = get_sqe();
>> big_sqe = (void *)(sqe + 1)
>>
>> with an appropriate helper.
> 
> In one of the internal version I did just that (i.e., sqe + 1), and
> that's fine for kernel.
> But afterwards added big_sqe so that userspace can directly access
> access second-half of SQE_128. We have the similar big_cqe[] within
> io_uring_cqe too.
> 
> Is this still an eyesore?

Yes, let's kill it as well please, and I don't think the feature
really cares about it, so should be easy to do if not already in
later revisions.

>>>> +
>>>> +/* this is placed in SQE128 */
>>>> +struct io_uring_meta_pi {
>>>> +    __u16        pi_flags;
>>>> +    __u16        app_tag;
>>>> +    __u32        len;
>>>> +    __u64        addr;
>>>> +    __u64        seed;
>>>> +    __u64        rsvd[2];
>>>>    };
>>>
>>> On the previous version, I was more questioning if it aligns with what
>>
>> I missed that discussion, let me know if I need to look it up
> 
> Yes, please take a look at previous iteration (v5):
> https://lore.kernel.org/io-uring/[email protected]/

"But in general, this is about seeing metadata as a generic term to
encode extra information into io_uring SQE."

Yep, that's the idea, and it also sounds to me that stream hints
is one potential user as well. To summarise, the end goal is to be
able to add more meta types/attributes in the future, which can be
file specific, e.g. pipes don't care about integrity data, and to
be able to pass an arbitrary number of such attributes to a single
request.

We don't need to implement it here, but the uapi needs to be flexible
enough to be able to accommodate that, or we should have an
understanding how it can be extended without dirty hacks.

> Also the corresponding code, since my other answers will use that.
> 
>>> Pavel was trying to do here. I didn't quite get it, so I was more
>>> confused than saying it should be this way now.
>>
>> The point is, SQEs don't have nearly enough space to accommodate all
>> such optional features, especially when it's taking so much space and
>> not applicable to all reads but rather some specific  use cases and
>> files. Consider that there might be more similar extensions and we might
>> even want to use them together.
>>
>> 1. SQE128 makes it big for all requests, intermixing with requests that
>> don't need additional space wastes space. SQE128 is fine to use but at
>> the same time we should be mindful about it and try to avoid enabling it
>> if feasible.
> 
> Right. And initial versions of this series did not use SQE128. But as we
> moved towards passing more comprehensive PI information, first SQE was
> not enough. And we thought to make use of SQE128 rather than taking
> copy_from_user cost.

Do we have any data how expensive it is? I don't think I've ever
tried to profile it. And where the overhead comes from? speculation
prevention?

If it's indeed costly, we can add sth to io_uring like pre-mapping
memory to optimise it, which would be useful in other places as
well.
  
>   > 2. This API hard codes io_uring_meta_pi into the extended part of the
>> SQE. If we want to add another feature it'd need to go after the meta
>> struct. SQE256?
> 
> Not necessarily. It depends on how much extra space it needs for another
> feature. To keep free space in first SQE, I chose to place PI in the
> second one. Anyone requiring 20b (in v6) or 18b (in v5) space, does not
> even have to ask for SQE128.
> For more, they can use leftover space in second SQE (about half of
> second sqe will still be free). In v5, they have entire second SQE if
> they don't want to use PI.
> If contiguity is a concern, we can move all PI bytes (about 32b) to the
> end of second SQE.
> 
> 
>   > And what if the user doesn't need PI but only the second
>> feature?
> 
> Not this version, but v5 exposed meta_type as bit flags.

There has to be a type, I assume it's being added back.

> And with that, user will not pass the PI flag and that enables to use
> all the PI bytes for something else. We will have union of PI with some
> other info that is known not to co-exist.

Let's say we have 3 different attributes META_TYPE{1,2,3}.

How are they placed in an SQE?

meta1 = (void *)get_big_sqe(sqe);
meta2 = meta1 + sizeof(?); // sizeof(struct meta1_struct)
meta3 = meta2 + sizeof(struct meta2_struct);

Structures are likely not fixed size (?). At least the PI looks large
enough to force everyone to be just aliased to it.

And can the user pass first meta2 in the sqe and then meta1?

meta2 = (void *)get_big_sqe(sqe);
meta1 = meta2 + sizeof(?); // sizeof(struct meta2_struct)

If yes, how parsing should look like? Does the kernel need to read each
chunk's type and look up its size to iterate to the next one?

If no, what happens if we want to pass meta2 and meta3, do they start
from the big_sqe?

How do we pass how many of such attributes is there for the request?

It should support arbitrary number of attributes in the long run, which
we can't pass in an SQE, bumping the SQE size is not scalable in
general, so it'd need to support user pointers or sth similar at some
point. Placing them in an SQE can serve as an optimisation, and a first
step, though it might be easier to start with user pointer instead.

Also, when we eventually come to user pointers, we want it to be
performant as well and e.g. get by just one copy_from_user, and the
api/struct layouts would need to be able to support it. And once it's
copied we'll want it to be handled uniformly with the SQE variant, that
requires a common format. For different formats there will be a question
of perfomance, maintainability, duplicating kernel and userspace code.

All that doesn't need to be implemented, but we need a clear direction
for the API. Maybe we can get a simplified user space pseudo code
showing how the end API is supposed to look like?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write
  2024-11-07 17:23             ` Pavel Begunkov
@ 2024-11-10 17:41               ` Kanchan Joshi
  2024-11-12  0:54                 ` Pavel Begunkov
  2024-11-10 18:36               ` Kanchan Joshi
  1 sibling, 1 reply; 24+ messages in thread
From: Kanchan Joshi @ 2024-11-10 17:41 UTC (permalink / raw)
  To: Pavel Begunkov, Keith Busch
  Cc: axboe, hch, martin.petersen, brauner, viro, jack, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-scsi, gost.dev,
	vishak.g, anuj1072538, Anuj Gupta

On 11/7/2024 10:53 PM, Pavel Begunkov wrote:

>>> 1. SQE128 makes it big for all requests, intermixing with requests that
>>> don't need additional space wastes space. SQE128 is fine to use but at
>>> the same time we should be mindful about it and try to avoid enabling it
>>> if feasible.
>>
>> Right. And initial versions of this series did not use SQE128. But as we
>> moved towards passing more comprehensive PI information, first SQE was
>> not enough. And we thought to make use of SQE128 rather than taking
>> copy_from_user cost.
> 
> Do we have any data how expensive it is? I don't think I've ever
> tried to profile it. And where the overhead comes from? speculation
> prevention?

We did measure this for nvme passthru commands in past (and that was the 
motivation for building SQE128). Perf profile showed about 3% overhead 
for copy [*].

> If it's indeed costly, we can add sth to io_uring like pre-mapping
> memory to optimise it, which would be useful in other places as
> well.

But why to operate as if SQE128 does not exist?
Reads/Writes, at this point, are clearly not using aboud 20b in first 
SQE and entire second SQE. Not using second SQE at all does not seem 
like the best way to protect it from being used by future users.

Pre-mapping maybe better for opcodes for which copy_for_user has already 
been done. For something new (like this), why to start in a suboptimal 
way, and later, put the burden of taking hoops on userspace to get to 
the same level where it can get by simply passing a flag at the time of 
ring setup.

[*]
perf record -a fio -iodepth=256 -rw=randread -ioengine=io_uring -bs=512 
-numjobs=1 -size=50G -group_reporting -iodepth_batch_submit=64 
-iodepth_batch_complete_min=1 -iodepth_batch_complete_max=64 
-fixedbufs=1 -hipri=1 -sqthread_poll=0 -filename=/dev/ng0n1 
-name=io_uring_1 -uring_cmd=1


# Overhead  Command          Shared Object                 Symbol
# ........  ...............  ............................ 
...............................................................................
#
     14.37%  fio              fio                           [.] axmap_isset
      6.30%  fio              fio                           [.] 
__fio_gettime
      3.69%  fio              fio                           [.] get_io_u
      3.16%  fio              [kernel.vmlinux]              [k] 
copy_user_enhanced_fast_string
      2.61%  fio              [kernel.vmlinux]              [k] 
io_submit_sqes
      1.99%  fio              [kernel.vmlinux]              [k] fget
      1.96%  fio              [nvme_core]                   [k] 
nvme_alloc_request
      1.82%  fio              [nvme]                        [k] nvme_poll
      1.79%  fio              fio                           [.] 
add_clat_sample
      1.69%  fio              fio                           [.] 
fio_ioring_prep
      1.59%  fio              fio                           [.] thread_main
      1.59%  fio              [nvme]                        [k] 
nvme_queue_rqs
      1.56%  fio              [kernel.vmlinux]              [k] io_issue_sqe
      1.52%  fio              [kernel.vmlinux]              [k] 
__put_user_nocheck_8
      1.44%  fio              fio                           [.] 
account_io_completion
      1.37%  fio              fio                           [.] 
get_next_rand_block
      1.37%  fio              fio                           [.] 
__get_next_rand_offset.isra.0
      1.34%  fio              fio                           [.] io_completed
      1.34%  fio              fio                           [.] td_io_queue
      1.27%  fio              [kernel.vmlinux]              [k] 
blk_mq_alloc_request
      1.27%  fio              [nvme_core]                   [k] 
nvme_user_cmd64

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write
  2024-11-07 17:23             ` Pavel Begunkov
  2024-11-10 17:41               ` Kanchan Joshi
@ 2024-11-10 18:36               ` Kanchan Joshi
  2024-11-12  1:32                 ` Pavel Begunkov
  1 sibling, 1 reply; 24+ messages in thread
From: Kanchan Joshi @ 2024-11-10 18:36 UTC (permalink / raw)
  To: Pavel Begunkov, Keith Busch
  Cc: axboe, hch, martin.petersen, brauner, viro, jack, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-scsi, gost.dev,
	vishak.g, anuj1072538, Anuj Gupta

On 11/7/2024 10:53 PM, Pavel Begunkov wrote:

> Let's say we have 3 different attributes META_TYPE{1,2,3}.
> 
> How are they placed in an SQE?
> 
> meta1 = (void *)get_big_sqe(sqe);
> meta2 = meta1 + sizeof(?); // sizeof(struct meta1_struct)
> meta3 = meta2 + sizeof(struct meta2_struct);

Not necessary to do this kind of additions and think in terms of 
sequential ordering for the extra information placed into 
primary/secondary SQE.

Please see v8:
https://lore.kernel.org/io-uring/[email protected]/

It exposes a distinct flag (sqe->ext_cap) for each attribute/cap, and 
userspace should place the corresponding information where kernel has 
mandated.

If a particular attribute (example write-hint) requires <20b of extra 
information, we should just place that in first SQE. PI requires more so 
we are placing that into second SQE.

When both PI and write-hint flags are specified by user they can get 
processed fine without actually having to care about above 
additions/ordering.

> Structures are likely not fixed size (?). At least the PI looks large
> enough to force everyone to be just aliased to it.
> 
> And can the user pass first meta2 in the sqe and then meta1?

Yes. Just set the ext_cap flags without bothering about first/second.
User can pass either or both, along with the corresponding info. Just 
don't have to assume specific placement into SQE.


> meta2 = (void *)get_big_sqe(sqe);
> meta1 = meta2 + sizeof(?); // sizeof(struct meta2_struct)
> 
> If yes, how parsing should look like? Does the kernel need to read each
> chunk's type and look up its size to iterate to the next one?

We don't need to iterate if we are not assuming any ordering.

> If no, what happens if we want to pass meta2 and meta3, do they start
> from the big_sqe?

The one who adds the support for meta2/meta3 in kernel decides where to 
place them within first/second SQE or get them fetched via a pointer 
from userspace.

> How do we pass how many of such attributes is there for the request?

ext_cap allows to pass 16 cap/attribute flags. Maybe all can or can not 
be passed inline in SQE, but I have no real visibility about the space 
requirement of future users.


> It should support arbitrary number of attributes in the long run, which
> we can't pass in an SQE, bumping the SQE size is not scalable in
> general, so it'd need to support user pointers or sth similar at some
> point. Placing them in an SQE can serve as an optimisation, and a first> step, though it might be easier to start with user pointer instead.
> 
> Also, when we eventually come to user pointers, we want it to be
> performant as well and e.g. get by just one copy_from_user, and the
> api/struct layouts would need to be able to support it. And once it's
> copied we'll want it to be handled uniformly with the SQE variant, that
> requires a common format. For different formats there will be a question
> of perfomance, maintainability, duplicating kernel and userspace code.
> 
> All that doesn't need to be implemented, but we need a clear direction
> for the API. Maybe we can get a simplified user space pseudo code
> showing how the end API is supposed to look like?

Yes. For a large/arbitrary number, we may have to fetch the entire 
attribute list using a user pointer/len combo. And parse it (that's 
where all your previous questions fit).

And that can still be added on top of v8.
For example, adding a flag (in ext_cap) that disables inline-sqe 
processing and switches to external attribute buffer:

/* Second SQE has PI information */
#define EXT_CAP_PI		(1U << 0)
/* First SQE has hint information */
#define EXT_CAP_WRITE_HINT	(1U << 1)	
/* Do not assume CAP presence in SQE, and fetch capability buffer page 
instead */
#define EXT_CAP_INDIRECT 	(1U << 2)

Corresponding pointer (and/or len) can be put into last 16b of SQE.
Use the same flags/structures for the given attributes within this buffer.
That will keep things uniform and will reuse the same handling that we 
add for inline attributes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write
  2024-11-10 17:41               ` Kanchan Joshi
@ 2024-11-12  0:54                 ` Pavel Begunkov
  0 siblings, 0 replies; 24+ messages in thread
From: Pavel Begunkov @ 2024-11-12  0:54 UTC (permalink / raw)
  To: Kanchan Joshi, Keith Busch
  Cc: axboe, hch, martin.petersen, brauner, viro, jack, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-scsi, gost.dev,
	vishak.g, anuj1072538, Anuj Gupta

On 11/10/24 17:41, Kanchan Joshi wrote:
> On 11/7/2024 10:53 PM, Pavel Begunkov wrote:
> 
>>>> 1. SQE128 makes it big for all requests, intermixing with requests that
>>>> don't need additional space wastes space. SQE128 is fine to use but at
>>>> the same time we should be mindful about it and try to avoid enabling it
>>>> if feasible.
>>>
>>> Right. And initial versions of this series did not use SQE128. But as we
>>> moved towards passing more comprehensive PI information, first SQE was
>>> not enough. And we thought to make use of SQE128 rather than taking
>>> copy_from_user cost.
>>
>> Do we have any data how expensive it is? I don't think I've ever
>> tried to profile it. And where the overhead comes from? speculation
>> prevention?
> 
> We did measure this for nvme passthru commands in past (and that was the
> motivation for building SQE128). Perf profile showed about 3% overhead
> for copy [*].

Interesting. Sounds like the 3% is not accounting spec barriers,
and then I'm a bit curious how much of it comes from the generic
memcpy what could've been several 64 bit reads. But regardless
let's assume it is expensive.

>> If it's indeed costly, we can add sth to io_uring like pre-mapping
>> memory to optimise it, which would be useful in other places as
>> well.
> 
> But why to operate as if SQE128 does not exist?
> Reads/Writes, at this point, are clearly not using aboud 20b in first
> SQE and entire second SQE. Not using second SQE at all does not seem
> like the best way to protect it from being used by future users.

You missed the point, if you take another look at the rest of my
reply I even mentioned that SQE128 could be used as an optimisation
and the only mode for this patchset, but the API has to be nicely
extendable with more attributes in the future.

You can't fit everything into SQE128. Even if we grow the SQE size
further, it's one size for all requests, mixing requests would mean
initilising entire SQE256/512/... for all requests, even for those
that don't need it. It might be reasonable for some applications
but not for a generic case.

I know you care about having that particular integrity feature,
but it'd be bad for io_uring to lock into a suboptimal API and
special-casing PI implementation. Let's shift a discussion about
details to the other sub-thread.

> Pre-mapping maybe better for opcodes for which copy_for_user has already
> been done. For something new (like this), why to start in a suboptimal
> way, and later, put the burden of taking hoops on userspace to get to
> the same level where it can get by simply passing a flag at the time of
> ring setup.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write
  2024-11-10 18:36               ` Kanchan Joshi
@ 2024-11-12  1:32                 ` Pavel Begunkov
  0 siblings, 0 replies; 24+ messages in thread
From: Pavel Begunkov @ 2024-11-12  1:32 UTC (permalink / raw)
  To: Kanchan Joshi, Keith Busch
  Cc: axboe, hch, martin.petersen, brauner, viro, jack, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-scsi, gost.dev,
	vishak.g, anuj1072538, Anuj Gupta

On 11/10/24 18:36, Kanchan Joshi wrote:
> On 11/7/2024 10:53 PM, Pavel Begunkov wrote:
> 
>> Let's say we have 3 different attributes META_TYPE{1,2,3}.
>>
>> How are they placed in an SQE?
>>
>> meta1 = (void *)get_big_sqe(sqe);
>> meta2 = meta1 + sizeof(?); // sizeof(struct meta1_struct)
>> meta3 = meta2 + sizeof(struct meta2_struct);
> 
> Not necessary to do this kind of additions and think in terms of
> sequential ordering for the extra information placed into
> primary/secondary SQE.
> 
> Please see v8:
> https://lore.kernel.org/io-uring/[email protected]/
> 
> It exposes a distinct flag (sqe->ext_cap) for each attribute/cap, and
> userspace should place the corresponding information where kernel has
> mandated.
> 
> If a particular attribute (example write-hint) requires <20b of extra
> information, we should just place that in first SQE. PI requires more so
> we are placing that into second SQE.
> 
> When both PI and write-hint flags are specified by user they can get
> processed fine without actually having to care about above
> additions/ordering.

Ok, this option is to statically define a place in SQE for each
meta type. The problem is that we can't place everything into
an SQE, and the next big meta would need to be a user pointer,
at which point copy_from_user() is expensive again and we need
to invent something new. PI becomes a special case, most likely
handled in a special way, and either becomes one of few "optimised"
or forces for nothing its users into SQE128 (with all additional
costs) when it could've been aligned with other later meta types.

>> Structures are likely not fixed size (?). At least the PI looks large
>> enough to force everyone to be just aliased to it.
>>
>> And can the user pass first meta2 in the sqe and then meta1?
> 
> Yes. Just set the ext_cap flags without bothering about first/second.
> User can pass either or both, along with the corresponding info. Just
> don't have to assume specific placement into SQE.
> 
> 
>> meta2 = (void *)get_big_sqe(sqe);
>> meta1 = meta2 + sizeof(?); // sizeof(struct meta2_struct)
>>
>> If yes, how parsing should look like? Does the kernel need to read each
>> chunk's type and look up its size to iterate to the next one?
> 
> We don't need to iterate if we are not assuming any ordering.
> 
>> If no, what happens if we want to pass meta2 and meta3, do they start
>> from the big_sqe?
> 
> The one who adds the support for meta2/meta3 in kernel decides where to
> place them within first/second SQE or get them fetched via a pointer
> from userspace.
> 
>> How do we pass how many of such attributes is there for the request?
> 
> ext_cap allows to pass 16 cap/attribute flags. Maybe all can or can not
> be passed inline in SQE, but I have no real visibility about the space
> requirement of future users.

I like ext_cap, if not in the current form / API, then as a user
hint - quick map of what meta types are passed.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-11-12  1:32 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20241030180957epcas5p3312b0a582e8562f8c2169e64d41592b2@epcas5p3.samsung.com>
2024-10-30 18:01 ` [PATCH v6 00/10] Read/Write with metadata/integrity Kanchan Joshi
     [not found]   ` <CGME20241030181000epcas5p2bfb47a79f1e796116135f646c6f0ccc7@epcas5p2.samsung.com>
2024-10-30 18:01     ` [PATCH v6 01/10] block: define set of integrity flags to be inherited by cloned bip Kanchan Joshi
     [not found]   ` <CGME20241030181002epcas5p2b44e244bcd0c49d0a379f0f4fe07dc3f@epcas5p2.samsung.com>
2024-10-30 18:01     ` [PATCH v6 02/10] block: copy back bounce buffer to user-space correctly in case of split Kanchan Joshi
     [not found]   ` <CGME20241030181005epcas5p43b40adb5af1029c9ffaecde317bf1c5d@epcas5p4.samsung.com>
2024-10-30 18:01     ` [PATCH v6 03/10] block: modify bio_integrity_map_user to accept iov_iter as argument Kanchan Joshi
2024-10-31  4:33       ` kernel test robot
     [not found]   ` <CGME20241030181008epcas5p333603fdbf3afb60947d3fc51138d11bf@epcas5p3.samsung.com>
2024-10-30 18:01     ` [PATCH v6 04/10] fs, iov_iter: define meta io descriptor Kanchan Joshi
2024-10-31  6:55       ` Christoph Hellwig
     [not found]   ` <CGME20241030181010epcas5p2c399ecea97ed6d0e5fb228b5d15c2089@epcas5p2.samsung.com>
2024-10-30 18:01     ` [PATCH v6 05/10] fs: introduce IOCB_HAS_METADATA for metadata Kanchan Joshi
     [not found]   ` <CGME20241030181013epcas5p2762403c83e29c81ec34b2a7755154245@epcas5p2.samsung.com>
2024-10-30 18:01     ` [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write Kanchan Joshi
2024-10-30 21:09       ` Keith Busch
2024-10-31 14:39         ` Pavel Begunkov
2024-11-01 17:54           ` Kanchan Joshi
2024-11-07 17:23             ` Pavel Begunkov
2024-11-10 17:41               ` Kanchan Joshi
2024-11-12  0:54                 ` Pavel Begunkov
2024-11-10 18:36               ` Kanchan Joshi
2024-11-12  1:32                 ` Pavel Begunkov
2024-10-31  6:55       ` Christoph Hellwig
     [not found]   ` <CGME20241030181016epcas5p3da284aa997e81d9855207584ab4bace3@epcas5p3.samsung.com>
2024-10-30 18:01     ` [PATCH v6 07/10] block: introduce BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags Kanchan Joshi
     [not found]   ` <CGME20241030181019epcas5p135961d721959d80f1f60bd4790ed52cf@epcas5p1.samsung.com>
2024-10-30 18:01     ` [PATCH v6 08/10] nvme: add support for passing on the application tag Kanchan Joshi
     [not found]   ` <CGME20241030181021epcas5p1c61b7980358f3120014b4f99390d1595@epcas5p1.samsung.com>
2024-10-30 18:01     ` [PATCH v6 09/10] scsi: add support for user-meta interface Kanchan Joshi
2024-10-31  5:09       ` kernel test robot
2024-10-31  5:10       ` kernel test robot
     [not found]   ` <CGME20241030181024epcas5p3964697a08159f8593a6f94764f77a7f3@epcas5p3.samsung.com>
2024-10-30 18:01     ` [PATCH v6 10/10] block: add support to pass user meta buffer Kanchan Joshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox