public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH 00/10] Read/Write with meta/integrity
       [not found] <CGME20240425184649epcas5p42f6ddbfb1c579f043a919973c70ebd03@epcas5p4.samsung.com>
@ 2024-04-25 18:39 ` Kanchan Joshi
       [not found]   ` <CGME20240425184651epcas5p3404f2390d6cf05148eb96e1af093e7bc@epcas5p3.samsung.com>
                     ` (10 more replies)
  0 siblings, 11 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
  To: axboe, martin.petersen, kbusch, hch, brauner
  Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Kanchan Joshi

This adds a new io_uring interface to specify meta along with
read/write. Beyond reading/writing meta, the interface also enables
(a) flags to control data-integrity checks, (b) application tag.

Block path (direct IO) and NVMe driver are modified to support
this.

First 5 patches are enhancements/fixes in the block/nvme so that user meta buffer
(mostly when it gets split) is handled correctly.
Patch 8 adds the io_uring support.
Patch 9 adds the support for block direct IO, and patch 10 for NVMe.

Interface:
Two new opcodes in io_uring: IORING_OP_READ/WRITE_META.
The leftover space in SQE is used to send meta buffer, its length,
apptag, and meta flags (guard/reftag/apptag check for now). Example
program on how to use the interface is appended below [1]

Another design choice will be not to introduce the new opcodes, and add
new RWF_META flag instead. Open to that in next version.
As for new meta flags, RWF_* seemed a bit precious to use. Hence took the route
to carve those within the SQE itself.

Performance:
of non-meta io is not affected due to these patches.

Testing:
has been done by modifying fio to use this interface.
https://github.com/SamsungDS/fio/commits/feat/test-meta-v2

Changes since RFC:
- modify io_uring plumbing based on recent async handling state changes
- fixes/enhancements to correctly handle the split for meta buffer
- add flags to specify guard/reftag/apptag checks
- add support to send apptag

[1]
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <linux/io_uring.h>
#include <linux/types.h>
#include "liburing.h"

/* write data/meta. read both. compare. send apptag too.
* prerequisite:
* unprotected xfer: format namespace with 4KB + 8b, pi_type = 0
* protected xfer: format namespace with 4KB + 8b, pi_type = 1
*/

#define DATA_LEN 4096
#define META_LEN 8

struct t10_pi_tuple {
        __be16  guard;
        __be16  apptag;
        __be32  reftag;
};

int main(int argc, char *argv[])
{
         struct io_uring ring;
         struct io_uring_sqe *sqe = NULL;
         struct io_uring_cqe *cqe = NULL;
         void *wdb,*rdb;
         char wmb[META_LEN], rmb[META_LEN];
         char *data_str = "data buffer";
         char *meta_str = "meta";
         int fd, ret, blksize;
         struct stat fstat;
         unsigned long long offset = DATA_LEN;
         struct t10_pi_tuple *pi;

         if (argc != 2) {
                 fprintf(stderr, "Usage: %s <block-device>", argv[0]);
                 return 1;
         };

         if (stat(argv[1], &fstat) == 0) {
                 blksize = (int)fstat.st_blksize;
         } else {
                 perror("stat");
                 return 1;
         }

         if (posix_memalign(&wdb, blksize, DATA_LEN)) {
                 perror("posix_memalign failed");
                 return 1;
         }
         if (posix_memalign(&rdb, blksize, DATA_LEN)) {
                 perror("posix_memalign failed");
                 return 1;
         }

         strcpy(wdb, data_str);
         strcpy(wmb, meta_str);

         fd = open(argv[1], O_RDWR | O_DIRECT);
         if (fd < 0) {
                 printf("Error in opening device\n");
                 return 0;
         }

         ret = io_uring_queue_init(8, &ring, 0);
         if (ret) {
                 fprintf(stderr, "ring setup failed: %d\n", ret);
                 return 1;
         }

         /* write data + meta-buffer to device */
         sqe = io_uring_get_sqe(&ring);
         if (!sqe) {
                 fprintf(stderr, "get sqe failed\n");
                 return 1;
         }

         io_uring_prep_write(sqe, fd, wdb, DATA_LEN, offset);
         sqe->opcode = IORING_OP_WRITE_META;
         sqe->meta_addr = (__u64)wmb;
         sqe->meta_len = META_LEN;
         /* flags to ask for guard/reftag/apptag*/
         sqe->meta_flags = META_CHK_APPTAG;
         sqe->apptag = 0x1234;

         pi = (struct t10_pi_tuple *)wmb;
         pi->apptag = 0x3412;

         ret = io_uring_submit(&ring);
         if (ret <= 0) {
                 fprintf(stderr, "sqe submit failed: %d\n", ret);
                 return 1;
         }

         ret = io_uring_wait_cqe(&ring, &cqe);
         if (!cqe) {
                 fprintf(stderr, "cqe is NULL :%d\n", ret);
                 return 1;
         }
         if (cqe->res < 0) {
                 fprintf(stderr, "write cqe failure: %d", cqe->res);
                 return 1;
         }

         io_uring_cqe_seen(&ring, cqe);

         /* read data + meta-buffer back from device */
         sqe = io_uring_get_sqe(&ring);
         if (!sqe) {
                 fprintf(stderr, "get sqe failed\n");
                 return 1;
         }

         io_uring_prep_read(sqe, fd, rdb, DATA_LEN, offset);
         sqe->opcode = IORING_OP_READ_META;
         sqe->meta_addr = (__u64)rmb;
         sqe->meta_len = META_LEN;
         sqe->meta_flags = META_CHK_APPTAG;
         sqe->apptag = 0x1234;

         ret = io_uring_submit(&ring);
         if (ret <= 0) {
                 fprintf(stderr, "sqe submit failed: %d\n", ret);
                 return 1;
         }

         ret = io_uring_wait_cqe(&ring, &cqe);
         if (!cqe) {
                 fprintf(stderr, "cqe is NULL :%d\n", ret);
                 return 1;
         }

         if (cqe->res < 0) {
                 fprintf(stderr, "read cqe failure: %d", cqe->res);
                 return 1;
         }
         io_uring_cqe_seen(&ring, cqe);

         if (strncmp(wmb, rmb, META_LEN))
                 printf("Failure: meta mismatch!, wmb=%s, rmb=%s\n", wmb, rmb);

         if (strncmp(wdb, rdb, DATA_LEN))
                 printf("Failure: data mismatch!\n");

         io_uring_queue_exit(&ring);
         free(rdb);
         free(wdb);
         return 0;
}


Anuj Gupta (6):
  block: set bip_vcnt correctly
  block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
  block: copy result back to user meta buffer correctly in case of split
  block: avoid unpinning/freeing the bio_vec incase of cloned bio
  block: modify bio_integrity_map_user argument
  io_uring/rw: add support to send meta along with read/write

Kanchan Joshi (4):
  block, nvme: modify rq_integrity_vec function
  block: define meta io descriptor
  block: add support to send meta buffer
  nvme: add separate handling for user integrity buffer

 block/bio-integrity.c         | 69 +++++++++++++++++++++++--------
 block/fops.c                  |  9 +++++
 block/t10-pi.c                |  6 +++
 drivers/nvme/host/core.c      | 36 ++++++++++++++++-
 drivers/nvme/host/ioctl.c     | 11 ++++-
 drivers/nvme/host/pci.c       |  9 +++--
 include/linux/bio.h           | 23 +++++++++--
 include/linux/blk-integrity.h | 13 +++---
 include/linux/fs.h            |  1 +
 include/uapi/linux/io_uring.h | 15 +++++++
 io_uring/io_uring.c           |  4 ++
 io_uring/opdef.c              | 30 ++++++++++++++
 io_uring/rw.c                 | 76 +++++++++++++++++++++++++++++++++--
 io_uring/rw.h                 | 11 ++++-
 14 files changed, 276 insertions(+), 37 deletions(-)


base-commit: 24c3fc5c75c5b9d471783b4a4958748243828613
-- 
2.25.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 01/10] block: set bip_vcnt correctly
       [not found]   ` <CGME20240425184651epcas5p3404f2390d6cf05148eb96e1af093e7bc@epcas5p3.samsung.com>
@ 2024-04-25 18:39     ` Kanchan Joshi
  2024-04-27  7:02       ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
  To: axboe, martin.petersen, kbusch, hch, brauner
  Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <[email protected]>

Set the bip_vcnt correctly in bio_integrity_init_user and
bio_integrity_copy_user. If the bio gets split at a later point,
this value is required to set the right bip_vcnt in the cloned bio.

Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
 block/bio-integrity.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 2e3e8e04961e..e3390424e6b5 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -254,6 +254,7 @@ static int bio_integrity_copy_user(struct bio *bio, struct bio_vec *bvec,
 
 	bip->bip_flags |= BIP_INTEGRITY_USER | BIP_COPY_USER;
 	bip->bip_iter.bi_sector = seed;
+	bip->bip_vcnt = nr_vecs;
 	return 0;
 free_bip:
 	bio_integrity_free(bio);
@@ -275,6 +276,7 @@ static int bio_integrity_init_user(struct bio *bio, struct bio_vec *bvec,
 	bip->bip_flags |= BIP_INTEGRITY_USER;
 	bip->bip_iter.bi_sector = seed;
 	bip->bip_iter.bi_size = len;
+	bip->bip_vcnt = nr_vecs;
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
       [not found]   ` <CGME20240425184653epcas5p28de1473090e0141ae74f8b0a6eb921a7@epcas5p2.samsung.com>
@ 2024-04-25 18:39     ` Kanchan Joshi
  2024-04-27  7:03       ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
  To: axboe, martin.petersen, kbusch, hch, brauner
  Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <[email protected]>

If bio_integrity_copy_user is used to process the meta buffer, bip_max_vcnt
is one greater than bip_vcnt. In this case bip_max_vcnt vecs needs to be
copied to cloned bip.

Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
 block/bio-integrity.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index e3390424e6b5..c1955f01412e 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -622,12 +622,12 @@ int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
 
 	BUG_ON(bip_src == NULL);
 
-	bip = bio_integrity_alloc(bio, gfp_mask, bip_src->bip_vcnt);
+	bip = bio_integrity_alloc(bio, gfp_mask, bip_src->bip_max_vcnt);
 	if (IS_ERR(bip))
 		return PTR_ERR(bip);
 
 	memcpy(bip->bip_vec, bip_src->bip_vec,
-	       bip_src->bip_vcnt * sizeof(struct bio_vec));
+	       bip_src->bip_max_vcnt * sizeof(struct bio_vec));
 
 	bip->bip_vcnt = bip_src->bip_vcnt;
 	bip->bip_iter = bip_src->bip_iter;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 03/10] block: copy result back to user meta buffer correctly in case of split
       [not found]   ` <CGME20240425184656epcas5p42228cdef753cf20a266d12de5bc130f0@epcas5p4.samsung.com>
@ 2024-04-25 18:39     ` Kanchan Joshi
  2024-04-27  7:04       ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
  To: axboe, martin.petersen, kbusch, hch, brauner
  Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <[email protected]>

In case of split, the len and offset of bvec gets updated in the iter.
Use it to fetch the right bvec, and copy it back to user meta buffer.

Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
 block/bio-integrity.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index c1955f01412e..b4042414a08f 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -108,11 +108,15 @@ static void bio_integrity_uncopy_user(struct bio_integrity_payload *bip)
 	unsigned short nr_vecs = bip->bip_max_vcnt - 1;
 	struct bio_vec *copy = &bip->bip_vec[1];
 	size_t bytes = bip->bip_iter.bi_size;
+	struct bio_vec copy_bvec, src_bvec;
 	struct iov_iter iter;
 	int ret;
 
-	iov_iter_bvec(&iter, ITER_DEST, copy, nr_vecs, bytes);
-	ret = copy_to_iter(bvec_virt(bip->bip_vec), bytes, &iter);
+	copy_bvec = mp_bvec_iter_bvec(copy, bip->bip_iter);
+	src_bvec = mp_bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
+
+	iov_iter_bvec(&iter, ITER_DEST, &copy_bvec, nr_vecs, bytes);
+	ret = copy_to_iter(bvec_virt(&src_bvec), bytes, &iter);
 	WARN_ON_ONCE(ret != bytes);
 
 	bio_integrity_unpin_bvec(copy, nr_vecs, true);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio
       [not found]   ` <CGME20240425184658epcas5p2adb6bf01a5c56ffaac3a55ab57afaf8e@epcas5p2.samsung.com>
@ 2024-04-25 18:39     ` Kanchan Joshi
  2024-04-27  7:05       ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
  To: axboe, martin.petersen, kbusch, hch, brauner
  Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <[email protected]>

Do it only once when the parent bio completes.

Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
 block/bio-integrity.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b4042414a08f..b698eb77515d 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -119,7 +119,8 @@ static void bio_integrity_uncopy_user(struct bio_integrity_payload *bip)
 	ret = copy_to_iter(bvec_virt(&src_bvec), bytes, &iter);
 	WARN_ON_ONCE(ret != bytes);
 
-	bio_integrity_unpin_bvec(copy, nr_vecs, true);
+	if (!bio_flagged((bip->bip_bio), BIO_CLONED))
+		bio_integrity_unpin_bvec(copy, nr_vecs, true);
 }
 
 static void bio_integrity_unmap_user(struct bio_integrity_payload *bip)
@@ -129,11 +130,14 @@ static void bio_integrity_unmap_user(struct bio_integrity_payload *bip)
 	if (bip->bip_flags & BIP_COPY_USER) {
 		if (dirty)
 			bio_integrity_uncopy_user(bip);
-		kfree(bvec_virt(bip->bip_vec));
+		if (!bio_flagged((bip->bip_bio), BIO_CLONED))
+			kfree(bvec_virt(bip->bip_vec));
 		return;
 	}
 
-	bio_integrity_unpin_bvec(bip->bip_vec, bip->bip_max_vcnt, dirty);
+	if (!bio_flagged((bip->bip_bio), BIO_CLONED))
+		bio_integrity_unpin_bvec(bip->bip_vec, bip->bip_max_vcnt,
+					 dirty);
 }
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 05/10] block, nvme: modify rq_integrity_vec function
       [not found]   ` <CGME20240425184700epcas5p1687590f7e4a3f3c3620ac27af514f0ca@epcas5p1.samsung.com>
@ 2024-04-25 18:39     ` Kanchan Joshi
  2024-04-27  7:18       ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
  To: axboe, martin.petersen, kbusch, hch, brauner
  Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Kanchan Joshi, Anuj Gupta

rq_integrity_vec always returns the first bio_vec, and does not take
bip_iter into consideration.
Modify it so that it does. This is similar to how req_bvec() works for
data buffers.

This is necessary for correct dma map/unmap of meta buffer when it was
split in the block layer.

Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
 drivers/nvme/host/pci.c       |  9 +++++----
 include/linux/blk-integrity.h | 13 +++++++------
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8e0bb9692685..bc5177ea6330 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -825,9 +825,9 @@ static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req,
 		struct nvme_command *cmnd)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct bio_vec bv = rq_integrity_vec(req);
 
-	iod->meta_dma = dma_map_bvec(dev->dev, rq_integrity_vec(req),
-			rq_dma_dir(req), 0);
+	iod->meta_dma = dma_map_bvec(dev->dev, &bv, rq_dma_dir(req), 0);
 	if (dma_mapping_error(dev->dev, iod->meta_dma))
 		return BLK_STS_IOERR;
 	cmnd->rw.metadata = cpu_to_le64(iod->meta_dma);
@@ -964,9 +964,10 @@ static __always_inline void nvme_pci_unmap_rq(struct request *req)
 
 	if (blk_integrity_rq(req)) {
 	        struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+		struct bio_vec bv = rq_integrity_vec(req);
 
-		dma_unmap_page(dev->dev, iod->meta_dma,
-			       rq_integrity_vec(req)->bv_len, rq_dma_dir(req));
+		dma_unmap_page(dev->dev, iod->meta_dma, bv.bv_len,
+			       rq_dma_dir(req));
 	}
 
 	if (blk_rq_nr_phys_segments(req))
diff --git a/include/linux/blk-integrity.h b/include/linux/blk-integrity.h
index e253e7bd0d17..9223050c0f75 100644
--- a/include/linux/blk-integrity.h
+++ b/include/linux/blk-integrity.h
@@ -109,11 +109,12 @@ static inline bool blk_integrity_rq(struct request *rq)
  * Return the first bvec that contains integrity data.  Only drivers that are
  * limited to a single integrity segment should use this helper.
  */
-static inline struct bio_vec *rq_integrity_vec(struct request *rq)
+static inline struct bio_vec rq_integrity_vec(struct request *rq)
 {
-	if (WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1))
-		return NULL;
-	return rq->bio->bi_integrity->bip_vec;
+	WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1);
+
+	return mp_bvec_iter_bvec(rq->bio->bi_integrity->bip_vec,
+				 rq->bio->bi_integrity->bip_iter);
 }
 #else /* CONFIG_BLK_DEV_INTEGRITY */
 static inline int blk_rq_count_integrity_sg(struct request_queue *q,
@@ -177,9 +178,9 @@ static inline int blk_integrity_rq(struct request *rq)
 	return 0;
 }
 
-static inline struct bio_vec *rq_integrity_vec(struct request *rq)
+static inline struct bio_vec rq_integrity_vec(struct request *rq)
 {
-	return NULL;
+	return (struct bio_vec){0};
 }
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 #endif /* _LINUX_BLK_INTEGRITY_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 06/10] block: modify bio_integrity_map_user argument
       [not found]   ` <CGME20240425184702epcas5p1ccb0df41b07845bc252d69007558e3fa@epcas5p1.samsung.com>
@ 2024-04-25 18:39     ` Kanchan Joshi
  2024-04-27  7:19       ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
  To: axboe, martin.petersen, kbusch, hch, brauner
  Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <[email protected]>

This patch refactors bio_integrity_map_user to accept iov_iter as
argument. This is a prep patch.

Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
 block/bio-integrity.c     | 14 ++++++--------
 drivers/nvme/host/ioctl.c | 11 +++++++++--
 include/linux/bio.h       |  7 ++++---
 3 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b698eb77515d..1085cf45f51e 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -318,17 +318,16 @@ static unsigned int bvec_from_pages(struct bio_vec *bvec, struct page **pages,
 	return nr_bvecs;
 }
 
-int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes,
-			   u32 seed)
+int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter,
+			  u32 seed)
 {
 	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 	unsigned int align = q->dma_pad_mask | queue_dma_alignment(q);
 	struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages;
 	struct bio_vec stack_vec[UIO_FASTIOV], *bvec = stack_vec;
+	size_t offset, bytes = iter->count;
 	unsigned int direction, nr_bvecs;
-	struct iov_iter iter;
 	int ret, nr_vecs;
-	size_t offset;
 	bool copy;
 
 	if (bio_integrity(bio))
@@ -341,8 +340,7 @@ int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes,
 	else
 		direction = ITER_SOURCE;
 
-	iov_iter_ubuf(&iter, direction, ubuf, bytes);
-	nr_vecs = iov_iter_npages(&iter, BIO_MAX_VECS + 1);
+	nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS + 1);
 	if (nr_vecs > BIO_MAX_VECS)
 		return -E2BIG;
 	if (nr_vecs > UIO_FASTIOV) {
@@ -352,8 +350,8 @@ int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes,
 		pages = NULL;
 	}
 
-	copy = !iov_iter_is_aligned(&iter, align, align);
-	ret = iov_iter_extract_pages(&iter, &pages, bytes, nr_vecs, 0, &offset);
+	copy = !iov_iter_is_aligned(iter, align, align);
+	ret = iov_iter_extract_pages(iter, &pages, bytes, nr_vecs, 0, &offset);
 	if (unlikely(ret < 0))
 		goto free_bvec;
 
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 499a8bb7cac7..4ed389edb3b6 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -145,8 +145,15 @@ static int nvme_map_user_request(struct request *req, u64 ubuffer,
 	if (bdev) {
 		bio_set_dev(bio, bdev);
 		if (meta_buffer && meta_len) {
-			ret = bio_integrity_map_user(bio, meta_buffer, meta_len,
-						     meta_seed);
+			struct iov_iter iter;
+			unsigned int direction;
+
+			if (bio_data_dir(bio) == READ)
+				direction = ITER_DEST;
+			else
+				direction = ITER_SOURCE;
+			iov_iter_ubuf(&iter, direction, meta_buffer, meta_len);
+			ret = bio_integrity_map_user(bio, &iter, meta_seed);
 			if (ret)
 				goto out_unmap;
 			req->cmd_flags |= REQ_INTEGRITY;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 875d792bffff..20cf47fc851f 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -723,7 +723,7 @@ static inline bool bioset_initialized(struct bio_set *bs)
 	for_each_bio(_bio)						\
 		bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)
 
-int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t len, u32 seed);
+int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter, u32 seed);
 extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
 extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
 extern bool bio_integrity_prep(struct bio *);
@@ -795,8 +795,9 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page,
 	return 0;
 }
 
-static inline int bio_integrity_map_user(struct bio *bio, void __user *ubuf,
-					 ssize_t len, u32 seed)
+static inline int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter,
+					u32 seed)
+
 {
 	return -EINVAL;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 07/10] block: define meta io descriptor
       [not found]   ` <CGME20240425184704epcas5p3b9eb6cce9c9658eb1d0d32937e778a5d@epcas5p3.samsung.com>
@ 2024-04-25 18:39     ` Kanchan Joshi
  0 siblings, 0 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
  To: axboe, martin.petersen, kbusch, hch, brauner
  Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Kanchan Joshi

Introduces a new 'uio_meta' structure that upper layer (io_uring) can
use to pass the meta/integrity information. This is a prep patch.

Signed-off-by: Kanchan Joshi <[email protected]>
---
 include/linux/bio.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 20cf47fc851f..0281b356935a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -331,6 +331,12 @@ enum bip_flags {
 	BIP_COPY_USER		= 1 << 6, /* Kernel bounce buffer in use */
 };
 
+struct uio_meta {
+	u16 flags;
+	u16 apptag;
+	struct iov_iter iter;
+};
+
 /*
  * bio integrity payload
  */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 08/10] io_uring/rw: add support to send meta along with read/write
       [not found]   ` <CGME20240425184706epcas5p1d75c19d1d1458c52fc4009f150c7dc7d@epcas5p1.samsung.com>
@ 2024-04-25 18:39     ` Kanchan Joshi
  2024-04-26 14:25       ` Jens Axboe
  0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
  To: axboe, martin.petersen, kbusch, hch, brauner
  Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta, Kanchan Joshi, Nitesh Shetty

From: Anuj Gupta <[email protected]>

This patch introduces IORING_OP_READ_META and IORING_OP_WRITE_META
opcodes which allow sending a meta buffer along with read/write. The
meta buffer, its length, apptag and integrity check flags can be specified
by the application in the newly introduced meta_buf, meta_len, apptag and
meta_flags fields of SQE.

Use the user-passed information to prepare uio_meta descriptor, and
pass it down using kiocb->private.

Meta exchange is supported only for direct IO.

Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
---
 include/linux/fs.h            |  1 +
 include/uapi/linux/io_uring.h | 15 +++++++
 io_uring/io_uring.c           |  4 ++
 io_uring/opdef.c              | 30 ++++++++++++++
 io_uring/rw.c                 | 76 +++++++++++++++++++++++++++++++++--
 io_uring/rw.h                 | 11 ++++-
 6 files changed, 132 insertions(+), 5 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8dfd53b52744..8868d17ae8f9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -329,6 +329,7 @@ struct readahead_control;
 #define IOCB_NOIO		(1 << 20)
 /* can use bio alloc cache */
 #define IOCB_ALLOC_CACHE	(1 << 21)
+#define IOCB_USE_META		(1 << 22)
 /*
  * IOCB_DIO_CALLER_COMP can be set by the iocb owner, to indicate that the
  * iocb completion can be passed back to the owner for execution from a safe
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index a7f847543a7f..d4653b52fdd6 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -97,6 +97,12 @@ struct io_uring_sqe {
 			__u64	addr3;
 			__u64	__pad2[1];
 		};
+		struct {
+			__u64	meta_addr;
+			__u32	meta_len;
+			__u16	meta_flags;
+			__u16	apptag;
+		};
 		__u64	optval;
 		/*
 		 * If the ring is initialized with IORING_SETUP_SQE128, then
@@ -106,6 +112,13 @@ struct io_uring_sqe {
 	};
 };
 
+/*
+ * meta io flags
+ */
+#define META_CHK_GUARD	(1U << 0)	/* guard is valid */
+#define META_CHK_APPTAG	(1U << 1)	/* app tag is valid */
+#define META_CHK_REFTAG	(1U << 2)	/* ref tag is valid */
+
 /*
  * If sqe->file_index is set to this for opcodes that instantiate a new
  * direct descriptor (like openat/openat2/accept), then io_uring will allocate
@@ -256,6 +269,8 @@ enum io_uring_op {
 	IORING_OP_FUTEX_WAITV,
 	IORING_OP_FIXED_FD_INSTALL,
 	IORING_OP_FTRUNCATE,
+	IORING_OP_READ_META,
+	IORING_OP_WRITE_META,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3c9087f37c43..af95fc8d988c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3723,7 +3723,11 @@ static int __init io_uring_init(void)
 	BUILD_BUG_SQE_ELEM(44, __u16,  addr_len);
 	BUILD_BUG_SQE_ELEM(46, __u16,  __pad3[0]);
 	BUILD_BUG_SQE_ELEM(48, __u64,  addr3);
+	BUILD_BUG_SQE_ELEM(48, __u64,  meta_addr);
 	BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd);
+	BUILD_BUG_SQE_ELEM(56, __u32,  meta_len);
+	BUILD_BUG_SQE_ELEM(60, __u16,  meta_flags);
+	BUILD_BUG_SQE_ELEM(62, __u16,  apptag);
 	BUILD_BUG_SQE_ELEM(56, __u64,  __pad2);
 
 	BUILD_BUG_ON(sizeof(struct io_uring_files_update) !=
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index a16f73938ebb..8b8fdcfb7f30 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -444,6 +444,28 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_READ_META] = {
+		.needs_file		= 1,
+		.plug			= 1,
+		.audit_skip		= 1,
+		.ioprio			= 1,
+		.iopoll			= 1,
+		.iopoll_queue		= 1,
+		.async_size		= sizeof(struct io_async_rw),
+		.prep			= io_prep_read_meta,
+		.issue			= io_rw_meta,
+	},
+	[IORING_OP_WRITE_META] = {
+		.needs_file		= 1,
+		.plug			= 1,
+		.audit_skip		= 1,
+		.ioprio			= 1,
+		.iopoll			= 1,
+		.iopoll_queue		= 1,
+		.async_size		= sizeof(struct io_async_rw),
+		.prep			= io_prep_write_meta,
+		.issue			= io_rw_meta,
+	},
 	[IORING_OP_READ_MULTISHOT] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
@@ -510,6 +532,14 @@ const struct io_cold_def io_cold_defs[] = {
 		.cleanup		= io_readv_writev_cleanup,
 		.fail			= io_rw_fail,
 	},
+	[IORING_OP_READ_META] = {
+		.name			= "READ_META",
+		.fail			= io_rw_fail,
+	},
+	[IORING_OP_WRITE_META] = {
+		.name			= "WRITE_META",
+		.fail			= io_rw_fail,
+	},
 	[IORING_OP_FSYNC] = {
 		.name			= "FSYNC",
 	},
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 3134a6ece1be..b2c9ac91d5e5 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -269,6 +269,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		rw->kiocb.ki_ioprio = get_current_ioprio();
 	}
 	rw->kiocb.dio_complete = NULL;
+	rw->kiocb.ki_flags = 0;
 
 	rw->addr = READ_ONCE(sqe->addr);
 	rw->len = READ_ONCE(sqe->len);
@@ -286,6 +287,41 @@ int io_prep_write(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	return io_prep_rw(req, sqe, ITER_SOURCE, true);
 }
 
+static int io_prep_rw_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+		    int ddir, bool import)
+{
+	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
+	struct io_async_rw *io;
+	struct kiocb *kiocb = &rw->kiocb;
+	int ret;
+
+	ret = io_prep_rw(req, sqe, ddir, import);
+	if (unlikely(ret))
+		return ret;
+
+	io = req->async_data;
+	kiocb->ki_flags |= IOCB_USE_META;
+	io->meta.flags = READ_ONCE(sqe->meta_flags);
+	io->meta.apptag = READ_ONCE(sqe->apptag);
+	ret = import_ubuf(ddir, u64_to_user_ptr(READ_ONCE(sqe->meta_addr)),
+			     READ_ONCE(sqe->meta_len), &io->meta.iter);
+	if (unlikely(ret < 0))
+		return ret;
+
+	iov_iter_save_state(&io->meta.iter, &io->iter_meta_state);
+	return 0;
+}
+
+int io_prep_read_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	return io_prep_rw_meta(req, sqe, ITER_DEST, true);
+}
+
+int io_prep_write_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	return io_prep_rw_meta(req, sqe, ITER_SOURCE, true);
+}
+
 static int io_prep_rwv(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		       int ddir)
 {
@@ -587,6 +623,8 @@ static int kiocb_done(struct io_kiocb *req, ssize_t ret,
 
 		req->flags &= ~REQ_F_REISSUE;
 		iov_iter_restore(&io->iter, &io->iter_state);
+		if (unlikely(rw->kiocb.ki_flags & IOCB_USE_META))
+			iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
 		return -EAGAIN;
 	}
 	return IOU_ISSUE_SKIP_COMPLETE;
@@ -768,7 +806,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
 	if (!(req->flags & REQ_F_FIXED_FILE))
 		req->flags |= io_file_get_flags(file);
 
-	kiocb->ki_flags = file->f_iocb_flags;
+	kiocb->ki_flags |= file->f_iocb_flags;
 	ret = kiocb_set_rw_flags(kiocb, rw->flags);
 	if (unlikely(ret))
 		return ret;
@@ -787,7 +825,8 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
 		if (!(kiocb->ki_flags & IOCB_DIRECT) || !file->f_op->iopoll)
 			return -EOPNOTSUPP;
 
-		kiocb->private = NULL;
+		if (likely(!(kiocb->ki_flags & IOCB_USE_META)))
+			kiocb->private = NULL;
 		kiocb->ki_flags |= IOCB_HIPRI;
 		kiocb->ki_complete = io_complete_rw_iopoll;
 		req->iopoll_completed = 0;
@@ -853,7 +892,8 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
 	} else if (ret == -EIOCBQUEUED) {
 		return IOU_ISSUE_SKIP_COMPLETE;
 	} else if (ret == req->cqe.res || ret <= 0 || !force_nonblock ||
-		   (req->flags & REQ_F_NOWAIT) || !need_complete_io(req)) {
+		   (req->flags & REQ_F_NOWAIT) || !need_complete_io(req) ||
+		   (kiocb->ki_flags & IOCB_USE_META)) {
 		/* read all, failed, already did sync or don't want to retry */
 		goto done;
 	}
@@ -864,6 +904,12 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
 	 * manually if we need to.
 	 */
 	iov_iter_restore(&io->iter, &io->iter_state);
+	if (unlikely(kiocb->ki_flags & IOCB_USE_META)) {
+		/* don't handle partial completion for read + meta */
+		if (ret > 0)
+			goto done;
+		iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
+	}
 
 	do {
 		/*
@@ -1053,7 +1099,8 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		if (ret2 == -EAGAIN && (req->ctx->flags & IORING_SETUP_IOPOLL))
 			goto ret_eagain;
 
-		if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)) {
+		if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)
+				&& !(kiocb->ki_flags & IOCB_USE_META)) {
 			trace_io_uring_short_write(req->ctx, kiocb->ki_pos - ret2,
 						req->cqe.res, ret2);
 
@@ -1074,12 +1121,33 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	} else {
 ret_eagain:
 		iov_iter_restore(&io->iter, &io->iter_state);
+		if (unlikely(kiocb->ki_flags & IOCB_USE_META))
+			iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
 		if (kiocb->ki_flags & IOCB_WRITE)
 			io_req_end_write(req);
 		return -EAGAIN;
 	}
 }
 
+int io_rw_meta(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
+	struct io_async_rw *io = req->async_data;
+	struct kiocb *kiocb = &rw->kiocb;
+	int ret;
+
+	if (!(req->file->f_flags & O_DIRECT))
+		return -EOPNOTSUPP;
+
+	kiocb->private = &io->meta;
+	if (req->opcode == IORING_OP_READ_META)
+		ret = io_read(req, issue_flags);
+	else
+		ret = io_write(req, issue_flags);
+
+	return ret;
+}
+
 void io_rw_fail(struct io_kiocb *req)
 {
 	int res;
diff --git a/io_uring/rw.h b/io_uring/rw.h
index 3f432dc75441..a640071064e3 100644
--- a/io_uring/rw.h
+++ b/io_uring/rw.h
@@ -9,7 +9,13 @@ struct io_async_rw {
 	struct iovec			fast_iov;
 	struct iovec			*free_iovec;
 	int				free_iov_nr;
-	struct wait_page_queue		wpq;
+	union {
+		struct wait_page_queue		wpq;
+		struct {
+			struct uio_meta			meta;
+			struct iov_iter_state		iter_meta_state;
+		};
+	};
 };
 
 int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe);
@@ -17,9 +23,12 @@ int io_prep_write_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_prep_readv(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_prep_writev(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_prep_read(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_prep_read_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_prep_write(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_prep_write_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_read(struct io_kiocb *req, unsigned int issue_flags);
 int io_write(struct io_kiocb *req, unsigned int issue_flags);
+int io_rw_meta(struct io_kiocb *req, unsigned int issue_flags);
 void io_readv_writev_cleanup(struct io_kiocb *req);
 void io_rw_fail(struct io_kiocb *req);
 void io_req_rw_complete(struct io_kiocb *req, struct io_tw_state *ts);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 09/10] block: add support to send meta buffer
       [not found]   ` <CGME20240425184708epcas5p4f1d95cd8d285614f712868d205a23115@epcas5p4.samsung.com>
@ 2024-04-25 18:39     ` Kanchan Joshi
  2024-04-26 15:21       ` Keith Busch
  0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
  To: axboe, martin.petersen, kbusch, hch, brauner
  Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Kanchan Joshi, Anuj Gupta

If iocb contains the meta, extract that and prepare the bip.
Extend bip so that can it can carry three new integrity-check flags
and application tag.

Make sure that ->prepare_fn and ->complete_fn are skipped for
user-owned meta buffer.

Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
 block/bio-integrity.c | 31 ++++++++++++++++++++++++++++++-
 block/fops.c          |  9 +++++++++
 block/t10-pi.c        |  6 ++++++
 include/linux/bio.h   | 10 ++++++++++
 4 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 1085cf45f51e..dab76fd73813 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -11,7 +11,7 @@
 #include <linux/export.h>
 #include <linux/bio.h>
 #include <linux/workqueue.h>
-#include <linux/slab.h>
+#include <uapi/linux/io_uring.h>
 #include "blk.h"
 
 static struct kmem_cache *bip_slab;
@@ -318,6 +318,35 @@ static unsigned int bvec_from_pages(struct bio_vec *bvec, struct page **pages,
 	return nr_bvecs;
 }
 
+int bio_integrity_map_iter(struct bio *bio, struct uio_meta *meta)
+{
+	struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
+	u16 bip_flags = 0;
+	int ret;
+
+	if (!bi)
+		return -EINVAL;
+
+	if (meta->iter.count < bio_integrity_bytes(bi, bio_sectors(bio)))
+		return -EINVAL;
+
+	ret = bio_integrity_map_user(bio, &meta->iter, 0);
+	if (!ret) {
+		struct bio_integrity_payload *bip = bio_integrity(bio);
+
+		if (meta->flags & META_CHK_GUARD)
+			bip_flags |= BIP_USER_CHK_GUARD;
+		if (meta->flags & META_CHK_APPTAG)
+			bip_flags |= BIP_USER_CHK_APPTAG;
+		if (meta->flags & META_CHK_REFTAG)
+			bip_flags |= BIP_USER_CHK_REFTAG;
+
+		bip->bip_flags |= bip_flags;
+		bip->apptag = meta->apptag;
+	}
+	return ret;
+}
+
 int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter,
 			  u32 seed)
 {
diff --git a/block/fops.c b/block/fops.c
index 679d9b752fe8..e488fa66dd60 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -353,6 +353,15 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 		task_io_account_write(bio->bi_iter.bi_size);
 	}
 
+	if (unlikely(iocb->ki_flags & IOCB_USE_META)) {
+		ret = bio_integrity_map_iter(bio, iocb->private);
+		WRITE_ONCE(iocb->private, NULL);
+		if (unlikely(ret)) {
+			bio_put(bio);
+			return ret;
+		}
+	}
+
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		bio->bi_opf |= REQ_NOWAIT;
 
diff --git a/block/t10-pi.c b/block/t10-pi.c
index d90892fd6f2a..72d1522417a1 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -156,6 +156,8 @@ static void t10_pi_type1_prepare(struct request *rq)
 		/* Already remapped? */
 		if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
 			break;
+		if (bip->bip_flags & BIP_INTEGRITY_USER)
+			break;
 
 		bip_for_each_vec(iv, bip, iter) {
 			unsigned int j;
@@ -205,6 +207,8 @@ static void t10_pi_type1_complete(struct request *rq, unsigned int nr_bytes)
 		struct bio_vec iv;
 		struct bvec_iter iter;
 
+		if (bip->bip_flags & BIP_INTEGRITY_USER)
+			break;
 		bip_for_each_vec(iv, bip, iter) {
 			unsigned int j;
 			void *p;
@@ -408,6 +412,8 @@ static void ext_pi_type1_prepare(struct request *rq)
 		/* Already remapped? */
 		if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
 			break;
+		if (bip->bip_flags & BIP_INTEGRITY_USER)
+			break;
 
 		bip_for_each_vec(iv, bip, iter) {
 			unsigned int j;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 0281b356935a..8724305bce62 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -329,6 +329,9 @@ enum bip_flags {
 	BIP_IP_CHECKSUM		= 1 << 4, /* IP checksum */
 	BIP_INTEGRITY_USER	= 1 << 5, /* Integrity payload is user address */
 	BIP_COPY_USER		= 1 << 6, /* Kernel bounce buffer in use */
+	BIP_USER_CHK_GUARD	= 1 << 7,
+	BIP_USER_CHK_APPTAG	= 1 << 8,
+	BIP_USER_CHK_REFTAG	= 1 << 9,
 };
 
 struct uio_meta {
@@ -348,6 +351,7 @@ struct bio_integrity_payload {
 	unsigned short		bip_vcnt;	/* # of integrity bio_vecs */
 	unsigned short		bip_max_vcnt;	/* integrity bio_vec slots */
 	unsigned short		bip_flags;	/* control flags */
+	u16			apptag;		/* apptag */
 
 	struct bvec_iter	bio_iter;	/* for rewinding parent bio */
 
@@ -729,6 +733,7 @@ static inline bool bioset_initialized(struct bio_set *bs)
 	for_each_bio(_bio)						\
 		bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)
 
+int bio_integrity_map_iter(struct bio *bio, struct uio_meta *meta);
 int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter, u32 seed);
 extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
 extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
@@ -808,6 +813,11 @@ static inline int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter,
 	return -EINVAL;
 }
 
+static inline int bio_integrity_map_iter(struct bio *bio, struct uio_meta *meta)
+{
+	return -EINVAL;
+}
+
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 10/10] nvme: add separate handling for user integrity buffer
       [not found]   ` <CGME20240425184710epcas5p2968bbc40ed10d1f0184bb511af054fcb@epcas5p2.samsung.com>
@ 2024-04-25 18:39     ` Kanchan Joshi
  2024-04-25 19:56       ` Keith Busch
  2024-04-26 10:57       ` kernel test robot
  0 siblings, 2 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
  To: axboe, martin.petersen, kbusch, hch, brauner
  Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Kanchan Joshi

For user provided integrity buffer, convert bip flags
(guard/reftag/apptag checks) to protocol specific flags.
Also pass apptag and reftag down.

Signed-off-by: Kanchan Joshi <[email protected]>
---
 drivers/nvme/host/core.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 27281a9a8951..3b719be4eedb 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -886,6 +886,13 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
 	return BLK_STS_OK;
 }
 
+static void nvme_set_app_tag(struct nvme_command *cmnd, u16 apptag)
+{
+	cmnd->rw.apptag = cpu_to_le16(apptag);
+	/* use 0xfff as mask so that apptag is used in entirety*/
+	cmnd->rw.appmask = cpu_to_le16(0xffff);
+}
+
 static void nvme_set_ref_tag(struct nvme_ns *ns, struct nvme_command *cmnd,
 			      struct request *req)
 {
@@ -943,6 +950,25 @@ static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns,
 	return BLK_STS_OK;
 }
 
+static inline void nvme_setup_user_integrity(struct nvme_ns *ns,
+		struct request *req, struct nvme_command *cmnd,
+		u16 *control)
+{
+	struct bio_integrity_payload *bip = bio_integrity(req->bio);
+	unsigned short bip_flags = bip->bip_flags;
+
+	if (bip_flags & BIP_USER_CHK_GUARD)
+		*control |= NVME_RW_PRINFO_PRCHK_GUARD;
+	if (bip_flags & BIP_USER_CHK_REFTAG) {
+		*control |= NVME_RW_PRINFO_PRCHK_REF;
+		nvme_set_ref_tag(ns, cmnd, req);
+	}
+	if (bip_flags & BIP_USER_CHK_APPTAG) {
+		*control |= NVME_RW_PRINFO_PRCHK_APP;
+		nvme_set_app_tag(cmnd, bip->apptag);
+	}
+}
+
 static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 		struct request *req, struct nvme_command *cmnd,
 		enum nvme_opcode op)
@@ -983,6 +1009,14 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 			if (WARN_ON_ONCE(!nvme_ns_has_pi(ns->head)))
 				return BLK_STS_NOTSUPP;
 			control |= NVME_RW_PRINFO_PRACT;
+		} else {
+			/* process user-created integrity */
+			if (bio_integrity(req->bio)->bip_flags &
+					BIP_INTEGRITY_USER) {
+				nvme_setup_user_integrity(ns, req, cmnd,
+							  &control);
+				goto out;
+			}
 		}
 
 		switch (ns->head->pi_type) {
@@ -999,7 +1033,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 			break;
 		}
 	}
-
+out:
 	cmnd->rw.control = cpu_to_le16(control);
 	cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
 	return 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 10/10] nvme: add separate handling for user integrity buffer
  2024-04-25 18:39     ` [PATCH 10/10] nvme: add separate handling for user integrity buffer Kanchan Joshi
@ 2024-04-25 19:56       ` Keith Busch
  2024-04-26 10:57       ` kernel test robot
  1 sibling, 0 replies; 41+ messages in thread
From: Keith Busch @ 2024-04-25 19:56 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, martin.petersen, hch, brauner, asml.silence, dw, io-uring,
	linux-nvme, linux-block, gost.dev

On Fri, Apr 26, 2024 at 12:09:43AM +0530, Kanchan Joshi wrote:
> @@ -983,6 +1009,14 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>  			if (WARN_ON_ONCE(!nvme_ns_has_pi(ns->head)))
>  				return BLK_STS_NOTSUPP;
>  			control |= NVME_RW_PRINFO_PRACT;
> +		} else {
> +			/* process user-created integrity */
> +			if (bio_integrity(req->bio)->bip_flags &
> +					BIP_INTEGRITY_USER) {

Make this an "else if" instead of nesting it an extra level.

> +				nvme_setup_user_integrity(ns, req, cmnd,
> +							  &control);
> +				goto out;
> +			}

And this can be structured a little differently so that you don't need
the "goto"; IMO, goto is good for error unwinding, but using it in a
good path harms readablilty.

This is getting complex enough that splitting it off in a helper
funciton, maybe nvme_setup_rw_meta(), might be a good idea.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 10/10] nvme: add separate handling for user integrity buffer
  2024-04-25 18:39     ` [PATCH 10/10] nvme: add separate handling for user integrity buffer Kanchan Joshi
  2024-04-25 19:56       ` Keith Busch
@ 2024-04-26 10:57       ` kernel test robot
  1 sibling, 0 replies; 41+ messages in thread
From: kernel test robot @ 2024-04-26 10:57 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, martin.petersen, kbusch, hch, brauner
  Cc: llvm, oe-kbuild-all, asml.silence, dw, io-uring, linux-nvme,
	linux-block, gost.dev, Kanchan Joshi

Hi Kanchan,

kernel test robot noticed the following build errors:

[auto build test ERROR on 24c3fc5c75c5b9d471783b4a4958748243828613]

url:    https://github.com/intel-lab-lkp/linux/commits/Kanchan-Joshi/block-set-bip_vcnt-correctly/20240426-024916
base:   24c3fc5c75c5b9d471783b4a4958748243828613
patch link:    https://lore.kernel.org/r/20240425183943.6319-11-joshi.k%40samsung.com
patch subject: [PATCH 10/10] nvme: add separate handling for user integrity buffer
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20240426/[email protected]/config)
compiler: clang version 18.1.4 (https://github.com/llvm/llvm-project e6c3289804a67ea0bb6a86fadbe454dd93b8d855)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240426/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

>> drivers/nvme/host/core.c:1014:31: error: member reference base type 'void' is not a structure or union
    1014 |                         if (bio_integrity(req->bio)->bip_flags &
         |                             ~~~~~~~~~~~~~~~~~~~~~~~^ ~~~~~~~~~
   1 error generated.


vim +/void +1014 drivers/nvme/host/core.c

   971	
   972	static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
   973			struct request *req, struct nvme_command *cmnd,
   974			enum nvme_opcode op)
   975	{
   976		u16 control = 0;
   977		u32 dsmgmt = 0;
   978	
   979		if (req->cmd_flags & REQ_FUA)
   980			control |= NVME_RW_FUA;
   981		if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_RAHEAD))
   982			control |= NVME_RW_LR;
   983	
   984		if (req->cmd_flags & REQ_RAHEAD)
   985			dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
   986	
   987		cmnd->rw.opcode = op;
   988		cmnd->rw.flags = 0;
   989		cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
   990		cmnd->rw.cdw2 = 0;
   991		cmnd->rw.cdw3 = 0;
   992		cmnd->rw.metadata = 0;
   993		cmnd->rw.slba =
   994			cpu_to_le64(nvme_sect_to_lba(ns->head, blk_rq_pos(req)));
   995		cmnd->rw.length =
   996			cpu_to_le16((blk_rq_bytes(req) >> ns->head->lba_shift) - 1);
   997		cmnd->rw.reftag = 0;
   998		cmnd->rw.apptag = 0;
   999		cmnd->rw.appmask = 0;
  1000	
  1001		if (ns->head->ms) {
  1002			/*
  1003			 * If formated with metadata, the block layer always provides a
  1004			 * metadata buffer if CONFIG_BLK_DEV_INTEGRITY is enabled.  Else
  1005			 * we enable the PRACT bit for protection information or set the
  1006			 * namespace capacity to zero to prevent any I/O.
  1007			 */
  1008			if (!blk_integrity_rq(req)) {
  1009				if (WARN_ON_ONCE(!nvme_ns_has_pi(ns->head)))
  1010					return BLK_STS_NOTSUPP;
  1011				control |= NVME_RW_PRINFO_PRACT;
  1012			} else {
  1013				/* process user-created integrity */
> 1014				if (bio_integrity(req->bio)->bip_flags &
  1015						BIP_INTEGRITY_USER) {
  1016					nvme_setup_user_integrity(ns, req, cmnd,
  1017								  &control);
  1018					goto out;
  1019				}
  1020			}
  1021	
  1022			switch (ns->head->pi_type) {
  1023			case NVME_NS_DPS_PI_TYPE3:
  1024				control |= NVME_RW_PRINFO_PRCHK_GUARD;
  1025				break;
  1026			case NVME_NS_DPS_PI_TYPE1:
  1027			case NVME_NS_DPS_PI_TYPE2:
  1028				control |= NVME_RW_PRINFO_PRCHK_GUARD |
  1029						NVME_RW_PRINFO_PRCHK_REF;
  1030				if (op == nvme_cmd_zone_append)
  1031					control |= NVME_RW_APPEND_PIREMAP;
  1032				nvme_set_ref_tag(ns, cmnd, req);
  1033				break;
  1034			}
  1035		}
  1036	out:
  1037		cmnd->rw.control = cpu_to_le16(control);
  1038		cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
  1039		return 0;
  1040	}
  1041	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] Read/Write with meta/integrity
  2024-04-25 18:39 ` [PATCH 00/10] Read/Write with meta/integrity Kanchan Joshi
                     ` (9 preceding siblings ...)
       [not found]   ` <CGME20240425184710epcas5p2968bbc40ed10d1f0184bb511af054fcb@epcas5p2.samsung.com>
@ 2024-04-26 14:19   ` Jens Axboe
  10 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2024-04-26 14:19 UTC (permalink / raw)
  To: Kanchan Joshi, martin.petersen, kbusch, hch, brauner
  Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev

A lot of the initial patches appear to be fixes on the block side,
maybe it'd make more sense to submti those separately?

Comments in other patches coming.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/10] io_uring/rw: add support to send meta along with read/write
  2024-04-25 18:39     ` [PATCH 08/10] io_uring/rw: add support to send meta along with read/write Kanchan Joshi
@ 2024-04-26 14:25       ` Jens Axboe
  2024-04-29 20:11         ` Kanchan Joshi
  0 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2024-04-26 14:25 UTC (permalink / raw)
  To: Kanchan Joshi, martin.petersen, kbusch, hch, brauner
  Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta, Nitesh Shetty

> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 3134a6ece1be..b2c9ac91d5e5 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -587,6 +623,8 @@ static int kiocb_done(struct io_kiocb *req, ssize_t ret,
>  
>  		req->flags &= ~REQ_F_REISSUE;
>  		iov_iter_restore(&io->iter, &io->iter_state);
> +		if (unlikely(rw->kiocb.ki_flags & IOCB_USE_META))
> +			iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
>  		return -EAGAIN;
>  	}
>  	return IOU_ISSUE_SKIP_COMPLETE;
This puzzles me a bit, why is the restore now dependent on
IOCB_USE_META?

> @@ -768,7 +806,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
>  	if (!(req->flags & REQ_F_FIXED_FILE))
>  		req->flags |= io_file_get_flags(file);
>  
> -	kiocb->ki_flags = file->f_iocb_flags;
> +	kiocb->ki_flags |= file->f_iocb_flags;
>  	ret = kiocb_set_rw_flags(kiocb, rw->flags);
>  	if (unlikely(ret))
>  		return ret;
> @@ -787,7 +825,8 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
>  		if (!(kiocb->ki_flags & IOCB_DIRECT) || !file->f_op->iopoll)
>  			return -EOPNOTSUPP;
>  
> -		kiocb->private = NULL;
> +		if (likely(!(kiocb->ki_flags & IOCB_USE_META)))
> +			kiocb->private = NULL;
>  		kiocb->ki_flags |= IOCB_HIPRI;
>  		kiocb->ki_complete = io_complete_rw_iopoll;
>  		req->iopoll_completed = 0;

Why don't we just set ->private generically earlier, eg like we do for
the ki_flags, rather than have it be a branch in here?

> @@ -853,7 +892,8 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
>  	} else if (ret == -EIOCBQUEUED) {
>  		return IOU_ISSUE_SKIP_COMPLETE;
>  	} else if (ret == req->cqe.res || ret <= 0 || !force_nonblock ||
> -		   (req->flags & REQ_F_NOWAIT) || !need_complete_io(req)) {
> +		   (req->flags & REQ_F_NOWAIT) || !need_complete_io(req) ||
> +		   (kiocb->ki_flags & IOCB_USE_META)) {
>  		/* read all, failed, already did sync or don't want to retry */
>  		goto done;
>  	}

Would it be cleaner to stuff that IOCB_USE_META check in
need_complete_io(), as that would closer seem to describe why that check
is there in the first place? With a comment.

> @@ -864,6 +904,12 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
>  	 * manually if we need to.
>  	 */
>  	iov_iter_restore(&io->iter, &io->iter_state);
> +	if (unlikely(kiocb->ki_flags & IOCB_USE_META)) {
> +		/* don't handle partial completion for read + meta */
> +		if (ret > 0)
> +			goto done;
> +		iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
> +	}

Also seems a bit odd why we need this check here, surely if this is
needed other "don't do retry IOs" conditions would be the same?

> @@ -1053,7 +1099,8 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
>  		if (ret2 == -EAGAIN && (req->ctx->flags & IORING_SETUP_IOPOLL))
>  			goto ret_eagain;
>  
> -		if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)) {
> +		if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)
> +				&& !(kiocb->ki_flags & IOCB_USE_META)) {
>  			trace_io_uring_short_write(req->ctx, kiocb->ki_pos - ret2,
>  						req->cqe.res, ret2);

Same here. Would be nice to integrate this a bit nicer rather than have
a bunch of "oh we also need this extra check here" conditions.

> @@ -1074,12 +1121,33 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
>  	} else {
>  ret_eagain:
>  		iov_iter_restore(&io->iter, &io->iter_state);
> +		if (unlikely(kiocb->ki_flags & IOCB_USE_META))
> +			iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
>  		if (kiocb->ki_flags & IOCB_WRITE)
>  			io_req_end_write(req);
>  		return -EAGAIN;
>  	}
>  }

Same question here on the (now) conditional restore.

> +int io_rw_meta(struct io_kiocb *req, unsigned int issue_flags)
> +{
> +	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
> +	struct io_async_rw *io = req->async_data;
> +	struct kiocb *kiocb = &rw->kiocb;
> +	int ret;
> +
> +	if (!(req->file->f_flags & O_DIRECT))
> +		return -EOPNOTSUPP;

Why isn't this just caught at init time when IOCB_DIRECT is checked?

> +	kiocb->private = &io->meta;
> +	if (req->opcode == IORING_OP_READ_META)
> +		ret = io_read(req, issue_flags);
> +	else
> +		ret = io_write(req, issue_flags);
> +
> +	return ret;
> +}

kiocb->private is a bit of an odd beast, and ownership isn't clear at
all. It would make the most sense if the owner of the kiocb (eg io_uring
in this case) owned it, but take a look at eg ocfs2 and see what they do
with it... I think this would blow up as a result.

Outside of that, and with the O_DIRECT thing check fixed, this should
just be two separate functions, one for read and one for write.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 09/10] block: add support to send meta buffer
  2024-04-25 18:39     ` [PATCH 09/10] block: add support to send meta buffer Kanchan Joshi
@ 2024-04-26 15:21       ` Keith Busch
  2024-04-29 11:47         ` Kanchan Joshi
  0 siblings, 1 reply; 41+ messages in thread
From: Keith Busch @ 2024-04-26 15:21 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, martin.petersen, hch, brauner, asml.silence, dw, io-uring,
	linux-nvme, linux-block, gost.dev, Anuj Gupta

On Fri, Apr 26, 2024 at 12:09:42AM +0530, Kanchan Joshi wrote:
> diff --git a/block/fops.c b/block/fops.c
> index 679d9b752fe8..e488fa66dd60 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -353,6 +353,15 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
>  		task_io_account_write(bio->bi_iter.bi_size);
>  	}
>  
> +	if (unlikely(iocb->ki_flags & IOCB_USE_META)) {
> +		ret = bio_integrity_map_iter(bio, iocb->private);
> +		WRITE_ONCE(iocb->private, NULL);
> +		if (unlikely(ret)) {
> +			bio_put(bio);
> +			return ret;
> +		}
> +	}

Should this also be done for __blkdev_direct_IO and
__blkdev_direct_IO_simple?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/10] block: set bip_vcnt correctly
  2024-04-25 18:39     ` [PATCH 01/10] block: set bip_vcnt correctly Kanchan Joshi
@ 2024-04-27  7:02       ` Christoph Hellwig
  2024-04-27 14:16         ` Keith Busch
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-27  7:02 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, martin.petersen, kbusch, hch, brauner, asml.silence, dw,
	io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

On Fri, Apr 26, 2024 at 12:09:34AM +0530, Kanchan Joshi wrote:
> From: Anuj Gupta <[email protected]>
> 
> Set the bip_vcnt correctly in bio_integrity_init_user and
> bio_integrity_copy_user. If the bio gets split at a later point,
> this value is required to set the right bip_vcnt in the cloned bio.
> 
> Signed-off-by: Anuj Gupta <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

Please add a Fixes tag and submit it separately from the features.

I'm actually kinda surprised the direct user mapping of integrity data
survived so far without this.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
  2024-04-25 18:39     ` [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone Kanchan Joshi
@ 2024-04-27  7:03       ` Christoph Hellwig
  2024-04-29 11:28         ` Kanchan Joshi
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-27  7:03 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, martin.petersen, kbusch, hch, brauner, asml.silence, dw,
	io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

On Fri, Apr 26, 2024 at 12:09:35AM +0530, Kanchan Joshi wrote:
> From: Anuj Gupta <[email protected]>
> 
> If bio_integrity_copy_user is used to process the meta buffer, bip_max_vcnt
> is one greater than bip_vcnt. In this case bip_max_vcnt vecs needs to be
> copied to cloned bip.

Can you explain this a bit more?  The clone should only allocate what
is actually used, so this leaves be a bit confused.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 03/10] block: copy result back to user meta buffer correctly in case of split
  2024-04-25 18:39     ` [PATCH 03/10] block: copy result back to user meta buffer correctly in case of split Kanchan Joshi
@ 2024-04-27  7:04       ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-27  7:04 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, martin.petersen, kbusch, hch, brauner, asml.silence, dw,
	io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

On Fri, Apr 26, 2024 at 12:09:36AM +0530, Kanchan Joshi wrote:
> From: Anuj Gupta <[email protected]>
> 
> In case of split, the len and offset of bvec gets updated in the iter.
> Use it to fetch the right bvec, and copy it back to user meta buffer.
> 
> Signed-off-by: Anuj Gupta <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>

Looks good, but needs a fixes tag.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio
  2024-04-25 18:39     ` [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio Kanchan Joshi
@ 2024-04-27  7:05       ` Christoph Hellwig
  2024-04-29 11:40         ` Kanchan Joshi
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-27  7:05 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, martin.petersen, kbusch, hch, brauner, asml.silence, dw,
	io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

On Fri, Apr 26, 2024 at 12:09:37AM +0530, Kanchan Joshi wrote:
> From: Anuj Gupta <[email protected]>
> 
> Do it only once when the parent bio completes.
> 
> Signed-off-by: Anuj Gupta <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
>  block/bio-integrity.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index b4042414a08f..b698eb77515d 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -119,7 +119,8 @@ static void bio_integrity_uncopy_user(struct bio_integrity_payload *bip)
>  	ret = copy_to_iter(bvec_virt(&src_bvec), bytes, &iter);
>  	WARN_ON_ONCE(ret != bytes);
>  
> -	bio_integrity_unpin_bvec(copy, nr_vecs, true);
> +	if (!bio_flagged((bip->bip_bio), BIO_CLONED))
> +		bio_integrity_unpin_bvec(copy, nr_vecs, true);
>  }

This feels wrong.  I suspect the problem is that BIP_COPY_USER is
inherited for clone bios while it shouldn't.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] block, nvme: modify rq_integrity_vec function
  2024-04-25 18:39     ` [PATCH 05/10] block, nvme: modify rq_integrity_vec function Kanchan Joshi
@ 2024-04-27  7:18       ` Christoph Hellwig
  2024-04-29 11:34         ` Kanchan Joshi
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-27  7:18 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, martin.petersen, kbusch, hch, brauner, asml.silence, dw,
	io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

The subjet is about as useless as it gets :)

The essence is that it should take the iter into account, so name that.

> --- a/include/linux/blk-integrity.h
> +++ b/include/linux/blk-integrity.h
> @@ -109,11 +109,12 @@ static inline bool blk_integrity_rq(struct request *rq)
>   * Return the first bvec that contains integrity data.  Only drivers that are
>   * limited to a single integrity segment should use this helper.
>   */

The comment really needs an update.  With this rq_integrity_vec now
is a "normal" iter based operation, that can actually be used by
drivers with multiple integrity segments if it is properly advanced.

> +static inline struct bio_vec rq_integrity_vec(struct request *rq)
>  {
> -	return NULL;
> +	return (struct bio_vec){0};

No need for the 0 here.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/10] block: modify bio_integrity_map_user argument
  2024-04-25 18:39     ` [PATCH 06/10] block: modify bio_integrity_map_user argument Kanchan Joshi
@ 2024-04-27  7:19       ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-27  7:19 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, martin.petersen, kbusch, hch, brauner, asml.silence, dw,
	io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

On Fri, Apr 26, 2024 at 12:09:39AM +0530, Kanchan Joshi wrote:
> From: Anuj Gupta <[email protected]>
> 
> This patch refactors bio_integrity_map_user to accept iov_iter as
> argument. This is a prep patch.

Please put the fact that you pass the iov_iter into the subject, and
use the main commit message to explain why.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/10] block: set bip_vcnt correctly
  2024-04-27  7:02       ` Christoph Hellwig
@ 2024-04-27 14:16         ` Keith Busch
  2024-04-29 10:59           ` Kanchan Joshi
  2024-05-01  7:45           ` Christoph Hellwig
  0 siblings, 2 replies; 41+ messages in thread
From: Keith Busch @ 2024-04-27 14:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, axboe, martin.petersen, brauner, asml.silence, dw,
	io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

On Sat, Apr 27, 2024 at 09:02:14AM +0200, Christoph Hellwig wrote:
> On Fri, Apr 26, 2024 at 12:09:34AM +0530, Kanchan Joshi wrote:
> > From: Anuj Gupta <[email protected]>
> > 
> > Set the bip_vcnt correctly in bio_integrity_init_user and
> > bio_integrity_copy_user. If the bio gets split at a later point,
> > this value is required to set the right bip_vcnt in the cloned bio.
> > 
> > Signed-off-by: Anuj Gupta <[email protected]>
> > Signed-off-by: Kanchan Joshi <[email protected]>
> 
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <[email protected]>
> 
> Please add a Fixes tag and submit it separately from the features.
> 
> I'm actually kinda surprised the direct user mapping of integrity data
> survived so far without this.

The only existing use case for user metadata is REQ_OP_DRV_IN/OUT, which
never splits, so these initial fixes only really matter after this
series adds new usage for generic READ/WRITE.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/10] block: set bip_vcnt correctly
  2024-04-27 14:16         ` Keith Busch
@ 2024-04-29 10:59           ` Kanchan Joshi
  2024-05-01  7:45           ` Christoph Hellwig
  1 sibling, 0 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-29 10:59 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: axboe, martin.petersen, brauner, asml.silence, dw, io-uring,
	linux-nvme, linux-block, gost.dev, Anuj Gupta

On 4/27/2024 7:46 PM, Keith Busch wrote:
>> Looks good:
>>
>> Reviewed-by: Christoph Hellwig<[email protected]>
>>
>> Please add a Fixes tag and submit it separately from the features.
>>
>> I'm actually kinda surprised the direct user mapping of integrity data
>> survived so far without this.
> The only existing use case for user metadata is REQ_OP_DRV_IN/OUT, which
> never splits, so these initial fixes only really matter after this
> series adds new usage for generic READ/WRITE.
> 

Yes. It did not seem that there is any harm due to these missing pieces 
(first 6 patches).
Therefore, "Fixes" tag is not there, and patches were not sent 
separately from the features.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
  2024-04-27  7:03       ` Christoph Hellwig
@ 2024-04-29 11:28         ` Kanchan Joshi
  2024-04-29 12:04           ` Keith Busch
  2024-05-01  7:50           ` Christoph Hellwig
  0 siblings, 2 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-29 11:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, martin.petersen, kbusch, brauner, asml.silence, dw,
	io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

On 4/27/2024 12:33 PM, Christoph Hellwig wrote:
>> If bio_integrity_copy_user is used to process the meta buffer, bip_max_vcnt
>> is one greater than bip_vcnt. In this case bip_max_vcnt vecs needs to be
>> copied to cloned bip.
> Can you explain this a bit more?  The clone should only allocate what
> is actually used, so this leaves be a bit confused.
> 

Will expand the commit description.

Usually the meta buffer is pinned and used directly (say N bio vecs).
In case kernel has to make a copy (in bio_integrity_copy_user), it 
factors these N vecs in, and one extra for the bounce buffer.
So for read IO, bip_max_vcnt is N+1, while bip_vcnt is N.

The clone bio also needs to be aware of all N+1 vecs, so that we can 
copy the data from the bounce buffer to pinned user pages correctly 
during read-completion.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] block, nvme: modify rq_integrity_vec function
  2024-04-27  7:18       ` Christoph Hellwig
@ 2024-04-29 11:34         ` Kanchan Joshi
  2024-04-29 17:11           ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-29 11:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, martin.petersen, kbusch, brauner, asml.silence, dw,
	io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

On 4/27/2024 12:48 PM, Christoph Hellwig wrote:
> The subjet is about as useless as it gets :)
> 
> The essence is that it should take the iter into account, so name that.

Sure.

>> --- a/include/linux/blk-integrity.h
>> +++ b/include/linux/blk-integrity.h
>> @@ -109,11 +109,12 @@ static inline bool blk_integrity_rq(struct request *rq)
>>    * Return the first bvec that contains integrity data.  Only drivers that are
>>    * limited to a single integrity segment should use this helper.
>>    */
> 
> The comment really needs an update.  With this rq_integrity_vec now
> is a "normal" iter based operation, that can actually be used by
> drivers with multiple integrity segments if it is properly advanced.

Right, will update.

>> +static inline struct bio_vec rq_integrity_vec(struct request *rq)
>>   {
>> -	return NULL;
>> +	return (struct bio_vec){0};
> 
> No need for the 0 here.
  Um, I did not follow. Need that to keep the compiler happy.
Do you suggest to change the prototype.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio
  2024-04-27  7:05       ` Christoph Hellwig
@ 2024-04-29 11:40         ` Kanchan Joshi
  2024-04-29 17:09           ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-29 11:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, martin.petersen, kbusch, brauner, asml.silence, dw,
	io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

On 4/27/2024 12:35 PM, Christoph Hellwig wrote:
> On Fri, Apr 26, 2024 at 12:09:37AM +0530, Kanchan Joshi wrote:
>> From: Anuj Gupta <[email protected]>
>>
>> Do it only once when the parent bio completes.
>>
>> Signed-off-by: Anuj Gupta <[email protected]>
>> Signed-off-by: Kanchan Joshi <[email protected]>
>> ---
>>   block/bio-integrity.c | 10 +++++++---
>>   1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
>> index b4042414a08f..b698eb77515d 100644
>> --- a/block/bio-integrity.c
>> +++ b/block/bio-integrity.c
>> @@ -119,7 +119,8 @@ static void bio_integrity_uncopy_user(struct bio_integrity_payload *bip)
>>   	ret = copy_to_iter(bvec_virt(&src_bvec), bytes, &iter);
>>   	WARN_ON_ONCE(ret != bytes);
>>   
>> -	bio_integrity_unpin_bvec(copy, nr_vecs, true);
>> +	if (!bio_flagged((bip->bip_bio), BIO_CLONED))
>> +		bio_integrity_unpin_bvec(copy, nr_vecs, true);
>>   }
> 
> This feels wrong.  I suspect the problem is that BIP_COPY_USER is
> inherited for clone bios while it shouldn't.
> 

But BIP_COPY_USER flag is really required in the clone bio. So that we 
can copy the subset of the metadata back (from kernel bounce buffer to 
user space pinned buffer) in case of read io.

Overall, copy-back will happen in installments (for each cloned bio), 
while the unpin will happen in one shot (for the source bio).

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 09/10] block: add support to send meta buffer
  2024-04-26 15:21       ` Keith Busch
@ 2024-04-29 11:47         ` Kanchan Joshi
  0 siblings, 0 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-29 11:47 UTC (permalink / raw)
  To: Keith Busch
  Cc: axboe, martin.petersen, hch, brauner, asml.silence, dw, io-uring,
	linux-nvme, linux-block, gost.dev, Anuj Gupta

On 4/26/2024 8:51 PM, Keith Busch wrote:
> Should this also be done for __blkdev_direct_IO and
> __blkdev_direct_IO_simple?


Right, this needs to be done for __blkdev_direct_IO.

But not for __blkdev_direct_IO_simple as that is for sync io. And we are 
wiring this interface up only for async io.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
  2024-04-29 11:28         ` Kanchan Joshi
@ 2024-04-29 12:04           ` Keith Busch
  2024-04-29 17:07             ` Christoph Hellwig
  2024-05-01  7:50           ` Christoph Hellwig
  1 sibling, 1 reply; 41+ messages in thread
From: Keith Busch @ 2024-04-29 12:04 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Christoph Hellwig, axboe, martin.petersen, brauner, asml.silence,
	dw, io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

On Mon, Apr 29, 2024 at 04:58:37PM +0530, Kanchan Joshi wrote:
> On 4/27/2024 12:33 PM, Christoph Hellwig wrote:
> >> If bio_integrity_copy_user is used to process the meta buffer, bip_max_vcnt
> >> is one greater than bip_vcnt. In this case bip_max_vcnt vecs needs to be
> >> copied to cloned bip.
> > Can you explain this a bit more?  The clone should only allocate what
> > is actually used, so this leaves be a bit confused.
> > 
> 
> Will expand the commit description.
> 
> Usually the meta buffer is pinned and used directly (say N bio vecs).
> In case kernel has to make a copy (in bio_integrity_copy_user), it 
> factors these N vecs in, and one extra for the bounce buffer.
> So for read IO, bip_max_vcnt is N+1, while bip_vcnt is N.
> 
> The clone bio also needs to be aware of all N+1 vecs, so that we can 
> copy the data from the bounce buffer to pinned user pages correctly 
> during read-completion.

An earlier version added a field in the bip to point to the original
bvec from the user address. That extra field wouldn't be used in the far
majority of cases, so moving the user bvec to the end of the existing
bip_vec is a spatial optimization. The code may look a little more
confusing that way, but I think it's better than making the bip bigger.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
  2024-04-29 12:04           ` Keith Busch
@ 2024-04-29 17:07             ` Christoph Hellwig
  2024-04-30  8:25               ` Keith Busch
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-29 17:07 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kanchan Joshi, Christoph Hellwig, axboe, martin.petersen, brauner,
	asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta

On Mon, Apr 29, 2024 at 01:04:12PM +0100, Keith Busch wrote:
> An earlier version added a field in the bip to point to the original
> bvec from the user address. That extra field wouldn't be used in the far
> majority of cases, so moving the user bvec to the end of the existing
> bip_vec is a spatial optimization. The code may look a little more
> confusing that way, but I think it's better than making the bip bigger.

I think we need to do something like that - just hiding the bounce
buffer is not really maintainable once we get multiple levels of stacking
and other creative bio cloning.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio
  2024-04-29 11:40         ` Kanchan Joshi
@ 2024-04-29 17:09           ` Christoph Hellwig
  2024-05-01 13:02             ` Kanchan Joshi
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-29 17:09 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Christoph Hellwig, axboe, martin.petersen, kbusch, brauner,
	asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta

On Mon, Apr 29, 2024 at 05:10:59PM +0530, Kanchan Joshi wrote:
> > This feels wrong.  I suspect the problem is that BIP_COPY_USER is
> > inherited for clone bios while it shouldn't.
> > 
> 
> But BIP_COPY_USER flag is really required in the clone bio. So that we 
> can copy the subset of the metadata back (from kernel bounce buffer to 
> user space pinned buffer) in case of read io.
> 
> Overall, copy-back will happen in installments (for each cloned bio), 
> while the unpin will happen in one shot (for the source bio).

That seems a bit odd compared to the bio data path.  If you think this
is better than the version used in the data path let's convert the
data path to this scheme first to make sure we don't diverge and get
the far better testing on the main data map side.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/10] block, nvme: modify rq_integrity_vec function
  2024-04-29 11:34         ` Kanchan Joshi
@ 2024-04-29 17:11           ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-29 17:11 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Christoph Hellwig, axboe, martin.petersen, kbusch, brauner,
	asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta

On Mon, Apr 29, 2024 at 05:04:30PM +0530, Kanchan Joshi wrote:
> >> +	return (struct bio_vec){0};
> > 
> > No need for the 0 here.
>   Um, I did not follow. Need that to keep the compiler happy.
> Do you suggest to change the prototype.

Just removing the NULL as in:

	return (struct bio_vec){ };

compiles just fine here, and at least for non-anonymous struct
initializers we use it all the time.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/10] io_uring/rw: add support to send meta along with read/write
  2024-04-26 14:25       ` Jens Axboe
@ 2024-04-29 20:11         ` Kanchan Joshi
  0 siblings, 0 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-29 20:11 UTC (permalink / raw)
  To: Jens Axboe, martin.petersen, kbusch, hch, brauner
  Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta, Nitesh Shetty

On 4/26/2024 7:55 PM, Jens Axboe wrote:
>> diff --git a/io_uring/rw.c b/io_uring/rw.c
>> index 3134a6ece1be..b2c9ac91d5e5 100644
>> --- a/io_uring/rw.c
>> +++ b/io_uring/rw.c
>> @@ -587,6 +623,8 @@ static int kiocb_done(struct io_kiocb *req, ssize_t ret,
>>   
>>   		req->flags &= ~REQ_F_REISSUE;
>>   		iov_iter_restore(&io->iter, &io->iter_state);
>> +		if (unlikely(rw->kiocb.ki_flags & IOCB_USE_META))
>> +			iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
>>   		return -EAGAIN;
>>   	}
>>   	return IOU_ISSUE_SKIP_COMPLETE;
> This puzzles me a bit, why is the restore now dependent on
> IOCB_USE_META?

Both saving/restore for meta is under this condition (so seemed natural).
Also, to avoid growing "struct io_async_rw" too much, this patch keeps 
keeps meta/iter_meta_state in the same memory as wpq. So doing this 
unconditionally can corrupt wpq for buffered io.

>> @@ -768,7 +806,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
>>   	if (!(req->flags & REQ_F_FIXED_FILE))
>>   		req->flags |= io_file_get_flags(file);
>>   
>> -	kiocb->ki_flags = file->f_iocb_flags;
>> +	kiocb->ki_flags |= file->f_iocb_flags;
>>   	ret = kiocb_set_rw_flags(kiocb, rw->flags);
>>   	if (unlikely(ret))
>>   		return ret;
>> @@ -787,7 +825,8 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
>>   		if (!(kiocb->ki_flags & IOCB_DIRECT) || !file->f_op->iopoll)
>>   			return -EOPNOTSUPP;
>>   
>> -		kiocb->private = NULL;
>> +		if (likely(!(kiocb->ki_flags & IOCB_USE_META)))
>> +			kiocb->private = NULL;
>>   		kiocb->ki_flags |= IOCB_HIPRI;
>>   		kiocb->ki_complete = io_complete_rw_iopoll;
>>   		req->iopoll_completed = 0;
> 
> Why don't we just set ->private generically earlier, eg like we do for
> the ki_flags, rather than have it be a branch in here?

Not sure if I am missing what you have in mind.
But kiocb->private was set before we reached to this point (in 
io_rw_meta). So we don't overwrite that here.

>> @@ -853,7 +892,8 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
>>   	} else if (ret == -EIOCBQUEUED) {
>>   		return IOU_ISSUE_SKIP_COMPLETE;
>>   	} else if (ret == req->cqe.res || ret <= 0 || !force_nonblock ||
>> -		   (req->flags & REQ_F_NOWAIT) || !need_complete_io(req)) {
>> +		   (req->flags & REQ_F_NOWAIT) || !need_complete_io(req) ||
>> +		   (kiocb->ki_flags & IOCB_USE_META)) {
>>   		/* read all, failed, already did sync or don't want to retry */
>>   		goto done;
>>   	}
> 
> Would it be cleaner to stuff that IOCB_USE_META check in
> need_complete_io(), as that would closer seem to describe why that check
> is there in the first place? With a comment.

Yes, will do.

>> @@ -864,6 +904,12 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
>>   	 * manually if we need to.
>>   	 */
>>   	iov_iter_restore(&io->iter, &io->iter_state);
>> +	if (unlikely(kiocb->ki_flags & IOCB_USE_META)) {
>> +		/* don't handle partial completion for read + meta */
>> +		if (ret > 0)
>> +			goto done;
>> +		iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
>> +	}
> 
> Also seems a bit odd why we need this check here, surely if this is
> needed other "don't do retry IOs" conditions would be the same?

Yes, will revisit.
>> @@ -1053,7 +1099,8 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>   		if (ret2 == -EAGAIN && (req->ctx->flags & IORING_SETUP_IOPOLL))
>>   			goto ret_eagain;
>>   
>> -		if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)) {
>> +		if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)
>> +				&& !(kiocb->ki_flags & IOCB_USE_META)) {
>>   			trace_io_uring_short_write(req->ctx, kiocb->ki_pos - ret2,
>>   						req->cqe.res, ret2);
> 
> Same here. Would be nice to integrate this a bit nicer rather than have
> a bunch of "oh we also need this extra check here" conditions.

Will look into this too.
>> @@ -1074,12 +1121,33 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
>>   	} else {
>>   ret_eagain:
>>   		iov_iter_restore(&io->iter, &io->iter_state);
>> +		if (unlikely(kiocb->ki_flags & IOCB_USE_META))
>> +			iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
>>   		if (kiocb->ki_flags & IOCB_WRITE)
>>   			io_req_end_write(req);
>>   		return -EAGAIN;
>>   	}
>>   }
> 
> Same question here on the (now) conditional restore.

Did not get the concern. Do you prefer it unconditional.

>> +int io_rw_meta(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> +	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>> +	struct io_async_rw *io = req->async_data;
>> +	struct kiocb *kiocb = &rw->kiocb;
>> +	int ret;
>> +
>> +	if (!(req->file->f_flags & O_DIRECT))
>> +		return -EOPNOTSUPP;
> 
> Why isn't this just caught at init time when IOCB_DIRECT is checked?

io_rw_init_file() gets invoked after this, and IOCB_DIRECT check is only 
for IOPOLL situation. We want to check/fail it regardless of IOPOLL.

> 
>> +	kiocb->private = &io->meta;
>> +	if (req->opcode == IORING_OP_READ_META)
>> +		ret = io_read(req, issue_flags);
>> +	else
>> +		ret = io_write(req, issue_flags);
>> +
>> +	return ret;
>> +}
> 
> kiocb->private is a bit of an odd beast, and ownership isn't clear at
> all. It would make the most sense if the owner of the kiocb (eg io_uring
> in this case) owned it, but take a look at eg ocfs2 and see what they do
> with it... I think this would blow up as a result.

Yes, ocfs2 is making use of kiocb->private. But seems that's fine. In 
io_uring we use the field only to send the information down. ocfs2 (or 
anything else unaware of this interface) may just overwrite the 
kiocb->private.
If the lower layer want to support meta exchange, it is supposed to 
extract meta-descriptor from kiocb->private before altering it.

This case is same for block direct path too when we are doing polled io.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
  2024-04-29 17:07             ` Christoph Hellwig
@ 2024-04-30  8:25               ` Keith Busch
  2024-05-01  7:46                 ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Keith Busch @ 2024-04-30  8:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, axboe, martin.petersen, brauner, asml.silence, dw,
	io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

On Mon, Apr 29, 2024 at 07:07:29PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 29, 2024 at 01:04:12PM +0100, Keith Busch wrote:
> > An earlier version added a field in the bip to point to the original
> > bvec from the user address. That extra field wouldn't be used in the far
> > majority of cases, so moving the user bvec to the end of the existing
> > bip_vec is a spatial optimization. The code may look a little more
> > confusing that way, but I think it's better than making the bip bigger.
> 
> I think we need to do something like that - just hiding the bounce
> buffer is not really maintainable once we get multiple levels of stacking
> and other creative bio cloning.

Not sure I follow that. From patches 2-4 here, I think that pretty much
covers it. It's just missing a good code comment, but the implementation
side looks complete for any amount of stacking and splitting.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/10] block: set bip_vcnt correctly
  2024-04-27 14:16         ` Keith Busch
  2024-04-29 10:59           ` Kanchan Joshi
@ 2024-05-01  7:45           ` Christoph Hellwig
  2024-05-01  8:03             ` Keith Busch
  1 sibling, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-05-01  7:45 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Kanchan Joshi, axboe, martin.petersen, brauner,
	asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta

On Sat, Apr 27, 2024 at 08:16:52AM -0600, Keith Busch wrote:
> > Please add a Fixes tag and submit it separately from the features.
> > 
> > I'm actually kinda surprised the direct user mapping of integrity data
> > survived so far without this.
> 
> The only existing use case for user metadata is REQ_OP_DRV_IN/OUT, which
> never splits, so these initial fixes only really matter after this
> series adds new usage for generic READ/WRITE.

Well, it matters to keep our contract up, even if we're not hitting it.

And apparently another user just came out of the woods in dm land..

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
  2024-04-30  8:25               ` Keith Busch
@ 2024-05-01  7:46                 ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-05-01  7:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Kanchan Joshi, axboe, martin.petersen, brauner,
	asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta

On Tue, Apr 30, 2024 at 09:25:38AM +0100, Keith Busch wrote:
> > I think we need to do something like that - just hiding the bounce
> > buffer is not really maintainable once we get multiple levels of stacking
> > and other creative bio cloning.
> 
> Not sure I follow that. From patches 2-4 here, I think that pretty much
> covers it. It's just missing a good code comment, but the implementation
> side looks complete for any amount of stacking and splitting.

I can't see how it would work, but I'll gladly wait for a better
description.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
  2024-04-29 11:28         ` Kanchan Joshi
  2024-04-29 12:04           ` Keith Busch
@ 2024-05-01  7:50           ` Christoph Hellwig
  1 sibling, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-05-01  7:50 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Christoph Hellwig, axboe, martin.petersen, kbusch, brauner,
	asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta

On Mon, Apr 29, 2024 at 04:58:37PM +0530, Kanchan Joshi wrote:
> On 4/27/2024 12:33 PM, Christoph Hellwig wrote:
> >> If bio_integrity_copy_user is used to process the meta buffer, bip_max_vcnt
> >> is one greater than bip_vcnt. In this case bip_max_vcnt vecs needs to be
> >> copied to cloned bip.
> > Can you explain this a bit more?  The clone should only allocate what
> > is actually used, so this leaves be a bit confused.
> > 
> 
> Will expand the commit description.
> 
> Usually the meta buffer is pinned and used directly (say N bio vecs).
> In case kernel has to make a copy (in bio_integrity_copy_user), it 
> factors these N vecs in, and one extra for the bounce buffer.
> So for read IO, bip_max_vcnt is N+1, while bip_vcnt is N.
> 
> The clone bio also needs to be aware of all N+1 vecs, so that we can 
> copy the data from the bounce buffer to pinned user pages correctly 
> during read-completion.

No.  The underlying layer below the clone/split/etc should never have
to care about your bounce buffer.  The bvecs are just data containers,
and if they are mapped, copied or used in any other way should remain
entirely encapsulated in the caller.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/10] block: set bip_vcnt correctly
  2024-05-01  7:45           ` Christoph Hellwig
@ 2024-05-01  8:03             ` Keith Busch
  0 siblings, 0 replies; 41+ messages in thread
From: Keith Busch @ 2024-05-01  8:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, axboe, martin.petersen, brauner, asml.silence, dw,
	io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

On Wed, May 01, 2024 at 09:45:44AM +0200, Christoph Hellwig wrote:
> On Sat, Apr 27, 2024 at 08:16:52AM -0600, Keith Busch wrote:
> > > Please add a Fixes tag and submit it separately from the features.
> > > 
> > > I'm actually kinda surprised the direct user mapping of integrity data
> > > survived so far without this.
> > 
> > The only existing use case for user metadata is REQ_OP_DRV_IN/OUT, which
> > never splits, so these initial fixes only really matter after this
> > series adds new usage for generic READ/WRITE.
> 
> Well, it matters to keep our contract up, even if we're not hitting it.
> 
> And apparently another user just came out of the woods in dm land..

But the bug report from dm has nothing to do with user mapped metadata.
That bug existed before that was added, so yeah, patch 5 from this
series (or something like it) should be applied on its own.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio
  2024-04-29 17:09           ` Christoph Hellwig
@ 2024-05-01 13:02             ` Kanchan Joshi
  2024-05-02  7:12               ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-05-01 13:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, martin.petersen, kbusch, brauner, asml.silence, dw,
	io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

On 4/29/2024 10:39 PM, Christoph Hellwig wrote:
> On Mon, Apr 29, 2024 at 05:10:59PM +0530, Kanchan Joshi wrote:
>>> This feels wrong.  I suspect the problem is that BIP_COPY_USER is
>>> inherited for clone bios while it shouldn't.
>>>
>>
>> But BIP_COPY_USER flag is really required in the clone bio. So that we
>> can copy the subset of the metadata back (from kernel bounce buffer to
>> user space pinned buffer) in case of read io.
>>
>> Overall, copy-back will happen in installments (for each cloned bio),
>> while the unpin will happen in one shot (for the source bio).
> 
> That seems a bit odd compared to the bio data path.  If you think this
> is better than the version used in the data path let's convert the
> data path to this scheme first to make sure we don't diverge and get
> the far better testing on the main data map side.
> 

Can you please tell what function(s) in bio data path that need this 
conversion?
To me data path handling seems similar. Each cloned bio will lead to 
some amount of data transfer to pinned user-memory. The same is 
happening for meta transfer here.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio
  2024-05-01 13:02             ` Kanchan Joshi
@ 2024-05-02  7:12               ` Christoph Hellwig
  2024-05-03 12:01                 ` Kanchan Joshi
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-05-02  7:12 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Christoph Hellwig, axboe, martin.petersen, kbusch, brauner,
	asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
	Anuj Gupta

On Wed, May 01, 2024 at 06:32:45PM +0530, Kanchan Joshi wrote:
> Can you please tell what function(s) in bio data path that need this 
> conversion?
> To me data path handling seems similar. Each cloned bio will lead to 
> some amount of data transfer to pinned user-memory. The same is 
> happening for meta transfer here.

Well, everywhere.  e.g. for direct I/O everything is just driven from
the fs/direct-io.c and and fs/iomap/direct-io.c code without any
knowledge in the underlying driver if data has been pinned (no bounce
buffering in this case).  Or for passthrough I/O none of the underlying
logic knows about the pinning or bounce buffering, everything is handled
in block/blk-map.c.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio
  2024-05-02  7:12               ` Christoph Hellwig
@ 2024-05-03 12:01                 ` Kanchan Joshi
  0 siblings, 0 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-05-03 12:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, martin.petersen, kbusch, brauner, asml.silence, dw,
	io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta

On 5/2/2024 12:42 PM, Christoph Hellwig wrote:
> On Wed, May 01, 2024 at 06:32:45PM +0530, Kanchan Joshi wrote:
>> Can you please tell what function(s) in bio data path that need this
>> conversion?
>> To me data path handling seems similar. Each cloned bio will lead to
>> some amount of data transfer to pinned user-memory. The same is
>> happening for meta transfer here.
> Well, everywhere.

Next version will not inherit BIP_COPY_USER for clone bios.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2024-05-03 12:01 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20240425184649epcas5p42f6ddbfb1c579f043a919973c70ebd03@epcas5p4.samsung.com>
2024-04-25 18:39 ` [PATCH 00/10] Read/Write with meta/integrity Kanchan Joshi
     [not found]   ` <CGME20240425184651epcas5p3404f2390d6cf05148eb96e1af093e7bc@epcas5p3.samsung.com>
2024-04-25 18:39     ` [PATCH 01/10] block: set bip_vcnt correctly Kanchan Joshi
2024-04-27  7:02       ` Christoph Hellwig
2024-04-27 14:16         ` Keith Busch
2024-04-29 10:59           ` Kanchan Joshi
2024-05-01  7:45           ` Christoph Hellwig
2024-05-01  8:03             ` Keith Busch
     [not found]   ` <CGME20240425184653epcas5p28de1473090e0141ae74f8b0a6eb921a7@epcas5p2.samsung.com>
2024-04-25 18:39     ` [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone Kanchan Joshi
2024-04-27  7:03       ` Christoph Hellwig
2024-04-29 11:28         ` Kanchan Joshi
2024-04-29 12:04           ` Keith Busch
2024-04-29 17:07             ` Christoph Hellwig
2024-04-30  8:25               ` Keith Busch
2024-05-01  7:46                 ` Christoph Hellwig
2024-05-01  7:50           ` Christoph Hellwig
     [not found]   ` <CGME20240425184656epcas5p42228cdef753cf20a266d12de5bc130f0@epcas5p4.samsung.com>
2024-04-25 18:39     ` [PATCH 03/10] block: copy result back to user meta buffer correctly in case of split Kanchan Joshi
2024-04-27  7:04       ` Christoph Hellwig
     [not found]   ` <CGME20240425184658epcas5p2adb6bf01a5c56ffaac3a55ab57afaf8e@epcas5p2.samsung.com>
2024-04-25 18:39     ` [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio Kanchan Joshi
2024-04-27  7:05       ` Christoph Hellwig
2024-04-29 11:40         ` Kanchan Joshi
2024-04-29 17:09           ` Christoph Hellwig
2024-05-01 13:02             ` Kanchan Joshi
2024-05-02  7:12               ` Christoph Hellwig
2024-05-03 12:01                 ` Kanchan Joshi
     [not found]   ` <CGME20240425184700epcas5p1687590f7e4a3f3c3620ac27af514f0ca@epcas5p1.samsung.com>
2024-04-25 18:39     ` [PATCH 05/10] block, nvme: modify rq_integrity_vec function Kanchan Joshi
2024-04-27  7:18       ` Christoph Hellwig
2024-04-29 11:34         ` Kanchan Joshi
2024-04-29 17:11           ` Christoph Hellwig
     [not found]   ` <CGME20240425184702epcas5p1ccb0df41b07845bc252d69007558e3fa@epcas5p1.samsung.com>
2024-04-25 18:39     ` [PATCH 06/10] block: modify bio_integrity_map_user argument Kanchan Joshi
2024-04-27  7:19       ` Christoph Hellwig
     [not found]   ` <CGME20240425184704epcas5p3b9eb6cce9c9658eb1d0d32937e778a5d@epcas5p3.samsung.com>
2024-04-25 18:39     ` [PATCH 07/10] block: define meta io descriptor Kanchan Joshi
     [not found]   ` <CGME20240425184706epcas5p1d75c19d1d1458c52fc4009f150c7dc7d@epcas5p1.samsung.com>
2024-04-25 18:39     ` [PATCH 08/10] io_uring/rw: add support to send meta along with read/write Kanchan Joshi
2024-04-26 14:25       ` Jens Axboe
2024-04-29 20:11         ` Kanchan Joshi
     [not found]   ` <CGME20240425184708epcas5p4f1d95cd8d285614f712868d205a23115@epcas5p4.samsung.com>
2024-04-25 18:39     ` [PATCH 09/10] block: add support to send meta buffer Kanchan Joshi
2024-04-26 15:21       ` Keith Busch
2024-04-29 11:47         ` Kanchan Joshi
     [not found]   ` <CGME20240425184710epcas5p2968bbc40ed10d1f0184bb511af054fcb@epcas5p2.samsung.com>
2024-04-25 18:39     ` [PATCH 10/10] nvme: add separate handling for user integrity buffer Kanchan Joshi
2024-04-25 19:56       ` Keith Busch
2024-04-26 10:57       ` kernel test robot
2024-04-26 14:19   ` [PATCH 00/10] Read/Write with meta/integrity Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox