* [PATCH 00/10] Read/Write with meta/integrity
[not found] <CGME20240425184649epcas5p42f6ddbfb1c579f043a919973c70ebd03@epcas5p4.samsung.com>
@ 2024-04-25 18:39 ` Kanchan Joshi
[not found] ` <CGME20240425184651epcas5p3404f2390d6cf05148eb96e1af093e7bc@epcas5p3.samsung.com>
` (10 more replies)
0 siblings, 11 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
To: axboe, martin.petersen, kbusch, hch, brauner
Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Kanchan Joshi
This adds a new io_uring interface to specify meta along with
read/write. Beyond reading/writing meta, the interface also enables
(a) flags to control data-integrity checks, (b) application tag.
Block path (direct IO) and NVMe driver are modified to support
this.
First 5 patches are enhancements/fixes in the block/nvme so that user meta buffer
(mostly when it gets split) is handled correctly.
Patch 8 adds the io_uring support.
Patch 9 adds the support for block direct IO, and patch 10 for NVMe.
Interface:
Two new opcodes in io_uring: IORING_OP_READ/WRITE_META.
The leftover space in SQE is used to send meta buffer, its length,
apptag, and meta flags (guard/reftag/apptag check for now). Example
program on how to use the interface is appended below [1]
Another design choice will be not to introduce the new opcodes, and add
new RWF_META flag instead. Open to that in next version.
As for new meta flags, RWF_* seemed a bit precious to use. Hence took the route
to carve those within the SQE itself.
Performance:
of non-meta io is not affected due to these patches.
Testing:
has been done by modifying fio to use this interface.
https://github.com/SamsungDS/fio/commits/feat/test-meta-v2
Changes since RFC:
- modify io_uring plumbing based on recent async handling state changes
- fixes/enhancements to correctly handle the split for meta buffer
- add flags to specify guard/reftag/apptag checks
- add support to send apptag
[1]
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <linux/io_uring.h>
#include <linux/types.h>
#include "liburing.h"
/* write data/meta. read both. compare. send apptag too.
* prerequisite:
* unprotected xfer: format namespace with 4KB + 8b, pi_type = 0
* protected xfer: format namespace with 4KB + 8b, pi_type = 1
*/
#define DATA_LEN 4096
#define META_LEN 8
struct t10_pi_tuple {
__be16 guard;
__be16 apptag;
__be32 reftag;
};
int main(int argc, char *argv[])
{
struct io_uring ring;
struct io_uring_sqe *sqe = NULL;
struct io_uring_cqe *cqe = NULL;
void *wdb,*rdb;
char wmb[META_LEN], rmb[META_LEN];
char *data_str = "data buffer";
char *meta_str = "meta";
int fd, ret, blksize;
struct stat fstat;
unsigned long long offset = DATA_LEN;
struct t10_pi_tuple *pi;
if (argc != 2) {
fprintf(stderr, "Usage: %s <block-device>", argv[0]);
return 1;
};
if (stat(argv[1], &fstat) == 0) {
blksize = (int)fstat.st_blksize;
} else {
perror("stat");
return 1;
}
if (posix_memalign(&wdb, blksize, DATA_LEN)) {
perror("posix_memalign failed");
return 1;
}
if (posix_memalign(&rdb, blksize, DATA_LEN)) {
perror("posix_memalign failed");
return 1;
}
strcpy(wdb, data_str);
strcpy(wmb, meta_str);
fd = open(argv[1], O_RDWR | O_DIRECT);
if (fd < 0) {
printf("Error in opening device\n");
return 0;
}
ret = io_uring_queue_init(8, &ring, 0);
if (ret) {
fprintf(stderr, "ring setup failed: %d\n", ret);
return 1;
}
/* write data + meta-buffer to device */
sqe = io_uring_get_sqe(&ring);
if (!sqe) {
fprintf(stderr, "get sqe failed\n");
return 1;
}
io_uring_prep_write(sqe, fd, wdb, DATA_LEN, offset);
sqe->opcode = IORING_OP_WRITE_META;
sqe->meta_addr = (__u64)wmb;
sqe->meta_len = META_LEN;
/* flags to ask for guard/reftag/apptag*/
sqe->meta_flags = META_CHK_APPTAG;
sqe->apptag = 0x1234;
pi = (struct t10_pi_tuple *)wmb;
pi->apptag = 0x3412;
ret = io_uring_submit(&ring);
if (ret <= 0) {
fprintf(stderr, "sqe submit failed: %d\n", ret);
return 1;
}
ret = io_uring_wait_cqe(&ring, &cqe);
if (!cqe) {
fprintf(stderr, "cqe is NULL :%d\n", ret);
return 1;
}
if (cqe->res < 0) {
fprintf(stderr, "write cqe failure: %d", cqe->res);
return 1;
}
io_uring_cqe_seen(&ring, cqe);
/* read data + meta-buffer back from device */
sqe = io_uring_get_sqe(&ring);
if (!sqe) {
fprintf(stderr, "get sqe failed\n");
return 1;
}
io_uring_prep_read(sqe, fd, rdb, DATA_LEN, offset);
sqe->opcode = IORING_OP_READ_META;
sqe->meta_addr = (__u64)rmb;
sqe->meta_len = META_LEN;
sqe->meta_flags = META_CHK_APPTAG;
sqe->apptag = 0x1234;
ret = io_uring_submit(&ring);
if (ret <= 0) {
fprintf(stderr, "sqe submit failed: %d\n", ret);
return 1;
}
ret = io_uring_wait_cqe(&ring, &cqe);
if (!cqe) {
fprintf(stderr, "cqe is NULL :%d\n", ret);
return 1;
}
if (cqe->res < 0) {
fprintf(stderr, "read cqe failure: %d", cqe->res);
return 1;
}
io_uring_cqe_seen(&ring, cqe);
if (strncmp(wmb, rmb, META_LEN))
printf("Failure: meta mismatch!, wmb=%s, rmb=%s\n", wmb, rmb);
if (strncmp(wdb, rdb, DATA_LEN))
printf("Failure: data mismatch!\n");
io_uring_queue_exit(&ring);
free(rdb);
free(wdb);
return 0;
}
Anuj Gupta (6):
block: set bip_vcnt correctly
block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
block: copy result back to user meta buffer correctly in case of split
block: avoid unpinning/freeing the bio_vec incase of cloned bio
block: modify bio_integrity_map_user argument
io_uring/rw: add support to send meta along with read/write
Kanchan Joshi (4):
block, nvme: modify rq_integrity_vec function
block: define meta io descriptor
block: add support to send meta buffer
nvme: add separate handling for user integrity buffer
block/bio-integrity.c | 69 +++++++++++++++++++++++--------
block/fops.c | 9 +++++
block/t10-pi.c | 6 +++
drivers/nvme/host/core.c | 36 ++++++++++++++++-
drivers/nvme/host/ioctl.c | 11 ++++-
drivers/nvme/host/pci.c | 9 +++--
include/linux/bio.h | 23 +++++++++--
include/linux/blk-integrity.h | 13 +++---
include/linux/fs.h | 1 +
include/uapi/linux/io_uring.h | 15 +++++++
io_uring/io_uring.c | 4 ++
io_uring/opdef.c | 30 ++++++++++++++
io_uring/rw.c | 76 +++++++++++++++++++++++++++++++++--
io_uring/rw.h | 11 ++++-
14 files changed, 276 insertions(+), 37 deletions(-)
base-commit: 24c3fc5c75c5b9d471783b4a4958748243828613
--
2.25.1
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 01/10] block: set bip_vcnt correctly
[not found] ` <CGME20240425184651epcas5p3404f2390d6cf05148eb96e1af093e7bc@epcas5p3.samsung.com>
@ 2024-04-25 18:39 ` Kanchan Joshi
2024-04-27 7:02 ` Christoph Hellwig
0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
To: axboe, martin.petersen, kbusch, hch, brauner
Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta, Kanchan Joshi
From: Anuj Gupta <[email protected]>
Set the bip_vcnt correctly in bio_integrity_init_user and
bio_integrity_copy_user. If the bio gets split at a later point,
this value is required to set the right bip_vcnt in the cloned bio.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
block/bio-integrity.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 2e3e8e04961e..e3390424e6b5 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -254,6 +254,7 @@ static int bio_integrity_copy_user(struct bio *bio, struct bio_vec *bvec,
bip->bip_flags |= BIP_INTEGRITY_USER | BIP_COPY_USER;
bip->bip_iter.bi_sector = seed;
+ bip->bip_vcnt = nr_vecs;
return 0;
free_bip:
bio_integrity_free(bio);
@@ -275,6 +276,7 @@ static int bio_integrity_init_user(struct bio *bio, struct bio_vec *bvec,
bip->bip_flags |= BIP_INTEGRITY_USER;
bip->bip_iter.bi_sector = seed;
bip->bip_iter.bi_size = len;
+ bip->bip_vcnt = nr_vecs;
return 0;
}
--
2.25.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
[not found] ` <CGME20240425184653epcas5p28de1473090e0141ae74f8b0a6eb921a7@epcas5p2.samsung.com>
@ 2024-04-25 18:39 ` Kanchan Joshi
2024-04-27 7:03 ` Christoph Hellwig
0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
To: axboe, martin.petersen, kbusch, hch, brauner
Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta, Kanchan Joshi
From: Anuj Gupta <[email protected]>
If bio_integrity_copy_user is used to process the meta buffer, bip_max_vcnt
is one greater than bip_vcnt. In this case bip_max_vcnt vecs needs to be
copied to cloned bip.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
block/bio-integrity.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index e3390424e6b5..c1955f01412e 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -622,12 +622,12 @@ int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
BUG_ON(bip_src == NULL);
- bip = bio_integrity_alloc(bio, gfp_mask, bip_src->bip_vcnt);
+ bip = bio_integrity_alloc(bio, gfp_mask, bip_src->bip_max_vcnt);
if (IS_ERR(bip))
return PTR_ERR(bip);
memcpy(bip->bip_vec, bip_src->bip_vec,
- bip_src->bip_vcnt * sizeof(struct bio_vec));
+ bip_src->bip_max_vcnt * sizeof(struct bio_vec));
bip->bip_vcnt = bip_src->bip_vcnt;
bip->bip_iter = bip_src->bip_iter;
--
2.25.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 03/10] block: copy result back to user meta buffer correctly in case of split
[not found] ` <CGME20240425184656epcas5p42228cdef753cf20a266d12de5bc130f0@epcas5p4.samsung.com>
@ 2024-04-25 18:39 ` Kanchan Joshi
2024-04-27 7:04 ` Christoph Hellwig
0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
To: axboe, martin.petersen, kbusch, hch, brauner
Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta, Kanchan Joshi
From: Anuj Gupta <[email protected]>
In case of split, the len and offset of bvec gets updated in the iter.
Use it to fetch the right bvec, and copy it back to user meta buffer.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
block/bio-integrity.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index c1955f01412e..b4042414a08f 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -108,11 +108,15 @@ static void bio_integrity_uncopy_user(struct bio_integrity_payload *bip)
unsigned short nr_vecs = bip->bip_max_vcnt - 1;
struct bio_vec *copy = &bip->bip_vec[1];
size_t bytes = bip->bip_iter.bi_size;
+ struct bio_vec copy_bvec, src_bvec;
struct iov_iter iter;
int ret;
- iov_iter_bvec(&iter, ITER_DEST, copy, nr_vecs, bytes);
- ret = copy_to_iter(bvec_virt(bip->bip_vec), bytes, &iter);
+ copy_bvec = mp_bvec_iter_bvec(copy, bip->bip_iter);
+ src_bvec = mp_bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
+
+ iov_iter_bvec(&iter, ITER_DEST, ©_bvec, nr_vecs, bytes);
+ ret = copy_to_iter(bvec_virt(&src_bvec), bytes, &iter);
WARN_ON_ONCE(ret != bytes);
bio_integrity_unpin_bvec(copy, nr_vecs, true);
--
2.25.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio
[not found] ` <CGME20240425184658epcas5p2adb6bf01a5c56ffaac3a55ab57afaf8e@epcas5p2.samsung.com>
@ 2024-04-25 18:39 ` Kanchan Joshi
2024-04-27 7:05 ` Christoph Hellwig
0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
To: axboe, martin.petersen, kbusch, hch, brauner
Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta, Kanchan Joshi
From: Anuj Gupta <[email protected]>
Do it only once when the parent bio completes.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
block/bio-integrity.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b4042414a08f..b698eb77515d 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -119,7 +119,8 @@ static void bio_integrity_uncopy_user(struct bio_integrity_payload *bip)
ret = copy_to_iter(bvec_virt(&src_bvec), bytes, &iter);
WARN_ON_ONCE(ret != bytes);
- bio_integrity_unpin_bvec(copy, nr_vecs, true);
+ if (!bio_flagged((bip->bip_bio), BIO_CLONED))
+ bio_integrity_unpin_bvec(copy, nr_vecs, true);
}
static void bio_integrity_unmap_user(struct bio_integrity_payload *bip)
@@ -129,11 +130,14 @@ static void bio_integrity_unmap_user(struct bio_integrity_payload *bip)
if (bip->bip_flags & BIP_COPY_USER) {
if (dirty)
bio_integrity_uncopy_user(bip);
- kfree(bvec_virt(bip->bip_vec));
+ if (!bio_flagged((bip->bip_bio), BIO_CLONED))
+ kfree(bvec_virt(bip->bip_vec));
return;
}
- bio_integrity_unpin_bvec(bip->bip_vec, bip->bip_max_vcnt, dirty);
+ if (!bio_flagged((bip->bip_bio), BIO_CLONED))
+ bio_integrity_unpin_bvec(bip->bip_vec, bip->bip_max_vcnt,
+ dirty);
}
/**
--
2.25.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 05/10] block, nvme: modify rq_integrity_vec function
[not found] ` <CGME20240425184700epcas5p1687590f7e4a3f3c3620ac27af514f0ca@epcas5p1.samsung.com>
@ 2024-04-25 18:39 ` Kanchan Joshi
2024-04-27 7:18 ` Christoph Hellwig
0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
To: axboe, martin.petersen, kbusch, hch, brauner
Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Kanchan Joshi, Anuj Gupta
rq_integrity_vec always returns the first bio_vec, and does not take
bip_iter into consideration.
Modify it so that it does. This is similar to how req_bvec() works for
data buffers.
This is necessary for correct dma map/unmap of meta buffer when it was
split in the block layer.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
drivers/nvme/host/pci.c | 9 +++++----
include/linux/blk-integrity.h | 13 +++++++------
2 files changed, 12 insertions(+), 10 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8e0bb9692685..bc5177ea6330 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -825,9 +825,9 @@ static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req,
struct nvme_command *cmnd)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ struct bio_vec bv = rq_integrity_vec(req);
- iod->meta_dma = dma_map_bvec(dev->dev, rq_integrity_vec(req),
- rq_dma_dir(req), 0);
+ iod->meta_dma = dma_map_bvec(dev->dev, &bv, rq_dma_dir(req), 0);
if (dma_mapping_error(dev->dev, iod->meta_dma))
return BLK_STS_IOERR;
cmnd->rw.metadata = cpu_to_le64(iod->meta_dma);
@@ -964,9 +964,10 @@ static __always_inline void nvme_pci_unmap_rq(struct request *req)
if (blk_integrity_rq(req)) {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ struct bio_vec bv = rq_integrity_vec(req);
- dma_unmap_page(dev->dev, iod->meta_dma,
- rq_integrity_vec(req)->bv_len, rq_dma_dir(req));
+ dma_unmap_page(dev->dev, iod->meta_dma, bv.bv_len,
+ rq_dma_dir(req));
}
if (blk_rq_nr_phys_segments(req))
diff --git a/include/linux/blk-integrity.h b/include/linux/blk-integrity.h
index e253e7bd0d17..9223050c0f75 100644
--- a/include/linux/blk-integrity.h
+++ b/include/linux/blk-integrity.h
@@ -109,11 +109,12 @@ static inline bool blk_integrity_rq(struct request *rq)
* Return the first bvec that contains integrity data. Only drivers that are
* limited to a single integrity segment should use this helper.
*/
-static inline struct bio_vec *rq_integrity_vec(struct request *rq)
+static inline struct bio_vec rq_integrity_vec(struct request *rq)
{
- if (WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1))
- return NULL;
- return rq->bio->bi_integrity->bip_vec;
+ WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1);
+
+ return mp_bvec_iter_bvec(rq->bio->bi_integrity->bip_vec,
+ rq->bio->bi_integrity->bip_iter);
}
#else /* CONFIG_BLK_DEV_INTEGRITY */
static inline int blk_rq_count_integrity_sg(struct request_queue *q,
@@ -177,9 +178,9 @@ static inline int blk_integrity_rq(struct request *rq)
return 0;
}
-static inline struct bio_vec *rq_integrity_vec(struct request *rq)
+static inline struct bio_vec rq_integrity_vec(struct request *rq)
{
- return NULL;
+ return (struct bio_vec){0};
}
#endif /* CONFIG_BLK_DEV_INTEGRITY */
#endif /* _LINUX_BLK_INTEGRITY_H */
--
2.25.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 06/10] block: modify bio_integrity_map_user argument
[not found] ` <CGME20240425184702epcas5p1ccb0df41b07845bc252d69007558e3fa@epcas5p1.samsung.com>
@ 2024-04-25 18:39 ` Kanchan Joshi
2024-04-27 7:19 ` Christoph Hellwig
0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
To: axboe, martin.petersen, kbusch, hch, brauner
Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta, Kanchan Joshi
From: Anuj Gupta <[email protected]>
This patch refactors bio_integrity_map_user to accept iov_iter as
argument. This is a prep patch.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
block/bio-integrity.c | 14 ++++++--------
drivers/nvme/host/ioctl.c | 11 +++++++++--
include/linux/bio.h | 7 ++++---
3 files changed, 19 insertions(+), 13 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b698eb77515d..1085cf45f51e 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -318,17 +318,16 @@ static unsigned int bvec_from_pages(struct bio_vec *bvec, struct page **pages,
return nr_bvecs;
}
-int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes,
- u32 seed)
+int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter,
+ u32 seed)
{
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
unsigned int align = q->dma_pad_mask | queue_dma_alignment(q);
struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages;
struct bio_vec stack_vec[UIO_FASTIOV], *bvec = stack_vec;
+ size_t offset, bytes = iter->count;
unsigned int direction, nr_bvecs;
- struct iov_iter iter;
int ret, nr_vecs;
- size_t offset;
bool copy;
if (bio_integrity(bio))
@@ -341,8 +340,7 @@ int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes,
else
direction = ITER_SOURCE;
- iov_iter_ubuf(&iter, direction, ubuf, bytes);
- nr_vecs = iov_iter_npages(&iter, BIO_MAX_VECS + 1);
+ nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS + 1);
if (nr_vecs > BIO_MAX_VECS)
return -E2BIG;
if (nr_vecs > UIO_FASTIOV) {
@@ -352,8 +350,8 @@ int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes,
pages = NULL;
}
- copy = !iov_iter_is_aligned(&iter, align, align);
- ret = iov_iter_extract_pages(&iter, &pages, bytes, nr_vecs, 0, &offset);
+ copy = !iov_iter_is_aligned(iter, align, align);
+ ret = iov_iter_extract_pages(iter, &pages, bytes, nr_vecs, 0, &offset);
if (unlikely(ret < 0))
goto free_bvec;
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 499a8bb7cac7..4ed389edb3b6 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -145,8 +145,15 @@ static int nvme_map_user_request(struct request *req, u64 ubuffer,
if (bdev) {
bio_set_dev(bio, bdev);
if (meta_buffer && meta_len) {
- ret = bio_integrity_map_user(bio, meta_buffer, meta_len,
- meta_seed);
+ struct iov_iter iter;
+ unsigned int direction;
+
+ if (bio_data_dir(bio) == READ)
+ direction = ITER_DEST;
+ else
+ direction = ITER_SOURCE;
+ iov_iter_ubuf(&iter, direction, meta_buffer, meta_len);
+ ret = bio_integrity_map_user(bio, &iter, meta_seed);
if (ret)
goto out_unmap;
req->cmd_flags |= REQ_INTEGRITY;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 875d792bffff..20cf47fc851f 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -723,7 +723,7 @@ static inline bool bioset_initialized(struct bio_set *bs)
for_each_bio(_bio) \
bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)
-int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t len, u32 seed);
+int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter, u32 seed);
extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
extern bool bio_integrity_prep(struct bio *);
@@ -795,8 +795,9 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page,
return 0;
}
-static inline int bio_integrity_map_user(struct bio *bio, void __user *ubuf,
- ssize_t len, u32 seed)
+static inline int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter,
+ u32 seed)
+
{
return -EINVAL;
}
--
2.25.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 07/10] block: define meta io descriptor
[not found] ` <CGME20240425184704epcas5p3b9eb6cce9c9658eb1d0d32937e778a5d@epcas5p3.samsung.com>
@ 2024-04-25 18:39 ` Kanchan Joshi
0 siblings, 0 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
To: axboe, martin.petersen, kbusch, hch, brauner
Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Kanchan Joshi
Introduces a new 'uio_meta' structure that upper layer (io_uring) can
use to pass the meta/integrity information. This is a prep patch.
Signed-off-by: Kanchan Joshi <[email protected]>
---
include/linux/bio.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 20cf47fc851f..0281b356935a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -331,6 +331,12 @@ enum bip_flags {
BIP_COPY_USER = 1 << 6, /* Kernel bounce buffer in use */
};
+struct uio_meta {
+ u16 flags;
+ u16 apptag;
+ struct iov_iter iter;
+};
+
/*
* bio integrity payload
*/
--
2.25.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 08/10] io_uring/rw: add support to send meta along with read/write
[not found] ` <CGME20240425184706epcas5p1d75c19d1d1458c52fc4009f150c7dc7d@epcas5p1.samsung.com>
@ 2024-04-25 18:39 ` Kanchan Joshi
2024-04-26 14:25 ` Jens Axboe
0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
To: axboe, martin.petersen, kbusch, hch, brauner
Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta, Kanchan Joshi, Nitesh Shetty
From: Anuj Gupta <[email protected]>
This patch introduces IORING_OP_READ_META and IORING_OP_WRITE_META
opcodes which allow sending a meta buffer along with read/write. The
meta buffer, its length, apptag and integrity check flags can be specified
by the application in the newly introduced meta_buf, meta_len, apptag and
meta_flags fields of SQE.
Use the user-passed information to prepare uio_meta descriptor, and
pass it down using kiocb->private.
Meta exchange is supported only for direct IO.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
---
include/linux/fs.h | 1 +
include/uapi/linux/io_uring.h | 15 +++++++
io_uring/io_uring.c | 4 ++
io_uring/opdef.c | 30 ++++++++++++++
io_uring/rw.c | 76 +++++++++++++++++++++++++++++++++--
io_uring/rw.h | 11 ++++-
6 files changed, 132 insertions(+), 5 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8dfd53b52744..8868d17ae8f9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -329,6 +329,7 @@ struct readahead_control;
#define IOCB_NOIO (1 << 20)
/* can use bio alloc cache */
#define IOCB_ALLOC_CACHE (1 << 21)
+#define IOCB_USE_META (1 << 22)
/*
* IOCB_DIO_CALLER_COMP can be set by the iocb owner, to indicate that the
* iocb completion can be passed back to the owner for execution from a safe
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index a7f847543a7f..d4653b52fdd6 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -97,6 +97,12 @@ struct io_uring_sqe {
__u64 addr3;
__u64 __pad2[1];
};
+ struct {
+ __u64 meta_addr;
+ __u32 meta_len;
+ __u16 meta_flags;
+ __u16 apptag;
+ };
__u64 optval;
/*
* If the ring is initialized with IORING_SETUP_SQE128, then
@@ -106,6 +112,13 @@ struct io_uring_sqe {
};
};
+/*
+ * meta io flags
+ */
+#define META_CHK_GUARD (1U << 0) /* guard is valid */
+#define META_CHK_APPTAG (1U << 1) /* app tag is valid */
+#define META_CHK_REFTAG (1U << 2) /* ref tag is valid */
+
/*
* If sqe->file_index is set to this for opcodes that instantiate a new
* direct descriptor (like openat/openat2/accept), then io_uring will allocate
@@ -256,6 +269,8 @@ enum io_uring_op {
IORING_OP_FUTEX_WAITV,
IORING_OP_FIXED_FD_INSTALL,
IORING_OP_FTRUNCATE,
+ IORING_OP_READ_META,
+ IORING_OP_WRITE_META,
/* this goes last, obviously */
IORING_OP_LAST,
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3c9087f37c43..af95fc8d988c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3723,7 +3723,11 @@ static int __init io_uring_init(void)
BUILD_BUG_SQE_ELEM(44, __u16, addr_len);
BUILD_BUG_SQE_ELEM(46, __u16, __pad3[0]);
BUILD_BUG_SQE_ELEM(48, __u64, addr3);
+ BUILD_BUG_SQE_ELEM(48, __u64, meta_addr);
BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd);
+ BUILD_BUG_SQE_ELEM(56, __u32, meta_len);
+ BUILD_BUG_SQE_ELEM(60, __u16, meta_flags);
+ BUILD_BUG_SQE_ELEM(62, __u16, apptag);
BUILD_BUG_SQE_ELEM(56, __u64, __pad2);
BUILD_BUG_ON(sizeof(struct io_uring_files_update) !=
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index a16f73938ebb..8b8fdcfb7f30 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -444,6 +444,28 @@ const struct io_issue_def io_issue_defs[] = {
.prep = io_eopnotsupp_prep,
#endif
},
+ [IORING_OP_READ_META] = {
+ .needs_file = 1,
+ .plug = 1,
+ .audit_skip = 1,
+ .ioprio = 1,
+ .iopoll = 1,
+ .iopoll_queue = 1,
+ .async_size = sizeof(struct io_async_rw),
+ .prep = io_prep_read_meta,
+ .issue = io_rw_meta,
+ },
+ [IORING_OP_WRITE_META] = {
+ .needs_file = 1,
+ .plug = 1,
+ .audit_skip = 1,
+ .ioprio = 1,
+ .iopoll = 1,
+ .iopoll_queue = 1,
+ .async_size = sizeof(struct io_async_rw),
+ .prep = io_prep_write_meta,
+ .issue = io_rw_meta,
+ },
[IORING_OP_READ_MULTISHOT] = {
.needs_file = 1,
.unbound_nonreg_file = 1,
@@ -510,6 +532,14 @@ const struct io_cold_def io_cold_defs[] = {
.cleanup = io_readv_writev_cleanup,
.fail = io_rw_fail,
},
+ [IORING_OP_READ_META] = {
+ .name = "READ_META",
+ .fail = io_rw_fail,
+ },
+ [IORING_OP_WRITE_META] = {
+ .name = "WRITE_META",
+ .fail = io_rw_fail,
+ },
[IORING_OP_FSYNC] = {
.name = "FSYNC",
},
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 3134a6ece1be..b2c9ac91d5e5 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -269,6 +269,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
rw->kiocb.ki_ioprio = get_current_ioprio();
}
rw->kiocb.dio_complete = NULL;
+ rw->kiocb.ki_flags = 0;
rw->addr = READ_ONCE(sqe->addr);
rw->len = READ_ONCE(sqe->len);
@@ -286,6 +287,41 @@ int io_prep_write(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return io_prep_rw(req, sqe, ITER_SOURCE, true);
}
+static int io_prep_rw_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+ int ddir, bool import)
+{
+ struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
+ struct io_async_rw *io;
+ struct kiocb *kiocb = &rw->kiocb;
+ int ret;
+
+ ret = io_prep_rw(req, sqe, ddir, import);
+ if (unlikely(ret))
+ return ret;
+
+ io = req->async_data;
+ kiocb->ki_flags |= IOCB_USE_META;
+ io->meta.flags = READ_ONCE(sqe->meta_flags);
+ io->meta.apptag = READ_ONCE(sqe->apptag);
+ ret = import_ubuf(ddir, u64_to_user_ptr(READ_ONCE(sqe->meta_addr)),
+ READ_ONCE(sqe->meta_len), &io->meta.iter);
+ if (unlikely(ret < 0))
+ return ret;
+
+ iov_iter_save_state(&io->meta.iter, &io->iter_meta_state);
+ return 0;
+}
+
+int io_prep_read_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+ return io_prep_rw_meta(req, sqe, ITER_DEST, true);
+}
+
+int io_prep_write_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+ return io_prep_rw_meta(req, sqe, ITER_SOURCE, true);
+}
+
static int io_prep_rwv(struct io_kiocb *req, const struct io_uring_sqe *sqe,
int ddir)
{
@@ -587,6 +623,8 @@ static int kiocb_done(struct io_kiocb *req, ssize_t ret,
req->flags &= ~REQ_F_REISSUE;
iov_iter_restore(&io->iter, &io->iter_state);
+ if (unlikely(rw->kiocb.ki_flags & IOCB_USE_META))
+ iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
return -EAGAIN;
}
return IOU_ISSUE_SKIP_COMPLETE;
@@ -768,7 +806,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
if (!(req->flags & REQ_F_FIXED_FILE))
req->flags |= io_file_get_flags(file);
- kiocb->ki_flags = file->f_iocb_flags;
+ kiocb->ki_flags |= file->f_iocb_flags;
ret = kiocb_set_rw_flags(kiocb, rw->flags);
if (unlikely(ret))
return ret;
@@ -787,7 +825,8 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
if (!(kiocb->ki_flags & IOCB_DIRECT) || !file->f_op->iopoll)
return -EOPNOTSUPP;
- kiocb->private = NULL;
+ if (likely(!(kiocb->ki_flags & IOCB_USE_META)))
+ kiocb->private = NULL;
kiocb->ki_flags |= IOCB_HIPRI;
kiocb->ki_complete = io_complete_rw_iopoll;
req->iopoll_completed = 0;
@@ -853,7 +892,8 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
} else if (ret == -EIOCBQUEUED) {
return IOU_ISSUE_SKIP_COMPLETE;
} else if (ret == req->cqe.res || ret <= 0 || !force_nonblock ||
- (req->flags & REQ_F_NOWAIT) || !need_complete_io(req)) {
+ (req->flags & REQ_F_NOWAIT) || !need_complete_io(req) ||
+ (kiocb->ki_flags & IOCB_USE_META)) {
/* read all, failed, already did sync or don't want to retry */
goto done;
}
@@ -864,6 +904,12 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
* manually if we need to.
*/
iov_iter_restore(&io->iter, &io->iter_state);
+ if (unlikely(kiocb->ki_flags & IOCB_USE_META)) {
+ /* don't handle partial completion for read + meta */
+ if (ret > 0)
+ goto done;
+ iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
+ }
do {
/*
@@ -1053,7 +1099,8 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
if (ret2 == -EAGAIN && (req->ctx->flags & IORING_SETUP_IOPOLL))
goto ret_eagain;
- if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)) {
+ if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)
+ && !(kiocb->ki_flags & IOCB_USE_META)) {
trace_io_uring_short_write(req->ctx, kiocb->ki_pos - ret2,
req->cqe.res, ret2);
@@ -1074,12 +1121,33 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
} else {
ret_eagain:
iov_iter_restore(&io->iter, &io->iter_state);
+ if (unlikely(kiocb->ki_flags & IOCB_USE_META))
+ iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
if (kiocb->ki_flags & IOCB_WRITE)
io_req_end_write(req);
return -EAGAIN;
}
}
+int io_rw_meta(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
+ struct io_async_rw *io = req->async_data;
+ struct kiocb *kiocb = &rw->kiocb;
+ int ret;
+
+ if (!(req->file->f_flags & O_DIRECT))
+ return -EOPNOTSUPP;
+
+ kiocb->private = &io->meta;
+ if (req->opcode == IORING_OP_READ_META)
+ ret = io_read(req, issue_flags);
+ else
+ ret = io_write(req, issue_flags);
+
+ return ret;
+}
+
void io_rw_fail(struct io_kiocb *req)
{
int res;
diff --git a/io_uring/rw.h b/io_uring/rw.h
index 3f432dc75441..a640071064e3 100644
--- a/io_uring/rw.h
+++ b/io_uring/rw.h
@@ -9,7 +9,13 @@ struct io_async_rw {
struct iovec fast_iov;
struct iovec *free_iovec;
int free_iov_nr;
- struct wait_page_queue wpq;
+ union {
+ struct wait_page_queue wpq;
+ struct {
+ struct uio_meta meta;
+ struct iov_iter_state iter_meta_state;
+ };
+ };
};
int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe);
@@ -17,9 +23,12 @@ int io_prep_write_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_prep_readv(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_prep_writev(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_prep_read(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_prep_read_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_prep_write(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_prep_write_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_read(struct io_kiocb *req, unsigned int issue_flags);
int io_write(struct io_kiocb *req, unsigned int issue_flags);
+int io_rw_meta(struct io_kiocb *req, unsigned int issue_flags);
void io_readv_writev_cleanup(struct io_kiocb *req);
void io_rw_fail(struct io_kiocb *req);
void io_req_rw_complete(struct io_kiocb *req, struct io_tw_state *ts);
--
2.25.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 09/10] block: add support to send meta buffer
[not found] ` <CGME20240425184708epcas5p4f1d95cd8d285614f712868d205a23115@epcas5p4.samsung.com>
@ 2024-04-25 18:39 ` Kanchan Joshi
2024-04-26 15:21 ` Keith Busch
0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
To: axboe, martin.petersen, kbusch, hch, brauner
Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Kanchan Joshi, Anuj Gupta
If iocb contains the meta, extract that and prepare the bip.
Extend bip so that can it can carry three new integrity-check flags
and application tag.
Make sure that ->prepare_fn and ->complete_fn are skipped for
user-owned meta buffer.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
block/bio-integrity.c | 31 ++++++++++++++++++++++++++++++-
block/fops.c | 9 +++++++++
block/t10-pi.c | 6 ++++++
include/linux/bio.h | 10 ++++++++++
4 files changed, 55 insertions(+), 1 deletion(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 1085cf45f51e..dab76fd73813 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -11,7 +11,7 @@
#include <linux/export.h>
#include <linux/bio.h>
#include <linux/workqueue.h>
-#include <linux/slab.h>
+#include <uapi/linux/io_uring.h>
#include "blk.h"
static struct kmem_cache *bip_slab;
@@ -318,6 +318,35 @@ static unsigned int bvec_from_pages(struct bio_vec *bvec, struct page **pages,
return nr_bvecs;
}
+int bio_integrity_map_iter(struct bio *bio, struct uio_meta *meta)
+{
+ struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
+ u16 bip_flags = 0;
+ int ret;
+
+ if (!bi)
+ return -EINVAL;
+
+ if (meta->iter.count < bio_integrity_bytes(bi, bio_sectors(bio)))
+ return -EINVAL;
+
+ ret = bio_integrity_map_user(bio, &meta->iter, 0);
+ if (!ret) {
+ struct bio_integrity_payload *bip = bio_integrity(bio);
+
+ if (meta->flags & META_CHK_GUARD)
+ bip_flags |= BIP_USER_CHK_GUARD;
+ if (meta->flags & META_CHK_APPTAG)
+ bip_flags |= BIP_USER_CHK_APPTAG;
+ if (meta->flags & META_CHK_REFTAG)
+ bip_flags |= BIP_USER_CHK_REFTAG;
+
+ bip->bip_flags |= bip_flags;
+ bip->apptag = meta->apptag;
+ }
+ return ret;
+}
+
int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter,
u32 seed)
{
diff --git a/block/fops.c b/block/fops.c
index 679d9b752fe8..e488fa66dd60 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -353,6 +353,15 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
task_io_account_write(bio->bi_iter.bi_size);
}
+ if (unlikely(iocb->ki_flags & IOCB_USE_META)) {
+ ret = bio_integrity_map_iter(bio, iocb->private);
+ WRITE_ONCE(iocb->private, NULL);
+ if (unlikely(ret)) {
+ bio_put(bio);
+ return ret;
+ }
+ }
+
if (iocb->ki_flags & IOCB_NOWAIT)
bio->bi_opf |= REQ_NOWAIT;
diff --git a/block/t10-pi.c b/block/t10-pi.c
index d90892fd6f2a..72d1522417a1 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -156,6 +156,8 @@ static void t10_pi_type1_prepare(struct request *rq)
/* Already remapped? */
if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
break;
+ if (bip->bip_flags & BIP_INTEGRITY_USER)
+ break;
bip_for_each_vec(iv, bip, iter) {
unsigned int j;
@@ -205,6 +207,8 @@ static void t10_pi_type1_complete(struct request *rq, unsigned int nr_bytes)
struct bio_vec iv;
struct bvec_iter iter;
+ if (bip->bip_flags & BIP_INTEGRITY_USER)
+ break;
bip_for_each_vec(iv, bip, iter) {
unsigned int j;
void *p;
@@ -408,6 +412,8 @@ static void ext_pi_type1_prepare(struct request *rq)
/* Already remapped? */
if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
break;
+ if (bip->bip_flags & BIP_INTEGRITY_USER)
+ break;
bip_for_each_vec(iv, bip, iter) {
unsigned int j;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 0281b356935a..8724305bce62 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -329,6 +329,9 @@ enum bip_flags {
BIP_IP_CHECKSUM = 1 << 4, /* IP checksum */
BIP_INTEGRITY_USER = 1 << 5, /* Integrity payload is user address */
BIP_COPY_USER = 1 << 6, /* Kernel bounce buffer in use */
+ BIP_USER_CHK_GUARD = 1 << 7,
+ BIP_USER_CHK_APPTAG = 1 << 8,
+ BIP_USER_CHK_REFTAG = 1 << 9,
};
struct uio_meta {
@@ -348,6 +351,7 @@ struct bio_integrity_payload {
unsigned short bip_vcnt; /* # of integrity bio_vecs */
unsigned short bip_max_vcnt; /* integrity bio_vec slots */
unsigned short bip_flags; /* control flags */
+ u16 apptag; /* apptag */
struct bvec_iter bio_iter; /* for rewinding parent bio */
@@ -729,6 +733,7 @@ static inline bool bioset_initialized(struct bio_set *bs)
for_each_bio(_bio) \
bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)
+int bio_integrity_map_iter(struct bio *bio, struct uio_meta *meta);
int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter, u32 seed);
extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
@@ -808,6 +813,11 @@ static inline int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter,
return -EINVAL;
}
+static inline int bio_integrity_map_iter(struct bio *bio, struct uio_meta *meta)
+{
+ return -EINVAL;
+}
+
#endif /* CONFIG_BLK_DEV_INTEGRITY */
/*
--
2.25.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 10/10] nvme: add separate handling for user integrity buffer
[not found] ` <CGME20240425184710epcas5p2968bbc40ed10d1f0184bb511af054fcb@epcas5p2.samsung.com>
@ 2024-04-25 18:39 ` Kanchan Joshi
2024-04-25 19:56 ` Keith Busch
2024-04-26 10:57 ` kernel test robot
0 siblings, 2 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-25 18:39 UTC (permalink / raw)
To: axboe, martin.petersen, kbusch, hch, brauner
Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Kanchan Joshi
For user provided integrity buffer, convert bip flags
(guard/reftag/apptag checks) to protocol specific flags.
Also pass apptag and reftag down.
Signed-off-by: Kanchan Joshi <[email protected]>
---
drivers/nvme/host/core.c | 36 +++++++++++++++++++++++++++++++++++-
1 file changed, 35 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 27281a9a8951..3b719be4eedb 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -886,6 +886,13 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
return BLK_STS_OK;
}
+static void nvme_set_app_tag(struct nvme_command *cmnd, u16 apptag)
+{
+ cmnd->rw.apptag = cpu_to_le16(apptag);
+ /* use 0xfff as mask so that apptag is used in entirety*/
+ cmnd->rw.appmask = cpu_to_le16(0xffff);
+}
+
static void nvme_set_ref_tag(struct nvme_ns *ns, struct nvme_command *cmnd,
struct request *req)
{
@@ -943,6 +950,25 @@ static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns,
return BLK_STS_OK;
}
+static inline void nvme_setup_user_integrity(struct nvme_ns *ns,
+ struct request *req, struct nvme_command *cmnd,
+ u16 *control)
+{
+ struct bio_integrity_payload *bip = bio_integrity(req->bio);
+ unsigned short bip_flags = bip->bip_flags;
+
+ if (bip_flags & BIP_USER_CHK_GUARD)
+ *control |= NVME_RW_PRINFO_PRCHK_GUARD;
+ if (bip_flags & BIP_USER_CHK_REFTAG) {
+ *control |= NVME_RW_PRINFO_PRCHK_REF;
+ nvme_set_ref_tag(ns, cmnd, req);
+ }
+ if (bip_flags & BIP_USER_CHK_APPTAG) {
+ *control |= NVME_RW_PRINFO_PRCHK_APP;
+ nvme_set_app_tag(cmnd, bip->apptag);
+ }
+}
+
static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
struct request *req, struct nvme_command *cmnd,
enum nvme_opcode op)
@@ -983,6 +1009,14 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
if (WARN_ON_ONCE(!nvme_ns_has_pi(ns->head)))
return BLK_STS_NOTSUPP;
control |= NVME_RW_PRINFO_PRACT;
+ } else {
+ /* process user-created integrity */
+ if (bio_integrity(req->bio)->bip_flags &
+ BIP_INTEGRITY_USER) {
+ nvme_setup_user_integrity(ns, req, cmnd,
+ &control);
+ goto out;
+ }
}
switch (ns->head->pi_type) {
@@ -999,7 +1033,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
break;
}
}
-
+out:
cmnd->rw.control = cpu_to_le16(control);
cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
return 0;
--
2.25.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 10/10] nvme: add separate handling for user integrity buffer
2024-04-25 18:39 ` [PATCH 10/10] nvme: add separate handling for user integrity buffer Kanchan Joshi
@ 2024-04-25 19:56 ` Keith Busch
2024-04-26 10:57 ` kernel test robot
1 sibling, 0 replies; 41+ messages in thread
From: Keith Busch @ 2024-04-25 19:56 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, martin.petersen, hch, brauner, asml.silence, dw, io-uring,
linux-nvme, linux-block, gost.dev
On Fri, Apr 26, 2024 at 12:09:43AM +0530, Kanchan Joshi wrote:
> @@ -983,6 +1009,14 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
> if (WARN_ON_ONCE(!nvme_ns_has_pi(ns->head)))
> return BLK_STS_NOTSUPP;
> control |= NVME_RW_PRINFO_PRACT;
> + } else {
> + /* process user-created integrity */
> + if (bio_integrity(req->bio)->bip_flags &
> + BIP_INTEGRITY_USER) {
Make this an "else if" instead of nesting it an extra level.
> + nvme_setup_user_integrity(ns, req, cmnd,
> + &control);
> + goto out;
> + }
And this can be structured a little differently so that you don't need
the "goto"; IMO, goto is good for error unwinding, but using it in a
good path harms readablilty.
This is getting complex enough that splitting it off in a helper
funciton, maybe nvme_setup_rw_meta(), might be a good idea.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 10/10] nvme: add separate handling for user integrity buffer
2024-04-25 18:39 ` [PATCH 10/10] nvme: add separate handling for user integrity buffer Kanchan Joshi
2024-04-25 19:56 ` Keith Busch
@ 2024-04-26 10:57 ` kernel test robot
1 sibling, 0 replies; 41+ messages in thread
From: kernel test robot @ 2024-04-26 10:57 UTC (permalink / raw)
To: Kanchan Joshi, axboe, martin.petersen, kbusch, hch, brauner
Cc: llvm, oe-kbuild-all, asml.silence, dw, io-uring, linux-nvme,
linux-block, gost.dev, Kanchan Joshi
Hi Kanchan,
kernel test robot noticed the following build errors:
[auto build test ERROR on 24c3fc5c75c5b9d471783b4a4958748243828613]
url: https://github.com/intel-lab-lkp/linux/commits/Kanchan-Joshi/block-set-bip_vcnt-correctly/20240426-024916
base: 24c3fc5c75c5b9d471783b4a4958748243828613
patch link: https://lore.kernel.org/r/20240425183943.6319-11-joshi.k%40samsung.com
patch subject: [PATCH 10/10] nvme: add separate handling for user integrity buffer
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20240426/[email protected]/config)
compiler: clang version 18.1.4 (https://github.com/llvm/llvm-project e6c3289804a67ea0bb6a86fadbe454dd93b8d855)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240426/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
>> drivers/nvme/host/core.c:1014:31: error: member reference base type 'void' is not a structure or union
1014 | if (bio_integrity(req->bio)->bip_flags &
| ~~~~~~~~~~~~~~~~~~~~~~~^ ~~~~~~~~~
1 error generated.
vim +/void +1014 drivers/nvme/host/core.c
971
972 static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
973 struct request *req, struct nvme_command *cmnd,
974 enum nvme_opcode op)
975 {
976 u16 control = 0;
977 u32 dsmgmt = 0;
978
979 if (req->cmd_flags & REQ_FUA)
980 control |= NVME_RW_FUA;
981 if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_RAHEAD))
982 control |= NVME_RW_LR;
983
984 if (req->cmd_flags & REQ_RAHEAD)
985 dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
986
987 cmnd->rw.opcode = op;
988 cmnd->rw.flags = 0;
989 cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
990 cmnd->rw.cdw2 = 0;
991 cmnd->rw.cdw3 = 0;
992 cmnd->rw.metadata = 0;
993 cmnd->rw.slba =
994 cpu_to_le64(nvme_sect_to_lba(ns->head, blk_rq_pos(req)));
995 cmnd->rw.length =
996 cpu_to_le16((blk_rq_bytes(req) >> ns->head->lba_shift) - 1);
997 cmnd->rw.reftag = 0;
998 cmnd->rw.apptag = 0;
999 cmnd->rw.appmask = 0;
1000
1001 if (ns->head->ms) {
1002 /*
1003 * If formated with metadata, the block layer always provides a
1004 * metadata buffer if CONFIG_BLK_DEV_INTEGRITY is enabled. Else
1005 * we enable the PRACT bit for protection information or set the
1006 * namespace capacity to zero to prevent any I/O.
1007 */
1008 if (!blk_integrity_rq(req)) {
1009 if (WARN_ON_ONCE(!nvme_ns_has_pi(ns->head)))
1010 return BLK_STS_NOTSUPP;
1011 control |= NVME_RW_PRINFO_PRACT;
1012 } else {
1013 /* process user-created integrity */
> 1014 if (bio_integrity(req->bio)->bip_flags &
1015 BIP_INTEGRITY_USER) {
1016 nvme_setup_user_integrity(ns, req, cmnd,
1017 &control);
1018 goto out;
1019 }
1020 }
1021
1022 switch (ns->head->pi_type) {
1023 case NVME_NS_DPS_PI_TYPE3:
1024 control |= NVME_RW_PRINFO_PRCHK_GUARD;
1025 break;
1026 case NVME_NS_DPS_PI_TYPE1:
1027 case NVME_NS_DPS_PI_TYPE2:
1028 control |= NVME_RW_PRINFO_PRCHK_GUARD |
1029 NVME_RW_PRINFO_PRCHK_REF;
1030 if (op == nvme_cmd_zone_append)
1031 control |= NVME_RW_APPEND_PIREMAP;
1032 nvme_set_ref_tag(ns, cmnd, req);
1033 break;
1034 }
1035 }
1036 out:
1037 cmnd->rw.control = cpu_to_le16(control);
1038 cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);
1039 return 0;
1040 }
1041
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 00/10] Read/Write with meta/integrity
2024-04-25 18:39 ` [PATCH 00/10] Read/Write with meta/integrity Kanchan Joshi
` (9 preceding siblings ...)
[not found] ` <CGME20240425184710epcas5p2968bbc40ed10d1f0184bb511af054fcb@epcas5p2.samsung.com>
@ 2024-04-26 14:19 ` Jens Axboe
10 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2024-04-26 14:19 UTC (permalink / raw)
To: Kanchan Joshi, martin.petersen, kbusch, hch, brauner
Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev
A lot of the initial patches appear to be fixes on the block side,
maybe it'd make more sense to submti those separately?
Comments in other patches coming.
--
Jens Axboe
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 08/10] io_uring/rw: add support to send meta along with read/write
2024-04-25 18:39 ` [PATCH 08/10] io_uring/rw: add support to send meta along with read/write Kanchan Joshi
@ 2024-04-26 14:25 ` Jens Axboe
2024-04-29 20:11 ` Kanchan Joshi
0 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2024-04-26 14:25 UTC (permalink / raw)
To: Kanchan Joshi, martin.petersen, kbusch, hch, brauner
Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta, Nitesh Shetty
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 3134a6ece1be..b2c9ac91d5e5 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -587,6 +623,8 @@ static int kiocb_done(struct io_kiocb *req, ssize_t ret,
>
> req->flags &= ~REQ_F_REISSUE;
> iov_iter_restore(&io->iter, &io->iter_state);
> + if (unlikely(rw->kiocb.ki_flags & IOCB_USE_META))
> + iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
> return -EAGAIN;
> }
> return IOU_ISSUE_SKIP_COMPLETE;
This puzzles me a bit, why is the restore now dependent on
IOCB_USE_META?
> @@ -768,7 +806,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
> if (!(req->flags & REQ_F_FIXED_FILE))
> req->flags |= io_file_get_flags(file);
>
> - kiocb->ki_flags = file->f_iocb_flags;
> + kiocb->ki_flags |= file->f_iocb_flags;
> ret = kiocb_set_rw_flags(kiocb, rw->flags);
> if (unlikely(ret))
> return ret;
> @@ -787,7 +825,8 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
> if (!(kiocb->ki_flags & IOCB_DIRECT) || !file->f_op->iopoll)
> return -EOPNOTSUPP;
>
> - kiocb->private = NULL;
> + if (likely(!(kiocb->ki_flags & IOCB_USE_META)))
> + kiocb->private = NULL;
> kiocb->ki_flags |= IOCB_HIPRI;
> kiocb->ki_complete = io_complete_rw_iopoll;
> req->iopoll_completed = 0;
Why don't we just set ->private generically earlier, eg like we do for
the ki_flags, rather than have it be a branch in here?
> @@ -853,7 +892,8 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
> } else if (ret == -EIOCBQUEUED) {
> return IOU_ISSUE_SKIP_COMPLETE;
> } else if (ret == req->cqe.res || ret <= 0 || !force_nonblock ||
> - (req->flags & REQ_F_NOWAIT) || !need_complete_io(req)) {
> + (req->flags & REQ_F_NOWAIT) || !need_complete_io(req) ||
> + (kiocb->ki_flags & IOCB_USE_META)) {
> /* read all, failed, already did sync or don't want to retry */
> goto done;
> }
Would it be cleaner to stuff that IOCB_USE_META check in
need_complete_io(), as that would closer seem to describe why that check
is there in the first place? With a comment.
> @@ -864,6 +904,12 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
> * manually if we need to.
> */
> iov_iter_restore(&io->iter, &io->iter_state);
> + if (unlikely(kiocb->ki_flags & IOCB_USE_META)) {
> + /* don't handle partial completion for read + meta */
> + if (ret > 0)
> + goto done;
> + iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
> + }
Also seems a bit odd why we need this check here, surely if this is
needed other "don't do retry IOs" conditions would be the same?
> @@ -1053,7 +1099,8 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
> if (ret2 == -EAGAIN && (req->ctx->flags & IORING_SETUP_IOPOLL))
> goto ret_eagain;
>
> - if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)) {
> + if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)
> + && !(kiocb->ki_flags & IOCB_USE_META)) {
> trace_io_uring_short_write(req->ctx, kiocb->ki_pos - ret2,
> req->cqe.res, ret2);
Same here. Would be nice to integrate this a bit nicer rather than have
a bunch of "oh we also need this extra check here" conditions.
> @@ -1074,12 +1121,33 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
> } else {
> ret_eagain:
> iov_iter_restore(&io->iter, &io->iter_state);
> + if (unlikely(kiocb->ki_flags & IOCB_USE_META))
> + iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
> if (kiocb->ki_flags & IOCB_WRITE)
> io_req_end_write(req);
> return -EAGAIN;
> }
> }
Same question here on the (now) conditional restore.
> +int io_rw_meta(struct io_kiocb *req, unsigned int issue_flags)
> +{
> + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
> + struct io_async_rw *io = req->async_data;
> + struct kiocb *kiocb = &rw->kiocb;
> + int ret;
> +
> + if (!(req->file->f_flags & O_DIRECT))
> + return -EOPNOTSUPP;
Why isn't this just caught at init time when IOCB_DIRECT is checked?
> + kiocb->private = &io->meta;
> + if (req->opcode == IORING_OP_READ_META)
> + ret = io_read(req, issue_flags);
> + else
> + ret = io_write(req, issue_flags);
> +
> + return ret;
> +}
kiocb->private is a bit of an odd beast, and ownership isn't clear at
all. It would make the most sense if the owner of the kiocb (eg io_uring
in this case) owned it, but take a look at eg ocfs2 and see what they do
with it... I think this would blow up as a result.
Outside of that, and with the O_DIRECT thing check fixed, this should
just be two separate functions, one for read and one for write.
--
Jens Axboe
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 09/10] block: add support to send meta buffer
2024-04-25 18:39 ` [PATCH 09/10] block: add support to send meta buffer Kanchan Joshi
@ 2024-04-26 15:21 ` Keith Busch
2024-04-29 11:47 ` Kanchan Joshi
0 siblings, 1 reply; 41+ messages in thread
From: Keith Busch @ 2024-04-26 15:21 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, martin.petersen, hch, brauner, asml.silence, dw, io-uring,
linux-nvme, linux-block, gost.dev, Anuj Gupta
On Fri, Apr 26, 2024 at 12:09:42AM +0530, Kanchan Joshi wrote:
> diff --git a/block/fops.c b/block/fops.c
> index 679d9b752fe8..e488fa66dd60 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -353,6 +353,15 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
> task_io_account_write(bio->bi_iter.bi_size);
> }
>
> + if (unlikely(iocb->ki_flags & IOCB_USE_META)) {
> + ret = bio_integrity_map_iter(bio, iocb->private);
> + WRITE_ONCE(iocb->private, NULL);
> + if (unlikely(ret)) {
> + bio_put(bio);
> + return ret;
> + }
> + }
Should this also be done for __blkdev_direct_IO and
__blkdev_direct_IO_simple?
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 01/10] block: set bip_vcnt correctly
2024-04-25 18:39 ` [PATCH 01/10] block: set bip_vcnt correctly Kanchan Joshi
@ 2024-04-27 7:02 ` Christoph Hellwig
2024-04-27 14:16 ` Keith Busch
0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-27 7:02 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, martin.petersen, kbusch, hch, brauner, asml.silence, dw,
io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
On Fri, Apr 26, 2024 at 12:09:34AM +0530, Kanchan Joshi wrote:
> From: Anuj Gupta <[email protected]>
>
> Set the bip_vcnt correctly in bio_integrity_init_user and
> bio_integrity_copy_user. If the bio gets split at a later point,
> this value is required to set the right bip_vcnt in the cloned bio.
>
> Signed-off-by: Anuj Gupta <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
Looks good:
Reviewed-by: Christoph Hellwig <[email protected]>
Please add a Fixes tag and submit it separately from the features.
I'm actually kinda surprised the direct user mapping of integrity data
survived so far without this.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
2024-04-25 18:39 ` [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone Kanchan Joshi
@ 2024-04-27 7:03 ` Christoph Hellwig
2024-04-29 11:28 ` Kanchan Joshi
0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-27 7:03 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, martin.petersen, kbusch, hch, brauner, asml.silence, dw,
io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
On Fri, Apr 26, 2024 at 12:09:35AM +0530, Kanchan Joshi wrote:
> From: Anuj Gupta <[email protected]>
>
> If bio_integrity_copy_user is used to process the meta buffer, bip_max_vcnt
> is one greater than bip_vcnt. In this case bip_max_vcnt vecs needs to be
> copied to cloned bip.
Can you explain this a bit more? The clone should only allocate what
is actually used, so this leaves be a bit confused.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 03/10] block: copy result back to user meta buffer correctly in case of split
2024-04-25 18:39 ` [PATCH 03/10] block: copy result back to user meta buffer correctly in case of split Kanchan Joshi
@ 2024-04-27 7:04 ` Christoph Hellwig
0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-27 7:04 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, martin.petersen, kbusch, hch, brauner, asml.silence, dw,
io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
On Fri, Apr 26, 2024 at 12:09:36AM +0530, Kanchan Joshi wrote:
> From: Anuj Gupta <[email protected]>
>
> In case of split, the len and offset of bvec gets updated in the iter.
> Use it to fetch the right bvec, and copy it back to user meta buffer.
>
> Signed-off-by: Anuj Gupta <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
Looks good, but needs a fixes tag.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio
2024-04-25 18:39 ` [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio Kanchan Joshi
@ 2024-04-27 7:05 ` Christoph Hellwig
2024-04-29 11:40 ` Kanchan Joshi
0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-27 7:05 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, martin.petersen, kbusch, hch, brauner, asml.silence, dw,
io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
On Fri, Apr 26, 2024 at 12:09:37AM +0530, Kanchan Joshi wrote:
> From: Anuj Gupta <[email protected]>
>
> Do it only once when the parent bio completes.
>
> Signed-off-by: Anuj Gupta <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
> block/bio-integrity.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index b4042414a08f..b698eb77515d 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -119,7 +119,8 @@ static void bio_integrity_uncopy_user(struct bio_integrity_payload *bip)
> ret = copy_to_iter(bvec_virt(&src_bvec), bytes, &iter);
> WARN_ON_ONCE(ret != bytes);
>
> - bio_integrity_unpin_bvec(copy, nr_vecs, true);
> + if (!bio_flagged((bip->bip_bio), BIO_CLONED))
> + bio_integrity_unpin_bvec(copy, nr_vecs, true);
> }
This feels wrong. I suspect the problem is that BIP_COPY_USER is
inherited for clone bios while it shouldn't.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 05/10] block, nvme: modify rq_integrity_vec function
2024-04-25 18:39 ` [PATCH 05/10] block, nvme: modify rq_integrity_vec function Kanchan Joshi
@ 2024-04-27 7:18 ` Christoph Hellwig
2024-04-29 11:34 ` Kanchan Joshi
0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-27 7:18 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, martin.petersen, kbusch, hch, brauner, asml.silence, dw,
io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
The subjet is about as useless as it gets :)
The essence is that it should take the iter into account, so name that.
> --- a/include/linux/blk-integrity.h
> +++ b/include/linux/blk-integrity.h
> @@ -109,11 +109,12 @@ static inline bool blk_integrity_rq(struct request *rq)
> * Return the first bvec that contains integrity data. Only drivers that are
> * limited to a single integrity segment should use this helper.
> */
The comment really needs an update. With this rq_integrity_vec now
is a "normal" iter based operation, that can actually be used by
drivers with multiple integrity segments if it is properly advanced.
> +static inline struct bio_vec rq_integrity_vec(struct request *rq)
> {
> - return NULL;
> + return (struct bio_vec){0};
No need for the 0 here.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 06/10] block: modify bio_integrity_map_user argument
2024-04-25 18:39 ` [PATCH 06/10] block: modify bio_integrity_map_user argument Kanchan Joshi
@ 2024-04-27 7:19 ` Christoph Hellwig
0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-27 7:19 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, martin.petersen, kbusch, hch, brauner, asml.silence, dw,
io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
On Fri, Apr 26, 2024 at 12:09:39AM +0530, Kanchan Joshi wrote:
> From: Anuj Gupta <[email protected]>
>
> This patch refactors bio_integrity_map_user to accept iov_iter as
> argument. This is a prep patch.
Please put the fact that you pass the iov_iter into the subject, and
use the main commit message to explain why.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 01/10] block: set bip_vcnt correctly
2024-04-27 7:02 ` Christoph Hellwig
@ 2024-04-27 14:16 ` Keith Busch
2024-04-29 10:59 ` Kanchan Joshi
2024-05-01 7:45 ` Christoph Hellwig
0 siblings, 2 replies; 41+ messages in thread
From: Keith Busch @ 2024-04-27 14:16 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, axboe, martin.petersen, brauner, asml.silence, dw,
io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
On Sat, Apr 27, 2024 at 09:02:14AM +0200, Christoph Hellwig wrote:
> On Fri, Apr 26, 2024 at 12:09:34AM +0530, Kanchan Joshi wrote:
> > From: Anuj Gupta <[email protected]>
> >
> > Set the bip_vcnt correctly in bio_integrity_init_user and
> > bio_integrity_copy_user. If the bio gets split at a later point,
> > this value is required to set the right bip_vcnt in the cloned bio.
> >
> > Signed-off-by: Anuj Gupta <[email protected]>
> > Signed-off-by: Kanchan Joshi <[email protected]>
>
> Looks good:
>
> Reviewed-by: Christoph Hellwig <[email protected]>
>
> Please add a Fixes tag and submit it separately from the features.
>
> I'm actually kinda surprised the direct user mapping of integrity data
> survived so far without this.
The only existing use case for user metadata is REQ_OP_DRV_IN/OUT, which
never splits, so these initial fixes only really matter after this
series adds new usage for generic READ/WRITE.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 01/10] block: set bip_vcnt correctly
2024-04-27 14:16 ` Keith Busch
@ 2024-04-29 10:59 ` Kanchan Joshi
2024-05-01 7:45 ` Christoph Hellwig
1 sibling, 0 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-29 10:59 UTC (permalink / raw)
To: Keith Busch, Christoph Hellwig
Cc: axboe, martin.petersen, brauner, asml.silence, dw, io-uring,
linux-nvme, linux-block, gost.dev, Anuj Gupta
On 4/27/2024 7:46 PM, Keith Busch wrote:
>> Looks good:
>>
>> Reviewed-by: Christoph Hellwig<[email protected]>
>>
>> Please add a Fixes tag and submit it separately from the features.
>>
>> I'm actually kinda surprised the direct user mapping of integrity data
>> survived so far without this.
> The only existing use case for user metadata is REQ_OP_DRV_IN/OUT, which
> never splits, so these initial fixes only really matter after this
> series adds new usage for generic READ/WRITE.
>
Yes. It did not seem that there is any harm due to these missing pieces
(first 6 patches).
Therefore, "Fixes" tag is not there, and patches were not sent
separately from the features.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
2024-04-27 7:03 ` Christoph Hellwig
@ 2024-04-29 11:28 ` Kanchan Joshi
2024-04-29 12:04 ` Keith Busch
2024-05-01 7:50 ` Christoph Hellwig
0 siblings, 2 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-29 11:28 UTC (permalink / raw)
To: Christoph Hellwig
Cc: axboe, martin.petersen, kbusch, brauner, asml.silence, dw,
io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
On 4/27/2024 12:33 PM, Christoph Hellwig wrote:
>> If bio_integrity_copy_user is used to process the meta buffer, bip_max_vcnt
>> is one greater than bip_vcnt. In this case bip_max_vcnt vecs needs to be
>> copied to cloned bip.
> Can you explain this a bit more? The clone should only allocate what
> is actually used, so this leaves be a bit confused.
>
Will expand the commit description.
Usually the meta buffer is pinned and used directly (say N bio vecs).
In case kernel has to make a copy (in bio_integrity_copy_user), it
factors these N vecs in, and one extra for the bounce buffer.
So for read IO, bip_max_vcnt is N+1, while bip_vcnt is N.
The clone bio also needs to be aware of all N+1 vecs, so that we can
copy the data from the bounce buffer to pinned user pages correctly
during read-completion.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 05/10] block, nvme: modify rq_integrity_vec function
2024-04-27 7:18 ` Christoph Hellwig
@ 2024-04-29 11:34 ` Kanchan Joshi
2024-04-29 17:11 ` Christoph Hellwig
0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-29 11:34 UTC (permalink / raw)
To: Christoph Hellwig
Cc: axboe, martin.petersen, kbusch, brauner, asml.silence, dw,
io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
On 4/27/2024 12:48 PM, Christoph Hellwig wrote:
> The subjet is about as useless as it gets :)
>
> The essence is that it should take the iter into account, so name that.
Sure.
>> --- a/include/linux/blk-integrity.h
>> +++ b/include/linux/blk-integrity.h
>> @@ -109,11 +109,12 @@ static inline bool blk_integrity_rq(struct request *rq)
>> * Return the first bvec that contains integrity data. Only drivers that are
>> * limited to a single integrity segment should use this helper.
>> */
>
> The comment really needs an update. With this rq_integrity_vec now
> is a "normal" iter based operation, that can actually be used by
> drivers with multiple integrity segments if it is properly advanced.
Right, will update.
>> +static inline struct bio_vec rq_integrity_vec(struct request *rq)
>> {
>> - return NULL;
>> + return (struct bio_vec){0};
>
> No need for the 0 here.
Um, I did not follow. Need that to keep the compiler happy.
Do you suggest to change the prototype.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio
2024-04-27 7:05 ` Christoph Hellwig
@ 2024-04-29 11:40 ` Kanchan Joshi
2024-04-29 17:09 ` Christoph Hellwig
0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-29 11:40 UTC (permalink / raw)
To: Christoph Hellwig
Cc: axboe, martin.petersen, kbusch, brauner, asml.silence, dw,
io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
On 4/27/2024 12:35 PM, Christoph Hellwig wrote:
> On Fri, Apr 26, 2024 at 12:09:37AM +0530, Kanchan Joshi wrote:
>> From: Anuj Gupta <[email protected]>
>>
>> Do it only once when the parent bio completes.
>>
>> Signed-off-by: Anuj Gupta <[email protected]>
>> Signed-off-by: Kanchan Joshi <[email protected]>
>> ---
>> block/bio-integrity.c | 10 +++++++---
>> 1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
>> index b4042414a08f..b698eb77515d 100644
>> --- a/block/bio-integrity.c
>> +++ b/block/bio-integrity.c
>> @@ -119,7 +119,8 @@ static void bio_integrity_uncopy_user(struct bio_integrity_payload *bip)
>> ret = copy_to_iter(bvec_virt(&src_bvec), bytes, &iter);
>> WARN_ON_ONCE(ret != bytes);
>>
>> - bio_integrity_unpin_bvec(copy, nr_vecs, true);
>> + if (!bio_flagged((bip->bip_bio), BIO_CLONED))
>> + bio_integrity_unpin_bvec(copy, nr_vecs, true);
>> }
>
> This feels wrong. I suspect the problem is that BIP_COPY_USER is
> inherited for clone bios while it shouldn't.
>
But BIP_COPY_USER flag is really required in the clone bio. So that we
can copy the subset of the metadata back (from kernel bounce buffer to
user space pinned buffer) in case of read io.
Overall, copy-back will happen in installments (for each cloned bio),
while the unpin will happen in one shot (for the source bio).
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 09/10] block: add support to send meta buffer
2024-04-26 15:21 ` Keith Busch
@ 2024-04-29 11:47 ` Kanchan Joshi
0 siblings, 0 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-29 11:47 UTC (permalink / raw)
To: Keith Busch
Cc: axboe, martin.petersen, hch, brauner, asml.silence, dw, io-uring,
linux-nvme, linux-block, gost.dev, Anuj Gupta
On 4/26/2024 8:51 PM, Keith Busch wrote:
> Should this also be done for __blkdev_direct_IO and
> __blkdev_direct_IO_simple?
Right, this needs to be done for __blkdev_direct_IO.
But not for __blkdev_direct_IO_simple as that is for sync io. And we are
wiring this interface up only for async io.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
2024-04-29 11:28 ` Kanchan Joshi
@ 2024-04-29 12:04 ` Keith Busch
2024-04-29 17:07 ` Christoph Hellwig
2024-05-01 7:50 ` Christoph Hellwig
1 sibling, 1 reply; 41+ messages in thread
From: Keith Busch @ 2024-04-29 12:04 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, axboe, martin.petersen, brauner, asml.silence,
dw, io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
On Mon, Apr 29, 2024 at 04:58:37PM +0530, Kanchan Joshi wrote:
> On 4/27/2024 12:33 PM, Christoph Hellwig wrote:
> >> If bio_integrity_copy_user is used to process the meta buffer, bip_max_vcnt
> >> is one greater than bip_vcnt. In this case bip_max_vcnt vecs needs to be
> >> copied to cloned bip.
> > Can you explain this a bit more? The clone should only allocate what
> > is actually used, so this leaves be a bit confused.
> >
>
> Will expand the commit description.
>
> Usually the meta buffer is pinned and used directly (say N bio vecs).
> In case kernel has to make a copy (in bio_integrity_copy_user), it
> factors these N vecs in, and one extra for the bounce buffer.
> So for read IO, bip_max_vcnt is N+1, while bip_vcnt is N.
>
> The clone bio also needs to be aware of all N+1 vecs, so that we can
> copy the data from the bounce buffer to pinned user pages correctly
> during read-completion.
An earlier version added a field in the bip to point to the original
bvec from the user address. That extra field wouldn't be used in the far
majority of cases, so moving the user bvec to the end of the existing
bip_vec is a spatial optimization. The code may look a little more
confusing that way, but I think it's better than making the bip bigger.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
2024-04-29 12:04 ` Keith Busch
@ 2024-04-29 17:07 ` Christoph Hellwig
2024-04-30 8:25 ` Keith Busch
0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-29 17:07 UTC (permalink / raw)
To: Keith Busch
Cc: Kanchan Joshi, Christoph Hellwig, axboe, martin.petersen, brauner,
asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta
On Mon, Apr 29, 2024 at 01:04:12PM +0100, Keith Busch wrote:
> An earlier version added a field in the bip to point to the original
> bvec from the user address. That extra field wouldn't be used in the far
> majority of cases, so moving the user bvec to the end of the existing
> bip_vec is a spatial optimization. The code may look a little more
> confusing that way, but I think it's better than making the bip bigger.
I think we need to do something like that - just hiding the bounce
buffer is not really maintainable once we get multiple levels of stacking
and other creative bio cloning.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio
2024-04-29 11:40 ` Kanchan Joshi
@ 2024-04-29 17:09 ` Christoph Hellwig
2024-05-01 13:02 ` Kanchan Joshi
0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-29 17:09 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, axboe, martin.petersen, kbusch, brauner,
asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta
On Mon, Apr 29, 2024 at 05:10:59PM +0530, Kanchan Joshi wrote:
> > This feels wrong. I suspect the problem is that BIP_COPY_USER is
> > inherited for clone bios while it shouldn't.
> >
>
> But BIP_COPY_USER flag is really required in the clone bio. So that we
> can copy the subset of the metadata back (from kernel bounce buffer to
> user space pinned buffer) in case of read io.
>
> Overall, copy-back will happen in installments (for each cloned bio),
> while the unpin will happen in one shot (for the source bio).
That seems a bit odd compared to the bio data path. If you think this
is better than the version used in the data path let's convert the
data path to this scheme first to make sure we don't diverge and get
the far better testing on the main data map side.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 05/10] block, nvme: modify rq_integrity_vec function
2024-04-29 11:34 ` Kanchan Joshi
@ 2024-04-29 17:11 ` Christoph Hellwig
0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-04-29 17:11 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, axboe, martin.petersen, kbusch, brauner,
asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta
On Mon, Apr 29, 2024 at 05:04:30PM +0530, Kanchan Joshi wrote:
> >> + return (struct bio_vec){0};
> >
> > No need for the 0 here.
> Um, I did not follow. Need that to keep the compiler happy.
> Do you suggest to change the prototype.
Just removing the NULL as in:
return (struct bio_vec){ };
compiles just fine here, and at least for non-anonymous struct
initializers we use it all the time.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 08/10] io_uring/rw: add support to send meta along with read/write
2024-04-26 14:25 ` Jens Axboe
@ 2024-04-29 20:11 ` Kanchan Joshi
0 siblings, 0 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-04-29 20:11 UTC (permalink / raw)
To: Jens Axboe, martin.petersen, kbusch, hch, brauner
Cc: asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta, Nitesh Shetty
On 4/26/2024 7:55 PM, Jens Axboe wrote:
>> diff --git a/io_uring/rw.c b/io_uring/rw.c
>> index 3134a6ece1be..b2c9ac91d5e5 100644
>> --- a/io_uring/rw.c
>> +++ b/io_uring/rw.c
>> @@ -587,6 +623,8 @@ static int kiocb_done(struct io_kiocb *req, ssize_t ret,
>>
>> req->flags &= ~REQ_F_REISSUE;
>> iov_iter_restore(&io->iter, &io->iter_state);
>> + if (unlikely(rw->kiocb.ki_flags & IOCB_USE_META))
>> + iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
>> return -EAGAIN;
>> }
>> return IOU_ISSUE_SKIP_COMPLETE;
> This puzzles me a bit, why is the restore now dependent on
> IOCB_USE_META?
Both saving/restore for meta is under this condition (so seemed natural).
Also, to avoid growing "struct io_async_rw" too much, this patch keeps
keeps meta/iter_meta_state in the same memory as wpq. So doing this
unconditionally can corrupt wpq for buffered io.
>> @@ -768,7 +806,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
>> if (!(req->flags & REQ_F_FIXED_FILE))
>> req->flags |= io_file_get_flags(file);
>>
>> - kiocb->ki_flags = file->f_iocb_flags;
>> + kiocb->ki_flags |= file->f_iocb_flags;
>> ret = kiocb_set_rw_flags(kiocb, rw->flags);
>> if (unlikely(ret))
>> return ret;
>> @@ -787,7 +825,8 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
>> if (!(kiocb->ki_flags & IOCB_DIRECT) || !file->f_op->iopoll)
>> return -EOPNOTSUPP;
>>
>> - kiocb->private = NULL;
>> + if (likely(!(kiocb->ki_flags & IOCB_USE_META)))
>> + kiocb->private = NULL;
>> kiocb->ki_flags |= IOCB_HIPRI;
>> kiocb->ki_complete = io_complete_rw_iopoll;
>> req->iopoll_completed = 0;
>
> Why don't we just set ->private generically earlier, eg like we do for
> the ki_flags, rather than have it be a branch in here?
Not sure if I am missing what you have in mind.
But kiocb->private was set before we reached to this point (in
io_rw_meta). So we don't overwrite that here.
>> @@ -853,7 +892,8 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
>> } else if (ret == -EIOCBQUEUED) {
>> return IOU_ISSUE_SKIP_COMPLETE;
>> } else if (ret == req->cqe.res || ret <= 0 || !force_nonblock ||
>> - (req->flags & REQ_F_NOWAIT) || !need_complete_io(req)) {
>> + (req->flags & REQ_F_NOWAIT) || !need_complete_io(req) ||
>> + (kiocb->ki_flags & IOCB_USE_META)) {
>> /* read all, failed, already did sync or don't want to retry */
>> goto done;
>> }
>
> Would it be cleaner to stuff that IOCB_USE_META check in
> need_complete_io(), as that would closer seem to describe why that check
> is there in the first place? With a comment.
Yes, will do.
>> @@ -864,6 +904,12 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
>> * manually if we need to.
>> */
>> iov_iter_restore(&io->iter, &io->iter_state);
>> + if (unlikely(kiocb->ki_flags & IOCB_USE_META)) {
>> + /* don't handle partial completion for read + meta */
>> + if (ret > 0)
>> + goto done;
>> + iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
>> + }
>
> Also seems a bit odd why we need this check here, surely if this is
> needed other "don't do retry IOs" conditions would be the same?
Yes, will revisit.
>> @@ -1053,7 +1099,8 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
>> if (ret2 == -EAGAIN && (req->ctx->flags & IORING_SETUP_IOPOLL))
>> goto ret_eagain;
>>
>> - if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)) {
>> + if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)
>> + && !(kiocb->ki_flags & IOCB_USE_META)) {
>> trace_io_uring_short_write(req->ctx, kiocb->ki_pos - ret2,
>> req->cqe.res, ret2);
>
> Same here. Would be nice to integrate this a bit nicer rather than have
> a bunch of "oh we also need this extra check here" conditions.
Will look into this too.
>> @@ -1074,12 +1121,33 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
>> } else {
>> ret_eagain:
>> iov_iter_restore(&io->iter, &io->iter_state);
>> + if (unlikely(kiocb->ki_flags & IOCB_USE_META))
>> + iov_iter_restore(&io->meta.iter, &io->iter_meta_state);
>> if (kiocb->ki_flags & IOCB_WRITE)
>> io_req_end_write(req);
>> return -EAGAIN;
>> }
>> }
>
> Same question here on the (now) conditional restore.
Did not get the concern. Do you prefer it unconditional.
>> +int io_rw_meta(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>> + struct io_async_rw *io = req->async_data;
>> + struct kiocb *kiocb = &rw->kiocb;
>> + int ret;
>> +
>> + if (!(req->file->f_flags & O_DIRECT))
>> + return -EOPNOTSUPP;
>
> Why isn't this just caught at init time when IOCB_DIRECT is checked?
io_rw_init_file() gets invoked after this, and IOCB_DIRECT check is only
for IOPOLL situation. We want to check/fail it regardless of IOPOLL.
>
>> + kiocb->private = &io->meta;
>> + if (req->opcode == IORING_OP_READ_META)
>> + ret = io_read(req, issue_flags);
>> + else
>> + ret = io_write(req, issue_flags);
>> +
>> + return ret;
>> +}
>
> kiocb->private is a bit of an odd beast, and ownership isn't clear at
> all. It would make the most sense if the owner of the kiocb (eg io_uring
> in this case) owned it, but take a look at eg ocfs2 and see what they do
> with it... I think this would blow up as a result.
Yes, ocfs2 is making use of kiocb->private. But seems that's fine. In
io_uring we use the field only to send the information down. ocfs2 (or
anything else unaware of this interface) may just overwrite the
kiocb->private.
If the lower layer want to support meta exchange, it is supposed to
extract meta-descriptor from kiocb->private before altering it.
This case is same for block direct path too when we are doing polled io.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
2024-04-29 17:07 ` Christoph Hellwig
@ 2024-04-30 8:25 ` Keith Busch
2024-05-01 7:46 ` Christoph Hellwig
0 siblings, 1 reply; 41+ messages in thread
From: Keith Busch @ 2024-04-30 8:25 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, axboe, martin.petersen, brauner, asml.silence, dw,
io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
On Mon, Apr 29, 2024 at 07:07:29PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 29, 2024 at 01:04:12PM +0100, Keith Busch wrote:
> > An earlier version added a field in the bip to point to the original
> > bvec from the user address. That extra field wouldn't be used in the far
> > majority of cases, so moving the user bvec to the end of the existing
> > bip_vec is a spatial optimization. The code may look a little more
> > confusing that way, but I think it's better than making the bip bigger.
>
> I think we need to do something like that - just hiding the bounce
> buffer is not really maintainable once we get multiple levels of stacking
> and other creative bio cloning.
Not sure I follow that. From patches 2-4 here, I think that pretty much
covers it. It's just missing a good code comment, but the implementation
side looks complete for any amount of stacking and splitting.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 01/10] block: set bip_vcnt correctly
2024-04-27 14:16 ` Keith Busch
2024-04-29 10:59 ` Kanchan Joshi
@ 2024-05-01 7:45 ` Christoph Hellwig
2024-05-01 8:03 ` Keith Busch
1 sibling, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-05-01 7:45 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Kanchan Joshi, axboe, martin.petersen, brauner,
asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta
On Sat, Apr 27, 2024 at 08:16:52AM -0600, Keith Busch wrote:
> > Please add a Fixes tag and submit it separately from the features.
> >
> > I'm actually kinda surprised the direct user mapping of integrity data
> > survived so far without this.
>
> The only existing use case for user metadata is REQ_OP_DRV_IN/OUT, which
> never splits, so these initial fixes only really matter after this
> series adds new usage for generic READ/WRITE.
Well, it matters to keep our contract up, even if we're not hitting it.
And apparently another user just came out of the woods in dm land..
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
2024-04-30 8:25 ` Keith Busch
@ 2024-05-01 7:46 ` Christoph Hellwig
0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-05-01 7:46 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Kanchan Joshi, axboe, martin.petersen, brauner,
asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta
On Tue, Apr 30, 2024 at 09:25:38AM +0100, Keith Busch wrote:
> > I think we need to do something like that - just hiding the bounce
> > buffer is not really maintainable once we get multiple levels of stacking
> > and other creative bio cloning.
>
> Not sure I follow that. From patches 2-4 here, I think that pretty much
> covers it. It's just missing a good code comment, but the implementation
> side looks complete for any amount of stacking and splitting.
I can't see how it would work, but I'll gladly wait for a better
description.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone
2024-04-29 11:28 ` Kanchan Joshi
2024-04-29 12:04 ` Keith Busch
@ 2024-05-01 7:50 ` Christoph Hellwig
1 sibling, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2024-05-01 7:50 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, axboe, martin.petersen, kbusch, brauner,
asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta
On Mon, Apr 29, 2024 at 04:58:37PM +0530, Kanchan Joshi wrote:
> On 4/27/2024 12:33 PM, Christoph Hellwig wrote:
> >> If bio_integrity_copy_user is used to process the meta buffer, bip_max_vcnt
> >> is one greater than bip_vcnt. In this case bip_max_vcnt vecs needs to be
> >> copied to cloned bip.
> > Can you explain this a bit more? The clone should only allocate what
> > is actually used, so this leaves be a bit confused.
> >
>
> Will expand the commit description.
>
> Usually the meta buffer is pinned and used directly (say N bio vecs).
> In case kernel has to make a copy (in bio_integrity_copy_user), it
> factors these N vecs in, and one extra for the bounce buffer.
> So for read IO, bip_max_vcnt is N+1, while bip_vcnt is N.
>
> The clone bio also needs to be aware of all N+1 vecs, so that we can
> copy the data from the bounce buffer to pinned user pages correctly
> during read-completion.
No. The underlying layer below the clone/split/etc should never have
to care about your bounce buffer. The bvecs are just data containers,
and if they are mapped, copied or used in any other way should remain
entirely encapsulated in the caller.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 01/10] block: set bip_vcnt correctly
2024-05-01 7:45 ` Christoph Hellwig
@ 2024-05-01 8:03 ` Keith Busch
0 siblings, 0 replies; 41+ messages in thread
From: Keith Busch @ 2024-05-01 8:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, axboe, martin.petersen, brauner, asml.silence, dw,
io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
On Wed, May 01, 2024 at 09:45:44AM +0200, Christoph Hellwig wrote:
> On Sat, Apr 27, 2024 at 08:16:52AM -0600, Keith Busch wrote:
> > > Please add a Fixes tag and submit it separately from the features.
> > >
> > > I'm actually kinda surprised the direct user mapping of integrity data
> > > survived so far without this.
> >
> > The only existing use case for user metadata is REQ_OP_DRV_IN/OUT, which
> > never splits, so these initial fixes only really matter after this
> > series adds new usage for generic READ/WRITE.
>
> Well, it matters to keep our contract up, even if we're not hitting it.
>
> And apparently another user just came out of the woods in dm land..
But the bug report from dm has nothing to do with user mapped metadata.
That bug existed before that was added, so yeah, patch 5 from this
series (or something like it) should be applied on its own.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio
2024-04-29 17:09 ` Christoph Hellwig
@ 2024-05-01 13:02 ` Kanchan Joshi
2024-05-02 7:12 ` Christoph Hellwig
0 siblings, 1 reply; 41+ messages in thread
From: Kanchan Joshi @ 2024-05-01 13:02 UTC (permalink / raw)
To: Christoph Hellwig
Cc: axboe, martin.petersen, kbusch, brauner, asml.silence, dw,
io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
On 4/29/2024 10:39 PM, Christoph Hellwig wrote:
> On Mon, Apr 29, 2024 at 05:10:59PM +0530, Kanchan Joshi wrote:
>>> This feels wrong. I suspect the problem is that BIP_COPY_USER is
>>> inherited for clone bios while it shouldn't.
>>>
>>
>> But BIP_COPY_USER flag is really required in the clone bio. So that we
>> can copy the subset of the metadata back (from kernel bounce buffer to
>> user space pinned buffer) in case of read io.
>>
>> Overall, copy-back will happen in installments (for each cloned bio),
>> while the unpin will happen in one shot (for the source bio).
>
> That seems a bit odd compared to the bio data path. If you think this
> is better than the version used in the data path let's convert the
> data path to this scheme first to make sure we don't diverge and get
> the far better testing on the main data map side.
>
Can you please tell what function(s) in bio data path that need this
conversion?
To me data path handling seems similar. Each cloned bio will lead to
some amount of data transfer to pinned user-memory. The same is
happening for meta transfer here.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio
2024-05-01 13:02 ` Kanchan Joshi
@ 2024-05-02 7:12 ` Christoph Hellwig
2024-05-03 12:01 ` Kanchan Joshi
0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2024-05-02 7:12 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, axboe, martin.petersen, kbusch, brauner,
asml.silence, dw, io-uring, linux-nvme, linux-block, gost.dev,
Anuj Gupta
On Wed, May 01, 2024 at 06:32:45PM +0530, Kanchan Joshi wrote:
> Can you please tell what function(s) in bio data path that need this
> conversion?
> To me data path handling seems similar. Each cloned bio will lead to
> some amount of data transfer to pinned user-memory. The same is
> happening for meta transfer here.
Well, everywhere. e.g. for direct I/O everything is just driven from
the fs/direct-io.c and and fs/iomap/direct-io.c code without any
knowledge in the underlying driver if data has been pinned (no bounce
buffering in this case). Or for passthrough I/O none of the underlying
logic knows about the pinning or bounce buffering, everything is handled
in block/blk-map.c.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio
2024-05-02 7:12 ` Christoph Hellwig
@ 2024-05-03 12:01 ` Kanchan Joshi
0 siblings, 0 replies; 41+ messages in thread
From: Kanchan Joshi @ 2024-05-03 12:01 UTC (permalink / raw)
To: Christoph Hellwig
Cc: axboe, martin.petersen, kbusch, brauner, asml.silence, dw,
io-uring, linux-nvme, linux-block, gost.dev, Anuj Gupta
On 5/2/2024 12:42 PM, Christoph Hellwig wrote:
> On Wed, May 01, 2024 at 06:32:45PM +0530, Kanchan Joshi wrote:
>> Can you please tell what function(s) in bio data path that need this
>> conversion?
>> To me data path handling seems similar. Each cloned bio will lead to
>> some amount of data transfer to pinned user-memory. The same is
>> happening for meta transfer here.
> Well, everywhere.
Next version will not inherit BIP_COPY_USER for clone bios.
^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2024-05-03 12:01 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20240425184649epcas5p42f6ddbfb1c579f043a919973c70ebd03@epcas5p4.samsung.com>
2024-04-25 18:39 ` [PATCH 00/10] Read/Write with meta/integrity Kanchan Joshi
[not found] ` <CGME20240425184651epcas5p3404f2390d6cf05148eb96e1af093e7bc@epcas5p3.samsung.com>
2024-04-25 18:39 ` [PATCH 01/10] block: set bip_vcnt correctly Kanchan Joshi
2024-04-27 7:02 ` Christoph Hellwig
2024-04-27 14:16 ` Keith Busch
2024-04-29 10:59 ` Kanchan Joshi
2024-05-01 7:45 ` Christoph Hellwig
2024-05-01 8:03 ` Keith Busch
[not found] ` <CGME20240425184653epcas5p28de1473090e0141ae74f8b0a6eb921a7@epcas5p2.samsung.com>
2024-04-25 18:39 ` [PATCH 02/10] block: copy bip_max_vcnt vecs instead of bip_vcnt during clone Kanchan Joshi
2024-04-27 7:03 ` Christoph Hellwig
2024-04-29 11:28 ` Kanchan Joshi
2024-04-29 12:04 ` Keith Busch
2024-04-29 17:07 ` Christoph Hellwig
2024-04-30 8:25 ` Keith Busch
2024-05-01 7:46 ` Christoph Hellwig
2024-05-01 7:50 ` Christoph Hellwig
[not found] ` <CGME20240425184656epcas5p42228cdef753cf20a266d12de5bc130f0@epcas5p4.samsung.com>
2024-04-25 18:39 ` [PATCH 03/10] block: copy result back to user meta buffer correctly in case of split Kanchan Joshi
2024-04-27 7:04 ` Christoph Hellwig
[not found] ` <CGME20240425184658epcas5p2adb6bf01a5c56ffaac3a55ab57afaf8e@epcas5p2.samsung.com>
2024-04-25 18:39 ` [PATCH 04/10] block: avoid unpinning/freeing the bio_vec incase of cloned bio Kanchan Joshi
2024-04-27 7:05 ` Christoph Hellwig
2024-04-29 11:40 ` Kanchan Joshi
2024-04-29 17:09 ` Christoph Hellwig
2024-05-01 13:02 ` Kanchan Joshi
2024-05-02 7:12 ` Christoph Hellwig
2024-05-03 12:01 ` Kanchan Joshi
[not found] ` <CGME20240425184700epcas5p1687590f7e4a3f3c3620ac27af514f0ca@epcas5p1.samsung.com>
2024-04-25 18:39 ` [PATCH 05/10] block, nvme: modify rq_integrity_vec function Kanchan Joshi
2024-04-27 7:18 ` Christoph Hellwig
2024-04-29 11:34 ` Kanchan Joshi
2024-04-29 17:11 ` Christoph Hellwig
[not found] ` <CGME20240425184702epcas5p1ccb0df41b07845bc252d69007558e3fa@epcas5p1.samsung.com>
2024-04-25 18:39 ` [PATCH 06/10] block: modify bio_integrity_map_user argument Kanchan Joshi
2024-04-27 7:19 ` Christoph Hellwig
[not found] ` <CGME20240425184704epcas5p3b9eb6cce9c9658eb1d0d32937e778a5d@epcas5p3.samsung.com>
2024-04-25 18:39 ` [PATCH 07/10] block: define meta io descriptor Kanchan Joshi
[not found] ` <CGME20240425184706epcas5p1d75c19d1d1458c52fc4009f150c7dc7d@epcas5p1.samsung.com>
2024-04-25 18:39 ` [PATCH 08/10] io_uring/rw: add support to send meta along with read/write Kanchan Joshi
2024-04-26 14:25 ` Jens Axboe
2024-04-29 20:11 ` Kanchan Joshi
[not found] ` <CGME20240425184708epcas5p4f1d95cd8d285614f712868d205a23115@epcas5p4.samsung.com>
2024-04-25 18:39 ` [PATCH 09/10] block: add support to send meta buffer Kanchan Joshi
2024-04-26 15:21 ` Keith Busch
2024-04-29 11:47 ` Kanchan Joshi
[not found] ` <CGME20240425184710epcas5p2968bbc40ed10d1f0184bb511af054fcb@epcas5p2.samsung.com>
2024-04-25 18:39 ` [PATCH 10/10] nvme: add separate handling for user integrity buffer Kanchan Joshi
2024-04-25 19:56 ` Keith Busch
2024-04-26 10:57 ` kernel test robot
2024-04-26 14:19 ` [PATCH 00/10] Read/Write with meta/integrity Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox