* [PATCH v9 00/11] Read/Write with meta/integrity
[not found] <CGME20241114105326epcas5p103b2c996293fa680092b97c747fdbd59@epcas5p1.samsung.com>
@ 2024-11-14 10:45 ` Anuj Gupta
[not found] ` <CGME20241114105352epcas5p109c1742fa8a6552296b9c104f2271308@epcas5p1.samsung.com>
` (10 more replies)
0 siblings, 11 replies; 37+ messages in thread
From: Anuj Gupta @ 2024-11-14 10:45 UTC (permalink / raw)
To: axboe, hch, kbusch, martin.petersen, asml.silence, anuj1072538,
brauner, jack, viro
Cc: io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Anuj Gupta
This adds a new io_uring interface to pass additional attributes with
read/write. It can be done using two ways.
1. Passing the attribute information via user pointer.
2. Passing the attribute information inline with SQE/SQE128.
We have had discussions in the past about using the SQE128 space for passing
PI fields. Still if there are concerns about it, can we please consider
this patchset for inclusion only with the user pointer approach for now
(and drop patch 7)?
Example program for using the interface is appended below [1].
In the program, write is done via inline scheme and read is done via user
pointer scheme.
The patchset is on top of block/for-next.
Block path (direct IO) , NVMe and SCSI driver are modified to support
this.
Patch 1 is an enhancement patch.
Patch 2 is required to make the bounce buffer copy back work correctly.
Patch 3 to 5 are prep patches.
Patch 6 and 7 adds the io_uring support.
Patch 8 gives us unified interface for user and kernel generated
integrity.
Patch 9 adds support in SCSI and patch 10 in NVMe.
Patch 11 adds the support for block direct IO.
Changes since v8:
https://lore.kernel.org/io-uring/[email protected]/
- add option of the pass the PI information from user space via a
pointer (Pavel)
Changes since v7:
https://lore.kernel.org/io-uring/[email protected]/
- change the sign-off order (hch)
- add a check for doing metadata completion handling only for async-io
- change meta_type name to something more meaningful (hch, keith)
- add detail description in io-uring patch (hch)
Changes since v6:
https://lore.kernel.org/linux-block/[email protected]/
- io_uring changes (bring back meta_type, move PI to the end of SQE128)
- Fix robot warnings
Changes since v5:
https://lore.kernel.org/linux-block/[email protected]/
- remove meta_type field from SQE (hch, keith)
- remove __bitwise annotation (hch)
- remove BIP_CTRL_NOCHECK from scsi (hch)
Changes since v4:
https://lore.kernel.org/linux-block/[email protected]/
- better variable names to describe bounce buffer copy back (hch)
- move defintion of flags in the same patch introducing uio_meta (hch)
- move uio_meta definition to include/linux/uio.h (hch)
- bump seed size in uio_meta to 8 bytes (martin)
- move flags definition to include/uapi/linux/fs.h (hch)
- s/meta/metadata in commit description of io-uring (hch)
- rearrange the meta fields in sqe for cleaner layout
- partial submission case is not applicable as, we are only plumbing for async case
- s/META_TYPE_INTEGRITY/META_TYPE_PI (hch, martin)
- remove unlikely branching (hch)
- Better formatting, misc cleanups, better commit descriptions, reordering commits(hch)
Changes since v3:
https://lore.kernel.org/linux-block/[email protected]/
- add reftag seed support (Martin)
- fix incorrect formatting in uio_meta (hch)
- s/IOCB_HAS_META/IOCB_HAS_METADATA (hch)
- move integrity check flags to block layer header (hch)
- add comments for BIP_CHECK_GUARD/REFTAG/APPTAG flags (hch)
- remove bio_integrity check during completion if IOCB_HAS_METADATA is set (hch)
- use goto label to get rid of duplicate error handling (hch)
- add warn_on if trying to do sync io with iocb_has_metadata flag (hch)
- remove check for disabling reftag remapping (hch)
- remove BIP_INTEGRITY_USER flag (hch)
- add comment for app_tag field introduced in bio_integrity_payload (hch)
- pass request to nvme_set_app_tag function (hch)
- right indentation at a place in scsi patch (hch)
- move IOCB_HAS_METADATA to a separate fs patch (hch)
Changes since v2:
https://lore.kernel.org/linux-block/[email protected]/
- io_uring error handling styling (Gabriel)
- add documented helper to get metadata bytes from data iter (hch)
- during clone specify "what flags to clone" rather than
"what not to clone" (hch)
- Move uio_meta defination to bio-integrity.h (hch)
- Rename apptag field to app_tag (hch)
- Change datatype of flags field in uio_meta to bitwise (hch)
- Don't introduce BIP_USER_CHK_FOO flags (hch, martin)
- Driver should rely on block layer flags instead of seeing if it is
user-passthrough (hch)
- update the scsi code for handling user-meta (hch, martin)
Changes since v1:
https://lore.kernel.org/linux-block/[email protected]/
- Do not use new opcode for meta, and also add the provision to introduce new
meta types beyond integrity (Pavel)
- Stuff IOCB_HAS_META check in need_complete_io (Jens)
- Split meta handling in NVMe into a separate handler (Keith)
- Add meta handling for __blkdev_direct_IO too (Keith)
- Don't inherit BIP_COPY_USER flag for cloned bio's (Christoph)
- Better commit descriptions (Christoph)
Changes since RFC:
- modify io_uring plumbing based on recent async handling state changes
- fixes/enhancements to correctly handle the split for meta buffer
- add flags to specify guard/reftag/apptag checks
- add support to send apptag
[1]
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <linux/fs.h>
#include <linux/io_uring.h>
#include <linux/types.h>
#include "liburing.h"
/*
* write data/meta. read both. compare. send apptag too.
* prerequisite:
* protected xfer: format namespace with 4KB + 8b, pi_type = 1
* For testing reftag remapping on device-mapper, create a
* device-mapper and run this program. Device mapper creation:
* # echo 0 80 linear /dev/nvme0n1 0 > /tmp/table
* # echo 80 160 linear /dev/nvme0n1 200 >> /tmp/table
* # dmsetup create two /tmp/table
* # ./a.out /dev/dm-0
*/
#define DATA_LEN 4096
#define META_LEN 8
struct t10_pi_tuple {
__be16 guard;
__be16 apptag;
__be32 reftag;
};
int main(int argc, char *argv[])
{
struct io_uring ring;
struct io_uring_sqe *sqe = NULL;
struct io_uring_cqe *cqe = NULL;
void *wdb,*rdb;
char wmb[META_LEN], rmb[META_LEN];
int fd, ret, blksize;
struct stat fstat;
unsigned long long offset = DATA_LEN * 10;
struct t10_pi_tuple *pi;
struct io_uring_sqe_ext *sqe_ext;
struct io_uring_attr_pi pi_attr;
struct io_uring_attr_vec attr_vec;
if (argc != 2) {
fprintf(stderr, "Usage: %s <block-device>", argv[0]);
return 1;
};
if (stat(argv[1], &fstat) == 0) {
blksize = (int)fstat.st_blksize;
} else {
perror("stat");
return 1;
}
if (posix_memalign(&wdb, blksize, DATA_LEN)) {
perror("posix_memalign failed");
return 1;
}
if (posix_memalign(&rdb, blksize, DATA_LEN)) {
perror("posix_memalign failed");
return 1;
}
memset(wdb, 0, DATA_LEN);
fd = open(argv[1], O_RDWR | O_DIRECT);
if (fd < 0) {
printf("Error in opening device\n");
return 0;
}
ret = io_uring_queue_init(8, &ring, IORING_SETUP_SQE128);
if (ret) {
fprintf(stderr, "ring setup failed: %d\n", ret);
return 1;
}
/* write data + meta-buffer to device */
sqe = io_uring_get_sqe(&ring);
if (!sqe) {
fprintf(stderr, "get sqe failed\n");
return 1;
}
io_uring_prep_write(sqe, fd, wdb, DATA_LEN, offset);
sqe->attr_inline_flags = ATTR_FLAG_PI;
sqe_ext= (struct io_uring_sqe_ext *) (sqe + 1);
sqe_ext->rw_pi.addr = (__u64)wmb;
sqe_ext->rw_pi.len = META_LEN;
/* flags to ask for guard/reftag/apptag*/
sqe_ext->rw_pi.flags = IO_INTEGRITY_CHK_GUARD | IO_INTEGRITY_CHK_REFTAG | IO_INTEGRITY_CHK_APPTAG;
sqe_ext->rw_pi.app_tag = 0x1234;
sqe_ext->rw_pi.seed = 10;
pi = (struct t10_pi_tuple *)wmb;
pi->guard = 0;
pi->reftag = 0x0A000000;
pi->apptag = 0x3412;
ret = io_uring_submit(&ring);
if (ret <= 0) {
fprintf(stderr, "sqe submit failed: %d\n", ret);
return 1;
}
ret = io_uring_wait_cqe(&ring, &cqe);
if (!cqe) {
fprintf(stderr, "cqe is NULL :%d\n", ret);
return 1;
}
if (cqe->res < 0) {
fprintf(stderr, "write cqe failure: %d", cqe->res);
return 1;
}
io_uring_cqe_seen(&ring, cqe);
/* read data + meta-buffer back from device */
sqe = io_uring_get_sqe(&ring);
if (!sqe) {
fprintf(stderr, "get sqe failed\n");
return 1;
}
io_uring_prep_read(sqe, fd, rdb, DATA_LEN, offset);
sqe->nr_attr_indirect = 1;
sqe->attr_vec_addr = (__u64)&attr_vec;
attr_vec.type = ATTR_TYPE_PI;
attr_vec.addr = (__u64)&pi_attr;
pi_attr.addr = (__u64)rmb;
pi_attr.len = META_LEN;
pi_attr.flags = IO_INTEGRITY_CHK_GUARD | IO_INTEGRITY_CHK_REFTAG | IO_INTEGRITY_CHK_APPTAG;
pi_attr.app_tag = 0x1234;
pi_attr.seed = 10;
ret = io_uring_submit(&ring);
if (ret <= 0) {
fprintf(stderr, "sqe submit failed: %d\n", ret);
return 1;
}
ret = io_uring_wait_cqe(&ring, &cqe);
if (!cqe) {
fprintf(stderr, "cqe is NULL :%d\n", ret);
return 1;
}
if (cqe->res < 0) {
fprintf(stderr, "read cqe failure: %d", cqe->res);
return 1;
}
pi = (struct t10_pi_tuple *)rmb;
if (pi->apptag != 0x3412)
printf("Failure: apptag mismatch!\n");
if (pi->reftag != 0x0A000000)
printf("Failure: reftag mismatch!\n");
io_uring_cqe_seen(&ring, cqe);
pi = (struct t10_pi_tuple *)rmb;
if (strncmp(wmb, rmb, META_LEN))
printf("Failure: meta mismatch!, wmb=%s, rmb=%s\n", wmb, rmb);
if (strncmp(wdb, rdb, DATA_LEN))
printf("Failure: data mismatch!\n");
io_uring_queue_exit(&ring);
free(rdb);
free(wdb);
return 0;
}
Anuj Gupta (8):
block: define set of integrity flags to be inherited by cloned bip
block: modify bio_integrity_map_user to accept iov_iter as argument
fs, iov_iter: define meta io descriptor
fs: introduce IOCB_HAS_METADATA for metadata
io_uring: introduce attributes for read/write and PI support
io_uring: inline read/write attributes and PI
block: introduce BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags
scsi: add support for user-meta interface
Christoph Hellwig (1):
block: copy back bounce buffer to user-space correctly in case of
split
Kanchan Joshi (2):
nvme: add support for passing on the application tag
block: add support to pass user meta buffer
block/bio-integrity.c | 84 ++++++++++++++----
block/blk-integrity.c | 10 ++-
block/fops.c | 45 +++++++---
drivers/nvme/host/core.c | 21 +++--
drivers/scsi/sd.c | 4 +-
include/linux/bio-integrity.h | 25 ++++--
include/linux/fs.h | 1 +
include/linux/uio.h | 9 ++
include/uapi/linux/fs.h | 9 ++
include/uapi/linux/io_uring.h | 40 +++++++++
io_uring/io_uring.c | 5 ++
io_uring/rw.c | 160 +++++++++++++++++++++++++++++++++-
io_uring/rw.h | 14 ++-
13 files changed, 381 insertions(+), 46 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v9 01/11] block: define set of integrity flags to be inherited by cloned bip
[not found] ` <CGME20241114105352epcas5p109c1742fa8a6552296b9c104f2271308@epcas5p1.samsung.com>
@ 2024-11-14 10:45 ` Anuj Gupta
0 siblings, 0 replies; 37+ messages in thread
From: Anuj Gupta @ 2024-11-14 10:45 UTC (permalink / raw)
To: axboe, hch, kbusch, martin.petersen, asml.silence, anuj1072538,
brauner, jack, viro
Cc: io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Anuj Gupta
Introduce BIP_CLONE_FLAGS describing integrity flags that should be
inherited in the cloned bip from the parent.
Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
---
block/bio-integrity.c | 2 +-
include/linux/bio-integrity.h | 3 +++
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 2a4bd6611692..a448a25d13de 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -559,7 +559,7 @@ int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
bip->bip_vec = bip_src->bip_vec;
bip->bip_iter = bip_src->bip_iter;
- bip->bip_flags = bip_src->bip_flags & ~BIP_BLOCK_INTEGRITY;
+ bip->bip_flags = bip_src->bip_flags & BIP_CLONE_FLAGS;
return 0;
}
diff --git a/include/linux/bio-integrity.h b/include/linux/bio-integrity.h
index dbf0f74c1529..0f0cf10222e8 100644
--- a/include/linux/bio-integrity.h
+++ b/include/linux/bio-integrity.h
@@ -30,6 +30,9 @@ struct bio_integrity_payload {
struct bio_vec bip_inline_vecs[];/* embedded bvec array */
};
+#define BIP_CLONE_FLAGS (BIP_MAPPED_INTEGRITY | BIP_CTRL_NOCHECK | \
+ BIP_DISK_NOCHECK | BIP_IP_CHECKSUM)
+
#ifdef CONFIG_BLK_DEV_INTEGRITY
#define bip_for_each_vec(bvl, bip, iter) \
--
2.25.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v9 02/11] block: copy back bounce buffer to user-space correctly in case of split
[not found] ` <CGME20241114105354epcas5p49a73947c3d37be4189023f66fb7ba413@epcas5p4.samsung.com>
@ 2024-11-14 10:45 ` Anuj Gupta
0 siblings, 0 replies; 37+ messages in thread
From: Anuj Gupta @ 2024-11-14 10:45 UTC (permalink / raw)
To: axboe, hch, kbusch, martin.petersen, asml.silence, anuj1072538,
brauner, jack, viro
Cc: io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Anuj Gupta
From: Christoph Hellwig <[email protected]>
Copy back the bounce buffer to user-space in entirety when the parent
bio completes. The existing code uses bip_iter.bi_size for sizing the
copy, which can be modified. So move away from that and fetch it from
the vector passed to the block layer. While at it, switch to using
better variable names.
Fixes: 492c5d455969f ("block: bio-integrity: directly map user buffers")
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
---
block/bio-integrity.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index a448a25d13de..4341b0d4efa1 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -118,17 +118,18 @@ static void bio_integrity_unpin_bvec(struct bio_vec *bv, int nr_vecs,
static void bio_integrity_uncopy_user(struct bio_integrity_payload *bip)
{
- unsigned short nr_vecs = bip->bip_max_vcnt - 1;
- struct bio_vec *copy = &bip->bip_vec[1];
- size_t bytes = bip->bip_iter.bi_size;
- struct iov_iter iter;
+ unsigned short orig_nr_vecs = bip->bip_max_vcnt - 1;
+ struct bio_vec *orig_bvecs = &bip->bip_vec[1];
+ struct bio_vec *bounce_bvec = &bip->bip_vec[0];
+ size_t bytes = bounce_bvec->bv_len;
+ struct iov_iter orig_iter;
int ret;
- iov_iter_bvec(&iter, ITER_DEST, copy, nr_vecs, bytes);
- ret = copy_to_iter(bvec_virt(bip->bip_vec), bytes, &iter);
+ iov_iter_bvec(&orig_iter, ITER_DEST, orig_bvecs, orig_nr_vecs, bytes);
+ ret = copy_to_iter(bvec_virt(bounce_bvec), bytes, &orig_iter);
WARN_ON_ONCE(ret != bytes);
- bio_integrity_unpin_bvec(copy, nr_vecs, true);
+ bio_integrity_unpin_bvec(orig_bvecs, orig_nr_vecs, true);
}
/**
--
2.25.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v9 03/11] block: modify bio_integrity_map_user to accept iov_iter as argument
[not found] ` <CGME20241114105357epcas5p41fd14282d4abfe564e858b37babe708a@epcas5p4.samsung.com>
@ 2024-11-14 10:45 ` Anuj Gupta
0 siblings, 0 replies; 37+ messages in thread
From: Anuj Gupta @ 2024-11-14 10:45 UTC (permalink / raw)
To: axboe, hch, kbusch, martin.petersen, asml.silence, anuj1072538,
brauner, jack, viro
Cc: io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Anuj Gupta, Kanchan Joshi
This patch refactors bio_integrity_map_user to accept iov_iter as
argument. This is a prep patch.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
---
block/bio-integrity.c | 12 +++++-------
block/blk-integrity.c | 10 +++++++++-
include/linux/bio-integrity.h | 5 ++---
3 files changed, 16 insertions(+), 11 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 4341b0d4efa1..f56d01cec689 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -302,16 +302,15 @@ static unsigned int bvec_from_pages(struct bio_vec *bvec, struct page **pages,
return nr_bvecs;
}
-int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes)
+int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
{
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
unsigned int align = blk_lim_dma_alignment_and_pad(&q->limits);
struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages;
struct bio_vec stack_vec[UIO_FASTIOV], *bvec = stack_vec;
+ size_t offset, bytes = iter->count;
unsigned int direction, nr_bvecs;
- struct iov_iter iter;
int ret, nr_vecs;
- size_t offset;
bool copy;
if (bio_integrity(bio))
@@ -324,8 +323,7 @@ int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes)
else
direction = ITER_SOURCE;
- iov_iter_ubuf(&iter, direction, ubuf, bytes);
- nr_vecs = iov_iter_npages(&iter, BIO_MAX_VECS + 1);
+ nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS + 1);
if (nr_vecs > BIO_MAX_VECS)
return -E2BIG;
if (nr_vecs > UIO_FASTIOV) {
@@ -335,8 +333,8 @@ int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes)
pages = NULL;
}
- copy = !iov_iter_is_aligned(&iter, align, align);
- ret = iov_iter_extract_pages(&iter, &pages, bytes, nr_vecs, 0, &offset);
+ copy = !iov_iter_is_aligned(iter, align, align);
+ ret = iov_iter_extract_pages(iter, &pages, bytes, nr_vecs, 0, &offset);
if (unlikely(ret < 0))
goto free_bvec;
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index b180cac61a9d..4a29754f1bc2 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -115,8 +115,16 @@ EXPORT_SYMBOL(blk_rq_map_integrity_sg);
int blk_rq_integrity_map_user(struct request *rq, void __user *ubuf,
ssize_t bytes)
{
- int ret = bio_integrity_map_user(rq->bio, ubuf, bytes);
+ int ret;
+ struct iov_iter iter;
+ unsigned int direction;
+ if (op_is_write(req_op(rq)))
+ direction = ITER_DEST;
+ else
+ direction = ITER_SOURCE;
+ iov_iter_ubuf(&iter, direction, ubuf, bytes);
+ ret = bio_integrity_map_user(rq->bio, &iter);
if (ret)
return ret;
diff --git a/include/linux/bio-integrity.h b/include/linux/bio-integrity.h
index 0f0cf10222e8..58ff9988433a 100644
--- a/include/linux/bio-integrity.h
+++ b/include/linux/bio-integrity.h
@@ -75,7 +75,7 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio, gfp_t gfp,
unsigned int nr);
int bio_integrity_add_page(struct bio *bio, struct page *page, unsigned int len,
unsigned int offset);
-int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t len);
+int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter);
void bio_integrity_unmap_user(struct bio *bio);
bool bio_integrity_prep(struct bio *bio);
void bio_integrity_advance(struct bio *bio, unsigned int bytes_done);
@@ -101,8 +101,7 @@ static inline void bioset_integrity_free(struct bio_set *bs)
{
}
-static inline int bio_integrity_map_user(struct bio *bio, void __user *ubuf,
- ssize_t len)
+static int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
{
return -EINVAL;
}
--
2.25.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v9 04/11] fs, iov_iter: define meta io descriptor
[not found] ` <CGME20241114105400epcas5p270b8062a0c4f26833a5b497f057d65a7@epcas5p2.samsung.com>
@ 2024-11-14 10:45 ` Anuj Gupta
0 siblings, 0 replies; 37+ messages in thread
From: Anuj Gupta @ 2024-11-14 10:45 UTC (permalink / raw)
To: axboe, hch, kbusch, martin.petersen, asml.silence, anuj1072538,
brauner, jack, viro
Cc: io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Anuj Gupta, Kanchan Joshi
Add flags to describe checks for integrity meta buffer. Also, introduce
a new 'uio_meta' structure that upper layer can use to pass the
meta/integrity information.
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/uio.h | 9 +++++++++
include/uapi/linux/fs.h | 9 +++++++++
2 files changed, 18 insertions(+)
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 853f9de5aa05..8ada84e85447 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -82,6 +82,15 @@ struct iov_iter {
};
};
+typedef __u16 uio_meta_flags_t;
+
+struct uio_meta {
+ uio_meta_flags_t flags;
+ u16 app_tag;
+ u64 seed;
+ struct iov_iter iter;
+};
+
static inline const struct iovec *iter_iov(const struct iov_iter *iter)
{
if (iter->iter_type == ITER_UBUF)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 753971770733..9070ef19f0a3 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -40,6 +40,15 @@
#define BLOCK_SIZE_BITS 10
#define BLOCK_SIZE (1<<BLOCK_SIZE_BITS)
+/* flags for integrity meta */
+#define IO_INTEGRITY_CHK_GUARD (1U << 0) /* enforce guard check */
+#define IO_INTEGRITY_CHK_REFTAG (1U << 1) /* enforce ref check */
+#define IO_INTEGRITY_CHK_APPTAG (1U << 2) /* enforce app check */
+
+#define IO_INTEGRITY_VALID_FLAGS (IO_INTEGRITY_CHK_GUARD | \
+ IO_INTEGRITY_CHK_REFTAG | \
+ IO_INTEGRITY_CHK_APPTAG)
+
#define SEEK_SET 0 /* seek relative to beginning of file */
#define SEEK_CUR 1 /* seek relative to current file position */
#define SEEK_END 2 /* seek relative to end of file */
--
2.25.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v9 05/11] fs: introduce IOCB_HAS_METADATA for metadata
[not found] ` <CGME20241114105402epcas5p41b1f6054a557f1bda2cfddfdfb9a9477@epcas5p4.samsung.com>
@ 2024-11-14 10:45 ` Anuj Gupta
0 siblings, 0 replies; 37+ messages in thread
From: Anuj Gupta @ 2024-11-14 10:45 UTC (permalink / raw)
To: axboe, hch, kbusch, martin.petersen, asml.silence, anuj1072538,
brauner, jack, viro
Cc: io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Anuj Gupta
Introduce an IOCB_HAS_METADATA flag for the kiocb struct, for handling
requests containing meta payload.
Signed-off-by: Anuj Gupta <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/fs.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4b5cad44a126..7f14675b02df 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -346,6 +346,7 @@ struct readahead_control;
#define IOCB_DIO_CALLER_COMP (1 << 22)
/* kiocb is a read or write operation submitted by fs/aio.c. */
#define IOCB_AIO_RW (1 << 23)
+#define IOCB_HAS_METADATA (1 << 24)
/* for use in trace events */
#define TRACE_IOCB_STRINGS \
--
2.25.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
[not found] ` <CGME20241114105405epcas5p24ca2fb9017276ff8a50ef447638fd739@epcas5p2.samsung.com>
@ 2024-11-14 10:45 ` Anuj Gupta
2024-11-14 12:16 ` Christoph Hellwig
` (2 more replies)
0 siblings, 3 replies; 37+ messages in thread
From: Anuj Gupta @ 2024-11-14 10:45 UTC (permalink / raw)
To: axboe, hch, kbusch, martin.petersen, asml.silence, anuj1072538,
brauner, jack, viro
Cc: io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Anuj Gupta, Kanchan Joshi
Add the ability to pass additional attributes along with read/write.
Application can populate an array of 'struct io_uring_attr_vec' and pass
its address using the SQE field:
__u64 attr_vec_addr;
Along with number of attributes using:
__u8 nr_attr_indirect;
Overall 16 attributes are allowed and currently one attribute
'ATTR_TYPE_PI' is supported.
With PI attribute, userspace can pass following information:
- flags: integrity check flags IO_INTEGRITY_CHK_{GUARD/APPTAG/REFTAG}
- len: length of PI/metadata buffer
- addr: address of metadata buffer
- seed: seed value for reftag remapping
- app_tag: application defined 16b value
Process this information to prepare uio_meta_descriptor and pass it down
using kiocb->private.
PI attribute is supported only for direct IO. Also, vectored read/write
operations are not supported with PI currently.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
include/uapi/linux/io_uring.h | 29 ++++++++
io_uring/io_uring.c | 1 +
io_uring/rw.c | 128 +++++++++++++++++++++++++++++++++-
io_uring/rw.h | 14 +++-
4 files changed, 169 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 5d08435b95a8..2e6808f6ba28 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -92,12 +92,18 @@ struct io_uring_sqe {
__u16 addr_len;
__u16 __pad3[1];
};
+ struct {
+ /* number of elements in the attribute vector */
+ __u8 nr_attr_indirect;
+ __u8 __pad4[3];
+ };
};
union {
struct {
__u64 addr3;
__u64 __pad2[1];
};
+ __u64 attr_vec_addr;
__u64 optval;
/*
* If the ring is initialized with IORING_SETUP_SQE128, then
@@ -107,6 +113,29 @@ struct io_uring_sqe {
};
};
+
+/* Attributes to be passed with read/write */
+enum io_uring_attr_type {
+ ATTR_TYPE_PI,
+ /* max supported attributes */
+ ATTR_TYPE_LAST = 16,
+};
+
+struct io_uring_attr_vec {
+ enum io_uring_attr_type type;
+ __u64 addr;
+};
+
+/* PI attribute information */
+struct io_uring_attr_pi {
+ __u16 flags;
+ __u16 app_tag;
+ __u32 len;
+ __u64 addr;
+ __u64 seed;
+ __u64 rsvd;
+};
+
/*
* If sqe->file_index is set to this for opcodes that instantiate a new
* direct descriptor (like openat/openat2/accept), then io_uring will allocate
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index bd71782057de..e32dd118d7c8 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3867,6 +3867,7 @@ static int __init io_uring_init(void)
BUILD_BUG_SQE_ELEM(44, __u32, file_index);
BUILD_BUG_SQE_ELEM(44, __u16, addr_len);
BUILD_BUG_SQE_ELEM(46, __u16, __pad3[0]);
+ BUILD_BUG_SQE_ELEM(44, __u8, nr_attr_indirect);
BUILD_BUG_SQE_ELEM(48, __u64, addr3);
BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd);
BUILD_BUG_SQE_ELEM(56, __u64, __pad2);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index cce8bc2ecd3f..93d7451b9370 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -257,11 +257,98 @@ static int io_prep_rw_setup(struct io_kiocb *req, int ddir, bool do_import)
return 0;
}
+static inline void io_meta_save_state(struct io_async_rw *io)
+{
+ io->meta_state.seed = io->meta.seed;
+ iov_iter_save_state(&io->meta.iter, &io->meta_state.iter_meta);
+}
+
+static inline void io_meta_restore(struct io_async_rw *io)
+{
+ io->meta.seed = io->meta_state.seed;
+ iov_iter_restore(&io->meta.iter, &io->meta_state.iter_meta);
+}
+
+static int io_prep_rw_pi(struct io_kiocb *req, struct io_rw *rw, int ddir,
+ const struct io_uring_attr_pi *pi_attr)
+{
+ const struct io_issue_def *def;
+ struct io_async_rw *io;
+ int ret;
+
+ if (READ_ONCE(pi_attr->rsvd))
+ return -EINVAL;
+
+ def = &io_issue_defs[req->opcode];
+ if (def->vectored)
+ return -EOPNOTSUPP;
+
+ io = req->async_data;
+ io->meta.flags = READ_ONCE(pi_attr->flags);
+ io->meta.app_tag = READ_ONCE(pi_attr->app_tag);
+ io->meta.seed = READ_ONCE(pi_attr->seed);
+ ret = import_ubuf(ddir, u64_to_user_ptr(READ_ONCE(pi_attr->addr)),
+ READ_ONCE(pi_attr->len), &io->meta.iter);
+ if (unlikely(ret < 0))
+ return ret;
+ rw->kiocb.ki_flags |= IOCB_HAS_METADATA;
+ io_meta_save_state(io);
+ return ret;
+}
+
+
+static inline int io_prep_pi_indirect(struct io_kiocb *req, struct io_rw *rw,
+ int ddir, u64 pi_attr_addr)
+{
+ struct io_uring_attr_pi pi_attr;
+
+ if (copy_from_user(&pi_attr, (void __user *)pi_attr_addr, sizeof(pi_attr)))
+ return -EFAULT;
+ return io_prep_rw_pi(req, rw, ddir, &pi_attr);
+}
+
+static int io_prep_attr_vec(struct io_kiocb *req, struct io_rw *rw, int ddir,
+ u64 attr_addr, u8 nr_attr)
+{
+ struct io_uring_attr_vec attr_vec[ATTR_TYPE_LAST];
+ size_t attr_vec_size = sizeof(struct io_uring_attr_vec) * nr_attr;
+ u8 dup[ATTR_TYPE_LAST] = {0};
+ enum io_uring_attr_type t;
+ int i, ret;
+
+ if (nr_attr > ATTR_TYPE_LAST)
+ return -EINVAL;
+ if (copy_from_user(attr_vec, (void __user *)attr_addr, attr_vec_size))
+ return -EFAULT;
+
+ for (i = 0; i < nr_attr; i++) {
+ t = attr_vec[i].type;
+ if (t >= ATTR_TYPE_LAST)
+ return -EINVAL;
+ /* allow each attribute only once */
+ if (dup[ATTR_TYPE_PI])
+ return -EBUSY;
+ dup[ATTR_TYPE_PI] = 1;
+
+ switch (t) {
+ case ATTR_TYPE_PI:
+ ret = io_prep_pi_indirect(req, rw, ddir, attr_vec[i].addr);
+ break;
+ default:
+ ret = -EOPNOTSUPP;
+ }
+ if (unlikely(ret))
+ return ret;
+ }
+ return 0;
+}
+
static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
int ddir, bool do_import)
{
struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
unsigned ioprio;
+ u8 nr_attr_indirect;
int ret;
rw->kiocb.ki_pos = READ_ONCE(sqe->off);
@@ -279,11 +366,29 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
rw->kiocb.ki_ioprio = get_current_ioprio();
}
rw->kiocb.dio_complete = NULL;
+ rw->kiocb.ki_flags = 0;
rw->addr = READ_ONCE(sqe->addr);
rw->len = READ_ONCE(sqe->len);
rw->flags = READ_ONCE(sqe->rw_flags);
- return io_prep_rw_setup(req, ddir, do_import);
+ ret = io_prep_rw_setup(req, ddir, do_import);
+
+ if (unlikely(ret))
+ return ret;
+
+ nr_attr_indirect = READ_ONCE(sqe->nr_attr_indirect);
+ if (nr_attr_indirect) {
+ u64 attr_vec_usr_addr = READ_ONCE(sqe->attr_vec_addr);
+
+ if (READ_ONCE(sqe->__pad4[0]) || READ_ONCE(sqe->__pad4[1]) ||
+ READ_ONCE(sqe->__pad4[2]))
+ return -EINVAL;
+
+ ret = io_prep_attr_vec(req, rw, ddir, attr_vec_usr_addr,
+ nr_attr_indirect);
+ }
+
+ return ret;
}
int io_prep_read(struct io_kiocb *req, const struct io_uring_sqe *sqe)
@@ -409,7 +514,10 @@ static inline loff_t *io_kiocb_update_pos(struct io_kiocb *req)
static void io_resubmit_prep(struct io_kiocb *req)
{
struct io_async_rw *io = req->async_data;
+ struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
+ if (rw->kiocb.ki_flags & IOCB_HAS_METADATA)
+ io_meta_restore(io);
iov_iter_restore(&io->iter, &io->iter_state);
}
@@ -794,7 +902,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
if (!(req->flags & REQ_F_FIXED_FILE))
req->flags |= io_file_get_flags(file);
- kiocb->ki_flags = file->f_iocb_flags;
+ kiocb->ki_flags |= file->f_iocb_flags;
ret = kiocb_set_rw_flags(kiocb, rw->flags, rw_type);
if (unlikely(ret))
return ret;
@@ -828,6 +936,18 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
kiocb->ki_complete = io_complete_rw;
}
+ if (kiocb->ki_flags & IOCB_HAS_METADATA) {
+ struct io_async_rw *io = req->async_data;
+
+ /*
+ * We have a union of meta fields with wpq used for buffered-io
+ * in io_async_rw, so fail it here.
+ */
+ if (!(req->file->f_flags & O_DIRECT))
+ return -EOPNOTSUPP;
+ kiocb->private = &io->meta;
+ }
+
return 0;
}
@@ -902,6 +1022,8 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
* manually if we need to.
*/
iov_iter_restore(&io->iter, &io->iter_state);
+ if (kiocb->ki_flags & IOCB_HAS_METADATA)
+ io_meta_restore(io);
do {
/*
@@ -1125,6 +1247,8 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
} else {
ret_eagain:
iov_iter_restore(&io->iter, &io->iter_state);
+ if (kiocb->ki_flags & IOCB_HAS_METADATA)
+ io_meta_restore(io);
if (kiocb->ki_flags & IOCB_WRITE)
io_req_end_write(req);
return -EAGAIN;
diff --git a/io_uring/rw.h b/io_uring/rw.h
index 3f432dc75441..2d7656bd268d 100644
--- a/io_uring/rw.h
+++ b/io_uring/rw.h
@@ -2,6 +2,11 @@
#include <linux/pagemap.h>
+struct io_meta_state {
+ u32 seed;
+ struct iov_iter_state iter_meta;
+};
+
struct io_async_rw {
size_t bytes_done;
struct iov_iter iter;
@@ -9,7 +14,14 @@ struct io_async_rw {
struct iovec fast_iov;
struct iovec *free_iovec;
int free_iov_nr;
- struct wait_page_queue wpq;
+ /* wpq is for buffered io, while meta fields are used with direct io */
+ union {
+ struct wait_page_queue wpq;
+ struct {
+ struct uio_meta meta;
+ struct io_meta_state meta_state;
+ };
+ };
};
int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe);
--
2.25.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v9 07/11] io_uring: inline read/write attributes and PI
[not found] ` <CGME20241114105408epcas5p3c77cda2faf7ccb37abbfd8e95b4ad1f5@epcas5p3.samsung.com>
@ 2024-11-14 10:45 ` Anuj Gupta
0 siblings, 0 replies; 37+ messages in thread
From: Anuj Gupta @ 2024-11-14 10:45 UTC (permalink / raw)
To: axboe, hch, kbusch, martin.petersen, asml.silence, anuj1072538,
brauner, jack, viro
Cc: io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Anuj Gupta, Kanchan Joshi
Add the ability to place attributes inline within SQE.
Carve a new field that can accommodate 16 attribute flags:
__u16 attr_inline_flags;
Currently ATTR_FLAG_PI is defined, and future flags can be or-ed to specify
the attributes that are placed inline.
When ATTR_FLAG_PI is passed, application should also setup SQE128 ring
and place PI information (i.e., struct io_uring_attr_pi) in the first
32b of second SQE.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
include/uapi/linux/io_uring.h | 13 +++++++++++-
io_uring/io_uring.c | 6 +++++-
io_uring/rw.c | 38 ++++++++++++++++++++++++++++++++---
3 files changed, 52 insertions(+), 5 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 2e6808f6ba28..9c290c16e543 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -93,9 +93,11 @@ struct io_uring_sqe {
__u16 __pad3[1];
};
struct {
+ /* used when extra attribute is passed inline SQE/SQE128 */
+ __u16 attr_inline_flags;
/* number of elements in the attribute vector */
__u8 nr_attr_indirect;
- __u8 __pad4[3];
+ __u8 __pad4[1];
};
};
union {
@@ -126,6 +128,8 @@ struct io_uring_attr_vec {
__u64 addr;
};
+/* sqe->attr_inline_flags */
+#define ATTR_FLAG_PI (1U << ATTR_TYPE_PI)
/* PI attribute information */
struct io_uring_attr_pi {
__u16 flags;
@@ -136,6 +140,13 @@ struct io_uring_attr_pi {
__u64 rsvd;
};
+/* Second half of SQE128 for IORING_OP_READ/WRITE */
+struct io_uring_sqe_ext {
+ /* if sqe->attr_inline_flags has ATTR_PI, first 32 bytes are for PI */
+ struct io_uring_attr_pi rw_pi;
+ __u64 rsvd1[4];
+};
+
/*
* If sqe->file_index is set to this for opcodes that instantiate a new
* direct descriptor (like openat/openat2/accept), then io_uring will allocate
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e32dd118d7c8..3f975befe82e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3866,8 +3866,9 @@ static int __init io_uring_init(void)
BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in);
BUILD_BUG_SQE_ELEM(44, __u32, file_index);
BUILD_BUG_SQE_ELEM(44, __u16, addr_len);
+ BUILD_BUG_SQE_ELEM(44, __u16, attr_inline_flags);
BUILD_BUG_SQE_ELEM(46, __u16, __pad3[0]);
- BUILD_BUG_SQE_ELEM(44, __u8, nr_attr_indirect);
+ BUILD_BUG_SQE_ELEM(46, __u8, nr_attr_indirect);
BUILD_BUG_SQE_ELEM(48, __u64, addr3);
BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd);
BUILD_BUG_SQE_ELEM(56, __u64, __pad2);
@@ -3894,6 +3895,9 @@ static int __init io_uring_init(void)
/* top 8bits are for internal use */
BUILD_BUG_ON((IORING_URING_CMD_MASK & 0xff000000) != 0);
+ BUILD_BUG_ON(sizeof(struct io_uring_sqe_ext) !=
+ sizeof(struct io_uring_sqe));
+
io_uring_optable_init();
/*
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 93d7451b9370..d2d403ca6eb3 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -269,6 +269,11 @@ static inline void io_meta_restore(struct io_async_rw *io)
iov_iter_restore(&io->meta.iter, &io->meta_state.iter_meta);
}
+static inline const void *io_uring_sqe_ext(const struct io_uring_sqe *sqe)
+{
+ return (sqe + 1);
+}
+
static int io_prep_rw_pi(struct io_kiocb *req, struct io_rw *rw, int ddir,
const struct io_uring_attr_pi *pi_attr)
{
@@ -343,11 +348,34 @@ static int io_prep_attr_vec(struct io_kiocb *req, struct io_rw *rw, int ddir,
return 0;
}
+static int io_prep_inline_attr(struct io_kiocb *req, struct io_rw *rw,
+ const struct io_uring_sqe *sqe, int ddir,
+ u16 attr_flags)
+{
+ const struct io_uring_sqe_ext *sqe_ext;
+ const struct io_uring_attr_pi *pi_attr;
+
+ if (!(attr_flags & ATTR_FLAG_PI))
+ return -EINVAL;
+
+ if (!(req->ctx->flags & IORING_SETUP_SQE128))
+ return -EINVAL;
+
+ sqe_ext = io_uring_sqe_ext(sqe);
+ if (READ_ONCE(sqe_ext->rsvd1[0]) || READ_ONCE(sqe_ext->rsvd1[1])
+ || READ_ONCE(sqe_ext->rsvd1[2]) || READ_ONCE(sqe_ext->rsvd1[3]))
+ return -EINVAL;
+
+ pi_attr = &sqe_ext->rw_pi;
+ return io_prep_rw_pi(req, rw, ddir, pi_attr);
+}
+
static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
int ddir, bool do_import)
{
struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
unsigned ioprio;
+ u16 attr_flags;
u8 nr_attr_indirect;
int ret;
@@ -376,12 +404,16 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
if (unlikely(ret))
return ret;
+ attr_flags = READ_ONCE(sqe->attr_inline_flags);
nr_attr_indirect = READ_ONCE(sqe->nr_attr_indirect);
- if (nr_attr_indirect) {
+ if (attr_flags) {
+ if (READ_ONCE(sqe->__pad4[0]) || nr_attr_indirect)
+ return -EINVAL;
+ ret = io_prep_inline_attr(req, rw, sqe, ddir, attr_flags);
+ } else if (nr_attr_indirect) {
u64 attr_vec_usr_addr = READ_ONCE(sqe->attr_vec_addr);
- if (READ_ONCE(sqe->__pad4[0]) || READ_ONCE(sqe->__pad4[1]) ||
- READ_ONCE(sqe->__pad4[2]))
+ if (READ_ONCE(sqe->__pad4[0]))
return -EINVAL;
ret = io_prep_attr_vec(req, rw, ddir, attr_vec_usr_addr,
--
2.25.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v9 08/11] block: introduce BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags
[not found] ` <CGME20241114105410epcas5p1c6a4e6141b073ccfd6277288f7d5e28b@epcas5p1.samsung.com>
@ 2024-11-14 10:45 ` Anuj Gupta
0 siblings, 0 replies; 37+ messages in thread
From: Anuj Gupta @ 2024-11-14 10:45 UTC (permalink / raw)
To: axboe, hch, kbusch, martin.petersen, asml.silence, anuj1072538,
brauner, jack, viro
Cc: io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Anuj Gupta, Kanchan Joshi
This patch introduces BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags which
indicate how the hardware should check the integrity payload.
BIP_CHECK_GUARD/REFTAG are conversion of existing semantics, while
BIP_CHECK_APPTAG is a new flag. The driver can now just rely on block
layer flags, and doesn't need to know the integrity source. Submitter
of PI decides which tags to check. This would also give us a unified
interface for user and kernel generated integrity.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
---
block/bio-integrity.c | 5 +++++
drivers/nvme/host/core.c | 11 +++--------
include/linux/bio-integrity.h | 6 +++++-
3 files changed, 13 insertions(+), 9 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index f56d01cec689..3bee43b87001 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -434,6 +434,11 @@ bool bio_integrity_prep(struct bio *bio)
if (bi->csum_type == BLK_INTEGRITY_CSUM_IP)
bip->bip_flags |= BIP_IP_CHECKSUM;
+ /* describe what tags to check in payload */
+ if (bi->csum_type)
+ bip->bip_flags |= BIP_CHECK_GUARD;
+ if (bi->flags & BLK_INTEGRITY_REF_TAG)
+ bip->bip_flags |= BIP_CHECK_REFTAG;
if (bio_integrity_add_page(bio, virt_to_page(buf), len,
offset_in_page(buf)) < len) {
printk(KERN_ERR "could not attach integrity payload\n");
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1a8d32a4a5c3..bd309e98ffac 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1017,18 +1017,13 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
control |= NVME_RW_PRINFO_PRACT;
}
- switch (ns->head->pi_type) {
- case NVME_NS_DPS_PI_TYPE3:
+ if (bio_integrity_flagged(req->bio, BIP_CHECK_GUARD))
control |= NVME_RW_PRINFO_PRCHK_GUARD;
- break;
- case NVME_NS_DPS_PI_TYPE1:
- case NVME_NS_DPS_PI_TYPE2:
- control |= NVME_RW_PRINFO_PRCHK_GUARD |
- NVME_RW_PRINFO_PRCHK_REF;
+ if (bio_integrity_flagged(req->bio, BIP_CHECK_REFTAG)) {
+ control |= NVME_RW_PRINFO_PRCHK_REF;
if (op == nvme_cmd_zone_append)
control |= NVME_RW_APPEND_PIREMAP;
nvme_set_ref_tag(ns, cmnd, req);
- break;
}
}
diff --git a/include/linux/bio-integrity.h b/include/linux/bio-integrity.h
index 58ff9988433a..fe2bfe122db2 100644
--- a/include/linux/bio-integrity.h
+++ b/include/linux/bio-integrity.h
@@ -11,6 +11,9 @@ enum bip_flags {
BIP_DISK_NOCHECK = 1 << 3, /* disable disk integrity checking */
BIP_IP_CHECKSUM = 1 << 4, /* IP checksum */
BIP_COPY_USER = 1 << 5, /* Kernel bounce buffer in use */
+ BIP_CHECK_GUARD = 1 << 6, /* guard check */
+ BIP_CHECK_REFTAG = 1 << 7, /* reftag check */
+ BIP_CHECK_APPTAG = 1 << 8, /* apptag check */
};
struct bio_integrity_payload {
@@ -31,7 +34,8 @@ struct bio_integrity_payload {
};
#define BIP_CLONE_FLAGS (BIP_MAPPED_INTEGRITY | BIP_CTRL_NOCHECK | \
- BIP_DISK_NOCHECK | BIP_IP_CHECKSUM)
+ BIP_DISK_NOCHECK | BIP_IP_CHECKSUM | \
+ BIP_CHECK_GUARD | BIP_CHECK_REFTAG | BIP_CHECK_APPTAG)
#ifdef CONFIG_BLK_DEV_INTEGRITY
--
2.25.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v9 09/11] nvme: add support for passing on the application tag
[not found] ` <CGME20241114105413epcas5p2d7da8675df2de0d1efba3057144e691d@epcas5p2.samsung.com>
@ 2024-11-14 10:45 ` Anuj Gupta
0 siblings, 0 replies; 37+ messages in thread
From: Anuj Gupta @ 2024-11-14 10:45 UTC (permalink / raw)
To: axboe, hch, kbusch, martin.petersen, asml.silence, anuj1072538,
brauner, jack, viro
Cc: io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Kanchan Joshi, Anuj Gupta
From: Kanchan Joshi <[email protected]>
With user integrity buffer, there is a way to specify the app_tag.
Set the corresponding protocol specific flags and send the app_tag down.
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
---
drivers/nvme/host/core.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bd309e98ffac..4a6ed8f38b77 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -885,6 +885,12 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
return BLK_STS_OK;
}
+static void nvme_set_app_tag(struct request *req, struct nvme_command *cmnd)
+{
+ cmnd->rw.lbat = cpu_to_le16(bio_integrity(req->bio)->app_tag);
+ cmnd->rw.lbatm = cpu_to_le16(0xffff);
+}
+
static void nvme_set_ref_tag(struct nvme_ns *ns, struct nvme_command *cmnd,
struct request *req)
{
@@ -1025,6 +1031,10 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
control |= NVME_RW_APPEND_PIREMAP;
nvme_set_ref_tag(ns, cmnd, req);
}
+ if (bio_integrity_flagged(req->bio, BIP_CHECK_APPTAG)) {
+ control |= NVME_RW_PRINFO_PRCHK_APP;
+ nvme_set_app_tag(req, cmnd);
+ }
}
cmnd->rw.control = cpu_to_le16(control);
--
2.25.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v9 10/11] scsi: add support for user-meta interface
[not found] ` <CGME20241114105416epcas5p3a7aa552775cfe50f60ef89f7d982ea12@epcas5p3.samsung.com>
@ 2024-11-14 10:45 ` Anuj Gupta
0 siblings, 0 replies; 37+ messages in thread
From: Anuj Gupta @ 2024-11-14 10:45 UTC (permalink / raw)
To: axboe, hch, kbusch, martin.petersen, asml.silence, anuj1072538,
brauner, jack, viro
Cc: io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Anuj Gupta
Add support for sending user-meta buffer. Set tags to be checked
using flags specified by user/block-layer.
With this change, BIP_CTRL_NOCHECK becomes unused. Remove it.
Signed-off-by: Anuj Gupta <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/scsi/sd.c | 4 ++--
include/linux/bio-integrity.h | 16 +++++++---------
2 files changed, 9 insertions(+), 11 deletions(-)
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8947dab132d7..cb7ac8736b91 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -814,14 +814,14 @@ static unsigned char sd_setup_protect_cmnd(struct scsi_cmnd *scmd,
if (bio_integrity_flagged(bio, BIP_IP_CHECKSUM))
scmd->prot_flags |= SCSI_PROT_IP_CHECKSUM;
- if (bio_integrity_flagged(bio, BIP_CTRL_NOCHECK) == false)
+ if (bio_integrity_flagged(bio, BIP_CHECK_GUARD))
scmd->prot_flags |= SCSI_PROT_GUARD_CHECK;
}
if (dif != T10_PI_TYPE3_PROTECTION) { /* DIX/DIF Type 0, 1, 2 */
scmd->prot_flags |= SCSI_PROT_REF_INCREMENT;
- if (bio_integrity_flagged(bio, BIP_CTRL_NOCHECK) == false)
+ if (bio_integrity_flagged(bio, BIP_CHECK_REFTAG))
scmd->prot_flags |= SCSI_PROT_REF_CHECK;
}
diff --git a/include/linux/bio-integrity.h b/include/linux/bio-integrity.h
index fe2bfe122db2..2195bc06dcde 100644
--- a/include/linux/bio-integrity.h
+++ b/include/linux/bio-integrity.h
@@ -7,13 +7,12 @@
enum bip_flags {
BIP_BLOCK_INTEGRITY = 1 << 0, /* block layer owns integrity data */
BIP_MAPPED_INTEGRITY = 1 << 1, /* ref tag has been remapped */
- BIP_CTRL_NOCHECK = 1 << 2, /* disable HBA integrity checking */
- BIP_DISK_NOCHECK = 1 << 3, /* disable disk integrity checking */
- BIP_IP_CHECKSUM = 1 << 4, /* IP checksum */
- BIP_COPY_USER = 1 << 5, /* Kernel bounce buffer in use */
- BIP_CHECK_GUARD = 1 << 6, /* guard check */
- BIP_CHECK_REFTAG = 1 << 7, /* reftag check */
- BIP_CHECK_APPTAG = 1 << 8, /* apptag check */
+ BIP_DISK_NOCHECK = 1 << 2, /* disable disk integrity checking */
+ BIP_IP_CHECKSUM = 1 << 3, /* IP checksum */
+ BIP_COPY_USER = 1 << 4, /* Kernel bounce buffer in use */
+ BIP_CHECK_GUARD = 1 << 5, /* guard check */
+ BIP_CHECK_REFTAG = 1 << 6, /* reftag check */
+ BIP_CHECK_APPTAG = 1 << 7, /* apptag check */
};
struct bio_integrity_payload {
@@ -33,8 +32,7 @@ struct bio_integrity_payload {
struct bio_vec bip_inline_vecs[];/* embedded bvec array */
};
-#define BIP_CLONE_FLAGS (BIP_MAPPED_INTEGRITY | BIP_CTRL_NOCHECK | \
- BIP_DISK_NOCHECK | BIP_IP_CHECKSUM | \
+#define BIP_CLONE_FLAGS (BIP_MAPPED_INTEGRITY | BIP_IP_CHECKSUM | \
BIP_CHECK_GUARD | BIP_CHECK_REFTAG | BIP_CHECK_APPTAG)
#ifdef CONFIG_BLK_DEV_INTEGRITY
--
2.25.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v9 11/11] block: add support to pass user meta buffer
[not found] ` <CGME20241114105418epcas5p1537d72b9016d10670cf97751704e2cc8@epcas5p1.samsung.com>
@ 2024-11-14 10:45 ` Anuj Gupta
0 siblings, 0 replies; 37+ messages in thread
From: Anuj Gupta @ 2024-11-14 10:45 UTC (permalink / raw)
To: axboe, hch, kbusch, martin.petersen, asml.silence, anuj1072538,
brauner, jack, viro
Cc: io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Kanchan Joshi, Anuj Gupta
From: Kanchan Joshi <[email protected]>
If an iocb contains metadata, extract that and prepare the bip.
Based on flags specified by the user, set corresponding guard/app/ref
tags to be checked in bip.
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
---
block/bio-integrity.c | 50 +++++++++++++++++++++++++++++++++++
block/fops.c | 45 ++++++++++++++++++++++++-------
include/linux/bio-integrity.h | 7 +++++
3 files changed, 92 insertions(+), 10 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 3bee43b87001..5d81ad9a3d20 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -364,6 +364,55 @@ int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
return ret;
}
+static void bio_uio_meta_to_bip(struct bio *bio, struct uio_meta *meta)
+{
+ struct bio_integrity_payload *bip = bio_integrity(bio);
+
+ if (meta->flags & IO_INTEGRITY_CHK_GUARD)
+ bip->bip_flags |= BIP_CHECK_GUARD;
+ if (meta->flags & IO_INTEGRITY_CHK_APPTAG)
+ bip->bip_flags |= BIP_CHECK_APPTAG;
+ if (meta->flags & IO_INTEGRITY_CHK_REFTAG)
+ bip->bip_flags |= BIP_CHECK_REFTAG;
+
+ bip->app_tag = meta->app_tag;
+}
+
+int bio_integrity_map_iter(struct bio *bio, struct uio_meta *meta)
+{
+ struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
+ unsigned int integrity_bytes;
+ int ret;
+ struct iov_iter it;
+
+ if (!bi)
+ return -EINVAL;
+ /*
+ * original meta iterator can be bigger.
+ * process integrity info corresponding to current data buffer only.
+ */
+ it = meta->iter;
+ integrity_bytes = bio_integrity_bytes(bi, bio_sectors(bio));
+ if (it.count < integrity_bytes)
+ return -EINVAL;
+
+ /* should fit into two bytes */
+ BUILD_BUG_ON(IO_INTEGRITY_VALID_FLAGS >= (1 << 16));
+
+ if (meta->flags && (meta->flags & ~IO_INTEGRITY_VALID_FLAGS))
+ return -EINVAL;
+
+ it.count = integrity_bytes;
+ ret = bio_integrity_map_user(bio, &it);
+ if (!ret) {
+ bio_uio_meta_to_bip(bio, meta);
+ bip_set_seed(bio_integrity(bio), meta->seed);
+ iov_iter_advance(&meta->iter, integrity_bytes);
+ meta->seed += bio_integrity_intervals(bi, bio_sectors(bio));
+ }
+ return ret;
+}
+
/**
* bio_integrity_prep - Prepare bio for integrity I/O
* @bio: bio to prepare
@@ -564,6 +613,7 @@ int bio_integrity_clone(struct bio *bio, struct bio *bio_src,
bip->bip_vec = bip_src->bip_vec;
bip->bip_iter = bip_src->bip_iter;
bip->bip_flags = bip_src->bip_flags & BIP_CLONE_FLAGS;
+ bip->app_tag = bip_src->app_tag;
return 0;
}
diff --git a/block/fops.c b/block/fops.c
index 2d01c9007681..412ae74032ad 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -54,6 +54,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
struct bio bio;
ssize_t ret;
+ WARN_ON_ONCE(iocb->ki_flags & IOCB_HAS_METADATA);
if (nr_pages <= DIO_INLINE_BIO_VECS)
vecs = inline_vecs;
else {
@@ -124,12 +125,16 @@ static void blkdev_bio_end_io(struct bio *bio)
{
struct blkdev_dio *dio = bio->bi_private;
bool should_dirty = dio->flags & DIO_SHOULD_DIRTY;
+ bool is_sync = dio->flags & DIO_IS_SYNC;
if (bio->bi_status && !dio->bio.bi_status)
dio->bio.bi_status = bio->bi_status;
+ if (!is_sync && (dio->iocb->ki_flags & IOCB_HAS_METADATA))
+ bio_integrity_unmap_user(bio);
+
if (atomic_dec_and_test(&dio->ref)) {
- if (!(dio->flags & DIO_IS_SYNC)) {
+ if (!is_sync) {
struct kiocb *iocb = dio->iocb;
ssize_t ret;
@@ -221,14 +226,16 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
* a retry of this from blocking context.
*/
if (unlikely(iov_iter_count(iter))) {
- bio_release_pages(bio, false);
- bio_clear_flag(bio, BIO_REFFED);
- bio_put(bio);
- blk_finish_plug(&plug);
- return -EAGAIN;
+ ret = -EAGAIN;
+ goto fail;
}
bio->bi_opf |= REQ_NOWAIT;
}
+ if (!is_sync && (iocb->ki_flags & IOCB_HAS_METADATA)) {
+ ret = bio_integrity_map_iter(bio, iocb->private);
+ if (unlikely(ret))
+ goto fail;
+ }
if (is_read) {
if (dio->flags & DIO_SHOULD_DIRTY)
@@ -269,6 +276,12 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
bio_put(&dio->bio);
return ret;
+fail:
+ bio_release_pages(bio, false);
+ bio_clear_flag(bio, BIO_REFFED);
+ bio_put(bio);
+ blk_finish_plug(&plug);
+ return ret;
}
static void blkdev_bio_end_io_async(struct bio *bio)
@@ -286,6 +299,9 @@ static void blkdev_bio_end_io_async(struct bio *bio)
ret = blk_status_to_errno(bio->bi_status);
}
+ if (iocb->ki_flags & IOCB_HAS_METADATA)
+ bio_integrity_unmap_user(bio);
+
iocb->ki_complete(iocb, ret);
if (dio->flags & DIO_SHOULD_DIRTY) {
@@ -330,10 +346,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
bio_iov_bvec_set(bio, iter);
} else {
ret = bio_iov_iter_get_pages(bio, iter);
- if (unlikely(ret)) {
- bio_put(bio);
- return ret;
- }
+ if (unlikely(ret))
+ goto out_bio_put;
}
dio->size = bio->bi_iter.bi_size;
@@ -346,6 +360,13 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
task_io_account_write(bio->bi_iter.bi_size);
}
+ if (iocb->ki_flags & IOCB_HAS_METADATA) {
+ ret = bio_integrity_map_iter(bio, iocb->private);
+ WRITE_ONCE(iocb->private, NULL);
+ if (unlikely(ret))
+ goto out_bio_put;
+ }
+
if (iocb->ki_flags & IOCB_ATOMIC)
bio->bi_opf |= REQ_ATOMIC;
@@ -360,6 +381,10 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
submit_bio(bio);
}
return -EIOCBQUEUED;
+
+out_bio_put:
+ bio_put(bio);
+ return ret;
}
static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
diff --git a/include/linux/bio-integrity.h b/include/linux/bio-integrity.h
index 2195bc06dcde..de0a6c9de4d1 100644
--- a/include/linux/bio-integrity.h
+++ b/include/linux/bio-integrity.h
@@ -23,6 +23,7 @@ struct bio_integrity_payload {
unsigned short bip_vcnt; /* # of integrity bio_vecs */
unsigned short bip_max_vcnt; /* integrity bio_vec slots */
unsigned short bip_flags; /* control flags */
+ u16 app_tag; /* application tag value */
struct bvec_iter bio_iter; /* for rewinding parent bio */
@@ -78,6 +79,7 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio, gfp_t gfp,
int bio_integrity_add_page(struct bio *bio, struct page *page, unsigned int len,
unsigned int offset);
int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter);
+int bio_integrity_map_iter(struct bio *bio, struct uio_meta *meta);
void bio_integrity_unmap_user(struct bio *bio);
bool bio_integrity_prep(struct bio *bio);
void bio_integrity_advance(struct bio *bio, unsigned int bytes_done);
@@ -108,6 +110,11 @@ static int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
return -EINVAL;
}
+static inline int bio_integrity_map_iter(struct bio *bio, struct uio_meta *meta)
+{
+ return -EINVAL;
+}
+
static inline void bio_integrity_unmap_user(struct bio *bio)
{
}
--
2.25.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-14 10:45 ` [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support Anuj Gupta
@ 2024-11-14 12:16 ` Christoph Hellwig
2024-11-14 13:09 ` Pavel Begunkov
2024-11-15 13:29 ` Anuj gupta
2024-11-16 0:00 ` Pavel Begunkov
2024-11-16 23:09 ` kernel test robot
2 siblings, 2 replies; 37+ messages in thread
From: Christoph Hellwig @ 2024-11-14 12:16 UTC (permalink / raw)
To: Anuj Gupta
Cc: axboe, hch, kbusch, martin.petersen, asml.silence, anuj1072538,
brauner, jack, viro, io-uring, linux-nvme, linux-block, gost.dev,
linux-scsi, vishak.g, linux-fsdevel, Kanchan Joshi
On Thu, Nov 14, 2024 at 04:15:12PM +0530, Anuj Gupta wrote:
> PI attribute is supported only for direct IO. Also, vectored read/write
> operations are not supported with PI currently.
Eww. I know it's frustration for your if maintainers give contradicting
guidance, but this is really an awful interface. Not only the pointless
indirection which make the interface hard to use, but limiting it to
not support vectored I/O makes it pretty useless.
I guess I need to do a little read-up on why Pavel wants this, but
from the block/fs perspective the previous interface made so much
more sense.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-14 12:16 ` Christoph Hellwig
@ 2024-11-14 13:09 ` Pavel Begunkov
2024-11-14 15:19 ` Christoph Hellwig
2024-11-15 18:04 ` Matthew Wilcox
2024-11-15 13:29 ` Anuj gupta
1 sibling, 2 replies; 37+ messages in thread
From: Pavel Begunkov @ 2024-11-14 13:09 UTC (permalink / raw)
To: Christoph Hellwig, Anuj Gupta
Cc: axboe, kbusch, martin.petersen, anuj1072538, brauner, jack, viro,
io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Kanchan Joshi
On 11/14/24 12:16, Christoph Hellwig wrote:
> On Thu, Nov 14, 2024 at 04:15:12PM +0530, Anuj Gupta wrote:
>> PI attribute is supported only for direct IO. Also, vectored read/write
>> operations are not supported with PI currently.
And my apologies Anuj, I've been busy, I hope to take a look
at this series today / tomorrow.
> Eww. I know it's frustration for your if maintainers give contradicting
> guidance, but this is really an awful interface. Not only the pointless
Because once you placed it at a fixed location nothing realistically
will be able to reuse it. Not everyone will need PI, but the assumption
that there will be more more additional types of attributes / parameters.
With SQE128 it's also a problem that now all SQEs are 128 bytes regardless
of whether a particular request needs it or not, and the user will need
to zero them for each request.
The discussion continued in the v6 thread, here
https://lore.kernel.org/all/[email protected]/T/#m12beca2ede2bd2017796adb391bedec9c95d85c3
and a little bit more here:
https://lore.kernel.org/all/[email protected]/T/#mc3f7a95915a64551e061d37b33a643676c5d87b2
> indirection which make the interface hard to use, but limiting it to
> not support vectored I/O makes it pretty useless.
I'm not sure why that's the case and need to take a look), but I
don't immediately see how it's relevant to that part of the API. It
shouldn't really matter where the main PI structure is located, you
get an iovec pointer and code from there wouldn't be any different.
> I guess I need to do a little read-up on why Pavel wants this, but
> from the block/fs perspective the previous interface made so much
> more sense.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-14 13:09 ` Pavel Begunkov
@ 2024-11-14 15:19 ` Christoph Hellwig
2024-11-15 16:40 ` Pavel Begunkov
2024-11-15 18:04 ` Matthew Wilcox
1 sibling, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2024-11-14 15:19 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Christoph Hellwig, Anuj Gupta, axboe, kbusch, martin.petersen,
anuj1072538, brauner, jack, viro, io-uring, linux-nvme,
linux-block, gost.dev, linux-scsi, vishak.g, linux-fsdevel,
Kanchan Joshi
On Thu, Nov 14, 2024 at 01:09:44PM +0000, Pavel Begunkov wrote:
>> Eww. I know it's frustration for your if maintainers give contradicting
>> guidance, but this is really an awful interface. Not only the pointless
>
> Because once you placed it at a fixed location nothing realistically
> will be able to reuse it. Not everyone will need PI, but the assumption
> that there will be more more additional types of attributes / parameters.
So? If we have a strong enough requirement for something else we
can triviall add another opcode. Maybe we should just add different
opcodes for read/write with metadata so that folks don't freak out
about this?
> With SQE128 it's also a problem that now all SQEs are 128 bytes regardless
> of whether a particular request needs it or not, and the user will need
> to zero them for each request.
The user is not going to create a SQE128 ring unless they need to,
so this seem like a bit of an odd objection.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-14 12:16 ` Christoph Hellwig
2024-11-14 13:09 ` Pavel Begunkov
@ 2024-11-15 13:29 ` Anuj gupta
1 sibling, 0 replies; 37+ messages in thread
From: Anuj gupta @ 2024-11-15 13:29 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Anuj Gupta, axboe, kbusch, martin.petersen, asml.silence, brauner,
jack, viro, io-uring, linux-nvme, linux-block, gost.dev,
linux-scsi, vishak.g, linux-fsdevel, Kanchan Joshi
On Thu, Nov 14, 2024 at 5:46 PM Christoph Hellwig <[email protected]> wrote:
>
> On Thu, Nov 14, 2024 at 04:15:12PM +0530, Anuj Gupta wrote:
> > PI attribute is supported only for direct IO. Also, vectored read/write
> > operations are not supported with PI currently.
>
> Eww. I know it's frustration for your if maintainers give contradicting
> guidance, but this is really an awful interface. Not only the pointless
> indirection which make the interface hard to use, but limiting it to
> not support vectored I/O makes it pretty useless.
>
The check added in this patch returning failure for vectored-io is a
mistake. The application can prepare protection information for vectored
read/write and send. So vectored-io works with the current patchset.
I just need to remove the check in this patch.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-14 15:19 ` Christoph Hellwig
@ 2024-11-15 16:40 ` Pavel Begunkov
2024-11-15 17:12 ` Christoph Hellwig
0 siblings, 1 reply; 37+ messages in thread
From: Pavel Begunkov @ 2024-11-15 16:40 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Anuj Gupta, axboe, kbusch, martin.petersen, anuj1072538, brauner,
jack, viro, io-uring, linux-nvme, linux-block, gost.dev,
linux-scsi, vishak.g, linux-fsdevel, Kanchan Joshi
On 11/14/24 15:19, Christoph Hellwig wrote:
> On Thu, Nov 14, 2024 at 01:09:44PM +0000, Pavel Begunkov wrote:
>>> Eww. I know it's frustration for your if maintainers give contradicting
>>> guidance, but this is really an awful interface. Not only the pointless
>>
>> Because once you placed it at a fixed location nothing realistically
>> will be able to reuse it. Not everyone will need PI, but the assumption
>> that there will be more more additional types of attributes / parameters.
>
> So? If we have a strong enough requirement for something else we
> can triviall add another opcode. Maybe we should just add different
> opcodes for read/write with metadata so that folks don't freak out
> about this?
IMHO, PI is not so special to have a special opcode for it unlike
some more generic read/write with meta / attributes, but that one
would have same questions.
FWIW, the series was steered from the separate opcode approach to avoid
duplicating things, for example there are 3 different OP_READ* opcodes
varying by the buffer type, and there is no reason meta reads wouldn't
want to support all of them as well. I have to admit that the effort is
a bit unfortunate on that side switching back a forth at least a couple
of times including attempts from 2+ years ago by some other guy.
>> With SQE128 it's also a problem that now all SQEs are 128 bytes regardless
>> of whether a particular request needs it or not, and the user will need
>> to zero them for each request.
>
> The user is not going to create a SQE128 ring unless they need to,
> so this seem like a bit of an odd objection.
It doesn't bring this overhead to those who don't use meta/PI, that's
right, but it does add it if you want to mix it with nearly all other
request types, and that is desirable.
As I mentioned before, it's just one downside but not a deal breaker.
I'm more concerned that the next type of meta information won't be
able to fit into the SQE and then we'll need to solve the same problem
(indirection + optimising copy_from_user with other means) while having
PI as a special case. And that's more of a problem of the static
placing from previous version, e.g. it wouldn't be a problem if in the
long run it becomes sth like:
struct attr attr, *p;
if (flags & META_IN_USE_SQE128)
p = sqe + 1;
else {
copy_from_user(&attr);
p = &attr;
}
but that shouldn't be PI specific.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-15 16:40 ` Pavel Begunkov
@ 2024-11-15 17:12 ` Christoph Hellwig
2024-11-15 17:44 ` Jens Axboe
2024-11-15 19:03 ` Pavel Begunkov
0 siblings, 2 replies; 37+ messages in thread
From: Christoph Hellwig @ 2024-11-15 17:12 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Christoph Hellwig, Anuj Gupta, axboe, kbusch, martin.petersen,
anuj1072538, brauner, jack, viro, io-uring, linux-nvme,
linux-block, gost.dev, linux-scsi, vishak.g, linux-fsdevel,
Kanchan Joshi
On Fri, Nov 15, 2024 at 04:40:58PM +0000, Pavel Begunkov wrote:
>> So? If we have a strong enough requirement for something else we
>> can triviall add another opcode. Maybe we should just add different
>> opcodes for read/write with metadata so that folks don't freak out
>> about this?
>
> IMHO, PI is not so special to have a special opcode for it unlike
> some more generic read/write with meta / attributes, but that one
> would have same questions.
Well, apparently is one the hand hand not general enough that you
don't want to give it SQE128 space, but you also don't want to give
it an opcode.
Maybe we just need make it uring_cmd to get out of these conflicting
requirements.
Just to make it clear: I'm not a huge fan of a separate opcode or
uring_cmd, but compared to the version in this patch it is much better.
> PI as a special case. And that's more of a problem of the static
> placing from previous version, e.g. it wouldn't be a problem if in the
> long run it becomes sth like:
>
> struct attr attr, *p;
>
> if (flags & META_IN_USE_SQE128)
> p = sqe + 1;
> else {
> copy_from_user(&attr);
> p = &attr;
> }
>
> but that shouldn't be PI specific.
Why would anyone not use the SQE128 version?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-15 17:12 ` Christoph Hellwig
@ 2024-11-15 17:44 ` Jens Axboe
2024-11-15 18:00 ` Christoph Hellwig
2024-11-15 19:03 ` Pavel Begunkov
1 sibling, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-11-15 17:44 UTC (permalink / raw)
To: Christoph Hellwig, Pavel Begunkov
Cc: Anuj Gupta, kbusch, martin.petersen, anuj1072538, brauner, jack,
viro, io-uring, linux-nvme, linux-block, gost.dev, linux-scsi,
vishak.g, linux-fsdevel, Kanchan Joshi
On 11/15/24 10:12 AM, Christoph Hellwig wrote:
> On Fri, Nov 15, 2024 at 04:40:58PM +0000, Pavel Begunkov wrote:
>>> So? If we have a strong enough requirement for something else we
>>> can triviall add another opcode. Maybe we should just add different
>>> opcodes for read/write with metadata so that folks don't freak out
>>> about this?
>>
>> IMHO, PI is not so special to have a special opcode for it unlike
>> some more generic read/write with meta / attributes, but that one
>> would have same questions.
>
> Well, apparently is one the hand hand not general enough that you
> don't want to give it SQE128 space, but you also don't want to give
> it an opcode.
>
> Maybe we just need make it uring_cmd to get out of these conflicting
> requirements.
Let's please lay off the hyperbole here, uring_cmd would be a terrible
way to do this. We're working through the flags requirements. Obviously
this is now missing 6.13, but there's no reason why it's not on track to
make 6.14 in a saner way.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-15 17:44 ` Jens Axboe
@ 2024-11-15 18:00 ` Christoph Hellwig
0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2024-11-15 18:00 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Pavel Begunkov, Anuj Gupta, kbusch,
martin.petersen, anuj1072538, brauner, jack, viro, io-uring,
linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Kanchan Joshi
On Fri, Nov 15, 2024 at 10:44:58AM -0700, Jens Axboe wrote:
> Let's please lay off the hyperbole here, uring_cmd would be a terrible
> way to do this. We're working through the flags requirements. Obviously
> this is now missing 6.13, but there's no reason why it's not on track to
> make 6.14 in a saner way.
I don't think it would actually be all that terrible. Still not my
preferred option of course.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-14 13:09 ` Pavel Begunkov
2024-11-14 15:19 ` Christoph Hellwig
@ 2024-11-15 18:04 ` Matthew Wilcox
2024-11-20 17:35 ` Darrick J. Wong
1 sibling, 1 reply; 37+ messages in thread
From: Matthew Wilcox @ 2024-11-15 18:04 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Christoph Hellwig, Anuj Gupta, axboe, kbusch, martin.petersen,
anuj1072538, brauner, jack, viro, io-uring, linux-nvme,
linux-block, gost.dev, linux-scsi, vishak.g, linux-fsdevel,
Kanchan Joshi
On Thu, Nov 14, 2024 at 01:09:44PM +0000, Pavel Begunkov wrote:
> With SQE128 it's also a problem that now all SQEs are 128 bytes regardless
> of whether a particular request needs it or not, and the user will need
> to zero them for each request.
The way we handled this in NVMe was to use a bit in the command that
was called (iirc) FUSED, which let you use two consecutive entries for
a single command.
Some variant on that could surely be used for io_uring. Perhaps a
special opcode that says "the real opcode is here, and this is a two-slot
command". Processing gets a little spicy when one slot is the last in
the buffer and the next is the the first in the buffer, but that's a SMOP.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-15 17:12 ` Christoph Hellwig
2024-11-15 17:44 ` Jens Axboe
@ 2024-11-15 19:03 ` Pavel Begunkov
2024-11-18 12:49 ` Christoph Hellwig
1 sibling, 1 reply; 37+ messages in thread
From: Pavel Begunkov @ 2024-11-15 19:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Anuj Gupta, axboe, kbusch, martin.petersen, anuj1072538, brauner,
jack, viro, io-uring, linux-nvme, linux-block, gost.dev,
linux-scsi, vishak.g, linux-fsdevel, Kanchan Joshi
On 11/15/24 17:12, Christoph Hellwig wrote:
> On Fri, Nov 15, 2024 at 04:40:58PM +0000, Pavel Begunkov wrote:
>>> So? If we have a strong enough requirement for something else we
>>> can triviall add another opcode. Maybe we should just add different
>>> opcodes for read/write with metadata so that folks don't freak out
>>> about this?
>>
>> IMHO, PI is not so special to have a special opcode for it unlike
>> some more generic read/write with meta / attributes, but that one
>> would have same questions.
>
> Well, apparently is one the hand hand not general enough that you
> don't want to give it SQE128 space, but you also don't want to give
> it an opcode.
Not like there are no other options. It can be user pointers,
and now we have infra to optimise it if copy_from_user is
expensive.
One thing that doesn't feel right is double indirection, i.e.
a uptr into an array of pointers, at least if IIUC from a quick
glance. I'll follow up on that.
> Maybe we just need make it uring_cmd to get out of these conflicting
> requirements.
>
> Just to make it clear: I'm not a huge fan of a separate opcode or
> uring_cmd, but compared to the version in this patch it is much better.
>
>> PI as a special case. And that's more of a problem of the static
>> placing from previous version, e.g. it wouldn't be a problem if in the
>> long run it becomes sth like:
>>
>> struct attr attr, *p;
>>
>> if (flags & META_IN_USE_SQE128)
>> p = sqe + 1;
>> else {
>> copy_from_user(&attr);
>> p = &attr;
>> }
>>
>> but that shouldn't be PI specific.
>
> Why would anyone not use the SQE128 version?
!SQE128 with user pointer can easily be faster depending on the
ratio of requests that use SQE128 and don't. E.g. one PI read
following with a 100 of send/recv on average. copy_from_user
is not _that_ expensive and we're talking about zeroing an
extra never used afterwards cache line.
Though the main reason would be when you pass 2+ different
attributes and there is no space to put them in SQEs.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-14 10:45 ` [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support Anuj Gupta
2024-11-14 12:16 ` Christoph Hellwig
@ 2024-11-16 0:00 ` Pavel Begunkov
2024-11-16 0:32 ` Pavel Begunkov
2024-11-16 23:09 ` kernel test robot
2 siblings, 1 reply; 37+ messages in thread
From: Pavel Begunkov @ 2024-11-16 0:00 UTC (permalink / raw)
To: Anuj Gupta, axboe, hch, kbusch, martin.petersen, anuj1072538,
brauner, jack, viro
Cc: io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Kanchan Joshi
On 11/14/24 10:45, Anuj Gupta wrote:
> Add the ability to pass additional attributes along with read/write.
> Application can populate an array of 'struct io_uring_attr_vec' and pass
> its address using the SQE field:
> __u64 attr_vec_addr;
>
> Along with number of attributes using:
> __u8 nr_attr_indirect;
>
> Overall 16 attributes are allowed and currently one attribute
> 'ATTR_TYPE_PI' is supported.
Why only 16? It's possible that might need more, 256 would
be a safer choice and fits into u8. I don't think you even
need to commit to a number, it should be ok to add more as
long as it fits into the given types (u8 above). It can also
be u16 as well.
> With PI attribute, userspace can pass following information:
> - flags: integrity check flags IO_INTEGRITY_CHK_{GUARD/APPTAG/REFTAG}
> - len: length of PI/metadata buffer
> - addr: address of metadata buffer
> - seed: seed value for reftag remapping
> - app_tag: application defined 16b value
In terms of flexibility I like it apart from small nits,
but the double indirection could be a bit inefficient,
this thing:
struct pi_attr pi = {};
attr_array = { &pi, ... };
sqe->attr_addr = attr_array;
So maybe we should just flatten it? An attempt to pseudo
code it to understand what it entails is below. Certainly
buggy and some handling is omitted, but should show the
idea.
// uapi/.../io_uring.h
struct sqe {
...
u64 attr_addr;
/* the total size of the array pointed by attr_addr */
u16 attr_size; /* max 64KB, more than enough */
}
struct io_attr_header {
/* bit mask of attributes passed, can be helpful in the future
* for optimising processing.
*/
u64 attr_type_map;
};
/* each attribute should start with a preamble */
struct io_uring_attr_preamble {
u16 attr_type;
};
// user space
struct PI_param {
struct io_attr_header header;
struct io_uring_attr_preamble preamble;
struct io_uring_attr_pi pi;
};
struct PI_param p = {};
p.header.map = 1 << ATTR_TYPE_PI;
p.preamble.type = ATTR_TYPE_PI;
p.pi = {...};
sqe->attr_addr = &p;
sqe->attr_size = sizeof(p);
The holes b/w structures should be packed better. For the same
reason I don't like a separate preamble structure much, maybe it
should be embedded into the attribute structures, e.g.
struct io_uring_attr_pi {
u16 attr_type;
...
}
The user side looks ok to me, should be pretty straightforward
if the user can define a structure like PI_param, i.e. knows
at compilation time which attributes it wants to use.
// kernel space (current patch set, PI only)
addr = READ_ONCE(sqe->attr_addr);
if (addr) {
size = READ_ONCE(sqe->attr_size);
process_pi(addr, size);
}
process_pi(addr, size) {
struct PI_param p;
if (size != sizeof(PI_attr + struct attr_preamble + struct attr_header))
return -EINVAL;
copy_from_user(p, addr, sizeof(p));
if (p.preamble != ATTR_TYPE_PI)
return -EINVAL;
do_pi_setup(&p->pi);
}
This one is pretty simple as well. A bit more troublesome if
extended with many attributes, but it'd need additional handling
regardless:
process_pi(addr, size) {
if (size < sizeof(header + preamble))
return -EINVAL;
attr_array = malloc(size); // +caching by io_uring
copy_from_user(attr_array);
handle_attributes(attr_array, size);
}
handle_attributes(attr_array, size) {
struct io_attr_header *hdr = attr_array;
offset = sizeof(*hdr);
while (1) {
if (offset + sizeof(struct preamble) > size)
break;
struct preamble *pr = attr_array + offset;
if (pr->type > MAX_TYPES)
return -EINVAL;
attr_size = attr_sizes[pr->type];
if (offset + sizeof(preamble) + attr_size > size)
return -EINVAL;
offset += sizeof(preamble) + attr_size;
process_attr(pr->type, (void *)(pr + 1));
}
}
Some checks can probably be optimised by playing with the uapi
a bit.
attr_type_map is unused here, but I like the idea. I'd love
to see all actual attribute handling to move deeper into the
stack to those who actually need it, but that's for far
away undecided future. E.g.
io_uring_rw {
p = malloc();
copy_from_user(p, sqe->attr_addr);
kiocb->attributes = p;
}
block_do_read {
hdr = kiocb->attributes;
type_mask = /* all types block layer recognises */
if (hdr->attr_type_map & type_mask)
use_attributes();
}
copy_from_user can be optimised, I mentioned before, we can
have a pre-mapped area into which the indirection can point.
The infra is already in there and even used for passing
waiting arguments.
process_pi(addr, size) {
struct PI_param *p, __p;
if (some_flags & USE_REGISTERED_REGION) {
// Glorified p = ctx->ptr; with some checks
p = io_uring_get_mem(addr, size);
} else {
copy_from_user(__p, addr, sizeof(__p));
p = &__p;
}
...
}
In this case all reads would need to be READ_ONCE, but that
shouldn't be a problem. It might also optimise out the kmalloc
in the extended version.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-16 0:00 ` Pavel Begunkov
@ 2024-11-16 0:32 ` Pavel Begunkov
2024-11-18 12:50 ` Christoph Hellwig
0 siblings, 1 reply; 37+ messages in thread
From: Pavel Begunkov @ 2024-11-16 0:32 UTC (permalink / raw)
To: Anuj Gupta, axboe, hch, kbusch, martin.petersen, anuj1072538,
brauner, jack, viro
Cc: io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Kanchan Joshi
On 11/16/24 00:00, Pavel Begunkov wrote:
> On 11/14/24 10:45, Anuj Gupta wrote:
>> Add the ability to pass additional attributes along with read/write.
>> Application can populate an array of 'struct io_uring_attr_vec' and pass
>> its address using the SQE field:
>> __u64 attr_vec_addr;
>>
>> Along with number of attributes using:
>> __u8 nr_attr_indirect;
>>
>> Overall 16 attributes are allowed and currently one attribute
>> 'ATTR_TYPE_PI' is supported.
>
> Why only 16? It's possible that might need more, 256 would
> be a safer choice and fits into u8. I don't think you even
> need to commit to a number, it should be ok to add more as
> long as it fits into the given types (u8 above). It can also
> be u16 as well.
>
>> With PI attribute, userspace can pass following information:
>> - flags: integrity check flags IO_INTEGRITY_CHK_{GUARD/APPTAG/REFTAG}
>> - len: length of PI/metadata buffer
>> - addr: address of metadata buffer
>> - seed: seed value for reftag remapping
>> - app_tag: application defined 16b value
>
> In terms of flexibility I like it apart from small nits,
> but the double indirection could be a bit inefficient,
> this thing:
>
> struct pi_attr pi = {};
> attr_array = { &pi, ... };
> sqe->attr_addr = attr_array;
We can also reuse your idea from your previous iterations and
use the bitmap to list all attributes. Then preamble and
the explicit attr_type field are not needed, type checking
in the loop is removed and packing is better. And just
by looking at the map we can calculate the size of the
array and remove all size checks in the loop.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-14 10:45 ` [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support Anuj Gupta
2024-11-14 12:16 ` Christoph Hellwig
2024-11-16 0:00 ` Pavel Begunkov
@ 2024-11-16 23:09 ` kernel test robot
2 siblings, 0 replies; 37+ messages in thread
From: kernel test robot @ 2024-11-16 23:09 UTC (permalink / raw)
To: Anuj Gupta, axboe, hch, kbusch, martin.petersen, asml.silence,
anuj1072538, brauner, jack, viro
Cc: oe-kbuild-all, io-uring, linux-nvme, linux-block, gost.dev,
linux-scsi, vishak.g, linux-fsdevel, Anuj Gupta, Kanchan Joshi
Hi Anuj,
kernel test robot noticed the following build warnings:
[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on next-20241115]
[cannot apply to brauner-vfs/vfs.all mkp-scsi/for-next hch-configfs/for-next linus/master jejb-scsi/for-next v6.12-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Anuj-Gupta/block-define-set-of-integrity-flags-to-be-inherited-by-cloned-bip/20241114-193419
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20241114104517.51726-7-anuj20.g%40samsung.com
patch subject: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
config: arc-nsimosci_hs_smp_defconfig (https://download.01.org/0day-ci/archive/20241117/[email protected]/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241117/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
io_uring/rw.c: In function 'io_prep_pi_indirect':
>> io_uring/rw.c:305:38: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
305 | if (copy_from_user(&pi_attr, (void __user *)pi_attr_addr, sizeof(pi_attr)))
| ^
io_uring/rw.c: In function 'io_prep_attr_vec':
io_uring/rw.c:321:38: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
321 | if (copy_from_user(attr_vec, (void __user *)attr_addr, attr_vec_size))
| ^
vim +305 io_uring/rw.c
298
299
300 static inline int io_prep_pi_indirect(struct io_kiocb *req, struct io_rw *rw,
301 int ddir, u64 pi_attr_addr)
302 {
303 struct io_uring_attr_pi pi_attr;
304
> 305 if (copy_from_user(&pi_attr, (void __user *)pi_attr_addr, sizeof(pi_attr)))
306 return -EFAULT;
307 return io_prep_rw_pi(req, rw, ddir, &pi_attr);
308 }
309
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-15 19:03 ` Pavel Begunkov
@ 2024-11-18 12:49 ` Christoph Hellwig
0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2024-11-18 12:49 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Christoph Hellwig, Anuj Gupta, axboe, kbusch, martin.petersen,
anuj1072538, brauner, jack, viro, io-uring, linux-nvme,
linux-block, gost.dev, linux-scsi, vishak.g, linux-fsdevel,
Kanchan Joshi
On Fri, Nov 15, 2024 at 07:03:28PM +0000, Pavel Begunkov wrote:
>>> but that shouldn't be PI specific.
>>
>> Why would anyone not use the SQE128 version?
>
> !SQE128 with user pointer can easily be faster depending on the
> ratio of requests that use SQE128 and don't. E.g. one PI read
> following with a 100 of send/recv on average. copy_from_user
> is not _that_ expensive and we're talking about zeroing an
> extra never used afterwards cache line.
Why would you use the same ring for it? Remember PI is typically
used by thing like databases. Everything that does disk I/O
will use it, so optimizing for it actually being used absolutely
makes sense.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-16 0:32 ` Pavel Begunkov
@ 2024-11-18 12:50 ` Christoph Hellwig
2024-11-18 16:59 ` Pavel Begunkov
0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2024-11-18 12:50 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Anuj Gupta, axboe, hch, kbusch, martin.petersen, anuj1072538,
brauner, jack, viro, io-uring, linux-nvme, linux-block, gost.dev,
linux-scsi, vishak.g, linux-fsdevel, Kanchan Joshi
On Sat, Nov 16, 2024 at 12:32:25AM +0000, Pavel Begunkov wrote:
> We can also reuse your idea from your previous iterations and
> use the bitmap to list all attributes. Then preamble and
> the explicit attr_type field are not needed, type checking
> in the loop is removed and packing is better. And just
> by looking at the map we can calculate the size of the
> array and remove all size checks in the loop.
Can we please stop overdesigning the f**k out of this? Really,
either we're fine using the space in the extended SQE, or
we're fine using a separate opcode, or if we really have to just
make it uring_cmd. But stop making thing being extensible for
the sake of being extensible.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-18 12:50 ` Christoph Hellwig
@ 2024-11-18 16:59 ` Pavel Begunkov
2024-11-18 17:03 ` Christoph Hellwig
2024-11-21 8:59 ` Anuj Gupta
0 siblings, 2 replies; 37+ messages in thread
From: Pavel Begunkov @ 2024-11-18 16:59 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Anuj Gupta, axboe, kbusch, martin.petersen, anuj1072538, brauner,
jack, viro, io-uring, linux-nvme, linux-block, gost.dev,
linux-scsi, vishak.g, linux-fsdevel, Kanchan Joshi
On 11/18/24 12:50, Christoph Hellwig wrote:
> On Sat, Nov 16, 2024 at 12:32:25AM +0000, Pavel Begunkov wrote:
>> We can also reuse your idea from your previous iterations and
>> use the bitmap to list all attributes. Then preamble and
>> the explicit attr_type field are not needed, type checking
>> in the loop is removed and packing is better. And just
>> by looking at the map we can calculate the size of the
>> array and remove all size checks in the loop.
>
> Can we please stop overdesigning the f**k out of this? Really,
Please stop it, it doesn't add weight to your argument. The design
requirement has never changed, at least not during this patchset
iterations.
> either we're fine using the space in the extended SQE, or
> we're fine using a separate opcode, or if we really have to just
> make it uring_cmd. But stop making thing being extensible for
> the sake of being extensible.
It's asked to be extendible because there is a good chance it'll need to
be extended, and no, I'm not suggesting anyone to implement the entire
thing, only PI bits is fine.
And no, it doesn't have to be "this or that" while there are other
options suggested for consideration. And the problem with the SQE128
option is not even about SQE128 but how it's placed inside, i.e.
at a fixed spot.
Do we have technical arguments against the direction in the last
suggestion? It's extendible and _very_ simple. The entire (generic)
handling for the bitmask approach for this set would be sth like:
struct sqe {
u64 attr_type_mask;
u64 attr_ptr;
};
if (sqe->attr_type_mask) {
if (sqe->attr_type_mask != TYPE_PI)
return -EINVAL;
struct uapi_pi_structure pi;
copy_from_user(&pi, sqe->attr_ptr, sizeof(pi));
hanlde_pi(&pi);
}
And the user side:
struct uapi_pi_structure pi = { ... };
sqe->attr_ptr = π
sqe->attr_type_mask = TYPE_PI;
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-18 16:59 ` Pavel Begunkov
@ 2024-11-18 17:03 ` Christoph Hellwig
2024-11-18 17:45 ` Pavel Begunkov
2024-11-21 8:59 ` Anuj Gupta
1 sibling, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2024-11-18 17:03 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Christoph Hellwig, Anuj Gupta, axboe, kbusch, martin.petersen,
anuj1072538, brauner, jack, viro, io-uring, linux-nvme,
linux-block, gost.dev, linux-scsi, vishak.g, linux-fsdevel,
Kanchan Joshi
On Mon, Nov 18, 2024 at 04:59:22PM +0000, Pavel Begunkov wrote:
>>
>> Can we please stop overdesigning the f**k out of this? Really,
>
> Please stop it, it doesn't add weight to your argument. The design
> requirement has never changed, at least not during this patchset
> iterations.
That's what you think because you are overdesigning the hell out of
it. And at least for me that rings every single alarm bell about
horrible interface design.
>> either we're fine using the space in the extended SQE, or
>> we're fine using a separate opcode, or if we really have to just
>> make it uring_cmd. But stop making thing being extensible for
>> the sake of being extensible.
>
> It's asked to be extendible because there is a good chance it'll need to
> be extended, and no, I'm not suggesting anyone to implement the entire
> thing, only PI bits is fine.
Extensibility as in having reserved fields that can be checked for
is one thing. "Extensibility" by adding indirections over indirections
without a concrete use case is another thing. And we're deep into the
latter phase now.
> And no, it doesn't have to be "this or that" while there are other
> options suggested for consideration. And the problem with the SQE128
> option is not even about SQE128 but how it's placed inside, i.e.
> at a fixed spot.
>
> Do we have technical arguments against the direction in the last
> suggestion?
Yes. It adds completely pointless indirections and variable offsets.
How do you expect people to actually use that sanely without
introducing bugs left right and center?
I really don't get why you want to make an I/O fast path as complicated
as possible.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-18 17:03 ` Christoph Hellwig
@ 2024-11-18 17:45 ` Pavel Begunkov
2024-11-19 12:49 ` Christoph Hellwig
0 siblings, 1 reply; 37+ messages in thread
From: Pavel Begunkov @ 2024-11-18 17:45 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Anuj Gupta, axboe, kbusch, martin.petersen, anuj1072538, brauner,
jack, viro, io-uring, linux-nvme, linux-block, gost.dev,
linux-scsi, vishak.g, linux-fsdevel, Kanchan Joshi
On 11/18/24 17:03, Christoph Hellwig wrote:
> On Mon, Nov 18, 2024 at 04:59:22PM +0000, Pavel Begunkov wrote:
>>>
>>> Can we please stop overdesigning the f**k out of this? Really,
>>
>> Please stop it, it doesn't add weight to your argument. The design
>> requirement has never changed, at least not during this patchset
>> iterations.
>
> That's what you think because you are overdesigning the hell out of
> it. And at least for me that rings every single alarm bell about
> horrible interface design.
Well, and that's what you think, terribly incorrectly as far as
I can say.
>>> either we're fine using the space in the extended SQE, or
>>> we're fine using a separate opcode, or if we really have to just
>>> make it uring_cmd. But stop making thing being extensible for
>>> the sake of being extensible.
>>
>> It's asked to be extendible because there is a good chance it'll need to
>> be extended, and no, I'm not suggesting anyone to implement the entire
>> thing, only PI bits is fine.
>
> Extensibility as in having reserved fields that can be checked for
> is one thing. "Extensibility" by adding indirections over indirections
I don't know where you found indirections over indirections.
> without a concrete use case is another thing. And we're deep into the
> latter phase now.
>
>> And no, it doesn't have to be "this or that" while there are other
>> options suggested for consideration. And the problem with the SQE128
>> option is not even about SQE128 but how it's placed inside, i.e.
>> at a fixed spot.
>>
>> Do we have technical arguments against the direction in the last
>> suggestion?
>
> Yes. It adds completely pointless indirections and variable offsets.
One indirection, and there are no variable offsets while PI remains
the only user around.
> How do you expect people to actually use that sanely without
> introducing bugs left right and center?
I've just given you an example how the user space can look like, I
have absolutely no idea what you're talking about.
> I really don't get why you want to make an I/O fast path as complicated
> as possible.
Exactly, _fast path_. PI-only handling is very simple, I don't buy
that "complicated". If we'd need to add more without an API expecting
that, that'll mean a yet another forest of never ending checks in the
fast path effecting all users.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-18 17:45 ` Pavel Begunkov
@ 2024-11-19 12:49 ` Christoph Hellwig
2024-11-21 13:29 ` Pavel Begunkov
0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2024-11-19 12:49 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Christoph Hellwig, Anuj Gupta, axboe, kbusch, martin.petersen,
anuj1072538, brauner, jack, viro, io-uring, linux-nvme,
linux-block, gost.dev, linux-scsi, vishak.g, linux-fsdevel,
Kanchan Joshi
On Mon, Nov 18, 2024 at 05:45:02PM +0000, Pavel Begunkov wrote:
> Exactly, _fast path_. PI-only handling is very simple, I don't buy
> that "complicated". If we'd need to add more without an API expecting
> that, that'll mean a yet another forest of never ending checks in the
> fast path effecting all users.
Well, that's a good argument for a separate opcode for PI, or at least
for a 128-byte write, isn't it? I have real hard time trying to find
a coherent line in your arguments.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-15 18:04 ` Matthew Wilcox
@ 2024-11-20 17:35 ` Darrick J. Wong
2024-11-21 6:54 ` Christoph Hellwig
2024-11-21 13:45 ` Pavel Begunkov
0 siblings, 2 replies; 37+ messages in thread
From: Darrick J. Wong @ 2024-11-20 17:35 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Pavel Begunkov, Christoph Hellwig, Anuj Gupta, axboe, kbusch,
martin.petersen, anuj1072538, brauner, jack, viro, io-uring,
linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Kanchan Joshi
On Fri, Nov 15, 2024 at 06:04:01PM +0000, Matthew Wilcox wrote:
> On Thu, Nov 14, 2024 at 01:09:44PM +0000, Pavel Begunkov wrote:
> > With SQE128 it's also a problem that now all SQEs are 128 bytes regardless
> > of whether a particular request needs it or not, and the user will need
> > to zero them for each request.
>
> The way we handled this in NVMe was to use a bit in the command that
> was called (iirc) FUSED, which let you use two consecutive entries for
> a single command.
>
> Some variant on that could surely be used for io_uring. Perhaps a
> special opcode that says "the real opcode is here, and this is a two-slot
> command". Processing gets a little spicy when one slot is the last in
> the buffer and the next is the the first in the buffer, but that's a SMOP.
I like willy's suggestion -- what's the difficulty in having a SQE flag
that says "...and keep going into the next SQE"? I guess that
introduces the problem that you can no longer react to the observation
of 4 new SQEs by creating 4 new contexts to process those SQEs and throw
all 4 of them at background threads, since you don't know how many IOs
are there.
That said, depending on the size of the PI metadata, it might be more
convenient for the app programmer to supply one pointer to a single
array of PI information for the entire IO request, packed in whatever
format the underlying device wants.
Thinking with my xfs(progs) hat on, if we ever wanted to run xfs_buf(fer
cache) IOs through io_uring with PI metadata, we'd probably want a
vectored io submission interface (xfs_buffers can map to discontiguous
LBA ranges on disk), but we'd probably have a single memory object to
hold all the PI information.
But really, AFAICT it's 6 of one or half a dozen of the other, so I
don't care all that much so long as you all pick something and merge it.
:)
--D
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-20 17:35 ` Darrick J. Wong
@ 2024-11-21 6:54 ` Christoph Hellwig
2024-11-21 13:45 ` Pavel Begunkov
1 sibling, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2024-11-21 6:54 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Matthew Wilcox, Pavel Begunkov, Christoph Hellwig, Anuj Gupta,
axboe, kbusch, martin.petersen, anuj1072538, brauner, jack, viro,
io-uring, linux-nvme, linux-block, gost.dev, linux-scsi, vishak.g,
linux-fsdevel, Kanchan Joshi
On Wed, Nov 20, 2024 at 09:35:17AM -0800, Darrick J. Wong wrote:
> I like willy's suggestion -- what's the difficulty in having a SQE flag
> that says "...and keep going into the next SQE"? I guess that
> introduces the problem that you can no longer react to the observation
> of 4 new SQEs by creating 4 new contexts to process those SQEs and throw
> all 4 of them at background threads, since you don't know how many IOs
> are there.
Which is why everyone hates the nvme fused commands with passion, and no
one but vmware actually uses them, and no other fused command pair
except for compare and write ever materialized.
> That said, depending on the size of the PI metadata, it might be more
> convenient for the app programmer to supply one pointer to a single
> array of PI information for the entire IO request, packed in whatever
> format the underlying device wants.
>
> Thinking with my xfs(progs) hat on, if we ever wanted to run xfs_buf(fer
> cache) IOs through io_uring with PI metadata, we'd probably want a
> vectored io submission interface (xfs_buffers can map to discontiguous
> LBA ranges on disk), but we'd probably have a single memory object to
> hold all the PI information.
Agreed. And unless I'm misremembering something, all proposals so far
had a single PI buffer for vectored read/writes.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-18 16:59 ` Pavel Begunkov
2024-11-18 17:03 ` Christoph Hellwig
@ 2024-11-21 8:59 ` Anuj Gupta
2024-11-21 15:45 ` Pavel Begunkov
1 sibling, 1 reply; 37+ messages in thread
From: Anuj Gupta @ 2024-11-21 8:59 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Christoph Hellwig, axboe, kbusch, martin.petersen, anuj1072538,
brauner, jack, viro, io-uring, linux-nvme, linux-block, gost.dev,
linux-scsi, vishak.g, linux-fsdevel, Kanchan Joshi
[-- Attachment #1: Type: text/plain, Size: 3514 bytes --]
On Mon, Nov 18, 2024 at 04:59:22PM +0000, Pavel Begunkov wrote:
> On 11/18/24 12:50, Christoph Hellwig wrote:
> > On Sat, Nov 16, 2024 at 12:32:25AM +0000, Pavel Begunkov wrote:
> > > We can also reuse your idea from your previous iterations and
> > > use the bitmap to list all attributes. Then preamble and
> > > the explicit attr_type field are not needed, type checking
> > > in the loop is removed and packing is better. And just
> > > by looking at the map we can calculate the size of the
> > > array and remove all size checks in the loop.
> >
> > Can we please stop overdesigning the f**k out of this? Really,
>
> Please stop it, it doesn't add weight to your argument. The design
> requirement has never changed, at least not during this patchset
> iterations.
>
> > either we're fine using the space in the extended SQE, or
> > we're fine using a separate opcode, or if we really have to just
> > make it uring_cmd. But stop making thing being extensible for
> > the sake of being extensible.
>
> It's asked to be extendible because there is a good chance it'll need to
> be extended, and no, I'm not suggesting anyone to implement the entire
> thing, only PI bits is fine.
>
> And no, it doesn't have to be "this or that" while there are other
> options suggested for consideration. And the problem with the SQE128
> option is not even about SQE128 but how it's placed inside, i.e.
> at a fixed spot.
>
> Do we have technical arguments against the direction in the last
> suggestion? It's extendible and _very_ simple. The entire (generic)
> handling for the bitmask approach for this set would be sth like:
>
> struct sqe {
> u64 attr_type_mask;
> u64 attr_ptr;
> };
> if (sqe->attr_type_mask) {
> if (sqe->attr_type_mask != TYPE_PI)
> return -EINVAL;
>
> struct uapi_pi_structure pi;
> copy_from_user(&pi, sqe->attr_ptr, sizeof(pi));
> hanlde_pi(&pi);
> }
>
> And the user side:
>
> struct uapi_pi_structure pi = { ... };
> sqe->attr_ptr = π
> sqe->attr_type_mask = TYPE_PI;
>
How about using this, but also have the ability to keep PI inline.
Attributes added down the line can take one of these options:
1. If available space in SQE/SQE128 is sufficient for keeping attribute
fields, one can choose to keep them inline and introduce a TYPE_XYZ_INLINE
attribute flag.
2. If the available space is not sufficient, pass the attribute information
as pointer and introduce a TYPE_XYZ attribute flag.
3. One can choose to support a attribute via both pointer and inline scheme.
The pointer scheme can help with scenarios where user wants to avoid SQE128
for whatever reasons (e.g. mixed workload).
Something like this:
// uapi/.../io_uring.h
struct sqe {
...
u64 attr_type_mask;
u64 attr_ptr;
};
// kernel space
if (sqe->attr_type_mask) {
struct uapi_pi_struct pi, *piptr;
if ((sqe->attr_type_mask & TYPE_PI_INLINE) &&
(sqe->attr_type_mask & TYPE_PI))
return -EINVAL;
/* inline PI case */
if (sqe->attr_type_mask & TYPE_PI_INLINE) {
if (!(ctx->flags & IORING_SETUP_SQE128))
return -EINVAL;
piptr = (sqe + 1);
} else if (sqe->attr_type_mask & TYPE_PI) {
/* indirect PI case */
copy_from_user(&pi, sqe->attr_ptr, sizeof(pi));
piptr = π
}
handle_pi(piptr);
}
And the user side:
// send PI using pointer:
struct uapi_pi_structure pi = { ... };
sqe->attr_ptr = π
sqe->attr_type_mask = TYPE_PI;
// send PI inline:
/* setup a big-sqe ring */
struct uapi_pi_structure *pi = (sqe+1);
prepare_pi(pi);
sqe->attr_type_mask = TYPE_PI_INLINE;
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-19 12:49 ` Christoph Hellwig
@ 2024-11-21 13:29 ` Pavel Begunkov
0 siblings, 0 replies; 37+ messages in thread
From: Pavel Begunkov @ 2024-11-21 13:29 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Anuj Gupta, axboe, kbusch, martin.petersen, anuj1072538, brauner,
jack, viro, io-uring, linux-nvme, linux-block, gost.dev,
linux-scsi, vishak.g, linux-fsdevel, Kanchan Joshi
On 11/19/24 12:49, Christoph Hellwig wrote:
> On Mon, Nov 18, 2024 at 05:45:02PM +0000, Pavel Begunkov wrote:
>> Exactly, _fast path_. PI-only handling is very simple, I don't buy
>> that "complicated". If we'd need to add more without an API expecting
>> that, that'll mean a yet another forest of never ending checks in the
>> fast path effecting all users.
>
> Well, that's a good argument for a separate opcode for PI, or at least
No, it's not. Apart from full duplication I haven't seen any PI
implementation that doesn't add overhead to the io_uring read-write
path, which is ok, but pretending that dropping a new opcode solves
everything is ill advised.
And I hope there is no misunderstanding on the fact that there are
other criteria as well, and what's not explicitly mentioned is usually
common sense. For example, it's supposed to be correct and bug free
as well as maintainable.
> for a 128-byte write, isn't it? I have real hard time trying to find
> a coherent line in your arguments.
When coming from invalid assumptions everything would seem incoherent.
And please, I'm not here to humour you, you can leave your crude
statements for yourself, it's getting old.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-20 17:35 ` Darrick J. Wong
2024-11-21 6:54 ` Christoph Hellwig
@ 2024-11-21 13:45 ` Pavel Begunkov
1 sibling, 0 replies; 37+ messages in thread
From: Pavel Begunkov @ 2024-11-21 13:45 UTC (permalink / raw)
To: Darrick J. Wong, Matthew Wilcox
Cc: Christoph Hellwig, Anuj Gupta, axboe, kbusch, martin.petersen,
anuj1072538, brauner, jack, viro, io-uring, linux-nvme,
linux-block, gost.dev, linux-scsi, vishak.g, linux-fsdevel,
Kanchan Joshi
On 11/20/24 17:35, Darrick J. Wong wrote:
> On Fri, Nov 15, 2024 at 06:04:01PM +0000, Matthew Wilcox wrote:
>> On Thu, Nov 14, 2024 at 01:09:44PM +0000, Pavel Begunkov wrote:
>>> With SQE128 it's also a problem that now all SQEs are 128 bytes regardless
>>> of whether a particular request needs it or not, and the user will need
>>> to zero them for each request.
>>
>> The way we handled this in NVMe was to use a bit in the command that
>> was called (iirc) FUSED, which let you use two consecutive entries for
>> a single command.
>>
>> Some variant on that could surely be used for io_uring. Perhaps a
>> special opcode that says "the real opcode is here, and this is a two-slot
>> command". Processing gets a little spicy when one slot is the last in
>> the buffer and the next is the the first in the buffer, but that's a SMOP.
>
> I like willy's suggestion -- what's the difficulty in having a SQE flag
> that says "...and keep going into the next SQE"? I guess that
> introduces the problem that you can no longer react to the observation
> of 4 new SQEs by creating 4 new contexts to process those SQEs and throw
> all 4 of them at background threads, since you don't know how many IOs
> are there.
Some variation on "variable size SQE" was discussed back in the day
as an option instead of SQE128. I don't remember why it was refused
exactly, but I'd think it was exactly the "spicy" moment Matthew
mentioned, especially since nvme passthrough was spanning its payload
across both parts of the SQE.
I'm pretty sure I can find more than a couple of downsides, like for
it to be truly generic you need a flag in each SQE and finding a bit
is not that easy, and also in terms of some overhead to everyone else
while this extension is not even needed. By the end of the day, the
main concern is how it's placed and not where specifically,
SQE / user memory / etc.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support
2024-11-21 8:59 ` Anuj Gupta
@ 2024-11-21 15:45 ` Pavel Begunkov
0 siblings, 0 replies; 37+ messages in thread
From: Pavel Begunkov @ 2024-11-21 15:45 UTC (permalink / raw)
To: Anuj Gupta
Cc: Christoph Hellwig, axboe, kbusch, martin.petersen, anuj1072538,
brauner, jack, viro, io-uring, linux-nvme, linux-block, gost.dev,
linux-scsi, vishak.g, linux-fsdevel, Kanchan Joshi
On 11/21/24 08:59, Anuj Gupta wrote:
> On Mon, Nov 18, 2024 at 04:59:22PM +0000, Pavel Begunkov wrote:
>> On 11/18/24 12:50, Christoph Hellwig wrote:
>>> On Sat, Nov 16, 2024 at 12:32:25AM +0000, Pavel Begunkov wrote:
...
>> Do we have technical arguments against the direction in the last
>> suggestion? It's extendible and _very_ simple. The entire (generic)
>> handling for the bitmask approach for this set would be sth like:
>>
>> struct sqe {
>> u64 attr_type_mask;
>> u64 attr_ptr;
>> };
>> if (sqe->attr_type_mask) {
>> if (sqe->attr_type_mask != TYPE_PI)
>> return -EINVAL;
>>
>> struct uapi_pi_structure pi;
>> copy_from_user(&pi, sqe->attr_ptr, sizeof(pi));
>> hanlde_pi(&pi);
>> }
>>
>> And the user side:
>>
>> struct uapi_pi_structure pi = { ... };
>> sqe->attr_ptr = π
>> sqe->attr_type_mask = TYPE_PI;
>>
>
> How about using this, but also have the ability to keep PI inline.
> Attributes added down the line can take one of these options:
> 1. If available space in SQE/SQE128 is sufficient for keeping attribute
> fields, one can choose to keep them inline and introduce a TYPE_XYZ_INLINE
> attribute flag.
> 2. If the available space is not sufficient, pass the attribute information
> as pointer and introduce a TYPE_XYZ attribute flag.
> 3. One can choose to support a attribute via both pointer and inline scheme.
> The pointer scheme can help with scenarios where user wants to avoid SQE128
> for whatever reasons (e.g. mixed workload).
Right, the idea would work. It'd need to be not type specific but
rather a separate flag covering all attributes of a request, though.
IOW, either all of them are in user memory or all optimised. We probably
don't have a good place for a flag, but then you can just chip away a
bit from attr_type_mask as you're doing for INLINE.
enum {
TYPE_PI = 1,
...
TYPE_FLAG_INLINE = 1 << 63,
};
// sqe->attr_type_mask = TYPE_PI | TYPE_FLAG_INLINE;
Another question is whether it's better to use SQE or another mapping
like reg-wait thing does. My suggestion is, send it without the INLINE
optimisation targeting 6.14 (I assume block bits are sorted?). We'll
figure that optimisation separately and target the same release, there
is plenty of time for that.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2024-11-21 15:44 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20241114105326epcas5p103b2c996293fa680092b97c747fdbd59@epcas5p1.samsung.com>
2024-11-14 10:45 ` [PATCH v9 00/11] Read/Write with meta/integrity Anuj Gupta
[not found] ` <CGME20241114105352epcas5p109c1742fa8a6552296b9c104f2271308@epcas5p1.samsung.com>
2024-11-14 10:45 ` [PATCH v9 01/11] block: define set of integrity flags to be inherited by cloned bip Anuj Gupta
[not found] ` <CGME20241114105354epcas5p49a73947c3d37be4189023f66fb7ba413@epcas5p4.samsung.com>
2024-11-14 10:45 ` [PATCH v9 02/11] block: copy back bounce buffer to user-space correctly in case of split Anuj Gupta
[not found] ` <CGME20241114105357epcas5p41fd14282d4abfe564e858b37babe708a@epcas5p4.samsung.com>
2024-11-14 10:45 ` [PATCH v9 03/11] block: modify bio_integrity_map_user to accept iov_iter as argument Anuj Gupta
[not found] ` <CGME20241114105400epcas5p270b8062a0c4f26833a5b497f057d65a7@epcas5p2.samsung.com>
2024-11-14 10:45 ` [PATCH v9 04/11] fs, iov_iter: define meta io descriptor Anuj Gupta
[not found] ` <CGME20241114105402epcas5p41b1f6054a557f1bda2cfddfdfb9a9477@epcas5p4.samsung.com>
2024-11-14 10:45 ` [PATCH v9 05/11] fs: introduce IOCB_HAS_METADATA for metadata Anuj Gupta
[not found] ` <CGME20241114105405epcas5p24ca2fb9017276ff8a50ef447638fd739@epcas5p2.samsung.com>
2024-11-14 10:45 ` [PATCH v9 06/11] io_uring: introduce attributes for read/write and PI support Anuj Gupta
2024-11-14 12:16 ` Christoph Hellwig
2024-11-14 13:09 ` Pavel Begunkov
2024-11-14 15:19 ` Christoph Hellwig
2024-11-15 16:40 ` Pavel Begunkov
2024-11-15 17:12 ` Christoph Hellwig
2024-11-15 17:44 ` Jens Axboe
2024-11-15 18:00 ` Christoph Hellwig
2024-11-15 19:03 ` Pavel Begunkov
2024-11-18 12:49 ` Christoph Hellwig
2024-11-15 18:04 ` Matthew Wilcox
2024-11-20 17:35 ` Darrick J. Wong
2024-11-21 6:54 ` Christoph Hellwig
2024-11-21 13:45 ` Pavel Begunkov
2024-11-15 13:29 ` Anuj gupta
2024-11-16 0:00 ` Pavel Begunkov
2024-11-16 0:32 ` Pavel Begunkov
2024-11-18 12:50 ` Christoph Hellwig
2024-11-18 16:59 ` Pavel Begunkov
2024-11-18 17:03 ` Christoph Hellwig
2024-11-18 17:45 ` Pavel Begunkov
2024-11-19 12:49 ` Christoph Hellwig
2024-11-21 13:29 ` Pavel Begunkov
2024-11-21 8:59 ` Anuj Gupta
2024-11-21 15:45 ` Pavel Begunkov
2024-11-16 23:09 ` kernel test robot
[not found] ` <CGME20241114105408epcas5p3c77cda2faf7ccb37abbfd8e95b4ad1f5@epcas5p3.samsung.com>
2024-11-14 10:45 ` [PATCH v9 07/11] io_uring: inline read/write attributes and PI Anuj Gupta
[not found] ` <CGME20241114105410epcas5p1c6a4e6141b073ccfd6277288f7d5e28b@epcas5p1.samsung.com>
2024-11-14 10:45 ` [PATCH v9 08/11] block: introduce BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags Anuj Gupta
[not found] ` <CGME20241114105413epcas5p2d7da8675df2de0d1efba3057144e691d@epcas5p2.samsung.com>
2024-11-14 10:45 ` [PATCH v9 09/11] nvme: add support for passing on the application tag Anuj Gupta
[not found] ` <CGME20241114105416epcas5p3a7aa552775cfe50f60ef89f7d982ea12@epcas5p3.samsung.com>
2024-11-14 10:45 ` [PATCH v9 10/11] scsi: add support for user-meta interface Anuj Gupta
[not found] ` <CGME20241114105418epcas5p1537d72b9016d10670cf97751704e2cc8@epcas5p1.samsung.com>
2024-11-14 10:45 ` [PATCH v9 11/11] block: add support to pass user meta buffer Anuj Gupta
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox