* [PATCHv3 1/5] bvec: introduce multi-page bvec iterating
2023-11-20 22:40 [PATCHv3 0/5] block integrity: directly map user space addresses Keith Busch
@ 2023-11-20 22:40 ` Keith Busch
2023-11-21 5:01 ` Christoph Hellwig
2023-11-21 8:37 ` Ming Lei
2023-11-20 22:40 ` [PATCHv3 2/5] block: bio-integrity: directly map user buffers Keith Busch
` (3 subsequent siblings)
4 siblings, 2 replies; 17+ messages in thread
From: Keith Busch @ 2023-11-20 22:40 UTC (permalink / raw)
To: linux-block, linux-nvme, io-uring
Cc: axboe, hch, joshi.k, martin.petersen, Keith Busch
From: Keith Busch <[email protected]>
Some bio_vec iterators can handle physically contiguous memory and have
no need to split bvec consideration on page boundaries.
Signed-off-by: Keith Busch <[email protected]>
---
include/linux/bvec.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 555aae5448ae4..9364c258513e0 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -184,6 +184,12 @@ static inline void bvec_iter_advance_single(const struct bio_vec *bv,
((bvl = bvec_iter_bvec((bio_vec), (iter))), 1); \
bvec_iter_advance_single((bio_vec), &(iter), (bvl).bv_len))
+#define for_each_mp_bvec(bvl, bio_vec, iter, start) \
+ for (iter = (start); \
+ (iter).bi_size && \
+ ((bvl = mp_bvec_iter_bvec((bio_vec), (iter))), 1); \
+ bvec_iter_advance_single((bio_vec), &(iter), (bvl).bv_len))
+
/* for iterating one bio from start to end */
#define BVEC_ITER_ALL_INIT (struct bvec_iter) \
{ \
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCHv3 1/5] bvec: introduce multi-page bvec iterating
2023-11-20 22:40 ` [PATCHv3 1/5] bvec: introduce multi-page bvec iterating Keith Busch
@ 2023-11-21 5:01 ` Christoph Hellwig
2023-11-21 8:37 ` Ming Lei
1 sibling, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2023-11-21 5:01 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, linux-nvme, io-uring, axboe, hch, joshi.k,
martin.petersen, Keith Busch
On Mon, Nov 20, 2023 at 02:40:54PM -0800, Keith Busch wrote:
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index 555aae5448ae4..9364c258513e0 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -184,6 +184,12 @@ static inline void bvec_iter_advance_single(const struct bio_vec *bv,
> ((bvl = bvec_iter_bvec((bio_vec), (iter))), 1); \
> bvec_iter_advance_single((bio_vec), &(iter), (bvl).bv_len))
>
> +#define for_each_mp_bvec(bvl, bio_vec, iter, start) \
> + for (iter = (start); \
> + (iter).bi_size && \
> + ((bvl = mp_bvec_iter_bvec((bio_vec), (iter))), 1); \
> + bvec_iter_advance_single((bio_vec), &(iter), (bvl).bv_len))
Hope thjis isn't too much bike-shedding, but in the block layer
we generally used _segment for the single page bvecs and just bvec
for the not page size limited. This isn't the best naming either,
but i wonder if it's worth to change the existing 4 callers and be
consistent. (and maybe one or two of them doesn't want the limit anyway?)
Otherwise this looks good to me.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv3 1/5] bvec: introduce multi-page bvec iterating
2023-11-20 22:40 ` [PATCHv3 1/5] bvec: introduce multi-page bvec iterating Keith Busch
2023-11-21 5:01 ` Christoph Hellwig
@ 2023-11-21 8:37 ` Ming Lei
2023-11-21 15:49 ` Keith Busch
1 sibling, 1 reply; 17+ messages in thread
From: Ming Lei @ 2023-11-21 8:37 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, linux-nvme, io-uring, axboe, hch, joshi.k,
martin.petersen, Keith Busch, ming.lei
On Mon, Nov 20, 2023 at 02:40:54PM -0800, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> Some bio_vec iterators can handle physically contiguous memory and have
> no need to split bvec consideration on page boundaries.
Then I am wondering why this helper is needed, and you can use each bvec
directly, which is supposed to be physically contiguous.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> include/linux/bvec.h | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index 555aae5448ae4..9364c258513e0 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -184,6 +184,12 @@ static inline void bvec_iter_advance_single(const struct bio_vec *bv,
> ((bvl = bvec_iter_bvec((bio_vec), (iter))), 1); \
> bvec_iter_advance_single((bio_vec), &(iter), (bvl).bv_len))
>
> +#define for_each_mp_bvec(bvl, bio_vec, iter, start) \
> + for (iter = (start); \
> + (iter).bi_size && \
> + ((bvl = mp_bvec_iter_bvec((bio_vec), (iter))), 1); \
> + bvec_iter_advance_single((bio_vec), &(iter), (bvl).bv_len))
> +
We already have bio_for_each_bvec() to iterate over (multipage)bvecs
from bio.
Thanks,
Ming
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv3 1/5] bvec: introduce multi-page bvec iterating
2023-11-21 8:37 ` Ming Lei
@ 2023-11-21 15:49 ` Keith Busch
2023-11-22 0:43 ` Ming Lei
0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2023-11-21 15:49 UTC (permalink / raw)
To: Ming Lei
Cc: Keith Busch, linux-block, linux-nvme, io-uring, axboe, hch,
joshi.k, martin.petersen
On Tue, Nov 21, 2023 at 04:37:01PM +0800, Ming Lei wrote:
> On Mon, Nov 20, 2023 at 02:40:54PM -0800, Keith Busch wrote:
> > From: Keith Busch <[email protected]>
> >
> > Some bio_vec iterators can handle physically contiguous memory and have
> > no need to split bvec consideration on page boundaries.
>
> Then I am wondering why this helper is needed, and you can use each bvec
> directly, which is supposed to be physically contiguous.
It's just a helper function to iterate a generic bvec.
> > diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> > index 555aae5448ae4..9364c258513e0 100644
> > --- a/include/linux/bvec.h
> > +++ b/include/linux/bvec.h
> > @@ -184,6 +184,12 @@ static inline void bvec_iter_advance_single(const struct bio_vec *bv,
> > ((bvl = bvec_iter_bvec((bio_vec), (iter))), 1); \
> > bvec_iter_advance_single((bio_vec), &(iter), (bvl).bv_len))
> >
> > +#define for_each_mp_bvec(bvl, bio_vec, iter, start) \
> > + for (iter = (start); \
> > + (iter).bi_size && \
> > + ((bvl = mp_bvec_iter_bvec((bio_vec), (iter))), 1); \
> > + bvec_iter_advance_single((bio_vec), &(iter), (bvl).bv_len))
> > +
>
> We already have bio_for_each_bvec() to iterate over (multipage)bvecs
> from bio.
Right, but we are not dealing with a bio here. We have a bip bvec
instead, so can't use bio_for_each_bvec().
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv3 1/5] bvec: introduce multi-page bvec iterating
2023-11-21 15:49 ` Keith Busch
@ 2023-11-22 0:43 ` Ming Lei
2023-11-22 0:54 ` Keith Busch
0 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2023-11-22 0:43 UTC (permalink / raw)
To: Keith Busch
Cc: Keith Busch, linux-block, linux-nvme, io-uring, axboe, hch,
joshi.k, martin.petersen, ming.lei
On Tue, Nov 21, 2023 at 08:49:45AM -0700, Keith Busch wrote:
> On Tue, Nov 21, 2023 at 04:37:01PM +0800, Ming Lei wrote:
> > On Mon, Nov 20, 2023 at 02:40:54PM -0800, Keith Busch wrote:
> > > From: Keith Busch <[email protected]>
> > >
> > > Some bio_vec iterators can handle physically contiguous memory and have
> > > no need to split bvec consideration on page boundaries.
> >
> > Then I am wondering why this helper is needed, and you can use each bvec
> > directly, which is supposed to be physically contiguous.
>
> It's just a helper function to iterate a generic bvec.
I just look into patch 3 about the use, seems what you need is for_each_bvec_all(),
which is safe & efficient to use when freeing the host data(bio or bip), but can't
be used in split bio/bip, in which the generic iterator is needed.
And you can open-code it in bio_integrity_unmap_user():
for (i = 0; i < bip->bip_vcnt; i++) {
struct bio_vec *v = &bip->bip_vec[i];
...
}
Thanks,
Ming
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv3 1/5] bvec: introduce multi-page bvec iterating
2023-11-22 0:43 ` Ming Lei
@ 2023-11-22 0:54 ` Keith Busch
0 siblings, 0 replies; 17+ messages in thread
From: Keith Busch @ 2023-11-22 0:54 UTC (permalink / raw)
To: Ming Lei
Cc: Keith Busch, linux-block, linux-nvme, io-uring, axboe, hch,
joshi.k, martin.petersen
On Wed, Nov 22, 2023 at 08:43:48AM +0800, Ming Lei wrote:
>
> And you can open-code it in bio_integrity_unmap_user():
>
> for (i = 0; i < bip->bip_vcnt; i++) {
> struct bio_vec *v = &bip->bip_vec[i];
>
> ...
> }
That works for me. io_uring/rsrc.c does similar too, which I referenced
when implementing this. I thought the macro might help make this
optimisation more reachable for future use, but I don't need to
introduce that with only the one user here.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCHv3 2/5] block: bio-integrity: directly map user buffers
2023-11-20 22:40 [PATCHv3 0/5] block integrity: directly map user space addresses Keith Busch
2023-11-20 22:40 ` [PATCHv3 1/5] bvec: introduce multi-page bvec iterating Keith Busch
@ 2023-11-20 22:40 ` Keith Busch
2023-11-20 23:19 ` Jens Axboe
` (2 more replies)
2023-11-20 22:40 ` [PATCHv3 3/5] nvme: use bio_integrity_map_user Keith Busch
` (2 subsequent siblings)
4 siblings, 3 replies; 17+ messages in thread
From: Keith Busch @ 2023-11-20 22:40 UTC (permalink / raw)
To: linux-block, linux-nvme, io-uring
Cc: axboe, hch, joshi.k, martin.petersen, Keith Busch
From: Keith Busch <[email protected]>
Passthrough commands that utilize metadata currently bounce the user
space buffer through the kernel. Add support for mapping user space
directly so that we can avoid this costly overhead. This is similiar to
how the normal bio data payload utilizes user addresses with
bio_map_user_iov().
If the user address can't directly be used for reasons like too many
segments or address unalignement, fallback to a copy of the user vec
while keeping the user address pinned for the IO duration so that it
can safely be copied on completion in any process context.
Signed-off-by: Keith Busch <[email protected]>
---
block/bio-integrity.c | 212 ++++++++++++++++++++++++++++++++++++++++++
include/linux/bio.h | 12 +++
2 files changed, 224 insertions(+)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index ec8ac8cf6e1b9..b761058bfb92f 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -91,6 +91,37 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio,
}
EXPORT_SYMBOL(bio_integrity_alloc);
+static void bio_integrity_unmap_user(struct bio_integrity_payload *bip)
+{
+ bool dirty = bio_data_dir(bip->bip_bio) == READ;
+ struct bvec_iter iter;
+ struct bio_vec bv;
+
+ if (bip->bip_flags & BIP_COPY_USER) {
+ unsigned short nr_vecs = bip->bip_max_vcnt - 1;
+ struct bio_vec *copy = bvec_virt(&bip->bip_vec[nr_vecs]);
+ size_t bytes = bip->bip_iter.bi_size;
+ void *buf = bvec_virt(bip->bip_vec);
+
+ if (dirty) {
+ struct iov_iter iter;
+
+ iov_iter_bvec(&iter, ITER_DEST, copy, nr_vecs, bytes);
+ WARN_ON_ONCE(copy_to_iter(buf, bytes, &iter) != bytes);
+ }
+
+ memcpy(bip->bip_vec, copy, nr_vecs * sizeof(*copy));
+ kfree(copy);
+ kfree(buf);
+ }
+
+ bip_for_each_mp_vec(bv, bip, iter) {
+ if (dirty && !PageCompound(bv.bv_page))
+ set_page_dirty_lock(bv.bv_page);
+ unpin_user_page(bv.bv_page);
+ }
+}
+
/**
* bio_integrity_free - Free bio integrity payload
* @bio: bio containing bip to be freed
@@ -105,6 +136,8 @@ void bio_integrity_free(struct bio *bio)
if (bip->bip_flags & BIP_BLOCK_INTEGRITY)
kfree(bvec_virt(bip->bip_vec));
+ else if (bip->bip_flags & BIP_INTEGRITY_USER)
+ bio_integrity_unmap_user(bip);;
__bio_integrity_free(bs, bip);
bio->bi_integrity = NULL;
@@ -160,6 +193,185 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
}
EXPORT_SYMBOL(bio_integrity_add_page);
+static int bio_integrity_copy_user(struct bio *bio, struct bio_vec *bvec,
+ int nr_vecs, unsigned int len,
+ unsigned int direction, u32 seed)
+{
+ struct bio_integrity_payload *bip;
+ struct bio_vec *copy_vec = NULL;
+ struct iov_iter iter;
+ void *buf;
+ int ret;
+
+ /* We need to allocate a copy for the completion if bvec is on stack */
+ if (nr_vecs <= UIO_FASTIOV) {
+ copy_vec = kcalloc(sizeof(*bvec), nr_vecs, GFP_KERNEL);
+ if (!copy_vec)
+ return -ENOMEM;
+ memcpy(copy_vec, bvec, nr_vecs * sizeof(*bvec));
+ bvec = copy_vec;
+ }
+
+ buf = kmalloc(len, GFP_KERNEL);
+ if (!buf) {
+ ret = -ENOMEM;
+ goto free_copy;
+ }
+
+ if (direction == ITER_SOURCE) {
+ iov_iter_bvec(&iter, direction, bvec, nr_vecs, len);
+ if (!copy_from_iter_full(buf, len, &iter)) {
+ ret = -EFAULT;
+ goto free_buf;
+ }
+ } else {
+ memset(buf, 0, len);
+ }
+
+ /*
+ * We need just one vec for this bip, but we also need to preserve the
+ * a pointer to the original bvec and the number of vecs in it for
+ * completion handling
+ */
+ bip = bio_integrity_alloc(bio, GFP_KERNEL, nr_vecs + 1);
+ if (IS_ERR(bip)) {
+ ret = PTR_ERR(bip);
+ goto free_buf;
+ }
+
+ ret = bio_integrity_add_page(bio, virt_to_page(buf), len,
+ offset_in_page(buf));
+ if (ret != len) {
+ ret = -ENOMEM;
+ goto free_bip;
+ }
+
+ /*
+ * Save a pointer to the user bvec at the end of this bip's bvec for
+ * completion handling: we know the index won't be used for anything
+ * else.
+ */
+ bvec_set_page(&bip->bip_vec[nr_vecs], virt_to_page(bvec), 0,
+ offset_in_page(bvec));
+ bip->bip_flags |= BIP_INTEGRITY_USER | BIP_COPY_USER;
+ return 0;
+
+free_bip:
+ bio_integrity_free(bio);
+free_buf:
+ kfree(buf);
+free_copy:
+ kfree(copy_vec);
+ return ret;
+}
+
+static unsigned int bvec_from_pages(struct bio_vec *bvec, struct page **pages,
+ int nr_vecs, ssize_t bytes, ssize_t offset)
+{
+ unsigned int nr_bvecs = 0;
+ int i, j;
+
+ for (i = 0; i < nr_vecs; i = j) {
+ size_t size = min_t(size_t, bytes, PAGE_SIZE - offset);
+ struct folio *folio = page_folio(pages[i]);
+
+ bytes -= size;
+ for (j = i + 1; j < nr_vecs; j++) {
+ size_t next = min_t(size_t, PAGE_SIZE, bytes);
+
+ if (page_folio(pages[j]) != folio ||
+ pages[j] != pages[j - 1] + 1)
+ break;
+ unpin_user_page(pages[j]);
+ size += next;
+ bytes -= next;
+ }
+
+ bvec_set_page(&bvec[nr_bvecs], pages[i], size, offset);
+ offset = 0;
+ nr_bvecs++;
+ }
+
+ return nr_bvecs;
+}
+
+int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes,
+ u32 seed)
+{
+ struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+ unsigned int align = q->dma_pad_mask | queue_dma_alignment(q);
+ struct bio_vec bv, stack_vec[UIO_FASTIOV], *bvec = stack_vec;
+ struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages;
+ struct bvec_iter bi = { bi.bi_size = bytes, };
+ unsigned int direction, nr_bvecs;
+ struct iov_iter iter;
+ int ret, nr_vecs;
+ size_t offset;
+ bool copy;
+
+ if (bio_integrity(bio))
+ return -EINVAL;
+ if (bytes >> SECTOR_SHIFT > queue_max_hw_sectors(q))
+ return -E2BIG;
+
+ if (bio_data_dir(bio) == READ)
+ direction = ITER_DEST;
+ else
+ direction = ITER_SOURCE;
+
+ iov_iter_ubuf(&iter, direction, ubuf, bytes);
+ nr_vecs = iov_iter_npages(&iter, BIO_MAX_VECS + 1);
+ if (nr_vecs > BIO_MAX_VECS)
+ return -E2BIG;
+ if (nr_vecs > UIO_FASTIOV) {
+ bvec = kcalloc(sizeof(*bvec), nr_vecs, GFP_KERNEL);
+ if (!bvec)
+ return -ENOMEM;
+ pages = NULL;
+ }
+
+ copy = !iov_iter_is_aligned(&iter, align, align);
+ ret = iov_iter_extract_pages(&iter, &pages, bytes, nr_vecs, 0, &offset);
+ if (unlikely(ret < 0))
+ goto free_bvec;
+
+ nr_bvecs = bvec_from_pages(bvec, pages, nr_vecs, bytes, offset);
+ if (pages != stack_pages)
+ kvfree(pages);
+
+ if (nr_bvecs > queue_max_integrity_segments(q) || copy) {
+ ret = bio_integrity_copy_user(bio, bvec, nr_bvecs, bytes,
+ direction, seed);
+ if (ret)
+ goto release_pages;
+ } else {
+ struct bio_integrity_payload *bip;
+
+ bip = bio_integrity_alloc(bio, GFP_KERNEL, nr_bvecs);
+ if (IS_ERR(bip)) {
+ ret = PTR_ERR(bip);
+ goto release_pages;
+ }
+
+ memcpy(bip->bip_vec, bvec, nr_bvecs * sizeof(*bvec));
+ bip->bip_flags |= BIP_INTEGRITY_USER;
+ bip->bip_iter.bi_size = bytes;
+ if (bvec != stack_vec)
+ kfree(bvec);
+ }
+
+ return 0;
+
+release_pages:
+ for_each_bvec(bv, bvec, bi, bi)
+ unpin_user_page(bv.bv_page);
+free_bvec:
+ if (bvec != stack_vec)
+ kfree(bvec);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(bio_integrity_map_user);
+
/**
* bio_integrity_process - Process integrity metadata for a bio
* @bio: bio to generate/verify integrity metadata for
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 41d417ee13499..09e123e7c4941 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -324,6 +324,8 @@ enum bip_flags {
BIP_CTRL_NOCHECK = 1 << 2, /* disable HBA integrity checking */
BIP_DISK_NOCHECK = 1 << 3, /* disable disk integrity checking */
BIP_IP_CHECKSUM = 1 << 4, /* IP checksum */
+ BIP_INTEGRITY_USER = 1 << 5, /* Integrity payload is user address */
+ BIP_COPY_USER = 1 << 6, /* Kernel bounce buffer in use */
};
/*
@@ -714,12 +716,16 @@ static inline bool bioset_initialized(struct bio_set *bs)
#define bip_for_each_vec(bvl, bip, iter) \
for_each_bvec(bvl, (bip)->bip_vec, iter, (bip)->bip_iter)
+#define bip_for_each_mp_vec(bvl, bip, iter) \
+ for_each_mp_bvec(bvl, (bip)->bip_vec, iter, (bip)->bip_iter)
+
#define bio_for_each_integrity_vec(_bvl, _bio, _iter) \
for_each_bio(_bio) \
bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)
extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
+extern int bio_integrity_map_user(struct bio *, void __user *, ssize_t, u32);
extern bool bio_integrity_prep(struct bio *);
extern void bio_integrity_advance(struct bio *, unsigned int);
extern void bio_integrity_trim(struct bio *);
@@ -789,6 +795,12 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page,
return 0;
}
+static inline int bio_integrity_map_user(struct bio *bio, void __user *ubuf,
+ ssize_t len, u32 seed)
+{
+ return -EINVAL;
+}
+
#endif /* CONFIG_BLK_DEV_INTEGRITY */
/*
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCHv3 2/5] block: bio-integrity: directly map user buffers
2023-11-20 22:40 ` [PATCHv3 2/5] block: bio-integrity: directly map user buffers Keith Busch
@ 2023-11-20 23:19 ` Jens Axboe
2023-11-21 5:04 ` Christoph Hellwig
2023-11-21 16:10 ` Anuj gupta
2 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2023-11-20 23:19 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme, io-uring
Cc: hch, joshi.k, martin.petersen, Keith Busch
On 11/20/23 3:40 PM, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> Passthrough commands that utilize metadata currently bounce the user
> space buffer through the kernel. Add support for mapping user space
> directly so that we can avoid this costly overhead. This is similiar to
> how the normal bio data payload utilizes user addresses with
> bio_map_user_iov().
>
> If the user address can't directly be used for reasons like too many
> segments or address unalignement, fallback to a copy of the user vec
> while keeping the user address pinned for the IO duration so that it
> can safely be copied on completion in any process context.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> block/bio-integrity.c | 212 ++++++++++++++++++++++++++++++++++++++++++
> include/linux/bio.h | 12 +++
> 2 files changed, 224 insertions(+)
>
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index ec8ac8cf6e1b9..b761058bfb92f 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -91,6 +91,37 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio,
> }
> EXPORT_SYMBOL(bio_integrity_alloc);
>
> +static void bio_integrity_unmap_user(struct bio_integrity_payload *bip)
> +{
> + bool dirty = bio_data_dir(bip->bip_bio) == READ;
> + struct bvec_iter iter;
> + struct bio_vec bv;
> +
> + if (bip->bip_flags & BIP_COPY_USER) {
> + unsigned short nr_vecs = bip->bip_max_vcnt - 1;
> + struct bio_vec *copy = bvec_virt(&bip->bip_vec[nr_vecs]);
> + size_t bytes = bip->bip_iter.bi_size;
> + void *buf = bvec_virt(bip->bip_vec);
> +
> + if (dirty) {
> + struct iov_iter iter;
> +
> + iov_iter_bvec(&iter, ITER_DEST, copy, nr_vecs, bytes);
> + WARN_ON_ONCE(copy_to_iter(buf, bytes, &iter) != bytes);
> + }
Minor nit, but I don't like hiding functions with side effects inside
potentially debug statements. Would be better to do:
ret = copy_to_iter(buf, bytes, &iter);
WARN_ON_ONCE(ret != bytes);
which is also easier to read, imho.
Apart from that, looks good to me.
--
Jens Axboe
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv3 2/5] block: bio-integrity: directly map user buffers
2023-11-20 22:40 ` [PATCHv3 2/5] block: bio-integrity: directly map user buffers Keith Busch
2023-11-20 23:19 ` Jens Axboe
@ 2023-11-21 5:04 ` Christoph Hellwig
2023-11-21 16:10 ` Anuj gupta
2 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2023-11-21 5:04 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, linux-nvme, io-uring, axboe, hch, joshi.k,
martin.petersen, Keith Busch
On Mon, Nov 20, 2023 at 02:40:55PM -0800, Keith Busch wrote:
> +static void bio_integrity_unmap_user(struct bio_integrity_payload *bip)
> +{
> + bool dirty = bio_data_dir(bip->bip_bio) == READ;
> + struct bvec_iter iter;
> + struct bio_vec bv;
> +
> + if (bip->bip_flags & BIP_COPY_USER) {
> + unsigned short nr_vecs = bip->bip_max_vcnt - 1;
> + struct bio_vec *copy = bvec_virt(&bip->bip_vec[nr_vecs]);
> + size_t bytes = bip->bip_iter.bi_size;
> + void *buf = bvec_virt(bip->bip_vec);
> +
> + if (dirty) {
> + struct iov_iter iter;
> +
> + iov_iter_bvec(&iter, ITER_DEST, copy, nr_vecs, bytes);
> + WARN_ON_ONCE(copy_to_iter(buf, bytes, &iter) != bytes);
> + }
> +
> + memcpy(bip->bip_vec, copy, nr_vecs * sizeof(*copy));
> + kfree(copy);
> + kfree(buf);
Nit: but I'd probably just split the user copy version into a separate
helper for clarity. Nice trick with the temporary iter, we could probably
use this for the data path too.
> +extern int bio_integrity_map_user(struct bio *, void __user *, ssize_t, u32);
Can you drop the pointless extern and just spell out the paratmeters?
I know this follows the existing style, but that style is pretty
horrible :)
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCHv3 2/5] block: bio-integrity: directly map user buffers
2023-11-20 22:40 ` [PATCHv3 2/5] block: bio-integrity: directly map user buffers Keith Busch
2023-11-20 23:19 ` Jens Axboe
2023-11-21 5:04 ` Christoph Hellwig
@ 2023-11-21 16:10 ` Anuj gupta
2 siblings, 0 replies; 17+ messages in thread
From: Anuj gupta @ 2023-11-21 16:10 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, linux-nvme, io-uring, axboe, hch, joshi.k,
martin.petersen, Keith Busch
On Tue, Nov 21, 2023 at 4:11 AM Keith Busch <[email protected]> wrote:
>
> From: Keith Busch <[email protected]>
>
> Passthrough commands that utilize metadata currently bounce the user
> space buffer through the kernel. Add support for mapping user space
> directly so that we can avoid this costly overhead. This is similiar to
Nit: s/similiar/similar
> /**
> * bio_integrity_free - Free bio integrity payload
> * @bio: bio containing bip to be freed
> @@ -105,6 +136,8 @@ void bio_integrity_free(struct bio *bio)
>
> if (bip->bip_flags & BIP_BLOCK_INTEGRITY)
> kfree(bvec_virt(bip->bip_vec));
> + else if (bip->bip_flags & BIP_INTEGRITY_USER)
> + bio_integrity_unmap_user(bip);;
Nit: extra semicolon here
--
Anuj Gupta
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCHv3 3/5] nvme: use bio_integrity_map_user
2023-11-20 22:40 [PATCHv3 0/5] block integrity: directly map user space addresses Keith Busch
2023-11-20 22:40 ` [PATCHv3 1/5] bvec: introduce multi-page bvec iterating Keith Busch
2023-11-20 22:40 ` [PATCHv3 2/5] block: bio-integrity: directly map user buffers Keith Busch
@ 2023-11-20 22:40 ` Keith Busch
2023-11-21 5:05 ` Christoph Hellwig
2023-11-20 22:40 ` [PATCHv3 4/5] iouring: remove IORING_URING_CMD_POLLED Keith Busch
2023-11-20 22:40 ` [PATCHv3 5/5] io_uring: remove uring_cmd cookie Keith Busch
4 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2023-11-20 22:40 UTC (permalink / raw)
To: linux-block, linux-nvme, io-uring
Cc: axboe, hch, joshi.k, martin.petersen, Keith Busch
From: Keith Busch <[email protected]>
Map user metadata buffers directly for a command request. Now that the
bio bip tracks the metadata, nvme doesn't need special handling for
tracking callbacks and additional fields in the driver pdu.
Signed-off-by: Keith Busch <[email protected]>
---
drivers/nvme/host/ioctl.c | 197 ++++++--------------------------------
1 file changed, 29 insertions(+), 168 deletions(-)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 529b9954d2b8c..32c9bcf491a33 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -96,58 +96,6 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval)
return (void __user *)ptrval;
}
-static void *nvme_add_user_metadata(struct request *req, void __user *ubuf,
- unsigned len, u32 seed)
-{
- struct bio_integrity_payload *bip;
- int ret = -ENOMEM;
- void *buf;
- struct bio *bio = req->bio;
-
- buf = kmalloc(len, GFP_KERNEL);
- if (!buf)
- goto out;
-
- if (req_op(req) == REQ_OP_DRV_OUT) {
- ret = -EFAULT;
- if (copy_from_user(buf, ubuf, len))
- goto out_free_meta;
- } else {
- memset(buf, 0, len);
- }
-
- bip = bio_integrity_alloc(bio, GFP_KERNEL, 1);
- if (IS_ERR(bip)) {
- ret = PTR_ERR(bip);
- goto out_free_meta;
- }
-
- bip->bip_iter.bi_sector = seed;
- ret = bio_integrity_add_page(bio, virt_to_page(buf), len,
- offset_in_page(buf));
- if (ret != len) {
- ret = -ENOMEM;
- goto out_free_meta;
- }
-
- req->cmd_flags |= REQ_INTEGRITY;
- return buf;
-out_free_meta:
- kfree(buf);
-out:
- return ERR_PTR(ret);
-}
-
-static int nvme_finish_user_metadata(struct request *req, void __user *ubuf,
- void *meta, unsigned len, int ret)
-{
- if (!ret && req_op(req) == REQ_OP_DRV_IN &&
- copy_to_user(ubuf, meta, len))
- ret = -EFAULT;
- kfree(meta);
- return ret;
-}
-
static struct request *nvme_alloc_user_request(struct request_queue *q,
struct nvme_command *cmd, blk_opf_t rq_flags,
blk_mq_req_flags_t blk_flags)
@@ -164,14 +112,12 @@ static struct request *nvme_alloc_user_request(struct request_queue *q,
static int nvme_map_user_request(struct request *req, u64 ubuffer,
unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
- u32 meta_seed, void **metap, struct io_uring_cmd *ioucmd,
- unsigned int flags)
+ u32 meta_seed, struct io_uring_cmd *ioucmd, unsigned int flags)
{
struct request_queue *q = req->q;
struct nvme_ns *ns = q->queuedata;
struct block_device *bdev = ns ? ns->disk->part0 : NULL;
struct bio *bio = NULL;
- void *meta = NULL;
int ret;
if (ioucmd && (ioucmd->flags & IORING_URING_CMD_FIXED)) {
@@ -193,18 +139,17 @@ static int nvme_map_user_request(struct request *req, u64 ubuffer,
if (ret)
goto out;
+
bio = req->bio;
- if (bdev)
+ if (bdev) {
bio_set_dev(bio, bdev);
-
- if (bdev && meta_buffer && meta_len) {
- meta = nvme_add_user_metadata(req, meta_buffer, meta_len,
- meta_seed);
- if (IS_ERR(meta)) {
- ret = PTR_ERR(meta);
- goto out_unmap;
+ if (meta_buffer && meta_len) {
+ ret = bio_integrity_map_user(bio, meta_buffer, meta_len,
+ meta_seed);
+ if (ret)
+ goto out_unmap;
+ req->cmd_flags |= REQ_INTEGRITY;
}
- *metap = meta;
}
return ret;
@@ -225,7 +170,6 @@ static int nvme_submit_user_cmd(struct request_queue *q,
struct nvme_ns *ns = q->queuedata;
struct nvme_ctrl *ctrl;
struct request *req;
- void *meta = NULL;
struct bio *bio;
u32 effects;
int ret;
@@ -237,7 +181,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
req->timeout = timeout;
if (ubuffer && bufflen) {
ret = nvme_map_user_request(req, ubuffer, bufflen, meta_buffer,
- meta_len, meta_seed, &meta, NULL, flags);
+ meta_len, meta_seed, NULL, flags);
if (ret)
return ret;
}
@@ -249,9 +193,6 @@ static int nvme_submit_user_cmd(struct request_queue *q,
ret = nvme_execute_rq(req, false);
if (result)
*result = le64_to_cpu(nvme_req(req)->result.u64);
- if (meta)
- ret = nvme_finish_user_metadata(req, meta_buffer, meta,
- meta_len, ret);
if (bio)
blk_rq_unmap_user(bio);
blk_mq_free_request(req);
@@ -446,19 +387,10 @@ struct nvme_uring_data {
* Expect build errors if this grows larger than that.
*/
struct nvme_uring_cmd_pdu {
- union {
- struct bio *bio;
- struct request *req;
- };
- u32 meta_len;
- u32 nvme_status;
- union {
- struct {
- void *meta; /* kernel-resident buffer */
- void __user *meta_buffer;
- };
- u64 result;
- } u;
+ struct request *req;
+ struct bio *bio;
+ u64 result;
+ int status;
};
static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
@@ -467,31 +399,6 @@ static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
}
-static void nvme_uring_task_meta_cb(struct io_uring_cmd *ioucmd,
- unsigned issue_flags)
-{
- struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
- struct request *req = pdu->req;
- int status;
- u64 result;
-
- if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
- status = -EINTR;
- else
- status = nvme_req(req)->status;
-
- result = le64_to_cpu(nvme_req(req)->result.u64);
-
- if (pdu->meta_len)
- status = nvme_finish_user_metadata(req, pdu->u.meta_buffer,
- pdu->u.meta, pdu->meta_len, status);
- if (req->bio)
- blk_rq_unmap_user(req->bio);
- blk_mq_free_request(req);
-
- io_uring_cmd_done(ioucmd, status, result, issue_flags);
-}
-
static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd,
unsigned issue_flags)
{
@@ -499,8 +406,7 @@ static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd,
if (pdu->bio)
blk_rq_unmap_user(pdu->bio);
-
- io_uring_cmd_done(ioucmd, pdu->nvme_status, pdu->u.result, issue_flags);
+ io_uring_cmd_done(ioucmd, pdu->status, pdu->result, issue_flags);
}
static enum rq_end_io_ret nvme_uring_cmd_end_io(struct request *req,
@@ -509,53 +415,24 @@ static enum rq_end_io_ret nvme_uring_cmd_end_io(struct request *req,
struct io_uring_cmd *ioucmd = req->end_io_data;
struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
- req->bio = pdu->bio;
- if (nvme_req(req)->flags & NVME_REQ_CANCELLED) {
- pdu->nvme_status = -EINTR;
- } else {
- pdu->nvme_status = nvme_req(req)->status;
- if (!pdu->nvme_status)
- pdu->nvme_status = blk_status_to_errno(err);
- }
- pdu->u.result = le64_to_cpu(nvme_req(req)->result.u64);
+ if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+ pdu->status = -EINTR;
+ else
+ pdu->status = nvme_req(req)->status;
+ pdu->result = le64_to_cpu(nvme_req(req)->result.u64);
/*
* For iopoll, complete it directly.
* Otherwise, move the completion to task work.
*/
- if (blk_rq_is_poll(req)) {
- WRITE_ONCE(ioucmd->cookie, NULL);
+ if (blk_rq_is_poll(req))
nvme_uring_task_cb(ioucmd, IO_URING_F_UNLOCKED);
- } else {
+ else
io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb);
- }
return RQ_END_IO_FREE;
}
-static enum rq_end_io_ret nvme_uring_cmd_end_io_meta(struct request *req,
- blk_status_t err)
-{
- struct io_uring_cmd *ioucmd = req->end_io_data;
- struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
-
- req->bio = pdu->bio;
- pdu->req = req;
-
- /*
- * For iopoll, complete it directly.
- * Otherwise, move the completion to task work.
- */
- if (blk_rq_is_poll(req)) {
- WRITE_ONCE(ioucmd->cookie, NULL);
- nvme_uring_task_meta_cb(ioucmd, IO_URING_F_UNLOCKED);
- } else {
- io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_meta_cb);
- }
-
- return RQ_END_IO_NONE;
-}
-
static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
struct io_uring_cmd *ioucmd, unsigned int issue_flags, bool vec)
{
@@ -567,7 +444,6 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
struct request *req;
blk_opf_t rq_flags = REQ_ALLOC_CACHE;
blk_mq_req_flags_t blk_flags = 0;
- void *meta = NULL;
int ret;
c.common.opcode = READ_ONCE(cmd->opcode);
@@ -615,27 +491,16 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
if (d.addr && d.data_len) {
ret = nvme_map_user_request(req, d.addr,
d.data_len, nvme_to_user_ptr(d.metadata),
- d.metadata_len, 0, &meta, ioucmd, vec);
+ d.metadata_len, 0, ioucmd, vec);
if (ret)
return ret;
}
- if (blk_rq_is_poll(req)) {
- ioucmd->flags |= IORING_URING_CMD_POLLED;
- WRITE_ONCE(ioucmd->cookie, req);
- }
-
/* to free bio on completion, as req->bio will be null at that time */
pdu->bio = req->bio;
- pdu->meta_len = d.metadata_len;
+ pdu->req = req;
req->end_io_data = ioucmd;
- if (pdu->meta_len) {
- pdu->u.meta = meta;
- pdu->u.meta_buffer = nvme_to_user_ptr(d.metadata);
- req->end_io = nvme_uring_cmd_end_io_meta;
- } else {
- req->end_io = nvme_uring_cmd_end_io;
- }
+ req->end_io = nvme_uring_cmd_end_io;
blk_execute_rq_nowait(req, false);
return -EIOCBQUEUED;
}
@@ -786,16 +651,12 @@ int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd,
struct io_comp_batch *iob,
unsigned int poll_flags)
{
- struct request *req;
- int ret = 0;
-
- if (!(ioucmd->flags & IORING_URING_CMD_POLLED))
- return 0;
+ struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+ struct request *req = pdu->req;
- req = READ_ONCE(ioucmd->cookie);
if (req && blk_rq_is_poll(req))
- ret = blk_rq_poll(req, iob, poll_flags);
- return ret;
+ return blk_rq_poll(req, iob, poll_flags);
+ return 0;
}
#ifdef CONFIG_NVME_MULTIPATH
static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCHv3 4/5] iouring: remove IORING_URING_CMD_POLLED
2023-11-20 22:40 [PATCHv3 0/5] block integrity: directly map user space addresses Keith Busch
` (2 preceding siblings ...)
2023-11-20 22:40 ` [PATCHv3 3/5] nvme: use bio_integrity_map_user Keith Busch
@ 2023-11-20 22:40 ` Keith Busch
2023-11-21 5:05 ` Christoph Hellwig
2023-11-20 22:40 ` [PATCHv3 5/5] io_uring: remove uring_cmd cookie Keith Busch
4 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2023-11-20 22:40 UTC (permalink / raw)
To: linux-block, linux-nvme, io-uring
Cc: axboe, hch, joshi.k, martin.petersen, Keith Busch
From: Keith Busch <[email protected]>
No more users of this flag.
Signed-off-by: Keith Busch <[email protected]>
---
include/linux/io_uring.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index aefb73eeeebff..fe23bf88f86fa 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -28,7 +28,6 @@ enum io_uring_cmd_flags {
/* only top 8 bits of sqe->uring_cmd_flags for kernel internal use */
#define IORING_URING_CMD_CANCELABLE (1U << 30)
-#define IORING_URING_CMD_POLLED (1U << 31)
struct io_uring_cmd {
struct file *file;
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCHv3 5/5] io_uring: remove uring_cmd cookie
2023-11-20 22:40 [PATCHv3 0/5] block integrity: directly map user space addresses Keith Busch
` (3 preceding siblings ...)
2023-11-20 22:40 ` [PATCHv3 4/5] iouring: remove IORING_URING_CMD_POLLED Keith Busch
@ 2023-11-20 22:40 ` Keith Busch
2023-11-21 5:05 ` Christoph Hellwig
4 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2023-11-20 22:40 UTC (permalink / raw)
To: linux-block, linux-nvme, io-uring
Cc: axboe, hch, joshi.k, martin.petersen, Keith Busch
From: Keith Busch <[email protected]>
No more users of this field.
Signed-off-by: Keith Busch <[email protected]>
---
include/linux/io_uring.h | 8 ++------
io_uring/uring_cmd.c | 1 -
2 files changed, 2 insertions(+), 7 deletions(-)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index fe23bf88f86fa..9e6ce6d4ab51f 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -32,12 +32,8 @@ enum io_uring_cmd_flags {
struct io_uring_cmd {
struct file *file;
const struct io_uring_sqe *sqe;
- union {
- /* callback to defer completions to task context */
- void (*task_work_cb)(struct io_uring_cmd *cmd, unsigned);
- /* used for polled completion */
- void *cookie;
- };
+ /* callback to defer completions to task context */
+ void (*task_work_cb)(struct io_uring_cmd *cmd, unsigned);
u32 cmd_op;
u32 flags;
u8 pdu[32]; /* available inline for free use */
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index acbc2924ecd21..b39ec25c36bc3 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -182,7 +182,6 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
return -EOPNOTSUPP;
issue_flags |= IO_URING_F_IOPOLL;
req->iopoll_completed = 0;
- WRITE_ONCE(ioucmd->cookie, NULL);
}
ret = file->f_op->uring_cmd(ioucmd, issue_flags);
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread