* [PATCH v2 0/3] implement direct IO with integrity
@ 2022-02-10 13:08 Alexander V. Buev
2022-02-10 13:08 ` [PATCH v2 1/3] block: bio-integrity: add PI iovec to bio Alexander V. Buev
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Alexander V. Buev @ 2022-02-10 13:08 UTC (permalink / raw)
To: linux-block
Cc: io-uring, Jens Axboe, Christoph Hellwig, Martin K . Petersen,
Pavel Begunkov, Chaitanya Kulkarni, Mikhail Malygin, linux,
Alexander V. Buev
This series of patches makes possible to do direct block IO
with integrity payload using io uring kernel interface.
Userspace app can utilize new READV_PI/WRITEV_PI operation with a new
fields in sqe struct (pi_addr/pi_len) to provide iovec's with
integrity data.
Changes since v1:
- switch to using separated io uring op codes and additional
sqe fields (instead of sqe flag)
- ability to process multiple iovecs
- use unused space in bio to pin user pages
Alexander V. Buev (3):
block: bio-integrity: add PI iovec to bio
block: io_uring: add READV_PI/WRITEV_PI operations
block: fops: handle IOCB_USE_PI in direct IO
block/bio-integrity.c | 151 +++++++++++++++++++++++
block/fops.c | 62 ++++++++++
fs/io_uring.c | 209 ++++++++++++++++++++++++++++++++
include/linux/bio.h | 8 ++
include/linux/fs.h | 2 +
include/trace/events/io_uring.h | 17 +--
include/uapi/linux/io_uring.h | 6 +-
include/uapi/linux/uio.h | 3 +-
8 files changed, 449 insertions(+), 9 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2 1/3] block: bio-integrity: add PI iovec to bio
2022-02-10 13:08 [PATCH v2 0/3] implement direct IO with integrity Alexander V. Buev
@ 2022-02-10 13:08 ` Alexander V. Buev
2022-02-10 17:49 ` Chaitanya Kulkarni
2022-02-10 13:08 ` [PATCH v2 2/3] block: io_uring: add READV_PI/WRITEV_PI operations Alexander V. Buev
2022-02-10 13:08 ` [PATCH v2 3/3] block: fops: handle IOCB_USE_PI in direct IO Alexander V. Buev
2 siblings, 1 reply; 10+ messages in thread
From: Alexander V. Buev @ 2022-02-10 13:08 UTC (permalink / raw)
To: linux-block
Cc: io-uring, Jens Axboe, Christoph Hellwig, Martin K . Petersen,
Pavel Begunkov, Chaitanya Kulkarni, Mikhail Malygin, linux,
Alexander V. Buev
Added functions to attach user PI iovec pages to bio
and release this pages via bio_integrity_free.
Signed-off-by: Alexander V. Buev <[email protected]>
---
block/bio-integrity.c | 151 ++++++++++++++++++++++++++++++++++++++++++
include/linux/bio.h | 8 +++
2 files changed, 159 insertions(+)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 0827b19820c5..8e57aea9c9eb 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -10,6 +10,7 @@
#include <linux/mempool.h>
#include <linux/export.h>
#include <linux/bio.h>
+#include <linux/uio.h>
#include <linux/workqueue.h>
#include <linux/slab.h>
#include "blk.h"
@@ -91,6 +92,19 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio,
}
EXPORT_SYMBOL(bio_integrity_alloc);
+void bio_integrity_release_pages(struct bio *bio)
+{
+ struct bio_integrity_payload *bip = bio_integrity(bio);
+ unsigned short i;
+
+ for (i = 0; i < bip->bip_vcnt; ++i) {
+ struct bio_vec *bv;
+
+ bv = bip->bip_vec + i;
+ put_page(bv->bv_page);
+ }
+}
+
/**
* bio_integrity_free - Free bio integrity payload
* @bio: bio containing bip to be freed
@@ -105,6 +119,10 @@ void bio_integrity_free(struct bio *bio)
if (bip->bip_flags & BIP_BLOCK_INTEGRITY)
kfree(bvec_virt(bip->bip_vec));
+ else {
+ if (bip->bip_flags & BIP_RELEASE_PAGES)
+ bio_integrity_release_pages(bio);
+ }
__bio_integrity_free(bs, bip);
bio->bi_integrity = NULL;
@@ -377,6 +395,139 @@ void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
bvec_iter_advance(bip->bip_vec, &bip->bip_iter, bytes);
}
+static inline
+struct page **__bio_integrity_temp_pages(struct bio *bio, unsigned int nr_needed_page)
+{
+ struct page **pages = 0;
+ unsigned int nr_avail_page;
+ struct bio_integrity_payload *bip = bio_integrity(bio);
+
+ if (bip->bip_max_vcnt > nr_needed_page) {
+ nr_avail_page = (bip->bip_max_vcnt - nr_needed_page) *
+ sizeof(struct bio_vec)/sizeof(struct page *);
+ } else
+ nr_avail_page = 0;
+
+ if (nr_avail_page >= nr_needed_page)
+ pages = (struct page **) (bip->bip_vec + nr_needed_page);
+ else {
+ if (bio->bi_max_vecs - bio->bi_vcnt) {
+ nr_avail_page = (bio->bi_max_vecs - bio->bi_vcnt) *
+ sizeof(struct bio_vec)/sizeof(struct page *);
+ if (nr_avail_page >= nr_needed_page)
+ pages = (struct page **) (bio->bi_io_vec + bio->bi_vcnt);
+ }
+ }
+
+ return pages;
+}
+
+/**
+ * bio_integrity_add_iovec - Add PI io vector
+ * @bio: bio whose integrity vector to update
+ * @pi_iter: iov_iter pointed to data added to @bio's integrity
+ *
+ * Description: Pins pages for *pi_iov and appends them to @bio's integrity.
+ */
+int bio_integrity_add_iovec(struct bio *bio, struct iov_iter *pi_iter)
+{
+ struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+ struct bio_integrity_payload *bip;
+ struct page **pi_page = 0, **bio_page;
+ unsigned int nr_vec_page, intervals;
+ int ret;
+ ssize_t size;
+ size_t offset, len, pg_num, page_count;
+
+ if (unlikely(!bi && bi->tuple_size && bi->interval_exp)) {
+ pr_err("Device is not integrity capable");
+ return -EINVAL;
+ }
+
+ intervals = bio_integrity_intervals(bi, bio_sectors(bio));
+ if (unlikely(intervals * bi->tuple_size < pi_iter->count)) {
+ pr_err("Intervals number is wrong, intervals=%u, tuple_size=%u, pi_len=%zu",
+ intervals, bi->tuple_size, pi_iter->count);
+ return -EINVAL;
+ }
+
+ nr_vec_page = (pi_iter->count + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ /* data of size N pages can be pinned to N+1 page */
+ nr_vec_page += 1;
+
+ bip = bio_integrity_alloc(bio, GFP_NOIO, nr_vec_page);
+ if (IS_ERR(bip))
+ return PTR_ERR(bip);
+
+ /* get space for page pointers array */
+ bio_page = __bio_integrity_temp_pages(bio, nr_vec_page);
+
+ if (likely(bio_page))
+ pi_page = bio_page;
+ else {
+ pi_page = kcalloc(nr_vec_page, sizeof(struct pi_page *), GFP_NOIO);
+ if (!pi_page) {
+ ret = -ENOMEM;
+ goto error;
+ }
+ }
+
+ bip->bip_iter.bi_size = pi_iter->count;
+ bip->bio_iter = bio->bi_iter;
+ bip_set_seed(bip, bio->bi_iter.bi_sector);
+
+ if (bi->flags & BLK_INTEGRITY_IP_CHECKSUM)
+ bip->bip_flags |= BIP_IP_CHECKSUM;
+
+ size = iov_iter_get_pages(pi_iter, pi_page, LONG_MAX, nr_vec_page, &offset);
+ if (unlikely(size <= 0)) {
+ pr_err("Failed to pin PI buffer to page (%zi)", size);
+ ret = (size) ? size : -EFAULT;
+ goto error;
+ }
+
+ /* calc count of pined pages */
+ if (size > (PAGE_SIZE - offset))
+ page_count = DIV_ROUND_UP(size - (PAGE_SIZE - offset), PAGE_SIZE) + 1;
+ else
+ page_count = 1;
+
+ /* fill bio integrity biovecs the given pages */
+ len = pi_iter->count;
+ for (pg_num = 0; pg_num < page_count; ++pg_num) {
+ size_t page_len;
+
+ page_len = PAGE_SIZE - offset;
+ if (page_len > len)
+ page_len = len;
+ ret = bio_integrity_add_page(bio, pi_page[pg_num], page_len, offset);
+ if (unlikely(ret != page_len)) {
+ ret = -ENOMEM;
+ goto error;
+ }
+ len -= page_len;
+ offset = 0;
+ bip->bip_flags |= BIP_RELEASE_PAGES;
+ }
+
+ iov_iter_advance(pi_iter, size);
+
+ if (pi_page != bio_page)
+ kfree(pi_page);
+
+ return 0;
+
+error:
+ if (bio_integrity(bio))
+ bio_integrity_free(bio);
+
+ if (pi_page && pi_page != bio_page)
+ kfree(pi_page);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(bio_integrity_add_iovec);
+
/**
* bio_integrity_trim - Trim integrity vector
* @bio: bio whose integrity vector to update
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 117d7f248ac9..ce008eeeb160 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -317,6 +317,7 @@ enum bip_flags {
BIP_CTRL_NOCHECK = 1 << 2, /* disable HBA integrity checking */
BIP_DISK_NOCHECK = 1 << 3, /* disable disk integrity checking */
BIP_IP_CHECKSUM = 1 << 4, /* IP checksum */
+ BIP_RELEASE_PAGES = 1 << 5, /* release pages after io completion */
};
/*
@@ -707,6 +708,7 @@ extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, un
extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
extern bool bio_integrity_prep(struct bio *);
extern void bio_integrity_advance(struct bio *, unsigned int);
+extern int bio_integrity_add_iovec(struct bio *bio, struct iov_iter *iter);
extern void bio_integrity_trim(struct bio *);
extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
extern int bioset_integrity_create(struct bio_set *, int);
@@ -747,6 +749,12 @@ static inline void bio_integrity_advance(struct bio *bio,
return;
}
+static inline int bio_integrity_add_iovec(struct bio *bio,
+ struct iov_iter *pi_iter)
+{
+ return 0;
+}
+
static inline void bio_integrity_trim(struct bio *bio)
{
return;
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 2/3] block: io_uring: add READV_PI/WRITEV_PI operations
2022-02-10 13:08 [PATCH v2 0/3] implement direct IO with integrity Alexander V. Buev
2022-02-10 13:08 ` [PATCH v2 1/3] block: bio-integrity: add PI iovec to bio Alexander V. Buev
@ 2022-02-10 13:08 ` Alexander V. Buev
2022-02-10 15:39 ` Jens Axboe
2022-02-10 13:08 ` [PATCH v2 3/3] block: fops: handle IOCB_USE_PI in direct IO Alexander V. Buev
2 siblings, 1 reply; 10+ messages in thread
From: Alexander V. Buev @ 2022-02-10 13:08 UTC (permalink / raw)
To: linux-block
Cc: io-uring, Jens Axboe, Christoph Hellwig, Martin K . Petersen,
Pavel Begunkov, Chaitanya Kulkarni, Mikhail Malygin, linux,
Alexander V. Buev
Added new READV_PI/WRITEV_PI operations to io_uring.
Added new pi_addr & pi_len fields to SQE struct.
Added new pi_iter field and IOCB_USE_PI flag to kiocb struct.
Make corresponding corrections to io uring trace event.
Signed-off-by: Alexander V. Buev <[email protected]>
---
fs/io_uring.c | 209 ++++++++++++++++++++++++++++++++
include/linux/fs.h | 2 +
include/trace/events/io_uring.h | 17 +--
include/uapi/linux/io_uring.h | 6 +-
include/uapi/linux/uio.h | 3 +-
5 files changed, 228 insertions(+), 9 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2e04f718319d..6e941040f228 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -563,6 +563,19 @@ struct io_rw {
u64 len;
};
+struct io_rw_pi_state {
+ struct iov_iter iter;
+ struct iov_iter_state iter_state;
+ struct iovec fast_iov[UIO_FASTIOV_PI];
+};
+
+struct io_rw_pi {
+ struct io_rw rw;
+ struct iovec *pi_iov;
+ u32 nr_pi_segs;
+ struct io_rw_pi_state *s;
+};
+
struct io_connect {
struct file *file;
struct sockaddr __user *addr;
@@ -716,6 +729,12 @@ struct io_async_rw {
struct wait_page_queue wpq;
};
+struct io_async_rw_pi {
+ struct io_async_rw async;
+ const struct iovec *free_iovec;
+ struct io_rw_pi_state s;
+};
+
enum {
REQ_F_FIXED_FILE_BIT = IOSQE_FIXED_FILE_BIT,
REQ_F_IO_DRAIN_BIT = IOSQE_IO_DRAIN_BIT,
@@ -744,6 +763,7 @@ enum {
/* keep async read/write and isreg together and in order */
REQ_F_SUPPORT_NOWAIT_BIT,
REQ_F_ISREG_BIT,
+ REQ_F_USE_PI_BIT,
/* not a real bit, just to check we're not overflowing the space */
__REQ_F_LAST_BIT,
@@ -799,6 +819,8 @@ enum {
REQ_F_ASYNC_DATA = BIT(REQ_F_ASYNC_DATA_BIT),
/* don't post CQEs while failing linked requests */
REQ_F_SKIP_LINK_CQES = BIT(REQ_F_SKIP_LINK_CQES_BIT),
+ /* pi metadata present */
+ REQ_F_USE_PI = BIT(REQ_F_USE_PI_BIT)
};
struct async_poll {
@@ -855,6 +877,7 @@ struct io_kiocb {
struct io_mkdir mkdir;
struct io_symlink symlink;
struct io_hardlink hardlink;
+ struct io_rw_pi rw_pi;
};
u8 opcode;
@@ -1105,6 +1128,24 @@ static const struct io_op_def io_op_defs[] = {
[IORING_OP_MKDIRAT] = {},
[IORING_OP_SYMLINKAT] = {},
[IORING_OP_LINKAT] = {},
+ [IORING_OP_READV_PI] = {
+ .needs_file = 1,
+ .unbound_nonreg_file = 1,
+ .pollin = 1,
+ .buffer_select = 1,
+ .needs_async_setup = 1,
+ .plug = 1,
+ .async_size = sizeof(struct io_async_rw_pi),
+ },
+ [IORING_OP_WRITEV_PI] = {
+ .needs_file = 1,
+ .hash_reg_file = 1,
+ .unbound_nonreg_file = 1,
+ .pollout = 1,
+ .needs_async_setup = 1,
+ .plug = 1,
+ .async_size = sizeof(struct io_async_rw_pi),
+ },
};
/* requests with any of those set should undergo io_disarm_next() */
@@ -3053,6 +3094,18 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return 0;
}
+static int io_prep_rw_pi(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+ if (!(req->file->f_flags & O_DIRECT))
+ return -EINVAL;
+
+ req->rw.kiocb.ki_flags |= IOCB_USE_PI;
+ req->flags |= REQ_F_USE_PI;
+ req->rw_pi.pi_iov = u64_to_user_ptr(READ_ONCE(sqe->pi_addr));
+ req->rw_pi.nr_pi_segs = READ_ONCE(sqe->pi_len);
+ return 0;
+}
+
static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
{
switch (ret) {
@@ -3505,10 +3558,39 @@ static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec,
iorw = req->async_data;
/* we've copied and mapped the iter, ensure state is saved */
iov_iter_save_state(&iorw->s.iter, &iorw->s.iter_state);
+ if (req->flags & REQ_F_USE_PI) {
+ struct io_async_rw_pi *iorw_pi = req->async_data;
+
+ /* copy iter from req to async ctx */
+ iorw_pi->s.iter = req->rw_pi.s->iter;
+
+ if (req->rw_pi.s->iter.iov == req->rw_pi.s->fast_iov) {
+ memcpy(iorw_pi->s.fast_iov, req->rw_pi.s->fast_iov,
+ sizeof(iorw_pi->s.fast_iov));
+ iorw_pi->s.iter.iov = iorw_pi->s.fast_iov;
+ iorw_pi->free_iovec = 0;
+ } else {
+ req->flags |= REQ_F_NEED_CLEANUP;
+ iorw_pi->free_iovec = req->rw_pi.s->iter.iov;
+ }
+
+ iov_iter_save_state(&iorw_pi->s.iter, &iorw_pi->s.iter_state);
+ }
}
return 0;
}
+static inline
+int io_import_pi_iovec(struct io_kiocb *req, int rw, unsigned int fast_segs,
+ struct iovec **fast_iov, struct iov_iter *iter)
+{
+ void __user *buf = req->rw_pi.pi_iov;
+
+ return __import_iovec(rw, buf, req->rw_pi.nr_pi_segs, fast_segs,
+ fast_iov, iter, req->ctx->compat);
+}
+
+
static inline int io_rw_prep_async(struct io_kiocb *req, int rw)
{
struct io_async_rw *iorw = req->async_data;
@@ -3527,6 +3609,25 @@ static inline int io_rw_prep_async(struct io_kiocb *req, int rw)
return 0;
}
+static inline int io_rw_prep_async_pi(struct io_kiocb *req, int rw)
+{
+ int ret = 0;
+ struct io_async_rw_pi *iorw_pi = req->async_data;
+ struct iovec *pi_iov = iorw_pi->s.fast_iov;
+
+ ret = io_import_pi_iovec(req, rw, UIO_FASTIOV_PI, &pi_iov, &iorw_pi->s.iter);
+ if (unlikely(ret < 0))
+ return ret;
+
+ iorw_pi->free_iovec = pi_iov;
+
+ if (pi_iov)
+ req->flags |= REQ_F_NEED_CLEANUP;
+ iov_iter_save_state(&iorw_pi->s.iter, &iorw_pi->s.iter_state);
+
+ return io_rw_prep_async(req, rw);
+}
+
static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
if (unlikely(!(req->file->f_mode & FMODE_READ)))
@@ -3534,6 +3635,15 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return io_prep_rw(req, sqe);
}
+static int io_read_prep_pi(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+ int ret = io_read_prep(req, sqe);
+
+ if (ret)
+ return ret;
+ return io_prep_rw_pi(req, sqe);
+}
+
/*
* This is our waitqueue callback handler, registered through __folio_lock_async()
* when we initially tried to do the IO with the iocb armed our waitqueue.
@@ -3690,6 +3800,9 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
* manually if we need to.
*/
iov_iter_restore(&s->iter, &s->iter_state);
+ if (req->flags & REQ_F_USE_PI)
+ iov_iter_restore(kiocb->pi_iter, &req->rw_pi.s->iter_state);
+
ret2 = io_setup_async_rw(req, iovec, s, true);
if (ret2)
@@ -3714,6 +3827,8 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
break;
rw->bytes_done += ret;
iov_iter_save_state(&s->iter, &s->iter_state);
+ if (req->flags & REQ_F_USE_PI)
+ iov_iter_save_state(kiocb->pi_iter, &req->rw_pi.s->iter_state);
/* if we can retry, do so with the callbacks armed */
if (!io_rw_should_retry(req)) {
@@ -3733,6 +3848,8 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
/* we got some bytes, but not all. retry. */
kiocb->ki_flags &= ~IOCB_WAITQ;
iov_iter_restore(&s->iter, &s->iter_state);
+ if (req->flags & REQ_F_USE_PI)
+ iov_iter_restore(kiocb->pi_iter, &req->rw_pi.s->iter_state);
} while (ret > 0);
done:
kiocb_done(req, ret, issue_flags);
@@ -3743,6 +3860,34 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}
+static int io_read_pi(struct io_kiocb *req, unsigned int issue_flags)
+{
+ if (req_has_async_data(req)) {
+ struct io_async_rw_pi *iorw_pi = req->async_data;
+
+ iov_iter_restore(&iorw_pi->s.iter, &iorw_pi->s.iter_state);
+ req->rw.kiocb.pi_iter = &iorw_pi->s.iter;
+ req->rw_pi.s = &iorw_pi->s;
+ return io_read(req, issue_flags);
+ } else {
+ int ret;
+ struct io_rw_pi_state __s, *s = &__s;
+ struct iovec *pi_iov = __s.fast_iov;
+
+ ret = io_import_pi_iovec(req, READ, UIO_FASTIOV_PI, &pi_iov, &s->iter);
+ if (unlikely(ret < 0))
+ return ret;
+ iov_iter_save_state(&s->iter, &s->iter_state);
+ req->rw.kiocb.pi_iter = &s->iter;
+ req->rw_pi.s = s;
+
+ ret = io_read(req, issue_flags);
+ if (pi_iov && !(ret == -EAGAIN && req_has_async_data(req)))
+ kfree(pi_iov);
+ return ret;
+ }
+}
+
static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
if (unlikely(!(req->file->f_mode & FMODE_WRITE)))
@@ -3751,6 +3896,15 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return io_prep_rw(req, sqe);
}
+static int io_write_prep_pi(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+ int ret = io_write_prep(req, sqe);
+
+ if (ret)
+ return ret;
+ return io_prep_rw_pi(req, sqe);
+}
+
static int io_write(struct io_kiocb *req, unsigned int issue_flags)
{
struct io_rw_state __s, *s = &__s;
@@ -3836,6 +3990,10 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
} else {
copy_iov:
iov_iter_restore(&s->iter, &s->iter_state);
+
+ if (req->flags & REQ_F_USE_PI)
+ iov_iter_restore(kiocb->pi_iter, &req->rw_pi.s->iter_state);
+
ret = io_setup_async_rw(req, iovec, s, false);
return ret ?: -EAGAIN;
}
@@ -3846,6 +4004,34 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
return ret;
}
+static int io_write_pi(struct io_kiocb *req, unsigned int issue_flags)
+{
+ if (req_has_async_data(req)) {
+ struct io_async_rw_pi *iorw_pi = req->async_data;
+
+ req->rw.kiocb.pi_iter = &iorw_pi->s.iter;
+ req->rw_pi.s = &iorw_pi->s;
+ iov_iter_restore(&iorw_pi->s.iter, &iorw_pi->s.iter_state);
+ return io_write(req, issue_flags);
+ } else {
+ int ret;
+ struct io_rw_pi_state __s, *s = &__s;
+ struct iovec *pi_iov = __s.fast_iov;
+
+ ret = io_import_pi_iovec(req, WRITE, UIO_FASTIOV_PI, &pi_iov, &s->iter);
+ if (unlikely(ret < 0))
+ return ret;
+ iov_iter_save_state(&s->iter, &s->iter_state);
+ req->rw.kiocb.pi_iter = &s->iter;
+ req->rw_pi.s = s;
+
+ ret = io_write(req, issue_flags);
+ if (pi_iov && !(ret == -EAGAIN && req_has_async_data(req)))
+ kfree(pi_iov);
+ return ret;
+ }
+}
+
static int io_renameat_prep(struct io_kiocb *req,
const struct io_uring_sqe *sqe)
{
@@ -6500,10 +6686,14 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
case IORING_OP_READ_FIXED:
case IORING_OP_READ:
return io_read_prep(req, sqe);
+ case IORING_OP_READV_PI:
+ return io_read_prep_pi(req, sqe);
case IORING_OP_WRITEV:
case IORING_OP_WRITE_FIXED:
case IORING_OP_WRITE:
return io_write_prep(req, sqe);
+ case IORING_OP_WRITEV_PI:
+ return io_write_prep_pi(req, sqe);
case IORING_OP_POLL_ADD:
return io_poll_add_prep(req, sqe);
case IORING_OP_POLL_REMOVE:
@@ -6589,6 +6779,10 @@ static int io_req_prep_async(struct io_kiocb *req)
return io_rw_prep_async(req, READ);
case IORING_OP_WRITEV:
return io_rw_prep_async(req, WRITE);
+ case IORING_OP_READV_PI:
+ return io_rw_prep_async_pi(req, READ);
+ case IORING_OP_WRITEV_PI:
+ return io_rw_prep_async_pi(req, WRITE);
case IORING_OP_SENDMSG:
return io_sendmsg_prep_async(req);
case IORING_OP_RECVMSG:
@@ -6670,7 +6864,14 @@ static void io_clean_op(struct io_kiocb *req)
case IORING_OP_WRITE_FIXED:
case IORING_OP_WRITE: {
struct io_async_rw *io = req->async_data;
+ kfree(io->free_iovec);
+ break;
+ }
+ case IORING_OP_READV_PI:
+ case IORING_OP_WRITEV_PI: {
+ struct io_async_rw_pi *io = req->async_data;
+ kfree(io->async.free_iovec);
kfree(io->free_iovec);
break;
}
@@ -6750,11 +6951,17 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
case IORING_OP_READ:
ret = io_read(req, issue_flags);
break;
+ case IORING_OP_READV_PI:
+ ret = io_read_pi(req, issue_flags);
+ break;
case IORING_OP_WRITEV:
case IORING_OP_WRITE_FIXED:
case IORING_OP_WRITE:
ret = io_write(req, issue_flags);
break;
+ case IORING_OP_WRITEV_PI:
+ ret = io_write_pi(req, issue_flags);
+ break;
case IORING_OP_FSYNC:
ret = io_fsync(req, issue_flags);
break;
@@ -11218,6 +11425,8 @@ static int __init io_uring_init(void)
BUILD_BUG_SQE_ELEM(42, __u16, personality);
BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in);
BUILD_BUG_SQE_ELEM(44, __u32, file_index);
+ BUILD_BUG_SQE_ELEM(48, __u64, pi_addr);
+ BUILD_BUG_SQE_ELEM(56, __u32, pi_len);
BUILD_BUG_ON(sizeof(struct io_uring_files_update) !=
sizeof(struct io_uring_rsrc_update));
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e2d892b201b0..c45ec5073300 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -316,6 +316,7 @@ enum rw_hint {
#define IOCB_NOIO (1 << 20)
/* can use bio alloc cache */
#define IOCB_ALLOC_CACHE (1 << 21)
+#define IOCB_USE_PI (1 << 22)
struct kiocb {
struct file *ki_filp;
@@ -330,6 +331,7 @@ struct kiocb {
u16 ki_hint;
u16 ki_ioprio; /* See linux/ioprio.h */
struct wait_page_queue *ki_waitq; /* for async buffered IO */
+ struct iov_iter *pi_iter;
randomized_struct_fields_end
};
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 7346f0164cf4..8a435df796b9 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -524,8 +524,9 @@ TRACE_EVENT(io_uring_req_failed,
__field( u16, buf_index )
__field( u16, personality )
__field( u32, file_index )
- __field( u64, pad1 )
- __field( u64, pad2 )
+ __field(u64, pi_addr)
+ __field(u32, pi_len)
+ __field(u32, pad)
__field( int, error )
),
@@ -541,21 +542,23 @@ TRACE_EVENT(io_uring_req_failed,
__entry->buf_index = sqe->buf_index;
__entry->personality = sqe->personality;
__entry->file_index = sqe->file_index;
- __entry->pad1 = sqe->__pad2[0];
- __entry->pad2 = sqe->__pad2[1];
+ __entry->pi_addr = sqe->pi_addr;
+ __entry->pi_len = sqe->pi_len;
+ __entry->pad = sqe->__pad2;
__entry->error = error;
),
TP_printk("op %d, flags=0x%x, prio=%d, off=%llu, addr=%llu, "
"len=%u, rw_flags=0x%x, user_data=0x%llx, buf_index=%d, "
- "personality=%d, file_index=%d, pad=0x%llx/%llx, error=%d",
+ "personality=%d, file_index=%d, pi_addr=0x%llx, pi_len=%u, "
+ "pad=%u, error=%d",
__entry->opcode, __entry->flags, __entry->ioprio,
(unsigned long long)__entry->off,
(unsigned long long) __entry->addr, __entry->len,
__entry->op_flags, (unsigned long long) __entry->user_data,
__entry->buf_index, __entry->personality, __entry->file_index,
- (unsigned long long) __entry->pad1,
- (unsigned long long) __entry->pad2, __entry->error)
+ (unsigned long long) __entry->pi_addr, __entry->pi_len,
+ __entry->pad, __entry->error)
);
#endif /* _TRACE_IO_URING_H */
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 787f491f0d2a..87ea512c2c8d 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -60,7 +60,9 @@ struct io_uring_sqe {
__s32 splice_fd_in;
__u32 file_index;
};
- __u64 __pad2[2];
+ __u64 pi_addr; /* pointer to iovec */
+ __u32 pi_len;
+ __u32 __pad2;
};
enum {
@@ -143,6 +145,8 @@ enum {
IORING_OP_MKDIRAT,
IORING_OP_SYMLINKAT,
IORING_OP_LINKAT,
+ IORING_OP_READV_PI,
+ IORING_OP_WRITEV_PI,
/* this goes last, obviously */
IORING_OP_LAST,
diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h
index 059b1a9147f4..c9eaaa6cdb0f 100644
--- a/include/uapi/linux/uio.h
+++ b/include/uapi/linux/uio.h
@@ -23,9 +23,10 @@ struct iovec
/*
* UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1)
*/
-
+
#define UIO_FASTIOV 8
#define UIO_MAXIOV 1024
+#define UIO_FASTIOV_PI 1
#endif /* _UAPI__LINUX_UIO_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 3/3] block: fops: handle IOCB_USE_PI in direct IO
2022-02-10 13:08 [PATCH v2 0/3] implement direct IO with integrity Alexander V. Buev
2022-02-10 13:08 ` [PATCH v2 1/3] block: bio-integrity: add PI iovec to bio Alexander V. Buev
2022-02-10 13:08 ` [PATCH v2 2/3] block: io_uring: add READV_PI/WRITEV_PI operations Alexander V. Buev
@ 2022-02-10 13:08 ` Alexander V. Buev
2022-02-10 17:53 ` Chaitanya Kulkarni
2 siblings, 1 reply; 10+ messages in thread
From: Alexander V. Buev @ 2022-02-10 13:08 UTC (permalink / raw)
To: linux-block
Cc: io-uring, Jens Axboe, Christoph Hellwig, Martin K . Petersen,
Pavel Begunkov, Chaitanya Kulkarni, Mikhail Malygin, linux,
Alexander V. Buev
Check that the size of integrity data correspond to device integrity
profile and data size. Split integrity data to the different bio's
in case of to big orginal bio (together with normal data).
Signed-off-by: Alexander V. Buev <[email protected]>
---
block/fops.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)
diff --git a/block/fops.c b/block/fops.c
index 4f59e0f5bf30..99c670b9f7d4 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -16,6 +16,7 @@
#include <linux/suspend.h>
#include <linux/fs.h>
#include <linux/module.h>
+#include <linux/blk-integrity.h>
#include "blk.h"
static inline struct inode *bdev_file_inode(struct file *file)
@@ -44,6 +45,19 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb)
#define DIO_INLINE_BIO_VECS 4
+static int __bio_integrity_add_iovec(struct bio *bio, struct iov_iter *pi_iter)
+{
+ struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+ unsigned int pi_len = bio_integrity_bytes(bi, bio->bi_iter.bi_size >> SECTOR_SHIFT);
+ size_t iter_count = pi_iter->count-pi_len;
+ int ret;
+
+ iov_iter_truncate(pi_iter, pi_len);
+ ret = bio_integrity_add_iovec(bio, pi_iter);
+ pi_iter->count = iter_count;
+ return ret;
+}
+
static void blkdev_bio_end_io_simple(struct bio *bio)
{
struct task_struct *waiter = bio->bi_private;
@@ -101,6 +115,14 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
if (iocb->ki_flags & IOCB_HIPRI)
bio_set_polled(&bio, iocb);
+ if (iocb->ki_flags & IOCB_USE_PI) {
+ ret = __bio_integrity_add_iovec(&bio, iocb->pi_iter);
+ if (ret) {
+ bio_release_pages(&bio, should_dirty);
+ goto out;
+ }
+ }
+
submit_bio(&bio);
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
@@ -252,6 +274,16 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
pos += bio->bi_iter.bi_size;
nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
+
+ if (iocb->ki_flags & IOCB_USE_PI) {
+ ret = __bio_integrity_add_iovec(bio, iocb->pi_iter);
+ if (ret) {
+ bio->bi_status = BLK_STS_IOERR;
+ bio_endio(bio);
+ break;
+ }
+ }
+
if (!nr_pages) {
submit_bio(bio);
break;
@@ -358,6 +390,15 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
task_io_account_write(bio->bi_iter.bi_size);
}
+ if (iocb->ki_flags & IOCB_USE_PI) {
+ ret = __bio_integrity_add_iovec(bio, iocb->pi_iter);
+ if (ret) {
+ bio->bi_status = BLK_STS_IOERR;
+ bio_endio(bio);
+ return ret;
+ }
+ }
+
if (iocb->ki_flags & IOCB_HIPRI) {
bio->bi_opf |= REQ_POLLED | REQ_NOWAIT;
submit_bio(bio);
@@ -377,6 +418,27 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
if (!iov_iter_count(iter))
return 0;
+ if (iocb->ki_flags & IOCB_USE_PI) {
+ struct block_device *bdev = iocb->ki_filp->private_data;
+ struct blk_integrity *bi = bdev_get_integrity(bdev);
+ unsigned int intervals;
+
+ if (unlikely(!(bi && bi->tuple_size &&
+ bi->flags & BLK_INTEGRITY_DEVICE_CAPABLE))) {
+ pr_err("Device %d:%d is not integrity capable",
+ MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev));
+ return -EINVAL;
+ }
+
+ intervals = bio_integrity_intervals(bi, iter->count >> 9);
+ if (unlikely(intervals * bi->tuple_size > iocb->pi_iter->count)) {
+ pr_err("Integrity & data size mismatch data=%zu integrity=%zu intervals=%u tuple=%u",
+ iter->count, iocb->pi_iter->count,
+ intervals, bi->tuple_size);
+ return -EINVAL;
+ }
+ }
+
nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1);
if (likely(nr_pages <= BIO_MAX_VECS)) {
if (is_sync_kiocb(iocb))
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v2 2/3] block: io_uring: add READV_PI/WRITEV_PI operations
2022-02-10 13:08 ` [PATCH v2 2/3] block: io_uring: add READV_PI/WRITEV_PI operations Alexander V. Buev
@ 2022-02-10 15:39 ` Jens Axboe
2022-02-10 19:03 ` Alexander V. Buev
0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2022-02-10 15:39 UTC (permalink / raw)
To: Alexander V. Buev, linux-block
Cc: io-uring, Christoph Hellwig, Martin K . Petersen, Pavel Begunkov,
Chaitanya Kulkarni, Mikhail Malygin, linux
On 2/10/22 6:08 AM, Alexander V. Buev wrote:
> Added new READV_PI/WRITEV_PI operations to io_uring.
> Added new pi_addr & pi_len fields to SQE struct.
> Added new pi_iter field and IOCB_USE_PI flag to kiocb struct.
> Make corresponding corrections to io uring trace event.
>
> Signed-off-by: Alexander V. Buev <[email protected]>
> ---
> fs/io_uring.c | 209 ++++++++++++++++++++++++++++++++
> include/linux/fs.h | 2 +
> include/trace/events/io_uring.h | 17 +--
> include/uapi/linux/io_uring.h | 6 +-
> include/uapi/linux/uio.h | 3 +-
> 5 files changed, 228 insertions(+), 9 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 2e04f718319d..6e941040f228 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -563,6 +563,19 @@ struct io_rw {
> u64 len;
> };
>
> +struct io_rw_pi_state {
> + struct iov_iter iter;
> + struct iov_iter_state iter_state;
> + struct iovec fast_iov[UIO_FASTIOV_PI];
> +};
> +
> +struct io_rw_pi {
> + struct io_rw rw;
> + struct iovec *pi_iov;
> + u32 nr_pi_segs;
> + struct io_rw_pi_state *s;
> +};
One immediate issue I see here is that io_rw_pi is big, and we try very
hard to keep the per-command payload to 64-bytes. This would be 88 bytes
by my count :-/
Do you need everything from io_rw? If not, I'd just make io_rw_pi
contain the bits you need and see if you can squeeze it into the
existing cacheline.
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 1/3] block: bio-integrity: add PI iovec to bio
2022-02-10 13:08 ` [PATCH v2 1/3] block: bio-integrity: add PI iovec to bio Alexander V. Buev
@ 2022-02-10 17:49 ` Chaitanya Kulkarni
0 siblings, 0 replies; 10+ messages in thread
From: Chaitanya Kulkarni @ 2022-02-10 17:49 UTC (permalink / raw)
To: Alexander V. Buev, [email protected]
Cc: [email protected], Jens Axboe, Christoph Hellwig,
Martin K . Petersen, Pavel Begunkov, Mikhail Malygin,
[email protected]
On 2/10/22 5:08 AM, Alexander V. Buev wrote:
> Added functions to attach user PI iovec pages to bio
> and release this pages via bio_integrity_free.
>
consider :-
Added functions to attach user PI iovec pages to bio and release this
pages via bio_integrity_free.
> Signed-off-by: Alexander V. Buev <[email protected]>
> ---
> block/bio-integrity.c | 151 ++++++++++++++++++++++++++++++++++++++++++
> include/linux/bio.h | 8 +++
> 2 files changed, 159 insertions(+)
>
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index 0827b19820c5..8e57aea9c9eb 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -10,6 +10,7 @@
> #include <linux/mempool.h>
> #include <linux/export.h>
> #include <linux/bio.h>
> +#include <linux/uio.h>
> #include <linux/workqueue.h>
> #include <linux/slab.h>
> #include "blk.h"
> @@ -91,6 +92,19 @@ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio,
> }
> EXPORT_SYMBOL(bio_integrity_alloc);
>
> +void bio_integrity_release_pages(struct bio *bio)
> +{
> + struct bio_integrity_payload *bip = bio_integrity(bio);
> + unsigned short i;
> +
> + for (i = 0; i < bip->bip_vcnt; ++i) {
++i is common in kernel code..
> + struct bio_vec *bv;
> +
> + bv = bip->bip_vec + i;
is it possible to bip->bip_vec++ instead of above ?
> + put_page(bv->bv_page);
> + }
> +}
> +
> /**
> * bio_integrity_free - Free bio integrity payload
> * @bio: bio containing bip to be freed
> @@ -105,6 +119,10 @@ void bio_integrity_free(struct bio *bio)
>
> if (bip->bip_flags & BIP_BLOCK_INTEGRITY)
> kfree(bvec_virt(bip->bip_vec));
> + else {
> + if (bip->bip_flags & BIP_RELEASE_PAGES)
> + bio_integrity_release_pages(bio);
> + }
>
why not use if() ... else if () {} ?
> __bio_integrity_free(bs, bip);
> bio->bi_integrity = NULL;
> @@ -377,6 +395,139 @@ void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
> bvec_iter_advance(bip->bip_vec, &bip->bip_iter, bytes);
> }
>
> +static inline
> +struct page **__bio_integrity_temp_pages(struct bio *bio, unsigned int nr_needed_page)
> +{
> + struct page **pages = 0;
use of NULL please and reverse tree order declaration ..
> + unsigned int nr_avail_page;
> + struct bio_integrity_payload *bip = bio_integrity(bio);
> +
> + if (bip->bip_max_vcnt > nr_needed_page) {
> + nr_avail_page = (bip->bip_max_vcnt - nr_needed_page) *
> + sizeof(struct bio_vec)/sizeof(struct page *);
> + } else
> + nr_avail_page = 0;
> +
I'd initialize nr_avail_pages at the time of declaration and remove else
part.
> + if (nr_avail_page >= nr_needed_page)
> + pages = (struct page **) (bip->bip_vec + nr_needed_page);
> + else {
> + if (bio->bi_max_vecs - bio->bi_vcnt) {
> + nr_avail_page = (bio->bi_max_vecs - bio->bi_vcnt) *
> + sizeof(struct bio_vec)/sizeof(struct page *);
> + if (nr_avail_page >= nr_needed_page)
> + pages = (struct page **) (bio->bi_io_vec + bio->bi_vcnt);
> + }
> + }
> +
How about following much clear and removes need for else part ?
if (nr_avail_page >= nr_needed_page)
return (struct page **) (bip->bip_vec +
nr_needed_page);
if (bio->bi_max_vecs - bio->bi_vcnt) {
nr_avail_page = (bio->bi_max_vecs - bio->bi_vcnt) *
sizeof(struct bio_vec)/sizeof(struct page *);
if (nr_avail_page >= nr_needed_page)
return (struct page **) (bio->bi_io_vec +
bio->bi_vcnt);
}
return NULL;
I think if you use above then we don't have to need pages variable on
the stack in the fast path.
> + return pages;
> +}
> +
> +/**
> + * bio_integrity_add_iovec - Add PI io vector
> + * @bio: bio whose integrity vector to update
> + * @pi_iter: iov_iter pointed to data added to @bio's integrity
> + *
> + * Description: Pins pages for *pi_iov and appends them to @bio's integrity.
> + */
> +int bio_integrity_add_iovec(struct bio *bio, struct iov_iter *pi_iter)
> +{
> + struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
> + struct bio_integrity_payload *bip;
> + struct page **pi_page = 0, **bio_page;
> + unsigned int nr_vec_page, intervals;
> + int ret;
> + ssize_t size;
> + size_t offset, len, pg_num, page_count;
> +
> + if (unlikely(!bi && bi->tuple_size && bi->interval_exp)) {
> + pr_err("Device is not integrity capable");
> + return -EINVAL;
> + }
> +
> + intervals = bio_integrity_intervals(bi, bio_sectors(bio));
> + if (unlikely(intervals * bi->tuple_size < pi_iter->count)) {
> + pr_err("Intervals number is wrong, intervals=%u, tuple_size=%u, pi_len=%zu",
> + intervals, bi->tuple_size, pi_iter->count);
> + return -EINVAL;
> + }
> +
> + nr_vec_page = (pi_iter->count + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + /* data of size N pages can be pinned to N+1 page */
> + nr_vec_page += 1;
> +
> + bip = bio_integrity_alloc(bio, GFP_NOIO, nr_vec_page);
> + if (IS_ERR(bip))
> + return PTR_ERR(bip);
> +
> + /* get space for page pointers array */
> + bio_page = __bio_integrity_temp_pages(bio, nr_vec_page);
> +
> + if (likely(bio_page))
> + pi_page = bio_page;
> + else {
> + pi_page = kcalloc(nr_vec_page, sizeof(struct pi_page *), GFP_NOIO);
overlay long line above..
> + if (!pi_page) {
> + ret = -ENOMEM;
> + goto error;
> + }
> + }
> +
> + bip->bip_iter.bi_size = pi_iter->count;
> + bip->bio_iter = bio->bi_iter;
> + bip_set_seed(bip, bio->bi_iter.bi_sector);
> +
> + if (bi->flags & BLK_INTEGRITY_IP_CHECKSUM)
> + bip->bip_flags |= BIP_IP_CHECKSUM;
> +
> + size = iov_iter_get_pages(pi_iter, pi_page, LONG_MAX, nr_vec_page, &offset);
same as above comment...
> + if (unlikely(size <= 0)) {
> + pr_err("Failed to pin PI buffer to page (%zi)", size);
> + ret = (size) ? size : -EFAULT;
> + goto error;
> + }
> +
> + /* calc count of pined pages */
> + if (size > (PAGE_SIZE - offset))
> + page_count = DIV_ROUND_UP(size - (PAGE_SIZE - offset), PAGE_SIZE) + 1;
> + else
> + page_count = 1;
> +
> + /* fill bio integrity biovecs the given pages */
> + len = pi_iter->count;
> + for (pg_num = 0; pg_num < page_count; ++pg_num) {
> + size_t page_len;
> +
> + page_len = PAGE_SIZE - offset;
> + if (page_len > len)
> + page_len = len;
> + ret = bio_integrity_add_page(bio, pi_page[pg_num], page_len, offset);
> + if (unlikely(ret != page_len)) {
> + ret = -ENOMEM;
> + goto error;
> + }
> + len -= page_len;
> + offset = 0;
> + bip->bip_flags |= BIP_RELEASE_PAGES;
> + }
> +
> + iov_iter_advance(pi_iter, size);
> +
> + if (pi_page != bio_page)
> + kfree(pi_page);
> +
> + return 0;
> +
> +error:
> + if (bio_integrity(bio))
> + bio_integrity_free(bio);
> +
> + if (pi_page && pi_page != bio_page)
> + kfree(pi_page);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(bio_integrity_add_iovec);
> +
> /**
> * bio_integrity_trim - Trim integrity vector
> * @bio: bio whose integrity vector to update
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 117d7f248ac9..ce008eeeb160 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -317,6 +317,7 @@ enum bip_flags {
> BIP_CTRL_NOCHECK = 1 << 2, /* disable HBA integrity checking */
> BIP_DISK_NOCHECK = 1 << 3, /* disable disk integrity checking */
> BIP_IP_CHECKSUM = 1 << 4, /* IP checksum */
> + BIP_RELEASE_PAGES = 1 << 5, /* release pages after io completion */
> };
>
> /*
> @@ -707,6 +708,7 @@ extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, un
> extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
> extern bool bio_integrity_prep(struct bio *);
> extern void bio_integrity_advance(struct bio *, unsigned int);
> +extern int bio_integrity_add_iovec(struct bio *bio, struct iov_iter *iter);
can we drop extern keyword ?
> extern void bio_integrity_trim(struct bio *);
> extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
> extern int bioset_integrity_create(struct bio_set *, int);
> @@ -747,6 +749,12 @@ static inline void bio_integrity_advance(struct bio *bio,
> return;
> }
>
> +static inline int bio_integrity_add_iovec(struct bio *bio,
> + struct iov_iter *pi_iter)
> +{
> + return 0;
> +}
> +
> static inline void bio_integrity_trim(struct bio *bio)
> {
> return;
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 3/3] block: fops: handle IOCB_USE_PI in direct IO
2022-02-10 13:08 ` [PATCH v2 3/3] block: fops: handle IOCB_USE_PI in direct IO Alexander V. Buev
@ 2022-02-10 17:53 ` Chaitanya Kulkarni
0 siblings, 0 replies; 10+ messages in thread
From: Chaitanya Kulkarni @ 2022-02-10 17:53 UTC (permalink / raw)
To: Alexander V. Buev, [email protected]
Cc: [email protected], Jens Axboe, Christoph Hellwig,
Martin K . Petersen, Pavel Begunkov, Mikhail Malygin,
[email protected]
On 2/10/22 5:08 AM, Alexander V. Buev wrote:
> Check that the size of integrity data correspond to device integrity
> profile and data size. Split integrity data to the different bio's
> in case of to big orginal bio (together with normal data).
>
> Signed-off-by: Alexander V. Buev <[email protected]>
> ---
> block/fops.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 62 insertions(+)
>
> diff --git a/block/fops.c b/block/fops.c
> index 4f59e0f5bf30..99c670b9f7d4 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -16,6 +16,7 @@
> #include <linux/suspend.h>
> #include <linux/fs.h>
> #include <linux/module.h>
> +#include <linux/blk-integrity.h>
> #include "blk.h"
>
> static inline struct inode *bdev_file_inode(struct file *file)
> @@ -44,6 +45,19 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb)
>
> #define DIO_INLINE_BIO_VECS 4
>
> +static int __bio_integrity_add_iovec(struct bio *bio, struct iov_iter *pi_iter)
> +{
> + struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
> + unsigned int pi_len = bio_integrity_bytes(bi, bio->bi_iter.bi_size >> SECTOR_SHIFT);
> + size_t iter_count = pi_iter->count-pi_len;
> + int ret;
> +
> + iov_iter_truncate(pi_iter, pi_len);
> + ret = bio_integrity_add_iovec(bio, pi_iter);
> + pi_iter->count = iter_count;
> + return ret;
> +}
> +
> static void blkdev_bio_end_io_simple(struct bio *bio)
> {
> struct task_struct *waiter = bio->bi_private;
> @@ -101,6 +115,14 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
> if (iocb->ki_flags & IOCB_HIPRI)
> bio_set_polled(&bio, iocb);
>
> + if (iocb->ki_flags & IOCB_USE_PI) {
> + ret = __bio_integrity_add_iovec(&bio, iocb->pi_iter);
> + if (ret) {
> + bio_release_pages(&bio, should_dirty);
> + goto out;
> + }
> + }
> +
> submit_bio(&bio);
> for (;;) {
> set_current_state(TASK_UNINTERRUPTIBLE);
> @@ -252,6 +274,16 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
> pos += bio->bi_iter.bi_size;
>
> nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
> +
> + if (iocb->ki_flags & IOCB_USE_PI) {
> + ret = __bio_integrity_add_iovec(bio, iocb->pi_iter);
> + if (ret) {
> + bio->bi_status = BLK_STS_IOERR;
> + bio_endio(bio);
> + break;
> + }
> + }
> +
> if (!nr_pages) {
> submit_bio(bio);
> break;
> @@ -358,6 +390,15 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
> task_io_account_write(bio->bi_iter.bi_size);
> }
>
> + if (iocb->ki_flags & IOCB_USE_PI) {
> + ret = __bio_integrity_add_iovec(bio, iocb->pi_iter);
> + if (ret) {
> + bio->bi_status = BLK_STS_IOERR;
> + bio_endio(bio);
> + return ret;
> + }
> + }
> +
> if (iocb->ki_flags & IOCB_HIPRI) {
> bio->bi_opf |= REQ_POLLED | REQ_NOWAIT;
> submit_bio(bio);
> @@ -377,6 +418,27 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> if (!iov_iter_count(iter))
> return 0;
>
> + if (iocb->ki_flags & IOCB_USE_PI) {
> + struct block_device *bdev = iocb->ki_filp->private_data;
> + struct blk_integrity *bi = bdev_get_integrity(bdev);
> + unsigned int intervals;
> +
> + if (unlikely(!(bi && bi->tuple_size &&
> + bi->flags & BLK_INTEGRITY_DEVICE_CAPABLE))) {
> + pr_err("Device %d:%d is not integrity capable",
> + MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev));
> + return -EINVAL;
> + }
> +
> + intervals = bio_integrity_intervals(bi, iter->count >> 9);
> + if (unlikely(intervals * bi->tuple_size > iocb->pi_iter->count)) {
> + pr_err("Integrity & data size mismatch data=%zu integrity=%zu intervals=%u tuple=%u",
> + iter->count, iocb->pi_iter->count,
> + intervals, bi->tuple_size);
> + return -EINVAL;
> + }
> + }
> +
Is there any reason above code is not moved into inline helper ?
> nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1);
> if (likely(nr_pages <= BIO_MAX_VECS)) {
> if (is_sync_kiocb(iocb))
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 2/3] block: io_uring: add READV_PI/WRITEV_PI operations
2022-02-10 15:39 ` Jens Axboe
@ 2022-02-10 19:03 ` Alexander V. Buev
2022-02-10 19:07 ` Jens Axboe
0 siblings, 1 reply; 10+ messages in thread
From: Alexander V. Buev @ 2022-02-10 19:03 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, io-uring, Christoph Hellwig, Martin K . Petersen,
Pavel Begunkov, Chaitanya Kulkarni, Mikhail Malygin, linux
> On 2/10/22 6:08 AM, Alexander V. Buev wrote:
> > Added new READV_PI/WRITEV_PI operations to io_uring.
> > Added new pi_addr & pi_len fields to SQE struct.
> > Added new pi_iter field and IOCB_USE_PI flag to kiocb struct.
> > Make corresponding corrections to io uring trace event.
> >
> > +struct io_rw_pi_state {
> > + struct iov_iter iter;
> > + struct iov_iter_state iter_state;
> > + struct iovec fast_iov[UIO_FASTIOV_PI];
> > +};
> > +
> > +struct io_rw_pi {
> > + struct io_rw rw;
> > + struct iovec *pi_iov;
> > + u32 nr_pi_segs;
> > + struct io_rw_pi_state *s;
> > +};
>
> One immediate issue I see here is that io_rw_pi is big, and we try very
> hard to keep the per-command payload to 64-bytes. This would be 88 bytes
> by my count :-/
>
> Do you need everything from io_rw? If not, I'd just make io_rw_pi
> contain the bits you need and see if you can squeeze it into the
> existing cacheline.
In short - Yes. Current patch code call existing io_read/io_write functions.
This functions use io_rw struct information and process this data.
I wanted to use existing functions but may be this is wrong way in this
case.
The second problem with request size is that the patch adds pi_iter
pointer to kiocb struct. This also increase whole request union
length.
So I can see some (may be possible) solution for this:
1) do not store whole kiocb struct in request
and write fully separated io_read/write_pi functions
2) make special CONFIG_XXX variable and simplify hide this code
as default
Any other variants?
--
Alexander Buev
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 2/3] block: io_uring: add READV_PI/WRITEV_PI operations
2022-02-10 19:03 ` Alexander V. Buev
@ 2022-02-10 19:07 ` Jens Axboe
2022-02-11 9:39 ` Alexander V. Buev
0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2022-02-10 19:07 UTC (permalink / raw)
To: Alexander V. Buev, linux-block, io-uring, Christoph Hellwig,
Martin K . Petersen, Pavel Begunkov, Chaitanya Kulkarni,
Mikhail Malygin, linux
On 2/10/22 12:03 PM, Alexander V. Buev wrote:
>> On 2/10/22 6:08 AM, Alexander V. Buev wrote:
>>> Added new READV_PI/WRITEV_PI operations to io_uring.
>>> Added new pi_addr & pi_len fields to SQE struct.
>>> Added new pi_iter field and IOCB_USE_PI flag to kiocb struct.
>>> Make corresponding corrections to io uring trace event.
>>>
>>> +struct io_rw_pi_state {
>>> + struct iov_iter iter;
>>> + struct iov_iter_state iter_state;
>>> + struct iovec fast_iov[UIO_FASTIOV_PI];
>>> +};
>>> +
>>> +struct io_rw_pi {
>>> + struct io_rw rw;
>>> + struct iovec *pi_iov;
>>> + u32 nr_pi_segs;
>>> + struct io_rw_pi_state *s;
>>> +};
>>
>> One immediate issue I see here is that io_rw_pi is big, and we try very
>> hard to keep the per-command payload to 64-bytes. This would be 88 bytes
>> by my count :-/
>>
>> Do you need everything from io_rw? If not, I'd just make io_rw_pi
>> contain the bits you need and see if you can squeeze it into the
>> existing cacheline.
>
> In short - Yes. Current patch code call existing io_read/io_write functions.
> This functions use io_rw struct information and process this data.
> I wanted to use existing functions but may be this is wrong way in this
> case.
>
> The second problem with request size is that the patch adds pi_iter
> pointer to kiocb struct. This also increase whole request union
> length.
>
> So I can see some (may be possible) solution for this:
>
> 1) do not store whole kiocb struct in request
> and write fully separated io_read/write_pi functions
>
> 2) make special CONFIG_XXX variable and simplify hide this code
> as default
Option 2 really sucks, because then obviously everyone wants their
feature enabled, and then we are back to square one. So never rely on a
config option, if it can be avoided.
I'd like to see what option 1 looks like, that sounds like a far better
solution.
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 2/3] block: io_uring: add READV_PI/WRITEV_PI operations
2022-02-10 19:07 ` Jens Axboe
@ 2022-02-11 9:39 ` Alexander V. Buev
0 siblings, 0 replies; 10+ messages in thread
From: Alexander V. Buev @ 2022-02-11 9:39 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, io-uring, Christoph Hellwig, Martin K . Petersen,
Pavel Begunkov, Chaitanya Kulkarni, Mikhail Malygin, linux
> On 2/10/22 12:03 PM, Alexander V. Buev wrote:
> >> On 2/10/22 6:08 AM, Alexander V. Buev wrote:
> >>> Added new READV_PI/WRITEV_PI operations to io_uring.
> >>> Added new pi_addr & pi_len fields to SQE struct.
> >>> Added new pi_iter field and IOCB_USE_PI flag to kiocb struct.
> >>> Make corresponding corrections to io uring trace event.
> >>>
> >>> +struct io_rw_pi_state {
> >>> + struct iov_iter iter;
> >>> + struct iov_iter_state iter_state;
> >>> + struct iovec fast_iov[UIO_FASTIOV_PI];
> >>> +};
> >>> +
> >>> +struct io_rw_pi {
> >>> + struct io_rw rw;
> >>> + struct iovec *pi_iov;
> >>> + u32 nr_pi_segs;
> >>> + struct io_rw_pi_state *s;
> >>> +};
> >>
> >> One immediate issue I see here is that io_rw_pi is big, and we try very
> >> hard to keep the per-command payload to 64-bytes. This would be 88 bytes
> >> by my count :-/
> >>
> >> Do you need everything from io_rw? If not, I'd just make io_rw_pi
> >> contain the bits you need and see if you can squeeze it into the
> >> existing cacheline.
> >
> > In short - Yes. Current patch code call existing io_read/io_write functions.
> > This functions use io_rw struct information and process this data.
> > I wanted to use existing functions but may be this is wrong way in this
> > case.
> >
> > The second problem with request size is that the patch adds pi_iter
> > pointer to kiocb struct. This also increase whole request union
> > length.
> >
> > So I can see some (may be possible) solution for this:
> >
> > 1) do not store whole kiocb struct in request
> > and write fully separated io_read/write_pi functions
> >
> > 2) make special CONFIG_XXX variable and simplify hide this code
> > as default
>
> Option 2 really sucks, because then obviously everyone wants their
> feature enabled, and then we are back to square one. So never rely on a
> config option, if it can be avoided.
>
> I'd like to see what option 1 looks like, that sounds like a far better
> solution.
>
Accepted. I am starting to prepare v3 in this way.
Thanks to all for feedback!
--
Alexander Buev
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-02-11 9:40 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-10 13:08 [PATCH v2 0/3] implement direct IO with integrity Alexander V. Buev
2022-02-10 13:08 ` [PATCH v2 1/3] block: bio-integrity: add PI iovec to bio Alexander V. Buev
2022-02-10 17:49 ` Chaitanya Kulkarni
2022-02-10 13:08 ` [PATCH v2 2/3] block: io_uring: add READV_PI/WRITEV_PI operations Alexander V. Buev
2022-02-10 15:39 ` Jens Axboe
2022-02-10 19:03 ` Alexander V. Buev
2022-02-10 19:07 ` Jens Axboe
2022-02-11 9:39 ` Alexander V. Buev
2022-02-10 13:08 ` [PATCH v2 3/3] block: fops: handle IOCB_USE_PI in direct IO Alexander V. Buev
2022-02-10 17:53 ` Chaitanya Kulkarni
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox