* [RFC PATCH 0/4] Read/Write with meta buffer [not found] <CGME20240322185729epcas5p350c5054b5b519a6aa9d1b35ba3709563@epcas5p3.samsung.com> @ 2024-03-22 18:50 ` Kanchan Joshi [not found] ` <CGME20240322185731epcas5p20fc525f793a537310f7b3ae5ba5bc75b@epcas5p2.samsung.com> ` (5 more replies) 0 siblings, 6 replies; 11+ messages in thread From: Kanchan Joshi @ 2024-03-22 18:50 UTC (permalink / raw) To: martin.petersen, axboe, kbusch, hch Cc: io-uring, linux-block, anuj1072538, Kanchan Joshi This patchset is aimed at getting the feedback on a new io_uring interface that userspace can use to exchange meta buffer along with read/write. Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META. The leftover space in the SQE is used to send meta buffer pointer and its length. Patch #2 for this. The interface is supported for block direct IO. Patch #4 for this. Other two are prep patches. It has been tried not to touch the hot read/write path, as much as possible. Performance for non-meta IO is same after the patches [2]. There is some code in the cold path (worker-based async) though. Moderately tested by modifying the fio [1] to use this interface (only for NVMe block devices) [1] https://github.com/OpenMPDK/fio/tree/feat/test-meta [2] without this: taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 With this: taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 Anuj Gupta (3): io_uring/rw: Get rid of flags field in struct io_rw io_uring/rw: support read/write with metadata block: modify bio_integrity_map_user to accept iov_iter as argument Kanchan Joshi (1): block: add support to pass the meta buffer block/bio-integrity.c | 27 ++++++--- block/fops.c | 9 +++ block/t10-pi.c | 6 ++ drivers/nvme/host/ioctl.c | 11 +++- include/linux/bio.h | 13 +++- include/linux/fs.h | 1 + include/uapi/linux/io_uring.h | 6 ++ io_uring/io_uring.c | 2 + io_uring/opdef.c | 29 +++++++++ io_uring/rw.c | 108 +++++++++++++++++++++++++++++----- io_uring/rw.h | 8 +++ 11 files changed, 193 insertions(+), 27 deletions(-) base-commit: 6f0974eccbf78baead1735722c4f1ee3eb9422cd -- 2.25.1 ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <CGME20240322185731epcas5p20fc525f793a537310f7b3ae5ba5bc75b@epcas5p2.samsung.com>]
* [RFC PATCH 1/4] io_uring/rw: Get rid of flags field in struct io_rw [not found] ` <CGME20240322185731epcas5p20fc525f793a537310f7b3ae5ba5bc75b@epcas5p2.samsung.com> @ 2024-03-22 18:50 ` Kanchan Joshi 2024-03-27 23:30 ` David Wei 0 siblings, 1 reply; 11+ messages in thread From: Kanchan Joshi @ 2024-03-22 18:50 UTC (permalink / raw) To: martin.petersen, axboe, kbusch, hch Cc: io-uring, linux-block, anuj1072538, Anuj Gupta From: Anuj Gupta <[email protected]> Get rid of the flags field in io_rw. Flags can be set in kiocb->flags during prep rather than doing it while issuing the I/O in io_read/io_write. Signed-off-by: Anuj Gupta <[email protected]> --- io_uring/rw.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/io_uring/rw.c b/io_uring/rw.c index 47e097ab5d7e..40f6c2a59928 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -27,7 +27,6 @@ struct io_rw { struct kiocb kiocb; u64 addr; u32 len; - rwf_t flags; }; static inline bool io_file_supports_nowait(struct io_kiocb *req) @@ -78,10 +77,16 @@ static int io_iov_buffer_select_prep(struct io_kiocb *req) int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + struct kiocb *kiocb = &rw->kiocb; unsigned ioprio; int ret; - rw->kiocb.ki_pos = READ_ONCE(sqe->off); + kiocb->ki_flags = 0; + ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags)); + if (unlikely(ret)) + return ret; + + kiocb->ki_pos = READ_ONCE(sqe->off); /* used for fixed read/write too - just read unconditionally */ req->buf_index = READ_ONCE(sqe->buf_index); @@ -91,15 +96,14 @@ int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (ret) return ret; - rw->kiocb.ki_ioprio = ioprio; + kiocb->ki_ioprio = ioprio; } else { - rw->kiocb.ki_ioprio = get_current_ioprio(); + kiocb->ki_ioprio = get_current_ioprio(); } - rw->kiocb.dio_complete = NULL; + kiocb->dio_complete = NULL; rw->addr = READ_ONCE(sqe->addr); rw->len = READ_ONCE(sqe->len); - rw->flags = READ_ONCE(sqe->rw_flags); return 0; } @@ -720,7 +724,6 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode) struct kiocb *kiocb = &rw->kiocb; struct io_ring_ctx *ctx = req->ctx; struct file *file = req->file; - int ret; if (unlikely(!(file->f_mode & mode))) return -EBADF; @@ -728,10 +731,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode) if (!(req->flags & REQ_F_FIXED_FILE)) req->flags |= io_file_get_flags(file); - kiocb->ki_flags = file->f_iocb_flags; - ret = kiocb_set_rw_flags(kiocb, rw->flags); - if (unlikely(ret)) - return ret; + kiocb->ki_flags |= file->f_iocb_flags; kiocb->ki_flags |= IOCB_ALLOC_CACHE; /* -- 2.25.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/4] io_uring/rw: Get rid of flags field in struct io_rw 2024-03-22 18:50 ` [RFC PATCH 1/4] io_uring/rw: Get rid of flags field in struct io_rw Kanchan Joshi @ 2024-03-27 23:30 ` David Wei 2024-03-27 23:32 ` David Wei 0 siblings, 1 reply; 11+ messages in thread From: David Wei @ 2024-03-27 23:30 UTC (permalink / raw) To: Kanchan Joshi, martin.petersen, axboe, kbusch, hch Cc: io-uring, linux-block, anuj1072538, Anuj Gupta On 2024-03-22 11:50, Kanchan Joshi wrote: > From: Anuj Gupta <[email protected]> > > Get rid of the flags field in io_rw. Flags can be set in kiocb->flags > during prep rather than doing it while issuing the I/O in io_read/io_write. > > Signed-off-by: Anuj Gupta <[email protected]> > --- > io_uring/rw.c | 22 +++++++++++----------- > 1 file changed, 11 insertions(+), 11 deletions(-) This patch looks fine and is a no-op on its own, but I think there is a subtle semantic change. If the rw_flags is invalid (i.e. kiocb_set_rw_flags() returns an err) and prep() fails, then the remaining submissions won't be submitted unless IORING_SETUP_SUBMIT_ALL is set. Currently if kiocb_set_rw_flags() fails in prep(), only the request will fail. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/4] io_uring/rw: Get rid of flags field in struct io_rw 2024-03-27 23:30 ` David Wei @ 2024-03-27 23:32 ` David Wei 0 siblings, 0 replies; 11+ messages in thread From: David Wei @ 2024-03-27 23:32 UTC (permalink / raw) To: Kanchan Joshi, martin.petersen, axboe, kbusch, hch Cc: io-uring, linux-block, anuj1072538, Anuj Gupta On 2024-03-27 16:30, David Wei wrote: > On 2024-03-22 11:50, Kanchan Joshi wrote: >> From: Anuj Gupta <[email protected]> >> >> Get rid of the flags field in io_rw. Flags can be set in kiocb->flags >> during prep rather than doing it while issuing the I/O in io_read/io_write. >> >> Signed-off-by: Anuj Gupta <[email protected]> >> --- >> io_uring/rw.c | 22 +++++++++++----------- >> 1 file changed, 11 insertions(+), 11 deletions(-) > > This patch looks fine and is a no-op on its own, but I think there is a > subtle semantic change. If the rw_flags is invalid (i.e. > kiocb_set_rw_flags() returns an err) and prep() fails, then the > remaining submissions won't be submitted unless IORING_SETUP_SUBMIT_ALL > is set. > > Currently if kiocb_set_rw_flags() fails in prep(), only the request will > fail. Sorry, that should say fails in _issue()_. ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <CGME20240322185734epcas5p2cd407dac97cd157c1833c4022ea84805@epcas5p2.samsung.com>]
* [RFC PATCH 2/4] io_uring/rw: support read/write with metadata [not found] ` <CGME20240322185734epcas5p2cd407dac97cd157c1833c4022ea84805@epcas5p2.samsung.com> @ 2024-03-22 18:50 ` Kanchan Joshi 0 siblings, 0 replies; 11+ messages in thread From: Kanchan Joshi @ 2024-03-22 18:50 UTC (permalink / raw) To: martin.petersen, axboe, kbusch, hch Cc: io-uring, linux-block, anuj1072538, Anuj Gupta, Kanchan Joshi, Nitesh Shetty From: Anuj Gupta <[email protected]> This patch introduces IORING_OP_READ_META and IORING_OP_WRITE_META opcodes which allow sending a meta buffer along with read/write. Application can do that by using the newly added meta_buf and meta-len fields of the SQE. These opcodes are supported only for direct IO. Signed-off-by: Anuj Gupta <[email protected]> Signed-off-by: Kanchan Joshi <[email protected]> Signed-off-by: Nitesh Shetty <[email protected]> --- include/linux/fs.h | 1 + include/uapi/linux/io_uring.h | 6 +++ io_uring/io_uring.c | 2 + io_uring/opdef.c | 29 ++++++++++++ io_uring/rw.c | 86 +++++++++++++++++++++++++++++++++-- io_uring/rw.h | 8 ++++ 6 files changed, 129 insertions(+), 3 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 0a22b7245982..c3a483a4fdac 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -327,6 +327,7 @@ struct readahead_control; #define IOCB_NOIO (1 << 20) /* can use bio alloc cache */ #define IOCB_ALLOC_CACHE (1 << 21) +#define IOCB_USE_META (1 << 22) /* * IOCB_DIO_CALLER_COMP can be set by the iocb owner, to indicate that the * iocb completion can be passed back to the owner for execution from a safe diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 7bd10201a02b..87bd44098037 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -97,6 +97,10 @@ struct io_uring_sqe { __u64 addr3; __u64 __pad2[1]; }; + struct { + __u64 meta_addr; + __u32 meta_len; + }; __u64 optval; /* * If the ring is initialized with IORING_SETUP_SQE128, then @@ -256,6 +260,8 @@ enum io_uring_op { IORING_OP_FUTEX_WAITV, IORING_OP_FIXED_FD_INSTALL, IORING_OP_FTRUNCATE, + IORING_OP_READ_META, + IORING_OP_WRITE_META, /* this goes last, obviously */ IORING_OP_LAST, diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 49a124daa359..7c380cac4465 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -4134,7 +4134,9 @@ static int __init io_uring_init(void) BUILD_BUG_SQE_ELEM(44, __u16, addr_len); BUILD_BUG_SQE_ELEM(46, __u16, __pad3[0]); BUILD_BUG_SQE_ELEM(48, __u64, addr3); + BUILD_BUG_SQE_ELEM(48, __u64, meta_addr); BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd); + BUILD_BUG_SQE_ELEM(56, __u32, meta_len); BUILD_BUG_SQE_ELEM(56, __u64, __pad2); BUILD_BUG_ON(sizeof(struct io_uring_files_update) != diff --git a/io_uring/opdef.c b/io_uring/opdef.c index 9c080aadc5a6..cb31573ac4ad 100644 --- a/io_uring/opdef.c +++ b/io_uring/opdef.c @@ -146,6 +146,26 @@ const struct io_issue_def io_issue_defs[] = { .prep = io_eopnotsupp_prep, #endif }, + [IORING_OP_READ_META] = { + .needs_file = 1, + .plug = 1, + .audit_skip = 1, + .ioprio = 1, + .iopoll = 1, + .iopoll_queue = 1, + .prep = io_prep_rw_meta, + .issue = io_rw_meta, + }, + [IORING_OP_WRITE_META] = { + .needs_file = 1, + .plug = 1, + .audit_skip = 1, + .ioprio = 1, + .iopoll = 1, + .iopoll_queue = 1, + .prep = io_prep_rw_meta, + .issue = io_rw_meta, + }, [IORING_OP_RECVMSG] = { .needs_file = 1, .unbound_nonreg_file = 1, @@ -501,6 +521,15 @@ const struct io_cold_def io_cold_defs[] = { .cleanup = io_readv_writev_cleanup, .fail = io_rw_fail, }, + [IORING_OP_READ_META] = { + .async_size = sizeof(struct io_async_rw), + .name = "READ_META", + .fail = io_rw_fail, + }, + [IORING_OP_WRITE_META] = { + .async_size = sizeof(struct io_async_rw), + .name = "WRITE_META", + }, [IORING_OP_FSYNC] = { .name = "FSYNC", }, diff --git a/io_uring/rw.c b/io_uring/rw.c index 40f6c2a59928..87a6304052f0 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -27,6 +27,7 @@ struct io_rw { struct kiocb kiocb; u64 addr; u32 len; + u32 meta_len; }; static inline bool io_file_supports_nowait(struct io_kiocb *req) @@ -107,6 +108,22 @@ int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; } +int io_prep_rw_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + struct kiocb *kiocb = &rw->kiocb; + int ret; + + ret = io_prep_rw(req, sqe); + if (unlikely(ret)) + return ret; + kiocb->private = u64_to_user_ptr(READ_ONCE(sqe->meta_addr)); + rw->meta_len = READ_ONCE(sqe->meta_len); + + kiocb->ki_flags |= IOCB_USE_META; + return 0; +} + int io_prep_rwv(struct io_kiocb *req, const struct io_uring_sqe *sqe) { int ret; @@ -571,9 +588,18 @@ static void io_req_map_rw(struct io_kiocb *req, const struct iovec *iovec, } } +static inline void io_req_map_meta(struct io_async_rw *iorw, struct io_rw_state_meta *sm) +{ + memcpy(&iorw->s_meta.iter_meta, &sm->iter_meta, sizeof(struct iov_iter)); + iov_iter_save_state(&iorw->s_meta.iter_meta, &iorw->s_meta.iter_state_meta); +} + static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec, struct io_rw_state *s, bool force) { + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + struct kiocb *kiocb = &rw->kiocb; + if (!force && !io_cold_defs[req->opcode].prep_async) return 0; /* opcode type doesn't need async data */ @@ -591,6 +617,11 @@ static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec, iorw = req->async_data; /* we've copied and mapped the iter, ensure state is saved */ iov_iter_save_state(&iorw->s.iter, &iorw->s.iter_state); + if (unlikely(kiocb->ki_flags & IOCB_USE_META)) { + struct io_rw_state_meta *sm = kiocb->private; + + io_req_map_meta(iorw, sm); + } } return 0; } @@ -747,7 +778,8 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode) if (!(kiocb->ki_flags & IOCB_DIRECT) || !file->f_op->iopoll) return -EOPNOTSUPP; - kiocb->private = NULL; + if (likely(!(kiocb->ki_flags & IOCB_USE_META))) + kiocb->private = NULL; kiocb->ki_flags |= IOCB_HIPRI; kiocb->ki_complete = io_complete_rw_iopoll; req->iopoll_completed = 0; @@ -766,6 +798,7 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags) struct io_rw_state __s, *s = &__s; struct iovec *iovec; struct kiocb *kiocb = &rw->kiocb; + struct io_rw_state_meta *sm = kiocb->private; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; struct io_async_rw *io; ssize_t ret, ret2; @@ -840,13 +873,16 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags) /* no retry on NONBLOCK nor RWF_NOWAIT */ if (req->flags & REQ_F_NOWAIT) goto done; + if (kiocb->ki_flags & IOCB_USE_META) + kiocb->private = sm; ret = 0; } else if (ret == -EIOCBQUEUED) { if (iovec) kfree(iovec); return IOU_ISSUE_SKIP_COMPLETE; } else if (ret == req->cqe.res || ret <= 0 || !force_nonblock || - (req->flags & REQ_F_NOWAIT) || !need_complete_io(req)) { + (req->flags & REQ_F_NOWAIT) || !need_complete_io(req) || + (kiocb->ki_flags & IOCB_USE_META)) { /* read all, failed, already did sync or don't want to retry */ goto done; } @@ -857,6 +893,12 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags) * manually if we need to. */ iov_iter_restore(&s->iter, &s->iter_state); + if (unlikely(kiocb->ki_flags & IOCB_USE_META)) { + /* don't handle partial completion for read + meta */ + if (ret > 0) + goto done; + iov_iter_restore(&sm->iter_meta, &sm->iter_state_meta); + } ret2 = io_setup_async_rw(req, iovec, s, true); iovec = NULL; @@ -1070,7 +1112,8 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) if (ret2 == -EAGAIN && (req->ctx->flags & IORING_SETUP_IOPOLL)) goto copy_iov; - if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)) { + if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req) + && !(kiocb->ki_flags & IOCB_USE_META)) { struct io_async_rw *io; trace_io_uring_short_write(req->ctx, kiocb->ki_pos - ret2, @@ -1111,6 +1154,43 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) return ret; } +int io_rw_meta(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + void __user *meta_addr = u64_to_user_ptr((u64)rw->kiocb.private); + struct io_rw_state_meta __sm, *sm = &__sm; + struct kiocb *kiocb = &rw->kiocb; + int ret; + + if (!(req->file->f_flags & O_DIRECT)) + return -EOPNOTSUPP; + /* prepare iter for meta-buffer */ + if (!req_has_async_data(req)) { + ret = import_ubuf(ITER_SOURCE, meta_addr, rw->meta_len, &sm->iter_meta); + iov_iter_save_state(&sm->iter_meta, &sm->iter_state_meta); + if (unlikely(ret < 0)) + return ret; + } else { + struct io_async_rw *io = req->async_data; + + sm = &io->s_meta; + iov_iter_restore(&sm->iter_meta, &sm->iter_state_meta); + } + /* Store iter for meta-buf in private, will be used later*/ + kiocb->private = sm; + if (req->opcode == IORING_OP_READ_META) { + ret = __io_read(req, issue_flags); + if (ret >= 0) + return kiocb_done(req, ret, issue_flags); + } else { + ret = io_write(req, issue_flags); + } + if (ret == -EAGAIN) + kiocb->private = meta_addr; + return ret; + +} + void io_rw_fail(struct io_kiocb *req) { int res; diff --git a/io_uring/rw.h b/io_uring/rw.h index f9e89b4fe4da..7c12216776bc 100644 --- a/io_uring/rw.h +++ b/io_uring/rw.h @@ -8,19 +8,27 @@ struct io_rw_state { struct iovec fast_iov[UIO_FASTIOV]; }; +struct io_rw_state_meta { + struct iov_iter iter_meta; + struct iov_iter_state iter_state_meta; +}; + struct io_async_rw { struct io_rw_state s; + struct io_rw_state_meta s_meta; const struct iovec *free_iovec; size_t bytes_done; struct wait_page_queue wpq; }; int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe); +int io_prep_rw_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe); int io_prep_rwv(struct io_kiocb *req, const struct io_uring_sqe *sqe); int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe); int io_read(struct io_kiocb *req, unsigned int issue_flags); int io_readv_prep_async(struct io_kiocb *req); int io_write(struct io_kiocb *req, unsigned int issue_flags); +int io_rw_meta(struct io_kiocb *req, unsigned int issue_flags); int io_writev_prep_async(struct io_kiocb *req); void io_readv_writev_cleanup(struct io_kiocb *req); void io_rw_fail(struct io_kiocb *req); -- 2.25.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
[parent not found: <CGME20240322185736epcas5p3d0093948e9904e775994bcbe735ea0c5@epcas5p3.samsung.com>]
* [RFC PATCH 3/4] block: modify bio_integrity_map_user to accept iov_iter as argument [not found] ` <CGME20240322185736epcas5p3d0093948e9904e775994bcbe735ea0c5@epcas5p3.samsung.com> @ 2024-03-22 18:50 ` Kanchan Joshi 0 siblings, 0 replies; 11+ messages in thread From: Kanchan Joshi @ 2024-03-22 18:50 UTC (permalink / raw) To: martin.petersen, axboe, kbusch, hch Cc: io-uring, linux-block, anuj1072538, Anuj Gupta, Kanchan Joshi From: Anuj Gupta <[email protected]> Refactor bio_integrity_map_user to accept iov_iter as argument. This is a prep patch. Signed-off-by: Anuj Gupta <[email protected]> Signed-off-by: Kanchan Joshi <[email protected]> --- block/bio-integrity.c | 14 ++++++-------- drivers/nvme/host/ioctl.c | 11 +++++++++-- include/linux/bio.h | 7 ++++--- 3 files changed, 19 insertions(+), 13 deletions(-) diff --git a/block/bio-integrity.c b/block/bio-integrity.c index 2e3e8e04961e..e340b5a03cdf 100644 --- a/block/bio-integrity.c +++ b/block/bio-integrity.c @@ -308,17 +308,16 @@ static unsigned int bvec_from_pages(struct bio_vec *bvec, struct page **pages, return nr_bvecs; } -int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes, - u32 seed) +int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter, + u32 seed) { struct request_queue *q = bdev_get_queue(bio->bi_bdev); unsigned int align = q->dma_pad_mask | queue_dma_alignment(q); struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages; struct bio_vec stack_vec[UIO_FASTIOV], *bvec = stack_vec; + size_t offset, bytes = iter->count; unsigned int direction, nr_bvecs; - struct iov_iter iter; int ret, nr_vecs; - size_t offset; bool copy; if (bio_integrity(bio)) @@ -331,8 +330,7 @@ int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes, else direction = ITER_SOURCE; - iov_iter_ubuf(&iter, direction, ubuf, bytes); - nr_vecs = iov_iter_npages(&iter, BIO_MAX_VECS + 1); + nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS + 1); if (nr_vecs > BIO_MAX_VECS) return -E2BIG; if (nr_vecs > UIO_FASTIOV) { @@ -342,8 +340,8 @@ int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t bytes, pages = NULL; } - copy = !iov_iter_is_aligned(&iter, align, align); - ret = iov_iter_extract_pages(&iter, &pages, bytes, nr_vecs, 0, &offset); + copy = !iov_iter_is_aligned(iter, align, align); + ret = iov_iter_extract_pages(iter, &pages, bytes, nr_vecs, 0, &offset); if (unlikely(ret < 0)) goto free_bvec; diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 3dfd5ae99ae0..deee13000f08 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -145,8 +145,15 @@ static int nvme_map_user_request(struct request *req, u64 ubuffer, if (bdev) { bio_set_dev(bio, bdev); if (meta_buffer && meta_len) { - ret = bio_integrity_map_user(bio, meta_buffer, meta_len, - meta_seed); + struct iov_iter iter; + unsigned int direction; + + if (bio_data_dir(bio) == READ) + direction = ITER_DEST; + else + direction = ITER_SOURCE; + iov_iter_ubuf(&iter, direction, meta_buffer, meta_len); + ret = bio_integrity_map_user(bio, &iter, meta_seed); if (ret) goto out_unmap; req->cmd_flags |= REQ_INTEGRITY; diff --git a/include/linux/bio.h b/include/linux/bio.h index 875d792bffff..20cf47fc851f 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -723,7 +723,7 @@ static inline bool bioset_initialized(struct bio_set *bs) for_each_bio(_bio) \ bip_for_each_vec(_bvl, _bio->bi_integrity, _iter) -int bio_integrity_map_user(struct bio *bio, void __user *ubuf, ssize_t len, u32 seed); +int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter, u32 seed); extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int); extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int); extern bool bio_integrity_prep(struct bio *); @@ -795,8 +795,9 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page, return 0; } -static inline int bio_integrity_map_user(struct bio *bio, void __user *ubuf, - ssize_t len, u32 seed) +static inline int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter, + u32 seed) + { return -EINVAL; } -- 2.25.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
[parent not found: <CGME20240322185738epcas5p20e5bd448ce83350eb9e79c929c4a9b2b@epcas5p2.samsung.com>]
* [RFC PATCH 4/4] block: add support to pass the meta buffer [not found] ` <CGME20240322185738epcas5p20e5bd448ce83350eb9e79c929c4a9b2b@epcas5p2.samsung.com> @ 2024-03-22 18:50 ` Kanchan Joshi 0 siblings, 0 replies; 11+ messages in thread From: Kanchan Joshi @ 2024-03-22 18:50 UTC (permalink / raw) To: martin.petersen, axboe, kbusch, hch Cc: io-uring, linux-block, anuj1072538, Kanchan Joshi, Anuj Gupta If IOCB_USE_META flag is set, iocb->private points to iov_iter corresponding to the meta-buffer. Use it to prepare the bip and attach that to the bio that we send down. Make sure that block-integrity checks are skipped for this user-owned meta buffer. Signed-off-by: Anuj Gupta <[email protected]> Signed-off-by: Kanchan Joshi <[email protected]> --- block/bio-integrity.c | 13 +++++++++++++ block/fops.c | 9 +++++++++ block/t10-pi.c | 6 ++++++ include/linux/bio.h | 6 ++++++ 4 files changed, 34 insertions(+) diff --git a/block/bio-integrity.c b/block/bio-integrity.c index e340b5a03cdf..c46b70aff840 100644 --- a/block/bio-integrity.c +++ b/block/bio-integrity.c @@ -308,6 +308,19 @@ static unsigned int bvec_from_pages(struct bio_vec *bvec, struct page **pages, return nr_bvecs; } +int bio_integrity_map_iter(struct bio *bio, struct iov_iter *iter) +{ + struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); + + if (!bi) + return -EINVAL; + + if (iter->count < bio_integrity_bytes(bi, bio_sectors(bio))) + return -EINVAL; + + return bio_integrity_map_user(bio, iter, 0); +} + int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter, u32 seed) { diff --git a/block/fops.c b/block/fops.c index 679d9b752fe8..e488fa66dd60 100644 --- a/block/fops.c +++ b/block/fops.c @@ -353,6 +353,15 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, task_io_account_write(bio->bi_iter.bi_size); } + if (unlikely(iocb->ki_flags & IOCB_USE_META)) { + ret = bio_integrity_map_iter(bio, iocb->private); + WRITE_ONCE(iocb->private, NULL); + if (unlikely(ret)) { + bio_put(bio); + return ret; + } + } + if (iocb->ki_flags & IOCB_NOWAIT) bio->bi_opf |= REQ_NOWAIT; diff --git a/block/t10-pi.c b/block/t10-pi.c index d90892fd6f2a..72d1522417a1 100644 --- a/block/t10-pi.c +++ b/block/t10-pi.c @@ -156,6 +156,8 @@ static void t10_pi_type1_prepare(struct request *rq) /* Already remapped? */ if (bip->bip_flags & BIP_MAPPED_INTEGRITY) break; + if (bip->bip_flags & BIP_INTEGRITY_USER) + break; bip_for_each_vec(iv, bip, iter) { unsigned int j; @@ -205,6 +207,8 @@ static void t10_pi_type1_complete(struct request *rq, unsigned int nr_bytes) struct bio_vec iv; struct bvec_iter iter; + if (bip->bip_flags & BIP_INTEGRITY_USER) + break; bip_for_each_vec(iv, bip, iter) { unsigned int j; void *p; @@ -408,6 +412,8 @@ static void ext_pi_type1_prepare(struct request *rq) /* Already remapped? */ if (bip->bip_flags & BIP_MAPPED_INTEGRITY) break; + if (bip->bip_flags & BIP_INTEGRITY_USER) + break; bip_for_each_vec(iv, bip, iter) { unsigned int j; diff --git a/include/linux/bio.h b/include/linux/bio.h index 20cf47fc851f..34ea387dfc59 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -723,6 +723,7 @@ static inline bool bioset_initialized(struct bio_set *bs) for_each_bio(_bio) \ bip_for_each_vec(_bvl, _bio->bi_integrity, _iter) +int bio_integrity_map_iter(struct bio *bio, struct iov_iter *iter); int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter, u32 seed); extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int); extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int); @@ -802,6 +803,11 @@ static inline int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter, return -EINVAL; } +static inline int bio_integrity_map_iter(struct bio *bio, struct iov_iter *iter) +{ + return -EINVAL; +} + #endif /* CONFIG_BLK_DEV_INTEGRITY */ /* -- 2.25.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/4] Read/Write with meta buffer 2024-03-22 18:50 ` [RFC PATCH 0/4] Read/Write with meta buffer Kanchan Joshi ` (3 preceding siblings ...) [not found] ` <CGME20240322185738epcas5p20e5bd448ce83350eb9e79c929c4a9b2b@epcas5p2.samsung.com> @ 2024-03-27 23:38 ` Jens Axboe 2024-03-28 12:03 ` Kanchan Joshi 2024-04-06 21:30 ` Pavel Begunkov 5 siblings, 1 reply; 11+ messages in thread From: Jens Axboe @ 2024-03-27 23:38 UTC (permalink / raw) To: Kanchan Joshi, martin.petersen, kbusch, hch Cc: io-uring, linux-block, anuj1072538 On 3/22/24 12:50 PM, Kanchan Joshi wrote: > This patchset is aimed at getting the feedback on a new io_uring > interface that userspace can use to exchange meta buffer along with > read/write. > > Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META. > The leftover space in the SQE is used to send meta buffer pointer > and its length. Patch #2 for this. > > The interface is supported for block direct IO. Patch #4 for this. > Other two are prep patches. > > It has been tried not to touch the hot read/write path, as much as > possible. Performance for non-meta IO is same after the patches [2]. > There is some code in the cold path (worker-based async) > though. This patchset should look cleaner if you rebase it on top of the current for-6.10/io_uring branch, as it gets rid of the async nastiness. Since that'll need doing anyway, could you repost a v2 where it's rebased on top of that? Also in terms of the cover letter, would be good with a bit more of a description of what this enables. It's a bit scant on detail on what exactly this gives you. > taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 > submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 > submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 > polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 > IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 > > With this: > taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 > submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 > submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 > polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 > IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 Not that I don't believe you, but that looks like you pasted the same stuff in there twice? It's the exact same perf and pids. -- Jens Axboe ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/4] Read/Write with meta buffer 2024-03-27 23:38 ` [RFC PATCH 0/4] Read/Write with " Jens Axboe @ 2024-03-28 12:03 ` Kanchan Joshi 0 siblings, 0 replies; 11+ messages in thread From: Kanchan Joshi @ 2024-03-28 12:03 UTC (permalink / raw) To: Jens Axboe, martin.petersen, kbusch, hch Cc: io-uring, linux-block, anuj1072538 On 3/28/2024 5:08 AM, Jens Axboe wrote: > This patchset should look cleaner if you rebase it on top of the current > for-6.10/io_uring branch, as it gets rid of the async nastiness. Since > that'll need doing anyway, could you repost a v2 where it's rebased on > top of that? Yes, next iteration will use that as the base. > Also in terms of the cover letter, would be good with a bit more of a > description of what this enables. It's a bit scant on detail on what > exactly this gives you. Will fix that. But currently the only thing it gives is - pass meta buffer to/from the block-device. It keeps things simple, and fine for PI type 0 (normal unprotected IO). For other PI types, exposing few knobs may help. Using "sqe->rw_flags" if there is no other way. >> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 >> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 >> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 >> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 >> Engine=io_uring, sq_ring=128, cq_ring=128 >> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 >> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 >> >> With this: >> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 >> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 >> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 >> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 >> Engine=io_uring, sq_ring=128, cq_ring=128 >> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 >> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 > > Not that I don't believe you, but that looks like you pasted the same > stuff in there twice? It's the exact same perf and pids. Indeed :-( Made a goof-up while pasting stuff [1] to the cover letter. [1] Before the patch: # taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 Exiting on timeout Maximum IOPS=10.04M After the patch: # taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 submitter=1, tid=2412, file=/dev/nvme1n1, node=-1 submitter=0, tid=2411, file=/dev/nvme0n1, node=-1 polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 Engine=io_uring, sq_ring=128, cq_ring=128 IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 IOPS=10.03M, BW=4.90GiB/s, IOS/call=31/31 IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 Exiting on timeout Maximum IOPS=10.04M ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/4] Read/Write with meta buffer 2024-03-22 18:50 ` [RFC PATCH 0/4] Read/Write with meta buffer Kanchan Joshi ` (4 preceding siblings ...) 2024-03-27 23:38 ` [RFC PATCH 0/4] Read/Write with " Jens Axboe @ 2024-04-06 21:30 ` Pavel Begunkov 2024-04-25 19:05 ` Kanchan Joshi 5 siblings, 1 reply; 11+ messages in thread From: Pavel Begunkov @ 2024-04-06 21:30 UTC (permalink / raw) To: Kanchan Joshi, martin.petersen, axboe, kbusch, hch Cc: io-uring, linux-block, anuj1072538 On 3/22/24 18:50, Kanchan Joshi wrote: > This patchset is aimed at getting the feedback on a new io_uring > interface that userspace can use to exchange meta buffer along with > read/write. > > Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META. > The leftover space in the SQE is used to send meta buffer pointer > and its length. Patch #2 for this. I do remember there were back and forth design discussions about that back when some other guy attempted to implement it, but have you tried to do it not as a separate opcode? It reads like all read/write opcodes might benefit from it, and it'd be unfortunate to then be adding IORING_OP_READ_META_FIXED and multiplicatively all other variants. > The interface is supported for block direct IO. Patch #4 for this. > Other two are prep patches. > > It has been tried not to touch the hot read/write path, as much as > possible. Performance for non-meta IO is same after the patches [2]. > There is some code in the cold path (worker-based async) > though. > > Moderately tested by modifying the fio [1] to use this interface > (only for NVMe block devices) > > [1] https://github.com/OpenMPDK/fio/tree/feat/test-meta > > [2] > without this: > > taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 > submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 > submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 > polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 > IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 > > With this: > taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 > submitter=1, tid=2453, file=/dev/nvme1n1, node=-1 > submitter=0, tid=2452, file=/dev/nvme0n1, node=-1 > polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31 > IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31 > > Anuj Gupta (3): > io_uring/rw: Get rid of flags field in struct io_rw > io_uring/rw: support read/write with metadata > block: modify bio_integrity_map_user to accept iov_iter as argument > > Kanchan Joshi (1): > block: add support to pass the meta buffer > > block/bio-integrity.c | 27 ++++++--- > block/fops.c | 9 +++ > block/t10-pi.c | 6 ++ > drivers/nvme/host/ioctl.c | 11 +++- > include/linux/bio.h | 13 +++- > include/linux/fs.h | 1 + > include/uapi/linux/io_uring.h | 6 ++ > io_uring/io_uring.c | 2 + > io_uring/opdef.c | 29 +++++++++ > io_uring/rw.c | 108 +++++++++++++++++++++++++++++----- > io_uring/rw.h | 8 +++ > 11 files changed, 193 insertions(+), 27 deletions(-) > > > base-commit: 6f0974eccbf78baead1735722c4f1ee3eb9422cd -- Pavel Begunkov ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/4] Read/Write with meta buffer 2024-04-06 21:30 ` Pavel Begunkov @ 2024-04-25 19:05 ` Kanchan Joshi 0 siblings, 0 replies; 11+ messages in thread From: Kanchan Joshi @ 2024-04-25 19:05 UTC (permalink / raw) To: Pavel Begunkov, martin.petersen, axboe, kbusch, hch Cc: io-uring, linux-block, anuj1072538 On 4/7/2024 3:00 AM, Pavel Begunkov wrote: > On 3/22/24 18:50, Kanchan Joshi wrote: >> This patchset is aimed at getting the feedback on a new io_uring >> interface that userspace can use to exchange meta buffer along with >> read/write. >> >> Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META. >> The leftover space in the SQE is used to send meta buffer pointer >> and its length. Patch #2 for this. > > I do remember there were back and forth design discussions about that > back when some other guy attempted to implement it, but have you tried > to do it not as a separate opcode? Did not try that in the first cut, thinking it would help in not touching the hot (non-meta) io path. But open to this. > It reads like all read/write opcodes might benefit from it, and it'd > be unfortunate to then be adding IORING_OP_READ_META_FIXED and > multiplicatively all other variants. Right, that's a good point. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-04-25 19:05 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <CGME20240322185729epcas5p350c5054b5b519a6aa9d1b35ba3709563@epcas5p3.samsung.com> 2024-03-22 18:50 ` [RFC PATCH 0/4] Read/Write with meta buffer Kanchan Joshi [not found] ` <CGME20240322185731epcas5p20fc525f793a537310f7b3ae5ba5bc75b@epcas5p2.samsung.com> 2024-03-22 18:50 ` [RFC PATCH 1/4] io_uring/rw: Get rid of flags field in struct io_rw Kanchan Joshi 2024-03-27 23:30 ` David Wei 2024-03-27 23:32 ` David Wei [not found] ` <CGME20240322185734epcas5p2cd407dac97cd157c1833c4022ea84805@epcas5p2.samsung.com> 2024-03-22 18:50 ` [RFC PATCH 2/4] io_uring/rw: support read/write with metadata Kanchan Joshi [not found] ` <CGME20240322185736epcas5p3d0093948e9904e775994bcbe735ea0c5@epcas5p3.samsung.com> 2024-03-22 18:50 ` [RFC PATCH 3/4] block: modify bio_integrity_map_user to accept iov_iter as argument Kanchan Joshi [not found] ` <CGME20240322185738epcas5p20e5bd448ce83350eb9e79c929c4a9b2b@epcas5p2.samsung.com> 2024-03-22 18:50 ` [RFC PATCH 4/4] block: add support to pass the meta buffer Kanchan Joshi 2024-03-27 23:38 ` [RFC PATCH 0/4] Read/Write with " Jens Axboe 2024-03-28 12:03 ` Kanchan Joshi 2024-04-06 21:30 ` Pavel Begunkov 2024-04-25 19:05 ` Kanchan Joshi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox