* [RFC 0/5] big-cqe based uring-passthru [not found] <CGME20220401110829epcas5p39f3cf4d3f6eb8a5c59794787a2b72b15@epcas5p3.samsung.com> @ 2022-04-01 11:03 ` Kanchan Joshi [not found] ` <CGME20220401110831epcas5p403bacabe8f7e5262356fdc1a2e66df90@epcas5p4.samsung.com> ` (5 more replies) 0 siblings, 6 replies; 28+ messages in thread From: Kanchan Joshi @ 2022-04-01 11:03 UTC (permalink / raw) To: axboe, hch Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, pankydev8, javier, joshiiitr, anuj20.g So this is a trimmed-down series primarily to test the waters with 32-byte CQE (patch 4). This big-cqe is created by combining two adjacent 16b CQEs. And that is done when ring is setup with IORING_SETUP_CQE32 flag. nvme-passthru sends one result to bigcqe->res and another result (8 bytes) gets updated to bigcqe->res2. As always, fio is modified to test the interface: https://github.com/joshkan/fio/tree/big-cqe Limited testing ATM, as plumbing itself is in early stage with patch 4. Patches are against for-next (linux-block), on top of 9e9d83faa ("io_uring: Remove unneeded test in io_run_task_work_sig") Jens Axboe (3): io_uring: add support for 128-byte SQEs fs: add file_operations->async_cmd() io_uring: add infra and support for IORING_OP_URING_CMD Kanchan Joshi (2): io_uring: add support for big-cqe nvme: wire-up support for async-passthru on char-device. drivers/nvme/host/core.c | 1 + drivers/nvme/host/ioctl.c | 187 ++++++++++++++++++++++++++++------ drivers/nvme/host/multipath.c | 1 + drivers/nvme/host/nvme.h | 3 + fs/io_uring.c | 168 ++++++++++++++++++++++++++---- include/linux/fs.h | 2 + include/linux/io_uring.h | 33 ++++++ include/uapi/linux/io_uring.h | 26 ++++- 8 files changed, 369 insertions(+), 52 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <CGME20220401110831epcas5p403bacabe8f7e5262356fdc1a2e66df90@epcas5p4.samsung.com>]
* [RFC 1/5] io_uring: add support for 128-byte SQEs [not found] ` <CGME20220401110831epcas5p403bacabe8f7e5262356fdc1a2e66df90@epcas5p4.samsung.com> @ 2022-04-01 11:03 ` Kanchan Joshi 0 siblings, 0 replies; 28+ messages in thread From: Kanchan Joshi @ 2022-04-01 11:03 UTC (permalink / raw) To: axboe, hch Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, pankydev8, javier, joshiiitr, anuj20.g From: Jens Axboe <[email protected]> Normal SQEs are 64-bytes in length, which is fine for all the commands we support. However, in preparation for supporting passthrough IO, provide an option for setting up a ring with 128-byte SQEs. We continue to use the same type for io_uring_sqe, it's marked and commented with a zero sized array pad at the end. This provides up to 80 bytes of data for a passthrough command - 64 bytes for the extra added data, and 16 bytes available at the end of the existing SQE. Signed-off-by: Jens Axboe <[email protected]> --- fs/io_uring.c | 13 ++++++++++--- include/uapi/linux/io_uring.h | 7 +++++++ 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index a7412f6862fc..241ba1cd6dcf 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7431,8 +7431,12 @@ static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx) * though the application is the one updating it. */ head = READ_ONCE(ctx->sq_array[sq_idx]); - if (likely(head < ctx->sq_entries)) + if (likely(head < ctx->sq_entries)) { + /* double index for 128-byte SQEs, twice as long */ + if (ctx->flags & IORING_SETUP_SQE128) + head <<= 1; return &ctx->sq_sqes[head]; + } /* drop invalid entries */ ctx->cq_extra--; @@ -10431,7 +10435,10 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx, rings->sq_ring_entries = p->sq_entries; rings->cq_ring_entries = p->cq_entries; - size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); + if (p->flags & IORING_SETUP_SQE128) + size = array_size(2 * sizeof(struct io_uring_sqe), p->sq_entries); + else + size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); if (size == SIZE_MAX) { io_mem_free(ctx->rings); ctx->rings = NULL; @@ -10643,7 +10650,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params) if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL | IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE | IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ | - IORING_SETUP_R_DISABLED)) + IORING_SETUP_R_DISABLED | IORING_SETUP_SQE128)) return -EINVAL; return io_uring_create(entries, &p, params); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 787f491f0d2a..c5db68433ca5 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -61,6 +61,12 @@ struct io_uring_sqe { __u32 file_index; }; __u64 __pad2[2]; + + /* + * If the ring is initializefd with IORING_SETUP_SQE128, then this field + * contains 64-bytes of padding, doubling the size of the SQE. + */ + __u64 __big_sqe_pad[0]; }; enum { @@ -101,6 +107,7 @@ enum { #define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */ #define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */ #define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */ +#define IORING_SETUP_SQE128 (1U << 7) /* SQEs are 128b */ enum { IORING_OP_NOP, -- 2.25.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
[parent not found: <CGME20220401110833epcas5p18e828a307a646cef5b7aa429be4396e0@epcas5p1.samsung.com>]
* [RFC 2/5] fs: add file_operations->async_cmd() [not found] ` <CGME20220401110833epcas5p18e828a307a646cef5b7aa429be4396e0@epcas5p1.samsung.com> @ 2022-04-01 11:03 ` Kanchan Joshi 2022-04-04 7:09 ` Christoph Hellwig 0 siblings, 1 reply; 28+ messages in thread From: Kanchan Joshi @ 2022-04-01 11:03 UTC (permalink / raw) To: axboe, hch Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, pankydev8, javier, joshiiitr, anuj20.g From: Jens Axboe <[email protected]> This is a file private handler, similar to ioctls but hopefully a lot more sane and useful. Signed-off-by: Jens Axboe <[email protected]> --- include/linux/fs.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/fs.h b/include/linux/fs.h index e2d892b201b0..a32f83b70435 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1977,6 +1977,7 @@ struct dir_context { #define REMAP_FILE_ADVISORY (REMAP_FILE_CAN_SHORTEN) struct iov_iter; +struct io_uring_cmd; struct file_operations { struct module *owner; @@ -2019,6 +2020,7 @@ struct file_operations { struct file *file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); + int (*async_cmd)(struct io_uring_cmd *ioucmd); } __randomize_layout; struct inode_operations { -- 2.25.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC 2/5] fs: add file_operations->async_cmd() 2022-04-01 11:03 ` [RFC 2/5] fs: add file_operations->async_cmd() Kanchan Joshi @ 2022-04-04 7:09 ` Christoph Hellwig 0 siblings, 0 replies; 28+ messages in thread From: Christoph Hellwig @ 2022-04-04 7:09 UTC (permalink / raw) To: Kanchan Joshi Cc: axboe, hch, io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, pankydev8, javier, joshiiitr, anuj20.g On Fri, Apr 01, 2022 at 04:33:07PM +0530, Kanchan Joshi wrote: > From: Jens Axboe <[email protected]> > > This is a file private handler, similar to ioctls but hopefully a lot > more sane and useful. Without the next patch this is rather pointless (and confusing), so I'd suggest to move it into that. > int (*fadvise)(struct file *, loff_t, loff_t, int); > + int (*async_cmd)(struct io_uring_cmd *ioucmd); Given that it takes a io_uring_cmd argument I also thnink that the name is a bit misleading. Caling this uring_cmd or io_uring_cmd would be more descriptive. ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <CGME20220401110834epcas5p4d1e5e8d1beb1a6205d670bbcb932bf77@epcas5p4.samsung.com>]
* [RFC 3/5] io_uring: add infra and support for IORING_OP_URING_CMD [not found] ` <CGME20220401110834epcas5p4d1e5e8d1beb1a6205d670bbcb932bf77@epcas5p4.samsung.com> @ 2022-04-01 11:03 ` Kanchan Joshi 2022-04-04 7:16 ` Christoph Hellwig 0 siblings, 1 reply; 28+ messages in thread From: Kanchan Joshi @ 2022-04-01 11:03 UTC (permalink / raw) To: axboe, hch Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, pankydev8, javier, joshiiitr, anuj20.g From: Jens Axboe <[email protected]> This is a file private kind of request. io_uring doesn't know what's in this command type, it's for the file_operations->async_cmd() handler to deal with. Signed-off-by: Jens Axboe <[email protected]> Signed-off-by: Kanchan Joshi <[email protected]> Signed-off-by: Anuj Gupta <[email protected]> --- fs/io_uring.c | 79 +++++++++++++++++++++++++++++++---- include/linux/io_uring.h | 29 +++++++++++++ include/uapi/linux/io_uring.h | 8 +++- 3 files changed, 108 insertions(+), 8 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 241ba1cd6dcf..bd0e6b102a7b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -200,13 +200,6 @@ struct io_rings { struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp; }; -enum io_uring_cmd_flags { - IO_URING_F_COMPLETE_DEFER = 1, - IO_URING_F_UNLOCKED = 2, - /* int's last bit, sign checks are usually faster than a bit test */ - IO_URING_F_NONBLOCK = INT_MIN, -}; - struct io_mapped_ubuf { u64 ubuf; u64 ubuf_end; @@ -860,6 +853,7 @@ struct io_kiocb { struct io_mkdir mkdir; struct io_symlink symlink; struct io_hardlink hardlink; + struct io_uring_cmd uring_cmd; }; u8 opcode; @@ -1110,6 +1104,9 @@ static const struct io_op_def io_op_defs[] = { [IORING_OP_MKDIRAT] = {}, [IORING_OP_SYMLINKAT] = {}, [IORING_OP_LINKAT] = {}, + [IORING_OP_URING_CMD] = { + .needs_file = 1, + }, }; /* requests with any of those set should undergo io_disarm_next() */ @@ -2464,6 +2461,22 @@ static void io_req_task_submit(struct io_kiocb *req, bool *locked) io_req_complete_failed(req, -EFAULT); } +static void io_uring_cmd_work(struct io_kiocb *req, bool *locked) +{ + req->uring_cmd.driver_cb(&req->uring_cmd); +} + +void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, + void (*driver_cb)(struct io_uring_cmd *)) +{ + struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd); + + req->uring_cmd.driver_cb = driver_cb; + req->io_task_work.func = io_uring_cmd_work; + io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL)); +} +EXPORT_SYMBOL_GPL(io_uring_cmd_complete_in_task); + static void io_req_task_queue_fail(struct io_kiocb *req, int ret) { req->result = ret; @@ -4109,6 +4122,51 @@ static int io_linkat(struct io_kiocb *req, unsigned int issue_flags) return 0; } +/* + * Called by consumers of io_uring_cmd, if they originally returned + * -EIOCBQUEUED upon receiving the command. + */ +void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret) +{ + struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd); + + if (ret < 0) + req_set_fail(req); + io_req_complete(req, ret); +} +EXPORT_SYMBOL_GPL(io_uring_cmd_done); + +static int io_uring_cmd_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + struct io_uring_cmd *ioucmd = &req->uring_cmd; + + if (!req->file->f_op->async_cmd) + return -EOPNOTSUPP; + if (req->ctx->flags & IORING_SETUP_IOPOLL) + return -EOPNOTSUPP; + ioucmd->cmd = (void *) &sqe->cmd; + ioucmd->cmd_op = READ_ONCE(sqe->cmd_op); + ioucmd->cmd_len = READ_ONCE(sqe->cmd_len); + ioucmd->flags = 0; + return 0; +} + +static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags) +{ + struct file *file = req->file; + int ret; + struct io_uring_cmd *ioucmd = &req->uring_cmd; + + ioucmd->flags |= issue_flags; + ret = file->f_op->async_cmd(ioucmd); + /* queued async, consumer will call io_uring_cmd_done() when complete */ + if (ret == -EIOCBQUEUED) + return 0; + io_uring_cmd_done(ioucmd, ret); + return 0; +} + static int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { @@ -6588,6 +6646,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return io_symlinkat_prep(req, sqe); case IORING_OP_LINKAT: return io_linkat_prep(req, sqe); + case IORING_OP_URING_CMD: + return io_uring_cmd_prep(req, sqe); } printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", @@ -6871,6 +6931,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) case IORING_OP_LINKAT: ret = io_linkat(req, issue_flags); break; + case IORING_OP_URING_CMD: + ret = io_uring_cmd(req, issue_flags); + break; default: ret = -EINVAL; break; @@ -11215,6 +11278,8 @@ static int __init io_uring_init(void) BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST); BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(int)); + BUILD_BUG_ON(sizeof(struct io_uring_cmd) > 64); + req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT); return 0; diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index 649a4d7c241b..cedc68201469 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -5,7 +5,29 @@ #include <linux/sched.h> #include <linux/xarray.h> +enum io_uring_cmd_flags { + IO_URING_F_COMPLETE_DEFER = 1, + IO_URING_F_UNLOCKED = 2, + /* int's last bit, sign checks are usually faster than a bit test */ + IO_URING_F_NONBLOCK = INT_MIN, +}; + +struct io_uring_cmd { + struct file *file; + void *cmd; + /* for irq-completion - if driver requires doing stuff in task-context*/ + void (*driver_cb)(struct io_uring_cmd *cmd); + u32 flags; + u32 cmd_op; + u16 cmd_len; + u16 unused; + u8 pdu[28]; /* available inline for free use */ +}; + #if defined(CONFIG_IO_URING) +void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret); +void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, + void (*driver_cb)(struct io_uring_cmd *)); struct sock *io_uring_get_socket(struct file *file); void __io_uring_cancel(bool cancel_all); void __io_uring_free(struct task_struct *tsk); @@ -26,6 +48,13 @@ static inline void io_uring_free(struct task_struct *tsk) __io_uring_free(tsk); } #else +static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret) +{ +} +static inline void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, + void (*driver_cb)(struct io_uring_cmd *)) +{ +} static inline struct sock *io_uring_get_socket(struct file *file) { return NULL; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index c5db68433ca5..d7a4bdb9bf3b 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -22,10 +22,12 @@ struct io_uring_sqe { union { __u64 off; /* offset into file */ __u64 addr2; + __u32 cmd_op; }; union { __u64 addr; /* pointer to buffer or iovecs */ __u64 splice_off_in; + __u16 cmd_len; }; __u32 len; /* buffer size or number of iovecs */ union { @@ -60,7 +62,10 @@ struct io_uring_sqe { __s32 splice_fd_in; __u32 file_index; }; - __u64 __pad2[2]; + union { + __u64 __pad2[2]; + __u64 cmd; + }; /* * If the ring is initializefd with IORING_SETUP_SQE128, then this field @@ -150,6 +155,7 @@ enum { IORING_OP_MKDIRAT, IORING_OP_SYMLINKAT, IORING_OP_LINKAT, + IORING_OP_URING_CMD, /* this goes last, obviously */ IORING_OP_LAST, -- 2.25.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC 3/5] io_uring: add infra and support for IORING_OP_URING_CMD 2022-04-01 11:03 ` [RFC 3/5] io_uring: add infra and support for IORING_OP_URING_CMD Kanchan Joshi @ 2022-04-04 7:16 ` Christoph Hellwig 2022-04-04 8:20 ` Pavel Begunkov 2022-04-04 15:14 ` Kanchan Joshi 0 siblings, 2 replies; 28+ messages in thread From: Christoph Hellwig @ 2022-04-04 7:16 UTC (permalink / raw) To: Kanchan Joshi Cc: axboe, hch, io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, pankydev8, javier, joshiiitr, anuj20.g Cann we plese spell out instastructure here? Or did you mean infraread anyway :) > -enum io_uring_cmd_flags { > - IO_URING_F_COMPLETE_DEFER = 1, > - IO_URING_F_UNLOCKED = 2, > - /* int's last bit, sign checks are usually faster than a bit test */ > - IO_URING_F_NONBLOCK = INT_MIN, > -}; This doesn't actually get used anywhere outside of io_uring.c, so why move it? > +static void io_uring_cmd_work(struct io_kiocb *req, bool *locked) > +{ > + req->uring_cmd.driver_cb(&req->uring_cmd); > +} > + > +void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, > + void (*driver_cb)(struct io_uring_cmd *)) > +{ > + struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd); > + > + req->uring_cmd.driver_cb = driver_cb; > + req->io_task_work.func = io_uring_cmd_work; > + io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL)); > +} > +EXPORT_SYMBOL_GPL(io_uring_cmd_complete_in_task); I'm still not a fund of the double indirect call here. I don't really have a good idea yet, but I plan to look into it. > static void io_req_task_queue_fail(struct io_kiocb *req, int ret) Also it would be great to not add it between io_req_task_queue_fail and the callback set by it. > +void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret) > +{ > + struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd); > + > + if (ret < 0) > + req_set_fail(req); > + io_req_complete(req, ret); > +} > +EXPORT_SYMBOL_GPL(io_uring_cmd_done); It seems like all callers of io_req_complete actually call req_set_fail on failure. So maybe it would be nice pre-cleanup to handle the req_set_fail call from ĩo_req_complete? > + /* queued async, consumer will call io_uring_cmd_done() when complete */ > + if (ret == -EIOCBQUEUED) > + return 0; > + io_uring_cmd_done(ioucmd, ret); Why not: if (ret != -EIOCBQUEUED) io_uring_cmd_done(ioucmd, ret); return 0; That being said I wonder why not just remove the retun value from ->async_cmd entirely and just require the implementation to always call io_uring_cmd_done? That removes the confusion on who needs to call it entirely, similarly to what we do in the block layer for ->submit_bio. > +struct io_uring_cmd { > + struct file *file; > + void *cmd; > + /* for irq-completion - if driver requires doing stuff in task-context*/ > + void (*driver_cb)(struct io_uring_cmd *cmd); > + u32 flags; > + u32 cmd_op; > + u16 cmd_len; The cmd_len field does not seem to actually be used anywhere. > +++ b/include/uapi/linux/io_uring.h > @@ -22,10 +22,12 @@ struct io_uring_sqe { > union { > __u64 off; /* offset into file */ > __u64 addr2; > + __u32 cmd_op; > }; > union { > __u64 addr; /* pointer to buffer or iovecs */ > __u64 splice_off_in; > + __u16 cmd_len; > }; > __u32 len; /* buffer size or number of iovecs */ > union { > @@ -60,7 +62,10 @@ struct io_uring_sqe { > __s32 splice_fd_in; > __u32 file_index; > }; > - __u64 __pad2[2]; > + union { > + __u64 __pad2[2]; > + __u64 cmd; > + }; Can someone explain these changes to me a little more? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 3/5] io_uring: add infra and support for IORING_OP_URING_CMD 2022-04-04 7:16 ` Christoph Hellwig @ 2022-04-04 8:20 ` Pavel Begunkov 2022-04-05 5:58 ` Christoph Hellwig 2022-04-04 15:14 ` Kanchan Joshi 1 sibling, 1 reply; 28+ messages in thread From: Pavel Begunkov @ 2022-04-04 8:20 UTC (permalink / raw) To: Christoph Hellwig, Kanchan Joshi Cc: axboe, io-uring, linux-nvme, ming.lei, mcgrof, pankydev8, javier, joshiiitr, anuj20.g On 4/4/22 08:16, Christoph Hellwig wrote: [...] >> +static void io_uring_cmd_work(struct io_kiocb *req, bool *locked) >> +{ >> + req->uring_cmd.driver_cb(&req->uring_cmd); >> +} >> + >> +void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, >> + void (*driver_cb)(struct io_uring_cmd *)) >> +{ >> + struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd); >> + >> + req->uring_cmd.driver_cb = driver_cb; >> + req->io_task_work.func = io_uring_cmd_work; >> + io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL)); >> +} >> +EXPORT_SYMBOL_GPL(io_uring_cmd_complete_in_task); > > I'm still not a fund of the double indirect call here. I don't really > have a good idea yet, but I plan to look into it. I haven't familiarised myself with the series properly, but if it's about driver_cb, we can expose struct io_kiocb and io_req_task_work_add() so the lower layers can implement their own io_task_work.func. Hopefully, it won't be inventively abused... # io_uring.h static inline struct io_uring_cmd *io_req_to_ucmd(struct io_kiocb *req) { return container_of(); } typedef void (*io_tw_cb_t)(struct io_kiocb *req, bool *locked); static inline void io_cmd_tw_add(struct io_uring_cmd *ioucmd, io_tw_cb_t foo) { struct io_kiocb *req = container_of(ioucmb...); req->io_task_work.func = foo; io_req_task_work_add(); } >> static void io_req_task_queue_fail(struct io_kiocb *req, int ret) > > Also it would be great to not add it between io_req_task_queue_fail and > the callback set by it. > >> +void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret) >> +{ >> + struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd); >> + >> + if (ret < 0) >> + req_set_fail(req); >> + io_req_complete(req, ret); >> +} >> +EXPORT_SYMBOL_GPL(io_uring_cmd_done); > > It seems like all callers of io_req_complete actually call req_set_fail > on failure. So maybe it would be nice pre-cleanup to handle the > req_set_fail call from ĩo_req_complete? Interpretation of the result is different, e.g. io_tee(), that was the reason it was left in the callers. [...] >> @@ -60,7 +62,10 @@ struct io_uring_sqe { >> __s32 splice_fd_in; >> __u32 file_index; >> }; >> - __u64 __pad2[2]; >> + union { >> + __u64 __pad2[2]; >> + __u64 cmd; >> + }; > > Can someone explain these changes to me a little more? not required indeed, just - __u64 __pad2[2]; + __u64 cmd; + __u64 __pad2; -- Pavel Begunkov ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 3/5] io_uring: add infra and support for IORING_OP_URING_CMD 2022-04-04 8:20 ` Pavel Begunkov @ 2022-04-05 5:58 ` Christoph Hellwig 2022-04-06 6:37 ` Kanchan Joshi 0 siblings, 1 reply; 28+ messages in thread From: Christoph Hellwig @ 2022-04-05 5:58 UTC (permalink / raw) To: Pavel Begunkov Cc: Christoph Hellwig, Kanchan Joshi, axboe, io-uring, linux-nvme, ming.lei, mcgrof, pankydev8, javier, joshiiitr, anuj20.g On Mon, Apr 04, 2022 at 09:20:00AM +0100, Pavel Begunkov wrote: >> I'm still not a fund of the double indirect call here. I don't really >> have a good idea yet, but I plan to look into it. > > I haven't familiarised myself with the series properly, but if it's about > driver_cb, we can expose struct io_kiocb and io_req_task_work_add() so > the lower layers can implement their own io_task_work.func. Hopefully, it > won't be inventively abused... If we move io_kiocb out avoiding one indirection would be very easy indeed. But I think that just invites abuse. Note that we also have at least one and potentially more indirections in this path. The request rq_end_io handler is a guranteed one, and the IPI or softirq for the request indirectin is another one. So my plan was to look into having an io_uring specific hook in the core block code to deliver completions directly to the right I/O uring thread. In the best case that should allow us to do a single indirect call for the completion instead of 4 and a pointless IPI/softirq. >>> + struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd); >>> + >>> + if (ret < 0) >>> + req_set_fail(req); >>> + io_req_complete(req, ret); >>> +} >>> +EXPORT_SYMBOL_GPL(io_uring_cmd_done); >> >> It seems like all callers of io_req_complete actually call req_set_fail >> on failure. So maybe it would be nice pre-cleanup to handle the >> req_set_fail call from ĩo_req_complete? > > Interpretation of the result is different, e.g. io_tee(), that was the > reason it was left in the callers. Yes, there is about two of them that would then need to be open coded using __io_req_complete. > > [...] >>> @@ -60,7 +62,10 @@ struct io_uring_sqe { >>> __s32 splice_fd_in; >>> __u32 file_index; >>> }; >>> - __u64 __pad2[2]; >>> + union { >>> + __u64 __pad2[2]; >>> + __u64 cmd; >>> + }; >> >> Can someone explain these changes to me a little more? > > not required indeed, just > > - __u64 __pad2[2]; > + __u64 cmd; > + __u64 __pad2; Do we still want a union for cmd and document it to say what opcode it is for? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 3/5] io_uring: add infra and support for IORING_OP_URING_CMD 2022-04-05 5:58 ` Christoph Hellwig @ 2022-04-06 6:37 ` Kanchan Joshi 0 siblings, 0 replies; 28+ messages in thread From: Kanchan Joshi @ 2022-04-06 6:37 UTC (permalink / raw) To: Christoph Hellwig Cc: Pavel Begunkov, axboe, io-uring, linux-nvme, ming.lei, mcgrof, pankydev8, javier, joshiiitr, anuj20.g [-- Attachment #1: Type: text/plain, Size: 5689 bytes --] >>>> + struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd); >>>> + >>>> + if (ret < 0) >>>> + req_set_fail(req); >>>> + io_req_complete(req, ret); >>>> +} >>>> +EXPORT_SYMBOL_GPL(io_uring_cmd_done); >>> >>> It seems like all callers of io_req_complete actually call req_set_fail >>> on failure. So maybe it would be nice pre-cleanup to handle the >>> req_set_fail call from ĩo_req_complete? >> >> Interpretation of the result is different, e.g. io_tee(), that was the >> reason it was left in the callers. > >Yes, there is about two of them that would then need to be open coded >using __io_req_complete. And this is how it looks. Pavel, Jens: would you prefer this as independent patch? commit 2be578326b80f7a9e603b2a3224644b0cb620e25 Author: Kanchan Joshi <[email protected]> Date: Wed Apr 6 11:22:07 2022 +0530 io_uring: cleanup error handling Move common error handling to io_req_complete, so that various callers avoid repeating that. Callers requiring different handling still keep that outside, and call __io_req_complete instead. Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Kanchan Joshi <[email protected]> diff --git a/fs/io_uring.c b/fs/io_uring.c index 7445084e48ce..7df465bd489a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2162,6 +2162,8 @@ static inline void __io_req_complete(struct io_kiocb *req, unsigned issue_flags, static inline void io_req_complete(struct io_kiocb *req, s32 res) { + if (res < 0) + req_set_fail(req); __io_req_complete(req, 0, res, 0); } @@ -4100,8 +4102,6 @@ static int io_renameat(struct io_kiocb *req, unsigned int issue_flags) ren->newpath, ren->flags); req->flags &= ~REQ_F_NEED_CLEANUP; - if (ret < 0) - req_set_fail(req); io_req_complete(req, ret); return 0; } @@ -4122,8 +4122,6 @@ static void io_xattr_finish(struct io_kiocb *req, int ret) req->flags &= ~REQ_F_NEED_CLEANUP; __io_xattr_finish(req); - if (ret < 0) - req_set_fail(req); io_req_complete(req, ret); } @@ -4443,8 +4441,6 @@ static int io_mkdirat(struct io_kiocb *req, unsigned int issue_flags) ret = do_mkdirat(mkd->dfd, mkd->filename, mkd->mode); req->flags &= ~REQ_F_NEED_CLEANUP; - if (ret < 0) - req_set_fail(req); io_req_complete(req, ret); return 0; } @@ -4492,8 +4488,6 @@ static int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags) ret = do_symlinkat(sl->oldpath, sl->new_dfd, sl->newpath); req->flags &= ~REQ_F_NEED_CLEANUP; - if (ret < 0) - req_set_fail(req); io_req_complete(req, ret); return 0; } @@ -4543,8 +4537,6 @@ static int io_linkat(struct io_kiocb *req, unsigned int issue_flags) lnk->newpath, lnk->flags); req->flags &= ~REQ_F_NEED_CLEANUP; - if (ret < 0) - req_set_fail(req); io_req_complete(req, ret); return 0; } @@ -4580,8 +4572,6 @@ static int io_shutdown(struct io_kiocb *req, unsigned int issue_flags) return -ENOTSOCK; ret = __sys_shutdown_sock(sock, req->shutdown.how); - if (ret < 0) - req_set_fail(req); io_req_complete(req, ret); return 0; #else @@ -4641,7 +4631,7 @@ static int io_tee(struct io_kiocb *req, unsigned int issue_flags) done: if (ret != sp->len) req_set_fail(req); - io_req_complete(req, ret); + __io_req_complete(req, 0, ret, 0); return 0; } @@ -4685,7 +4675,7 @@ static int io_splice(struct io_kiocb *req, unsigned int issue_flags) done: if (ret != sp->len) req_set_fail(req); - io_req_complete(req, ret); + __io_req_complete(req, 0, ret, 0); return 0; } @@ -4777,8 +4767,6 @@ static int io_fsync(struct io_kiocb *req, unsigned int issue_flags) ret = vfs_fsync_range(req->file, req->sync.off, end > 0 ? end : LLONG_MAX, req->sync.flags & IORING_FSYNC_DATASYNC); - if (ret < 0) - req_set_fail(req); io_req_complete(req, ret); return 0; } @@ -4807,9 +4795,7 @@ static int io_fallocate(struct io_kiocb *req, unsigned int issue_flags) return -EAGAIN; ret = vfs_fallocate(req->file, req->sync.mode, req->sync.off, req->sync.len); - if (ret < 0) - req_set_fail(req); - else + if (ret >= 0) fsnotify_modify(req->file); io_req_complete(req, ret); return 0; @@ -5221,8 +5207,6 @@ static int io_madvise(struct io_kiocb *req, unsigned int issue_flags) return -EAGAIN; ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice); - if (ret < 0) - req_set_fail(req); io_req_complete(req, ret); return 0; #else @@ -5309,8 +5293,6 @@ static int io_statx(struct io_kiocb *req, unsigned int issue_flags) ret = do_statx(ctx->dfd, ctx->filename, ctx->flags, ctx->mask, ctx->buffer); - if (ret < 0) - req_set_fail(req); io_req_complete(req, ret); return 0; } @@ -5410,8 +5392,6 @@ static int io_sync_file_range(struct io_kiocb *req, unsigned int issue_flags) ret = sync_file_range(req->file, req->sync.off, req->sync.len, req->sync.flags); - if (ret < 0) - req_set_fail(req); io_req_complete(req, ret); return 0; [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC 3/5] io_uring: add infra and support for IORING_OP_URING_CMD 2022-04-04 7:16 ` Christoph Hellwig 2022-04-04 8:20 ` Pavel Begunkov @ 2022-04-04 15:14 ` Kanchan Joshi 2022-04-05 6:00 ` Christoph Hellwig 1 sibling, 1 reply; 28+ messages in thread From: Kanchan Joshi @ 2022-04-04 15:14 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, Jens Axboe, io-uring, linux-nvme, Pavel Begunkov, Ming Lei, Luis Chamberlain, Pankaj Raghav, Javier González, Anuj Gupta On Mon, Apr 4, 2022 at 12:46 PM Christoph Hellwig <[email protected]> wrote: > > Cann we plese spell out instastructure here? Or did you mean > infraread anyway :) :-) sure, I see the problem with this shorthand now. > > -enum io_uring_cmd_flags { > > - IO_URING_F_COMPLETE_DEFER = 1, > > - IO_URING_F_UNLOCKED = 2, > > - /* int's last bit, sign checks are usually faster than a bit test */ > > - IO_URING_F_NONBLOCK = INT_MIN, > > -}; > > This doesn't actually get used anywhere outside of io_uring.c, so why > move it? This got missed out. We are going to use it for things like this in nvme (from past series): + if (ioucmd && (ioucmd->flags & IO_URING_F_NONBLOCK)) { + rq_flags |= REQ_NOWAIT; + blk_flags |= BLK_MQ_REQ_NOWAIT; + } + req = nvme_alloc_request(q, cmd, blk_flags, rq_flags); Also for polling too. Will fix this up in the next version. > > +static void io_uring_cmd_work(struct io_kiocb *req, bool *locked) > > +{ > > + req->uring_cmd.driver_cb(&req->uring_cmd); > > +} > > + > > +void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, > > + void (*driver_cb)(struct io_uring_cmd *)) > > +{ > > + struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd); > > + > > + req->uring_cmd.driver_cb = driver_cb; > > + req->io_task_work.func = io_uring_cmd_work; > > + io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL)); > > +} > > +EXPORT_SYMBOL_GPL(io_uring_cmd_complete_in_task); > > I'm still not a fund of the double indirect call here. I don't really > have a good idea yet, but I plan to look into it. > > > static void io_req_task_queue_fail(struct io_kiocb *req, int ret) > > Also it would be great to not add it between io_req_task_queue_fail and > the callback set by it. Right. Will change. > > +void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret) > > +{ > > + struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd); > > + > > + if (ret < 0) > > + req_set_fail(req); > > + io_req_complete(req, ret); > > +} > > +EXPORT_SYMBOL_GPL(io_uring_cmd_done); > > It seems like all callers of io_req_complete actually call req_set_fail > on failure. So maybe it would be nice pre-cleanup to handle the > req_set_fail call from ĩo_req_complete? io_tee and io_splice do slightly different handling. If we take this route, it seems they can use __io_req_complete() instead. > > + /* queued async, consumer will call io_uring_cmd_done() when complete */ > > + if (ret == -EIOCBQUEUED) > > + return 0; > > + io_uring_cmd_done(ioucmd, ret); > > Why not: > if (ret != -EIOCBQUEUED) > io_uring_cmd_done(ioucmd, ret); > return 0; > > That being said I wonder why not just remove the retun value from > ->async_cmd entirely and just require the implementation to always call > io_uring_cmd_done? That removes the confusion on who needs to call it > entirely, similarly to what we do in the block layer for ->submit_bio. Right, this seems doable at this point as we do not do any special handling (in io_uring) on the return code. > > +struct io_uring_cmd { > > + struct file *file; > > + void *cmd; > > + /* for irq-completion - if driver requires doing stuff in task-context*/ > > + void (*driver_cb)(struct io_uring_cmd *cmd); > > + u32 flags; > > + u32 cmd_op; > > + u16 cmd_len; > > The cmd_len field does not seem to actually be used anywhere. Another stuff that got left out from the previous series :-( Using this field for a bit of sanity checking at the moment. Like this in nvme: + if (ioucmd->cmd_len != sizeof(struct nvme_passthru_cmd64)) + return -EINVAL; + cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd; > > +++ b/include/uapi/linux/io_uring.h > > @@ -22,10 +22,12 @@ struct io_uring_sqe { > > union { > > __u64 off; /* offset into file */ > > __u64 addr2; > > + __u32 cmd_op; > > }; > > union { > > __u64 addr; /* pointer to buffer or iovecs */ > > __u64 splice_off_in; > > + __u16 cmd_len; > > }; > > __u32 len; /* buffer size or number of iovecs */ > > union { > > @@ -60,7 +62,10 @@ struct io_uring_sqe { > > __s32 splice_fd_in; > > __u32 file_index; > > }; > > - __u64 __pad2[2]; > > + union { > > + __u64 __pad2[2]; > > + __u64 cmd; > > + }; > > Can someone explain these changes to me a little more? All these three (cmd_op, cmd_len and cmd) are operation specific fields. user-space supplies these into SQE, io_uring packages those into "struct io_uring_cmd" and pass that down to the provider for doing the real processing. 1. cmd_op = operation code for async cmd (e.g. passthru ioctl opcode or whatever else we want to turn async from user-space) 2. cmd_len = actual length of async command (e.g. we have max 80 bytes with big-sqe and for nvme-passthru we use the max, but for some other usecase one can go with smaller length) 3. cmd = this is the starting-offset where async-command is placed (by user-space) inside the big-sqe. 16 bytes in first-sqe, and next 64 bytes comes from second-sqe. And if we were doing pointer-based command submission, this "cmd" would have pointed to user-space command (of whatever length). Just to put the entire thought-process across; I understand that we are not taking that route. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 3/5] io_uring: add infra and support for IORING_OP_URING_CMD 2022-04-04 15:14 ` Kanchan Joshi @ 2022-04-05 6:00 ` Christoph Hellwig 2022-04-05 16:27 ` Kanchan Joshi 0 siblings, 1 reply; 28+ messages in thread From: Christoph Hellwig @ 2022-04-05 6:00 UTC (permalink / raw) To: Kanchan Joshi Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, io-uring, linux-nvme, Pavel Begunkov, Ming Lei, Luis Chamberlain, Pankaj Raghav, Javier González, Anuj Gupta On Mon, Apr 04, 2022 at 08:44:20PM +0530, Kanchan Joshi wrote: > Another stuff that got left out from the previous series :-( > Using this field for a bit of sanity checking at the moment. Like this in nvme: > > + if (ioucmd->cmd_len != sizeof(struct nvme_passthru_cmd64)) > + return -EINVAL; > + cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd; Do we actually need that sanity checking? Each command should have a known length bound by the SQE size, right? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 3/5] io_uring: add infra and support for IORING_OP_URING_CMD 2022-04-05 6:00 ` Christoph Hellwig @ 2022-04-05 16:27 ` Kanchan Joshi 0 siblings, 0 replies; 28+ messages in thread From: Kanchan Joshi @ 2022-04-05 16:27 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, Jens Axboe, io-uring, linux-nvme, Pavel Begunkov, Ming Lei, Luis Chamberlain, Pankaj Raghav, Javier González, Anuj Gupta On Tue, Apr 5, 2022 at 11:30 AM Christoph Hellwig <[email protected]> wrote: > > On Mon, Apr 04, 2022 at 08:44:20PM +0530, Kanchan Joshi wrote: > > Another stuff that got left out from the previous series :-( > > Using this field for a bit of sanity checking at the moment. Like this in nvme: > > > > + if (ioucmd->cmd_len != sizeof(struct nvme_passthru_cmd64)) > > + return -EINVAL; > > + cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd; > > Do we actually need that sanity checking? Each command should have > a known length bound by the SQE size, right? Right, and that check can go in io_uring without needing this field (as we keep cmd_len in sqe already). Will remove this from io_uring_cmd struct. ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <CGME20220401110836epcas5p37bd59ab5a48cf77ca3ac05052a164b0b@epcas5p3.samsung.com>]
* [RFC 4/5] io_uring: add support for big-cqe [not found] ` <CGME20220401110836epcas5p37bd59ab5a48cf77ca3ac05052a164b0b@epcas5p3.samsung.com> @ 2022-04-01 11:03 ` Kanchan Joshi 2022-04-04 7:07 ` Christoph Hellwig 0 siblings, 1 reply; 28+ messages in thread From: Kanchan Joshi @ 2022-04-01 11:03 UTC (permalink / raw) To: axboe, hch Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, pankydev8, javier, joshiiitr, anuj20.g Add IORING_SETUP_CQE32 flag to allow setting up ring with big-cqe which is 32 bytes in size. Also modify uring-cmd completion infra to accept additional result and fill that up in big-cqe. Signed-off-by: Kanchan Joshi <[email protected]> Signed-off-by: Anuj Gupta <[email protected]> --- fs/io_uring.c | 82 +++++++++++++++++++++++++++++------ include/linux/io_uring.h | 10 +++-- include/uapi/linux/io_uring.h | 11 +++++ 3 files changed, 87 insertions(+), 16 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index bd0e6b102a7b..b819c0ad47fc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -211,8 +211,8 @@ struct io_mapped_ubuf { struct io_ring_ctx; struct io_overflow_cqe { - struct io_uring_cqe cqe; struct list_head list; + struct io_uring_cqe cqe; /* this must be kept at end */ }; struct io_fixed_file { @@ -1713,6 +1713,13 @@ static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx) return NULL; tail = ctx->cached_cq_tail++; + + /* double index for large CQE */ + if (ctx->flags & IORING_SETUP_CQE32) { + mask = 2 * ctx->cq_entries - 1; + tail <<= 1; + } + return &rings->cqes[tail & mask]; } @@ -1792,13 +1799,16 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) while (!list_empty(&ctx->cq_overflow_list)) { struct io_uring_cqe *cqe = io_get_cqe(ctx); struct io_overflow_cqe *ocqe; + int cqeshift = 0; if (!cqe && !force) break; + /* copy more for big-cqe */ + cqeshift = ctx->flags & IORING_SETUP_CQE32 ? 1 : 0; ocqe = list_first_entry(&ctx->cq_overflow_list, struct io_overflow_cqe, list); if (cqe) - memcpy(cqe, &ocqe->cqe, sizeof(*cqe)); + memcpy(cqe, &ocqe->cqe, sizeof(*cqe) << cqeshift); else io_account_cq_overflow(ctx); @@ -1884,11 +1894,17 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task) } static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, - s32 res, u32 cflags) + s32 res, u32 cflags, u64 res2, + int bigcqe) { struct io_overflow_cqe *ocqe; + int size = sizeof(*ocqe); + + /* allocate more for big-cqe */ + if (bigcqe) + size += sizeof(struct io_uring_cqe); - ocqe = kmalloc(sizeof(*ocqe), GFP_ATOMIC | __GFP_ACCOUNT); + ocqe = kmalloc(size, GFP_ATOMIC | __GFP_ACCOUNT); if (!ocqe) { /* * If we're in ring overflow flush mode, or in task cancel mode, @@ -1907,6 +1923,11 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, ocqe->cqe.user_data = user_data; ocqe->cqe.res = res; ocqe->cqe.flags = cflags; + if (bigcqe) { + struct io_uring_cqe32 *bcqe = (struct io_uring_cqe32 *)&ocqe->cqe; + + bcqe->res2 = res2; + } list_add_tail(&ocqe->list, &ctx->cq_overflow_list); return true; } @@ -1928,13 +1949,38 @@ static inline bool __fill_cqe(struct io_ring_ctx *ctx, u64 user_data, WRITE_ONCE(cqe->flags, cflags); return true; } - return io_cqring_event_overflow(ctx, user_data, res, cflags); + return io_cqring_event_overflow(ctx, user_data, res, cflags, 0, false); } +static inline bool __fill_big_cqe(struct io_ring_ctx *ctx, u64 user_data, + s32 res, u32 cflags, u64 res2) +{ + struct io_uring_cqe32 *bcqe; + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + bcqe = (struct io_uring_cqe32 *) io_get_cqe(ctx); + if (likely(bcqe)) { + WRITE_ONCE(bcqe->cqe.user_data, user_data); + WRITE_ONCE(bcqe->cqe.res, res); + WRITE_ONCE(bcqe->cqe.flags, cflags); + WRITE_ONCE(bcqe->res2, res2); + return true; + } + return io_cqring_event_overflow(ctx, user_data, res, cflags, res2, + true); +} static inline bool __io_fill_cqe(struct io_kiocb *req, s32 res, u32 cflags) { trace_io_uring_complete(req->ctx, req, req->user_data, res, cflags); - return __fill_cqe(req->ctx, req->user_data, res, cflags); + if (!(req->ctx->flags & IORING_SETUP_CQE32)) + return __fill_cqe(req->ctx, req->user_data, res, cflags); + else + return __fill_big_cqe(req->ctx, req->user_data, res, cflags, + req->uring_cmd.res2); } static noinline void io_fill_cqe_req(struct io_kiocb *req, s32 res, u32 cflags) @@ -4126,10 +4172,12 @@ static int io_linkat(struct io_kiocb *req, unsigned int issue_flags) * Called by consumers of io_uring_cmd, if they originally returned * -EIOCBQUEUED upon receiving the command. */ -void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret) +void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2) { struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd); + /* store secondary result in res2 */ + req->uring_cmd.res2 = res2; if (ret < 0) req_set_fail(req); io_req_complete(req, ret); @@ -4163,7 +4211,7 @@ static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags) /* queued async, consumer will call io_uring_cmd_done() when complete */ if (ret == -EIOCBQUEUED) return 0; - io_uring_cmd_done(ioucmd, ret); + io_uring_cmd_done(ioucmd, ret, 0); return 0; } @@ -9026,13 +9074,20 @@ static void *io_mem_alloc(size_t size) return (void *) __get_free_pages(gfp_flags, get_order(size)); } -static unsigned long rings_size(unsigned sq_entries, unsigned cq_entries, - size_t *sq_offset) +static unsigned long rings_size(struct io_uring_params *p, + size_t *sq_offset) { + unsigned sq_entries, cq_entries; struct io_rings *rings; size_t off, sq_array_size; - off = struct_size(rings, cqes, cq_entries); + sq_entries = p->sq_entries; + cq_entries = p->cq_entries; + + if (p->flags & IORING_SETUP_CQE32) + off = struct_size(rings, cqes, 2 * cq_entries); + else + off = struct_size(rings, cqes, cq_entries); if (off == SIZE_MAX) return SIZE_MAX; @@ -10483,7 +10538,7 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx, ctx->sq_entries = p->sq_entries; ctx->cq_entries = p->cq_entries; - size = rings_size(p->sq_entries, p->cq_entries, &sq_array_offset); + size = rings_size(p, &sq_array_offset); if (size == SIZE_MAX) return -EOVERFLOW; @@ -10713,7 +10768,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params) if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL | IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE | IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ | - IORING_SETUP_R_DISABLED | IORING_SETUP_SQE128)) + IORING_SETUP_R_DISABLED | IORING_SETUP_SQE128 | + IORING_SETUP_CQE32)) return -EINVAL; return io_uring_create(entries, &p, params); diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index cedc68201469..0aba7b50cde6 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -14,7 +14,10 @@ enum io_uring_cmd_flags { struct io_uring_cmd { struct file *file; - void *cmd; + union { + void *cmd; /* used on submission */ + u64 res2; /* used on completion */ + }; /* for irq-completion - if driver requires doing stuff in task-context*/ void (*driver_cb)(struct io_uring_cmd *cmd); u32 flags; @@ -25,7 +28,7 @@ struct io_uring_cmd { }; #if defined(CONFIG_IO_URING) -void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret); +void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t res2); void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, void (*driver_cb)(struct io_uring_cmd *)); struct sock *io_uring_get_socket(struct file *file); @@ -48,7 +51,8 @@ static inline void io_uring_free(struct task_struct *tsk) __io_uring_free(tsk); } #else -static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret) +static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, + ssize_t ret2) { } static inline void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index d7a4bdb9bf3b..85b8ff046496 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -113,6 +113,7 @@ enum { #define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */ #define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */ #define IORING_SETUP_SQE128 (1U << 7) /* SQEs are 128b */ +#define IORING_SETUP_CQE32 (1U << 8) /* CQEs are 32b */ enum { IORING_OP_NOP, @@ -207,6 +208,16 @@ struct io_uring_cqe { __u32 flags; }; +/* + * If the ring is initializefd with IORING_SETUP_CQE32, we setup large cqe. + * Large CQE is created by combining two adjacent regular CQES. + */ +struct io_uring_cqe32 { + struct io_uring_cqe cqe; + __u64 res2; + __u64 unused; +}; + /* * cqe->flags * -- 2.25.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC 4/5] io_uring: add support for big-cqe 2022-04-01 11:03 ` [RFC 4/5] io_uring: add support for big-cqe Kanchan Joshi @ 2022-04-04 7:07 ` Christoph Hellwig 2022-04-04 14:04 ` Kanchan Joshi 0 siblings, 1 reply; 28+ messages in thread From: Christoph Hellwig @ 2022-04-04 7:07 UTC (permalink / raw) To: Kanchan Joshi Cc: axboe, hch, io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, pankydev8, javier, joshiiitr, anuj20.g On Fri, Apr 01, 2022 at 04:33:09PM +0530, Kanchan Joshi wrote: > Add IORING_SETUP_CQE32 flag to allow setting up ring with big-cqe which > is 32 bytes in size. Also modify uring-cmd completion infra to accept > additional result and fill that up in big-cqe. This should probably be patch 2 in the series. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 4/5] io_uring: add support for big-cqe 2022-04-04 7:07 ` Christoph Hellwig @ 2022-04-04 14:04 ` Kanchan Joshi 0 siblings, 0 replies; 28+ messages in thread From: Kanchan Joshi @ 2022-04-04 14:04 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, Jens Axboe, io-uring, linux-nvme, Pavel Begunkov, Ming Lei, Luis Chamberlain, Pankaj Raghav, Javier González, Anuj Gupta On Mon, Apr 4, 2022 at 12:37 PM Christoph Hellwig <[email protected]> wrote: > > On Fri, Apr 01, 2022 at 04:33:09PM +0530, Kanchan Joshi wrote: > > Add IORING_SETUP_CQE32 flag to allow setting up ring with big-cqe which > > is 32 bytes in size. Also modify uring-cmd completion infra to accept > > additional result and fill that up in big-cqe. > > This should probably be patch 2 in the series. Yes, indeed. Thought (for the next version) is placing this immediately after big-sqe. And make uring-cmd infrastructure to be on the top. ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <CGME20220401110838epcas5p2c1a2e776923dfe5bf65a3e7946820150@epcas5p2.samsung.com>]
* [RFC 5/5] nvme: wire-up support for async-passthru on char-device. [not found] ` <CGME20220401110838epcas5p2c1a2e776923dfe5bf65a3e7946820150@epcas5p2.samsung.com> @ 2022-04-01 11:03 ` Kanchan Joshi 2022-04-04 7:20 ` Christoph Hellwig 0 siblings, 1 reply; 28+ messages in thread From: Kanchan Joshi @ 2022-04-01 11:03 UTC (permalink / raw) To: axboe, hch Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, pankydev8, javier, joshiiitr, anuj20.g Introduce handler for fops->async_cmd(), implementing async passthru on char device (/dev/ngX). The handler supports NVME_IOCTL_IO64_CMD. Signed-off-by: Kanchan Joshi <[email protected]> Signed-off-by: Anuj Gupta <[email protected]> --- drivers/nvme/host/core.c | 1 + drivers/nvme/host/ioctl.c | 187 ++++++++++++++++++++++++++++------ drivers/nvme/host/multipath.c | 1 + drivers/nvme/host/nvme.h | 3 + 4 files changed, 161 insertions(+), 31 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 961a5f8a44d2..38b9630c2cb7 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3665,6 +3665,7 @@ static const struct file_operations nvme_ns_chr_fops = { .release = nvme_ns_chr_release, .unlocked_ioctl = nvme_ns_chr_ioctl, .compat_ioctl = compat_ptr_ioctl, + .async_cmd = nvme_ns_chr_async_cmd, }; static int nvme_add_ns_cdev(struct nvme_ns *ns) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 22314962842d..1d15694d411c 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -19,6 +19,81 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval) return (void __user *)ptrval; } +/* + * This overlays struct io_uring_cmd pdu. + * Expect build errors if this grows larger than that. + */ +struct nvme_uring_cmd_pdu { + union { + struct bio *bio; + struct request *req; + }; + void *meta; /* kernel-resident buffer */ + void __user *meta_buffer; + u32 meta_len; +} __packed; + +static struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(struct io_uring_cmd *ioucmd) +{ + return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu; +} + +static void nvme_pt_task_cb(struct io_uring_cmd *ioucmd) +{ + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); + struct request *req = pdu->req; + struct bio *bio = req->bio; + bool write = (req_op(req) == REQ_OP_DRV_OUT); + int status; + u64 result; + + if (nvme_req(req)->flags & NVME_REQ_CANCELLED) + status = -EINTR; + else + status = nvme_req(req)->status; + + /* we can free request */ + blk_mq_free_request(req); + blk_rq_unmap_user(bio); + + if (pdu->meta && !status && !write) { + if (copy_to_user(pdu->meta_buffer, pdu->meta, pdu->meta_len)) + status = -EFAULT; + } + kfree(pdu->meta); + + result = le64_to_cpu(nvme_req(req)->result.u64); + io_uring_cmd_done(ioucmd, status, result); +} + +static void nvme_end_async_pt(struct request *req, blk_status_t err) +{ + struct io_uring_cmd *ioucmd = req->end_io_data; + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); + /* extract bio before reusing the same field for request */ + struct bio *bio = pdu->bio; + + pdu->req = req; + req->bio = bio; + + /* this takes care of moving rest of completion-work to task context */ + io_uring_cmd_complete_in_task(ioucmd, nvme_pt_task_cb); +} + +static void nvme_setup_uring_cmd_data(struct request *rq, + struct io_uring_cmd *ioucmd, void *meta, + void __user *meta_buffer, u32 meta_len) +{ + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); + + /* to free bio on completion, as req->bio will be null at that time */ + pdu->bio = rq->bio; + pdu->meta = meta; + pdu->meta_buffer = meta_buffer; + pdu->meta_len = meta_len; + rq->end_io_data = ioucmd; +} + static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf, unsigned len, u32 seed, bool write) { @@ -56,7 +131,8 @@ static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf, static int nvme_submit_user_cmd(struct request_queue *q, struct nvme_command *cmd, void __user *ubuffer, unsigned bufflen, void __user *meta_buffer, unsigned meta_len, - u32 meta_seed, u64 *result, unsigned timeout) + u32 meta_seed, u64 *result, unsigned timeout, + struct io_uring_cmd *ioucmd) { bool write = nvme_is_write(cmd); struct nvme_ns *ns = q->queuedata; @@ -93,6 +169,12 @@ static int nvme_submit_user_cmd(struct request_queue *q, } } + if (ioucmd) { /* async dispatch */ + nvme_setup_uring_cmd_data(req, ioucmd, meta, meta_buffer, + meta_len); + blk_execute_rq_nowait(req, 0, nvme_end_async_pt); + return -EIOCBQUEUED; + } ret = nvme_execute_passthru_rq(req); if (result) *result = le64_to_cpu(nvme_req(req)->result.u64); @@ -170,7 +252,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) return nvme_submit_user_cmd(ns->queue, &c, nvme_to_user_ptr(io.addr), length, - metadata, meta_len, lower_32_bits(io.slba), NULL, 0); + metadata, meta_len, lower_32_bits(io.slba), NULL, 0, NULL); } static bool nvme_validate_passthru_nsid(struct nvme_ctrl *ctrl, @@ -224,7 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c, nvme_to_user_ptr(cmd.addr), cmd.data_len, nvme_to_user_ptr(cmd.metadata), cmd.metadata_len, - 0, &result, timeout); + 0, &result, timeout, NULL); if (status >= 0) { if (put_user(result, &ucmd->result)) @@ -235,45 +317,51 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, } static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns, - struct nvme_passthru_cmd64 __user *ucmd) + struct nvme_passthru_cmd64 __user *ucmd, + struct io_uring_cmd *ioucmd) { - struct nvme_passthru_cmd64 cmd; + struct nvme_passthru_cmd64 cmd, *cptr; struct nvme_command c; unsigned timeout = 0; int status; if (!capable(CAP_SYS_ADMIN)) return -EACCES; - if (copy_from_user(&cmd, ucmd, sizeof(cmd))) - return -EFAULT; - if (cmd.flags) + if (!ioucmd) { + if (copy_from_user(&cmd, ucmd, sizeof(cmd))) + return -EFAULT; + cptr = &cmd; + } else { + cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd; + } + if (cptr->flags) return -EINVAL; - if (!nvme_validate_passthru_nsid(ctrl, ns, cmd.nsid)) + if (!nvme_validate_passthru_nsid(ctrl, ns, cptr->nsid)) return -EINVAL; memset(&c, 0, sizeof(c)); - c.common.opcode = cmd.opcode; - c.common.flags = cmd.flags; - c.common.nsid = cpu_to_le32(cmd.nsid); - c.common.cdw2[0] = cpu_to_le32(cmd.cdw2); - c.common.cdw2[1] = cpu_to_le32(cmd.cdw3); - c.common.cdw10 = cpu_to_le32(cmd.cdw10); - c.common.cdw11 = cpu_to_le32(cmd.cdw11); - c.common.cdw12 = cpu_to_le32(cmd.cdw12); - c.common.cdw13 = cpu_to_le32(cmd.cdw13); - c.common.cdw14 = cpu_to_le32(cmd.cdw14); - c.common.cdw15 = cpu_to_le32(cmd.cdw15); - - if (cmd.timeout_ms) - timeout = msecs_to_jiffies(cmd.timeout_ms); + c.common.opcode = cptr->opcode; + c.common.flags = cptr->flags; + c.common.nsid = cpu_to_le32(cptr->nsid); + c.common.cdw2[0] = cpu_to_le32(cptr->cdw2); + c.common.cdw2[1] = cpu_to_le32(cptr->cdw3); + c.common.cdw10 = cpu_to_le32(cptr->cdw10); + c.common.cdw11 = cpu_to_le32(cptr->cdw11); + c.common.cdw12 = cpu_to_le32(cptr->cdw12); + c.common.cdw13 = cpu_to_le32(cptr->cdw13); + c.common.cdw14 = cpu_to_le32(cptr->cdw14); + c.common.cdw15 = cpu_to_le32(cptr->cdw15); + + if (cptr->timeout_ms) + timeout = msecs_to_jiffies(cptr->timeout_ms); status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c, - nvme_to_user_ptr(cmd.addr), cmd.data_len, - nvme_to_user_ptr(cmd.metadata), cmd.metadata_len, - 0, &cmd.result, timeout); + nvme_to_user_ptr(cptr->addr), cptr->data_len, + nvme_to_user_ptr(cptr->metadata), cptr->metadata_len, + 0, &cptr->result, timeout, ioucmd); - if (status >= 0) { - if (put_user(cmd.result, &ucmd->result)) + if (!ioucmd && status >= 0) { + if (put_user(cptr->result, &ucmd->result)) return -EFAULT; } @@ -296,7 +384,7 @@ static int nvme_ctrl_ioctl(struct nvme_ctrl *ctrl, unsigned int cmd, case NVME_IOCTL_ADMIN_CMD: return nvme_user_cmd(ctrl, NULL, argp); case NVME_IOCTL_ADMIN64_CMD: - return nvme_user_cmd64(ctrl, NULL, argp); + return nvme_user_cmd64(ctrl, NULL, argp, NULL); default: return sed_ioctl(ctrl->opal_dev, cmd, argp); } @@ -340,7 +428,7 @@ static int nvme_ns_ioctl(struct nvme_ns *ns, unsigned int cmd, case NVME_IOCTL_SUBMIT_IO: return nvme_submit_io(ns, argp); case NVME_IOCTL_IO64_CMD: - return nvme_user_cmd64(ns->ctrl, ns, argp); + return nvme_user_cmd64(ns->ctrl, ns, argp, NULL); default: return -ENOTTY; } @@ -369,6 +457,29 @@ long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg) return __nvme_ioctl(ns, cmd, (void __user *)arg); } +static int nvme_ns_async_ioctl(struct nvme_ns *ns, struct io_uring_cmd *ioucmd) +{ + int ret; + + switch (ioucmd->cmd_op) { + case NVME_IOCTL_IO64_CMD: + ret = nvme_user_cmd64(ns->ctrl, ns, NULL, ioucmd); + break; + default: + ret = -ENOTTY; + } + + return ret; +} + +int nvme_ns_chr_async_cmd(struct io_uring_cmd *ioucmd) +{ + struct nvme_ns *ns = container_of(file_inode(ioucmd->file)->i_cdev, + struct nvme_ns, cdev); + return nvme_ns_async_ioctl(ns, ioucmd); +} + + #ifdef CONFIG_NVME_MULTIPATH static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd, void __user *argp, struct nvme_ns_head *head, int srcu_idx) @@ -435,6 +546,20 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd, srcu_read_unlock(&head->srcu, srcu_idx); return ret; } + +int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd) +{ + struct cdev *cdev = file_inode(ioucmd->file)->i_cdev; + struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev); + int srcu_idx = srcu_read_lock(&head->srcu); + struct nvme_ns *ns = nvme_find_path(head); + int ret = -EWOULDBLOCK; + + if (ns) + ret = nvme_ns_async_ioctl(ns, ioucmd); + srcu_read_unlock(&head->srcu, srcu_idx); + return ret; +} #endif /* CONFIG_NVME_MULTIPATH */ static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp) @@ -480,7 +605,7 @@ long nvme_dev_ioctl(struct file *file, unsigned int cmd, case NVME_IOCTL_ADMIN_CMD: return nvme_user_cmd(ctrl, NULL, argp); case NVME_IOCTL_ADMIN64_CMD: - return nvme_user_cmd64(ctrl, NULL, argp); + return nvme_user_cmd64(ctrl, NULL, argp, NULL); case NVME_IOCTL_IO_CMD: return nvme_dev_user_cmd(ctrl, argp); case NVME_IOCTL_RESET: diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index f8bf6606eb2f..1d798d09456f 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -459,6 +459,7 @@ static const struct file_operations nvme_ns_head_chr_fops = { .release = nvme_ns_head_chr_release, .unlocked_ioctl = nvme_ns_head_chr_ioctl, .compat_ioctl = compat_ptr_ioctl, + .async_cmd = nvme_ns_head_chr_async_cmd, }; static int nvme_add_ns_head_cdev(struct nvme_ns_head *head) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index a162f6c6da6e..7b801c241d26 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -16,6 +16,7 @@ #include <linux/rcupdate.h> #include <linux/wait.h> #include <linux/t10-pi.h> +#include <linux/io_uring.h> #include <trace/events/block.h> @@ -751,6 +752,8 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg); long nvme_dev_ioctl(struct file *file, unsigned int cmd, unsigned long arg); +int nvme_ns_chr_async_cmd(struct io_uring_cmd *ioucmd); +int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd); int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo); extern const struct attribute_group *nvme_ns_id_attr_groups[]; -- 2.25.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC 5/5] nvme: wire-up support for async-passthru on char-device. 2022-04-01 11:03 ` [RFC 5/5] nvme: wire-up support for async-passthru on char-device Kanchan Joshi @ 2022-04-04 7:20 ` Christoph Hellwig 2022-04-04 14:25 ` Kanchan Joshi 0 siblings, 1 reply; 28+ messages in thread From: Christoph Hellwig @ 2022-04-04 7:20 UTC (permalink / raw) To: Kanchan Joshi Cc: axboe, hch, io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, pankydev8, javier, joshiiitr, anuj20.g On Fri, Apr 01, 2022 at 04:33:10PM +0530, Kanchan Joshi wrote: > Introduce handler for fops->async_cmd(), implementing async passthru > on char device (/dev/ngX). The handler supports NVME_IOCTL_IO64_CMD. I really don't like how this still mixes up ioctls and uring cmds, as mentioned close to half a dozend times we really should not mix them up. Something like this (untested) patch should help to separate the much better: --- From e145d515929f086b2dadf8816e1397eb287f8ae0 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig <[email protected]> Date: Mon, 4 Apr 2022 07:22:33 +0200 Subject: nvme: split the uring cmd path from the regular ioctl path io_uring async commands are not ioctls, and we should not reuse opcodes or otherwise pretend that they are the same. Signed-off-by: Christoph Hellwig <[email protected]> --- drivers/nvme/host/ioctl.c | 222 ++++++++++++++++++++------------ include/uapi/linux/nvme_ioctl.h | 3 + 2 files changed, 141 insertions(+), 84 deletions(-) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 1d15694d411c0..ea6cfd4321942 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -3,6 +3,7 @@ * Copyright (c) 2011-2014, Intel Corporation. * Copyright (c) 2017-2021 Christoph Hellwig. */ +#include <linux/blk-integrity.h> #include <linux/ptrace.h> /* for force_successful_syscall_return */ #include <linux/nvme_ioctl.h> #include "nvme.h" @@ -19,6 +20,23 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval) return (void __user *)ptrval; } +static int nvme_ioctl_finish_metadata(struct bio *bio, int ret, + void __user *meta_ubuf) +{ + struct bio_integrity_payload *bip = bio_integrity(bio); + + if (bip) { + void *meta = bvec_virt(bip->bip_vec); + + if (!ret && bio_op(bio) == REQ_OP_DRV_IN && + copy_to_user(meta_ubuf, meta, bip->bip_vec->bv_len)) + ret = -EFAULT; + kfree(meta); + } + + return ret; +} + /* * This overlays struct io_uring_cmd pdu. * Expect build errors if this grows larger than that. @@ -28,9 +46,7 @@ struct nvme_uring_cmd_pdu { struct bio *bio; struct request *req; }; - void *meta; /* kernel-resident buffer */ void __user *meta_buffer; - u32 meta_len; } __packed; static struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(struct io_uring_cmd *ioucmd) @@ -38,12 +54,11 @@ static struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(struct io_uring_cmd *ioucmd return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu; } -static void nvme_pt_task_cb(struct io_uring_cmd *ioucmd) +static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd) { struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); struct request *req = pdu->req; struct bio *bio = req->bio; - bool write = (req_op(req) == REQ_OP_DRV_OUT); int status; u64 result; @@ -56,12 +71,7 @@ static void nvme_pt_task_cb(struct io_uring_cmd *ioucmd) blk_mq_free_request(req); blk_rq_unmap_user(bio); - if (pdu->meta && !status && !write) { - if (copy_to_user(pdu->meta_buffer, pdu->meta, pdu->meta_len)) - status = -EFAULT; - } - kfree(pdu->meta); - + status = nvme_ioctl_finish_metadata(bio, status, pdu->meta_buffer); result = le64_to_cpu(nvme_req(req)->result.u64); io_uring_cmd_done(ioucmd, status, result); } @@ -77,21 +87,7 @@ static void nvme_end_async_pt(struct request *req, blk_status_t err) req->bio = bio; /* this takes care of moving rest of completion-work to task context */ - io_uring_cmd_complete_in_task(ioucmd, nvme_pt_task_cb); -} - -static void nvme_setup_uring_cmd_data(struct request *rq, - struct io_uring_cmd *ioucmd, void *meta, - void __user *meta_buffer, u32 meta_len) -{ - struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); - - /* to free bio on completion, as req->bio will be null at that time */ - pdu->bio = rq->bio; - pdu->meta = meta; - pdu->meta_buffer = meta_buffer; - pdu->meta_len = meta_len; - rq->end_io_data = ioucmd; + io_uring_cmd_complete_in_task(ioucmd, nvme_uring_task_cb); } static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf, @@ -128,11 +124,10 @@ static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf, return ERR_PTR(ret); } -static int nvme_submit_user_cmd(struct request_queue *q, +static struct request *nvme_alloc_user_request(struct request_queue *q, struct nvme_command *cmd, void __user *ubuffer, unsigned bufflen, void __user *meta_buffer, unsigned meta_len, - u32 meta_seed, u64 *result, unsigned timeout, - struct io_uring_cmd *ioucmd) + u32 meta_seed, unsigned timeout) { bool write = nvme_is_write(cmd); struct nvme_ns *ns = q->queuedata; @@ -144,7 +139,7 @@ static int nvme_submit_user_cmd(struct request_queue *q, req = nvme_alloc_request(q, cmd, 0); if (IS_ERR(req)) - return PTR_ERR(req); + return req; if (timeout) req->timeout = timeout; @@ -169,28 +164,47 @@ static int nvme_submit_user_cmd(struct request_queue *q, } } - if (ioucmd) { /* async dispatch */ - nvme_setup_uring_cmd_data(req, ioucmd, meta, meta_buffer, - meta_len); - blk_execute_rq_nowait(req, 0, nvme_end_async_pt); - return -EIOCBQUEUED; - } + return req; + +out_unmap: + if (bio) + blk_rq_unmap_user(bio); +out: + blk_mq_free_request(req); + return ERR_PTR(ret); +} + +static int nvme_execute_user_rq(struct request *req, + void __user *meta_buffer, u64 *result) +{ + struct bio *bio = req->bio; + int ret; + ret = nvme_execute_passthru_rq(req); + if (result) *result = le64_to_cpu(nvme_req(req)->result.u64); - if (meta && !ret && !write) { - if (copy_to_user(meta_buffer, meta, meta_len)) - ret = -EFAULT; - } - kfree(meta); - out_unmap: + ret = nvme_ioctl_finish_metadata(bio, ret, meta_buffer); + if (bio) blk_rq_unmap_user(bio); - out: blk_mq_free_request(req); return ret; } +static int nvme_submit_user_cmd(struct request_queue *q, + struct nvme_command *cmd, void __user *ubuffer, + unsigned bufflen, void __user *meta_buffer, unsigned meta_len, + u32 meta_seed, u64 *result, unsigned timeout) +{ + struct request *req; + + req = nvme_alloc_user_request(q, cmd, ubuffer, bufflen, meta_buffer, + meta_len, meta_seed, timeout); + if (IS_ERR(req)) + return PTR_ERR(req); + return nvme_execute_user_rq(req, meta_buffer, result); +} static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) { @@ -252,7 +266,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) return nvme_submit_user_cmd(ns->queue, &c, nvme_to_user_ptr(io.addr), length, - metadata, meta_len, lower_32_bits(io.slba), NULL, 0, NULL); + metadata, meta_len, lower_32_bits(io.slba), NULL, 0); } static bool nvme_validate_passthru_nsid(struct nvme_ctrl *ctrl, @@ -306,7 +320,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c, nvme_to_user_ptr(cmd.addr), cmd.data_len, nvme_to_user_ptr(cmd.metadata), cmd.metadata_len, - 0, &result, timeout, NULL); + 0, &result, timeout); if (status >= 0) { if (put_user(result, &ucmd->result)) @@ -317,57 +331,98 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, } static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns, - struct nvme_passthru_cmd64 __user *ucmd, - struct io_uring_cmd *ioucmd) + struct nvme_passthru_cmd64 __user *ucmd) { - struct nvme_passthru_cmd64 cmd, *cptr; + struct nvme_passthru_cmd64 cmd; struct nvme_command c; unsigned timeout = 0; int status; if (!capable(CAP_SYS_ADMIN)) return -EACCES; - if (!ioucmd) { - if (copy_from_user(&cmd, ucmd, sizeof(cmd))) - return -EFAULT; - cptr = &cmd; - } else { - cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd; - } - if (cptr->flags) + if (copy_from_user(&cmd, ucmd, sizeof(cmd))) + return -EFAULT; + if (cmd.flags) return -EINVAL; - if (!nvme_validate_passthru_nsid(ctrl, ns, cptr->nsid)) + if (!nvme_validate_passthru_nsid(ctrl, ns, cmd.nsid)) return -EINVAL; memset(&c, 0, sizeof(c)); - c.common.opcode = cptr->opcode; - c.common.flags = cptr->flags; - c.common.nsid = cpu_to_le32(cptr->nsid); - c.common.cdw2[0] = cpu_to_le32(cptr->cdw2); - c.common.cdw2[1] = cpu_to_le32(cptr->cdw3); - c.common.cdw10 = cpu_to_le32(cptr->cdw10); - c.common.cdw11 = cpu_to_le32(cptr->cdw11); - c.common.cdw12 = cpu_to_le32(cptr->cdw12); - c.common.cdw13 = cpu_to_le32(cptr->cdw13); - c.common.cdw14 = cpu_to_le32(cptr->cdw14); - c.common.cdw15 = cpu_to_le32(cptr->cdw15); - - if (cptr->timeout_ms) - timeout = msecs_to_jiffies(cptr->timeout_ms); + c.common.opcode = cmd.opcode; + c.common.flags = cmd.flags; + c.common.nsid = cpu_to_le32(cmd.nsid); + c.common.cdw2[0] = cpu_to_le32(cmd.cdw2); + c.common.cdw2[1] = cpu_to_le32(cmd.cdw3); + c.common.cdw10 = cpu_to_le32(cmd.cdw10); + c.common.cdw11 = cpu_to_le32(cmd.cdw11); + c.common.cdw12 = cpu_to_le32(cmd.cdw12); + c.common.cdw13 = cpu_to_le32(cmd.cdw13); + c.common.cdw14 = cpu_to_le32(cmd.cdw14); + c.common.cdw15 = cpu_to_le32(cmd.cdw15); + + if (cmd.timeout_ms) + timeout = msecs_to_jiffies(cmd.timeout_ms); status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c, - nvme_to_user_ptr(cptr->addr), cptr->data_len, - nvme_to_user_ptr(cptr->metadata), cptr->metadata_len, - 0, &cptr->result, timeout, ioucmd); + nvme_to_user_ptr(cmd.addr), cmd.data_len, + nvme_to_user_ptr(cmd.metadata), cmd.metadata_len, + 0, &cmd.result, timeout); - if (!ioucmd && status >= 0) { - if (put_user(cptr->result, &ucmd->result)) + if (status >= 0) { + if (put_user(cmd.result, &ucmd->result)) return -EFAULT; } return status; } +static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, + struct io_uring_cmd *ioucmd) +{ + struct nvme_passthru_cmd64 *cmd = + (struct nvme_passthru_cmd64 *)ioucmd->cmd; + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); + struct request_queue *q = ns ? ns->queue : ctrl->admin_q; + struct nvme_command c; + struct request *req; + + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + + if (cmd->flags || cmd->result) + return -EINVAL; + if (!nvme_validate_passthru_nsid(ctrl, ns, cmd->nsid)) + return -EINVAL; + + memset(&c, 0, sizeof(c)); + c.common.opcode = cmd->opcode; + c.common.flags = cmd->flags; + c.common.nsid = cpu_to_le32(cmd->nsid); + c.common.cdw2[0] = cpu_to_le32(cmd->cdw2); + c.common.cdw2[1] = cpu_to_le32(cmd->cdw3); + c.common.cdw10 = cpu_to_le32(cmd->cdw10); + c.common.cdw11 = cpu_to_le32(cmd->cdw11); + c.common.cdw12 = cpu_to_le32(cmd->cdw12); + c.common.cdw13 = cpu_to_le32(cmd->cdw13); + c.common.cdw14 = cpu_to_le32(cmd->cdw14); + c.common.cdw15 = cpu_to_le32(cmd->cdw15); + + req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(cmd->addr), + cmd->data_len, nvme_to_user_ptr(cmd->metadata), + cmd->metadata_len, 0, cmd->timeout_ms ? + msecs_to_jiffies(cmd->timeout_ms) : 0); + if (IS_ERR(req)) + return PTR_ERR(req); + + /* to free bio on completion, as req->bio will be null at that time */ + pdu->bio = req->bio; + pdu->meta_buffer = nvme_to_user_ptr(cmd->metadata); + req->end_io_data = ioucmd; + + blk_execute_rq_nowait(req, 0, nvme_end_async_pt); + return -EIOCBQUEUED; +} + static bool is_ctrl_ioctl(unsigned int cmd) { if (cmd == NVME_IOCTL_ADMIN_CMD || cmd == NVME_IOCTL_ADMIN64_CMD) @@ -384,7 +439,7 @@ static int nvme_ctrl_ioctl(struct nvme_ctrl *ctrl, unsigned int cmd, case NVME_IOCTL_ADMIN_CMD: return nvme_user_cmd(ctrl, NULL, argp); case NVME_IOCTL_ADMIN64_CMD: - return nvme_user_cmd64(ctrl, NULL, argp, NULL); + return nvme_user_cmd64(ctrl, NULL, argp); default: return sed_ioctl(ctrl->opal_dev, cmd, argp); } @@ -428,7 +483,7 @@ static int nvme_ns_ioctl(struct nvme_ns *ns, unsigned int cmd, case NVME_IOCTL_SUBMIT_IO: return nvme_submit_io(ns, argp); case NVME_IOCTL_IO64_CMD: - return nvme_user_cmd64(ns->ctrl, ns, argp, NULL); + return nvme_user_cmd64(ns->ctrl, ns, argp); default: return -ENOTTY; } @@ -457,13 +512,13 @@ long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg) return __nvme_ioctl(ns, cmd, (void __user *)arg); } -static int nvme_ns_async_ioctl(struct nvme_ns *ns, struct io_uring_cmd *ioucmd) +static int nvme_ns_async_cmd(struct nvme_ns *ns, struct io_uring_cmd *ioucmd) { int ret; switch (ioucmd->cmd_op) { - case NVME_IOCTL_IO64_CMD: - ret = nvme_user_cmd64(ns->ctrl, ns, NULL, ioucmd); + case NVME_URING_CMD_IO: + ret = nvme_uring_cmd_io(ns->ctrl, ns, ioucmd); break; default: ret = -ENOTTY; @@ -476,10 +531,9 @@ int nvme_ns_chr_async_cmd(struct io_uring_cmd *ioucmd) { struct nvme_ns *ns = container_of(file_inode(ioucmd->file)->i_cdev, struct nvme_ns, cdev); - return nvme_ns_async_ioctl(ns, ioucmd); + return nvme_ns_async_cmd(ns, ioucmd); } - #ifdef CONFIG_NVME_MULTIPATH static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd, void __user *argp, struct nvme_ns_head *head, int srcu_idx) @@ -556,7 +610,7 @@ int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd) int ret = -EWOULDBLOCK; if (ns) - ret = nvme_ns_async_ioctl(ns, ioucmd); + ret = nvme_ns_async_cmd(ns, ioucmd); srcu_read_unlock(&head->srcu, srcu_idx); return ret; } @@ -605,7 +659,7 @@ long nvme_dev_ioctl(struct file *file, unsigned int cmd, case NVME_IOCTL_ADMIN_CMD: return nvme_user_cmd(ctrl, NULL, argp); case NVME_IOCTL_ADMIN64_CMD: - return nvme_user_cmd64(ctrl, NULL, argp, NULL); + return nvme_user_cmd64(ctrl, NULL, argp); case NVME_IOCTL_IO_CMD: return nvme_dev_user_cmd(ctrl, argp); case NVME_IOCTL_RESET: diff --git a/include/uapi/linux/nvme_ioctl.h b/include/uapi/linux/nvme_ioctl.h index d99b5a7726980..39c9d3ecfef88 100644 --- a/include/uapi/linux/nvme_ioctl.h +++ b/include/uapi/linux/nvme_ioctl.h @@ -79,4 +79,7 @@ struct nvme_passthru_cmd64 { #define NVME_IOCTL_ADMIN64_CMD _IOWR('N', 0x47, struct nvme_passthru_cmd64) #define NVME_IOCTL_IO64_CMD _IOWR('N', 0x48, struct nvme_passthru_cmd64) +/* io_uring async commands: */ +#define NVME_URING_CMD_IO _IOWR('N', 0x80, struct nvme_passthru_cmd64) + #endif /* _UAPI_LINUX_NVME_IOCTL_H */ -- 2.30.2 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC 5/5] nvme: wire-up support for async-passthru on char-device. 2022-04-04 7:20 ` Christoph Hellwig @ 2022-04-04 14:25 ` Kanchan Joshi 2022-04-05 6:02 ` Christoph Hellwig 0 siblings, 1 reply; 28+ messages in thread From: Kanchan Joshi @ 2022-04-04 14:25 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, Jens Axboe, io-uring, linux-nvme, Pavel Begunkov, Ming Lei, Luis Chamberlain, Pankaj Raghav, Javier González, Anuj Gupta On Mon, Apr 4, 2022 at 12:50 PM Christoph Hellwig <[email protected]> wrote: > > On Fri, Apr 01, 2022 at 04:33:10PM +0530, Kanchan Joshi wrote: > > Introduce handler for fops->async_cmd(), implementing async passthru > > on char device (/dev/ngX). The handler supports NVME_IOCTL_IO64_CMD. > > I really don't like how this still mixes up ioctls and uring cmds, > as mentioned close to half a dozend times we really should not mix > them up. Sorry, I too had the plans to use a different opcode eventually (i.e. for the full series). Just wanted to focus on big-cqe this time. > Something like this (untested) patch should help to separate > the much better: It does, thanks. But the only thing is - it would be good to support vectored-passthru too (i.e. NVME_IOCTL_IO64_CMD_VEC) for this path. For the new opcode "NVME_URING_CMD_IO" , either we can change the cmd-structure or flag-based handling so that vectored-io is supported. Or we introduce NVME_URING_CMD_IO_VEC also for that. Which one do you prefer? > +static int nvme_ioctl_finish_metadata(struct bio *bio, int ret, > + void __user *meta_ubuf) > +{ > + struct bio_integrity_payload *bip = bio_integrity(bio); > + > + if (bip) { > + void *meta = bvec_virt(bip->bip_vec); > + > + if (!ret && bio_op(bio) == REQ_OP_DRV_IN && > + copy_to_user(meta_ubuf, meta, bip->bip_vec->bv_len)) > + ret = -EFAULT; Maybe it is better to move the check "bio_op(bio) != REQ_OP_DRV_IN" outside. Because this can be common, and for that we can avoid entering into the function call itself (i.e. nvme_ioctl_finish_metadata). ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 5/5] nvme: wire-up support for async-passthru on char-device. 2022-04-04 14:25 ` Kanchan Joshi @ 2022-04-05 6:02 ` Christoph Hellwig 2022-04-05 15:40 ` Jens Axboe 2022-04-05 15:49 ` Kanchan Joshi 0 siblings, 2 replies; 28+ messages in thread From: Christoph Hellwig @ 2022-04-05 6:02 UTC (permalink / raw) To: Kanchan Joshi Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, io-uring, linux-nvme, Pavel Begunkov, Ming Lei, Luis Chamberlain, Pankaj Raghav, Javier González, Anuj Gupta On Mon, Apr 04, 2022 at 07:55:05PM +0530, Kanchan Joshi wrote: > > Something like this (untested) patch should help to separate > > the much better: > > It does, thanks. But the only thing is - it would be good to support > vectored-passthru too (i.e. NVME_IOCTL_IO64_CMD_VEC) for this path. > For the new opcode "NVME_URING_CMD_IO" , either we can change the > cmd-structure or flag-based handling so that vectored-io is supported. > Or we introduce NVME_URING_CMD_IO_VEC also for that. > Which one do you prefer? I agree vectored I/O support is useful. Do we even need to support the non-vectored case? Also I think we'll want admin command passthrough on /dev/nvmeX as well, but I'm fine solving the other items first. > > +static int nvme_ioctl_finish_metadata(struct bio *bio, int ret, > > + void __user *meta_ubuf) > > +{ > > + struct bio_integrity_payload *bip = bio_integrity(bio); > > + > > + if (bip) { > > + void *meta = bvec_virt(bip->bip_vec); > > + > > + if (!ret && bio_op(bio) == REQ_OP_DRV_IN && > > + copy_to_user(meta_ubuf, meta, bip->bip_vec->bv_len)) > > + ret = -EFAULT; > > Maybe it is better to move the check "bio_op(bio) != REQ_OP_DRV_IN" outside. > Because this can be common, and for that we can avoid entering into > the function call itself (i.e. nvme_ioctl_finish_metadata). Function calls are pretty cheap, but I'll see what we can do. I'll try to come up with a prep series to refactor the passthrough support for easier adding of the io_uring in the next days. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 5/5] nvme: wire-up support for async-passthru on char-device. 2022-04-05 6:02 ` Christoph Hellwig @ 2022-04-05 15:40 ` Jens Axboe 2022-04-05 15:49 ` Kanchan Joshi 1 sibling, 0 replies; 28+ messages in thread From: Jens Axboe @ 2022-04-05 15:40 UTC (permalink / raw) To: Christoph Hellwig, Kanchan Joshi Cc: Kanchan Joshi, io-uring, linux-nvme, Pavel Begunkov, Ming Lei, Luis Chamberlain, Pankaj Raghav, Javier González, Anuj Gupta On 4/5/22 12:02 AM, Christoph Hellwig wrote: > On Mon, Apr 04, 2022 at 07:55:05PM +0530, Kanchan Joshi wrote: >>> Something like this (untested) patch should help to separate >>> the much better: >> >> It does, thanks. But the only thing is - it would be good to support >> vectored-passthru too (i.e. NVME_IOCTL_IO64_CMD_VEC) for this path. >> For the new opcode "NVME_URING_CMD_IO" , either we can change the >> cmd-structure or flag-based handling so that vectored-io is supported. >> Or we introduce NVME_URING_CMD_IO_VEC also for that. >> Which one do you prefer? > > I agree vectored I/O support is useful. > > Do we even need to support the non-vectored case? I would argue that 99% of the use cases will be non-vectored, and non-vectored is a lot cheaper to handle for deferrals as there's no iovec to keep persistent on the io_uring side. So yes, I'd say we _definitely_ want to have non-vectored be available and the default thing that applications use unless they explicitly want more than 1 segment in a request. -- Jens Axboe ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 5/5] nvme: wire-up support for async-passthru on char-device. 2022-04-05 6:02 ` Christoph Hellwig 2022-04-05 15:40 ` Jens Axboe @ 2022-04-05 15:49 ` Kanchan Joshi 2022-04-06 5:20 ` Kanchan Joshi 1 sibling, 1 reply; 28+ messages in thread From: Kanchan Joshi @ 2022-04-05 15:49 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, Jens Axboe, io-uring, linux-nvme, Pavel Begunkov, Ming Lei, Luis Chamberlain, Pankaj Raghav, Javier González, Anuj Gupta On Tue, Apr 5, 2022 at 11:32 AM Christoph Hellwig <[email protected]> wrote: > > On Mon, Apr 04, 2022 at 07:55:05PM +0530, Kanchan Joshi wrote: > > > Something like this (untested) patch should help to separate > > > the much better: > > > > It does, thanks. But the only thing is - it would be good to support > > vectored-passthru too (i.e. NVME_IOCTL_IO64_CMD_VEC) for this path. > > For the new opcode "NVME_URING_CMD_IO" , either we can change the > > cmd-structure or flag-based handling so that vectored-io is supported. > > Or we introduce NVME_URING_CMD_IO_VEC also for that. > > Which one do you prefer? > > I agree vectored I/O support is useful. > > Do we even need to support the non-vectored case? Would be good to have, I suppose. Helps keeping it simple when user-space wants to use a single-buffer (otherwise it must carry psuedo iovec for that too). > Also I think we'll want admin command passthrough on /dev/nvmeX as > well, but I'm fine solving the other items first. > > > > +static int nvme_ioctl_finish_metadata(struct bio *bio, int ret, > > > + void __user *meta_ubuf) > > > +{ > > > + struct bio_integrity_payload *bip = bio_integrity(bio); > > > + > > > + if (bip) { > > > + void *meta = bvec_virt(bip->bip_vec); > > > + > > > + if (!ret && bio_op(bio) == REQ_OP_DRV_IN && > > > + copy_to_user(meta_ubuf, meta, bip->bip_vec->bv_len)) > > > + ret = -EFAULT; > > > > Maybe it is better to move the check "bio_op(bio) != REQ_OP_DRV_IN" outside. > > Because this can be common, and for that we can avoid entering into > > the function call itself (i.e. nvme_ioctl_finish_metadata). > > Function calls are pretty cheap, but I'll see what we can do. I'll try > to come up with a prep series to refactor the passthrough support for > easier adding of the io_uring in the next days. In that case we will base the newer version on its top. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 5/5] nvme: wire-up support for async-passthru on char-device. 2022-04-05 15:49 ` Kanchan Joshi @ 2022-04-06 5:20 ` Kanchan Joshi 2022-04-06 5:23 ` Christoph Hellwig 2022-04-23 17:53 ` Christoph Hellwig 0 siblings, 2 replies; 28+ messages in thread From: Kanchan Joshi @ 2022-04-06 5:20 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, Jens Axboe, io-uring, linux-nvme, Pavel Begunkov, Ming Lei, Luis Chamberlain, Pankaj Raghav, Javier González, Anuj Gupta > > Also I think we'll want admin command passthrough on /dev/nvmeX as > > well, but I'm fine solving the other items first. Sure, will add that in the second round. Should be fairly simple as we can reuse io-command work anyway. > > > > +static int nvme_ioctl_finish_metadata(struct bio *bio, int ret, > > > > + void __user *meta_ubuf) > > > > +{ > > > > + struct bio_integrity_payload *bip = bio_integrity(bio); > > > > + > > > > + if (bip) { > > > > + void *meta = bvec_virt(bip->bip_vec); > > > > + > > > > + if (!ret && bio_op(bio) == REQ_OP_DRV_IN && > > > > + copy_to_user(meta_ubuf, meta, bip->bip_vec->bv_len)) > > > > + ret = -EFAULT; > > > > > > Maybe it is better to move the check "bio_op(bio) != REQ_OP_DRV_IN" outside. > > > Because this can be common, and for that we can avoid entering into > > > the function call itself (i.e. nvme_ioctl_finish_metadata). > > > > Function calls are pretty cheap, but I'll see what we can do. I'll try > > to come up with a prep series to refactor the passthrough support for > > easier adding of the io_uring in the next days. > > In that case we will base the newer version on its top. But if it saves some cycles for you, and also the travel from nvme to linux-block tree - I can carry that refactoring as a prep patch in this series. Your call. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 5/5] nvme: wire-up support for async-passthru on char-device. 2022-04-06 5:20 ` Kanchan Joshi @ 2022-04-06 5:23 ` Christoph Hellwig 2022-04-23 17:53 ` Christoph Hellwig 1 sibling, 0 replies; 28+ messages in thread From: Christoph Hellwig @ 2022-04-06 5:23 UTC (permalink / raw) To: Kanchan Joshi Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, io-uring, linux-nvme, Pavel Begunkov, Ming Lei, Luis Chamberlain, Pankaj Raghav, Javier González, Anuj Gupta On Wed, Apr 06, 2022 at 10:50:14AM +0530, Kanchan Joshi wrote: > > > Also I think we'll want admin command passthrough on /dev/nvmeX as > > > well, but I'm fine solving the other items first. > > Sure, will add that in the second round. Should be fairly simple as we > can reuse io-command work anyway. Yes, I don't think it is a priority right now. And I'm actually glad you did this simple version first as that allows us to get the basics right first. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 5/5] nvme: wire-up support for async-passthru on char-device. 2022-04-06 5:20 ` Kanchan Joshi 2022-04-06 5:23 ` Christoph Hellwig @ 2022-04-23 17:53 ` Christoph Hellwig 2022-04-25 17:38 ` Kanchan Joshi 1 sibling, 1 reply; 28+ messages in thread From: Christoph Hellwig @ 2022-04-23 17:53 UTC (permalink / raw) To: Kanchan Joshi Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, io-uring, linux-nvme, Pavel Begunkov, Ming Lei, Luis Chamberlain, Pankaj Raghav, Javier González, Anuj Gupta On Wed, Apr 06, 2022 at 10:50:14AM +0530, Kanchan Joshi wrote: > > In that case we will base the newer version on its top. > But if it saves some cycles for you, and also the travel from nvme to > linux-block tree - I can carry that refactoring as a prep patch in > this series. Your call. FYI, this is what I have so far: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/nvme-passthrough-refactor the idea would be to use these lower level helpers for uring, and not really share the higher level function at all. This does create a little extra code, but I think it'll be more modular and better maintainable. Feel free to pull this in if it helps you, otherwise I'll try to find some time to do more than just light testing and will post it. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 5/5] nvme: wire-up support for async-passthru on char-device. 2022-04-23 17:53 ` Christoph Hellwig @ 2022-04-25 17:38 ` Kanchan Joshi 2022-04-29 13:16 ` Kanchan Joshi 0 siblings, 1 reply; 28+ messages in thread From: Kanchan Joshi @ 2022-04-25 17:38 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, Jens Axboe, io-uring, linux-nvme, Pavel Begunkov, Ming Lei, Luis Chamberlain, Pankaj Raghav, Javier González, Anuj Gupta [-- Attachment #1: Type: text/plain, Size: 1389 bytes --] On Sat, Apr 23, 2022 at 07:53:09PM +0200, Christoph Hellwig wrote: >On Wed, Apr 06, 2022 at 10:50:14AM +0530, Kanchan Joshi wrote: >> > In that case we will base the newer version on its top. >> But if it saves some cycles for you, and also the travel from nvme to >> linux-block tree - I can carry that refactoring as a prep patch in >> this series. Your call. > >FYI, this is what I have so far: > >http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/nvme-passthrough-refactor > >the idea would be to use these lower level helpers for uring, and >not really share the higher level function at all. This does create >a little extra code, but I think it'll be more modular and better >maintainable. Feel free to pull this in if it helps you, otherwise >I'll try to find some time to do more than just light testing and >will post it. Thanks for sharing. So I had picked your previous version, and this one streamlines meta handling further. But the problem is bip gets freed before we reach to this point - +static int nvme_free_user_metadata(struct bio *bio, void __user *ubuf, int ret) +{ + struct bio_integrity_payload *bip = bio_integrity(bio); + void *buf = bvec_virt(bip->bip_vec); + + if (!ret && bio_op(bio) == REQ_OP_DRV_IN && + copy_to_user(ubuf, buf, bip->bip_vec->bv_len)) Without bip, we cannot kill current meta/meta_len fields. [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 5/5] nvme: wire-up support for async-passthru on char-device. 2022-04-25 17:38 ` Kanchan Joshi @ 2022-04-29 13:16 ` Kanchan Joshi 0 siblings, 0 replies; 28+ messages in thread From: Kanchan Joshi @ 2022-04-29 13:16 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, Jens Axboe, io-uring, linux-nvme, Pavel Begunkov, Ming Lei, Luis Chamberlain, Pankaj Raghav, Javier González, Anuj Gupta [-- Attachment #1: Type: text/plain, Size: 2776 bytes --] On Mon, Apr 25, 2022 at 11:08:03PM +0530, Kanchan Joshi wrote: >On Sat, Apr 23, 2022 at 07:53:09PM +0200, Christoph Hellwig wrote: >>On Wed, Apr 06, 2022 at 10:50:14AM +0530, Kanchan Joshi wrote: >>>> In that case we will base the newer version on its top. >>>But if it saves some cycles for you, and also the travel from nvme to >>>linux-block tree - I can carry that refactoring as a prep patch in >>>this series. Your call. >> >>FYI, this is what I have so far: >> >>http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/nvme-passthrough-refactor >> >>the idea would be to use these lower level helpers for uring, and >>not really share the higher level function at all. This does create >>a little extra code, but I think it'll be more modular and better >>maintainable. Feel free to pull this in if it helps you, otherwise >>I'll try to find some time to do more than just light testing and >>will post it. > >Thanks for sharing. >So I had picked your previous version, and this one streamlines meta >handling further. But the problem is bip gets freed before we reach to >this point - > >+static int nvme_free_user_metadata(struct bio *bio, void __user *ubuf, int ret) >+{ >+ struct bio_integrity_payload *bip = bio_integrity(bio); >+ void *buf = bvec_virt(bip->bip_vec); >+ >+ if (!ret && bio_op(bio) == REQ_OP_DRV_IN && >+ copy_to_user(ubuf, buf, bip->bip_vec->bv_len)) > >Without bip, we cannot kill current meta/meta_len fields. And by this I mean we cannot keep io_uring_cmd this way - +struct io_uring_cmd { + struct file *file; + void *cmd; + /* for irq-completion - if driver requires doing stuff in task-context*/ + void (*driver_cb)(struct io_uring_cmd *cmd); + u32 flags; + u32 cmd_op; + + void *private; + + /* + * Out of band data can be used for data that is not the main data. + * E.g. block device PI/metadata or additional information. + */ + void __user *oob_user; +}; Rather we need to backtrack to pdu[28], since nvme would need all that space. +struct io_uring_cmd { + struct file *file; + void *cmd; + /* for irq-completion - if driver requires doing stuff in task-context*/ + void (*driver_cb)(struct io_uring_cmd *cmd); + u32 flags; + u32 cmd_op; + u32 unused; + u8 pdu[28]; /* available inline for free use */ +}; +struct nvme_uring_cmd_pdu { + union { + struct bio *bio; + struct request *req; + }; + void *meta; /* kernel-resident buffer */ + void __user *meta_buffer; + u32 meta_len; +} __packed; [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/5] big-cqe based uring-passthru 2022-04-01 11:03 ` [RFC 0/5] big-cqe based uring-passthru Kanchan Joshi ` (4 preceding siblings ...) [not found] ` <CGME20220401110838epcas5p2c1a2e776923dfe5bf65a3e7946820150@epcas5p2.samsung.com> @ 2022-04-04 7:21 ` Christoph Hellwig 2022-04-05 15:37 ` Kanchan Joshi 5 siblings, 1 reply; 28+ messages in thread From: Christoph Hellwig @ 2022-04-04 7:21 UTC (permalink / raw) To: Kanchan Joshi Cc: axboe, hch, io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, pankydev8, javier, joshiiitr, anuj20.g I really can't get excited about the pdu thingy. Here is a patch (on top of the series and the patch sent in reply to patch 4) that does away with it and just adds a oob_user field to struct io_uring_cmd to simplify the handling a fair bit: --- From 426fa5de1d5f5a718b797eda2fc3ea47010662f7 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig <[email protected]> Date: Mon, 4 Apr 2022 08:24:43 +0200 Subject: io_uring: explicit support for out of band data in io_uring_cmd Instead of the magic pdu byte array, which in its current form causes unaligned pointers and a lot of casting add explicit support for out of band data in struct io_uring_cmd and just leave a normal private data pointer to the driver. Signed-off-by: Christoph Hellwig <[email protected]> --- drivers/nvme/host/ioctl.c | 35 +++++++---------------------------- include/linux/io_uring.h | 10 ++++++++-- 2 files changed, 15 insertions(+), 30 deletions(-) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index ea6cfd4321942..b93c6ecfcd2ab 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -37,27 +37,9 @@ static int nvme_ioctl_finish_metadata(struct bio *bio, int ret, return ret; } -/* - * This overlays struct io_uring_cmd pdu. - * Expect build errors if this grows larger than that. - */ -struct nvme_uring_cmd_pdu { - union { - struct bio *bio; - struct request *req; - }; - void __user *meta_buffer; -} __packed; - -static struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(struct io_uring_cmd *ioucmd) -{ - return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu; -} - static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd) { - struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); - struct request *req = pdu->req; + struct request *req = ioucmd->private; struct bio *bio = req->bio; int status; u64 result; @@ -71,7 +53,7 @@ static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd) blk_mq_free_request(req); blk_rq_unmap_user(bio); - status = nvme_ioctl_finish_metadata(bio, status, pdu->meta_buffer); + status = nvme_ioctl_finish_metadata(bio, status, ioucmd->oob_user); result = le64_to_cpu(nvme_req(req)->result.u64); io_uring_cmd_done(ioucmd, status, result); } @@ -79,12 +61,10 @@ static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd) static void nvme_end_async_pt(struct request *req, blk_status_t err) { struct io_uring_cmd *ioucmd = req->end_io_data; - struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); - /* extract bio before reusing the same field for request */ - struct bio *bio = pdu->bio; + struct bio *bio = ioucmd->private; - pdu->req = req; req->bio = bio; + ioucmd->private = req; /* this takes care of moving rest of completion-work to task context */ io_uring_cmd_complete_in_task(ioucmd, nvme_uring_task_cb); @@ -381,7 +361,6 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, { struct nvme_passthru_cmd64 *cmd = (struct nvme_passthru_cmd64 *)ioucmd->cmd; - struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); struct request_queue *q = ns ? ns->queue : ctrl->admin_q; struct nvme_command c; struct request *req; @@ -415,10 +394,10 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, return PTR_ERR(req); /* to free bio on completion, as req->bio will be null at that time */ - pdu->bio = req->bio; - pdu->meta_buffer = nvme_to_user_ptr(cmd->metadata); - req->end_io_data = ioucmd; + ioucmd->private = req->bio; + ioucmd->oob_user = nvme_to_user_ptr(cmd->metadata); + req->end_io_data = ioucmd; blk_execute_rq_nowait(req, 0, nvme_end_async_pt); return -EIOCBQUEUED; } diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index 0aba7b50cde65..95b56e45cd539 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -23,8 +23,14 @@ struct io_uring_cmd { u32 flags; u32 cmd_op; u16 cmd_len; - u16 unused; - u8 pdu[28]; /* available inline for free use */ + + void *private; + + /* + * Out of band data can be used for data that is not the main data. + * E.g. block device PI/metadata or additional information. + */ + void __user *oob_user; }; #if defined(CONFIG_IO_URING) -- 2.30.2 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC 0/5] big-cqe based uring-passthru 2022-04-04 7:21 ` [RFC 0/5] big-cqe based uring-passthru Christoph Hellwig @ 2022-04-05 15:37 ` Kanchan Joshi 0 siblings, 0 replies; 28+ messages in thread From: Kanchan Joshi @ 2022-04-05 15:37 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, Jens Axboe, io-uring, linux-nvme, Pavel Begunkov, Ming Lei, Luis Chamberlain, Pankaj Raghav, Javier González, Anuj Gupta On Mon, Apr 4, 2022 at 12:51 PM Christoph Hellwig <[email protected]> wrote: > > I really can't get excited about the pdu thingy. Here is a patch > (on top of the series and the patch sent in reply to patch 4) that > does away with it and just adds a oob_user field to struct io_uring_cmd > to simplify the handling a fair bit: Fine. We discussed the tradeoff before, and the concern was generic-space reduction for other use-cases. But this is kernel-only change, so we can alter this if/when a new usecase requires more space. It seems fine to move ahead with what you proposed. ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2022-04-30 5:58 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <CGME20220401110829epcas5p39f3cf4d3f6eb8a5c59794787a2b72b15@epcas5p3.samsung.com> 2022-04-01 11:03 ` [RFC 0/5] big-cqe based uring-passthru Kanchan Joshi [not found] ` <CGME20220401110831epcas5p403bacabe8f7e5262356fdc1a2e66df90@epcas5p4.samsung.com> 2022-04-01 11:03 ` [RFC 1/5] io_uring: add support for 128-byte SQEs Kanchan Joshi [not found] ` <CGME20220401110833epcas5p18e828a307a646cef5b7aa429be4396e0@epcas5p1.samsung.com> 2022-04-01 11:03 ` [RFC 2/5] fs: add file_operations->async_cmd() Kanchan Joshi 2022-04-04 7:09 ` Christoph Hellwig [not found] ` <CGME20220401110834epcas5p4d1e5e8d1beb1a6205d670bbcb932bf77@epcas5p4.samsung.com> 2022-04-01 11:03 ` [RFC 3/5] io_uring: add infra and support for IORING_OP_URING_CMD Kanchan Joshi 2022-04-04 7:16 ` Christoph Hellwig 2022-04-04 8:20 ` Pavel Begunkov 2022-04-05 5:58 ` Christoph Hellwig 2022-04-06 6:37 ` Kanchan Joshi 2022-04-04 15:14 ` Kanchan Joshi 2022-04-05 6:00 ` Christoph Hellwig 2022-04-05 16:27 ` Kanchan Joshi [not found] ` <CGME20220401110836epcas5p37bd59ab5a48cf77ca3ac05052a164b0b@epcas5p3.samsung.com> 2022-04-01 11:03 ` [RFC 4/5] io_uring: add support for big-cqe Kanchan Joshi 2022-04-04 7:07 ` Christoph Hellwig 2022-04-04 14:04 ` Kanchan Joshi [not found] ` <CGME20220401110838epcas5p2c1a2e776923dfe5bf65a3e7946820150@epcas5p2.samsung.com> 2022-04-01 11:03 ` [RFC 5/5] nvme: wire-up support for async-passthru on char-device Kanchan Joshi 2022-04-04 7:20 ` Christoph Hellwig 2022-04-04 14:25 ` Kanchan Joshi 2022-04-05 6:02 ` Christoph Hellwig 2022-04-05 15:40 ` Jens Axboe 2022-04-05 15:49 ` Kanchan Joshi 2022-04-06 5:20 ` Kanchan Joshi 2022-04-06 5:23 ` Christoph Hellwig 2022-04-23 17:53 ` Christoph Hellwig 2022-04-25 17:38 ` Kanchan Joshi 2022-04-29 13:16 ` Kanchan Joshi 2022-04-04 7:21 ` [RFC 0/5] big-cqe based uring-passthru Christoph Hellwig 2022-04-05 15:37 ` Kanchan Joshi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox