* [PATCH for-next 0/4] nvme-multipathing for uring-passthrough [not found] <CGME20220711110753epcas5p4169b9e288d15ca35740dbb66a6f6983a@epcas5p4.samsung.com> @ 2022-07-11 11:01 ` Kanchan Joshi [not found] ` <CGME20220711110800epcas5p3d338dd486fd778c5ba5bfe93a91ec8bd@epcas5p3.samsung.com> ` (3 more replies) 0 siblings, 4 replies; 7+ messages in thread From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw) To: hch, sagi, kbusch, axboe Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev, Kanchan Joshi nvme passthrough lacks multipathing capability and some of us have already expressed interest to see this plumbed. Most recently during LSFMM, around 2 months back. This series wires up multipathing for uring-passthrough commands. Attempt is not to affect the common path (i.e. when requeue/failover does not trigger) with allocation or deferral. The most important design bit is to treat "struct io_uring_cmd" in the same way as "struct bio" is treated by the block-based nvme multipath. Uring-commands are queued when path is not available, and resubmitted on discovery of new path. Also if passthrough command on multipath-node is failed, it is resubmitted on a different path. Testing: Using the upstream fio that support uring-passthrough: fio -iodepth=16 -rw=randread -ioengine=io_uring_cmd -bs=4k -numjobs=4 -size=1G -iodepth_batch_submit=16 -group_reporting -cmd_type=nvme -filename=/dev/ng0n1 -name=uring-pt 1. Multiple failover - every command is retried 1-5 times before completion 2. Fail nvme_find_path() - this tests completion post requeue 3. Combine above two 4. Repeat above but for passthrough commands which do not generate bio (e.g. flush command) Anuj Gupta (2): nvme: compact nvme_uring_cmd_pdu struct nvme-multipath: add multipathing for uring-passthrough commands Kanchan Joshi (2): io_uring, nvme: rename a function io_uring: grow a field in struct io_uring_cmd drivers/nvme/host/ioctl.c | 157 +++++++++++++++++++++++++++++----- drivers/nvme/host/multipath.c | 36 +++++++- drivers/nvme/host/nvme.h | 26 ++++++ include/linux/io_uring.h | 44 +++++++++- io_uring/uring_cmd.c | 4 +- 5 files changed, 237 insertions(+), 30 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <CGME20220711110800epcas5p3d338dd486fd778c5ba5bfe93a91ec8bd@epcas5p3.samsung.com>]
* [PATCH for-next 1/4] io_uring, nvme: rename a function [not found] ` <CGME20220711110800epcas5p3d338dd486fd778c5ba5bfe93a91ec8bd@epcas5p3.samsung.com> @ 2022-07-11 11:01 ` Kanchan Joshi 0 siblings, 0 replies; 7+ messages in thread From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw) To: hch, sagi, kbusch, axboe Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev, Kanchan Joshi io_uring_cmd_complete_in_task() is bit of a misnomer. It schedules a callback function for execution in task context. What callback does is private to provider, and does not have to be completion. So rename it to io_uring_cmd_execute_in_task() to allow more generic use. Signed-off-by: Kanchan Joshi <[email protected]> --- drivers/nvme/host/ioctl.c | 2 +- include/linux/io_uring.h | 4 ++-- io_uring/uring_cmd.c | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index a2e89db1cd63..9227e07f717e 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -395,7 +395,7 @@ static void nvme_uring_cmd_end_io(struct request *req, blk_status_t err) pdu->req = req; req->bio = bio; /* this takes care of moving rest of completion-work to task context */ - io_uring_cmd_complete_in_task(ioucmd, nvme_uring_task_cb); + io_uring_cmd_execute_in_task(ioucmd, nvme_uring_task_cb); } static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index 4a2f6cc5a492..54063d67506b 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -29,7 +29,7 @@ struct io_uring_cmd { #if defined(CONFIG_IO_URING) void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t res2); -void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, +void io_uring_cmd_execute_in_task(struct io_uring_cmd *ioucmd, void (*task_work_cb)(struct io_uring_cmd *)); struct sock *io_uring_get_socket(struct file *file); void __io_uring_cancel(bool cancel_all); @@ -59,7 +59,7 @@ static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t ret2) { } -static inline void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, +static inline void io_uring_cmd_execute_in_task(struct io_uring_cmd *ioucmd, void (*task_work_cb)(struct io_uring_cmd *)) { } diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c index 0a421ed51e7e..d409b99abac5 100644 --- a/io_uring/uring_cmd.c +++ b/io_uring/uring_cmd.c @@ -16,7 +16,7 @@ static void io_uring_cmd_work(struct io_kiocb *req, bool *locked) ioucmd->task_work_cb(ioucmd); } -void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, +void io_uring_cmd_execute_in_task(struct io_uring_cmd *ioucmd, void (*task_work_cb)(struct io_uring_cmd *)) { struct io_kiocb *req = cmd_to_io_kiocb(ioucmd); @@ -25,7 +25,7 @@ void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, req->io_task_work.func = io_uring_cmd_work; io_req_task_work_add(req); } -EXPORT_SYMBOL_GPL(io_uring_cmd_complete_in_task); +EXPORT_SYMBOL_GPL(io_uring_cmd_execute_in_task); static inline void io_req_set_cqe32_extra(struct io_kiocb *req, u64 extra1, u64 extra2) -- 2.25.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
[parent not found: <CGME20220711110812epcas5p33aa90b23aa62fb11722aa8195754becf@epcas5p3.samsung.com>]
* [PATCH for-next 2/4] nvme: compact nvme_uring_cmd_pdu struct [not found] ` <CGME20220711110812epcas5p33aa90b23aa62fb11722aa8195754becf@epcas5p3.samsung.com> @ 2022-07-11 11:01 ` Kanchan Joshi 0 siblings, 0 replies; 7+ messages in thread From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw) To: hch, sagi, kbusch, axboe Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev From: Anuj Gupta <[email protected]> Mark this packed so that we can create bit more space in its container i.e. io_uring_cmd. This is in preparation to support multipathing on uring-passthrough. Move its definition to nvme.h as well. Signed-off-by: Anuj Gupta <[email protected]> --- drivers/nvme/host/ioctl.c | 14 -------------- drivers/nvme/host/nvme.h | 14 ++++++++++++++ 2 files changed, 14 insertions(+), 14 deletions(-) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 9227e07f717e..fc02eddd4977 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -340,20 +340,6 @@ struct nvme_uring_data { __u32 timeout_ms; }; -/* - * This overlays struct io_uring_cmd pdu. - * Expect build errors if this grows larger than that. - */ -struct nvme_uring_cmd_pdu { - union { - struct bio *bio; - struct request *req; - }; - void *meta; /* kernel-resident buffer */ - void __user *meta_buffer; - u32 meta_len; -}; - static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu( struct io_uring_cmd *ioucmd) { diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 7323a2f61126..9d3ff6feda06 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -165,6 +165,20 @@ struct nvme_request { struct nvme_ctrl *ctrl; }; +/* + * This overlays struct io_uring_cmd pdu. + * Expect build errors if this grows larger than that. + */ +struct nvme_uring_cmd_pdu { + union { + struct bio *bio; + struct request *req; + }; + void *meta; /* kernel-resident buffer */ + void __user *meta_buffer; + u32 meta_len; +} __packed; + /* * Mark a bio as coming in through the mpath node. */ -- 2.25.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
[parent not found: <CGME20220711110824epcas5p22c8e945cb8c3c3ac46c8c2b5ab55db9b@epcas5p2.samsung.com>]
* [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd [not found] ` <CGME20220711110824epcas5p22c8e945cb8c3c3ac46c8c2b5ab55db9b@epcas5p2.samsung.com> @ 2022-07-11 11:01 ` Kanchan Joshi 0 siblings, 0 replies; 7+ messages in thread From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw) To: hch, sagi, kbusch, axboe Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev, Kanchan Joshi Use the leftover space to carve 'next' field that enables linking of io_uring_cmd structs. Also introduce a list head and few helpers. This is in preparation to support nvme-mulitpath, allowing multiple uring passthrough commands to be queued. Signed-off-by: Kanchan Joshi <[email protected]> Signed-off-by: Anuj Gupta <[email protected]> --- include/linux/io_uring.h | 38 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index 54063d67506b..d734599cbcd7 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -22,9 +22,14 @@ struct io_uring_cmd { const void *cmd; /* callback to defer completions to task context */ void (*task_work_cb)(struct io_uring_cmd *cmd); + struct io_uring_cmd *next; u32 cmd_op; - u32 pad; - u8 pdu[32]; /* available inline for free use */ + u8 pdu[28]; /* available inline for free use */ +}; + +struct ioucmd_list { + struct io_uring_cmd *head; + struct io_uring_cmd *tail; }; #if defined(CONFIG_IO_URING) @@ -54,6 +59,27 @@ static inline void io_uring_free(struct task_struct *tsk) if (tsk->io_uring) __io_uring_free(tsk); } + +static inline struct io_uring_cmd *ioucmd_list_get(struct ioucmd_list *il) +{ + struct io_uring_cmd *ioucmd = il->head; + + il->head = il->tail = NULL; + + return ioucmd; +} + +static inline void ioucmd_list_add(struct ioucmd_list *il, + struct io_uring_cmd *ioucmd) +{ + ioucmd->next = NULL; + + if (il->tail) + il->tail->next = ioucmd; + else + il->head = ioucmd; + il->tail = ioucmd; +} #else static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t ret2) @@ -80,6 +106,14 @@ static inline const char *io_uring_get_opcode(u8 opcode) { return ""; } +static inline struct io_uring_cmd *ioucmd_list_get(struct ioucmd_list *il) +{ + return NULL; +} +static inline void ioucmd_list_add(struct ioucmd_list *il, + struct io_uring_cmd *ioucmd) +{ +} #endif #endif -- 2.25.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
[parent not found: <CGME20220711110827epcas5p3fd81f142f55ca3048abc38a9ef0d0089@epcas5p3.samsung.com>]
* [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands [not found] ` <CGME20220711110827epcas5p3fd81f142f55ca3048abc38a9ef0d0089@epcas5p3.samsung.com> @ 2022-07-11 11:01 ` Kanchan Joshi 2022-07-11 13:51 ` Sagi Grimberg 0 siblings, 1 reply; 7+ messages in thread From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw) To: hch, sagi, kbusch, axboe Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev, Kanchan Joshi From: Anuj Gupta <[email protected]> This heavily reuses nvme-mpath infrastructure for block-path. Block-path operates on bio, and uses that as long-lived object across requeue attempts. For passthrough interface io_uring_cmd happens to be such object, so add a requeue list for that. If path is not available, uring-cmds are added into that list and resubmitted on discovery of new path. Existing requeue handler (nvme_requeue_work) is extended to do this resubmission. For failed commands, failover is done directly (i.e. without first queuing into list) by resubmitting individual command from task-work based callback. Suggested-by: Sagi Grimberg <[email protected]> Co-developed-by: Kanchan Joshi <[email protected]> Signed-off-by: Kanchan Joshi <[email protected]> Signed-off-by: Anuj Gupta <[email protected]> --- drivers/nvme/host/ioctl.c | 141 ++++++++++++++++++++++++++++++++-- drivers/nvme/host/multipath.c | 36 ++++++++- drivers/nvme/host/nvme.h | 12 +++ include/linux/io_uring.h | 2 + 4 files changed, 182 insertions(+), 9 deletions(-) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index fc02eddd4977..281c21d3c67d 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -340,12 +340,6 @@ struct nvme_uring_data { __u32 timeout_ms; }; -static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu( - struct io_uring_cmd *ioucmd) -{ - return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu; -} - static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd) { struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); @@ -448,6 +442,14 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, pdu->meta_buffer = nvme_to_user_ptr(d.metadata); pdu->meta_len = d.metadata_len; + if (issue_flags & IO_URING_F_MPATH) { + req->cmd_flags |= REQ_NVME_MPATH; + /* + * we would need the buffer address (d.addr field) if we have + * to retry the command. Store it by repurposing ioucmd->cmd + */ + ioucmd->cmd = (void *)d.addr; + } blk_execute_rq_nowait(req, false); return -EIOCBQUEUED; } @@ -665,12 +667,135 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd, int srcu_idx = srcu_read_lock(&head->srcu); struct nvme_ns *ns = nvme_find_path(head); int ret = -EINVAL; + struct device *dev = &head->cdev_device; + + if (likely(ns)) { + ret = nvme_ns_uring_cmd(ns, ioucmd, + issue_flags | IO_URING_F_MPATH); + } else if (nvme_available_path(head)) { + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); + struct nvme_uring_cmd *payload = NULL; + + dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n"); + /* + * We may requeue two kinds of uring commands: + * 1. command failed post submission. pdu->req will be non-null + * for this + * 2. command that could not be submitted because path was not + * available. For this pdu->req is set to NULL. + */ + pdu->req = NULL; + /* + * passthrough command will not be available during retry as it + * is embedded in io_uring's SQE. Hence we allocate/copy here + */ + payload = kmalloc(sizeof(struct nvme_uring_cmd), GFP_KERNEL); + if (!payload) { + ret = -ENOMEM; + goto out_unlock; + } + memcpy(payload, ioucmd->cmd, sizeof(struct nvme_uring_cmd)); + ioucmd->cmd = payload; - if (ns) - ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags); + spin_lock_irq(&head->requeue_ioucmd_lock); + ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd); + spin_unlock_irq(&head->requeue_ioucmd_lock); + ret = -EIOCBQUEUED; + } else { + dev_warn_ratelimited(dev, "no available path - failing I/O\n"); + } +out_unlock: srcu_read_unlock(&head->srcu, srcu_idx); return ret; } + +int nvme_uring_cmd_io_retry(struct nvme_ns *ns, struct request *oreq, + struct io_uring_cmd *ioucmd, struct nvme_uring_cmd_pdu *pdu) +{ + struct nvme_ctrl *ctrl = ns->ctrl; + struct request_queue *q = ns ? ns->queue : ctrl->admin_q; + struct nvme_uring_data d; + struct nvme_command c; + struct request *req; + struct bio *obio = oreq->bio; + void *meta = NULL; + + memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command)); + d.metadata = (__u64)pdu->meta_buffer; + d.metadata_len = pdu->meta_len; + d.timeout_ms = oreq->timeout; + d.addr = (__u64)ioucmd->cmd; + if (obio) { + d.data_len = obio->bi_iter.bi_size; + blk_rq_unmap_user(obio); + } else { + d.data_len = 0; + } + blk_mq_free_request(oreq); + req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr), + d.data_len, nvme_to_user_ptr(d.metadata), + d.metadata_len, 0, &meta, d.timeout_ms ? + msecs_to_jiffies(d.timeout_ms) : 0, + ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0); + if (IS_ERR(req)) + return PTR_ERR(req); + + req->end_io = nvme_uring_cmd_end_io; + req->end_io_data = ioucmd; + pdu->bio = req->bio; + pdu->meta = meta; + req->cmd_flags |= REQ_NVME_MPATH; + blk_execute_rq_nowait(req, false); + return -EIOCBQUEUED; +} + +void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd) +{ + struct cdev *cdev = file_inode(ioucmd->file)->i_cdev; + struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, + cdev); + int srcu_idx = srcu_read_lock(&head->srcu); + struct nvme_ns *ns = nvme_find_path(head); + unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 | + IO_URING_F_MPATH; + struct device *dev = &head->cdev_device; + + if (likely(ns)) { + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); + struct request *oreq = pdu->req; + int ret; + + if (oreq == NULL) { + /* + * this was not submitted (to device) earlier. For this + * ioucmd->cmd points to persistent memory. Free that + * up post submission + */ + const void *cmd = ioucmd->cmd; + + ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags); + kfree(cmd); + } else { + /* + * this was submitted (to device) earlier. Use old + * request, bio (if it exists) and nvme-pdu to recreate + * the command for the discovered path + */ + ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu); + } + if (ret != -EIOCBQUEUED) + io_uring_cmd_done(ioucmd, ret, 0); + } else if (nvme_available_path(head)) { + dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n"); + spin_lock_irq(&head->requeue_ioucmd_lock); + ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd); + spin_unlock_irq(&head->requeue_ioucmd_lock); + } else { + dev_warn_ratelimited(dev, "no available path - failing I/O\n"); + io_uring_cmd_done(ioucmd, -EINVAL, 0); + } + srcu_read_unlock(&head->srcu, srcu_idx); +} #endif /* CONFIG_NVME_MULTIPATH */ int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index f26640ccb955..fe5655d98c36 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -6,6 +6,7 @@ #include <linux/backing-dev.h> #include <linux/moduleparam.h> #include <linux/vmalloc.h> +#include <linux/io_uring.h> #include <trace/events/block.h> #include "nvme.h" @@ -80,6 +81,17 @@ void nvme_mpath_start_freeze(struct nvme_subsystem *subsys) blk_freeze_queue_start(h->disk->queue); } +static void nvme_ioucmd_failover_req(struct request *req, struct nvme_ns *ns) +{ + struct io_uring_cmd *ioucmd = req->end_io_data; + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); + + /* store the request, to reconstruct the command during retry */ + pdu->req = req; + /* retry in original submitter context */ + io_uring_cmd_execute_in_task(ioucmd, nvme_ioucmd_mpath_retry); +} + void nvme_failover_req(struct request *req) { struct nvme_ns *ns = req->q->queuedata; @@ -99,6 +111,11 @@ void nvme_failover_req(struct request *req) queue_work(nvme_wq, &ns->ctrl->ana_work); } + if (blk_rq_is_passthrough(req)) { + nvme_ioucmd_failover_req(req, ns); + return; + } + spin_lock_irqsave(&ns->head->requeue_lock, flags); for (bio = req->bio; bio; bio = bio->bi_next) { bio_set_dev(bio, ns->head->disk->part0); @@ -314,7 +331,7 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) return ns; } -static bool nvme_available_path(struct nvme_ns_head *head) +bool nvme_available_path(struct nvme_ns_head *head) { struct nvme_ns *ns; @@ -459,7 +476,9 @@ static void nvme_requeue_work(struct work_struct *work) struct nvme_ns_head *head = container_of(work, struct nvme_ns_head, requeue_work); struct bio *bio, *next; + struct io_uring_cmd *ioucmd, *ioucmd_next; + /* process requeued bios*/ spin_lock_irq(&head->requeue_lock); next = bio_list_get(&head->requeue_list); spin_unlock_irq(&head->requeue_lock); @@ -470,6 +489,21 @@ static void nvme_requeue_work(struct work_struct *work) submit_bio_noacct(bio); } + + /* process requeued passthrough-commands */ + spin_lock_irq(&head->requeue_ioucmd_lock); + ioucmd_next = ioucmd_list_get(&head->requeue_ioucmd_list); + spin_unlock_irq(&head->requeue_ioucmd_lock); + + while ((ioucmd = ioucmd_next) != NULL) { + ioucmd_next = ioucmd->next; + ioucmd->next = NULL; + /* + * Retry in original submitter context as we would be + * processing the passthrough command again + */ + io_uring_cmd_execute_in_task(ioucmd, nvme_ioucmd_mpath_retry); + } } int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 9d3ff6feda06..125b48e74e72 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -16,6 +16,7 @@ #include <linux/rcupdate.h> #include <linux/wait.h> #include <linux/t10-pi.h> +#include <linux/io_uring.h> #include <trace/events/block.h> @@ -189,6 +190,12 @@ enum { NVME_REQ_USERCMD = (1 << 1), }; +static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu( + struct io_uring_cmd *ioucmd) +{ + return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu; +} + static inline struct nvme_request *nvme_req(struct request *req) { return blk_mq_rq_to_pdu(req); @@ -442,6 +449,9 @@ struct nvme_ns_head { struct work_struct requeue_work; struct mutex lock; unsigned long flags; + /* for uring-passthru multipath handling */ + struct ioucmd_list requeue_ioucmd_list; + spinlock_t requeue_ioucmd_lock; #define NVME_NSHEAD_DISK_LIVE 0 struct nvme_ns __rcu *current_path[]; #endif @@ -830,6 +840,7 @@ int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags); int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags); +void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd); int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo); int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags); @@ -844,6 +855,7 @@ static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl) return ctrl->ana_log_buf != NULL; } +bool nvme_available_path(struct nvme_ns_head *head); void nvme_mpath_unfreeze(struct nvme_subsystem *subsys); void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys); void nvme_mpath_start_freeze(struct nvme_subsystem *subsys); diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index d734599cbcd7..57f4dfc83316 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -15,6 +15,8 @@ enum io_uring_cmd_flags { IO_URING_F_SQE128 = 4, IO_URING_F_CQE32 = 8, IO_URING_F_IOPOLL = 16, + /* to indicate that it is a MPATH req*/ + IO_URING_F_MPATH = 32, }; struct io_uring_cmd { -- 2.25.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands 2022-07-11 11:01 ` [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands Kanchan Joshi @ 2022-07-11 13:51 ` Sagi Grimberg 2022-07-11 15:12 ` Stefan Metzmacher 0 siblings, 1 reply; 7+ messages in thread From: Sagi Grimberg @ 2022-07-11 13:51 UTC (permalink / raw) To: Kanchan Joshi, hch, kbusch, axboe Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev > This heavily reuses nvme-mpath infrastructure for block-path. > Block-path operates on bio, and uses that as long-lived object across > requeue attempts. For passthrough interface io_uring_cmd happens to > be such object, so add a requeue list for that. > > If path is not available, uring-cmds are added into that list and > resubmitted on discovery of new path. Existing requeue handler > (nvme_requeue_work) is extended to do this resubmission. > > For failed commands, failover is done directly (i.e. without first > queuing into list) by resubmitting individual command from task-work > based callback. Nice! > > Suggested-by: Sagi Grimberg <[email protected]> > Co-developed-by: Kanchan Joshi <[email protected]> > Signed-off-by: Kanchan Joshi <[email protected]> > Signed-off-by: Anuj Gupta <[email protected]> > --- > drivers/nvme/host/ioctl.c | 141 ++++++++++++++++++++++++++++++++-- > drivers/nvme/host/multipath.c | 36 ++++++++- > drivers/nvme/host/nvme.h | 12 +++ > include/linux/io_uring.h | 2 + > 4 files changed, 182 insertions(+), 9 deletions(-) > > diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c > index fc02eddd4977..281c21d3c67d 100644 > --- a/drivers/nvme/host/ioctl.c > +++ b/drivers/nvme/host/ioctl.c > @@ -340,12 +340,6 @@ struct nvme_uring_data { > __u32 timeout_ms; > }; > > -static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu( > - struct io_uring_cmd *ioucmd) > -{ > - return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu; > -} > - > static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd) > { > struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); > @@ -448,6 +442,14 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, > pdu->meta_buffer = nvme_to_user_ptr(d.metadata); > pdu->meta_len = d.metadata_len; > > + if (issue_flags & IO_URING_F_MPATH) { > + req->cmd_flags |= REQ_NVME_MPATH; > + /* > + * we would need the buffer address (d.addr field) if we have > + * to retry the command. Store it by repurposing ioucmd->cmd > + */ > + ioucmd->cmd = (void *)d.addr; What does repurposing mean here? > + } > blk_execute_rq_nowait(req, false); > return -EIOCBQUEUED; > } > @@ -665,12 +667,135 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd, > int srcu_idx = srcu_read_lock(&head->srcu); > struct nvme_ns *ns = nvme_find_path(head); > int ret = -EINVAL; > + struct device *dev = &head->cdev_device; > + > + if (likely(ns)) { > + ret = nvme_ns_uring_cmd(ns, ioucmd, > + issue_flags | IO_URING_F_MPATH); > + } else if (nvme_available_path(head)) { > + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); > + struct nvme_uring_cmd *payload = NULL; > + > + dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n"); > + /* > + * We may requeue two kinds of uring commands: > + * 1. command failed post submission. pdu->req will be non-null > + * for this > + * 2. command that could not be submitted because path was not > + * available. For this pdu->req is set to NULL. > + */ > + pdu->req = NULL; Relying on a pointer does not sound like a good idea to me. But why do you care at all where did this command came from? This code should not concern itself what happened prior to this execution. > + /* > + * passthrough command will not be available during retry as it > + * is embedded in io_uring's SQE. Hence we allocate/copy here > + */ OK, that is a nice solution. > + payload = kmalloc(sizeof(struct nvme_uring_cmd), GFP_KERNEL); > + if (!payload) { > + ret = -ENOMEM; > + goto out_unlock; > + } > + memcpy(payload, ioucmd->cmd, sizeof(struct nvme_uring_cmd)); > + ioucmd->cmd = payload; > > - if (ns) > - ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags); > + spin_lock_irq(&head->requeue_ioucmd_lock); > + ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd); > + spin_unlock_irq(&head->requeue_ioucmd_lock); > + ret = -EIOCBQUEUED; > + } else { > + dev_warn_ratelimited(dev, "no available path - failing I/O\n"); ret=-EIO ? > + } > +out_unlock: > srcu_read_unlock(&head->srcu, srcu_idx); > return ret; > } > + > +int nvme_uring_cmd_io_retry(struct nvme_ns *ns, struct request *oreq, > + struct io_uring_cmd *ioucmd, struct nvme_uring_cmd_pdu *pdu) > +{ > + struct nvme_ctrl *ctrl = ns->ctrl; > + struct request_queue *q = ns ? ns->queue : ctrl->admin_q; > + struct nvme_uring_data d; > + struct nvme_command c; > + struct request *req; > + struct bio *obio = oreq->bio; > + void *meta = NULL; > + > + memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command)); > + d.metadata = (__u64)pdu->meta_buffer; > + d.metadata_len = pdu->meta_len; > + d.timeout_ms = oreq->timeout; > + d.addr = (__u64)ioucmd->cmd; > + if (obio) { > + d.data_len = obio->bi_iter.bi_size; > + blk_rq_unmap_user(obio); > + } else { > + d.data_len = 0; > + } > + blk_mq_free_request(oreq); The way I would do this that in nvme_ioucmd_failover_req (or in the retry driven from command retriable failure) I would do the above, requeue it and kick the requeue work, to go over the requeue_list and just execute them again. Not sure why you even need an explicit retry code. > + req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr), > + d.data_len, nvme_to_user_ptr(d.metadata), > + d.metadata_len, 0, &meta, d.timeout_ms ? > + msecs_to_jiffies(d.timeout_ms) : 0, > + ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0); > + if (IS_ERR(req)) > + return PTR_ERR(req); > + > + req->end_io = nvme_uring_cmd_end_io; > + req->end_io_data = ioucmd; > + pdu->bio = req->bio; > + pdu->meta = meta; > + req->cmd_flags |= REQ_NVME_MPATH; > + blk_execute_rq_nowait(req, false); > + return -EIOCBQUEUED; > +} > + > +void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd) > +{ > + struct cdev *cdev = file_inode(ioucmd->file)->i_cdev; > + struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, > + cdev); > + int srcu_idx = srcu_read_lock(&head->srcu); > + struct nvme_ns *ns = nvme_find_path(head); > + unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 | > + IO_URING_F_MPATH; > + struct device *dev = &head->cdev_device; > + > + if (likely(ns)) { > + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); > + struct request *oreq = pdu->req; > + int ret; > + > + if (oreq == NULL) { > + /* > + * this was not submitted (to device) earlier. For this > + * ioucmd->cmd points to persistent memory. Free that > + * up post submission > + */ > + const void *cmd = ioucmd->cmd; > + > + ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags); > + kfree(cmd); > + } else { > + /* > + * this was submitted (to device) earlier. Use old > + * request, bio (if it exists) and nvme-pdu to recreate > + * the command for the discovered path > + */ > + ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu); Why is this needed? Why is reuse important here? Why not always call nvme_ns_uring_cmd? > + } > + if (ret != -EIOCBQUEUED) > + io_uring_cmd_done(ioucmd, ret, 0); > + } else if (nvme_available_path(head)) { > + dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n"); > + spin_lock_irq(&head->requeue_ioucmd_lock); > + ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd); > + spin_unlock_irq(&head->requeue_ioucmd_lock); > + } else { > + dev_warn_ratelimited(dev, "no available path - failing I/O\n"); > + io_uring_cmd_done(ioucmd, -EINVAL, 0); -EIO? > + } > + srcu_read_unlock(&head->srcu, srcu_idx); > +} > #endif /* CONFIG_NVME_MULTIPATH */ > > int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags) > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index f26640ccb955..fe5655d98c36 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -6,6 +6,7 @@ > #include <linux/backing-dev.h> > #include <linux/moduleparam.h> > #include <linux/vmalloc.h> > +#include <linux/io_uring.h> > #include <trace/events/block.h> > #include "nvme.h" > > @@ -80,6 +81,17 @@ void nvme_mpath_start_freeze(struct nvme_subsystem *subsys) > blk_freeze_queue_start(h->disk->queue); > } > > +static void nvme_ioucmd_failover_req(struct request *req, struct nvme_ns *ns) > +{ > + struct io_uring_cmd *ioucmd = req->end_io_data; > + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); > + > + /* store the request, to reconstruct the command during retry */ > + pdu->req = req; > + /* retry in original submitter context */ > + io_uring_cmd_execute_in_task(ioucmd, nvme_ioucmd_mpath_retry); Why not deinit the command, put it on the request list and just schedule requeue_work? Why not just have requeue_work go over the request list and call nvme_ns_head_chr_uring_cmd similar to what it does for block requests? > +} > + > void nvme_failover_req(struct request *req) > { > struct nvme_ns *ns = req->q->queuedata; > @@ -99,6 +111,11 @@ void nvme_failover_req(struct request *req) > queue_work(nvme_wq, &ns->ctrl->ana_work); > } > > + if (blk_rq_is_passthrough(req)) { > + nvme_ioucmd_failover_req(req, ns); > + return; > + } > + > spin_lock_irqsave(&ns->head->requeue_lock, flags); > for (bio = req->bio; bio; bio = bio->bi_next) { > bio_set_dev(bio, ns->head->disk->part0); > @@ -314,7 +331,7 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) > return ns; > } > > -static bool nvme_available_path(struct nvme_ns_head *head) > +bool nvme_available_path(struct nvme_ns_head *head) > { > struct nvme_ns *ns; > > @@ -459,7 +476,9 @@ static void nvme_requeue_work(struct work_struct *work) > struct nvme_ns_head *head = > container_of(work, struct nvme_ns_head, requeue_work); > struct bio *bio, *next; > + struct io_uring_cmd *ioucmd, *ioucmd_next; > > + /* process requeued bios*/ > spin_lock_irq(&head->requeue_lock); > next = bio_list_get(&head->requeue_list); > spin_unlock_irq(&head->requeue_lock); > @@ -470,6 +489,21 @@ static void nvme_requeue_work(struct work_struct *work) > > submit_bio_noacct(bio); > } > + > + /* process requeued passthrough-commands */ > + spin_lock_irq(&head->requeue_ioucmd_lock); > + ioucmd_next = ioucmd_list_get(&head->requeue_ioucmd_list); > + spin_unlock_irq(&head->requeue_ioucmd_lock); > + > + while ((ioucmd = ioucmd_next) != NULL) { > + ioucmd_next = ioucmd->next; > + ioucmd->next = NULL; > + /* > + * Retry in original submitter context as we would be > + * processing the passthrough command again > + */ > + io_uring_cmd_execute_in_task(ioucmd, nvme_ioucmd_mpath_retry); Why retry? why not nvme_ns_head_chr_uring_cmd ? > + } > } > > int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > index 9d3ff6feda06..125b48e74e72 100644 > --- a/drivers/nvme/host/nvme.h > +++ b/drivers/nvme/host/nvme.h > @@ -16,6 +16,7 @@ > #include <linux/rcupdate.h> > #include <linux/wait.h> > #include <linux/t10-pi.h> > +#include <linux/io_uring.h> > > #include <trace/events/block.h> > > @@ -189,6 +190,12 @@ enum { > NVME_REQ_USERCMD = (1 << 1), > }; > > +static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu( > + struct io_uring_cmd *ioucmd) > +{ > + return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu; > +} > + > static inline struct nvme_request *nvme_req(struct request *req) > { > return blk_mq_rq_to_pdu(req); > @@ -442,6 +449,9 @@ struct nvme_ns_head { > struct work_struct requeue_work; > struct mutex lock; > unsigned long flags; > + /* for uring-passthru multipath handling */ > + struct ioucmd_list requeue_ioucmd_list; > + spinlock_t requeue_ioucmd_lock; > #define NVME_NSHEAD_DISK_LIVE 0 > struct nvme_ns __rcu *current_path[]; > #endif > @@ -830,6 +840,7 @@ int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, > unsigned int issue_flags); > int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd, > unsigned int issue_flags); > +void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd); > int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo); > int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags); > > @@ -844,6 +855,7 @@ static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl) > return ctrl->ana_log_buf != NULL; > } > > +bool nvme_available_path(struct nvme_ns_head *head); > void nvme_mpath_unfreeze(struct nvme_subsystem *subsys); > void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys); > void nvme_mpath_start_freeze(struct nvme_subsystem *subsys); > diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h > index d734599cbcd7..57f4dfc83316 100644 > --- a/include/linux/io_uring.h > +++ b/include/linux/io_uring.h > @@ -15,6 +15,8 @@ enum io_uring_cmd_flags { > IO_URING_F_SQE128 = 4, > IO_URING_F_CQE32 = 8, > IO_URING_F_IOPOLL = 16, > + /* to indicate that it is a MPATH req*/ > + IO_URING_F_MPATH = 32, Unrelated, but this should probably move to bitwise representation... ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands 2022-07-11 13:51 ` Sagi Grimberg @ 2022-07-11 15:12 ` Stefan Metzmacher 0 siblings, 0 replies; 7+ messages in thread From: Stefan Metzmacher @ 2022-07-11 15:12 UTC (permalink / raw) To: Sagi Grimberg, Kanchan Joshi, hch, kbusch, axboe Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev Hi Sagi, >> @@ -189,6 +190,12 @@ enum { >> NVME_REQ_USERCMD = (1 << 1), >> }; >> +static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu( >> + struct io_uring_cmd *ioucmd) >> +{ Shouldn't we have a BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu)); here? >> + return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu; >> +} >> + >> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h >> index d734599cbcd7..57f4dfc83316 100644 >> --- a/include/linux/io_uring.h >> +++ b/include/linux/io_uring.h >> @@ -15,6 +15,8 @@ enum io_uring_cmd_flags { >> IO_URING_F_SQE128 = 4, >> IO_URING_F_CQE32 = 8, >> IO_URING_F_IOPOLL = 16, >> + /* to indicate that it is a MPATH req*/ >> + IO_URING_F_MPATH = 32, Isn't that nvme specific? If so I don't think it belongs in io_uring.h at all... metze ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-07-11 15:12 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <CGME20220711110753epcas5p4169b9e288d15ca35740dbb66a6f6983a@epcas5p4.samsung.com> 2022-07-11 11:01 ` [PATCH for-next 0/4] nvme-multipathing for uring-passthrough Kanchan Joshi [not found] ` <CGME20220711110800epcas5p3d338dd486fd778c5ba5bfe93a91ec8bd@epcas5p3.samsung.com> 2022-07-11 11:01 ` [PATCH for-next 1/4] io_uring, nvme: rename a function Kanchan Joshi [not found] ` <CGME20220711110812epcas5p33aa90b23aa62fb11722aa8195754becf@epcas5p3.samsung.com> 2022-07-11 11:01 ` [PATCH for-next 2/4] nvme: compact nvme_uring_cmd_pdu struct Kanchan Joshi [not found] ` <CGME20220711110824epcas5p22c8e945cb8c3c3ac46c8c2b5ab55db9b@epcas5p2.samsung.com> 2022-07-11 11:01 ` [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd Kanchan Joshi [not found] ` <CGME20220711110827epcas5p3fd81f142f55ca3048abc38a9ef0d0089@epcas5p3.samsung.com> 2022-07-11 11:01 ` [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands Kanchan Joshi 2022-07-11 13:51 ` Sagi Grimberg 2022-07-11 15:12 ` Stefan Metzmacher
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox