* [PATCH for-next 0/4] nvme-multipathing for uring-passthrough
[not found] <CGME20220711110753epcas5p4169b9e288d15ca35740dbb66a6f6983a@epcas5p4.samsung.com>
@ 2022-07-11 11:01 ` Kanchan Joshi
[not found] ` <CGME20220711110800epcas5p3d338dd486fd778c5ba5bfe93a91ec8bd@epcas5p3.samsung.com>
` (3 more replies)
0 siblings, 4 replies; 7+ messages in thread
From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw)
To: hch, sagi, kbusch, axboe
Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
anuj20.g, gost.dev, Kanchan Joshi
nvme passthrough lacks multipathing capability and some of us have
already expressed interest to see this plumbed. Most recently during LSFMM,
around 2 months back.
This series wires up multipathing for uring-passthrough commands.
Attempt is not to affect the common path (i.e. when
requeue/failover does not trigger) with allocation or deferral. The
most important design bit is to treat "struct io_uring_cmd" in the same
way as "struct bio" is treated by the block-based nvme multipath.
Uring-commands are queued when path is not available, and resubmitted on
discovery of new path. Also if passthrough command on multipath-node is
failed, it is resubmitted on a different path.
Testing:
Using the upstream fio that support uring-passthrough:
fio -iodepth=16 -rw=randread -ioengine=io_uring_cmd -bs=4k -numjobs=4
-size=1G -iodepth_batch_submit=16 -group_reporting -cmd_type=nvme
-filename=/dev/ng0n1 -name=uring-pt
1. Multiple failover - every command is retried 1-5 times before completion
2. Fail nvme_find_path() - this tests completion post requeue
3. Combine above two
4. Repeat above but for passthrough commands which do not generate bio
(e.g. flush command)
Anuj Gupta (2):
nvme: compact nvme_uring_cmd_pdu struct
nvme-multipath: add multipathing for uring-passthrough commands
Kanchan Joshi (2):
io_uring, nvme: rename a function
io_uring: grow a field in struct io_uring_cmd
drivers/nvme/host/ioctl.c | 157 +++++++++++++++++++++++++++++-----
drivers/nvme/host/multipath.c | 36 +++++++-
drivers/nvme/host/nvme.h | 26 ++++++
include/linux/io_uring.h | 44 +++++++++-
io_uring/uring_cmd.c | 4 +-
5 files changed, 237 insertions(+), 30 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH for-next 1/4] io_uring, nvme: rename a function
[not found] ` <CGME20220711110800epcas5p3d338dd486fd778c5ba5bfe93a91ec8bd@epcas5p3.samsung.com>
@ 2022-07-11 11:01 ` Kanchan Joshi
0 siblings, 0 replies; 7+ messages in thread
From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw)
To: hch, sagi, kbusch, axboe
Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
anuj20.g, gost.dev, Kanchan Joshi
io_uring_cmd_complete_in_task() is bit of a misnomer. It schedules a
callback function for execution in task context. What callback does is
private to provider, and does not have to be completion. So rename it to
io_uring_cmd_execute_in_task() to allow more generic use.
Signed-off-by: Kanchan Joshi <[email protected]>
---
drivers/nvme/host/ioctl.c | 2 +-
include/linux/io_uring.h | 4 ++--
io_uring/uring_cmd.c | 4 ++--
3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index a2e89db1cd63..9227e07f717e 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -395,7 +395,7 @@ static void nvme_uring_cmd_end_io(struct request *req, blk_status_t err)
pdu->req = req;
req->bio = bio;
/* this takes care of moving rest of completion-work to task context */
- io_uring_cmd_complete_in_task(ioucmd, nvme_uring_task_cb);
+ io_uring_cmd_execute_in_task(ioucmd, nvme_uring_task_cb);
}
static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 4a2f6cc5a492..54063d67506b 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -29,7 +29,7 @@ struct io_uring_cmd {
#if defined(CONFIG_IO_URING)
void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t res2);
-void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+void io_uring_cmd_execute_in_task(struct io_uring_cmd *ioucmd,
void (*task_work_cb)(struct io_uring_cmd *));
struct sock *io_uring_get_socket(struct file *file);
void __io_uring_cancel(bool cancel_all);
@@ -59,7 +59,7 @@ static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret,
ssize_t ret2)
{
}
-static inline void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+static inline void io_uring_cmd_execute_in_task(struct io_uring_cmd *ioucmd,
void (*task_work_cb)(struct io_uring_cmd *))
{
}
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 0a421ed51e7e..d409b99abac5 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -16,7 +16,7 @@ static void io_uring_cmd_work(struct io_kiocb *req, bool *locked)
ioucmd->task_work_cb(ioucmd);
}
-void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+void io_uring_cmd_execute_in_task(struct io_uring_cmd *ioucmd,
void (*task_work_cb)(struct io_uring_cmd *))
{
struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
@@ -25,7 +25,7 @@ void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
req->io_task_work.func = io_uring_cmd_work;
io_req_task_work_add(req);
}
-EXPORT_SYMBOL_GPL(io_uring_cmd_complete_in_task);
+EXPORT_SYMBOL_GPL(io_uring_cmd_execute_in_task);
static inline void io_req_set_cqe32_extra(struct io_kiocb *req,
u64 extra1, u64 extra2)
--
2.25.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH for-next 2/4] nvme: compact nvme_uring_cmd_pdu struct
[not found] ` <CGME20220711110812epcas5p33aa90b23aa62fb11722aa8195754becf@epcas5p3.samsung.com>
@ 2022-07-11 11:01 ` Kanchan Joshi
0 siblings, 0 replies; 7+ messages in thread
From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw)
To: hch, sagi, kbusch, axboe
Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
anuj20.g, gost.dev
From: Anuj Gupta <[email protected]>
Mark this packed so that we can create bit more space in its container
i.e. io_uring_cmd. This is in preparation to support multipathing on
uring-passthrough.
Move its definition to nvme.h as well.
Signed-off-by: Anuj Gupta <[email protected]>
---
drivers/nvme/host/ioctl.c | 14 --------------
drivers/nvme/host/nvme.h | 14 ++++++++++++++
2 files changed, 14 insertions(+), 14 deletions(-)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 9227e07f717e..fc02eddd4977 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -340,20 +340,6 @@ struct nvme_uring_data {
__u32 timeout_ms;
};
-/*
- * This overlays struct io_uring_cmd pdu.
- * Expect build errors if this grows larger than that.
- */
-struct nvme_uring_cmd_pdu {
- union {
- struct bio *bio;
- struct request *req;
- };
- void *meta; /* kernel-resident buffer */
- void __user *meta_buffer;
- u32 meta_len;
-};
-
static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
struct io_uring_cmd *ioucmd)
{
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7323a2f61126..9d3ff6feda06 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -165,6 +165,20 @@ struct nvme_request {
struct nvme_ctrl *ctrl;
};
+/*
+ * This overlays struct io_uring_cmd pdu.
+ * Expect build errors if this grows larger than that.
+ */
+struct nvme_uring_cmd_pdu {
+ union {
+ struct bio *bio;
+ struct request *req;
+ };
+ void *meta; /* kernel-resident buffer */
+ void __user *meta_buffer;
+ u32 meta_len;
+} __packed;
+
/*
* Mark a bio as coming in through the mpath node.
*/
--
2.25.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
[not found] ` <CGME20220711110824epcas5p22c8e945cb8c3c3ac46c8c2b5ab55db9b@epcas5p2.samsung.com>
@ 2022-07-11 11:01 ` Kanchan Joshi
0 siblings, 0 replies; 7+ messages in thread
From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw)
To: hch, sagi, kbusch, axboe
Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
anuj20.g, gost.dev, Kanchan Joshi
Use the leftover space to carve 'next' field that enables linking of
io_uring_cmd structs. Also introduce a list head and few helpers.
This is in preparation to support nvme-mulitpath, allowing multiple
uring passthrough commands to be queued.
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
include/linux/io_uring.h | 38 ++++++++++++++++++++++++++++++++++++--
1 file changed, 36 insertions(+), 2 deletions(-)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 54063d67506b..d734599cbcd7 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -22,9 +22,14 @@ struct io_uring_cmd {
const void *cmd;
/* callback to defer completions to task context */
void (*task_work_cb)(struct io_uring_cmd *cmd);
+ struct io_uring_cmd *next;
u32 cmd_op;
- u32 pad;
- u8 pdu[32]; /* available inline for free use */
+ u8 pdu[28]; /* available inline for free use */
+};
+
+struct ioucmd_list {
+ struct io_uring_cmd *head;
+ struct io_uring_cmd *tail;
};
#if defined(CONFIG_IO_URING)
@@ -54,6 +59,27 @@ static inline void io_uring_free(struct task_struct *tsk)
if (tsk->io_uring)
__io_uring_free(tsk);
}
+
+static inline struct io_uring_cmd *ioucmd_list_get(struct ioucmd_list *il)
+{
+ struct io_uring_cmd *ioucmd = il->head;
+
+ il->head = il->tail = NULL;
+
+ return ioucmd;
+}
+
+static inline void ioucmd_list_add(struct ioucmd_list *il,
+ struct io_uring_cmd *ioucmd)
+{
+ ioucmd->next = NULL;
+
+ if (il->tail)
+ il->tail->next = ioucmd;
+ else
+ il->head = ioucmd;
+ il->tail = ioucmd;
+}
#else
static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret,
ssize_t ret2)
@@ -80,6 +106,14 @@ static inline const char *io_uring_get_opcode(u8 opcode)
{
return "";
}
+static inline struct io_uring_cmd *ioucmd_list_get(struct ioucmd_list *il)
+{
+ return NULL;
+}
+static inline void ioucmd_list_add(struct ioucmd_list *il,
+ struct io_uring_cmd *ioucmd)
+{
+}
#endif
#endif
--
2.25.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
[not found] ` <CGME20220711110827epcas5p3fd81f142f55ca3048abc38a9ef0d0089@epcas5p3.samsung.com>
@ 2022-07-11 11:01 ` Kanchan Joshi
2022-07-11 13:51 ` Sagi Grimberg
0 siblings, 1 reply; 7+ messages in thread
From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw)
To: hch, sagi, kbusch, axboe
Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
anuj20.g, gost.dev, Kanchan Joshi
From: Anuj Gupta <[email protected]>
This heavily reuses nvme-mpath infrastructure for block-path.
Block-path operates on bio, and uses that as long-lived object across
requeue attempts. For passthrough interface io_uring_cmd happens to
be such object, so add a requeue list for that.
If path is not available, uring-cmds are added into that list and
resubmitted on discovery of new path. Existing requeue handler
(nvme_requeue_work) is extended to do this resubmission.
For failed commands, failover is done directly (i.e. without first
queuing into list) by resubmitting individual command from task-work
based callback.
Suggested-by: Sagi Grimberg <[email protected]>
Co-developed-by: Kanchan Joshi <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
drivers/nvme/host/ioctl.c | 141 ++++++++++++++++++++++++++++++++--
drivers/nvme/host/multipath.c | 36 ++++++++-
drivers/nvme/host/nvme.h | 12 +++
include/linux/io_uring.h | 2 +
4 files changed, 182 insertions(+), 9 deletions(-)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index fc02eddd4977..281c21d3c67d 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -340,12 +340,6 @@ struct nvme_uring_data {
__u32 timeout_ms;
};
-static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
- struct io_uring_cmd *ioucmd)
-{
- return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
-}
-
static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd)
{
struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
@@ -448,6 +442,14 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
pdu->meta_buffer = nvme_to_user_ptr(d.metadata);
pdu->meta_len = d.metadata_len;
+ if (issue_flags & IO_URING_F_MPATH) {
+ req->cmd_flags |= REQ_NVME_MPATH;
+ /*
+ * we would need the buffer address (d.addr field) if we have
+ * to retry the command. Store it by repurposing ioucmd->cmd
+ */
+ ioucmd->cmd = (void *)d.addr;
+ }
blk_execute_rq_nowait(req, false);
return -EIOCBQUEUED;
}
@@ -665,12 +667,135 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
int srcu_idx = srcu_read_lock(&head->srcu);
struct nvme_ns *ns = nvme_find_path(head);
int ret = -EINVAL;
+ struct device *dev = &head->cdev_device;
+
+ if (likely(ns)) {
+ ret = nvme_ns_uring_cmd(ns, ioucmd,
+ issue_flags | IO_URING_F_MPATH);
+ } else if (nvme_available_path(head)) {
+ struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+ struct nvme_uring_cmd *payload = NULL;
+
+ dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
+ /*
+ * We may requeue two kinds of uring commands:
+ * 1. command failed post submission. pdu->req will be non-null
+ * for this
+ * 2. command that could not be submitted because path was not
+ * available. For this pdu->req is set to NULL.
+ */
+ pdu->req = NULL;
+ /*
+ * passthrough command will not be available during retry as it
+ * is embedded in io_uring's SQE. Hence we allocate/copy here
+ */
+ payload = kmalloc(sizeof(struct nvme_uring_cmd), GFP_KERNEL);
+ if (!payload) {
+ ret = -ENOMEM;
+ goto out_unlock;
+ }
+ memcpy(payload, ioucmd->cmd, sizeof(struct nvme_uring_cmd));
+ ioucmd->cmd = payload;
- if (ns)
- ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
+ spin_lock_irq(&head->requeue_ioucmd_lock);
+ ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
+ spin_unlock_irq(&head->requeue_ioucmd_lock);
+ ret = -EIOCBQUEUED;
+ } else {
+ dev_warn_ratelimited(dev, "no available path - failing I/O\n");
+ }
+out_unlock:
srcu_read_unlock(&head->srcu, srcu_idx);
return ret;
}
+
+int nvme_uring_cmd_io_retry(struct nvme_ns *ns, struct request *oreq,
+ struct io_uring_cmd *ioucmd, struct nvme_uring_cmd_pdu *pdu)
+{
+ struct nvme_ctrl *ctrl = ns->ctrl;
+ struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
+ struct nvme_uring_data d;
+ struct nvme_command c;
+ struct request *req;
+ struct bio *obio = oreq->bio;
+ void *meta = NULL;
+
+ memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command));
+ d.metadata = (__u64)pdu->meta_buffer;
+ d.metadata_len = pdu->meta_len;
+ d.timeout_ms = oreq->timeout;
+ d.addr = (__u64)ioucmd->cmd;
+ if (obio) {
+ d.data_len = obio->bi_iter.bi_size;
+ blk_rq_unmap_user(obio);
+ } else {
+ d.data_len = 0;
+ }
+ blk_mq_free_request(oreq);
+ req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
+ d.data_len, nvme_to_user_ptr(d.metadata),
+ d.metadata_len, 0, &meta, d.timeout_ms ?
+ msecs_to_jiffies(d.timeout_ms) : 0,
+ ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0);
+ if (IS_ERR(req))
+ return PTR_ERR(req);
+
+ req->end_io = nvme_uring_cmd_end_io;
+ req->end_io_data = ioucmd;
+ pdu->bio = req->bio;
+ pdu->meta = meta;
+ req->cmd_flags |= REQ_NVME_MPATH;
+ blk_execute_rq_nowait(req, false);
+ return -EIOCBQUEUED;
+}
+
+void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd)
+{
+ struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
+ struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head,
+ cdev);
+ int srcu_idx = srcu_read_lock(&head->srcu);
+ struct nvme_ns *ns = nvme_find_path(head);
+ unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 |
+ IO_URING_F_MPATH;
+ struct device *dev = &head->cdev_device;
+
+ if (likely(ns)) {
+ struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+ struct request *oreq = pdu->req;
+ int ret;
+
+ if (oreq == NULL) {
+ /*
+ * this was not submitted (to device) earlier. For this
+ * ioucmd->cmd points to persistent memory. Free that
+ * up post submission
+ */
+ const void *cmd = ioucmd->cmd;
+
+ ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
+ kfree(cmd);
+ } else {
+ /*
+ * this was submitted (to device) earlier. Use old
+ * request, bio (if it exists) and nvme-pdu to recreate
+ * the command for the discovered path
+ */
+ ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu);
+ }
+ if (ret != -EIOCBQUEUED)
+ io_uring_cmd_done(ioucmd, ret, 0);
+ } else if (nvme_available_path(head)) {
+ dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
+ spin_lock_irq(&head->requeue_ioucmd_lock);
+ ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
+ spin_unlock_irq(&head->requeue_ioucmd_lock);
+ } else {
+ dev_warn_ratelimited(dev, "no available path - failing I/O\n");
+ io_uring_cmd_done(ioucmd, -EINVAL, 0);
+ }
+ srcu_read_unlock(&head->srcu, srcu_idx);
+}
#endif /* CONFIG_NVME_MULTIPATH */
int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index f26640ccb955..fe5655d98c36 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -6,6 +6,7 @@
#include <linux/backing-dev.h>
#include <linux/moduleparam.h>
#include <linux/vmalloc.h>
+#include <linux/io_uring.h>
#include <trace/events/block.h>
#include "nvme.h"
@@ -80,6 +81,17 @@ void nvme_mpath_start_freeze(struct nvme_subsystem *subsys)
blk_freeze_queue_start(h->disk->queue);
}
+static void nvme_ioucmd_failover_req(struct request *req, struct nvme_ns *ns)
+{
+ struct io_uring_cmd *ioucmd = req->end_io_data;
+ struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+
+ /* store the request, to reconstruct the command during retry */
+ pdu->req = req;
+ /* retry in original submitter context */
+ io_uring_cmd_execute_in_task(ioucmd, nvme_ioucmd_mpath_retry);
+}
+
void nvme_failover_req(struct request *req)
{
struct nvme_ns *ns = req->q->queuedata;
@@ -99,6 +111,11 @@ void nvme_failover_req(struct request *req)
queue_work(nvme_wq, &ns->ctrl->ana_work);
}
+ if (blk_rq_is_passthrough(req)) {
+ nvme_ioucmd_failover_req(req, ns);
+ return;
+ }
+
spin_lock_irqsave(&ns->head->requeue_lock, flags);
for (bio = req->bio; bio; bio = bio->bi_next) {
bio_set_dev(bio, ns->head->disk->part0);
@@ -314,7 +331,7 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
return ns;
}
-static bool nvme_available_path(struct nvme_ns_head *head)
+bool nvme_available_path(struct nvme_ns_head *head)
{
struct nvme_ns *ns;
@@ -459,7 +476,9 @@ static void nvme_requeue_work(struct work_struct *work)
struct nvme_ns_head *head =
container_of(work, struct nvme_ns_head, requeue_work);
struct bio *bio, *next;
+ struct io_uring_cmd *ioucmd, *ioucmd_next;
+ /* process requeued bios*/
spin_lock_irq(&head->requeue_lock);
next = bio_list_get(&head->requeue_list);
spin_unlock_irq(&head->requeue_lock);
@@ -470,6 +489,21 @@ static void nvme_requeue_work(struct work_struct *work)
submit_bio_noacct(bio);
}
+
+ /* process requeued passthrough-commands */
+ spin_lock_irq(&head->requeue_ioucmd_lock);
+ ioucmd_next = ioucmd_list_get(&head->requeue_ioucmd_list);
+ spin_unlock_irq(&head->requeue_ioucmd_lock);
+
+ while ((ioucmd = ioucmd_next) != NULL) {
+ ioucmd_next = ioucmd->next;
+ ioucmd->next = NULL;
+ /*
+ * Retry in original submitter context as we would be
+ * processing the passthrough command again
+ */
+ io_uring_cmd_execute_in_task(ioucmd, nvme_ioucmd_mpath_retry);
+ }
}
int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9d3ff6feda06..125b48e74e72 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -16,6 +16,7 @@
#include <linux/rcupdate.h>
#include <linux/wait.h>
#include <linux/t10-pi.h>
+#include <linux/io_uring.h>
#include <trace/events/block.h>
@@ -189,6 +190,12 @@ enum {
NVME_REQ_USERCMD = (1 << 1),
};
+static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
+ struct io_uring_cmd *ioucmd)
+{
+ return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
+}
+
static inline struct nvme_request *nvme_req(struct request *req)
{
return blk_mq_rq_to_pdu(req);
@@ -442,6 +449,9 @@ struct nvme_ns_head {
struct work_struct requeue_work;
struct mutex lock;
unsigned long flags;
+ /* for uring-passthru multipath handling */
+ struct ioucmd_list requeue_ioucmd_list;
+ spinlock_t requeue_ioucmd_lock;
#define NVME_NSHEAD_DISK_LIVE 0
struct nvme_ns __rcu *current_path[];
#endif
@@ -830,6 +840,7 @@ int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd,
unsigned int issue_flags);
int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
unsigned int issue_flags);
+void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd);
int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
@@ -844,6 +855,7 @@ static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
return ctrl->ana_log_buf != NULL;
}
+bool nvme_available_path(struct nvme_ns_head *head);
void nvme_mpath_unfreeze(struct nvme_subsystem *subsys);
void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys);
void nvme_mpath_start_freeze(struct nvme_subsystem *subsys);
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index d734599cbcd7..57f4dfc83316 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -15,6 +15,8 @@ enum io_uring_cmd_flags {
IO_URING_F_SQE128 = 4,
IO_URING_F_CQE32 = 8,
IO_URING_F_IOPOLL = 16,
+ /* to indicate that it is a MPATH req*/
+ IO_URING_F_MPATH = 32,
};
struct io_uring_cmd {
--
2.25.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
2022-07-11 11:01 ` [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands Kanchan Joshi
@ 2022-07-11 13:51 ` Sagi Grimberg
2022-07-11 15:12 ` Stefan Metzmacher
0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2022-07-11 13:51 UTC (permalink / raw)
To: Kanchan Joshi, hch, kbusch, axboe
Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
anuj20.g, gost.dev
> This heavily reuses nvme-mpath infrastructure for block-path.
> Block-path operates on bio, and uses that as long-lived object across
> requeue attempts. For passthrough interface io_uring_cmd happens to
> be such object, so add a requeue list for that.
>
> If path is not available, uring-cmds are added into that list and
> resubmitted on discovery of new path. Existing requeue handler
> (nvme_requeue_work) is extended to do this resubmission.
>
> For failed commands, failover is done directly (i.e. without first
> queuing into list) by resubmitting individual command from task-work
> based callback.
Nice!
>
> Suggested-by: Sagi Grimberg <[email protected]>
> Co-developed-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Anuj Gupta <[email protected]>
> ---
> drivers/nvme/host/ioctl.c | 141 ++++++++++++++++++++++++++++++++--
> drivers/nvme/host/multipath.c | 36 ++++++++-
> drivers/nvme/host/nvme.h | 12 +++
> include/linux/io_uring.h | 2 +
> 4 files changed, 182 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> index fc02eddd4977..281c21d3c67d 100644
> --- a/drivers/nvme/host/ioctl.c
> +++ b/drivers/nvme/host/ioctl.c
> @@ -340,12 +340,6 @@ struct nvme_uring_data {
> __u32 timeout_ms;
> };
>
> -static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
> - struct io_uring_cmd *ioucmd)
> -{
> - return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
> -}
> -
> static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd)
> {
> struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> @@ -448,6 +442,14 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> pdu->meta_buffer = nvme_to_user_ptr(d.metadata);
> pdu->meta_len = d.metadata_len;
>
> + if (issue_flags & IO_URING_F_MPATH) {
> + req->cmd_flags |= REQ_NVME_MPATH;
> + /*
> + * we would need the buffer address (d.addr field) if we have
> + * to retry the command. Store it by repurposing ioucmd->cmd
> + */
> + ioucmd->cmd = (void *)d.addr;
What does repurposing mean here?
> + }
> blk_execute_rq_nowait(req, false);
> return -EIOCBQUEUED;
> }
> @@ -665,12 +667,135 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
> int srcu_idx = srcu_read_lock(&head->srcu);
> struct nvme_ns *ns = nvme_find_path(head);
> int ret = -EINVAL;
> + struct device *dev = &head->cdev_device;
> +
> + if (likely(ns)) {
> + ret = nvme_ns_uring_cmd(ns, ioucmd,
> + issue_flags | IO_URING_F_MPATH);
> + } else if (nvme_available_path(head)) {
> + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> + struct nvme_uring_cmd *payload = NULL;
> +
> + dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
> + /*
> + * We may requeue two kinds of uring commands:
> + * 1. command failed post submission. pdu->req will be non-null
> + * for this
> + * 2. command that could not be submitted because path was not
> + * available. For this pdu->req is set to NULL.
> + */
> + pdu->req = NULL;
Relying on a pointer does not sound like a good idea to me.
But why do you care at all where did this command came from?
This code should not concern itself what happened prior to this
execution.
> + /*
> + * passthrough command will not be available during retry as it
> + * is embedded in io_uring's SQE. Hence we allocate/copy here
> + */
OK, that is a nice solution.
> + payload = kmalloc(sizeof(struct nvme_uring_cmd), GFP_KERNEL);
> + if (!payload) {
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> + memcpy(payload, ioucmd->cmd, sizeof(struct nvme_uring_cmd));
> + ioucmd->cmd = payload;
>
> - if (ns)
> - ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
> + spin_lock_irq(&head->requeue_ioucmd_lock);
> + ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
> + spin_unlock_irq(&head->requeue_ioucmd_lock);
> + ret = -EIOCBQUEUED;
> + } else {
> + dev_warn_ratelimited(dev, "no available path - failing I/O\n");
ret=-EIO ?
> + }
> +out_unlock:
> srcu_read_unlock(&head->srcu, srcu_idx);
> return ret;
> }
> +
> +int nvme_uring_cmd_io_retry(struct nvme_ns *ns, struct request *oreq,
> + struct io_uring_cmd *ioucmd, struct nvme_uring_cmd_pdu *pdu)
> +{
> + struct nvme_ctrl *ctrl = ns->ctrl;
> + struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
> + struct nvme_uring_data d;
> + struct nvme_command c;
> + struct request *req;
> + struct bio *obio = oreq->bio;
> + void *meta = NULL;
> +
> + memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command));
> + d.metadata = (__u64)pdu->meta_buffer;
> + d.metadata_len = pdu->meta_len;
> + d.timeout_ms = oreq->timeout;
> + d.addr = (__u64)ioucmd->cmd;
> + if (obio) {
> + d.data_len = obio->bi_iter.bi_size;
> + blk_rq_unmap_user(obio);
> + } else {
> + d.data_len = 0;
> + }
> + blk_mq_free_request(oreq);
The way I would do this that in nvme_ioucmd_failover_req (or in the
retry driven from command retriable failure) I would do the above,
requeue it and kick the requeue work, to go over the requeue_list and
just execute them again. Not sure why you even need an explicit retry
code.
> + req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
> + d.data_len, nvme_to_user_ptr(d.metadata),
> + d.metadata_len, 0, &meta, d.timeout_ms ?
> + msecs_to_jiffies(d.timeout_ms) : 0,
> + ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0);
> + if (IS_ERR(req))
> + return PTR_ERR(req);
> +
> + req->end_io = nvme_uring_cmd_end_io;
> + req->end_io_data = ioucmd;
> + pdu->bio = req->bio;
> + pdu->meta = meta;
> + req->cmd_flags |= REQ_NVME_MPATH;
> + blk_execute_rq_nowait(req, false);
> + return -EIOCBQUEUED;
> +}
> +
> +void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd)
> +{
> + struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
> + struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head,
> + cdev);
> + int srcu_idx = srcu_read_lock(&head->srcu);
> + struct nvme_ns *ns = nvme_find_path(head);
> + unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 |
> + IO_URING_F_MPATH;
> + struct device *dev = &head->cdev_device;
> +
> + if (likely(ns)) {
> + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> + struct request *oreq = pdu->req;
> + int ret;
> +
> + if (oreq == NULL) {
> + /*
> + * this was not submitted (to device) earlier. For this
> + * ioucmd->cmd points to persistent memory. Free that
> + * up post submission
> + */
> + const void *cmd = ioucmd->cmd;
> +
> + ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
> + kfree(cmd);
> + } else {
> + /*
> + * this was submitted (to device) earlier. Use old
> + * request, bio (if it exists) and nvme-pdu to recreate
> + * the command for the discovered path
> + */
> + ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu);
Why is this needed? Why is reuse important here? Why not always call
nvme_ns_uring_cmd?
> + }
> + if (ret != -EIOCBQUEUED)
> + io_uring_cmd_done(ioucmd, ret, 0);
> + } else if (nvme_available_path(head)) {
> + dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
> + spin_lock_irq(&head->requeue_ioucmd_lock);
> + ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
> + spin_unlock_irq(&head->requeue_ioucmd_lock);
> + } else {
> + dev_warn_ratelimited(dev, "no available path - failing I/O\n");
> + io_uring_cmd_done(ioucmd, -EINVAL, 0);
-EIO?
> + }
> + srcu_read_unlock(&head->srcu, srcu_idx);
> +}
> #endif /* CONFIG_NVME_MULTIPATH */
>
> int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index f26640ccb955..fe5655d98c36 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -6,6 +6,7 @@
> #include <linux/backing-dev.h>
> #include <linux/moduleparam.h>
> #include <linux/vmalloc.h>
> +#include <linux/io_uring.h>
> #include <trace/events/block.h>
> #include "nvme.h"
>
> @@ -80,6 +81,17 @@ void nvme_mpath_start_freeze(struct nvme_subsystem *subsys)
> blk_freeze_queue_start(h->disk->queue);
> }
>
> +static void nvme_ioucmd_failover_req(struct request *req, struct nvme_ns *ns)
> +{
> + struct io_uring_cmd *ioucmd = req->end_io_data;
> + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> +
> + /* store the request, to reconstruct the command during retry */
> + pdu->req = req;
> + /* retry in original submitter context */
> + io_uring_cmd_execute_in_task(ioucmd, nvme_ioucmd_mpath_retry);
Why not deinit the command, put it on the request list and just schedule
requeue_work? Why not just have requeue_work go over the request list
and call nvme_ns_head_chr_uring_cmd similar to what it does for block
requests?
> +}
> +
> void nvme_failover_req(struct request *req)
> {
> struct nvme_ns *ns = req->q->queuedata;
> @@ -99,6 +111,11 @@ void nvme_failover_req(struct request *req)
> queue_work(nvme_wq, &ns->ctrl->ana_work);
> }
>
> + if (blk_rq_is_passthrough(req)) {
> + nvme_ioucmd_failover_req(req, ns);
> + return;
> + }
> +
> spin_lock_irqsave(&ns->head->requeue_lock, flags);
> for (bio = req->bio; bio; bio = bio->bi_next) {
> bio_set_dev(bio, ns->head->disk->part0);
> @@ -314,7 +331,7 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
> return ns;
> }
>
> -static bool nvme_available_path(struct nvme_ns_head *head)
> +bool nvme_available_path(struct nvme_ns_head *head)
> {
> struct nvme_ns *ns;
>
> @@ -459,7 +476,9 @@ static void nvme_requeue_work(struct work_struct *work)
> struct nvme_ns_head *head =
> container_of(work, struct nvme_ns_head, requeue_work);
> struct bio *bio, *next;
> + struct io_uring_cmd *ioucmd, *ioucmd_next;
>
> + /* process requeued bios*/
> spin_lock_irq(&head->requeue_lock);
> next = bio_list_get(&head->requeue_list);
> spin_unlock_irq(&head->requeue_lock);
> @@ -470,6 +489,21 @@ static void nvme_requeue_work(struct work_struct *work)
>
> submit_bio_noacct(bio);
> }
> +
> + /* process requeued passthrough-commands */
> + spin_lock_irq(&head->requeue_ioucmd_lock);
> + ioucmd_next = ioucmd_list_get(&head->requeue_ioucmd_list);
> + spin_unlock_irq(&head->requeue_ioucmd_lock);
> +
> + while ((ioucmd = ioucmd_next) != NULL) {
> + ioucmd_next = ioucmd->next;
> + ioucmd->next = NULL;
> + /*
> + * Retry in original submitter context as we would be
> + * processing the passthrough command again
> + */
> + io_uring_cmd_execute_in_task(ioucmd, nvme_ioucmd_mpath_retry);
Why retry? why not nvme_ns_head_chr_uring_cmd ?
> + }
> }
>
> int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 9d3ff6feda06..125b48e74e72 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -16,6 +16,7 @@
> #include <linux/rcupdate.h>
> #include <linux/wait.h>
> #include <linux/t10-pi.h>
> +#include <linux/io_uring.h>
>
> #include <trace/events/block.h>
>
> @@ -189,6 +190,12 @@ enum {
> NVME_REQ_USERCMD = (1 << 1),
> };
>
> +static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
> + struct io_uring_cmd *ioucmd)
> +{
> + return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
> +}
> +
> static inline struct nvme_request *nvme_req(struct request *req)
> {
> return blk_mq_rq_to_pdu(req);
> @@ -442,6 +449,9 @@ struct nvme_ns_head {
> struct work_struct requeue_work;
> struct mutex lock;
> unsigned long flags;
> + /* for uring-passthru multipath handling */
> + struct ioucmd_list requeue_ioucmd_list;
> + spinlock_t requeue_ioucmd_lock;
> #define NVME_NSHEAD_DISK_LIVE 0
> struct nvme_ns __rcu *current_path[];
> #endif
> @@ -830,6 +840,7 @@ int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd,
> unsigned int issue_flags);
> int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
> unsigned int issue_flags);
> +void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd);
> int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
> int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
>
> @@ -844,6 +855,7 @@ static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
> return ctrl->ana_log_buf != NULL;
> }
>
> +bool nvme_available_path(struct nvme_ns_head *head);
> void nvme_mpath_unfreeze(struct nvme_subsystem *subsys);
> void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys);
> void nvme_mpath_start_freeze(struct nvme_subsystem *subsys);
> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> index d734599cbcd7..57f4dfc83316 100644
> --- a/include/linux/io_uring.h
> +++ b/include/linux/io_uring.h
> @@ -15,6 +15,8 @@ enum io_uring_cmd_flags {
> IO_URING_F_SQE128 = 4,
> IO_URING_F_CQE32 = 8,
> IO_URING_F_IOPOLL = 16,
> + /* to indicate that it is a MPATH req*/
> + IO_URING_F_MPATH = 32,
Unrelated, but this should probably move to bitwise representation...
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
2022-07-11 13:51 ` Sagi Grimberg
@ 2022-07-11 15:12 ` Stefan Metzmacher
0 siblings, 0 replies; 7+ messages in thread
From: Stefan Metzmacher @ 2022-07-11 15:12 UTC (permalink / raw)
To: Sagi Grimberg, Kanchan Joshi, hch, kbusch, axboe
Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
anuj20.g, gost.dev
Hi Sagi,
>> @@ -189,6 +190,12 @@ enum {
>> NVME_REQ_USERCMD = (1 << 1),
>> };
>> +static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
>> + struct io_uring_cmd *ioucmd)
>> +{
Shouldn't we have a BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu));
here?
>> + return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
>> +}
>> +
>> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
>> index d734599cbcd7..57f4dfc83316 100644
>> --- a/include/linux/io_uring.h
>> +++ b/include/linux/io_uring.h
>> @@ -15,6 +15,8 @@ enum io_uring_cmd_flags {
>> IO_URING_F_SQE128 = 4,
>> IO_URING_F_CQE32 = 8,
>> IO_URING_F_IOPOLL = 16,
>> + /* to indicate that it is a MPATH req*/
>> + IO_URING_F_MPATH = 32,
Isn't that nvme specific? If so I don't think it belongs in io_uring.h at all...
metze
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-07-11 15:12 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20220711110753epcas5p4169b9e288d15ca35740dbb66a6f6983a@epcas5p4.samsung.com>
2022-07-11 11:01 ` [PATCH for-next 0/4] nvme-multipathing for uring-passthrough Kanchan Joshi
[not found] ` <CGME20220711110800epcas5p3d338dd486fd778c5ba5bfe93a91ec8bd@epcas5p3.samsung.com>
2022-07-11 11:01 ` [PATCH for-next 1/4] io_uring, nvme: rename a function Kanchan Joshi
[not found] ` <CGME20220711110812epcas5p33aa90b23aa62fb11722aa8195754becf@epcas5p3.samsung.com>
2022-07-11 11:01 ` [PATCH for-next 2/4] nvme: compact nvme_uring_cmd_pdu struct Kanchan Joshi
[not found] ` <CGME20220711110824epcas5p22c8e945cb8c3c3ac46c8c2b5ab55db9b@epcas5p2.samsung.com>
2022-07-11 11:01 ` [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd Kanchan Joshi
[not found] ` <CGME20220711110827epcas5p3fd81f142f55ca3048abc38a9ef0d0089@epcas5p3.samsung.com>
2022-07-11 11:01 ` [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands Kanchan Joshi
2022-07-11 13:51 ` Sagi Grimberg
2022-07-11 15:12 ` Stefan Metzmacher
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox