[PATCH for-next 0/4] nvme-multipathing for uring-passthrough

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH for-next 0/4] nvme-multipathing for uring-passthrough
       [not found] <CGME20220711110753epcas5p4169b9e288d15ca35740dbb66a6f6983a@epcas5p4.samsung.com>
@ 2022-07-11 11:01 ` Kanchan Joshi
       [not found]   ` <CGME20220711110800epcas5p3d338dd486fd778c5ba5bfe93a91ec8bd@epcas5p3.samsung.com>
                     ` (3 more replies)
  0 siblings, 4 replies; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw)
  To: hch, sagi, kbusch, axboe
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev, Kanchan Joshi

nvme passthrough lacks multipathing capability and some of us have
already expressed interest to see this plumbed. Most recently during LSFMM,
around 2 months back.

This series wires up multipathing for uring-passthrough commands.
Attempt is not to affect the common path (i.e. when
requeue/failover does not trigger) with allocation or deferral. The
most important design bit is to treat "struct io_uring_cmd" in the same
way as "struct bio" is treated by the block-based nvme multipath.
Uring-commands are queued when path is not available, and resubmitted on
discovery of new path. Also if passthrough command on multipath-node is
failed, it is resubmitted on a different path.

Testing:
Using the upstream fio that support uring-passthrough:

fio -iodepth=16 -rw=randread -ioengine=io_uring_cmd -bs=4k -numjobs=4
-size=1G -iodepth_batch_submit=16 -group_reporting -cmd_type=nvme
-filename=/dev/ng0n1 -name=uring-pt

1. Multiple failover - every command is retried 1-5 times before completion
2. Fail nvme_find_path() - this tests completion post requeue
3. Combine above two
4. Repeat above but for passthrough commands which do not generate bio
(e.g. flush command)

Anuj Gupta (2):
  nvme: compact nvme_uring_cmd_pdu struct
  nvme-multipath: add multipathing for uring-passthrough commands

Kanchan Joshi (2):
  io_uring, nvme: rename a function
  io_uring: grow a field in struct io_uring_cmd

 drivers/nvme/host/ioctl.c     | 157 +++++++++++++++++++++++++++++-----
 drivers/nvme/host/multipath.c |  36 +++++++-
 drivers/nvme/host/nvme.h      |  26 ++++++
 include/linux/io_uring.h      |  44 +++++++++-
 io_uring/uring_cmd.c          |   4 +-
 5 files changed, 237 insertions(+), 30 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 52+ messages in thread

[parent not found: <CGME20220711110800epcas5p3d338dd486fd778c5ba5bfe93a91ec8bd@epcas5p3.samsung.com>]

* [PATCH for-next 1/4] io_uring, nvme: rename a function
       [not found]   ` <CGME20220711110800epcas5p3d338dd486fd778c5ba5bfe93a91ec8bd@epcas5p3.samsung.com>
@ 2022-07-11 11:01     ` Kanchan Joshi
  2022-07-14 13:55       ` Ming Lei
  0 siblings, 1 reply; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw)
  To: hch, sagi, kbusch, axboe
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev, Kanchan Joshi

io_uring_cmd_complete_in_task() is bit of a misnomer. It schedules a
callback function for execution in task context. What callback does is
private to provider, and does not have to be completion. So rename it to
io_uring_cmd_execute_in_task() to allow more generic use.

Signed-off-by: Kanchan Joshi <[email protected]>
---
 drivers/nvme/host/ioctl.c | 2 +-
 include/linux/io_uring.h  | 4 ++--
 io_uring/uring_cmd.c      | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index a2e89db1cd63..9227e07f717e 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -395,7 +395,7 @@ static void nvme_uring_cmd_end_io(struct request *req, blk_status_t err)
 	pdu->req = req;
 	req->bio = bio;
 	/* this takes care of moving rest of completion-work to task context */
-	io_uring_cmd_complete_in_task(ioucmd, nvme_uring_task_cb);
+	io_uring_cmd_execute_in_task(ioucmd, nvme_uring_task_cb);
 }
 
 static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 4a2f6cc5a492..54063d67506b 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -29,7 +29,7 @@ struct io_uring_cmd {
 
 #if defined(CONFIG_IO_URING)
 void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t res2);
-void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+void io_uring_cmd_execute_in_task(struct io_uring_cmd *ioucmd,
 			void (*task_work_cb)(struct io_uring_cmd *));
 struct sock *io_uring_get_socket(struct file *file);
 void __io_uring_cancel(bool cancel_all);
@@ -59,7 +59,7 @@ static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret,
 		ssize_t ret2)
 {
 }
-static inline void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+static inline void io_uring_cmd_execute_in_task(struct io_uring_cmd *ioucmd,
 			void (*task_work_cb)(struct io_uring_cmd *))
 {
 }
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 0a421ed51e7e..d409b99abac5 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -16,7 +16,7 @@ static void io_uring_cmd_work(struct io_kiocb *req, bool *locked)
 	ioucmd->task_work_cb(ioucmd);
 }
 
-void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+void io_uring_cmd_execute_in_task(struct io_uring_cmd *ioucmd,
 			void (*task_work_cb)(struct io_uring_cmd *))
 {
 	struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
@@ -25,7 +25,7 @@ void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
 	req->io_task_work.func = io_uring_cmd_work;
 	io_req_task_work_add(req);
 }
-EXPORT_SYMBOL_GPL(io_uring_cmd_complete_in_task);
+EXPORT_SYMBOL_GPL(io_uring_cmd_execute_in_task);
 
 static inline void io_req_set_cqe32_extra(struct io_kiocb *req,
 					  u64 extra1, u64 extra2)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 1/4] io_uring, nvme: rename a function
  2022-07-11 11:01     ` [PATCH for-next 1/4] io_uring, nvme: rename a function Kanchan Joshi
@ 2022-07-14 13:55       ` Ming Lei
  0 siblings, 0 replies; 52+ messages in thread
From: Ming Lei @ 2022-07-14 13:55 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: hch, sagi, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev, ming.lei

On Mon, Jul 11, 2022 at 04:31:52PM +0530, Kanchan Joshi wrote:
> io_uring_cmd_complete_in_task() is bit of a misnomer. It schedules a
> callback function for execution in task context. What callback does is
> private to provider, and does not have to be completion. So rename it to
> io_uring_cmd_execute_in_task() to allow more generic use.
> 
> Signed-off-by: Kanchan Joshi <[email protected]>

ublk driver has same usage, and this change makes sense:

Reviewed-by: Ming Lei <[email protected]>

thanks,
Ming


^ permalink raw reply	[flat|nested] 52+ messages in thread

[parent not found: <CGME20220711110812epcas5p33aa90b23aa62fb11722aa8195754becf@epcas5p3.samsung.com>]

* [PATCH for-next 2/4] nvme: compact nvme_uring_cmd_pdu struct
       [not found]   ` <CGME20220711110812epcas5p33aa90b23aa62fb11722aa8195754becf@epcas5p3.samsung.com>
@ 2022-07-11 11:01     ` Kanchan Joshi
  2022-07-12  6:32       ` Christoph Hellwig
  0 siblings, 1 reply; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw)
  To: hch, sagi, kbusch, axboe
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev

From: Anuj Gupta <[email protected]>

Mark this packed so that we can create bit more space in its container
i.e. io_uring_cmd. This is in preparation to support multipathing on
uring-passthrough.
Move its definition to nvme.h as well.

Signed-off-by: Anuj Gupta <[email protected]>
---
 drivers/nvme/host/ioctl.c | 14 --------------
 drivers/nvme/host/nvme.h  | 14 ++++++++++++++
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 9227e07f717e..fc02eddd4977 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -340,20 +340,6 @@ struct nvme_uring_data {
 	__u32	timeout_ms;
 };
 
-/*
- * This overlays struct io_uring_cmd pdu.
- * Expect build errors if this grows larger than that.
- */
-struct nvme_uring_cmd_pdu {
-	union {
-		struct bio *bio;
-		struct request *req;
-	};
-	void *meta; /* kernel-resident buffer */
-	void __user *meta_buffer;
-	u32 meta_len;
-};
-
 static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
 		struct io_uring_cmd *ioucmd)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7323a2f61126..9d3ff6feda06 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -165,6 +165,20 @@ struct nvme_request {
 	struct nvme_ctrl	*ctrl;
 };
 
+/*
+ * This overlays struct io_uring_cmd pdu.
+ * Expect build errors if this grows larger than that.
+ */
+struct nvme_uring_cmd_pdu {
+	union {
+		struct bio *bio;
+		struct request *req;
+	};
+	void *meta; /* kernel-resident buffer */
+	void __user *meta_buffer;
+	u32 meta_len;
+} __packed;
+
 /*
  * Mark a bio as coming in through the mpath node.
  */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 2/4] nvme: compact nvme_uring_cmd_pdu struct
  2022-07-11 11:01     ` [PATCH for-next 2/4] nvme: compact nvme_uring_cmd_pdu struct Kanchan Joshi
@ 2022-07-12  6:32       ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2022-07-12  6:32 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: hch, sagi, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev

On Mon, Jul 11, 2022 at 04:31:53PM +0530, Kanchan Joshi wrote:
> From: Anuj Gupta <[email protected]>
> 
> Mark this packed so that we can create bit more space in its container
> i.e. io_uring_cmd. This is in preparation to support multipathing on
> uring-passthrough.
> Move its definition to nvme.h as well.

I do not like this.  packed structures that contain pointers are
inherently dangerous as that will get us into unaligned accesses
very quickly.  I also do not think we should expose it any more widely
than absolutely required.

In fact if possible I'd really like to figure out how we can remove
this pdu concept entirely an just have a small number of well typed
field directly in the uring cmd.  This will involved some rework
of the passthrough I/O completions so that we can get at the
metadata biovecs and integrity data.

^ permalink raw reply	[flat|nested] 52+ messages in thread

[parent not found: <CGME20220711110824epcas5p22c8e945cb8c3c3ac46c8c2b5ab55db9b@epcas5p2.samsung.com>]

* [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
       [not found]   ` <CGME20220711110824epcas5p22c8e945cb8c3c3ac46c8c2b5ab55db9b@epcas5p2.samsung.com>
@ 2022-07-11 11:01     ` Kanchan Joshi
  2022-07-11 17:00       ` Sagi Grimberg
  2022-07-11 17:18       ` Jens Axboe
  0 siblings, 2 replies; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw)
  To: hch, sagi, kbusch, axboe
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev, Kanchan Joshi

Use the leftover space to carve 'next' field that enables linking of
io_uring_cmd structs. Also introduce a list head and few helpers.

This is in preparation to support nvme-mulitpath, allowing multiple
uring passthrough commands to be queued.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
 include/linux/io_uring.h | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 54063d67506b..d734599cbcd7 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -22,9 +22,14 @@ struct io_uring_cmd {
 	const void	*cmd;
 	/* callback to defer completions to task context */
 	void (*task_work_cb)(struct io_uring_cmd *cmd);
+	struct io_uring_cmd	*next;
 	u32		cmd_op;
-	u32		pad;
-	u8		pdu[32]; /* available inline for free use */
+	u8		pdu[28]; /* available inline for free use */
+};
+
+struct ioucmd_list {
+	struct io_uring_cmd *head;
+	struct io_uring_cmd *tail;
 };
 
 #if defined(CONFIG_IO_URING)
@@ -54,6 +59,27 @@ static inline void io_uring_free(struct task_struct *tsk)
 	if (tsk->io_uring)
 		__io_uring_free(tsk);
 }
+
+static inline struct io_uring_cmd *ioucmd_list_get(struct ioucmd_list *il)
+{
+	struct io_uring_cmd *ioucmd = il->head;
+
+	il->head = il->tail = NULL;
+
+	return ioucmd;
+}
+
+static inline void ioucmd_list_add(struct ioucmd_list *il,
+		struct io_uring_cmd *ioucmd)
+{
+	ioucmd->next = NULL;
+
+	if (il->tail)
+		il->tail->next = ioucmd;
+	else
+		il->head = ioucmd;
+	il->tail = ioucmd;
+}
 #else
 static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret,
 		ssize_t ret2)
@@ -80,6 +106,14 @@ static inline const char *io_uring_get_opcode(u8 opcode)
 {
 	return "";
 }
+static inline struct io_uring_cmd *ioucmd_list_get(struct ioucmd_list *il)
+{
+	return NULL;
+}
+static inline void ioucmd_list_add(struct ioucmd_list *il,
+		struct io_uring_cmd *ioucmd)
+{
+}
 #endif
 
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
  2022-07-11 11:01     ` [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd Kanchan Joshi
@ 2022-07-11 17:00       ` Sagi Grimberg
  2022-07-11 17:19         ` Jens Axboe
  2022-07-11 17:18       ` Jens Axboe
  1 sibling, 1 reply; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-11 17:00 UTC (permalink / raw)
  To: Kanchan Joshi, hch, kbusch, axboe
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev


> Use the leftover space to carve 'next' field that enables linking of
> io_uring_cmd structs. Also introduce a list head and few helpers.
> 
> This is in preparation to support nvme-mulitpath, allowing multiple
> uring passthrough commands to be queued.
> 
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Anuj Gupta <[email protected]>
> ---
>   include/linux/io_uring.h | 38 ++++++++++++++++++++++++++++++++++++--
>   1 file changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> index 54063d67506b..d734599cbcd7 100644
> --- a/include/linux/io_uring.h
> +++ b/include/linux/io_uring.h
> @@ -22,9 +22,14 @@ struct io_uring_cmd {
>   	const void	*cmd;
>   	/* callback to defer completions to task context */
>   	void (*task_work_cb)(struct io_uring_cmd *cmd);
> +	struct io_uring_cmd	*next;
>   	u32		cmd_op;
> -	u32		pad;
> -	u8		pdu[32]; /* available inline for free use */
> +	u8		pdu[28]; /* available inline for free use */
> +};

I think io_uring_cmd will at some point become two cachelines and may
not be worth the effort to limit a pdu to 28 bytes...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
  2022-07-11 17:00       ` Sagi Grimberg
@ 2022-07-11 17:19         ` Jens Axboe
  0 siblings, 0 replies; 52+ messages in thread
From: Jens Axboe @ 2022-07-11 17:19 UTC (permalink / raw)
  To: Sagi Grimberg, Kanchan Joshi, hch, kbusch
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev

On 7/11/22 11:00 AM, Sagi Grimberg wrote:
> 
>> Use the leftover space to carve 'next' field that enables linking of
>> io_uring_cmd structs. Also introduce a list head and few helpers.
>>
>> This is in preparation to support nvme-mulitpath, allowing multiple
>> uring passthrough commands to be queued.
>>
>> Signed-off-by: Kanchan Joshi <[email protected]>
>> Signed-off-by: Anuj Gupta <[email protected]>
>> ---
>>   include/linux/io_uring.h | 38 ++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 36 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
>> index 54063d67506b..d734599cbcd7 100644
>> --- a/include/linux/io_uring.h
>> +++ b/include/linux/io_uring.h
>> @@ -22,9 +22,14 @@ struct io_uring_cmd {
>>       const void    *cmd;
>>       /* callback to defer completions to task context */
>>       void (*task_work_cb)(struct io_uring_cmd *cmd);
>> +    struct io_uring_cmd    *next;
>>       u32        cmd_op;
>> -    u32        pad;
>> -    u8        pdu[32]; /* available inline for free use */
>> +    u8        pdu[28]; /* available inline for free use */
>> +};
> 
> I think io_uring_cmd will at some point become two cachelines and may
> not be worth the effort to limit a pdu to 28 bytes...


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
  2022-07-11 11:01     ` [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd Kanchan Joshi
  2022-07-11 17:00       ` Sagi Grimberg
@ 2022-07-11 17:18       ` Jens Axboe
  2022-07-11 17:55         ` Sagi Grimberg
  1 sibling, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2022-07-11 17:18 UTC (permalink / raw)
  To: Kanchan Joshi, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev

On 7/11/22 5:01 AM, Kanchan Joshi wrote:
> Use the leftover space to carve 'next' field that enables linking of
> io_uring_cmd structs. Also introduce a list head and few helpers.
> 
> This is in preparation to support nvme-mulitpath, allowing multiple
> uring passthrough commands to be queued.

It's not clear to me why we need linking at that level?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
  2022-07-11 17:18       ` Jens Axboe
@ 2022-07-11 17:55         ` Sagi Grimberg
  2022-07-11 18:22           ` Sagi Grimberg
  0 siblings, 1 reply; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-11 17:55 UTC (permalink / raw)
  To: Jens Axboe, Kanchan Joshi, hch, kbusch
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev

>> Use the leftover space to carve 'next' field that enables linking of
>> io_uring_cmd structs. Also introduce a list head and few helpers.
>>
>> This is in preparation to support nvme-mulitpath, allowing multiple
>> uring passthrough commands to be queued.
> 
> It's not clear to me why we need linking at that level?

I think the attempt is to allow something like blk_steal_bios that
nvme leverages for io_uring_cmd(s).

nvme failover steals all the bios from requests that fail (and should
failover) and puts them on a requeue list, and then schedules
a work that takes these bios one-by-one and submits them on a different
bottom namespace (see nvme_failover_req/nvme_requeue_work).

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
  2022-07-11 17:55         ` Sagi Grimberg
@ 2022-07-11 18:22           ` Sagi Grimberg
  2022-07-11 18:24             ` Jens Axboe
  2022-07-14  3:40             ` Ming Lei
  0 siblings, 2 replies; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-11 18:22 UTC (permalink / raw)
  To: Jens Axboe, Kanchan Joshi, hch, kbusch
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev


>>> Use the leftover space to carve 'next' field that enables linking of
>>> io_uring_cmd structs. Also introduce a list head and few helpers.
>>>
>>> This is in preparation to support nvme-mulitpath, allowing multiple
>>> uring passthrough commands to be queued.
>>
>> It's not clear to me why we need linking at that level?
> 
> I think the attempt is to allow something like blk_steal_bios that
> nvme leverages for io_uring_cmd(s).

I'll rephrase because now that I read it, I think my phrasing is
confusing.

I think the attempt is to allow something like blk_steal_bios that
nvme leverages, but for io_uring_cmd(s). Essentially allow io_uring_cmd
to be linked in a requeue_list.

> 
> nvme failover steals all the bios from requests that fail (and should
> failover) and puts them on a requeue list, and then schedules
> a work that takes these bios one-by-one and submits them on a different
> bottom namespace (see nvme_failover_req/nvme_requeue_work).

Maybe if io_kiocb could exposed to nvme, and it had some generic space
that nvme could use, that would work as well...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
  2022-07-11 18:22           ` Sagi Grimberg
@ 2022-07-11 18:24             ` Jens Axboe
  2022-07-11 18:58               ` Sagi Grimberg
  2022-07-12 11:40               ` Kanchan Joshi
  2022-07-14  3:40             ` Ming Lei
  1 sibling, 2 replies; 52+ messages in thread
From: Jens Axboe @ 2022-07-11 18:24 UTC (permalink / raw)
  To: Sagi Grimberg, Kanchan Joshi, hch, kbusch
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev

On 7/11/22 12:22 PM, Sagi Grimberg wrote:
> 
>>>> Use the leftover space to carve 'next' field that enables linking of
>>>> io_uring_cmd structs. Also introduce a list head and few helpers.
>>>>
>>>> This is in preparation to support nvme-mulitpath, allowing multiple
>>>> uring passthrough commands to be queued.
>>>
>>> It's not clear to me why we need linking at that level?
>>
>> I think the attempt is to allow something like blk_steal_bios that
>> nvme leverages for io_uring_cmd(s).
> 
> I'll rephrase because now that I read it, I think my phrasing is
> confusing.
> 
> I think the attempt is to allow something like blk_steal_bios that
> nvme leverages, but for io_uring_cmd(s). Essentially allow io_uring_cmd
> to be linked in a requeue_list.

I see. I wonder if there's some other way we can accomplish that, so we
don't have to shrink the current space. io_kiocb already support
linking, so seems like that should be workable.

>> nvme failover steals all the bios from requests that fail (and should
>> failover) and puts them on a requeue list, and then schedules
>> a work that takes these bios one-by-one and submits them on a different
>> bottom namespace (see nvme_failover_req/nvme_requeue_work).
> 
> Maybe if io_kiocb could exposed to nvme, and it had some generic space
> that nvme could use, that would work as well...

It will be more exposed in 5.20, but passthrough is already using the
per-op allotted space in the io_kiocb. But as mentioned above, there's
already linking support between io_kiocbs, and that is likely what
should be used here too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
  2022-07-11 18:24             ` Jens Axboe
@ 2022-07-11 18:58               ` Sagi Grimberg
  2022-07-12 11:40               ` Kanchan Joshi
  1 sibling, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-11 18:58 UTC (permalink / raw)
  To: Jens Axboe, Kanchan Joshi, hch, kbusch
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev


>>>>> Use the leftover space to carve 'next' field that enables linking of
>>>>> io_uring_cmd structs. Also introduce a list head and few helpers.
>>>>>
>>>>> This is in preparation to support nvme-mulitpath, allowing multiple
>>>>> uring passthrough commands to be queued.
>>>>
>>>> It's not clear to me why we need linking at that level?
>>>
>>> I think the attempt is to allow something like blk_steal_bios that
>>> nvme leverages for io_uring_cmd(s).
>>
>> I'll rephrase because now that I read it, I think my phrasing is
>> confusing.
>>
>> I think the attempt is to allow something like blk_steal_bios that
>> nvme leverages, but for io_uring_cmd(s). Essentially allow io_uring_cmd
>> to be linked in a requeue_list.
> 
> I see. I wonder if there's some other way we can accomplish that, so we
> don't have to shrink the current space. io_kiocb already support
> linking, so seems like that should be workable.
> 
>>> nvme failover steals all the bios from requests that fail (and should
>>> failover) and puts them on a requeue list, and then schedules
>>> a work that takes these bios one-by-one and submits them on a different
>>> bottom namespace (see nvme_failover_req/nvme_requeue_work).
>>
>> Maybe if io_kiocb could exposed to nvme, and it had some generic space
>> that nvme could use, that would work as well...
> 
> It will be more exposed in 5.20, but passthrough is already using the
> per-op allotted space in the io_kiocb. But as mentioned above, there's
> already linking support between io_kiocbs, and that is likely what
> should be used here too.

Agree. I don't think there is an issue with passing uring_cmd() an
io_kiocb instead of a io_uring_cmd which is less space strict.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
  2022-07-11 18:24             ` Jens Axboe
  2022-07-11 18:58               ` Sagi Grimberg
@ 2022-07-12 11:40               ` Kanchan Joshi
  1 sibling, 0 replies; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-12 11:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sagi Grimberg, hch, kbusch, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev

[-- Attachment #1: Type: text/plain, Size: 1815 bytes --]

On Mon, Jul 11, 2022 at 12:24:42PM -0600, Jens Axboe wrote:
>On 7/11/22 12:22 PM, Sagi Grimberg wrote:
>>
>>>>> Use the leftover space to carve 'next' field that enables linking of
>>>>> io_uring_cmd structs. Also introduce a list head and few helpers.
>>>>>
>>>>> This is in preparation to support nvme-mulitpath, allowing multiple
>>>>> uring passthrough commands to be queued.
>>>>
>>>> It's not clear to me why we need linking at that level?
>>>
>>> I think the attempt is to allow something like blk_steal_bios that
>>> nvme leverages for io_uring_cmd(s).
>>
>> I'll rephrase because now that I read it, I think my phrasing is
>> confusing.
>>
>> I think the attempt is to allow something like blk_steal_bios that
>> nvme leverages, but for io_uring_cmd(s). Essentially allow io_uring_cmd
>> to be linked in a requeue_list.
>
>I see. I wonder if there's some other way we can accomplish that, so we
>don't have to shrink the current space. io_kiocb already support
>linking, so seems like that should be workable.
>
>>> nvme failover steals all the bios from requests that fail (and should
>>> failover) and puts them on a requeue list, and then schedules
>>> a work that takes these bios one-by-one and submits them on a different
>>> bottom namespace (see nvme_failover_req/nvme_requeue_work).
>>
>> Maybe if io_kiocb could exposed to nvme, and it had some generic space
>> that nvme could use, that would work as well...
>
>It will be more exposed in 5.20, but passthrough is already using the
>per-op allotted space in the io_kiocb. But as mentioned above, there's
>already linking support between io_kiocbs, and that is likely what
>should be used here too.
>
io_kiocb linking is used if user-space wants to link SQEs for any
ordering. If we go this route, we give up that feature for
uring-command SQEs.

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
  2022-07-11 18:22           ` Sagi Grimberg
  2022-07-11 18:24             ` Jens Axboe
@ 2022-07-14  3:40             ` Ming Lei
  2022-07-14  8:19               ` Kanchan Joshi
  1 sibling, 1 reply; 52+ messages in thread
From: Ming Lei @ 2022-07-14  3:40 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Kanchan Joshi, hch, kbusch, io-uring, linux-nvme,
	linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev,
	ming.lei

On Mon, Jul 11, 2022 at 09:22:54PM +0300, Sagi Grimberg wrote:
> 
> > > > Use the leftover space to carve 'next' field that enables linking of
> > > > io_uring_cmd structs. Also introduce a list head and few helpers.
> > > > 
> > > > This is in preparation to support nvme-mulitpath, allowing multiple
> > > > uring passthrough commands to be queued.
> > > 
> > > It's not clear to me why we need linking at that level?
> > 
> > I think the attempt is to allow something like blk_steal_bios that
> > nvme leverages for io_uring_cmd(s).
> 
> I'll rephrase because now that I read it, I think my phrasing is
> confusing.
> 
> I think the attempt is to allow something like blk_steal_bios that
> nvme leverages, but for io_uring_cmd(s). Essentially allow io_uring_cmd
> to be linked in a requeue_list.

io_uring_cmd is 1:1 with pt request, so I am wondering why not retry on
io_uring_cmd instance directly via io_uring_cmd_execute_in_task().

I feels it isn't necessary to link io_uring_cmd into list.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
  2022-07-14  3:40             ` Ming Lei
@ 2022-07-14  8:19               ` Kanchan Joshi
  2022-07-14 15:30                 ` Daniel Wagner
  0 siblings, 1 reply; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-14  8:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Jens Axboe, hch, kbusch, io-uring, linux-nvme,
	linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev

[-- Attachment #1: Type: text/plain, Size: 1606 bytes --]

On Thu, Jul 14, 2022 at 11:40:46AM +0800, Ming Lei wrote:
>On Mon, Jul 11, 2022 at 09:22:54PM +0300, Sagi Grimberg wrote:
>>
>> > > > Use the leftover space to carve 'next' field that enables linking of
>> > > > io_uring_cmd structs. Also introduce a list head and few helpers.
>> > > >
>> > > > This is in preparation to support nvme-mulitpath, allowing multiple
>> > > > uring passthrough commands to be queued.
>> > >
>> > > It's not clear to me why we need linking at that level?
>> >
>> > I think the attempt is to allow something like blk_steal_bios that
>> > nvme leverages for io_uring_cmd(s).
>>
>> I'll rephrase because now that I read it, I think my phrasing is
>> confusing.
>>
>> I think the attempt is to allow something like blk_steal_bios that
>> nvme leverages, but for io_uring_cmd(s). Essentially allow io_uring_cmd
>> to be linked in a requeue_list.
>
>io_uring_cmd is 1:1 with pt request, so I am wondering why not retry on
>io_uring_cmd instance directly via io_uring_cmd_execute_in_task().
>
>I feels it isn't necessary to link io_uring_cmd into list.

If path is not available, retry is not done immediately rather we wait for
path to be available (as underlying controller may still be
resetting/connecting). List helped as command gets added into
it (and submitter/io_uring gets the control back), and retry is done
exact point in time.
But yes, it won't harm if we do couple of retries even if path is known
not to be available (somewhat like iopoll). As this situation is
not common. And with that scheme, we don't have to link io_uring_cmd.

Sagi: does this sound fine to you?

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
  2022-07-14  8:19               ` Kanchan Joshi
@ 2022-07-14 15:30                 ` Daniel Wagner
  2022-07-15 11:07                   ` Kanchan Joshi
  0 siblings, 1 reply; 52+ messages in thread
From: Daniel Wagner @ 2022-07-14 15:30 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Ming Lei, Sagi Grimberg, Jens Axboe, hch, kbusch, io-uring,
	linux-nvme, linux-block, asml.silence, joshiiitr, anuj20.g,
	gost.dev

On Thu, Jul 14, 2022 at 01:49:23PM +0530, Kanchan Joshi wrote:
> If path is not available, retry is not done immediately rather we wait for
> path to be available (as underlying controller may still be
> resetting/connecting). List helped as command gets added into
> it (and submitter/io_uring gets the control back), and retry is done
> exact point in time.
> But yes, it won't harm if we do couple of retries even if path is known
> not to be available (somewhat like iopoll). As this situation is
> not common. And with that scheme, we don't have to link io_uring_cmd.

Stupid question does it only fail over immediately when the path is not
available or any failure? If it fails over for everything it's possible
the target gets the same request twice. FWIW, we are just debugging this
scenario right now.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
  2022-07-14 15:30                 ` Daniel Wagner
@ 2022-07-15 11:07                   ` Kanchan Joshi
  2022-07-18  9:03                     ` Daniel Wagner
  0 siblings, 1 reply; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-15 11:07 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Ming Lei, Sagi Grimberg, Jens Axboe, hch, kbusch, io-uring,
	linux-nvme, linux-block, asml.silence, joshiiitr, anuj20.g,
	gost.dev

[-- Attachment #1: Type: text/plain, Size: 1046 bytes --]

On Thu, Jul 14, 2022 at 05:30:51PM +0200, Daniel Wagner wrote:
>On Thu, Jul 14, 2022 at 01:49:23PM +0530, Kanchan Joshi wrote:
>> If path is not available, retry is not done immediately rather we wait for
>> path to be available (as underlying controller may still be
>> resetting/connecting). List helped as command gets added into
>> it (and submitter/io_uring gets the control back), and retry is done
>> exact point in time.
>> But yes, it won't harm if we do couple of retries even if path is known
>> not to be available (somewhat like iopoll). As this situation is
>> not common. And with that scheme, we don't have to link io_uring_cmd.
>
>Stupid question does it only fail over immediately when the path is not
>available or any failure? If it fails over for everything it's possible
>the target gets the same request twice. FWIW, we are just debugging this
>scenario right now.

failover is only for path-related failure, and not otherwise.
you might want to take a look at nvme_decide_disposition routine where
it makes that decision.

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd
  2022-07-15 11:07                   ` Kanchan Joshi
@ 2022-07-18  9:03                     ` Daniel Wagner
  0 siblings, 0 replies; 52+ messages in thread
From: Daniel Wagner @ 2022-07-18  9:03 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Ming Lei, Sagi Grimberg, Jens Axboe, hch, kbusch, io-uring,
	linux-nvme, linux-block, asml.silence, joshiiitr, anuj20.g,
	gost.dev

On Fri, Jul 15, 2022 at 04:37:00PM +0530, Kanchan Joshi wrote:
> On Thu, Jul 14, 2022 at 05:30:51PM +0200, Daniel Wagner wrote:
> > On Thu, Jul 14, 2022 at 01:49:23PM +0530, Kanchan Joshi wrote:
> > > If path is not available, retry is not done immediately rather we wait for
> > > path to be available (as underlying controller may still be
> > > resetting/connecting). List helped as command gets added into
> > > it (and submitter/io_uring gets the control back), and retry is done
> > > exact point in time.
> > > But yes, it won't harm if we do couple of retries even if path is known
> > > not to be available (somewhat like iopoll). As this situation is
> > > not common. And with that scheme, we don't have to link io_uring_cmd.
> > 
> > Stupid question does it only fail over immediately when the path is not
> > available or any failure? If it fails over for everything it's possible
> > the target gets the same request twice. FWIW, we are just debugging this
> > scenario right now.
> 
> failover is only for path-related failure, and not otherwise.
> you might want to take a look at nvme_decide_disposition routine where
> it makes that decision.

Ah okay, never mind. I somewhat got the impression there is special
handling code added for this case. Sorry for the noise.

^ permalink raw reply	[flat|nested] 52+ messages in thread

[parent not found: <CGME20220711110827epcas5p3fd81f142f55ca3048abc38a9ef0d0089@epcas5p3.samsung.com>]

* [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
       [not found]   ` <CGME20220711110827epcas5p3fd81f142f55ca3048abc38a9ef0d0089@epcas5p3.samsung.com>
@ 2022-07-11 11:01     ` Kanchan Joshi
  2022-07-11 13:51       ` Sagi Grimberg
  2022-07-12  6:52       ` Christoph Hellwig
  0 siblings, 2 replies; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-11 11:01 UTC (permalink / raw)
  To: hch, sagi, kbusch, axboe
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev, Kanchan Joshi

From: Anuj Gupta <[email protected]>

This heavily reuses nvme-mpath infrastructure for block-path.
Block-path operates on bio, and uses that as long-lived object across
requeue attempts. For passthrough interface io_uring_cmd happens to
be such object, so add a requeue list for that.

If path is not available, uring-cmds are added into that list and
resubmitted on discovery of new path. Existing requeue handler
(nvme_requeue_work) is extended to do this resubmission.

For failed commands, failover is done directly (i.e. without first
queuing into list) by resubmitting individual command from task-work
based callback.

Suggested-by: Sagi Grimberg <[email protected]>
Co-developed-by: Kanchan Joshi <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
 drivers/nvme/host/ioctl.c     | 141 ++++++++++++++++++++++++++++++++--
 drivers/nvme/host/multipath.c |  36 ++++++++-
 drivers/nvme/host/nvme.h      |  12 +++
 include/linux/io_uring.h      |   2 +
 4 files changed, 182 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index fc02eddd4977..281c21d3c67d 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -340,12 +340,6 @@ struct nvme_uring_data {
 	__u32	timeout_ms;
 };
 
-static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
-		struct io_uring_cmd *ioucmd)
-{
-	return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
-}
-
 static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd)
 {
 	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
@@ -448,6 +442,14 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	pdu->meta_buffer = nvme_to_user_ptr(d.metadata);
 	pdu->meta_len = d.metadata_len;
 
+	if (issue_flags & IO_URING_F_MPATH) {
+		req->cmd_flags |= REQ_NVME_MPATH;
+		/*
+		 * we would need the buffer address (d.addr field) if we have
+		 * to retry the command. Store it by repurposing ioucmd->cmd
+		 */
+		ioucmd->cmd = (void *)d.addr;
+	}
 	blk_execute_rq_nowait(req, false);
 	return -EIOCBQUEUED;
 }
@@ -665,12 +667,135 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
 	int srcu_idx = srcu_read_lock(&head->srcu);
 	struct nvme_ns *ns = nvme_find_path(head);
 	int ret = -EINVAL;
+	struct device *dev = &head->cdev_device;
+
+	if (likely(ns)) {
+		ret = nvme_ns_uring_cmd(ns, ioucmd,
+				issue_flags | IO_URING_F_MPATH);
+	} else if (nvme_available_path(head)) {
+		struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+		struct nvme_uring_cmd *payload = NULL;
+
+		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
+		/*
+		 * We may requeue two kinds of uring commands:
+		 * 1. command failed post submission. pdu->req will be non-null
+		 * for this
+		 * 2. command that could not be submitted because path was not
+		 * available. For this pdu->req is set to NULL.
+		 */
+		pdu->req = NULL;
+		/*
+		 * passthrough command will not be available during retry as it
+		 * is embedded in io_uring's SQE. Hence we allocate/copy here
+		 */
+		payload = kmalloc(sizeof(struct nvme_uring_cmd), GFP_KERNEL);
+		if (!payload) {
+			ret = -ENOMEM;
+			goto out_unlock;
+		}
+		memcpy(payload, ioucmd->cmd, sizeof(struct nvme_uring_cmd));
+		ioucmd->cmd = payload;
 
-	if (ns)
-		ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
+		spin_lock_irq(&head->requeue_ioucmd_lock);
+		ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
+		spin_unlock_irq(&head->requeue_ioucmd_lock);
+		ret = -EIOCBQUEUED;
+	} else {
+		dev_warn_ratelimited(dev, "no available path - failing I/O\n");
+	}
+out_unlock:
 	srcu_read_unlock(&head->srcu, srcu_idx);
 	return ret;
 }
+
+int nvme_uring_cmd_io_retry(struct nvme_ns *ns, struct request *oreq,
+		struct io_uring_cmd *ioucmd, struct nvme_uring_cmd_pdu *pdu)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
+	struct nvme_uring_data d;
+	struct nvme_command c;
+	struct request *req;
+	struct bio *obio = oreq->bio;
+	void *meta = NULL;
+
+	memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command));
+	d.metadata = (__u64)pdu->meta_buffer;
+	d.metadata_len = pdu->meta_len;
+	d.timeout_ms = oreq->timeout;
+	d.addr = (__u64)ioucmd->cmd;
+	if (obio) {
+		d.data_len = obio->bi_iter.bi_size;
+		blk_rq_unmap_user(obio);
+	} else {
+		d.data_len = 0;
+	}
+	blk_mq_free_request(oreq);
+	req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
+			d.data_len, nvme_to_user_ptr(d.metadata),
+			d.metadata_len, 0, &meta, d.timeout_ms ?
+			msecs_to_jiffies(d.timeout_ms) : 0,
+			ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	req->end_io = nvme_uring_cmd_end_io;
+	req->end_io_data = ioucmd;
+	pdu->bio = req->bio;
+	pdu->meta = meta;
+	req->cmd_flags |= REQ_NVME_MPATH;
+	blk_execute_rq_nowait(req, false);
+	return -EIOCBQUEUED;
+}
+
+void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd)
+{
+	struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
+	struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head,
+			cdev);
+	int srcu_idx = srcu_read_lock(&head->srcu);
+	struct nvme_ns *ns = nvme_find_path(head);
+	unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 |
+		IO_URING_F_MPATH;
+	struct device *dev = &head->cdev_device;
+
+	if (likely(ns)) {
+		struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+		struct request *oreq = pdu->req;
+		int ret;
+
+		if (oreq == NULL) {
+			/*
+			 * this was not submitted (to device) earlier. For this
+			 * ioucmd->cmd points to persistent memory. Free that
+			 * up post submission
+			 */
+			const void *cmd = ioucmd->cmd;
+
+			ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
+			kfree(cmd);
+		} else {
+			/*
+			 * this was submitted (to device) earlier. Use old
+			 * request, bio (if it exists) and nvme-pdu to recreate
+			 * the command for the discovered path
+			 */
+			ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu);
+		}
+		if (ret != -EIOCBQUEUED)
+			io_uring_cmd_done(ioucmd, ret, 0);
+	} else if (nvme_available_path(head)) {
+		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
+		spin_lock_irq(&head->requeue_ioucmd_lock);
+		ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
+		spin_unlock_irq(&head->requeue_ioucmd_lock);
+	} else {
+		dev_warn_ratelimited(dev, "no available path - failing I/O\n");
+		io_uring_cmd_done(ioucmd, -EINVAL, 0);
+	}
+	srcu_read_unlock(&head->srcu, srcu_idx);
+}
 #endif /* CONFIG_NVME_MULTIPATH */
 
 int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index f26640ccb955..fe5655d98c36 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -6,6 +6,7 @@
 #include <linux/backing-dev.h>
 #include <linux/moduleparam.h>
 #include <linux/vmalloc.h>
+#include <linux/io_uring.h>
 #include <trace/events/block.h>
 #include "nvme.h"
 
@@ -80,6 +81,17 @@ void nvme_mpath_start_freeze(struct nvme_subsystem *subsys)
 			blk_freeze_queue_start(h->disk->queue);
 }
 
+static void nvme_ioucmd_failover_req(struct request *req, struct nvme_ns *ns)
+{
+	struct io_uring_cmd *ioucmd = req->end_io_data;
+	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+
+	/* store the request, to reconstruct the command during retry */
+	pdu->req = req;
+	/* retry in original submitter context */
+	io_uring_cmd_execute_in_task(ioucmd, nvme_ioucmd_mpath_retry);
+}
+
 void nvme_failover_req(struct request *req)
 {
 	struct nvme_ns *ns = req->q->queuedata;
@@ -99,6 +111,11 @@ void nvme_failover_req(struct request *req)
 		queue_work(nvme_wq, &ns->ctrl->ana_work);
 	}
 
+	if (blk_rq_is_passthrough(req)) {
+		nvme_ioucmd_failover_req(req, ns);
+		return;
+	}
+
 	spin_lock_irqsave(&ns->head->requeue_lock, flags);
 	for (bio = req->bio; bio; bio = bio->bi_next) {
 		bio_set_dev(bio, ns->head->disk->part0);
@@ -314,7 +331,7 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
 	return ns;
 }
 
-static bool nvme_available_path(struct nvme_ns_head *head)
+bool nvme_available_path(struct nvme_ns_head *head)
 {
 	struct nvme_ns *ns;
 
@@ -459,7 +476,9 @@ static void nvme_requeue_work(struct work_struct *work)
 	struct nvme_ns_head *head =
 		container_of(work, struct nvme_ns_head, requeue_work);
 	struct bio *bio, *next;
+	struct io_uring_cmd *ioucmd, *ioucmd_next;
 
+	/* process requeued bios*/
 	spin_lock_irq(&head->requeue_lock);
 	next = bio_list_get(&head->requeue_list);
 	spin_unlock_irq(&head->requeue_lock);
@@ -470,6 +489,21 @@ static void nvme_requeue_work(struct work_struct *work)
 
 		submit_bio_noacct(bio);
 	}
+
+	/* process requeued passthrough-commands */
+	spin_lock_irq(&head->requeue_ioucmd_lock);
+	ioucmd_next = ioucmd_list_get(&head->requeue_ioucmd_list);
+	spin_unlock_irq(&head->requeue_ioucmd_lock);
+
+	while ((ioucmd = ioucmd_next) != NULL) {
+		ioucmd_next = ioucmd->next;
+		ioucmd->next = NULL;
+		/*
+		 * Retry in original submitter context as we would be
+		 * processing the passthrough command again
+		 */
+		io_uring_cmd_execute_in_task(ioucmd, nvme_ioucmd_mpath_retry);
+	}
 }
 
 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9d3ff6feda06..125b48e74e72 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -16,6 +16,7 @@
 #include <linux/rcupdate.h>
 #include <linux/wait.h>
 #include <linux/t10-pi.h>
+#include <linux/io_uring.h>
 
 #include <trace/events/block.h>
 
@@ -189,6 +190,12 @@ enum {
 	NVME_REQ_USERCMD		= (1 << 1),
 };
 
+static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
+		struct io_uring_cmd *ioucmd)
+{
+	return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
+}
+
 static inline struct nvme_request *nvme_req(struct request *req)
 {
 	return blk_mq_rq_to_pdu(req);
@@ -442,6 +449,9 @@ struct nvme_ns_head {
 	struct work_struct	requeue_work;
 	struct mutex		lock;
 	unsigned long		flags;
+	/* for uring-passthru multipath handling */
+	struct ioucmd_list	requeue_ioucmd_list;
+	spinlock_t		requeue_ioucmd_lock;
 #define NVME_NSHEAD_DISK_LIVE	0
 	struct nvme_ns __rcu	*current_path[];
 #endif
@@ -830,6 +840,7 @@ int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd,
 		unsigned int issue_flags);
 int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
 		unsigned int issue_flags);
+void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd);
 int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
 int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
 
@@ -844,6 +855,7 @@ static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
 	return ctrl->ana_log_buf != NULL;
 }
 
+bool nvme_available_path(struct nvme_ns_head *head);
 void nvme_mpath_unfreeze(struct nvme_subsystem *subsys);
 void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys);
 void nvme_mpath_start_freeze(struct nvme_subsystem *subsys);
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index d734599cbcd7..57f4dfc83316 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -15,6 +15,8 @@ enum io_uring_cmd_flags {
 	IO_URING_F_SQE128		= 4,
 	IO_URING_F_CQE32		= 8,
 	IO_URING_F_IOPOLL		= 16,
+	/* to indicate that it is a MPATH req*/
+	IO_URING_F_MPATH		= 32,
 };
 
 struct io_uring_cmd {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-11 11:01     ` [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands Kanchan Joshi
@ 2022-07-11 13:51       ` Sagi Grimberg
  2022-07-11 15:12         ` Stefan Metzmacher
  2022-07-11 18:37         ` Kanchan Joshi
  2022-07-12  6:52       ` Christoph Hellwig
  1 sibling, 2 replies; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-11 13:51 UTC (permalink / raw)
  To: Kanchan Joshi, hch, kbusch, axboe
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev


> This heavily reuses nvme-mpath infrastructure for block-path.
> Block-path operates on bio, and uses that as long-lived object across
> requeue attempts. For passthrough interface io_uring_cmd happens to
> be such object, so add a requeue list for that.
> 
> If path is not available, uring-cmds are added into that list and
> resubmitted on discovery of new path. Existing requeue handler
> (nvme_requeue_work) is extended to do this resubmission.
> 
> For failed commands, failover is done directly (i.e. without first
> queuing into list) by resubmitting individual command from task-work
> based callback.

Nice!

> 
> Suggested-by: Sagi Grimberg <[email protected]>
> Co-developed-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Anuj Gupta <[email protected]>
> ---
>   drivers/nvme/host/ioctl.c     | 141 ++++++++++++++++++++++++++++++++--
>   drivers/nvme/host/multipath.c |  36 ++++++++-
>   drivers/nvme/host/nvme.h      |  12 +++
>   include/linux/io_uring.h      |   2 +
>   4 files changed, 182 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> index fc02eddd4977..281c21d3c67d 100644
> --- a/drivers/nvme/host/ioctl.c
> +++ b/drivers/nvme/host/ioctl.c
> @@ -340,12 +340,6 @@ struct nvme_uring_data {
>   	__u32	timeout_ms;
>   };
>   
> -static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
> -		struct io_uring_cmd *ioucmd)
> -{
> -	return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
> -}
> -
>   static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd)
>   {
>   	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> @@ -448,6 +442,14 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
>   	pdu->meta_buffer = nvme_to_user_ptr(d.metadata);
>   	pdu->meta_len = d.metadata_len;
>   
> +	if (issue_flags & IO_URING_F_MPATH) {
> +		req->cmd_flags |= REQ_NVME_MPATH;
> +		/*
> +		 * we would need the buffer address (d.addr field) if we have
> +		 * to retry the command. Store it by repurposing ioucmd->cmd
> +		 */
> +		ioucmd->cmd = (void *)d.addr;

What does repurposing mean here?

> +	}
>   	blk_execute_rq_nowait(req, false);
>   	return -EIOCBQUEUED;
>   }
> @@ -665,12 +667,135 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
>   	int srcu_idx = srcu_read_lock(&head->srcu);
>   	struct nvme_ns *ns = nvme_find_path(head);
>   	int ret = -EINVAL;
> +	struct device *dev = &head->cdev_device;
> +
> +	if (likely(ns)) {
> +		ret = nvme_ns_uring_cmd(ns, ioucmd,
> +				issue_flags | IO_URING_F_MPATH);
> +	} else if (nvme_available_path(head)) {
> +		struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> +		struct nvme_uring_cmd *payload = NULL;
> +
> +		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
> +		/*
> +		 * We may requeue two kinds of uring commands:
> +		 * 1. command failed post submission. pdu->req will be non-null
> +		 * for this
> +		 * 2. command that could not be submitted because path was not
> +		 * available. For this pdu->req is set to NULL.
> +		 */
> +		pdu->req = NULL;

Relying on a pointer does not sound like a good idea to me.
But why do you care at all where did this command came from?
This code should not concern itself what happened prior to this
execution.

> +		/*
> +		 * passthrough command will not be available during retry as it
> +		 * is embedded in io_uring's SQE. Hence we allocate/copy here
> +		 */

OK, that is a nice solution.

> +		payload = kmalloc(sizeof(struct nvme_uring_cmd), GFP_KERNEL);
> +		if (!payload) {
> +			ret = -ENOMEM;
> +			goto out_unlock;
> +		}
> +		memcpy(payload, ioucmd->cmd, sizeof(struct nvme_uring_cmd));
> +		ioucmd->cmd = payload;
>   
> -	if (ns)
> -		ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
> +		spin_lock_irq(&head->requeue_ioucmd_lock);
> +		ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
> +		spin_unlock_irq(&head->requeue_ioucmd_lock);
> +		ret = -EIOCBQUEUED;
> +	} else {
> +		dev_warn_ratelimited(dev, "no available path - failing I/O\n");

ret=-EIO ?

> +	}
> +out_unlock:
>   	srcu_read_unlock(&head->srcu, srcu_idx);
>   	return ret;
>   }
> +
> +int nvme_uring_cmd_io_retry(struct nvme_ns *ns, struct request *oreq,
> +		struct io_uring_cmd *ioucmd, struct nvme_uring_cmd_pdu *pdu)
> +{
> +	struct nvme_ctrl *ctrl = ns->ctrl;
> +	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
> +	struct nvme_uring_data d;
> +	struct nvme_command c;
> +	struct request *req;
> +	struct bio *obio = oreq->bio;
> +	void *meta = NULL;
> +
> +	memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command));
> +	d.metadata = (__u64)pdu->meta_buffer;
> +	d.metadata_len = pdu->meta_len;
> +	d.timeout_ms = oreq->timeout;
> +	d.addr = (__u64)ioucmd->cmd;
> +	if (obio) {
> +		d.data_len = obio->bi_iter.bi_size;
> +		blk_rq_unmap_user(obio);
> +	} else {
> +		d.data_len = 0;
> +	}
> +	blk_mq_free_request(oreq);

The way I would do this that in nvme_ioucmd_failover_req (or in the
retry driven from command retriable failure) I would do the above,
requeue it and kick the requeue work, to go over the requeue_list and
just execute them again. Not sure why you even need an explicit retry
code.

> +	req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
> +			d.data_len, nvme_to_user_ptr(d.metadata),
> +			d.metadata_len, 0, &meta, d.timeout_ms ?
> +			msecs_to_jiffies(d.timeout_ms) : 0,
> +			ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0);
> +	if (IS_ERR(req))
> +		return PTR_ERR(req);
> +
> +	req->end_io = nvme_uring_cmd_end_io;
> +	req->end_io_data = ioucmd;
> +	pdu->bio = req->bio;
> +	pdu->meta = meta;
> +	req->cmd_flags |= REQ_NVME_MPATH;
> +	blk_execute_rq_nowait(req, false);
> +	return -EIOCBQUEUED;
> +}
> +
> +void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd)
> +{
> +	struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
> +	struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head,
> +			cdev);
> +	int srcu_idx = srcu_read_lock(&head->srcu);
> +	struct nvme_ns *ns = nvme_find_path(head);
> +	unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 |
> +		IO_URING_F_MPATH;
> +	struct device *dev = &head->cdev_device;
> +
> +	if (likely(ns)) {
> +		struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> +		struct request *oreq = pdu->req;
> +		int ret;
> +
> +		if (oreq == NULL) {
> +			/*
> +			 * this was not submitted (to device) earlier. For this
> +			 * ioucmd->cmd points to persistent memory. Free that
> +			 * up post submission
> +			 */
> +			const void *cmd = ioucmd->cmd;
> +
> +			ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
> +			kfree(cmd);
> +		} else {
> +			/*
> +			 * this was submitted (to device) earlier. Use old
> +			 * request, bio (if it exists) and nvme-pdu to recreate
> +			 * the command for the discovered path
> +			 */
> +			ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu);

Why is this needed? Why is reuse important here? Why not always call
nvme_ns_uring_cmd?

> +		}
> +		if (ret != -EIOCBQUEUED)
> +			io_uring_cmd_done(ioucmd, ret, 0);
> +	} else if (nvme_available_path(head)) {
> +		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
> +		spin_lock_irq(&head->requeue_ioucmd_lock);
> +		ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
> +		spin_unlock_irq(&head->requeue_ioucmd_lock);
> +	} else {
> +		dev_warn_ratelimited(dev, "no available path - failing I/O\n");
> +		io_uring_cmd_done(ioucmd, -EINVAL, 0);

-EIO?

> +	}
> +	srcu_read_unlock(&head->srcu, srcu_idx);
> +}
>   #endif /* CONFIG_NVME_MULTIPATH */
>   
>   int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index f26640ccb955..fe5655d98c36 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -6,6 +6,7 @@
>   #include <linux/backing-dev.h>
>   #include <linux/moduleparam.h>
>   #include <linux/vmalloc.h>
> +#include <linux/io_uring.h>
>   #include <trace/events/block.h>
>   #include "nvme.h"
>   
> @@ -80,6 +81,17 @@ void nvme_mpath_start_freeze(struct nvme_subsystem *subsys)
>   			blk_freeze_queue_start(h->disk->queue);
>   }
>   
> +static void nvme_ioucmd_failover_req(struct request *req, struct nvme_ns *ns)
> +{
> +	struct io_uring_cmd *ioucmd = req->end_io_data;
> +	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> +
> +	/* store the request, to reconstruct the command during retry */
> +	pdu->req = req;
> +	/* retry in original submitter context */
> +	io_uring_cmd_execute_in_task(ioucmd, nvme_ioucmd_mpath_retry);

Why not deinit the command, put it on the request list and just schedule
requeue_work? Why not just have requeue_work go over the request list
and call nvme_ns_head_chr_uring_cmd similar to what it does for block
requests?

> +}
> +
>   void nvme_failover_req(struct request *req)
>   {
>   	struct nvme_ns *ns = req->q->queuedata;
> @@ -99,6 +111,11 @@ void nvme_failover_req(struct request *req)
>   		queue_work(nvme_wq, &ns->ctrl->ana_work);
>   	}
>   
> +	if (blk_rq_is_passthrough(req)) {
> +		nvme_ioucmd_failover_req(req, ns);
> +		return;
> +	}
> +
>   	spin_lock_irqsave(&ns->head->requeue_lock, flags);
>   	for (bio = req->bio; bio; bio = bio->bi_next) {
>   		bio_set_dev(bio, ns->head->disk->part0);
> @@ -314,7 +331,7 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>   	return ns;
>   }
>   
> -static bool nvme_available_path(struct nvme_ns_head *head)
> +bool nvme_available_path(struct nvme_ns_head *head)
>   {
>   	struct nvme_ns *ns;
>   
> @@ -459,7 +476,9 @@ static void nvme_requeue_work(struct work_struct *work)
>   	struct nvme_ns_head *head =
>   		container_of(work, struct nvme_ns_head, requeue_work);
>   	struct bio *bio, *next;
> +	struct io_uring_cmd *ioucmd, *ioucmd_next;
>   
> +	/* process requeued bios*/
>   	spin_lock_irq(&head->requeue_lock);
>   	next = bio_list_get(&head->requeue_list);
>   	spin_unlock_irq(&head->requeue_lock);
> @@ -470,6 +489,21 @@ static void nvme_requeue_work(struct work_struct *work)
>   
>   		submit_bio_noacct(bio);
>   	}
> +
> +	/* process requeued passthrough-commands */
> +	spin_lock_irq(&head->requeue_ioucmd_lock);
> +	ioucmd_next = ioucmd_list_get(&head->requeue_ioucmd_list);
> +	spin_unlock_irq(&head->requeue_ioucmd_lock);
> +
> +	while ((ioucmd = ioucmd_next) != NULL) {
> +		ioucmd_next = ioucmd->next;
> +		ioucmd->next = NULL;
> +		/*
> +		 * Retry in original submitter context as we would be
> +		 * processing the passthrough command again
> +		 */
> +		io_uring_cmd_execute_in_task(ioucmd, nvme_ioucmd_mpath_retry);

Why retry? why not nvme_ns_head_chr_uring_cmd ?

> +	}
>   }
>   
>   int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 9d3ff6feda06..125b48e74e72 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -16,6 +16,7 @@
>   #include <linux/rcupdate.h>
>   #include <linux/wait.h>
>   #include <linux/t10-pi.h>
> +#include <linux/io_uring.h>
>   
>   #include <trace/events/block.h>
>   
> @@ -189,6 +190,12 @@ enum {
>   	NVME_REQ_USERCMD		= (1 << 1),
>   };
>   
> +static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
> +		struct io_uring_cmd *ioucmd)
> +{
> +	return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
> +}
> +
>   static inline struct nvme_request *nvme_req(struct request *req)
>   {
>   	return blk_mq_rq_to_pdu(req);
> @@ -442,6 +449,9 @@ struct nvme_ns_head {
>   	struct work_struct	requeue_work;
>   	struct mutex		lock;
>   	unsigned long		flags;
> +	/* for uring-passthru multipath handling */
> +	struct ioucmd_list	requeue_ioucmd_list;
> +	spinlock_t		requeue_ioucmd_lock;
>   #define NVME_NSHEAD_DISK_LIVE	0
>   	struct nvme_ns __rcu	*current_path[];
>   #endif
> @@ -830,6 +840,7 @@ int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd,
>   		unsigned int issue_flags);
>   int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
>   		unsigned int issue_flags);
> +void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd);
>   int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
>   int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
>   
> @@ -844,6 +855,7 @@ static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
>   	return ctrl->ana_log_buf != NULL;
>   }
>   
> +bool nvme_available_path(struct nvme_ns_head *head);
>   void nvme_mpath_unfreeze(struct nvme_subsystem *subsys);
>   void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys);
>   void nvme_mpath_start_freeze(struct nvme_subsystem *subsys);
> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> index d734599cbcd7..57f4dfc83316 100644
> --- a/include/linux/io_uring.h
> +++ b/include/linux/io_uring.h
> @@ -15,6 +15,8 @@ enum io_uring_cmd_flags {
>   	IO_URING_F_SQE128		= 4,
>   	IO_URING_F_CQE32		= 8,
>   	IO_URING_F_IOPOLL		= 16,
> +	/* to indicate that it is a MPATH req*/
> +	IO_URING_F_MPATH		= 32,

Unrelated, but this should probably move to bitwise representation...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-11 13:51       ` Sagi Grimberg
@ 2022-07-11 15:12         ` Stefan Metzmacher
  2022-07-11 16:58           ` Sagi Grimberg
  2022-07-11 18:54           ` Kanchan Joshi
  2022-07-11 18:37         ` Kanchan Joshi
  1 sibling, 2 replies; 52+ messages in thread
From: Stefan Metzmacher @ 2022-07-11 15:12 UTC (permalink / raw)
  To: Sagi Grimberg, Kanchan Joshi, hch, kbusch, axboe
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev

Hi Sagi,

>> @@ -189,6 +190,12 @@ enum {
>>       NVME_REQ_USERCMD        = (1 << 1),
>>   };
>> +static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
>> +        struct io_uring_cmd *ioucmd)
>> +{

Shouldn't we have a BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu));
here?

>> +    return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
>> +}
>> +

>> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
>> index d734599cbcd7..57f4dfc83316 100644
>> --- a/include/linux/io_uring.h
>> +++ b/include/linux/io_uring.h
>> @@ -15,6 +15,8 @@ enum io_uring_cmd_flags {
>>       IO_URING_F_SQE128        = 4,
>>       IO_URING_F_CQE32        = 8,
>>       IO_URING_F_IOPOLL        = 16,
>> +    /* to indicate that it is a MPATH req*/
>> +    IO_URING_F_MPATH        = 32,

Isn't that nvme specific? If so I don't think it belongs in io_uring.h at all...

metze



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-11 15:12         ` Stefan Metzmacher
@ 2022-07-11 16:58           ` Sagi Grimberg
  2022-07-11 18:54           ` Kanchan Joshi
  1 sibling, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-11 16:58 UTC (permalink / raw)
  To: Stefan Metzmacher, Kanchan Joshi, hch, kbusch, axboe
  Cc: io-uring, linux-nvme, linux-block, asml.silence, joshiiitr,
	anuj20.g, gost.dev


>>> +static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
>>> +        struct io_uring_cmd *ioucmd)
>>> +{
> 
> Shouldn't we have a BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > 
> sizeof(ioucmd->pdu));
> here?

Probably...

>>> +    return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
>>> +}
>>> +
> 
>>> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
>>> index d734599cbcd7..57f4dfc83316 100644
>>> --- a/include/linux/io_uring.h
>>> +++ b/include/linux/io_uring.h
>>> @@ -15,6 +15,8 @@ enum io_uring_cmd_flags {
>>>       IO_URING_F_SQE128        = 4,
>>>       IO_URING_F_CQE32        = 8,
>>>       IO_URING_F_IOPOLL        = 16,
>>> +    /* to indicate that it is a MPATH req*/
>>> +    IO_URING_F_MPATH        = 32,
> 
> Isn't that nvme specific? If so I don't think it belongs in io_uring.h 
> at all...

Yes, it doesn't, it should be completely internal to nvme.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-11 15:12         ` Stefan Metzmacher
  2022-07-11 16:58           ` Sagi Grimberg
@ 2022-07-11 18:54           ` Kanchan Joshi
  1 sibling, 0 replies; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-11 18:54 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Sagi Grimberg, hch, kbusch, axboe, io-uring, linux-nvme,
	linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev

[-- Attachment #1: Type: text/plain, Size: 1198 bytes --]

On Mon, Jul 11, 2022 at 05:12:23PM +0200, Stefan Metzmacher wrote:
>Hi Sagi,
>
>>>@@ -189,6 +190,12 @@ enum {
>>>      NVME_REQ_USERCMD        = (1 << 1),
>>>  };
>>>+static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
>>>+        struct io_uring_cmd *ioucmd)
>>>+{
>
>Shouldn't we have a BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu));
>here?
Not sure if I got the concern. We have this already inside
nvme_ns_uring_cmd function.

>>>+    return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
>>>+}
>>>+
>
>>>diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
>>>index d734599cbcd7..57f4dfc83316 100644
>>>--- a/include/linux/io_uring.h
>>>+++ b/include/linux/io_uring.h
>>>@@ -15,6 +15,8 @@ enum io_uring_cmd_flags {
>>>      IO_URING_F_SQE128        = 4,
>>>      IO_URING_F_CQE32        = 8,
>>>      IO_URING_F_IOPOLL        = 16,
>>>+    /* to indicate that it is a MPATH req*/
>>>+    IO_URING_F_MPATH        = 32,
>
>Isn't that nvme specific? If so I don't think it belongs in io_uring.h at all...

Right. Removing this was bit ugly in nvme, but yes, need to kill this in
v2.

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-11 13:51       ` Sagi Grimberg
  2022-07-11 15:12         ` Stefan Metzmacher
@ 2022-07-11 18:37         ` Kanchan Joshi
  2022-07-11 19:56           ` Sagi Grimberg
  1 sibling, 1 reply; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-11 18:37 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: hch, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev

[-- Attachment #1: Type: text/plain, Size: 13031 bytes --]

On Mon, Jul 11, 2022 at 04:51:44PM +0300, Sagi Grimberg wrote:
>
>>This heavily reuses nvme-mpath infrastructure for block-path.
>>Block-path operates on bio, and uses that as long-lived object across
>>requeue attempts. For passthrough interface io_uring_cmd happens to
>>be such object, so add a requeue list for that.
>>
>>If path is not available, uring-cmds are added into that list and
>>resubmitted on discovery of new path. Existing requeue handler
>>(nvme_requeue_work) is extended to do this resubmission.
>>
>>For failed commands, failover is done directly (i.e. without first
>>queuing into list) by resubmitting individual command from task-work
>>based callback.
>
>Nice!
>
>>
>>Suggested-by: Sagi Grimberg <[email protected]>
>>Co-developed-by: Kanchan Joshi <[email protected]>
>>Signed-off-by: Kanchan Joshi <[email protected]>
>>Signed-off-by: Anuj Gupta <[email protected]>
>>---
>>  drivers/nvme/host/ioctl.c     | 141 ++++++++++++++++++++++++++++++++--
>>  drivers/nvme/host/multipath.c |  36 ++++++++-
>>  drivers/nvme/host/nvme.h      |  12 +++
>>  include/linux/io_uring.h      |   2 +
>>  4 files changed, 182 insertions(+), 9 deletions(-)
>>
>>diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
>>index fc02eddd4977..281c21d3c67d 100644
>>--- a/drivers/nvme/host/ioctl.c
>>+++ b/drivers/nvme/host/ioctl.c
>>@@ -340,12 +340,6 @@ struct nvme_uring_data {
>>  	__u32	timeout_ms;
>>  };
>>-static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
>>-		struct io_uring_cmd *ioucmd)
>>-{
>>-	return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
>>-}
>>-
>>  static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd)
>>  {
>>  	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>@@ -448,6 +442,14 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
>>  	pdu->meta_buffer = nvme_to_user_ptr(d.metadata);
>>  	pdu->meta_len = d.metadata_len;
>>+	if (issue_flags & IO_URING_F_MPATH) {
>>+		req->cmd_flags |= REQ_NVME_MPATH;
>>+		/*
>>+		 * we would need the buffer address (d.addr field) if we have
>>+		 * to retry the command. Store it by repurposing ioucmd->cmd
>>+		 */
>>+		ioucmd->cmd = (void *)d.addr;
>
>What does repurposing mean here?

This field (ioucmd->cmd) was pointing to passthrough command (which
is embedded in SQE of io_uring). At this point we have consumed
passthrough command, so this field can be reused if we have to. And we
have to beceause failover needs recreating passthrough command.
Please see nvme_uring_cmd_io_retry to see how this helps in recreating
the fields of passthrough command. And more on this below.

>>+	}
>>  	blk_execute_rq_nowait(req, false);
>>  	return -EIOCBQUEUED;
>>  }
>>@@ -665,12 +667,135 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
>>  	int srcu_idx = srcu_read_lock(&head->srcu);
>>  	struct nvme_ns *ns = nvme_find_path(head);
>>  	int ret = -EINVAL;
>>+	struct device *dev = &head->cdev_device;
>>+
>>+	if (likely(ns)) {
>>+		ret = nvme_ns_uring_cmd(ns, ioucmd,
>>+				issue_flags | IO_URING_F_MPATH);
>>+	} else if (nvme_available_path(head)) {
>>+		struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>+		struct nvme_uring_cmd *payload = NULL;
>>+
>>+		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
>>+		/*
>>+		 * We may requeue two kinds of uring commands:
>>+		 * 1. command failed post submission. pdu->req will be non-null
>>+		 * for this
>>+		 * 2. command that could not be submitted because path was not
>>+		 * available. For this pdu->req is set to NULL.
>>+		 */
>>+		pdu->req = NULL;
>
>Relying on a pointer does not sound like a good idea to me.
>But why do you care at all where did this command came from?
>This code should not concern itself what happened prior to this
>execution.
Required, please see nvme_uring_cmd_io_retry. And this should be more
clear as part of responses to your other questions.

>>+		/*
>>+		 * passthrough command will not be available during retry as it
>>+		 * is embedded in io_uring's SQE. Hence we allocate/copy here
>>+		 */
>
>OK, that is a nice solution.
Please note that prefered way is to recreate the passthrough command,
and not to allocate it. We allocate it here because this happens very early
(i.e. before processing passthrough command and setting that up inside
struct request). Recreating requires a populated 'struct request' .
>
>>+		payload = kmalloc(sizeof(struct nvme_uring_cmd), GFP_KERNEL);
>>+		if (!payload) {
>>+			ret = -ENOMEM;
>>+			goto out_unlock;
>>+		}
>>+		memcpy(payload, ioucmd->cmd, sizeof(struct nvme_uring_cmd));
>>+		ioucmd->cmd = payload;
>>-	if (ns)
>>-		ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>+		spin_lock_irq(&head->requeue_ioucmd_lock);
>>+		ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
>>+		spin_unlock_irq(&head->requeue_ioucmd_lock);
>>+		ret = -EIOCBQUEUED;
>>+	} else {
>>+		dev_warn_ratelimited(dev, "no available path - failing I/O\n");
>
>ret=-EIO ?
Did not do as it was initialized to -EINVAL. Do you prefer -EIO instead.
>>+	}
>>+out_unlock:
>>  	srcu_read_unlock(&head->srcu, srcu_idx);
>>  	return ret;
>>  }
>>+
>>+int nvme_uring_cmd_io_retry(struct nvme_ns *ns, struct request *oreq,
>>+		struct io_uring_cmd *ioucmd, struct nvme_uring_cmd_pdu *pdu)
>>+{
>>+	struct nvme_ctrl *ctrl = ns->ctrl;
>>+	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
>>+	struct nvme_uring_data d;
>>+	struct nvme_command c;
>>+	struct request *req;
>>+	struct bio *obio = oreq->bio;
>>+	void *meta = NULL;
>>+
>>+	memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command));
>>+	d.metadata = (__u64)pdu->meta_buffer;
>>+	d.metadata_len = pdu->meta_len;
>>+	d.timeout_ms = oreq->timeout;
>>+	d.addr = (__u64)ioucmd->cmd;
>>+	if (obio) {
>>+		d.data_len = obio->bi_iter.bi_size;
>>+		blk_rq_unmap_user(obio);
>>+	} else {
>>+		d.data_len = 0;
>>+	}
>>+	blk_mq_free_request(oreq);
>
>The way I would do this that in nvme_ioucmd_failover_req (or in the
>retry driven from command retriable failure) I would do the above,
>requeue it and kick the requeue work, to go over the requeue_list and
>just execute them again. Not sure why you even need an explicit retry
>code.
During retry we need passthrough command. But passthrough command is not
stable (i.e. valid only during first submission). We can make it stable
either by:
(a) allocating in nvme 
(b) return -EAGAIN to io_uring, and it will do allocate + deferral
Both add a cost. And since any command can potentially fail, that
means taking that cost for every IO that we issue on mpath node. Even if
no failure (initial or subsquent after IO) occcured.

So to avoid commmon-path cost, we go about doing nothing (no allocation,
no deferral) in the outset and choose to recreate the passthrough
command if failure occured. Hope this explains the purpose of
nvme_uring_cmd_io_retry?


>>+	req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
>>+			d.data_len, nvme_to_user_ptr(d.metadata),
>>+			d.metadata_len, 0, &meta, d.timeout_ms ?
>>+			msecs_to_jiffies(d.timeout_ms) : 0,
>>+			ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0);
>>+	if (IS_ERR(req))
>>+		return PTR_ERR(req);
>>+
>>+	req->end_io = nvme_uring_cmd_end_io;
>>+	req->end_io_data = ioucmd;
>>+	pdu->bio = req->bio;
>>+	pdu->meta = meta;
>>+	req->cmd_flags |= REQ_NVME_MPATH;
>>+	blk_execute_rq_nowait(req, false);
>>+	return -EIOCBQUEUED;
>>+}
>>+
>>+void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd)
>>+{
>>+	struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>+	struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head,
>>+			cdev);
>>+	int srcu_idx = srcu_read_lock(&head->srcu);
>>+	struct nvme_ns *ns = nvme_find_path(head);
>>+	unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 |
>>+		IO_URING_F_MPATH;
>>+	struct device *dev = &head->cdev_device;
>>+
>>+	if (likely(ns)) {
>>+		struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>+		struct request *oreq = pdu->req;
>>+		int ret;
>>+
>>+		if (oreq == NULL) {
>>+			/*
>>+			 * this was not submitted (to device) earlier. For this
>>+			 * ioucmd->cmd points to persistent memory. Free that
>>+			 * up post submission
>>+			 */
>>+			const void *cmd = ioucmd->cmd;
>>+
>>+			ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>+			kfree(cmd);
>>+		} else {
>>+			/*
>>+			 * this was submitted (to device) earlier. Use old
>>+			 * request, bio (if it exists) and nvme-pdu to recreate
>>+			 * the command for the discovered path
>>+			 */
>>+			ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu);
>
>Why is this needed? Why is reuse important here? Why not always call
>nvme_ns_uring_cmd?

Please see the previous explanation.
If condition is for the case when we made the passthrough command stable
by allocating beforehand.
Else is for the case when we avoided taking that cost.

>>+		}
>>+		if (ret != -EIOCBQUEUED)
>>+			io_uring_cmd_done(ioucmd, ret, 0);
>>+	} else if (nvme_available_path(head)) {
>>+		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
>>+		spin_lock_irq(&head->requeue_ioucmd_lock);
>>+		ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
>>+		spin_unlock_irq(&head->requeue_ioucmd_lock);
>>+	} else {
>>+		dev_warn_ratelimited(dev, "no available path - failing I/O\n");
>>+		io_uring_cmd_done(ioucmd, -EINVAL, 0);
>
>-EIO?
Can change -EINVAL to -EIO if that is what you prefer.

>>+	}
>>+	srcu_read_unlock(&head->srcu, srcu_idx);
>>+}
>>  #endif /* CONFIG_NVME_MULTIPATH */
>>  int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
>>diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>index f26640ccb955..fe5655d98c36 100644
>>--- a/drivers/nvme/host/multipath.c
>>+++ b/drivers/nvme/host/multipath.c
>>@@ -6,6 +6,7 @@
>>  #include <linux/backing-dev.h>
>>  #include <linux/moduleparam.h>
>>  #include <linux/vmalloc.h>
>>+#include <linux/io_uring.h>
>>  #include <trace/events/block.h>
>>  #include "nvme.h"
>>@@ -80,6 +81,17 @@ void nvme_mpath_start_freeze(struct nvme_subsystem *subsys)
>>  			blk_freeze_queue_start(h->disk->queue);
>>  }
>>+static void nvme_ioucmd_failover_req(struct request *req, struct nvme_ns *ns)
>>+{
>>+	struct io_uring_cmd *ioucmd = req->end_io_data;
>>+	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>+
>>+	/* store the request, to reconstruct the command during retry */
>>+	pdu->req = req;
>>+	/* retry in original submitter context */
>>+	io_uring_cmd_execute_in_task(ioucmd, nvme_ioucmd_mpath_retry);
>
>Why not deinit the command, put it on the request list and just schedule
>requeue_work? Why not just have requeue_work go over the request list
>and call nvme_ns_head_chr_uring_cmd similar to what it does for block
>requests?
We cannot process passthrough command, map user/meta buffer (in bio) etc. in
worker context. We need to do that in task context. So worker is a
hoop that we try to avoid here. Even if we user worker, we need to
switch to task as we do for other case (i.e. when command could not be
submitted in the first place).

>>+}
>>+
>>  void nvme_failover_req(struct request *req)
>>  {
>>  	struct nvme_ns *ns = req->q->queuedata;
>>@@ -99,6 +111,11 @@ void nvme_failover_req(struct request *req)
>>  		queue_work(nvme_wq, &ns->ctrl->ana_work);
>>  	}
>>+	if (blk_rq_is_passthrough(req)) {
>>+		nvme_ioucmd_failover_req(req, ns);
>>+		return;
>>+	}
>>+
>>  	spin_lock_irqsave(&ns->head->requeue_lock, flags);
>>  	for (bio = req->bio; bio; bio = bio->bi_next) {
>>  		bio_set_dev(bio, ns->head->disk->part0);
>>@@ -314,7 +331,7 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>  	return ns;
>>  }
>>-static bool nvme_available_path(struct nvme_ns_head *head)
>>+bool nvme_available_path(struct nvme_ns_head *head)
>>  {
>>  	struct nvme_ns *ns;
>>@@ -459,7 +476,9 @@ static void nvme_requeue_work(struct work_struct *work)
>>  	struct nvme_ns_head *head =
>>  		container_of(work, struct nvme_ns_head, requeue_work);
>>  	struct bio *bio, *next;
>>+	struct io_uring_cmd *ioucmd, *ioucmd_next;
>>+	/* process requeued bios*/
>>  	spin_lock_irq(&head->requeue_lock);
>>  	next = bio_list_get(&head->requeue_list);
>>  	spin_unlock_irq(&head->requeue_lock);
>>@@ -470,6 +489,21 @@ static void nvme_requeue_work(struct work_struct *work)
>>  		submit_bio_noacct(bio);
>>  	}
>>+
>>+	/* process requeued passthrough-commands */
>>+	spin_lock_irq(&head->requeue_ioucmd_lock);
>>+	ioucmd_next = ioucmd_list_get(&head->requeue_ioucmd_list);
>>+	spin_unlock_irq(&head->requeue_ioucmd_lock);
>>+
>>+	while ((ioucmd = ioucmd_next) != NULL) {
>>+		ioucmd_next = ioucmd->next;
>>+		ioucmd->next = NULL;
>>+		/*
>>+		 * Retry in original submitter context as we would be
>>+		 * processing the passthrough command again
>>+		 */
>>+		io_uring_cmd_execute_in_task(ioucmd, nvme_ioucmd_mpath_retry);
>
>Why retry? why not nvme_ns_head_chr_uring_cmd ?

what nvme_ioucmd_mpath_retry does different is explained above. Putting
that processing inside nvme_ns_head_chr_uring_cmd makes the function too
long and hard to read/understand.


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-11 18:37         ` Kanchan Joshi
@ 2022-07-11 19:56           ` Sagi Grimberg
  2022-07-12  4:23             ` Kanchan Joshi
  0 siblings, 1 reply; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-11 19:56 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: hch, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev


>>> @@ -448,6 +442,14 @@ static int nvme_uring_cmd_io(struct nvme_ctrl 
>>> *ctrl, struct nvme_ns *ns,
>>>      pdu->meta_buffer = nvme_to_user_ptr(d.metadata);
>>>      pdu->meta_len = d.metadata_len;
>>> +    if (issue_flags & IO_URING_F_MPATH) {
>>> +        req->cmd_flags |= REQ_NVME_MPATH;
>>> +        /*
>>> +         * we would need the buffer address (d.addr field) if we have
>>> +         * to retry the command. Store it by repurposing ioucmd->cmd
>>> +         */
>>> +        ioucmd->cmd = (void *)d.addr;
>>
>> What does repurposing mean here?
> 
> This field (ioucmd->cmd) was pointing to passthrough command (which
> is embedded in SQE of io_uring). At this point we have consumed
> passthrough command, so this field can be reused if we have to. And we
> have to beceause failover needs recreating passthrough command.
> Please see nvme_uring_cmd_io_retry to see how this helps in recreating
> the fields of passthrough command. And more on this below.

so ioucmd->cmd starts as an nvme_uring_cmd pointer and continues as
an address of buffer for a later processing that may or may not
happen in its lifetime?

Sounds a bit of a backwards design to me.

>>> +    }
>>>      blk_execute_rq_nowait(req, false);
>>>      return -EIOCBQUEUED;
>>>  }
>>> @@ -665,12 +667,135 @@ int nvme_ns_head_chr_uring_cmd(struct 
>>> io_uring_cmd *ioucmd,
>>>      int srcu_idx = srcu_read_lock(&head->srcu);
>>>      struct nvme_ns *ns = nvme_find_path(head);
>>>      int ret = -EINVAL;
>>> +    struct device *dev = &head->cdev_device;
>>> +
>>> +    if (likely(ns)) {
>>> +        ret = nvme_ns_uring_cmd(ns, ioucmd,
>>> +                issue_flags | IO_URING_F_MPATH);
>>> +    } else if (nvme_available_path(head)) {
>>> +        struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>> +        struct nvme_uring_cmd *payload = NULL;
>>> +
>>> +        dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
>>> +        /*
>>> +         * We may requeue two kinds of uring commands:
>>> +         * 1. command failed post submission. pdu->req will be non-null
>>> +         * for this
>>> +         * 2. command that could not be submitted because path was not
>>> +         * available. For this pdu->req is set to NULL.
>>> +         */
>>> +        pdu->req = NULL;
>>
>> Relying on a pointer does not sound like a good idea to me.
>> But why do you care at all where did this command came from?
>> This code should not concern itself what happened prior to this
>> execution.
> Required, please see nvme_uring_cmd_io_retry. And this should be more
> clear as part of responses to your other questions.

I think I understand. But it looks fragile to me.

> 
>>> +        /*
>>> +         * passthrough command will not be available during retry as it
>>> +         * is embedded in io_uring's SQE. Hence we allocate/copy here
>>> +         */
>>
>> OK, that is a nice solution.
> Please note that prefered way is to recreate the passthrough command,
> and not to allocate it. We allocate it here because this happens very early
> (i.e. before processing passthrough command and setting that up inside
> struct request). Recreating requires a populated 'struct request' .

I think that once a driver consumes a command as queued, it needs a
stable copy of the command (could be in a percpu pool).

The current design of nvme multipathing is that the request is stripped
from its bios and placed on a requeue_list, and if a request was not
formed yet (i.e. nvme available path exist but no usable) the bio is
placed on the requeue_list.

So ideally we have something sufficient like a bio, that can be linked
on a requeue list, and is sufficient to build a request, at any point in
time...

>>
>>> +        payload = kmalloc(sizeof(struct nvme_uring_cmd), GFP_KERNEL);
>>> +        if (!payload) {
>>> +            ret = -ENOMEM;
>>> +            goto out_unlock;
>>> +        }
>>> +        memcpy(payload, ioucmd->cmd, sizeof(struct nvme_uring_cmd));
>>> +        ioucmd->cmd = payload;
>>> -    if (ns)
>>> -        ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>> +        spin_lock_irq(&head->requeue_ioucmd_lock);
>>> +        ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
>>> +        spin_unlock_irq(&head->requeue_ioucmd_lock);
>>> +        ret = -EIOCBQUEUED;
>>> +    } else {
>>> +        dev_warn_ratelimited(dev, "no available path - failing I/O\n");
>>
>> ret=-EIO ?
> Did not do as it was initialized to -EINVAL. Do you prefer -EIO instead.

It is not an invalid argument error here, it is essentially an IO error.
So yes, that would be my preference.

>>> +    }
>>> +out_unlock:
>>>      srcu_read_unlock(&head->srcu, srcu_idx);
>>>      return ret;
>>>  }
>>> +
>>> +int nvme_uring_cmd_io_retry(struct nvme_ns *ns, struct request *oreq,
>>> +        struct io_uring_cmd *ioucmd, struct nvme_uring_cmd_pdu *pdu)
>>> +{
>>> +    struct nvme_ctrl *ctrl = ns->ctrl;
>>> +    struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
>>> +    struct nvme_uring_data d;
>>> +    struct nvme_command c;
>>> +    struct request *req;
>>> +    struct bio *obio = oreq->bio;
>>> +    void *meta = NULL;
>>> +
>>> +    memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command));
>>> +    d.metadata = (__u64)pdu->meta_buffer;
>>> +    d.metadata_len = pdu->meta_len;
>>> +    d.timeout_ms = oreq->timeout;
>>> +    d.addr = (__u64)ioucmd->cmd;
>>> +    if (obio) {
>>> +        d.data_len = obio->bi_iter.bi_size;
>>> +        blk_rq_unmap_user(obio);
>>> +    } else {
>>> +        d.data_len = 0;
>>> +    }
>>> +    blk_mq_free_request(oreq);
>>
>> The way I would do this that in nvme_ioucmd_failover_req (or in the
>> retry driven from command retriable failure) I would do the above,
>> requeue it and kick the requeue work, to go over the requeue_list and
>> just execute them again. Not sure why you even need an explicit retry
>> code.
> During retry we need passthrough command. But passthrough command is not
> stable (i.e. valid only during first submission). We can make it stable
> either by:
> (a) allocating in nvme (b) return -EAGAIN to io_uring, and it will do 
> allocate + deferral
> Both add a cost. And since any command can potentially fail, that
> means taking that cost for every IO that we issue on mpath node. Even if
> no failure (initial or subsquent after IO) occcured.

As mentioned, I think that if a driver consumes a command as queued,
it needs a stable copy for a later reformation of the request for
failover purposes.

> So to avoid commmon-path cost, we go about doing nothing (no allocation,
> no deferral) in the outset and choose to recreate the passthrough
> command if failure occured. Hope this explains the purpose of
> nvme_uring_cmd_io_retry?

I think it does, but I'm still having a hard time with it...

> 
> 
>>> +    req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
>>> +            d.data_len, nvme_to_user_ptr(d.metadata),
>>> +            d.metadata_len, 0, &meta, d.timeout_ms ?
>>> +            msecs_to_jiffies(d.timeout_ms) : 0,
>>> +            ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0);
>>> +    if (IS_ERR(req))
>>> +        return PTR_ERR(req);
>>> +
>>> +    req->end_io = nvme_uring_cmd_end_io;
>>> +    req->end_io_data = ioucmd;
>>> +    pdu->bio = req->bio;
>>> +    pdu->meta = meta;
>>> +    req->cmd_flags |= REQ_NVME_MPATH;
>>> +    blk_execute_rq_nowait(req, false);
>>> +    return -EIOCBQUEUED;
>>> +}
>>> +
>>> +void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd)
>>> +{
>>> +    struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>> +    struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head,
>>> +            cdev);
>>> +    int srcu_idx = srcu_read_lock(&head->srcu);
>>> +    struct nvme_ns *ns = nvme_find_path(head);
>>> +    unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 |
>>> +        IO_URING_F_MPATH;
>>> +    struct device *dev = &head->cdev_device;
>>> +
>>> +    if (likely(ns)) {
>>> +        struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>> +        struct request *oreq = pdu->req;
>>> +        int ret;
>>> +
>>> +        if (oreq == NULL) {
>>> +            /*
>>> +             * this was not submitted (to device) earlier. For this
>>> +             * ioucmd->cmd points to persistent memory. Free that
>>> +             * up post submission
>>> +             */
>>> +            const void *cmd = ioucmd->cmd;
>>> +
>>> +            ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>> +            kfree(cmd);
>>> +        } else {
>>> +            /*
>>> +             * this was submitted (to device) earlier. Use old
>>> +             * request, bio (if it exists) and nvme-pdu to recreate
>>> +             * the command for the discovered path
>>> +             */
>>> +            ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu);
>>
>> Why is this needed? Why is reuse important here? Why not always call
>> nvme_ns_uring_cmd?
> 
> Please see the previous explanation.
> If condition is for the case when we made the passthrough command stable
> by allocating beforehand.
> Else is for the case when we avoided taking that cost.

The current design of the multipath failover code is clean:
1. extract bio(s) from request
2. link in requeue_list
3. schedule requeue_work that,
  3.1 takes bios 1-by-1, and submits them again (exactly the same way)

I'd like us to try to follow the same design where retry is
literally "do the exact same thing, again". That would eliminate
two submission paths that do the same thing, but slightly different.

> 
>>> +        }
>>> +        if (ret != -EIOCBQUEUED)
>>> +            io_uring_cmd_done(ioucmd, ret, 0);
>>> +    } else if (nvme_available_path(head)) {
>>> +        dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
>>> +        spin_lock_irq(&head->requeue_ioucmd_lock);
>>> +        ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
>>> +        spin_unlock_irq(&head->requeue_ioucmd_lock);
>>> +    } else {
>>> +        dev_warn_ratelimited(dev, "no available path - failing I/O\n");
>>> +        io_uring_cmd_done(ioucmd, -EINVAL, 0);
>>
>> -EIO?
> Can change -EINVAL to -EIO if that is what you prefer.

Yes.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-11 19:56           ` Sagi Grimberg
@ 2022-07-12  4:23             ` Kanchan Joshi
  2022-07-12 21:26               ` Sagi Grimberg
  0 siblings, 1 reply; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-12  4:23 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: hch, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev

[-- Attachment #1: Type: text/plain, Size: 13063 bytes --]

On Mon, Jul 11, 2022 at 10:56:56PM +0300, Sagi Grimberg wrote:
>
>>>>@@ -448,6 +442,14 @@ static int nvme_uring_cmd_io(struct 
>>>>nvme_ctrl *ctrl, struct nvme_ns *ns,
>>>>     pdu->meta_buffer = nvme_to_user_ptr(d.metadata);
>>>>     pdu->meta_len = d.metadata_len;
>>>>+    if (issue_flags & IO_URING_F_MPATH) {
>>>>+        req->cmd_flags |= REQ_NVME_MPATH;
>>>>+        /*
>>>>+         * we would need the buffer address (d.addr field) if we have
>>>>+         * to retry the command. Store it by repurposing ioucmd->cmd
>>>>+         */
>>>>+        ioucmd->cmd = (void *)d.addr;
>>>
>>>What does repurposing mean here?
>>
>>This field (ioucmd->cmd) was pointing to passthrough command (which
>>is embedded in SQE of io_uring). At this point we have consumed
>>passthrough command, so this field can be reused if we have to. And we
>>have to beceause failover needs recreating passthrough command.
>>Please see nvme_uring_cmd_io_retry to see how this helps in recreating
>>the fields of passthrough command. And more on this below.
>
>so ioucmd->cmd starts as an nvme_uring_cmd pointer and continues as
>an address of buffer for a later processing that may or may not
>happen in its lifetime?

Yes. See this as a no-cost way to handle fast/common path. We manage by
doing just this assignment.
Otherwise this would have involved doing high-cost (allocate/copy/deferral)
stuff for later processing that may or may not happen.

>Sounds a bit of a backwards design to me.
>
>>>>+    }
>>>>     blk_execute_rq_nowait(req, false);
>>>>     return -EIOCBQUEUED;
>>>> }
>>>>@@ -665,12 +667,135 @@ int nvme_ns_head_chr_uring_cmd(struct 
>>>>io_uring_cmd *ioucmd,
>>>>     int srcu_idx = srcu_read_lock(&head->srcu);
>>>>     struct nvme_ns *ns = nvme_find_path(head);
>>>>     int ret = -EINVAL;
>>>>+    struct device *dev = &head->cdev_device;
>>>>+
>>>>+    if (likely(ns)) {
>>>>+        ret = nvme_ns_uring_cmd(ns, ioucmd,
>>>>+                issue_flags | IO_URING_F_MPATH);
>>>>+    } else if (nvme_available_path(head)) {
>>>>+        struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>>>+        struct nvme_uring_cmd *payload = NULL;
>>>>+
>>>>+        dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
>>>>+        /*
>>>>+         * We may requeue two kinds of uring commands:
>>>>+         * 1. command failed post submission. pdu->req will be non-null
>>>>+         * for this
>>>>+         * 2. command that could not be submitted because path was not
>>>>+         * available. For this pdu->req is set to NULL.
>>>>+         */
>>>>+        pdu->req = NULL;
>>>
>>>Relying on a pointer does not sound like a good idea to me.
>>>But why do you care at all where did this command came from?
>>>This code should not concern itself what happened prior to this
>>>execution.
>>Required, please see nvme_uring_cmd_io_retry. And this should be more
>>clear as part of responses to your other questions.
>
>I think I understand. But it looks fragile to me.
>
>>
>>>>+        /*
>>>>+         * passthrough command will not be available during retry as it
>>>>+         * is embedded in io_uring's SQE. Hence we allocate/copy here
>>>>+         */
>>>
>>>OK, that is a nice solution.
>>Please note that prefered way is to recreate the passthrough command,
>>and not to allocate it. We allocate it here because this happens very early
>>(i.e. before processing passthrough command and setting that up inside
>>struct request). Recreating requires a populated 'struct request' .
>
>I think that once a driver consumes a command as queued, it needs a
>stable copy of the command (could be in a percpu pool).
>
>The current design of nvme multipathing is that the request is stripped
>from its bios and placed on a requeue_list, and if a request was not
>formed yet (i.e. nvme available path exist but no usable) the bio is
>placed on the requeue_list.
>
>So ideally we have something sufficient like a bio, that can be linked
>on a requeue list, and is sufficient to build a request, at any point in
>time...

we could be dealing with any passthrough command here. bio is not
guranteed to exist in first place. It can very well be NULL.
As I mentioned in cover letter, this was tested for such passthrough
commands too.
And bio is not a good fit for this interface. For block-path, entry
function is nvme_ns_head_submit_bio() which speaks bio. Request is
allocated afterwards. But since request formation can happen only after
discovering a valid path, it may have to queue the bio if path does not
exist.
For passthrough, interface speaks/understands struct io_uring_cmd.
Request is allocated after. And bio may (or may not) be attached with
this request. It may have to queue the command even before request/bio
gets formed. So cardinal structure (which is
really available all the time) in this case is struct io_uring_cmd.
Hence we use that as the object around which requeuing/failover is
built.
Conceptually io_uring_cmd is the bio of this interface.

>>>>+        payload = kmalloc(sizeof(struct nvme_uring_cmd), GFP_KERNEL);
>>>>+        if (!payload) {
>>>>+            ret = -ENOMEM;
>>>>+            goto out_unlock;
>>>>+        }
>>>>+        memcpy(payload, ioucmd->cmd, sizeof(struct nvme_uring_cmd));
>>>>+        ioucmd->cmd = payload;
>>>>-    if (ns)
>>>>-        ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>>>+        spin_lock_irq(&head->requeue_ioucmd_lock);
>>>>+        ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
>>>>+        spin_unlock_irq(&head->requeue_ioucmd_lock);
>>>>+        ret = -EIOCBQUEUED;
>>>>+    } else {
>>>>+        dev_warn_ratelimited(dev, "no available path - failing I/O\n");
>>>
>>>ret=-EIO ?
>>Did not do as it was initialized to -EINVAL. Do you prefer -EIO instead.
>
>It is not an invalid argument error here, it is essentially an IO error.
>So yes, that would be my preference.

sure, will change.

>>>>+    }
>>>>+out_unlock:
>>>>     srcu_read_unlock(&head->srcu, srcu_idx);
>>>>     return ret;
>>>> }
>>>>+
>>>>+int nvme_uring_cmd_io_retry(struct nvme_ns *ns, struct request *oreq,
>>>>+        struct io_uring_cmd *ioucmd, struct nvme_uring_cmd_pdu *pdu)
>>>>+{
>>>>+    struct nvme_ctrl *ctrl = ns->ctrl;
>>>>+    struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
>>>>+    struct nvme_uring_data d;
>>>>+    struct nvme_command c;
>>>>+    struct request *req;
>>>>+    struct bio *obio = oreq->bio;
>>>>+    void *meta = NULL;
>>>>+
>>>>+    memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command));
>>>>+    d.metadata = (__u64)pdu->meta_buffer;
>>>>+    d.metadata_len = pdu->meta_len;
>>>>+    d.timeout_ms = oreq->timeout;
>>>>+    d.addr = (__u64)ioucmd->cmd;
>>>>+    if (obio) {
>>>>+        d.data_len = obio->bi_iter.bi_size;
>>>>+        blk_rq_unmap_user(obio);
>>>>+    } else {
>>>>+        d.data_len = 0;
>>>>+    }
>>>>+    blk_mq_free_request(oreq);
>>>
>>>The way I would do this that in nvme_ioucmd_failover_req (or in the
>>>retry driven from command retriable failure) I would do the above,
>>>requeue it and kick the requeue work, to go over the requeue_list and
>>>just execute them again. Not sure why you even need an explicit retry
>>>code.
>>During retry we need passthrough command. But passthrough command is not
>>stable (i.e. valid only during first submission). We can make it stable
>>either by:
>>(a) allocating in nvme (b) return -EAGAIN to io_uring, and it will 
>>do allocate + deferral
>>Both add a cost. And since any command can potentially fail, that
>>means taking that cost for every IO that we issue on mpath node. Even if
>>no failure (initial or subsquent after IO) occcured.
>
>As mentioned, I think that if a driver consumes a command as queued,
>it needs a stable copy for a later reformation of the request for
>failover purposes.

So what do you propose to make that stable?
As I mentioned earlier, stable copy requires allocating/copying in fast
path. And for a condition (failover) that may not even occur.
I really think currrent solution is much better as it does not try to make
it stable. Rather it assembles pieces of passthrough command if retry
(which is rare) happens.

>>So to avoid commmon-path cost, we go about doing nothing (no allocation,
>>no deferral) in the outset and choose to recreate the passthrough
>>command if failure occured. Hope this explains the purpose of
>>nvme_uring_cmd_io_retry?
>
>I think it does, but I'm still having a hard time with it...
>
Maybe I am reiterating but few main differences that should help -

- io_uring_cmd is at the forefront, and bio/request as secondary
objects. Former is persistent object across requeue attempts and the
only thing available when we discover the path, while other two are
created every time we retry.

- Receiving bio from upper layer is a luxury that we do not have for
  passthrough. When we receive bio, pages are already mapped and we
  do not have to deal with user-specific fields, so there is more
  freedom in using arbitrary context (workers etc.). But passthrough
  command continues to point to user-space fields/buffers, so we need
  that task context.

>>
>>>>+    req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
>>>>+            d.data_len, nvme_to_user_ptr(d.metadata),
>>>>+            d.metadata_len, 0, &meta, d.timeout_ms ?
>>>>+            msecs_to_jiffies(d.timeout_ms) : 0,
>>>>+            ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0);
>>>>+    if (IS_ERR(req))
>>>>+        return PTR_ERR(req);
>>>>+
>>>>+    req->end_io = nvme_uring_cmd_end_io;
>>>>+    req->end_io_data = ioucmd;
>>>>+    pdu->bio = req->bio;
>>>>+    pdu->meta = meta;
>>>>+    req->cmd_flags |= REQ_NVME_MPATH;
>>>>+    blk_execute_rq_nowait(req, false);
>>>>+    return -EIOCBQUEUED;
>>>>+}
>>>>+
>>>>+void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd)
>>>>+{
>>>>+    struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>>>+    struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head,
>>>>+            cdev);
>>>>+    int srcu_idx = srcu_read_lock(&head->srcu);
>>>>+    struct nvme_ns *ns = nvme_find_path(head);
>>>>+    unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 |
>>>>+        IO_URING_F_MPATH;
>>>>+    struct device *dev = &head->cdev_device;
>>>>+
>>>>+    if (likely(ns)) {
>>>>+        struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>>>+        struct request *oreq = pdu->req;
>>>>+        int ret;
>>>>+
>>>>+        if (oreq == NULL) {
>>>>+            /*
>>>>+             * this was not submitted (to device) earlier. For this
>>>>+             * ioucmd->cmd points to persistent memory. Free that
>>>>+             * up post submission
>>>>+             */
>>>>+            const void *cmd = ioucmd->cmd;
>>>>+
>>>>+            ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>>>+            kfree(cmd);
>>>>+        } else {
>>>>+            /*
>>>>+             * this was submitted (to device) earlier. Use old
>>>>+             * request, bio (if it exists) and nvme-pdu to recreate
>>>>+             * the command for the discovered path
>>>>+             */
>>>>+            ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu);
>>>
>>>Why is this needed? Why is reuse important here? Why not always call
>>>nvme_ns_uring_cmd?
>>
>>Please see the previous explanation.
>>If condition is for the case when we made the passthrough command stable
>>by allocating beforehand.
>>Else is for the case when we avoided taking that cost.
>
>The current design of the multipath failover code is clean:
>1. extract bio(s) from request
>2. link in requeue_list
>3. schedule requeue_work that,
> 3.1 takes bios 1-by-1, and submits them again (exactly the same way)

It is really clean, and fits really well with bio based entry interface.
But as I said earlier, it does not go well with uring-cmd based entry
interface, and bunch of of other things which needs to be done
differently for generic passthrough command.

>I'd like us to try to follow the same design where retry is
>literally "do the exact same thing, again". That would eliminate
>two submission paths that do the same thing, but slightly different.

Exact same thing is possible if we make the common path slow i.e.
allocate/copy passthrough command and keep it alive until completion.
But that is really not the way to go I suppose.


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-12  4:23             ` Kanchan Joshi
@ 2022-07-12 21:26               ` Sagi Grimberg
  2022-07-13  5:37                 ` Kanchan Joshi
  0 siblings, 1 reply; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-12 21:26 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: hch, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev


>>>>> @@ -448,6 +442,14 @@ static int nvme_uring_cmd_io(struct nvme_ctrl 
>>>>> *ctrl, struct nvme_ns *ns,
>>>>>      pdu->meta_buffer = nvme_to_user_ptr(d.metadata);
>>>>>      pdu->meta_len = d.metadata_len;
>>>>> +    if (issue_flags & IO_URING_F_MPATH) {
>>>>> +        req->cmd_flags |= REQ_NVME_MPATH;
>>>>> +        /*
>>>>> +         * we would need the buffer address (d.addr field) if we have
>>>>> +         * to retry the command. Store it by repurposing ioucmd->cmd
>>>>> +         */
>>>>> +        ioucmd->cmd = (void *)d.addr;
>>>>
>>>> What does repurposing mean here?
>>>
>>> This field (ioucmd->cmd) was pointing to passthrough command (which
>>> is embedded in SQE of io_uring). At this point we have consumed
>>> passthrough command, so this field can be reused if we have to. And we
>>> have to beceause failover needs recreating passthrough command.
>>> Please see nvme_uring_cmd_io_retry to see how this helps in recreating
>>> the fields of passthrough command. And more on this below.
>>
>> so ioucmd->cmd starts as an nvme_uring_cmd pointer and continues as
>> an address of buffer for a later processing that may or may not
>> happen in its lifetime?
> 
> Yes. See this as a no-cost way to handle fast/common path. We manage by
> doing just this assignment.
> Otherwise this would have involved doing high-cost (allocate/copy/deferral)
> stuff for later processing that may or may not happen.

I see it as a hacky no-cost way...

> 
>> Sounds a bit of a backwards design to me.
>>
>>>>> +    }
>>>>>      blk_execute_rq_nowait(req, false);
>>>>>      return -EIOCBQUEUED;
>>>>>  }
>>>>> @@ -665,12 +667,135 @@ int nvme_ns_head_chr_uring_cmd(struct 
>>>>> io_uring_cmd *ioucmd,
>>>>>      int srcu_idx = srcu_read_lock(&head->srcu);
>>>>>      struct nvme_ns *ns = nvme_find_path(head);
>>>>>      int ret = -EINVAL;
>>>>> +    struct device *dev = &head->cdev_device;
>>>>> +
>>>>> +    if (likely(ns)) {
>>>>> +        ret = nvme_ns_uring_cmd(ns, ioucmd,
>>>>> +                issue_flags | IO_URING_F_MPATH);
>>>>> +    } else if (nvme_available_path(head)) {
>>>>> +        struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>>>> +        struct nvme_uring_cmd *payload = NULL;
>>>>> +
>>>>> +        dev_warn_ratelimited(dev, "no usable path - requeuing 
>>>>> I/O\n");
>>>>> +        /*
>>>>> +         * We may requeue two kinds of uring commands:
>>>>> +         * 1. command failed post submission. pdu->req will be 
>>>>> non-null
>>>>> +         * for this
>>>>> +         * 2. command that could not be submitted because path was 
>>>>> not
>>>>> +         * available. For this pdu->req is set to NULL.
>>>>> +         */
>>>>> +        pdu->req = NULL;
>>>>
>>>> Relying on a pointer does not sound like a good idea to me.
>>>> But why do you care at all where did this command came from?
>>>> This code should not concern itself what happened prior to this
>>>> execution.
>>> Required, please see nvme_uring_cmd_io_retry. And this should be more
>>> clear as part of responses to your other questions.
>>
>> I think I understand. But it looks fragile to me.
>>
>>>
>>>>> +        /*
>>>>> +         * passthrough command will not be available during retry 
>>>>> as it
>>>>> +         * is embedded in io_uring's SQE. Hence we allocate/copy here
>>>>> +         */
>>>>
>>>> OK, that is a nice solution.
>>> Please note that prefered way is to recreate the passthrough command,
>>> and not to allocate it. We allocate it here because this happens very 
>>> early
>>> (i.e. before processing passthrough command and setting that up inside
>>> struct request). Recreating requires a populated 'struct request' .
>>
>> I think that once a driver consumes a command as queued, it needs a
>> stable copy of the command (could be in a percpu pool).
>>
>> The current design of nvme multipathing is that the request is stripped
>> from its bios and placed on a requeue_list, and if a request was not
>> formed yet (i.e. nvme available path exist but no usable) the bio is
>> placed on the requeue_list.
>>
>> So ideally we have something sufficient like a bio, that can be linked
>> on a requeue list, and is sufficient to build a request, at any point in
>> time...
> 
> we could be dealing with any passthrough command here. bio is not
> guranteed to exist in first place. It can very well be NULL.
> As I mentioned in cover letter, this was tested for such passthrough
> commands too.
> And bio is not a good fit for this interface. For block-path, entry
> function is nvme_ns_head_submit_bio() which speaks bio. Request is
> allocated afterwards. But since request formation can happen only after
> discovering a valid path, it may have to queue the bio if path does not
> exist.
> For passthrough, interface speaks/understands struct io_uring_cmd.
> Request is allocated after. And bio may (or may not) be attached with
> this request. It may have to queue the command even before request/bio
> gets formed. So cardinal structure (which is
> really available all the time) in this case is struct io_uring_cmd.
> Hence we use that as the object around which requeuing/failover is
> built.
> Conceptually io_uring_cmd is the bio of this interface.

I didn't say bio, I said something like a bio that holds stable
information that could be used for requeue purposes...

> 
>>>>> +        payload = kmalloc(sizeof(struct nvme_uring_cmd), GFP_KERNEL);
>>>>> +        if (!payload) {
>>>>> +            ret = -ENOMEM;
>>>>> +            goto out_unlock;
>>>>> +        }
>>>>> +        memcpy(payload, ioucmd->cmd, sizeof(struct nvme_uring_cmd));
>>>>> +        ioucmd->cmd = payload;
>>>>> -    if (ns)
>>>>> -        ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>>>> +        spin_lock_irq(&head->requeue_ioucmd_lock);
>>>>> +        ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
>>>>> +        spin_unlock_irq(&head->requeue_ioucmd_lock);
>>>>> +        ret = -EIOCBQUEUED;
>>>>> +    } else {
>>>>> +        dev_warn_ratelimited(dev, "no available path - failing 
>>>>> I/O\n");
>>>>
>>>> ret=-EIO ?
>>> Did not do as it was initialized to -EINVAL. Do you prefer -EIO instead.
>>
>> It is not an invalid argument error here, it is essentially an IO error.
>> So yes, that would be my preference.
> 
> sure, will change.
> 
>>>>> +    }
>>>>> +out_unlock:
>>>>>      srcu_read_unlock(&head->srcu, srcu_idx);
>>>>>      return ret;
>>>>>  }
>>>>> +
>>>>> +int nvme_uring_cmd_io_retry(struct nvme_ns *ns, struct request *oreq,
>>>>> +        struct io_uring_cmd *ioucmd, struct nvme_uring_cmd_pdu *pdu)
>>>>> +{
>>>>> +    struct nvme_ctrl *ctrl = ns->ctrl;
>>>>> +    struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
>>>>> +    struct nvme_uring_data d;
>>>>> +    struct nvme_command c;
>>>>> +    struct request *req;
>>>>> +    struct bio *obio = oreq->bio;
>>>>> +    void *meta = NULL;
>>>>> +
>>>>> +    memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command));
>>>>> +    d.metadata = (__u64)pdu->meta_buffer;
>>>>> +    d.metadata_len = pdu->meta_len;
>>>>> +    d.timeout_ms = oreq->timeout;
>>>>> +    d.addr = (__u64)ioucmd->cmd;
>>>>> +    if (obio) {
>>>>> +        d.data_len = obio->bi_iter.bi_size;
>>>>> +        blk_rq_unmap_user(obio);
>>>>> +    } else {
>>>>> +        d.data_len = 0;
>>>>> +    }
>>>>> +    blk_mq_free_request(oreq);
>>>>
>>>> The way I would do this that in nvme_ioucmd_failover_req (or in the
>>>> retry driven from command retriable failure) I would do the above,
>>>> requeue it and kick the requeue work, to go over the requeue_list and
>>>> just execute them again. Not sure why you even need an explicit retry
>>>> code.
>>> During retry we need passthrough command. But passthrough command is not
>>> stable (i.e. valid only during first submission). We can make it stable
>>> either by:
>>> (a) allocating in nvme (b) return -EAGAIN to io_uring, and it will do 
>>> allocate + deferral
>>> Both add a cost. And since any command can potentially fail, that
>>> means taking that cost for every IO that we issue on mpath node. Even if
>>> no failure (initial or subsquent after IO) occcured.
>>
>> As mentioned, I think that if a driver consumes a command as queued,
>> it needs a stable copy for a later reformation of the request for
>> failover purposes.
> 
> So what do you propose to make that stable?
> As I mentioned earlier, stable copy requires allocating/copying in fast
> path. And for a condition (failover) that may not even occur.
> I really think currrent solution is much better as it does not try to make
> it stable. Rather it assembles pieces of passthrough command if retry
> (which is rare) happens.

Well, I can understand that io_uring_cmd is space constrained, otherwise
we wouldn't be having this discussion. However io_kiocb is less
constrained, and could be used as a context to hold such a space.

Even if it is undesired to have io_kiocb be passed to uring_cmd(), it
can still hold a driver specific space paired with a helper to obtain it
(i.e. something like io_uring_cmd_to_driver_ctx(ioucmd) ). Then if the
space is pre-allocated it is only a small memory copy for a stable copy
that would allow a saner failover design.

>>> So to avoid commmon-path cost, we go about doing nothing (no allocation,
>>> no deferral) in the outset and choose to recreate the passthrough
>>> command if failure occured. Hope this explains the purpose of
>>> nvme_uring_cmd_io_retry?
>>
>> I think it does, but I'm still having a hard time with it...
>>
> Maybe I am reiterating but few main differences that should help -
> 
> - io_uring_cmd is at the forefront, and bio/request as secondary
> objects. Former is persistent object across requeue attempts and the
> only thing available when we discover the path, while other two are
> created every time we retry.
> 
> - Receiving bio from upper layer is a luxury that we do not have for
>   passthrough. When we receive bio, pages are already mapped and we
>   do not have to deal with user-specific fields, so there is more
>   freedom in using arbitrary context (workers etc.). But passthrough
>   command continues to point to user-space fields/buffers, so we need
>   that task context.
> 
>>>
>>>>> +    req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
>>>>> +            d.data_len, nvme_to_user_ptr(d.metadata),
>>>>> +            d.metadata_len, 0, &meta, d.timeout_ms ?
>>>>> +            msecs_to_jiffies(d.timeout_ms) : 0,
>>>>> +            ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0);
>>>>> +    if (IS_ERR(req))
>>>>> +        return PTR_ERR(req);
>>>>> +
>>>>> +    req->end_io = nvme_uring_cmd_end_io;
>>>>> +    req->end_io_data = ioucmd;
>>>>> +    pdu->bio = req->bio;
>>>>> +    pdu->meta = meta;
>>>>> +    req->cmd_flags |= REQ_NVME_MPATH;
>>>>> +    blk_execute_rq_nowait(req, false);
>>>>> +    return -EIOCBQUEUED;
>>>>> +}
>>>>> +
>>>>> +void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd)
>>>>> +{
>>>>> +    struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>>>> +    struct nvme_ns_head *head = container_of(cdev, struct 
>>>>> nvme_ns_head,
>>>>> +            cdev);
>>>>> +    int srcu_idx = srcu_read_lock(&head->srcu);
>>>>> +    struct nvme_ns *ns = nvme_find_path(head);
>>>>> +    unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 |
>>>>> +        IO_URING_F_MPATH;
>>>>> +    struct device *dev = &head->cdev_device;
>>>>> +
>>>>> +    if (likely(ns)) {
>>>>> +        struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>>>> +        struct request *oreq = pdu->req;
>>>>> +        int ret;
>>>>> +
>>>>> +        if (oreq == NULL) {
>>>>> +            /*
>>>>> +             * this was not submitted (to device) earlier. For this
>>>>> +             * ioucmd->cmd points to persistent memory. Free that
>>>>> +             * up post submission
>>>>> +             */
>>>>> +            const void *cmd = ioucmd->cmd;
>>>>> +
>>>>> +            ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>>>> +            kfree(cmd);
>>>>> +        } else {
>>>>> +            /*
>>>>> +             * this was submitted (to device) earlier. Use old
>>>>> +             * request, bio (if it exists) and nvme-pdu to recreate
>>>>> +             * the command for the discovered path
>>>>> +             */
>>>>> +            ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu);
>>>>
>>>> Why is this needed? Why is reuse important here? Why not always call
>>>> nvme_ns_uring_cmd?
>>>
>>> Please see the previous explanation.
>>> If condition is for the case when we made the passthrough command stable
>>> by allocating beforehand.
>>> Else is for the case when we avoided taking that cost.
>>
>> The current design of the multipath failover code is clean:
>> 1. extract bio(s) from request
>> 2. link in requeue_list
>> 3. schedule requeue_work that,
>> 3.1 takes bios 1-by-1, and submits them again (exactly the same way)
> 
> It is really clean, and fits really well with bio based entry interface.
> But as I said earlier, it does not go well with uring-cmd based entry
> interface, and bunch of of other things which needs to be done
> differently for generic passthrough command.
> 
>> I'd like us to try to follow the same design where retry is
>> literally "do the exact same thing, again". That would eliminate
>> two submission paths that do the same thing, but slightly different.
> 
> Exact same thing is possible if we make the common path slow i.e.
> allocate/copy passthrough command and keep it alive until completion.
> But that is really not the way to go I suppose.

I'm not sure. With Christoph's response, I'm not sure it is
universally desired to support failover (in my opinion it should). But
if we do in fact choose to support it, I think we need a better
solution. If fast-path allocation is your prime concern, then let's try
to address that with space pre-allocation.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-12 21:26               ` Sagi Grimberg
@ 2022-07-13  5:37                 ` Kanchan Joshi
  2022-07-13  9:03                   ` Sagi Grimberg
  2022-07-14 15:14                   ` Ming Lei
  0 siblings, 2 replies; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-13  5:37 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: hch, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev

[-- Attachment #1: Type: text/plain, Size: 7793 bytes --]

>>>>>The way I would do this that in nvme_ioucmd_failover_req (or in the
>>>>>retry driven from command retriable failure) I would do the above,
>>>>>requeue it and kick the requeue work, to go over the requeue_list and
>>>>>just execute them again. Not sure why you even need an explicit retry
>>>>>code.
>>>>During retry we need passthrough command. But passthrough command is not
>>>>stable (i.e. valid only during first submission). We can make it stable
>>>>either by:
>>>>(a) allocating in nvme (b) return -EAGAIN to io_uring, and it 
>>>>will do allocate + deferral
>>>>Both add a cost. And since any command can potentially fail, that
>>>>means taking that cost for every IO that we issue on mpath node. Even if
>>>>no failure (initial or subsquent after IO) occcured.
>>>
>>>As mentioned, I think that if a driver consumes a command as queued,
>>>it needs a stable copy for a later reformation of the request for
>>>failover purposes.
>>
>>So what do you propose to make that stable?
>>As I mentioned earlier, stable copy requires allocating/copying in fast
>>path. And for a condition (failover) that may not even occur.
>>I really think currrent solution is much better as it does not try to make
>>it stable. Rather it assembles pieces of passthrough command if retry
>>(which is rare) happens.
>
>Well, I can understand that io_uring_cmd is space constrained, otherwise
>we wouldn't be having this discussion. 

Indeed. If we had space for keeping passthrough command stable for
retry, that would really have simplified the plumbing. Retry logic would
be same as first submission.

>However io_kiocb is less
>constrained, and could be used as a context to hold such a space.
>
>Even if it is undesired to have io_kiocb be passed to uring_cmd(), it
>can still hold a driver specific space paired with a helper to obtain it
>(i.e. something like io_uring_cmd_to_driver_ctx(ioucmd) ). Then if the
>space is pre-allocated it is only a small memory copy for a stable copy
>that would allow a saner failover design.

I am thinking along the same lines, but it's not about few bytes of
space rather we need 80 (72 to be precise). Will think more, but
these 72 bytes really stand tall in front of my optimism.

Do you see anything is possible in nvme-side?
Now also passthrough command (although in a modified form) gets copied
into this preallocated space i.e. nvme_req(req)->cmd. This part -

void nvme_init_request(struct request *req, struct nvme_command *cmd) 
{
	...
	memcpy(nvme_req(req)->cmd, cmd, sizeof(*cmd)); 
}

>>>>So to avoid commmon-path cost, we go about doing nothing (no allocation,
>>>>no deferral) in the outset and choose to recreate the passthrough
>>>>command if failure occured. Hope this explains the purpose of
>>>>nvme_uring_cmd_io_retry?
>>>
>>>I think it does, but I'm still having a hard time with it...
>>>
>>Maybe I am reiterating but few main differences that should help -
>>
>>- io_uring_cmd is at the forefront, and bio/request as secondary
>>objects. Former is persistent object across requeue attempts and the
>>only thing available when we discover the path, while other two are
>>created every time we retry.
>>
>>- Receiving bio from upper layer is a luxury that we do not have for
>>  passthrough. When we receive bio, pages are already mapped and we
>>  do not have to deal with user-specific fields, so there is more
>>  freedom in using arbitrary context (workers etc.). But passthrough
>>  command continues to point to user-space fields/buffers, so we need
>>  that task context.
>>
>>>>
>>>>>>+    req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
>>>>>>+            d.data_len, nvme_to_user_ptr(d.metadata),
>>>>>>+            d.metadata_len, 0, &meta, d.timeout_ms ?
>>>>>>+            msecs_to_jiffies(d.timeout_ms) : 0,
>>>>>>+            ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0);
>>>>>>+    if (IS_ERR(req))
>>>>>>+        return PTR_ERR(req);
>>>>>>+
>>>>>>+    req->end_io = nvme_uring_cmd_end_io;
>>>>>>+    req->end_io_data = ioucmd;
>>>>>>+    pdu->bio = req->bio;
>>>>>>+    pdu->meta = meta;
>>>>>>+    req->cmd_flags |= REQ_NVME_MPATH;
>>>>>>+    blk_execute_rq_nowait(req, false);
>>>>>>+    return -EIOCBQUEUED;
>>>>>>+}
>>>>>>+
>>>>>>+void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd)
>>>>>>+{
>>>>>>+    struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>>>>>+    struct nvme_ns_head *head = container_of(cdev, struct 
>>>>>>nvme_ns_head,
>>>>>>+            cdev);
>>>>>>+    int srcu_idx = srcu_read_lock(&head->srcu);
>>>>>>+    struct nvme_ns *ns = nvme_find_path(head);
>>>>>>+    unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 |
>>>>>>+        IO_URING_F_MPATH;
>>>>>>+    struct device *dev = &head->cdev_device;
>>>>>>+
>>>>>>+    if (likely(ns)) {
>>>>>>+        struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>>>>>+        struct request *oreq = pdu->req;
>>>>>>+        int ret;
>>>>>>+
>>>>>>+        if (oreq == NULL) {
>>>>>>+            /*
>>>>>>+             * this was not submitted (to device) earlier. For this
>>>>>>+             * ioucmd->cmd points to persistent memory. Free that
>>>>>>+             * up post submission
>>>>>>+             */
>>>>>>+            const void *cmd = ioucmd->cmd;
>>>>>>+
>>>>>>+            ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>>>>>+            kfree(cmd);
>>>>>>+        } else {
>>>>>>+            /*
>>>>>>+             * this was submitted (to device) earlier. Use old
>>>>>>+             * request, bio (if it exists) and nvme-pdu to recreate
>>>>>>+             * the command for the discovered path
>>>>>>+             */
>>>>>>+            ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu);
>>>>>
>>>>>Why is this needed? Why is reuse important here? Why not always call
>>>>>nvme_ns_uring_cmd?
>>>>
>>>>Please see the previous explanation.
>>>>If condition is for the case when we made the passthrough command stable
>>>>by allocating beforehand.
>>>>Else is for the case when we avoided taking that cost.
>>>
>>>The current design of the multipath failover code is clean:
>>>1. extract bio(s) from request
>>>2. link in requeue_list
>>>3. schedule requeue_work that,
>>>3.1 takes bios 1-by-1, and submits them again (exactly the same way)
>>
>>It is really clean, and fits really well with bio based entry interface.
>>But as I said earlier, it does not go well with uring-cmd based entry
>>interface, and bunch of of other things which needs to be done
>>differently for generic passthrough command.
>>
>>>I'd like us to try to follow the same design where retry is
>>>literally "do the exact same thing, again". That would eliminate
>>>two submission paths that do the same thing, but slightly different.
>>
>>Exact same thing is possible if we make the common path slow i.e.
>>allocate/copy passthrough command and keep it alive until completion.
>>But that is really not the way to go I suppose.
>
>I'm not sure. With Christoph's response, I'm not sure it is
>universally desired to support failover (in my opinion it should). But
>if we do in fact choose to support it, I think we need a better
>solution. If fast-path allocation is your prime concern, then let's try
>to address that with space pre-allocation.

Sure. I understand the benefit that space pre-allocation will give.

And overall, these are the top three things to iron out: 
- to do (failover) or not to do
- better way to keep the passthrough-cmd stable
- better way to link io_uring_cmd

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13  5:37                 ` Kanchan Joshi
@ 2022-07-13  9:03                   ` Sagi Grimberg
  2022-07-13 11:28                     ` Kanchan Joshi
  2022-07-14 15:14                   ` Ming Lei
  1 sibling, 1 reply; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-13  9:03 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: hch, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev


>> However io_kiocb is less
>> constrained, and could be used as a context to hold such a space.
>>
>> Even if it is undesired to have io_kiocb be passed to uring_cmd(), it
>> can still hold a driver specific space paired with a helper to obtain it
>> (i.e. something like io_uring_cmd_to_driver_ctx(ioucmd) ). Then if the
>> space is pre-allocated it is only a small memory copy for a stable copy
>> that would allow a saner failover design.
> 
> I am thinking along the same lines, but it's not about few bytes of
> space rather we need 80 (72 to be precise). Will think more, but
> these 72 bytes really stand tall in front of my optimism.

You don't have to populate this space on every I/O, you can just
populate it when there is no usable path and when you failover a
request...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13  9:03                   ` Sagi Grimberg
@ 2022-07-13 11:28                     ` Kanchan Joshi
  2022-07-13 12:17                       ` Sagi Grimberg
  0 siblings, 1 reply; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-13 11:28 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: hch, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev

[-- Attachment #1: Type: text/plain, Size: 1582 bytes --]

On Wed, Jul 13, 2022 at 12:03:48PM +0300, Sagi Grimberg wrote:
>
>>>However io_kiocb is less
>>>constrained, and could be used as a context to hold such a space.
>>>
>>>Even if it is undesired to have io_kiocb be passed to uring_cmd(), it
>>>can still hold a driver specific space paired with a helper to obtain it
>>>(i.e. something like io_uring_cmd_to_driver_ctx(ioucmd) ). Then if the
>>>space is pre-allocated it is only a small memory copy for a stable copy
>>>that would allow a saner failover design.
>>
>>I am thinking along the same lines, but it's not about few bytes of
>>space rather we need 80 (72 to be precise). Will think more, but
>>these 72 bytes really stand tall in front of my optimism.
>
>You don't have to populate this space on every I/O, you can just
>populate it when there is no usable path and when you failover a
>request...

Getting the space and when/how to populate it - related but diferent
topics in this context.

It is about the lifetime of SQE which is valid only for the first
submission. If we don't make the command stable at that point, we don't
have another chance. And that is exactly what happens for failover.
Since we know IO is failed only when it fails, but by that time
original passthrough-command is gone out of hand. I think if we somehow
get the space (preallocated), it is ok to copy to command for every IO
in mpath case.

The other part (no usuable path) is fine, because we hit that condition
during initial submission and therefore have the chance to allocate/copy
the passthrough command. This patch already does that. 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13 11:28                     ` Kanchan Joshi
@ 2022-07-13 12:17                       ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-13 12:17 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: hch, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev


>>>> However io_kiocb is less
>>>> constrained, and could be used as a context to hold such a space.
>>>>
>>>> Even if it is undesired to have io_kiocb be passed to uring_cmd(), it
>>>> can still hold a driver specific space paired with a helper to 
>>>> obtain it
>>>> (i.e. something like io_uring_cmd_to_driver_ctx(ioucmd) ). Then if the
>>>> space is pre-allocated it is only a small memory copy for a stable copy
>>>> that would allow a saner failover design.
>>>
>>> I am thinking along the same lines, but it's not about few bytes of
>>> space rather we need 80 (72 to be precise). Will think more, but
>>> these 72 bytes really stand tall in front of my optimism.
>>
>> You don't have to populate this space on every I/O, you can just
>> populate it when there is no usable path and when you failover a
>> request...
> 
> Getting the space and when/how to populate it - related but diferent
> topics in this context.
> 
> It is about the lifetime of SQE which is valid only for the first
> submission. If we don't make the command stable at that point, we don't
> have another chance. And that is exactly what happens for failover.
> Since we know IO is failed only when it fails, but by that time
> original passthrough-command is gone out of hand. I think if we somehow
> get the space (preallocated), it is ok to copy to command for every IO
> in mpath case.

Yea you're right. you need to populate it as soon as you queue the
uring command.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13  5:37                 ` Kanchan Joshi
  2022-07-13  9:03                   ` Sagi Grimberg
@ 2022-07-14 15:14                   ` Ming Lei
  2022-07-14 23:05                     ` Kanchan Joshi
  1 sibling, 1 reply; 52+ messages in thread
From: Ming Lei @ 2022-07-14 15:14 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Sagi Grimberg, hch, kbusch, axboe, io-uring, linux-nvme,
	linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev,
	ming.lei

On Wed, Jul 13, 2022 at 11:07:57AM +0530, Kanchan Joshi wrote:
> > > > > > The way I would do this that in nvme_ioucmd_failover_req (or in the
> > > > > > retry driven from command retriable failure) I would do the above,
> > > > > > requeue it and kick the requeue work, to go over the requeue_list and
> > > > > > just execute them again. Not sure why you even need an explicit retry
> > > > > > code.
> > > > > During retry we need passthrough command. But passthrough command is not
> > > > > stable (i.e. valid only during first submission). We can make it stable
> > > > > either by:
> > > > > (a) allocating in nvme (b) return -EAGAIN to io_uring, and
> > > > > it will do allocate + deferral
> > > > > Both add a cost. And since any command can potentially fail, that
> > > > > means taking that cost for every IO that we issue on mpath node. Even if
> > > > > no failure (initial or subsquent after IO) occcured.
> > > > 
> > > > As mentioned, I think that if a driver consumes a command as queued,
> > > > it needs a stable copy for a later reformation of the request for
> > > > failover purposes.
> > > 
> > > So what do you propose to make that stable?
> > > As I mentioned earlier, stable copy requires allocating/copying in fast
> > > path. And for a condition (failover) that may not even occur.
> > > I really think currrent solution is much better as it does not try to make
> > > it stable. Rather it assembles pieces of passthrough command if retry
> > > (which is rare) happens.
> > 
> > Well, I can understand that io_uring_cmd is space constrained, otherwise
> > we wouldn't be having this discussion.
> 
> Indeed. If we had space for keeping passthrough command stable for
> retry, that would really have simplified the plumbing. Retry logic would
> be same as first submission.
> 
> > However io_kiocb is less
> > constrained, and could be used as a context to hold such a space.
> > 
> > Even if it is undesired to have io_kiocb be passed to uring_cmd(), it
> > can still hold a driver specific space paired with a helper to obtain it
> > (i.e. something like io_uring_cmd_to_driver_ctx(ioucmd) ). Then if the
> > space is pre-allocated it is only a small memory copy for a stable copy
> > that would allow a saner failover design.
> 
> I am thinking along the same lines, but it's not about few bytes of
> space rather we need 80 (72 to be precise). Will think more, but
> these 72 bytes really stand tall in front of my optimism.
> 
> Do you see anything is possible in nvme-side?
> Now also passthrough command (although in a modified form) gets copied
> into this preallocated space i.e. nvme_req(req)->cmd. This part -

I understand it can't be allocated in nvme request which is freed
during retry, and looks the extra space has to be bound with
io_uring_cmd.


thanks, 
Ming


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-14 15:14                   ` Ming Lei
@ 2022-07-14 23:05                     ` Kanchan Joshi
  2022-07-15  1:35                       ` Ming Lei
  0 siblings, 1 reply; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-14 23:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, hch, kbusch, axboe, io-uring, linux-nvme,
	linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev

[-- Attachment #1: Type: text/plain, Size: 4231 bytes --]

On Thu, Jul 14, 2022 at 11:14:32PM +0800, Ming Lei wrote:
>On Wed, Jul 13, 2022 at 11:07:57AM +0530, Kanchan Joshi wrote:
>> > > > > > The way I would do this that in nvme_ioucmd_failover_req (or in the
>> > > > > > retry driven from command retriable failure) I would do the above,
>> > > > > > requeue it and kick the requeue work, to go over the requeue_list and
>> > > > > > just execute them again. Not sure why you even need an explicit retry
>> > > > > > code.
>> > > > > During retry we need passthrough command. But passthrough command is not
>> > > > > stable (i.e. valid only during first submission). We can make it stable
>> > > > > either by:
>> > > > > (a) allocating in nvme (b) return -EAGAIN to io_uring, and
>> > > > > it will do allocate + deferral
>> > > > > Both add a cost. And since any command can potentially fail, that
>> > > > > means taking that cost for every IO that we issue on mpath node. Even if
>> > > > > no failure (initial or subsquent after IO) occcured.
>> > > >
>> > > > As mentioned, I think that if a driver consumes a command as queued,
>> > > > it needs a stable copy for a later reformation of the request for
>> > > > failover purposes.
>> > >
>> > > So what do you propose to make that stable?
>> > > As I mentioned earlier, stable copy requires allocating/copying in fast
>> > > path. And for a condition (failover) that may not even occur.
>> > > I really think currrent solution is much better as it does not try to make
>> > > it stable. Rather it assembles pieces of passthrough command if retry
>> > > (which is rare) happens.
>> >
>> > Well, I can understand that io_uring_cmd is space constrained, otherwise
>> > we wouldn't be having this discussion.
>>
>> Indeed. If we had space for keeping passthrough command stable for
>> retry, that would really have simplified the plumbing. Retry logic would
>> be same as first submission.
>>
>> > However io_kiocb is less
>> > constrained, and could be used as a context to hold such a space.
>> >
>> > Even if it is undesired to have io_kiocb be passed to uring_cmd(), it
>> > can still hold a driver specific space paired with a helper to obtain it
>> > (i.e. something like io_uring_cmd_to_driver_ctx(ioucmd) ). Then if the
>> > space is pre-allocated it is only a small memory copy for a stable copy
>> > that would allow a saner failover design.
>>
>> I am thinking along the same lines, but it's not about few bytes of
>> space rather we need 80 (72 to be precise). Will think more, but
>> these 72 bytes really stand tall in front of my optimism.
>>
>> Do you see anything is possible in nvme-side?
>> Now also passthrough command (although in a modified form) gets copied
>> into this preallocated space i.e. nvme_req(req)->cmd. This part -
>
>I understand it can't be allocated in nvme request which is freed
>during retry,

Why not. Yes it gets freed, but we have control over when it gets freed
and we can do if anything needs to be done before freeing it. Please see
below as well.

>and looks the extra space has to be bound with
>io_uring_cmd.

if extra space is bound with io_uring_cmd, it helps to reduce the code
(and just that. I don't see that efficiency will improve - rather it
will be tad bit less because of one more 72 byte copy opeation in fast-path).

Alternate is to use this space that is bound with request i.e.
nvme_req(req)->cmd. This is also preallocated and passthru-cmd
already gets copied. But it is ~80% of the original command. 
Rest 20% includes few fields (data/meta buffer addres, rspective len)
which are not maintained (as bio/request can give that). 
During retry, we take out stuff from nvme_req(req)->cmd and then free
req. Please see nvme_uring_cmd_io_retry in the patch. And here is the
fragement for quick glance -

+       memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command));
+       d.metadata = (__u64)pdu->meta_buffer;
+       d.metadata_len = pdu->meta_len;
+       d.timeout_ms = oreq->timeout;
+       d.addr = (__u64)ioucmd->cmd;
+       if (obio) {
+               d.data_len = obio->bi_iter.bi_size;
+               blk_rq_unmap_user(obio);
+       } else {
+               d.data_len = 0;
+       }
+       blk_mq_free_request(oreq);

Do you see chinks in above?

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-14 23:05                     ` Kanchan Joshi
@ 2022-07-15  1:35                       ` Ming Lei
  2022-07-15  1:46                         ` Ming Lei
  0 siblings, 1 reply; 52+ messages in thread
From: Ming Lei @ 2022-07-15  1:35 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Sagi Grimberg, hch, kbusch, axboe, io-uring, linux-nvme,
	linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev,
	ming.lei

On Fri, Jul 15, 2022 at 04:35:23AM +0530, Kanchan Joshi wrote:
> On Thu, Jul 14, 2022 at 11:14:32PM +0800, Ming Lei wrote:
> > On Wed, Jul 13, 2022 at 11:07:57AM +0530, Kanchan Joshi wrote:
> > > > > > > > The way I would do this that in nvme_ioucmd_failover_req (or in the
> > > > > > > > retry driven from command retriable failure) I would do the above,
> > > > > > > > requeue it and kick the requeue work, to go over the requeue_list and
> > > > > > > > just execute them again. Not sure why you even need an explicit retry
> > > > > > > > code.
> > > > > > > During retry we need passthrough command. But passthrough command is not
> > > > > > > stable (i.e. valid only during first submission). We can make it stable
> > > > > > > either by:
> > > > > > > (a) allocating in nvme (b) return -EAGAIN to io_uring, and
> > > > > > > it will do allocate + deferral
> > > > > > > Both add a cost. And since any command can potentially fail, that
> > > > > > > means taking that cost for every IO that we issue on mpath node. Even if
> > > > > > > no failure (initial or subsquent after IO) occcured.
> > > > > >
> > > > > > As mentioned, I think that if a driver consumes a command as queued,
> > > > > > it needs a stable copy for a later reformation of the request for
> > > > > > failover purposes.
> > > > >
> > > > > So what do you propose to make that stable?
> > > > > As I mentioned earlier, stable copy requires allocating/copying in fast
> > > > > path. And for a condition (failover) that may not even occur.
> > > > > I really think currrent solution is much better as it does not try to make
> > > > > it stable. Rather it assembles pieces of passthrough command if retry
> > > > > (which is rare) happens.
> > > >
> > > > Well, I can understand that io_uring_cmd is space constrained, otherwise
> > > > we wouldn't be having this discussion.
> > > 
> > > Indeed. If we had space for keeping passthrough command stable for
> > > retry, that would really have simplified the plumbing. Retry logic would
> > > be same as first submission.
> > > 
> > > > However io_kiocb is less
> > > > constrained, and could be used as a context to hold such a space.
> > > >
> > > > Even if it is undesired to have io_kiocb be passed to uring_cmd(), it
> > > > can still hold a driver specific space paired with a helper to obtain it
> > > > (i.e. something like io_uring_cmd_to_driver_ctx(ioucmd) ). Then if the
> > > > space is pre-allocated it is only a small memory copy for a stable copy
> > > > that would allow a saner failover design.
> > > 
> > > I am thinking along the same lines, but it's not about few bytes of
> > > space rather we need 80 (72 to be precise). Will think more, but
> > > these 72 bytes really stand tall in front of my optimism.
> > > 
> > > Do you see anything is possible in nvme-side?
> > > Now also passthrough command (although in a modified form) gets copied
> > > into this preallocated space i.e. nvme_req(req)->cmd. This part -
> > 
> > I understand it can't be allocated in nvme request which is freed
> > during retry,
> 
> Why not. Yes it gets freed, but we have control over when it gets freed
> and we can do if anything needs to be done before freeing it. Please see
> below as well.

This way requires you to hold the old request until one new path is
found, and it is fragile.

What if there isn't any path available then controller tries to
reset the path? If the requeue or io_uring_cmd holds the old request,
it might cause error recovery hang or make error handler code more
complicated.

> 
> > and looks the extra space has to be bound with
> > io_uring_cmd.
> 
> if extra space is bound with io_uring_cmd, it helps to reduce the code
> (and just that. I don't see that efficiency will improve - rather it

Does retry have to be efficient?

> will be tad bit less because of one more 72 byte copy opeation in fast-path).

Allocating one buffer and bind it with io_uring_cmd in case of retry is actually
similar with current model, retry is triggered by FS bio, and the allocated
buffer can play similar role with FS bio.



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-15  1:35                       ` Ming Lei
@ 2022-07-15  1:46                         ` Ming Lei
  2022-07-15  4:24                           ` Kanchan Joshi
  0 siblings, 1 reply; 52+ messages in thread
From: Ming Lei @ 2022-07-15  1:46 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Sagi Grimberg, hch, kbusch, axboe, io-uring, linux-nvme,
	linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev,
	ming.lei

On Fri, Jul 15, 2022 at 09:35:38AM +0800, Ming Lei wrote:
> On Fri, Jul 15, 2022 at 04:35:23AM +0530, Kanchan Joshi wrote:
> > On Thu, Jul 14, 2022 at 11:14:32PM +0800, Ming Lei wrote:
> > > On Wed, Jul 13, 2022 at 11:07:57AM +0530, Kanchan Joshi wrote:
> > > > > > > > > The way I would do this that in nvme_ioucmd_failover_req (or in the
> > > > > > > > > retry driven from command retriable failure) I would do the above,
> > > > > > > > > requeue it and kick the requeue work, to go over the requeue_list and
> > > > > > > > > just execute them again. Not sure why you even need an explicit retry
> > > > > > > > > code.
> > > > > > > > During retry we need passthrough command. But passthrough command is not
> > > > > > > > stable (i.e. valid only during first submission). We can make it stable
> > > > > > > > either by:
> > > > > > > > (a) allocating in nvme (b) return -EAGAIN to io_uring, and
> > > > > > > > it will do allocate + deferral
> > > > > > > > Both add a cost. And since any command can potentially fail, that
> > > > > > > > means taking that cost for every IO that we issue on mpath node. Even if
> > > > > > > > no failure (initial or subsquent after IO) occcured.
> > > > > > >
> > > > > > > As mentioned, I think that if a driver consumes a command as queued,
> > > > > > > it needs a stable copy for a later reformation of the request for
> > > > > > > failover purposes.
> > > > > >
> > > > > > So what do you propose to make that stable?
> > > > > > As I mentioned earlier, stable copy requires allocating/copying in fast
> > > > > > path. And for a condition (failover) that may not even occur.
> > > > > > I really think currrent solution is much better as it does not try to make
> > > > > > it stable. Rather it assembles pieces of passthrough command if retry
> > > > > > (which is rare) happens.
> > > > >
> > > > > Well, I can understand that io_uring_cmd is space constrained, otherwise
> > > > > we wouldn't be having this discussion.
> > > > 
> > > > Indeed. If we had space for keeping passthrough command stable for
> > > > retry, that would really have simplified the plumbing. Retry logic would
> > > > be same as first submission.
> > > > 
> > > > > However io_kiocb is less
> > > > > constrained, and could be used as a context to hold such a space.
> > > > >
> > > > > Even if it is undesired to have io_kiocb be passed to uring_cmd(), it
> > > > > can still hold a driver specific space paired with a helper to obtain it
> > > > > (i.e. something like io_uring_cmd_to_driver_ctx(ioucmd) ). Then if the
> > > > > space is pre-allocated it is only a small memory copy for a stable copy
> > > > > that would allow a saner failover design.
> > > > 
> > > > I am thinking along the same lines, but it's not about few bytes of
> > > > space rather we need 80 (72 to be precise). Will think more, but
> > > > these 72 bytes really stand tall in front of my optimism.
> > > > 
> > > > Do you see anything is possible in nvme-side?
> > > > Now also passthrough command (although in a modified form) gets copied
> > > > into this preallocated space i.e. nvme_req(req)->cmd. This part -
> > > 
> > > I understand it can't be allocated in nvme request which is freed
> > > during retry,
> > 
> > Why not. Yes it gets freed, but we have control over when it gets freed
> > and we can do if anything needs to be done before freeing it. Please see
> > below as well.
> 
> This way requires you to hold the old request until one new path is
> found, and it is fragile.
> 
> What if there isn't any path available then controller tries to
> reset the path? If the requeue or io_uring_cmd holds the old request,
> it might cause error recovery hang or make error handler code more
> complicated.
> 
> > 
> > > and looks the extra space has to be bound with
> > > io_uring_cmd.
> > 
> > if extra space is bound with io_uring_cmd, it helps to reduce the code
> > (and just that. I don't see that efficiency will improve - rather it
> 
> Does retry have to be efficient?
> 
> > will be tad bit less because of one more 72 byte copy opeation in fast-path).
> 
> Allocating one buffer and bind it with io_uring_cmd in case of retry is actually
> similar with current model, retry is triggered by FS bio, and the allocated
> buffer can play similar role with FS bio.

oops, just think of SQE data only valid during submission, so the buffer
has to be allocated during 1st submission, but the allocation for each io_uring_cmd
shouldn't cost much by creating one slab cache, especially inline data
size of io_uring_cmd has been 56(24 + 32) bytes.


thanks,
Ming


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-15  1:46                         ` Ming Lei
@ 2022-07-15  4:24                           ` Kanchan Joshi
  0 siblings, 0 replies; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-15  4:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, hch, kbusch, axboe, io-uring, linux-nvme,
	linux-block, asml.silence, joshiiitr, anuj20.g, gost.dev

[-- Attachment #1: Type: text/plain, Size: 5712 bytes --]

On Fri, Jul 15, 2022 at 09:46:16AM +0800, Ming Lei wrote:
>On Fri, Jul 15, 2022 at 09:35:38AM +0800, Ming Lei wrote:
>> On Fri, Jul 15, 2022 at 04:35:23AM +0530, Kanchan Joshi wrote:
>> > On Thu, Jul 14, 2022 at 11:14:32PM +0800, Ming Lei wrote:
>> > > On Wed, Jul 13, 2022 at 11:07:57AM +0530, Kanchan Joshi wrote:
>> > > > > > > > > The way I would do this that in nvme_ioucmd_failover_req (or in the
>> > > > > > > > > retry driven from command retriable failure) I would do the above,
>> > > > > > > > > requeue it and kick the requeue work, to go over the requeue_list and
>> > > > > > > > > just execute them again. Not sure why you even need an explicit retry
>> > > > > > > > > code.
>> > > > > > > > During retry we need passthrough command. But passthrough command is not
>> > > > > > > > stable (i.e. valid only during first submission). We can make it stable
>> > > > > > > > either by:
>> > > > > > > > (a) allocating in nvme (b) return -EAGAIN to io_uring, and
>> > > > > > > > it will do allocate + deferral
>> > > > > > > > Both add a cost. And since any command can potentially fail, that
>> > > > > > > > means taking that cost for every IO that we issue on mpath node. Even if
>> > > > > > > > no failure (initial or subsquent after IO) occcured.
>> > > > > > >
>> > > > > > > As mentioned, I think that if a driver consumes a command as queued,
>> > > > > > > it needs a stable copy for a later reformation of the request for
>> > > > > > > failover purposes.
>> > > > > >
>> > > > > > So what do you propose to make that stable?
>> > > > > > As I mentioned earlier, stable copy requires allocating/copying in fast
>> > > > > > path. And for a condition (failover) that may not even occur.
>> > > > > > I really think currrent solution is much better as it does not try to make
>> > > > > > it stable. Rather it assembles pieces of passthrough command if retry
>> > > > > > (which is rare) happens.
>> > > > >
>> > > > > Well, I can understand that io_uring_cmd is space constrained, otherwise
>> > > > > we wouldn't be having this discussion.
>> > > >
>> > > > Indeed. If we had space for keeping passthrough command stable for
>> > > > retry, that would really have simplified the plumbing. Retry logic would
>> > > > be same as first submission.
>> > > >
>> > > > > However io_kiocb is less
>> > > > > constrained, and could be used as a context to hold such a space.
>> > > > >
>> > > > > Even if it is undesired to have io_kiocb be passed to uring_cmd(), it
>> > > > > can still hold a driver specific space paired with a helper to obtain it
>> > > > > (i.e. something like io_uring_cmd_to_driver_ctx(ioucmd) ). Then if the
>> > > > > space is pre-allocated it is only a small memory copy for a stable copy
>> > > > > that would allow a saner failover design.
>> > > >
>> > > > I am thinking along the same lines, but it's not about few bytes of
>> > > > space rather we need 80 (72 to be precise). Will think more, but
>> > > > these 72 bytes really stand tall in front of my optimism.
>> > > >
>> > > > Do you see anything is possible in nvme-side?
>> > > > Now also passthrough command (although in a modified form) gets copied
>> > > > into this preallocated space i.e. nvme_req(req)->cmd. This part -
>> > >
>> > > I understand it can't be allocated in nvme request which is freed
>> > > during retry,
>> >
>> > Why not. Yes it gets freed, but we have control over when it gets freed
>> > and we can do if anything needs to be done before freeing it. Please see
>> > below as well.
>>
>> This way requires you to hold the old request until one new path is
>> found, and it is fragile.
>>
>> What if there isn't any path available then controller tries to
>> reset the path? If the requeue or io_uring_cmd holds the old request,
>> it might cause error recovery hang or make error handler code more
>> complicated.

no, no, this code does not hold the old request until path is found.
It is not tied to path discovery at all.
old request is freed much early, just after extracting pieces of
passthrough-commands from it (i.e. from nvme_req(req)->cmd).
Seems you are yet to look at the snippet I shared.

>> >
>> > > and looks the extra space has to be bound with
>> > > io_uring_cmd.
>> >
>> > if extra space is bound with io_uring_cmd, it helps to reduce the code
>> > (and just that. I don't see that efficiency will improve - rather it
>>
>> Does retry have to be efficient?

I also do not care about that. But as mentioned earlier, it is each-io
cost not failure-only. Lifetime of original passthrough-cmd is
first submission. So if we take this route, we need to do extra copy or
allocate+copy (which is trivial by just returing -EAGAIN to io_uring) 
for each io that is issued on mpath-node.  

>> > will be tad bit less because of one more 72 byte copy opeation in fast-path).
>>
>> Allocating one buffer and bind it with io_uring_cmd in case of retry is actually
>> similar with current model, retry is triggered by FS bio, and the allocated
>> buffer can play similar role with FS bio.
>
>oops, just think of SQE data only valid during submission, so the buffer
>has to be allocated during 1st submission, 

Exactly.

> but the allocation for each io_uring_cmd
>shouldn't cost much by creating one slab cache, especially inline data
>size of io_uring_cmd has been 56(24 + 32) bytes.

io_uring_cmd does not take any extra memory currently. It is part of the
existing (per-op) space of io_kiocb. Is it a good idea to amplify + decouple 
that into its own slab cache for each io (that may not be going to mpath
node, and may not be failing either).

I really wonder why current solution using  nvme_req(req)->cmd (which is
also preallocated stable storage) is not much better.

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-11 11:01     ` [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands Kanchan Joshi
  2022-07-11 13:51       ` Sagi Grimberg
@ 2022-07-12  6:52       ` Christoph Hellwig
  2022-07-12 11:33         ` Kanchan Joshi
  2022-07-12 20:13         ` Sagi Grimberg
  1 sibling, 2 replies; 52+ messages in thread
From: Christoph Hellwig @ 2022-07-12  6:52 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: hch, sagi, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev

Hmm, I'm a little confused on what this is trying to archive.

The io_uring passthrough already does support multipathing, it picks
an available path in nvme_ns_head_chr_uring_cmd and uses that.

What this does is adding support for requeing on failure or the
lack of an available path.  Which very strongly is against our
passthrough philosophy both in SCSI and NVMe where error handling
is left entirely to the userspace program issuing the I/O.

So this does radically change behavior in a very unexpected way.
Why?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-12  6:52       ` Christoph Hellwig
@ 2022-07-12 11:33         ` Kanchan Joshi
  2022-07-12 20:13         ` Sagi Grimberg
  1 sibling, 0 replies; 52+ messages in thread
From: Kanchan Joshi @ 2022-07-12 11:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: sagi, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev

[-- Attachment #1: Type: text/plain, Size: 781 bytes --]

On Tue, Jul 12, 2022 at 08:52:50AM +0200, Christoph Hellwig wrote:
>Hmm, I'm a little confused on what this is trying to archive.
>
>The io_uring passthrough already does support multipathing, it picks
>an available path in nvme_ns_head_chr_uring_cmd and uses that.
>
>What this does is adding support for requeing on failure or the
>lack of an available path.  Which very strongly is against our
>passthrough philosophy both in SCSI and NVMe where error handling
>is left entirely to the userspace program issuing the I/O.
>
>So this does radically change behavior in a very unexpected way.
>Why?

I think ask has been to add requeue/failover support for
async-passthrough. This came up here - 
https://lore.kernel.org/linux-nvme/[email protected]/

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-12  6:52       ` Christoph Hellwig
  2022-07-12 11:33         ` Kanchan Joshi
@ 2022-07-12 20:13         ` Sagi Grimberg
  2022-07-13  5:36           ` Christoph Hellwig
  1 sibling, 1 reply; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-12 20:13 UTC (permalink / raw)
  To: Christoph Hellwig, Kanchan Joshi
  Cc: kbusch, axboe, io-uring, linux-nvme, linux-block, asml.silence,
	joshiiitr, anuj20.g, gost.dev

> Hmm, I'm a little confused on what this is trying to archive.
> 
> The io_uring passthrough already does support multipathing, it picks
> an available path in nvme_ns_head_chr_uring_cmd and uses that.
> 
> What this does is adding support for requeing on failure or the
> lack of an available path.  Which very strongly is against our
> passthrough philosophy both in SCSI and NVMe where error handling
> is left entirely to the userspace program issuing the I/O.
> 
> So this does radically change behavior in a very unexpected way.
> Why?

I think the difference from scsi-generic and controller nvme passthru is
that this is a mpath device node (or mpath chardev). This is why I think
that users would expect that it would have equivalent multipath
capabilities (i.e. failover).

In general, I think that uring passthru as an alternative I/O interface
and as such needs to be able to failover. If this is not expected from
the interface, then why are we exposing a chardev for the mpath device
node? why not only the bottom namespaces?

I can't really imagine a user that would use uring passthru and
when it gets an error completion, would then try to reconcile if there
is an available path (from sysfs?), and submitting it again in hope that
an available path is selected by the driver (without really being able
to control any of this)...

Maybe I'm wrong, but it looks like an awkward interface to operate on a
multipath device node, but implement failover yourself, based on some
information that is not necessarily in-sync with the driver.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-12 20:13         ` Sagi Grimberg
@ 2022-07-13  5:36           ` Christoph Hellwig
  2022-07-13  8:04             ` Sagi Grimberg
  0 siblings, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2022-07-13  5:36 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Kanchan Joshi, kbusch, axboe, io-uring,
	linux-nvme, linux-block, asml.silence, joshiiitr, anuj20.g,
	gost.dev

On Tue, Jul 12, 2022 at 11:13:57PM +0300, Sagi Grimberg wrote:
> I think the difference from scsi-generic and controller nvme passthru is
> that this is a mpath device node (or mpath chardev). This is why I think
> that users would expect that it would have equivalent multipath
> capabilities (i.e. failover).

How is that different from /dev/sg?

> In general, I think that uring passthru as an alternative I/O interface
> and as such needs to be able to failover. If this is not expected from
> the interface, then why are we exposing a chardev for the mpath device
> node? why not only the bottom namespaces?

The failover will happen when you retry, but we leave that retry to
userspace.  There even is the uevent to tell userspace when a new
path is up.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13  5:36           ` Christoph Hellwig
@ 2022-07-13  8:04             ` Sagi Grimberg
  2022-07-13 10:12               ` Christoph Hellwig
  0 siblings, 1 reply; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-13  8:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev

>> I think the difference from scsi-generic and controller nvme passthru is
>> that this is a mpath device node (or mpath chardev). This is why I think
>> that users would expect that it would have equivalent multipath
>> capabilities (i.e. failover).
> 
> How is that different from /dev/sg?

/dev/sg it is not a multipath device.

Maybe the solution is to just not expose a /dev/ng for the mpath device
node, but only for bottom namespaces. Then it would be completely
equivalent to scsi-generic devices.

It just creates an unexpected mix of semantics of best-effort
multipathing with just path selection, but no requeue/failover...

>> In general, I think that uring passthru as an alternative I/O interface
>> and as such needs to be able to failover. If this is not expected from
>> the interface, then why are we exposing a chardev for the mpath device
>> node? why not only the bottom namespaces?
> 
> The failover will happen when you retry, but we leave that retry to
> userspace.  There even is the uevent to tell userspace when a new
> path is up.

If the user needs to do the retry, discover and understand namespace
paths, ANA states, controller states, etc. Why does the user need a
mpath chardev at all?

The user basically needs to understand everything including indirectly
path selection for the I/O. IMO it is cleaner for the user to just have 
the bottom devices and do the path selection directly.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13  8:04             ` Sagi Grimberg
@ 2022-07-13 10:12               ` Christoph Hellwig
  2022-07-13 11:00                 ` Sagi Grimberg
  0 siblings, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2022-07-13 10:12 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Kanchan Joshi, kbusch, axboe, io-uring,
	linux-nvme, linux-block, asml.silence, joshiiitr, anuj20.g,
	gost.dev

On Wed, Jul 13, 2022 at 11:04:31AM +0300, Sagi Grimberg wrote:
> Maybe the solution is to just not expose a /dev/ng for the mpath device
> node, but only for bottom namespaces. Then it would be completely
> equivalent to scsi-generic devices.
>
> It just creates an unexpected mix of semantics of best-effort
> multipathing with just path selection, but no requeue/failover...

Which is exactly the same semanics as SG_IO on the dm-mpath nodes.

> If the user needs to do the retry, discover and understand namespace
> paths, ANA states, controller states, etc. Why does the user need a
> mpath chardev at all?

The user needs to do that for all kinds of other resons anyway,
as we don't do any retries for passthrough at all.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13 10:12               ` Christoph Hellwig
@ 2022-07-13 11:00                 ` Sagi Grimberg
  2022-07-13 11:28                   ` Christoph Hellwig
  2022-07-13 11:49                   ` Hannes Reinecke
  0 siblings, 2 replies; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-13 11:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev

>> Maybe the solution is to just not expose a /dev/ng for the mpath device
>> node, but only for bottom namespaces. Then it would be completely
>> equivalent to scsi-generic devices.
>>
>> It just creates an unexpected mix of semantics of best-effort
>> multipathing with just path selection, but no requeue/failover...
> 
> Which is exactly the same semanics as SG_IO on the dm-mpath nodes.

I view uring passthru somewhat as a different thing than sending SG_IO
ioctls to dm-mpath. But it can be argued otherwise.

BTW, the only consumer of it that I'm aware of commented that he
expects dm-mpath to retry SG_IO when dm-mpath retry for SG_IO submission
was attempted (https://www.spinics.net/lists/dm-devel/msg46924.html).

 From Paolo:
"The problem is that userspace does not have a way to direct the command 
to a different path in the resubmission. It may not even have permission 
to issue DM_TABLE_STATUS, or to access the /dev nodes for the underlying 
paths, so without Martin's patches SG_IO on dm-mpath is basically 
unreliable by design."

I didn't manage to track down any followup after that email though...

>> If the user needs to do the retry, discover and understand namespace
>> paths, ANA states, controller states, etc. Why does the user need a
>> mpath chardev at all?
> 
> The user needs to do that for all kinds of other resons anyway,
> as we don't do any retries for passthrough at all.

I still think that there is a problem with the existing semantics for
passthru requests over mpath device nodes.

Again, I think it will actually be cleaner not to expose passthru
devices for mpath at all if we are not going to support retry/failover.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13 11:00                 ` Sagi Grimberg
@ 2022-07-13 11:28                   ` Christoph Hellwig
  2022-07-13 12:16                     ` Sagi Grimberg
  2022-07-13 11:49                   ` Hannes Reinecke
  1 sibling, 1 reply; 52+ messages in thread
From: Christoph Hellwig @ 2022-07-13 11:28 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Kanchan Joshi, kbusch, axboe, io-uring,
	linux-nvme, linux-block, asml.silence, joshiiitr, anuj20.g,
	gost.dev

On Wed, Jul 13, 2022 at 02:00:56PM +0300, Sagi Grimberg wrote:
> I view uring passthru somewhat as a different thing than sending SG_IO
> ioctls to dm-mpath. But it can be argued otherwise.
>
> BTW, the only consumer of it that I'm aware of commented that he
> expects dm-mpath to retry SG_IO when dm-mpath retry for SG_IO submission
> was attempted (https://www.spinics.net/lists/dm-devel/msg46924.html).

Yeah.  But the point is that if we have a path failure, the kernel
will pick a new path next time anyway, both in dm-mpath and nvme-mpath.

> I still think that there is a problem with the existing semantics for
> passthru requests over mpath device nodes.
>
> Again, I think it will actually be cleaner not to expose passthru
> devices for mpath at all if we are not going to support retry/failover.

I think they are very useful here.  Users of passthrough interface
need to be able to retry anyway, even on non-multipath setups.  And
a dumb retry will do the right thing.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13 11:28                   ` Christoph Hellwig
@ 2022-07-13 12:16                     ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-13 12:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev

>> I view uring passthru somewhat as a different thing than sending SG_IO
>> ioctls to dm-mpath. But it can be argued otherwise.
>>
>> BTW, the only consumer of it that I'm aware of commented that he
>> expects dm-mpath to retry SG_IO when dm-mpath retry for SG_IO submission
>> was attempted (https://www.spinics.net/lists/dm-devel/msg46924.html).
> 
> Yeah.  But the point is that if we have a path failure, the kernel
> will pick a new path next time anyway, both in dm-mpath and nvme-mpath.

If such a path is available at all.

>> I still think that there is a problem with the existing semantics for
>> passthru requests over mpath device nodes.
>>
>> Again, I think it will actually be cleaner not to expose passthru
>> devices for mpath at all if we are not going to support retry/failover.
> 
> I think they are very useful here.  Users of passthrough interface
> need to be able to retry anyway, even on non-multipath setups.  And
> a dumb retry will do the right thing.

I think you are painting a simple picture while this is not the case
necessarily. It is not a dumb retry, because the user needs to determine
if an available path for this particular namespace exists or wait for
one if it doesn't want to do a submit/fail constant loop.

A passthru interface does not mean that the user by definition needs to
understand multipathing, ana/ctrl/ns states/mappings, etc. The user may
just want to be able issue vendor-specific commands to the device.

If the user needs to understand multipathing by definition, he/she has
zero use of a mpath passthru device if it doesn't retry IMO.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13 11:00                 ` Sagi Grimberg
  2022-07-13 11:28                   ` Christoph Hellwig
@ 2022-07-13 11:49                   ` Hannes Reinecke
  2022-07-13 12:43                     ` Sagi Grimberg
  1 sibling, 1 reply; 52+ messages in thread
From: Hannes Reinecke @ 2022-07-13 11:49 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Kanchan Joshi, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev

On 7/13/22 13:00, Sagi Grimberg wrote:
> 
>>> Maybe the solution is to just not expose a /dev/ng for the mpath device
>>> node, but only for bottom namespaces. Then it would be completely
>>> equivalent to scsi-generic devices.
>>>
>>> It just creates an unexpected mix of semantics of best-effort
>>> multipathing with just path selection, but no requeue/failover...
>>
>> Which is exactly the same semanics as SG_IO on the dm-mpath nodes.
> 
> I view uring passthru somewhat as a different thing than sending SG_IO
> ioctls to dm-mpath. But it can be argued otherwise.
> 
> BTW, the only consumer of it that I'm aware of commented that he
> expects dm-mpath to retry SG_IO when dm-mpath retry for SG_IO submission
> was attempted (https://www.spinics.net/lists/dm-devel/msg46924.html).
> 
>  From Paolo:
> "The problem is that userspace does not have a way to direct the command 
> to a different path in the resubmission. It may not even have permission 
> to issue DM_TABLE_STATUS, or to access the /dev nodes for the underlying 
> paths, so without Martin's patches SG_IO on dm-mpath is basically 
> unreliable by design."
> 
> I didn't manage to track down any followup after that email though...
> 
I did; 'twas me who was involved in the initial customer issue leading 
up to that.

Amongst all the other issue we've found the prime problem with SG_IO is 
that it needs to be directed to the 'active' path.
For the device-mapper has a distinct callout (dm_prepare_ioctl), which 
essentially returns the current active path device. And then the 
device-mapper core issues the command on that active path.

All nice and good, _unless_ that command triggers an error.
Normally it'd be intercepted by the dm-multipath end_io handler, and 
would set the path to offline.
But as ioctls do not use the normal I/O path the end_io handler is never 
called, and further SG_IO calls are happily routed down the failed path.

And the customer had to use SG_IO (or, in qemu-speak, LUN passthrough) 
as his application/filesystem makes heavy use of persistent reservations.

So yeah, nvme-multipathing should be okay here, as io_uring and normal 
I/O are using the same code paths, hence the above scenario really won't 
occur.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
[email protected]			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13 11:49                   ` Hannes Reinecke
@ 2022-07-13 12:43                     ` Sagi Grimberg
  2022-07-13 13:30                       ` Hannes Reinecke
  0 siblings, 1 reply; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-13 12:43 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Kanchan Joshi, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev



On 7/13/22 14:49, Hannes Reinecke wrote:
> On 7/13/22 13:00, Sagi Grimberg wrote:
>>
>>>> Maybe the solution is to just not expose a /dev/ng for the mpath device
>>>> node, but only for bottom namespaces. Then it would be completely
>>>> equivalent to scsi-generic devices.
>>>>
>>>> It just creates an unexpected mix of semantics of best-effort
>>>> multipathing with just path selection, but no requeue/failover...
>>>
>>> Which is exactly the same semanics as SG_IO on the dm-mpath nodes.
>>
>> I view uring passthru somewhat as a different thing than sending SG_IO
>> ioctls to dm-mpath. But it can be argued otherwise.
>>
>> BTW, the only consumer of it that I'm aware of commented that he
>> expects dm-mpath to retry SG_IO when dm-mpath retry for SG_IO submission
>> was attempted (https://www.spinics.net/lists/dm-devel/msg46924.html).
>>
>>  From Paolo:
>> "The problem is that userspace does not have a way to direct the 
>> command to a different path in the resubmission. It may not even have 
>> permission to issue DM_TABLE_STATUS, or to access the /dev nodes for 
>> the underlying paths, so without Martin's patches SG_IO on dm-mpath is 
>> basically unreliable by design."
>>
>> I didn't manage to track down any followup after that email though...
>>
> I did; 'twas me who was involved in the initial customer issue leading 
> up to that.
> 
> Amongst all the other issue we've found the prime problem with SG_IO is 
> that it needs to be directed to the 'active' path.
> For the device-mapper has a distinct callout (dm_prepare_ioctl), which 
> essentially returns the current active path device. And then the 
> device-mapper core issues the command on that active path.
> 
> All nice and good, _unless_ that command triggers an error.
> Normally it'd be intercepted by the dm-multipath end_io handler, and 
> would set the path to offline.
> But as ioctls do not use the normal I/O path the end_io handler is never 
> called, and further SG_IO calls are happily routed down the failed path.
> 
> And the customer had to use SG_IO (or, in qemu-speak, LUN passthrough) 
> as his application/filesystem makes heavy use of persistent reservations.

How did this conclude Hannes?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13 12:43                     ` Sagi Grimberg
@ 2022-07-13 13:30                       ` Hannes Reinecke
  2022-07-13 13:41                         ` Sagi Grimberg
  0 siblings, 1 reply; 52+ messages in thread
From: Hannes Reinecke @ 2022-07-13 13:30 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Kanchan Joshi, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev

On 7/13/22 14:43, Sagi Grimberg wrote:
> 
> 
> On 7/13/22 14:49, Hannes Reinecke wrote:
>> On 7/13/22 13:00, Sagi Grimberg wrote:
>>>
>>>>> Maybe the solution is to just not expose a /dev/ng for the mpath 
>>>>> device
>>>>> node, but only for bottom namespaces. Then it would be completely
>>>>> equivalent to scsi-generic devices.
>>>>>
>>>>> It just creates an unexpected mix of semantics of best-effort
>>>>> multipathing with just path selection, but no requeue/failover...
>>>>
>>>> Which is exactly the same semanics as SG_IO on the dm-mpath nodes.
>>>
>>> I view uring passthru somewhat as a different thing than sending SG_IO
>>> ioctls to dm-mpath. But it can be argued otherwise.
>>>
>>> BTW, the only consumer of it that I'm aware of commented that he
>>> expects dm-mpath to retry SG_IO when dm-mpath retry for SG_IO submission
>>> was attempted (https://www.spinics.net/lists/dm-devel/msg46924.html).
>>>
>>>  From Paolo:
>>> "The problem is that userspace does not have a way to direct the 
>>> command to a different path in the resubmission. It may not even have 
>>> permission to issue DM_TABLE_STATUS, or to access the /dev nodes for 
>>> the underlying paths, so without Martin's patches SG_IO on dm-mpath 
>>> is basically unreliable by design."
>>>
>>> I didn't manage to track down any followup after that email though...
>>>
>> I did; 'twas me who was involved in the initial customer issue leading 
>> up to that.
>>
>> Amongst all the other issue we've found the prime problem with SG_IO 
>> is that it needs to be directed to the 'active' path.
>> For the device-mapper has a distinct callout (dm_prepare_ioctl), which 
>> essentially returns the current active path device. And then the 
>> device-mapper core issues the command on that active path.
>>
>> All nice and good, _unless_ that command triggers an error.
>> Normally it'd be intercepted by the dm-multipath end_io handler, and 
>> would set the path to offline.
>> But as ioctls do not use the normal I/O path the end_io handler is 
>> never called, and further SG_IO calls are happily routed down the 
>> failed path.
>>
>> And the customer had to use SG_IO (or, in qemu-speak, LUN passthrough) 
>> as his application/filesystem makes heavy use of persistent reservations.
> 
> How did this conclude Hannes?

It didn't. The proposed interface got rejected, and now we need to come 
up with an alternative solution.
Which we haven't found yet.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
[email protected]			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13 13:30                       ` Hannes Reinecke
@ 2022-07-13 13:41                         ` Sagi Grimberg
  2022-07-13 14:07                           ` Hannes Reinecke
  0 siblings, 1 reply; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-13 13:41 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Kanchan Joshi, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev


>>>>>> Maybe the solution is to just not expose a /dev/ng for the mpath 
>>>>>> device
>>>>>> node, but only for bottom namespaces. Then it would be completely
>>>>>> equivalent to scsi-generic devices.
>>>>>>
>>>>>> It just creates an unexpected mix of semantics of best-effort
>>>>>> multipathing with just path selection, but no requeue/failover...
>>>>>
>>>>> Which is exactly the same semanics as SG_IO on the dm-mpath nodes.
>>>>
>>>> I view uring passthru somewhat as a different thing than sending SG_IO
>>>> ioctls to dm-mpath. But it can be argued otherwise.
>>>>
>>>> BTW, the only consumer of it that I'm aware of commented that he
>>>> expects dm-mpath to retry SG_IO when dm-mpath retry for SG_IO 
>>>> submission
>>>> was attempted (https://www.spinics.net/lists/dm-devel/msg46924.html).
>>>>
>>>>  From Paolo:
>>>> "The problem is that userspace does not have a way to direct the 
>>>> command to a different path in the resubmission. It may not even 
>>>> have permission to issue DM_TABLE_STATUS, or to access the /dev 
>>>> nodes for the underlying paths, so without Martin's patches SG_IO on 
>>>> dm-mpath is basically unreliable by design."
>>>>
>>>> I didn't manage to track down any followup after that email though...
>>>>
>>> I did; 'twas me who was involved in the initial customer issue 
>>> leading up to that.
>>>
>>> Amongst all the other issue we've found the prime problem with SG_IO 
>>> is that it needs to be directed to the 'active' path.
>>> For the device-mapper has a distinct callout (dm_prepare_ioctl), 
>>> which essentially returns the current active path device. And then 
>>> the device-mapper core issues the command on that active path.
>>>
>>> All nice and good, _unless_ that command triggers an error.
>>> Normally it'd be intercepted by the dm-multipath end_io handler, and 
>>> would set the path to offline.
>>> But as ioctls do not use the normal I/O path the end_io handler is 
>>> never called, and further SG_IO calls are happily routed down the 
>>> failed path.
>>>
>>> And the customer had to use SG_IO (or, in qemu-speak, LUN 
>>> passthrough) as his application/filesystem makes heavy use of 
>>> persistent reservations.
>>
>> How did this conclude Hannes?
> 
> It didn't. The proposed interface got rejected, and now we need to come 
> up with an alternative solution.
> Which we haven't found yet.

Lets assume for the sake of discussion, had dm-mpath set a path to be
offline on ioctl errors, what would qemu do upon this error? blindly
retry? Until When? Or would qemu need to learn about the path tables in
order to know when there is at least one online path in order to retry?

What is the model that a passthru consumer needs to follow when
operating against a mpath device?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13 13:41                         ` Sagi Grimberg
@ 2022-07-13 14:07                           ` Hannes Reinecke
  2022-07-13 15:59                             ` Sagi Grimberg
  0 siblings, 1 reply; 52+ messages in thread
From: Hannes Reinecke @ 2022-07-13 14:07 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: Kanchan Joshi, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev

On 7/13/22 15:41, Sagi Grimberg wrote:
> 
>>>>>>> Maybe the solution is to just not expose a /dev/ng for the mpath 
>>>>>>> device
>>>>>>> node, but only for bottom namespaces. Then it would be completely
>>>>>>> equivalent to scsi-generic devices.
>>>>>>>
>>>>>>> It just creates an unexpected mix of semantics of best-effort
>>>>>>> multipathing with just path selection, but no requeue/failover...
>>>>>>
>>>>>> Which is exactly the same semanics as SG_IO on the dm-mpath nodes.
>>>>>
>>>>> I view uring passthru somewhat as a different thing than sending SG_IO
>>>>> ioctls to dm-mpath. But it can be argued otherwise.
>>>>>
>>>>> BTW, the only consumer of it that I'm aware of commented that he
>>>>> expects dm-mpath to retry SG_IO when dm-mpath retry for SG_IO 
>>>>> submission
>>>>> was attempted (https://www.spinics.net/lists/dm-devel/msg46924.html).
>>>>>
>>>>>  From Paolo:
>>>>> "The problem is that userspace does not have a way to direct the 
>>>>> command to a different path in the resubmission. It may not even 
>>>>> have permission to issue DM_TABLE_STATUS, or to access the /dev 
>>>>> nodes for the underlying paths, so without Martin's patches SG_IO 
>>>>> on dm-mpath is basically unreliable by design."
>>>>>
>>>>> I didn't manage to track down any followup after that email though...
>>>>>
>>>> I did; 'twas me who was involved in the initial customer issue 
>>>> leading up to that.
>>>>
>>>> Amongst all the other issue we've found the prime problem with SG_IO 
>>>> is that it needs to be directed to the 'active' path.
>>>> For the device-mapper has a distinct callout (dm_prepare_ioctl), 
>>>> which essentially returns the current active path device. And then 
>>>> the device-mapper core issues the command on that active path.
>>>>
>>>> All nice and good, _unless_ that command triggers an error.
>>>> Normally it'd be intercepted by the dm-multipath end_io handler, and 
>>>> would set the path to offline.
>>>> But as ioctls do not use the normal I/O path the end_io handler is 
>>>> never called, and further SG_IO calls are happily routed down the 
>>>> failed path.
>>>>
>>>> And the customer had to use SG_IO (or, in qemu-speak, LUN 
>>>> passthrough) as his application/filesystem makes heavy use of 
>>>> persistent reservations.
>>>
>>> How did this conclude Hannes?
>>
>> It didn't. The proposed interface got rejected, and now we need to 
>> come up with an alternative solution.
>> Which we haven't found yet.
> 
> Lets assume for the sake of discussion, had dm-mpath set a path to be
> offline on ioctl errors, what would qemu do upon this error? blindly
> retry? Until When? Or would qemu need to learn about the path tables in
> order to know when there is at least one online path in order to retry?
> 
IIRC that was one of the points why it got rejected.
Ideally we would return an errno indicating that the path had failed, 
but further paths are available, so a retry is in order.
Once no paths are available qemu would be getting a different error 
indicating that all paths are failed.

But we would be overloading a new meaning to existing error numbers, or 
even inventing our own error numbers. Which makes it rather awkward to use.

Ideally we would be able to return this as the SG_IO status, as that is 
well capable of expressing these situations. But then we would need to 
parse and/or return the error ourselves, essentially moving sg_io 
funtionality into dm-mpath. Also not what one wants.

> What is the model that a passthru consumer needs to follow when
> operating against a mpath device?

The model really is that passthru consumer needs to deal with these errors:
- No error (obviously)
- I/O error (error status will not change with a retry)
- Temporary/path related error (error status might change with a retry)

Then the consumer can decide whether to invoke a retry (for the last 
class), or whether it should pass up that error, as maybe there are 
applications with need a quick response time and can handle temporary 
failures (or, in fact, want to be informed about temporary failures).

IE the 'DNR' bit should serve nicely here, keeping in mind that we might 
need to 'fake' an NVMe error status if the connection is severed.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
[email protected]			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
  2022-07-13 14:07                           ` Hannes Reinecke
@ 2022-07-13 15:59                             ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2022-07-13 15:59 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Kanchan Joshi, kbusch, axboe, io-uring, linux-nvme, linux-block,
	asml.silence, joshiiitr, anuj20.g, gost.dev


>>>>> Amongst all the other issue we've found the prime problem with 
>>>>> SG_IO is that it needs to be directed to the 'active' path.
>>>>> For the device-mapper has a distinct callout (dm_prepare_ioctl), 
>>>>> which essentially returns the current active path device. And then 
>>>>> the device-mapper core issues the command on that active path.
>>>>>
>>>>> All nice and good, _unless_ that command triggers an error.
>>>>> Normally it'd be intercepted by the dm-multipath end_io handler, 
>>>>> and would set the path to offline.
>>>>> But as ioctls do not use the normal I/O path the end_io handler is 
>>>>> never called, and further SG_IO calls are happily routed down the 
>>>>> failed path.
>>>>>
>>>>> And the customer had to use SG_IO (or, in qemu-speak, LUN 
>>>>> passthrough) as his application/filesystem makes heavy use of 
>>>>> persistent reservations.
>>>>
>>>> How did this conclude Hannes?
>>>
>>> It didn't. The proposed interface got rejected, and now we need to 
>>> come up with an alternative solution.
>>> Which we haven't found yet.
>>
>> Lets assume for the sake of discussion, had dm-mpath set a path to be
>> offline on ioctl errors, what would qemu do upon this error? blindly
>> retry? Until When? Or would qemu need to learn about the path tables in
>> order to know when there is at least one online path in order to retry?
>>
> IIRC that was one of the points why it got rejected.
> Ideally we would return an errno indicating that the path had failed, 
> but further paths are available, so a retry is in order.
> Once no paths are available qemu would be getting a different error 
> indicating that all paths are failed.

There is no such no-paths-available error.

> 
> But we would be overloading a new meaning to existing error numbers, or 
> even inventing our own error numbers. Which makes it rather awkward to use.

I agree that this sounds awkward.

> Ideally we would be able to return this as the SG_IO status, as that is 
> well capable of expressing these situations. But then we would need to 
> parse and/or return the error ourselves, essentially moving sg_io 
> funtionality into dm-mpath. Also not what one wants.

uring actually should send back the cqe for passthru, but there is no
concept like "Path error, but no paths are available".

> 
>> What is the model that a passthru consumer needs to follow when
>> operating against a mpath device?
> 
> The model really is that passthru consumer needs to deal with these errors:
> - No error (obviously)
> - I/O error (error status will not change with a retry)
> - Temporary/path related error (error status might change with a retry)
> 
> Then the consumer can decide whether to invoke a retry (for the last 
> class), or whether it should pass up that error, as maybe there are 
> applications with need a quick response time and can handle temporary 
> failures (or, in fact, want to be informed about temporary failures).
> 
> IE the 'DNR' bit should serve nicely here, keeping in mind that we might 
> need to 'fake' an NVMe error status if the connection is severed.

uring passthru sends the cqe status to userspace IIRC. But nothing in
there indicates about path availability. That would be something that
userspace would need to reconcile on its own from traversing sysfs or
alike...

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2022-07-18  9:03 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20220711110753epcas5p4169b9e288d15ca35740dbb66a6f6983a@epcas5p4.samsung.com>
2022-07-11 11:01 ` [PATCH for-next 0/4] nvme-multipathing for uring-passthrough Kanchan Joshi
     [not found]   ` <CGME20220711110800epcas5p3d338dd486fd778c5ba5bfe93a91ec8bd@epcas5p3.samsung.com>
2022-07-11 11:01     ` [PATCH for-next 1/4] io_uring, nvme: rename a function Kanchan Joshi
2022-07-14 13:55       ` Ming Lei
     [not found]   ` <CGME20220711110812epcas5p33aa90b23aa62fb11722aa8195754becf@epcas5p3.samsung.com>
2022-07-11 11:01     ` [PATCH for-next 2/4] nvme: compact nvme_uring_cmd_pdu struct Kanchan Joshi
2022-07-12  6:32       ` Christoph Hellwig
     [not found]   ` <CGME20220711110824epcas5p22c8e945cb8c3c3ac46c8c2b5ab55db9b@epcas5p2.samsung.com>
2022-07-11 11:01     ` [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd Kanchan Joshi
2022-07-11 17:00       ` Sagi Grimberg
2022-07-11 17:19         ` Jens Axboe
2022-07-11 17:18       ` Jens Axboe
2022-07-11 17:55         ` Sagi Grimberg
2022-07-11 18:22           ` Sagi Grimberg
2022-07-11 18:24             ` Jens Axboe
2022-07-11 18:58               ` Sagi Grimberg
2022-07-12 11:40               ` Kanchan Joshi
2022-07-14  3:40             ` Ming Lei
2022-07-14  8:19               ` Kanchan Joshi
2022-07-14 15:30                 ` Daniel Wagner
2022-07-15 11:07                   ` Kanchan Joshi
2022-07-18  9:03                     ` Daniel Wagner
     [not found]   ` <CGME20220711110827epcas5p3fd81f142f55ca3048abc38a9ef0d0089@epcas5p3.samsung.com>
2022-07-11 11:01     ` [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands Kanchan Joshi
2022-07-11 13:51       ` Sagi Grimberg
2022-07-11 15:12         ` Stefan Metzmacher
2022-07-11 16:58           ` Sagi Grimberg
2022-07-11 18:54           ` Kanchan Joshi
2022-07-11 18:37         ` Kanchan Joshi
2022-07-11 19:56           ` Sagi Grimberg
2022-07-12  4:23             ` Kanchan Joshi
2022-07-12 21:26               ` Sagi Grimberg
2022-07-13  5:37                 ` Kanchan Joshi
2022-07-13  9:03                   ` Sagi Grimberg
2022-07-13 11:28                     ` Kanchan Joshi
2022-07-13 12:17                       ` Sagi Grimberg
2022-07-14 15:14                   ` Ming Lei
2022-07-14 23:05                     ` Kanchan Joshi
2022-07-15  1:35                       ` Ming Lei
2022-07-15  1:46                         ` Ming Lei
2022-07-15  4:24                           ` Kanchan Joshi
2022-07-12  6:52       ` Christoph Hellwig
2022-07-12 11:33         ` Kanchan Joshi
2022-07-12 20:13         ` Sagi Grimberg
2022-07-13  5:36           ` Christoph Hellwig
2022-07-13  8:04             ` Sagi Grimberg
2022-07-13 10:12               ` Christoph Hellwig
2022-07-13 11:00                 ` Sagi Grimberg
2022-07-13 11:28                   ` Christoph Hellwig
2022-07-13 12:16                     ` Sagi Grimberg
2022-07-13 11:49                   ` Hannes Reinecke
2022-07-13 12:43                     ` Sagi Grimberg
2022-07-13 13:30                       ` Hannes Reinecke
2022-07-13 13:41                         ` Sagi Grimberg
2022-07-13 14:07                           ` Hannes Reinecke
2022-07-13 15:59                             ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox