From: Kanchan Joshi <[email protected]>
To: Sagi Grimberg <[email protected]>
Cc: [email protected], [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected], [email protected]
Subject: Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
Date: Tue, 12 Jul 2022 09:53:32 +0530 [thread overview]
Message-ID: <20220712042332.GA14780@test-zns> (raw)
In-Reply-To: <[email protected]>
[-- Attachment #1: Type: text/plain, Size: 13063 bytes --]
On Mon, Jul 11, 2022 at 10:56:56PM +0300, Sagi Grimberg wrote:
>
>>>>@@ -448,6 +442,14 @@ static int nvme_uring_cmd_io(struct
>>>>nvme_ctrl *ctrl, struct nvme_ns *ns,
>>>> pdu->meta_buffer = nvme_to_user_ptr(d.metadata);
>>>> pdu->meta_len = d.metadata_len;
>>>>+ if (issue_flags & IO_URING_F_MPATH) {
>>>>+ req->cmd_flags |= REQ_NVME_MPATH;
>>>>+ /*
>>>>+ * we would need the buffer address (d.addr field) if we have
>>>>+ * to retry the command. Store it by repurposing ioucmd->cmd
>>>>+ */
>>>>+ ioucmd->cmd = (void *)d.addr;
>>>
>>>What does repurposing mean here?
>>
>>This field (ioucmd->cmd) was pointing to passthrough command (which
>>is embedded in SQE of io_uring). At this point we have consumed
>>passthrough command, so this field can be reused if we have to. And we
>>have to beceause failover needs recreating passthrough command.
>>Please see nvme_uring_cmd_io_retry to see how this helps in recreating
>>the fields of passthrough command. And more on this below.
>
>so ioucmd->cmd starts as an nvme_uring_cmd pointer and continues as
>an address of buffer for a later processing that may or may not
>happen in its lifetime?
Yes. See this as a no-cost way to handle fast/common path. We manage by
doing just this assignment.
Otherwise this would have involved doing high-cost (allocate/copy/deferral)
stuff for later processing that may or may not happen.
>Sounds a bit of a backwards design to me.
>
>>>>+ }
>>>> blk_execute_rq_nowait(req, false);
>>>> return -EIOCBQUEUED;
>>>> }
>>>>@@ -665,12 +667,135 @@ int nvme_ns_head_chr_uring_cmd(struct
>>>>io_uring_cmd *ioucmd,
>>>> int srcu_idx = srcu_read_lock(&head->srcu);
>>>> struct nvme_ns *ns = nvme_find_path(head);
>>>> int ret = -EINVAL;
>>>>+ struct device *dev = &head->cdev_device;
>>>>+
>>>>+ if (likely(ns)) {
>>>>+ ret = nvme_ns_uring_cmd(ns, ioucmd,
>>>>+ issue_flags | IO_URING_F_MPATH);
>>>>+ } else if (nvme_available_path(head)) {
>>>>+ struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>>>+ struct nvme_uring_cmd *payload = NULL;
>>>>+
>>>>+ dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
>>>>+ /*
>>>>+ * We may requeue two kinds of uring commands:
>>>>+ * 1. command failed post submission. pdu->req will be non-null
>>>>+ * for this
>>>>+ * 2. command that could not be submitted because path was not
>>>>+ * available. For this pdu->req is set to NULL.
>>>>+ */
>>>>+ pdu->req = NULL;
>>>
>>>Relying on a pointer does not sound like a good idea to me.
>>>But why do you care at all where did this command came from?
>>>This code should not concern itself what happened prior to this
>>>execution.
>>Required, please see nvme_uring_cmd_io_retry. And this should be more
>>clear as part of responses to your other questions.
>
>I think I understand. But it looks fragile to me.
>
>>
>>>>+ /*
>>>>+ * passthrough command will not be available during retry as it
>>>>+ * is embedded in io_uring's SQE. Hence we allocate/copy here
>>>>+ */
>>>
>>>OK, that is a nice solution.
>>Please note that prefered way is to recreate the passthrough command,
>>and not to allocate it. We allocate it here because this happens very early
>>(i.e. before processing passthrough command and setting that up inside
>>struct request). Recreating requires a populated 'struct request' .
>
>I think that once a driver consumes a command as queued, it needs a
>stable copy of the command (could be in a percpu pool).
>
>The current design of nvme multipathing is that the request is stripped
>from its bios and placed on a requeue_list, and if a request was not
>formed yet (i.e. nvme available path exist but no usable) the bio is
>placed on the requeue_list.
>
>So ideally we have something sufficient like a bio, that can be linked
>on a requeue list, and is sufficient to build a request, at any point in
>time...
we could be dealing with any passthrough command here. bio is not
guranteed to exist in first place. It can very well be NULL.
As I mentioned in cover letter, this was tested for such passthrough
commands too.
And bio is not a good fit for this interface. For block-path, entry
function is nvme_ns_head_submit_bio() which speaks bio. Request is
allocated afterwards. But since request formation can happen only after
discovering a valid path, it may have to queue the bio if path does not
exist.
For passthrough, interface speaks/understands struct io_uring_cmd.
Request is allocated after. And bio may (or may not) be attached with
this request. It may have to queue the command even before request/bio
gets formed. So cardinal structure (which is
really available all the time) in this case is struct io_uring_cmd.
Hence we use that as the object around which requeuing/failover is
built.
Conceptually io_uring_cmd is the bio of this interface.
>>>>+ payload = kmalloc(sizeof(struct nvme_uring_cmd), GFP_KERNEL);
>>>>+ if (!payload) {
>>>>+ ret = -ENOMEM;
>>>>+ goto out_unlock;
>>>>+ }
>>>>+ memcpy(payload, ioucmd->cmd, sizeof(struct nvme_uring_cmd));
>>>>+ ioucmd->cmd = payload;
>>>>- if (ns)
>>>>- ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>>>+ spin_lock_irq(&head->requeue_ioucmd_lock);
>>>>+ ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
>>>>+ spin_unlock_irq(&head->requeue_ioucmd_lock);
>>>>+ ret = -EIOCBQUEUED;
>>>>+ } else {
>>>>+ dev_warn_ratelimited(dev, "no available path - failing I/O\n");
>>>
>>>ret=-EIO ?
>>Did not do as it was initialized to -EINVAL. Do you prefer -EIO instead.
>
>It is not an invalid argument error here, it is essentially an IO error.
>So yes, that would be my preference.
sure, will change.
>>>>+ }
>>>>+out_unlock:
>>>> srcu_read_unlock(&head->srcu, srcu_idx);
>>>> return ret;
>>>> }
>>>>+
>>>>+int nvme_uring_cmd_io_retry(struct nvme_ns *ns, struct request *oreq,
>>>>+ struct io_uring_cmd *ioucmd, struct nvme_uring_cmd_pdu *pdu)
>>>>+{
>>>>+ struct nvme_ctrl *ctrl = ns->ctrl;
>>>>+ struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
>>>>+ struct nvme_uring_data d;
>>>>+ struct nvme_command c;
>>>>+ struct request *req;
>>>>+ struct bio *obio = oreq->bio;
>>>>+ void *meta = NULL;
>>>>+
>>>>+ memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command));
>>>>+ d.metadata = (__u64)pdu->meta_buffer;
>>>>+ d.metadata_len = pdu->meta_len;
>>>>+ d.timeout_ms = oreq->timeout;
>>>>+ d.addr = (__u64)ioucmd->cmd;
>>>>+ if (obio) {
>>>>+ d.data_len = obio->bi_iter.bi_size;
>>>>+ blk_rq_unmap_user(obio);
>>>>+ } else {
>>>>+ d.data_len = 0;
>>>>+ }
>>>>+ blk_mq_free_request(oreq);
>>>
>>>The way I would do this that in nvme_ioucmd_failover_req (or in the
>>>retry driven from command retriable failure) I would do the above,
>>>requeue it and kick the requeue work, to go over the requeue_list and
>>>just execute them again. Not sure why you even need an explicit retry
>>>code.
>>During retry we need passthrough command. But passthrough command is not
>>stable (i.e. valid only during first submission). We can make it stable
>>either by:
>>(a) allocating in nvme (b) return -EAGAIN to io_uring, and it will
>>do allocate + deferral
>>Both add a cost. And since any command can potentially fail, that
>>means taking that cost for every IO that we issue on mpath node. Even if
>>no failure (initial or subsquent after IO) occcured.
>
>As mentioned, I think that if a driver consumes a command as queued,
>it needs a stable copy for a later reformation of the request for
>failover purposes.
So what do you propose to make that stable?
As I mentioned earlier, stable copy requires allocating/copying in fast
path. And for a condition (failover) that may not even occur.
I really think currrent solution is much better as it does not try to make
it stable. Rather it assembles pieces of passthrough command if retry
(which is rare) happens.
>>So to avoid commmon-path cost, we go about doing nothing (no allocation,
>>no deferral) in the outset and choose to recreate the passthrough
>>command if failure occured. Hope this explains the purpose of
>>nvme_uring_cmd_io_retry?
>
>I think it does, but I'm still having a hard time with it...
>
Maybe I am reiterating but few main differences that should help -
- io_uring_cmd is at the forefront, and bio/request as secondary
objects. Former is persistent object across requeue attempts and the
only thing available when we discover the path, while other two are
created every time we retry.
- Receiving bio from upper layer is a luxury that we do not have for
passthrough. When we receive bio, pages are already mapped and we
do not have to deal with user-specific fields, so there is more
freedom in using arbitrary context (workers etc.). But passthrough
command continues to point to user-space fields/buffers, so we need
that task context.
>>
>>>>+ req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
>>>>+ d.data_len, nvme_to_user_ptr(d.metadata),
>>>>+ d.metadata_len, 0, &meta, d.timeout_ms ?
>>>>+ msecs_to_jiffies(d.timeout_ms) : 0,
>>>>+ ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0);
>>>>+ if (IS_ERR(req))
>>>>+ return PTR_ERR(req);
>>>>+
>>>>+ req->end_io = nvme_uring_cmd_end_io;
>>>>+ req->end_io_data = ioucmd;
>>>>+ pdu->bio = req->bio;
>>>>+ pdu->meta = meta;
>>>>+ req->cmd_flags |= REQ_NVME_MPATH;
>>>>+ blk_execute_rq_nowait(req, false);
>>>>+ return -EIOCBQUEUED;
>>>>+}
>>>>+
>>>>+void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd)
>>>>+{
>>>>+ struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>>>+ struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head,
>>>>+ cdev);
>>>>+ int srcu_idx = srcu_read_lock(&head->srcu);
>>>>+ struct nvme_ns *ns = nvme_find_path(head);
>>>>+ unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 |
>>>>+ IO_URING_F_MPATH;
>>>>+ struct device *dev = &head->cdev_device;
>>>>+
>>>>+ if (likely(ns)) {
>>>>+ struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>>>+ struct request *oreq = pdu->req;
>>>>+ int ret;
>>>>+
>>>>+ if (oreq == NULL) {
>>>>+ /*
>>>>+ * this was not submitted (to device) earlier. For this
>>>>+ * ioucmd->cmd points to persistent memory. Free that
>>>>+ * up post submission
>>>>+ */
>>>>+ const void *cmd = ioucmd->cmd;
>>>>+
>>>>+ ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>>>+ kfree(cmd);
>>>>+ } else {
>>>>+ /*
>>>>+ * this was submitted (to device) earlier. Use old
>>>>+ * request, bio (if it exists) and nvme-pdu to recreate
>>>>+ * the command for the discovered path
>>>>+ */
>>>>+ ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu);
>>>
>>>Why is this needed? Why is reuse important here? Why not always call
>>>nvme_ns_uring_cmd?
>>
>>Please see the previous explanation.
>>If condition is for the case when we made the passthrough command stable
>>by allocating beforehand.
>>Else is for the case when we avoided taking that cost.
>
>The current design of the multipath failover code is clean:
>1. extract bio(s) from request
>2. link in requeue_list
>3. schedule requeue_work that,
> 3.1 takes bios 1-by-1, and submits them again (exactly the same way)
It is really clean, and fits really well with bio based entry interface.
But as I said earlier, it does not go well with uring-cmd based entry
interface, and bunch of of other things which needs to be done
differently for generic passthrough command.
>I'd like us to try to follow the same design where retry is
>literally "do the exact same thing, again". That would eliminate
>two submission paths that do the same thing, but slightly different.
Exact same thing is possible if we make the common path slow i.e.
allocate/copy passthrough command and keep it alive until completion.
But that is really not the way to go I suppose.
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
next prev parent reply other threads:[~2022-07-12 4:29 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20220711110753epcas5p4169b9e288d15ca35740dbb66a6f6983a@epcas5p4.samsung.com>
2022-07-11 11:01 ` [PATCH for-next 0/4] nvme-multipathing for uring-passthrough Kanchan Joshi
[not found] ` <CGME20220711110800epcas5p3d338dd486fd778c5ba5bfe93a91ec8bd@epcas5p3.samsung.com>
2022-07-11 11:01 ` [PATCH for-next 1/4] io_uring, nvme: rename a function Kanchan Joshi
2022-07-14 13:55 ` Ming Lei
[not found] ` <CGME20220711110812epcas5p33aa90b23aa62fb11722aa8195754becf@epcas5p3.samsung.com>
2022-07-11 11:01 ` [PATCH for-next 2/4] nvme: compact nvme_uring_cmd_pdu struct Kanchan Joshi
2022-07-12 6:32 ` Christoph Hellwig
[not found] ` <CGME20220711110824epcas5p22c8e945cb8c3c3ac46c8c2b5ab55db9b@epcas5p2.samsung.com>
2022-07-11 11:01 ` [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd Kanchan Joshi
2022-07-11 17:00 ` Sagi Grimberg
2022-07-11 17:19 ` Jens Axboe
2022-07-11 17:18 ` Jens Axboe
2022-07-11 17:55 ` Sagi Grimberg
2022-07-11 18:22 ` Sagi Grimberg
2022-07-11 18:24 ` Jens Axboe
2022-07-11 18:58 ` Sagi Grimberg
2022-07-12 11:40 ` Kanchan Joshi
2022-07-14 3:40 ` Ming Lei
2022-07-14 8:19 ` Kanchan Joshi
2022-07-14 15:30 ` Daniel Wagner
2022-07-15 11:07 ` Kanchan Joshi
2022-07-18 9:03 ` Daniel Wagner
[not found] ` <CGME20220711110827epcas5p3fd81f142f55ca3048abc38a9ef0d0089@epcas5p3.samsung.com>
2022-07-11 11:01 ` [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands Kanchan Joshi
2022-07-11 13:51 ` Sagi Grimberg
2022-07-11 15:12 ` Stefan Metzmacher
2022-07-11 16:58 ` Sagi Grimberg
2022-07-11 18:54 ` Kanchan Joshi
2022-07-11 18:37 ` Kanchan Joshi
2022-07-11 19:56 ` Sagi Grimberg
2022-07-12 4:23 ` Kanchan Joshi [this message]
2022-07-12 21:26 ` Sagi Grimberg
2022-07-13 5:37 ` Kanchan Joshi
2022-07-13 9:03 ` Sagi Grimberg
2022-07-13 11:28 ` Kanchan Joshi
2022-07-13 12:17 ` Sagi Grimberg
2022-07-14 15:14 ` Ming Lei
2022-07-14 23:05 ` Kanchan Joshi
2022-07-15 1:35 ` Ming Lei
2022-07-15 1:46 ` Ming Lei
2022-07-15 4:24 ` Kanchan Joshi
2022-07-12 6:52 ` Christoph Hellwig
2022-07-12 11:33 ` Kanchan Joshi
2022-07-12 20:13 ` Sagi Grimberg
2022-07-13 5:36 ` Christoph Hellwig
2022-07-13 8:04 ` Sagi Grimberg
2022-07-13 10:12 ` Christoph Hellwig
2022-07-13 11:00 ` Sagi Grimberg
2022-07-13 11:28 ` Christoph Hellwig
2022-07-13 12:16 ` Sagi Grimberg
2022-07-13 11:49 ` Hannes Reinecke
2022-07-13 12:43 ` Sagi Grimberg
2022-07-13 13:30 ` Hannes Reinecke
2022-07-13 13:41 ` Sagi Grimberg
2022-07-13 14:07 ` Hannes Reinecke
2022-07-13 15:59 ` Sagi Grimberg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220712042332.GA14780@test-zns \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox