Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

From: Kanchan Joshi <joshi.k@samsung.com>
To: Sagi Grimberg <sagi@grimberg.me>
Cc: hch@lst.de, kbusch@kernel.org, axboe@kernel.dk,
	io-uring@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, asml.silence@gmail.com,
	joshiiitr@gmail.com, anuj20.g@samsung.com, gost.dev@samsung.com
Subject: Re: [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands
Date: Tue, 12 Jul 2022 09:53:32 +0530	[thread overview]
Message-ID: <20220712042332.GA14780@test-zns> (raw)
In-Reply-To: <5f30c7de-03b1-768a-d44f-594ed2d1dc75@grimberg.me>

[-- Attachment #1: Type: text/plain, Size: 13063 bytes --]

On Mon, Jul 11, 2022 at 10:56:56PM +0300, Sagi Grimberg wrote:
>
>>>>@@ -448,6 +442,14 @@ static int nvme_uring_cmd_io(struct 
>>>>nvme_ctrl *ctrl, struct nvme_ns *ns,
>>>>     pdu->meta_buffer = nvme_to_user_ptr(d.metadata);
>>>>     pdu->meta_len = d.metadata_len;
>>>>+    if (issue_flags & IO_URING_F_MPATH) {
>>>>+        req->cmd_flags |= REQ_NVME_MPATH;
>>>>+        /*
>>>>+         * we would need the buffer address (d.addr field) if we have
>>>>+         * to retry the command. Store it by repurposing ioucmd->cmd
>>>>+         */
>>>>+        ioucmd->cmd = (void *)d.addr;
>>>
>>>What does repurposing mean here?
>>
>>This field (ioucmd->cmd) was pointing to passthrough command (which
>>is embedded in SQE of io_uring). At this point we have consumed
>>passthrough command, so this field can be reused if we have to. And we
>>have to beceause failover needs recreating passthrough command.
>>Please see nvme_uring_cmd_io_retry to see how this helps in recreating
>>the fields of passthrough command. And more on this below.
>
>so ioucmd->cmd starts as an nvme_uring_cmd pointer and continues as
>an address of buffer for a later processing that may or may not
>happen in its lifetime?

Yes. See this as a no-cost way to handle fast/common path. We manage by
doing just this assignment.
Otherwise this would have involved doing high-cost (allocate/copy/deferral)
stuff for later processing that may or may not happen.

>Sounds a bit of a backwards design to me.
>
>>>>+    }
>>>>     blk_execute_rq_nowait(req, false);
>>>>     return -EIOCBQUEUED;
>>>> }
>>>>@@ -665,12 +667,135 @@ int nvme_ns_head_chr_uring_cmd(struct 
>>>>io_uring_cmd *ioucmd,
>>>>     int srcu_idx = srcu_read_lock(&head->srcu);
>>>>     struct nvme_ns *ns = nvme_find_path(head);
>>>>     int ret = -EINVAL;
>>>>+    struct device *dev = &head->cdev_device;
>>>>+
>>>>+    if (likely(ns)) {
>>>>+        ret = nvme_ns_uring_cmd(ns, ioucmd,
>>>>+                issue_flags | IO_URING_F_MPATH);
>>>>+    } else if (nvme_available_path(head)) {
>>>>+        struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>>>+        struct nvme_uring_cmd *payload = NULL;
>>>>+
>>>>+        dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
>>>>+        /*
>>>>+         * We may requeue two kinds of uring commands:
>>>>+         * 1. command failed post submission. pdu->req will be non-null
>>>>+         * for this
>>>>+         * 2. command that could not be submitted because path was not
>>>>+         * available. For this pdu->req is set to NULL.
>>>>+         */
>>>>+        pdu->req = NULL;
>>>
>>>Relying on a pointer does not sound like a good idea to me.
>>>But why do you care at all where did this command came from?
>>>This code should not concern itself what happened prior to this
>>>execution.
>>Required, please see nvme_uring_cmd_io_retry. And this should be more
>>clear as part of responses to your other questions.
>
>I think I understand. But it looks fragile to me.
>
>>
>>>>+        /*
>>>>+         * passthrough command will not be available during retry as it
>>>>+         * is embedded in io_uring's SQE. Hence we allocate/copy here
>>>>+         */
>>>
>>>OK, that is a nice solution.
>>Please note that prefered way is to recreate the passthrough command,
>>and not to allocate it. We allocate it here because this happens very early
>>(i.e. before processing passthrough command and setting that up inside
>>struct request). Recreating requires a populated 'struct request' .
>
>I think that once a driver consumes a command as queued, it needs a
>stable copy of the command (could be in a percpu pool).
>
>The current design of nvme multipathing is that the request is stripped
>from its bios and placed on a requeue_list, and if a request was not
>formed yet (i.e. nvme available path exist but no usable) the bio is
>placed on the requeue_list.
>
>So ideally we have something sufficient like a bio, that can be linked
>on a requeue list, and is sufficient to build a request, at any point in
>time...

we could be dealing with any passthrough command here. bio is not
guranteed to exist in first place. It can very well be NULL.
As I mentioned in cover letter, this was tested for such passthrough
commands too.
And bio is not a good fit for this interface. For block-path, entry
function is nvme_ns_head_submit_bio() which speaks bio. Request is
allocated afterwards. But since request formation can happen only after
discovering a valid path, it may have to queue the bio if path does not
exist.
For passthrough, interface speaks/understands struct io_uring_cmd.
Request is allocated after. And bio may (or may not) be attached with
this request. It may have to queue the command even before request/bio
gets formed. So cardinal structure (which is
really available all the time) in this case is struct io_uring_cmd.
Hence we use that as the object around which requeuing/failover is
built.
Conceptually io_uring_cmd is the bio of this interface.

>>>>+        payload = kmalloc(sizeof(struct nvme_uring_cmd), GFP_KERNEL);
>>>>+        if (!payload) {
>>>>+            ret = -ENOMEM;
>>>>+            goto out_unlock;
>>>>+        }
>>>>+        memcpy(payload, ioucmd->cmd, sizeof(struct nvme_uring_cmd));
>>>>+        ioucmd->cmd = payload;
>>>>-    if (ns)
>>>>-        ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>>>+        spin_lock_irq(&head->requeue_ioucmd_lock);
>>>>+        ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
>>>>+        spin_unlock_irq(&head->requeue_ioucmd_lock);
>>>>+        ret = -EIOCBQUEUED;
>>>>+    } else {
>>>>+        dev_warn_ratelimited(dev, "no available path - failing I/O\n");
>>>
>>>ret=-EIO ?
>>Did not do as it was initialized to -EINVAL. Do you prefer -EIO instead.
>
>It is not an invalid argument error here, it is essentially an IO error.
>So yes, that would be my preference.

sure, will change.

>>>>+    }
>>>>+out_unlock:
>>>>     srcu_read_unlock(&head->srcu, srcu_idx);
>>>>     return ret;
>>>> }
>>>>+
>>>>+int nvme_uring_cmd_io_retry(struct nvme_ns *ns, struct request *oreq,
>>>>+        struct io_uring_cmd *ioucmd, struct nvme_uring_cmd_pdu *pdu)
>>>>+{
>>>>+    struct nvme_ctrl *ctrl = ns->ctrl;
>>>>+    struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
>>>>+    struct nvme_uring_data d;
>>>>+    struct nvme_command c;
>>>>+    struct request *req;
>>>>+    struct bio *obio = oreq->bio;
>>>>+    void *meta = NULL;
>>>>+
>>>>+    memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command));
>>>>+    d.metadata = (__u64)pdu->meta_buffer;
>>>>+    d.metadata_len = pdu->meta_len;
>>>>+    d.timeout_ms = oreq->timeout;
>>>>+    d.addr = (__u64)ioucmd->cmd;
>>>>+    if (obio) {
>>>>+        d.data_len = obio->bi_iter.bi_size;
>>>>+        blk_rq_unmap_user(obio);
>>>>+    } else {
>>>>+        d.data_len = 0;
>>>>+    }
>>>>+    blk_mq_free_request(oreq);
>>>
>>>The way I would do this that in nvme_ioucmd_failover_req (or in the
>>>retry driven from command retriable failure) I would do the above,
>>>requeue it and kick the requeue work, to go over the requeue_list and
>>>just execute them again. Not sure why you even need an explicit retry
>>>code.
>>During retry we need passthrough command. But passthrough command is not
>>stable (i.e. valid only during first submission). We can make it stable
>>either by:
>>(a) allocating in nvme (b) return -EAGAIN to io_uring, and it will 
>>do allocate + deferral
>>Both add a cost. And since any command can potentially fail, that
>>means taking that cost for every IO that we issue on mpath node. Even if
>>no failure (initial or subsquent after IO) occcured.
>
>As mentioned, I think that if a driver consumes a command as queued,
>it needs a stable copy for a later reformation of the request for
>failover purposes.

So what do you propose to make that stable?
As I mentioned earlier, stable copy requires allocating/copying in fast
path. And for a condition (failover) that may not even occur.
I really think currrent solution is much better as it does not try to make
it stable. Rather it assembles pieces of passthrough command if retry
(which is rare) happens.

>>So to avoid commmon-path cost, we go about doing nothing (no allocation,
>>no deferral) in the outset and choose to recreate the passthrough
>>command if failure occured. Hope this explains the purpose of
>>nvme_uring_cmd_io_retry?
>
>I think it does, but I'm still having a hard time with it...
>
Maybe I am reiterating but few main differences that should help -

- io_uring_cmd is at the forefront, and bio/request as secondary
objects. Former is persistent object across requeue attempts and the
only thing available when we discover the path, while other two are
created every time we retry.

- Receiving bio from upper layer is a luxury that we do not have for
  passthrough. When we receive bio, pages are already mapped and we
  do not have to deal with user-specific fields, so there is more
  freedom in using arbitrary context (workers etc.). But passthrough
  command continues to point to user-space fields/buffers, so we need
  that task context.

>>
>>>>+    req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
>>>>+            d.data_len, nvme_to_user_ptr(d.metadata),
>>>>+            d.metadata_len, 0, &meta, d.timeout_ms ?
>>>>+            msecs_to_jiffies(d.timeout_ms) : 0,
>>>>+            ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0);
>>>>+    if (IS_ERR(req))
>>>>+        return PTR_ERR(req);
>>>>+
>>>>+    req->end_io = nvme_uring_cmd_end_io;
>>>>+    req->end_io_data = ioucmd;
>>>>+    pdu->bio = req->bio;
>>>>+    pdu->meta = meta;
>>>>+    req->cmd_flags |= REQ_NVME_MPATH;
>>>>+    blk_execute_rq_nowait(req, false);
>>>>+    return -EIOCBQUEUED;
>>>>+}
>>>>+
>>>>+void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd)
>>>>+{
>>>>+    struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>>>+    struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head,
>>>>+            cdev);
>>>>+    int srcu_idx = srcu_read_lock(&head->srcu);
>>>>+    struct nvme_ns *ns = nvme_find_path(head);
>>>>+    unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 |
>>>>+        IO_URING_F_MPATH;
>>>>+    struct device *dev = &head->cdev_device;
>>>>+
>>>>+    if (likely(ns)) {
>>>>+        struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>>>+        struct request *oreq = pdu->req;
>>>>+        int ret;
>>>>+
>>>>+        if (oreq == NULL) {
>>>>+            /*
>>>>+             * this was not submitted (to device) earlier. For this
>>>>+             * ioucmd->cmd points to persistent memory. Free that
>>>>+             * up post submission
>>>>+             */
>>>>+            const void *cmd = ioucmd->cmd;
>>>>+
>>>>+            ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>>>+            kfree(cmd);
>>>>+        } else {
>>>>+            /*
>>>>+             * this was submitted (to device) earlier. Use old
>>>>+             * request, bio (if it exists) and nvme-pdu to recreate
>>>>+             * the command for the discovered path
>>>>+             */
>>>>+            ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu);
>>>
>>>Why is this needed? Why is reuse important here? Why not always call
>>>nvme_ns_uring_cmd?
>>
>>Please see the previous explanation.
>>If condition is for the case when we made the passthrough command stable
>>by allocating beforehand.
>>Else is for the case when we avoided taking that cost.
>
>The current design of the multipath failover code is clean:
>1. extract bio(s) from request
>2. link in requeue_list
>3. schedule requeue_work that,
> 3.1 takes bios 1-by-1, and submits them again (exactly the same way)

It is really clean, and fits really well with bio based entry interface.
But as I said earlier, it does not go well with uring-cmd based entry
interface, and bunch of of other things which needs to be done
differently for generic passthrough command.

>I'd like us to try to follow the same design where retry is
>literally "do the exact same thing, again". That would eliminate
>two submission paths that do the same thing, but slightly different.

Exact same thing is possible if we make the common path slow i.e.
allocate/copy passthrough command and keep it alive until completion.
But that is really not the way to go I suppose.


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

next prev parent reply	other threads:[~2022-07-12  4:29 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20220711110753epcas5p4169b9e288d15ca35740dbb66a6f6983a@epcas5p4.samsung.com>
2022-07-11 11:01 ` [PATCH for-next 0/4] nvme-multipathing for uring-passthrough Kanchan Joshi
     [not found]   ` <CGME20220711110800epcas5p3d338dd486fd778c5ba5bfe93a91ec8bd@epcas5p3.samsung.com>
2022-07-11 11:01     ` [PATCH for-next 1/4] io_uring, nvme: rename a function Kanchan Joshi
2022-07-14 13:55       ` Ming Lei
     [not found]   ` <CGME20220711110812epcas5p33aa90b23aa62fb11722aa8195754becf@epcas5p3.samsung.com>
2022-07-11 11:01     ` [PATCH for-next 2/4] nvme: compact nvme_uring_cmd_pdu struct Kanchan Joshi
2022-07-12  6:32       ` Christoph Hellwig
     [not found]   ` <CGME20220711110824epcas5p22c8e945cb8c3c3ac46c8c2b5ab55db9b@epcas5p2.samsung.com>
2022-07-11 11:01     ` [PATCH for-next 3/4] io_uring: grow a field in struct io_uring_cmd Kanchan Joshi
2022-07-11 17:00       ` Sagi Grimberg
2022-07-11 17:19         ` Jens Axboe
2022-07-11 17:18       ` Jens Axboe
2022-07-11 17:55         ` Sagi Grimberg
2022-07-11 18:22           ` Sagi Grimberg
2022-07-11 18:24             ` Jens Axboe
2022-07-11 18:58               ` Sagi Grimberg
2022-07-12 11:40               ` Kanchan Joshi
2022-07-14  3:40             ` Ming Lei
2022-07-14  8:19               ` Kanchan Joshi
2022-07-14 15:30                 ` Daniel Wagner
2022-07-15 11:07                   ` Kanchan Joshi
2022-07-18  9:03                     ` Daniel Wagner
     [not found]   ` <CGME20220711110827epcas5p3fd81f142f55ca3048abc38a9ef0d0089@epcas5p3.samsung.com>
2022-07-11 11:01     ` [PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands Kanchan Joshi
2022-07-11 13:51       ` Sagi Grimberg
2022-07-11 15:12         ` Stefan Metzmacher
2022-07-11 16:58           ` Sagi Grimberg
2022-07-11 18:54           ` Kanchan Joshi
2022-07-11 18:37         ` Kanchan Joshi
2022-07-11 19:56           ` Sagi Grimberg
2022-07-12  4:23             ` Kanchan Joshi [this message]
2022-07-12 21:26               ` Sagi Grimberg
2022-07-13  5:37                 ` Kanchan Joshi
2022-07-13  9:03                   ` Sagi Grimberg
2022-07-13 11:28                     ` Kanchan Joshi
2022-07-13 12:17                       ` Sagi Grimberg
2022-07-14 15:14                   ` Ming Lei
2022-07-14 23:05                     ` Kanchan Joshi
2022-07-15  1:35                       ` Ming Lei
2022-07-15  1:46                         ` Ming Lei
2022-07-15  4:24                           ` Kanchan Joshi
2022-07-12  6:52       ` Christoph Hellwig
2022-07-12 11:33         ` Kanchan Joshi
2022-07-12 20:13         ` Sagi Grimberg
2022-07-13  5:36           ` Christoph Hellwig
2022-07-13  8:04             ` Sagi Grimberg
2022-07-13 10:12               ` Christoph Hellwig
2022-07-13 11:00                 ` Sagi Grimberg
2022-07-13 11:28                   ` Christoph Hellwig
2022-07-13 12:16                     ` Sagi Grimberg
2022-07-13 11:49                   ` Hannes Reinecke
2022-07-13 12:43                     ` Sagi Grimberg
2022-07-13 13:30                       ` Hannes Reinecke
2022-07-13 13:41                         ` Sagi Grimberg
2022-07-13 14:07                           ` Hannes Reinecke
2022-07-13 15:59                             ` Sagi Grimberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220712042332.GA14780@test-zns \
    --to=joshi.k@samsung.com \
    --cc=anuj20.g@samsung.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=gost.dev@samsung.com \
    --cc=hch@lst.de \
    --cc=io-uring@vger.kernel.org \
    --cc=joshiiitr@gmail.com \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox