On Fri, Mar 11, 2022 at 08:01:48AM +0100, Christoph Hellwig wrote:
>On Tue, Mar 08, 2022 at 08:50:53PM +0530, Kanchan Joshi wrote:
>> +/*
>> + * This overlays struct io_uring_cmd pdu.
>> + * Expect build errors if this grows larger than that.
>> + */
>> +struct nvme_uring_cmd_pdu {
>> +	u32 meta_len;
>> +	union {
>> +		struct bio *bio;
>> +		struct request *req;
>> +	};
>> +	void *meta; /* kernel-resident buffer */
>> +	void __user *meta_buffer;
>> +} __packed;
>
>Why is this marked __packed?
Did not like doing it, but had to.
If not packed, this takes 32 bytes of space. While driver-pdu in struct
io_uring_cmd can take max 30 bytes. Packing nvme-pdu brought it down to
28 bytes, which fits and gives 2 bytes back.

For quick reference - 
struct io_uring_cmd {
        struct file *              file;                 /*     0     8 */
        void *                     cmd;                  /*     8     8 */
        union {
                void *             bio;                  /*    16     8 */
                void               (*driver_cb)(struct io_uring_cmd *); /*    16     8 */
        };                                               /*    16     8 */
        u32                        flags;                /*    24     4 */
        u32                        cmd_op;               /*    28     4 */
        u16                        cmd_len;              /*    32     2 */
        u16                        unused;               /*    34     2 */
        u8                         pdu[28];              /*    36    28 */

        /* size: 64, cachelines: 1, members: 8 */
};
io_uring_cmd struct goes into the first cacheline of io_kiocb.
Last field is pdu, taking 28 bytes. Will be 30 if I evaporate above
field.
nvme-pdu after packing:
struct nvme_uring_cmd_pdu {
        u32                        meta_len;             /*     0     4 */
        union {
                struct bio *       bio;                  /*     4     8 */
                struct request *   req;                  /*     4     8 */
        };                                               /*     4     8 */
        void *                     meta;                 /*    12     8 */
        void *                     meta_buffer;          /*    20     8 */

        /* size: 28, cachelines: 1, members: 4 */
        /* last cacheline: 28 bytes */
} __attribute__((__packed__));

>In general I'd be much more happy if the meta elelements were a
>io_uring-level feature handled outside the driver and typesafe in
>struct io_uring_cmd, with just a normal private data pointer for the
>actual user, which would remove all the crazy casting.

Not sure if I got your point.

+static struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(struct io_uring_cmd *ioucmd)
+{
+       return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
+}
+
+static void nvme_pt_task_cb(struct io_uring_cmd *ioucmd)
+{
+       struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);

Do you mean crazy casting inside nvme_uring_cmd_pdu()?
Somehow this looks sane to me (perhaps because it used to be crazier
earlier).

And on moving meta elements outside the driver, my worry is that it
reduces scope of uring-cmd infra and makes it nvme passthru specific.
At this point uring-cmd is still generic async ioctl/fsctl facility
which may find other users (than nvme-passthru) down the line. 
Organization of fields within "struct io_uring_cmd" is around the rule
that a field is kept out (of 28 bytes pdu) only if is accessed by both
io_uring and driver. 

>> +static void nvme_end_async_pt(struct request *req, blk_status_t err)
>> +{
>> +	struct io_uring_cmd *ioucmd = req->end_io_data;
>> +	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>> +	/* extract bio before reusing the same field for request */
>> +	struct bio *bio = pdu->bio;
>> +
>> +	pdu->req = req;
>> +	req->bio = bio;
>> +	/* this takes care of setting up task-work */
>> +	io_uring_cmd_complete_in_task(ioucmd, nvme_pt_task_cb);
>
>This is a bit silly.  First we defer the actual request I/O completion
>from the block layer to a different CPU or softirq and then we have
>another callback here.  I think it would be much more useful if we
>could find a way to enhance blk_mq_complete_request so that it could
>directly complet in a given task.  That would also be really nice for
>say normal synchronous direct I/O.

I see, so there is room for adding some efficiency.
Hope it will be ok if I carry this out as a separate effort.
Since this is about touching blk_mq_complete_request at its heart, and
improving sync-direct-io, this does not seem best-fit and slow this
series down.

FWIW, I ran the tests with counters inside blk_mq_complete_request_remote()

        if (blk_mq_complete_need_ipi(rq)) {
                blk_mq_complete_send_ipi(rq);
                return true;
        }

        if (rq->q->nr_hw_queues == 1) {
                blk_mq_raise_softirq(rq);
                return true;
        }
Deferring by ipi or softirq never occured. Neither for block nor for
char. Softirq is obvious since I was not running against scsi (or nvme with
single queue). I could not spot whether this is really a overhead, at
least for nvme.


>> +	if (ioucmd) { /* async dispatch */
>> +		if (cmd->common.opcode == nvme_cmd_write ||
>> +				cmd->common.opcode == nvme_cmd_read) {
>
>No we can't just check for specific commands in the passthrough handler.

Right. This is for inline-cmd approach. 
Last two patches of the series undo this (for indirect-cmd).
I will do something about it.

>> +			nvme_setup_uring_cmd_data(req, ioucmd, meta, meta_buffer,
>> +					meta_len);
>> +			blk_execute_rq_nowait(req, 0, nvme_end_async_pt);
>> +			return 0;
>> +		} else {
>> +			/* support only read and write for now. */
>> +			ret = -EINVAL;
>> +			goto out_meta;
>> +		}
>
>Pleae always handle error in the first branch and don't bother with an
>else after a goto or return.

Yes, that'll be better.
>> +static int nvme_ns_async_ioctl(struct nvme_ns *ns, struct io_uring_cmd *ioucmd)
>> +{
>> +	int ret;
>> +
>> +	BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu));
>> +
>> +	switch (ioucmd->cmd_op) {
>> +	case NVME_IOCTL_IO64_CMD:
>> +		ret = nvme_user_cmd64(ns->ctrl, ns, NULL, ioucmd);
>> +		break;
>> +	default:
>> +		ret = -ENOTTY;
>> +	}
>> +
>> +	if (ret >= 0)
>> +		ret = -EIOCBQUEUED;
>
>That's a weird way to handle the returns.  Just return -EIOCBQUEUED
>directly from the handler (which as said before should be split from
>the ioctl handler anyway).

Indeed. That will make it cleaner.