[PATCH v4 0/5] io_uring passthrough for nvme

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH v4 0/5] io_uring passthrough for nvme
       [not found] <CGME20220505061142epcas5p2c943572766bfd5088138fe0f7873c96c@epcas5p2.samsung.com>
@ 2022-05-05  6:06 ` Kanchan Joshi
       [not found]   ` <CGME20220505061144epcas5p3821a9516dad2b5eff5a25c56dbe164df@epcas5p3.samsung.com>
                     ` (5 more replies)
  0 siblings, 6 replies; 45+ messages in thread
From: Kanchan Joshi @ 2022-05-05  6:06 UTC (permalink / raw)
  To: axboe, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

This iteration is against io_uring-big-sqe brach (linux-block).
On top of a739b2354 ("io_uring: enable CQE32").

fio testing branch:
https://github.com/joshkan/fio/tree/big-cqe-pt.v4

Changes since v3:
https://lore.kernel.org/linux-nvme/[email protected]/
- Cleaned up placements of new fields in sqe and io_uring_cmd
- Removed packed-attribute from nvme_uring_cmd struct
- Applied all other Christoph's feedback too
- Applied Jens feedback
- Collected reviewed-by

Changes since v2:
https://lore.kernel.org/linux-nvme/[email protected]/
- Rewire uring-cmd infrastructure on top of new big CQE
- Prep patch (refactored) and feedback from Christoph
- Add new opcode and structure in nvme for uring-cmd
- Enable vectored-io

Changes since v1:
https://lore.kernel.org/linux-nvme/[email protected]/
- Trim down by removing patches for polling, fixed-buffer and bio-cache
- Add big CQE and move uring-cmd to use that
- Remove indirect (pointer) submission

Anuj Gupta (1):
  nvme: add vectored-io support for uring-cmd

Christoph Hellwig (1):
  nvme: refactor nvme_submit_user_cmd()

Jens Axboe (2):
  fs,io_uring: add infrastructure for uring-cmd
  block: wire-up support for passthrough plugging

Kanchan Joshi (1):
  nvme: wire-up uring-cmd support for io-passthru on char-device.

 block/blk-mq.c                  |  73 ++++++-----
 drivers/nvme/host/core.c        |   1 +
 drivers/nvme/host/ioctl.c       | 215 +++++++++++++++++++++++++++++++-
 drivers/nvme/host/multipath.c   |   1 +
 drivers/nvme/host/nvme.h        |   5 +
 fs/io_uring.c                   |  82 ++++++++++--
 include/linux/fs.h              |   2 +
 include/linux/io_uring.h        |  27 ++++
 include/uapi/linux/io_uring.h   |  21 ++--
 include/uapi/linux/nvme_ioctl.h |  26 ++++
 10 files changed, 398 insertions(+), 55 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd
       [not found]   ` <CGME20220505061144epcas5p3821a9516dad2b5eff5a25c56dbe164df@epcas5p3.samsung.com>
@ 2022-05-05  6:06     ` Kanchan Joshi
  2022-05-05 12:52       ` Jens Axboe
                         ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Kanchan Joshi @ 2022-05-05  6:06 UTC (permalink / raw)
  To: axboe, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

From: Jens Axboe <[email protected]>

file_operations->uring_cmd is a file private handler.
This is somewhat similar to ioctl but hopefully a lot more sane and
useful as it can be used to enable many io_uring capabilities for the
underlying operation.

IORING_OP_URING_CMD is a file private kind of request. io_uring doesn't
know what is in this command type, it's for the provider of ->uring_cmd()
to deal with. This operation can be issued only on the ring that is
setup with both IORING_SETUP_SQE128 and IORING_SETUP_CQE32 flags.

Co-developed-by: Kanchan Joshi <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
 fs/io_uring.c                 | 82 ++++++++++++++++++++++++++++++++---
 include/linux/fs.h            |  2 +
 include/linux/io_uring.h      | 27 ++++++++++++
 include/uapi/linux/io_uring.h | 21 +++++----
 4 files changed, 117 insertions(+), 15 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index c7e3f7e74d92..3436507f04eb 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -202,13 +202,6 @@ struct io_rings {
 	struct io_uring_cqe	cqes[] ____cacheline_aligned_in_smp;
 };
 
-enum io_uring_cmd_flags {
-	IO_URING_F_COMPLETE_DEFER	= 1,
-	IO_URING_F_UNLOCKED		= 2,
-	/* int's last bit, sign checks are usually faster than a bit test */
-	IO_URING_F_NONBLOCK		= INT_MIN,
-};
-
 struct io_mapped_ubuf {
 	u64		ubuf;
 	u64		ubuf_end;
@@ -969,6 +962,7 @@ struct io_kiocb {
 		struct io_xattr		xattr;
 		struct io_socket	sock;
 		struct io_nop		nop;
+		struct io_uring_cmd	uring_cmd;
 	};
 
 	u8				opcode;
@@ -1254,6 +1248,10 @@ static const struct io_op_def io_op_defs[] = {
 	[IORING_OP_SOCKET] = {
 		.audit_skip		= 1,
 	},
+	[IORING_OP_URING_CMD] = {
+		.needs_file		= 1,
+		.plug			= 1,
+	},
 };
 
 /* requests with any of those set should undergo io_disarm_next() */
@@ -1393,6 +1391,8 @@ const char *io_uring_get_opcode(u8 opcode)
 		return "GETXATTR";
 	case IORING_OP_SOCKET:
 		return "SOCKET";
+	case IORING_OP_URING_CMD:
+		return "URING_CMD";
 	case IORING_OP_LAST:
 		return "INVALID";
 	}
@@ -4907,6 +4907,67 @@ static int io_linkat(struct io_kiocb *req, unsigned int issue_flags)
 	return 0;
 }
 
+static void io_uring_cmd_work(struct io_kiocb *req, bool *locked)
+{
+	req->uring_cmd.task_work_cb(&req->uring_cmd);
+}
+
+void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+			void (*task_work_cb)(struct io_uring_cmd *))
+{
+	struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd);
+
+	req->uring_cmd.task_work_cb = task_work_cb;
+	req->io_task_work.func = io_uring_cmd_work;
+	io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
+}
+EXPORT_SYMBOL_GPL(io_uring_cmd_complete_in_task);
+
+/*
+ * Called by consumers of io_uring_cmd, if they originally returned
+ * -EIOCBQUEUED upon receiving the command.
+ */
+void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2)
+{
+	struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd);
+
+	if (ret < 0)
+		req_set_fail(req);
+	__io_req_complete32(req, 0, ret, 0, res2, 0);
+}
+EXPORT_SYMBOL_GPL(io_uring_cmd_done);
+
+static int io_uring_cmd_prep(struct io_kiocb *req,
+			     const struct io_uring_sqe *sqe)
+{
+	struct io_uring_cmd *ioucmd = &req->uring_cmd;
+	struct io_ring_ctx *ctx = req->ctx;
+
+	if (ctx->flags & IORING_SETUP_IOPOLL)
+		return -EOPNOTSUPP;
+	/* do not support uring-cmd without big SQE/CQE */
+	if (!(ctx->flags & IORING_SETUP_SQE128))
+		return -EOPNOTSUPP;
+	if (!(ctx->flags & IORING_SETUP_CQE32))
+		return -EOPNOTSUPP;
+	if (sqe->ioprio || sqe->rw_flags)
+		return -EINVAL;
+	ioucmd->cmd = sqe->cmd;
+	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
+	return 0;
+}
+
+static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct file *file = req->file;
+	struct io_uring_cmd *ioucmd = &req->uring_cmd;
+
+	if (!req->file->f_op->uring_cmd)
+		return -EOPNOTSUPP;
+	file->f_op->uring_cmd(ioucmd, issue_flags);
+	return 0;
+}
+
 static int io_shutdown_prep(struct io_kiocb *req,
 			    const struct io_uring_sqe *sqe)
 {
@@ -7764,6 +7825,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return io_getxattr_prep(req, sqe);
 	case IORING_OP_SOCKET:
 		return io_socket_prep(req, sqe);
+	case IORING_OP_URING_CMD:
+		return io_uring_cmd_prep(req, sqe);
 	}
 
 	printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
@@ -8085,6 +8148,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	case IORING_OP_SOCKET:
 		ret = io_socket(req, issue_flags);
 		break;
+	case IORING_OP_URING_CMD:
+		ret = io_uring_cmd(req, issue_flags);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -12688,6 +12754,8 @@ static int __init io_uring_init(void)
 	BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST);
 	BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(int));
 
+	BUILD_BUG_ON(sizeof(struct io_uring_cmd) > 64);
+
 	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC |
 				SLAB_ACCOUNT);
 	return 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bbde95387a23..30c98d9993df 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1953,6 +1953,7 @@ struct dir_context {
 #define REMAP_FILE_ADVISORY		(REMAP_FILE_CAN_SHORTEN)
 
 struct iov_iter;
+struct io_uring_cmd;
 
 struct file_operations {
 	struct module *owner;
@@ -1995,6 +1996,7 @@ struct file_operations {
 				   struct file *file_out, loff_t pos_out,
 				   loff_t len, unsigned int remap_flags);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
+	void (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
 } __randomize_layout;
 
 struct inode_operations {
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 24651c229ed2..e4cbc80949ce 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -5,7 +5,26 @@
 #include <linux/sched.h>
 #include <linux/xarray.h>
 
+enum io_uring_cmd_flags {
+	IO_URING_F_COMPLETE_DEFER	= 1,
+	IO_URING_F_UNLOCKED		= 2,
+	/* int's last bit, sign checks are usually faster than a bit test */
+	IO_URING_F_NONBLOCK		= INT_MIN,
+};
+
+struct io_uring_cmd {
+	struct file	*file;
+	const u8	*cmd;
+	/* callback to defer completions to task context */
+	void (*task_work_cb)(struct io_uring_cmd *cmd);
+	u32		cmd_op;
+	u8		pdu[32]; /* available inline for free use */
+};
+
 #if defined(CONFIG_IO_URING)
+void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t res2);
+void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+			void (*task_work_cb)(struct io_uring_cmd *));
 struct sock *io_uring_get_socket(struct file *file);
 void __io_uring_cancel(bool cancel_all);
 void __io_uring_free(struct task_struct *tsk);
@@ -30,6 +49,14 @@ static inline void io_uring_free(struct task_struct *tsk)
 		__io_uring_free(tsk);
 }
 #else
+static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret,
+		ssize_t ret2)
+{
+}
+static inline void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+			void (*task_work_cb)(struct io_uring_cmd *))
+{
+}
 static inline struct sock *io_uring_get_socket(struct file *file)
 {
 	return NULL;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 881e508767f8..92af2169ef58 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -22,6 +22,7 @@ struct io_uring_sqe {
 	union {
 		__u64	off;	/* offset into file */
 		__u64	addr2;
+		__u32	cmd_op;
 	};
 	union {
 		__u64	addr;	/* pointer to buffer or iovecs */
@@ -61,14 +62,17 @@ struct io_uring_sqe {
 		__s32	splice_fd_in;
 		__u32	file_index;
 	};
-	__u64	addr3;
-	__u64	__pad2[1];
-
-	/*
-	 * If the ring is initialized with IORING_SETUP_SQE128, then this field
-	 * contains 64-bytes of padding, doubling the size of the SQE.
-	 */
-	__u64	__big_sqe_pad[0];
+	union {
+		struct {
+			__u64	addr3;
+			__u64	__pad2[1];
+		};
+		/*
+		 * If the ring is initialized with IORING_SETUP_SQE128, then
+		 * this field is used for 80 bytes of arbitrary command data
+		 */
+		__u8	cmd[0];
+	};
 };
 
 enum {
@@ -160,6 +164,7 @@ enum io_uring_op {
 	IORING_OP_FGETXATTR,
 	IORING_OP_GETXATTR,
 	IORING_OP_SOCKET,
+	IORING_OP_URING_CMD,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 2/5] block: wire-up support for passthrough plugging
       [not found]   ` <CGME20220505061146epcas5p3919c48d58d353a62a5858ee10ad162a0@epcas5p3.samsung.com>
@ 2022-05-05  6:06     ` Kanchan Joshi
  2022-05-05 14:21       ` Ming Lei
  0 siblings, 1 reply; 45+ messages in thread
From: Kanchan Joshi @ 2022-05-05  6:06 UTC (permalink / raw)
  To: axboe, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

From: Jens Axboe <[email protected]>

Add support for plugging in passthrough path. When plugging is enabled, the
requests are added to a plug instead of getting dispatched to the driver.
And when the plug is finished, the whole batch gets dispatched via
->queue_rqs which turns out to be more efficient. Otherwise dispatching
used to happen via ->queue_rq, one request at a time.

Signed-off-by: Jens Axboe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
 block/blk-mq.c | 73 +++++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 34 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 84d749511f55..2cf011b57cf9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2340,6 +2340,40 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	blk_mq_hctx_mark_pending(hctx, ctx);
 }
 
+/*
+ * Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
+ * queues. This is important for md arrays to benefit from merging
+ * requests.
+ */
+static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
+{
+	if (plug->multiple_queues)
+		return BLK_MAX_REQUEST_COUNT * 2;
+	return BLK_MAX_REQUEST_COUNT;
+}
+
+static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
+{
+	struct request *last = rq_list_peek(&plug->mq_list);
+
+	if (!plug->rq_count) {
+		trace_block_plug(rq->q);
+	} else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
+		   (!blk_queue_nomerges(rq->q) &&
+		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
+		blk_mq_flush_plug_list(plug, false);
+		trace_block_plug(rq->q);
+	}
+
+	if (!plug->multiple_queues && last && last->q != rq->q)
+		plug->multiple_queues = true;
+	if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
+		plug->has_elevator = true;
+	rq->rq_next = NULL;
+	rq_list_add(&plug->mq_list, rq);
+	plug->rq_count++;
+}
+
 /**
  * blk_mq_request_bypass_insert - Insert a request at dispatch list.
  * @rq: Pointer to request to be inserted.
@@ -2353,7 +2387,12 @@ void blk_mq_request_bypass_insert(struct request *rq, bool at_head,
 				  bool run_queue)
 {
 	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+	struct blk_plug *plug = current->plug;
 
+	if (plug) {
+		blk_add_rq_to_plug(plug, rq);
+		return;
+	}
 	spin_lock(&hctx->lock);
 	if (at_head)
 		list_add(&rq->queuelist, &hctx->dispatch);
@@ -2676,40 +2715,6 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
 		hctx->queue->mq_ops->commit_rqs(hctx);
 }
 
-/*
- * Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
- * queues. This is important for md arrays to benefit from merging
- * requests.
- */
-static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
-{
-	if (plug->multiple_queues)
-		return BLK_MAX_REQUEST_COUNT * 2;
-	return BLK_MAX_REQUEST_COUNT;
-}
-
-static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
-{
-	struct request *last = rq_list_peek(&plug->mq_list);
-
-	if (!plug->rq_count) {
-		trace_block_plug(rq->q);
-	} else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
-		   (!blk_queue_nomerges(rq->q) &&
-		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
-		blk_mq_flush_plug_list(plug, false);
-		trace_block_plug(rq->q);
-	}
-
-	if (!plug->multiple_queues && last && last->q != rq->q)
-		plug->multiple_queues = true;
-	if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
-		plug->has_elevator = true;
-	rq->rq_next = NULL;
-	rq_list_add(&plug->mq_list, rq);
-	plug->rq_count++;
-}
-
 static bool blk_mq_attempt_bio_merge(struct request_queue *q,
 				     struct bio *bio, unsigned int nr_segs)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 3/5] nvme: refactor nvme_submit_user_cmd()
       [not found]   ` <CGME20220505061148epcas5p188618b5b15a95cbe48c8c1559a18c994@epcas5p1.samsung.com>
@ 2022-05-05  6:06     ` Kanchan Joshi
  2022-05-05 13:30       ` Christoph Hellwig
  2022-05-05 18:37       ` Clay Mayers
  0 siblings, 2 replies; 45+ messages in thread
From: Kanchan Joshi @ 2022-05-05  6:06 UTC (permalink / raw)
  To: axboe, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

From: Christoph Hellwig <[email protected]>

Divide the work into two helpers, namely nvme_alloc_user_request and
nvme_execute_user_rq. This is a prep patch, to help wiring up
uring-cmd support in nvme.

Signed-off-by: Christoph Hellwig <[email protected]>
---
 drivers/nvme/host/ioctl.c | 47 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 42 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 554566371ffa..3531de8073a6 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -19,6 +19,13 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval)
 	return (void __user *)ptrval;
 }
 
+static inline void *nvme_meta_from_bio(struct bio *bio)
+{
+	struct bio_integrity_payload *bip = bio_integrity(bio);
+
+	return bip ? bvec_virt(bip->bip_vec) : NULL;
+}
+
 static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf,
 		unsigned len, u32 seed, bool write)
 {
@@ -53,10 +60,10 @@ static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf,
 	return ERR_PTR(ret);
 }
 
-static int nvme_submit_user_cmd(struct request_queue *q,
+static struct request *nvme_alloc_user_request(struct request_queue *q,
 		struct nvme_command *cmd, void __user *ubuffer,
 		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
-		u32 meta_seed, u64 *result, unsigned timeout, bool vec)
+		u32 meta_seed, unsigned timeout, bool vec)
 {
 	bool write = nvme_is_write(cmd);
 	struct nvme_ns *ns = q->queuedata;
@@ -68,7 +75,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
 
 	req = blk_mq_alloc_request(q, nvme_req_op(cmd), 0);
 	if (IS_ERR(req))
-		return PTR_ERR(req);
+		return req;
 	nvme_init_request(req, cmd);
 
 	if (timeout)
@@ -108,7 +115,26 @@ static int nvme_submit_user_cmd(struct request_queue *q,
 		}
 	}
 
+	return req;
+
+out_unmap:
+	if (bio)
+		blk_rq_unmap_user(bio);
+out:
+	blk_mq_free_request(req);
+	return ERR_PTR(ret);
+}
+
+static int nvme_execute_user_rq(struct request *req, void __user *meta_buffer,
+		unsigned meta_len, u64 *result)
+{
+	struct bio *bio = req->bio;
+	bool write = bio_op(bio) == REQ_OP_DRV_OUT;
+	int ret;
+	void *meta = nvme_meta_from_bio(bio);
+
 	ret = nvme_execute_passthru_rq(req);
+
 	if (result)
 		*result = le64_to_cpu(nvme_req(req)->result.u64);
 	if (meta && !ret && !write) {
@@ -116,14 +142,25 @@ static int nvme_submit_user_cmd(struct request_queue *q,
 			ret = -EFAULT;
 	}
 	kfree(meta);
- out_unmap:
 	if (bio)
 		blk_rq_unmap_user(bio);
- out:
 	blk_mq_free_request(req);
 	return ret;
 }
 
+static int nvme_submit_user_cmd(struct request_queue *q,
+		struct nvme_command *cmd, void __user *ubuffer,
+		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
+		u32 meta_seed, u64 *result, unsigned timeout, bool vec)
+{
+	struct request *req;
+
+	req = nvme_alloc_user_request(q, cmd, ubuffer, bufflen, meta_buffer,
+			meta_len, meta_seed, timeout, vec);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+	return nvme_execute_user_rq(req, meta_buffer, meta_len, result);
+}
 
 static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device.
       [not found]   ` <CGME20220505061150epcas5p2b60880c541a4b2f144c348834c7cbf0b@epcas5p2.samsung.com>
@ 2022-05-05  6:06     ` Kanchan Joshi
  2022-05-05 13:33       ` Christoph Hellwig
  2022-05-05 13:38       ` Jens Axboe
  0 siblings, 2 replies; 45+ messages in thread
From: Kanchan Joshi @ 2022-05-05  6:06 UTC (permalink / raw)
  To: axboe, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

Introduce handler for fops->uring_cmd(), implementing async passthru
on char device (/dev/ngX). The handler supports newly introduced
operation NVME_URING_CMD_IO. This operates on a new structure
nvme_uring_cmd, which is similar to struct nvme_passthru_cmd64 but
without the embedded 8b result field. This field is not needed since
uring-cmd allows to return additional result via big-CQE.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
 drivers/nvme/host/core.c        |   1 +
 drivers/nvme/host/ioctl.c       | 168 +++++++++++++++++++++++++++++++-
 drivers/nvme/host/multipath.c   |   1 +
 drivers/nvme/host/nvme.h        |   5 +
 include/uapi/linux/nvme_ioctl.h |  25 +++++
 5 files changed, 197 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e1846d04817f..682df98db341 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3699,6 +3699,7 @@ static const struct file_operations nvme_ns_chr_fops = {
 	.release	= nvme_ns_chr_release,
 	.unlocked_ioctl	= nvme_ns_chr_ioctl,
 	.compat_ioctl	= compat_ptr_ioctl,
+	.uring_cmd	= nvme_ns_chr_uring_cmd,
 };
 
 static int nvme_add_ns_cdev(struct nvme_ns *ns)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 3531de8073a6..3687cb8d7428 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -26,6 +26,80 @@ static inline void *nvme_meta_from_bio(struct bio *bio)
 	return bip ? bvec_virt(bip->bip_vec) : NULL;
 }
 
+/*
+ * This overlays struct io_uring_cmd pdu.
+ * Expect build errors if this grows larger than that.
+ */
+struct nvme_uring_cmd_pdu {
+	union {
+		struct bio *bio;
+		struct request *req;
+	};
+	void *meta; /* kernel-resident buffer */
+	void __user *meta_buffer;
+	u32 meta_len;
+};
+
+static inline struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(
+		struct io_uring_cmd *ioucmd)
+{
+	return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
+}
+
+static void nvme_uring_task_cb(struct io_uring_cmd *ioucmd)
+{
+	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+	struct request *req = pdu->req;
+	struct bio *bio = req->bio;
+	bool write = req_op(req) == REQ_OP_DRV_OUT;
+	int status;
+	u64 result;
+
+	if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+		status = -EINTR;
+	else
+		status = nvme_req(req)->status;
+
+	result = le64_to_cpu(nvme_req(req)->result.u64);
+	blk_mq_free_request(req);
+	if (bio)
+		blk_rq_unmap_user(bio);
+
+	if (pdu->meta && !status && !write) {
+		if (copy_to_user(pdu->meta_buffer, pdu->meta, pdu->meta_len))
+			status = -EFAULT;
+	}
+	kfree(pdu->meta);
+	io_uring_cmd_done(ioucmd, status, result);
+}
+
+static void nvme_end_async_pt(struct request *req, blk_status_t err)
+{
+	struct io_uring_cmd *ioucmd = req->end_io_data;
+	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+	/* extract bio before reusing the same field for request */
+	struct bio *bio = pdu->bio;
+
+	pdu->req = req;
+	req->bio = bio;
+	/* this takes care of moving rest of completion-work to task context */
+	io_uring_cmd_complete_in_task(ioucmd, nvme_uring_task_cb);
+}
+
+static void nvme_setup_uring_cmd_data(struct request *rq,
+		struct io_uring_cmd *ioucmd, void __user *meta_buffer,
+		u32 meta_len)
+{
+	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+
+	/* to free bio on completion, as req->bio will be null at that time */
+	pdu->bio = rq->bio;
+	pdu->meta = nvme_meta_from_bio(rq->bio);
+	pdu->meta_buffer = meta_buffer;
+	pdu->meta_len = meta_len;
+	rq->end_io_data = ioucmd;
+}
+
 static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf,
 		unsigned len, u32 seed, bool write)
 {
@@ -63,7 +137,8 @@ static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf,
 static struct request *nvme_alloc_user_request(struct request_queue *q,
 		struct nvme_command *cmd, void __user *ubuffer,
 		unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
-		u32 meta_seed, unsigned timeout, bool vec)
+		u32 meta_seed, unsigned timeout, bool vec, unsigned int rq_flags,
+		blk_mq_req_flags_t blk_flags)
 {
 	bool write = nvme_is_write(cmd);
 	struct nvme_ns *ns = q->queuedata;
@@ -73,7 +148,7 @@ static struct request *nvme_alloc_user_request(struct request_queue *q,
 	void *meta = NULL;
 	int ret;
 
-	req = blk_mq_alloc_request(q, nvme_req_op(cmd), 0);
+	req = blk_mq_alloc_request(q, nvme_req_op(cmd) | rq_flags, blk_flags);
 	if (IS_ERR(req))
 		return req;
 	nvme_init_request(req, cmd);
@@ -156,7 +231,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
 	struct request *req;
 
 	req = nvme_alloc_user_request(q, cmd, ubuffer, bufflen, meta_buffer,
-			meta_len, meta_seed, timeout, vec);
+			meta_len, meta_seed, timeout, vec, 0, 0);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 	return nvme_execute_user_rq(req, meta_buffer, meta_len, result);
@@ -333,6 +408,55 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	return status;
 }
 
+static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+		struct io_uring_cmd *ioucmd, unsigned int issue_flags)
+{
+	struct nvme_uring_cmd *cmd =
+		(struct nvme_uring_cmd *)ioucmd->cmd;
+	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
+	struct nvme_command c;
+	struct request *req;
+	unsigned int rq_flags = 0;
+	blk_mq_req_flags_t blk_flags = 0;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+	if (cmd->flags)
+		return -EINVAL;
+	if (!nvme_validate_passthru_nsid(ctrl, ns, cmd->nsid))
+		return -EINVAL;
+
+	if (issue_flags & IO_URING_F_NONBLOCK) {
+		rq_flags = REQ_NOWAIT;
+		blk_flags = BLK_MQ_REQ_NOWAIT;
+	}
+	memset(&c, 0, sizeof(c));
+	c.common.opcode = cmd->opcode;
+	c.common.flags = cmd->flags;
+	c.common.nsid = cpu_to_le32(cmd->nsid);
+	c.common.cdw2[0] = cpu_to_le32(cmd->cdw2);
+	c.common.cdw2[1] = cpu_to_le32(cmd->cdw3);
+	c.common.cdw10 = cpu_to_le32(cmd->cdw10);
+	c.common.cdw11 = cpu_to_le32(cmd->cdw11);
+	c.common.cdw12 = cpu_to_le32(cmd->cdw12);
+	c.common.cdw13 = cpu_to_le32(cmd->cdw13);
+	c.common.cdw14 = cpu_to_le32(cmd->cdw14);
+	c.common.cdw15 = cpu_to_le32(cmd->cdw15);
+
+	req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(cmd->addr),
+			cmd->data_len, nvme_to_user_ptr(cmd->metadata),
+			cmd->metadata_len, 0, cmd->timeout_ms ?
+			msecs_to_jiffies(cmd->timeout_ms) : 0, 0, rq_flags,
+			blk_flags);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	nvme_setup_uring_cmd_data(req, ioucmd, nvme_to_user_ptr(cmd->metadata),
+			cmd->metadata_len);
+	blk_execute_rq_nowait(req, 0, nvme_end_async_pt);
+	return -EIOCBQUEUED;
+}
+
 static bool is_ctrl_ioctl(unsigned int cmd)
 {
 	if (cmd == NVME_IOCTL_ADMIN_CMD || cmd == NVME_IOCTL_ADMIN64_CMD)
@@ -424,6 +548,32 @@ long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	return __nvme_ioctl(ns, cmd, (void __user *)arg);
 }
 
+static void nvme_ns_uring_cmd(struct nvme_ns *ns, struct io_uring_cmd *ioucmd,
+		unsigned int issue_flags)
+{
+	int ret;
+
+	BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu));
+
+	switch (ioucmd->cmd_op) {
+	case NVME_URING_CMD_IO:
+		ret = nvme_uring_cmd_io(ns->ctrl, ns, ioucmd, issue_flags);
+		break;
+	default:
+		ret = -ENOTTY;
+	}
+
+	if (ret != -EIOCBQUEUED)
+		io_uring_cmd_done(ioucmd, ret, 0);
+}
+
+void nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
+{
+	struct nvme_ns *ns = container_of(file_inode(ioucmd->file)->i_cdev,
+			struct nvme_ns, cdev);
+	nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
+}
+
 #ifdef CONFIG_NVME_MULTIPATH
 static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
 		void __user *argp, struct nvme_ns_head *head, int srcu_idx)
@@ -490,6 +640,18 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
 	srcu_read_unlock(&head->srcu, srcu_idx);
 	return ret;
 }
+void nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
+		unsigned int issue_flags)
+{
+	struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
+	struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
+	int srcu_idx = srcu_read_lock(&head->srcu);
+	struct nvme_ns *ns = nvme_find_path(head);
+
+	if (ns)
+		nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
+	srcu_read_unlock(&head->srcu, srcu_idx);
+}
 #endif /* CONFIG_NVME_MULTIPATH */
 
 static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index d464fdf978fb..d3e2440d8abb 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -437,6 +437,7 @@ static const struct file_operations nvme_ns_head_chr_fops = {
 	.release	= nvme_ns_head_chr_release,
 	.unlocked_ioctl	= nvme_ns_head_chr_ioctl,
 	.compat_ioctl	= compat_ptr_ioctl,
+	.uring_cmd	= nvme_ns_head_chr_uring_cmd,
 };
 
 static int nvme_add_ns_head_cdev(struct nvme_ns_head *head)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a2b53ca63335..761ad6c629c4 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -16,6 +16,7 @@
 #include <linux/rcupdate.h>
 #include <linux/wait.h>
 #include <linux/t10-pi.h>
+#include <linux/io_uring.h>
 
 #include <trace/events/block.h>
 
@@ -782,6 +783,10 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
 		unsigned long arg);
 long nvme_dev_ioctl(struct file *file, unsigned int cmd,
 		unsigned long arg);
+void nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd,
+		unsigned int issue_flags);
+void nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
+		unsigned int issue_flags);
 int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
 
 extern const struct attribute_group *nvme_ns_id_attr_groups[];
diff --git a/include/uapi/linux/nvme_ioctl.h b/include/uapi/linux/nvme_ioctl.h
index b2e43185e3b5..04e458c649ab 100644
--- a/include/uapi/linux/nvme_ioctl.h
+++ b/include/uapi/linux/nvme_ioctl.h
@@ -70,6 +70,28 @@ struct nvme_passthru_cmd64 {
 	__u64	result;
 };
 
+/* same as struct nvme_passthru_cmd64, minus the 8b result field */
+struct nvme_uring_cmd {
+	__u8	opcode;
+	__u8	flags;
+	__u16	rsvd1;
+	__u32	nsid;
+	__u32	cdw2;
+	__u32	cdw3;
+	__u64	metadata;
+	__u64	addr;
+	__u32	metadata_len;
+	__u32	data_len;
+	__u32	cdw10;
+	__u32	cdw11;
+	__u32	cdw12;
+	__u32	cdw13;
+	__u32	cdw14;
+	__u32	cdw15;
+	__u32	timeout_ms;
+	__u32   rsvd2;
+};
+
 #define nvme_admin_cmd nvme_passthru_cmd
 
 #define NVME_IOCTL_ID		_IO('N', 0x40)
@@ -83,4 +105,7 @@ struct nvme_passthru_cmd64 {
 #define NVME_IOCTL_IO64_CMD	_IOWR('N', 0x48, struct nvme_passthru_cmd64)
 #define NVME_IOCTL_IO64_CMD_VEC	_IOWR('N', 0x49, struct nvme_passthru_cmd64)
 
+/* io_uring async commands: */
+#define NVME_URING_CMD_IO	_IOWR('N', 0x80, struct nvme_uring_cmd)
+
 #endif /* _UAPI_LINUX_NVME_IOCTL_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 5/5] nvme: add vectored-io support for uring-cmd
       [not found]   ` <CGME20220505061151epcas5p2523dc661a0daf3e6185dee771eade393@epcas5p2.samsung.com>
@ 2022-05-05  6:06     ` Kanchan Joshi
  0 siblings, 0 replies; 45+ messages in thread
From: Kanchan Joshi @ 2022-05-05  6:06 UTC (permalink / raw)
  To: axboe, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

From: Anuj Gupta <[email protected]>

wire up support for async passthru that takes an array of buffers (using
iovec). Exposed via a new op NVME_URING_CMD_IO_VEC. Same 'struct
nvme_uring_cmd' is to be used with -

1. cmd.addr as base address of user iovec array
2. cmd.data_len as count of iovec array elements

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
 drivers/nvme/host/ioctl.c       | 10 +++++++---
 include/uapi/linux/nvme_ioctl.h |  1 +
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 3687cb8d7428..8c3b15d3e86d 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -409,7 +409,7 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 }
 
 static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
-		struct io_uring_cmd *ioucmd, unsigned int issue_flags)
+		struct io_uring_cmd *ioucmd, unsigned int issue_flags, bool vec)
 {
 	struct nvme_uring_cmd *cmd =
 		(struct nvme_uring_cmd *)ioucmd->cmd;
@@ -446,7 +446,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(cmd->addr),
 			cmd->data_len, nvme_to_user_ptr(cmd->metadata),
 			cmd->metadata_len, 0, cmd->timeout_ms ?
-			msecs_to_jiffies(cmd->timeout_ms) : 0, 0, rq_flags,
+			msecs_to_jiffies(cmd->timeout_ms) : 0, vec, rq_flags,
 			blk_flags);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
@@ -557,7 +557,11 @@ static void nvme_ns_uring_cmd(struct nvme_ns *ns, struct io_uring_cmd *ioucmd,
 
 	switch (ioucmd->cmd_op) {
 	case NVME_URING_CMD_IO:
-		ret = nvme_uring_cmd_io(ns->ctrl, ns, ioucmd, issue_flags);
+		ret = nvme_uring_cmd_io(ns->ctrl, ns, ioucmd, issue_flags,
+				false);
+		break;
+	case NVME_URING_CMD_IO_VEC:
+		ret = nvme_uring_cmd_io(ns->ctrl, ns, ioucmd, issue_flags, true);
 		break;
 	default:
 		ret = -ENOTTY;
diff --git a/include/uapi/linux/nvme_ioctl.h b/include/uapi/linux/nvme_ioctl.h
index 04e458c649ab..0b1876aa5a59 100644
--- a/include/uapi/linux/nvme_ioctl.h
+++ b/include/uapi/linux/nvme_ioctl.h
@@ -107,5 +107,6 @@ struct nvme_uring_cmd {
 
 /* io_uring async commands: */
 #define NVME_URING_CMD_IO	_IOWR('N', 0x80, struct nvme_uring_cmd)
+#define NVME_URING_CMD_IO_VEC	_IOWR('N', 0x81, struct nvme_uring_cmd)
 
 #endif /* _UAPI_LINUX_NVME_IOCTL_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd
  2022-05-05  6:06     ` [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd Kanchan Joshi
@ 2022-05-05 12:52       ` Jens Axboe
  2022-05-05 13:48         ` Ming Lei
  2022-05-05 13:29       ` Christoph Hellwig
  2022-05-05 16:17       ` Jens Axboe
  2 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2022-05-05 12:52 UTC (permalink / raw)
  To: Kanchan Joshi, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

On 5/5/22 12:06 AM, Kanchan Joshi wrote:
> From: Jens Axboe <[email protected]>
> 
> file_operations->uring_cmd is a file private handler.
> This is somewhat similar to ioctl but hopefully a lot more sane and
> useful as it can be used to enable many io_uring capabilities for the
> underlying operation.
> 
> IORING_OP_URING_CMD is a file private kind of request. io_uring doesn't
> know what is in this command type, it's for the provider of ->uring_cmd()
> to deal with. This operation can be issued only on the ring that is
> setup with both IORING_SETUP_SQE128 and IORING_SETUP_CQE32 flags.

One thing that occured to me that I think we need to change is what you
mention above, code here:

> +static int io_uring_cmd_prep(struct io_kiocb *req,
> +			     const struct io_uring_sqe *sqe)
> +{
> +	struct io_uring_cmd *ioucmd = &req->uring_cmd;
> +	struct io_ring_ctx *ctx = req->ctx;
> +
> +	if (ctx->flags & IORING_SETUP_IOPOLL)
> +		return -EOPNOTSUPP;
> +	/* do not support uring-cmd without big SQE/CQE */
> +	if (!(ctx->flags & IORING_SETUP_SQE128))
> +		return -EOPNOTSUPP;
> +	if (!(ctx->flags & IORING_SETUP_CQE32))
> +		return -EOPNOTSUPP;
> +	if (sqe->ioprio || sqe->rw_flags)
> +		return -EINVAL;
> +	ioucmd->cmd = sqe->cmd;
> +	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
> +	return 0;
> +}

I've been thinking of this mostly in the context of passthrough for
nvme, but it originally started as a generic feature to be able to wire
up anything for these types of commands. The SQE128/CQE32 requirement is
really an nvme passthrough restriction, we don't necessarily need this
for any kind of URING_CMD. Ditto IOPOLL as well. These are all things
that should be validated further down, but there's no way to do that
currently.

Let's not have that hold up merging this, but we do need it fixed up for
5.19-final so we don't have this restriction. Suggestions welcome...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd
  2022-05-05  6:06     ` [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd Kanchan Joshi
  2022-05-05 12:52       ` Jens Axboe
@ 2022-05-05 13:29       ` Christoph Hellwig
  2022-05-05 16:17       ` Jens Axboe
  2 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2022-05-05 13:29 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, hch, io-uring, linux-nvme, asml.silence, ming.lei, mcgrof,
	shr, joshiiitr, anuj20.g, gost.dev

On Thu, May 05, 2022 at 11:36:12AM +0530, Kanchan Joshi wrote:
> +	void (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);

This adds an overly long line now.

>  } __randomize_layout;
> +struct io_uring_cmd {
> +	struct file	*file;
> +	const u8	*cmd;
> +	/* callback to defer completions to task context */
> +	void (*task_work_cb)(struct io_uring_cmd *cmd);
> +	u32		cmd_op;
> +	u8		pdu[32]; /* available inline for free use */

And with both unused and flags gone, pdu is unaligned again and will
crash the kernel on 64-bit architectures that do not support unaligned
loads and stores, so we'll need a "u32 __pad" here now.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 3/5] nvme: refactor nvme_submit_user_cmd()
  2022-05-05  6:06     ` [PATCH v4 3/5] nvme: refactor nvme_submit_user_cmd() Kanchan Joshi
@ 2022-05-05 13:30       ` Christoph Hellwig
  2022-05-05 18:37       ` Clay Mayers
  1 sibling, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2022-05-05 13:30 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, hch, io-uring, linux-nvme, asml.silence, ming.lei, mcgrof,
	shr, joshiiitr, anuj20.g, gost.dev

On Thu, May 05, 2022 at 11:36:14AM +0530, Kanchan Joshi wrote:
> +{
> +	struct bio_integrity_payload *bip = bio_integrity(bio);
> +
> +	return bip ? bvec_virt(bip->bip_vec) : NULL;
> +}

Nit, I'd prefer this as the slightly more verbose:

	if (!bip)
		return NULL;
	return bvec_virt(bip->bip_vec);

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device.
  2022-05-05  6:06     ` [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device Kanchan Joshi
@ 2022-05-05 13:33       ` Christoph Hellwig
  2022-05-05 13:38       ` Jens Axboe
  1 sibling, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2022-05-05 13:33 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, hch, io-uring, linux-nvme, asml.silence, ming.lei, mcgrof,
	shr, joshiiitr, anuj20.g, gost.dev

Sorry a few ints while looking overthis again:

> +static void nvme_end_async_pt(struct request *req, blk_status_t err)

The naming here is a bit silly and doesn't match what we use elsewhere
in the code.  I'd suggest nvme_uring_cmd_end_io instead.
> +{
> +	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> +
> +	/* to free bio on completion, as req->bio will be null at that time */
> +	pdu->bio = rq->bio;
> +	pdu->meta = nvme_meta_from_bio(rq->bio);
> +	pdu->meta_buffer = meta_buffer;
> +	pdu->meta_len = meta_len;
> +	rq->end_io_data = ioucmd;
> +}

And this really should be folded into the only caller.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device.
  2022-05-05  6:06     ` [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device Kanchan Joshi
  2022-05-05 13:33       ` Christoph Hellwig
@ 2022-05-05 13:38       ` Jens Axboe
  2022-05-05 13:42         ` Christoph Hellwig
  1 sibling, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2022-05-05 13:38 UTC (permalink / raw)
  To: Kanchan Joshi, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

On 5/5/22 12:06 AM, Kanchan Joshi wrote:
> +static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> +		struct io_uring_cmd *ioucmd, unsigned int issue_flags)
> +{
> +	struct nvme_uring_cmd *cmd =
> +		(struct nvme_uring_cmd *)ioucmd->cmd;
> +	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
> +	struct nvme_command c;
> +	struct request *req;
> +	unsigned int rq_flags = 0;
> +	blk_mq_req_flags_t blk_flags = 0;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EACCES;
> +	if (cmd->flags)
> +		return -EINVAL;
> +	if (!nvme_validate_passthru_nsid(ctrl, ns, cmd->nsid))
> +		return -EINVAL;
> +
> +	if (issue_flags & IO_URING_F_NONBLOCK) {
> +		rq_flags = REQ_NOWAIT;
> +		blk_flags = BLK_MQ_REQ_NOWAIT;
> +	}
> +	memset(&c, 0, sizeof(c));
> +	c.common.opcode = cmd->opcode;
> +	c.common.flags = cmd->flags;
> +	c.common.nsid = cpu_to_le32(cmd->nsid);
> +	c.common.cdw2[0] = cpu_to_le32(cmd->cdw2);
> +	c.common.cdw2[1] = cpu_to_le32(cmd->cdw3);
> +	c.common.cdw10 = cpu_to_le32(cmd->cdw10);
> +	c.common.cdw11 = cpu_to_le32(cmd->cdw11);
> +	c.common.cdw12 = cpu_to_le32(cmd->cdw12);
> +	c.common.cdw13 = cpu_to_le32(cmd->cdw13);
> +	c.common.cdw14 = cpu_to_le32(cmd->cdw14);
> +	c.common.cdw15 = cpu_to_le32(cmd->cdw15);
> +
> +	req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(cmd->addr),
> +			cmd->data_len, nvme_to_user_ptr(cmd->metadata),
> +			cmd->metadata_len, 0, cmd->timeout_ms ?
> +			msecs_to_jiffies(cmd->timeout_ms) : 0, 0, rq_flags,
> +			blk_flags);

You need to be careful with reading/re-reading the shared memory. For
example, you do:

	if (!nvme_validate_passthru_nsid(ctrl, ns, cmd->nsid))
		return -EINVAL;

but then later read it again:

	c.common.nsid = cpu_to_le32(cmd->nsid);

What happens if this changes in between the validation and assigning it
here? Either this needs to be a single read and validation, or the
validation doesn't really matter. I'd make this:

	c.common.opcode = READ_ONCE(cmd->opcode);
	c.common.flags = READ_ONCE(cmd->flags);
	c.common.nsid = cpu_to_le32(READ_ONCE(cmd->nsid));
	
	if (!nvme_validate_passthru_nsid(ctrl, ns, le32_to_cpu(c.common.nsid)));
		return -EINVAL;

	c.common.cdw2[0] = cpu_to_le32(READ_ONCE(cmd->cdw2));
	c.common.cdw2[1] = cpu_to_le32(READ_ONCE(cmd->cdw3));
	c.common.metadata = 0;
	memset(&c.common.dptr, 0, sizeof(c.common.dptr));
	c.common.cdw10 = cpu_to_le32(READ_ONCE(cmd->cdw10));
	c.common.cdw11 = cpu_to_le32(READ_ONCE(cmd->cdw11));
	c.common.cdw12 = cpu_to_le32(READ_ONCE(cmd->cdw12));
	c.common.cdw13 = cpu_to_le32(READ_ONCE(cmd->cdw13));
	c.common.cdw14 = cpu_to_le32(READ_ONCE(cmd->cdw14));
	c.common.cdw15 = cpu_to_le32(READ_ONCE(cmd->cdw15));

and then consider the ones passed in to nvme_alloc_user_request() as
well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device.
  2022-05-05 13:38       ` Jens Axboe
@ 2022-05-05 13:42         ` Christoph Hellwig
  2022-05-05 13:50           ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2022-05-05 13:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kanchan Joshi, hch, io-uring, linux-nvme, asml.silence, ming.lei,
	mcgrof, shr, joshiiitr, anuj20.g, gost.dev

On Thu, May 05, 2022 at 07:38:31AM -0600, Jens Axboe wrote:
> > +	req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(cmd->addr),
> > +			cmd->data_len, nvme_to_user_ptr(cmd->metadata),
> > +			cmd->metadata_len, 0, cmd->timeout_ms ?
> > +			msecs_to_jiffies(cmd->timeout_ms) : 0, 0, rq_flags,
> > +			blk_flags);
> 
> You need to be careful with reading/re-reading the shared memory. For
> example, you do:

Uh, yes.  With ioucmd->cmd pointing to the user space mapped SQ
we need to be very careful here.  To the point where I'd almost prfer
to memcpy it out first, altough there might be performance implications.

On something unrelated while looking over the code again:  the cast
when asssigning cmd in nvme_uring_cmd_io should not be needed any more.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd
  2022-05-05 12:52       ` Jens Axboe
@ 2022-05-05 13:48         ` Ming Lei
  2022-05-05 13:54           ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Ming Lei @ 2022-05-05 13:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kanchan Joshi, hch, io-uring, linux-nvme, asml.silence, mcgrof,
	shr, joshiiitr, anuj20.g, gost.dev

On Thu, May 05, 2022 at 06:52:25AM -0600, Jens Axboe wrote:
> On 5/5/22 12:06 AM, Kanchan Joshi wrote:
> > From: Jens Axboe <[email protected]>
> > 
> > file_operations->uring_cmd is a file private handler.
> > This is somewhat similar to ioctl but hopefully a lot more sane and
> > useful as it can be used to enable many io_uring capabilities for the
> > underlying operation.
> > 
> > IORING_OP_URING_CMD is a file private kind of request. io_uring doesn't
> > know what is in this command type, it's for the provider of ->uring_cmd()
> > to deal with. This operation can be issued only on the ring that is
> > setup with both IORING_SETUP_SQE128 and IORING_SETUP_CQE32 flags.
> 
> One thing that occured to me that I think we need to change is what you
> mention above, code here:
> 
> > +static int io_uring_cmd_prep(struct io_kiocb *req,
> > +			     const struct io_uring_sqe *sqe)
> > +{
> > +	struct io_uring_cmd *ioucmd = &req->uring_cmd;
> > +	struct io_ring_ctx *ctx = req->ctx;
> > +
> > +	if (ctx->flags & IORING_SETUP_IOPOLL)
> > +		return -EOPNOTSUPP;
> > +	/* do not support uring-cmd without big SQE/CQE */
> > +	if (!(ctx->flags & IORING_SETUP_SQE128))
> > +		return -EOPNOTSUPP;
> > +	if (!(ctx->flags & IORING_SETUP_CQE32))
> > +		return -EOPNOTSUPP;
> > +	if (sqe->ioprio || sqe->rw_flags)
> > +		return -EINVAL;
> > +	ioucmd->cmd = sqe->cmd;
> > +	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
> > +	return 0;
> > +}
> 
> I've been thinking of this mostly in the context of passthrough for
> nvme, but it originally started as a generic feature to be able to wire
> up anything for these types of commands. The SQE128/CQE32 requirement is
> really an nvme passthrough restriction, we don't necessarily need this
> for any kind of URING_CMD. Ditto IOPOLL as well. These are all things
> that should be validated further down, but there's no way to do that
> currently.
> 
> Let's not have that hold up merging this, but we do need it fixed up for
> 5.19-final so we don't have this restriction. Suggestions welcome...

The validation has to be done in consumer of SQE128/CQE32(nvme). One way is
to add SQE128/CQE32 io_uring_cmd_flags and pass them via ->uring_cmd(issue_flags).


Thanks,
Ming


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device.
  2022-05-05 13:42         ` Christoph Hellwig
@ 2022-05-05 13:50           ` Jens Axboe
  2022-05-05 17:23             ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2022-05-05 13:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, io-uring, linux-nvme, asml.silence, ming.lei,
	mcgrof, shr, joshiiitr, anuj20.g, gost.dev

On 5/5/22 7:42 AM, Christoph Hellwig wrote:
> On Thu, May 05, 2022 at 07:38:31AM -0600, Jens Axboe wrote:
>>> +	req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(cmd->addr),
>>> +			cmd->data_len, nvme_to_user_ptr(cmd->metadata),
>>> +			cmd->metadata_len, 0, cmd->timeout_ms ?
>>> +			msecs_to_jiffies(cmd->timeout_ms) : 0, 0, rq_flags,
>>> +			blk_flags);
>>
>> You need to be careful with reading/re-reading the shared memory. For
>> example, you do:
> 
> Uh, yes.  With ioucmd->cmd pointing to the user space mapped SQ
> we need to be very careful here.  To the point where I'd almost prfer
> to memcpy it out first, altough there might be performance implications.

Most of it is just copied individually to the on-stack command, so that
part is fine just with READ_ONCE(). Things like timeout don't matter too
much I think, but addr/metadata/metadata_len definitely should be
ensured stable and read/verified just once.

IOW, I don't think we need the full on-stack copy, as it's just a few
items that are currently an issue. But let's see what the new iteration
looks like of that patch. Maybe we can just shove nvme_uring_cmd
metadata_len and data_len at the end of that struct and make it
identical to nvme_command_command and just make that the memcpy, then
use the copy going forward?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd
  2022-05-05 13:48         ` Ming Lei
@ 2022-05-05 13:54           ` Jens Axboe
  0 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2022-05-05 13:54 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kanchan Joshi, hch, io-uring, linux-nvme, asml.silence, mcgrof,
	shr, joshiiitr, anuj20.g, gost.dev

On 5/5/22 7:48 AM, Ming Lei wrote:
> On Thu, May 05, 2022 at 06:52:25AM -0600, Jens Axboe wrote:
>> On 5/5/22 12:06 AM, Kanchan Joshi wrote:
>>> From: Jens Axboe <[email protected]>
>>>
>>> file_operations->uring_cmd is a file private handler.
>>> This is somewhat similar to ioctl but hopefully a lot more sane and
>>> useful as it can be used to enable many io_uring capabilities for the
>>> underlying operation.
>>>
>>> IORING_OP_URING_CMD is a file private kind of request. io_uring doesn't
>>> know what is in this command type, it's for the provider of ->uring_cmd()
>>> to deal with. This operation can be issued only on the ring that is
>>> setup with both IORING_SETUP_SQE128 and IORING_SETUP_CQE32 flags.
>>
>> One thing that occured to me that I think we need to change is what you
>> mention above, code here:
>>
>>> +static int io_uring_cmd_prep(struct io_kiocb *req,
>>> +			     const struct io_uring_sqe *sqe)
>>> +{
>>> +	struct io_uring_cmd *ioucmd = &req->uring_cmd;
>>> +	struct io_ring_ctx *ctx = req->ctx;
>>> +
>>> +	if (ctx->flags & IORING_SETUP_IOPOLL)
>>> +		return -EOPNOTSUPP;
>>> +	/* do not support uring-cmd without big SQE/CQE */
>>> +	if (!(ctx->flags & IORING_SETUP_SQE128))
>>> +		return -EOPNOTSUPP;
>>> +	if (!(ctx->flags & IORING_SETUP_CQE32))
>>> +		return -EOPNOTSUPP;
>>> +	if (sqe->ioprio || sqe->rw_flags)
>>> +		return -EINVAL;
>>> +	ioucmd->cmd = sqe->cmd;
>>> +	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
>>> +	return 0;
>>> +}
>>
>> I've been thinking of this mostly in the context of passthrough for
>> nvme, but it originally started as a generic feature to be able to wire
>> up anything for these types of commands. The SQE128/CQE32 requirement is
>> really an nvme passthrough restriction, we don't necessarily need this
>> for any kind of URING_CMD. Ditto IOPOLL as well. These are all things
>> that should be validated further down, but there's no way to do that
>> currently.
>>
>> Let's not have that hold up merging this, but we do need it fixed up for
>> 5.19-final so we don't have this restriction. Suggestions welcome...
> 
> The validation has to be done in consumer of SQE128/CQE32(nvme). One
> way is to add SQE128/CQE32 io_uring_cmd_flags and pass them via
> ->uring_cmd(issue_flags).

Right, that's what I tried to say, it needs to be validated further down
as we can (and will) have URING_CMD users that don't care about any of
those 3 things and can work fine with whatever sqe/cqe size we have.
IOPOLL also only applies if the handler potentially can block.

Using the issue_flags makes sense to me, it's probably the easiest
approach. Doesn't take space in the command itself, and there's plenty
of room in that flag space to pass in the ring sqe/cqe/iopoll state.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 2/5] block: wire-up support for passthrough plugging
  2022-05-05  6:06     ` [PATCH v4 2/5] block: wire-up support for passthrough plugging Kanchan Joshi
@ 2022-05-05 14:21       ` Ming Lei
  0 siblings, 0 replies; 45+ messages in thread
From: Ming Lei @ 2022-05-05 14:21 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, hch, io-uring, linux-nvme, asml.silence, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

On Thu, May 05, 2022 at 11:36:13AM +0530, Kanchan Joshi wrote:
> From: Jens Axboe <[email protected]>
> 
> Add support for plugging in passthrough path. When plugging is enabled, the
> requests are added to a plug instead of getting dispatched to the driver.
> And when the plug is finished, the whole batch gets dispatched via
> ->queue_rqs which turns out to be more efficient. Otherwise dispatching
> used to happen via ->queue_rq, one request at a time.
> 
> Signed-off-by: Jens Axboe <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
>  block/blk-mq.c | 73 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 39 insertions(+), 34 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 84d749511f55..2cf011b57cf9 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2340,6 +2340,40 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	blk_mq_hctx_mark_pending(hctx, ctx);
>  }
>  
> +/*
> + * Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
> + * queues. This is important for md arrays to benefit from merging
> + * requests.
> + */
> +static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
> +{
> +	if (plug->multiple_queues)
> +		return BLK_MAX_REQUEST_COUNT * 2;
> +	return BLK_MAX_REQUEST_COUNT;
> +}
> +
> +static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
> +{
> +	struct request *last = rq_list_peek(&plug->mq_list);
> +
> +	if (!plug->rq_count) {
> +		trace_block_plug(rq->q);
> +	} else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
> +		   (!blk_queue_nomerges(rq->q) &&
> +		    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
> +		blk_mq_flush_plug_list(plug, false);
> +		trace_block_plug(rq->q);
> +	}
> +
> +	if (!plug->multiple_queues && last && last->q != rq->q)
> +		plug->multiple_queues = true;
> +	if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
> +		plug->has_elevator = true;
> +	rq->rq_next = NULL;
> +	rq_list_add(&plug->mq_list, rq);
> +	plug->rq_count++;
> +}
> +
>  /**
>   * blk_mq_request_bypass_insert - Insert a request at dispatch list.
>   * @rq: Pointer to request to be inserted.
> @@ -2353,7 +2387,12 @@ void blk_mq_request_bypass_insert(struct request *rq, bool at_head,
>  				  bool run_queue)
>  {
>  	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> +	struct blk_plug *plug = current->plug;
>  
> +	if (plug) {
> +		blk_add_rq_to_plug(plug, rq);
> +		return;
> +	}

This way looks a bit fragile.

blk_mq_request_bypass_insert() is called for dispatching io request too,
such as blk_insert_cloned_request(), then the request may be inserted to
scheduler finally from blk_mq_flush_plug_list().

Another issue in blk_execute_rq(), the request may stay in plug list
before polling, then hang forever.

Just wondering why not adding the pt request to plug in blk_execute_rq_nowait()
explicitly?


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd
  2022-05-05  6:06     ` [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd Kanchan Joshi
  2022-05-05 12:52       ` Jens Axboe
  2022-05-05 13:29       ` Christoph Hellwig
@ 2022-05-05 16:17       ` Jens Axboe
  2022-05-05 17:04         ` Jens Axboe
                           ` (2 more replies)
  2 siblings, 3 replies; 45+ messages in thread
From: Jens Axboe @ 2022-05-05 16:17 UTC (permalink / raw)
  To: Kanchan Joshi, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

On 5/5/22 12:06 AM, Kanchan Joshi wrote:
> +static int io_uring_cmd_prep(struct io_kiocb *req,
> +			     const struct io_uring_sqe *sqe)
> +{
> +	struct io_uring_cmd *ioucmd = &req->uring_cmd;
> +	struct io_ring_ctx *ctx = req->ctx;
> +
> +	if (ctx->flags & IORING_SETUP_IOPOLL)
> +		return -EOPNOTSUPP;
> +	/* do not support uring-cmd without big SQE/CQE */
> +	if (!(ctx->flags & IORING_SETUP_SQE128))
> +		return -EOPNOTSUPP;
> +	if (!(ctx->flags & IORING_SETUP_CQE32))
> +		return -EOPNOTSUPP;
> +	if (sqe->ioprio || sqe->rw_flags)
> +		return -EINVAL;
> +	ioucmd->cmd = sqe->cmd;
> +	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
> +	return 0;
> +}

While looking at the other suggested changes, I noticed a more
fundamental issue with the passthrough support. For any other command,
SQE contents are stable once prep has been done. The above does do that
for the basic items, but this case is special as the lower level command
itself resides in the SQE.

For cases where the command needs deferral, it's problematic. There are
two main cases where this can happen:

- The issue attempt yields -EAGAIN (we ran out of requests, etc). If you
  look at other commands, if they have data that doesn't fit in the
  io_kiocb itself, then they need to allocate room for that data and have
  it be persistent

- Deferral is specified by the application, using eg IOSQE_IO_LINK or
  IOSQE_ASYNC.

We're totally missing support for both of these cases. Consider the case
where the ring is setup with an SQ size of 1. You prep a passthrough
command (request A) and issue it with io_uring_submit(). Due to one of
the two above mentioned conditions, the internal request is deferred.
Either it was sent to ->uring_cmd() but we got -EAGAIN, or it was
deferred even before that happened. The application doesn't know this
happened, it gets another SQE to submit a new request (request B). Fills
it in, calls io_uring_submit(). Since we only have one SQE available in
that ring, when request A gets re-issued, it's now happily reading SQE
contents from command B. Oops.

This is why prep handlers are the only ones that get an sqe passed to
them. They are supposed to ensure that we no longer read from the SQE
past that. Applications can always rely on that fact that once
io_uring_submit() has been done, which consumes the SQE in the SQ ring,
that no further reads are done from that SQE.

IOW, we need io_req_prep_async() handling for URING_CMD, which needs to
allocate the full 80 bytes and copy them over. Subsequent issue attempts
will then use that memory rather than the SQE parts. Just need to find a
sane way to do that so we don't need to make the usual prep + direct
issue path any slower.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd
  2022-05-05 16:17       ` Jens Axboe
@ 2022-05-05 17:04         ` Jens Axboe
  2022-05-06  7:12         ` Kanchan Joshi
  2022-05-10 14:23         ` Kanchan Joshi
  2 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2022-05-05 17:04 UTC (permalink / raw)
  To: Kanchan Joshi, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

On 5/5/22 10:17 AM, Jens Axboe wrote:
> On 5/5/22 12:06 AM, Kanchan Joshi wrote:
>> +static int io_uring_cmd_prep(struct io_kiocb *req,
>> +			     const struct io_uring_sqe *sqe)
>> +{
>> +	struct io_uring_cmd *ioucmd = &req->uring_cmd;
>> +	struct io_ring_ctx *ctx = req->ctx;
>> +
>> +	if (ctx->flags & IORING_SETUP_IOPOLL)
>> +		return -EOPNOTSUPP;
>> +	/* do not support uring-cmd without big SQE/CQE */
>> +	if (!(ctx->flags & IORING_SETUP_SQE128))
>> +		return -EOPNOTSUPP;
>> +	if (!(ctx->flags & IORING_SETUP_CQE32))
>> +		return -EOPNOTSUPP;
>> +	if (sqe->ioprio || sqe->rw_flags)
>> +		return -EINVAL;
>> +	ioucmd->cmd = sqe->cmd;
>> +	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
>> +	return 0;
>> +}
> 
> While looking at the other suggested changes, I noticed a more
> fundamental issue with the passthrough support. For any other command,
> SQE contents are stable once prep has been done. The above does do that
> for the basic items, but this case is special as the lower level command
> itself resides in the SQE.
> 
> For cases where the command needs deferral, it's problematic. There are
> two main cases where this can happen:
> 
> - The issue attempt yields -EAGAIN (we ran out of requests, etc). If you
>   look at other commands, if they have data that doesn't fit in the
>   io_kiocb itself, then they need to allocate room for that data and have
>   it be persistent
> 
> - Deferral is specified by the application, using eg IOSQE_IO_LINK or
>   IOSQE_ASYNC.
> 
> We're totally missing support for both of these cases. Consider the case
> where the ring is setup with an SQ size of 1. You prep a passthrough
> command (request A) and issue it with io_uring_submit(). Due to one of
> the two above mentioned conditions, the internal request is deferred.
> Either it was sent to ->uring_cmd() but we got -EAGAIN, or it was
> deferred even before that happened. The application doesn't know this
> happened, it gets another SQE to submit a new request (request B). Fills
> it in, calls io_uring_submit(). Since we only have one SQE available in
> that ring, when request A gets re-issued, it's now happily reading SQE
> contents from command B. Oops.
> 
> This is why prep handlers are the only ones that get an sqe passed to
> them. They are supposed to ensure that we no longer read from the SQE
> past that. Applications can always rely on that fact that once
> io_uring_submit() has been done, which consumes the SQE in the SQ ring,
> that no further reads are done from that SQE.
> 
> IOW, we need io_req_prep_async() handling for URING_CMD, which needs to
> allocate the full 80 bytes and copy them over. Subsequent issue attempts
> will then use that memory rather than the SQE parts. Just need to find a
> sane way to do that so we don't need to make the usual prep + direct
> issue path any slower.

This one should take care of that. Some parts should be folded into
patch 1, other bits into the nvme adoption.


diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 8c3b15d3e86d..7daa95d50db1 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -411,8 +411,7 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 		struct io_uring_cmd *ioucmd, unsigned int issue_flags, bool vec)
 {
-	struct nvme_uring_cmd *cmd =
-		(struct nvme_uring_cmd *)ioucmd->cmd;
+	struct nvme_uring_cmd *cmd = ioucmd->cmd;
 	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
 	struct nvme_command c;
 	struct request *req;
@@ -548,8 +547,8 @@ long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	return __nvme_ioctl(ns, cmd, (void __user *)arg);
 }
 
-static void nvme_ns_uring_cmd(struct nvme_ns *ns, struct io_uring_cmd *ioucmd,
-		unsigned int issue_flags)
+static int nvme_ns_uring_cmd(struct nvme_ns *ns, struct io_uring_cmd *ioucmd,
+			     unsigned int issue_flags)
 {
 	int ret;
 
@@ -567,15 +566,15 @@ static void nvme_ns_uring_cmd(struct nvme_ns *ns, struct io_uring_cmd *ioucmd,
 		ret = -ENOTTY;
 	}
 
-	if (ret != -EIOCBQUEUED)
-		io_uring_cmd_done(ioucmd, ret, 0);
+	return ret;
 }
 
-void nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
+int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
 {
 	struct nvme_ns *ns = container_of(file_inode(ioucmd->file)->i_cdev,
 			struct nvme_ns, cdev);
-	nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
+
+	return nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
 }
 
 #ifdef CONFIG_NVME_MULTIPATH
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 761ad6c629c4..26adc7660d28 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -783,7 +783,7 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
 		unsigned long arg);
 long nvme_dev_ioctl(struct file *file, unsigned int cmd,
 		unsigned long arg);
-void nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd,
+int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd,
 		unsigned int issue_flags);
 void nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
 		unsigned int issue_flags);
diff --git a/fs/io_uring.c b/fs/io_uring.c
index cf7087dc12b3..cee7fc95f2c2 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1251,6 +1251,8 @@ static const struct io_op_def io_op_defs[] = {
 	[IORING_OP_URING_CMD] = {
 		.needs_file		= 1,
 		.plug			= 1,
+		.async_size		= 2 * sizeof(struct io_uring_sqe) -
+					  offsetof(struct io_uring_sqe, cmd),
 	},
 };
 
@@ -4936,6 +4938,20 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2)
 }
 EXPORT_SYMBOL_GPL(io_uring_cmd_done);
 
+static int io_uring_cmd_prep_async(struct io_kiocb *req)
+{
+	struct io_uring_cmd *ioucmd = &req->uring_cmd;
+	struct io_ring_ctx *ctx = req->ctx;
+	size_t cmd_size = sizeof(struct io_uring_sqe) -
+				offsetof(struct io_uring_sqe, cmd);
+
+	if (ctx->flags & IORING_SETUP_SQE128)
+		cmd_size += sizeof(struct io_uring_sqe);
+
+	memcpy(req->async_data, ioucmd->cmd, cmd_size);
+	return 0;
+}
+
 static int io_uring_cmd_prep(struct io_kiocb *req,
 			     const struct io_uring_sqe *sqe)
 {
@@ -4951,7 +4967,7 @@ static int io_uring_cmd_prep(struct io_kiocb *req,
 		return -EOPNOTSUPP;
 	if (sqe->ioprio || sqe->rw_flags)
 		return -EINVAL;
-	ioucmd->cmd = sqe->cmd;
+	ioucmd->cmd = (void *) sqe->cmd;
 	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
 	return 0;
 }
@@ -4960,10 +4976,18 @@ static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct file *file = req->file;
 	struct io_uring_cmd *ioucmd = &req->uring_cmd;
+	int ret;
 
 	if (!req->file->f_op->uring_cmd)
 		return -EOPNOTSUPP;
-	file->f_op->uring_cmd(ioucmd, issue_flags);
+
+	if (req_has_async_data(req))
+		ioucmd->cmd = req->async_data;
+
+	ret = file->f_op->uring_cmd(ioucmd, issue_flags);
+	if (ret != -EIOCBQUEUED)
+		io_uring_cmd_done(ioucmd, ret, 0);
+
 	return 0;
 }
 
@@ -7849,6 +7873,8 @@ static int io_req_prep_async(struct io_kiocb *req)
 		return io_recvmsg_prep_async(req);
 	case IORING_OP_CONNECT:
 		return io_connect_prep_async(req);
+	case IORING_OP_URING_CMD:
+		return io_uring_cmd_prep_async(req);
 	}
 	printk_once(KERN_WARNING "io_uring: prep_async() bad opcode %d\n",
 		    req->opcode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 30c98d9993df..87b5af1d9fbe 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1996,7 +1996,7 @@ struct file_operations {
 				   struct file *file_out, loff_t pos_out,
 				   loff_t len, unsigned int remap_flags);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
-	void (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
+	int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
 } __randomize_layout;
 
 struct inode_operations {
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index e4cbc80949ce..74371503478a 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -14,10 +14,11 @@ enum io_uring_cmd_flags {
 
 struct io_uring_cmd {
 	struct file	*file;
-	const u8	*cmd;
+	void		*cmd;
 	/* callback to defer completions to task context */
 	void (*task_work_cb)(struct io_uring_cmd *cmd);
 	u32		cmd_op;
+	u32		pad;
 	u8		pdu[32]; /* available inline for free use */
 };
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device.
  2022-05-05 13:50           ` Jens Axboe
@ 2022-05-05 17:23             ` Jens Axboe
  2022-05-06  8:28               ` Christoph Hellwig
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2022-05-05 17:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, io-uring, linux-nvme, asml.silence, ming.lei,
	mcgrof, shr, joshiiitr, anuj20.g, gost.dev

On 5/5/22 7:50 AM, Jens Axboe wrote:
> On 5/5/22 7:42 AM, Christoph Hellwig wrote:
>> On Thu, May 05, 2022 at 07:38:31AM -0600, Jens Axboe wrote:
>>>> +	req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(cmd->addr),
>>>> +			cmd->data_len, nvme_to_user_ptr(cmd->metadata),
>>>> +			cmd->metadata_len, 0, cmd->timeout_ms ?
>>>> +			msecs_to_jiffies(cmd->timeout_ms) : 0, 0, rq_flags,
>>>> +			blk_flags);
>>>
>>> You need to be careful with reading/re-reading the shared memory. For
>>> example, you do:
>>
>> Uh, yes.  With ioucmd->cmd pointing to the user space mapped SQ
>> we need to be very careful here.  To the point where I'd almost prfer
>> to memcpy it out first, altough there might be performance implications.
> 
> Most of it is just copied individually to the on-stack command, so that
> part is fine just with READ_ONCE(). Things like timeout don't matter too
> much I think, but addr/metadata/metadata_len definitely should be
> ensured stable and read/verified just once.
> 
> IOW, I don't think we need the full on-stack copy, as it's just a few
> items that are currently an issue. But let's see what the new iteration
> looks like of that patch. Maybe we can just shove nvme_uring_cmd
> metadata_len and data_len at the end of that struct and make it
> identical to nvme_command_command and just make that the memcpy, then
> use the copy going forward?

The top three patches here have a proposed solution for the 3 issues
that I highlighted:

https://git.kernel.dk/cgit/linux-block/log/?h=for-5.19/io_uring-passthrough

Totally untested... Kanchan, can you take a look and see what you think?
They all need folding obviously, I just tried to do them separately.
Should also get tested :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/5] io_uring passthrough for nvme
  2022-05-05  6:06 ` [PATCH v4 0/5] io_uring passthrough for nvme Kanchan Joshi
                     ` (4 preceding siblings ...)
       [not found]   ` <CGME20220505061151epcas5p2523dc661a0daf3e6185dee771eade393@epcas5p2.samsung.com>
@ 2022-05-05 18:20   ` Jens Axboe
  2022-05-05 18:29     ` Jens Axboe
  2022-05-10  7:20     ` Christoph Hellwig
  5 siblings, 2 replies; 45+ messages in thread
From: Jens Axboe @ 2022-05-05 18:20 UTC (permalink / raw)
  To: Kanchan Joshi, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

On 5/5/22 12:06 AM, Kanchan Joshi wrote:
> This iteration is against io_uring-big-sqe brach (linux-block).
> On top of a739b2354 ("io_uring: enable CQE32").
> 
> fio testing branch:
> https://github.com/joshkan/fio/tree/big-cqe-pt.v4

I folded in the suggested changes, the branch is here:

https://git.kernel.dk/cgit/linux-block/log/?h=for-5.19/io_uring-passthrough

I'll try and run the fio test branch, but please take a look and see what
you think.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/5] io_uring passthrough for nvme
  2022-05-05 18:20   ` [PATCH v4 0/5] io_uring passthrough for nvme Jens Axboe
@ 2022-05-05 18:29     ` Jens Axboe
  2022-05-06  6:42       ` Kanchan Joshi
  2022-05-10  7:20     ` Christoph Hellwig
  1 sibling, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2022-05-05 18:29 UTC (permalink / raw)
  To: Kanchan Joshi, hch
  Cc: io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

On 5/5/22 12:20 PM, Jens Axboe wrote:
> On 5/5/22 12:06 AM, Kanchan Joshi wrote:
>> This iteration is against io_uring-big-sqe brach (linux-block).
>> On top of a739b2354 ("io_uring: enable CQE32").
>>
>> fio testing branch:
>> https://github.com/joshkan/fio/tree/big-cqe-pt.v4
> 
> I folded in the suggested changes, the branch is here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=for-5.19/io_uring-passthrough
> 
> I'll try and run the fio test branch, but please take a look and see what
> you think.

Tested that fio branch and it works for me with what I had pushed out.
Also tested explicit deferral of requests.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH v4 3/5] nvme: refactor nvme_submit_user_cmd()
  2022-05-05  6:06     ` [PATCH v4 3/5] nvme: refactor nvme_submit_user_cmd() Kanchan Joshi
  2022-05-05 13:30       ` Christoph Hellwig
@ 2022-05-05 18:37       ` Clay Mayers
  2022-05-05 19:03         ` Jens Axboe
  1 sibling, 1 reply; 45+ messages in thread
From: Clay Mayers @ 2022-05-05 18:37 UTC (permalink / raw)
  To: Kanchan Joshi, [email protected], [email protected]
  Cc: [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected]

> From: Kanchan Joshi
> Sent: Wednesday, May 4, 2022 11:06 PM
> ---

>  drivers/nvme/host/ioctl.c | 47 ++++++++++++++++++++++++++++++++++-----
>  1 file changed, 42 insertions(+), 5 deletions(-)
>
> +static int nvme_execute_user_rq(struct request *req, void __user
> *meta_buffer,
> +		unsigned meta_len, u64 *result)
> +{
> +	struct bio *bio = req->bio;
> +	bool write = bio_op(bio) == REQ_OP_DRV_OUT;

I'm getting a NULL ptr access on the first ioctl(NVME_IOCTL_ADMIN64_CMD)
I send - it has no ubuffer so I think there's no req->bio.

> +	int ret;
> +	void *meta = nvme_meta_from_bio(bio);
> +
>  	ret = nvme_execute_passthru_rq(req);
> +
>  	if (result)
>  		*result = le64_to_cpu(nvme_req(req)->result.u64);
>  	if (meta && !ret && !write) {

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 3/5] nvme: refactor nvme_submit_user_cmd()
  2022-05-05 18:37       ` Clay Mayers
@ 2022-05-05 19:03         ` Jens Axboe
  2022-05-05 19:11           ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2022-05-05 19:03 UTC (permalink / raw)
  To: Clay Mayers, Kanchan Joshi, [email protected]
  Cc: [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected]

On 5/5/22 12:37 PM, Clay Mayers wrote:
>> From: Kanchan Joshi
>> Sent: Wednesday, May 4, 2022 11:06 PM
>> ---
> 
>>  drivers/nvme/host/ioctl.c | 47 ++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 42 insertions(+), 5 deletions(-)
>>
>> +static int nvme_execute_user_rq(struct request *req, void __user
>> *meta_buffer,
>> +		unsigned meta_len, u64 *result)
>> +{
>> +	struct bio *bio = req->bio;
>> +	bool write = bio_op(bio) == REQ_OP_DRV_OUT;
> 
> I'm getting a NULL ptr access on the first ioctl(NVME_IOCTL_ADMIN64_CMD)
> I send - it has no ubuffer so I think there's no req->bio.

Does this work?

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 8fe7ad18a709..f615a791a7cd 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -205,7 +205,6 @@ static int nvme_execute_user_rq(struct request *req, void __user *meta_buffer,
 		unsigned meta_len, u64 *result)
 {
 	struct bio *bio = req->bio;
-	bool write = bio_op(bio) == REQ_OP_DRV_OUT;
 	int ret;
 	void *meta = nvme_meta_from_bio(bio);
 
@@ -213,11 +212,13 @@ static int nvme_execute_user_rq(struct request *req, void __user *meta_buffer,
 
 	if (result)
 		*result = le64_to_cpu(nvme_req(req)->result.u64);
-	if (meta && !ret && !write) {
-		if (copy_to_user(meta_buffer, meta, meta_len))
+	if (meta) {
+		bool write = bio_op(bio) == REQ_OP_DRV_OUT;
+
+		if (!ret && !write && copy_to_user(meta_buffer, meta, meta_len))
 			ret = -EFAULT;
+		kfree(meta);
 	}
-	kfree(meta);
 	if (bio)
 		blk_rq_unmap_user(bio);
 	blk_mq_free_request(req);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 3/5] nvme: refactor nvme_submit_user_cmd()
  2022-05-05 19:03         ` Jens Axboe
@ 2022-05-05 19:11           ` Jens Axboe
  2022-05-05 19:30             ` Clay Mayers
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2022-05-05 19:11 UTC (permalink / raw)
  To: Clay Mayers, Kanchan Joshi, [email protected]
  Cc: [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected]

On 5/5/22 1:03 PM, Jens Axboe wrote:
> On 5/5/22 12:37 PM, Clay Mayers wrote:
>>> From: Kanchan Joshi
>>> Sent: Wednesday, May 4, 2022 11:06 PM
>>> ---
>>
>>>  drivers/nvme/host/ioctl.c | 47 ++++++++++++++++++++++++++++++++++-----
>>>  1 file changed, 42 insertions(+), 5 deletions(-)
>>>
>>> +static int nvme_execute_user_rq(struct request *req, void __user
>>> *meta_buffer,
>>> +		unsigned meta_len, u64 *result)
>>> +{
>>> +	struct bio *bio = req->bio;
>>> +	bool write = bio_op(bio) == REQ_OP_DRV_OUT;
>>
>> I'm getting a NULL ptr access on the first ioctl(NVME_IOCTL_ADMIN64_CMD)
>> I send - it has no ubuffer so I think there's no req->bio.
> 
> Does this work?

This might be better, though you'd only notice if you had integrity
enabled. Christoph, I'm folding this in with patch 3...


diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 8fe7ad18a709..3d827789b536 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -21,9 +21,13 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval)
 
 static inline void *nvme_meta_from_bio(struct bio *bio)
 {
-	struct bio_integrity_payload *bip = bio_integrity(bio);
+	if (bio) {
+		struct bio_integrity_payload *bip = bio_integrity(bio);
 
-	return bip ? bvec_virt(bip->bip_vec) : NULL;
+		return bip ? bvec_virt(bip->bip_vec) : NULL;
+	}
+
+	return NULL;
 }
 
 /*
@@ -205,19 +209,20 @@ static int nvme_execute_user_rq(struct request *req, void __user *meta_buffer,
 		unsigned meta_len, u64 *result)
 {
 	struct bio *bio = req->bio;
-	bool write = bio_op(bio) == REQ_OP_DRV_OUT;
-	int ret;
 	void *meta = nvme_meta_from_bio(bio);
+	int ret;
 
 	ret = nvme_execute_passthru_rq(req);
 
 	if (result)
 		*result = le64_to_cpu(nvme_req(req)->result.u64);
-	if (meta && !ret && !write) {
-		if (copy_to_user(meta_buffer, meta, meta_len))
+	if (meta) {
+		bool write = bio_op(bio) == REQ_OP_DRV_OUT;
+
+		if (!ret && !write && copy_to_user(meta_buffer, meta, meta_len))
 			ret = -EFAULT;
+		kfree(meta);
 	}
-	kfree(meta);
 	if (bio)
 		blk_rq_unmap_user(bio);
 	blk_mq_free_request(req);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* RE: [PATCH v4 3/5] nvme: refactor nvme_submit_user_cmd()
  2022-05-05 19:11           ` Jens Axboe
@ 2022-05-05 19:30             ` Clay Mayers
  2022-05-05 19:31               ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Clay Mayers @ 2022-05-05 19:30 UTC (permalink / raw)
  To: Jens Axboe, Kanchan Joshi, [email protected]
  Cc: [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected]

On 5/5/22 12:11 PM, Jens Axboe wrote:
> On 5/5/22 1:03 PM, Jens Axboe wrote:
> > On 5/5/22 12:37 PM, Clay Mayers wrote:
> >>> From: Kanchan Joshi
> >>> Sent: Wednesday, May 4, 2022 11:06 PM
> >>> ---
> >>
> >>>  drivers/nvme/host/ioctl.c | 47 ++++++++++++++++++++++++++++++++++---
> --
> >>>  1 file changed, 42 insertions(+), 5 deletions(-)
> >>>
> >>> +static int nvme_execute_user_rq(struct request *req, void __user
> >>> *meta_buffer,
> >>> +		unsigned meta_len, u64 *result)
> >>> +{
> >>> +	struct bio *bio = req->bio;
> >>> +	bool write = bio_op(bio) == REQ_OP_DRV_OUT;
> >>
> >> I'm getting a NULL ptr access on the first
> ioctl(NVME_IOCTL_ADMIN64_CMD)
> >> I send - it has no ubuffer so I think there's no req->bio.
> >
> > Does this work?

It did not!  Same null ptr deref at nearly if not the same location.
I didn't investigate to see the line of code since you had sent v2.

> 
> This might be better, though you'd only notice if you had integrity
> enabled. Christoph, I'm folding this in with patch 3...
> 
> 
> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> index 8fe7ad18a709..3d827789b536 100644
> --- a/drivers/nvme/host/ioctl.c
> +++ b/drivers/nvme/host/ioctl.c
> @@ -21,9 +21,13 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval)
> 
>  static inline void *nvme_meta_from_bio(struct bio *bio)
>  {
> -	struct bio_integrity_payload *bip = bio_integrity(bio);
> +	if (bio) {
> +		struct bio_integrity_payload *bip = bio_integrity(bio);
> 
> -	return bip ? bvec_virt(bip->bip_vec) : NULL;
> +		return bip ? bvec_virt(bip->bip_vec) : NULL;
> +	}
> +
> +	return NULL;
>  }
> 
>  /*
> @@ -205,19 +209,20 @@ static int nvme_execute_user_rq(struct request *req,
> void __user *meta_buffer,
>  		unsigned meta_len, u64 *result)
>  {
>  	struct bio *bio = req->bio;
> -	bool write = bio_op(bio) == REQ_OP_DRV_OUT;
> -	int ret;
>  	void *meta = nvme_meta_from_bio(bio);
> +	int ret;
> 
>  	ret = nvme_execute_passthru_rq(req);
> 
>  	if (result)
>  		*result = le64_to_cpu(nvme_req(req)->result.u64);
> -	if (meta && !ret && !write) {
> -		if (copy_to_user(meta_buffer, meta, meta_len))
> +	if (meta) {
> +		bool write = bio_op(bio) == REQ_OP_DRV_OUT;
> +
> +		if (!ret && !write && copy_to_user(meta_buffer, meta,
> meta_len))
>  			ret = -EFAULT;
> +		kfree(meta);
>  	}
> -	kfree(meta);
>  	if (bio)
>  		blk_rq_unmap_user(bio);
>  	blk_mq_free_request(req);
> 
> --
> Jens Axboe

This does work and got me past the null ptr segfault.

Clay.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 3/5] nvme: refactor nvme_submit_user_cmd()
  2022-05-05 19:30             ` Clay Mayers
@ 2022-05-05 19:31               ` Jens Axboe
  2022-05-05 19:50                 ` hch
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2022-05-05 19:31 UTC (permalink / raw)
  To: Clay Mayers, Kanchan Joshi, [email protected]
  Cc: [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected]

On 5/5/22 1:30 PM, Clay Mayers wrote:
>> This might be better, though you'd only notice if you had integrity
>> enabled. Christoph, I'm folding this in with patch 3...
>>
>>
>> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
>> index 8fe7ad18a709..3d827789b536 100644
>> --- a/drivers/nvme/host/ioctl.c
>> +++ b/drivers/nvme/host/ioctl.c
>> @@ -21,9 +21,13 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval)
>>
>>  static inline void *nvme_meta_from_bio(struct bio *bio)
>>  {
>> -	struct bio_integrity_payload *bip = bio_integrity(bio);
>> +	if (bio) {
>> +		struct bio_integrity_payload *bip = bio_integrity(bio);
>>
>> -	return bip ? bvec_virt(bip->bip_vec) : NULL;
>> +		return bip ? bvec_virt(bip->bip_vec) : NULL;
>> +	}
>> +
>> +	return NULL;
>>  }
>>
>>  /*
>> @@ -205,19 +209,20 @@ static int nvme_execute_user_rq(struct request *req,
>> void __user *meta_buffer,
>>  		unsigned meta_len, u64 *result)
>>  {
>>  	struct bio *bio = req->bio;
>> -	bool write = bio_op(bio) == REQ_OP_DRV_OUT;
>> -	int ret;
>>  	void *meta = nvme_meta_from_bio(bio);
>> +	int ret;
>>
>>  	ret = nvme_execute_passthru_rq(req);
>>
>>  	if (result)
>>  		*result = le64_to_cpu(nvme_req(req)->result.u64);
>> -	if (meta && !ret && !write) {
>> -		if (copy_to_user(meta_buffer, meta, meta_len))
>> +	if (meta) {
>> +		bool write = bio_op(bio) == REQ_OP_DRV_OUT;
>> +
>> +		if (!ret && !write && copy_to_user(meta_buffer, meta,
>> meta_len))
>>  			ret = -EFAULT;
>> +		kfree(meta);
>>  	}
>> -	kfree(meta);
>>  	if (bio)
>>  		blk_rq_unmap_user(bio);
>>  	blk_mq_free_request(req);
>>
>> --
>> Jens Axboe
> 
> This does work and got me past the null ptr segfault.

OK good, thanks for testing. I did fold it in.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 3/5] nvme: refactor nvme_submit_user_cmd()
  2022-05-05 19:31               ` Jens Axboe
@ 2022-05-05 19:50                 ` hch
  2022-05-05 20:44                   ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: hch @ 2022-05-05 19:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Clay Mayers, Kanchan Joshi, [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected]

On Thu, May 05, 2022 at 01:31:28PM -0600, Jens Axboe wrote:
> >> Jens Axboe
> > 
> > This does work and got me past the null ptr segfault.
> 
> OK good, thanks for testing. I did fold it in.

It might make sense to just kill nvme_meta_from_bio and pass the
meta pointer directly with this version of the code.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 3/5] nvme: refactor nvme_submit_user_cmd()
  2022-05-05 19:50                 ` hch
@ 2022-05-05 20:44                   ` Jens Axboe
  2022-05-06  5:56                     ` hch
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2022-05-05 20:44 UTC (permalink / raw)
  To: [email protected]
  Cc: Clay Mayers, Kanchan Joshi, [email protected],
	[email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected]

On 5/5/22 1:50 PM, [email protected] wrote:
> On Thu, May 05, 2022 at 01:31:28PM -0600, Jens Axboe wrote:
>>>> Jens Axboe
>>>
>>> This does work and got me past the null ptr segfault.
>>
>> OK good, thanks for testing. I did fold it in.
> 
> It might make sense to just kill nvme_meta_from_bio and pass the
> meta pointer directly with this version of the code.

Do you want to do an incremental for that? Looking at
nvme_execute_user_rq() and nvme_uring_task_cb() there's a fair bit of
duplication of the meta copy.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 3/5] nvme: refactor nvme_submit_user_cmd()
  2022-05-05 20:44                   ` Jens Axboe
@ 2022-05-06  5:56                     ` hch
  0 siblings, 0 replies; 45+ messages in thread
From: hch @ 2022-05-06  5:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: [email protected], Clay Mayers, Kanchan Joshi, [email protected],
	[email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected]

On Thu, May 05, 2022 at 02:44:40PM -0600, Jens Axboe wrote:
> On 5/5/22 1:50 PM, [email protected] wrote:
> > On Thu, May 05, 2022 at 01:31:28PM -0600, Jens Axboe wrote:
> >>>> Jens Axboe
> >>>
> >>> This does work and got me past the null ptr segfault.
> >>
> >> OK good, thanks for testing. I did fold it in.
> > 
> > It might make sense to just kill nvme_meta_from_bio and pass the
> > meta pointer directly with this version of the code.
> 
> Do you want to do an incremental for that? Looking at
> nvme_execute_user_rq() and nvme_uring_task_cb() there's a fair bit of
> duplication of the meta copy.

Yes, there is.  And the right way is to keep the integrity payload alive
longer, but I'm not sure we are going to get that done in time..

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/5] io_uring passthrough for nvme
  2022-05-05 18:29     ` Jens Axboe
@ 2022-05-06  6:42       ` Kanchan Joshi
  2022-05-06 13:14         ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Kanchan Joshi @ 2022-05-06  6:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch, io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

[-- Attachment #1: Type: text/plain, Size: 2207 bytes --]

On Thu, May 05, 2022 at 12:29:27PM -0600, Jens Axboe wrote:
>On 5/5/22 12:20 PM, Jens Axboe wrote:
>> On 5/5/22 12:06 AM, Kanchan Joshi wrote:
>>> This iteration is against io_uring-big-sqe brach (linux-block).
>>> On top of a739b2354 ("io_uring: enable CQE32").
>>>
>>> fio testing branch:
>>> https://protect2.fireeye.com/v1/url?k=b0d23f72-d1592a52-b0d3b43d-74fe485fb347-02541f801e3b5f5f&q=1&e=ef4bb07a-7707-4854-819c-98abcabb5d2d&u=https%3A%2F%2Fgithub.com%2Fjoshkan%2Ffio%2Ftree%2Fbig-cqe-pt.v4
>>
>> I folded in the suggested changes, the branch is here:
>>
>> https://protect2.fireeye.com/v1/url?k=40e9c3d0-2162d6f0-40e8489f-74fe485fb347-f8e801be1f796980&q=1&e=ef4bb07a-7707-4854-819c-98abcabb5d2d&u=https%3A%2F%2Fgit.kernel.dk%2Fcgit%2Flinux-block%2Flog%2F%3Fh%3Dfor-5.19%2Fio_uring-passthrough
>>
>> I'll try and run the fio test branch, but please take a look and see what
>> you think.
>
>Tested that fio branch and it works for me with what I had pushed out.
>Also tested explicit deferral of requests.

Thanks for sorting everything out! I could test this out now only :-(
Fio scripts ran fine (post refreshing SQE128/CQE32 flag values;repushed fio).

I think these two changes are needed in your branch:

1. since uring-cmd can be without large-cqe, we need to add that
condition in io_uring_cmd_done(). This change in patch 1 - 

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 884f40f51536..c24469564ebc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4938,7 +4938,10 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2)

        if (ret < 0)
                req_set_fail(req);
-       __io_req_complete32(req, 0, ret, 0, res2, 0);
+       if (req->ctx->flags & IORING_SETUP_CQE32)
+               __io_req_complete32(req, 0, ret, 0, res2, 0);
+       else
+               io_req_complete(req, ret);
 }
 EXPORT_SYMBOL_GPL(io_uring_cmd_done);

2. In the commit-message of patch 1, we should delete last line.
i.e.
"This operation can be issued only on the ring that is
setup with both IORING_SETUP_SQE128 and IORING_SETUP_CQE32 flags."

And/or we can move this commit-message of patch 4.

You can either take these changes in, or I can respin the series. LMK.

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd
  2022-05-05 16:17       ` Jens Axboe
  2022-05-05 17:04         ` Jens Axboe
@ 2022-05-06  7:12         ` Kanchan Joshi
  2022-05-10 14:23         ` Kanchan Joshi
  2 siblings, 0 replies; 45+ messages in thread
From: Kanchan Joshi @ 2022-05-06  7:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch, io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

[-- Attachment #1: Type: text/plain, Size: 2864 bytes --]

On Thu, May 05, 2022 at 10:17:39AM -0600, Jens Axboe wrote:
>On 5/5/22 12:06 AM, Kanchan Joshi wrote:
>> +static int io_uring_cmd_prep(struct io_kiocb *req,
>> +			     const struct io_uring_sqe *sqe)
>> +{
>> +	struct io_uring_cmd *ioucmd = &req->uring_cmd;
>> +	struct io_ring_ctx *ctx = req->ctx;
>> +
>> +	if (ctx->flags & IORING_SETUP_IOPOLL)
>> +		return -EOPNOTSUPP;
>> +	/* do not support uring-cmd without big SQE/CQE */
>> +	if (!(ctx->flags & IORING_SETUP_SQE128))
>> +		return -EOPNOTSUPP;
>> +	if (!(ctx->flags & IORING_SETUP_CQE32))
>> +		return -EOPNOTSUPP;
>> +	if (sqe->ioprio || sqe->rw_flags)
>> +		return -EINVAL;
>> +	ioucmd->cmd = sqe->cmd;
>> +	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
>> +	return 0;
>> +}
>
>While looking at the other suggested changes, I noticed a more
>fundamental issue with the passthrough support. For any other command,
>SQE contents are stable once prep has been done. The above does do that
>for the basic items, but this case is special as the lower level command
>itself resides in the SQE.
>
>For cases where the command needs deferral, it's problematic. There are
>two main cases where this can happen:
>
>- The issue attempt yields -EAGAIN (we ran out of requests, etc). If you
>  look at other commands, if they have data that doesn't fit in the
>  io_kiocb itself, then they need to allocate room for that data and have
>  it be persistent
>
>- Deferral is specified by the application, using eg IOSQE_IO_LINK or
>  IOSQE_ASYNC.
>
>We're totally missing support for both of these cases. Consider the case
>where the ring is setup with an SQ size of 1. You prep a passthrough
>command (request A) and issue it with io_uring_submit(). Due to one of
>the two above mentioned conditions, the internal request is deferred.
>Either it was sent to ->uring_cmd() but we got -EAGAIN, or it was
>deferred even before that happened. The application doesn't know this
>happened, it gets another SQE to submit a new request (request B). Fills
>it in, calls io_uring_submit(). Since we only have one SQE available in
>that ring, when request A gets re-issued, it's now happily reading SQE
>contents from command B. Oops.
>
>This is why prep handlers are the only ones that get an sqe passed to
>them. They are supposed to ensure that we no longer read from the SQE
>past that. Applications can always rely on that fact that once
>io_uring_submit() has been done, which consumes the SQE in the SQ ring,
>that no further reads are done from that SQE.
>
Thanks for explaining; gives great deal of clarity.
Are there already some tests (liburing, fio etc.) that you use to test
this part?
Different from what you mentioned, but I was forcing failure scenario by
setting low QD in nvme and pumping commands at higher QD than that. 
But this was just testing that we return failure to usespace (since
deferral was not there). 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device.
  2022-05-05 17:23             ` Jens Axboe
@ 2022-05-06  8:28               ` Christoph Hellwig
  2022-05-06 13:37                 ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2022-05-06  8:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Kanchan Joshi, io-uring, linux-nvme,
	asml.silence, ming.lei, mcgrof, shr, joshiiitr, anuj20.g,
	gost.dev

On Thu, May 05, 2022 at 11:23:01AM -0600, Jens Axboe wrote:
> The top three patches here have a proposed solution for the 3 issues
> that I highlighted:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=for-5.19/io_uring-passthrough
> 
> Totally untested... Kanchan, can you take a look and see what you think?
> They all need folding obviously, I just tried to do them separately.
> Should also get tested :-)

I've also pushed out a tree based on this, which contains my little
fixups that I'd suggest to be folded.  Totally untested and written
while jetlagged:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/io_uring-passthrough

Note that while I tried to keep all my changes in separate patches, the
main passthrough patch had conflicts during a rebase, which I had to
fix up, but I tried to touch as little as possible.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/5] io_uring passthrough for nvme
  2022-05-06  6:42       ` Kanchan Joshi
@ 2022-05-06 13:14         ` Jens Axboe
  0 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2022-05-06 13:14 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: hch, io-uring, linux-nvme, asml.silence, ming.lei, mcgrof, shr,
	joshiiitr, anuj20.g, gost.dev

On 5/6/22 12:42 AM, Kanchan Joshi wrote:
> On Thu, May 05, 2022 at 12:29:27PM -0600, Jens Axboe wrote:
>> On 5/5/22 12:20 PM, Jens Axboe wrote:
>>> On 5/5/22 12:06 AM, Kanchan Joshi wrote:
>>>> This iteration is against io_uring-big-sqe brach (linux-block).
>>>> On top of a739b2354 ("io_uring: enable CQE32").
>>>>
>>>> fio testing branch:
>>>> https://protect2.fireeye.com/v1/url?k=b0d23f72-d1592a52-b0d3b43d-74fe485fb347-02541f801e3b5f5f&q=1&e=ef4bb07a-7707-4854-819c-98abcabb5d2d&u=https%3A%2F%2Fgithub.com%2Fjoshkan%2Ffio%2Ftree%2Fbig-cqe-pt.v4
>>>
>>> I folded in the suggested changes, the branch is here:
>>>
>>> https://protect2.fireeye.com/v1/url?k=40e9c3d0-2162d6f0-40e8489f-74fe485fb347-f8e801be1f796980&q=1&e=ef4bb07a-7707-4854-819c-98abcabb5d2d&u=https%3A%2F%2Fgit.kernel.dk%2Fcgit%2Flinux-block%2Flog%2F%3Fh%3Dfor-5.19%2Fio_uring-passthrough
>>>
>>> I'll try and run the fio test branch, but please take a look and see what
>>> you think.
>>
>> Tested that fio branch and it works for me with what I had pushed out.
>> Also tested explicit deferral of requests.
> 
> Thanks for sorting everything out! I could test this out now only :-(
> Fio scripts ran fine (post refreshing SQE128/CQE32 flag values;repushed fio).
> 
> I think these two changes are needed in your branch:
> 
> 1. since uring-cmd can be without large-cqe, we need to add that
> condition in io_uring_cmd_done(). This change in patch 1 -
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 884f40f51536..c24469564ebc 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -4938,7 +4938,10 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2)
> 
>        if (ret < 0)
>                req_set_fail(req);
> -       __io_req_complete32(req, 0, ret, 0, res2, 0);
> +       if (req->ctx->flags & IORING_SETUP_CQE32)
> +               __io_req_complete32(req, 0, ret, 0, res2, 0);
> +       else
> +               io_req_complete(req, ret);
> }
> EXPORT_SYMBOL_GPL(io_uring_cmd_done);

Ah yes, good catch, I missed that it's cqe32 exclusive right now.

> 2. In the commit-message of patch 1, we should delete last line.
> i.e.
> "This operation can be issued only on the ring that is
> setup with both IORING_SETUP_SQE128 and IORING_SETUP_CQE32 flags."
> 
> And/or we can move this commit-message of patch 4.

I killed the line.

> You can either take these changes in, or I can respin the series. LMK.

I folded it in and will process Christoph's suggestions.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device.
  2022-05-06  8:28               ` Christoph Hellwig
@ 2022-05-06 13:37                 ` Jens Axboe
  2022-05-06 14:50                   ` Christoph Hellwig
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2022-05-06 13:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, io-uring, linux-nvme, asml.silence, ming.lei,
	mcgrof, shr, joshiiitr, anuj20.g, gost.dev

On 5/6/22 2:28 AM, Christoph Hellwig wrote:
> On Thu, May 05, 2022 at 11:23:01AM -0600, Jens Axboe wrote:
>> The top three patches here have a proposed solution for the 3 issues
>> that I highlighted:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=for-5.19/io_uring-passthrough
>>
>> Totally untested... Kanchan, can you take a look and see what you think?
>> They all need folding obviously, I just tried to do them separately.
>> Should also get tested :-)
> 
> I've also pushed out a tree based on this, which contains my little
> fixups that I'd suggest to be folded.  Totally untested and written
> while jetlagged:
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/io_uring-passthrough
> 
> Note that while I tried to keep all my changes in separate patches, the
> main passthrough patch had conflicts during a rebase, which I had to
> fix up, but I tried to touch as little as possible.

Folded most of it, but I left your two meta data related patches as
separate as I they really should be separate. However, they need a
proper commit message and signed-off-by from you. It's these two:

https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.19/io_uring-passthrough&id=b855a4458068722235bdf69688448820c8ddae8e
https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.19/io_uring-passthrough&id=2be698bdd668daeb1aad2ecd516484a62e948547

Can you do those as proper patches?

I did not do your async_size changes, I think you're jetlagged eyes
missed that this isn't a sizeof thing on a flexible array, it's just the
offset of it. Hence for non-sqe128, the the async size is io_uring_sqe -
offsetof where pdu starts, and so forth.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device.
  2022-05-06 13:37                 ` Jens Axboe
@ 2022-05-06 14:50                   ` Christoph Hellwig
  2022-05-06 14:57                     ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2022-05-06 14:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Kanchan Joshi, io-uring, linux-nvme,
	asml.silence, ming.lei, mcgrof, shr, joshiiitr, anuj20.g,
	gost.dev

On Fri, May 06, 2022 at 07:37:55AM -0600, Jens Axboe wrote:
> Folded most of it, but I left your two meta data related patches as
> separate as I they really should be separate. However, they need a
> proper commit message and signed-off-by from you. It's these two:
> 
> https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.19/io_uring-passthrough&id=b855a4458068722235bdf69688448820c8ddae8e

This one should be folded into "nvme: refactor nvme_submit_user_cmd()",
which is the patch just before it that adds nvme_meta_from_bio.

> https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.19/io_uring-passthrough&id=2be698bdd668daeb1aad2ecd516484a62e948547

Just add this:

"Add a small helper to act as the counterpart to nvme_add_user_metadata."

with my signoff:

Signed-off-by: Christoph Hellwig <[email protected]>

> I did not do your async_size changes, I think you're jetlagged eyes
> missed that this isn't a sizeof thing on a flexible array, it's just the
> offset of it. Hence for non-sqe128, the the async size is io_uring_sqe -
> offsetof where pdu starts, and so forth.

Hmm, this still seems a bit odd to me.  So without sqe128 you don't even
get the cmd data that would fit into the 64-bit SQE?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device.
  2022-05-06 14:50                   ` Christoph Hellwig
@ 2022-05-06 14:57                     ` Jens Axboe
  2022-05-07  5:03                       ` Christoph Hellwig
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2022-05-06 14:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, io-uring, linux-nvme, asml.silence, ming.lei,
	mcgrof, shr, joshiiitr, anuj20.g, gost.dev

On 5/6/22 8:50 AM, Christoph Hellwig wrote:
> On Fri, May 06, 2022 at 07:37:55AM -0600, Jens Axboe wrote:
>> Folded most of it, but I left your two meta data related patches as
>> separate as I they really should be separate. However, they need a
>> proper commit message and signed-off-by from you. It's these two:
>>
>> https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.19/io_uring-passthrough&id=b855a4458068722235bdf69688448820c8ddae8e
> 
> This one should be folded into "nvme: refactor nvme_submit_user_cmd()",
> which is the patch just before it that adds nvme_meta_from_bio.
> 
>> https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.19/io_uring-passthrough&id=2be698bdd668daeb1aad2ecd516484a62e948547
> 
> Just add this:
> 
> "Add a small helper to act as the counterpart to nvme_add_user_metadata."
> 
> with my signoff:
> 
> Signed-off-by: Christoph Hellwig <[email protected]>

Both done, thanks.

>> I did not do your async_size changes, I think you're jetlagged eyes
>> missed that this isn't a sizeof thing on a flexible array, it's just the
>> offset of it. Hence for non-sqe128, the the async size is io_uring_sqe -
>> offsetof where pdu starts, and so forth.
> 
> Hmm, this still seems a bit odd to me.  So without sqe128 you don't even
> get the cmd data that would fit into the 64-bit SQE?

You do. Without sqe128, you get sizeof(sqe) - offsetof(cmd) == 16 bytes.
With, you get 16 + 64, 80.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device.
  2022-05-06 14:57                     ` Jens Axboe
@ 2022-05-07  5:03                       ` Christoph Hellwig
  2022-05-07 12:53                         ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2022-05-07  5:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Kanchan Joshi, io-uring, linux-nvme,
	asml.silence, ming.lei, mcgrof, shr, joshiiitr, anuj20.g,
	gost.dev

Getting back to this after a good night's worth of sleep:

On Fri, May 06, 2022 at 08:57:53AM -0600, Jens Axboe wrote:
> > Just add this:
> > 
> > "Add a small helper to act as the counterpart to nvme_add_user_metadata."
> > 
> > with my signoff:
> > 
> > Signed-off-by: Christoph Hellwig <[email protected]>
> 
> Both done, thanks.

I think we're much better of folding "nvme: add nvme_finish_user_metadata
helper" into "nvme: refactor nvme_submit_user_cmd()" as the first basically
just redos the split done in the first patch in a more fine grained way
to allow sharing some of the metadata end I/O code with the uring path,
and basically only touches code changes in the first patch again.

> >> I did not do your async_size changes, I think you're jetlagged eyes
> >> missed that this isn't a sizeof thing on a flexible array, it's just the
> >> offset of it. Hence for non-sqe128, the the async size is io_uring_sqe -
> >> offsetof where pdu starts, and so forth.
> > 
> > Hmm, this still seems a bit odd to me.  So without sqe128 you don't even
> > get the cmd data that would fit into the 64-bit SQE?
> 
> You do. Without sqe128, you get sizeof(sqe) - offsetof(cmd) == 16 bytes.
> With, you get 16 + 64, 80.

Can we please get a little documented helper that does this instead of
the two open coded places?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device.
  2022-05-07  5:03                       ` Christoph Hellwig
@ 2022-05-07 12:53                         ` Jens Axboe
  2022-05-09  6:00                           ` Christoph Hellwig
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2022-05-07 12:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, io-uring, linux-nvme, asml.silence, ming.lei,
	mcgrof, shr, joshiiitr, anuj20.g, gost.dev

On 5/6/22 11:03 PM, Christoph Hellwig wrote:
> Getting back to this after a good night's worth of sleep:
> 
> On Fri, May 06, 2022 at 08:57:53AM -0600, Jens Axboe wrote:
>>> Just add this:
>>>
>>> "Add a small helper to act as the counterpart to nvme_add_user_metadata."
>>>
>>> with my signoff:
>>>
>>> Signed-off-by: Christoph Hellwig <[email protected]>
>>
>> Both done, thanks.
> 
> I think we're much better of folding "nvme: add nvme_finish_user_metadata
> helper" into "nvme: refactor nvme_submit_user_cmd()" as the first basically
> just redos the split done in the first patch in a more fine grained way
> to allow sharing some of the metadata end I/O code with the uring path,
> and basically only touches code changes in the first patch again.

Yes good point, I've folded the two.

>>>> I did not do your async_size changes, I think you're jetlagged eyes
>>>> missed that this isn't a sizeof thing on a flexible array, it's just the
>>>> offset of it. Hence for non-sqe128, the the async size is io_uring_sqe -
>>>> offsetof where pdu starts, and so forth.
>>>
>>> Hmm, this still seems a bit odd to me.  So without sqe128 you don't even
>>> get the cmd data that would fit into the 64-bit SQE?
>>
>> You do. Without sqe128, you get sizeof(sqe) - offsetof(cmd) == 16 bytes.
>> With, you get 16 + 64, 80.
> 
> Can we please get a little documented helper that does this instead of
> the two open coded places?

How about we just add a comment? We use it in two spots, but one has
knowledge of the sqe64 vs sqe128 state, the other one does not. Hence
not sure how best to add a helper for this. One also must be a compile
time constant. Best I can think of is the below. Not the prettiest, but
it does keep it in one spot and with a single comment rather than in two
spots.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1860c50f7f8e..0a9b0fde55af 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1044,6 +1044,14 @@ struct io_cancel_data {
 	int seq;
 };
 
+/*
+ * The URING_CMD payload starts at 'cmd' in the first sqe, and continues into
+ * the following sqe if SQE128 is used.
+ */
+#define uring_cmd_pdu_size(is_sqe128)				\
+	((1 + !!(is_sqe128)) * sizeof(struct io_uring_sqe) -	\
+		offsetof(struct io_uring_sqe, cmd))
+
 struct io_op_def {
 	/* needs req->file assigned */
 	unsigned		needs_file : 1;
@@ -1286,8 +1294,7 @@ static const struct io_op_def io_op_defs[] = {
 	[IORING_OP_URING_CMD] = {
 		.needs_file		= 1,
 		.plug			= 1,
-		.async_size		= 2 * sizeof(struct io_uring_sqe) -
-					  offsetof(struct io_uring_sqe, cmd),
+		.async_size		= uring_cmd_pdu_size(1),
 	},
 };
 
@@ -4947,11 +4954,9 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
 
 static int io_uring_cmd_prep_async(struct io_kiocb *req)
 {
-	size_t cmd_size = sizeof(struct io_uring_sqe) -
-				offsetof(struct io_uring_sqe, cmd);
+	size_t cmd_size;
 
-	if (req->ctx->flags & IORING_SETUP_SQE128)
-		cmd_size += sizeof(struct io_uring_sqe);
+	cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
 
 	memcpy(req->async_data, req->uring_cmd.cmd, cmd_size);
 	return 0;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device.
  2022-05-07 12:53                         ` Jens Axboe
@ 2022-05-09  6:00                           ` Christoph Hellwig
  2022-05-09 12:52                             ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2022-05-09  6:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Kanchan Joshi, io-uring, linux-nvme,
	asml.silence, ming.lei, mcgrof, shr, joshiiitr, anuj20.g,
	gost.dev

On Sat, May 07, 2022 at 06:53:30AM -0600, Jens Axboe wrote:
> How about we just add a comment? We use it in two spots, but one has
> knowledge of the sqe64 vs sqe128 state, the other one does not. Hence
> not sure how best to add a helper for this. One also must be a compile
> time constant. Best I can think of is the below. Not the prettiest, but
> it does keep it in one spot and with a single comment rather than in two
> spots.

If you think just a comment is better I can live with that, also the
proposed macro also looks fine to me.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device.
  2022-05-09  6:00                           ` Christoph Hellwig
@ 2022-05-09 12:52                             ` Jens Axboe
  0 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2022-05-09 12:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, io-uring, linux-nvme, asml.silence, ming.lei,
	mcgrof, shr, joshiiitr, anuj20.g, gost.dev

On 5/9/22 12:00 AM, Christoph Hellwig wrote:
> On Sat, May 07, 2022 at 06:53:30AM -0600, Jens Axboe wrote:
>> How about we just add a comment? We use it in two spots, but one has
>> knowledge of the sqe64 vs sqe128 state, the other one does not. Hence
>> not sure how best to add a helper for this. One also must be a compile
>> time constant. Best I can think of is the below. Not the prettiest, but
>> it does keep it in one spot and with a single comment rather than in two
>> spots.
> 
> If you think just a comment is better I can live with that, also the
> proposed macro also looks fine to me.

I folded in the macro patch, seems safer and better to just have a
single spot where it's done rather than rely on two comments if things
do change there in the future.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/5] io_uring passthrough for nvme
  2022-05-05 18:20   ` [PATCH v4 0/5] io_uring passthrough for nvme Jens Axboe
  2022-05-05 18:29     ` Jens Axboe
@ 2022-05-10  7:20     ` Christoph Hellwig
  2022-05-10 12:29       ` Jens Axboe
  1 sibling, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2022-05-10  7:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kanchan Joshi, hch, io-uring, linux-nvme, asml.silence, ming.lei,
	mcgrof, shr, joshiiitr, anuj20.g, gost.dev

On Thu, May 05, 2022 at 12:20:37PM -0600, Jens Axboe wrote:
> On 5/5/22 12:06 AM, Kanchan Joshi wrote:
> > This iteration is against io_uring-big-sqe brach (linux-block).
> > On top of a739b2354 ("io_uring: enable CQE32").
> > 
> > fio testing branch:
> > https://github.com/joshkan/fio/tree/big-cqe-pt.v4
> 
> I folded in the suggested changes, the branch is here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=for-5.19/io_uring-passthrough
> 
> I'll try and run the fio test branch, but please take a look and see what
> you think.

I think what is in the branch now looks pretty good.  Can either of you
two send it out to the lists for a final review pass?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/5] io_uring passthrough for nvme
  2022-05-10  7:20     ` Christoph Hellwig
@ 2022-05-10 12:29       ` Jens Axboe
  2022-05-10 14:21         ` Kanchan Joshi
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2022-05-10 12:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, io-uring, linux-nvme, asml.silence, ming.lei,
	mcgrof, shr, joshiiitr, anuj20.g, gost.dev

On 5/10/22 1:20 AM, Christoph Hellwig wrote:
> On Thu, May 05, 2022 at 12:20:37PM -0600, Jens Axboe wrote:
>> On 5/5/22 12:06 AM, Kanchan Joshi wrote:
>>> This iteration is against io_uring-big-sqe brach (linux-block).
>>> On top of a739b2354 ("io_uring: enable CQE32").
>>>
>>> fio testing branch:
>>> https://github.com/joshkan/fio/tree/big-cqe-pt.v4
>>
>> I folded in the suggested changes, the branch is here:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=for-5.19/io_uring-passthrough
>>
>> I'll try and run the fio test branch, but please take a look and see what
>> you think.
> 
> I think what is in the branch now looks pretty good.  Can either of you
> two send it out to the lists for a final review pass?

Kanchan, do you want to do this?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/5] io_uring passthrough for nvme
  2022-05-10 12:29       ` Jens Axboe
@ 2022-05-10 14:21         ` Kanchan Joshi
  0 siblings, 0 replies; 45+ messages in thread
From: Kanchan Joshi @ 2022-05-10 14:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Kanchan Joshi, io-uring, linux-nvme,
	Pavel Begunkov, Ming Lei, Luis Chamberlain, Stefan Roesch,
	Anuj Gupta, gost.dev

On Tue, May 10, 2022 at 5:59 PM Jens Axboe <[email protected]> wrote:
>
> On 5/10/22 1:20 AM, Christoph Hellwig wrote:
> > On Thu, May 05, 2022 at 12:20:37PM -0600, Jens Axboe wrote:
> >> On 5/5/22 12:06 AM, Kanchan Joshi wrote:
> >>> This iteration is against io_uring-big-sqe brach (linux-block).
> >>> On top of a739b2354 ("io_uring: enable CQE32").
> >>>
> >>> fio testing branch:
> >>> https://github.com/joshkan/fio/tree/big-cqe-pt.v4
> >>
> >> I folded in the suggested changes, the branch is here:
> >>
> >> https://git.kernel.dk/cgit/linux-block/log/?h=for-5.19/io_uring-passthrough
> >>
> >> I'll try and run the fio test branch, but please take a look and see what
> >> you think.
> >
> > I think what is in the branch now looks pretty good.  Can either of you
> > two send it out to the lists for a final review pass?
>
> Kanchan, do you want to do this?

Sure, will do that.
Looking more into cases when cmd needs deferral. We need a few changes there.
I will put those in the other email (for better context). Please take
a look. If you are fine with that, will fold those in the newer
version.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd
  2022-05-05 16:17       ` Jens Axboe
  2022-05-05 17:04         ` Jens Axboe
  2022-05-06  7:12         ` Kanchan Joshi
@ 2022-05-10 14:23         ` Kanchan Joshi
  2022-05-10 14:35           ` Jens Axboe
  2 siblings, 1 reply; 45+ messages in thread
From: Kanchan Joshi @ 2022-05-10 14:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kanchan Joshi, Christoph Hellwig, io-uring, linux-nvme,
	Pavel Begunkov, Ming Lei, Luis Chamberlain, Stefan Roesch,
	Anuj Gupta, gost.dev

On Thu, May 5, 2022 at 9:47 PM Jens Axboe <[email protected]> wrote:
>
> On 5/5/22 12:06 AM, Kanchan Joshi wrote:
> > +static int io_uring_cmd_prep(struct io_kiocb *req,
> > +                          const struct io_uring_sqe *sqe)
> > +{
> > +     struct io_uring_cmd *ioucmd = &req->uring_cmd;
> > +     struct io_ring_ctx *ctx = req->ctx;
> > +
> > +     if (ctx->flags & IORING_SETUP_IOPOLL)
> > +             return -EOPNOTSUPP;
> > +     /* do not support uring-cmd without big SQE/CQE */
> > +     if (!(ctx->flags & IORING_SETUP_SQE128))
> > +             return -EOPNOTSUPP;
> > +     if (!(ctx->flags & IORING_SETUP_CQE32))
> > +             return -EOPNOTSUPP;
> > +     if (sqe->ioprio || sqe->rw_flags)
> > +             return -EINVAL;
> > +     ioucmd->cmd = sqe->cmd;
> > +     ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
> > +     return 0;
> > +}
>
> While looking at the other suggested changes, I noticed a more
> fundamental issue with the passthrough support. For any other command,
> SQE contents are stable once prep has been done. The above does do that
> for the basic items, but this case is special as the lower level command
> itself resides in the SQE.
>
> For cases where the command needs deferral, it's problematic. There are
> two main cases where this can happen:
>
> - The issue attempt yields -EAGAIN (we ran out of requests, etc). If you
>   look at other commands, if they have data that doesn't fit in the
>   io_kiocb itself, then they need to allocate room for that data and have
>   it be persistent

While we have io-wq retrying for this case, async_data is not allocated.
We need to do that explicitly inside io_uring_cmd(). Something like this -

if (ret == -EAGAIN) {
if (!req_has_async_data(req)) {
if (io_alloc_async_data(req)) return -ENOMEM;
io_uring_cmd_prep_async(req);
}
return ret;
}

> - Deferral is specified by the application, using eg IOSQE_IO_LINK or
>   IOSQE_ASYNC.
For this to work, we are missing ".needs_async_setup = 1" for
IORING_OP_URING_CMD.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd
  2022-05-10 14:23         ` Kanchan Joshi
@ 2022-05-10 14:35           ` Jens Axboe
  0 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2022-05-10 14:35 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Kanchan Joshi, Christoph Hellwig, io-uring, linux-nvme,
	Pavel Begunkov, Ming Lei, Luis Chamberlain, Stefan Roesch,
	Anuj Gupta, gost.dev

On 5/10/22 8:23 AM, Kanchan Joshi wrote:
> On Thu, May 5, 2022 at 9:47 PM Jens Axboe <[email protected]> wrote:
>>
>> On 5/5/22 12:06 AM, Kanchan Joshi wrote:
>>> +static int io_uring_cmd_prep(struct io_kiocb *req,
>>> +                          const struct io_uring_sqe *sqe)
>>> +{
>>> +     struct io_uring_cmd *ioucmd = &req->uring_cmd;
>>> +     struct io_ring_ctx *ctx = req->ctx;
>>> +
>>> +     if (ctx->flags & IORING_SETUP_IOPOLL)
>>> +             return -EOPNOTSUPP;
>>> +     /* do not support uring-cmd without big SQE/CQE */
>>> +     if (!(ctx->flags & IORING_SETUP_SQE128))
>>> +             return -EOPNOTSUPP;
>>> +     if (!(ctx->flags & IORING_SETUP_CQE32))
>>> +             return -EOPNOTSUPP;
>>> +     if (sqe->ioprio || sqe->rw_flags)
>>> +             return -EINVAL;
>>> +     ioucmd->cmd = sqe->cmd;
>>> +     ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
>>> +     return 0;
>>> +}
>>
>> While looking at the other suggested changes, I noticed a more
>> fundamental issue with the passthrough support. For any other command,
>> SQE contents are stable once prep has been done. The above does do that
>> for the basic items, but this case is special as the lower level command
>> itself resides in the SQE.
>>
>> For cases where the command needs deferral, it's problematic. There are
>> two main cases where this can happen:
>>
>> - The issue attempt yields -EAGAIN (we ran out of requests, etc). If you
>>   look at other commands, if they have data that doesn't fit in the
>>   io_kiocb itself, then they need to allocate room for that data and have
>>   it be persistent
> 
> While we have io-wq retrying for this case, async_data is not allocated.
> We need to do that explicitly inside io_uring_cmd(). Something like this -
> 
> if (ret == -EAGAIN) {
> if (!req_has_async_data(req)) {
> if (io_alloc_async_data(req)) return -ENOMEM;
> io_uring_cmd_prep_async(req);
> }
> return ret;
> }
> 
>> - Deferral is specified by the application, using eg IOSQE_IO_LINK or
>>   IOSQE_ASYNC.
> For this to work, we are missing ".needs_async_setup = 1" for
> IORING_OP_URING_CMD.

Agree on both, the op handler itself should alloc the async_data for
this case and that flag does need to be set.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2022-05-10 15:05 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20220505061142epcas5p2c943572766bfd5088138fe0f7873c96c@epcas5p2.samsung.com>
2022-05-05  6:06 ` [PATCH v4 0/5] io_uring passthrough for nvme Kanchan Joshi
     [not found]   ` <CGME20220505061144epcas5p3821a9516dad2b5eff5a25c56dbe164df@epcas5p3.samsung.com>
2022-05-05  6:06     ` [PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd Kanchan Joshi
2022-05-05 12:52       ` Jens Axboe
2022-05-05 13:48         ` Ming Lei
2022-05-05 13:54           ` Jens Axboe
2022-05-05 13:29       ` Christoph Hellwig
2022-05-05 16:17       ` Jens Axboe
2022-05-05 17:04         ` Jens Axboe
2022-05-06  7:12         ` Kanchan Joshi
2022-05-10 14:23         ` Kanchan Joshi
2022-05-10 14:35           ` Jens Axboe
     [not found]   ` <CGME20220505061146epcas5p3919c48d58d353a62a5858ee10ad162a0@epcas5p3.samsung.com>
2022-05-05  6:06     ` [PATCH v4 2/5] block: wire-up support for passthrough plugging Kanchan Joshi
2022-05-05 14:21       ` Ming Lei
     [not found]   ` <CGME20220505061148epcas5p188618b5b15a95cbe48c8c1559a18c994@epcas5p1.samsung.com>
2022-05-05  6:06     ` [PATCH v4 3/5] nvme: refactor nvme_submit_user_cmd() Kanchan Joshi
2022-05-05 13:30       ` Christoph Hellwig
2022-05-05 18:37       ` Clay Mayers
2022-05-05 19:03         ` Jens Axboe
2022-05-05 19:11           ` Jens Axboe
2022-05-05 19:30             ` Clay Mayers
2022-05-05 19:31               ` Jens Axboe
2022-05-05 19:50                 ` hch
2022-05-05 20:44                   ` Jens Axboe
2022-05-06  5:56                     ` hch
     [not found]   ` <CGME20220505061150epcas5p2b60880c541a4b2f144c348834c7cbf0b@epcas5p2.samsung.com>
2022-05-05  6:06     ` [PATCH v4 4/5] nvme: wire-up uring-cmd support for io-passthru on char-device Kanchan Joshi
2022-05-05 13:33       ` Christoph Hellwig
2022-05-05 13:38       ` Jens Axboe
2022-05-05 13:42         ` Christoph Hellwig
2022-05-05 13:50           ` Jens Axboe
2022-05-05 17:23             ` Jens Axboe
2022-05-06  8:28               ` Christoph Hellwig
2022-05-06 13:37                 ` Jens Axboe
2022-05-06 14:50                   ` Christoph Hellwig
2022-05-06 14:57                     ` Jens Axboe
2022-05-07  5:03                       ` Christoph Hellwig
2022-05-07 12:53                         ` Jens Axboe
2022-05-09  6:00                           ` Christoph Hellwig
2022-05-09 12:52                             ` Jens Axboe
     [not found]   ` <CGME20220505061151epcas5p2523dc661a0daf3e6185dee771eade393@epcas5p2.samsung.com>
2022-05-05  6:06     ` [PATCH v4 5/5] nvme: add vectored-io support for uring-cmd Kanchan Joshi
2022-05-05 18:20   ` [PATCH v4 0/5] io_uring passthrough for nvme Jens Axboe
2022-05-05 18:29     ` Jens Axboe
2022-05-06  6:42       ` Kanchan Joshi
2022-05-06 13:14         ` Jens Axboe
2022-05-10  7:20     ` Christoph Hellwig
2022-05-10 12:29       ` Jens Axboe
2022-05-10 14:21         ` Kanchan Joshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox