public inbox for [email protected]
 help / color / mirror / Atom feed
* [RFC PATCH 00/12] io_uring attached nvme queue
       [not found] <CGME20230429094228epcas5p4a80d8ed77433989fa804ecf449f83b0b@epcas5p4.samsung.com>
@ 2023-04-29  9:39 ` Kanchan Joshi
       [not found]   ` <CGME20230429094238epcas5p4efa3dc785fa54ab974852c7f90113025@epcas5p4.samsung.com>
                     ` (12 more replies)
  0 siblings, 13 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-04-29  9:39 UTC (permalink / raw)
  To: axboe, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, gost.dev, anuj1072538,
	xiaoguang.wang, Kanchan Joshi

This series shows one way to do what the title says.
This puts up a more direct/lean path that enables
 - submission from io_uring SQE to NVMe SQE
 - completion from NVMe CQE to io_uring CQE
Essentially cutting the hoops (involving request/bio) for nvme io path.

Also, io_uring ring is not to be shared among application threads.
Application is responsible for building the sharing (if it feels the
need). This means ring-associated exclusive queue can do away with some
synchronization costs that occur for shared queue.

Primary objective is to amp up of efficiency of kernel io path further
(towards PCIe gen N, N+1 hardware).
And we are seeing some asks too [1].

Building-blocks
===============
At high level, series can be divided into following parts -

1. nvme driver starts exposing some queue-pairs (SQ+CQ) that can
be attached to other in-kernel user (not just to block-layer, which is
the case at the moment) on demand.

Example:
insmod nvme.ko poll_queus=1 raw_queues=2

nvme0: 24/0/1/2 default/read/poll queues/raw queues

While driver registers other queues with block-layer, raw-queues are
rather reserved for exclusive attachment with other in-kernel users.
At this point, each raw-queue is interrupt-disabled (similar to
poll_queues). Maybe we need a better name for these (e.g. app/user queues).
[Refer: patch 2]

2. register/unregister queue interface
(a) one for io_uring application to ask for device-queue and register
with the ring. [Refer: patch 4]
(b) another at nvme so that other in-kernel users (io_uring for now) can
ask for a raw-queue. [Refer: patch 3, 5, 6]

The latter returns a qid, that io_uring stores internally (not exposed
to user-space) in the ring ctx. At max one queue per ring is enabled.
Ring has no other special properties except the fact that it stores a
qid that it can use exclusively. So application can very well use the
ring to do other things than nvme io.

3. user-interface to send commands down this way
(a) uring-cmd is extended to support a new flag "IORING_URING_CMD_DIRECT"
that application passes in the SQE. That is all.
(b) the flag goes down to provider of ->uring_cmd which may choose to do
  things differently based on it (or ignore it).
[Refer: patch 7]

4. nvme uring-cmd understands the above flag. It submits the command
into the known pre-registered queue, and completes (polled-completion)
from it. Transformation from "struct io_uring_cmd" to "nvme command" is
done directly without building other intermediate constructs.
[Refer: patch 8, 10, 12]

Testing and Performance
=======================
fio and t/io_uring is modified to exercise this path.
- fio: new "registerqueues" option
- t/io_uring: new "k" option

Good part:
2.96M -> 5.02M

nvme io (without this):
# t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k0 /dev/ng0n1
submitter=0, tid=2922, file=/dev/ng0n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=0 QD=64
Engine=io_uring, sq_ring=64, cq_ring=64
IOPS=2.89M, BW=1412MiB/s, IOS/call=2/1
IOPS=2.92M, BW=1426MiB/s, IOS/call=2/2
IOPS=2.96M, BW=1444MiB/s, IOS/call=2/1
Exiting on timeout
Maximum IOPS=2.96M

nvme io (with this):
# t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
submitter=0, tid=2927, file=/dev/ng0n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
Engine=io_uring, sq_ring=64, cq_ring=64
IOPS=4.99M, BW=2.43GiB/s, IOS/call=2/1
IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
Exiting on timeout
Maximum IOPS=5.02M

Not so good part:
While single IO is fast this way, we do not have batching abilities for
multi-io scenario. Plugging, submission and completion batching are tied to
block-layer constructs. Things should look better if we could do something
about that.
Particularly something is off with the completion-batching.

With -s32 and -c32, the numbers decline:

# t/io_uring -b512 -d64 -c32 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
submitter=0, tid=3674, file=/dev/ng0n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
Engine=io_uring, sq_ring=64, cq_ring=64
IOPS=3.70M, BW=1806MiB/s, IOS/call=32/31
IOPS=3.71M, BW=1812MiB/s, IOS/call=32/31
IOPS=3.71M, BW=1812MiB/s, IOS/call=32/32
Exiting on timeout
Maximum IOPS=3.71M

And perf gets restored if we go back to -c2

# t/io_uring -b512 -d64 -c2 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
submitter=0, tid=3677, file=/dev/ng0n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
Engine=io_uring, sq_ring=64, cq_ring=64
IOPS=4.99M, BW=2.44GiB/s, IOS/call=5/5
IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
Exiting on timeout
Maximum IOPS=5.02M

Source
======
Kernel: https://github.com/OpenMPDK/linux/tree/feat/directq-v1
fio: https://github.com/OpenMPDK/fio/commits/feat/rawq-v2

Please take a look.

[1]
https://lore.kernel.org/io-uring/[email protected]/

Anuj Gupta (5):
  fs, block: interface to register/unregister the raw-queue
  io_uring, fs: plumb support to register/unregister raw-queue
  nvme: wire-up register/unregister queue f_op callback
  block: add mq_ops to submit and complete commands from raw-queue
  pci: modify nvme_setup_prp_simple parameters

Kanchan Joshi (7):
  nvme: refactor nvme_alloc_io_tag_set
  pci: enable "raw_queues = N" module parameter
  pci: implement register/unregister functionality
  io_uring: support for using registered queue in uring-cmd
  nvme: carve out a helper to prepare nvme_command from ioucmd->cmd
  nvme: submisssion/completion of uring_cmd to/from the registered queue
  pci: implement submission/completion for rawq commands

 drivers/nvme/host/core.c       |  31 ++-
 drivers/nvme/host/fc.c         |   3 +-
 drivers/nvme/host/ioctl.c      | 234 +++++++++++++++----
 drivers/nvme/host/multipath.c  |   2 +
 drivers/nvme/host/nvme.h       |  19 +-
 drivers/nvme/host/pci.c        | 409 +++++++++++++++++++++++++++++++--
 drivers/nvme/host/rdma.c       |   2 +-
 drivers/nvme/host/tcp.c        |   3 +-
 drivers/nvme/target/loop.c     |   3 +-
 fs/file.c                      |  14 ++
 include/linux/blk-mq.h         |   5 +
 include/linux/fs.h             |   4 +
 include/linux/io_uring.h       |   6 +
 include/linux/io_uring_types.h |   3 +
 include/uapi/linux/io_uring.h  |   6 +
 io_uring/io_uring.c            |  60 +++++
 io_uring/uring_cmd.c           |  14 +-
 17 files changed, 739 insertions(+), 79 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 01/12] nvme: refactor nvme_alloc_io_tag_set
       [not found]   ` <CGME20230429094238epcas5p4efa3dc785fa54ab974852c7f90113025@epcas5p4.samsung.com>
@ 2023-04-29  9:39     ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-04-29  9:39 UTC (permalink / raw)
  To: axboe, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, gost.dev, anuj1072538,
	xiaoguang.wang, Kanchan Joshi

Allow this function to take nr_hw_queues as a parameter. And change all
the callers. This is in preparation to introduce queues which do not
have to be registered with block-layer.

Signed-off-by: Kanchan Joshi <[email protected]>
---
 drivers/nvme/host/core.c   | 5 +++--
 drivers/nvme/host/fc.c     | 3 ++-
 drivers/nvme/host/nvme.h   | 2 +-
 drivers/nvme/host/pci.c    | 3 ++-
 drivers/nvme/host/rdma.c   | 2 +-
 drivers/nvme/host/tcp.c    | 3 ++-
 drivers/nvme/target/loop.c | 3 ++-
 7 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1bfd52eae2ee..ba476c48d566 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4962,7 +4962,7 @@ EXPORT_SYMBOL_GPL(nvme_remove_admin_tag_set);
 
 int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set,
 		const struct blk_mq_ops *ops, unsigned int nr_maps,
-		unsigned int cmd_size)
+		unsigned int cmd_size, unsigned int nr_hw_queues)
 {
 	int ret;
 
@@ -4983,9 +4983,10 @@ int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set,
 		set->flags |= BLK_MQ_F_BLOCKING;
 	set->cmd_size = cmd_size,
 	set->driver_data = ctrl;
-	set->nr_hw_queues = ctrl->queue_count - 1;
+	set->nr_hw_queues = nr_hw_queues - 1;
 	set->timeout = NVME_IO_TIMEOUT;
 	set->nr_maps = nr_maps;
+
 	ret = blk_mq_alloc_tag_set(set);
 	if (ret)
 		return ret;
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 456ee42a6133..c7dd084c1999 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2918,7 +2918,8 @@ nvme_fc_create_io_queues(struct nvme_fc_ctrl *ctrl)
 	ret = nvme_alloc_io_tag_set(&ctrl->ctrl, &ctrl->tag_set,
 			&nvme_fc_mq_ops, 1,
 			struct_size((struct nvme_fcp_op_w_sgl *)NULL, priv,
-				    ctrl->lport->ops->fcprqst_priv_sz));
+				    ctrl->lport->ops->fcprqst_priv_sz),
+				    ctrl->ctrl.queue_count);
 	if (ret)
 		return ret;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index bf46f122e9e1..73992dc9dec7 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -748,7 +748,7 @@ int nvme_alloc_admin_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set,
 void nvme_remove_admin_tag_set(struct nvme_ctrl *ctrl);
 int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set,
 		const struct blk_mq_ops *ops, unsigned int nr_maps,
-		unsigned int cmd_size);
+		unsigned int cmd_size, unsigned int nr_hw_queues);
 void nvme_remove_io_tag_set(struct nvme_ctrl *ctrl);
 
 void nvme_remove_namespaces(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7f25c0fe3a0b..3a38ee6ee129 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3033,7 +3033,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	if (dev->online_queues > 1) {
 		nvme_alloc_io_tag_set(&dev->ctrl, &dev->tagset, &nvme_mq_ops,
-				nvme_pci_nr_maps(dev), sizeof(struct nvme_iod));
+				nvme_pci_nr_maps(dev), sizeof(struct nvme_iod),
+				dev->ctrl.queue_count);
 		nvme_dbbuf_set(dev);
 	}
 
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 0eb79696fb73..2e2e131edc62 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -795,7 +795,7 @@ static int nvme_rdma_alloc_tag_set(struct nvme_ctrl *ctrl)
 	return nvme_alloc_io_tag_set(ctrl, &to_rdma_ctrl(ctrl)->tag_set,
 			&nvme_rdma_mq_ops,
 			ctrl->opts->nr_poll_queues ? HCTX_MAX_TYPES : 2,
-			cmd_size);
+			cmd_size, ctrl->queue_count);
 }
 
 static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index bf0230442d57..50efbd724284 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1893,7 +1893,8 @@ static int nvme_tcp_configure_io_queues(struct nvme_ctrl *ctrl, bool new)
 		ret = nvme_alloc_io_tag_set(ctrl, &to_tcp_ctrl(ctrl)->tag_set,
 				&nvme_tcp_mq_ops,
 				ctrl->opts->nr_poll_queues ? HCTX_MAX_TYPES : 2,
-				sizeof(struct nvme_tcp_request));
+				sizeof(struct nvme_tcp_request),
+				ctrl->queue_count);
 		if (ret)
 			goto out_free_io_queues;
 	}
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index f2d24b2d992f..f02419b537f7 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -496,7 +496,8 @@ static int nvme_loop_create_io_queues(struct nvme_loop_ctrl *ctrl)
 	ret = nvme_alloc_io_tag_set(&ctrl->ctrl, &ctrl->tag_set,
 			&nvme_loop_mq_ops, 1,
 			sizeof(struct nvme_loop_iod) +
-			NVME_INLINE_SG_CNT * sizeof(struct scatterlist));
+			NVME_INLINE_SG_CNT * sizeof(struct scatterlist),
+			ctrl->ctrl.queue_count);
 	if (ret)
 		goto out_destroy_queues;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 02/12] pci: enable "raw_queues = N" module parameter
       [not found]   ` <CGME20230429094240epcas5p1a7411f266412244115411b05da509e4a@epcas5p1.samsung.com>
@ 2023-04-29  9:39     ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-04-29  9:39 UTC (permalink / raw)
  To: axboe, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, gost.dev, anuj1072538,
	xiaoguang.wang, Kanchan Joshi

Add the infrastructure that carves out N nvme queue-pairs (struct
nvme_queue) which are not registered with the block layer.
The last N entries in dev->nvmeq[] are available to be attached
on demand.
Similar to poll_queues, these are interrupt-disabled.

This patch does not introduce the interface to attach/detach these
queues with any user. That is to be followed in subsequent patches.

Signed-off-by: Kanchan Joshi <[email protected]>
---
 drivers/nvme/host/pci.c | 49 +++++++++++++++++++++++++++++------------
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3a38ee6ee129..d366a76cc304 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -101,6 +101,10 @@ static unsigned int poll_queues;
 module_param_cb(poll_queues, &io_queue_count_ops, &poll_queues, 0644);
 MODULE_PARM_DESC(poll_queues, "Number of queues to use for polled IO.");
 
+static unsigned int raw_queues;
+module_param_cb(raw_queues, &io_queue_count_ops, &raw_queues, 0644);
+MODULE_PARM_DESC(raw_queues, "Number of polled, unmanaged queues.");
+
 static bool noacpi;
 module_param(noacpi, bool, 0444);
 MODULE_PARM_DESC(noacpi, "disable acpi bios quirks");
@@ -159,6 +163,7 @@ struct nvme_dev {
 	unsigned int nr_allocated_queues;
 	unsigned int nr_write_queues;
 	unsigned int nr_poll_queues;
+	unsigned int nr_raw_queues;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -209,6 +214,7 @@ struct nvme_queue {
 #define NVMEQ_SQ_CMB		1
 #define NVMEQ_DELETE_ERROR	2
 #define NVMEQ_POLLED		3
+#define NVMEQ_RAW		4
 	__le32 *dbbuf_sq_db;
 	__le32 *dbbuf_cq_db;
 	__le32 *dbbuf_sq_ei;
@@ -1599,7 +1605,8 @@ static int nvme_setup_io_queues_trylock(struct nvme_dev *dev)
 	return 0;
 }
 
-static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
+static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled,
+				bool rawq)
 {
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
@@ -1613,8 +1620,11 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 	 */
 	if (!polled)
 		vector = dev->num_vecs == 1 ? 0 : qid;
-	else
+	else {
 		set_bit(NVMEQ_POLLED, &nvmeq->flags);
+		if (rawq)
+			set_bit(NVMEQ_RAW, &nvmeq->flags);
+	}
 
 	result = adapter_alloc_cq(dev, qid, nvmeq, vector);
 	if (result)
@@ -1770,7 +1780,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
 
 static int nvme_create_io_queues(struct nvme_dev *dev)
 {
-	unsigned i, max, rw_queues;
+	unsigned i, max, rw_queues, rw_poll_queues;
 	int ret = 0;
 
 	for (i = dev->ctrl.queue_count; i <= dev->max_qid; i++) {
@@ -1781,17 +1791,20 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
 	}
 
 	max = min(dev->max_qid, dev->ctrl.queue_count - 1);
-	if (max != 1 && dev->io_queues[HCTX_TYPE_POLL]) {
+	if (max != 1 &&
+		(dev->io_queues[HCTX_TYPE_POLL] || dev->nr_raw_queues)) {
 		rw_queues = dev->io_queues[HCTX_TYPE_DEFAULT] +
 				dev->io_queues[HCTX_TYPE_READ];
+		rw_poll_queues = rw_queues + dev->io_queues[HCTX_TYPE_POLL];
 	} else {
 		rw_queues = max;
+		rw_poll_queues = max;
 	}
-
 	for (i = dev->online_queues; i <= max; i++) {
 		bool polled = i > rw_queues;
+		bool rawq = i > rw_poll_queues;
 
-		ret = nvme_create_queue(&dev->queues[i], i, polled);
+		ret = nvme_create_queue(&dev->queues[i], i, polled, rawq);
 		if (ret)
 			break;
 	}
@@ -2212,7 +2225,7 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 		.calc_sets	= nvme_calc_irq_sets,
 		.priv		= dev,
 	};
-	unsigned int irq_queues, poll_queues;
+	unsigned int irq_queues, poll_queues, raw_queues;
 
 	/*
 	 * Poll queues don't need interrupts, but we need at least one I/O queue
@@ -2220,6 +2233,7 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 	 */
 	poll_queues = min(dev->nr_poll_queues, nr_io_queues - 1);
 	dev->io_queues[HCTX_TYPE_POLL] = poll_queues;
+	raw_queues = dev->nr_raw_queues;
 
 	/*
 	 * Initialize for the single interrupt case, will be updated in
@@ -2235,7 +2249,7 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned int nr_io_queues)
 	 */
 	irq_queues = 1;
 	if (!(dev->ctrl.quirks & NVME_QUIRK_SINGLE_VECTOR))
-		irq_queues += (nr_io_queues - poll_queues);
+		irq_queues += (nr_io_queues - poll_queues - raw_queues);
 	return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues,
 			      PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
 }
@@ -2248,7 +2262,9 @@ static unsigned int nvme_max_io_queues(struct nvme_dev *dev)
 	 */
 	if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
 		return 1;
-	return num_possible_cpus() + dev->nr_write_queues + dev->nr_poll_queues;
+
+	return num_possible_cpus() + dev->nr_write_queues + dev->nr_poll_queues
+		+ dev->nr_raw_queues;
 }
 
 static int nvme_setup_io_queues(struct nvme_dev *dev)
@@ -2265,6 +2281,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 */
 	dev->nr_write_queues = write_queues;
 	dev->nr_poll_queues = poll_queues;
+	dev->nr_raw_queues = raw_queues;
 
 	nr_io_queues = dev->nr_allocated_queues - 1;
 	result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
@@ -2329,7 +2346,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 
 	dev->num_vecs = result;
 	result = max(result - 1, 1);
-	dev->max_qid = result + dev->io_queues[HCTX_TYPE_POLL];
+	dev->max_qid = result + dev->io_queues[HCTX_TYPE_POLL] +
+			dev->nr_raw_queues;
 
 	/*
 	 * Should investigate if there's a performance win from allocating
@@ -2356,10 +2374,11 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 		nvme_suspend_io_queues(dev);
 		goto retry;
 	}
-	dev_info(dev->ctrl.device, "%d/%d/%d default/read/poll queues\n",
+	dev_info(dev->ctrl.device, "%d/%d/%d/%d default/read/poll queues/raw queues\n",
 					dev->io_queues[HCTX_TYPE_DEFAULT],
 					dev->io_queues[HCTX_TYPE_READ],
-					dev->io_queues[HCTX_TYPE_POLL]);
+					dev->io_queues[HCTX_TYPE_POLL],
+					dev->nr_raw_queues);
 	return 0;
 out_unlock:
 	mutex_unlock(&dev->shutdown_lock);
@@ -2457,7 +2476,8 @@ static unsigned int nvme_pci_nr_maps(struct nvme_dev *dev)
 
 static void nvme_pci_update_nr_queues(struct nvme_dev *dev)
 {
-	blk_mq_update_nr_hw_queues(&dev->tagset, dev->online_queues - 1);
+	blk_mq_update_nr_hw_queues(&dev->tagset,
+			dev->online_queues - dev->nr_raw_queues - 1);
 	/* free previously allocated queues that are no longer usable */
 	nvme_free_queues(dev, dev->online_queues);
 }
@@ -2921,6 +2941,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
 
 	dev->nr_write_queues = write_queues;
 	dev->nr_poll_queues = poll_queues;
+	dev->nr_raw_queues = raw_queues;
 	dev->nr_allocated_queues = nvme_max_io_queues(dev) + 1;
 	dev->queues = kcalloc_node(dev->nr_allocated_queues,
 			sizeof(struct nvme_queue), GFP_KERNEL, node);
@@ -3034,7 +3055,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (dev->online_queues > 1) {
 		nvme_alloc_io_tag_set(&dev->ctrl, &dev->tagset, &nvme_mq_ops,
 				nvme_pci_nr_maps(dev), sizeof(struct nvme_iod),
-				dev->ctrl.queue_count);
+				dev->ctrl.queue_count - dev->nr_raw_queues);
 		nvme_dbbuf_set(dev);
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 03/12] fs, block: interface to register/unregister the raw-queue
       [not found]   ` <CGME20230429094243epcas5p13be3ca62dc2b03299d09cafaf11923c1@epcas5p1.samsung.com>
@ 2023-04-29  9:39     ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-04-29  9:39 UTC (permalink / raw)
  To: axboe, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, gost.dev, anuj1072538,
	xiaoguang.wang, Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <[email protected]>

Introduce register/unregister callbacks to fops and mq_ops, so that
callers can attach/detach to raw-queues.

Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
 include/linux/blk-mq.h | 2 ++
 include/linux/fs.h     | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 06caacd77ed6..7d6790be4847 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -657,6 +657,8 @@ struct blk_mq_ops {
 	 * information about a request.
 	 */
 	void (*show_rq)(struct seq_file *m, struct request *rq);
+	int (*register_queue)(void *data);
+	int (*unregister_queue)(void *data, int qid);
 #endif
 };
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 475d88640d3d..79acccc5e7d4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1798,6 +1798,8 @@ struct file_operations {
 	int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
 	int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
 				unsigned int poll_flags);
+	int (*register_queue)(struct file *f);
+	int (*unregister_queue)(struct file *f, int qid);
 } __randomize_layout;
 
 struct inode_operations {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 04/12] io_uring, fs: plumb support to register/unregister raw-queue
       [not found]   ` <CGME20230429094245epcas5p2843abc5cd54ffe301d36459543bcd228@epcas5p2.samsung.com>
@ 2023-04-29  9:39     ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-04-29  9:39 UTC (permalink / raw)
  To: axboe, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, gost.dev, anuj1072538,
	xiaoguang.wang, Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <[email protected]>

Extend io_uring's registration interface with

- IORING_REGISTER_QUEUE: to ask for a queue. It goes down via
fops->register_queue and returns identifier of the queue. This qid is
stored in ring's ctx.

- IORING_UNREGISTER_QUEUE: to return the previously registered queue.

At max one queue is allowed to be attached with the io_uring's ring.
The file for which queue is requested is expected to be in
registered file-set.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
 fs/file.c                      | 14 ++++++++
 include/linux/fs.h             |  2 ++
 include/linux/io_uring_types.h |  3 ++
 include/uapi/linux/io_uring.h  |  4 +++
 io_uring/io_uring.c            | 60 ++++++++++++++++++++++++++++++++++
 5 files changed, 83 insertions(+)

diff --git a/fs/file.c b/fs/file.c
index 7893ea161d77..7dada9cd0911 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1318,3 +1318,17 @@ int iterate_fd(struct files_struct *files, unsigned n,
 	return res;
 }
 EXPORT_SYMBOL(iterate_fd);
+
+int file_register_queue(struct file *file)
+{
+	if (file->f_op->register_queue)
+		return file->f_op->register_queue(file);
+	return -EINVAL;
+}
+
+int file_unregister_queue(struct file *file, int qid)
+{
+	if (file->f_op->unregister_queue)
+		return file->f_op->unregister_queue(file, qid);
+	return -EINVAL;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 79acccc5e7d4..0a82aac6868b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3190,5 +3190,7 @@ extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
 		       int advice);
 extern int generic_fadvise(struct file *file, loff_t offset, loff_t len,
 			   int advice);
+int file_register_queue(struct file *file);
+int file_unregister_queue(struct file *file, int qid);
 
 #endif /* _LINUX_FS_H */
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 1b2a20a42413..8d4e721493d6 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -364,6 +364,9 @@ struct io_ring_ctx {
 	unsigned			sq_thread_idle;
 	/* protected by ->completion_lock */
 	unsigned			evfd_last_cq_tail;
+	/* for io_uring attached device queue */
+	int				dev_qid;
+	int				dev_fd;
 };
 
 struct io_tw_state {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 0716cb17e436..a9d59bfd26f7 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -523,6 +523,10 @@ enum {
 	/* register a range of fixed file slots for automatic slot allocation */
 	IORING_REGISTER_FILE_ALLOC_RANGE	= 25,
 
+	/* register a device-queue with the ring */
+	IORING_REGISTER_QUEUE			= 26,
+	IORING_UNREGISTER_QUEUE			= 27,
+
 	/* this goes last */
 	IORING_REGISTER_LAST,
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3bca7a79efda..5a9b7adf438e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -337,6 +337,8 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_WQ_LIST(&ctx->locked_free_list);
 	INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func);
 	INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
+	/* -EINVAL implies nothing is registered with this ring */
+	ctx->dev_qid = -EINVAL;
 	return ctx;
 err:
 	kfree(ctx->dummy_ubuf);
@@ -2822,6 +2824,51 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
 	mutex_unlock(&ctx->uring_lock);
 }
 
+static int io_register_queue(struct io_ring_ctx *ctx, void __user *arg)
+{
+	struct io_fixed_file *file_slot;
+	struct file *file;
+	__s32 __user *fds = arg;
+	int fd, qid;
+
+	if (ctx->dev_qid != -EINVAL)
+		return -EINVAL;
+	if (copy_from_user(&fd, fds, sizeof(*fds)))
+		return -EFAULT;
+	file_slot = io_fixed_file_slot(&ctx->file_table,
+			array_index_nospec(fd, ctx->nr_user_files));
+	if (!file_slot->file_ptr)
+		return -EBADF;
+	file = (struct file *)(file_slot->file_ptr & FFS_MASK);
+	qid = file_register_queue(file);
+	if (qid < 0)
+		return qid;
+	ctx->dev_fd = fd;
+	ctx->dev_qid = qid;
+	return 0;
+}
+
+static int io_unregister_queue(struct io_ring_ctx *ctx)
+{
+	struct io_fixed_file *file_slot;
+	struct file *file;
+	int ret;
+
+	if (ctx->dev_qid == -EINVAL)
+		return 0;
+	file_slot = io_fixed_file_slot(&ctx->file_table,
+			array_index_nospec(ctx->dev_fd, ctx->nr_user_files));
+	if (!file_slot)
+		return -EBADF;
+	if (!file_slot->file_ptr)
+		return -EBADF;
+	file = (struct file *)(file_slot->file_ptr & FFS_MASK);
+	ret = file_unregister_queue(file, ctx->dev_qid);
+	if (!ret)
+		ctx->dev_qid = -EINVAL;
+	return ret;
+}
+
 static void io_rsrc_node_cache_free(struct io_cache_entry *entry)
 {
 	kfree(container_of(entry, struct io_rsrc_node, cache));
@@ -2835,6 +2882,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		return;
 
 	mutex_lock(&ctx->uring_lock);
+	io_unregister_queue(ctx);
 	if (ctx->buf_data)
 		__io_sqe_buffers_unregister(ctx);
 	if (ctx->file_data)
@@ -4418,6 +4466,18 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_register_file_alloc_range(ctx, arg);
 		break;
+	case IORING_REGISTER_QUEUE:
+		ret = -EINVAL;
+		if (!arg || nr_args != 1)
+			break;
+		ret = io_register_queue(ctx, arg);
+		break;
+	case IORING_UNREGISTER_QUEUE:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_unregister_queue(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 05/12] nvme: wire-up register/unregister queue f_op callback
       [not found]   ` <CGME20230429094247epcas5p333e0f515000de60fb64dc2590cf9fcd8@epcas5p3.samsung.com>
@ 2023-04-29  9:39     ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-04-29  9:39 UTC (permalink / raw)
  To: axboe, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, gost.dev, anuj1072538,
	xiaoguang.wang, Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <[email protected]>

Implement register/unregister handlers for char-device file.

Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
 drivers/nvme/host/core.c      | 26 +++++++++++++++++++
 drivers/nvme/host/ioctl.c     | 48 +++++++++++++++++++++++++++++++++++
 drivers/nvme/host/multipath.c |  2 ++
 drivers/nvme/host/nvme.h      |  2 ++
 4 files changed, 78 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ba476c48d566..4462ce50d076 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4059,6 +4059,30 @@ static int nvme_ns_chr_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+int nvme_register_queue(struct file *file)
+{
+	struct nvme_ns *ns = container_of(file_inode(file)->i_cdev,
+			struct nvme_ns, cdev);
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
+
+	if (q->mq_ops && q->mq_ops->register_queue)
+		return q->mq_ops->register_queue(ns);
+	return -EINVAL;
+}
+
+int nvme_unregister_queue(struct file *file, int qid)
+{
+	struct nvme_ns *ns = container_of(file_inode(file)->i_cdev,
+			struct nvme_ns, cdev);
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
+
+	if (q->mq_ops && q->mq_ops->unregister_queue)
+		return q->mq_ops->unregister_queue(ns, qid);
+	return -EINVAL;
+}
+
 static const struct file_operations nvme_ns_chr_fops = {
 	.owner		= THIS_MODULE,
 	.open		= nvme_ns_chr_open,
@@ -4067,6 +4091,8 @@ static const struct file_operations nvme_ns_chr_fops = {
 	.compat_ioctl	= compat_ptr_ioctl,
 	.uring_cmd	= nvme_ns_chr_uring_cmd,
 	.uring_cmd_iopoll = nvme_ns_chr_uring_cmd_iopoll,
+	.register_queue	= nvme_register_queue,
+	.unregister_queue = nvme_unregister_queue,
 };
 
 static int nvme_add_ns_cdev(struct nvme_ns *ns)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index d24ea2e05156..292a578686b6 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -871,6 +871,54 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
 	return ret;
 }
 
+int nvme_ns_head_register_queue(struct file *file)
+{
+	struct cdev *cdev = file_inode(file)->i_cdev;
+	struct nvme_ns_head *head =
+		container_of(cdev, struct nvme_ns_head, cdev);
+	struct nvme_ns *ns;
+	struct nvme_ctrl *ctrl;
+	struct request_queue *q;
+	int srcu_idx, ret = -EINVAL;
+
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = nvme_find_path(head);
+	if (!ns)
+		goto out_unlock;
+
+	ctrl = ns->ctrl;
+	q = ns ? ns->queue : ctrl->admin_q;
+	if (q->mq_ops && q->mq_ops->register_queue)
+		ret = q->mq_ops->register_queue(ns);
+out_unlock:
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return ret;
+}
+
+int nvme_ns_head_unregister_queue(struct file *file, int qid)
+{
+	struct cdev *cdev = file_inode(file)->i_cdev;
+	struct nvme_ns_head *head =
+		container_of(cdev, struct nvme_ns_head, cdev);
+	struct nvme_ns *ns;
+	struct nvme_ctrl *ctrl;
+	struct request_queue *q;
+	int srcu_idx, ret = -EINVAL;
+
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = nvme_find_path(head);
+	if (!ns)
+		goto out_unlock;
+
+	ctrl = ns->ctrl;
+	q = ns ? ns->queue : ctrl->admin_q;
+	if (q->mq_ops && q->mq_ops->unregister_queue)
+		ret = q->mq_ops->unregister_queue(ns, qid);
+out_unlock:
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	return ret;
+}
+
 int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
 		unsigned int issue_flags)
 {
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 9171452e2f6d..eed30daf1a37 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -471,6 +471,8 @@ static const struct file_operations nvme_ns_head_chr_fops = {
 	.compat_ioctl	= compat_ptr_ioctl,
 	.uring_cmd	= nvme_ns_head_chr_uring_cmd,
 	.uring_cmd_iopoll = nvme_ns_head_chr_uring_cmd_iopoll,
+	.register_queue	= nvme_ns_head_register_queue,
+	.unregister_queue = nvme_ns_head_unregister_queue,
 };
 
 static int nvme_add_ns_head_cdev(struct nvme_ns_head *head)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 73992dc9dec7..4619a7498f8e 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -843,6 +843,8 @@ int nvme_ns_head_ioctl(struct block_device *bdev, fmode_t mode,
 		unsigned int cmd, unsigned long arg);
 long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
 		unsigned long arg);
+int nvme_ns_head_register_queue(struct file *file);
+int nvme_ns_head_unregister_queue(struct file *file, int qid);
 long nvme_dev_ioctl(struct file *file, unsigned int cmd,
 		unsigned long arg);
 int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 06/12] pci: implement register/unregister functionality
       [not found]   ` <CGME20230429094249epcas5p18bd717f4e34077c0fcf28458f11de8d1@epcas5p1.samsung.com>
@ 2023-04-29  9:39     ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-04-29  9:39 UTC (permalink / raw)
  To: axboe, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, gost.dev, anuj1072538,
	xiaoguang.wang, Kanchan Joshi, Anuj Gupta

Implement the register callback. It checks if any raw-queue is available to
be attached. If found, it returns the qid.
During queue registration, iod and command-id bitmap are also preallocated.

Unregister callback does the opposite and returns the corresponding
queue to the pool of available raw-queues.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
 drivers/nvme/host/pci.c | 154 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 154 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d366a76cc304..b4498e198e8a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -116,6 +116,15 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 static void nvme_delete_io_queues(struct nvme_dev *dev);
 static void nvme_update_attrs(struct nvme_dev *dev);
 
+enum {
+	Q_FREE,
+	Q_ALLOC
+};
+struct rawq_info {
+	int nr_free;
+	int q_state[];
+};
+
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
  */
@@ -164,6 +173,8 @@ struct nvme_dev {
 	unsigned int nr_write_queues;
 	unsigned int nr_poll_queues;
 	unsigned int nr_raw_queues;
+	struct mutex rawq_lock;
+	struct rawq_info *rawqi;
 };
 
 static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -195,6 +206,10 @@ struct nvme_queue {
 	struct nvme_dev *dev;
 	spinlock_t sq_lock;
 	void *sq_cmds;
+	 /* only used for raw queues: */
+	unsigned long *cmdid_bmp;
+	spinlock_t cmdid_lock;
+	struct nvme_iod *iod;
 	 /* only used for poll queues: */
 	spinlock_t cq_poll_lock ____cacheline_aligned_in_smp;
 	struct nvme_completion *cqes;
@@ -1661,6 +1676,141 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled,
 	return result;
 }
 
+static int setup_rawq_info(struct nvme_dev *dev, int nr_rawq)
+{
+	struct rawq_info *rawqi;
+	int size = sizeof(struct rawq_info) + nr_rawq * sizeof(int);
+
+	rawqi = kzalloc(size, GFP_KERNEL);
+	if (rawqi == NULL)
+		return -ENOMEM;
+	rawqi->nr_free = nr_rawq;
+	dev->rawqi = rawqi;
+	return 0;
+}
+
+static int nvme_pci_get_rawq(struct nvme_dev *dev)
+{
+	int i, qid, nr_rawq;
+	struct rawq_info *rawqi = NULL;
+	int ret = -EINVAL;
+
+	nr_rawq = dev->nr_raw_queues;
+	if (!nr_rawq)
+		return ret;
+
+	mutex_lock(&dev->rawq_lock);
+	if (dev->rawqi == NULL) {
+		ret = setup_rawq_info(dev, nr_rawq);
+		if (ret)
+			goto unlock;
+	}
+	rawqi = dev->rawqi;
+	if (rawqi->nr_free == 0) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+	for (i = 0; i < nr_rawq; i++) {
+		if (rawqi->q_state[i] == Q_FREE) {
+			rawqi->q_state[i] = Q_ALLOC;
+			qid = dev->nr_allocated_queues - nr_rawq - i;
+			rawqi->nr_free--;
+			ret = qid;
+			goto unlock;
+		}
+	}
+unlock:
+	mutex_unlock(&dev->rawq_lock);
+	return ret;
+}
+
+static int nvme_pci_put_rawq(struct nvme_dev *dev, int qid)
+{
+	int i, nr_rawq;
+	struct rawq_info *rawqi = NULL;
+	struct nvme_queue *nvmeq;
+
+	nr_rawq = dev->nr_raw_queues;
+	if (!nr_rawq || dev->rawqi == NULL)
+		return -EINVAL;
+
+	i = dev->nr_allocated_queues - nr_rawq - qid;
+	mutex_lock(&dev->rawq_lock);
+	rawqi = dev->rawqi;
+	if (rawqi->q_state[i] == Q_ALLOC) {
+		rawqi->q_state[i] = Q_FREE;
+		rawqi->nr_free++;
+	}
+	mutex_unlock(&dev->rawq_lock);
+	nvmeq = &dev->queues[qid];
+	kfree(nvmeq->cmdid_bmp);
+	kfree(nvmeq->iod);
+	return 0;
+}
+
+static int nvme_pci_alloc_cmdid_bmp(struct nvme_queue *nvmeq)
+{
+	int size = BITS_TO_LONGS(nvmeq->q_depth) * sizeof(unsigned long);
+
+	if (!test_bit(NVMEQ_RAW, &nvmeq->flags))
+		return -EINVAL;
+	nvmeq->cmdid_bmp = kzalloc(size, GFP_KERNEL);
+	if (!nvmeq->cmdid_bmp)
+		return -ENOMEM;
+	spin_lock_init(&nvmeq->cmdid_lock);
+	return 0;
+}
+
+static int nvme_pci_alloc_iod_array(struct nvme_queue *nvmeq)
+{
+	if (!test_bit(NVMEQ_RAW, &nvmeq->flags))
+		return -EINVAL;
+	nvmeq->iod = kcalloc(nvmeq->q_depth - 1, sizeof(struct nvme_iod),
+				 GFP_KERNEL);
+	if (!nvmeq->iod)
+		return -ENOMEM;
+	return 0;
+}
+
+static int nvme_pci_setup_rawq(struct nvme_queue *nvmeq)
+{
+	int ret;
+
+	ret = nvme_pci_alloc_cmdid_bmp(nvmeq);
+	if (ret)
+		return ret;
+	ret = nvme_pci_alloc_iod_array(nvmeq);
+	if (ret) {
+		kfree(nvmeq->cmdid_bmp);
+		return ret;
+	}
+	return ret;
+}
+
+static int nvme_pci_register_queue(void *data)
+{
+	struct nvme_ns *ns = (struct nvme_ns *) data;
+	struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
+	int qid, ret;
+
+	qid = nvme_pci_get_rawq(dev);
+	if (qid > 0) {
+		/* setup command-id bitmap and iod array */
+		ret = nvme_pci_setup_rawq(&dev->queues[qid]);
+		if (ret < 0)
+			qid = ret;
+	}
+	return qid;
+}
+
+static int nvme_pci_unregister_queue(void *data, int qid)
+{
+	struct nvme_ns *ns = (struct nvme_ns *) data;
+	struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
+
+	return nvme_pci_put_rawq(dev, qid);
+}
+
 static const struct blk_mq_ops nvme_mq_admin_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_pci_complete_rq,
@@ -1679,6 +1829,8 @@ static const struct blk_mq_ops nvme_mq_ops = {
 	.map_queues	= nvme_pci_map_queues,
 	.timeout	= nvme_timeout,
 	.poll		= nvme_poll,
+	.register_queue	= nvme_pci_register_queue,
+	.unregister_queue =  nvme_pci_unregister_queue,
 };
 
 static void nvme_dev_remove_admin(struct nvme_dev *dev)
@@ -2698,6 +2850,7 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	nvme_free_tagset(dev);
 	put_device(dev->dev);
 	kfree(dev->queues);
+	kfree(dev->rawqi);
 	kfree(dev);
 }
 
@@ -2938,6 +3091,7 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
 		return ERR_PTR(-ENOMEM);
 	INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work);
 	mutex_init(&dev->shutdown_lock);
+	mutex_init(&dev->rawq_lock);
 
 	dev->nr_write_queues = write_queues;
 	dev->nr_poll_queues = poll_queues;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 07/12] io_uring: support for using registered queue in uring-cmd
       [not found]   ` <CGME20230429094251epcas5p144d042853e10f090e3119338c2306546@epcas5p1.samsung.com>
@ 2023-04-29  9:39     ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-04-29  9:39 UTC (permalink / raw)
  To: axboe, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, gost.dev, anuj1072538,
	xiaoguang.wang, Kanchan Joshi

Add a new flag IORING_URING_CMD_DIRECT that user-space can spcifiy in SQE.
If queue is registered with the ring, this flag goes down to the provider
of ->uring_cmd. Provider may choose to do things differently or ignore
this flag.
Also export a helper that allows retrieving the identifier of the
registered queue.

Signed-off-by: Kanchan Joshi <[email protected]>
---
 include/linux/io_uring.h      |  6 ++++++
 include/uapi/linux/io_uring.h |  2 ++
 io_uring/uring_cmd.c          | 14 +++++++++++++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 35b9328ca335..bb6f900411e1 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -37,6 +37,7 @@ struct io_uring_cmd {
 };
 
 #if defined(CONFIG_IO_URING)
+int io_uring_cmd_import_qid(void *ioucmd);
 int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd);
 void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t res2,
@@ -67,6 +68,11 @@ static inline void io_uring_free(struct task_struct *tsk)
 		__io_uring_free(tsk);
 }
 #else
+static inline int io_uring_cmd_import_qid(void *ioucmd)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd)
 {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index a9d59bfd26f7..67fbcfd3f676 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -232,8 +232,10 @@ enum io_uring_op {
  * sqe->uring_cmd_flags
  * IORING_URING_CMD_FIXED	use registered buffer; pass this flag
  *				along with setting sqe->buf_index.
+ * IORING_URING_CMD_DIRECT	use registered queue for this cmd.
  */
 #define IORING_URING_CMD_FIXED	(1U << 0)
+#define IORING_URING_CMD_DIRECT	(1U << 1)
 
 
 /*
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 5113c9a48583..2a543b490045 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -89,9 +89,12 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return -EINVAL;
 
 	ioucmd->flags = READ_ONCE(sqe->uring_cmd_flags);
-	if (ioucmd->flags & ~IORING_URING_CMD_FIXED)
+	if (ioucmd->flags & ~(IORING_URING_CMD_FIXED | IORING_URING_CMD_DIRECT))
 		return -EINVAL;
 
+	if (ioucmd->flags & IORING_URING_CMD_DIRECT &&
+			req->ctx->dev_qid == -EINVAL)
+		return -EINVAL;
 	if (ioucmd->flags & IORING_URING_CMD_FIXED) {
 		struct io_ring_ctx *ctx = req->ctx;
 		u16 index;
@@ -162,3 +165,12 @@ int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 	return io_import_fixed(rw, iter, req->imu, ubuf, len);
 }
 EXPORT_SYMBOL_GPL(io_uring_cmd_import_fixed);
+
+int io_uring_cmd_import_qid(void *ioucmd)
+{
+	struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
+	struct io_ring_ctx *ctx = req->ctx;
+
+	return ctx->dev_qid;
+}
+EXPORT_SYMBOL_GPL(io_uring_cmd_import_qid);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 08/12] block: add mq_ops to submit and complete commands from raw-queue
       [not found]   ` <CGME20230429094253epcas5p3cfff90e1c003b6fc9c7c4a61287beecb@epcas5p3.samsung.com>
@ 2023-04-29  9:39     ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-04-29  9:39 UTC (permalink / raw)
  To: axboe, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, gost.dev, anuj1072538,
	xiaoguang.wang, Anuj Gupta, Kanchan Joshi

From: Anuj Gupta <[email protected]>

This patch introduces ->queue_uring_cmd and -> poll_uring_cmd in mq_ops,
which will allow to submit and complete (via polling) from the specified
raw-queue (denoted by qid).

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
 include/linux/blk-mq.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7d6790be4847..dcce2939ff1e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -659,6 +659,9 @@ struct blk_mq_ops {
 	void (*show_rq)(struct seq_file *m, struct request *rq);
 	int (*register_queue)(void *data);
 	int (*unregister_queue)(void *data, int qid);
+	int (*queue_uring_cmd)(struct io_uring_cmd *ioucmd, int qid);
+	int (*poll_uring_cmd)(struct io_uring_cmd *ioucmd, int qid,
+			      struct io_comp_batch *);
 #endif
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 09/12] nvme: carve out a helper to prepare nvme_command from ioucmd->cmd
       [not found]   ` <CGME20230429094255epcas5p11bcbe76772289f27c41a50ce502c998d@epcas5p1.samsung.com>
@ 2023-04-29  9:39     ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-04-29  9:39 UTC (permalink / raw)
  To: axboe, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, gost.dev, anuj1072538,
	xiaoguang.wang, Kanchan Joshi, Anuj Gupta

This helper will be used in the subsequent patch.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
 drivers/nvme/host/ioctl.c | 81 +++++++++++++++++++++------------------
 drivers/nvme/host/nvme.h  | 11 ++++++
 2 files changed, 54 insertions(+), 38 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 292a578686b6..18f4f20f5e76 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -430,14 +430,6 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	return status;
 }
 
-struct nvme_uring_data {
-	__u64	metadata;
-	__u64	addr;
-	__u32	data_len;
-	__u32	metadata_len;
-	__u32	timeout_ms;
-};
-
 /*
  * This overlays struct io_uring_cmd pdu.
  * Expect build errors if this grows larger than that.
@@ -548,11 +540,50 @@ static enum rq_end_io_ret nvme_uring_cmd_end_io_meta(struct request *req,
 	return RQ_END_IO_NONE;
 }
 
+int nvme_prep_cmd_from_ioucmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+		struct io_uring_cmd *ioucmd, struct nvme_command *c,
+		struct nvme_uring_data *d)
+{
+	const struct nvme_uring_cmd *cmd = ioucmd->cmd;
+
+	c->common.opcode = READ_ONCE(cmd->opcode);
+	c->common.flags = READ_ONCE(cmd->flags);
+	if (c->common.flags)
+		return -EINVAL;
+
+	c->common.command_id = 0;
+	c->common.nsid = cpu_to_le32(cmd->nsid);
+	if (!nvme_validate_passthru_nsid(ctrl, ns, le32_to_cpu(c->common.nsid)))
+		return -EINVAL;
+
+	c->common.cdw2[0] = cpu_to_le32(READ_ONCE(cmd->cdw2));
+	c->common.cdw2[1] = cpu_to_le32(READ_ONCE(cmd->cdw3));
+	c->common.metadata = 0;
+	c->common.dptr.prp1 = c->common.dptr.prp2 = 0;
+	c->common.cdw10 = cpu_to_le32(READ_ONCE(cmd->cdw10));
+	c->common.cdw11 = cpu_to_le32(READ_ONCE(cmd->cdw11));
+	c->common.cdw12 = cpu_to_le32(READ_ONCE(cmd->cdw12));
+	c->common.cdw13 = cpu_to_le32(READ_ONCE(cmd->cdw13));
+	c->common.cdw14 = cpu_to_le32(READ_ONCE(cmd->cdw14));
+	c->common.cdw15 = cpu_to_le32(READ_ONCE(cmd->cdw15));
+
+	if (!nvme_cmd_allowed(ns, c, 0, ioucmd->file->f_mode))
+		return -EACCES;
+
+	d->metadata = READ_ONCE(cmd->metadata);
+	d->addr = READ_ONCE(cmd->addr);
+	d->data_len = READ_ONCE(cmd->data_len);
+	d->metadata_len = READ_ONCE(cmd->metadata_len);
+	d->timeout_ms = READ_ONCE(cmd->timeout_ms);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nvme_prep_cmd_from_ioucmd);
+
+
 static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 		struct io_uring_cmd *ioucmd, unsigned int issue_flags, bool vec)
 {
 	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
-	const struct nvme_uring_cmd *cmd = ioucmd->cmd;
 	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
 	struct nvme_uring_data d;
 	struct nvme_command c;
@@ -562,35 +593,9 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	void *meta = NULL;
 	int ret;
 
-	c.common.opcode = READ_ONCE(cmd->opcode);
-	c.common.flags = READ_ONCE(cmd->flags);
-	if (c.common.flags)
-		return -EINVAL;
-
-	c.common.command_id = 0;
-	c.common.nsid = cpu_to_le32(cmd->nsid);
-	if (!nvme_validate_passthru_nsid(ctrl, ns, le32_to_cpu(c.common.nsid)))
-		return -EINVAL;
-
-	c.common.cdw2[0] = cpu_to_le32(READ_ONCE(cmd->cdw2));
-	c.common.cdw2[1] = cpu_to_le32(READ_ONCE(cmd->cdw3));
-	c.common.metadata = 0;
-	c.common.dptr.prp1 = c.common.dptr.prp2 = 0;
-	c.common.cdw10 = cpu_to_le32(READ_ONCE(cmd->cdw10));
-	c.common.cdw11 = cpu_to_le32(READ_ONCE(cmd->cdw11));
-	c.common.cdw12 = cpu_to_le32(READ_ONCE(cmd->cdw12));
-	c.common.cdw13 = cpu_to_le32(READ_ONCE(cmd->cdw13));
-	c.common.cdw14 = cpu_to_le32(READ_ONCE(cmd->cdw14));
-	c.common.cdw15 = cpu_to_le32(READ_ONCE(cmd->cdw15));
-
-	if (!nvme_cmd_allowed(ns, &c, 0, ioucmd->file->f_mode))
-		return -EACCES;
-
-	d.metadata = READ_ONCE(cmd->metadata);
-	d.addr = READ_ONCE(cmd->addr);
-	d.data_len = READ_ONCE(cmd->data_len);
-	d.metadata_len = READ_ONCE(cmd->metadata_len);
-	d.timeout_ms = READ_ONCE(cmd->timeout_ms);
+	ret = nvme_prep_cmd_from_ioucmd(ctrl, ns, ioucmd, &c, &d);
+	if (ret)
+		return ret;
 
 	if (issue_flags & IO_URING_F_NONBLOCK) {
 		rq_flags |= REQ_NOWAIT;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 4619a7498f8e..4eb45afc9484 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -168,6 +168,14 @@ struct nvme_request {
 	struct nvme_ctrl	*ctrl;
 };
 
+struct nvme_uring_data {
+	__u64	metadata;
+	__u64	addr;
+	__u32	data_len;
+	__u32	metadata_len;
+	__u32	timeout_ms;
+};
+
 /*
  * Mark a bio as coming in through the mpath node.
  */
@@ -811,6 +819,9 @@ static inline bool nvme_is_unique_nsid(struct nvme_ctrl *ctrl,
 		(ctrl->ctratt & NVME_CTRL_CTRATT_NVM_SETS);
 }
 
+int nvme_prep_cmd_from_ioucmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+		struct io_uring_cmd *ioucmd, struct nvme_command *c,
+		struct nvme_uring_data *d);
 int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 		void *buf, unsigned bufflen);
 int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 10/12] nvme: submisssion/completion of uring_cmd to/from the registered queue
       [not found]   ` <CGME20230429094257epcas5p463574920bba26cd219275e57c2063d85@epcas5p4.samsung.com>
@ 2023-04-29  9:39     ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-04-29  9:39 UTC (permalink / raw)
  To: axboe, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, gost.dev, anuj1072538,
	xiaoguang.wang, Kanchan Joshi

If IORING_URING_CMD_DIRECT flag is set, get the registered qid and
- submit the io_uring command via mq_ops->queue_uring_cmd.
- complete it via mq_ops->poll_uring_cmd.

If the command could not be submitted this way due to any reason,
abstract it and fallback to old way of submission.
This keeps IORING_URING_CMD_DIRECT flag advisory.

Signed-off-by: Kanchan Joshi <[email protected]>
---
 drivers/nvme/host/ioctl.c | 105 +++++++++++++++++++++++++++++++++-----
 drivers/nvme/host/nvme.h  |   4 ++
 2 files changed, 97 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 18f4f20f5e76..df86fb4f132b 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -653,6 +653,23 @@ static bool is_ctrl_ioctl(unsigned int cmd)
 	return false;
 }
 
+static int nvme_uring_cmd_io_direct(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+		struct io_uring_cmd *ioucmd, unsigned int issue_flags)
+{
+	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
+	int qid = io_uring_cmd_import_qid(ioucmd);
+	struct nvme_uring_direct_pdu *pdu =
+		(struct nvme_uring_direct_pdu *)&ioucmd->pdu;
+
+	if ((issue_flags & IO_URING_F_IOPOLL) != IO_URING_F_IOPOLL)
+		return -EOPNOTSUPP;
+
+	pdu->ns = ns;
+	if (q->mq_ops && q->mq_ops->queue_uring_cmd)
+		return q->mq_ops->queue_uring_cmd(ioucmd, qid);
+	return -EOPNOTSUPP;
+}
+
 static int nvme_ctrl_ioctl(struct nvme_ctrl *ctrl, unsigned int cmd,
 		void __user *argp, fmode_t mode)
 {
@@ -763,6 +780,14 @@ static int nvme_ns_uring_cmd(struct nvme_ns *ns, struct io_uring_cmd *ioucmd,
 
 	switch (ioucmd->cmd_op) {
 	case NVME_URING_CMD_IO:
+		if (ioucmd->flags & IORING_URING_CMD_DIRECT) {
+			ret = nvme_uring_cmd_io_direct(ctrl, ns, ioucmd,
+					issue_flags);
+			if (ret == -EIOCBQUEUED)
+				return ret;
+			/* in case of any error, just fallback */
+			ioucmd->flags &= ~(IORING_URING_CMD_DIRECT);
+		}
 		ret = nvme_uring_cmd_io(ctrl, ns, ioucmd, issue_flags, false);
 		break;
 	case NVME_URING_CMD_IO_VEC:
@@ -783,6 +808,38 @@ int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
 	return nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
 }
 
+/* similar to blk_mq_poll; may be possible to unify */
+int nvme_uring_cmd_iopoll_qid(struct request_queue *q,
+				 struct io_uring_cmd *ioucmd, int qid,
+				 struct io_comp_batch *iob,
+				 unsigned int flags)
+{
+	long state = get_current_state();
+	int ret;
+
+	if (!(q->mq_ops && q->mq_ops->poll_uring_cmd))
+		return 0;
+	do {
+		ret = q->mq_ops->poll_uring_cmd(ioucmd, qid, iob);
+		if (ret > 0) {
+			__set_current_state(TASK_RUNNING);
+			return ret;
+		}
+		if (signal_pending_state(state, current))
+			__set_current_state(TASK_RUNNING);
+		if (task_is_running(current))
+			return 1;
+
+		if (ret < 0 || (flags & BLK_POLL_ONESHOT))
+			break;
+		cpu_relax();
+
+	} while (!need_resched());
+
+	__set_current_state(TASK_RUNNING);
+	return 0;
+}
+
 int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd,
 				 struct io_comp_batch *iob,
 				 unsigned int poll_flags)
@@ -792,14 +849,26 @@ int nvme_ns_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd,
 	struct nvme_ns *ns;
 	struct request_queue *q;
 
-	rcu_read_lock();
-	bio = READ_ONCE(ioucmd->cookie);
 	ns = container_of(file_inode(ioucmd->file)->i_cdev,
 			struct nvme_ns, cdev);
 	q = ns->queue;
-	if (test_bit(QUEUE_FLAG_POLL, &q->queue_flags) && bio && bio->bi_bdev)
-		ret = bio_poll(bio, iob, poll_flags);
-	rcu_read_unlock();
+	if (!(ioucmd->flags & IORING_URING_CMD_DIRECT)) {
+		rcu_read_lock();
+		bio = READ_ONCE(ioucmd->cookie);
+		if (test_bit(QUEUE_FLAG_POLL, &q->queue_flags) && bio && bio->bi_bdev)
+			ret = bio_poll(bio, iob, poll_flags);
+
+		rcu_read_unlock();
+	} else {
+		int qid = io_uring_cmd_import_qid(ioucmd);
+
+		if (qid <= 0)
+			return 0;
+		if (!percpu_ref_tryget(&q->q_usage_counter))
+			return 0;
+		ret = nvme_uring_cmd_iopoll_qid(q, ioucmd, qid, iob, poll_flags);
+		percpu_ref_put(&q->q_usage_counter);
+	}
 	return ret;
 }
 #ifdef CONFIG_NVME_MULTIPATH
@@ -952,13 +1021,25 @@ int nvme_ns_head_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd,
 	struct request_queue *q;
 
 	if (ns) {
-		rcu_read_lock();
-		bio = READ_ONCE(ioucmd->cookie);
-		q = ns->queue;
-		if (test_bit(QUEUE_FLAG_POLL, &q->queue_flags) && bio
-				&& bio->bi_bdev)
-			ret = bio_poll(bio, iob, poll_flags);
-		rcu_read_unlock();
+		if (!(ioucmd->flags & IORING_URING_CMD_DIRECT)) {
+			rcu_read_lock();
+			bio = READ_ONCE(ioucmd->cookie);
+			q = ns->queue;
+			if (test_bit(QUEUE_FLAG_POLL, &q->queue_flags) && bio
+					&& bio->bi_bdev)
+				ret = bio_poll(bio, iob, poll_flags);
+			rcu_read_unlock();
+		} else {
+			int qid = io_uring_cmd_import_qid(ioucmd);
+
+			if (qid <= 0)
+				return 0;
+			if (!percpu_ref_tryget(&q->q_usage_counter))
+				return 0;
+			ret = nvme_uring_cmd_iopoll_qid(q, ioucmd, qid, iob,
+							poll_flags);
+			percpu_ref_put(&q->q_usage_counter);
+		}
 	}
 	srcu_read_unlock(&head->srcu, srcu_idx);
 	return ret;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 4eb45afc9484..2fd4432fbe12 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -176,6 +176,10 @@ struct nvme_uring_data {
 	__u32	timeout_ms;
 };
 
+struct nvme_uring_direct_pdu {
+	struct nvme_ns *ns;
+};
+
 /*
  * Mark a bio as coming in through the mpath node.
  */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 11/12] pci: modify nvme_setup_prp_simple parameters
       [not found]   ` <CGME20230429094259epcas5p11f0f3422eb4aa4e3ebf00e0666790efa@epcas5p1.samsung.com>
@ 2023-04-29  9:39     ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-04-29  9:39 UTC (permalink / raw)
  To: axboe, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, gost.dev, anuj1072538,
	xiaoguang.wang, Anuj Gupta

From: Anuj Gupta <[email protected]>

Refactor parameters of nvme_setup_prp_simple a bit.
This is a prep patch.

Signed-off-by: Anuj Gupta <[email protected]>
---
 drivers/nvme/host/pci.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b4498e198e8a..30d7a1a6eaab 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -750,14 +750,13 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 }
 
 static blk_status_t nvme_setup_prp_simple(struct nvme_dev *dev,
-		struct request *req, struct nvme_rw_command *cmnd,
-		struct bio_vec *bv)
+		struct nvme_iod *iod, struct nvme_rw_command *cmnd,
+		struct bio_vec *bv, int dma_dir)
 {
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	unsigned int offset = bv->bv_offset & (NVME_CTRL_PAGE_SIZE - 1);
 	unsigned int first_prp_len = NVME_CTRL_PAGE_SIZE - offset;
 
-	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
+	iod->first_dma = dma_map_bvec(dev->dev, bv, dma_dir, 0);
 	if (dma_mapping_error(dev->dev, iod->first_dma))
 		return BLK_STS_RESOURCE;
 	iod->dma_len = bv->bv_len;
@@ -801,8 +800,10 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 
 		if (!is_pci_p2pdma_page(bv.bv_page)) {
 			if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
-				return nvme_setup_prp_simple(dev, req,
-							     &cmnd->rw, &bv);
+				return nvme_setup_prp_simple(dev,
+					       blk_mq_rq_to_pdu(req),
+					       &cmnd->rw, &bv,
+					       rq_dma_dir(req));
 
 			if (nvmeq->qid && sgl_threshold &&
 			    nvme_ctrl_sgl_supported(&dev->ctrl))
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 12/12] pci: implement submission/completion for rawq commands
       [not found]   ` <CGME20230429094301epcas5p48cf45da2f83d9ca8140ee777c7446d11@epcas5p4.samsung.com>
@ 2023-04-29  9:39     ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-04-29  9:39 UTC (permalink / raw)
  To: axboe, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, gost.dev, anuj1072538,
	xiaoguang.wang, Kanchan Joshi, Anuj Gupta

Implement ->queue_uring_cmd and ->poll_uring_cmd for leaner submission
and completion. Both operate on "struct io_uring_cmd" directly, without
having to take hoops involving request and bio.

This is currently enabled for small (single segement length), premapped
(fixedbufs) IOs.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
 drivers/nvme/host/pci.c | 194 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 192 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 30d7a1a6eaab..ca0580c4e977 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -28,6 +28,8 @@
 #include <linux/io-64-nonatomic-hi-lo.h>
 #include <linux/sed-opal.h>
 #include <linux/pci-p2pdma.h>
+#include <linux/nvme_ioctl.h>
+#include <linux/io_uring.h>
 
 #include "trace.h"
 #include "nvme.h"
@@ -210,6 +212,7 @@ struct nvme_queue {
 	unsigned long *cmdid_bmp;
 	spinlock_t cmdid_lock;
 	struct nvme_iod *iod;
+	u8	reg_id;
 	 /* only used for poll queues: */
 	spinlock_t cq_poll_lock ____cacheline_aligned_in_smp;
 	struct nvme_completion *cqes;
@@ -249,7 +252,11 @@ union nvme_descriptor {
  * to the actual struct scatterlist.
  */
 struct nvme_iod {
-	struct nvme_request req;
+	union {
+		struct nvme_request req;
+		/* for raw-queue */
+		struct io_uring_cmd *ioucmd;
+	};
 	struct nvme_command cmd;
 	bool aborted;
 	s8 nr_allocations;	/* PRP list pool allocations. 0 means small
@@ -1025,6 +1032,13 @@ static inline void nvme_ring_cq_doorbell(struct nvme_queue *nvmeq)
 		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
 }
 
+static void nvme_pci_put_cmdid(struct nvme_queue *nvmeq, int id)
+{
+	spin_lock(&nvmeq->cmdid_lock);
+	clear_bit(id, nvmeq->cmdid_bmp);
+	spin_unlock(&nvmeq->cmdid_lock);
+}
+
 static inline struct blk_mq_tags *nvme_queue_tagset(struct nvme_queue *nvmeq)
 {
 	if (!nvmeq->qid)
@@ -1032,6 +1046,37 @@ static inline struct blk_mq_tags *nvme_queue_tagset(struct nvme_queue *nvmeq)
 	return nvmeq->dev->tagset.tags[nvmeq->qid - 1];
 }
 
+static inline struct nvme_uring_direct_pdu *nvme_uring_cmd_direct_pdu(
+		struct io_uring_cmd *ioucmd)
+{
+	return (struct nvme_uring_direct_pdu *)&ioucmd->pdu;
+}
+
+static inline void nvme_handle_cqe_rawq(struct nvme_queue *nvmeq,
+					struct nvme_completion *cqe,
+					__u16 command_id)
+{
+	struct nvme_dev *dev = nvmeq->dev;
+	u32 status = le16_to_cpu(cqe->status) >> 1;
+	u64 result = cqe->result.u64;
+	struct nvme_iod *iod;
+	int id, reg_id;
+
+	reg_id = nvme_genctr_from_cid(command_id);
+	/* we should not encounter completions from past registration*/
+	WARN_ON_ONCE(nvmeq->reg_id != reg_id);
+
+	id = nvme_tag_from_cid(command_id);
+	iod = &nvmeq->iod[id];
+
+	if (iod->dma_len)
+		dma_unmap_page(dev->dev, iod->first_dma, iod->dma_len,
+		nvme_is_write(&iod->cmd) ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
+
+	nvme_pci_put_cmdid(nvmeq, id);
+	io_uring_cmd_done(iod->ioucmd, status, result, IO_URING_F_UNLOCKED);
+}
+
 static inline void nvme_handle_cqe(struct nvme_queue *nvmeq,
 				   struct io_comp_batch *iob, u16 idx)
 {
@@ -1039,6 +1084,9 @@ static inline void nvme_handle_cqe(struct nvme_queue *nvmeq,
 	__u16 command_id = READ_ONCE(cqe->command_id);
 	struct request *req;
 
+	if (test_bit(NVMEQ_RAW, &nvmeq->flags))
+		return nvme_handle_cqe_rawq(nvmeq, cqe, command_id);
+
 	/*
 	 * AEN requests are special as they don't time out and can
 	 * survive any kind of queue freeze and often don't respond to
@@ -1151,6 +1199,25 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
 	return found;
 }
 
+static int nvme_poll_uring_cmd(struct io_uring_cmd *ioucmd, int qid,
+			       struct io_comp_batch *iob)
+
+{
+	struct nvme_uring_direct_pdu *pdu = nvme_uring_cmd_direct_pdu(ioucmd);
+	struct nvme_dev *dev = to_nvme_dev(pdu->ns->ctrl);
+	struct nvme_queue *nvmeq = &dev->queues[qid];
+	bool found;
+
+	if (!nvme_cqe_pending(nvmeq))
+		return 0;
+
+	spin_lock(&nvmeq->cq_poll_lock);
+	found = nvme_poll_cq(nvmeq, iob);
+	spin_unlock(&nvmeq->cq_poll_lock);
+
+	return found;
+}
+
 static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl)
 {
 	struct nvme_dev *dev = to_nvme_dev(ctrl);
@@ -1762,6 +1829,22 @@ static int nvme_pci_alloc_cmdid_bmp(struct nvme_queue *nvmeq)
 	return 0;
 }
 
+static int nvme_pci_get_cmdid(struct nvme_queue *nvmeq)
+{
+	int id = 0, size = nvmeq->q_depth - 1;
+
+	spin_lock(&nvmeq->cmdid_lock);
+	id = find_first_zero_bit(nvmeq->cmdid_bmp, size);
+	if (id >= size) {
+		id = -EBUSY;
+		goto unlock;
+	}
+	set_bit(id, nvmeq->cmdid_bmp);
+unlock:
+	spin_unlock(&nvmeq->cmdid_lock);
+	return id;
+}
+
 static int nvme_pci_alloc_iod_array(struct nvme_queue *nvmeq)
 {
 	if (!test_bit(NVMEQ_RAW, &nvmeq->flags))
@@ -1793,13 +1876,17 @@ static int nvme_pci_register_queue(void *data)
 	struct nvme_ns *ns = (struct nvme_ns *) data;
 	struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
 	int qid, ret;
+	struct nvme_queue *nvmeq;
 
 	qid = nvme_pci_get_rawq(dev);
 	if (qid > 0) {
 		/* setup command-id bitmap and iod array */
-		ret = nvme_pci_setup_rawq(&dev->queues[qid]);
+		nvmeq = &dev->queues[qid];
+		ret = nvme_pci_setup_rawq(nvmeq);
 		if (ret < 0)
 			qid = ret;
+		else
+			nvmeq->reg_id++;
 	}
 	return qid;
 }
@@ -1812,6 +1899,107 @@ static int nvme_pci_unregister_queue(void *data, int qid)
 	return nvme_pci_put_rawq(dev, qid);
 }
 
+static int nvme_map_io_fb(struct request_queue *q, struct io_uring_cmd *ioucmd,
+			struct nvme_dev *dev, struct nvme_iod *iod,
+			unsigned long addr, unsigned int buf_len)
+{
+	struct nvme_command *cmnd = &iod->cmd;
+	int ret, rw = nvme_is_write(cmnd);
+	struct iov_iter iter;
+	size_t nr_iter, nr_segs;
+	struct bio_vec *bv;
+
+	if (!(ioucmd->flags & IORING_URING_CMD_FIXED))
+		return -EINVAL;
+
+	ret = io_uring_cmd_import_fixed(addr, buf_len, rw, &iter, ioucmd);
+	if (ret < 0)
+		return ret;
+	nr_iter = iov_iter_count(&iter);
+	nr_segs = iter.nr_segs;
+
+	/* device will do these checks anyway, so why do duplicate work? */
+	if (!nr_iter || (nr_iter >> SECTOR_SHIFT) > queue_max_hw_sectors(q))
+		goto err;
+	if (nr_segs > queue_max_segments(q))
+		goto err;
+	/* this will be attempted from regular path instead */
+	if (nr_segs > 1)
+		goto err;
+
+	bv = (struct bio_vec *)&iter.bvec[0];
+	if (bv->bv_offset + bv->bv_len <= NVME_CTRL_PAGE_SIZE * 2)
+		return nvme_setup_prp_simple(dev, iod,
+				(struct nvme_rw_command *)cmnd,
+				bv, rw ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
+err:
+	return -EINVAL;
+}
+
+static int nvme_alloc_cmd_from_q(struct nvme_queue *nvmeq, int *cmdid,
+				struct nvme_iod **iod)
+{
+	int id;
+
+	id = nvme_pci_get_cmdid(nvmeq);
+	if (id < 0)
+		return id;
+	*cmdid = id | nvme_cid_install_genctr(nvmeq->reg_id);
+	*iod = &nvmeq->iod[id];
+	return 0;
+}
+
+static int nvme_pci_queue_ucmd(struct io_uring_cmd *ioucmd, int qid)
+{
+	struct nvme_uring_direct_pdu *pdu = nvme_uring_cmd_direct_pdu(ioucmd);
+	struct nvme_ns *ns = pdu->ns;
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct nvme_dev *dev = to_nvme_dev(ctrl);
+	struct nvme_command *nvme_cmd;
+	struct nvme_uring_data d;
+	int ret, cmdid;
+	struct nvme_iod *iod;
+	struct nvme_queue *nvmeq = &dev->queues[qid];
+
+	ret = nvme_alloc_cmd_from_q(nvmeq, &cmdid, &iod);
+	if (ret)
+		return ret;
+
+	iod->ioucmd = ioucmd;
+	nvme_cmd = &iod->cmd;
+	ret = nvme_prep_cmd_from_ioucmd(ctrl, ns, ioucmd, nvme_cmd, &d);
+	if (ret)
+		goto out;
+
+	nvme_cmd->common.command_id = cmdid;
+	iod->aborted = false;
+	iod->nr_allocations = -1;
+	iod->sgt.nents = 0;
+	ret = nvme_map_io_fb(ns->queue, ioucmd, dev, iod, d.addr, d.data_len);
+	if  (ret)
+		goto out;
+
+	/*
+	 * since this nvmeq is exclusive to single submitter (io_uring
+	 * follows one-thread-per-ring model), we do not need the lock
+	 * ideally. But we have lifetime difference between io_uring
+	 * and nvme sqe. io_uring SQE can be reused just after submission
+	 * but for nvme we have to wait until completion.
+	 * So if we run out of space, submission will be deferred to
+	 * io_uring worker. So we can't skip the lock.
+	 * But not all is lost! Due to one-thread-per-ring model and
+	 * polled-completion, contention is not common in most cases.
+	 */
+	spin_lock(&nvmeq->sq_lock);
+	nvme_sq_copy_cmd(nvmeq, &iod->cmd);
+	nvme_write_sq_db(nvmeq, true);
+	spin_unlock(&nvmeq->sq_lock);
+	return -EIOCBQUEUED;
+out:
+	nvme_pci_put_cmdid(nvmeq, cmdid);
+	return ret;
+}
+
 static const struct blk_mq_ops nvme_mq_admin_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_pci_complete_rq,
@@ -1832,6 +2020,8 @@ static const struct blk_mq_ops nvme_mq_ops = {
 	.poll		= nvme_poll,
 	.register_queue	= nvme_pci_register_queue,
 	.unregister_queue =  nvme_pci_unregister_queue,
+	.queue_uring_cmd = nvme_pci_queue_ucmd,
+	.poll_uring_cmd	= nvme_poll_uring_cmd,
 };
 
 static void nvme_dev_remove_admin(struct nvme_dev *dev)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 00/12] io_uring attached nvme queue
  2023-04-29  9:39 ` [RFC PATCH 00/12] io_uring attached nvme queue Kanchan Joshi
                     ` (11 preceding siblings ...)
       [not found]   ` <CGME20230429094301epcas5p48cf45da2f83d9ca8140ee777c7446d11@epcas5p4.samsung.com>
@ 2023-04-29 17:17   ` Jens Axboe
  2023-05-01 11:36     ` Kanchan Joshi
  12 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2023-04-29 17:17 UTC (permalink / raw)
  To: Kanchan Joshi, hch, sagi, kbusch
  Cc: io-uring, linux-nvme, linux-block, gost.dev, anuj1072538, xiaoguang.wang

On 4/29/23 3:39?AM, Kanchan Joshi wrote:
> This series shows one way to do what the title says.
> This puts up a more direct/lean path that enables
>  - submission from io_uring SQE to NVMe SQE
>  - completion from NVMe CQE to io_uring CQE
> Essentially cutting the hoops (involving request/bio) for nvme io path.
> 
> Also, io_uring ring is not to be shared among application threads.
> Application is responsible for building the sharing (if it feels the
> need). This means ring-associated exclusive queue can do away with some
> synchronization costs that occur for shared queue.
> 
> Primary objective is to amp up of efficiency of kernel io path further
> (towards PCIe gen N, N+1 hardware).
> And we are seeing some asks too [1].
> 
> Building-blocks
> ===============
> At high level, series can be divided into following parts -
> 
> 1. nvme driver starts exposing some queue-pairs (SQ+CQ) that can
> be attached to other in-kernel user (not just to block-layer, which is
> the case at the moment) on demand.
> 
> Example:
> insmod nvme.ko poll_queus=1 raw_queues=2
> 
> nvme0: 24/0/1/2 default/read/poll queues/raw queues
> 
> While driver registers other queues with block-layer, raw-queues are
> rather reserved for exclusive attachment with other in-kernel users.
> At this point, each raw-queue is interrupt-disabled (similar to
> poll_queues). Maybe we need a better name for these (e.g. app/user queues).
> [Refer: patch 2]
> 
> 2. register/unregister queue interface
> (a) one for io_uring application to ask for device-queue and register
> with the ring. [Refer: patch 4]
> (b) another at nvme so that other in-kernel users (io_uring for now) can
> ask for a raw-queue. [Refer: patch 3, 5, 6]
> 
> The latter returns a qid, that io_uring stores internally (not exposed
> to user-space) in the ring ctx. At max one queue per ring is enabled.
> Ring has no other special properties except the fact that it stores a
> qid that it can use exclusively. So application can very well use the
> ring to do other things than nvme io.
> 
> 3. user-interface to send commands down this way
> (a) uring-cmd is extended to support a new flag "IORING_URING_CMD_DIRECT"
> that application passes in the SQE. That is all.
> (b) the flag goes down to provider of ->uring_cmd which may choose to do
>   things differently based on it (or ignore it).
> [Refer: patch 7]
> 
> 4. nvme uring-cmd understands the above flag. It submits the command
> into the known pre-registered queue, and completes (polled-completion)
> from it. Transformation from "struct io_uring_cmd" to "nvme command" is
> done directly without building other intermediate constructs.
> [Refer: patch 8, 10, 12]
> 
> Testing and Performance
> =======================
> fio and t/io_uring is modified to exercise this path.
> - fio: new "registerqueues" option
> - t/io_uring: new "k" option
> 
> Good part:
> 2.96M -> 5.02M
> 
> nvme io (without this):
> # t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k0 /dev/ng0n1
> submitter=0, tid=2922, file=/dev/ng0n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=0 QD=64
> Engine=io_uring, sq_ring=64, cq_ring=64
> IOPS=2.89M, BW=1412MiB/s, IOS/call=2/1
> IOPS=2.92M, BW=1426MiB/s, IOS/call=2/2
> IOPS=2.96M, BW=1444MiB/s, IOS/call=2/1
> Exiting on timeout
> Maximum IOPS=2.96M
> 
> nvme io (with this):
> # t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> submitter=0, tid=2927, file=/dev/ng0n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> Engine=io_uring, sq_ring=64, cq_ring=64
> IOPS=4.99M, BW=2.43GiB/s, IOS/call=2/1
> IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
> IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
> Exiting on timeout
> Maximum IOPS=5.02M
> 
> Not so good part:
> While single IO is fast this way, we do not have batching abilities for
> multi-io scenario. Plugging, submission and completion batching are tied to
> block-layer constructs. Things should look better if we could do something
> about that.
> Particularly something is off with the completion-batching.
> 
> With -s32 and -c32, the numbers decline:
> 
> # t/io_uring -b512 -d64 -c32 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> submitter=0, tid=3674, file=/dev/ng0n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> Engine=io_uring, sq_ring=64, cq_ring=64
> IOPS=3.70M, BW=1806MiB/s, IOS/call=32/31
> IOPS=3.71M, BW=1812MiB/s, IOS/call=32/31
> IOPS=3.71M, BW=1812MiB/s, IOS/call=32/32
> Exiting on timeout
> Maximum IOPS=3.71M
> 
> And perf gets restored if we go back to -c2
> 
> # t/io_uring -b512 -d64 -c2 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> submitter=0, tid=3677, file=/dev/ng0n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> Engine=io_uring, sq_ring=64, cq_ring=64
> IOPS=4.99M, BW=2.44GiB/s, IOS/call=5/5
> IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
> IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
> Exiting on timeout
> Maximum IOPS=5.02M
> 
> Source
> ======
> Kernel: https://github.com/OpenMPDK/linux/tree/feat/directq-v1
> fio: https://github.com/OpenMPDK/fio/commits/feat/rawq-v2
> 
> Please take a look.

This looks like a great starting point! Unfortunately I won't be at
LSFMM this year to discuss it in person, but I'll be taking a closer
look at this. Some quick initial reactions:

- I'd call them "user" queues rather than raw or whatever, I think that
  more accurately describes what they are for.

- I guess there's no way around needing to pre-allocate these user
  queues, just like we do for polled_queues right now? In terms of user
  API, it'd be nicer if you could just do IORING_REGISTER_QUEUE (insert
  right name here...) and it'd allocate and return you an ID.

- Need to take a look at the uring_cmd stuff again, but would be nice if
  we did not have to add more stuff to fops for this. Maybe we can set
  aside a range of "ioctl" type commands through uring_cmd for this
  instead, and go that way for registering/unregistering queues.

We do have some users that are CPU constrained, and while my testing
easily maxes out a gen2 optane (actually 2 or 3) with the generic IO
path, that's also with all the fat that adds overhead removed. Most
people don't have this luxury, necessarily, or actually need some of
this fat for their monitoring, for example. This would provide a nice
way to have pretty consistent and efficient performance across distro
type configs, which would be great, while still retaining the fattier
bits for "normal" IO.
 

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 00/12] io_uring attached nvme queue
  2023-04-29 17:17   ` [RFC PATCH 00/12] io_uring attached nvme queue Jens Axboe
@ 2023-05-01 11:36     ` Kanchan Joshi
  0 siblings, 0 replies; 15+ messages in thread
From: Kanchan Joshi @ 2023-05-01 11:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kanchan Joshi, hch, sagi, kbusch, io-uring, linux-nvme,
	linux-block, gost.dev, anuj1072538, xiaoguang.wang

On Sat, Apr 29, 2023 at 10:55 PM Jens Axboe <[email protected]> wrote:
>
> On 4/29/23 3:39?AM, Kanchan Joshi wrote:
> > This series shows one way to do what the title says.
> > This puts up a more direct/lean path that enables
> >  - submission from io_uring SQE to NVMe SQE
> >  - completion from NVMe CQE to io_uring CQE
> > Essentially cutting the hoops (involving request/bio) for nvme io path.
> >
> > Also, io_uring ring is not to be shared among application threads.
> > Application is responsible for building the sharing (if it feels the
> > need). This means ring-associated exclusive queue can do away with some
> > synchronization costs that occur for shared queue.
> >
> > Primary objective is to amp up of efficiency of kernel io path further
> > (towards PCIe gen N, N+1 hardware).
> > And we are seeing some asks too [1].
> >
> > Building-blocks
> > ===============
> > At high level, series can be divided into following parts -
> >
> > 1. nvme driver starts exposing some queue-pairs (SQ+CQ) that can
> > be attached to other in-kernel user (not just to block-layer, which is
> > the case at the moment) on demand.
> >
> > Example:
> > insmod nvme.ko poll_queus=1 raw_queues=2
> >
> > nvme0: 24/0/1/2 default/read/poll queues/raw queues
> >
> > While driver registers other queues with block-layer, raw-queues are
> > rather reserved for exclusive attachment with other in-kernel users.
> > At this point, each raw-queue is interrupt-disabled (similar to
> > poll_queues). Maybe we need a better name for these (e.g. app/user queues).
> > [Refer: patch 2]
> >
> > 2. register/unregister queue interface
> > (a) one for io_uring application to ask for device-queue and register
> > with the ring. [Refer: patch 4]
> > (b) another at nvme so that other in-kernel users (io_uring for now) can
> > ask for a raw-queue. [Refer: patch 3, 5, 6]
> >
> > The latter returns a qid, that io_uring stores internally (not exposed
> > to user-space) in the ring ctx. At max one queue per ring is enabled.
> > Ring has no other special properties except the fact that it stores a
> > qid that it can use exclusively. So application can very well use the
> > ring to do other things than nvme io.
> >
> > 3. user-interface to send commands down this way
> > (a) uring-cmd is extended to support a new flag "IORING_URING_CMD_DIRECT"
> > that application passes in the SQE. That is all.
> > (b) the flag goes down to provider of ->uring_cmd which may choose to do
> >   things differently based on it (or ignore it).
> > [Refer: patch 7]
> >
> > 4. nvme uring-cmd understands the above flag. It submits the command
> > into the known pre-registered queue, and completes (polled-completion)
> > from it. Transformation from "struct io_uring_cmd" to "nvme command" is
> > done directly without building other intermediate constructs.
> > [Refer: patch 8, 10, 12]
> >
> > Testing and Performance
> > =======================
> > fio and t/io_uring is modified to exercise this path.
> > - fio: new "registerqueues" option
> > - t/io_uring: new "k" option
> >
> > Good part:
> > 2.96M -> 5.02M
> >
> > nvme io (without this):
> > # t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k0 /dev/ng0n1
> > submitter=0, tid=2922, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=0 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=2.89M, BW=1412MiB/s, IOS/call=2/1
> > IOPS=2.92M, BW=1426MiB/s, IOS/call=2/2
> > IOPS=2.96M, BW=1444MiB/s, IOS/call=2/1
> > Exiting on timeout
> > Maximum IOPS=2.96M
> >
> > nvme io (with this):
> > # t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> > submitter=0, tid=2927, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=4.99M, BW=2.43GiB/s, IOS/call=2/1
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
> > Exiting on timeout
> > Maximum IOPS=5.02M
> >
> > Not so good part:
> > While single IO is fast this way, we do not have batching abilities for
> > multi-io scenario. Plugging, submission and completion batching are tied to
> > block-layer constructs. Things should look better if we could do something
> > about that.
> > Particularly something is off with the completion-batching.
> >
> > With -s32 and -c32, the numbers decline:
> >
> > # t/io_uring -b512 -d64 -c32 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> > submitter=0, tid=3674, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=3.70M, BW=1806MiB/s, IOS/call=32/31
> > IOPS=3.71M, BW=1812MiB/s, IOS/call=32/31
> > IOPS=3.71M, BW=1812MiB/s, IOS/call=32/32
> > Exiting on timeout
> > Maximum IOPS=3.71M
> >
> > And perf gets restored if we go back to -c2
> >
> > # t/io_uring -b512 -d64 -c2 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> > submitter=0, tid=3677, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=4.99M, BW=2.44GiB/s, IOS/call=5/5
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
> > Exiting on timeout
> > Maximum IOPS=5.02M
> >
> > Source
> > ======
> > Kernel: https://github.com/OpenMPDK/linux/tree/feat/directq-v1
> > fio: https://github.com/OpenMPDK/fio/commits/feat/rawq-v2
> >
> > Please take a look.
>
> This looks like a great starting point! Unfortunately I won't be at
> LSFMM this year to discuss it in person, but I'll be taking a closer
> look at this.

That will help, thanks.

> Some quick initial reactions:
>
> - I'd call them "user" queues rather than raw or whatever, I think that
>   more accurately describes what they are for.

Right, that is better.

> - I guess there's no way around needing to pre-allocate these user
>   queues, just like we do for polled_queues right now?

Right, we would need to allocate nvme sq/cq in the outset.
Changing the count at run-time is a bit murky. I will have another look though.

>In terms of user
>   API, it'd be nicer if you could just do IORING_REGISTER_QUEUE (insert
>   right name here...) and it'd allocate and return you an ID.

But this is the implemented API (new register code in io_uring) in the
patchset at the moment.
So it seems I am missing your point?

> - Need to take a look at the uring_cmd stuff again, but would be nice if
>   we did not have to add more stuff to fops for this. Maybe we can set
>   aside a range of "ioctl" type commands through uring_cmd for this
>   instead, and go that way for registering/unregistering queues.

Yes, I see your point in not having to add new fops.
But, a new uring_cmd opcode is only at the nvme-level.
It is a good way to allocate/deallocate a nvme queue, but it cannot
attach that with the io_uring's ring.
Or do you have a different view? Seems this is connected to the previous point.

> We do have some users that are CPU constrained, and while my testing
> easily maxes out a gen2 optane (actually 2 or 3) with the generic IO
> path, that's also with all the fat that adds overhead removed. Most
> people don't have this luxury, necessarily, or actually need some of
> this fat for their monitoring, for example. This would provide a nice
> way to have pretty consistent and efficient performance across distro
> type configs, which would be great, while still retaining the fattier
> bits for "normal" IO.
Makes total sense.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-05-01 11:36 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20230429094228epcas5p4a80d8ed77433989fa804ecf449f83b0b@epcas5p4.samsung.com>
2023-04-29  9:39 ` [RFC PATCH 00/12] io_uring attached nvme queue Kanchan Joshi
     [not found]   ` <CGME20230429094238epcas5p4efa3dc785fa54ab974852c7f90113025@epcas5p4.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 01/12] nvme: refactor nvme_alloc_io_tag_set Kanchan Joshi
     [not found]   ` <CGME20230429094240epcas5p1a7411f266412244115411b05da509e4a@epcas5p1.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 02/12] pci: enable "raw_queues = N" module parameter Kanchan Joshi
     [not found]   ` <CGME20230429094243epcas5p13be3ca62dc2b03299d09cafaf11923c1@epcas5p1.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 03/12] fs, block: interface to register/unregister the raw-queue Kanchan Joshi
     [not found]   ` <CGME20230429094245epcas5p2843abc5cd54ffe301d36459543bcd228@epcas5p2.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 04/12] io_uring, fs: plumb support to register/unregister raw-queue Kanchan Joshi
     [not found]   ` <CGME20230429094247epcas5p333e0f515000de60fb64dc2590cf9fcd8@epcas5p3.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 05/12] nvme: wire-up register/unregister queue f_op callback Kanchan Joshi
     [not found]   ` <CGME20230429094249epcas5p18bd717f4e34077c0fcf28458f11de8d1@epcas5p1.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 06/12] pci: implement register/unregister functionality Kanchan Joshi
     [not found]   ` <CGME20230429094251epcas5p144d042853e10f090e3119338c2306546@epcas5p1.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 07/12] io_uring: support for using registered queue in uring-cmd Kanchan Joshi
     [not found]   ` <CGME20230429094253epcas5p3cfff90e1c003b6fc9c7c4a61287beecb@epcas5p3.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 08/12] block: add mq_ops to submit and complete commands from raw-queue Kanchan Joshi
     [not found]   ` <CGME20230429094255epcas5p11bcbe76772289f27c41a50ce502c998d@epcas5p1.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 09/12] nvme: carve out a helper to prepare nvme_command from ioucmd->cmd Kanchan Joshi
     [not found]   ` <CGME20230429094257epcas5p463574920bba26cd219275e57c2063d85@epcas5p4.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 10/12] nvme: submisssion/completion of uring_cmd to/from the registered queue Kanchan Joshi
     [not found]   ` <CGME20230429094259epcas5p11f0f3422eb4aa4e3ebf00e0666790efa@epcas5p1.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 11/12] pci: modify nvme_setup_prp_simple parameters Kanchan Joshi
     [not found]   ` <CGME20230429094301epcas5p48cf45da2f83d9ca8140ee777c7446d11@epcas5p4.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 12/12] pci: implement submission/completion for rawq commands Kanchan Joshi
2023-04-29 17:17   ` [RFC PATCH 00/12] io_uring attached nvme queue Jens Axboe
2023-05-01 11:36     ` Kanchan Joshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox