* [PATCH 00/17] io_uring passthru over nvme
[not found] <CGME20220308152651epcas5p1ebd2dc7fa01db43dd587c228a3695696@epcas5p1.samsung.com>
@ 2022-03-08 15:20 ` Kanchan Joshi
[not found] ` <CGME20220308152653epcas5p10c31f58cf6bff125cc0baa176b4d4fac@epcas5p1.samsung.com>
` (17 more replies)
0 siblings, 18 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:20 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
This is a streamlined series with new way of doing uring-cmd, and connects
nvme-passthrough (over char device /dev/ngX) to it.
uring-cmd enables using io_uring for any arbitrary command (ioctl,
fsctl etc.) exposed by the command provider (e.g. driver, fs etc.).
To store the command inline within the sqe, Jens added an option to setup
the ring with 128-byte SQEs.This gives 80 bytes of space (16 bytes at
the end of the first sqe + 64 bytes in the second sqe). With inline
command in sqe, the application avoids explicit allocation and, in the
kernel, we avoid doing copy_from_user. Command-opcode, length etc.
are stored in per-op fields of io_uring_sqe.
Non-inline submission (when command is a user-space pointer rather than
housed inside sqe) is also supported.
io_uring sends this command down by newly introduced ->async_cmd()
handler in file_operations. The handler does what is required to
submit, and indicates queued completion.The infra has been added to
process the completion when it arrives.
Overall the patches wire up the following capabilities for this path:
- async
- fixed-buffer
- plugging
- bio-cache
- sync and async polling.
This scales well. 512b randread perf (KIOPS) comparing
uring-passthru-over-char (/dev/ng0n1) to
uring-over-block(/dev/nvme0n1)
QD uring pt uring-poll pt-poll
8 538 589 831 902
64 967 1131 1351 1378
256 1043 1230 1376 1429
Testing/perf is done with this custom fio that turnes regular-io into
passthru-io on supplying "uring_cmd=1" option.
https://github.com/joshkan/fio/tree/big-sqe-pt.v1
Example command-line:
fio -iodepth=256 -rw=randread -ioengine=io_uring -bs=512 -numjobs=1
-runtime=60 -group_reporting -iodepth_batch_submit=64
-iodepth_batch_complete_min=1 -iodepth_batch_complete_max=64
-fixedbufs=1 -hipri=1 -sqthread_poll=0 -filename=/dev/ng0n1
-name=io_uring_256 -uring_cmd=1
Anuj Gupta (3):
io_uring: prep for fixed-buffer enabled uring-cmd
nvme: enable passthrough with fixed-buffer
nvme: enable non-inline passthru commands
Jens Axboe (5):
io_uring: add support for 128-byte SQEs
fs: add file_operations->async_cmd()
io_uring: add infra and support for IORING_OP_URING_CMD
io_uring: plug for async bypass
block: wire-up support for plugging
Kanchan Joshi (5):
nvme: wire-up support for async-passthru on char-device.
io_uring: add support for uring_cmd with fixed-buffer
block: factor out helper for bio allocation from cache
nvme: enable bio-cache for fixed-buffer passthru
io_uring: add support for non-inline uring-cmd
Keith Busch (2):
nvme: modify nvme_alloc_request to take an additional parameter
nvme: allow user passthrough commands to poll
Pankaj Raghav (2):
io_uring: add polling support for uring-cmd
nvme: wire-up polling for uring-passthru
block/bio.c | 43 ++--
block/blk-map.c | 45 +++++
block/blk-mq.c | 93 ++++-----
drivers/nvme/host/core.c | 21 +-
drivers/nvme/host/ioctl.c | 336 +++++++++++++++++++++++++++-----
drivers/nvme/host/multipath.c | 2 +
drivers/nvme/host/nvme.h | 11 +-
drivers/nvme/host/pci.c | 4 +-
drivers/nvme/target/passthru.c | 2 +-
fs/io_uring.c | 188 ++++++++++++++++--
include/linux/bio.h | 1 +
include/linux/blk-mq.h | 4 +
include/linux/fs.h | 2 +
include/linux/io_uring.h | 43 ++++
include/uapi/linux/io_uring.h | 21 +-
include/uapi/linux/nvme_ioctl.h | 4 +
16 files changed, 689 insertions(+), 131 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 122+ messages in thread
* [PATCH 01/17] io_uring: add support for 128-byte SQEs
[not found] ` <CGME20220308152653epcas5p10c31f58cf6bff125cc0baa176b4d4fac@epcas5p1.samsung.com>
@ 2022-03-08 15:20 ` Kanchan Joshi
0 siblings, 0 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:20 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
From: Jens Axboe <[email protected]>
Normal SQEs are 64-bytes in length, which is fine for all the commands
we support. However, in preparation for supporting passthrough IO,
provide an option for setting up a ring with 128-byte SQEs.
We continue to use the same type for io_uring_sqe, it's marked and
commented with a zero sized array pad at the end. This provides up
to 80 bytes of data for a passthrough command - 64 bytes for the
extra added data, and 16 bytes available at the end of the existing
SQE.
Signed-off-by: Jens Axboe <[email protected]>
---
fs/io_uring.c | 13 ++++++++++---
include/uapi/linux/io_uring.h | 7 +++++++
2 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index a7412f6862fc..241ba1cd6dcf 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7431,8 +7431,12 @@ static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx)
* though the application is the one updating it.
*/
head = READ_ONCE(ctx->sq_array[sq_idx]);
- if (likely(head < ctx->sq_entries))
+ if (likely(head < ctx->sq_entries)) {
+ /* double index for 128-byte SQEs, twice as long */
+ if (ctx->flags & IORING_SETUP_SQE128)
+ head <<= 1;
return &ctx->sq_sqes[head];
+ }
/* drop invalid entries */
ctx->cq_extra--;
@@ -10431,7 +10435,10 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx,
rings->sq_ring_entries = p->sq_entries;
rings->cq_ring_entries = p->cq_entries;
- size = array_size(sizeof(struct io_uring_sqe), p->sq_entries);
+ if (p->flags & IORING_SETUP_SQE128)
+ size = array_size(2 * sizeof(struct io_uring_sqe), p->sq_entries);
+ else
+ size = array_size(sizeof(struct io_uring_sqe), p->sq_entries);
if (size == SIZE_MAX) {
io_mem_free(ctx->rings);
ctx->rings = NULL;
@@ -10643,7 +10650,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |
IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ |
- IORING_SETUP_R_DISABLED))
+ IORING_SETUP_R_DISABLED | IORING_SETUP_SQE128))
return -EINVAL;
return io_uring_create(entries, &p, params);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 787f491f0d2a..c5db68433ca5 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -61,6 +61,12 @@ struct io_uring_sqe {
__u32 file_index;
};
__u64 __pad2[2];
+
+ /*
+ * If the ring is initializefd with IORING_SETUP_SQE128, then this field
+ * contains 64-bytes of padding, doubling the size of the SQE.
+ */
+ __u64 __big_sqe_pad[0];
};
enum {
@@ -101,6 +107,7 @@ enum {
#define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */
#define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */
#define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */
+#define IORING_SETUP_SQE128 (1U << 7) /* SQEs are 128b */
enum {
IORING_OP_NOP,
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 02/17] fs: add file_operations->async_cmd()
[not found] ` <CGME20220308152655epcas5p4ae47d715e1c15069e97152dcd283fd40@epcas5p4.samsung.com>
@ 2022-03-08 15:20 ` Kanchan Joshi
0 siblings, 0 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:20 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
From: Jens Axboe <[email protected]>
This is a file private handler, similar to ioctls but hopefully a lot
more sane and useful.
Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/fs.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e2d892b201b0..a32f83b70435 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1977,6 +1977,7 @@ struct dir_context {
#define REMAP_FILE_ADVISORY (REMAP_FILE_CAN_SHORTEN)
struct iov_iter;
+struct io_uring_cmd;
struct file_operations {
struct module *owner;
@@ -2019,6 +2020,7 @@ struct file_operations {
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
int (*fadvise)(struct file *, loff_t, loff_t, int);
+ int (*async_cmd)(struct io_uring_cmd *ioucmd);
} __randomize_layout;
struct inode_operations {
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 03/17] io_uring: add infra and support for IORING_OP_URING_CMD
[not found] ` <CGME20220308152658epcas5p3929bd1fcf75edc505fec71901158d1b5@epcas5p3.samsung.com>
@ 2022-03-08 15:20 ` Kanchan Joshi
2022-03-11 1:51 ` Luis Chamberlain
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:20 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
From: Jens Axboe <[email protected]>
This is a file private kind of request. io_uring doesn't know what's
in this command type, it's for the file_operations->async_cmd()
handler to deal with.
Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
fs/io_uring.c | 79 +++++++++++++++++++++++++++++++----
include/linux/io_uring.h | 29 +++++++++++++
include/uapi/linux/io_uring.h | 9 +++-
3 files changed, 108 insertions(+), 9 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 241ba1cd6dcf..1f228a79e68f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -200,13 +200,6 @@ struct io_rings {
struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp;
};
-enum io_uring_cmd_flags {
- IO_URING_F_COMPLETE_DEFER = 1,
- IO_URING_F_UNLOCKED = 2,
- /* int's last bit, sign checks are usually faster than a bit test */
- IO_URING_F_NONBLOCK = INT_MIN,
-};
-
struct io_mapped_ubuf {
u64 ubuf;
u64 ubuf_end;
@@ -860,6 +853,7 @@ struct io_kiocb {
struct io_mkdir mkdir;
struct io_symlink symlink;
struct io_hardlink hardlink;
+ struct io_uring_cmd uring_cmd;
};
u8 opcode;
@@ -1110,6 +1104,9 @@ static const struct io_op_def io_op_defs[] = {
[IORING_OP_MKDIRAT] = {},
[IORING_OP_SYMLINKAT] = {},
[IORING_OP_LINKAT] = {},
+ [IORING_OP_URING_CMD] = {
+ .needs_file = 1,
+ },
};
/* requests with any of those set should undergo io_disarm_next() */
@@ -2464,6 +2461,22 @@ static void io_req_task_submit(struct io_kiocb *req, bool *locked)
io_req_complete_failed(req, -EFAULT);
}
+static void io_uring_cmd_work(struct io_kiocb *req, bool *locked)
+{
+ req->uring_cmd.driver_cb(&req->uring_cmd);
+}
+
+void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+ void (*driver_cb)(struct io_uring_cmd *))
+{
+ struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd);
+
+ req->uring_cmd.driver_cb = driver_cb;
+ req->io_task_work.func = io_uring_cmd_work;
+ io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
+}
+EXPORT_SYMBOL_GPL(io_uring_cmd_complete_in_task);
+
static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
{
req->result = ret;
@@ -4109,6 +4122,51 @@ static int io_linkat(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}
+/*
+ * Called by consumers of io_uring_cmd, if they originally returned
+ * -EIOCBQUEUED upon receiving the command.
+ */
+void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret)
+{
+ struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd);
+
+ if (ret < 0)
+ req_set_fail(req);
+ io_req_complete(req, ret);
+}
+EXPORT_SYMBOL_GPL(io_uring_cmd_done);
+
+static int io_uring_cmd_prep(struct io_kiocb *req,
+ const struct io_uring_sqe *sqe)
+{
+ struct io_uring_cmd *ioucmd = &req->uring_cmd;
+
+ if (!req->file->f_op->async_cmd || !(req->ctx->flags & IORING_SETUP_SQE128))
+ return -EOPNOTSUPP;
+ if (req->ctx->flags & IORING_SETUP_IOPOLL)
+ return -EOPNOTSUPP;
+ ioucmd->cmd = (void *) &sqe->cmd;
+ ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
+ ioucmd->cmd_len = READ_ONCE(sqe->cmd_len);
+ ioucmd->flags = 0;
+ return 0;
+}
+
+static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct file *file = req->file;
+ int ret;
+ struct io_uring_cmd *ioucmd = &req->uring_cmd;
+
+ ioucmd->flags |= issue_flags;
+ ret = file->f_op->async_cmd(ioucmd);
+ /* queued async, consumer will call io_uring_cmd_done() when complete */
+ if (ret == -EIOCBQUEUED)
+ return 0;
+ io_uring_cmd_done(ioucmd, ret);
+ return 0;
+}
+
static int io_shutdown_prep(struct io_kiocb *req,
const struct io_uring_sqe *sqe)
{
@@ -6588,6 +6646,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return io_symlinkat_prep(req, sqe);
case IORING_OP_LINKAT:
return io_linkat_prep(req, sqe);
+ case IORING_OP_URING_CMD:
+ return io_uring_cmd_prep(req, sqe);
}
printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
@@ -6871,6 +6931,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
case IORING_OP_LINKAT:
ret = io_linkat(req, issue_flags);
break;
+ case IORING_OP_URING_CMD:
+ ret = io_uring_cmd(req, issue_flags);
+ break;
default:
ret = -EINVAL;
break;
@@ -11215,6 +11278,8 @@ static int __init io_uring_init(void)
BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST);
BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(int));
+ BUILD_BUG_ON(sizeof(struct io_uring_cmd) > 64);
+
req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC |
SLAB_ACCOUNT);
return 0;
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 649a4d7c241b..cedc68201469 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -5,7 +5,29 @@
#include <linux/sched.h>
#include <linux/xarray.h>
+enum io_uring_cmd_flags {
+ IO_URING_F_COMPLETE_DEFER = 1,
+ IO_URING_F_UNLOCKED = 2,
+ /* int's last bit, sign checks are usually faster than a bit test */
+ IO_URING_F_NONBLOCK = INT_MIN,
+};
+
+struct io_uring_cmd {
+ struct file *file;
+ void *cmd;
+ /* for irq-completion - if driver requires doing stuff in task-context*/
+ void (*driver_cb)(struct io_uring_cmd *cmd);
+ u32 flags;
+ u32 cmd_op;
+ u16 cmd_len;
+ u16 unused;
+ u8 pdu[28]; /* available inline for free use */
+};
+
#if defined(CONFIG_IO_URING)
+void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret);
+void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+ void (*driver_cb)(struct io_uring_cmd *));
struct sock *io_uring_get_socket(struct file *file);
void __io_uring_cancel(bool cancel_all);
void __io_uring_free(struct task_struct *tsk);
@@ -26,6 +48,13 @@ static inline void io_uring_free(struct task_struct *tsk)
__io_uring_free(tsk);
}
#else
+static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret)
+{
+}
+static inline void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
+ void (*driver_cb)(struct io_uring_cmd *))
+{
+}
static inline struct sock *io_uring_get_socket(struct file *file)
{
return NULL;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index c5db68433ca5..9bf1d6c0ed7f 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -22,10 +22,12 @@ struct io_uring_sqe {
union {
__u64 off; /* offset into file */
__u64 addr2;
+ __u32 cmd_op;
};
union {
__u64 addr; /* pointer to buffer or iovecs */
__u64 splice_off_in;
+ __u16 cmd_len;
};
__u32 len; /* buffer size or number of iovecs */
union {
@@ -60,8 +62,10 @@ struct io_uring_sqe {
__s32 splice_fd_in;
__u32 file_index;
};
- __u64 __pad2[2];
-
+ union {
+ __u64 __pad2[2];
+ __u64 cmd;
+ };
/*
* If the ring is initializefd with IORING_SETUP_SQE128, then this field
* contains 64-bytes of padding, doubling the size of the SQE.
@@ -150,6 +154,7 @@ enum {
IORING_OP_MKDIRAT,
IORING_OP_SYMLINKAT,
IORING_OP_LINKAT,
+ IORING_OP_URING_CMD,
/* this goes last, obviously */
IORING_OP_LAST,
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 04/17] nvme: modify nvme_alloc_request to take an additional parameter
[not found] ` <CGME20220308152700epcas5p4130d20119a3a250a2515217d6552f668@epcas5p4.samsung.com>
@ 2022-03-08 15:20 ` Kanchan Joshi
2022-03-11 6:38 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:20 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
From: Keith Busch <[email protected]>
This is a prep patch. It modifies nvme_alloc_request to take an
additional parameter, allowing request flags to be passed.
Signed-off-by: Keith Busch <[email protected]>
---
drivers/nvme/host/core.c | 10 ++++++----
drivers/nvme/host/ioctl.c | 2 +-
drivers/nvme/host/nvme.h | 3 ++-
drivers/nvme/host/pci.c | 4 ++--
drivers/nvme/target/passthru.c | 2 +-
5 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 961a5f8a44d2..159944499c4f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -630,11 +630,13 @@ static inline void nvme_init_request(struct request *req,
}
struct request *nvme_alloc_request(struct request_queue *q,
- struct nvme_command *cmd, blk_mq_req_flags_t flags)
+ struct nvme_command *cmd, blk_mq_req_flags_t flags,
+ unsigned int rq_flags)
{
+ unsigned int cmd_flags = nvme_req_op(cmd) | rq_flags;
struct request *req;
- req = blk_mq_alloc_request(q, nvme_req_op(cmd), flags);
+ req = blk_mq_alloc_request(q, cmd_flags, flags);
if (!IS_ERR(req))
nvme_init_request(req, cmd);
return req;
@@ -1075,7 +1077,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
int ret;
if (qid == NVME_QID_ANY)
- req = nvme_alloc_request(q, cmd, flags);
+ req = nvme_alloc_request(q, cmd, flags, 0);
else
req = nvme_alloc_request_qid(q, cmd, flags, qid);
if (IS_ERR(req))
@@ -1271,7 +1273,7 @@ static void nvme_keep_alive_work(struct work_struct *work)
}
rq = nvme_alloc_request(ctrl->admin_q, &ctrl->ka_cmd,
- BLK_MQ_REQ_RESERVED | BLK_MQ_REQ_NOWAIT);
+ BLK_MQ_REQ_RESERVED | BLK_MQ_REQ_NOWAIT, 0);
if (IS_ERR(rq)) {
/* allocation failure, reset the controller */
dev_err(ctrl->device, "keep-alive failed: %ld\n", PTR_ERR(rq));
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 22314962842d..5c9cd9695519 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -66,7 +66,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
void *meta = NULL;
int ret;
- req = nvme_alloc_request(q, cmd, 0);
+ req = nvme_alloc_request(q, cmd, 0, 0);
if (IS_ERR(req))
return PTR_ERR(req);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a162f6c6da6e..b32f4e2c68fd 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -698,7 +698,8 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl);
#define NVME_QID_ANY -1
struct request *nvme_alloc_request(struct request_queue *q,
- struct nvme_command *cmd, blk_mq_req_flags_t flags);
+ struct nvme_command *cmd, blk_mq_req_flags_t flags,
+ unsigned int rq_flags);
void nvme_cleanup_cmd(struct request *req);
blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req);
blk_status_t nvme_fail_nonready_command(struct nvme_ctrl *ctrl,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6a99ed680915..655c26589ac3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1429,7 +1429,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
req->tag, nvmeq->qid);
abort_req = nvme_alloc_request(dev->ctrl.admin_q, &cmd,
- BLK_MQ_REQ_NOWAIT);
+ BLK_MQ_REQ_NOWAIT, 0);
if (IS_ERR(abort_req)) {
atomic_inc(&dev->ctrl.abort_limit);
return BLK_EH_RESET_TIMER;
@@ -2475,7 +2475,7 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
cmd.delete_queue.opcode = opcode;
cmd.delete_queue.qid = cpu_to_le16(nvmeq->qid);
- req = nvme_alloc_request(q, &cmd, BLK_MQ_REQ_NOWAIT);
+ req = nvme_alloc_request(q, &cmd, BLK_MQ_REQ_NOWAIT, 0);
if (IS_ERR(req))
return PTR_ERR(req);
diff --git a/drivers/nvme/target/passthru.c b/drivers/nvme/target/passthru.c
index 9e5b89ae29df..2a9e2fd3b137 100644
--- a/drivers/nvme/target/passthru.c
+++ b/drivers/nvme/target/passthru.c
@@ -253,7 +253,7 @@ static void nvmet_passthru_execute_cmd(struct nvmet_req *req)
timeout = nvmet_req_subsys(req)->admin_timeout;
}
- rq = nvme_alloc_request(q, req->cmd, 0);
+ rq = nvme_alloc_request(q, req->cmd, 0, 0);
if (IS_ERR(rq)) {
status = NVME_SC_INTERNAL;
goto out_put_ns;
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
[not found] ` <CGME20220308152702epcas5p1eb1880e024ac8b9531c85a82f31a4e78@epcas5p1.samsung.com>
@ 2022-03-08 15:20 ` Kanchan Joshi
2022-03-10 0:02 ` Clay Mayers
` (4 more replies)
0 siblings, 5 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:20 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
Introduce handler for fops->async_cmd(), implementing async passthru
on char device (/dev/ngX). The handler supports NVME_IOCTL_IO64_CMD for
read and write commands. Returns failure for other commands.
This is low overhead path for processing the inline commands housed
inside io_uring's sqe. Neither the commmand is fetched via
copy_from_user, nor the result (inside passthru command) is updated via
put_user.
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
drivers/nvme/host/core.c | 1 +
drivers/nvme/host/ioctl.c | 205 ++++++++++++++++++++++++++++------
drivers/nvme/host/multipath.c | 1 +
drivers/nvme/host/nvme.h | 3 +
4 files changed, 178 insertions(+), 32 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 159944499c4f..3fe8f5901cd9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3667,6 +3667,7 @@ static const struct file_operations nvme_ns_chr_fops = {
.release = nvme_ns_chr_release,
.unlocked_ioctl = nvme_ns_chr_ioctl,
.compat_ioctl = compat_ptr_ioctl,
+ .async_cmd = nvme_ns_chr_async_cmd,
};
static int nvme_add_ns_cdev(struct nvme_ns *ns)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 5c9cd9695519..1df270b47af5 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -18,6 +18,76 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval)
ptrval = (compat_uptr_t)ptrval;
return (void __user *)ptrval;
}
+/*
+ * This overlays struct io_uring_cmd pdu.
+ * Expect build errors if this grows larger than that.
+ */
+struct nvme_uring_cmd_pdu {
+ u32 meta_len;
+ union {
+ struct bio *bio;
+ struct request *req;
+ };
+ void *meta; /* kernel-resident buffer */
+ void __user *meta_buffer;
+} __packed;
+
+static struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(struct io_uring_cmd *ioucmd)
+{
+ return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
+}
+
+static void nvme_pt_task_cb(struct io_uring_cmd *ioucmd)
+{
+ struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+ struct request *req = pdu->req;
+ int status;
+ struct bio *bio = req->bio;
+
+ if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
+ status = -EINTR;
+ else
+ status = nvme_req(req)->status;
+
+ /* we can free request */
+ blk_mq_free_request(req);
+ blk_rq_unmap_user(bio);
+
+ if (!status && pdu->meta_buffer) {
+ if (copy_to_user(pdu->meta_buffer, pdu->meta, pdu->meta_len))
+ status = -EFAULT;
+ }
+ kfree(pdu->meta);
+
+ io_uring_cmd_done(ioucmd, status);
+}
+
+static void nvme_end_async_pt(struct request *req, blk_status_t err)
+{
+ struct io_uring_cmd *ioucmd = req->end_io_data;
+ struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+ /* extract bio before reusing the same field for request */
+ struct bio *bio = pdu->bio;
+
+ pdu->req = req;
+ req->bio = bio;
+ /* this takes care of setting up task-work */
+ io_uring_cmd_complete_in_task(ioucmd, nvme_pt_task_cb);
+}
+
+static void nvme_setup_uring_cmd_data(struct request *rq,
+ struct io_uring_cmd *ioucmd, void *meta,
+ void __user *meta_buffer, u32 meta_len)
+{
+ struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
+
+ /* to free bio on completion, as req->bio will be null at that time */
+ pdu->bio = rq->bio;
+ pdu->meta = meta;
+ pdu->meta_buffer = meta_buffer;
+ pdu->meta_len = meta_len;
+ rq->end_io_data = ioucmd;
+}
static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf,
unsigned len, u32 seed, bool write)
@@ -56,7 +126,8 @@ static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf,
static int nvme_submit_user_cmd(struct request_queue *q,
struct nvme_command *cmd, void __user *ubuffer,
unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
- u32 meta_seed, u64 *result, unsigned timeout)
+ u32 meta_seed, u64 *result, unsigned timeout,
+ struct io_uring_cmd *ioucmd)
{
bool write = nvme_is_write(cmd);
struct nvme_ns *ns = q->queuedata;
@@ -64,9 +135,15 @@ static int nvme_submit_user_cmd(struct request_queue *q,
struct request *req;
struct bio *bio = NULL;
void *meta = NULL;
+ unsigned int rq_flags = 0;
+ blk_mq_req_flags_t blk_flags = 0;
int ret;
- req = nvme_alloc_request(q, cmd, 0, 0);
+ if (ioucmd && (ioucmd->flags & IO_URING_F_NONBLOCK)) {
+ rq_flags |= REQ_NOWAIT;
+ blk_flags |= BLK_MQ_REQ_NOWAIT;
+ }
+ req = nvme_alloc_request(q, cmd, blk_flags, rq_flags);
if (IS_ERR(req))
return PTR_ERR(req);
@@ -92,6 +169,19 @@ static int nvme_submit_user_cmd(struct request_queue *q,
req->cmd_flags |= REQ_INTEGRITY;
}
}
+ if (ioucmd) { /* async dispatch */
+ if (cmd->common.opcode == nvme_cmd_write ||
+ cmd->common.opcode == nvme_cmd_read) {
+ nvme_setup_uring_cmd_data(req, ioucmd, meta, meta_buffer,
+ meta_len);
+ blk_execute_rq_nowait(req, 0, nvme_end_async_pt);
+ return 0;
+ } else {
+ /* support only read and write for now. */
+ ret = -EINVAL;
+ goto out_meta;
+ }
+ }
ret = nvme_execute_passthru_rq(req);
if (result)
@@ -100,6 +190,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
if (copy_to_user(meta_buffer, meta, meta_len))
ret = -EFAULT;
}
+ out_meta:
kfree(meta);
out_unmap:
if (bio)
@@ -170,7 +261,8 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
return nvme_submit_user_cmd(ns->queue, &c,
nvme_to_user_ptr(io.addr), length,
- metadata, meta_len, lower_32_bits(io.slba), NULL, 0);
+ metadata, meta_len, lower_32_bits(io.slba), NULL, 0,
+ NULL);
}
static bool nvme_validate_passthru_nsid(struct nvme_ctrl *ctrl,
@@ -224,7 +316,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
nvme_to_user_ptr(cmd.addr), cmd.data_len,
nvme_to_user_ptr(cmd.metadata), cmd.metadata_len,
- 0, &result, timeout);
+ 0, &result, timeout, NULL);
if (status >= 0) {
if (put_user(result, &ucmd->result))
@@ -235,45 +327,53 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
}
static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
- struct nvme_passthru_cmd64 __user *ucmd)
+ struct nvme_passthru_cmd64 __user *ucmd,
+ struct io_uring_cmd *ioucmd)
{
- struct nvme_passthru_cmd64 cmd;
+ struct nvme_passthru_cmd64 cmd, *cptr;
struct nvme_command c;
unsigned timeout = 0;
int status;
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
- if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
- return -EFAULT;
- if (cmd.flags)
+ if (!ioucmd) {
+ if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
+ return -EFAULT;
+ cptr = &cmd;
+ } else {
+ if (ioucmd->cmd_len != sizeof(struct nvme_passthru_cmd64))
+ return -EINVAL;
+ cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd;
+ }
+ if (cptr->flags)
return -EINVAL;
- if (!nvme_validate_passthru_nsid(ctrl, ns, cmd.nsid))
+ if (!nvme_validate_passthru_nsid(ctrl, ns, cptr->nsid))
return -EINVAL;
memset(&c, 0, sizeof(c));
- c.common.opcode = cmd.opcode;
- c.common.flags = cmd.flags;
- c.common.nsid = cpu_to_le32(cmd.nsid);
- c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
- c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
- c.common.cdw10 = cpu_to_le32(cmd.cdw10);
- c.common.cdw11 = cpu_to_le32(cmd.cdw11);
- c.common.cdw12 = cpu_to_le32(cmd.cdw12);
- c.common.cdw13 = cpu_to_le32(cmd.cdw13);
- c.common.cdw14 = cpu_to_le32(cmd.cdw14);
- c.common.cdw15 = cpu_to_le32(cmd.cdw15);
-
- if (cmd.timeout_ms)
- timeout = msecs_to_jiffies(cmd.timeout_ms);
+ c.common.opcode = cptr->opcode;
+ c.common.flags = cptr->flags;
+ c.common.nsid = cpu_to_le32(cptr->nsid);
+ c.common.cdw2[0] = cpu_to_le32(cptr->cdw2);
+ c.common.cdw2[1] = cpu_to_le32(cptr->cdw3);
+ c.common.cdw10 = cpu_to_le32(cptr->cdw10);
+ c.common.cdw11 = cpu_to_le32(cptr->cdw11);
+ c.common.cdw12 = cpu_to_le32(cptr->cdw12);
+ c.common.cdw13 = cpu_to_le32(cptr->cdw13);
+ c.common.cdw14 = cpu_to_le32(cptr->cdw14);
+ c.common.cdw15 = cpu_to_le32(cptr->cdw15);
+
+ if (cptr->timeout_ms)
+ timeout = msecs_to_jiffies(cptr->timeout_ms);
status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
- nvme_to_user_ptr(cmd.addr), cmd.data_len,
- nvme_to_user_ptr(cmd.metadata), cmd.metadata_len,
- 0, &cmd.result, timeout);
+ nvme_to_user_ptr(cptr->addr), cptr->data_len,
+ nvme_to_user_ptr(cptr->metadata), cptr->metadata_len,
+ 0, &cptr->result, timeout, ioucmd);
- if (status >= 0) {
- if (put_user(cmd.result, &ucmd->result))
+ if (!ioucmd && status >= 0) {
+ if (put_user(cptr->result, &ucmd->result))
return -EFAULT;
}
@@ -296,7 +396,7 @@ static int nvme_ctrl_ioctl(struct nvme_ctrl *ctrl, unsigned int cmd,
case NVME_IOCTL_ADMIN_CMD:
return nvme_user_cmd(ctrl, NULL, argp);
case NVME_IOCTL_ADMIN64_CMD:
- return nvme_user_cmd64(ctrl, NULL, argp);
+ return nvme_user_cmd64(ctrl, NULL, argp, NULL);
default:
return sed_ioctl(ctrl->opal_dev, cmd, argp);
}
@@ -340,7 +440,7 @@ static int nvme_ns_ioctl(struct nvme_ns *ns, unsigned int cmd,
case NVME_IOCTL_SUBMIT_IO:
return nvme_submit_io(ns, argp);
case NVME_IOCTL_IO64_CMD:
- return nvme_user_cmd64(ns->ctrl, ns, argp);
+ return nvme_user_cmd64(ns->ctrl, ns, argp, NULL);
default:
return -ENOTTY;
}
@@ -369,6 +469,33 @@ long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
return __nvme_ioctl(ns, cmd, (void __user *)arg);
}
+static int nvme_ns_async_ioctl(struct nvme_ns *ns, struct io_uring_cmd *ioucmd)
+{
+ int ret;
+
+ BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu));
+
+ switch (ioucmd->cmd_op) {
+ case NVME_IOCTL_IO64_CMD:
+ ret = nvme_user_cmd64(ns->ctrl, ns, NULL, ioucmd);
+ break;
+ default:
+ ret = -ENOTTY;
+ }
+
+ if (ret >= 0)
+ ret = -EIOCBQUEUED;
+ return ret;
+}
+
+int nvme_ns_chr_async_cmd(struct io_uring_cmd *ioucmd)
+{
+ struct nvme_ns *ns = container_of(file_inode(ioucmd->file)->i_cdev,
+ struct nvme_ns, cdev);
+
+ return nvme_ns_async_ioctl(ns, ioucmd);
+}
+
#ifdef CONFIG_NVME_MULTIPATH
static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
void __user *argp, struct nvme_ns_head *head, int srcu_idx)
@@ -412,6 +539,20 @@ int nvme_ns_head_ioctl(struct block_device *bdev, fmode_t mode,
return ret;
}
+int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd)
+{
+ struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
+ struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
+ int srcu_idx = srcu_read_lock(&head->srcu);
+ struct nvme_ns *ns = nvme_find_path(head);
+ int ret = -EWOULDBLOCK;
+
+ if (ns)
+ ret = nvme_ns_async_ioctl(ns, ioucmd);
+ srcu_read_unlock(&head->srcu, srcu_idx);
+ return ret;
+}
+
long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
@@ -480,7 +621,7 @@ long nvme_dev_ioctl(struct file *file, unsigned int cmd,
case NVME_IOCTL_ADMIN_CMD:
return nvme_user_cmd(ctrl, NULL, argp);
case NVME_IOCTL_ADMIN64_CMD:
- return nvme_user_cmd64(ctrl, NULL, argp);
+ return nvme_user_cmd64(ctrl, NULL, argp, NULL);
case NVME_IOCTL_IO_CMD:
return nvme_dev_user_cmd(ctrl, argp);
case NVME_IOCTL_RESET:
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index f8bf6606eb2f..1d798d09456f 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -459,6 +459,7 @@ static const struct file_operations nvme_ns_head_chr_fops = {
.release = nvme_ns_head_chr_release,
.unlocked_ioctl = nvme_ns_head_chr_ioctl,
.compat_ioctl = compat_ptr_ioctl,
+ .async_cmd = nvme_ns_head_chr_async_cmd,
};
static int nvme_add_ns_head_cdev(struct nvme_ns_head *head)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index b32f4e2c68fd..e6a30543d7c8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -16,6 +16,7 @@
#include <linux/rcupdate.h>
#include <linux/wait.h>
#include <linux/t10-pi.h>
+#include <linux/io_uring.h>
#include <trace/events/block.h>
@@ -752,6 +753,8 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
unsigned long arg);
long nvme_dev_ioctl(struct file *file, unsigned int cmd,
unsigned long arg);
+int nvme_ns_chr_async_cmd(struct io_uring_cmd *ioucmd);
+int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd);
int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
extern const struct attribute_group *nvme_ns_id_attr_groups[];
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 06/17] io_uring: prep for fixed-buffer enabled uring-cmd
[not found] ` <CGME20220308152704epcas5p16610e1f50672b25fa1df5f7c5c261bb5@epcas5p1.samsung.com>
@ 2022-03-08 15:20 ` Kanchan Joshi
0 siblings, 0 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:20 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
From: Anuj Gupta <[email protected]>
Refactor the existing code and factor out helper that can be used for
passthrough with fixed-buffer. This is a prep patch.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
fs/io_uring.c | 19 ++++++++++++++-----
include/linux/io_uring.h | 7 +++++++
2 files changed, 21 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1f228a79e68f..ead0cbae8416 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3140,11 +3140,10 @@ static void kiocb_done(struct io_kiocb *req, ssize_t ret,
}
}
-static int __io_import_fixed(struct io_kiocb *req, int rw, struct iov_iter *iter,
- struct io_mapped_ubuf *imu)
+static int __io_import_fixed(u64 buf_addr, size_t len, int rw,
+ struct iov_iter *iter, struct io_mapped_ubuf *imu)
{
- size_t len = req->rw.len;
- u64 buf_end, buf_addr = req->rw.addr;
+ u64 buf_end;
size_t offset;
if (unlikely(check_add_overflow(buf_addr, (u64)len, &buf_end)))
@@ -3213,9 +3212,19 @@ static int io_import_fixed(struct io_kiocb *req, int rw, struct iov_iter *iter)
imu = READ_ONCE(ctx->user_bufs[index]);
req->imu = imu;
}
- return __io_import_fixed(req, rw, iter, imu);
+ return __io_import_fixed(req->rw.addr, req->rw.len, rw, iter, imu);
}
+int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len,
+ int rw, struct iov_iter *iter, void *ioucmd)
+{
+ struct io_kiocb *req = container_of(ioucmd, struct io_kiocb, uring_cmd);
+ struct io_mapped_ubuf *imu = req->imu;
+
+ return __io_import_fixed(ubuf, len, rw, iter, imu);
+}
+EXPORT_SYMBOL_GPL(io_uring_cmd_import_fixed);
+
static void io_ring_submit_unlock(struct io_ring_ctx *ctx, bool needs_lock)
{
if (needs_lock)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index cedc68201469..1888a5ea7dbe 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -25,6 +25,8 @@ struct io_uring_cmd {
};
#if defined(CONFIG_IO_URING)
+int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len,
+ int rw, struct iov_iter *iter, void *ioucmd);
void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret);
void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
void (*driver_cb)(struct io_uring_cmd *));
@@ -48,6 +50,11 @@ static inline void io_uring_free(struct task_struct *tsk)
__io_uring_free(tsk);
}
#else
+int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len,
+ int rw, struct iov_iter *iter, void *ioucmd)
+{
+ return -1;
+}
static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret)
{
}
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 07/17] io_uring: add support for uring_cmd with fixed-buffer
[not found] ` <CGME20220308152707epcas5p430127761a7fd4bf90c2501eabe9ee96e@epcas5p4.samsung.com>
@ 2022-03-08 15:20 ` Kanchan Joshi
0 siblings, 0 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:20 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
Add IORING_OP_URING_CMD_FIXED opcode that enables performing the
operation with previously registered buffers.
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
fs/io_uring.c | 29 ++++++++++++++++++++++++++++-
include/linux/io_uring.h | 1 +
include/uapi/linux/io_uring.h | 1 +
3 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index ead0cbae8416..6a1dcea0f538 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1107,6 +1107,9 @@ static const struct io_op_def io_op_defs[] = {
[IORING_OP_URING_CMD] = {
.needs_file = 1,
},
+ [IORING_OP_URING_CMD_FIXED] = {
+ .needs_file = 1,
+ },
};
/* requests with any of those set should undergo io_disarm_next() */
@@ -4148,16 +4151,25 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_done);
static int io_uring_cmd_prep(struct io_kiocb *req,
const struct io_uring_sqe *sqe)
{
+ struct io_ring_ctx *ctx = req->ctx;
struct io_uring_cmd *ioucmd = &req->uring_cmd;
if (!req->file->f_op->async_cmd || !(req->ctx->flags & IORING_SETUP_SQE128))
return -EOPNOTSUPP;
if (req->ctx->flags & IORING_SETUP_IOPOLL)
return -EOPNOTSUPP;
+ if (req->opcode == IORING_OP_URING_CMD_FIXED) {
+ req->imu = NULL;
+ io_req_set_rsrc_node(req, ctx);
+ req->buf_index = READ_ONCE(sqe->buf_index);
+ ioucmd->flags = IO_URING_F_UCMD_FIXEDBUFS;
+ } else {
+ ioucmd->flags = 0;
+ }
+
ioucmd->cmd = (void *) &sqe->cmd;
ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
ioucmd->cmd_len = READ_ONCE(sqe->cmd_len);
- ioucmd->flags = 0;
return 0;
}
@@ -4167,6 +4179,19 @@ static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
int ret;
struct io_uring_cmd *ioucmd = &req->uring_cmd;
+ if (req->opcode == IORING_OP_URING_CMD_FIXED) {
+ u32 index, buf_index = req->buf_index;
+ struct io_ring_ctx *ctx = req->ctx;
+ struct io_mapped_ubuf *imu = req->imu;
+
+ if (likely(!imu)) {
+ if (unlikely(buf_index >= ctx->nr_user_bufs))
+ return -EFAULT;
+ index = array_index_nospec(buf_index, ctx->nr_user_bufs);
+ imu = READ_ONCE(ctx->user_bufs[index]);
+ req->imu = imu;
+ }
+ }
ioucmd->flags |= issue_flags;
ret = file->f_op->async_cmd(ioucmd);
/* queued async, consumer will call io_uring_cmd_done() when complete */
@@ -6656,6 +6681,7 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
case IORING_OP_LINKAT:
return io_linkat_prep(req, sqe);
case IORING_OP_URING_CMD:
+ case IORING_OP_URING_CMD_FIXED:
return io_uring_cmd_prep(req, sqe);
}
@@ -6941,6 +6967,7 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
ret = io_linkat(req, issue_flags);
break;
case IORING_OP_URING_CMD:
+ case IORING_OP_URING_CMD_FIXED:
ret = io_uring_cmd(req, issue_flags);
break;
default:
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 1888a5ea7dbe..abad6175739e 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -8,6 +8,7 @@
enum io_uring_cmd_flags {
IO_URING_F_COMPLETE_DEFER = 1,
IO_URING_F_UNLOCKED = 2,
+ IO_URING_F_UCMD_FIXEDBUFS = 4,
/* int's last bit, sign checks are usually faster than a bit test */
IO_URING_F_NONBLOCK = INT_MIN,
};
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 9bf1d6c0ed7f..ee84be4b6be8 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -155,6 +155,7 @@ enum {
IORING_OP_SYMLINKAT,
IORING_OP_LINKAT,
IORING_OP_URING_CMD,
+ IORING_OP_URING_CMD_FIXED,
/* this goes last, obviously */
IORING_OP_LAST,
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 08/17] nvme: enable passthrough with fixed-buffer
[not found] ` <CGME20220308152709epcas5p1f9d274a0214dc462c22c278a72d8697c@epcas5p1.samsung.com>
@ 2022-03-08 15:20 ` Kanchan Joshi
2022-03-10 8:32 ` Christoph Hellwig
` (2 more replies)
0 siblings, 3 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:20 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
From: Anuj Gupta <[email protected]>
Add support to carry out passthrough command with pre-mapped buffers.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
block/blk-map.c | 45 +++++++++++++++++++++++++++++++++++++++
drivers/nvme/host/ioctl.c | 27 ++++++++++++++---------
include/linux/blk-mq.h | 2 ++
3 files changed, 64 insertions(+), 10 deletions(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index 4526adde0156..027e8216e313 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -8,6 +8,7 @@
#include <linux/bio.h>
#include <linux/blkdev.h>
#include <linux/uio.h>
+#include <linux/io_uring.h>
#include "blk.h"
@@ -577,6 +578,50 @@ int blk_rq_map_user(struct request_queue *q, struct request *rq,
}
EXPORT_SYMBOL(blk_rq_map_user);
+/* Unlike blk_rq_map_user () this is only for fixed-buffer async passthrough. */
+int blk_rq_map_user_fixedb(struct request_queue *q, struct request *rq,
+ u64 ubuf, unsigned long len, gfp_t gfp_mask,
+ struct io_uring_cmd *ioucmd)
+{
+ struct iov_iter iter;
+ size_t iter_count, nr_segs;
+ struct bio *bio;
+ int ret;
+
+ /*
+ * Talk to io_uring to obtain BVEC iterator for the buffer.
+ * And use that iterator to form bio/request.
+ */
+ ret = io_uring_cmd_import_fixed(ubuf, len, rq_data_dir(rq), &iter,
+ ioucmd);
+ if (unlikely(ret < 0))
+ return ret;
+ iter_count = iov_iter_count(&iter);
+ nr_segs = iter.nr_segs;
+
+ if (!iter_count || (iter_count >> 9) > queue_max_hw_sectors(q))
+ return -EINVAL;
+ if (nr_segs > queue_max_segments(q))
+ return -EINVAL;
+ /* no iovecs to alloc, as we already have a BVEC iterator */
+ bio = bio_alloc(gfp_mask, 0);
+ if (!bio)
+ return -ENOMEM;
+
+ ret = bio_iov_iter_get_pages(bio, &iter);
+ if (ret)
+ goto out_free;
+
+ blk_rq_bio_prep(rq, bio, nr_segs);
+ return 0;
+
+out_free:
+ bio_release_pages(bio, false);
+ bio_put(bio);
+ return ret;
+}
+EXPORT_SYMBOL(blk_rq_map_user_fixedb);
+
/**
* blk_rq_unmap_user - unmap a request with user data
* @bio: start of bio list
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 1df270b47af5..91d893eedc82 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -123,8 +123,13 @@ static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf,
return ERR_PTR(ret);
}
+static inline bool nvme_is_fixedb_passthru(struct io_uring_cmd *ioucmd)
+{
+ return ((ioucmd) && (ioucmd->flags & IO_URING_F_UCMD_FIXEDBUFS));
+}
+
static int nvme_submit_user_cmd(struct request_queue *q,
- struct nvme_command *cmd, void __user *ubuffer,
+ struct nvme_command *cmd, u64 ubuffer,
unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
u32 meta_seed, u64 *result, unsigned timeout,
struct io_uring_cmd *ioucmd)
@@ -152,8 +157,12 @@ static int nvme_submit_user_cmd(struct request_queue *q,
nvme_req(req)->flags |= NVME_REQ_USERCMD;
if (ubuffer && bufflen) {
- ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen,
- GFP_KERNEL);
+ if (likely(nvme_is_fixedb_passthru(ioucmd)))
+ ret = blk_rq_map_user_fixedb(q, req, ubuffer, bufflen,
+ GFP_KERNEL, ioucmd);
+ else
+ ret = blk_rq_map_user(q, req, NULL, nvme_to_user_ptr(ubuffer),
+ bufflen, GFP_KERNEL);
if (ret)
goto out;
bio = req->bio;
@@ -260,9 +269,8 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
c.rw.appmask = cpu_to_le16(io.appmask);
return nvme_submit_user_cmd(ns->queue, &c,
- nvme_to_user_ptr(io.addr), length,
- metadata, meta_len, lower_32_bits(io.slba), NULL, 0,
- NULL);
+ io.addr, length, metadata, meta_len,
+ lower_32_bits(io.slba), NULL, 0, NULL);
}
static bool nvme_validate_passthru_nsid(struct nvme_ctrl *ctrl,
@@ -314,9 +322,8 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
timeout = msecs_to_jiffies(cmd.timeout_ms);
status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
- nvme_to_user_ptr(cmd.addr), cmd.data_len,
- nvme_to_user_ptr(cmd.metadata), cmd.metadata_len,
- 0, &result, timeout, NULL);
+ cmd.addr, cmd.data_len, nvme_to_user_ptr(cmd.metadata),
+ cmd.metadata_len, 0, &result, timeout, NULL);
if (status >= 0) {
if (put_user(result, &ucmd->result))
@@ -368,7 +375,7 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
timeout = msecs_to_jiffies(cptr->timeout_ms);
status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
- nvme_to_user_ptr(cptr->addr), cptr->data_len,
+ cptr->addr, cptr->data_len,
nvme_to_user_ptr(cptr->metadata), cptr->metadata_len,
0, &cptr->result, timeout, ioucmd);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d319ffa59354..48bcfd194bdc 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -966,6 +966,8 @@ struct rq_map_data {
int blk_rq_map_user(struct request_queue *, struct request *,
struct rq_map_data *, void __user *, unsigned long, gfp_t);
+int blk_rq_map_user_fixedb(struct request_queue *, struct request *,
+ u64 ubuf, unsigned long, gfp_t, struct io_uring_cmd *);
int blk_rq_map_user_iov(struct request_queue *, struct request *,
struct rq_map_data *, const struct iov_iter *, gfp_t);
int blk_rq_unmap_user(struct bio *);
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 09/17] io_uring: plug for async bypass
[not found] ` <CGME20220308152711epcas5p31de5d63f5de91fae94e61e5c857c0f13@epcas5p3.samsung.com>
@ 2022-03-08 15:20 ` Kanchan Joshi
2022-03-10 8:33 ` Christoph Hellwig
2022-03-11 17:15 ` Luis Chamberlain
0 siblings, 2 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:20 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
From: Jens Axboe <[email protected]>
Enable .plug for uring-cmd.
Signed-off-by: Jens Axboe <[email protected]>
---
fs/io_uring.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6a1dcea0f538..f04bb497bd88 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1106,9 +1106,11 @@ static const struct io_op_def io_op_defs[] = {
[IORING_OP_LINKAT] = {},
[IORING_OP_URING_CMD] = {
.needs_file = 1,
+ .plug = 1,
},
[IORING_OP_URING_CMD_FIXED] = {
.needs_file = 1,
+ .plug = 1,
},
};
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 10/17] block: wire-up support for plugging
[not found] ` <CGME20220308152714epcas5p4c5a0d16512fd7054c9a713ee28ede492@epcas5p4.samsung.com>
@ 2022-03-08 15:20 ` Kanchan Joshi
2022-03-10 8:34 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:20 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
From: Jens Axboe <[email protected]>
Add support to use plugging if it is enabled, else use default path.
Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-mq.c | 90 ++++++++++++++++++++++++++------------------------
1 file changed, 47 insertions(+), 43 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1adfe4824ef5..29f65eaf3e6b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2326,6 +2326,40 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
blk_mq_hctx_mark_pending(hctx, ctx);
}
+/*
+ * Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
+ * queues. This is important for md arrays to benefit from merging
+ * requests.
+ */
+static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
+{
+ if (plug->multiple_queues)
+ return BLK_MAX_REQUEST_COUNT * 2;
+ return BLK_MAX_REQUEST_COUNT;
+}
+
+static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
+{
+ struct request *last = rq_list_peek(&plug->mq_list);
+
+ if (!plug->rq_count) {
+ trace_block_plug(rq->q);
+ } else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
+ (!blk_queue_nomerges(rq->q) &&
+ blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
+ blk_mq_flush_plug_list(plug, false);
+ trace_block_plug(rq->q);
+ }
+
+ if (!plug->multiple_queues && last && last->q != rq->q)
+ plug->multiple_queues = true;
+ if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
+ plug->has_elevator = true;
+ rq->rq_next = NULL;
+ rq_list_add(&plug->mq_list, rq);
+ plug->rq_count++;
+}
+
/**
* blk_mq_request_bypass_insert - Insert a request at dispatch list.
* @rq: Pointer to request to be inserted.
@@ -2339,16 +2373,20 @@ void blk_mq_request_bypass_insert(struct request *rq, bool at_head,
bool run_queue)
{
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+ struct blk_plug *plug = current->plug;
- spin_lock(&hctx->lock);
- if (at_head)
- list_add(&rq->queuelist, &hctx->dispatch);
- else
- list_add_tail(&rq->queuelist, &hctx->dispatch);
- spin_unlock(&hctx->lock);
-
- if (run_queue)
- blk_mq_run_hw_queue(hctx, false);
+ if (plug) {
+ blk_add_rq_to_plug(plug, rq);
+ } else {
+ spin_lock(&hctx->lock);
+ if (at_head)
+ list_add(&rq->queuelist, &hctx->dispatch);
+ else
+ list_add_tail(&rq->queuelist, &hctx->dispatch);
+ spin_unlock(&hctx->lock);
+ if (run_queue)
+ blk_mq_run_hw_queue(hctx, false);
+ }
}
void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
@@ -2666,40 +2704,6 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
hctx->queue->mq_ops->commit_rqs(hctx);
}
-/*
- * Allow 2x BLK_MAX_REQUEST_COUNT requests on plug queue for multiple
- * queues. This is important for md arrays to benefit from merging
- * requests.
- */
-static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
-{
- if (plug->multiple_queues)
- return BLK_MAX_REQUEST_COUNT * 2;
- return BLK_MAX_REQUEST_COUNT;
-}
-
-static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
-{
- struct request *last = rq_list_peek(&plug->mq_list);
-
- if (!plug->rq_count) {
- trace_block_plug(rq->q);
- } else if (plug->rq_count >= blk_plug_max_rq_count(plug) ||
- (!blk_queue_nomerges(rq->q) &&
- blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
- blk_mq_flush_plug_list(plug, false);
- trace_block_plug(rq->q);
- }
-
- if (!plug->multiple_queues && last && last->q != rq->q)
- plug->multiple_queues = true;
- if (!plug->has_elevator && (rq->rq_flags & RQF_ELV))
- plug->has_elevator = true;
- rq->rq_next = NULL;
- rq_list_add(&plug->mq_list, rq);
- plug->rq_count++;
-}
-
static bool blk_mq_attempt_bio_merge(struct request_queue *q,
struct bio *bio, unsigned int nr_segs)
{
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 11/17] block: factor out helper for bio allocation from cache
[not found] ` <CGME20220308152716epcas5p3d38d2372c184259f1a10c969f7e4396f@epcas5p3.samsung.com>
@ 2022-03-08 15:20 ` Kanchan Joshi
2022-03-10 8:35 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:20 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
Factor the code to pull out bio_from_cache helper which is not tied to
kiocb. This is prep patch.
Signed-off-by: Kanchan Joshi <[email protected]>
---
block/bio.c | 43 ++++++++++++++++++++++++++-----------------
include/linux/bio.h | 1 +
2 files changed, 27 insertions(+), 17 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 4312a8085396..5e12c6bd43d3 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1705,27 +1705,12 @@ int bioset_init_from_src(struct bio_set *bs, struct bio_set *src)
}
EXPORT_SYMBOL(bioset_init_from_src);
-/**
- * bio_alloc_kiocb - Allocate a bio from bio_set based on kiocb
- * @kiocb: kiocb describing the IO
- * @nr_vecs: number of iovecs to pre-allocate
- * @bs: bio_set to allocate from
- *
- * Description:
- * Like @bio_alloc_bioset, but pass in the kiocb. The kiocb is only
- * used to check if we should dip into the per-cpu bio_set allocation
- * cache. The allocation uses GFP_KERNEL internally. On return, the
- * bio is marked BIO_PERCPU_CACHEABLE, and the final put of the bio
- * MUST be done from process context, not hard/soft IRQ.
- *
- */
-struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs,
- struct bio_set *bs)
+struct bio *bio_from_cache(unsigned short nr_vecs, struct bio_set *bs)
{
struct bio_alloc_cache *cache;
struct bio *bio;
- if (!(kiocb->ki_flags & IOCB_ALLOC_CACHE) || nr_vecs > BIO_INLINE_VECS)
+ if (nr_vecs > BIO_INLINE_VECS)
return bio_alloc_bioset(GFP_KERNEL, nr_vecs, bs);
cache = per_cpu_ptr(bs->cache, get_cpu());
@@ -1744,6 +1729,30 @@ struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs,
bio_set_flag(bio, BIO_PERCPU_CACHE);
return bio;
}
+EXPORT_SYMBOL_GPL(bio_from_cache);
+
+/**
+ * bio_alloc_kiocb - Allocate a bio from bio_set based on kiocb
+ * @kiocb: kiocb describing the IO
+ * @nr_vecs: number of iovecs to pre-allocate
+ * @bs: bio_set to allocate from
+ *
+ * Description:
+ * Like @bio_alloc_bioset, but pass in the kiocb. The kiocb is only
+ * used to check if we should dip into the per-cpu bio_set allocation
+ * cache. The allocation uses GFP_KERNEL internally. On return, the
+ * bio is marked BIO_PERCPU_CACHEABLE, and the final put of the bio
+ * MUST be done from process context, not hard/soft IRQ.
+ *
+ */
+struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs,
+ struct bio_set *bs)
+{
+ if (!(kiocb->ki_flags & IOCB_ALLOC_CACHE))
+ return bio_alloc_bioset(GFP_KERNEL, nr_vecs, bs);
+
+ return bio_from_cache(nr_vecs, bs);
+}
EXPORT_SYMBOL_GPL(bio_alloc_kiocb);
static int __init init_bio(void)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 117d7f248ac9..3216401f75b0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -409,6 +409,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp, unsigned short nr_iovecs,
struct bio_set *bs);
struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs,
struct bio_set *bs);
+struct bio *bio_from_cache(unsigned short nr_vecs, struct bio_set *bs);
struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned short nr_iovecs);
extern void bio_put(struct bio *);
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 12/17] nvme: enable bio-cache for fixed-buffer passthru
[not found] ` <CGME20220308152718epcas5p3afd2c8a628f4e9733572cbb39270989d@epcas5p3.samsung.com>
@ 2022-03-08 15:21 ` Kanchan Joshi
2022-03-11 6:48 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:21 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
Since we do submission/completion in task, we can have this up.
Add a bio-set for nvme as we need that for bio-cache.
Signed-off-by: Kanchan Joshi <[email protected]>
---
block/blk-map.c | 4 ++--
drivers/nvme/host/core.c | 9 +++++++++
drivers/nvme/host/ioctl.c | 2 +-
drivers/nvme/host/nvme.h | 1 +
include/linux/blk-mq.h | 3 ++-
5 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index 027e8216e313..c39917f0eb78 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -580,7 +580,7 @@ EXPORT_SYMBOL(blk_rq_map_user);
/* Unlike blk_rq_map_user () this is only for fixed-buffer async passthrough. */
int blk_rq_map_user_fixedb(struct request_queue *q, struct request *rq,
- u64 ubuf, unsigned long len, gfp_t gfp_mask,
+ u64 ubuf, unsigned long len, struct bio_set *bs,
struct io_uring_cmd *ioucmd)
{
struct iov_iter iter;
@@ -604,7 +604,7 @@ int blk_rq_map_user_fixedb(struct request_queue *q, struct request *rq,
if (nr_segs > queue_max_segments(q))
return -EINVAL;
/* no iovecs to alloc, as we already have a BVEC iterator */
- bio = bio_alloc(gfp_mask, 0);
+ bio = bio_from_cache(0, bs);
if (!bio)
return -ENOMEM;
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3fe8f5901cd9..4a385001f124 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -30,6 +30,9 @@
#define NVME_MINORS (1U << MINORBITS)
+#define NVME_BIO_POOL_SZ (4)
+struct bio_set nvme_bio_pool;
+
unsigned int admin_timeout = 60;
module_param(admin_timeout, uint, 0644);
MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
@@ -4797,6 +4800,11 @@ static int __init nvme_core_init(void)
goto unregister_generic_ns;
}
+ result = bioset_init(&nvme_bio_pool, NVME_BIO_POOL_SZ, 0,
+ BIOSET_NEED_BVECS | BIOSET_PERCPU_CACHE);
+ if (result < 0)
+ goto unregister_generic_ns;
+
return 0;
unregister_generic_ns:
@@ -4819,6 +4827,7 @@ static int __init nvme_core_init(void)
static void __exit nvme_core_exit(void)
{
+ bioset_exit(&nvme_bio_pool);
class_destroy(nvme_ns_chr_class);
class_destroy(nvme_subsys_class);
class_destroy(nvme_class);
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 91d893eedc82..a4cde210aab9 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -159,7 +159,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
if (ubuffer && bufflen) {
if (likely(nvme_is_fixedb_passthru(ioucmd)))
ret = blk_rq_map_user_fixedb(q, req, ubuffer, bufflen,
- GFP_KERNEL, ioucmd);
+ &nvme_bio_pool, ioucmd);
else
ret = blk_rq_map_user(q, req, NULL, nvme_to_user_ptr(ubuffer),
bufflen, GFP_KERNEL);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e6a30543d7c8..9a3e5093dedc 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -47,6 +47,7 @@ extern unsigned int admin_timeout;
extern struct workqueue_struct *nvme_wq;
extern struct workqueue_struct *nvme_reset_wq;
extern struct workqueue_struct *nvme_delete_wq;
+extern struct bio_set nvme_bio_pool;
/*
* List of workarounds for devices that required behavior not specified in
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 48bcfd194bdc..5f21f71b2529 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -967,7 +967,8 @@ struct rq_map_data {
int blk_rq_map_user(struct request_queue *, struct request *,
struct rq_map_data *, void __user *, unsigned long, gfp_t);
int blk_rq_map_user_fixedb(struct request_queue *, struct request *,
- u64 ubuf, unsigned long, gfp_t, struct io_uring_cmd *);
+ u64 ubuf, unsigned long, struct bio_set *,
+ struct io_uring_cmd *);
int blk_rq_map_user_iov(struct request_queue *, struct request *,
struct rq_map_data *, const struct iov_iter *, gfp_t);
int blk_rq_unmap_user(struct bio *);
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 13/17] nvme: allow user passthrough commands to poll
[not found] ` <CGME20220308152720epcas5p19653942458e160714444942ddb8b8579@epcas5p1.samsung.com>
@ 2022-03-08 15:21 ` Kanchan Joshi
2022-03-08 17:08 ` Keith Busch
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:21 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
From: Keith Busch <[email protected]>
The block layer knows how to deal with polled requests. Let the NVMe
driver use the previously reserved user "flags" fields to define an
option to allocate the request from the polled hardware contexts. If
polling is not enabled, then the block layer will automatically fallback
to a non-polled request.[1]
[1] https://lore.kernel.org/linux-block/[email protected]/
Signed-off-by: Keith Busch <[email protected]>
---
drivers/nvme/host/ioctl.c | 30 ++++++++++++++++--------------
include/uapi/linux/nvme_ioctl.h | 4 ++++
2 files changed, 20 insertions(+), 14 deletions(-)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index a4cde210aab9..a6712fb3eb98 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -132,7 +132,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
struct nvme_command *cmd, u64 ubuffer,
unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
u32 meta_seed, u64 *result, unsigned timeout,
- struct io_uring_cmd *ioucmd)
+ struct io_uring_cmd *ioucmd, unsigned int rq_flags)
{
bool write = nvme_is_write(cmd);
struct nvme_ns *ns = q->queuedata;
@@ -140,7 +140,6 @@ static int nvme_submit_user_cmd(struct request_queue *q,
struct request *req;
struct bio *bio = NULL;
void *meta = NULL;
- unsigned int rq_flags = 0;
blk_mq_req_flags_t blk_flags = 0;
int ret;
@@ -216,11 +215,12 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
struct nvme_command c;
unsigned length, meta_len;
void __user *metadata;
+ unsigned int rq_flags = 0;
if (copy_from_user(&io, uio, sizeof(io)))
return -EFAULT;
- if (io.flags)
- return -EINVAL;
+ if (io.flags & NVME_HIPRI)
+ rq_flags |= REQ_POLLED;
switch (io.opcode) {
case nvme_cmd_write:
@@ -258,7 +258,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
memset(&c, 0, sizeof(c));
c.rw.opcode = io.opcode;
- c.rw.flags = io.flags;
+ c.rw.flags = 0;
c.rw.nsid = cpu_to_le32(ns->head->ns_id);
c.rw.slba = cpu_to_le64(io.slba);
c.rw.length = cpu_to_le16(io.nblocks);
@@ -270,7 +270,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
return nvme_submit_user_cmd(ns->queue, &c,
io.addr, length, metadata, meta_len,
- lower_32_bits(io.slba), NULL, 0, NULL);
+ lower_32_bits(io.slba), NULL, 0, NULL, rq_flags);
}
static bool nvme_validate_passthru_nsid(struct nvme_ctrl *ctrl,
@@ -292,6 +292,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
{
struct nvme_passthru_cmd cmd;
struct nvme_command c;
+ unsigned int rq_flags = 0;
unsigned timeout = 0;
u64 result;
int status;
@@ -300,14 +301,14 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
return -EACCES;
if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
return -EFAULT;
- if (cmd.flags)
- return -EINVAL;
+ if (cmd.flags & NVME_HIPRI)
+ rq_flags |= REQ_POLLED;
if (!nvme_validate_passthru_nsid(ctrl, ns, cmd.nsid))
return -EINVAL;
memset(&c, 0, sizeof(c));
c.common.opcode = cmd.opcode;
- c.common.flags = cmd.flags;
+ c.common.flags = 0;
c.common.nsid = cpu_to_le32(cmd.nsid);
c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
@@ -323,7 +324,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
cmd.addr, cmd.data_len, nvme_to_user_ptr(cmd.metadata),
- cmd.metadata_len, 0, &result, timeout, NULL);
+ cmd.metadata_len, 0, &result, timeout, NULL, rq_flags);
if (status >= 0) {
if (put_user(result, &ucmd->result))
@@ -339,6 +340,7 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
{
struct nvme_passthru_cmd64 cmd, *cptr;
struct nvme_command c;
+ unsigned int rq_flags = 0;
unsigned timeout = 0;
int status;
@@ -353,14 +355,14 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
return -EINVAL;
cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd;
}
- if (cptr->flags)
- return -EINVAL;
+ if (cptr->flags & NVME_HIPRI)
+ rq_flags |= REQ_POLLED;
if (!nvme_validate_passthru_nsid(ctrl, ns, cptr->nsid))
return -EINVAL;
memset(&c, 0, sizeof(c));
c.common.opcode = cptr->opcode;
- c.common.flags = cptr->flags;
+ c.common.flags = 0;
c.common.nsid = cpu_to_le32(cptr->nsid);
c.common.cdw2[0] = cpu_to_le32(cptr->cdw2);
c.common.cdw2[1] = cpu_to_le32(cptr->cdw3);
@@ -377,7 +379,7 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
cptr->addr, cptr->data_len,
nvme_to_user_ptr(cptr->metadata), cptr->metadata_len,
- 0, &cptr->result, timeout, ioucmd);
+ 0, &cptr->result, timeout, ioucmd, rq_flags);
if (!ioucmd && status >= 0) {
if (put_user(cptr->result, &ucmd->result))
diff --git a/include/uapi/linux/nvme_ioctl.h b/include/uapi/linux/nvme_ioctl.h
index d99b5a772698..df2c138c38d9 100644
--- a/include/uapi/linux/nvme_ioctl.h
+++ b/include/uapi/linux/nvme_ioctl.h
@@ -9,6 +9,10 @@
#include <linux/types.h>
+enum nvme_io_flags {
+ NVME_HIPRI = 1 << 0, /* use polling queue if available */
+};
+
struct nvme_user_io {
__u8 opcode;
__u8 flags;
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 14/17] io_uring: add polling support for uring-cmd
[not found] ` <CGME20220308152723epcas5p34460b4af720e515317f88dbb78295f06@epcas5p3.samsung.com>
@ 2022-03-08 15:21 ` Kanchan Joshi
2022-03-11 6:50 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:21 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
From: Pankaj Raghav <[email protected]>
Enable polling support infra for uring-cmd.
Collect the bio during submission, and use that to implement polling on
completion.
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/io_uring.c | 51 ++++++++++++++++++++++++++++++++++------
include/linux/io_uring.h | 9 +++++--
2 files changed, 51 insertions(+), 9 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index f04bb497bd88..8bd9401f9964 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2664,7 +2664,20 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
if (READ_ONCE(req->iopoll_completed))
break;
- ret = kiocb->ki_filp->f_op->iopoll(kiocb, &iob, poll_flags);
+ if (req->opcode == IORING_OP_URING_CMD ||
+ req->opcode == IORING_OP_URING_CMD_FIXED) {
+ /* uring_cmd structure does not contain kiocb struct */
+ struct kiocb kiocb_uring_cmd;
+
+ kiocb_uring_cmd.private = req->uring_cmd.bio;
+ kiocb_uring_cmd.ki_filp = req->uring_cmd.file;
+ ret = req->uring_cmd.file->f_op->iopoll(&kiocb_uring_cmd,
+ &iob, poll_flags);
+ } else {
+ ret = kiocb->ki_filp->f_op->iopoll(kiocb, &iob,
+ poll_flags);
+ }
+
if (unlikely(ret < 0))
return ret;
else if (ret)
@@ -2777,6 +2790,15 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, long min)
wq_list_empty(&ctx->iopoll_list))
break;
}
+
+ /*
+ * In some scenarios, completion callback has been queued up to be
+ * completed in-task context but polling happens in the same task
+ * not giving a chance for the completion callback to complete.
+ */
+ if (current->task_works)
+ io_run_task_work();
+
ret = io_do_iopoll(ctx, !min);
if (ret < 0)
break;
@@ -4136,6 +4158,14 @@ static int io_linkat(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}
+static void io_complete_uring_cmd_iopoll(struct io_kiocb *req, long res)
+{
+ WRITE_ONCE(req->result, res);
+ /* order with io_iopoll_complete() checking ->result */
+ smp_wmb();
+ WRITE_ONCE(req->iopoll_completed, 1);
+}
+
/*
* Called by consumers of io_uring_cmd, if they originally returned
* -EIOCBQUEUED upon receiving the command.
@@ -4146,7 +4176,11 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret)
if (ret < 0)
req_set_fail(req);
- io_req_complete(req, ret);
+
+ if (req->uring_cmd.flags & IO_URING_F_UCMD_POLLED)
+ io_complete_uring_cmd_iopoll(req, ret);
+ else
+ io_req_complete(req, ret);
}
EXPORT_SYMBOL_GPL(io_uring_cmd_done);
@@ -4158,15 +4192,18 @@ static int io_uring_cmd_prep(struct io_kiocb *req,
if (!req->file->f_op->async_cmd || !(req->ctx->flags & IORING_SETUP_SQE128))
return -EOPNOTSUPP;
- if (req->ctx->flags & IORING_SETUP_IOPOLL)
- return -EOPNOTSUPP;
+ if (req->ctx->flags & IORING_SETUP_IOPOLL) {
+ ioucmd->flags = IO_URING_F_UCMD_POLLED;
+ ioucmd->bio = NULL;
+ req->iopoll_completed = 0;
+ } else {
+ ioucmd->flags = 0;
+ }
if (req->opcode == IORING_OP_URING_CMD_FIXED) {
req->imu = NULL;
io_req_set_rsrc_node(req, ctx);
req->buf_index = READ_ONCE(sqe->buf_index);
- ioucmd->flags = IO_URING_F_UCMD_FIXEDBUFS;
- } else {
- ioucmd->flags = 0;
+ ioucmd->flags |= IO_URING_F_UCMD_FIXEDBUFS;
}
ioucmd->cmd = (void *) &sqe->cmd;
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index abad6175739e..65db83d703b7 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -9,6 +9,7 @@ enum io_uring_cmd_flags {
IO_URING_F_COMPLETE_DEFER = 1,
IO_URING_F_UNLOCKED = 2,
IO_URING_F_UCMD_FIXEDBUFS = 4,
+ IO_URING_F_UCMD_POLLED = 8,
/* int's last bit, sign checks are usually faster than a bit test */
IO_URING_F_NONBLOCK = INT_MIN,
};
@@ -16,8 +17,12 @@ enum io_uring_cmd_flags {
struct io_uring_cmd {
struct file *file;
void *cmd;
- /* for irq-completion - if driver requires doing stuff in task-context*/
- void (*driver_cb)(struct io_uring_cmd *cmd);
+ union {
+ void *bio; /* used for polled completion */
+
+ /* for irq-completion - if driver requires doing stuff in task-context*/
+ void (*driver_cb)(struct io_uring_cmd *cmd);
+ };
u32 flags;
u32 cmd_op;
u16 cmd_len;
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 15/17] nvme: wire-up polling for uring-passthru
[not found] ` <CGME20220308152725epcas5p36d1ce3269a47c1c22cc0d66bdc2b9eb3@epcas5p3.samsung.com>
@ 2022-03-08 15:21 ` Kanchan Joshi
0 siblings, 0 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:21 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
From: Pankaj Raghav <[email protected]>
add .iopoll handler for char device (/dev/ngX), which in turn redirects
to bio_poll to implement polling.
Signed-off-by: Pankaj Raghav <[email protected]>
---
block/blk-mq.c | 3 +-
drivers/nvme/host/core.c | 1 +
drivers/nvme/host/ioctl.c | 79 ++++++++++++++++++++++++++++++++++-
drivers/nvme/host/multipath.c | 1 +
drivers/nvme/host/nvme.h | 4 ++
include/linux/blk-mq.h | 1 +
6 files changed, 86 insertions(+), 3 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 29f65eaf3e6b..6b37774b0d59 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1193,7 +1193,7 @@ void blk_execute_rq_nowait(struct request *rq, bool at_head, rq_end_io_fn *done)
}
EXPORT_SYMBOL_GPL(blk_execute_rq_nowait);
-static bool blk_rq_is_poll(struct request *rq)
+bool blk_rq_is_poll(struct request *rq)
{
if (!rq->mq_hctx)
return false;
@@ -1203,6 +1203,7 @@ static bool blk_rq_is_poll(struct request *rq)
return false;
return true;
}
+EXPORT_SYMBOL_GPL(blk_rq_is_poll);
static void blk_rq_poll_completion(struct request *rq, struct completion *wait)
{
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4a385001f124..64254771a28e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3671,6 +3671,7 @@ static const struct file_operations nvme_ns_chr_fops = {
.unlocked_ioctl = nvme_ns_chr_ioctl,
.compat_ioctl = compat_ptr_ioctl,
.async_cmd = nvme_ns_chr_async_cmd,
+ .iopoll = nvme_iopoll,
};
static int nvme_add_ns_cdev(struct nvme_ns *ns)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index a6712fb3eb98..701feaecabbe 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -37,6 +37,12 @@ static struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(struct io_uring_cmd *ioucmd
return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
}
+static inline bool is_polling_enabled(struct io_uring_cmd *ioucmd,
+ struct request *req)
+{
+ return (ioucmd->flags & IO_URING_F_UCMD_POLLED) && blk_rq_is_poll(req);
+}
+
static void nvme_pt_task_cb(struct io_uring_cmd *ioucmd)
{
struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
@@ -71,8 +77,17 @@ static void nvme_end_async_pt(struct request *req, blk_status_t err)
pdu->req = req;
req->bio = bio;
- /* this takes care of setting up task-work */
- io_uring_cmd_complete_in_task(ioucmd, nvme_pt_task_cb);
+
+ /*
+ * IO can be completed directly (i.e. without task work) if we are
+ * polling and in the task context already
+ */
+ if (is_polling_enabled(ioucmd, req)) {
+ nvme_pt_task_cb(ioucmd);
+ } else {
+ /* this takes care of setting up task-work */
+ io_uring_cmd_complete_in_task(ioucmd, nvme_pt_task_cb);
+ }
}
static void nvme_setup_uring_cmd_data(struct request *rq,
@@ -180,6 +195,10 @@ static int nvme_submit_user_cmd(struct request_queue *q,
if (ioucmd) { /* async dispatch */
if (cmd->common.opcode == nvme_cmd_write ||
cmd->common.opcode == nvme_cmd_read) {
+ if (bio && is_polling_enabled(ioucmd, req)) {
+ ioucmd->bio = bio;
+ bio->bi_opf |= REQ_POLLED;
+ }
nvme_setup_uring_cmd_data(req, ioucmd, meta, meta_buffer,
meta_len);
blk_execute_rq_nowait(req, 0, nvme_end_async_pt);
@@ -505,6 +524,32 @@ int nvme_ns_chr_async_cmd(struct io_uring_cmd *ioucmd)
return nvme_ns_async_ioctl(ns, ioucmd);
}
+int nvme_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
+ unsigned int flags)
+{
+ struct bio *bio = NULL;
+ struct nvme_ns *ns = NULL;
+ struct request_queue *q = NULL;
+ int ret = 0;
+
+ rcu_read_lock();
+ bio = READ_ONCE(kiocb->private);
+ ns = container_of(file_inode(kiocb->ki_filp)->i_cdev, struct nvme_ns,
+ cdev);
+ q = ns->queue;
+ /*
+ * bio and driver_cb are a part of the same union inside io_uring_cmd
+ * struct. If driver is loaded without poll queues, completion will be
+ * IRQ based and driver_cb is populated. We do not want to treat that
+ * as bio and get into troubles. Avoid this by checking if queue is
+ * polled and bail out if not.
+ */
+ if ((test_bit(QUEUE_FLAG_POLL, &q->queue_flags)) && bio && bio->bi_bdev)
+ ret = bio_poll(bio, iob, flags);
+ rcu_read_unlock();
+ return ret;
+}
+
#ifdef CONFIG_NVME_MULTIPATH
static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
void __user *argp, struct nvme_ns_head *head, int srcu_idx)
@@ -585,6 +630,36 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
srcu_read_unlock(&head->srcu, srcu_idx);
return ret;
}
+
+int nvme_ns_head_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
+ unsigned int flags)
+{
+ struct bio *bio = NULL;
+ struct request_queue *q = NULL;
+ struct cdev *cdev = file_inode(kiocb->ki_filp)->i_cdev;
+ struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
+ int srcu_idx = srcu_read_lock(&head->srcu);
+ struct nvme_ns *ns = nvme_find_path(head);
+ int ret = -EWOULDBLOCK;
+
+ if (ns) {
+ bio = READ_ONCE(kiocb->private);
+ q = ns->queue;
+ /*
+ * bio and driver_cb are part of same union inside io_uring_cmd
+ * struct. If driver is loaded without poll queues, completion
+ * will be IRQ based and driver_cb is populated. We do not want
+ * to treat that as bio and get into troubles. Avoid this by
+ * checking if queue is polled, and bail out if not.
+ */
+ if ((test_bit(QUEUE_FLAG_POLL, &q->queue_flags)) && bio &&
+ bio->bi_bdev)
+ ret = bio_poll(bio, iob, flags);
+ }
+
+ srcu_read_unlock(&head->srcu, srcu_idx);
+ return ret;
+}
#endif /* CONFIG_NVME_MULTIPATH */
static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 1d798d09456f..27995358c847 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -460,6 +460,7 @@ static const struct file_operations nvme_ns_head_chr_fops = {
.unlocked_ioctl = nvme_ns_head_chr_ioctl,
.compat_ioctl = compat_ptr_ioctl,
.async_cmd = nvme_ns_head_chr_async_cmd,
+ .iopoll = nvme_ns_head_iopoll,
};
static int nvme_add_ns_head_cdev(struct nvme_ns_head *head)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9a3e5093dedc..0be437c25077 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -755,7 +755,11 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
long nvme_dev_ioctl(struct file *file, unsigned int cmd,
unsigned long arg);
int nvme_ns_chr_async_cmd(struct io_uring_cmd *ioucmd);
+int nvme_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
+ unsigned int flags);
int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd);
+int nvme_ns_head_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
+ unsigned int flags);
int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
extern const struct attribute_group *nvme_ns_id_attr_groups[];
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 5f21f71b2529..9f00b2a5a991 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -977,6 +977,7 @@ int blk_rq_map_kern(struct request_queue *, struct request *, void *,
int blk_rq_append_bio(struct request *rq, struct bio *bio);
void blk_execute_rq_nowait(struct request *rq, bool at_head,
rq_end_io_fn *end_io);
+bool blk_rq_is_poll(struct request *rq);
blk_status_t blk_execute_rq(struct request *rq, bool at_head);
struct req_iterator {
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 16/17] io_uring: add support for non-inline uring-cmd
[not found] ` <CGME20220308152727epcas5p20e605718dd99e97c94f9232d40d04d95@epcas5p2.samsung.com>
@ 2022-03-08 15:21 ` Kanchan Joshi
0 siblings, 0 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:21 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
Lifetime of inline commands (within sqe) is upto submission, and maximum
length is 80 bytes. This can be limiting for certain commands.
Add option to accept command pointer via same sqe->cmd field. User need to
set IORING_URING_CMD_INDIRECT flag in sqe->uring_cmd_flags.
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
fs/io_uring.c | 13 +++++++++++--
include/linux/io_uring.h | 1 +
include/uapi/linux/io_uring.h | 6 ++++++
3 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8bd9401f9964..d88c6601a556 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4189,8 +4189,12 @@ static int io_uring_cmd_prep(struct io_kiocb *req,
{
struct io_ring_ctx *ctx = req->ctx;
struct io_uring_cmd *ioucmd = &req->uring_cmd;
+ u32 ucmd_flags = READ_ONCE(sqe->uring_cmd_flags);
- if (!req->file->f_op->async_cmd || !(req->ctx->flags & IORING_SETUP_SQE128))
+ if (!req->file->f_op->async_cmd)
+ return -EOPNOTSUPP;
+ if (!(req->ctx->flags & IORING_SETUP_SQE128) &&
+ !(ucmd_flags & IORING_URING_CMD_INDIRECT))
return -EOPNOTSUPP;
if (req->ctx->flags & IORING_SETUP_IOPOLL) {
ioucmd->flags = IO_URING_F_UCMD_POLLED;
@@ -4206,7 +4210,12 @@ static int io_uring_cmd_prep(struct io_kiocb *req,
ioucmd->flags |= IO_URING_F_UCMD_FIXEDBUFS;
}
- ioucmd->cmd = (void *) &sqe->cmd;
+ if (ucmd_flags & IORING_URING_CMD_INDIRECT) {
+ ioucmd->flags |= IO_URING_F_UCMD_INDIRECT;
+ ioucmd->cmd = (void *) sqe->cmd;
+ } else {
+ ioucmd->cmd = (void *) &sqe->cmd;
+ }
ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
ioucmd->cmd_len = READ_ONCE(sqe->cmd_len);
return 0;
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 65db83d703b7..c534a6fcef4f 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -10,6 +10,7 @@ enum io_uring_cmd_flags {
IO_URING_F_UNLOCKED = 2,
IO_URING_F_UCMD_FIXEDBUFS = 4,
IO_URING_F_UCMD_POLLED = 8,
+ IO_URING_F_UCMD_INDIRECT = 16,
/* int's last bit, sign checks are usually faster than a bit test */
IO_URING_F_NONBLOCK = INT_MIN,
};
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ee84be4b6be8..a4b9db37ecf1 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -47,6 +47,7 @@ struct io_uring_sqe {
__u32 rename_flags;
__u32 unlink_flags;
__u32 hardlink_flags;
+ __u32 uring_cmd_flags;
};
__u64 user_data; /* data to be passed back at completion time */
/* pack this to avoid bogus arm OABI complaints */
@@ -198,6 +199,11 @@ enum {
#define IORING_POLL_UPDATE_EVENTS (1U << 1)
#define IORING_POLL_UPDATE_USER_DATA (1U << 2)
+/*
+ * sqe->uring_cmd_flags
+ */
+#define IORING_URING_CMD_INDIRECT (1U << 0)
+
/*
* IO completion data structure (Completion Queue Entry)
*/
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* [PATCH 17/17] nvme: enable non-inline passthru commands
[not found] ` <CGME20220308152729epcas5p17e82d59c68076eb46b5ef658619d65e3@epcas5p1.samsung.com>
@ 2022-03-08 15:21 ` Kanchan Joshi
2022-03-10 8:36 ` Christoph Hellwig
2022-03-24 21:09 ` Clay Mayers
0 siblings, 2 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-08 15:21 UTC (permalink / raw)
To: axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
From: Anuj Gupta <[email protected]>
On submission,just fetch the commmand from userspace pointer and reuse
everything else. On completion, update the result field inside the
passthru command.
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
drivers/nvme/host/ioctl.c | 29 +++++++++++++++++++++++++----
1 file changed, 25 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 701feaecabbe..ddb7e5864be6 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -65,6 +65,14 @@ static void nvme_pt_task_cb(struct io_uring_cmd *ioucmd)
}
kfree(pdu->meta);
+ if (ioucmd->flags & IO_URING_F_UCMD_INDIRECT) {
+ struct nvme_passthru_cmd64 __user *ptcmd64 = ioucmd->cmd;
+ u64 result = le64_to_cpu(nvme_req(req)->result.u64);
+
+ if (put_user(result, &ptcmd64->result))
+ status = -EFAULT;
+ }
+
io_uring_cmd_done(ioucmd, status);
}
@@ -143,6 +151,13 @@ static inline bool nvme_is_fixedb_passthru(struct io_uring_cmd *ioucmd)
return ((ioucmd) && (ioucmd->flags & IO_URING_F_UCMD_FIXEDBUFS));
}
+static inline bool is_inline_rw(struct io_uring_cmd *ioucmd, struct nvme_command *cmd)
+{
+ return ((ioucmd->flags & IO_URING_F_UCMD_INDIRECT) ||
+ (cmd->common.opcode == nvme_cmd_write ||
+ cmd->common.opcode == nvme_cmd_read));
+}
+
static int nvme_submit_user_cmd(struct request_queue *q,
struct nvme_command *cmd, u64 ubuffer,
unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
@@ -193,8 +208,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
}
}
if (ioucmd) { /* async dispatch */
- if (cmd->common.opcode == nvme_cmd_write ||
- cmd->common.opcode == nvme_cmd_read) {
+ if (is_inline_rw(ioucmd, cmd)) {
if (bio && is_polling_enabled(ioucmd, req)) {
ioucmd->bio = bio;
bio->bi_opf |= REQ_POLLED;
@@ -204,7 +218,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
blk_execute_rq_nowait(req, 0, nvme_end_async_pt);
return 0;
} else {
- /* support only read and write for now. */
+ /* support only read and write for inline */
ret = -EINVAL;
goto out_meta;
}
@@ -372,7 +386,14 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
} else {
if (ioucmd->cmd_len != sizeof(struct nvme_passthru_cmd64))
return -EINVAL;
- cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd;
+ if (ioucmd->flags & IO_URING_F_UCMD_INDIRECT) {
+ ucmd = (struct nvme_passthru_cmd64 __user *)ioucmd->cmd;
+ if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
+ return -EFAULT;
+ cptr = &cmd;
+ } else {
+ cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd;
+ }
}
if (cptr->flags & NVME_HIPRI)
rq_flags |= REQ_POLLED;
--
2.25.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* Re: [PATCH 13/17] nvme: allow user passthrough commands to poll
2022-03-08 15:21 ` [PATCH 13/17] nvme: allow user passthrough commands to poll Kanchan Joshi
@ 2022-03-08 17:08 ` Keith Busch
2022-03-09 7:03 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Keith Busch @ 2022-03-08 17:08 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, hch, asml.silence, io-uring, linux-nvme, linux-block,
sbates, logang, pankydev8, javier, mcgrof, a.manzanares,
joshiiitr, anuj20.g
On Tue, Mar 08, 2022 at 08:51:01PM +0530, Kanchan Joshi wrote:
> if (copy_from_user(&io, uio, sizeof(io)))
> return -EFAULT;
> - if (io.flags)
> - return -EINVAL;
> + if (io.flags & NVME_HIPRI)
> + rq_flags |= REQ_POLLED;
I'm pretty sure we can repurpose this previously reserved field for this
kind of special handling without an issue now, but we should continue
returning EINVAL if any unknown flags are set. I have no idea what, if
any, new flags may be defined later, so we shouldn't let a future
application think an older driver honored something we are not handling.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 13/17] nvme: allow user passthrough commands to poll
2022-03-08 17:08 ` Keith Busch
@ 2022-03-09 7:03 ` Kanchan Joshi
2022-03-11 6:49 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-09 7:03 UTC (permalink / raw)
To: Keith Busch
Cc: Kanchan Joshi, Jens Axboe, Christoph Hellwig, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Tue, Mar 8, 2022 at 10:39 PM Keith Busch <[email protected]> wrote:
>
> On Tue, Mar 08, 2022 at 08:51:01PM +0530, Kanchan Joshi wrote:
> > if (copy_from_user(&io, uio, sizeof(io)))
> > return -EFAULT;
> > - if (io.flags)
> > - return -EINVAL;
> > + if (io.flags & NVME_HIPRI)
> > + rq_flags |= REQ_POLLED;
>
> I'm pretty sure we can repurpose this previously reserved field for this
> kind of special handling without an issue now, but we should continue
> returning EINVAL if any unknown flags are set. I have no idea what, if
> any, new flags may be defined later, so we shouldn't let a future
> application think an older driver honored something we are not handling.
Would it be better if we don't try to pass NVME_HIPRI by any means
(flags or rsvd1/rsvd2), and that means not enabling sync-polling and
killing this patch.
We have another flag "IO_URING_F_UCMD_POLLED" in ioucmd->flags, and we
can use that instead to enable only the async polling. What do you
think?
^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-08 15:20 ` [PATCH 05/17] nvme: wire-up support for async-passthru on char-device Kanchan Joshi
@ 2022-03-10 0:02 ` Clay Mayers
2022-03-10 8:32 ` Kanchan Joshi
2022-03-11 7:01 ` Christoph Hellwig
` (3 subsequent siblings)
4 siblings, 1 reply; 122+ messages in thread
From: Clay Mayers @ 2022-03-10 0:02 UTC (permalink / raw)
To: Kanchan Joshi, [email protected], [email protected], [email protected],
[email protected]
Cc: [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected]
> From: Linux-nvme <[email protected]> On Behalf Of
> Kanchan Joshi
> Sent: Tuesday, March 8, 2022 7:21 AM
> To: [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: [PATCH 05/17] nvme: wire-up support for async-passthru on char-
> device.
>
> Introduce handler for fops->async_cmd(), implementing async passthru
> on char device (/dev/ngX). The handler supports NVME_IOCTL_IO64_CMD
> for
> read and write commands. Returns failure for other commands.
> This is low overhead path for processing the inline commands housed
> inside io_uring's sqe. Neither the commmand is fetched via
> copy_from_user, nor the result (inside passthru command) is updated via
> put_user.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Anuj Gupta <[email protected]>
> ---
> drivers/nvme/host/core.c | 1 +
> drivers/nvme/host/ioctl.c | 205 ++++++++++++++++++++++++++++------
> drivers/nvme/host/multipath.c | 1 +
> drivers/nvme/host/nvme.h | 3 +
> 4 files changed, 178 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 159944499c4f..3fe8f5901cd9 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3667,6 +3667,7 @@ static const struct file_operations
> nvme_ns_chr_fops = {
> .release = nvme_ns_chr_release,
> .unlocked_ioctl = nvme_ns_chr_ioctl,
> .compat_ioctl = compat_ptr_ioctl,
> + .async_cmd = nvme_ns_chr_async_cmd,
> };
>
> static int nvme_add_ns_cdev(struct nvme_ns *ns)
> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> index 5c9cd9695519..1df270b47af5 100644
> --- a/drivers/nvme/host/ioctl.c
> +++ b/drivers/nvme/host/ioctl.c
> @@ -18,6 +18,76 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval)
> ptrval = (compat_uptr_t)ptrval;
> return (void __user *)ptrval;
> }
> +/*
> + * This overlays struct io_uring_cmd pdu.
> + * Expect build errors if this grows larger than that.
> + */
> +struct nvme_uring_cmd_pdu {
> + u32 meta_len;
> + union {
> + struct bio *bio;
> + struct request *req;
> + };
> + void *meta; /* kernel-resident buffer */
> + void __user *meta_buffer;
> +} __packed;
> +
> +static struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(struct
> io_uring_cmd *ioucmd)
> +{
> + return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
> +}
> +
> +static void nvme_pt_task_cb(struct io_uring_cmd *ioucmd)
> +{
> + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> + struct request *req = pdu->req;
> + int status;
> + struct bio *bio = req->bio;
> +
> + if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
> + status = -EINTR;
> + else
> + status = nvme_req(req)->status;
> +
> + /* we can free request */
> + blk_mq_free_request(req);
> + blk_rq_unmap_user(bio);
> +
> + if (!status && pdu->meta_buffer) {
> + if (copy_to_user(pdu->meta_buffer, pdu->meta, pdu-
> >meta_len))
> + status = -EFAULT;
> + }
> + kfree(pdu->meta);
> +
> + io_uring_cmd_done(ioucmd, status);
> +}
> +
> +static void nvme_end_async_pt(struct request *req, blk_status_t err)
> +{
> + struct io_uring_cmd *ioucmd = req->end_io_data;
> + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> + /* extract bio before reusing the same field for request */
> + struct bio *bio = pdu->bio;
> +
> + pdu->req = req;
> + req->bio = bio;
> + /* this takes care of setting up task-work */
> + io_uring_cmd_complete_in_task(ioucmd, nvme_pt_task_cb);
> +}
> +
> +static void nvme_setup_uring_cmd_data(struct request *rq,
> + struct io_uring_cmd *ioucmd, void *meta,
> + void __user *meta_buffer, u32 meta_len)
> +{
> + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> +
> + /* to free bio on completion, as req->bio will be null at that time */
> + pdu->bio = rq->bio;
> + pdu->meta = meta;
> + pdu->meta_buffer = meta_buffer;
> + pdu->meta_len = meta_len;
> + rq->end_io_data = ioucmd;
> +}
>
> static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf,
> unsigned len, u32 seed, bool write)
> @@ -56,7 +126,8 @@ static void *nvme_add_user_metadata(struct bio
> *bio, void __user *ubuf,
> static int nvme_submit_user_cmd(struct request_queue *q,
> struct nvme_command *cmd, void __user *ubuffer,
> unsigned bufflen, void __user *meta_buffer, unsigned
> meta_len,
> - u32 meta_seed, u64 *result, unsigned timeout)
> + u32 meta_seed, u64 *result, unsigned timeout,
> + struct io_uring_cmd *ioucmd)
> {
> bool write = nvme_is_write(cmd);
> struct nvme_ns *ns = q->queuedata;
> @@ -64,9 +135,15 @@ static int nvme_submit_user_cmd(struct
> request_queue *q,
> struct request *req;
> struct bio *bio = NULL;
> void *meta = NULL;
> + unsigned int rq_flags = 0;
> + blk_mq_req_flags_t blk_flags = 0;
> int ret;
>
> - req = nvme_alloc_request(q, cmd, 0, 0);
> + if (ioucmd && (ioucmd->flags & IO_URING_F_NONBLOCK)) {
> + rq_flags |= REQ_NOWAIT;
> + blk_flags |= BLK_MQ_REQ_NOWAIT;
> + }
> + req = nvme_alloc_request(q, cmd, blk_flags, rq_flags);
> if (IS_ERR(req))
> return PTR_ERR(req);
>
> @@ -92,6 +169,19 @@ static int nvme_submit_user_cmd(struct
> request_queue *q,
> req->cmd_flags |= REQ_INTEGRITY;
> }
> }
> + if (ioucmd) { /* async dispatch */
> + if (cmd->common.opcode == nvme_cmd_write ||
> + cmd->common.opcode == nvme_cmd_read) {
> + nvme_setup_uring_cmd_data(req, ioucmd, meta,
> meta_buffer,
> + meta_len);
> + blk_execute_rq_nowait(req, 0, nvme_end_async_pt);
> + return 0;
> + } else {
> + /* support only read and write for now. */
> + ret = -EINVAL;
> + goto out_meta;
> + }
> + }
>
> ret = nvme_execute_passthru_rq(req);
> if (result)
> @@ -100,6 +190,7 @@ static int nvme_submit_user_cmd(struct
> request_queue *q,
> if (copy_to_user(meta_buffer, meta, meta_len))
> ret = -EFAULT;
> }
> + out_meta:
> kfree(meta);
> out_unmap:
> if (bio)
> @@ -170,7 +261,8 @@ static int nvme_submit_io(struct nvme_ns *ns, struct
> nvme_user_io __user *uio)
>
> return nvme_submit_user_cmd(ns->queue, &c,
> nvme_to_user_ptr(io.addr), length,
> - metadata, meta_len, lower_32_bits(io.slba), NULL,
> 0);
> + metadata, meta_len, lower_32_bits(io.slba), NULL, 0,
> + NULL);
> }
>
> static bool nvme_validate_passthru_nsid(struct nvme_ctrl *ctrl,
> @@ -224,7 +316,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl,
> struct nvme_ns *ns,
> status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
> nvme_to_user_ptr(cmd.addr), cmd.data_len,
> nvme_to_user_ptr(cmd.metadata),
> cmd.metadata_len,
> - 0, &result, timeout);
> + 0, &result, timeout, NULL);
>
> if (status >= 0) {
> if (put_user(result, &ucmd->result))
> @@ -235,45 +327,53 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl,
> struct nvme_ns *ns,
> }
>
> static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> - struct nvme_passthru_cmd64 __user *ucmd)
> + struct nvme_passthru_cmd64 __user *ucmd,
> + struct io_uring_cmd *ioucmd)
> {
> - struct nvme_passthru_cmd64 cmd;
> + struct nvme_passthru_cmd64 cmd, *cptr;
> struct nvme_command c;
> unsigned timeout = 0;
> int status;
>
> if (!capable(CAP_SYS_ADMIN))
> return -EACCES;
> - if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
> - return -EFAULT;
> - if (cmd.flags)
> + if (!ioucmd) {
> + if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
> + return -EFAULT;
> + cptr = &cmd;
> + } else {
> + if (ioucmd->cmd_len != sizeof(struct nvme_passthru_cmd64))
> + return -EINVAL;
> + cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd;
> + }
> + if (cptr->flags)
> return -EINVAL;
> - if (!nvme_validate_passthru_nsid(ctrl, ns, cmd.nsid))
> + if (!nvme_validate_passthru_nsid(ctrl, ns, cptr->nsid))
> return -EINVAL;
>
> memset(&c, 0, sizeof(c));
> - c.common.opcode = cmd.opcode;
> - c.common.flags = cmd.flags;
> - c.common.nsid = cpu_to_le32(cmd.nsid);
> - c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
> - c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> - c.common.cdw10 = cpu_to_le32(cmd.cdw10);
> - c.common.cdw11 = cpu_to_le32(cmd.cdw11);
> - c.common.cdw12 = cpu_to_le32(cmd.cdw12);
> - c.common.cdw13 = cpu_to_le32(cmd.cdw13);
> - c.common.cdw14 = cpu_to_le32(cmd.cdw14);
> - c.common.cdw15 = cpu_to_le32(cmd.cdw15);
> -
> - if (cmd.timeout_ms)
> - timeout = msecs_to_jiffies(cmd.timeout_ms);
> + c.common.opcode = cptr->opcode;
> + c.common.flags = cptr->flags;
> + c.common.nsid = cpu_to_le32(cptr->nsid);
> + c.common.cdw2[0] = cpu_to_le32(cptr->cdw2);
> + c.common.cdw2[1] = cpu_to_le32(cptr->cdw3);
> + c.common.cdw10 = cpu_to_le32(cptr->cdw10);
> + c.common.cdw11 = cpu_to_le32(cptr->cdw11);
> + c.common.cdw12 = cpu_to_le32(cptr->cdw12);
> + c.common.cdw13 = cpu_to_le32(cptr->cdw13);
> + c.common.cdw14 = cpu_to_le32(cptr->cdw14);
> + c.common.cdw15 = cpu_to_le32(cptr->cdw15);
> +
> + if (cptr->timeout_ms)
> + timeout = msecs_to_jiffies(cptr->timeout_ms);
>
> status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
> - nvme_to_user_ptr(cmd.addr), cmd.data_len,
> - nvme_to_user_ptr(cmd.metadata),
> cmd.metadata_len,
> - 0, &cmd.result, timeout);
> + nvme_to_user_ptr(cptr->addr), cptr->data_len,
> + nvme_to_user_ptr(cptr->metadata), cptr-
> >metadata_len,
> + 0, &cptr->result, timeout, ioucmd);
>
> - if (status >= 0) {
> - if (put_user(cmd.result, &ucmd->result))
> + if (!ioucmd && status >= 0) {
> + if (put_user(cptr->result, &ucmd->result))
> return -EFAULT;
> }
>
> @@ -296,7 +396,7 @@ static int nvme_ctrl_ioctl(struct nvme_ctrl *ctrl,
> unsigned int cmd,
> case NVME_IOCTL_ADMIN_CMD:
> return nvme_user_cmd(ctrl, NULL, argp);
> case NVME_IOCTL_ADMIN64_CMD:
> - return nvme_user_cmd64(ctrl, NULL, argp);
> + return nvme_user_cmd64(ctrl, NULL, argp, NULL);
> default:
> return sed_ioctl(ctrl->opal_dev, cmd, argp);
> }
> @@ -340,7 +440,7 @@ static int nvme_ns_ioctl(struct nvme_ns *ns,
> unsigned int cmd,
> case NVME_IOCTL_SUBMIT_IO:
> return nvme_submit_io(ns, argp);
> case NVME_IOCTL_IO64_CMD:
> - return nvme_user_cmd64(ns->ctrl, ns, argp);
> + return nvme_user_cmd64(ns->ctrl, ns, argp, NULL);
> default:
> return -ENOTTY;
> }
> @@ -369,6 +469,33 @@ long nvme_ns_chr_ioctl(struct file *file, unsigned int
> cmd, unsigned long arg)
> return __nvme_ioctl(ns, cmd, (void __user *)arg);
> }
>
> +static int nvme_ns_async_ioctl(struct nvme_ns *ns, struct io_uring_cmd
> *ioucmd)
> +{
> + int ret;
> +
> + BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) >
> sizeof(ioucmd->pdu));
> +
> + switch (ioucmd->cmd_op) {
> + case NVME_IOCTL_IO64_CMD:
> + ret = nvme_user_cmd64(ns->ctrl, ns, NULL, ioucmd);
> + break;
> + default:
> + ret = -ENOTTY;
> + }
> +
> + if (ret >= 0)
> + ret = -EIOCBQUEUED;
> + return ret;
> +}
ret can equal -EAGAIN, which will cause io_uring to reissue the cmd
from a worker thread. This can happen when ioucmd->flags has
IO_URING_F_NONBLOCK set causing nvme_alloc_request() to return
-EAGAIN when there are no tags available.
Either -EAGAIN needs to be remapped or force set REQ_F_NOWAIT in the
io_uring cmd request in patch 3 (the 2nd option is untested).
> +
> +int nvme_ns_chr_async_cmd(struct io_uring_cmd *ioucmd)
> +{
> + struct nvme_ns *ns = container_of(file_inode(ioucmd->file)->i_cdev,
> + struct nvme_ns, cdev);
> +
> + return nvme_ns_async_ioctl(ns, ioucmd);
> +}
> +
> #ifdef CONFIG_NVME_MULTIPATH
> static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
> void __user *argp, struct nvme_ns_head *head, int srcu_idx)
> @@ -412,6 +539,20 @@ int nvme_ns_head_ioctl(struct block_device *bdev,
> fmode_t mode,
> return ret;
> }
>
> +int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd)
> +{
> + struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
> + struct nvme_ns_head *head = container_of(cdev, struct
> nvme_ns_head, cdev);
> + int srcu_idx = srcu_read_lock(&head->srcu);
> + struct nvme_ns *ns = nvme_find_path(head);
> + int ret = -EWOULDBLOCK;
-EWOULDBLOCK has the same value as -EAGAIN so the same issue
Is here as with nvme_ns_async_ioctl() returning it.
> +
> + if (ns)
> + ret = nvme_ns_async_ioctl(ns, ioucmd);
> + srcu_read_unlock(&head->srcu, srcu_idx);
> + return ret;
> +}
> +
> long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
> unsigned long arg)
> {
> @@ -480,7 +621,7 @@ long nvme_dev_ioctl(struct file *file, unsigned int
> cmd,
> case NVME_IOCTL_ADMIN_CMD:
> return nvme_user_cmd(ctrl, NULL, argp);
> case NVME_IOCTL_ADMIN64_CMD:
> - return nvme_user_cmd64(ctrl, NULL, argp);
> + return nvme_user_cmd64(ctrl, NULL, argp, NULL);
> case NVME_IOCTL_IO_CMD:
> return nvme_dev_user_cmd(ctrl, argp);
> case NVME_IOCTL_RESET:
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index f8bf6606eb2f..1d798d09456f 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -459,6 +459,7 @@ static const struct file_operations
> nvme_ns_head_chr_fops = {
> .release = nvme_ns_head_chr_release,
> .unlocked_ioctl = nvme_ns_head_chr_ioctl,
> .compat_ioctl = compat_ptr_ioctl,
> + .async_cmd = nvme_ns_head_chr_async_cmd,
> };
>
> static int nvme_add_ns_head_cdev(struct nvme_ns_head *head)
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index b32f4e2c68fd..e6a30543d7c8 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -16,6 +16,7 @@
> #include <linux/rcupdate.h>
> #include <linux/wait.h>
> #include <linux/t10-pi.h>
> +#include <linux/io_uring.h>
>
> #include <trace/events/block.h>
>
> @@ -752,6 +753,8 @@ long nvme_ns_head_chr_ioctl(struct file *file,
> unsigned int cmd,
> unsigned long arg);
> long nvme_dev_ioctl(struct file *file, unsigned int cmd,
> unsigned long arg);
> +int nvme_ns_chr_async_cmd(struct io_uring_cmd *ioucmd);
> +int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd);
> int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
>
> extern const struct attribute_group *nvme_ns_id_attr_groups[];
> --
> 2.25.1
On 5.10 with our version of this patch, I've seen that returning -EAGAIN to
io_uring results in poisoned bios and crashed kernel threads (NULL current->mm)
while constructing the async pass through request. I looked at
git://git.kernel.dk/linux-block and git://git.infradead.org/nvme.git
and as best as I can tell, the same thing will happen.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 00/17] io_uring passthru over nvme
2022-03-08 15:20 ` [PATCH 00/17] io_uring passthru over nvme Kanchan Joshi
` (16 preceding siblings ...)
[not found] ` <CGME20220308152729epcas5p17e82d59c68076eb46b5ef658619d65e3@epcas5p1.samsung.com>
@ 2022-03-10 8:29 ` Christoph Hellwig
2022-03-10 10:05 ` Kanchan Joshi
17 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-10 8:29 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, mcgrof,
a.manzanares, joshiiitr, anuj20.g
What branch is this against?
Do you have a git tree available?
On Tue, Mar 08, 2022 at 08:50:48PM +0530, Kanchan Joshi wrote:
> This is a streamlined series with new way of doing uring-cmd, and connects
> nvme-passthrough (over char device /dev/ngX) to it.
> uring-cmd enables using io_uring for any arbitrary command (ioctl,
> fsctl etc.) exposed by the command provider (e.g. driver, fs etc.).
>
> To store the command inline within the sqe, Jens added an option to setup
> the ring with 128-byte SQEs.This gives 80 bytes of space (16 bytes at
> the end of the first sqe + 64 bytes in the second sqe). With inline
> command in sqe, the application avoids explicit allocation and, in the
> kernel, we avoid doing copy_from_user. Command-opcode, length etc.
> are stored in per-op fields of io_uring_sqe.
>
> Non-inline submission (when command is a user-space pointer rather than
> housed inside sqe) is also supported.
>
> io_uring sends this command down by newly introduced ->async_cmd()
> handler in file_operations. The handler does what is required to
> submit, and indicates queued completion.The infra has been added to
> process the completion when it arrives.
>
> Overall the patches wire up the following capabilities for this path:
> - async
> - fixed-buffer
> - plugging
> - bio-cache
> - sync and async polling.
>
> This scales well. 512b randread perf (KIOPS) comparing
> uring-passthru-over-char (/dev/ng0n1) to
> uring-over-block(/dev/nvme0n1)
>
> QD uring pt uring-poll pt-poll
> 8 538 589 831 902
> 64 967 1131 1351 1378
> 256 1043 1230 1376 1429
>
> Testing/perf is done with this custom fio that turnes regular-io into
> passthru-io on supplying "uring_cmd=1" option.
> https://github.com/joshkan/fio/tree/big-sqe-pt.v1
>
> Example command-line:
> fio -iodepth=256 -rw=randread -ioengine=io_uring -bs=512 -numjobs=1
> -runtime=60 -group_reporting -iodepth_batch_submit=64
> -iodepth_batch_complete_min=1 -iodepth_batch_complete_max=64
> -fixedbufs=1 -hipri=1 -sqthread_poll=0 -filename=/dev/ng0n1
> -name=io_uring_256 -uring_cmd=1
>
>
> Anuj Gupta (3):
> io_uring: prep for fixed-buffer enabled uring-cmd
> nvme: enable passthrough with fixed-buffer
> nvme: enable non-inline passthru commands
>
> Jens Axboe (5):
> io_uring: add support for 128-byte SQEs
> fs: add file_operations->async_cmd()
> io_uring: add infra and support for IORING_OP_URING_CMD
> io_uring: plug for async bypass
> block: wire-up support for plugging
>
> Kanchan Joshi (5):
> nvme: wire-up support for async-passthru on char-device.
> io_uring: add support for uring_cmd with fixed-buffer
> block: factor out helper for bio allocation from cache
> nvme: enable bio-cache for fixed-buffer passthru
> io_uring: add support for non-inline uring-cmd
>
> Keith Busch (2):
> nvme: modify nvme_alloc_request to take an additional parameter
> nvme: allow user passthrough commands to poll
>
> Pankaj Raghav (2):
> io_uring: add polling support for uring-cmd
> nvme: wire-up polling for uring-passthru
>
> block/bio.c | 43 ++--
> block/blk-map.c | 45 +++++
> block/blk-mq.c | 93 ++++-----
> drivers/nvme/host/core.c | 21 +-
> drivers/nvme/host/ioctl.c | 336 +++++++++++++++++++++++++++-----
> drivers/nvme/host/multipath.c | 2 +
> drivers/nvme/host/nvme.h | 11 +-
> drivers/nvme/host/pci.c | 4 +-
> drivers/nvme/target/passthru.c | 2 +-
> fs/io_uring.c | 188 ++++++++++++++++--
> include/linux/bio.h | 1 +
> include/linux/blk-mq.h | 4 +
> include/linux/fs.h | 2 +
> include/linux/io_uring.h | 43 ++++
> include/uapi/linux/io_uring.h | 21 +-
> include/uapi/linux/nvme_ioctl.h | 4 +
> 16 files changed, 689 insertions(+), 131 deletions(-)
>
> --
> 2.25.1
---end quoted text---
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 08/17] nvme: enable passthrough with fixed-buffer
2022-03-08 15:20 ` [PATCH 08/17] nvme: enable passthrough " Kanchan Joshi
@ 2022-03-10 8:32 ` Christoph Hellwig
2022-03-11 6:43 ` Christoph Hellwig
2022-03-14 12:18 ` Ming Lei
2 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-10 8:32 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, mcgrof,
a.manzanares, joshiiitr, anuj20.g
On Tue, Mar 08, 2022 at 08:50:56PM +0530, Kanchan Joshi wrote:
> +/* Unlike blk_rq_map_user () this is only for fixed-buffer async passthrough. */
> +int blk_rq_map_user_fixedb(struct request_queue *q, struct request *rq,
> + u64 ubuf, unsigned long len, gfp_t gfp_mask,
> + struct io_uring_cmd *ioucmd)
> +{
This doesn't belong into a patch title nvme. Also please add a proper
kernel-doc comment.
> +EXPORT_SYMBOL(blk_rq_map_user_fixedb);
EXPORT_SYMBOL_GPL, please.
> +static inline bool nvme_is_fixedb_passthru(struct io_uring_cmd *ioucmd)
> +{
> + return ((ioucmd) && (ioucmd->flags & IO_URING_F_UCMD_FIXEDBUFS));
> +}
No need for the outer and first set of inner braces.
> +
> static int nvme_submit_user_cmd(struct request_queue *q,
> - struct nvme_command *cmd, void __user *ubuffer,
> + struct nvme_command *cmd, u64 ubuffer,
> unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
> u32 meta_seed, u64 *result, unsigned timeout,
> struct io_uring_cmd *ioucmd)
> @@ -152,8 +157,12 @@ static int nvme_submit_user_cmd(struct request_queue *q,
> nvme_req(req)->flags |= NVME_REQ_USERCMD;
>
> if (ubuffer && bufflen) {
> - ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen,
> - GFP_KERNEL);
> + if (likely(nvme_is_fixedb_passthru(ioucmd)))
> + ret = blk_rq_map_user_fixedb(q, req, ubuffer, bufflen,
> + GFP_KERNEL, ioucmd);
> + else
> + ret = blk_rq_map_user(q, req, NULL, nvme_to_user_ptr(ubuffer),
Overly long line.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-10 0:02 ` Clay Mayers
@ 2022-03-10 8:32 ` Kanchan Joshi
0 siblings, 0 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-10 8:32 UTC (permalink / raw)
To: Clay Mayers
Cc: Kanchan Joshi, [email protected], [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected]
On Thu, Mar 10, 2022 at 5:32 AM Clay Mayers <[email protected]> wrote:
>
> > From: Linux-nvme <[email protected]> On Behalf Of
> > Kanchan Joshi
> > Sent: Tuesday, March 8, 2022 7:21 AM
> > To: [email protected]; [email protected]; [email protected];
> > [email protected]
> > Cc: [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected]
> > Subject: [PATCH 05/17] nvme: wire-up support for async-passthru on char-
> > device.
> >
> > Introduce handler for fops->async_cmd(), implementing async passthru
> > on char device (/dev/ngX). The handler supports NVME_IOCTL_IO64_CMD
> > for
> > read and write commands. Returns failure for other commands.
> > This is low overhead path for processing the inline commands housed
> > inside io_uring's sqe. Neither the commmand is fetched via
> > copy_from_user, nor the result (inside passthru command) is updated via
> > put_user.
> >
> > Signed-off-by: Kanchan Joshi <[email protected]>
> > Signed-off-by: Anuj Gupta <[email protected]>
> > ---
> > drivers/nvme/host/core.c | 1 +
> > drivers/nvme/host/ioctl.c | 205 ++++++++++++++++++++++++++++------
> > drivers/nvme/host/multipath.c | 1 +
> > drivers/nvme/host/nvme.h | 3 +
> > 4 files changed, 178 insertions(+), 32 deletions(-)
> >
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 159944499c4f..3fe8f5901cd9 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -3667,6 +3667,7 @@ static const struct file_operations
> > nvme_ns_chr_fops = {
> > .release = nvme_ns_chr_release,
> > .unlocked_ioctl = nvme_ns_chr_ioctl,
> > .compat_ioctl = compat_ptr_ioctl,
> > + .async_cmd = nvme_ns_chr_async_cmd,
> > };
> >
> > static int nvme_add_ns_cdev(struct nvme_ns *ns)
> > diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> > index 5c9cd9695519..1df270b47af5 100644
> > --- a/drivers/nvme/host/ioctl.c
> > +++ b/drivers/nvme/host/ioctl.c
> > @@ -18,6 +18,76 @@ static void __user *nvme_to_user_ptr(uintptr_t ptrval)
> > ptrval = (compat_uptr_t)ptrval;
> > return (void __user *)ptrval;
> > }
> > +/*
> > + * This overlays struct io_uring_cmd pdu.
> > + * Expect build errors if this grows larger than that.
> > + */
> > +struct nvme_uring_cmd_pdu {
> > + u32 meta_len;
> > + union {
> > + struct bio *bio;
> > + struct request *req;
> > + };
> > + void *meta; /* kernel-resident buffer */
> > + void __user *meta_buffer;
> > +} __packed;
> > +
> > +static struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(struct
> > io_uring_cmd *ioucmd)
> > +{
> > + return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
> > +}
> > +
> > +static void nvme_pt_task_cb(struct io_uring_cmd *ioucmd)
> > +{
> > + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> > + struct request *req = pdu->req;
> > + int status;
> > + struct bio *bio = req->bio;
> > +
> > + if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
> > + status = -EINTR;
> > + else
> > + status = nvme_req(req)->status;
> > +
> > + /* we can free request */
> > + blk_mq_free_request(req);
> > + blk_rq_unmap_user(bio);
> > +
> > + if (!status && pdu->meta_buffer) {
> > + if (copy_to_user(pdu->meta_buffer, pdu->meta, pdu-
> > >meta_len))
> > + status = -EFAULT;
> > + }
> > + kfree(pdu->meta);
> > +
> > + io_uring_cmd_done(ioucmd, status);
> > +}
> > +
> > +static void nvme_end_async_pt(struct request *req, blk_status_t err)
> > +{
> > + struct io_uring_cmd *ioucmd = req->end_io_data;
> > + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> > + /* extract bio before reusing the same field for request */
> > + struct bio *bio = pdu->bio;
> > +
> > + pdu->req = req;
> > + req->bio = bio;
> > + /* this takes care of setting up task-work */
> > + io_uring_cmd_complete_in_task(ioucmd, nvme_pt_task_cb);
> > +}
> > +
> > +static void nvme_setup_uring_cmd_data(struct request *rq,
> > + struct io_uring_cmd *ioucmd, void *meta,
> > + void __user *meta_buffer, u32 meta_len)
> > +{
> > + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> > +
> > + /* to free bio on completion, as req->bio will be null at that time */
> > + pdu->bio = rq->bio;
> > + pdu->meta = meta;
> > + pdu->meta_buffer = meta_buffer;
> > + pdu->meta_len = meta_len;
> > + rq->end_io_data = ioucmd;
> > +}
> >
> > static void *nvme_add_user_metadata(struct bio *bio, void __user *ubuf,
> > unsigned len, u32 seed, bool write)
> > @@ -56,7 +126,8 @@ static void *nvme_add_user_metadata(struct bio
> > *bio, void __user *ubuf,
> > static int nvme_submit_user_cmd(struct request_queue *q,
> > struct nvme_command *cmd, void __user *ubuffer,
> > unsigned bufflen, void __user *meta_buffer, unsigned
> > meta_len,
> > - u32 meta_seed, u64 *result, unsigned timeout)
> > + u32 meta_seed, u64 *result, unsigned timeout,
> > + struct io_uring_cmd *ioucmd)
> > {
> > bool write = nvme_is_write(cmd);
> > struct nvme_ns *ns = q->queuedata;
> > @@ -64,9 +135,15 @@ static int nvme_submit_user_cmd(struct
> > request_queue *q,
> > struct request *req;
> > struct bio *bio = NULL;
> > void *meta = NULL;
> > + unsigned int rq_flags = 0;
> > + blk_mq_req_flags_t blk_flags = 0;
> > int ret;
> >
> > - req = nvme_alloc_request(q, cmd, 0, 0);
> > + if (ioucmd && (ioucmd->flags & IO_URING_F_NONBLOCK)) {
> > + rq_flags |= REQ_NOWAIT;
> > + blk_flags |= BLK_MQ_REQ_NOWAIT;
> > + }
> > + req = nvme_alloc_request(q, cmd, blk_flags, rq_flags);
> > if (IS_ERR(req))
> > return PTR_ERR(req);
> >
> > @@ -92,6 +169,19 @@ static int nvme_submit_user_cmd(struct
> > request_queue *q,
> > req->cmd_flags |= REQ_INTEGRITY;
> > }
> > }
> > + if (ioucmd) { /* async dispatch */
> > + if (cmd->common.opcode == nvme_cmd_write ||
> > + cmd->common.opcode == nvme_cmd_read) {
> > + nvme_setup_uring_cmd_data(req, ioucmd, meta,
> > meta_buffer,
> > + meta_len);
> > + blk_execute_rq_nowait(req, 0, nvme_end_async_pt);
> > + return 0;
> > + } else {
> > + /* support only read and write for now. */
> > + ret = -EINVAL;
> > + goto out_meta;
> > + }
> > + }
> >
> > ret = nvme_execute_passthru_rq(req);
> > if (result)
> > @@ -100,6 +190,7 @@ static int nvme_submit_user_cmd(struct
> > request_queue *q,
> > if (copy_to_user(meta_buffer, meta, meta_len))
> > ret = -EFAULT;
> > }
> > + out_meta:
> > kfree(meta);
> > out_unmap:
> > if (bio)
> > @@ -170,7 +261,8 @@ static int nvme_submit_io(struct nvme_ns *ns, struct
> > nvme_user_io __user *uio)
> >
> > return nvme_submit_user_cmd(ns->queue, &c,
> > nvme_to_user_ptr(io.addr), length,
> > - metadata, meta_len, lower_32_bits(io.slba), NULL,
> > 0);
> > + metadata, meta_len, lower_32_bits(io.slba), NULL, 0,
> > + NULL);
> > }
> >
> > static bool nvme_validate_passthru_nsid(struct nvme_ctrl *ctrl,
> > @@ -224,7 +316,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl,
> > struct nvme_ns *ns,
> > status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
> > nvme_to_user_ptr(cmd.addr), cmd.data_len,
> > nvme_to_user_ptr(cmd.metadata),
> > cmd.metadata_len,
> > - 0, &result, timeout);
> > + 0, &result, timeout, NULL);
> >
> > if (status >= 0) {
> > if (put_user(result, &ucmd->result))
> > @@ -235,45 +327,53 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl,
> > struct nvme_ns *ns,
> > }
> >
> > static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > - struct nvme_passthru_cmd64 __user *ucmd)
> > + struct nvme_passthru_cmd64 __user *ucmd,
> > + struct io_uring_cmd *ioucmd)
> > {
> > - struct nvme_passthru_cmd64 cmd;
> > + struct nvme_passthru_cmd64 cmd, *cptr;
> > struct nvme_command c;
> > unsigned timeout = 0;
> > int status;
> >
> > if (!capable(CAP_SYS_ADMIN))
> > return -EACCES;
> > - if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
> > - return -EFAULT;
> > - if (cmd.flags)
> > + if (!ioucmd) {
> > + if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
> > + return -EFAULT;
> > + cptr = &cmd;
> > + } else {
> > + if (ioucmd->cmd_len != sizeof(struct nvme_passthru_cmd64))
> > + return -EINVAL;
> > + cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd;
> > + }
> > + if (cptr->flags)
> > return -EINVAL;
> > - if (!nvme_validate_passthru_nsid(ctrl, ns, cmd.nsid))
> > + if (!nvme_validate_passthru_nsid(ctrl, ns, cptr->nsid))
> > return -EINVAL;
> >
> > memset(&c, 0, sizeof(c));
> > - c.common.opcode = cmd.opcode;
> > - c.common.flags = cmd.flags;
> > - c.common.nsid = cpu_to_le32(cmd.nsid);
> > - c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
> > - c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> > - c.common.cdw10 = cpu_to_le32(cmd.cdw10);
> > - c.common.cdw11 = cpu_to_le32(cmd.cdw11);
> > - c.common.cdw12 = cpu_to_le32(cmd.cdw12);
> > - c.common.cdw13 = cpu_to_le32(cmd.cdw13);
> > - c.common.cdw14 = cpu_to_le32(cmd.cdw14);
> > - c.common.cdw15 = cpu_to_le32(cmd.cdw15);
> > -
> > - if (cmd.timeout_ms)
> > - timeout = msecs_to_jiffies(cmd.timeout_ms);
> > + c.common.opcode = cptr->opcode;
> > + c.common.flags = cptr->flags;
> > + c.common.nsid = cpu_to_le32(cptr->nsid);
> > + c.common.cdw2[0] = cpu_to_le32(cptr->cdw2);
> > + c.common.cdw2[1] = cpu_to_le32(cptr->cdw3);
> > + c.common.cdw10 = cpu_to_le32(cptr->cdw10);
> > + c.common.cdw11 = cpu_to_le32(cptr->cdw11);
> > + c.common.cdw12 = cpu_to_le32(cptr->cdw12);
> > + c.common.cdw13 = cpu_to_le32(cptr->cdw13);
> > + c.common.cdw14 = cpu_to_le32(cptr->cdw14);
> > + c.common.cdw15 = cpu_to_le32(cptr->cdw15);
> > +
> > + if (cptr->timeout_ms)
> > + timeout = msecs_to_jiffies(cptr->timeout_ms);
> >
> > status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
> > - nvme_to_user_ptr(cmd.addr), cmd.data_len,
> > - nvme_to_user_ptr(cmd.metadata),
> > cmd.metadata_len,
> > - 0, &cmd.result, timeout);
> > + nvme_to_user_ptr(cptr->addr), cptr->data_len,
> > + nvme_to_user_ptr(cptr->metadata), cptr-
> > >metadata_len,
> > + 0, &cptr->result, timeout, ioucmd);
> >
> > - if (status >= 0) {
> > - if (put_user(cmd.result, &ucmd->result))
> > + if (!ioucmd && status >= 0) {
> > + if (put_user(cptr->result, &ucmd->result))
> > return -EFAULT;
> > }
> >
> > @@ -296,7 +396,7 @@ static int nvme_ctrl_ioctl(struct nvme_ctrl *ctrl,
> > unsigned int cmd,
> > case NVME_IOCTL_ADMIN_CMD:
> > return nvme_user_cmd(ctrl, NULL, argp);
> > case NVME_IOCTL_ADMIN64_CMD:
> > - return nvme_user_cmd64(ctrl, NULL, argp);
> > + return nvme_user_cmd64(ctrl, NULL, argp, NULL);
> > default:
> > return sed_ioctl(ctrl->opal_dev, cmd, argp);
> > }
> > @@ -340,7 +440,7 @@ static int nvme_ns_ioctl(struct nvme_ns *ns,
> > unsigned int cmd,
> > case NVME_IOCTL_SUBMIT_IO:
> > return nvme_submit_io(ns, argp);
> > case NVME_IOCTL_IO64_CMD:
> > - return nvme_user_cmd64(ns->ctrl, ns, argp);
> > + return nvme_user_cmd64(ns->ctrl, ns, argp, NULL);
> > default:
> > return -ENOTTY;
> > }
> > @@ -369,6 +469,33 @@ long nvme_ns_chr_ioctl(struct file *file, unsigned int
> > cmd, unsigned long arg)
> > return __nvme_ioctl(ns, cmd, (void __user *)arg);
> > }
> >
> > +static int nvme_ns_async_ioctl(struct nvme_ns *ns, struct io_uring_cmd
> > *ioucmd)
> > +{
> > + int ret;
> > +
> > + BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) >
> > sizeof(ioucmd->pdu));
> > +
> > + switch (ioucmd->cmd_op) {
> > + case NVME_IOCTL_IO64_CMD:
> > + ret = nvme_user_cmd64(ns->ctrl, ns, NULL, ioucmd);
> > + break;
> > + default:
> > + ret = -ENOTTY;
> > + }
> > +
> > + if (ret >= 0)
> > + ret = -EIOCBQUEUED;
> > + return ret;
> > +}
>
> ret can equal -EAGAIN, which will cause io_uring to reissue the cmd
> from a worker thread. This can happen when ioucmd->flags has
> IO_URING_F_NONBLOCK set causing nvme_alloc_request() to return
> -EAGAIN when there are no tags available.
>
> Either -EAGAIN needs to be remapped or force set REQ_F_NOWAIT in the
> io_uring cmd request in patch 3 (the 2nd option is untested).
This patch already sets REQ_F_NOWAIT for IO_URING_F_NONBLOCK flag.
Here:
+ if (ioucmd && (ioucmd->flags & IO_URING_F_NONBLOCK)) {
+ rq_flags |= REQ_NOWAIT;
+ blk_flags |= BLK_MQ_REQ_NOWAIT;
+ }
+ req = nvme_alloc_request(q, cmd, blk_flags, rq_flags);
And if -EAGAIN goes to io_uring, we don't try to setup the worker and
instead return the error to userspace for retry.
Here is the relevant fragment from Patch 3:
+ ioucmd->flags |= issue_flags;
+ ret = file->f_op->async_cmd(ioucmd);
+ /* queued async, consumer will call io_uring_cmd_done() when complete */
+ if (ret == -EIOCBQUEUED)
+ return 0;
+ io_uring_cmd_done(ioucmd, ret);
> > +
> > +int nvme_ns_chr_async_cmd(struct io_uring_cmd *ioucmd)
> > +{
> > + struct nvme_ns *ns = container_of(file_inode(ioucmd->file)->i_cdev,
> > + struct nvme_ns, cdev);
> > +
> > + return nvme_ns_async_ioctl(ns, ioucmd);
> > +}
> > +
> > #ifdef CONFIG_NVME_MULTIPATH
> > static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
> > void __user *argp, struct nvme_ns_head *head, int srcu_idx)
> > @@ -412,6 +539,20 @@ int nvme_ns_head_ioctl(struct block_device *bdev,
> > fmode_t mode,
> > return ret;
> > }
> >
> > +int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd)
> > +{
> > + struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
> > + struct nvme_ns_head *head = container_of(cdev, struct
> > nvme_ns_head, cdev);
> > + int srcu_idx = srcu_read_lock(&head->srcu);
> > + struct nvme_ns *ns = nvme_find_path(head);
> > + int ret = -EWOULDBLOCK;
>
> -EWOULDBLOCK has the same value as -EAGAIN so the same issue
> Is here as with nvme_ns_async_ioctl() returning it.
Same as above.
> > +
> > + if (ns)
> > + ret = nvme_ns_async_ioctl(ns, ioucmd);
> > + srcu_read_unlock(&head->srcu, srcu_idx);
> > + return ret;
> > +}
> > +
> > long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
> > unsigned long arg)
> > {
> > @@ -480,7 +621,7 @@ long nvme_dev_ioctl(struct file *file, unsigned int
> > cmd,
> > case NVME_IOCTL_ADMIN_CMD:
> > return nvme_user_cmd(ctrl, NULL, argp);
> > case NVME_IOCTL_ADMIN64_CMD:
> > - return nvme_user_cmd64(ctrl, NULL, argp);
> > + return nvme_user_cmd64(ctrl, NULL, argp, NULL);
> > case NVME_IOCTL_IO_CMD:
> > return nvme_dev_user_cmd(ctrl, argp);
> > case NVME_IOCTL_RESET:
> > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> > index f8bf6606eb2f..1d798d09456f 100644
> > --- a/drivers/nvme/host/multipath.c
> > +++ b/drivers/nvme/host/multipath.c
> > @@ -459,6 +459,7 @@ static const struct file_operations
> > nvme_ns_head_chr_fops = {
> > .release = nvme_ns_head_chr_release,
> > .unlocked_ioctl = nvme_ns_head_chr_ioctl,
> > .compat_ioctl = compat_ptr_ioctl,
> > + .async_cmd = nvme_ns_head_chr_async_cmd,
> > };
> >
> > static int nvme_add_ns_head_cdev(struct nvme_ns_head *head)
> > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> > index b32f4e2c68fd..e6a30543d7c8 100644
> > --- a/drivers/nvme/host/nvme.h
> > +++ b/drivers/nvme/host/nvme.h
> > @@ -16,6 +16,7 @@
> > #include <linux/rcupdate.h>
> > #include <linux/wait.h>
> > #include <linux/t10-pi.h>
> > +#include <linux/io_uring.h>
> >
> > #include <trace/events/block.h>
> >
> > @@ -752,6 +753,8 @@ long nvme_ns_head_chr_ioctl(struct file *file,
> > unsigned int cmd,
> > unsigned long arg);
> > long nvme_dev_ioctl(struct file *file, unsigned int cmd,
> > unsigned long arg);
> > +int nvme_ns_chr_async_cmd(struct io_uring_cmd *ioucmd);
> > +int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd);
> > int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
> >
> > extern const struct attribute_group *nvme_ns_id_attr_groups[];
> > --
> > 2.25.1
>
> On 5.10 with our version of this patch, I've seen that returning -EAGAIN to
> io_uring results in poisoned bios and crashed kernel threads (NULL current->mm)
> while constructing the async pass through request. I looked at
> git://git.kernel.dk/linux-block and git://git.infradead.org/nvme.git
> and as best as I can tell, the same thing will happen.
As pointed above in the snippet, any error except -EIOCBQUEUED is just
posted in CQE during submission itself. So I do not see why -EAGAIN
should cause trouble, at least in this patchset.
FWIW- I tested by forcefully returning -EAGAIN from nvme, and also tag
saturation case (which also returns -EAGAIN) and did not see that sort
of issue.
Please take this series for a spin.
Kernel: for-next in linux-block, on top of commit 9e9d83faa
("io_uring: Remove unneeded test in io_run_task_work_sig")
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 09/17] io_uring: plug for async bypass
2022-03-08 15:20 ` [PATCH 09/17] io_uring: plug for async bypass Kanchan Joshi
@ 2022-03-10 8:33 ` Christoph Hellwig
2022-03-14 14:33 ` Ming Lei
2022-03-11 17:15 ` Luis Chamberlain
1 sibling, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-10 8:33 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, mcgrof,
a.manzanares, joshiiitr, anuj20.g
On Tue, Mar 08, 2022 at 08:50:57PM +0530, Kanchan Joshi wrote:
> From: Jens Axboe <[email protected]>
>
> Enable .plug for uring-cmd.
This should go into the patch adding the
IORING_OP_URING_CMD/IORING_OP_URING_CMD_FIXED.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 10/17] block: wire-up support for plugging
2022-03-08 15:20 ` [PATCH 10/17] block: wire-up support for plugging Kanchan Joshi
@ 2022-03-10 8:34 ` Christoph Hellwig
2022-03-10 12:40 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-10 8:34 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, mcgrof,
a.manzanares, joshiiitr, anuj20.g
On Tue, Mar 08, 2022 at 08:50:58PM +0530, Kanchan Joshi wrote:
> From: Jens Axboe <[email protected]>
>
> Add support to use plugging if it is enabled, else use default path.
The subject and this comment don't really explain what is done, and
also don't mention at all why it is done.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 11/17] block: factor out helper for bio allocation from cache
2022-03-08 15:20 ` [PATCH 11/17] block: factor out helper for bio allocation from cache Kanchan Joshi
@ 2022-03-10 8:35 ` Christoph Hellwig
2022-03-10 12:25 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-10 8:35 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, mcgrof,
a.manzanares, joshiiitr, anuj20.g
On Tue, Mar 08, 2022 at 08:50:59PM +0530, Kanchan Joshi wrote:
> +struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs,
> + struct bio_set *bs)
> +{
> + if (!(kiocb->ki_flags & IOCB_ALLOC_CACHE))
> + return bio_alloc_bioset(GFP_KERNEL, nr_vecs, bs);
> +
> + return bio_from_cache(nr_vecs, bs);
> +}
> EXPORT_SYMBOL_GPL(bio_alloc_kiocb);
If we go down this route we might want to just kill the bio_alloc_kiocb
wrapper.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-08 15:21 ` [PATCH 17/17] nvme: enable non-inline passthru commands Kanchan Joshi
@ 2022-03-10 8:36 ` Christoph Hellwig
2022-03-10 11:50 ` Kanchan Joshi
2022-03-24 21:09 ` Clay Mayers
1 sibling, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-10 8:36 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, mcgrof,
a.manzanares, joshiiitr, anuj20.g
On Tue, Mar 08, 2022 at 08:51:05PM +0530, Kanchan Joshi wrote:
> From: Anuj Gupta <[email protected]>
>
> On submission,just fetch the commmand from userspace pointer and reuse
> everything else. On completion, update the result field inside the
> passthru command.
What is that supposed to mean? What is the reason to do it. Remember
to always document the why in commit logs.
>
> +static inline bool is_inline_rw(struct io_uring_cmd *ioucmd, struct nvme_command *cmd)
Overly long line.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 00/17] io_uring passthru over nvme
2022-03-10 8:29 ` [PATCH 00/17] io_uring passthru over nvme Christoph Hellwig
@ 2022-03-10 10:05 ` Kanchan Joshi
2022-03-11 16:43 ` Luis Chamberlain
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-10 10:05 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Jens Axboe, Keith Busch, Pavel Begunkov, io-uring,
linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Thu, Mar 10, 2022 at 1:59 PM Christoph Hellwig <[email protected]> wrote:
>
> What branch is this against?
Sorry I missed that in the cover.
Two options -
(a) https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-big-sqe
first patch ("128 byte sqe support") is already there.
(b) for-next (linux-block), series will fit on top of commit 9e9d83faa
("io_uring: Remove unneeded test in io_run_task_work_sig")
> Do you have a git tree available?
Not at the moment.
@Jens: Please see if it is possible to move patches to your
io_uring-big-sqe branch (and maybe rename that to big-sqe-pt.v1).
Thanks.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-10 8:36 ` Christoph Hellwig
@ 2022-03-10 11:50 ` Kanchan Joshi
2022-03-10 14:19 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-10 11:50 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Jens Axboe, Keith Busch, Pavel Begunkov, io-uring,
linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Thu, Mar 10, 2022 at 2:06 PM Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Mar 08, 2022 at 08:51:05PM +0530, Kanchan Joshi wrote:
> > From: Anuj Gupta <[email protected]>
> >
> > On submission,just fetch the commmand from userspace pointer and reuse
> > everything else. On completion, update the result field inside the
> > passthru command.
>
> What is that supposed to mean? What is the reason to do it. Remember
> to always document the why in commit logs.
I covered some of it in patch 6, but yes I need to expand the
reasoning here. Felt that retrospectively too.
So there are two ways/modes of submitting commands:
Mode 1: inline into sqe. This is the default way when passthru command
is placed inside a big sqe which has 80 bytes of space.
The only problem is - passthru command has this 'result' field
(structure below for quick reference) which is statically embedded and
not a pointer (like addr and metadata field).
struct nvme_passthru_cmd64 {
__u8 opcode;
__u8 flags;
__u16 rsvd1;
__u32 nsid;
__u32 cdw2;
__u32 cdw3;
__u64 metadata;
__u64 addr;
__u32 metadata_len;
__u32 data_len;
__u32 cdw10;
__u32 cdw11;
__u32 cdw12;
__u32 cdw13;
__u32 cdw14;
__u32 cdw15;
__u32 timeout_ms;
__u32 rsvd2;
__u64 result;
};
In sync ioctl, we always update this result field by doing put_user on
completion.
For async ioctl, since command is inside the the sqe, its lifetime is
only upto submission. SQE may get reused post submission, leaving no
way to update the "result" field on completion. Had this field been a
pointer, we could have saved this on submission and updated on
completion. But that would require redesigning this structure and
adding newer ioctl in nvme.
Coming back, even though sync-ioctl alway updates this result to
user-space, only a few nvme io commands (e.g. zone-append, copy,
zone-mgmt-send) can return this additional result (spec-wise).
Therefore in nvme, when we are dealing with inline-sqe commands from
io_uring, we never attempt to update the result. And since we don't
update the result, we limit support to only read/write passthru
commands. And fail any other command during submission itself (Patch
2).
Mode 2: Non-inline/indirect (pointer of command into sqe) submission.
User-space places a pointer of passthru command, and a flag in
io_uring saying that this is not inline.
For this, in nvme (this patch) we always update the 'result' on
completion and therefore can support all passthru commands.
Hope this makes the reasoning clear?
Plumbing wise, non-inline support does not create churn (almost all
the infra of inline-command handling is reused). Extra is
copy_from_user , and put_user.
> > +static inline bool is_inline_rw(struct io_uring_cmd *ioucmd, struct nvme_command *cmd)
>
> Overly long line.
Under 100, but sure, can fold it under 80.
--
Kanchan
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 11/17] block: factor out helper for bio allocation from cache
2022-03-10 8:35 ` Christoph Hellwig
@ 2022-03-10 12:25 ` Kanchan Joshi
2022-03-24 6:30 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-10 12:25 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Jens Axboe, Keith Busch, Pavel Begunkov, io-uring,
linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Thu, Mar 10, 2022 at 2:05 PM Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Mar 08, 2022 at 08:50:59PM +0530, Kanchan Joshi wrote:
> > +struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs,
> > + struct bio_set *bs)
> > +{
> > + if (!(kiocb->ki_flags & IOCB_ALLOC_CACHE))
> > + return bio_alloc_bioset(GFP_KERNEL, nr_vecs, bs);
> > +
> > + return bio_from_cache(nr_vecs, bs);
> > +}
> > EXPORT_SYMBOL_GPL(bio_alloc_kiocb);
>
> If we go down this route we might want to just kill the bio_alloc_kiocb
> wrapper.
Fine, will kill that in v2.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 10/17] block: wire-up support for plugging
2022-03-10 8:34 ` Christoph Hellwig
@ 2022-03-10 12:40 ` Kanchan Joshi
2022-03-14 14:40 ` Ming Lei
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-10 12:40 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Jens Axboe, Keith Busch, Pavel Begunkov, io-uring,
linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Thu, Mar 10, 2022 at 2:04 PM Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Mar 08, 2022 at 08:50:58PM +0530, Kanchan Joshi wrote:
> > From: Jens Axboe <[email protected]>
> >
> > Add support to use plugging if it is enabled, else use default path.
>
> The subject and this comment don't really explain what is done, and
> also don't mention at all why it is done.
Missed out, will fix up. But plugging gave a very good hike to IOPS.
Especially while comparing this with io-uring's block-io path that
keeps .plug enabled. Patch 9 (that enables .plug for uring-cmd) and
this goes hand in hand.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-10 11:50 ` Kanchan Joshi
@ 2022-03-10 14:19 ` Christoph Hellwig
2022-03-10 18:43 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-10 14:19 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Thu, Mar 10, 2022 at 05:20:13PM +0530, Kanchan Joshi wrote:
> In sync ioctl, we always update this result field by doing put_user on
> completion.
> For async ioctl, since command is inside the the sqe, its lifetime is
> only upto submission. SQE may get reused post submission, leaving no
> way to update the "result" field on completion. Had this field been a
> pointer, we could have saved this on submission and updated on
> completion. But that would require redesigning this structure and
> adding newer ioctl in nvme.
Why would it required adding an ioctl to nvme? The whole io_uring
async_cmd infrastructure is completely independent from ioctls.
> Coming back, even though sync-ioctl alway updates this result to
> user-space, only a few nvme io commands (e.g. zone-append, copy,
> zone-mgmt-send) can return this additional result (spec-wise).
> Therefore in nvme, when we are dealing with inline-sqe commands from
> io_uring, we never attempt to update the result. And since we don't
> update the result, we limit support to only read/write passthru
> commands. And fail any other command during submission itself (Patch
> 2).
Yikes. That is outright horrible. passthrough needs to be command
agnostic and future proof to any newly added nvme command.
> > Overly long line.
>
> Under 100, but sure, can fold it under 80.
You can only use 100 sparingly if it makes the code more readable. Which
I know is fuzzy, and in practice never does. Certainly not in nvme and
block code.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-10 14:19 ` Christoph Hellwig
@ 2022-03-10 18:43 ` Kanchan Joshi
2022-03-11 6:27 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-10 18:43 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Jens Axboe, Keith Busch, Pavel Begunkov, io-uring,
linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Thu, Mar 10, 2022 at 7:49 PM Christoph Hellwig <[email protected]> wrote:
>
> On Thu, Mar 10, 2022 at 05:20:13PM +0530, Kanchan Joshi wrote:
> > In sync ioctl, we always update this result field by doing put_user on
> > completion.
> > For async ioctl, since command is inside the the sqe, its lifetime is
> > only upto submission. SQE may get reused post submission, leaving no
> > way to update the "result" field on completion. Had this field been a
> > pointer, we could have saved this on submission and updated on
> > completion. But that would require redesigning this structure and
> > adding newer ioctl in nvme.
>
> Why would it required adding an ioctl to nvme? The whole io_uring
> async_cmd infrastructure is completely independent from ioctls.
io_uring is sure not peeking into ioctl and its command-structure but
offering the facility to use its sqe to store that ioctl-command
inline.
Problem is, the inline facility does not go very well with this
particular nvme-passthru ioctl (NVME_IOCTL_IO64_CMD).
And that's because this ioctl requires additional "__u64 result;" to
be updated within "struct nvme_passthru_cmd64".
To update that during completion, we need, at the least, the result
field to be a pointer "__u64 result_ptr" inside the struct
nvme_passthru_cmd64.
Do you see that is possible without adding a new passthru ioctl in nvme?
> > Coming back, even though sync-ioctl alway updates this result to
> > user-space, only a few nvme io commands (e.g. zone-append, copy,
> > zone-mgmt-send) can return this additional result (spec-wise).
> > Therefore in nvme, when we are dealing with inline-sqe commands from
> > io_uring, we never attempt to update the result. And since we don't
> > update the result, we limit support to only read/write passthru
> > commands. And fail any other command during submission itself (Patch
> > 2).
>
> Yikes. That is outright horrible. passthrough needs to be command
> agnostic and future proof to any newly added nvme command.
This patch (along with patch 16) does exactly that. Makes it
command-agnostic and future-proof. All nvme-commands will work with
it.
Just that application needs to pass the pointer of ioctl-command and
not place it inline inside the sqe.
Overall, I think at io_uring infra level both submission makes sense:
big-sqe based inline submission (more efficient for <= 80 bytes) and
normal-sqe based non-inline/indirect submissions.
At nvme-level, we have to pick (depending on ioctl in hand). Currently
we are playing with both and constructing a sort of fast-path (for all
commands) and another faster-path (only for read/write commands).
Should we (at nvme-level) rather opt out and use only indirect
(because it works for all commands) or must we build a way to enable
inline-one for all commands?
> > > Overly long line.
> >
> > Under 100, but sure, can fold it under 80.
>
> You can only use 100 sparingly if it makes the code more readable. Which
> I know is fuzzy, and in practice never does. Certainly not in nvme and
> block code.
Clears up, thanks.
--
Kanchan
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 03/17] io_uring: add infra and support for IORING_OP_URING_CMD
2022-03-08 15:20 ` [PATCH 03/17] io_uring: add infra and support for IORING_OP_URING_CMD Kanchan Joshi
@ 2022-03-11 1:51 ` Luis Chamberlain
2022-03-11 2:43 ` Jens Axboe
0 siblings, 1 reply; 122+ messages in thread
From: Luis Chamberlain @ 2022-03-11 1:51 UTC (permalink / raw)
To: Kanchan Joshi, jmorris, serge, ast, daniel, andrii, kafai,
songliubraving, yhs, john.fastabend, kpsingh,
linux-security-module
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, a.manzanares,
joshiiitr, anuj20.g
On Tue, Mar 08, 2022 at 08:50:51PM +0530, Kanchan Joshi wrote:
> From: Jens Axboe <[email protected]>
>
> This is a file private kind of request. io_uring doesn't know what's
> in this command type, it's for the file_operations->async_cmd()
> handler to deal with.
>
> Signed-off-by: Jens Axboe <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
<-- snip -->
> +static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
> +{
> + struct file *file = req->file;
> + int ret;
> + struct io_uring_cmd *ioucmd = &req->uring_cmd;
> +
> + ioucmd->flags |= issue_flags;
> + ret = file->f_op->async_cmd(ioucmd);
I think we're going to have to add a security_file_async_cmd() check
before this call here. Because otherwise we're enabling to, for
example, bypass security_file_ioctl() for example using the new
iouring-cmd interface.
Or is this already thought out with the existing security_uring_*() stuff?
Luis
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 03/17] io_uring: add infra and support for IORING_OP_URING_CMD
2022-03-11 1:51 ` Luis Chamberlain
@ 2022-03-11 2:43 ` Jens Axboe
2022-03-11 17:11 ` Luis Chamberlain
0 siblings, 1 reply; 122+ messages in thread
From: Jens Axboe @ 2022-03-11 2:43 UTC (permalink / raw)
To: Luis Chamberlain, Kanchan Joshi, jmorris, serge, ast, daniel,
andrii, kafai, songliubraving, yhs, john.fastabend, kpsingh,
linux-security-module
Cc: hch, kbusch, asml.silence, io-uring, linux-nvme, linux-block,
sbates, logang, pankydev8, javier, a.manzanares, joshiiitr,
anuj20.g
On 3/10/22 6:51 PM, Luis Chamberlain wrote:
> On Tue, Mar 08, 2022 at 08:50:51PM +0530, Kanchan Joshi wrote:
>> From: Jens Axboe <[email protected]>
>>
>> This is a file private kind of request. io_uring doesn't know what's
>> in this command type, it's for the file_operations->async_cmd()
>> handler to deal with.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> Signed-off-by: Kanchan Joshi <[email protected]>
>> ---
>
> <-- snip -->
>
>> +static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> + struct file *file = req->file;
>> + int ret;
>> + struct io_uring_cmd *ioucmd = &req->uring_cmd;
>> +
>> + ioucmd->flags |= issue_flags;
>> + ret = file->f_op->async_cmd(ioucmd);
>
> I think we're going to have to add a security_file_async_cmd() check
> before this call here. Because otherwise we're enabling to, for
> example, bypass security_file_ioctl() for example using the new
> iouring-cmd interface.
>
> Or is this already thought out with the existing security_uring_*() stuff?
Unless the request sets .audit_skip, it'll be included already in terms
of logging. But I'd prefer not to lodge this in with ioctls, unless
we're going to be doing actual ioctls.
But definitely something to keep in mind and make sure that we're under
the right umbrella in terms of auditing and security.
--
Jens Axboe
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-10 18:43 ` Kanchan Joshi
@ 2022-03-11 6:27 ` Christoph Hellwig
2022-03-22 17:10 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-11 6:27 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Fri, Mar 11, 2022 at 12:13:24AM +0530, Kanchan Joshi wrote:
> Problem is, the inline facility does not go very well with this
> particular nvme-passthru ioctl (NVME_IOCTL_IO64_CMD).
And it doesn't have to, because there is absolutely no need to reuse
the existing structures! Quite to the contrary, trying to reuse the
structure and opcode makes things confusing as hell.
> And that's because this ioctl requires additional "__u64 result;" to
> be updated within "struct nvme_passthru_cmd64".
> To update that during completion, we need, at the least, the result
> field to be a pointer "__u64 result_ptr" inside the struct
> nvme_passthru_cmd64.
> Do you see that is possible without adding a new passthru ioctl in nvme?
We don't need a new passthrough ioctl in nvme. We need to decouple the
uring cmd properly. And properly in this case means not to add a
result pointer, but to drop the result from the _input_ structure
entirely, and instead optionally support a larger CQ entry that contains
it, just like the first patch does for the SQ.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 04/17] nvme: modify nvme_alloc_request to take an additional parameter
2022-03-08 15:20 ` [PATCH 04/17] nvme: modify nvme_alloc_request to take an additional parameter Kanchan Joshi
@ 2022-03-11 6:38 ` Christoph Hellwig
0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-11 6:38 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, mcgrof,
a.manzanares, joshiiitr, anuj20.g
On Tue, Mar 08, 2022 at 08:50:52PM +0530, Kanchan Joshi wrote:
> From: Keith Busch <[email protected]>
>
> This is a prep patch. It modifies nvme_alloc_request to take an
> additional parameter, allowing request flags to be passed.
I don't think we need more paramters to nvme_alloc_request.
In fact I think we're probably better over removing nvme_alloc_request
as a prep cleanup given that is is just two function calls anyway.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 08/17] nvme: enable passthrough with fixed-buffer
2022-03-08 15:20 ` [PATCH 08/17] nvme: enable passthrough " Kanchan Joshi
2022-03-10 8:32 ` Christoph Hellwig
@ 2022-03-11 6:43 ` Christoph Hellwig
2022-03-14 13:06 ` Kanchan Joshi
2022-03-14 12:18 ` Ming Lei
2 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-11 6:43 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, mcgrof,
a.manzanares, joshiiitr, anuj20.g
> +int blk_rq_map_user_fixedb(struct request_queue *q, struct request *rq,
> + u64 ubuf, unsigned long len, gfp_t gfp_mask,
> + struct io_uring_cmd *ioucmd)
Looking at this a bit more, I don't think this is a good interface or
works at all for that matter.
> +{
> + struct iov_iter iter;
> + size_t iter_count, nr_segs;
> + struct bio *bio;
> + int ret;
> +
> + /*
> + * Talk to io_uring to obtain BVEC iterator for the buffer.
> + * And use that iterator to form bio/request.
> + */
> + ret = io_uring_cmd_import_fixed(ubuf, len, rq_data_dir(rq), &iter,
> + ioucmd);
Instead of pulling the io-uring dependency into blk-map.c we could just
pass the iter to a helper function and have that as the block layer
abstraction if we really want one. But:
> + if (unlikely(ret < 0))
> + return ret;
> + iter_count = iov_iter_count(&iter);
> + nr_segs = iter.nr_segs;
> +
> + if (!iter_count || (iter_count >> 9) > queue_max_hw_sectors(q))
> + return -EINVAL;
> + if (nr_segs > queue_max_segments(q))
> + return -EINVAL;
> + /* no iovecs to alloc, as we already have a BVEC iterator */
> + bio = bio_alloc(gfp_mask, 0);
> + if (!bio)
> + return -ENOMEM;
> +
> + ret = bio_iov_iter_get_pages(bio, &iter);
I can't see how this works at all. block drivers have a lot more
requirements than just total size and number of segments. Very typical
is a limit on the size of each sector, and for nvme we also have the
weird virtual boundary for the PRPs. None of that is being checked here.
You really need to use bio_add_pc_page or open code the equivalent checks
for passthrough I/O.
> + if (likely(nvme_is_fixedb_passthru(ioucmd)))
> + ret = blk_rq_map_user_fixedb(q, req, ubuffer, bufflen,
> + GFP_KERNEL, ioucmd);
And I'm also really worried about only supporting fixed buffers. Fixed
buffers are a really nice benchmarketing feature, but without supporting
arbitrary buffers this is rather useless in real life.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 12/17] nvme: enable bio-cache for fixed-buffer passthru
2022-03-08 15:21 ` [PATCH 12/17] nvme: enable bio-cache for fixed-buffer passthru Kanchan Joshi
@ 2022-03-11 6:48 ` Christoph Hellwig
2022-03-14 18:18 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-11 6:48 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, mcgrof,
a.manzanares, joshiiitr, anuj20.g
On Tue, Mar 08, 2022 at 08:51:00PM +0530, Kanchan Joshi wrote:
> Since we do submission/completion in task, we can have this up.
> Add a bio-set for nvme as we need that for bio-cache.
Well, passthrough I/O should just use kmalloced bios anyway, as there
is no need for the mempool to start with. Take a look at the existing
code in blk-map.c.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 13/17] nvme: allow user passthrough commands to poll
2022-03-09 7:03 ` Kanchan Joshi
@ 2022-03-11 6:49 ` Christoph Hellwig
0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-11 6:49 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Keith Busch, Kanchan Joshi, Jens Axboe, Christoph Hellwig,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Wed, Mar 09, 2022 at 12:33:33PM +0530, Kanchan Joshi wrote:
> Would it be better if we don't try to pass NVME_HIPRI by any means
> (flags or rsvd1/rsvd2), and that means not enabling sync-polling and
> killing this patch.
> We have another flag "IO_URING_F_UCMD_POLLED" in ioucmd->flags, and we
> can use that instead to enable only the async polling. What do you
> think?
Yes, polling should be a io_uring level feature.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 14/17] io_uring: add polling support for uring-cmd
2022-03-08 15:21 ` [PATCH 14/17] io_uring: add polling support for uring-cmd Kanchan Joshi
@ 2022-03-11 6:50 ` Christoph Hellwig
2022-03-14 10:16 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-11 6:50 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, mcgrof,
a.manzanares, joshiiitr, anuj20.g
On Tue, Mar 08, 2022 at 08:51:02PM +0530, Kanchan Joshi wrote:
> + if (req->opcode == IORING_OP_URING_CMD ||
> + req->opcode == IORING_OP_URING_CMD_FIXED) {
> + /* uring_cmd structure does not contain kiocb struct */
> + struct kiocb kiocb_uring_cmd;
> +
> + kiocb_uring_cmd.private = req->uring_cmd.bio;
> + kiocb_uring_cmd.ki_filp = req->uring_cmd.file;
> + ret = req->uring_cmd.file->f_op->iopoll(&kiocb_uring_cmd,
> + &iob, poll_flags);
> + } else {
> + ret = kiocb->ki_filp->f_op->iopoll(kiocb, &iob,
> + poll_flags);
> + }
This is just completely broken. You absolutely do need the iocb
in struct uring_cmd for ->iopoll to work.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-08 15:20 ` [PATCH 05/17] nvme: wire-up support for async-passthru on char-device Kanchan Joshi
2022-03-10 0:02 ` Clay Mayers
@ 2022-03-11 7:01 ` Christoph Hellwig
2022-03-14 16:23 ` Kanchan Joshi
2022-03-11 17:56 ` Luis Chamberlain
` (2 subsequent siblings)
4 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-11 7:01 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, mcgrof,
a.manzanares, joshiiitr, anuj20.g
On Tue, Mar 08, 2022 at 08:50:53PM +0530, Kanchan Joshi wrote:
> +/*
> + * This overlays struct io_uring_cmd pdu.
> + * Expect build errors if this grows larger than that.
> + */
> +struct nvme_uring_cmd_pdu {
> + u32 meta_len;
> + union {
> + struct bio *bio;
> + struct request *req;
> + };
> + void *meta; /* kernel-resident buffer */
> + void __user *meta_buffer;
> +} __packed;
Why is this marked __packed?
In general I'd be much more happy if the meta elelements were a
io_uring-level feature handled outside the driver and typesafe in
struct io_uring_cmd, with just a normal private data pointer for the
actual user, which would remove all the crazy casting.
> +static void nvme_end_async_pt(struct request *req, blk_status_t err)
> +{
> + struct io_uring_cmd *ioucmd = req->end_io_data;
> + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> + /* extract bio before reusing the same field for request */
> + struct bio *bio = pdu->bio;
> +
> + pdu->req = req;
> + req->bio = bio;
> + /* this takes care of setting up task-work */
> + io_uring_cmd_complete_in_task(ioucmd, nvme_pt_task_cb);
This is a bit silly. First we defer the actual request I/O completion
from the block layer to a different CPU or softirq and then we have
another callback here. I think it would be much more useful if we
could find a way to enhance blk_mq_complete_request so that it could
directly complet in a given task. That would also be really nice for
say normal synchronous direct I/O.
> + if (ioucmd) { /* async dispatch */
> + if (cmd->common.opcode == nvme_cmd_write ||
> + cmd->common.opcode == nvme_cmd_read) {
No we can't just check for specific commands in the passthrough handler.
> + nvme_setup_uring_cmd_data(req, ioucmd, meta, meta_buffer,
> + meta_len);
> + blk_execute_rq_nowait(req, 0, nvme_end_async_pt);
> + return 0;
> + } else {
> + /* support only read and write for now. */
> + ret = -EINVAL;
> + goto out_meta;
> + }
Pleae always handle error in the first branch and don't bother with an
else after a goto or return.
> +static int nvme_ns_async_ioctl(struct nvme_ns *ns, struct io_uring_cmd *ioucmd)
> +{
> + int ret;
> +
> + BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu));
> +
> + switch (ioucmd->cmd_op) {
> + case NVME_IOCTL_IO64_CMD:
> + ret = nvme_user_cmd64(ns->ctrl, ns, NULL, ioucmd);
> + break;
> + default:
> + ret = -ENOTTY;
> + }
> +
> + if (ret >= 0)
> + ret = -EIOCBQUEUED;
That's a weird way to handle the returns. Just return -EIOCBQUEUED
directly from the handler (which as said before should be split from
the ioctl handler anyway).
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 00/17] io_uring passthru over nvme
2022-03-10 10:05 ` Kanchan Joshi
@ 2022-03-11 16:43 ` Luis Chamberlain
2022-03-11 23:35 ` Adam Manzanares
2022-03-13 5:10 ` Kanchan Joshi
0 siblings, 2 replies; 122+ messages in thread
From: Luis Chamberlain @ 2022-03-11 16:43 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Adam Manzanares, Anuj Gupta
On Thu, Mar 10, 2022 at 03:35:02PM +0530, Kanchan Joshi wrote:
> On Thu, Mar 10, 2022 at 1:59 PM Christoph Hellwig <[email protected]> wrote:
> >
> > What branch is this against?
> Sorry I missed that in the cover.
> Two options -
> (a) https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-big-sqe
> first patch ("128 byte sqe support") is already there.
> (b) for-next (linux-block), series will fit on top of commit 9e9d83faa
> ("io_uring: Remove unneeded test in io_run_task_work_sig")
>
> > Do you have a git tree available?
> Not at the moment.
>
> @Jens: Please see if it is possible to move patches to your
> io_uring-big-sqe branch (and maybe rename that to big-sqe-pt.v1).
Since Jens might be busy, I've put up a tree with all this stuff:
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20220311-io-uring-cmd
It is based on option (b) mentioned above, I took linux-block for-next
and reset the tree to commit "io_uring: Remove unneeded test in
io_run_task_work_sig" before applying the series.
Luis
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 03/17] io_uring: add infra and support for IORING_OP_URING_CMD
2022-03-11 2:43 ` Jens Axboe
@ 2022-03-11 17:11 ` Luis Chamberlain
2022-03-11 18:47 ` Paul Moore
2022-03-14 16:25 ` Casey Schaufler
0 siblings, 2 replies; 122+ messages in thread
From: Luis Chamberlain @ 2022-03-11 17:11 UTC (permalink / raw)
To: Jens Axboe, Paul Moore
Cc: Kanchan Joshi, jmorris, serge, ast, daniel, andrii, kafai,
songliubraving, yhs, john.fastabend, kpsingh,
linux-security-module, hch, kbusch, asml.silence, io-uring,
linux-nvme, linux-block, sbates, logang, pankydev8, javier,
a.manzanares, joshiiitr, anuj20.g
On Thu, Mar 10, 2022 at 07:43:04PM -0700, Jens Axboe wrote:
> On 3/10/22 6:51 PM, Luis Chamberlain wrote:
> > On Tue, Mar 08, 2022 at 08:50:51PM +0530, Kanchan Joshi wrote:
> >> From: Jens Axboe <[email protected]>
> >>
> >> This is a file private kind of request. io_uring doesn't know what's
> >> in this command type, it's for the file_operations->async_cmd()
> >> handler to deal with.
> >>
> >> Signed-off-by: Jens Axboe <[email protected]>
> >> Signed-off-by: Kanchan Joshi <[email protected]>
> >> ---
> >
> > <-- snip -->
> >
> >> +static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
> >> +{
> >> + struct file *file = req->file;
> >> + int ret;
> >> + struct io_uring_cmd *ioucmd = &req->uring_cmd;
> >> +
> >> + ioucmd->flags |= issue_flags;
> >> + ret = file->f_op->async_cmd(ioucmd);
> >
> > I think we're going to have to add a security_file_async_cmd() check
> > before this call here. Because otherwise we're enabling to, for
> > example, bypass security_file_ioctl() for example using the new
> > iouring-cmd interface.
> >
> > Or is this already thought out with the existing security_uring_*() stuff?
>
> Unless the request sets .audit_skip, it'll be included already in terms
> of logging.
Neat.
> But I'd prefer not to lodge this in with ioctls, unless
> we're going to be doing actual ioctls.
Oh sure, I have been an advocate to ensure folks don't conflate async_cmd
with ioctl. However it *can* enable subsystems to enable ioctl
passthrough, but each of those subsystems need to vet for this on their
own terms. I'd hate to see / hear some LSM surprises later.
> But definitely something to keep in mind and make sure that we're under
> the right umbrella in terms of auditing and security.
Paul, how about something like this for starters (and probably should
be squashed into this series so its not a separate commit) ?
From f3ddbe822374cc1c7002bd795c1ae486d370cbd1 Mon Sep 17 00:00:00 2001
From: Luis Chamberlain <[email protected]>
Date: Fri, 11 Mar 2022 08:55:50 -0800
Subject: [PATCH] lsm,io_uring: add LSM hooks to for the new async_cmd file op
io-uring is extending the struct file_operations to allow a new
command which each subsystem can use to enable command passthrough.
Add an LSM specific for the command passthrough which enables LSMs
to inspect the command details.
Signed-off-by: Luis Chamberlain <[email protected]>
---
fs/io_uring.c | 5 +++++
include/linux/lsm_hook_defs.h | 1 +
include/linux/lsm_hooks.h | 3 +++
include/linux/security.h | 5 +++++
security/security.c | 4 ++++
5 files changed, 18 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 3f6eacc98e31..1c4e6b2cb61a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4190,6 +4190,11 @@ static int io_uring_cmd_prep(struct io_kiocb *req,
struct io_ring_ctx *ctx = req->ctx;
struct io_uring_cmd *ioucmd = &req->uring_cmd;
u32 ucmd_flags = READ_ONCE(sqe->uring_cmd_flags);
+ int ret;
+
+ ret = security_uring_async_cmd(ioucmd);
+ if (ret)
+ return ret;
if (!req->file->f_op->async_cmd)
return -EOPNOTSUPP;
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 819ec92dc2a8..4a20f8e6b295 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -404,4 +404,5 @@ LSM_HOOK(int, 0, perf_event_write, struct perf_event *event)
#ifdef CONFIG_IO_URING
LSM_HOOK(int, 0, uring_override_creds, const struct cred *new)
LSM_HOOK(int, 0, uring_sqpoll, void)
+LSM_HOOK(int, 0, uring_async_cmd, struct io_uring_cmd *ioucmd)
#endif /* CONFIG_IO_URING */
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 3bf5c658bc44..21b18cf138c2 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1569,6 +1569,9 @@
* Check whether the current task is allowed to spawn a io_uring polling
* thread (IORING_SETUP_SQPOLL).
*
+ * @uring_async_cmd:
+ * Check whether the file_operations async_cmd is allowed to run.
+ *
*/
union security_list_options {
#define LSM_HOOK(RET, DEFAULT, NAME, ...) RET (*NAME)(__VA_ARGS__);
diff --git a/include/linux/security.h b/include/linux/security.h
index 6d72772182c8..4d7f72813d75 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -2041,6 +2041,7 @@ static inline int security_perf_event_write(struct perf_event *event)
#ifdef CONFIG_SECURITY
extern int security_uring_override_creds(const struct cred *new);
extern int security_uring_sqpoll(void);
+extern int security_uring_async_cmd(struct io_uring_cmd *ioucmd);
#else
static inline int security_uring_override_creds(const struct cred *new)
{
@@ -2050,6 +2051,10 @@ static inline int security_uring_sqpoll(void)
{
return 0;
}
+static inline int security_uring_async_cmd(struct io_uring_cmd *ioucmd)
+{
+ return 0;
+}
#endif /* CONFIG_SECURITY */
#endif /* CONFIG_IO_URING */
diff --git a/security/security.c b/security/security.c
index 22261d79f333..ef96be2f953a 100644
--- a/security/security.c
+++ b/security/security.c
@@ -2640,4 +2640,8 @@ int security_uring_sqpoll(void)
{
return call_int_hook(uring_sqpoll, 0);
}
+int security_uring_async_cmd(struct io_uring_cmd *ioucmd)
+{
+ return call_int_hook(uring_async_cmd, 0, ioucmd);
+}
#endif /* CONFIG_IO_URING */
--
2.34.1
^ permalink raw reply related [flat|nested] 122+ messages in thread
* Re: [PATCH 09/17] io_uring: plug for async bypass
2022-03-08 15:20 ` [PATCH 09/17] io_uring: plug for async bypass Kanchan Joshi
2022-03-10 8:33 ` Christoph Hellwig
@ 2022-03-11 17:15 ` Luis Chamberlain
1 sibling, 0 replies; 122+ messages in thread
From: Luis Chamberlain @ 2022-03-11 17:15 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, a.manzanares,
joshiiitr, anuj20.g
On Tue, Mar 08, 2022 at 08:50:57PM +0530, Kanchan Joshi wrote:
> From: Jens Axboe <[email protected]>
>
> Enable .plug for uring-cmd.
It would be wonderful if the commit log explained *why*.
Luis
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-08 15:20 ` [PATCH 05/17] nvme: wire-up support for async-passthru on char-device Kanchan Joshi
2022-03-10 0:02 ` Clay Mayers
2022-03-11 7:01 ` Christoph Hellwig
@ 2022-03-11 17:56 ` Luis Chamberlain
2022-03-11 18:53 ` Paul Moore
2022-03-13 21:53 ` Sagi Grimberg
2022-03-22 15:18 ` Clay Mayers
4 siblings, 1 reply; 122+ messages in thread
From: Luis Chamberlain @ 2022-03-11 17:56 UTC (permalink / raw)
To: Kanchan Joshi, Paul Moore, James Morris, Serge E. Hallyn,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
KP Singh, linux-security-module
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, a.manzanares,
joshiiitr, anuj20.g
On Tue, Mar 08, 2022 at 08:50:53PM +0530, Kanchan Joshi wrote:
> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> index 5c9cd9695519..1df270b47af5 100644
> --- a/drivers/nvme/host/ioctl.c
> +++ b/drivers/nvme/host/ioctl.c
> @@ -369,6 +469,33 @@ long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> return __nvme_ioctl(ns, cmd, (void __user *)arg);
> }
>
> +static int nvme_ns_async_ioctl(struct nvme_ns *ns, struct io_uring_cmd *ioucmd)
> +{
> + int ret;
> +
> + BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu));
> +
> + switch (ioucmd->cmd_op) {
> + case NVME_IOCTL_IO64_CMD:
> + ret = nvme_user_cmd64(ns->ctrl, ns, NULL, ioucmd);
> + break;
> + default:
> + ret = -ENOTTY;
> + }
> +
> + if (ret >= 0)
> + ret = -EIOCBQUEUED;
> + return ret;
> +}
And here I think we'll need something like this:
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index ddb7e5864be6..83529adf130d 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -5,6 +5,7 @@
*/
#include <linux/ptrace.h> /* for force_successful_syscall_return */
#include <linux/nvme_ioctl.h>
+#include <linux/security.h>
#include "nvme.h"
/*
@@ -524,6 +525,11 @@ static int nvme_ns_async_ioctl(struct nvme_ns *ns, struct io_uring_cmd *ioucmd)
BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu));
+ ret = security_file_ioctl(ioucmd->file, ioucmd->cmd_op,
+ (unsigned long) ioucmd->cmd);
+ if (ret)
+ return ret;
+
switch (ioucmd->cmd_op) {
case NVME_IOCTL_IO64_CMD:
ret = nvme_user_cmd64(ns->ctrl, ns, NULL, ioucmd);
^ permalink raw reply related [flat|nested] 122+ messages in thread
* Re: [PATCH 03/17] io_uring: add infra and support for IORING_OP_URING_CMD
2022-03-11 17:11 ` Luis Chamberlain
@ 2022-03-11 18:47 ` Paul Moore
2022-03-11 20:57 ` Luis Chamberlain
2022-03-14 16:25 ` Casey Schaufler
1 sibling, 1 reply; 122+ messages in thread
From: Paul Moore @ 2022-03-11 18:47 UTC (permalink / raw)
To: Luis Chamberlain, Jens Axboe, linux-security-module, linux-audit
Cc: Kanchan Joshi, jmorris, serge, ast, daniel, andrii, kafai,
songliubraving, yhs, john.fastabend, kpsingh, hch, kbusch,
asml.silence, io-uring, linux-nvme, linux-block, sbates, logang,
pankydev8, javier, a.manzanares, joshiiitr, anuj20.g, selinux
On Fri, Mar 11, 2022 at 12:11 PM Luis Chamberlain <[email protected]> wrote:
> On Thu, Mar 10, 2022 at 07:43:04PM -0700, Jens Axboe wrote:
> > On 3/10/22 6:51 PM, Luis Chamberlain wrote:
> > > On Tue, Mar 08, 2022 at 08:50:51PM +0530, Kanchan Joshi wrote:
> > >> From: Jens Axboe <[email protected]>
> > >>
> > >> This is a file private kind of request. io_uring doesn't know what's
> > >> in this command type, it's for the file_operations->async_cmd()
> > >> handler to deal with.
> > >>
> > >> Signed-off-by: Jens Axboe <[email protected]>
> > >> Signed-off-by: Kanchan Joshi <[email protected]>
> > >> ---
> > >
> > > <-- snip -->
> > >
> > >> +static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
> > >> +{
> > >> + struct file *file = req->file;
> > >> + int ret;
> > >> + struct io_uring_cmd *ioucmd = &req->uring_cmd;
> > >> +
> > >> + ioucmd->flags |= issue_flags;
> > >> + ret = file->f_op->async_cmd(ioucmd);
> > >
> > > I think we're going to have to add a security_file_async_cmd() check
> > > before this call here. Because otherwise we're enabling to, for
> > > example, bypass security_file_ioctl() for example using the new
> > > iouring-cmd interface.
> > >
> > > Or is this already thought out with the existing security_uring_*() stuff?
> >
> > Unless the request sets .audit_skip, it'll be included already in terms
> > of logging.
>
> Neat.
[NOTE: added the audit and SELinux lists to the To/CC line]
Neat, but I think we will need to augment things to support this new
passthrough mechanism.
The issue is that folks who look at audit logs need to be able to
piece together what happened on the system using just what they have
in the logs themselves. As things currently stand with this patchset,
the only bit of information they would have to go on would be
"uring_op=<IORING_OP_URING_CMD>" which isn't very informative :)
You'll see a similar issue in the newly proposed LSM hook below, we
need to be able to record information about not only the passthrough
command, e.g. io_uring_cmd::cmd_op, but also the underlying
device/handler so that we can put the passthrough command in the right
context (as far as I can tell io_uring_cmd::cmd_op is specific to the
device). We might be able to leverage file_operations::owner::name
for this, e.g. "uring_passthru_dev=nvme
uring_passthru_op=<NVME_IOCTL_IO64_CMD>".
> > But I'd prefer not to lodge this in with ioctls, unless
> > we're going to be doing actual ioctls.
>
> Oh sure, I have been an advocate to ensure folks don't conflate async_cmd
> with ioctl. However it *can* enable subsystems to enable ioctl
> passthrough, but each of those subsystems need to vet for this on their
> own terms. I'd hate to see / hear some LSM surprises later.
Same :) Thanks for bringing this up with us while the patches are
still in-progress/under-review, I think it makes for a much more
pleasant experience for everyone.
> > But definitely something to keep in mind and make sure that we're under
> > the right umbrella in terms of auditing and security.
>
> Paul, how about something like this for starters (and probably should
> be squashed into this series so its not a separate commit) ?
>
> From f3ddbe822374cc1c7002bd795c1ae486d370cbd1 Mon Sep 17 00:00:00 2001
> From: Luis Chamberlain <[email protected]>
> Date: Fri, 11 Mar 2022 08:55:50 -0800
> Subject: [PATCH] lsm,io_uring: add LSM hooks to for the new async_cmd file op
>
> io-uring is extending the struct file_operations to allow a new
> command which each subsystem can use to enable command passthrough.
> Add an LSM specific for the command passthrough which enables LSMs
> to inspect the command details.
>
> Signed-off-by: Luis Chamberlain <[email protected]>
> ---
> fs/io_uring.c | 5 +++++
> include/linux/lsm_hook_defs.h | 1 +
> include/linux/lsm_hooks.h | 3 +++
> include/linux/security.h | 5 +++++
> security/security.c | 4 ++++
> 5 files changed, 18 insertions(+)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 3f6eacc98e31..1c4e6b2cb61a 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -4190,6 +4190,11 @@ static int io_uring_cmd_prep(struct io_kiocb *req,
> struct io_ring_ctx *ctx = req->ctx;
> struct io_uring_cmd *ioucmd = &req->uring_cmd;
> u32 ucmd_flags = READ_ONCE(sqe->uring_cmd_flags);
> + int ret;
> +
> + ret = security_uring_async_cmd(ioucmd);
> + if (ret)
> + return ret;
As a quick aside, for the LSM/audit folks the lore link for the full
patchset is here:
https://lore.kernel.org/io-uring/CA+1E3rJ17F0Rz5UKUnW-LPkWDfPHXG5aeq-ocgNxHfGrxYtAuw@mail.gmail.com/T/#m605e2fb7caf33e8880683fe6b57ade4093ed0643
Similar to what was discussed above with respect to auditing, I think
we need to do some extra work here to make it easier for a LSM to put
the IO request in the proper context. We have io_uring_cmd::cmd_op
via the @ioucmd parameter, which is good, but we need to be able to
associate that with a driver to make sense of it. In the case of
audit we could simply use the module name string, which is probably
ideal as we would want a string anyway, but LSMs will likely want
something more machine friendly. That isn't to say we couldn't do a
strcmp() on the module name string, but for something that aims to
push performance as much as possible, doing a strcmp() on each
operation seems a little less than optimal ;)
--
paul-moore.com
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-11 17:56 ` Luis Chamberlain
@ 2022-03-11 18:53 ` Paul Moore
2022-03-11 21:02 ` Luis Chamberlain
0 siblings, 1 reply; 122+ messages in thread
From: Paul Moore @ 2022-03-11 18:53 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Kanchan Joshi, James Morris, Serge E. Hallyn, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, linux-security-module,
axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, a.manzanares,
joshiiitr, anuj20.g
On Fri, Mar 11, 2022 at 12:56 PM Luis Chamberlain <[email protected]> wrote:
>
> On Tue, Mar 08, 2022 at 08:50:53PM +0530, Kanchan Joshi wrote:
> > diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> > index 5c9cd9695519..1df270b47af5 100644
> > --- a/drivers/nvme/host/ioctl.c
> > +++ b/drivers/nvme/host/ioctl.c
> > @@ -369,6 +469,33 @@ long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> > return __nvme_ioctl(ns, cmd, (void __user *)arg);
> > }
> >
> > +static int nvme_ns_async_ioctl(struct nvme_ns *ns, struct io_uring_cmd *ioucmd)
> > +{
> > + int ret;
> > +
> > + BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu));
> > +
> > + switch (ioucmd->cmd_op) {
> > + case NVME_IOCTL_IO64_CMD:
> > + ret = nvme_user_cmd64(ns->ctrl, ns, NULL, ioucmd);
> > + break;
> > + default:
> > + ret = -ENOTTY;
> > + }
> > +
> > + if (ret >= 0)
> > + ret = -EIOCBQUEUED;
> > + return ret;
> > +}
>
> And here I think we'll need something like this:
If we can promise that we will have a LSM hook for all of the
file_operations::async_cmd implementations that are security relevant
we could skip the LSM passthrough hook at the io_uring layer. It
would potentially make life easier in that we don't have to worry
about putting the passthrough op in the right context, but risks
missing a LSM hook control point (it will happen at some point and
*boom* CVE).
> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> index ddb7e5864be6..83529adf130d 100644
> --- a/drivers/nvme/host/ioctl.c
> +++ b/drivers/nvme/host/ioctl.c
> @@ -5,6 +5,7 @@
> */
> #include <linux/ptrace.h> /* for force_successful_syscall_return */
> #include <linux/nvme_ioctl.h>
> +#include <linux/security.h>
> #include "nvme.h"
>
> /*
> @@ -524,6 +525,11 @@ static int nvme_ns_async_ioctl(struct nvme_ns *ns, struct io_uring_cmd *ioucmd)
>
> BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu));
>
> + ret = security_file_ioctl(ioucmd->file, ioucmd->cmd_op,
> + (unsigned long) ioucmd->cmd);
> + if (ret)
> + return ret;
> +
> switch (ioucmd->cmd_op) {
> case NVME_IOCTL_IO64_CMD:
> ret = nvme_user_cmd64(ns->ctrl, ns, NULL, ioucmd);
--
paul-moore.com
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 03/17] io_uring: add infra and support for IORING_OP_URING_CMD
2022-03-11 18:47 ` Paul Moore
@ 2022-03-11 20:57 ` Luis Chamberlain
2022-03-11 21:03 ` Paul Moore
0 siblings, 1 reply; 122+ messages in thread
From: Luis Chamberlain @ 2022-03-11 20:57 UTC (permalink / raw)
To: Paul Moore
Cc: Jens Axboe, linux-security-module, linux-audit, Kanchan Joshi,
jmorris, serge, ast, daniel, andrii, kafai, songliubraving, yhs,
john.fastabend, kpsingh, hch, kbusch, asml.silence, io-uring,
linux-nvme, linux-block, sbates, logang, pankydev8, javier,
a.manzanares, joshiiitr, anuj20.g, selinux
On Fri, Mar 11, 2022 at 01:47:51PM -0500, Paul Moore wrote:
> On Fri, Mar 11, 2022 at 12:11 PM Luis Chamberlain <[email protected]> wrote:
> > On Thu, Mar 10, 2022 at 07:43:04PM -0700, Jens Axboe wrote:
> > > On 3/10/22 6:51 PM, Luis Chamberlain wrote:
> > > > On Tue, Mar 08, 2022 at 08:50:51PM +0530, Kanchan Joshi wrote:
> > > >> From: Jens Axboe <[email protected]>
> > > >>
> > > >> This is a file private kind of request. io_uring doesn't know what's
> > > >> in this command type, it's for the file_operations->async_cmd()
> > > >> handler to deal with.
> > > >>
> > > >> Signed-off-by: Jens Axboe <[email protected]>
> > > >> Signed-off-by: Kanchan Joshi <[email protected]>
> > > >> ---
> > > >
> > > > <-- snip -->
> > > >
> > > >> +static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
> > > >> +{
> > > >> + struct file *file = req->file;
> > > >> + int ret;
> > > >> + struct io_uring_cmd *ioucmd = &req->uring_cmd;
> > > >> +
> > > >> + ioucmd->flags |= issue_flags;
> > > >> + ret = file->f_op->async_cmd(ioucmd);
> > > >
> > > > I think we're going to have to add a security_file_async_cmd() check
> > > > before this call here. Because otherwise we're enabling to, for
> > > > example, bypass security_file_ioctl() for example using the new
> > > > iouring-cmd interface.
> > > >
> > > > Or is this already thought out with the existing security_uring_*() stuff?
> > >
> > > Unless the request sets .audit_skip, it'll be included already in terms
> > > of logging.
> >
> > Neat.
>
> [NOTE: added the audit and SELinux lists to the To/CC line]
>
> Neat, but I think we will need to augment things to support this new
> passthrough mechanism.
That's what my spidey instincts told me.
> The issue is that folks who look at audit logs need to be able to
> piece together what happened on the system using just what they have
> in the logs themselves. As things currently stand with this patchset,
> the only bit of information they would have to go on would be
> "uring_op=<IORING_OP_URING_CMD>" which isn't very informative :)
>
> You'll see a similar issue in the newly proposed LSM hook below, we
> need to be able to record information about not only the passthrough
> command, e.g. io_uring_cmd::cmd_op, but also the underlying
> device/handler so that we can put the passthrough command in the right
> context (as far as I can tell io_uring_cmd::cmd_op is specific to the
> device). We might be able to leverage file_operations::owner::name
> for this, e.g. "uring_passthru_dev=nvme
> uring_passthru_op=<NVME_IOCTL_IO64_CMD>".
OK...
> > > But I'd prefer not to lodge this in with ioctls, unless
> > > we're going to be doing actual ioctls.
> >
> > Oh sure, I have been an advocate to ensure folks don't conflate async_cmd
> > with ioctl. However it *can* enable subsystems to enable ioctl
> > passthrough, but each of those subsystems need to vet for this on their
> > own terms. I'd hate to see / hear some LSM surprises later.
>
> Same :) Thanks for bringing this up with us while the patches are
> still in-progress/under-review, I think it makes for a much more
> pleasant experience for everyone.
Sure thing.
> > > But definitely something to keep in mind and make sure that we're under
> > > the right umbrella in terms of auditing and security.
> >
> > Paul, how about something like this for starters (and probably should
> > be squashed into this series so its not a separate commit) ?
> >
> > From f3ddbe822374cc1c7002bd795c1ae486d370cbd1 Mon Sep 17 00:00:00 2001
> > From: Luis Chamberlain <[email protected]>
> > Date: Fri, 11 Mar 2022 08:55:50 -0800
> > Subject: [PATCH] lsm,io_uring: add LSM hooks to for the new async_cmd file op
> >
> > io-uring is extending the struct file_operations to allow a new
> > command which each subsystem can use to enable command passthrough.
> > Add an LSM specific for the command passthrough which enables LSMs
> > to inspect the command details.
> >
> > Signed-off-by: Luis Chamberlain <[email protected]>
> > ---
> > fs/io_uring.c | 5 +++++
> > include/linux/lsm_hook_defs.h | 1 +
> > include/linux/lsm_hooks.h | 3 +++
> > include/linux/security.h | 5 +++++
> > security/security.c | 4 ++++
> > 5 files changed, 18 insertions(+)
> >
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index 3f6eacc98e31..1c4e6b2cb61a 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -4190,6 +4190,11 @@ static int io_uring_cmd_prep(struct io_kiocb *req,
> > struct io_ring_ctx *ctx = req->ctx;
> > struct io_uring_cmd *ioucmd = &req->uring_cmd;
> > u32 ucmd_flags = READ_ONCE(sqe->uring_cmd_flags);
> > + int ret;
> > +
> > + ret = security_uring_async_cmd(ioucmd);
> > + if (ret)
> > + return ret;
>
> As a quick aside, for the LSM/audit folks the lore link for the full
> patchset is here:
> https://lore.kernel.org/io-uring/CA+1E3rJ17F0Rz5UKUnW-LPkWDfPHXG5aeq-ocgNxHfGrxYtAuw@mail.gmail.com/T/#m605e2fb7caf33e8880683fe6b57ade4093ed0643
>
> Similar to what was discussed above with respect to auditing, I think
> we need to do some extra work here to make it easier for a LSM to put
> the IO request in the proper context. We have io_uring_cmd::cmd_op
> via the @ioucmd parameter, which is good, but we need to be able to
> associate that with a driver to make sense of it.
It may not always be a driver, it can be built-in stuff.
> In the case of
> audit we could simply use the module name string, which is probably
> ideal as we would want a string anyway, but LSMs will likely want
> something more machine friendly. That isn't to say we couldn't do a
> strcmp() on the module name string, but for something that aims to
> push performance as much as possible, doing a strcmp() on each
> operation seems a little less than optimal ;)
Yes this is a super hot path...
Luis
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-11 18:53 ` Paul Moore
@ 2022-03-11 21:02 ` Luis Chamberlain
0 siblings, 0 replies; 122+ messages in thread
From: Luis Chamberlain @ 2022-03-11 21:02 UTC (permalink / raw)
To: Paul Moore
Cc: Kanchan Joshi, James Morris, Serge E. Hallyn, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
Yonghong Song, John Fastabend, KP Singh, linux-security-module,
axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, a.manzanares,
joshiiitr, anuj20.g
On Fri, Mar 11, 2022 at 01:53:03PM -0500, Paul Moore wrote:
> On Fri, Mar 11, 2022 at 12:56 PM Luis Chamberlain <[email protected]> wrote:
> >
> > On Tue, Mar 08, 2022 at 08:50:53PM +0530, Kanchan Joshi wrote:
> > > diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> > > index 5c9cd9695519..1df270b47af5 100644
> > > --- a/drivers/nvme/host/ioctl.c
> > > +++ b/drivers/nvme/host/ioctl.c
> > > @@ -369,6 +469,33 @@ long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> > > return __nvme_ioctl(ns, cmd, (void __user *)arg);
> > > }
> > >
> > > +static int nvme_ns_async_ioctl(struct nvme_ns *ns, struct io_uring_cmd *ioucmd)
> > > +{
> > > + int ret;
> > > +
> > > + BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu));
> > > +
> > > + switch (ioucmd->cmd_op) {
> > > + case NVME_IOCTL_IO64_CMD:
> > > + ret = nvme_user_cmd64(ns->ctrl, ns, NULL, ioucmd);
> > > + break;
> > > + default:
> > > + ret = -ENOTTY;
> > > + }
> > > +
> > > + if (ret >= 0)
> > > + ret = -EIOCBQUEUED;
> > > + return ret;
> > > +}
> >
> > And here I think we'll need something like this:
>
> If we can promise that we will have a LSM hook for all of the
> file_operations::async_cmd implementations that are security relevant
> we could skip the LSM passthrough hook at the io_uring layer.
There is no way to ensure this unless perhaps we cake that into
the API somehow... Or have a registration system for setting the
respctive file ops / LSM.
> It
> would potentially make life easier in that we don't have to worry
> about putting the passthrough op in the right context, but risks
> missing a LSM hook control point (it will happen at some point and
> *boom* CVE).
Precicely my concern. So we either open code this and ask folks
to do this or I think we do a registration and require both ops
and the the LSM hook at registration.
I think this should be enough information to get Kanchan rolling
on the LSM side.
Luis
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 03/17] io_uring: add infra and support for IORING_OP_URING_CMD
2022-03-11 20:57 ` Luis Chamberlain
@ 2022-03-11 21:03 ` Paul Moore
0 siblings, 0 replies; 122+ messages in thread
From: Paul Moore @ 2022-03-11 21:03 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Jens Axboe, linux-security-module, linux-audit, Kanchan Joshi,
jmorris, serge, ast, daniel, andrii, kafai, songliubraving, yhs,
john.fastabend, kpsingh, hch, kbusch, asml.silence, io-uring,
linux-nvme, linux-block, sbates, logang, pankydev8, javier,
a.manzanares, joshiiitr, anuj20.g, selinux
On Fri, Mar 11, 2022 at 3:57 PM Luis Chamberlain <[email protected]> wrote:
> On Fri, Mar 11, 2022 at 01:47:51PM -0500, Paul Moore wrote:
...
> > Similar to what was discussed above with respect to auditing, I think
> > we need to do some extra work here to make it easier for a LSM to put
> > the IO request in the proper context. We have io_uring_cmd::cmd_op
> > via the @ioucmd parameter, which is good, but we need to be able to
> > associate that with a driver to make sense of it.
>
> It may not always be a driver, it can be built-in stuff.
Good point, but I believe the argument still applies. LSMs are going
to need some way to put the cmd_op token in the proper context so that
security policy can be properly enforced.
--
paul-moore.com
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 00/17] io_uring passthru over nvme
2022-03-11 16:43 ` Luis Chamberlain
@ 2022-03-11 23:35 ` Adam Manzanares
2022-03-12 2:27 ` Adam Manzanares
2022-03-13 5:10 ` Kanchan Joshi
1 sibling, 1 reply; 122+ messages in thread
From: Adam Manzanares @ 2022-03-11 23:35 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Kanchan Joshi, Christoph Hellwig, Kanchan Joshi, Jens Axboe,
Keith Busch, Pavel Begunkov, [email protected],
[email protected], [email protected],
[email protected], [email protected], Pankaj Raghav,
Javier González, Anuj Gupta, [email protected],
[email protected]
On Fri, Mar 11, 2022 at 08:43:24AM -0800, Luis Chamberlain wrote:
> On Thu, Mar 10, 2022 at 03:35:02PM +0530, Kanchan Joshi wrote:
> > On Thu, Mar 10, 2022 at 1:59 PM Christoph Hellwig <[email protected]> wrote:
> > >
> > > What branch is this against?
> > Sorry I missed that in the cover.
> > Two options -
> > (a) https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=03500d22-5ccb341f-0351866d-0cc47a31309a-6f95e6932e414a1d&q=1&e=4ca7b05e-2fe6-40d9-bbcf-a4ed687eca9f&u=https*3A*2F*2Fgit.kernel.dk*2Fcgit*2Flinux-block*2Flog*2F*3Fh*3Dio_uring-big-sqe__;JSUlJSUlJSUl!!EwVzqGoTKBqv-0DWAJBm!FujuZ927K3fuIklgYjkWtodmdQnQyBqOw4Ge4M08DU_0oD5tPm0-wS2SZg0MDh8_2-U9$
> > first patch ("128 byte sqe support") is already there.
> > (b) for-next (linux-block), series will fit on top of commit 9e9d83faa
> > ("io_uring: Remove unneeded test in io_run_task_work_sig")
> >
> > > Do you have a git tree available?
> > Not at the moment.
> >
> > @Jens: Please see if it is possible to move patches to your
> > io_uring-big-sqe branch (and maybe rename that to big-sqe-pt.v1).
>
> Since Jens might be busy, I've put up a tree with all this stuff:
>
> https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20220311-io-uring-cmd__;!!EwVzqGoTKBqv-0DWAJBm!FujuZ927K3fuIklgYjkWtodmdQnQyBqOw4Ge4M08DU_0oD5tPm0-wS2SZg0MDiTF0Q7F$
>
> It is based on option (b) mentioned above, I took linux-block for-next
> and reset the tree to commit "io_uring: Remove unneeded test in
> io_run_task_work_sig" before applying the series.
FYI I can be involved in testing this and have added some colleagues that can
help in this regard. We have been using some form of this work for several
months now and haven't had any issues. That being said some simple tests I have
are not currently working with the above git tree :). I will work to get this
resolved and post an update here.
>
> Luis
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 00/17] io_uring passthru over nvme
2022-03-11 23:35 ` Adam Manzanares
@ 2022-03-12 2:27 ` Adam Manzanares
2022-03-13 5:07 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Adam Manzanares @ 2022-03-12 2:27 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Kanchan Joshi, Christoph Hellwig, Kanchan Joshi, Jens Axboe,
Keith Busch, Pavel Begunkov, [email protected],
[email protected], [email protected],
[email protected], [email protected], Pankaj Raghav,
Javier González, Anuj Gupta, [email protected],
[email protected]
On Fri, Mar 11, 2022 at 03:35:04PM -0800, Adam Manzanares wrote:
> On Fri, Mar 11, 2022 at 08:43:24AM -0800, Luis Chamberlain wrote:
> > On Thu, Mar 10, 2022 at 03:35:02PM +0530, Kanchan Joshi wrote:
> > > On Thu, Mar 10, 2022 at 1:59 PM Christoph Hellwig <[email protected]> wrote:
> > > >
> > > > What branch is this against?
> > > Sorry I missed that in the cover.
> > > Two options -
> > > (a) https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=03500d22-5ccb341f-0351866d-0cc47a31309a-6f95e6932e414a1d&q=1&e=4ca7b05e-2fe6-40d9-bbcf-a4ed687eca9f&u=https*3A*2F*2Fgit.kernel.dk*2Fcgit*2Flinux-block*2Flog*2F*3Fh*3Dio_uring-big-sqe__;JSUlJSUlJSUl!!EwVzqGoTKBqv-0DWAJBm!FujuZ927K3fuIklgYjkWtodmdQnQyBqOw4Ge4M08DU_0oD5tPm0-wS2SZg0MDh8_2-U9$
> > > first patch ("128 byte sqe support") is already there.
> > > (b) for-next (linux-block), series will fit on top of commit 9e9d83faa
> > > ("io_uring: Remove unneeded test in io_run_task_work_sig")
> > >
> > > > Do you have a git tree available?
> > > Not at the moment.
> > >
> > > @Jens: Please see if it is possible to move patches to your
> > > io_uring-big-sqe branch (and maybe rename that to big-sqe-pt.v1).
> >
> > Since Jens might be busy, I've put up a tree with all this stuff:
> >
> > https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20220311-io-uring-cmd__;!!EwVzqGoTKBqv-0DWAJBm!FujuZ927K3fuIklgYjkWtodmdQnQyBqOw4Ge4M08DU_0oD5tPm0-wS2SZg0MDiTF0Q7F$
> >
> > It is based on option (b) mentioned above, I took linux-block for-next
> > and reset the tree to commit "io_uring: Remove unneeded test in
> > io_run_task_work_sig" before applying the series.
>
> FYI I can be involved in testing this and have added some colleagues that can
> help in this regard. We have been using some form of this work for several
> months now and haven't had any issues. That being said some simple tests I have
> are not currently working with the above git tree :). I will work to get this
> resolved and post an update here.
Sorry for the noise, I jumped up the stack too quickly with my tests. The
"simple test" actually depends on several pieces of SW not related to the
kernel.
>
> >
> > Luis
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 00/17] io_uring passthru over nvme
2022-03-12 2:27 ` Adam Manzanares
@ 2022-03-13 5:07 ` Kanchan Joshi
2022-03-14 20:30 ` Adam Manzanares
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-13 5:07 UTC (permalink / raw)
To: Adam Manzanares
Cc: Luis Chamberlain, Christoph Hellwig, Kanchan Joshi, Jens Axboe,
Keith Busch, Pavel Begunkov, [email protected],
[email protected], [email protected],
[email protected], [email protected], Pankaj Raghav,
Javier González, Anuj Gupta, [email protected],
[email protected]
On Sat, Mar 12, 2022 at 7:57 AM Adam Manzanares
<[email protected]> wrote:
>
> On Fri, Mar 11, 2022 at 03:35:04PM -0800, Adam Manzanares wrote:
> > On Fri, Mar 11, 2022 at 08:43:24AM -0800, Luis Chamberlain wrote:
> > > On Thu, Mar 10, 2022 at 03:35:02PM +0530, Kanchan Joshi wrote:
> > > > On Thu, Mar 10, 2022 at 1:59 PM Christoph Hellwig <[email protected]> wrote:
> > > > >
> > > > > What branch is this against?
> > > > Sorry I missed that in the cover.
> > > > Two options -
> > > > (a) https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=03500d22-5ccb341f-0351866d-0cc47a31309a-6f95e6932e414a1d&q=1&e=4ca7b05e-2fe6-40d9-bbcf-a4ed687eca9f&u=https*3A*2F*2Fgit.kernel.dk*2Fcgit*2Flinux-block*2Flog*2F*3Fh*3Dio_uring-big-sqe__;JSUlJSUlJSUl!!EwVzqGoTKBqv-0DWAJBm!FujuZ927K3fuIklgYjkWtodmdQnQyBqOw4Ge4M08DU_0oD5tPm0-wS2SZg0MDh8_2-U9$
> > > > first patch ("128 byte sqe support") is already there.
> > > > (b) for-next (linux-block), series will fit on top of commit 9e9d83faa
> > > > ("io_uring: Remove unneeded test in io_run_task_work_sig")
> > > >
> > > > > Do you have a git tree available?
> > > > Not at the moment.
> > > >
> > > > @Jens: Please see if it is possible to move patches to your
> > > > io_uring-big-sqe branch (and maybe rename that to big-sqe-pt.v1).
> > >
> > > Since Jens might be busy, I've put up a tree with all this stuff:
> > >
> > > https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20220311-io-uring-cmd__;!!EwVzqGoTKBqv-0DWAJBm!FujuZ927K3fuIklgYjkWtodmdQnQyBqOw4Ge4M08DU_0oD5tPm0-wS2SZg0MDiTF0Q7F$
> > >
> > > It is based on option (b) mentioned above, I took linux-block for-next
> > > and reset the tree to commit "io_uring: Remove unneeded test in
> > > io_run_task_work_sig" before applying the series.
> >
> > FYI I can be involved in testing this and have added some colleagues that can
> > help in this regard. We have been using some form of this work for several
> > months now and haven't had any issues. That being said some simple tests I have
> > are not currently working with the above git tree :). I will work to get this
> > resolved and post an update here.
>
> Sorry for the noise, I jumped up the stack too quickly with my tests. The
> "simple test" actually depends on several pieces of SW not related to the
> kernel.
Did you read the cover letter? It's not the same *user-interface* as
the previous series.
If you did not modify those out-of-kernel layers for the new
interface, you're bound to see what you saw.
If you did, please specify what the simple test was. I'll fix that in v2.
Otherwise, the throwaway remark "simple tests not working" - only
infers this series is untested. Nothing could be further from the
truth.
Rather this series is more robust than the previous one.
Let me expand bit more on testing part that's already there in cover:
fio -iodepth=256 -rw=randread -ioengine=io_uring -bs=512 -numjobs=1
-runtime=60 -group_reporting -iodepth_batch_submit=64
-iodepth_batch_complete_min=1 -iodepth_batch_complete_max=64
-fixedbufs=1 -hipri=1 -sqthread_poll=0 -filename=/dev/ng0n1
-name=io_uring_256 -uring_cmd=1
When I reduce the above command-line to do single IO, I call that a simple test.
Simple test that touches almost *everything* that patches build (i.e
async, fixed-buffer, plugging, fixed-buffer, bio-cache, polling).
And larger tests combine these knobs in various ways, QD ranging from
1, 2, 4...upto 256; on general and perf-optimized kernel config; with
big-sqe and normal-sqe (pointer one). And all this is repeated on the
block interface (regular io) too, which covers the regression part.
Sure, I can add more tests for checking regression. But no, I don't
expect any simple test to fail. And that applies to Luis' tree as
well. Tried that too again.
--
Joshi
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 00/17] io_uring passthru over nvme
2022-03-11 16:43 ` Luis Chamberlain
2022-03-11 23:35 ` Adam Manzanares
@ 2022-03-13 5:10 ` Kanchan Joshi
1 sibling, 0 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-13 5:10 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Adam Manzanares, Anuj Gupta
On Fri, Mar 11, 2022 at 10:13 PM Luis Chamberlain <[email protected]> wrote:
>
> On Thu, Mar 10, 2022 at 03:35:02PM +0530, Kanchan Joshi wrote:
> > On Thu, Mar 10, 2022 at 1:59 PM Christoph Hellwig <[email protected]> wrote:
> > >
> > > What branch is this against?
> > Sorry I missed that in the cover.
> > Two options -
> > (a) https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-big-sqe
> > first patch ("128 byte sqe support") is already there.
> > (b) for-next (linux-block), series will fit on top of commit 9e9d83faa
> > ("io_uring: Remove unneeded test in io_run_task_work_sig")
> >
> > > Do you have a git tree available?
> > Not at the moment.
> >
> > @Jens: Please see if it is possible to move patches to your
> > io_uring-big-sqe branch (and maybe rename that to big-sqe-pt.v1).
>
> Since Jens might be busy, I've put up a tree with all this stuff:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20220311-io-uring-cmd
>
> It is based on option (b) mentioned above, I took linux-block for-next
> and reset the tree to commit "io_uring: Remove unneeded test in
> io_run_task_work_sig" before applying the series.
Thanks for putting this up.
--
Joshi
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-08 15:20 ` [PATCH 05/17] nvme: wire-up support for async-passthru on char-device Kanchan Joshi
` (2 preceding siblings ...)
2022-03-11 17:56 ` Luis Chamberlain
@ 2022-03-13 21:53 ` Sagi Grimberg
2022-03-14 17:54 ` Kanchan Joshi
2022-03-22 15:18 ` Clay Mayers
4 siblings, 1 reply; 122+ messages in thread
From: Sagi Grimberg @ 2022-03-13 21:53 UTC (permalink / raw)
To: Kanchan Joshi, axboe, hch, kbusch, asml.silence
Cc: io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
> +int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd)
> +{
> + struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
> + struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
> + int srcu_idx = srcu_read_lock(&head->srcu);
> + struct nvme_ns *ns = nvme_find_path(head);
> + int ret = -EWOULDBLOCK;
> +
> + if (ns)
> + ret = nvme_ns_async_ioctl(ns, ioucmd);
> + srcu_read_unlock(&head->srcu, srcu_idx);
> + return ret;
> +}
No one cares that this has no multipathing capabilities what-so-ever?
despite being issued on the mpath device node?
I know we are not doing multipathing for userspace today, but this
feels like an alternative I/O interface for nvme, seems a bit cripled
with zero multipathing capabilities...
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 14/17] io_uring: add polling support for uring-cmd
2022-03-11 6:50 ` Christoph Hellwig
@ 2022-03-14 10:16 ` Kanchan Joshi
2022-03-15 8:57 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-14 10:16 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Jens Axboe, Keith Busch, Pavel Begunkov, io-uring,
linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Fri, Mar 11, 2022 at 12:20 PM Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Mar 08, 2022 at 08:51:02PM +0530, Kanchan Joshi wrote:
> > + if (req->opcode == IORING_OP_URING_CMD ||
> > + req->opcode == IORING_OP_URING_CMD_FIXED) {
> > + /* uring_cmd structure does not contain kiocb struct */
> > + struct kiocb kiocb_uring_cmd;
> > +
> > + kiocb_uring_cmd.private = req->uring_cmd.bio;
> > + kiocb_uring_cmd.ki_filp = req->uring_cmd.file;
> > + ret = req->uring_cmd.file->f_op->iopoll(&kiocb_uring_cmd,
> > + &iob, poll_flags);
> > + } else {
> > + ret = kiocb->ki_filp->f_op->iopoll(kiocb, &iob,
> > + poll_flags);
> > + }
>
> This is just completely broken. You absolutely do need the iocb
> in struct uring_cmd for ->iopoll to work.
But, after you did bio based polling, we need just the bio to poll.
iocb is a big structure (48 bytes), and if we try to place it in
struct io_uring_cmd, we will just blow up the cacheline in io_uring
(first one in io_kiocb).
So we just store that bio pointer in io_uring_cmd on submission
(please see patch 15).
And here on completion we pick that bio, and put that into this local
iocb, simply because ->iopoll needs it.
Do you see I am still missing anything here?
--
Kanchan
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 08/17] nvme: enable passthrough with fixed-buffer
2022-03-08 15:20 ` [PATCH 08/17] nvme: enable passthrough " Kanchan Joshi
2022-03-10 8:32 ` Christoph Hellwig
2022-03-11 6:43 ` Christoph Hellwig
@ 2022-03-14 12:18 ` Ming Lei
2022-03-14 13:09 ` Kanchan Joshi
2 siblings, 1 reply; 122+ messages in thread
From: Ming Lei @ 2022-03-14 12:18 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, hch, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, mcgrof,
a.manzanares, joshiiitr, anuj20.g
On Tue, Mar 08, 2022 at 08:50:56PM +0530, Kanchan Joshi wrote:
> From: Anuj Gupta <[email protected]>
>
> Add support to carry out passthrough command with pre-mapped buffers.
>
> Signed-off-by: Anuj Gupta <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
> block/blk-map.c | 45 +++++++++++++++++++++++++++++++++++++++
> drivers/nvme/host/ioctl.c | 27 ++++++++++++++---------
> include/linux/blk-mq.h | 2 ++
> 3 files changed, 64 insertions(+), 10 deletions(-)
>
> diff --git a/block/blk-map.c b/block/blk-map.c
> index 4526adde0156..027e8216e313 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -8,6 +8,7 @@
> #include <linux/bio.h>
> #include <linux/blkdev.h>
> #include <linux/uio.h>
> +#include <linux/io_uring.h>
>
> #include "blk.h"
>
> @@ -577,6 +578,50 @@ int blk_rq_map_user(struct request_queue *q, struct request *rq,
> }
> EXPORT_SYMBOL(blk_rq_map_user);
>
> +/* Unlike blk_rq_map_user () this is only for fixed-buffer async passthrough. */
> +int blk_rq_map_user_fixedb(struct request_queue *q, struct request *rq,
> + u64 ubuf, unsigned long len, gfp_t gfp_mask,
> + struct io_uring_cmd *ioucmd)
> +{
> + struct iov_iter iter;
> + size_t iter_count, nr_segs;
> + struct bio *bio;
> + int ret;
> +
> + /*
> + * Talk to io_uring to obtain BVEC iterator for the buffer.
> + * And use that iterator to form bio/request.
> + */
> + ret = io_uring_cmd_import_fixed(ubuf, len, rq_data_dir(rq), &iter,
> + ioucmd);
> + if (unlikely(ret < 0))
> + return ret;
> + iter_count = iov_iter_count(&iter);
> + nr_segs = iter.nr_segs;
> +
> + if (!iter_count || (iter_count >> 9) > queue_max_hw_sectors(q))
> + return -EINVAL;
> + if (nr_segs > queue_max_segments(q))
> + return -EINVAL;
> + /* no iovecs to alloc, as we already have a BVEC iterator */
> + bio = bio_alloc(gfp_mask, 0);
> + if (!bio)
> + return -ENOMEM;
> +
> + ret = bio_iov_iter_get_pages(bio, &iter);
Here bio_iov_iter_get_pages() may not work as expected since the code
needs to check queue limit before adding page to bio and we don't run
split for passthrough bio. __bio_iov_append_get_pages() may be generalized
for covering this case.
Thanks,
Ming
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 08/17] nvme: enable passthrough with fixed-buffer
2022-03-11 6:43 ` Christoph Hellwig
@ 2022-03-14 13:06 ` Kanchan Joshi
2022-03-15 8:55 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-14 13:06 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Jens Axboe, Keith Busch, Pavel Begunkov, io-uring,
linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Fri, Mar 11, 2022 at 12:13 PM Christoph Hellwig <[email protected]> wrote:
>
> > +int blk_rq_map_user_fixedb(struct request_queue *q, struct request *rq,
> > + u64 ubuf, unsigned long len, gfp_t gfp_mask,
> > + struct io_uring_cmd *ioucmd)
>
> Looking at this a bit more, I don't think this is a good interface or
> works at all for that matter.
>
> > +{
> > + struct iov_iter iter;
> > + size_t iter_count, nr_segs;
> > + struct bio *bio;
> > + int ret;
> > +
> > + /*
> > + * Talk to io_uring to obtain BVEC iterator for the buffer.
> > + * And use that iterator to form bio/request.
> > + */
> > + ret = io_uring_cmd_import_fixed(ubuf, len, rq_data_dir(rq), &iter,
> > + ioucmd);
>
> Instead of pulling the io-uring dependency into blk-map.c we could just
> pass the iter to a helper function and have that as the block layer
> abstraction if we really want one. But:
>
> > + if (unlikely(ret < 0))
> > + return ret;
> > + iter_count = iov_iter_count(&iter);
> > + nr_segs = iter.nr_segs;
> > +
> > + if (!iter_count || (iter_count >> 9) > queue_max_hw_sectors(q))
> > + return -EINVAL;
> > + if (nr_segs > queue_max_segments(q))
> > + return -EINVAL;
> > + /* no iovecs to alloc, as we already have a BVEC iterator */
> > + bio = bio_alloc(gfp_mask, 0);
> > + if (!bio)
> > + return -ENOMEM;
> > +
> > + ret = bio_iov_iter_get_pages(bio, &iter);
>
> I can't see how this works at all. block drivers have a lot more
> requirements than just total size and number of segments. Very typical
> is a limit on the size of each sector, and for nvme we also have the
> weird virtual boundary for the PRPs. None of that is being checked here.
> You really need to use bio_add_pc_page or open code the equivalent checks
> for passthrough I/O.
Indeed, I'm missing those checks. Will fix up.
> > + if (likely(nvme_is_fixedb_passthru(ioucmd)))
> > + ret = blk_rq_map_user_fixedb(q, req, ubuffer, bufflen,
> > + GFP_KERNEL, ioucmd);
>
> And I'm also really worried about only supporting fixed buffers. Fixed
> buffers are a really nice benchmarketing feature, but without supporting
> arbitrary buffers this is rather useless in real life.
Sorry, I did not get your point on arbitrary buffers.
The goal has been to match/surpass io_uring's block-io peak perf, so
pre-mapped buffers had to be added.
--
Kanchan
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 08/17] nvme: enable passthrough with fixed-buffer
2022-03-14 12:18 ` Ming Lei
@ 2022-03-14 13:09 ` Kanchan Joshi
0 siblings, 0 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-14 13:09 UTC (permalink / raw)
To: Ming Lei
Cc: Kanchan Joshi, Jens Axboe, Christoph Hellwig, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Mon, Mar 14, 2022 at 5:49 PM Ming Lei <[email protected]> wrote:
>
> On Tue, Mar 08, 2022 at 08:50:56PM +0530, Kanchan Joshi wrote:
> > From: Anuj Gupta <[email protected]>
> >
> > Add support to carry out passthrough command with pre-mapped buffers.
> >
> > Signed-off-by: Anuj Gupta <[email protected]>
> > Signed-off-by: Kanchan Joshi <[email protected]>
> > ---
> > block/blk-map.c | 45 +++++++++++++++++++++++++++++++++++++++
> > drivers/nvme/host/ioctl.c | 27 ++++++++++++++---------
> > include/linux/blk-mq.h | 2 ++
> > 3 files changed, 64 insertions(+), 10 deletions(-)
> >
> > diff --git a/block/blk-map.c b/block/blk-map.c
> > index 4526adde0156..027e8216e313 100644
> > --- a/block/blk-map.c
> > +++ b/block/blk-map.c
> > @@ -8,6 +8,7 @@
> > #include <linux/bio.h>
> > #include <linux/blkdev.h>
> > #include <linux/uio.h>
> > +#include <linux/io_uring.h>
> >
> > #include "blk.h"
> >
> > @@ -577,6 +578,50 @@ int blk_rq_map_user(struct request_queue *q, struct request *rq,
> > }
> > EXPORT_SYMBOL(blk_rq_map_user);
> >
> > +/* Unlike blk_rq_map_user () this is only for fixed-buffer async passthrough. */
> > +int blk_rq_map_user_fixedb(struct request_queue *q, struct request *rq,
> > + u64 ubuf, unsigned long len, gfp_t gfp_mask,
> > + struct io_uring_cmd *ioucmd)
> > +{
> > + struct iov_iter iter;
> > + size_t iter_count, nr_segs;
> > + struct bio *bio;
> > + int ret;
> > +
> > + /*
> > + * Talk to io_uring to obtain BVEC iterator for the buffer.
> > + * And use that iterator to form bio/request.
> > + */
> > + ret = io_uring_cmd_import_fixed(ubuf, len, rq_data_dir(rq), &iter,
> > + ioucmd);
> > + if (unlikely(ret < 0))
> > + return ret;
> > + iter_count = iov_iter_count(&iter);
> > + nr_segs = iter.nr_segs;
> > +
> > + if (!iter_count || (iter_count >> 9) > queue_max_hw_sectors(q))
> > + return -EINVAL;
> > + if (nr_segs > queue_max_segments(q))
> > + return -EINVAL;
> > + /* no iovecs to alloc, as we already have a BVEC iterator */
> > + bio = bio_alloc(gfp_mask, 0);
> > + if (!bio)
> > + return -ENOMEM;
> > +
> > + ret = bio_iov_iter_get_pages(bio, &iter);
>
> Here bio_iov_iter_get_pages() may not work as expected since the code
> needs to check queue limit before adding page to bio and we don't run
> split for passthrough bio. __bio_iov_append_get_pages() may be generalized
> for covering this case.
Yes. That may just be the right thing to do. Thanks for the suggestion.
--
Kanchan
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 09/17] io_uring: plug for async bypass
2022-03-10 8:33 ` Christoph Hellwig
@ 2022-03-14 14:33 ` Ming Lei
2022-03-15 8:56 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Ming Lei @ 2022-03-14 14:33 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, axboe, kbusch, asml.silence, io-uring, linux-nvme,
linux-block, sbates, logang, pankydev8, javier, mcgrof,
a.manzanares, joshiiitr, anuj20.g
On Thu, Mar 10, 2022 at 09:33:03AM +0100, Christoph Hellwig wrote:
> On Tue, Mar 08, 2022 at 08:50:57PM +0530, Kanchan Joshi wrote:
> > From: Jens Axboe <[email protected]>
> >
> > Enable .plug for uring-cmd.
>
> This should go into the patch adding the
> IORING_OP_URING_CMD/IORING_OP_URING_CMD_FIXED.
Plug support for passthrough rq is added in the following patch, so
this one may be put after patch 'block: wire-up support for plugging'.
Thanks,
Ming
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 10/17] block: wire-up support for plugging
2022-03-10 12:40 ` Kanchan Joshi
@ 2022-03-14 14:40 ` Ming Lei
2022-03-21 7:02 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Ming Lei @ 2022-03-14 14:40 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Thu, Mar 10, 2022 at 06:10:08PM +0530, Kanchan Joshi wrote:
> On Thu, Mar 10, 2022 at 2:04 PM Christoph Hellwig <[email protected]> wrote:
> >
> > On Tue, Mar 08, 2022 at 08:50:58PM +0530, Kanchan Joshi wrote:
> > > From: Jens Axboe <[email protected]>
> > >
> > > Add support to use plugging if it is enabled, else use default path.
> >
> > The subject and this comment don't really explain what is done, and
> > also don't mention at all why it is done.
>
> Missed out, will fix up. But plugging gave a very good hike to IOPS.
But how does plugging improve IOPS here for passthrough request? Not
see plug->nr_ios is wired to data.nr_tags in blk_mq_alloc_request(),
which is called by nvme_submit_user_cmd().
Thanks,
Ming
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-11 7:01 ` Christoph Hellwig
@ 2022-03-14 16:23 ` Kanchan Joshi
2022-03-15 8:54 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-14 16:23 UTC (permalink / raw)
To: Christoph Hellwig
Cc: axboe, kbusch, asml.silence, io-uring, linux-nvme, linux-block,
sbates, logang, pankydev8, javier, mcgrof, a.manzanares,
joshiiitr, anuj20.g
[-- Attachment #1: Type: text/plain, Size: 6563 bytes --]
On Fri, Mar 11, 2022 at 08:01:48AM +0100, Christoph Hellwig wrote:
>On Tue, Mar 08, 2022 at 08:50:53PM +0530, Kanchan Joshi wrote:
>> +/*
>> + * This overlays struct io_uring_cmd pdu.
>> + * Expect build errors if this grows larger than that.
>> + */
>> +struct nvme_uring_cmd_pdu {
>> + u32 meta_len;
>> + union {
>> + struct bio *bio;
>> + struct request *req;
>> + };
>> + void *meta; /* kernel-resident buffer */
>> + void __user *meta_buffer;
>> +} __packed;
>
>Why is this marked __packed?
Did not like doing it, but had to.
If not packed, this takes 32 bytes of space. While driver-pdu in struct
io_uring_cmd can take max 30 bytes. Packing nvme-pdu brought it down to
28 bytes, which fits and gives 2 bytes back.
For quick reference -
struct io_uring_cmd {
struct file * file; /* 0 8 */
void * cmd; /* 8 8 */
union {
void * bio; /* 16 8 */
void (*driver_cb)(struct io_uring_cmd *); /* 16 8 */
}; /* 16 8 */
u32 flags; /* 24 4 */
u32 cmd_op; /* 28 4 */
u16 cmd_len; /* 32 2 */
u16 unused; /* 34 2 */
u8 pdu[28]; /* 36 28 */
/* size: 64, cachelines: 1, members: 8 */
};
io_uring_cmd struct goes into the first cacheline of io_kiocb.
Last field is pdu, taking 28 bytes. Will be 30 if I evaporate above
field.
nvme-pdu after packing:
struct nvme_uring_cmd_pdu {
u32 meta_len; /* 0 4 */
union {
struct bio * bio; /* 4 8 */
struct request * req; /* 4 8 */
}; /* 4 8 */
void * meta; /* 12 8 */
void * meta_buffer; /* 20 8 */
/* size: 28, cachelines: 1, members: 4 */
/* last cacheline: 28 bytes */
} __attribute__((__packed__));
>In general I'd be much more happy if the meta elelements were a
>io_uring-level feature handled outside the driver and typesafe in
>struct io_uring_cmd, with just a normal private data pointer for the
>actual user, which would remove all the crazy casting.
Not sure if I got your point.
+static struct nvme_uring_cmd_pdu *nvme_uring_cmd_pdu(struct io_uring_cmd *ioucmd)
+{
+ return (struct nvme_uring_cmd_pdu *)&ioucmd->pdu;
+}
+
+static void nvme_pt_task_cb(struct io_uring_cmd *ioucmd)
+{
+ struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
Do you mean crazy casting inside nvme_uring_cmd_pdu()?
Somehow this looks sane to me (perhaps because it used to be crazier
earlier).
And on moving meta elements outside the driver, my worry is that it
reduces scope of uring-cmd infra and makes it nvme passthru specific.
At this point uring-cmd is still generic async ioctl/fsctl facility
which may find other users (than nvme-passthru) down the line.
Organization of fields within "struct io_uring_cmd" is around the rule
that a field is kept out (of 28 bytes pdu) only if is accessed by both
io_uring and driver.
>> +static void nvme_end_async_pt(struct request *req, blk_status_t err)
>> +{
>> + struct io_uring_cmd *ioucmd = req->end_io_data;
>> + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>> + /* extract bio before reusing the same field for request */
>> + struct bio *bio = pdu->bio;
>> +
>> + pdu->req = req;
>> + req->bio = bio;
>> + /* this takes care of setting up task-work */
>> + io_uring_cmd_complete_in_task(ioucmd, nvme_pt_task_cb);
>
>This is a bit silly. First we defer the actual request I/O completion
>from the block layer to a different CPU or softirq and then we have
>another callback here. I think it would be much more useful if we
>could find a way to enhance blk_mq_complete_request so that it could
>directly complet in a given task. That would also be really nice for
>say normal synchronous direct I/O.
I see, so there is room for adding some efficiency.
Hope it will be ok if I carry this out as a separate effort.
Since this is about touching blk_mq_complete_request at its heart, and
improving sync-direct-io, this does not seem best-fit and slow this
series down.
FWIW, I ran the tests with counters inside blk_mq_complete_request_remote()
if (blk_mq_complete_need_ipi(rq)) {
blk_mq_complete_send_ipi(rq);
return true;
}
if (rq->q->nr_hw_queues == 1) {
blk_mq_raise_softirq(rq);
return true;
}
Deferring by ipi or softirq never occured. Neither for block nor for
char. Softirq is obvious since I was not running against scsi (or nvme with
single queue). I could not spot whether this is really a overhead, at
least for nvme.
>> + if (ioucmd) { /* async dispatch */
>> + if (cmd->common.opcode == nvme_cmd_write ||
>> + cmd->common.opcode == nvme_cmd_read) {
>
>No we can't just check for specific commands in the passthrough handler.
Right. This is for inline-cmd approach.
Last two patches of the series undo this (for indirect-cmd).
I will do something about it.
>> + nvme_setup_uring_cmd_data(req, ioucmd, meta, meta_buffer,
>> + meta_len);
>> + blk_execute_rq_nowait(req, 0, nvme_end_async_pt);
>> + return 0;
>> + } else {
>> + /* support only read and write for now. */
>> + ret = -EINVAL;
>> + goto out_meta;
>> + }
>
>Pleae always handle error in the first branch and don't bother with an
>else after a goto or return.
Yes, that'll be better.
>> +static int nvme_ns_async_ioctl(struct nvme_ns *ns, struct io_uring_cmd *ioucmd)
>> +{
>> + int ret;
>> +
>> + BUILD_BUG_ON(sizeof(struct nvme_uring_cmd_pdu) > sizeof(ioucmd->pdu));
>> +
>> + switch (ioucmd->cmd_op) {
>> + case NVME_IOCTL_IO64_CMD:
>> + ret = nvme_user_cmd64(ns->ctrl, ns, NULL, ioucmd);
>> + break;
>> + default:
>> + ret = -ENOTTY;
>> + }
>> +
>> + if (ret >= 0)
>> + ret = -EIOCBQUEUED;
>
>That's a weird way to handle the returns. Just return -EIOCBQUEUED
>directly from the handler (which as said before should be split from
>the ioctl handler anyway).
Indeed. That will make it cleaner.
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 03/17] io_uring: add infra and support for IORING_OP_URING_CMD
2022-03-11 17:11 ` Luis Chamberlain
2022-03-11 18:47 ` Paul Moore
@ 2022-03-14 16:25 ` Casey Schaufler
2022-03-14 16:32 ` Luis Chamberlain
1 sibling, 1 reply; 122+ messages in thread
From: Casey Schaufler @ 2022-03-14 16:25 UTC (permalink / raw)
To: Luis Chamberlain, Jens Axboe, Paul Moore
Cc: Kanchan Joshi, jmorris, serge, ast, daniel, andrii, kafai,
songliubraving, yhs, john.fastabend, kpsingh,
linux-security-module, hch, kbusch, asml.silence, io-uring,
linux-nvme, linux-block, sbates, logang, pankydev8, javier,
a.manzanares, joshiiitr, anuj20.g, Casey Schaufler
On 3/11/2022 9:11 AM, Luis Chamberlain wrote:
> On Thu, Mar 10, 2022 at 07:43:04PM -0700, Jens Axboe wrote:
>> On 3/10/22 6:51 PM, Luis Chamberlain wrote:
>>> On Tue, Mar 08, 2022 at 08:50:51PM +0530, Kanchan Joshi wrote:
>>>> From: Jens Axboe <[email protected]>
>>>>
>>>> This is a file private kind of request. io_uring doesn't know what's
>>>> in this command type, it's for the file_operations->async_cmd()
>>>> handler to deal with.
>>>>
>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>>> ---
>>> <-- snip -->
>>>
>>>> +static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
>>>> +{
>>>> + struct file *file = req->file;
>>>> + int ret;
>>>> + struct io_uring_cmd *ioucmd = &req->uring_cmd;
>>>> +
>>>> + ioucmd->flags |= issue_flags;
>>>> + ret = file->f_op->async_cmd(ioucmd);
>>> I think we're going to have to add a security_file_async_cmd() check
>>> before this call here. Because otherwise we're enabling to, for
>>> example, bypass security_file_ioctl() for example using the new
>>> iouring-cmd interface.
>>>
>>> Or is this already thought out with the existing security_uring_*() stuff?
>> Unless the request sets .audit_skip, it'll be included already in terms
>> of logging.
> Neat.
>
>> But I'd prefer not to lodge this in with ioctls, unless
>> we're going to be doing actual ioctls.
> Oh sure, I have been an advocate to ensure folks don't conflate async_cmd
> with ioctl. However it *can* enable subsystems to enable ioctl
> passthrough, but each of those subsystems need to vet for this on their
> own terms. I'd hate to see / hear some LSM surprises later.
>
>> But definitely something to keep in mind and make sure that we're under
>> the right umbrella in terms of auditing and security.
> Paul, how about something like this for starters (and probably should
> be squashed into this series so its not a separate commit) ?
>
> >From f3ddbe822374cc1c7002bd795c1ae486d370cbd1 Mon Sep 17 00:00:00 2001
> From: Luis Chamberlain <[email protected]>
> Date: Fri, 11 Mar 2022 08:55:50 -0800
> Subject: [PATCH] lsm,io_uring: add LSM hooks to for the new async_cmd file op
>
> io-uring is extending the struct file_operations to allow a new
> command which each subsystem can use to enable command passthrough.
> Add an LSM specific for the command passthrough which enables LSMs
> to inspect the command details.
>
> Signed-off-by: Luis Chamberlain <[email protected]>
> ---
> fs/io_uring.c | 5 +++++
> include/linux/lsm_hook_defs.h | 1 +
> include/linux/lsm_hooks.h | 3 +++
> include/linux/security.h | 5 +++++
> security/security.c | 4 ++++
> 5 files changed, 18 insertions(+)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 3f6eacc98e31..1c4e6b2cb61a 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -4190,6 +4190,11 @@ static int io_uring_cmd_prep(struct io_kiocb *req,
> struct io_ring_ctx *ctx = req->ctx;
> struct io_uring_cmd *ioucmd = &req->uring_cmd;
> u32 ucmd_flags = READ_ONCE(sqe->uring_cmd_flags);
> + int ret;
> +
> + ret = security_uring_async_cmd(ioucmd);
> + if (ret)
> + return ret;
>
> if (!req->file->f_op->async_cmd)
> return -EOPNOTSUPP;
> diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
> index 819ec92dc2a8..4a20f8e6b295 100644
> --- a/include/linux/lsm_hook_defs.h
> +++ b/include/linux/lsm_hook_defs.h
> @@ -404,4 +404,5 @@ LSM_HOOK(int, 0, perf_event_write, struct perf_event *event)
> #ifdef CONFIG_IO_URING
> LSM_HOOK(int, 0, uring_override_creds, const struct cred *new)
> LSM_HOOK(int, 0, uring_sqpoll, void)
> +LSM_HOOK(int, 0, uring_async_cmd, struct io_uring_cmd *ioucmd)
> #endif /* CONFIG_IO_URING */
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index 3bf5c658bc44..21b18cf138c2 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -1569,6 +1569,9 @@
> * Check whether the current task is allowed to spawn a io_uring polling
> * thread (IORING_SETUP_SQPOLL).
> *
> + * @uring_async_cmd:
> + * Check whether the file_operations async_cmd is allowed to run.
> + *
> */
> union security_list_options {
> #define LSM_HOOK(RET, DEFAULT, NAME, ...) RET (*NAME)(__VA_ARGS__);
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 6d72772182c8..4d7f72813d75 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -2041,6 +2041,7 @@ static inline int security_perf_event_write(struct perf_event *event)
> #ifdef CONFIG_SECURITY
> extern int security_uring_override_creds(const struct cred *new);
> extern int security_uring_sqpoll(void);
> +extern int security_uring_async_cmd(struct io_uring_cmd *ioucmd);
> #else
> static inline int security_uring_override_creds(const struct cred *new)
> {
> @@ -2050,6 +2051,10 @@ static inline int security_uring_sqpoll(void)
> {
> return 0;
> }
> +static inline int security_uring_async_cmd(struct io_uring_cmd *ioucmd)
> +{
> + return 0;
> +}
> #endif /* CONFIG_SECURITY */
> #endif /* CONFIG_IO_URING */
>
> diff --git a/security/security.c b/security/security.c
> index 22261d79f333..ef96be2f953a 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -2640,4 +2640,8 @@ int security_uring_sqpoll(void)
> {
> return call_int_hook(uring_sqpoll, 0);
> }
> +int security_uring_async_cmd(struct io_uring_cmd *ioucmd)
> +{
> + return call_int_hook(uring_async_cmd, 0, ioucmd);
I don't have a good understanding of what information is in ioucmd.
I am afraid that there may not be enough for a security module to
make appropriate decisions in all cases. I am especially concerned
about the modules that use path hooks, but based on the issues we've
always had with ioctl and the like I fear for all cases.
> +}
> #endif /* CONFIG_IO_URING */
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 03/17] io_uring: add infra and support for IORING_OP_URING_CMD
2022-03-14 16:25 ` Casey Schaufler
@ 2022-03-14 16:32 ` Luis Chamberlain
2022-03-14 18:05 ` Casey Schaufler
0 siblings, 1 reply; 122+ messages in thread
From: Luis Chamberlain @ 2022-03-14 16:32 UTC (permalink / raw)
To: Casey Schaufler
Cc: Jens Axboe, Paul Moore, Kanchan Joshi, jmorris, serge, ast,
daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
kpsingh, linux-security-module, hch, kbusch, asml.silence,
io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, a.manzanares, joshiiitr, anuj20.g
On Mon, Mar 14, 2022 at 09:25:35AM -0700, Casey Schaufler wrote:
> On 3/11/2022 9:11 AM, Luis Chamberlain wrote:
> > On Thu, Mar 10, 2022 at 07:43:04PM -0700, Jens Axboe wrote:
> > > On 3/10/22 6:51 PM, Luis Chamberlain wrote:
> > > > On Tue, Mar 08, 2022 at 08:50:51PM +0530, Kanchan Joshi wrote:
> > > > > From: Jens Axboe <[email protected]>
> > > > >
> > > > > This is a file private kind of request. io_uring doesn't know what's
> > > > > in this command type, it's for the file_operations->async_cmd()
> > > > > handler to deal with.
> > > > >
> > > > > Signed-off-by: Jens Axboe <[email protected]>
> > > > > Signed-off-by: Kanchan Joshi <[email protected]>
> > > > > ---
> > > > <-- snip -->
> > > >
> > > > > +static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
> > > > > +{
> > > > > + struct file *file = req->file;
> > > > > + int ret;
> > > > > + struct io_uring_cmd *ioucmd = &req->uring_cmd;
> > > > > +
> > > > > + ioucmd->flags |= issue_flags;
> > > > > + ret = file->f_op->async_cmd(ioucmd);
> > > > I think we're going to have to add a security_file_async_cmd() check
> > > > before this call here. Because otherwise we're enabling to, for
> > > > example, bypass security_file_ioctl() for example using the new
> > > > iouring-cmd interface.
> > > >
> > > > Or is this already thought out with the existing security_uring_*() stuff?
> > > Unless the request sets .audit_skip, it'll be included already in terms
> > > of logging.
> > Neat.
> >
> > > But I'd prefer not to lodge this in with ioctls, unless
> > > we're going to be doing actual ioctls.
> > Oh sure, I have been an advocate to ensure folks don't conflate async_cmd
> > with ioctl. However it *can* enable subsystems to enable ioctl
> > passthrough, but each of those subsystems need to vet for this on their
> > own terms. I'd hate to see / hear some LSM surprises later.
> >
> > > But definitely something to keep in mind and make sure that we're under
> > > the right umbrella in terms of auditing and security.
> > Paul, how about something like this for starters (and probably should
> > be squashed into this series so its not a separate commit) ?
> >
> > >From f3ddbe822374cc1c7002bd795c1ae486d370cbd1 Mon Sep 17 00:00:00 2001
> > From: Luis Chamberlain <[email protected]>
> > Date: Fri, 11 Mar 2022 08:55:50 -0800
> > Subject: [PATCH] lsm,io_uring: add LSM hooks to for the new async_cmd file op
> >
> > io-uring is extending the struct file_operations to allow a new
> > command which each subsystem can use to enable command passthrough.
> > Add an LSM specific for the command passthrough which enables LSMs
> > to inspect the command details.
> >
> > Signed-off-by: Luis Chamberlain <[email protected]>
> > ---
> > fs/io_uring.c | 5 +++++
> > include/linux/lsm_hook_defs.h | 1 +
> > include/linux/lsm_hooks.h | 3 +++
> > include/linux/security.h | 5 +++++
> > security/security.c | 4 ++++
> > 5 files changed, 18 insertions(+)
> >
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index 3f6eacc98e31..1c4e6b2cb61a 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -4190,6 +4190,11 @@ static int io_uring_cmd_prep(struct io_kiocb *req,
> > struct io_ring_ctx *ctx = req->ctx;
> > struct io_uring_cmd *ioucmd = &req->uring_cmd;
> > u32 ucmd_flags = READ_ONCE(sqe->uring_cmd_flags);
> > + int ret;
> > +
> > + ret = security_uring_async_cmd(ioucmd);
> > + if (ret)
> > + return ret;
> > if (!req->file->f_op->async_cmd)
> > return -EOPNOTSUPP;
> > diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
> > index 819ec92dc2a8..4a20f8e6b295 100644
> > --- a/include/linux/lsm_hook_defs.h
> > +++ b/include/linux/lsm_hook_defs.h
> > @@ -404,4 +404,5 @@ LSM_HOOK(int, 0, perf_event_write, struct perf_event *event)
> > #ifdef CONFIG_IO_URING
> > LSM_HOOK(int, 0, uring_override_creds, const struct cred *new)
> > LSM_HOOK(int, 0, uring_sqpoll, void)
> > +LSM_HOOK(int, 0, uring_async_cmd, struct io_uring_cmd *ioucmd)
> > #endif /* CONFIG_IO_URING */
> > diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> > index 3bf5c658bc44..21b18cf138c2 100644
> > --- a/include/linux/lsm_hooks.h
> > +++ b/include/linux/lsm_hooks.h
> > @@ -1569,6 +1569,9 @@
> > * Check whether the current task is allowed to spawn a io_uring polling
> > * thread (IORING_SETUP_SQPOLL).
> > *
> > + * @uring_async_cmd:
> > + * Check whether the file_operations async_cmd is allowed to run.
> > + *
> > */
> > union security_list_options {
> > #define LSM_HOOK(RET, DEFAULT, NAME, ...) RET (*NAME)(__VA_ARGS__);
> > diff --git a/include/linux/security.h b/include/linux/security.h
> > index 6d72772182c8..4d7f72813d75 100644
> > --- a/include/linux/security.h
> > +++ b/include/linux/security.h
> > @@ -2041,6 +2041,7 @@ static inline int security_perf_event_write(struct perf_event *event)
> > #ifdef CONFIG_SECURITY
> > extern int security_uring_override_creds(const struct cred *new);
> > extern int security_uring_sqpoll(void);
> > +extern int security_uring_async_cmd(struct io_uring_cmd *ioucmd);
> > #else
> > static inline int security_uring_override_creds(const struct cred *new)
> > {
> > @@ -2050,6 +2051,10 @@ static inline int security_uring_sqpoll(void)
> > {
> > return 0;
> > }
> > +static inline int security_uring_async_cmd(struct io_uring_cmd *ioucmd)
> > +{
> > + return 0;
> > +}
> > #endif /* CONFIG_SECURITY */
> > #endif /* CONFIG_IO_URING */
> > diff --git a/security/security.c b/security/security.c
> > index 22261d79f333..ef96be2f953a 100644
> > --- a/security/security.c
> > +++ b/security/security.c
> > @@ -2640,4 +2640,8 @@ int security_uring_sqpoll(void)
> > {
> > return call_int_hook(uring_sqpoll, 0);
> > }
> > +int security_uring_async_cmd(struct io_uring_cmd *ioucmd)
> > +{
> > + return call_int_hook(uring_async_cmd, 0, ioucmd);
>
> I don't have a good understanding of what information is in ioucmd.
> I am afraid that there may not be enough for a security module to
> make appropriate decisions in all cases. I am especially concerned
> about the modules that use path hooks, but based on the issues we've
> always had with ioctl and the like I fear for all cases.
As Paul pointed out, this particular LSM hook would not be needed if we can
somehow ensure users of the cmd path use their respective LSMs there. It
is not easy to force users to have the LSM hook to be used, one idea
might be to have a registration mechanism which allows users to also
specify the LSM hook, but these can vary in arguments, so perhaps then
what is needed is the LSM type in enum form, and internally we have a
mapping of these. That way we slowly itemize which cmds we *do* allow
for, thus vetting at the same time a respective LSM hook. Thoughts?
Luis
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-13 21:53 ` Sagi Grimberg
@ 2022-03-14 17:54 ` Kanchan Joshi
2022-03-15 9:02 ` Sagi Grimberg
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-14 17:54 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Kanchan Joshi, Jens Axboe, Christoph Hellwig, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Mon, Mar 14, 2022 at 3:23 AM Sagi Grimberg <[email protected]> wrote:
>
>
> > +int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd)
> > +{
> > + struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
> > + struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
> > + int srcu_idx = srcu_read_lock(&head->srcu);
> > + struct nvme_ns *ns = nvme_find_path(head);
> > + int ret = -EWOULDBLOCK;
> > +
> > + if (ns)
> > + ret = nvme_ns_async_ioctl(ns, ioucmd);
> > + srcu_read_unlock(&head->srcu, srcu_idx);
> > + return ret;
> > +}
>
> No one cares that this has no multipathing capabilities what-so-ever?
> despite being issued on the mpath device node?
>
> I know we are not doing multipathing for userspace today, but this
> feels like an alternative I/O interface for nvme, seems a bit cripled
> with zero multipathing capabilities...
Multipathing is on the radar. Either in the first cut or in
subsequent. Thanks for bringing this up.
So the char-node (/dev/ngX) will be exposed to the host if we enable
controller passthru on the target side. And then the host can send
commands using uring-passthru in the same way.
May I know what are the other requirements here.
Bit of a shame that I missed adding that in the LSF proposal, but it's
correctible.
--
Kanchan
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 03/17] io_uring: add infra and support for IORING_OP_URING_CMD
2022-03-14 16:32 ` Luis Chamberlain
@ 2022-03-14 18:05 ` Casey Schaufler
2022-03-14 19:40 ` Luis Chamberlain
0 siblings, 1 reply; 122+ messages in thread
From: Casey Schaufler @ 2022-03-14 18:05 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Jens Axboe, Paul Moore, Kanchan Joshi, jmorris, serge, ast,
daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
kpsingh, linux-security-module, hch, kbusch, asml.silence,
io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, a.manzanares, joshiiitr, anuj20.g, Casey Schaufler
On 3/14/2022 9:32 AM, Luis Chamberlain wrote:
> On Mon, Mar 14, 2022 at 09:25:35AM -0700, Casey Schaufler wrote:
>> On 3/11/2022 9:11 AM, Luis Chamberlain wrote:
>>> On Thu, Mar 10, 2022 at 07:43:04PM -0700, Jens Axboe wrote:
>>>> On 3/10/22 6:51 PM, Luis Chamberlain wrote:
>>>>> On Tue, Mar 08, 2022 at 08:50:51PM +0530, Kanchan Joshi wrote:
>>>>>> From: Jens Axboe <[email protected]>
>>>>>>
>>>>>> This is a file private kind of request. io_uring doesn't know what's
>>>>>> in this command type, it's for the file_operations->async_cmd()
>>>>>> handler to deal with.
>>>>>>
>>>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>>>>> ---
>>>>> <-- snip -->
>>>>>
>>>>>> +static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
>>>>>> +{
>>>>>> + struct file *file = req->file;
>>>>>> + int ret;
>>>>>> + struct io_uring_cmd *ioucmd = &req->uring_cmd;
>>>>>> +
>>>>>> + ioucmd->flags |= issue_flags;
>>>>>> + ret = file->f_op->async_cmd(ioucmd);
>>>>> I think we're going to have to add a security_file_async_cmd() check
>>>>> before this call here. Because otherwise we're enabling to, for
>>>>> example, bypass security_file_ioctl() for example using the new
>>>>> iouring-cmd interface.
>>>>>
>>>>> Or is this already thought out with the existing security_uring_*() stuff?
>>>> Unless the request sets .audit_skip, it'll be included already in terms
>>>> of logging.
>>> Neat.
>>>
>>>> But I'd prefer not to lodge this in with ioctls, unless
>>>> we're going to be doing actual ioctls.
>>> Oh sure, I have been an advocate to ensure folks don't conflate async_cmd
>>> with ioctl. However it *can* enable subsystems to enable ioctl
>>> passthrough, but each of those subsystems need to vet for this on their
>>> own terms. I'd hate to see / hear some LSM surprises later.
>>>
>>>> But definitely something to keep in mind and make sure that we're under
>>>> the right umbrella in terms of auditing and security.
>>> Paul, how about something like this for starters (and probably should
>>> be squashed into this series so its not a separate commit) ?
>>>
>>> >From f3ddbe822374cc1c7002bd795c1ae486d370cbd1 Mon Sep 17 00:00:00 2001
>>> From: Luis Chamberlain <[email protected]>
>>> Date: Fri, 11 Mar 2022 08:55:50 -0800
>>> Subject: [PATCH] lsm,io_uring: add LSM hooks to for the new async_cmd file op
>>>
>>> io-uring is extending the struct file_operations to allow a new
>>> command which each subsystem can use to enable command passthrough.
>>> Add an LSM specific for the command passthrough which enables LSMs
>>> to inspect the command details.
>>>
>>> Signed-off-by: Luis Chamberlain <[email protected]>
>>> ---
>>> fs/io_uring.c | 5 +++++
>>> include/linux/lsm_hook_defs.h | 1 +
>>> include/linux/lsm_hooks.h | 3 +++
>>> include/linux/security.h | 5 +++++
>>> security/security.c | 4 ++++
>>> 5 files changed, 18 insertions(+)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 3f6eacc98e31..1c4e6b2cb61a 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -4190,6 +4190,11 @@ static int io_uring_cmd_prep(struct io_kiocb *req,
>>> struct io_ring_ctx *ctx = req->ctx;
>>> struct io_uring_cmd *ioucmd = &req->uring_cmd;
>>> u32 ucmd_flags = READ_ONCE(sqe->uring_cmd_flags);
>>> + int ret;
>>> +
>>> + ret = security_uring_async_cmd(ioucmd);
>>> + if (ret)
>>> + return ret;
>>> if (!req->file->f_op->async_cmd)
>>> return -EOPNOTSUPP;
>>> diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
>>> index 819ec92dc2a8..4a20f8e6b295 100644
>>> --- a/include/linux/lsm_hook_defs.h
>>> +++ b/include/linux/lsm_hook_defs.h
>>> @@ -404,4 +404,5 @@ LSM_HOOK(int, 0, perf_event_write, struct perf_event *event)
>>> #ifdef CONFIG_IO_URING
>>> LSM_HOOK(int, 0, uring_override_creds, const struct cred *new)
>>> LSM_HOOK(int, 0, uring_sqpoll, void)
>>> +LSM_HOOK(int, 0, uring_async_cmd, struct io_uring_cmd *ioucmd)
>>> #endif /* CONFIG_IO_URING */
>>> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
>>> index 3bf5c658bc44..21b18cf138c2 100644
>>> --- a/include/linux/lsm_hooks.h
>>> +++ b/include/linux/lsm_hooks.h
>>> @@ -1569,6 +1569,9 @@
>>> * Check whether the current task is allowed to spawn a io_uring polling
>>> * thread (IORING_SETUP_SQPOLL).
>>> *
>>> + * @uring_async_cmd:
>>> + * Check whether the file_operations async_cmd is allowed to run.
>>> + *
>>> */
>>> union security_list_options {
>>> #define LSM_HOOK(RET, DEFAULT, NAME, ...) RET (*NAME)(__VA_ARGS__);
>>> diff --git a/include/linux/security.h b/include/linux/security.h
>>> index 6d72772182c8..4d7f72813d75 100644
>>> --- a/include/linux/security.h
>>> +++ b/include/linux/security.h
>>> @@ -2041,6 +2041,7 @@ static inline int security_perf_event_write(struct perf_event *event)
>>> #ifdef CONFIG_SECURITY
>>> extern int security_uring_override_creds(const struct cred *new);
>>> extern int security_uring_sqpoll(void);
>>> +extern int security_uring_async_cmd(struct io_uring_cmd *ioucmd);
>>> #else
>>> static inline int security_uring_override_creds(const struct cred *new)
>>> {
>>> @@ -2050,6 +2051,10 @@ static inline int security_uring_sqpoll(void)
>>> {
>>> return 0;
>>> }
>>> +static inline int security_uring_async_cmd(struct io_uring_cmd *ioucmd)
>>> +{
>>> + return 0;
>>> +}
>>> #endif /* CONFIG_SECURITY */
>>> #endif /* CONFIG_IO_URING */
>>> diff --git a/security/security.c b/security/security.c
>>> index 22261d79f333..ef96be2f953a 100644
>>> --- a/security/security.c
>>> +++ b/security/security.c
>>> @@ -2640,4 +2640,8 @@ int security_uring_sqpoll(void)
>>> {
>>> return call_int_hook(uring_sqpoll, 0);
>>> }
>>> +int security_uring_async_cmd(struct io_uring_cmd *ioucmd)
>>> +{
>>> + return call_int_hook(uring_async_cmd, 0, ioucmd);
>> I don't have a good understanding of what information is in ioucmd.
>> I am afraid that there may not be enough for a security module to
>> make appropriate decisions in all cases. I am especially concerned
>> about the modules that use path hooks, but based on the issues we've
>> always had with ioctl and the like I fear for all cases.
> As Paul pointed out, this particular LSM hook would not be needed if we can
> somehow ensure users of the cmd path use their respective LSMs there. It
> is not easy to force users to have the LSM hook to be used, one idea
> might be to have a registration mechanism which allows users to also
> specify the LSM hook, but these can vary in arguments, so perhaps then
> what is needed is the LSM type in enum form, and internally we have a
> mapping of these. That way we slowly itemize which cmds we *do* allow
> for, thus vetting at the same time a respective LSM hook. Thoughts?
tl;dr - Yuck.
I don't see how your registration mechanism would be easier than
getting "users of the cmd path" to use the LSM mechanism the way
everyone else does. What it would do is pass responsibility for
dealing with LSM to the io_uring core team. Experience has shown
that dealing with the security issues after the fact is much
harder than doing it up front, even when developers wail about
the burden. Sure, LSM is an unpleasant interface/mechanism, but
so is locking, and no one gets away without addressing that.
My $0.02.
>
> Luis
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 12/17] nvme: enable bio-cache for fixed-buffer passthru
2022-03-11 6:48 ` Christoph Hellwig
@ 2022-03-14 18:18 ` Kanchan Joshi
2022-03-15 8:57 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-14 18:18 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Jens Axboe, Keith Busch, Pavel Begunkov, io-uring,
linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Fri, Mar 11, 2022 at 12:18 PM Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Mar 08, 2022 at 08:51:00PM +0530, Kanchan Joshi wrote:
> > Since we do submission/completion in task, we can have this up.
> > Add a bio-set for nvme as we need that for bio-cache.
>
> Well, passthrough I/O should just use kmalloced bios anyway, as there
> is no need for the mempool to start with. Take a look at the existing
> code in blk-map.c.
Yes, the only reason to switch from kmalloc to bio-set was being able
to use bio-cache.
Towards the goal of matching peak perf of io_uring's block io path.
Is it too bad to go down this route; Is there any different way to
enable bio-cache for passthru.
--
Kanchan
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 03/17] io_uring: add infra and support for IORING_OP_URING_CMD
2022-03-14 18:05 ` Casey Schaufler
@ 2022-03-14 19:40 ` Luis Chamberlain
0 siblings, 0 replies; 122+ messages in thread
From: Luis Chamberlain @ 2022-03-14 19:40 UTC (permalink / raw)
To: Casey Schaufler
Cc: Jens Axboe, Paul Moore, Kanchan Joshi, jmorris, serge, ast,
daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
kpsingh, linux-security-module, hch, kbusch, asml.silence,
io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, a.manzanares, joshiiitr, anuj20.g
On Mon, Mar 14, 2022 at 11:05:41AM -0700, Casey Schaufler wrote:
> On 3/14/2022 9:32 AM, Luis Chamberlain wrote:
> > On Mon, Mar 14, 2022 at 09:25:35AM -0700, Casey Schaufler wrote:
> > > On 3/11/2022 9:11 AM, Luis Chamberlain wrote:
> > > > On Thu, Mar 10, 2022 at 07:43:04PM -0700, Jens Axboe wrote:
> > > > > On 3/10/22 6:51 PM, Luis Chamberlain wrote:
> > > > > > On Tue, Mar 08, 2022 at 08:50:51PM +0530, Kanchan Joshi wrote:
> > > > > > > From: Jens Axboe <[email protected]>
> > > > > > >
> > > > > > > This is a file private kind of request. io_uring doesn't know what's
> > > > > > > in this command type, it's for the file_operations->async_cmd()
> > > > > > > handler to deal with.
> > > > > > >
> > > > > > > Signed-off-by: Jens Axboe <[email protected]>
> > > > > > > Signed-off-by: Kanchan Joshi <[email protected]>
> > > > > > > ---
> > > > > > <-- snip -->
> > > > > >
> > > > > > > +static int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
> > > > > > > +{
> > > > > > > + struct file *file = req->file;
> > > > > > > + int ret;
> > > > > > > + struct io_uring_cmd *ioucmd = &req->uring_cmd;
> > > > > > > +
> > > > > > > + ioucmd->flags |= issue_flags;
> > > > > > > + ret = file->f_op->async_cmd(ioucmd);
> > > > > > I think we're going to have to add a security_file_async_cmd() check
> > > > > > before this call here. Because otherwise we're enabling to, for
> > > > > > example, bypass security_file_ioctl() for example using the new
> > > > > > iouring-cmd interface.
> > > > > >
> > > > > > Or is this already thought out with the existing security_uring_*() stuff?
> > > > > Unless the request sets .audit_skip, it'll be included already in terms
> > > > > of logging.
> > > > Neat.
> > > >
> > > > > But I'd prefer not to lodge this in with ioctls, unless
> > > > > we're going to be doing actual ioctls.
> > > > Oh sure, I have been an advocate to ensure folks don't conflate async_cmd
> > > > with ioctl. However it *can* enable subsystems to enable ioctl
> > > > passthrough, but each of those subsystems need to vet for this on their
> > > > own terms. I'd hate to see / hear some LSM surprises later.
> > > >
> > > > > But definitely something to keep in mind and make sure that we're under
> > > > > the right umbrella in terms of auditing and security.
> > > > Paul, how about something like this for starters (and probably should
> > > > be squashed into this series so its not a separate commit) ?
> > > >
> > > > >From f3ddbe822374cc1c7002bd795c1ae486d370cbd1 Mon Sep 17 00:00:00 2001
> > > > From: Luis Chamberlain <[email protected]>
> > > > Date: Fri, 11 Mar 2022 08:55:50 -0800
> > > > Subject: [PATCH] lsm,io_uring: add LSM hooks to for the new async_cmd file op
> > > >
> > > > io-uring is extending the struct file_operations to allow a new
> > > > command which each subsystem can use to enable command passthrough.
> > > > Add an LSM specific for the command passthrough which enables LSMs
> > > > to inspect the command details.
> > > >
> > > > Signed-off-by: Luis Chamberlain <[email protected]>
> > > > ---
> > > > fs/io_uring.c | 5 +++++
> > > > include/linux/lsm_hook_defs.h | 1 +
> > > > include/linux/lsm_hooks.h | 3 +++
> > > > include/linux/security.h | 5 +++++
> > > > security/security.c | 4 ++++
> > > > 5 files changed, 18 insertions(+)
> > > >
> > > > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > > > index 3f6eacc98e31..1c4e6b2cb61a 100644
> > > > --- a/fs/io_uring.c
> > > > +++ b/fs/io_uring.c
> > > > @@ -4190,6 +4190,11 @@ static int io_uring_cmd_prep(struct io_kiocb *req,
> > > > struct io_ring_ctx *ctx = req->ctx;
> > > > struct io_uring_cmd *ioucmd = &req->uring_cmd;
> > > > u32 ucmd_flags = READ_ONCE(sqe->uring_cmd_flags);
> > > > + int ret;
> > > > +
> > > > + ret = security_uring_async_cmd(ioucmd);
> > > > + if (ret)
> > > > + return ret;
> > > > if (!req->file->f_op->async_cmd)
> > > > return -EOPNOTSUPP;
> > > > diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
> > > > index 819ec92dc2a8..4a20f8e6b295 100644
> > > > --- a/include/linux/lsm_hook_defs.h
> > > > +++ b/include/linux/lsm_hook_defs.h
> > > > @@ -404,4 +404,5 @@ LSM_HOOK(int, 0, perf_event_write, struct perf_event *event)
> > > > #ifdef CONFIG_IO_URING
> > > > LSM_HOOK(int, 0, uring_override_creds, const struct cred *new)
> > > > LSM_HOOK(int, 0, uring_sqpoll, void)
> > > > +LSM_HOOK(int, 0, uring_async_cmd, struct io_uring_cmd *ioucmd)
> > > > #endif /* CONFIG_IO_URING */
> > > > diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> > > > index 3bf5c658bc44..21b18cf138c2 100644
> > > > --- a/include/linux/lsm_hooks.h
> > > > +++ b/include/linux/lsm_hooks.h
> > > > @@ -1569,6 +1569,9 @@
> > > > * Check whether the current task is allowed to spawn a io_uring polling
> > > > * thread (IORING_SETUP_SQPOLL).
> > > > *
> > > > + * @uring_async_cmd:
> > > > + * Check whether the file_operations async_cmd is allowed to run.
> > > > + *
> > > > */
> > > > union security_list_options {
> > > > #define LSM_HOOK(RET, DEFAULT, NAME, ...) RET (*NAME)(__VA_ARGS__);
> > > > diff --git a/include/linux/security.h b/include/linux/security.h
> > > > index 6d72772182c8..4d7f72813d75 100644
> > > > --- a/include/linux/security.h
> > > > +++ b/include/linux/security.h
> > > > @@ -2041,6 +2041,7 @@ static inline int security_perf_event_write(struct perf_event *event)
> > > > #ifdef CONFIG_SECURITY
> > > > extern int security_uring_override_creds(const struct cred *new);
> > > > extern int security_uring_sqpoll(void);
> > > > +extern int security_uring_async_cmd(struct io_uring_cmd *ioucmd);
> > > > #else
> > > > static inline int security_uring_override_creds(const struct cred *new)
> > > > {
> > > > @@ -2050,6 +2051,10 @@ static inline int security_uring_sqpoll(void)
> > > > {
> > > > return 0;
> > > > }
> > > > +static inline int security_uring_async_cmd(struct io_uring_cmd *ioucmd)
> > > > +{
> > > > + return 0;
> > > > +}
> > > > #endif /* CONFIG_SECURITY */
> > > > #endif /* CONFIG_IO_URING */
> > > > diff --git a/security/security.c b/security/security.c
> > > > index 22261d79f333..ef96be2f953a 100644
> > > > --- a/security/security.c
> > > > +++ b/security/security.c
> > > > @@ -2640,4 +2640,8 @@ int security_uring_sqpoll(void)
> > > > {
> > > > return call_int_hook(uring_sqpoll, 0);
> > > > }
> > > > +int security_uring_async_cmd(struct io_uring_cmd *ioucmd)
> > > > +{
> > > > + return call_int_hook(uring_async_cmd, 0, ioucmd);
> > > I don't have a good understanding of what information is in ioucmd.
> > > I am afraid that there may not be enough for a security module to
> > > make appropriate decisions in all cases. I am especially concerned
> > > about the modules that use path hooks, but based on the issues we've
> > > always had with ioctl and the like I fear for all cases.
> > As Paul pointed out, this particular LSM hook would not be needed if we can
> > somehow ensure users of the cmd path use their respective LSMs there. It
> > is not easy to force users to have the LSM hook to be used, one idea
> > might be to have a registration mechanism which allows users to also
> > specify the LSM hook, but these can vary in arguments, so perhaps then
> > what is needed is the LSM type in enum form, and internally we have a
> > mapping of these. That way we slowly itemize which cmds we *do* allow
> > for, thus vetting at the same time a respective LSM hook. Thoughts?
>
> tl;dr - Yuck.
>
> I don't see how your registration mechanism would be easier than
> getting "users of the cmd path" to use the LSM mechanism the way
> everyone else does. What it would do is pass responsibility for
> dealing with LSM to the io_uring core team.
Agreed, I was just trying to be proactive to help with the LSM stuff.
But indeed, that path would be complicated and I agree probably not
the most practical one.
> Experience has shown
> that dealing with the security issues after the fact is much
> harder than doing it up front, even when developers wail about
> the burden. Sure, LSM is an unpleasant interface/mechanism, but
> so is locking, and no one gets away without addressing that.
> My $0.02.
So putting the onus on those file_operations which embrace async_cmd to
take into account LSMs seems to be the way to go then, which seems to
align with what Paul was suggesting.
Luis
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 00/17] io_uring passthru over nvme
2022-03-13 5:07 ` Kanchan Joshi
@ 2022-03-14 20:30 ` Adam Manzanares
0 siblings, 0 replies; 122+ messages in thread
From: Adam Manzanares @ 2022-03-14 20:30 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Luis Chamberlain, Christoph Hellwig, Kanchan Joshi, Jens Axboe,
Keith Busch, Pavel Begunkov, [email protected],
[email protected], [email protected],
[email protected], [email protected], Pankaj Raghav,
Javier González, Anuj Gupta, [email protected],
[email protected]
On Sun, Mar 13, 2022 at 10:37:53AM +0530, Kanchan Joshi wrote:
> On Sat, Mar 12, 2022 at 7:57 AM Adam Manzanares
> <[email protected]> wrote:
> >
> > On Fri, Mar 11, 2022 at 03:35:04PM -0800, Adam Manzanares wrote:
> > > On Fri, Mar 11, 2022 at 08:43:24AM -0800, Luis Chamberlain wrote:
> > > > On Thu, Mar 10, 2022 at 03:35:02PM +0530, Kanchan Joshi wrote:
> > > > > On Thu, Mar 10, 2022 at 1:59 PM Christoph Hellwig <[email protected]> wrote:
> > > > > >
> > > > > > What branch is this against?
> > > > > Sorry I missed that in the cover.
> > > > > Two options -
> > > > > (a) https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=03500d22-5ccb341f-0351866d-0cc47a31309a-6f95e6932e414a1d&q=1&e=4ca7b05e-2fe6-40d9-bbcf-a4ed687eca9f&u=https*3A*2F*2Fgit.kernel.dk*2Fcgit*2Flinux-block*2Flog*2F*3Fh*3Dio_uring-big-sqe__;JSUlJSUlJSUl!!EwVzqGoTKBqv-0DWAJBm!FujuZ927K3fuIklgYjkWtodmdQnQyBqOw4Ge4M08DU_0oD5tPm0-wS2SZg0MDh8_2-U9$
> > > > > first patch ("128 byte sqe support") is already there.
> > > > > (b) for-next (linux-block), series will fit on top of commit 9e9d83faa
> > > > > ("io_uring: Remove unneeded test in io_run_task_work_sig")
> > > > >
> > > > > > Do you have a git tree available?
> > > > > Not at the moment.
> > > > >
> > > > > @Jens: Please see if it is possible to move patches to your
> > > > > io_uring-big-sqe branch (and maybe rename that to big-sqe-pt.v1).
> > > >
> > > > Since Jens might be busy, I've put up a tree with all this stuff:
> > > >
> > > > https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20220311-io-uring-cmd__;!!EwVzqGoTKBqv-0DWAJBm!FujuZ927K3fuIklgYjkWtodmdQnQyBqOw4Ge4M08DU_0oD5tPm0-wS2SZg0MDiTF0Q7F$
> > > >
> > > > It is based on option (b) mentioned above, I took linux-block for-next
> > > > and reset the tree to commit "io_uring: Remove unneeded test in
> > > > io_run_task_work_sig" before applying the series.
> > >
> > > FYI I can be involved in testing this and have added some colleagues that can
> > > help in this regard. We have been using some form of this work for several
> > > months now and haven't had any issues. That being said some simple tests I have
> > > are not currently working with the above git tree :). I will work to get this
> > > resolved and post an update here.
> >
> > Sorry for the noise, I jumped up the stack too quickly with my tests. The
> > "simple test" actually depends on several pieces of SW not related to the
> > kernel.
>
> Did you read the cover letter? It's not the same *user-interface* as
> the previous series.
> If you did not modify those out-of-kernel layers for the new
> interface, you're bound to see what you saw.
> If you did, please specify what the simple test was. I'll fix that in v2.
I got a little ahead of myself. Looking forward to leveraging this work in the
near future ;)
>
> Otherwise, the throwaway remark "simple tests not working" - only
> infers this series is untested. Nothing could be further from the
> truth.
> Rather this series is more robust than the previous one.
Excellent, to hear that this series is robust. My intent was not to claim
this series was untested. It is clear I need to do more hw before making an
attempt to help with testing.
>
> Let me expand bit more on testing part that's already there in cover:
>
> fio -iodepth=256 -rw=randread -ioengine=io_uring -bs=512 -numjobs=1
> -runtime=60 -group_reporting -iodepth_batch_submit=64
> -iodepth_batch_complete_min=1 -iodepth_batch_complete_max=64
> -fixedbufs=1 -hipri=1 -sqthread_poll=0 -filename=/dev/ng0n1
> -name=io_uring_256 -uring_cmd=1
>
> When I reduce the above command-line to do single IO, I call that a simple test.
> Simple test that touches almost *everything* that patches build (i.e
> async, fixed-buffer, plugging, fixed-buffer, bio-cache, polling).
> And larger tests combine these knobs in various ways, QD ranging from
> 1, 2, 4...upto 256; on general and perf-optimized kernel config; with
> big-sqe and normal-sqe (pointer one). And all this is repeated on the
> block interface (regular io) too, which covers the regression part.
> Sure, I can add more tests for checking regression. But no, I don't
> expect any simple test to fail. And that applies to Luis' tree as
> well. Tried that too again.
Looks like you have all of your based covered. Keep up the good work.
>
>
> --
> Joshi
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-14 16:23 ` Kanchan Joshi
@ 2022-03-15 8:54 ` Christoph Hellwig
2022-03-16 7:27 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-15 8:54 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, axboe, kbusch, asml.silence, io-uring,
linux-nvme, linux-block, sbates, logang, pankydev8, javier,
mcgrof, a.manzanares, joshiiitr, anuj20.g
On Mon, Mar 14, 2022 at 09:53:56PM +0530, Kanchan Joshi wrote:
>>> +struct nvme_uring_cmd_pdu {
>>> + u32 meta_len;
>>> + union {
>>> + struct bio *bio;
>>> + struct request *req;
>>> + };
>>> + void *meta; /* kernel-resident buffer */
>>> + void __user *meta_buffer;
>>> +} __packed;
>>
>> Why is this marked __packed?
> Did not like doing it, but had to.
> If not packed, this takes 32 bytes of space. While driver-pdu in struct
> io_uring_cmd can take max 30 bytes. Packing nvme-pdu brought it down to
> 28 bytes, which fits and gives 2 bytes back.
What if you move meta_len to the end? Even if we need the __packed
that will avoid all the unaligned access to pointers, which on some
architectures will crash the kernel.
> And on moving meta elements outside the driver, my worry is that it
> reduces scope of uring-cmd infra and makes it nvme passthru specific.
> At this point uring-cmd is still generic async ioctl/fsctl facility
> which may find other users (than nvme-passthru) down the line. Organization
> of fields within "struct io_uring_cmd" is around the rule
> that a field is kept out (of 28 bytes pdu) only if is accessed by both
> io_uring and driver.
We have plenty of other interfaces of that kind. Sockets are one case
already, and regular read/write with protection information will be
another one. So having some core infrastrucure for "secondary data"
seems very useful.
> I see, so there is room for adding some efficiency.
> Hope it will be ok if I carry this out as a separate effort.
> Since this is about touching blk_mq_complete_request at its heart, and
> improving sync-direct-io, this does not seem best-fit and slow this
> series down.
I really rather to this properly. Especially as the current effort
adds new exported interfaces.
> Deferring by ipi or softirq never occured. Neither for block nor for
> char. Softirq is obvious since I was not running against scsi (or nvme with
> single queue). I could not spot whether this is really a overhead, at
> least for nvme.
This tends to kick in if you have less queues than cpu cores. Quite
command with either a high core cound or a not very high end nvme
controller.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 08/17] nvme: enable passthrough with fixed-buffer
2022-03-14 13:06 ` Kanchan Joshi
@ 2022-03-15 8:55 ` Christoph Hellwig
0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-15 8:55 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Mon, Mar 14, 2022 at 06:36:17PM +0530, Kanchan Joshi wrote:
> > And I'm also really worried about only supporting fixed buffers. Fixed
> > buffers are a really nice benchmarketing feature, but without supporting
> > arbitrary buffers this is rather useless in real life.
>
> Sorry, I did not get your point on arbitrary buffers.
> The goal has been to match/surpass io_uring's block-io peak perf, so
> pre-mapped buffers had to be added.
I'm completely interesting in adding code to the nvme driver that is
just intended for benchmarketing. The fixed buffers are nice for
benchmarking and a very small number of real use cases (e.g. fixed size
database log), but for this feature to be generally useful for the real
world we'll need to support arbitrary user memory.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 09/17] io_uring: plug for async bypass
2022-03-14 14:33 ` Ming Lei
@ 2022-03-15 8:56 ` Christoph Hellwig
0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-15 8:56 UTC (permalink / raw)
To: Ming Lei
Cc: Christoph Hellwig, Kanchan Joshi, axboe, kbusch, asml.silence,
io-uring, linux-nvme, linux-block, sbates, logang, pankydev8,
javier, mcgrof, a.manzanares, joshiiitr, anuj20.g
On Mon, Mar 14, 2022 at 10:33:57PM +0800, Ming Lei wrote:
> Plug support for passthrough rq is added in the following patch, so
> this one may be put after patch 'block: wire-up support for plugging'.
Yes. And as already mentioned early that other patch really needs
a much better title and description.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 12/17] nvme: enable bio-cache for fixed-buffer passthru
2022-03-14 18:18 ` Kanchan Joshi
@ 2022-03-15 8:57 ` Christoph Hellwig
0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-15 8:57 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Mon, Mar 14, 2022 at 11:48:43PM +0530, Kanchan Joshi wrote:
> Yes, the only reason to switch from kmalloc to bio-set was being able
> to use bio-cache.
> Towards the goal of matching peak perf of io_uring's block io path.
> Is it too bad to go down this route; Is there any different way to
> enable bio-cache for passthru.
How does this actually make a difference vs say a slab cache? Slab/slub
seems to be very fine tuned for these kinds of patters using per-cpu
caches.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 14/17] io_uring: add polling support for uring-cmd
2022-03-14 10:16 ` Kanchan Joshi
@ 2022-03-15 8:57 ` Christoph Hellwig
2022-03-16 5:09 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-15 8:57 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Mon, Mar 14, 2022 at 03:46:08PM +0530, Kanchan Joshi wrote:
> But, after you did bio based polling, we need just the bio to poll.
> iocb is a big structure (48 bytes), and if we try to place it in
> struct io_uring_cmd, we will just blow up the cacheline in io_uring
> (first one in io_kiocb).
> So we just store that bio pointer in io_uring_cmd on submission
> (please see patch 15).
> And here on completion we pick that bio, and put that into this local
> iocb, simply because ->iopoll needs it.
> Do you see I am still missing anything here?
Yes. The VFS layer interface for polling is the kiocb. Don't break
it. The bio is just an implementation detail.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-14 17:54 ` Kanchan Joshi
@ 2022-03-15 9:02 ` Sagi Grimberg
2022-03-16 9:21 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Sagi Grimberg @ 2022-03-15 9:02 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Kanchan Joshi, Jens Axboe, Christoph Hellwig, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
>>> +int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd)
>>> +{
>>> + struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>> + struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
>>> + int srcu_idx = srcu_read_lock(&head->srcu);
>>> + struct nvme_ns *ns = nvme_find_path(head);
>>> + int ret = -EWOULDBLOCK;
>>> +
>>> + if (ns)
>>> + ret = nvme_ns_async_ioctl(ns, ioucmd);
>>> + srcu_read_unlock(&head->srcu, srcu_idx);
>>> + return ret;
>>> +}
>>
>> No one cares that this has no multipathing capabilities what-so-ever?
>> despite being issued on the mpath device node?
>>
>> I know we are not doing multipathing for userspace today, but this
>> feels like an alternative I/O interface for nvme, seems a bit cripled
>> with zero multipathing capabilities...
>
> Multipathing is on the radar. Either in the first cut or in
> subsequent. Thanks for bringing this up.
Good to know...
> So the char-node (/dev/ngX) will be exposed to the host if we enable
> controller passthru on the target side. And then the host can send
> commands using uring-passthru in the same way.
Not sure I follow...
> May I know what are the other requirements here.
Again, not sure I follow... The fundamental capability is to
requeue/failover I/O if there is no I/O capable path available...
> Bit of a shame that I missed adding that in the LSF proposal, but it's
> correctible.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 14/17] io_uring: add polling support for uring-cmd
2022-03-15 8:57 ` Christoph Hellwig
@ 2022-03-16 5:09 ` Kanchan Joshi
2022-03-24 6:30 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-16 5:09 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Jens Axboe, Keith Busch, Pavel Begunkov, io-uring,
linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
[-- Attachment #1: Type: text/plain, Size: 1332 bytes --]
On Tue, Mar 15, 2022 at 09:57:45AM +0100, Christoph Hellwig wrote:
>On Mon, Mar 14, 2022 at 03:46:08PM +0530, Kanchan Joshi wrote:
>> But, after you did bio based polling, we need just the bio to poll.
>> iocb is a big structure (48 bytes), and if we try to place it in
>> struct io_uring_cmd, we will just blow up the cacheline in io_uring
>> (first one in io_kiocb).
>> So we just store that bio pointer in io_uring_cmd on submission
>> (please see patch 15).
>> And here on completion we pick that bio, and put that into this local
>> iocb, simply because ->iopoll needs it.
>> Do you see I am still missing anything here?
>
>Yes. The VFS layer interface for polling is the kiocb. Don't break
>it. The bio is just an implementation detail.
So how about adding ->async_cmd_poll in file_operations (since this
corresponds to ->async_cmd)?
It will take struct io_uring_cmd pointer as parameter.
Both ->iopoll and ->async_cmd_poll will differ in what they accept (kiocb
vs io_uring_cmd). The provider may use bio_poll, or whatever else is the
implementation detail.
for read/write, submission interface took kiocb, and completion
interface (->iopoll) also operated on the same.
for uring/async-cmd, submission interface took io_uring_cmd, but
completion used kiocb based ->iopoll. The new ->async_cmd_poll should
settle this.
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-15 8:54 ` Christoph Hellwig
@ 2022-03-16 7:27 ` Kanchan Joshi
2022-03-24 6:22 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-16 7:27 UTC (permalink / raw)
To: Christoph Hellwig
Cc: axboe, kbusch, asml.silence, io-uring, linux-nvme, linux-block,
sbates, logang, pankydev8, javier, mcgrof, a.manzanares,
joshiiitr, anuj20.g
[-- Attachment #1: Type: text/plain, Size: 4086 bytes --]
On Tue, Mar 15, 2022 at 09:54:10AM +0100, Christoph Hellwig wrote:
>On Mon, Mar 14, 2022 at 09:53:56PM +0530, Kanchan Joshi wrote:
>>>> +struct nvme_uring_cmd_pdu {
>>>> + u32 meta_len;
>>>> + union {
>>>> + struct bio *bio;
>>>> + struct request *req;
>>>> + };
>>>> + void *meta; /* kernel-resident buffer */
>>>> + void __user *meta_buffer;
>>>> +} __packed;
>>>
>>> Why is this marked __packed?
>> Did not like doing it, but had to.
>> If not packed, this takes 32 bytes of space. While driver-pdu in struct
>> io_uring_cmd can take max 30 bytes. Packing nvme-pdu brought it down to
>> 28 bytes, which fits and gives 2 bytes back.
>
>What if you move meta_len to the end? Even if we need the __packed
>that will avoid all the unaligned access to pointers, which on some
>architectures will crash the kernel.
ah, right. Will move that to the end.
>> And on moving meta elements outside the driver, my worry is that it
>> reduces scope of uring-cmd infra and makes it nvme passthru specific.
>> At this point uring-cmd is still generic async ioctl/fsctl facility
>> which may find other users (than nvme-passthru) down the line. Organization
>> of fields within "struct io_uring_cmd" is around the rule
>> that a field is kept out (of 28 bytes pdu) only if is accessed by both
>> io_uring and driver.
>
>We have plenty of other interfaces of that kind. Sockets are one case
>already, and regular read/write with protection information will be
>another one. So having some core infrastrucure for "secondary data"
>seems very useful.
So what is the picture that you have in mind for struct io_uring_cmd?
Moving meta fields out makes it look like this -
@@ -28,7 +28,10 @@ struct io_uring_cmd {
u32 cmd_op;
u16 cmd_len;
u16 unused;
- u8 pdu[28]; /* available inline for free use */
+ void __user *meta_buffer; /* nvme pt specific */
+ u32 meta_len; /* nvme pt specific */
+ u8 pdu[16]; /* available inline for free use */
+
};
And corresponding nvme 16 byte pdu -
struct nvme_uring_cmd_pdu {
- u32 meta_len;
union {
struct bio *bio;
struct request *req;
};
void *meta; /* kernel-resident buffer */
- void __user *meta_buffer;
} __packed;
I do not understand how this helps. Only the generic space (28 bytes)
got reduced to 16 bytes.
>> I see, so there is room for adding some efficiency.
>> Hope it will be ok if I carry this out as a separate effort.
>> Since this is about touching blk_mq_complete_request at its heart, and
>> improving sync-direct-io, this does not seem best-fit and slow this
>> series down.
>
>I really rather to this properly. Especially as the current effort
>adds new exported interfaces.
Seems you are referring to io_uring_cmd_complete_in_task().
We would still need to use/export that even if we somehow manage to move
task-work trigger from nvme-function to blk_mq_complete_request.
io_uring's task-work infra is more baked than raw task-work infra.
It would not be good to repeat all that code elsewhere.
I tried raw one in the first attempt, and Jens suggested to move to baked
one. Here is the link that gave birth to this interface -
https://lore.kernel.org/linux-nvme/[email protected]/
>> Deferring by ipi or softirq never occured. Neither for block nor for
>> char. Softirq is obvious since I was not running against scsi (or nvme with
>> single queue). I could not spot whether this is really a overhead, at
>> least for nvme.
>
>This tends to kick in if you have less queues than cpu cores. Quite
>command with either a high core cound or a not very high end nvme
>controller.
I will check that.
But swtiching (irq to task-work) is more generic and not about this series.
Triggering task-work anyway happens for regular read/write
completion too (in io_uring)...in the same return path involving
blk_mq_complete_request. For passthru, we are just triggering this
somewhat earlier in the completion path.
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-15 9:02 ` Sagi Grimberg
@ 2022-03-16 9:21 ` Kanchan Joshi
2022-03-16 10:56 ` Sagi Grimberg
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-16 9:21 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Kanchan Joshi, Jens Axboe, Christoph Hellwig, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
[-- Attachment #1: Type: text/plain, Size: 1711 bytes --]
On Tue, Mar 15, 2022 at 11:02:30AM +0200, Sagi Grimberg wrote:
>
>>>>+int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd)
>>>>+{
>>>>+ struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>>>+ struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
>>>>+ int srcu_idx = srcu_read_lock(&head->srcu);
>>>>+ struct nvme_ns *ns = nvme_find_path(head);
>>>>+ int ret = -EWOULDBLOCK;
>>>>+
>>>>+ if (ns)
>>>>+ ret = nvme_ns_async_ioctl(ns, ioucmd);
>>>>+ srcu_read_unlock(&head->srcu, srcu_idx);
>>>>+ return ret;
>>>>+}
>>>
>>>No one cares that this has no multipathing capabilities what-so-ever?
>>>despite being issued on the mpath device node?
>>>
>>>I know we are not doing multipathing for userspace today, but this
>>>feels like an alternative I/O interface for nvme, seems a bit cripled
>>>with zero multipathing capabilities...
>>
>>Multipathing is on the radar. Either in the first cut or in
>>subsequent. Thanks for bringing this up.
>
>Good to know...
>
>>So the char-node (/dev/ngX) will be exposed to the host if we enable
>>controller passthru on the target side. And then the host can send
>>commands using uring-passthru in the same way.
>
>Not sure I follow...
Doing this on target side:
echo -n /dev/nvme0 > /sys/kernel/config/nvmet/subsystems/testnqn/passthru/device_path
echo 1 > /sys/kernel/config/nvmet/subsystems/testnqn/passthru/enable
>>May I know what are the other requirements here.
>
>Again, not sure I follow... The fundamental capability is to
>requeue/failover I/O if there is no I/O capable path available...
That is covered I think, with nvme_find_path() at places including the
one you highlighted above.
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-16 9:21 ` Kanchan Joshi
@ 2022-03-16 10:56 ` Sagi Grimberg
2022-03-16 11:51 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Sagi Grimberg @ 2022-03-16 10:56 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Kanchan Joshi, Jens Axboe, Christoph Hellwig, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On 3/16/22 11:21, Kanchan Joshi wrote:
> On Tue, Mar 15, 2022 at 11:02:30AM +0200, Sagi Grimberg wrote:
>>
>>>>> +int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd)
>>>>> +{
>>>>> + struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>>>> + struct nvme_ns_head *head = container_of(cdev, struct
>>>>> nvme_ns_head, cdev);
>>>>> + int srcu_idx = srcu_read_lock(&head->srcu);
>>>>> + struct nvme_ns *ns = nvme_find_path(head);
>>>>> + int ret = -EWOULDBLOCK;
>>>>> +
>>>>> + if (ns)
>>>>> + ret = nvme_ns_async_ioctl(ns, ioucmd);
>>>>> + srcu_read_unlock(&head->srcu, srcu_idx);
>>>>> + return ret;
>>>>> +}
>>>>
>>>> No one cares that this has no multipathing capabilities what-so-ever?
>>>> despite being issued on the mpath device node?
>>>>
>>>> I know we are not doing multipathing for userspace today, but this
>>>> feels like an alternative I/O interface for nvme, seems a bit cripled
>>>> with zero multipathing capabilities...
>>>
>>> Multipathing is on the radar. Either in the first cut or in
>>> subsequent. Thanks for bringing this up.
>>
>> Good to know...
>>
>>> So the char-node (/dev/ngX) will be exposed to the host if we enable
>>> controller passthru on the target side. And then the host can send
>>> commands using uring-passthru in the same way.
>>
>> Not sure I follow...
>
> Doing this on target side:
> echo -n /dev/nvme0 >
> /sys/kernel/config/nvmet/subsystems/testnqn/passthru/device_path
> echo 1 > /sys/kernel/config/nvmet/subsystems/testnqn/passthru/enable
Cool, what does that have to do with what I asked?
>>> May I know what are the other requirements here.
>>
>> Again, not sure I follow... The fundamental capability is to
>> requeue/failover I/O if there is no I/O capable path available...
>
> That is covered I think, with nvme_find_path() at places including the
> one you highlighted above.
No it isn't. nvme_find_path is a simple function that retrieves an I/O
capable path which is not guaranteed to exist, it has nothing to do with
I/O requeue/failover.
Please see nvme_ns_head_submit_bio, nvme_failover_req,
nvme_requeue_work.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-16 10:56 ` Sagi Grimberg
@ 2022-03-16 11:51 ` Kanchan Joshi
2022-03-16 13:52 ` Sagi Grimberg
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-16 11:51 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Kanchan Joshi, Jens Axboe, Christoph Hellwig, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
> On 3/16/22 11:21, Kanchan Joshi wrote:
> > On Tue, Mar 15, 2022 at 11:02:30AM +0200, Sagi Grimberg wrote:
> >>
> >>>>> +int nvme_ns_head_chr_async_cmd(struct io_uring_cmd *ioucmd)
> >>>>> +{
> >>>>> + struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
> >>>>> + struct nvme_ns_head *head = container_of(cdev, struct
> >>>>> nvme_ns_head, cdev);
> >>>>> + int srcu_idx = srcu_read_lock(&head->srcu);
> >>>>> + struct nvme_ns *ns = nvme_find_path(head);
> >>>>> + int ret = -EWOULDBLOCK;
> >>>>> +
> >>>>> + if (ns)
> >>>>> + ret = nvme_ns_async_ioctl(ns, ioucmd);
> >>>>> + srcu_read_unlock(&head->srcu, srcu_idx);
> >>>>> + return ret;
> >>>>> +}
> >>>>
> >>>> No one cares that this has no multipathing capabilities what-so-ever?
> >>>> despite being issued on the mpath device node?
> >>>>
> >>>> I know we are not doing multipathing for userspace today, but this
> >>>> feels like an alternative I/O interface for nvme, seems a bit cripled
> >>>> with zero multipathing capabilities...
> >>>
> >>> Multipathing is on the radar. Either in the first cut or in
> >>> subsequent. Thanks for bringing this up.
> >>
> >> Good to know...
> >>
> >>> So the char-node (/dev/ngX) will be exposed to the host if we enable
> >>> controller passthru on the target side. And then the host can send
> >>> commands using uring-passthru in the same way.
> >>
> >> Not sure I follow...
> >
> > Doing this on target side:
> > echo -n /dev/nvme0 >
> > /sys/kernel/config/nvmet/subsystems/testnqn/passthru/device_path
> > echo 1 > /sys/kernel/config/nvmet/subsystems/testnqn/passthru/enable
>
> Cool, what does that have to do with what I asked?
Maybe nothing.
This was rather about how to set up nvmeof if block-interface does not
exist for the underlying nvme device.
> >>> May I know what are the other requirements here.
> >>
> >> Again, not sure I follow... The fundamental capability is to
> >> requeue/failover I/O if there is no I/O capable path available...
> >
> > That is covered I think, with nvme_find_path() at places including the
> > one you highlighted above.
>
> No it isn't. nvme_find_path is a simple function that retrieves an I/O
> capable path which is not guaranteed to exist, it has nothing to do with
> I/O requeue/failover.
Got it, thanks. Passthrough (sync or async) just returns the failure
to user-space if that fails.
No attempt to retry/requeue as the block path does.
--
Kanchan
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-16 11:51 ` Kanchan Joshi
@ 2022-03-16 13:52 ` Sagi Grimberg
2022-03-16 14:35 ` Jens Axboe
0 siblings, 1 reply; 122+ messages in thread
From: Sagi Grimberg @ 2022-03-16 13:52 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Kanchan Joshi, Jens Axboe, Christoph Hellwig, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
>>>>>> No one cares that this has no multipathing capabilities what-so-ever?
>>>>>> despite being issued on the mpath device node?
>>>>>>
>>>>>> I know we are not doing multipathing for userspace today, but this
>>>>>> feels like an alternative I/O interface for nvme, seems a bit cripled
>>>>>> with zero multipathing capabilities...
[...]
> Got it, thanks. Passthrough (sync or async) just returns the failure
> to user-space if that fails.
> No attempt to retry/requeue as the block path does.
I know, and that was my original question, no one cares that this
interface completely lacks this capability? Maybe it is fine, but
it is not a trivial assumption given that this is designed to be more
than an interface to send admin/vs commands to the controller...
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-16 13:52 ` Sagi Grimberg
@ 2022-03-16 14:35 ` Jens Axboe
2022-03-16 14:50 ` Sagi Grimberg
0 siblings, 1 reply; 122+ messages in thread
From: Jens Axboe @ 2022-03-16 14:35 UTC (permalink / raw)
To: Sagi Grimberg, Kanchan Joshi
Cc: Kanchan Joshi, Christoph Hellwig, Keith Busch, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On 3/16/22 7:52 AM, Sagi Grimberg wrote:
>
>>>>>>> No one cares that this has no multipathing capabilities what-so-ever?
>>>>>>> despite being issued on the mpath device node?
>>>>>>>
>>>>>>> I know we are not doing multipathing for userspace today, but this
>>>>>>> feels like an alternative I/O interface for nvme, seems a bit cripled
>>>>>>> with zero multipathing capabilities...
>
> [...]
>
>> Got it, thanks. Passthrough (sync or async) just returns the failure
>> to user-space if that fails.
>> No attempt to retry/requeue as the block path does.
>
> I know, and that was my original question, no one cares that this
> interface completely lacks this capability? Maybe it is fine, but
> it is not a trivial assumption given that this is designed to be more
> than an interface to send admin/vs commands to the controller...
Most people don't really care about or use multipath, so it's not a
primary goal. For passthrough, most of request types should hit the
exact target, I would suggest that if someone cares about multipath for
specific commands, that they be flagged as such.
--
Jens Axboe
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-16 14:35 ` Jens Axboe
@ 2022-03-16 14:50 ` Sagi Grimberg
2022-03-24 6:20 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Sagi Grimberg @ 2022-03-16 14:50 UTC (permalink / raw)
To: Jens Axboe, Kanchan Joshi
Cc: Kanchan Joshi, Christoph Hellwig, Keith Busch, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
>> [...]
>>
>>> Got it, thanks. Passthrough (sync or async) just returns the failure
>>> to user-space if that fails.
>>> No attempt to retry/requeue as the block path does.
>>
>> I know, and that was my original question, no one cares that this
>> interface completely lacks this capability? Maybe it is fine, but
>> it is not a trivial assumption given that this is designed to be more
>> than an interface to send admin/vs commands to the controller...
>
> Most people don't really care about or use multipath, so it's not a
> primary goal.
This statement is generally correct. However what application would be
interested in speaking raw nvme to a device and gaining performance that
is even higher than the block layer (which is great to begin with)?
First thing that comes to mind is a high-end storage array, where
dual-ported drives are considered to be the standard. I could argue the
same for a high-end oracle appliance or something like that... Although
in a lot of cases, each nvme port will connect to a different host...
What are the use-cases that need this interface that are the target
here? Don't remember seeing this come up in the cover-letter or previous
iterations...
> For passthrough, most of request types should hit the
> exact target, I would suggest that if someone cares about multipath for
> specific commands, that they be flagged as such.
What do you mean by "they be flagged as such"?
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 10/17] block: wire-up support for plugging
2022-03-14 14:40 ` Ming Lei
@ 2022-03-21 7:02 ` Kanchan Joshi
2022-03-23 1:27 ` Ming Lei
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-21 7:02 UTC (permalink / raw)
To: Ming Lei
Cc: Kanchan Joshi, Christoph Hellwig, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
[-- Attachment #1: Type: text/plain, Size: 1265 bytes --]
On Mon, Mar 14, 2022 at 10:40:53PM +0800, Ming Lei wrote:
>On Thu, Mar 10, 2022 at 06:10:08PM +0530, Kanchan Joshi wrote:
>> On Thu, Mar 10, 2022 at 2:04 PM Christoph Hellwig <[email protected]> wrote:
>> >
>> > On Tue, Mar 08, 2022 at 08:50:58PM +0530, Kanchan Joshi wrote:
>> > > From: Jens Axboe <[email protected]>
>> > >
>> > > Add support to use plugging if it is enabled, else use default path.
>> >
>> > The subject and this comment don't really explain what is done, and
>> > also don't mention at all why it is done.
>>
>> Missed out, will fix up. But plugging gave a very good hike to IOPS.
>
>But how does plugging improve IOPS here for passthrough request? Not
>see plug->nr_ios is wired to data.nr_tags in blk_mq_alloc_request(),
>which is called by nvme_submit_user_cmd().
Yes, one tag at a time for each request, but none of the request gets
dispatched and instead added to the plug. And when io_uring ends the
plug, the whole batch gets dispatched via ->queue_rqs (otherwise it used
to be via ->queue_rq, one request at a time).
Only .plug impact looks like this on passthru-randread:
KIOPS(depth_batch) 1_1 8_2 64_16 128_32
Without plug 159 496 784 785
With plug 159 525 991 1044
Hope it does clarify.
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-08 15:20 ` [PATCH 05/17] nvme: wire-up support for async-passthru on char-device Kanchan Joshi
` (3 preceding siblings ...)
2022-03-13 21:53 ` Sagi Grimberg
@ 2022-03-22 15:18 ` Clay Mayers
2022-03-22 16:57 ` Kanchan Joshi
4 siblings, 1 reply; 122+ messages in thread
From: Clay Mayers @ 2022-03-22 15:18 UTC (permalink / raw)
To: Kanchan Joshi, [email protected], [email protected], [email protected],
[email protected]
Cc: [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected]
> From: Kanchan Joshi
> Sent: Tuesday, March 8, 2022 7:21 AM
> To: [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: [PATCH 05/17] nvme: wire-up support for async-passthru on char-
> device.
>
<snip>
> +static void nvme_pt_task_cb(struct io_uring_cmd *ioucmd)
> +{
> + struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
> + struct request *req = pdu->req;
> + int status;
> + struct bio *bio = req->bio;
> +
> + if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
> + status = -EINTR;
> + else
> + status = nvme_req(req)->status;
> +
> + /* we can free request */
> + blk_mq_free_request(req);
> + blk_rq_unmap_user(bio);
> +
> + if (!status && pdu->meta_buffer) {
> + if (copy_to_user(pdu->meta_buffer, pdu->meta, pdu-
> >meta_len))
This copy is incorrectly called for writes.
> + status = -EFAULT;
> + }
> + kfree(pdu->meta);
> +
> + io_uring_cmd_done(ioucmd, status);
> +}
</snip>
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-22 15:18 ` Clay Mayers
@ 2022-03-22 16:57 ` Kanchan Joshi
0 siblings, 0 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-22 16:57 UTC (permalink / raw)
To: Clay Mayers
Cc: Kanchan Joshi, [email protected], [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected]
> > +
> > + if (!status && pdu->meta_buffer) {
> > + if (copy_to_user(pdu->meta_buffer, pdu->meta, pdu-
> > >meta_len))
>
> This copy is incorrectly called for writes.
Indeed. Will fix this up in v2, thanks.
--
Kanchan
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-11 6:27 ` Christoph Hellwig
@ 2022-03-22 17:10 ` Kanchan Joshi
2022-03-24 6:32 ` Christoph Hellwig
2022-04-01 1:22 ` Jens Axboe
0 siblings, 2 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-22 17:10 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Jens Axboe, Keith Busch, Pavel Begunkov, io-uring,
linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Fri, Mar 11, 2022 at 11:57 AM Christoph Hellwig <[email protected]> wrote:
> > And that's because this ioctl requires additional "__u64 result;" to
> > be updated within "struct nvme_passthru_cmd64".
> > To update that during completion, we need, at the least, the result
> > field to be a pointer "__u64 result_ptr" inside the struct
> > nvme_passthru_cmd64.
> > Do you see that is possible without adding a new passthru ioctl in nvme?
>
> We don't need a new passthrough ioctl in nvme.
Right. Maybe it is easier for applications if they get to use the same
ioctl opcode/structure that they know well already.
> We need to decouple the
> uring cmd properly. And properly in this case means not to add a
> result pointer, but to drop the result from the _input_ structure
> entirely, and instead optionally support a larger CQ entry that contains
> it, just like the first patch does for the SQ.
Creating a large CQE was my thought too. Gave that another stab.
Dealing with two types of CQE felt nasty to fit in liburing's api-set
(which is cqe-heavy).
Jens: Do you already have thoughts (go/no-go) for this route?
From all that we discussed, maybe the path forward could be this:
- inline-cmd/big-sqe is useful if paired with big-cqe. Drop big-sqe
for now if we cannot go the big-cqe route.
- use only indirect-cmd as this requires nothing special, just regular
sqe and cqe. We can support all passthru commands with a lot less
code. No new ioctl in nvme, so same semantics. For common commands
(i.e. read/write) we can still avoid updating the result (put_user
cost will go).
Please suggest if we should approach this any differently in v2.
Thanks,
--
Kanchan
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 10/17] block: wire-up support for plugging
2022-03-21 7:02 ` Kanchan Joshi
@ 2022-03-23 1:27 ` Ming Lei
2022-03-23 1:41 ` Jens Axboe
0 siblings, 1 reply; 122+ messages in thread
From: Ming Lei @ 2022-03-23 1:27 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Kanchan Joshi, Christoph Hellwig, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Mon, Mar 21, 2022 at 12:32:08PM +0530, Kanchan Joshi wrote:
> On Mon, Mar 14, 2022 at 10:40:53PM +0800, Ming Lei wrote:
> > On Thu, Mar 10, 2022 at 06:10:08PM +0530, Kanchan Joshi wrote:
> > > On Thu, Mar 10, 2022 at 2:04 PM Christoph Hellwig <[email protected]> wrote:
> > > >
> > > > On Tue, Mar 08, 2022 at 08:50:58PM +0530, Kanchan Joshi wrote:
> > > > > From: Jens Axboe <[email protected]>
> > > > >
> > > > > Add support to use plugging if it is enabled, else use default path.
> > > >
> > > > The subject and this comment don't really explain what is done, and
> > > > also don't mention at all why it is done.
> > >
> > > Missed out, will fix up. But plugging gave a very good hike to IOPS.
> >
> > But how does plugging improve IOPS here for passthrough request? Not
> > see plug->nr_ios is wired to data.nr_tags in blk_mq_alloc_request(),
> > which is called by nvme_submit_user_cmd().
>
> Yes, one tag at a time for each request, but none of the request gets
> dispatched and instead added to the plug. And when io_uring ends the
> plug, the whole batch gets dispatched via ->queue_rqs (otherwise it used
> to be via ->queue_rq, one request at a time).
>
> Only .plug impact looks like this on passthru-randread:
>
> KIOPS(depth_batch) 1_1 8_2 64_16 128_32
> Without plug 159 496 784 785
> With plug 159 525 991 1044
>
> Hope it does clarify.
OK, thanks for your confirmation, then the improvement should be from
batch submission only.
If cached request is enabled, I guess the number could be better.
Thanks,
Ming
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 10/17] block: wire-up support for plugging
2022-03-23 1:27 ` Ming Lei
@ 2022-03-23 1:41 ` Jens Axboe
2022-03-23 1:58 ` Jens Axboe
0 siblings, 1 reply; 122+ messages in thread
From: Jens Axboe @ 2022-03-23 1:41 UTC (permalink / raw)
To: Ming Lei, Kanchan Joshi
Cc: Kanchan Joshi, Christoph Hellwig, Keith Busch, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On 3/22/22 7:27 PM, Ming Lei wrote:
> On Mon, Mar 21, 2022 at 12:32:08PM +0530, Kanchan Joshi wrote:
>> On Mon, Mar 14, 2022 at 10:40:53PM +0800, Ming Lei wrote:
>>> On Thu, Mar 10, 2022 at 06:10:08PM +0530, Kanchan Joshi wrote:
>>>> On Thu, Mar 10, 2022 at 2:04 PM Christoph Hellwig <[email protected]> wrote:
>>>>>
>>>>> On Tue, Mar 08, 2022 at 08:50:58PM +0530, Kanchan Joshi wrote:
>>>>>> From: Jens Axboe <[email protected]>
>>>>>>
>>>>>> Add support to use plugging if it is enabled, else use default path.
>>>>>
>>>>> The subject and this comment don't really explain what is done, and
>>>>> also don't mention at all why it is done.
>>>>
>>>> Missed out, will fix up. But plugging gave a very good hike to IOPS.
>>>
>>> But how does plugging improve IOPS here for passthrough request? Not
>>> see plug->nr_ios is wired to data.nr_tags in blk_mq_alloc_request(),
>>> which is called by nvme_submit_user_cmd().
>>
>> Yes, one tag at a time for each request, but none of the request gets
>> dispatched and instead added to the plug. And when io_uring ends the
>> plug, the whole batch gets dispatched via ->queue_rqs (otherwise it used
>> to be via ->queue_rq, one request at a time).
>>
>> Only .plug impact looks like this on passthru-randread:
>>
>> KIOPS(depth_batch) 1_1 8_2 64_16 128_32
>> Without plug 159 496 784 785
>> With plug 159 525 991 1044
>>
>> Hope it does clarify.
>
> OK, thanks for your confirmation, then the improvement should be from
> batch submission only.
>
> If cached request is enabled, I guess the number could be better.
Yes, my original test patch pre-dates being able to set a submit count,
it would definitely help improve this case too. The current win is
indeed just from being able to use ->queue_rqs() rather than single
submit.
--
Jens Axboe
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 10/17] block: wire-up support for plugging
2022-03-23 1:41 ` Jens Axboe
@ 2022-03-23 1:58 ` Jens Axboe
2022-03-23 2:10 ` Ming Lei
0 siblings, 1 reply; 122+ messages in thread
From: Jens Axboe @ 2022-03-23 1:58 UTC (permalink / raw)
To: Ming Lei, Kanchan Joshi
Cc: Kanchan Joshi, Christoph Hellwig, Keith Busch, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On 3/22/22 7:41 PM, Jens Axboe wrote:
> On 3/22/22 7:27 PM, Ming Lei wrote:
>> On Mon, Mar 21, 2022 at 12:32:08PM +0530, Kanchan Joshi wrote:
>>> On Mon, Mar 14, 2022 at 10:40:53PM +0800, Ming Lei wrote:
>>>> On Thu, Mar 10, 2022 at 06:10:08PM +0530, Kanchan Joshi wrote:
>>>>> On Thu, Mar 10, 2022 at 2:04 PM Christoph Hellwig <[email protected]> wrote:
>>>>>>
>>>>>> On Tue, Mar 08, 2022 at 08:50:58PM +0530, Kanchan Joshi wrote:
>>>>>>> From: Jens Axboe <[email protected]>
>>>>>>>
>>>>>>> Add support to use plugging if it is enabled, else use default path.
>>>>>>
>>>>>> The subject and this comment don't really explain what is done, and
>>>>>> also don't mention at all why it is done.
>>>>>
>>>>> Missed out, will fix up. But plugging gave a very good hike to IOPS.
>>>>
>>>> But how does plugging improve IOPS here for passthrough request? Not
>>>> see plug->nr_ios is wired to data.nr_tags in blk_mq_alloc_request(),
>>>> which is called by nvme_submit_user_cmd().
>>>
>>> Yes, one tag at a time for each request, but none of the request gets
>>> dispatched and instead added to the plug. And when io_uring ends the
>>> plug, the whole batch gets dispatched via ->queue_rqs (otherwise it used
>>> to be via ->queue_rq, one request at a time).
>>>
>>> Only .plug impact looks like this on passthru-randread:
>>>
>>> KIOPS(depth_batch) 1_1 8_2 64_16 128_32
>>> Without plug 159 496 784 785
>>> With plug 159 525 991 1044
>>>
>>> Hope it does clarify.
>>
>> OK, thanks for your confirmation, then the improvement should be from
>> batch submission only.
>>
>> If cached request is enabled, I guess the number could be better.
>
> Yes, my original test patch pre-dates being able to set a submit count,
> it would definitely help improve this case too. The current win is
> indeed just from being able to use ->queue_rqs() rather than single
> submit.
Actually that is already there through io_uring, nothing extra is
needed.
--
Jens Axboe
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 10/17] block: wire-up support for plugging
2022-03-23 1:58 ` Jens Axboe
@ 2022-03-23 2:10 ` Ming Lei
2022-03-23 2:17 ` Jens Axboe
0 siblings, 1 reply; 122+ messages in thread
From: Ming Lei @ 2022-03-23 2:10 UTC (permalink / raw)
To: Jens Axboe
Cc: Kanchan Joshi, Kanchan Joshi, Christoph Hellwig, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Tue, Mar 22, 2022 at 07:58:25PM -0600, Jens Axboe wrote:
> On 3/22/22 7:41 PM, Jens Axboe wrote:
> > On 3/22/22 7:27 PM, Ming Lei wrote:
> >> On Mon, Mar 21, 2022 at 12:32:08PM +0530, Kanchan Joshi wrote:
> >>> On Mon, Mar 14, 2022 at 10:40:53PM +0800, Ming Lei wrote:
> >>>> On Thu, Mar 10, 2022 at 06:10:08PM +0530, Kanchan Joshi wrote:
> >>>>> On Thu, Mar 10, 2022 at 2:04 PM Christoph Hellwig <[email protected]> wrote:
> >>>>>>
> >>>>>> On Tue, Mar 08, 2022 at 08:50:58PM +0530, Kanchan Joshi wrote:
> >>>>>>> From: Jens Axboe <[email protected]>
> >>>>>>>
> >>>>>>> Add support to use plugging if it is enabled, else use default path.
> >>>>>>
> >>>>>> The subject and this comment don't really explain what is done, and
> >>>>>> also don't mention at all why it is done.
> >>>>>
> >>>>> Missed out, will fix up. But plugging gave a very good hike to IOPS.
> >>>>
> >>>> But how does plugging improve IOPS here for passthrough request? Not
> >>>> see plug->nr_ios is wired to data.nr_tags in blk_mq_alloc_request(),
> >>>> which is called by nvme_submit_user_cmd().
> >>>
> >>> Yes, one tag at a time for each request, but none of the request gets
> >>> dispatched and instead added to the plug. And when io_uring ends the
> >>> plug, the whole batch gets dispatched via ->queue_rqs (otherwise it used
> >>> to be via ->queue_rq, one request at a time).
> >>>
> >>> Only .plug impact looks like this on passthru-randread:
> >>>
> >>> KIOPS(depth_batch) 1_1 8_2 64_16 128_32
> >>> Without plug 159 496 784 785
> >>> With plug 159 525 991 1044
> >>>
> >>> Hope it does clarify.
> >>
> >> OK, thanks for your confirmation, then the improvement should be from
> >> batch submission only.
> >>
> >> If cached request is enabled, I guess the number could be better.
> >
> > Yes, my original test patch pre-dates being able to set a submit count,
> > it would definitely help improve this case too. The current win is
> > indeed just from being able to use ->queue_rqs() rather than single
> > submit.
>
> Actually that is already there through io_uring, nothing extra is
> needed.
I meant in this patchset that plug->nr_ios isn't wired to data.nr_tags in
blk_mq_alloc_request(), which is called by pt request allocation, so
looks cached request isn't enabled for pt commands?
Thanks,
Ming
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 10/17] block: wire-up support for plugging
2022-03-23 2:10 ` Ming Lei
@ 2022-03-23 2:17 ` Jens Axboe
0 siblings, 0 replies; 122+ messages in thread
From: Jens Axboe @ 2022-03-23 2:17 UTC (permalink / raw)
To: Ming Lei
Cc: Kanchan Joshi, Kanchan Joshi, Christoph Hellwig, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On 3/22/22 8:10 PM, Ming Lei wrote:
> On Tue, Mar 22, 2022 at 07:58:25PM -0600, Jens Axboe wrote:
>> On 3/22/22 7:41 PM, Jens Axboe wrote:
>>> On 3/22/22 7:27 PM, Ming Lei wrote:
>>>> On Mon, Mar 21, 2022 at 12:32:08PM +0530, Kanchan Joshi wrote:
>>>>> On Mon, Mar 14, 2022 at 10:40:53PM +0800, Ming Lei wrote:
>>>>>> On Thu, Mar 10, 2022 at 06:10:08PM +0530, Kanchan Joshi wrote:
>>>>>>> On Thu, Mar 10, 2022 at 2:04 PM Christoph Hellwig <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On Tue, Mar 08, 2022 at 08:50:58PM +0530, Kanchan Joshi wrote:
>>>>>>>>> From: Jens Axboe <[email protected]>
>>>>>>>>>
>>>>>>>>> Add support to use plugging if it is enabled, else use default path.
>>>>>>>>
>>>>>>>> The subject and this comment don't really explain what is done, and
>>>>>>>> also don't mention at all why it is done.
>>>>>>>
>>>>>>> Missed out, will fix up. But plugging gave a very good hike to IOPS.
>>>>>>
>>>>>> But how does plugging improve IOPS here for passthrough request? Not
>>>>>> see plug->nr_ios is wired to data.nr_tags in blk_mq_alloc_request(),
>>>>>> which is called by nvme_submit_user_cmd().
>>>>>
>>>>> Yes, one tag at a time for each request, but none of the request gets
>>>>> dispatched and instead added to the plug. And when io_uring ends the
>>>>> plug, the whole batch gets dispatched via ->queue_rqs (otherwise it used
>>>>> to be via ->queue_rq, one request at a time).
>>>>>
>>>>> Only .plug impact looks like this on passthru-randread:
>>>>>
>>>>> KIOPS(depth_batch) 1_1 8_2 64_16 128_32
>>>>> Without plug 159 496 784 785
>>>>> With plug 159 525 991 1044
>>>>>
>>>>> Hope it does clarify.
>>>>
>>>> OK, thanks for your confirmation, then the improvement should be from
>>>> batch submission only.
>>>>
>>>> If cached request is enabled, I guess the number could be better.
>>>
>>> Yes, my original test patch pre-dates being able to set a submit count,
>>> it would definitely help improve this case too. The current win is
>>> indeed just from being able to use ->queue_rqs() rather than single
>>> submit.
>>
>> Actually that is already there through io_uring, nothing extra is
>> needed.
>
> I meant in this patchset that plug->nr_ios isn't wired to data.nr_tags in
> blk_mq_alloc_request(), which is called by pt request allocation, so
> looks cached request isn't enabled for pt commands?
My point is that it is, since submission is through io_uring which does
exactly that.
--
Jens Axboe
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-16 14:50 ` Sagi Grimberg
@ 2022-03-24 6:20 ` Christoph Hellwig
2022-03-24 10:42 ` Sagi Grimberg
0 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-24 6:20 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Jens Axboe, Kanchan Joshi, Kanchan Joshi, Christoph Hellwig,
Keith Busch, Pavel Begunkov, io-uring, linux-nvme, linux-block,
sbates, logang, Pankaj Raghav, Javier González,
Luis Chamberlain, Adam Manzanares, Anuj Gupta
On Wed, Mar 16, 2022 at 04:50:53PM +0200, Sagi Grimberg wrote:
>>> I know, and that was my original question, no one cares that this
>>> interface completely lacks this capability? Maybe it is fine, but
>>> it is not a trivial assumption given that this is designed to be more
>>> than an interface to send admin/vs commands to the controller...
>>
>> Most people don't really care about or use multipath, so it's not a
>> primary goal.
>
> This statement is generally correct. However what application would be
> interested in speaking raw nvme to a device and gaining performance that
> is even higher than the block layer (which is great to begin with)?
If passthrough is faster than the block I/O path we're doing someting
wrong. At best it should be the same performance.
That being said multipathing is an integral part of the nvme driver
architecture, and the /dev/ngX devices. If we want to support uring
async commands on /dev/ngX it will have to support multipath.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-16 7:27 ` Kanchan Joshi
@ 2022-03-24 6:22 ` Christoph Hellwig
2022-03-24 17:45 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-24 6:22 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, axboe, kbusch, asml.silence, io-uring,
linux-nvme, linux-block, sbates, logang, pankydev8, javier,
mcgrof, a.manzanares, joshiiitr, anuj20.g
On Wed, Mar 16, 2022 at 12:57:27PM +0530, Kanchan Joshi wrote:
> So what is the picture that you have in mind for struct io_uring_cmd?
> Moving meta fields out makes it look like this -
> @@ -28,7 +28,10 @@ struct io_uring_cmd {
> u32 cmd_op;
> u16 cmd_len;
> u16 unused;
> - u8 pdu[28]; /* available inline for free use */
> + void __user *meta_buffer; /* nvme pt specific */
> + u32 meta_len; /* nvme pt specific */
> + u8 pdu[16]; /* available inline for free use */
> +
> };
> And corresponding nvme 16 byte pdu - struct nvme_uring_cmd_pdu {
> - u32 meta_len;
> union {
> struct bio *bio;
> struct request *req;
> };
> void *meta; /* kernel-resident buffer */
> - void __user *meta_buffer;
> } __packed;
No, I'd also move the meta field (and call it meta_buffer) to
struct io_uring_cmd, and replace the pdu array with a simple
void *private;
> We would still need to use/export that even if we somehow manage to move
> task-work trigger from nvme-function to blk_mq_complete_request.
> io_uring's task-work infra is more baked than raw task-work infra.
> It would not be good to repeat all that code elsewhere.
> I tried raw one in the first attempt, and Jens suggested to move to baked
> one. Here is the link that gave birth to this interface -
> https://lore.kernel.org/linux-nvme/[email protected]/
Ok. I can't say I'm a fan of where this is going.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 11/17] block: factor out helper for bio allocation from cache
2022-03-10 12:25 ` Kanchan Joshi
@ 2022-03-24 6:30 ` Christoph Hellwig
2022-03-24 17:45 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-24 6:30 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Thu, Mar 10, 2022 at 05:55:02PM +0530, Kanchan Joshi wrote:
> On Thu, Mar 10, 2022 at 2:05 PM Christoph Hellwig <[email protected]> wrote:
> >
> > On Tue, Mar 08, 2022 at 08:50:59PM +0530, Kanchan Joshi wrote:
> > > +struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs,
> > > + struct bio_set *bs)
> > > +{
> > > + if (!(kiocb->ki_flags & IOCB_ALLOC_CACHE))
> > > + return bio_alloc_bioset(GFP_KERNEL, nr_vecs, bs);
> > > +
> > > + return bio_from_cache(nr_vecs, bs);
> > > +}
> > > EXPORT_SYMBOL_GPL(bio_alloc_kiocb);
> >
> > If we go down this route we might want to just kill the bio_alloc_kiocb
> > wrapper.
>
> Fine, will kill that in v2.
As a headsup, Mike Snitzer has been doing something similar in the
"block/dm: use BIOSET_PERCPU_CACHE from bio_alloc_bioset"
series.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 14/17] io_uring: add polling support for uring-cmd
2022-03-16 5:09 ` Kanchan Joshi
@ 2022-03-24 6:30 ` Christoph Hellwig
0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-24 6:30 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Wed, Mar 16, 2022 at 10:39:05AM +0530, Kanchan Joshi wrote:
> So how about adding ->async_cmd_poll in file_operations (since this
> corresponds to ->async_cmd)?
> It will take struct io_uring_cmd pointer as parameter.
> Both ->iopoll and ->async_cmd_poll will differ in what they accept (kiocb
> vs io_uring_cmd). The provider may use bio_poll, or whatever else is the
> implementation detail.
That does sound way better than the current version at least.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-22 17:10 ` Kanchan Joshi
@ 2022-03-24 6:32 ` Christoph Hellwig
2022-03-25 13:39 ` Kanchan Joshi
2022-04-01 1:22 ` Jens Axboe
1 sibling, 1 reply; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-24 6:32 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Tue, Mar 22, 2022 at 10:40:27PM +0530, Kanchan Joshi wrote:
> On Fri, Mar 11, 2022 at 11:57 AM Christoph Hellwig <[email protected]> wrote:
> > > And that's because this ioctl requires additional "__u64 result;" to
> > > be updated within "struct nvme_passthru_cmd64".
> > > To update that during completion, we need, at the least, the result
> > > field to be a pointer "__u64 result_ptr" inside the struct
> > > nvme_passthru_cmd64.
> > > Do you see that is possible without adding a new passthru ioctl in nvme?
> >
> > We don't need a new passthrough ioctl in nvme.
> Right. Maybe it is easier for applications if they get to use the same
> ioctl opcode/structure that they know well already.
I disagree. Reusing the same opcode and/or structure for something
fundamentally different creates major confusion. Don't do it.
> >From all that we discussed, maybe the path forward could be this:
> - inline-cmd/big-sqe is useful if paired with big-cqe. Drop big-sqe
> for now if we cannot go the big-cqe route.
> - use only indirect-cmd as this requires nothing special, just regular
> sqe and cqe. We can support all passthru commands with a lot less
> code. No new ioctl in nvme, so same semantics. For common commands
> (i.e. read/write) we can still avoid updating the result (put_user
> cost will go).
>
> Please suggest if we should approach this any differently in v2.
Personally I think larger SQEs and CQEs are the only sensible interface
here. Everything else just fails like a horrible hack I would not want
to support in NVMe.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-24 6:20 ` Christoph Hellwig
@ 2022-03-24 10:42 ` Sagi Grimberg
0 siblings, 0 replies; 122+ messages in thread
From: Sagi Grimberg @ 2022-03-24 10:42 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Kanchan Joshi, Kanchan Joshi, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
>>>> I know, and that was my original question, no one cares that this
>>>> interface completely lacks this capability? Maybe it is fine, but
>>>> it is not a trivial assumption given that this is designed to be more
>>>> than an interface to send admin/vs commands to the controller...
>>>
>>> Most people don't really care about or use multipath, so it's not a
>>> primary goal.
>>
>> This statement is generally correct. However what application would be
>> interested in speaking raw nvme to a device and gaining performance that
>> is even higher than the block layer (which is great to begin with)?
>
> If passthrough is faster than the block I/O path we're doing someting
> wrong. At best it should be the same performance.
That is not what the changelog says.
> That being said multipathing is an integral part of the nvme driver
> architecture, and the /dev/ngX devices. If we want to support uring
> async commands on /dev/ngX it will have to support multipath.
Couldn't agree more...
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
2022-03-24 6:22 ` Christoph Hellwig
@ 2022-03-24 17:45 ` Kanchan Joshi
0 siblings, 0 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-24 17:45 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Jens Axboe, Keith Busch, Pavel Begunkov, io-uring,
linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Thu, Mar 24, 2022 at 11:52 AM Christoph Hellwig <[email protected]> wrote:
>
> On Wed, Mar 16, 2022 at 12:57:27PM +0530, Kanchan Joshi wrote:
> > So what is the picture that you have in mind for struct io_uring_cmd?
> > Moving meta fields out makes it look like this -
>
>
> > @@ -28,7 +28,10 @@ struct io_uring_cmd {
> > u32 cmd_op;
> > u16 cmd_len;
> > u16 unused;
> > - u8 pdu[28]; /* available inline for free use */
> > + void __user *meta_buffer; /* nvme pt specific */
> > + u32 meta_len; /* nvme pt specific */
> > + u8 pdu[16]; /* available inline for free use */
> > +
> > };
> > And corresponding nvme 16 byte pdu - struct nvme_uring_cmd_pdu {
> > - u32 meta_len;
> > union {
> > struct bio *bio;
> > struct request *req;
> > };
> > void *meta; /* kernel-resident buffer */
> > - void __user *meta_buffer;
> > } __packed;
>
> No, I'd also move the meta field (and call it meta_buffer) to
> struct io_uring_cmd, and replace the pdu array with a simple
>
> void *private;
That clears up. Can go that route, but the tradeoff is -
while we clean up one casting in nvme, we end up making async-cmd way
too nvme-passthrough specific.
People have talked about using async-cmd for other use cases; Darrick
mentioned using for xfs-scrub, and Luis had some ideas (other than
nvme) too.
The pdu array of 28 bytes is being used to avoid fast path
allocations. It got reduced to 8 bytes, and that is fine for one
nvme-ioctl as we moved other fields out.
But for other use-cases, 8 bytes of generic space may not be enough to
help with fast-path allocations.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 11/17] block: factor out helper for bio allocation from cache
2022-03-24 6:30 ` Christoph Hellwig
@ 2022-03-24 17:45 ` Kanchan Joshi
2022-03-25 5:38 ` Christoph Hellwig
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-24 17:45 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Jens Axboe, Keith Busch, Pavel Begunkov, io-uring,
linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Thu, Mar 24, 2022 at 12:00 PM Christoph Hellwig <[email protected]> wrote:
>
> On Thu, Mar 10, 2022 at 05:55:02PM +0530, Kanchan Joshi wrote:
> > On Thu, Mar 10, 2022 at 2:05 PM Christoph Hellwig <[email protected]> wrote:
> > >
> > > On Tue, Mar 08, 2022 at 08:50:59PM +0530, Kanchan Joshi wrote:
> > > > +struct bio *bio_alloc_kiocb(struct kiocb *kiocb, unsigned short nr_vecs,
> > > > + struct bio_set *bs)
> > > > +{
> > > > + if (!(kiocb->ki_flags & IOCB_ALLOC_CACHE))
> > > > + return bio_alloc_bioset(GFP_KERNEL, nr_vecs, bs);
> > > > +
> > > > + return bio_from_cache(nr_vecs, bs);
> > > > +}
> > > > EXPORT_SYMBOL_GPL(bio_alloc_kiocb);
> > >
> > > If we go down this route we might want to just kill the bio_alloc_kiocb
> > > wrapper.
> >
> > Fine, will kill that in v2.
>
> As a headsup, Mike Snitzer has been doing something similar in the
>
> "block/dm: use BIOSET_PERCPU_CACHE from bio_alloc_bioset"
>
> series.
Thanks, that can be reused here too. But to enable this feature - we
need to move to a bioset from bio_kmalloc in nvme, and you did not
seem fine with that.
^ permalink raw reply [flat|nested] 122+ messages in thread
* RE: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-08 15:21 ` [PATCH 17/17] nvme: enable non-inline passthru commands Kanchan Joshi
2022-03-10 8:36 ` Christoph Hellwig
@ 2022-03-24 21:09 ` Clay Mayers
2022-03-24 23:36 ` Jens Axboe
1 sibling, 1 reply; 122+ messages in thread
From: Clay Mayers @ 2022-03-24 21:09 UTC (permalink / raw)
To: Kanchan Joshi, [email protected], [email protected], [email protected],
[email protected]
Cc: [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected]
> From: Kanchan Joshi
> Sent: Tuesday, March 8, 2022 7:21 AM
> To: [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: [PATCH 17/17] nvme: enable non-inline passthru commands
>
> From: Anuj Gupta <[email protected]>
>
> On submission,just fetch the commmand from userspace pointer and reuse
> everything else. On completion, update the result field inside the passthru
> command.
>
> Signed-off-by: Anuj Gupta <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> ---
> drivers/nvme/host/ioctl.c | 29 +++++++++++++++++++++++++----
> 1 file changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index
> 701feaecabbe..ddb7e5864be6 100644
> --- a/drivers/nvme/host/ioctl.c
> +++ b/drivers/nvme/host/ioctl.c
> @@ -65,6 +65,14 @@ static void nvme_pt_task_cb(struct io_uring_cmd
> *ioucmd)
> }
> kfree(pdu->meta);
>
> + if (ioucmd->flags & IO_URING_F_UCMD_INDIRECT) {
> + struct nvme_passthru_cmd64 __user *ptcmd64 = ioucmd-
> >cmd;
> + u64 result = le64_to_cpu(nvme_req(req)->result.u64);
> +
> + if (put_user(result, &ptcmd64->result))
> + status = -EFAULT;
When the thread that submitted the io_uring_cmd has exited, the CB is
called by a system worker instead so put_user() fails. The cqe is still
completed and the process sees a failed i/o status, but the i/o did not
fail. The same is true for meta data being returned in patch 5.
I can't say if it's a requirement to support this case. It does break our
current proto-type but we can adjust.
> + }
> +
> io_uring_cmd_done(ioucmd, status);
> }
>
> @@ -143,6 +151,13 @@ static inline bool nvme_is_fixedb_passthru(struct
> io_uring_cmd *ioucmd)
> return ((ioucmd) && (ioucmd->flags &
> IO_URING_F_UCMD_FIXEDBUFS)); }
>
> +static inline bool is_inline_rw(struct io_uring_cmd *ioucmd, struct
> +nvme_command *cmd) {
> + return ((ioucmd->flags & IO_URING_F_UCMD_INDIRECT) ||
> + (cmd->common.opcode == nvme_cmd_write ||
> + cmd->common.opcode == nvme_cmd_read)); }
> +
> static int nvme_submit_user_cmd(struct request_queue *q,
> struct nvme_command *cmd, u64 ubuffer,
> unsigned bufflen, void __user *meta_buffer, unsigned
> meta_len, @@ -193,8 +208,7 @@ static int nvme_submit_user_cmd(struct
> request_queue *q,
> }
> }
> if (ioucmd) { /* async dispatch */
> - if (cmd->common.opcode == nvme_cmd_write ||
> - cmd->common.opcode == nvme_cmd_read) {
> + if (is_inline_rw(ioucmd, cmd)) {
> if (bio && is_polling_enabled(ioucmd, req)) {
> ioucmd->bio = bio;
> bio->bi_opf |= REQ_POLLED;
> @@ -204,7 +218,7 @@ static int nvme_submit_user_cmd(struct
> request_queue *q,
> blk_execute_rq_nowait(req, 0, nvme_end_async_pt);
> return 0;
> } else {
> - /* support only read and write for now. */
> + /* support only read and write for inline */
> ret = -EINVAL;
> goto out_meta;
> }
> @@ -372,7 +386,14 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl,
> struct nvme_ns *ns,
> } else {
> if (ioucmd->cmd_len != sizeof(struct nvme_passthru_cmd64))
> return -EINVAL;
> - cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd;
> + if (ioucmd->flags & IO_URING_F_UCMD_INDIRECT) {
> + ucmd = (struct nvme_passthru_cmd64 __user
> *)ioucmd->cmd;
> + if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
> + return -EFAULT;
> + cptr = &cmd;
> + } else {
> + cptr = (struct nvme_passthru_cmd64 *)ioucmd->cmd;
> + }
> }
> if (cptr->flags & NVME_HIPRI)
> rq_flags |= REQ_POLLED;
> --
> 2.25.1
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-24 21:09 ` Clay Mayers
@ 2022-03-24 23:36 ` Jens Axboe
0 siblings, 0 replies; 122+ messages in thread
From: Jens Axboe @ 2022-03-24 23:36 UTC (permalink / raw)
To: Clay Mayers, Kanchan Joshi, [email protected], [email protected],
[email protected]
Cc: [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected]
On 3/24/22 3:09 PM, Clay Mayers wrote:
>> From: Kanchan Joshi
>> Sent: Tuesday, March 8, 2022 7:21 AM
>> To: [email protected]; [email protected]; [email protected];
>> [email protected]
>> Cc: [email protected]; [email protected]; linux-
>> [email protected]; [email protected]; [email protected];
>> [email protected]; [email protected]; [email protected];
>> [email protected]; [email protected]; [email protected]
>> Subject: [PATCH 17/17] nvme: enable non-inline passthru commands
>>
>> From: Anuj Gupta <[email protected]>
>>
>> On submission,just fetch the commmand from userspace pointer and reuse
>> everything else. On completion, update the result field inside the passthru
>> command.
>>
>> Signed-off-by: Anuj Gupta <[email protected]>
>> Signed-off-by: Kanchan Joshi <[email protected]>
>> ---
>> drivers/nvme/host/ioctl.c | 29 +++++++++++++++++++++++++----
>> 1 file changed, 25 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index
>> 701feaecabbe..ddb7e5864be6 100644
>> --- a/drivers/nvme/host/ioctl.c
>> +++ b/drivers/nvme/host/ioctl.c
>> @@ -65,6 +65,14 @@ static void nvme_pt_task_cb(struct io_uring_cmd
>> *ioucmd)
>> }
>> kfree(pdu->meta);
>>
>> + if (ioucmd->flags & IO_URING_F_UCMD_INDIRECT) {
>> + struct nvme_passthru_cmd64 __user *ptcmd64 = ioucmd-
>>> cmd;
>> + u64 result = le64_to_cpu(nvme_req(req)->result.u64);
>> +
>> + if (put_user(result, &ptcmd64->result))
>> + status = -EFAULT;
>
> When the thread that submitted the io_uring_cmd has exited, the CB is
> called by a system worker instead so put_user() fails. The cqe is
> still completed and the process sees a failed i/o status, but the i/o
> did not fail. The same is true for meta data being returned in patch
> 5.
>
> I can't say if it's a requirement to support this case. It does break
> our current proto-type but we can adjust.
Just don't do that then - it's all very much task based. If the task
goes away and completions haven't been reaped, don't count on anything
sane happening in terms of them completing successfully or not.
The common case for this happening is offloading submit to a submit
thread, which is utterly pointless with io_uring anyway.
--
Jens Axboe
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 11/17] block: factor out helper for bio allocation from cache
2022-03-24 17:45 ` Kanchan Joshi
@ 2022-03-25 5:38 ` Christoph Hellwig
0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-25 5:38 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Thu, Mar 24, 2022 at 11:15:20PM +0530, Kanchan Joshi wrote:
> Thanks, that can be reused here too. But to enable this feature - we
> need to move to a bioset from bio_kmalloc in nvme, and you did not
> seem fine with that.
Yeah, kmalloc already does percpu caches, so we don't even need it.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-24 6:32 ` Christoph Hellwig
@ 2022-03-25 13:39 ` Kanchan Joshi
2022-03-28 4:44 ` Kanchan Joshi
2022-03-30 13:02 ` Christoph Hellwig
0 siblings, 2 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-25 13:39 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Jens Axboe, Keith Busch, Pavel Begunkov, io-uring,
linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
[-- Attachment #1: Type: text/plain, Size: 3367 bytes --]
On Thu, Mar 24, 2022 at 07:32:18AM +0100, Christoph Hellwig wrote:
>On Tue, Mar 22, 2022 at 10:40:27PM +0530, Kanchan Joshi wrote:
>> On Fri, Mar 11, 2022 at 11:57 AM Christoph Hellwig <[email protected]> wrote:
>> > > And that's because this ioctl requires additional "__u64 result;" to
>> > > be updated within "struct nvme_passthru_cmd64".
>> > > To update that during completion, we need, at the least, the result
>> > > field to be a pointer "__u64 result_ptr" inside the struct
>> > > nvme_passthru_cmd64.
>> > > Do you see that is possible without adding a new passthru ioctl in nvme?
>> >
>> > We don't need a new passthrough ioctl in nvme.
>> Right. Maybe it is easier for applications if they get to use the same
>> ioctl opcode/structure that they know well already.
>
>I disagree. Reusing the same opcode and/or structure for something
>fundamentally different creates major confusion. Don't do it.
Ok. If you are open to take new opcode/struct route, that is all we
require to pair with big-sqe and have this sorted. How about this -
+/* same as nvme_passthru_cmd64 but expecting result field to be pointer */
+struct nvme_passthru_cmd64_ind {
+ __u8 opcode;
+ __u8 flags;
+ __u16 rsvd1;
+ __u32 nsid;
+ __u32 cdw2;
+ __u32 cdw3;
+ __u64 metadata;
+ __u64 addr;
+ __u32 metadata_len;
+ union {
+ __u32 data_len; /* for non-vectored io */
+ __u32 vec_cnt; /* for vectored io */
+ };
+ __u32 cdw10;
+ __u32 cdw11;
+ __u32 cdw12;
+ __u32 cdw13;
+ __u32 cdw14;
+ __u32 cdw15;
+ __u32 timeout_ms;
+ __u32 rsvd2;
+ __u64 presult; /* pointer to result */
+};
+
#define nvme_admin_cmd nvme_passthru_cmd
+#define NVME_IOCTL_IO64_CMD_IND _IOWR('N', 0x50, struct nvme_passthru_cmd64_ind)
Not heavy on code-volume too, because only one statement (updating
result) changes and we reuse everything else.
>> >From all that we discussed, maybe the path forward could be this:
>> - inline-cmd/big-sqe is useful if paired with big-cqe. Drop big-sqe
>> for now if we cannot go the big-cqe route.
>> - use only indirect-cmd as this requires nothing special, just regular
>> sqe and cqe. We can support all passthru commands with a lot less
>> code. No new ioctl in nvme, so same semantics. For common commands
>> (i.e. read/write) we can still avoid updating the result (put_user
>> cost will go).
>>
>> Please suggest if we should approach this any differently in v2.
>
>Personally I think larger SQEs and CQEs are the only sensible interface
>here. Everything else just fails like a horrible hack I would not want
>to support in NVMe.
So far we have gathered three choices:
(a) big-sqe + new opcode/struct in nvme
(b) big-sqe + big-cqe
(c) regular-sqe + regular-cqe
I can post a RFC on big-cqe work if that is what it takes to evaluate
clearly what path to take. But really, the code is much more compared
to choice (a) and (c). Differentiating one CQE with another does not
seem very maintenance-friendly, particularly in liburing.
For (c), I did not get what part feels like horrible hack.
It is same as how we do sync passthru - read passthru command from
user-space memory, and update result into that on completion.
But yes, (a) seems like the best option to me.
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-25 13:39 ` Kanchan Joshi
@ 2022-03-28 4:44 ` Kanchan Joshi
2022-03-30 12:59 ` Christoph Hellwig
2022-03-30 13:02 ` Christoph Hellwig
1 sibling, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-28 4:44 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
> >I disagree. Reusing the same opcode and/or structure for something
> >fundamentally different creates major confusion. Don't do it.
>
> Ok. If you are open to take new opcode/struct route, that is all we
> require to pair with big-sqe and have this sorted. How about this -
>
> +/* same as nvme_passthru_cmd64 but expecting result field to be pointer */
> +struct nvme_passthru_cmd64_ind {
> + __u8 opcode;
> + __u8 flags;
> + __u16 rsvd1;
> + __u32 nsid;
> + __u32 cdw2;
> + __u32 cdw3;
> + __u64 metadata;
> + __u64 addr;
> + __u32 metadata_len;
> + union {
> + __u32 data_len; /* for non-vectored io */
> + __u32 vec_cnt; /* for vectored io */
> + };
> + __u32 cdw10;
> + __u32 cdw11;
> + __u32 cdw12;
> + __u32 cdw13;
> + __u32 cdw14;
> + __u32 cdw15;
> + __u32 timeout_ms;
> + __u32 rsvd2;
> + __u64 presult; /* pointer to result */
> +};
> +
> #define nvme_admin_cmd nvme_passthru_cmd
>
> +#define NVME_IOCTL_IO64_CMD_IND _IOWR('N', 0x50, struct nvme_passthru_cmd64_ind)
>
> Not heavy on code-volume too, because only one statement (updating
> result) changes and we reuse everything else.
>
> >> >From all that we discussed, maybe the path forward could be this:
> >> - inline-cmd/big-sqe is useful if paired with big-cqe. Drop big-sqe
> >> for now if we cannot go the big-cqe route.
> >> - use only indirect-cmd as this requires nothing special, just regular
> >> sqe and cqe. We can support all passthru commands with a lot less
> >> code. No new ioctl in nvme, so same semantics. For common commands
> >> (i.e. read/write) we can still avoid updating the result (put_user
> >> cost will go).
> >>
> >> Please suggest if we should approach this any differently in v2.
> >
> >Personally I think larger SQEs and CQEs are the only sensible interface
> >here. Everything else just fails like a horrible hack I would not want
> >to support in NVMe.
>
> So far we have gathered three choices:
>
> (a) big-sqe + new opcode/struct in nvme
> (b) big-sqe + big-cqe
> (c) regular-sqe + regular-cqe
>
> I can post a RFC on big-cqe work if that is what it takes to evaluate
> clearly what path to take. But really, the code is much more compared
> to choice (a) and (c). Differentiating one CQE with another does not
> seem very maintenance-friendly, particularly in liburing.
>
> For (c), I did not get what part feels like horrible hack.
> It is same as how we do sync passthru - read passthru command from
> user-space memory, and update result into that on completion.
> But yes, (a) seems like the best option to me.
Thinking a bit more on "(b) big-sqe + big-cqe". Will that also require
a new ioctl (other than NVME_IOCTL_IO64_CMD) in nvme? Because
semantics will be slightly different (i.e. not updating the result
inside the passthrough command but sending it out-of-band to
io_uring). Or am I just overthinking it.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-28 4:44 ` Kanchan Joshi
@ 2022-03-30 12:59 ` Christoph Hellwig
0 siblings, 0 replies; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-30 12:59 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Kanchan Joshi, Christoph Hellwig, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Mon, Mar 28, 2022 at 10:14:13AM +0530, Kanchan Joshi wrote:
> Thinking a bit more on "(b) big-sqe + big-cqe". Will that also require
> a new ioctl (other than NVME_IOCTL_IO64_CMD) in nvme? Because
> semantics will be slightly different (i.e. not updating the result
> inside the passthrough command but sending it out-of-band to
> io_uring). Or am I just overthinking it.
Again, there should be absolutely no coupling between ioctls and
io_uring async cmds. The only thing trying to reuse structures or
constants does is to create a lot of confusion.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-25 13:39 ` Kanchan Joshi
2022-03-28 4:44 ` Kanchan Joshi
@ 2022-03-30 13:02 ` Christoph Hellwig
2022-03-30 13:14 ` Kanchan Joshi
2022-04-01 1:23 ` Jens Axboe
1 sibling, 2 replies; 122+ messages in thread
From: Christoph Hellwig @ 2022-03-30 13:02 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Jens Axboe, Keith Busch,
Pavel Begunkov, io-uring, linux-nvme, linux-block, sbates, logang,
Pankaj Raghav, Javier González, Luis Chamberlain,
Adam Manzanares, Anuj Gupta
On Fri, Mar 25, 2022 at 07:09:21PM +0530, Kanchan Joshi wrote:
> Ok. If you are open to take new opcode/struct route, that is all we
> require to pair with big-sqe and have this sorted. How about this -
I would much, much, much prefer to support a bigger CQE. Having
a pointer in there just creates a fair amount of overhead and
really does not fit into the model nvme and io_uring use.
But yes, if we did not go down that route that would be the structure
that is needed.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-30 13:02 ` Christoph Hellwig
@ 2022-03-30 13:14 ` Kanchan Joshi
2022-04-01 1:25 ` Jens Axboe
2022-04-01 1:23 ` Jens Axboe
1 sibling, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-03-30 13:14 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Jens Axboe, Keith Busch, Pavel Begunkov, io-uring,
linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Wed, Mar 30, 2022 at 6:32 PM Christoph Hellwig <[email protected]> wrote:
>
> On Fri, Mar 25, 2022 at 07:09:21PM +0530, Kanchan Joshi wrote:
> > Ok. If you are open to take new opcode/struct route, that is all we
> > require to pair with big-sqe and have this sorted. How about this -
>
> I would much, much, much prefer to support a bigger CQE. Having
> a pointer in there just creates a fair amount of overhead and
> really does not fit into the model nvme and io_uring use.
Sure, will post the code with bigger-cqe first.
> But yes, if we did not go down that route that would be the structure
> that is needed.
Got it. Thanks for confirming.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-22 17:10 ` Kanchan Joshi
2022-03-24 6:32 ` Christoph Hellwig
@ 2022-04-01 1:22 ` Jens Axboe
2022-04-01 6:29 ` Kanchan Joshi
1 sibling, 1 reply; 122+ messages in thread
From: Jens Axboe @ 2022-04-01 1:22 UTC (permalink / raw)
To: Kanchan Joshi, Christoph Hellwig
Cc: Kanchan Joshi, Keith Busch, Pavel Begunkov, io-uring, linux-nvme,
linux-block, sbates, logang, Pankaj Raghav, Javier González,
Luis Chamberlain, Adam Manzanares, Anuj Gupta
On 3/22/22 11:10 AM, Kanchan Joshi wrote:
>> We need to decouple the
>> uring cmd properly. And properly in this case means not to add a
>> result pointer, but to drop the result from the _input_ structure
>> entirely, and instead optionally support a larger CQ entry that contains
>> it, just like the first patch does for the SQ.
>
> Creating a large CQE was my thought too. Gave that another stab.
> Dealing with two types of CQE felt nasty to fit in liburing's api-set
> (which is cqe-heavy).
>
> Jens: Do you already have thoughts (go/no-go) for this route?
Yes, I think we should just add support for 32-byte CQEs as well. Only
pondering I've done here is if it makes sense to manage them separately,
or if you should just get both big sqe and cqe support in one setting.
For passthrough, you'd want both. But eg for zoned writes, you can make
do with a normal sized sqes and only do larger cqes.
I did actually benchmark big sqes in peak testing, and found them to
perform about the same, no noticeable difference. Which does make sense,
as normal IO with big sqe would only touch the normal sized sqe and
leave the other one unwritten and unread. Since they are cacheline
sized, there's no extra load there.
For big cqes, that's a bit different and I'd expect a bit of a
performance hit for that. We can currently fit 4 of them into a
cacheline, with the change it'd be 2. The same number of ops/sec would
hence touch twice as many cachelines for completions.
But I still think it's way better than having to copy back part of the
completion info out-of-band vs just doing it inline, and it's more
efficient too for that case for sure.
> From all that we discussed, maybe the path forward could be this:
> - inline-cmd/big-sqe is useful if paired with big-cqe. Drop big-sqe
> for now if we cannot go the big-cqe route.
We should go big cqe for sure, it'll help clean up a bunch of things
too.
--
Jens Axboe
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-30 13:02 ` Christoph Hellwig
2022-03-30 13:14 ` Kanchan Joshi
@ 2022-04-01 1:23 ` Jens Axboe
1 sibling, 0 replies; 122+ messages in thread
From: Jens Axboe @ 2022-04-01 1:23 UTC (permalink / raw)
To: Christoph Hellwig, Kanchan Joshi
Cc: Kanchan Joshi, Keith Busch, Pavel Begunkov, io-uring, linux-nvme,
linux-block, sbates, logang, Pankaj Raghav, Javier González,
Luis Chamberlain, Adam Manzanares, Anuj Gupta
On 3/30/22 7:02 AM, Christoph Hellwig wrote:
> On Fri, Mar 25, 2022 at 07:09:21PM +0530, Kanchan Joshi wrote:
>> Ok. If you are open to take new opcode/struct route, that is all we
>> require to pair with big-sqe and have this sorted. How about this -
>
> I would much, much, much prefer to support a bigger CQE. Having
> a pointer in there just creates a fair amount of overhead and
> really does not fit into the model nvme and io_uring use.
>
> But yes, if we did not go down that route that would be the structure
> that is needed.
IMHO doing 32-byte CQEs is the only sane choice here, I would not
entertain anything else.
--
Jens Axboe
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-03-30 13:14 ` Kanchan Joshi
@ 2022-04-01 1:25 ` Jens Axboe
2022-04-01 2:33 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Jens Axboe @ 2022-04-01 1:25 UTC (permalink / raw)
To: Kanchan Joshi, Christoph Hellwig
Cc: Kanchan Joshi, Keith Busch, Pavel Begunkov, io-uring, linux-nvme,
linux-block, sbates, logang, Pankaj Raghav, Javier González,
Luis Chamberlain, Adam Manzanares, Anuj Gupta
On 3/30/22 7:14 AM, Kanchan Joshi wrote:
> On Wed, Mar 30, 2022 at 6:32 PM Christoph Hellwig <[email protected]> wrote:
>>
>> On Fri, Mar 25, 2022 at 07:09:21PM +0530, Kanchan Joshi wrote:
>>> Ok. If you are open to take new opcode/struct route, that is all we
>>> require to pair with big-sqe and have this sorted. How about this -
>>
>> I would much, much, much prefer to support a bigger CQE. Having
>> a pointer in there just creates a fair amount of overhead and
>> really does not fit into the model nvme and io_uring use.
>
> Sure, will post the code with bigger-cqe first.
I can add the support, should be pretty trivial. And do the liburing
side as well, so we have a sane base.
Then I'd suggest to collapse a few of the patches in the series,
the ones that simply modify or fix gaps in previous ones. Order
the series so we build the support and then add nvme support
nicely on top of that.
I'll send out a message on a rebase big sqe/cqe branch, will do
that once 5.18-rc1 is released so we can get it updated to a
current tree as well.
--
Jens Axboe
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-04-01 1:25 ` Jens Axboe
@ 2022-04-01 2:33 ` Kanchan Joshi
2022-04-01 2:44 ` Jens Axboe
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-04-01 2:33 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Kanchan Joshi, Keith Busch, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Fri, Apr 1, 2022 at 6:55 AM Jens Axboe <[email protected]> wrote:
>
> On 3/30/22 7:14 AM, Kanchan Joshi wrote:
> > On Wed, Mar 30, 2022 at 6:32 PM Christoph Hellwig <[email protected]> wrote:
> >>
> >> On Fri, Mar 25, 2022 at 07:09:21PM +0530, Kanchan Joshi wrote:
> >>> Ok. If you are open to take new opcode/struct route, that is all we
> >>> require to pair with big-sqe and have this sorted. How about this -
> >>
> >> I would much, much, much prefer to support a bigger CQE. Having
> >> a pointer in there just creates a fair amount of overhead and
> >> really does not fit into the model nvme and io_uring use.
> >
> > Sure, will post the code with bigger-cqe first.
>
> I can add the support, should be pretty trivial. And do the liburing
> side as well, so we have a sane base.
I will post the big-cqe based work today. It works with fio.
It does not deal with liburing (which seems tricky), but hopefully it
can help us move forward anyway .
> Then I'd suggest to collapse a few of the patches in the series,
> the ones that simply modify or fix gaps in previous ones. Order
> the series so we build the support and then add nvme support
> nicely on top of that.
I think we already did away with patches which were fixing only the
gaps. But yes, patches still add infra for features incrementally.
Do you mean having all io_uring infra (async, plug, poll) squashed
into a single io_uring patch?
On a related note, I was thinking of deferring fixed-buffer and
bio-cache support for now.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-04-01 2:33 ` Kanchan Joshi
@ 2022-04-01 2:44 ` Jens Axboe
2022-04-01 3:05 ` Jens Axboe
` (2 more replies)
0 siblings, 3 replies; 122+ messages in thread
From: Jens Axboe @ 2022-04-01 2:44 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Keith Busch, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On 3/31/22 8:33 PM, Kanchan Joshi wrote:
> On Fri, Apr 1, 2022 at 6:55 AM Jens Axboe <[email protected]> wrote:
>>
>> On 3/30/22 7:14 AM, Kanchan Joshi wrote:
>>> On Wed, Mar 30, 2022 at 6:32 PM Christoph Hellwig <[email protected]> wrote:
>>>>
>>>> On Fri, Mar 25, 2022 at 07:09:21PM +0530, Kanchan Joshi wrote:
>>>>> Ok. If you are open to take new opcode/struct route, that is all we
>>>>> require to pair with big-sqe and have this sorted. How about this -
>>>>
>>>> I would much, much, much prefer to support a bigger CQE. Having
>>>> a pointer in there just creates a fair amount of overhead and
>>>> really does not fit into the model nvme and io_uring use.
>>>
>>> Sure, will post the code with bigger-cqe first.
>>
>> I can add the support, should be pretty trivial. And do the liburing
>> side as well, so we have a sane base.
>
> I will post the big-cqe based work today. It works with fio.
> It does not deal with liburing (which seems tricky), but hopefully it
> can help us move forward anyway .
Let's compare then, since I just did the support too :-)
Some limitations in what I pushed:
1) Doesn't support the inline completion path. Undecided if this is
super important or not, the priority here for me was to not pollute the
general completion path.
2) Doesn't support overflow. That can certainly be done, only
complication here is that we need 2x64bit in the io_kiocb for that.
Perhaps something can get reused for that, not impossible. But figured
it wasn't important enough for a first run.
I also did the liburing support, but haven't pushed it yet. That's
another case where some care has to be taken to avoid makig the general
path slower.
Oh, it's here, usual branch:
https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-big-sqe
and based on top of the pending 5.18 bits and the current 5.19 bits.
>> Then I'd suggest to collapse a few of the patches in the series,
>> the ones that simply modify or fix gaps in previous ones. Order
>> the series so we build the support and then add nvme support
>> nicely on top of that.
>
> I think we already did away with patches which were fixing only the
> gaps. But yes, patches still add infra for features incrementally.
> Do you mean having all io_uring infra (async, plug, poll) squashed
> into a single io_uring patch?
At least async and plug, I'll double check on the poll bit.
> On a related note, I was thinking of deferring fixed-buffer and
> bio-cache support for now.
Yes, I think that can be done as a round 2. Keep the current one
simpler.
--
Jens Axboe
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-04-01 2:44 ` Jens Axboe
@ 2022-04-01 3:05 ` Jens Axboe
2022-04-01 6:32 ` Kanchan Joshi
2022-04-19 17:31 ` Kanchan Joshi
2 siblings, 0 replies; 122+ messages in thread
From: Jens Axboe @ 2022-04-01 3:05 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Keith Busch, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On 3/31/22 8:44 PM, Jens Axboe wrote:
> On 3/31/22 8:33 PM, Kanchan Joshi wrote:
>> On Fri, Apr 1, 2022 at 6:55 AM Jens Axboe <[email protected]> wrote:
>>>
>>> On 3/30/22 7:14 AM, Kanchan Joshi wrote:
>>>> On Wed, Mar 30, 2022 at 6:32 PM Christoph Hellwig <[email protected]> wrote:
>>>>>
>>>>> On Fri, Mar 25, 2022 at 07:09:21PM +0530, Kanchan Joshi wrote:
>>>>>> Ok. If you are open to take new opcode/struct route, that is all we
>>>>>> require to pair with big-sqe and have this sorted. How about this -
>>>>>
>>>>> I would much, much, much prefer to support a bigger CQE. Having
>>>>> a pointer in there just creates a fair amount of overhead and
>>>>> really does not fit into the model nvme and io_uring use.
>>>>
>>>> Sure, will post the code with bigger-cqe first.
>>>
>>> I can add the support, should be pretty trivial. And do the liburing
>>> side as well, so we have a sane base.
>>
>> I will post the big-cqe based work today. It works with fio.
>> It does not deal with liburing (which seems tricky), but hopefully it
>> can help us move forward anyway .
>
> Let's compare then, since I just did the support too :-)
>
> Some limitations in what I pushed:
>
> 1) Doesn't support the inline completion path. Undecided if this is
> super important or not, the priority here for me was to not pollute the
> general completion path.
>
> 2) Doesn't support overflow. That can certainly be done, only
> complication here is that we need 2x64bit in the io_kiocb for that.
> Perhaps something can get reused for that, not impossible. But figured
> it wasn't important enough for a first run.
>
> I also did the liburing support, but haven't pushed it yet. That's
> another case where some care has to be taken to avoid makig the general
> path slower.
>
> Oh, it's here, usual branch:
>
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-big-sqe
>
> and based on top of the pending 5.18 bits and the current 5.19 bits.
Do post your version too, would be interesting to compare. I just wired
mine up to NOP, hasn't seen any testing beyond just verifying that we do
pass back the extra data.
Added inline completion as well. Kind of interesting in that performance
actually seems to be _better_ with CQE32 for my initial testing, just
using NOP. More testing surely needed, will run it on actual hardware
too as I have a good idea what performance should look like there.
I also think it's currently broken for request deferral and timeouts,
but those are just minor tweaks that need to be made to account for the
cq head being doubly incremented on bigger CQEs.
--
Jens Axboe
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-04-01 1:22 ` Jens Axboe
@ 2022-04-01 6:29 ` Kanchan Joshi
0 siblings, 0 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-04-01 6:29 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Kanchan Joshi, Keith Busch, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Fri, Apr 1, 2022 at 6:52 AM Jens Axboe <[email protected]> wrote:
>
> On 3/22/22 11:10 AM, Kanchan Joshi wrote:
> >> We need to decouple the
> >> uring cmd properly. And properly in this case means not to add a
> >> result pointer, but to drop the result from the _input_ structure
> >> entirely, and instead optionally support a larger CQ entry that contains
> >> it, just like the first patch does for the SQ.
> >
> > Creating a large CQE was my thought too. Gave that another stab.
> > Dealing with two types of CQE felt nasty to fit in liburing's api-set
> > (which is cqe-heavy).
> >
> > Jens: Do you already have thoughts (go/no-go) for this route?
>
> Yes, I think we should just add support for 32-byte CQEs as well. Only
> pondering I've done here is if it makes sense to manage them separately,
> or if you should just get both big sqe and cqe support in one setting.
> For passthrough, you'd want both. But eg for zoned writes, you can make
> do with a normal sized sqes and only do larger cqes.
I had the same thought. That we may have other use-cases returning a
second result.
For now I am doing 32-byte cqe with the same big-sqe flag, but an
independent flag can be done easily.
Combinations are:
(a) big-sqe with big-cqe (for nvme-passthru)
(b) big-sqe without big-cqe (inline submission but not requiring second result)
(c) regular-sqe with big-cqe (for zone-append)
(d) regular-sqe with regular-cqe (for cases when inline submission is
not enough e.g. > 80 bytes of cmd)
At this point (d) seems rare. And the other three can be done with two flags.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-04-01 2:44 ` Jens Axboe
2022-04-01 3:05 ` Jens Axboe
@ 2022-04-01 6:32 ` Kanchan Joshi
2022-04-19 17:31 ` Kanchan Joshi
2 siblings, 0 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-04-01 6:32 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Kanchan Joshi, Keith Busch, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On Fri, Apr 1, 2022 at 8:14 AM Jens Axboe <[email protected]> wrote:
> >>>>> Ok. If you are open to take new opcode/struct route, that is all we
> >>>>> require to pair with big-sqe and have this sorted. How about this -
> >>>>
> >>>> I would much, much, much prefer to support a bigger CQE. Having
> >>>> a pointer in there just creates a fair amount of overhead and
> >>>> really does not fit into the model nvme and io_uring use.
> >>>
> >>> Sure, will post the code with bigger-cqe first.
> >>
> >> I can add the support, should be pretty trivial. And do the liburing
> >> side as well, so we have a sane base.
> >
> > I will post the big-cqe based work today. It works with fio.
> > It does not deal with liburing (which seems tricky), but hopefully it
> > can help us move forward anyway .
>
> Let's compare then, since I just did the support too :-)
:-) awesome
> Some limitations in what I pushed:
>
> 1) Doesn't support the inline completion path. Undecided if this is
> super important or not, the priority here for me was to not pollute the
> general completion path.
>
> 2) Doesn't support overflow. That can certainly be done, only
> complication here is that we need 2x64bit in the io_kiocb for that.
> Perhaps something can get reused for that, not impossible. But figured
> it wasn't important enough for a first run.
We have the handling in my version. But that part is not tested, since
that situation did not occur naturally.
Maybe it requires slowing down completion-reaping (in user-space) to
trigger that.
> I also did the liburing support, but haven't pushed it yet. That's
> another case where some care has to be taken to avoid makig the general
> path slower.
>
> Oh, it's here, usual branch:
>
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-big-sqe
>
> and based on top of the pending 5.18 bits and the current 5.19 bits.
>
> >> Then I'd suggest to collapse a few of the patches in the series,
> >> the ones that simply modify or fix gaps in previous ones. Order
> >> the series so we build the support and then add nvme support
> >> nicely on top of that.
> >
> > I think we already did away with patches which were fixing only the
> > gaps. But yes, patches still add infra for features incrementally.
> > Do you mean having all io_uring infra (async, plug, poll) squashed
> > into a single io_uring patch?
>
> At least async and plug, I'll double check on the poll bit.
Sounds right, the plug should definitely go in the async one.
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-04-01 2:44 ` Jens Axboe
2022-04-01 3:05 ` Jens Axboe
2022-04-01 6:32 ` Kanchan Joshi
@ 2022-04-19 17:31 ` Kanchan Joshi
2022-04-19 18:19 ` Jens Axboe
2 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-04-19 17:31 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, Kanchan Joshi, Keith Busch, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
Hi Jens,
Few thoughts below toward the next version -
On Fri, Apr 1, 2022 at 8:14 AM Jens Axboe <[email protected]> wrote:
[snip]
> >>> Sure, will post the code with bigger-cqe first.
> >>
> >> I can add the support, should be pretty trivial. And do the liburing
> >> side as well, so we have a sane base.
> >
> > I will post the big-cqe based work today. It works with fio.
> > It does not deal with liburing (which seems tricky), but hopefully it
> > can help us move forward anyway .
>
> Let's compare then, since I just did the support too :-)
Major difference is generic support (rather than uring-cmd only) and
not touching the regular completion path. So plan is to use your patch
for the next version with some bits added (e.g. overflow-handling and
avoiding extra CQE tail increment). Hope that sounds fine.
We have things working on top of your current branch
"io_uring-big-sqe". Since SQE now has 8 bytes of free space (post
xattr merge) and CQE infra is different (post cqe-caching in ctx) -
things needed to be done a bit differently. But all this is now tested
better with liburing support/util (plan is to post that too).
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-04-19 17:31 ` Kanchan Joshi
@ 2022-04-19 18:19 ` Jens Axboe
[not found] ` <CGME20220420152003epcas5p3991e6941773690bcb425fd9d817105c3@epcas5p3.samsung.com>
0 siblings, 1 reply; 122+ messages in thread
From: Jens Axboe @ 2022-04-19 18:19 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Kanchan Joshi, Keith Busch, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
On 4/19/22 11:31 AM, Kanchan Joshi wrote:
> Hi Jens,
> Few thoughts below toward the next version -
>
> On Fri, Apr 1, 2022 at 8:14 AM Jens Axboe <[email protected]> wrote:
> [snip]
>>>>> Sure, will post the code with bigger-cqe first.
>>>>
>>>> I can add the support, should be pretty trivial. And do the liburing
>>>> side as well, so we have a sane base.
>>>
>>> I will post the big-cqe based work today. It works with fio.
>>> It does not deal with liburing (which seems tricky), but hopefully it
>>> can help us move forward anyway .
>>
>> Let's compare then, since I just did the support too :-)
>
> Major difference is generic support (rather than uring-cmd only) and
> not touching the regular completion path. So plan is to use your patch
> for the next version with some bits added (e.g. overflow-handling and
> avoiding extra CQE tail increment). Hope that sounds fine.
I'll sanitize my branch today or tomorrow, it has overflow and proper cq
ring management now, just hasn't been posted yet. So it should be
complete.
> We have things working on top of your current branch
> "io_uring-big-sqe". Since SQE now has 8 bytes of free space (post
> xattr merge) and CQE infra is different (post cqe-caching in ctx) -
> things needed to be done a bit differently. But all this is now tested
> better with liburing support/util (plan is to post that too).
Just still grab the 16 bytes, we don't care about addr3 for passthrough.
Should be no changes required there.
--
Jens Axboe
^ permalink raw reply [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
[not found] ` <CGME20220420152003epcas5p3991e6941773690bcb425fd9d817105c3@epcas5p3.samsung.com>
@ 2022-04-20 15:14 ` Kanchan Joshi
2022-04-20 15:28 ` Kanchan Joshi
0 siblings, 1 reply; 122+ messages in thread
From: Kanchan Joshi @ 2022-04-20 15:14 UTC (permalink / raw)
To: Jens Axboe
Cc: Kanchan Joshi, Christoph Hellwig, Keith Busch, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
[-- Attachment #1: Type: text/plain, Size: 20711 bytes --]
On Tue, Apr 19, 2022 at 12:19:01PM -0600, Jens Axboe wrote:
>On 4/19/22 11:31 AM, Kanchan Joshi wrote:
>> Hi Jens,
>> Few thoughts below toward the next version -
>>
>> On Fri, Apr 1, 2022 at 8:14 AM Jens Axboe <[email protected]> wrote:
>> [snip]
>>>>>> Sure, will post the code with bigger-cqe first.
>>>>>
>>>>> I can add the support, should be pretty trivial. And do the liburing
>>>>> side as well, so we have a sane base.
>>>>
>>>> I will post the big-cqe based work today. It works with fio.
>>>> It does not deal with liburing (which seems tricky), but hopefully it
>>>> can help us move forward anyway .
>>>
>>> Let's compare then, since I just did the support too :-)
>>
>> Major difference is generic support (rather than uring-cmd only) and
>> not touching the regular completion path. So plan is to use your patch
>> for the next version with some bits added (e.g. overflow-handling and
>> avoiding extra CQE tail increment). Hope that sounds fine.
>
>I'll sanitize my branch today or tomorrow, it has overflow and proper cq
>ring management now, just hasn't been posted yet. So it should be
>complete.
Ok, thanks.
Here is revised patch that works for me (perhaps you don't need, but worth
if it saves any cycle). Would require one change in previous (big-sqe)
patch:
enum io_uring_cmd_flags {
IO_URING_F_COMPLETE_DEFER = 1,
IO_URING_F_UNLOCKED = 2,
+ IO_URING_F_SQE128 = 4,
Subject: [PATCH 2/7] io_uring: add support for 32-byte CQEs
Normal CQEs are 16-bytes in length, which is fine for all the commands
we support. However, in preparation for supporting passthrough IO,
provide an option for setting up a ring with 32-byte CQEs and add a
helper for completing them.
Rather than always use the slower locked path, wire up use of the
deferred completion path that normal CQEs can take. This reuses the
hash list node for the storage we need to hold the two 64-bit values
that must be passed back.
Signed-off-by: Jens Axboe <[email protected]>
---
fs/io_uring.c | 206 ++++++++++++++++++++++++++++----
include/trace/events/io_uring.h | 18 ++-
include/uapi/linux/io_uring.h | 12 ++
3 files changed, 212 insertions(+), 24 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5e2b7485f380..6c1a69ae74a4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -206,6 +206,7 @@ enum io_uring_cmd_flags {
IO_URING_F_COMPLETE_DEFER = 1,
IO_URING_F_UNLOCKED = 2,
IO_URING_F_SQE128 = 4,
+ IO_URING_F_CQE32 = 8,
/* int's last bit, sign checks are usually faster than a bit test */
IO_URING_F_NONBLOCK = INT_MIN,
};
@@ -221,8 +222,8 @@ struct io_mapped_ubuf {
struct io_ring_ctx;
struct io_overflow_cqe {
- struct io_uring_cqe cqe;
struct list_head list;
+ struct io_uring_cqe cqe; /* this must be kept at end */
};
struct io_fixed_file {
@@ -954,7 +955,13 @@ struct io_kiocb {
atomic_t poll_refs;
struct io_task_work io_task_work;
/* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
- struct hlist_node hash_node;
+ union {
+ struct hlist_node hash_node;
+ struct {
+ u64 extra1;
+ u64 extra2;
+ };
+ };
/* internal polling, see IORING_FEAT_FAST_POLL */
struct async_poll *apoll;
/* opcode allocated if it needs to store data for async defer */
@@ -1902,6 +1909,40 @@ static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
return __io_get_cqe(ctx);
}
+static noinline struct io_uring_cqe *__io_get_cqe32(struct io_ring_ctx *ctx)
+{
+ struct io_rings *rings = ctx->rings;
+ unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1);
+ unsigned int free, queued, len;
+
+ /* userspace may cheat modifying the tail, be safe and do min */
+ queued = min(__io_cqring_events(ctx), ctx->cq_entries);
+ free = ctx->cq_entries - queued;
+ /* we need a contiguous range, limit based on the current array offset */
+ len = min(free, ctx->cq_entries - off);
+ if (!len)
+ return NULL;
+
+ ctx->cached_cq_tail++;
+ /* double increment for 32 CQEs */
+ ctx->cqe_cached = &rings->cqes[off << 1];
+ ctx->cqe_sentinel = ctx->cqe_cached + (len << 1);
+ return ctx->cqe_cached;
+}
+
+static inline struct io_uring_cqe *io_get_cqe32(struct io_ring_ctx *ctx)
+{
+ struct io_uring_cqe *cqe32;
+ if (likely(ctx->cqe_cached < ctx->cqe_sentinel)) {
+ ctx->cached_cq_tail++;
+ cqe32 = ctx->cqe_cached;
+ } else
+ cqe32 = __io_get_cqe32(ctx);
+ /* double increment for 32b CQE*/
+ ctx->cqe_cached += 2;
+ return cqe32;
+}
+
static void io_eventfd_signal(struct io_ring_ctx *ctx)
{
struct io_ev_fd *ev_fd;
@@ -1977,15 +2018,21 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
posted = false;
spin_lock(&ctx->completion_lock);
while (!list_empty(&ctx->cq_overflow_list)) {
- struct io_uring_cqe *cqe = io_get_cqe(ctx);
+ struct io_uring_cqe *cqe;
struct io_overflow_cqe *ocqe;
+ /* copy more for big-cqe */
+ int cqeshift = ctx->flags & IORING_SETUP_CQE32 ? 1 : 0;
+ if (cqeshift)
+ cqe = io_get_cqe32(ctx);
+ else
+ cqe = io_get_cqe(ctx);
if (!cqe && !force)
break;
ocqe = list_first_entry(&ctx->cq_overflow_list,
struct io_overflow_cqe, list);
if (cqe)
- memcpy(cqe, &ocqe->cqe, sizeof(*cqe));
+ memcpy(cqe, &ocqe->cqe, sizeof(*cqe) << cqeshift);
else
io_account_cq_overflow(ctx);
@@ -2074,11 +2121,18 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task)
}
static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
- s32 res, u32 cflags)
+ s32 res, u32 cflags, u64 extra1,
+ u64 extra2)
{
struct io_overflow_cqe *ocqe;
+ int size = sizeof(*ocqe);
+ bool cqe32 = ctx->flags & IORING_SETUP_CQE32;
- ocqe = kmalloc(sizeof(*ocqe), GFP_ATOMIC | __GFP_ACCOUNT);
+ /* allocate more for 32b CQE */
+ if (cqe32)
+ size += sizeof(struct io_uring_cqe);
+
+ ocqe = kmalloc(size, GFP_ATOMIC | __GFP_ACCOUNT);
if (!ocqe) {
/*
* If we're in ring overflow flush mode, or in task cancel mode,
@@ -2097,6 +2151,10 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
ocqe->cqe.user_data = user_data;
ocqe->cqe.res = res;
ocqe->cqe.flags = cflags;
+ if (cqe32) {
+ ocqe->cqe.b[0].extra1 = extra1;
+ ocqe->cqe.b[0].extra2 = extra2;
+ }
list_add_tail(&ocqe->list, &ctx->cq_overflow_list);
return true;
}
@@ -2118,7 +2176,35 @@ static inline bool __io_fill_cqe(struct io_ring_ctx *ctx, u64 user_data,
WRITE_ONCE(cqe->flags, cflags);
return true;
}
- return io_cqring_event_overflow(ctx, user_data, res, cflags);
+ return io_cqring_event_overflow(ctx, user_data, res, cflags, 0, 0);
+}
+
+static inline bool __io_fill_cqe32_req_filled(struct io_ring_ctx *ctx,
+ struct io_kiocb *req)
+{
+ struct io_uring_cqe *cqe;
+ u64 extra1 = req->extra1;
+ u64 extra2 = req->extra2;
+
+ trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
+ req->cqe.res, req->cqe.flags, extra1,
+ extra2);
+
+ /*
+ * If we can't get a cq entry, userspace overflowed the
+ * submission (by quite a lot). Increment the overflow count in
+ * the ring.
+ */
+ cqe = io_get_cqe32(ctx);
+ if (likely(cqe)) {
+ memcpy(cqe, &req->cqe, sizeof(*cqe));
+ cqe->b[0].extra1 = extra1;
+ cqe->b[0].extra2 = extra2;
+ return true;
+ }
+ return io_cqring_event_overflow(ctx, req->cqe.user_data,
+ req->cqe.res, req->cqe.flags, extra1,
+ extra2);
}
static inline bool __io_fill_cqe_req_filled(struct io_ring_ctx *ctx,
@@ -2127,7 +2213,7 @@ static inline bool __io_fill_cqe_req_filled(struct io_ring_ctx *ctx,
struct io_uring_cqe *cqe;
trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
- req->cqe.res, req->cqe.flags);
+ req->cqe.res, req->cqe.flags, 0, 0);
/*
* If we can't get a cq entry, userspace overflowed the
@@ -2140,12 +2226,13 @@ static inline bool __io_fill_cqe_req_filled(struct io_ring_ctx *ctx,
return true;
}
return io_cqring_event_overflow(ctx, req->cqe.user_data,
- req->cqe.res, req->cqe.flags);
+ req->cqe.res, req->cqe.flags, 0, 0);
}
static inline bool __io_fill_cqe_req(struct io_kiocb *req, s32 res, u32 cflags)
{
- trace_io_uring_complete(req->ctx, req, req->cqe.user_data, res, cflags);
+ trace_io_uring_complete(req->ctx, req, req->cqe.user_data, res, cflags,
+ 0, 0);
return __io_fill_cqe(req->ctx, req->cqe.user_data, res, cflags);
}
@@ -2159,22 +2246,52 @@ static noinline bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data,
s32 res, u32 cflags)
{
ctx->cq_extra++;
- trace_io_uring_complete(ctx, NULL, user_data, res, cflags);
+ trace_io_uring_complete(ctx, NULL, user_data, res, cflags, 0, 0);
return __io_fill_cqe(ctx, user_data, res, cflags);
}
-static void __io_req_complete_post(struct io_kiocb *req, s32 res,
- u32 cflags)
+static void __io_fill_cqe32_req(struct io_kiocb *req, s32 res, u32 cflags,
+ u64 extra1, u64 extra2)
{
struct io_ring_ctx *ctx = req->ctx;
+ struct io_uring_cqe *cqe;
+
+ if (WARN_ON_ONCE(!(ctx->flags & IORING_SETUP_CQE32)))
+ return;
+ if (req->flags & REQ_F_CQE_SKIP)
+ return;
+
+ trace_io_uring_complete(ctx, req, req->cqe.user_data, res, cflags,
+ extra1, extra2);
- if (!(req->flags & REQ_F_CQE_SKIP))
- __io_fill_cqe_req(req, res, cflags);
+ /*
+ * If we can't get a cq entry, userspace overflowed the
+ * submission (by quite a lot). Increment the overflow count in
+ * the ring.
+ */
+ cqe = io_get_cqe32(ctx);
+ if (likely(cqe)) {
+ WRITE_ONCE(cqe->user_data, req->cqe.user_data);
+ WRITE_ONCE(cqe->res, res);
+ WRITE_ONCE(cqe->flags, cflags);
+ WRITE_ONCE(cqe->b[0].extra1, extra1);
+ WRITE_ONCE(cqe->b[0].extra2, extra2);
+ return;
+ }
+
+ io_cqring_event_overflow(ctx, req->cqe.user_data, res, cflags, extra1,
+ extra2);
+}
+
+static void __io_req_complete_put(struct io_kiocb *req)
+{
/*
* If we're the last reference to this request, add to our locked
* free_list cache.
*/
if (req_ref_put_and_test(req)) {
+ struct io_ring_ctx *ctx = req->ctx;
+
if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) {
if (req->flags & IO_DISARM_MASK)
io_disarm_next(req);
@@ -2197,6 +2314,33 @@ static void __io_req_complete_post(struct io_kiocb *req, s32 res,
}
}
+static void __io_req_complete_post(struct io_kiocb *req, s32 res,
+ u32 cflags)
+{
+ if (!(req->flags & REQ_F_CQE_SKIP))
+ __io_fill_cqe_req(req, res, cflags);
+ __io_req_complete_put(req);
+}
+
+static void io_req_complete_post32(struct io_kiocb *req, s32 res,
+ u32 cflags, u64 extra1, u64 extra2)
+{
+ struct io_ring_ctx *ctx = req->ctx;
+ bool posted = false;
+
+ spin_lock(&ctx->completion_lock);
+
+ if (!(req->flags & REQ_F_CQE_SKIP)) {
+ __io_fill_cqe32_req(req, res, cflags, extra1, extra2);
+ io_commit_cqring(ctx);
+ posted = true;
+ }
+ __io_req_complete_put(req);
+ spin_unlock(&ctx->completion_lock);
+ if (posted)
+ io_cqring_ev_posted(ctx);
+}
+
static void io_req_complete_post(struct io_kiocb *req, s32 res,
u32 cflags)
{
@@ -2226,6 +2370,19 @@ static inline void __io_req_complete(struct io_kiocb *req, unsigned issue_flags,
io_req_complete_post(req, res, cflags);
}
+static inline void __io_req_complete32(struct io_kiocb *req,
+ unsigned issue_flags, s32 res,
+ u32 cflags, u64 extra1, u64 extra2)
+{
+ if (issue_flags & IO_URING_F_COMPLETE_DEFER) {
+ io_req_complete_state(req, res, cflags);
+ req->extra1 = extra1;
+ req->extra2 = extra2;
+ } else {
+ io_req_complete_post32(req, res, cflags, extra1, extra2);
+ }
+}
+
static inline void io_req_complete(struct io_kiocb *req, s32 res)
{
__io_req_complete(req, 0, res, 0);
@@ -2779,8 +2936,12 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
struct io_kiocb *req = container_of(node, struct io_kiocb,
comp_list);
- if (!(req->flags & REQ_F_CQE_SKIP))
+ if (req->flags & REQ_F_CQE_SKIP)
+ continue;
+ if (!(ctx->flags & IORING_SETUP_CQE32))
__io_fill_cqe_req_filled(ctx, req);
+ else
+ __io_fill_cqe32_req_filled(ctx, req);
}
io_commit_cqring(ctx);
@@ -9632,8 +9793,8 @@ static void *io_mem_alloc(size_t size)
return (void *) __get_free_pages(gfp, get_order(size));
}
-static unsigned long rings_size(unsigned sq_entries, unsigned cq_entries,
- size_t *sq_offset)
+static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned sq_entries,
+ unsigned cq_entries, size_t *sq_offset)
{
struct io_rings *rings;
size_t off, sq_array_size;
@@ -9641,6 +9802,11 @@ static unsigned long rings_size(unsigned sq_entries, unsigned cq_entries,
off = struct_size(rings, cqes, cq_entries);
if (off == SIZE_MAX)
return SIZE_MAX;
+ if (ctx->flags & IORING_SETUP_CQE32) {
+ if ((off << 1) < off)
+ return SIZE_MAX;
+ off <<= 1;
+ }
#ifdef CONFIG_SMP
off = ALIGN(off, SMP_CACHE_BYTES);
@@ -11297,7 +11463,7 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx,
ctx->sq_entries = p->sq_entries;
ctx->cq_entries = p->cq_entries;
- size = rings_size(p->sq_entries, p->cq_entries, &sq_array_offset);
+ size = rings_size(ctx, p->sq_entries, p->cq_entries, &sq_array_offset);
if (size == SIZE_MAX)
return -EOVERFLOW;
@@ -11540,7 +11706,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |
IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ |
IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL |
- IORING_SETUP_SQE128))
+ IORING_SETUP_SQE128 | IORING_SETUP_CQE32))
return -EINVAL;
return io_uring_create(entries, &p, params);
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 8477414d6d06..2eb4f4e47de4 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -318,13 +318,16 @@ TRACE_EVENT(io_uring_fail_link,
* @user_data: user data associated with the request
* @res: result of the request
* @cflags: completion flags
+ * @extra1: extra 64-bit data for CQE32
+ * @extra2: extra 64-bit data for CQE32
*
*/
TRACE_EVENT(io_uring_complete,
- TP_PROTO(void *ctx, void *req, u64 user_data, int res, unsigned cflags),
+ TP_PROTO(void *ctx, void *req, u64 user_data, int res, unsigned cflags,
+ u64 extra1, u64 extra2),
- TP_ARGS(ctx, req, user_data, res, cflags),
+ TP_ARGS(ctx, req, user_data, res, cflags, extra1, extra2),
TP_STRUCT__entry (
__field( void *, ctx )
@@ -332,6 +335,8 @@ TRACE_EVENT(io_uring_complete,
__field( u64, user_data )
__field( int, res )
__field( unsigned, cflags )
+ __field( u64, extra1 )
+ __field( u64, extra2 )
),
TP_fast_assign(
@@ -340,12 +345,17 @@ TRACE_EVENT(io_uring_complete,
__entry->user_data = user_data;
__entry->res = res;
__entry->cflags = cflags;
+ __entry->extra1 = extra1;
+ __entry->extra2 = extra2;
),
- TP_printk("ring %p, req %p, user_data 0x%llx, result %d, cflags 0x%x",
+ TP_printk("ring %p, req %p, user_data 0x%llx, result %d, cflags 0x%x "
+ "extra1 %llu extra2 %llu ",
__entry->ctx, __entry->req,
__entry->user_data,
- __entry->res, __entry->cflags)
+ __entry->res, __entry->cflags,
+ (unsigned long long) __entry->extra1,
+ (unsigned long long) __entry->extra2)
);
/**
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 88a5c67d6666..1fe0ad3668d1 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -111,6 +111,7 @@ enum {
#define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */
#define IORING_SETUP_SUBMIT_ALL (1U << 7) /* continue submit on error */
#define IORING_SETUP_SQE128 (1U << 8) /* SQEs are 128b */
+#define IORING_SETUP_CQE32 (1U << 9) /* CQEs are 32b */
enum {
IORING_OP_NOP,
@@ -200,6 +201,11 @@ enum {
#define IORING_POLL_UPDATE_EVENTS (1U << 1)
#define IORING_POLL_UPDATE_USER_DATA (1U << 2)
+struct io_uring_cqe_extra {
+ __u64 extra1;
+ __u64 extra2;
+};
+
/*
* IO completion data structure (Completion Queue Entry)
*/
@@ -207,6 +213,12 @@ struct io_uring_cqe {
__u64 user_data; /* sqe->data submission passed back */
__s32 res; /* result code for this event */
__u32 flags;
+
+ /*
+ * If the ring is initialized with IORING_SETUP_CQE32, then this field
+ * contains 16-bytes of padding, doubling the size of the CQE.
+ */
+ struct io_uring_cqe_extra b[0];
};
>> We have things working on top of your current branch
>> "io_uring-big-sqe". Since SQE now has 8 bytes of free space (post
>> xattr merge) and CQE infra is different (post cqe-caching in ctx) -
>> things needed to be done a bit differently. But all this is now tested
>> better with liburing support/util (plan is to post that too).
>
>Just still grab the 16 bytes, we don't care about addr3 for passthrough.
>Should be no changes required there.
I was thinking of uring-cmd in general, but then also it does not seem
to collide with xattr. Got your point.
Measure was removing 8b "result" field from passthru-cmd, since 32b CQE
makes that part useless, and we are adding new opcode in nvme
anyway. Maybe we should still reduce passthu-cmd to 64b (rather than 72),
not very sure.
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply related [flat|nested] 122+ messages in thread
* Re: [PATCH 17/17] nvme: enable non-inline passthru commands
2022-04-20 15:14 ` Kanchan Joshi
@ 2022-04-20 15:28 ` Kanchan Joshi
0 siblings, 0 replies; 122+ messages in thread
From: Kanchan Joshi @ 2022-04-20 15:28 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Pavel Begunkov,
io-uring, linux-nvme, linux-block, sbates, logang, Pankaj Raghav,
Javier González, Luis Chamberlain, Adam Manzanares,
Anuj Gupta
> >Just still grab the 16 bytes, we don't care about addr3 for passthrough.
> >Should be no changes required there.
> I was thinking of uring-cmd in general, but then also it does not seem
> to collide with xattr. Got your point.
> Measure was removing 8b "result" field from passthru-cmd, since 32b CQE
> makes that part useless, and we are adding new opcode in nvme
> anyway. Maybe we should still reduce passthu-cmd to 64b (rather than 72),
> not very sure.
Correction above: reduce passthru-cmd to 72b (rather than 80b).
--
Joshi
^ permalink raw reply [flat|nested] 122+ messages in thread
end of thread, other threads:[~2022-04-20 15:29 UTC | newest]
Thread overview: 122+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20220308152651epcas5p1ebd2dc7fa01db43dd587c228a3695696@epcas5p1.samsung.com>
2022-03-08 15:20 ` [PATCH 00/17] io_uring passthru over nvme Kanchan Joshi
[not found] ` <CGME20220308152653epcas5p10c31f58cf6bff125cc0baa176b4d4fac@epcas5p1.samsung.com>
2022-03-08 15:20 ` [PATCH 01/17] io_uring: add support for 128-byte SQEs Kanchan Joshi
[not found] ` <CGME20220308152655epcas5p4ae47d715e1c15069e97152dcd283fd40@epcas5p4.samsung.com>
2022-03-08 15:20 ` [PATCH 02/17] fs: add file_operations->async_cmd() Kanchan Joshi
[not found] ` <CGME20220308152658epcas5p3929bd1fcf75edc505fec71901158d1b5@epcas5p3.samsung.com>
2022-03-08 15:20 ` [PATCH 03/17] io_uring: add infra and support for IORING_OP_URING_CMD Kanchan Joshi
2022-03-11 1:51 ` Luis Chamberlain
2022-03-11 2:43 ` Jens Axboe
2022-03-11 17:11 ` Luis Chamberlain
2022-03-11 18:47 ` Paul Moore
2022-03-11 20:57 ` Luis Chamberlain
2022-03-11 21:03 ` Paul Moore
2022-03-14 16:25 ` Casey Schaufler
2022-03-14 16:32 ` Luis Chamberlain
2022-03-14 18:05 ` Casey Schaufler
2022-03-14 19:40 ` Luis Chamberlain
[not found] ` <CGME20220308152700epcas5p4130d20119a3a250a2515217d6552f668@epcas5p4.samsung.com>
2022-03-08 15:20 ` [PATCH 04/17] nvme: modify nvme_alloc_request to take an additional parameter Kanchan Joshi
2022-03-11 6:38 ` Christoph Hellwig
[not found] ` <CGME20220308152702epcas5p1eb1880e024ac8b9531c85a82f31a4e78@epcas5p1.samsung.com>
2022-03-08 15:20 ` [PATCH 05/17] nvme: wire-up support for async-passthru on char-device Kanchan Joshi
2022-03-10 0:02 ` Clay Mayers
2022-03-10 8:32 ` Kanchan Joshi
2022-03-11 7:01 ` Christoph Hellwig
2022-03-14 16:23 ` Kanchan Joshi
2022-03-15 8:54 ` Christoph Hellwig
2022-03-16 7:27 ` Kanchan Joshi
2022-03-24 6:22 ` Christoph Hellwig
2022-03-24 17:45 ` Kanchan Joshi
2022-03-11 17:56 ` Luis Chamberlain
2022-03-11 18:53 ` Paul Moore
2022-03-11 21:02 ` Luis Chamberlain
2022-03-13 21:53 ` Sagi Grimberg
2022-03-14 17:54 ` Kanchan Joshi
2022-03-15 9:02 ` Sagi Grimberg
2022-03-16 9:21 ` Kanchan Joshi
2022-03-16 10:56 ` Sagi Grimberg
2022-03-16 11:51 ` Kanchan Joshi
2022-03-16 13:52 ` Sagi Grimberg
2022-03-16 14:35 ` Jens Axboe
2022-03-16 14:50 ` Sagi Grimberg
2022-03-24 6:20 ` Christoph Hellwig
2022-03-24 10:42 ` Sagi Grimberg
2022-03-22 15:18 ` Clay Mayers
2022-03-22 16:57 ` Kanchan Joshi
[not found] ` <CGME20220308152704epcas5p16610e1f50672b25fa1df5f7c5c261bb5@epcas5p1.samsung.com>
2022-03-08 15:20 ` [PATCH 06/17] io_uring: prep for fixed-buffer enabled uring-cmd Kanchan Joshi
[not found] ` <CGME20220308152707epcas5p430127761a7fd4bf90c2501eabe9ee96e@epcas5p4.samsung.com>
2022-03-08 15:20 ` [PATCH 07/17] io_uring: add support for uring_cmd with fixed-buffer Kanchan Joshi
[not found] ` <CGME20220308152709epcas5p1f9d274a0214dc462c22c278a72d8697c@epcas5p1.samsung.com>
2022-03-08 15:20 ` [PATCH 08/17] nvme: enable passthrough " Kanchan Joshi
2022-03-10 8:32 ` Christoph Hellwig
2022-03-11 6:43 ` Christoph Hellwig
2022-03-14 13:06 ` Kanchan Joshi
2022-03-15 8:55 ` Christoph Hellwig
2022-03-14 12:18 ` Ming Lei
2022-03-14 13:09 ` Kanchan Joshi
[not found] ` <CGME20220308152711epcas5p31de5d63f5de91fae94e61e5c857c0f13@epcas5p3.samsung.com>
2022-03-08 15:20 ` [PATCH 09/17] io_uring: plug for async bypass Kanchan Joshi
2022-03-10 8:33 ` Christoph Hellwig
2022-03-14 14:33 ` Ming Lei
2022-03-15 8:56 ` Christoph Hellwig
2022-03-11 17:15 ` Luis Chamberlain
[not found] ` <CGME20220308152714epcas5p4c5a0d16512fd7054c9a713ee28ede492@epcas5p4.samsung.com>
2022-03-08 15:20 ` [PATCH 10/17] block: wire-up support for plugging Kanchan Joshi
2022-03-10 8:34 ` Christoph Hellwig
2022-03-10 12:40 ` Kanchan Joshi
2022-03-14 14:40 ` Ming Lei
2022-03-21 7:02 ` Kanchan Joshi
2022-03-23 1:27 ` Ming Lei
2022-03-23 1:41 ` Jens Axboe
2022-03-23 1:58 ` Jens Axboe
2022-03-23 2:10 ` Ming Lei
2022-03-23 2:17 ` Jens Axboe
[not found] ` <CGME20220308152716epcas5p3d38d2372c184259f1a10c969f7e4396f@epcas5p3.samsung.com>
2022-03-08 15:20 ` [PATCH 11/17] block: factor out helper for bio allocation from cache Kanchan Joshi
2022-03-10 8:35 ` Christoph Hellwig
2022-03-10 12:25 ` Kanchan Joshi
2022-03-24 6:30 ` Christoph Hellwig
2022-03-24 17:45 ` Kanchan Joshi
2022-03-25 5:38 ` Christoph Hellwig
[not found] ` <CGME20220308152718epcas5p3afd2c8a628f4e9733572cbb39270989d@epcas5p3.samsung.com>
2022-03-08 15:21 ` [PATCH 12/17] nvme: enable bio-cache for fixed-buffer passthru Kanchan Joshi
2022-03-11 6:48 ` Christoph Hellwig
2022-03-14 18:18 ` Kanchan Joshi
2022-03-15 8:57 ` Christoph Hellwig
[not found] ` <CGME20220308152720epcas5p19653942458e160714444942ddb8b8579@epcas5p1.samsung.com>
2022-03-08 15:21 ` [PATCH 13/17] nvme: allow user passthrough commands to poll Kanchan Joshi
2022-03-08 17:08 ` Keith Busch
2022-03-09 7:03 ` Kanchan Joshi
2022-03-11 6:49 ` Christoph Hellwig
[not found] ` <CGME20220308152723epcas5p34460b4af720e515317f88dbb78295f06@epcas5p3.samsung.com>
2022-03-08 15:21 ` [PATCH 14/17] io_uring: add polling support for uring-cmd Kanchan Joshi
2022-03-11 6:50 ` Christoph Hellwig
2022-03-14 10:16 ` Kanchan Joshi
2022-03-15 8:57 ` Christoph Hellwig
2022-03-16 5:09 ` Kanchan Joshi
2022-03-24 6:30 ` Christoph Hellwig
[not found] ` <CGME20220308152725epcas5p36d1ce3269a47c1c22cc0d66bdc2b9eb3@epcas5p3.samsung.com>
2022-03-08 15:21 ` [PATCH 15/17] nvme: wire-up polling for uring-passthru Kanchan Joshi
[not found] ` <CGME20220308152727epcas5p20e605718dd99e97c94f9232d40d04d95@epcas5p2.samsung.com>
2022-03-08 15:21 ` [PATCH 16/17] io_uring: add support for non-inline uring-cmd Kanchan Joshi
[not found] ` <CGME20220308152729epcas5p17e82d59c68076eb46b5ef658619d65e3@epcas5p1.samsung.com>
2022-03-08 15:21 ` [PATCH 17/17] nvme: enable non-inline passthru commands Kanchan Joshi
2022-03-10 8:36 ` Christoph Hellwig
2022-03-10 11:50 ` Kanchan Joshi
2022-03-10 14:19 ` Christoph Hellwig
2022-03-10 18:43 ` Kanchan Joshi
2022-03-11 6:27 ` Christoph Hellwig
2022-03-22 17:10 ` Kanchan Joshi
2022-03-24 6:32 ` Christoph Hellwig
2022-03-25 13:39 ` Kanchan Joshi
2022-03-28 4:44 ` Kanchan Joshi
2022-03-30 12:59 ` Christoph Hellwig
2022-03-30 13:02 ` Christoph Hellwig
2022-03-30 13:14 ` Kanchan Joshi
2022-04-01 1:25 ` Jens Axboe
2022-04-01 2:33 ` Kanchan Joshi
2022-04-01 2:44 ` Jens Axboe
2022-04-01 3:05 ` Jens Axboe
2022-04-01 6:32 ` Kanchan Joshi
2022-04-19 17:31 ` Kanchan Joshi
2022-04-19 18:19 ` Jens Axboe
[not found] ` <CGME20220420152003epcas5p3991e6941773690bcb425fd9d817105c3@epcas5p3.samsung.com>
2022-04-20 15:14 ` Kanchan Joshi
2022-04-20 15:28 ` Kanchan Joshi
2022-04-01 1:23 ` Jens Axboe
2022-04-01 1:22 ` Jens Axboe
2022-04-01 6:29 ` Kanchan Joshi
2022-03-24 21:09 ` Clay Mayers
2022-03-24 23:36 ` Jens Axboe
2022-03-10 8:29 ` [PATCH 00/17] io_uring passthru over nvme Christoph Hellwig
2022-03-10 10:05 ` Kanchan Joshi
2022-03-11 16:43 ` Luis Chamberlain
2022-03-11 23:35 ` Adam Manzanares
2022-03-12 2:27 ` Adam Manzanares
2022-03-13 5:07 ` Kanchan Joshi
2022-03-14 20:30 ` Adam Manzanares
2022-03-13 5:10 ` Kanchan Joshi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox