* [PATCH V2 01/17] io_uring: add IO_URING_F_FUSED and prepare for supporting OP_FUSED_CMD
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-07 14:15 ` [PATCH V2 02/17] io_uring: increase io_kiocb->flags into 64bit Ming Lei
` (17 subsequent siblings)
18 siblings, 0 replies; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
Add flag IO_URING_F_FUSED and prepare for supporting IO_URING_OP_FUSED_CMD,
which is still one type of IO_URING_OP_URING_CMD, so it is reasonable
to reuse ->uring_cmd() for handling IO_URING_F_FUSED_CMD.
Just IO_URING_F_FUSED_CMD will carry one 64byte SQE as payload which
is handled by one slave request. The master uring command will
provide kernel buffer to the slave request.
Mark all existed drivers to not support IO_URING_F_FUSED_CMD, given it
depends if driver is capable of handling the slave request.
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 6 ++++++
drivers/char/mem.c | 4 ++++
drivers/nvme/host/ioctl.c | 9 +++++++++
include/linux/io_uring.h | 7 +++++++
4 files changed, 26 insertions(+)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index d1d1c8d606c8..088d9efffe6b 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1271,6 +1271,9 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
__func__, cmd->cmd_op, ub_cmd->q_id, tag,
ub_cmd->result);
+ if (issue_flags & IO_URING_F_FUSED)
+ return -EOPNOTSUPP;
+
if (ub_cmd->q_id >= ub->dev_info.nr_hw_queues)
goto out;
@@ -2169,6 +2172,9 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
struct ublk_device *ub = NULL;
int ret = -EINVAL;
+ if (issue_flags & IO_URING_F_FUSED)
+ return -EOPNOTSUPP;
+
if (issue_flags & IO_URING_F_NONBLOCK)
return -EAGAIN;
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index ffb101d349f0..134ba6665194 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -30,6 +30,7 @@
#include <linux/uio.h>
#include <linux/uaccess.h>
#include <linux/security.h>
+#include <linux/io_uring.h>
#ifdef CONFIG_IA64
# include <linux/efi.h>
@@ -482,6 +483,9 @@ static ssize_t splice_write_null(struct pipe_inode_info *pipe, struct file *out,
static int uring_cmd_null(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
{
+ if (issue_flags & IO_URING_F_FUSED)
+ return -EOPNOTSUPP;
+
return 0;
}
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 723e7d5b778f..44a171bcaa90 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -773,6 +773,9 @@ int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
struct nvme_ns *ns = container_of(file_inode(ioucmd->file)->i_cdev,
struct nvme_ns, cdev);
+ if (issue_flags & IO_URING_F_FUSED)
+ return -EOPNOTSUPP;
+
return nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
}
@@ -878,6 +881,9 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
struct nvme_ns *ns = nvme_find_path(head);
int ret = -EINVAL;
+ if (issue_flags & IO_URING_F_FUSED)
+ return -EOPNOTSUPP;
+
if (ns)
ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
srcu_read_unlock(&head->srcu, srcu_idx);
@@ -915,6 +921,9 @@ int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
struct nvme_ctrl *ctrl = ioucmd->file->private_data;
int ret;
+ if (issue_flags & IO_URING_F_FUSED)
+ return -EOPNOTSUPP;
+
/* IOPOLL not supported yet */
if (issue_flags & IO_URING_F_IOPOLL)
return -EOPNOTSUPP;
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 934e5dd4ccc0..0676c5b4b5fe 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -20,6 +20,13 @@ enum io_uring_cmd_flags {
IO_URING_F_SQE128 = (1 << 8),
IO_URING_F_CQE32 = (1 << 9),
IO_URING_F_IOPOLL = (1 << 10),
+
+ /* for FUSED_CMD only */
+ IO_URING_F_FUSED_WRITE = (1 << 11), /* slave writes to buffer */
+ IO_URING_F_FUSED_READ = (1 << 12), /* slave reads from buffer */
+ /* driver incapable of FUSED_CMD should fail cmd when seeing F_FUSED */
+ IO_URING_F_FUSED = IO_URING_F_FUSED_WRITE |
+ IO_URING_F_FUSED_READ,
};
struct io_uring_cmd {
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH V2 02/17] io_uring: increase io_kiocb->flags into 64bit
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
2023-03-07 14:15 ` [PATCH V2 01/17] io_uring: add IO_URING_F_FUSED and prepare for supporting OP_FUSED_CMD Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-07 14:15 ` [PATCH V2 03/17] io_uring: add IORING_OP_FUSED_CMD Ming Lei
` (16 subsequent siblings)
18 siblings, 0 replies; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
The 32bit io_kiocb->flags has been used up, so extend it to 64bit.
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/io_uring_types.h | 3 ++-
io_uring/io_uring.c | 2 +-
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 00689c12f6ab..66fd0f7f1b24 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -530,7 +530,8 @@ struct io_kiocb {
* and after selection it points to the buffer ID itself.
*/
u16 buf_index;
- unsigned int flags;
+ u32 __pad;
+ u64 flags;
struct io_cqe cqe;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index fd1cc35a1c00..90a835215d04 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -4418,7 +4418,7 @@ static int __init io_uring_init(void)
BUILD_BUG_ON(SQE_COMMON_FLAGS >= (1 << 8));
BUILD_BUG_ON((SQE_VALID_FLAGS | SQE_COMMON_FLAGS) != SQE_VALID_FLAGS);
- BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(int));
+ BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(u64));
BUILD_BUG_ON(sizeof(atomic_t) != sizeof(u32));
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH V2 03/17] io_uring: add IORING_OP_FUSED_CMD
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
2023-03-07 14:15 ` [PATCH V2 01/17] io_uring: add IO_URING_F_FUSED and prepare for supporting OP_FUSED_CMD Ming Lei
2023-03-07 14:15 ` [PATCH V2 02/17] io_uring: increase io_kiocb->flags into 64bit Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-07 14:15 ` [PATCH V2 04/17] io_uring: support OP_READ/OP_WRITE for fused slave request Ming Lei
` (15 subsequent siblings)
18 siblings, 0 replies; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
64byte SQE(slave) is another normal 64byte OP. For any OP which needs
to support slave OP, io_issue_defs[op].fused_slave has to be set as 1,
and its ->issue() needs to retrieve buffer from master request's
fused_cmd_kbuf.
Follows the key points of the design/implementation:
1) The master uring command produces and provides immutable command
buffer(struct io_uring_bvec_buf) to the slave request, and the slave
OP can retrieve any part of this buffer by sqe->addr and sqe->len.
2) Master command is always completed after the slave request is
completed.
- Before slave request is submitted, the buffer ownership is
transferred to slave request. After slave request is completed,
the buffer ownership is returned back to master request.
- This way also guarantees correct SQE order since the master
request uses slave request's LINK flag.
3) Master request is always completed by driver, so that driver
can know when the buffer is done with slave quest.
The motivation is for supporting zero copy for fuse/ublk, in which
the device holds IO request buffer, and IO handling is often normal
IO OP(fs, net, ..). With IORING_OP_FUSED_CMD, we can implement this kind
of zero copy easily & reliably.
Signed-off-by: Ming Lei <[email protected]>
---
include/linux/io_uring.h | 42 +++++-
include/linux/io_uring_types.h | 15 +++
include/uapi/linux/io_uring.h | 1 +
io_uring/Makefile | 2 +-
io_uring/fused_cmd.c | 232 +++++++++++++++++++++++++++++++++
io_uring/fused_cmd.h | 11 ++
io_uring/io_uring.c | 20 ++-
io_uring/io_uring.h | 3 +
io_uring/opdef.c | 12 ++
io_uring/opdef.h | 2 +
10 files changed, 334 insertions(+), 6 deletions(-)
create mode 100644 io_uring/fused_cmd.c
create mode 100644 io_uring/fused_cmd.h
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 0676c5b4b5fe..f3a23e00b47c 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -4,6 +4,7 @@
#include <linux/sched.h>
#include <linux/xarray.h>
+#include <linux/bvec.h>
#include <uapi/linux/io_uring.h>
enum io_uring_cmd_flags {
@@ -29,6 +30,19 @@ enum io_uring_cmd_flags {
IO_URING_F_FUSED_READ,
};
+union io_uring_fused_cmd_data {
+ /*
+ * In case of slave request IOSQE_CQE_SKIP_SUCCESS, return slave
+ * result via master command; otherwise we simply return success
+ * if buffer is provided, and slave request will return its result
+ * via its CQE
+ */
+ s32 slave_res;
+
+ /* fused cmd private, driver do not touch it */
+ struct io_kiocb *__slave;
+};
+
struct io_uring_cmd {
struct file *file;
const void *cmd;
@@ -40,10 +54,31 @@ struct io_uring_cmd {
};
u32 cmd_op;
u32 flags;
- u8 pdu[32]; /* available inline for free use */
+
+ /* for fused command, the available pdu is a bit less */
+ union {
+ struct {
+ union io_uring_fused_cmd_data data;
+ u8 pdu[24]; /* available inline for free use */
+ } fused;
+ u8 pdu[32]; /* available inline for free use */
+ };
+};
+
+struct io_uring_bvec_buf {
+ unsigned long len;
+ unsigned int nr_bvecs;
+
+ /* offset in the 1st bvec */
+ unsigned int offset;
+ struct bio_vec *bvec;
+ struct bio_vec __bvec[];
};
#if defined(CONFIG_IO_URING)
+void io_fused_cmd_provide_kbuf(struct io_uring_cmd *ioucmd, bool locked,
+ const struct io_uring_bvec_buf *imu,
+ void (*complete_tw_cb)(struct io_uring_cmd *));
int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
struct iov_iter *iter, void *ioucmd);
void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t res2);
@@ -73,6 +108,11 @@ static inline void io_uring_free(struct task_struct *tsk)
__io_uring_free(tsk);
}
#else
+static inline void io_fused_cmd_provide_kbuf(struct io_uring_cmd *ioucmd,
+ bool locked, const struct io_uring_bvec_buf *fused_cmd_kbuf,
+ unsigned int len, void (*complete_tw_cb)(struct io_uring_cmd *))
+{
+}
static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
struct iov_iter *iter, void *ioucmd)
{
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 66fd0f7f1b24..088df87f6e8c 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -401,6 +401,7 @@ enum {
/* keep async read/write and isreg together and in order */
REQ_F_SUPPORT_NOWAIT_BIT,
REQ_F_ISREG_BIT,
+ REQ_F_FUSED_SLAVE_BIT,
/* not a real bit, just to check we're not overflowing the space */
__REQ_F_LAST_BIT,
@@ -470,6 +471,8 @@ enum {
REQ_F_CLEAR_POLLIN = BIT(REQ_F_CLEAR_POLLIN_BIT),
/* hashed into ->cancel_hash_locked, protected by ->uring_lock */
REQ_F_HASH_LOCKED = BIT(REQ_F_HASH_LOCKED_BIT),
+ /* slave request in fused cmd, won't be one uring cmd */
+ REQ_F_FUSED_SLAVE = BIT(REQ_F_FUSED_SLAVE_BIT),
};
typedef void (*io_req_tw_func_t)(struct io_kiocb *req, bool *locked);
@@ -552,6 +555,18 @@ struct io_kiocb {
* REQ_F_BUFFER_RING is set.
*/
struct io_buffer_list *buf_list;
+
+ /*
+ * store kernel (sub)buffer of fused master request which OP
+ * is IORING_OP_FUSED_CMD
+ */
+ const struct io_uring_bvec_buf *fused_cmd_kbuf;
+
+ /*
+ * store fused command master request for fuse slave request,
+ * which uses fuse master's kernel buffer for handling this OP
+ */
+ struct io_kiocb *fused_master_req;
};
union {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 709de6d4feb2..f07d005ee898 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -223,6 +223,7 @@ enum io_uring_op {
IORING_OP_URING_CMD,
IORING_OP_SEND_ZC,
IORING_OP_SENDMSG_ZC,
+ IORING_OP_FUSED_CMD,
/* this goes last, obviously */
IORING_OP_LAST,
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 8cc8e5387a75..5301077e61c5 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -7,5 +7,5 @@ obj-$(CONFIG_IO_URING) += io_uring.o xattr.o nop.o fs.o splice.o \
openclose.o uring_cmd.o epoll.o \
statx.o net.o msg_ring.o timeout.o \
sqpoll.o fdinfo.o tctx.o poll.o \
- cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o
+ cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o fused_cmd.o
obj-$(CONFIG_IO_WQ) += io-wq.o
diff --git a/io_uring/fused_cmd.c b/io_uring/fused_cmd.c
new file mode 100644
index 000000000000..9435e67fc47c
--- /dev/null
+++ b/io_uring/fused_cmd.c
@@ -0,0 +1,232 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/io_uring.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "io_uring.h"
+#include "opdef.h"
+#include "rsrc.h"
+#include "uring_cmd.h"
+#include "fused_cmd.h"
+
+static bool io_fused_slave_valid(const struct io_uring_sqe *sqe, u8 op)
+{
+ unsigned int sqe_flags = READ_ONCE(sqe->flags);
+
+ if (op == IORING_OP_FUSED_CMD || op == IORING_OP_URING_CMD)
+ return false;
+
+ if (sqe_flags & REQ_F_BUFFER_SELECT)
+ return false;
+
+ if (!io_issue_defs[op].fused_slave)
+ return false;
+
+ return true;
+}
+
+static inline void io_fused_cmd_update_link_flags(struct io_kiocb *req,
+ const struct io_kiocb *slave)
+{
+ /*
+ * We have to keep slave SQE in order, so update master link flags
+ * with slave request's given master command isn't completed until
+ * the slave request is done
+ */
+ if (slave->flags & (REQ_F_LINK | REQ_F_HARDLINK))
+ req->flags |= REQ_F_LINK;
+}
+
+int io_fused_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+ __must_hold(&req->ctx->uring_lock)
+{
+ struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+ const struct io_uring_sqe *slave_sqe = sqe + 1;
+ struct io_ring_ctx *ctx = req->ctx;
+ struct io_kiocb *slave;
+ u8 slave_op;
+ int ret;
+
+ if (unlikely(!(ctx->flags & IORING_SETUP_SQE128)))
+ return -EINVAL;
+
+ if (unlikely(sqe->__pad1))
+ return -EINVAL;
+
+ ioucmd->flags = READ_ONCE(sqe->uring_cmd_flags);
+ if (unlikely(ioucmd->flags))
+ return -EINVAL;
+
+ slave_op = READ_ONCE(slave_sqe->opcode);
+ if (unlikely(!io_fused_slave_valid(slave_sqe, slave_op)))
+ return -EINVAL;
+
+ ioucmd->cmd = sqe->cmd;
+ ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
+ req->fused_cmd_kbuf = NULL;
+
+ /* take one extra reference for the slave request */
+ io_get_task_refs(1);
+
+ ret = -ENOMEM;
+ if (unlikely(!io_alloc_req(ctx, &slave)))
+ goto fail;
+
+ ret = io_init_req(ctx, slave, slave_sqe, true);
+ if (unlikely(ret))
+ goto fail_free_req;
+
+ io_fused_cmd_update_link_flags(req, slave);
+
+ ioucmd->fused.data.__slave = slave;
+
+ return 0;
+
+fail_free_req:
+ io_free_req(slave);
+fail:
+ current->io_uring->cached_refs += 1;
+ return ret;
+}
+
+static inline bool io_fused_slave_write_to_buf(u8 op)
+{
+ switch (op) {
+ case IORING_OP_READ:
+ case IORING_OP_READV:
+ case IORING_OP_READ_FIXED:
+ case IORING_OP_RECVMSG:
+ case IORING_OP_RECV:
+ return 1;
+ default:
+ return 0;
+ }
+}
+
+int io_fused_cmd(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+ const struct io_kiocb *slave = ioucmd->fused.data.__slave;
+ int ret = -EINVAL;
+
+ /*
+ * Pass buffer direction for driver to validate if the read/write
+ * is legal
+ */
+ if (io_fused_slave_write_to_buf(slave->opcode))
+ issue_flags |= IO_URING_F_FUSED_WRITE;
+ else
+ issue_flags |= IO_URING_F_FUSED_READ;
+
+ ret = io_uring_cmd(req, issue_flags);
+ if (ret != IOU_ISSUE_SKIP_COMPLETE)
+ io_free_req(ioucmd->fused.data.__slave);
+
+ return ret;
+}
+
+int io_import_kbuf_for_slave(unsigned long buf_off, unsigned int len, int dir,
+ struct iov_iter *iter, struct io_kiocb *slave)
+{
+ struct io_kiocb *req = slave->fused_master_req;
+ const struct io_uring_bvec_buf *kbuf;
+ unsigned long offset;
+
+ if (unlikely(!(slave->flags & REQ_F_FUSED_SLAVE) || !req))
+ return -EINVAL;
+
+ if (unlikely(!req->fused_cmd_kbuf))
+ return -EINVAL;
+
+ /* req->fused_cmd_kbuf is immutable */
+ kbuf = req->fused_cmd_kbuf;
+ offset = kbuf->offset;
+
+ if (!kbuf->bvec)
+ return -EINVAL;
+
+ if (unlikely(buf_off > kbuf->len))
+ return -EFAULT;
+
+ if (unlikely(len > kbuf->len - buf_off))
+ return -EFAULT;
+
+ /* don't use io_import_fixed which doesn't support multipage bvec */
+ offset += buf_off;
+ iov_iter_bvec(iter, dir, kbuf->bvec, kbuf->nr_bvecs, offset + len);
+
+ if (offset)
+ iov_iter_advance(iter, offset);
+
+ return 0;
+}
+
+/*
+ * Called when slave request is completed,
+ *
+ * Return back ownership of the fused_cmd kbuf to master request, and
+ * notify master request.
+ */
+void io_fused_cmd_return_kbuf(struct io_kiocb *slave)
+{
+ struct io_kiocb *req = slave->fused_master_req;
+ struct io_uring_cmd *ioucmd;
+
+ if (unlikely(!req || !(slave->flags & REQ_F_FUSED_SLAVE)))
+ return;
+
+ /* return back the buffer */
+ slave->fused_master_req = NULL;
+ ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+ ioucmd->fused.data.__slave = NULL;
+
+ /* If slave OP skips CQE, return the result via master command */
+ if (slave->flags & REQ_F_CQE_SKIP)
+ ioucmd->fused.data.slave_res = slave->cqe.res;
+ else
+ ioucmd->fused.data.slave_res = 0;
+ io_uring_cmd_complete_in_task(ioucmd, ioucmd->task_work_cb);
+}
+
+/*
+ * This API needs to be called when master command has prepared
+ * FUSED_CMD buffer, and offset/len in ->fused.data is for retrieving
+ * sub-buffer in the command buffer, which is often figured out by
+ * command payload data.
+ *
+ * Master command is always completed after the slave request
+ * is completed, so driver has to set completion callback for
+ * getting notification.
+ *
+ * Ownership of the fused_cmd kbuf is transferred to slave request.
+ */
+void io_fused_cmd_provide_kbuf(struct io_uring_cmd *ioucmd, bool locked,
+ const struct io_uring_bvec_buf *fused_cmd_kbuf,
+ void (*complete_tw_cb)(struct io_uring_cmd *))
+{
+ struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
+ struct io_kiocb *slave = ioucmd->fused.data.__slave;
+
+ /*
+ * Once the fused slave request is completed, the driver will
+ * be notified by callback of complete_tw_cb
+ */
+ ioucmd->task_work_cb = complete_tw_cb;
+
+ /* now we get the buffer */
+ req->fused_cmd_kbuf = fused_cmd_kbuf;
+ slave->fused_master_req = req;
+
+ trace_io_uring_submit_sqe(slave, true);
+ if (locked)
+ io_req_task_submit(slave, &locked);
+ else
+ io_req_task_queue(slave);
+}
+EXPORT_SYMBOL_GPL(io_fused_cmd_provide_kbuf);
diff --git a/io_uring/fused_cmd.h b/io_uring/fused_cmd.h
new file mode 100644
index 000000000000..86ad87d1b0ec
--- /dev/null
+++ b/io_uring/fused_cmd.h
@@ -0,0 +1,11 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IOU_FUSED_CMD_H
+#define IOU_FUSED_CMD_H
+
+int io_fused_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_fused_cmd(struct io_kiocb *req, unsigned int issue_flags);
+void io_fused_cmd_return_kbuf(struct io_kiocb *slave);
+int io_import_kbuf_for_slave(unsigned long buf, unsigned int len, int dir,
+ struct iov_iter *iter, struct io_kiocb *slave);
+
+#endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 90a835215d04..46c060713928 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -91,6 +91,7 @@
#include "cancel.h"
#include "net.h"
#include "notif.h"
+#include "fused_cmd.h"
#include "timeout.h"
#include "poll.h"
@@ -110,7 +111,7 @@
#define IO_REQ_CLEAN_FLAGS (REQ_F_BUFFER_SELECTED | REQ_F_NEED_CLEANUP | \
REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS | \
- REQ_F_ASYNC_DATA)
+ REQ_F_ASYNC_DATA | REQ_F_FUSED_SLAVE)
#define IO_REQ_CLEAN_SLOW_FLAGS (REQ_F_REFCOUNT | REQ_F_LINK | REQ_F_HARDLINK |\
IO_REQ_CLEAN_FLAGS)
@@ -964,6 +965,9 @@ static void __io_req_complete_post(struct io_kiocb *req)
{
struct io_ring_ctx *ctx = req->ctx;
+ if (req->flags & REQ_F_FUSED_SLAVE)
+ io_fused_cmd_return_kbuf(req);
+
io_cq_lock(ctx);
if (!(req->flags & REQ_F_CQE_SKIP))
io_fill_cqe_req(ctx, req);
@@ -1848,6 +1852,8 @@ static void io_clean_op(struct io_kiocb *req)
spin_lock(&req->ctx->completion_lock);
io_put_kbuf_comp(req);
spin_unlock(&req->ctx->completion_lock);
+ } else if (req->flags & REQ_F_FUSED_SLAVE) {
+ io_fused_cmd_return_kbuf(req);
}
if (req->flags & REQ_F_NEED_CLEANUP) {
@@ -2156,8 +2162,8 @@ static void io_init_req_drain(struct io_kiocb *req)
}
}
-static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
- const struct io_uring_sqe *sqe)
+int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
+ const struct io_uring_sqe *sqe, bool slave)
__must_hold(&ctx->uring_lock)
{
const struct io_issue_def *def;
@@ -2210,6 +2216,12 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
}
}
+ if (slave) {
+ if (!def->fused_slave)
+ return -EINVAL;
+ req->flags |= REQ_F_FUSED_SLAVE;
+ }
+
if (!def->ioprio && sqe->ioprio)
return -EINVAL;
if (!def->iopoll && (ctx->flags & IORING_SETUP_IOPOLL))
@@ -2294,7 +2306,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
struct io_submit_link *link = &ctx->submit_state.link;
int ret;
- ret = io_init_req(ctx, req, sqe);
+ ret = io_init_req(ctx, req, sqe, false);
if (unlikely(ret))
return io_submit_fail_init(sqe, req, ret);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 2711865f1e19..a50c7e1f6e81 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -78,6 +78,9 @@ bool __io_alloc_req_refill(struct io_ring_ctx *ctx);
bool io_match_task_safe(struct io_kiocb *head, struct task_struct *task,
bool cancel_all);
+int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
+ const struct io_uring_sqe *sqe, bool slave);
+
#define io_lockdep_assert_cq_locked(ctx) \
do { \
if (ctx->flags & IORING_SETUP_IOPOLL) { \
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index cca7c5b55208..63b90e8e65f8 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -33,6 +33,7 @@
#include "poll.h"
#include "cancel.h"
#include "rw.h"
+#include "fused_cmd.h"
static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags)
{
@@ -428,6 +429,12 @@ const struct io_issue_def io_issue_defs[] = {
.prep = io_eopnotsupp_prep,
#endif
},
+ [IORING_OP_FUSED_CMD] = {
+ .needs_file = 1,
+ .plug = 1,
+ .prep = io_fused_cmd_prep,
+ .issue = io_fused_cmd,
+ },
};
@@ -648,6 +655,11 @@ const struct io_cold_def io_cold_defs[] = {
.fail = io_sendrecv_fail,
#endif
},
+ [IORING_OP_FUSED_CMD] = {
+ .name = "FUSED_CMD",
+ .async_size = uring_cmd_pdu_size(1),
+ .prep_async = io_uring_cmd_prep_async,
+ },
};
const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/opdef.h b/io_uring/opdef.h
index c22c8696e749..306f6fc48ed4 100644
--- a/io_uring/opdef.h
+++ b/io_uring/opdef.h
@@ -29,6 +29,8 @@ struct io_issue_def {
unsigned iopoll_queue : 1;
/* opcode specific path will handle ->async_data allocation if needed */
unsigned manual_alloc : 1;
+ /* can be slave op of fused command */
+ unsigned fused_slave : 1;
int (*issue)(struct io_kiocb *, unsigned int);
int (*prep)(struct io_kiocb *, const struct io_uring_sqe *);
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH V2 04/17] io_uring: support OP_READ/OP_WRITE for fused slave request
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (2 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 03/17] io_uring: add IORING_OP_FUSED_CMD Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-07 14:15 ` [PATCH V2 05/17] io_uring: support OP_SEND_ZC/OP_RECV " Ming Lei
` (14 subsequent siblings)
18 siblings, 0 replies; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
Start to allow fused slave request to support OP_READ/OP_WRITE, and
the buffer can be retrieved from master request.
Once the slave request is completed, the master buffer will be returned
back.
Signed-off-by: Ming Lei <[email protected]>
---
io_uring/opdef.c | 2 ++
io_uring/rw.c | 20 ++++++++++++++++++++
2 files changed, 22 insertions(+)
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 63b90e8e65f8..f044629e5475 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -235,6 +235,7 @@ const struct io_issue_def io_issue_defs[] = {
.ioprio = 1,
.iopoll = 1,
.iopoll_queue = 1,
+ .fused_slave = 1,
.prep = io_prep_rw,
.issue = io_read,
},
@@ -248,6 +249,7 @@ const struct io_issue_def io_issue_defs[] = {
.ioprio = 1,
.iopoll = 1,
.iopoll_queue = 1,
+ .fused_slave = 1,
.prep = io_prep_rw,
.issue = io_write,
},
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 4c233910e200..36d31a943317 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -19,6 +19,7 @@
#include "kbuf.h"
#include "rsrc.h"
#include "rw.h"
+#include "fused_cmd.h"
struct io_rw {
/* NOTE: kiocb has the file as the first member, so don't do it here */
@@ -371,6 +372,17 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req,
size_t sqe_len;
ssize_t ret;
+ /*
+ * SLAVE OP passes buffer offset from sqe->addr actually, since
+ * the fused cmd kbuf's mapped start address is zero.
+ */
+ if (req->flags & REQ_F_FUSED_SLAVE) {
+ ret = io_import_kbuf_for_slave(rw->addr, rw->len, ddir, iter, req);
+ if (ret)
+ return ERR_PTR(ret);
+ return NULL;
+ }
+
if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
ret = io_import_fixed(ddir, iter, req->imu, rw->addr, rw->len);
if (ret)
@@ -428,11 +440,19 @@ static inline loff_t *io_kiocb_ppos(struct kiocb *kiocb)
*/
static ssize_t loop_rw_iter(int ddir, struct io_rw *rw, struct iov_iter *iter)
{
+ struct io_kiocb *req = cmd_to_io_kiocb(rw);
struct kiocb *kiocb = &rw->kiocb;
struct file *file = kiocb->ki_filp;
ssize_t ret = 0;
loff_t *ppos;
+ /*
+ * Fused slave req hasn't user buffer, so ->read/->write can't
+ * be supported
+ */
+ if (req->flags & REQ_F_FUSED_SLAVE)
+ return -EOPNOTSUPP;
+
/*
* Don't support polled IO through this interface, and we can't
* support non-blocking either. For the latter, this just causes
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH V2 05/17] io_uring: support OP_SEND_ZC/OP_RECV for fused slave request
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (3 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 04/17] io_uring: support OP_READ/OP_WRITE for fused slave request Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-09 7:46 ` kernel test robot
2023-03-09 17:22 ` kernel test robot
2023-03-07 14:15 ` [PATCH V2 06/17] block: ublk_drv: mark device as LIVE before adding disk Ming Lei
` (13 subsequent siblings)
18 siblings, 2 replies; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
Start to allow fused slave request to support OP_SEND_ZC/OP_RECV, and
the buffer can be retrieved from master request.
Once the slave request is completed, the master buffer will be returned
back.
Signed-off-by: Ming Lei <[email protected]>
---
io_uring/net.c | 23 +++++++++++++++++++++--
io_uring/opdef.c | 3 +++
2 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/io_uring/net.c b/io_uring/net.c
index b7f190ca528e..9ec6a77b4e82 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -16,6 +16,7 @@
#include "net.h"
#include "notif.h"
#include "rsrc.h"
+#include "fused_cmd.h"
#if defined(CONFIG_NET)
struct io_shutdown {
@@ -378,7 +379,11 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
if (unlikely(!sock))
return -ENOTSOCK;
- ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len, &msg.msg_iter);
+ if (!(req->flags & REQ_F_FUSED_SLAVE))
+ ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len, &msg.msg_iter);
+ else
+ ret = io_import_kbuf_for_slave((u64)sr->buf, sr->len,
+ ITER_SOURCE, &msg.msg_iter, req);
if (unlikely(ret))
return ret;
@@ -869,7 +874,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
sr->buf = buf;
}
- ret = import_ubuf(ITER_DEST, sr->buf, len, &msg.msg_iter);
+ if (!(req->flags & REQ_F_FUSED_SLAVE))
+ ret = import_ubuf(ITER_DEST, sr->buf, len, &msg.msg_iter);
+ else
+ ret = io_import_kbuf_for_slave((u64)sr->buf, sr->len, ITER_DEST,
+ &msg.msg_iter, req);
if (unlikely(ret))
goto out_free;
@@ -983,6 +992,9 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
unsigned idx = READ_ONCE(sqe->buf_index);
+ if (req->flags & REQ_F_FUSED_SLAVE)
+ return -EINVAL;
+
if (unlikely(idx >= ctx->nr_user_bufs))
return -EFAULT;
idx = array_index_nospec(idx, ctx->nr_user_bufs);
@@ -1119,8 +1131,15 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
if (unlikely(ret))
return ret;
msg.sg_from_iter = io_sg_from_iter;
+ } else if (req->flags & REQ_F_FUSED_SLAVE) {
+ ret = io_import_kbuf_for_slave((u64)zc->buf, zc->len,
+ ITER_SOURCE, &msg.msg_iter, req);
+ if (unlikely(ret))
+ return ret;
+ msg.sg_from_iter = io_sg_from_iter;
} else {
io_notif_set_extended(zc->notif);
+
ret = import_ubuf(ITER_SOURCE, zc->buf, zc->len, &msg.msg_iter);
if (unlikely(ret))
return ret;
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index f044629e5475..0a9d39a9db16 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -271,6 +271,7 @@ const struct io_issue_def io_issue_defs[] = {
.audit_skip = 1,
.ioprio = 1,
.manual_alloc = 1,
+ .fused_slave = 1,
#if defined(CONFIG_NET)
.prep = io_sendmsg_prep,
.issue = io_send,
@@ -285,6 +286,7 @@ const struct io_issue_def io_issue_defs[] = {
.buffer_select = 1,
.audit_skip = 1,
.ioprio = 1,
+ .fused_slave = 1,
#if defined(CONFIG_NET)
.prep = io_recvmsg_prep,
.issue = io_recv,
@@ -411,6 +413,7 @@ const struct io_issue_def io_issue_defs[] = {
.audit_skip = 1,
.ioprio = 1,
.manual_alloc = 1,
+ .fused_slave = 1,
#if defined(CONFIG_NET)
.prep = io_send_zc_prep,
.issue = io_send_zc,
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH V2 05/17] io_uring: support OP_SEND_ZC/OP_RECV for fused slave request
2023-03-07 14:15 ` [PATCH V2 05/17] io_uring: support OP_SEND_ZC/OP_RECV " Ming Lei
@ 2023-03-09 7:46 ` kernel test robot
2023-03-09 17:22 ` kernel test robot
1 sibling, 0 replies; 42+ messages in thread
From: kernel test robot @ 2023-03-09 7:46 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, io-uring
Cc: oe-kbuild-all, linux-block, Miklos Szeredi, ZiyangZhang,
Xiaoguang Wang, Bernd Schubert, Ming Lei
Hi Ming,
I love your patch! Perhaps something to improve:
[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on linus/master v6.3-rc1 next-20230309]
[cannot apply to char-misc/char-misc-testing char-misc/char-misc-next char-misc/char-misc-linus]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Ming-Lei/io_uring-add-IO_URING_F_FUSED-and-prepare-for-supporting-OP_FUSED_CMD/20230307-222928
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20230307141520.793891-6-ming.lei%40redhat.com
patch subject: [PATCH V2 05/17] io_uring: support OP_SEND_ZC/OP_RECV for fused slave request
config: i386-randconfig-a003 (https://download.01.org/0day-ci/archive/20230309/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/0a921da27026b3ba08aeceb432dd983480281344
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Ming-Lei/io_uring-add-IO_URING_F_FUSED-and-prepare-for-supporting-OP_FUSED_CMD/20230307-222928
git checkout 0a921da27026b3ba08aeceb432dd983480281344
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=i386 olddefconfig
make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
In file included from include/linux/bits.h:6,
from include/linux/bitops.h:6,
from include/linux/kernel.h:22,
from io_uring/net.c:2:
include/vdso/bits.h:7:40: warning: left shift count >= width of type [-Wshift-count-overflow]
7 | #define BIT(nr) (UL(1) << (nr))
| ^~
include/linux/io_uring_types.h:475:35: note: in expansion of macro 'BIT'
475 | REQ_F_FUSED_SLAVE = BIT(REQ_F_FUSED_SLAVE_BIT),
| ^~~
io_uring/net.c: In function 'io_send':
>> io_uring/net.c:385:48: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
385 | ret = io_import_kbuf_for_slave((u64)sr->buf, sr->len,
| ^
io_uring/net.c: In function 'io_recv':
io_uring/net.c:880:48: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
880 | ret = io_import_kbuf_for_slave((u64)sr->buf, sr->len, ITER_DEST,
| ^
io_uring/net.c: In function 'io_send_zc':
io_uring/net.c:1135:48: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
1135 | ret = io_import_kbuf_for_slave((u64)zc->buf, zc->len,
| ^
vim +385 io_uring/net.c
343
344 int io_send(struct io_kiocb *req, unsigned int issue_flags)
345 {
346 struct sockaddr_storage __address;
347 struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
348 struct msghdr msg;
349 struct socket *sock;
350 unsigned flags;
351 int min_ret = 0;
352 int ret;
353
354 msg.msg_name = NULL;
355 msg.msg_control = NULL;
356 msg.msg_controllen = 0;
357 msg.msg_namelen = 0;
358 msg.msg_ubuf = NULL;
359
360 if (sr->addr) {
361 if (req_has_async_data(req)) {
362 struct io_async_msghdr *io = req->async_data;
363
364 msg.msg_name = &io->addr;
365 } else {
366 ret = move_addr_to_kernel(sr->addr, sr->addr_len, &__address);
367 if (unlikely(ret < 0))
368 return ret;
369 msg.msg_name = (struct sockaddr *)&__address;
370 }
371 msg.msg_namelen = sr->addr_len;
372 }
373
374 if (!(req->flags & REQ_F_POLLED) &&
375 (sr->flags & IORING_RECVSEND_POLL_FIRST))
376 return io_setup_async_addr(req, &__address, issue_flags);
377
378 sock = sock_from_file(req->file);
379 if (unlikely(!sock))
380 return -ENOTSOCK;
381
382 if (!(req->flags & REQ_F_FUSED_SLAVE))
383 ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len, &msg.msg_iter);
384 else
> 385 ret = io_import_kbuf_for_slave((u64)sr->buf, sr->len,
386 ITER_SOURCE, &msg.msg_iter, req);
387 if (unlikely(ret))
388 return ret;
389
390 flags = sr->msg_flags;
391 if (issue_flags & IO_URING_F_NONBLOCK)
392 flags |= MSG_DONTWAIT;
393 if (flags & MSG_WAITALL)
394 min_ret = iov_iter_count(&msg.msg_iter);
395
396 msg.msg_flags = flags;
397 ret = sock_sendmsg(sock, &msg);
398 if (ret < min_ret) {
399 if (ret == -EAGAIN && (issue_flags & IO_URING_F_NONBLOCK))
400 return io_setup_async_addr(req, &__address, issue_flags);
401
402 if (ret > 0 && io_net_retry(sock, flags)) {
403 sr->len -= ret;
404 sr->buf += ret;
405 sr->done_io += ret;
406 req->flags |= REQ_F_PARTIAL_IO;
407 return io_setup_async_addr(req, &__address, issue_flags);
408 }
409 if (ret == -ERESTARTSYS)
410 ret = -EINTR;
411 req_set_fail(req);
412 }
413 if (ret >= 0)
414 ret += sr->done_io;
415 else if (sr->done_io)
416 ret = sr->done_io;
417 io_req_set_res(req, ret, 0);
418 return IOU_OK;
419 }
420
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH V2 05/17] io_uring: support OP_SEND_ZC/OP_RECV for fused slave request
2023-03-07 14:15 ` [PATCH V2 05/17] io_uring: support OP_SEND_ZC/OP_RECV " Ming Lei
2023-03-09 7:46 ` kernel test robot
@ 2023-03-09 17:22 ` kernel test robot
1 sibling, 0 replies; 42+ messages in thread
From: kernel test robot @ 2023-03-09 17:22 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, io-uring
Cc: oe-kbuild-all, linux-block, Miklos Szeredi, ZiyangZhang,
Xiaoguang Wang, Bernd Schubert, Ming Lei
Hi Ming,
I love your patch! Perhaps something to improve:
[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on linus/master v6.3-rc1 next-20230309]
[cannot apply to char-misc/char-misc-testing char-misc/char-misc-next char-misc/char-misc-linus]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Ming-Lei/io_uring-add-IO_URING_F_FUSED-and-prepare-for-supporting-OP_FUSED_CMD/20230307-222928
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20230307141520.793891-6-ming.lei%40redhat.com
patch subject: [PATCH V2 05/17] io_uring: support OP_SEND_ZC/OP_RECV for fused slave request
config: sparc64-randconfig-s031-20230308 (https://download.01.org/0day-ci/archive/20230310/[email protected]/config)
compiler: sparc64-linux-gcc (GCC) 12.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.4-39-gce1a6720-dirty
# https://github.com/intel-lab-lkp/linux/commit/0a921da27026b3ba08aeceb432dd983480281344
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Ming-Lei/io_uring-add-IO_URING_F_FUSED-and-prepare-for-supporting-OP_FUSED_CMD/20230307-222928
git checkout 0a921da27026b3ba08aeceb432dd983480281344
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=sparc64 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=sparc64 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
sparse warnings: (new ones prefixed by >>)
io_uring/net.c: note: in included file (through io_uring/io_uring.h):
io_uring/slist.h:116:29: sparse: sparse: no newline at end of file
io_uring/net.c: note: in included file (through io_uring/io_uring.h):
include/linux/io_uring_types.h:179:37: sparse: sparse: array of flexible structures
>> io_uring/net.c:385:49: sparse: sparse: cast removes address space '__user' of expression
io_uring/net.c:880:49: sparse: sparse: cast removes address space '__user' of expression
io_uring/net.c:1135:49: sparse: sparse: cast removes address space '__user' of expression
vim +/__user +385 io_uring/net.c
343
344 int io_send(struct io_kiocb *req, unsigned int issue_flags)
345 {
346 struct sockaddr_storage __address;
347 struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
348 struct msghdr msg;
349 struct socket *sock;
350 unsigned flags;
351 int min_ret = 0;
352 int ret;
353
354 msg.msg_name = NULL;
355 msg.msg_control = NULL;
356 msg.msg_controllen = 0;
357 msg.msg_namelen = 0;
358 msg.msg_ubuf = NULL;
359
360 if (sr->addr) {
361 if (req_has_async_data(req)) {
362 struct io_async_msghdr *io = req->async_data;
363
364 msg.msg_name = &io->addr;
365 } else {
366 ret = move_addr_to_kernel(sr->addr, sr->addr_len, &__address);
367 if (unlikely(ret < 0))
368 return ret;
369 msg.msg_name = (struct sockaddr *)&__address;
370 }
371 msg.msg_namelen = sr->addr_len;
372 }
373
374 if (!(req->flags & REQ_F_POLLED) &&
375 (sr->flags & IORING_RECVSEND_POLL_FIRST))
376 return io_setup_async_addr(req, &__address, issue_flags);
377
378 sock = sock_from_file(req->file);
379 if (unlikely(!sock))
380 return -ENOTSOCK;
381
382 if (!(req->flags & REQ_F_FUSED_SLAVE))
383 ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len, &msg.msg_iter);
384 else
> 385 ret = io_import_kbuf_for_slave((u64)sr->buf, sr->len,
386 ITER_SOURCE, &msg.msg_iter, req);
387 if (unlikely(ret))
388 return ret;
389
390 flags = sr->msg_flags;
391 if (issue_flags & IO_URING_F_NONBLOCK)
392 flags |= MSG_DONTWAIT;
393 if (flags & MSG_WAITALL)
394 min_ret = iov_iter_count(&msg.msg_iter);
395
396 msg.msg_flags = flags;
397 ret = sock_sendmsg(sock, &msg);
398 if (ret < min_ret) {
399 if (ret == -EAGAIN && (issue_flags & IO_URING_F_NONBLOCK))
400 return io_setup_async_addr(req, &__address, issue_flags);
401
402 if (ret > 0 && io_net_retry(sock, flags)) {
403 sr->len -= ret;
404 sr->buf += ret;
405 sr->done_io += ret;
406 req->flags |= REQ_F_PARTIAL_IO;
407 return io_setup_async_addr(req, &__address, issue_flags);
408 }
409 if (ret == -ERESTARTSYS)
410 ret = -EINTR;
411 req_set_fail(req);
412 }
413 if (ret >= 0)
414 ret += sr->done_io;
415 else if (sr->done_io)
416 ret = sr->done_io;
417 io_req_set_res(req, ret, 0);
418 return IOU_OK;
419 }
420
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH V2 06/17] block: ublk_drv: mark device as LIVE before adding disk
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (4 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 05/17] io_uring: support OP_SEND_ZC/OP_RECV " Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-08 3:48 ` Ziyang Zhang
2023-03-07 14:15 ` [PATCH V2 07/17] block: ublk_drv: add common exit handling Ming Lei
` (12 subsequent siblings)
18 siblings, 1 reply; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
IO can be started before add_disk() returns, such as reading parititon table,
then the monitor work should work for making forward progress.
So mark device as LIVE before adding disk, meantime change to
DEAD if add_disk() fails.
Reviewed-by: Ziyang Zhang <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 088d9efffe6b..2778c14a8407 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1605,17 +1605,18 @@ static int ublk_ctrl_start_dev(struct ublk_device *ub, struct io_uring_cmd *cmd)
set_bit(GD_SUPPRESS_PART_SCAN, &disk->state);
get_device(&ub->cdev_dev);
+ ub->dev_info.state = UBLK_S_DEV_LIVE;
ret = add_disk(disk);
if (ret) {
/*
* Has to drop the reference since ->free_disk won't be
* called in case of add_disk failure.
*/
+ ub->dev_info.state = UBLK_S_DEV_DEAD;
ublk_put_device(ub);
goto out_put_disk;
}
set_bit(UB_STATE_USED, &ub->state);
- ub->dev_info.state = UBLK_S_DEV_LIVE;
out_put_disk:
if (ret)
put_disk(disk);
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH V2 06/17] block: ublk_drv: mark device as LIVE before adding disk
2023-03-07 14:15 ` [PATCH V2 06/17] block: ublk_drv: mark device as LIVE before adding disk Ming Lei
@ 2023-03-08 3:48 ` Ziyang Zhang
2023-03-08 7:44 ` Ming Lei
0 siblings, 1 reply; 42+ messages in thread
From: Ziyang Zhang @ 2023-03-08 3:48 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-block, io-uring, Jens Axboe
On 2023/3/7 22:15, Ming Lei wrote:
> IO can be started before add_disk() returns, such as reading parititon table,
> then the monitor work should work for making forward progress.
>
> So mark device as LIVE before adding disk, meantime change to
> DEAD if add_disk() fails.
>
> Reviewed-by: Ziyang Zhang <[email protected]>
> Signed-off-by: Ming Lei <[email protected]>
> ---
Hi Ming,
Without this patch, if we fail to read partition table(Could this
happen?)and EIO is returned, then START_DEV may hang forever, right?
I may have encountered such error before and I think this bug is introduced
by:
bbae8d1f526b(ublk_drv: consider recovery feature in aborting mechanism)
which change the behavior of monitor_work. So shall we add a fixes tag, such
as:
Fixes: bbae8d1f526b("ublk_drv: consider recovery feature in aborting mechanism")
Regards,
Zhang
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH V2 06/17] block: ublk_drv: mark device as LIVE before adding disk
2023-03-08 3:48 ` Ziyang Zhang
@ 2023-03-08 7:44 ` Ming Lei
0 siblings, 0 replies; 42+ messages in thread
From: Ming Lei @ 2023-03-08 7:44 UTC (permalink / raw)
To: Ziyang Zhang; +Cc: linux-block, io-uring, Jens Axboe, ming.lei
On Wed, Mar 08, 2023 at 11:48:15AM +0800, Ziyang Zhang wrote:
> On 2023/3/7 22:15, Ming Lei wrote:
> > IO can be started before add_disk() returns, such as reading parititon table,
> > then the monitor work should work for making forward progress.
> >
> > So mark device as LIVE before adding disk, meantime change to
> > DEAD if add_disk() fails.
> >
> > Reviewed-by: Ziyang Zhang <[email protected]>
> > Signed-off-by: Ming Lei <[email protected]>
> > ---
>
> Hi Ming,
>
> Without this patch, if we fail to read partition table(Could this
> happen?)and EIO is returned, then START_DEV may hang forever, right?
> I may have encountered such error before and I think this bug is introduced
> by:
> bbae8d1f526b(ublk_drv: consider recovery feature in aborting mechanism)
> which change the behavior of monitor_work. So shall we add a fixes tag, such
> as:
> Fixes: bbae8d1f526b("ublk_drv: consider recovery feature in aborting mechanism")
Even without the above commit, monitor work still may not be started
because of ub->dev_info.state == UBLK_S_DEV_DEAD.
thanks,
Ming
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH V2 07/17] block: ublk_drv: add common exit handling
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (5 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 06/17] block: ublk_drv: mark device as LIVE before adding disk Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-14 17:15 ` kernel test robot
2023-03-07 14:15 ` [PATCH V2 08/17] block: ublk_drv: don't consider flush request in map/unmap io Ming Lei
` (11 subsequent siblings)
18 siblings, 1 reply; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
Simplify exit handling a bit, and prepare for supporting fused command.
Reviewed-by: Ziyang Zhang <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 2778c14a8407..c03728c13eea 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -655,14 +655,15 @@ static void ublk_complete_rq(struct request *req)
struct ublk_queue *ubq = req->mq_hctx->driver_data;
struct ublk_io *io = &ubq->ios[req->tag];
unsigned int unmapped_bytes;
+ int res = BLK_STS_OK;
/* failed read IO if nothing is read */
if (!io->res && req_op(req) == REQ_OP_READ)
io->res = -EIO;
if (io->res < 0) {
- blk_mq_end_request(req, errno_to_blk_status(io->res));
- return;
+ res = errno_to_blk_status(io->res);
+ goto exit;
}
/*
@@ -671,10 +672,8 @@ static void ublk_complete_rq(struct request *req)
*
* Both the two needn't unmap.
*/
- if (req_op(req) != REQ_OP_READ && req_op(req) != REQ_OP_WRITE) {
- blk_mq_end_request(req, BLK_STS_OK);
- return;
- }
+ if (req_op(req) != REQ_OP_READ && req_op(req) != REQ_OP_WRITE)
+ goto exit;
/* for READ request, writing data in iod->addr to rq buffers */
unmapped_bytes = ublk_unmap_io(ubq, req, io);
@@ -691,6 +690,10 @@ static void ublk_complete_rq(struct request *req)
blk_mq_requeue_request(req, true);
else
__blk_mq_end_request(req, BLK_STS_OK);
+
+ return;
+exit:
+ blk_mq_end_request(req, res);
}
/*
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH V2 07/17] block: ublk_drv: add common exit handling
2023-03-07 14:15 ` [PATCH V2 07/17] block: ublk_drv: add common exit handling Ming Lei
@ 2023-03-14 17:15 ` kernel test robot
0 siblings, 0 replies; 42+ messages in thread
From: kernel test robot @ 2023-03-14 17:15 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, io-uring
Cc: oe-kbuild-all, linux-block, Miklos Szeredi, ZiyangZhang,
Xiaoguang Wang, Bernd Schubert, Ming Lei
Hi Ming,
I love your patch! Perhaps something to improve:
[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on char-misc/char-misc-testing char-misc/char-misc-next char-misc/char-misc-linus linus/master v6.3-rc2 next-20230314]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Ming-Lei/io_uring-add-IO_URING_F_FUSED-and-prepare-for-supporting-OP_FUSED_CMD/20230307-222928
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20230307141520.793891-8-ming.lei%40redhat.com
patch subject: [PATCH V2 07/17] block: ublk_drv: add common exit handling
config: microblaze-randconfig-s033-20230308 (https://download.01.org/0day-ci/archive/20230315/[email protected]/config)
compiler: microblaze-linux-gcc (GCC) 12.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.4-39-gce1a6720-dirty
# https://github.com/intel-lab-lkp/linux/commit/53855edaeebdbd21c916cbb864ac45cb64def9cd
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Ming-Lei/io_uring-add-IO_URING_F_FUSED-and-prepare-for-supporting-OP_FUSED_CMD/20230307-222928
git checkout 53855edaeebdbd21c916cbb864ac45cb64def9cd
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=microblaze olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=microblaze SHELL=/bin/bash drivers/block/
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
sparse warnings: (new ones prefixed by >>)
>> drivers/block/ublk_drv.c:665:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected int res @@ got restricted blk_status_t @@
drivers/block/ublk_drv.c:665:21: sparse: expected int res
drivers/block/ublk_drv.c:665:21: sparse: got restricted blk_status_t
>> drivers/block/ublk_drv.c:696:33: sparse: sparse: incorrect type in argument 2 (different base types) @@ expected restricted blk_status_t [usertype] error @@ got int res @@
drivers/block/ublk_drv.c:696:33: sparse: expected restricted blk_status_t [usertype] error
drivers/block/ublk_drv.c:696:33: sparse: got int res
vim +665 drivers/block/ublk_drv.c
651
652 /* todo: handle partial completion */
653 static void ublk_complete_rq(struct request *req)
654 {
655 struct ublk_queue *ubq = req->mq_hctx->driver_data;
656 struct ublk_io *io = &ubq->ios[req->tag];
657 unsigned int unmapped_bytes;
658 int res = BLK_STS_OK;
659
660 /* failed read IO if nothing is read */
661 if (!io->res && req_op(req) == REQ_OP_READ)
662 io->res = -EIO;
663
664 if (io->res < 0) {
> 665 res = errno_to_blk_status(io->res);
666 goto exit;
667 }
668
669 /*
670 * FLUSH, DISCARD or WRITE_ZEROES usually won't return bytes returned, so end them
671 * directly.
672 *
673 * Both the two needn't unmap.
674 */
675 if (req_op(req) != REQ_OP_READ && req_op(req) != REQ_OP_WRITE)
676 goto exit;
677
678 /* for READ request, writing data in iod->addr to rq buffers */
679 unmapped_bytes = ublk_unmap_io(ubq, req, io);
680
681 /*
682 * Extremely impossible since we got data filled in just before
683 *
684 * Re-read simply for this unlikely case.
685 */
686 if (unlikely(unmapped_bytes < io->res))
687 io->res = unmapped_bytes;
688
689 if (blk_update_request(req, BLK_STS_OK, io->res))
690 blk_mq_requeue_request(req, true);
691 else
692 __blk_mq_end_request(req, BLK_STS_OK);
693
694 return;
695 exit:
> 696 blk_mq_end_request(req, res);
697 }
698
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH V2 08/17] block: ublk_drv: don't consider flush request in map/unmap io
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (6 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 07/17] block: ublk_drv: add common exit handling Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-08 3:50 ` Ziyang Zhang
2023-03-07 14:15 ` [PATCH V2 09/17] block: ublk_drv: add two helpers to clean up map/unmap request Ming Lei
` (10 subsequent siblings)
18 siblings, 1 reply; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
There isn't data in request of REQ_OP_FLUSH always, so don't consider
it in both ublk_map_io() and ublk_unmap_io().
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index c03728c13eea..624ee4ca20e5 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -529,15 +529,13 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
struct ublk_io *io)
{
const unsigned int rq_bytes = blk_rq_bytes(req);
+
/*
* no zero copy, we delay copy WRITE request data into ublksrv
* context and the big benefit is that pinning pages in current
* context is pretty fast, see ublk_pin_user_pages
*/
- if (req_op(req) != REQ_OP_WRITE && req_op(req) != REQ_OP_FLUSH)
- return rq_bytes;
-
- if (ublk_rq_has_data(req)) {
+ if (ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE) {
struct ublk_map_data data = {
.ubq = ubq,
.rq = req,
@@ -772,9 +770,7 @@ static inline void __ublk_rq_task_work(struct request *req)
return;
}
- if (ublk_need_get_data(ubq) &&
- (req_op(req) == REQ_OP_WRITE ||
- req_op(req) == REQ_OP_FLUSH)) {
+ if (ublk_need_get_data(ubq) && (req_op(req) == REQ_OP_WRITE)) {
/*
* We have not handled UBLK_IO_NEED_GET_DATA command yet,
* so immepdately pass UBLK_IO_RES_NEED_GET_DATA to ublksrv
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH V2 09/17] block: ublk_drv: add two helpers to clean up map/unmap request
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (7 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 08/17] block: ublk_drv: don't consider flush request in map/unmap io Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-09 3:12 ` Ziyang Zhang
2023-03-07 14:15 ` [PATCH V2 10/17] block: ublk_drv: clean up several helpers Ming Lei
` (9 subsequent siblings)
18 siblings, 1 reply; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
Add two helpers for checking if map/unmap is needed, since we may have
passthrough request which needs map or unmap in future, such as for
supporting report zones.
Meantime don't mark ublk_copy_user_pages as inline since this function
is a bit fat now.
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 624ee4ca20e5..115590bb60b2 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -488,8 +488,7 @@ static inline unsigned ublk_copy_io_pages(struct ublk_io_iter *data,
return done;
}
-static inline int ublk_copy_user_pages(struct ublk_map_data *data,
- bool to_vm)
+static int ublk_copy_user_pages(struct ublk_map_data *data, bool to_vm)
{
const unsigned int gup_flags = to_vm ? FOLL_WRITE : 0;
const unsigned long start_vm = data->io->addr;
@@ -525,6 +524,16 @@ static inline int ublk_copy_user_pages(struct ublk_map_data *data,
return done;
}
+static inline bool ublk_need_map_req(const struct request *req)
+{
+ return ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE;
+}
+
+static inline bool ublk_need_unmap_req(const struct request *req)
+{
+ return ublk_rq_has_data(req) && req_op(req) == REQ_OP_READ;
+}
+
static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
struct ublk_io *io)
{
@@ -535,7 +544,7 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
* context and the big benefit is that pinning pages in current
* context is pretty fast, see ublk_pin_user_pages
*/
- if (ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE) {
+ if (ublk_need_map_req(req)) {
struct ublk_map_data data = {
.ubq = ubq,
.rq = req,
@@ -556,7 +565,7 @@ static int ublk_unmap_io(const struct ublk_queue *ubq,
{
const unsigned int rq_bytes = blk_rq_bytes(req);
- if (req_op(req) == REQ_OP_READ && ublk_rq_has_data(req)) {
+ if (ublk_need_unmap_req(req)) {
struct ublk_map_data data = {
.ubq = ubq,
.rq = req,
@@ -770,7 +779,7 @@ static inline void __ublk_rq_task_work(struct request *req)
return;
}
- if (ublk_need_get_data(ubq) && (req_op(req) == REQ_OP_WRITE)) {
+ if (ublk_need_get_data(ubq) && ublk_need_map_req(req)) {
/*
* We have not handled UBLK_IO_NEED_GET_DATA command yet,
* so immepdately pass UBLK_IO_RES_NEED_GET_DATA to ublksrv
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH V2 10/17] block: ublk_drv: clean up several helpers
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (8 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 09/17] block: ublk_drv: add two helpers to clean up map/unmap request Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-09 3:12 ` Ziyang Zhang
2023-03-07 14:15 ` [PATCH V2 11/17] block: ublk_drv: cleanup 'struct ublk_map_data' Ming Lei
` (8 subsequent siblings)
18 siblings, 1 reply; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
Convert the following pattern in several helpers
if (Z)
return true
return false
into:
return Z;
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 18 +++++-------------
1 file changed, 5 insertions(+), 13 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 115590bb60b2..5b69e239dca3 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -298,9 +298,7 @@ static inline bool ublk_can_use_task_work(const struct ublk_queue *ubq)
static inline bool ublk_need_get_data(const struct ublk_queue *ubq)
{
- if (ubq->flags & UBLK_F_NEED_GET_DATA)
- return true;
- return false;
+ return ubq->flags & UBLK_F_NEED_GET_DATA;
}
static struct ublk_device *ublk_get_device(struct ublk_device *ub)
@@ -349,25 +347,19 @@ static inline int ublk_queue_cmd_buf_size(struct ublk_device *ub, int q_id)
static inline bool ublk_queue_can_use_recovery_reissue(
struct ublk_queue *ubq)
{
- if ((ubq->flags & UBLK_F_USER_RECOVERY) &&
- (ubq->flags & UBLK_F_USER_RECOVERY_REISSUE))
- return true;
- return false;
+ return (ubq->flags & UBLK_F_USER_RECOVERY) &&
+ (ubq->flags & UBLK_F_USER_RECOVERY_REISSUE);
}
static inline bool ublk_queue_can_use_recovery(
struct ublk_queue *ubq)
{
- if (ubq->flags & UBLK_F_USER_RECOVERY)
- return true;
- return false;
+ return ubq->flags & UBLK_F_USER_RECOVERY;
}
static inline bool ublk_can_use_recovery(struct ublk_device *ub)
{
- if (ub->dev_info.flags & UBLK_F_USER_RECOVERY)
- return true;
- return false;
+ return ub->dev_info.flags & UBLK_F_USER_RECOVERY;
}
static void ublk_free_disk(struct gendisk *disk)
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH V2 11/17] block: ublk_drv: cleanup 'struct ublk_map_data'
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (9 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 10/17] block: ublk_drv: clean up several helpers Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-09 3:16 ` Ziyang Zhang
2023-03-07 14:15 ` [PATCH V2 12/17] block: ublk_drv: cleanup ublk_copy_user_pages Ming Lei
` (7 subsequent siblings)
18 siblings, 1 reply; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
'struct ublk_map_data' is passed to ublk_copy_user_pages()
for copying data between userspace buffer and request pages.
Here what matters is userspace buffer address/len and 'struct request',
so replace ->io field with user buffer address, and rename max_bytes
as len.
Meantime remove 'ubq' field from ublk_map_data, since it isn't used
any more.
Then code becomes more readable.
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 27 ++++++++++++---------------
1 file changed, 12 insertions(+), 15 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 5b69e239dca3..4449e1013841 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -420,10 +420,9 @@ static const struct block_device_operations ub_fops = {
#define UBLK_MAX_PIN_PAGES 32
struct ublk_map_data {
- const struct ublk_queue *ubq;
const struct request *rq;
- const struct ublk_io *io;
- unsigned max_bytes;
+ unsigned long ubuf;
+ unsigned int len;
};
struct ublk_io_iter {
@@ -483,14 +482,14 @@ static inline unsigned ublk_copy_io_pages(struct ublk_io_iter *data,
static int ublk_copy_user_pages(struct ublk_map_data *data, bool to_vm)
{
const unsigned int gup_flags = to_vm ? FOLL_WRITE : 0;
- const unsigned long start_vm = data->io->addr;
+ const unsigned long start_vm = data->ubuf;
unsigned int done = 0;
struct ublk_io_iter iter = {
.pg_off = start_vm & (PAGE_SIZE - 1),
.bio = data->rq->bio,
.iter = data->rq->bio->bi_iter,
};
- const unsigned int nr_pages = round_up(data->max_bytes +
+ const unsigned int nr_pages = round_up(data->len +
(start_vm & (PAGE_SIZE - 1)), PAGE_SIZE) >> PAGE_SHIFT;
while (done < nr_pages) {
@@ -503,13 +502,13 @@ static int ublk_copy_user_pages(struct ublk_map_data *data, bool to_vm)
iter.pages);
if (iter.nr_pages <= 0)
return done == 0 ? iter.nr_pages : done;
- len = ublk_copy_io_pages(&iter, data->max_bytes, to_vm);
+ len = ublk_copy_io_pages(&iter, data->len, to_vm);
for (i = 0; i < iter.nr_pages; i++) {
if (to_vm)
set_page_dirty(iter.pages[i]);
put_page(iter.pages[i]);
}
- data->max_bytes -= len;
+ data->len -= len;
done += iter.nr_pages;
}
@@ -538,15 +537,14 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
*/
if (ublk_need_map_req(req)) {
struct ublk_map_data data = {
- .ubq = ubq,
.rq = req,
- .io = io,
- .max_bytes = rq_bytes,
+ .ubuf = io->addr,
+ .len = rq_bytes,
};
ublk_copy_user_pages(&data, true);
- return rq_bytes - data.max_bytes;
+ return rq_bytes - data.len;
}
return rq_bytes;
}
@@ -559,17 +557,16 @@ static int ublk_unmap_io(const struct ublk_queue *ubq,
if (ublk_need_unmap_req(req)) {
struct ublk_map_data data = {
- .ubq = ubq,
.rq = req,
- .io = io,
- .max_bytes = io->res,
+ .ubuf = io->addr,
+ .len = io->res,
};
WARN_ON_ONCE(io->res > rq_bytes);
ublk_copy_user_pages(&data, false);
- return io->res - data.max_bytes;
+ return io->res - data.len;
}
return rq_bytes;
}
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH V2 11/17] block: ublk_drv: cleanup 'struct ublk_map_data'
2023-03-07 14:15 ` [PATCH V2 11/17] block: ublk_drv: cleanup 'struct ublk_map_data' Ming Lei
@ 2023-03-09 3:16 ` Ziyang Zhang
0 siblings, 0 replies; 42+ messages in thread
From: Ziyang Zhang @ 2023-03-09 3:16 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, Xiaoguang Wang, Bernd Schubert
On 2023/3/7 22:15, Ming Lei wrote:
> 'struct ublk_map_data' is passed to ublk_copy_user_pages()
> for copying data between userspace buffer and request pages.
>
> Here what matters is userspace buffer address/len and 'struct request',
> so replace ->io field with user buffer address, and rename max_bytes
> as len.
>
> Meantime remove 'ubq' field from ublk_map_data, since it isn't used
> any more.
>
> Then code becomes more readable.
>
> Signed-off-by: Ming Lei <[email protected]>
> ---
Reviewed-by: Ziyang Zhang <[email protected]>
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH V2 12/17] block: ublk_drv: cleanup ublk_copy_user_pages
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (10 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 11/17] block: ublk_drv: cleanup 'struct ublk_map_data' Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-07 23:57 ` kernel test robot
2023-03-15 7:05 ` Ziyang Zhang
2023-03-07 14:15 ` [PATCH V2 13/17] block: ublk_drv: grab request reference when the request is handled by userspace Ming Lei
` (6 subsequent siblings)
18 siblings, 2 replies; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
Clean up ublk_copy_user_pages() by using iov iter, and code
gets simplified a lot and becomes much more readable than before.
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 112 +++++++++++++++++----------------------
1 file changed, 49 insertions(+), 63 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 4449e1013841..f07eb8a54357 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -419,49 +419,39 @@ static const struct block_device_operations ub_fops = {
#define UBLK_MAX_PIN_PAGES 32
-struct ublk_map_data {
- const struct request *rq;
- unsigned long ubuf;
- unsigned int len;
-};
-
struct ublk_io_iter {
struct page *pages[UBLK_MAX_PIN_PAGES];
- unsigned pg_off; /* offset in the 1st page in pages */
- int nr_pages; /* how many page pointers in pages */
struct bio *bio;
struct bvec_iter iter;
};
-static inline unsigned ublk_copy_io_pages(struct ublk_io_iter *data,
- unsigned max_bytes, bool to_vm)
+/* return how many pages are copied */
+static void ublk_copy_io_pages(struct ublk_io_iter *data,
+ size_t total, size_t pg_off, int dir)
{
- const unsigned total = min_t(unsigned, max_bytes,
- PAGE_SIZE - data->pg_off +
- ((data->nr_pages - 1) << PAGE_SHIFT));
unsigned done = 0;
unsigned pg_idx = 0;
while (done < total) {
struct bio_vec bv = bio_iter_iovec(data->bio, data->iter);
- const unsigned int bytes = min3(bv.bv_len, total - done,
- (unsigned)(PAGE_SIZE - data->pg_off));
+ unsigned int bytes = min3(bv.bv_len, (unsigned)total - done,
+ (unsigned)(PAGE_SIZE - pg_off));
void *bv_buf = bvec_kmap_local(&bv);
void *pg_buf = kmap_local_page(data->pages[pg_idx]);
- if (to_vm)
- memcpy(pg_buf + data->pg_off, bv_buf, bytes);
+ if (dir == ITER_DEST)
+ memcpy(pg_buf + pg_off, bv_buf, bytes);
else
- memcpy(bv_buf, pg_buf + data->pg_off, bytes);
+ memcpy(bv_buf, pg_buf + pg_off, bytes);
kunmap_local(pg_buf);
kunmap_local(bv_buf);
/* advance page array */
- data->pg_off += bytes;
- if (data->pg_off == PAGE_SIZE) {
+ pg_off += bytes;
+ if (pg_off == PAGE_SIZE) {
pg_idx += 1;
- data->pg_off = 0;
+ pg_off = 0;
}
done += bytes;
@@ -475,41 +465,40 @@ static inline unsigned ublk_copy_io_pages(struct ublk_io_iter *data,
data->iter = data->bio->bi_iter;
}
}
-
- return done;
}
-static int ublk_copy_user_pages(struct ublk_map_data *data, bool to_vm)
+/*
+ * Copy data between request pages and io_iter, and 'offset'
+ * is the start point of linear offset of request.
+ */
+static size_t ublk_copy_user_pages(const struct request *req,
+ struct iov_iter *uiter, int dir)
{
- const unsigned int gup_flags = to_vm ? FOLL_WRITE : 0;
- const unsigned long start_vm = data->ubuf;
- unsigned int done = 0;
struct ublk_io_iter iter = {
- .pg_off = start_vm & (PAGE_SIZE - 1),
- .bio = data->rq->bio,
- .iter = data->rq->bio->bi_iter,
+ .bio = req->bio,
+ .iter = req->bio->bi_iter,
};
- const unsigned int nr_pages = round_up(data->len +
- (start_vm & (PAGE_SIZE - 1)), PAGE_SIZE) >> PAGE_SHIFT;
-
- while (done < nr_pages) {
- const unsigned to_pin = min_t(unsigned, UBLK_MAX_PIN_PAGES,
- nr_pages - done);
- unsigned i, len;
-
- iter.nr_pages = get_user_pages_fast(start_vm +
- (done << PAGE_SHIFT), to_pin, gup_flags,
- iter.pages);
- if (iter.nr_pages <= 0)
- return done == 0 ? iter.nr_pages : done;
- len = ublk_copy_io_pages(&iter, data->len, to_vm);
- for (i = 0; i < iter.nr_pages; i++) {
- if (to_vm)
+ size_t done = 0;
+
+ while (iov_iter_count(uiter) && iter.bio) {
+ unsigned nr_pages;
+ size_t len, off;
+ int i;
+
+ len = iov_iter_get_pages2(uiter, iter.pages,
+ iov_iter_count(uiter),
+ UBLK_MAX_PIN_PAGES, &off);
+ if (len <= 0)
+ return done;
+
+ ublk_copy_io_pages(&iter, len, off, dir);
+ nr_pages = DIV_ROUND_UP(len + off, PAGE_SIZE);
+ for (i = 0; i < nr_pages; i++) {
+ if (dir == ITER_DEST)
set_page_dirty(iter.pages[i]);
put_page(iter.pages[i]);
}
- data->len -= len;
- done += iter.nr_pages;
+ done += len;
}
return done;
@@ -536,15 +525,14 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
* context is pretty fast, see ublk_pin_user_pages
*/
if (ublk_need_map_req(req)) {
- struct ublk_map_data data = {
- .rq = req,
- .ubuf = io->addr,
- .len = rq_bytes,
- };
+ struct iov_iter iter;
+ struct iovec iov;
+ const int dir = ITER_DEST;
- ublk_copy_user_pages(&data, true);
+ import_single_range(dir, (void __user *)io->addr,
+ rq_bytes, &iov, &iter);
- return rq_bytes - data.len;
+ return ublk_copy_user_pages(req, &iter, dir);
}
return rq_bytes;
}
@@ -556,17 +544,15 @@ static int ublk_unmap_io(const struct ublk_queue *ubq,
const unsigned int rq_bytes = blk_rq_bytes(req);
if (ublk_need_unmap_req(req)) {
- struct ublk_map_data data = {
- .rq = req,
- .ubuf = io->addr,
- .len = io->res,
- };
+ struct iov_iter iter;
+ struct iovec iov;
+ const int dir = ITER_SOURCE;
WARN_ON_ONCE(io->res > rq_bytes);
- ublk_copy_user_pages(&data, false);
-
- return io->res - data.len;
+ import_single_range(dir, (void __user *)io->addr,
+ io->res, &iov, &iter);
+ return ublk_copy_user_pages(req, &iter, dir);
}
return rq_bytes;
}
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH V2 12/17] block: ublk_drv: cleanup ublk_copy_user_pages
2023-03-07 14:15 ` [PATCH V2 12/17] block: ublk_drv: cleanup ublk_copy_user_pages Ming Lei
@ 2023-03-07 23:57 ` kernel test robot
2023-03-15 7:05 ` Ziyang Zhang
1 sibling, 0 replies; 42+ messages in thread
From: kernel test robot @ 2023-03-07 23:57 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, io-uring
Cc: oe-kbuild-all, linux-block, Miklos Szeredi, ZiyangZhang,
Xiaoguang Wang, Bernd Schubert, Ming Lei
Hi Ming,
I love your patch! Perhaps something to improve:
[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on linus/master v6.3-rc1 next-20230307]
[cannot apply to char-misc/char-misc-testing char-misc/char-misc-next char-misc/char-misc-linus]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Ming-Lei/io_uring-add-IO_URING_F_FUSED-and-prepare-for-supporting-OP_FUSED_CMD/20230307-222928
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20230307141520.793891-13-ming.lei%40redhat.com
patch subject: [PATCH V2 12/17] block: ublk_drv: cleanup ublk_copy_user_pages
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20230308/[email protected]/config)
compiler: m68k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/c375364124e3c63211e7edd23bb74d22a86d5194
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Ming-Lei/io_uring-add-IO_URING_F_FUSED-and-prepare-for-supporting-OP_FUSED_CMD/20230307-222928
git checkout c375364124e3c63211e7edd23bb74d22a86d5194
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash drivers/block/
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
drivers/block/ublk_drv.c: In function 'ublk_map_io':
>> drivers/block/ublk_drv.c:532:42: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
532 | import_single_range(dir, (void __user *)io->addr,
| ^
drivers/block/ublk_drv.c: In function 'ublk_unmap_io':
drivers/block/ublk_drv.c:553:42: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
553 | import_single_range(dir, (void __user *)io->addr,
| ^
vim +532 drivers/block/ublk_drv.c
516
517 static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
518 struct ublk_io *io)
519 {
520 const unsigned int rq_bytes = blk_rq_bytes(req);
521
522 /*
523 * no zero copy, we delay copy WRITE request data into ublksrv
524 * context and the big benefit is that pinning pages in current
525 * context is pretty fast, see ublk_pin_user_pages
526 */
527 if (ublk_need_map_req(req)) {
528 struct iov_iter iter;
529 struct iovec iov;
530 const int dir = ITER_DEST;
531
> 532 import_single_range(dir, (void __user *)io->addr,
533 rq_bytes, &iov, &iter);
534
535 return ublk_copy_user_pages(req, &iter, dir);
536 }
537 return rq_bytes;
538 }
539
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH V2 12/17] block: ublk_drv: cleanup ublk_copy_user_pages
2023-03-07 14:15 ` [PATCH V2 12/17] block: ublk_drv: cleanup ublk_copy_user_pages Ming Lei
2023-03-07 23:57 ` kernel test robot
@ 2023-03-15 7:05 ` Ziyang Zhang
1 sibling, 0 replies; 42+ messages in thread
From: Ziyang Zhang @ 2023-03-15 7:05 UTC (permalink / raw)
To: Ming Lei
Cc: linux-block, Jens Axboe, io-uring, Miklos Szeredi, Xiaoguang Wang,
Bernd Schubert
On 2023/3/7 22:15, Ming Lei wrote:
> Clean up ublk_copy_user_pages() by using iov iter, and code
> gets simplified a lot and becomes much more readable than before.
>
> Signed-off-by: Ming Lei <[email protected]>
> ---
Reviewed-by: Ziyang Zhang <[email protected]>
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH V2 13/17] block: ublk_drv: grab request reference when the request is handled by userspace
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (11 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 12/17] block: ublk_drv: cleanup ublk_copy_user_pages Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-15 5:20 ` kernel test robot
2023-03-07 14:15 ` [PATCH V2 14/17] block: ublk_drv: support to copy any part of request pages Ming Lei
` (5 subsequent siblings)
18 siblings, 1 reply; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
Add one reference counter into request pdu data, and hold this reference
in the request's lifetime. This way is always safe. In theory, the ublk
request won't be completed until fused commands are done. However, it
is userspace, and application can submit fused command at will.
Prepare for supporting zero copy, which needs to retrieve request buffer
by fused command, so we have to guarantee:
- the fused command can't succeed unless the request isn't queued
- when any fused command is successful, this request can't be freed
until all fused commands on this request are done.
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 67 ++++++++++++++++++++++++++++++++++++++--
1 file changed, 64 insertions(+), 3 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index f07eb8a54357..ba252932951c 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -43,6 +43,7 @@
#include <asm/page.h>
#include <linux/task_work.h>
#include <linux/namei.h>
+#include <linux/kref.h>
#include <uapi/linux/ublk_cmd.h>
#define UBLK_MINORS (1U << MINORBITS)
@@ -62,6 +63,17 @@
struct ublk_rq_data {
struct llist_node node;
struct callback_head work;
+
+ /*
+ * Only for applying fused command to support zero copy:
+ *
+ * - if there is any fused command aiming at this request, not complete
+ * request until all fused commands are done
+ *
+ * - fused command has to fail unless this reference is grabbed
+ * successfully
+ */
+ struct kref ref;
};
struct ublk_uring_cmd_pdu {
@@ -180,6 +192,9 @@ struct ublk_params_header {
__u32 types;
};
+static inline void __ublk_complete_rq(struct request *req);
+static void ublk_complete_rq(struct kref *ref);
+
static dev_t ublk_chr_devt;
static struct class *ublk_chr_class;
@@ -288,6 +303,35 @@ static int ublk_apply_params(struct ublk_device *ub)
return 0;
}
+static inline bool ublk_support_zc(const struct ublk_queue *ubq)
+{
+ return ubq->flags & UBLK_F_SUPPORT_ZERO_COPY;
+}
+
+static inline bool ublk_get_req_ref(const struct ublk_queue *ubq,
+ struct request *req)
+{
+ if (ublk_support_zc(ubq)) {
+ struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
+
+ return kref_get_unless_zero(&data->ref);
+ }
+
+ return true;
+}
+
+static inline void ublk_put_req_ref(const struct ublk_queue *ubq,
+ struct request *req)
+{
+ if (ublk_support_zc(ubq)) {
+ struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
+
+ kref_put(&data->ref, ublk_complete_rq);
+ } else {
+ __ublk_complete_rq(req);
+ }
+}
+
static inline bool ublk_can_use_task_work(const struct ublk_queue *ubq)
{
if (IS_BUILTIN(CONFIG_BLK_DEV_UBLK) &&
@@ -632,13 +676,19 @@ static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
}
/* todo: handle partial completion */
-static void ublk_complete_rq(struct request *req)
+static inline void __ublk_complete_rq(struct request *req)
{
struct ublk_queue *ubq = req->mq_hctx->driver_data;
struct ublk_io *io = &ubq->ios[req->tag];
unsigned int unmapped_bytes;
int res = BLK_STS_OK;
+ /* called from ublk_abort_queue() code path */
+ if (io->flags & UBLK_IO_FLAG_ABORTED) {
+ res = BLK_STS_IOERR;
+ goto exit;
+ }
+
/* failed read IO if nothing is read */
if (!io->res && req_op(req) == REQ_OP_READ)
io->res = -EIO;
@@ -678,6 +728,15 @@ static void ublk_complete_rq(struct request *req)
blk_mq_end_request(req, res);
}
+static void ublk_complete_rq(struct kref *ref)
+{
+ struct ublk_rq_data *data = container_of(ref, struct ublk_rq_data,
+ ref);
+ struct request *req = blk_mq_rq_from_pdu(data);
+
+ __ublk_complete_rq(req);
+}
+
/*
* Since __ublk_rq_task_work always fails requests immediately during
* exiting, __ublk_fail_req() is only called from abort context during
@@ -696,7 +755,7 @@ static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io,
if (ublk_queue_can_use_recovery_reissue(ubq))
blk_mq_requeue_request(req, false);
else
- blk_mq_end_request(req, BLK_STS_IOERR);
+ ublk_put_req_ref(ubq, req);
}
}
@@ -732,6 +791,7 @@ static inline void __ublk_abort_rq(struct ublk_queue *ubq,
static inline void __ublk_rq_task_work(struct request *req)
{
struct ublk_queue *ubq = req->mq_hctx->driver_data;
+ struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
int tag = req->tag;
struct ublk_io *io = &ubq->ios[tag];
unsigned int mapped_bytes;
@@ -803,6 +863,7 @@ static inline void __ublk_rq_task_work(struct request *req)
mapped_bytes >> 9;
}
+ kref_init(&data->ref);
ubq_complete_io_cmd(io, UBLK_IO_RES_OK);
}
@@ -1013,7 +1074,7 @@ static void ublk_commit_completion(struct ublk_device *ub,
req = blk_mq_tag_to_rq(ub->tag_set.tags[qid], tag);
if (req && likely(!blk_should_fake_timeout(req->q)))
- ublk_complete_rq(req);
+ ublk_put_req_ref(ubq, req);
}
/*
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH V2 13/17] block: ublk_drv: grab request reference when the request is handled by userspace
2023-03-07 14:15 ` [PATCH V2 13/17] block: ublk_drv: grab request reference when the request is handled by userspace Ming Lei
@ 2023-03-15 5:20 ` kernel test robot
0 siblings, 0 replies; 42+ messages in thread
From: kernel test robot @ 2023-03-15 5:20 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, io-uring
Cc: oe-kbuild-all, linux-block, Miklos Szeredi, ZiyangZhang,
Xiaoguang Wang, Bernd Schubert, Ming Lei
Hi Ming,
I love your patch! Perhaps something to improve:
[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on char-misc/char-misc-testing char-misc/char-misc-next char-misc/char-misc-linus linus/master v6.3-rc2 next-20230314]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Ming-Lei/io_uring-add-IO_URING_F_FUSED-and-prepare-for-supporting-OP_FUSED_CMD/20230307-222928
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20230307141520.793891-14-ming.lei%40redhat.com
patch subject: [PATCH V2 13/17] block: ublk_drv: grab request reference when the request is handled by userspace
config: microblaze-randconfig-s033-20230308 (https://download.01.org/0day-ci/archive/20230315/[email protected]/config)
compiler: microblaze-linux-gcc (GCC) 12.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.4-39-gce1a6720-dirty
# https://github.com/intel-lab-lkp/linux/commit/8e1a2115a8d58ff04cbc1aad6192805e29d0b63e
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Ming-Lei/io_uring-add-IO_URING_F_FUSED-and-prepare-for-supporting-OP_FUSED_CMD/20230307-222928
git checkout 8e1a2115a8d58ff04cbc1aad6192805e29d0b63e
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=microblaze olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=microblaze SHELL=/bin/bash drivers/block/
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
sparse warnings: (new ones prefixed by >>)
>> drivers/block/ublk_drv.c:688:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected int res @@ got restricted blk_status_t [usertype] @@
drivers/block/ublk_drv.c:688:21: sparse: expected int res
drivers/block/ublk_drv.c:688:21: sparse: got restricted blk_status_t [usertype]
drivers/block/ublk_drv.c:697:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected int res @@ got restricted blk_status_t @@
drivers/block/ublk_drv.c:697:21: sparse: expected int res
drivers/block/ublk_drv.c:697:21: sparse: got restricted blk_status_t
drivers/block/ublk_drv.c:728:33: sparse: sparse: incorrect type in argument 2 (different base types) @@ expected restricted blk_status_t [usertype] error @@ got int res @@
drivers/block/ublk_drv.c:728:33: sparse: expected restricted blk_status_t [usertype] error
drivers/block/ublk_drv.c:728:33: sparse: got int res
>> drivers/block/ublk_drv.c:688:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected int res @@ got restricted blk_status_t [usertype] @@
drivers/block/ublk_drv.c:688:21: sparse: expected int res
drivers/block/ublk_drv.c:688:21: sparse: got restricted blk_status_t [usertype]
drivers/block/ublk_drv.c:697:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected int res @@ got restricted blk_status_t @@
drivers/block/ublk_drv.c:697:21: sparse: expected int res
drivers/block/ublk_drv.c:697:21: sparse: got restricted blk_status_t
drivers/block/ublk_drv.c:728:33: sparse: sparse: incorrect type in argument 2 (different base types) @@ expected restricted blk_status_t [usertype] error @@ got int res @@
drivers/block/ublk_drv.c:728:33: sparse: expected restricted blk_status_t [usertype] error
drivers/block/ublk_drv.c:728:33: sparse: got int res
>> drivers/block/ublk_drv.c:688:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected int res @@ got restricted blk_status_t [usertype] @@
drivers/block/ublk_drv.c:688:21: sparse: expected int res
drivers/block/ublk_drv.c:688:21: sparse: got restricted blk_status_t [usertype]
drivers/block/ublk_drv.c:697:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected int res @@ got restricted blk_status_t @@
drivers/block/ublk_drv.c:697:21: sparse: expected int res
drivers/block/ublk_drv.c:697:21: sparse: got restricted blk_status_t
drivers/block/ublk_drv.c:728:33: sparse: sparse: incorrect type in argument 2 (different base types) @@ expected restricted blk_status_t [usertype] error @@ got int res @@
drivers/block/ublk_drv.c:728:33: sparse: expected restricted blk_status_t [usertype] error
drivers/block/ublk_drv.c:728:33: sparse: got int res
vim +688 drivers/block/ublk_drv.c
677
678 /* todo: handle partial completion */
679 static inline void __ublk_complete_rq(struct request *req)
680 {
681 struct ublk_queue *ubq = req->mq_hctx->driver_data;
682 struct ublk_io *io = &ubq->ios[req->tag];
683 unsigned int unmapped_bytes;
684 int res = BLK_STS_OK;
685
686 /* called from ublk_abort_queue() code path */
687 if (io->flags & UBLK_IO_FLAG_ABORTED) {
> 688 res = BLK_STS_IOERR;
689 goto exit;
690 }
691
692 /* failed read IO if nothing is read */
693 if (!io->res && req_op(req) == REQ_OP_READ)
694 io->res = -EIO;
695
696 if (io->res < 0) {
697 res = errno_to_blk_status(io->res);
698 goto exit;
699 }
700
701 /*
702 * FLUSH, DISCARD or WRITE_ZEROES usually won't return bytes returned, so end them
703 * directly.
704 *
705 * Both the two needn't unmap.
706 */
707 if (req_op(req) != REQ_OP_READ && req_op(req) != REQ_OP_WRITE)
708 goto exit;
709
710 /* for READ request, writing data in iod->addr to rq buffers */
711 unmapped_bytes = ublk_unmap_io(ubq, req, io);
712
713 /*
714 * Extremely impossible since we got data filled in just before
715 *
716 * Re-read simply for this unlikely case.
717 */
718 if (unlikely(unmapped_bytes < io->res))
719 io->res = unmapped_bytes;
720
721 if (blk_update_request(req, BLK_STS_OK, io->res))
722 blk_mq_requeue_request(req, true);
723 else
724 __blk_mq_end_request(req, BLK_STS_OK);
725
726 return;
727 exit:
728 blk_mq_end_request(req, res);
729 }
730
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH V2 14/17] block: ublk_drv: support to copy any part of request pages
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (12 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 13/17] block: ublk_drv: grab request reference when the request is handled by userspace Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-07 14:15 ` [PATCH V2 15/17] block: ublk_drv: add read()/write() support for ublk char device Ming Lei
` (4 subsequent siblings)
18 siblings, 0 replies; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
Add 'offset' to 'struct ublk_map_data', so that ublk_copy_user_pages()
can be used to copy any sub-buffer(linear mapped) of the request.
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index ba252932951c..c0cb99cb0c3c 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -511,19 +511,37 @@ static void ublk_copy_io_pages(struct ublk_io_iter *data,
}
}
+static bool ublk_advance_io_iter(struct ublk_io_iter *iter, unsigned int offset)
+{
+ struct bio *bio = iter->bio;
+
+ for_each_bio(bio) {
+ if (bio->bi_iter.bi_size > offset) {
+ iter->bio = bio;
+ iter->iter = bio->bi_iter;
+ bio_advance_iter(iter->bio, &iter->iter, offset);
+ return true;
+ }
+ offset -= bio->bi_iter.bi_size;
+ }
+ return false;
+}
+
/*
* Copy data between request pages and io_iter, and 'offset'
* is the start point of linear offset of request.
*/
static size_t ublk_copy_user_pages(const struct request *req,
- struct iov_iter *uiter, int dir)
+ unsigned offset, struct iov_iter *uiter, int dir)
{
struct ublk_io_iter iter = {
.bio = req->bio,
- .iter = req->bio->bi_iter,
};
size_t done = 0;
+ if (!ublk_advance_io_iter(&iter, offset))
+ return 0;
+
while (iov_iter_count(uiter) && iter.bio) {
unsigned nr_pages;
size_t len, off;
@@ -576,7 +594,7 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
import_single_range(dir, (void __user *)io->addr,
rq_bytes, &iov, &iter);
- return ublk_copy_user_pages(req, &iter, dir);
+ return ublk_copy_user_pages(req, 0, &iter, dir);
}
return rq_bytes;
}
@@ -596,7 +614,7 @@ static int ublk_unmap_io(const struct ublk_queue *ubq,
import_single_range(dir, (void __user *)io->addr,
io->res, &iov, &iter);
- return ublk_copy_user_pages(req, &iter, dir);
+ return ublk_copy_user_pages(req, 0, &iter, dir);
}
return rq_bytes;
}
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH V2 15/17] block: ublk_drv: add read()/write() support for ublk char device
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (13 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 14/17] block: ublk_drv: support to copy any part of request pages Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-07 14:15 ` [PATCH V2 16/17] block: ublk_drv: don't check buffer in case of zero copy Ming Lei
` (3 subsequent siblings)
18 siblings, 0 replies; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
We are going to support zero copy by fused uring command, the userspace
can't read from or write to the io buffer any more, it becomes not
flexible for applications:
1) some targets need to zero buffer explicitly, such as when reading
unmapped qcow2 cluster
2) some targets need to support passthrough command, such as zoned
report zones, and still need to read/write the io buffer
Support pread()/pwrite() on ublk char device for reading/writing request
io buffer, so ublk server can handle the above cases easily.
This also can help to make zero copy becoming the primary option, and
non-zero-copy will become legacy code path since the added read()/write()
can cover non-zero-copy feature.
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 131 ++++++++++++++++++++++++++++++++++
include/uapi/linux/ublk_cmd.h | 31 +++++++-
2 files changed, 161 insertions(+), 1 deletion(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index c0cb99cb0c3c..452deec571f7 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1318,6 +1318,36 @@ static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id,
ublk_queue_cmd(ubq, req);
}
+static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
+ struct ublk_queue *ubq, int tag, size_t offset)
+{
+ struct request *req;
+
+ if (!ublk_support_zc(ubq))
+ return NULL;
+
+ req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
+ if (!req)
+ return NULL;
+
+ if (!ublk_get_req_ref(ubq, req))
+ return NULL;
+
+ if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
+ goto fail_put;
+
+ if (!ublk_rq_has_data(req))
+ goto fail_put;
+
+ if (offset > blk_rq_bytes(req))
+ goto fail_put;
+
+ return req;
+fail_put:
+ ublk_put_req_ref(ubq, req);
+ return NULL;
+}
+
static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
{
struct ublksrv_io_cmd *ub_cmd = (struct ublksrv_io_cmd *)cmd->cmd;
@@ -1422,11 +1452,112 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
return -EIOCBQUEUED;
}
+static inline bool ublk_check_ubuf_dir(const struct request *req,
+ int ubuf_dir)
+{
+ /* copy ubuf to request pages */
+ if (req_op(req) == REQ_OP_READ && ubuf_dir == ITER_SOURCE)
+ return true;
+
+ /* copy request pages to ubuf */
+ if (req_op(req) == REQ_OP_WRITE && ubuf_dir == ITER_DEST)
+ return true;
+
+ return false;
+}
+
+static struct request *ublk_check_and_get_req(struct kiocb *iocb,
+ struct iov_iter *iter, size_t *off, int dir)
+{
+ struct ublk_device *ub = iocb->ki_filp->private_data;
+ struct ublk_queue *ubq;
+ struct request *req;
+ size_t buf_off;
+ u16 tag, q_id;
+
+ if (!ub)
+ return ERR_PTR(-EACCES);
+
+ if (!user_backed_iter(iter))
+ return ERR_PTR(-EACCES);
+
+ if (ub->dev_info.state == UBLK_S_DEV_DEAD)
+ return ERR_PTR(-EACCES);
+
+ tag = ublk_pos_to_tag(iocb->ki_pos);
+ q_id = ublk_pos_to_hwq(iocb->ki_pos);
+ buf_off = ublk_pos_to_buf_offset(iocb->ki_pos);
+
+ if (q_id >= ub->dev_info.nr_hw_queues)
+ return ERR_PTR(-EINVAL);
+
+ ubq = ublk_get_queue(ub, q_id);
+ if (!ubq)
+ return ERR_PTR(-EINVAL);
+
+ if (tag >= ubq->q_depth)
+ return ERR_PTR(-EINVAL);
+
+ req = __ublk_check_and_get_req(ub, ubq, tag, buf_off);
+ if (!req)
+ return ERR_PTR(-EINVAL);
+
+ if (!req->mq_hctx || !req->mq_hctx->driver_data)
+ goto fail;
+
+ if (!ublk_check_ubuf_dir(req, dir))
+ goto fail;
+
+ *off = buf_off;
+ return req;
+fail:
+ ublk_put_req_ref(ubq, req);
+ return ERR_PTR(-EACCES);
+}
+
+static ssize_t ublk_ch_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+ struct ublk_queue *ubq;
+ struct request *req;
+ size_t buf_off;
+ size_t ret;
+
+ req = ublk_check_and_get_req(iocb, to, &buf_off, ITER_DEST);
+ if (unlikely(IS_ERR(req)))
+ return PTR_ERR(req);
+
+ ret = ublk_copy_user_pages(req, buf_off, to, ITER_DEST);
+ ubq = req->mq_hctx->driver_data;
+ ublk_put_req_ref(ubq, req);
+
+ return ret;
+}
+
+static ssize_t ublk_ch_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct ublk_queue *ubq;
+ struct request *req;
+ size_t buf_off;
+ size_t ret;
+
+ req = ublk_check_and_get_req(iocb, from, &buf_off, ITER_SOURCE);
+ if (unlikely(IS_ERR(req)))
+ return PTR_ERR(req);
+
+ ret = ublk_copy_user_pages(req, buf_off, from, ITER_SOURCE);
+ ubq = req->mq_hctx->driver_data;
+ ublk_put_req_ref(ubq, req);
+
+ return ret;
+}
+
static const struct file_operations ublk_ch_fops = {
.owner = THIS_MODULE,
.open = ublk_ch_open,
.release = ublk_ch_release,
.llseek = no_llseek,
+ .read_iter = ublk_ch_read_iter,
+ .write_iter = ublk_ch_write_iter,
.uring_cmd = ublk_ch_uring_cmd,
.mmap = ublk_ch_mmap,
};
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index f6238ccc7800..d1a6b3dc0327 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -54,7 +54,36 @@
#define UBLKSRV_IO_BUF_OFFSET 0x80000000
/* tag bit is 12bit, so at most 4096 IOs for each queue */
-#define UBLK_MAX_QUEUE_DEPTH 4096
+#define UBLK_TAG_BITS 12
+#define UBLK_MAX_QUEUE_DEPTH (1U << UBLK_TAG_BITS)
+
+/* used for locating each io buffer for pread()/pwrite() on char device */
+#define UBLK_BUFS_SIZE_BITS 42
+#define UBLK_BUFS_SIZE_MASK ((1ULL << UBLK_BUFS_SIZE_BITS) - 1)
+#define UBLK_BUF_SIZE_BITS (UBLK_BUFS_SIZE_BITS - UBLK_TAG_BITS)
+#define UBLK_BUF_MAX_SIZE (1ULL << UBLK_BUF_SIZE_BITS)
+
+static inline __u16 ublk_pos_to_hwq(__u64 pos)
+{
+ return pos >> UBLK_BUFS_SIZE_BITS;
+}
+
+static inline __u32 ublk_pos_to_buf_offset(__u64 pos)
+{
+ return (pos & UBLK_BUFS_SIZE_MASK) & (UBLK_BUF_MAX_SIZE - 1);
+}
+
+static inline __u16 ublk_pos_to_tag(__u64 pos)
+{
+ return (pos & UBLK_BUFS_SIZE_MASK) >> UBLK_BUF_SIZE_BITS;
+}
+
+/* offset of single buffer, which has to be < UBLK_BUX_MAX_SIZE */
+static inline __u64 ublk_pos(__u16 q_id, __u16 tag, __u32 offset)
+{
+ return (((__u64)q_id) << UBLK_BUFS_SIZE_BITS) |
+ ((((__u64)tag) << UBLK_BUF_SIZE_BITS) + offset);
+}
/*
* zero copy requires 4k block size, and can remap ublk driver's io
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH V2 16/17] block: ublk_drv: don't check buffer in case of zero copy
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (14 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 15/17] block: ublk_drv: add read()/write() support for ublk char device Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-07 14:15 ` [PATCH V2 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting " Ming Lei
` (2 subsequent siblings)
18 siblings, 0 replies; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
In case of zero copy, ublk server needn't to pre-allocate IO buffer
and provide it to driver more.
Meantime not set the buffer in case of zero copy any more, and the
userspace can use pread()/pwrite() to read from/write to the io request
buffer, which is easier & simpler from userspace viewpoint.
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 452deec571f7..2385cc3f8566 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1409,25 +1409,30 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
if (io->flags & UBLK_IO_FLAG_OWNED_BY_SRV)
goto out;
/* FETCH_RQ has to provide IO buffer if NEED GET DATA is not enabled */
- if (!ub_cmd->addr && !ublk_need_get_data(ubq))
- goto out;
+ if (!ublk_support_zc(ubq)) {
+ if (!ub_cmd->addr && !ublk_need_get_data(ubq))
+ goto out;
+ io->addr = ub_cmd->addr;
+ }
io->cmd = cmd;
io->flags |= UBLK_IO_FLAG_ACTIVE;
- io->addr = ub_cmd->addr;
-
ublk_mark_io_ready(ub, ubq);
break;
case UBLK_IO_COMMIT_AND_FETCH_REQ:
req = blk_mq_tag_to_rq(ub->tag_set.tags[ub_cmd->q_id], tag);
+
+ if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV))
+ goto out;
/*
* COMMIT_AND_FETCH_REQ has to provide IO buffer if NEED GET DATA is
* not enabled or it is Read IO.
*/
- if (!ub_cmd->addr && (!ublk_need_get_data(ubq) || req_op(req) == REQ_OP_READ))
- goto out;
- if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV))
- goto out;
- io->addr = ub_cmd->addr;
+ if (!ublk_support_zc(ubq)) {
+ if (!ub_cmd->addr && (!ublk_need_get_data(ubq) ||
+ req_op(req) == REQ_OP_READ))
+ goto out;
+ io->addr = ub_cmd->addr;
+ }
io->flags |= UBLK_IO_FLAG_ACTIVE;
io->cmd = cmd;
ublk_commit_completion(ub, ub_cmd);
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH V2 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (15 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 16/17] block: ublk_drv: don't check buffer in case of zero copy Ming Lei
@ 2023-03-07 14:15 ` Ming Lei
2023-03-07 15:37 ` [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Pavel Begunkov
2023-03-15 7:08 ` Ziyang Zhang
18 siblings, 0 replies; 42+ messages in thread
From: Ming Lei @ 2023-03-07 14:15 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert, Ming Lei
Apply io_uring fused command for supporting zero copy:
1) init the fused cmd buffer(io_mapped_buf) in ublk_map_io(),
and deinit it in ublk_unmap_io(), and this buffer is immutable,
so it is just fine to retrieve it from concurrent fused command.
1) add sub-command opcode of UBLK_IO_FUSED_SUBMIT_IO for retrieving
this fused cmd(zero copy) buffer
2) call io_fused_cmd_provide_kbuf() to provide buffer to slave
request; meantime setup complete callback via this API, once
slave request is completed, the complete callback is called
for freeing the buffer and completing the uring fused command
Also request reference is held during fused command lifetime, and
this way guarantees that request buffer won't be freed until
fused commands are done.
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 190 ++++++++++++++++++++++++++++++++--
include/uapi/linux/ublk_cmd.h | 6 +-
2 files changed, 183 insertions(+), 13 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 2385cc3f8566..466a8cfd2b2a 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -74,10 +74,15 @@ struct ublk_rq_data {
* successfully
*/
struct kref ref;
+ bool allocated_bvec;
+ struct io_uring_bvec_buf buf[0];
};
struct ublk_uring_cmd_pdu {
- struct ublk_queue *ubq;
+ union {
+ struct ublk_queue *ubq;
+ struct request *req;
+ };
};
/*
@@ -566,6 +571,69 @@ static size_t ublk_copy_user_pages(const struct request *req,
return done;
}
+/*
+ * The built command buffer is immutable, so it is fine to feed it to
+ * concurrent io_uring fused commands
+ */
+static int ublk_init_zero_copy_buffer(struct request *rq)
+{
+ struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
+ struct io_uring_bvec_buf *imu = data->buf;
+ struct req_iterator rq_iter;
+ unsigned int nr_bvecs = 0;
+ struct bio_vec *bvec;
+ unsigned int offset;
+ struct bio_vec bv;
+
+ if (!ublk_rq_has_data(rq))
+ goto exit;
+
+ rq_for_each_bvec(bv, rq, rq_iter)
+ nr_bvecs++;
+
+ if (!nr_bvecs)
+ goto exit;
+
+ if (rq->bio != rq->biotail) {
+ int idx = 0;
+
+ bvec = kvmalloc_array(sizeof(struct bio_vec), nr_bvecs,
+ GFP_NOIO);
+ if (!bvec)
+ return -ENOMEM;
+
+ offset = 0;
+ rq_for_each_bvec(bv, rq, rq_iter)
+ bvec[idx++] = bv;
+ data->allocated_bvec = true;
+ } else {
+ struct bio *bio = rq->bio;
+
+ offset = bio->bi_iter.bi_bvec_done;
+ bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+ }
+ imu->bvec = bvec;
+ imu->nr_bvecs = nr_bvecs;
+ imu->offset = offset;
+ imu->len = blk_rq_bytes(rq);
+
+ return 0;
+exit:
+ imu->bvec = NULL;
+ return 0;
+}
+
+static void ublk_deinit_zero_copy_buffer(struct request *rq)
+{
+ struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
+ struct io_uring_bvec_buf *imu = data->buf;
+
+ if (data->allocated_bvec) {
+ kvfree(imu->bvec);
+ data->allocated_bvec = false;
+ }
+}
+
static inline bool ublk_need_map_req(const struct request *req)
{
return ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE;
@@ -576,11 +644,23 @@ static inline bool ublk_need_unmap_req(const struct request *req)
return ublk_rq_has_data(req) && req_op(req) == REQ_OP_READ;
}
-static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
+static int ublk_map_io(const struct ublk_queue *ubq, struct request *req,
struct ublk_io *io)
{
const unsigned int rq_bytes = blk_rq_bytes(req);
+ if (ublk_support_zc(ubq)) {
+ int ret = ublk_init_zero_copy_buffer(req);
+
+ /*
+ * The only failure is -ENOMEM for allocating fused cmd
+ * buffer, return zero so that we can requeue this req.
+ */
+ if (unlikely(ret))
+ return 0;
+ return rq_bytes;
+ }
+
/*
* no zero copy, we delay copy WRITE request data into ublksrv
* context and the big benefit is that pinning pages in current
@@ -600,11 +680,17 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
}
static int ublk_unmap_io(const struct ublk_queue *ubq,
- const struct request *req,
+ struct request *req,
struct ublk_io *io)
{
const unsigned int rq_bytes = blk_rq_bytes(req);
+ if (ublk_support_zc(ubq)) {
+ ublk_deinit_zero_copy_buffer(req);
+
+ return rq_bytes;
+ }
+
if (ublk_need_unmap_req(req)) {
struct iov_iter iter;
struct iovec iov;
@@ -688,6 +774,12 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu;
}
+static inline struct ublk_uring_cmd_pdu *ublk_get_uring_fused_cmd_pdu(
+ struct io_uring_cmd *ioucmd)
+{
+ return (struct ublk_uring_cmd_pdu *)&ioucmd->fused.pdu;
+}
+
static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
{
return ubq->ubq_daemon->flags & PF_EXITING;
@@ -743,6 +835,7 @@ static inline void __ublk_complete_rq(struct request *req)
return;
exit:
+ ublk_deinit_zero_copy_buffer(req);
blk_mq_end_request(req, res);
}
@@ -1348,6 +1441,67 @@ static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
return NULL;
}
+static void ublk_fused_cmd_done_cb(struct io_uring_cmd *cmd)
+{
+ struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_fused_cmd_pdu(cmd);
+ struct request *req = pdu->req;
+ struct ublk_queue *ubq = req->mq_hctx->driver_data;
+
+ ublk_put_req_ref(ubq, req);
+ io_uring_cmd_done(cmd, cmd->fused.data.slave_res, 0);
+}
+
+static inline bool ublk_check_fused_buf_dir(const struct request *req,
+ unsigned int flags)
+{
+ flags &= IO_URING_F_FUSED;
+
+ if (req_op(req) == REQ_OP_READ && flags == IO_URING_F_FUSED_WRITE)
+ return true;
+
+ if (req_op(req) == REQ_OP_WRITE && flags == IO_URING_F_FUSED_READ)
+ return true;
+
+ return false;
+}
+
+static int ublk_handle_fused_cmd(struct io_uring_cmd *cmd,
+ struct ublk_queue *ubq, int tag, unsigned int issue_flags)
+{
+ struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_fused_cmd_pdu(cmd);
+ struct ublk_device *ub = cmd->file->private_data;
+ struct ublk_rq_data *data;
+ struct request *req;
+
+ if (!ub)
+ return -EPERM;
+
+ if (!(issue_flags & IO_URING_F_FUSED))
+ goto exit;
+
+ req = __ublk_check_and_get_req(ub, ubq, tag, 0);
+ if (!req)
+ goto exit;
+
+ pr_devel("%s: qid %d tag %u request bytes %u, issue flags %x\n",
+ __func__, tag, ubq->q_id, blk_rq_bytes(req),
+ issue_flags);
+
+ if (!ublk_check_fused_buf_dir(req, issue_flags))
+ goto exit_put_ref;
+
+ pdu->req = req;
+ data = blk_mq_rq_to_pdu(req);
+ io_fused_cmd_provide_kbuf(cmd, !(issue_flags & IO_URING_F_UNLOCKED),
+ data->buf, ublk_fused_cmd_done_cb);
+ return -EIOCBQUEUED;
+
+exit_put_ref:
+ ublk_put_req_ref(ubq, req);
+exit:
+ return -EINVAL;
+}
+
static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
{
struct ublksrv_io_cmd *ub_cmd = (struct ublksrv_io_cmd *)cmd->cmd;
@@ -1363,7 +1517,8 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
__func__, cmd->cmd_op, ub_cmd->q_id, tag,
ub_cmd->result);
- if (issue_flags & IO_URING_F_FUSED)
+ if ((issue_flags & IO_URING_F_FUSED) &&
+ cmd_op != UBLK_IO_FUSED_SUBMIT_IO)
return -EOPNOTSUPP;
if (ub_cmd->q_id >= ub->dev_info.nr_hw_queues)
@@ -1373,7 +1528,12 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
if (!ubq || ub_cmd->q_id != ubq->q_id)
goto out;
- if (ubq->ubq_daemon && ubq->ubq_daemon != current)
+ /*
+ * The fused command reads the io buffer data structure only, so it
+ * is fine to be issued from other context.
+ */
+ if ((ubq->ubq_daemon && ubq->ubq_daemon != current) &&
+ (cmd_op != UBLK_IO_FUSED_SUBMIT_IO))
goto out;
if (tag >= ubq->q_depth)
@@ -1396,6 +1556,9 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
goto out;
switch (cmd_op) {
+ case UBLK_IO_FUSED_SUBMIT_IO:
+ return ublk_handle_fused_cmd(cmd, ubq, tag, issue_flags);
+
case UBLK_IO_FETCH_REQ:
/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
if (ublk_queue_ready(ubq)) {
@@ -1725,11 +1888,14 @@ static void ublk_align_max_io_size(struct ublk_device *ub)
static int ublk_add_tag_set(struct ublk_device *ub)
{
+ int zc = !!(ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY);
+ struct ublk_rq_data *data;
+
ub->tag_set.ops = &ublk_mq_ops;
ub->tag_set.nr_hw_queues = ub->dev_info.nr_hw_queues;
ub->tag_set.queue_depth = ub->dev_info.queue_depth;
ub->tag_set.numa_node = NUMA_NO_NODE;
- ub->tag_set.cmd_size = sizeof(struct ublk_rq_data);
+ ub->tag_set.cmd_size = struct_size(data, buf, zc);
ub->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
ub->tag_set.driver_data = ub;
return blk_mq_alloc_tag_set(&ub->tag_set);
@@ -1945,12 +2111,18 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
*/
ub->dev_info.flags &= UBLK_F_ALL;
+ /*
+ * NEED_GET_DATA doesn't make sense any more in case that
+ * ZERO_COPY is requested. Another reason is that userspace
+ * can read/write io request buffer by pread()/pwrite() with
+ * each io buffer's position.
+ */
+ if (ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY)
+ ub->dev_info.flags &= ~UBLK_F_NEED_GET_DATA;
+
if (!IS_BUILTIN(CONFIG_BLK_DEV_UBLK))
ub->dev_info.flags |= UBLK_F_URING_CMD_COMP_IN_TASK;
- /* We are not ready to support zero copy */
- ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY;
-
ub->dev_info.nr_hw_queues = min_t(unsigned int,
ub->dev_info.nr_hw_queues, nr_cpu_ids);
ublk_align_max_io_size(ub);
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index d1a6b3dc0327..c4f3465399cf 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -44,6 +44,7 @@
#define UBLK_IO_FETCH_REQ 0x20
#define UBLK_IO_COMMIT_AND_FETCH_REQ 0x21
#define UBLK_IO_NEED_GET_DATA 0x22
+#define UBLK_IO_FUSED_SUBMIT_IO 0x23
/* only ABORT means that no re-fetch */
#define UBLK_IO_RES_OK 0
@@ -85,10 +86,7 @@ static inline __u64 ublk_pos(__u16 q_id, __u16 tag, __u32 offset)
((((__u64)tag) << UBLK_BUF_SIZE_BITS) + offset);
}
-/*
- * zero copy requires 4k block size, and can remap ublk driver's io
- * request into ublksrv's vm space
- */
+/* io_uring fused command based zero copy */
#define UBLK_F_SUPPORT_ZERO_COPY (1ULL << 0)
/*
--
2.39.2
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (16 preceding siblings ...)
2023-03-07 14:15 ` [PATCH V2 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting " Ming Lei
@ 2023-03-07 15:37 ` Pavel Begunkov
2023-03-07 17:17 ` Pavel Begunkov
2023-03-08 1:08 ` Ming Lei
2023-03-15 7:08 ` Ziyang Zhang
18 siblings, 2 replies; 42+ messages in thread
From: Pavel Begunkov @ 2023-03-07 15:37 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert
On 3/7/23 14:15, Ming Lei wrote:
> Hello,
>
> Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
> be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
> 64byte SQE(slave) is another normal 64byte OP. For any OP which needs
> to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1,
> and its ->issue() can retrieve/import buffer from master request's
> fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of
> this idea is from Xiaoguang's ublk ebpf patchset, but this patchset
> submits slave OP just like normal OP issued from userspace, that said,
> SQE order is kept, and batching handling is done too.
From a quick look through patches it all looks a bit complicated
and intrusive, all over generic hot paths. I think instead we
should be able to use registered buffer table as intermediary and
reuse splicing. Let me try it out
> Please see detailed design in commit log of the 3th patch, and one big
> point is how to handle buffer ownership.
>
> With this way, it is easy to support zero copy for ublk/fuse device.
>
> Basically userspace can specify any sub-buffer of the ublk block request
> buffer from the fused command just by setting 'offset/len'
> in the slave SQE for running slave OP. This way is flexible to implement
> io mapping: mirror, stripped, ...
>
> The 4th & 5th patches enable fused slave support for the following OPs:
>
> OP_READ/OP_WRITE
> OP_SEND/OP_RECV/OP_SEND_ZC
>
> The other ublk patches cleans ublk driver and implement fused command
> for supporting zero copy.
>
> Follows userspace code:
>
> https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-v2
>
> All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing:
>
> ublk add -t [loop|nbd|qcow2] -z ....
>
> Basic fs mount/kernel building and builtin test are done.
>
> Also add liburing test case for covering fused command based on miniublk
> of blktest:
>
> https://github.com/ming1/liburing/commits/fused_cmd_miniublk
>
> Performance improvement is obvious on memory bandwidth
> related workloads, such as, 1~2X improvement on 64K/512K BS
> IO test on loop with ramfs backing file.
>
> Any comments are welcome!
>
> V2:
> - don't resue io_mapped_ubuf (io_uring)
> - remove REQ_F_FUSED_MASTER_BIT (io_uring)
> - fix compile warning (io_uring)
> - rebase on v6.3-rc1 (io_uring)
> - grabbing io request reference when handling fused command
> - simplify ublk_copy_user_pages() by iov iterator
> - add read()/write() for userspace to read/write ublk io buffer, so
> that some corner cases(read zero, passthrough request(report zones)) can
> be handled easily in case of zero copy; this way also helps to switch to
> zero copy completely
> - misc cleanup
>
> Ming Lei (17):
> io_uring: add IO_URING_F_FUSED and prepare for supporting OP_FUSED_CMD
> io_uring: increase io_kiocb->flags into 64bit
> io_uring: add IORING_OP_FUSED_CMD
> io_uring: support OP_READ/OP_WRITE for fused slave request
> io_uring: support OP_SEND_ZC/OP_RECV for fused slave request
> block: ublk_drv: mark device as LIVE before adding disk
> block: ublk_drv: add common exit handling
> block: ublk_drv: don't consider flush request in map/unmap io
> block: ublk_drv: add two helpers to clean up map/unmap request
> block: ublk_drv: clean up several helpers
> block: ublk_drv: cleanup 'struct ublk_map_data'
> block: ublk_drv: cleanup ublk_copy_user_pages
> block: ublk_drv: grab request reference when the request is handled by
> userspace
> block: ublk_drv: support to copy any part of request pages
> block: ublk_drv: add read()/write() support for ublk char device
> block: ublk_drv: don't check buffer in case of zero copy
> block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy
>
> drivers/block/ublk_drv.c | 605 ++++++++++++++++++++++++++-------
> drivers/char/mem.c | 4 +
> drivers/nvme/host/ioctl.c | 9 +
> include/linux/io_uring.h | 49 ++-
> include/linux/io_uring_types.h | 18 +-
> include/uapi/linux/io_uring.h | 1 +
> include/uapi/linux/ublk_cmd.h | 37 +-
> io_uring/Makefile | 2 +-
> io_uring/fused_cmd.c | 232 +++++++++++++
> io_uring/fused_cmd.h | 11 +
> io_uring/io_uring.c | 22 +-
> io_uring/io_uring.h | 3 +
> io_uring/net.c | 23 +-
> io_uring/opdef.c | 17 +
> io_uring/opdef.h | 2 +
> io_uring/rw.c | 20 ++
> 16 files changed, 926 insertions(+), 129 deletions(-)
> create mode 100644 io_uring/fused_cmd.c
> create mode 100644 io_uring/fused_cmd.h
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD
2023-03-07 15:37 ` [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Pavel Begunkov
@ 2023-03-07 17:17 ` Pavel Begunkov
2023-03-08 2:10 ` Ming Lei
2023-03-08 1:08 ` Ming Lei
1 sibling, 1 reply; 42+ messages in thread
From: Pavel Begunkov @ 2023-03-07 17:17 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, io-uring
Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
Bernd Schubert
On 3/7/23 15:37, Pavel Begunkov wrote:
> On 3/7/23 14:15, Ming Lei wrote:
>> Hello,
>>
>> Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
>> be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
>> 64byte SQE(slave) is another normal 64byte OP. For any OP which needs
>> to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1,
>> and its ->issue() can retrieve/import buffer from master request's
>> fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of
>> this idea is from Xiaoguang's ublk ebpf patchset, but this patchset
>> submits slave OP just like normal OP issued from userspace, that said,
>> SQE order is kept, and batching handling is done too.
>
> From a quick look through patches it all looks a bit complicated
> and intrusive, all over generic hot paths. I think instead we
> should be able to use registered buffer table as intermediary and
> reuse splicing. Let me try it out
Here we go, isolated in a new opcode, and in the end should work
with any file supporting splice. It's a quick prototype, it's lacking
and there are many obvious fatal bugs. It also needs some optimisations,
improvements on how executed by io_uring and extra stuff like
memcpy ops and fixed buf recv/send. I'll clean it up.
I used a test below, it essentially does zc recv.
https://github.com/isilence/liburing/commit/81fe705739af7d9b77266f9aa901c1ada870739d
From 87ad9e8e3aed683aa040fb4b9ae499f8726ba393 Mon Sep 17 00:00:00 2001
Message-Id: <87ad9e8e3aed683aa040fb4b9ae499f8726ba393.1678208911.git.asml.silence@gmail.com>
From: Pavel Begunkov <[email protected]>
Date: Tue, 7 Mar 2023 17:01:44 +0000
Subject: [POC 1/1] io_uring: splicing into reg buf table
EXTREMELY BUGGY! Not for inclusion.
Add a new operation called IORING_OP_SPLICE_FROM,
which splices from a file into the registered buffer table. This is
done in a zerocopy fashion with a caveat that the user won't have
direct access to the data, however it can use it with any io_uring
request supporting registered buffers.
Signed-off-by: Pavel Begunkov <[email protected]>
---
include/uapi/linux/io_uring.h | 1 +
io_uring/io_uring.c | 4 +-
io_uring/opdef.c | 10 ++++
io_uring/splice.c | 98 +++++++++++++++++++++++++++++++++++
io_uring/splice.h | 3 ++
5 files changed, 114 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 709de6d4feb2..a91ce1d2ebd7 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -223,6 +223,7 @@ enum io_uring_op {
IORING_OP_URING_CMD,
IORING_OP_SEND_ZC,
IORING_OP_SENDMSG_ZC,
+ IORING_OP_SPLICE_FROM,
/* this goes last, obviously */
IORING_OP_LAST,
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 7625597b5227..b7389a6ea190 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2781,8 +2781,8 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
io_wait_rsrc_data(ctx->file_data);
mutex_lock(&ctx->uring_lock);
- if (ctx->buf_data)
- __io_sqe_buffers_unregister(ctx);
+ // if (ctx->buf_data)
+ // __io_sqe_buffers_unregister(ctx);
if (ctx->file_data)
__io_sqe_files_unregister(ctx);
io_cqring_overflow_kill(ctx);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index cca7c5b55208..28d4fa42676b 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -428,6 +428,13 @@ const struct io_issue_def io_issue_defs[] = {
.prep = io_eopnotsupp_prep,
#endif
},
+ [IORING_OP_SPLICE_FROM] = {
+ .needs_file = 1,
+ .unbound_nonreg_file = 1,
+ // .pollin = 1,
+ .prep = io_splice_from_prep,
+ .issue = io_splice_from,
+ }
};
@@ -648,6 +655,9 @@ const struct io_cold_def io_cold_defs[] = {
.fail = io_sendrecv_fail,
#endif
},
+ [IORING_OP_SPLICE_FROM] = {
+ .name = "SPLICE_FROM",
+ }
};
const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/splice.c b/io_uring/splice.c
index 2a4bbb719531..0467e9f46e99 100644
--- a/io_uring/splice.c
+++ b/io_uring/splice.c
@@ -8,11 +8,13 @@
#include <linux/namei.h>
#include <linux/io_uring.h>
#include <linux/splice.h>
+#include <linux/nospec.h>
#include <uapi/linux/io_uring.h>
#include "io_uring.h"
#include "splice.h"
+#include "rsrc.h"
struct io_splice {
struct file *file_out;
@@ -119,3 +121,99 @@ int io_splice(struct io_kiocb *req, unsigned int issue_flags)
io_req_set_res(req, ret, 0);
return IOU_OK;
}
+
+struct io_splice_from {
+ struct file *file;
+ loff_t off;
+ u64 len;
+};
+
+
+int io_splice_from_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+ struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from);
+
+ if (unlikely(sqe->splice_flags || sqe->splice_fd_in || sqe->ioprio ||
+ sqe->addr || sqe->addr3))
+ return -EINVAL;
+
+ req->buf_index = READ_ONCE(sqe->buf_index);
+
+ sp->len = READ_ONCE(sqe->len);
+ if (unlikely(!sp->len))
+ return -EINVAL;
+
+ sp->off = READ_ONCE(sqe->off);
+ return 0;
+}
+
+int io_splice_from(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from);
+ loff_t *ppos = (sp->off == -1) ? NULL : &sp->off;
+ struct io_mapped_ubuf *imu;
+ struct pipe_inode_info *pi;
+ struct io_ring_ctx *ctx;
+ unsigned int pipe_tail;
+ int ret, i, nr_pages;
+ u16 index;
+
+ if (!sp->file->f_op->splice_read)
+ return -ENOTSUPP;
+
+ pi = alloc_pipe_info();
+ if (!pi)
+ return -ENOMEM;
+ pi->readers = 1;
+
+ ret = sp->file->f_op->splice_read(sp->file, ppos, pi, sp->len, 0);
+ if (ret < 0)
+ goto done;
+
+ nr_pages = pipe_occupancy(pi->head, pi->tail);
+ imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
+ if (!imu)
+ goto done;
+
+ ret = 0;
+ pipe_tail = pi->tail;
+ for (i = 0; !pipe_empty(pi->head, pipe_tail); i++) {
+ unsigned int mask = pi->ring_size - 1; // kill mask
+ struct pipe_buffer *buf = &pi->bufs[pipe_tail & mask];
+
+ bvec_set_page(&imu->bvec[i], buf->page, buf->len, buf->offset);
+ ret += buf->len;
+ pipe_tail++;
+ }
+ if (WARN_ON_ONCE(i != nr_pages))
+ return -EFAULT;
+
+ ctx = req->ctx;
+ io_ring_submit_lock(ctx, issue_flags);
+ if (unlikely(req->buf_index >= ctx->nr_user_bufs)) {
+ /* TODO: cleanup pages */
+ ret = -EFAULT;
+ kvfree(imu);
+ goto done_unlock;
+ }
+ index = array_index_nospec(req->buf_index, ctx->nr_user_bufs);
+ if (ctx->user_bufs[index] != ctx->dummy_ubuf) {
+ /* TODO: cleanup pages */
+ kvfree(imu);
+ ret = -EFAULT;
+ goto done_unlock;
+ }
+
+ imu->ubuf = 0;
+ imu->ubuf_end = ret;
+ imu->nr_bvecs = nr_pages;
+ ctx->user_bufs[index] = imu;
+done_unlock:
+ io_ring_submit_unlock(ctx, issue_flags);
+done:
+ free_pipe_info(pi);
+ if (ret != sp->len)
+ req_set_fail(req);
+ io_req_set_res(req, ret, 0);
+ return IOU_OK;
+}
diff --git a/io_uring/splice.h b/io_uring/splice.h
index 542f94168ad3..abdf5ad8e8d2 100644
--- a/io_uring/splice.h
+++ b/io_uring/splice.h
@@ -5,3 +5,6 @@ int io_tee(struct io_kiocb *req, unsigned int issue_flags);
int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_splice(struct io_kiocb *req, unsigned int issue_flags);
+
+int io_splice_from_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_splice_from(struct io_kiocb *req, unsigned int issue_flags);
--
2.39.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD
2023-03-07 17:17 ` Pavel Begunkov
@ 2023-03-08 2:10 ` Ming Lei
2023-03-08 14:46 ` Pavel Begunkov
0 siblings, 1 reply; 42+ messages in thread
From: Ming Lei @ 2023-03-08 2:10 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, io-uring, linux-block, Miklos Szeredi, ZiyangZhang,
Xiaoguang Wang, Bernd Schubert, ming.lei
On Tue, Mar 07, 2023 at 05:17:04PM +0000, Pavel Begunkov wrote:
> On 3/7/23 15:37, Pavel Begunkov wrote:
> > On 3/7/23 14:15, Ming Lei wrote:
> > > Hello,
> > >
> > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
> > > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
> > > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs
> > > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1,
> > > and its ->issue() can retrieve/import buffer from master request's
> > > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of
> > > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset
> > > submits slave OP just like normal OP issued from userspace, that said,
> > > SQE order is kept, and batching handling is done too.
> >
> > From a quick look through patches it all looks a bit complicated
> > and intrusive, all over generic hot paths. I think instead we
> > should be able to use registered buffer table as intermediary and
> > reuse splicing. Let me try it out
>
> Here we go, isolated in a new opcode, and in the end should work
> with any file supporting splice. It's a quick prototype, it's lacking
> and there are many obvious fatal bugs. It also needs some optimisations,
> improvements on how executed by io_uring and extra stuff like
> memcpy ops and fixed buf recv/send. I'll clean it up.
>
> I used a test below, it essentially does zc recv.
>
> https://github.com/isilence/liburing/commit/81fe705739af7d9b77266f9aa901c1ada870739d
>
>
> From 87ad9e8e3aed683aa040fb4b9ae499f8726ba393 Mon Sep 17 00:00:00 2001
> Message-Id: <87ad9e8e3aed683aa040fb4b9ae499f8726ba393.1678208911.git.asml.silence@gmail.com>
> From: Pavel Begunkov <[email protected]>
> Date: Tue, 7 Mar 2023 17:01:44 +0000
> Subject: [POC 1/1] io_uring: splicing into reg buf table
>
> EXTREMELY BUGGY! Not for inclusion.
>
> Add a new operation called IORING_OP_SPLICE_FROM,
> which splices from a file into the registered buffer table. This is
> done in a zerocopy fashion with a caveat that the user won't have
> direct access to the data, however it can use it with any io_uring
> request supporting registered buffers.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> include/uapi/linux/io_uring.h | 1 +
> io_uring/io_uring.c | 4 +-
> io_uring/opdef.c | 10 ++++
> io_uring/splice.c | 98 +++++++++++++++++++++++++++++++++++
> io_uring/splice.h | 3 ++
> 5 files changed, 114 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 709de6d4feb2..a91ce1d2ebd7 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -223,6 +223,7 @@ enum io_uring_op {
> IORING_OP_URING_CMD,
> IORING_OP_SEND_ZC,
> IORING_OP_SENDMSG_ZC,
> + IORING_OP_SPLICE_FROM,
> /* this goes last, obviously */
> IORING_OP_LAST,
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 7625597b5227..b7389a6ea190 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -2781,8 +2781,8 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
> io_wait_rsrc_data(ctx->file_data);
> mutex_lock(&ctx->uring_lock);
> - if (ctx->buf_data)
> - __io_sqe_buffers_unregister(ctx);
> + // if (ctx->buf_data)
> + // __io_sqe_buffers_unregister(ctx);
> if (ctx->file_data)
> __io_sqe_files_unregister(ctx);
> io_cqring_overflow_kill(ctx);
> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
> index cca7c5b55208..28d4fa42676b 100644
> --- a/io_uring/opdef.c
> +++ b/io_uring/opdef.c
> @@ -428,6 +428,13 @@ const struct io_issue_def io_issue_defs[] = {
> .prep = io_eopnotsupp_prep,
> #endif
> },
> + [IORING_OP_SPLICE_FROM] = {
> + .needs_file = 1,
> + .unbound_nonreg_file = 1,
> + // .pollin = 1,
> + .prep = io_splice_from_prep,
> + .issue = io_splice_from,
> + }
> };
> @@ -648,6 +655,9 @@ const struct io_cold_def io_cold_defs[] = {
> .fail = io_sendrecv_fail,
> #endif
> },
> + [IORING_OP_SPLICE_FROM] = {
> + .name = "SPLICE_FROM",
> + }
> };
> const char *io_uring_get_opcode(u8 opcode)
> diff --git a/io_uring/splice.c b/io_uring/splice.c
> index 2a4bbb719531..0467e9f46e99 100644
> --- a/io_uring/splice.c
> +++ b/io_uring/splice.c
> @@ -8,11 +8,13 @@
> #include <linux/namei.h>
> #include <linux/io_uring.h>
> #include <linux/splice.h>
> +#include <linux/nospec.h>
> #include <uapi/linux/io_uring.h>
> #include "io_uring.h"
> #include "splice.h"
> +#include "rsrc.h"
> struct io_splice {
> struct file *file_out;
> @@ -119,3 +121,99 @@ int io_splice(struct io_kiocb *req, unsigned int issue_flags)
> io_req_set_res(req, ret, 0);
> return IOU_OK;
> }
> +
> +struct io_splice_from {
> + struct file *file;
> + loff_t off;
> + u64 len;
> +};
> +
> +
> +int io_splice_from_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> +{
> + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from);
> +
> + if (unlikely(sqe->splice_flags || sqe->splice_fd_in || sqe->ioprio ||
> + sqe->addr || sqe->addr3))
> + return -EINVAL;
> +
> + req->buf_index = READ_ONCE(sqe->buf_index);
> +
> + sp->len = READ_ONCE(sqe->len);
> + if (unlikely(!sp->len))
> + return -EINVAL;
> +
> + sp->off = READ_ONCE(sqe->off);
> + return 0;
> +}
> +
> +int io_splice_from(struct io_kiocb *req, unsigned int issue_flags)
> +{
> + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from);
> + loff_t *ppos = (sp->off == -1) ? NULL : &sp->off;
> + struct io_mapped_ubuf *imu;
> + struct pipe_inode_info *pi;
> + struct io_ring_ctx *ctx;
> + unsigned int pipe_tail;
> + int ret, i, nr_pages;
> + u16 index;
> +
> + if (!sp->file->f_op->splice_read)
> + return -ENOTSUPP;
> +
> + pi = alloc_pipe_info();
The above should be replaced with direct pipe, otherwise every time
allocating one pipe inode really hurts performance.
> + if (!pi)
> + return -ENOMEM;
> + pi->readers = 1;
> +
> + ret = sp->file->f_op->splice_read(sp->file, ppos, pi, sp->len, 0);
> + if (ret < 0)
> + goto done;
> +
> + nr_pages = pipe_occupancy(pi->head, pi->tail);
> + imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
> + if (!imu)
> + goto done;
> +
> + ret = 0;
> + pipe_tail = pi->tail;
> + for (i = 0; !pipe_empty(pi->head, pipe_tail); i++) {
> + unsigned int mask = pi->ring_size - 1; // kill mask
> + struct pipe_buffer *buf = &pi->bufs[pipe_tail & mask];
> +
> + bvec_set_page(&imu->bvec[i], buf->page, buf->len, buf->offset);
> + ret += buf->len;
> + pipe_tail++;
> + }
> + if (WARN_ON_ONCE(i != nr_pages))
> + return -EFAULT;
> +
> + ctx = req->ctx;
> + io_ring_submit_lock(ctx, issue_flags);
> + if (unlikely(req->buf_index >= ctx->nr_user_bufs)) {
> + /* TODO: cleanup pages */
> + ret = -EFAULT;
> + kvfree(imu);
> + goto done_unlock;
> + }
> + index = array_index_nospec(req->buf_index, ctx->nr_user_bufs);
> + if (ctx->user_bufs[index] != ctx->dummy_ubuf) {
> + /* TODO: cleanup pages */
> + kvfree(imu);
> + ret = -EFAULT;
> + goto done_unlock;
> + }
> +
> + imu->ubuf = 0;
> + imu->ubuf_end = ret;
> + imu->nr_bvecs = nr_pages;
> + ctx->user_bufs[index] = imu;
Your patch looks like transferring pages ownership to io_uring fixed
buffer, but unfortunately it can't be done in this way. splice is
supposed for moving data, not transfer buffer ownership.
https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/
1) pages are actually owned by device side(ublk, here: sp->file), but we want to
loan them to io_uring normal OPs.
2) after these pages are used by io_uring normal OPs, these pages have
been returned back to sp->file, and the notification has to be done
explicitly, because page is owned by sp->file of splice_read().
3) pages RW direction has to limited strictly, and in case of ublk/fuse,
device pages can only be read or write which depends on user io request
direction.
Also IMO it isn't good to add one buffer to ctx->user_bufs[] oneshot and
retrieve it oneshot, and it can be set via req->imu simply in one fused
command.
Thanks,
Ming
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD
2023-03-08 2:10 ` Ming Lei
@ 2023-03-08 14:46 ` Pavel Begunkov
2023-03-08 16:17 ` Ming Lei
0 siblings, 1 reply; 42+ messages in thread
From: Pavel Begunkov @ 2023-03-08 14:46 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, io-uring, linux-block, Miklos Szeredi, ZiyangZhang,
Xiaoguang Wang, Bernd Schubert
On 3/8/23 02:10, Ming Lei wrote:
> On Tue, Mar 07, 2023 at 05:17:04PM +0000, Pavel Begunkov wrote:
>> On 3/7/23 15:37, Pavel Begunkov wrote:
>>> On 3/7/23 14:15, Ming Lei wrote:
>>>> Hello,
>>>>
>>>> Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
>>>> be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
>>>> 64byte SQE(slave) is another normal 64byte OP. For any OP which needs
>>>> to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1,
>>>> and its ->issue() can retrieve/import buffer from master request's
>>>> fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of
>>>> this idea is from Xiaoguang's ublk ebpf patchset, but this patchset
>>>> submits slave OP just like normal OP issued from userspace, that said,
>>>> SQE order is kept, and batching handling is done too.
>>>
>>> From a quick look through patches it all looks a bit complicated
>>> and intrusive, all over generic hot paths. I think instead we
>>> should be able to use registered buffer table as intermediary and
>>> reuse splicing. Let me try it out
>>
>> Here we go, isolated in a new opcode, and in the end should work
>> with any file supporting splice. It's a quick prototype, it's lacking
>> and there are many obvious fatal bugs. It also needs some optimisations,
>> improvements on how executed by io_uring and extra stuff like
>> memcpy ops and fixed buf recv/send. I'll clean it up.
>>
>> I used a test below, it essentially does zc recv.
>>
>> https://github.com/isilence/liburing/commit/81fe705739af7d9b77266f9aa901c1ada870739d
>>
[...]
>> +int io_splice_from(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from);
>> + loff_t *ppos = (sp->off == -1) ? NULL : &sp->off;
>> + struct io_mapped_ubuf *imu;
>> + struct pipe_inode_info *pi;
>> + struct io_ring_ctx *ctx;
>> + unsigned int pipe_tail;
>> + int ret, i, nr_pages;
>> + u16 index;
>> +
>> + if (!sp->file->f_op->splice_read)
>> + return -ENOTSUPP;
>> +
>> + pi = alloc_pipe_info();
>
> The above should be replaced with direct pipe, otherwise every time
> allocating one pipe inode really hurts performance.
We don't even need to alloc it dynanically, could be just
on stack. There is a long list of TODOs I can add, e.g.
polling support, retries, nowait, caching imu and so on.
[...]
> Your patch looks like transferring pages ownership to io_uring fixed
> buffer, but unfortunately it can't be done in this way. splice is
> supposed for moving data, not transfer buffer ownership.
Borrowing rather than transferring. It's not obvious since it's
not implemented in the patch, but the buffer should be eventually
returned using the splice's ->release callback.
> https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/
>
> 1) pages are actually owned by device side(ublk, here: sp->file), but we want to
> loan them to io_uring normal OPs.
>
> 2) after these pages are used by io_uring normal OPs, these pages have
> been returned back to sp->file, and the notification has to be done
> explicitly, because page is owned by sp->file of splice_read().
Right, see above, they're going to be returned back via ->release.
> 3) pages RW direction has to limited strictly, and in case of ublk/fuse,
> device pages can only be read or write which depends on user io request
> direction.
Yes, I know, and directions will be needed anyway for DMA mappings and
different p2p cases in the future, but again a bunch of things is
omitted here.
>
> Also IMO it isn't good to add one buffer to ctx->user_bufs[] oneshot and
> retrieve it oneshot, and it can be set via req->imu simply in one fused
> command.
That's one of the points though. It's nice if not necessary (for a generic
feature) to be able to do multiple ops on the data. For instance, if we
have a memcpy request, we can link it to this splice / zc recv, memcpy
necessary headers to the userspace and let it decide how to proceed with
data.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD
2023-03-08 14:46 ` Pavel Begunkov
@ 2023-03-08 16:17 ` Ming Lei
2023-03-08 16:54 ` Pavel Begunkov
0 siblings, 1 reply; 42+ messages in thread
From: Ming Lei @ 2023-03-08 16:17 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, io-uring, linux-block, Miklos Szeredi, ZiyangZhang,
Xiaoguang Wang, Bernd Schubert, ming.lei
On Wed, Mar 08, 2023 at 02:46:48PM +0000, Pavel Begunkov wrote:
> On 3/8/23 02:10, Ming Lei wrote:
> > On Tue, Mar 07, 2023 at 05:17:04PM +0000, Pavel Begunkov wrote:
> > > On 3/7/23 15:37, Pavel Begunkov wrote:
> > > > On 3/7/23 14:15, Ming Lei wrote:
> > > > > Hello,
> > > > >
> > > > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
> > > > > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
> > > > > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs
> > > > > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1,
> > > > > and its ->issue() can retrieve/import buffer from master request's
> > > > > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of
> > > > > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset
> > > > > submits slave OP just like normal OP issued from userspace, that said,
> > > > > SQE order is kept, and batching handling is done too.
> > > >
> > > > From a quick look through patches it all looks a bit complicated
> > > > and intrusive, all over generic hot paths. I think instead we
> > > > should be able to use registered buffer table as intermediary and
> > > > reuse splicing. Let me try it out
> > >
> > > Here we go, isolated in a new opcode, and in the end should work
> > > with any file supporting splice. It's a quick prototype, it's lacking
> > > and there are many obvious fatal bugs. It also needs some optimisations,
> > > improvements on how executed by io_uring and extra stuff like
> > > memcpy ops and fixed buf recv/send. I'll clean it up.
> > >
> > > I used a test below, it essentially does zc recv.
> > >
> > > https://github.com/isilence/liburing/commit/81fe705739af7d9b77266f9aa901c1ada870739d
> > >
> [...]
> > > +int io_splice_from(struct io_kiocb *req, unsigned int issue_flags)
> > > +{
> > > + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from);
> > > + loff_t *ppos = (sp->off == -1) ? NULL : &sp->off;
> > > + struct io_mapped_ubuf *imu;
> > > + struct pipe_inode_info *pi;
> > > + struct io_ring_ctx *ctx;
> > > + unsigned int pipe_tail;
> > > + int ret, i, nr_pages;
> > > + u16 index;
> > > +
> > > + if (!sp->file->f_op->splice_read)
> > > + return -ENOTSUPP;
> > > +
> > > + pi = alloc_pipe_info();
> >
> > The above should be replaced with direct pipe, otherwise every time
> > allocating one pipe inode really hurts performance.
>
> We don't even need to alloc it dynanically, could be just
> on stack. There is a long list of TODOs I can add, e.g.
> polling support, retries, nowait, caching imu and so on.
>
> [...]
> > Your patch looks like transferring pages ownership to io_uring fixed
> > buffer, but unfortunately it can't be done in this way. splice is
> > supposed for moving data, not transfer buffer ownership.
>
> Borrowing rather than transferring. It's not obvious since it's
> not implemented in the patch, but the buffer should be eventually
> returned using the splice's ->release callback.
What is the splice's ->release() callback? Is pipe buffer's
release()? If yes, there is at least the following two problems:
1) it requires the buffer to be saved(for calling its callback and use its private
data to return back the whole buffer) in the pipe until it is consumed, which becomes
one sync interface like splice syscall, and can't cross multiple io_uring OPs or
per-buffer pipe inode is needed
2) pipe buffer's get()/release() works on per-buffer/page level, but
we need to borrow the whole buffer, and the whole buffer could be used
by arbitrary number of OPs, such as one IO buffer needs to be used for
handling mirror or stripped targets, so when we know the buffer can be released?
And basically it can't be known by kernel, and only application knows
when to release it.
Anyway, please post the whole patch, otherwise it is hard to see
the whole picture, and devil is always in details, especially Linus
mentioned splice can't be used in this way.
>
> > https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/
> >
> > 1) pages are actually owned by device side(ublk, here: sp->file), but we want to
> > loan them to io_uring normal OPs.
> >
> > 2) after these pages are used by io_uring normal OPs, these pages have
> > been returned back to sp->file, and the notification has to be done
> > explicitly, because page is owned by sp->file of splice_read().
>
> Right, see above, they're going to be returned back via ->release.
How?
>
> > 3) pages RW direction has to limited strictly, and in case of ublk/fuse,
> > device pages can only be read or write which depends on user io request
> > direction.
>
> Yes, I know, and directions will be needed anyway for DMA mappings and
> different p2p cases in the future, but again a bunch of things is
> omitted here.
Please don't omitted it and it is one fundamental security problem.
>
> >
> > Also IMO it isn't good to add one buffer to ctx->user_bufs[] oneshot and
> > retrieve it oneshot, and it can be set via req->imu simply in one fused
> > command.
>
> That's one of the points though. It's nice if not necessary (for a generic
> feature) to be able to do multiple ops on the data. For instance, if we
> have a memcpy request, we can link it to this splice / zc recv, memcpy
> necessary headers to the userspace and let it decide how to proceed with
> data.
I feel it could be one big problem for buffer borrowing to cross more than one
OPs, and when can the buffer be returned back?
memory copy can be done simply by device's read/write interface, please see
patch 15.
Thanks,
Ming
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD
2023-03-08 16:17 ` Ming Lei
@ 2023-03-08 16:54 ` Pavel Begunkov
2023-03-09 1:44 ` Ming Lei
0 siblings, 1 reply; 42+ messages in thread
From: Pavel Begunkov @ 2023-03-08 16:54 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, io-uring, linux-block, Miklos Szeredi, ZiyangZhang,
Xiaoguang Wang, Bernd Schubert
On 3/8/23 16:17, Ming Lei wrote:
> On Wed, Mar 08, 2023 at 02:46:48PM +0000, Pavel Begunkov wrote:
>> On 3/8/23 02:10, Ming Lei wrote:
>>> On Tue, Mar 07, 2023 at 05:17:04PM +0000, Pavel Begunkov wrote:
>>>> On 3/7/23 15:37, Pavel Begunkov wrote:
>>>>> On 3/7/23 14:15, Ming Lei wrote:
>>>>>> Hello,
>>>>>>
>>>>>> Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
>>>>>> be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
>>>>>> 64byte SQE(slave) is another normal 64byte OP. For any OP which needs
>>>>>> to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1,
>>>>>> and its ->issue() can retrieve/import buffer from master request's
>>>>>> fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of
>>>>>> this idea is from Xiaoguang's ublk ebpf patchset, but this patchset
>>>>>> submits slave OP just like normal OP issued from userspace, that said,
>>>>>> SQE order is kept, and batching handling is done too.
>>>>>
>>>>> From a quick look through patches it all looks a bit complicated
>>>>> and intrusive, all over generic hot paths. I think instead we
>>>>> should be able to use registered buffer table as intermediary and
>>>>> reuse splicing. Let me try it out
>>>>
>>>> Here we go, isolated in a new opcode, and in the end should work
>>>> with any file supporting splice. It's a quick prototype, it's lacking
>>>> and there are many obvious fatal bugs. It also needs some optimisations,
>>>> improvements on how executed by io_uring and extra stuff like
>>>> memcpy ops and fixed buf recv/send. I'll clean it up.
>>>>
>>>> I used a test below, it essentially does zc recv.
>>>>
>>>> https://github.com/isilence/liburing/commit/81fe705739af7d9b77266f9aa901c1ada870739d
>>>>
>> [...]
>>>> +int io_splice_from(struct io_kiocb *req, unsigned int issue_flags)
>>>> +{
>>>> + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from);
>>>> + loff_t *ppos = (sp->off == -1) ? NULL : &sp->off;
>>>> + struct io_mapped_ubuf *imu;
>>>> + struct pipe_inode_info *pi;
>>>> + struct io_ring_ctx *ctx;
>>>> + unsigned int pipe_tail;
>>>> + int ret, i, nr_pages;
>>>> + u16 index;
>>>> +
>>>> + if (!sp->file->f_op->splice_read)
>>>> + return -ENOTSUPP;
>>>> +
>>>> + pi = alloc_pipe_info();
>>>
>>> The above should be replaced with direct pipe, otherwise every time
>>> allocating one pipe inode really hurts performance.
>>
>> We don't even need to alloc it dynanically, could be just
>> on stack. There is a long list of TODOs I can add, e.g.
>> polling support, retries, nowait, caching imu and so on.
>>
>> [...]
>>> Your patch looks like transferring pages ownership to io_uring fixed
>>> buffer, but unfortunately it can't be done in this way. splice is
>>> supposed for moving data, not transfer buffer ownership.
>>
>> Borrowing rather than transferring. It's not obvious since it's
>> not implemented in the patch, but the buffer should be eventually
>> returned using the splice's ->release callback.
>
> What is the splice's ->release() callback? Is pipe buffer's
> release()? If yes, there is at least the following two problems:
Right
> 1) it requires the buffer to be saved(for calling its callback and use its private
> data to return back the whole buffer) in the pipe until it is consumed, which becomes
> one sync interface like splice syscall, and can't cross multiple io_uring OPs or
> per-buffer pipe inode is needed
We don't mix data from different sources, it's reasonable to expect
that all buffers will have the same callback, then it'll be saved
in struct io_mapped_ubuf. That's sth should definitely be checked and
rejected if happens.
> 2) pipe buffer's get()/release() works on per-buffer/page level, but
> we need to borrow the whole buffer, and the whole buffer could be used
Surely that can be improved.
> by arbitrary number of OPs, such as one IO buffer needs to be used for
> handling mirror or stripped targets, so when we know the buffer can be released?
There is a separate efficient lifetime semantic for io_uring's registered
buffers, which don't involve any get/put. It'll be freed according to it,
i.e. when the userspace asks it to be removed and there are no more
inflight requests.
> And basically it can't be known by kernel, and only application knows
> when to release it.
>
> Anyway, please post the whole patch, otherwise it is hard to see
> the whole picture, and devil is always in details, especially Linus
> mentioned splice can't be used in this way.
Sure
>
>>
>>> https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/
>>>
>>> 1) pages are actually owned by device side(ublk, here: sp->file), but we want to
>>> loan them to io_uring normal OPs.
>>>
>>> 2) after these pages are used by io_uring normal OPs, these pages have
>>> been returned back to sp->file, and the notification has to be done
>>> explicitly, because page is owned by sp->file of splice_read().
>>
>> Right, see above, they're going to be returned back via ->release.
>
> How?
I admit, I shouldn't have skipped it even for a quick POC. It'll save
->release() in struct io_mapped_ubuf and call it when the buffer is
freed from io_uring perspective, that is there are no more requests
using it and the user requested it to be removed.
>>> 3) pages RW direction has to limited strictly, and in case of ublk/fuse,
>>> device pages can only be read or write which depends on user io request
>>> direction.
>>
>> Yes, I know, and directions will be needed anyway for DMA mappings and
>> different p2p cases in the future, but again a bunch of things is
>> omitted here.
>
> Please don't omitted it and it is one fundamental security problem.
That's not interesting for a design concept with a huge warning.
io_import_fixed() already takes a dir argument, we just need to check
it against the buffer's one.
>>> Also IMO it isn't good to add one buffer to ctx->user_bufs[] oneshot and
>>> retrieve it oneshot, and it can be set via req->imu simply in one fused
>>> command.
>>
>> That's one of the points though. It's nice if not necessary (for a generic
>> feature) to be able to do multiple ops on the data. For instance, if we
>> have a memcpy request, we can link it to this splice / zc recv, memcpy
>> necessary headers to the userspace and let it decide how to proceed with
>> data.
>
> I feel it could be one big problem for buffer borrowing to cross more than one
> OPs, and when can the buffer be returned back?
Described above
> memory copy can be done simply by device's read/write interface, please see
> patch 15.
I don't think I understand how it looks in the userspace, maybe it's
only applicable to ublk? but it seems that the concept of having one op
producing a buffer and another consuming it don't go well with multi
use in general case, especially stretched in time.
E.g. you recv data, some of which is an application protocol header
that should be looked at by the user and the rest is data that might
be sent out somewhere else.
Am I wrong?
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD
2023-03-08 16:54 ` Pavel Begunkov
@ 2023-03-09 1:44 ` Ming Lei
0 siblings, 0 replies; 42+ messages in thread
From: Ming Lei @ 2023-03-09 1:44 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, io-uring, linux-block, Miklos Szeredi, ZiyangZhang,
Xiaoguang Wang, Bernd Schubert, ming.lei
On Wed, Mar 08, 2023 at 04:54:45PM +0000, Pavel Begunkov wrote:
> On 3/8/23 16:17, Ming Lei wrote:
> > On Wed, Mar 08, 2023 at 02:46:48PM +0000, Pavel Begunkov wrote:
> > > On 3/8/23 02:10, Ming Lei wrote:
> > > > On Tue, Mar 07, 2023 at 05:17:04PM +0000, Pavel Begunkov wrote:
> > > > > On 3/7/23 15:37, Pavel Begunkov wrote:
> > > > > > On 3/7/23 14:15, Ming Lei wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
> > > > > > > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
> > > > > > > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs
> > > > > > > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1,
> > > > > > > and its ->issue() can retrieve/import buffer from master request's
> > > > > > > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of
> > > > > > > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset
> > > > > > > submits slave OP just like normal OP issued from userspace, that said,
> > > > > > > SQE order is kept, and batching handling is done too.
> > > > > >
> > > > > > From a quick look through patches it all looks a bit complicated
> > > > > > and intrusive, all over generic hot paths. I think instead we
> > > > > > should be able to use registered buffer table as intermediary and
> > > > > > reuse splicing. Let me try it out
> > > > >
> > > > > Here we go, isolated in a new opcode, and in the end should work
> > > > > with any file supporting splice. It's a quick prototype, it's lacking
> > > > > and there are many obvious fatal bugs. It also needs some optimisations,
> > > > > improvements on how executed by io_uring and extra stuff like
> > > > > memcpy ops and fixed buf recv/send. I'll clean it up.
> > > > >
> > > > > I used a test below, it essentially does zc recv.
> > > > >
> > > > > https://github.com/isilence/liburing/commit/81fe705739af7d9b77266f9aa901c1ada870739d
> > > > >
> > > [...]
> > > > > +int io_splice_from(struct io_kiocb *req, unsigned int issue_flags)
> > > > > +{
> > > > > + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from);
> > > > > + loff_t *ppos = (sp->off == -1) ? NULL : &sp->off;
> > > > > + struct io_mapped_ubuf *imu;
> > > > > + struct pipe_inode_info *pi;
> > > > > + struct io_ring_ctx *ctx;
> > > > > + unsigned int pipe_tail;
> > > > > + int ret, i, nr_pages;
> > > > > + u16 index;
> > > > > +
> > > > > + if (!sp->file->f_op->splice_read)
> > > > > + return -ENOTSUPP;
> > > > > +
> > > > > + pi = alloc_pipe_info();
> > > >
> > > > The above should be replaced with direct pipe, otherwise every time
> > > > allocating one pipe inode really hurts performance.
> > >
> > > We don't even need to alloc it dynanically, could be just
> > > on stack. There is a long list of TODOs I can add, e.g.
> > > polling support, retries, nowait, caching imu and so on.
> > >
> > > [...]
> > > > Your patch looks like transferring pages ownership to io_uring fixed
> > > > buffer, but unfortunately it can't be done in this way. splice is
> > > > supposed for moving data, not transfer buffer ownership.
> > >
> > > Borrowing rather than transferring. It's not obvious since it's
> > > not implemented in the patch, but the buffer should be eventually
> > > returned using the splice's ->release callback.
> >
> > What is the splice's ->release() callback? Is pipe buffer's
> > release()? If yes, there is at least the following two problems:
>
> Right
>
> > 1) it requires the buffer to be saved(for calling its callback and use its private
> > data to return back the whole buffer) in the pipe until it is consumed, which becomes
> > one sync interface like splice syscall, and can't cross multiple io_uring OPs or
> > per-buffer pipe inode is needed
>
> We don't mix data from different sources, it's reasonable to expect
> that all buffers will have the same callback, then it'll be saved
> in struct io_mapped_ubuf. That's sth should definitely be checked and
> rejected if happens.
It seems reasonable to expect ->release() &->confirm() to be same, but
not sure if pipe_buffer->flags & pipe_buffer->private are same, and I
think pipe_buffer->private has to be saved too, since ->release() could use
it.
->read_splice() is one generic interface, if we support it in the new
io_uring OP, all existed ->read_splice() have to be supported.
Trust me, I did think about this approach, and the main difference is to add
the buffer into io_uring ctx fixed buffer.
>
> > 2) pipe buffer's get()/release() works on per-buffer/page level, but
> > we need to borrow the whole buffer, and the whole buffer could be used
>
> Surely that can be improved.
>
> > by arbitrary number of OPs, such as one IO buffer needs to be used for
> > handling mirror or stripped targets, so when we know the buffer can be released?
>
> There is a separate efficient lifetime semantic for io_uring's registered
> buffers, which don't involve any get/put. It'll be freed according to it,
> i.e. when the userspace asks it to be removed and there are no more
> inflight requests.
Then one new OP need to be added for removing the buffer explicitly.
Then at least three OPs(SQEs)(one for adding the buffer, one or more for consuming it,
one for removing it) are involved for handling single IO, which can't be efficient.
>
> > And basically it can't be known by kernel, and only application knows
> > when to release it.
> >
> > Anyway, please post the whole patch, otherwise it is hard to see
> > the whole picture, and devil is always in details, especially Linus
> > mentioned splice can't be used in this way.
>
> Sure
>
> >
> > >
> > > > https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/
> > > >
> > > > 1) pages are actually owned by device side(ublk, here: sp->file), but we want to
> > > > loan them to io_uring normal OPs.
> > > >
> > > > 2) after these pages are used by io_uring normal OPs, these pages have
> > > > been returned back to sp->file, and the notification has to be done
> > > > explicitly, because page is owned by sp->file of splice_read().
> > >
> > > Right, see above, they're going to be returned back via ->release.
> >
> > How?
>
> I admit, I shouldn't have skipped it even for a quick POC. It'll save
> ->release() in struct io_mapped_ubuf and call it when the buffer is
> freed from io_uring perspective, that is there are no more requests
> using it and the user requested it to be removed.
It isn't enough to save ->release() only(->confirm, ->flags, ->private), and all existed
->read_splice() has to be considered.
Basically the io_mapped_ubuf becomes one per-buffer (thin)pipe, and the extra
allocation could be big.
>
> > > > 3) pages RW direction has to limited strictly, and in case of ublk/fuse,
> > > > device pages can only be read or write which depends on user io request
> > > > direction.
> > >
> > > Yes, I know, and directions will be needed anyway for DMA mappings and
> > > different p2p cases in the future, but again a bunch of things is
> > > omitted here.
> >
> > Please don't omitted it and it is one fundamental security problem.
>
> That's not interesting for a design concept with a huge warning.
> io_import_fixed() already takes a dir argument, we just need to check
> it against the buffer's one.
The dir in io_import_fixed() can't be validated, and only the buffer
owner knows if the buffer can be read or write. The main issue is that
for one ublk/fuse read io request, the buffer can only be filled with data,
and can't be read to somewhere, otherwise kernel data could be leaked
to userspace.
And the check has to be done in ->read_splcie() or new pipe_buffer->flags has
to be added, that is the exact topic we discussed before, and NAKed by Linus.
https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/
>
>
> > > > Also IMO it isn't good to add one buffer to ctx->user_bufs[] oneshot and
> > > > retrieve it oneshot, and it can be set via req->imu simply in one fused
> > > > command.
> > >
> > > That's one of the points though. It's nice if not necessary (for a generic
> > > feature) to be able to do multiple ops on the data. For instance, if we
> > > have a memcpy request, we can link it to this splice / zc recv, memcpy
> > > necessary headers to the userspace and let it decide how to proceed with
> > > data.
> >
> > I feel it could be one big problem for buffer borrowing to cross more than one
> > OPs, and when can the buffer be returned back?
>
> Described above
>
> > memory copy can be done simply by device's read/write interface, please see
> > patch 15.
>
> I don't think I understand how it looks in the userspace, maybe it's
> only applicable to ublk? but it seems that the concept of having one op
> producing a buffer and another consuming it don't go well with multi
> use in general case, especially stretched in time.
So far, the interface is only for ublk/fuse, and the use case for ublk
is generic, such as: we have one logic volume manager ublk driver, and
now we need to mirror the io request to multiple underlying devices, then the
buffer needs to be consumed for each device. Same with stripped, or
distribute network storage(one same request need to be sent to more than
one remote machine).
>
> E.g. you recv data, some of which is an application protocol header
> that should be looked at by the user and the rest is data that might
> be sent out somewhere else.
Yeah, see the example[1] in ublk/nbd, application protocol is recv/send
by its own SQE, and io data is handled in standalone SQE.
Here we only deal with ublk block io data, which isn't changed most of
times.
[1] https://github.com/ming1/ubdsrv/blob/master/nbd/tgt_nbd.cpp
Thanks,
Ming
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD
2023-03-07 15:37 ` [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Pavel Begunkov
2023-03-07 17:17 ` Pavel Begunkov
@ 2023-03-08 1:08 ` Ming Lei
2023-03-08 16:22 ` Pavel Begunkov
1 sibling, 1 reply; 42+ messages in thread
From: Ming Lei @ 2023-03-08 1:08 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, io-uring, linux-block, Miklos Szeredi, ZiyangZhang,
Xiaoguang Wang, Bernd Schubert, ming.lei
On Tue, Mar 07, 2023 at 03:37:21PM +0000, Pavel Begunkov wrote:
> On 3/7/23 14:15, Ming Lei wrote:
> > Hello,
> >
> > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
> > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
> > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs
> > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1,
> > and its ->issue() can retrieve/import buffer from master request's
> > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of
> > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset
> > submits slave OP just like normal OP issued from userspace, that said,
> > SQE order is kept, and batching handling is done too.
>
> From a quick look through patches it all looks a bit complicated
> and intrusive, all over generic hot paths. I think instead we
Really? The main change to generic hot paths are adding one 'true/false'
parameter to io_init_req(). For others, the change is just check on
req->flags or issue_flags, which is basically zero cost.
> should be able to use registered buffer table as intermediary and
> reuse splicing. Let me try it out
I will take a look at you patch, but last time, Linus has pointed out that
splice isn't one good way, in which buffer ownership transferring is one big
issue for writing data to page retrieved from pipe.
https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/
Thanks,
Ming
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD
2023-03-08 1:08 ` Ming Lei
@ 2023-03-08 16:22 ` Pavel Begunkov
2023-03-09 2:05 ` Ming Lei
0 siblings, 1 reply; 42+ messages in thread
From: Pavel Begunkov @ 2023-03-08 16:22 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, io-uring, linux-block, Miklos Szeredi, ZiyangZhang,
Xiaoguang Wang, Bernd Schubert
On 3/8/23 01:08, Ming Lei wrote:
> On Tue, Mar 07, 2023 at 03:37:21PM +0000, Pavel Begunkov wrote:
>> On 3/7/23 14:15, Ming Lei wrote:
>>> Hello,
>>>
>>> Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
>>> be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
>>> 64byte SQE(slave) is another normal 64byte OP. For any OP which needs
>>> to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1,
>>> and its ->issue() can retrieve/import buffer from master request's
>>> fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of
>>> this idea is from Xiaoguang's ublk ebpf patchset, but this patchset
>>> submits slave OP just like normal OP issued from userspace, that said,
>>> SQE order is kept, and batching handling is done too.
>>
>> From a quick look through patches it all looks a bit complicated
>> and intrusive, all over generic hot paths. I think instead we
>
> Really? The main change to generic hot paths are adding one 'true/false'
> parameter to io_init_req(). For others, the change is just check on
> req->flags or issue_flags, which is basically zero cost.
Extra flag in io_init_req() but also exporting it, which is an
internal function, to non-core code. Additionally it un-inlines it
and even looks recurse calls it (max depth 2). From a quick look,
there is some hand coded ->cached_refs manipulations, it takes extra
space in generic sections of io_kiocb. It makes all cmd users to
check for IO_URING_F_FUSED. There is also a two-way dependency b/w
requests, which never plays out well, e.g. I still hate how linked
timeouts stick out in generic paths.
Depending on SQE128 also doesn't seem right, though it can be dealt
with, e.g. sth like how it's done with links requests.
>> should be able to use registered buffer table as intermediary and
>> reuse splicing. Let me try it out
>
> I will take a look at you patch, but last time, Linus has pointed out that
> splice isn't one good way, in which buffer ownership transferring is one big
> issue for writing data to page retrieved from pipe.
There are no real pipes, better to say io_uring replaces a pipe,
and splice bits are used to get pages from a file. Though, there
will be some common problems. Thanks for the link, I'll need to
get through it first, thanks for the link
> https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@mail.gmail.com/
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD
2023-03-08 16:22 ` Pavel Begunkov
@ 2023-03-09 2:05 ` Ming Lei
0 siblings, 0 replies; 42+ messages in thread
From: Ming Lei @ 2023-03-09 2:05 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Jens Axboe, io-uring, linux-block, Miklos Szeredi, ZiyangZhang,
Xiaoguang Wang, Bernd Schubert, ming.lei
On Wed, Mar 08, 2023 at 04:22:15PM +0000, Pavel Begunkov wrote:
> On 3/8/23 01:08, Ming Lei wrote:
> > On Tue, Mar 07, 2023 at 03:37:21PM +0000, Pavel Begunkov wrote:
> > > On 3/7/23 14:15, Ming Lei wrote:
> > > > Hello,
> > > >
> > > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
> > > > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
> > > > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs
> > > > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1,
> > > > and its ->issue() can retrieve/import buffer from master request's
> > > > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of
> > > > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset
> > > > submits slave OP just like normal OP issued from userspace, that said,
> > > > SQE order is kept, and batching handling is done too.
> > >
> > > From a quick look through patches it all looks a bit complicated
> > > and intrusive, all over generic hot paths. I think instead we
> >
> > Really? The main change to generic hot paths are adding one 'true/false'
> > parameter to io_init_req(). For others, the change is just check on
> > req->flags or issue_flags, which is basically zero cost.
>
> Extra flag in io_init_req() but also exporting it, which is an
> internal function, to non-core code. Additionally it un-inlines it
We can make it inline for core code only.
> and even looks recurse calls it (max depth 2). From a quick look,
The reurse call is only done for fused command, and won't be one
issue for normal OPs.
> there is some hand coded ->cached_refs manipulations, it takes extra
> space in generic sections of io_kiocb.
Yeah, but it is still done on fused command only. I think people
is happy to pay the cost for the benefit, and we do not cause trouble
for others.
> It makes all cmd users to
> check for IO_URING_F_FUSED. There is also a two-way dependency b/w
The check is zero cost, and just for avoiding to add ->fused_cmd() callback,
otherwise the check can be killed.
> requests, which never plays out well, e.g. I still hate how linked
> timeouts stick out in generic paths.
I appreciate you may explain it in details.
Yeah, part of fused command's job is to submit one new io and wait its completion.
slave request is actually invisible in the linked list, and only fused
command can be in the linked list.
>
> Depending on SQE128 also doesn't seem right, though it can be dealt
> with, e.g. sth like how it's done with links requests.
I thought about handling it by linked request, but we need fused command to be
completed after the slave request is done, and that becomes one deadlock if
the two are linked together.
SQE128 is per-context feature, when we need to submit uring SQE128 command, the
same ring is required to handle IO, then IMO it is perfect for this
case, at least for ublk.
>
> > > should be able to use registered buffer table as intermediary and
> > > reuse splicing. Let me try it out
> >
> > I will take a look at you patch, but last time, Linus has pointed out that
> > splice isn't one good way, in which buffer ownership transferring is one big
> > issue for writing data to page retrieved from pipe.
>
> There are no real pipes, better to say io_uring replaces a pipe,
> and splice bits are used to get pages from a file. Though, there
> will be some common problems. Thanks for the link, I'll need to
> get through it first, thanks for the link
Yeah, here the only value of pipe is to reuse ->splice_read() interface,
that is why I figure out fused command for this job. I am open for
other approaches, if the problem can be solved(reliably and efficiently).
Thanks,
Ming
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD
2023-03-07 14:15 [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Ming Lei
` (17 preceding siblings ...)
2023-03-07 15:37 ` [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD Pavel Begunkov
@ 2023-03-15 7:08 ` Ziyang Zhang
2023-03-15 7:54 ` Ming Lei
18 siblings, 1 reply; 42+ messages in thread
From: Ziyang Zhang @ 2023-03-15 7:08 UTC (permalink / raw)
To: Ming Lei
Cc: linux-block, io-uring, Jens Axboe, Miklos Szeredi, Xiaoguang Wang,
Bernd Schubert
On 2023/3/7 22:15, Ming Lei wrote:
> Hello,
>
> Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
> be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
> 64byte SQE(slave) is another normal 64byte OP. For any OP which needs
> to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1,
> and its ->issue() can retrieve/import buffer from master request's
> fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of
> this idea is from Xiaoguang's ublk ebpf patchset, but this patchset
> submits slave OP just like normal OP issued from userspace, that said,
> SQE order is kept, and batching handling is done too.
>
> Please see detailed design in commit log of the 3th patch, and one big
> point is how to handle buffer ownership.
>
> With this way, it is easy to support zero copy for ublk/fuse device.
>
> Basically userspace can specify any sub-buffer of the ublk block request
> buffer from the fused command just by setting 'offset/len'
> in the slave SQE for running slave OP. This way is flexible to implement
> io mapping: mirror, stripped, ...
>
> The 4th & 5th patches enable fused slave support for the following OPs:
>
> OP_READ/OP_WRITE
> OP_SEND/OP_RECV/OP_SEND_ZC
>
> The other ublk patches cleans ublk driver and implement fused command
> for supporting zero copy.
>
> Follows userspace code:
>
> https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-v2
>
> All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing:
>
> ublk add -t [loop|nbd|qcow2] -z ....
>
> Basic fs mount/kernel building and builtin test are done.
>
> Also add liburing test case for covering fused command based on miniublk
> of blktest:
>
> https://github.com/ming1/liburing/commits/fused_cmd_miniublk
>
> Performance improvement is obvious on memory bandwidth
> related workloads, such as, 1~2X improvement on 64K/512K BS
> IO test on loop with ramfs backing file.
>
> Any comments are welcome!
>
Hi Ming,
Maybe we can split patch 06-12 to a separate cleanup pacthset. I think
these patches can be merged first because they are not related to zero copy.
Regards,
Zhang
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH V2 00/17] io_uring/ublk: add IORING_OP_FUSED_CMD
2023-03-15 7:08 ` Ziyang Zhang
@ 2023-03-15 7:54 ` Ming Lei
0 siblings, 0 replies; 42+ messages in thread
From: Ming Lei @ 2023-03-15 7:54 UTC (permalink / raw)
To: Ziyang Zhang
Cc: linux-block, io-uring, Jens Axboe, Miklos Szeredi, Xiaoguang Wang,
Bernd Schubert, ming.lei
On Wed, Mar 15, 2023 at 03:08:21PM +0800, Ziyang Zhang wrote:
> On 2023/3/7 22:15, Ming Lei wrote:
> > Hello,
> >
> > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
> > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
> > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs
> > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1,
> > and its ->issue() can retrieve/import buffer from master request's
> > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of
> > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset
> > submits slave OP just like normal OP issued from userspace, that said,
> > SQE order is kept, and batching handling is done too.
> >
> > Please see detailed design in commit log of the 3th patch, and one big
> > point is how to handle buffer ownership.
> >
> > With this way, it is easy to support zero copy for ublk/fuse device.
> >
> > Basically userspace can specify any sub-buffer of the ublk block request
> > buffer from the fused command just by setting 'offset/len'
> > in the slave SQE for running slave OP. This way is flexible to implement
> > io mapping: mirror, stripped, ...
> >
> > The 4th & 5th patches enable fused slave support for the following OPs:
> >
> > OP_READ/OP_WRITE
> > OP_SEND/OP_RECV/OP_SEND_ZC
> >
> > The other ublk patches cleans ublk driver and implement fused command
> > for supporting zero copy.
> >
> > Follows userspace code:
> >
> > https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-v2
> >
> > All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing:
> >
> > ublk add -t [loop|nbd|qcow2] -z ....
> >
> > Basic fs mount/kernel building and builtin test are done.
> >
> > Also add liburing test case for covering fused command based on miniublk
> > of blktest:
> >
> > https://github.com/ming1/liburing/commits/fused_cmd_miniublk
> >
> > Performance improvement is obvious on memory bandwidth
> > related workloads, such as, 1~2X improvement on 64K/512K BS
> > IO test on loop with ramfs backing file.
> >
> > Any comments are welcome!
> >
>
>
> Hi Ming,
>
> Maybe we can split patch 06-12 to a separate cleanup pacthset. I think
> these patches can be merged first because they are not related to zero copy.
Hi Ziyang,
I think the fused command approach is doable, and the patchset is basically
ready to go now:
- the fused command approach won't affect normal uring OP
- most of the change focuses on io_uring/fused_cmd.*
- this approach works reliably and efficiently, pass xfstests, and
verified on all three ublk targets(loop, nbd, qcow2), improvement is
pretty obvious
so I'd rather to put them in one series and aim at v6.14, then we can avoid conflict.
However if things doesn't work toward the expected direction, we can move on with
another way at that time, so it is too early to separate the cleanup patchset now.
thanks,
Ming
^ permalink raw reply [flat|nested] 42+ messages in thread