* [RFC PATCH 1/4] io_uring/splice: support do_splice_direct
2022-11-03 8:50 [RFC PATCH 0/4] io_uring/splice: extend splice for supporting ublk zero copy Ming Lei
@ 2022-11-03 8:50 ` Ming Lei
2022-11-08 7:42 ` Christoph Hellwig
2022-11-03 8:50 ` [RFC PATCH 2/4] fs/splice: add helper of splice_dismiss_pipe() Ming Lei
` (3 subsequent siblings)
4 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2022-11-03 8:50 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, linux-kernel, linux-fsdevel, Miklos Szeredi,
Stefan Hajnoczi, ZiyangZhang, Ming Lei
do_splice_direct() has at least two advantages:
1) the extra pipe isn't required from user viewpoint, so userspace
code can be simplified, meantime easy to relax current pipe
limit since curret->splice_pipe is used for direct splice
2) in some situation, it isn't good to expose file data via
->splice_read() to userspace, such as the coming ublk driver's
zero copy support, request pages will be spliced to pipe for
supporting zero copy, and if it is READ, userspace may read
data of kernel pages, and direct splice can avoid this kind
of info leaks
Signed-off-by: Ming Lei <[email protected]>
---
fs/read_write.c | 5 +++--
include/linux/splice.h | 3 +++
io_uring/splice.c | 13 ++++++++++---
3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index 328ce8cf9a85..98869d15e884 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1253,7 +1253,7 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
goto fput_out;
file_start_write(out.file);
retval = do_splice_direct(in.file, &pos, out.file, &out_pos,
- count, fl);
+ count, fl | SPLICE_F_DIRECT);
file_end_write(out.file);
} else {
if (out.file->f_flags & O_NONBLOCK)
@@ -1389,7 +1389,8 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
size_t len, unsigned int flags)
{
return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
- len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
+ len > MAX_RW_COUNT ? MAX_RW_COUNT : len,
+ SPLICE_F_DIRECT);
}
EXPORT_SYMBOL(generic_copy_file_range);
diff --git a/include/linux/splice.h b/include/linux/splice.h
index a55179fd60fc..9121624ad198 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -23,6 +23,9 @@
#define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT)
+/* used for io_uring interface only */
+#define SPLICE_F_DIRECT (0x10) /* direct splice and user needn't provide pipe */
+
/*
* Passed to the actors
*/
diff --git a/io_uring/splice.c b/io_uring/splice.c
index 53e4232d0866..c11ea4cd1c7e 100644
--- a/io_uring/splice.c
+++ b/io_uring/splice.c
@@ -27,7 +27,8 @@ static int __io_splice_prep(struct io_kiocb *req,
const struct io_uring_sqe *sqe)
{
struct io_splice *sp = io_kiocb_to_cmd(req, struct io_splice);
- unsigned int valid_flags = SPLICE_F_FD_IN_FIXED | SPLICE_F_ALL;
+ unsigned int valid_flags = SPLICE_F_FD_IN_FIXED | SPLICE_F_ALL |
+ SPLICE_F_DIRECT;
sp->len = READ_ONCE(sqe->len);
sp->flags = READ_ONCE(sqe->splice_flags);
@@ -109,8 +110,14 @@ int io_splice(struct io_kiocb *req, unsigned int issue_flags)
poff_in = (sp->off_in == -1) ? NULL : &sp->off_in;
poff_out = (sp->off_out == -1) ? NULL : &sp->off_out;
- if (sp->len)
- ret = do_splice(in, poff_in, out, poff_out, sp->len, flags);
+ if (sp->len) {
+ if (flags & SPLICE_F_DIRECT)
+ ret = do_splice_direct(in, poff_in, out, poff_out,
+ sp->len, flags);
+ else
+ ret = do_splice(in, poff_in, out, poff_out, sp->len,
+ flags);
+ }
if (!(sp->flags & SPLICE_F_FD_IN_FIXED))
io_put_file(in);
--
2.31.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 1/4] io_uring/splice: support do_splice_direct
2022-11-03 8:50 ` [RFC PATCH 1/4] io_uring/splice: support do_splice_direct Ming Lei
@ 2022-11-08 7:42 ` Christoph Hellwig
0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2022-11-08 7:42 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, io-uring, linux-block, linux-kernel, linux-fsdevel,
Miklos Szeredi, Stefan Hajnoczi, ZiyangZhang
On Thu, Nov 03, 2022 at 04:50:01PM +0800, Ming Lei wrote:
> do_splice_direct() has at least two advantages:
>
> 1) the extra pipe isn't required from user viewpoint, so userspace
> code can be simplified, meantime easy to relax current pipe
> limit since curret->splice_pipe is used for direct splice
>
> 2) in some situation, it isn't good to expose file data via
> ->splice_read() to userspace, such as the coming ublk driver's
> zero copy support, request pages will be spliced to pipe for
> supporting zero copy, and if it is READ, userspace may read
> data of kernel pages, and direct splice can avoid this kind
> of info leaks
Please make this a separate opcode instead of overloading the splice
op with a flag that causes very different behavior and isn't supported
for the regular splice syscall.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC PATCH 2/4] fs/splice: add helper of splice_dismiss_pipe()
2022-11-03 8:50 [RFC PATCH 0/4] io_uring/splice: extend splice for supporting ublk zero copy Ming Lei
2022-11-03 8:50 ` [RFC PATCH 1/4] io_uring/splice: support do_splice_direct Ming Lei
@ 2022-11-03 8:50 ` Ming Lei
2022-11-03 8:50 ` [RFC PATCH 3/4] io_uring/splice: support splice from ->splice_read to ->splice_read Ming Lei
` (2 subsequent siblings)
4 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2022-11-03 8:50 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, linux-kernel, linux-fsdevel, Miklos Szeredi,
Stefan Hajnoczi, ZiyangZhang, Ming Lei
Add helper of splice_dismiss_pipe(), so cleanup iter_file_splice_write
a bit.
And this helper will be reused in the following patch for supporting
to consume pipe by ->splice_read().
Signed-off-by: Ming Lei <[email protected]>
---
fs/splice.c | 47 ++++++++++++++++++++++++++++++-----------------
1 file changed, 30 insertions(+), 17 deletions(-)
diff --git a/fs/splice.c b/fs/splice.c
index 0878b852b355..f8999ffe6215 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -282,6 +282,34 @@ void splice_shrink_spd(struct splice_pipe_desc *spd)
kfree(spd->partial);
}
+/* return if wakeup is needed */
+static bool splice_dismiss_pipe(struct pipe_inode_info *pipe, ssize_t bytes)
+{
+ unsigned int mask = pipe->ring_size - 1;
+ unsigned int tail = pipe->tail;
+ bool need_wakeup = false;
+
+ while (bytes) {
+ struct pipe_buffer *buf = &pipe->bufs[tail & mask];
+
+ if (bytes >= buf->len) {
+ bytes -= buf->len;
+ buf->len = 0;
+ pipe_buf_release(pipe, buf);
+ tail++;
+ pipe->tail = tail;
+ if (pipe->files)
+ need_wakeup = true;
+ } else {
+ buf->offset += bytes;
+ buf->len -= bytes;
+ bytes = 0;
+ }
+ }
+
+ return need_wakeup;
+}
+
/**
* generic_file_splice_read - splice data from file to a pipe
* @in: file to splice from
@@ -692,23 +720,8 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
*ppos = sd.pos;
/* dismiss the fully eaten buffers, adjust the partial one */
- tail = pipe->tail;
- while (ret) {
- struct pipe_buffer *buf = &pipe->bufs[tail & mask];
- if (ret >= buf->len) {
- ret -= buf->len;
- buf->len = 0;
- pipe_buf_release(pipe, buf);
- tail++;
- pipe->tail = tail;
- if (pipe->files)
- sd.need_wakeup = true;
- } else {
- buf->offset += ret;
- buf->len -= ret;
- ret = 0;
- }
- }
+ if (splice_dismiss_pipe(pipe, ret))
+ sd.need_wakeup = true;
}
done:
kfree(array);
--
2.31.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC PATCH 3/4] io_uring/splice: support splice from ->splice_read to ->splice_read
2022-11-03 8:50 [RFC PATCH 0/4] io_uring/splice: extend splice for supporting ublk zero copy Ming Lei
2022-11-03 8:50 ` [RFC PATCH 1/4] io_uring/splice: support do_splice_direct Ming Lei
2022-11-03 8:50 ` [RFC PATCH 2/4] fs/splice: add helper of splice_dismiss_pipe() Ming Lei
@ 2022-11-03 8:50 ` Ming Lei
2022-11-08 7:45 ` Christoph Hellwig
2022-11-03 8:50 ` [RFC PATCH 4/4] ublk_drv: support splice based read/write zero copy Ming Lei
2022-11-04 9:15 ` [RFC PATCH 0/4] io_uring/splice: extend splice for supporting ublk " Ziyang Zhang
4 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2022-11-03 8:50 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, linux-kernel, linux-fsdevel, Miklos Szeredi,
Stefan Hajnoczi, ZiyangZhang, Ming Lei
The 1st ->splice_read produces buffer to the pipe of
current->splice_pipe, and the 2nd ->splice_read consumes the buffer
in this pipe.
This way helps to support zero copy of read request for ublk and fuse.
Signed-off-by: Ming Lei <[email protected]>
---
fs/splice.c | 146 ++++++++++++++++++++++++++++++++++++--
include/linux/fs.h | 2 +
include/linux/pipe_fs_i.h | 9 +++
include/linux/splice.h | 11 +++
io_uring/splice.c | 13 ++--
5 files changed, 169 insertions(+), 12 deletions(-)
diff --git a/fs/splice.c b/fs/splice.c
index f8999ffe6215..cd5255f9ff13 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -330,17 +330,70 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
struct iov_iter to;
struct kiocb kiocb;
int ret;
+ struct bio_vec *array = NULL;
+ bool consumer;
+
+ /*
+ * So far, in->fops->splice_read() has to make sure the following
+ * simple page use model works.
+ *
+ * pipe->consumed_by_read is set by the in end of the pipe
+ */
+ if ((flags & SPLICE_F_READ_TO_READ) && pipe->consumed_by_read) {
+ unsigned int head, tail, mask;
+ int nbufs = pipe->max_usage;
+ size_t left = len;
+ int n;
+
+ if (WARN_ON_ONCE(!(flags & SPLICE_F_DIRECT)))
+ return -EINVAL;
+
+ head = pipe->head;
+ tail = pipe->tail;
+ mask = pipe->ring_size - 1;
+
+ array = kcalloc(nbufs, sizeof(struct bio_vec), GFP_KERNEL);
+ if (!array)
+ return -ENOMEM;
+
+ for (n = 0; !pipe_empty(head, tail) && left && n < nbufs;
+ tail++) {
+ struct pipe_buffer *buf = &pipe->bufs[tail & mask];
+ size_t this_len = buf->len;
+
+ /* zero-length bvecs are not supported, skip them */
+ if (!this_len)
+ continue;
+ this_len = min(this_len, left);
+
+ array[n].bv_page = buf->page;
+ array[n].bv_len = this_len;
+ array[n].bv_offset = buf->offset;
+ left -= this_len;
+ n++;
+ }
+
+ consumer = true;
+ iov_iter_bvec(&to, READ, array, n, len - left);
+ } else {
+ /* !consumer means one pipe buf producer */
+ consumer = false;
+ iov_iter_pipe(&to, READ, pipe, len);
+ }
- iov_iter_pipe(&to, READ, pipe, len);
init_sync_kiocb(&kiocb, in);
kiocb.ki_pos = *ppos;
ret = call_read_iter(in, &kiocb, &to);
if (ret > 0) {
*ppos = kiocb.ki_pos;
file_accessed(in);
+
+ if (consumer)
+ splice_dismiss_pipe(pipe, ret);
} else if (ret < 0) {
/* free what was emitted */
- pipe_discard_from(pipe, to.start_head);
+ if (!consumer)
+ pipe_discard_from(pipe, to.start_head);
/*
* callers of ->splice_read() expect -EAGAIN on
* "can't put anything in there", rather than -EFAULT.
@@ -349,6 +402,11 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
ret = -EAGAIN;
}
+ if (consumer) {
+ kfree(array);
+ pipe->consumed_by_read = false;
+ }
+
return ret;
}
EXPORT_SYMBOL(generic_file_splice_read);
@@ -782,7 +840,7 @@ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
*/
static long do_splice_to(struct file *in, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
- unsigned int flags)
+ unsigned int flags, bool consumer)
{
unsigned int p_space;
int ret;
@@ -790,8 +848,12 @@ static long do_splice_to(struct file *in, loff_t *ppos,
if (unlikely(!(in->f_mode & FMODE_READ)))
return -EBADF;
- /* Don't try to read more the pipe has space for. */
- p_space = pipe->max_usage - pipe_occupancy(pipe->head, pipe->tail);
+ if (consumer) /* read is consumer */
+ p_space = pipe_occupancy(pipe->head, pipe->tail);
+ else
+ /* Don't try to read more the pipe has space for. */
+ p_space = pipe->max_usage - pipe_occupancy(pipe->head,
+ pipe->tail);
len = min_t(size_t, len, p_space << PAGE_SHIFT);
ret = rw_verify_area(READ, in, ppos, len);
@@ -875,7 +937,7 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
size_t read_len;
loff_t pos = sd->pos, prev_pos = pos;
- ret = do_splice_to(in, &pos, pipe, len, flags);
+ ret = do_splice_to(in, &pos, pipe, len, flags, false);
if (unlikely(ret <= 0))
goto out_release;
@@ -992,6 +1054,76 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
}
EXPORT_SYMBOL(do_splice_direct);
+static int direct_splice_read_consumer_actor(struct pipe_inode_info *pipe,
+ struct splice_desc *sd)
+{
+ struct file *file = sd->u.file;
+
+ /* Pipe in side has to notify us by ->consumed_by_read */
+ if (!pipe->consumed_by_read)
+ return -EINVAL;
+
+ return do_splice_to(file, sd->opos, pipe, sd->total_len,
+ sd->flags, true);
+}
+
+/**
+ * do_splice_direct_read_consumer - splices data directly with producer/
+ * consumer model
+ * @in: file to splice from
+ * @ppos: input file offset
+ * @out: file to splice to
+ * @opos: output file offset
+ * @len: number of bytes to splice
+ * @flags: splice modifier flags, SPLICE_F_READ_TO_READ is required
+ *
+ * Description:
+ * For use by ublk or fuse to implement zero copy for READ request, and
+ * splice directly over internal pipe from device to file, and device's
+ * ->splice_read() produces pipe buffers, and file's ->splice_read()
+ * consumes the buffers.
+ *
+ */
+long do_splice_direct_read_consumer(struct file *in, loff_t *ppos,
+ struct file *out, loff_t *opos,
+ size_t len, unsigned int flags)
+{
+ struct splice_desc sd = {
+ .len = len,
+ .total_len = len,
+ .flags = flags,
+ .pos = *ppos,
+ .u.file = out,
+ .opos = opos,
+ };
+ long ret;
+
+ if (!(flags & (SPLICE_F_DIRECT | SPLICE_F_READ_TO_READ)))
+ return -EINVAL;
+
+ if (unlikely(!(out->f_mode & FMODE_READ)))
+ return -EBADF;
+
+ /*
+ * So far we just support F_READ_TO_READ if it is one plain
+ * file which ->splice_read points to generic_file_splice_read
+ */
+ if (out->f_op->splice_read != generic_file_splice_read)
+ return -EINVAL;
+
+ ret = rw_verify_area(READ, out, opos, len);
+ if (unlikely(ret < 0))
+ return ret;
+
+ ret = splice_direct_to_actor(in, &sd,
+ direct_splice_read_consumer_actor);
+ if (ret > 0)
+ *ppos = sd.pos;
+
+ return ret;
+}
+EXPORT_SYMBOL(do_splice_direct_read_consumer);
+
static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
{
for (;;) {
@@ -1023,7 +1155,7 @@ long splice_file_to_pipe(struct file *in,
pipe_lock(opipe);
ret = wait_for_space(opipe, flags);
if (!ret)
- ret = do_splice_to(in, offset, opipe, len, flags);
+ ret = do_splice_to(in, offset, opipe, len, flags, false);
pipe_unlock(opipe);
if (ret > 0)
wakeup_pipe_readers(opipe);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e654435f1651..e5f84902f149 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3169,6 +3169,8 @@ extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
struct file *out, loff_t *, size_t len, unsigned int flags);
extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
loff_t *opos, size_t len, unsigned int flags);
+extern long do_splice_direct_read_consumer(struct file *in, loff_t *ppos,
+ struct file *out, loff_t *opos, size_t len, unsigned int flags);
extern void
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 6cb65df3e3ba..90c6ff8c82ef 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -72,6 +72,15 @@ struct pipe_inode_info {
unsigned int r_counter;
unsigned int w_counter;
bool poll_usage;
+
+ /*
+ * If SPLICE_F_READ_TO_READ is applied, in->fops->splice_read()
+ * should set this flag, so that out->fops->splice_read() can
+ * observe this flag, then consume buffers in the pipe.
+ *
+ * Used by do_splice_direct_read_consumer() only.
+ */
+ bool consumed_by_read;
struct page *tmp_page;
struct fasync_struct *fasync_readers;
struct fasync_struct *fasync_writers;
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 9121624ad198..f48044e5e173 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -26,6 +26,17 @@
/* used for io_uring interface only */
#define SPLICE_F_DIRECT (0x10) /* direct splice and user needn't provide pipe */
+/*
+ * The usual splice is file-to-pipe and pipe-to-file, and this flag means the
+ * splice is file-to-pipe and file-to-pipe. Looks this way is stupid, but
+ * please understand from producer & consumer viewpoint, the 1st file-to-pipe
+ * is producer, and the 2nd file-to-pipe is consumer, so here the 2nd
+ * ->slice_read just consumes buffers stored in the pipe.
+ *
+ * And this flag is only valid in case of SPLICE_F_DIRECT.
+ */
+#define SPLICE_F_READ_TO_READ (0x20)
+
/*
* Passed to the actors
*/
diff --git a/io_uring/splice.c b/io_uring/splice.c
index c11ea4cd1c7e..df66d89f4f17 100644
--- a/io_uring/splice.c
+++ b/io_uring/splice.c
@@ -28,7 +28,7 @@ static int __io_splice_prep(struct io_kiocb *req,
{
struct io_splice *sp = io_kiocb_to_cmd(req, struct io_splice);
unsigned int valid_flags = SPLICE_F_FD_IN_FIXED | SPLICE_F_ALL |
- SPLICE_F_DIRECT;
+ SPLICE_F_DIRECT | SPLICE_F_READ_TO_READ;
sp->len = READ_ONCE(sqe->len);
sp->flags = READ_ONCE(sqe->splice_flags);
@@ -111,12 +111,15 @@ int io_splice(struct io_kiocb *req, unsigned int issue_flags)
poff_out = (sp->off_out == -1) ? NULL : &sp->off_out;
if (sp->len) {
- if (flags & SPLICE_F_DIRECT)
- ret = do_splice_direct(in, poff_in, out, poff_out,
- sp->len, flags);
- else
+ if (!(flags & (SPLICE_F_DIRECT | SPLICE_F_READ_TO_READ)))
ret = do_splice(in, poff_in, out, poff_out, sp->len,
flags);
+ else if (flags & SPLICE_F_READ_TO_READ)
+ ret = do_splice_direct_read_consumer(in, poff_in, out,
+ poff_out, sp->len, flags);
+ else
+ ret = do_splice_direct(in, poff_in, out, poff_out,
+ sp->len, flags);
}
if (!(sp->flags & SPLICE_F_FD_IN_FIXED))
--
2.31.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 3/4] io_uring/splice: support splice from ->splice_read to ->splice_read
2022-11-03 8:50 ` [RFC PATCH 3/4] io_uring/splice: support splice from ->splice_read to ->splice_read Ming Lei
@ 2022-11-08 7:45 ` Christoph Hellwig
2022-11-08 8:29 ` Ming Lei
0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2022-11-08 7:45 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, io-uring, linux-block, linux-kernel, linux-fsdevel,
Miklos Szeredi, Stefan Hajnoczi, ZiyangZhang
On Thu, Nov 03, 2022 at 04:50:03PM +0800, Ming Lei wrote:
> The 1st ->splice_read produces buffer to the pipe of
> current->splice_pipe, and the 2nd ->splice_read consumes the buffer
> in this pipe.
This looks really ugly. I think you want Linus and Al to look over
it at very least.
Also, what is going to happen if your ->splice_read instance does not
support the flag to magically do something entirely different?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 3/4] io_uring/splice: support splice from ->splice_read to ->splice_read
2022-11-08 7:45 ` Christoph Hellwig
@ 2022-11-08 8:29 ` Ming Lei
0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2022-11-08 8:29 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, io-uring, linux-block, linux-kernel, linux-fsdevel,
Miklos Szeredi, Stefan Hajnoczi, ZiyangZhang
On Mon, Nov 07, 2022 at 11:45:06PM -0800, Christoph Hellwig wrote:
> On Thu, Nov 03, 2022 at 04:50:03PM +0800, Ming Lei wrote:
> > The 1st ->splice_read produces buffer to the pipe of
> > current->splice_pipe, and the 2nd ->splice_read consumes the buffer
> > in this pipe.
>
> This looks really ugly. I think you want Linus and Al to look over
> it at very least.
OK, I will Cc Linus and Al in V2.
It is just another case of pipe's producer/consumer model, IMO.
>
> Also, what is going to happen if your ->splice_read instance does not
> support the flag to magically do something entirely different?
If the ->splice_read() instance doesn't support this feature, then the new
added pipe flag won't be set, this API will return -EINVAL.
thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC PATCH 4/4] ublk_drv: support splice based read/write zero copy
2022-11-03 8:50 [RFC PATCH 0/4] io_uring/splice: extend splice for supporting ublk zero copy Ming Lei
` (2 preceding siblings ...)
2022-11-03 8:50 ` [RFC PATCH 3/4] io_uring/splice: support splice from ->splice_read to ->splice_read Ming Lei
@ 2022-11-03 8:50 ` Ming Lei
2022-11-03 22:28 ` Bernd Schubert
2022-11-04 9:15 ` [RFC PATCH 0/4] io_uring/splice: extend splice for supporting ublk " Ziyang Zhang
4 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2022-11-03 8:50 UTC (permalink / raw)
To: Jens Axboe, io-uring
Cc: linux-block, linux-kernel, linux-fsdevel, Miklos Szeredi,
Stefan Hajnoczi, ZiyangZhang, Ming Lei
Pass ublk block IO request pages to kernel backend IO handling code via
pipe, and request page copy can be avoided. So far, the existed
pipe/splice mechanism works for handling write request only.
The initial idea of using splice for zero copy is from Miklos and Stefan.
Read request's zero copy requires pipe's change to allow one read end to
produce buffers for another read end to consume. The added SPLICE_F_READ_TO_READ
flag is for supporting this feature.
READ is handled by sending IORING_OP_SPLICE with SPLICE_F_DIRECT |
SPLICE_F_READ_TO_READ. WRITE is handled by sending IORING_OP_SPLICE with
SPLICE_F_DIRECT. Kernel internal pipe is used for simplifying userspace,
meantime potential info leak could be avoided.
Suggested-by: Miklos Szeredi <[email protected]>
Suggested-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 151 +++++++++++++++++++++++++++++++++-
include/uapi/linux/ublk_cmd.h | 34 +++++++-
2 files changed, 182 insertions(+), 3 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index f96cb01e9604..c9d061547877 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -42,6 +42,8 @@
#include <linux/mm.h>
#include <asm/page.h>
#include <linux/task_work.h>
+#include <linux/pipe_fs_i.h>
+#include <linux/splice.h>
#include <uapi/linux/ublk_cmd.h>
#define UBLK_MINORS (1U << MINORBITS)
@@ -51,7 +53,8 @@
| UBLK_F_URING_CMD_COMP_IN_TASK \
| UBLK_F_NEED_GET_DATA \
| UBLK_F_USER_RECOVERY \
- | UBLK_F_USER_RECOVERY_REISSUE)
+ | UBLK_F_USER_RECOVERY_REISSUE \
+ | UBLK_F_SPLICE_ZC)
/* All UBLK_PARAM_TYPE_* should be included here */
#define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD)
@@ -61,6 +64,7 @@ struct ublk_rq_data {
struct callback_head work;
struct llist_node node;
};
+ atomic_t handled;
};
struct ublk_uring_cmd_pdu {
@@ -480,6 +484,9 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
if (req_op(req) != REQ_OP_WRITE && req_op(req) != REQ_OP_FLUSH)
return rq_bytes;
+ if (ubq->flags & UBLK_F_SPLICE_ZC)
+ return rq_bytes;
+
if (ublk_rq_has_data(req)) {
struct ublk_map_data data = {
.ubq = ubq,
@@ -501,6 +508,9 @@ static int ublk_unmap_io(const struct ublk_queue *ubq,
{
const unsigned int rq_bytes = blk_rq_bytes(req);
+ if (ubq->flags & UBLK_F_SPLICE_ZC)
+ return rq_bytes;
+
if (req_op(req) == REQ_OP_READ && ublk_rq_has_data(req)) {
struct ublk_map_data data = {
.ubq = ubq,
@@ -858,6 +868,19 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
if (ublk_queue_can_use_recovery(ubq) && unlikely(ubq->force_abort))
return BLK_STS_IOERR;
+ if (ubq->flags & UBLK_F_SPLICE_ZC) {
+ struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
+
+ atomic_set(&data->handled, 0);
+
+ /*
+ * Order write ->handled and write rq->state in
+ * blk_mq_start_request, the pair barrier is the one
+ * implied in atomic_inc_return() in ublk_splice_read
+ */
+ smp_wmb();
+ }
+
blk_mq_start_request(bd->rq);
if (unlikely(ubq_daemon_is_dying(ubq))) {
@@ -1299,13 +1322,137 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
return -EIOCBQUEUED;
}
+static void ublk_pipe_buf_release(struct pipe_inode_info *pipe,
+ struct pipe_buffer *buf)
+{
+}
+
+static const struct pipe_buf_operations ublk_pipe_buf_ops = {
+ .release = ublk_pipe_buf_release,
+};
+
+/*
+ * Pass request page reference to kernel backend IO handler via pipe
+ *
+ * ublk server has to handle backend IO via splice()
+ */
+static ssize_t ublk_splice_read(struct file *in, loff_t *ppos,
+ struct pipe_inode_info *pipe,
+ size_t len, unsigned int flags)
+{
+ struct ublk_device *ub = in->private_data;
+ struct req_iterator rq_iter;
+ struct bio_vec bv;
+ struct request *req;
+ struct ublk_queue *ubq;
+ u16 tag, q_id;
+ unsigned int done;
+ int ret, buf_offset;
+ struct ublk_rq_data *data;
+
+ if (!(flags & SPLICE_F_DIRECT))
+ return -EPERM;
+
+ /* No, we have to be the in side */
+ if (pipe->consumed_by_read)
+ return -EINVAL;
+
+ if (!ub)
+ return -EPERM;
+
+ tag = ublk_pos_to_tag(*ppos);
+ q_id = ublk_pos_to_hwq(*ppos);
+ buf_offset = ublk_pos_to_buf_offset(*ppos);
+
+ if (q_id >= ub->dev_info.nr_hw_queues)
+ return -EINVAL;
+
+ ubq = ublk_get_queue(ub, q_id);
+ if (!ubq)
+ return -EINVAL;
+
+ if (!(ubq->flags & UBLK_F_SPLICE_ZC))
+ return -EINVAL;
+
+ if (tag >= ubq->q_depth)
+ return -EINVAL;
+
+ req = blk_mq_tag_to_rq(ub->tag_set.tags[q_id], tag);
+ if (!req || !blk_mq_request_started(req))
+ return -EINVAL;
+
+ data = blk_mq_rq_to_pdu(req);
+ if (atomic_add_return(len, &data->handled) > blk_rq_bytes(req) || !len)
+ return -EACCES;
+
+ ret = -EINVAL;
+ if (!ublk_rq_has_data(req))
+ goto exit;
+
+ pr_devel("%s: qid %d tag %u offset %x, request bytes %u, len %llu\n",
+ __func__, tag, q_id, buf_offset, blk_rq_bytes(req),
+ (unsigned long long)len);
+
+ if (buf_offset + len > blk_rq_bytes(req))
+ goto exit;
+
+ if ((req_op(req) == REQ_OP_READ) &&
+ !(flags & SPLICE_F_READ_TO_READ))
+ goto exit;
+
+ if ((req_op(req) != REQ_OP_READ) &&
+ (flags & SPLICE_F_READ_TO_READ))
+ goto exit;
+
+ done = ret = 0;
+ /* todo: is iov_iter ready for handling multipage bvec? */
+ rq_for_each_segment(bv, req, rq_iter) {
+ struct pipe_buffer buf = {
+ .ops = &ublk_pipe_buf_ops,
+ .flags = 0,
+ .page = bv.bv_page,
+ .offset = bv.bv_offset,
+ .len = bv.bv_len,
+ };
+
+ if (buf_offset > 0) {
+ if (buf_offset >= bv.bv_len) {
+ buf_offset -= bv.bv_len;
+ continue;
+ } else {
+ buf.offset += buf_offset;
+ buf.len -= buf_offset;
+ buf_offset = 0;
+ }
+ }
+
+ ret = add_to_pipe(pipe, &buf);
+ if (unlikely(ret < 0))
+ break;
+ done += ret;
+ }
+
+ if (flags & SPLICE_F_READ_TO_READ)
+ pipe->consumed_by_read = true;
+
+ WARN_ON_ONCE(done > len);
+
+ if (done) {
+ *ppos += done;
+ ret = done;
+ }
+exit:
+ return ret;
+}
+
static const struct file_operations ublk_ch_fops = {
.owner = THIS_MODULE,
.open = ublk_ch_open,
.release = ublk_ch_release,
- .llseek = no_llseek,
+ .llseek = noop_llseek,
.uring_cmd = ublk_ch_uring_cmd,
.mmap = ublk_ch_mmap,
+ .splice_read = ublk_splice_read,
};
static void ublk_deinit_queue(struct ublk_device *ub, int q_id)
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index 8f88e3a29998..93d9ca7650ce 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -52,7 +52,36 @@
#define UBLKSRV_IO_BUF_OFFSET 0x80000000
/* tag bit is 12bit, so at most 4096 IOs for each queue */
-#define UBLK_MAX_QUEUE_DEPTH 4096
+#define UBLK_TAG_BITS 12
+#define UBLK_MAX_QUEUE_DEPTH (1U << UBLK_TAG_BITS)
+
+/* used in ->splice_read for supporting zero-copy */
+#define UBLK_BUFS_SIZE_BITS 42
+#define UBLK_BUFS_SIZE_MASK ((1ULL << UBLK_BUFS_SIZE_BITS) - 1)
+#define UBLK_BUF_SIZE_BITS (UBLK_BUFS_SIZE_BITS - UBLK_TAG_BITS)
+#define UBLK_BUF_MAX_SIZE (1ULL << UBLK_BUF_SIZE_BITS)
+
+static inline __u16 ublk_pos_to_hwq(__u64 pos)
+{
+ return pos >> UBLK_BUFS_SIZE_BITS;
+}
+
+static inline __u32 ublk_pos_to_buf_offset(__u64 pos)
+{
+ return (pos & UBLK_BUFS_SIZE_MASK) & (UBLK_BUF_MAX_SIZE - 1);
+}
+
+static inline __u16 ublk_pos_to_tag(__u64 pos)
+{
+ return (pos & UBLK_BUFS_SIZE_MASK) >> UBLK_BUF_SIZE_BITS;
+}
+
+/* offset of single buffer, which has to be < UBLK_BUX_MAX_SIZE */
+static inline __u64 ublk_pos(__u16 q_id, __u16 tag, __u32 offset)
+{
+ return (((__u64)q_id) << UBLK_BUFS_SIZE_BITS) |
+ ((((__u64)tag) << UBLK_BUF_SIZE_BITS) + offset);
+}
/*
* zero copy requires 4k block size, and can remap ublk driver's io
@@ -79,6 +108,9 @@
#define UBLK_F_USER_RECOVERY_REISSUE (1UL << 4)
+/* slice based write zero copy */
+#define UBLK_F_SPLICE_ZC (1UL << 5)
+
/* device state */
#define UBLK_S_DEV_DEAD 0
#define UBLK_S_DEV_LIVE 1
--
2.31.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 4/4] ublk_drv: support splice based read/write zero copy
2022-11-03 8:50 ` [RFC PATCH 4/4] ublk_drv: support splice based read/write zero copy Ming Lei
@ 2022-11-03 22:28 ` Bernd Schubert
2022-11-04 0:44 ` Ming Lei
0 siblings, 1 reply; 13+ messages in thread
From: Bernd Schubert @ 2022-11-03 22:28 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, io-uring
Cc: linux-block, linux-kernel, linux-fsdevel, Miklos Szeredi,
Stefan Hajnoczi, ZiyangZhang
On 11/3/22 09:50, Ming Lei wrote:
> Pass ublk block IO request pages to kernel backend IO handling code via
> pipe, and request page copy can be avoided. So far, the existed
> pipe/splice mechanism works for handling write request only.
>
> The initial idea of using splice for zero copy is from Miklos and Stefan.
>
> Read request's zero copy requires pipe's change to allow one read end to
> produce buffers for another read end to consume. The added SPLICE_F_READ_TO_READ
> flag is for supporting this feature.
>
> READ is handled by sending IORING_OP_SPLICE with SPLICE_F_DIRECT |
> SPLICE_F_READ_TO_READ. WRITE is handled by sending IORING_OP_SPLICE with
> SPLICE_F_DIRECT. Kernel internal pipe is used for simplifying userspace,
> meantime potential info leak could be avoided.
Sorry to ask, do you have an ublk branch that gives an example how to
use this?
I still have several things to fix in my branches, but I got basic fuse
uring with copies working. Adding back splice would be next after
posting rfc patches. My initial assumption was that I needed to
duplicate everything splice does into the fuse .uring_cmd handler -
obviously there is a better way with your patches.
This week I have a few days off, by end of next week or the week after I
might have patches in an rfc state (one thing I'm going to ask about is
how do I know what is the next CQE in the kernel handler - ublk does
this with tags through mq, but I don't understand yet where the tag is
increased and what the relation between tag and right CQE order is).
This got modeled a bit after ublk, but then diverged a bit.
https://github.com/aakefbs/linux/tree/fuse-uring
https://github.com/aakefbs/libfuse/tree/uring
(Again, the branches are not ready by any means for review yet).
Thanks,
Bernd
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 4/4] ublk_drv: support splice based read/write zero copy
2022-11-03 22:28 ` Bernd Schubert
@ 2022-11-04 0:44 ` Ming Lei
2022-11-04 23:37 ` Bernd Schubert
0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2022-11-04 0:44 UTC (permalink / raw)
To: Bernd Schubert
Cc: Jens Axboe, io-uring, linux-block, linux-kernel, linux-fsdevel,
Miklos Szeredi, Stefan Hajnoczi, ZiyangZhang
On Thu, Nov 03, 2022 at 11:28:29PM +0100, Bernd Schubert wrote:
>
>
> On 11/3/22 09:50, Ming Lei wrote:
> > Pass ublk block IO request pages to kernel backend IO handling code via
> > pipe, and request page copy can be avoided. So far, the existed
> > pipe/splice mechanism works for handling write request only.
> >
> > The initial idea of using splice for zero copy is from Miklos and Stefan.
> >
> > Read request's zero copy requires pipe's change to allow one read end to
> > produce buffers for another read end to consume. The added SPLICE_F_READ_TO_READ
> > flag is for supporting this feature.
> >
> > READ is handled by sending IORING_OP_SPLICE with SPLICE_F_DIRECT |
> > SPLICE_F_READ_TO_READ. WRITE is handled by sending IORING_OP_SPLICE with
> > SPLICE_F_DIRECT. Kernel internal pipe is used for simplifying userspace,
> > meantime potential info leak could be avoided.
>
>
> Sorry to ask, do you have an ublk branch that gives an example how to use
> this?
Follows the ublk splice-zc branch:
https://github.com/ming1/ubdsrv/commits/splice-zc
which is mentioned in cover letter, but I guess it should be added to
here too, sorry for that, so far only ublk-loop supports it by:
ublk add -t loop -f $BACKING -z
without '-z', ublk-loop is created with zero copy disabled.
>
> I still have several things to fix in my branches, but I got basic fuse
> uring with copies working. Adding back splice would be next after posting
> rfc patches. My initial assumption was that I needed to duplicate everything
> splice does into the fuse .uring_cmd handler - obviously there is a better
> way with your patches.
>
> This week I have a few days off, by end of next week or the week after I
> might have patches in an rfc state (one thing I'm going to ask about is how
> do I know what is the next CQE in the kernel handler - ublk does this with
> tags through mq, but I don't understand yet where the tag is increased and
> what the relation between tag and right CQE order is).
tag is one attribute of io request, which is originated from ublk
driver, and it is unique for each request among one queue. So ublksrv
won't change it at all, just use it, and ublk driver guarantees that
it is unique.
In ublkserv implementation, the tag info is set in cqe->user_data, so
we can retrieve the io request via tag part of cqe->user_data.
Also I may not understand your question of 'the relation between tag and right
CQE order', io_uring provides IOSQE_IO_DRAIN/IOSQE_IO_LINK for ordering
SQE, and ublksrv only applies IOSQE_IO_LINK in ublk-qcow2, so care to
explain it in a bit details about the "the relation between tag and right
CQE order"?
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 4/4] ublk_drv: support splice based read/write zero copy
2022-11-04 0:44 ` Ming Lei
@ 2022-11-04 23:37 ` Bernd Schubert
2022-11-07 1:05 ` Ming Lei
0 siblings, 1 reply; 13+ messages in thread
From: Bernd Schubert @ 2022-11-04 23:37 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, io-uring, linux-block, linux-kernel, linux-fsdevel,
Miklos Szeredi, Stefan Hajnoczi, ZiyangZhang
On 11/4/22 01:44, Ming Lei wrote:
> On Thu, Nov 03, 2022 at 11:28:29PM +0100, Bernd Schubert wrote:
>>
>>
>> On 11/3/22 09:50, Ming Lei wrote:
>>> Pass ublk block IO request pages to kernel backend IO handling code via
>>> pipe, and request page copy can be avoided. So far, the existed
>>> pipe/splice mechanism works for handling write request only.
>>>
>>> The initial idea of using splice for zero copy is from Miklos and Stefan.
>>>
>>> Read request's zero copy requires pipe's change to allow one read end to
>>> produce buffers for another read end to consume. The added SPLICE_F_READ_TO_READ
>>> flag is for supporting this feature.
>>>
>>> READ is handled by sending IORING_OP_SPLICE with SPLICE_F_DIRECT |
>>> SPLICE_F_READ_TO_READ. WRITE is handled by sending IORING_OP_SPLICE with
>>> SPLICE_F_DIRECT. Kernel internal pipe is used for simplifying userspace,
>>> meantime potential info leak could be avoided.
>>
>>
>> Sorry to ask, do you have an ublk branch that gives an example how to use
>> this?
>
> Follows the ublk splice-zc branch:
>
> https://github.com/ming1/ubdsrv/commits/splice-zc
>
> which is mentioned in cover letter, but I guess it should be added to
> here too, sorry for that, so far only ublk-loop supports it by:
>
> ublk add -t loop -f $BACKING -z
>
> without '-z', ublk-loop is created with zero copy disabled.
Ah, thanks a lot! And sorry, I had missed this part in the cover letter.
I will take a look on your new zero copy code on Monday.
>
>>
>> I still have several things to fix in my branches, but I got basic fuse
>> uring with copies working. Adding back splice would be next after posting
>> rfc patches. My initial assumption was that I needed to duplicate everything
>> splice does into the fuse .uring_cmd handler - obviously there is a better
>> way with your patches.
>>
>> This week I have a few days off, by end of next week or the week after I
>> might have patches in an rfc state (one thing I'm going to ask about is how
>> do I know what is the next CQE in the kernel handler - ublk does this with
>> tags through mq, but I don't understand yet where the tag is increased and
>> what the relation between tag and right CQE order is).
>
> tag is one attribute of io request, which is originated from ublk
> driver, and it is unique for each request among one queue. So ublksrv
> won't change it at all, just use it, and ublk driver guarantees that
> it is unique.
>
> In ublkserv implementation, the tag info is set in cqe->user_data, so
> we can retrieve the io request via tag part of cqe->user_data.
Yeah, this is the easy part I understood. At least I hope so :)
>
> Also I may not understand your question of 'the relation between tag and right
> CQE order', io_uring provides IOSQE_IO_DRAIN/IOSQE_IO_LINK for ordering
> SQE, and ublksrv only applies IOSQE_IO_LINK in ublk-qcow2, so care to
> explain it in a bit details about the "the relation between tag and right
> CQE order"?
For fuse (kernel) a vfs request comes in and I need to choose a command
in the ring queue. Right now this is just an atomic counter % queue_size
fuse_request_alloc_ring()
req_cnt = atomic_inc_return(&queue->req_cnt);
tag = req_cnt & (fc->ring.queue_depth - 1); /* cnt % queue_depth */
ring_req = &queue->ring_req[tag];
I might be wrong, but I think that can be compared a bit to
ublk_queue_rq(). Looks like ublk_queue_rq gets called in blk-mq context
and blk-mq seems to provide rq->tag, which then determines the command
in the ring queue - completion of commands is done in tag-order provided
by blk-mq? The part I didn't figure out yet is where the tag value gets set.
Also interesting is that there is no handler if the ring is already full
- like the ublk_io command is currently busy in ublksrv (user space).
Handled auto-magically with blk-mq?
This is one of the parts not handled in my fuse code yet and my current
plan is to have a request queue on top of the (per core) ring queues.
Similar to the existing fuse request queue, just not one, but per ring
queue and processed by the ring queue. Unless there is a better way -
which is another reason to understand how ublk handles this.
Thanks,
Bernd
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 4/4] ublk_drv: support splice based read/write zero copy
2022-11-04 23:37 ` Bernd Schubert
@ 2022-11-07 1:05 ` Ming Lei
0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2022-11-07 1:05 UTC (permalink / raw)
To: Bernd Schubert
Cc: Jens Axboe, io-uring, linux-block, linux-kernel, linux-fsdevel,
Miklos Szeredi, Stefan Hajnoczi, ZiyangZhang
On Sat, Nov 05, 2022 at 12:37:21AM +0100, Bernd Schubert wrote:
>
>
> On 11/4/22 01:44, Ming Lei wrote:
> > On Thu, Nov 03, 2022 at 11:28:29PM +0100, Bernd Schubert wrote:
> > >
> > >
> > > On 11/3/22 09:50, Ming Lei wrote:
> > > > Pass ublk block IO request pages to kernel backend IO handling code via
> > > > pipe, and request page copy can be avoided. So far, the existed
> > > > pipe/splice mechanism works for handling write request only.
> > > >
> > > > The initial idea of using splice for zero copy is from Miklos and Stefan.
> > > >
> > > > Read request's zero copy requires pipe's change to allow one read end to
> > > > produce buffers for another read end to consume. The added SPLICE_F_READ_TO_READ
> > > > flag is for supporting this feature.
> > > >
> > > > READ is handled by sending IORING_OP_SPLICE with SPLICE_F_DIRECT |
> > > > SPLICE_F_READ_TO_READ. WRITE is handled by sending IORING_OP_SPLICE with
> > > > SPLICE_F_DIRECT. Kernel internal pipe is used for simplifying userspace,
> > > > meantime potential info leak could be avoided.
> > >
> > >
> > > Sorry to ask, do you have an ublk branch that gives an example how to use
> > > this?
> >
> > Follows the ublk splice-zc branch:
> >
> > https://github.com/ming1/ubdsrv/commits/splice-zc
> >
> > which is mentioned in cover letter, but I guess it should be added to
> > here too, sorry for that, so far only ublk-loop supports it by:
> >
> > ublk add -t loop -f $BACKING -z
> >
> > without '-z', ublk-loop is created with zero copy disabled.
>
> Ah, thanks a lot! And sorry, I had missed this part in the cover letter.
>
> I will take a look on your new zero copy code on Monday.
>
>
> >
> > >
> > > I still have several things to fix in my branches, but I got basic fuse
> > > uring with copies working. Adding back splice would be next after posting
> > > rfc patches. My initial assumption was that I needed to duplicate everything
> > > splice does into the fuse .uring_cmd handler - obviously there is a better
> > > way with your patches.
> > >
> > > This week I have a few days off, by end of next week or the week after I
> > > might have patches in an rfc state (one thing I'm going to ask about is how
> > > do I know what is the next CQE in the kernel handler - ublk does this with
> > > tags through mq, but I don't understand yet where the tag is increased and
> > > what the relation between tag and right CQE order is).
> >
> > tag is one attribute of io request, which is originated from ublk
> > driver, and it is unique for each request among one queue. So ublksrv
> > won't change it at all, just use it, and ublk driver guarantees that
> > it is unique.
> >
> > In ublkserv implementation, the tag info is set in cqe->user_data, so
> > we can retrieve the io request via tag part of cqe->user_data.
>
> Yeah, this is the easy part I understood. At least I hope so :)
>
> >
> > Also I may not understand your question of 'the relation between tag and right
> > CQE order', io_uring provides IOSQE_IO_DRAIN/IOSQE_IO_LINK for ordering
> > SQE, and ublksrv only applies IOSQE_IO_LINK in ublk-qcow2, so care to
> > explain it in a bit details about the "the relation between tag and right
> > CQE order"?
>
>
> For fuse (kernel) a vfs request comes in and I need to choose a command in
> the ring queue. Right now this is just an atomic counter % queue_size
>
> fuse_request_alloc_ring()
> req_cnt = atomic_inc_return(&queue->req_cnt);
> tag = req_cnt & (fc->ring.queue_depth - 1); /* cnt % queue_depth */
>
> ring_req = &queue->ring_req[tag];
>
>
>
> I might be wrong, but I think that can be compared a bit to ublk_queue_rq().
> Looks like ublk_queue_rq gets called in blk-mq context and blk-mq seems to
> provide rq->tag, which then determines the command in the ring queue -
> completion of commands is done in tag-order provided by blk-mq? The part I
The two are not related, blk-mq tag number means nothing wrt. io
handling order:
- tag is allocated via sbitmap, which may return tag number in any
order, you may think the returned number is just random
- blk-mq may re-order requests and dispatch them with any order
- once requests are issued to io_uring, userspace may handles these IOs
with any order
- after backend io is queued via io_uring or libaio or whatever to kernel, it
could be completed at any order
> didn't figure out yet is where the tag value gets set.
> Also interesting is that there is no handler if the ring is already full -
> like the ublk_io command is currently busy in ublksrv (user space). Handled
> auto-magically with blk-mq?
For ublk, the queue has fixed depth, so the pre-allocated io_uring size is
enough, and blk-mq can throttle IOs from the beginning if the max queue depth is
reached, so ublk needn't to worry about io_uring size/depth.
But fuse may have to consider request throttle.
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH 0/4] io_uring/splice: extend splice for supporting ublk zero copy
2022-11-03 8:50 [RFC PATCH 0/4] io_uring/splice: extend splice for supporting ublk zero copy Ming Lei
` (3 preceding siblings ...)
2022-11-03 8:50 ` [RFC PATCH 4/4] ublk_drv: support splice based read/write zero copy Ming Lei
@ 2022-11-04 9:15 ` Ziyang Zhang
4 siblings, 0 replies; 13+ messages in thread
From: Ziyang Zhang @ 2022-11-04 9:15 UTC (permalink / raw)
To: Ming Lei
Cc: linux-block, io-uring, linux-kernel, linux-fsdevel, Jens Axboe,
Miklos Szeredi, Stefan Hajnoczi
On 2022/11/3 16:50, Ming Lei wrote:
> Hello Guys,
>
> This patch extends io_uring/splice by adding two flags(SPLICE_F_DIRECT &
> SPLICE_F_READ_TO_READ) for supporting ublk zero copy, and fuse could benefit
> from the change too.
>
> - SPLICE_F_DIRECT is for using do_splice_direct() to support zero copy
>
> - SPLICE_F_READ_TO_READ is for supporting ublk READ zero copy, the plain
> splice can support WRITE zero copy by:
>
> splice(ublkc_fd, ublkc_pos, pipe_wr_fd, NULL, len, flags)
> splice(pipe_rd_fd, NULL, backing_fd, backing_off, len, flags)
>
> but can't support READ zc. Extend splice to allow to splice from the 1st
> ->splice_read()(producer) to the 2nd ->splice_read()(consumer), then READ
> request pages reference can be transferred to backing IO code path.
>
> The initial idea is suggested by Miklos Szeredi & Stefan Hajnoczi.
>
> The patchset has been verified basically by ublk builtin tests(loop/008,
> loop/009, generic/003), and basic mount, git clone, kernel building, umount
> tests on ublk-loop[1] which is created by 'ublk add -t loop -f $backing -z'.
>
> The next step is to allow io_uring to run do_splice_direct*()
> in async style like normal async RW instead of offloading to
> iowq context, so that top performance can be reached, and that
> depends on current work.
>
> Any comments are welcome.
>
> [1] https://github.com/ming1/ubdsrv/commits/splice-zc
>
>
> Ming Lei (4):
> io_uring/splice: support do_splice_direct
> fs/splice: add helper of splice_dismiss_pipe()
> io_uring/splice: support splice from ->splice_read to ->splice_read
> ublk_drv: support splice based read/write zero copy
>
> drivers/block/ublk_drv.c | 151 +++++++++++++++++++++++++-
> fs/read_write.c | 5 +-
> fs/splice.c | 193 +++++++++++++++++++++++++++++-----
> include/linux/fs.h | 2 +
> include/linux/pipe_fs_i.h | 9 ++
> include/linux/splice.h | 14 +++
> include/uapi/linux/ublk_cmd.h | 34 +++++-
> io_uring/splice.c | 16 ++-
> 8 files changed, 392 insertions(+), 32 deletions(-)
>
Hi, Ming
I have quickly scanned the code. It seems like biovec pages are successfully
passed to the backend for READ/WRITE. I will learn your code and run some
tests. I will give more feedback next week.
Regards,
Zhang
^ permalink raw reply [flat|nested] 13+ messages in thread