* [PATCHSET RFC 0/5] file_operations based io_uring commands
@ 2021-01-27 21:25 Jens Axboe
2021-01-27 21:25 ` [PATCH 1/5] fs: add file_operations->uring_cmd() Jens Axboe
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Jens Axboe @ 2021-01-27 21:25 UTC (permalink / raw)
To: io-uring
Hi,
This is a concept I've been toying with, some (including myself) have
referred to it as a way to do ioctls over io_uring. And indeed it has
many similarities with that. The purpose of the patchset is to have
a file private command type. io_uring doesn't know or care what is in
it, only the end target would be able to do that. In that sense it's
similar to ioctls which share the same trait. io_uring just provides
all the infrastructure to pass them back and forth, etc.
At the core of it, we overlay a struct type on top of io_uring_sqe.
That's basically like other commands, except this one is a bit more
brutal. That leaves us with 32 bytes that we can use, 8 bytes that
we can't (as they overlap with ->user_data), and then 8 bytes that are
usable again. Hence there's 40 bytes available for a command, anything
more than that should be allocated as per usual for request types.
The file_operations structure is augmented to add a ->uring_cmd()
handler. A file type must support this operation to support these
kinds of commands, otherwise they get errored. This handler can
either queue this operation async and signal completion when done,
or it can complete it inline (either successfully, or in error).
Proof of concept added for raw block and network commands, just to
show how it could work. Nothing exciting/interesting yet, just a
way to test it out.
This is very much sent out for comments/review of the concept itself.
There are a host of things that could be implemented with this, like
raw device access, new APIs (network zero copy additions), etc.
I'm not a huge fan of the command overlay, but I do like the fact that
we can do alloc less commands as long as we stay at <= 40 bytes. So
maybe it's just a pill that we have to swallow. Or maybe there are
other great ideas for how to do this. It does mean that we overlay
much of the sqe, but most of that won't make sense for private commands.
Anyways, comments welcome. This is kept on io_uring for now until
details have been hashed out.
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/5] fs: add file_operations->uring_cmd()
2021-01-27 21:25 [PATCHSET RFC 0/5] file_operations based io_uring commands Jens Axboe
@ 2021-01-27 21:25 ` Jens Axboe
2021-01-27 21:25 ` [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD Jens Axboe
` (3 subsequent siblings)
4 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2021-01-27 21:25 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
This is a file private handler, similar to ioctls but hopefully a lot
more sane and useful.
Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/fs.h | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..b596039966da 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1816,6 +1816,11 @@ struct dir_context {
#define REMAP_FILE_ADVISORY (REMAP_FILE_CAN_SHORTEN)
struct iov_iter;
+struct io_uring_cmd;
+
+enum io_uring_cmd_flags {
+ IO_URING_F_NONBLOCK = 1,
+};
struct file_operations {
struct module *owner;
@@ -1857,6 +1862,8 @@ struct file_operations {
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
int (*fadvise)(struct file *, loff_t, loff_t, int);
+
+ int (*uring_cmd)(struct io_uring_cmd *, enum io_uring_cmd_flags);
} __randomize_layout;
struct inode_operations {
--
2.30.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD
2021-01-27 21:25 [PATCHSET RFC 0/5] file_operations based io_uring commands Jens Axboe
2021-01-27 21:25 ` [PATCH 1/5] fs: add file_operations->uring_cmd() Jens Axboe
@ 2021-01-27 21:25 ` Jens Axboe
2021-01-28 0:38 ` Darrick J. Wong
2021-01-27 21:25 ` [PATCH 3/5] block: wire up support for file_operations->uring_cmd() Jens Axboe
` (2 subsequent siblings)
4 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2021-01-27 21:25 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
This is a file private kind of request. io_uring doesn't know what's
in this command type, it's for the file_operations->uring_cmd()
handler to deal with.
Signed-off-by: Jens Axboe <[email protected]>
---
fs/io_uring.c | 59 +++++++++++++++++++++++++++++++++++
include/linux/io_uring.h | 12 +++++++
include/uapi/linux/io_uring.h | 1 +
3 files changed, 72 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 03748faa5295..55c2714a591e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -712,6 +712,7 @@ struct io_kiocb {
struct io_shutdown shutdown;
struct io_rename rename;
struct io_unlink unlink;
+ struct io_uring_cmd uring_cmd;
/* use only after cleaning per-op data, see io_clean_op() */
struct io_completion compl;
};
@@ -805,6 +806,8 @@ struct io_op_def {
unsigned needs_async_data : 1;
/* should block plug */
unsigned plug : 1;
+ /* doesn't support personality */
+ unsigned no_personality : 1;
/* size of async data needed, if any */
unsigned short async_size;
unsigned work_flags;
@@ -998,6 +1001,11 @@ static const struct io_op_def io_op_defs[] = {
.work_flags = IO_WQ_WORK_MM | IO_WQ_WORK_FILES |
IO_WQ_WORK_FS | IO_WQ_WORK_BLKCG,
},
+ [IORING_OP_URING_CMD] = {
+ .needs_file = 1,
+ .no_personality = 1,
+ .work_flags = IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
+ },
};
enum io_mem_account {
@@ -3797,6 +3805,47 @@ static int io_unlinkat(struct io_kiocb *req, bool force_nonblock)
return 0;
}
+static void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret)
+{
+ struct io_kiocb *req = container_of(cmd, struct io_kiocb, uring_cmd);
+
+ if (ret < 0)
+ req_set_fail_links(req);
+ io_req_complete(req, ret);
+}
+
+static int io_uring_cmd_prep(struct io_kiocb *req,
+ const struct io_uring_sqe *sqe)
+{
+ struct io_uring_cmd *cmd = &req->uring_cmd;
+
+ if (!req->file->f_op->uring_cmd)
+ return -EOPNOTSUPP;
+
+ memcpy(&cmd->pdu, (void *) &sqe->off, sizeof(cmd->pdu));
+ cmd->done = io_uring_cmd_done;
+ return 0;
+}
+
+static int io_uring_cmd(struct io_kiocb *req, bool force_nonblock)
+{
+ enum io_uring_cmd_flags flags = 0;
+ struct file *file = req->file;
+ int ret;
+
+ if (force_nonblock)
+ flags |= IO_URING_F_NONBLOCK;
+
+ ret = file->f_op->uring_cmd(&req->uring_cmd, flags);
+ /* queued async, consumer will call ->done() when complete */
+ if (ret == -EIOCBQUEUED)
+ return 0;
+ else if (ret < 0)
+ req_set_fail_links(req);
+ io_req_complete(req, ret);
+ return 0;
+}
+
static int io_shutdown_prep(struct io_kiocb *req,
const struct io_uring_sqe *sqe)
{
@@ -6093,6 +6142,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return io_renameat_prep(req, sqe);
case IORING_OP_UNLINKAT:
return io_unlinkat_prep(req, sqe);
+ case IORING_OP_URING_CMD:
+ return io_uring_cmd_prep(req, sqe);
}
printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
@@ -6351,6 +6402,9 @@ static int io_issue_sqe(struct io_kiocb *req, bool force_nonblock,
case IORING_OP_UNLINKAT:
ret = io_unlinkat(req, force_nonblock);
break;
+ case IORING_OP_URING_CMD:
+ ret = io_uring_cmd(req, force_nonblock);
+ break;
default:
ret = -EINVAL;
break;
@@ -6865,6 +6919,9 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
if (id) {
struct io_identity *iod;
+ if (io_op_defs[req->opcode].no_personality)
+ return -EINVAL;
+
iod = idr_find(&ctx->personality_idr, id);
if (unlikely(!iod))
return -EINVAL;
@@ -10260,6 +10317,8 @@ static int __init io_uring_init(void)
BUILD_BUG_SQE_ELEM(40, __u16, buf_index);
BUILD_BUG_SQE_ELEM(42, __u16, personality);
BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in);
+ BUILD_BUG_ON(offsetof(struct io_uring_sqe, user_data) !=
+ offsetof(struct io_uring_pdu, reserved));
BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST);
BUILD_BUG_ON(__REQ_F_LAST_BIT >= 8 * sizeof(int));
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 35b2d845704d..e4e822d86e22 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -34,6 +34,18 @@ struct io_uring_task {
bool sqpoll;
};
+struct io_uring_pdu {
+ __u64 data[4]; /* available for free use */
+ __u64 reserved; /* can't be used by application! */
+ __u64 data2; /* available or free use */
+};
+
+struct io_uring_cmd {
+ struct file *file;
+ struct io_uring_pdu pdu;
+ void (*done)(struct io_uring_cmd *, ssize_t);
+};
+
#if defined(CONFIG_IO_URING)
struct sock *io_uring_get_socket(struct file *file);
void __io_uring_task_cancel(void);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ac4e1738a9af..0a0de40a3a5c 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -137,6 +137,7 @@ enum {
IORING_OP_SHUTDOWN,
IORING_OP_RENAMEAT,
IORING_OP_UNLINKAT,
+ IORING_OP_URING_CMD,
/* this goes last, obviously */
IORING_OP_LAST,
--
2.30.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/5] block: wire up support for file_operations->uring_cmd()
2021-01-27 21:25 [PATCHSET RFC 0/5] file_operations based io_uring commands Jens Axboe
2021-01-27 21:25 ` [PATCH 1/5] fs: add file_operations->uring_cmd() Jens Axboe
2021-01-27 21:25 ` [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD Jens Axboe
@ 2021-01-27 21:25 ` Jens Axboe
2021-01-27 21:25 ` [PATCH 4/5] block: add example ioctl Jens Axboe
2021-01-27 21:25 ` [PATCH 5/5] net: wire up support for file_operations->uring_cmd() Jens Axboe
4 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2021-01-27 21:25 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-mq.c | 11 +++++++++++
fs/block_dev.c | 10 ++++++++++
include/linux/blk-mq.h | 6 ++++++
include/linux/blkdev.h | 1 +
4 files changed, 28 insertions(+)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f285a9123a8b..00114f838ac2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3893,6 +3893,17 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
}
EXPORT_SYMBOL_GPL(blk_poll);
+int blk_uring_cmd(struct block_device *bdev, struct io_uring_cmd *cmd,
+ bool force_nonblock)
+{
+ struct request_queue *q = bdev_get_queue(bdev);
+
+ if (!q->mq_ops || !q->mq_ops->uring_cmd)
+ return -EINVAL;
+
+ return q->mq_ops->uring_cmd(q, cmd, force_nonblock);
+}
+
unsigned int blk_mq_rq_cpu(struct request *rq)
{
return rq->mq_ctx->cpu;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3b8963e228a1..c837912c1d72 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -34,6 +34,7 @@
#include <linux/part_stat.h>
#include <linux/uaccess.h>
#include <linux/suspend.h>
+#include <linux/io_uring.h>
#include "internal.h"
struct bdev_inode {
@@ -301,6 +302,14 @@ struct blkdev_dio {
static struct bio_set blkdev_dio_pool;
+static int blkdev_uring_cmd(struct io_uring_cmd *cmd,
+ enum io_uring_cmd_flags flags)
+{
+ struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host);
+
+ return blk_uring_cmd(bdev, cmd, flags);
+}
+
static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
{
struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host);
@@ -1825,6 +1834,7 @@ const struct file_operations def_blk_fops = {
.splice_read = generic_file_splice_read,
.splice_write = iter_file_splice_write,
.fallocate = blkdev_fallocate,
+ .uring_cmd = blkdev_uring_cmd,
};
/**
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d705b174d346..ddc0e1f07548 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -380,6 +380,12 @@ struct blk_mq_ops {
*/
int (*map_queues)(struct blk_mq_tag_set *set);
+ /**
+ * @uring_cmd: queues requests through io_uring IORING_OP_URING_CMD
+ */
+ int (*uring_cmd)(struct request_queue *q, struct io_uring_cmd *cmd,
+ bool force_nonblock);
+
#ifdef CONFIG_BLK_DEBUG_FS
/**
* @show_rq: Used by the debugfs implementation to show driver-specific
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f94ee3089e01..d5af592d73fe 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -960,6 +960,7 @@ int blk_status_to_errno(blk_status_t status);
blk_status_t errno_to_blk_status(int errno);
int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin);
+int blk_uring_cmd(struct block_device *bdev, struct io_uring_cmd *cmd, bool force_nonblock);
static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
{
--
2.30.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 4/5] block: add example ioctl
2021-01-27 21:25 [PATCHSET RFC 0/5] file_operations based io_uring commands Jens Axboe
` (2 preceding siblings ...)
2021-01-27 21:25 ` [PATCH 3/5] block: wire up support for file_operations->uring_cmd() Jens Axboe
@ 2021-01-27 21:25 ` Jens Axboe
2021-01-27 21:25 ` [PATCH 5/5] net: wire up support for file_operations->uring_cmd() Jens Axboe
4 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2021-01-27 21:25 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
Example code, to issue BLKBSZGET through IORING_OP_URING_CMD:
struct block_uring_cmd {
__u16 op;
__u16 pad;
union {
__u32 size;
__u32 ioctl_cmd;
};
__u64 addr;
__u64 unused[2];
__u64 reserved; /* can never be used */
__u64 unused2;
};
static int get_bs(struct io_uring *ring, const char *dev)
{
struct io_uring_cqe *cqe;
struct io_uring_sqe *sqe;
struct block_uring_cmd *cmd;
int ret, fd;
fd = open(dev, O_RDONLY);
if (fd < 0) {
perror("open");
return 1;
}
sqe = io_uring_get_sqe(ring);
if (!sqe) {
fprintf(stderr, "get sqe failed\n");
goto err;
}
memset(sqe, 0, sizeof(*sqe));
sqe->opcode = IORING_OP_URING_CMD;
sqe->fd = fd;
cmd = (void *) &sqe->off;
cmd->op = BLOCK_URING_OP_IOCTL;
cmd->ioctl_cmd = BLKBSZGET;
sqe->user_data = 0x1234;
ret = io_uring_submit(ring);
if (ret <= 0) {
fprintf(stderr, "sqe submit failed: %d\n", ret);
goto err;
}
ret = io_uring_wait_cqe(ring, &cqe);
if (ret < 0) {
fprintf(stderr, "wait completion %d\n", ret);
goto err;
}
printf("bs=%d\n", cqe->res);
io_uring_cqe_seen(ring, cqe);
return 0;
err:
return 1;
}
Signed-off-by: Jens Axboe <[email protected]>
---
fs/block_dev.c | 19 +++++++++++++++++++
include/linux/blkdev.h | 17 +++++++++++++++++
2 files changed, 36 insertions(+)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index c837912c1d72..7cb1b24ebbb5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -302,10 +302,29 @@ struct blkdev_dio {
static struct bio_set blkdev_dio_pool;
+static int blkdev_uring_ioctl(struct block_device *bdev,
+ struct block_uring_cmd *bcmd)
+{
+ switch (bcmd->ioctl_cmd) {
+ case BLKBSZGET:
+ return block_size(bdev);
+ default:
+ return -ENOTTY;
+ }
+}
+
static int blkdev_uring_cmd(struct io_uring_cmd *cmd,
enum io_uring_cmd_flags flags)
{
struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host);
+ struct block_uring_cmd *bcmd = (struct block_uring_cmd *) &cmd->pdu;
+
+ switch (bcmd->op) {
+ case BLOCK_URING_OP_IOCTL:
+ return blkdev_uring_ioctl(bdev, bcmd);
+ default:
+ break;
+ }
return blk_uring_cmd(bdev, cmd, flags);
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d5af592d73fe..48ac8ccbffe2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -44,6 +44,23 @@ struct blk_queue_stats;
struct blk_stat_callback;
struct blk_keyslot_manager;
+enum {
+ BLOCK_URING_OP_IOCTL = 1,
+};
+
+struct block_uring_cmd {
+ __u16 op;
+ __u16 pad;
+ union {
+ __u32 size;
+ __u32 ioctl_cmd;
+ };
+ __u64 addr;
+ __u64 unused[2];
+ __u64 reserved; /* can never be used */
+ __u64 unused2;
+};
+
#define BLKDEV_MIN_RQ 4
#define BLKDEV_MAX_RQ 128 /* Default maximum */
--
2.30.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 5/5] net: wire up support for file_operations->uring_cmd()
2021-01-27 21:25 [PATCHSET RFC 0/5] file_operations based io_uring commands Jens Axboe
` (3 preceding siblings ...)
2021-01-27 21:25 ` [PATCH 4/5] block: add example ioctl Jens Axboe
@ 2021-01-27 21:25 ` Jens Axboe
4 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2021-01-27 21:25 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
Support for SOCKET_URING_OP_SIOCINQ and SOCKET_URING_OP_SIOCOUTQ
for tcp/udp/raw ipv4/ipv6.
Signed-off-by: Jens Axboe <[email protected]>
---
include/net/raw.h | 3 +++
include/net/sock.h | 3 +++
include/net/tcp.h | 2 ++
include/net/udp.h | 2 ++
include/uapi/linux/net.h | 12 ++++++++++++
net/ipv4/raw.c | 23 +++++++++++++++++++++++
net/ipv4/tcp.c | 33 +++++++++++++++++++++++++++++++++
net/ipv4/tcp_ipv4.c | 1 +
net/ipv4/udp.c | 14 ++++++++++++++
net/ipv6/raw.c | 1 +
net/ipv6/tcp_ipv6.c | 1 +
net/ipv6/udp.c | 1 +
12 files changed, 96 insertions(+)
diff --git a/include/net/raw.h b/include/net/raw.h
index 8ad8df594853..fbaa123c2458 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -82,4 +82,7 @@ static inline bool raw_sk_bound_dev_eq(struct net *net, int bound_dev_if,
#endif
}
+int raw_uring_cmd(struct sock *sk, struct sock_uring_cmd *scmd,
+ enum io_uring_cmd_flags flags);
+
#endif /* _RAW_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 129d200bccb4..d0fbd366887a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1146,6 +1146,9 @@ struct proto {
int (*ioctl)(struct sock *sk, int cmd,
unsigned long arg);
+ int (*uring_cmd)(struct sock *sk,
+ struct sock_uring_cmd *scmd,
+ enum io_uring_cmd_flags flags);
int (*init)(struct sock *sk);
void (*destroy)(struct sock *sk);
void (*shutdown)(struct sock *sk, int how);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 78d13c88720f..8174d9752e52 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -350,6 +350,8 @@ void tcp_twsk_destructor(struct sock *sk);
ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags);
+int tcp_uring_cmd(struct sock *sk, struct sock_uring_cmd *scmd,
+ enum io_uring_cmd_flags flags);
void tcp_enter_quickack_mode(struct sock *sk, unsigned int max_quickacks);
static inline void tcp_dec_quickack_mode(struct sock *sk,
diff --git a/include/net/udp.h b/include/net/udp.h
index 877832bed471..362224fe83fe 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -326,6 +326,8 @@ struct sock *__udp6_lib_lookup(struct net *net,
struct sk_buff *skb);
struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb,
__be16 sport, __be16 dport);
+int udp_uring_cmd(struct sock *sk, struct sock_uring_cmd *scmd,
+ enum io_uring_cmd_flags flags);
/* UDP uses skb->dev_scratch to cache as much information as possible and avoid
* possibly multiple cache miss on dequeue()
diff --git a/include/uapi/linux/net.h b/include/uapi/linux/net.h
index 4dabec6bd957..629e5df40858 100644
--- a/include/uapi/linux/net.h
+++ b/include/uapi/linux/net.h
@@ -55,4 +55,16 @@ typedef enum {
#define __SO_ACCEPTCON (1 << 16) /* performed a listen */
+enum {
+ SOCKET_URING_OP_SIOCINQ = 0,
+ SOCKET_URING_OP_SIOCOUTQ,
+};
+
+struct sock_uring_cmd {
+ __u16 op;
+ __u16 unused[13];
+ __u64 reserved; /* will be overwritten by core */
+ __u64 unused2;
+};
+
#endif /* _UAPI_LINUX_NET_H */
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 50a73178d63a..106200033a60 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -878,6 +878,28 @@ static int raw_getsockopt(struct sock *sk, int level, int optname,
return do_raw_getsockopt(sk, level, optname, optval, optlen);
}
+int raw_uring_cmd(struct sock *sk, struct sock_uring_cmd *scmd,
+ enum io_uring_cmd_flags flags)
+{
+ switch (scmd->op) {
+ case SIOCOUTQ:
+ return sk_wmem_alloc_get(sk);
+ case SIOCINQ: {
+ struct sk_buff *skb;
+ int amount = 0;
+
+ spin_lock_bh(&sk->sk_receive_queue.lock);
+ skb = skb_peek(&sk->sk_receive_queue);
+ if (skb)
+ amount = skb->len;
+ spin_unlock_bh(&sk->sk_receive_queue.lock);
+ return amount;
+ }
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
static int raw_ioctl(struct sock *sk, int cmd, unsigned long arg)
{
switch (cmd) {
@@ -956,6 +978,7 @@ struct proto raw_prot = {
.release_cb = ip4_datagram_release_cb,
.hash = raw_hash_sk,
.unhash = raw_unhash_sk,
+ .uring_cmd = raw_uring_cmd,
.obj_size = sizeof(struct raw_sock),
.useroffset = offsetof(struct raw_sock, filter),
.usersize = sizeof_field(struct raw_sock, filter),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 32545ecf2ab1..3757106bc54c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -602,6 +602,39 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
}
EXPORT_SYMBOL(tcp_poll);
+int tcp_uring_cmd(struct sock *sk, struct sock_uring_cmd *scmd,
+ enum io_uring_cmd_flags flags)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+ bool slow;
+ int ret;
+
+ switch (scmd->op) {
+ case SIOCINQ:
+ if (sk->sk_state == TCP_LISTEN)
+ return -EINVAL;
+
+ slow = lock_sock_fast(sk);
+ ret = tcp_inq(sk);
+ unlock_sock_fast(sk, slow);
+ break;
+ case SIOCOUTQ:
+ if (sk->sk_state == TCP_LISTEN)
+ return -EINVAL;
+
+ if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))
+ ret = 0;
+ else
+ ret = READ_ONCE(tp->write_seq) - tp->snd_una;
+ break;
+ default:
+ ret = -EOPNOTSUPP;
+ break;
+ }
+
+ return ret;
+}
+
int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
{
struct tcp_sock *tp = tcp_sk(sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 777306b5bc22..ca3c2654a351 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2788,6 +2788,7 @@ struct proto tcp_prot = {
.disconnect = tcp_disconnect,
.accept = inet_csk_accept,
.ioctl = tcp_ioctl,
+ .uring_cmd = tcp_uring_cmd,
.init = tcp_v4_init_sock,
.destroy = tcp_v4_destroy_sock,
.shutdown = tcp_shutdown,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 69ea76578abb..b3ccacf25300 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1676,6 +1676,19 @@ static int first_packet_length(struct sock *sk)
return res;
}
+int udp_uring_cmd(struct sock *sk, struct sock_uring_cmd *scmd,
+ enum io_uring_cmd_flags flags)
+{
+ switch (scmd->op) {
+ case SIOCOUTQ:
+ return sk_wmem_alloc_get(sk);
+ case SIOCINQ:
+ return max_t(int, 0, first_packet_length(sk));
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
/*
* IOCTL requests applicable to the UDP protocol
*/
@@ -2832,6 +2845,7 @@ struct proto udp_prot = {
.connect = ip4_datagram_connect,
.disconnect = udp_disconnect,
.ioctl = udp_ioctl,
+ .uring_cmd = udp_uring_cmd,
.init = udp_init_sock,
.destroy = udp_destroy_sock,
.setsockopt = udp_setsockopt,
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 1f56d9aae589..50f1e8189482 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -1235,6 +1235,7 @@ struct proto rawv6_prot = {
.connect = ip6_datagram_connect_v6_only,
.disconnect = __udp_disconnect,
.ioctl = rawv6_ioctl,
+ .uring_cmd = raw_uring_cmd,
.init = rawv6_init_sk,
.setsockopt = rawv6_setsockopt,
.getsockopt = rawv6_getsockopt,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0e1509b02cb3..e86af3503a4b 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2116,6 +2116,7 @@ struct proto tcpv6_prot = {
.disconnect = tcp_disconnect,
.accept = inet_csk_accept,
.ioctl = tcp_ioctl,
+ .uring_cmd = tcp_uring_cmd,
.init = tcp_v6_init_sock,
.destroy = tcp_v6_destroy_sock,
.shutdown = tcp_shutdown,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index b9f3dfdd2383..881ae4a1cdda 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1701,6 +1701,7 @@ struct proto udpv6_prot = {
.connect = ip6_datagram_connect,
.disconnect = udp_disconnect,
.ioctl = udp_ioctl,
+ .uring_cmd = udp_uring_cmd,
.init = udp_init_sock,
.destroy = udpv6_destroy_sock,
.setsockopt = udpv6_setsockopt,
--
2.30.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD
2021-01-27 21:25 ` [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD Jens Axboe
@ 2021-01-28 0:38 ` Darrick J. Wong
2021-01-28 1:45 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2021-01-28 0:38 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring
On Wed, Jan 27, 2021 at 02:25:38PM -0700, Jens Axboe wrote:
> This is a file private kind of request. io_uring doesn't know what's
> in this command type, it's for the file_operations->uring_cmd()
> handler to deal with.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> fs/io_uring.c | 59 +++++++++++++++++++++++++++++++++++
> include/linux/io_uring.h | 12 +++++++
> include/uapi/linux/io_uring.h | 1 +
> 3 files changed, 72 insertions(+)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 03748faa5295..55c2714a591e 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -712,6 +712,7 @@ struct io_kiocb {
> struct io_shutdown shutdown;
> struct io_rename rename;
> struct io_unlink unlink;
> + struct io_uring_cmd uring_cmd;
> /* use only after cleaning per-op data, see io_clean_op() */
> struct io_completion compl;
> };
> @@ -805,6 +806,8 @@ struct io_op_def {
> unsigned needs_async_data : 1;
> /* should block plug */
> unsigned plug : 1;
> + /* doesn't support personality */
> + unsigned no_personality : 1;
> /* size of async data needed, if any */
> unsigned short async_size;
> unsigned work_flags;
> @@ -998,6 +1001,11 @@ static const struct io_op_def io_op_defs[] = {
> .work_flags = IO_WQ_WORK_MM | IO_WQ_WORK_FILES |
> IO_WQ_WORK_FS | IO_WQ_WORK_BLKCG,
> },
> + [IORING_OP_URING_CMD] = {
> + .needs_file = 1,
> + .no_personality = 1,
> + .work_flags = IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
> + },
> };
>
> enum io_mem_account {
> @@ -3797,6 +3805,47 @@ static int io_unlinkat(struct io_kiocb *req, bool force_nonblock)
> return 0;
> }
>
> +static void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret)
> +{
> + struct io_kiocb *req = container_of(cmd, struct io_kiocb, uring_cmd);
> +
> + if (ret < 0)
> + req_set_fail_links(req);
> + io_req_complete(req, ret);
> +}
> +
> +static int io_uring_cmd_prep(struct io_kiocb *req,
> + const struct io_uring_sqe *sqe)
> +{
> + struct io_uring_cmd *cmd = &req->uring_cmd;
> +
> + if (!req->file->f_op->uring_cmd)
> + return -EOPNOTSUPP;
> +
> + memcpy(&cmd->pdu, (void *) &sqe->off, sizeof(cmd->pdu));
Hmmm. struct io_uring_pdu is (by my count) 6x uint64_t (==48 bytes) in
size. This starts copying the pdu from byte 8 in struct io_uring_sqe,
and the sqe is 64 bytes in size.
I guess (having not played much with io_uring) that the stuff in the
first eight bytes of the sqe are header info that's common to all
io_uring operations, and hence not passed to io_uring_cmd*.
Assuming that I got that right, that means that the pdu information
doesn't actually go all the way to the end of the sqe, which currently
is just a bunch of padding. Was that intentional, or does this mean
that io_uring_pdu could actually be 8 bytes longer?
Also, I thought io_uring_seq.user_data was supposed to coincide with
io_uring_pdu.reserved? They don't seem to...?
(I could be totally off here, fwiw.)
The reason why I'm counting bytes so stingily is that xfs_scrub issues
millions upon millions of ioctl calls to scrub an XFS. Wouldn't it be
nice if there was a way to submit a single userspace buffer to the
kernel and let it run every scrubber for that fs object in order? I
could cram all that data into the pdu struct ... if it had 56 bytes of
space.
If not, it wouldn't be a big deal to use one of the data[4] fields as a
pointer to a larger struct, but where's the fun in that? :)
Granted I'm programming speculatively in my head, not building an actual
prototype. There are all kinds of other questions I have, like, can a
uring command handler access the task struct or the userspace memory of
the process it was called from? What happens when the user is madly
pounding on ^C while uring commands are running? I should probably
figure out the answers to those questions and maybe even write/crib a
program first...
--D
> + cmd->done = io_uring_cmd_done;
> + return 0;
> +}
> +
> +static int io_uring_cmd(struct io_kiocb *req, bool force_nonblock)
> +{
> + enum io_uring_cmd_flags flags = 0;
> + struct file *file = req->file;
> + int ret;
> +
> + if (force_nonblock)
> + flags |= IO_URING_F_NONBLOCK;
> +
> + ret = file->f_op->uring_cmd(&req->uring_cmd, flags);
> + /* queued async, consumer will call ->done() when complete */
> + if (ret == -EIOCBQUEUED)
> + return 0;
> + else if (ret < 0)
> + req_set_fail_links(req);
> + io_req_complete(req, ret);
> + return 0;
> +}
> +
> static int io_shutdown_prep(struct io_kiocb *req,
> const struct io_uring_sqe *sqe)
> {
> @@ -6093,6 +6142,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> return io_renameat_prep(req, sqe);
> case IORING_OP_UNLINKAT:
> return io_unlinkat_prep(req, sqe);
> + case IORING_OP_URING_CMD:
> + return io_uring_cmd_prep(req, sqe);
> }
>
> printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
> @@ -6351,6 +6402,9 @@ static int io_issue_sqe(struct io_kiocb *req, bool force_nonblock,
> case IORING_OP_UNLINKAT:
> ret = io_unlinkat(req, force_nonblock);
> break;
> + case IORING_OP_URING_CMD:
> + ret = io_uring_cmd(req, force_nonblock);
> + break;
> default:
> ret = -EINVAL;
> break;
> @@ -6865,6 +6919,9 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
> if (id) {
> struct io_identity *iod;
>
> + if (io_op_defs[req->opcode].no_personality)
> + return -EINVAL;
> +
> iod = idr_find(&ctx->personality_idr, id);
> if (unlikely(!iod))
> return -EINVAL;
> @@ -10260,6 +10317,8 @@ static int __init io_uring_init(void)
> BUILD_BUG_SQE_ELEM(40, __u16, buf_index);
> BUILD_BUG_SQE_ELEM(42, __u16, personality);
> BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in);
> + BUILD_BUG_ON(offsetof(struct io_uring_sqe, user_data) !=
> + offsetof(struct io_uring_pdu, reserved));
>
> BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST);
> BUILD_BUG_ON(__REQ_F_LAST_BIT >= 8 * sizeof(int));
> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> index 35b2d845704d..e4e822d86e22 100644
> --- a/include/linux/io_uring.h
> +++ b/include/linux/io_uring.h
> @@ -34,6 +34,18 @@ struct io_uring_task {
> bool sqpoll;
> };
>
> +struct io_uring_pdu {
> + __u64 data[4]; /* available for free use */
> + __u64 reserved; /* can't be used by application! */
> + __u64 data2; /* available or free use */
> +};
> +
> +struct io_uring_cmd {
> + struct file *file;
> + struct io_uring_pdu pdu;
> + void (*done)(struct io_uring_cmd *, ssize_t);
> +};
> +
> #if defined(CONFIG_IO_URING)
> struct sock *io_uring_get_socket(struct file *file);
> void __io_uring_task_cancel(void);
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index ac4e1738a9af..0a0de40a3a5c 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -137,6 +137,7 @@ enum {
> IORING_OP_SHUTDOWN,
> IORING_OP_RENAMEAT,
> IORING_OP_UNLINKAT,
> + IORING_OP_URING_CMD,
>
> /* this goes last, obviously */
> IORING_OP_LAST,
> --
> 2.30.0
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD
2021-01-28 0:38 ` Darrick J. Wong
@ 2021-01-28 1:45 ` Jens Axboe
2021-01-28 2:19 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2021-01-28 1:45 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: io-uring
On 1/27/21 5:38 PM, Darrick J. Wong wrote:
> On Wed, Jan 27, 2021 at 02:25:38PM -0700, Jens Axboe wrote:
>> This is a file private kind of request. io_uring doesn't know what's
>> in this command type, it's for the file_operations->uring_cmd()
>> handler to deal with.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>> fs/io_uring.c | 59 +++++++++++++++++++++++++++++++++++
>> include/linux/io_uring.h | 12 +++++++
>> include/uapi/linux/io_uring.h | 1 +
>> 3 files changed, 72 insertions(+)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 03748faa5295..55c2714a591e 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -712,6 +712,7 @@ struct io_kiocb {
>> struct io_shutdown shutdown;
>> struct io_rename rename;
>> struct io_unlink unlink;
>> + struct io_uring_cmd uring_cmd;
>> /* use only after cleaning per-op data, see io_clean_op() */
>> struct io_completion compl;
>> };
>> @@ -805,6 +806,8 @@ struct io_op_def {
>> unsigned needs_async_data : 1;
>> /* should block plug */
>> unsigned plug : 1;
>> + /* doesn't support personality */
>> + unsigned no_personality : 1;
>> /* size of async data needed, if any */
>> unsigned short async_size;
>> unsigned work_flags;
>> @@ -998,6 +1001,11 @@ static const struct io_op_def io_op_defs[] = {
>> .work_flags = IO_WQ_WORK_MM | IO_WQ_WORK_FILES |
>> IO_WQ_WORK_FS | IO_WQ_WORK_BLKCG,
>> },
>> + [IORING_OP_URING_CMD] = {
>> + .needs_file = 1,
>> + .no_personality = 1,
>> + .work_flags = IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
>> + },
>> };
>>
>> enum io_mem_account {
>> @@ -3797,6 +3805,47 @@ static int io_unlinkat(struct io_kiocb *req, bool force_nonblock)
>> return 0;
>> }
>>
>> +static void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret)
>> +{
>> + struct io_kiocb *req = container_of(cmd, struct io_kiocb, uring_cmd);
>> +
>> + if (ret < 0)
>> + req_set_fail_links(req);
>> + io_req_complete(req, ret);
>> +}
>> +
>> +static int io_uring_cmd_prep(struct io_kiocb *req,
>> + const struct io_uring_sqe *sqe)
>> +{
>> + struct io_uring_cmd *cmd = &req->uring_cmd;
>> +
>> + if (!req->file->f_op->uring_cmd)
>> + return -EOPNOTSUPP;
>> +
>> + memcpy(&cmd->pdu, (void *) &sqe->off, sizeof(cmd->pdu));
>
> Hmmm. struct io_uring_pdu is (by my count) 6x uint64_t (==48 bytes) in
> size. This starts copying the pdu from byte 8 in struct io_uring_sqe,
> and the sqe is 64 bytes in size.
Correct
> I guess (having not played much with io_uring) that the stuff in the
> first eight bytes of the sqe are header info that's common to all
> io_uring operations, and hence not passed to io_uring_cmd*.
Exactly
> Assuming that I got that right, that means that the pdu information
> doesn't actually go all the way to the end of the sqe, which currently
> is just a bunch of padding. Was that intentional, or does this mean
> that io_uring_pdu could actually be 8 bytes longer?
Also correct. The reason is actually kind of stupid, and I think we
should just fix that up. struct io_uring_cmd should fit within the first
cacheline of io_kiocb, to avoid bloating that one. But with the members
in there, it ends up being 8 bytes too big, if we grab those 8 bytes.
What I think we should do is get rid of ->done, and just have drivers
call io_uring_cmd_done() instead. We can provide an empty hook for that.
Then we can reclaim the 8 bytes, and grow the io_uring_cmd to 56 bytes.
> Also, I thought io_uring_seq.user_data was supposed to coincide with
> io_uring_pdu.reserved? They don't seem to...?
>
> (I could be totally off here, fwiw.)
I think you are, I even added a BUILD check for that:
BUILD_BUG_ON(offsetof(struct io_uring_sqe, user_data) !=
offsetof(struct io_uring_pdu, reserved));
to ensure that that is the case.
> The reason why I'm counting bytes so stingily is that xfs_scrub issues
> millions upon millions of ioctl calls to scrub an XFS. Wouldn't it be
> nice if there was a way to submit a single userspace buffer to the
> kernel and let it run every scrubber for that fs object in order? I
> could cram all that data into the pdu struct ... if it had 56 bytes of
> space.
For other purposes too, the bigger we can make the inline data, the more
likely we are that we can fit everything we need in there. I'm going to
make the change to bump it to 56 bytes.
> If not, it wouldn't be a big deal to use one of the data[4] fields as a
> pointer to a larger struct, but where's the fun in that? :)
Agree :-)
> Granted I'm programming speculatively in my head, not building an actual
> prototype. There are all kinds of other questions I have, like, can a
> uring command handler access the task struct or the userspace memory of
> the process it was called from? What happens when the user is madly
> pounding on ^C while uring commands are running? I should probably
> figure out the answers to those questions and maybe even write/crib a
> program first...
Well, either the program would exit if it had no SIGINT handler, and
that would signal the async task handling it and cancel it. Or you
handle it, and then you need to cancel on your own.
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD
2021-01-28 1:45 ` Jens Axboe
@ 2021-01-28 2:19 ` Jens Axboe
2021-02-20 3:57 ` Stefan Metzmacher
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2021-01-28 2:19 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: io-uring
>> Assuming that I got that right, that means that the pdu information
>> doesn't actually go all the way to the end of the sqe, which currently
>> is just a bunch of padding. Was that intentional, or does this mean
>> that io_uring_pdu could actually be 8 bytes longer?
>
> Also correct. The reason is actually kind of stupid, and I think we
> should just fix that up. struct io_uring_cmd should fit within the first
> cacheline of io_kiocb, to avoid bloating that one. But with the members
> in there, it ends up being 8 bytes too big, if we grab those 8 bytes.
> What I think we should do is get rid of ->done, and just have drivers
> call io_uring_cmd_done() instead. We can provide an empty hook for that.
> Then we can reclaim the 8 bytes, and grow the io_uring_cmd to 56 bytes.
Pushed out that version:
https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops.v2
which gives you the full 56 bytes for the payload command.
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD
2021-01-28 2:19 ` Jens Axboe
@ 2021-02-20 3:57 ` Stefan Metzmacher
2021-02-20 14:50 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Stefan Metzmacher @ 2021-02-20 3:57 UTC (permalink / raw)
To: Jens Axboe, Darrick J. Wong; +Cc: io-uring
[-- Attachment #1.1: Type: text/plain, Size: 3249 bytes --]
Hi Jens,
Am 28.01.21 um 03:19 schrieb Jens Axboe:
>>> Assuming that I got that right, that means that the pdu information
>>> doesn't actually go all the way to the end of the sqe, which currently
>>> is just a bunch of padding. Was that intentional, or does this mean
>>> that io_uring_pdu could actually be 8 bytes longer?
>>
>> Also correct. The reason is actually kind of stupid, and I think we
>> should just fix that up. struct io_uring_cmd should fit within the first
>> cacheline of io_kiocb, to avoid bloating that one. But with the members
>> in there, it ends up being 8 bytes too big, if we grab those 8 bytes.
>> What I think we should do is get rid of ->done, and just have drivers
>> call io_uring_cmd_done() instead. We can provide an empty hook for that.
>> Then we can reclaim the 8 bytes, and grow the io_uring_cmd to 56 bytes.
>
> Pushed out that version:
>
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops.v2
>
> which gives you the full 56 bytes for the payload command.
I think we only have 48 bytes for the payload.
I've rebased and improved your io_uring-fops.v2 on top of your io_uring-worker.v3.
See https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/io_uring-fops
I've changed the layout like this:
struct io_uring_sqe {
__u8 opcode; /* type of operation for this sqe */
__u8 flags; /* IOSQE_ flags */
union {
__u16 ioprio; /* ioprio for the request */
__u16 cmd_personality; /* IORING_OP_URING_CMD */
};
__s32 fd; /* file descriptor to do IO on */
union {
__u64 off; /* offset into file */
__u64 addr2;
__u64 cmd_user_data; /* IORING_OP_URING_CMD: data to be passed back at completion time */
};
union {
__u64 addr; /* pointer to buffer or iovecs */
__u64 splice_off_in;
__u64 cmd_pdu_start; /* IORING_OP_URING_CMD: this is the start for the remaining 48 bytes */
};
And then use:
struct io_uring_cmd_pdu {
__u64 data[6]; /* 48 bytes available for free use */
};
So we effectively have this:
struct io_uring_cmd_sqe {
__u8 opcode; /* type of operation for this sqe */
__u8 flags; /* IOSQE_ flags */
__u16 cmd_personality; /* IORING_OP_URING_CMD */
__s32 fd; /* file descriptor to do IO on */
__u64 cmd_user_data; /* IORING_OP_URING_CMD: data to be passed back at completion time */
union {
__u64 cmd_pdu_start; /* IORING_OP_URING_CMD: this is the start for the remaining 48 bytes */
struct io_uring_cmd_pdu cmd_pdu;
};
}
I think it's saner to have a complete block of 48 bytes available for the payload
and move personality and user_data to to top if opcode is IORING_OP_URING_CMD
instead of having a hole that can't be touched.
I also finished the socket glue from struct file -> struct socket -> struct sock
I think it compiles, but I haven't done any tests.
What do you think?
metze
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD
2021-02-20 3:57 ` Stefan Metzmacher
@ 2021-02-20 14:50 ` Jens Axboe
2021-02-20 16:45 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2021-02-20 14:50 UTC (permalink / raw)
To: Stefan Metzmacher, Darrick J. Wong; +Cc: io-uring
On 2/19/21 8:57 PM, Stefan Metzmacher wrote:
> Hi Jens,
>
> Am 28.01.21 um 03:19 schrieb Jens Axboe:
>>>> Assuming that I got that right, that means that the pdu information
>>>> doesn't actually go all the way to the end of the sqe, which currently
>>>> is just a bunch of padding. Was that intentional, or does this mean
>>>> that io_uring_pdu could actually be 8 bytes longer?
>>>
>>> Also correct. The reason is actually kind of stupid, and I think we
>>> should just fix that up. struct io_uring_cmd should fit within the first
>>> cacheline of io_kiocb, to avoid bloating that one. But with the members
>>> in there, it ends up being 8 bytes too big, if we grab those 8 bytes.
>>> What I think we should do is get rid of ->done, and just have drivers
>>> call io_uring_cmd_done() instead. We can provide an empty hook for that.
>>> Then we can reclaim the 8 bytes, and grow the io_uring_cmd to 56 bytes.
>>
>> Pushed out that version:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops.v2
>>
>> which gives you the full 56 bytes for the payload command.
>
> I think we only have 48 bytes for the payload.
You are right, it's 64b minus 8 for the head, and 8 for user_data.
> I've rebased and improved your io_uring-fops.v2 on top of your io_uring-worker.v3.
Heh, I did that myself yesterday too, when I was folding in two fixes!
> See https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/io_uring-fops
Had a quick look, and some good stuff in there. So thanks for that. One
question, though - if you look at mine, you'll see I moved the force_nonblock
to be using the issue_flags instead. You dropped that from the issue path,
we definitely need that to be able to punt the request if we're in the
nonblock/fast path.
>
> I've changed the layout like this:
>
> struct io_uring_sqe {
> __u8 opcode; /* type of operation for this sqe */
> __u8 flags; /* IOSQE_ flags */
> union {
> __u16 ioprio; /* ioprio for the request */
> __u16 cmd_personality; /* IORING_OP_URING_CMD */
> };
> __s32 fd; /* file descriptor to do IO on */
> union {
> __u64 off; /* offset into file */
> __u64 addr2;
> __u64 cmd_user_data; /* IORING_OP_URING_CMD: data to be passed back at completion time */
> };
> union {
> __u64 addr; /* pointer to buffer or iovecs */
> __u64 splice_off_in;
> __u64 cmd_pdu_start; /* IORING_OP_URING_CMD: this is the start for the remaining 48 bytes */
> };
>
> And then use:
>
> struct io_uring_cmd_pdu {
> __u64 data[6]; /* 48 bytes available for free use */
> };
>
> So we effectively have this:
>
> struct io_uring_cmd_sqe {
> __u8 opcode; /* type of operation for this sqe */
> __u8 flags; /* IOSQE_ flags */
> __u16 cmd_personality; /* IORING_OP_URING_CMD */
> __s32 fd; /* file descriptor to do IO on */
> __u64 cmd_user_data; /* IORING_OP_URING_CMD: data to be passed back at completion time */
> union {
> __u64 cmd_pdu_start; /* IORING_OP_URING_CMD: this is the start for the remaining 48 bytes */
> struct io_uring_cmd_pdu cmd_pdu;
> };
> }
>
> I think it's saner to have a complete block of 48 bytes available for the payload
> and move personality and user_data to to top if opcode is IORING_OP_URING_CMD
> instead of having a hole that can't be touched.
>
> I also finished the socket glue from struct file -> struct socket -> struct sock
>
> I think it compiles, but I haven't done any tests.
>
> What do you think?
I've been thinking along the same lines, because having a sparse sqe layout
for the uring cmd is a pain. I do think 'personality' is a bit too specific
to be part of the shared space, that should probably belong in the pdu
instead if the user needs it. One thing they all have in common is that they'd
need a sub-command, so why not make that u16 that?
There's also the option of simply saying that the uring_cmd sqe is just
a different type, ala:
struct io_uring_cmd_sqe {
__u8 opcode; /* IO_OP_URING_CMD */
__u8 flags;
__u16 target_op;
__s32 fd;
__u64 user_data;
strut io_uring_cmd_pdu cmd_pdu;
};
which is essentially the same as your suggestion in terms of layout
(because that is the one that makes the most sense), we just don't try
and shoe-horn it into the existing sqe. As long as we overlap
opcode/flags, then init is fine. And past init, sqe is already consumed.
Haven't tried and wire that up yet, and it may just be that the simple
layout change you did is just easier to deal with. The important part
here is the layout, and I certainly think we should do that. There's
effectively 54 bytes of data there, if you include the target op and fd
as part of that space. 48 fully usable for whatever.
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD
2021-02-20 14:50 ` Jens Axboe
@ 2021-02-20 16:45 ` Jens Axboe
2021-02-22 20:04 ` Stefan Metzmacher
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2021-02-20 16:45 UTC (permalink / raw)
To: Stefan Metzmacher, Darrick J. Wong; +Cc: io-uring
On 2/20/21 7:50 AM, Jens Axboe wrote:
> On 2/19/21 8:57 PM, Stefan Metzmacher wrote:
>> Hi Jens,
>>
>> Am 28.01.21 um 03:19 schrieb Jens Axboe:
>>>>> Assuming that I got that right, that means that the pdu information
>>>>> doesn't actually go all the way to the end of the sqe, which currently
>>>>> is just a bunch of padding. Was that intentional, or does this mean
>>>>> that io_uring_pdu could actually be 8 bytes longer?
>>>>
>>>> Also correct. The reason is actually kind of stupid, and I think we
>>>> should just fix that up. struct io_uring_cmd should fit within the first
>>>> cacheline of io_kiocb, to avoid bloating that one. But with the members
>>>> in there, it ends up being 8 bytes too big, if we grab those 8 bytes.
>>>> What I think we should do is get rid of ->done, and just have drivers
>>>> call io_uring_cmd_done() instead. We can provide an empty hook for that.
>>>> Then we can reclaim the 8 bytes, and grow the io_uring_cmd to 56 bytes.
>>>
>>> Pushed out that version:
>>>
>>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops.v2
>>>
>>> which gives you the full 56 bytes for the payload command.
>>
>> I think we only have 48 bytes for the payload.
>
> You are right, it's 64b minus 8 for the head, and 8 for user_data.
>
>> I've rebased and improved your io_uring-fops.v2 on top of your io_uring-worker.v3.
>
> Heh, I did that myself yesterday too, when I was folding in two fixes!
>
>> See https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/io_uring-fops
>
> Had a quick look, and some good stuff in there. So thanks for that. One
> question, though - if you look at mine, you'll see I moved the force_nonblock
> to be using the issue_flags instead. You dropped that from the issue path,
> we definitely need that to be able to punt the request if we're in the
> nonblock/fast path.
>
>>
>> I've changed the layout like this:
>>
>> struct io_uring_sqe {
>> __u8 opcode; /* type of operation for this sqe */
>> __u8 flags; /* IOSQE_ flags */
>> union {
>> __u16 ioprio; /* ioprio for the request */
>> __u16 cmd_personality; /* IORING_OP_URING_CMD */
>> };
>> __s32 fd; /* file descriptor to do IO on */
>> union {
>> __u64 off; /* offset into file */
>> __u64 addr2;
>> __u64 cmd_user_data; /* IORING_OP_URING_CMD: data to be passed back at completion time */
>> };
>> union {
>> __u64 addr; /* pointer to buffer or iovecs */
>> __u64 splice_off_in;
>> __u64 cmd_pdu_start; /* IORING_OP_URING_CMD: this is the start for the remaining 48 bytes */
>> };
>>
>> And then use:
>>
>> struct io_uring_cmd_pdu {
>> __u64 data[6]; /* 48 bytes available for free use */
>> };
>>
>> So we effectively have this:
>>
>> struct io_uring_cmd_sqe {
>> __u8 opcode; /* type of operation for this sqe */
>> __u8 flags; /* IOSQE_ flags */
>> __u16 cmd_personality; /* IORING_OP_URING_CMD */
>> __s32 fd; /* file descriptor to do IO on */
>> __u64 cmd_user_data; /* IORING_OP_URING_CMD: data to be passed back at completion time */
>> union {
>> __u64 cmd_pdu_start; /* IORING_OP_URING_CMD: this is the start for the remaining 48 bytes */
>> struct io_uring_cmd_pdu cmd_pdu;
>> };
>> }
>>
>> I think it's saner to have a complete block of 48 bytes available for the payload
>> and move personality and user_data to to top if opcode is IORING_OP_URING_CMD
>> instead of having a hole that can't be touched.
>>
>> I also finished the socket glue from struct file -> struct socket -> struct sock
>>
>> I think it compiles, but I haven't done any tests.
>>
>> What do you think?
>
> I've been thinking along the same lines, because having a sparse sqe layout
> for the uring cmd is a pain. I do think 'personality' is a bit too specific
> to be part of the shared space, that should probably belong in the pdu
> instead if the user needs it. One thing they all have in common is that they'd
> need a sub-command, so why not make that u16 that?
>
> There's also the option of simply saying that the uring_cmd sqe is just
> a different type, ala:
>
> struct io_uring_cmd_sqe {
> __u8 opcode; /* IO_OP_URING_CMD */
> __u8 flags;
> __u16 target_op;
> __s32 fd;
> __u64 user_data;
> strut io_uring_cmd_pdu cmd_pdu;
> };
>
> which is essentially the same as your suggestion in terms of layout
> (because that is the one that makes the most sense), we just don't try
> and shoe-horn it into the existing sqe. As long as we overlap
> opcode/flags, then init is fine. And past init, sqe is already consumed.
>
> Haven't tried and wire that up yet, and it may just be that the simple
> layout change you did is just easier to deal with. The important part
> here is the layout, and I certainly think we should do that. There's
> effectively 54 bytes of data there, if you include the target op and fd
> as part of that space. 48 fully usable for whatever.
OK, folded in some of your stuff, and pushed out a new branch. Find it
here:
https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops.v3
I did notice while doing so that you put the issue flags in the cmd,
I've made them external again. Just seems cleaner to me, otherwise
you'd have to modify the command for reissue rather than just
pass in the flags directly.
I also retained struct file * in the cmd - that's a requirement for
the layout of io_kiocb, so might as well keep it in there and not
pass in the file. Plus that one won't ever change...
Since we just need that one branch in req init, I do think that your
suggestion of just modifying io_uring_sqe is the way to go. So that's
what the above branch does.
I tested the block side, and it works for getting the bs of the
device. That's all the testing that has been done so far :-)
Comments welcome! Would like to move this one forward and hopefully
target 5.13 for it.
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD
2021-02-20 16:45 ` Jens Axboe
@ 2021-02-22 20:04 ` Stefan Metzmacher
2021-02-22 20:14 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Stefan Metzmacher @ 2021-02-22 20:04 UTC (permalink / raw)
To: Jens Axboe, Darrick J. Wong; +Cc: io-uring
[-- Attachment #1.1: Type: text/plain, Size: 3670 bytes --]
Hi Jens,
>> I've been thinking along the same lines, because having a sparse sqe layout
>> for the uring cmd is a pain. I do think 'personality' is a bit too specific
>> to be part of the shared space, that should probably belong in the pdu
>> instead if the user needs it. One thing they all have in common is that they'd
>> need a sub-command, so why not make that u16 that?
>>
>> There's also the option of simply saying that the uring_cmd sqe is just
>> a different type, ala:
>>
>> struct io_uring_cmd_sqe {
>> __u8 opcode; /* IO_OP_URING_CMD */
>> __u8 flags;
>> __u16 target_op;
>> __s32 fd;
>> __u64 user_data;
>> strut io_uring_cmd_pdu cmd_pdu;
>> };
>>
>> which is essentially the same as your suggestion in terms of layout
>> (because that is the one that makes the most sense), we just don't try
>> and shoe-horn it into the existing sqe. As long as we overlap
>> opcode/flags, then init is fine. And past init, sqe is already consumed.
>>
>> Haven't tried and wire that up yet, and it may just be that the simple
>> layout change you did is just easier to deal with. The important part
>> here is the layout, and I certainly think we should do that. There's
>> effectively 54 bytes of data there, if you include the target op and fd
>> as part of that space. 48 fully usable for whatever.
>
> OK, folded in some of your stuff, and pushed out a new branch. Find it
> here:
>
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops.v3
>
> I did notice while doing so that you put the issue flags in the cmd,
> I've made them external again. Just seems cleaner to me, otherwise
> you'd have to modify the command for reissue rather than just
> pass in the flags directly.
I think the first two commits need more verbose comments, which clearly
document the uring_cmd() API.
Event before uring_cmd(), it's really not clear to me why we have
'enum io_uring_cmd_flags', as 'enum'.
As it seems to be use it as 'flags' (IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER).
With uring_cmd() it's not clear what the backend is supposed to do with these flags.
I'd assume that uring_cmd() would per definition never block and go async itself,
by returning -EIOCBQUEUED. And a single &req->uring_cmd is only ever passed once to
uring_cmd() without any retry.
It's also not clear if IOSQE_ASYNC should have any impact.
I think we also need a way to pass IORING_OP_ASYNC_CANCEL down.
> I also retained struct file * in the cmd - that's a requirement for
> the layout of io_kiocb, so might as well keep it in there and not
> pass in the file. Plus that one won't ever change...
Ah, ok.
> Since we just need that one branch in req init, I do think that your
> suggestion of just modifying io_uring_sqe is the way to go. So that's
> what the above branch does.
Thanks! I think it's much easier to handle the personality logic in
the core only.
For fixed files or fixed buffers I think helper functions like this:
struct file *io_uring_cmd_get_file(struct io_uring_cmd *cmd, int fd, bool fixed);
And similar functions for io_buffer_select or io_import_fixed.
> I tested the block side, and it works for getting the bs of the
> device. That's all the testing that has been done so far :-)
I've added EXPORT_SYMBOL(io_uring_cmd_done); and split your net patch,
similar to the two block patches. So we can better isolate the core
from the first consumers.
See https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/io_uring-fops.v3
> Comments welcome! Would like to move this one forward and hopefully
> target 5.13 for it.
Great!
metze
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD
2021-02-22 20:04 ` Stefan Metzmacher
@ 2021-02-22 20:14 ` Jens Axboe
2021-02-23 8:14 ` Stefan Metzmacher
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2021-02-22 20:14 UTC (permalink / raw)
To: Stefan Metzmacher, Darrick J. Wong; +Cc: io-uring
On 2/22/21 1:04 PM, Stefan Metzmacher wrote:
> Hi Jens,
>
>>> I've been thinking along the same lines, because having a sparse sqe layout
>>> for the uring cmd is a pain. I do think 'personality' is a bit too specific
>>> to be part of the shared space, that should probably belong in the pdu
>>> instead if the user needs it. One thing they all have in common is that they'd
>>> need a sub-command, so why not make that u16 that?
>>>
>>> There's also the option of simply saying that the uring_cmd sqe is just
>>> a different type, ala:
>>>
>>> struct io_uring_cmd_sqe {
>>> __u8 opcode; /* IO_OP_URING_CMD */
>>> __u8 flags;
>>> __u16 target_op;
>>> __s32 fd;
>>> __u64 user_data;
>>> strut io_uring_cmd_pdu cmd_pdu;
>>> };
>>>
>>> which is essentially the same as your suggestion in terms of layout
>>> (because that is the one that makes the most sense), we just don't try
>>> and shoe-horn it into the existing sqe. As long as we overlap
>>> opcode/flags, then init is fine. And past init, sqe is already consumed.
>>>
>>> Haven't tried and wire that up yet, and it may just be that the simple
>>> layout change you did is just easier to deal with. The important part
>>> here is the layout, and I certainly think we should do that. There's
>>> effectively 54 bytes of data there, if you include the target op and fd
>>> as part of that space. 48 fully usable for whatever.
>>
>> OK, folded in some of your stuff, and pushed out a new branch. Find it
>> here:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops.v3
>>
>> I did notice while doing so that you put the issue flags in the cmd,
>> I've made them external again. Just seems cleaner to me, otherwise
>> you'd have to modify the command for reissue rather than just
>> pass in the flags directly.
>
> I think the first two commits need more verbose comments, which clearly
> document the uring_cmd() API.
Oh for sure, I just haven't gotten around to it yet :-)
> Event before uring_cmd(), it's really not clear to me why we have
> 'enum io_uring_cmd_flags', as 'enum'.
> As it seems to be use it as 'flags' (IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER).
They could be unsigned in too, but not really a big deal imho.
> With uring_cmd() it's not clear what the backend is supposed to do
> with these flags.
IO_URING_F_NONBLOCK tells the lower layers that the operation should be
non-blocking, and if that isn't possible, then it must return -EAGAIN.
If that happens, then the operation will be retried from a context where
IO_URING_F_NONBLOCK isn't set.
IO_URING_F_COMPLETE_DEFER is just part of the flags that should be
passed to the completion side, the handler need not do anything else.
It's only used internally, but allows fast processing if the completion
occurs off the IO_URING_F_NONBLOCK path.
It'll get documented... But the above is also why it should get passed
in, rather than stuffed in the command itself.
> I'd assume that uring_cmd() would per definition never block and go
> async itself, by returning -EIOCBQUEUED. And a single &req->uring_cmd
> is only ever passed once to uring_cmd() without any retry.
No, -EIOCBQUEUED would mean "operation is queued, I'll call the
completion callback for it later". For example, you start the IO
operation and you'll get a notification (eg IRQ) later on which allows
you to complete it.
> It's also not clear if IOSQE_ASYNC should have any impact.
Handler doesn't need to care about that, it'll just mean that the
initial queue attempt will not have IO_URING_F_NONBLOCK set.
> I think we also need a way to pass IORING_OP_ASYNC_CANCEL down.
Cancelation indeed needs some thought. There are a few options:
1) Request completes sync, obviously no cancelation needed here as the
request is never stuck in a state that requires cancelation.
2) Request needs blocking context, and hence an async handler is doing
it. The regular cancelation already works for that, nothing is needed
here. Would probably be better handled with a cancel handler.
3) uring_cmd handler returns -EIOCBQUEUED. This is the only case that
needs active cancelation support. Only case where that would
currently happen are things like block IO, where we don't support
cancelation to begin with (insert long rant on inadequate hw
support).
So tldr here is that 1+2 is already there, and 3 not being fixed leaves
us no different than the existing support for cancelation. IOW, I don't
think this is an initial requirement, support can always be expanded
later.
>> Since we just need that one branch in req init, I do think that your
>> suggestion of just modifying io_uring_sqe is the way to go. So that's
>> what the above branch does.
>
> Thanks! I think it's much easier to handle the personality logic in
> the core only.
>
> For fixed files or fixed buffers I think helper functions like this:
>
> struct file *io_uring_cmd_get_file(struct io_uring_cmd *cmd, int fd, bool fixed);
>
> And similar functions for io_buffer_select or io_import_fixed.
I did end up retaining that, at least in its current state it's like you
proposed. Only change is some packing on that very union, which should
not be necessary, but due to fun arm reasons it is.
>> I tested the block side, and it works for getting the bs of the
>> device. That's all the testing that has been done so far :-)
>
> I've added EXPORT_SYMBOL(io_uring_cmd_done); and split your net patch,
> similar to the two block patches. So we can better isolate the core
> from the first consumers.
>
> See https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/io_uring-fops.v3
Great thanks, I'll take a look and fold back. I'll also expand those
commit messages :-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD
2021-02-22 20:14 ` Jens Axboe
@ 2021-02-23 8:14 ` Stefan Metzmacher
2021-02-23 13:21 ` Pavel Begunkov
0 siblings, 1 reply; 16+ messages in thread
From: Stefan Metzmacher @ 2021-02-23 8:14 UTC (permalink / raw)
To: Jens Axboe, Darrick J. Wong; +Cc: io-uring
[-- Attachment #1.1: Type: text/plain, Size: 8550 bytes --]
Am 22.02.21 um 21:14 schrieb Jens Axboe:
> On 2/22/21 1:04 PM, Stefan Metzmacher wrote:
>> Hi Jens,
>>
>>>> I've been thinking along the same lines, because having a sparse sqe layout
>>>> for the uring cmd is a pain. I do think 'personality' is a bit too specific
>>>> to be part of the shared space, that should probably belong in the pdu
>>>> instead if the user needs it. One thing they all have in common is that they'd
>>>> need a sub-command, so why not make that u16 that?
>>>>
>>>> There's also the option of simply saying that the uring_cmd sqe is just
>>>> a different type, ala:
>>>>
>>>> struct io_uring_cmd_sqe {
>>>> __u8 opcode; /* IO_OP_URING_CMD */
>>>> __u8 flags;
>>>> __u16 target_op;
>>>> __s32 fd;
>>>> __u64 user_data;
>>>> strut io_uring_cmd_pdu cmd_pdu;
>>>> };
>>>>
>>>> which is essentially the same as your suggestion in terms of layout
>>>> (because that is the one that makes the most sense), we just don't try
>>>> and shoe-horn it into the existing sqe. As long as we overlap
>>>> opcode/flags, then init is fine. And past init, sqe is already consumed.
>>>>
>>>> Haven't tried and wire that up yet, and it may just be that the simple
>>>> layout change you did is just easier to deal with. The important part
>>>> here is the layout, and I certainly think we should do that. There's
>>>> effectively 54 bytes of data there, if you include the target op and fd
>>>> as part of that space. 48 fully usable for whatever.
>>>
>>> OK, folded in some of your stuff, and pushed out a new branch. Find it
>>> here:
>>>
>>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-fops.v3
>>>
>>> I did notice while doing so that you put the issue flags in the cmd,
>>> I've made them external again. Just seems cleaner to me, otherwise
>>> you'd have to modify the command for reissue rather than just
>>> pass in the flags directly.
>>
>> I think the first two commits need more verbose comments, which clearly
>> document the uring_cmd() API.
>
> Oh for sure, I just haven't gotten around to it yet :-)
>
>> Event before uring_cmd(), it's really not clear to me why we have
>> 'enum io_uring_cmd_flags', as 'enum'.
>> As it seems to be use it as 'flags' (IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER).
>
> They could be unsigned in too, but not really a big deal imho.
>
>> With uring_cmd() it's not clear what the backend is supposed to do
>> with these flags.
>
> IO_URING_F_NONBLOCK tells the lower layers that the operation should be
> non-blocking, and if that isn't possible, then it must return -EAGAIN.
> If that happens, then the operation will be retried from a context where
> IO_URING_F_NONBLOCK isn't set.
>
> IO_URING_F_COMPLETE_DEFER is just part of the flags that should be
> passed to the completion side, the handler need not do anything else.
> It's only used internally, but allows fast processing if the completion
> occurs off the IO_URING_F_NONBLOCK path.
>
> It'll get documented... But the above is also why it should get passed
> in, rather than stuffed in the command itself.
Thanks!
>> I'd assume that uring_cmd() would per definition never block and go
>> async itself, by returning -EIOCBQUEUED. And a single &req->uring_cmd
>> is only ever passed once to uring_cmd() without any retry.
>
> No, -EIOCBQUEUED would mean "operation is queued, I'll call the
> completion callback for it later".
That's what I meant with async here.
> For example, you start the IO operation and you'll get a notification (eg IRQ) later on which allows
> you to complete it.
Yes, it's up to the implementation of uring_cmd() to do the processing and waiting
in the background, based on timers, hardware events or whatever and finally call
io_uring_cmd_done().
But with this:
ret = file->f_op->uring_cmd(&req->uring_cmd, issue_flags);
/* queued async, consumer will call io_uring_cmd_done() when complete */
if (ret == -EIOCBQUEUED)
return 0;
io_uring_cmd_done(&req->uring_cmd, ret);
return 0;
I don't see where -EAGAIN would trigger a retry in a io-wq worker context.
Isn't -EAGAIN exposed to the cqe. Similar to ret == -EAGAIN && req->flags & REQ_F_NOWAIT.
>> It's also not clear if IOSQE_ASYNC should have any impact.
>
> Handler doesn't need to care about that, it'll just mean that the
> initial queue attempt will not have IO_URING_F_NONBLOCK set.
Ok, because it's done from the io-wq worker, correct?
>> I think we also need a way to pass IORING_OP_ASYNC_CANCEL down.
>
> Cancelation indeed needs some thought. There are a few options:
>
> 1) Request completes sync, obviously no cancelation needed here as the
> request is never stuck in a state that requires cancelation.
>
> 2) Request needs blocking context, and hence an async handler is doing
> it. The regular cancelation already works for that, nothing is needed
> here. Would probably be better handled with a cancel handler.
>
> 3) uring_cmd handler returns -EIOCBQUEUED. This is the only case that
> needs active cancelation support. Only case where that would
> currently happen are things like block IO, where we don't support
> cancelation to begin with (insert long rant on inadequate hw
> support).
>
> So tldr here is that 1+2 is already there, and 3 not being fixed leaves
> us no different than the existing support for cancelation. IOW, I don't
> think this is an initial requirement, support can always be expanded
> later.
Given that you list 2) here again, I get the impression that the logic should be:
ret = file->f_op->uring_cmd(&req->uring_cmd, issue_flags);
/* reschedule in io-wq worker again */
if (ret == -EAGAIN)
return ret;
/* queued async, consumer will call io_uring_cmd_done() when complete */
if (ret == -EIOCBQUEUED)
return 0;
io_uring_cmd_done(&req->uring_cmd, ret);
return 0;
With that the above would make sense and seems to make the whole design more flexible
for the uring_cmd implementers.
However my primary use case would be using the -EIOCBQUEUED logic.
And I think it would be good to have IORING_OP_ASYNC_CANCEL logic in place for that,
as it would simplify the userspace logic to single io_uring_opcode_supported(IO_OP_URING_CMD).
I also noticed that some sendmsg/recvmsg implementations support -EIOCBQUEUED, e.g. _aead_recvmsg(),
I guess it would be nice to support that for IORING_OP_SENDMSG and IORING_OP_RECVMSG as well.
It uses struct kiocb and iocb->ki_complete().
Would it make sense to also use struct kiocb and iocb->ki_complete() instead of
a custom io_uring_cmd_done()?
Maybe it would be possible to also have a common way to cancel an struct kiocb request...
>>> Since we just need that one branch in req init, I do think that your
>>> suggestion of just modifying io_uring_sqe is the way to go. So that's
>>> what the above branch does.
>>
>> Thanks! I think it's much easier to handle the personality logic in
>> the core only.
>>
>> For fixed files or fixed buffers I think helper functions like this:
>>
>> struct file *io_uring_cmd_get_file(struct io_uring_cmd *cmd, int fd, bool fixed);
>>
>> And similar functions for io_buffer_select or io_import_fixed.
>
> I did end up retaining that, at least in its current state it's like you
> proposed. Only change is some packing on that very union, which should
> not be necessary, but due to fun arm reasons it is.
I noticed that thanks!
Do you also think a io_uring_cmd_get_file() would be useful?
My uring_cmd() implementation would need a 2nd struct file in order to
do something similar to a splice operation. And it might be good to
allow also fixed files to be used.
Referencing fixed buffer may also be useful, I'm not 100% sure I'll need them,
but it would be good to be flexible and prototype various solutions.
>>> I tested the block side, and it works for getting the bs of the
>>> device. That's all the testing that has been done so far :-)
>>
>> I've added EXPORT_SYMBOL(io_uring_cmd_done); and split your net patch,
>> similar to the two block patches. So we can better isolate the core
>> from the first consumers.
>>
>> See https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/io_uring-fops.v3
>
> Great thanks, I'll take a look and fold back. I'll also expand those
> commit messages :-)
Thanks!
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD
2021-02-23 8:14 ` Stefan Metzmacher
@ 2021-02-23 13:21 ` Pavel Begunkov
0 siblings, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2021-02-23 13:21 UTC (permalink / raw)
To: Stefan Metzmacher, Jens Axboe, Darrick J. Wong; +Cc: io-uring
On 23/02/2021 08:14, Stefan Metzmacher wrote:
> Am 22.02.21 um 21:14 schrieb Jens Axboe:
>> On 2/22/21 1:04 PM, Stefan Metzmacher wrote:
>> For example, you start the IO operation and you'll get a notification (eg IRQ) later on which allows
>> you to complete it.
>
> Yes, it's up to the implementation of uring_cmd() to do the processing and waiting
> in the background, based on timers, hardware events or whatever and finally call
> io_uring_cmd_done().
>
> But with this:
>
> ret = file->f_op->uring_cmd(&req->uring_cmd, issue_flags);
> /* queued async, consumer will call io_uring_cmd_done() when complete */
> if (ret == -EIOCBQUEUED)
> return 0;
> io_uring_cmd_done(&req->uring_cmd, ret);
> return 0;
>
> I don't see where -EAGAIN would trigger a retry in a io-wq worker context.
> Isn't -EAGAIN exposed to the cqe. Similar to ret == -EAGAIN && req->flags & REQ_F_NOWAIT.
if (ret == -EAGAIN && (issue_flags & IO_URING_F_NONBLOCK))
return -EAGAIN;
Yes, something alike. Apparently it was just forgotten
>>> It's also not clear if IOSQE_ASYNC should have any impact.
>>
>> Handler doesn't need to care about that, it'll just mean that the
>> initial queue attempt will not have IO_URING_F_NONBLOCK set.
>
> Ok, because it's done from the io-wq worker, correct?
Currently, IO_URING_F_NONBLOCK isn't set only when executed from
io-wq, may change for any reason though. Actually, ASYNC req may
get executed with IO_URING_F_NONBLOCK, but handlers should be
sane.
>> So tldr here is that 1+2 is already there, and 3 not being fixed leaves
>> us no different than the existing support for cancelation. IOW, I don't
>> think this is an initial requirement, support can always be expanded
>> later.
>
> Given that you list 2) here again, I get the impression that the logic should be:
>
> ret = file->f_op->uring_cmd(&req->uring_cmd, issue_flags);
> /* reschedule in io-wq worker again */
> if (ret == -EAGAIN)
> return ret;
Yes, kind of
> /* queued async, consumer will call io_uring_cmd_done() when complete */
> if (ret == -EIOCBQUEUED)
> return 0;
> io_uring_cmd_done(&req->uring_cmd, ret);
> return 0;
>
> With that the above would make sense and seems to make the whole design more flexible
> for the uring_cmd implementers.
>
> However my primary use case would be using the -EIOCBQUEUED logic.
> And I think it would be good to have IORING_OP_ASYNC_CANCEL logic in place for that,
> as it would simplify the userspace logic to single io_uring_opcode_supported(IO_OP_URING_CMD).
>
> I also noticed that some sendmsg/recvmsg implementations support -EIOCBQUEUED, e.g. _aead_recvmsg(),
> I guess it would be nice to support that for IORING_OP_SENDMSG and IORING_OP_RECVMSG as well.
> It uses struct kiocb and iocb->ki_complete().
It's just crypto stuff, IMHO unless something more useful like TCP/UDP
sockets start supporting it, it isn't worth it.
>
> Would it make sense to also use struct kiocb and iocb->ki_complete() instead of
> a custom io_uring_cmd_done()?
>
> Maybe it would be possible to also have a common way to cancel an struct kiocb request...
>
>>>> Since we just need that one branch in req init, I do think that your
>>>> suggestion of just modifying io_uring_sqe is the way to go. So that's
>>>> what the above branch does.
>>>
>>> Thanks! I think it's much easier to handle the personality logic in
>>> the core only.
>>>
>>> For fixed files or fixed buffers I think helper functions like this:
>>>
>>> struct file *io_uring_cmd_get_file(struct io_uring_cmd *cmd, int fd, bool fixed);
>>>
>>> And similar functions for io_buffer_select or io_import_fixed.
>>
>> I did end up retaining that, at least in its current state it's like you
>> proposed. Only change is some packing on that very union, which should
>> not be necessary, but due to fun arm reasons it is.
>
> I noticed that thanks!
>
> Do you also think a io_uring_cmd_get_file() would be useful?
> My uring_cmd() implementation would need a 2nd struct file in order to
> do something similar to a splice operation. And it might be good to
> allow also fixed files to be used.
>
> Referencing fixed buffer may also be useful, I'm not 100% sure I'll need them,
> but it would be good to be flexible and prototype various solutions.
Right, there should be a set of API for different purposes, including
getting resources. Also should be better integrated into prep/cleaning
up/etc. bits.
Even more, I think the approach should be modernised into two-step:
1. register your fd (or classes of fds) for further uring_cmd() in
io_uring ctx, 2. use that pre-registered fds/state for actually
issuing a command.
That would open the way for pre-allocating memory in advance,
pre-backing some stuff like iommu mappings, iovec/bvecs in specific
cases, pinning pages, and so on.
And also will get rid of virtual calls chaining, e.g.
->uring_cmd() { blk_mq->uring_blk_mq() { ... }
We will just get the final callback in the state.
Hopefully, I'll get to it to describe in details or even hack
it up.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2021-02-23 13:26 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-01-27 21:25 [PATCHSET RFC 0/5] file_operations based io_uring commands Jens Axboe
2021-01-27 21:25 ` [PATCH 1/5] fs: add file_operations->uring_cmd() Jens Axboe
2021-01-27 21:25 ` [PATCH 2/5] io_uring: add support for IORING_OP_URING_CMD Jens Axboe
2021-01-28 0:38 ` Darrick J. Wong
2021-01-28 1:45 ` Jens Axboe
2021-01-28 2:19 ` Jens Axboe
2021-02-20 3:57 ` Stefan Metzmacher
2021-02-20 14:50 ` Jens Axboe
2021-02-20 16:45 ` Jens Axboe
2021-02-22 20:04 ` Stefan Metzmacher
2021-02-22 20:14 ` Jens Axboe
2021-02-23 8:14 ` Stefan Metzmacher
2021-02-23 13:21 ` Pavel Begunkov
2021-01-27 21:25 ` [PATCH 3/5] block: wire up support for file_operations->uring_cmd() Jens Axboe
2021-01-27 21:25 ` [PATCH 4/5] block: add example ioctl Jens Axboe
2021-01-27 21:25 ` [PATCH 5/5] net: wire up support for file_operations->uring_cmd() Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox