public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCHSET 0/6] Enable NO_OFFLOAD support
@ 2023-04-19 16:25 Jens Axboe
  2023-04-19 16:25 ` [PATCH 1/6] io_uring: grow struct io_kiocb 'flags' to a 64-bit value Jens Axboe
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Jens Axboe @ 2023-04-19 16:25 UTC (permalink / raw)
  To: io-uring; +Cc: luhongfei

Hi,

This series enables support for forcing no-offload for requests that
otherwise would have been punted to io-wq. In essence, it bypasses
the normal non-blocking issue in favor of just letting the issue block.
This is only done for requests that would've otherwise hit io-wq in
the offload path, anything pollable will still be doing non-blocking
issue. See patch 3 for details.

Patches 1-2 are just prep patches, and patch 4-5 are reverts of async
cleanups, and patch 6 enables this for requests that don't support
nonblocking issue at all.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/6] io_uring: grow struct io_kiocb 'flags' to a 64-bit value
  2023-04-19 16:25 [PATCHSET 0/6] Enable NO_OFFLOAD support Jens Axboe
@ 2023-04-19 16:25 ` Jens Axboe
  2023-04-19 16:25 ` [PATCH 2/6] io_uring: move poll_refs up a cacheline to fill a hole Jens Axboe
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2023-04-19 16:25 UTC (permalink / raw)
  To: io-uring; +Cc: luhongfei, Jens Axboe

We've run out of flags at this point, and none of the flags are easily
removable. Bump the flags argument to 64-bit to add room for more.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/io_uring_types.h | 64 +++++++++++++++++-----------------
 io_uring/filetable.h           |  2 +-
 io_uring/io_uring.c            |  6 ++--
 3 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 1b2a20a42413..84f436cc6509 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -413,68 +413,68 @@ enum {
 
 enum {
 	/* ctx owns file */
-	REQ_F_FIXED_FILE	= BIT(REQ_F_FIXED_FILE_BIT),
+	REQ_F_FIXED_FILE	= BIT_ULL(REQ_F_FIXED_FILE_BIT),
 	/* drain existing IO first */
-	REQ_F_IO_DRAIN		= BIT(REQ_F_IO_DRAIN_BIT),
+	REQ_F_IO_DRAIN		= BIT_ULL(REQ_F_IO_DRAIN_BIT),
 	/* linked sqes */
-	REQ_F_LINK		= BIT(REQ_F_LINK_BIT),
+	REQ_F_LINK		= BIT_ULL(REQ_F_LINK_BIT),
 	/* doesn't sever on completion < 0 */
-	REQ_F_HARDLINK		= BIT(REQ_F_HARDLINK_BIT),
+	REQ_F_HARDLINK		= BIT_ULL(REQ_F_HARDLINK_BIT),
 	/* IOSQE_ASYNC */
-	REQ_F_FORCE_ASYNC	= BIT(REQ_F_FORCE_ASYNC_BIT),
+	REQ_F_FORCE_ASYNC	= BIT_ULL(REQ_F_FORCE_ASYNC_BIT),
 	/* IOSQE_BUFFER_SELECT */
-	REQ_F_BUFFER_SELECT	= BIT(REQ_F_BUFFER_SELECT_BIT),
+	REQ_F_BUFFER_SELECT	= BIT_ULL(REQ_F_BUFFER_SELECT_BIT),
 	/* IOSQE_CQE_SKIP_SUCCESS */
-	REQ_F_CQE_SKIP		= BIT(REQ_F_CQE_SKIP_BIT),
+	REQ_F_CQE_SKIP		= BIT_ULL(REQ_F_CQE_SKIP_BIT),
 
 	/* fail rest of links */
-	REQ_F_FAIL		= BIT(REQ_F_FAIL_BIT),
+	REQ_F_FAIL		= BIT_ULL(REQ_F_FAIL_BIT),
 	/* on inflight list, should be cancelled and waited on exit reliably */
-	REQ_F_INFLIGHT		= BIT(REQ_F_INFLIGHT_BIT),
+	REQ_F_INFLIGHT		= BIT_ULL(REQ_F_INFLIGHT_BIT),
 	/* read/write uses file position */
-	REQ_F_CUR_POS		= BIT(REQ_F_CUR_POS_BIT),
+	REQ_F_CUR_POS		= BIT_ULL(REQ_F_CUR_POS_BIT),
 	/* must not punt to workers */
-	REQ_F_NOWAIT		= BIT(REQ_F_NOWAIT_BIT),
+	REQ_F_NOWAIT		= BIT_ULL(REQ_F_NOWAIT_BIT),
 	/* has or had linked timeout */
-	REQ_F_LINK_TIMEOUT	= BIT(REQ_F_LINK_TIMEOUT_BIT),
+	REQ_F_LINK_TIMEOUT	= BIT_ULL(REQ_F_LINK_TIMEOUT_BIT),
 	/* needs cleanup */
-	REQ_F_NEED_CLEANUP	= BIT(REQ_F_NEED_CLEANUP_BIT),
+	REQ_F_NEED_CLEANUP	= BIT_ULL(REQ_F_NEED_CLEANUP_BIT),
 	/* already went through poll handler */
-	REQ_F_POLLED		= BIT(REQ_F_POLLED_BIT),
+	REQ_F_POLLED		= BIT_ULL(REQ_F_POLLED_BIT),
 	/* buffer already selected */
-	REQ_F_BUFFER_SELECTED	= BIT(REQ_F_BUFFER_SELECTED_BIT),
+	REQ_F_BUFFER_SELECTED	= BIT_ULL(REQ_F_BUFFER_SELECTED_BIT),
 	/* buffer selected from ring, needs commit */
-	REQ_F_BUFFER_RING	= BIT(REQ_F_BUFFER_RING_BIT),
+	REQ_F_BUFFER_RING	= BIT_ULL(REQ_F_BUFFER_RING_BIT),
 	/* caller should reissue async */
-	REQ_F_REISSUE		= BIT(REQ_F_REISSUE_BIT),
+	REQ_F_REISSUE		= BIT_ULL(REQ_F_REISSUE_BIT),
 	/* supports async reads/writes */
-	REQ_F_SUPPORT_NOWAIT	= BIT(REQ_F_SUPPORT_NOWAIT_BIT),
+	REQ_F_SUPPORT_NOWAIT	= BIT_ULL(REQ_F_SUPPORT_NOWAIT_BIT),
 	/* regular file */
-	REQ_F_ISREG		= BIT(REQ_F_ISREG_BIT),
+	REQ_F_ISREG		= BIT_ULL(REQ_F_ISREG_BIT),
 	/* has creds assigned */
-	REQ_F_CREDS		= BIT(REQ_F_CREDS_BIT),
+	REQ_F_CREDS		= BIT_ULL(REQ_F_CREDS_BIT),
 	/* skip refcounting if not set */
-	REQ_F_REFCOUNT		= BIT(REQ_F_REFCOUNT_BIT),
+	REQ_F_REFCOUNT		= BIT_ULL(REQ_F_REFCOUNT_BIT),
 	/* there is a linked timeout that has to be armed */
-	REQ_F_ARM_LTIMEOUT	= BIT(REQ_F_ARM_LTIMEOUT_BIT),
+	REQ_F_ARM_LTIMEOUT	= BIT_ULL(REQ_F_ARM_LTIMEOUT_BIT),
 	/* ->async_data allocated */
-	REQ_F_ASYNC_DATA	= BIT(REQ_F_ASYNC_DATA_BIT),
+	REQ_F_ASYNC_DATA	= BIT_ULL(REQ_F_ASYNC_DATA_BIT),
 	/* don't post CQEs while failing linked requests */
-	REQ_F_SKIP_LINK_CQES	= BIT(REQ_F_SKIP_LINK_CQES_BIT),
+	REQ_F_SKIP_LINK_CQES	= BIT_ULL(REQ_F_SKIP_LINK_CQES_BIT),
 	/* single poll may be active */
-	REQ_F_SINGLE_POLL	= BIT(REQ_F_SINGLE_POLL_BIT),
+	REQ_F_SINGLE_POLL	= BIT_ULL(REQ_F_SINGLE_POLL_BIT),
 	/* double poll may active */
-	REQ_F_DOUBLE_POLL	= BIT(REQ_F_DOUBLE_POLL_BIT),
+	REQ_F_DOUBLE_POLL	= BIT_ULL(REQ_F_DOUBLE_POLL_BIT),
 	/* request has already done partial IO */
-	REQ_F_PARTIAL_IO	= BIT(REQ_F_PARTIAL_IO_BIT),
+	REQ_F_PARTIAL_IO	= BIT_ULL(REQ_F_PARTIAL_IO_BIT),
 	/* fast poll multishot mode */
-	REQ_F_APOLL_MULTISHOT	= BIT(REQ_F_APOLL_MULTISHOT_BIT),
+	REQ_F_APOLL_MULTISHOT	= BIT_ULL(REQ_F_APOLL_MULTISHOT_BIT),
 	/* ->extra1 and ->extra2 are initialised */
-	REQ_F_CQE32_INIT	= BIT(REQ_F_CQE32_INIT_BIT),
+	REQ_F_CQE32_INIT	= BIT_ULL(REQ_F_CQE32_INIT_BIT),
 	/* recvmsg special flag, clear EPOLLIN */
-	REQ_F_CLEAR_POLLIN	= BIT(REQ_F_CLEAR_POLLIN_BIT),
+	REQ_F_CLEAR_POLLIN	= BIT_ULL(REQ_F_CLEAR_POLLIN_BIT),
 	/* hashed into ->cancel_hash_locked, protected by ->uring_lock */
-	REQ_F_HASH_LOCKED	= BIT(REQ_F_HASH_LOCKED_BIT),
+	REQ_F_HASH_LOCKED	= BIT_ULL(REQ_F_HASH_LOCKED_BIT),
 };
 
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
@@ -535,7 +535,7 @@ struct io_kiocb {
 	 * and after selection it points to the buffer ID itself.
 	 */
 	u16				buf_index;
-	unsigned int			flags;
+	u64				flags;
 
 	struct io_cqe			cqe;
 
diff --git a/io_uring/filetable.h b/io_uring/filetable.h
index 351111ff8882..cfa32dcd77a1 100644
--- a/io_uring/filetable.h
+++ b/io_uring/filetable.h
@@ -21,7 +21,7 @@ int io_fixed_fd_remove(struct io_ring_ctx *ctx, unsigned int offset);
 int io_register_file_alloc_range(struct io_ring_ctx *ctx,
 				 struct io_uring_file_index_range __user *arg);
 
-unsigned int io_file_get_flags(struct file *file);
+u64 io_file_get_flags(struct file *file);
 
 static inline void io_file_bitmap_clear(struct io_file_table *table, int bit)
 {
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3bca7a79efda..9568b5e4cf87 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1113,7 +1113,7 @@ __cold bool __io_alloc_req_refill(struct io_ring_ctx *ctx)
 
 static inline void io_dismantle_req(struct io_kiocb *req)
 {
-	unsigned int flags = req->flags;
+	u64 flags = req->flags;
 
 	if (unlikely(flags & IO_REQ_CLEAN_FLAGS))
 		io_clean_op(req);
@@ -1797,7 +1797,7 @@ static bool __io_file_supports_nowait(struct file *file, umode_t mode)
  * any file. For now, just ensure that anything potentially problematic is done
  * inline.
  */
-unsigned int io_file_get_flags(struct file *file)
+u64 io_file_get_flags(struct file *file)
 {
 	umode_t mode = file_inode(file)->i_mode;
 	unsigned int res = 0;
@@ -4544,7 +4544,7 @@ static int __init io_uring_init(void)
 	BUILD_BUG_ON(SQE_COMMON_FLAGS >= (1 << 8));
 	BUILD_BUG_ON((SQE_VALID_FLAGS | SQE_COMMON_FLAGS) != SQE_VALID_FLAGS);
 
-	BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(int));
+	BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(u64));
 
 	BUILD_BUG_ON(sizeof(atomic_t) != sizeof(u32));
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/6] io_uring: move poll_refs up a cacheline to fill a hole
  2023-04-19 16:25 [PATCHSET 0/6] Enable NO_OFFLOAD support Jens Axboe
  2023-04-19 16:25 ` [PATCH 1/6] io_uring: grow struct io_kiocb 'flags' to a 64-bit value Jens Axboe
@ 2023-04-19 16:25 ` Jens Axboe
  2023-04-20  0:50   ` Pavel Begunkov
  2023-04-19 16:25 ` [PATCH 3/6] io_uring: add support for NO_OFFLOAD Jens Axboe
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2023-04-19 16:25 UTC (permalink / raw)
  To: io-uring; +Cc: luhongfei, Jens Axboe

After bumping the flags to 64-bits, we now have two holes in io_kiocb.
The best candidate for moving is poll_refs, as not to split the task_work
related items.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/io_uring_types.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 84f436cc6509..4dd54d2173e1 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -535,6 +535,9 @@ struct io_kiocb {
 	 * and after selection it points to the buffer ID itself.
 	 */
 	u16				buf_index;
+
+	atomic_t			poll_refs;
+
 	u64				flags;
 
 	struct io_cqe			cqe;
@@ -565,9 +568,8 @@ struct io_kiocb {
 		__poll_t apoll_events;
 	};
 	atomic_t			refs;
-	atomic_t			poll_refs;
-	struct io_task_work		io_task_work;
 	unsigned			nr_tw;
+	struct io_task_work		io_task_work;
 	/* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
 	union {
 		struct hlist_node	hash_node;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/6] io_uring: add support for NO_OFFLOAD
  2023-04-19 16:25 [PATCHSET 0/6] Enable NO_OFFLOAD support Jens Axboe
  2023-04-19 16:25 ` [PATCH 1/6] io_uring: grow struct io_kiocb 'flags' to a 64-bit value Jens Axboe
  2023-04-19 16:25 ` [PATCH 2/6] io_uring: move poll_refs up a cacheline to fill a hole Jens Axboe
@ 2023-04-19 16:25 ` Jens Axboe
  2023-04-20  1:01   ` Pavel Begunkov
  2023-04-19 16:25 ` [PATCH 4/6] Revert "io_uring: always go async for unsupported fadvise flags" Jens Axboe
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2023-04-19 16:25 UTC (permalink / raw)
  To: io-uring; +Cc: luhongfei, Jens Axboe

Some applications don't necessarily care about io_uring not blocking for
request issue, they simply want to use io_uring for batched submission
of IO. However, io_uring will always do non-blocking issues, and for
some request types, there's simply no support for doing non-blocking
issue and hence they get punted to io-wq unconditionally. If the
application doesn't care about issue potentially blocking, this causes
a performance slowdown as thread offload is not nearly as efficient as
inline issue.

Add support for configuring the ring with IORING_SETUP_NO_OFFLOAD, and
add an IORING_ENTER_NO_OFFLOAD flag to io_uring_enter(2). If either one
of these is set, then io_uring will ignore the non-block issue attempt
for any file which we cannot poll for readiness. The simplified io_uring
issue model looks as follows:

1) Non-blocking issue is attempted for IO. If successful, we're done for
   now.

2) Case 1 failed. Now we have two options
  	a) We can poll the file. We arm poll, and we're done for now
	   until that triggers.
   	b) File cannot be polled, we punt to io-wq which then does a
	   blocking attempt.

If either of the NO_OFFLOAD flags are set, we should never hit case
2b. Instead, case 1 would issue the IO without the non-blocking flag
being set and perform an inline completion.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/io_uring_types.h |  3 +++
 include/uapi/linux/io_uring.h  |  7 +++++++
 io_uring/io_uring.c            | 26 ++++++++++++++++++++------
 io_uring/io_uring.h            |  2 +-
 io_uring/sqpoll.c              |  3 ++-
 5 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 4dd54d2173e1..c54f3fb7ab1a 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -403,6 +403,7 @@ enum {
 	REQ_F_APOLL_MULTISHOT_BIT,
 	REQ_F_CLEAR_POLLIN_BIT,
 	REQ_F_HASH_LOCKED_BIT,
+	REQ_F_NO_OFFLOAD_BIT,
 	/* keep async read/write and isreg together and in order */
 	REQ_F_SUPPORT_NOWAIT_BIT,
 	REQ_F_ISREG_BIT,
@@ -475,6 +476,8 @@ enum {
 	REQ_F_CLEAR_POLLIN	= BIT_ULL(REQ_F_CLEAR_POLLIN_BIT),
 	/* hashed into ->cancel_hash_locked, protected by ->uring_lock */
 	REQ_F_HASH_LOCKED	= BIT_ULL(REQ_F_HASH_LOCKED_BIT),
+	/* don't offload to io-wq, issue blocking if needed */
+	REQ_F_NO_OFFLOAD	= BIT_ULL(REQ_F_NO_OFFLOAD_BIT),
 };
 
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 0716cb17e436..ea903a677ce9 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -173,6 +173,12 @@ enum {
  */
 #define IORING_SETUP_DEFER_TASKRUN	(1U << 13)
 
+/*
+ * Don't attempt non-blocking issue on file types that would otherwise
+ * punt to io-wq if they cannot be completed non-blocking.
+ */
+#define IORING_SETUP_NO_OFFLOAD		(1U << 14)
+
 enum io_uring_op {
 	IORING_OP_NOP,
 	IORING_OP_READV,
@@ -443,6 +449,7 @@ struct io_cqring_offsets {
 #define IORING_ENTER_SQ_WAIT		(1U << 2)
 #define IORING_ENTER_EXT_ARG		(1U << 3)
 #define IORING_ENTER_REGISTERED_RING	(1U << 4)
+#define IORING_ENTER_NO_OFFLOAD		(1U << 5)
 
 /*
  * Passed in for io_uring_setup(2). Copied back with updated info on success
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 9568b5e4cf87..04770b06de16 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1947,6 +1947,10 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	if (unlikely(!io_assign_file(req, def, issue_flags)))
 		return -EBADF;
 
+	if (req->flags & REQ_F_NO_OFFLOAD &&
+	    (!req->file || !file_can_poll(req->file)))
+		issue_flags &= ~IO_URING_F_NONBLOCK;
+
 	if (unlikely((req->flags & REQ_F_CREDS) && req->creds != current_cred()))
 		creds = override_creds(req->creds);
 
@@ -2337,7 +2341,7 @@ static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
 }
 
 static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
-			 const struct io_uring_sqe *sqe)
+			 const struct io_uring_sqe *sqe, bool no_offload)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_submit_link *link = &ctx->submit_state.link;
@@ -2385,6 +2389,9 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		return 0;
 	}
 
+	if (no_offload)
+		req->flags |= REQ_F_NO_OFFLOAD;
+
 	io_queue_sqe(req);
 	return 0;
 }
@@ -2466,7 +2473,7 @@ static bool io_get_sqe(struct io_ring_ctx *ctx, const struct io_uring_sqe **sqe)
 	return false;
 }
 
-int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
+int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, bool no_offload)
 	__must_hold(&ctx->uring_lock)
 {
 	unsigned int entries = io_sqring_entries(ctx);
@@ -2495,7 +2502,7 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 		 * Continue submitting even for sqe failure if the
 		 * ring was setup with IORING_SETUP_SUBMIT_ALL
 		 */
-		if (unlikely(io_submit_sqe(ctx, req, sqe)) &&
+		if (unlikely(io_submit_sqe(ctx, req, sqe, no_offload)) &&
 		    !(ctx->flags & IORING_SETUP_SUBMIT_ALL)) {
 			left--;
 			break;
@@ -3524,7 +3531,8 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 
 	if (unlikely(flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP |
 			       IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG |
-			       IORING_ENTER_REGISTERED_RING)))
+			       IORING_ENTER_REGISTERED_RING |
+			       IORING_ENTER_NO_OFFLOAD)))
 		return -EINVAL;
 
 	/*
@@ -3575,12 +3583,17 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 
 		ret = to_submit;
 	} else if (to_submit) {
+		bool no_offload;
+
 		ret = io_uring_add_tctx_node(ctx);
 		if (unlikely(ret))
 			goto out;
 
+		no_offload = flags & IORING_ENTER_NO_OFFLOAD ||
+				ctx->flags & IORING_SETUP_NO_OFFLOAD;
+
 		mutex_lock(&ctx->uring_lock);
-		ret = io_submit_sqes(ctx, to_submit);
+		ret = io_submit_sqes(ctx, to_submit, no_offload);
 		if (ret != to_submit) {
 			mutex_unlock(&ctx->uring_lock);
 			goto out;
@@ -3969,7 +3982,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL |
 			IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG |
 			IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |
-			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN))
+			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN |
+			IORING_SETUP_NO_OFFLOAD))
 		return -EINVAL;
 
 	return io_uring_create(entries, &p, params);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 25515d69d205..c5c0db7232c0 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -76,7 +76,7 @@ int io_uring_alloc_task_context(struct task_struct *task,
 				struct io_ring_ctx *ctx);
 
 int io_poll_issue(struct io_kiocb *req, struct io_tw_state *ts);
-int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr);
+int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, bool no_offload);
 int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin);
 void io_free_batch_list(struct io_ring_ctx *ctx, struct io_wq_work_node *node);
 int io_req_prep_async(struct io_kiocb *req);
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index 9db4bc1f521a..9a9417bf9e3f 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -166,6 +166,7 @@ static inline bool io_sqd_events_pending(struct io_sq_data *sqd)
 
 static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
 {
+	bool no_offload = ctx->flags & IORING_SETUP_NO_OFFLOAD;
 	unsigned int to_submit;
 	int ret = 0;
 
@@ -190,7 +191,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
 		 */
 		if (to_submit && likely(!percpu_ref_is_dying(&ctx->refs)) &&
 		    !(ctx->flags & IORING_SETUP_R_DISABLED))
-			ret = io_submit_sqes(ctx, to_submit);
+			ret = io_submit_sqes(ctx, to_submit, no_offload);
 		mutex_unlock(&ctx->uring_lock);
 
 		if (to_submit && wq_has_sleeper(&ctx->sqo_sq_wait))
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/6] Revert "io_uring: always go async for unsupported fadvise flags"
  2023-04-19 16:25 [PATCHSET 0/6] Enable NO_OFFLOAD support Jens Axboe
                   ` (2 preceding siblings ...)
  2023-04-19 16:25 ` [PATCH 3/6] io_uring: add support for NO_OFFLOAD Jens Axboe
@ 2023-04-19 16:25 ` Jens Axboe
  2023-04-19 16:25 ` [PATCH 5/6] Revert "io_uring: for requests that require async, force it" Jens Axboe
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2023-04-19 16:25 UTC (permalink / raw)
  To: io-uring; +Cc: luhongfei, Jens Axboe

This reverts commit c31cc60fddd11134031e7f9eb76812353cfaac84.

In preparation for handling this a bit differently, revert this
cleanup.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/advise.c | 25 ++++++++++---------------
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/io_uring/advise.c b/io_uring/advise.c
index 7085804c513c..cf600579bffe 100644
--- a/io_uring/advise.c
+++ b/io_uring/advise.c
@@ -62,18 +62,6 @@ int io_madvise(struct io_kiocb *req, unsigned int issue_flags)
 #endif
 }
 
-static bool io_fadvise_force_async(struct io_fadvise *fa)
-{
-	switch (fa->advice) {
-	case POSIX_FADV_NORMAL:
-	case POSIX_FADV_RANDOM:
-	case POSIX_FADV_SEQUENTIAL:
-		return false;
-	default:
-		return true;
-	}
-}
-
 int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise);
@@ -84,8 +72,6 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	fa->offset = READ_ONCE(sqe->off);
 	fa->len = READ_ONCE(sqe->len);
 	fa->advice = READ_ONCE(sqe->fadvise_advice);
-	if (io_fadvise_force_async(fa))
-		req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
 }
 
@@ -94,7 +80,16 @@ int io_fadvise(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise);
 	int ret;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK && io_fadvise_force_async(fa));
+	if (issue_flags & IO_URING_F_NONBLOCK) {
+		switch (fa->advice) {
+		case POSIX_FADV_NORMAL:
+		case POSIX_FADV_RANDOM:
+		case POSIX_FADV_SEQUENTIAL:
+			break;
+		default:
+			return -EAGAIN;
+		}
+	}
 
 	ret = vfs_fadvise(req->file, fa->offset, fa->len, fa->advice);
 	if (ret < 0)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/6] Revert "io_uring: for requests that require async, force it"
  2023-04-19 16:25 [PATCHSET 0/6] Enable NO_OFFLOAD support Jens Axboe
                   ` (3 preceding siblings ...)
  2023-04-19 16:25 ` [PATCH 4/6] Revert "io_uring: always go async for unsupported fadvise flags" Jens Axboe
@ 2023-04-19 16:25 ` Jens Axboe
  2023-04-19 16:25 ` [PATCH 6/6] io_uring: mark opcodes that always need io-wq punt Jens Axboe
  2023-04-20  0:43 ` [PATCHSET 0/6] Enable NO_OFFLOAD support Pavel Begunkov
  6 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2023-04-19 16:25 UTC (permalink / raw)
  To: io-uring; +Cc: luhongfei, Jens Axboe

This reverts commit aebb224fd4fc7352cd839ad90414c548387142fd.

In preparation for handling this a bit differently, revert this
cleanup.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/advise.c |  4 ++--
 io_uring/fs.c     | 20 ++++++++++----------
 io_uring/net.c    |  4 ++--
 io_uring/splice.c |  7 ++++---
 io_uring/statx.c  |  4 ++--
 io_uring/sync.c   | 14 ++++++--------
 io_uring/xattr.c  | 14 ++++++++------
 7 files changed, 34 insertions(+), 33 deletions(-)

diff --git a/io_uring/advise.c b/io_uring/advise.c
index cf600579bffe..449c6f14649f 100644
--- a/io_uring/advise.c
+++ b/io_uring/advise.c
@@ -39,7 +39,6 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	ma->addr = READ_ONCE(sqe->addr);
 	ma->len = READ_ONCE(sqe->len);
 	ma->advice = READ_ONCE(sqe->fadvise_advice);
-	req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
 #else
 	return -EOPNOTSUPP;
@@ -52,7 +51,8 @@ int io_madvise(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise);
 	int ret;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice);
 	io_req_set_res(req, ret, 0);
diff --git a/io_uring/fs.c b/io_uring/fs.c
index f6a69a549fd4..7100c293c13a 100644
--- a/io_uring/fs.c
+++ b/io_uring/fs.c
@@ -74,7 +74,6 @@ int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	}
 
 	req->flags |= REQ_F_NEED_CLEANUP;
-	req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
 }
 
@@ -83,7 +82,8 @@ int io_renameat(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
 	int ret;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 	ret = do_renameat2(ren->old_dfd, ren->oldpath, ren->new_dfd,
 				ren->newpath, ren->flags);
@@ -123,7 +123,6 @@ int io_unlinkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return PTR_ERR(un->filename);
 
 	req->flags |= REQ_F_NEED_CLEANUP;
-	req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
 }
 
@@ -132,7 +131,8 @@ int io_unlinkat(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_unlink *un = io_kiocb_to_cmd(req, struct io_unlink);
 	int ret;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 	if (un->flags & AT_REMOVEDIR)
 		ret = do_rmdir(un->dfd, un->filename);
@@ -170,7 +170,6 @@ int io_mkdirat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return PTR_ERR(mkd->filename);
 
 	req->flags |= REQ_F_NEED_CLEANUP;
-	req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
 }
 
@@ -179,7 +178,8 @@ int io_mkdirat(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_mkdir *mkd = io_kiocb_to_cmd(req, struct io_mkdir);
 	int ret;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 	ret = do_mkdirat(mkd->dfd, mkd->filename, mkd->mode);
 
@@ -220,7 +220,6 @@ int io_symlinkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	}
 
 	req->flags |= REQ_F_NEED_CLEANUP;
-	req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
 }
 
@@ -229,7 +228,8 @@ int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_link *sl = io_kiocb_to_cmd(req, struct io_link);
 	int ret;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 	ret = do_symlinkat(sl->oldpath, sl->new_dfd, sl->newpath);
 
@@ -265,7 +265,6 @@ int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	}
 
 	req->flags |= REQ_F_NEED_CLEANUP;
-	req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
 }
 
@@ -274,7 +273,8 @@ int io_linkat(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_link *lnk = io_kiocb_to_cmd(req, struct io_link);
 	int ret;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 	ret = do_linkat(lnk->old_dfd, lnk->oldpath, lnk->new_dfd,
 				lnk->newpath, lnk->flags);
diff --git a/io_uring/net.c b/io_uring/net.c
index 89e839013837..e85a868290ec 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -91,7 +91,6 @@ int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return -EINVAL;
 
 	shutdown->how = READ_ONCE(sqe->len);
-	req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
 }
 
@@ -101,7 +100,8 @@ int io_shutdown(struct io_kiocb *req, unsigned int issue_flags)
 	struct socket *sock;
 	int ret;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 	sock = sock_from_file(req->file);
 	if (unlikely(!sock))
diff --git a/io_uring/splice.c b/io_uring/splice.c
index 2a4bbb719531..53e4232d0866 100644
--- a/io_uring/splice.c
+++ b/io_uring/splice.c
@@ -34,7 +34,6 @@ static int __io_splice_prep(struct io_kiocb *req,
 	if (unlikely(sp->flags & ~valid_flags))
 		return -EINVAL;
 	sp->splice_fd_in = READ_ONCE(sqe->splice_fd_in);
-	req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
 }
 
@@ -53,7 +52,8 @@ int io_tee(struct io_kiocb *req, unsigned int issue_flags)
 	struct file *in;
 	long ret = 0;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 	if (sp->flags & SPLICE_F_FD_IN_FIXED)
 		in = io_file_get_fixed(req, sp->splice_fd_in, issue_flags);
@@ -94,7 +94,8 @@ int io_splice(struct io_kiocb *req, unsigned int issue_flags)
 	struct file *in;
 	long ret = 0;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 	if (sp->flags & SPLICE_F_FD_IN_FIXED)
 		in = io_file_get_fixed(req, sp->splice_fd_in, issue_flags);
diff --git a/io_uring/statx.c b/io_uring/statx.c
index abb874209caa..d8fc933d3f59 100644
--- a/io_uring/statx.c
+++ b/io_uring/statx.c
@@ -48,7 +48,6 @@ int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	}
 
 	req->flags |= REQ_F_NEED_CLEANUP;
-	req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
 }
 
@@ -57,7 +56,8 @@ int io_statx(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_statx *sx = io_kiocb_to_cmd(req, struct io_statx);
 	int ret;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 	ret = do_statx(sx->dfd, sx->filename, sx->flags, sx->mask, sx->buffer);
 	io_req_set_res(req, ret, 0);
diff --git a/io_uring/sync.c b/io_uring/sync.c
index 255f68c37e55..64e87ea2b8fb 100644
--- a/io_uring/sync.c
+++ b/io_uring/sync.c
@@ -32,8 +32,6 @@ int io_sfr_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	sync->off = READ_ONCE(sqe->off);
 	sync->len = READ_ONCE(sqe->len);
 	sync->flags = READ_ONCE(sqe->sync_range_flags);
-	req->flags |= REQ_F_FORCE_ASYNC;
-
 	return 0;
 }
 
@@ -43,7 +41,8 @@ int io_sync_file_range(struct io_kiocb *req, unsigned int issue_flags)
 	int ret;
 
 	/* sync_file_range always requires a blocking context */
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 	ret = sync_file_range(req->file, sync->off, sync->len, sync->flags);
 	io_req_set_res(req, ret, 0);
@@ -63,7 +62,6 @@ int io_fsync_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 	sync->off = READ_ONCE(sqe->off);
 	sync->len = READ_ONCE(sqe->len);
-	req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
 }
 
@@ -74,7 +72,8 @@ int io_fsync(struct io_kiocb *req, unsigned int issue_flags)
 	int ret;
 
 	/* fsync always requires a blocking context */
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 	ret = vfs_fsync_range(req->file, sync->off, end > 0 ? end : LLONG_MAX,
 				sync->flags & IORING_FSYNC_DATASYNC);
@@ -92,7 +91,6 @@ int io_fallocate_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	sync->off = READ_ONCE(sqe->off);
 	sync->len = READ_ONCE(sqe->addr);
 	sync->mode = READ_ONCE(sqe->len);
-	req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
 }
 
@@ -102,8 +100,8 @@ int io_fallocate(struct io_kiocb *req, unsigned int issue_flags)
 	int ret;
 
 	/* fallocate always requiring blocking context */
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
-
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 	ret = vfs_fallocate(req->file, sync->mode, sync->off, sync->len);
 	if (ret >= 0)
 		fsnotify_modify(req->file);
diff --git a/io_uring/xattr.c b/io_uring/xattr.c
index e1c810e0b85a..6201a9f442c6 100644
--- a/io_uring/xattr.c
+++ b/io_uring/xattr.c
@@ -75,7 +75,6 @@ static int __io_getxattr_prep(struct io_kiocb *req,
 	}
 
 	req->flags |= REQ_F_NEED_CLEANUP;
-	req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
 }
 
@@ -110,7 +109,8 @@ int io_fgetxattr(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_xattr *ix = io_kiocb_to_cmd(req, struct io_xattr);
 	int ret;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 	ret = do_getxattr(mnt_idmap(req->file->f_path.mnt),
 			req->file->f_path.dentry,
@@ -127,7 +127,8 @@ int io_getxattr(struct io_kiocb *req, unsigned int issue_flags)
 	struct path path;
 	int ret;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 retry:
 	ret = filename_lookup(AT_FDCWD, ix->filename, lookup_flags, &path, NULL);
@@ -173,7 +174,6 @@ static int __io_setxattr_prep(struct io_kiocb *req,
 	}
 
 	req->flags |= REQ_F_NEED_CLEANUP;
-	req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
 }
 
@@ -222,7 +222,8 @@ int io_fsetxattr(struct io_kiocb *req, unsigned int issue_flags)
 {
 	int ret;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 	ret = __io_setxattr(req, issue_flags, &req->file->f_path);
 	io_xattr_finish(req, ret);
@@ -236,7 +237,8 @@ int io_setxattr(struct io_kiocb *req, unsigned int issue_flags)
 	struct path path;
 	int ret;
 
-	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
 
 retry:
 	ret = filename_lookup(AT_FDCWD, ix->filename, lookup_flags, &path, NULL);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/6] io_uring: mark opcodes that always need io-wq punt
  2023-04-19 16:25 [PATCHSET 0/6] Enable NO_OFFLOAD support Jens Axboe
                   ` (4 preceding siblings ...)
  2023-04-19 16:25 ` [PATCH 5/6] Revert "io_uring: for requests that require async, force it" Jens Axboe
@ 2023-04-19 16:25 ` Jens Axboe
  2023-04-20  0:43 ` [PATCHSET 0/6] Enable NO_OFFLOAD support Pavel Begunkov
  6 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2023-04-19 16:25 UTC (permalink / raw)
  To: io-uring; +Cc: luhongfei, Jens Axboe

Add an opdef bit for them, and set it for the opcodes where we always
need io-wq punt. With that done, exclude them from the file_can_poll()
check in terms of whether or not we need to punt them if any of the
NO_OFFLOAD flags are set.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/io_uring.c |  2 +-
 io_uring/opdef.c    | 22 ++++++++++++++++++++--
 io_uring/opdef.h    |  2 ++
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 04770b06de16..91045270b665 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1948,7 +1948,7 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 		return -EBADF;
 
 	if (req->flags & REQ_F_NO_OFFLOAD &&
-	    (!req->file || !file_can_poll(req->file)))
+	    (!req->file || !file_can_poll(req->file) || def->always_iowq))
 		issue_flags &= ~IO_URING_F_NONBLOCK;
 
 	if (unlikely((req->flags & REQ_F_CREDS) && req->creds != current_cred()))
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index cca7c5b55208..686d46001622 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -82,6 +82,7 @@ const struct io_issue_def io_issue_defs[] = {
 	[IORING_OP_FSYNC] = {
 		.needs_file		= 1,
 		.audit_skip		= 1,
+		.always_iowq		= 1,
 		.prep			= io_fsync_prep,
 		.issue			= io_fsync,
 	},
@@ -125,6 +126,7 @@ const struct io_issue_def io_issue_defs[] = {
 	[IORING_OP_SYNC_FILE_RANGE] = {
 		.needs_file		= 1,
 		.audit_skip		= 1,
+		.always_iowq		= 1,
 		.prep			= io_sfr_prep,
 		.issue			= io_sync_file_range,
 	},
@@ -202,6 +204,7 @@ const struct io_issue_def io_issue_defs[] = {
 	},
 	[IORING_OP_FALLOCATE] = {
 		.needs_file		= 1,
+		.always_iowq		= 1,
 		.prep			= io_fallocate_prep,
 		.issue			= io_fallocate,
 	},
@@ -221,6 +224,7 @@ const struct io_issue_def io_issue_defs[] = {
 	},
 	[IORING_OP_STATX] = {
 		.audit_skip		= 1,
+		.always_iowq		= 1,
 		.prep			= io_statx_prep,
 		.issue			= io_statx,
 	},
@@ -253,11 +257,13 @@ const struct io_issue_def io_issue_defs[] = {
 	[IORING_OP_FADVISE] = {
 		.needs_file		= 1,
 		.audit_skip		= 1,
+		.always_iowq		= 1,
 		.prep			= io_fadvise_prep,
 		.issue			= io_fadvise,
 	},
 	[IORING_OP_MADVISE] = {
 		.audit_skip		= 1,
+		.always_iowq		= 1,
 		.prep			= io_madvise_prep,
 		.issue			= io_madvise,
 	},
@@ -308,6 +314,7 @@ const struct io_issue_def io_issue_defs[] = {
 		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.audit_skip		= 1,
+		.always_iowq		= 1,
 		.prep			= io_splice_prep,
 		.issue			= io_splice,
 	},
@@ -328,11 +335,13 @@ const struct io_issue_def io_issue_defs[] = {
 		.hash_reg_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.audit_skip		= 1,
+		.always_iowq		= 1,
 		.prep			= io_tee_prep,
 		.issue			= io_tee,
 	},
 	[IORING_OP_SHUTDOWN] = {
 		.needs_file		= 1,
+		.always_iowq		= 1,
 #if defined(CONFIG_NET)
 		.prep			= io_shutdown_prep,
 		.issue			= io_shutdown,
@@ -343,22 +352,27 @@ const struct io_issue_def io_issue_defs[] = {
 	[IORING_OP_RENAMEAT] = {
 		.prep			= io_renameat_prep,
 		.issue			= io_renameat,
+		.always_iowq		= 1,
 	},
 	[IORING_OP_UNLINKAT] = {
 		.prep			= io_unlinkat_prep,
 		.issue			= io_unlinkat,
+		.always_iowq		= 1,
 	},
 	[IORING_OP_MKDIRAT] = {
 		.prep			= io_mkdirat_prep,
 		.issue			= io_mkdirat,
+		.always_iowq		= 1,
 	},
 	[IORING_OP_SYMLINKAT] = {
 		.prep			= io_symlinkat_prep,
 		.issue			= io_symlinkat,
+		.always_iowq		= 1,
 	},
 	[IORING_OP_LINKAT] = {
 		.prep			= io_linkat_prep,
 		.issue			= io_linkat,
+		.always_iowq		= 1,
 	},
 	[IORING_OP_MSG_RING] = {
 		.needs_file		= 1,
@@ -367,20 +381,24 @@ const struct io_issue_def io_issue_defs[] = {
 		.issue			= io_msg_ring,
 	},
 	[IORING_OP_FSETXATTR] = {
-		.needs_file = 1,
+		.needs_file		= 1,
+		.always_iowq		= 1,
 		.prep			= io_fsetxattr_prep,
 		.issue			= io_fsetxattr,
 	},
 	[IORING_OP_SETXATTR] = {
+		.always_iowq		= 1,
 		.prep			= io_setxattr_prep,
 		.issue			= io_setxattr,
 	},
 	[IORING_OP_FGETXATTR] = {
-		.needs_file = 1,
+		.needs_file		= 1,
+		.always_iowq		= 1,
 		.prep			= io_fgetxattr_prep,
 		.issue			= io_fgetxattr,
 	},
 	[IORING_OP_GETXATTR] = {
+		.always_iowq		= 1,
 		.prep			= io_getxattr_prep,
 		.issue			= io_getxattr,
 	},
diff --git a/io_uring/opdef.h b/io_uring/opdef.h
index c22c8696e749..657a831249ff 100644
--- a/io_uring/opdef.h
+++ b/io_uring/opdef.h
@@ -29,6 +29,8 @@ struct io_issue_def {
 	unsigned		iopoll_queue : 1;
 	/* opcode specific path will handle ->async_data allocation if needed */
 	unsigned		manual_alloc : 1;
+	/* op always needs io-wq offload */
+	unsigned		always_iowq : 1;
 
 	int (*issue)(struct io_kiocb *, unsigned int);
 	int (*prep)(struct io_kiocb *, const struct io_uring_sqe *);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCHSET 0/6] Enable NO_OFFLOAD support
  2023-04-19 16:25 [PATCHSET 0/6] Enable NO_OFFLOAD support Jens Axboe
                   ` (5 preceding siblings ...)
  2023-04-19 16:25 ` [PATCH 6/6] io_uring: mark opcodes that always need io-wq punt Jens Axboe
@ 2023-04-20  0:43 ` Pavel Begunkov
  2023-04-20 15:08   ` Jens Axboe
  6 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2023-04-20  0:43 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: luhongfei

On 4/19/23 17:25, Jens Axboe wrote:
> Hi,
> 
> This series enables support for forcing no-offload for requests that
> otherwise would have been punted to io-wq. In essence, it bypasses
> the normal non-blocking issue in favor of just letting the issue block.
> This is only done for requests that would've otherwise hit io-wq in
> the offload path, anything pollable will still be doing non-blocking
> issue. See patch 3 for details.

That's shooting ourselves in the leg.

1) It has never been easier to lock up userspace. They might be able
to deal with simple cases like read(pipe) + write(pipe), though even
that in a complex enough framework would cause debugging and associated
headache.

Now let's assume that the userspace submits nvme passthrough requests,
it exhausts tags and a request is left waiting there. To progress
forward one of the previous reqs should complete, but it's only putting
task in tw, which will never be run with DEFER_TASKRUN.

It's not enough for the userspace to be careful, for DEFER_TASKRUN
there will always be a chance to get locked .

2) It's not limited only to requests we're submitting, but also
already queued async requests. Inline submission holds uring_lock,
and so if io-wq thread needs to grab a registered file for the
request, it'll io_ring_submit_lock() and wait until the submission
ends. Same for provided buffers and some other cases.

Even task exit will actively try to grab the lock.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/6] io_uring: move poll_refs up a cacheline to fill a hole
  2023-04-19 16:25 ` [PATCH 2/6] io_uring: move poll_refs up a cacheline to fill a hole Jens Axboe
@ 2023-04-20  0:50   ` Pavel Begunkov
  0 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2023-04-20  0:50 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: luhongfei

On 4/19/23 17:25, Jens Axboe wrote:
> After bumping the flags to 64-bits, we now have two holes in io_kiocb.
> The best candidate for moving is poll_refs, as not to split the task_work
> related items.
> 
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>   include/linux/io_uring_types.h | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 84f436cc6509..4dd54d2173e1 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -535,6 +535,9 @@ struct io_kiocb {
>   	 * and after selection it points to the buffer ID itself.
>   	 */
>   	u16				buf_index;
> +
> +	atomic_t			poll_refs;

poll wake is often done by a random CPU, would be great to limit
cache sharing, i.e. the 2nd cache line is never touched by
io_poll_wake() and it'd only need the poll entry and
the 3rd line with io_task_work().

There is one place that doesn't care about it, i.e.
clearing REQ_F_[SINGLE,DOUBLE]_POLL in poll wake, but
it's a place to improve.


> +
>   	u64				flags;
>   
>   	struct io_cqe			cqe;
> @@ -565,9 +568,8 @@ struct io_kiocb {
>   		__poll_t apoll_events;
>   	};
>   	atomic_t			refs;
> -	atomic_t			poll_refs;
> -	struct io_task_work		io_task_work;
>   	unsigned			nr_tw;
> +	struct io_task_work		io_task_work;
>   	/* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
>   	union {
>   		struct hlist_node	hash_node;

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/6] io_uring: add support for NO_OFFLOAD
  2023-04-19 16:25 ` [PATCH 3/6] io_uring: add support for NO_OFFLOAD Jens Axboe
@ 2023-04-20  1:01   ` Pavel Begunkov
  2023-04-20 15:03     ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2023-04-20  1:01 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: luhongfei

On 4/19/23 17:25, Jens Axboe wrote:
> Some applications don't necessarily care about io_uring not blocking for
> request issue, they simply want to use io_uring for batched submission
> of IO. However, io_uring will always do non-blocking issues, and for
> some request types, there's simply no support for doing non-blocking
> issue and hence they get punted to io-wq unconditionally. If the
> application doesn't care about issue potentially blocking, this causes
> a performance slowdown as thread offload is not nearly as efficient as
> inline issue.
> 
> Add support for configuring the ring with IORING_SETUP_NO_OFFLOAD, and
> add an IORING_ENTER_NO_OFFLOAD flag to io_uring_enter(2). If either one
> of these is set, then io_uring will ignore the non-block issue attempt
> for any file which we cannot poll for readiness. The simplified io_uring
> issue model looks as follows:
> 
> 1) Non-blocking issue is attempted for IO. If successful, we're done for
>     now.
> 
> 2) Case 1 failed. Now we have two options
>    	a) We can poll the file. We arm poll, and we're done for now
> 	   until that triggers.
>     	b) File cannot be polled, we punt to io-wq which then does a
> 	   blocking attempt.
> 
> If either of the NO_OFFLOAD flags are set, we should never hit case
> 2b. Instead, case 1 would issue the IO without the non-blocking flag
> being set and perform an inline completion.
> 
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>   include/linux/io_uring_types.h |  3 +++
>   include/uapi/linux/io_uring.h  |  7 +++++++
>   io_uring/io_uring.c            | 26 ++++++++++++++++++++------
>   io_uring/io_uring.h            |  2 +-
>   io_uring/sqpoll.c              |  3 ++-
>   5 files changed, 33 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 4dd54d2173e1..c54f3fb7ab1a 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -403,6 +403,7 @@ enum {
>   	REQ_F_APOLL_MULTISHOT_BIT,
>   	REQ_F_CLEAR_POLLIN_BIT,
>   	REQ_F_HASH_LOCKED_BIT,
> +	REQ_F_NO_OFFLOAD_BIT,
>   	/* keep async read/write and isreg together and in order */
>   	REQ_F_SUPPORT_NOWAIT_BIT,
>   	REQ_F_ISREG_BIT,
> @@ -475,6 +476,8 @@ enum {
>   	REQ_F_CLEAR_POLLIN	= BIT_ULL(REQ_F_CLEAR_POLLIN_BIT),
>   	/* hashed into ->cancel_hash_locked, protected by ->uring_lock */
>   	REQ_F_HASH_LOCKED	= BIT_ULL(REQ_F_HASH_LOCKED_BIT),
> +	/* don't offload to io-wq, issue blocking if needed */
> +	REQ_F_NO_OFFLOAD	= BIT_ULL(REQ_F_NO_OFFLOAD_BIT),
>   };
>   
>   typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 0716cb17e436..ea903a677ce9 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -173,6 +173,12 @@ enum {
>    */
>   #define IORING_SETUP_DEFER_TASKRUN	(1U << 13)
>   
> +/*
> + * Don't attempt non-blocking issue on file types that would otherwise
> + * punt to io-wq if they cannot be completed non-blocking.
> + */
> +#define IORING_SETUP_NO_OFFLOAD		(1U << 14)
> +
>   enum io_uring_op {
>   	IORING_OP_NOP,
>   	IORING_OP_READV,
> @@ -443,6 +449,7 @@ struct io_cqring_offsets {
>   #define IORING_ENTER_SQ_WAIT		(1U << 2)
>   #define IORING_ENTER_EXT_ARG		(1U << 3)
>   #define IORING_ENTER_REGISTERED_RING	(1U << 4)
> +#define IORING_ENTER_NO_OFFLOAD		(1U << 5)
>   
>   /*
>    * Passed in for io_uring_setup(2). Copied back with updated info on success
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 9568b5e4cf87..04770b06de16 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -1947,6 +1947,10 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>   	if (unlikely(!io_assign_file(req, def, issue_flags)))
>   		return -EBADF;
>   
> +	if (req->flags & REQ_F_NO_OFFLOAD &&
> +	    (!req->file || !file_can_poll(req->file)))
> +		issue_flags &= ~IO_URING_F_NONBLOCK;
> +
>   	if (unlikely((req->flags & REQ_F_CREDS) && req->creds != current_cred()))
>   		creds = override_creds(req->creds);
>   
> @@ -2337,7 +2341,7 @@ static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
>   }
>   
>   static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> -			 const struct io_uring_sqe *sqe)
> +			 const struct io_uring_sqe *sqe, bool no_offload)
>   	__must_hold(&ctx->uring_lock)
>   {
>   	struct io_submit_link *link = &ctx->submit_state.link;
> @@ -2385,6 +2389,9 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>   		return 0;
>   	}
>   
> +	if (no_offload)
> +		req->flags |= REQ_F_NO_OFFLOAD;

Shouldn't it be a part of the initial "in syscall" submission
but not extended to tw? I'd say it should, otherwise it risks
making !DEFER_TASKRUN totally unpredictable. E.g. any syscall
can try to execute tw and get stuck waiting in there.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/6] io_uring: add support for NO_OFFLOAD
  2023-04-20  1:01   ` Pavel Begunkov
@ 2023-04-20 15:03     ` Jens Axboe
  2023-04-20 15:16       ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2023-04-20 15:03 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: luhongfei

On 4/19/23 7:01?PM, Pavel Begunkov wrote:
> On 4/19/23 17:25, Jens Axboe wrote:
>> Some applications don't necessarily care about io_uring not blocking for
>> request issue, they simply want to use io_uring for batched submission
>> of IO. However, io_uring will always do non-blocking issues, and for
>> some request types, there's simply no support for doing non-blocking
>> issue and hence they get punted to io-wq unconditionally. If the
>> application doesn't care about issue potentially blocking, this causes
>> a performance slowdown as thread offload is not nearly as efficient as
>> inline issue.
>>
>> Add support for configuring the ring with IORING_SETUP_NO_OFFLOAD, and
>> add an IORING_ENTER_NO_OFFLOAD flag to io_uring_enter(2). If either one
>> of these is set, then io_uring will ignore the non-block issue attempt
>> for any file which we cannot poll for readiness. The simplified io_uring
>> issue model looks as follows:
>>
>> 1) Non-blocking issue is attempted for IO. If successful, we're done for
>>     now.
>>
>> 2) Case 1 failed. Now we have two options
>>        a) We can poll the file. We arm poll, and we're done for now
>>        until that triggers.
>>         b) File cannot be polled, we punt to io-wq which then does a
>>        blocking attempt.
>>
>> If either of the NO_OFFLOAD flags are set, we should never hit case
>> 2b. Instead, case 1 would issue the IO without the non-blocking flag
>> being set and perform an inline completion.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>>   include/linux/io_uring_types.h |  3 +++
>>   include/uapi/linux/io_uring.h  |  7 +++++++
>>   io_uring/io_uring.c            | 26 ++++++++++++++++++++------
>>   io_uring/io_uring.h            |  2 +-
>>   io_uring/sqpoll.c              |  3 ++-
>>   5 files changed, 33 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index 4dd54d2173e1..c54f3fb7ab1a 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -403,6 +403,7 @@ enum {
>>       REQ_F_APOLL_MULTISHOT_BIT,
>>       REQ_F_CLEAR_POLLIN_BIT,
>>       REQ_F_HASH_LOCKED_BIT,
>> +    REQ_F_NO_OFFLOAD_BIT,
>>       /* keep async read/write and isreg together and in order */
>>       REQ_F_SUPPORT_NOWAIT_BIT,
>>       REQ_F_ISREG_BIT,
>> @@ -475,6 +476,8 @@ enum {
>>       REQ_F_CLEAR_POLLIN    = BIT_ULL(REQ_F_CLEAR_POLLIN_BIT),
>>       /* hashed into ->cancel_hash_locked, protected by ->uring_lock */
>>       REQ_F_HASH_LOCKED    = BIT_ULL(REQ_F_HASH_LOCKED_BIT),
>> +    /* don't offload to io-wq, issue blocking if needed */
>> +    REQ_F_NO_OFFLOAD    = BIT_ULL(REQ_F_NO_OFFLOAD_BIT),
>>   };
>>     typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index 0716cb17e436..ea903a677ce9 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -173,6 +173,12 @@ enum {
>>    */
>>   #define IORING_SETUP_DEFER_TASKRUN    (1U << 13)
>>   +/*
>> + * Don't attempt non-blocking issue on file types that would otherwise
>> + * punt to io-wq if they cannot be completed non-blocking.
>> + */
>> +#define IORING_SETUP_NO_OFFLOAD        (1U << 14)
>> +
>>   enum io_uring_op {
>>       IORING_OP_NOP,
>>       IORING_OP_READV,
>> @@ -443,6 +449,7 @@ struct io_cqring_offsets {
>>   #define IORING_ENTER_SQ_WAIT        (1U << 2)
>>   #define IORING_ENTER_EXT_ARG        (1U << 3)
>>   #define IORING_ENTER_REGISTERED_RING    (1U << 4)
>> +#define IORING_ENTER_NO_OFFLOAD        (1U << 5)
>>     /*
>>    * Passed in for io_uring_setup(2). Copied back with updated info on success
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 9568b5e4cf87..04770b06de16 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -1947,6 +1947,10 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>>       if (unlikely(!io_assign_file(req, def, issue_flags)))
>>           return -EBADF;
>>   +    if (req->flags & REQ_F_NO_OFFLOAD &&
>> +        (!req->file || !file_can_poll(req->file)))
>> +        issue_flags &= ~IO_URING_F_NONBLOCK;
>> +
>>       if (unlikely((req->flags & REQ_F_CREDS) && req->creds != current_cred()))
>>           creds = override_creds(req->creds);
>>   @@ -2337,7 +2341,7 @@ static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
>>   }
>>     static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>> -             const struct io_uring_sqe *sqe)
>> +             const struct io_uring_sqe *sqe, bool no_offload)
>>       __must_hold(&ctx->uring_lock)
>>   {
>>       struct io_submit_link *link = &ctx->submit_state.link;
>> @@ -2385,6 +2389,9 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>           return 0;
>>       }
>>   +    if (no_offload)
>> +        req->flags |= REQ_F_NO_OFFLOAD;
> 
> Shouldn't it be a part of the initial "in syscall" submission
> but not extended to tw? I'd say it should, otherwise it risks
> making !DEFER_TASKRUN totally unpredictable. E.g. any syscall
> can try to execute tw and get stuck waiting in there.

Yeah, it should probably be ignore outside of off io_uring_enter(2)
submissions. If we do that, we could drop the flag too (and the flags
extension).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET 0/6] Enable NO_OFFLOAD support
  2023-04-20  0:43 ` [PATCHSET 0/6] Enable NO_OFFLOAD support Pavel Begunkov
@ 2023-04-20 15:08   ` Jens Axboe
  2023-04-20 15:28     ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2023-04-20 15:08 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: luhongfei

On 4/19/23 6:43?PM, Pavel Begunkov wrote:
> On 4/19/23 17:25, Jens Axboe wrote:
>> Hi,
>>
>> This series enables support for forcing no-offload for requests that
>> otherwise would have been punted to io-wq. In essence, it bypasses
>> the normal non-blocking issue in favor of just letting the issue block.
>> This is only done for requests that would've otherwise hit io-wq in
>> the offload path, anything pollable will still be doing non-blocking
>> issue. See patch 3 for details.
> 
> That's shooting ourselves in the leg.
> 
> 1) It has never been easier to lock up userspace. They might be able
> to deal with simple cases like read(pipe) + write(pipe), though even
> that in a complex enough framework would cause debugging and associated
> headache.
> 
> Now let's assume that the userspace submits nvme passthrough requests,
> it exhausts tags and a request is left waiting there. To progress
> forward one of the previous reqs should complete, but it's only putting
> task in tw, which will never be run with DEFER_TASKRUN.
> 
> It's not enough for the userspace to be careful, for DEFER_TASKRUN
> there will always be a chance to get locked .
> 
> 2) It's not limited only to requests we're submitting, but also
> already queued async requests. Inline submission holds uring_lock,
> and so if io-wq thread needs to grab a registered file for the
> request, it'll io_ring_submit_lock() and wait until the submission
> ends. Same for provided buffers and some other cases.
> 
> Even task exit will actively try to grab the lock.

One thing I pondered was making the inline submissions similar to io-wq
submissions - eg don't hold uring_lock over them. To make useful, I
suspect we'd want to prep all SQ entries upfront, and then drop for
submission.

We'd also want to make this mutually exclusive with IOPOLL, obviously.
Doesn't make any sense to do anyway for IOPOLL, but it needs to be
explicitly disallowed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/6] io_uring: add support for NO_OFFLOAD
  2023-04-20 15:03     ` Jens Axboe
@ 2023-04-20 15:16       ` Pavel Begunkov
  2023-04-20 15:56         ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2023-04-20 15:16 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: luhongfei

On 4/20/23 16:03, Jens Axboe wrote:
> On 4/19/23 7:01?PM, Pavel Begunkov wrote:
>> On 4/19/23 17:25, Jens Axboe wrote:
>>> Some applications don't necessarily care about io_uring not blocking for
>>> request issue, they simply want to use io_uring for batched submission
>>> of IO. However, io_uring will always do non-blocking issues, and for
>>> some request types, there's simply no support for doing non-blocking
>>> issue and hence they get punted to io-wq unconditionally. If the
>>> application doesn't care about issue potentially blocking, this causes
>>> a performance slowdown as thread offload is not nearly as efficient as
>>> inline issue.
>>>
>>> Add support for configuring the ring with IORING_SETUP_NO_OFFLOAD, and
>>> add an IORING_ENTER_NO_OFFLOAD flag to io_uring_enter(2). If either one
>>> of these is set, then io_uring will ignore the non-block issue attempt
>>> for any file which we cannot poll for readiness. The simplified io_uring
>>> issue model looks as follows:
>>>
>>> 1) Non-blocking issue is attempted for IO. If successful, we're done for
>>>      now.
>>>
>>> 2) Case 1 failed. Now we have two options
>>>         a) We can poll the file. We arm poll, and we're done for now
>>>         until that triggers.
>>>          b) File cannot be polled, we punt to io-wq which then does a
>>>         blocking attempt.
>>>
>>> If either of the NO_OFFLOAD flags are set, we should never hit case
>>> 2b. Instead, case 1 would issue the IO without the non-blocking flag
>>> being set and perform an inline completion.
>>>
>>> Signed-off-by: Jens Axboe <[email protected]>
>>> ---
>>>    include/linux/io_uring_types.h |  3 +++
>>>    include/uapi/linux/io_uring.h  |  7 +++++++
>>>    io_uring/io_uring.c            | 26 ++++++++++++++++++++------
>>>    io_uring/io_uring.h            |  2 +-
>>>    io_uring/sqpoll.c              |  3 ++-
>>>    5 files changed, 33 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>> index 4dd54d2173e1..c54f3fb7ab1a 100644
>>> --- a/include/linux/io_uring_types.h
>>> +++ b/include/linux/io_uring_types.h
>>> @@ -403,6 +403,7 @@ enum {
>>>        REQ_F_APOLL_MULTISHOT_BIT,
>>>        REQ_F_CLEAR_POLLIN_BIT,
>>>        REQ_F_HASH_LOCKED_BIT,
>>> +    REQ_F_NO_OFFLOAD_BIT,
>>>        /* keep async read/write and isreg together and in order */
>>>        REQ_F_SUPPORT_NOWAIT_BIT,
>>>        REQ_F_ISREG_BIT,
>>> @@ -475,6 +476,8 @@ enum {
>>>        REQ_F_CLEAR_POLLIN    = BIT_ULL(REQ_F_CLEAR_POLLIN_BIT),
>>>        /* hashed into ->cancel_hash_locked, protected by ->uring_lock */
>>>        REQ_F_HASH_LOCKED    = BIT_ULL(REQ_F_HASH_LOCKED_BIT),
>>> +    /* don't offload to io-wq, issue blocking if needed */
>>> +    REQ_F_NO_OFFLOAD    = BIT_ULL(REQ_F_NO_OFFLOAD_BIT),
>>>    };
>>>      typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>> index 0716cb17e436..ea903a677ce9 100644
>>> --- a/include/uapi/linux/io_uring.h
>>> +++ b/include/uapi/linux/io_uring.h
>>> @@ -173,6 +173,12 @@ enum {
>>>     */
>>>    #define IORING_SETUP_DEFER_TASKRUN    (1U << 13)
>>>    +/*
>>> + * Don't attempt non-blocking issue on file types that would otherwise
>>> + * punt to io-wq if they cannot be completed non-blocking.
>>> + */
>>> +#define IORING_SETUP_NO_OFFLOAD        (1U << 14)
>>> +
>>>    enum io_uring_op {
>>>        IORING_OP_NOP,
>>>        IORING_OP_READV,
>>> @@ -443,6 +449,7 @@ struct io_cqring_offsets {
>>>    #define IORING_ENTER_SQ_WAIT        (1U << 2)
>>>    #define IORING_ENTER_EXT_ARG        (1U << 3)
>>>    #define IORING_ENTER_REGISTERED_RING    (1U << 4)
>>> +#define IORING_ENTER_NO_OFFLOAD        (1U << 5)
>>>      /*
>>>     * Passed in for io_uring_setup(2). Copied back with updated info on success
>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>> index 9568b5e4cf87..04770b06de16 100644
>>> --- a/io_uring/io_uring.c
>>> +++ b/io_uring/io_uring.c
>>> @@ -1947,6 +1947,10 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>>>        if (unlikely(!io_assign_file(req, def, issue_flags)))
>>>            return -EBADF;
>>>    +    if (req->flags & REQ_F_NO_OFFLOAD &&
>>> +        (!req->file || !file_can_poll(req->file)))
>>> +        issue_flags &= ~IO_URING_F_NONBLOCK;
>>> +
>>>        if (unlikely((req->flags & REQ_F_CREDS) && req->creds != current_cred()))
>>>            creds = override_creds(req->creds);
>>>    @@ -2337,7 +2341,7 @@ static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
>>>    }
>>>      static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>> -             const struct io_uring_sqe *sqe)
>>> +             const struct io_uring_sqe *sqe, bool no_offload)
>>>        __must_hold(&ctx->uring_lock)
>>>    {
>>>        struct io_submit_link *link = &ctx->submit_state.link;
>>> @@ -2385,6 +2389,9 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>>            return 0;
>>>        }
>>>    +    if (no_offload)
>>> +        req->flags |= REQ_F_NO_OFFLOAD;
>>
>> Shouldn't it be a part of the initial "in syscall" submission
>> but not extended to tw? I'd say it should, otherwise it risks
>> making !DEFER_TASKRUN totally unpredictable. E.g. any syscall
>> can try to execute tw and get stuck waiting in there.
> 
> Yeah, it should probably be ignore outside of off io_uring_enter(2)
> submissions. If we do that, we could drop the flag too (and the flags
> extension).

issue_flags instead of req->flags might be a better place for it

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET 0/6] Enable NO_OFFLOAD support
  2023-04-20 15:08   ` Jens Axboe
@ 2023-04-20 15:28     ` Pavel Begunkov
  0 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2023-04-20 15:28 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: luhongfei

On 4/20/23 16:08, Jens Axboe wrote:
> On 4/19/23 6:43?PM, Pavel Begunkov wrote:
>> On 4/19/23 17:25, Jens Axboe wrote:
>>> Hi,
>>>
>>> This series enables support for forcing no-offload for requests that
>>> otherwise would have been punted to io-wq. In essence, it bypasses
>>> the normal non-blocking issue in favor of just letting the issue block.
>>> This is only done for requests that would've otherwise hit io-wq in
>>> the offload path, anything pollable will still be doing non-blocking
>>> issue. See patch 3 for details.
>>
>> That's shooting ourselves in the leg.
>>
>> 1) It has never been easier to lock up userspace. They might be able
>> to deal with simple cases like read(pipe) + write(pipe), though even
>> that in a complex enough framework would cause debugging and associated
>> headache.
>>
>> Now let's assume that the userspace submits nvme passthrough requests,
>> it exhausts tags and a request is left waiting there. To progress
>> forward one of the previous reqs should complete, but it's only putting
>> task in tw, which will never be run with DEFER_TASKRUN.
>>
>> It's not enough for the userspace to be careful, for DEFER_TASKRUN
>> there will always be a chance to get locked .
>>
>> 2) It's not limited only to requests we're submitting, but also
>> already queued async requests. Inline submission holds uring_lock,
>> and so if io-wq thread needs to grab a registered file for the
>> request, it'll io_ring_submit_lock() and wait until the submission
>> ends. Same for provided buffers and some other cases.
>>
>> Even task exit will actively try to grab the lock.
> 
> One thing I pondered was making the inline submissions similar to io-wq
> submissions - eg don't hold uring_lock over them. To make useful, I
> suspect we'd want to prep all SQ entries upfront, and then drop for
> submission.

That would need completion caches (ctx->submit_state) to be changed,
either by allowing multiple of them or limiting by some other mean
to only 1 inline submitter. Also, that will probably return the
request refcounting back, and DEFER_TASKRUN would probably need
to retake the lock for execution unless there are magic tricks
around it. Not an easy task if we don't want to hurt performance.


> We'd also want to make this mutually exclusive with IOPOLL, obviously.
> Doesn't make any sense to do anyway for IOPOLL, but it needs to be
> explicitly disallowed.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/6] io_uring: add support for NO_OFFLOAD
  2023-04-20 15:16       ` Pavel Begunkov
@ 2023-04-20 15:56         ` Jens Axboe
  0 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2023-04-20 15:56 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: luhongfei

On 4/20/23 9:16?AM, Pavel Begunkov wrote:
> On 4/20/23 16:03, Jens Axboe wrote:
>> On 4/19/23 7:01?PM, Pavel Begunkov wrote:
>>> On 4/19/23 17:25, Jens Axboe wrote:
>>>> Some applications don't necessarily care about io_uring not blocking for
>>>> request issue, they simply want to use io_uring for batched submission
>>>> of IO. However, io_uring will always do non-blocking issues, and for
>>>> some request types, there's simply no support for doing non-blocking
>>>> issue and hence they get punted to io-wq unconditionally. If the
>>>> application doesn't care about issue potentially blocking, this causes
>>>> a performance slowdown as thread offload is not nearly as efficient as
>>>> inline issue.
>>>>
>>>> Add support for configuring the ring with IORING_SETUP_NO_OFFLOAD, and
>>>> add an IORING_ENTER_NO_OFFLOAD flag to io_uring_enter(2). If either one
>>>> of these is set, then io_uring will ignore the non-block issue attempt
>>>> for any file which we cannot poll for readiness. The simplified io_uring
>>>> issue model looks as follows:
>>>>
>>>> 1) Non-blocking issue is attempted for IO. If successful, we're done for
>>>>      now.
>>>>
>>>> 2) Case 1 failed. Now we have two options
>>>>         a) We can poll the file. We arm poll, and we're done for now
>>>>         until that triggers.
>>>>          b) File cannot be polled, we punt to io-wq which then does a
>>>>         blocking attempt.
>>>>
>>>> If either of the NO_OFFLOAD flags are set, we should never hit case
>>>> 2b. Instead, case 1 would issue the IO without the non-blocking flag
>>>> being set and perform an inline completion.
>>>>
>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>> ---
>>>>    include/linux/io_uring_types.h |  3 +++
>>>>    include/uapi/linux/io_uring.h  |  7 +++++++
>>>>    io_uring/io_uring.c            | 26 ++++++++++++++++++++------
>>>>    io_uring/io_uring.h            |  2 +-
>>>>    io_uring/sqpoll.c              |  3 ++-
>>>>    5 files changed, 33 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>>> index 4dd54d2173e1..c54f3fb7ab1a 100644
>>>> --- a/include/linux/io_uring_types.h
>>>> +++ b/include/linux/io_uring_types.h
>>>> @@ -403,6 +403,7 @@ enum {
>>>>        REQ_F_APOLL_MULTISHOT_BIT,
>>>>        REQ_F_CLEAR_POLLIN_BIT,
>>>>        REQ_F_HASH_LOCKED_BIT,
>>>> +    REQ_F_NO_OFFLOAD_BIT,
>>>>        /* keep async read/write and isreg together and in order */
>>>>        REQ_F_SUPPORT_NOWAIT_BIT,
>>>>        REQ_F_ISREG_BIT,
>>>> @@ -475,6 +476,8 @@ enum {
>>>>        REQ_F_CLEAR_POLLIN    = BIT_ULL(REQ_F_CLEAR_POLLIN_BIT),
>>>>        /* hashed into ->cancel_hash_locked, protected by ->uring_lock */
>>>>        REQ_F_HASH_LOCKED    = BIT_ULL(REQ_F_HASH_LOCKED_BIT),
>>>> +    /* don't offload to io-wq, issue blocking if needed */
>>>> +    REQ_F_NO_OFFLOAD    = BIT_ULL(REQ_F_NO_OFFLOAD_BIT),
>>>>    };
>>>>      typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>> index 0716cb17e436..ea903a677ce9 100644
>>>> --- a/include/uapi/linux/io_uring.h
>>>> +++ b/include/uapi/linux/io_uring.h
>>>> @@ -173,6 +173,12 @@ enum {
>>>>     */
>>>>    #define IORING_SETUP_DEFER_TASKRUN    (1U << 13)
>>>>    +/*
>>>> + * Don't attempt non-blocking issue on file types that would otherwise
>>>> + * punt to io-wq if they cannot be completed non-blocking.
>>>> + */
>>>> +#define IORING_SETUP_NO_OFFLOAD        (1U << 14)
>>>> +
>>>>    enum io_uring_op {
>>>>        IORING_OP_NOP,
>>>>        IORING_OP_READV,
>>>> @@ -443,6 +449,7 @@ struct io_cqring_offsets {
>>>>    #define IORING_ENTER_SQ_WAIT        (1U << 2)
>>>>    #define IORING_ENTER_EXT_ARG        (1U << 3)
>>>>    #define IORING_ENTER_REGISTERED_RING    (1U << 4)
>>>> +#define IORING_ENTER_NO_OFFLOAD        (1U << 5)
>>>>      /*
>>>>     * Passed in for io_uring_setup(2). Copied back with updated info on success
>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>> index 9568b5e4cf87..04770b06de16 100644
>>>> --- a/io_uring/io_uring.c
>>>> +++ b/io_uring/io_uring.c
>>>> @@ -1947,6 +1947,10 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>>>>        if (unlikely(!io_assign_file(req, def, issue_flags)))
>>>>            return -EBADF;
>>>>    +    if (req->flags & REQ_F_NO_OFFLOAD &&
>>>> +        (!req->file || !file_can_poll(req->file)))
>>>> +        issue_flags &= ~IO_URING_F_NONBLOCK;
>>>> +
>>>>        if (unlikely((req->flags & REQ_F_CREDS) && req->creds != current_cred()))
>>>>            creds = override_creds(req->creds);
>>>>    @@ -2337,7 +2341,7 @@ static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
>>>>    }
>>>>      static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>>> -             const struct io_uring_sqe *sqe)
>>>> +             const struct io_uring_sqe *sqe, bool no_offload)
>>>>        __must_hold(&ctx->uring_lock)
>>>>    {
>>>>        struct io_submit_link *link = &ctx->submit_state.link;
>>>> @@ -2385,6 +2389,9 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>>>            return 0;
>>>>        }
>>>>    +    if (no_offload)
>>>> +        req->flags |= REQ_F_NO_OFFLOAD;
>>>
>>> Shouldn't it be a part of the initial "in syscall" submission
>>> but not extended to tw? I'd say it should, otherwise it risks
>>> making !DEFER_TASKRUN totally unpredictable. E.g. any syscall
>>> can try to execute tw and get stuck waiting in there.
>>
>> Yeah, it should probably be ignore outside of off io_uring_enter(2)
>> submissions. If we do that, we could drop the flag too (and the flags
>> extension).
> 
> issue_flags instead of req->flags might be a better place for it

For sure, if we're not carrying it in io_kiocb state, then an issue flag
is the way to go. FWIW, I already rebased and did that. And then I ran
the full test suite, with a modification to the queue init helpers that
sets IORING_SETUP_NO_OFFLOAD for all queue creations, except the ones
where IORING_SETUP_IOPOLL is set. Ran into one minor issue, which is
test/fallocate.c which will trigger SIGXFSZ for the process itself
(rather than io-wq, where it gets ignored) when exceeding the file size.
That's to be expected.

Outside of that, everything worked, nothing odd observed. Obviously not
a comprehensive test for potential issues, but it does show that we're
not THAT much in trouble here.

Didn't drop the uring_lock yet, outside of dropping the REQ_F_NO_OFFLOAD
flag and making it an issue flag, it's all pretty much the same as
before.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-04-20 15:56 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-19 16:25 [PATCHSET 0/6] Enable NO_OFFLOAD support Jens Axboe
2023-04-19 16:25 ` [PATCH 1/6] io_uring: grow struct io_kiocb 'flags' to a 64-bit value Jens Axboe
2023-04-19 16:25 ` [PATCH 2/6] io_uring: move poll_refs up a cacheline to fill a hole Jens Axboe
2023-04-20  0:50   ` Pavel Begunkov
2023-04-19 16:25 ` [PATCH 3/6] io_uring: add support for NO_OFFLOAD Jens Axboe
2023-04-20  1:01   ` Pavel Begunkov
2023-04-20 15:03     ` Jens Axboe
2023-04-20 15:16       ` Pavel Begunkov
2023-04-20 15:56         ` Jens Axboe
2023-04-19 16:25 ` [PATCH 4/6] Revert "io_uring: always go async for unsupported fadvise flags" Jens Axboe
2023-04-19 16:25 ` [PATCH 5/6] Revert "io_uring: for requests that require async, force it" Jens Axboe
2023-04-19 16:25 ` [PATCH 6/6] io_uring: mark opcodes that always need io-wq punt Jens Axboe
2023-04-20  0:43 ` [PATCHSET 0/6] Enable NO_OFFLOAD support Pavel Begunkov
2023-04-20 15:08   ` Jens Axboe
2023-04-20 15:28     ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox