public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH RFC 0/9] Launching processes with io_uring
@ 2024-12-09 23:43 Gabriel Krisman Bertazi
  2024-12-09 23:43 ` [PATCH RFC 1/9] io_uring: Drop __io_req_find_next_prep Gabriel Krisman Bertazi
                   ` (10 more replies)
  0 siblings, 11 replies; 25+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-09 23:43 UTC (permalink / raw)
  To: axboe, asml.silence; +Cc: io-uring, josh, Gabriel Krisman Bertazi

During LPC 2022, Josh Triplett proposed io_uring_spawn as a mechanism to
fork and exec new processes through io_uring [1].  The goal, according
to him, was to have a very efficient mechanism to quickly execute tasks,
eliminating the multiple roundtrips to userspace required to fork,
perform multiple $PATH lookup and finally execve.  In addition, he
mentioned this would allow for a more simple implementation of
preparatory tasks, such as file redirection configuration, and handling
of stuff like posix_spawn_file_actions_t.

This RFC revives his original patchset.  I fixed all the pending issues
I found with task submission, including the issue blocking the work at
the time, a kernel corruption after a few spawns, converted the execve
command into execveat* variant, cleaned up the code and surely
introduced a few bugs of my own along the way.  At this point, I made it
an RFC because I have a few outstanding questions about the design, in
particular whether the CLONE context would be better implemented as a
special io-wq case to avoid the exposure of io_issue_sqe and
duplication of the dispatching logic.

I'm also providing the liburing support in a separate patchset,
including a testcase that exemplifies the $PATH lookup mechanism
proposed by Josh.

Thanks,

[1]  https://lwn.net/Articles/908268/

Gabriel Krisman Bertazi (6):
  io_uring: Drop __io_req_find_next_prep
  io_uring: Expose failed request helper in internal header
  kernel/fork: Don't inherit PF_USER_WORKER from parent
  fs/exec: Expose do_execveat symbol
  io_uring: Let commands run with current credentials
  io_uring: Let ->issue know if it was called from spawn thread

Josh Triplett (3):
  kernel/fork: Add helper to fork from io_uring
  io_uring: Introduce IORING_OP_CLONE
  io_uring: Introduce IORING_OP_EXEC command

 fs/exec.c                      |   2 +-
 include/linux/binfmts.h        |   5 +
 include/linux/io_uring_types.h |   3 +
 include/linux/sched/task.h     |   1 +
 include/uapi/linux/io_uring.h  |   3 +
 io_uring/Makefile              |   2 +-
 io_uring/io_uring.c            |  27 ++---
 io_uring/io_uring.h            |   8 ++
 io_uring/opdef.c               |  18 +++
 io_uring/opdef.h               |   2 +
 io_uring/spawn.c               | 195 +++++++++++++++++++++++++++++++++
 io_uring/spawn.h               |  13 +++
 kernel/fork.c                  |  21 ++++
 13 files changed, 279 insertions(+), 21 deletions(-)
 create mode 100644 io_uring/spawn.c
 create mode 100644 io_uring/spawn.h

-- 
2.47.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFC 1/9] io_uring: Drop __io_req_find_next_prep
  2024-12-09 23:43 [PATCH RFC 0/9] Launching processes with io_uring Gabriel Krisman Bertazi
@ 2024-12-09 23:43 ` Gabriel Krisman Bertazi
  2024-12-09 23:43 ` [PATCH RFC 2/9] io_uring: Expose failed request helper in internal header Gabriel Krisman Bertazi
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-09 23:43 UTC (permalink / raw)
  To: axboe, asml.silence; +Cc: io-uring, josh, Gabriel Krisman Bertazi

This is only used inside io_req_find_next.  Inline it and drop the helper.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
 io_uring/io_uring.c | 16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a8cbe674e5d6..57d8947ae69e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -989,15 +989,6 @@ __cold void io_free_req(struct io_kiocb *req)
 	io_req_task_work_add(req);
 }
 
-static void __io_req_find_next_prep(struct io_kiocb *req)
-{
-	struct io_ring_ctx *ctx = req->ctx;
-
-	spin_lock(&ctx->completion_lock);
-	io_disarm_next(req);
-	spin_unlock(&ctx->completion_lock);
-}
-
 static inline struct io_kiocb *io_req_find_next(struct io_kiocb *req)
 {
 	struct io_kiocb *nxt;
@@ -1008,8 +999,11 @@ static inline struct io_kiocb *io_req_find_next(struct io_kiocb *req)
 	 * dependencies to the next request. In case of failure, fail the rest
 	 * of the chain.
 	 */
-	if (unlikely(req->flags & IO_DISARM_MASK))
-		__io_req_find_next_prep(req);
+	if (unlikely(req->flags & IO_DISARM_MASK)) {
+		spin_lock(&req->ctx->completion_lock);
+		io_disarm_next(req);
+		spin_unlock(&req->ctx->completion_lock);
+	}
 	nxt = req->link;
 	req->link = NULL;
 	return nxt;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 2/9] io_uring: Expose failed request helper in internal header
  2024-12-09 23:43 [PATCH RFC 0/9] Launching processes with io_uring Gabriel Krisman Bertazi
  2024-12-09 23:43 ` [PATCH RFC 1/9] io_uring: Drop __io_req_find_next_prep Gabriel Krisman Bertazi
@ 2024-12-09 23:43 ` Gabriel Krisman Bertazi
  2024-12-09 23:43 ` [PATCH RFC 3/9] kernel/fork: Don't inherit PF_USER_WORKER from parent Gabriel Krisman Bertazi
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-09 23:43 UTC (permalink / raw)
  To: axboe, asml.silence; +Cc: io-uring, josh, Gabriel Krisman Bertazi

In preparation to calling it from the clone command, expose this helper
in io_uring.h.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
 io_uring/io_uring.c | 6 ------
 io_uring/io_uring.h | 6 ++++++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 57d8947ae69e..a19f72755eaa 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -224,12 +224,6 @@ bool io_match_task_safe(struct io_kiocb *head, struct io_uring_task *tctx,
 	return matched;
 }
 
-static inline void req_fail_link_node(struct io_kiocb *req, int res)
-{
-	req_set_fail(req);
-	io_req_set_res(req, res, 0);
-}
-
 static inline void io_req_add_to_cache(struct io_kiocb *req, struct io_ring_ctx *ctx)
 {
 	wq_stack_add_head(&req->comp_list, &ctx->submit_state.free_list);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 12abee607e4a..4dd051d29cb0 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -222,6 +222,12 @@ static inline void io_req_set_res(struct io_kiocb *req, s32 res, u32 cflags)
 	req->cqe.flags = cflags;
 }
 
+static inline void req_fail_link_node(struct io_kiocb *req, int res)
+{
+	req_set_fail(req);
+	io_req_set_res(req, res, 0);
+}
+
 static inline bool req_has_async_data(struct io_kiocb *req)
 {
 	return req->flags & REQ_F_ASYNC_DATA;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 3/9] kernel/fork: Don't inherit PF_USER_WORKER from parent
  2024-12-09 23:43 [PATCH RFC 0/9] Launching processes with io_uring Gabriel Krisman Bertazi
  2024-12-09 23:43 ` [PATCH RFC 1/9] io_uring: Drop __io_req_find_next_prep Gabriel Krisman Bertazi
  2024-12-09 23:43 ` [PATCH RFC 2/9] io_uring: Expose failed request helper in internal header Gabriel Krisman Bertazi
@ 2024-12-09 23:43 ` Gabriel Krisman Bertazi
  2024-12-09 23:43 ` [PATCH RFC 4/9] fs/exec: Expose do_execveat symbol Gabriel Krisman Bertazi
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-09 23:43 UTC (permalink / raw)
  To: axboe, asml.silence; +Cc: io-uring, josh, Gabriel Krisman Bertazi

Clear the PF_USER_WORKER bit of new tasks, instead of inheriting it from
the parent.  This allows PF_USER_WORKER tasks to fork regular tasks.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
 kernel/fork.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index 1450b461d196..56baa320a720 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2228,6 +2228,8 @@ __latent_entropy struct task_struct *copy_process(
 	p->flags &= ~PF_KTHREAD;
 	if (args->kthread)
 		p->flags |= PF_KTHREAD;
+
+	p->flags &= ~PF_USER_WORKER;
 	if (args->user_worker) {
 		/*
 		 * Mark us a user worker, and block any signal that isn't
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 4/9] fs/exec: Expose do_execveat symbol
  2024-12-09 23:43 [PATCH RFC 0/9] Launching processes with io_uring Gabriel Krisman Bertazi
                   ` (2 preceding siblings ...)
  2024-12-09 23:43 ` [PATCH RFC 3/9] kernel/fork: Don't inherit PF_USER_WORKER from parent Gabriel Krisman Bertazi
@ 2024-12-09 23:43 ` Gabriel Krisman Bertazi
  2024-12-09 23:43 ` [PATCH RFC 5/9] kernel/fork: Add helper to fork from io_uring Gabriel Krisman Bertazi
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-09 23:43 UTC (permalink / raw)
  To: axboe, asml.silence; +Cc: io-uring, josh, Gabriel Krisman Bertazi

In order to allow it to be called by io_uring code, expose do_execveat in
the header file.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
 fs/exec.c               | 2 +-
 include/linux/binfmts.h | 5 +++++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 98cb7ba9983c..1a03ae5b9941 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -2023,7 +2023,7 @@ static int do_execve(struct filename *filename,
 	return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
 }
 
-static int do_execveat(int fd, struct filename *filename,
+int do_execveat(int fd, struct filename *filename,
 		const char __user *const __user *__argv,
 		const char __user *const __user *__envp,
 		int flags)
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index e6c00e860951..baec14dfb7ca 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -141,4 +141,9 @@ extern ssize_t read_code(struct file *, unsigned long, loff_t, size_t);
 int kernel_execve(const char *filename,
 		  const char *const *argv, const char *const *envp);
 
+int do_execveat(int dfd, struct filename *filename,
+		const char __user *const __user *__argv,
+		const char __user *const __user *__envp,
+		int flags);
+
 #endif /* _LINUX_BINFMTS_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 5/9] kernel/fork: Add helper to fork from io_uring
  2024-12-09 23:43 [PATCH RFC 0/9] Launching processes with io_uring Gabriel Krisman Bertazi
                   ` (3 preceding siblings ...)
  2024-12-09 23:43 ` [PATCH RFC 4/9] fs/exec: Expose do_execveat symbol Gabriel Krisman Bertazi
@ 2024-12-09 23:43 ` Gabriel Krisman Bertazi
  2024-12-09 23:43 ` [PATCH RFC 6/9] io_uring: Let commands run with current credentials Gabriel Krisman Bertazi
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-09 23:43 UTC (permalink / raw)
  To: axboe, asml.silence; +Cc: io-uring, josh, Gabriel Krisman Bertazi

From: Josh Triplett <[email protected]>

Introduce a helper to fork a new process from io_uring.  This is
different from the io_uring io_worker in multiple ways: First, it can
return to userspace following a execve, doesn't have PF_USER_WORKER set,
and doesn't share most process structures with the  the parent.  It
shares the MM though, allowing the helper to do limited io_uring
operations.

The sole use of this is the io_uring OP_CLONE command, which prepares
the ground for EXECVE from io_uring.

Signed-off-by: Josh Triplett <[email protected]>
Co-developed-by: Gabriel Krisman Bertazi <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
 include/linux/sched/task.h |  1 +
 kernel/fork.c              | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0f2aeb37bbb0..a76f05a886ad 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -98,6 +98,7 @@ extern pid_t kernel_clone(struct kernel_clone_args *kargs);
 struct task_struct *copy_process(struct pid *pid, int trace, int node,
 				 struct kernel_clone_args *args);
 struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
+struct task_struct *create_io_uring_spawn_task(int (*fn)(void *), void *arg);
 struct task_struct *fork_idle(int);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, const char *name,
 			    unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index 56baa320a720..fa983a0614ce 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2757,6 +2757,25 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 	return copy_process(NULL, 0, node, &args);
 }
 
+/*
+ * This is like kernel_clone(), but shaved down and tailored for io_uring_spawn.
+ * It returns a created task, or an error pointer. The returned task is
+ * inactive, and the caller must fire it up through wake_up_new_task(p).
+ */
+struct task_struct *create_io_uring_spawn_task(int (*fn)(void *), void *arg)
+{
+	unsigned long flags = CLONE_CLEAR_SIGHAND;
+	struct kernel_clone_args args = {
+		.flags		= ((lower_32_bits(flags) | CLONE_VM |
+				    CLONE_UNTRACED) & ~CSIGNAL),
+		.exit_signal	= (lower_32_bits(flags) & CSIGNAL),
+		.fn		= fn,
+		.fn_arg		= arg,
+	};
+
+	return copy_process(NULL, 0, NUMA_NO_NODE, &args);
+}
+
 /*
  *  Ok, this is the main fork-routine.
  *
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 6/9] io_uring: Let commands run with current credentials
  2024-12-09 23:43 [PATCH RFC 0/9] Launching processes with io_uring Gabriel Krisman Bertazi
                   ` (4 preceding siblings ...)
  2024-12-09 23:43 ` [PATCH RFC 5/9] kernel/fork: Add helper to fork from io_uring Gabriel Krisman Bertazi
@ 2024-12-09 23:43 ` Gabriel Krisman Bertazi
  2024-12-11 14:48   ` Pavel Begunkov
  2024-12-09 23:43 ` [PATCH RFC 7/9] io_uring: Introduce IORING_OP_CLONE Gabriel Krisman Bertazi
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-09 23:43 UTC (permalink / raw)
  To: axboe, asml.silence; +Cc: io-uring, josh, Gabriel Krisman Bertazi

IORING_OP_EXEC runs only from a custom handler and cannot rely on
overloaded credentials. This commit adds infrastructure to allow running
operations without overloading the credentials, i.e. not enabling the
REQ_F_CREDS flag.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
 io_uring/io_uring.c | 2 +-
 io_uring/opdef.h    | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a19f72755eaa..0fd8709401fc 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -457,7 +457,7 @@ static void io_prep_async_work(struct io_kiocb *req)
 	const struct io_issue_def *def = &io_issue_defs[req->opcode];
 	struct io_ring_ctx *ctx = req->ctx;
 
-	if (!(req->flags & REQ_F_CREDS)) {
+	if (!(req->flags & REQ_F_CREDS) && !def->ignore_creds) {
 		req->flags |= REQ_F_CREDS;
 		req->creds = get_current_cred();
 	}
diff --git a/io_uring/opdef.h b/io_uring/opdef.h
index 14456436ff74..94e9a2e3c028 100644
--- a/io_uring/opdef.h
+++ b/io_uring/opdef.h
@@ -27,6 +27,8 @@ struct io_issue_def {
 	unsigned		iopoll_queue : 1;
 	/* vectored opcode, set if 1) vectored, and 2) handler needs to know */
 	unsigned		vectored : 1;
+	/* io_uring must not overload credentials on async context. */
+	unsigned		ignore_creds : 1;
 
 	/* size of async data needed, if any */
 	unsigned short		async_size;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 7/9] io_uring: Introduce IORING_OP_CLONE
  2024-12-09 23:43 [PATCH RFC 0/9] Launching processes with io_uring Gabriel Krisman Bertazi
                   ` (5 preceding siblings ...)
  2024-12-09 23:43 ` [PATCH RFC 6/9] io_uring: Let commands run with current credentials Gabriel Krisman Bertazi
@ 2024-12-09 23:43 ` Gabriel Krisman Bertazi
  2024-12-11 13:37   ` Pavel Begunkov
  2024-12-09 23:43 ` [PATCH RFC 8/9] io_uring: Let ->issue know if it was called from spawn thread Gabriel Krisman Bertazi
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-09 23:43 UTC (permalink / raw)
  To: axboe, asml.silence; +Cc: io-uring, josh, Gabriel Krisman Bertazi

From: Josh Triplett <[email protected]>

This command spawns a short lived asynchronous context to execute
following linked operations.  Once the link is completed, the task
terminates.  This is specially useful to create new processes, by
linking an IORING_OP_EXEC at the end of the chain. In this case, the
task doesn't terminate, but returns to userspace, starting the new
process.

This is different from the existing io workqueues in a few ways: First,
it is completely separated from the io-wq code, and the task cannot be
reused by a future link; Second, the task doesn't share the FDT, and
other process structures with the rest of io_uring (except for the
memory map); Finally, because of the limited context, it doesn't support
executing requests asynchronously and requeing them. Every request must
complete at ->issue time, or fail.  It also doesn't support task_work
execution, for a similar reason.  The goal of this design allowing the
user to close file descriptors, release locks and do other cleanups
right before switching to a new process.

A big pitfall here, in my (Gabriel) opinion, is how this duplicates the
logic of io_uring linked request dispatching.  I'd suggest I merge this
into the io-wq code, as a special case of workqueue. But I'd like to get
feedback on this idea from the maintainers before moving further with the
implementation.

Signed-off-by: Josh Triplett <[email protected]>
Co-developed-by: Gabriel Krisman Bertazi <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
 include/uapi/linux/io_uring.h |   1 +
 io_uring/Makefile             |   2 +-
 io_uring/io_uring.c           |   3 +-
 io_uring/io_uring.h           |   2 +
 io_uring/opdef.c              |   9 +++
 io_uring/spawn.c              | 140 ++++++++++++++++++++++++++++++++++
 io_uring/spawn.h              |  10 +++
 7 files changed, 165 insertions(+), 2 deletions(-)
 create mode 100644 io_uring/spawn.c
 create mode 100644 io_uring/spawn.h

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 38f0d6b10eaf..82d8dae49645 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -278,6 +278,7 @@ enum io_uring_op {
 	IORING_OP_FTRUNCATE,
 	IORING_OP_BIND,
 	IORING_OP_LISTEN,
+	IORING_OP_CLONE,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 53167bef37d7..06ad15c07c57 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -12,7 +12,7 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o opdef.o kbuf.o rsrc.o notif.o \
 					sqpoll.o xattr.o nop.o fs.o splice.o \
 					sync.o msg_ring.o advise.o openclose.o \
 					epoll.o statx.o timeout.o fdinfo.o \
-					cancel.o waitid.o register.o \
+					cancel.o waitid.o register.o spawn.o \
 					truncate.o memmap.o
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
 obj-$(CONFIG_FUTEX)		+= futex.o
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0fd8709401fc..b82ea1cc393f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -97,6 +97,7 @@
 #include "uring_cmd.h"
 #include "msg_ring.h"
 #include "memmap.h"
+#include "spawn.h"
 
 #include "timeout.h"
 #include "poll.h"
@@ -1706,7 +1707,7 @@ static bool io_assign_file(struct io_kiocb *req, const struct io_issue_def *def,
 	return !!req->file;
 }
 
-static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
+int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 {
 	const struct io_issue_def *def = &io_issue_defs[req->opcode];
 	const struct cred *creds = NULL;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 4dd051d29cb0..302c8f92b812 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -497,4 +497,6 @@ static inline bool io_has_work(struct io_ring_ctx *ctx)
 	return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq) ||
 	       io_local_work_pending(ctx);
 }
+
+int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags);
 #endif
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 3de75eca1c92..1bab2e517e55 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -36,6 +36,7 @@
 #include "waitid.h"
 #include "futex.h"
 #include "truncate.h"
+#include "spawn.h"
 
 static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags)
 {
@@ -515,6 +516,11 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_CLONE] = {
+		.audit_skip		= 1,
+		.prep			= io_clone_prep,
+		.issue			= io_clone,
+	},
 };
 
 const struct io_cold_def io_cold_defs[] = {
@@ -744,6 +750,9 @@ const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_LISTEN] = {
 		.name			= "LISTEN",
 	},
+	[IORING_OP_CLONE] = {
+		.name			= "CLONE",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/spawn.c b/io_uring/spawn.c
new file mode 100644
index 000000000000..1cd069bb6f59
--- /dev/null
+++ b/io_uring/spawn.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Spawning a linked series of operations onto a dedicated task.
+ *
+ * Copyright (C) 2022 Josh Triplett
+ */
+
+#include <linux/binfmts.h>
+#include <linux/nospec.h>
+#include <linux/syscalls.h>
+
+#include "io_uring.h"
+#include "rsrc.h"
+#include "spawn.h"
+
+struct io_clone {
+	struct file *file_unused;
+	struct io_kiocb *link;
+};
+
+static void fail_link(struct io_kiocb *req)
+{
+	struct io_kiocb *nxt;
+
+	while (req) {
+		req_fail_link_node(req, -ECANCELED);
+		io_req_complete_defer(req);
+
+		nxt = req->link;
+		req->link = NULL;
+		req = nxt;
+	}
+}
+
+static int io_uring_spawn_task(void *data)
+{
+	struct io_kiocb *head = data;
+	struct io_clone *c = io_kiocb_to_cmd(head, struct io_clone);
+	struct io_ring_ctx *ctx = head->ctx;
+	struct io_kiocb *req, *next;
+	int err;
+
+	set_task_comm(current, "iou-spawn");
+
+	mutex_lock(&ctx->uring_lock);
+
+	for (req = c->link; req; req = next) {
+		int hardlink = req->flags & REQ_F_HARDLINK;
+
+		next = req->link;
+		req->link = NULL;
+		req->flags &= ~(REQ_F_HARDLINK | REQ_F_LINK);
+
+		if (!(req->flags & REQ_F_FAIL)) {
+			err = io_issue_sqe(req, IO_URING_F_COMPLETE_DEFER);
+			/*
+			 * We can't requeue a request from the spawn
+			 * context.  Fail the whole chain.
+			 */
+			if (err) {
+				req_fail_link_node(req, -ECANCELED);
+				io_req_complete_defer(req);
+			}
+		}
+		if (req->flags & REQ_F_FAIL) {
+			if (!hardlink) {
+				fail_link(next);
+				break;
+			}
+		}
+	}
+
+	io_submit_flush_completions(ctx);
+	percpu_ref_put(&ctx->refs);
+
+	mutex_unlock(&ctx->uring_lock);
+
+	force_exit_sig(SIGKILL);
+	return 0;
+}
+
+int io_clone_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	if (unlikely(sqe->fd || sqe->ioprio || sqe->addr2 || sqe->addr
+		     || sqe->len || sqe->rw_flags || sqe->buf_index
+		     || sqe->optlen || sqe->addr3))
+		return -EINVAL;
+
+	if (unlikely(!(req->flags & (REQ_F_HARDLINK|REQ_F_LINK))))
+		return -EINVAL;
+
+	if (unlikely(req->ctx->submit_state.link.head))
+		return -EINVAL;
+
+	return 0;
+}
+
+int io_clone(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_clone *c = io_kiocb_to_cmd(req, struct io_clone);
+	struct task_struct *tsk;
+
+	/* It is possible that we don't have any linked requests, depite
+	 * checking during ->prep().  It would be harmless to continue,
+	 * but we don't need even to create the worker thread in this
+	 * case.
+	 */
+	if (!req->link)
+		return IOU_OK;
+
+	/*
+	 * Prevent the context from going away before the spawned task
+	 * has had a chance to execute.  Dropped by io_uring_spawn_task.
+	 */
+	percpu_ref_get(&req->ctx->refs);
+
+	tsk = create_io_uring_spawn_task(io_uring_spawn_task, req);
+	if (IS_ERR(tsk)) {
+		percpu_ref_put(&req->ctx->refs);
+
+		req_set_fail(req);
+		io_req_set_res(req, PTR_ERR(tsk), 0);
+		return PTR_ERR(tsk);
+	}
+
+	/*
+	 * Steal the link from the io_uring dispatcher to have them
+	 * submitted through the new thread.  Note we can no longer fail
+	 * the clone, so the spawned task is responsible for completing
+	 * these requests.
+	 */
+	c->link = req->link;
+	req->flags &= ~(REQ_F_HARDLINK | REQ_F_LINK);
+	req->link = NULL;
+
+	wake_up_new_task(tsk);
+
+	return IOU_OK;
+
+}
diff --git a/io_uring/spawn.h b/io_uring/spawn.h
new file mode 100644
index 000000000000..9b7ddb776d1e
--- /dev/null
+++ b/io_uring/spawn.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+/*
+ * Spawning a linked series of operations onto a dedicated task.
+ *
+ * Copyright © 2022 Josh Triplett
+ */
+
+int io_clone_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_clone(struct io_kiocb *req, unsigned int issue_flags);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 8/9] io_uring: Let ->issue know if it was called from spawn thread
  2024-12-09 23:43 [PATCH RFC 0/9] Launching processes with io_uring Gabriel Krisman Bertazi
                   ` (6 preceding siblings ...)
  2024-12-09 23:43 ` [PATCH RFC 7/9] io_uring: Introduce IORING_OP_CLONE Gabriel Krisman Bertazi
@ 2024-12-09 23:43 ` Gabriel Krisman Bertazi
  2024-12-09 23:43 ` [PATCH RFC 9/9] io_uring: Introduce IORING_OP_EXEC command Gabriel Krisman Bertazi
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-09 23:43 UTC (permalink / raw)
  To: axboe, asml.silence; +Cc: io-uring, josh, Gabriel Krisman Bertazi

IORING_OP_EXEC can only be called from the spawn task context.  Pass
that information via the issue_flags to let it be verified by the
IORING_OP_EXEC handler.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
 include/linux/io_uring_types.h | 3 +++
 io_uring/spawn.c               | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index c141ffec81fe..8717259ba715 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -38,6 +38,9 @@ enum io_uring_cmd_flags {
 	IO_URING_F_CANCEL		= (1 << 11),
 	IO_URING_F_COMPAT		= (1 << 12),
 	IO_URING_F_TASK_DEAD		= (1 << 13),
+
+	/* Set when issuing from spawn thread */
+	IO_URING_F_SPAWN		= (1 << 14),
 };
 
 struct io_wq_work_node {
diff --git a/io_uring/spawn.c b/io_uring/spawn.c
index 1cd069bb6f59..59d6ccf96f45 100644
--- a/io_uring/spawn.c
+++ b/io_uring/spawn.c
@@ -52,7 +52,7 @@ static int io_uring_spawn_task(void *data)
 		req->flags &= ~(REQ_F_HARDLINK | REQ_F_LINK);
 
 		if (!(req->flags & REQ_F_FAIL)) {
-			err = io_issue_sqe(req, IO_URING_F_COMPLETE_DEFER);
+			err = io_issue_sqe(req, IO_URING_F_COMPLETE_DEFER|IO_URING_F_SPAWN);
 			/*
 			 * We can't requeue a request from the spawn
 			 * context.  Fail the whole chain.
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC 9/9] io_uring: Introduce IORING_OP_EXEC command
  2024-12-09 23:43 [PATCH RFC 0/9] Launching processes with io_uring Gabriel Krisman Bertazi
                   ` (7 preceding siblings ...)
  2024-12-09 23:43 ` [PATCH RFC 8/9] io_uring: Let ->issue know if it was called from spawn thread Gabriel Krisman Bertazi
@ 2024-12-09 23:43 ` Gabriel Krisman Bertazi
  2024-12-10 21:01   ` Josh Triplett
  2024-12-10 21:10 ` [PATCH RFC 0/9] Launching processes with io_uring Josh Triplett
  2024-12-11 14:02 ` Pavel Begunkov
  10 siblings, 1 reply; 25+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-09 23:43 UTC (permalink / raw)
  To: axboe, asml.silence; +Cc: io-uring, josh, Gabriel Krisman Bertazi

From: Josh Triplett <[email protected]>

This command executes the equivalent of an execveat(2) in a previously
spawned io_uring context, causing the execution to return to a new
program indicated by the SQE.

As an io_uring command, it is special in a few ways, requiring some
quirks. First, it can only be executed from the spawned context linked
after the IORING_OP_CLONE command; In addition, the first successful
IORING_OP_EXEC command will terminate the link chain, causing
further operations to fail with -ECANCELED.

There are a few reason for the first limitation: First, it wouldn't make
much sense to execute IORING_OP_EXEC in an io-wq, as it would simply
mean "stealing" the worker thread from io_uring; It would also be
questionable to execute inline or in a task work, as it would terminate
the execution of the ring.  Another technical reason is that we'd
immediately deadlock (fixable), because we'd need to complete the
command and release the reference after returning from the execve, but
the context has already been invalidated by terminating the process.
All in all, considering io_uring's purpose to provide an asynchronous
interface, I'd (Gabriel) like to focus on the simple use-case first,
limiting it to the cloned context for now.

The second limitation is obvious.  We reject further operations on the
link after a successful exec because that is the boundary of the new
program.

There is a very interesting usecase that Josh mentioned for this
feature.  One can issue a series of hardlinked IORING_OP_EXEC using the
different $PATH components to search for the binary and try them in
sequence without returning to userspace.  This is exemplified in
the liburing testcase accompanying the patchset.

Signed-off-by: Josh Triplett <[email protected]>
Co-developed-by: Gabriel Krisman Bertazi <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
 include/uapi/linux/io_uring.h |  2 ++
 io_uring/opdef.c              |  9 ++++++
 io_uring/spawn.c              | 57 ++++++++++++++++++++++++++++++++++-
 io_uring/spawn.h              |  3 ++
 4 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 82d8dae49645..1116ff8b5018 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -73,6 +73,7 @@ struct io_uring_sqe {
 		__u32		futex_flags;
 		__u32		install_fd_flags;
 		__u32		nop_flags;
+		__u32		execve_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	/* pack this to avoid bogus arm OABI complaints */
@@ -279,6 +280,7 @@ enum io_uring_op {
 	IORING_OP_BIND,
 	IORING_OP_LISTEN,
 	IORING_OP_CLONE,
+	IORING_OP_EXEC,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 1bab2e517e55..8cca077641d5 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -521,6 +521,12 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_clone_prep,
 		.issue			= io_clone,
 	},
+	[IORING_OP_EXEC] = {
+		.audit_skip		= 1,
+		.ignore_creds		= 1,
+		.prep			= io_exec_prep,
+		.issue			= io_exec,
+	},
 };
 
 const struct io_cold_def io_cold_defs[] = {
@@ -753,6 +759,9 @@ const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_CLONE] = {
 		.name			= "CLONE",
 	},
+	[IORING_OP_EXEC] = {
+		.name			= "EXEC",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/spawn.c b/io_uring/spawn.c
index 59d6ccf96f45..d6d649f78906 100644
--- a/io_uring/spawn.c
+++ b/io_uring/spawn.c
@@ -18,6 +18,16 @@ struct io_clone {
 	struct io_kiocb *link;
 };
 
+struct io_exec {
+	struct file *file_unused;
+	const char __user *filename;
+	const char __user *const __user *argv;
+	const char __user *const __user *envp;
+
+	int dfd;
+	u32 flags;
+};
+
 static void fail_link(struct io_kiocb *req)
 {
 	struct io_kiocb *nxt;
@@ -38,6 +48,7 @@ static int io_uring_spawn_task(void *data)
 	struct io_clone *c = io_kiocb_to_cmd(head, struct io_clone);
 	struct io_ring_ctx *ctx = head->ctx;
 	struct io_kiocb *req, *next;
+	bool return_to_user = false;
 	int err;
 
 	set_task_comm(current, "iou-spawn");
@@ -67,6 +78,15 @@ static int io_uring_spawn_task(void *data)
 				fail_link(next);
 				break;
 			}
+		} else if (req->opcode == IORING_OP_EXEC) {
+			/*
+			 * Don't execute anything after the first
+			 * successful IORING_OP_EXEC.  Cancel the rest
+			 * of the link and allow userspace to return
+			 */
+			fail_link(next);
+			return_to_user = true;
+			break;
 		}
 	}
 
@@ -75,7 +95,9 @@ static int io_uring_spawn_task(void *data)
 
 	mutex_unlock(&ctx->uring_lock);
 
-	force_exit_sig(SIGKILL);
+	/* If there wasn't a successful exec, terminate the thread. */
+	if (!return_to_user)
+		force_exit_sig(SIGKILL);
 	return 0;
 }
 
@@ -138,3 +160,36 @@ int io_clone(struct io_kiocb *req, unsigned int issue_flags)
 	return IOU_OK;
 
 }
+
+int io_exec_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_exec *e = io_kiocb_to_cmd(req, typeof(*e));
+
+	if (unlikely(sqe->buf_index || sqe->len || sqe->file_index))
+		return -EINVAL;
+
+	e->dfd = READ_ONCE(sqe->fd);
+	e->filename = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	e->argv = u64_to_user_ptr(READ_ONCE(sqe->addr2));
+	e->envp = u64_to_user_ptr(READ_ONCE(sqe->addr3));
+	e->flags = READ_ONCE(sqe->execve_flags);
+	return 0;
+}
+
+int io_exec(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_exec *e = io_kiocb_to_cmd(req, typeof(*e));
+	int ret;
+
+	if (!(issue_flags & IO_URING_F_SPAWN))
+		return -EINVAL;
+
+	ret = do_execveat(e->dfd, getname(e->filename),
+			  e->argv, e->envp, e->flags);
+	if (ret < 0) {
+		req_set_fail(req);
+		io_req_set_res(req, ret, 0);
+	}
+	return IOU_OK;
+
+}
diff --git a/io_uring/spawn.h b/io_uring/spawn.h
index 9b7ddb776d1e..93d9f0ae378c 100644
--- a/io_uring/spawn.h
+++ b/io_uring/spawn.h
@@ -8,3 +8,6 @@
 
 int io_clone_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_clone(struct io_kiocb *req, unsigned int issue_flags);
+
+int io_exec_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_exec(struct io_kiocb *req, unsigned int issue_flags);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 9/9] io_uring: Introduce IORING_OP_EXEC command
  2024-12-09 23:43 ` [PATCH RFC 9/9] io_uring: Introduce IORING_OP_EXEC command Gabriel Krisman Bertazi
@ 2024-12-10 21:01   ` Josh Triplett
  0 siblings, 0 replies; 25+ messages in thread
From: Josh Triplett @ 2024-12-10 21:01 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: axboe, asml.silence, io-uring

On Mon, Dec 09, 2024 at 06:43:11PM -0500, Gabriel Krisman Bertazi wrote:
> From: Josh Triplett <[email protected]>
> 
> This command executes the equivalent of an execveat(2) in a previously
> spawned io_uring context, causing the execution to return to a new
> program indicated by the SQE.
> 
> As an io_uring command, it is special in a few ways, requiring some
> quirks. First, it can only be executed from the spawned context linked
> after the IORING_OP_CLONE command; In addition, the first successful
> IORING_OP_EXEC command will terminate the link chain, causing
> further operations to fail with -ECANCELED.
> 
> There are a few reason for the first limitation: First, it wouldn't make
> much sense to execute IORING_OP_EXEC in an io-wq, as it would simply
> mean "stealing" the worker thread from io_uring; It would also be
> questionable to execute inline or in a task work, as it would terminate
> the execution of the ring.  Another technical reason is that we'd
> immediately deadlock (fixable), because we'd need to complete the
> command and release the reference after returning from the execve, but
> the context has already been invalidated by terminating the process.
> All in all, considering io_uring's purpose to provide an asynchronous
> interface, I'd (Gabriel) like to focus on the simple use-case first,
> limiting it to the cloned context for now.

This seems like a reasonable limitation for now. I'd eventually like to
handle things like "install these fds, do some other setup calls, then
execveat" as a ring submission (perhaps as a synchronous one), but
leaving that out for now seems reasonable.

The combination of clone and exec should probably get advertised as a
new capability. If we add exec-without-clone in the future, that can be
a second new capability.

The commit message should probably also document the rationale for dfd
not accepting a ring index (for now) rather than an installed fd. That
*also* seems like a perfectly reasonable limitation for now, just one
that needs documenting.

Otherwise, LGTM, and thank you again for updating this!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 0/9] Launching processes with io_uring
  2024-12-09 23:43 [PATCH RFC 0/9] Launching processes with io_uring Gabriel Krisman Bertazi
                   ` (8 preceding siblings ...)
  2024-12-09 23:43 ` [PATCH RFC 9/9] io_uring: Introduce IORING_OP_EXEC command Gabriel Krisman Bertazi
@ 2024-12-10 21:10 ` Josh Triplett
  2024-12-11 14:02 ` Pavel Begunkov
  10 siblings, 0 replies; 25+ messages in thread
From: Josh Triplett @ 2024-12-10 21:10 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: axboe, asml.silence, io-uring

On Mon, Dec 09, 2024 at 06:43:02PM -0500, Gabriel Krisman Bertazi wrote:
> During LPC 2022, Josh Triplett proposed io_uring_spawn as a mechanism to
> fork and exec new processes through io_uring [1].  The goal, according
> to him, was to have a very efficient mechanism to quickly execute tasks,
> eliminating the multiple roundtrips to userspace required to fork,
> perform multiple $PATH lookup and finally execve.  In addition, he
> mentioned this would allow for a more simple implementation of
> preparatory tasks, such as file redirection configuration, and handling
> of stuff like posix_spawn_file_actions_t.
> 
> This RFC revives his original patchset.  I fixed all the pending issues
> I found with task submission, including the issue blocking the work at
> the time, a kernel corruption after a few spawns, converted the execve
> command into execveat* variant, cleaned up the code and surely
> introduced a few bugs of my own along the way.  At this point, I made it
> an RFC because I have a few outstanding questions about the design, in
> particular whether the CLONE context would be better implemented as a
> special io-wq case to avoid the exposure of io_issue_sqe and
> duplication of the dispatching logic.

Thank you for updating and debugging this! Much appreciated.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 7/9] io_uring: Introduce IORING_OP_CLONE
  2024-12-09 23:43 ` [PATCH RFC 7/9] io_uring: Introduce IORING_OP_CLONE Gabriel Krisman Bertazi
@ 2024-12-11 13:37   ` Pavel Begunkov
  2024-12-11 17:26     ` Josh Triplett
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2024-12-11 13:37 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, axboe; +Cc: io-uring, josh

On 12/9/24 23:43, Gabriel Krisman Bertazi wrote:
> From: Josh Triplett <[email protected]>
> 
> This command spawns a short lived asynchronous context to execute
> following linked operations.  Once the link is completed, the task
> terminates.  This is specially useful to create new processes, by
> linking an IORING_OP_EXEC at the end of the chain. In this case, the
> task doesn't terminate, but returns to userspace, starting the new
> process.
> 
> This is different from the existing io workqueues in a few ways: First,
> it is completely separated from the io-wq code, and the task cannot be
> reused by a future link; Second, the task doesn't share the FDT, and
> other process structures with the rest of io_uring (except for the
> memory map); Finally, because of the limited context, it doesn't support
> executing requests asynchronously and requeing them. Every request must
> complete at ->issue time, or fail.  It also doesn't support task_work
> execution, for a similar reason.  The goal of this design allowing the
> user to close file descriptors, release locks and do other cleanups
> right before switching to a new process.
> 
> A big pitfall here, in my (Gabriel) opinion, is how this duplicates the
> logic of io_uring linked request dispatching.  I'd suggest I merge this
> into the io-wq code, as a special case of workqueue. But I'd like to get
> feedback on this idea from the maintainers before moving further with the
> implementation.
> 
> Signed-off-by: Josh Triplett <[email protected]>
> Co-developed-by: Gabriel Krisman Bertazi <[email protected]>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
...
> +static int io_uring_spawn_task(void *data)
> +{
> +	struct io_kiocb *head = data;
> +	struct io_clone *c = io_kiocb_to_cmd(head, struct io_clone);
> +	struct io_ring_ctx *ctx = head->ctx;
> +	struct io_kiocb *req, *next;
> +	int err;
> +
> +	set_task_comm(current, "iou-spawn");
> +
> +	mutex_lock(&ctx->uring_lock);
> +
> +	for (req = c->link; req; req = next) {
> +		int hardlink = req->flags & REQ_F_HARDLINK;
> +
> +		next = req->link;
> +		req->link = NULL;
> +		req->flags &= ~(REQ_F_HARDLINK | REQ_F_LINK);

Do you allow linked timeouts? If so, it'd need to take the lock.

Also, the current link impl assumes that the list only modified
when all refs to the head request are put, you can't just do it
without dropping refs first or adjusting the rest of core link
handling.

> +
> +		if (!(req->flags & REQ_F_FAIL)) {
> +			err = io_issue_sqe(req, IO_URING_F_COMPLETE_DEFER);

There should never be non IO_URING_F_NONBLOCK calls with ->uring_lock.

I'd even say that opcode handling shouldn't have any business with
digging so deep into internal infra, submitting requests, flushing
caches, processing links and so on. It complicates things.

Take defer taskrun, io_submit_flush_completions() will not wake
waiters, and we probably have or will have a bunch of optimisations
that it can break.

Also, do you block somewhere all other opcodes? If it's indeed
an under initialised task then it's not safe to run most of them,
and you'd never know in what way, unfortunately. An fs write
might need a net namespace, a send/recv might decide to touch
fs_struct and so on.

> +			/*
> +			 * We can't requeue a request from the spawn
> +			 * context.  Fail the whole chain.
> +			 */
> +			if (err) {
> +				req_fail_link_node(req, -ECANCELED);
> +				io_req_complete_defer(req);
> +			}
> +		}
> +		if (req->flags & REQ_F_FAIL) {
> +			if (!hardlink) {
> +				fail_link(next);
> +				break;
> +			}
> +		}
> +	}
> +
> +	io_submit_flush_completions(ctx);

Ordering with

> +	percpu_ref_put(&ctx->refs);
> +
> +	mutex_unlock(&ctx->uring_lock);
> +
> +	force_exit_sig(SIGKILL);
> +	return 0;
> +}
> +
> +int io_clone_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> +{
> +	if (unlikely(sqe->fd || sqe->ioprio || sqe->addr2 || sqe->addr
> +		     || sqe->len || sqe->rw_flags || sqe->buf_index
> +		     || sqe->optlen || sqe->addr3))
> +		return -EINVAL;
> +
> +	if (unlikely(!(req->flags & (REQ_F_HARDLINK|REQ_F_LINK))))
> +		return -EINVAL;
> +
> +	if (unlikely(req->ctx->submit_state.link.head))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +int io_clone(struct io_kiocb *req, unsigned int issue_flags)
> +{
> +	struct io_clone *c = io_kiocb_to_cmd(req, struct io_clone);
> +	struct task_struct *tsk;
> +
> +	/* It is possible that we don't have any linked requests, depite
> +	 * checking during ->prep().  It would be harmless to continue,
> +	 * but we don't need even to create the worker thread in this
> +	 * case.
> +	 */
> +	if (!req->link)
> +		return IOU_OK;
> +
> +	/*
> +	 * Prevent the context from going away before the spawned task
> +	 * has had a chance to execute.  Dropped by io_uring_spawn_task.
> +	 */
> +	percpu_ref_get(&req->ctx->refs);
> +
> +	tsk = create_io_uring_spawn_task(io_uring_spawn_task, req);
> +	if (IS_ERR(tsk)) {
> +		percpu_ref_put(&req->ctx->refs);
> +
> +		req_set_fail(req);
> +		io_req_set_res(req, PTR_ERR(tsk), 0);
> +		return PTR_ERR(tsk);
> +	}
> +
> +	/*
> +	 * Steal the link from the io_uring dispatcher to have them
> +	 * submitted through the new thread.  Note we can no longer fail
> +	 * the clone, so the spawned task is responsible for completing
> +	 * these requests.
> +	 */
> +	c->link = req->link;
> +	req->flags &= ~(REQ_F_HARDLINK | REQ_F_LINK);
> +	req->link = NULL;
> +
> +	wake_up_new_task(tsk);

I assume from here onwards io_uring_spawn_task() might be
running in parallel. You're passing the req inside but free
it below by returning OK. Is there anything that prevents
io_uring_spawn_task() from accessing an already deallocated
request?

> +
> +	return IOU_OK;
> +
> +}
> diff --git a/io_uring/spawn.h b/io_uring/spawn.h
> new file mode 100644
> index 000000000000..9b7ddb776d1e
> --- /dev/null
> +++ b/io_uring/spawn.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +
> +/*
> + * Spawning a linked series of operations onto a dedicated task.
> + *
> + * Copyright © 2022 Josh Triplett
> + */
> +
> +int io_clone_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
> +int io_clone(struct io_kiocb *req, unsigned int issue_flags);

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 0/9] Launching processes with io_uring
  2024-12-09 23:43 [PATCH RFC 0/9] Launching processes with io_uring Gabriel Krisman Bertazi
                   ` (9 preceding siblings ...)
  2024-12-10 21:10 ` [PATCH RFC 0/9] Launching processes with io_uring Josh Triplett
@ 2024-12-11 14:02 ` Pavel Begunkov
  2024-12-11 17:34   ` Josh Triplett
                     ` (2 more replies)
  10 siblings, 3 replies; 25+ messages in thread
From: Pavel Begunkov @ 2024-12-11 14:02 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, axboe; +Cc: io-uring, josh

On 12/9/24 23:43, Gabriel Krisman Bertazi wrote:
> During LPC 2022, Josh Triplett proposed io_uring_spawn as a mechanism to
> fork and exec new processes through io_uring [1].  The goal, according
> to him, was to have a very efficient mechanism to quickly execute tasks,
> eliminating the multiple roundtrips to userspace required to fork,
> perform multiple $PATH lookup and finally execve.  In addition, he
> mentioned this would allow for a more simple implementation of
> preparatory tasks, such as file redirection configuration, and handling
> of stuff like posix_spawn_file_actions_t.
> 
> This RFC revives his original patchset.  I fixed all the pending issues
> I found with task submission, including the issue blocking the work at
> the time, a kernel corruption after a few spawns, converted the execve
> command into execveat* variant, cleaned up the code and surely
> introduced a few bugs of my own along the way.  At this point, I made it
> an RFC because I have a few outstanding questions about the design, in
> particular whether the CLONE context would be better implemented as a
> special io-wq case to avoid the exposure of io_issue_sqe and
> duplication of the dispatching logic.
> 
> I'm also providing the liburing support in a separate patchset,
> including a testcase that exemplifies the $PATH lookup mechanism
> proposed by Josh.

Sorry to say but the series is rather concerning.

1) It creates a special path that tries to mimick the core
path, but not without a bunch of troubles and in quite a
special way.

2) There would be a special set of ops that can only be run
from that special path.

3) And I don't believe that path can ever be allowed to run
anything but these ops from (2) and maybe a very limited subset
of normal ops like nop requests but no read/write/send/etc. (?)

4) And it all requires links, which already a bad sign for
a bunch of reasons.

At this point it raises a question why it even needs io_uring
infra? I don't think it's really helping you. E.g. why not do it
as a list of operation in a custom format instead of links? That
can be run by a single io_uring request or can even be a normal
syscall.

struct clone_op ops = { { CLONE },
         { SET_CRED, cred_id }, ...,
         { EXEC, path }};

Makes me wonder about a different ways of handling. E.g. why should
it be run in the created task context (apart from final exec)? Can
requests be run as normal by the original task, each will take the
half created and not yet launched task as a parameter (in some form),
modify it, and the final exec would launch it?

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 6/9] io_uring: Let commands run with current credentials
  2024-12-09 23:43 ` [PATCH RFC 6/9] io_uring: Let commands run with current credentials Gabriel Krisman Bertazi
@ 2024-12-11 14:48   ` Pavel Begunkov
  0 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2024-12-11 14:48 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, axboe; +Cc: io-uring, josh

On 12/9/24 23:43, Gabriel Krisman Bertazi wrote:
> IORING_OP_EXEC runs only from a custom handler and cannot rely on
> overloaded credentials. This commit adds infrastructure to allow running
> operations without overloading the credentials, i.e. not enabling the
> REQ_F_CREDS flag.
> 
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
>   io_uring/io_uring.c | 2 +-
>   io_uring/opdef.h    | 2 ++
>   2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index a19f72755eaa..0fd8709401fc 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -457,7 +457,7 @@ static void io_prep_async_work(struct io_kiocb *req)
>   	const struct io_issue_def *def = &io_issue_defs[req->opcode];
>   	struct io_ring_ctx *ctx = req->ctx;
>   
> -	if (!(req->flags & REQ_F_CREDS)) {
> +	if (!(req->flags & REQ_F_CREDS) && !def->ignore_creds)

It's not the only place setting creds, see io_init_req().

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 7/9] io_uring: Introduce IORING_OP_CLONE
  2024-12-11 13:37   ` Pavel Begunkov
@ 2024-12-11 17:26     ` Josh Triplett
  2024-12-17 11:03       ` Pavel Begunkov
  0 siblings, 1 reply; 25+ messages in thread
From: Josh Triplett @ 2024-12-11 17:26 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Gabriel Krisman Bertazi, axboe, io-uring

On Wed, Dec 11, 2024 at 01:37:40PM +0000, Pavel Begunkov wrote:
> Also, do you block somewhere all other opcodes? If it's indeed
> an under initialised task then it's not safe to run most of them,
> and you'd never know in what way, unfortunately. An fs write
> might need a net namespace, a send/recv might decide to touch
> fs_struct and so on.

I would not expect the new task to be under-initialised, beyond the fact
that it doesn't have a userspace yet (e.g. it can't return to userspace
without exec-ing first); if it is, that'd be a bug. It *should* be
possible to do almost any reasonable opcode. For instance, reasonable
possibilities include "write a byte to a pipe, open a file,
install/rearrange some file descriptors, then exec".

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 0/9] Launching processes with io_uring
  2024-12-11 14:02 ` Pavel Begunkov
@ 2024-12-11 17:34   ` Josh Triplett
  2024-12-13 20:13   ` Gabriel Krisman Bertazi
  2025-01-18 22:33   ` Askar Safin
  2 siblings, 0 replies; 25+ messages in thread
From: Josh Triplett @ 2024-12-11 17:34 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Gabriel Krisman Bertazi, axboe, io-uring

On Wed, Dec 11, 2024 at 02:02:14PM +0000, Pavel Begunkov wrote:
> 1) It creates a special path that tries to mimick the core
> path, but not without a bunch of troubles and in quite a
> special way.
> 
> 2) There would be a special set of ops that can only be run
> from that special path.

The goal would be for the exec op to work just fine from the normal
path, too, for processes that want to do the equivalent of "do several
syscalls then exec to replace myself", rather than doing a fork/exec.
The current implementation defers supporting exec on a non-clone ring,
but I'd expect that limitation to be lifted in the future.

> 3) And I don't believe that path can ever be allowed to run
> anything but these ops from (2) and maybe a very limited subset
> of normal ops like nop requests but no read/write/send/etc. (?)

I would ideally expect it to be able to run almost *any* op, in the
context of the new process: write, send, open, accept, connect,
unlinkat, FIXED_FD_INSTALL, ring messaging, ...

> 4) And it all requires links, which already a bad sign for
> a bunch of reasons.

In theory you don't *have* to have everything linked for a batch of
operations like this, as long as it's clear what to run in the new task.

> At this point it raises a question why it even needs io_uring
> infra? I don't think it's really helping you. E.g. why not do it
> as a list of operation in a custom format instead of links?

Because, as mentioned above, the intention *is* to support almost any
io_uring operation, not just a tiny subset.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 0/9] Launching processes with io_uring
  2024-12-11 14:02 ` Pavel Begunkov
  2024-12-11 17:34   ` Josh Triplett
@ 2024-12-13 20:13   ` Gabriel Krisman Bertazi
  2024-12-17 16:10     ` Pavel Begunkov
  2025-01-18 22:33   ` Askar Safin
  2 siblings, 1 reply; 25+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-13 20:13 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: axboe, io-uring, josh


Hi Pavel,

Pavel Begunkov <[email protected]> writes:
> On 12/9/24 23:43, Gabriel Krisman Bertazi wrote:

> Sorry to say but the series is rather concerning.
>
> 1) It creates a special path that tries to mimick the core
> path, but not without a bunch of troubles and in quite a
> special way.

I fully agree this is one of the main problem with the series.  I'm
interested in how we can merge this implementation into the existing
io_uring paths.  My idea, which I hinted in the cover letter, is to have
a flavor of io-wq that executes one linked sequence and then terminates.
When a work is queued there, the newly spawned worker thread will live
only until the end of that link.  This wq is only used to execute the
link following a IORING_OP_CLONE and the user can pass CLONE_ flags to
determine how it is created.  This allows the user to create a detached
file descriptor table in the worker thread, for instance.

It'd allows us to reuse the dispatching infrastructure of io-wq, hide
io_uring internals from the OP_CLONE implementation, and
enable, if I understand correctly, the workarounds to execute
task_works.  We'd need to ensure nothing from the link gets
executed outside of this context.

> 2) There would be a special set of ops that can only be run
> from that special path.

There are problems with cancellations and timeouts, that I'd expect to
be more solvable when reusing the io-wq code.  But this task is
executing from a cloned context, so we have a copy of the parent
context, and share the same memory map.  It should be safe to do IO to
open file descriptors, wake futexes and pretty much anything that
doesn't touch io_uring itself.  There are oddities, like the fact the fd
table is split from the parent task while the io_uring direct
descriptors are shared.  That needs to be handled with more sane
semantics.

> At this point it raises a question why it even needs io_uring
> infra? I don't think it's really helping you. E.g. why not do it
> as a list of operation in a custom format instead of links? That
> can be run by a single io_uring request or can even be a normal
> syscall.
>
> struct clone_op ops = { { CLONE },
>         { SET_CRED, cred_id }, ...,
>         { EXEC, path }};
>
>
> Makes me wonder about a different ways of handling. E.g. why should
> it be run in the created task context (apart from final exec)? Can
> requests be run as normal by the original task, each will take the
> half created and not yet launched task as a parameter (in some form),
> modify it, and the final exec would launch it?

A single operation would be a very complex operation doing many things
at once , and much less flexible.  This approach is flexible: you
can combine any (in theory) io_uring operation to obtain the desired
behavior.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 7/9] io_uring: Introduce IORING_OP_CLONE
  2024-12-11 17:26     ` Josh Triplett
@ 2024-12-17 11:03       ` Pavel Begunkov
  2024-12-17 19:14         ` Josh Triplett
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2024-12-17 11:03 UTC (permalink / raw)
  To: Josh Triplett; +Cc: Gabriel Krisman Bertazi, axboe, io-uring

On 12/11/24 17:26, Josh Triplett wrote:
> On Wed, Dec 11, 2024 at 01:37:40PM +0000, Pavel Begunkov wrote:
>> Also, do you block somewhere all other opcodes? If it's indeed
>> an under initialised task then it's not safe to run most of them,
>> and you'd never know in what way, unfortunately. An fs write
>> might need a net namespace, a send/recv might decide to touch
>> fs_struct and so on.
> 
> I would not expect the new task to be under-initialised, beyond the fact
> that it doesn't have a userspace yet (e.g. it can't return to userspace

I see, that's good. What it takes to setup a userspace? and is
it expensive? I remember there were good numbers at the time and
I'm to see where the performance improvement comes from. Is it
because the page table is shared? In other word what's the
difference comparing to spinning a new (user space) thread and
executing the rest with a new io_uring instance from it?


> without exec-ing first); if it is, that'd be a bug. It *should* be
> possible to do almost any reasonable opcode. For instance, reasonable
> possibilities include "write a byte to a pipe, open a file,
> install/rearrange some file descriptors, then exec".

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 0/9] Launching processes with io_uring
  2024-12-13 20:13   ` Gabriel Krisman Bertazi
@ 2024-12-17 16:10     ` Pavel Begunkov
  2024-12-30 23:38       ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2024-12-17 16:10 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: axboe, io-uring, josh

On 12/13/24 20:13, Gabriel Krisman Bertazi wrote:
> 
> Hi Pavel,
> 
> Pavel Begunkov <[email protected]> writes:
>> On 12/9/24 23:43, Gabriel Krisman Bertazi wrote:
> 
>> Sorry to say but the series is rather concerning.
>>
>> 1) It creates a special path that tries to mimick the core
>> path, but not without a bunch of troubles and in quite a
>> special way.
> 
> I fully agree this is one of the main problem with the series.  I'm
> interested in how we can merge this implementation into the existing
> io_uring paths.  My idea, which I hinted in the cover letter, is to have
> a flavor of io-wq that executes one linked sequence and then terminates.
> When a work is queued there, the newly spawned worker thread will live
> only until the end of that link.  This wq is only used to execute the
> link following a IORING_OP_CLONE and the user can pass CLONE_ flags to
> determine how it is created.  This allows the user to create a detached
> file descriptor table in the worker thread, for instance.
> 
> It'd allows us to reuse the dispatching infrastructure of io-wq, hide
> io_uring internals from the OP_CLONE implementation, and
> enable, if I understand correctly, the workarounds to execute
> task_works.  We'd need to ensure nothing from the link gets
> executed outside of this context.

One problem with io-wq is that it's not guaranteed that it's able to
serve all types of requests. Though it's limited to multishots atm,
which you might not need, but the situation might change. And there
is no guarantee that the request is completed by the time it returns
from ->issue(), it might even change hands from inside the callback
via task_work or by any other mean.

It also sounds like you want the cloned task to be a normal
io_uring submmiter in terms of infra even though it can't
initiate a syscall, which also sounds a bit like an SQPOLL task.

And do we really need to execute everything from the new task
context, or ops can take a task as an argument and run whenever
while final exec could be special cased inside the callback?


>> 2) There would be a special set of ops that can only be run
>> from that special path.
> 
> There are problems with cancellations and timeouts, that I'd expect to
> be more solvable when reusing the io-wq code.  But this task is
> executing from a cloned context, so we have a copy of the parent
> context, and share the same memory map.  It should be safe to do IO to
> open file descriptors, wake futexes and pretty much anything that
> doesn't touch io_uring itself.  There are oddities, like the fact the fd
> table is split from the parent task while the io_uring direct
> descriptors are shared.  That needs to be handled with more sane
> semantics.
> 
>> At this point it raises a question why it even needs io_uring
>> infra? I don't think it's really helping you. E.g. why not do it
>> as a list of operation in a custom format instead of links? That
>> can be run by a single io_uring request or can even be a normal
>> syscall.
>>
>> struct clone_op ops = { { CLONE },
>>          { SET_CRED, cred_id }, ...,
>>          { EXEC, path }};
>>
>>
>> Makes me wonder about a different ways of handling. E.g. why should
>> it be run in the created task context (apart from final exec)? Can
>> requests be run as normal by the original task, each will take the
>> half created and not yet launched task as a parameter (in some form),
>> modify it, and the final exec would launch it?
> 
> A single operation would be a very complex operation doing many things
> at once , and much less flexible.  This approach is flexible: you
> can combine any (in theory) io_uring operation to obtain the desired
> behavior.

Ok. And links are not flexible enough for it either. Think of
error handling, passing results from one request to another and
more complex relations. Unless chains are supposed to be very
short and simple, it'd need to be able to return back to user
space (the one issuing requests) for error handling.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 7/9] io_uring: Introduce IORING_OP_CLONE
  2024-12-17 11:03       ` Pavel Begunkov
@ 2024-12-17 19:14         ` Josh Triplett
  0 siblings, 0 replies; 25+ messages in thread
From: Josh Triplett @ 2024-12-17 19:14 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Gabriel Krisman Bertazi, axboe, io-uring

On Tue, Dec 17, 2024 at 11:03:27AM +0000, Pavel Begunkov wrote:
> On 12/11/24 17:26, Josh Triplett wrote:
> > On Wed, Dec 11, 2024 at 01:37:40PM +0000, Pavel Begunkov wrote:
> > > Also, do you block somewhere all other opcodes? If it's indeed
> > > an under initialised task then it's not safe to run most of them,
> > > and you'd never know in what way, unfortunately. An fs write
> > > might need a net namespace, a send/recv might decide to touch
> > > fs_struct and so on.
> > 
> > I would not expect the new task to be under-initialised, beyond the fact
> > that it doesn't have a userspace yet (e.g. it can't return to userspace
> 
> I see, that's good. What it takes to setup a userspace? and is
> it expensive? I remember there were good numbers at the time and
> I'm to see where the performance improvement comes from. Is it
> because the page table is shared? In other word what's the
> difference comparing to spinning a new (user space) thread and
> executing the rest with a new io_uring instance from it?

The goal is to provide all the advantages of `vfork` (and then some),
but without the incredibly unsafe vfork limitations.

Or, to look at it a different way, posix_spawn but with all the power of
io_uring available rather than a handful of "spawn attributes".

> > without exec-ing first); if it is, that'd be a bug. It *should* be
> > possible to do almost any reasonable opcode. For instance, reasonable
> > possibilities include "write a byte to a pipe, open a file,
> > install/rearrange some file descriptors, then exec".

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 0/9] Launching processes with io_uring
  2024-12-17 16:10     ` Pavel Begunkov
@ 2024-12-30 23:38       ` Gabriel Krisman Bertazi
  2024-12-31 14:35         ` Pavel Begunkov
  0 siblings, 1 reply; 25+ messages in thread
From: Gabriel Krisman Bertazi @ 2024-12-30 23:38 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: axboe, io-uring, josh

Pavel Begunkov <[email protected]> writes:

Hi Pavel,

Sorry for the delay. I took the chance to stay offline for a while on
the christmas week.

>> I fully agree this is one of the main problem with the series.  I'm
>> interested in how we can merge this implementation into the existing
>> io_uring paths.  My idea, which I hinted in the cover letter, is to have
>> a flavor of io-wq that executes one linked sequence and then terminates.
>> When a work is queued there, the newly spawned worker thread will live
>> only until the end of that link.  This wq is only used to execute the
>> link following a IORING_OP_CLONE and the user can pass CLONE_ flags to
>> determine how it is created.  This allows the user to create a detached
>> file descriptor table in the worker thread, for instance.
>> It'd allows us to reuse the dispatching infrastructure of io-wq, hide
>> io_uring internals from the OP_CLONE implementation, and
>> enable, if I understand correctly, the workarounds to execute
>> task_works.  We'd need to ensure nothing from the link gets
>> executed outside of this context.
>
> One problem with io-wq is that it's not guaranteed that it's able to
> serve all types of requests. Though it's limited to multishots atm,
> which you might not need, but the situation might change. And there
> is no guarantee that the request is completed by the time it returns
> from ->issue(), it might even change hands from inside the callback
> via task_work or by any other mean.

Multishot is the least of my concerns for this feature, tbh.  I don't
see how it could be useful in the context of spawning a new thread, so
in terms of finding sane semantics, we could just reject them at
submission time if linked from a CLONE.
>
> It also sounds like you want the cloned task to be a normal
> io_uring submmiter in terms of infra even though it can't
> initiate a syscall, which also sounds a bit like an SQPOLL task.
>
> And do we really need to execute everything from the new task
> context, or ops can take a task as an argument and run whenever
> while final exec could be special cased inside the callback?

Wouldn't this be similar to the original design of the io-wq/sqpoll, which
attempted to impersonate the submitter task and resulted in some issues?
Executing directly from the new task is much simpler than trying to do the
operations on the context of another thread.

>>> requests be run as normal by the original task, each will take the
>>> half created and not yet launched task as a parameter (in some form),
>>> modify it, and the final exec would launch it?
>> A single operation would be a very complex operation doing many things
>> at once , and much less flexible.  This approach is flexible: you
>> can combine any (in theory) io_uring operation to obtain the desired
>> behavior.
>
> Ok. And links are not flexible enough for it either. Think of
> error handling, passing results from one request to another and
> more complex relations. Unless chains are supposed to be very
> short and simple, it'd need to be able to return back to user
> space (the one issuing requests) for error handling.

We are posting the completions to the submitter ring.  If a request
fails, we kill the context, but the user is notified of what operation
failed and need to resubmit the entire link with a new spawn.

We could avoid links by letting the spawned task linger, and provide a
way for the user to submit more operations to be executed by this
specific context.  The new task exists until an op to kill the worker is
issued or when the execve command executes.  This would allow the user
to keep multiple partially initialized contexts around for quick
userspace thread dispatching.  we could provide a mechanism to clone
these pre-initialized tasks.

Perhaps it is a stupid idea, but not new - i've seen it discussed
before: I'm thinking of a workload like a thread scheduler in userspace
or a network server.  It could keep a partially initialized worker
thread that never and, when needed, the main task would duplicate it with
another uring command and make the copy return to userspace, mitigating
the cost of copying and reinitializing the task.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 0/9] Launching processes with io_uring
  2024-12-30 23:38       ` Gabriel Krisman Bertazi
@ 2024-12-31 14:35         ` Pavel Begunkov
  0 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2024-12-31 14:35 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: axboe, io-uring, josh

On 12/30/24 23:38, Gabriel Krisman Bertazi wrote:
> Pavel Begunkov <[email protected]> writes:
> 
> Hi Pavel,
> 
> Sorry for the delay. I took the chance to stay offline for a while on
> the christmas week.

No worries at all

>>> I fully agree this is one of the main problem with the series.  I'm
>>> interested in how we can merge this implementation into the existing
>>> io_uring paths.  My idea, which I hinted in the cover letter, is to have
>>> a flavor of io-wq that executes one linked sequence and then terminates.
>>> When a work is queued there, the newly spawned worker thread will live
>>> only until the end of that link.  This wq is only used to execute the
>>> link following a IORING_OP_CLONE and the user can pass CLONE_ flags to
>>> determine how it is created.  This allows the user to create a detached
>>> file descriptor table in the worker thread, for instance.
>>> It'd allows us to reuse the dispatching infrastructure of io-wq, hide
>>> io_uring internals from the OP_CLONE implementation, and
>>> enable, if I understand correctly, the workarounds to execute
>>> task_works.  We'd need to ensure nothing from the link gets
>>> executed outside of this context.
>>
>> One problem with io-wq is that it's not guaranteed that it's able to
>> serve all types of requests. Though it's limited to multishots atm,
>> which you might not need, but the situation might change. And there
>> is no guarantee that the request is completed by the time it returns
>> from ->issue(), it might even change hands from inside the callback
>> via task_work or by any other mean.
> 
> Multishot is the least of my concerns for this feature, tbh.  I don't
> see how it could be useful in the context of spawning a new thread, so
> in terms of finding sane semantics, we could just reject them at
> submission time if linked from a CLONE.

Right, for multishot that is, but the point here is that io-wq
is an internal detail, and it might get to a point where relying
solely on iowq way of execution would fence you off operations
that you care about. Regardless, it would need to follow common
rules like request refcounting and lifetime

>> It also sounds like you want the cloned task to be a normal
>> io_uring submmiter in terms of infra even though it can't
>> initiate a syscall, which also sounds a bit like an SQPOLL task.
>>
>> And do we really need to execute everything from the new task
>> context, or ops can take a task as an argument and run whenever
>> while final exec could be special cased inside the callback?
> 
> Wouldn't this be similar to the original design of the io-wq/sqpoll, which
> attempted to impersonate the submitter task and resulted in some issues?

The problem there was that it was a kthread, which are not
prepared to run what is basically a random syscall path, and
no amount of whack-a-mole'ing with likes of kthread_use_mm()
was ever enough.

That's the same reason why I'm saying that unless the the new
task is completely initialised and has all resources as far as
the kernel execution goes, there would be no way to run a random
io_uring request from it. "Initialised" would mean here having
->mm and all other task resources to anything valid.

> Executing directly from the new task is much simpler than trying to do the
> operations on the context of another thread.

I agree, if that's about executing e.g. a read request, but if
I'd doubt it if it's about setting a ns to a new not yet running
task.

>>>> requests be run as normal by the original task, each will take the
>>>> half created and not yet launched task as a parameter (in some form),
>>>> modify it, and the final exec would launch it?
>>> A single operation would be a very complex operation doing many things
>>> at once , and much less flexible.  This approach is flexible: you
>>> can combine any (in theory) io_uring operation to obtain the desired
>>> behavior.
>>
>> Ok. And links are not flexible enough for it either. Think of
>> error handling, passing results from one request to another and
>> more complex relations. Unless chains are supposed to be very
>> short and simple, it'd need to be able to return back to user
>> space (the one issuing requests) for error handling.
> 
> We are posting the completions to the submitter ring.  If a request
> fails, we kill the context, but the user is notified of what operation
> failed and need to resubmit the entire link with a new spawn.

That's fine in case of a short simple chain, but there is a
talk of random requests. Let's say a read op fails and you
need to redo it, that's not only expensive but might even be
impossible. E.g. it already consumed data from a pipe / socket.

But the biggest problem is flexibility, any non-trivial relation
between request would need some user processing. Take Josh's
example of linking 2+ exec requests, which instead of relying
on some special status of exec requests can be well implemented
by returning the control back to user, and that would be much
more flexible at that.

> We could avoid links by letting the spawned task linger, and provide a
> way for the user to submit more operations to be executed by this
> specific context.  The new task exists until an op to kill the worker is
> issued or when the execve command executes.  This would allow the user
> to keep multiple partially initialized contexts around for quick
> userspace thread dispatching.  we could provide a mechanism to clone
> these pre-initialized tasks.
> 
> Perhaps it is a stupid idea, but not new - i've seen it discussed
> before: I'm thinking of a workload like a thread scheduler in userspace
> or a network server.  It could keep a partially initialized worker
> thread that never and, when needed, the main task would duplicate it with
> another uring command and make the copy return to userspace, mitigating
> the cost of copying and reinitializing the task.
> 

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 0/9] Launching processes with io_uring
  2024-12-11 14:02 ` Pavel Begunkov
  2024-12-11 17:34   ` Josh Triplett
  2024-12-13 20:13   ` Gabriel Krisman Bertazi
@ 2025-01-18 22:33   ` Askar Safin
  2025-01-19  3:04     ` Pavel Begunkov
  2 siblings, 1 reply; 25+ messages in thread
From: Askar Safin @ 2025-01-18 22:33 UTC (permalink / raw)
  To: asml.silence; +Cc: axboe, io-uring, josh, krisman

Pavel Begunkov:
> At this point it raises a question why it even needs io_uring
> infra? I don't think it's really helping you. E.g. why not do it
> as a list of operation in a custom format instead of links? That
> can be run by a single io_uring request or can even be a normal
> syscall.

> Makes me wonder about a different ways of handling. E.g. why should
> it be run in the created task context (apart from final exec)? Can
> requests be run as normal by the original task, each will take the
> half created and not yet launched task as a parameter (in some form),
> modify it, and the final exec would launch it?

I totally agree. I think API should look like this:

===
// This may be PID fd or something completely different
int fd = create_task ();

task_manipulate (fd, OP_CHDIR, "/");
task_manipulate (fd, OP_CLOSE, 0);
task_manipulate (fd, OP_OPEN, "/dev/null", O_RDONLY, 0666);

task_execve (fd, "/bin/true", argv, envp);
===

We will have ability to do proper error handling.

Paper "A fork() in the road" says the same thing.
( https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf ).

From that paper:

> While a spawnlike API is preferred for most instances of starting a new
> program, for full generality it requires a flag, parameter, or
> helper function controlling every possible aspect of process
> state. It is infeasible for a single OS API to give complete
> control over the initial state of a new process. In Unix today,
> the only fallback for advanced use-cases remains code executed
> after fork, but clean-slate designs [e.g., 40, 43] have
> demonstrated an alternative model where system calls that
> modify per-process state are not constrained to merely the
> current process, but rather can manipulate any process to
> which the caller has access. This yields the flexibility and
> orthogonality of the fork/exec model, without most of its
> drawbacks: a new process starts as an empty address space,
> and an advanced user may manipulate it in a piecemeal fashion,
> populating its address-space and kernel context prior to
> execution, without needing to clone the parent nor run code
> in the context of the child. ExOS [43] implemented fork in
> user-mode atop such a primitive. Retrofitting cross-process
> APIs into Unix seems at first glance challenging, but may
> also be productive for future research

So, yes, such APIs already exist in research operating systems.

Moreover, as well as I understand, Windows NT native API also
creates processes the same way.

Askar Safin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC 0/9] Launching processes with io_uring
  2025-01-18 22:33   ` Askar Safin
@ 2025-01-19  3:04     ` Pavel Begunkov
  0 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2025-01-19  3:04 UTC (permalink / raw)
  To: Askar Safin; +Cc: axboe, io-uring, josh, krisman

On 1/18/25 22:33, Askar Safin wrote:
> Pavel Begunkov:
>> At this point it raises a question why it even needs io_uring
>> infra? I don't think it's really helping you. E.g. why not do it
>> as a list of operation in a custom format instead of links? That
>> can be run by a single io_uring request or can even be a normal
>> syscall.
> 
>> Makes me wonder about a different ways of handling. E.g. why should
>> it be run in the created task context (apart from final exec)? Can
>> requests be run as normal by the original task, each will take the
>> half created and not yet launched task as a parameter (in some form),
>> modify it, and the final exec would launch it?
> 
> I totally agree. I think API should look like this:
> 
> ===
> // This may be PID fd or something completely different
> int fd = create_task ();
> 
> task_manipulate (fd, OP_CHDIR, "/");
> task_manipulate (fd, OP_CLOSE, 0);
> task_manipulate (fd, OP_OPEN, "/dev/null", O_RDONLY, 0666);
> 
> task_execve (fd, "/bin/true", argv, envp);
> ===

That's one way of doing it. The api would be nice, it'd fit into
io_uring without extra tricks, and wouldn't even hard io_uring.
Though that could take a good amount of plumbing.

I also wonder, if copying the page table is a performance problem, why
CLONE_VM + exec is not an option? It's flexible, sounds like
posix_spawn does exactly that under the hood, and people tried it
and managed to outperform posix_spawn.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-01-19  3:03 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-09 23:43 [PATCH RFC 0/9] Launching processes with io_uring Gabriel Krisman Bertazi
2024-12-09 23:43 ` [PATCH RFC 1/9] io_uring: Drop __io_req_find_next_prep Gabriel Krisman Bertazi
2024-12-09 23:43 ` [PATCH RFC 2/9] io_uring: Expose failed request helper in internal header Gabriel Krisman Bertazi
2024-12-09 23:43 ` [PATCH RFC 3/9] kernel/fork: Don't inherit PF_USER_WORKER from parent Gabriel Krisman Bertazi
2024-12-09 23:43 ` [PATCH RFC 4/9] fs/exec: Expose do_execveat symbol Gabriel Krisman Bertazi
2024-12-09 23:43 ` [PATCH RFC 5/9] kernel/fork: Add helper to fork from io_uring Gabriel Krisman Bertazi
2024-12-09 23:43 ` [PATCH RFC 6/9] io_uring: Let commands run with current credentials Gabriel Krisman Bertazi
2024-12-11 14:48   ` Pavel Begunkov
2024-12-09 23:43 ` [PATCH RFC 7/9] io_uring: Introduce IORING_OP_CLONE Gabriel Krisman Bertazi
2024-12-11 13:37   ` Pavel Begunkov
2024-12-11 17:26     ` Josh Triplett
2024-12-17 11:03       ` Pavel Begunkov
2024-12-17 19:14         ` Josh Triplett
2024-12-09 23:43 ` [PATCH RFC 8/9] io_uring: Let ->issue know if it was called from spawn thread Gabriel Krisman Bertazi
2024-12-09 23:43 ` [PATCH RFC 9/9] io_uring: Introduce IORING_OP_EXEC command Gabriel Krisman Bertazi
2024-12-10 21:01   ` Josh Triplett
2024-12-10 21:10 ` [PATCH RFC 0/9] Launching processes with io_uring Josh Triplett
2024-12-11 14:02 ` Pavel Begunkov
2024-12-11 17:34   ` Josh Triplett
2024-12-13 20:13   ` Gabriel Krisman Bertazi
2024-12-17 16:10     ` Pavel Begunkov
2024-12-30 23:38       ` Gabriel Krisman Bertazi
2024-12-31 14:35         ` Pavel Begunkov
2025-01-18 22:33   ` Askar Safin
2025-01-19  3:04     ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox