public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCHSET RFC 0/6] Add io_uring support for futex wait/wake
@ 2023-06-09 18:31 Jens Axboe
  2023-06-09 18:31 ` [PATCH 1/6] futex: abstract out futex_op_to_flags() helper Jens Axboe
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Jens Axboe @ 2023-06-09 18:31 UTC (permalink / raw)
  To: io-uring; +Cc: andres

Hi,

Sending this just to the io_uring list for now so we can iron out
details, questions, concerns, etc before going a bit broader to get
the futex parts reviewed. Those are pretty straight forward though,
and try not to get too entangled into futex internals.

Anyway, this adds support for IORING_OP_FUTEX_{WAIT,WAKE}. The
WAKE part is pretty trivial, as we can just call into the wake side
of things. For wait, obviously we cannot, as we don't want to block
on waiting on a futex. When futex currently does a wait, it queues up
the futex_q in question and then does a sync wait/schedule on that.
The futex handler futex_wake_mark() is responsible for waking the
task that is synchronousely sleeping on that futex_q. This default
handler is hardwired, and we simply add a wake handler in futex_q
for this intead and change the hardwired futex_q->task to be a
generic data piece for the handler.

With that, we can queue up a futex_q and get a callback when it
would have woken. With that, we can sanely implement async WAIT
support without blocking.

Notable omissions in this code so far:

- We don't support timeouts with futex wait. We could definitely
  add this support. Outside of some complications with racing with
  wake (and cancelation), it is certainly doable. The main question
  here is we need it? And if we do, can we get by with just using
  linked timeouts for this? That's the io_uring idiomatic way of
  achieving this goal. That said, I may just go ahead and add it
  if I can solve the races in a clean fashion. Because at that point,
  seems the right thing to do.

- No PI support. This can certainly get added later.

Code can also be found here:

git://git.kernel.dk/linux io_uring-futex

or on cgit:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-futex

Very simple sample code below showing how to do a wait and wake,
Obviously not that exciting, just a brief demo.


#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <linux/futex.h>
#include <liburing.h>

#define	IORING_OP_FUTEX_WAIT (IORING_OP_SENDMSG_ZC + 1)
#define IORING_OP_FUTEX_WAKE (IORING_OP_FUTEX_WAIT + 1)

static void io_uring_prep_futex_wait(struct io_uring_sqe *sqe, void *futex,
				     int val)
{
	memset(sqe, 0, sizeof(*sqe));
	sqe->opcode = IORING_OP_FUTEX_WAIT;
	sqe->fd = FUTEX_WAIT;
	sqe->addr = (unsigned long) futex;
	sqe->len = val;
	sqe->file_index = FUTEX_BITSET_MATCH_ANY;
}

static void io_uring_prep_futex_wake(struct io_uring_sqe *sqe, void *futex,
				     int val)
{
	memset(sqe, 0, sizeof(*sqe));
	sqe->opcode = IORING_OP_FUTEX_WAKE;
	sqe->fd = FUTEX_WAIT;
	sqe->addr = (unsigned long) futex;
	sqe->len = val;
	sqe->file_index = FUTEX_BITSET_MATCH_ANY;
}

int main(int argc, char *argv[])
{
	struct io_uring_sqe *sqe;
	struct io_uring_cqe *cqe;
	struct io_uring ring;
	unsigned int *futex;
	int ret, i;

	futex = malloc(sizeof(*futex));
	*futex = 0;

	io_uring_queue_init(8, &ring, 0);

	sqe = io_uring_get_sqe(&ring);
	io_uring_prep_futex_wait(sqe, futex, 0);
	sqe->user_data = 1;
	
	io_uring_submit(&ring);

	*futex = 1;
	sqe = io_uring_get_sqe(&ring);
	io_uring_prep_futex_wake(sqe, futex, 1);
	sqe->user_data = 2;

	io_uring_submit(&ring);

	for (i = 0; i < 2; i++) {
		ret = io_uring_wait_cqe(&ring, &cqe);
		if (ret)
			return 1;

		io_uring_cqe_seen(&ring, cqe);
	}

	return 0;
}

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/6] futex: abstract out futex_op_to_flags() helper
  2023-06-09 18:31 [PATCHSET RFC 0/6] Add io_uring support for futex wait/wake Jens Axboe
@ 2023-06-09 18:31 ` Jens Axboe
  2023-06-09 18:31 ` [PATCH 2/6] futex: factor out the futex wake handling Jens Axboe
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2023-06-09 18:31 UTC (permalink / raw)
  To: io-uring; +Cc: andres, Jens Axboe

Rather than needing to duplicate this for the io_uring hook of futexes,
abstract out a helper.

No functional changes intended in this patch.

Signed-off-by: Jens Axboe <[email protected]>
---
 kernel/futex/futex.h    | 15 +++++++++++++++
 kernel/futex/syscalls.c | 11 ++---------
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index b5379c0e6d6d..d2949fca37d1 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -291,4 +291,19 @@ extern int futex_unlock_pi(u32 __user *uaddr, unsigned int flags);
 
 extern int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int trylock);
 
+static inline bool futex_op_to_flags(int op, int cmd, unsigned int *flags)
+{
+	if (!(op & FUTEX_PRIVATE_FLAG))
+		*flags |= FLAGS_SHARED;
+
+	if (op & FUTEX_CLOCK_REALTIME) {
+		*flags |= FLAGS_CLOCKRT;
+		if (cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI &&
+		    cmd != FUTEX_LOCK_PI2)
+			return false;
+	}
+
+	return true;
+}
+
 #endif /* _FUTEX_H */
diff --git a/kernel/futex/syscalls.c b/kernel/futex/syscalls.c
index a8074079b09e..75ca8c41cc94 100644
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -88,15 +88,8 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	int cmd = op & FUTEX_CMD_MASK;
 	unsigned int flags = 0;
 
-	if (!(op & FUTEX_PRIVATE_FLAG))
-		flags |= FLAGS_SHARED;
-
-	if (op & FUTEX_CLOCK_REALTIME) {
-		flags |= FLAGS_CLOCKRT;
-		if (cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI &&
-		    cmd != FUTEX_LOCK_PI2)
-			return -ENOSYS;
-	}
+	if (!futex_op_to_flags(op, cmd, &flags))
+		return -ENOSYS;
 
 	switch (cmd) {
 	case FUTEX_WAIT:
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/6] futex: factor out the futex wake handling
  2023-06-09 18:31 [PATCHSET RFC 0/6] Add io_uring support for futex wait/wake Jens Axboe
  2023-06-09 18:31 ` [PATCH 1/6] futex: abstract out futex_op_to_flags() helper Jens Axboe
@ 2023-06-09 18:31 ` Jens Axboe
  2023-06-09 18:31 ` [PATCH 3/6] futex: assign default futex_q->wait_data at insertion time Jens Axboe
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2023-06-09 18:31 UTC (permalink / raw)
  To: io-uring; +Cc: andres, Jens Axboe

In preparation for having another waker that isn't futex_wake_mark(),
add a wake handler in futex_q and rename the futex_q->task field to just
be wake_data. futex_wake_mark() is defined as the standard wakeup helper.

Signed-off-by: Jens Axboe <[email protected]>
---
 kernel/futex/core.c     | 2 +-
 kernel/futex/futex.h    | 3 ++-
 kernel/futex/requeue.c  | 7 ++++---
 kernel/futex/waitwake.c | 8 ++++----
 4 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 514e4582b863..6223cce3d876 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -556,7 +556,7 @@ void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
 
 	plist_node_init(&q->list, prio);
 	plist_add(&q->list, &hb->chain);
-	q->task = current;
+	q->wake_data = current;
 }
 
 /**
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index d2949fca37d1..1b7dd5266dd2 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -96,7 +96,6 @@ struct futex_pi_state {
 struct futex_q {
 	struct plist_node list;
 
-	struct task_struct *task;
 	spinlock_t *lock_ptr;
 	union futex_key key;
 	struct futex_pi_state *pi_state;
@@ -107,6 +106,8 @@ struct futex_q {
 #ifdef CONFIG_PREEMPT_RT
 	struct rcuwait requeue_wait;
 #endif
+	void (*wake)(struct wake_q_head *wake_q, struct futex_q *);
+	void *wake_data;
 } __randomize_layout;
 
 extern const struct futex_q futex_q_init;
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index cba8b1a6a4cc..6aee8408c341 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -61,6 +61,7 @@ const struct futex_q futex_q_init = {
 	.key		= FUTEX_KEY_INIT,
 	.bitset		= FUTEX_BITSET_MATCH_ANY,
 	.requeue_state	= ATOMIC_INIT(Q_REQUEUE_PI_NONE),
+	.wake		= futex_wake_mark,
 };
 
 /**
@@ -234,7 +235,7 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
 
 	/* Signal locked state to the waiter */
 	futex_requeue_pi_complete(q, 1);
-	wake_up_state(q->task, TASK_NORMAL);
+	wake_up_state(q->wake_data, TASK_NORMAL);
 }
 
 /**
@@ -316,7 +317,7 @@ futex_proxy_trylock_atomic(u32 __user *pifutex, struct futex_hash_bucket *hb1,
 	 * the user space lock can be acquired then PI state is attached to
 	 * the new owner (@top_waiter->task) when @set_waiters is true.
 	 */
-	ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task,
+	ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->wake_data,
 				   exiting, set_waiters);
 	if (ret == 1) {
 		/*
@@ -626,7 +627,7 @@ int futex_requeue(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 
 		ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
 						this->rt_waiter,
-						this->task);
+						this->wake_data);
 
 		if (ret == 1) {
 			/*
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index ba01b9408203..5151c83e2db8 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -114,7 +114,7 @@
  */
 void futex_wake_mark(struct wake_q_head *wake_q, struct futex_q *q)
 {
-	struct task_struct *p = q->task;
+	struct task_struct *p = q->wake_data;
 
 	if (WARN(q->pi_state || q->rt_waiter, "refusing to wake PI futex\n"))
 		return;
@@ -174,7 +174,7 @@ int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 			if (!(this->bitset & bitset))
 				continue;
 
-			futex_wake_mark(&wake_q, this);
+			this->wake(&wake_q, this);
 			if (++ret >= nr_wake)
 				break;
 		}
@@ -289,7 +289,7 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 				ret = -EINVAL;
 				goto out_unlock;
 			}
-			futex_wake_mark(&wake_q, this);
+			this->wake(&wake_q, this);
 			if (++ret >= nr_wake)
 				break;
 		}
@@ -303,7 +303,7 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
 					ret = -EINVAL;
 					goto out_unlock;
 				}
-				futex_wake_mark(&wake_q, this);
+				this->wake(&wake_q, this);
 				if (++op_ret >= nr_wake2)
 					break;
 			}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/6] futex: assign default futex_q->wait_data at insertion time
  2023-06-09 18:31 [PATCHSET RFC 0/6] Add io_uring support for futex wait/wake Jens Axboe
  2023-06-09 18:31 ` [PATCH 1/6] futex: abstract out futex_op_to_flags() helper Jens Axboe
  2023-06-09 18:31 ` [PATCH 2/6] futex: factor out the futex wake handling Jens Axboe
@ 2023-06-09 18:31 ` Jens Axboe
  2023-06-09 18:31 ` [PATCH 4/6] futex: add futex wait variant that takes a futex_q directly Jens Axboe
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2023-06-09 18:31 UTC (permalink / raw)
  To: io-uring; +Cc: andres, Jens Axboe

Rather than do it in the lower level queueing helper, move it to the
upper level one. This enables use of that helper with the caller setting
the wake handler data prior to calling it, rather than assume that
futex_wake_mark() is the handler for this futex_q.

Signed-off-by: Jens Axboe <[email protected]>
---
 kernel/futex/core.c  | 1 -
 kernel/futex/futex.h | 1 +
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 6223cce3d876..b9d8619c06fc 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -556,7 +556,6 @@ void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
 
 	plist_node_init(&q->list, prio);
 	plist_add(&q->list, &hb->chain);
-	q->wake_data = current;
 }
 
 /**
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 1b7dd5266dd2..8c12cef83d38 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -171,6 +171,7 @@ extern int futex_unqueue(struct futex_q *q);
 static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
 	__releases(&hb->lock)
 {
+	q->wake_data = current;
 	__futex_queue(q, hb);
 	spin_unlock(&hb->lock);
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/6] futex: add futex wait variant that takes a futex_q directly
  2023-06-09 18:31 [PATCHSET RFC 0/6] Add io_uring support for futex wait/wake Jens Axboe
                   ` (2 preceding siblings ...)
  2023-06-09 18:31 ` [PATCH 3/6] futex: assign default futex_q->wait_data at insertion time Jens Axboe
@ 2023-06-09 18:31 ` Jens Axboe
  2023-06-09 18:31 ` [PATCH 5/6] io_uring: add support for futex wake and wait Jens Axboe
  2023-06-09 18:31 ` [PATCH 6/6] io_uring/futex: enable use of the allocation caches for futex_q Jens Axboe
  5 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2023-06-09 18:31 UTC (permalink / raw)
  To: io-uring; +Cc: andres, Jens Axboe

For async trigger of the wait, we need to be able to pass in a futex_q
that is already setup. Add that helper.

Signed-off-by: Jens Axboe <[email protected]>
---
 kernel/futex/futex.h    |  3 +++
 kernel/futex/waitwake.c | 17 +++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 8c12cef83d38..29bf78a1f475 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -156,6 +156,9 @@ extern void __futex_unqueue(struct futex_q *q);
 extern void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb);
 extern int futex_unqueue(struct futex_q *q);
 
+extern int futex_queue_wait(struct futex_q *q, u32 __user *uaddr,
+			    unsigned int flags, u32 val);
+
 /**
  * futex_queue() - Enqueue the futex_q on the futex_hash_bucket
  * @q:	The futex_q to enqueue
diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 5151c83e2db8..442dafdfa22a 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -706,3 +706,20 @@ static long futex_wait_restart(struct restart_block *restart)
 				restart->futex.val, tp, restart->futex.bitset);
 }
 
+int futex_queue_wait(struct futex_q *q, u32 __user *uaddr, unsigned int flags,
+		     u32 val)
+{
+	struct futex_hash_bucket *hb;
+	int ret;
+
+	if (!q->bitset)
+		return -EINVAL;
+
+	ret = futex_wait_setup(uaddr, val, flags, q, &hb);
+	if (ret)
+		return ret;
+
+	__futex_queue(q, hb);
+	spin_unlock(&hb->lock);
+	return 0;
+}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/6] io_uring: add support for futex wake and wait
  2023-06-09 18:31 [PATCHSET RFC 0/6] Add io_uring support for futex wait/wake Jens Axboe
                   ` (3 preceding siblings ...)
  2023-06-09 18:31 ` [PATCH 4/6] futex: add futex wait variant that takes a futex_q directly Jens Axboe
@ 2023-06-09 18:31 ` Jens Axboe
  2023-06-12 16:06   ` Gabriel Krisman Bertazi
  2023-06-23 19:04   ` [PATCH 5/6] io_uring: add support for futex wake and wait Andres Freund
  2023-06-09 18:31 ` [PATCH 6/6] io_uring/futex: enable use of the allocation caches for futex_q Jens Axboe
  5 siblings, 2 replies; 16+ messages in thread
From: Jens Axboe @ 2023-06-09 18:31 UTC (permalink / raw)
  To: io-uring; +Cc: andres, Jens Axboe

Add support for FUTEX_WAKE/WAIT primitives.

IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
it does support passing in a bitset.

Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
FUTEX_WAIT_BITSET.

FUTEX_WAKE is straight forward, as we can always just do those inline.
FUTEX_WAIT will queue the futex with an appropriate callback, and
that callback will in turn post a CQE when it has triggered.

Cancelations are supported, both from the application point-of-view,
but also to be able to cancel pending waits if the ring exits before
all events have occurred.

This is just the barebones wait/wake support. Features to be added
later:

- We do not support the PI or requeue operations. The immediate use
  case don't need them, unsure if future support for these would be
  useful.

- Should we support futex wait with timeout? Not clear if this is
  necessary, as the usual io_uring linked timeouts could fill this
  purpose.

- Would be nice to support registered futexes, just like we do buffers.
  This would avoid mapping in user memory for each operation.

- Probably lots more that I just didn't think of.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/io_uring_types.h |   2 +
 include/uapi/linux/io_uring.h  |   3 +
 io_uring/Makefile              |   4 +-
 io_uring/cancel.c              |   5 +
 io_uring/cancel.h              |   4 +
 io_uring/futex.c               | 194 +++++++++++++++++++++++++++++++++
 io_uring/futex.h               |  26 +++++
 io_uring/io_uring.c            |   3 +
 io_uring/opdef.c               |  25 ++++-
 9 files changed, 264 insertions(+), 2 deletions(-)
 create mode 100644 io_uring/futex.c
 create mode 100644 io_uring/futex.h

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index f04ce513fadb..d796b578c129 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -273,6 +273,8 @@ struct io_ring_ctx {
 	struct io_wq_work_list	locked_free_list;
 	unsigned int		locked_free_nr;
 
+	struct hlist_head	futex_list;
+
 	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
 	struct io_sq_data	*sq_data;	/* if using sq thread polling */
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index f222d263bc55..b1a151ab8150 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -65,6 +65,7 @@ struct io_uring_sqe {
 		__u32		xattr_flags;
 		__u32		msg_ring_flags;
 		__u32		uring_cmd_flags;
+		__u32		futex_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	/* pack this to avoid bogus arm OABI complaints */
@@ -235,6 +236,8 @@ enum io_uring_op {
 	IORING_OP_URING_CMD,
 	IORING_OP_SEND_ZC,
 	IORING_OP_SENDMSG_ZC,
+	IORING_OP_FUTEX_WAIT,
+	IORING_OP_FUTEX_WAKE,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 8cc8e5387a75..2e4779bc550c 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -7,5 +7,7 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o xattr.o nop.o fs.o splice.o \
 					openclose.o uring_cmd.o epoll.o \
 					statx.o net.o msg_ring.o timeout.o \
 					sqpoll.o fdinfo.o tctx.o poll.o \
-					cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o
+					cancel.o kbuf.o rsrc.o rw.o opdef.o \
+					notif.o
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
+obj-$(CONFIG_FUTEX)		+= futex.o
diff --git a/io_uring/cancel.c b/io_uring/cancel.c
index b4f5dfacc0c3..280fb83145d3 100644
--- a/io_uring/cancel.c
+++ b/io_uring/cancel.c
@@ -15,6 +15,7 @@
 #include "tctx.h"
 #include "poll.h"
 #include "timeout.h"
+#include "futex.h"
 #include "cancel.h"
 
 struct io_cancel {
@@ -98,6 +99,10 @@ int io_try_cancel(struct io_uring_task *tctx, struct io_cancel_data *cd,
 	if (ret != -ENOENT)
 		return ret;
 
+	ret = io_futex_cancel(ctx, cd, issue_flags);
+	if (ret != -ENOENT)
+		return ret;
+
 	spin_lock(&ctx->completion_lock);
 	if (!(cd->flags & IORING_ASYNC_CANCEL_FD))
 		ret = io_timeout_cancel(ctx, cd);
diff --git a/io_uring/cancel.h b/io_uring/cancel.h
index 6a59ee484d0c..6a2a38df7159 100644
--- a/io_uring/cancel.h
+++ b/io_uring/cancel.h
@@ -1,4 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
+#ifndef IORING_CANCEL_H
+#define IORING_CANCEL_H
 
 #include <linux/io_uring_types.h>
 
@@ -21,3 +23,5 @@ int io_try_cancel(struct io_uring_task *tctx, struct io_cancel_data *cd,
 void init_hash_table(struct io_hash_table *table, unsigned size);
 
 int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg);
+
+#endif
diff --git a/io_uring/futex.c b/io_uring/futex.c
new file mode 100644
index 000000000000..a1d50145927a
--- /dev/null
+++ b/io_uring/futex.c
@@ -0,0 +1,194 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/io_uring.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "../kernel/futex/futex.h"
+#include "io_uring.h"
+#include "futex.h"
+
+struct io_futex {
+	struct file	*file;
+	u32 __user	*uaddr;
+	int		futex_op;
+	unsigned int	futex_val;
+	unsigned int	futex_flags;
+	unsigned int	futex_mask;
+	bool		has_timeout;
+	ktime_t		timeout;
+};
+
+static void io_futex_complete(struct io_kiocb *req, struct io_tw_state *ts)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	kfree(req->async_data);
+	io_tw_lock(ctx, ts);
+	hlist_del_init(&req->hash_node);
+	io_req_task_complete(req, ts);
+}
+
+static bool __io_futex_cancel(struct io_ring_ctx *ctx, struct io_kiocb *req)
+{
+	struct futex_q *q = req->async_data;
+
+	/* futex wake already done or in progress */
+	if (!futex_unqueue(q))
+		return false;
+
+	hlist_del_init(&req->hash_node);
+	io_req_set_res(req, -ECANCELED, 0);
+	req->io_task_work.func = io_futex_complete;
+	io_req_task_work_add(req);
+	return true;
+}
+
+int io_futex_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
+		    unsigned int issue_flags)
+{
+	struct hlist_node *tmp;
+	struct io_kiocb *req;
+	int nr = 0;
+
+	if (cd->flags & (IORING_ASYNC_CANCEL_FD|IORING_ASYNC_CANCEL_FD_FIXED))
+		return 0;
+
+	io_ring_submit_lock(ctx, issue_flags);
+	hlist_for_each_entry_safe(req, tmp, &ctx->futex_list, hash_node) {
+		if (req->cqe.user_data != cd->data &&
+		    !(cd->flags & IORING_ASYNC_CANCEL_ANY))
+			continue;
+		if (__io_futex_cancel(ctx, req))
+			nr++;
+		nr++;
+		if (!(cd->flags & IORING_ASYNC_CANCEL_ALL))
+			break;
+	}
+	io_ring_submit_unlock(ctx, issue_flags);
+
+	if (nr)
+		return nr;
+
+	return -ENOENT;
+}
+
+bool io_futex_remove_all(struct io_ring_ctx *ctx, struct task_struct *task,
+			 bool cancel_all)
+{
+	struct hlist_node *tmp;
+	struct io_kiocb *req;
+	bool found = false;
+
+	lockdep_assert_held(&ctx->uring_lock);
+
+	hlist_for_each_entry_safe(req, tmp, &ctx->futex_list, hash_node) {
+		if (!io_match_task_safe(req, task, cancel_all))
+			continue;
+		__io_futex_cancel(ctx, req);
+		found = true;
+	}
+
+	return found;
+}
+
+int io_futex_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_futex *iof = io_kiocb_to_cmd(req, struct io_futex);
+	struct __kernel_timespec __user *utime;
+	struct timespec64 t;
+
+	iof->futex_op = READ_ONCE(sqe->fd);
+	iof->uaddr = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	iof->futex_val = READ_ONCE(sqe->len);
+	iof->has_timeout = false;
+	iof->futex_mask = READ_ONCE(sqe->file_index);
+	utime = u64_to_user_ptr(READ_ONCE(sqe->addr2));
+	if (utime) {
+		if (get_timespec64(&t, utime))
+			return -EFAULT;
+		iof->timeout = timespec64_to_ktime(t);
+		iof->timeout = ktime_add_safe(ktime_get(), iof->timeout);
+		iof->has_timeout = true;
+	}
+	iof->futex_flags = READ_ONCE(sqe->futex_flags);
+	if (iof->futex_flags & FUTEX_CMD_MASK)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void io_futex_wake_fn(struct wake_q_head *wake_q, struct futex_q *q)
+{
+	struct io_kiocb *req = q->wake_data;
+
+	__futex_unqueue(q);
+	smp_store_release(&q->lock_ptr, NULL);
+
+	io_req_set_res(req, 0, 0);
+	req->io_task_work.func = io_futex_complete;
+	io_req_task_work_add(req);
+}
+
+int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_futex *iof = io_kiocb_to_cmd(req, struct io_futex);
+	struct io_ring_ctx *ctx = req->ctx;
+	unsigned int flags = 0;
+	struct futex_q *q;
+	int ret;
+
+	if (!futex_op_to_flags(FUTEX_WAIT, iof->futex_flags, &flags)) {
+		ret = -ENOSYS;
+		goto done;
+	}
+
+	q = kmalloc(sizeof(*q), GFP_NOWAIT);
+	if (!q) {
+		ret = -ENOMEM;
+		goto done;
+	}
+
+	req->async_data = q;
+	*q = futex_q_init;
+	q->bitset = iof->futex_mask;
+	q->wake = io_futex_wake_fn;
+	q->wake_data = req;
+
+	io_ring_submit_lock(ctx, issue_flags);
+	hlist_add_head(&req->hash_node, &ctx->futex_list);
+	io_ring_submit_unlock(ctx, issue_flags);
+
+	ret = futex_queue_wait(q, iof->uaddr, flags, iof->futex_val);
+	if (ret)
+		goto done;
+
+	return IOU_ISSUE_SKIP_COMPLETE;
+done:
+	if (ret < 0)
+		req_set_fail(req);
+	io_req_set_res(req, ret, 0);
+	return IOU_OK;
+}
+
+int io_futex_wake(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_futex *iof = io_kiocb_to_cmd(req, struct io_futex);
+	unsigned int flags = 0;
+	int ret;
+
+	if (!futex_op_to_flags(FUTEX_WAKE, iof->futex_flags, &flags)) {
+		ret = -ENOSYS;
+		goto done;
+	}
+
+	ret = futex_wake(iof->uaddr, flags, iof->futex_val, iof->futex_mask);
+done:
+	if (ret < 0)
+		req_set_fail(req);
+	io_req_set_res(req, ret, 0);
+	return IOU_OK;
+}
diff --git a/io_uring/futex.h b/io_uring/futex.h
new file mode 100644
index 000000000000..16add2c069cc
--- /dev/null
+++ b/io_uring/futex.h
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "cancel.h"
+
+int io_futex_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags);
+int io_futex_wake(struct io_kiocb *req, unsigned int issue_flags);
+
+#if defined(CONFIG_FUTEX)
+int io_futex_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
+		    unsigned int issue_flags);
+bool io_futex_remove_all(struct io_ring_ctx *ctx, struct task_struct *task,
+			 bool cancel_all);
+#else
+static inline int io_futex_cancel(struct io_ring_ctx *ctx,
+				  struct io_cancel_data *cd,
+				  unsigned int issue_flags);
+{
+	return 0;
+}
+static inline bool io_futex_remove_all(struct io_ring_ctx *ctx,
+				       struct task_struct *task, bool cancel_all)
+{
+	return false;
+}
+#endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a467064da1af..8270f37c312d 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -92,6 +92,7 @@
 #include "cancel.h"
 #include "net.h"
 #include "notif.h"
+#include "futex.h"
 
 #include "timeout.h"
 #include "poll.h"
@@ -336,6 +337,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_LIST_HEAD(&ctx->tctx_list);
 	ctx->submit_state.free_list.next = NULL;
 	INIT_WQ_LIST(&ctx->locked_free_list);
+	INIT_HLIST_HEAD(&ctx->futex_list);
 	INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func);
 	INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
 	return ctx;
@@ -3309,6 +3311,7 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 	ret |= io_cancel_defer_files(ctx, task, cancel_all);
 	mutex_lock(&ctx->uring_lock);
 	ret |= io_poll_remove_all(ctx, task, cancel_all);
+	ret |= io_futex_remove_all(ctx, task, cancel_all);
 	mutex_unlock(&ctx->uring_lock);
 	ret |= io_kill_timeouts(ctx, task, cancel_all);
 	if (task)
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 3b9c6489b8b6..e6b03d7f82e5 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -33,6 +33,7 @@
 #include "poll.h"
 #include "cancel.h"
 #include "rw.h"
+#include "futex.h"
 
 static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags)
 {
@@ -426,11 +427,26 @@ const struct io_issue_def io_issue_defs[] = {
 		.issue			= io_sendmsg_zc,
 #else
 		.prep			= io_eopnotsupp_prep,
+#endif
+	},
+	[IORING_OP_FUTEX_WAIT] = {
+#if defined(CONFIG_FUTEX)
+		.prep			= io_futex_prep,
+		.issue			= io_futex_wait,
+#else
+		.prep			= io_eopnotsupp_prep,
+#endif
+	},
+	[IORING_OP_FUTEX_WAKE] = {
+#if defined(CONFIG_FUTEX)
+		.prep			= io_futex_prep,
+		.issue			= io_futex_wake,
+#else
+		.prep			= io_eopnotsupp_prep,
 #endif
 	},
 };
 
-
 const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_NOP] = {
 		.name			= "NOP",
@@ -648,6 +664,13 @@ const struct io_cold_def io_cold_defs[] = {
 		.fail			= io_sendrecv_fail,
 #endif
 	},
+	[IORING_OP_FUTEX_WAIT] = {
+		.name			= "FUTEX_WAIT",
+	},
+
+	[IORING_OP_FUTEX_WAKE] = {
+		.name			= "FUTEX_WAKE",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/6] io_uring/futex: enable use of the allocation caches for futex_q
  2023-06-09 18:31 [PATCHSET RFC 0/6] Add io_uring support for futex wait/wake Jens Axboe
                   ` (4 preceding siblings ...)
  2023-06-09 18:31 ` [PATCH 5/6] io_uring: add support for futex wake and wait Jens Axboe
@ 2023-06-09 18:31 ` Jens Axboe
  5 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2023-06-09 18:31 UTC (permalink / raw)
  To: io-uring; +Cc: andres, Jens Axboe

We're under the ctx uring_lock for the issue and completion path anyway,
wire up the futex_q allocator so we can just recycle entries rather than
hit the allocator every time.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/io_uring_types.h |  1 +
 io_uring/futex.c               | 65 +++++++++++++++++++++++++++-------
 io_uring/futex.h               |  8 +++++
 io_uring/io_uring.c            |  2 ++
 4 files changed, 63 insertions(+), 13 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index d796b578c129..a7f03d8d879f 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -274,6 +274,7 @@ struct io_ring_ctx {
 	unsigned int		locked_free_nr;
 
 	struct hlist_head	futex_list;
+	struct io_alloc_cache	futex_cache;
 
 	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
 	struct io_sq_data	*sq_data;	/* if using sq thread polling */
diff --git a/io_uring/futex.c b/io_uring/futex.c
index a1d50145927a..e0707723c689 100644
--- a/io_uring/futex.c
+++ b/io_uring/futex.c
@@ -9,6 +9,7 @@
 
 #include "../kernel/futex/futex.h"
 #include "io_uring.h"
+#include "rsrc.h"
 #include "futex.h"
 
 struct io_futex {
@@ -22,22 +23,48 @@ struct io_futex {
 	ktime_t		timeout;
 };
 
+struct io_futex_data {
+	union {
+		struct futex_q		q;
+		struct io_cache_entry	cache;
+	};
+};
+
+void io_futex_cache_init(struct io_ring_ctx *ctx)
+{
+	io_alloc_cache_init(&ctx->futex_cache, IO_NODE_ALLOC_CACHE_MAX,
+				sizeof(struct io_futex_data));
+}
+
+static void io_futex_cache_entry_free(struct io_cache_entry *entry)
+{
+	kfree(container_of(entry, struct io_futex_data, cache));
+}
+
+void io_futex_cache_free(struct io_ring_ctx *ctx)
+{
+	io_alloc_cache_free(&ctx->futex_cache, io_futex_cache_entry_free);
+}
+
 static void io_futex_complete(struct io_kiocb *req, struct io_tw_state *ts)
 {
+	struct io_futex_data *ifd = req->async_data;
 	struct io_ring_ctx *ctx = req->ctx;
 
-	kfree(req->async_data);
 	io_tw_lock(ctx, ts);
+	if (!io_alloc_cache_put(&ctx->futex_cache, &ifd->cache))
+		kfree(ifd);
+	req->async_data = NULL;
 	hlist_del_init(&req->hash_node);
 	io_req_task_complete(req, ts);
 }
 
 static bool __io_futex_cancel(struct io_ring_ctx *ctx, struct io_kiocb *req)
 {
-	struct futex_q *q = req->async_data;
+	struct io_futex_data *ifd = req->async_data;
 
 	/* futex wake already done or in progress */
-	if (!futex_unqueue(q))
+	if (!futex_unqueue(&ifd->q))
 		return false;
 
 	hlist_del_init(&req->hash_node);
@@ -133,12 +160,23 @@ static void io_futex_wake_fn(struct wake_q_head *wake_q, struct futex_q *q)
 	io_req_task_work_add(req);
 }
 
+static struct io_futex_data *io_alloc_ifd(struct io_ring_ctx *ctx)
+{
+	struct io_cache_entry *entry;
+
+	entry = io_alloc_cache_get(&ctx->futex_cache);
+	if (entry)
+		return container_of(entry, struct io_futex_data, cache);
+
+	return kmalloc(sizeof(struct io_futex_data), GFP_NOWAIT);
+}
+
 int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_futex *iof = io_kiocb_to_cmd(req, struct io_futex);
 	struct io_ring_ctx *ctx = req->ctx;
+	struct io_futex_data *ifd;
 	unsigned int flags = 0;
-	struct futex_q *q;
 	int ret;
 
 	if (!futex_op_to_flags(FUTEX_WAIT, iof->futex_flags, &flags)) {
@@ -146,23 +184,24 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
 		goto done;
 	}
 
-	q = kmalloc(sizeof(*q), GFP_NOWAIT);
-	if (!q) {
+	io_ring_submit_lock(ctx, issue_flags);
+	ifd = io_alloc_ifd(ctx);
+	if (!ifd) {
+		io_ring_submit_unlock(ctx, issue_flags);
 		ret = -ENOMEM;
 		goto done;
 	}
 
-	req->async_data = q;
-	*q = futex_q_init;
-	q->bitset = iof->futex_mask;
-	q->wake = io_futex_wake_fn;
-	q->wake_data = req;
+	req->async_data = ifd;
+	ifd->q = futex_q_init;
+	ifd->q.bitset = iof->futex_mask;
+	ifd->q.wake = io_futex_wake_fn;
+	ifd->q.wake_data = req;
 
-	io_ring_submit_lock(ctx, issue_flags);
 	hlist_add_head(&req->hash_node, &ctx->futex_list);
 	io_ring_submit_unlock(ctx, issue_flags);
 
-	ret = futex_queue_wait(q, iof->uaddr, flags, iof->futex_val);
+	ret = futex_queue_wait(&ifd->q, iof->uaddr, flags, iof->futex_val);
 	if (ret)
 		goto done;
 
diff --git a/io_uring/futex.h b/io_uring/futex.h
index 16add2c069cc..e60d0abaf676 100644
--- a/io_uring/futex.h
+++ b/io_uring/futex.h
@@ -11,6 +11,8 @@ int io_futex_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
 		    unsigned int issue_flags);
 bool io_futex_remove_all(struct io_ring_ctx *ctx, struct task_struct *task,
 			 bool cancel_all);
+void io_futex_cache_init(struct io_ring_ctx *ctx);
+void io_futex_cache_free(struct io_ring_ctx *ctx);
 #else
 static inline int io_futex_cancel(struct io_ring_ctx *ctx,
 				  struct io_cancel_data *cd,
@@ -23,4 +25,10 @@ static inline bool io_futex_remove_all(struct io_ring_ctx *ctx,
 {
 	return false;
 }
+static inline void io_futex_cache_init(struct io_ring_ctx *ctx)
+{
+}
+static inline void io_futex_cache_free(struct io_ring_ctx *ctx)
+{
+}
 #endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 8270f37c312d..7db2a139d110 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -318,6 +318,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 			    sizeof(struct async_poll));
 	io_alloc_cache_init(&ctx->netmsg_cache, IO_ALLOC_CACHE_MAX,
 			    sizeof(struct io_async_msghdr));
+	io_futex_cache_init(ctx);
 	init_completion(&ctx->ref_comp);
 	xa_init_flags(&ctx->personalities, XA_FLAGS_ALLOC1);
 	mutex_init(&ctx->uring_lock);
@@ -2917,6 +2918,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	io_eventfd_unregister(ctx);
 	io_alloc_cache_free(&ctx->apoll_cache, io_apoll_cache_free);
 	io_alloc_cache_free(&ctx->netmsg_cache, io_netmsg_cache_free);
+	io_futex_cache_free(ctx);
 	io_destroy_buffers(ctx);
 	mutex_unlock(&ctx->uring_lock);
 	if (ctx->sq_creds)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6] io_uring: add support for futex wake and wait
  2023-06-09 18:31 ` [PATCH 5/6] io_uring: add support for futex wake and wait Jens Axboe
@ 2023-06-12 16:06   ` Gabriel Krisman Bertazi
  2023-06-12 20:37     ` Jens Axboe
  2023-06-23 19:04   ` [PATCH 5/6] io_uring: add support for futex wake and wait Andres Freund
  1 sibling, 1 reply; 16+ messages in thread
From: Gabriel Krisman Bertazi @ 2023-06-12 16:06 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, andres

Jens Axboe <[email protected]> writes:

> Add support for FUTEX_WAKE/WAIT primitives.

This is great.  I was so sure io_uring had this support already for some
reason.  I might have dreamed it.

The semantics are tricky, though. You might want to CC peterZ and tglx
directly.

> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
> it does support passing in a bitset.

As far as I know, the _BITSET variant are not commonly used in the
current interface.  I haven't seen any code that really benefits from
it.

> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
> FUTEX_WAIT_BITSET.

But it is definitely safe to have a single one, basically with the
_BITSET semantics.

> FUTEX_WAKE is straight forward, as we can always just do those inline.
> FUTEX_WAIT will queue the futex with an appropriate callback, and
> that callback will in turn post a CQE when it has triggered.

Even with an asynchronous model, it might make sense to halt execution
of further queued operations until futex completes.  I think
IOSQE_IO_DRAIN is a barrier only against the submission part, so it
wouldn't hep.  Is there a way to ensure this ordering?

I know, it goes against the asynchronous nature of io_uring, but I think
it might be a valid use case. Say we extend FUTEX_WAIT with a way to
acquire the futex in kernel space.  Then, when the CQE returns, we know
the lock is acquired.  if we can queue dependencies on that (stronger
than the link semantics), we could queue operations to be executed once
the lock is taken. Makes sense?

> Cancelations are supported, both from the application point-of-view,
> but also to be able to cancel pending waits if the ring exits before
> all events have occurred.
>
> This is just the barebones wait/wake support. Features to be added
> later:

One item high on my wishlist would be the futexv semantics (wait on any
of a set of futexes).  It cannot be implemented by issuing several
FUTEX_WAIT.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6] io_uring: add support for futex wake and wait
  2023-06-12 16:06   ` Gabriel Krisman Bertazi
@ 2023-06-12 20:37     ` Jens Axboe
  2023-06-12 23:00       ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2023-06-12 20:37 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: io-uring, andres

On 6/12/23 10:06?AM, Gabriel Krisman Bertazi wrote:
> Jens Axboe <[email protected]> writes:
> 
>> Add support for FUTEX_WAKE/WAIT primitives.
> 
> This is great.  I was so sure io_uring had this support already for some
> reason.  I might have dreamed it.

I think you did :-)

> The semantics are tricky, though. You might want to CC peterZ and tglx
> directly.

For sure, I'll take it wider soon enough. Just wanted to iron out
io_uring details first.

>> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
>> it does support passing in a bitset.
> 
> As far as I know, the _BITSET variant are not commonly used in the
> current interface.  I haven't seen any code that really benefits from
> it.

Since FUTEX_WAKE is a strict subset of FUTEX_WAKE_BITSET, makes little
sense to not just support both imho.

>> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
>> FUTEX_WAIT_BITSET.
> 
> But it is definitely safe to have a single one, basically with the
> _BITSET semantics.

Yep I think so.

>> FUTEX_WAKE is straight forward, as we can always just do those inline.
>> FUTEX_WAIT will queue the futex with an appropriate callback, and
>> that callback will in turn post a CQE when it has triggered.
> 
> Even with an asynchronous model, it might make sense to halt execution
> of further queued operations until futex completes.  I think
> IOSQE_IO_DRAIN is a barrier only against the submission part, so it
> wouldn't hep.  Is there a way to ensure this ordering?

You'd use link for that - link whatever depends on the wake to the futex
wait. Or just queue it up once you reap the wait completion, when that
is posted because we got woken.

> I know, it goes against the asynchronous nature of io_uring, but I think
> it might be a valid use case. Say we extend FUTEX_WAIT with a way to
> acquire the futex in kernel space.  Then, when the CQE returns, we know
> the lock is acquired.  if we can queue dependencies on that (stronger
> than the link semantics), we could queue operations to be executed once
> the lock is taken. Makes sense?

It does, and acquiring it _may_ make sense indeed. But I'd rather punt
that to a later thing, and focus on getting the standard (and smaller)
primitives done first.

>> Cancelations are supported, both from the application point-of-view,
>> but also to be able to cancel pending waits if the ring exits before
>> all events have occurred.
>>
>> This is just the barebones wait/wake support. Features to be added
>> later:
> 
> One item high on my wishlist would be the futexv semantics (wait on any
> of a set of futexes).  It cannot be implemented by issuing several
> FUTEX_WAIT.

Yep, I do think that one is interesting enough to consider upfront.
Unfortunately the internal implementation of that does not look that
great, though I'm sure we can make that work. But would likely require
some futexv refactoring to make it work. I can take a look at it.

You could obviously do futexv with this patchset, just posting N futex
waits and canceling N-1 when you get woken by one. Though that's of
course not very pretty or nice to use, but design wise it would totally
work as you don't actually block on these with io_uring.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6] io_uring: add support for futex wake and wait
  2023-06-12 20:37     ` Jens Axboe
@ 2023-06-12 23:00       ` Gabriel Krisman Bertazi
  2023-06-13  1:09         ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Gabriel Krisman Bertazi @ 2023-06-12 23:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, andres

Jens Axboe <[email protected]> writes:

> On 6/12/23 10:06?AM, Gabriel Krisman Bertazi wrote:
>> Jens Axboe <[email protected]> writes:
>> 
>>> Add support for FUTEX_WAKE/WAIT primitives.
>> 
>> This is great.  I was so sure io_uring had this support already for some
>> reason.  I might have dreamed it.
>
> I think you did :-)

Premonitory!  Still, there should be better things to dream about than
with the kernel code.

>> Even with an asynchronous model, it might make sense to halt execution
>> of further queued operations until futex completes.  I think
>> IOSQE_IO_DRAIN is a barrier only against the submission part, so it
>> wouldn't hep.  Is there a way to ensure this ordering?
>
> You'd use link for that - link whatever depends on the wake to the futex
> wait. Or just queue it up once you reap the wait completion, when that
> is posted because we got woken.

The challenge of linked requests, in my opinion, is that once a link
chain starts, everything needs to be link together, and a single error
fails everything, which is ok when operations are related, but
not so much when doing IO to different files from the same ring.

>>> Cancelations are supported, both from the application point-of-view,
>>> but also to be able to cancel pending waits if the ring exits before
>>> all events have occurred.
>>>
>>> This is just the barebones wait/wake support. Features to be added
>>> later:
>> 
>> One item high on my wishlist would be the futexv semantics (wait on any
>> of a set of futexes).  It cannot be implemented by issuing several
>> FUTEX_WAIT.
>
> Yep, I do think that one is interesting enough to consider upfront.
>Unfortunately the internal implementation of that does not look that
>great, though I'm sure we can make that work.  ?  But would likely
>require some futexv refactoring to make it work. I can take a look at
>it.

No disagreement here.  To be fair, the main challenge was making the new
interface compatible with a futex being waited on/waked the original
interface. At some point, we had a really nice design for a single
object, but we spent two years bikesheding over the interface and ended
up merging something pretty much similar to the proposal from two years
prior.

> You could obviously do futexv with this patchset, just posting N futex
> waits and canceling N-1 when you get woken by one. Though that's of
> course not very pretty or nice to use, but design wise it would totally
> work as you don't actually block on these with io_uring.

Yes, but at that point, i guess it'd make more sense to implement the
same semantics by polling over a set of eventfds or having a single
futex and doing dispatch in userspace.

thanks,

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6] io_uring: add support for futex wake and wait
  2023-06-12 23:00       ` Gabriel Krisman Bertazi
@ 2023-06-13  1:09         ` Jens Axboe
  2023-06-13  2:55           ` io_uring link semantics (was [PATCH 5/6] io_uring: add support for futex wake and wait) Gabriel Krisman Bertazi
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2023-06-13  1:09 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: io-uring, andres

On 6/12/23 5:00?PM, Gabriel Krisman Bertazi wrote:
> Jens Axboe <[email protected]> writes:
> 
>> On 6/12/23 10:06?AM, Gabriel Krisman Bertazi wrote:
>>> Jens Axboe <[email protected]> writes:
>>>
>>>> Add support for FUTEX_WAKE/WAIT primitives.
>>>
>>> This is great.  I was so sure io_uring had this support already for some
>>> reason.  I might have dreamed it.
>>
>> I think you did :-)
> 
> Premonitory!  Still, there should be better things to dream about than
> with the kernel code.

I dunno, if it's io_uring related I'm a supporter.

>>> Even with an asynchronous model, it might make sense to halt execution
>>> of further queued operations until futex completes.  I think
>>> IOSQE_IO_DRAIN is a barrier only against the submission part, so it
>>> wouldn't hep.  Is there a way to ensure this ordering?
>>
>> You'd use link for that - link whatever depends on the wake to the futex
>> wait. Or just queue it up once you reap the wait completion, when that
>> is posted because we got woken.
> 
> The challenge of linked requests, in my opinion, is that once a link
> chain starts, everything needs to be link together, and a single error
> fails everything, which is ok when operations are related, but
> not so much when doing IO to different files from the same ring.

Not quite sure if you're misunderstanding links, or just have a
different use case in mind. You can certainly have several independent
chains of links.

>>>> Cancelations are supported, both from the application point-of-view,
>>>> but also to be able to cancel pending waits if the ring exits before
>>>> all events have occurred.
>>>>
>>>> This is just the barebones wait/wake support. Features to be added
>>>> later:
>>>
>>> One item high on my wishlist would be the futexv semantics (wait on any
>>> of a set of futexes).  It cannot be implemented by issuing several
>>> FUTEX_WAIT.
>>
>> Yep, I do think that one is interesting enough to consider upfront.
>> Unfortunately the internal implementation of that does not look that
>> great, though I'm sure we can make that work.  ?  But would likely
>> require some futexv refactoring to make it work. I can take a look at
>> it.
> 
> No disagreement here.  To be fair, the main challenge was making the new
> interface compatible with a futex being waited on/waked the original
> interface. At some point, we had a really nice design for a single
> object, but we spent two years bikesheding over the interface and ended
> up merging something pretty much similar to the proposal from two years
> prior.

It turned out not to be too bad - here's a poc:

https://git.kernel.dk/cgit/linux/commit/?h=io_uring-futex&id=421b12df4ed0bb25c53afe496370bc2b70b04e15

needs a bit of splitting and cleaning, notably I think I need to redo
the futex_q->wake_data bit to make that cleaner with the current use
case and the async use case. With that, then everything can just use
futex_queue() and the only difference really is that the sync variants
will do timer setup upfront and then sleep at the bottom, where the
async part just calls the meat of the function.

>> You could obviously do futexv with this patchset, just posting N futex
>> waits and canceling N-1 when you get woken by one. Though that's of
>> course not very pretty or nice to use, but design wise it would totally
>> work as you don't actually block on these with io_uring.
> 
> Yes, but at that point, i guess it'd make more sense to implement the
> same semantics by polling over a set of eventfds or having a single
> futex and doing dispatch in userspace.

Oh yeah, would not recommend the above approach. Just saying that you
COULD do that if you really wanted to, which is not something you could
do with futex before waitv. But kind of moot now that there's at least a
prototype.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* io_uring link semantics (was [PATCH 5/6] io_uring: add support for futex wake and wait)
  2023-06-13  1:09         ` Jens Axboe
@ 2023-06-13  2:55           ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 16+ messages in thread
From: Gabriel Krisman Bertazi @ 2023-06-13  2:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, andres

Jens Axboe <[email protected]> writes:

> On 6/12/23 5:00?PM, Gabriel Krisman Bertazi wrote:

>>>> Even with an asynchronous model, it might make sense to halt execution
>>>> of further queued operations until futex completes.  I think
>>>> IOSQE_IO_DRAIN is a barrier only against the submission part, so it
>>>> wouldn't hep.  Is there a way to ensure this ordering?
>>>
>>> You'd use link for that - link whatever depends on the wake to the futex
>>> wait. Or just queue it up once you reap the wait completion, when that
>>> is posted because we got woken.
>> 
>> The challenge of linked requests, in my opinion, is that once a link
>> chain starts, everything needs to be link together, and a single error
>> fails everything, which is ok when operations are related, but
>> not so much when doing IO to different files from the same ring.
>
> Not quite sure if you're misunderstanding links, or just have a
> different use case in mind. You can certainly have several independent
> chains of links.

I might be. Or my use case might be bogus. Please, correct me if either
is the case.

My understanding is that a link is a sequence of commands all carrying
the IOSQE_IO_LINK flag.  io_uring guarantees the ordering within the
link and, if a previous command fails, the rest of the link chain is
aborted.

But, if I have independent commands to submit in between, i.e. on a
different fd, I might want an intermediary operation to not be dependent
on the rest of the link without breaking the chain.  Most of the time I
know ahead of time the entire chain, and I can batch the operations
together.  But, I think it might be a problem specifically for some
unbounded commands like FUTEX_WAIT and recv.  I want a specific
operation to depend on a recv, but I might not be able to specify ahead
of time all of the dependent operations. I'd need to wait for a recv
command to complete and only then issue the dependency, to guarantee
ordering, or I make sure that everything I put on the ring in the
meantime is part of one big link submitted sequentially.

A related issue/example that comes to mind would be two recvs/sends
against the same socket.  When doing a syscall, I know the first recv
will return ahead of the second because it is, well, synchronous.  On
io_uring, I think it must be a link.  I might be recv'ing a huge stream
from the network, and I can't tell if the packet is done on a single
recv.  I could have to issue a second recv but I either make it linked
ahead of time, or I need to wait for the first recv to complete, to only
then submit the second one.  The problem is the ordering of recvs; from
my understanding of the code, I cannot assure the first recv will
complete before the second, without a link.

Sorry if I'm wrong and there are ways around it, but it is a struggling
points for me at the moment with using io_uring.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6] io_uring: add support for futex wake and wait
  2023-06-09 18:31 ` [PATCH 5/6] io_uring: add support for futex wake and wait Jens Axboe
  2023-06-12 16:06   ` Gabriel Krisman Bertazi
@ 2023-06-23 19:04   ` Andres Freund
  2023-06-23 19:07     ` Jens Axboe
  1 sibling, 1 reply; 16+ messages in thread
From: Andres Freund @ 2023-06-23 19:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Hi,

I'd been chatting with Jens about this, so obviously I'm interested in the
feature...

On 2023-06-09 12:31:24 -0600, Jens Axboe wrote:
> Add support for FUTEX_WAKE/WAIT primitives.
> 
> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
> it does support passing in a bitset.
> 
> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
> FUTEX_WAIT_BITSET.

One thing I was wondering about is what happens when there are multiple
OP_FUTEX_WAITs queued for the same futex, and that futex gets woken up. I
don't really have an opinion about what would be best, just that it'd be
helpful to specify the behaviour.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6] io_uring: add support for futex wake and wait
  2023-06-23 19:04   ` [PATCH 5/6] io_uring: add support for futex wake and wait Andres Freund
@ 2023-06-23 19:07     ` Jens Axboe
  2023-06-23 19:34       ` Andres Freund
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2023-06-23 19:07 UTC (permalink / raw)
  To: Andres Freund; +Cc: io-uring

On 6/23/23 1:04?PM, Andres Freund wrote:
> Hi,
> 
> I'd been chatting with Jens about this, so obviously I'm interested in the
> feature...
> 
> On 2023-06-09 12:31:24 -0600, Jens Axboe wrote:
>> Add support for FUTEX_WAKE/WAIT primitives.
>>
>> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
>> it does support passing in a bitset.
>>
>> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
>> FUTEX_WAIT_BITSET.
> 
> One thing I was wondering about is what happens when there are multiple
> OP_FUTEX_WAITs queued for the same futex, and that futex gets woken up. I
> don't really have an opinion about what would be best, just that it'd be
> helpful to specify the behaviour.

Not sure I follow the question, can you elaborate?

If you have N futex waits on the same futex and someone does a wait
(with wakenum >= N), then they'd all wake and post a CQE. If less are
woken because the caller asked for less than N, than that number should
be woken.

IOW, should have the same semantics as "normal" futex waits.

Or maybe I'm totally missing what is being asked here...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6] io_uring: add support for futex wake and wait
  2023-06-23 19:07     ` Jens Axboe
@ 2023-06-23 19:34       ` Andres Freund
  2023-06-23 19:46         ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Andres Freund @ 2023-06-23 19:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Hi,

On 2023-06-23 13:07:12 -0600, Jens Axboe wrote:
> On 6/23/23 1:04?PM, Andres Freund wrote:
> > Hi,
> >
> > I'd been chatting with Jens about this, so obviously I'm interested in the
> > feature...
> >
> > On 2023-06-09 12:31:24 -0600, Jens Axboe wrote:
> >> Add support for FUTEX_WAKE/WAIT primitives.
> >>
> >> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
> >> it does support passing in a bitset.
> >>
> >> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
> >> FUTEX_WAIT_BITSET.
> >
> > One thing I was wondering about is what happens when there are multiple
> > OP_FUTEX_WAITs queued for the same futex, and that futex gets woken up. I
> > don't really have an opinion about what would be best, just that it'd be
> > helpful to specify the behaviour.
>
> Not sure I follow the question, can you elaborate?
>
> If you have N futex waits on the same futex and someone does a wait
> (with wakenum >= N), then they'd all wake and post a CQE. If less are
> woken because the caller asked for less than N, than that number should
> be woken.
>
> IOW, should have the same semantics as "normal" futex waits.

With a normal futex wait you can't wait multiple times on the same futex in
one thread. But with the proposed io_uring interface, one can.

Basically, what is the defined behaviour for:

   sqe = io_uring_get_sqe(ring);
   io_uring_prep_futex_wait(sqe, futex, 0, FUTEX_BITSET_MATCH_ANY);

   sqe = io_uring_get_sqe(ring);
   io_uring_prep_futex_wait(sqe, futex, 0, FUTEX_BITSET_MATCH_ANY);

   io_uring_submit(ring)

when someone does:
   futex(FUTEX_WAKE, futex, 1, 0, 0, 0);
   or
   futex(FUTEX_WAKE, futex, INT_MAX, 0, 0, 0);

or the equivalent io_uring operation.

Is it an error? Will there always be two cqes queued? Will it depend on the
number of wakeups specified by the waker?  I'd assume the latter, but it'd be
good to specify that.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6] io_uring: add support for futex wake and wait
  2023-06-23 19:34       ` Andres Freund
@ 2023-06-23 19:46         ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2023-06-23 19:46 UTC (permalink / raw)
  To: Andres Freund; +Cc: io-uring

On 6/23/23 1:34?PM, Andres Freund wrote:
> Hi,
> 
> On 2023-06-23 13:07:12 -0600, Jens Axboe wrote:
>> On 6/23/23 1:04?PM, Andres Freund wrote:
>>> Hi,
>>>
>>> I'd been chatting with Jens about this, so obviously I'm interested in the
>>> feature...
>>>
>>> On 2023-06-09 12:31:24 -0600, Jens Axboe wrote:
>>>> Add support for FUTEX_WAKE/WAIT primitives.
>>>>
>>>> IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
>>>> it does support passing in a bitset.
>>>>
>>>> Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
>>>> FUTEX_WAIT_BITSET.
>>>
>>> One thing I was wondering about is what happens when there are multiple
>>> OP_FUTEX_WAITs queued for the same futex, and that futex gets woken up. I
>>> don't really have an opinion about what would be best, just that it'd be
>>> helpful to specify the behaviour.
>>
>> Not sure I follow the question, can you elaborate?
>>
>> If you have N futex waits on the same futex and someone does a wait
>> (with wakenum >= N), then they'd all wake and post a CQE. If less are
>> woken because the caller asked for less than N, than that number should
>> be woken.
>>
>> IOW, should have the same semantics as "normal" futex waits.
> 
> With a normal futex wait you can't wait multiple times on the same futex in
> one thread. But with the proposed io_uring interface, one can.

Right, but you could have N threads waiting on the same futex.

> Basically, what is the defined behaviour for:
> 
>    sqe = io_uring_get_sqe(ring);
>    io_uring_prep_futex_wait(sqe, futex, 0, FUTEX_BITSET_MATCH_ANY);
> 
>    sqe = io_uring_get_sqe(ring);
>    io_uring_prep_futex_wait(sqe, futex, 0, FUTEX_BITSET_MATCH_ANY);
> 
>    io_uring_submit(ring)
> 
> when someone does:
>    futex(FUTEX_WAKE, futex, 1, 0, 0, 0);
>    or
>    futex(FUTEX_WAKE, futex, INT_MAX, 0, 0, 0);
> 
> or the equivalent io_uring operation.

Waking with num=1 should wake just one of them, which one is really down
to the futex ordering which depends on task priority (which would be the
same here), and ordered after that. So first one should wake the first
sqe queued.

Second one will wake all of them, in that order.

I'll put that in the the test case.

> Is it an error? Will there always be two cqes queued? Will it depend
> on the number of wakeups specified by the waker?  I'd assume the
> latter, but it'd be good to specify that.

It's not an error, I would not want to police that. It will purely
depend on the number of wakes specified by the wake operation. If it's
1, one will be triggered. If it's INT_MAX, then both of them will
trigger. First case will generate one CQE, second one will generate both
CQEs.

No documentation has been written for the io_uring bits yet. But the
current branch is ready for wider posting, so I should get that written
up too...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-06-23 19:47 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-09 18:31 [PATCHSET RFC 0/6] Add io_uring support for futex wait/wake Jens Axboe
2023-06-09 18:31 ` [PATCH 1/6] futex: abstract out futex_op_to_flags() helper Jens Axboe
2023-06-09 18:31 ` [PATCH 2/6] futex: factor out the futex wake handling Jens Axboe
2023-06-09 18:31 ` [PATCH 3/6] futex: assign default futex_q->wait_data at insertion time Jens Axboe
2023-06-09 18:31 ` [PATCH 4/6] futex: add futex wait variant that takes a futex_q directly Jens Axboe
2023-06-09 18:31 ` [PATCH 5/6] io_uring: add support for futex wake and wait Jens Axboe
2023-06-12 16:06   ` Gabriel Krisman Bertazi
2023-06-12 20:37     ` Jens Axboe
2023-06-12 23:00       ` Gabriel Krisman Bertazi
2023-06-13  1:09         ` Jens Axboe
2023-06-13  2:55           ` io_uring link semantics (was [PATCH 5/6] io_uring: add support for futex wake and wait) Gabriel Krisman Bertazi
2023-06-23 19:04   ` [PATCH 5/6] io_uring: add support for futex wake and wait Andres Freund
2023-06-23 19:07     ` Jens Axboe
2023-06-23 19:34       ` Andres Freund
2023-06-23 19:46         ` Jens Axboe
2023-06-09 18:31 ` [PATCH 6/6] io_uring/futex: enable use of the allocation caches for futex_q Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox