public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH v4 0/3] io_uring: avoid ring quiesce in io_uring_register for eventfd opcodes
@ 2022-02-03 18:24 Usama Arif
  2022-02-03 18:24 ` [PATCH v4 1/3] io_uring: remove trace for eventfd Usama Arif
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Usama Arif @ 2022-02-03 18:24 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence, linux-kernel; +Cc: fam.zheng, Usama Arif

This is done by creating a new RCU data structure (io_ev_fd) as part of
io_ring_ctx that holds the eventfd_ctx, with reads to the structure protected
by rcu_read_lock and writes (register/unregister calls) protected by a mutex.

With the above approach ring quiesce can be avoided which is much more
expensive then using RCU lock. On the system tested, io_uring_reigster with
IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms
before with ring quiesce.

The first patch creates the RCU protected data structure and removes ring
quiesce for IORING_REGISTER_EVENTFD and IORING_UNREGISTER_EVENTFD.

The second patch builds on top of the first patch and removes ring quiesce
for IORING_REGISTER_EVENTFD_ASYNC.

---
v3->v4:
- Switch back to call_rcu and use rcu_barrier incase io_eventfd_register fails
to make sure all rcu callbacks have finished.

v2->v3:
- Switched to using synchronize_rcu from call_rcu in io_eventfd_unregister.

v1->v2:
- Added patch to remove eventfd from tracepoint (Patch 1) (Jens Axboe)
- Made the code of io_should_trigger_evfd as part of io_eventfd_signal (Jens Axboe)

Usama Arif (3):
  io_uring: remove trace for eventfd
  io_uring: avoid ring quiesce while registering/unregistering eventfd
  io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC

 fs/io_uring.c                   | 127 +++++++++++++++++++++++---------
 include/trace/events/io_uring.h |  13 ++--
 2 files changed, 96 insertions(+), 44 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v4 1/3] io_uring: remove trace for eventfd
  2022-02-03 18:24 [PATCH v4 0/3] io_uring: avoid ring quiesce in io_uring_register for eventfd opcodes Usama Arif
@ 2022-02-03 18:24 ` Usama Arif
  2022-02-03 18:24 ` [PATCH v4 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd Usama Arif
  2022-02-03 18:24 ` [PATCH v4 3/3] io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC Usama Arif
  2 siblings, 0 replies; 8+ messages in thread
From: Usama Arif @ 2022-02-03 18:24 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence, linux-kernel; +Cc: fam.zheng, Usama Arif

The information on whether eventfd is registered is not very useful
and would result in the tracepoint being enclosed in an rcu_readlock
in a later patch that tries to avoid ring quiesce for registering
eventfd.

Suggested-by: Jens Axboe <[email protected]>
Signed-off-by: Usama Arif <[email protected]>
---
 fs/io_uring.c                   |  3 +--
 include/trace/events/io_uring.h | 13 +++++--------
 2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2e04f718319d..21531609a9c6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -11171,8 +11171,7 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
 	mutex_lock(&ctx->uring_lock);
 	ret = __io_uring_register(ctx, opcode, arg, nr_args);
 	mutex_unlock(&ctx->uring_lock);
-	trace_io_uring_register(ctx, opcode, ctx->nr_user_files, ctx->nr_user_bufs,
-							ctx->cq_ev_fd != NULL, ret);
+	trace_io_uring_register(ctx, opcode, ctx->nr_user_files, ctx->nr_user_bufs, ret);
 out_fput:
 	fdput(f);
 	return ret;
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 7346f0164cf4..098beda7601a 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -57,10 +57,9 @@ TRACE_EVENT(io_uring_create,
  * @opcode:		describes which operation to perform
  * @nr_user_files:	number of registered files
  * @nr_user_bufs:	number of registered buffers
- * @cq_ev_fd:		whether eventfs registered or not
  * @ret:		return code
  *
- * Allows to trace fixed files/buffers/eventfds, that could be registered to
+ * Allows to trace fixed files/buffers, that could be registered to
  * avoid an overhead of getting references to them for every operation. This
  * event, together with io_uring_file_get, can provide a full picture of how
  * much overhead one can reduce via fixing.
@@ -68,16 +67,15 @@ TRACE_EVENT(io_uring_create,
 TRACE_EVENT(io_uring_register,
 
 	TP_PROTO(void *ctx, unsigned opcode, unsigned nr_files,
-			 unsigned nr_bufs, bool eventfd, long ret),
+			 unsigned nr_bufs, long ret),
 
-	TP_ARGS(ctx, opcode, nr_files, nr_bufs, eventfd, ret),
+	TP_ARGS(ctx, opcode, nr_files, nr_bufs, ret),
 
 	TP_STRUCT__entry (
 		__field(  void *,	ctx			)
 		__field(  unsigned,	opcode		)
 		__field(  unsigned,	nr_files	)
 		__field(  unsigned,	nr_bufs		)
-		__field(  bool,		eventfd		)
 		__field(  long,		ret			)
 	),
 
@@ -86,14 +84,13 @@ TRACE_EVENT(io_uring_register,
 		__entry->opcode		= opcode;
 		__entry->nr_files	= nr_files;
 		__entry->nr_bufs	= nr_bufs;
-		__entry->eventfd	= eventfd;
 		__entry->ret		= ret;
 	),
 
 	TP_printk("ring %p, opcode %d, nr_user_files %d, nr_user_bufs %d, "
-			  "eventfd %d, ret %ld",
+			  "ret %ld",
 			  __entry->ctx, __entry->opcode, __entry->nr_files,
-			  __entry->nr_bufs, __entry->eventfd, __entry->ret)
+			  __entry->nr_bufs, __entry->ret)
 );
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 18:24 [PATCH v4 0/3] io_uring: avoid ring quiesce in io_uring_register for eventfd opcodes Usama Arif
  2022-02-03 18:24 ` [PATCH v4 1/3] io_uring: remove trace for eventfd Usama Arif
@ 2022-02-03 18:24 ` Usama Arif
  2022-02-03 18:49   ` Jens Axboe
  2022-02-03 18:24 ` [PATCH v4 3/3] io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC Usama Arif
  2 siblings, 1 reply; 8+ messages in thread
From: Usama Arif @ 2022-02-03 18:24 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence, linux-kernel; +Cc: fam.zheng, Usama Arif

This is done by creating a new RCU data structure (io_ev_fd) as part of
io_ring_ctx that holds the eventfd_ctx.

The function io_eventfd_signal is executed under rcu_read_lock with a
single rcu_dereference to io_ev_fd so that if another thread unregisters
the eventfd while io_eventfd_signal is still being executed, the
eventfd_signal for which io_eventfd_signal was called completes
successfully.

The process of registering/unregistering eventfd is done under a lock
so multiple threads don't enter a race condition while
registering/unregistering eventfd.

With the above approach ring quiesce can be avoided which is much more
expensive then using RCU lock. On the system tested, io_uring_reigster with
IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms
before with ring quiesce.

Signed-off-by: Usama Arif <[email protected]>
---
 fs/io_uring.c | 104 ++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 79 insertions(+), 25 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 21531609a9c6..7a8f4ac7a785 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -326,6 +326,12 @@ struct io_submit_state {
 	struct blk_plug		plug;
 };
 
+struct io_ev_fd {
+	struct eventfd_ctx	*cq_ev_fd;
+	struct io_ring_ctx	*ctx;
+	struct rcu_head		rcu;
+};
+
 struct io_ring_ctx {
 	/* const or read-mostly hot data */
 	struct {
@@ -399,7 +405,8 @@ struct io_ring_ctx {
 	struct {
 		unsigned		cached_cq_tail;
 		unsigned		cq_entries;
-		struct eventfd_ctx	*cq_ev_fd;
+		struct io_ev_fd	__rcu	*io_ev_fd;
+		struct mutex		ev_fd_lock;
 		struct wait_queue_head	cq_wait;
 		unsigned		cq_extra;
 		atomic_t		cq_timeouts;
@@ -1448,6 +1455,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	xa_init_flags(&ctx->io_buffers, XA_FLAGS_ALLOC1);
 	xa_init_flags(&ctx->personalities, XA_FLAGS_ALLOC1);
 	mutex_init(&ctx->uring_lock);
+	mutex_init(&ctx->ev_fd_lock);
 	init_waitqueue_head(&ctx->cq_wait);
 	spin_lock_init(&ctx->completion_lock);
 	spin_lock_init(&ctx->timeout_lock);
@@ -1726,13 +1734,24 @@ static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
 	return &rings->cqes[tail & mask];
 }
 
-static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
+static void io_eventfd_signal(struct io_ring_ctx *ctx)
 {
-	if (likely(!ctx->cq_ev_fd))
-		return false;
+	struct io_ev_fd *ev_fd;
+
+	rcu_read_lock();
+	/* rcu_dereference ctx->io_ev_fd once and use it for both for checking and eventfd_signal */
+	ev_fd = rcu_dereference(ctx->io_ev_fd);
+
+	if (likely(!ev_fd))
+		goto out;
 	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
-		return false;
-	return !ctx->eventfd_async || io_wq_current_is_worker();
+		goto out;
+
+	if (!ctx->eventfd_async || io_wq_current_is_worker())
+		eventfd_signal(ev_fd->cq_ev_fd, 1);
+
+out:
+	rcu_read_unlock();
 }
 
 /*
@@ -1751,8 +1770,7 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
 	 */
 	if (wq_has_sleeper(&ctx->cq_wait))
 		wake_up_all(&ctx->cq_wait);
-	if (io_should_trigger_evfd(ctx))
-		eventfd_signal(ctx->cq_ev_fd, 1);
+	io_eventfd_signal(ctx);
 }
 
 static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
@@ -1764,8 +1782,7 @@ static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
 		if (waitqueue_active(&ctx->cq_wait))
 			wake_up_all(&ctx->cq_wait);
 	}
-	if (io_should_trigger_evfd(ctx))
-		eventfd_signal(ctx->cq_ev_fd, 1);
+	io_eventfd_signal(ctx);
 }
 
 /* Returns true if there are no backlogged entries after the flush */
@@ -9353,35 +9370,70 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 
 static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
 {
+	struct io_ev_fd *ev_fd;
 	__s32 __user *fds = arg;
-	int fd;
+	int fd, ret;
 
-	if (ctx->cq_ev_fd)
-		return -EBUSY;
+	mutex_lock(&ctx->ev_fd_lock);
+	ret = -EBUSY;
+	if (rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock))) {
+		rcu_barrier();
+		if(rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock)))
+			goto out;
+	}
 
+	ret = -EFAULT;
 	if (copy_from_user(&fd, fds, sizeof(*fds)))
-		return -EFAULT;
+		goto out;
 
-	ctx->cq_ev_fd = eventfd_ctx_fdget(fd);
-	if (IS_ERR(ctx->cq_ev_fd)) {
-		int ret = PTR_ERR(ctx->cq_ev_fd);
+	ret = -ENOMEM;
+	ev_fd = kmalloc(sizeof(*ev_fd), GFP_KERNEL);
+	if (!ev_fd)
+		goto out;
 
-		ctx->cq_ev_fd = NULL;
-		return ret;
+	ev_fd->cq_ev_fd = eventfd_ctx_fdget(fd);
+	if (IS_ERR(ev_fd->cq_ev_fd)) {
+		ret = PTR_ERR(ev_fd->cq_ev_fd);
+		kfree(ev_fd);
+		goto out;
 	}
+	ev_fd->ctx = ctx;
 
-	return 0;
+	rcu_assign_pointer(ctx->io_ev_fd, ev_fd);
+	ret = 0;
+
+out:
+	mutex_unlock(&ctx->ev_fd_lock);
+	return ret;
+}
+
+static void io_eventfd_put(struct rcu_head *rcu)
+{
+	struct io_ev_fd *ev_fd = container_of(rcu, struct io_ev_fd, rcu);
+	struct io_ring_ctx *ctx = ev_fd->ctx;
+
+	eventfd_ctx_put(ev_fd->cq_ev_fd);
+	kfree(ev_fd);
+	rcu_assign_pointer(ctx->io_ev_fd, NULL);
 }
 
 static int io_eventfd_unregister(struct io_ring_ctx *ctx)
 {
-	if (ctx->cq_ev_fd) {
-		eventfd_ctx_put(ctx->cq_ev_fd);
-		ctx->cq_ev_fd = NULL;
-		return 0;
+	struct io_ev_fd *ev_fd;
+	int ret;
+
+	mutex_lock(&ctx->ev_fd_lock);
+	ev_fd = rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock));
+	if (ev_fd) {
+		call_rcu(&ev_fd->rcu, io_eventfd_put);
+		ret = 0;
+		goto out;
 	}
+	ret = -ENXIO;
 
-	return -ENXIO;
+out:
+	mutex_unlock(&ctx->ev_fd_lock);
+	return ret;
 }
 
 static void io_destroy_buffers(struct io_ring_ctx *ctx)
@@ -10960,6 +11012,8 @@ static bool io_register_op_must_quiesce(int op)
 	case IORING_REGISTER_FILES:
 	case IORING_UNREGISTER_FILES:
 	case IORING_REGISTER_FILES_UPDATE:
+	case IORING_REGISTER_EVENTFD:
+	case IORING_UNREGISTER_EVENTFD:
 	case IORING_REGISTER_PROBE:
 	case IORING_REGISTER_PERSONALITY:
 	case IORING_UNREGISTER_PERSONALITY:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v4 3/3] io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC
  2022-02-03 18:24 [PATCH v4 0/3] io_uring: avoid ring quiesce in io_uring_register for eventfd opcodes Usama Arif
  2022-02-03 18:24 ` [PATCH v4 1/3] io_uring: remove trace for eventfd Usama Arif
  2022-02-03 18:24 ` [PATCH v4 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd Usama Arif
@ 2022-02-03 18:24 ` Usama Arif
  2 siblings, 0 replies; 8+ messages in thread
From: Usama Arif @ 2022-02-03 18:24 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence, linux-kernel; +Cc: fam.zheng, Usama Arif

This is done using the RCU data structure (io_ev_fd). eventfd_async
is moved from io_ring_ctx to io_ev_fd which is RCU protected hence
avoiding ring quiesce which is much more expensive than an RCU lock.
io_should_trigger_evfd is already under rcu_read_lock so there is
no extra RCU read-side critical section needed.

Signed-off-by: Usama Arif <[email protected]>
---
 fs/io_uring.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7a8f4ac7a785..e287fc61879f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -329,6 +329,7 @@ struct io_submit_state {
 struct io_ev_fd {
 	struct eventfd_ctx	*cq_ev_fd;
 	struct io_ring_ctx	*ctx;
+	unsigned int		eventfd_async: 1;
 	struct rcu_head		rcu;
 };
 
@@ -341,7 +342,6 @@ struct io_ring_ctx {
 		unsigned int		flags;
 		unsigned int		compat: 1;
 		unsigned int		drain_next: 1;
-		unsigned int		eventfd_async: 1;
 		unsigned int		restricted: 1;
 		unsigned int		off_timeout_used: 1;
 		unsigned int		drain_active: 1;
@@ -1747,7 +1747,7 @@ static void io_eventfd_signal(struct io_ring_ctx *ctx)
 	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
 		goto out;
 
-	if (!ctx->eventfd_async || io_wq_current_is_worker())
+	if (!ev_fd->eventfd_async || io_wq_current_is_worker())
 		eventfd_signal(ev_fd->cq_ev_fd, 1);
 
 out:
@@ -9368,7 +9368,8 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 	return done ? done : err;
 }
 
-static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
+static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg,
+			       unsigned int eventfd_async)
 {
 	struct io_ev_fd *ev_fd;
 	__s32 __user *fds = arg;
@@ -9398,6 +9399,7 @@ static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
 		goto out;
 	}
 	ev_fd->ctx = ctx;
+	ev_fd->eventfd_async = eventfd_async;
 
 	rcu_assign_pointer(ctx->io_ev_fd, ev_fd);
 	ret = 0;
@@ -11013,6 +11015,7 @@ static bool io_register_op_must_quiesce(int op)
 	case IORING_UNREGISTER_FILES:
 	case IORING_REGISTER_FILES_UPDATE:
 	case IORING_REGISTER_EVENTFD:
+	case IORING_REGISTER_EVENTFD_ASYNC:
 	case IORING_UNREGISTER_EVENTFD:
 	case IORING_REGISTER_PROBE:
 	case IORING_REGISTER_PERSONALITY:
@@ -11113,17 +11116,16 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 		ret = io_register_files_update(ctx, arg, nr_args);
 		break;
 	case IORING_REGISTER_EVENTFD:
-	case IORING_REGISTER_EVENTFD_ASYNC:
 		ret = -EINVAL;
 		if (nr_args != 1)
 			break;
-		ret = io_eventfd_register(ctx, arg);
-		if (ret)
+		ret = io_eventfd_register(ctx, arg, 0);
+		break;
+	case IORING_REGISTER_EVENTFD_ASYNC:
+		ret = -EINVAL;
+		if (nr_args != 1)
 			break;
-		if (opcode == IORING_REGISTER_EVENTFD_ASYNC)
-			ctx->eventfd_async = 1;
-		else
-			ctx->eventfd_async = 0;
+		ret = io_eventfd_register(ctx, arg, 1);
 		break;
 	case IORING_UNREGISTER_EVENTFD:
 		ret = -EINVAL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 18:24 ` [PATCH v4 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd Usama Arif
@ 2022-02-03 18:49   ` Jens Axboe
  2022-02-03 19:05     ` [External] " Usama Arif
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2022-02-03 18:49 UTC (permalink / raw)
  To: Usama Arif, io-uring, asml.silence, linux-kernel; +Cc: fam.zheng

On 2/3/22 11:24 AM, Usama Arif wrote:
> -static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
> +static void io_eventfd_signal(struct io_ring_ctx *ctx)
>  {
> -	if (likely(!ctx->cq_ev_fd))
> -		return false;
> +	struct io_ev_fd *ev_fd;
> +
> +	rcu_read_lock();
> +	/* rcu_dereference ctx->io_ev_fd once and use it for both for checking and eventfd_signal */
> +	ev_fd = rcu_dereference(ctx->io_ev_fd);
> +
> +	if (likely(!ev_fd))
> +		goto out;
>  	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
> -		return false;
> -	return !ctx->eventfd_async || io_wq_current_is_worker();
> +		goto out;
> +
> +	if (!ctx->eventfd_async || io_wq_current_is_worker())
> +		eventfd_signal(ev_fd->cq_ev_fd, 1);
> +
> +out:
> +	rcu_read_unlock();
>  }

This still needs what we discussed in v3, something ala:

/*
 * This will potential race with eventfd registration, but that's
 * always going to be the case if there is IO inflight while an eventfd
 * descriptor is being registered.
 */
if (!rcu_dereference_raw(ctx->io_ev_fd))
	return;

rcu_read_lock();
...

which I think is cheap enough and won't hit sparse complaints. The

> @@ -9353,35 +9370,70 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
>  
>  static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
>  {
> +	struct io_ev_fd *ev_fd;
>  	__s32 __user *fds = arg;
> -	int fd;
> +	int fd, ret;
>  
> -	if (ctx->cq_ev_fd)
> -		return -EBUSY;
> +	mutex_lock(&ctx->ev_fd_lock);
> +	ret = -EBUSY;
> +	if (rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock))) {
> +		rcu_barrier();
> +		if(rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock)))
> +			goto out;
> +	}

I wonder if we can get away with assigning ctx->io_ev_fd to NULL when we
do the call_rcu(). The struct itself will remain valid as long as we're
under rcu_read_lock() protection, so I think we'd be fine? If we do
that, then we don't need any rcu_barrier() or synchronize_rcu() calls,
as we can register a new one while the previous one is still being
killed.

Hmm?

>  static int io_eventfd_unregister(struct io_ring_ctx *ctx)
>  {
> -	if (ctx->cq_ev_fd) {
> -		eventfd_ctx_put(ctx->cq_ev_fd);
> -		ctx->cq_ev_fd = NULL;
> -		return 0;
> +	struct io_ev_fd *ev_fd;
> +	int ret;
> +
> +	mutex_lock(&ctx->ev_fd_lock);
> +	ev_fd = rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock));
> +	if (ev_fd) {
> +		call_rcu(&ev_fd->rcu, io_eventfd_put);
> +		ret = 0;
> +		goto out;
>  	}
> +	ret = -ENXIO;
>  
> -	return -ENXIO;
> +out:
> +	mutex_unlock(&ctx->ev_fd_lock);
> +	return ret;
>  }

I also think that'd be cleaner without the goto:

{
	struct io_ev_fd *ev_fd;
	int ret;

	mutex_lock(&ctx->ev_fd_lock);
	ev_fd = rcu_dereference_protected(ctx->io_ev_fd,
					lockdep_is_held(&ctx->ev_fd_lock));
	if (ev_fd) {
		call_rcu(&ev_fd->rcu, io_eventfd_put);
		mutex_unlock(&ctx->ev_fd_lock);
		return 0;
	}

	mutex_unlock(&ctx->ev_fd_lock);
	return -ENXIO;
}

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [External] Re: [PATCH v4 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 18:49   ` Jens Axboe
@ 2022-02-03 19:05     ` Usama Arif
  2022-02-03 19:12       ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Usama Arif @ 2022-02-03 19:05 UTC (permalink / raw)
  To: Jens Axboe, io-uring, asml.silence, linux-kernel; +Cc: fam.zheng



On 03/02/2022 18:49, Jens Axboe wrote:
> On 2/3/22 11:24 AM, Usama Arif wrote:
>> -static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
>> +static void io_eventfd_signal(struct io_ring_ctx *ctx)
>>   {
>> -	if (likely(!ctx->cq_ev_fd))
>> -		return false;
>> +	struct io_ev_fd *ev_fd;
>> +
>> +	rcu_read_lock();
>> +	/* rcu_dereference ctx->io_ev_fd once and use it for both for checking and eventfd_signal */
>> +	ev_fd = rcu_dereference(ctx->io_ev_fd);
>> +
>> +	if (likely(!ev_fd))
>> +		goto out;
>>   	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
>> -		return false;
>> -	return !ctx->eventfd_async || io_wq_current_is_worker();
>> +		goto out;
>> +
>> +	if (!ctx->eventfd_async || io_wq_current_is_worker())
>> +		eventfd_signal(ev_fd->cq_ev_fd, 1);
>> +
>> +out:
>> +	rcu_read_unlock();
>>   }
> 
> This still needs what we discussed in v3, something ala:
> 
> /*
>   * This will potential race with eventfd registration, but that's
>   * always going to be the case if there is IO inflight while an eventfd
>   * descriptor is being registered.
>   */
> if (!rcu_dereference_raw(ctx->io_ev_fd))
> 	return;
> 
> rcu_read_lock();

Hmm, so i am not so worried about the registeration, but actually 
worried about unregisteration.
If after the check and before the rcu_read_lock, the eventfd is 
unregistered won't we get a NULL pointer exception at 
eventfd_signal(ev_fd->cq_ev_fd, 1)?

I guess checking for NULL twice would work, so something like this is ok 
then?

static void io_eventfd_signal(struct io_ring_ctx *ctx)
{
	struct io_ev_fd *ev_fd;

	/* Return quickly if ctx->io_ev_fd doesn't exist */
	if (likely(!rcu_dereference_raw(ctx->io_ev_fd)))
		return;

	rcu_read_lock();
	/* rcu_dereference ctx->io_ev_fd once and use it for both for checking 
and eventfd_signal */
	ev_fd = rcu_dereference(ctx->io_ev_fd);

	/*
	 * Check again if ev_fd exists incase an io_eventfd_unregister call 
completed between
	 * the NULL check of ctx->io_ev_fd at the start of the function and 
rcu_read_lock.
	 */
	if (unlikely(!ev_fd))
		goto out;
	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
		goto out;

	if (!ev_fd->eventfd_async || io_wq_current_is_worker())
		eventfd_signal(ev_fd->cq_ev_fd, 1);

out:
	rcu_read_unlock();
}


> ...
> 
> which I think is cheap enough and won't hit sparse complaints. The
> 
>> @@ -9353,35 +9370,70 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
>>   
>>   static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
>>   {
>> +	struct io_ev_fd *ev_fd;
>>   	__s32 __user *fds = arg;
>> -	int fd;
>> +	int fd, ret;
>>   
>> -	if (ctx->cq_ev_fd)
>> -		return -EBUSY;
>> +	mutex_lock(&ctx->ev_fd_lock);
>> +	ret = -EBUSY;
>> +	if (rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock))) {
>> +		rcu_barrier();
>> +		if(rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock)))
>> +			goto out;
>> +	}
> 
> I wonder if we can get away with assigning ctx->io_ev_fd to NULL when we
> do the call_rcu(). The struct itself will remain valid as long as we're
> under rcu_read_lock() protection, so I think we'd be fine? If we do
> that, then we don't need any rcu_barrier() or synchronize_rcu() calls,
> as we can register a new one while the previous one is still being
> killed.
> 
> Hmm?
> 

We would have to remove the check that ctx->io_ev_fd != NULL. That we 
would also result in 2 successive calls to io_eventfd_register without 
any unregister in between being successful? Which i dont think is the 
right behaviour?

I think the likelihood of hitting the rcu_barrier itself is quite low, 
so probably the cost is low as well.

>>   static int io_eventfd_unregister(struct io_ring_ctx *ctx)
>>   {
>> -	if (ctx->cq_ev_fd) {
>> -		eventfd_ctx_put(ctx->cq_ev_fd);
>> -		ctx->cq_ev_fd = NULL;
>> -		return 0;
>> +	struct io_ev_fd *ev_fd;
>> +	int ret;
>> +
>> +	mutex_lock(&ctx->ev_fd_lock);
>> +	ev_fd = rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock));
>> +	if (ev_fd) {
>> +		call_rcu(&ev_fd->rcu, io_eventfd_put);
>> +		ret = 0;
>> +		goto out;
>>   	}
>> +	ret = -ENXIO;
>>   
>> -	return -ENXIO;
>> +out:
>> +	mutex_unlock(&ctx->ev_fd_lock);
>> +	return ret;
>>   }
> 
> I also think that'd be cleaner without the goto:
> 
> {
> 	struct io_ev_fd *ev_fd;
> 	int ret;
> 
> 	mutex_lock(&ctx->ev_fd_lock);
> 	ev_fd = rcu_dereference_protected(ctx->io_ev_fd,
> 					lockdep_is_held(&ctx->ev_fd_lock));
> 	if (ev_fd) {
> 		call_rcu(&ev_fd->rcu, io_eventfd_put);
> 		mutex_unlock(&ctx->ev_fd_lock);
> 		return 0;
> 	}
> 
> 	mutex_unlock(&ctx->ev_fd_lock);
> 	return -ENXIO;
> }
> 
Thanks, will do that this in the next patchset with the above 
io_eventfd_signal changes if those look ok as well?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [External] Re: [PATCH v4 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 19:05     ` [External] " Usama Arif
@ 2022-02-03 19:12       ` Jens Axboe
  2022-02-03 23:37         ` Usama Arif
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2022-02-03 19:12 UTC (permalink / raw)
  To: Usama Arif, io-uring, asml.silence, linux-kernel; +Cc: fam.zheng

On 2/3/22 12:05 PM, Usama Arif wrote:
> 
> 
> On 03/02/2022 18:49, Jens Axboe wrote:
>> On 2/3/22 11:24 AM, Usama Arif wrote:
>>> -static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
>>> +static void io_eventfd_signal(struct io_ring_ctx *ctx)
>>>   {
>>> -	if (likely(!ctx->cq_ev_fd))
>>> -		return false;
>>> +	struct io_ev_fd *ev_fd;
>>> +
>>> +	rcu_read_lock();
>>> +	/* rcu_dereference ctx->io_ev_fd once and use it for both for checking and eventfd_signal */
>>> +	ev_fd = rcu_dereference(ctx->io_ev_fd);
>>> +
>>> +	if (likely(!ev_fd))
>>> +		goto out;
>>>   	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
>>> -		return false;
>>> -	return !ctx->eventfd_async || io_wq_current_is_worker();
>>> +		goto out;
>>> +
>>> +	if (!ctx->eventfd_async || io_wq_current_is_worker())
>>> +		eventfd_signal(ev_fd->cq_ev_fd, 1);
>>> +
>>> +out:
>>> +	rcu_read_unlock();
>>>   }
>>
>> This still needs what we discussed in v3, something ala:
>>
>> /*
>>   * This will potential race with eventfd registration, but that's
>>   * always going to be the case if there is IO inflight while an eventfd
>>   * descriptor is being registered.
>>   */
>> if (!rcu_dereference_raw(ctx->io_ev_fd))
>> 	return;
>>
>> rcu_read_lock();
> 
> Hmm, so i am not so worried about the registeration, but actually 
> worried about unregisteration.
> If after the check and before the rcu_read_lock, the eventfd is 
> unregistered won't we get a NULL pointer exception at 
> eventfd_signal(ev_fd->cq_ev_fd, 1)?

You need to check it twice, that's a hard requirement. The first racy
check is safe because we don't care if we miss a notification, once
inside rcu_read_lock() it needs to be done properly of course. Like you
do below, that's how it should be done.

>> I wonder if we can get away with assigning ctx->io_ev_fd to NULL when we
>> do the call_rcu(). The struct itself will remain valid as long as we're
>> under rcu_read_lock() protection, so I think we'd be fine? If we do
>> that, then we don't need any rcu_barrier() or synchronize_rcu() calls,
>> as we can register a new one while the previous one is still being
>> killed.
>>
>> Hmm?
>>
> 
> We would have to remove the check that ctx->io_ev_fd != NULL. That we 
> would also result in 2 successive calls to io_eventfd_register without 
> any unregister in between being successful? Which i dont think is the 
> right behaviour?
> 
> I think the likelihood of hitting the rcu_barrier itself is quite low, 
> so probably the cost is low as well.

Yeah it might very well be. To make what I suggested work, we'd need a
way to mark the io_ev_fd as going away. Which would be feasible, as we
know the memory will remain valid for us to check. So it could
definitely work, you'd just need a check for that.

> Thanks, will do that this in the next patchset with the above 
> io_eventfd_signal changes if those look ok as well?

The code you pasted looked good. Consider the "is unregistration in
progress" suggestion as well, as it would be nice to avoid any kind of
rcu synchronization if at all possible.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [External] Re: [PATCH v4 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 19:12       ` Jens Axboe
@ 2022-02-03 23:37         ` Usama Arif
  0 siblings, 0 replies; 8+ messages in thread
From: Usama Arif @ 2022-02-03 23:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring, asml.silence, linux-kernel; +Cc: fam.zheng



On 03/02/2022 19:12, Jens Axboe wrote:
> On 2/3/22 12:05 PM, Usama Arif wrote:
>>
>>
>> On 03/02/2022 18:49, Jens Axboe wrote:
>>> On 2/3/22 11:24 AM, Usama Arif wrote:
>>>> -static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
>>>> +static void io_eventfd_signal(struct io_ring_ctx *ctx)
>>>>    {
>>>> -	if (likely(!ctx->cq_ev_fd))
>>>> -		return false;
>>>> +	struct io_ev_fd *ev_fd;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	/* rcu_dereference ctx->io_ev_fd once and use it for both for checking and eventfd_signal */
>>>> +	ev_fd = rcu_dereference(ctx->io_ev_fd);
>>>> +
>>>> +	if (likely(!ev_fd))
>>>> +		goto out;
>>>>    	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
>>>> -		return false;
>>>> -	return !ctx->eventfd_async || io_wq_current_is_worker();
>>>> +		goto out;
>>>> +
>>>> +	if (!ctx->eventfd_async || io_wq_current_is_worker())
>>>> +		eventfd_signal(ev_fd->cq_ev_fd, 1);
>>>> +
>>>> +out:
>>>> +	rcu_read_unlock();
>>>>    }
>>>
>>> This still needs what we discussed in v3, something ala:
>>>
>>> /*
>>>    * This will potential race with eventfd registration, but that's
>>>    * always going to be the case if there is IO inflight while an eventfd
>>>    * descriptor is being registered.
>>>    */
>>> if (!rcu_dereference_raw(ctx->io_ev_fd))
>>> 	return;
>>>
>>> rcu_read_lock();
>>
>> Hmm, so i am not so worried about the registeration, but actually
>> worried about unregisteration.
>> If after the check and before the rcu_read_lock, the eventfd is
>> unregistered won't we get a NULL pointer exception at
>> eventfd_signal(ev_fd->cq_ev_fd, 1)?
> 
> You need to check it twice, that's a hard requirement. The first racy
> check is safe because we don't care if we miss a notification, once
> inside rcu_read_lock() it needs to be done properly of course. Like you
> do below, that's how it should be done.
> 
>>> I wonder if we can get away with assigning ctx->io_ev_fd to NULL when we
>>> do the call_rcu(). The struct itself will remain valid as long as we're
>>> under rcu_read_lock() protection, so I think we'd be fine? If we do
>>> that, then we don't need any rcu_barrier() or synchronize_rcu() calls,
>>> as we can register a new one while the previous one is still being
>>> killed.
>>>
>>> Hmm?
>>>
>>
>> We would have to remove the check that ctx->io_ev_fd != NULL. That we
>> would also result in 2 successive calls to io_eventfd_register without
>> any unregister in between being successful? Which i dont think is the
>> right behaviour?
>>
>> I think the likelihood of hitting the rcu_barrier itself is quite low,
>> so probably the cost is low as well.
> 
> Yeah it might very well be. To make what I suggested work, we'd need a
> way to mark the io_ev_fd as going away. Which would be feasible, as we
> know the memory will remain valid for us to check. So it could
> definitely work, you'd just need a check for that.
> 
>> Thanks, will do that this in the next patchset with the above
>> io_eventfd_signal changes if those look ok as well?
> 
> The code you pasted looked good. Consider the "is unregistration in
> progress" suggestion as well, as it would be nice to avoid any kind of
> rcu synchronization if at all possible.
> 


Thanks for the review comments! I think all of them should have been 
addressed now in v5. I also removed ring quiesce from io_uring_register 
as the remaining 2 opcodes don't need them (Thanks Pavel for confirming 
that!)

Regards,
Usama

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-02-03 23:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-03 18:24 [PATCH v4 0/3] io_uring: avoid ring quiesce in io_uring_register for eventfd opcodes Usama Arif
2022-02-03 18:24 ` [PATCH v4 1/3] io_uring: remove trace for eventfd Usama Arif
2022-02-03 18:24 ` [PATCH v4 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd Usama Arif
2022-02-03 18:49   ` Jens Axboe
2022-02-03 19:05     ` [External] " Usama Arif
2022-02-03 19:12       ` Jens Axboe
2022-02-03 23:37         ` Usama Arif
2022-02-03 18:24 ` [PATCH v4 3/3] io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox