* [PATCH v3 0/3] napi tracking strategy
@ 2024-09-18 12:59 Olivier Langlois
2024-09-18 12:59 ` [PATCH v3 1/3] io_uring/napi: protect concurrent io_napi_entry timeout accesses Olivier Langlois
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Olivier Langlois @ 2024-09-18 12:59 UTC (permalink / raw)
To: Jens Axboe, Pavel Begunkov, io-uring
the actual napi tracking strategy is inducing a non-negligeable overhead.
Everytime a multishot poll is triggered or any poll armed, if the napi is
enabled on the ring a lookup is performed to either add a new napi id into
the napi_list or its timeout value is updated.
For many scenarios, this is overkill as the napi id list will be pretty
much static most of the time. To address this common scenario, a new
abstraction has been created following the common Linux kernel idiom of
creating an abstract interface with a struct filled with function pointers.
the patch serie consist of very minor fixes followed by the core of the changes
to implement the new feature.
v3 changes:
- address minor comments in patch #3
- replace the double for loop hash macro with the single loop list macro to iterate the napi elements in patch #2
- add __cold attribute to common_tracking_show_fdinfo() and napi_show_fdinfo()
v2 changes:
- extract small changes from the core changes to ease minor fixes backport
- totally remove the io_napi_tracking_ops interface
Olivier Langlois (3):
io_uring/napi: protect concurrent io_napi_entry timeout accesses
io_uring/napi: fix io_napi_entry RCU accesses
io_uring/napi: add static napi tracking strategy
include/linux/io_uring_types.h | 2 +-
include/uapi/linux/io_uring.h | 32 +++++++-
io_uring/fdinfo.c | 41 ++++++++++
io_uring/napi.c | 140 +++++++++++++++++++++++++--------
io_uring/napi.h | 15 +++-
5 files changed, 192 insertions(+), 38 deletions(-)
--
2.46.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3 1/3] io_uring/napi: protect concurrent io_napi_entry timeout accesses
2024-09-18 12:59 [PATCH v3 0/3] napi tracking strategy Olivier Langlois
@ 2024-09-18 12:59 ` Olivier Langlois
2024-09-18 12:59 ` [PATCH v3 2/3] io_uring/napi: fix io_napi_entry RCU accesses Olivier Langlois
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: Olivier Langlois @ 2024-09-18 12:59 UTC (permalink / raw)
To: Jens Axboe, Pavel Begunkov, io-uring
io_napi_entry timeout value can be updated while accessed from the poll
functions.
Its concurrent accesses are wrapped with READ_ONCE()/WRITE_ONCE() macros
to avoid incorrect compiler optimizations.
Signed-off-by: Olivier Langlois <[email protected]>
---
io_uring/napi.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/io_uring/napi.c b/io_uring/napi.c
index 1de1d4d62925..1dd2df9da468 100644
--- a/io_uring/napi.c
+++ b/io_uring/napi.c
@@ -60,7 +60,7 @@ void __io_napi_add(struct io_ring_ctx *ctx, struct socket *sock)
rcu_read_lock();
e = io_napi_hash_find(hash_list, napi_id);
if (e) {
- e->timeout = jiffies + NAPI_TIMEOUT;
+ WRITE_ONCE(e->timeout, jiffies + NAPI_TIMEOUT);
rcu_read_unlock();
return;
}
@@ -92,7 +92,7 @@ static void __io_napi_remove_stale(struct io_ring_ctx *ctx)
spin_lock(&ctx->napi_lock);
hash_for_each(ctx->napi_ht, i, e, node) {
- if (time_after(jiffies, e->timeout)) {
+ if (time_after(jiffies, READ_ONCE(e->timeout))) {
list_del(&e->list);
hash_del_rcu(&e->node);
kfree_rcu(e, rcu);
@@ -150,7 +150,7 @@ static bool __io_napi_do_busy_loop(struct io_ring_ctx *ctx,
napi_busy_loop_rcu(e->napi_id, loop_end, loop_end_arg,
ctx->napi_prefer_busy_poll, BUSY_POLL_BUDGET);
- if (time_after(jiffies, e->timeout))
+ if (time_after(jiffies, READ_ONCE(e->timeout)))
is_stale = true;
}
--
2.46.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v3 2/3] io_uring/napi: fix io_napi_entry RCU accesses
2024-09-18 12:59 [PATCH v3 0/3] napi tracking strategy Olivier Langlois
2024-09-18 12:59 ` [PATCH v3 1/3] io_uring/napi: protect concurrent io_napi_entry timeout accesses Olivier Langlois
@ 2024-09-18 12:59 ` Olivier Langlois
2024-09-18 12:59 ` [PATCH v3 3/3] io_uring/napi: add static napi tracking strategy Olivier Langlois
2024-09-19 2:36 ` [PATCH v3 0/3] " Jens Axboe
3 siblings, 0 replies; 9+ messages in thread
From: Olivier Langlois @ 2024-09-18 12:59 UTC (permalink / raw)
To: Jens Axboe, Pavel Begunkov, io-uring
correct 3 RCU structures modifications that were not using the RCU
functions to make their update.
Signed-off-by: Olivier Langlois <[email protected]>
---
io_uring/napi.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/io_uring/napi.c b/io_uring/napi.c
index 1dd2df9da468..6fc127e74f10 100644
--- a/io_uring/napi.c
+++ b/io_uring/napi.c
@@ -81,19 +81,24 @@ void __io_napi_add(struct io_ring_ctx *ctx, struct socket *sock)
}
hlist_add_tail_rcu(&e->node, hash_list);
- list_add_tail(&e->list, &ctx->napi_list);
+ list_add_tail_rcu(&e->list, &ctx->napi_list);
spin_unlock(&ctx->napi_lock);
}
static void __io_napi_remove_stale(struct io_ring_ctx *ctx)
{
struct io_napi_entry *e;
- unsigned int i;
spin_lock(&ctx->napi_lock);
- hash_for_each(ctx->napi_ht, i, e, node) {
+ /*
+ * list_for_each_entry_safe() is not required as long as:
+ * 1. list_del_rcu() does not reset the deleted node next pointer
+ * 2. kfree_rcu() delays the memory freeing until the next quiescent
+ * state
+ */
+ list_for_each_entry(e, &ctx->napi_list, list) {
if (time_after(jiffies, READ_ONCE(e->timeout))) {
- list_del(&e->list);
+ list_del_rcu(&e->list);
hash_del_rcu(&e->node);
kfree_rcu(e, rcu);
}
@@ -204,13 +209,13 @@ void io_napi_init(struct io_ring_ctx *ctx)
void io_napi_free(struct io_ring_ctx *ctx)
{
struct io_napi_entry *e;
- unsigned int i;
spin_lock(&ctx->napi_lock);
- hash_for_each(ctx->napi_ht, i, e, node) {
+ list_for_each_entry(e, &ctx->napi_list, list) {
hash_del_rcu(&e->node);
kfree_rcu(e, rcu);
}
+ INIT_LIST_HEAD_RCU(&ctx->napi_list);
spin_unlock(&ctx->napi_lock);
}
--
2.46.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v3 3/3] io_uring/napi: add static napi tracking strategy
2024-09-18 12:59 [PATCH v3 0/3] napi tracking strategy Olivier Langlois
2024-09-18 12:59 ` [PATCH v3 1/3] io_uring/napi: protect concurrent io_napi_entry timeout accesses Olivier Langlois
2024-09-18 12:59 ` [PATCH v3 2/3] io_uring/napi: fix io_napi_entry RCU accesses Olivier Langlois
@ 2024-09-18 12:59 ` Olivier Langlois
2024-09-19 2:59 ` Jens Axboe
2024-09-19 2:36 ` [PATCH v3 0/3] " Jens Axboe
3 siblings, 1 reply; 9+ messages in thread
From: Olivier Langlois @ 2024-09-18 12:59 UTC (permalink / raw)
To: Jens Axboe, Pavel Begunkov, io-uring
add the static napi tracking strategy that allows the user to manually
manage the napi ids list to busy poll and offload the ring from
dynamically update the list.
Signed-off-by: Olivier Langlois <[email protected]>
---
include/linux/io_uring_types.h | 2 +-
include/uapi/linux/io_uring.h | 32 ++++++++-
io_uring/fdinfo.c | 41 ++++++++++++
io_uring/napi.c | 117 ++++++++++++++++++++++++++-------
io_uring/napi.h | 15 +++--
5 files changed, 178 insertions(+), 29 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 3315005df117..f0a0cb2518bb 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -406,7 +406,7 @@ struct io_ring_ctx {
/* napi busy poll default timeout */
ktime_t napi_busy_poll_dt;
bool napi_prefer_busy_poll;
- bool napi_enabled;
+ u8 napi_track_mode;
DECLARE_HASHTABLE(napi_ht, 4);
#endif
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index adc2524fd8e3..10d9030c4242 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -728,12 +728,40 @@ struct io_uring_buf_status {
__u32 resv[8];
};
+enum io_uring_napi_op {
+ /* register/ungister backward compatible opcode */
+ IO_URING_NAPI_REGISTER_OP = 0,
+
+ /* opcodes to update napi_list when static tracking is used */
+ IO_URING_NAPI_STATIC_ADD_ID = 1,
+ IO_URING_NAPI_STATIC_DEL_ID = 2
+};
+
+enum io_uring_napi_tracking_strategy {
+ /* value must be 0 for backward compatibility */
+ IO_URING_NAPI_TRACKING_DYNAMIC = 0,
+ IO_URING_NAPI_TRACKING_STATIC = 1,
+ IO_URING_NAPI_TRACKING_INACTIVE = 255
+};
+
/* argument for IORING_(UN)REGISTER_NAPI */
struct io_uring_napi {
__u32 busy_poll_to;
__u8 prefer_busy_poll;
- __u8 pad[3];
- __u64 resv;
+
+ /* a io_uring_napi_op value */
+ __u8 opcode;
+ __u8 pad[2];
+
+ /*
+ * for IO_URING_NAPI_REGISTER_OP, it is a
+ * io_uring_napi_tracking_strategy value.
+ *
+ * for IO_URING_NAPI_STATIC_ADD_ID/IO_URING_NAPI_STATIC_DEL_ID
+ * it is the napi id to add/del from napi_list.
+ */
+ __u32 op_param;
+ __u32 resv;
};
/*
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c
index b1e0e0d85349..6f0e40e1469c 100644
--- a/io_uring/fdinfo.c
+++ b/io_uring/fdinfo.c
@@ -46,6 +46,46 @@ static __cold int io_uring_show_cred(struct seq_file *m, unsigned int id,
return 0;
}
+#ifdef CONFIG_NET_RX_BUSY_POLL
+static __cold void common_tracking_show_fdinfo(struct io_ring_ctx *ctx,
+ struct seq_file *m,
+ const char *tracking_strategy)
+{
+ seq_puts(m, "NAPI:\tenabled\n");
+ seq_printf(m, "napi tracking:\t%s\n", tracking_strategy);
+ seq_printf(m, "napi_busy_poll_dt:\t%llu\n", ctx->napi_busy_poll_dt);
+ if (ctx->napi_prefer_busy_poll)
+ seq_puts(m, "napi_prefer_busy_poll:\ttrue\n");
+ else
+ seq_puts(m, "napi_prefer_busy_poll:\tfalse\n");
+}
+
+static __cold void napi_show_fdinfo(struct io_ring_ctx *ctx,
+ struct seq_file *m)
+{
+ unsigned int mode = READ_ONCE(ctx->napi_track_mode);
+
+ switch (mode) {
+ case IO_URING_NAPI_TRACKING_INACTIVE:
+ seq_puts(m, "NAPI:\tdisabled\n");
+ break;
+ case IO_URING_NAPI_TRACKING_DYNAMIC:
+ common_tracking_show_fdinfo(ctx, m, "dynamic");
+ break;
+ case IO_URING_NAPI_TRACKING_STATIC:
+ common_tracking_show_fdinfo(ctx, m, "static");
+ break;
+ default:
+ seq_printf(m, "NAPI:\tunknown mode (%u)\n", mode);
+ }
+}
+#else
+static inline void napi_show_fdinfo(struct io_ring_ctx *ctx,
+ struct seq_file *m)
+{
+}
+#endif
+
/*
* Caller holds a reference to the file already, we don't need to do
* anything else to get an extra reference.
@@ -223,5 +263,6 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *file)
}
spin_unlock(&ctx->completion_lock);
+ napi_show_fdinfo(ctx, m);
}
#endif
diff --git a/io_uring/napi.c b/io_uring/napi.c
index 6fc127e74f10..d98b87d346ca 100644
--- a/io_uring/napi.c
+++ b/io_uring/napi.c
@@ -38,22 +38,14 @@ static inline ktime_t net_to_ktime(unsigned long t)
return ns_to_ktime(t << 10);
}
-void __io_napi_add(struct io_ring_ctx *ctx, struct socket *sock)
+int __io_napi_add_id(struct io_ring_ctx *ctx, unsigned int napi_id)
{
struct hlist_head *hash_list;
- unsigned int napi_id;
- struct sock *sk;
struct io_napi_entry *e;
- sk = sock->sk;
- if (!sk)
- return;
-
- napi_id = READ_ONCE(sk->sk_napi_id);
-
/* Non-NAPI IDs can be rejected. */
if (napi_id < MIN_NAPI_ID)
- return;
+ return -EINVAL;
hash_list = &ctx->napi_ht[hash_min(napi_id, HASH_BITS(ctx->napi_ht))];
@@ -62,13 +54,13 @@ void __io_napi_add(struct io_ring_ctx *ctx, struct socket *sock)
if (e) {
WRITE_ONCE(e->timeout, jiffies + NAPI_TIMEOUT);
rcu_read_unlock();
- return;
+ return -EEXIST;
}
rcu_read_unlock();
e = kmalloc(sizeof(*e), GFP_NOWAIT);
if (!e)
- return;
+ return -ENOMEM;
e->napi_id = napi_id;
e->timeout = jiffies + NAPI_TIMEOUT;
@@ -77,12 +69,37 @@ void __io_napi_add(struct io_ring_ctx *ctx, struct socket *sock)
if (unlikely(io_napi_hash_find(hash_list, napi_id))) {
spin_unlock(&ctx->napi_lock);
kfree(e);
- return;
+ return -EEXIST;
}
hlist_add_tail_rcu(&e->node, hash_list);
list_add_tail_rcu(&e->list, &ctx->napi_list);
spin_unlock(&ctx->napi_lock);
+ return 0;
+}
+
+static int __io_napi_del_id(struct io_ring_ctx *ctx, unsigned int napi_id)
+{
+ struct hlist_head *hash_list;
+ struct io_napi_entry *e;
+
+ /* Non-NAPI IDs can be rejected. */
+ if (napi_id < MIN_NAPI_ID)
+ return -EINVAL;
+
+ hash_list = &ctx->napi_ht[hash_min(napi_id, HASH_BITS(ctx->napi_ht))];
+ spin_lock(&ctx->napi_lock);
+ e = io_napi_hash_find(hash_list, napi_id);
+ if (unlikely(!e)) {
+ spin_unlock(&ctx->napi_lock);
+ return -ENOENT;
+ }
+
+ list_del_rcu(&e->list);
+ hash_del_rcu(&e->node);
+ kfree_rcu(e, rcu);
+ spin_unlock(&ctx->napi_lock);
+ return 0;
}
static void __io_napi_remove_stale(struct io_ring_ctx *ctx)
@@ -141,8 +158,26 @@ static bool io_napi_busy_loop_should_end(void *data,
return false;
}
-static bool __io_napi_do_busy_loop(struct io_ring_ctx *ctx,
- void *loop_end_arg)
+/*
+ * never report stale entries
+ */
+static bool static_tracking_do_busy_loop(struct io_ring_ctx *ctx,
+ void *loop_end_arg)
+{
+ struct io_napi_entry *e;
+ bool (*loop_end)(void *, unsigned long) = NULL;
+
+ if (loop_end_arg)
+ loop_end = io_napi_busy_loop_should_end;
+
+ list_for_each_entry_rcu(e, &ctx->napi_list, list)
+ napi_busy_loop_rcu(e->napi_id, loop_end, loop_end_arg,
+ ctx->napi_prefer_busy_poll, BUSY_POLL_BUDGET);
+ return false;
+}
+
+static bool dynamic_tracking_do_busy_loop(struct io_ring_ctx *ctx,
+ void *loop_end_arg)
{
struct io_napi_entry *e;
bool (*loop_end)(void *, unsigned long) = NULL;
@@ -162,6 +197,14 @@ static bool __io_napi_do_busy_loop(struct io_ring_ctx *ctx,
return is_stale;
}
+static inline bool __io_napi_do_busy_loop(struct io_ring_ctx *ctx,
+ void *loop_end_arg)
+{
+ if (READ_ONCE(ctx->napi_track_mode) == IO_URING_NAPI_TRACKING_STATIC)
+ return static_tracking_do_busy_loop(ctx, loop_end_arg);
+ return dynamic_tracking_do_busy_loop(ctx, loop_end_arg);
+}
+
static void io_napi_blocking_busy_loop(struct io_ring_ctx *ctx,
struct io_wait_queue *iowq)
{
@@ -198,6 +241,7 @@ void io_napi_init(struct io_ring_ctx *ctx)
spin_lock_init(&ctx->napi_lock);
ctx->napi_prefer_busy_poll = false;
ctx->napi_busy_poll_dt = ns_to_ktime(sys_dt);
+ ctx->napi_track_mode = IO_URING_NAPI_TRACKING_INACTIVE;
}
/*
@@ -219,6 +263,24 @@ void io_napi_free(struct io_ring_ctx *ctx)
spin_unlock(&ctx->napi_lock);
}
+static int io_napi_register_napi(struct io_ring_ctx *ctx,
+ struct io_uring_napi *napi)
+{
+ switch (napi->op_param) {
+ case IO_URING_NAPI_TRACKING_DYNAMIC:
+ case IO_URING_NAPI_TRACKING_STATIC:
+ break;
+ default:
+ return -EINVAL;
+ }
+ /* clean the napi list for new settings */
+ io_napi_free(ctx);
+ WRITE_ONCE(ctx->napi_track_mode, napi->op_param);
+ WRITE_ONCE(ctx->napi_busy_poll_dt, napi->busy_poll_to * NSEC_PER_USEC);
+ WRITE_ONCE(ctx->napi_prefer_busy_poll, !!napi->prefer_busy_poll);
+ return 0;
+}
+
/*
* io_napi_register() - Register napi with io-uring
* @ctx: pointer to io-uring context structure
@@ -230,7 +292,8 @@ int io_register_napi(struct io_ring_ctx *ctx, void __user *arg)
{
const struct io_uring_napi curr = {
.busy_poll_to = ktime_to_us(ctx->napi_busy_poll_dt),
- .prefer_busy_poll = ctx->napi_prefer_busy_poll
+ .prefer_busy_poll = ctx->napi_prefer_busy_poll,
+ .op_param = ctx->napi_track_mode
};
struct io_uring_napi napi;
@@ -238,16 +301,26 @@ int io_register_napi(struct io_ring_ctx *ctx, void __user *arg)
return -EINVAL;
if (copy_from_user(&napi, arg, sizeof(napi)))
return -EFAULT;
- if (napi.pad[0] || napi.pad[1] || napi.pad[2] || napi.resv)
+ if (napi.pad[0] || napi.pad[1] || napi.resv)
return -EINVAL;
if (copy_to_user(arg, &curr, sizeof(curr)))
return -EFAULT;
- WRITE_ONCE(ctx->napi_busy_poll_dt, napi.busy_poll_to * NSEC_PER_USEC);
- WRITE_ONCE(ctx->napi_prefer_busy_poll, !!napi.prefer_busy_poll);
- WRITE_ONCE(ctx->napi_enabled, true);
- return 0;
+ switch (napi.opcode) {
+ case IO_URING_NAPI_REGISTER_OP:
+ return io_napi_register_napi(ctx, &napi);
+ case IO_URING_NAPI_STATIC_ADD_ID:
+ if (curr.op_param != IO_URING_NAPI_TRACKING_STATIC)
+ return -EINVAL;
+ return __io_napi_add_id(ctx, napi.op_param);
+ case IO_URING_NAPI_STATIC_DEL_ID:
+ if (curr.op_param != IO_URING_NAPI_TRACKING_STATIC)
+ return -EINVAL;
+ return __io_napi_del_id(ctx, napi.op_param);
+ default:
+ return -EINVAL;
+ }
}
/*
@@ -270,7 +343,7 @@ int io_unregister_napi(struct io_ring_ctx *ctx, void __user *arg)
WRITE_ONCE(ctx->napi_busy_poll_dt, 0);
WRITE_ONCE(ctx->napi_prefer_busy_poll, false);
- WRITE_ONCE(ctx->napi_enabled, false);
+ WRITE_ONCE(ctx->napi_track_mode, IO_URING_NAPI_TRACKING_INACTIVE);
return 0;
}
diff --git a/io_uring/napi.h b/io_uring/napi.h
index 27b88c3eb428..220574522484 100644
--- a/io_uring/napi.h
+++ b/io_uring/napi.h
@@ -15,7 +15,7 @@ void io_napi_free(struct io_ring_ctx *ctx);
int io_register_napi(struct io_ring_ctx *ctx, void __user *arg);
int io_unregister_napi(struct io_ring_ctx *ctx, void __user *arg);
-void __io_napi_add(struct io_ring_ctx *ctx, struct socket *sock);
+int __io_napi_add_id(struct io_ring_ctx *ctx, unsigned int napi_id);
void __io_napi_adjust_timeout(struct io_ring_ctx *ctx,
struct io_wait_queue *iowq, ktime_t to_wait);
@@ -54,13 +54,20 @@ static inline void io_napi_add(struct io_kiocb *req)
{
struct io_ring_ctx *ctx = req->ctx;
struct socket *sock;
+ struct sock *sk;
- if (!READ_ONCE(ctx->napi_enabled))
+ if (READ_ONCE(ctx->napi_track_mode) != IO_URING_NAPI_TRACKING_DYNAMIC)
return;
sock = sock_from_file(req->file);
- if (sock)
- __io_napi_add(ctx, sock);
+ if (!sock)
+ return;
+
+ sk = sock->sk;
+ if (!sk)
+ return;
+
+ __io_napi_add_id(ctx, READ_ONCE(sk->sk_napi_id));
}
#else
--
2.46.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v3 0/3] napi tracking strategy
2024-09-18 12:59 [PATCH v3 0/3] napi tracking strategy Olivier Langlois
` (2 preceding siblings ...)
2024-09-18 12:59 ` [PATCH v3 3/3] io_uring/napi: add static napi tracking strategy Olivier Langlois
@ 2024-09-19 2:36 ` Jens Axboe
2024-09-21 13:32 ` Olivier Langlois
3 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2024-09-19 2:36 UTC (permalink / raw)
To: Olivier Langlois, Pavel Begunkov, io-uring
On 9/18/24 6:59 AM, Olivier Langlois wrote:
> the actual napi tracking strategy is inducing a non-negligeable overhead.
> Everytime a multishot poll is triggered or any poll armed, if the napi is
> enabled on the ring a lookup is performed to either add a new napi id into
> the napi_list or its timeout value is updated.
>
> For many scenarios, this is overkill as the napi id list will be pretty
> much static most of the time. To address this common scenario, a new
> abstraction has been created following the common Linux kernel idiom of
> creating an abstract interface with a struct filled with function pointers.
This paragraph seems to be a remnant of the v1 implementation?
--
Jens Axboe
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 3/3] io_uring/napi: add static napi tracking strategy
2024-09-18 12:59 ` [PATCH v3 3/3] io_uring/napi: add static napi tracking strategy Olivier Langlois
@ 2024-09-19 2:59 ` Jens Axboe
2024-09-21 13:51 ` Olivier Langlois
0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2024-09-19 2:59 UTC (permalink / raw)
To: Olivier Langlois, Pavel Begunkov, io-uring
On 9/18/24 6:59 AM, Olivier Langlois wrote:
> add the static napi tracking strategy that allows the user to manually
> manage the napi ids list to busy poll and offload the ring from
> dynamically update the list.
Add the static napi tracking strategy. That allows the user to manually
manage the napi ids list for busy polling, and eliminate the overhead of
dynamically updating the list from the fast path.
Maybe?
> index b1e0e0d85349..6f0e40e1469c 100644
> --- a/io_uring/fdinfo.c
> +++ b/io_uring/fdinfo.c
> @@ -46,6 +46,46 @@ static __cold int io_uring_show_cred(struct seq_file *m, unsigned int id,
> return 0;
> }
>
> +#ifdef CONFIG_NET_RX_BUSY_POLL
> +static __cold void common_tracking_show_fdinfo(struct io_ring_ctx *ctx,
> + struct seq_file *m,
> + const char *tracking_strategy)
> +{
> + seq_puts(m, "NAPI:\tenabled\n");
> + seq_printf(m, "napi tracking:\t%s\n", tracking_strategy);
> + seq_printf(m, "napi_busy_poll_dt:\t%llu\n", ctx->napi_busy_poll_dt);
> + if (ctx->napi_prefer_busy_poll)
> + seq_puts(m, "napi_prefer_busy_poll:\ttrue\n");
> + else
> + seq_puts(m, "napi_prefer_busy_poll:\tfalse\n");
> +}
> +
> +static __cold void napi_show_fdinfo(struct io_ring_ctx *ctx,
> + struct seq_file *m)
> +{
> + unsigned int mode = READ_ONCE(ctx->napi_track_mode);
> +
> + switch (mode) {
> + case IO_URING_NAPI_TRACKING_INACTIVE:
> + seq_puts(m, "NAPI:\tdisabled\n");
> + break;
> + case IO_URING_NAPI_TRACKING_DYNAMIC:
> + common_tracking_show_fdinfo(ctx, m, "dynamic");
> + break;
> + case IO_URING_NAPI_TRACKING_STATIC:
> + common_tracking_show_fdinfo(ctx, m, "static");
> + break;
> + default:
> + seq_printf(m, "NAPI:\tunknown mode (%u)\n", mode);
> + }
> +}
> +#else
> +static inline void napi_show_fdinfo(struct io_ring_ctx *ctx,
> + struct seq_file *m)
> +{
> +}
> +#endif
I think this code should go in napi.c, with the stub
CONFIG_NET_RX_BUSY_POLL in napi.h. Not a huge deal.
This also conflicts with your previous napi patch adding fdinfo support.
What kernel is this patchset based on? You should rebase it
for-6.12/io_uring, then it should apply to the development branch going
forward too.
> diff --git a/io_uring/napi.c b/io_uring/napi.c
> index 6fc127e74f10..d98b87d346ca 100644
> --- a/io_uring/napi.c
> +++ b/io_uring/napi.c
> @@ -38,22 +38,14 @@ static inline ktime_t net_to_ktime(unsigned long t)
> return ns_to_ktime(t << 10);
> }
>
> -void __io_napi_add(struct io_ring_ctx *ctx, struct socket *sock)
> +int __io_napi_add_id(struct io_ring_ctx *ctx, unsigned int napi_id)
> {
> struct hlist_head *hash_list;
> - unsigned int napi_id;
> - struct sock *sk;
> struct io_napi_entry *e;
>
> - sk = sock->sk;
> - if (!sk)
> - return;
> -
> - napi_id = READ_ONCE(sk->sk_napi_id);
> -
> /* Non-NAPI IDs can be rejected. */
> if (napi_id < MIN_NAPI_ID)
> - return;
> + return -EINVAL;
>
> hash_list = &ctx->napi_ht[hash_min(napi_id, HASH_BITS(ctx->napi_ht))];
>
> @@ -62,13 +54,13 @@ void __io_napi_add(struct io_ring_ctx *ctx, struct socket *sock)
> if (e) {
> WRITE_ONCE(e->timeout, jiffies + NAPI_TIMEOUT);
> rcu_read_unlock();
> - return;
> + return -EEXIST;
> }
> rcu_read_unlock();
>
> e = kmalloc(sizeof(*e), GFP_NOWAIT);
> if (!e)
> - return;
> + return -ENOMEM;
>
> e->napi_id = napi_id;
> e->timeout = jiffies + NAPI_TIMEOUT;
> @@ -77,12 +69,37 @@ void __io_napi_add(struct io_ring_ctx *ctx, struct socket *sock)
> if (unlikely(io_napi_hash_find(hash_list, napi_id))) {
> spin_unlock(&ctx->napi_lock);
> kfree(e);
> - return;
> + return -EEXIST;
> }
You could abstract this out to a prep patch, having __io_napi_add()
return an error value. That would leave the meat of your patch simpler
and easier to review.
> +static int __io_napi_del_id(struct io_ring_ctx *ctx, unsigned int napi_id)
> +{
> + struct hlist_head *hash_list;
> + struct io_napi_entry *e;
> +
> + /* Non-NAPI IDs can be rejected. */
> + if (napi_id < MIN_NAPI_ID)
> + return -EINVAL;
> +
> + hash_list = &ctx->napi_ht[hash_min(napi_id, HASH_BITS(ctx->napi_ht))];
> + spin_lock(&ctx->napi_lock);
> + e = io_napi_hash_find(hash_list, napi_id);
> + if (unlikely(!e)) {
> + spin_unlock(&ctx->napi_lock);
> + return -ENOENT;
> + }
> +
> + list_del_rcu(&e->list);
> + hash_del_rcu(&e->node);
> + kfree_rcu(e, rcu);
> + spin_unlock(&ctx->napi_lock);
> + return 0;
> }
For new code, not a bad idea to use:
guard(spinlock)(&ctx->napi_lock);
Only one cleanup path here, but...
> static void __io_napi_remove_stale(struct io_ring_ctx *ctx)
> @@ -141,8 +158,26 @@ static bool io_napi_busy_loop_should_end(void *data,
> return false;
> }
>
> -static bool __io_napi_do_busy_loop(struct io_ring_ctx *ctx,
> - void *loop_end_arg)
> +/*
> + * never report stale entries
> + */
> +static bool static_tracking_do_busy_loop(struct io_ring_ctx *ctx,
> + void *loop_end_arg)
> +{
> + struct io_napi_entry *e;
> + bool (*loop_end)(void *, unsigned long) = NULL;
> +
> + if (loop_end_arg)
> + loop_end = io_napi_busy_loop_should_end;
> +
> + list_for_each_entry_rcu(e, &ctx->napi_list, list)
> + napi_busy_loop_rcu(e->napi_id, loop_end, loop_end_arg,
> + ctx->napi_prefer_busy_poll, BUSY_POLL_BUDGET);
> + return false;
> +}
> +
> +static bool dynamic_tracking_do_busy_loop(struct io_ring_ctx *ctx,
> + void *loop_end_arg)
> {
> struct io_napi_entry *e;
> bool (*loop_end)(void *, unsigned long) = NULL;
This is somewhat convoluted, but I think it'd be cleaner to have a prep
patch that just passes in both loop_end and loop_end_arg to the caller?
Some of this predates your changes here, but seems there's room for
cleaning this up. What do you think?
> diff --git a/io_uring/napi.h b/io_uring/napi.h
> index 27b88c3eb428..220574522484 100644
> --- a/io_uring/napi.h
> +++ b/io_uring/napi.h
> @@ -54,13 +54,20 @@ static inline void io_napi_add(struct io_kiocb *req)
> {
> struct io_ring_ctx *ctx = req->ctx;
> struct socket *sock;
> + struct sock *sk;
>
> - if (!READ_ONCE(ctx->napi_enabled))
> + if (READ_ONCE(ctx->napi_track_mode) != IO_URING_NAPI_TRACKING_DYNAMIC)
> return;
>
> sock = sock_from_file(req->file);
> - if (sock)
> - __io_napi_add(ctx, sock);
> + if (!sock)
> + return;
> +
> + sk = sock->sk;
> + if (!sk)
> + return;
> +
> + __io_napi_add_id(ctx, READ_ONCE(sk->sk_napi_id));
> }
I like having this follow the expected outcome, which is that sock and
sk are valid.
sock = sock_from_file(req->file);
if (sock && sock->sk)
__io_napi_add_id(ctx, READ_ONCE(sock->sk->sk_napi_id));
or something like that. At least to me that's more readable.
--
Jens Axboe
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 0/3] napi tracking strategy
2024-09-19 2:36 ` [PATCH v3 0/3] " Jens Axboe
@ 2024-09-21 13:32 ` Olivier Langlois
2024-09-21 13:46 ` Jens Axboe
0 siblings, 1 reply; 9+ messages in thread
From: Olivier Langlois @ 2024-09-21 13:32 UTC (permalink / raw)
To: Jens Axboe, Pavel Begunkov, io-uring
On Wed, 2024-09-18 at 20:36 -0600, Jens Axboe wrote:
> On 9/18/24 6:59 AM, Olivier Langlois wrote:
> > the actual napi tracking strategy is inducing a non-negligeable
> > overhead.
> > Everytime a multishot poll is triggered or any poll armed, if the
> > napi is
> > enabled on the ring a lookup is performed to either add a new napi
> > id into
> > the napi_list or its timeout value is updated.
> >
> > For many scenarios, this is overkill as the napi id list will be
> > pretty
> > much static most of the time. To address this common scenario, a
> > new
> > abstraction has been created following the common Linux kernel
> > idiom of
> > creating an abstract interface with a struct filled with function
> > pointers.
>
> This paragraph seems to be a remnant of the v1 implementation?
>
you are right. At least the last sentence is. Is the cover letter
injected somewhere in any way once the patch is accepted? Would that
detail alone justify the creation of a v4?
I am taking note of the detail and I will correct this if there is a
need for a v4.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 0/3] napi tracking strategy
2024-09-21 13:32 ` Olivier Langlois
@ 2024-09-21 13:46 ` Jens Axboe
0 siblings, 0 replies; 9+ messages in thread
From: Jens Axboe @ 2024-09-21 13:46 UTC (permalink / raw)
To: Olivier Langlois, Pavel Begunkov, io-uring
On 9/21/24 7:32 AM, Olivier Langlois wrote:
> On Wed, 2024-09-18 at 20:36 -0600, Jens Axboe wrote:
>> On 9/18/24 6:59 AM, Olivier Langlois wrote:
>>> the actual napi tracking strategy is inducing a non-negligeable
>>> overhead.
>>> Everytime a multishot poll is triggered or any poll armed, if the
>>> napi is
>>> enabled on the ring a lookup is performed to either add a new napi
>>> id into
>>> the napi_list or its timeout value is updated.
>>>
>>> For many scenarios, this is overkill as the napi id list will be
>>> pretty
>>> much static most of the time. To address this common scenario, a
>>> new
>>> abstraction has been created following the common Linux kernel
>>> idiom of
>>> creating an abstract interface with a struct filled with function
>>> pointers.
>>
>> This paragraph seems to be a remnant of the v1 implementation?
>>
> you are right. At least the last sentence is. Is the cover letter
> injected somewhere in any way once the patch is accepted? Would that
> detail alone justify the creation of a v4?
>
> I am taking note of the detail and I will correct this if there is a
> need for a v4.
Right just make a note of it if you need to send a v4 for other reasons,
cover letter doesn't go anywhere in the git log - but is, however,
linked off the git commits in terms of the Link in there. But since we
now have this discussion in there as well clarifying it, that is fine.
Doing conferences in Europe until end of next week, I'll see if I can do
some thorough looking at this version during that, or worst case end of
that.
--
Jens Axboe
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 3/3] io_uring/napi: add static napi tracking strategy
2024-09-19 2:59 ` Jens Axboe
@ 2024-09-21 13:51 ` Olivier Langlois
0 siblings, 0 replies; 9+ messages in thread
From: Olivier Langlois @ 2024-09-21 13:51 UTC (permalink / raw)
To: Jens Axboe, Pavel Begunkov, io-uring
On Wed, 2024-09-18 at 20:59 -0600, Jens Axboe wrote:
> On 9/18/24 6:59 AM, Olivier Langlois wrote:
> > add the static napi tracking strategy that allows the user to
> > manually
> > manage the napi ids list to busy poll and offload the ring from
> > dynamically update the list.
>
> Add the static napi tracking strategy. That allows the user to
> manually
> manage the napi ids list for busy polling, and eliminate the overhead
> of
> dynamically updating the list from the fast path.
>
> Maybe?
ok
>
> > index b1e0e0d85349..6f0e40e1469c 100644
> > --- a/io_uring/fdinfo.c
> > +++ b/io_uring/fdinfo.c
> > @@ -46,6 +46,46 @@ static __cold int io_uring_show_cred(struct
> > seq_file *m, unsigned int id,
> > return 0;
> > }
> >
> > +#ifdef CONFIG_NET_RX_BUSY_POLL
> > +static __cold void common_tracking_show_fdinfo(struct io_ring_ctx
> > *ctx,
> > + struct seq_file *m,
> > + const char
> > *tracking_strategy)
> > +{
> > + seq_puts(m, "NAPI:\tenabled\n");
> > + seq_printf(m, "napi tracking:\t%s\n", tracking_strategy);
> > + seq_printf(m, "napi_busy_poll_dt:\t%llu\n", ctx-
> > >napi_busy_poll_dt);
> > + if (ctx->napi_prefer_busy_poll)
> > + seq_puts(m, "napi_prefer_busy_poll:\ttrue\n");
> > + else
> > + seq_puts(m, "napi_prefer_busy_poll:\tfalse\n");
> > +}
> > +
> > +static __cold void napi_show_fdinfo(struct io_ring_ctx *ctx,
> > + struct seq_file *m)
> > +{
> > + unsigned int mode = READ_ONCE(ctx->napi_track_mode);
> > +
> > + switch (mode) {
> > + case IO_URING_NAPI_TRACKING_INACTIVE:
> > + seq_puts(m, "NAPI:\tdisabled\n");
> > + break;
> > + case IO_URING_NAPI_TRACKING_DYNAMIC:
> > + common_tracking_show_fdinfo(ctx, m, "dynamic");
> > + break;
> > + case IO_URING_NAPI_TRACKING_STATIC:
> > + common_tracking_show_fdinfo(ctx, m, "static");
> > + break;
> > + default:
> > + seq_printf(m, "NAPI:\tunknown mode (%u)\n", mode);
> > + }
> > +}
> > +#else
> > +static inline void napi_show_fdinfo(struct io_ring_ctx *ctx,
> > + struct seq_file *m)
> > +{
> > +}
> > +#endif
>
> I think this code should go in napi.c, with the stub
> CONFIG_NET_RX_BUSY_POLL in napi.h. Not a huge deal.
>
> This also conflicts with your previous napi patch adding fdinfo
> support.
> What kernel is this patchset based on? You should rebase it
> for-6.12/io_uring, then it should apply to the development branch
> going
> forward too.
The repo used is Linus head. Yes I was surprised to not see my own
patch there. Kernel dev process is still a bit obscure to me... but I
figured that the conflict would be very trivial to solve. Basically,
simply drop the previous patch to replace it with this one.
but yes. I can rebase the patch into your repo. Can you provide me the
URL of your repo?
About your suggestion to move stuff in napi.h. I did though about it.
In fact, the code was there at some point before submission (or in v1).
but I did conclude the exact opposite. That it would be cleaner to keep
the fdinfo stuff together instead of polluting napi TU with it...
with this argument, if you still prefer to see that code in napi TU, I
will move it there.
>
> > diff --git a/io_uring/napi.c b/io_uring/napi.c
> > index 6fc127e74f10..d98b87d346ca 100644
> > --- a/io_uring/napi.c
> > +++ b/io_uring/napi.c
> > @@ -38,22 +38,14 @@ static inline ktime_t net_to_ktime(unsigned
> > long t)
> > return ns_to_ktime(t << 10);
> > }
> >
> > -void __io_napi_add(struct io_ring_ctx *ctx, struct socket *sock)
> > +int __io_napi_add_id(struct io_ring_ctx *ctx, unsigned int
> > napi_id)
> > {
> > struct hlist_head *hash_list;
> > - unsigned int napi_id;
> > - struct sock *sk;
> > struct io_napi_entry *e;
> >
> > - sk = sock->sk;
> > - if (!sk)
> > - return;
> > -
> > - napi_id = READ_ONCE(sk->sk_napi_id);
> > -
> > /* Non-NAPI IDs can be rejected. */
> > if (napi_id < MIN_NAPI_ID)
> > - return;
> > + return -EINVAL;
> >
> > hash_list = &ctx->napi_ht[hash_min(napi_id, HASH_BITS(ctx-
> > >napi_ht))];
> >
> > @@ -62,13 +54,13 @@ void __io_napi_add(struct io_ring_ctx *ctx,
> > struct socket *sock)
> > if (e) {
> > WRITE_ONCE(e->timeout, jiffies + NAPI_TIMEOUT);
> > rcu_read_unlock();
> > - return;
> > + return -EEXIST;
> > }
> > rcu_read_unlock();
> >
> > e = kmalloc(sizeof(*e), GFP_NOWAIT);
> > if (!e)
> > - return;
> > + return -ENOMEM;
> >
> > e->napi_id = napi_id;
> > e->timeout = jiffies + NAPI_TIMEOUT;
> > @@ -77,12 +69,37 @@ void __io_napi_add(struct io_ring_ctx *ctx,
> > struct socket *sock)
> > if (unlikely(io_napi_hash_find(hash_list, napi_id))) {
> > spin_unlock(&ctx->napi_lock);
> > kfree(e);
> > - return;
> > + return -EEXIST;
> > }
>
> You could abstract this out to a prep patch, having __io_napi_add()
> return an error value. That would leave the meat of your patch
> simpler
> and easier to review.
ok
>
> > +static int __io_napi_del_id(struct io_ring_ctx *ctx, unsigned int
> > napi_id)
> > +{
> > + struct hlist_head *hash_list;
> > + struct io_napi_entry *e;
> > +
> > + /* Non-NAPI IDs can be rejected. */
> > + if (napi_id < MIN_NAPI_ID)
> > + return -EINVAL;
> > +
> > + hash_list = &ctx->napi_ht[hash_min(napi_id, HASH_BITS(ctx-
> > >napi_ht))];
> > + spin_lock(&ctx->napi_lock);
> > + e = io_napi_hash_find(hash_list, napi_id);
> > + if (unlikely(!e)) {
> > + spin_unlock(&ctx->napi_lock);
> > + return -ENOENT;
> > + }
> > +
> > + list_del_rcu(&e->list);
> > + hash_del_rcu(&e->node);
> > + kfree_rcu(e, rcu);
> > + spin_unlock(&ctx->napi_lock);
> > + return 0;
> > }
>
> For new code, not a bad idea to use:
>
> guard(spinlock)(&ctx->napi_lock);
>
> Only one cleanup path here, but...
first time I hear about this... Some sort of RAII where the compiler
adds automatically unlock function call at every exit points?
ok, I'll look into it
>
> > static void __io_napi_remove_stale(struct io_ring_ctx *ctx)
> > @@ -141,8 +158,26 @@ static bool io_napi_busy_loop_should_end(void
> > *data,
> > return false;
> > }
> >
> > -static bool __io_napi_do_busy_loop(struct io_ring_ctx *ctx,
> > - void *loop_end_arg)
> > +/*
> > + * never report stale entries
> > + */
> > +static bool static_tracking_do_busy_loop(struct io_ring_ctx *ctx,
> > + void *loop_end_arg)
> > +{
> > + struct io_napi_entry *e;
> > + bool (*loop_end)(void *, unsigned long) = NULL;
> > +
> > + if (loop_end_arg)
> > + loop_end = io_napi_busy_loop_should_end;
> > +
> > + list_for_each_entry_rcu(e, &ctx->napi_list, list)
> > + napi_busy_loop_rcu(e->napi_id, loop_end,
> > loop_end_arg,
> > + ctx->napi_prefer_busy_poll,
> > BUSY_POLL_BUDGET);
> > + return false;
> > +}
> > +
> > +static bool dynamic_tracking_do_busy_loop(struct io_ring_ctx *ctx,
> > + void *loop_end_arg)
> > {
> > struct io_napi_entry *e;
> > bool (*loop_end)(void *, unsigned long) = NULL;
>
> This is somewhat convoluted, but I think it'd be cleaner to have a
> prep
> patch that just passes in both loop_end and loop_end_arg to the
> caller?
> Some of this predates your changes here, but seems there's room for
> cleaning this up. What do you think?
I am not sure that I see your point that it would make the code
cleaner. TBH, I did not fully understood what was the point of the
whole loop_end business thing but since it was unrelated to my changes,
I did leave that thing undisturbed.
I'll take a second look at it to see if I can clean this up a bit...
>
> > diff --git a/io_uring/napi.h b/io_uring/napi.h
> > index 27b88c3eb428..220574522484 100644
> > --- a/io_uring/napi.h
> > +++ b/io_uring/napi.h
> > @@ -54,13 +54,20 @@ static inline void io_napi_add(struct io_kiocb
> > *req)
> > {
> > struct io_ring_ctx *ctx = req->ctx;
> > struct socket *sock;
> > + struct sock *sk;
> >
> > - if (!READ_ONCE(ctx->napi_enabled))
> > + if (READ_ONCE(ctx->napi_track_mode) !=
> > IO_URING_NAPI_TRACKING_DYNAMIC)
> > return;
> >
> > sock = sock_from_file(req->file);
> > - if (sock)
> > - __io_napi_add(ctx, sock);
> > + if (!sock)
> > + return;
> > +
> > + sk = sock->sk;
> > + if (!sk)
> > + return;
> > +
> > + __io_napi_add_id(ctx, READ_ONCE(sk->sk_napi_id));
> > }
>
> I like having this follow the expected outcome, which is that sock
> and
> sk are valid.
>
> sock = sock_from_file(req->file);
> if (sock && sock->sk)
> __io_napi_add_id(ctx, READ_ONCE(sock->sk-
> >sk_napi_id));
>
> or something like that. At least to me that's more readable.
>
ok.
All those points will be addressed in v4
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-09-21 13:51 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-18 12:59 [PATCH v3 0/3] napi tracking strategy Olivier Langlois
2024-09-18 12:59 ` [PATCH v3 1/3] io_uring/napi: protect concurrent io_napi_entry timeout accesses Olivier Langlois
2024-09-18 12:59 ` [PATCH v3 2/3] io_uring/napi: fix io_napi_entry RCU accesses Olivier Langlois
2024-09-18 12:59 ` [PATCH v3 3/3] io_uring/napi: add static napi tracking strategy Olivier Langlois
2024-09-19 2:59 ` Jens Axboe
2024-09-21 13:51 ` Olivier Langlois
2024-09-19 2:36 ` [PATCH v3 0/3] " Jens Axboe
2024-09-21 13:32 ` Olivier Langlois
2024-09-21 13:46 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox