* napi_busy_poll @ 2022-02-08 14:58 Olivier Langlois 2022-02-08 17:05 ` napi_busy_poll Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Olivier Langlois @ 2022-02-08 14:58 UTC (permalink / raw) To: io-uring Hi, I was wondering if integrating the NAPI busy poll for socket fds into io_uring like how select/poll/epoll are doing has ever been considered? It seems to me that it could be an awesome feature when used along with a io_qpoll thread and something not too difficult to add... Greetings, Olivier ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-08 14:58 napi_busy_poll Olivier Langlois @ 2022-02-08 17:05 ` Jens Axboe 2022-02-09 3:34 ` napi_busy_poll Hao Xu 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2022-02-08 17:05 UTC (permalink / raw) To: Olivier Langlois, io-uring On 2/8/22 7:58 AM, Olivier Langlois wrote: > Hi, > > I was wondering if integrating the NAPI busy poll for socket fds into > io_uring like how select/poll/epoll are doing has ever been considered? > > It seems to me that it could be an awesome feature when used along with > a io_qpoll thread and something not too difficult to add... Should be totally doable and it's been brought up before, just needs someone to actually do it... Would love to see it. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-08 17:05 ` napi_busy_poll Jens Axboe @ 2022-02-09 3:34 ` Hao Xu 2022-02-12 19:51 ` napi_busy_poll Olivier Langlois 0 siblings, 1 reply; 23+ messages in thread From: Hao Xu @ 2022-02-09 3:34 UTC (permalink / raw) To: Jens Axboe, Olivier Langlois, io-uring 在 2022/2/9 上午1:05, Jens Axboe 写道: > On 2/8/22 7:58 AM, Olivier Langlois wrote: >> Hi, >> >> I was wondering if integrating the NAPI busy poll for socket fds into >> io_uring like how select/poll/epoll are doing has ever been considered? >> >> It seems to me that it could be an awesome feature when used along with >> a io_qpoll thread and something not too difficult to add... > > Should be totally doable and it's been brought up before, just needs > someone to actually do it... Would love to see it. > We've done some investigation before, would like to have a try. Regards, Hao ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-09 3:34 ` napi_busy_poll Hao Xu @ 2022-02-12 19:51 ` Olivier Langlois 2022-02-13 18:47 ` napi_busy_poll Jens Axboe 2022-02-14 17:13 ` napi_busy_poll Hao Xu 0 siblings, 2 replies; 23+ messages in thread From: Olivier Langlois @ 2022-02-12 19:51 UTC (permalink / raw) To: Hao Xu, Jens Axboe, io-uring On Wed, 2022-02-09 at 11:34 +0800, Hao Xu wrote: > 在 2022/2/9 上午1:05, Jens Axboe 写道: > > On 2/8/22 7:58 AM, Olivier Langlois wrote: > > > Hi, > > > > > > I was wondering if integrating the NAPI busy poll for socket fds > > > into > > > io_uring like how select/poll/epoll are doing has ever been > > > considered? > > > > > > It seems to me that it could be an awesome feature when used > > > along with > > > a io_qpoll thread and something not too difficult to add... > > > > Should be totally doable and it's been brought up before, just > > needs > > someone to actually do it... Would love to see it. > > > We've done some investigation before, would like to have a try. > Hao, Let me know if I can help you with coding or testing. I have done very preliminary investigation too. It doesn't seem like it would be very hard to implement but I get confused with small details. For instance, the epoll implementation, unless there is something that I don't understand, appears to have a serious limitation. It seems like it would not work correctly if there are sockets associated to more than 1 NAPI device in the fd set. As far as I am concerned, that limitation would be ok since in my setup I only use 1 device but if it was necessary to be better than the epoll implementation, I am not sure at all how this could be addressed. I do not have enough kernel dev experience to find easy solutions to those type of issues... Worse case scenario, I guess that I could give it a shot creating a good enough implementation for my needs and show it to the list to get feedback... Greetings, Olivier ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-12 19:51 ` napi_busy_poll Olivier Langlois @ 2022-02-13 18:47 ` Jens Axboe 2022-02-14 17:13 ` napi_busy_poll Hao Xu 1 sibling, 0 replies; 23+ messages in thread From: Jens Axboe @ 2022-02-13 18:47 UTC (permalink / raw) To: Olivier Langlois, Hao Xu, io-uring On 2/12/22 12:51 PM, Olivier Langlois wrote: > On Wed, 2022-02-09 at 11:34 +0800, Hao Xu wrote: >> 在 2022/2/9 上午1:05, Jens Axboe 写道: >>> On 2/8/22 7:58 AM, Olivier Langlois wrote: >>>> Hi, >>>> >>>> I was wondering if integrating the NAPI busy poll for socket fds >>>> into >>>> io_uring like how select/poll/epoll are doing has ever been >>>> considered? >>>> >>>> It seems to me that it could be an awesome feature when used >>>> along with >>>> a io_qpoll thread and something not too difficult to add... >>> >>> Should be totally doable and it's been brought up before, just >>> needs >>> someone to actually do it... Would love to see it. >>> >> We've done some investigation before, would like to have a try. >> > Hao, > > Let me know if I can help you with coding or testing. I have done very > preliminary investigation too. It doesn't seem like it would be very > hard to implement but I get confused with small details. > > For instance, the epoll implementation, unless there is something that > I don't understand, appears to have a serious limitation. It seems > like it would not work correctly if there are sockets associated to > more than 1 NAPI device in the fd set. As far as I am concerned, that > limitation would be ok since in my setup I only use 1 device but if it > was necessary to be better than the epoll implementation, I am not > sure at all how this could be addressed. I do not have enough kernel > dev experience to find easy solutions to those type of issues... > > Worse case scenario, I guess that I could give it a shot creating a > good enough implementation for my needs and show it to the list to get > feedback... That is, in fact, usually the best way to get feedback on a change! I believe there's even a law named for it, though local to Linux I attribute it to akpm. Hence I'd encourage you to do that, best and fastest way to make progress on the solution. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-12 19:51 ` napi_busy_poll Olivier Langlois 2022-02-13 18:47 ` napi_busy_poll Jens Axboe @ 2022-02-14 17:13 ` Hao Xu 2022-02-15 8:37 ` napi_busy_poll Olivier Langlois 1 sibling, 1 reply; 23+ messages in thread From: Hao Xu @ 2022-02-14 17:13 UTC (permalink / raw) To: Olivier Langlois, Jens Axboe, io-uring On 2/13/22 03:51, Olivier Langlois wrote: > On Wed, 2022-02-09 at 11:34 +0800, Hao Xu wrote: >> 在 2022/2/9 上午1:05, Jens Axboe 写道: >>> On 2/8/22 7:58 AM, Olivier Langlois wrote: >>>> Hi, >>>> >>>> I was wondering if integrating the NAPI busy poll for socket fds >>>> into >>>> io_uring like how select/poll/epoll are doing has ever been >>>> considered? >>>> >>>> It seems to me that it could be an awesome feature when used >>>> along with >>>> a io_qpoll thread and something not too difficult to add... >>> Should be totally doable and it's been brought up before, just >>> needs >>> someone to actually do it... Would love to see it. >>> >> We've done some investigation before, would like to have a try. >> > Hao, > > Let me know if I can help you with coding or testing. I have done very > preliminary investigation too. It doesn't seem like it would be very > hard to implement but I get confused with small details. > > For instance, the epoll implementation, unless there is something that > I don't understand, appears to have a serious limitation. It seems like > it would not work correctly if there are sockets associated to more > than 1 NAPI device in the fd set. As far as I am concerned, that Yes, it seems that epoll_wait only does busy polling for 1 NAPI. I think it is because the busy polling there is just an optimization (doing some polling before trapping into sleep) not a must have, so it's kind of trade-off between polling and reacting to other events I guess. Not very sure about this too.. The iouring implementation I'm thinking of in my mind is polling for every NAPI involved. Regards, Hao > limitation would be ok since in my setup I only use 1 device but if it > was necessary to be better than the epoll implementation, I am not sure > at all how this could be addressed. I do not have enough kernel dev > experience to find easy solutions to those type of issues... > > Worse case scenario, I guess that I could give it a shot creating a > good enough implementation for my needs and show it to the list to get > feedback... > > Greetings, > Olivier ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-14 17:13 ` napi_busy_poll Hao Xu @ 2022-02-15 8:37 ` Olivier Langlois 2022-02-15 18:05 ` napi_busy_poll Olivier Langlois 0 siblings, 1 reply; 23+ messages in thread From: Olivier Langlois @ 2022-02-15 8:37 UTC (permalink / raw) To: Hao Xu, Jens Axboe, io-uring On Tue, 2022-02-15 at 01:13 +0800, Hao Xu wrote: > > Yes, it seems that epoll_wait only does busy polling for 1 NAPI. > > I think it is because the busy polling there is just an optimization > > (doing some polling before trapping into sleep) not a must have, > > so it's kind of trade-off between polling and reacting to other > events > > I guess. Not very sure about this too.. > > The iouring implementation I'm thinking of in my mind is polling for > every > > NAPI involved. > Hao, I have found the explanation about the epoll oddity: In: https://legacy.netdevconf.info/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf Linux-4.12 changes epoll() support was added by Sridhar Samudrala and Alexander Duyck, with the assumption that an application using epoll() and busy polling would first make sure that it would classify sockets based on their receive queue (NAPI ID), and use at least one epoll fd per receive queue. SO_INCOMING_NAPI_ID was added as a new socket option to retrieve this information, instead of relying on other mechanisms (CPU or NUMA identifications). I have created a small toy implementation with some limitations: 1. It assumes a single napi_id per io_uring ctx like what epoll does 2. It does not detect when pending requests using supporting sockets are all gone. That being said, I have not been able to make it work yet. For some unknown reasons, no valid napi_id is extracted from the sockets added to the context so the net_busy_poll function is never called. I find that very strange since prior to use io_uring, my code was using epoll and the busy polling was working fine with my application sockets. Something is escaping my comprehension. I must tired and this will become obvious... In the meantime, here is what I have created so far. Feel free to play with it and/or enhance it: diff --git a/fs/io_uring.c b/fs/io_uring.c index 77b9c7e4793b..d3deca9b9ef5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -63,6 +63,7 @@ #include <net/sock.h> #include <net/af_unix.h> #include <net/scm.h> +#include <net/busy_poll.h> #include <linux/anon_inodes.h> #include <linux/sched/mm.h> #include <linux/uaccess.h> @@ -395,6 +396,10 @@ struct io_ring_ctx { struct list_head sqd_list; unsigned long check_cq_overflow; +#ifdef CONFIG_NET_RX_BUSY_POLL + /* used to track busy poll napi_id */ + unsigned int napi_id; +#endif struct { unsigned cached_cq_tail; @@ -6976,7 +6981,40 @@ static inline struct file *io_file_get_fixed(struct io_ring_ctx *ctx, io_req_set_rsrc_node(req, ctx); return file; } +#ifdef CONFIG_NET_RX_BUSY_POLL +/* + * Set epoll busy poll NAPI ID from sk. + */ +static inline void io_set_busy_poll_napi_id(struct io_ring_ctx *ctx, struct file *file) +{ + unsigned int napi_id; + struct socket *sock; + struct sock *sk; + + if (!net_busy_loop_on()) + return; + sock = sock_from_file(file); + if (!sock) + return; + + sk = sock->sk; + if (!sk) + return; + + napi_id = READ_ONCE(sk->sk_napi_id); + + /* Non-NAPI IDs can be rejected + * or + * Nothing to do if we already have this ID + */ + if (napi_id < MIN_NAPI_ID || napi_id == ctx->napi_id) + return; + + /* record NAPI ID for use in next busy poll */ + ctx->napi_id = napi_id; +} +#endif static struct file *io_file_get_normal(struct io_ring_ctx *ctx, struct io_kiocb *req, int fd) { @@ -6985,8 +7023,14 @@ static struct file *io_file_get_normal(struct io_ring_ctx *ctx, trace_io_uring_file_get(ctx, fd); /* we don't allow fixed io_uring files */ - if (file && unlikely(file->f_op == &io_uring_fops)) - io_req_track_inflight(req); + if (file) { + if (unlikely(file->f_op == &io_uring_fops)) + io_req_track_inflight(req); +#ifdef CONFIG_NET_RX_BUSY_POLL + else + io_set_busy_poll_napi_id(ctx, file); +#endif + } return file; } @@ -7489,7 +7533,22 @@ static inline void io_ring_clear_wakeup_flag(struct io_ring_ctx *ctx) ctx->rings->sq_flags & ~IORING_SQ_NEED_WAKEUP); spin_unlock(&ctx->completion_lock); } +#ifdef CONFIG_NET_RX_BUSY_POLL +/* + * Busy poll if globally on and supporting sockets found + */ +static inline bool io_napi_busy_loop(struct io_ring_ctx *ctx) +{ + unsigned int napi_id = ctx->napi_id; + if ((napi_id >= MIN_NAPI_ID) && net_busy_loop_on()) { + napi_busy_loop(napi_id, NULL, NULL, true, + BUSY_POLL_BUDGET); + return true; + } + return false; +} +#endif static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries) { unsigned int to_submit; @@ -7518,7 +7577,10 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries) !(ctx->flags & IORING_SETUP_R_DISABLED)) ret = io_submit_sqes(ctx, to_submit); mutex_unlock(&ctx->uring_lock); - +#ifdef CONFIG_NET_RX_BUSY_POLL + if (io_napi_busy_loop(ctx)) + ++ret; +#endif if (to_submit && wq_has_sleeper(&ctx->sqo_sq_wait)) wake_up(&ctx->sqo_sq_wait); if (creds) @@ -7649,6 +7711,9 @@ struct io_wait_queue { struct io_ring_ctx *ctx; unsigned cq_tail; unsigned nr_timeouts; +#ifdef CONFIG_NET_RX_BUSY_POLL + unsigned busy_poll_to; +#endif }; static inline bool io_should_wake(struct io_wait_queue *iowq) @@ -7709,6 +7774,29 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, return !*timeout ? -ETIME : 1; } +#ifdef CONFIG_NET_RX_BUSY_POLL +static inline bool io_busy_loop_timeout(unsigned long start_time, + unsigned long bp_usec) +{ + if (bp_usec) { + unsigned long end_time = start_time + bp_usec; + unsigned long now = busy_loop_current_time(); + + return time_after(now, end_time); + } + return true; +} + +static bool io_busy_loop_end(void *p, unsigned long start_time) +{ + struct io_wait_queue *iowq = p; + + return io_busy_loop_timeout(start_time, iowq->busy_poll_to) || + io_run_task_work_sig() || + io_should_wake(iowq); +} +#endif + /* * Wait until events become available, if we don't already have some. The * application must reap them itself, as they reside on the shared cq ring. @@ -7729,12 +7817,33 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, if (!io_run_task_work()) break; } while (1); - +#ifdef CONFIG_NET_RX_BUSY_POLL + iowq.busy_poll_to = 0; +#endif if (uts) { struct timespec64 ts; if (get_timespec64(&ts, uts)) return -EFAULT; +#ifdef CONFIG_NET_RX_BUSY_POLL + if (!(ctx->flags & IORING_SETUP_SQPOLL) && + (ctx->napi_id >= MIN_NAPI_ID) && net_busy_loop_on()) { + unsigned busy_poll_to = + READ_ONCE(sysctl_net_busy_poll); + struct timespec64 pollto = + ns_to_timespec64(1000*busy_poll_to); + + if (timespec64_compare(&ts, &pollto) > 0) { + ts = timespec64_sub(ts, pollto); + iowq.busy_poll_to = busy_poll_to; + } + else { + iowq.busy_poll_to = timespec64_to_ns(&ts)/1000; + ts.tv_sec = 0; + ts.tv_nsec = 0; + } + } +#endif timeout = timespec64_to_jiffies(&ts); } @@ -7759,6 +7868,11 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, iowq.cq_tail = READ_ONCE(ctx->rings->cq.head) + min_events; trace_io_uring_cqring_wait(ctx, min_events); +#ifdef CONFIG_NET_RX_BUSY_POLL + if (iowq.busy_poll_to) + napi_busy_loop(ctx->napi_id, io_busy_loop_end, &iowq, true, + BUSY_POLL_BUDGET); +#endif do { /* if we can't even flush overflow, don't wait for more */ if (!io_cqring_overflow_flush(ctx)) { > ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-15 8:37 ` napi_busy_poll Olivier Langlois @ 2022-02-15 18:05 ` Olivier Langlois 2022-02-16 3:12 ` napi_busy_poll Hao Xu 2022-02-16 12:14 ` napi_busy_poll Hao Xu 0 siblings, 2 replies; 23+ messages in thread From: Olivier Langlois @ 2022-02-15 18:05 UTC (permalink / raw) To: Hao Xu, Jens Axboe, io-uring On Tue, 2022-02-15 at 03:37 -0500, Olivier Langlois wrote: > > That being said, I have not been able to make it work yet. For some > unknown reasons, no valid napi_id is extracted from the sockets added > to the context so the net_busy_poll function is never called. > > I find that very strange since prior to use io_uring, my code was > using > epoll and the busy polling was working fine with my application > sockets. Something is escaping my comprehension. I must tired and > this > will become obvious... > The napi_id values associated with my sockets appear to be in the range 0 < napi_id < MIN_NAPI_ID from busy_loop.h: /* 0 - Reserved to indicate value not set * 1..NR_CPUS - Reserved for sender_cpu * NR_CPUS+1..~0 - Region available for NAPI IDs */ #define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1)) I have found this: https://lwn.net/Articles/619862/ hinting that busy_poll may be incompatible with RPS (Documentation/networking/scaling.rst) that I may have discovered *AFTER* my epoll -> io_uring transition (I don't recall exactly the sequence of my learning process). With my current knowledge, it makes little sense why busy polling would not be possible with RPS. Also, what exactly is a NAPI device is quite nebulous to me... Looking into the Intel igb driver code, it seems like 1 NAPI device is created for each interrupt vector/Rx buffer of the device. Bottomline, it seems like I have fallen into a new rabbit hole. It may take me a day or 2 to figure it all... you are welcome to enlight me if you know a thing or 2 about those topics... I am kinda lost right now... ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-15 18:05 ` napi_busy_poll Olivier Langlois @ 2022-02-16 3:12 ` Hao Xu 2022-02-16 19:19 ` napi_busy_poll Olivier Langlois 2022-02-16 12:14 ` napi_busy_poll Hao Xu 1 sibling, 1 reply; 23+ messages in thread From: Hao Xu @ 2022-02-16 3:12 UTC (permalink / raw) To: Olivier Langlois, Jens Axboe, io-uring 在 2022/2/16 上午2:05, Olivier Langlois 写道: > On Tue, 2022-02-15 at 03:37 -0500, Olivier Langlois wrote: >> >> That being said, I have not been able to make it work yet. For some >> unknown reasons, no valid napi_id is extracted from the sockets added >> to the context so the net_busy_poll function is never called. >> >> I find that very strange since prior to use io_uring, my code was >> using >> epoll and the busy polling was working fine with my application >> sockets. Something is escaping my comprehension. I must tired and >> this >> will become obvious... >> > The napi_id values associated with my sockets appear to be in the range > 0 < napi_id < MIN_NAPI_ID > > from busy_loop.h: > /* 0 - Reserved to indicate value not set > * 1..NR_CPUS - Reserved for sender_cpu > * NR_CPUS+1..~0 - Region available for NAPI IDs > */ > #define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1)) > > I have found this: > https://lwn.net/Articles/619862/ > > hinting that busy_poll may be incompatible with RPS > (Documentation/networking/scaling.rst) that I may have discovered > *AFTER* my epoll -> io_uring transition (I don't recall exactly the > sequence of my learning process). > I read your code, I guess the thing is the sk->napi_id is set from skb->napi_id and the latter is set when the net device received some packets. > With my current knowledge, it makes little sense why busy polling would > not be possible with RPS. Also, what exactly is a NAPI device is quite > nebulous to me... Looking into the Intel igb driver code, it seems like > 1 NAPI device is created for each interrupt vector/Rx buffer of the > device. AFAIK, yes, each Rx ring has its own NAPI. > > Bottomline, it seems like I have fallen into a new rabbit hole. It may > take me a day or 2 to figure it all... you are welcome to enlight me if > you know a thing or 2 about those topics... I am kinda lost right > now... > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-16 3:12 ` napi_busy_poll Hao Xu @ 2022-02-16 19:19 ` Olivier Langlois 0 siblings, 0 replies; 23+ messages in thread From: Olivier Langlois @ 2022-02-16 19:19 UTC (permalink / raw) To: Hao Xu, Jens Axboe, io-uring On Wed, 2022-02-16 at 11:12 +0800, Hao Xu wrote: > > > > I read your code, I guess the thing is the sk->napi_id is set from > skb->napi_id and the latter is set when the net device received some > packets. > > With my current knowledge, it makes little sense why busy polling > > would > > not be possible with RPS. Also, what exactly is a NAPI device is > > quite > > nebulous to me... Looking into the Intel igb driver code, it seems > > like > > 1 NAPI device is created for each interrupt vector/Rx buffer of the > > device. > AFAIK, yes, each Rx ring has its own NAPI. > > > > Bottomline, it seems like I have fallen into a new rabbit hole. It > > may > > take me a day or 2 to figure it all... you are welcome to enlight > > me if > > you know a thing or 2 about those topics... I am kinda lost right > > now... > > > My dive into the net/core code has been beneficial! I have found out that the reason why I did not have napi_id for my sockets is because I have introduced a local SOCKS proxy into my setup. By using the loopback device, this is de facto removing NAPI out of the picture. After having fixed this issue, I have started to test my code. The modified io_cqring_wait() code does not work. With a pending recv() request, the moment napi_busy_loop() is called, the recv() request fails with an EFAULT error. I suspect this might be because io_busy_loop_end() is doing something that is not allowed while inside napi_busy_loop(). The simpler code change inside __io_sq_thread() might work but I still have to validate. I'll update later today or tomorrow with the latest result and discovery! Greetings, ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-15 18:05 ` napi_busy_poll Olivier Langlois 2022-02-16 3:12 ` napi_busy_poll Hao Xu @ 2022-02-16 12:14 ` Hao Xu 2022-02-17 20:28 ` napi_busy_poll Olivier Langlois ` (2 more replies) 1 sibling, 3 replies; 23+ messages in thread From: Hao Xu @ 2022-02-16 12:14 UTC (permalink / raw) To: Olivier Langlois, Jens Axboe, io-uring 在 2022/2/16 上午2:05, Olivier Langlois 写道: > On Tue, 2022-02-15 at 03:37 -0500, Olivier Langlois wrote: >> >> That being said, I have not been able to make it work yet. For some >> unknown reasons, no valid napi_id is extracted from the sockets added >> to the context so the net_busy_poll function is never called. >> >> I find that very strange since prior to use io_uring, my code was >> using >> epoll and the busy polling was working fine with my application >> sockets. Something is escaping my comprehension. I must tired and >> this >> will become obvious... >> > The napi_id values associated with my sockets appear to be in the range > 0 < napi_id < MIN_NAPI_ID > > from busy_loop.h: > /* 0 - Reserved to indicate value not set > * 1..NR_CPUS - Reserved for sender_cpu > * NR_CPUS+1..~0 - Region available for NAPI IDs > */ > #define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1)) > > I have found this: > https://lwn.net/Articles/619862/ > > hinting that busy_poll may be incompatible with RPS > (Documentation/networking/scaling.rst) that I may have discovered > *AFTER* my epoll -> io_uring transition (I don't recall exactly the > sequence of my learning process). > > With my current knowledge, it makes little sense why busy polling would > not be possible with RPS. Also, what exactly is a NAPI device is quite > nebulous to me... Looking into the Intel igb driver code, it seems like > 1 NAPI device is created for each interrupt vector/Rx buffer of the > device. > > Bottomline, it seems like I have fallen into a new rabbit hole. It may > take me a day or 2 to figure it all... you are welcome to enlight me if > you know a thing or 2 about those topics... I am kinda lost right > now... > Hi Olivier, I've write something to express my idea, it would be great if you can try it. It's totally untested and only does polling in sqthread, won't be hard to expand it to cqring_wait. My original idea is to poll all the napi device but seems that may be not efficient. so for a request, just do napi polling for one napi. There is still one problem: when to delete the polled NAPIs. Regards, Hao diff --git a/fs/io_uring.c b/fs/io_uring.c index 538f90bd0508..2e32d5fe0641 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -63,6 +63,7 @@ #include <net/sock.h> #include <net/af_unix.h> #include <net/scm.h> +#include <net/busy_poll.h> #include <linux/anon_inodes.h> #include <linux/sched/mm.h> #include <linux/uaccess.h> @@ -443,6 +444,7 @@ struct io_ring_ctx { spinlock_t rsrc_ref_lock; }; + struct list_head napi_list; /* Keep this last, we don't need it for the fast path */ struct { #if defined(CONFIG_UNIX) @@ -1457,6 +1459,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) INIT_WQ_LIST(&ctx->locked_free_list); INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func); INIT_WQ_LIST(&ctx->submit_state.compl_reqs); + INIT_LIST_HEAD(&ctx->napi_list); return ctx; err: kfree(ctx->dummy_ubuf); @@ -5419,6 +5422,70 @@ IO_NETOP_FN(send); IO_NETOP_FN(recv); #endif /* CONFIG_NET */ +#ifdef CONFIG_NET_RX_BUSY_POLL +struct napi_entry { + struct list_head list; + unsigned int napi_id; +}; + +static void io_add_napi(struct file *file, struct io_ring_ctx *ctx) +{ + unsigned int napi_id; + struct socket *sock; + struct sock *sk; + struct napi_entry *ne; + + if (!net_busy_loop_on()) + return; + + sock = sock_from_file(file); + if (!sock) + return; + + sk = sock->sk; + if (!sk) + return; + + napi_id = READ_ONCE(sk->sk_napi_id); + if (napi_id < MIN_NAPI_ID) + return; + + list_for_each_entry(ne, &ctx->napi_list, list) { + if (ne->napi_id == napi_id) + return; + } + + ne = kmalloc(sizeof(*ne), GFP_KERNEL); + if (!ne) + return; + + list_add_tail(&ne->list, &ctx->napi_list); +} + +static void io_napi_busy_loop(struct io_ring_ctx *ctx) +{ + struct napi_entry *ne; + + if (list_empty(&ctx->napi_list) || !net_busy_loop_on()) + return; + + list_for_each_entry(ne, &ctx->napi_list, list) + napi_busy_loop(ne->napi_id, NULL, NULL, false, BUSY_POLL_BUDGET); +} +#else + +static inline void io_add_napi(struct file *file, struct io_ring_ctx *ctx) +{ + return; +} + +static inline void io_napi_busy_loop(struct io_ring_ctx *ctx) +{ + return; +} +#endif /* CONFIG_NET_RX_BUSY_POLL */ + + struct io_poll_table { struct poll_table_struct pt; struct io_kiocb *req; @@ -5583,6 +5650,7 @@ static void io_poll_task_func(struct io_kiocb *req, bool *locked) struct io_ring_ctx *ctx = req->ctx; int ret; + io_add_napi(req->file, req->ctx); ret = io_poll_check_events(req); if (ret > 0) return; @@ -5608,6 +5676,7 @@ static void io_apoll_task_func(struct io_kiocb *req, bool *locked) struct io_ring_ctx *ctx = req->ctx; int ret; + io_add_napi(req->file, req->ctx); ret = io_poll_check_events(req); if (ret > 0) return; @@ -7544,6 +7613,9 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries) wake_up(&ctx->sqo_sq_wait); if (creds) revert_creds(creds); +#ifdef CONFIG_NET_RX_BUSY_POLL + io_napi_busy_loop(ctx); +#endif } return ret; ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-16 12:14 ` napi_busy_poll Hao Xu @ 2022-02-17 20:28 ` Olivier Langlois 2022-02-18 8:06 ` napi_busy_poll Hao Xu 2022-02-17 23:18 ` napi_busy_poll Olivier Langlois 2022-02-18 5:05 ` napi_busy_poll Olivier Langlois 2 siblings, 1 reply; 23+ messages in thread From: Olivier Langlois @ 2022-02-17 20:28 UTC (permalink / raw) To: Hao Xu, Jens Axboe, io-uring On Wed, 2022-02-16 at 20:14 +0800, Hao Xu wrote: > > > Hi Olivier, > I've write something to express my idea, it would be great if you can > try it. > It's totally untested and only does polling in sqthread, won't be > hard > to expand it to cqring_wait. My original idea is to poll all the napi > device but seems that may be not efficient. so for a request, just > do napi polling for one napi. > There is still one problem: when to delete the polled NAPIs. > > Regards, > Hao > Hi Hao, I am going to give your patch a try hopefully later today. On my side, I have made a small change to my code and it started to work. I did remove the call to io_run_task_work() from io_busy_loop_end(). While inside napi_busy_loop(), preemption is disabled, local_bh too and the function acquires a napi_poll lock. I haven't been able to put my finger on exactly why but it appears that in that state, the net subsystem is not reentrant. Therefore, I did replace the call to io_run_task_work() with a call to signal_pending() just to know if there are pending task works so that they can be handled outside the napi_busy_loop. I am not sure how a socket is assigned to a napi device but I got a few with my 50 sockets program (8 to be exact): [2022-02-17 09:59:10] INFO WSBASE/client_established 706 LWS_CALLBACK_CLIENT_ESTABLISHED client 50(17), napi_id: 10 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 3(63), napi_id: 12 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 49(16), napi_id: 15 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 15(51), napi_id: 12 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 31(35), napi_id: 16 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 14(52), napi_id: 14 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 11(55), napi_id: 12 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 16(50), napi_id: 9 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 40(26), napi_id: 9 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 39(27), napi_id: 14 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 8(58), napi_id: 10 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 20(46), napi_id: 13 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 7(59), napi_id: 16 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 6(60), napi_id: 16 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 22(44), napi_id: 16 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 13(53), napi_id: 9 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 38(28), napi_id: 9 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 21(45), napi_id: 12 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 4(62), napi_id: 15 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 35(31), napi_id: 13 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 25(41), napi_id: 12 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 18(48), napi_id: 16 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 12(54), napi_id: 13 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 5(61), napi_id: 9 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 23(43), napi_id: 13 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 46(20), napi_id: 11 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 43(23), napi_id: 9 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 9(57), napi_id: 9 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 29(37), napi_id: 14 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 28(38), napi_id: 15 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 33(33), napi_id: 13 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 27(39), napi_id: 10 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 32(34), napi_id: 12 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 1(65), napi_id: 10 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 26(40), napi_id: 13 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 37(29), napi_id: 12 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 30(36), napi_id: 9 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 47(19), napi_id: 14 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 10(56), napi_id: 11 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 44(22), napi_id: 16 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 17(49), napi_id: 11 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 41(25), napi_id: 13 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 48(18), napi_id: 10 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 2(64), napi_id: 15 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 45(21), napi_id: 14 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 24(42), napi_id: 9 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 34(32), napi_id: 11 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 42(24), napi_id: 16 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 36(30), napi_id: 16 [2022-02-17 09:59:10] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 19(47), napi_id: 16 [2022-02-17 09:59:11] INFO WSBASE/client_established 696 LWS_CALLBACK_CLIENT_ESTABLISHED client 38(66), napi_id: 11 First number is the thread id (706 and 696). I have 2 threads. 1 io_uring context per thread. Next, you have the client id and the number in parenthesis is the socket fd. Based on the result, it appears that having a single napi_id per context won't do it... I wish I could pick the brains of the people having done things the way they have done it with the epoll implementation. I guess that I could create 8 distinct io_uring context with IORING_SETUP_ATTACH_WQ but it seems like a big burden placed on the shoulders of users when a simple linked list with ref-counted elements could do it... I think that if I merge your patch with what I have done so far, we could get something really cool! Concerning the remaining problem about when to remove the napi_id, I would say that a good place to do it would be when a request is completed and discarded if there was a refcount added to your napi_entry struct. The only thing that I hate about this idea is that in my scenario, the sockets are going to be pretty much the same for the whole io_uring context existance. Therefore, the whole ref counting overhead is useless and unneeded. Here is the latest version of my effort: diff --git a/fs/io_uring.c b/fs/io_uring.c index 77b9c7e4793b..ea2a3661c16f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -63,6 +63,7 @@ #include <net/sock.h> #include <net/af_unix.h> #include <net/scm.h> +#include <net/busy_poll.h> #include <linux/anon_inodes.h> #include <linux/sched/mm.h> #include <linux/uaccess.h> @@ -395,6 +396,10 @@ struct io_ring_ctx { struct list_head sqd_list; unsigned long check_cq_overflow; +#ifdef CONFIG_NET_RX_BUSY_POLL + /* used to track busy poll napi_id */ + unsigned int napi_id; +#endif struct { unsigned cached_cq_tail; @@ -6976,7 +6981,40 @@ static inline struct file *io_file_get_fixed(struct io_ring_ctx *ctx, io_req_set_rsrc_node(req, ctx); return file; } +#ifdef CONFIG_NET_RX_BUSY_POLL +/* + * Set epoll busy poll NAPI ID from sk. + */ +static inline void io_set_busy_poll_napi_id(struct io_ring_ctx *ctx, struct file *file) +{ + unsigned int napi_id; + struct socket *sock; + struct sock *sk; + + if (!net_busy_loop_on()) + return; + sock = sock_from_file(file); + if (!sock) + return; + + sk = sock->sk; + if (!sk) + return; + + napi_id = READ_ONCE(sk->sk_napi_id); + + /* Non-NAPI IDs can be rejected + * or + * Nothing to do if we already have this ID + */ + if (napi_id < MIN_NAPI_ID || napi_id == ctx->napi_id) + return; + + /* record NAPI ID for use in next busy poll */ + ctx->napi_id = napi_id; +} +#endif static struct file *io_file_get_normal(struct io_ring_ctx *ctx, struct io_kiocb *req, int fd) { @@ -6985,8 +7023,14 @@ static struct file *io_file_get_normal(struct io_ring_ctx *ctx, trace_io_uring_file_get(ctx, fd); /* we don't allow fixed io_uring files */ - if (file && unlikely(file->f_op == &io_uring_fops)) - io_req_track_inflight(req); + if (file) { + if (unlikely(file->f_op == &io_uring_fops)) + io_req_track_inflight(req); +#ifdef CONFIG_NET_RX_BUSY_POLL + else + io_set_busy_poll_napi_id(ctx, file); +#endif + } return file; } @@ -7489,7 +7533,22 @@ static inline void io_ring_clear_wakeup_flag(struct io_ring_ctx *ctx) ctx->rings->sq_flags & ~IORING_SQ_NEED_WAKEUP); spin_unlock(&ctx->completion_lock); } +#ifdef CONFIG_NET_RX_BUSY_POLL +/* + * Busy poll if globally on and supporting sockets found + */ +static inline bool io_napi_busy_loop(struct io_ring_ctx *ctx) +{ + unsigned int napi_id = ctx->napi_id; + if ((napi_id >= MIN_NAPI_ID) && net_busy_loop_on()) { + napi_busy_loop(napi_id, NULL, NULL, true, + BUSY_POLL_BUDGET); + return true; + } + return false; +} +#endif static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries) { unsigned int to_submit; @@ -7518,7 +7577,10 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries) !(ctx->flags & IORING_SETUP_R_DISABLED)) ret = io_submit_sqes(ctx, to_submit); mutex_unlock(&ctx->uring_lock); - +#ifdef CONFIG_NET_RX_BUSY_POLL + if (io_napi_busy_loop(ctx)) + ++ret; +#endif if (to_submit && wq_has_sleeper(&ctx->sqo_sq_wait)) wake_up(&ctx->sqo_sq_wait); if (creds) @@ -7649,6 +7711,9 @@ struct io_wait_queue { struct io_ring_ctx *ctx; unsigned cq_tail; unsigned nr_timeouts; +#ifdef CONFIG_NET_RX_BUSY_POLL + unsigned busy_poll_to; +#endif }; static inline bool io_should_wake(struct io_wait_queue *iowq) @@ -7709,6 +7774,29 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, return !*timeout ? -ETIME : 1; } +#ifdef CONFIG_NET_RX_BUSY_POLL +static inline bool io_busy_loop_timeout(unsigned long start_time, + unsigned long bp_usec) +{ + if (bp_usec) { + unsigned long end_time = start_time + bp_usec; + unsigned long now = busy_loop_current_time(); + + return time_after(now, end_time); + } + return true; +} + +static bool io_busy_loop_end(void *p, unsigned long start_time) +{ + struct io_wait_queue *iowq = p; + + return io_busy_loop_timeout(start_time, iowq->busy_poll_to) || + signal_pending(current) || + io_should_wake(iowq); +} +#endif + /* * Wait until events become available, if we don't already have some. The * application must reap them itself, as they reside on the shared cq ring. @@ -7729,12 +7817,33 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, if (!io_run_task_work()) break; } while (1); - +#ifdef CONFIG_NET_RX_BUSY_POLL + iowq.busy_poll_to = 0; +#endif if (uts) { struct timespec64 ts; if (get_timespec64(&ts, uts)) return -EFAULT; +#ifdef CONFIG_NET_RX_BUSY_POLL + if (!(ctx->flags & IORING_SETUP_SQPOLL) && + (ctx->napi_id >= MIN_NAPI_ID) && net_busy_loop_on()) { + unsigned busy_poll_to = + READ_ONCE(sysctl_net_busy_poll); + struct timespec64 pollto = + ns_to_timespec64(1000*busy_poll_to); + + if (timespec64_compare(&ts, &pollto) > 0) { + ts = timespec64_sub(ts, pollto); + iowq.busy_poll_to = busy_poll_to; + } + else { + iowq.busy_poll_to = timespec64_to_ns(&ts)/1000; + ts.tv_sec = 0; + ts.tv_nsec = 0; + } + } +#endif timeout = timespec64_to_jiffies(&ts); } @@ -7759,6 +7868,11 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, iowq.cq_tail = READ_ONCE(ctx->rings->cq.head) + min_events; trace_io_uring_cqring_wait(ctx, min_events); +#ifdef CONFIG_NET_RX_BUSY_POLL + if (iowq.busy_poll_to) + napi_busy_loop(ctx->napi_id, io_busy_loop_end, &iowq, true, + BUSY_POLL_BUDGET); +#endif do { /* if we can't even flush overflow, don't wait for more */ if (!io_cqring_overflow_flush(ctx)) { ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-17 20:28 ` napi_busy_poll Olivier Langlois @ 2022-02-18 8:06 ` Hao Xu 2022-02-19 7:14 ` napi_busy_poll Olivier Langlois 0 siblings, 1 reply; 23+ messages in thread From: Hao Xu @ 2022-02-18 8:06 UTC (permalink / raw) To: Olivier Langlois, Jens Axboe, io-uring On 2/18/22 04:28, Olivier Langlois wrote: > On Wed, 2022-02-16 at 20:14 +0800, Hao Xu wrote: >> Hi Olivier, >> I've write something to express my idea, it would be great if you can >> try it. >> It's totally untested and only does polling in sqthread, won't be >> hard >> to expand it to cqring_wait. My original idea is to poll all the napi >> device but seems that may be not efficient. so for a request, just >> do napi polling for one napi. >> There is still one problem: when to delete the polled NAPIs. >> >> Regards, >> Hao >> > Hi Hao, > > I am going to give your patch a try hopefully later today. > > On my side, I have made a small change to my code and it started to > work. > > I did remove the call to io_run_task_work() from io_busy_loop_end(). > While inside napi_busy_loop(), preemption is disabled, local_bh too and > the function acquires a napi_poll lock. I haven't been able to put my > finger on exactly why but it appears that in that state, the net > subsystem is not reentrant. Therefore, I did replace the call to > io_run_task_work() with a call to signal_pending() just to know if > there are pending task works so that they can be handled outside the > napi_busy_loop. > > I am not sure how a socket is assigned to a napi device but I got a few > with my 50 sockets program (8 to be exact): > > [2022-02-17 09:59:10] INFO WSBASE/client_established 706 > LWS_CALLBACK_CLIENT_ESTABLISHED client 50(17), napi_id: 10 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 3(63), napi_id: 12 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 49(16), napi_id: 15 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 15(51), napi_id: 12 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 31(35), napi_id: 16 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 14(52), napi_id: 14 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 11(55), napi_id: 12 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 16(50), napi_id: 9 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 40(26), napi_id: 9 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 39(27), napi_id: 14 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 8(58), napi_id: 10 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 20(46), napi_id: 13 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 7(59), napi_id: 16 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 6(60), napi_id: 16 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 22(44), napi_id: 16 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 13(53), napi_id: 9 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 38(28), napi_id: 9 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 21(45), napi_id: 12 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 4(62), napi_id: 15 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 35(31), napi_id: 13 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 25(41), napi_id: 12 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 18(48), napi_id: 16 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 12(54), napi_id: 13 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 5(61), napi_id: 9 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 23(43), napi_id: 13 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 46(20), napi_id: 11 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 43(23), napi_id: 9 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 9(57), napi_id: 9 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 29(37), napi_id: 14 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 28(38), napi_id: 15 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 33(33), napi_id: 13 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 27(39), napi_id: 10 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 32(34), napi_id: 12 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 1(65), napi_id: 10 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 26(40), napi_id: 13 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 37(29), napi_id: 12 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 30(36), napi_id: 9 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 47(19), napi_id: 14 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 10(56), napi_id: 11 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 44(22), napi_id: 16 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 17(49), napi_id: 11 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 41(25), napi_id: 13 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 48(18), napi_id: 10 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 2(64), napi_id: 15 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 45(21), napi_id: 14 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 24(42), napi_id: 9 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 34(32), napi_id: 11 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 42(24), napi_id: 16 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 36(30), napi_id: 16 > [2022-02-17 09:59:10] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 19(47), napi_id: 16 > [2022-02-17 09:59:11] INFO WSBASE/client_established 696 > LWS_CALLBACK_CLIENT_ESTABLISHED client 38(66), napi_id: 11 > > First number is the thread id (706 and 696). I have 2 threads. 1 > io_uring context per thread. > > Next, you have the client id and the number in parenthesis is the > socket fd. > > Based on the result, it appears that having a single napi_id per > context won't do it... I wish I could pick the brains of the people > having done things the way they have done it with the epoll > implementation. > > I guess that I could create 8 distinct io_uring context with > IORING_SETUP_ATTACH_WQ but it seems like a big burden placed on the > shoulders of users when a simple linked list with ref-counted elements > could do it... > > I think that if I merge your patch with what I have done so far, we > could get something really cool! > > Concerning the remaining problem about when to remove the napi_id, I > would say that a good place to do it would be when a request is > completed and discarded if there was a refcount added to your > napi_entry struct. > > The only thing that I hate about this idea is that in my scenario, the > sockets are going to be pretty much the same for the whole io_uring > context existance. Therefore, the whole ref counting overhead is > useless and unneeded. I remember that now all the completion is in the original task( should be confirmed again), so it should be ok to just use simple 'unsigned int count' to show the number of users of a napi entry. And doing deletion when count is 0. For your scenario, which is only one napi in a iouring context, This won't be big overhead as well. The only thing is we may need to optimize the napi lookup process, but I'm not sure if it is necessary. Regards, Hao > > Here is the latest version of my effort: > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 77b9c7e4793b..ea2a3661c16f 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -63,6 +63,7 @@ > #include <net/sock.h> > #include <net/af_unix.h> > #include <net/scm.h> > +#include <net/busy_poll.h> > #include <linux/anon_inodes.h> > #include <linux/sched/mm.h> > #include <linux/uaccess.h> > @@ -395,6 +396,10 @@ struct io_ring_ctx { > struct list_head sqd_list; > > unsigned long check_cq_overflow; > +#ifdef CONFIG_NET_RX_BUSY_POLL > + /* used to track busy poll napi_id */ > + unsigned int napi_id; > +#endif > > struct { > unsigned cached_cq_tail; > @@ -6976,7 +6981,40 @@ static inline struct file > *io_file_get_fixed(struct io_ring_ctx *ctx, > io_req_set_rsrc_node(req, ctx); > return file; > } > +#ifdef CONFIG_NET_RX_BUSY_POLL > +/* > + * Set epoll busy poll NAPI ID from sk. > + */ > +static inline void io_set_busy_poll_napi_id(struct io_ring_ctx *ctx, > struct file *file) > +{ > + unsigned int napi_id; > + struct socket *sock; > + struct sock *sk; > + > + if (!net_busy_loop_on()) > + return; > > + sock = sock_from_file(file); > + if (!sock) > + return; > + > + sk = sock->sk; > + if (!sk) > + return; > + > + napi_id = READ_ONCE(sk->sk_napi_id); > + > + /* Non-NAPI IDs can be rejected > + * or > + * Nothing to do if we already have this ID > + */ > + if (napi_id < MIN_NAPI_ID || napi_id == ctx->napi_id) > + return; > + > + /* record NAPI ID for use in next busy poll */ > + ctx->napi_id = napi_id; > +} > +#endif > static struct file *io_file_get_normal(struct io_ring_ctx *ctx, > struct io_kiocb *req, int fd) > { > @@ -6985,8 +7023,14 @@ static struct file *io_file_get_normal(struct > io_ring_ctx *ctx, > trace_io_uring_file_get(ctx, fd); > > /* we don't allow fixed io_uring files */ > - if (file && unlikely(file->f_op == &io_uring_fops)) > - io_req_track_inflight(req); > + if (file) { > + if (unlikely(file->f_op == &io_uring_fops)) > + io_req_track_inflight(req); > +#ifdef CONFIG_NET_RX_BUSY_POLL > + else > + io_set_busy_poll_napi_id(ctx, file); > +#endif > + } > return file; > } > > @@ -7489,7 +7533,22 @@ static inline void > io_ring_clear_wakeup_flag(struct io_ring_ctx *ctx) > ctx->rings->sq_flags & ~IORING_SQ_NEED_WAKEUP); > spin_unlock(&ctx->completion_lock); > } > +#ifdef CONFIG_NET_RX_BUSY_POLL > +/* > + * Busy poll if globally on and supporting sockets found > + */ > +static inline bool io_napi_busy_loop(struct io_ring_ctx *ctx) > +{ > + unsigned int napi_id = ctx->napi_id; > > + if ((napi_id >= MIN_NAPI_ID) && net_busy_loop_on()) { > + napi_busy_loop(napi_id, NULL, NULL, true, > + BUSY_POLL_BUDGET); > + return true; > + } > + return false; > +} > +#endif > static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries) > { > unsigned int to_submit; > @@ -7518,7 +7577,10 @@ static int __io_sq_thread(struct io_ring_ctx > *ctx, bool cap_entries) > !(ctx->flags & IORING_SETUP_R_DISABLED)) > ret = io_submit_sqes(ctx, to_submit); > mutex_unlock(&ctx->uring_lock); > - > +#ifdef CONFIG_NET_RX_BUSY_POLL > + if (io_napi_busy_loop(ctx)) > + ++ret; > +#endif > if (to_submit && wq_has_sleeper(&ctx->sqo_sq_wait)) > wake_up(&ctx->sqo_sq_wait); > if (creds) > @@ -7649,6 +7711,9 @@ struct io_wait_queue { > struct io_ring_ctx *ctx; > unsigned cq_tail; > unsigned nr_timeouts; > +#ifdef CONFIG_NET_RX_BUSY_POLL > + unsigned busy_poll_to; > +#endif > }; > > static inline bool io_should_wake(struct io_wait_queue *iowq) > @@ -7709,6 +7774,29 @@ static inline int io_cqring_wait_schedule(struct > io_ring_ctx *ctx, > return !*timeout ? -ETIME : 1; > } > > +#ifdef CONFIG_NET_RX_BUSY_POLL > +static inline bool io_busy_loop_timeout(unsigned long start_time, > + unsigned long bp_usec) > +{ > + if (bp_usec) { > + unsigned long end_time = start_time + bp_usec; > + unsigned long now = busy_loop_current_time(); > + > + return time_after(now, end_time); > + } > + return true; > +} > + > +static bool io_busy_loop_end(void *p, unsigned long start_time) > +{ > + struct io_wait_queue *iowq = p; > + > + return io_busy_loop_timeout(start_time, iowq->busy_poll_to) || > + signal_pending(current) || > + io_should_wake(iowq); > +} > +#endif > + > /* > * Wait until events become available, if we don't already have some. > The > * application must reap them itself, as they reside on the shared cq > ring. > @@ -7729,12 +7817,33 @@ static int io_cqring_wait(struct io_ring_ctx > *ctx, int min_events, > if (!io_run_task_work()) > break; > } while (1); > - > +#ifdef CONFIG_NET_RX_BUSY_POLL > + iowq.busy_poll_to = 0; > +#endif > if (uts) { > struct timespec64 ts; > > if (get_timespec64(&ts, uts)) > return -EFAULT; > +#ifdef CONFIG_NET_RX_BUSY_POLL > + if (!(ctx->flags & IORING_SETUP_SQPOLL) && > + (ctx->napi_id >= MIN_NAPI_ID) && > net_busy_loop_on()) { > + unsigned busy_poll_to = > + READ_ONCE(sysctl_net_busy_poll); > + struct timespec64 pollto = > + ns_to_timespec64(1000*busy_poll_to); > + > + if (timespec64_compare(&ts, &pollto) > 0) { > + ts = timespec64_sub(ts, pollto); > + iowq.busy_poll_to = busy_poll_to; > + } > + else { > + iowq.busy_poll_to = > timespec64_to_ns(&ts)/1000; > + ts.tv_sec = 0; > + ts.tv_nsec = 0; > + } > + } > +#endif > timeout = timespec64_to_jiffies(&ts); > } > > @@ -7759,6 +7868,11 @@ static int io_cqring_wait(struct io_ring_ctx > *ctx, int min_events, > iowq.cq_tail = READ_ONCE(ctx->rings->cq.head) + min_events; > > trace_io_uring_cqring_wait(ctx, min_events); > +#ifdef CONFIG_NET_RX_BUSY_POLL > + if (iowq.busy_poll_to) > + napi_busy_loop(ctx->napi_id, io_busy_loop_end, &iowq, > true, > + BUSY_POLL_BUDGET); > +#endif > do { > /* if we can't even flush overflow, don't wait for > more */ > if (!io_cqring_overflow_flush(ctx)) { ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-18 8:06 ` napi_busy_poll Hao Xu @ 2022-02-19 7:14 ` Olivier Langlois 2022-02-21 4:52 ` napi_busy_poll Hao Xu 0 siblings, 1 reply; 23+ messages in thread From: Olivier Langlois @ 2022-02-19 7:14 UTC (permalink / raw) To: Hao Xu, Jens Axboe, io-uring On Fri, 2022-02-18 at 16:06 +0800, Hao Xu wrote: > > > > > Concerning the remaining problem about when to remove the napi_id, > > I > > would say that a good place to do it would be when a request is > > completed and discarded if there was a refcount added to your > > napi_entry struct. > > > > The only thing that I hate about this idea is that in my scenario, > > the > > sockets are going to be pretty much the same for the whole io_uring > > context existance. Therefore, the whole ref counting overhead is > > useless and unneeded. > > I remember that now all the completion is in the original task( > > should be confirmed again), > > so it should be ok to just use simple 'unsigned int count' to show > > the number of users of a napi entry. And doing deletion when count > > is 0. For your scenario, which is only one napi in a iouring context, > > This won't be big overhead as well. > > The only thing is we may need to optimize the napi lookup process, > > but I'm not sure if it is necessary. > Hi Hao, You can forget about the ref count idea. What I hated about it was that it was incurring a cost to all the io_uring users including the vast majority of them that won't be using napi_busy_poll... One thing that I know that Pavel hates is when hot paths are littered with a bunch of new code. I have been very careful doing that in my design. I think that I have found a much better approach to tackle the problem of when to remove the napi_ids... I'll stop teasing about it... The code is ready, tested... All I need to do is prepare the patch and send it to the list for review. net/core/dev.c is using a hash to store the napi structs. This could possible be easily replicated in io_uring but for now, imho, this is a polishing detail only that can be revisited later after the proof of concept will have been accepted. So eager to share the patch... This is the next thing that I do before going to bed once I am done reading my emails... ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-19 7:14 ` napi_busy_poll Olivier Langlois @ 2022-02-21 4:52 ` Hao Xu 0 siblings, 0 replies; 23+ messages in thread From: Hao Xu @ 2022-02-21 4:52 UTC (permalink / raw) To: Olivier Langlois, Jens Axboe, io-uring 在 2022/2/19 下午3:14, Olivier Langlois 写道: > On Fri, 2022-02-18 at 16:06 +0800, Hao Xu wrote: >> >>> >>> Concerning the remaining problem about when to remove the napi_id, >>> I >>> would say that a good place to do it would be when a request is >>> completed and discarded if there was a refcount added to your >>> napi_entry struct. >>> >>> The only thing that I hate about this idea is that in my scenario, >>> the >>> sockets are going to be pretty much the same for the whole io_uring >>> context existance. Therefore, the whole ref counting overhead is >>> useless and unneeded. >> >> I remember that now all the completion is in the original task( >> >> should be confirmed again), >> >> so it should be ok to just use simple 'unsigned int count' to show >> >> the number of users of a napi entry. And doing deletion when count >> >> is 0. For your scenario, which is only one napi in a iouring context, >> >> This won't be big overhead as well. >> >> The only thing is we may need to optimize the napi lookup process, >> >> but I'm not sure if it is necessary. >> > Hi Hao, > > You can forget about the ref count idea. What I hated about it was that > it was incurring a cost to all the io_uring users including the vast > majority of them that won't be using napi_busy_poll... We can do the deletion at the end of each OP which we should, like the recv, the poll_task_func/apoll_task_func. Won't affect the main path I guess. > > One thing that I know that Pavel hates is when hot paths are littered > with a bunch of new code. I have been very careful doing that in my > design. > > I think that I have found a much better approach to tackle the problem > of when to remove the napi_ids... I'll stop teasing about it... The > code is ready, tested... All I need to do is prepare the patch and send > it to the list for review. > > net/core/dev.c is using a hash to store the napi structs. This could > possible be easily replicated in io_uring but for now, imho, this is a > polishing detail only that can be revisited later after the proof of > concept will have been accepted. Exactly, that is not a high priority thing right now. > > So eager to share the patch... This is the next thing that I do before > going to bed once I am done reading my emails... ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-16 12:14 ` napi_busy_poll Hao Xu 2022-02-17 20:28 ` napi_busy_poll Olivier Langlois @ 2022-02-17 23:18 ` Olivier Langlois 2022-02-17 23:25 ` napi_busy_poll Jens Axboe 2022-02-18 7:21 ` napi_busy_poll Hao Xu 2022-02-18 5:05 ` napi_busy_poll Olivier Langlois 2 siblings, 2 replies; 23+ messages in thread From: Olivier Langlois @ 2022-02-17 23:18 UTC (permalink / raw) To: Hao Xu, Jens Axboe, io-uring On Wed, 2022-02-16 at 20:14 +0800, Hao Xu wrote: > > > > Hi Olivier, > I've write something to express my idea, it would be great if you can > try it. > It's totally untested and only does polling in sqthread, won't be > hard > to expand it to cqring_wait. My original idea is to poll all the napi > device but seems that may be not efficient. so for a request, just > do napi polling for one napi. > There is still one problem: when to delete the polled NAPIs. > I think that I have found an elegant solution to the remaining problem! Are you ok if I send out a patch to Jens that contains your code if I put your name as a co-author? > > > + ne = kmalloc(sizeof(*ne), GFP_KERNEL); > + if (!ne) > + return; > + > + list_add_tail(&ne->list, &ctx->napi_list); > +} ne->napi_id is not initialized before returning from the function! ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-17 23:18 ` napi_busy_poll Olivier Langlois @ 2022-02-17 23:25 ` Jens Axboe 2022-02-18 7:21 ` napi_busy_poll Hao Xu 1 sibling, 0 replies; 23+ messages in thread From: Jens Axboe @ 2022-02-17 23:25 UTC (permalink / raw) To: Olivier Langlois, Hao Xu, io-uring On 2/17/22 4:18 PM, Olivier Langlois wrote: > On Wed, 2022-02-16 at 20:14 +0800, Hao Xu wrote: >> >>> >> Hi Olivier, >> I've write something to express my idea, it would be great if you can >> try it. >> It's totally untested and only does polling in sqthread, won't be >> hard >> to expand it to cqring_wait. My original idea is to poll all the napi >> device but seems that may be not efficient. so for a request, just >> do napi polling for one napi. >> There is still one problem: when to delete the polled NAPIs. >> > I think that I have found an elegant solution to the remaining problem! > > Are you ok if I send out a patch to Jens that contains your code if I > put your name as a co-author? FWIW, we tend to use the Co-developed-by: tag for this kind of situation. Just throwing it out there :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-17 23:18 ` napi_busy_poll Olivier Langlois 2022-02-17 23:25 ` napi_busy_poll Jens Axboe @ 2022-02-18 7:21 ` Hao Xu 1 sibling, 0 replies; 23+ messages in thread From: Hao Xu @ 2022-02-18 7:21 UTC (permalink / raw) To: Olivier Langlois, Jens Axboe, io-uring On 2/18/22 07:18, Olivier Langlois wrote: > On Wed, 2022-02-16 at 20:14 +0800, Hao Xu wrote: >> Hi Olivier, >> I've write something to express my idea, it would be great if you can >> try it. >> It's totally untested and only does polling in sqthread, won't be >> hard >> to expand it to cqring_wait. My original idea is to poll all the napi >> device but seems that may be not efficient. so for a request, just >> do napi polling for one napi. >> There is still one problem: when to delete the polled NAPIs. >> > I think that I have found an elegant solution to the remaining problem! > > Are you ok if I send out a patch to Jens that contains your code if I > put your name as a co-author? No problem, go ahead. >> >> + ne = kmalloc(sizeof(*ne), GFP_KERNEL); >> + if (!ne) >> + return; >> + >> + list_add_tail(&ne->list, &ctx->napi_list); >> +} > ne->napi_id is not initialized before returning from the function! Yes, feel free to enhance this code and fix bugs. > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-16 12:14 ` napi_busy_poll Hao Xu 2022-02-17 20:28 ` napi_busy_poll Olivier Langlois 2022-02-17 23:18 ` napi_busy_poll Olivier Langlois @ 2022-02-18 5:05 ` Olivier Langlois 2022-02-18 7:41 ` napi_busy_poll Hao Xu 2 siblings, 1 reply; 23+ messages in thread From: Olivier Langlois @ 2022-02-18 5:05 UTC (permalink / raw) To: Hao Xu, Jens Axboe, io-uring On Wed, 2022-02-16 at 20:14 +0800, Hao Xu wrote: > > @@ -5583,6 +5650,7 @@ static void io_poll_task_func(struct io_kiocb > *req, bool *locked) > struct io_ring_ctx *ctx = req->ctx; > int ret; > > + io_add_napi(req->file, req->ctx); > ret = io_poll_check_events(req); > if (ret > 0) > return; > @@ -5608,6 +5676,7 @@ static void io_apoll_task_func(struct io_kiocb > *req, bool *locked) > struct io_ring_ctx *ctx = req->ctx; > int ret; > > + io_add_napi(req->file, req->ctx); > ret = io_poll_check_events(req); > if (ret > 0) > return; > I have a doubt about these call sites for adding the napi_id into the list. AFAIK, these are the functions called when the desired events are ready therefore, it is too late for polling the device. OTOH, my choice of doing it from io_file_get_normal() was perhaps a poor choice too because it is premature. Possibly the best location might be __io_arm_poll_handler()... ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-18 5:05 ` napi_busy_poll Olivier Langlois @ 2022-02-18 7:41 ` Hao Xu 2022-02-19 7:02 ` napi_busy_poll Olivier Langlois 0 siblings, 1 reply; 23+ messages in thread From: Hao Xu @ 2022-02-18 7:41 UTC (permalink / raw) To: Olivier Langlois, Jens Axboe, io-uring On 2/18/22 13:05, Olivier Langlois wrote: > On Wed, 2022-02-16 at 20:14 +0800, Hao Xu wrote: >> @@ -5583,6 +5650,7 @@ static void io_poll_task_func(struct io_kiocb >> *req, bool *locked) >> struct io_ring_ctx *ctx = req->ctx; >> int ret; >> >> + io_add_napi(req->file, req->ctx); >> ret = io_poll_check_events(req); >> if (ret > 0) >> return; >> @@ -5608,6 +5676,7 @@ static void io_apoll_task_func(struct io_kiocb >> *req, bool *locked) >> struct io_ring_ctx *ctx = req->ctx; >> int ret; >> >> + io_add_napi(req->file, req->ctx); >> ret = io_poll_check_events(req); >> if (ret > 0) >> return; >> > I have a doubt about these call sites for adding the napi_id into the > list. AFAIK, these are the functions called when the desired events are > ready therefore, it is too late for polling the device. [1] > > OTOH, my choice of doing it from io_file_get_normal() was perhaps a > poor choice too because it is premature. > > Possibly the best location might be __io_arm_poll_handler()... Hi Oliver, Have you tried just issue one recv/pollin request and observe the napi_id? From my understanding of the network stack, the napi_id of a socket won't be valid until it gets some packets. Because before that moment, busy_poll doesn't know which hw queue to poll. In other words, the idea of NAPI polling is: the packets of a socket can be from any hw queue of a net adapter, but we just poll the queue which just received some data. So to get this piece of info, there must be some data coming to one queue, before doing the busy_poll. Correct me if I'm wrong since I'm also a newbie of network stuff... I was considering to poll all the rx rings, but it seemed to be not efficient from some tests by my colleague. for question [1] you mentioned, I think it's ok, since: - not all the data has been ready at that moment - the polling is not just for that request, there may be more data comming from the rx ring since we usually use polling mode under high workload pressure. See the implementation of epoll busy_poll, the same thing. Regards, Hao ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-18 7:41 ` napi_busy_poll Hao Xu @ 2022-02-19 7:02 ` Olivier Langlois 2022-02-21 5:03 ` napi_busy_poll Hao Xu 0 siblings, 1 reply; 23+ messages in thread From: Olivier Langlois @ 2022-02-19 7:02 UTC (permalink / raw) To: Hao Xu, Jens Axboe, io-uring On Fri, 2022-02-18 at 15:41 +0800, Hao Xu wrote: > > Hi Oliver, > > Have you tried just issue one recv/pollin request and observe the > > napi_id? Hi Hao, not precisely but you are 100% right about where the association is done. It is when a packet is received that the association is made. This happens in few places but the most likely place where it can happen with my NIC (Intel igb) is inside napi_gro_receive(). I do verify the socket napi_id once a WebSocket session is established. At that point a lot of packets going back and forth have been exchanged: TCP handshake TLS handshake HTTP request requesting a WS upgrade At that point, the napi_id has been assigned. My problem was only that my socket packets were routed on the loopback interface which has no napi devices associated to it. I did remove the local SOCKS proxy out of my setup and NAPI ids started to appear as expected. > From my understanding of the network stack, the napi_id > > of a socket won't be valid until it gets some packets. Because before > > that moment, busy_poll doesn't know which hw queue to poll. > > In other words, the idea of NAPI polling is: the packets of a socket > > can be from any hw queue of a net adapter, but we just poll the > > queue which just received some data. So to get this piece of info, > > there must be some data coming to one queue, before doing the > > busy_poll. Correct me if I'm wrong since I'm also a newbie of > > network stuff... I am now getting what you mean here. So there are 2 possible approaches. You either: 1. add the napi id when you are sure that it is available after its setting in the sock layer but you are not sure if it will be needed again with future requests as it is too late to be of any use for the current request (unless it is a MULTISHOT poll) (the add is performed in io_poll_task_func() and io_apoll_task_func() 2. add the napi id when the request poll is armed where this knowledge could be leveraged to handle the current req knowing that we might fail getting the id if it is the initial recv request. (the add would be performed in __io_arm_poll_handler) TBH, I am not sure if there are arguments in favor of one approach over the other... Maybe option #1 is the only one to make napi busy poll work correctly with MULTISHOT requests... I'll let you think about this point... Your first choice might be the right one... the other thing to consider when choosing the call site is locking... when done from __io_arm_poll_handler(), uring_lock is acquired... I am not sure that this is always the case with io_poll_task_func/io_apoll_task_func... I'll post v1 of the patch. My testing is showing that it works fine. race condition is not an issue when busy poll is performed by sqpoll thread because the list modification is exclusivy performed by that thread too. but I think that there is a possible race condition where the napi_list could be used from io_cqring_wait() while another thread modify the list. This is NOT done in my testing scenario but definitely something that could happen somewhere in the real world... > > > I was considering to poll all the rx rings, but it seemed to be not > > efficient from some tests by my colleague. This is definitely the simplest implementation but I did not go as far as testing it. There is too much unknown variables to be viable IMHO. I am not too sure how many napi devices there can be in a typical server. I know that in my test machine, it has 2 NICs and one of them is just unconnected. If we were to loop through all the devices, we would be polling wastefully at least half of all the devices on the system. That does not sound like a very good approach. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-19 7:02 ` napi_busy_poll Olivier Langlois @ 2022-02-21 5:03 ` Hao Xu 2022-02-25 4:42 ` napi_busy_poll Olivier Langlois 0 siblings, 1 reply; 23+ messages in thread From: Hao Xu @ 2022-02-21 5:03 UTC (permalink / raw) To: Olivier Langlois, Jens Axboe, io-uring 在 2022/2/19 下午3:02, Olivier Langlois 写道: > On Fri, 2022-02-18 at 15:41 +0800, Hao Xu wrote: >> >> Hi Oliver, >> >> Have you tried just issue one recv/pollin request and observe the >> >> napi_id? > > Hi Hao, not precisely but you are 100% right about where the > association is done. It is when a packet is received that the > association is made. This happens in few places but the most likely > place where it can happen with my NIC (Intel igb) is inside > napi_gro_receive(). Yes, when a packet is received-->set skb->napi_id, when receiving a batch of them-->deliver the skbs to the protocol layer and set sk->napi_id > > I do verify the socket napi_id once a WebSocket session is established. > At that point a lot of packets going back and forth have been > exchanged: > > TCP handshake > TLS handshake > HTTP request requesting a WS upgrade > > At that point, the napi_id has been assigned. > > My problem was only that my socket packets were routed on the loopback > interface which has no napi devices associated to it. > > I did remove the local SOCKS proxy out of my setup and NAPI ids started > to appear as expected. > >> From my understanding of the network stack, the napi_id >> >> of a socket won't be valid until it gets some packets. Because before >> >> that moment, busy_poll doesn't know which hw queue to poll. >> >> In other words, the idea of NAPI polling is: the packets of a socket >> >> can be from any hw queue of a net adapter, but we just poll the >> >> queue which just received some data. So to get this piece of info, >> >> there must be some data coming to one queue, before doing the >> >> busy_poll. Correct me if I'm wrong since I'm also a newbie of >> >> network stuff... > > I am now getting what you mean here. So there are 2 possible > approaches. You either: > > 1. add the napi id when you are sure that it is available after its > setting in the sock layer but you are not sure if it will be needed > again with future requests as it is too late to be of any use for the > current request (unless it is a MULTISHOT poll) (the add is performed > in io_poll_task_func() and io_apoll_task_func() > > 2. add the napi id when the request poll is armed where this knowledge > could be leveraged to handle the current req knowing that we might fail > getting the id if it is the initial recv request. (the add would be > performed in __io_arm_poll_handler) I explains this in the patch. > > TBH, I am not sure if there are arguments in favor of one approach over > the other... Maybe option #1 is the only one to make napi busy poll > work correctly with MULTISHOT requests... > > I'll let you think about this point... Your first choice might be the > right one... > > the other thing to consider when choosing the call site is locking... > when done from __io_arm_poll_handler(), uring_lock is acquired... > > I am not sure that this is always the case with > io_poll_task_func/io_apoll_task_func... > > I'll post v1 of the patch. My testing is showing that it works fine. > race condition is not an issue when busy poll is performed by sqpoll > thread because the list modification is exclusivy performed by that > thread too. > > but I think that there is a possible race condition where the napi_list > could be used from io_cqring_wait() while another thread modify the > list. This is NOT done in my testing scenario but definitely something > that could happen somewhere in the real world... Will there be any issue if we do the access with list_for_each_entry_safe? I think it is safe enough. > >> >> >> I was considering to poll all the rx rings, but it seemed to be not >> >> efficient from some tests by my colleague. > > This is definitely the simplest implementation but I did not go as far > as testing it. There is too much unknown variables to be viable IMHO. I > am not too sure how many napi devices there can be in a typical server. > I know that in my test machine, it has 2 NICs and one of them is just > unconnected. If we were to loop through all the devices, we would be > polling wastefully at least half of all the devices on the system. That > does not sound like a very good approach. > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: napi_busy_poll 2022-02-21 5:03 ` napi_busy_poll Hao Xu @ 2022-02-25 4:42 ` Olivier Langlois 0 siblings, 0 replies; 23+ messages in thread From: Olivier Langlois @ 2022-02-25 4:42 UTC (permalink / raw) To: Hao Xu, Jens Axboe, io-uring On Mon, 2022-02-21 at 13:03 +0800, Hao Xu wrote: > > > > but I think that there is a possible race condition where the > > napi_list > > could be used from io_cqring_wait() while another thread modify the > > list. This is NOT done in my testing scenario but definitely > > something > > that could happen somewhere in the real world... > > Will there be any issue if we do the access with > list_for_each_entry_safe? I think it is safe enough. Hi Hao, If napi_busy_poll is exclusively done from the sqpoll thread, all is good because all the napi_list manipulations are performed from the sqpoll thread. The issue is if we want to offer napi_busy_poll for a task calling io_uring_enter(). If the busy_poll is performed from io_cqring_wait() as I propose in my patch, the napi_list could be updated by a different thread calling io_uring_enter() to submit other requests. This is an issue that v2 is addressing. This makes the code uglier. The strategy being to splice the context napi_list into a local list in io_cqring_wait() and assume that the most likely outcome when the busy_poll will be over the only thing that will be needed is to move back the local list into the context. If in the meantime, the context napi_list has been updated, the lists are going to be merged. This appears to be the approach minimizing the amount of memory allocations. Creating a benchmark program took more time than I originally expected. I am going to run it and if gains from napi_polling from io_cqring_wait() aren't that good... maybe ditching napi_busy_poll() support from io_cqring_wait() and that way, locking the lock before adding napi ids will not be required anymore... Here is what will be added in v2: - Evaluate list_empty(&ctx->napi_list) outside io_napi_busy_loop() to keep __io_sq_thread() execution as fast as possible - In io_cqring_wait(), move up the sig block to avoid needless computation if the block exits the function - In io_cqring_wait(), protect ctx->napi_list from race condition by splicing it into a local list - In io_cqring_wait(), allow busy polling when uts is missing - Fix kernel test robot issues ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2022-02-25 4:42 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-02-08 14:58 napi_busy_poll Olivier Langlois 2022-02-08 17:05 ` napi_busy_poll Jens Axboe 2022-02-09 3:34 ` napi_busy_poll Hao Xu 2022-02-12 19:51 ` napi_busy_poll Olivier Langlois 2022-02-13 18:47 ` napi_busy_poll Jens Axboe 2022-02-14 17:13 ` napi_busy_poll Hao Xu 2022-02-15 8:37 ` napi_busy_poll Olivier Langlois 2022-02-15 18:05 ` napi_busy_poll Olivier Langlois 2022-02-16 3:12 ` napi_busy_poll Hao Xu 2022-02-16 19:19 ` napi_busy_poll Olivier Langlois 2022-02-16 12:14 ` napi_busy_poll Hao Xu 2022-02-17 20:28 ` napi_busy_poll Olivier Langlois 2022-02-18 8:06 ` napi_busy_poll Hao Xu 2022-02-19 7:14 ` napi_busy_poll Olivier Langlois 2022-02-21 4:52 ` napi_busy_poll Hao Xu 2022-02-17 23:18 ` napi_busy_poll Olivier Langlois 2022-02-17 23:25 ` napi_busy_poll Jens Axboe 2022-02-18 7:21 ` napi_busy_poll Hao Xu 2022-02-18 5:05 ` napi_busy_poll Olivier Langlois 2022-02-18 7:41 ` napi_busy_poll Hao Xu 2022-02-19 7:02 ` napi_busy_poll Olivier Langlois 2022-02-21 5:03 ` napi_busy_poll Hao Xu 2022-02-25 4:42 ` napi_busy_poll Olivier Langlois
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox