* io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread
@ 2024-07-30 20:05 Olivier Langlois
2024-07-30 20:25 ` Pavel Begunkov
0 siblings, 1 reply; 12+ messages in thread
From: Olivier Langlois @ 2024-07-30 20:05 UTC (permalink / raw)
To: io-uring; +Cc: netdev
if you are interested into all the details,
they are all here:
https://github.com/axboe/liburing/issues/1190
it seems like I like to write a lot when I am investigating a problem.
Pavel has been a great help in assisting me understanding what was
happening.
Next, I came to question where the integration of RCU came from and I
have found this:
https://lore.kernel.org/all/[email protected]/
I guess that in some use-case being able to dynamically manage hundreds
of NAPI devices automatically that can suddenly all be swepted over
during a device reconfiguration is something desirable to have for
some...
but in my case, this is an excessively a high price to pay for a
flexibility that I do not need at all.
I have a single NAPI device. Once I know what it is, it will pratically
remain immutable until termination.
For that reason, I am thinking that offering some sort of polymorphic
NAPI device tracking strategy customization would be desirable.
The current one, the RCU one, I would call it the
dynamic_napi_tracking (rcu could be peppered in the name somewhere so
people know what the strategy is up to)
where as the new one that I am imagining would be called
static_napi_tracking.
NAPI devices would be added/removed by the user manually through an
extended registration function.
for the sake of conveniance, a clear_list operation could even be
offered.
The benefits of this new static tracking strategy would be numerous:
- this removes the need to invoke the heavy duty RCU cavalry
- no need to scan the list to remove stall devices
- no need to search the list at each SQE submission to update the
device timeout value
So is this a good idea in your opinion?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread 2024-07-30 20:05 io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread Olivier Langlois @ 2024-07-30 20:25 ` Pavel Begunkov 2024-07-30 23:14 ` Olivier Langlois 0 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2024-07-30 20:25 UTC (permalink / raw) To: Olivier Langlois, io-uring; +Cc: netdev On 7/30/24 21:05, Olivier Langlois wrote: > if you are interested into all the details, > > they are all here: > https://github.com/axboe/liburing/issues/1190 > > it seems like I like to write a lot when I am investigating a problem. > Pavel has been a great help in assisting me understanding what was > happening. > > Next, I came to question where the integration of RCU came from and I > have found this: > https://lore.kernel.org/all/[email protected]/ > > I guess that in some use-case being able to dynamically manage hundreds > of NAPI devices automatically that can suddenly all be swepted over > during a device reconfiguration is something desirable to have for > some... > > but in my case, this is an excessively a high price to pay for a > flexibility that I do not need at all. Removing an entry or two once every minute is definitely not going to take 50% CPU, RCU machinery is running in background regardless of whether io_uring uses it or not, and it's pretty cheap considering ammortisation. If anything it more sounds from your explanation like the scheduler makes a wrong decision and schedules out the sqpoll thread even though it could continue to run, but that's need a confirmation. Does the CPU your SQPOLL is pinned to stays 100% utilised? > I have a single NAPI device. Once I know what it is, it will pratically > remain immutable until termination. > > For that reason, I am thinking that offering some sort of polymorphic > NAPI device tracking strategy customization would be desirable. > > The current one, the RCU one, I would call it the > > dynamic_napi_tracking (rcu could be peppered in the name somewhere so > people know what the strategy is up to) > > where as the new one that I am imagining would be called > > static_napi_tracking. > > NAPI devices would be added/removed by the user manually through an > extended registration function. > > for the sake of conveniance, a clear_list operation could even be > offered. > > The benefits of this new static tracking strategy would be numerous: > - this removes the need to invoke the heavy duty RCU cavalry > - no need to scan the list to remove stall devices > - no need to search the list at each SQE submission to update the > device timeout value > > So is this a good idea in your opinion? I believe that's a good thing, I've been prototyping a similar if not the same approach just today, i.e. user [un]registers napi instance by id you can get with SO_INCOMING_NAPI_ID. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread 2024-07-30 20:25 ` Pavel Begunkov @ 2024-07-30 23:14 ` Olivier Langlois 2024-07-31 0:33 ` Pavel Begunkov 0 siblings, 1 reply; 12+ messages in thread From: Olivier Langlois @ 2024-07-30 23:14 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: netdev On Tue, 2024-07-30 at 21:25 +0100, Pavel Begunkov wrote: > > Removing an entry or two once every minute is definitely not > going to take 50% CPU, RCU machinery is running in background > regardless of whether io_uring uses it or not, and it's pretty > cheap considering ammortisation. > > If anything it more sounds from your explanation like the > scheduler makes a wrong decision and schedules out the sqpoll > thread even though it could continue to run, but that's need > a confirmation. Does the CPU your SQPOLL is pinned to stays > 100% utilised? Here are the facts as they are documented in the github issue. 1. despite thinking that I was doing NAPI busy polling, I was not because my ring was not receiving any sqe after its initial setup. This is what the patch developped with your input https://lore.kernel.org/io-uring/382791dc97d208d88ee31e5ebb5b661a0453fb79.1722374371.git.olivier@trillion01.com/T/#u is addressing (BTW, I should check if there is such a thing, but I would love to know if the net code is exposing a tracepoint when napi_busy_poll is called because it is very tricky to know if it is done for real or not) 2. the moment a second ring has been attached to the sqpoll thread that was receving a lot of sqe, the NAPI busy loop started to be made for real and the sqpoll cpu usage unexplicably dropped from 99% to 55% 3. here is my kernel cmdline: hugepages=72 isolcpus=0,1,2 nohz_full=0,1,2 rcu_nocbs=0,1,2 rcu_nocb_poll irqaffinity=3 idle=nomwait processor.max_cstate=1 intel_idle.max_cstate=1 nmi_watchdog=0 there is absolutely nothing else on CPU0 where the sqpoll thread affinity is set to run. 4. I got the idea of doing this: echo common_pid == sqpoll_pid > /sys/kernel/tracing/events/sched/filter echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable and I have recorded over 1,000 context switches in 23 seconds with RCU related kernel threads. 5. just for the fun of checking out, I have disabled NAPI polling on my io_uring rings and the sqpoll thread magically returned to 99% CPU usage from 55%... I am open to other explanations for what I have observed but my current conclusion is based on what I am able to see... the evidence appears very convincing to me... > > > > So is this a good idea in your opinion? > > I believe that's a good thing, I've been prototyping a similar > if not the same approach just today, i.e. user [un]registers > napi instance by id you can get with SO_INCOMING_NAPI_ID. > this is fantastic! I am super happy to see all this NAPI busy polling feature interest and activity which is a feature that I am very fond with (along with io_uring) I am looking forward collaborating with you Pavel to make io_uring the best NAPI busy polling goto solution! Greetings, Olivier ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread 2024-07-30 23:14 ` Olivier Langlois @ 2024-07-31 0:33 ` Pavel Begunkov 2024-07-31 1:00 ` Pavel Begunkov 2024-08-01 22:02 ` Olivier Langlois 0 siblings, 2 replies; 12+ messages in thread From: Pavel Begunkov @ 2024-07-31 0:33 UTC (permalink / raw) To: Olivier Langlois, io-uring; +Cc: netdev On 7/31/24 00:14, Olivier Langlois wrote: > On Tue, 2024-07-30 at 21:25 +0100, Pavel Begunkov wrote: >> >> Removing an entry or two once every minute is definitely not >> going to take 50% CPU, RCU machinery is running in background >> regardless of whether io_uring uses it or not, and it's pretty >> cheap considering ammortisation. >> >> If anything it more sounds from your explanation like the >> scheduler makes a wrong decision and schedules out the sqpoll >> thread even though it could continue to run, but that's need >> a confirmation. Does the CPU your SQPOLL is pinned to stays >> 100% utilised? > > Here are the facts as they are documented in the github issue. > > 1. despite thinking that I was doing NAPI busy polling, I was not > because my ring was not receiving any sqe after its initial setup. > > This is what the patch developped with your input > https://lore.kernel.org/io-uring/382791dc97d208d88ee31e5ebb5b661a0453fb79.1722374371.git.olivier@trillion01.com/T/#u > > is addressing > > (BTW, I should check if there is such a thing, but I would love to know > if the net code is exposing a tracepoint when napi_busy_poll is called > because it is very tricky to know if it is done for real or not) > > 2. the moment a second ring has been attached to the sqpoll thread that > was receving a lot of sqe, the NAPI busy loop started to be made for > real and the sqpoll cpu usage unexplicably dropped from 99% to 55% > > 3. here is my kernel cmdline: > hugepages=72 isolcpus=0,1,2 nohz_full=0,1,2 rcu_nocbs=0,1,2 > rcu_nocb_poll irqaffinity=3 idle=nomwait processor.max_cstate=1 > intel_idle.max_cstate=1 nmi_watchdog=0 > > there is absolutely nothing else on CPU0 where the sqpoll thread > affinity is set to run. > > 4. I got the idea of doing this: > echo common_pid == sqpoll_pid > /sys/kernel/tracing/events/sched/filter > echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable > > and I have recorded over 1,000 context switches in 23 seconds with RCU > related kernel threads. > > 5. just for the fun of checking out, I have disabled NAPI polling on my > io_uring rings and the sqpoll thread magically returned to 99% CPU > usage from 55%... > > I am open to other explanations for what I have observed but my current > conclusion is based on what I am able to see... the evidence appears > very convincing to me... You're seeing something that doesn't make much sense to me, and we need to understand what that is. There might be a bug _somewhere_, that's always a possibility, but before saying that let's get a bit more data. While the app is working, can you grab a profile and run mpstat for the CPU on which you have the SQPOLL task? perf record -g -C <CPU number> --all-kernel & mpstat -u -P <CPU number> 5 10 & And then as usual, time it so that you have some activity going on, mpstat interval may need adjustments, and perf report it as before. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread 2024-07-31 0:33 ` Pavel Begunkov @ 2024-07-31 1:00 ` Pavel Begunkov 2024-08-01 23:05 ` Olivier Langlois 2024-08-01 22:02 ` Olivier Langlois 1 sibling, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2024-07-31 1:00 UTC (permalink / raw) To: Olivier Langlois, io-uring; +Cc: netdev On 7/31/24 01:33, Pavel Begunkov wrote: > On 7/31/24 00:14, Olivier Langlois wrote: >> On Tue, 2024-07-30 at 21:25 +0100, Pavel Begunkov wrote: >>> >>> Removing an entry or two once every minute is definitely not >>> going to take 50% CPU, RCU machinery is running in background >>> regardless of whether io_uring uses it or not, and it's pretty >>> cheap considering ammortisation. >>> >>> If anything it more sounds from your explanation like the >>> scheduler makes a wrong decision and schedules out the sqpoll >>> thread even though it could continue to run, but that's need >>> a confirmation. Does the CPU your SQPOLL is pinned to stays >>> 100% utilised? >> >> Here are the facts as they are documented in the github issue. >> >> 1. despite thinking that I was doing NAPI busy polling, I was not >> because my ring was not receiving any sqe after its initial setup. >> >> This is what the patch developped with your input >> https://lore.kernel.org/io-uring/382791dc97d208d88ee31e5ebb5b661a0453fb79.1722374371.git.olivier@trillion01.com/T/#u >> >> is addressing >> >> (BTW, I should check if there is such a thing, but I would love to know >> if the net code is exposing a tracepoint when napi_busy_poll is called >> because it is very tricky to know if it is done for real or not) >> >> 2. the moment a second ring has been attached to the sqpoll thread that >> was receving a lot of sqe, the NAPI busy loop started to be made for >> real and the sqpoll cpu usage unexplicably dropped from 99% to 55% >> >> 3. here is my kernel cmdline: >> hugepages=72 isolcpus=0,1,2 nohz_full=0,1,2 rcu_nocbs=0,1,2 >> rcu_nocb_poll irqaffinity=3 idle=nomwait processor.max_cstate=1 >> intel_idle.max_cstate=1 nmi_watchdog=0 >> >> there is absolutely nothing else on CPU0 where the sqpoll thread >> affinity is set to run. >> >> 4. I got the idea of doing this: >> echo common_pid == sqpoll_pid > /sys/kernel/tracing/events/sched/filter >> echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable >> >> and I have recorded over 1,000 context switches in 23 seconds with RCU >> related kernel threads. >> >> 5. just for the fun of checking out, I have disabled NAPI polling on my >> io_uring rings and the sqpoll thread magically returned to 99% CPU >> usage from 55%... >> >> I am open to other explanations for what I have observed but my current >> conclusion is based on what I am able to see... the evidence appears >> very convincing to me... > > You're seeing something that doesn't make much sense to me, and we need > to understand what that is. There might be a bug _somewhere_, that's > always a possibility, but before saying that let's get a bit more data. > > While the app is working, can you grab a profile and run mpstat for the > CPU on which you have the SQPOLL task? > > perf record -g -C <CPU number> --all-kernel & > mpstat -u -P <CPU number> 5 10 & > > And then as usual, time it so that you have some activity going on, > mpstat interval may need adjustments, and perf report it as before I forgot to add, ~50 switches/second for relatively brief RCU handling is not much, not enough to take 50% of a CPU. I wonder if sqpoll was still running but napi busy polling time got accounted to softirq because of disabled bh and you didn't include it, hence asking CPU stats. Do you see any latency problems for that configuration? -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread 2024-07-31 1:00 ` Pavel Begunkov @ 2024-08-01 23:05 ` Olivier Langlois 0 siblings, 0 replies; 12+ messages in thread From: Olivier Langlois @ 2024-08-01 23:05 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: netdev On Wed, 2024-07-31 at 02:00 +0100, Pavel Begunkov wrote: > > I forgot to add, ~50 switches/second for relatively brief RCU > handling > is not much, not enough to take 50% of a CPU. I wonder if sqpoll was > still running but napi busy polling time got accounted to softirq > because of disabled bh and you didn't include it, hence asking CPU > stats. Do you see any latency problems for that configuration? > Pavel, I am not sure if I will ever discover what this 50% CPU usage drop was exactly. when I did test https://lore.kernel.org/io-uring/382791dc97d208d88ee31e5ebb5b661a0453fb79.1722374371.git.olivier@trillion01.com/T/#u from this custom setup: https://github.com/axboe/liburing/issues/1190#issuecomment-2258632731 iou-sqp task cpu usage went back to 100%... there was also my busy_poll config numbers that were inadequate. I went from: echo 1000 > /sys/class/net/enp39s0/napi_defer_hard_irqs echo 500 > /sys/class/net/enp39s0/gro_flush_timeout to: echo 5000 > /sys/class/net/enp39s0/napi_defer_hard_irqs # gro_flush_timeout unit is nanoseconds echo 100000 > /sys/class/net/enp39s0/gro_flush_timeout ksoftirqd has stopped being awakening to service NET SOFTIRQS but I would that this might not be the cause neither I have no more latency issues. After a lot of efforts during the last 7 days, my system latency have improved by a good 10usec on average over what it was last week... but knowing that it can be even better is stopping me from letting go... the sporadic CPU1 interrupt can introduce a 27usec delay and this is the difference between a win or a loss that is at stake... https://lore.kernel.org/rcu/[email protected]/T/#m5abf9aa02ec7648c615885a6f8ebdebc57935c35 I want to get rid of that interrupt so hard that is going to provide a great satidfaction when I will have finally found the cause... ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread 2024-07-31 0:33 ` Pavel Begunkov 2024-07-31 1:00 ` Pavel Begunkov @ 2024-08-01 22:02 ` Olivier Langlois 2024-08-02 15:22 ` Pavel Begunkov 1 sibling, 1 reply; 12+ messages in thread From: Olivier Langlois @ 2024-08-01 22:02 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: netdev On Wed, 2024-07-31 at 01:33 +0100, Pavel Begunkov wrote: > > You're seeing something that doesn't make much sense to me, and we > need > to understand what that is. There might be a bug _somewhere_, that's > always a possibility, but before saying that let's get a bit more > data. > > While the app is working, can you grab a profile and run mpstat for > the > CPU on which you have the SQPOLL task? > > perf record -g -C <CPU number> --all-kernel & > mpstat -u -P <CPU number> 5 10 & > > And then as usual, time it so that you have some activity going on, > mpstat interval may need adjustments, and perf report it as before. > First thing first. The other day, I did put my foot in my mouth by saying the NAPI busy poll was adding 50 context switches/second. I was responsible for that behavior with the rcu_nocb_poll boot kernel param. I have removed the option and the context switches went away... I am clearly outside my comfort zone with this project, I am trying things without fully understand what I am doing and I am making errors and stuff that is incorrect. On top of that, before mentioning io_uring RCU usage, I did not realize that net/core was already massively using RCU, including in napi_busy_poll, therefore, that io_uring is using rcu before calling napi_busy_poll, the point does seem very moot. this is what I did the other day and I wanted to apologize to have said something incorrect. that being said, it does not remove the possible merit of what I did propose. I really think that the current io_uring implemention of the napi device tracking strategy is overkill for a lot of scenarios... if some sort of abstract interface like a mini struct net_device_ops with 3-4 function pointers where the user could select between the standard dynamic tracking or a manual lightweight tracking was present, that would be very cool... so cool... I am definitely interested in running the profiler tools that you are proposing... Most of my problems are resolved... - I got rid of 99.9% if the NET_RX_SOFTIRQ - I have reduced significantly the number of NET_TX_SOFTIRQ https://github.com/amzn/amzn-drivers/issues/316 - No more rcu context switches - CPU2 is now nohz_full all the time - CPU1 local timer interrupt is raised once every 2-3 seconds for an unknown origin. Paul E. McKenney did offer me his assistance on this issue https://lore.kernel.org/rcu/[email protected]/t/#u I am going to give perf record a second chance... but just keep in mind, that it is not because it is not recording much, it is not because nothing is happening. if perf relies on interrupts to properly operate, there is close to 0 on my nohz_full CPU... thx a lot for your help Pavel! ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread 2024-08-01 22:02 ` Olivier Langlois @ 2024-08-02 15:22 ` Pavel Begunkov 2024-08-03 14:15 ` Olivier Langlois 0 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2024-08-02 15:22 UTC (permalink / raw) To: Olivier Langlois, io-uring; +Cc: netdev On 8/1/24 23:02, Olivier Langlois wrote: > On Wed, 2024-07-31 at 01:33 +0100, Pavel Begunkov wrote: >> >> You're seeing something that doesn't make much sense to me, and we >> need >> to understand what that is. There might be a bug _somewhere_, that's >> always a possibility, but before saying that let's get a bit more >> data. >> >> While the app is working, can you grab a profile and run mpstat for >> the >> CPU on which you have the SQPOLL task? >> >> perf record -g -C <CPU number> --all-kernel & >> mpstat -u -P <CPU number> 5 10 & >> >> And then as usual, time it so that you have some activity going on, >> mpstat interval may need adjustments, and perf report it as before. >> > First thing first. > > The other day, I did put my foot in my mouth by saying the NAPI busy > poll was adding 50 context switches/second. > > I was responsible for that behavior with the rcu_nocb_poll boot kernel > param. I have removed the option and the context switches went away... > > I am clearly outside my comfort zone with this project, I am trying > things without fully understand what I am doing and I am making errors > and stuff that is incorrect. > > On top of that, before mentioning io_uring RCU usage, I did not realize > that net/core was already massively using RCU, including in > napi_busy_poll, therefore, that io_uring is using rcu before calling > napi_busy_poll, the point does seem very moot. > > this is what I did the other day and I wanted to apologize to have said > something incorrect. No worries at all, you're pushing your configuration to extremes, anyone would get lost in the options, and I'm getting curious what you can squeeze from it. That's true that the current tracking scheme might be an overkill but not because of mild RCU use. > that being said, it does not remove the possible merit of what I did > propose. > > I really think that the current io_uring implemention of the napi > device tracking strategy is overkill for a lot of scenarios... > > if some sort of abstract interface like a mini struct net_device_ops > with 3-4 function pointers where the user could select between the > standard dynamic tracking or a manual lightweight tracking was present, > that would be very cool... so cool... > > I am definitely interested in running the profiler tools that you are > proposing... Most of my problems are resolved... > > - I got rid of 99.9% if the NET_RX_SOFTIRQ > - I have reduced significantly the number of NET_TX_SOFTIRQ > https://github.com/amzn/amzn-drivers/issues/316 > - No more rcu context switches > - CPU2 is now nohz_full all the time > - CPU1 local timer interrupt is raised once every 2-3 seconds for an > unknown origin. Paul E. McKenney did offer me his assistance on this > issue > https://lore.kernel.org/rcu/[email protected]/t/#u And I was just going to propose to ask Paul, but great to see you beat me on that > I am going to give perf record a second chance... but just keep in > mind, that it is not because it is not recording much, it is not > because nothing is happening. if perf relies on interrupts to properly > operate, there is close to 0 on my nohz_full CPU... > > thx a lot for your help Pavel! > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread 2024-08-02 15:22 ` Pavel Begunkov @ 2024-08-03 14:15 ` Olivier Langlois 2024-08-03 14:36 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Olivier Langlois @ 2024-08-03 14:15 UTC (permalink / raw) To: Pavel Begunkov, io-uring On Fri, 2024-08-02 at 16:22 +0100, Pavel Begunkov wrote: > > > > I am definitely interested in running the profiler tools that you > > are > > proposing... Most of my problems are resolved... > > > > - I got rid of 99.9% if the NET_RX_SOFTIRQ > > - I have reduced significantly the number of NET_TX_SOFTIRQ > > https://github.com/amzn/amzn-drivers/issues/316 > > - No more rcu context switches > > - CPU2 is now nohz_full all the time > > - CPU1 local timer interrupt is raised once every 2-3 seconds for > > an > > unknown origin. Paul E. McKenney did offer me his assistance on > > this > > issue > > https://lore.kernel.org/rcu/[email protected]/t/#u > > And I was just going to propose to ask Paul, but great to > see you beat me on that > My investigation has progressed... my cpu1 interrupts are nvme block device interrupts. I feel that for questions about block device drivers, this time, I am ringing at the experts door! What is the meaning of a nvme interrupt? I am assuming that this is to signal the completing of writing blocks in the device... I am currently looking in the code to find the answer for this. Next, it seems to me that there is an odd number of interrupts for the device: 63: 12 0 0 0 PCI-MSIX-0000:00:04.0 0-edge nvme0q0 64: 0 23336 0 0 PCI-MSIX-0000:00:04.0 1-edge nvme0q1 65: 0 0 0 33878 PCI-MSIX-0000:00:04.0 2-edge nvme0q2 why 3? Why not 4? one for each CPU... If there was 4, I would have concluded that the driver has created a queue for each CPU... How are the queues associated to certain request/task? The file I/O is made by threads running on CPU3, so I find it surprising that nvmeq1 is choosen... One noteworthy detail is that the process main thread is on CPU1. In my flawed mental model of 1 queue per CPU, there could be some sort of magical association with a process file descriptors table and the choosen block device queue but this idea does not hold... What would happen to processes running on CPU2... ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread 2024-08-03 14:15 ` Olivier Langlois @ 2024-08-03 14:36 ` Jens Axboe 2024-08-03 16:50 ` Olivier Langlois 2024-08-03 21:37 ` Olivier Langlois 0 siblings, 2 replies; 12+ messages in thread From: Jens Axboe @ 2024-08-03 14:36 UTC (permalink / raw) To: Olivier Langlois, Pavel Begunkov, io-uring On 8/3/24 8:15 AM, Olivier Langlois wrote: > On Fri, 2024-08-02 at 16:22 +0100, Pavel Begunkov wrote: >>> >>> I am definitely interested in running the profiler tools that you >>> are >>> proposing... Most of my problems are resolved... >>> >>> - I got rid of 99.9% if the NET_RX_SOFTIRQ >>> - I have reduced significantly the number of NET_TX_SOFTIRQ >>> https://github.com/amzn/amzn-drivers/issues/316 >>> - No more rcu context switches >>> - CPU2 is now nohz_full all the time >>> - CPU1 local timer interrupt is raised once every 2-3 seconds for >>> an >>> unknown origin. Paul E. McKenney did offer me his assistance on >>> this >>> issue >>> https://lore.kernel.org/rcu/[email protected]/t/#u >> >> And I was just going to propose to ask Paul, but great to >> see you beat me on that >> > My investigation has progressed... my cpu1 interrupts are nvme block > device interrupts. > > I feel that for questions about block device drivers, this time, I am > ringing at the experts door! > > What is the meaning of a nvme interrupt? > > I am assuming that this is to signal the completing of writing blocks > in the device... > I am currently looking in the code to find the answer for this. > > Next, it seems to me that there is an odd number of interrupts for the > device: > 63: 12 0 0 0 PCI-MSIX-0000:00:04.0 > 0-edge nvme0q0 > 64: 0 23336 0 0 PCI-MSIX-0000:00:04.0 > 1-edge nvme0q1 > 65: 0 0 0 33878 PCI-MSIX-0000:00:04.0 > 2-edge nvme0q2 > > why 3? Why not 4? one for each CPU... > > If there was 4, I would have concluded that the driver has created a > queue for each CPU... > > How are the queues associated to certain request/task? > > The file I/O is made by threads running on CPU3, so I find it > surprising that nvmeq1 is choosen... > > One noteworthy detail is that the process main thread is on CPU1. In my > flawed mental model of 1 queue per CPU, there could be some sort of > magical association with a process file descriptors table and the > choosen block device queue but this idea does not hold... What would > happen to processes running on CPU2... The cpu <-> hw queue mappings for nvme devices depend on the topology of the machine (number of CPUs, relation between thread siblings, number of nodes, etc) and the number of queue available on the device in question. If you have as many (or more) device side queues available as number of CPUs, then you'll have a queue per CPU. If you have less, then multiple CPUs will share a queue. You can check the mappings in /sys/kernel/debug/block/<device>/ in there you'll find a number of hctxN folders, each of these is a hardware queue. hcxt0/type tells you what kind of queue it is, and inside the directory, you'll find which CPUs this queue is mapped to. Example: root@r7625 /s/k/d/b/nvme0n1# cat hctx1/type default "default" means it's a read/write queue, so it'll handle both reads and writes. root@r7625 /s/k/d/b/nvme0n1# ls hctx1/ active cpu11/ dispatch sched_tags tags busy cpu266/ dispatch_busy sched_tags_bitmap tags_bitmap cpu10/ ctx_map flags state type and we can see this hardware queue is mapped to cpu 10/11/266. That ties into how these are mapped. It's pretty simple - if a task is running on cpu 10/11/266 when it's queueing IO, then it'll use hw queue 1. This maps to the interrupts you found, but note that the admin queue (which is not listed these directories, as it's not an IO queue) is the first one there. hctx0 is nvme0q1 in your /proc/interrupts list. If IO is queued on hctx1, then it should complete on the interrupt vector associated with nvme0q2. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread 2024-08-03 14:36 ` Jens Axboe @ 2024-08-03 16:50 ` Olivier Langlois 2024-08-03 21:37 ` Olivier Langlois 1 sibling, 0 replies; 12+ messages in thread From: Olivier Langlois @ 2024-08-03 16:50 UTC (permalink / raw) To: Jens Axboe, Pavel Begunkov, io-uring On Sat, 2024-08-03 at 08:36 -0600, Jens Axboe wrote: > You can check the mappings in /sys/kernel/debug/block/<device>/ > > in there you'll find a number of hctxN folders, each of these is a > hardware queue. hcxt0/type tells you what kind of queue it is, and > inside the directory, you'll find which CPUs this queue is mapped to. > Example: > > root@r7625 /s/k/d/b/nvme0n1# cat hctx1/type > default > > "default" means it's a read/write queue, so it'll handle both reads > and > writes. > > root@r7625 /s/k/d/b/nvme0n1# ls hctx1/ > active cpu11/ dispatch sched_tags tags > busy cpu266/ dispatch_busy sched_tags_bitmap tags_bitmap > cpu10/ ctx_map flags state type > > and we can see this hardware queue is mapped to cpu 10/11/266. > > That ties into how these are mapped. It's pretty simple - if a task > is > running on cpu 10/11/266 when it's queueing IO, then it'll use hw > queue > 1. This maps to the interrupts you found, but note that the admin > queue > (which is not listed these directories, as it's not an IO queue) is > the > first one there. hctx0 is nvme0q1 in your /proc/interrupts list. > > If IO is queued on hctx1, then it should complete on the interrupt > vector associated with nvme0q2. > Jens, I knew there were nvme experts here! thx for your help. # ls nvme0n1/hctx0/ active busy cpu0 cpu1 ctx_map dispatch dispatch_busy flags sched_tags sched_tags_bitmap state tags tags_bitmap type it means that some I/O that I am unaware of is initiated either from cpu0-cpu1... It seems like nvme number of queues is configurable... I'll try to find out how to reduce it to 1... but my real problem is not really which I/O queue is assigned to a request. It is the irq affinity assigned to the queues... I have found the function: nvme_setup_irqs() where the assignations happen. Considering that I have the bootparams irqaffinity=3 I do not understand how the admin queue and hctx0 irqs can be assigned to the cpu 0 and 1. It is as-if the irqaffinity param had no effect on MSIX interrupts affinity masks... ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread 2024-08-03 14:36 ` Jens Axboe 2024-08-03 16:50 ` Olivier Langlois @ 2024-08-03 21:37 ` Olivier Langlois 1 sibling, 0 replies; 12+ messages in thread From: Olivier Langlois @ 2024-08-03 21:37 UTC (permalink / raw) To: Jens Axboe, Pavel Begunkov, io-uring On Sat, 2024-08-03 at 08:36 -0600, Jens Axboe wrote: > > You can check the mappings in /sys/kernel/debug/block/<device>/ > > in there you'll find a number of hctxN folders, each of these is a > hardware queue. hcxt0/type tells you what kind of queue it is, and > inside the directory, you'll find which CPUs this queue is mapped to. > Example: > > root@r7625 /s/k/d/b/nvme0n1# cat hctx1/type > default > > "default" means it's a read/write queue, so it'll handle both reads > and > writes. > > root@r7625 /s/k/d/b/nvme0n1# ls hctx1/ > active cpu11/ dispatch sched_tags tags > busy cpu266/ dispatch_busy sched_tags_bitmap tags_bitmap > cpu10/ ctx_map flags state type > > and we can see this hardware queue is mapped to cpu 10/11/266. > > That ties into how these are mapped. It's pretty simple - if a task > is > running on cpu 10/11/266 when it's queueing IO, then it'll use hw > queue > 1. This maps to the interrupts you found, but note that the admin > queue > (which is not listed these directories, as it's not an IO queue) is > the > first one there. hctx0 is nvme0q1 in your /proc/interrupts list. > > If IO is queued on hctx1, then it should complete on the interrupt > vector associated with nvme0q2. > I have entered hacking territory but I did not find any other way to do it... drivers/nvme/host/pci.c diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 6cd9395ba9ec..70b7ca84ee21 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2299,7 +2299,7 @@ static unsigned int nvme_max_io_queues(struct nvme_dev *dev) */ if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS) return 1; - return num_possible_cpus() + dev->nr_write_queues + dev- >nr_poll_queues; + return 1 + dev->nr_write_queues + dev->nr_poll_queues; } static int nvme_setup_io_queues(struct nvme_dev *dev) it works. I have no more IRQ on cpu1 as I wanted 63: 9 0 0 0 PCI-MSIX-0000:00:04.0 0-edge nvme0q0 64: 0 0 0 7533 PCI-MSIX-0000:00:04.0 1-edge nvme0q1 # ls /sys/kernel/debug/block/nvme0n1/hctx0/ active busy cpu0 cpu1 cpu2 cpu3 ctx_map dispatch dispatch_busy flags sched_tags sched_tags_bitmap state tags tags_bitmap type ^ permalink raw reply related [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-08-03 21:37 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-07-30 20:05 io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread Olivier Langlois 2024-07-30 20:25 ` Pavel Begunkov 2024-07-30 23:14 ` Olivier Langlois 2024-07-31 0:33 ` Pavel Begunkov 2024-07-31 1:00 ` Pavel Begunkov 2024-08-01 23:05 ` Olivier Langlois 2024-08-01 22:02 ` Olivier Langlois 2024-08-02 15:22 ` Pavel Begunkov 2024-08-03 14:15 ` Olivier Langlois 2024-08-03 14:36 ` Jens Axboe 2024-08-03 16:50 ` Olivier Langlois 2024-08-03 21:37 ` Olivier Langlois
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox