io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread

public inbox for [email protected]
 help / color / mirror / Atom feed

* io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread
@ 2024-07-30 20:05 Olivier Langlois
  2024-07-30 20:25 ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Olivier Langlois @ 2024-07-30 20:05 UTC (permalink / raw)
  To: io-uring; +Cc: netdev

if you are interested into all the details,

they are all here:
https://github.com/axboe/liburing/issues/1190

it seems like I like to write a lot when I am investigating a problem.
Pavel has been a great help in assisting me understanding what was
happening.

Next, I came to question where the integration of RCU came from and I
have found this:
https://lore.kernel.org/all/[email protected]/

I guess that in some use-case being able to dynamically manage hundreds
of NAPI devices automatically that can suddenly all be swepted over
during a device reconfiguration is something desirable to have for
some...

but in my case, this is an excessively a high price to pay for a
flexibility that I do not need at all.

I have a single NAPI device. Once I know what it is, it will pratically
remain immutable until termination.

For that reason, I am thinking that offering some sort of polymorphic
NAPI device tracking strategy customization would be desirable.

The current one, the RCU one, I would call it the

dynamic_napi_tracking (rcu could be peppered in the name somewhere so
people know what the strategy is up to)

where as the new one that I am imagining would be called

static_napi_tracking.

NAPI devices would be added/removed by the user manually through an
extended registration function.

for the sake of conveniance, a clear_list operation could even be
offered.

The benefits of this new static tracking strategy would be numerous:
- this removes the need to invoke the heavy duty RCU cavalry
- no need to scan the list to remove stall devices
- no need to search the list at each SQE submission to update the
device timeout value

So is this a good idea in your opinion?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread
  2024-07-30 20:05 io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread Olivier Langlois
@ 2024-07-30 20:25 ` Pavel Begunkov
  2024-07-30 23:14   ` Olivier Langlois
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2024-07-30 20:25 UTC (permalink / raw)
  To: Olivier Langlois, io-uring; +Cc: netdev

On 7/30/24 21:05, Olivier Langlois wrote:
> if you are interested into all the details,
> 
> they are all here:
> https://github.com/axboe/liburing/issues/1190
> 
> it seems like I like to write a lot when I am investigating a problem.
> Pavel has been a great help in assisting me understanding what was
> happening.
> 
> Next, I came to question where the integration of RCU came from and I
> have found this:
> https://lore.kernel.org/all/[email protected]/
> 
> I guess that in some use-case being able to dynamically manage hundreds
> of NAPI devices automatically that can suddenly all be swepted over
> during a device reconfiguration is something desirable to have for
> some...
> 
> but in my case, this is an excessively a high price to pay for a
> flexibility that I do not need at all.

Removing an entry or two once every minute is definitely not
going to take 50% CPU, RCU machinery is running in background
regardless of whether io_uring uses it or not, and it's pretty
cheap considering ammortisation.

If anything it more sounds from your explanation like the
scheduler makes a wrong decision and schedules out the sqpoll
thread even though it could continue to run, but that's need
a confirmation. Does the CPU your SQPOLL is pinned to stays
100% utilised?


> I have a single NAPI device. Once I know what it is, it will pratically
> remain immutable until termination.
> 
> For that reason, I am thinking that offering some sort of polymorphic
> NAPI device tracking strategy customization would be desirable.
> 
> The current one, the RCU one, I would call it the
> 
> dynamic_napi_tracking (rcu could be peppered in the name somewhere so
> people know what the strategy is up to)
> 
> where as the new one that I am imagining would be called
> 
> static_napi_tracking.
> 
> NAPI devices would be added/removed by the user manually through an
> extended registration function.
> 
> for the sake of conveniance, a clear_list operation could even be
> offered.
> 
> The benefits of this new static tracking strategy would be numerous:
> - this removes the need to invoke the heavy duty RCU cavalry
> - no need to scan the list to remove stall devices
> - no need to search the list at each SQE submission to update the
> device timeout value
> 
> So is this a good idea in your opinion?

I believe that's a good thing, I've been prototyping a similar
if not the same approach just today, i.e. user [un]registers
napi instance by id you can get with SO_INCOMING_NAPI_ID.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread
  2024-07-30 20:25 ` Pavel Begunkov
@ 2024-07-30 23:14   ` Olivier Langlois
  2024-07-31  0:33     ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Olivier Langlois @ 2024-07-30 23:14 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: netdev

On Tue, 2024-07-30 at 21:25 +0100, Pavel Begunkov wrote:
> 
> Removing an entry or two once every minute is definitely not
> going to take 50% CPU, RCU machinery is running in background
> regardless of whether io_uring uses it or not, and it's pretty
> cheap considering ammortisation.
> 
> If anything it more sounds from your explanation like the
> scheduler makes a wrong decision and schedules out the sqpoll
> thread even though it could continue to run, but that's need
> a confirmation. Does the CPU your SQPOLL is pinned to stays
> 100% utilised?

Here are the facts as they are documented in the github issue.

1. despite thinking that I was doing NAPI busy polling, I was not
because my ring was not receiving any sqe after its initial setup.

This is what the patch developped with your input
https://lore.kernel.org/io-uring/382791dc97d208d88ee31e5ebb5b661a0453fb79.1722374371.git.olivier@trillion01.com/T/#u

is addressing

(BTW, I should check if there is such a thing, but I would love to know
if the net code is exposing a tracepoint when napi_busy_poll is called
because it is very tricky to know if it is done for real or not)

2. the moment a second ring has been attached to the sqpoll thread that
was receving a lot of sqe, the NAPI busy loop started to be made for
real and the sqpoll cpu usage unexplicably dropped from 99% to 55%

3. here is my kernel cmdline:
hugepages=72 isolcpus=0,1,2 nohz_full=0,1,2 rcu_nocbs=0,1,2
rcu_nocb_poll irqaffinity=3 idle=nomwait processor.max_cstate=1
intel_idle.max_cstate=1 nmi_watchdog=0

there is absolutely nothing else on CPU0 where the sqpoll thread
affinity is set to run.

4. I got the idea of doing this:
echo common_pid == sqpoll_pid > /sys/kernel/tracing/events/sched/filter
echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable

and I have recorded over 1,000 context switches in 23 seconds with RCU
related kernel threads.

5. just for the fun of checking out, I have disabled NAPI polling on my
io_uring rings and the sqpoll thread magically returned to 99% CPU
usage from 55%...

I am open to other explanations for what I have observed but my current
conclusion is based on what I am able to see... the evidence appears
very convincing to me...

> > 
> > So is this a good idea in your opinion?
> 
> I believe that's a good thing, I've been prototyping a similar
> if not the same approach just today, i.e. user [un]registers
> napi instance by id you can get with SO_INCOMING_NAPI_ID.
> 
this is fantastic!

I am super happy to see all this NAPI busy polling feature interest and
activity which is a feature that I am very fond with (along with
io_uring)

I am looking forward collaborating with you Pavel to make io_uring the
best NAPI busy polling goto solution!

Greetings,
Olivier

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread
  2024-07-30 23:14   ` Olivier Langlois
@ 2024-07-31  0:33     ` Pavel Begunkov
  2024-07-31  1:00       ` Pavel Begunkov
  2024-08-01 22:02       ` Olivier Langlois
  0 siblings, 2 replies; 12+ messages in thread
From: Pavel Begunkov @ 2024-07-31  0:33 UTC (permalink / raw)
  To: Olivier Langlois, io-uring; +Cc: netdev

On 7/31/24 00:14, Olivier Langlois wrote:
> On Tue, 2024-07-30 at 21:25 +0100, Pavel Begunkov wrote:
>>
>> Removing an entry or two once every minute is definitely not
>> going to take 50% CPU, RCU machinery is running in background
>> regardless of whether io_uring uses it or not, and it's pretty
>> cheap considering ammortisation.
>>
>> If anything it more sounds from your explanation like the
>> scheduler makes a wrong decision and schedules out the sqpoll
>> thread even though it could continue to run, but that's need
>> a confirmation. Does the CPU your SQPOLL is pinned to stays
>> 100% utilised?
> 
> Here are the facts as they are documented in the github issue.
> 
> 1. despite thinking that I was doing NAPI busy polling, I was not
> because my ring was not receiving any sqe after its initial setup.
> 
> This is what the patch developped with your input
> https://lore.kernel.org/io-uring/382791dc97d208d88ee31e5ebb5b661a0453fb79.1722374371.git.olivier@trillion01.com/T/#u
> 
> is addressing
> 
> (BTW, I should check if there is such a thing, but I would love to know
> if the net code is exposing a tracepoint when napi_busy_poll is called
> because it is very tricky to know if it is done for real or not)
> 
> 2. the moment a second ring has been attached to the sqpoll thread that
> was receving a lot of sqe, the NAPI busy loop started to be made for
> real and the sqpoll cpu usage unexplicably dropped from 99% to 55%
> 
> 3. here is my kernel cmdline:
> hugepages=72 isolcpus=0,1,2 nohz_full=0,1,2 rcu_nocbs=0,1,2
> rcu_nocb_poll irqaffinity=3 idle=nomwait processor.max_cstate=1
> intel_idle.max_cstate=1 nmi_watchdog=0
> 
> there is absolutely nothing else on CPU0 where the sqpoll thread
> affinity is set to run.
> 
> 4. I got the idea of doing this:
> echo common_pid == sqpoll_pid > /sys/kernel/tracing/events/sched/filter
> echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable
> 
> and I have recorded over 1,000 context switches in 23 seconds with RCU
> related kernel threads.
> 
> 5. just for the fun of checking out, I have disabled NAPI polling on my
> io_uring rings and the sqpoll thread magically returned to 99% CPU
> usage from 55%...
> 
> I am open to other explanations for what I have observed but my current
> conclusion is based on what I am able to see... the evidence appears
> very convincing to me...

You're seeing something that doesn't make much sense to me, and we need
to understand what that is. There might be a bug _somewhere_, that's
always a possibility, but before saying that let's get a bit more data.

While the app is working, can you grab a profile and run mpstat for the
CPU on which you have the SQPOLL task?

perf record -g -C <CPU number> --all-kernel &
mpstat -u -P <CPU number> 5 10 &

And then as usual, time it so that you have some activity going on,
mpstat interval may need adjustments, and perf report it as before.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread
  2024-07-31  0:33     ` Pavel Begunkov
@ 2024-07-31  1:00       ` Pavel Begunkov
  2024-08-01 23:05         ` Olivier Langlois
  2024-08-01 22:02       ` Olivier Langlois
  1 sibling, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2024-07-31  1:00 UTC (permalink / raw)
  To: Olivier Langlois, io-uring; +Cc: netdev

On 7/31/24 01:33, Pavel Begunkov wrote:
> On 7/31/24 00:14, Olivier Langlois wrote:
>> On Tue, 2024-07-30 at 21:25 +0100, Pavel Begunkov wrote:
>>>
>>> Removing an entry or two once every minute is definitely not
>>> going to take 50% CPU, RCU machinery is running in background
>>> regardless of whether io_uring uses it or not, and it's pretty
>>> cheap considering ammortisation.
>>>
>>> If anything it more sounds from your explanation like the
>>> scheduler makes a wrong decision and schedules out the sqpoll
>>> thread even though it could continue to run, but that's need
>>> a confirmation. Does the CPU your SQPOLL is pinned to stays
>>> 100% utilised?
>>
>> Here are the facts as they are documented in the github issue.
>>
>> 1. despite thinking that I was doing NAPI busy polling, I was not
>> because my ring was not receiving any sqe after its initial setup.
>>
>> This is what the patch developped with your input
>> https://lore.kernel.org/io-uring/382791dc97d208d88ee31e5ebb5b661a0453fb79.1722374371.git.olivier@trillion01.com/T/#u
>>
>> is addressing
>>
>> (BTW, I should check if there is such a thing, but I would love to know
>> if the net code is exposing a tracepoint when napi_busy_poll is called
>> because it is very tricky to know if it is done for real or not)
>>
>> 2. the moment a second ring has been attached to the sqpoll thread that
>> was receving a lot of sqe, the NAPI busy loop started to be made for
>> real and the sqpoll cpu usage unexplicably dropped from 99% to 55%
>>
>> 3. here is my kernel cmdline:
>> hugepages=72 isolcpus=0,1,2 nohz_full=0,1,2 rcu_nocbs=0,1,2
>> rcu_nocb_poll irqaffinity=3 idle=nomwait processor.max_cstate=1
>> intel_idle.max_cstate=1 nmi_watchdog=0
>>
>> there is absolutely nothing else on CPU0 where the sqpoll thread
>> affinity is set to run.
>>
>> 4. I got the idea of doing this:
>> echo common_pid == sqpoll_pid > /sys/kernel/tracing/events/sched/filter
>> echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable
>>
>> and I have recorded over 1,000 context switches in 23 seconds with RCU
>> related kernel threads.
>>
>> 5. just for the fun of checking out, I have disabled NAPI polling on my
>> io_uring rings and the sqpoll thread magically returned to 99% CPU
>> usage from 55%...
>>
>> I am open to other explanations for what I have observed but my current
>> conclusion is based on what I am able to see... the evidence appears
>> very convincing to me...
> 
> You're seeing something that doesn't make much sense to me, and we need
> to understand what that is. There might be a bug _somewhere_, that's
> always a possibility, but before saying that let's get a bit more data.
> 
> While the app is working, can you grab a profile and run mpstat for the
> CPU on which you have the SQPOLL task?
> 
> perf record -g -C <CPU number> --all-kernel &
> mpstat -u -P <CPU number> 5 10 &
> 
> And then as usual, time it so that you have some activity going on,
> mpstat interval may need adjustments, and perf report it as before

I forgot to add, ~50 switches/second for relatively brief RCU handling
is not much, not enough to take 50% of a CPU. I wonder if sqpoll was
still running but napi busy polling time got accounted to softirq
because of disabled bh and you didn't include it, hence asking CPU
stats. Do you see any latency problems for that configuration?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread
  2024-07-31  1:00       ` Pavel Begunkov
@ 2024-08-01 23:05         ` Olivier Langlois
  0 siblings, 0 replies; 12+ messages in thread
From: Olivier Langlois @ 2024-08-01 23:05 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: netdev

On Wed, 2024-07-31 at 02:00 +0100, Pavel Begunkov wrote:
> 
> I forgot to add, ~50 switches/second for relatively brief RCU
> handling
> is not much, not enough to take 50% of a CPU. I wonder if sqpoll was
> still running but napi busy polling time got accounted to softirq
> because of disabled bh and you didn't include it, hence asking CPU
> stats. Do you see any latency problems for that configuration?
> 
Pavel,

I am not sure if I will ever discover what this 50% CPU usage drop was
exactly.

when I did test
https://lore.kernel.org/io-uring/382791dc97d208d88ee31e5ebb5b661a0453fb79.1722374371.git.olivier@trillion01.com/T/#u

from this custom setup:
https://github.com/axboe/liburing/issues/1190#issuecomment-2258632731

iou-sqp task cpu usage went back to 100%...

there was also my busy_poll config numbers that were inadequate.

I went from:
echo 1000 > /sys/class/net/enp39s0/napi_defer_hard_irqs
echo 500 > /sys/class/net/enp39s0/gro_flush_timeout

to:
echo 5000 > /sys/class/net/enp39s0/napi_defer_hard_irqs
# gro_flush_timeout unit is nanoseconds
echo 100000 > /sys/class/net/enp39s0/gro_flush_timeout

ksoftirqd has stopped being awakening to service NET SOFTIRQS but I
would that this might not be the cause neither

I have no more latency issues. After a lot of efforts during the last 7
days, my system latency have improved by a good 10usec on average over
what it was last week...

but knowing that it can be even better is stopping me from letting
go...

the sporadic CPU1 interrupt can introduce a 27usec delay and this is
the difference between a win or a loss that is at stake...
https://lore.kernel.org/rcu/[email protected]/T/#m5abf9aa02ec7648c615885a6f8ebdebc57935c35

I want to get rid of that interrupt so hard that is going to provide a
great satidfaction when I will have finally found the cause...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread
  2024-07-31  0:33     ` Pavel Begunkov
  2024-07-31  1:00       ` Pavel Begunkov
@ 2024-08-01 22:02       ` Olivier Langlois
  2024-08-02 15:22         ` Pavel Begunkov
  1 sibling, 1 reply; 12+ messages in thread
From: Olivier Langlois @ 2024-08-01 22:02 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: netdev

On Wed, 2024-07-31 at 01:33 +0100, Pavel Begunkov wrote:
> 
> You're seeing something that doesn't make much sense to me, and we
> need
> to understand what that is. There might be a bug _somewhere_, that's
> always a possibility, but before saying that let's get a bit more
> data.
> 
> While the app is working, can you grab a profile and run mpstat for
> the
> CPU on which you have the SQPOLL task?
> 
> perf record -g -C <CPU number> --all-kernel &
> mpstat -u -P <CPU number> 5 10 &
> 
> And then as usual, time it so that you have some activity going on,
> mpstat interval may need adjustments, and perf report it as before.
> 
First thing first.

The other day, I did put my foot in my mouth by saying the NAPI busy
poll was adding 50 context switches/second.

I was responsible for that behavior with the rcu_nocb_poll boot kernel
param. I have removed the option and the context switches went away...

I am clearly outside my comfort zone with this project, I am trying
things without fully understand what I am doing and I am making errors
and stuff that is incorrect.

On top of that, before mentioning io_uring RCU usage, I did not realize
that net/core was already massively using RCU, including in
napi_busy_poll, therefore, that io_uring is using rcu before calling
napi_busy_poll, the point does seem very moot.

this is what I did the other day and I wanted to apologize to have said
something incorrect.

that being said, it does not remove the possible merit of what I did
propose.

I really think that the current io_uring implemention of the napi
device tracking strategy is overkill for a lot of scenarios...

if some sort of abstract interface like a mini struct net_device_ops
with 3-4 function pointers where the user could select between the
standard dynamic tracking or a manual lightweight tracking was present,
that would be very cool... so cool...

I am definitely interested in running the profiler tools that you are
proposing... Most of my problems are resolved...

- I got rid of 99.9% if the NET_RX_SOFTIRQ
- I have reduced significantly the number of NET_TX_SOFTIRQ
  https://github.com/amzn/amzn-drivers/issues/316
- No more rcu context switches
- CPU2 is now nohz_full all the time
- CPU1 local timer interrupt is raised once every 2-3 seconds for an
unknown origin. Paul E. McKenney did offer me his assistance on this
issue
https://lore.kernel.org/rcu/[email protected]/t/#u

I am going to give perf record a second chance... but just keep in
mind, that it is not because it is not recording much, it is not
because nothing is happening. if perf relies on interrupts to properly
operate, there is close to 0 on my nohz_full CPU...

thx a lot for your help Pavel!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread
  2024-08-01 22:02       ` Olivier Langlois
@ 2024-08-02 15:22         ` Pavel Begunkov
  2024-08-03 14:15           ` Olivier Langlois
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2024-08-02 15:22 UTC (permalink / raw)
  To: Olivier Langlois, io-uring; +Cc: netdev

On 8/1/24 23:02, Olivier Langlois wrote:
> On Wed, 2024-07-31 at 01:33 +0100, Pavel Begunkov wrote:
>>
>> You're seeing something that doesn't make much sense to me, and we
>> need
>> to understand what that is. There might be a bug _somewhere_, that's
>> always a possibility, but before saying that let's get a bit more
>> data.
>>
>> While the app is working, can you grab a profile and run mpstat for
>> the
>> CPU on which you have the SQPOLL task?
>>
>> perf record -g -C <CPU number> --all-kernel &
>> mpstat -u -P <CPU number> 5 10 &
>>
>> And then as usual, time it so that you have some activity going on,
>> mpstat interval may need adjustments, and perf report it as before.
>>
> First thing first.
> 
> The other day, I did put my foot in my mouth by saying the NAPI busy
> poll was adding 50 context switches/second.
> 
> I was responsible for that behavior with the rcu_nocb_poll boot kernel
> param. I have removed the option and the context switches went away...
> 
> I am clearly outside my comfort zone with this project, I am trying
> things without fully understand what I am doing and I am making errors
> and stuff that is incorrect.
> 
> On top of that, before mentioning io_uring RCU usage, I did not realize
> that net/core was already massively using RCU, including in
> napi_busy_poll, therefore, that io_uring is using rcu before calling
> napi_busy_poll, the point does seem very moot.
> 
> this is what I did the other day and I wanted to apologize to have said
> something incorrect.

No worries at all, you're pushing your configuration to extremes,
anyone would get lost in the options, and I'm getting curious what
you can squeeze from it. That's true that the current tracking
scheme might be an overkill but not because of mild RCU use.

> that being said, it does not remove the possible merit of what I did
> propose.
> 
> I really think that the current io_uring implemention of the napi
> device tracking strategy is overkill for a lot of scenarios...
> 
> if some sort of abstract interface like a mini struct net_device_ops
> with 3-4 function pointers where the user could select between the
> standard dynamic tracking or a manual lightweight tracking was present,
> that would be very cool... so cool...
> 
> I am definitely interested in running the profiler tools that you are
> proposing... Most of my problems are resolved...
> 
> - I got rid of 99.9% if the NET_RX_SOFTIRQ
> - I have reduced significantly the number of NET_TX_SOFTIRQ
>    https://github.com/amzn/amzn-drivers/issues/316
> - No more rcu context switches
> - CPU2 is now nohz_full all the time
> - CPU1 local timer interrupt is raised once every 2-3 seconds for an
> unknown origin. Paul E. McKenney did offer me his assistance on this
> issue
> https://lore.kernel.org/rcu/[email protected]/t/#u

And I was just going to propose to ask Paul, but great to
see you beat me on that

> I am going to give perf record a second chance... but just keep in
> mind, that it is not because it is not recording much, it is not
> because nothing is happening. if perf relies on interrupts to properly
> operate, there is close to 0 on my nohz_full CPU...
> 
> thx a lot for your help Pavel!
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread
  2024-08-02 15:22         ` Pavel Begunkov
@ 2024-08-03 14:15           ` Olivier Langlois
  2024-08-03 14:36             ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Olivier Langlois @ 2024-08-03 14:15 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On Fri, 2024-08-02 at 16:22 +0100, Pavel Begunkov wrote:
> > 
> > I am definitely interested in running the profiler tools that you
> > are
> > proposing... Most of my problems are resolved...
> > 
> > - I got rid of 99.9% if the NET_RX_SOFTIRQ
> > - I have reduced significantly the number of NET_TX_SOFTIRQ
> >    https://github.com/amzn/amzn-drivers/issues/316
> > - No more rcu context switches
> > - CPU2 is now nohz_full all the time
> > - CPU1 local timer interrupt is raised once every 2-3 seconds for
> > an
> > unknown origin. Paul E. McKenney did offer me his assistance on
> > this
> > issue
> > https://lore.kernel.org/rcu/[email protected]/t/#u
> 
> And I was just going to propose to ask Paul, but great to
> see you beat me on that
> 
My investigation has progressed... my cpu1 interrupts are nvme block
device interrupts.

I feel that for questions about block device drivers, this time, I am
ringing at the experts door!

What is the meaning of a nvme interrupt?

I am assuming that this is to signal the completing of writing blocks
in the device...
I am currently looking in the code to find the answer for this.

Next, it seems to me that there is an odd number of interrupts for the
device:
 63:         12          0          0          0  PCI-MSIX-0000:00:04.0
0-edge      nvme0q0
 64:          0      23336          0          0  PCI-MSIX-0000:00:04.0
1-edge      nvme0q1
 65:          0          0          0      33878  PCI-MSIX-0000:00:04.0
2-edge      nvme0q2

why 3? Why not 4? one for each CPU...

If there was 4, I would have concluded that the driver has created a
queue for each CPU...

How are the queues associated to certain request/task?

The file I/O is made by threads running on CPU3, so I find it
surprising that nvmeq1 is choosen...

One noteworthy detail is that the process main thread is on CPU1. In my
flawed mental model of 1 queue per CPU, there could be some sort of
magical association with a process file descriptors table and the
choosen block device queue but this idea does not hold... What would
happen to processes running on CPU2...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread
  2024-08-03 14:15           ` Olivier Langlois
@ 2024-08-03 14:36             ` Jens Axboe
  2024-08-03 16:50               ` Olivier Langlois
  2024-08-03 21:37               ` Olivier Langlois
  0 siblings, 2 replies; 12+ messages in thread
From: Jens Axboe @ 2024-08-03 14:36 UTC (permalink / raw)
  To: Olivier Langlois, Pavel Begunkov, io-uring

On 8/3/24 8:15 AM, Olivier Langlois wrote:
> On Fri, 2024-08-02 at 16:22 +0100, Pavel Begunkov wrote:
>>>
>>> I am definitely interested in running the profiler tools that you
>>> are
>>> proposing... Most of my problems are resolved...
>>>
>>> - I got rid of 99.9% if the NET_RX_SOFTIRQ
>>> - I have reduced significantly the number of NET_TX_SOFTIRQ
>>>    https://github.com/amzn/amzn-drivers/issues/316
>>> - No more rcu context switches
>>> - CPU2 is now nohz_full all the time
>>> - CPU1 local timer interrupt is raised once every 2-3 seconds for
>>> an
>>> unknown origin. Paul E. McKenney did offer me his assistance on
>>> this
>>> issue
>>> https://lore.kernel.org/rcu/[email protected]/t/#u
>>
>> And I was just going to propose to ask Paul, but great to
>> see you beat me on that
>>
> My investigation has progressed... my cpu1 interrupts are nvme block
> device interrupts.
> 
> I feel that for questions about block device drivers, this time, I am
> ringing at the experts door!
> 
> What is the meaning of a nvme interrupt?
> 
> I am assuming that this is to signal the completing of writing blocks
> in the device...
> I am currently looking in the code to find the answer for this.
> 
> Next, it seems to me that there is an odd number of interrupts for the
> device:
>  63:         12          0          0          0  PCI-MSIX-0000:00:04.0
> 0-edge      nvme0q0
>  64:          0      23336          0          0  PCI-MSIX-0000:00:04.0
> 1-edge      nvme0q1
>  65:          0          0          0      33878  PCI-MSIX-0000:00:04.0
> 2-edge      nvme0q2
> 
> why 3? Why not 4? one for each CPU...
> 
> If there was 4, I would have concluded that the driver has created a
> queue for each CPU...
> 
> How are the queues associated to certain request/task?
> 
> The file I/O is made by threads running on CPU3, so I find it
> surprising that nvmeq1 is choosen...
> 
> One noteworthy detail is that the process main thread is on CPU1. In my
> flawed mental model of 1 queue per CPU, there could be some sort of
> magical association with a process file descriptors table and the
> choosen block device queue but this idea does not hold... What would
> happen to processes running on CPU2...

The cpu <-> hw queue mappings for nvme devices depend on the topology of
the machine (number of CPUs, relation between thread siblings, number of
nodes, etc) and the number of queue available on the device in question.
If you have as many (or more) device side queues available as number of
CPUs, then you'll have a queue per CPU. If you have less, then multiple
CPUs will share a queue.

You can check the mappings in /sys/kernel/debug/block/<device>/

in there you'll find a number of hctxN folders, each of these is a
hardware queue. hcxt0/type tells you what kind of queue it is, and
inside the directory, you'll find which CPUs this queue is mapped to.
Example:

root@r7625 /s/k/d/b/nvme0n1# cat hctx1/type 
default

"default" means it's a read/write queue, so it'll handle both reads and
writes.

root@r7625 /s/k/d/b/nvme0n1# ls hctx1/
active  cpu11/   dispatch       sched_tags         tags
busy    cpu266/  dispatch_busy  sched_tags_bitmap  tags_bitmap
cpu10/  ctx_map  flags          state              type

and we can see this hardware queue is mapped to cpu 10/11/266.

That ties into how these are mapped. It's pretty simple - if a task is
running on cpu 10/11/266 when it's queueing IO, then it'll use hw queue
1. This maps to the interrupts you found, but note that the admin queue
(which is not listed these directories, as it's not an IO queue) is the
first one there. hctx0 is nvme0q1 in your /proc/interrupts list.

If IO is queued on hctx1, then it should complete on the interrupt
vector associated with nvme0q2.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread
  2024-08-03 14:36             ` Jens Axboe
@ 2024-08-03 16:50               ` Olivier Langlois
  2024-08-03 21:37               ` Olivier Langlois
  1 sibling, 0 replies; 12+ messages in thread
From: Olivier Langlois @ 2024-08-03 16:50 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring

On Sat, 2024-08-03 at 08:36 -0600, Jens Axboe wrote:
> You can check the mappings in /sys/kernel/debug/block/<device>/
> 
> in there you'll find a number of hctxN folders, each of these is a
> hardware queue. hcxt0/type tells you what kind of queue it is, and
> inside the directory, you'll find which CPUs this queue is mapped to.
> Example:
> 
> root@r7625 /s/k/d/b/nvme0n1# cat hctx1/type 
> default
> 
> "default" means it's a read/write queue, so it'll handle both reads
> and
> writes.
> 
> root@r7625 /s/k/d/b/nvme0n1# ls hctx1/
> active  cpu11/   dispatch       sched_tags         tags
> busy    cpu266/  dispatch_busy  sched_tags_bitmap  tags_bitmap
> cpu10/  ctx_map  flags          state              type
> 
> and we can see this hardware queue is mapped to cpu 10/11/266.
> 
> That ties into how these are mapped. It's pretty simple - if a task
> is
> running on cpu 10/11/266 when it's queueing IO, then it'll use hw
> queue
> 1. This maps to the interrupts you found, but note that the admin
> queue
> (which is not listed these directories, as it's not an IO queue) is
> the
> first one there. hctx0 is nvme0q1 in your /proc/interrupts list.
> 
> If IO is queued on hctx1, then it should complete on the interrupt
> vector associated with nvme0q2.
> 
Jens,

I knew there were nvme experts here!
thx for your help.

# ls nvme0n1/hctx0/
active  busy  cpu0  cpu1  ctx_map  dispatch  dispatch_busy  flags 
sched_tags  sched_tags_bitmap  state  tags  tags_bitmap  type

it means that some I/O that I am unaware of is initiated either from
cpu0-cpu1...

It seems like nvme number of queues is configurable... I'll try to find
out how to reduce it to 1...

but my real problem is not really which I/O queue is assigned to a
request. It is the irq affinity assigned to the queues...

I have found the function:
nvme_setup_irqs() where the assignations happen.

Considering that I have the bootparams irqaffinity=3

I do not understand how the admin queue and hctx0 irqs can be assigned
to the cpu 0 and 1. It is as-if the irqaffinity param had no effect on 
MSIX interrupts affinity masks...


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread
  2024-08-03 14:36             ` Jens Axboe
  2024-08-03 16:50               ` Olivier Langlois
@ 2024-08-03 21:37               ` Olivier Langlois
  1 sibling, 0 replies; 12+ messages in thread
From: Olivier Langlois @ 2024-08-03 21:37 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring

On Sat, 2024-08-03 at 08:36 -0600, Jens Axboe wrote:
> 
> You can check the mappings in /sys/kernel/debug/block/<device>/
> 
> in there you'll find a number of hctxN folders, each of these is a
> hardware queue. hcxt0/type tells you what kind of queue it is, and
> inside the directory, you'll find which CPUs this queue is mapped to.
> Example:
> 
> root@r7625 /s/k/d/b/nvme0n1# cat hctx1/type 
> default
> 
> "default" means it's a read/write queue, so it'll handle both reads
> and
> writes.
> 
> root@r7625 /s/k/d/b/nvme0n1# ls hctx1/
> active  cpu11/   dispatch       sched_tags         tags
> busy    cpu266/  dispatch_busy  sched_tags_bitmap  tags_bitmap
> cpu10/  ctx_map  flags          state              type
> 
> and we can see this hardware queue is mapped to cpu 10/11/266.
> 
> That ties into how these are mapped. It's pretty simple - if a task
> is
> running on cpu 10/11/266 when it's queueing IO, then it'll use hw
> queue
> 1. This maps to the interrupts you found, but note that the admin
> queue
> (which is not listed these directories, as it's not an IO queue) is
> the
> first one there. hctx0 is nvme0q1 in your /proc/interrupts list.
> 
> If IO is queued on hctx1, then it should complete on the interrupt
> vector associated with nvme0q2.
> 
I have entered hacking territory but I did not find any other way to do
it...

drivers/nvme/host/pci.c
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6cd9395ba9ec..70b7ca84ee21 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2299,7 +2299,7 @@ static unsigned int nvme_max_io_queues(struct
nvme_dev *dev)
         */
        if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
                return 1;
-       return num_possible_cpus() + dev->nr_write_queues + dev-
>nr_poll_queues;
+       return 1 + dev->nr_write_queues + dev->nr_poll_queues;
 }
 
 static int nvme_setup_io_queues(struct nvme_dev *dev)

it works. I have no more IRQ on cpu1 as I wanted

 63:          9          0          0          0  PCI-MSIX-0000:00:04.0
0-edge      nvme0q0
 64:          0          0          0       7533  PCI-MSIX-0000:00:04.0
1-edge      nvme0q1

# ls /sys/kernel/debug/block/nvme0n1/hctx0/
active  busy  cpu0  cpu1  cpu2  cpu3  ctx_map  dispatch  dispatch_busy
flags  sched_tags  sched_tags_bitmap  state  tags  tags_bitmap  type


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-08-03 21:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-30 20:05 io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread Olivier Langlois
2024-07-30 20:25 ` Pavel Begunkov
2024-07-30 23:14   ` Olivier Langlois
2024-07-31  0:33     ` Pavel Begunkov
2024-07-31  1:00       ` Pavel Begunkov
2024-08-01 23:05         ` Olivier Langlois
2024-08-01 22:02       ` Olivier Langlois
2024-08-02 15:22         ` Pavel Begunkov
2024-08-03 14:15           ` Olivier Langlois
2024-08-03 14:36             ` Jens Axboe
2024-08-03 16:50               ` Olivier Langlois
2024-08-03 21:37               ` Olivier Langlois

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox