public inbox for [email protected]
 help / color / mirror / Atom feed
From: Jens Axboe <[email protected]>
To: Olivier Langlois <[email protected]>,
	Pavel Begunkov <[email protected]>,
	[email protected]
Subject: Re: io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread
Date: Sat, 3 Aug 2024 08:36:12 -0600	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 8/3/24 8:15 AM, Olivier Langlois wrote:
> On Fri, 2024-08-02 at 16:22 +0100, Pavel Begunkov wrote:
>>>
>>> I am definitely interested in running the profiler tools that you
>>> are
>>> proposing... Most of my problems are resolved...
>>>
>>> - I got rid of 99.9% if the NET_RX_SOFTIRQ
>>> - I have reduced significantly the number of NET_TX_SOFTIRQ
>>>    https://github.com/amzn/amzn-drivers/issues/316
>>> - No more rcu context switches
>>> - CPU2 is now nohz_full all the time
>>> - CPU1 local timer interrupt is raised once every 2-3 seconds for
>>> an
>>> unknown origin. Paul E. McKenney did offer me his assistance on
>>> this
>>> issue
>>> https://lore.kernel.org/rcu/[email protected]/t/#u
>>
>> And I was just going to propose to ask Paul, but great to
>> see you beat me on that
>>
> My investigation has progressed... my cpu1 interrupts are nvme block
> device interrupts.
> 
> I feel that for questions about block device drivers, this time, I am
> ringing at the experts door!
> 
> What is the meaning of a nvme interrupt?
> 
> I am assuming that this is to signal the completing of writing blocks
> in the device...
> I am currently looking in the code to find the answer for this.
> 
> Next, it seems to me that there is an odd number of interrupts for the
> device:
>  63:         12          0          0          0  PCI-MSIX-0000:00:04.0
> 0-edge      nvme0q0
>  64:          0      23336          0          0  PCI-MSIX-0000:00:04.0
> 1-edge      nvme0q1
>  65:          0          0          0      33878  PCI-MSIX-0000:00:04.0
> 2-edge      nvme0q2
> 
> why 3? Why not 4? one for each CPU...
> 
> If there was 4, I would have concluded that the driver has created a
> queue for each CPU...
> 
> How are the queues associated to certain request/task?
> 
> The file I/O is made by threads running on CPU3, so I find it
> surprising that nvmeq1 is choosen...
> 
> One noteworthy detail is that the process main thread is on CPU1. In my
> flawed mental model of 1 queue per CPU, there could be some sort of
> magical association with a process file descriptors table and the
> choosen block device queue but this idea does not hold... What would
> happen to processes running on CPU2...

The cpu <-> hw queue mappings for nvme devices depend on the topology of
the machine (number of CPUs, relation between thread siblings, number of
nodes, etc) and the number of queue available on the device in question.
If you have as many (or more) device side queues available as number of
CPUs, then you'll have a queue per CPU. If you have less, then multiple
CPUs will share a queue.

You can check the mappings in /sys/kernel/debug/block/<device>/

in there you'll find a number of hctxN folders, each of these is a
hardware queue. hcxt0/type tells you what kind of queue it is, and
inside the directory, you'll find which CPUs this queue is mapped to.
Example:

root@r7625 /s/k/d/b/nvme0n1# cat hctx1/type 
default

"default" means it's a read/write queue, so it'll handle both reads and
writes.

root@r7625 /s/k/d/b/nvme0n1# ls hctx1/
active  cpu11/   dispatch       sched_tags         tags
busy    cpu266/  dispatch_busy  sched_tags_bitmap  tags_bitmap
cpu10/  ctx_map  flags          state              type

and we can see this hardware queue is mapped to cpu 10/11/266.

That ties into how these are mapped. It's pretty simple - if a task is
running on cpu 10/11/266 when it's queueing IO, then it'll use hw queue
1. This maps to the interrupts you found, but note that the admin queue
(which is not listed these directories, as it's not an IO queue) is the
first one there. hctx0 is nvme0q1 in your /proc/interrupts list.

If IO is queued on hctx1, then it should complete on the interrupt
vector associated with nvme0q2.

-- 
Jens Axboe


  reply	other threads:[~2024-08-03 14:36 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-30 20:05 io_uring NAPI busy poll RCU is causing 50 context switches/second to my sqpoll thread Olivier Langlois
2024-07-30 20:25 ` Pavel Begunkov
2024-07-30 23:14   ` Olivier Langlois
2024-07-31  0:33     ` Pavel Begunkov
2024-07-31  1:00       ` Pavel Begunkov
2024-08-01 23:05         ` Olivier Langlois
2024-08-01 22:02       ` Olivier Langlois
2024-08-02 15:22         ` Pavel Begunkov
2024-08-03 14:15           ` Olivier Langlois
2024-08-03 14:36             ` Jens Axboe [this message]
2024-08-03 16:50               ` Olivier Langlois
2024-08-03 21:37               ` Olivier Langlois

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox