public inbox for gwml@vger.gnuweeb.org
 help / color / mirror / Atom feed
* Re: [GIT PULL v2] PCI changes for v6.17
       [not found] ` <175408424863.4088284.13236765550439476565.pr-tracker-bot@kernel.org>
@ 2025-08-07  3:34   ` Ammar Faizi
  2025-08-07  3:51     ` Lukas Wunner
  0 siblings, 1 reply; 18+ messages in thread
From: Ammar Faizi @ 2025-08-07  3:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Linus Torvalds, Linux PCI Mailing List, Linux Kernel Mailing List,
	Rob Herring, Lorenzo Pieralisi, Manivannan Sadhasivam,
	Krzysztof Wilczyński, Armando Budianto,
	Alviro Iskandar Setiawan, gwml

On Fri, 01 Aug 2025 21:37:28 +0000, pr-tracker-bot@kernel.org, wrote:
>
> The pull request you sent on Fri, 1 Aug 2025 09:22:54 -0500:
>
> > git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git tags/pci-v6.17-changes
>
> has been merged into torvalds/linux.git:
> https://git.kernel.org/torvalds/c/0bd0a41a5120f78685a132834865b0a631b9026a

Yesterday, I synced with Linus' tree, but couldn't boot. Crashed with
this call trace:

  https://gist.githubusercontent.com/ammarfaizi2/3ba41f13517be4bae70cde869347d259/raw/0ac09b3e1d90d51c3fed14ca9f837f45d7730f0a/crash.jpg

This morning, I synced with Linus' tree again, still the same result.

I suspect it's related to pci. I'm still bisecting. I've successfully
narrowed it down to this pci pull.

  0bd0a41a5120 (refs/bisect/bad) Merge tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
  db5f0c3e3e60 ring-buffer: Convert ring_buffer_write() to use guard(preempt_notrace)
  12d518961586 tracing: Use __free(kfree) in trace.c to remove gotos
  debe57fbe12c tracing: Add guard() around locks and mutexes in trace.c
  788fa4b47cdc tracing: Add guard(ring_buffer_nest)
  c89504a703fb tracing: Remove unneeded goto out logic
  877d94c74e4c (refs/bisect/good-877d94c74e4c6665d2af55c0154363b43b947e60) Merge tag 'linux-watchdog-6.17-rc1' of git://www.linux-watchdog.org/linux-watchd

Now, I am testing:

  769ce531faa6 (HEAD) Merge branch 'pci/controller/msi-parent'

I'll be back to it later. git bisect says:

  Bisecting: 65 revisions left to test after this (roughly 6 steps)

-- 
Ammar Faizi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-07  3:34   ` [GIT PULL v2] PCI changes for v6.17 Ammar Faizi
@ 2025-08-07  3:51     ` Lukas Wunner
  2025-08-07  4:44       ` Nam Cao
  2025-08-07  4:54       ` Ammar Faizi
  0 siblings, 2 replies; 18+ messages in thread
From: Lukas Wunner @ 2025-08-07  3:51 UTC (permalink / raw)
  To: Ammar Faizi
  Cc: Bjorn Helgaas, Linus Torvalds, Linux PCI Mailing List,
	Linux Kernel Mailing List, Rob Herring, Lorenzo Pieralisi,
	Manivannan Sadhasivam, Krzysztof Wilczynski, Armando Budianto,
	Alviro Iskandar Setiawan, gwml, Nam Cao

On Thu, Aug 07, 2025 at 10:34:08AM +0700, Ammar Faizi wrote:
> On Fri, 01 Aug 2025 21:37:28 +0000, pr-tracker-bot@kernel.org, wrote:
> >
> > The pull request you sent on Fri, 1 Aug 2025 09:22:54 -0500:
> >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git tags/pci-v6.17-changes
> >
> > has been merged into torvalds/linux.git:
> > https://git.kernel.org/torvalds/c/0bd0a41a5120f78685a132834865b0a631b9026a
> 
> Yesterday, I synced with Linus' tree, but couldn't boot. Crashed with
> this call trace:
> 
>   https://gist.githubusercontent.com/ammarfaizi2/3ba41f13517be4bae70cde869347d259/raw/0ac09b3e1d90d51c3fed14ca9f837f45d7730f0a/crash.jpg
> 
> This morning, I synced with Linus' tree again, still the same result.
> 
> I suspect it's related to pci. I'm still bisecting. I've successfully
> narrowed it down to this pci pull.
> 
>   0bd0a41a5120 (refs/bisect/bad) Merge tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
>   db5f0c3e3e60 ring-buffer: Convert ring_buffer_write() to use guard(preempt_notrace)
>   12d518961586 tracing: Use __free(kfree) in trace.c to remove gotos
>   debe57fbe12c tracing: Add guard() around locks and mutexes in trace.c
>   788fa4b47cdc tracing: Add guard(ring_buffer_nest)
>   c89504a703fb tracing: Remove unneeded goto out logic
>   877d94c74e4c (refs/bisect/good-877d94c74e4c6665d2af55c0154363b43b947e60) Merge tag 'linux-watchdog-6.17-rc1' of git://www.linux-watchdog.org/linux-watchd
> 
> Now, I am testing:
> 
>   769ce531faa6 (HEAD) Merge branch 'pci/controller/msi-parent'
> 
> I'll be back to it later. git bisect says:
> 
>   Bisecting: 65 revisions left to test after this (roughly 6 steps)

Kenneth reports early-stage reboots caused by d7d8ab87e3e
("PCI: vmd: Switch to msi_create_parent_irq_domain()"):

https://lore.kernel.org/all/dfa40e48-8840-4e61-9fda-25cdb3ad81c1@panix.com/

Perhaps you're witnessing the same issue?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-07  3:51     ` Lukas Wunner
@ 2025-08-07  4:44       ` Nam Cao
  2025-08-07  4:54       ` Ammar Faizi
  1 sibling, 0 replies; 18+ messages in thread
From: Nam Cao @ 2025-08-07  4:44 UTC (permalink / raw)
  To: Lukas Wunner, Ammar Faizi
  Cc: Bjorn Helgaas, Linus Torvalds, Linux PCI Mailing List,
	Linux Kernel Mailing List, Rob Herring, Lorenzo Pieralisi,
	Manivannan Sadhasivam, Krzysztof Wilczynski, Armando Budianto,
	Alviro Iskandar Setiawan, gwml

Lukas Wunner <lukas@wunner.de> writes:

> On Thu, Aug 07, 2025 at 10:34:08AM +0700, Ammar Faizi wrote:
>> On Fri, 01 Aug 2025 21:37:28 +0000, pr-tracker-bot@kernel.org, wrote:
>> >
>> > The pull request you sent on Fri, 1 Aug 2025 09:22:54 -0500:
>> >
>> > > git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git tags/pci-v6.17-changes
>> >
>> > has been merged into torvalds/linux.git:
>> > https://git.kernel.org/torvalds/c/0bd0a41a5120f78685a132834865b0a631b9026a
>> 
>> Yesterday, I synced with Linus' tree, but couldn't boot. Crashed with
>> this call trace:
>> 
>>   https://gist.githubusercontent.com/ammarfaizi2/3ba41f13517be4bae70cde869347d259/raw/0ac09b3e1d90d51c3fed14ca9f837f45d7730f0a/crash.jpg
>> 
>> This morning, I synced with Linus' tree again, still the same result.
>> 
>> I suspect it's related to pci. I'm still bisecting. I've successfully
>> narrowed it down to this pci pull.
>> 
>>   0bd0a41a5120 (refs/bisect/bad) Merge tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
>>   db5f0c3e3e60 ring-buffer: Convert ring_buffer_write() to use guard(preempt_notrace)
>>   12d518961586 tracing: Use __free(kfree) in trace.c to remove gotos
>>   debe57fbe12c tracing: Add guard() around locks and mutexes in trace.c
>>   788fa4b47cdc tracing: Add guard(ring_buffer_nest)
>>   c89504a703fb tracing: Remove unneeded goto out logic
>>   877d94c74e4c (refs/bisect/good-877d94c74e4c6665d2af55c0154363b43b947e60) Merge tag 'linux-watchdog-6.17-rc1' of git://www.linux-watchdog.org/linux-watchd
>> 
>> Now, I am testing:
>> 
>>   769ce531faa6 (HEAD) Merge branch 'pci/controller/msi-parent'
>> 
>> I'll be back to it later. git bisect says:
>> 
>>   Bisecting: 65 revisions left to test after this (roughly 6 steps)
>
> Kenneth reports early-stage reboots caused by d7d8ab87e3e
> ("PCI: vmd: Switch to msi_create_parent_irq_domain()"):
>
> https://lore.kernel.org/all/dfa40e48-8840-4e61-9fda-25cdb3ad81c1@panix.com/
>
> Perhaps you're witnessing the same issue?

Thanks for the Cc. The backtrace does look like something that the
commit would cause.

Let me stare at it.

Nam

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-07  3:51     ` Lukas Wunner
  2025-08-07  4:44       ` Nam Cao
@ 2025-08-07  4:54       ` Ammar Faizi
  2025-08-07  5:03         ` Nam Cao
  1 sibling, 1 reply; 18+ messages in thread
From: Ammar Faizi @ 2025-08-07  4:54 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Nam Cao, Bjorn Helgaas, Linus Torvalds, Linux PCI Mailing List,
	Linux Kernel Mailing List, Rob Herring, Lorenzo Pieralisi,
	Manivannan Sadhasivam, Krzysztof Wilczynski, Armando Budianto,
	Alviro Iskandar Setiawan, gwml

On Thu, Aug 07, 2025 at 05:51:57AM +0200, Lukas Wunner wrote:
> Kenneth reports early-stage reboots caused by d7d8ab87e3e
> ("PCI: vmd: Switch to msi_create_parent_irq_domain()"):
> 
> https://lore.kernel.org/all/dfa40e48-8840-4e61-9fda-25cdb3ad81c1@panix.com/
> 
> Perhaps you're witnessing the same issue?

Confirmed, reverting that commit works on my machine. I'll try to
further diagnose it and report more details.

-- 
Ammar Faizi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-07  4:54       ` Ammar Faizi
@ 2025-08-07  5:03         ` Nam Cao
  2025-08-07  5:13           ` Ammar Faizi
  0 siblings, 1 reply; 18+ messages in thread
From: Nam Cao @ 2025-08-07  5:03 UTC (permalink / raw)
  To: Ammar Faizi
  Cc: Lukas Wunner, Bjorn Helgaas, Linus Torvalds,
	Linux PCI Mailing List, Linux Kernel Mailing List, Rob Herring,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Krzysztof Wilczynski,
	Armando Budianto, Alviro Iskandar Setiawan, gwml

On Thu, Aug 07, 2025 at 11:54:12AM +0700, Ammar Faizi wrote:
> On Thu, Aug 07, 2025 at 05:51:57AM +0200, Lukas Wunner wrote:
> > Kenneth reports early-stage reboots caused by d7d8ab87e3e
> > ("PCI: vmd: Switch to msi_create_parent_irq_domain()"):
> > 
> > https://lore.kernel.org/all/dfa40e48-8840-4e61-9fda-25cdb3ad81c1@panix.com/
> > 
> > Perhaps you're witnessing the same issue?
> 
> Confirmed, reverting that commit works on my machine. I'll try to
> further diagnose it and report more details.

Does the diff below help?

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 9bbb0ff4cc15..b679c7f28f51 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -280,10 +280,12 @@ static int vmd_msi_alloc(struct irq_domain *domain, unsigned int virq,
 static void vmd_msi_free(struct irq_domain *domain, unsigned int virq,
 			 unsigned int nr_irqs)
 {
+	struct irq_data *irq_data;
 	struct vmd_irq *vmdirq;
 
 	for (int i = 0; i < nr_irqs; ++i) {
-		vmdirq = irq_get_chip_data(virq + i);
+		irq_data = irq_domain_get_irq_data(domain, virq + i);
+		vmdirq = irq_data->chip_data;
 
 		synchronize_srcu(&vmdirq->irq->srcu);
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-07  5:03         ` Nam Cao
@ 2025-08-07  5:13           ` Ammar Faizi
  2025-08-08 10:59             ` Ammar Faizi
  0 siblings, 1 reply; 18+ messages in thread
From: Ammar Faizi @ 2025-08-07  5:13 UTC (permalink / raw)
  To: Nam Cao
  Cc: Lukas Wunner, Bjorn Helgaas, Linus Torvalds,
	Linux PCI Mailing List, Linux Kernel Mailing List, Rob Herring,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Krzysztof Wilczynski,
	Armando Budianto, Alviro Iskandar Setiawan, gwml

On Thu, Aug 07, 2025 at 07:03:50AM +0200, Nam Cao wrote:
> Does the diff below help?
> 
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> index 9bbb0ff4cc15..b679c7f28f51 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
> @@ -280,10 +280,12 @@ static int vmd_msi_alloc(struct irq_domain *domain, unsigned int virq,
>  static void vmd_msi_free(struct irq_domain *domain, unsigned int virq,
>  			 unsigned int nr_irqs)
>  {
> +	struct irq_data *irq_data;
>  	struct vmd_irq *vmdirq;
>  
>  	for (int i = 0; i < nr_irqs; ++i) {
> -		vmdirq = irq_get_chip_data(virq + i);
> +		irq_data = irq_domain_get_irq_data(domain, virq + i);
> +		vmdirq = irq_data->chip_data;
>  
>  		synchronize_srcu(&vmdirq->irq->srcu);
>  

Yes, it works.

Tested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>

Thank you for the fix!

-- 
Ammar Faizi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-07  5:13           ` Ammar Faizi
@ 2025-08-08 10:59             ` Ammar Faizi
  2025-08-08 11:19               ` Ammar Faizi
  0 siblings, 1 reply; 18+ messages in thread
From: Ammar Faizi @ 2025-08-08 10:59 UTC (permalink / raw)
  To: Nam Cao
  Cc: Lukas Wunner, Bjorn Helgaas, Linus Torvalds,
	Linux PCI Mailing List, Linux Kernel Mailing List, Rob Herring,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Krzysztof Wilczynski,
	Armando Budianto, Alviro Iskandar Setiawan, gwml

On Thu, Aug 07, 2025 at 12:13:37PM +0700, Ammar Faizi wrote:
> On Thu, Aug 07, 2025 at 07:03:50AM +0200, Nam Cao wrote:
> > Does the diff below help?
> 
> Yes, it works.

So today, I synced with Linus' master branch again:

  37816488247d ("Merge tag 'net-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")

and applied your fix on top of it.

I can boot, but I get this splat. Looking at the call trace, it seems
it's still related to pci, but different issue. The call trace is also
different from the previous one.

Let me know if you have something for me to test.

  [    1.017767] pci 10000:e0:1d.0: bridge window [mem 0x82000000-0x820fffff]: assigned
  [    1.018519] pci 10000:e0:1d.0: bridge window [io  size 0x1000]: can't assign; no space
  [    1.019268] pci 10000:e0:1d.0: bridge window [io  size 0x1000]: failed to assign
  [    1.020026] pci 10000:e0:1d.0: bridge window [io  size 0x1000]: can't assign; no space
  [    1.020789] pci 10000:e0:1d.0: bridge window [io  size 0x1000]: failed to assign
  [    1.021539] pci 10000:e1:00.0: BAR 0 [mem 0x82000000-0x82003fff 64bit]: assigned
  [    1.022317] pci 10000:e0:1d.0: PCI bridge to [bus e1]
  [    1.023091] pci 10000:e0:1d.0:   bridge window [mem 0x82000000-0x820fffff]
  [    1.023885] pci 10000:e1:00.0: VMD: Default LTR value set by driver
  [    1.024654] pci 10000:e1:00.0: can't override BIOS ASPM; OS doesn't have ASPM control
  [    1.025442] pcieport 10000:e0:1d.0: can't derive routing for PCI INT A
  [    1.026245] pcieport 10000:e0:1d.0: PCI INT A: no GSI
  [    1.027058] ------------[ cut here ]------------
  [    1.027849] WARNING: CPU: 0 PID: 166 at drivers/pci/controller/vmd.c:309 vmd_init_dev_msi_info+0x36/0x40 [vmd]
  [    1.028649] Modules linked in: hid_generic i2c_hid_acpi i2c_hid drm intel_lpss_pci i2c_i801 i2c_mux intel_lpss idma64 i2c_smbus vmd(+) hid video wmi pinctrl_tigerlake
  [    1.029467] CPU: 0 UID: 0 PID: 166 Comm: systemd-udevd Not tainted 6.16.0-afh-home-2025-08-08-g6026508bdb9d #10 PREEMPT(full)  fe08b908bb15b9ded6f7769c45f204194ebf7eef
  [    1.030301] Hardware name: HP HP Laptop 14s-dq2xxx/87FD, BIOS F.21 03/21/2022
  [    1.031122] RIP: 0010:vmd_init_dev_msi_info+0x36/0x40 [vmd]
  [    1.031953] Code: 48 89 cb e8 7c 49 4f e1 84 c0 74 1a 48 8b 53 20 48 c7 42 18 70 18 40 a0 48 8b 53 20 48 c7 42 20 e0 17 40 a0 5b c3 31 c0 5b c3 <0f> 0b 31 c0 c3 0f 1f 44 00 00 0f 1f 44 00 00 41 56 41 55 41 54 49
  [    1.032798] RSP: 0018:ffff888105a47860 EFLAGS: 00010297
  [    1.033642] RAX: ffff8881014d5d98 RBX: ffff8881014d5c00 RCX: ffff8881014d5d98
  [    1.034506] RDX: ffff888120a49400 RSI: ffff888120a49400 RDI: ffff88810132b0c8
  [    1.035359] RBP: 0000000000000001 R08: ffff888100d8de3a R09: 00000000ffffffff
  [    1.036206] R10: 0000000000000004 R11: 0000000000000005 R12: ffff8881019b7e40
  [    1.036561] usb 1-2: new full-speed USB device number 3 using xhci_hcd
  [    1.037058] R13: ffffffffa020c680 R14: ffff88810132b0c8 R15: ffff888120a49400
  [    1.038724] FS:  00007f25ede3a8c0(0000) GS:ffff8890f1a2d000(0000) knlGS:0000000000000000
  [    1.039559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [    1.040400] CR2: 00007f25ee475ea0 CR3: 000000011dfd1003 CR4: 0000000000770ef0
  [    1.041265] PKRU: 55555554
  [    1.042121] Call Trace:
  [    1.042971]  <TASK>
  [    1.043803]  msi_create_device_irq_domain+0x1eb/0x290
  [    1.044645]  __pci_enable_msi_range+0x106/0x300
  [    1.045488]  pci_alloc_irq_vectors_affinity+0xc5/0x110
  [    1.046336]  pcie_portdrv_probe+0x24e/0x610
  [    1.047177]  ? kernfs_activate+0x48/0x60
  [    1.048015]  local_pci_probe+0x3c/0x80
  [    1.048854]  pci_device_probe+0xbc/0x1b0
  [    1.049693]  really_probe+0xcd/0x380
  [    1.050529]  ? driver_probe_device+0x90/0x90
  [    1.051353]  __driver_probe_device+0x78/0x150
  [    1.052182]  driver_probe_device+0x1f/0x90
  [    1.053000]  __device_attach_driver+0x76/0xf0
  [    1.053812]  bus_for_each_drv+0x69/0xa0
  [    1.054633]  __device_attach+0xaa/0x1a0
  [    1.055448]  pci_bus_add_device+0x4c/0x70
  [    1.056260]  pci_bus_add_devices+0x2c/0x70
  [    1.057063]  vmd_probe+0x81e/0xa20 [vmd 65bddf00234a3cddd21388091a077f038c9af2be]
  [    1.057881]  local_pci_probe+0x3c/0x80
  [    1.058682]  pci_device_probe+0xbc/0x1b0
  [    1.059489]  really_probe+0xcd/0x380
  [    1.060303]  ? __device_attach_driver+0xf0/0xf0
  [    1.061112]  __driver_probe_device+0x78/0x150
  [    1.061908]  driver_probe_device+0x1f/0x90
  [    1.062711]  __driver_attach+0xbf/0x1b0
  [    1.063500]  bus_for_each_dev+0x64/0xa0
  [    1.064303]  bus_add_driver+0x10a/0x230
  [    1.065100]  driver_register+0x55/0xf0
  [    1.065892]  ? vmd_drv_exit+0x9a0/0x9a0 [vmd 65bddf00234a3cddd21388091a077f038c9af2be]
  [    1.066673]  do_one_initcall+0x31/0x1e0
  [    1.067460]  do_init_module+0x60/0x260
  [    1.068264]  init_module_from_file+0x74/0x90
  [    1.069068]  idempotent_init_module+0xed/0x2c0
  [    1.069853]  __x64_sys_finit_module+0x65/0xd0
  [    1.070630]  do_syscall_64+0x56/0x260
  [    1.071394]  entry_SYSCALL_64_after_hwframe+0x29/0x31
  [    1.072152] RIP: 0033:0x7f25ee53288d
  [    1.072918] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48
  [    1.073698] RSP: 002b:00007fff09cbd5b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
  [    1.074477] RAX: ffffffffffffffda RBX: 000055d9edf36030 RCX: 00007f25ee53288d
  [    1.075254] RDX: 0000000000000000 RSI: 00007f25ee6cc441 RDI: 0000000000000005
  [    1.076042] RBP: 0000000000020000 R08: 0000000000000000 R09: 00007fff09cbd6f0
  [    1.076827] R10: 0000000000000005 R11: 0000000000000246 R12: 00007f25ee6cc441
  [    1.077609] R13: 000055d9edf323b0 R14: 000055d9edf36120 R15: 000055d9edf35850
  [    1.078389]  </TASK>
  [    1.079166] ---[ end trace 0000000000000000 ]---

-- 
Ammar Faizi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-08 10:59             ` Ammar Faizi
@ 2025-08-08 11:19               ` Ammar Faizi
  2025-08-08 13:34                 ` Ammar Faizi
  2025-08-08 18:07                 ` Nam Cao
  0 siblings, 2 replies; 18+ messages in thread
From: Ammar Faizi @ 2025-08-08 11:19 UTC (permalink / raw)
  To: Nam Cao
  Cc: Lukas Wunner, Bjorn Helgaas, Linus Torvalds,
	Linux PCI Mailing List, Linux Kernel Mailing List, Rob Herring,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Krzysztof Wilczynski,
	Armando Budianto, Alviro Iskandar Setiawan, gwml

On Fri, Aug 08, 2025 at 05:59:23PM +0700, Ammar Faizi wrote:
> On Thu, Aug 07, 2025 at 12:13:37PM +0700, Ammar Faizi wrote:
> > On Thu, Aug 07, 2025 at 07:03:50AM +0200, Nam Cao wrote:
> > > Does the diff below help?
> > 
> > Yes, it works.
> 
> So today, I synced with Linus' master branch again:
> 
>   37816488247d ("Merge tag 'net-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
> 
> and applied your fix on top of it.
> 
> I can boot, but I get this splat. Looking at the call trace, it seems
> it's still related to pci, but different issue. The call trace is also
> different from the previous one.

It'll be a bit tricky to bisect this one. Because if I step back to
a bad commit post:

   d7d8ab87e3e ("PCI: vmd: Switch to msi_create_parent_irq_domain()")

I won't be able to boot to reach this new splat :/

I guess I need to apply the fix dirty for each bisection step. But I'll
also need to make sure the current step has the d7d8ab87e3e commit
anchestor before applying.

-- 
Ammar Faizi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-08 11:19               ` Ammar Faizi
@ 2025-08-08 13:34                 ` Ammar Faizi
  2025-08-08 18:07                 ` Nam Cao
  1 sibling, 0 replies; 18+ messages in thread
From: Ammar Faizi @ 2025-08-08 13:34 UTC (permalink / raw)
  To: Nam Cao
  Cc: Lukas Wunner, Bjorn Helgaas, Linus Torvalds,
	Linux PCI Mailing List, Linux Kernel Mailing List, Rob Herring,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Krzysztof Wilczynski,
	Armando Budianto, Alviro Iskandar Setiawan, gwml

On Fri, Aug 08, 2025 at 06:19:18PM +0700, Ammar Faizi wrote:
> It'll be a bit tricky to bisect this one. Because if I step back to
> a bad commit post:
> 
>    d7d8ab87e3e ("PCI: vmd: Switch to msi_create_parent_irq_domain()")
> 
> I won't be able to boot to reach this new splat :/
> 
> I guess I need to apply the fix dirty for each bisection step. But I'll
> also need to make sure the current step has the d7d8ab87e3e commit
> anchestor before applying.

Unfortunately, it turns out I don't have time to bisect it at the moment.
But if you have something for me to test, I'll give it a try.

I can continue bisecting it later, maybe after rc1 is out if the fix is
still pending.

For now, I revert the PCI merge entirely in my local tree and it
resolves the issue.

  0bd0a41a5120 ("Merge tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pc")

  git revert -m 1 -s 0bd0a41a5120;

This is an interesting case for me because it is my first time reverting
a merge commit :-)

I'll probably need to revert that revert-commit later when the PCI fix
actually arrives in mainline before syncing.

Happy weekend!

-- 
Ammar Faizi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-08 11:19               ` Ammar Faizi
  2025-08-08 13:34                 ` Ammar Faizi
@ 2025-08-08 18:07                 ` Nam Cao
  2025-08-08 22:52                   ` Ammar Faizi
  1 sibling, 1 reply; 18+ messages in thread
From: Nam Cao @ 2025-08-08 18:07 UTC (permalink / raw)
  To: Ammar Faizi
  Cc: Lukas Wunner, Bjorn Helgaas, Linus Torvalds,
	Linux PCI Mailing List, Linux Kernel Mailing List, Rob Herring,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Krzysztof Wilczynski,
	Armando Budianto, Alviro Iskandar Setiawan, gwml

Ammar Faizi <ammarfaizi2@gnuweeb.org> writes:

> On Fri, Aug 08, 2025 at 05:59:23PM +0700, Ammar Faizi wrote:
>> On Thu, Aug 07, 2025 at 12:13:37PM +0700, Ammar Faizi wrote:
>> > On Thu, Aug 07, 2025 at 07:03:50AM +0200, Nam Cao wrote:
>> > > Does the diff below help?
>> > 
>> > Yes, it works.
>> 
>> So today, I synced with Linus' master branch again:
>> 
>>   37816488247d ("Merge tag 'net-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
>> 
>> and applied your fix on top of it.
>> 
>> I can boot, but I get this splat. Looking at the call trace, it seems
>> it's still related to pci, but different issue. The call trace is also
>> different from the previous one.
>
> It'll be a bit tricky to bisect this one. Because if I step back to
> a bad commit post:
>
>    d7d8ab87e3e ("PCI: vmd: Switch to msi_create_parent_irq_domain()")
>
> I won't be able to boot to reach this new splat :/
>
> I guess I need to apply the fix dirty for each bisection step. But I'll
> also need to make sure the current step has the d7d8ab87e3e commit
> anchestor before applying.

There is no point in bisecting before that commit, because the WARN_ON()
is added by that commit, so you wouldn't see anything before that.

The WARN_ONCE() tells us that some devices down the PCI tree are
allocating MSI, but VMD supports MSI-X only.

From the backtrace:
   msi_create_device_irq_domain+0x1eb/0x290
   __pci_enable_msi_range+0x106/0x300
   pci_alloc_irq_vectors_affinity+0xc5/0x110
   pcie_portdrv_probe+0x24e/0x610

It seems MSI-X are allocated first, but fail for some reason. Then
fallback to MSI, which triggers the WARN_ON().

So we need to figure out why MSI-X allocation fail.

I may need to ask you to insert a bunch of printk() to help me pinpoint
the problem. But let me stare at it first..

Nam

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-08 18:07                 ` Nam Cao
@ 2025-08-08 22:52                   ` Ammar Faizi
  2025-08-09  4:34                     ` Nam Cao
  0 siblings, 1 reply; 18+ messages in thread
From: Ammar Faizi @ 2025-08-08 22:52 UTC (permalink / raw)
  To: Nam Cao
  Cc: Thomas Gleixner, Lukas Wunner, Bjorn Helgaas, Linus Torvalds,
	Linux PCI Mailing List, Linux Kernel Mailing List, Rob Herring,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Krzysztof Wilczynski,
	Armando Budianto, Alviro Iskandar Setiawan, gwml

+ Adding tglx as he's involved in reviewing that commit.

Context: https://lore.kernel.org/linux-pci/aJXYhfc%2F6DfcqfqF@linux.gnuweeb.org/

On Fri, Aug 08, 2025 at 08:07:03PM +0200, Nam Cao wrote:
> There is no point in bisecting before that commit, because the WARN_ON()
> is added by that commit, so you wouldn't see anything before that.

I didn't notice that. Good you pointed that out earlier.

> The WARN_ONCE() tells us that some devices down the PCI tree are
> allocating MSI, but VMD supports MSI-X only.
> 
> From the backtrace:
>    msi_create_device_irq_domain+0x1eb/0x290
>    __pci_enable_msi_range+0x106/0x300
>    pci_alloc_irq_vectors_affinity+0xc5/0x110
>    pcie_portdrv_probe+0x24e/0x610
> 
> It seems MSI-X are allocated first, but fail for some reason. Then
> fallback to MSI, which triggers the WARN_ON().
> 
> So we need to figure out why MSI-X allocation fail.
> 
> I may need to ask you to insert a bunch of printk() to help me pinpoint
> the problem. But let me stare at it first..

I can do that. Send me a git diff. I'll test it and back with the dmesg
output.

-- 
Ammar Faizi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-08 22:52                   ` Ammar Faizi
@ 2025-08-09  4:34                     ` Nam Cao
  2025-08-09 13:28                       ` Ammar Faizi
  0 siblings, 1 reply; 18+ messages in thread
From: Nam Cao @ 2025-08-09  4:34 UTC (permalink / raw)
  To: Ammar Faizi
  Cc: Thomas Gleixner, Lukas Wunner, Bjorn Helgaas, Linus Torvalds,
	Linux PCI Mailing List, Linux Kernel Mailing List, Rob Herring,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Krzysztof Wilczynski,
	Armando Budianto, Alviro Iskandar Setiawan, gwml, namcaov

On Sat, Aug 09, 2025 at 07:52:30AM +0900, Ammar Faizi wrote:
> I can do that. Send me a git diff. I'll test it and back with the dmesg
> output.

That would be very helpful, thanks!

Please bear with me, this may take a few iterations.

Let's first try the below.


diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 9bbb0ff4cc15..ce477c030990 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -304,11 +304,14 @@ static bool vmd_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
 				  struct irq_domain *real_parent,
 				  struct msi_domain_info *info)
 {
+	pr_err("%s: bus_token=%d\n", __func__, (int)info->bus_token);
 	if (WARN_ON_ONCE(info->bus_token != DOMAIN_BUS_PCI_DEVICE_MSIX))
 		return false;
 
-	if (!msi_lib_init_dev_msi_info(dev, domain, real_parent, info))
+	if (!msi_lib_init_dev_msi_info(dev, domain, real_parent, info)) {
+		pr_err("%s:%d err=msi_lib_init\n", __func__, __LINE__);
 		return false;
+	}
 
 	info->chip->irq_enable		= vmd_pci_msi_enable;
 	info->chip->irq_disable		= vmd_pci_msi_disable;
diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index 818d55fbad0d..e5f83036d76f 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -265,6 +265,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 	}
 
 	if (flags & PCI_IRQ_MSIX) {
+		pci_err(dev, "%s:%d msix\n", __func__, __LINE__);
 		nvecs = __pci_enable_msix_range(dev, NULL, min_vecs, max_vecs,
 						affd, flags);
 		if (nvecs > 0)
@@ -272,6 +273,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 	}
 
 	if (flags & PCI_IRQ_MSI) {
+		pci_err(dev, "%s:%d msi\n", __func__, __LINE__);
 		nvecs = __pci_enable_msi_range(dev, min_vecs, max_vecs, affd);
 		if (nvecs > 0)
 			return nvecs;
@@ -279,6 +281,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 
 	/* use INTx IRQ if allowed */
 	if (flags & PCI_IRQ_INTX) {
+		pci_err(dev, "%s:%d INTx\n", __func__, __LINE__);
 		if (min_vecs == 1 && dev->irq) {
 			/*
 			 * Invoke the affinity spreading logic to ensure that
diff --git a/drivers/pci/msi/irqdomain.c b/drivers/pci/msi/irqdomain.c
index 0938ef7ebabf..6695508e9e53 100644
--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -13,8 +13,10 @@ int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
 	struct irq_domain *domain;
 
 	domain = dev_get_msi_domain(&dev->dev);
-	if (domain && irq_domain_is_hierarchy(domain))
+	if (domain && irq_domain_is_hierarchy(domain)) {
+		pci_err(dev, "%s:%d err=\n", __func__, __LINE__);
 		return msi_domain_alloc_irqs_all_locked(&dev->dev, MSI_DEFAULT_DOMAIN, nvec);
+	}
 
 	return pci_msi_legacy_setup_msi_irqs(dev, nvec, type);
 }
@@ -355,6 +357,7 @@ bool pci_msi_domain_supports(struct pci_dev *pdev, unsigned int feature_mask,
 	if (!domain || !irq_domain_is_hierarchy(domain)) {
 		if (IS_ENABLED(CONFIG_PCI_MSI_ARCH_FALLBACKS))
 			return mode == ALLOW_LEGACY;
+		pci_err(pdev, "%s:%d err\n", __func__, __LINE__);
 		return false;
 	}
 
@@ -364,6 +367,7 @@ bool pci_msi_domain_supports(struct pci_dev *pdev, unsigned int feature_mask,
 		 * msi_domain_info::flags is the authoritative source of
 		 * information.
 		 */
+		pci_err(pdev, "%s:%d not msi parent\n", __func__, __LINE__);
 		info = domain->host_data;
 		supported = info->flags;
 	} else {
@@ -374,6 +378,7 @@ bool pci_msi_domain_supports(struct pci_dev *pdev, unsigned int feature_mask,
 		 * per device domain because the parent is never
 		 * expanding the PCI/MSI functionality.
 		 */
+		pci_err(pdev, "%s:%d is msi parent\n", __func__, __LINE__);
 		supported = domain->msi_parent_ops->supported_flags;
 	}
 
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index 34d664139f48..028281d51fd0 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -31,19 +31,25 @@ static int pci_msi_supported(struct pci_dev *dev, int nvec)
 	struct pci_bus *bus;
 
 	/* MSI must be globally enabled and supported by the device */
-	if (!pci_msi_enable)
+	if (!pci_msi_enable) {
+		pci_err(dev, "%s:%d err=pci_msi_enable\n", __func__, __LINE__);
 		return 0;
+	}
 
-	if (!dev || dev->no_msi)
+	if (!dev || dev->no_msi) {
+		pci_err(dev, "%s:%d err=no_msi\n", __func__, __LINE__);
 		return 0;
+	}
 
 	/*
 	 * You can't ask to have 0 or less MSIs configured.
 	 *  a) it's stupid ..
 	 *  b) the list manipulation code assumes nvec >= 1.
 	 */
-	if (nvec < 1)
+	if (nvec < 1) {
+		pci_err(dev, "%s:%d err=nvec\n", __func__, __LINE__);
 		return 0;
+	}
 
 	/*
 	 * Any bridge which does NOT route MSI transactions from its
@@ -60,8 +66,10 @@ static int pci_msi_supported(struct pci_dev *dev, int nvec)
 	 * at probe time.
 	 */
 	for (bus = dev->bus; bus; bus = bus->parent)
-		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
+		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI) {
+			pci_err(dev, "%s:%d bus_flags\n", __func__, __LINE__);
 			return 0;
+		}
 
 	return 1;
 }
@@ -86,8 +94,10 @@ static int pcim_setup_msi_release(struct pci_dev *dev)
 		return 0;
 
 	ret = devm_add_action(&dev->dev, pcim_msi_release, dev);
-	if (ret)
+	if (ret) {
+		pci_err(dev, "%s:%d err=%d\n", __func__, __LINE__, ret);
 		return ret;
+	}
 
 	dev->is_msi_managed = true;
 	return 0;
@@ -671,17 +681,23 @@ static int __msix_setup_interrupts(struct pci_dev *__dev, struct msix_entry *ent
 	struct pci_dev *dev __free(free_msi_irqs) = __dev;
 
 	int ret = msix_setup_msi_descs(dev, entries, nvec, masks);
-	if (ret)
+	if (ret) {
+		pci_err(dev, "%s:%d err=%d\n", __func__, __LINE__, ret);
 		return ret;
+	}
 
 	ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
-	if (ret)
+	if (ret) {
+		pci_err(dev, "%s:%d err=%d\n", __func__, __LINE__, ret);
 		return ret;
+	}
 
 	/* Check if all MSI entries honor device restrictions */
 	ret = msi_verify_entries(dev);
-	if (ret)
+	if (ret) {
+		pci_err(dev, "%s:%d err=%d\n", __func__, __LINE__, ret);
 		return ret;
+	}
 
 	msix_update_entries(dev, entries);
 	retain_and_null_ptr(dev);
@@ -736,8 +752,10 @@ static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
 	}
 
 	ret = msix_setup_interrupts(dev, entries, nvec, affd);
-	if (ret)
+	if (ret) {
+		pci_err(dev, "%s:%d err=setup irqs\n", __func__, __LINE__);
 		goto out_disable;
+	}
 
 	/* Disable INTX */
 	pci_intx_for_msi(dev, 0);
@@ -793,30 +811,40 @@ int __pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, int
 {
 	int hwsize, rc, nvec = maxvec;
 
-	if (maxvec < minvec)
+	if (maxvec < minvec) {
+		pci_err(dev, "%s:%d err=%d\n", __func__, __LINE__, -ERANGE);
 		return -ERANGE;
+	}
 
 	if (dev->msi_enabled) {
-		pci_info(dev, "can't enable MSI-X (MSI already enabled)\n");
+		pci_err(dev, "can't enable MSI-X (MSI already enabled)\n");
 		return -EINVAL;
 	}
 
-	if (WARN_ON_ONCE(dev->msix_enabled))
+	if (WARN_ON_ONCE(dev->msix_enabled)) {
+		pci_err(dev, "%s:%d err=%d\n", __func__, __LINE__, -EINVAL);
 		return -EINVAL;
+	}
 
 	/* Check MSI-X early on irq domain enabled architectures */
-	if (!pci_msi_domain_supports(dev, MSI_FLAG_PCI_MSIX, ALLOW_LEGACY))
+	if (!pci_msi_domain_supports(dev, MSI_FLAG_PCI_MSIX, ALLOW_LEGACY)) {
+		pci_err(dev, "%s:%d err=ENOTSUPP\n", __func__, __LINE__);
 		return -ENOTSUPP;
+	}
 
 	if (!pci_msi_supported(dev, nvec) || dev->current_state != PCI_D0)
 		return -EINVAL;
 
 	hwsize = pci_msix_vec_count(dev);
-	if (hwsize < 0)
+	if (hwsize < 0) {
+		pci_err(dev, "%s:%d err=hwsize\n", __func__, __LINE__);
 		return hwsize;
+	}
 
-	if (!pci_msix_validate_entries(dev, entries, nvec))
+	if (!pci_msix_validate_entries(dev, entries, nvec)) {
+		pci_err(dev, "%s:%d err=validate entries\n", __func__, __LINE__);
 		return -EINVAL;
+	}
 
 	if (hwsize < nvec) {
 		/* Keep the IRQ virtual hackery working */
@@ -826,21 +854,27 @@ int __pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries, int
 			nvec = hwsize;
 	}
 
-	if (nvec < minvec)
+	if (nvec < minvec) {
+		pci_err(dev, "%s:%d err=ENOSPC\n", __func__, __LINE__);
 		return -ENOSPC;
+	}
 
 	rc = pci_setup_msi_context(dev);
 	if (rc)
 		return rc;
 
-	if (!pci_setup_msix_device_domain(dev, hwsize))
+	if (!pci_setup_msix_device_domain(dev, hwsize)) {
+		pci_err(dev, "%s:%d err=%d\n", __func__, __LINE__, -ENODEV);
 		return -ENODEV;
+	}
 
 	for (;;) {
 		if (affd) {
 			nvec = irq_calc_affinity_vectors(minvec, nvec, affd);
-			if (nvec < minvec)
+			if (nvec < minvec) {
+				pci_err(dev, "%s:%d err=%d\n", __func__, __LINE__, -ENOSPC);
 				return -ENOSPC;
+			}
 		}
 
 		rc = msix_capability_init(dev, entries, nvec, affd);
diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index d1b68c18444f..44d1c0cd79fa 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -190,6 +190,7 @@ static int pcie_init_service_irqs(struct pci_dev *dev, int *irqs, int mask)
 		goto intx_irq;
 
 	/* Try to use MSI-X or MSI if supported */
+	pr_err("%s:%d\n", __func__, __LINE__);
 	if (pcie_port_enable_irq_vec(dev, irqs, mask) == 0)
 		return 0;
 
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index dc473faadcc8..1771b6a24765 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -229,13 +229,17 @@ static struct irq_domain *__irq_domain_create(const struct irq_domain_info *info
 
 	if (WARN_ON((info->size && info->direct_max) ||
 		    (!IS_ENABLED(CONFIG_IRQ_DOMAIN_NOMAP) && info->direct_max) ||
-		    (info->direct_max && info->direct_max != info->hwirq_max)))
+		    (info->direct_max && info->direct_max != info->hwirq_max))) {
+		pr_err("%s:%d\n err", __func__, __LINE__);
 		return ERR_PTR(-EINVAL);
+	}
 
 	domain = kzalloc_node(struct_size(domain, revmap, info->size),
 			      GFP_KERNEL, of_node_to_nid(to_of_node(info->fwnode)));
-	if (!domain)
+	if (!domain) {
+		pr_err("%s:%d\n err", __func__, __LINE__);
 		return ERR_PTR(-ENOMEM);
+	}
 
 	err = irq_domain_set_name(domain, info);
 	if (err) {
@@ -328,14 +332,18 @@ static struct irq_domain *__irq_domain_instantiate(const struct irq_domain_info
 
 	if (info->dgc_info) {
 		err = irq_domain_alloc_generic_chips(domain, info->dgc_info);
-		if (err)
+		if (err) {
+			pr_err("%s:%d\n err=%d", __func__, __LINE__, err);
 			goto err_domain_free;
+		}
 	}
 
 	if (info->init) {
 		err = info->init(domain);
-		if (err)
+		if (err) {
+			pr_err("%s:%d\n err=%d", __func__, __LINE__, err);
 			goto err_domain_gc_remove;
+		}
 	}
 
 	__irq_domain_publish(domain);
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 9b09ad3f9914..b86ed78be575 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -316,11 +316,14 @@ int msi_setup_device_data(struct device *dev)
 		return 0;
 
 	md = devres_alloc(msi_device_data_release, sizeof(*md), GFP_KERNEL);
-	if (!md)
+	if (!md) {
+		dev_err(dev, "%s:%d err=ENOMEM\n", __func__, __LINE__);
 		return -ENOMEM;
+	}
 
 	ret = msi_sysfs_create_group(dev);
 	if (ret) {
+		dev_err(dev, "%s:%d err=%d\n", __func__, __LINE__, ret);
 		devres_free(md);
 		return ret;
 	}
@@ -1035,16 +1038,22 @@ bool msi_create_device_irq_domain(struct device *dev, unsigned int domid,
 	const struct msi_parent_ops *pops;
 	struct fwnode_handle *fwnode;
 
-	if (!irq_domain_is_msi_parent(parent))
+	if (!irq_domain_is_msi_parent(parent)) {
+		dev_err(dev, "%s:%d err=not parent\n", __func__, __LINE__);
 		return false;
+	}
 
-	if (domid >= MSI_MAX_DEVICE_IRQDOMAINS)
+	if (domid >= MSI_MAX_DEVICE_IRQDOMAINS) {
+		dev_err(dev, "%s:%d err=max domain\n", __func__, __LINE__);
 		return false;
+	}
 
 	struct msi_domain_template *bundle __free(kfree) =
 		kmemdup(template, sizeof(*bundle), GFP_KERNEL);
-	if (!bundle)
+	if (!bundle) {
+		dev_err(dev, "%s:%d err=mem\n", __func__, __LINE__);
 		return false;
+	}
 
 	bundle->info.hwsize = hwsize;
 	bundle->info.chip = &bundle->chip;
@@ -1077,23 +1086,32 @@ bool msi_create_device_irq_domain(struct device *dev, unsigned int domid,
 	if (!fwnode)
 		return false;
 
-	if (msi_setup_device_data(dev))
+	if (msi_setup_device_data(dev)) {
+		dev_err(dev, "%s:%d err=setup\n", __func__, __LINE__);
 		return false;
+	}
 
 	guard(msi_descs_lock)(dev);
-	if (WARN_ON_ONCE(msi_get_device_domain(dev, domid)))
+	if (WARN_ON_ONCE(msi_get_device_domain(dev, domid))) {
+		dev_err(dev, "%s:%d err=get domain\n", __func__, __LINE__);
 		return false;
+	}
 
-	if (!pops->init_dev_msi_info(dev, parent, parent, &bundle->info))
+	if (!pops->init_dev_msi_info(dev, parent, parent, &bundle->info)) {
+		dev_err(dev, "%s:%d err=init_msi_info\n", __func__, __LINE__);
 		return false;
+	}
 
 	domain = __msi_create_irq_domain(fwnode, &bundle->info, IRQ_DOMAIN_FLAG_MSI_DEVICE, parent);
-	if (!domain)
+	if (!domain) {
+		dev_err(dev, "%s:%d err=create domain\n", __func__, __LINE__);
 		return false;
+	}
 
 	dev->msi.data->__domains[domid].domain = domain;
 
 	if (msi_domain_prepare_irqs(domain, dev, hwsize, &bundle->alloc_info)) {
+		pr_err("%s:%d\n err=prepare_irqs", __func__, __LINE__);
 		dev->msi.data->__domains[domid].domain = NULL;
 		irq_domain_remove(domain);
 		return false;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-09  4:34                     ` Nam Cao
@ 2025-08-09 13:28                       ` Ammar Faizi
  2025-08-09 14:49                         ` Nam Cao
  0 siblings, 1 reply; 18+ messages in thread
From: Ammar Faizi @ 2025-08-09 13:28 UTC (permalink / raw)
  To: Nam Cao
  Cc: Thomas Gleixner, Lukas Wunner, Bjorn Helgaas, Linus Torvalds,
	Linux PCI Mailing List, Linux Kernel Mailing List, Rob Herring,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Krzysztof Wilczynski,
	Armando Budianto, Alviro Iskandar Setiawan, gwml, namcaov

On Sat, Aug 09, 2025 at 06:34:29AM +0200, Nam Cao wrote:
> On Sat, Aug 09, 2025 at 07:52:30AM +0900, Ammar Faizi wrote:
> > I can do that. Send me a git diff. I'll test it and back with the dmesg
> > output.
> 
> That would be very helpful, thanks!
> 
> Please bear with me, this may take a few iterations.
> 
> Let's first try the below.

I just got home from a family outing. A bit slow response.

Here's the result:

  https://gist.github.com/ammarfaizi2/ef5f98123ed3868f8d64ed41662edd63#file-dmesg_pci_debug_001-txt-L853

-- 
Ammar Faizi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-09 13:28                       ` Ammar Faizi
@ 2025-08-09 14:49                         ` Nam Cao
  2025-08-09 15:15                           ` Ammar Faizi
  0 siblings, 1 reply; 18+ messages in thread
From: Nam Cao @ 2025-08-09 14:49 UTC (permalink / raw)
  To: Ammar Faizi
  Cc: Thomas Gleixner, Lukas Wunner, Bjorn Helgaas, Linus Torvalds,
	Linux PCI Mailing List, Linux Kernel Mailing List, Rob Herring,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Krzysztof Wilczynski,
	Armando Budianto, Alviro Iskandar Setiawan, gwml, namcaov

On Sat, Aug 09, 2025 at 08:28:39PM +0700, Ammar Faizi wrote:
> On Sat, Aug 09, 2025 at 06:34:29AM +0200, Nam Cao wrote:
> > On Sat, Aug 09, 2025 at 07:52:30AM +0900, Ammar Faizi wrote:
> > > I can do that. Send me a git diff. I'll test it and back with the dmesg
> > > output.
> > 
> > That would be very helpful, thanks!
> > 
> > Please bear with me, this may take a few iterations.
> > 
> > Let's first try the below.
> 
> I just got home from a family outing. A bit slow response.
> 
> Here's the result:
> 
>   https://gist.github.com/ammarfaizi2/ef5f98123ed3868f8d64ed41662edd63#file-dmesg_pci_debug_001-txt-L853

Thanks! Here's the problem:

    [    1.037223] pcieport 10000:e0:1d.0: __pci_enable_msix_range:840 err=hwsize

The PCIe port driver enables interrupt, trying MSI-X first. However, the
device does not support MSI-X, so it tries MSI instead, which triggers
the WARN_ON() in VMD driver.

What's strange is that, the VMD doc says:

    "Intel VMD only supports MSIx Interrupts from child devices and
    therefore the BIOS must enable PCIe Hot Plug and MSIx interrups"

Is it lying to us?

Can you	please try:

    Revert d5c647b08ee0 ("PCI: vmd: Fix wrong kfree() in vmd_msi_free()")
    Revert d7d8ab87e3e7 ("PCI: vmd: Switch to msi_create_parent_irq_domain()")

So that the driver is back to the original state before I touched it.

And apply the diff below. This will tell us if my commit breaks the driver
somehow, or VMD has been allowing MSI all this time.


diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index d9b893bf4e45..e99d8cefb78d 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -255,9 +255,15 @@ static int vmd_msi_init(struct irq_domain *domain, struct msi_domain_info *info,
 			msi_alloc_info_t *arg)
 {
 	struct msi_desc *desc = arg->desc;
+	struct pci_dev *pci_dev = msi_desc_to_pci_dev(desc);
 	struct vmd_dev *vmd = vmd_from_bus(msi_desc_to_pci_dev(desc)->bus);
 	struct vmd_irq *vmdirq = kzalloc(sizeof(*vmdirq), GFP_KERNEL);
 
+	if (!pci_msix_vec_count(pci_dev))
+		pr_err("But VMD only supports MSIx Interrupts from child devices!\n");
+	else
+		pr_err("MSI-X, looking good...\n");
+	dump_stack();
 	if (!vmdirq)
 		return -ENOMEM;
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-09 14:49                         ` Nam Cao
@ 2025-08-09 15:15                           ` Ammar Faizi
  2025-08-09 15:32                             ` Nam Cao
  0 siblings, 1 reply; 18+ messages in thread
From: Ammar Faizi @ 2025-08-09 15:15 UTC (permalink / raw)
  To: Nam Cao
  Cc: Thomas Gleixner, Lukas Wunner, Bjorn Helgaas, Linus Torvalds,
	Linux PCI Mailing List, Linux Kernel Mailing List, Rob Herring,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Krzysztof Wilczynski,
	Armando Budianto, Alviro Iskandar Setiawan, gwml, namcaov

On Sat, Aug 09, 2025 at 04:49:27PM +0200, Nam Cao wrote:
> Thanks! Here's the problem:
> 
>     [    1.037223] pcieport 10000:e0:1d.0: __pci_enable_msix_range:840 err=hwsize
> 
> The PCIe port driver enables interrupt, trying MSI-X first. However, the
> device does not support MSI-X, so it tries MSI instead, which triggers
> the WARN_ON() in VMD driver.
> 
> What's strange is that, the VMD doc says:
> 
>     "Intel VMD only supports MSIx Interrupts from child devices and
>     therefore the BIOS must enable PCIe Hot Plug and MSIx interrups"
> 
> Is it lying to us?
> 
> Can you	please try:
> 
>     Revert d5c647b08ee0 ("PCI: vmd: Fix wrong kfree() in vmd_msi_free()")
>     Revert d7d8ab87e3e7 ("PCI: vmd: Switch to msi_create_parent_irq_domain()")
> 
> So that the driver is back to the original state before I touched it.
> 
> And apply the diff below. This will tell us if my commit breaks the driver
> somehow, or VMD has been allowing MSI all this time.

Here's the result after reverting those two commits and applied the diff.

  https://gist.github.com/ammarfaizi2/03c7a9c0fec2a11f206931f1b7790709#file-dmesg_pci_debug_002-txt

Let's see if this one is enough for you to diagnose the problem.

Note that the previous printk() diff has to be discarded to avoid
conflict in the reverts. If that's still needed, send me a fixed
diff after clean reverts.

-- 
Ammar Faizi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-09 15:15                           ` Ammar Faizi
@ 2025-08-09 15:32                             ` Nam Cao
  2025-08-09 16:03                               ` Ammar Faizi
  0 siblings, 1 reply; 18+ messages in thread
From: Nam Cao @ 2025-08-09 15:32 UTC (permalink / raw)
  To: Ammar Faizi
  Cc: Thomas Gleixner, Lukas Wunner, Bjorn Helgaas, Linus Torvalds,
	Linux PCI Mailing List, Linux Kernel Mailing List, Rob Herring,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Krzysztof Wilczynski,
	Armando Budianto, Alviro Iskandar Setiawan, gwml, namcaov

Ammar Faizi <ammarfaizi2@gnuweeb.org> writes:
> Here's the result after reverting those two commits and applied the diff.
>
>   https://gist.github.com/ammarfaizi2/03c7a9c0fec2a11f206931f1b7790709#file-dmesg_pci_debug_002-txt
>
> Let's see if this one is enough for you to diagnose the problem.

Thanks, I think the problem is clear now. 

The diff I sent you has a mistake, it should be
    if (pci_msix_vec_count(pci_dev) < 0)
not
    if (!pci_msix_vec_count(pci_dev))

So the log is wrong, it printed "MSI-X, looking good...". It should have
printed the other one.

But no need to re-run it, the backtrace is enough.

    MSI-X, looking good... <-------- wrong log
    CPU: 3 UID: 0 PID: 183 Comm: systemd-udevd Not tainted 6.16.0-afh2-dbg-2025-08-09-gb622ab28bcac #13 PREEMPT(full)  28137b57996795286f6544f071ec852674a057d4
    Hardware name: HP HP Laptop 14s-dq2xxx/87FD, BIOS F.21 03/21/2022
    Call Trace:
     <TASK>
    dump_stack_lvl
    vmd_msi_init
    msi_domain_alloc
    irq_domain_alloc_irqs_locked
    __irq_domain_alloc_irqs
    __msi_domain_alloc_irqs
    msi_domain_alloc_irqs_all_locked
    __msi_capability_init
    __pci_enable_msi_range
    pci_alloc_irq_vectors_affinity
    pcie_portdrv_probe

So unlike what VMD doc says, it actually can have non-MSI-X children devices!

Please discard the reverts and the diff I sent you, and try the diff
below. I believe your machine will work now.

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index b679c7f28f51..1bd5bf4a6097 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -306,9 +306,6 @@ static bool vmd_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
 				  struct irq_domain *real_parent,
 				  struct msi_domain_info *info)
 {
-	if (WARN_ON_ONCE(info->bus_token != DOMAIN_BUS_PCI_DEVICE_MSIX))
-		return false;
-
 	if (!msi_lib_init_dev_msi_info(dev, domain, real_parent, info))
 		return false;
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-09 15:32                             ` Nam Cao
@ 2025-08-09 16:03                               ` Ammar Faizi
  2025-08-09 16:28                                 ` Nam Cao
  0 siblings, 1 reply; 18+ messages in thread
From: Ammar Faizi @ 2025-08-09 16:03 UTC (permalink / raw)
  To: Nam Cao
  Cc: Thomas Gleixner, Lukas Wunner, Bjorn Helgaas, Linus Torvalds,
	Linux PCI Mailing List, Linux Kernel Mailing List, Rob Herring,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Krzysztof Wilczynski,
	Armando Budianto, Alviro Iskandar Setiawan, gwml, namcaov

On Sat, Aug 09, 2025 at 05:32:16PM +0200, Nam Cao wrote:
> So unlike what VMD doc says, it actually can have non-MSI-X children devices!

If that's the conclusion, then Intel VMD doc also needs fixing :/

> Please discard the reverts and the diff I sent you, and try the diff
> below. I believe your machine will work now.

Yes, I can confirm it's now clean. Just to verify both sides, here is
the last result:

  https://gist.github.com/ammarfaizi2/72578d2b4cc385fbdb5faee69013d530

If that one fix is final, then:

Tested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>

Thanks for the debugging work.

It's probably too late to get the fix in mainline before rc1. But if it
can go upstream sooner, that would be great.

-- 
Ammar Faizi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v2] PCI changes for v6.17
  2025-08-09 16:03                               ` Ammar Faizi
@ 2025-08-09 16:28                                 ` Nam Cao
  0 siblings, 0 replies; 18+ messages in thread
From: Nam Cao @ 2025-08-09 16:28 UTC (permalink / raw)
  To: Ammar Faizi
  Cc: Thomas Gleixner, Lukas Wunner, Bjorn Helgaas, Linus Torvalds,
	Linux PCI Mailing List, Linux Kernel Mailing List, Rob Herring,
	Lorenzo Pieralisi, Manivannan Sadhasivam, Krzysztof Wilczynski,
	Armando Budianto, Alviro Iskandar Setiawan, gwml, namcaov

Ammar Faizi <ammarfaizi2@gnuweeb.org> writes:

> On Sat, Aug 09, 2025 at 05:32:16PM +0200, Nam Cao wrote:
>> So unlike what VMD doc says, it actually can have non-MSI-X children devices!
>
> If that's the conclusion, then Intel VMD doc also needs fixing :/

Other possibilities are problem with your BIOS (likely), or problem with
Linux's PCI enumeration (unlikely).

But without hardware, I cannot investigate this further. Now that we
know my commit didn't make the driver any worse, I am done here.

>> Please discard the reverts and the diff I sent you, and try the diff
>> below. I believe your machine will work now.
>
> Yes, I can confirm it's now clean. Just to verify both sides, here is
> the last result:
>
>   https://gist.github.com/ammarfaizi2/72578d2b4cc385fbdb5faee69013d530
>
> If that one fix is final, then:
>
> Tested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
>
> Thanks for the debugging work.

Thanks for running tests, it would be impossible to figure out otherwise.

> It's probably too late to get the fix in mainline before rc1. But if it
> can go upstream sooner, that would be great.

I don't think PCI maintainers are available at the moment, so I will
send the patch next Monday. Time to enjoy my weekends..

Nam

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-08-09 16:28 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250801142254.GA3496192@bhelgaas>
     [not found] ` <175408424863.4088284.13236765550439476565.pr-tracker-bot@kernel.org>
2025-08-07  3:34   ` [GIT PULL v2] PCI changes for v6.17 Ammar Faizi
2025-08-07  3:51     ` Lukas Wunner
2025-08-07  4:44       ` Nam Cao
2025-08-07  4:54       ` Ammar Faizi
2025-08-07  5:03         ` Nam Cao
2025-08-07  5:13           ` Ammar Faizi
2025-08-08 10:59             ` Ammar Faizi
2025-08-08 11:19               ` Ammar Faizi
2025-08-08 13:34                 ` Ammar Faizi
2025-08-08 18:07                 ` Nam Cao
2025-08-08 22:52                   ` Ammar Faizi
2025-08-09  4:34                     ` Nam Cao
2025-08-09 13:28                       ` Ammar Faizi
2025-08-09 14:49                         ` Nam Cao
2025-08-09 15:15                           ` Ammar Faizi
2025-08-09 15:32                             ` Nam Cao
2025-08-09 16:03                               ` Ammar Faizi
2025-08-09 16:28                                 ` Nam Cao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox