public inbox for [email protected]
 help / color / mirror / Atom feed
* PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59
@ 2024-11-03 23:47 Andrew Marshall
  2024-11-03 23:53 ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Marshall @ 2024-11-03 23:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

Hi,

I, and others (see downstream report below), are encountering io_uring at times hanging on 6.6.59 LTS. If the process is killed, the process remains stuck in sleep uninterruptible ("D"). This failure can be fairly reliably reproduced via Node.js with `npm ci` in at least some projects; disabling that tool’s use of io_uring causes via its configuration causes it to succeed. I have identified what seems to be the problematic commit on linux-6.6.y (f4ce3b5).

Summary of Kernel version triaging:

- 6.6.56: succeeds
- 6.6.57: fails
- 6.6.58: fails
- 6.6.59: fails
- 6.6.59 (with f4ce3b5 reverted): succeeds
- 6.11.6: succeeds

System logs upon failure indicate hung task:

kernel: INFO: task npm ci:47920 blocked for more than 245 seconds.
kernel:       Tainted: P           O       6.6.58 #1-NixOS
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:npm ci          state:D stack:0     pid:47920 ppid:47710  flags:0x00004006
kernel: Call Trace:
kernel:  <TASK>
kernel:  __schedule+0x3fc/0x1430
kernel:  ? sysvec_apic_timer_interrupt+0xe/0x90
kernel:  schedule+0x5e/0xe0
kernel:  schedule_preempt_disabled+0x15/0x30
kernel:  __mutex_lock.constprop.0+0x3a2/0x6b0
kernel:  io_uring_del_tctx_node+0x61/0xf0
kernel:  io_uring_clean_tctx+0x5c/0xc0
kernel:  io_uring_cancel_generic+0x198/0x350
kernel:  ? srso_return_thunk+0x5/0x5f
kernel:  ? timerqueue_del+0x2e/0x50
kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
kernel:  do_exit+0x167/0xad0
kernel:  ? __pfx_hrtimer_wakeup+0x10/0x10
kernel:  do_group_exit+0x31/0x80
kernel:  get_signal+0xa60/0xa60
kernel:  arch_do_signal_or_restart+0x3e/0x280
kernel:  exit_to_user_mode_prepare+0x1d4/0x230
kernel:  syscall_exit_to_user_mode+0x1b/0x50
kernel:  do_syscall_64+0x45/0x90
kernel:  entry_SYSCALL_64_after_hwframe+0x78/0xe2

For more details, see the downstream bug report in Node.js: https://github.com/nodejs/node/issues/55587

I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely problematic commit simply by browsing git log. As indicated above; reverting that atop 6.6.59 results in success. Since it is passing on 6.11.6, I suspect there is some missing backport to 6.6.x, or some other semantic merge conflict. Unfortunately I do not have a compact, minimal reproducer, but can provide my large one (it is testing a larger build process in a VM) if needed—there are some additional details in the above-linked downstream bug report, though. I hope that having identified the problematic commit is enough for someone with more context to go off of. Happy to provide more information if needed.


Thanks,
Andrew

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59
  2024-11-03 23:47 PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59 Andrew Marshall
@ 2024-11-03 23:53 ` Jens Axboe
  2024-11-03 23:58   ` Jens Axboe
  2024-11-04  0:01   ` Keith Busch
  0 siblings, 2 replies; 11+ messages in thread
From: Jens Axboe @ 2024-11-03 23:53 UTC (permalink / raw)
  To: Andrew Marshall; +Cc: io-uring

On 11/3/24 4:47 PM, Andrew Marshall wrote:
> Hi,
> 
> I, and others (see downstream report below), are encountering io_uring
> at times hanging on 6.6.59 LTS. If the process is killed, the process
> remains stuck in sleep uninterruptible ("D"). This failure can be
> fairly reliably reproduced via Node.js with `npm ci` in at least some
> projects; disabling that tool?s use of io_uring causes via its
> configuration causes it to succeed. I have identified what seems to be
> the problematic commit on linux-6.6.y (f4ce3b5).
> 
> Summary of Kernel version triaging:
> 
> - 6.6.56: succeeds
> - 6.6.57: fails
> - 6.6.58: fails
> - 6.6.59: fails
> - 6.6.59 (with f4ce3b5 reverted): succeeds
> - 6.11.6: succeeds
> 
> System logs upon failure indicate hung task:
> 
> kernel: INFO: task npm ci:47920 blocked for more than 245 seconds.
> kernel:       Tainted: P           O       6.6.58 #1-NixOS
> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kernel: task:npm ci          state:D stack:0     pid:47920 ppid:47710  flags:0x00004006
> kernel: Call Trace:
> kernel:  <TASK>
> kernel:  __schedule+0x3fc/0x1430
> kernel:  ? sysvec_apic_timer_interrupt+0xe/0x90
> kernel:  schedule+0x5e/0xe0
> kernel:  schedule_preempt_disabled+0x15/0x30
> kernel:  __mutex_lock.constprop.0+0x3a2/0x6b0
> kernel:  io_uring_del_tctx_node+0x61/0xf0
> kernel:  io_uring_clean_tctx+0x5c/0xc0
> kernel:  io_uring_cancel_generic+0x198/0x350
> kernel:  ? srso_return_thunk+0x5/0x5f
> kernel:  ? timerqueue_del+0x2e/0x50
> kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
> kernel:  do_exit+0x167/0xad0
> kernel:  ? __pfx_hrtimer_wakeup+0x10/0x10
> kernel:  do_group_exit+0x31/0x80
> kernel:  get_signal+0xa60/0xa60
> kernel:  arch_do_signal_or_restart+0x3e/0x280
> kernel:  exit_to_user_mode_prepare+0x1d4/0x230
> kernel:  syscall_exit_to_user_mode+0x1b/0x50
> kernel:  do_syscall_64+0x45/0x90
> kernel:  entry_SYSCALL_64_after_hwframe+0x78/0xe2
> 
> For more details, see the downstream bug report in Node.js: https://github.com/nodejs/node/issues/55587
> 
> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
> problematic commit simply by browsing git log. As indicated above;
> reverting that atop 6.6.59 results in success. Since it is passing on
> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
> other semantic merge conflict. Unfortunately I do not have a compact,
> minimal reproducer, but can provide my large one (it is testing a
> larger build process in a VM) if needed?there are some additional
> details in the above-linked downstream bug report, though. I hope that
> having identified the problematic commit is enough for someone with
> more context to go off of. Happy to provide more information if
> needed.

Don't worry about not having a reproducer, having the backport commit
pin pointed will do just fine. I'll take a look at this.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59
  2024-11-03 23:53 ` Jens Axboe
@ 2024-11-03 23:58   ` Jens Axboe
  2024-11-04  0:01   ` Keith Busch
  1 sibling, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2024-11-03 23:58 UTC (permalink / raw)
  To: Andrew Marshall; +Cc: io-uring

On 11/3/24 4:53 PM, Jens Axboe wrote:
> On 11/3/24 4:47 PM, Andrew Marshall wrote:
>> Hi,
>>
>> I, and others (see downstream report below), are encountering io_uring
>> at times hanging on 6.6.59 LTS. If the process is killed, the process
>> remains stuck in sleep uninterruptible ("D"). This failure can be
>> fairly reliably reproduced via Node.js with `npm ci` in at least some
>> projects; disabling that tool?s use of io_uring causes via its
>> configuration causes it to succeed. I have identified what seems to be
>> the problematic commit on linux-6.6.y (f4ce3b5).
>>
>> Summary of Kernel version triaging:
>>
>> - 6.6.56: succeeds
>> - 6.6.57: fails
>> - 6.6.58: fails
>> - 6.6.59: fails
>> - 6.6.59 (with f4ce3b5 reverted): succeeds
>> - 6.11.6: succeeds
>>
>> System logs upon failure indicate hung task:
>>
>> kernel: INFO: task npm ci:47920 blocked for more than 245 seconds.
>> kernel:       Tainted: P           O       6.6.58 #1-NixOS
>> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> kernel: task:npm ci          state:D stack:0     pid:47920 ppid:47710  flags:0x00004006
>> kernel: Call Trace:
>> kernel:  <TASK>
>> kernel:  __schedule+0x3fc/0x1430
>> kernel:  ? sysvec_apic_timer_interrupt+0xe/0x90
>> kernel:  schedule+0x5e/0xe0
>> kernel:  schedule_preempt_disabled+0x15/0x30
>> kernel:  __mutex_lock.constprop.0+0x3a2/0x6b0
>> kernel:  io_uring_del_tctx_node+0x61/0xf0
>> kernel:  io_uring_clean_tctx+0x5c/0xc0
>> kernel:  io_uring_cancel_generic+0x198/0x350
>> kernel:  ? srso_return_thunk+0x5/0x5f
>> kernel:  ? timerqueue_del+0x2e/0x50
>> kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
>> kernel:  do_exit+0x167/0xad0
>> kernel:  ? __pfx_hrtimer_wakeup+0x10/0x10
>> kernel:  do_group_exit+0x31/0x80
>> kernel:  get_signal+0xa60/0xa60
>> kernel:  arch_do_signal_or_restart+0x3e/0x280
>> kernel:  exit_to_user_mode_prepare+0x1d4/0x230
>> kernel:  syscall_exit_to_user_mode+0x1b/0x50
>> kernel:  do_syscall_64+0x45/0x90
>> kernel:  entry_SYSCALL_64_after_hwframe+0x78/0xe2
>>
>> For more details, see the downstream bug report in Node.js: https://github.com/nodejs/node/issues/55587
>>
>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
>> problematic commit simply by browsing git log. As indicated above;
>> reverting that atop 6.6.59 results in success. Since it is passing on
>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
>> other semantic merge conflict. Unfortunately I do not have a compact,
>> minimal reproducer, but can provide my large one (it is testing a
>> larger build process in a VM) if needed?there are some additional
>> details in the above-linked downstream bug report, though. I hope that
>> having identified the problematic commit is enough for someone with
>> more context to go off of. Happy to provide more information if
>> needed.
> 
> Don't worry about not having a reproducer, having the backport commit
> pin pointed will do just fine. I'll take a look at this.

Ah that looks pretty dumb, in fact. The below should fix it. However,
it's worth noting that this will only happen if there's overflow going
on, and presumably only if the overflow list is quite long. That does
indicate a problem with the user of the ring, generally overflow should
not be seen at all. Entirely independent from this backport being buggy,
just wanted to bring it up as it is cause for concern on the application
side.

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 39d8d1fc5c2b..aa7c67a037e7 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -711,9 +711,11 @@ static void __io_cqring_overflow_flush(struct io_ring_ctx *ctx)
 		 */
 		if (need_resched()) {
 			io_cq_unlock_post(ctx);
-			mutex_unlock(&ctx->uring_lock);
+			if (ctx->flags & IORING_SETUP_IOPOLL)
+				mutex_unlock(&ctx->uring_lock);
 			cond_resched();
-			mutex_lock(&ctx->uring_lock);
+			if (ctx->flags & IORING_SETUP_IOPOLL)
+				mutex_lock(&ctx->uring_lock);
 			io_cq_lock(ctx);
 		}
 	}

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59
  2024-11-03 23:53 ` Jens Axboe
  2024-11-03 23:58   ` Jens Axboe
@ 2024-11-04  0:01   ` Keith Busch
  2024-11-04  0:06     ` Jens Axboe
  1 sibling, 1 reply; 11+ messages in thread
From: Keith Busch @ 2024-11-04  0:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Marshall, io-uring

On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote:
> On 11/3/24 4:47 PM, Andrew Marshall wrote:
> > I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
> > problematic commit simply by browsing git log. As indicated above;
> > reverting that atop 6.6.59 results in success. Since it is passing on
> > 6.11.6, I suspect there is some missing backport to 6.6.x, or some
> > other semantic merge conflict. Unfortunately I do not have a compact,
> > minimal reproducer, but can provide my large one (it is testing a
> > larger build process in a VM) if needed?there are some additional
> > details in the above-linked downstream bug report, though. I hope that
> > having identified the problematic commit is enough for someone with
> > more context to go off of. Happy to provide more information if
> > needed.
> 
> Don't worry about not having a reproducer, having the backport commit
> pin pointed will do just fine. I'll take a look at this.

I think stable is missing:

  6b231248e97fc3 ("io_uring: consolidate overflow flushing")

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59
  2024-11-04  0:01   ` Keith Busch
@ 2024-11-04  0:06     ` Jens Axboe
  2024-11-04  2:38       ` Stable backport (was "Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59") Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2024-11-04  0:06 UTC (permalink / raw)
  To: Keith Busch; +Cc: Andrew Marshall, io-uring

On 11/3/24 5:01 PM, Keith Busch wrote:
> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote:
>> On 11/3/24 4:47 PM, Andrew Marshall wrote:
>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
>>> problematic commit simply by browsing git log. As indicated above;
>>> reverting that atop 6.6.59 results in success. Since it is passing on
>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
>>> other semantic merge conflict. Unfortunately I do not have a compact,
>>> minimal reproducer, but can provide my large one (it is testing a
>>> larger build process in a VM) if needed?there are some additional
>>> details in the above-linked downstream bug report, though. I hope that
>>> having identified the problematic commit is enough for someone with
>>> more context to go off of. Happy to provide more information if
>>> needed.
>>
>> Don't worry about not having a reproducer, having the backport commit
>> pin pointed will do just fine. I'll take a look at this.
> 
> I think stable is missing:
> 
>   6b231248e97fc3 ("io_uring: consolidate overflow flushing")

I think you need to go back further than that, this one already
unconditionally holds ->uring_lock around overflow flushing...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Stable backport (was "Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59")
  2024-11-04  0:06     ` Jens Axboe
@ 2024-11-04  2:38       ` Jens Axboe
  2024-11-04  4:25         ` Andrew Marshall
  2024-11-06  6:05         ` Greg Kroah-Hartman
  0 siblings, 2 replies; 11+ messages in thread
From: Jens Axboe @ 2024-11-04  2:38 UTC (permalink / raw)
  To: Keith Busch; +Cc: Andrew Marshall, io-uring, Greg Kroah-Hartman, stable

[-- Attachment #1: Type: text/plain, Size: 1802 bytes --]

On 11/3/24 5:06 PM, Jens Axboe wrote:
> On 11/3/24 5:01 PM, Keith Busch wrote:
>> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote:
>>> On 11/3/24 4:47 PM, Andrew Marshall wrote:
>>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
>>>> problematic commit simply by browsing git log. As indicated above;
>>>> reverting that atop 6.6.59 results in success. Since it is passing on
>>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
>>>> other semantic merge conflict. Unfortunately I do not have a compact,
>>>> minimal reproducer, but can provide my large one (it is testing a
>>>> larger build process in a VM) if needed?there are some additional
>>>> details in the above-linked downstream bug report, though. I hope that
>>>> having identified the problematic commit is enough for someone with
>>>> more context to go off of. Happy to provide more information if
>>>> needed.
>>>
>>> Don't worry about not having a reproducer, having the backport commit
>>> pin pointed will do just fine. I'll take a look at this.
>>
>> I think stable is missing:
>>
>>   6b231248e97fc3 ("io_uring: consolidate overflow flushing")
> 
> I think you need to go back further than that, this one already
> unconditionally holds ->uring_lock around overflow flushing...

Took a look, it's this one:

commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998
Author: Pavel Begunkov <[email protected]>
Date:   Wed Apr 10 02:26:54 2024 +0100

    io_uring: always lock __io_cqring_overflow_flush

Greg/stable, can you pick this one for 6.6-stable? It picks
cleanly.

For 6.1, which is the other stable of that age that has the backport,
the attached patch will do the trick.

With that, I believe it should be sorted. Hopefully that can make
6.6.60 and 6.1.116.

-- 
Jens Axboe

[-- Attachment #2: 0001-io_uring-always-lock-__io_cqring_overflow_flush.patch --]
[-- Type: text/x-patch, Size: 1966 bytes --]

From 3f1c33f03386c481caf2044a836f3ca611094098 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <[email protected]>
Date: Wed, 10 Apr 2024 02:26:54 +0100
Subject: [PATCH] io_uring: always lock __io_cqring_overflow_flush

Commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998 upstream.

Conditional locking is never great, in case of
__io_cqring_overflow_flush(), which is a slow path, it's not justified.
Don't handle IOPOLL separately, always grab uring_lock for overflow
flushing.

Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/162947df299aa12693ac4b305dacedab32ec7976.1712708261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/io_uring.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index f902b161f02c..92c1aa8f3501 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -593,6 +593,8 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
 	bool all_flushed;
 	size_t cqe_size = sizeof(struct io_uring_cqe);
 
+	lockdep_assert_held(&ctx->uring_lock);
+
 	if (!force && __io_cqring_events(ctx) == ctx->cq_entries)
 		return false;
 
@@ -647,12 +649,9 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx)
 	bool ret = true;
 
 	if (test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq)) {
-		/* iopoll syncs against uring_lock, not completion_lock */
-		if (ctx->flags & IORING_SETUP_IOPOLL)
-			mutex_lock(&ctx->uring_lock);
+		mutex_lock(&ctx->uring_lock);
 		ret = __io_cqring_overflow_flush(ctx, false);
-		if (ctx->flags & IORING_SETUP_IOPOLL)
-			mutex_unlock(&ctx->uring_lock);
+		mutex_unlock(&ctx->uring_lock);
 	}
 
 	return ret;
@@ -1405,6 +1404,8 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, long min)
 	int ret = 0;
 	unsigned long check_cq;
 
+	lockdep_assert_held(&ctx->uring_lock);
+
 	if (!io_allowed_run_tw(ctx))
 		return -EEXIST;
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Stable backport (was "Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59")
  2024-11-04  2:38       ` Stable backport (was "Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59") Jens Axboe
@ 2024-11-04  4:25         ` Andrew Marshall
  2024-11-04 13:17           ` Andrew Marshall
  2024-11-06  6:05         ` Greg Kroah-Hartman
  1 sibling, 1 reply; 11+ messages in thread
From: Andrew Marshall @ 2024-11-04  4:25 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch; +Cc: io-uring, Greg Kroah-Hartman, stable

On Sun, Nov 3, 2024, at 21:38, Jens Axboe wrote:
> On 11/3/24 5:06 PM, Jens Axboe wrote:
>> On 11/3/24 5:01 PM, Keith Busch wrote:
>>> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote:
>>>> On 11/3/24 4:47 PM, Andrew Marshall wrote:
>>>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
>>>>> problematic commit simply by browsing git log. As indicated above;
>>>>> reverting that atop 6.6.59 results in success. Since it is passing on
>>>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
>>>>> other semantic merge conflict. Unfortunately I do not have a compact,
>>>>> minimal reproducer, but can provide my large one (it is testing a
>>>>> larger build process in a VM) if needed?there are some additional
>>>>> details in the above-linked downstream bug report, though. I hope that
>>>>> having identified the problematic commit is enough for someone with
>>>>> more context to go off of. Happy to provide more information if
>>>>> needed.
>>>>
>>>> Don't worry about not having a reproducer, having the backport commit
>>>> pin pointed will do just fine. I'll take a look at this.
>>>
>>> I think stable is missing:
>>>
>>>   6b231248e97fc3 ("io_uring: consolidate overflow flushing")
>> 
>> I think you need to go back further than that, this one already
>> unconditionally holds ->uring_lock around overflow flushing...
>
> Took a look, it's this one:
>
> commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998
> Author: Pavel Begunkov <[email protected]>
> Date:   Wed Apr 10 02:26:54 2024 +0100
>
>     io_uring: always lock __io_cqring_overflow_flush
>
> Greg/stable, can you pick this one for 6.6-stable? It picks
> cleanly.
>
> For 6.1, which is the other stable of that age that has the backport,
> the attached patch will do the trick.
>
> With that, I believe it should be sorted. Hopefully that can make
> 6.6.60 and 6.1.116.
>
> -- 
> Jens Axboe
> Attachments:
> * 0001-io_uring-always-lock-__io_cqring_overflow_flush.patch

Cherry-picking 6b231248e97fc3 onto 6.6.59, I can confirm it passes my reproducer (run a few times). Your first quick patch also passed, for what it’s worth. Thanks for the quick responses!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Stable backport (was "Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59")
  2024-11-04  4:25         ` Andrew Marshall
@ 2024-11-04 13:17           ` Andrew Marshall
  2024-11-04 15:58             ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Marshall @ 2024-11-04 13:17 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch; +Cc: io-uring, Greg Kroah-Hartman, stable

On Sun, Nov 3, 2024, at 23:25, Andrew Marshall wrote:
> On Sun, Nov 3, 2024, at 21:38, Jens Axboe wrote:
>> On 11/3/24 5:06 PM, Jens Axboe wrote:
>>> On 11/3/24 5:01 PM, Keith Busch wrote:
>>>> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote:
>>>>> On 11/3/24 4:47 PM, Andrew Marshall wrote:
>>>>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
>>>>>> problematic commit simply by browsing git log. As indicated above;
>>>>>> reverting that atop 6.6.59 results in success. Since it is passing on
>>>>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
>>>>>> other semantic merge conflict. Unfortunately I do not have a compact,
>>>>>> minimal reproducer, but can provide my large one (it is testing a
>>>>>> larger build process in a VM) if needed?there are some additional
>>>>>> details in the above-linked downstream bug report, though. I hope that
>>>>>> having identified the problematic commit is enough for someone with
>>>>>> more context to go off of. Happy to provide more information if
>>>>>> needed.
>>>>>
>>>>> Don't worry about not having a reproducer, having the backport commit
>>>>> pin pointed will do just fine. I'll take a look at this.
>>>>
>>>> I think stable is missing:
>>>>
>>>>   6b231248e97fc3 ("io_uring: consolidate overflow flushing")
>>> 
>>> I think you need to go back further than that, this one already
>>> unconditionally holds ->uring_lock around overflow flushing...
>>
>> Took a look, it's this one:
>>
>> commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998
>> Author: Pavel Begunkov <[email protected]>
>> Date:   Wed Apr 10 02:26:54 2024 +0100
>>
>>     io_uring: always lock __io_cqring_overflow_flush
>>
>> Greg/stable, can you pick this one for 6.6-stable? It picks
>> cleanly.
>>
>> For 6.1, which is the other stable of that age that has the backport,
>> the attached patch will do the trick.
>>
>> With that, I believe it should be sorted. Hopefully that can make
>> 6.6.60 and 6.1.116.
>>
>> -- 
>> Jens Axboe
>> Attachments:
>> * 0001-io_uring-always-lock-__io_cqring_overflow_flush.patch
>
> Cherry-picking 6b231248e97fc3 onto 6.6.59, I can confirm it passes my 
> reproducer (run a few times). Your first quick patch also passed, for 
> what it’s worth. Thanks for the quick responses!

Correction: I cherry-picked and tested 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998 (which was the change you identified), not 6b231248e97fc3. Apologies for any confusion.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Stable backport (was "Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59")
  2024-11-04 13:17           ` Andrew Marshall
@ 2024-11-04 15:58             ` Jens Axboe
  0 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2024-11-04 15:58 UTC (permalink / raw)
  To: Andrew Marshall, Keith Busch; +Cc: io-uring, Greg Kroah-Hartman, stable

On 11/4/24 6:17 AM, Andrew Marshall wrote:
> On Sun, Nov 3, 2024, at 23:25, Andrew Marshall wrote:
>> On Sun, Nov 3, 2024, at 21:38, Jens Axboe wrote:
>>> On 11/3/24 5:06 PM, Jens Axboe wrote:
>>>> On 11/3/24 5:01 PM, Keith Busch wrote:
>>>>> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote:
>>>>>> On 11/3/24 4:47 PM, Andrew Marshall wrote:
>>>>>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
>>>>>>> problematic commit simply by browsing git log. As indicated above;
>>>>>>> reverting that atop 6.6.59 results in success. Since it is passing on
>>>>>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
>>>>>>> other semantic merge conflict. Unfortunately I do not have a compact,
>>>>>>> minimal reproducer, but can provide my large one (it is testing a
>>>>>>> larger build process in a VM) if needed?there are some additional
>>>>>>> details in the above-linked downstream bug report, though. I hope that
>>>>>>> having identified the problematic commit is enough for someone with
>>>>>>> more context to go off of. Happy to provide more information if
>>>>>>> needed.
>>>>>>
>>>>>> Don't worry about not having a reproducer, having the backport commit
>>>>>> pin pointed will do just fine. I'll take a look at this.
>>>>>
>>>>> I think stable is missing:
>>>>>
>>>>>   6b231248e97fc3 ("io_uring: consolidate overflow flushing")
>>>>
>>>> I think you need to go back further than that, this one already
>>>> unconditionally holds ->uring_lock around overflow flushing...
>>>
>>> Took a look, it's this one:
>>>
>>> commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998
>>> Author: Pavel Begunkov <[email protected]>
>>> Date:   Wed Apr 10 02:26:54 2024 +0100
>>>
>>>     io_uring: always lock __io_cqring_overflow_flush
>>>
>>> Greg/stable, can you pick this one for 6.6-stable? It picks
>>> cleanly.
>>>
>>> For 6.1, which is the other stable of that age that has the backport,
>>> the attached patch will do the trick.
>>>
>>> With that, I believe it should be sorted. Hopefully that can make
>>> 6.6.60 and 6.1.116.
>>>
>>> -- 
>>> Jens Axboe
>>> Attachments:
>>> * 0001-io_uring-always-lock-__io_cqring_overflow_flush.patch
>>
>> Cherry-picking 6b231248e97fc3 onto 6.6.59, I can confirm it passes my 
>> reproducer (run a few times). Your first quick patch also passed, for 
>> what it?s worth. Thanks for the quick responses!
> 
> Correction: I cherry-picked and tested
> 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998 (which was the change you
> identified), not 6b231248e97fc3. Apologies for any confusion.

Thanks for clarifying, so it's as expected. Hopefully -stable can pick
this backport up soonish, so the next stable release will be sorted.
Thanks for reporting the issue!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Stable backport (was "Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59")
  2024-11-04  2:38       ` Stable backport (was "Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59") Jens Axboe
  2024-11-04  4:25         ` Andrew Marshall
@ 2024-11-06  6:05         ` Greg Kroah-Hartman
  2024-11-06 14:11           ` Jens Axboe
  1 sibling, 1 reply; 11+ messages in thread
From: Greg Kroah-Hartman @ 2024-11-06  6:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, Andrew Marshall, io-uring, stable

On Sun, Nov 03, 2024 at 07:38:30PM -0700, Jens Axboe wrote:
> On 11/3/24 5:06 PM, Jens Axboe wrote:
> > On 11/3/24 5:01 PM, Keith Busch wrote:
> >> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote:
> >>> On 11/3/24 4:47 PM, Andrew Marshall wrote:
> >>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
> >>>> problematic commit simply by browsing git log. As indicated above;
> >>>> reverting that atop 6.6.59 results in success. Since it is passing on
> >>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
> >>>> other semantic merge conflict. Unfortunately I do not have a compact,
> >>>> minimal reproducer, but can provide my large one (it is testing a
> >>>> larger build process in a VM) if needed?there are some additional
> >>>> details in the above-linked downstream bug report, though. I hope that
> >>>> having identified the problematic commit is enough for someone with
> >>>> more context to go off of. Happy to provide more information if
> >>>> needed.
> >>>
> >>> Don't worry about not having a reproducer, having the backport commit
> >>> pin pointed will do just fine. I'll take a look at this.
> >>
> >> I think stable is missing:
> >>
> >>   6b231248e97fc3 ("io_uring: consolidate overflow flushing")
> > 
> > I think you need to go back further than that, this one already
> > unconditionally holds ->uring_lock around overflow flushing...
> 
> Took a look, it's this one:
> 
> commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998
> Author: Pavel Begunkov <[email protected]>
> Date:   Wed Apr 10 02:26:54 2024 +0100
> 
>     io_uring: always lock __io_cqring_overflow_flush
> 
> Greg/stable, can you pick this one for 6.6-stable? It picks
> cleanly.
> 
> For 6.1, which is the other stable of that age that has the backport,
> the attached patch will do the trick.
> 
> With that, I believe it should be sorted. Hopefully that can make
> 6.6.60 and 6.1.116.

Now queued up, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Stable backport (was "Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59")
  2024-11-06  6:05         ` Greg Kroah-Hartman
@ 2024-11-06 14:11           ` Jens Axboe
  0 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2024-11-06 14:11 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: Keith Busch, Andrew Marshall, io-uring, stable

On 11/5/24 11:05 PM, Greg Kroah-Hartman wrote:
> On Sun, Nov 03, 2024 at 07:38:30PM -0700, Jens Axboe wrote:
>> On 11/3/24 5:06 PM, Jens Axboe wrote:
>>> On 11/3/24 5:01 PM, Keith Busch wrote:
>>>> On Sun, Nov 03, 2024 at 04:53:27PM -0700, Jens Axboe wrote:
>>>>> On 11/3/24 4:47 PM, Andrew Marshall wrote:
>>>>>> I identified f4ce3b5d26ce149e77e6b8e8f2058aa80e5b034e as the likely
>>>>>> problematic commit simply by browsing git log. As indicated above;
>>>>>> reverting that atop 6.6.59 results in success. Since it is passing on
>>>>>> 6.11.6, I suspect there is some missing backport to 6.6.x, or some
>>>>>> other semantic merge conflict. Unfortunately I do not have a compact,
>>>>>> minimal reproducer, but can provide my large one (it is testing a
>>>>>> larger build process in a VM) if needed?there are some additional
>>>>>> details in the above-linked downstream bug report, though. I hope that
>>>>>> having identified the problematic commit is enough for someone with
>>>>>> more context to go off of. Happy to provide more information if
>>>>>> needed.
>>>>>
>>>>> Don't worry about not having a reproducer, having the backport commit
>>>>> pin pointed will do just fine. I'll take a look at this.
>>>>
>>>> I think stable is missing:
>>>>
>>>>   6b231248e97fc3 ("io_uring: consolidate overflow flushing")
>>>
>>> I think you need to go back further than that, this one already
>>> unconditionally holds ->uring_lock around overflow flushing...
>>
>> Took a look, it's this one:
>>
>> commit 8d09a88ef9d3cb7d21d45c39b7b7c31298d23998
>> Author: Pavel Begunkov <[email protected]>
>> Date:   Wed Apr 10 02:26:54 2024 +0100
>>
>>     io_uring: always lock __io_cqring_overflow_flush
>>
>> Greg/stable, can you pick this one for 6.6-stable? It picks
>> cleanly.
>>
>> For 6.1, which is the other stable of that age that has the backport,
>> the attached patch will do the trick.
>>
>> With that, I believe it should be sorted. Hopefully that can make
>> 6.6.60 and 6.1.116.
> 
> Now queued up, thanks.

Thanks Greg!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-11-06 14:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-03 23:47 PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59 Andrew Marshall
2024-11-03 23:53 ` Jens Axboe
2024-11-03 23:58   ` Jens Axboe
2024-11-04  0:01   ` Keith Busch
2024-11-04  0:06     ` Jens Axboe
2024-11-04  2:38       ` Stable backport (was "Re: PROBLEM: io_uring hang causing uninterruptible sleep state on 6.6.59") Jens Axboe
2024-11-04  4:25         ` Andrew Marshall
2024-11-04 13:17           ` Andrew Marshall
2024-11-04 15:58             ` Jens Axboe
2024-11-06  6:05         ` Greg Kroah-Hartman
2024-11-06 14:11           ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox