Re: [bug report] WARNING: CPU: 5 PID: 679 at io_uring/io_uring.c:2835 io_ring_exit

public inbox for [email protected]
 help / color / mirror / Atom feed

* Re: [bug report] WARNING: CPU: 5 PID: 679 at io_uring/io_uring.c:2835 io_ring_exit_work+0x2b6/0x2e0
       [not found]         ` <[email protected]>
@ 2024-04-16 12:14           ` Pavel Begunkov
  2024-04-16 12:24             ` Jens Axboe
  0 siblings, 1 reply; 5+ messages in thread
From: Pavel Begunkov @ 2024-04-16 12:14 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Changhui Zhong; +Cc: Linux Block Devices, io-uring

On 4/16/24 12:38, Jens Axboe wrote:
> On 4/16/24 4:00 AM, Ming Lei wrote:
>> On Tue, Apr 16, 2024 at 10:26:16AM +0800, Changhui Zhong wrote:
>>>>>
>>>>> I can't reproduce this here, fwiw. Ming, something you've seen?
>>>>
>>>> I just test against the latest for-next/block(-rc4 based), and still can't
>>>> reproduce it. There was such RH internal report before, and maybe not
>>>> ublk related.
>>>>
>>>> Changhui, if the issue can be reproduced in your machine, care to share
>>>> your machine for me to investigate a bit?
>>>>
>>>> Thanks,
>>>> Ming
>>>>
>>>
>>> I still can reproduce this issue on my machine?
>>> and I shared machine to Ming?he can do more investigation for this issue?
>>>
>>> [ 1244.207092] running generic/006
>>> [ 1246.456896] blk_print_req_error: 77 callbacks suppressed
>>> [ 1246.456907] I/O error, dev ublkb1, sector 2395864 op 0x1:(WRITE)
>>> flags 0x8800 phys_seg 1 prio class 0
>>
>> The failure is actually triggered in recovering qcow2 target in generic/005,
>> since ublkb0 isn't removed successfully in generic/005.
>>
>> git-bisect shows that the 1st bad commit is cca6571381a0 ("io_uring/rw:
>> cleanup retry path").
>>
>> And not see any issue in uring command side, so the trouble seems
>> in normal io_uring rw side over XFS file, and not related with block
>> device.
> 
> Indeed, I can reproduce it on XFS as well. I'll take a look.

Looking at this patch, that io_rw_should_reissue() path is for when
we failed via the kiocb callback but came there off of the submission
path, so when we unwind back it finds the flag, preps and resubmits
the req. If it's not the case but we still return "true", it'd leaks
the request, which would explains why exit_work gets stuck.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug report] WARNING: CPU: 5 PID: 679 at io_uring/io_uring.c:2835 io_ring_exit_work+0x2b6/0x2e0
  2024-04-16 12:14           ` [bug report] WARNING: CPU: 5 PID: 679 at io_uring/io_uring.c:2835 io_ring_exit_work+0x2b6/0x2e0 Pavel Begunkov
@ 2024-04-16 12:24             ` Jens Axboe
  2024-04-16 12:40               ` Pavel Begunkov
  0 siblings, 1 reply; 5+ messages in thread
From: Jens Axboe @ 2024-04-16 12:24 UTC (permalink / raw)
  To: Pavel Begunkov, Ming Lei, Changhui Zhong; +Cc: Linux Block Devices, io-uring

On 4/16/24 6:14 AM, Pavel Begunkov wrote:
> On 4/16/24 12:38, Jens Axboe wrote:
>> On 4/16/24 4:00 AM, Ming Lei wrote:
>>> On Tue, Apr 16, 2024 at 10:26:16AM +0800, Changhui Zhong wrote:
>>>>>>
>>>>>> I can't reproduce this here, fwiw. Ming, something you've seen?
>>>>>
>>>>> I just test against the latest for-next/block(-rc4 based), and still can't
>>>>> reproduce it. There was such RH internal report before, and maybe not
>>>>> ublk related.
>>>>>
>>>>> Changhui, if the issue can be reproduced in your machine, care to share
>>>>> your machine for me to investigate a bit?
>>>>>
>>>>> Thanks,
>>>>> Ming
>>>>>
>>>>
>>>> I still can reproduce this issue on my machine?
>>>> and I shared machine to Ming?he can do more investigation for this issue?
>>>>
>>>> [ 1244.207092] running generic/006
>>>> [ 1246.456896] blk_print_req_error: 77 callbacks suppressed
>>>> [ 1246.456907] I/O error, dev ublkb1, sector 2395864 op 0x1:(WRITE)
>>>> flags 0x8800 phys_seg 1 prio class 0
>>>
>>> The failure is actually triggered in recovering qcow2 target in generic/005,
>>> since ublkb0 isn't removed successfully in generic/005.
>>>
>>> git-bisect shows that the 1st bad commit is cca6571381a0 ("io_uring/rw:
>>> cleanup retry path").
>>>
>>> And not see any issue in uring command side, so the trouble seems
>>> in normal io_uring rw side over XFS file, and not related with block
>>> device.
>>
>> Indeed, I can reproduce it on XFS as well. I'll take a look.
> 
> Looking at this patch, that io_rw_should_reissue() path is for when
> we failed via the kiocb callback but came there off of the submission
> path, so when we unwind back it finds the flag, preps and resubmits
> the req. If it's not the case but we still return "true", it'd leaks
> the request, which would explains why exit_work gets stuck.

Yep, this is what happens. I have a test patch that just punts any
reissue to task_work, it'll insert to iowq from there. Before we would
fail it, even though we didn't have to, but that check was killed and
then it just lingers for this case and it's lost.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug report] WARNING: CPU: 5 PID: 679 at io_uring/io_uring.c:2835 io_ring_exit_work+0x2b6/0x2e0
  2024-04-16 12:24             ` Jens Axboe
@ 2024-04-16 12:40               ` Pavel Begunkov
  2024-04-16 12:51                 ` Jens Axboe
  0 siblings, 1 reply; 5+ messages in thread
From: Pavel Begunkov @ 2024-04-16 12:40 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Changhui Zhong; +Cc: Linux Block Devices, io-uring

On 4/16/24 13:24, Jens Axboe wrote:
> On 4/16/24 6:14 AM, Pavel Begunkov wrote:
>> On 4/16/24 12:38, Jens Axboe wrote:
>>> On 4/16/24 4:00 AM, Ming Lei wrote:
>>>> On Tue, Apr 16, 2024 at 10:26:16AM +0800, Changhui Zhong wrote:
>>>>>>>
>>>>>>> I can't reproduce this here, fwiw. Ming, something you've seen?
>>>>>>
>>>>>> I just test against the latest for-next/block(-rc4 based), and still can't
>>>>>> reproduce it. There was such RH internal report before, and maybe not
>>>>>> ublk related.
>>>>>>
>>>>>> Changhui, if the issue can be reproduced in your machine, care to share
>>>>>> your machine for me to investigate a bit?
>>>>>>
>>>>>> Thanks,
>>>>>> Ming
>>>>>>
>>>>>
>>>>> I still can reproduce this issue on my machine?
>>>>> and I shared machine to Ming?he can do more investigation for this issue?
>>>>>
>>>>> [ 1244.207092] running generic/006
>>>>> [ 1246.456896] blk_print_req_error: 77 callbacks suppressed
>>>>> [ 1246.456907] I/O error, dev ublkb1, sector 2395864 op 0x1:(WRITE)
>>>>> flags 0x8800 phys_seg 1 prio class 0
>>>>
>>>> The failure is actually triggered in recovering qcow2 target in generic/005,
>>>> since ublkb0 isn't removed successfully in generic/005.
>>>>
>>>> git-bisect shows that the 1st bad commit is cca6571381a0 ("io_uring/rw:
>>>> cleanup retry path").
>>>>
>>>> And not see any issue in uring command side, so the trouble seems
>>>> in normal io_uring rw side over XFS file, and not related with block
>>>> device.
>>>
>>> Indeed, I can reproduce it on XFS as well. I'll take a look.
>>
>> Looking at this patch, that io_rw_should_reissue() path is for when
>> we failed via the kiocb callback but came there off of the submission
>> path, so when we unwind back it finds the flag, preps and resubmits
>> the req. If it's not the case but we still return "true", it'd leaks
>> the request, which would explains why exit_work gets stuck.
> 
> Yep, this is what happens. I have a test patch that just punts any
> reissue to task_work, it'll insert to iowq from there. Before we would
> fail it, even though we didn't have to, but that check was killed and
> then it just lingers for this case and it's lost.

Sounds good, but let me note that while unwinding, block/fs/etc
could try to revert the iter, so it might not be safe to initiate
async IO from the callback as is

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug report] WARNING: CPU: 5 PID: 679 at io_uring/io_uring.c:2835 io_ring_exit_work+0x2b6/0x2e0
  2024-04-16 12:40               ` Pavel Begunkov
@ 2024-04-16 12:51                 ` Jens Axboe
  2024-04-16 13:08                   ` Pavel Begunkov
  0 siblings, 1 reply; 5+ messages in thread
From: Jens Axboe @ 2024-04-16 12:51 UTC (permalink / raw)
  To: Pavel Begunkov, Ming Lei, Changhui Zhong; +Cc: Linux Block Devices, io-uring

On 4/16/24 6:40 AM, Pavel Begunkov wrote:
> On 4/16/24 13:24, Jens Axboe wrote:
>> On 4/16/24 6:14 AM, Pavel Begunkov wrote:
>>> On 4/16/24 12:38, Jens Axboe wrote:
>>>> On 4/16/24 4:00 AM, Ming Lei wrote:
>>>>> On Tue, Apr 16, 2024 at 10:26:16AM +0800, Changhui Zhong wrote:
>>>>>>>>
>>>>>>>> I can't reproduce this here, fwiw. Ming, something you've seen?
>>>>>>>
>>>>>>> I just test against the latest for-next/block(-rc4 based), and still can't
>>>>>>> reproduce it. There was such RH internal report before, and maybe not
>>>>>>> ublk related.
>>>>>>>
>>>>>>> Changhui, if the issue can be reproduced in your machine, care to share
>>>>>>> your machine for me to investigate a bit?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ming
>>>>>>>
>>>>>>
>>>>>> I still can reproduce this issue on my machine?
>>>>>> and I shared machine to Ming?he can do more investigation for this issue?
>>>>>>
>>>>>> [ 1244.207092] running generic/006
>>>>>> [ 1246.456896] blk_print_req_error: 77 callbacks suppressed
>>>>>> [ 1246.456907] I/O error, dev ublkb1, sector 2395864 op 0x1:(WRITE)
>>>>>> flags 0x8800 phys_seg 1 prio class 0
>>>>>
>>>>> The failure is actually triggered in recovering qcow2 target in generic/005,
>>>>> since ublkb0 isn't removed successfully in generic/005.
>>>>>
>>>>> git-bisect shows that the 1st bad commit is cca6571381a0 ("io_uring/rw:
>>>>> cleanup retry path").
>>>>>
>>>>> And not see any issue in uring command side, so the trouble seems
>>>>> in normal io_uring rw side over XFS file, and not related with block
>>>>> device.
>>>>
>>>> Indeed, I can reproduce it on XFS as well. I'll take a look.
>>>
>>> Looking at this patch, that io_rw_should_reissue() path is for when
>>> we failed via the kiocb callback but came there off of the submission
>>> path, so when we unwind back it finds the flag, preps and resubmits
>>> the req. If it's not the case but we still return "true", it'd leaks
>>> the request, which would explains why exit_work gets stuck.
>>
>> Yep, this is what happens. I have a test patch that just punts any
>> reissue to task_work, it'll insert to iowq from there. Before we would
>> fail it, even though we didn't have to, but that check was killed and
>> then it just lingers for this case and it's lost.
> 
> Sounds good, but let me note that while unwinding, block/fs/etc
> could try to revert the iter, so it might not be safe to initiate
> async IO from the callback as is

Good point, we may just want to do the iov iter revert before sending it
to io-wq for retry. Seems prudent, and can't hurt.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug report] WARNING: CPU: 5 PID: 679 at io_uring/io_uring.c:2835 io_ring_exit_work+0x2b6/0x2e0
  2024-04-16 12:51                 ` Jens Axboe
@ 2024-04-16 13:08                   ` Pavel Begunkov
  0 siblings, 0 replies; 5+ messages in thread
From: Pavel Begunkov @ 2024-04-16 13:08 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, Changhui Zhong; +Cc: Linux Block Devices, io-uring

On 4/16/24 13:51, Jens Axboe wrote:
> On 4/16/24 6:40 AM, Pavel Begunkov wrote:
>> On 4/16/24 13:24, Jens Axboe wrote:
>>> On 4/16/24 6:14 AM, Pavel Begunkov wrote:
>>>> On 4/16/24 12:38, Jens Axboe wrote:
>>>>> On 4/16/24 4:00 AM, Ming Lei wrote:
>>>>>> On Tue, Apr 16, 2024 at 10:26:16AM +0800, Changhui Zhong wrote:
>>>>>>>>>
>>>>>>>>> I can't reproduce this here, fwiw. Ming, something you've seen?
>>>>>>>>
>>>>>>>> I just test against the latest for-next/block(-rc4 based), and still can't
>>>>>>>> reproduce it. There was such RH internal report before, and maybe not
>>>>>>>> ublk related.
>>>>>>>>
>>>>>>>> Changhui, if the issue can be reproduced in your machine, care to share
>>>>>>>> your machine for me to investigate a bit?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ming
>>>>>>>>
>>>>>>>
>>>>>>> I still can reproduce this issue on my machine?
>>>>>>> and I shared machine to Ming?he can do more investigation for this issue?
>>>>>>>
>>>>>>> [ 1244.207092] running generic/006
>>>>>>> [ 1246.456896] blk_print_req_error: 77 callbacks suppressed
>>>>>>> [ 1246.456907] I/O error, dev ublkb1, sector 2395864 op 0x1:(WRITE)
>>>>>>> flags 0x8800 phys_seg 1 prio class 0
>>>>>>
>>>>>> The failure is actually triggered in recovering qcow2 target in generic/005,
>>>>>> since ublkb0 isn't removed successfully in generic/005.
>>>>>>
>>>>>> git-bisect shows that the 1st bad commit is cca6571381a0 ("io_uring/rw:
>>>>>> cleanup retry path").
>>>>>>
>>>>>> And not see any issue in uring command side, so the trouble seems
>>>>>> in normal io_uring rw side over XFS file, and not related with block
>>>>>> device.
>>>>>
>>>>> Indeed, I can reproduce it on XFS as well. I'll take a look.
>>>>
>>>> Looking at this patch, that io_rw_should_reissue() path is for when
>>>> we failed via the kiocb callback but came there off of the submission
>>>> path, so when we unwind back it finds the flag, preps and resubmits
>>>> the req. If it's not the case but we still return "true", it'd leaks
>>>> the request, which would explains why exit_work gets stuck.
>>>
>>> Yep, this is what happens. I have a test patch that just punts any
>>> reissue to task_work, it'll insert to iowq from there. Before we would
>>> fail it, even though we didn't have to, but that check was killed and
>>> then it just lingers for this case and it's lost.
>>
>> Sounds good, but let me note that while unwinding, block/fs/etc
>> could try to revert the iter, so it might not be safe to initiate
>> async IO from the callback as is
> 
> Good point, we may just want to do the iov iter revert before sending it
> to io-wq for retry. Seems prudent, and can't hurt.

To be more precise, the case I worry about is like this:

{fs,block}_read_iter()          |
-> iter_truncate();             |
-> kiocb->callback();           |
--> restore iter                |
--> queue async IO              |
                                 | start IO async()
                                 | -> or restore iter here
                                 | -> iter_truncate() / etc.
-> iter_reexpand() // unwind    |

At the iter_reexpand(), it's already re-initialised, and
re-expanding it would likely corrupt it.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-04-16 13:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAGVVp+WzC1yKiLHf8z0PnNWutse7BgY9HuwgQwwsvT4UYbUZXQ@mail.gmail.com>
     [not found] ` <[email protected]>
     [not found]   ` <Zh3TjqD1763LzXUj@fedora>
     [not found]     ` <CAGVVp+X81FhOHH0E3BwcsVBYsAAOoAPXpTX5D_BbRH4jqjeTJg@mail.gmail.com>
     [not found]       ` <Zh5MSQVk54tN7Xx4@fedora>
     [not found]         ` <[email protected]>
2024-04-16 12:14           ` [bug report] WARNING: CPU: 5 PID: 679 at io_uring/io_uring.c:2835 io_ring_exit_work+0x2b6/0x2e0 Pavel Begunkov
2024-04-16 12:24             ` Jens Axboe
2024-04-16 12:40               ` Pavel Begunkov
2024-04-16 12:51                 ` Jens Axboe
2024-04-16 13:08                   ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox