* Re: [PATCH 1/1] block: Manage bio references so the bio persists until necessary
2020-02-04 20:59 ` Jens Axboe
@ 2020-02-04 22:41 ` Bijan Mottahedeh
2020-02-24 23:32 ` Bijan Mottahedeh
1 sibling, 0 replies; 10+ messages in thread
From: Bijan Mottahedeh @ 2020-02-04 22:41 UTC (permalink / raw)
To: Jens Axboe, Christoph Hellwig; +Cc: linux-block, io-uring
On 2/4/2020 12:59 PM, Jens Axboe wrote:
> On 2/4/20 12:51 AM, Christoph Hellwig wrote:
>> On Mon, Feb 03, 2020 at 01:07:48PM -0800, Bijan Mottahedeh wrote:
>>> My concern is with the code below for the single bio async case:
>>>
>>> qc = submit_bio(bio);
>>>
>>> if (polled)
>>> WRITE_ONCE(iocb->ki_cookie, qc);
>>>
>>> The bio/dio can be freed before the the cookie is written which is what I'm
>>> seeing, and I thought this may lead to a scenario where that iocb request
>>> could be completed, freed, reallocated, and resubmitted in io_uring layer;
>>> i.e., I thought the cookie could be written into the wrong iocb.
>> I think we do have a potential use after free of the iocb here.
>> But taking a bio reference isn't going to help with that, as the iocb
>> and bio/dio life times are unrelated.
>>
>> I vaguely remember having that discussion with Jens a while ago, and
>> tried to pass a pointer to the qc to submit_bio so that we can set
>> it at submission time, but he came up with a reason why that might not
>> be required. I'd have to dig out all notes unless Jens remembers
>> better.
> Don't remember that either, so I'd have to dig out emails! But looking
> at it now, for the async case with io_uring, the iocb is embedded in the
> io_kiocb from io_uring. We hold two references to the io_kiocb, one for
> submit and one for completion. Hence even if the bio completes
> immediately and someone else finds the completion before the application
> doing this submit, we still hold the submission reference to the
> io_kiocb. Hence I don't really see how we can end up with a
> use-after-free situation here.
>
> IIRC, Bijan had traces showing this can happen, KASAN complaining about
> it. Which makes me think that I'm missing a case here, though I don't
> immediately see what it is.
>
> Bijan, could post your trace again, I can't seem to find it?
>
Trace included below; note the two different warnings:
[ 90.785364] WARNING: CPU: 3 PID: 4731 at lib/refcount.c:28
refcount_warn_saturate+0xa1/0xf0
[ 90.838168] WARNING: CPU: 2 PID: 4714 at kernel/dma/debug.c:1051
check_unmap+0x459/0x870
[ 90.959286] WARNING: CPU: 8 PID: 4716 at kernel/dma/debug.c:1014
check_unmap+0x145/0x870
This is the bug I was originally chasing when I suspected the single bio
async case code above but it doesn't seem to related.
It happens pretty reliably though not necessarily with the two warnings
at the same time. Let me know if you want to tweak anything.
--bijan
-----------------------------------------------------------------------------------------
[ 90.774034] nvme 0000:00:04.0: dma_pool_free prp list page,
000000008f36ce00 (bad vaddr)/0x00000000830ae000
[ 90.780098] ------------[ cut here ]------------
[ 90.782388] refcount_t: underflow; use-after-free.
[ 90.785364] WARNING: CPU: 3 PID: 4731 at lib/refcount.c:28
refcount_warn_saturate+0xa1/0xf0
[ 90.788181] Modules linked in: xfs dm_mod sr_mod cdrom sd_mod
crc32c_intel nvme ata_piix nvme_core e1000 t10_pi libata virtio_pci
9pnet_virtio virtio_ring virtio 9p 9pnet
[ 90.794794] CPU: 3 PID: 4731 Comm: fio Not tainted
5.5.0-next-20200203-bij #610
[ 90.797396] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[ 90.801519] RIP: 0010:refcount_warn_saturate+0xa1/0xf0
[ 90.803549] Code: 4c 28 01 01 e8 d0 8f c4 ff 0f 0b c3 80 3d 14 4c 28
01 00 75 59 48 c7 c7 68 15 17 82 31 c0 c6 05 02 4c 28 01 01 e8 af 8f c4
ff <0f> 0b c3 80 3d f2 4b 28 01 00 75 38 48 c7 c7 90 15 17 82 31 c0 c6
[ 90.812441] RSP: 0018:ffffc900024d3cb8 EFLAGS: 00010286
[ 90.813911] RAX: 0000000000000000 RBX: ffff888226f2b700 RCX:
0000000000000000
[ 90.816595] RDX: ffff888233bfab40 RSI: ffff888233bea408 RDI:
ffff888233bea408
[ 90.819218] RBP: ffff8882286d6e78 R08: ffff88822cba30c0 R09:
00000000fffffff0
[ 90.822529] R10: 0000000000000001 R11: 0000000000000000 R12:
ffffe8fffd600580
[ 90.825082] R13: ffff888226f2b700 R14: ffff8882286d6e78 R15:
ffff888228abda80
[ 90.827679] FS: 00007fbecf69a740(0000) GS:ffff888233a00000(0000)
knlGS:0000000000000000
[ 90.829474] ------------[ cut here ]------------
[ 90.830949] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 90.830953] CR2: 00007f1037f54000 CR3: 00000001f46b4001 CR4:
0000000000160ee0
[ 90.832778] DMA-API: nvme 0000:00:04.0: device driver frees DMA sg
list with different entry count [map count=2] [unmap count=1]
[ 90.834876] Call Trace:
[ 90.838168] WARNING: CPU: 2 PID: 4714 at kernel/dma/debug.c:1051
check_unmap+0x459/0x870
[ 90.843447] blk_mq_complete_request+0x47/0xf0
[ 90.844292] Modules linked in: xfs dm_mod sr_mod cdrom sd_mod
crc32c_intel nvme ata_piix nvme_core e1000 t10_pi libata virtio_pci
9pnet_virtio virtio_ring virtio 9p 9pnet
[ 90.848498] nvme_poll+0x23a/0x2c0 [nvme]
[ 90.850475] CPU: 2 PID: 4714 Comm: fio Not tainted
5.5.0-next-20200203-bij #610
[ 90.857373] blk_poll+0x24f/0x320
[ 90.858513] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[ 90.858519] RIP: 0010:check_unmap+0x459/0x870
[ 90.860929] ? __lock_acquire.isra.27+0x526/0x6c0
[ 90.862191] Code: d1 03 00 00 4c 8b 37 e9 c9 03 00 00 49 c7 c6 f3 8f
14 82 4c 89 f6 44 89 f9 4c 89 f2 48 c7 c7 f8 8c 12 82 31 c0 e8 57 2c f7
ff <0f> 0b 48 c7 c7 9c 84 12 82 31 c0 e8 cb 73 fe ff 8b 73 4c 48 8d 7b
[ 90.868222] ? find_held_lock+0x38/0xa0
[ 90.871415] RSP: 0018:ffffc9000158bad8 EFLAGS: 00010086
[ 90.874370] io_iopoll_getevents+0x11c/0x2e0
[ 90.882129] RAX: 0000000000000000 RBX: ffff8882311afe80 RCX:
0000000000000001
[ 90.882132] RDX: 0000000000000001 RSI: 00000000d97dcdbe RDI:
ffff8882337ea400
[ 90.885032] ? kvm_sched_clock_read+0xd/0x20
[ 90.887837] RBP: ffffc9000158bb40 R08: ffff8881fe1bd5b8 R09:
00000000b4d5c0f0
[ 90.889983] ? __x64_sys_io_uring_enter+0x1fb/0x420
[ 90.892835] R10: 0000000000000001 R11: 0000000000000000 R12:
ffffffff834f6bd8
[ 90.896551] ? __x64_sys_io_uring_enter+0x259/0x420
[ 90.898401] R13: 0000000000000282 R14: ffff8882307b82a0 R15:
0000000000000002
[ 90.898405] FS: 00007fbecf69a740(0000) GS:ffff888233600000(0000)
knlGS:0000000000000000
[ 90.901417] __io_iopoll_check+0x75/0xa0
[ 90.903627] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 90.903629] CR2: 00007f4f8806b1f8 CR3: 00000001f89de005 CR4:
0000000000160ee0
[ 90.906273] __x64_sys_io_uring_enter+0x269/0x420
[ 90.909268] Call Trace:
[ 90.911805] ? __x64_sys_io_uring_enter+0x83/0x420
[ 90.911818] do_syscall_64+0x63/0x1a0
[ 90.915022] debug_dma_unmap_sg+0xff/0x140
[ 90.916977] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 90.918894] ? kvm_sched_clock_read+0xd/0x20
[ 90.921652] RIP: 0033:0x7fbece5e62cd
[ 90.923601] ? __lock_acquire.isra.27+0x526/0x6c0
[ 90.925060] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa
48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b 6b 2c 00 f7 d8 64 89 01 48
[ 90.927287] nvme_unmap_data+0x109/0x250 [nvme]
[ 90.929843] RSP: 002b:00007ffd6a00b8b8 EFLAGS: 00000246 ORIG_RAX:
00000000000001aa
[ 90.932078] nvme_pci_complete_rq+0xda/0xf0 [nvme]
[ 90.933924] RAX: ffffffffffffffda RBX: 00007fbead343700 RCX:
00007fbece5e62cd
[ 90.933926] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
0000000000000003
[ 90.935992] blk_mq_complete_request+0x47/0xf0
[ 90.937856] RBP: 00007fbead343700 R08: 0000000000000000 R09:
0000000000000000
[ 90.937859] R10: 0000000000000001 R11: 0000000000000246 R12:
00000000024ce730
[ 90.941177] nvme_poll+0x23a/0x2c0 [nvme]
[ 90.949896] R13: 0000000000004329 R14: 0000000000000001 R15:
000000000250e728
[ 90.949912] ---[ end trace 6f10eae6c97ebbfe ]---
[ 90.951749] blk_poll+0x24f/0x320
[ 90.959132] nvme 0000:00:04.0: dma_pool_free prp list page,
000000005fe36fa7 (bad vaddr)/0x0000000083911000
[ 90.959246] ? __lock_acquire.isra.27+0x526/0x6c0
[ 90.959263] ------------[ cut here ]------------
[ 90.959267] DMA-API: nvme 0000:00:04.0: device driver tries to free
DMA memory it has not allocated [device address=0x6b6b6b6b6b6b6b6b]
[size=1802201963 bytes]
[ 90.959286] WARNING: CPU: 8 PID: 4716 at kernel/dma/debug.c:1014
check_unmap+0x145/0x870
[ 90.959288] Modules linked in: xfs dm_mod sr_mod cdrom sd_mod
crc32c_intel nvme ata_piix nvme_core e1000 t10_pi libata virtio_pci
9pnet_virtio virtio_ring virtio 9p 9pnet
[ 90.959305] CPU: 8 PID: 4716 Comm: fio Tainted: G W
5.5.0-next-20200203-bij #610
[ 90.959307] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[ 90.959310] RIP: 0010:check_unmap+0x145/0x870
[ 90.959312] Code: 00 48 8b 1f e9 59 06 00 00 48 c7 c3 f3 8f 14 82 48
89 de 4d 89 e8 4c 89 e1 48 89 da 48 c7 c7 98 8a 12 82 31 c0 e8 6b 2f f7
ff <0f> 0b 8b 15 63 49 5e 01 85 d2 0f 85 08 07 00 00 8b 05 c1 b1 40 01
[ 90.959314] RSP: 0018:ffffc90002463ad8 EFLAGS: 00010286
[ 90.959316] RAX: 0000000000000000 RBX: ffff8882307b82a0 RCX:
0000000000000001
[ 90.959317] RDX: 0000000000000001 RSI: 00000000fea2a5f1 RDI:
ffff888234fea400
[ 90.959318] RBP: ffffc90002463b40 R08: ffff8881fe1bb0c0 R09:
00000000fffffff0
[ 90.959319] R10: 0000000000000001 R11: 0000000000000000 R12:
6b6b6b6b6b6b6b6b
[ 90.959320] R13: 000000006b6b6b6b R14: 0000000000000002 R15:
0000000000000001
[ 90.959322] FS: 00007fbecf69a740(0000) GS:ffff888234e00000(0000)
knlGS:0000000000000000
[ 90.959323] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 90.959324] CR2: 000055ec5f1b4f70 CR3: 00000001fa136001 CR4:
0000000000160ee0
[ 90.959329] Call Trace:
[ 90.959335] ? do_raw_spin_unlock+0x83/0x90
[ 90.959342] debug_dma_unmap_sg+0xff/0x140
[ 90.959358] ? kvm_sched_clock_read+0xd/0x20
[ 90.959363] ? __lock_acquire.isra.27+0x526/0x6c0
[ 90.959370] nvme_unmap_data+0x109/0x250 [nvme]
[ 90.959378] nvme_pci_complete_rq+0xda/0xf0 [nvme]
[ 90.959384] blk_mq_complete_request+0x47/0xf0
[ 90.959389] nvme_poll+0x23a/0x2c0 [nvme]
[ 90.959396] blk_poll+0x24f/0x320
[ 90.959399] ? __lock_acquire.isra.27+0x526/0x6c0
[ 90.959406] ? find_held_lock+0x38/0xa0
[ 90.959416] io_iopoll_getevents+0x11c/0x2e0
[ 90.959425] ? __mutex_unlock_slowpath+0x14e/0x250
[ 90.959429] ? __x64_sys_io_uring_enter+0x259/0x420
[ 90.959438] __io_iopoll_check+0x75/0xa0
[ 90.959444] __x64_sys_io_uring_enter+0x269/0x420
[ 90.959445] ? __x64_sys_io_uring_enter+0x83/0x420
[ 90.959458] do_syscall_64+0x63/0x1a0
[ 90.959462] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 90.959465] RIP: 0033:0x7fbece5e62cd
[ 90.959467] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa
48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b 6b 2c 00 f7 d8 64 89 01 48
[ 90.959468] RSP: 002b:00007ffd6a00b8b8 EFLAGS: 00000246 ORIG_RAX:
00000000000001aa
[ 90.959470] RAX: ffffffffffffffda RBX: 00007fbead217100 RCX:
00007fbece5e62cd
[ 90.959471] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
0000000000000003
[ 90.959472] RBP: 00007fbead217100 R08: 0000000000000000 R09:
0000000000000000
[ 90.959473] R10: 0000000000000001 R11: 0000000000000246 R12:
00000000024ce5b0
[ 90.959474] R13: 000000000000465a R14: 0000000000000001 R15:
000000000252de28
[ 90.959489] ---[ end trace 6f10eae6c97ebbff ]---
[ 90.959500] general protection fault, probably for non-canonical
address 0x6b6b6b6b6b6b7b63: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[ 90.959501] CPU: 8 PID: 4716 Comm: fio Tainted: G W
5.5.0-next-20200203-bij #610
[ 90.959502] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[ 90.959504] RIP: 0010:nvme_unmap_data+0x1f8/0x250 [nvme]
[ 90.959505] Code: 75 0b 0f b7 83 c2 00 00 00 48 c1 e0 05 80 bb 40 01
00 00 00 48 63 cd 48 8d 14 ca 48 8b 34 02 74 09 4c 8b b6 f0 0f 00 00 eb
04 <4e> 8b 34 2e 49 8b bf 58 02 00 00 4c 89 e2 83 c5 01 e8 42 2e 15 e1
[ 90.959506] RSP: 0018:ffffc90002463c38 EFLAGS: 00010246
[ 90.959508] RAX: 0000000000000020 RBX: ffff888226f58400 RCX:
0000000000000000
[ 90.959508] RDX: ffff888083742000 RSI: 6b6b6b6b6b6b6b6b RDI:
ffff888083742000
[ 90.959509] RBP: 0000000000000000 R08: 0000000000000000 R09:
0000000000000001
[ 90.959510] R10: 0000000000000001 R11: 0000000000000000 R12:
000000011a06b000
[ 90.959510] R13: 0000000000000ff8 R14: ffff8882316020b0 R15:
ffff888228a00000
[ 90.959511] FS: 00007fbecf69a740(0000) GS:ffff888234e00000(0000)
knlGS:0000000000000000
[ 90.959512] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 90.959513] CR2: 000055ec5f1b4f70 CR3: 00000001fa136001 CR4:
0000000000160ee0
[ 90.959513] Call Trace:
[ 90.959517] nvme_pci_complete_rq+0xda/0xf0 [nvme]
[ 90.959519] blk_mq_complete_request+0x47/0xf0
[ 90.959522] nvme_poll+0x23a/0x2c0 [nvme]
[ 90.959525] blk_poll+0x24f/0x320
[ 90.959526] ? __lock_acquire.isra.27+0x526/0x6c0
[ 90.959529] ? find_held_lock+0x38/0xa0
[ 90.959532] io_iopoll_getevents+0x11c/0x2e0
[ 90.959535] ? __mutex_unlock_slowpath+0x14e/0x250
[ 90.959537] ? __x64_sys_io_uring_enter+0x259/0x420
[ 90.959540] __io_iopoll_check+0x75/0xa0
[ 90.959543] __x64_sys_io_uring_enter+0x269/0x420
[ 90.959544] ? __x64_sys_io_uring_enter+0x83/0x420
[ 90.959548] do_syscall_64+0x63/0x1a0
[ 90.959550] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 90.959551] RIP: 0033:0x7fbece5e62cd
[ 90.959552] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa
48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b 6b 2c 00 f7 d8 64 89 01 48
[ 90.959552] RSP: 002b:00007ffd6a00b8b8 EFLAGS: 00000246 ORIG_RAX:
00000000000001aa
[ 90.959554] RAX: ffffffffffffffda RBX: 00007fbead217100 RCX:
00007fbece5e62cd
[ 90.959554] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
0000000000000003
[ 90.959566] RBP: 00007fbead217100 R08: 0000000000000000 R09:
0000000000000000
[ 90.959567] R10: 0000000000000001 R11: 0000000000000246 R12:
00000000024ce5b0
[ 90.959568] R13: 000000000000465a R14: 0000000000000001 R15:
000000000252de28
[ 90.959572] Modules linked in: xfs dm_mod sr_mod cdrom sd_mod
crc32c_intel nvme ata_piix nvme_core e1000 t10_pi libata virtio_pci
9pnet_virtio virtio_ring virtio 9p 9pnet
[ 90.959588] ---[ end trace 6f10eae6c97ebc00 ]---
[ 90.959590] RIP: 0010:nvme_unmap_data+0x1f8/0x250 [nvme]
[ 90.959591] Code: 75 0b 0f b7 83 c2 00 00 00 48 c1 e0 05 80 bb 40 01
00 00 00 48 63 cd 48 8d 14 ca 48 8b 34 02 74 09 4c 8b b6 f0 0f 00 00 eb
04 <4e> 8b 34 2e 49 8b bf 58 02 00 00 4c 89 e2 83 c5 01 e8 42 2e 15 e1
[ 90.959592] RSP: 0018:ffffc90002463c38 EFLAGS: 00010246
[ 90.959593] RAX: 0000000000000020 RBX: ffff888226f58400 RCX:
0000000000000000
[ 90.959595] RDX: ffff888083742000 RSI: 6b6b6b6b6b6b6b6b RDI:
ffff888083742000
[ 90.959595] RBP: 0000000000000000 R08: 0000000000000000 R09:
0000000000000001
[ 90.959596] R10: 0000000000000001 R11: 0000000000000000 R12:
000000011a06b000
[ 90.959597] R13: 0000000000000ff8 R14: ffff8882316020b0 R15:
ffff888228a00000
[ 90.959598] FS: 00007fbecf69a740(0000) GS:ffff888234e00000(0000)
knlGS:0000000000000000
[ 90.959599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 90.959599] CR2: 000055ec5f1b4f70 CR3: 00000001fa136001 CR4:
0000000000160ee0
[ 90.959600] Kernel panic - not syncing: Fatal exception
[ 91.154129] ? find_held_lock+0x38/0xa0
[ 91.154817] io_iopoll_getevents+0x11c/0x2e0
[ 91.155574] ? kvm_sched_clock_read+0xd/0x20
[ 91.156335] ? __x64_sys_io_uring_enter+0x1fb/0x420
[ 91.157203] ? __x64_sys_io_uring_enter+0x259/0x420
[ 91.158075] __io_iopoll_check+0x75/0xa0
[ 91.158781] __x64_sys_io_uring_enter+0x269/0x420
[ 91.159624] ? __x64_sys_io_uring_enter+0x83/0x420
[ 91.160491] do_syscall_64+0x63/0x1a0
[ 91.161178] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 91.162090] RIP: 0033:0x7fbece5e62cd
[ 91.162742] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa
48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b 6b 2c 00 f7 d8 64 89 01 48
[ 91.166010] RSP: 002b:00007ffd6a00b8b8 EFLAGS: 00000246 ORIG_RAX:
00000000000001aa
[ 91.167342] RAX: ffffffffffffffda RBX: 00007fbead1e5000 RCX:
00007fbece5e62cd
[ 91.168597] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
0000000000000003
[ 91.169851] RBP: 00007fbead1e5000 R08: 0000000000000000 R09:
0000000000000000
[ 91.171105] R10: 0000000000000001 R11: 0000000000000246 R12:
00000000024ce570
[ 91.172360] R13: 0000000000004689 R14: 0000000000000001 R15:
00000000024e0c68
[ 91.173618] ---[ end trace 6f10eae6c97ebc01 ]---
[ 91.174439] DMA-API: Mapped at:
[ 91.175011] debug_dma_map_sg+0x4e/0x340
[ 91.175717] nvme_queue_rq+0x41f/0xc99 [nvme]
[ 91.176495] __blk_mq_try_issue_directly+0x120/0x1f0
[ 91.177381] blk_mq_try_issue_directly+0x57/0xb0
[ 91.178206] blk_mq_make_request+0x4fe/0x650
[ 91.179024] Kernel Offset: disabled
[ 91.179671] ---[ end Kernel panic - not syncing: Fatal exception ]---
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/1] block: Manage bio references so the bio persists until necessary
2020-02-04 20:59 ` Jens Axboe
2020-02-04 22:41 ` Bijan Mottahedeh
@ 2020-02-24 23:32 ` Bijan Mottahedeh
1 sibling, 0 replies; 10+ messages in thread
From: Bijan Mottahedeh @ 2020-02-24 23:32 UTC (permalink / raw)
To: Jens Axboe, Christoph Hellwig; +Cc: linux-block, io-uring, MATTHEW_WILCOX
On 2/4/2020 12:59 PM, Jens Axboe wrote:
> On 2/4/20 12:51 AM, Christoph Hellwig wrote:
>> On Mon, Feb 03, 2020 at 01:07:48PM -0800, Bijan Mottahedeh wrote:
>>> My concern is with the code below for the single bio async case:
>>>
>>> qc = submit_bio(bio);
>>>
>>> if (polled)
>>> WRITE_ONCE(iocb->ki_cookie, qc);
>>>
>>> The bio/dio can be freed before the the cookie is written which is what I'm
>>> seeing, and I thought this may lead to a scenario where that iocb request
>>> could be completed, freed, reallocated, and resubmitted in io_uring layer;
>>> i.e., I thought the cookie could be written into the wrong iocb.
>> I think we do have a potential use after free of the iocb here.
>> But taking a bio reference isn't going to help with that, as the iocb
>> and bio/dio life times are unrelated.
>>
>> I vaguely remember having that discussion with Jens a while ago, and
>> tried to pass a pointer to the qc to submit_bio so that we can set
>> it at submission time, but he came up with a reason why that might not
>> be required. I'd have to dig out all notes unless Jens remembers
>> better.
> Don't remember that either, so I'd have to dig out emails! But looking
> at it now, for the async case with io_uring, the iocb is embedded in the
> io_kiocb from io_uring. We hold two references to the io_kiocb, one for
> submit and one for completion. Hence even if the bio completes
> immediately and someone else finds the completion before the application
> doing this submit, we still hold the submission reference to the
> io_kiocb. Hence I don't really see how we can end up with a
> use-after-free situation here.
>
> IIRC, Bijan had traces showing this can happen, KASAN complaining about
> it. Which makes me think that I'm missing a case here, though I don't
> immediately see what it is.
>
> Bijan, could post your trace again, I can't seem to find it?
>
I think the problem may be in the nvme driver's handling of multiple
pollers sharing the same CQ, due to the fact that nvme_poll() drops
cq_poll_lock before completing the CQEs found with nvme_process_cq():
nvme_poll()
{
...
spin_lock(&nvmeq->cq_poll_lock);
found = nvme_process_cq(nvmeq, &start, &end, -1);
spin_unlock(&nvmeq->cq_poll_lock);
nvme_complete_cqes(nvmeq, start, end);
...
}
Furthermore, nvme_process_cq() rings the CQ doorbell after collecting
the CQEs but before processing them:
static inline int nvme_process_cq(struct nvme_queue *nvmeq, u16 *start,
u16 *end, unsigned int tag)
{
...
while (nvme_cqe_pending(nvmeq)) {
...
nvme_update_cq_head(nvmeq);
}
...
nvme_ring_cq_doorbell(nvmeq);
return found;
}
Each poller effectively tells the controller that the CQ is empty when it rings the CQ doorbell. This is ok if there is only one poller but with many of them, I think enough tags can be freed and reissued that CQ could be overrun.
In one specific example:
- Poller 1 find a CQ full of entries in nvme_process_cq()
- Poller 1 processes CQEs, and more pollers find CQE ranges to process
Pollers 2-4 start processing additional non-overlapping CQE ranges
- Poller 5 finds a CQE range that is overlapping with Poller 1
CQ size 1024
Poller 1 2 3 4 5
CQ start index 10 9 214 401 708
CQ end index 9 214 401 708 77
CQ start phase 1 0 0 0 0
CQ end phase 0 0 0 0 1
Poller 1 finds the CQ phase has flipped when processing CQE 821 and indeed the phase has flipped because of poller 5. If I interpret this data correctly, it suggests that Pollers 1 and 5 overlap.
After that I start seeing errors.
A simpler theoretical example with two threads suggested by Matthew Wilcox:
Thread 1 submits enough I/O to fill the CQ
Thread 1 then processes two CQEs, two block layer tags become available.
Thread 1 is preempted by thread 2.
Thread 2 submits two I/Os.
Thread 2 processes the two CQEs which it owns.
Thread 2 submits two more I/Os.
Those CQEs overwrite the next two CQEs that will be processed by thread 1.
Two of thread 1's IOs will not receive a completion. Two of
thread 2's IOs will receive two completions.
Just as a workaround, I held cq_poll_lock while completing the CQEs and see no errors.
Does that make sense?
Thanks.
--bijan
^ permalink raw reply [flat|nested] 10+ messages in thread