* [bug report] ublk: re-issued blk-mq request may reference a freed io_uring context
@ 2022-08-08 7:45 Ziyang Zhang
0 siblings, 0 replies; only message in thread
From: Ziyang Zhang @ 2022-08-08 7:45 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-block, io-uring, Xiaoguang Wang
Hi, all
We are running test on ublk to verify its stability. While running
tests/generic/002 in ublksrv[1] which kills ublksrv daemon while
running fio at the same time, we find a null-deference error.
[1] https://github.com/ming1/ubdsrv
***********************
DMESG:
***********************
[ 1396.090924] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1396.098770] ublk_ctrl_uring_cmd: cmd done ret 0 cmd_op 2, dev id 0 qid 65535
[ 1396.105714] #PF: supervisor write access in kernel mode
[ 1396.105716] #PF: error_code(0x0002) - not-present page
[ 1396.105718] PGD 800000013c762067 P4D 800000013c762067 PUD 634c8d067 PMD 0
[ 1396.132122] Oops: 0002 [#1] PREEMPT SMP PTI
[ 1396.136613] CPU: 33 PID: 27341 Comm: kworker/33:2 Kdump: loaded Tainted: G S OE 5.19.0 #121
[ 1396.146382] Hardware name: Inventec K900G3-10G/B900G3, BIOS A2.20 06/23/2017
[ 1396.154069] Workqueue: events io_fallback_req_func
[ 1396.159169] RIP: 0010:percpu_ref_get_many+0x23/0x30
[ 1396.164360] Code: 0f 1f 80 00 00 00 00 55 48 89 f5 53 48 89 fb e8 83 93 be ff 48 8b 03 a8 03 75 0b 65 48 01 28 5b 5d e9 b1 f5 be ff 48 8b 43 08 <f0> 48 01 28 5b 5d e9 a2 f5 be ff 66 90 41 54 49 89 d4 55 48 89 f5
[ 1396.183754] RSP: 0018:ffffae20e37c7e48 EFLAGS: 00010206
[ 1396.189301] RAX: 0000000000000000 RBX: ffff9e966730d800 RCX: 0000000000000000
[ 1396.196754] RDX: ffff9e9651ca9180 RSI: 0000000000000001 RDI: ffff9e966730d800
[ 1396.204214] RBP: 0000000000000001 R08: 61626c6c61665f6f R09: 665f7165725f6b63
[ 1396.211671] R10: 666562203a636e75 R11: 7265746920657269 R12: ffff9e965da36700
[ 1396.219143] R13: ffff9e965da36788 R14: 0000000000000000 R15: ffff9ed4bf876905
[ 1396.226607] FS: 0000000000000000(0000) GS:ffff9ed4bf840000(0000) knlGS:0000000000000000
[ 1396.235026] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1396.241110] CR2: 0000000000000000 CR3: 000000068cf9c001 CR4: 00000000003706e0
[ 1396.248587] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1396.256054] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1396.263523] Call Trace:
[ 1396.266313] <TASK>
[ 1396.268747] io_fallback_req_func+0x61/0x126
[ 1396.273354] process_one_work+0x1df/0x3b0
[ 1396.277705] worker_thread+0x49/0x2e0
[ 1396.281712] ? rescuer_thread+0x390/0x390
[ 1396.286060] kthread+0xe5/0x110
[ 1396.289534] ? kthread_complete_and_exit+0x20/0x20
[ 1396.294644] ret_from_fork+0x1f/0x30
[ 1396.298551] </TASK>
[ 1396.301055] Modules linked in: ublk_drv(OE) tcp_diag(E) udp_diag(E) inet_diag(E) ip6_tables(E) iptable_filter(E) ebtable_nat(E) ebtables(E) binfmt_misc(E) intel_rapl_msr(E) intel_rapl_common(E) iosf_mbi(E) x86_pkg_temp_thermal(E) coretemp(E) bonding(E) tls(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) iTCO_wdt(E) crypto_simd(E) iTCO_vendor_support(E) cryptd(E) mxm_wmi(E) mei_me(E) i2c_i801(E) lpc_ich(E) pcspkr(E) sg(E) i2c_smbus(E) mfd_core(E) mei(E) acpi_ipmi(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) wmi(E) acpi_power_meter(E) ip_tables(E) ast(E) i2c_algo_bit(E) drm_vram_helper(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) drm_ttm_helper(E) ttm(E) crc32c_intel(E) drm(E) nvme(E) i2c_core(E) ixgbe(E) nvme_core(E) mdio(E) dca(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) ahci(E) libahci(E) libata(E) [last unloaded: ublk_drv]
[ 1396.386209] CR2: 0000000000000000
***********************
DESCRIPTION 1:
***********************
Here in percpu_ref_get_many():atomic_long_add(nr, &ref->data->count),
ref->data is actually freed because io_uring's ctx->refs is freed at that
time. We analyze ublk_drv.c and find a reason, assume there is only one
tag per blk-mq queue:
CPU 0 CPU 1 CPU2 CPU3
__ublk_rq_task_work()
io_uring_cmd_done(io->cmd)
****task crash****
io_ring_exit_work()
/* Now io_uring ctx can be freed
since all ioucmd(only one) are done */
ublk_daemon_monitor_work()
blk_mq_end_request(req, BLK_STS_IOERR);
ublk_queue_rq() /* fio re-issue req because of IO error */
ubq_daemon_is_dying() /* NOT SEE PF_EXITING */
io_uring_cmd_complete_in_task(io->cmd) /* the io_uring ctx is freed! */
io_fallback_req_func() /* io_uring fallback */
percpu_ref_get_many() /* null-deref! */
***********************
DESCRIPTION 2:
***********************
I describe the above data flow here again:
(1) After the ioucmd are io_uring_cmd_done(), a crash happens.
(2) The io_uring context is freed (at least part
of it, e.g. ctx->refs) because we don't reference it anymore.
(3) Then, monitor_work aborts the blk-mq request and fio re-tries
because of the I/O error.
(4) After that, ublk_queue_rq() is called for the re-issued request
but it does not see PF_EXITING!(Since ublk_queue_rq() is called
in another task different from ubq_daemon or ioucmd's task)
(5) io_uring_cmd_complete_in_task() is called, and the ioucmd is
still the old one because the tag does not change.
(6) io_fallback_req_func(), and ref->data is NULL while calling
atomic_long_add(nr, &ref->data->count).
The root cause is that:
(1) io_uring_cmd_complete_in_task() does not
immediately try to call task_work_add(), but puts the ioucmd's
task_work into a llist for batching. If task_work_add() is called
immediately in ublk_queue_rq(), this situation can be avoided.
(2) In ublk_drv.c:ublk_queue_rq(), check on PF_EXITING is added:
if (unlikely(ubq_daemon_is_dying(ubq))) {
fail:
mod_delayed_work(system_wq, &ubq->dev->monitor_work, 0);
return BLK_STS_IOERR;
}
However, this check on PF_EXITING is unsafe since ublk_queue_rq() runs in
another task other than ubq_daemon(ioucmd's task). ublk_queue_rq() may not
see PF_EXITING after the ubq_daemon(ioucmd's task) crashes. With this
check, the above null-deref error rarely happens. But if we comment it out,
We are more likely to trigger this error.
***********************
HOW TO REPRODUCE:
***********************
(1) Forbid check on PF_EXITING. In ublk_drv.c:ublk_queue_rq(), comment out:
if (unlikely(ubq_daemon_is_dying(ubq))) {
fail:
mod_delayed_work(system_wq, &ubq->dev->monitor_work, 0);
return BLK_STS_IOERR;
}
This makes sense because this check is unsafe and we are more likely to re-produce
this error without this check.
(2) Then, run:
make test T=generic/002
It may fail after running several times.
Please point out my mistake because this bug is really complicated for me :)
Regards,
Zhang
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2022-08-08 7:46 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-08 7:45 [bug report] ublk: re-issued blk-mq request may reference a freed io_uring context Ziyang Zhang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox