From: Stefan Metzmacher <[email protected]>
To: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>,
[email protected],
io-uring <[email protected]>,
"[email protected]" <[email protected]>
Subject: Re: sched_waking vs. set_event_pid crash (Re: Tracing busy processes/threads freezes/stalls the whole machine)
Date: Fri, 23 Jul 2021 12:35:09 +0200 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
[-- Attachment #1.1: Type: text/plain, Size: 5748 bytes --]
Hi Steve,
>> After some days of training:
>> https://training.linuxfoundation.org/training/linux-kernel-debugging-and-security/
>> I was able to get much closer to the problem :-)
>
> Well, knowing what to look for was not going to be easy.
>
> And I'm sure you were shocked to see what I posted as a fix ;-)
Not really, such problems typically just have one line fixes :-)
> Assuming this does fix your issue, I sent out a real patch with the
> explanation of what happened in the change log, so that you can see why
> that change was your issue.
Yes, it does the trick, thanks very much!
>> In order to reproduce it and get reliable kexec crash dumps,
>> I needed to give the VM at least 3 cores.
>
> Yes, it did require having this run on multiple CPUs to have a race
> condition trigger, and two cores would have been hard to hit it. I ran
> it on 8 cores and it triggered rather easily.
I mean I was able to trigger the problem with 2 cores, but then
the machine was completely stuck, without triggering the kexec reboot
to get the crash dump. I don't understand why as the additional cpu
is not really involved at all..
Any idea why kexec could be so unreliable?
(even with 3 or 6 cpus it sometimes just didn't trigger)
>> While running './io-uring_cp-forever link-cp.c file' (from:
>> https://github.com/metze-samba/liburing/commits/io_uring-cp-forever )
>> in one window, the following simple sequence triggered the problem in most cases:
>>
>> echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
>> echo 1 > /sys/kernel/tracing/set_event_pid
>
> All it took was something busy that did a lot of wakeups while setting
> the set_event_pid, to be able to hit the race easily. As I stated, I
> triggered it with running hackbench instead of the io-uring code. In
> fact, this bug had nothing to do with io-uring, and you were only
> triggering it because you were making enough of a load on the system to
> make the race happen often.
Yes, it was just the example that made it most reliable to trigger for me.
>> pid_list = rcu_dereference_raw(tr->filtered_pids);
>
> That "tr" comes from the trace_event_file that is passed in by the
> "data" field of the callback. Hence, this callback got the data field
> of the event_filter_pid_sched_wakeup_probe_pre() callback that is
> called before all events when the set_event_pid file is set. That
> means, the "tr" being dereferened was not the "tr" you were looking for.
Yes, that's what I assumed, but didn't find the reason for.
>>
>> It seems it got inlined within trace_event_buffer_reserve()
>>
>> There strangest things I found so far is this:
>>
>> crash> sym global_trace
>> ffffffffbcdb7e80 (d) global_trace
>> crash> list ftrace_trace_arrays
>> ffffffffbcdb7e70
>> ffffffffbcdb7e80
>>
>> trace_array has size 7672, but ffffffffbcdb7e70 is only 16 bytes away from
>> ffffffffbcdb7e80.
>
> ftrace_trace_arrays is a list_head, and I doubt you created any
> instances, thus the list head has only one instance, and that is
> global_trace. Hence, it points to global_trace and itself. It just so
> happens that a list_head is 16 bytes.
Ah, that explains a lot:
crash> struct list_head -o ftrace_trace_arrays
struct list_head {
[ffffffffbcdb7e70] struct list_head *next;
[ffffffffbcdb7e78] struct list_head *prev;
}
SIZE: 16
crash> list ftrace_trace_arrays
ffffffffbcdb7e70
ffffffffbcdb7e80
I guess this is what I want:
crash> list -H ftrace_trace_arrays
ffffffffbcdb7e80
>> Also this:
>>
>> crash> struct trace_array.events -o global_trace
>> struct trace_array {
>> [ffffffffbcdb9be0] struct list_head events;
>> }
>> crash> list -s trace_event_file.tr -o trace_event_file.list ffffffffbcdb9be0
>> ffffffffbcdb9be0
>> tr = 0xffffffffbcdb7b20
>> ffff8ccb45cdfb00
>> tr = 0xffffffffbcdb7e80
>> ffff8ccb45cdf580
>> tr = 0xffffffffbcdb7e80
>> ffff8ccb45cdfe18
>> tr = 0xffffffffbcdb7e80
>> ...
>> tr = 0xffffffffbcdb7e80
>>
>> The first one 0xffffffffbcdb7b20 is only 864 bytes away from 0xffffffffbcdb7e80
>
> I'm thinking it is confused by hitting the ftrace_trace_arrays
> list_head itself.
Yes, I needed the -H here too:
list -s trace_event_file.tr -o trace_event_file.list -H ffffffffbcdb9be0
>> It would be really great if you (or anyone else on the lists) could have a closer look
>> in order to get the problem fixed :-)
>
> Once I triggered it and started looking at what was happening, it
> didn't take me to long to figure out where the culprit was.
Great!
>>
>> I've learned a lot this week and tried hard to find the problem myself,
>> but I have to move back to other work for now.
>>
>
> I'm glad you learned a lot. There's a lot of complex code in there, and
> its getting more complex, as you can see with the static_calls.
>
> This is because tracing tries hard to avoid the heisenbug effect. You
> know, you see a bug, turn on tracing, and then that bug goes away!
>
> Thus it pulls out all the tricks it can to be as least intrusive on the
> system as it can be. And that causes things to get complex really quick.
I'll have a look, I may need something similar for my smbdirect socket driver
in future.
> Cheers, and thanks for keeping up on this bug!
Thanks for fixing the bug!
Now I can finally use:
trace-cmd record -e all -P $(pidof io_uring-cp-forever)
But that doesn't include the iou-wrk-* threads
and the '-c' option seems to only work with forking.
Is there a way to specify "trace *all* threads of the given pid"?
(Note the threads are comming and going, so it's not possible to
specifiy -P more than once)
Thanks!
metze
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2021-07-23 10:35 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-22 14:26 Tracing busy processes/threads freezes/stalls the whole machine Stefan Metzmacher
2021-04-22 14:55 ` Stefan Metzmacher
2021-05-04 12:26 ` Stefan Metzmacher
2021-05-04 13:24 ` Steven Rostedt
2021-05-04 13:28 ` Stefan Metzmacher
2021-05-04 13:32 ` Steven Rostedt
2021-05-04 13:35 ` Steven Rostedt
2021-05-05 9:50 ` Stefan Metzmacher
2021-05-31 15:39 ` Stefan Metzmacher
2021-07-19 17:07 ` Stefan Metzmacher
2021-07-22 22:43 ` sched_waking vs. set_event_pid crash (Re: Tracing busy processes/threads freezes/stalls the whole machine) Stefan Metzmacher
2021-07-23 1:41 ` Steven Rostedt
2021-07-23 2:51 ` Steven Rostedt
2021-07-23 10:35 ` Stefan Metzmacher [this message]
2021-07-23 11:29 ` Steven Rostedt
2021-07-23 11:53 ` Stefan Metzmacher
2021-07-23 12:41 ` Steven Rostedt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox