[PATCH] tracepoints: Update static_call before tp

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH] tracepoints: Update static_call before tp_funcs when adding a tracepoint
@ 2021-07-23  2:33 Steven Rostedt
  2021-07-26 15:46 ` Mathieu Desnoyers
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2021-07-23  2:33 UTC (permalink / raw)
  To: LKML, Linux Trace Devel
  Cc: Mathieu Desnoyers, Ingo Molnar, Andrew Morton, Stefan Metzmacher,
	io-uring, Peter Zijlstra

From: "Steven Rostedt (VMware)" <[email protected]>

Because of the significant overhead that retpolines pose on indirect
calls, the tracepoint code was updated to use the new "static_calls" that
can modify the running code to directly call a function instead of using
an indirect caller, and this function can be changed at runtime.

In the tracepoint code that calls all the registered callbacks that are
attached to a tracepoint, the following is done:

	it_func_ptr = rcu_dereference_raw((&__tracepoint_##name)->funcs);
	if (it_func_ptr) {
		__data = (it_func_ptr)->data;
		static_call(tp_func_##name)(__data, args);
	}

If there's just a single callback, the static_call is updated to just call
that callback directly. Once another handler is added, then the static
caller is updated to call the iterator, that simply loops over all the
funcs in the array and calls each of the callbacks like the old method
using indirect calling.

The issue was discovered with a race between updating the funcs array and
updating the static_call. The funcs array was updated first and then the
static_call was updated. This is not an issue as long as the first element
in the old array is the same as the first element in the new array. But
that assumption is incorrect, because callbacks also have a priority
field, and if there's a callback added that has a higher priority than the
callback on the old array, then it will become the first callback in the
new array. This means that it is possible to call the old callback with
the new callback data element, which can cause a kernel panic.

	static_call = callback1()
	funcs[] = {callback1,data1};
	callback2 has higher priority than callback1

	CPU 1				CPU 2
	-----				-----

   new_funcs = {callback2,data2},
               {callback1,data1}

   rcu_assign_pointer(tp->funcs, new_funcs);

  /*
   * Now tp->funcs has the new array
   * but the static_call still calls callback1
   */

				it_func_ptr = tp->funcs [ new_funcs ]
				data = it_func_ptr->data [ data2 ]
				static_call(callback1, data);

				/* Now callback1 is called with
				 * callback2's data */

				[ KERNEL PANIC ]

   update_static_call(iterator);

To prevent this from happening, always switch the static_call to the
iterator before assigning the tp->funcs to the new array. The iterator will
always properly match the callback with its data.

To trigger this bug:

  In one terminal:

    while :; do hackbench 50; done

  In another terminal

    echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
    while :; do
        echo 1 > /sys/kernel/tracing/set_event_pid;
        sleep 0.5
        echo 0 > /sys/kernel/tracing/set_event_pid;
        sleep 0.5
   done

And it doesn't take long to crash. This is because the set_event_pid adds
a callback to the sched_waking tracepoint with a high priority, which will
be called before the sched_waking trace event callback is called.

Note, the removal to a single callback updates the array first, before
changing the static_call to single callback, which is the proper order as
the first element in the array is the same as what the static_call is
being changed to.

Link: https://lore.kernel.org/io-uring/[email protected]/

Cc: [email protected]
Fixes: d25e37d89dd2f ("tracepoint: Optimize using static_call()")
Reported-by: Stefan Metzmacher <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
---
 kernel/tracepoint.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 976bf8ce8039..fc32821f8240 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -299,8 +299,8 @@ static int tracepoint_add_func(struct tracepoint *tp,
 	 * a pointer to it.  This array is referenced by __DO_TRACE from
 	 * include/linux/tracepoint.h using rcu_dereference_sched().
 	 */
-	rcu_assign_pointer(tp->funcs, tp_funcs);
 	tracepoint_update_call(tp, tp_funcs, false);
+	rcu_assign_pointer(tp->funcs, tp_funcs);
 	static_key_enable(&tp->key);

 	release_probes(old);
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] tracepoints: Update static_call before tp_funcs when adding a tracepoint
  2021-07-23  2:33 [PATCH] tracepoints: Update static_call before tp_funcs when adding a tracepoint Steven Rostedt
@ 2021-07-26 15:46 ` Mathieu Desnoyers
  2021-07-26 16:56   ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Mathieu Desnoyers @ 2021-07-26 15:46 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, linux-trace-devel, Ingo Molnar, Andrew Morton,
	Stefan Metzmacher, io-uring, Peter Zijlstra

----- On Jul 22, 2021, at 10:33 PM, rostedt [email protected] wrote:

> From: "Steven Rostedt (VMware)" <[email protected]>
> 
> Because of the significant overhead that retpolines pose on indirect
> calls, the tracepoint code was updated to use the new "static_calls" that
> can modify the running code to directly call a function instead of using
> an indirect caller, and this function can be changed at runtime.
> 
> In the tracepoint code that calls all the registered callbacks that are
> attached to a tracepoint, the following is done:
> 
>	it_func_ptr = rcu_dereference_raw((&__tracepoint_##name)->funcs);
>	if (it_func_ptr) {
>		__data = (it_func_ptr)->data;
>		static_call(tp_func_##name)(__data, args);
>	}
> 
> If there's just a single callback, the static_call is updated to just call
> that callback directly. Once another handler is added, then the static
> caller is updated to call the iterator, that simply loops over all the
> funcs in the array and calls each of the callbacks like the old method
> using indirect calling.
> 
> The issue was discovered with a race between updating the funcs array and
> updating the static_call. The funcs array was updated first and then the
> static_call was updated. This is not an issue as long as the first element
> in the old array is the same as the first element in the new array. But
> that assumption is incorrect, because callbacks also have a priority
> field, and if there's a callback added that has a higher priority than the
> callback on the old array, then it will become the first callback in the
> new array. This means that it is possible to call the old callback with
> the new callback data element, which can cause a kernel panic.
> 

[...]

Looking into the various transitions, I suspect the issue runs much deeper than
this.

The sequence of transitions (number of probes) I'm considering is:

0->1
1->2
2->1
1->0
0->1
1->2

I come to three conclusions:

Where we have:

tracepoint_remove_func()

                tracepoint_update_call(tp, tp_funcs,
                                       tp_funcs[0].func != old[0].func);

We should be comparing .data rather than .func, because the same callback
can be registered twice with different data, and what we care about here
is that the data of array element 0 is unchanged to skip rcu sync.

My second conclusion is that it's odd that transition 1->0 leaves the
prior function call in place even after it's been removed. When we go
back to 0->1, that function call may still be called even though the
function is not there anymore. And there is no RCU synchronization on
these transitions, so those are all possible scenarios.

My third conclusion is that we'd need synchronize RCU whenever tp_funcs[0].data
changes for transitions 1->2, 2->1, and 1->2 because the priorities don't guarantee
that the first callback stays in the first position, and we also need to rcu sync
unconditionally on transition 1->0. We currently only have sync RCU on transition
from 2->1 when tp_funcs[0].func changes, which is bogus in many ways.

Basically, transitions from the iterator to a specific function should be handled
with care (making sure the tp_funcs array is updated and rcu-sync is done), except
in the specific case where the prior tp->funcs was NULL, which skips the function
call. And unless there is a rcu-sync between the state transitions, we need to consider
all prior states as additional original state as well. Therefore, in a 1->0->1
transition sequence, it's very much possible that the old function ends up observing
the new callback's data unless we add some rcu sync in between.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] tracepoints: Update static_call before tp_funcs when adding a tracepoint
  2021-07-26 15:46 ` Mathieu Desnoyers
@ 2021-07-26 16:56   ` Steven Rostedt
  2021-07-26 17:39     ` Mathieu Desnoyers
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2021-07-26 16:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, linux-trace-devel, Ingo Molnar, Andrew Morton,
	Stefan Metzmacher, io-uring, Peter Zijlstra

On Mon, 26 Jul 2021 11:46:41 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:
> 
> [...]
> 
> Looking into the various transitions, I suspect the issue runs much deeper than
> this.
> 
> The sequence of transitions (number of probes) I'm considering is:
> 
> 0->1
> 1->2
> 2->1
> 1->0
> 0->1
> 1->2
> 
> I come to three conclusions:
> 
> Where we have:
> 
> tracepoint_remove_func()
> 
>                 tracepoint_update_call(tp, tp_funcs,
>                                        tp_funcs[0].func != old[0].func);
> 
> We should be comparing .data rather than .func, because the same callback
> can be registered twice with different data, and what we care about here
> is that the data of array element 0 is unchanged to skip rcu sync.

I guess we could do that, as you are right, we are worried about
passing the wrong data to the wrong function. If the function is the
same, at least it wont crash the kernel as the function can handle that
data. But, it could miss the callback that is staying while calling the
one that is going instead.

Unlikely to happen, but in theory it is enough to fix.

> 
> My second conclusion is that it's odd that transition 1->0 leaves the
> prior function call in place even after it's been removed. When we go
> back to 0->1, that function call may still be called even though the
> function is not there anymore. And there is no RCU synchronization on
> these transitions, so those are all possible scenarios.

How so? When doing this transition we have:

	tracepoint_update_call(tp, tp_funcs, false);
	rcu_assign_pointer(tp->funcs, tp_funcs);
	static_key_enable(&tp->key);

Where that tracepoint_update_call() will reinstall the iterator, and
that's a full memory barrier. It even sends IPIs to all other CPUs to
make sure all CPUs are synchronized before continuing.

By the time we get to static_key_enable(), there will not be any CPUs
that see the old function. And the process of updating a static_key
also does the same kind of synchronization.

> 
> My third conclusion is that we'd need synchronize RCU whenever tp_funcs[0].data
> changes for transitions 1->2, 2->1, and 1->2 because the priorities don't guarantee
> that the first callback stays in the first position, and we also need to rcu sync
> unconditionally on transition 1->0. We currently only have sync RCU on transition
> from 2->1 when tp_funcs[0].func changes, which is bogus in many ways.

Going from 1 to 2, there's no issue. We switch to the iterator, which
is the old method anyway. It looks directly at the array and matches
the data with the func for each element of that array, and the data
read initially (before calling the iterator) is ignored.

> 
> Basically, transitions from the iterator to a specific function should be handled
> with care (making sure the tp_funcs array is updated and rcu-sync is done), except
> in the specific case where the prior tp->funcs was NULL, which skips the function
> call. And unless there is a rcu-sync between the state transitions, we need to consider
> all prior states as additional original state as well. Therefore, in a 1->0->1
> transition sequence, it's very much possible that the old function ends up observing
> the new callback's data unless we add some rcu sync in between.

I disagree with the last part, as I explained above.

But I do agree that comparing data is probably the better check.

-- Steve

> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] tracepoints: Update static_call before tp_funcs when adding a tracepoint
  2021-07-26 16:56   ` Steven Rostedt
@ 2021-07-26 17:39     ` Mathieu Desnoyers
  2021-07-26 18:49       ` Steven Rostedt
  0 siblings, 1 reply; 8+ messages in thread
From: Mathieu Desnoyers @ 2021-07-26 17:39 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, linux-trace-devel, Ingo Molnar, Andrew Morton,
	Stefan Metzmacher, io-uring, Peter Zijlstra, paulmck

----- On Jul 26, 2021, at 12:56 PM, rostedt [email protected] wrote:

> On Mon, 26 Jul 2021 11:46:41 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
[...]
> 
>> 
>> My second conclusion is that it's odd that transition 1->0 leaves the
>> prior function call in place even after it's been removed. When we go
>> back to 0->1, that function call may still be called even though the
>> function is not there anymore. And there is no RCU synchronization on
>> these transitions, so those are all possible scenarios.
> 
> How so? When doing this transition we have:
> 
>	tracepoint_update_call(tp, tp_funcs, false);
>	rcu_assign_pointer(tp->funcs, tp_funcs);
>	static_key_enable(&tp->key);
> 
> Where that tracepoint_update_call() will reinstall the iterator, and
> that's a full memory barrier. It even sends IPIs to all other CPUs to
> make sure all CPUs are synchronized before continuing.
> 
> By the time we get to static_key_enable(), there will not be any CPUs
> that see the old function. And the process of updating a static_key
> also does the same kind of synchronization.

Actually, my explanation was inaccurate. The issue is that the _new_ callback
may see the _old_ data.

Considering __DO_TRACE_CALL:

        do {                                                            \
                struct tracepoint_func *it_func_ptr;                    \
                void *__data;                                           \
                it_func_ptr =                                           \
                        rcu_dereference_raw((&__tracepoint_##name)->funcs); \
                if (it_func_ptr) {                                      \
                        __data = (it_func_ptr)->data;                   \

----> [ delayed here on one CPU (e.g. vcpu preempted by the host) ]

                        static_call(tp_func_##name)(__data, args);      \
                }                                                       \
        } while (0)

It has loaded the tp->funcs of the old callback (so it will try to use the old
data).

AFAIU, none of the synchronization mechanisms you refer to here (memory barrier,
IPIs..) will change the fact that this CPU may still be delayed across the entire
1->0->1 transition sequence, and may end up calling the new callback with the
old data. Unless an explicit RCU-sync is done.

> 
>> 
>> My third conclusion is that we'd need synchronize RCU whenever tp_funcs[0].data
>> changes for transitions 1->2, 2->1, and 1->2 because the priorities don't
>> guarantee
>> that the first callback stays in the first position, and we also need to rcu
>> sync
>> unconditionally on transition 1->0. We currently only have sync RCU on
>> transition
>> from 2->1 when tp_funcs[0].func changes, which is bogus in many ways.
> 
> Going from 1 to 2, there's no issue. We switch to the iterator, which
> is the old method anyway. It looks directly at the array and matches
> the data with the func for each element of that array, and the data
> read initially (before calling the iterator) is ignored.

This relies on ordering guarantees between RCU assign/dereference and static_call
updates/call. It may well be the case, but I'm asking anyway.

Are we guaranteed of the following ordering ?

CPU A                             CPU B

                                  static_call_update()
y = rcu_dereference(x)            rcu_assign_pointer(x, ...)
do_static_call(y)                 

That load of "x" should never happen after the CPU fetches the new static call
instruction.

Also, I suspect that transition 2->1 needs an unconditional rcu-sync because you
may have a sequence of 3->2->1 (or 1->2->1) where the element 0 data is unchanged
between 2->1, but was changed from 3->2 (or from 1->2), which may be observed by the
static call.

Thanks,

Mathieu

> 
>> 
>> Basically, transitions from the iterator to a specific function should be
>> handled
>> with care (making sure the tp_funcs array is updated and rcu-sync is done),
>> except
>> in the specific case where the prior tp->funcs was NULL, which skips the
>> function
>> call. And unless there is a rcu-sync between the state transitions, we need to
>> consider
>> all prior states as additional original state as well. Therefore, in a 1->0->1
>> transition sequence, it's very much possible that the old function ends up
>> observing
>> the new callback's data unless we add some rcu sync in between.
> 
> I disagree with the last part, as I explained above.
> 
> But I do agree that comparing data is probably the better check.
> 
> -- Steve
> 
>> 
>> Thoughts ?
>> 
>> Thanks,
>> 
>> Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] tracepoints: Update static_call before tp_funcs when adding a tracepoint
  2021-07-26 17:39     ` Mathieu Desnoyers
@ 2021-07-26 18:49       ` Steven Rostedt
  2021-07-26 19:12         ` Mathieu Desnoyers
  2021-07-27 11:44         ` Peter Zijlstra
  0 siblings, 2 replies; 8+ messages in thread
From: Steven Rostedt @ 2021-07-26 18:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, linux-trace-devel, Ingo Molnar, Andrew Morton,
	Stefan Metzmacher, io-uring, Peter Zijlstra, paulmck

On Mon, 26 Jul 2021 13:39:18 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> ----- On Jul 26, 2021, at 12:56 PM, rostedt [email protected] wrote:
> 
> > On Mon, 26 Jul 2021 11:46:41 -0400 (EDT)
> > Mathieu Desnoyers <[email protected]> wrote:  
> [...]
> >   
> >> 
> >> My second conclusion is that it's odd that transition 1->0 leaves the
> >> prior function call in place even after it's been removed. When we go
> >> back to 0->1, that function call may still be called even though the
> >> function is not there anymore. And there is no RCU synchronization on
> >> these transitions, so those are all possible scenarios.  
> > 
> > How so? When doing this transition we have:
> > 
> >	tracepoint_update_call(tp, tp_funcs, false);
> >	rcu_assign_pointer(tp->funcs, tp_funcs);
> >	static_key_enable(&tp->key);
> > 
> > Where that tracepoint_update_call() will reinstall the iterator, and
> > that's a full memory barrier. It even sends IPIs to all other CPUs to
> > make sure all CPUs are synchronized before continuing.
> > 
> > By the time we get to static_key_enable(), there will not be any CPUs
> > that see the old function. And the process of updating a static_key
> > also does the same kind of synchronization.  
> 
> Actually, my explanation was inaccurate. The issue is that the _new_ callback
> may see the _old_ data.
> 
> Considering __DO_TRACE_CALL:
> 
>         do {                                                            \
>                 struct tracepoint_func *it_func_ptr;                    \
>                 void *__data;                                           \
>                 it_func_ptr =                                           \
>                         rcu_dereference_raw((&__tracepoint_##name)->funcs); \
>                 if (it_func_ptr) {                                      \
>                         __data = (it_func_ptr)->data;                   \
> 
> ----> [ delayed here on one CPU (e.g. vcpu preempted by the host) ]  
> 
>                         static_call(tp_func_##name)(__data, args);      \
>                 }                                                       \
>         } while (0)
> 
> It has loaded the tp->funcs of the old callback (so it will try to use the old
> data).
> 
> AFAIU, none of the synchronization mechanisms you refer to here (memory barrier,
> IPIs..) will change the fact that this CPU may still be delayed across the entire
> 1->0->1 transition sequence, and may end up calling the new callback with the
> old data. Unless an explicit RCU-sync is done.

OK. I see the issue you are saying. And this came from my assumption
that the tracepoint code did a synchronization when unregistering the
last callback. But of course it wont because that would make a lot of
back to back synchronizations of a large number of tracepoints being
unregistered at once.

And doing it for all 0->1 or 1->0 or even a 1->0->1 can be costly.

One way to handle this is when going from 1->0, set off a worker that
will do the synchronization asynchronously, and if a 0->1 comes in,
have that block until the synchronization is complete. This should
work, and not have too much of an overhead.

If one 1->0 starts the synchronization, and one or more 1->0
transitions happen, it will be recorded where the worker will do
another synchronization, to make sure all 1->0 have went through a full
synchronization before a 0->1 can happen.

If a 0->1 comes in while a synchronization is happening, it will note
the current "number" for the synchronizations (if another one is
queued, it will wait for one more), before it can begin. As locks will
be held while waiting for synchronizations to finish, we don't need to
worry about another 1->0 coming in while a 0->1 is waiting.

Shouldn't be too hard to implement.

static unsigned long sync_number;
static unsigned long last_sync_number;

Going from 1->0:

	/* tracepoint_mutex held */

	mutex_lock(&sync_mutex);
	sync_number++;
	mutex_unlock(&sync_mutex);
	queue_worker(sync_worker);


sync_worker()

  again:
	run = false;

	mutex_lock(&sync_mutex);
	if (last_sync_number != sync_number) {
		last_sync_number = sync_number;
		run = true;
	}
	mutex_unlock(&sync_mutex);

	if (!run)
		return;

	tracepoint_synchronization_unregister();

	wake_up(waiters);

	goto again;


Going from 0->1

 again:
	mutex_lock(&sync_mutex);
	if (last_sync_number != sync_number) {
		prepare_to_wait(waiters);
		block = true;
	}
	mutex_unlock(&sync_mutex);

	/* tracepoint_mutex held, sync_number will not increase */
	if (block) {
		schedule();
		/* in case we were woken up by an exiting worker */
		goto again;
	}
	
> 
> >   
> >> 
> >> My third conclusion is that we'd need synchronize RCU whenever tp_funcs[0].data
> >> changes for transitions 1->2, 2->1, and 1->2 because the priorities don't
> >> guarantee
> >> that the first callback stays in the first position, and we also need to rcu
> >> sync
> >> unconditionally on transition 1->0. We currently only have sync RCU on
> >> transition
> >> from 2->1 when tp_funcs[0].func changes, which is bogus in many ways.  
> > 
> > Going from 1 to 2, there's no issue. We switch to the iterator, which
> > is the old method anyway. It looks directly at the array and matches
> > the data with the func for each element of that array, and the data
> > read initially (before calling the iterator) is ignored.  
> 
> This relies on ordering guarantees between RCU assign/dereference and static_call
> updates/call. It may well be the case, but I'm asking anyway.
> 
> Are we guaranteed of the following ordering ?
> 
> CPU A                             CPU B
> 
>                                   static_call_update()

The static_call_update() triggers an IPI on all CPUs that perform a full memory barrier.

That is, nothing on any CPU will cross the static_call_update().

> y = rcu_dereference(x)            rcu_assign_pointer(x, ...)
> do_static_call(y)                 
> 
> That load of "x" should never happen after the CPU fetches the new static call
> instruction.

The 'y' will always be the new static call (which is the iterator in
this case), and it doesn't matter which x it read, because the iterator
will read the array just like it was done without static calls.

> 
> Also, I suspect that transition 2->1 needs an unconditional rcu-sync because you
> may have a sequence of 3->2->1 (or 1->2->1) where the element 0 data is unchanged
> between 2->1, but was changed from 3->2 (or from 1->2), which may be observed by the
> static call.


I'll agree that we need to add synchronization between 1->0->1, but you
have not convinced me on this second part.

-- Steve

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] tracepoints: Update static_call before tp_funcs when adding a tracepoint
  2021-07-26 18:49       ` Steven Rostedt
@ 2021-07-26 19:12         ` Mathieu Desnoyers
  2021-07-27 11:44         ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Mathieu Desnoyers @ 2021-07-26 19:12 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, linux-trace-devel, Ingo Molnar, Andrew Morton,
	Stefan Metzmacher, io-uring, Peter Zijlstra, paulmck

----- On Jul 26, 2021, at 2:49 PM, rostedt [email protected] wrote:

> On Mon, 26 Jul 2021 13:39:18 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
> 
>> ----- On Jul 26, 2021, at 12:56 PM, rostedt [email protected] wrote:
>> 
>> > On Mon, 26 Jul 2021 11:46:41 -0400 (EDT)
>> > Mathieu Desnoyers <[email protected]> wrote:
>> [...]
>> >   

[...]

>> 
>> AFAIU, none of the synchronization mechanisms you refer to here (memory barrier,
>> IPIs..) will change the fact that this CPU may still be delayed across the
>> entire
>> 1->0->1 transition sequence, and may end up calling the new callback with the
>> old data. Unless an explicit RCU-sync is done.
> 
> OK. I see the issue you are saying. And this came from my assumption
> that the tracepoint code did a synchronization when unregistering the
> last callback. But of course it wont because that would make a lot of
> back to back synchronizations of a large number of tracepoints being
> unregistered at once.
> 

Something along the lines of the work approach you propose should work.

>> 
>> >   
>> >> 
>> >> My third conclusion is that we'd need synchronize RCU whenever tp_funcs[0].data
>> >> changes for transitions 1->2, 2->1, and 1->2 because the priorities don't
>> >> guarantee
>> >> that the first callback stays in the first position, and we also need to rcu
>> >> sync
>> >> unconditionally on transition 1->0. We currently only have sync RCU on
>> >> transition
>> >> from 2->1 when tp_funcs[0].func changes, which is bogus in many ways.
>> > 
>> > Going from 1 to 2, there's no issue. We switch to the iterator, which
>> > is the old method anyway. It looks directly at the array and matches
>> > the data with the func for each element of that array, and the data
>> > read initially (before calling the iterator) is ignored.
>> 
>> This relies on ordering guarantees between RCU assign/dereference and
>> static_call
>> updates/call. It may well be the case, but I'm asking anyway.
>> 
>> Are we guaranteed of the following ordering ?
>> 
>> CPU A                             CPU B
>> 
>>                                   static_call_update()
> 
> The static_call_update() triggers an IPI on all CPUs that perform a full memory
> barrier.
> 
> That is, nothing on any CPU will cross the static_call_update().
> 
>> y = rcu_dereference(x)            rcu_assign_pointer(x, ...)
>> do_static_call(y)
>> 
>> That load of "x" should never happen after the CPU fetches the new static call
>> instruction.
> 
> The 'y' will always be the new static call (which is the iterator in
> this case), and it doesn't matter which x it read, because the iterator
> will read the array just like it was done without static calls.

Here by "y" I mean the pointer loaded by rcu_dereference, which is then passed to the
call. I do not mean the call per se.

I suspect there is something like a control dependency between loading "x" through
rcu_dereference and passing the loaded pointer as argument to the static call (and
the instruction fetch of the static call). But I don't recall reading any documentation
of this specific ordering.

> 
>> 
>> Also, I suspect that transition 2->1 needs an unconditional rcu-sync because you
>> may have a sequence of 3->2->1 (or 1->2->1) where the element 0 data is
>> unchanged
>> between 2->1, but was changed from 3->2 (or from 1->2), which may be observed by
>> the
>> static call.
> 
> 
> I'll agree that we need to add synchronization between 1->0->1, but you
> have not convinced me on this second part.

In addition to 1->0->1, there are actually 2 other scenarios here:

Transition 1->2 to which the ordering question between RCU and static call is
related.

Transition 2->1 would need an unconditional rcu-sync, because of transitions
3->2->1 and 1->2->1 which can lead the static call to observe wrong data if the
rcu_dereference happens when there are e.g. 3 callbacks registered, and then the
static call to the function (single callback) is called on the 3-callbacks array.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] tracepoints: Update static_call before tp_funcs when adding a tracepoint
  2021-07-26 18:49       ` Steven Rostedt
  2021-07-26 19:12         ` Mathieu Desnoyers
@ 2021-07-27 11:44         ` Peter Zijlstra
  2021-07-27 13:46           ` Mathieu Desnoyers
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2021-07-27 11:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, linux-kernel, linux-trace-devel, Ingo Molnar,
	Andrew Morton, Stefan Metzmacher, io-uring, paulmck

On Mon, Jul 26, 2021 at 02:49:03PM -0400, Steven Rostedt wrote:
> OK. I see the issue you are saying. And this came from my assumption
> that the tracepoint code did a synchronization when unregistering the
> last callback. But of course it wont because that would make a lot of
> back to back synchronizations of a large number of tracepoints being
> unregistered at once.
> 
> And doing it for all 0->1 or 1->0 or even a 1->0->1 can be costly.
> 
> One way to handle this is when going from 1->0, set off a worker that
> will do the synchronization asynchronously, and if a 0->1 comes in,
> have that block until the synchronization is complete. This should
> work, and not have too much of an overhead.
> 
> If one 1->0 starts the synchronization, and one or more 1->0
> transitions happen, it will be recorded where the worker will do
> another synchronization, to make sure all 1->0 have went through a full
> synchronization before a 0->1 can happen.
> 
> If a 0->1 comes in while a synchronization is happening, it will note
> the current "number" for the synchronizations (if another one is
> queued, it will wait for one more), before it can begin. As locks will
> be held while waiting for synchronizations to finish, we don't need to
> worry about another 1->0 coming in while a 0->1 is waiting.

Wouldn't get_state_synchronize_rcu() and cond_synchronize_rcu() get you
what you need?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] tracepoints: Update static_call before tp_funcs when adding a tracepoint
  2021-07-27 11:44         ` Peter Zijlstra
@ 2021-07-27 13:46           ` Mathieu Desnoyers
  0 siblings, 0 replies; 8+ messages in thread
From: Mathieu Desnoyers @ 2021-07-27 13:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rostedt, linux-kernel, linux-trace-devel, Ingo Molnar,
	Andrew Morton, Stefan Metzmacher, io-uring, paulmck

----- On Jul 27, 2021, at 7:44 AM, Peter Zijlstra [email protected] wrote:

> On Mon, Jul 26, 2021 at 02:49:03PM -0400, Steven Rostedt wrote:
>> OK. I see the issue you are saying. And this came from my assumption
>> that the tracepoint code did a synchronization when unregistering the
>> last callback. But of course it wont because that would make a lot of
>> back to back synchronizations of a large number of tracepoints being
>> unregistered at once.
>> 
>> And doing it for all 0->1 or 1->0 or even a 1->0->1 can be costly.
>> 
>> One way to handle this is when going from 1->0, set off a worker that
>> will do the synchronization asynchronously, and if a 0->1 comes in,
>> have that block until the synchronization is complete. This should
>> work, and not have too much of an overhead.
>> 
>> If one 1->0 starts the synchronization, and one or more 1->0
>> transitions happen, it will be recorded where the worker will do
>> another synchronization, to make sure all 1->0 have went through a full
>> synchronization before a 0->1 can happen.
>> 
>> If a 0->1 comes in while a synchronization is happening, it will note
>> the current "number" for the synchronizations (if another one is
>> queued, it will wait for one more), before it can begin. As locks will
>> be held while waiting for synchronizations to finish, we don't need to
>> worry about another 1->0 coming in while a 0->1 is waiting.
> 
> Wouldn't get_state_synchronize_rcu() and cond_synchronize_rcu() get you
> what you need?

Indeed, snapshotting the state and conditionally waiting for a grace period
if none happened since the snapshot appears to be the intent here. Using
get_state+cond_sync should allow us to do this without any additional worker
thread.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-07-27 13:46 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-07-23  2:33 [PATCH] tracepoints: Update static_call before tp_funcs when adding a tracepoint Steven Rostedt
2021-07-26 15:46 ` Mathieu Desnoyers
2021-07-26 16:56   ` Steven Rostedt
2021-07-26 17:39     ` Mathieu Desnoyers
2021-07-26 18:49       ` Steven Rostedt
2021-07-26 19:12         ` Mathieu Desnoyers
2021-07-27 11:44         ` Peter Zijlstra
2021-07-27 13:46           ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox