[PATCH 0/2] task_put batching

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH 0/2] task_put batching
@ 2020-07-18  8:32 Pavel Begunkov
  2020-07-18  8:32 ` [PATCH 1/2] tasks: add put_task_struct_many() Pavel Begunkov
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Pavel Begunkov @ 2020-07-18  8:32 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

For my a bit exaggerated test case perf continues to show high CPU
cosumption by io_dismantle(), and so calling it io_iopoll_complete().
Even though the patch doesn't yield throughput increase for my setup,
probably because the effect is hidden behind polling, but it definitely
improves relative percentage. And the difference should only grow with
increasing number of CPUs. Another reason to have this is that atomics
may affect other parallel tasks (e.g. which doesn't use io_uring)

before:
io_iopoll_complete: 5.29%
io_dismantle_req:   2.16%

after:
io_iopoll_complete: 3.39%
io_dismantle_req:   0.465%

Pavel Begunkov (2):
  tasks: add put_task_struct_many()
  io_uring: batch put_task_struct()

 fs/io_uring.c              | 28 ++++++++++++++++++++++++++--
 include/linux/sched/task.h |  6 ++++++
 2 files changed, 32 insertions(+), 2 deletions(-)

-- 
2.24.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] tasks: add put_task_struct_many()
  2020-07-18  8:32 [PATCH 0/2] task_put batching Pavel Begunkov
@ 2020-07-18  8:32 ` Pavel Begunkov
  2020-07-18  8:32 ` [PATCH 2/2] io_uring: batch put_task_struct() Pavel Begunkov
  2020-07-18 14:37 ` [PATCH 0/2] task_put batching Jens Axboe
  2 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2020-07-18  8:32 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

put_task_struct_many() is as put_task_struct() but puts several
references at once. Useful to batching it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/sched/task.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 38359071236a..1301077f9c24 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -126,6 +126,12 @@ static inline void put_task_struct(struct task_struct *t)
 		__put_task_struct(t);
 }
 
+static inline void put_task_struct_many(struct task_struct *t, int nr)
+{
+	if (refcount_sub_and_test(nr, &t->usage))
+		__put_task_struct(t);
+}
+
 void put_task_struct_rcu_user(struct task_struct *task);
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] io_uring: batch put_task_struct()
  2020-07-18  8:32 [PATCH 0/2] task_put batching Pavel Begunkov
  2020-07-18  8:32 ` [PATCH 1/2] tasks: add put_task_struct_many() Pavel Begunkov
@ 2020-07-18  8:32 ` Pavel Begunkov
  2020-07-18 14:37 ` [PATCH 0/2] task_put batching Jens Axboe
  2 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2020-07-18  8:32 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

As every iopoll request have a task ref, it becomes expensive to put
them one by one, instead we can put several at once integrating that
into io_req_free_batch().

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 57e1f26b6a6b..b52aa0d8b09d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1543,7 +1543,6 @@ static void io_dismantle_req(struct io_kiocb *req)
 		kfree(req->io);
 	if (req->file)
 		io_put_file(req, req->file, (req->flags & REQ_F_FIXED_FILE));
-	__io_put_req_task(req);
 	io_req_clean_work(req);
 
 	if (req->flags & REQ_F_INFLIGHT) {
@@ -1563,6 +1562,7 @@ static void __io_free_req(struct io_kiocb *req)
 	struct io_ring_ctx *ctx;
 
 	io_dismantle_req(req);
+	__io_put_req_task(req);
 	ctx = req->ctx;
 	if (likely(!io_is_fallback_req(req)))
 		kmem_cache_free(req_cachep, req);
@@ -1806,8 +1806,18 @@ static void io_free_req(struct io_kiocb *req)
 struct req_batch {
 	void *reqs[IO_IOPOLL_BATCH];
 	int to_free;
+
+	struct task_struct	*task;
+	int			task_refs;
 };
 
+static inline void io_init_req_batch(struct req_batch *rb)
+{
+	rb->to_free = 0;
+	rb->task_refs = 0;
+	rb->task = NULL;
+}
+
 static void __io_req_free_batch_flush(struct io_ring_ctx *ctx,
 				      struct req_batch *rb)
 {
@@ -1821,6 +1831,10 @@ static void io_req_free_batch_finish(struct io_ring_ctx *ctx,
 {
 	if (rb->to_free)
 		__io_req_free_batch_flush(ctx, rb);
+	if (rb->task) {
+		put_task_struct_many(rb->task, rb->task_refs);
+		rb->task = NULL;
+	}
 }
 
 static void io_req_free_batch(struct req_batch *rb, struct io_kiocb *req)
@@ -1832,6 +1846,16 @@ static void io_req_free_batch(struct req_batch *rb, struct io_kiocb *req)
 	if (req->flags & REQ_F_LINK_HEAD)
 		io_queue_next(req);
 
+	if (req->flags & REQ_F_TASK_PINNED) {
+		if (req->task != rb->task && rb->task) {
+			put_task_struct_many(rb->task, rb->task_refs);
+			rb->task = req->task;
+			rb->task_refs = 0;
+		}
+		rb->task_refs++;
+		req->flags &= ~REQ_F_TASK_PINNED;
+	}
+
 	io_dismantle_req(req);
 	rb->reqs[rb->to_free++] = req;
 	if (unlikely(rb->to_free == ARRAY_SIZE(rb->reqs)))
@@ -1977,7 +2001,7 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 	/* order with ->result store in io_complete_rw_iopoll() */
 	smp_rmb();
 
-	rb.to_free = 0;
+	io_init_req_batch(&rb);
 	while (!list_empty(done)) {
 		int cflags = 0;
 
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] task_put batching
  2020-07-18  8:32 [PATCH 0/2] task_put batching Pavel Begunkov
  2020-07-18  8:32 ` [PATCH 1/2] tasks: add put_task_struct_many() Pavel Begunkov
  2020-07-18  8:32 ` [PATCH 2/2] io_uring: batch put_task_struct() Pavel Begunkov
@ 2020-07-18 14:37 ` Jens Axboe
  2020-07-19 11:15   ` Pavel Begunkov
  2020-07-20 15:22   ` Pavel Begunkov
  2 siblings, 2 replies; 12+ messages in thread
From: Jens Axboe @ 2020-07-18 14:37 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, linux-kernel

On 7/18/20 2:32 AM, Pavel Begunkov wrote:
> For my a bit exaggerated test case perf continues to show high CPU
> cosumption by io_dismantle(), and so calling it io_iopoll_complete().
> Even though the patch doesn't yield throughput increase for my setup,
> probably because the effect is hidden behind polling, but it definitely
> improves relative percentage. And the difference should only grow with
> increasing number of CPUs. Another reason to have this is that atomics
> may affect other parallel tasks (e.g. which doesn't use io_uring)
> 
> before:
> io_iopoll_complete: 5.29%
> io_dismantle_req:   2.16%
> 
> after:
> io_iopoll_complete: 3.39%
> io_dismantle_req:   0.465%

Still not seeing a win here, but it's clean and it _should_ work. For
some reason I end up getting the offset in task ref put growing the
fput_many(). Which doesn't (on the surface) make a lot of sense, but
may just mean that we have some weird side effects.

I have applied it, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] task_put batching
  2020-07-18 14:37 ` [PATCH 0/2] task_put batching Jens Axboe
@ 2020-07-19 11:15   ` Pavel Begunkov
  2020-07-19 18:49     ` Jens Axboe
  2020-07-20 15:22   ` Pavel Begunkov
  1 sibling, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2020-07-19 11:15 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

On 18/07/2020 17:37, Jens Axboe wrote:
> On 7/18/20 2:32 AM, Pavel Begunkov wrote:
>> For my a bit exaggerated test case perf continues to show high CPU
>> cosumption by io_dismantle(), and so calling it io_iopoll_complete().
>> Even though the patch doesn't yield throughput increase for my setup,
>> probably because the effect is hidden behind polling, but it definitely
>> improves relative percentage. And the difference should only grow with
>> increasing number of CPUs. Another reason to have this is that atomics
>> may affect other parallel tasks (e.g. which doesn't use io_uring)
>>
>> before:
>> io_iopoll_complete: 5.29%
>> io_dismantle_req:   2.16%
>>
>> after:
>> io_iopoll_complete: 3.39%
>> io_dismantle_req:   0.465%
> 
> Still not seeing a win here, but it's clean and it _should_ work. For

Well, if this thing is useful, it'd be hard to quantify, because active
polling would hide it. I think, it'd need to apply a lot of isolated
pressure on cache synchronisation (e.g. spam with barriers), or try to
create and measure an atomic heavy task pinned to another core. Don't
worth the effort IMHO.
`
Just out of curiosity, let me ask how do you test it?
- is it a VM?
- how many cores and threads do you use?
- how many io_uring instances you have? Per thread?
- Is it all goes to a single NVMe SSD?

> some reason I end up getting the offset in task ref put growing the
> fput_many(). Which doesn't (on the surface) make a lot of sense, but
> may just mean that we have some weird side effects.

I'll take a look whether I can reproduce.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] task_put batching
  2020-07-19 11:15   ` Pavel Begunkov
@ 2020-07-19 18:49     ` Jens Axboe
  2020-07-20 14:18       ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2020-07-19 18:49 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, linux-kernel

On 7/19/20 5:15 AM, Pavel Begunkov wrote:
> On 18/07/2020 17:37, Jens Axboe wrote:
>> On 7/18/20 2:32 AM, Pavel Begunkov wrote:
>>> For my a bit exaggerated test case perf continues to show high CPU
>>> cosumption by io_dismantle(), and so calling it io_iopoll_complete().
>>> Even though the patch doesn't yield throughput increase for my setup,
>>> probably because the effect is hidden behind polling, but it definitely
>>> improves relative percentage. And the difference should only grow with
>>> increasing number of CPUs. Another reason to have this is that atomics
>>> may affect other parallel tasks (e.g. which doesn't use io_uring)
>>>
>>> before:
>>> io_iopoll_complete: 5.29%
>>> io_dismantle_req:   2.16%
>>>
>>> after:
>>> io_iopoll_complete: 3.39%
>>> io_dismantle_req:   0.465%
>>
>> Still not seeing a win here, but it's clean and it _should_ work. For
> 
> Well, if this thing is useful, it'd be hard to quantify, because active
> polling would hide it. I think, it'd need to apply a lot of isolated

It should be very visible in my setup, as we're CPU limited, not device
limited. Hence it makes it very easy to show CPU gains, as they directly
translate into improved performance.

> pressure on cache synchronisation (e.g. spam with barriers), or try to
> create and measure an atomic heavy task pinned to another core. Don't
> worth the effort IMHO.
> `
> Just out of curiosity, let me ask how do you test it?
> - is it a VM?
> - how many cores and threads do you use?
> - how many io_uring instances you have? Per thread?
> - Is it all goes to a single NVMe SSD?

It's not a VM, it's a normal box. I'm using just one CPU, one thread,
and just one NVMe device. That's my goto test for seeing if we reclaimed
some CPU cycles.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] task_put batching
  2020-07-19 18:49     ` Jens Axboe
@ 2020-07-20 14:18       ` Pavel Begunkov
  0 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2020-07-20 14:18 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

On 19/07/2020 21:49, Jens Axboe wrote:
> On 7/19/20 5:15 AM, Pavel Begunkov wrote:
>> On 18/07/2020 17:37, Jens Axboe wrote:
>>> On 7/18/20 2:32 AM, Pavel Begunkov wrote:
>>>> For my a bit exaggerated test case perf continues to show high CPU
>>>> cosumption by io_dismantle(), and so calling it io_iopoll_complete().
>>>> Even though the patch doesn't yield throughput increase for my setup,
>>>> probably because the effect is hidden behind polling, but it definitely
>>>> improves relative percentage. And the difference should only grow with
>>>> increasing number of CPUs. Another reason to have this is that atomics
>>>> may affect other parallel tasks (e.g. which doesn't use io_uring)
>>>>
>>>> before:
>>>> io_iopoll_complete: 5.29%
>>>> io_dismantle_req:   2.16%
>>>>
>>>> after:
>>>> io_iopoll_complete: 3.39%
>>>> io_dismantle_req:   0.465%
>>>
>>> Still not seeing a win here, but it's clean and it _should_ work. For
>>
>> Well, if this thing is useful, it'd be hard to quantify, because active
>> polling would hide it. I think, it'd need to apply a lot of isolated
> 
> It should be very visible in my setup, as we're CPU limited, not device
> limited. Hence it makes it very easy to show CPU gains, as they directly
> translate into improved performance.

IIRC, atomics for x64 in a single thread don't hurt too much. Disregarding
this patch, it would be good to have a many-threaded benchmark to look
after scalability.

>> pressure on cache synchronisation (e.g. spam with barriers), or try to
>> create and measure an atomic heavy task pinned to another core. Don't
>> worth the effort IMHO.
>> `
>> Just out of curiosity, let me ask how do you test it?
>> - is it a VM?
>> - how many cores and threads do you use?
>> - how many io_uring instances you have? Per thread?
>> - Is it all goes to a single NVMe SSD?
> 
> It's not a VM, it's a normal box. I'm using just one CPU, one thread,
> and just one NVMe device. That's my goto test for seeing if we reclaimed
> some CPU cycles.

Got it, thanks

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] task_put batching
  2020-07-18 14:37 ` [PATCH 0/2] task_put batching Jens Axboe
  2020-07-19 11:15   ` Pavel Begunkov
@ 2020-07-20 15:22   ` Pavel Begunkov
  2020-07-20 15:49     ` Jens Axboe
  1 sibling, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2020-07-20 15:22 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

On 18/07/2020 17:37, Jens Axboe wrote:
> On 7/18/20 2:32 AM, Pavel Begunkov wrote:
>> For my a bit exaggerated test case perf continues to show high CPU
>> cosumption by io_dismantle(), and so calling it io_iopoll_complete().
>> Even though the patch doesn't yield throughput increase for my setup,
>> probably because the effect is hidden behind polling, but it definitely
>> improves relative percentage. And the difference should only grow with
>> increasing number of CPUs. Another reason to have this is that atomics
>> may affect other parallel tasks (e.g. which doesn't use io_uring)
>>
>> before:
>> io_iopoll_complete: 5.29%
>> io_dismantle_req:   2.16%
>>
>> after:
>> io_iopoll_complete: 3.39%
>> io_dismantle_req:   0.465%
> 
> Still not seeing a win here, but it's clean and it _should_ work. For
> some reason I end up getting the offset in task ref put growing the
> fput_many(). Which doesn't (on the surface) make a lot of sense, but
> may just mean that we have some weird side effects.

It grows because the patch is garbage, the second condition is always false.
See the diff. Could you please drop both patches?


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 87a772eee0c4..2f02f85269eb 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1847,8 +1847,9 @@ static void io_req_free_batch(struct req_batch *rb, struct io_kiocb *req)
                io_queue_next(req);
 
        if (req->flags & REQ_F_TASK_PINNED) {
-               if (req->task != rb->task && rb->task) {
-                       put_task_struct_many(rb->task, rb->task_refs);
+               if (req->task != rb->task) {
+                       if (rb->task)
+                               put_task_struct_many(rb->task, rb->task_refs);
                        rb->task = req->task;
                        rb->task_refs = 0;
                }

-- 
Pavel Begunkov

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] task_put batching
  2020-07-20 15:22   ` Pavel Begunkov
@ 2020-07-20 15:49     ` Jens Axboe
  2020-07-20 16:06       ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2020-07-20 15:49 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, linux-kernel

On 7/20/20 9:22 AM, Pavel Begunkov wrote:
> On 18/07/2020 17:37, Jens Axboe wrote:
>> On 7/18/20 2:32 AM, Pavel Begunkov wrote:
>>> For my a bit exaggerated test case perf continues to show high CPU
>>> cosumption by io_dismantle(), and so calling it io_iopoll_complete().
>>> Even though the patch doesn't yield throughput increase for my setup,
>>> probably because the effect is hidden behind polling, but it definitely
>>> improves relative percentage. And the difference should only grow with
>>> increasing number of CPUs. Another reason to have this is that atomics
>>> may affect other parallel tasks (e.g. which doesn't use io_uring)
>>>
>>> before:
>>> io_iopoll_complete: 5.29%
>>> io_dismantle_req:   2.16%
>>>
>>> after:
>>> io_iopoll_complete: 3.39%
>>> io_dismantle_req:   0.465%
>>
>> Still not seeing a win here, but it's clean and it _should_ work. For
>> some reason I end up getting the offset in task ref put growing the
>> fput_many(). Which doesn't (on the surface) make a lot of sense, but
>> may just mean that we have some weird side effects.
> 
> It grows because the patch is garbage, the second condition is always false.
> See the diff. Could you please drop both patches?

Hah, indeed. With this on top, it looks like it should in terms of
performance and profiles.

I can just fold this into the existing one, if you'd like.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] task_put batching
  2020-07-20 15:49     ` Jens Axboe
@ 2020-07-20 16:06       ` Pavel Begunkov
  2020-07-20 16:11         ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2020-07-20 16:06 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

On 20/07/2020 18:49, Jens Axboe wrote:
> On 7/20/20 9:22 AM, Pavel Begunkov wrote:
>> On 18/07/2020 17:37, Jens Axboe wrote:
>>> On 7/18/20 2:32 AM, Pavel Begunkov wrote:
>>>> For my a bit exaggerated test case perf continues to show high CPU
>>>> cosumption by io_dismantle(), and so calling it io_iopoll_complete().
>>>> Even though the patch doesn't yield throughput increase for my setup,
>>>> probably because the effect is hidden behind polling, but it definitely
>>>> improves relative percentage. And the difference should only grow with
>>>> increasing number of CPUs. Another reason to have this is that atomics
>>>> may affect other parallel tasks (e.g. which doesn't use io_uring)
>>>>
>>>> before:
>>>> io_iopoll_complete: 5.29%
>>>> io_dismantle_req:   2.16%
>>>>
>>>> after:
>>>> io_iopoll_complete: 3.39%
>>>> io_dismantle_req:   0.465%
>>>
>>> Still not seeing a win here, but it's clean and it _should_ work. For
>>> some reason I end up getting the offset in task ref put growing the
>>> fput_many(). Which doesn't (on the surface) make a lot of sense, but
>>> may just mean that we have some weird side effects.
>>
>> It grows because the patch is garbage, the second condition is always false.
>> See the diff. Could you please drop both patches?
> 
> Hah, indeed. With this on top, it looks like it should in terms of
> performance and profiles.

It just shows, that it doesn't really matters for a single-threaded app,
as expected. Worth to throw some contention though. I'll think about
finding some time to get/borrow a multi-threaded one.

> 
> I can just fold this into the existing one, if you'd like.

Would be nice. I'm going to double-check the counter and re-measure anyway.
BTW, how did you find it? A tool or a proc file would be awesome.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] task_put batching
  2020-07-20 16:06       ` Pavel Begunkov
@ 2020-07-20 16:11         ` Jens Axboe
  2020-07-20 16:42           ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2020-07-20 16:11 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, linux-kernel

On 7/20/20 10:06 AM, Pavel Begunkov wrote:
> On 20/07/2020 18:49, Jens Axboe wrote:
>> On 7/20/20 9:22 AM, Pavel Begunkov wrote:
>>> On 18/07/2020 17:37, Jens Axboe wrote:
>>>> On 7/18/20 2:32 AM, Pavel Begunkov wrote:
>>>>> For my a bit exaggerated test case perf continues to show high CPU
>>>>> cosumption by io_dismantle(), and so calling it io_iopoll_complete().
>>>>> Even though the patch doesn't yield throughput increase for my setup,
>>>>> probably because the effect is hidden behind polling, but it definitely
>>>>> improves relative percentage. And the difference should only grow with
>>>>> increasing number of CPUs. Another reason to have this is that atomics
>>>>> may affect other parallel tasks (e.g. which doesn't use io_uring)
>>>>>
>>>>> before:
>>>>> io_iopoll_complete: 5.29%
>>>>> io_dismantle_req:   2.16%
>>>>>
>>>>> after:
>>>>> io_iopoll_complete: 3.39%
>>>>> io_dismantle_req:   0.465%
>>>>
>>>> Still not seeing a win here, but it's clean and it _should_ work. For
>>>> some reason I end up getting the offset in task ref put growing the
>>>> fput_many(). Which doesn't (on the surface) make a lot of sense, but
>>>> may just mean that we have some weird side effects.
>>>
>>> It grows because the patch is garbage, the second condition is always false.
>>> See the diff. Could you please drop both patches?
>>
>> Hah, indeed. With this on top, it looks like it should in terms of
>> performance and profiles.
> 
> It just shows, that it doesn't really matters for a single-threaded app,
> as expected. Worth to throw some contention though. I'll think about
> finding some time to get/borrow a multi-threaded one.

But it kind of did here, ended up being mostly a wash in terms of perf
here as my testing reported. With the incremental applied, it's up a bit
over before the task put batching.

>> I can just fold this into the existing one, if you'd like.
> 
> Would be nice. I'm going to double-check the counter and re-measure anyway.
> BTW, how did you find it? A tool or a proc file would be awesome.

For this kind of testing, I just use t/io_uring out of fio. It's probably
the lowest overhead kind of tool:

# sudo taskset -c 0 t/io_uring -b512 -p1 /dev/nvme2n1

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] task_put batching
  2020-07-20 16:11         ` Jens Axboe
@ 2020-07-20 16:42           ` Pavel Begunkov
  0 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2020-07-20 16:42 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-kernel

On 20/07/2020 19:11, Jens Axboe wrote:
> On 7/20/20 10:06 AM, Pavel Begunkov wrote:
>> On 20/07/2020 18:49, Jens Axboe wrote:
>>> On 7/20/20 9:22 AM, Pavel Begunkov wrote:
>>>> On 18/07/2020 17:37, Jens Axboe wrote:
>>>>> On 7/18/20 2:32 AM, Pavel Begunkov wrote:
>>>>>> For my a bit exaggerated test case perf continues to show high CPU
>>>>>> cosumption by io_dismantle(), and so calling it io_iopoll_complete().
>>>>>> Even though the patch doesn't yield throughput increase for my setup,
>>>>>> probably because the effect is hidden behind polling, but it definitely
>>>>>> improves relative percentage. And the difference should only grow with
>>>>>> increasing number of CPUs. Another reason to have this is that atomics
>>>>>> may affect other parallel tasks (e.g. which doesn't use io_uring)
>>>>>>
>>>>>> before:
>>>>>> io_iopoll_complete: 5.29%
>>>>>> io_dismantle_req:   2.16%
>>>>>>
>>>>>> after:
>>>>>> io_iopoll_complete: 3.39%
>>>>>> io_dismantle_req:   0.465%
>>>>>
>>>>> Still not seeing a win here, but it's clean and it _should_ work. For
>>>>> some reason I end up getting the offset in task ref put growing the
>>>>> fput_many(). Which doesn't (on the surface) make a lot of sense, but
>>>>> may just mean that we have some weird side effects.
>>>>
>>>> It grows because the patch is garbage, the second condition is always false.
>>>> See the diff. Could you please drop both patches?
>>>
>>> Hah, indeed. With this on top, it looks like it should in terms of
>>> performance and profiles.
>>
>> It just shows, that it doesn't really matters for a single-threaded app,
>> as expected. Worth to throw some contention though. I'll think about
>> finding some time to get/borrow a multi-threaded one.
> 
> But it kind of did here, ended up being mostly a wash in terms of perf
> here as my testing reported. With the incremental applied, it's up a bit
> over before the task put batching.

Hmm, I need to get used to sensitivity of your box, that's a good one!

Do you mean, that the buggy version without atomics was on par comparing
to not having it at all, but the fixed/updated one is a bit faster? Sounds
like micro binary differences, like a bit altered jumps.

It'd also interesting to know, what degree of coalescing in
io_iopoll_complete() you manage to get with that.

>>> I can just fold this into the existing one, if you'd like.
>>
>> Would be nice. I'm going to double-check the counter and re-measure anyway.
>> BTW, how did you find it? A tool or a proc file would be awesome.
> 
> For this kind of testing, I just use t/io_uring out of fio. It's probably
> the lowest overhead kind of tool:
> 
> # sudo taskset -c 0 t/io_uring -b512 -p1 /dev/nvme2n1

I use io_uring-bench.c from time to time, but didn't know it continued living
under fio/t/. Thanks! I also put it under cshield for more consistency, but it
looks like io-wq ignores that.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-07-20 16:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-07-18  8:32 [PATCH 0/2] task_put batching Pavel Begunkov
2020-07-18  8:32 ` [PATCH 1/2] tasks: add put_task_struct_many() Pavel Begunkov
2020-07-18  8:32 ` [PATCH 2/2] io_uring: batch put_task_struct() Pavel Begunkov
2020-07-18 14:37 ` [PATCH 0/2] task_put batching Jens Axboe
2020-07-19 11:15   ` Pavel Begunkov
2020-07-19 18:49     ` Jens Axboe
2020-07-20 14:18       ` Pavel Begunkov
2020-07-20 15:22   ` Pavel Begunkov
2020-07-20 15:49     ` Jens Axboe
2020-07-20 16:06       ` Pavel Begunkov
2020-07-20 16:11         ` Jens Axboe
2020-07-20 16:42           ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox