[PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait
       [not found] <[email protected]>
@ 2023-07-16 19:50 ` Greg Kroah-Hartman
  2023-07-23  9:39   ` Oleksandr Natalenko
  0 siblings, 1 reply; 14+ messages in thread
From: Greg Kroah-Hartman @ 2023-07-16 19:50 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Pavel Begunkov, io-uring,
	linux-kernel, Andres Freund, Jens Axboe

From: Andres Freund <[email protected]>

commit 8a796565cec3601071cbbd27d6304e202019d014 upstream.

I observed poor performance of io_uring compared to synchronous IO. That
turns out to be caused by deeper CPU idle states entered with io_uring,
due to io_uring using plain schedule(), whereas synchronous IO uses
io_schedule().

The losses due to this are substantial. On my cascade lake workstation,
t/io_uring from the fio repository e.g. yields regressions between 20%
and 40% with the following command:
./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0

This is repeatable with different filesystems, using raw block devices
and using different block devices.

Use io_schedule_prepare() / io_schedule_finish() in
io_cqring_wait_schedule() to address the difference.

After that using io_uring is on par or surpassing synchronous IO (using
registered files etc makes it reliably win, but arguably is a less fair
comparison).

There are other calls to schedule() in io_uring/, but none immediately
jump out to be similarly situated, so I did not touch them. Similarly,
it's possible that mutex_lock_io() should be used, but it's not clear if
there are cases where that matters.

Cc: [email protected] # 5.10+
Cc: Pavel Begunkov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Andres Freund <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[axboe: minor style fixup]
Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
 io_uring/io_uring.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2575,6 +2575,8 @@ int io_run_task_work_sig(struct io_ring_
 static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 					  struct io_wait_queue *iowq)
 {
+	int token, ret;
+
 	if (unlikely(READ_ONCE(ctx->check_cq)))
 		return 1;
 	if (unlikely(!llist_empty(&ctx->work_llist)))
@@ -2585,11 +2587,20 @@ static inline int io_cqring_wait_schedul
 		return -EINTR;
 	if (unlikely(io_should_wake(iowq)))
 		return 0;
+
+	/*
+	 * Use io_schedule_prepare/finish, so cpufreq can take into account
+	 * that the task is waiting for IO - turns out to be important for low
+	 * QD IO.
+	 */
+	token = io_schedule_prepare();
+	ret = 0;
 	if (iowq->timeout == KTIME_MAX)
 		schedule();
 	else if (!schedule_hrtimeout(&iowq->timeout, HRTIMER_MODE_ABS))
-		return -ETIME;
-	return 0;
+		ret = -ETIME;
+	io_schedule_finish(token);
+	return ret;
 }
 
 /*



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait
  2023-07-16 19:50 ` [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait Greg Kroah-Hartman
@ 2023-07-23  9:39   ` Oleksandr Natalenko
  2023-07-23 10:50     ` Greg Kroah-Hartman
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Oleksandr Natalenko @ 2023-07-23  9:39 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, Pavel Begunkov, io-uring, linux-kernel,
	Andres Freund, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 3328 bytes --]

Hello.

On neděle 16. července 2023 21:50:53 CEST Greg Kroah-Hartman wrote:
> From: Andres Freund <[email protected]>
> 
> commit 8a796565cec3601071cbbd27d6304e202019d014 upstream.
> 
> I observed poor performance of io_uring compared to synchronous IO. That
> turns out to be caused by deeper CPU idle states entered with io_uring,
> due to io_uring using plain schedule(), whereas synchronous IO uses
> io_schedule().
> 
> The losses due to this are substantial. On my cascade lake workstation,
> t/io_uring from the fio repository e.g. yields regressions between 20%
> and 40% with the following command:
> ./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0
> 
> This is repeatable with different filesystems, using raw block devices
> and using different block devices.
> 
> Use io_schedule_prepare() / io_schedule_finish() in
> io_cqring_wait_schedule() to address the difference.
> 
> After that using io_uring is on par or surpassing synchronous IO (using
> registered files etc makes it reliably win, but arguably is a less fair
> comparison).
> 
> There are other calls to schedule() in io_uring/, but none immediately
> jump out to be similarly situated, so I did not touch them. Similarly,
> it's possible that mutex_lock_io() should be used, but it's not clear if
> there are cases where that matters.
> 
> Cc: [email protected] # 5.10+
> Cc: Pavel Begunkov <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Andres Freund <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
> [axboe: minor style fixup]
> Signed-off-by: Jens Axboe <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
> ---
>  io_uring/io_uring.c |   15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -2575,6 +2575,8 @@ int io_run_task_work_sig(struct io_ring_
>  static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
>  					  struct io_wait_queue *iowq)
>  {
> +	int token, ret;
> +
>  	if (unlikely(READ_ONCE(ctx->check_cq)))
>  		return 1;
>  	if (unlikely(!llist_empty(&ctx->work_llist)))
> @@ -2585,11 +2587,20 @@ static inline int io_cqring_wait_schedul
>  		return -EINTR;
>  	if (unlikely(io_should_wake(iowq)))
>  		return 0;
> +
> +	/*
> +	 * Use io_schedule_prepare/finish, so cpufreq can take into account
> +	 * that the task is waiting for IO - turns out to be important for low
> +	 * QD IO.
> +	 */
> +	token = io_schedule_prepare();
> +	ret = 0;
>  	if (iowq->timeout == KTIME_MAX)
>  		schedule();
>  	else if (!schedule_hrtimeout(&iowq->timeout, HRTIMER_MODE_ABS))
> -		return -ETIME;
> -	return 0;
> +		ret = -ETIME;
> +	io_schedule_finish(token);
> +	return ret;
>  }
>  
>  /*

Reportedly, this caused a regression as reported in [1] [2] [3]. Not only v6.4.4 is affected, v6.1.39 is affected too.

Reverting this commit fixes the issue.

Please check.

Thanks.

[1] https://bbs.archlinux.org/viewtopic.php?id=287343
[2] https://bugzilla.kernel.org/show_bug.cgi?id=217700
[3] https://bugzilla.kernel.org/show_bug.cgi?id=217699

-- 
Oleksandr Natalenko (post-factum)

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait
  2023-07-23  9:39   ` Oleksandr Natalenko
@ 2023-07-23 10:50     ` Greg Kroah-Hartman
  2023-07-23 10:55       ` Genes Lists
  2023-07-23 10:56       ` Oleksandr Natalenko
  2023-07-23 12:11     ` Linux regression tracking #adding (Thorsten Leemhuis)
  2023-07-23 15:31     ` Jens Axboe
  2 siblings, 2 replies; 14+ messages in thread
From: Greg Kroah-Hartman @ 2023-07-23 10:50 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: stable, Pavel Begunkov, io-uring, linux-kernel, Andres Freund,
	Jens Axboe

On Sun, Jul 23, 2023 at 11:39:42AM +0200, Oleksandr Natalenko wrote:
> Hello.
> 
> On neděle 16. července 2023 21:50:53 CEST Greg Kroah-Hartman wrote:
> > From: Andres Freund <[email protected]>
> > 
> > commit 8a796565cec3601071cbbd27d6304e202019d014 upstream.
> > 
> > I observed poor performance of io_uring compared to synchronous IO. That
> > turns out to be caused by deeper CPU idle states entered with io_uring,
> > due to io_uring using plain schedule(), whereas synchronous IO uses
> > io_schedule().
> > 
> > The losses due to this are substantial. On my cascade lake workstation,
> > t/io_uring from the fio repository e.g. yields regressions between 20%
> > and 40% with the following command:
> > ./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0
> > 
> > This is repeatable with different filesystems, using raw block devices
> > and using different block devices.
> > 
> > Use io_schedule_prepare() / io_schedule_finish() in
> > io_cqring_wait_schedule() to address the difference.
> > 
> > After that using io_uring is on par or surpassing synchronous IO (using
> > registered files etc makes it reliably win, but arguably is a less fair
> > comparison).
> > 
> > There are other calls to schedule() in io_uring/, but none immediately
> > jump out to be similarly situated, so I did not touch them. Similarly,
> > it's possible that mutex_lock_io() should be used, but it's not clear if
> > there are cases where that matters.
> > 
> > Cc: [email protected] # 5.10+
> > Cc: Pavel Begunkov <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Andres Freund <[email protected]>
> > Link: https://lore.kernel.org/r/[email protected]
> > [axboe: minor style fixup]
> > Signed-off-by: Jens Axboe <[email protected]>
> > Signed-off-by: Greg Kroah-Hartman <[email protected]>
> > ---
> >  io_uring/io_uring.c |   15 +++++++++++++--
> >  1 file changed, 13 insertions(+), 2 deletions(-)
> > 
> > --- a/io_uring/io_uring.c
> > +++ b/io_uring/io_uring.c
> > @@ -2575,6 +2575,8 @@ int io_run_task_work_sig(struct io_ring_
> >  static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
> >  					  struct io_wait_queue *iowq)
> >  {
> > +	int token, ret;
> > +
> >  	if (unlikely(READ_ONCE(ctx->check_cq)))
> >  		return 1;
> >  	if (unlikely(!llist_empty(&ctx->work_llist)))
> > @@ -2585,11 +2587,20 @@ static inline int io_cqring_wait_schedul
> >  		return -EINTR;
> >  	if (unlikely(io_should_wake(iowq)))
> >  		return 0;
> > +
> > +	/*
> > +	 * Use io_schedule_prepare/finish, so cpufreq can take into account
> > +	 * that the task is waiting for IO - turns out to be important for low
> > +	 * QD IO.
> > +	 */
> > +	token = io_schedule_prepare();
> > +	ret = 0;
> >  	if (iowq->timeout == KTIME_MAX)
> >  		schedule();
> >  	else if (!schedule_hrtimeout(&iowq->timeout, HRTIMER_MODE_ABS))
> > -		return -ETIME;
> > -	return 0;
> > +		ret = -ETIME;
> > +	io_schedule_finish(token);
> > +	return ret;
> >  }
> >  
> >  /*
> 
> Reportedly, this caused a regression as reported in [1] [2] [3]. Not only v6.4.4 is affected, v6.1.39 is affected too.
> 
> Reverting this commit fixes the issue.
> 
> Please check.

Is this also an issue in 6.5-rc2?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait
  2023-07-23 10:50     ` Greg Kroah-Hartman
@ 2023-07-23 10:55       ` Genes Lists
  2023-07-23 10:56       ` Oleksandr Natalenko
  1 sibling, 0 replies; 14+ messages in thread
From: Genes Lists @ 2023-07-23 10:55 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Oleksandr Natalenko
  Cc: stable, Pavel Begunkov, io-uring, linux-kernel, Andres Freund,
	Jens Axboe

On 7/23/23 06:50, Greg Kroah-Hartman wrote:
> On Sun, Jul 23, 2023 at 11:39:42AM +0200, Oleksandr Natalenko wrote:
>> Hello.
>>
>> On neděle 16. července 2023 21:50:53 CEST Greg Kroah-Hartman wrote:
>>> From: Andres Freund <[email protected]>
>>>
>>> commit 8a796565cec3601071cbbd27d6304e202019d014 upstream.
>>>
>>> I observed poor performance of io_uring compared to synchronous IO. That
...
>>
>> Reportedly, this caused a regression as reported in [1] [2] [3]. Not only v6.4.4 is affected, v6.1.39 is affected too.
>>
>> Reverting this commit fixes the issue.
>>
>> Please check.
> 
> Is this also an issue in 6.5-rc2?
> 
> thanks,
> 
> greg k-h

Yes - I can confirm this issue is in 6.5-rc2 and with Linus' commit 
c2782531397f5cb19ca3f8f9c17727f1cdf5bee8.


gene


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait
  2023-07-23 10:50     ` Greg Kroah-Hartman
  2023-07-23 10:55       ` Genes Lists
@ 2023-07-23 10:56       ` Oleksandr Natalenko
  1 sibling, 0 replies; 14+ messages in thread
From: Oleksandr Natalenko @ 2023-07-23 10:56 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: stable, Pavel Begunkov, io-uring, linux-kernel, Andres Freund,
	Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 3792 bytes --]

On neděle 23. července 2023 12:50:30 CEST Greg Kroah-Hartman wrote:
> On Sun, Jul 23, 2023 at 11:39:42AM +0200, Oleksandr Natalenko wrote:
> > Hello.
> > 
> > On neděle 16. července 2023 21:50:53 CEST Greg Kroah-Hartman wrote:
> > > From: Andres Freund <[email protected]>
> > > 
> > > commit 8a796565cec3601071cbbd27d6304e202019d014 upstream.
> > > 
> > > I observed poor performance of io_uring compared to synchronous IO. That
> > > turns out to be caused by deeper CPU idle states entered with io_uring,
> > > due to io_uring using plain schedule(), whereas synchronous IO uses
> > > io_schedule().
> > > 
> > > The losses due to this are substantial. On my cascade lake workstation,
> > > t/io_uring from the fio repository e.g. yields regressions between 20%
> > > and 40% with the following command:
> > > ./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0
> > > 
> > > This is repeatable with different filesystems, using raw block devices
> > > and using different block devices.
> > > 
> > > Use io_schedule_prepare() / io_schedule_finish() in
> > > io_cqring_wait_schedule() to address the difference.
> > > 
> > > After that using io_uring is on par or surpassing synchronous IO (using
> > > registered files etc makes it reliably win, but arguably is a less fair
> > > comparison).
> > > 
> > > There are other calls to schedule() in io_uring/, but none immediately
> > > jump out to be similarly situated, so I did not touch them. Similarly,
> > > it's possible that mutex_lock_io() should be used, but it's not clear if
> > > there are cases where that matters.
> > > 
> > > Cc: [email protected] # 5.10+
> > > Cc: Pavel Begunkov <[email protected]>
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Signed-off-by: Andres Freund <[email protected]>
> > > Link: https://lore.kernel.org/r/[email protected]
> > > [axboe: minor style fixup]
> > > Signed-off-by: Jens Axboe <[email protected]>
> > > Signed-off-by: Greg Kroah-Hartman <[email protected]>
> > > ---
> > >  io_uring/io_uring.c |   15 +++++++++++++--
> > >  1 file changed, 13 insertions(+), 2 deletions(-)
> > > 
> > > --- a/io_uring/io_uring.c
> > > +++ b/io_uring/io_uring.c
> > > @@ -2575,6 +2575,8 @@ int io_run_task_work_sig(struct io_ring_
> > >  static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
> > >  					  struct io_wait_queue *iowq)
> > >  {
> > > +	int token, ret;
> > > +
> > >  	if (unlikely(READ_ONCE(ctx->check_cq)))
> > >  		return 1;
> > >  	if (unlikely(!llist_empty(&ctx->work_llist)))
> > > @@ -2585,11 +2587,20 @@ static inline int io_cqring_wait_schedul
> > >  		return -EINTR;
> > >  	if (unlikely(io_should_wake(iowq)))
> > >  		return 0;
> > > +
> > > +	/*
> > > +	 * Use io_schedule_prepare/finish, so cpufreq can take into account
> > > +	 * that the task is waiting for IO - turns out to be important for low
> > > +	 * QD IO.
> > > +	 */
> > > +	token = io_schedule_prepare();
> > > +	ret = 0;
> > >  	if (iowq->timeout == KTIME_MAX)
> > >  		schedule();
> > >  	else if (!schedule_hrtimeout(&iowq->timeout, HRTIMER_MODE_ABS))
> > > -		return -ETIME;
> > > -	return 0;
> > > +		ret = -ETIME;
> > > +	io_schedule_finish(token);
> > > +	return ret;
> > >  }
> > >  
> > >  /*
> > 
> > Reportedly, this caused a regression as reported in [1] [2] [3]. Not only v6.4.4 is affected, v6.1.39 is affected too.
> > 
> > Reverting this commit fixes the issue.
> > 
> > Please check.
> 
> Is this also an issue in 6.5-rc2?

As per [1], yes.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217699#c4

> 
> thanks,
> 
> greg k-h
> 


-- 
Oleksandr Natalenko (post-factum)

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait
  2023-07-23  9:39   ` Oleksandr Natalenko
  2023-07-23 10:50     ` Greg Kroah-Hartman
@ 2023-07-23 12:11     ` Linux regression tracking #adding (Thorsten Leemhuis)
  2023-07-23 17:35       ` Linux regression tracking #update (Thorsten Leemhuis)
  2023-07-23 15:31     ` Jens Axboe
  2 siblings, 1 reply; 14+ messages in thread
From: Linux regression tracking #adding (Thorsten Leemhuis) @ 2023-07-23 12:11 UTC (permalink / raw)
  To: Oleksandr Natalenko, stable
  Cc: Greg Kroah-Hartman, Pavel Begunkov, io-uring, linux-kernel,
	Andres Freund, Jens Axboe, Linux kernel regressions list

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 23.07.23 11:39, Oleksandr Natalenko wrote:
> On neděle 16. července 2023 21:50:53 CEST Greg Kroah-Hartman wrote:
>> From: Andres Freund <[email protected]>
>>
>> commit 8a796565cec3601071cbbd27d6304e202019d014 upstream.
>>
>> I observed poor performance of io_uring compared to synchronous IO. That
>> turns out to be caused by deeper CPU idle states entered with io_uring,
>> due to io_uring using plain schedule(), whereas synchronous IO uses
>> io_schedule().
>>
>> The losses due to this are substantial. On my cascade lake workstation,
>> t/io_uring from the fio repository e.g. yields regressions between 20%
>> and 40% with the following command:
>> ./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0
>>
> 
> Reportedly, this caused a regression as reported in [1] [2] [3]. Not only v6.4.4 is affected, v6.1.39 is affected too.
> 
> Reverting this commit fixes the issue.
> 
> Please check.
> 
> Thanks.
> 
> [1] https://bbs.archlinux.org/viewtopic.php?id=287343
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=217700
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=217699

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot introduced 8a796565cec360107 ^
https://bbs.archlinux.org/viewtopic.php?id=287343
https://bugzilla.kernel.org/show_bug.cgi?id=217700
https://bugzilla.kernel.org/show_bug.cgi?id=217699
#regzbot title block: io_uring: high iowait rates and stalls
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait
  2023-07-23  9:39   ` Oleksandr Natalenko
  2023-07-23 10:50     ` Greg Kroah-Hartman
  2023-07-23 12:11     ` Linux regression tracking #adding (Thorsten Leemhuis)
@ 2023-07-23 15:31     ` Jens Axboe
  2023-07-23 17:43       ` Genes Lists
  2 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2023-07-23 15:31 UTC (permalink / raw)
  To: Oleksandr Natalenko, stable
  Cc: Greg Kroah-Hartman, Pavel Begunkov, io-uring, linux-kernel,
	Andres Freund

On 7/23/23 3:39?AM, Oleksandr Natalenko wrote:
> Hello.
> 
> On ned?le 16. ?ervence 2023 21:50:53 CEST Greg Kroah-Hartman wrote:
>> From: Andres Freund <[email protected]>
>>
>> commit 8a796565cec3601071cbbd27d6304e202019d014 upstream.
>>
>> I observed poor performance of io_uring compared to synchronous IO. That
>> turns out to be caused by deeper CPU idle states entered with io_uring,
>> due to io_uring using plain schedule(), whereas synchronous IO uses
>> io_schedule().
>>
>> The losses due to this are substantial. On my cascade lake workstation,
>> t/io_uring from the fio repository e.g. yields regressions between 20%
>> and 40% with the following command:
>> ./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0
>>
>> This is repeatable with different filesystems, using raw block devices
>> and using different block devices.
>>
>> Use io_schedule_prepare() / io_schedule_finish() in
>> io_cqring_wait_schedule() to address the difference.
>>
>> After that using io_uring is on par or surpassing synchronous IO (using
>> registered files etc makes it reliably win, but arguably is a less fair
>> comparison).
>>
>> There are other calls to schedule() in io_uring/, but none immediately
>> jump out to be similarly situated, so I did not touch them. Similarly,
>> it's possible that mutex_lock_io() should be used, but it's not clear if
>> there are cases where that matters.
>>
>> Cc: [email protected] # 5.10+
>> Cc: Pavel Begunkov <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Signed-off-by: Andres Freund <[email protected]>
>> Link: https://lore.kernel.org/r/[email protected]
>> [axboe: minor style fixup]
>> Signed-off-by: Jens Axboe <[email protected]>
>> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>> ---
>>  io_uring/io_uring.c |   15 +++++++++++++--
>>  1 file changed, 13 insertions(+), 2 deletions(-)
>>
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -2575,6 +2575,8 @@ int io_run_task_work_sig(struct io_ring_
>>  static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
>>  					  struct io_wait_queue *iowq)
>>  {
>> +	int token, ret;
>> +
>>  	if (unlikely(READ_ONCE(ctx->check_cq)))
>>  		return 1;
>>  	if (unlikely(!llist_empty(&ctx->work_llist)))
>> @@ -2585,11 +2587,20 @@ static inline int io_cqring_wait_schedul
>>  		return -EINTR;
>>  	if (unlikely(io_should_wake(iowq)))
>>  		return 0;
>> +
>> +	/*
>> +	 * Use io_schedule_prepare/finish, so cpufreq can take into account
>> +	 * that the task is waiting for IO - turns out to be important for low
>> +	 * QD IO.
>> +	 */
>> +	token = io_schedule_prepare();
>> +	ret = 0;
>>  	if (iowq->timeout == KTIME_MAX)
>>  		schedule();
>>  	else if (!schedule_hrtimeout(&iowq->timeout, HRTIMER_MODE_ABS))
>> -		return -ETIME;
>> -	return 0;
>> +		ret = -ETIME;
>> +	io_schedule_finish(token);
>> +	return ret;
>>  }
>>  
>>  /*
> 
> Reportedly, this caused a regression as reported in [1] [2] [3]. Not only v6.4.4 is affected, v6.1.39 is affected too.
> 
> Reverting this commit fixes the issue.
> 
> Please check.
> 
> Thanks.
> 
> [1] https://bbs.archlinux.org/viewtopic.php?id=287343
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=217700
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=217699

Just read the first one, but this is very much expected. It's now just
correctly reflecting that one thread is waiting on IO. IO wait being
100% doesn't mean that one core is running 100% of the time, it just
means it's WAITING on IO 100% of the time.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait
  2023-07-23 12:11     ` Linux regression tracking #adding (Thorsten Leemhuis)
@ 2023-07-23 17:35       ` Linux regression tracking #update (Thorsten Leemhuis)
  0 siblings, 0 replies; 14+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-07-23 17:35 UTC (permalink / raw)
  To: Oleksandr Natalenko, stable
  Cc: Greg Kroah-Hartman, Pavel Begunkov, io-uring, linux-kernel,
	Andres Freund, Jens Axboe, Linux kernel regressions list

On 23.07.23 14:11, Linux regression tracking #adding (Thorsten Leemhuis)
wrote:
> On 23.07.23 11:39, Oleksandr Natalenko wrote:
>> On neděle 16. července 2023 21:50:53 CEST Greg Kroah-Hartman wrote:
>>> From: Andres Freund <[email protected]>
>>>
>>> commit 8a796565cec3601071cbbd27d6304e202019d014 upstream.
>>>
>>> I observed poor performance of io_uring compared to synchronous IO. That
>>> turns out to be caused by deeper CPU idle states entered with io_uring,
>>> due to io_uring using plain schedule(), whereas synchronous IO uses
>>> io_schedule().

> #regzbot introduced 8a796565cec360107 ^
> https://bbs.archlinux.org/viewtopic.php?id=287343
> https://bugzilla.kernel.org/show_bug.cgi?id=217700
> https://bugzilla.kernel.org/show_bug.cgi?id=217699
> #regzbot title block: io_uring: high iowait rates and stalls
> #regzbot ignore-activity

Apparently expected behavior:
https://lore.kernel.org/lkml/[email protected]/

#regzbot resolve: notabug: on a closer look it turned out that's
expected behavior

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

#regzbot ignore-activity

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait
  2023-07-23 15:31     ` Jens Axboe
@ 2023-07-23 17:43       ` Genes Lists
  2023-07-23 18:06         ` Oleksandr Natalenko
  0 siblings, 1 reply; 14+ messages in thread
From: Genes Lists @ 2023-07-23 17:43 UTC (permalink / raw)
  To: Jens Axboe, Oleksandr Natalenko, stable
  Cc: Greg Kroah-Hartman, Pavel Begunkov, io-uring, linux-kernel,
	Andres Freund

On 7/23/23 11:31, Jens Axboe wrote:
...
> Just read the first one, but this is very much expected. It's now just
> correctly reflecting that one thread is waiting on IO. IO wait being
> 100% doesn't mean that one core is running 100% of the time, it just
> means it's WAITING on IO 100% of the time.
> 

Seems reasonable thank you.

Question - do you expect the iowait to stay high for a freshly created 
mariadb doing nothing (as far as I can tell anyway) until process 
exited? Or Would you think it would drop in this case prior to the 
process exiting.

For example I tried the following - is the output what you expect?

Create a fresh mariab with no databases - monitor the core showing the 
iowaits with:

    mpstat -P ALL 2 100

# rm -f /var/lib/mysql/*
# mariadb-install-db --user=mysql --basedir=/usr --datadir=/var/lib/mysql

# systemctl start mariadb      (iowaits -> 100%) 
 

# iotop -bo |grep maria        (shows no output, iowait stays 100%)

(this persists until mariadb process exits)
 

# systemctl stop mariadb       (iowait drops to 0%) 






^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait
  2023-07-23 17:43       ` Genes Lists
@ 2023-07-23 18:06         ` Oleksandr Natalenko
  2023-07-23 18:58           ` Andres Freund
  2023-07-24 15:47           ` Jens Axboe
  0 siblings, 2 replies; 14+ messages in thread
From: Oleksandr Natalenko @ 2023-07-23 18:06 UTC (permalink / raw)
  To: Jens Axboe, stable, Genes Lists
  Cc: Greg Kroah-Hartman, Pavel Begunkov, io-uring, linux-kernel,
	Andres Freund

[-- Attachment #1: Type: text/plain, Size: 1510 bytes --]

Hello.

On neděle 23. července 2023 19:43:50 CEST Genes Lists wrote:
> On 7/23/23 11:31, Jens Axboe wrote:
> ...
> > Just read the first one, but this is very much expected. It's now just
> > correctly reflecting that one thread is waiting on IO. IO wait being
> > 100% doesn't mean that one core is running 100% of the time, it just
> > means it's WAITING on IO 100% of the time.
> > 
> 
> Seems reasonable thank you.
> 
> Question - do you expect the iowait to stay high for a freshly created 
> mariadb doing nothing (as far as I can tell anyway) until process 
> exited? Or Would you think it would drop in this case prior to the 
> process exiting.
> 
> For example I tried the following - is the output what you expect?
> 
> Create a fresh mariab with no databases - monitor the core showing the 
> iowaits with:
> 
>     mpstat -P ALL 2 100
> 
> # rm -f /var/lib/mysql/*
> # mariadb-install-db --user=mysql --basedir=/usr --datadir=/var/lib/mysql
> 
> # systemctl start mariadb      (iowaits -> 100%) 
>  
> 
> # iotop -bo |grep maria        (shows no output, iowait stays 100%)
> 
> (this persists until mariadb process exits)
>  
> 
> # systemctl stop mariadb       (iowait drops to 0%) 

This is a visible userspace behaviour change with no changes in the userspace itself, so we cannot just ignore it. If for some reason this is how it should be now, how do we explain it to MariaDB devs to get this fixed?

Thanks.

-- 
Oleksandr Natalenko (post-factum)

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait
  2023-07-23 18:06         ` Oleksandr Natalenko
@ 2023-07-23 18:58           ` Andres Freund
  2023-07-23 19:44             ` Genes Lists
  2023-07-24 15:47           ` Jens Axboe
  1 sibling, 1 reply; 14+ messages in thread
From: Andres Freund @ 2023-07-23 18:58 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Jens Axboe, stable, Genes Lists, Greg Kroah-Hartman,
	Pavel Begunkov, io-uring, linux-kernel

Hi,

On 2023-07-23 20:06:19 +0200, Oleksandr Natalenko wrote:
> On neděle 23. července 2023 19:43:50 CEST Genes Lists wrote:
> > On 7/23/23 11:31, Jens Axboe wrote:
> > ...
> > > Just read the first one, but this is very much expected. It's now just
> > > correctly reflecting that one thread is waiting on IO. IO wait being
> > > 100% doesn't mean that one core is running 100% of the time, it just
> > > means it's WAITING on IO 100% of the time.
> > >
> >
> > Seems reasonable thank you.
> >
> > Question - do you expect the iowait to stay high for a freshly created
> > mariadb doing nothing (as far as I can tell anyway) until process
> > exited? Or Would you think it would drop in this case prior to the
> > process exiting.
> >
> > For example I tried the following - is the output what you expect?
> >
> > Create a fresh mariab with no databases - monitor the core showing the
> > iowaits with:
> >
> >     mpstat -P ALL 2 100
> >
> > # rm -f /var/lib/mysql/*
> > # mariadb-install-db --user=mysql --basedir=/usr --datadir=/var/lib/mysql
> >
> > # systemctl start mariadb      (iowaits -> 100%)
> >
> >
> > # iotop -bo |grep maria        (shows no output, iowait stays 100%)
> >
> > (this persists until mariadb process exits)
> >
> >
> > # systemctl stop mariadb       (iowait drops to 0%)
>
> This is a visible userspace behaviour change with no changes in the
> userspace itself, so we cannot just ignore it. If for some reason this is
> how it should be now, how do we explain it to MariaDB devs to get this
> fixed?

Just to confirm I understand: Your concern is how it looks in mpstat, not
performance or anything like that?

As far as I can tell, mariadb submits a bunch of IOs, which all have
completed:
...
        mariadbd  438034 [000] 67593.094595:       io_uring:io_uring_submit_req: ring 0xffff8887878ac800, req 0xffff888929df2400, user_data 0x55d5eaf29488, opcode READV, flags 0x0, sq_thread 0
        mariadbd  438034 [000] 67593.094604:       io_uring:io_uring_submit_req: ring 0xffff8887878ac800, req 0xffff888929df2500, user_data 0x55d5eaf29520, opcode READV, flags 0x0, sq_thread 0
        mariadbd  438034 [000] 67593.094690:         io_uring:io_uring_complete: ring 0xffff8887878ac800, req 0xffff888929df2400, user_data 0x55d5eaf29488, result 16384, cflags 0x0 extra1 0 extra2 0 
        mariadbd  438034 [000] 67593.094699:         io_uring:io_uring_complete: ring 0xffff8887878ac800, req 0xffff888929df2500, user_data 0x55d5eaf29520, result 16384, cflags 0x0 extra1 0 extra2 0 

Then waits for io_uring events:
        mariadbd  438032 [003] 67593.095282:      io_uring:io_uring_cqring_wait: ring 0xffff8887878ac800, min_events 1

There won't be any completions without further IO being submitted.

io_uring could have logic to somehow report a different state in such a case
(where there'll not be any completions before new IOs being submitted), but
that'd likely not be free.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait
  2023-07-23 18:58           ` Andres Freund
@ 2023-07-23 19:44             ` Genes Lists
  0 siblings, 0 replies; 14+ messages in thread
From: Genes Lists @ 2023-07-23 19:44 UTC (permalink / raw)
  To: Andres Freund, Oleksandr Natalenko
  Cc: Jens Axboe, stable, Greg Kroah-Hartman, Pavel Begunkov, io-uring,
	linux-kernel

On 7/23/23 14:58, Andres Freund wrote:

> Just to confirm I understand: Your concern is how it looks in mpstat, not
> performance or anything like that?

Right - there is no performance issue or any other issue to my knowledge 
and cores are idle.

So, as you said, its just a small reporting item - which is now quite clear.

thank you.

gene

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait
  2023-07-23 18:06         ` Oleksandr Natalenko
  2023-07-23 18:58           ` Andres Freund
@ 2023-07-24 15:47           ` Jens Axboe
  2023-07-24 18:04             ` Genes Lists
  1 sibling, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2023-07-24 15:47 UTC (permalink / raw)
  To: Oleksandr Natalenko, stable, Genes Lists
  Cc: Greg Kroah-Hartman, Pavel Begunkov, io-uring, linux-kernel,
	Andres Freund

On 7/23/23 12:06?PM, Oleksandr Natalenko wrote:
> Hello.
> 
> On ned?le 23. ?ervence 2023 19:43:50 CEST Genes Lists wrote:
>> On 7/23/23 11:31, Jens Axboe wrote:
>> ...
>>> Just read the first one, but this is very much expected. It's now just
>>> correctly reflecting that one thread is waiting on IO. IO wait being
>>> 100% doesn't mean that one core is running 100% of the time, it just
>>> means it's WAITING on IO 100% of the time.
>>>
>>
>> Seems reasonable thank you.
>>
>> Question - do you expect the iowait to stay high for a freshly created 
>> mariadb doing nothing (as far as I can tell anyway) until process 
>> exited? Or Would you think it would drop in this case prior to the 
>> process exiting.
>>
>> For example I tried the following - is the output what you expect?
>>
>> Create a fresh mariab with no databases - monitor the core showing the 
>> iowaits with:
>>
>>     mpstat -P ALL 2 100
>>
>> # rm -f /var/lib/mysql/*
>> # mariadb-install-db --user=mysql --basedir=/usr --datadir=/var/lib/mysql
>>
>> # systemctl start mariadb      (iowaits -> 100%) 
>>  
>>
>> # iotop -bo |grep maria        (shows no output, iowait stays 100%)
>>
>> (this persists until mariadb process exits)
>>  
>>
>> # systemctl stop mariadb       (iowait drops to 0%) 
> 
> This is a visible userspace behaviour change with no changes in the
> userspace itself, so we cannot just ignore it. If for some reason this
> is how it should be now, how do we explain it to MariaDB devs to get
> this fixed?

It's not a behavioural change, it's a reporting change. There's no
functionality changing here. That said, I do think we should narrow it a
bit so we're only marked as in iowait if the task waiting has pending
IO. That should still satisfy the initial problem, and it won't flag
iowait on mariadb like cases where they have someone else just
perpetually waiting on requests.

As a side effect, this also removes the flush that wasn't at all
necessary on io_uring.

If folks are able to test the below, that would be appreciated.


diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 89a611541bc4..f4591b912ea8 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2493,11 +2493,20 @@ int io_run_task_work_sig(struct io_ring_ctx *ctx)
 	return 0;
 }
 
+static bool current_pending_io(void)
+{
+	struct io_uring_task *tctx = current->io_uring;
+
+	if (!tctx)
+		return false;
+	return percpu_counter_read_positive(&tctx->inflight);
+}
+
 /* when returns >0, the caller should retry */
 static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 					  struct io_wait_queue *iowq)
 {
-	int token, ret;
+	int io_wait, ret;
 
 	if (unlikely(READ_ONCE(ctx->check_cq)))
 		return 1;
@@ -2511,17 +2520,19 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 		return 0;
 
 	/*
-	 * Use io_schedule_prepare/finish, so cpufreq can take into account
-	 * that the task is waiting for IO - turns out to be important for low
-	 * QD IO.
+	 * Mark us as being in io_wait if we have pending requests, so cpufreq
+	 * can take into account that the task is waiting for IO - turns out
+	 * to be important for low QD IO.
 	 */
-	token = io_schedule_prepare();
+	io_wait = current->in_iowait;
+	if (current_pending_io())
+		current->in_iowait = 1;
 	ret = 0;
 	if (iowq->timeout == KTIME_MAX)
 		schedule();
 	else if (!schedule_hrtimeout(&iowq->timeout, HRTIMER_MODE_ABS))
 		ret = -ETIME;
-	io_schedule_finish(token);
+	current->in_iowait = io_wait;
 	return ret;
 }
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait
  2023-07-24 15:47           ` Jens Axboe
@ 2023-07-24 18:04             ` Genes Lists
  0 siblings, 0 replies; 14+ messages in thread
From: Genes Lists @ 2023-07-24 18:04 UTC (permalink / raw)
  To: Jens Axboe, Oleksandr Natalenko, stable
  Cc: Greg Kroah-Hartman, Pavel Begunkov, io-uring, linux-kernel,
	Andres Freund

On 7/24/23 11:47, Jens Axboe wrote:

> It's not a behavioural change, it's a reporting change. There's no
> functionality changing here. That said, I do think we should narrow it a
> bit so we're only marked as in iowait if the task waiting has pending
> IO. That should still satisfy the initial problem, and it won't flag
> iowait on mariadb like cases where they have someone else just
> perpetually waiting on requests.
> 
> As a side effect, this also removes the flush that wasn't at all
> necessary on io_uring.
> 
> If folks are able to test the below, that would be appreciated.
> 
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 89a611541bc4..f4591b912ea8 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -2493,11 +2493,20 @@ int io_run_task_work_sig(struct io_ring_ctx *ctx)
>   	return 0;
>   }
>   
> +static bool current_pending_io(void)
> +{
> +	struct io_uring_task *tctx = current->io_uring;
> +
> +	if (!tctx)
> +		return false;
> +	return percpu_counter_read_positive(&tctx->inflight);
> +}
> +
>   /* when returns >0, the caller should retry */
>   static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
>   					  struct io_wait_queue *iowq)
>   {
> -	int token, ret;
> +	int io_wait, ret;
>   
>   	if (unlikely(READ_ONCE(ctx->check_cq)))
>   		return 1;
> @@ -2511,17 +2520,19 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
>   		return 0;
>   
>   	/*
> -	 * Use io_schedule_prepare/finish, so cpufreq can take into account
> -	 * that the task is waiting for IO - turns out to be important for low
> -	 * QD IO.
> +	 * Mark us as being in io_wait if we have pending requests, so cpufreq
> +	 * can take into account that the task is waiting for IO - turns out
> +	 * to be important for low QD IO.
>   	 */
> -	token = io_schedule_prepare();
> +	io_wait = current->in_iowait;
> +	if (current_pending_io())
> +		current->in_iowait = 1;
>   	ret = 0;
>   	if (iowq->timeout == KTIME_MAX)
>   		schedule();
>   	else if (!schedule_hrtimeout(&iowq->timeout, HRTIMER_MODE_ABS))
>   		ret = -ETIME;
> -	io_schedule_finish(token);
> +	current->in_iowait = io_wait;
>   	return ret;
>   }
>   
> 
Tested on top of 6.4.6 stable - and working great - iowait stats now 
look like they always did.

thank you

gene

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-07-24 18:04 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <[email protected]>
2023-07-16 19:50 ` [PATCH 6.4 800/800] io_uring: Use io_schedule* in cqring wait Greg Kroah-Hartman
2023-07-23  9:39   ` Oleksandr Natalenko
2023-07-23 10:50     ` Greg Kroah-Hartman
2023-07-23 10:55       ` Genes Lists
2023-07-23 10:56       ` Oleksandr Natalenko
2023-07-23 12:11     ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-07-23 17:35       ` Linux regression tracking #update (Thorsten Leemhuis)
2023-07-23 15:31     ` Jens Axboe
2023-07-23 17:43       ` Genes Lists
2023-07-23 18:06         ` Oleksandr Natalenko
2023-07-23 18:58           ` Andres Freund
2023-07-23 19:44             ` Genes Lists
2023-07-24 15:47           ` Jens Axboe
2023-07-24 18:04             ` Genes Lists

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox