public inbox for [email protected]
 help / color / mirror / Atom feed
* Very low write throughput on file opened with O_SYNC/O_DSYNC
@ 2020-08-17 11:46 Dmitry Shulyak
  2020-08-17 11:58 ` Dmitry Shulyak
  2020-08-17 14:29 ` Jens Axboe
  0 siblings, 2 replies; 9+ messages in thread
From: Dmitry Shulyak @ 2020-08-17 11:46 UTC (permalink / raw)
  To: io-uring

Hi everyone,

I noticed in iotop that all writes are executed by the same thread
(io_wqe_worker-0). This is a significant problem if I am using files
with mentioned flags. Not the case with reads, requests are
multiplexed over many threads (note the different name
io_wqe_worker-1). The problem is not specific to O_SYNC, in the
general case I can get higher throughput with thread pool and regular
system calls, but specifically with O_SYNC the throughput is the same
as if I were using a single thread for writing.

The setup is always the same, ring per thread with shared workers pool
(IORING_SETUP_ATTACH_WQ), and high submission rate. Also, it is
possible to get around this performance issue by using separate worker
pools, but then I have to load balance workload between many rings for
perf gains.

I thought that it may have something to do with the IOSQE_ASYNC flag,
but setting it had no effect.

Is it expected behavior? Are there any other solutions, except
creating many rings with isolated worker pools?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Very low write throughput on file opened with O_SYNC/O_DSYNC
  2020-08-17 11:46 Very low write throughput on file opened with O_SYNC/O_DSYNC Dmitry Shulyak
@ 2020-08-17 11:58 ` Dmitry Shulyak
  2020-08-17 14:29 ` Jens Axboe
  1 sibling, 0 replies; 9+ messages in thread
From: Dmitry Shulyak @ 2020-08-17 11:58 UTC (permalink / raw)
  To: io-uring

Forgot to mention, that this is on 5.7.12, with writev/read during operations.

On Mon, 17 Aug 2020 at 14:46, Dmitry Shulyak <[email protected]> wrote:
>
> Hi everyone,
>
> I noticed in iotop that all writes are executed by the same thread
> (io_wqe_worker-0). This is a significant problem if I am using files
> with mentioned flags. Not the case with reads, requests are
> multiplexed over many threads (note the different name
> io_wqe_worker-1). The problem is not specific to O_SYNC, in the
> general case I can get higher throughput with thread pool and regular
> system calls, but specifically with O_SYNC the throughput is the same
> as if I were using a single thread for writing.
>
> The setup is always the same, ring per thread with shared workers pool
> (IORING_SETUP_ATTACH_WQ), and high submission rate. Also, it is
> possible to get around this performance issue by using separate worker
> pools, but then I have to load balance workload between many rings for
> perf gains.
>
> I thought that it may have something to do with the IOSQE_ASYNC flag,
> but setting it had no effect.
>
> Is it expected behavior? Are there any other solutions, except
> creating many rings with isolated worker pools?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Very low write throughput on file opened with O_SYNC/O_DSYNC
  2020-08-17 11:46 Very low write throughput on file opened with O_SYNC/O_DSYNC Dmitry Shulyak
  2020-08-17 11:58 ` Dmitry Shulyak
@ 2020-08-17 14:29 ` Jens Axboe
  2020-08-17 15:49   ` Dmitry Shulyak
  1 sibling, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2020-08-17 14:29 UTC (permalink / raw)
  To: Dmitry Shulyak, io-uring

On 8/17/20 4:46 AM, Dmitry Shulyak wrote:
> Hi everyone,
> 
> I noticed in iotop that all writes are executed by the same thread
> (io_wqe_worker-0). This is a significant problem if I am using files
> with mentioned flags. Not the case with reads, requests are
> multiplexed over many threads (note the different name
> io_wqe_worker-1). The problem is not specific to O_SYNC, in the
> general case I can get higher throughput with thread pool and regular
> system calls, but specifically with O_SYNC the throughput is the same
> as if I were using a single thread for writing.
> 
> The setup is always the same, ring per thread with shared workers pool
> (IORING_SETUP_ATTACH_WQ), and high submission rate. Also, it is
> possible to get around this performance issue by using separate worker
> pools, but then I have to load balance workload between many rings for
> perf gains.
> 
> I thought that it may have something to do with the IOSQE_ASYNC flag,
> but setting it had no effect.
> 
> Is it expected behavior? Are there any other solutions, except
> creating many rings with isolated worker pools?

This is done on purpose, as buffered writes end up being serialized
on the inode mutex anyway. So if you spread the load over multiple
workers, you generally just waste resources. In detail, writes to the
same inode are serialized by io-wq, it doesn't attempt to run them
in parallel.

What kind of performance are you seeing with io_uring vs your own
thread pool that doesn't serialize writes? On what fs and what kind
of storage?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Very low write throughput on file opened with O_SYNC/O_DSYNC
  2020-08-17 14:29 ` Jens Axboe
@ 2020-08-17 15:49   ` Dmitry Shulyak
  2020-08-17 16:17     ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitry Shulyak @ 2020-08-17 15:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

With 48 threads i am getting 200 mb/s, about the same with 48 separate
uring instances.
With single uring instance (or with shared pool) - 60 mb/s.
fs - ext4, device - ssd.

On Mon, 17 Aug 2020 at 17:29, Jens Axboe <[email protected]> wrote:
>
> On 8/17/20 4:46 AM, Dmitry Shulyak wrote:
> > Hi everyone,
> >
> > I noticed in iotop that all writes are executed by the same thread
> > (io_wqe_worker-0). This is a significant problem if I am using files
> > with mentioned flags. Not the case with reads, requests are
> > multiplexed over many threads (note the different name
> > io_wqe_worker-1). The problem is not specific to O_SYNC, in the
> > general case I can get higher throughput with thread pool and regular
> > system calls, but specifically with O_SYNC the throughput is the same
> > as if I were using a single thread for writing.
> >
> > The setup is always the same, ring per thread with shared workers pool
> > (IORING_SETUP_ATTACH_WQ), and high submission rate. Also, it is
> > possible to get around this performance issue by using separate worker
> > pools, but then I have to load balance workload between many rings for
> > perf gains.
> >
> > I thought that it may have something to do with the IOSQE_ASYNC flag,
> > but setting it had no effect.
> >
> > Is it expected behavior? Are there any other solutions, except
> > creating many rings with isolated worker pools?
>
> This is done on purpose, as buffered writes end up being serialized
> on the inode mutex anyway. So if you spread the load over multiple
> workers, you generally just waste resources. In detail, writes to the
> same inode are serialized by io-wq, it doesn't attempt to run them
> in parallel.
>
> What kind of performance are you seeing with io_uring vs your own
> thread pool that doesn't serialize writes? On what fs and what kind
> of storage?
>
> --
> Jens Axboe
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Very low write throughput on file opened with O_SYNC/O_DSYNC
  2020-08-17 15:49   ` Dmitry Shulyak
@ 2020-08-17 16:17     ` Jens Axboe
  2020-08-18 16:09       ` Dmitry Shulyak
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2020-08-17 16:17 UTC (permalink / raw)
  To: Dmitry Shulyak; +Cc: io-uring

On 8/17/20 8:49 AM, Dmitry Shulyak wrote:
> With 48 threads i am getting 200 mb/s, about the same with 48 separate
> uring instances.
> With single uring instance (or with shared pool) - 60 mb/s.
> fs - ext4, device - ssd.

You could try something like this kernel addition:

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 4b102d9ad846..8909a1d37801 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1152,7 +1152,7 @@ static void io_prep_async_work(struct io_kiocb *req)
 	io_req_init_async(req);
 
 	if (req->flags & REQ_F_ISREG) {
-		if (def->hash_reg_file)
+		if (def->hash_reg_file && !(req->flags & REQ_F_FORCE_ASYNC))
 			io_wq_hash_work(&req->work, file_inode(req->file));
 	} else {
 		if (def->unbound_nonreg_file)

and then set IOSQE_IO_ASYNC on your writes. That'll parallelize them in
terms of execution.

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Very low write throughput on file opened with O_SYNC/O_DSYNC
  2020-08-17 16:17     ` Jens Axboe
@ 2020-08-18 16:09       ` Dmitry Shulyak
  2020-08-18 16:42         ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitry Shulyak @ 2020-08-18 16:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

it worked, but there are some issues.
with o_dsync and even moderate submission rate threads are stuck in
some cpu task (99.9% cpu consumption), and make very slow progress.
have you expected it? it must be something specific to uring, i can't
reproduce this condition by writing from 2048 threads.




On Mon, 17 Aug 2020 at 19:17, Jens Axboe <[email protected]> wrote:
>
> On 8/17/20 8:49 AM, Dmitry Shulyak wrote:
> > With 48 threads i am getting 200 mb/s, about the same with 48 separate
> > uring instances.
> > With single uring instance (or with shared pool) - 60 mb/s.
> > fs - ext4, device - ssd.
>
> You could try something like this kernel addition:
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 4b102d9ad846..8909a1d37801 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1152,7 +1152,7 @@ static void io_prep_async_work(struct io_kiocb *req)
>         io_req_init_async(req);
>
>         if (req->flags & REQ_F_ISREG) {
> -               if (def->hash_reg_file)
> +               if (def->hash_reg_file && !(req->flags & REQ_F_FORCE_ASYNC))
>                         io_wq_hash_work(&req->work, file_inode(req->file));
>         } else {
>                 if (def->unbound_nonreg_file)
>
> and then set IOSQE_IO_ASYNC on your writes. That'll parallelize them in
> terms of execution.
>
> --
> Jens Axboe
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Very low write throughput on file opened with O_SYNC/O_DSYNC
  2020-08-18 16:09       ` Dmitry Shulyak
@ 2020-08-18 16:42         ` Jens Axboe
  2020-08-19  7:55           ` Dmitry Shulyak
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2020-08-18 16:42 UTC (permalink / raw)
  To: Dmitry Shulyak; +Cc: io-uring

On 8/18/20 9:09 AM, Dmitry Shulyak wrote:
> it worked, but there are some issues.
> with o_dsync and even moderate submission rate threads are stuck in
> some cpu task (99.9% cpu consumption), and make very slow progress.
> have you expected it? it must be something specific to uring, i can't
> reproduce this condition by writing from 2048 threads.

Do you have a reproducer I can run? Curious what that CPU spin is,
that obviously shouldn't happen.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Very low write throughput on file opened with O_SYNC/O_DSYNC
  2020-08-18 16:42         ` Jens Axboe
@ 2020-08-19  7:55           ` Dmitry Shulyak
  2020-08-21 13:43             ` Dmitry Shulyak
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitry Shulyak @ 2020-08-19  7:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

reproducer with liburing -
https://github.com/dshulyak/liburing/blob/async-repro/test/async-repro.c
i am using self-written library in golang for interacting with uring,
but i can reproduce the same issue with that snippet consistently.

On Tue, 18 Aug 2020 at 19:42, Jens Axboe <[email protected]> wrote:
>
> On 8/18/20 9:09 AM, Dmitry Shulyak wrote:
> > it worked, but there are some issues.
> > with o_dsync and even moderate submission rate threads are stuck in
> > some cpu task (99.9% cpu consumption), and make very slow progress.
> > have you expected it? it must be something specific to uring, i can't
> > reproduce this condition by writing from 2048 threads.
>
> Do you have a reproducer I can run? Curious what that CPU spin is,
> that obviously shouldn't happen.
>
> --
> Jens Axboe
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Very low write throughput on file opened with O_SYNC/O_DSYNC
  2020-08-19  7:55           ` Dmitry Shulyak
@ 2020-08-21 13:43             ` Dmitry Shulyak
  0 siblings, 0 replies; 9+ messages in thread
From: Dmitry Shulyak @ 2020-08-21 13:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

noticed that i made a mistake in c reproducer. it doesn't reap
completelitions correctly, but it was still sufficient to reproduce
the problem.
the idea is to submit writes at the high rate, waiting only when
necessary and at the end.

anyway, i am not so interested in making it work on my computer but
rather would like to ask, what are your thoughts about serializing
writes in the kernel?
it looks like that sometimes it is strictly worse and might be a
problem for integrations into the runtimes (e.g. libuv, nodejs,
golang). at the same time it is very easy to
serialize in user space if necessary.

On Wed, 19 Aug 2020 at 10:55, Dmitry Shulyak <[email protected]> wrote:
>
> reproducer with liburing -
> https://github.com/dshulyak/liburing/blob/async-repro/test/async-repro.c
> i am using self-written library in golang for interacting with uring,
> but i can reproduce the same issue with that snippet consistently.
>
> On Tue, 18 Aug 2020 at 19:42, Jens Axboe <[email protected]> wrote:
> >
> > On 8/18/20 9:09 AM, Dmitry Shulyak wrote:
> > > it worked, but there are some issues.
> > > with o_dsync and even moderate submission rate threads are stuck in
> > > some cpu task (99.9% cpu consumption), and make very slow progress.
> > > have you expected it? it must be something specific to uring, i can't
> > > reproduce this condition by writing from 2048 threads.
> >
> > Do you have a reproducer I can run? Curious what that CPU spin is,
> > that obviously shouldn't happen.
> >
> > --
> > Jens Axboe
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-08-21 13:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-17 11:46 Very low write throughput on file opened with O_SYNC/O_DSYNC Dmitry Shulyak
2020-08-17 11:58 ` Dmitry Shulyak
2020-08-17 14:29 ` Jens Axboe
2020-08-17 15:49   ` Dmitry Shulyak
2020-08-17 16:17     ` Jens Axboe
2020-08-18 16:09       ` Dmitry Shulyak
2020-08-18 16:42         ` Jens Axboe
2020-08-19  7:55           ` Dmitry Shulyak
2020-08-21 13:43             ` Dmitry Shulyak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox