public inbox for [email protected]
 help / color / mirror / Atom feed
* io_uring performance with block sizes > 128k
@ 2020-03-02 23:55 Bijan Mottahedeh
  2020-03-02 23:57 ` Jens Axboe
  0 siblings, 1 reply; 4+ messages in thread
From: Bijan Mottahedeh @ 2020-03-02 23:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

I'm seeing a sizeable drop in perf with polled fio tests for block sizes 
 > 128k:

filename=/dev/nvme0n1
rw=randread
direct=1
time_based=1
randrepeat=1
gtod_reduce=1

fio --readonly --ioengine=io_uring --iodepth 1024 --fixedbufs --hipri 
--numjobs=16
fio --readonly --ioengine=pvsync2 --iodepth 1024 --hipri --numjobs=16


Compared with the pvsync2 engine, the only major difference I could see 
was the dio path, __blkdev_direct_IO() for io_uring vs. 
__blkdev_direct_IO_simple() for pvsync2 because of the is_sync_kiocb() 
check.


static ssize_t
blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
         ...
         if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
                 return __blkdev_direct_IO_simple(iocb, iter, nr_pages);

         return __blkdev_direct_IO(iocb, iter, min(nr_pages, 
BIO_MAX_PAGES));
}

Just for an experiment, I hacked io_uring code to force it through the 
_simple() path and I get better numbers though the variance is fairly 
high, but the drop at bs > 128k seems consistent:


# baseline
READ: bw=3167MiB/s (3321MB/s), 186MiB/s-208MiB/s (196MB/s-219MB/s)   #128k
READ: bw=898MiB/s (941MB/s), 51.2MiB/s-66.1MiB/s (53.7MB/s-69.3MB/s) #144k
READ: bw=1576MiB/s (1652MB/s), 81.8MiB/s-109MiB/s (85.8MB/s-114MB/s) #256k

# hack
READ: bw=2705MiB/s (2836MB/s), 157MiB/s-174MiB/s (165MB/s-183MB/s) #128k
READ: bw=2901MiB/s (3042MB/s), 174MiB/s-194MiB/s (183MB/s-204MB/s) #144k
READ: bw=4194MiB/s (4398MB/s), 252MiB/s-271MiB/s (265MB/s-284MB/s) #256k


--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1972,12 +1972,12 @@ static int io_prep_rw(struct io_kiocb *req, 
const struct
                         return -EOPNOTSUPP;

                 kiocb->ki_flags |= IOCB_HIPRI;
-               kiocb->ki_complete = io_complete_rw_iopoll;
+               kiocb->ki_complete = NULL;
                 req->result = 0;
         } else {
                 if (kiocb->ki_flags & IOCB_HIPRI)
                         return -EINVAL;
-               kiocb->ki_complete = io_complete_rw;
+               kiocb->ki_complete = NULL;
         }

         req->rw.addr = READ_ONCE(sqe->addr);
@@ -2005,7 +2005,12 @@ static inline void io_rw_done(struct kiocb 
*kiocb, ssize_
                 ret = -EINTR;
                 /* fall through */
         default:
-               kiocb->ki_complete(kiocb, ret, 0);
+               if (kiocb->ki_complete)
+                       kiocb->ki_complete(kiocb, ret, 0);
+               else if (kiocb->ki_flags & IOCB_HIPRI)
+                       io_complete_rw_iopoll(kiocb, ret, 0);
+               else
+                       io_complete_rw(kiocb, ret, 0);
         }
  }


With the baseline version, perf top shows a significant amount of time 
for lock contention.  I *think* it is nvmeq->sq_lock.

Does that make sense?  I do realize the hack defeats the io_uring 
purpose but I though it might provide some clues as to what is going 
on.  Let me know if there is something else I can try.

Thanks.

--bijan



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: io_uring performance with block sizes > 128k
  2020-03-02 23:55 io_uring performance with block sizes > 128k Bijan Mottahedeh
@ 2020-03-02 23:57 ` Jens Axboe
  2020-03-03  5:01   ` Jens Axboe
  0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2020-03-02 23:57 UTC (permalink / raw)
  To: Bijan Mottahedeh; +Cc: io-uring

On 3/2/20 4:55 PM, Bijan Mottahedeh wrote:
> I'm seeing a sizeable drop in perf with polled fio tests for block sizes 
>  > 128k:
> 
> filename=/dev/nvme0n1
> rw=randread
> direct=1
> time_based=1
> randrepeat=1
> gtod_reduce=1
> 
> fio --readonly --ioengine=io_uring --iodepth 1024 --fixedbufs --hipri 
> --numjobs=16
> fio --readonly --ioengine=pvsync2 --iodepth 1024 --hipri --numjobs=16
> 
> 
> Compared with the pvsync2 engine, the only major difference I could see 
> was the dio path, __blkdev_direct_IO() for io_uring vs. 
> __blkdev_direct_IO_simple() for pvsync2 because of the is_sync_kiocb() 
> check.
> 
> 
> static ssize_t
> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> {
>          ...
>          if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>                  return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
> 
>          return __blkdev_direct_IO(iocb, iter, min(nr_pages, 
> BIO_MAX_PAGES));
> }
> 
> Just for an experiment, I hacked io_uring code to force it through the 
> _simple() path and I get better numbers though the variance is fairly 
> high, but the drop at bs > 128k seems consistent:
> 
> 
> # baseline
> READ: bw=3167MiB/s (3321MB/s), 186MiB/s-208MiB/s (196MB/s-219MB/s)   #128k
> READ: bw=898MiB/s (941MB/s), 51.2MiB/s-66.1MiB/s (53.7MB/s-69.3MB/s) #144k
> READ: bw=1576MiB/s (1652MB/s), 81.8MiB/s-109MiB/s (85.8MB/s-114MB/s) #256k
> 
> # hack
> READ: bw=2705MiB/s (2836MB/s), 157MiB/s-174MiB/s (165MB/s-183MB/s) #128k
> READ: bw=2901MiB/s (3042MB/s), 174MiB/s-194MiB/s (183MB/s-204MB/s) #144k
> READ: bw=4194MiB/s (4398MB/s), 252MiB/s-271MiB/s (265MB/s-284MB/s) #256k

A quick guess would be that the IO is being split above 128K, and hence
the polling only catches one of the parts?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: io_uring performance with block sizes > 128k
  2020-03-02 23:57 ` Jens Axboe
@ 2020-03-03  5:01   ` Jens Axboe
  2020-03-03 20:23     ` Bijan Mottahedeh
  0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2020-03-03  5:01 UTC (permalink / raw)
  To: Bijan Mottahedeh; +Cc: io-uring

On 3/2/20 4:57 PM, Jens Axboe wrote:
> On 3/2/20 4:55 PM, Bijan Mottahedeh wrote:
>> I'm seeing a sizeable drop in perf with polled fio tests for block sizes 
>>  > 128k:
>>
>> filename=/dev/nvme0n1
>> rw=randread
>> direct=1
>> time_based=1
>> randrepeat=1
>> gtod_reduce=1
>>
>> fio --readonly --ioengine=io_uring --iodepth 1024 --fixedbufs --hipri 
>> --numjobs=16
>> fio --readonly --ioengine=pvsync2 --iodepth 1024 --hipri --numjobs=16
>>
>>
>> Compared with the pvsync2 engine, the only major difference I could see 
>> was the dio path, __blkdev_direct_IO() for io_uring vs. 
>> __blkdev_direct_IO_simple() for pvsync2 because of the is_sync_kiocb() 
>> check.
>>
>>
>> static ssize_t
>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>> {
>>          ...
>>          if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>>                  return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>
>>          return __blkdev_direct_IO(iocb, iter, min(nr_pages, 
>> BIO_MAX_PAGES));
>> }
>>
>> Just for an experiment, I hacked io_uring code to force it through the 
>> _simple() path and I get better numbers though the variance is fairly 
>> high, but the drop at bs > 128k seems consistent:
>>
>>
>> # baseline
>> READ: bw=3167MiB/s (3321MB/s), 186MiB/s-208MiB/s (196MB/s-219MB/s)   #128k
>> READ: bw=898MiB/s (941MB/s), 51.2MiB/s-66.1MiB/s (53.7MB/s-69.3MB/s) #144k
>> READ: bw=1576MiB/s (1652MB/s), 81.8MiB/s-109MiB/s (85.8MB/s-114MB/s) #256k
>>
>> # hack
>> READ: bw=2705MiB/s (2836MB/s), 157MiB/s-174MiB/s (165MB/s-183MB/s) #128k
>> READ: bw=2901MiB/s (3042MB/s), 174MiB/s-194MiB/s (183MB/s-204MB/s) #144k
>> READ: bw=4194MiB/s (4398MB/s), 252MiB/s-271MiB/s (265MB/s-284MB/s) #256k
> 
> A quick guess would be that the IO is being split above 128K, and hence
> the polling only catches one of the parts?

Can you try and see if this makes a difference?


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 571b510ef0e7..cf7599a2c503 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1725,8 +1725,10 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events,
 		if (ret < 0)
 			break;
 
+#if 0
 		if (ret && spin)
 			spin = false;
+#endif
 		ret = 0;
 	}
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: io_uring performance with block sizes > 128k
  2020-03-03  5:01   ` Jens Axboe
@ 2020-03-03 20:23     ` Bijan Mottahedeh
  0 siblings, 0 replies; 4+ messages in thread
From: Bijan Mottahedeh @ 2020-03-03 20:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On 3/2/2020 9:01 PM, Jens Axboe wrote:
> On 3/2/20 4:57 PM, Jens Axboe wrote:
>> On 3/2/20 4:55 PM, Bijan Mottahedeh wrote:
>>> I'm seeing a sizeable drop in perf with polled fio tests for block sizes
>>>   > 128k:
>>>
>>> filename=/dev/nvme0n1
>>> rw=randread
>>> direct=1
>>> time_based=1
>>> randrepeat=1
>>> gtod_reduce=1
>>>
>>> fio --readonly --ioengine=io_uring --iodepth 1024 --fixedbufs --hipri
>>> --numjobs=16
>>> fio --readonly --ioengine=pvsync2 --iodepth 1024 --hipri --numjobs=16
>>>
>>>
>>> Compared with the pvsync2 engine, the only major difference I could see
>>> was the dio path, __blkdev_direct_IO() for io_uring vs.
>>> __blkdev_direct_IO_simple() for pvsync2 because of the is_sync_kiocb()
>>> check.
>>>
>>>
>>> static ssize_t
>>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>> {
>>>           ...
>>>           if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>>>                   return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>>
>>>           return __blkdev_direct_IO(iocb, iter, min(nr_pages,
>>> BIO_MAX_PAGES));
>>> }
>>>
>>> Just for an experiment, I hacked io_uring code to force it through the
>>> _simple() path and I get better numbers though the variance is fairly
>>> high, but the drop at bs > 128k seems consistent:
>>>
>>>
>>> # baseline
>>> READ: bw=3167MiB/s (3321MB/s), 186MiB/s-208MiB/s (196MB/s-219MB/s)   #128k
>>> READ: bw=898MiB/s (941MB/s), 51.2MiB/s-66.1MiB/s (53.7MB/s-69.3MB/s) #144k
>>> READ: bw=1576MiB/s (1652MB/s), 81.8MiB/s-109MiB/s (85.8MB/s-114MB/s) #256k
>>>
>>> # hack
>>> READ: bw=2705MiB/s (2836MB/s), 157MiB/s-174MiB/s (165MB/s-183MB/s) #128k
>>> READ: bw=2901MiB/s (3042MB/s), 174MiB/s-194MiB/s (183MB/s-204MB/s) #144k
>>> READ: bw=4194MiB/s (4398MB/s), 252MiB/s-271MiB/s (265MB/s-284MB/s) #256k
>> A quick guess would be that the IO is being split above 128K, and hence
>> the polling only catches one of the parts?
> Can you try and see if this makes a difference?
>
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 571b510ef0e7..cf7599a2c503 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1725,8 +1725,10 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events,
>   		if (ret < 0)
>   			break;
>   
> +#if 0
>   		if (ret && spin)
>   			spin = false;
> +#endif
>   		ret = 0;
>   	}
>   
>
I didn't see a difference.

If the request is split into two bios, is REQ_F_IOPOLL_COMPLETED set 
only when the 2nd bio completes?

I think you mentioned before that the request is split with 
__blk_queue_split() but I haven't yet been able to see how that happens 
exactly.  I see that the request size nvme_queue_rq() is the same as the 
original (e.g. 256k), is that expected?


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-03-03 20:23 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-03-02 23:55 io_uring performance with block sizes > 128k Bijan Mottahedeh
2020-03-02 23:57 ` Jens Axboe
2020-03-03  5:01   ` Jens Axboe
2020-03-03 20:23     ` Bijan Mottahedeh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox