* io_uring performance with block sizes > 128k
@ 2020-03-02 23:55 Bijan Mottahedeh
2020-03-02 23:57 ` Jens Axboe
0 siblings, 1 reply; 4+ messages in thread
From: Bijan Mottahedeh @ 2020-03-02 23:55 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring
I'm seeing a sizeable drop in perf with polled fio tests for block sizes
> 128k:
filename=/dev/nvme0n1
rw=randread
direct=1
time_based=1
randrepeat=1
gtod_reduce=1
fio --readonly --ioengine=io_uring --iodepth 1024 --fixedbufs --hipri
--numjobs=16
fio --readonly --ioengine=pvsync2 --iodepth 1024 --hipri --numjobs=16
Compared with the pvsync2 engine, the only major difference I could see
was the dio path, __blkdev_direct_IO() for io_uring vs.
__blkdev_direct_IO_simple() for pvsync2 because of the is_sync_kiocb()
check.
static ssize_t
blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
...
if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
return __blkdev_direct_IO(iocb, iter, min(nr_pages,
BIO_MAX_PAGES));
}
Just for an experiment, I hacked io_uring code to force it through the
_simple() path and I get better numbers though the variance is fairly
high, but the drop at bs > 128k seems consistent:
# baseline
READ: bw=3167MiB/s (3321MB/s), 186MiB/s-208MiB/s (196MB/s-219MB/s) #128k
READ: bw=898MiB/s (941MB/s), 51.2MiB/s-66.1MiB/s (53.7MB/s-69.3MB/s) #144k
READ: bw=1576MiB/s (1652MB/s), 81.8MiB/s-109MiB/s (85.8MB/s-114MB/s) #256k
# hack
READ: bw=2705MiB/s (2836MB/s), 157MiB/s-174MiB/s (165MB/s-183MB/s) #128k
READ: bw=2901MiB/s (3042MB/s), 174MiB/s-194MiB/s (183MB/s-204MB/s) #144k
READ: bw=4194MiB/s (4398MB/s), 252MiB/s-271MiB/s (265MB/s-284MB/s) #256k
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1972,12 +1972,12 @@ static int io_prep_rw(struct io_kiocb *req,
const struct
return -EOPNOTSUPP;
kiocb->ki_flags |= IOCB_HIPRI;
- kiocb->ki_complete = io_complete_rw_iopoll;
+ kiocb->ki_complete = NULL;
req->result = 0;
} else {
if (kiocb->ki_flags & IOCB_HIPRI)
return -EINVAL;
- kiocb->ki_complete = io_complete_rw;
+ kiocb->ki_complete = NULL;
}
req->rw.addr = READ_ONCE(sqe->addr);
@@ -2005,7 +2005,12 @@ static inline void io_rw_done(struct kiocb
*kiocb, ssize_
ret = -EINTR;
/* fall through */
default:
- kiocb->ki_complete(kiocb, ret, 0);
+ if (kiocb->ki_complete)
+ kiocb->ki_complete(kiocb, ret, 0);
+ else if (kiocb->ki_flags & IOCB_HIPRI)
+ io_complete_rw_iopoll(kiocb, ret, 0);
+ else
+ io_complete_rw(kiocb, ret, 0);
}
}
With the baseline version, perf top shows a significant amount of time
for lock contention. I *think* it is nvmeq->sq_lock.
Does that make sense? I do realize the hack defeats the io_uring
purpose but I though it might provide some clues as to what is going
on. Let me know if there is something else I can try.
Thanks.
--bijan
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: io_uring performance with block sizes > 128k
2020-03-02 23:55 io_uring performance with block sizes > 128k Bijan Mottahedeh
@ 2020-03-02 23:57 ` Jens Axboe
2020-03-03 5:01 ` Jens Axboe
0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2020-03-02 23:57 UTC (permalink / raw)
To: Bijan Mottahedeh; +Cc: io-uring
On 3/2/20 4:55 PM, Bijan Mottahedeh wrote:
> I'm seeing a sizeable drop in perf with polled fio tests for block sizes
> > 128k:
>
> filename=/dev/nvme0n1
> rw=randread
> direct=1
> time_based=1
> randrepeat=1
> gtod_reduce=1
>
> fio --readonly --ioengine=io_uring --iodepth 1024 --fixedbufs --hipri
> --numjobs=16
> fio --readonly --ioengine=pvsync2 --iodepth 1024 --hipri --numjobs=16
>
>
> Compared with the pvsync2 engine, the only major difference I could see
> was the dio path, __blkdev_direct_IO() for io_uring vs.
> __blkdev_direct_IO_simple() for pvsync2 because of the is_sync_kiocb()
> check.
>
>
> static ssize_t
> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> {
> ...
> if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
> return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>
> return __blkdev_direct_IO(iocb, iter, min(nr_pages,
> BIO_MAX_PAGES));
> }
>
> Just for an experiment, I hacked io_uring code to force it through the
> _simple() path and I get better numbers though the variance is fairly
> high, but the drop at bs > 128k seems consistent:
>
>
> # baseline
> READ: bw=3167MiB/s (3321MB/s), 186MiB/s-208MiB/s (196MB/s-219MB/s) #128k
> READ: bw=898MiB/s (941MB/s), 51.2MiB/s-66.1MiB/s (53.7MB/s-69.3MB/s) #144k
> READ: bw=1576MiB/s (1652MB/s), 81.8MiB/s-109MiB/s (85.8MB/s-114MB/s) #256k
>
> # hack
> READ: bw=2705MiB/s (2836MB/s), 157MiB/s-174MiB/s (165MB/s-183MB/s) #128k
> READ: bw=2901MiB/s (3042MB/s), 174MiB/s-194MiB/s (183MB/s-204MB/s) #144k
> READ: bw=4194MiB/s (4398MB/s), 252MiB/s-271MiB/s (265MB/s-284MB/s) #256k
A quick guess would be that the IO is being split above 128K, and hence
the polling only catches one of the parts?
--
Jens Axboe
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: io_uring performance with block sizes > 128k
2020-03-02 23:57 ` Jens Axboe
@ 2020-03-03 5:01 ` Jens Axboe
2020-03-03 20:23 ` Bijan Mottahedeh
0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2020-03-03 5:01 UTC (permalink / raw)
To: Bijan Mottahedeh; +Cc: io-uring
On 3/2/20 4:57 PM, Jens Axboe wrote:
> On 3/2/20 4:55 PM, Bijan Mottahedeh wrote:
>> I'm seeing a sizeable drop in perf with polled fio tests for block sizes
>> > 128k:
>>
>> filename=/dev/nvme0n1
>> rw=randread
>> direct=1
>> time_based=1
>> randrepeat=1
>> gtod_reduce=1
>>
>> fio --readonly --ioengine=io_uring --iodepth 1024 --fixedbufs --hipri
>> --numjobs=16
>> fio --readonly --ioengine=pvsync2 --iodepth 1024 --hipri --numjobs=16
>>
>>
>> Compared with the pvsync2 engine, the only major difference I could see
>> was the dio path, __blkdev_direct_IO() for io_uring vs.
>> __blkdev_direct_IO_simple() for pvsync2 because of the is_sync_kiocb()
>> check.
>>
>>
>> static ssize_t
>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>> {
>> ...
>> if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>> return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>
>> return __blkdev_direct_IO(iocb, iter, min(nr_pages,
>> BIO_MAX_PAGES));
>> }
>>
>> Just for an experiment, I hacked io_uring code to force it through the
>> _simple() path and I get better numbers though the variance is fairly
>> high, but the drop at bs > 128k seems consistent:
>>
>>
>> # baseline
>> READ: bw=3167MiB/s (3321MB/s), 186MiB/s-208MiB/s (196MB/s-219MB/s) #128k
>> READ: bw=898MiB/s (941MB/s), 51.2MiB/s-66.1MiB/s (53.7MB/s-69.3MB/s) #144k
>> READ: bw=1576MiB/s (1652MB/s), 81.8MiB/s-109MiB/s (85.8MB/s-114MB/s) #256k
>>
>> # hack
>> READ: bw=2705MiB/s (2836MB/s), 157MiB/s-174MiB/s (165MB/s-183MB/s) #128k
>> READ: bw=2901MiB/s (3042MB/s), 174MiB/s-194MiB/s (183MB/s-204MB/s) #144k
>> READ: bw=4194MiB/s (4398MB/s), 252MiB/s-271MiB/s (265MB/s-284MB/s) #256k
>
> A quick guess would be that the IO is being split above 128K, and hence
> the polling only catches one of the parts?
Can you try and see if this makes a difference?
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 571b510ef0e7..cf7599a2c503 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1725,8 +1725,10 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events,
if (ret < 0)
break;
+#if 0
if (ret && spin)
spin = false;
+#endif
ret = 0;
}
--
Jens Axboe
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: io_uring performance with block sizes > 128k
2020-03-03 5:01 ` Jens Axboe
@ 2020-03-03 20:23 ` Bijan Mottahedeh
0 siblings, 0 replies; 4+ messages in thread
From: Bijan Mottahedeh @ 2020-03-03 20:23 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring
On 3/2/2020 9:01 PM, Jens Axboe wrote:
> On 3/2/20 4:57 PM, Jens Axboe wrote:
>> On 3/2/20 4:55 PM, Bijan Mottahedeh wrote:
>>> I'm seeing a sizeable drop in perf with polled fio tests for block sizes
>>> > 128k:
>>>
>>> filename=/dev/nvme0n1
>>> rw=randread
>>> direct=1
>>> time_based=1
>>> randrepeat=1
>>> gtod_reduce=1
>>>
>>> fio --readonly --ioengine=io_uring --iodepth 1024 --fixedbufs --hipri
>>> --numjobs=16
>>> fio --readonly --ioengine=pvsync2 --iodepth 1024 --hipri --numjobs=16
>>>
>>>
>>> Compared with the pvsync2 engine, the only major difference I could see
>>> was the dio path, __blkdev_direct_IO() for io_uring vs.
>>> __blkdev_direct_IO_simple() for pvsync2 because of the is_sync_kiocb()
>>> check.
>>>
>>>
>>> static ssize_t
>>> blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>> {
>>> ...
>>> if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES)
>>> return __blkdev_direct_IO_simple(iocb, iter, nr_pages);
>>>
>>> return __blkdev_direct_IO(iocb, iter, min(nr_pages,
>>> BIO_MAX_PAGES));
>>> }
>>>
>>> Just for an experiment, I hacked io_uring code to force it through the
>>> _simple() path and I get better numbers though the variance is fairly
>>> high, but the drop at bs > 128k seems consistent:
>>>
>>>
>>> # baseline
>>> READ: bw=3167MiB/s (3321MB/s), 186MiB/s-208MiB/s (196MB/s-219MB/s) #128k
>>> READ: bw=898MiB/s (941MB/s), 51.2MiB/s-66.1MiB/s (53.7MB/s-69.3MB/s) #144k
>>> READ: bw=1576MiB/s (1652MB/s), 81.8MiB/s-109MiB/s (85.8MB/s-114MB/s) #256k
>>>
>>> # hack
>>> READ: bw=2705MiB/s (2836MB/s), 157MiB/s-174MiB/s (165MB/s-183MB/s) #128k
>>> READ: bw=2901MiB/s (3042MB/s), 174MiB/s-194MiB/s (183MB/s-204MB/s) #144k
>>> READ: bw=4194MiB/s (4398MB/s), 252MiB/s-271MiB/s (265MB/s-284MB/s) #256k
>> A quick guess would be that the IO is being split above 128K, and hence
>> the polling only catches one of the parts?
> Can you try and see if this makes a difference?
>
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 571b510ef0e7..cf7599a2c503 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1725,8 +1725,10 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events,
> if (ret < 0)
> break;
>
> +#if 0
> if (ret && spin)
> spin = false;
> +#endif
> ret = 0;
> }
>
>
I didn't see a difference.
If the request is split into two bios, is REQ_F_IOPOLL_COMPLETED set
only when the 2nd bio completes?
I think you mentioned before that the request is split with
__blk_queue_split() but I haven't yet been able to see how that happens
exactly. I see that the request size nvme_queue_rq() is the same as the
original (e.g. 256k), is that expected?
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-03-03 20:23 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-03-02 23:55 io_uring performance with block sizes > 128k Bijan Mottahedeh
2020-03-02 23:57 ` Jens Axboe
2020-03-03 5:01 ` Jens Axboe
2020-03-03 20:23 ` Bijan Mottahedeh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox