ublk-qcow2: ublk-qcow2 is available

public inbox for [email protected]
 help / color / mirror / Atom feed

* ublk-qcow2: ublk-qcow2 is available
@ 2022-09-30  9:24 Ming Lei
  2022-10-03 19:53 ` Stefan Hajnoczi
  2022-10-04  5:43 ` Manuel Bentele
  0 siblings, 2 replies; 44+ messages in thread
From: Ming Lei @ 2022-09-30  9:24 UTC (permalink / raw)
  To: io-uring, linux-block, linux-kernel
  Cc: Kirill Tkhai, Manuel Bentele, Stefan Hajnoczi

Hello,

ublk-qcow2 is available now.

So far it provides basic read/write function, and compression and snapshot
aren't supported yet. The target/backend implementation is completely
based on io_uring, and share the same io_uring with ublk IO command
handler, just like what ublk-loop does.

Follows the main motivations of ublk-qcow2:

- building one complicated target from scratch helps libublksrv APIs/functions
  become mature/stable more quickly, since qcow2 is complicated and needs more
  requirement from libublksrv compared with other simple ones(loop, null)

- there are several attempts of implementing qcow2 driver in kernel, such as
  ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
  might useful be for covering requirement in this field

- performance comparison with qemu-nbd, and it was my 1st thought to evaluate
  performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
  is started

- help to abstract common building block or design pattern for writing new ublk
  target/backend

So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
device as TEST_DEV, and kernel building workload is verified too. Also
soft update approach is applied in meta flushing, and meta data
integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
test, and only cluster leak is reported during this test.

The performance data looks much better compared with qemu-nbd, see
details in commit log[1], README[5] and STATUS[6]. And the test covers both
empty image and pre-allocated image, for example of pre-allocated qcow2
image(8GB):

- qemu-nbd (make test T=qcow2/002)
	randwrite(4k): jobs 1, iops 24605
	randread(4k): jobs 1, iops 30938
	randrw(4k): jobs 1, iops read 13981 write 14001
	rw(512k): jobs 1, iops read 724 write 728

- ublk-qcow2 (make test T=qcow2/022)
	randwrite(4k): jobs 1, iops 104481
	randread(4k): jobs 1, iops 114937
	randrw(4k): jobs 1, iops read 53630 write 53577
	rw(512k): jobs 1, iops read 1412 write 1423

Also ublk-qcow2 aligns queue's chunk_sectors limit with qcow2's cluster size,
which is 64KB at default, this way simplifies backend io handling, but
it could be increased to 512K or more proper size for improving sequential
IO perf, just need one coroutine to handle more than one IOs.

[1] https://github.com/ming1/ubdsrv/commit/9faabbec3a92ca83ddae92335c66eabbeff654e7
[2] https://upcommons.upc.edu/bitstream/handle/2099.1/9619/65757.pdf?sequence=1&isAllowed=y
[3] https://lwn.net/Articles/889429/
[4] https://lab.ks.uni-freiburg.de/projects/kernel-qcow2/repository
[5] https://github.com/ming1/ubdsrv/blob/master/qcow2/README.rst
[6] https://github.com/ming1/ubdsrv/blob/master/qcow2/STATUS.rst

Thanks,
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-09-30  9:24 ublk-qcow2: ublk-qcow2 is available Ming Lei
@ 2022-10-03 19:53 ` Stefan Hajnoczi
  2022-10-03 23:57   ` Denis V. Lunev
  2022-10-04  9:43   ` Ming Lei
  2022-10-04  5:43 ` Manuel Bentele
  1 sibling, 2 replies; 44+ messages in thread
From: Stefan Hajnoczi @ 2022-10-03 19:53 UTC (permalink / raw)
  To: Ming Lei
  Cc: io-uring, linux-block, linux-kernel, Kirill Tkhai, Manuel Bentele,
	qemu-devel, Kevin Wolf, rjones, Xie Yongji, Denis V. Lunev,
	Stefano Garzarella

[-- Attachment #1: Type: text/plain, Size: 4485 bytes --]

On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> ublk-qcow2 is available now.

Cool, thanks for sharing!

> 
> So far it provides basic read/write function, and compression and snapshot
> aren't supported yet. The target/backend implementation is completely
> based on io_uring, and share the same io_uring with ublk IO command
> handler, just like what ublk-loop does.
> 
> Follows the main motivations of ublk-qcow2:
> 
> - building one complicated target from scratch helps libublksrv APIs/functions
>   become mature/stable more quickly, since qcow2 is complicated and needs more
>   requirement from libublksrv compared with other simple ones(loop, null)
> 
> - there are several attempts of implementing qcow2 driver in kernel, such as
>   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
>   might useful be for covering requirement in this field
> 
> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
>   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
>   is started
> 
> - help to abstract common building block or design pattern for writing new ublk
>   target/backend
> 
> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> device as TEST_DEV, and kernel building workload is verified too. Also
> soft update approach is applied in meta flushing, and meta data
> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> test, and only cluster leak is reported during this test.
> 
> The performance data looks much better compared with qemu-nbd, see
> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> empty image and pre-allocated image, for example of pre-allocated qcow2
> image(8GB):
> 
> - qemu-nbd (make test T=qcow2/002)

Single queue?

> 	randwrite(4k): jobs 1, iops 24605
> 	randread(4k): jobs 1, iops 30938
> 	randrw(4k): jobs 1, iops read 13981 write 14001
> 	rw(512k): jobs 1, iops read 724 write 728

Please try qemu-storage-daemon's VDUSE export type as well. The
command-line should be similar to this:

  # modprobe virtio_vdpa # attaches vDPA devices to host kernel
  # modprobe vduse
  # qemu-storage-daemon \
      --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
      --blockdev qcow2,file=file,node-name=qcow2 \
      --object iothread,id=iothread0 \
      --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
  # vdpa dev add name vduse0 mgmtdev vduse

A virtio-blk device should appear and xfstests can be run on it
(typically /dev/vda unless you already have other virtio-blk devices).

Afterwards you can destroy the device using:

  # vdpa dev del vduse0

> 
> - ublk-qcow2 (make test T=qcow2/022)

There are a lot of other factors not directly related to NBD vs ublk. In
order to get an apples-to-apples comparison with qemu-* a ublk export
type is needed in qemu-storage-daemon. That way only the difference is
the ublk interface and the rest of the code path is identical, making it
possible to compare NBD, VDUSE, ublk, etc more precisely.

I think that comparison is interesting before comparing different qcow2
implementations because qcow2 sits on top of too much other code. It's
hard to know what should be accounted to configuration differences,
implementation differences, or fundamental differences that cannot be
overcome (this is the interesting part!).

> 	randwrite(4k): jobs 1, iops 104481
> 	randread(4k): jobs 1, iops 114937
> 	randrw(4k): jobs 1, iops read 53630 write 53577
> 	rw(512k): jobs 1, iops read 1412 write 1423
> 
> Also ublk-qcow2 aligns queue's chunk_sectors limit with qcow2's cluster size,
> which is 64KB at default, this way simplifies backend io handling, but
> it could be increased to 512K or more proper size for improving sequential
> IO perf, just need one coroutine to handle more than one IOs.
> 
> 
> [1] https://github.com/ming1/ubdsrv/commit/9faabbec3a92ca83ddae92335c66eabbeff654e7
> [2] https://upcommons.upc.edu/bitstream/handle/2099.1/9619/65757.pdf?sequence=1&isAllowed=y
> [3] https://lwn.net/Articles/889429/
> [4] https://lab.ks.uni-freiburg.de/projects/kernel-qcow2/repository
> [5] https://github.com/ming1/ubdsrv/blob/master/qcow2/README.rst
> [6] https://github.com/ming1/ubdsrv/blob/master/qcow2/STATUS.rst
> 
> Thanks,
> Ming
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-03 19:53 ` Stefan Hajnoczi
@ 2022-10-03 23:57   ` Denis V. Lunev
  2022-10-05 15:11     ` Stefan Hajnoczi
  2022-10-04  9:43   ` Ming Lei
  1 sibling, 1 reply; 44+ messages in thread
From: Denis V. Lunev @ 2022-10-03 23:57 UTC (permalink / raw)
  To: Stefan Hajnoczi, Ming Lei
  Cc: io-uring, linux-block, linux-kernel, Kirill Tkhai, Manuel Bentele,
	qemu-devel, Kevin Wolf, rjones, Xie Yongji, Stefano Garzarella

On 10/3/22 21:53, Stefan Hajnoczi wrote:
> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
>> ublk-qcow2 is available now.
> Cool, thanks for sharing!
yep

>> So far it provides basic read/write function, and compression and snapshot
>> aren't supported yet. The target/backend implementation is completely
>> based on io_uring, and share the same io_uring with ublk IO command
>> handler, just like what ublk-loop does.
>>
>> Follows the main motivations of ublk-qcow2:
>>
>> - building one complicated target from scratch helps libublksrv APIs/functions
>>    become mature/stable more quickly, since qcow2 is complicated and needs more
>>    requirement from libublksrv compared with other simple ones(loop, null)
>>
>> - there are several attempts of implementing qcow2 driver in kernel, such as
>>    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
>>    might useful be for covering requirement in this field
There is one important thing to keep in mind about all partly-userspace
implementations though:
* any single allocation happened in the context of the
    userspace daemon through try_to_free_pages() in
    kernel has a possibility to trigger the operation,
    which will require userspace daemon action, which
    is inside the kernel now.
* the probability of this is higher in the overcommitted
    environment

This was the main motivation of us in favor for the in-kernel
implementation.

>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
>>    performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
>>    is started
>>
>> - help to abstract common building block or design pattern for writing new ublk
>>    target/backend
>>
>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
>> device as TEST_DEV, and kernel building workload is verified too. Also
>> soft update approach is applied in meta flushing, and meta data
>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
>> test, and only cluster leak is reported during this test.
>>
>> The performance data looks much better compared with qemu-nbd, see
>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
>> empty image and pre-allocated image, for example of pre-allocated qcow2
>> image(8GB):
>>
>> - qemu-nbd (make test T=qcow2/002)
> Single queue?
>
>> 	randwrite(4k): jobs 1, iops 24605
>> 	randread(4k): jobs 1, iops 30938
>> 	randrw(4k): jobs 1, iops read 13981 write 14001
>> 	rw(512k): jobs 1, iops read 724 write 728
> Please try qemu-storage-daemon's VDUSE export type as well. The
> command-line should be similar to this:
>
>    # modprobe virtio_vdpa # attaches vDPA devices to host kernel
>    # modprobe vduse
>    # qemu-storage-daemon \
>        --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
>        --blockdev qcow2,file=file,node-name=qcow2 \
>        --object iothread,id=iothread0 \
>        --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
>    # vdpa dev add name vduse0 mgmtdev vduse
>
> A virtio-blk device should appear and xfstests can be run on it
> (typically /dev/vda unless you already have other virtio-blk devices).
>
> Afterwards you can destroy the device using:
>
>    # vdpa dev del vduse0
but this would be anyway limited by a single thread doing AIO in
qemu-storage-daemon, I believe.


>> - ublk-qcow2 (make test T=qcow2/022)
> There are a lot of other factors not directly related to NBD vs ublk. In
> order to get an apples-to-apples comparison with qemu-* a ublk export
> type is needed in qemu-storage-daemon. That way only the difference is
> the ublk interface and the rest of the code path is identical, making it
> possible to compare NBD, VDUSE, ublk, etc more precisely.
>
> I think that comparison is interesting before comparing different qcow2
> implementations because qcow2 sits on top of too much other code. It's
> hard to know what should be accounted to configuration differences,
> implementation differences, or fundamental differences that cannot be
> overcome (this is the interesting part!).
>
>> 	randwrite(4k): jobs 1, iops 104481
>> 	randread(4k): jobs 1, iops 114937
>> 	randrw(4k): jobs 1, iops read 53630 write 53577
>> 	rw(512k): jobs 1, iops read 1412 write 1423
>>
>> Also ublk-qcow2 aligns queue's chunk_sectors limit with qcow2's cluster size,
>> which is 64KB at default, this way simplifies backend io handling, but
>> it could be increased to 512K or more proper size for improving sequential
>> IO perf, just need one coroutine to handle more than one IOs.
>>
>>
>> [1] https://github.com/ming1/ubdsrv/commit/9faabbec3a92ca83ddae92335c66eabbeff654e7
>> [2] https://upcommons.upc.edu/bitstream/handle/2099.1/9619/65757.pdf?sequence=1&isAllowed=y
>> [3] https://lwn.net/Articles/889429/
>> [4] https://lab.ks.uni-freiburg.de/projects/kernel-qcow2/repository
>> [5] https://github.com/ming1/ubdsrv/blob/master/qcow2/README.rst
>> [6] https://github.com/ming1/ubdsrv/blob/master/qcow2/STATUS.rst

interesting...

Den

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-03 23:57   ` Denis V. Lunev
@ 2022-10-05 15:11     ` Stefan Hajnoczi
  2022-10-06 10:26       ` Ming Lei
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Hajnoczi @ 2022-10-05 15:11 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Ming Lei, io-uring, linux-block, linux-kernel, Kirill Tkhai,
	Manuel Bentele, qemu-devel, Kevin Wolf, rjones, Xie Yongji,
	Stefano Garzarella, Josef Bacik

[-- Attachment #1: Type: text/plain, Size: 1848 bytes --]

On Tue, Oct 04, 2022 at 01:57:50AM +0200, Denis V. Lunev wrote:
> On 10/3/22 21:53, Stefan Hajnoczi wrote:
> > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > ublk-qcow2 is available now.
> > Cool, thanks for sharing!
> yep
> 
> > > So far it provides basic read/write function, and compression and snapshot
> > > aren't supported yet. The target/backend implementation is completely
> > > based on io_uring, and share the same io_uring with ublk IO command
> > > handler, just like what ublk-loop does.
> > > 
> > > Follows the main motivations of ublk-qcow2:
> > > 
> > > - building one complicated target from scratch helps libublksrv APIs/functions
> > >    become mature/stable more quickly, since qcow2 is complicated and needs more
> > >    requirement from libublksrv compared with other simple ones(loop, null)
> > > 
> > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > >    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > >    might useful be for covering requirement in this field
> There is one important thing to keep in mind about all partly-userspace
> implementations though:
> * any single allocation happened in the context of the
>    userspace daemon through try_to_free_pages() in
>    kernel has a possibility to trigger the operation,
>    which will require userspace daemon action, which
>    is inside the kernel now.
> * the probability of this is higher in the overcommitted
>    environment
> 
> This was the main motivation of us in favor for the in-kernel
> implementation.

CCed Josef Bacik because the Linux NBD driver has dealt with memory
reclaim hangs in the past.

Josef: Any thoughts on userspace block drivers (whether NBD or ublk) and
how to avoid hangs in memory reclaim?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-05 15:11     ` Stefan Hajnoczi
@ 2022-10-06 10:26       ` Ming Lei
  2022-10-06 13:59         ` Stefan Hajnoczi
  0 siblings, 1 reply; 44+ messages in thread
From: Ming Lei @ 2022-10-06 10:26 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Denis V. Lunev, io-uring, linux-block, linux-kernel, Kirill Tkhai,
	Manuel Bentele, qemu-devel, Kevin Wolf, rjones, Xie Yongji,
	Stefano Garzarella, Josef Bacik

On Wed, Oct 05, 2022 at 11:11:32AM -0400, Stefan Hajnoczi wrote:
> On Tue, Oct 04, 2022 at 01:57:50AM +0200, Denis V. Lunev wrote:
> > On 10/3/22 21:53, Stefan Hajnoczi wrote:
> > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > ublk-qcow2 is available now.
> > > Cool, thanks for sharing!
> > yep
> > 
> > > > So far it provides basic read/write function, and compression and snapshot
> > > > aren't supported yet. The target/backend implementation is completely
> > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > handler, just like what ublk-loop does.
> > > > 
> > > > Follows the main motivations of ublk-qcow2:
> > > > 
> > > > - building one complicated target from scratch helps libublksrv APIs/functions
> > > >    become mature/stable more quickly, since qcow2 is complicated and needs more
> > > >    requirement from libublksrv compared with other simple ones(loop, null)
> > > > 
> > > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > > >    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > >    might useful be for covering requirement in this field
> > There is one important thing to keep in mind about all partly-userspace
> > implementations though:
> > * any single allocation happened in the context of the
> >    userspace daemon through try_to_free_pages() in
> >    kernel has a possibility to trigger the operation,
> >    which will require userspace daemon action, which
> >    is inside the kernel now.
> > * the probability of this is higher in the overcommitted
> >    environment
> > 
> > This was the main motivation of us in favor for the in-kernel
> > implementation.
> 
> CCed Josef Bacik because the Linux NBD driver has dealt with memory
> reclaim hangs in the past.
> 
> Josef: Any thoughts on userspace block drivers (whether NBD or ublk) and
> how to avoid hangs in memory reclaim?

If I remember correctly, there isn't new report after the last NBD(TCMU) deadlock
in memory reclaim was addressed by 8d19f1c8e193 ("prctl: PR_{G,S}ET_IO_FLUSHER
to support controlling memory reclaim").


Thanks, 
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-06 10:26       ` Ming Lei
@ 2022-10-06 13:59         ` Stefan Hajnoczi
  2022-10-06 15:09           ` Ming Lei
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Hajnoczi @ 2022-10-06 13:59 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Ming Lei, io-uring, linux-block, linux-kernel, Kirill Tkhai,
	Manuel Bentele, qemu-devel, Kevin Wolf, rjones, Xie Yongji,
	Stefano Garzarella, Josef Bacik

[-- Attachment #1: Type: text/plain, Size: 2769 bytes --]

On Thu, Oct 06, 2022 at 06:26:15PM +0800, Ming Lei wrote:
> On Wed, Oct 05, 2022 at 11:11:32AM -0400, Stefan Hajnoczi wrote:
> > On Tue, Oct 04, 2022 at 01:57:50AM +0200, Denis V. Lunev wrote:
> > > On 10/3/22 21:53, Stefan Hajnoczi wrote:
> > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > ublk-qcow2 is available now.
> > > > Cool, thanks for sharing!
> > > yep
> > > 
> > > > > So far it provides basic read/write function, and compression and snapshot
> > > > > aren't supported yet. The target/backend implementation is completely
> > > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > > handler, just like what ublk-loop does.
> > > > > 
> > > > > Follows the main motivations of ublk-qcow2:
> > > > > 
> > > > > - building one complicated target from scratch helps libublksrv APIs/functions
> > > > >    become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > >    requirement from libublksrv compared with other simple ones(loop, null)
> > > > > 
> > > > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > >    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > >    might useful be for covering requirement in this field
> > > There is one important thing to keep in mind about all partly-userspace
> > > implementations though:
> > > * any single allocation happened in the context of the
> > >    userspace daemon through try_to_free_pages() in
> > >    kernel has a possibility to trigger the operation,
> > >    which will require userspace daemon action, which
> > >    is inside the kernel now.
> > > * the probability of this is higher in the overcommitted
> > >    environment
> > > 
> > > This was the main motivation of us in favor for the in-kernel
> > > implementation.
> > 
> > CCed Josef Bacik because the Linux NBD driver has dealt with memory
> > reclaim hangs in the past.
> > 
> > Josef: Any thoughts on userspace block drivers (whether NBD or ublk) and
> > how to avoid hangs in memory reclaim?
> 
> If I remember correctly, there isn't new report after the last NBD(TCMU) deadlock
> in memory reclaim was addressed by 8d19f1c8e193 ("prctl: PR_{G,S}ET_IO_FLUSHER
> to support controlling memory reclaim").

Denis: I'm trying to understand the problem you described. Is this
correct:

Due to memory pressure, the kernel reclaims pages and submits a write to
a ublk block device. The userspace process attempts to allocate memory
in order to service the write request, but it gets stuck because there
is no memory available. As a result reclaim gets stuck, the system is
unable to free more memory and therefore it hangs?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-06 13:59         ` Stefan Hajnoczi
@ 2022-10-06 15:09           ` Ming Lei
  2022-10-06 18:29             ` Stefan Hajnoczi
  0 siblings, 1 reply; 44+ messages in thread
From: Ming Lei @ 2022-10-06 15:09 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Denis V. Lunev, io-uring, linux-block, linux-kernel, Kirill Tkhai,
	Manuel Bentele, qemu-devel, Kevin Wolf, rjones, Xie Yongji,
	Stefano Garzarella, Josef Bacik

On Thu, Oct 06, 2022 at 09:59:40AM -0400, Stefan Hajnoczi wrote:
> On Thu, Oct 06, 2022 at 06:26:15PM +0800, Ming Lei wrote:
> > On Wed, Oct 05, 2022 at 11:11:32AM -0400, Stefan Hajnoczi wrote:
> > > On Tue, Oct 04, 2022 at 01:57:50AM +0200, Denis V. Lunev wrote:
> > > > On 10/3/22 21:53, Stefan Hajnoczi wrote:
> > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > ublk-qcow2 is available now.
> > > > > Cool, thanks for sharing!
> > > > yep
> > > > 
> > > > > > So far it provides basic read/write function, and compression and snapshot
> > > > > > aren't supported yet. The target/backend implementation is completely
> > > > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > > > handler, just like what ublk-loop does.
> > > > > > 
> > > > > > Follows the main motivations of ublk-qcow2:
> > > > > > 
> > > > > > - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > >    become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > >    requirement from libublksrv compared with other simple ones(loop, null)
> > > > > > 
> > > > > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > >    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > >    might useful be for covering requirement in this field
> > > > There is one important thing to keep in mind about all partly-userspace
> > > > implementations though:
> > > > * any single allocation happened in the context of the
> > > >    userspace daemon through try_to_free_pages() in
> > > >    kernel has a possibility to trigger the operation,
> > > >    which will require userspace daemon action, which
> > > >    is inside the kernel now.
> > > > * the probability of this is higher in the overcommitted
> > > >    environment
> > > > 
> > > > This was the main motivation of us in favor for the in-kernel
> > > > implementation.
> > > 
> > > CCed Josef Bacik because the Linux NBD driver has dealt with memory
> > > reclaim hangs in the past.
> > > 
> > > Josef: Any thoughts on userspace block drivers (whether NBD or ublk) and
> > > how to avoid hangs in memory reclaim?
> > 
> > If I remember correctly, there isn't new report after the last NBD(TCMU) deadlock
> > in memory reclaim was addressed by 8d19f1c8e193 ("prctl: PR_{G,S}ET_IO_FLUSHER
> > to support controlling memory reclaim").
> 
> Denis: I'm trying to understand the problem you described. Is this
> correct:
> 
> Due to memory pressure, the kernel reclaims pages and submits a write to
> a ublk block device. The userspace process attempts to allocate memory
> in order to service the write request, but it gets stuck because there
> is no memory available. As a result reclaim gets stuck, the system is
> unable to free more memory and therefore it hangs?

The process should be killed in this situation if PR_SET_IO_FLUSHER
is applied since the page allocation is done in VM fault handler.

Firstly in theory the userspace part should provide forward progress
guarantee in code path for handling IO, such as reserving/mlock pages
for such situation. However, this issue isn't unique for nbd or ublk,
all userspace block device should have such potential risk, and vduse
is no exception, IMO.

Secondly with proper/enough swap space, I think it is hard to trigger
such kind of issue.

Finally ublk driver has added user recovery commands for recovering from
crash, and ublksrv will support it soon.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-06 15:09           ` Ming Lei
@ 2022-10-06 18:29             ` Stefan Hajnoczi
  2022-10-07 11:21               ` Ming Lei
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Hajnoczi @ 2022-10-06 18:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Denis V. Lunev, io-uring, linux-block, linux-kernel, Kirill Tkhai,
	Manuel Bentele, qemu-devel, Kevin Wolf, rjones, Xie Yongji,
	Stefano Garzarella, Josef Bacik, Mike Christie

[-- Attachment #1: Type: text/plain, Size: 4166 bytes --]

On Thu, Oct 06, 2022 at 11:09:48PM +0800, Ming Lei wrote:
> On Thu, Oct 06, 2022 at 09:59:40AM -0400, Stefan Hajnoczi wrote:
> > On Thu, Oct 06, 2022 at 06:26:15PM +0800, Ming Lei wrote:
> > > On Wed, Oct 05, 2022 at 11:11:32AM -0400, Stefan Hajnoczi wrote:
> > > > On Tue, Oct 04, 2022 at 01:57:50AM +0200, Denis V. Lunev wrote:
> > > > > On 10/3/22 21:53, Stefan Hajnoczi wrote:
> > > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > > ublk-qcow2 is available now.
> > > > > > Cool, thanks for sharing!
> > > > > yep
> > > > > 
> > > > > > > So far it provides basic read/write function, and compression and snapshot
> > > > > > > aren't supported yet. The target/backend implementation is completely
> > > > > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > > > > handler, just like what ublk-loop does.
> > > > > > > 
> > > > > > > Follows the main motivations of ublk-qcow2:
> > > > > > > 
> > > > > > > - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > > >    become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > > >    requirement from libublksrv compared with other simple ones(loop, null)
> > > > > > > 
> > > > > > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > > >    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > > >    might useful be for covering requirement in this field
> > > > > There is one important thing to keep in mind about all partly-userspace
> > > > > implementations though:
> > > > > * any single allocation happened in the context of the
> > > > >    userspace daemon through try_to_free_pages() in
> > > > >    kernel has a possibility to trigger the operation,
> > > > >    which will require userspace daemon action, which
> > > > >    is inside the kernel now.
> > > > > * the probability of this is higher in the overcommitted
> > > > >    environment
> > > > > 
> > > > > This was the main motivation of us in favor for the in-kernel
> > > > > implementation.
> > > > 
> > > > CCed Josef Bacik because the Linux NBD driver has dealt with memory
> > > > reclaim hangs in the past.
> > > > 
> > > > Josef: Any thoughts on userspace block drivers (whether NBD or ublk) and
> > > > how to avoid hangs in memory reclaim?
> > > 
> > > If I remember correctly, there isn't new report after the last NBD(TCMU) deadlock
> > > in memory reclaim was addressed by 8d19f1c8e193 ("prctl: PR_{G,S}ET_IO_FLUSHER
> > > to support controlling memory reclaim").
> > 
> > Denis: I'm trying to understand the problem you described. Is this
> > correct:
> > 
> > Due to memory pressure, the kernel reclaims pages and submits a write to
> > a ublk block device. The userspace process attempts to allocate memory
> > in order to service the write request, but it gets stuck because there
> > is no memory available. As a result reclaim gets stuck, the system is
> > unable to free more memory and therefore it hangs?
> 
> The process should be killed in this situation if PR_SET_IO_FLUSHER
> is applied since the page allocation is done in VM fault handler.

Thanks for mentioning PR_SET_IO_FLUSHER. There is more info in commit
8d19f1c8e1937baf74e1962aae9f90fa3aeab463 ("prctl: PR_{G,S}ET_IO_FLUSHER
to support controlling memory reclaim").

It requires CAP_SYS_RESOURCE :/. This makes me wonder whether
unprivileged ublk will ever be possible.

I think this addresses Denis' concern about hangs, but it doesn't solve
them because I/O will fail. The real solution is probably what you
mentioned...

> Firstly in theory the userspace part should provide forward progress
> guarantee in code path for handling IO, such as reserving/mlock pages
> for such situation. However, this issue isn't unique for nbd or ublk,
> all userspace block device should have such potential risk, and vduse
> is no exception, IMO.

...here. Userspace needs to minimize memory allocations in the I/O code
path and reserve sufficient resources to make forward progress.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-06 18:29             ` Stefan Hajnoczi
@ 2022-10-07 11:21               ` Ming Lei
  0 siblings, 0 replies; 44+ messages in thread
From: Ming Lei @ 2022-10-07 11:21 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Denis V. Lunev, io-uring, linux-block, linux-kernel, Kirill Tkhai,
	Manuel Bentele, qemu-devel, Kevin Wolf, rjones, Xie Yongji,
	Stefano Garzarella, Josef Bacik, Mike Christie

On Thu, Oct 06, 2022 at 02:29:55PM -0400, Stefan Hajnoczi wrote:
> On Thu, Oct 06, 2022 at 11:09:48PM +0800, Ming Lei wrote:
> > On Thu, Oct 06, 2022 at 09:59:40AM -0400, Stefan Hajnoczi wrote:
> > > On Thu, Oct 06, 2022 at 06:26:15PM +0800, Ming Lei wrote:
> > > > On Wed, Oct 05, 2022 at 11:11:32AM -0400, Stefan Hajnoczi wrote:
> > > > > On Tue, Oct 04, 2022 at 01:57:50AM +0200, Denis V. Lunev wrote:
> > > > > > On 10/3/22 21:53, Stefan Hajnoczi wrote:
> > > > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > > > ublk-qcow2 is available now.
> > > > > > > Cool, thanks for sharing!
> > > > > > yep
> > > > > > 
> > > > > > > > So far it provides basic read/write function, and compression and snapshot
> > > > > > > > aren't supported yet. The target/backend implementation is completely
> > > > > > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > > > > > handler, just like what ublk-loop does.
> > > > > > > > 
> > > > > > > > Follows the main motivations of ublk-qcow2:
> > > > > > > > 
> > > > > > > > - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > > > >    become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > > > >    requirement from libublksrv compared with other simple ones(loop, null)
> > > > > > > > 
> > > > > > > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > > > >    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > > > >    might useful be for covering requirement in this field
> > > > > > There is one important thing to keep in mind about all partly-userspace
> > > > > > implementations though:
> > > > > > * any single allocation happened in the context of the
> > > > > >    userspace daemon through try_to_free_pages() in
> > > > > >    kernel has a possibility to trigger the operation,
> > > > > >    which will require userspace daemon action, which
> > > > > >    is inside the kernel now.
> > > > > > * the probability of this is higher in the overcommitted
> > > > > >    environment
> > > > > > 
> > > > > > This was the main motivation of us in favor for the in-kernel
> > > > > > implementation.
> > > > > 
> > > > > CCed Josef Bacik because the Linux NBD driver has dealt with memory
> > > > > reclaim hangs in the past.
> > > > > 
> > > > > Josef: Any thoughts on userspace block drivers (whether NBD or ublk) and
> > > > > how to avoid hangs in memory reclaim?
> > > > 
> > > > If I remember correctly, there isn't new report after the last NBD(TCMU) deadlock
> > > > in memory reclaim was addressed by 8d19f1c8e193 ("prctl: PR_{G,S}ET_IO_FLUSHER
> > > > to support controlling memory reclaim").
> > > 
> > > Denis: I'm trying to understand the problem you described. Is this
> > > correct:
> > > 
> > > Due to memory pressure, the kernel reclaims pages and submits a write to
> > > a ublk block device. The userspace process attempts to allocate memory
> > > in order to service the write request, but it gets stuck because there
> > > is no memory available. As a result reclaim gets stuck, the system is
> > > unable to free more memory and therefore it hangs?
> > 
> > The process should be killed in this situation if PR_SET_IO_FLUSHER
> > is applied since the page allocation is done in VM fault handler.
> 
> Thanks for mentioning PR_SET_IO_FLUSHER. There is more info in commit
> 8d19f1c8e1937baf74e1962aae9f90fa3aeab463 ("prctl: PR_{G,S}ET_IO_FLUSHER
> to support controlling memory reclaim").
> 
> It requires CAP_SYS_RESOURCE :/. This makes me wonder whether
> unprivileged ublk will ever be possible.

IMO, it shouldn't be one blocker, there might be lots of choices for us

- unprivileged ublk can simply not call it, if such io hang is triggered,
ublksrv is capable of figuring out this problem, then kill & recover the device.

- set PR_IO_FLUSHER for current task in ublk_ch_uring_cmd(UBLK_IO_FETCH_REQ)

- ...

> 
> I think this addresses Denis' concern about hangs, but it doesn't solve
> them because I/O will fail. The real solution is probably what you
> mentioned...

So far, not see real report yet, and it may be never one issue if proper
swap device/file is configured.


Thanks, 
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-03 19:53 ` Stefan Hajnoczi
  2022-10-03 23:57   ` Denis V. Lunev
@ 2022-10-04  9:43   ` Ming Lei
  2022-10-04 13:53     ` Stefan Hajnoczi
  1 sibling, 1 reply; 44+ messages in thread
From: Ming Lei @ 2022-10-04  9:43 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: io-uring, linux-block, linux-kernel, Kirill Tkhai, Manuel Bentele,
	qemu-devel, Kevin Wolf, rjones, Xie Yongji, Denis V. Lunev,
	Stefano Garzarella

On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > ublk-qcow2 is available now.
> 
> Cool, thanks for sharing!
> 
> > 
> > So far it provides basic read/write function, and compression and snapshot
> > aren't supported yet. The target/backend implementation is completely
> > based on io_uring, and share the same io_uring with ublk IO command
> > handler, just like what ublk-loop does.
> > 
> > Follows the main motivations of ublk-qcow2:
> > 
> > - building one complicated target from scratch helps libublksrv APIs/functions
> >   become mature/stable more quickly, since qcow2 is complicated and needs more
> >   requirement from libublksrv compared with other simple ones(loop, null)
> > 
> > - there are several attempts of implementing qcow2 driver in kernel, such as
> >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> >   might useful be for covering requirement in this field
> > 
> > - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> >   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> >   is started
> > 
> > - help to abstract common building block or design pattern for writing new ublk
> >   target/backend
> > 
> > So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > device as TEST_DEV, and kernel building workload is verified too. Also
> > soft update approach is applied in meta flushing, and meta data
> > integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > test, and only cluster leak is reported during this test.
> > 
> > The performance data looks much better compared with qemu-nbd, see
> > details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > empty image and pre-allocated image, for example of pre-allocated qcow2
> > image(8GB):
> > 
> > - qemu-nbd (make test T=qcow2/002)
> 
> Single queue?

Yeah.

> 
> > 	randwrite(4k): jobs 1, iops 24605
> > 	randread(4k): jobs 1, iops 30938
> > 	randrw(4k): jobs 1, iops read 13981 write 14001
> > 	rw(512k): jobs 1, iops read 724 write 728
> 
> Please try qemu-storage-daemon's VDUSE export type as well. The
> command-line should be similar to this:
> 
>   # modprobe virtio_vdpa # attaches vDPA devices to host kernel

Not found virtio_vdpa module even though I enabled all the following
options:

        --- vDPA drivers                                 
          <M>   vDPA device simulator core               
          <M>     vDPA simulator for networking device   
          <M>     vDPA simulator for block device        
          <M>   VDUSE (vDPA Device in Userspace) support 
          <M>   Intel IFC VF vDPA driver                 
          <M>   Virtio PCI bridge vDPA driver            
          <M>   vDPA driver for Alibaba ENI

BTW, my test environment is VM and the shared data is done in VM too, and
can virtio_vdpa be used inside VM?

>   # modprobe vduse
>   # qemu-storage-daemon \
>       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
>       --blockdev qcow2,file=file,node-name=qcow2 \
>       --object iothread,id=iothread0 \
>       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
>   # vdpa dev add name vduse0 mgmtdev vduse
> 
> A virtio-blk device should appear and xfstests can be run on it
> (typically /dev/vda unless you already have other virtio-blk devices).
> 
> Afterwards you can destroy the device using:
> 
>   # vdpa dev del vduse0
> 
> > 
> > - ublk-qcow2 (make test T=qcow2/022)
> 
> There are a lot of other factors not directly related to NBD vs ublk. In
> order to get an apples-to-apples comparison with qemu-* a ublk export
> type is needed in qemu-storage-daemon. That way only the difference is
> the ublk interface and the rest of the code path is identical, making it
> possible to compare NBD, VDUSE, ublk, etc more precisely.

Maybe not true.

ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
and so far single io_uring/pthread is for handling all qcow2 IOs and IO
command.


thanks, 
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-04  9:43   ` Ming Lei
@ 2022-10-04 13:53     ` Stefan Hajnoczi
  2022-10-05  4:18       ` Ming Lei
  2022-10-06 10:14       ` Richard W.M. Jones
  0 siblings, 2 replies; 44+ messages in thread
From: Stefan Hajnoczi @ 2022-10-04 13:53 UTC (permalink / raw)
  To: Ming Lei
  Cc: Stefan Hajnoczi, io-uring, linux-block, linux-kernel,
	Kirill Tkhai, Manuel Bentele, qemu-devel, Kevin Wolf, rjones,
	Xie Yongji, Denis V. Lunev, Stefano Garzarella

On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
>
> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > ublk-qcow2 is available now.
> >
> > Cool, thanks for sharing!
> >
> > >
> > > So far it provides basic read/write function, and compression and snapshot
> > > aren't supported yet. The target/backend implementation is completely
> > > based on io_uring, and share the same io_uring with ublk IO command
> > > handler, just like what ublk-loop does.
> > >
> > > Follows the main motivations of ublk-qcow2:
> > >
> > > - building one complicated target from scratch helps libublksrv APIs/functions
> > >   become mature/stable more quickly, since qcow2 is complicated and needs more
> > >   requirement from libublksrv compared with other simple ones(loop, null)
> > >
> > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > >   might useful be for covering requirement in this field
> > >
> > > - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > >   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > >   is started
> > >
> > > - help to abstract common building block or design pattern for writing new ublk
> > >   target/backend
> > >
> > > So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > device as TEST_DEV, and kernel building workload is verified too. Also
> > > soft update approach is applied in meta flushing, and meta data
> > > integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > test, and only cluster leak is reported during this test.
> > >
> > > The performance data looks much better compared with qemu-nbd, see
> > > details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > empty image and pre-allocated image, for example of pre-allocated qcow2
> > > image(8GB):
> > >
> > > - qemu-nbd (make test T=qcow2/002)
> >
> > Single queue?
>
> Yeah.
>
> >
> > >     randwrite(4k): jobs 1, iops 24605
> > >     randread(4k): jobs 1, iops 30938
> > >     randrw(4k): jobs 1, iops read 13981 write 14001
> > >     rw(512k): jobs 1, iops read 724 write 728
> >
> > Please try qemu-storage-daemon's VDUSE export type as well. The
> > command-line should be similar to this:
> >
> >   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
>
> Not found virtio_vdpa module even though I enabled all the following
> options:
>
>         --- vDPA drivers
>           <M>   vDPA device simulator core
>           <M>     vDPA simulator for networking device
>           <M>     vDPA simulator for block device
>           <M>   VDUSE (vDPA Device in Userspace) support
>           <M>   Intel IFC VF vDPA driver
>           <M>   Virtio PCI bridge vDPA driver
>           <M>   vDPA driver for Alibaba ENI
>
> BTW, my test environment is VM and the shared data is done in VM too, and
> can virtio_vdpa be used inside VM?

I hope Xie Yongji can help explain how to benchmark VDUSE.

virtio_vdpa is available inside guests too. Please check that
VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
drivers" menu.

>
> >   # modprobe vduse
> >   # qemu-storage-daemon \
> >       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> >       --blockdev qcow2,file=file,node-name=qcow2 \
> >       --object iothread,id=iothread0 \
> >       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> >   # vdpa dev add name vduse0 mgmtdev vduse
> >
> > A virtio-blk device should appear and xfstests can be run on it
> > (typically /dev/vda unless you already have other virtio-blk devices).
> >
> > Afterwards you can destroy the device using:
> >
> >   # vdpa dev del vduse0
> >
> > >
> > > - ublk-qcow2 (make test T=qcow2/022)
> >
> > There are a lot of other factors not directly related to NBD vs ublk. In
> > order to get an apples-to-apples comparison with qemu-* a ublk export
> > type is needed in qemu-storage-daemon. That way only the difference is
> > the ublk interface and the rest of the code path is identical, making it
> > possible to compare NBD, VDUSE, ublk, etc more precisely.
>
> Maybe not true.
>
> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> command.

qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
know whether the benchmark demonstrates that ublk is faster than NBD,
that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
whether there are miscellaneous implementation differences between
ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
ublk and backend IO), or something else.

I'm suggesting measuring changes to just 1 variable at a time.
Otherwise it's hard to reach a conclusion about the root cause of the
performance difference. Let's learn why ublk-qcow2 performs well.

Stefan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-04 13:53     ` Stefan Hajnoczi
@ 2022-10-05  4:18       ` Ming Lei
  2022-10-05 12:21         ` Stefan Hajnoczi
  2022-10-08  8:43         ` Ziyang Zhang
  2022-10-06 10:14       ` Richard W.M. Jones
  1 sibling, 2 replies; 44+ messages in thread
From: Ming Lei @ 2022-10-05  4:18 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, io-uring, linux-block, linux-kernel,
	Kirill Tkhai, Manuel Bentele, qemu-devel, Kevin Wolf, rjones,
	Xie Yongji, Denis V. Lunev, Stefano Garzarella

On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> >
> > On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > ublk-qcow2 is available now.
> > >
> > > Cool, thanks for sharing!
> > >
> > > >
> > > > So far it provides basic read/write function, and compression and snapshot
> > > > aren't supported yet. The target/backend implementation is completely
> > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > handler, just like what ublk-loop does.
> > > >
> > > > Follows the main motivations of ublk-qcow2:
> > > >
> > > > - building one complicated target from scratch helps libublksrv APIs/functions
> > > >   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > >   requirement from libublksrv compared with other simple ones(loop, null)
> > > >
> > > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > > >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > >   might useful be for covering requirement in this field
> > > >
> > > > - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > >   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > >   is started
> > > >
> > > > - help to abstract common building block or design pattern for writing new ublk
> > > >   target/backend
> > > >
> > > > So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > device as TEST_DEV, and kernel building workload is verified too. Also
> > > > soft update approach is applied in meta flushing, and meta data
> > > > integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > test, and only cluster leak is reported during this test.
> > > >
> > > > The performance data looks much better compared with qemu-nbd, see
> > > > details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > > empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > image(8GB):
> > > >
> > > > - qemu-nbd (make test T=qcow2/002)
> > >
> > > Single queue?
> >
> > Yeah.
> >
> > >
> > > >     randwrite(4k): jobs 1, iops 24605
> > > >     randread(4k): jobs 1, iops 30938
> > > >     randrw(4k): jobs 1, iops read 13981 write 14001
> > > >     rw(512k): jobs 1, iops read 724 write 728
> > >
> > > Please try qemu-storage-daemon's VDUSE export type as well. The
> > > command-line should be similar to this:
> > >
> > >   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> >
> > Not found virtio_vdpa module even though I enabled all the following
> > options:
> >
> >         --- vDPA drivers
> >           <M>   vDPA device simulator core
> >           <M>     vDPA simulator for networking device
> >           <M>     vDPA simulator for block device
> >           <M>   VDUSE (vDPA Device in Userspace) support
> >           <M>   Intel IFC VF vDPA driver
> >           <M>   Virtio PCI bridge vDPA driver
> >           <M>   vDPA driver for Alibaba ENI
> >
> > BTW, my test environment is VM and the shared data is done in VM too, and
> > can virtio_vdpa be used inside VM?
> 
> I hope Xie Yongji can help explain how to benchmark VDUSE.
> 
> virtio_vdpa is available inside guests too. Please check that
> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> drivers" menu.
> 
> >
> > >   # modprobe vduse
> > >   # qemu-storage-daemon \
> > >       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > >       --blockdev qcow2,file=file,node-name=qcow2 \
> > >       --object iothread,id=iothread0 \
> > >       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > >   # vdpa dev add name vduse0 mgmtdev vduse
> > >
> > > A virtio-blk device should appear and xfstests can be run on it
> > > (typically /dev/vda unless you already have other virtio-blk devices).
> > >
> > > Afterwards you can destroy the device using:
> > >
> > >   # vdpa dev del vduse0
> > >
> > > >
> > > > - ublk-qcow2 (make test T=qcow2/022)
> > >
> > > There are a lot of other factors not directly related to NBD vs ublk. In
> > > order to get an apples-to-apples comparison with qemu-* a ublk export
> > > type is needed in qemu-storage-daemon. That way only the difference is
> > > the ublk interface and the rest of the code path is identical, making it
> > > possible to compare NBD, VDUSE, ublk, etc more precisely.
> >
> > Maybe not true.
> >
> > ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > command.
> 
> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't

I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.

> know whether the benchmark demonstrates that ublk is faster than NBD,
> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> whether there are miscellaneous implementation differences between
> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> ublk and backend IO), or something else.

The theory shouldn't be too complicated:

1) io uring passthough(pt) communication is fast than socket, and io command
is carried over io_uring pt commands, and should be fast than virio
communication too.

2) io uring io handling is fast than libaio which is taken in the
test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
by io_uring.

https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common

3) ublk uses one single io_uring to handle all io commands and qcow2
backend IOs, so batching handling is common, and it is easy to see
dozens of IOs/io commands handled in single syscall, or even more.

> 
> I'm suggesting measuring changes to just 1 variable at a time.
> Otherwise it's hard to reach a conclusion about the root cause of the
> performance difference. Let's learn why ublk-qcow2 performs well.

Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
qemu from the latest github tree, and finally it starts to work. And test kernel
is v6.0 release.

Follows the test result, and all three devices are setup as single
queue, and all tests are run in single job, still done in one VM, and
the test images are stored on XFS/virito-scsi backed SSD.

The 1st group tests all three block device which is backed by empty
qcow2 image.

The 2nd group tests all the three block devices backed by pre-allocated
qcow2 image.

Except for big sequential IO(512K), there is still not small gap between
vdpa-virtio-blk and ublk.

1. run fio on block device over empty qcow2 image
1) qemu-nbd
running qcow2/001
run perf test on empty qcow2 image via nbd
	fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
	randwrite: jobs 1, iops 8549
	randread: jobs 1, iops 34829
	randrw: jobs 1, iops read 11363 write 11333
	rw(512k): jobs 1, iops read 590 write 597


2) ublk-qcow2
running qcow2/021
run perf test on empty qcow2 image via ublk
	fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
	randwrite: jobs 1, iops 16086
	randread: jobs 1, iops 172720
	randrw: jobs 1, iops read 35760 write 35702
	rw(512k): jobs 1, iops read 1140 write 1149

3) vdpa-virtio-blk
running debug/test_dev
run io test on specified device
	fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
	randwrite: jobs 1, iops 8626
	randread: jobs 1, iops 126118
	randrw: jobs 1, iops read 17698 write 17665
	rw(512k): jobs 1, iops read 1023 write 1031


2. run fio on block device over pre-allocated qcow2 image
1) qemu-nbd
running qcow2/002
run perf test on pre-allocated qcow2 image via nbd
	fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
	randwrite: jobs 1, iops 21439
	randread: jobs 1, iops 30336
	randrw: jobs 1, iops read 11476 write 11449
	rw(512k): jobs 1, iops read 718 write 722

2) ublk-qcow2
running qcow2/022
run perf test on pre-allocated qcow2 image via ublk
	fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
	randwrite: jobs 1, iops 98757
	randread: jobs 1, iops 110246
	randrw: jobs 1, iops read 47229 write 47161
	rw(512k): jobs 1, iops read 1416 write 1427

3) vdpa-virtio-blk
running debug/test_dev
run io test on specified device
	fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
	randwrite: jobs 1, iops 47317
	randread: jobs 1, iops 74092
	randrw: jobs 1, iops read 27196 write 27234
	rw(512k): jobs 1, iops read 1447 write 1458


thanks,
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-05  4:18       ` Ming Lei
@ 2022-10-05 12:21         ` Stefan Hajnoczi
  2022-10-05 12:38           ` Denis V. Lunev
  2022-10-06 11:24           ` Ming Lei
  2022-10-08  8:43         ` Ziyang Zhang
  1 sibling, 2 replies; 44+ messages in thread
From: Stefan Hajnoczi @ 2022-10-05 12:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: Stefan Hajnoczi, io-uring, linux-block, linux-kernel,
	Kirill Tkhai, Manuel Bentele, qemu-devel, Kevin Wolf, rjones,
	Xie Yongji, Denis V. Lunev, Stefano Garzarella

On Wed, 5 Oct 2022 at 00:19, Ming Lei <[email protected]> wrote:
>
> On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > >
> > > On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > ublk-qcow2 is available now.
> > > >
> > > > Cool, thanks for sharing!
> > > >
> > > > >
> > > > > So far it provides basic read/write function, and compression and snapshot
> > > > > aren't supported yet. The target/backend implementation is completely
> > > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > > handler, just like what ublk-loop does.
> > > > >
> > > > > Follows the main motivations of ublk-qcow2:
> > > > >
> > > > > - building one complicated target from scratch helps libublksrv APIs/functions
> > > > >   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > >   requirement from libublksrv compared with other simple ones(loop, null)
> > > > >
> > > > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > >   might useful be for covering requirement in this field
> > > > >
> > > > > - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > > >   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > > >   is started
> > > > >
> > > > > - help to abstract common building block or design pattern for writing new ublk
> > > > >   target/backend
> > > > >
> > > > > So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > > device as TEST_DEV, and kernel building workload is verified too. Also
> > > > > soft update approach is applied in meta flushing, and meta data
> > > > > integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > > test, and only cluster leak is reported during this test.
> > > > >
> > > > > The performance data looks much better compared with qemu-nbd, see
> > > > > details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > > > empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > > image(8GB):
> > > > >
> > > > > - qemu-nbd (make test T=qcow2/002)
> > > >
> > > > Single queue?
> > >
> > > Yeah.
> > >
> > > >
> > > > >     randwrite(4k): jobs 1, iops 24605
> > > > >     randread(4k): jobs 1, iops 30938
> > > > >     randrw(4k): jobs 1, iops read 13981 write 14001
> > > > >     rw(512k): jobs 1, iops read 724 write 728
> > > >
> > > > Please try qemu-storage-daemon's VDUSE export type as well. The
> > > > command-line should be similar to this:
> > > >
> > > >   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > >
> > > Not found virtio_vdpa module even though I enabled all the following
> > > options:
> > >
> > >         --- vDPA drivers
> > >           <M>   vDPA device simulator core
> > >           <M>     vDPA simulator for networking device
> > >           <M>     vDPA simulator for block device
> > >           <M>   VDUSE (vDPA Device in Userspace) support
> > >           <M>   Intel IFC VF vDPA driver
> > >           <M>   Virtio PCI bridge vDPA driver
> > >           <M>   vDPA driver for Alibaba ENI
> > >
> > > BTW, my test environment is VM and the shared data is done in VM too, and
> > > can virtio_vdpa be used inside VM?
> >
> > I hope Xie Yongji can help explain how to benchmark VDUSE.
> >
> > virtio_vdpa is available inside guests too. Please check that
> > VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > drivers" menu.
> >
> > >
> > > >   # modprobe vduse
> > > >   # qemu-storage-daemon \
> > > >       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > >       --blockdev qcow2,file=file,node-name=qcow2 \
> > > >       --object iothread,id=iothread0 \
> > > >       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > >   # vdpa dev add name vduse0 mgmtdev vduse
> > > >
> > > > A virtio-blk device should appear and xfstests can be run on it
> > > > (typically /dev/vda unless you already have other virtio-blk devices).
> > > >
> > > > Afterwards you can destroy the device using:
> > > >
> > > >   # vdpa dev del vduse0
> > > >
> > > > >
> > > > > - ublk-qcow2 (make test T=qcow2/022)
> > > >
> > > > There are a lot of other factors not directly related to NBD vs ublk. In
> > > > order to get an apples-to-apples comparison with qemu-* a ublk export
> > > > type is needed in qemu-storage-daemon. That way only the difference is
> > > > the ublk interface and the rest of the code path is identical, making it
> > > > possible to compare NBD, VDUSE, ublk, etc more precisely.
> > >
> > > Maybe not true.
> > >
> > > ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > command.
> >
> > qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
>
> I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
>
> > know whether the benchmark demonstrates that ublk is faster than NBD,
> > that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > whether there are miscellaneous implementation differences between
> > ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > ublk and backend IO), or something else.
>
> The theory shouldn't be too complicated:
>
> 1) io uring passthough(pt) communication is fast than socket, and io command
> is carried over io_uring pt commands, and should be fast than virio
> communication too.
>
> 2) io uring io handling is fast than libaio which is taken in the
> test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> by io_uring.
>
> https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
>
> 3) ublk uses one single io_uring to handle all io commands and qcow2
> backend IOs, so batching handling is common, and it is easy to see
> dozens of IOs/io commands handled in single syscall, or even more.

I agree with the theory but theory has to be tested through
experiments in order to validate it. We can all learn from systematic
performance analysis - there might even be bottlenecks in ublk that
can be solved to improve performance further.

> >
> > I'm suggesting measuring changes to just 1 variable at a time.
> > Otherwise it's hard to reach a conclusion about the root cause of the
> > performance difference. Let's learn why ublk-qcow2 performs well.
>
> Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> qemu from the latest github tree, and finally it starts to work. And test kernel
> is v6.0 release.
>
> Follows the test result, and all three devices are setup as single
> queue, and all tests are run in single job, still done in one VM, and
> the test images are stored on XFS/virito-scsi backed SSD.
>
> The 1st group tests all three block device which is backed by empty
> qcow2 image.
>
> The 2nd group tests all the three block devices backed by pre-allocated
> qcow2 image.
>
> Except for big sequential IO(512K), there is still not small gap between
> vdpa-virtio-blk and ublk.
>
> 1. run fio on block device over empty qcow2 image
> 1) qemu-nbd
> running qcow2/001
> run perf test on empty qcow2 image via nbd
>         fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
>         randwrite: jobs 1, iops 8549
>         randread: jobs 1, iops 34829
>         randrw: jobs 1, iops read 11363 write 11333
>         rw(512k): jobs 1, iops read 590 write 597
>
>
> 2) ublk-qcow2
> running qcow2/021
> run perf test on empty qcow2 image via ublk
>         fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
>         randwrite: jobs 1, iops 16086
>         randread: jobs 1, iops 172720
>         randrw: jobs 1, iops read 35760 write 35702
>         rw(512k): jobs 1, iops read 1140 write 1149
>
> 3) vdpa-virtio-blk
> running debug/test_dev
> run io test on specified device
>         fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
>         randwrite: jobs 1, iops 8626
>         randread: jobs 1, iops 126118
>         randrw: jobs 1, iops read 17698 write 17665
>         rw(512k): jobs 1, iops read 1023 write 1031
>
>
> 2. run fio on block device over pre-allocated qcow2 image
> 1) qemu-nbd
> running qcow2/002
> run perf test on pre-allocated qcow2 image via nbd
>         fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
>         randwrite: jobs 1, iops 21439
>         randread: jobs 1, iops 30336
>         randrw: jobs 1, iops read 11476 write 11449
>         rw(512k): jobs 1, iops read 718 write 722
>
> 2) ublk-qcow2
> running qcow2/022
> run perf test on pre-allocated qcow2 image via ublk
>         fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
>         randwrite: jobs 1, iops 98757
>         randread: jobs 1, iops 110246
>         randrw: jobs 1, iops read 47229 write 47161
>         rw(512k): jobs 1, iops read 1416 write 1427
>
> 3) vdpa-virtio-blk
> running debug/test_dev
> run io test on specified device
>         fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
>         randwrite: jobs 1, iops 47317
>         randread: jobs 1, iops 74092
>         randrw: jobs 1, iops read 27196 write 27234
>         rw(512k): jobs 1, iops read 1447 write 1458

Thanks for including VDUSE results! ublk looks great here and worth
considering even in cases where NBD or VDUSE is already being used.

Stefan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-05 12:21         ` Stefan Hajnoczi
@ 2022-10-05 12:38           ` Denis V. Lunev
  2022-10-06 11:24           ` Ming Lei
  1 sibling, 0 replies; 44+ messages in thread
From: Denis V. Lunev @ 2022-10-05 12:38 UTC (permalink / raw)
  To: Stefan Hajnoczi, Ming Lei
  Cc: Stefan Hajnoczi, io-uring, linux-block, linux-kernel,
	Kirill Tkhai, Manuel Bentele, qemu-devel, Kevin Wolf, rjones,
	Xie Yongji, Stefano Garzarella, Andrey Zhadchenko

On 10/5/22 14:21, Stefan Hajnoczi wrote:
> On Wed, 5 Oct 2022 at 00:19, Ming Lei <[email protected]> wrote:
>> On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
>>> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
>>>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
>>>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
>>>>>> ublk-qcow2 is available now.
>>>>> Cool, thanks for sharing!
>>>>>
>>>>>> So far it provides basic read/write function, and compression and snapshot
>>>>>> aren't supported yet. The target/backend implementation is completely
>>>>>> based on io_uring, and share the same io_uring with ublk IO command
>>>>>> handler, just like what ublk-loop does.
>>>>>>
>>>>>> Follows the main motivations of ublk-qcow2:
>>>>>>
>>>>>> - building one complicated target from scratch helps libublksrv APIs/functions
>>>>>>    become mature/stable more quickly, since qcow2 is complicated and needs more
>>>>>>    requirement from libublksrv compared with other simple ones(loop, null)
>>>>>>
>>>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
>>>>>>    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
>>>>>>    might useful be for covering requirement in this field
>>>>>>
>>>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
>>>>>>    performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
>>>>>>    is started
>>>>>>
>>>>>> - help to abstract common building block or design pattern for writing new ublk
>>>>>>    target/backend
>>>>>>
>>>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
>>>>>> device as TEST_DEV, and kernel building workload is verified too. Also
>>>>>> soft update approach is applied in meta flushing, and meta data
>>>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
>>>>>> test, and only cluster leak is reported during this test.
>>>>>>
>>>>>> The performance data looks much better compared with qemu-nbd, see
>>>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
>>>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
>>>>>> image(8GB):
>>>>>>
>>>>>> - qemu-nbd (make test T=qcow2/002)
>>>>> Single queue?
>>>> Yeah.
>>>>
>>>>>>      randwrite(4k): jobs 1, iops 24605
>>>>>>      randread(4k): jobs 1, iops 30938
>>>>>>      randrw(4k): jobs 1, iops read 13981 write 14001
>>>>>>      rw(512k): jobs 1, iops read 724 write 728
>>>>> Please try qemu-storage-daemon's VDUSE export type as well. The
>>>>> command-line should be similar to this:
>>>>>
>>>>>    # modprobe virtio_vdpa # attaches vDPA devices to host kernel
>>>> Not found virtio_vdpa module even though I enabled all the following
>>>> options:
>>>>
>>>>          --- vDPA drivers
>>>>            <M>   vDPA device simulator core
>>>>            <M>     vDPA simulator for networking device
>>>>            <M>     vDPA simulator for block device
>>>>            <M>   VDUSE (vDPA Device in Userspace) support
>>>>            <M>   Intel IFC VF vDPA driver
>>>>            <M>   Virtio PCI bridge vDPA driver
>>>>            <M>   vDPA driver for Alibaba ENI
>>>>
>>>> BTW, my test environment is VM and the shared data is done in VM too, and
>>>> can virtio_vdpa be used inside VM?
>>> I hope Xie Yongji can help explain how to benchmark VDUSE.
>>>
>>> virtio_vdpa is available inside guests too. Please check that
>>> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
>>> drivers" menu.
>>>
>>>>>    # modprobe vduse
>>>>>    # qemu-storage-daemon \
>>>>>        --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
>>>>>        --blockdev qcow2,file=file,node-name=qcow2 \
>>>>>        --object iothread,id=iothread0 \
>>>>>        --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
>>>>>    # vdpa dev add name vduse0 mgmtdev vduse
>>>>>
>>>>> A virtio-blk device should appear and xfstests can be run on it
>>>>> (typically /dev/vda unless you already have other virtio-blk devices).
>>>>>
>>>>> Afterwards you can destroy the device using:
>>>>>
>>>>>    # vdpa dev del vduse0
>>>>>
>>>>>> - ublk-qcow2 (make test T=qcow2/022)
>>>>> There are a lot of other factors not directly related to NBD vs ublk. In
>>>>> order to get an apples-to-apples comparison with qemu-* a ublk export
>>>>> type is needed in qemu-storage-daemon. That way only the difference is
>>>>> the ublk interface and the rest of the code path is identical, making it
>>>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
>>>> Maybe not true.
>>>>
>>>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
>>>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
>>>> command.
>>> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
>> I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
>>
>>> know whether the benchmark demonstrates that ublk is faster than NBD,
>>> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
>>> whether there are miscellaneous implementation differences between
>>> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
>>> ublk and backend IO), or something else.
>> The theory shouldn't be too complicated:
>>
>> 1) io uring passthough(pt) communication is fast than socket, and io command
>> is carried over io_uring pt commands, and should be fast than virio
>> communication too.
>>
>> 2) io uring io handling is fast than libaio which is taken in the
>> test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
>> by io_uring.
>>
>> https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
>>
>> 3) ublk uses one single io_uring to handle all io commands and qcow2
>> backend IOs, so batching handling is common, and it is easy to see
>> dozens of IOs/io commands handled in single syscall, or even more.
> I agree with the theory but theory has to be tested through
> experiments in order to validate it. We can all learn from systematic
> performance analysis - there might even be bottlenecks in ublk that
> can be solved to improve performance further.
>
>>> I'm suggesting measuring changes to just 1 variable at a time.
>>> Otherwise it's hard to reach a conclusion about the root cause of the
>>> performance difference. Let's learn why ublk-qcow2 performs well.
>> Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
>> qemu from the latest github tree, and finally it starts to work. And test kernel
>> is v6.0 release.
>>
>> Follows the test result, and all three devices are setup as single
>> queue, and all tests are run in single job, still done in one VM, and
>> the test images are stored on XFS/virito-scsi backed SSD.
>>
>> The 1st group tests all three block device which is backed by empty
>> qcow2 image.
>>
>> The 2nd group tests all the three block devices backed by pre-allocated
>> qcow2 image.
>>
>> Except for big sequential IO(512K), there is still not small gap between
>> vdpa-virtio-blk and ublk.
>>
>> 1. run fio on block device over empty qcow2 image
>> 1) qemu-nbd
>> running qcow2/001
>> run perf test on empty qcow2 image via nbd
>>          fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
>>          randwrite: jobs 1, iops 8549
>>          randread: jobs 1, iops 34829
>>          randrw: jobs 1, iops read 11363 write 11333
>>          rw(512k): jobs 1, iops read 590 write 597
>>
>>
>> 2) ublk-qcow2
>> running qcow2/021
>> run perf test on empty qcow2 image via ublk
>>          fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
>>          randwrite: jobs 1, iops 16086
>>          randread: jobs 1, iops 172720
>>          randrw: jobs 1, iops read 35760 write 35702
>>          rw(512k): jobs 1, iops read 1140 write 1149
>>
>> 3) vdpa-virtio-blk
>> running debug/test_dev
>> run io test on specified device
>>          fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
>>          randwrite: jobs 1, iops 8626
>>          randread: jobs 1, iops 126118
>>          randrw: jobs 1, iops read 17698 write 17665
>>          rw(512k): jobs 1, iops read 1023 write 1031
>>
>>
>> 2. run fio on block device over pre-allocated qcow2 image
>> 1) qemu-nbd
>> running qcow2/002
>> run perf test on pre-allocated qcow2 image via nbd
>>          fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
>>          randwrite: jobs 1, iops 21439
>>          randread: jobs 1, iops 30336
>>          randrw: jobs 1, iops read 11476 write 11449
>>          rw(512k): jobs 1, iops read 718 write 722
>>
>> 2) ublk-qcow2
>> running qcow2/022
>> run perf test on pre-allocated qcow2 image via ublk
>>          fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
>>          randwrite: jobs 1, iops 98757
>>          randread: jobs 1, iops 110246
>>          randrw: jobs 1, iops read 47229 write 47161
>>          rw(512k): jobs 1, iops read 1416 write 1427
>>
>> 3) vdpa-virtio-blk
>> running debug/test_dev
>> run io test on specified device
>>          fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
>>          randwrite: jobs 1, iops 47317
>>          randread: jobs 1, iops 74092
>>          randrw: jobs 1, iops read 27196 write 27234
>>          rw(512k): jobs 1, iops read 1447 write 1458
> Thanks for including VDUSE results! ublk looks great here and worth
> considering even in cases where NBD or VDUSE is already being used.
>
> Stefan
+ Andrey Zhadchenko

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-05 12:21         ` Stefan Hajnoczi
  2022-10-05 12:38           ` Denis V. Lunev
@ 2022-10-06 11:24           ` Ming Lei
  2022-10-07 10:04             ` Yongji Xie
  1 sibling, 1 reply; 44+ messages in thread
From: Ming Lei @ 2022-10-06 11:24 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, io-uring, linux-block, linux-kernel,
	Kirill Tkhai, Manuel Bentele, qemu-devel, Kevin Wolf, rjones,
	Xie Yongji, Denis V. Lunev, Stefano Garzarella

On Wed, Oct 05, 2022 at 08:21:45AM -0400, Stefan Hajnoczi wrote:
> On Wed, 5 Oct 2022 at 00:19, Ming Lei <[email protected]> wrote:
> >
> > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > >
> > > > On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > ublk-qcow2 is available now.
> > > > >
> > > > > Cool, thanks for sharing!
> > > > >
> > > > > >
> > > > > > So far it provides basic read/write function, and compression and snapshot
> > > > > > aren't supported yet. The target/backend implementation is completely
> > > > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > > > handler, just like what ublk-loop does.
> > > > > >
> > > > > > Follows the main motivations of ublk-qcow2:
> > > > > >
> > > > > > - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > >   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > >   requirement from libublksrv compared with other simple ones(loop, null)
> > > > > >
> > > > > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > >   might useful be for covering requirement in this field
> > > > > >
> > > > > > - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > > > >   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > > > >   is started
> > > > > >
> > > > > > - help to abstract common building block or design pattern for writing new ublk
> > > > > >   target/backend
> > > > > >
> > > > > > So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > > > device as TEST_DEV, and kernel building workload is verified too. Also
> > > > > > soft update approach is applied in meta flushing, and meta data
> > > > > > integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > > > test, and only cluster leak is reported during this test.
> > > > > >
> > > > > > The performance data looks much better compared with qemu-nbd, see
> > > > > > details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > > > > empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > > > image(8GB):
> > > > > >
> > > > > > - qemu-nbd (make test T=qcow2/002)
> > > > >
> > > > > Single queue?
> > > >
> > > > Yeah.
> > > >
> > > > >
> > > > > >     randwrite(4k): jobs 1, iops 24605
> > > > > >     randread(4k): jobs 1, iops 30938
> > > > > >     randrw(4k): jobs 1, iops read 13981 write 14001
> > > > > >     rw(512k): jobs 1, iops read 724 write 728
> > > > >
> > > > > Please try qemu-storage-daemon's VDUSE export type as well. The
> > > > > command-line should be similar to this:
> > > > >
> > > > >   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > >
> > > > Not found virtio_vdpa module even though I enabled all the following
> > > > options:
> > > >
> > > >         --- vDPA drivers
> > > >           <M>   vDPA device simulator core
> > > >           <M>     vDPA simulator for networking device
> > > >           <M>     vDPA simulator for block device
> > > >           <M>   VDUSE (vDPA Device in Userspace) support
> > > >           <M>   Intel IFC VF vDPA driver
> > > >           <M>   Virtio PCI bridge vDPA driver
> > > >           <M>   vDPA driver for Alibaba ENI
> > > >
> > > > BTW, my test environment is VM and the shared data is done in VM too, and
> > > > can virtio_vdpa be used inside VM?
> > >
> > > I hope Xie Yongji can help explain how to benchmark VDUSE.
> > >
> > > virtio_vdpa is available inside guests too. Please check that
> > > VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > drivers" menu.
> > >
> > > >
> > > > >   # modprobe vduse
> > > > >   # qemu-storage-daemon \
> > > > >       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > > >       --blockdev qcow2,file=file,node-name=qcow2 \
> > > > >       --object iothread,id=iothread0 \
> > > > >       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > > >   # vdpa dev add name vduse0 mgmtdev vduse
> > > > >
> > > > > A virtio-blk device should appear and xfstests can be run on it
> > > > > (typically /dev/vda unless you already have other virtio-blk devices).
> > > > >
> > > > > Afterwards you can destroy the device using:
> > > > >
> > > > >   # vdpa dev del vduse0
> > > > >
> > > > > >
> > > > > > - ublk-qcow2 (make test T=qcow2/022)
> > > > >
> > > > > There are a lot of other factors not directly related to NBD vs ublk. In
> > > > > order to get an apples-to-apples comparison with qemu-* a ublk export
> > > > > type is needed in qemu-storage-daemon. That way only the difference is
> > > > > the ublk interface and the rest of the code path is identical, making it
> > > > > possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > >
> > > > Maybe not true.
> > > >
> > > > ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > > and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > > command.
> > >
> > > qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> >
> > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> >
> > > know whether the benchmark demonstrates that ublk is faster than NBD,
> > > that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > whether there are miscellaneous implementation differences between
> > > ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > ublk and backend IO), or something else.
> >
> > The theory shouldn't be too complicated:
> >
> > 1) io uring passthough(pt) communication is fast than socket, and io command
> > is carried over io_uring pt commands, and should be fast than virio
> > communication too.
> >
> > 2) io uring io handling is fast than libaio which is taken in the
> > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > by io_uring.
> >
> > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> >
> > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > backend IOs, so batching handling is common, and it is easy to see
> > dozens of IOs/io commands handled in single syscall, or even more.
> 
> I agree with the theory but theory has to be tested through
> experiments in order to validate it. We can all learn from systematic
> performance analysis - there might even be bottlenecks in ublk that
> can be solved to improve performance further.

Indeed, one thing is that ublk uses get user pages to retrieve user pages
for copying data, this way may add latency for big chunk IO, since
latency of get user pages should be increased linearly by nr_pages.

I looked into vduse code a bit too, and vduse still needs the page copy,
but lots of bounce pages are allocated and cached in the whole device
lifetime, this way can void the latency for retrieving & allocating
pages runtime with cost of extra memory consumption. Correct me
if it is wrong, Xie Yongji or anyone?

ublk has code to deal with device idle, and it may apply the similar
cache approach intelligently in future.

But I think here the final solution could be applying zero copy for
avoiding the big chunk copy, or use hardware engine.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-06 11:24           ` Ming Lei
@ 2022-10-07 10:04             ` Yongji Xie
  2022-10-07 10:51               ` Ming Lei
  0 siblings, 1 reply; 44+ messages in thread
From: Yongji Xie @ 2022-10-07 10:04 UTC (permalink / raw)
  To: Ming Lei
  Cc: Stefan Hajnoczi, Stefan Hajnoczi, io-uring, linux-block,
	linux-kernel, Kirill Tkhai, Manuel Bentele, qemu-devel,
	Kevin Wolf, rjones, Denis V. Lunev, Stefano Garzarella

On Thu, Oct 6, 2022 at 7:24 PM Ming Lei <[email protected]> wrote:
>
> On Wed, Oct 05, 2022 at 08:21:45AM -0400, Stefan Hajnoczi wrote:
> > On Wed, 5 Oct 2022 at 00:19, Ming Lei <[email protected]> wrote:
> > >
> > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > > >
> > > > > On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > > ublk-qcow2 is available now.
> > > > > >
> > > > > > Cool, thanks for sharing!
> > > > > >
> > > > > > >
> > > > > > > So far it provides basic read/write function, and compression and snapshot
> > > > > > > aren't supported yet. The target/backend implementation is completely
> > > > > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > > > > handler, just like what ublk-loop does.
> > > > > > >
> > > > > > > Follows the main motivations of ublk-qcow2:
> > > > > > >
> > > > > > > - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > > >   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > > >   requirement from libublksrv compared with other simple ones(loop, null)
> > > > > > >
> > > > > > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > > >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > > >   might useful be for covering requirement in this field
> > > > > > >
> > > > > > > - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > > > > >   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > > > > >   is started
> > > > > > >
> > > > > > > - help to abstract common building block or design pattern for writing new ublk
> > > > > > >   target/backend
> > > > > > >
> > > > > > > So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > > > > device as TEST_DEV, and kernel building workload is verified too. Also
> > > > > > > soft update approach is applied in meta flushing, and meta data
> > > > > > > integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > > > > test, and only cluster leak is reported during this test.
> > > > > > >
> > > > > > > The performance data looks much better compared with qemu-nbd, see
> > > > > > > details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > > > > > empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > > > > image(8GB):
> > > > > > >
> > > > > > > - qemu-nbd (make test T=qcow2/002)
> > > > > >
> > > > > > Single queue?
> > > > >
> > > > > Yeah.
> > > > >
> > > > > >
> > > > > > >     randwrite(4k): jobs 1, iops 24605
> > > > > > >     randread(4k): jobs 1, iops 30938
> > > > > > >     randrw(4k): jobs 1, iops read 13981 write 14001
> > > > > > >     rw(512k): jobs 1, iops read 724 write 728
> > > > > >
> > > > > > Please try qemu-storage-daemon's VDUSE export type as well. The
> > > > > > command-line should be similar to this:
> > > > > >
> > > > > >   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > > >
> > > > > Not found virtio_vdpa module even though I enabled all the following
> > > > > options:
> > > > >
> > > > >         --- vDPA drivers
> > > > >           <M>   vDPA device simulator core
> > > > >           <M>     vDPA simulator for networking device
> > > > >           <M>     vDPA simulator for block device
> > > > >           <M>   VDUSE (vDPA Device in Userspace) support
> > > > >           <M>   Intel IFC VF vDPA driver
> > > > >           <M>   Virtio PCI bridge vDPA driver
> > > > >           <M>   vDPA driver for Alibaba ENI
> > > > >
> > > > > BTW, my test environment is VM and the shared data is done in VM too, and
> > > > > can virtio_vdpa be used inside VM?
> > > >
> > > > I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > >
> > > > virtio_vdpa is available inside guests too. Please check that
> > > > VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > > drivers" menu.
> > > >
> > > > >
> > > > > >   # modprobe vduse
> > > > > >   # qemu-storage-daemon \
> > > > > >       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > > > >       --blockdev qcow2,file=file,node-name=qcow2 \
> > > > > >       --object iothread,id=iothread0 \
> > > > > >       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > > > >   # vdpa dev add name vduse0 mgmtdev vduse
> > > > > >
> > > > > > A virtio-blk device should appear and xfstests can be run on it
> > > > > > (typically /dev/vda unless you already have other virtio-blk devices).
> > > > > >
> > > > > > Afterwards you can destroy the device using:
> > > > > >
> > > > > >   # vdpa dev del vduse0
> > > > > >
> > > > > > >
> > > > > > > - ublk-qcow2 (make test T=qcow2/022)
> > > > > >
> > > > > > There are a lot of other factors not directly related to NBD vs ublk. In
> > > > > > order to get an apples-to-apples comparison with qemu-* a ublk export
> > > > > > type is needed in qemu-storage-daemon. That way only the difference is
> > > > > > the ublk interface and the rest of the code path is identical, making it
> > > > > > possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > > >
> > > > > Maybe not true.
> > > > >
> > > > > ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > > > and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > > > command.
> > > >
> > > > qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > >
> > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > >
> > > > know whether the benchmark demonstrates that ublk is faster than NBD,
> > > > that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > > whether there are miscellaneous implementation differences between
> > > > ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > > ublk and backend IO), or something else.
> > >
> > > The theory shouldn't be too complicated:
> > >
> > > 1) io uring passthough(pt) communication is fast than socket, and io command
> > > is carried over io_uring pt commands, and should be fast than virio
> > > communication too.
> > >
> > > 2) io uring io handling is fast than libaio which is taken in the
> > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > by io_uring.
> > >
> > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > >
> > > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > backend IOs, so batching handling is common, and it is easy to see
> > > dozens of IOs/io commands handled in single syscall, or even more.
> >
> > I agree with the theory but theory has to be tested through
> > experiments in order to validate it. We can all learn from systematic
> > performance analysis - there might even be bottlenecks in ublk that
> > can be solved to improve performance further.
>
> Indeed, one thing is that ublk uses get user pages to retrieve user pages
> for copying data, this way may add latency for big chunk IO, since
> latency of get user pages should be increased linearly by nr_pages.
>
> I looked into vduse code a bit too, and vduse still needs the page copy,
> but lots of bounce pages are allocated and cached in the whole device
> lifetime, this way can void the latency for retrieving & allocating
> pages runtime with cost of extra memory consumption. Correct me
> if it is wrong, Xie Yongji or anyone?
>

Yes, you are right. Another way is registering the preallocated
userspace memory as bounce buffer.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-07 10:04             ` Yongji Xie
@ 2022-10-07 10:51               ` Ming Lei
  2022-10-07 11:21                 ` Yongji Xie
  0 siblings, 1 reply; 44+ messages in thread
From: Ming Lei @ 2022-10-07 10:51 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Stefan Hajnoczi, Stefan Hajnoczi, io-uring, linux-block,
	linux-kernel, Kirill Tkhai, Manuel Bentele, qemu-devel,
	Kevin Wolf, rjones, Denis V. Lunev, Stefano Garzarella

On Fri, Oct 07, 2022 at 06:04:29PM +0800, Yongji Xie wrote:
> On Thu, Oct 6, 2022 at 7:24 PM Ming Lei <[email protected]> wrote:
> >
> > On Wed, Oct 05, 2022 at 08:21:45AM -0400, Stefan Hajnoczi wrote:
> > > On Wed, 5 Oct 2022 at 00:19, Ming Lei <[email protected]> wrote:
> > > >
> > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > > On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > > > >
> > > > > > On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > > > ublk-qcow2 is available now.
> > > > > > >
> > > > > > > Cool, thanks for sharing!
> > > > > > >
> > > > > > > >
> > > > > > > > So far it provides basic read/write function, and compression and snapshot
> > > > > > > > aren't supported yet. The target/backend implementation is completely
> > > > > > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > > > > > handler, just like what ublk-loop does.
> > > > > > > >
> > > > > > > > Follows the main motivations of ublk-qcow2:
> > > > > > > >
> > > > > > > > - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > > > >   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > > > >   requirement from libublksrv compared with other simple ones(loop, null)
> > > > > > > >
> > > > > > > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > > > >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > > > >   might useful be for covering requirement in this field
> > > > > > > >
> > > > > > > > - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > > > > > >   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > > > > > >   is started
> > > > > > > >
> > > > > > > > - help to abstract common building block or design pattern for writing new ublk
> > > > > > > >   target/backend
> > > > > > > >
> > > > > > > > So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > > > > > device as TEST_DEV, and kernel building workload is verified too. Also
> > > > > > > > soft update approach is applied in meta flushing, and meta data
> > > > > > > > integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > > > > > test, and only cluster leak is reported during this test.
> > > > > > > >
> > > > > > > > The performance data looks much better compared with qemu-nbd, see
> > > > > > > > details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > > > > > > empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > > > > > image(8GB):
> > > > > > > >
> > > > > > > > - qemu-nbd (make test T=qcow2/002)
> > > > > > >
> > > > > > > Single queue?
> > > > > >
> > > > > > Yeah.
> > > > > >
> > > > > > >
> > > > > > > >     randwrite(4k): jobs 1, iops 24605
> > > > > > > >     randread(4k): jobs 1, iops 30938
> > > > > > > >     randrw(4k): jobs 1, iops read 13981 write 14001
> > > > > > > >     rw(512k): jobs 1, iops read 724 write 728
> > > > > > >
> > > > > > > Please try qemu-storage-daemon's VDUSE export type as well. The
> > > > > > > command-line should be similar to this:
> > > > > > >
> > > > > > >   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > > > >
> > > > > > Not found virtio_vdpa module even though I enabled all the following
> > > > > > options:
> > > > > >
> > > > > >         --- vDPA drivers
> > > > > >           <M>   vDPA device simulator core
> > > > > >           <M>     vDPA simulator for networking device
> > > > > >           <M>     vDPA simulator for block device
> > > > > >           <M>   VDUSE (vDPA Device in Userspace) support
> > > > > >           <M>   Intel IFC VF vDPA driver
> > > > > >           <M>   Virtio PCI bridge vDPA driver
> > > > > >           <M>   vDPA driver for Alibaba ENI
> > > > > >
> > > > > > BTW, my test environment is VM and the shared data is done in VM too, and
> > > > > > can virtio_vdpa be used inside VM?
> > > > >
> > > > > I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > > >
> > > > > virtio_vdpa is available inside guests too. Please check that
> > > > > VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > > > drivers" menu.
> > > > >
> > > > > >
> > > > > > >   # modprobe vduse
> > > > > > >   # qemu-storage-daemon \
> > > > > > >       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > > > > >       --blockdev qcow2,file=file,node-name=qcow2 \
> > > > > > >       --object iothread,id=iothread0 \
> > > > > > >       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > > > > >   # vdpa dev add name vduse0 mgmtdev vduse
> > > > > > >
> > > > > > > A virtio-blk device should appear and xfstests can be run on it
> > > > > > > (typically /dev/vda unless you already have other virtio-blk devices).
> > > > > > >
> > > > > > > Afterwards you can destroy the device using:
> > > > > > >
> > > > > > >   # vdpa dev del vduse0
> > > > > > >
> > > > > > > >
> > > > > > > > - ublk-qcow2 (make test T=qcow2/022)
> > > > > > >
> > > > > > > There are a lot of other factors not directly related to NBD vs ublk. In
> > > > > > > order to get an apples-to-apples comparison with qemu-* a ublk export
> > > > > > > type is needed in qemu-storage-daemon. That way only the difference is
> > > > > > > the ublk interface and the rest of the code path is identical, making it
> > > > > > > possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > > > >
> > > > > > Maybe not true.
> > > > > >
> > > > > > ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > > > > and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > > > > command.
> > > > >
> > > > > qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > > >
> > > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > > >
> > > > > know whether the benchmark demonstrates that ublk is faster than NBD,
> > > > > that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > > > whether there are miscellaneous implementation differences between
> > > > > ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > > > ublk and backend IO), or something else.
> > > >
> > > > The theory shouldn't be too complicated:
> > > >
> > > > 1) io uring passthough(pt) communication is fast than socket, and io command
> > > > is carried over io_uring pt commands, and should be fast than virio
> > > > communication too.
> > > >
> > > > 2) io uring io handling is fast than libaio which is taken in the
> > > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > > by io_uring.
> > > >
> > > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > > >
> > > > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > > backend IOs, so batching handling is common, and it is easy to see
> > > > dozens of IOs/io commands handled in single syscall, or even more.
> > >
> > > I agree with the theory but theory has to be tested through
> > > experiments in order to validate it. We can all learn from systematic
> > > performance analysis - there might even be bottlenecks in ublk that
> > > can be solved to improve performance further.
> >
> > Indeed, one thing is that ublk uses get user pages to retrieve user pages
> > for copying data, this way may add latency for big chunk IO, since
> > latency of get user pages should be increased linearly by nr_pages.
> >
> > I looked into vduse code a bit too, and vduse still needs the page copy,
> > but lots of bounce pages are allocated and cached in the whole device
> > lifetime, this way can void the latency for retrieving & allocating
> > pages runtime with cost of extra memory consumption. Correct me
> > if it is wrong, Xie Yongji or anyone?
> >
> 
> Yes, you are right. Another way is registering the preallocated
> userspace memory as bounce buffer.

Thanks for the clarification.

IMO, the pages consumption is too much for vduse, each vdpa device
has one vduse_iova_domain which may allocate 64K bounce pages at most,
and these pages won't be freed until freeing the device.

But it is one solution for implementing generic userspace device(not
limit to block device), and this idea seems great.




Thanks,
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-07 10:51               ` Ming Lei
@ 2022-10-07 11:21                 ` Yongji Xie
  2022-10-07 11:23                   ` Ming Lei
  0 siblings, 1 reply; 44+ messages in thread
From: Yongji Xie @ 2022-10-07 11:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: Stefan Hajnoczi, Stefan Hajnoczi, io-uring, linux-block,
	linux-kernel, Kirill Tkhai, Manuel Bentele, qemu-devel,
	Kevin Wolf, rjones, Denis V. Lunev, Stefano Garzarella

On Fri, Oct 7, 2022 at 6:51 PM Ming Lei <[email protected]> wrote:
>
> On Fri, Oct 07, 2022 at 06:04:29PM +0800, Yongji Xie wrote:
> > On Thu, Oct 6, 2022 at 7:24 PM Ming Lei <[email protected]> wrote:
> > >
> > > On Wed, Oct 05, 2022 at 08:21:45AM -0400, Stefan Hajnoczi wrote:
> > > > On Wed, 5 Oct 2022 at 00:19, Ming Lei <[email protected]> wrote:
> > > > >
> > > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > > > On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > > > > >
> > > > > > > On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > > > > ublk-qcow2 is available now.
> > > > > > > >
> > > > > > > > Cool, thanks for sharing!
> > > > > > > >
> > > > > > > > >
> > > > > > > > > So far it provides basic read/write function, and compression and snapshot
> > > > > > > > > aren't supported yet. The target/backend implementation is completely
> > > > > > > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > > > > > > handler, just like what ublk-loop does.
> > > > > > > > >
> > > > > > > > > Follows the main motivations of ublk-qcow2:
> > > > > > > > >
> > > > > > > > > - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > > > > >   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > > > > >   requirement from libublksrv compared with other simple ones(loop, null)
> > > > > > > > >
> > > > > > > > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > > > > >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > > > > >   might useful be for covering requirement in this field
> > > > > > > > >
> > > > > > > > > - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > > > > > > >   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > > > > > > >   is started
> > > > > > > > >
> > > > > > > > > - help to abstract common building block or design pattern for writing new ublk
> > > > > > > > >   target/backend
> > > > > > > > >
> > > > > > > > > So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > > > > > > device as TEST_DEV, and kernel building workload is verified too. Also
> > > > > > > > > soft update approach is applied in meta flushing, and meta data
> > > > > > > > > integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > > > > > > test, and only cluster leak is reported during this test.
> > > > > > > > >
> > > > > > > > > The performance data looks much better compared with qemu-nbd, see
> > > > > > > > > details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > > > > > > > empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > > > > > > image(8GB):
> > > > > > > > >
> > > > > > > > > - qemu-nbd (make test T=qcow2/002)
> > > > > > > >
> > > > > > > > Single queue?
> > > > > > >
> > > > > > > Yeah.
> > > > > > >
> > > > > > > >
> > > > > > > > >     randwrite(4k): jobs 1, iops 24605
> > > > > > > > >     randread(4k): jobs 1, iops 30938
> > > > > > > > >     randrw(4k): jobs 1, iops read 13981 write 14001
> > > > > > > > >     rw(512k): jobs 1, iops read 724 write 728
> > > > > > > >
> > > > > > > > Please try qemu-storage-daemon's VDUSE export type as well. The
> > > > > > > > command-line should be similar to this:
> > > > > > > >
> > > > > > > >   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > > > > >
> > > > > > > Not found virtio_vdpa module even though I enabled all the following
> > > > > > > options:
> > > > > > >
> > > > > > >         --- vDPA drivers
> > > > > > >           <M>   vDPA device simulator core
> > > > > > >           <M>     vDPA simulator for networking device
> > > > > > >           <M>     vDPA simulator for block device
> > > > > > >           <M>   VDUSE (vDPA Device in Userspace) support
> > > > > > >           <M>   Intel IFC VF vDPA driver
> > > > > > >           <M>   Virtio PCI bridge vDPA driver
> > > > > > >           <M>   vDPA driver for Alibaba ENI
> > > > > > >
> > > > > > > BTW, my test environment is VM and the shared data is done in VM too, and
> > > > > > > can virtio_vdpa be used inside VM?
> > > > > >
> > > > > > I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > > > >
> > > > > > virtio_vdpa is available inside guests too. Please check that
> > > > > > VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > > > > drivers" menu.
> > > > > >
> > > > > > >
> > > > > > > >   # modprobe vduse
> > > > > > > >   # qemu-storage-daemon \
> > > > > > > >       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > > > > > >       --blockdev qcow2,file=file,node-name=qcow2 \
> > > > > > > >       --object iothread,id=iothread0 \
> > > > > > > >       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > > > > > >   # vdpa dev add name vduse0 mgmtdev vduse
> > > > > > > >
> > > > > > > > A virtio-blk device should appear and xfstests can be run on it
> > > > > > > > (typically /dev/vda unless you already have other virtio-blk devices).
> > > > > > > >
> > > > > > > > Afterwards you can destroy the device using:
> > > > > > > >
> > > > > > > >   # vdpa dev del vduse0
> > > > > > > >
> > > > > > > > >
> > > > > > > > > - ublk-qcow2 (make test T=qcow2/022)
> > > > > > > >
> > > > > > > > There are a lot of other factors not directly related to NBD vs ublk. In
> > > > > > > > order to get an apples-to-apples comparison with qemu-* a ublk export
> > > > > > > > type is needed in qemu-storage-daemon. That way only the difference is
> > > > > > > > the ublk interface and the rest of the code path is identical, making it
> > > > > > > > possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > > > > >
> > > > > > > Maybe not true.
> > > > > > >
> > > > > > > ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > > > > > and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > > > > > command.
> > > > > >
> > > > > > qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > > > >
> > > > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > > > >
> > > > > > know whether the benchmark demonstrates that ublk is faster than NBD,
> > > > > > that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > > > > whether there are miscellaneous implementation differences between
> > > > > > ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > > > > ublk and backend IO), or something else.
> > > > >
> > > > > The theory shouldn't be too complicated:
> > > > >
> > > > > 1) io uring passthough(pt) communication is fast than socket, and io command
> > > > > is carried over io_uring pt commands, and should be fast than virio
> > > > > communication too.
> > > > >
> > > > > 2) io uring io handling is fast than libaio which is taken in the
> > > > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > > > by io_uring.
> > > > >
> > > > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > > > >
> > > > > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > > > backend IOs, so batching handling is common, and it is easy to see
> > > > > dozens of IOs/io commands handled in single syscall, or even more.
> > > >
> > > > I agree with the theory but theory has to be tested through
> > > > experiments in order to validate it. We can all learn from systematic
> > > > performance analysis - there might even be bottlenecks in ublk that
> > > > can be solved to improve performance further.
> > >
> > > Indeed, one thing is that ublk uses get user pages to retrieve user pages
> > > for copying data, this way may add latency for big chunk IO, since
> > > latency of get user pages should be increased linearly by nr_pages.
> > >
> > > I looked into vduse code a bit too, and vduse still needs the page copy,
> > > but lots of bounce pages are allocated and cached in the whole device
> > > lifetime, this way can void the latency for retrieving & allocating
> > > pages runtime with cost of extra memory consumption. Correct me
> > > if it is wrong, Xie Yongji or anyone?
> > >
> >
> > Yes, you are right. Another way is registering the preallocated
> > userspace memory as bounce buffer.
>
> Thanks for the clarification.
>
> IMO, the pages consumption is too much for vduse, each vdpa device
> has one vduse_iova_domain which may allocate 64K bounce pages at most,
> and these pages won't be freed until freeing the device.
>

Yes, actually in our initial design, this can be mitigated by some
memory reclaim mechanism and zero copy support. Even we can let
multiple vdpa device share one iova domain.

Thanks,
Yongji

> But it is one solution for implementing generic userspace device(not
> limit to block device), and this idea seems great.
>
>
>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-07 11:21                 ` Yongji Xie
@ 2022-10-07 11:23                   ` Ming Lei
  0 siblings, 0 replies; 44+ messages in thread
From: Ming Lei @ 2022-10-07 11:23 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Stefan Hajnoczi, Stefan Hajnoczi, io-uring, linux-block,
	linux-kernel, Kirill Tkhai, Manuel Bentele, qemu-devel,
	Kevin Wolf, rjones, Denis V. Lunev, Stefano Garzarella

On Fri, Oct 07, 2022 at 07:21:51PM +0800, Yongji Xie wrote:
> On Fri, Oct 7, 2022 at 6:51 PM Ming Lei <[email protected]> wrote:
> >
> > On Fri, Oct 07, 2022 at 06:04:29PM +0800, Yongji Xie wrote:
> > > On Thu, Oct 6, 2022 at 7:24 PM Ming Lei <[email protected]> wrote:
> > > >
> > > > On Wed, Oct 05, 2022 at 08:21:45AM -0400, Stefan Hajnoczi wrote:
> > > > > On Wed, 5 Oct 2022 at 00:19, Ming Lei <[email protected]> wrote:
> > > > > >
> > > > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > > > > On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > > > > > ublk-qcow2 is available now.
> > > > > > > > >
> > > > > > > > > Cool, thanks for sharing!
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > So far it provides basic read/write function, and compression and snapshot
> > > > > > > > > > aren't supported yet. The target/backend implementation is completely
> > > > > > > > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > > > > > > > handler, just like what ublk-loop does.
> > > > > > > > > >
> > > > > > > > > > Follows the main motivations of ublk-qcow2:
> > > > > > > > > >
> > > > > > > > > > - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > > > > > >   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > > > > > >   requirement from libublksrv compared with other simple ones(loop, null)
> > > > > > > > > >
> > > > > > > > > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > > > > > >   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > > > > > >   might useful be for covering requirement in this field
> > > > > > > > > >
> > > > > > > > > > - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > > > > > > > >   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > > > > > > > >   is started
> > > > > > > > > >
> > > > > > > > > > - help to abstract common building block or design pattern for writing new ublk
> > > > > > > > > >   target/backend
> > > > > > > > > >
> > > > > > > > > > So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > > > > > > > device as TEST_DEV, and kernel building workload is verified too. Also
> > > > > > > > > > soft update approach is applied in meta flushing, and meta data
> > > > > > > > > > integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > > > > > > > test, and only cluster leak is reported during this test.
> > > > > > > > > >
> > > > > > > > > > The performance data looks much better compared with qemu-nbd, see
> > > > > > > > > > details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > > > > > > > > empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > > > > > > > image(8GB):
> > > > > > > > > >
> > > > > > > > > > - qemu-nbd (make test T=qcow2/002)
> > > > > > > > >
> > > > > > > > > Single queue?
> > > > > > > >
> > > > > > > > Yeah.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >     randwrite(4k): jobs 1, iops 24605
> > > > > > > > > >     randread(4k): jobs 1, iops 30938
> > > > > > > > > >     randrw(4k): jobs 1, iops read 13981 write 14001
> > > > > > > > > >     rw(512k): jobs 1, iops read 724 write 728
> > > > > > > > >
> > > > > > > > > Please try qemu-storage-daemon's VDUSE export type as well. The
> > > > > > > > > command-line should be similar to this:
> > > > > > > > >
> > > > > > > > >   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > > > > > >
> > > > > > > > Not found virtio_vdpa module even though I enabled all the following
> > > > > > > > options:
> > > > > > > >
> > > > > > > >         --- vDPA drivers
> > > > > > > >           <M>   vDPA device simulator core
> > > > > > > >           <M>     vDPA simulator for networking device
> > > > > > > >           <M>     vDPA simulator for block device
> > > > > > > >           <M>   VDUSE (vDPA Device in Userspace) support
> > > > > > > >           <M>   Intel IFC VF vDPA driver
> > > > > > > >           <M>   Virtio PCI bridge vDPA driver
> > > > > > > >           <M>   vDPA driver for Alibaba ENI
> > > > > > > >
> > > > > > > > BTW, my test environment is VM and the shared data is done in VM too, and
> > > > > > > > can virtio_vdpa be used inside VM?
> > > > > > >
> > > > > > > I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > > > > >
> > > > > > > virtio_vdpa is available inside guests too. Please check that
> > > > > > > VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > > > > > drivers" menu.
> > > > > > >
> > > > > > > >
> > > > > > > > >   # modprobe vduse
> > > > > > > > >   # qemu-storage-daemon \
> > > > > > > > >       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > > > > > > >       --blockdev qcow2,file=file,node-name=qcow2 \
> > > > > > > > >       --object iothread,id=iothread0 \
> > > > > > > > >       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > > > > > > >   # vdpa dev add name vduse0 mgmtdev vduse
> > > > > > > > >
> > > > > > > > > A virtio-blk device should appear and xfstests can be run on it
> > > > > > > > > (typically /dev/vda unless you already have other virtio-blk devices).
> > > > > > > > >
> > > > > > > > > Afterwards you can destroy the device using:
> > > > > > > > >
> > > > > > > > >   # vdpa dev del vduse0
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > - ublk-qcow2 (make test T=qcow2/022)
> > > > > > > > >
> > > > > > > > > There are a lot of other factors not directly related to NBD vs ublk. In
> > > > > > > > > order to get an apples-to-apples comparison with qemu-* a ublk export
> > > > > > > > > type is needed in qemu-storage-daemon. That way only the difference is
> > > > > > > > > the ublk interface and the rest of the code path is identical, making it
> > > > > > > > > possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > > > > > >
> > > > > > > > Maybe not true.
> > > > > > > >
> > > > > > > > ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > > > > > > and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > > > > > > command.
> > > > > > >
> > > > > > > qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > > > > >
> > > > > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > > > > >
> > > > > > > know whether the benchmark demonstrates that ublk is faster than NBD,
> > > > > > > that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > > > > > whether there are miscellaneous implementation differences between
> > > > > > > ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > > > > > ublk and backend IO), or something else.
> > > > > >
> > > > > > The theory shouldn't be too complicated:
> > > > > >
> > > > > > 1) io uring passthough(pt) communication is fast than socket, and io command
> > > > > > is carried over io_uring pt commands, and should be fast than virio
> > > > > > communication too.
> > > > > >
> > > > > > 2) io uring io handling is fast than libaio which is taken in the
> > > > > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > > > > by io_uring.
> > > > > >
> > > > > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > > > > >
> > > > > > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > > > > backend IOs, so batching handling is common, and it is easy to see
> > > > > > dozens of IOs/io commands handled in single syscall, or even more.
> > > > >
> > > > > I agree with the theory but theory has to be tested through
> > > > > experiments in order to validate it. We can all learn from systematic
> > > > > performance analysis - there might even be bottlenecks in ublk that
> > > > > can be solved to improve performance further.
> > > >
> > > > Indeed, one thing is that ublk uses get user pages to retrieve user pages
> > > > for copying data, this way may add latency for big chunk IO, since
> > > > latency of get user pages should be increased linearly by nr_pages.
> > > >
> > > > I looked into vduse code a bit too, and vduse still needs the page copy,
> > > > but lots of bounce pages are allocated and cached in the whole device
> > > > lifetime, this way can void the latency for retrieving & allocating
> > > > pages runtime with cost of extra memory consumption. Correct me
> > > > if it is wrong, Xie Yongji or anyone?
> > > >
> > >
> > > Yes, you are right. Another way is registering the preallocated
> > > userspace memory as bounce buffer.
> >
> > Thanks for the clarification.
> >
> > IMO, the pages consumption is too much for vduse, each vdpa device
> > has one vduse_iova_domain which may allocate 64K bounce pages at most,
> > and these pages won't be freed until freeing the device.
> >
> 
> Yes, actually in our initial design, this can be mitigated by some
> memory reclaim mechanism and zero copy support. Even we can let
> multiple vdpa device share one iova domain.

I think zero copy is great, especially for big chunk IO request.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-05  4:18       ` Ming Lei
  2022-10-05 12:21         ` Stefan Hajnoczi
@ 2022-10-08  8:43         ` Ziyang Zhang
  2022-10-12 14:22           ` Stefan Hajnoczi
  1 sibling, 1 reply; 44+ messages in thread
From: Ziyang Zhang @ 2022-10-08  8:43 UTC (permalink / raw)
  To: Ming Lei, Stefan Hajnoczi
  Cc: Stefan Hajnoczi, io-uring, linux-block, linux-kernel,
	Denis V. Lunev, Xiaoguang Wang

On 2022/10/5 12:18, Ming Lei wrote:
> On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
>> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
>>>
>>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
>>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
>>>>> ublk-qcow2 is available now.
>>>>
>>>> Cool, thanks for sharing!
>>>>
>>>>>
>>>>> So far it provides basic read/write function, and compression and snapshot
>>>>> aren't supported yet. The target/backend implementation is completely
>>>>> based on io_uring, and share the same io_uring with ublk IO command
>>>>> handler, just like what ublk-loop does.
>>>>>
>>>>> Follows the main motivations of ublk-qcow2:
>>>>>
>>>>> - building one complicated target from scratch helps libublksrv APIs/functions
>>>>>   become mature/stable more quickly, since qcow2 is complicated and needs more
>>>>>   requirement from libublksrv compared with other simple ones(loop, null)
>>>>>
>>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
>>>>>   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
>>>>>   might useful be for covering requirement in this field
>>>>>
>>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
>>>>>   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
>>>>>   is started
>>>>>
>>>>> - help to abstract common building block or design pattern for writing new ublk
>>>>>   target/backend
>>>>>
>>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
>>>>> device as TEST_DEV, and kernel building workload is verified too. Also
>>>>> soft update approach is applied in meta flushing, and meta data
>>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
>>>>> test, and only cluster leak is reported during this test.
>>>>>
>>>>> The performance data looks much better compared with qemu-nbd, see
>>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
>>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
>>>>> image(8GB):
>>>>>
>>>>> - qemu-nbd (make test T=qcow2/002)
>>>>
>>>> Single queue?
>>>
>>> Yeah.
>>>
>>>>
>>>>>     randwrite(4k): jobs 1, iops 24605
>>>>>     randread(4k): jobs 1, iops 30938
>>>>>     randrw(4k): jobs 1, iops read 13981 write 14001
>>>>>     rw(512k): jobs 1, iops read 724 write 728
>>>>
>>>> Please try qemu-storage-daemon's VDUSE export type as well. The
>>>> command-line should be similar to this:
>>>>
>>>>   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
>>>
>>> Not found virtio_vdpa module even though I enabled all the following
>>> options:
>>>
>>>         --- vDPA drivers
>>>           <M>   vDPA device simulator core
>>>           <M>     vDPA simulator for networking device
>>>           <M>     vDPA simulator for block device
>>>           <M>   VDUSE (vDPA Device in Userspace) support
>>>           <M>   Intel IFC VF vDPA driver
>>>           <M>   Virtio PCI bridge vDPA driver
>>>           <M>   vDPA driver for Alibaba ENI
>>>
>>> BTW, my test environment is VM and the shared data is done in VM too, and
>>> can virtio_vdpa be used inside VM?
>>
>> I hope Xie Yongji can help explain how to benchmark VDUSE.
>>
>> virtio_vdpa is available inside guests too. Please check that
>> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
>> drivers" menu.
>>
>>>
>>>>   # modprobe vduse
>>>>   # qemu-storage-daemon \
>>>>       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
>>>>       --blockdev qcow2,file=file,node-name=qcow2 \
>>>>       --object iothread,id=iothread0 \
>>>>       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
>>>>   # vdpa dev add name vduse0 mgmtdev vduse
>>>>
>>>> A virtio-blk device should appear and xfstests can be run on it
>>>> (typically /dev/vda unless you already have other virtio-blk devices).
>>>>
>>>> Afterwards you can destroy the device using:
>>>>
>>>>   # vdpa dev del vduse0
>>>>
>>>>>
>>>>> - ublk-qcow2 (make test T=qcow2/022)
>>>>
>>>> There are a lot of other factors not directly related to NBD vs ublk. In
>>>> order to get an apples-to-apples comparison with qemu-* a ublk export
>>>> type is needed in qemu-storage-daemon. That way only the difference is
>>>> the ublk interface and the rest of the code path is identical, making it
>>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
>>>
>>> Maybe not true.
>>>
>>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
>>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
>>> command.
>>
>> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> 
> I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> 
>> know whether the benchmark demonstrates that ublk is faster than NBD,
>> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
>> whether there are miscellaneous implementation differences between
>> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
>> ublk and backend IO), or something else.
> 
> The theory shouldn't be too complicated:
> 
> 1) io uring passthough(pt) communication is fast than socket, and io command
> is carried over io_uring pt commands, and should be fast than virio
> communication too.
> 
> 2) io uring io handling is fast than libaio which is taken in the
> test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> by io_uring.
> 
> https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> 
> 3) ublk uses one single io_uring to handle all io commands and qcow2
> backend IOs, so batching handling is common, and it is easy to see
> dozens of IOs/io commands handled in single syscall, or even more.
> 
>>
>> I'm suggesting measuring changes to just 1 variable at a time.
>> Otherwise it's hard to reach a conclusion about the root cause of the
>> performance difference. Let's learn why ublk-qcow2 performs well.
> 
> Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> qemu from the latest github tree, and finally it starts to work. And test kernel
> is v6.0 release.
> 
> Follows the test result, and all three devices are setup as single
> queue, and all tests are run in single job, still done in one VM, and
> the test images are stored on XFS/virito-scsi backed SSD.
> 
> The 1st group tests all three block device which is backed by empty
> qcow2 image.
> 
> The 2nd group tests all the three block devices backed by pre-allocated
> qcow2 image.
> 
> Except for big sequential IO(512K), there is still not small gap between
> vdpa-virtio-blk and ublk.
> 
> 1. run fio on block device over empty qcow2 image
> 1) qemu-nbd
> running qcow2/001
> run perf test on empty qcow2 image via nbd
> 	fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> 	randwrite: jobs 1, iops 8549
> 	randread: jobs 1, iops 34829
> 	randrw: jobs 1, iops read 11363 write 11333
> 	rw(512k): jobs 1, iops read 590 write 597
> 
> 
> 2) ublk-qcow2
> running qcow2/021
> run perf test on empty qcow2 image via ublk
> 	fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> 	randwrite: jobs 1, iops 16086
> 	randread: jobs 1, iops 172720
> 	randrw: jobs 1, iops read 35760 write 35702
> 	rw(512k): jobs 1, iops read 1140 write 1149
> 
> 3) vdpa-virtio-blk
> running debug/test_dev
> run io test on specified device
> 	fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> 	randwrite: jobs 1, iops 8626
> 	randread: jobs 1, iops 126118
> 	randrw: jobs 1, iops read 17698 write 17665
> 	rw(512k): jobs 1, iops read 1023 write 1031
> 
> 
> 2. run fio on block device over pre-allocated qcow2 image
> 1) qemu-nbd
> running qcow2/002
> run perf test on pre-allocated qcow2 image via nbd
> 	fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> 	randwrite: jobs 1, iops 21439
> 	randread: jobs 1, iops 30336
> 	randrw: jobs 1, iops read 11476 write 11449
> 	rw(512k): jobs 1, iops read 718 write 722
> 
> 2) ublk-qcow2
> running qcow2/022
> run perf test on pre-allocated qcow2 image via ublk
> 	fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> 	randwrite: jobs 1, iops 98757
> 	randread: jobs 1, iops 110246
> 	randrw: jobs 1, iops read 47229 write 47161
> 	rw(512k): jobs 1, iops read 1416 write 1427
> 
> 3) vdpa-virtio-blk
> running debug/test_dev
> run io test on specified device
> 	fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> 	randwrite: jobs 1, iops 47317
> 	randread: jobs 1, iops 74092
> 	randrw: jobs 1, iops read 27196 write 27234
> 	rw(512k): jobs 1, iops read 1447 write 1458
> 
> 

Hi All,

We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
Let me share some results here.

I setup UBLK with:
  ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE

I setup VDUSE with:
  qemu-storage-daemon \
       --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
       --monitor chardev=charmonitor \
       --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
       --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH

Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.

Note:
(1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
(2) I use qemu 7.1.0-rc3. It supports vduse-blk.
(3) I do not use ublk null target so that the test is fair.
(4) I setup fio with direct=1, bs=4k.

------------------------------
1 job 1 iodepth, lat（usec)
		vduse	ublk
seq-read	22.55	11.15
rand-read	22.49	11.17
seq-write	25.67	10.25
rand-write	24.13	10.16

------------------------------
1 job 32 iodepth, iops（k)
		vduse	ublk
seq-read	166	207
rand-read	150	204
seq-write	131	359
rand-write	129	363

------------------------------
4job 128 iodepth, iops (k)

		vduse	ublk
seq-read	318	984
rand-read	307	929
seq-write	221	924
rand-write	217	917

Regards,
Zhang

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-08  8:43         ` Ziyang Zhang
@ 2022-10-12 14:22           ` Stefan Hajnoczi
  2022-10-13  6:48             ` Yongji Xie
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Hajnoczi @ 2022-10-12 14:22 UTC (permalink / raw)
  To: Ziyang Zhang
  Cc: Ming Lei, Stefan Hajnoczi, io-uring, linux-block, linux-kernel,
	Denis V. Lunev, Xiaoguang Wang, Xie Yongji

On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
>
> On 2022/10/5 12:18, Ming Lei wrote:
> > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> >> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> >>>
> >>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> >>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> >>>>> ublk-qcow2 is available now.
> >>>>
> >>>> Cool, thanks for sharing!
> >>>>
> >>>>>
> >>>>> So far it provides basic read/write function, and compression and snapshot
> >>>>> aren't supported yet. The target/backend implementation is completely
> >>>>> based on io_uring, and share the same io_uring with ublk IO command
> >>>>> handler, just like what ublk-loop does.
> >>>>>
> >>>>> Follows the main motivations of ublk-qcow2:
> >>>>>
> >>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> >>>>>   become mature/stable more quickly, since qcow2 is complicated and needs more
> >>>>>   requirement from libublksrv compared with other simple ones(loop, null)
> >>>>>
> >>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> >>>>>   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> >>>>>   might useful be for covering requirement in this field
> >>>>>
> >>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> >>>>>   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> >>>>>   is started
> >>>>>
> >>>>> - help to abstract common building block or design pattern for writing new ublk
> >>>>>   target/backend
> >>>>>
> >>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> >>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> >>>>> soft update approach is applied in meta flushing, and meta data
> >>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> >>>>> test, and only cluster leak is reported during this test.
> >>>>>
> >>>>> The performance data looks much better compared with qemu-nbd, see
> >>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> >>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> >>>>> image(8GB):
> >>>>>
> >>>>> - qemu-nbd (make test T=qcow2/002)
> >>>>
> >>>> Single queue?
> >>>
> >>> Yeah.
> >>>
> >>>>
> >>>>>     randwrite(4k): jobs 1, iops 24605
> >>>>>     randread(4k): jobs 1, iops 30938
> >>>>>     randrw(4k): jobs 1, iops read 13981 write 14001
> >>>>>     rw(512k): jobs 1, iops read 724 write 728
> >>>>
> >>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> >>>> command-line should be similar to this:
> >>>>
> >>>>   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> >>>
> >>> Not found virtio_vdpa module even though I enabled all the following
> >>> options:
> >>>
> >>>         --- vDPA drivers
> >>>           <M>   vDPA device simulator core
> >>>           <M>     vDPA simulator for networking device
> >>>           <M>     vDPA simulator for block device
> >>>           <M>   VDUSE (vDPA Device in Userspace) support
> >>>           <M>   Intel IFC VF vDPA driver
> >>>           <M>   Virtio PCI bridge vDPA driver
> >>>           <M>   vDPA driver for Alibaba ENI
> >>>
> >>> BTW, my test environment is VM and the shared data is done in VM too, and
> >>> can virtio_vdpa be used inside VM?
> >>
> >> I hope Xie Yongji can help explain how to benchmark VDUSE.
> >>
> >> virtio_vdpa is available inside guests too. Please check that
> >> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> >> drivers" menu.
> >>
> >>>
> >>>>   # modprobe vduse
> >>>>   # qemu-storage-daemon \
> >>>>       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> >>>>       --blockdev qcow2,file=file,node-name=qcow2 \
> >>>>       --object iothread,id=iothread0 \
> >>>>       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> >>>>   # vdpa dev add name vduse0 mgmtdev vduse
> >>>>
> >>>> A virtio-blk device should appear and xfstests can be run on it
> >>>> (typically /dev/vda unless you already have other virtio-blk devices).
> >>>>
> >>>> Afterwards you can destroy the device using:
> >>>>
> >>>>   # vdpa dev del vduse0
> >>>>
> >>>>>
> >>>>> - ublk-qcow2 (make test T=qcow2/022)
> >>>>
> >>>> There are a lot of other factors not directly related to NBD vs ublk. In
> >>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> >>>> type is needed in qemu-storage-daemon. That way only the difference is
> >>>> the ublk interface and the rest of the code path is identical, making it
> >>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> >>>
> >>> Maybe not true.
> >>>
> >>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> >>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> >>> command.
> >>
> >> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> >
> > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> >
> >> know whether the benchmark demonstrates that ublk is faster than NBD,
> >> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> >> whether there are miscellaneous implementation differences between
> >> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> >> ublk and backend IO), or something else.
> >
> > The theory shouldn't be too complicated:
> >
> > 1) io uring passthough(pt) communication is fast than socket, and io command
> > is carried over io_uring pt commands, and should be fast than virio
> > communication too.
> >
> > 2) io uring io handling is fast than libaio which is taken in the
> > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > by io_uring.
> >
> > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> >
> > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > backend IOs, so batching handling is common, and it is easy to see
> > dozens of IOs/io commands handled in single syscall, or even more.
> >
> >>
> >> I'm suggesting measuring changes to just 1 variable at a time.
> >> Otherwise it's hard to reach a conclusion about the root cause of the
> >> performance difference. Let's learn why ublk-qcow2 performs well.
> >
> > Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> > qemu from the latest github tree, and finally it starts to work. And test kernel
> > is v6.0 release.
> >
> > Follows the test result, and all three devices are setup as single
> > queue, and all tests are run in single job, still done in one VM, and
> > the test images are stored on XFS/virito-scsi backed SSD.
> >
> > The 1st group tests all three block device which is backed by empty
> > qcow2 image.
> >
> > The 2nd group tests all the three block devices backed by pre-allocated
> > qcow2 image.
> >
> > Except for big sequential IO(512K), there is still not small gap between
> > vdpa-virtio-blk and ublk.
> >
> > 1. run fio on block device over empty qcow2 image
> > 1) qemu-nbd
> > running qcow2/001
> > run perf test on empty qcow2 image via nbd
> >       fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> >       randwrite: jobs 1, iops 8549
> >       randread: jobs 1, iops 34829
> >       randrw: jobs 1, iops read 11363 write 11333
> >       rw(512k): jobs 1, iops read 590 write 597
> >
> >
> > 2) ublk-qcow2
> > running qcow2/021
> > run perf test on empty qcow2 image via ublk
> >       fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> >       randwrite: jobs 1, iops 16086
> >       randread: jobs 1, iops 172720
> >       randrw: jobs 1, iops read 35760 write 35702
> >       rw(512k): jobs 1, iops read 1140 write 1149
> >
> > 3) vdpa-virtio-blk
> > running debug/test_dev
> > run io test on specified device
> >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> >       randwrite: jobs 1, iops 8626
> >       randread: jobs 1, iops 126118
> >       randrw: jobs 1, iops read 17698 write 17665
> >       rw(512k): jobs 1, iops read 1023 write 1031
> >
> >
> > 2. run fio on block device over pre-allocated qcow2 image
> > 1) qemu-nbd
> > running qcow2/002
> > run perf test on pre-allocated qcow2 image via nbd
> >       fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> >       randwrite: jobs 1, iops 21439
> >       randread: jobs 1, iops 30336
> >       randrw: jobs 1, iops read 11476 write 11449
> >       rw(512k): jobs 1, iops read 718 write 722
> >
> > 2) ublk-qcow2
> > running qcow2/022
> > run perf test on pre-allocated qcow2 image via ublk
> >       fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> >       randwrite: jobs 1, iops 98757
> >       randread: jobs 1, iops 110246
> >       randrw: jobs 1, iops read 47229 write 47161
> >       rw(512k): jobs 1, iops read 1416 write 1427
> >
> > 3) vdpa-virtio-blk
> > running debug/test_dev
> > run io test on specified device
> >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> >       randwrite: jobs 1, iops 47317
> >       randread: jobs 1, iops 74092
> >       randrw: jobs 1, iops read 27196 write 27234
> >       rw(512k): jobs 1, iops read 1447 write 1458
> >
> >
>
> Hi All,
>
> We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> Let me share some results here.
>
> I setup UBLK with:
>   ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
>
> I setup VDUSE with:
>   qemu-storage-daemon \
>        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
>        --monitor chardev=charmonitor \
>        --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
>        --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
>
> Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
>
> Note:
> (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> (3) I do not use ublk null target so that the test is fair.
> (4) I setup fio with direct=1, bs=4k.
>
> ------------------------------
> 1 job 1 iodepth, lat（usec)
>                 vduse   ublk
> seq-read        22.55   11.15
> rand-read       22.49   11.17
> seq-write       25.67   10.25
> rand-write      24.13   10.16

Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?

Stefan

>
> ------------------------------
> 1 job 32 iodepth, iops（k)
>                 vduse   ublk
> seq-read        166     207
> rand-read       150     204
> seq-write       131     359
> rand-write      129     363
>
> ------------------------------
> 4job 128 iodepth, iops (k)
>
>                 vduse   ublk
> seq-read        318     984
> rand-read       307     929
> seq-write       221     924
> rand-write      217     917
>
> Regards,
> Zhang

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-12 14:22           ` Stefan Hajnoczi
@ 2022-10-13  6:48             ` Yongji Xie
  2022-10-13 16:02               ` Stefan Hajnoczi
  2022-10-14 12:56               ` Ming Lei
  0 siblings, 2 replies; 44+ messages in thread
From: Yongji Xie @ 2022-10-13  6:48 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Ziyang Zhang, Ming Lei, Stefan Hajnoczi, io-uring, linux-block,
	linux-kernel, Denis V. Lunev, Xiaoguang Wang

On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
>
> On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> >
> > On 2022/10/5 12:18, Ming Lei wrote:
> > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > >> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > >>>
> > >>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > >>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > >>>>> ublk-qcow2 is available now.
> > >>>>
> > >>>> Cool, thanks for sharing!
> > >>>>
> > >>>>>
> > >>>>> So far it provides basic read/write function, and compression and snapshot
> > >>>>> aren't supported yet. The target/backend implementation is completely
> > >>>>> based on io_uring, and share the same io_uring with ublk IO command
> > >>>>> handler, just like what ublk-loop does.
> > >>>>>
> > >>>>> Follows the main motivations of ublk-qcow2:
> > >>>>>
> > >>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> > >>>>>   become mature/stable more quickly, since qcow2 is complicated and needs more
> > >>>>>   requirement from libublksrv compared with other simple ones(loop, null)
> > >>>>>
> > >>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> > >>>>>   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > >>>>>   might useful be for covering requirement in this field
> > >>>>>
> > >>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > >>>>>   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > >>>>>   is started
> > >>>>>
> > >>>>> - help to abstract common building block or design pattern for writing new ublk
> > >>>>>   target/backend
> > >>>>>
> > >>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > >>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> > >>>>> soft update approach is applied in meta flushing, and meta data
> > >>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > >>>>> test, and only cluster leak is reported during this test.
> > >>>>>
> > >>>>> The performance data looks much better compared with qemu-nbd, see
> > >>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > >>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> > >>>>> image(8GB):
> > >>>>>
> > >>>>> - qemu-nbd (make test T=qcow2/002)
> > >>>>
> > >>>> Single queue?
> > >>>
> > >>> Yeah.
> > >>>
> > >>>>
> > >>>>>     randwrite(4k): jobs 1, iops 24605
> > >>>>>     randread(4k): jobs 1, iops 30938
> > >>>>>     randrw(4k): jobs 1, iops read 13981 write 14001
> > >>>>>     rw(512k): jobs 1, iops read 724 write 728
> > >>>>
> > >>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> > >>>> command-line should be similar to this:
> > >>>>
> > >>>>   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > >>>
> > >>> Not found virtio_vdpa module even though I enabled all the following
> > >>> options:
> > >>>
> > >>>         --- vDPA drivers
> > >>>           <M>   vDPA device simulator core
> > >>>           <M>     vDPA simulator for networking device
> > >>>           <M>     vDPA simulator for block device
> > >>>           <M>   VDUSE (vDPA Device in Userspace) support
> > >>>           <M>   Intel IFC VF vDPA driver
> > >>>           <M>   Virtio PCI bridge vDPA driver
> > >>>           <M>   vDPA driver for Alibaba ENI
> > >>>
> > >>> BTW, my test environment is VM and the shared data is done in VM too, and
> > >>> can virtio_vdpa be used inside VM?
> > >>
> > >> I hope Xie Yongji can help explain how to benchmark VDUSE.
> > >>
> > >> virtio_vdpa is available inside guests too. Please check that
> > >> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > >> drivers" menu.
> > >>
> > >>>
> > >>>>   # modprobe vduse
> > >>>>   # qemu-storage-daemon \
> > >>>>       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > >>>>       --blockdev qcow2,file=file,node-name=qcow2 \
> > >>>>       --object iothread,id=iothread0 \
> > >>>>       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > >>>>   # vdpa dev add name vduse0 mgmtdev vduse
> > >>>>
> > >>>> A virtio-blk device should appear and xfstests can be run on it
> > >>>> (typically /dev/vda unless you already have other virtio-blk devices).
> > >>>>
> > >>>> Afterwards you can destroy the device using:
> > >>>>
> > >>>>   # vdpa dev del vduse0
> > >>>>
> > >>>>>
> > >>>>> - ublk-qcow2 (make test T=qcow2/022)
> > >>>>
> > >>>> There are a lot of other factors not directly related to NBD vs ublk. In
> > >>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> > >>>> type is needed in qemu-storage-daemon. That way only the difference is
> > >>>> the ublk interface and the rest of the code path is identical, making it
> > >>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> > >>>
> > >>> Maybe not true.
> > >>>
> > >>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > >>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > >>> command.
> > >>
> > >> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > >
> > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > >
> > >> know whether the benchmark demonstrates that ublk is faster than NBD,
> > >> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > >> whether there are miscellaneous implementation differences between
> > >> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > >> ublk and backend IO), or something else.
> > >
> > > The theory shouldn't be too complicated:
> > >
> > > 1) io uring passthough(pt) communication is fast than socket, and io command
> > > is carried over io_uring pt commands, and should be fast than virio
> > > communication too.
> > >
> > > 2) io uring io handling is fast than libaio which is taken in the
> > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > by io_uring.
> > >
> > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > >
> > > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > backend IOs, so batching handling is common, and it is easy to see
> > > dozens of IOs/io commands handled in single syscall, or even more.
> > >
> > >>
> > >> I'm suggesting measuring changes to just 1 variable at a time.
> > >> Otherwise it's hard to reach a conclusion about the root cause of the
> > >> performance difference. Let's learn why ublk-qcow2 performs well.
> > >
> > > Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> > > qemu from the latest github tree, and finally it starts to work. And test kernel
> > > is v6.0 release.
> > >
> > > Follows the test result, and all three devices are setup as single
> > > queue, and all tests are run in single job, still done in one VM, and
> > > the test images are stored on XFS/virito-scsi backed SSD.
> > >
> > > The 1st group tests all three block device which is backed by empty
> > > qcow2 image.
> > >
> > > The 2nd group tests all the three block devices backed by pre-allocated
> > > qcow2 image.
> > >
> > > Except for big sequential IO(512K), there is still not small gap between
> > > vdpa-virtio-blk and ublk.
> > >
> > > 1. run fio on block device over empty qcow2 image
> > > 1) qemu-nbd
> > > running qcow2/001
> > > run perf test on empty qcow2 image via nbd
> > >       fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > >       randwrite: jobs 1, iops 8549
> > >       randread: jobs 1, iops 34829
> > >       randrw: jobs 1, iops read 11363 write 11333
> > >       rw(512k): jobs 1, iops read 590 write 597
> > >
> > >
> > > 2) ublk-qcow2
> > > running qcow2/021
> > > run perf test on empty qcow2 image via ublk
> > >       fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > >       randwrite: jobs 1, iops 16086
> > >       randread: jobs 1, iops 172720
> > >       randrw: jobs 1, iops read 35760 write 35702
> > >       rw(512k): jobs 1, iops read 1140 write 1149
> > >
> > > 3) vdpa-virtio-blk
> > > running debug/test_dev
> > > run io test on specified device
> > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > >       randwrite: jobs 1, iops 8626
> > >       randread: jobs 1, iops 126118
> > >       randrw: jobs 1, iops read 17698 write 17665
> > >       rw(512k): jobs 1, iops read 1023 write 1031
> > >
> > >
> > > 2. run fio on block device over pre-allocated qcow2 image
> > > 1) qemu-nbd
> > > running qcow2/002
> > > run perf test on pre-allocated qcow2 image via nbd
> > >       fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > >       randwrite: jobs 1, iops 21439
> > >       randread: jobs 1, iops 30336
> > >       randrw: jobs 1, iops read 11476 write 11449
> > >       rw(512k): jobs 1, iops read 718 write 722
> > >
> > > 2) ublk-qcow2
> > > running qcow2/022
> > > run perf test on pre-allocated qcow2 image via ublk
> > >       fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > >       randwrite: jobs 1, iops 98757
> > >       randread: jobs 1, iops 110246
> > >       randrw: jobs 1, iops read 47229 write 47161
> > >       rw(512k): jobs 1, iops read 1416 write 1427
> > >
> > > 3) vdpa-virtio-blk
> > > running debug/test_dev
> > > run io test on specified device
> > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > >       randwrite: jobs 1, iops 47317
> > >       randread: jobs 1, iops 74092
> > >       randrw: jobs 1, iops read 27196 write 27234
> > >       rw(512k): jobs 1, iops read 1447 write 1458
> > >
> > >
> >
> > Hi All,
> >
> > We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> > Let me share some results here.
> >
> > I setup UBLK with:
> >   ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
> >
> > I setup VDUSE with:
> >   qemu-storage-daemon \
> >        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> >        --monitor chardev=charmonitor \
> >        --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
> >        --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
> >
> > Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
> >
> > Note:
> > (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> > (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> > (3) I do not use ublk null target so that the test is fair.
> > (4) I setup fio with direct=1, bs=4k.
> >
> > ------------------------------
> > 1 job 1 iodepth, lat（usec)
> >                 vduse   ublk
> > seq-read        22.55   11.15
> > rand-read       22.49   11.17
> > seq-write       25.67   10.25
> > rand-write      24.13   10.16
>
> Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
>

I think one reason for the latency gap of sync I/O is that vduse uses
workqueue in the I/O completion path but ublk doesn't.

And one bottleneck for the async I/O in vduse is that vduse will do
memcpy inside the critical section of virtqueue's spinlock in the
virtio-blk driver. That will hurt the performance heavily when
virtio_queue_rq() and virtblk_done() run concurrently. And it can be
mitigated by the advance DMA mapping feature [1] or irq binding
support [2].

[1] https://lwn.net/Articles/886029/
[2] https://www.spinics.net/lists/kvm/msg236244.html

Thanks,
Yongji

> Stefan
>
> >
> > ------------------------------
> > 1 job 32 iodepth, iops（k)
> >                 vduse   ublk
> > seq-read        166     207
> > rand-read       150     204
> > seq-write       131     359
> > rand-write      129     363
> >
> > ------------------------------
> > 4job 128 iodepth, iops (k)
> >
> >                 vduse   ublk
> > seq-read        318     984
> > rand-read       307     929
> > seq-write       221     924
> > rand-write      217     917
> >
> > Regards,
> > Zhang

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-13  6:48             ` Yongji Xie
@ 2022-10-13 16:02               ` Stefan Hajnoczi
  2022-10-14 12:56               ` Ming Lei
  1 sibling, 0 replies; 44+ messages in thread
From: Stefan Hajnoczi @ 2022-10-13 16:02 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Stefan Hajnoczi, Ziyang Zhang, Ming Lei, io-uring, linux-block,
	linux-kernel, Denis V. Lunev, Xiaoguang Wang

[-- Attachment #1: Type: text/plain, Size: 12833 bytes --]

On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> >
> > On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> > >
> > > On 2022/10/5 12:18, Ming Lei wrote:
> > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > >> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > >>>
> > > >>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > >>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > >>>>> ublk-qcow2 is available now.
> > > >>>>
> > > >>>> Cool, thanks for sharing!
> > > >>>>
> > > >>>>>
> > > >>>>> So far it provides basic read/write function, and compression and snapshot
> > > >>>>> aren't supported yet. The target/backend implementation is completely
> > > >>>>> based on io_uring, and share the same io_uring with ublk IO command
> > > >>>>> handler, just like what ublk-loop does.
> > > >>>>>
> > > >>>>> Follows the main motivations of ublk-qcow2:
> > > >>>>>
> > > >>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> > > >>>>>   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > >>>>>   requirement from libublksrv compared with other simple ones(loop, null)
> > > >>>>>
> > > >>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> > > >>>>>   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > >>>>>   might useful be for covering requirement in this field
> > > >>>>>
> > > >>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > >>>>>   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > >>>>>   is started
> > > >>>>>
> > > >>>>> - help to abstract common building block or design pattern for writing new ublk
> > > >>>>>   target/backend
> > > >>>>>
> > > >>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > >>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> > > >>>>> soft update approach is applied in meta flushing, and meta data
> > > >>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > >>>>> test, and only cluster leak is reported during this test.
> > > >>>>>
> > > >>>>> The performance data looks much better compared with qemu-nbd, see
> > > >>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > >>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> > > >>>>> image(8GB):
> > > >>>>>
> > > >>>>> - qemu-nbd (make test T=qcow2/002)
> > > >>>>
> > > >>>> Single queue?
> > > >>>
> > > >>> Yeah.
> > > >>>
> > > >>>>
> > > >>>>>     randwrite(4k): jobs 1, iops 24605
> > > >>>>>     randread(4k): jobs 1, iops 30938
> > > >>>>>     randrw(4k): jobs 1, iops read 13981 write 14001
> > > >>>>>     rw(512k): jobs 1, iops read 724 write 728
> > > >>>>
> > > >>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> > > >>>> command-line should be similar to this:
> > > >>>>
> > > >>>>   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > >>>
> > > >>> Not found virtio_vdpa module even though I enabled all the following
> > > >>> options:
> > > >>>
> > > >>>         --- vDPA drivers
> > > >>>           <M>   vDPA device simulator core
> > > >>>           <M>     vDPA simulator for networking device
> > > >>>           <M>     vDPA simulator for block device
> > > >>>           <M>   VDUSE (vDPA Device in Userspace) support
> > > >>>           <M>   Intel IFC VF vDPA driver
> > > >>>           <M>   Virtio PCI bridge vDPA driver
> > > >>>           <M>   vDPA driver for Alibaba ENI
> > > >>>
> > > >>> BTW, my test environment is VM and the shared data is done in VM too, and
> > > >>> can virtio_vdpa be used inside VM?
> > > >>
> > > >> I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > >>
> > > >> virtio_vdpa is available inside guests too. Please check that
> > > >> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > >> drivers" menu.
> > > >>
> > > >>>
> > > >>>>   # modprobe vduse
> > > >>>>   # qemu-storage-daemon \
> > > >>>>       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > >>>>       --blockdev qcow2,file=file,node-name=qcow2 \
> > > >>>>       --object iothread,id=iothread0 \
> > > >>>>       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > >>>>   # vdpa dev add name vduse0 mgmtdev vduse
> > > >>>>
> > > >>>> A virtio-blk device should appear and xfstests can be run on it
> > > >>>> (typically /dev/vda unless you already have other virtio-blk devices).
> > > >>>>
> > > >>>> Afterwards you can destroy the device using:
> > > >>>>
> > > >>>>   # vdpa dev del vduse0
> > > >>>>
> > > >>>>>
> > > >>>>> - ublk-qcow2 (make test T=qcow2/022)
> > > >>>>
> > > >>>> There are a lot of other factors not directly related to NBD vs ublk. In
> > > >>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> > > >>>> type is needed in qemu-storage-daemon. That way only the difference is
> > > >>>> the ublk interface and the rest of the code path is identical, making it
> > > >>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > >>>
> > > >>> Maybe not true.
> > > >>>
> > > >>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > >>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > >>> command.
> > > >>
> > > >> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > > >
> > > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > > >
> > > >> know whether the benchmark demonstrates that ublk is faster than NBD,
> > > >> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > >> whether there are miscellaneous implementation differences between
> > > >> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > >> ublk and backend IO), or something else.
> > > >
> > > > The theory shouldn't be too complicated:
> > > >
> > > > 1) io uring passthough(pt) communication is fast than socket, and io command
> > > > is carried over io_uring pt commands, and should be fast than virio
> > > > communication too.
> > > >
> > > > 2) io uring io handling is fast than libaio which is taken in the
> > > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > > by io_uring.
> > > >
> > > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > > >
> > > > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > > backend IOs, so batching handling is common, and it is easy to see
> > > > dozens of IOs/io commands handled in single syscall, or even more.
> > > >
> > > >>
> > > >> I'm suggesting measuring changes to just 1 variable at a time.
> > > >> Otherwise it's hard to reach a conclusion about the root cause of the
> > > >> performance difference. Let's learn why ublk-qcow2 performs well.
> > > >
> > > > Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> > > > qemu from the latest github tree, and finally it starts to work. And test kernel
> > > > is v6.0 release.
> > > >
> > > > Follows the test result, and all three devices are setup as single
> > > > queue, and all tests are run in single job, still done in one VM, and
> > > > the test images are stored on XFS/virito-scsi backed SSD.
> > > >
> > > > The 1st group tests all three block device which is backed by empty
> > > > qcow2 image.
> > > >
> > > > The 2nd group tests all the three block devices backed by pre-allocated
> > > > qcow2 image.
> > > >
> > > > Except for big sequential IO(512K), there is still not small gap between
> > > > vdpa-virtio-blk and ublk.
> > > >
> > > > 1. run fio on block device over empty qcow2 image
> > > > 1) qemu-nbd
> > > > running qcow2/001
> > > > run perf test on empty qcow2 image via nbd
> > > >       fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > >       randwrite: jobs 1, iops 8549
> > > >       randread: jobs 1, iops 34829
> > > >       randrw: jobs 1, iops read 11363 write 11333
> > > >       rw(512k): jobs 1, iops read 590 write 597
> > > >
> > > >
> > > > 2) ublk-qcow2
> > > > running qcow2/021
> > > > run perf test on empty qcow2 image via ublk
> > > >       fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > >       randwrite: jobs 1, iops 16086
> > > >       randread: jobs 1, iops 172720
> > > >       randrw: jobs 1, iops read 35760 write 35702
> > > >       rw(512k): jobs 1, iops read 1140 write 1149
> > > >
> > > > 3) vdpa-virtio-blk
> > > > running debug/test_dev
> > > > run io test on specified device
> > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > >       randwrite: jobs 1, iops 8626
> > > >       randread: jobs 1, iops 126118
> > > >       randrw: jobs 1, iops read 17698 write 17665
> > > >       rw(512k): jobs 1, iops read 1023 write 1031
> > > >
> > > >
> > > > 2. run fio on block device over pre-allocated qcow2 image
> > > > 1) qemu-nbd
> > > > running qcow2/002
> > > > run perf test on pre-allocated qcow2 image via nbd
> > > >       fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > >       randwrite: jobs 1, iops 21439
> > > >       randread: jobs 1, iops 30336
> > > >       randrw: jobs 1, iops read 11476 write 11449
> > > >       rw(512k): jobs 1, iops read 718 write 722
> > > >
> > > > 2) ublk-qcow2
> > > > running qcow2/022
> > > > run perf test on pre-allocated qcow2 image via ublk
> > > >       fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > >       randwrite: jobs 1, iops 98757
> > > >       randread: jobs 1, iops 110246
> > > >       randrw: jobs 1, iops read 47229 write 47161
> > > >       rw(512k): jobs 1, iops read 1416 write 1427
> > > >
> > > > 3) vdpa-virtio-blk
> > > > running debug/test_dev
> > > > run io test on specified device
> > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > >       randwrite: jobs 1, iops 47317
> > > >       randread: jobs 1, iops 74092
> > > >       randrw: jobs 1, iops read 27196 write 27234
> > > >       rw(512k): jobs 1, iops read 1447 write 1458
> > > >
> > > >
> > >
> > > Hi All,
> > >
> > > We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> > > Let me share some results here.
> > >
> > > I setup UBLK with:
> > >   ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
> > >
> > > I setup VDUSE with:
> > >   qemu-storage-daemon \
> > >        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > >        --monitor chardev=charmonitor \
> > >        --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
> > >        --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
> > >
> > > Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
> > >
> > > Note:
> > > (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> > > (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> > > (3) I do not use ublk null target so that the test is fair.
> > > (4) I setup fio with direct=1, bs=4k.
> > >
> > > ------------------------------
> > > 1 job 1 iodepth, lat（usec)
> > >                 vduse   ublk
> > > seq-read        22.55   11.15
> > > rand-read       22.49   11.17
> > > seq-write       25.67   10.25
> > > rand-write      24.13   10.16
> >
> > Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
> >
> 
> I think one reason for the latency gap of sync I/O is that vduse uses
> workqueue in the I/O completion path but ublk doesn't.
> 
> And one bottleneck for the async I/O in vduse is that vduse will do
> memcpy inside the critical section of virtqueue's spinlock in the
> virtio-blk driver. That will hurt the performance heavily when
> virtio_queue_rq() and virtblk_done() run concurrently. And it can be
> mitigated by the advance DMA mapping feature [1] or irq binding
> support [2].
> 
> [1] https://lwn.net/Articles/886029/
> [2] https://www.spinics.net/lists/kvm/msg236244.html

Thanks!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-13  6:48             ` Yongji Xie
  2022-10-13 16:02               ` Stefan Hajnoczi
@ 2022-10-14 12:56               ` Ming Lei
  2022-10-17 11:11                 ` Yongji Xie
  1 sibling, 1 reply; 44+ messages in thread
From: Ming Lei @ 2022-10-14 12:56 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Stefan Hajnoczi, Ziyang Zhang, Stefan Hajnoczi, io-uring,
	linux-block, linux-kernel, Denis V. Lunev, Xiaoguang Wang

On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> >
> > On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> > >
> > > On 2022/10/5 12:18, Ming Lei wrote:
> > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > >> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > >>>
> > > >>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > >>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > >>>>> ublk-qcow2 is available now.
> > > >>>>
> > > >>>> Cool, thanks for sharing!
> > > >>>>
> > > >>>>>
> > > >>>>> So far it provides basic read/write function, and compression and snapshot
> > > >>>>> aren't supported yet. The target/backend implementation is completely
> > > >>>>> based on io_uring, and share the same io_uring with ublk IO command
> > > >>>>> handler, just like what ublk-loop does.
> > > >>>>>
> > > >>>>> Follows the main motivations of ublk-qcow2:
> > > >>>>>
> > > >>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> > > >>>>>   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > >>>>>   requirement from libublksrv compared with other simple ones(loop, null)
> > > >>>>>
> > > >>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> > > >>>>>   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > >>>>>   might useful be for covering requirement in this field
> > > >>>>>
> > > >>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > >>>>>   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > >>>>>   is started
> > > >>>>>
> > > >>>>> - help to abstract common building block or design pattern for writing new ublk
> > > >>>>>   target/backend
> > > >>>>>
> > > >>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > >>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> > > >>>>> soft update approach is applied in meta flushing, and meta data
> > > >>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > >>>>> test, and only cluster leak is reported during this test.
> > > >>>>>
> > > >>>>> The performance data looks much better compared with qemu-nbd, see
> > > >>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > >>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> > > >>>>> image(8GB):
> > > >>>>>
> > > >>>>> - qemu-nbd (make test T=qcow2/002)
> > > >>>>
> > > >>>> Single queue?
> > > >>>
> > > >>> Yeah.
> > > >>>
> > > >>>>
> > > >>>>>     randwrite(4k): jobs 1, iops 24605
> > > >>>>>     randread(4k): jobs 1, iops 30938
> > > >>>>>     randrw(4k): jobs 1, iops read 13981 write 14001
> > > >>>>>     rw(512k): jobs 1, iops read 724 write 728
> > > >>>>
> > > >>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> > > >>>> command-line should be similar to this:
> > > >>>>
> > > >>>>   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > >>>
> > > >>> Not found virtio_vdpa module even though I enabled all the following
> > > >>> options:
> > > >>>
> > > >>>         --- vDPA drivers
> > > >>>           <M>   vDPA device simulator core
> > > >>>           <M>     vDPA simulator for networking device
> > > >>>           <M>     vDPA simulator for block device
> > > >>>           <M>   VDUSE (vDPA Device in Userspace) support
> > > >>>           <M>   Intel IFC VF vDPA driver
> > > >>>           <M>   Virtio PCI bridge vDPA driver
> > > >>>           <M>   vDPA driver for Alibaba ENI
> > > >>>
> > > >>> BTW, my test environment is VM and the shared data is done in VM too, and
> > > >>> can virtio_vdpa be used inside VM?
> > > >>
> > > >> I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > >>
> > > >> virtio_vdpa is available inside guests too. Please check that
> > > >> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > >> drivers" menu.
> > > >>
> > > >>>
> > > >>>>   # modprobe vduse
> > > >>>>   # qemu-storage-daemon \
> > > >>>>       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > >>>>       --blockdev qcow2,file=file,node-name=qcow2 \
> > > >>>>       --object iothread,id=iothread0 \
> > > >>>>       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > >>>>   # vdpa dev add name vduse0 mgmtdev vduse
> > > >>>>
> > > >>>> A virtio-blk device should appear and xfstests can be run on it
> > > >>>> (typically /dev/vda unless you already have other virtio-blk devices).
> > > >>>>
> > > >>>> Afterwards you can destroy the device using:
> > > >>>>
> > > >>>>   # vdpa dev del vduse0
> > > >>>>
> > > >>>>>
> > > >>>>> - ublk-qcow2 (make test T=qcow2/022)
> > > >>>>
> > > >>>> There are a lot of other factors not directly related to NBD vs ublk. In
> > > >>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> > > >>>> type is needed in qemu-storage-daemon. That way only the difference is
> > > >>>> the ublk interface and the rest of the code path is identical, making it
> > > >>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > >>>
> > > >>> Maybe not true.
> > > >>>
> > > >>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > >>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > >>> command.
> > > >>
> > > >> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > > >
> > > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > > >
> > > >> know whether the benchmark demonstrates that ublk is faster than NBD,
> > > >> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > >> whether there are miscellaneous implementation differences between
> > > >> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > >> ublk and backend IO), or something else.
> > > >
> > > > The theory shouldn't be too complicated:
> > > >
> > > > 1) io uring passthough(pt) communication is fast than socket, and io command
> > > > is carried over io_uring pt commands, and should be fast than virio
> > > > communication too.
> > > >
> > > > 2) io uring io handling is fast than libaio which is taken in the
> > > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > > by io_uring.
> > > >
> > > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > > >
> > > > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > > backend IOs, so batching handling is common, and it is easy to see
> > > > dozens of IOs/io commands handled in single syscall, or even more.
> > > >
> > > >>
> > > >> I'm suggesting measuring changes to just 1 variable at a time.
> > > >> Otherwise it's hard to reach a conclusion about the root cause of the
> > > >> performance difference. Let's learn why ublk-qcow2 performs well.
> > > >
> > > > Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> > > > qemu from the latest github tree, and finally it starts to work. And test kernel
> > > > is v6.0 release.
> > > >
> > > > Follows the test result, and all three devices are setup as single
> > > > queue, and all tests are run in single job, still done in one VM, and
> > > > the test images are stored on XFS/virito-scsi backed SSD.
> > > >
> > > > The 1st group tests all three block device which is backed by empty
> > > > qcow2 image.
> > > >
> > > > The 2nd group tests all the three block devices backed by pre-allocated
> > > > qcow2 image.
> > > >
> > > > Except for big sequential IO(512K), there is still not small gap between
> > > > vdpa-virtio-blk and ublk.
> > > >
> > > > 1. run fio on block device over empty qcow2 image
> > > > 1) qemu-nbd
> > > > running qcow2/001
> > > > run perf test on empty qcow2 image via nbd
> > > >       fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > >       randwrite: jobs 1, iops 8549
> > > >       randread: jobs 1, iops 34829
> > > >       randrw: jobs 1, iops read 11363 write 11333
> > > >       rw(512k): jobs 1, iops read 590 write 597
> > > >
> > > >
> > > > 2) ublk-qcow2
> > > > running qcow2/021
> > > > run perf test on empty qcow2 image via ublk
> > > >       fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > >       randwrite: jobs 1, iops 16086
> > > >       randread: jobs 1, iops 172720
> > > >       randrw: jobs 1, iops read 35760 write 35702
> > > >       rw(512k): jobs 1, iops read 1140 write 1149
> > > >
> > > > 3) vdpa-virtio-blk
> > > > running debug/test_dev
> > > > run io test on specified device
> > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > >       randwrite: jobs 1, iops 8626
> > > >       randread: jobs 1, iops 126118
> > > >       randrw: jobs 1, iops read 17698 write 17665
> > > >       rw(512k): jobs 1, iops read 1023 write 1031
> > > >
> > > >
> > > > 2. run fio on block device over pre-allocated qcow2 image
> > > > 1) qemu-nbd
> > > > running qcow2/002
> > > > run perf test on pre-allocated qcow2 image via nbd
> > > >       fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > >       randwrite: jobs 1, iops 21439
> > > >       randread: jobs 1, iops 30336
> > > >       randrw: jobs 1, iops read 11476 write 11449
> > > >       rw(512k): jobs 1, iops read 718 write 722
> > > >
> > > > 2) ublk-qcow2
> > > > running qcow2/022
> > > > run perf test on pre-allocated qcow2 image via ublk
> > > >       fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > >       randwrite: jobs 1, iops 98757
> > > >       randread: jobs 1, iops 110246
> > > >       randrw: jobs 1, iops read 47229 write 47161
> > > >       rw(512k): jobs 1, iops read 1416 write 1427
> > > >
> > > > 3) vdpa-virtio-blk
> > > > running debug/test_dev
> > > > run io test on specified device
> > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > >       randwrite: jobs 1, iops 47317
> > > >       randread: jobs 1, iops 74092
> > > >       randrw: jobs 1, iops read 27196 write 27234
> > > >       rw(512k): jobs 1, iops read 1447 write 1458
> > > >
> > > >
> > >
> > > Hi All,
> > >
> > > We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> > > Let me share some results here.
> > >
> > > I setup UBLK with:
> > >   ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
> > >
> > > I setup VDUSE with:
> > >   qemu-storage-daemon \
> > >        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > >        --monitor chardev=charmonitor \
> > >        --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
> > >        --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
> > >
> > > Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
> > >
> > > Note:
> > > (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> > > (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> > > (3) I do not use ublk null target so that the test is fair.
> > > (4) I setup fio with direct=1, bs=4k.
> > >
> > > ------------------------------
> > > 1 job 1 iodepth, lat（usec)
> > >                 vduse   ublk
> > > seq-read        22.55   11.15
> > > rand-read       22.49   11.17
> > > seq-write       25.67   10.25
> > > rand-write      24.13   10.16
> >
> > Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
> >
> 
> I think one reason for the latency gap of sync I/O is that vduse uses
> workqueue in the I/O completion path but ublk doesn't.
> 
> And one bottleneck for the async I/O in vduse is that vduse will do
> memcpy inside the critical section of virtqueue's spinlock in the
> virtio-blk driver. That will hurt the performance heavily when
> virtio_queue_rq() and virtblk_done() run concurrently. And it can be
> mitigated by the advance DMA mapping feature [1] or irq binding
> support [2].

Hi Yongji,

Yeah, that is the cost you paid for virtio. Wrt. userspace block device
or other sort of userspace devices, cmd completion is driven by
userspace, not sure if one such 'irq' is needed. Even not sure if virtio
ring is one good choice for such use case, given io_uring has been proved
as very efficient(should be better than virtio ring, IMO).

ublk uses io_uring pt cmd for handling both io submission and completion,
turns out the extra latency can be pretty small.

BTW, one un-related topic, I saw the following words in
Documentation/userspace-api/vduse.rst:

```
Note that only virtio block device is supported by VDUSE framework now,
which can reduce security risks when the userspace process that implements
the data path is run by an unprivileged user.
```

But when I tried to start qemu-storage-daemon for creating vdpa-virtio
block by nor unprivileged user, 'Permission denied' is still returned,
can you explain a bit how to start such process by unprivileged user?
Or maybe I misunderstood the above words, please let me know.


thanks,
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-14 12:56               ` Ming Lei
@ 2022-10-17 11:11                 ` Yongji Xie
  2022-10-18  6:59                   ` Ming Lei
  0 siblings, 1 reply; 44+ messages in thread
From: Yongji Xie @ 2022-10-17 11:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Stefan Hajnoczi, Ziyang Zhang, Stefan Hajnoczi, io-uring,
	linux-block, linux-kernel, Denis V. Lunev, Xiaoguang Wang

On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
>
> On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> > On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> > >
> > > On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> > > >
> > > > On 2022/10/5 12:18, Ming Lei wrote:
> > > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > >> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > > >>>
> > > > >>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > >>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > >>>>> ublk-qcow2 is available now.
> > > > >>>>
> > > > >>>> Cool, thanks for sharing!
> > > > >>>>
> > > > >>>>>
> > > > >>>>> So far it provides basic read/write function, and compression and snapshot
> > > > >>>>> aren't supported yet. The target/backend implementation is completely
> > > > >>>>> based on io_uring, and share the same io_uring with ublk IO command
> > > > >>>>> handler, just like what ublk-loop does.
> > > > >>>>>
> > > > >>>>> Follows the main motivations of ublk-qcow2:
> > > > >>>>>
> > > > >>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> > > > >>>>>   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > >>>>>   requirement from libublksrv compared with other simple ones(loop, null)
> > > > >>>>>
> > > > >>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > >>>>>   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > >>>>>   might useful be for covering requirement in this field
> > > > >>>>>
> > > > >>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > > >>>>>   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > > >>>>>   is started
> > > > >>>>>
> > > > >>>>> - help to abstract common building block or design pattern for writing new ublk
> > > > >>>>>   target/backend
> > > > >>>>>
> > > > >>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > >>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> > > > >>>>> soft update approach is applied in meta flushing, and meta data
> > > > >>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > >>>>> test, and only cluster leak is reported during this test.
> > > > >>>>>
> > > > >>>>> The performance data looks much better compared with qemu-nbd, see
> > > > >>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > > >>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > >>>>> image(8GB):
> > > > >>>>>
> > > > >>>>> - qemu-nbd (make test T=qcow2/002)
> > > > >>>>
> > > > >>>> Single queue?
> > > > >>>
> > > > >>> Yeah.
> > > > >>>
> > > > >>>>
> > > > >>>>>     randwrite(4k): jobs 1, iops 24605
> > > > >>>>>     randread(4k): jobs 1, iops 30938
> > > > >>>>>     randrw(4k): jobs 1, iops read 13981 write 14001
> > > > >>>>>     rw(512k): jobs 1, iops read 724 write 728
> > > > >>>>
> > > > >>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> > > > >>>> command-line should be similar to this:
> > > > >>>>
> > > > >>>>   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > > >>>
> > > > >>> Not found virtio_vdpa module even though I enabled all the following
> > > > >>> options:
> > > > >>>
> > > > >>>         --- vDPA drivers
> > > > >>>           <M>   vDPA device simulator core
> > > > >>>           <M>     vDPA simulator for networking device
> > > > >>>           <M>     vDPA simulator for block device
> > > > >>>           <M>   VDUSE (vDPA Device in Userspace) support
> > > > >>>           <M>   Intel IFC VF vDPA driver
> > > > >>>           <M>   Virtio PCI bridge vDPA driver
> > > > >>>           <M>   vDPA driver for Alibaba ENI
> > > > >>>
> > > > >>> BTW, my test environment is VM and the shared data is done in VM too, and
> > > > >>> can virtio_vdpa be used inside VM?
> > > > >>
> > > > >> I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > > >>
> > > > >> virtio_vdpa is available inside guests too. Please check that
> > > > >> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > > >> drivers" menu.
> > > > >>
> > > > >>>
> > > > >>>>   # modprobe vduse
> > > > >>>>   # qemu-storage-daemon \
> > > > >>>>       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > > >>>>       --blockdev qcow2,file=file,node-name=qcow2 \
> > > > >>>>       --object iothread,id=iothread0 \
> > > > >>>>       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > > >>>>   # vdpa dev add name vduse0 mgmtdev vduse
> > > > >>>>
> > > > >>>> A virtio-blk device should appear and xfstests can be run on it
> > > > >>>> (typically /dev/vda unless you already have other virtio-blk devices).
> > > > >>>>
> > > > >>>> Afterwards you can destroy the device using:
> > > > >>>>
> > > > >>>>   # vdpa dev del vduse0
> > > > >>>>
> > > > >>>>>
> > > > >>>>> - ublk-qcow2 (make test T=qcow2/022)
> > > > >>>>
> > > > >>>> There are a lot of other factors not directly related to NBD vs ublk. In
> > > > >>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> > > > >>>> type is needed in qemu-storage-daemon. That way only the difference is
> > > > >>>> the ublk interface and the rest of the code path is identical, making it
> > > > >>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > > >>>
> > > > >>> Maybe not true.
> > > > >>>
> > > > >>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > > >>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > > >>> command.
> > > > >>
> > > > >> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > > > >
> > > > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > > > >
> > > > >> know whether the benchmark demonstrates that ublk is faster than NBD,
> > > > >> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > > >> whether there are miscellaneous implementation differences between
> > > > >> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > > >> ublk and backend IO), or something else.
> > > > >
> > > > > The theory shouldn't be too complicated:
> > > > >
> > > > > 1) io uring passthough(pt) communication is fast than socket, and io command
> > > > > is carried over io_uring pt commands, and should be fast than virio
> > > > > communication too.
> > > > >
> > > > > 2) io uring io handling is fast than libaio which is taken in the
> > > > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > > > by io_uring.
> > > > >
> > > > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > > > >
> > > > > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > > > backend IOs, so batching handling is common, and it is easy to see
> > > > > dozens of IOs/io commands handled in single syscall, or even more.
> > > > >
> > > > >>
> > > > >> I'm suggesting measuring changes to just 1 variable at a time.
> > > > >> Otherwise it's hard to reach a conclusion about the root cause of the
> > > > >> performance difference. Let's learn why ublk-qcow2 performs well.
> > > > >
> > > > > Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> > > > > qemu from the latest github tree, and finally it starts to work. And test kernel
> > > > > is v6.0 release.
> > > > >
> > > > > Follows the test result, and all three devices are setup as single
> > > > > queue, and all tests are run in single job, still done in one VM, and
> > > > > the test images are stored on XFS/virito-scsi backed SSD.
> > > > >
> > > > > The 1st group tests all three block device which is backed by empty
> > > > > qcow2 image.
> > > > >
> > > > > The 2nd group tests all the three block devices backed by pre-allocated
> > > > > qcow2 image.
> > > > >
> > > > > Except for big sequential IO(512K), there is still not small gap between
> > > > > vdpa-virtio-blk and ublk.
> > > > >
> > > > > 1. run fio on block device over empty qcow2 image
> > > > > 1) qemu-nbd
> > > > > running qcow2/001
> > > > > run perf test on empty qcow2 image via nbd
> > > > >       fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > > >       randwrite: jobs 1, iops 8549
> > > > >       randread: jobs 1, iops 34829
> > > > >       randrw: jobs 1, iops read 11363 write 11333
> > > > >       rw(512k): jobs 1, iops read 590 write 597
> > > > >
> > > > >
> > > > > 2) ublk-qcow2
> > > > > running qcow2/021
> > > > > run perf test on empty qcow2 image via ublk
> > > > >       fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > > >       randwrite: jobs 1, iops 16086
> > > > >       randread: jobs 1, iops 172720
> > > > >       randrw: jobs 1, iops read 35760 write 35702
> > > > >       rw(512k): jobs 1, iops read 1140 write 1149
> > > > >
> > > > > 3) vdpa-virtio-blk
> > > > > running debug/test_dev
> > > > > run io test on specified device
> > > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > > >       randwrite: jobs 1, iops 8626
> > > > >       randread: jobs 1, iops 126118
> > > > >       randrw: jobs 1, iops read 17698 write 17665
> > > > >       rw(512k): jobs 1, iops read 1023 write 1031
> > > > >
> > > > >
> > > > > 2. run fio on block device over pre-allocated qcow2 image
> > > > > 1) qemu-nbd
> > > > > running qcow2/002
> > > > > run perf test on pre-allocated qcow2 image via nbd
> > > > >       fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > > >       randwrite: jobs 1, iops 21439
> > > > >       randread: jobs 1, iops 30336
> > > > >       randrw: jobs 1, iops read 11476 write 11449
> > > > >       rw(512k): jobs 1, iops read 718 write 722
> > > > >
> > > > > 2) ublk-qcow2
> > > > > running qcow2/022
> > > > > run perf test on pre-allocated qcow2 image via ublk
> > > > >       fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > > >       randwrite: jobs 1, iops 98757
> > > > >       randread: jobs 1, iops 110246
> > > > >       randrw: jobs 1, iops read 47229 write 47161
> > > > >       rw(512k): jobs 1, iops read 1416 write 1427
> > > > >
> > > > > 3) vdpa-virtio-blk
> > > > > running debug/test_dev
> > > > > run io test on specified device
> > > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > > >       randwrite: jobs 1, iops 47317
> > > > >       randread: jobs 1, iops 74092
> > > > >       randrw: jobs 1, iops read 27196 write 27234
> > > > >       rw(512k): jobs 1, iops read 1447 write 1458
> > > > >
> > > > >
> > > >
> > > > Hi All,
> > > >
> > > > We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> > > > Let me share some results here.
> > > >
> > > > I setup UBLK with:
> > > >   ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
> > > >
> > > > I setup VDUSE with:
> > > >   qemu-storage-daemon \
> > > >        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > > >        --monitor chardev=charmonitor \
> > > >        --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
> > > >        --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
> > > >
> > > > Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
> > > >
> > > > Note:
> > > > (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> > > > (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> > > > (3) I do not use ublk null target so that the test is fair.
> > > > (4) I setup fio with direct=1, bs=4k.
> > > >
> > > > ------------------------------
> > > > 1 job 1 iodepth, lat（usec)
> > > >                 vduse   ublk
> > > > seq-read        22.55   11.15
> > > > rand-read       22.49   11.17
> > > > seq-write       25.67   10.25
> > > > rand-write      24.13   10.16
> > >
> > > Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
> > >
> >
> > I think one reason for the latency gap of sync I/O is that vduse uses
> > workqueue in the I/O completion path but ublk doesn't.
> >
> > And one bottleneck for the async I/O in vduse is that vduse will do
> > memcpy inside the critical section of virtqueue's spinlock in the
> > virtio-blk driver. That will hurt the performance heavily when
> > virtio_queue_rq() and virtblk_done() run concurrently. And it can be
> > mitigated by the advance DMA mapping feature [1] or irq binding
> > support [2].
>
> Hi Yongji,
>
> Yeah, that is the cost you paid for virtio. Wrt. userspace block device
> or other sort of userspace devices, cmd completion is driven by
> userspace, not sure if one such 'irq' is needed.

I'm not sure, it can be an optional feature in the future if needed.

> Even not sure if virtio
> ring is one good choice for such use case, given io_uring has been proved
> as very efficient(should be better than virtio ring, IMO).
>

Since vduse is aimed at creating a generic userspace device framework,
virtio should be the right way IMO. And with the vdpa framework, the
userspace device can serve both virtual machines and containers.

Regarding the performance issue, actually I can't measure how much of
the performance loss is due to the difference between virtio ring and
iouring. But I think it should be very small. The main costs come from
the two bottlenecks I mentioned before which could be mitigated in the
future.

> ublk uses io_uring pt cmd for handling both io submission and completion,
> turns out the extra latency can be pretty small.
>
> BTW, one un-related topic, I saw the following words in
> Documentation/userspace-api/vduse.rst:
>
> ```
> Note that only virtio block device is supported by VDUSE framework now,
> which can reduce security risks when the userspace process that implements
> the data path is run by an unprivileged user.
> ```
>
> But when I tried to start qemu-storage-daemon for creating vdpa-virtio
> block by nor unprivileged user, 'Permission denied' is still returned,
> can you explain a bit how to start such process by unprivileged user?
> Or maybe I misunderstood the above words, please let me know.
>

Currently vduse should only allow privileged users by default. But
sysadmin can change the permission of the vduse char device or pass
the device fd to an unprivileged process IIUC.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-17 11:11                 ` Yongji Xie
@ 2022-10-18  6:59                   ` Ming Lei
  2022-10-18 13:17                     ` Yongji Xie
  2022-10-21  6:28                     ` Jason Wang
  0 siblings, 2 replies; 44+ messages in thread
From: Ming Lei @ 2022-10-18  6:59 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Stefan Hajnoczi, Ziyang Zhang, Stefan Hajnoczi, io-uring,
	linux-block, linux-kernel, Denis V. Lunev, Xiaoguang Wang

On Mon, Oct 17, 2022 at 07:11:59PM +0800, Yongji Xie wrote:
> On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
> >
> > On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> > > On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> > > >
> > > > On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> > > > >
> > > > > On 2022/10/5 12:18, Ming Lei wrote:
> > > > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > > >> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > > > >>>
> > > > > >>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > >>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > >>>>> ublk-qcow2 is available now.
> > > > > >>>>
> > > > > >>>> Cool, thanks for sharing!
> > > > > >>>>
> > > > > >>>>>
> > > > > >>>>> So far it provides basic read/write function, and compression and snapshot
> > > > > >>>>> aren't supported yet. The target/backend implementation is completely
> > > > > >>>>> based on io_uring, and share the same io_uring with ublk IO command
> > > > > >>>>> handler, just like what ublk-loop does.
> > > > > >>>>>
> > > > > >>>>> Follows the main motivations of ublk-qcow2:
> > > > > >>>>>
> > > > > >>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > >>>>>   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > >>>>>   requirement from libublksrv compared with other simple ones(loop, null)
> > > > > >>>>>
> > > > > >>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > >>>>>   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > >>>>>   might useful be for covering requirement in this field
> > > > > >>>>>
> > > > > >>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > > > >>>>>   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > > > >>>>>   is started
> > > > > >>>>>
> > > > > >>>>> - help to abstract common building block or design pattern for writing new ublk
> > > > > >>>>>   target/backend
> > > > > >>>>>
> > > > > >>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > > >>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> > > > > >>>>> soft update approach is applied in meta flushing, and meta data
> > > > > >>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > > >>>>> test, and only cluster leak is reported during this test.
> > > > > >>>>>
> > > > > >>>>> The performance data looks much better compared with qemu-nbd, see
> > > > > >>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > > > >>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > > >>>>> image(8GB):
> > > > > >>>>>
> > > > > >>>>> - qemu-nbd (make test T=qcow2/002)
> > > > > >>>>
> > > > > >>>> Single queue?
> > > > > >>>
> > > > > >>> Yeah.
> > > > > >>>
> > > > > >>>>
> > > > > >>>>>     randwrite(4k): jobs 1, iops 24605
> > > > > >>>>>     randread(4k): jobs 1, iops 30938
> > > > > >>>>>     randrw(4k): jobs 1, iops read 13981 write 14001
> > > > > >>>>>     rw(512k): jobs 1, iops read 724 write 728
> > > > > >>>>
> > > > > >>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> > > > > >>>> command-line should be similar to this:
> > > > > >>>>
> > > > > >>>>   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > > > >>>
> > > > > >>> Not found virtio_vdpa module even though I enabled all the following
> > > > > >>> options:
> > > > > >>>
> > > > > >>>         --- vDPA drivers
> > > > > >>>           <M>   vDPA device simulator core
> > > > > >>>           <M>     vDPA simulator for networking device
> > > > > >>>           <M>     vDPA simulator for block device
> > > > > >>>           <M>   VDUSE (vDPA Device in Userspace) support
> > > > > >>>           <M>   Intel IFC VF vDPA driver
> > > > > >>>           <M>   Virtio PCI bridge vDPA driver
> > > > > >>>           <M>   vDPA driver for Alibaba ENI
> > > > > >>>
> > > > > >>> BTW, my test environment is VM and the shared data is done in VM too, and
> > > > > >>> can virtio_vdpa be used inside VM?
> > > > > >>
> > > > > >> I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > > > >>
> > > > > >> virtio_vdpa is available inside guests too. Please check that
> > > > > >> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > > > >> drivers" menu.
> > > > > >>
> > > > > >>>
> > > > > >>>>   # modprobe vduse
> > > > > >>>>   # qemu-storage-daemon \
> > > > > >>>>       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > > > >>>>       --blockdev qcow2,file=file,node-name=qcow2 \
> > > > > >>>>       --object iothread,id=iothread0 \
> > > > > >>>>       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > > > >>>>   # vdpa dev add name vduse0 mgmtdev vduse
> > > > > >>>>
> > > > > >>>> A virtio-blk device should appear and xfstests can be run on it
> > > > > >>>> (typically /dev/vda unless you already have other virtio-blk devices).
> > > > > >>>>
> > > > > >>>> Afterwards you can destroy the device using:
> > > > > >>>>
> > > > > >>>>   # vdpa dev del vduse0
> > > > > >>>>
> > > > > >>>>>
> > > > > >>>>> - ublk-qcow2 (make test T=qcow2/022)
> > > > > >>>>
> > > > > >>>> There are a lot of other factors not directly related to NBD vs ublk. In
> > > > > >>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> > > > > >>>> type is needed in qemu-storage-daemon. That way only the difference is
> > > > > >>>> the ublk interface and the rest of the code path is identical, making it
> > > > > >>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > > > >>>
> > > > > >>> Maybe not true.
> > > > > >>>
> > > > > >>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > > > >>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > > > >>> command.
> > > > > >>
> > > > > >> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > > > > >
> > > > > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > > > > >
> > > > > >> know whether the benchmark demonstrates that ublk is faster than NBD,
> > > > > >> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > > > >> whether there are miscellaneous implementation differences between
> > > > > >> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > > > >> ublk and backend IO), or something else.
> > > > > >
> > > > > > The theory shouldn't be too complicated:
> > > > > >
> > > > > > 1) io uring passthough(pt) communication is fast than socket, and io command
> > > > > > is carried over io_uring pt commands, and should be fast than virio
> > > > > > communication too.
> > > > > >
> > > > > > 2) io uring io handling is fast than libaio which is taken in the
> > > > > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > > > > by io_uring.
> > > > > >
> > > > > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > > > > >
> > > > > > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > > > > backend IOs, so batching handling is common, and it is easy to see
> > > > > > dozens of IOs/io commands handled in single syscall, or even more.
> > > > > >
> > > > > >>
> > > > > >> I'm suggesting measuring changes to just 1 variable at a time.
> > > > > >> Otherwise it's hard to reach a conclusion about the root cause of the
> > > > > >> performance difference. Let's learn why ublk-qcow2 performs well.
> > > > > >
> > > > > > Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> > > > > > qemu from the latest github tree, and finally it starts to work. And test kernel
> > > > > > is v6.0 release.
> > > > > >
> > > > > > Follows the test result, and all three devices are setup as single
> > > > > > queue, and all tests are run in single job, still done in one VM, and
> > > > > > the test images are stored on XFS/virito-scsi backed SSD.
> > > > > >
> > > > > > The 1st group tests all three block device which is backed by empty
> > > > > > qcow2 image.
> > > > > >
> > > > > > The 2nd group tests all the three block devices backed by pre-allocated
> > > > > > qcow2 image.
> > > > > >
> > > > > > Except for big sequential IO(512K), there is still not small gap between
> > > > > > vdpa-virtio-blk and ublk.
> > > > > >
> > > > > > 1. run fio on block device over empty qcow2 image
> > > > > > 1) qemu-nbd
> > > > > > running qcow2/001
> > > > > > run perf test on empty qcow2 image via nbd
> > > > > >       fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > > > >       randwrite: jobs 1, iops 8549
> > > > > >       randread: jobs 1, iops 34829
> > > > > >       randrw: jobs 1, iops read 11363 write 11333
> > > > > >       rw(512k): jobs 1, iops read 590 write 597
> > > > > >
> > > > > >
> > > > > > 2) ublk-qcow2
> > > > > > running qcow2/021
> > > > > > run perf test on empty qcow2 image via ublk
> > > > > >       fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > > > >       randwrite: jobs 1, iops 16086
> > > > > >       randread: jobs 1, iops 172720
> > > > > >       randrw: jobs 1, iops read 35760 write 35702
> > > > > >       rw(512k): jobs 1, iops read 1140 write 1149
> > > > > >
> > > > > > 3) vdpa-virtio-blk
> > > > > > running debug/test_dev
> > > > > > run io test on specified device
> > > > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > > > >       randwrite: jobs 1, iops 8626
> > > > > >       randread: jobs 1, iops 126118
> > > > > >       randrw: jobs 1, iops read 17698 write 17665
> > > > > >       rw(512k): jobs 1, iops read 1023 write 1031
> > > > > >
> > > > > >
> > > > > > 2. run fio on block device over pre-allocated qcow2 image
> > > > > > 1) qemu-nbd
> > > > > > running qcow2/002
> > > > > > run perf test on pre-allocated qcow2 image via nbd
> > > > > >       fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > > > >       randwrite: jobs 1, iops 21439
> > > > > >       randread: jobs 1, iops 30336
> > > > > >       randrw: jobs 1, iops read 11476 write 11449
> > > > > >       rw(512k): jobs 1, iops read 718 write 722
> > > > > >
> > > > > > 2) ublk-qcow2
> > > > > > running qcow2/022
> > > > > > run perf test on pre-allocated qcow2 image via ublk
> > > > > >       fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > > > >       randwrite: jobs 1, iops 98757
> > > > > >       randread: jobs 1, iops 110246
> > > > > >       randrw: jobs 1, iops read 47229 write 47161
> > > > > >       rw(512k): jobs 1, iops read 1416 write 1427
> > > > > >
> > > > > > 3) vdpa-virtio-blk
> > > > > > running debug/test_dev
> > > > > > run io test on specified device
> > > > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > > > >       randwrite: jobs 1, iops 47317
> > > > > >       randread: jobs 1, iops 74092
> > > > > >       randrw: jobs 1, iops read 27196 write 27234
> > > > > >       rw(512k): jobs 1, iops read 1447 write 1458
> > > > > >
> > > > > >
> > > > >
> > > > > Hi All,
> > > > >
> > > > > We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> > > > > Let me share some results here.
> > > > >
> > > > > I setup UBLK with:
> > > > >   ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
> > > > >
> > > > > I setup VDUSE with:
> > > > >   qemu-storage-daemon \
> > > > >        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > > > >        --monitor chardev=charmonitor \
> > > > >        --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
> > > > >        --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
> > > > >
> > > > > Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
> > > > >
> > > > > Note:
> > > > > (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> > > > > (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> > > > > (3) I do not use ublk null target so that the test is fair.
> > > > > (4) I setup fio with direct=1, bs=4k.
> > > > >
> > > > > ------------------------------
> > > > > 1 job 1 iodepth, lat（usec)
> > > > >                 vduse   ublk
> > > > > seq-read        22.55   11.15
> > > > > rand-read       22.49   11.17
> > > > > seq-write       25.67   10.25
> > > > > rand-write      24.13   10.16
> > > >
> > > > Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
> > > >
> > >
> > > I think one reason for the latency gap of sync I/O is that vduse uses
> > > workqueue in the I/O completion path but ublk doesn't.
> > >
> > > And one bottleneck for the async I/O in vduse is that vduse will do
> > > memcpy inside the critical section of virtqueue's spinlock in the
> > > virtio-blk driver. That will hurt the performance heavily when
> > > virtio_queue_rq() and virtblk_done() run concurrently. And it can be
> > > mitigated by the advance DMA mapping feature [1] or irq binding
> > > support [2].
> >
> > Hi Yongji,
> >
> > Yeah, that is the cost you paid for virtio. Wrt. userspace block device
> > or other sort of userspace devices, cmd completion is driven by
> > userspace, not sure if one such 'irq' is needed.
> 
> I'm not sure, it can be an optional feature in the future if needed.
> 
> > Even not sure if virtio
> > ring is one good choice for such use case, given io_uring has been proved
> > as very efficient(should be better than virtio ring, IMO).
> >
> 
> Since vduse is aimed at creating a generic userspace device framework,
> virtio should be the right way IMO.

OK, it is the right way, but may not be the effective one.

> And with the vdpa framework, the
> userspace device can serve both virtual machines and containers.

virtio is good for VM, but not sure it is good enough for other
cases.

> 
> Regarding the performance issue, actually I can't measure how much of
> the performance loss is due to the difference between virtio ring and
> iouring. But I think it should be very small. The main costs come from
> the two bottlenecks I mentioned before which could be mitigated in the
> future.

Per my understanding, at least there are two places where virtio ring is
less efficient than io_uring:

1) io_uring uses standalone submission queue(SQ) and completion queue(CQ),
so no contention exists between submission and completion; but virtio queue
requires per-vq lock in both submission and completion.

2) io_uring can use single system call of io_uring_enter() for both
submitting and completing, so one context switch is enough, together
with natural batch processing for both submission and completion, and
it is observed that dozens or more than one hundred of IOs can be
covered in single syscall; virtio requires one notification for submission and
another one for completion, looks at least two context switch are required
for handling one IO(s).

> 
> > ublk uses io_uring pt cmd for handling both io submission and completion,
> > turns out the extra latency can be pretty small.
> >
> > BTW, one un-related topic, I saw the following words in
> > Documentation/userspace-api/vduse.rst:
> >
> > ```
> > Note that only virtio block device is supported by VDUSE framework now,
> > which can reduce security risks when the userspace process that implements
> > the data path is run by an unprivileged user.
> > ```
> >
> > But when I tried to start qemu-storage-daemon for creating vdpa-virtio
> > block by nor unprivileged user, 'Permission denied' is still returned,
> > can you explain a bit how to start such process by unprivileged user?
> > Or maybe I misunderstood the above words, please let me know.
> >
> 
> Currently vduse should only allow privileged users by default. But
> sysadmin can change the permission of the vduse char device or pass
> the device fd to an unprivileged process IIUC.

I appreciate if you may provide a bit detailed steps for the above?

BTW, I changed privilege of /dev/vduse/control to normal user, but
qemu-storage-daemon still returns 'Permission denied'. And if the
char dev is /dev/vduse/vduse0N, which is created by qemu-storage-daemon,
so how to change user of qemu-storage-daemon to unprivileged after
/dev/vduse/vduse0N is created?



Thanks,
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-18  6:59                   ` Ming Lei
@ 2022-10-18 13:17                     ` Yongji Xie
  2022-10-18 14:54                       ` Stefan Hajnoczi
  2022-10-21  6:28                     ` Jason Wang
  1 sibling, 1 reply; 44+ messages in thread
From: Yongji Xie @ 2022-10-18 13:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: Stefan Hajnoczi, Ziyang Zhang, Stefan Hajnoczi, io-uring,
	linux-block, linux-kernel, Denis V. Lunev, Xiaoguang Wang

On Tue, Oct 18, 2022 at 2:59 PM Ming Lei <[email protected]> wrote:
>
> On Mon, Oct 17, 2022 at 07:11:59PM +0800, Yongji Xie wrote:
> > On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
> > >
> > > On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> > > > On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> > > > >
> > > > > On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> > > > > >
> > > > > > On 2022/10/5 12:18, Ming Lei wrote:
> > > > > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > > > >> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > > > > >>>
> > > > > > >>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > > >>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > >>>>> ublk-qcow2 is available now.
> > > > > > >>>>
> > > > > > >>>> Cool, thanks for sharing!
> > > > > > >>>>
> > > > > > >>>>>
> > > > > > >>>>> So far it provides basic read/write function, and compression and snapshot
> > > > > > >>>>> aren't supported yet. The target/backend implementation is completely
> > > > > > >>>>> based on io_uring, and share the same io_uring with ublk IO command
> > > > > > >>>>> handler, just like what ublk-loop does.
> > > > > > >>>>>
> > > > > > >>>>> Follows the main motivations of ublk-qcow2:
> > > > > > >>>>>
> > > > > > >>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > > >>>>>   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > > >>>>>   requirement from libublksrv compared with other simple ones(loop, null)
> > > > > > >>>>>
> > > > > > >>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > > >>>>>   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > > >>>>>   might useful be for covering requirement in this field
> > > > > > >>>>>
> > > > > > >>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > > > > >>>>>   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > > > > >>>>>   is started
> > > > > > >>>>>
> > > > > > >>>>> - help to abstract common building block or design pattern for writing new ublk
> > > > > > >>>>>   target/backend
> > > > > > >>>>>
> > > > > > >>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > > > >>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> > > > > > >>>>> soft update approach is applied in meta flushing, and meta data
> > > > > > >>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > > > >>>>> test, and only cluster leak is reported during this test.
> > > > > > >>>>>
> > > > > > >>>>> The performance data looks much better compared with qemu-nbd, see
> > > > > > >>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > > > > >>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > > > >>>>> image(8GB):
> > > > > > >>>>>
> > > > > > >>>>> - qemu-nbd (make test T=qcow2/002)
> > > > > > >>>>
> > > > > > >>>> Single queue?
> > > > > > >>>
> > > > > > >>> Yeah.
> > > > > > >>>
> > > > > > >>>>
> > > > > > >>>>>     randwrite(4k): jobs 1, iops 24605
> > > > > > >>>>>     randread(4k): jobs 1, iops 30938
> > > > > > >>>>>     randrw(4k): jobs 1, iops read 13981 write 14001
> > > > > > >>>>>     rw(512k): jobs 1, iops read 724 write 728
> > > > > > >>>>
> > > > > > >>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> > > > > > >>>> command-line should be similar to this:
> > > > > > >>>>
> > > > > > >>>>   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > > > > >>>
> > > > > > >>> Not found virtio_vdpa module even though I enabled all the following
> > > > > > >>> options:
> > > > > > >>>
> > > > > > >>>         --- vDPA drivers
> > > > > > >>>           <M>   vDPA device simulator core
> > > > > > >>>           <M>     vDPA simulator for networking device
> > > > > > >>>           <M>     vDPA simulator for block device
> > > > > > >>>           <M>   VDUSE (vDPA Device in Userspace) support
> > > > > > >>>           <M>   Intel IFC VF vDPA driver
> > > > > > >>>           <M>   Virtio PCI bridge vDPA driver
> > > > > > >>>           <M>   vDPA driver for Alibaba ENI
> > > > > > >>>
> > > > > > >>> BTW, my test environment is VM and the shared data is done in VM too, and
> > > > > > >>> can virtio_vdpa be used inside VM?
> > > > > > >>
> > > > > > >> I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > > > > >>
> > > > > > >> virtio_vdpa is available inside guests too. Please check that
> > > > > > >> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > > > > >> drivers" menu.
> > > > > > >>
> > > > > > >>>
> > > > > > >>>>   # modprobe vduse
> > > > > > >>>>   # qemu-storage-daemon \
> > > > > > >>>>       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > > > > >>>>       --blockdev qcow2,file=file,node-name=qcow2 \
> > > > > > >>>>       --object iothread,id=iothread0 \
> > > > > > >>>>       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > > > > >>>>   # vdpa dev add name vduse0 mgmtdev vduse
> > > > > > >>>>
> > > > > > >>>> A virtio-blk device should appear and xfstests can be run on it
> > > > > > >>>> (typically /dev/vda unless you already have other virtio-blk devices).
> > > > > > >>>>
> > > > > > >>>> Afterwards you can destroy the device using:
> > > > > > >>>>
> > > > > > >>>>   # vdpa dev del vduse0
> > > > > > >>>>
> > > > > > >>>>>
> > > > > > >>>>> - ublk-qcow2 (make test T=qcow2/022)
> > > > > > >>>>
> > > > > > >>>> There are a lot of other factors not directly related to NBD vs ublk. In
> > > > > > >>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> > > > > > >>>> type is needed in qemu-storage-daemon. That way only the difference is
> > > > > > >>>> the ublk interface and the rest of the code path is identical, making it
> > > > > > >>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > > > > >>>
> > > > > > >>> Maybe not true.
> > > > > > >>>
> > > > > > >>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > > > > >>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > > > > >>> command.
> > > > > > >>
> > > > > > >> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > > > > > >
> > > > > > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > > > > > >
> > > > > > >> know whether the benchmark demonstrates that ublk is faster than NBD,
> > > > > > >> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > > > > >> whether there are miscellaneous implementation differences between
> > > > > > >> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > > > > >> ublk and backend IO), or something else.
> > > > > > >
> > > > > > > The theory shouldn't be too complicated:
> > > > > > >
> > > > > > > 1) io uring passthough(pt) communication is fast than socket, and io command
> > > > > > > is carried over io_uring pt commands, and should be fast than virio
> > > > > > > communication too.
> > > > > > >
> > > > > > > 2) io uring io handling is fast than libaio which is taken in the
> > > > > > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > > > > > by io_uring.
> > > > > > >
> > > > > > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > > > > > >
> > > > > > > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > > > > > backend IOs, so batching handling is common, and it is easy to see
> > > > > > > dozens of IOs/io commands handled in single syscall, or even more.
> > > > > > >
> > > > > > >>
> > > > > > >> I'm suggesting measuring changes to just 1 variable at a time.
> > > > > > >> Otherwise it's hard to reach a conclusion about the root cause of the
> > > > > > >> performance difference. Let's learn why ublk-qcow2 performs well.
> > > > > > >
> > > > > > > Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> > > > > > > qemu from the latest github tree, and finally it starts to work. And test kernel
> > > > > > > is v6.0 release.
> > > > > > >
> > > > > > > Follows the test result, and all three devices are setup as single
> > > > > > > queue, and all tests are run in single job, still done in one VM, and
> > > > > > > the test images are stored on XFS/virito-scsi backed SSD.
> > > > > > >
> > > > > > > The 1st group tests all three block device which is backed by empty
> > > > > > > qcow2 image.
> > > > > > >
> > > > > > > The 2nd group tests all the three block devices backed by pre-allocated
> > > > > > > qcow2 image.
> > > > > > >
> > > > > > > Except for big sequential IO(512K), there is still not small gap between
> > > > > > > vdpa-virtio-blk and ublk.
> > > > > > >
> > > > > > > 1. run fio on block device over empty qcow2 image
> > > > > > > 1) qemu-nbd
> > > > > > > running qcow2/001
> > > > > > > run perf test on empty qcow2 image via nbd
> > > > > > >       fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > > > > >       randwrite: jobs 1, iops 8549
> > > > > > >       randread: jobs 1, iops 34829
> > > > > > >       randrw: jobs 1, iops read 11363 write 11333
> > > > > > >       rw(512k): jobs 1, iops read 590 write 597
> > > > > > >
> > > > > > >
> > > > > > > 2) ublk-qcow2
> > > > > > > running qcow2/021
> > > > > > > run perf test on empty qcow2 image via ublk
> > > > > > >       fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > > > > >       randwrite: jobs 1, iops 16086
> > > > > > >       randread: jobs 1, iops 172720
> > > > > > >       randrw: jobs 1, iops read 35760 write 35702
> > > > > > >       rw(512k): jobs 1, iops read 1140 write 1149
> > > > > > >
> > > > > > > 3) vdpa-virtio-blk
> > > > > > > running debug/test_dev
> > > > > > > run io test on specified device
> > > > > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > > > > >       randwrite: jobs 1, iops 8626
> > > > > > >       randread: jobs 1, iops 126118
> > > > > > >       randrw: jobs 1, iops read 17698 write 17665
> > > > > > >       rw(512k): jobs 1, iops read 1023 write 1031
> > > > > > >
> > > > > > >
> > > > > > > 2. run fio on block device over pre-allocated qcow2 image
> > > > > > > 1) qemu-nbd
> > > > > > > running qcow2/002
> > > > > > > run perf test on pre-allocated qcow2 image via nbd
> > > > > > >       fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > > > > >       randwrite: jobs 1, iops 21439
> > > > > > >       randread: jobs 1, iops 30336
> > > > > > >       randrw: jobs 1, iops read 11476 write 11449
> > > > > > >       rw(512k): jobs 1, iops read 718 write 722
> > > > > > >
> > > > > > > 2) ublk-qcow2
> > > > > > > running qcow2/022
> > > > > > > run perf test on pre-allocated qcow2 image via ublk
> > > > > > >       fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > > > > >       randwrite: jobs 1, iops 98757
> > > > > > >       randread: jobs 1, iops 110246
> > > > > > >       randrw: jobs 1, iops read 47229 write 47161
> > > > > > >       rw(512k): jobs 1, iops read 1416 write 1427
> > > > > > >
> > > > > > > 3) vdpa-virtio-blk
> > > > > > > running debug/test_dev
> > > > > > > run io test on specified device
> > > > > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > > > > >       randwrite: jobs 1, iops 47317
> > > > > > >       randread: jobs 1, iops 74092
> > > > > > >       randrw: jobs 1, iops read 27196 write 27234
> > > > > > >       rw(512k): jobs 1, iops read 1447 write 1458
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> > > > > > Let me share some results here.
> > > > > >
> > > > > > I setup UBLK with:
> > > > > >   ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
> > > > > >
> > > > > > I setup VDUSE with:
> > > > > >   qemu-storage-daemon \
> > > > > >        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > > > > >        --monitor chardev=charmonitor \
> > > > > >        --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
> > > > > >        --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
> > > > > >
> > > > > > Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
> > > > > >
> > > > > > Note:
> > > > > > (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> > > > > > (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> > > > > > (3) I do not use ublk null target so that the test is fair.
> > > > > > (4) I setup fio with direct=1, bs=4k.
> > > > > >
> > > > > > ------------------------------
> > > > > > 1 job 1 iodepth, lat（usec)
> > > > > >                 vduse   ublk
> > > > > > seq-read        22.55   11.15
> > > > > > rand-read       22.49   11.17
> > > > > > seq-write       25.67   10.25
> > > > > > rand-write      24.13   10.16
> > > > >
> > > > > Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
> > > > >
> > > >
> > > > I think one reason for the latency gap of sync I/O is that vduse uses
> > > > workqueue in the I/O completion path but ublk doesn't.
> > > >
> > > > And one bottleneck for the async I/O in vduse is that vduse will do
> > > > memcpy inside the critical section of virtqueue's spinlock in the
> > > > virtio-blk driver. That will hurt the performance heavily when
> > > > virtio_queue_rq() and virtblk_done() run concurrently. And it can be
> > > > mitigated by the advance DMA mapping feature [1] or irq binding
> > > > support [2].
> > >
> > > Hi Yongji,
> > >
> > > Yeah, that is the cost you paid for virtio. Wrt. userspace block device
> > > or other sort of userspace devices, cmd completion is driven by
> > > userspace, not sure if one such 'irq' is needed.
> >
> > I'm not sure, it can be an optional feature in the future if needed.
> >
> > > Even not sure if virtio
> > > ring is one good choice for such use case, given io_uring has been proved
> > > as very efficient(should be better than virtio ring, IMO).
> > >
> >
> > Since vduse is aimed at creating a generic userspace device framework,
> > virtio should be the right way IMO.
>
> OK, it is the right way, but may not be the effective one.
>

Maybe, but I think we can try to optimize it.

> > And with the vdpa framework, the
> > userspace device can serve both virtual machines and containers.
>
> virtio is good for VM, but not sure it is good enough for other
> cases.
>
> >
> > Regarding the performance issue, actually I can't measure how much of
> > the performance loss is due to the difference between virtio ring and
> > iouring. But I think it should be very small. The main costs come from
> > the two bottlenecks I mentioned before which could be mitigated in the
> > future.
>
> Per my understanding, at least there are two places where virtio ring is
> less efficient than io_uring:
>

I might have misunderstood what you mean by virtio ring before. My
previous understanding of the virtio ring does not include the
virtio-blk driver.

> 1) io_uring uses standalone submission queue(SQ) and completion queue(CQ),
> so no contention exists between submission and completion; but virtio queue
> requires per-vq lock in both submission and completion.
>

Yes, this is the bottleneck of the virtio-blk driver, even in the VM
case. We are also trying to optimize this lock.

One way to mitigate it is making submission and completion happen in
the same core.

> 2) io_uring can use single system call of io_uring_enter() for both
> submitting and completing, so one context switch is enough, together
> with natural batch processing for both submission and completion, and
> it is observed that dozens or more than one hundred of IOs can be
> covered in single syscall; virtio requires one notification for submission and
> another one for completion, looks at least two context switch are required
> for handling one IO(s).
>

I'm not sure I get your point here. Looks like vduse doesn't need any
syscall in the submitting path. And in the completion path, we can
also do some batch processing then handle several I/Os in one single
syscall.

> >
> > > ublk uses io_uring pt cmd for handling both io submission and completion,
> > > turns out the extra latency can be pretty small.
> > >
> > > BTW, one un-related topic, I saw the following words in
> > > Documentation/userspace-api/vduse.rst:
> > >
> > > ```
> > > Note that only virtio block device is supported by VDUSE framework now,
> > > which can reduce security risks when the userspace process that implements
> > > the data path is run by an unprivileged user.
> > > ```
> > >
> > > But when I tried to start qemu-storage-daemon for creating vdpa-virtio
> > > block by nor unprivileged user, 'Permission denied' is still returned,
> > > can you explain a bit how to start such process by unprivileged user?
> > > Or maybe I misunderstood the above words, please let me know.
> > >
> >
> > Currently vduse should only allow privileged users by default. But
> > sysadmin can change the permission of the vduse char device or pass
> > the device fd to an unprivileged process IIUC.
>
> I appreciate if you may provide a bit detailed steps for the above?
>

For example:

1. A privileged process creates a vduse device named "test" via
/dev/vduse/control.

2. The privileged process changes the permission of /dev/vduse/test.

3. An unprivileged process opens the /dev/vduse/test to handle the I/O.

> BTW, I changed the privilege of /dev/vduse/control to normal user, but
> qemu-storage-daemon still returns 'Permission denied'. And if the
> char dev is /dev/vduse/vduse0N, which is created by qemu-storage-daemon,
> so how to change user of qemu-storage-daemon to unprivileged after
> /dev/vduse/vduse0N is created?
>

I think qemu-storage-daemon doesn't support unprivileged users in
current implementation. To support that, one extra privileged process
is needed for device creation.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-18 13:17                     ` Yongji Xie
@ 2022-10-18 14:54                       ` Stefan Hajnoczi
  2022-10-19  9:09                         ` Ming Lei
  2022-10-21  5:33                         ` Yongji Xie
  0 siblings, 2 replies; 44+ messages in thread
From: Stefan Hajnoczi @ 2022-10-18 14:54 UTC (permalink / raw)
  To: Yongji Xie, Michael S. Tsirkin
  Cc: Ming Lei, Ziyang Zhang, Stefan Hajnoczi, io-uring, linux-block,
	linux-kernel, Denis V. Lunev, Xiaoguang Wang

On Tue, 18 Oct 2022 at 09:17, Yongji Xie <[email protected]> wrote:
>
> On Tue, Oct 18, 2022 at 2:59 PM Ming Lei <[email protected]> wrote:
> >
> > On Mon, Oct 17, 2022 at 07:11:59PM +0800, Yongji Xie wrote:
> > > On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
> > > >
> > > > On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> > > > > On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> > > > > >
> > > > > > On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> > > > > > >
> > > > > > > On 2022/10/5 12:18, Ming Lei wrote:
> > > > > > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > > > > >> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > > > > > >>>
> > > > > > > >>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > > > >>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > > >>>>> ublk-qcow2 is available now.
> > > > > > > >>>>
> > > > > > > >>>> Cool, thanks for sharing!
> > > > > > > >>>>
> > > > > > > >>>>>
> > > > > > > >>>>> So far it provides basic read/write function, and compression and snapshot
> > > > > > > >>>>> aren't supported yet. The target/backend implementation is completely
> > > > > > > >>>>> based on io_uring, and share the same io_uring with ublk IO command
> > > > > > > >>>>> handler, just like what ublk-loop does.
> > > > > > > >>>>>
> > > > > > > >>>>> Follows the main motivations of ublk-qcow2:
> > > > > > > >>>>>
> > > > > > > >>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > > > >>>>>   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > > > >>>>>   requirement from libublksrv compared with other simple ones(loop, null)
> > > > > > > >>>>>
> > > > > > > >>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > > > >>>>>   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > > > >>>>>   might useful be for covering requirement in this field
> > > > > > > >>>>>
> > > > > > > >>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > > > > > >>>>>   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > > > > > >>>>>   is started
> > > > > > > >>>>>
> > > > > > > >>>>> - help to abstract common building block or design pattern for writing new ublk
> > > > > > > >>>>>   target/backend
> > > > > > > >>>>>
> > > > > > > >>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > > > > >>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> > > > > > > >>>>> soft update approach is applied in meta flushing, and meta data
> > > > > > > >>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > > > > >>>>> test, and only cluster leak is reported during this test.
> > > > > > > >>>>>
> > > > > > > >>>>> The performance data looks much better compared with qemu-nbd, see
> > > > > > > >>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > > > > > >>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > > > > >>>>> image(8GB):
> > > > > > > >>>>>
> > > > > > > >>>>> - qemu-nbd (make test T=qcow2/002)
> > > > > > > >>>>
> > > > > > > >>>> Single queue?
> > > > > > > >>>
> > > > > > > >>> Yeah.
> > > > > > > >>>
> > > > > > > >>>>
> > > > > > > >>>>>     randwrite(4k): jobs 1, iops 24605
> > > > > > > >>>>>     randread(4k): jobs 1, iops 30938
> > > > > > > >>>>>     randrw(4k): jobs 1, iops read 13981 write 14001
> > > > > > > >>>>>     rw(512k): jobs 1, iops read 724 write 728
> > > > > > > >>>>
> > > > > > > >>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> > > > > > > >>>> command-line should be similar to this:
> > > > > > > >>>>
> > > > > > > >>>>   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > > > > > >>>
> > > > > > > >>> Not found virtio_vdpa module even though I enabled all the following
> > > > > > > >>> options:
> > > > > > > >>>
> > > > > > > >>>         --- vDPA drivers
> > > > > > > >>>           <M>   vDPA device simulator core
> > > > > > > >>>           <M>     vDPA simulator for networking device
> > > > > > > >>>           <M>     vDPA simulator for block device
> > > > > > > >>>           <M>   VDUSE (vDPA Device in Userspace) support
> > > > > > > >>>           <M>   Intel IFC VF vDPA driver
> > > > > > > >>>           <M>   Virtio PCI bridge vDPA driver
> > > > > > > >>>           <M>   vDPA driver for Alibaba ENI
> > > > > > > >>>
> > > > > > > >>> BTW, my test environment is VM and the shared data is done in VM too, and
> > > > > > > >>> can virtio_vdpa be used inside VM?
> > > > > > > >>
> > > > > > > >> I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > > > > > >>
> > > > > > > >> virtio_vdpa is available inside guests too. Please check that
> > > > > > > >> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > > > > > >> drivers" menu.
> > > > > > > >>
> > > > > > > >>>
> > > > > > > >>>>   # modprobe vduse
> > > > > > > >>>>   # qemu-storage-daemon \
> > > > > > > >>>>       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > > > > > >>>>       --blockdev qcow2,file=file,node-name=qcow2 \
> > > > > > > >>>>       --object iothread,id=iothread0 \
> > > > > > > >>>>       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > > > > > >>>>   # vdpa dev add name vduse0 mgmtdev vduse
> > > > > > > >>>>
> > > > > > > >>>> A virtio-blk device should appear and xfstests can be run on it
> > > > > > > >>>> (typically /dev/vda unless you already have other virtio-blk devices).
> > > > > > > >>>>
> > > > > > > >>>> Afterwards you can destroy the device using:
> > > > > > > >>>>
> > > > > > > >>>>   # vdpa dev del vduse0
> > > > > > > >>>>
> > > > > > > >>>>>
> > > > > > > >>>>> - ublk-qcow2 (make test T=qcow2/022)
> > > > > > > >>>>
> > > > > > > >>>> There are a lot of other factors not directly related to NBD vs ublk. In
> > > > > > > >>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> > > > > > > >>>> type is needed in qemu-storage-daemon. That way only the difference is
> > > > > > > >>>> the ublk interface and the rest of the code path is identical, making it
> > > > > > > >>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > > > > > >>>
> > > > > > > >>> Maybe not true.
> > > > > > > >>>
> > > > > > > >>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > > > > > >>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > > > > > >>> command.
> > > > > > > >>
> > > > > > > >> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > > > > > > >
> > > > > > > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > > > > > > >
> > > > > > > >> know whether the benchmark demonstrates that ublk is faster than NBD,
> > > > > > > >> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > > > > > >> whether there are miscellaneous implementation differences between
> > > > > > > >> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > > > > > >> ublk and backend IO), or something else.
> > > > > > > >
> > > > > > > > The theory shouldn't be too complicated:
> > > > > > > >
> > > > > > > > 1) io uring passthough(pt) communication is fast than socket, and io command
> > > > > > > > is carried over io_uring pt commands, and should be fast than virio
> > > > > > > > communication too.
> > > > > > > >
> > > > > > > > 2) io uring io handling is fast than libaio which is taken in the
> > > > > > > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > > > > > > by io_uring.
> > > > > > > >
> > > > > > > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > > > > > > >
> > > > > > > > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > > > > > > backend IOs, so batching handling is common, and it is easy to see
> > > > > > > > dozens of IOs/io commands handled in single syscall, or even more.
> > > > > > > >
> > > > > > > >>
> > > > > > > >> I'm suggesting measuring changes to just 1 variable at a time.
> > > > > > > >> Otherwise it's hard to reach a conclusion about the root cause of the
> > > > > > > >> performance difference. Let's learn why ublk-qcow2 performs well.
> > > > > > > >
> > > > > > > > Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> > > > > > > > qemu from the latest github tree, and finally it starts to work. And test kernel
> > > > > > > > is v6.0 release.
> > > > > > > >
> > > > > > > > Follows the test result, and all three devices are setup as single
> > > > > > > > queue, and all tests are run in single job, still done in one VM, and
> > > > > > > > the test images are stored on XFS/virito-scsi backed SSD.
> > > > > > > >
> > > > > > > > The 1st group tests all three block device which is backed by empty
> > > > > > > > qcow2 image.
> > > > > > > >
> > > > > > > > The 2nd group tests all the three block devices backed by pre-allocated
> > > > > > > > qcow2 image.
> > > > > > > >
> > > > > > > > Except for big sequential IO(512K), there is still not small gap between
> > > > > > > > vdpa-virtio-blk and ublk.
> > > > > > > >
> > > > > > > > 1. run fio on block device over empty qcow2 image
> > > > > > > > 1) qemu-nbd
> > > > > > > > running qcow2/001
> > > > > > > > run perf test on empty qcow2 image via nbd
> > > > > > > >       fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > >       randwrite: jobs 1, iops 8549
> > > > > > > >       randread: jobs 1, iops 34829
> > > > > > > >       randrw: jobs 1, iops read 11363 write 11333
> > > > > > > >       rw(512k): jobs 1, iops read 590 write 597
> > > > > > > >
> > > > > > > >
> > > > > > > > 2) ublk-qcow2
> > > > > > > > running qcow2/021
> > > > > > > > run perf test on empty qcow2 image via ublk
> > > > > > > >       fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > > > > > >       randwrite: jobs 1, iops 16086
> > > > > > > >       randread: jobs 1, iops 172720
> > > > > > > >       randrw: jobs 1, iops read 35760 write 35702
> > > > > > > >       rw(512k): jobs 1, iops read 1140 write 1149
> > > > > > > >
> > > > > > > > 3) vdpa-virtio-blk
> > > > > > > > running debug/test_dev
> > > > > > > > run io test on specified device
> > > > > > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > >       randwrite: jobs 1, iops 8626
> > > > > > > >       randread: jobs 1, iops 126118
> > > > > > > >       randrw: jobs 1, iops read 17698 write 17665
> > > > > > > >       rw(512k): jobs 1, iops read 1023 write 1031
> > > > > > > >
> > > > > > > >
> > > > > > > > 2. run fio on block device over pre-allocated qcow2 image
> > > > > > > > 1) qemu-nbd
> > > > > > > > running qcow2/002
> > > > > > > > run perf test on pre-allocated qcow2 image via nbd
> > > > > > > >       fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > >       randwrite: jobs 1, iops 21439
> > > > > > > >       randread: jobs 1, iops 30336
> > > > > > > >       randrw: jobs 1, iops read 11476 write 11449
> > > > > > > >       rw(512k): jobs 1, iops read 718 write 722
> > > > > > > >
> > > > > > > > 2) ublk-qcow2
> > > > > > > > running qcow2/022
> > > > > > > > run perf test on pre-allocated qcow2 image via ublk
> > > > > > > >       fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > > > > > >       randwrite: jobs 1, iops 98757
> > > > > > > >       randread: jobs 1, iops 110246
> > > > > > > >       randrw: jobs 1, iops read 47229 write 47161
> > > > > > > >       rw(512k): jobs 1, iops read 1416 write 1427
> > > > > > > >
> > > > > > > > 3) vdpa-virtio-blk
> > > > > > > > running debug/test_dev
> > > > > > > > run io test on specified device
> > > > > > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > >       randwrite: jobs 1, iops 47317
> > > > > > > >       randread: jobs 1, iops 74092
> > > > > > > >       randrw: jobs 1, iops read 27196 write 27234
> > > > > > > >       rw(512k): jobs 1, iops read 1447 write 1458
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> > > > > > > Let me share some results here.
> > > > > > >
> > > > > > > I setup UBLK with:
> > > > > > >   ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
> > > > > > >
> > > > > > > I setup VDUSE with:
> > > > > > >   qemu-storage-daemon \
> > > > > > >        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > > > > > >        --monitor chardev=charmonitor \
> > > > > > >        --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
> > > > > > >        --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
> > > > > > >
> > > > > > > Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
> > > > > > >
> > > > > > > Note:
> > > > > > > (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> > > > > > > (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> > > > > > > (3) I do not use ublk null target so that the test is fair.
> > > > > > > (4) I setup fio with direct=1, bs=4k.
> > > > > > >
> > > > > > > ------------------------------
> > > > > > > 1 job 1 iodepth, lat（usec)
> > > > > > >                 vduse   ublk
> > > > > > > seq-read        22.55   11.15
> > > > > > > rand-read       22.49   11.17
> > > > > > > seq-write       25.67   10.25
> > > > > > > rand-write      24.13   10.16
> > > > > >
> > > > > > Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
> > > > > >
> > > > >
> > > > > I think one reason for the latency gap of sync I/O is that vduse uses
> > > > > workqueue in the I/O completion path but ublk doesn't.
> > > > >
> > > > > And one bottleneck for the async I/O in vduse is that vduse will do
> > > > > memcpy inside the critical section of virtqueue's spinlock in the
> > > > > virtio-blk driver. That will hurt the performance heavily when
> > > > > virtio_queue_rq() and virtblk_done() run concurrently. And it can be
> > > > > mitigated by the advance DMA mapping feature [1] or irq binding
> > > > > support [2].
> > > >
> > > > Hi Yongji,
> > > >
> > > > Yeah, that is the cost you paid for virtio. Wrt. userspace block device
> > > > or other sort of userspace devices, cmd completion is driven by
> > > > userspace, not sure if one such 'irq' is needed.
> > >
> > > I'm not sure, it can be an optional feature in the future if needed.
> > >
> > > > Even not sure if virtio
> > > > ring is one good choice for such use case, given io_uring has been proved
> > > > as very efficient(should be better than virtio ring, IMO).
> > > >
> > >
> > > Since vduse is aimed at creating a generic userspace device framework,
> > > virtio should be the right way IMO.
> >
> > OK, it is the right way, but may not be the effective one.
> >
>
> Maybe, but I think we can try to optimize it.
>
> > > And with the vdpa framework, the
> > > userspace device can serve both virtual machines and containers.
> >
> > virtio is good for VM, but not sure it is good enough for other
> > cases.
> >
> > >
> > > Regarding the performance issue, actually I can't measure how much of
> > > the performance loss is due to the difference between virtio ring and
> > > iouring. But I think it should be very small. The main costs come from
> > > the two bottlenecks I mentioned before which could be mitigated in the
> > > future.
> >
> > Per my understanding, at least there are two places where virtio ring is
> > less efficient than io_uring:
> >
>
> I might have misunderstood what you mean by virtio ring before. My
> previous understanding of the virtio ring does not include the
> virtio-blk driver.
>
> > 1) io_uring uses standalone submission queue(SQ) and completion queue(CQ),
> > so no contention exists between submission and completion; but virtio queue
> > requires per-vq lock in both submission and completion.
> >
>
> Yes, this is the bottleneck of the virtio-blk driver, even in the VM
> case. We are also trying to optimize this lock.
>
> One way to mitigate it is making submission and completion happen in
> the same core.

QEMU sizes virtio-blk device num-queues to match the vCPU count. The
virtio-blk driver is a blk-mq driver, so submissions and completions
for a given virtqueue should already be processed by the same vCPU.

Unless the device is misconfigured or the guest software chooses a
custom vq:vCPU mapping, there should be no vq lock contention between
vCPUs.

I can think of a reason why submission and completion require
coordination: descriptors are occupied until completion. The
submission logic chooses free descriptors from the table. The
completion logic returns free descriptors so they can be used in
future submissions.

Other ring designs expose the submission ring head AND tail index so
that it's clear which submissions have been processed by the other
side. Once processed, the descriptors are no longer occupied and can
be reused for future submissions immediately. This means that
submission and completion do not share state.

This is for the split virtqueue layout. For the packed layout I think
there is a similar dependency because descriptors are used for both
submission and completion.

I have CCed Michael Tsirkin in case he has any thoughts on the
independence of submission and completion in the vring design.

BTW I have written about difference in the VIRTIO, NVMe, and io_uring
descriptor ring designs here:
https://blog.vmsplice.net/2022/06/comparing-virtio-nvme-and-iouring-queue.html

Stefan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-18 14:54                       ` Stefan Hajnoczi
@ 2022-10-19  9:09                         ` Ming Lei
  2022-10-24 16:11                           ` Stefan Hajnoczi
  2022-10-21  5:33                         ` Yongji Xie
  1 sibling, 1 reply; 44+ messages in thread
From: Ming Lei @ 2022-10-19  9:09 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Yongji Xie, Michael S. Tsirkin, Ziyang Zhang, Stefan Hajnoczi,
	io-uring, linux-block, linux-kernel, Denis V. Lunev,
	Xiaoguang Wang

On Tue, Oct 18, 2022 at 10:54:45AM -0400, Stefan Hajnoczi wrote:
> On Tue, 18 Oct 2022 at 09:17, Yongji Xie <[email protected]> wrote:
> >
> > On Tue, Oct 18, 2022 at 2:59 PM Ming Lei <[email protected]> wrote:
> > >
> > > On Mon, Oct 17, 2022 at 07:11:59PM +0800, Yongji Xie wrote:
> > > > On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
> > > > >
> > > > > On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> > > > > > On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> > > > > > >
> > > > > > > On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On 2022/10/5 12:18, Ming Lei wrote:
> > > > > > > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > > > > > >> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > > > > > > >>>
> > > > > > > > >>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > > > > >>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > > > >>>>> ublk-qcow2 is available now.
> > > > > > > > >>>>
> > > > > > > > >>>> Cool, thanks for sharing!
> > > > > > > > >>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>> So far it provides basic read/write function, and compression and snapshot
> > > > > > > > >>>>> aren't supported yet. The target/backend implementation is completely
> > > > > > > > >>>>> based on io_uring, and share the same io_uring with ublk IO command
> > > > > > > > >>>>> handler, just like what ublk-loop does.
> > > > > > > > >>>>>
> > > > > > > > >>>>> Follows the main motivations of ublk-qcow2:
> > > > > > > > >>>>>
> > > > > > > > >>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > > > > >>>>>   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > > > > >>>>>   requirement from libublksrv compared with other simple ones(loop, null)
> > > > > > > > >>>>>
> > > > > > > > >>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > > > > >>>>>   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > > > > >>>>>   might useful be for covering requirement in this field
> > > > > > > > >>>>>
> > > > > > > > >>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > > > > > > >>>>>   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > > > > > > >>>>>   is started
> > > > > > > > >>>>>
> > > > > > > > >>>>> - help to abstract common building block or design pattern for writing new ublk
> > > > > > > > >>>>>   target/backend
> > > > > > > > >>>>>
> > > > > > > > >>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > > > > > >>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> > > > > > > > >>>>> soft update approach is applied in meta flushing, and meta data
> > > > > > > > >>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > > > > > >>>>> test, and only cluster leak is reported during this test.
> > > > > > > > >>>>>
> > > > > > > > >>>>> The performance data looks much better compared with qemu-nbd, see
> > > > > > > > >>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > > > > > > >>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > > > > > >>>>> image(8GB):
> > > > > > > > >>>>>
> > > > > > > > >>>>> - qemu-nbd (make test T=qcow2/002)
> > > > > > > > >>>>
> > > > > > > > >>>> Single queue?
> > > > > > > > >>>
> > > > > > > > >>> Yeah.
> > > > > > > > >>>
> > > > > > > > >>>>
> > > > > > > > >>>>>     randwrite(4k): jobs 1, iops 24605
> > > > > > > > >>>>>     randread(4k): jobs 1, iops 30938
> > > > > > > > >>>>>     randrw(4k): jobs 1, iops read 13981 write 14001
> > > > > > > > >>>>>     rw(512k): jobs 1, iops read 724 write 728
> > > > > > > > >>>>
> > > > > > > > >>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> > > > > > > > >>>> command-line should be similar to this:
> > > > > > > > >>>>
> > > > > > > > >>>>   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > > > > > > >>>
> > > > > > > > >>> Not found virtio_vdpa module even though I enabled all the following
> > > > > > > > >>> options:
> > > > > > > > >>>
> > > > > > > > >>>         --- vDPA drivers
> > > > > > > > >>>           <M>   vDPA device simulator core
> > > > > > > > >>>           <M>     vDPA simulator for networking device
> > > > > > > > >>>           <M>     vDPA simulator for block device
> > > > > > > > >>>           <M>   VDUSE (vDPA Device in Userspace) support
> > > > > > > > >>>           <M>   Intel IFC VF vDPA driver
> > > > > > > > >>>           <M>   Virtio PCI bridge vDPA driver
> > > > > > > > >>>           <M>   vDPA driver for Alibaba ENI
> > > > > > > > >>>
> > > > > > > > >>> BTW, my test environment is VM and the shared data is done in VM too, and
> > > > > > > > >>> can virtio_vdpa be used inside VM?
> > > > > > > > >>
> > > > > > > > >> I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > > > > > > >>
> > > > > > > > >> virtio_vdpa is available inside guests too. Please check that
> > > > > > > > >> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > > > > > > >> drivers" menu.
> > > > > > > > >>
> > > > > > > > >>>
> > > > > > > > >>>>   # modprobe vduse
> > > > > > > > >>>>   # qemu-storage-daemon \
> > > > > > > > >>>>       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > > > > > > >>>>       --blockdev qcow2,file=file,node-name=qcow2 \
> > > > > > > > >>>>       --object iothread,id=iothread0 \
> > > > > > > > >>>>       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > > > > > > >>>>   # vdpa dev add name vduse0 mgmtdev vduse
> > > > > > > > >>>>
> > > > > > > > >>>> A virtio-blk device should appear and xfstests can be run on it
> > > > > > > > >>>> (typically /dev/vda unless you already have other virtio-blk devices).
> > > > > > > > >>>>
> > > > > > > > >>>> Afterwards you can destroy the device using:
> > > > > > > > >>>>
> > > > > > > > >>>>   # vdpa dev del vduse0
> > > > > > > > >>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>> - ublk-qcow2 (make test T=qcow2/022)
> > > > > > > > >>>>
> > > > > > > > >>>> There are a lot of other factors not directly related to NBD vs ublk. In
> > > > > > > > >>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> > > > > > > > >>>> type is needed in qemu-storage-daemon. That way only the difference is
> > > > > > > > >>>> the ublk interface and the rest of the code path is identical, making it
> > > > > > > > >>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > > > > > > >>>
> > > > > > > > >>> Maybe not true.
> > > > > > > > >>>
> > > > > > > > >>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > > > > > > >>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > > > > > > >>> command.
> > > > > > > > >>
> > > > > > > > >> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > > > > > > > >
> > > > > > > > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > > > > > > > >
> > > > > > > > >> know whether the benchmark demonstrates that ublk is faster than NBD,
> > > > > > > > >> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > > > > > > >> whether there are miscellaneous implementation differences between
> > > > > > > > >> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > > > > > > >> ublk and backend IO), or something else.
> > > > > > > > >
> > > > > > > > > The theory shouldn't be too complicated:
> > > > > > > > >
> > > > > > > > > 1) io uring passthough(pt) communication is fast than socket, and io command
> > > > > > > > > is carried over io_uring pt commands, and should be fast than virio
> > > > > > > > > communication too.
> > > > > > > > >
> > > > > > > > > 2) io uring io handling is fast than libaio which is taken in the
> > > > > > > > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > > > > > > > by io_uring.
> > > > > > > > >
> > > > > > > > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > > > > > > > >
> > > > > > > > > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > > > > > > > backend IOs, so batching handling is common, and it is easy to see
> > > > > > > > > dozens of IOs/io commands handled in single syscall, or even more.
> > > > > > > > >
> > > > > > > > >>
> > > > > > > > >> I'm suggesting measuring changes to just 1 variable at a time.
> > > > > > > > >> Otherwise it's hard to reach a conclusion about the root cause of the
> > > > > > > > >> performance difference. Let's learn why ublk-qcow2 performs well.
> > > > > > > > >
> > > > > > > > > Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> > > > > > > > > qemu from the latest github tree, and finally it starts to work. And test kernel
> > > > > > > > > is v6.0 release.
> > > > > > > > >
> > > > > > > > > Follows the test result, and all three devices are setup as single
> > > > > > > > > queue, and all tests are run in single job, still done in one VM, and
> > > > > > > > > the test images are stored on XFS/virito-scsi backed SSD.
> > > > > > > > >
> > > > > > > > > The 1st group tests all three block device which is backed by empty
> > > > > > > > > qcow2 image.
> > > > > > > > >
> > > > > > > > > The 2nd group tests all the three block devices backed by pre-allocated
> > > > > > > > > qcow2 image.
> > > > > > > > >
> > > > > > > > > Except for big sequential IO(512K), there is still not small gap between
> > > > > > > > > vdpa-virtio-blk and ublk.
> > > > > > > > >
> > > > > > > > > 1. run fio on block device over empty qcow2 image
> > > > > > > > > 1) qemu-nbd
> > > > > > > > > running qcow2/001
> > > > > > > > > run perf test on empty qcow2 image via nbd
> > > > > > > > >       fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > > >       randwrite: jobs 1, iops 8549
> > > > > > > > >       randread: jobs 1, iops 34829
> > > > > > > > >       randrw: jobs 1, iops read 11363 write 11333
> > > > > > > > >       rw(512k): jobs 1, iops read 590 write 597
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2) ublk-qcow2
> > > > > > > > > running qcow2/021
> > > > > > > > > run perf test on empty qcow2 image via ublk
> > > > > > > > >       fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > > > > > > >       randwrite: jobs 1, iops 16086
> > > > > > > > >       randread: jobs 1, iops 172720
> > > > > > > > >       randrw: jobs 1, iops read 35760 write 35702
> > > > > > > > >       rw(512k): jobs 1, iops read 1140 write 1149
> > > > > > > > >
> > > > > > > > > 3) vdpa-virtio-blk
> > > > > > > > > running debug/test_dev
> > > > > > > > > run io test on specified device
> > > > > > > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > > >       randwrite: jobs 1, iops 8626
> > > > > > > > >       randread: jobs 1, iops 126118
> > > > > > > > >       randrw: jobs 1, iops read 17698 write 17665
> > > > > > > > >       rw(512k): jobs 1, iops read 1023 write 1031
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2. run fio on block device over pre-allocated qcow2 image
> > > > > > > > > 1) qemu-nbd
> > > > > > > > > running qcow2/002
> > > > > > > > > run perf test on pre-allocated qcow2 image via nbd
> > > > > > > > >       fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > > >       randwrite: jobs 1, iops 21439
> > > > > > > > >       randread: jobs 1, iops 30336
> > > > > > > > >       randrw: jobs 1, iops read 11476 write 11449
> > > > > > > > >       rw(512k): jobs 1, iops read 718 write 722
> > > > > > > > >
> > > > > > > > > 2) ublk-qcow2
> > > > > > > > > running qcow2/022
> > > > > > > > > run perf test on pre-allocated qcow2 image via ublk
> > > > > > > > >       fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > > > > > > >       randwrite: jobs 1, iops 98757
> > > > > > > > >       randread: jobs 1, iops 110246
> > > > > > > > >       randrw: jobs 1, iops read 47229 write 47161
> > > > > > > > >       rw(512k): jobs 1, iops read 1416 write 1427
> > > > > > > > >
> > > > > > > > > 3) vdpa-virtio-blk
> > > > > > > > > running debug/test_dev
> > > > > > > > > run io test on specified device
> > > > > > > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > > >       randwrite: jobs 1, iops 47317
> > > > > > > > >       randread: jobs 1, iops 74092
> > > > > > > > >       randrw: jobs 1, iops read 27196 write 27234
> > > > > > > > >       rw(512k): jobs 1, iops read 1447 write 1458
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> > > > > > > > Let me share some results here.
> > > > > > > >
> > > > > > > > I setup UBLK with:
> > > > > > > >   ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
> > > > > > > >
> > > > > > > > I setup VDUSE with:
> > > > > > > >   qemu-storage-daemon \
> > > > > > > >        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > > > > > > >        --monitor chardev=charmonitor \
> > > > > > > >        --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
> > > > > > > >        --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
> > > > > > > >
> > > > > > > > Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
> > > > > > > >
> > > > > > > > Note:
> > > > > > > > (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> > > > > > > > (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> > > > > > > > (3) I do not use ublk null target so that the test is fair.
> > > > > > > > (4) I setup fio with direct=1, bs=4k.
> > > > > > > >
> > > > > > > > ------------------------------
> > > > > > > > 1 job 1 iodepth, lat（usec)
> > > > > > > >                 vduse   ublk
> > > > > > > > seq-read        22.55   11.15
> > > > > > > > rand-read       22.49   11.17
> > > > > > > > seq-write       25.67   10.25
> > > > > > > > rand-write      24.13   10.16
> > > > > > >
> > > > > > > Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
> > > > > > >
> > > > > >
> > > > > > I think one reason for the latency gap of sync I/O is that vduse uses
> > > > > > workqueue in the I/O completion path but ublk doesn't.
> > > > > >
> > > > > > And one bottleneck for the async I/O in vduse is that vduse will do
> > > > > > memcpy inside the critical section of virtqueue's spinlock in the
> > > > > > virtio-blk driver. That will hurt the performance heavily when
> > > > > > virtio_queue_rq() and virtblk_done() run concurrently. And it can be
> > > > > > mitigated by the advance DMA mapping feature [1] or irq binding
> > > > > > support [2].
> > > > >
> > > > > Hi Yongji,
> > > > >
> > > > > Yeah, that is the cost you paid for virtio. Wrt. userspace block device
> > > > > or other sort of userspace devices, cmd completion is driven by
> > > > > userspace, not sure if one such 'irq' is needed.
> > > >
> > > > I'm not sure, it can be an optional feature in the future if needed.
> > > >
> > > > > Even not sure if virtio
> > > > > ring is one good choice for such use case, given io_uring has been proved
> > > > > as very efficient(should be better than virtio ring, IMO).
> > > > >
> > > >
> > > > Since vduse is aimed at creating a generic userspace device framework,
> > > > virtio should be the right way IMO.
> > >
> > > OK, it is the right way, but may not be the effective one.
> > >
> >
> > Maybe, but I think we can try to optimize it.
> >
> > > > And with the vdpa framework, the
> > > > userspace device can serve both virtual machines and containers.
> > >
> > > virtio is good for VM, but not sure it is good enough for other
> > > cases.
> > >
> > > >
> > > > Regarding the performance issue, actually I can't measure how much of
> > > > the performance loss is due to the difference between virtio ring and
> > > > iouring. But I think it should be very small. The main costs come from
> > > > the two bottlenecks I mentioned before which could be mitigated in the
> > > > future.
> > >
> > > Per my understanding, at least there are two places where virtio ring is
> > > less efficient than io_uring:
> > >
> >
> > I might have misunderstood what you mean by virtio ring before. My
> > previous understanding of the virtio ring does not include the
> > virtio-blk driver.
> >
> > > 1) io_uring uses standalone submission queue(SQ) and completion queue(CQ),
> > > so no contention exists between submission and completion; but virtio queue
> > > requires per-vq lock in both submission and completion.
> > >
> >
> > Yes, this is the bottleneck of the virtio-blk driver, even in the VM
> > case. We are also trying to optimize this lock.
> >
> > One way to mitigate it is making submission and completion happen in
> > the same core.
> 
> QEMU sizes virtio-blk device num-queues to match the vCPU count. The

num-queues is configurable via qemu-storage-daemon command line, and
single queue is usually common case, more queues often means more
resources.

> virtio-blk driver is a blk-mq driver, so submissions and completions
> for a given virtqueue should already be processed by the same vCPU.
> 
> Unless the device is misconfigured or the guest software chooses a
> custom vq:vCPU mapping, there should be no vq lock contention between
> vCPUs.

Single queue or nr_queue less than nr_cpus can't be thought as mis-configured,
so every vCPU can submit request, but only one or a few vCPUs complete all.

> 
> I can think of a reason why submission and completion require
> coordination: descriptors are occupied until completion. The
> submission logic chooses free descriptors from the table. The
> completion logic returns free descriptors so they can be used in
> future submissions.

Shared descriptors is one fundamental design of virtio ring, and
looks the reason why vq spin lock is needed in both sides.

> 
> Other ring designs expose the submission ring head AND tail index so
> that it's clear which submissions have been processed by the other
> side. Once processed, the descriptors are no longer occupied and can
> be reused for future submissions immediately. This means that
> submission and completion do not share state.
> 
> This is for the split virtqueue layout. For the packed layout I think
> there is a similar dependency because descriptors are used for both
> submission and completion.
> 
> I have CCed Michael Tsirkin in case he has any thoughts on the
> independence of submission and completion in the vring design.
> 
> BTW I have written about difference in the VIRTIO, NVMe, and io_uring
> descriptor ring designs here:
> https://blog.vmsplice.net/2022/06/comparing-virtio-nvme-and-iouring-queue.html

Except for ring, notification could be another difference.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-19  9:09                         ` Ming Lei
@ 2022-10-24 16:11                           ` Stefan Hajnoczi
  0 siblings, 0 replies; 44+ messages in thread
From: Stefan Hajnoczi @ 2022-10-24 16:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Yongji Xie, Michael S. Tsirkin, Ziyang Zhang, Stefan Hajnoczi,
	io-uring, linux-block, linux-kernel, Denis V. Lunev,
	Xiaoguang Wang

On Wed, 19 Oct 2022 at 05:09, Ming Lei <[email protected]> wrote:
>
> On Tue, Oct 18, 2022 at 10:54:45AM -0400, Stefan Hajnoczi wrote:
> > On Tue, 18 Oct 2022 at 09:17, Yongji Xie <[email protected]> wrote:
> > >
> > > On Tue, Oct 18, 2022 at 2:59 PM Ming Lei <[email protected]> wrote:
> > > >
> > > > On Mon, Oct 17, 2022 at 07:11:59PM +0800, Yongji Xie wrote:
> > > > > On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> > > > > > > On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On 2022/10/5 12:18, Ming Lei wrote:
> > > > > > > > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > > > > > > >> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > > > > > > > >>>
> > > > > > > > > >>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > > > > > >>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > > > > >>>>> ublk-qcow2 is available now.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Cool, thanks for sharing!
> > > > > > > > > >>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> So far it provides basic read/write function, and compression and snapshot
> > > > > > > > > >>>>> aren't supported yet. The target/backend implementation is completely
> > > > > > > > > >>>>> based on io_uring, and share the same io_uring with ublk IO command
> > > > > > > > > >>>>> handler, just like what ublk-loop does.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Follows the main motivations of ublk-qcow2:
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > > > > > >>>>>   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > > > > > >>>>>   requirement from libublksrv compared with other simple ones(loop, null)
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > > > > > >>>>>   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > > > > > >>>>>   might useful be for covering requirement in this field
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > > > > > > > >>>>>   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > > > > > > > >>>>>   is started
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> - help to abstract common building block or design pattern for writing new ublk
> > > > > > > > > >>>>>   target/backend
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > > > > > > >>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> > > > > > > > > >>>>> soft update approach is applied in meta flushing, and meta data
> > > > > > > > > >>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > > > > > > >>>>> test, and only cluster leak is reported during this test.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> The performance data looks much better compared with qemu-nbd, see
> > > > > > > > > >>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > > > > > > > >>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > > > > > > >>>>> image(8GB):
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> - qemu-nbd (make test T=qcow2/002)
> > > > > > > > > >>>>
> > > > > > > > > >>>> Single queue?
> > > > > > > > > >>>
> > > > > > > > > >>> Yeah.
> > > > > > > > > >>>
> > > > > > > > > >>>>
> > > > > > > > > >>>>>     randwrite(4k): jobs 1, iops 24605
> > > > > > > > > >>>>>     randread(4k): jobs 1, iops 30938
> > > > > > > > > >>>>>     randrw(4k): jobs 1, iops read 13981 write 14001
> > > > > > > > > >>>>>     rw(512k): jobs 1, iops read 724 write 728
> > > > > > > > > >>>>
> > > > > > > > > >>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> > > > > > > > > >>>> command-line should be similar to this:
> > > > > > > > > >>>>
> > > > > > > > > >>>>   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > > > > > > > >>>
> > > > > > > > > >>> Not found virtio_vdpa module even though I enabled all the following
> > > > > > > > > >>> options:
> > > > > > > > > >>>
> > > > > > > > > >>>         --- vDPA drivers
> > > > > > > > > >>>           <M>   vDPA device simulator core
> > > > > > > > > >>>           <M>     vDPA simulator for networking device
> > > > > > > > > >>>           <M>     vDPA simulator for block device
> > > > > > > > > >>>           <M>   VDUSE (vDPA Device in Userspace) support
> > > > > > > > > >>>           <M>   Intel IFC VF vDPA driver
> > > > > > > > > >>>           <M>   Virtio PCI bridge vDPA driver
> > > > > > > > > >>>           <M>   vDPA driver for Alibaba ENI
> > > > > > > > > >>>
> > > > > > > > > >>> BTW, my test environment is VM and the shared data is done in VM too, and
> > > > > > > > > >>> can virtio_vdpa be used inside VM?
> > > > > > > > > >>
> > > > > > > > > >> I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > > > > > > > >>
> > > > > > > > > >> virtio_vdpa is available inside guests too. Please check that
> > > > > > > > > >> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > > > > > > > >> drivers" menu.
> > > > > > > > > >>
> > > > > > > > > >>>
> > > > > > > > > >>>>   # modprobe vduse
> > > > > > > > > >>>>   # qemu-storage-daemon \
> > > > > > > > > >>>>       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > > > > > > > >>>>       --blockdev qcow2,file=file,node-name=qcow2 \
> > > > > > > > > >>>>       --object iothread,id=iothread0 \
> > > > > > > > > >>>>       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > > > > > > > >>>>   # vdpa dev add name vduse0 mgmtdev vduse
> > > > > > > > > >>>>
> > > > > > > > > >>>> A virtio-blk device should appear and xfstests can be run on it
> > > > > > > > > >>>> (typically /dev/vda unless you already have other virtio-blk devices).
> > > > > > > > > >>>>
> > > > > > > > > >>>> Afterwards you can destroy the device using:
> > > > > > > > > >>>>
> > > > > > > > > >>>>   # vdpa dev del vduse0
> > > > > > > > > >>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> - ublk-qcow2 (make test T=qcow2/022)
> > > > > > > > > >>>>
> > > > > > > > > >>>> There are a lot of other factors not directly related to NBD vs ublk. In
> > > > > > > > > >>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> > > > > > > > > >>>> type is needed in qemu-storage-daemon. That way only the difference is
> > > > > > > > > >>>> the ublk interface and the rest of the code path is identical, making it
> > > > > > > > > >>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > > > > > > > >>>
> > > > > > > > > >>> Maybe not true.
> > > > > > > > > >>>
> > > > > > > > > >>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > > > > > > > >>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > > > > > > > >>> command.
> > > > > > > > > >>
> > > > > > > > > >> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > > > > > > > > >
> > > > > > > > > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > > > > > > > > >
> > > > > > > > > >> know whether the benchmark demonstrates that ublk is faster than NBD,
> > > > > > > > > >> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > > > > > > > >> whether there are miscellaneous implementation differences between
> > > > > > > > > >> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > > > > > > > >> ublk and backend IO), or something else.
> > > > > > > > > >
> > > > > > > > > > The theory shouldn't be too complicated:
> > > > > > > > > >
> > > > > > > > > > 1) io uring passthough(pt) communication is fast than socket, and io command
> > > > > > > > > > is carried over io_uring pt commands, and should be fast than virio
> > > > > > > > > > communication too.
> > > > > > > > > >
> > > > > > > > > > 2) io uring io handling is fast than libaio which is taken in the
> > > > > > > > > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > > > > > > > > by io_uring.
> > > > > > > > > >
> > > > > > > > > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > > > > > > > > >
> > > > > > > > > > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > > > > > > > > backend IOs, so batching handling is common, and it is easy to see
> > > > > > > > > > dozens of IOs/io commands handled in single syscall, or even more.
> > > > > > > > > >
> > > > > > > > > >>
> > > > > > > > > >> I'm suggesting measuring changes to just 1 variable at a time.
> > > > > > > > > >> Otherwise it's hard to reach a conclusion about the root cause of the
> > > > > > > > > >> performance difference. Let's learn why ublk-qcow2 performs well.
> > > > > > > > > >
> > > > > > > > > > Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> > > > > > > > > > qemu from the latest github tree, and finally it starts to work. And test kernel
> > > > > > > > > > is v6.0 release.
> > > > > > > > > >
> > > > > > > > > > Follows the test result, and all three devices are setup as single
> > > > > > > > > > queue, and all tests are run in single job, still done in one VM, and
> > > > > > > > > > the test images are stored on XFS/virito-scsi backed SSD.
> > > > > > > > > >
> > > > > > > > > > The 1st group tests all three block device which is backed by empty
> > > > > > > > > > qcow2 image.
> > > > > > > > > >
> > > > > > > > > > The 2nd group tests all the three block devices backed by pre-allocated
> > > > > > > > > > qcow2 image.
> > > > > > > > > >
> > > > > > > > > > Except for big sequential IO(512K), there is still not small gap between
> > > > > > > > > > vdpa-virtio-blk and ublk.
> > > > > > > > > >
> > > > > > > > > > 1. run fio on block device over empty qcow2 image
> > > > > > > > > > 1) qemu-nbd
> > > > > > > > > > running qcow2/001
> > > > > > > > > > run perf test on empty qcow2 image via nbd
> > > > > > > > > >       fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > > > >       randwrite: jobs 1, iops 8549
> > > > > > > > > >       randread: jobs 1, iops 34829
> > > > > > > > > >       randrw: jobs 1, iops read 11363 write 11333
> > > > > > > > > >       rw(512k): jobs 1, iops read 590 write 597
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2) ublk-qcow2
> > > > > > > > > > running qcow2/021
> > > > > > > > > > run perf test on empty qcow2 image via ublk
> > > > > > > > > >       fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > > > > > > > >       randwrite: jobs 1, iops 16086
> > > > > > > > > >       randread: jobs 1, iops 172720
> > > > > > > > > >       randrw: jobs 1, iops read 35760 write 35702
> > > > > > > > > >       rw(512k): jobs 1, iops read 1140 write 1149
> > > > > > > > > >
> > > > > > > > > > 3) vdpa-virtio-blk
> > > > > > > > > > running debug/test_dev
> > > > > > > > > > run io test on specified device
> > > > > > > > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > > > >       randwrite: jobs 1, iops 8626
> > > > > > > > > >       randread: jobs 1, iops 126118
> > > > > > > > > >       randrw: jobs 1, iops read 17698 write 17665
> > > > > > > > > >       rw(512k): jobs 1, iops read 1023 write 1031
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2. run fio on block device over pre-allocated qcow2 image
> > > > > > > > > > 1) qemu-nbd
> > > > > > > > > > running qcow2/002
> > > > > > > > > > run perf test on pre-allocated qcow2 image via nbd
> > > > > > > > > >       fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > > > >       randwrite: jobs 1, iops 21439
> > > > > > > > > >       randread: jobs 1, iops 30336
> > > > > > > > > >       randrw: jobs 1, iops read 11476 write 11449
> > > > > > > > > >       rw(512k): jobs 1, iops read 718 write 722
> > > > > > > > > >
> > > > > > > > > > 2) ublk-qcow2
> > > > > > > > > > running qcow2/022
> > > > > > > > > > run perf test on pre-allocated qcow2 image via ublk
> > > > > > > > > >       fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > > > > > > > >       randwrite: jobs 1, iops 98757
> > > > > > > > > >       randread: jobs 1, iops 110246
> > > > > > > > > >       randrw: jobs 1, iops read 47229 write 47161
> > > > > > > > > >       rw(512k): jobs 1, iops read 1416 write 1427
> > > > > > > > > >
> > > > > > > > > > 3) vdpa-virtio-blk
> > > > > > > > > > running debug/test_dev
> > > > > > > > > > run io test on specified device
> > > > > > > > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > > > >       randwrite: jobs 1, iops 47317
> > > > > > > > > >       randread: jobs 1, iops 74092
> > > > > > > > > >       randrw: jobs 1, iops read 27196 write 27234
> > > > > > > > > >       rw(512k): jobs 1, iops read 1447 write 1458
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi All,
> > > > > > > > >
> > > > > > > > > We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> > > > > > > > > Let me share some results here.
> > > > > > > > >
> > > > > > > > > I setup UBLK with:
> > > > > > > > >   ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
> > > > > > > > >
> > > > > > > > > I setup VDUSE with:
> > > > > > > > >   qemu-storage-daemon \
> > > > > > > > >        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > > > > > > > >        --monitor chardev=charmonitor \
> > > > > > > > >        --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
> > > > > > > > >        --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
> > > > > > > > >
> > > > > > > > > Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
> > > > > > > > >
> > > > > > > > > Note:
> > > > > > > > > (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> > > > > > > > > (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> > > > > > > > > (3) I do not use ublk null target so that the test is fair.
> > > > > > > > > (4) I setup fio with direct=1, bs=4k.
> > > > > > > > >
> > > > > > > > > ------------------------------
> > > > > > > > > 1 job 1 iodepth, lat（usec)
> > > > > > > > >                 vduse   ublk
> > > > > > > > > seq-read        22.55   11.15
> > > > > > > > > rand-read       22.49   11.17
> > > > > > > > > seq-write       25.67   10.25
> > > > > > > > > rand-write      24.13   10.16
> > > > > > > >
> > > > > > > > Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
> > > > > > > >
> > > > > > >
> > > > > > > I think one reason for the latency gap of sync I/O is that vduse uses
> > > > > > > workqueue in the I/O completion path but ublk doesn't.
> > > > > > >
> > > > > > > And one bottleneck for the async I/O in vduse is that vduse will do
> > > > > > > memcpy inside the critical section of virtqueue's spinlock in the
> > > > > > > virtio-blk driver. That will hurt the performance heavily when
> > > > > > > virtio_queue_rq() and virtblk_done() run concurrently. And it can be
> > > > > > > mitigated by the advance DMA mapping feature [1] or irq binding
> > > > > > > support [2].
> > > > > >
> > > > > > Hi Yongji,
> > > > > >
> > > > > > Yeah, that is the cost you paid for virtio. Wrt. userspace block device
> > > > > > or other sort of userspace devices, cmd completion is driven by
> > > > > > userspace, not sure if one such 'irq' is needed.
> > > > >
> > > > > I'm not sure, it can be an optional feature in the future if needed.
> > > > >
> > > > > > Even not sure if virtio
> > > > > > ring is one good choice for such use case, given io_uring has been proved
> > > > > > as very efficient(should be better than virtio ring, IMO).
> > > > > >
> > > > >
> > > > > Since vduse is aimed at creating a generic userspace device framework,
> > > > > virtio should be the right way IMO.
> > > >
> > > > OK, it is the right way, but may not be the effective one.
> > > >
> > >
> > > Maybe, but I think we can try to optimize it.
> > >
> > > > > And with the vdpa framework, the
> > > > > userspace device can serve both virtual machines and containers.
> > > >
> > > > virtio is good for VM, but not sure it is good enough for other
> > > > cases.
> > > >
> > > > >
> > > > > Regarding the performance issue, actually I can't measure how much of
> > > > > the performance loss is due to the difference between virtio ring and
> > > > > iouring. But I think it should be very small. The main costs come from
> > > > > the two bottlenecks I mentioned before which could be mitigated in the
> > > > > future.
> > > >
> > > > Per my understanding, at least there are two places where virtio ring is
> > > > less efficient than io_uring:
> > > >
> > >
> > > I might have misunderstood what you mean by virtio ring before. My
> > > previous understanding of the virtio ring does not include the
> > > virtio-blk driver.
> > >
> > > > 1) io_uring uses standalone submission queue(SQ) and completion queue(CQ),
> > > > so no contention exists between submission and completion; but virtio queue
> > > > requires per-vq lock in both submission and completion.
> > > >
> > >
> > > Yes, this is the bottleneck of the virtio-blk driver, even in the VM
> > > case. We are also trying to optimize this lock.
> > >
> > > One way to mitigate it is making submission and completion happen in
> > > the same core.
> >
> > QEMU sizes virtio-blk device num-queues to match the vCPU count. The
>
> num-queues is configurable via qemu-storage-daemon command line, and
> single queue is usually common case, more queues often means more
> resources.

Sorry, I didn't make a distinction between the running VM case the and
qemu-storage-daemon case where there is no VM. I described the VM case
here because "submission and completion happen in the same core" as
suggested there.

For qemu-storage-daemon there is no automatic sizing of num-queues.
It's up to the user to decide that manually. qemu-storage-daemon
currently does not fully take advantage of SMP. All queues are
processed by a single thread in qemu-storage-daemon. In the future it
should be possible to assign queues to specific threads (there is
ongoing "multi-queue QEMU block layer" work to do this).

The resource requirements of virtqueues aren't very large:
1. Minimum 4 KB of memory for a packed vring.
2. 1 eventfd for submission notification.
3. 1 eventfd for completion notification.

Having ~64 queues is not a big resource commitment.

>
> > virtio-blk driver is a blk-mq driver, so submissions and completions
> > for a given virtqueue should already be processed by the same vCPU.
> >
> > Unless the device is misconfigured or the guest software chooses a
> > custom vq:vCPU mapping, there should be no vq lock contention between
> > vCPUs.
>
> Single queue or nr_queue less than nr_cpus can't be thought as mis-configured,
> so every vCPU can submit request, but only one or a few vCPUs complete all.

Yes.

>
> >
> > I can think of a reason why submission and completion require
> > coordination: descriptors are occupied until completion. The
> > submission logic chooses free descriptors from the table. The
> > completion logic returns free descriptors so they can be used in
> > future submissions.
>
> Shared descriptors is one fundamental design of virtio ring, and
> looks the reason why vq spin lock is needed in both sides.
>
> >
> > Other ring designs expose the submission ring head AND tail index so
> > that it's clear which submissions have been processed by the other
> > side. Once processed, the descriptors are no longer occupied and can
> > be reused for future submissions immediately. This means that
> > submission and completion do not share state.
> >
> > This is for the split virtqueue layout. For the packed layout I think
> > there is a similar dependency because descriptors are used for both
> > submission and completion.
> >
> > I have CCed Michael Tsirkin in case he has any thoughts on the
> > independence of submission and completion in the vring design.
> >
> > BTW I have written about difference in the VIRTIO, NVMe, and io_uring
> > descriptor ring designs here:
> > https://blog.vmsplice.net/2022/06/comparing-virtio-nvme-and-iouring-queue.html
>
> Except for ring, notification could be another difference.

Yes, the io_uring_enter(2) syscall takes over the control flow of the
current thread and can perform both submission and completion work.

VIRTIO/vhost-user has separate submission and completion
notifications, although they are typically implemented as eventfds
that can be processed with io_uring too.

Stefan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-18 14:54                       ` Stefan Hajnoczi
  2022-10-19  9:09                         ` Ming Lei
@ 2022-10-21  5:33                         ` Yongji Xie
  2022-10-21  6:30                           ` Jason Wang
  1 sibling, 1 reply; 44+ messages in thread
From: Yongji Xie @ 2022-10-21  5:33 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Ming Lei, Ziyang Zhang, Stefan Hajnoczi,
	io-uring, linux-block, linux-kernel, Denis V. Lunev,
	Xiaoguang Wang

On Tue, Oct 18, 2022 at 10:54 PM Stefan Hajnoczi <[email protected]> wrote:
>
> On Tue, 18 Oct 2022 at 09:17, Yongji Xie <[email protected]> wrote:
> >
> > On Tue, Oct 18, 2022 at 2:59 PM Ming Lei <[email protected]> wrote:
> > >
> > > On Mon, Oct 17, 2022 at 07:11:59PM +0800, Yongji Xie wrote:
> > > > On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
> > > > >
> > > > > On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> > > > > > On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> > > > > > >
> > > > > > > On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On 2022/10/5 12:18, Ming Lei wrote:
> > > > > > > > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > > > > > >> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > > > > > > >>>
> > > > > > > > >>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > > > > >>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > > > >>>>> ublk-qcow2 is available now.
> > > > > > > > >>>>
> > > > > > > > >>>> Cool, thanks for sharing!
> > > > > > > > >>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>> So far it provides basic read/write function, and compression and snapshot
> > > > > > > > >>>>> aren't supported yet. The target/backend implementation is completely
> > > > > > > > >>>>> based on io_uring, and share the same io_uring with ublk IO command
> > > > > > > > >>>>> handler, just like what ublk-loop does.
> > > > > > > > >>>>>
> > > > > > > > >>>>> Follows the main motivations of ublk-qcow2:
> > > > > > > > >>>>>
> > > > > > > > >>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > > > > >>>>>   become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > > > > >>>>>   requirement from libublksrv compared with other simple ones(loop, null)
> > > > > > > > >>>>>
> > > > > > > > >>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > > > > >>>>>   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > > > > >>>>>   might useful be for covering requirement in this field
> > > > > > > > >>>>>
> > > > > > > > >>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > > > > > > >>>>>   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > > > > > > >>>>>   is started
> > > > > > > > >>>>>
> > > > > > > > >>>>> - help to abstract common building block or design pattern for writing new ublk
> > > > > > > > >>>>>   target/backend
> > > > > > > > >>>>>
> > > > > > > > >>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > > > > > > >>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> > > > > > > > >>>>> soft update approach is applied in meta flushing, and meta data
> > > > > > > > >>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > > > > > > >>>>> test, and only cluster leak is reported during this test.
> > > > > > > > >>>>>
> > > > > > > > >>>>> The performance data looks much better compared with qemu-nbd, see
> > > > > > > > >>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > > > > > > >>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> > > > > > > > >>>>> image(8GB):
> > > > > > > > >>>>>
> > > > > > > > >>>>> - qemu-nbd (make test T=qcow2/002)
> > > > > > > > >>>>
> > > > > > > > >>>> Single queue?
> > > > > > > > >>>
> > > > > > > > >>> Yeah.
> > > > > > > > >>>
> > > > > > > > >>>>
> > > > > > > > >>>>>     randwrite(4k): jobs 1, iops 24605
> > > > > > > > >>>>>     randread(4k): jobs 1, iops 30938
> > > > > > > > >>>>>     randrw(4k): jobs 1, iops read 13981 write 14001
> > > > > > > > >>>>>     rw(512k): jobs 1, iops read 724 write 728
> > > > > > > > >>>>
> > > > > > > > >>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> > > > > > > > >>>> command-line should be similar to this:
> > > > > > > > >>>>
> > > > > > > > >>>>   # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > > > > > > >>>
> > > > > > > > >>> Not found virtio_vdpa module even though I enabled all the following
> > > > > > > > >>> options:
> > > > > > > > >>>
> > > > > > > > >>>         --- vDPA drivers
> > > > > > > > >>>           <M>   vDPA device simulator core
> > > > > > > > >>>           <M>     vDPA simulator for networking device
> > > > > > > > >>>           <M>     vDPA simulator for block device
> > > > > > > > >>>           <M>   VDUSE (vDPA Device in Userspace) support
> > > > > > > > >>>           <M>   Intel IFC VF vDPA driver
> > > > > > > > >>>           <M>   Virtio PCI bridge vDPA driver
> > > > > > > > >>>           <M>   vDPA driver for Alibaba ENI
> > > > > > > > >>>
> > > > > > > > >>> BTW, my test environment is VM and the shared data is done in VM too, and
> > > > > > > > >>> can virtio_vdpa be used inside VM?
> > > > > > > > >>
> > > > > > > > >> I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > > > > > > >>
> > > > > > > > >> virtio_vdpa is available inside guests too. Please check that
> > > > > > > > >> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > > > > > > >> drivers" menu.
> > > > > > > > >>
> > > > > > > > >>>
> > > > > > > > >>>>   # modprobe vduse
> > > > > > > > >>>>   # qemu-storage-daemon \
> > > > > > > > >>>>       --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > > > > > > >>>>       --blockdev qcow2,file=file,node-name=qcow2 \
> > > > > > > > >>>>       --object iothread,id=iothread0 \
> > > > > > > > >>>>       --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > > > > > > >>>>   # vdpa dev add name vduse0 mgmtdev vduse
> > > > > > > > >>>>
> > > > > > > > >>>> A virtio-blk device should appear and xfstests can be run on it
> > > > > > > > >>>> (typically /dev/vda unless you already have other virtio-blk devices).
> > > > > > > > >>>>
> > > > > > > > >>>> Afterwards you can destroy the device using:
> > > > > > > > >>>>
> > > > > > > > >>>>   # vdpa dev del vduse0
> > > > > > > > >>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>> - ublk-qcow2 (make test T=qcow2/022)
> > > > > > > > >>>>
> > > > > > > > >>>> There are a lot of other factors not directly related to NBD vs ublk. In
> > > > > > > > >>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> > > > > > > > >>>> type is needed in qemu-storage-daemon. That way only the difference is
> > > > > > > > >>>> the ublk interface and the rest of the code path is identical, making it
> > > > > > > > >>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > > > > > > >>>
> > > > > > > > >>> Maybe not true.
> > > > > > > > >>>
> > > > > > > > >>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > > > > > > >>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > > > > > > >>> command.
> > > > > > > > >>
> > > > > > > > >> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > > > > > > > >
> > > > > > > > > I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > > > > > > > >
> > > > > > > > >> know whether the benchmark demonstrates that ublk is faster than NBD,
> > > > > > > > >> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > > > > > > >> whether there are miscellaneous implementation differences between
> > > > > > > > >> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > > > > > > >> ublk and backend IO), or something else.
> > > > > > > > >
> > > > > > > > > The theory shouldn't be too complicated:
> > > > > > > > >
> > > > > > > > > 1) io uring passthough(pt) communication is fast than socket, and io command
> > > > > > > > > is carried over io_uring pt commands, and should be fast than virio
> > > > > > > > > communication too.
> > > > > > > > >
> > > > > > > > > 2) io uring io handling is fast than libaio which is taken in the
> > > > > > > > > test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > > > > > > > by io_uring.
> > > > > > > > >
> > > > > > > > > https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > > > > > > > >
> > > > > > > > > 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > > > > > > > backend IOs, so batching handling is common, and it is easy to see
> > > > > > > > > dozens of IOs/io commands handled in single syscall, or even more.
> > > > > > > > >
> > > > > > > > >>
> > > > > > > > >> I'm suggesting measuring changes to just 1 variable at a time.
> > > > > > > > >> Otherwise it's hard to reach a conclusion about the root cause of the
> > > > > > > > >> performance difference. Let's learn why ublk-qcow2 performs well.
> > > > > > > > >
> > > > > > > > > Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> > > > > > > > > qemu from the latest github tree, and finally it starts to work. And test kernel
> > > > > > > > > is v6.0 release.
> > > > > > > > >
> > > > > > > > > Follows the test result, and all three devices are setup as single
> > > > > > > > > queue, and all tests are run in single job, still done in one VM, and
> > > > > > > > > the test images are stored on XFS/virito-scsi backed SSD.
> > > > > > > > >
> > > > > > > > > The 1st group tests all three block device which is backed by empty
> > > > > > > > > qcow2 image.
> > > > > > > > >
> > > > > > > > > The 2nd group tests all the three block devices backed by pre-allocated
> > > > > > > > > qcow2 image.
> > > > > > > > >
> > > > > > > > > Except for big sequential IO(512K), there is still not small gap between
> > > > > > > > > vdpa-virtio-blk and ublk.
> > > > > > > > >
> > > > > > > > > 1. run fio on block device over empty qcow2 image
> > > > > > > > > 1) qemu-nbd
> > > > > > > > > running qcow2/001
> > > > > > > > > run perf test on empty qcow2 image via nbd
> > > > > > > > >       fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > > >       randwrite: jobs 1, iops 8549
> > > > > > > > >       randread: jobs 1, iops 34829
> > > > > > > > >       randrw: jobs 1, iops read 11363 write 11333
> > > > > > > > >       rw(512k): jobs 1, iops read 590 write 597
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2) ublk-qcow2
> > > > > > > > > running qcow2/021
> > > > > > > > > run perf test on empty qcow2 image via ublk
> > > > > > > > >       fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > > > > > > >       randwrite: jobs 1, iops 16086
> > > > > > > > >       randread: jobs 1, iops 172720
> > > > > > > > >       randrw: jobs 1, iops read 35760 write 35702
> > > > > > > > >       rw(512k): jobs 1, iops read 1140 write 1149
> > > > > > > > >
> > > > > > > > > 3) vdpa-virtio-blk
> > > > > > > > > running debug/test_dev
> > > > > > > > > run io test on specified device
> > > > > > > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > > >       randwrite: jobs 1, iops 8626
> > > > > > > > >       randread: jobs 1, iops 126118
> > > > > > > > >       randrw: jobs 1, iops read 17698 write 17665
> > > > > > > > >       rw(512k): jobs 1, iops read 1023 write 1031
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2. run fio on block device over pre-allocated qcow2 image
> > > > > > > > > 1) qemu-nbd
> > > > > > > > > running qcow2/002
> > > > > > > > > run perf test on pre-allocated qcow2 image via nbd
> > > > > > > > >       fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > > >       randwrite: jobs 1, iops 21439
> > > > > > > > >       randread: jobs 1, iops 30336
> > > > > > > > >       randrw: jobs 1, iops read 11476 write 11449
> > > > > > > > >       rw(512k): jobs 1, iops read 718 write 722
> > > > > > > > >
> > > > > > > > > 2) ublk-qcow2
> > > > > > > > > running qcow2/022
> > > > > > > > > run perf test on pre-allocated qcow2 image via ublk
> > > > > > > > >       fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > > > > > > >       randwrite: jobs 1, iops 98757
> > > > > > > > >       randread: jobs 1, iops 110246
> > > > > > > > >       randrw: jobs 1, iops read 47229 write 47161
> > > > > > > > >       rw(512k): jobs 1, iops read 1416 write 1427
> > > > > > > > >
> > > > > > > > > 3) vdpa-virtio-blk
> > > > > > > > > running debug/test_dev
> > > > > > > > > run io test on specified device
> > > > > > > > >       fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > > > > > > >       randwrite: jobs 1, iops 47317
> > > > > > > > >       randread: jobs 1, iops 74092
> > > > > > > > >       randrw: jobs 1, iops read 27196 write 27234
> > > > > > > > >       rw(512k): jobs 1, iops read 1447 write 1458
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> > > > > > > > Let me share some results here.
> > > > > > > >
> > > > > > > > I setup UBLK with:
> > > > > > > >   ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
> > > > > > > >
> > > > > > > > I setup VDUSE with:
> > > > > > > >   qemu-storage-daemon \
> > > > > > > >        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > > > > > > >        --monitor chardev=charmonitor \
> > > > > > > >        --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
> > > > > > > >        --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
> > > > > > > >
> > > > > > > > Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
> > > > > > > >
> > > > > > > > Note:
> > > > > > > > (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> > > > > > > > (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> > > > > > > > (3) I do not use ublk null target so that the test is fair.
> > > > > > > > (4) I setup fio with direct=1, bs=4k.
> > > > > > > >
> > > > > > > > ------------------------------
> > > > > > > > 1 job 1 iodepth, lat（usec)
> > > > > > > >                 vduse   ublk
> > > > > > > > seq-read        22.55   11.15
> > > > > > > > rand-read       22.49   11.17
> > > > > > > > seq-write       25.67   10.25
> > > > > > > > rand-write      24.13   10.16
> > > > > > >
> > > > > > > Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
> > > > > > >
> > > > > >
> > > > > > I think one reason for the latency gap of sync I/O is that vduse uses
> > > > > > workqueue in the I/O completion path but ublk doesn't.
> > > > > >
> > > > > > And one bottleneck for the async I/O in vduse is that vduse will do
> > > > > > memcpy inside the critical section of virtqueue's spinlock in the
> > > > > > virtio-blk driver. That will hurt the performance heavily when
> > > > > > virtio_queue_rq() and virtblk_done() run concurrently. And it can be
> > > > > > mitigated by the advance DMA mapping feature [1] or irq binding
> > > > > > support [2].
> > > > >
> > > > > Hi Yongji,
> > > > >
> > > > > Yeah, that is the cost you paid for virtio. Wrt. userspace block device
> > > > > or other sort of userspace devices, cmd completion is driven by
> > > > > userspace, not sure if one such 'irq' is needed.
> > > >
> > > > I'm not sure, it can be an optional feature in the future if needed.
> > > >
> > > > > Even not sure if virtio
> > > > > ring is one good choice for such use case, given io_uring has been proved
> > > > > as very efficient(should be better than virtio ring, IMO).
> > > > >
> > > >
> > > > Since vduse is aimed at creating a generic userspace device framework,
> > > > virtio should be the right way IMO.
> > >
> > > OK, it is the right way, but may not be the effective one.
> > >
> >
> > Maybe, but I think we can try to optimize it.
> >
> > > > And with the vdpa framework, the
> > > > userspace device can serve both virtual machines and containers.
> > >
> > > virtio is good for VM, but not sure it is good enough for other
> > > cases.
> > >
> > > >
> > > > Regarding the performance issue, actually I can't measure how much of
> > > > the performance loss is due to the difference between virtio ring and
> > > > iouring. But I think it should be very small. The main costs come from
> > > > the two bottlenecks I mentioned before which could be mitigated in the
> > > > future.
> > >
> > > Per my understanding, at least there are two places where virtio ring is
> > > less efficient than io_uring:
> > >
> >
> > I might have misunderstood what you mean by virtio ring before. My
> > previous understanding of the virtio ring does not include the
> > virtio-blk driver.
> >
> > > 1) io_uring uses standalone submission queue(SQ) and completion queue(CQ),
> > > so no contention exists between submission and completion; but virtio queue
> > > requires per-vq lock in both submission and completion.
> > >
> >
> > Yes, this is the bottleneck of the virtio-blk driver, even in the VM
> > case. We are also trying to optimize this lock.
> >
> > One way to mitigate it is making submission and completion happen in
> > the same core.
>
> QEMU sizes virtio-blk device num-queues to match the vCPU count. The
> virtio-blk driver is a blk-mq driver, so submissions and completions
> for a given virtqueue should already be processed by the same vCPU.
>
> Unless the device is misconfigured or the guest software chooses a
> custom vq:vCPU mapping, there should be no vq lock contention between
> vCPUs.
>
> I can think of a reason why submission and completion require
> coordination: descriptors are occupied until completion. The
> submission logic chooses free descriptors from the table. The
> completion logic returns free descriptors so they can be used in
> future submissions.
>

Yes, we need to maintain a head pointer of the free descriptors in
both submission and completion path.

> Other ring designs expose the submission ring head AND tail index so
> that it's clear which submissions have been processed by the other
> side. Once processed, the descriptors are no longer occupied and can
> be reused for future submissions immediately. This means that
> submission and completion do not share state.
>
> This is for the split virtqueue layout. For the packed layout I think
> there is a similar dependency because descriptors are used for both
> submission and completion.
>
> I have CCed Michael Tsirkin in case he has any thoughts on the
> independence of submission and completion in the vring design.
>
> BTW I have written about difference in the VIRTIO, NVMe, and io_uring
> descriptor ring designs here:
> https://blog.vmsplice.net/2022/06/comparing-virtio-nvme-and-iouring-queue.html
>

Good to know that!

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-21  5:33                         ` Yongji Xie
@ 2022-10-21  6:30                           ` Jason Wang
  2022-10-25  8:17                             ` Yongji Xie
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Wang @ 2022-10-21  6:30 UTC (permalink / raw)
  To: Yongji Xie, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Ming Lei, Ziyang Zhang, Stefan Hajnoczi,
	io-uring, linux-block, linux-kernel, Denis V. Lunev,
	Xiaoguang Wang


在 2022/10/21 13:33, Yongji Xie 写道:
> On Tue, Oct 18, 2022 at 10:54 PM Stefan Hajnoczi <[email protected]> wrote:
>> On Tue, 18 Oct 2022 at 09:17, Yongji Xie <[email protected]> wrote:
>>> On Tue, Oct 18, 2022 at 2:59 PM Ming Lei <[email protected]> wrote:
>>>> On Mon, Oct 17, 2022 at 07:11:59PM +0800, Yongji Xie wrote:
>>>>> On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
>>>>>> On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
>>>>>>> On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
>>>>>>>> On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
>>>>>>>>> On 2022/10/5 12:18, Ming Lei wrote:
>>>>>>>>>> On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
>>>>>>>>>>> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
>>>>>>>>>>>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
>>>>>>>>>>>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
>>>>>>>>>>>>>> ublk-qcow2 is available now.
>>>>>>>>>>>>> Cool, thanks for sharing!
>>>>>>>>>>>>>
>>>>>>>>>>>>>> So far it provides basic read/write function, and compression and snapshot
>>>>>>>>>>>>>> aren't supported yet. The target/backend implementation is completely
>>>>>>>>>>>>>> based on io_uring, and share the same io_uring with ublk IO command
>>>>>>>>>>>>>> handler, just like what ublk-loop does.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Follows the main motivations of ublk-qcow2:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - building one complicated target from scratch helps libublksrv APIs/functions
>>>>>>>>>>>>>>    become mature/stable more quickly, since qcow2 is complicated and needs more
>>>>>>>>>>>>>>    requirement from libublksrv compared with other simple ones(loop, null)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
>>>>>>>>>>>>>>    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
>>>>>>>>>>>>>>    might useful be for covering requirement in this field
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
>>>>>>>>>>>>>>    performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
>>>>>>>>>>>>>>    is started
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - help to abstract common building block or design pattern for writing new ublk
>>>>>>>>>>>>>>    target/backend
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
>>>>>>>>>>>>>> device as TEST_DEV, and kernel building workload is verified too. Also
>>>>>>>>>>>>>> soft update approach is applied in meta flushing, and meta data
>>>>>>>>>>>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
>>>>>>>>>>>>>> test, and only cluster leak is reported during this test.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The performance data looks much better compared with qemu-nbd, see
>>>>>>>>>>>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
>>>>>>>>>>>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
>>>>>>>>>>>>>> image(8GB):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - qemu-nbd (make test T=qcow2/002)
>>>>>>>>>>>>> Single queue?
>>>>>>>>>>>> Yeah.
>>>>>>>>>>>>
>>>>>>>>>>>>>>      randwrite(4k): jobs 1, iops 24605
>>>>>>>>>>>>>>      randread(4k): jobs 1, iops 30938
>>>>>>>>>>>>>>      randrw(4k): jobs 1, iops read 13981 write 14001
>>>>>>>>>>>>>>      rw(512k): jobs 1, iops read 724 write 728
>>>>>>>>>>>>> Please try qemu-storage-daemon's VDUSE export type as well. The
>>>>>>>>>>>>> command-line should be similar to this:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    # modprobe virtio_vdpa # attaches vDPA devices to host kernel
>>>>>>>>>>>> Not found virtio_vdpa module even though I enabled all the following
>>>>>>>>>>>> options:
>>>>>>>>>>>>
>>>>>>>>>>>>          --- vDPA drivers
>>>>>>>>>>>>            <M>   vDPA device simulator core
>>>>>>>>>>>>            <M>     vDPA simulator for networking device
>>>>>>>>>>>>            <M>     vDPA simulator for block device
>>>>>>>>>>>>            <M>   VDUSE (vDPA Device in Userspace) support
>>>>>>>>>>>>            <M>   Intel IFC VF vDPA driver
>>>>>>>>>>>>            <M>   Virtio PCI bridge vDPA driver
>>>>>>>>>>>>            <M>   vDPA driver for Alibaba ENI
>>>>>>>>>>>>
>>>>>>>>>>>> BTW, my test environment is VM and the shared data is done in VM too, and
>>>>>>>>>>>> can virtio_vdpa be used inside VM?
>>>>>>>>>>> I hope Xie Yongji can help explain how to benchmark VDUSE.
>>>>>>>>>>>
>>>>>>>>>>> virtio_vdpa is available inside guests too. Please check that
>>>>>>>>>>> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
>>>>>>>>>>> drivers" menu.
>>>>>>>>>>>
>>>>>>>>>>>>>    # modprobe vduse
>>>>>>>>>>>>>    # qemu-storage-daemon \
>>>>>>>>>>>>>        --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
>>>>>>>>>>>>>        --blockdev qcow2,file=file,node-name=qcow2 \
>>>>>>>>>>>>>        --object iothread,id=iothread0 \
>>>>>>>>>>>>>        --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
>>>>>>>>>>>>>    # vdpa dev add name vduse0 mgmtdev vduse
>>>>>>>>>>>>>
>>>>>>>>>>>>> A virtio-blk device should appear and xfstests can be run on it
>>>>>>>>>>>>> (typically /dev/vda unless you already have other virtio-blk devices).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Afterwards you can destroy the device using:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    # vdpa dev del vduse0
>>>>>>>>>>>>>
>>>>>>>>>>>>>> - ublk-qcow2 (make test T=qcow2/022)
>>>>>>>>>>>>> There are a lot of other factors not directly related to NBD vs ublk. In
>>>>>>>>>>>>> order to get an apples-to-apples comparison with qemu-* a ublk export
>>>>>>>>>>>>> type is needed in qemu-storage-daemon. That way only the difference is
>>>>>>>>>>>>> the ublk interface and the rest of the code path is identical, making it
>>>>>>>>>>>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
>>>>>>>>>>>> Maybe not true.
>>>>>>>>>>>>
>>>>>>>>>>>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
>>>>>>>>>>>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
>>>>>>>>>>>> command.
>>>>>>>>>>> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
>>>>>>>>>> I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
>>>>>>>>>>
>>>>>>>>>>> know whether the benchmark demonstrates that ublk is faster than NBD,
>>>>>>>>>>> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
>>>>>>>>>>> whether there are miscellaneous implementation differences between
>>>>>>>>>>> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
>>>>>>>>>>> ublk and backend IO), or something else.
>>>>>>>>>> The theory shouldn't be too complicated:
>>>>>>>>>>
>>>>>>>>>> 1) io uring passthough(pt) communication is fast than socket, and io command
>>>>>>>>>> is carried over io_uring pt commands, and should be fast than virio
>>>>>>>>>> communication too.
>>>>>>>>>>
>>>>>>>>>> 2) io uring io handling is fast than libaio which is taken in the
>>>>>>>>>> test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
>>>>>>>>>> by io_uring.
>>>>>>>>>>
>>>>>>>>>> https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
>>>>>>>>>>
>>>>>>>>>> 3) ublk uses one single io_uring to handle all io commands and qcow2
>>>>>>>>>> backend IOs, so batching handling is common, and it is easy to see
>>>>>>>>>> dozens of IOs/io commands handled in single syscall, or even more.
>>>>>>>>>>
>>>>>>>>>>> I'm suggesting measuring changes to just 1 variable at a time.
>>>>>>>>>>> Otherwise it's hard to reach a conclusion about the root cause of the
>>>>>>>>>>> performance difference. Let's learn why ublk-qcow2 performs well.
>>>>>>>>>> Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
>>>>>>>>>> qemu from the latest github tree, and finally it starts to work. And test kernel
>>>>>>>>>> is v6.0 release.
>>>>>>>>>>
>>>>>>>>>> Follows the test result, and all three devices are setup as single
>>>>>>>>>> queue, and all tests are run in single job, still done in one VM, and
>>>>>>>>>> the test images are stored on XFS/virito-scsi backed SSD.
>>>>>>>>>>
>>>>>>>>>> The 1st group tests all three block device which is backed by empty
>>>>>>>>>> qcow2 image.
>>>>>>>>>>
>>>>>>>>>> The 2nd group tests all the three block devices backed by pre-allocated
>>>>>>>>>> qcow2 image.
>>>>>>>>>>
>>>>>>>>>> Except for big sequential IO(512K), there is still not small gap between
>>>>>>>>>> vdpa-virtio-blk and ublk.
>>>>>>>>>>
>>>>>>>>>> 1. run fio on block device over empty qcow2 image
>>>>>>>>>> 1) qemu-nbd
>>>>>>>>>> running qcow2/001
>>>>>>>>>> run perf test on empty qcow2 image via nbd
>>>>>>>>>>        fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
>>>>>>>>>>        randwrite: jobs 1, iops 8549
>>>>>>>>>>        randread: jobs 1, iops 34829
>>>>>>>>>>        randrw: jobs 1, iops read 11363 write 11333
>>>>>>>>>>        rw(512k): jobs 1, iops read 590 write 597
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2) ublk-qcow2
>>>>>>>>>> running qcow2/021
>>>>>>>>>> run perf test on empty qcow2 image via ublk
>>>>>>>>>>        fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
>>>>>>>>>>        randwrite: jobs 1, iops 16086
>>>>>>>>>>        randread: jobs 1, iops 172720
>>>>>>>>>>        randrw: jobs 1, iops read 35760 write 35702
>>>>>>>>>>        rw(512k): jobs 1, iops read 1140 write 1149
>>>>>>>>>>
>>>>>>>>>> 3) vdpa-virtio-blk
>>>>>>>>>> running debug/test_dev
>>>>>>>>>> run io test on specified device
>>>>>>>>>>        fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
>>>>>>>>>>        randwrite: jobs 1, iops 8626
>>>>>>>>>>        randread: jobs 1, iops 126118
>>>>>>>>>>        randrw: jobs 1, iops read 17698 write 17665
>>>>>>>>>>        rw(512k): jobs 1, iops read 1023 write 1031
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2. run fio on block device over pre-allocated qcow2 image
>>>>>>>>>> 1) qemu-nbd
>>>>>>>>>> running qcow2/002
>>>>>>>>>> run perf test on pre-allocated qcow2 image via nbd
>>>>>>>>>>        fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
>>>>>>>>>>        randwrite: jobs 1, iops 21439
>>>>>>>>>>        randread: jobs 1, iops 30336
>>>>>>>>>>        randrw: jobs 1, iops read 11476 write 11449
>>>>>>>>>>        rw(512k): jobs 1, iops read 718 write 722
>>>>>>>>>>
>>>>>>>>>> 2) ublk-qcow2
>>>>>>>>>> running qcow2/022
>>>>>>>>>> run perf test on pre-allocated qcow2 image via ublk
>>>>>>>>>>        fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
>>>>>>>>>>        randwrite: jobs 1, iops 98757
>>>>>>>>>>        randread: jobs 1, iops 110246
>>>>>>>>>>        randrw: jobs 1, iops read 47229 write 47161
>>>>>>>>>>        rw(512k): jobs 1, iops read 1416 write 1427
>>>>>>>>>>
>>>>>>>>>> 3) vdpa-virtio-blk
>>>>>>>>>> running debug/test_dev
>>>>>>>>>> run io test on specified device
>>>>>>>>>>        fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
>>>>>>>>>>        randwrite: jobs 1, iops 47317
>>>>>>>>>>        randread: jobs 1, iops 74092
>>>>>>>>>>        randrw: jobs 1, iops read 27196 write 27234
>>>>>>>>>>        rw(512k): jobs 1, iops read 1447 write 1458
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
>>>>>>>>> Let me share some results here.
>>>>>>>>>
>>>>>>>>> I setup UBLK with:
>>>>>>>>>    ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
>>>>>>>>>
>>>>>>>>> I setup VDUSE with:
>>>>>>>>>    qemu-storage-daemon \
>>>>>>>>>         --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
>>>>>>>>>         --monitor chardev=charmonitor \
>>>>>>>>>         --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
>>>>>>>>>         --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
>>>>>>>>>
>>>>>>>>> Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
>>>>>>>>>
>>>>>>>>> Note:
>>>>>>>>> (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
>>>>>>>>> (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
>>>>>>>>> (3) I do not use ublk null target so that the test is fair.
>>>>>>>>> (4) I setup fio with direct=1, bs=4k.
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>> 1 job 1 iodepth, lat（usec)
>>>>>>>>>                  vduse   ublk
>>>>>>>>> seq-read        22.55   11.15
>>>>>>>>> rand-read       22.49   11.17
>>>>>>>>> seq-write       25.67   10.25
>>>>>>>>> rand-write      24.13   10.16
>>>>>>>> Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
>>>>>>>>
>>>>>>> I think one reason for the latency gap of sync I/O is that vduse uses
>>>>>>> workqueue in the I/O completion path but ublk doesn't.
>>>>>>>
>>>>>>> And one bottleneck for the async I/O in vduse is that vduse will do
>>>>>>> memcpy inside the critical section of virtqueue's spinlock in the
>>>>>>> virtio-blk driver. That will hurt the performance heavily when
>>>>>>> virtio_queue_rq() and virtblk_done() run concurrently. And it can be
>>>>>>> mitigated by the advance DMA mapping feature [1] or irq binding
>>>>>>> support [2].
>>>>>> Hi Yongji,
>>>>>>
>>>>>> Yeah, that is the cost you paid for virtio. Wrt. userspace block device
>>>>>> or other sort of userspace devices, cmd completion is driven by
>>>>>> userspace, not sure if one such 'irq' is needed.
>>>>> I'm not sure, it can be an optional feature in the future if needed.
>>>>>
>>>>>> Even not sure if virtio
>>>>>> ring is one good choice for such use case, given io_uring has been proved
>>>>>> as very efficient(should be better than virtio ring, IMO).
>>>>>>
>>>>> Since vduse is aimed at creating a generic userspace device framework,
>>>>> virtio should be the right way IMO.
>>>> OK, it is the right way, but may not be the effective one.
>>>>
>>> Maybe, but I think we can try to optimize it.
>>>
>>>>> And with the vdpa framework, the
>>>>> userspace device can serve both virtual machines and containers.
>>>> virtio is good for VM, but not sure it is good enough for other
>>>> cases.
>>>>
>>>>> Regarding the performance issue, actually I can't measure how much of
>>>>> the performance loss is due to the difference between virtio ring and
>>>>> iouring. But I think it should be very small. The main costs come from
>>>>> the two bottlenecks I mentioned before which could be mitigated in the
>>>>> future.
>>>> Per my understanding, at least there are two places where virtio ring is
>>>> less efficient than io_uring:
>>>>
>>> I might have misunderstood what you mean by virtio ring before. My
>>> previous understanding of the virtio ring does not include the
>>> virtio-blk driver.
>>>
>>>> 1) io_uring uses standalone submission queue(SQ) and completion queue(CQ),
>>>> so no contention exists between submission and completion; but virtio queue
>>>> requires per-vq lock in both submission and completion.
>>>>
>>> Yes, this is the bottleneck of the virtio-blk driver, even in the VM
>>> case. We are also trying to optimize this lock.
>>>
>>> One way to mitigate it is making submission and completion happen in
>>> the same core.
>> QEMU sizes virtio-blk device num-queues to match the vCPU count. The
>> virtio-blk driver is a blk-mq driver, so submissions and completions
>> for a given virtqueue should already be processed by the same vCPU.
>>
>> Unless the device is misconfigured or the guest software chooses a
>> custom vq:vCPU mapping, there should be no vq lock contention between
>> vCPUs.
>>
>> I can think of a reason why submission and completion require
>> coordination: descriptors are occupied until completion. The
>> submission logic chooses free descriptors from the table. The
>> completion logic returns free descriptors so they can be used in
>> future submissions.
>>
> Yes, we need to maintain a head pointer of the free descriptors in
> both submission and completion path.


Not necessarily after IN_ORDER?

Thanks


>
>> Other ring designs expose the submission ring head AND tail index so
>> that it's clear which submissions have been processed by the other
>> side. Once processed, the descriptors are no longer occupied and can
>> be reused for future submissions immediately. This means that
>> submission and completion do not share state.
>>
>> This is for the split virtqueue layout. For the packed layout I think
>> there is a similar dependency because descriptors are used for both
>> submission and completion.
>>
>> I have CCed Michael Tsirkin in case he has any thoughts on the
>> independence of submission and completion in the vring design.
>>
>> BTW I have written about difference in the VIRTIO, NVMe, and io_uring
>> descriptor ring designs here:
>> https://blog.vmsplice.net/2022/06/comparing-virtio-nvme-and-iouring-queue.html
>>
> Good to know that!
>
> Thanks,
> Yongji
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-21  6:30                           ` Jason Wang
@ 2022-10-25  8:17                             ` Yongji Xie
  2022-10-25 12:02                               ` Stefan Hajnoczi
  0 siblings, 1 reply; 44+ messages in thread
From: Yongji Xie @ 2022-10-25  8:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Ming Lei, Ziyang Zhang,
	Stefan Hajnoczi, io-uring, linux-block, linux-kernel,
	Denis V. Lunev, Xiaoguang Wang

On Fri, Oct 21, 2022 at 2:30 PM Jason Wang <[email protected]> wrote:
>
>
> 在 2022/10/21 13:33, Yongji Xie 写道:
> > On Tue, Oct 18, 2022 at 10:54 PM Stefan Hajnoczi <[email protected]> wrote:
> >> On Tue, 18 Oct 2022 at 09:17, Yongji Xie <[email protected]> wrote:
> >>> On Tue, Oct 18, 2022 at 2:59 PM Ming Lei <[email protected]> wrote:
> >>>> On Mon, Oct 17, 2022 at 07:11:59PM +0800, Yongji Xie wrote:
> >>>>> On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
> >>>>>> On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> >>>>>>> On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> >>>>>>>> On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> >>>>>>>>> On 2022/10/5 12:18, Ming Lei wrote:
> >>>>>>>>>> On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> >>>>>>>>>>> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> >>>>>>>>>>>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> >>>>>>>>>>>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> >>>>>>>>>>>>>> ublk-qcow2 is available now.
> >>>>>>>>>>>>> Cool, thanks for sharing!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> So far it provides basic read/write function, and compression and snapshot
> >>>>>>>>>>>>>> aren't supported yet. The target/backend implementation is completely
> >>>>>>>>>>>>>> based on io_uring, and share the same io_uring with ublk IO command
> >>>>>>>>>>>>>> handler, just like what ublk-loop does.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Follows the main motivations of ublk-qcow2:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> >>>>>>>>>>>>>>    become mature/stable more quickly, since qcow2 is complicated and needs more
> >>>>>>>>>>>>>>    requirement from libublksrv compared with other simple ones(loop, null)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> >>>>>>>>>>>>>>    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> >>>>>>>>>>>>>>    might useful be for covering requirement in this field
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> >>>>>>>>>>>>>>    performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> >>>>>>>>>>>>>>    is started
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - help to abstract common building block or design pattern for writing new ublk
> >>>>>>>>>>>>>>    target/backend
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> >>>>>>>>>>>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> >>>>>>>>>>>>>> soft update approach is applied in meta flushing, and meta data
> >>>>>>>>>>>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> >>>>>>>>>>>>>> test, and only cluster leak is reported during this test.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The performance data looks much better compared with qemu-nbd, see
> >>>>>>>>>>>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> >>>>>>>>>>>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> >>>>>>>>>>>>>> image(8GB):
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - qemu-nbd (make test T=qcow2/002)
> >>>>>>>>>>>>> Single queue?
> >>>>>>>>>>>> Yeah.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>      randwrite(4k): jobs 1, iops 24605
> >>>>>>>>>>>>>>      randread(4k): jobs 1, iops 30938
> >>>>>>>>>>>>>>      randrw(4k): jobs 1, iops read 13981 write 14001
> >>>>>>>>>>>>>>      rw(512k): jobs 1, iops read 724 write 728
> >>>>>>>>>>>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> >>>>>>>>>>>>> command-line should be similar to this:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>    # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> >>>>>>>>>>>> Not found virtio_vdpa module even though I enabled all the following
> >>>>>>>>>>>> options:
> >>>>>>>>>>>>
> >>>>>>>>>>>>          --- vDPA drivers
> >>>>>>>>>>>>            <M>   vDPA device simulator core
> >>>>>>>>>>>>            <M>     vDPA simulator for networking device
> >>>>>>>>>>>>            <M>     vDPA simulator for block device
> >>>>>>>>>>>>            <M>   VDUSE (vDPA Device in Userspace) support
> >>>>>>>>>>>>            <M>   Intel IFC VF vDPA driver
> >>>>>>>>>>>>            <M>   Virtio PCI bridge vDPA driver
> >>>>>>>>>>>>            <M>   vDPA driver for Alibaba ENI
> >>>>>>>>>>>>
> >>>>>>>>>>>> BTW, my test environment is VM and the shared data is done in VM too, and
> >>>>>>>>>>>> can virtio_vdpa be used inside VM?
> >>>>>>>>>>> I hope Xie Yongji can help explain how to benchmark VDUSE.
> >>>>>>>>>>>
> >>>>>>>>>>> virtio_vdpa is available inside guests too. Please check that
> >>>>>>>>>>> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> >>>>>>>>>>> drivers" menu.
> >>>>>>>>>>>
> >>>>>>>>>>>>>    # modprobe vduse
> >>>>>>>>>>>>>    # qemu-storage-daemon \
> >>>>>>>>>>>>>        --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> >>>>>>>>>>>>>        --blockdev qcow2,file=file,node-name=qcow2 \
> >>>>>>>>>>>>>        --object iothread,id=iothread0 \
> >>>>>>>>>>>>>        --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> >>>>>>>>>>>>>    # vdpa dev add name vduse0 mgmtdev vduse
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> A virtio-blk device should appear and xfstests can be run on it
> >>>>>>>>>>>>> (typically /dev/vda unless you already have other virtio-blk devices).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Afterwards you can destroy the device using:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>    # vdpa dev del vduse0
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> - ublk-qcow2 (make test T=qcow2/022)
> >>>>>>>>>>>>> There are a lot of other factors not directly related to NBD vs ublk. In
> >>>>>>>>>>>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> >>>>>>>>>>>>> type is needed in qemu-storage-daemon. That way only the difference is
> >>>>>>>>>>>>> the ublk interface and the rest of the code path is identical, making it
> >>>>>>>>>>>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> >>>>>>>>>>>> Maybe not true.
> >>>>>>>>>>>>
> >>>>>>>>>>>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> >>>>>>>>>>>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> >>>>>>>>>>>> command.
> >>>>>>>>>>> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> >>>>>>>>>> I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> >>>>>>>>>>
> >>>>>>>>>>> know whether the benchmark demonstrates that ublk is faster than NBD,
> >>>>>>>>>>> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> >>>>>>>>>>> whether there are miscellaneous implementation differences between
> >>>>>>>>>>> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> >>>>>>>>>>> ublk and backend IO), or something else.
> >>>>>>>>>> The theory shouldn't be too complicated:
> >>>>>>>>>>
> >>>>>>>>>> 1) io uring passthough(pt) communication is fast than socket, and io command
> >>>>>>>>>> is carried over io_uring pt commands, and should be fast than virio
> >>>>>>>>>> communication too.
> >>>>>>>>>>
> >>>>>>>>>> 2) io uring io handling is fast than libaio which is taken in the
> >>>>>>>>>> test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> >>>>>>>>>> by io_uring.
> >>>>>>>>>>
> >>>>>>>>>> https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> >>>>>>>>>>
> >>>>>>>>>> 3) ublk uses one single io_uring to handle all io commands and qcow2
> >>>>>>>>>> backend IOs, so batching handling is common, and it is easy to see
> >>>>>>>>>> dozens of IOs/io commands handled in single syscall, or even more.
> >>>>>>>>>>
> >>>>>>>>>>> I'm suggesting measuring changes to just 1 variable at a time.
> >>>>>>>>>>> Otherwise it's hard to reach a conclusion about the root cause of the
> >>>>>>>>>>> performance difference. Let's learn why ublk-qcow2 performs well.
> >>>>>>>>>> Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> >>>>>>>>>> qemu from the latest github tree, and finally it starts to work. And test kernel
> >>>>>>>>>> is v6.0 release.
> >>>>>>>>>>
> >>>>>>>>>> Follows the test result, and all three devices are setup as single
> >>>>>>>>>> queue, and all tests are run in single job, still done in one VM, and
> >>>>>>>>>> the test images are stored on XFS/virito-scsi backed SSD.
> >>>>>>>>>>
> >>>>>>>>>> The 1st group tests all three block device which is backed by empty
> >>>>>>>>>> qcow2 image.
> >>>>>>>>>>
> >>>>>>>>>> The 2nd group tests all the three block devices backed by pre-allocated
> >>>>>>>>>> qcow2 image.
> >>>>>>>>>>
> >>>>>>>>>> Except for big sequential IO(512K), there is still not small gap between
> >>>>>>>>>> vdpa-virtio-blk and ublk.
> >>>>>>>>>>
> >>>>>>>>>> 1. run fio on block device over empty qcow2 image
> >>>>>>>>>> 1) qemu-nbd
> >>>>>>>>>> running qcow2/001
> >>>>>>>>>> run perf test on empty qcow2 image via nbd
> >>>>>>>>>>        fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> >>>>>>>>>>        randwrite: jobs 1, iops 8549
> >>>>>>>>>>        randread: jobs 1, iops 34829
> >>>>>>>>>>        randrw: jobs 1, iops read 11363 write 11333
> >>>>>>>>>>        rw(512k): jobs 1, iops read 590 write 597
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2) ublk-qcow2
> >>>>>>>>>> running qcow2/021
> >>>>>>>>>> run perf test on empty qcow2 image via ublk
> >>>>>>>>>>        fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> >>>>>>>>>>        randwrite: jobs 1, iops 16086
> >>>>>>>>>>        randread: jobs 1, iops 172720
> >>>>>>>>>>        randrw: jobs 1, iops read 35760 write 35702
> >>>>>>>>>>        rw(512k): jobs 1, iops read 1140 write 1149
> >>>>>>>>>>
> >>>>>>>>>> 3) vdpa-virtio-blk
> >>>>>>>>>> running debug/test_dev
> >>>>>>>>>> run io test on specified device
> >>>>>>>>>>        fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> >>>>>>>>>>        randwrite: jobs 1, iops 8626
> >>>>>>>>>>        randread: jobs 1, iops 126118
> >>>>>>>>>>        randrw: jobs 1, iops read 17698 write 17665
> >>>>>>>>>>        rw(512k): jobs 1, iops read 1023 write 1031
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2. run fio on block device over pre-allocated qcow2 image
> >>>>>>>>>> 1) qemu-nbd
> >>>>>>>>>> running qcow2/002
> >>>>>>>>>> run perf test on pre-allocated qcow2 image via nbd
> >>>>>>>>>>        fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> >>>>>>>>>>        randwrite: jobs 1, iops 21439
> >>>>>>>>>>        randread: jobs 1, iops 30336
> >>>>>>>>>>        randrw: jobs 1, iops read 11476 write 11449
> >>>>>>>>>>        rw(512k): jobs 1, iops read 718 write 722
> >>>>>>>>>>
> >>>>>>>>>> 2) ublk-qcow2
> >>>>>>>>>> running qcow2/022
> >>>>>>>>>> run perf test on pre-allocated qcow2 image via ublk
> >>>>>>>>>>        fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> >>>>>>>>>>        randwrite: jobs 1, iops 98757
> >>>>>>>>>>        randread: jobs 1, iops 110246
> >>>>>>>>>>        randrw: jobs 1, iops read 47229 write 47161
> >>>>>>>>>>        rw(512k): jobs 1, iops read 1416 write 1427
> >>>>>>>>>>
> >>>>>>>>>> 3) vdpa-virtio-blk
> >>>>>>>>>> running debug/test_dev
> >>>>>>>>>> run io test on specified device
> >>>>>>>>>>        fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> >>>>>>>>>>        randwrite: jobs 1, iops 47317
> >>>>>>>>>>        randread: jobs 1, iops 74092
> >>>>>>>>>>        randrw: jobs 1, iops read 27196 write 27234
> >>>>>>>>>>        rw(512k): jobs 1, iops read 1447 write 1458
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> Hi All,
> >>>>>>>>>
> >>>>>>>>> We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> >>>>>>>>> Let me share some results here.
> >>>>>>>>>
> >>>>>>>>> I setup UBLK with:
> >>>>>>>>>    ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
> >>>>>>>>>
> >>>>>>>>> I setup VDUSE with:
> >>>>>>>>>    qemu-storage-daemon \
> >>>>>>>>>         --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> >>>>>>>>>         --monitor chardev=charmonitor \
> >>>>>>>>>         --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
> >>>>>>>>>         --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
> >>>>>>>>>
> >>>>>>>>> Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
> >>>>>>>>>
> >>>>>>>>> Note:
> >>>>>>>>> (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> >>>>>>>>> (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> >>>>>>>>> (3) I do not use ublk null target so that the test is fair.
> >>>>>>>>> (4) I setup fio with direct=1, bs=4k.
> >>>>>>>>>
> >>>>>>>>> ------------------------------
> >>>>>>>>> 1 job 1 iodepth, lat（usec)
> >>>>>>>>>                  vduse   ublk
> >>>>>>>>> seq-read        22.55   11.15
> >>>>>>>>> rand-read       22.49   11.17
> >>>>>>>>> seq-write       25.67   10.25
> >>>>>>>>> rand-write      24.13   10.16
> >>>>>>>> Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
> >>>>>>>>
> >>>>>>> I think one reason for the latency gap of sync I/O is that vduse uses
> >>>>>>> workqueue in the I/O completion path but ublk doesn't.
> >>>>>>>
> >>>>>>> And one bottleneck for the async I/O in vduse is that vduse will do
> >>>>>>> memcpy inside the critical section of virtqueue's spinlock in the
> >>>>>>> virtio-blk driver. That will hurt the performance heavily when
> >>>>>>> virtio_queue_rq() and virtblk_done() run concurrently. And it can be
> >>>>>>> mitigated by the advance DMA mapping feature [1] or irq binding
> >>>>>>> support [2].
> >>>>>> Hi Yongji,
> >>>>>>
> >>>>>> Yeah, that is the cost you paid for virtio. Wrt. userspace block device
> >>>>>> or other sort of userspace devices, cmd completion is driven by
> >>>>>> userspace, not sure if one such 'irq' is needed.
> >>>>> I'm not sure, it can be an optional feature in the future if needed.
> >>>>>
> >>>>>> Even not sure if virtio
> >>>>>> ring is one good choice for such use case, given io_uring has been proved
> >>>>>> as very efficient(should be better than virtio ring, IMO).
> >>>>>>
> >>>>> Since vduse is aimed at creating a generic userspace device framework,
> >>>>> virtio should be the right way IMO.
> >>>> OK, it is the right way, but may not be the effective one.
> >>>>
> >>> Maybe, but I think we can try to optimize it.
> >>>
> >>>>> And with the vdpa framework, the
> >>>>> userspace device can serve both virtual machines and containers.
> >>>> virtio is good for VM, but not sure it is good enough for other
> >>>> cases.
> >>>>
> >>>>> Regarding the performance issue, actually I can't measure how much of
> >>>>> the performance loss is due to the difference between virtio ring and
> >>>>> iouring. But I think it should be very small. The main costs come from
> >>>>> the two bottlenecks I mentioned before which could be mitigated in the
> >>>>> future.
> >>>> Per my understanding, at least there are two places where virtio ring is
> >>>> less efficient than io_uring:
> >>>>
> >>> I might have misunderstood what you mean by virtio ring before. My
> >>> previous understanding of the virtio ring does not include the
> >>> virtio-blk driver.
> >>>
> >>>> 1) io_uring uses standalone submission queue(SQ) and completion queue(CQ),
> >>>> so no contention exists between submission and completion; but virtio queue
> >>>> requires per-vq lock in both submission and completion.
> >>>>
> >>> Yes, this is the bottleneck of the virtio-blk driver, even in the VM
> >>> case. We are also trying to optimize this lock.
> >>>
> >>> One way to mitigate it is making submission and completion happen in
> >>> the same core.
> >> QEMU sizes virtio-blk device num-queues to match the vCPU count. The
> >> virtio-blk driver is a blk-mq driver, so submissions and completions
> >> for a given virtqueue should already be processed by the same vCPU.
> >>
> >> Unless the device is misconfigured or the guest software chooses a
> >> custom vq:vCPU mapping, there should be no vq lock contention between
> >> vCPUs.
> >>
> >> I can think of a reason why submission and completion require
> >> coordination: descriptors are occupied until completion. The
> >> submission logic chooses free descriptors from the table. The
> >> completion logic returns free descriptors so they can be used in
> >> future submissions.
> >>
> > Yes, we need to maintain a head pointer of the free descriptors in
> > both submission and completion path.
>
>
> Not necessarily after IN_ORDER?
>

Sounds like a good idea.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-25  8:17                             ` Yongji Xie
@ 2022-10-25 12:02                               ` Stefan Hajnoczi
  2022-10-28 13:33                                 ` Yongji Xie
  2022-11-01  2:36                                 ` Jason Wang
  0 siblings, 2 replies; 44+ messages in thread
From: Stefan Hajnoczi @ 2022-10-25 12:02 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Jason Wang, Michael S. Tsirkin, Ming Lei, Ziyang Zhang,
	Stefan Hajnoczi, io-uring, linux-block, linux-kernel,
	Denis V. Lunev, Xiaoguang Wang

On Tue, 25 Oct 2022 at 04:17, Yongji Xie <[email protected]> wrote:
>
> On Fri, Oct 21, 2022 at 2:30 PM Jason Wang <[email protected]> wrote:
> >
> >
> > 在 2022/10/21 13:33, Yongji Xie 写道:
> > > On Tue, Oct 18, 2022 at 10:54 PM Stefan Hajnoczi <[email protected]> wrote:
> > >> On Tue, 18 Oct 2022 at 09:17, Yongji Xie <[email protected]> wrote:
> > >>> On Tue, Oct 18, 2022 at 2:59 PM Ming Lei <[email protected]> wrote:
> > >>>> On Mon, Oct 17, 2022 at 07:11:59PM +0800, Yongji Xie wrote:
> > >>>>> On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
> > >>>>>> On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> > >>>>>>> On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> > >>>>>>>> On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> > >>>>>>>>> On 2022/10/5 12:18, Ming Lei wrote:
> > >>>>>>>>>> On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > >>>>>>>>>>> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > >>>>>>>>>>>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > >>>>>>>>>>>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > >>>>>>>>>>>>>> ublk-qcow2 is available now.
> > >>>>>>>>>>>>> Cool, thanks for sharing!
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> So far it provides basic read/write function, and compression and snapshot
> > >>>>>>>>>>>>>> aren't supported yet. The target/backend implementation is completely
> > >>>>>>>>>>>>>> based on io_uring, and share the same io_uring with ublk IO command
> > >>>>>>>>>>>>>> handler, just like what ublk-loop does.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Follows the main motivations of ublk-qcow2:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> > >>>>>>>>>>>>>>    become mature/stable more quickly, since qcow2 is complicated and needs more
> > >>>>>>>>>>>>>>    requirement from libublksrv compared with other simple ones(loop, null)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> > >>>>>>>>>>>>>>    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > >>>>>>>>>>>>>>    might useful be for covering requirement in this field
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > >>>>>>>>>>>>>>    performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > >>>>>>>>>>>>>>    is started
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> - help to abstract common building block or design pattern for writing new ublk
> > >>>>>>>>>>>>>>    target/backend
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > >>>>>>>>>>>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> > >>>>>>>>>>>>>> soft update approach is applied in meta flushing, and meta data
> > >>>>>>>>>>>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > >>>>>>>>>>>>>> test, and only cluster leak is reported during this test.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The performance data looks much better compared with qemu-nbd, see
> > >>>>>>>>>>>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > >>>>>>>>>>>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> > >>>>>>>>>>>>>> image(8GB):
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> - qemu-nbd (make test T=qcow2/002)
> > >>>>>>>>>>>>> Single queue?
> > >>>>>>>>>>>> Yeah.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>>      randwrite(4k): jobs 1, iops 24605
> > >>>>>>>>>>>>>>      randread(4k): jobs 1, iops 30938
> > >>>>>>>>>>>>>>      randrw(4k): jobs 1, iops read 13981 write 14001
> > >>>>>>>>>>>>>>      rw(512k): jobs 1, iops read 724 write 728
> > >>>>>>>>>>>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> > >>>>>>>>>>>>> command-line should be similar to this:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>    # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > >>>>>>>>>>>> Not found virtio_vdpa module even though I enabled all the following
> > >>>>>>>>>>>> options:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>          --- vDPA drivers
> > >>>>>>>>>>>>            <M>   vDPA device simulator core
> > >>>>>>>>>>>>            <M>     vDPA simulator for networking device
> > >>>>>>>>>>>>            <M>     vDPA simulator for block device
> > >>>>>>>>>>>>            <M>   VDUSE (vDPA Device in Userspace) support
> > >>>>>>>>>>>>            <M>   Intel IFC VF vDPA driver
> > >>>>>>>>>>>>            <M>   Virtio PCI bridge vDPA driver
> > >>>>>>>>>>>>            <M>   vDPA driver for Alibaba ENI
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> BTW, my test environment is VM and the shared data is done in VM too, and
> > >>>>>>>>>>>> can virtio_vdpa be used inside VM?
> > >>>>>>>>>>> I hope Xie Yongji can help explain how to benchmark VDUSE.
> > >>>>>>>>>>>
> > >>>>>>>>>>> virtio_vdpa is available inside guests too. Please check that
> > >>>>>>>>>>> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > >>>>>>>>>>> drivers" menu.
> > >>>>>>>>>>>
> > >>>>>>>>>>>>>    # modprobe vduse
> > >>>>>>>>>>>>>    # qemu-storage-daemon \
> > >>>>>>>>>>>>>        --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > >>>>>>>>>>>>>        --blockdev qcow2,file=file,node-name=qcow2 \
> > >>>>>>>>>>>>>        --object iothread,id=iothread0 \
> > >>>>>>>>>>>>>        --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > >>>>>>>>>>>>>    # vdpa dev add name vduse0 mgmtdev vduse
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> A virtio-blk device should appear and xfstests can be run on it
> > >>>>>>>>>>>>> (typically /dev/vda unless you already have other virtio-blk devices).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Afterwards you can destroy the device using:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>    # vdpa dev del vduse0
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> - ublk-qcow2 (make test T=qcow2/022)
> > >>>>>>>>>>>>> There are a lot of other factors not directly related to NBD vs ublk. In
> > >>>>>>>>>>>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> > >>>>>>>>>>>>> type is needed in qemu-storage-daemon. That way only the difference is
> > >>>>>>>>>>>>> the ublk interface and the rest of the code path is identical, making it
> > >>>>>>>>>>>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> > >>>>>>>>>>>> Maybe not true.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > >>>>>>>>>>>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > >>>>>>>>>>>> command.
> > >>>>>>>>>>> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > >>>>>>>>>> I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > >>>>>>>>>>
> > >>>>>>>>>>> know whether the benchmark demonstrates that ublk is faster than NBD,
> > >>>>>>>>>>> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > >>>>>>>>>>> whether there are miscellaneous implementation differences between
> > >>>>>>>>>>> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > >>>>>>>>>>> ublk and backend IO), or something else.
> > >>>>>>>>>> The theory shouldn't be too complicated:
> > >>>>>>>>>>
> > >>>>>>>>>> 1) io uring passthough(pt) communication is fast than socket, and io command
> > >>>>>>>>>> is carried over io_uring pt commands, and should be fast than virio
> > >>>>>>>>>> communication too.
> > >>>>>>>>>>
> > >>>>>>>>>> 2) io uring io handling is fast than libaio which is taken in the
> > >>>>>>>>>> test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > >>>>>>>>>> by io_uring.
> > >>>>>>>>>>
> > >>>>>>>>>> https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > >>>>>>>>>>
> > >>>>>>>>>> 3) ublk uses one single io_uring to handle all io commands and qcow2
> > >>>>>>>>>> backend IOs, so batching handling is common, and it is easy to see
> > >>>>>>>>>> dozens of IOs/io commands handled in single syscall, or even more.
> > >>>>>>>>>>
> > >>>>>>>>>>> I'm suggesting measuring changes to just 1 variable at a time.
> > >>>>>>>>>>> Otherwise it's hard to reach a conclusion about the root cause of the
> > >>>>>>>>>>> performance difference. Let's learn why ublk-qcow2 performs well.
> > >>>>>>>>>> Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> > >>>>>>>>>> qemu from the latest github tree, and finally it starts to work. And test kernel
> > >>>>>>>>>> is v6.0 release.
> > >>>>>>>>>>
> > >>>>>>>>>> Follows the test result, and all three devices are setup as single
> > >>>>>>>>>> queue, and all tests are run in single job, still done in one VM, and
> > >>>>>>>>>> the test images are stored on XFS/virito-scsi backed SSD.
> > >>>>>>>>>>
> > >>>>>>>>>> The 1st group tests all three block device which is backed by empty
> > >>>>>>>>>> qcow2 image.
> > >>>>>>>>>>
> > >>>>>>>>>> The 2nd group tests all the three block devices backed by pre-allocated
> > >>>>>>>>>> qcow2 image.
> > >>>>>>>>>>
> > >>>>>>>>>> Except for big sequential IO(512K), there is still not small gap between
> > >>>>>>>>>> vdpa-virtio-blk and ublk.
> > >>>>>>>>>>
> > >>>>>>>>>> 1. run fio on block device over empty qcow2 image
> > >>>>>>>>>> 1) qemu-nbd
> > >>>>>>>>>> running qcow2/001
> > >>>>>>>>>> run perf test on empty qcow2 image via nbd
> > >>>>>>>>>>        fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > >>>>>>>>>>        randwrite: jobs 1, iops 8549
> > >>>>>>>>>>        randread: jobs 1, iops 34829
> > >>>>>>>>>>        randrw: jobs 1, iops read 11363 write 11333
> > >>>>>>>>>>        rw(512k): jobs 1, iops read 590 write 597
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> 2) ublk-qcow2
> > >>>>>>>>>> running qcow2/021
> > >>>>>>>>>> run perf test on empty qcow2 image via ublk
> > >>>>>>>>>>        fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > >>>>>>>>>>        randwrite: jobs 1, iops 16086
> > >>>>>>>>>>        randread: jobs 1, iops 172720
> > >>>>>>>>>>        randrw: jobs 1, iops read 35760 write 35702
> > >>>>>>>>>>        rw(512k): jobs 1, iops read 1140 write 1149
> > >>>>>>>>>>
> > >>>>>>>>>> 3) vdpa-virtio-blk
> > >>>>>>>>>> running debug/test_dev
> > >>>>>>>>>> run io test on specified device
> > >>>>>>>>>>        fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > >>>>>>>>>>        randwrite: jobs 1, iops 8626
> > >>>>>>>>>>        randread: jobs 1, iops 126118
> > >>>>>>>>>>        randrw: jobs 1, iops read 17698 write 17665
> > >>>>>>>>>>        rw(512k): jobs 1, iops read 1023 write 1031
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> 2. run fio on block device over pre-allocated qcow2 image
> > >>>>>>>>>> 1) qemu-nbd
> > >>>>>>>>>> running qcow2/002
> > >>>>>>>>>> run perf test on pre-allocated qcow2 image via nbd
> > >>>>>>>>>>        fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > >>>>>>>>>>        randwrite: jobs 1, iops 21439
> > >>>>>>>>>>        randread: jobs 1, iops 30336
> > >>>>>>>>>>        randrw: jobs 1, iops read 11476 write 11449
> > >>>>>>>>>>        rw(512k): jobs 1, iops read 718 write 722
> > >>>>>>>>>>
> > >>>>>>>>>> 2) ublk-qcow2
> > >>>>>>>>>> running qcow2/022
> > >>>>>>>>>> run perf test on pre-allocated qcow2 image via ublk
> > >>>>>>>>>>        fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > >>>>>>>>>>        randwrite: jobs 1, iops 98757
> > >>>>>>>>>>        randread: jobs 1, iops 110246
> > >>>>>>>>>>        randrw: jobs 1, iops read 47229 write 47161
> > >>>>>>>>>>        rw(512k): jobs 1, iops read 1416 write 1427
> > >>>>>>>>>>
> > >>>>>>>>>> 3) vdpa-virtio-blk
> > >>>>>>>>>> running debug/test_dev
> > >>>>>>>>>> run io test on specified device
> > >>>>>>>>>>        fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > >>>>>>>>>>        randwrite: jobs 1, iops 47317
> > >>>>>>>>>>        randread: jobs 1, iops 74092
> > >>>>>>>>>>        randrw: jobs 1, iops read 27196 write 27234
> > >>>>>>>>>>        rw(512k): jobs 1, iops read 1447 write 1458
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>> Hi All,
> > >>>>>>>>>
> > >>>>>>>>> We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> > >>>>>>>>> Let me share some results here.
> > >>>>>>>>>
> > >>>>>>>>> I setup UBLK with:
> > >>>>>>>>>    ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
> > >>>>>>>>>
> > >>>>>>>>> I setup VDUSE with:
> > >>>>>>>>>    qemu-storage-daemon \
> > >>>>>>>>>         --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > >>>>>>>>>         --monitor chardev=charmonitor \
> > >>>>>>>>>         --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
> > >>>>>>>>>         --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
> > >>>>>>>>>
> > >>>>>>>>> Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
> > >>>>>>>>>
> > >>>>>>>>> Note:
> > >>>>>>>>> (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> > >>>>>>>>> (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> > >>>>>>>>> (3) I do not use ublk null target so that the test is fair.
> > >>>>>>>>> (4) I setup fio with direct=1, bs=4k.
> > >>>>>>>>>
> > >>>>>>>>> ------------------------------
> > >>>>>>>>> 1 job 1 iodepth, lat（usec)
> > >>>>>>>>>                  vduse   ublk
> > >>>>>>>>> seq-read        22.55   11.15
> > >>>>>>>>> rand-read       22.49   11.17
> > >>>>>>>>> seq-write       25.67   10.25
> > >>>>>>>>> rand-write      24.13   10.16
> > >>>>>>>> Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
> > >>>>>>>>
> > >>>>>>> I think one reason for the latency gap of sync I/O is that vduse uses
> > >>>>>>> workqueue in the I/O completion path but ublk doesn't.
> > >>>>>>>
> > >>>>>>> And one bottleneck for the async I/O in vduse is that vduse will do
> > >>>>>>> memcpy inside the critical section of virtqueue's spinlock in the
> > >>>>>>> virtio-blk driver. That will hurt the performance heavily when
> > >>>>>>> virtio_queue_rq() and virtblk_done() run concurrently. And it can be
> > >>>>>>> mitigated by the advance DMA mapping feature [1] or irq binding
> > >>>>>>> support [2].
> > >>>>>> Hi Yongji,
> > >>>>>>
> > >>>>>> Yeah, that is the cost you paid for virtio. Wrt. userspace block device
> > >>>>>> or other sort of userspace devices, cmd completion is driven by
> > >>>>>> userspace, not sure if one such 'irq' is needed.
> > >>>>> I'm not sure, it can be an optional feature in the future if needed.
> > >>>>>
> > >>>>>> Even not sure if virtio
> > >>>>>> ring is one good choice for such use case, given io_uring has been proved
> > >>>>>> as very efficient(should be better than virtio ring, IMO).
> > >>>>>>
> > >>>>> Since vduse is aimed at creating a generic userspace device framework,
> > >>>>> virtio should be the right way IMO.
> > >>>> OK, it is the right way, but may not be the effective one.
> > >>>>
> > >>> Maybe, but I think we can try to optimize it.
> > >>>
> > >>>>> And with the vdpa framework, the
> > >>>>> userspace device can serve both virtual machines and containers.
> > >>>> virtio is good for VM, but not sure it is good enough for other
> > >>>> cases.
> > >>>>
> > >>>>> Regarding the performance issue, actually I can't measure how much of
> > >>>>> the performance loss is due to the difference between virtio ring and
> > >>>>> iouring. But I think it should be very small. The main costs come from
> > >>>>> the two bottlenecks I mentioned before which could be mitigated in the
> > >>>>> future.
> > >>>> Per my understanding, at least there are two places where virtio ring is
> > >>>> less efficient than io_uring:
> > >>>>
> > >>> I might have misunderstood what you mean by virtio ring before. My
> > >>> previous understanding of the virtio ring does not include the
> > >>> virtio-blk driver.
> > >>>
> > >>>> 1) io_uring uses standalone submission queue(SQ) and completion queue(CQ),
> > >>>> so no contention exists between submission and completion; but virtio queue
> > >>>> requires per-vq lock in both submission and completion.
> > >>>>
> > >>> Yes, this is the bottleneck of the virtio-blk driver, even in the VM
> > >>> case. We are also trying to optimize this lock.
> > >>>
> > >>> One way to mitigate it is making submission and completion happen in
> > >>> the same core.
> > >> QEMU sizes virtio-blk device num-queues to match the vCPU count. The
> > >> virtio-blk driver is a blk-mq driver, so submissions and completions
> > >> for a given virtqueue should already be processed by the same vCPU.
> > >>
> > >> Unless the device is misconfigured or the guest software chooses a
> > >> custom vq:vCPU mapping, there should be no vq lock contention between
> > >> vCPUs.
> > >>
> > >> I can think of a reason why submission and completion require
> > >> coordination: descriptors are occupied until completion. The
> > >> submission logic chooses free descriptors from the table. The
> > >> completion logic returns free descriptors so they can be used in
> > >> future submissions.
> > >>
> > > Yes, we need to maintain a head pointer of the free descriptors in
> > > both submission and completion path.
> >
> >
> > Not necessarily after IN_ORDER?
> >
>
> Sounds like a good idea.

Submission and completion are still not 100% independent with IN_ORDER
because descriptors are still in use until completion. It may not be
necessary to keep a freelist, but you cannot actually use the
descriptors for new submissions until existing requests complete. Is
that correct?

Anyway, independent submission and completion rings aren't perfect
either because independent submission introduces a new point of
communication: the device must tell the driver when submitted
descriptors have been processed. That means the driver must access a
hardware register on the device or the device must DMA to RAM. So it
involves extra bus traffic that is not necessary if descriptors are in
use until completion. io_uring gets away with it because the
io_uring_enter(2) syscall is synchronous and can therefore return the
number of consumed sq elements for free.

There are ways to minimize that cost:
1. The driver only needs to fetch the device's sq index when it has
run out of sq ring space.
2. The device can include sq index updates with completions. This is
what NVMe does with the CQE SQ Head Pointer field, but the
disadvantage is that the driver has no way of determining the sq index
until a completion occurs.

Stefan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-25 12:02                               ` Stefan Hajnoczi
@ 2022-10-28 13:33                                 ` Yongji Xie
  2022-11-01  2:36                                 ` Jason Wang
  1 sibling, 0 replies; 44+ messages in thread
From: Yongji Xie @ 2022-10-28 13:33 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Jason Wang, Michael S. Tsirkin, Ming Lei, Ziyang Zhang,
	Stefan Hajnoczi, io-uring, linux-block, linux-kernel,
	Denis V. Lunev, Xiaoguang Wang

On Tue, Oct 25, 2022 at 8:02 PM Stefan Hajnoczi <[email protected]> wrote:
>
> On Tue, 25 Oct 2022 at 04:17, Yongji Xie <[email protected]> wrote:
> >
> > On Fri, Oct 21, 2022 at 2:30 PM Jason Wang <[email protected]> wrote:
> > >
> > >
> > > 在 2022/10/21 13:33, Yongji Xie 写道:
> > > > On Tue, Oct 18, 2022 at 10:54 PM Stefan Hajnoczi <[email protected]> wrote:
> > > >> On Tue, 18 Oct 2022 at 09:17, Yongji Xie <[email protected]> wrote:
> > > >>> On Tue, Oct 18, 2022 at 2:59 PM Ming Lei <[email protected]> wrote:
> > > >>>> On Mon, Oct 17, 2022 at 07:11:59PM +0800, Yongji Xie wrote:
> > > >>>>> On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
> > > >>>>>> On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> > > >>>>>>> On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> > > >>>>>>>> On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> > > >>>>>>>>> On 2022/10/5 12:18, Ming Lei wrote:
> > > >>>>>>>>>> On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > >>>>>>>>>>> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > >>>>>>>>>>>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > >>>>>>>>>>>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > >>>>>>>>>>>>>> ublk-qcow2 is available now.
> > > >>>>>>>>>>>>> Cool, thanks for sharing!
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> So far it provides basic read/write function, and compression and snapshot
> > > >>>>>>>>>>>>>> aren't supported yet. The target/backend implementation is completely
> > > >>>>>>>>>>>>>> based on io_uring, and share the same io_uring with ublk IO command
> > > >>>>>>>>>>>>>> handler, just like what ublk-loop does.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Follows the main motivations of ublk-qcow2:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> > > >>>>>>>>>>>>>>    become mature/stable more quickly, since qcow2 is complicated and needs more
> > > >>>>>>>>>>>>>>    requirement from libublksrv compared with other simple ones(loop, null)
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> > > >>>>>>>>>>>>>>    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > >>>>>>>>>>>>>>    might useful be for covering requirement in this field
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > >>>>>>>>>>>>>>    performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > >>>>>>>>>>>>>>    is started
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - help to abstract common building block or design pattern for writing new ublk
> > > >>>>>>>>>>>>>>    target/backend
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > >>>>>>>>>>>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> > > >>>>>>>>>>>>>> soft update approach is applied in meta flushing, and meta data
> > > >>>>>>>>>>>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > >>>>>>>>>>>>>> test, and only cluster leak is reported during this test.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> The performance data looks much better compared with qemu-nbd, see
> > > >>>>>>>>>>>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > >>>>>>>>>>>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> > > >>>>>>>>>>>>>> image(8GB):
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - qemu-nbd (make test T=qcow2/002)
> > > >>>>>>>>>>>>> Single queue?
> > > >>>>>>>>>>>> Yeah.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>>>      randwrite(4k): jobs 1, iops 24605
> > > >>>>>>>>>>>>>>      randread(4k): jobs 1, iops 30938
> > > >>>>>>>>>>>>>>      randrw(4k): jobs 1, iops read 13981 write 14001
> > > >>>>>>>>>>>>>>      rw(512k): jobs 1, iops read 724 write 728
> > > >>>>>>>>>>>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> > > >>>>>>>>>>>>> command-line should be similar to this:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>    # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > >>>>>>>>>>>> Not found virtio_vdpa module even though I enabled all the following
> > > >>>>>>>>>>>> options:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>          --- vDPA drivers
> > > >>>>>>>>>>>>            <M>   vDPA device simulator core
> > > >>>>>>>>>>>>            <M>     vDPA simulator for networking device
> > > >>>>>>>>>>>>            <M>     vDPA simulator for block device
> > > >>>>>>>>>>>>            <M>   VDUSE (vDPA Device in Userspace) support
> > > >>>>>>>>>>>>            <M>   Intel IFC VF vDPA driver
> > > >>>>>>>>>>>>            <M>   Virtio PCI bridge vDPA driver
> > > >>>>>>>>>>>>            <M>   vDPA driver for Alibaba ENI
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> BTW, my test environment is VM and the shared data is done in VM too, and
> > > >>>>>>>>>>>> can virtio_vdpa be used inside VM?
> > > >>>>>>>>>>> I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> virtio_vdpa is available inside guests too. Please check that
> > > >>>>>>>>>>> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > >>>>>>>>>>> drivers" menu.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>>>    # modprobe vduse
> > > >>>>>>>>>>>>>    # qemu-storage-daemon \
> > > >>>>>>>>>>>>>        --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > >>>>>>>>>>>>>        --blockdev qcow2,file=file,node-name=qcow2 \
> > > >>>>>>>>>>>>>        --object iothread,id=iothread0 \
> > > >>>>>>>>>>>>>        --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > >>>>>>>>>>>>>    # vdpa dev add name vduse0 mgmtdev vduse
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> A virtio-blk device should appear and xfstests can be run on it
> > > >>>>>>>>>>>>> (typically /dev/vda unless you already have other virtio-blk devices).
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Afterwards you can destroy the device using:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>    # vdpa dev del vduse0
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - ublk-qcow2 (make test T=qcow2/022)
> > > >>>>>>>>>>>>> There are a lot of other factors not directly related to NBD vs ublk. In
> > > >>>>>>>>>>>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> > > >>>>>>>>>>>>> type is needed in qemu-storage-daemon. That way only the difference is
> > > >>>>>>>>>>>>> the ublk interface and the rest of the code path is identical, making it
> > > >>>>>>>>>>>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > >>>>>>>>>>>> Maybe not true.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > >>>>>>>>>>>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > >>>>>>>>>>>> command.
> > > >>>>>>>>>>> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > > >>>>>>>>>> I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > > >>>>>>>>>>
> > > >>>>>>>>>>> know whether the benchmark demonstrates that ublk is faster than NBD,
> > > >>>>>>>>>>> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > >>>>>>>>>>> whether there are miscellaneous implementation differences between
> > > >>>>>>>>>>> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > >>>>>>>>>>> ublk and backend IO), or something else.
> > > >>>>>>>>>> The theory shouldn't be too complicated:
> > > >>>>>>>>>>
> > > >>>>>>>>>> 1) io uring passthough(pt) communication is fast than socket, and io command
> > > >>>>>>>>>> is carried over io_uring pt commands, and should be fast than virio
> > > >>>>>>>>>> communication too.
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2) io uring io handling is fast than libaio which is taken in the
> > > >>>>>>>>>> test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > >>>>>>>>>> by io_uring.
> > > >>>>>>>>>>
> > > >>>>>>>>>> https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > > >>>>>>>>>>
> > > >>>>>>>>>> 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > >>>>>>>>>> backend IOs, so batching handling is common, and it is easy to see
> > > >>>>>>>>>> dozens of IOs/io commands handled in single syscall, or even more.
> > > >>>>>>>>>>
> > > >>>>>>>>>>> I'm suggesting measuring changes to just 1 variable at a time.
> > > >>>>>>>>>>> Otherwise it's hard to reach a conclusion about the root cause of the
> > > >>>>>>>>>>> performance difference. Let's learn why ublk-qcow2 performs well.
> > > >>>>>>>>>> Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> > > >>>>>>>>>> qemu from the latest github tree, and finally it starts to work. And test kernel
> > > >>>>>>>>>> is v6.0 release.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Follows the test result, and all three devices are setup as single
> > > >>>>>>>>>> queue, and all tests are run in single job, still done in one VM, and
> > > >>>>>>>>>> the test images are stored on XFS/virito-scsi backed SSD.
> > > >>>>>>>>>>
> > > >>>>>>>>>> The 1st group tests all three block device which is backed by empty
> > > >>>>>>>>>> qcow2 image.
> > > >>>>>>>>>>
> > > >>>>>>>>>> The 2nd group tests all the three block devices backed by pre-allocated
> > > >>>>>>>>>> qcow2 image.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Except for big sequential IO(512K), there is still not small gap between
> > > >>>>>>>>>> vdpa-virtio-blk and ublk.
> > > >>>>>>>>>>
> > > >>>>>>>>>> 1. run fio on block device over empty qcow2 image
> > > >>>>>>>>>> 1) qemu-nbd
> > > >>>>>>>>>> running qcow2/001
> > > >>>>>>>>>> run perf test on empty qcow2 image via nbd
> > > >>>>>>>>>>        fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > >>>>>>>>>>        randwrite: jobs 1, iops 8549
> > > >>>>>>>>>>        randread: jobs 1, iops 34829
> > > >>>>>>>>>>        randrw: jobs 1, iops read 11363 write 11333
> > > >>>>>>>>>>        rw(512k): jobs 1, iops read 590 write 597
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2) ublk-qcow2
> > > >>>>>>>>>> running qcow2/021
> > > >>>>>>>>>> run perf test on empty qcow2 image via ublk
> > > >>>>>>>>>>        fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > >>>>>>>>>>        randwrite: jobs 1, iops 16086
> > > >>>>>>>>>>        randread: jobs 1, iops 172720
> > > >>>>>>>>>>        randrw: jobs 1, iops read 35760 write 35702
> > > >>>>>>>>>>        rw(512k): jobs 1, iops read 1140 write 1149
> > > >>>>>>>>>>
> > > >>>>>>>>>> 3) vdpa-virtio-blk
> > > >>>>>>>>>> running debug/test_dev
> > > >>>>>>>>>> run io test on specified device
> > > >>>>>>>>>>        fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > >>>>>>>>>>        randwrite: jobs 1, iops 8626
> > > >>>>>>>>>>        randread: jobs 1, iops 126118
> > > >>>>>>>>>>        randrw: jobs 1, iops read 17698 write 17665
> > > >>>>>>>>>>        rw(512k): jobs 1, iops read 1023 write 1031
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2. run fio on block device over pre-allocated qcow2 image
> > > >>>>>>>>>> 1) qemu-nbd
> > > >>>>>>>>>> running qcow2/002
> > > >>>>>>>>>> run perf test on pre-allocated qcow2 image via nbd
> > > >>>>>>>>>>        fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > >>>>>>>>>>        randwrite: jobs 1, iops 21439
> > > >>>>>>>>>>        randread: jobs 1, iops 30336
> > > >>>>>>>>>>        randrw: jobs 1, iops read 11476 write 11449
> > > >>>>>>>>>>        rw(512k): jobs 1, iops read 718 write 722
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2) ublk-qcow2
> > > >>>>>>>>>> running qcow2/022
> > > >>>>>>>>>> run perf test on pre-allocated qcow2 image via ublk
> > > >>>>>>>>>>        fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > >>>>>>>>>>        randwrite: jobs 1, iops 98757
> > > >>>>>>>>>>        randread: jobs 1, iops 110246
> > > >>>>>>>>>>        randrw: jobs 1, iops read 47229 write 47161
> > > >>>>>>>>>>        rw(512k): jobs 1, iops read 1416 write 1427
> > > >>>>>>>>>>
> > > >>>>>>>>>> 3) vdpa-virtio-blk
> > > >>>>>>>>>> running debug/test_dev
> > > >>>>>>>>>> run io test on specified device
> > > >>>>>>>>>>        fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > >>>>>>>>>>        randwrite: jobs 1, iops 47317
> > > >>>>>>>>>>        randread: jobs 1, iops 74092
> > > >>>>>>>>>>        randrw: jobs 1, iops read 27196 write 27234
> > > >>>>>>>>>>        rw(512k): jobs 1, iops read 1447 write 1458
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>> Hi All,
> > > >>>>>>>>>
> > > >>>>>>>>> We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> > > >>>>>>>>> Let me share some results here.
> > > >>>>>>>>>
> > > >>>>>>>>> I setup UBLK with:
> > > >>>>>>>>>    ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
> > > >>>>>>>>>
> > > >>>>>>>>> I setup VDUSE with:
> > > >>>>>>>>>    qemu-storage-daemon \
> > > >>>>>>>>>         --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > > >>>>>>>>>         --monitor chardev=charmonitor \
> > > >>>>>>>>>         --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
> > > >>>>>>>>>         --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
> > > >>>>>>>>>
> > > >>>>>>>>> Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
> > > >>>>>>>>>
> > > >>>>>>>>> Note:
> > > >>>>>>>>> (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> > > >>>>>>>>> (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> > > >>>>>>>>> (3) I do not use ublk null target so that the test is fair.
> > > >>>>>>>>> (4) I setup fio with direct=1, bs=4k.
> > > >>>>>>>>>
> > > >>>>>>>>> ------------------------------
> > > >>>>>>>>> 1 job 1 iodepth, lat（usec)
> > > >>>>>>>>>                  vduse   ublk
> > > >>>>>>>>> seq-read        22.55   11.15
> > > >>>>>>>>> rand-read       22.49   11.17
> > > >>>>>>>>> seq-write       25.67   10.25
> > > >>>>>>>>> rand-write      24.13   10.16
> > > >>>>>>>> Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
> > > >>>>>>>>
> > > >>>>>>> I think one reason for the latency gap of sync I/O is that vduse uses
> > > >>>>>>> workqueue in the I/O completion path but ublk doesn't.
> > > >>>>>>>
> > > >>>>>>> And one bottleneck for the async I/O in vduse is that vduse will do
> > > >>>>>>> memcpy inside the critical section of virtqueue's spinlock in the
> > > >>>>>>> virtio-blk driver. That will hurt the performance heavily when
> > > >>>>>>> virtio_queue_rq() and virtblk_done() run concurrently. And it can be
> > > >>>>>>> mitigated by the advance DMA mapping feature [1] or irq binding
> > > >>>>>>> support [2].
> > > >>>>>> Hi Yongji,
> > > >>>>>>
> > > >>>>>> Yeah, that is the cost you paid for virtio. Wrt. userspace block device
> > > >>>>>> or other sort of userspace devices, cmd completion is driven by
> > > >>>>>> userspace, not sure if one such 'irq' is needed.
> > > >>>>> I'm not sure, it can be an optional feature in the future if needed.
> > > >>>>>
> > > >>>>>> Even not sure if virtio
> > > >>>>>> ring is one good choice for such use case, given io_uring has been proved
> > > >>>>>> as very efficient(should be better than virtio ring, IMO).
> > > >>>>>>
> > > >>>>> Since vduse is aimed at creating a generic userspace device framework,
> > > >>>>> virtio should be the right way IMO.
> > > >>>> OK, it is the right way, but may not be the effective one.
> > > >>>>
> > > >>> Maybe, but I think we can try to optimize it.
> > > >>>
> > > >>>>> And with the vdpa framework, the
> > > >>>>> userspace device can serve both virtual machines and containers.
> > > >>>> virtio is good for VM, but not sure it is good enough for other
> > > >>>> cases.
> > > >>>>
> > > >>>>> Regarding the performance issue, actually I can't measure how much of
> > > >>>>> the performance loss is due to the difference between virtio ring and
> > > >>>>> iouring. But I think it should be very small. The main costs come from
> > > >>>>> the two bottlenecks I mentioned before which could be mitigated in the
> > > >>>>> future.
> > > >>>> Per my understanding, at least there are two places where virtio ring is
> > > >>>> less efficient than io_uring:
> > > >>>>
> > > >>> I might have misunderstood what you mean by virtio ring before. My
> > > >>> previous understanding of the virtio ring does not include the
> > > >>> virtio-blk driver.
> > > >>>
> > > >>>> 1) io_uring uses standalone submission queue(SQ) and completion queue(CQ),
> > > >>>> so no contention exists between submission and completion; but virtio queue
> > > >>>> requires per-vq lock in both submission and completion.
> > > >>>>
> > > >>> Yes, this is the bottleneck of the virtio-blk driver, even in the VM
> > > >>> case. We are also trying to optimize this lock.
> > > >>>
> > > >>> One way to mitigate it is making submission and completion happen in
> > > >>> the same core.
> > > >> QEMU sizes virtio-blk device num-queues to match the vCPU count. The
> > > >> virtio-blk driver is a blk-mq driver, so submissions and completions
> > > >> for a given virtqueue should already be processed by the same vCPU.
> > > >>
> > > >> Unless the device is misconfigured or the guest software chooses a
> > > >> custom vq:vCPU mapping, there should be no vq lock contention between
> > > >> vCPUs.
> > > >>
> > > >> I can think of a reason why submission and completion require
> > > >> coordination: descriptors are occupied until completion. The
> > > >> submission logic chooses free descriptors from the table. The
> > > >> completion logic returns free descriptors so they can be used in
> > > >> future submissions.
> > > >>
> > > > Yes, we need to maintain a head pointer of the free descriptors in
> > > > both submission and completion path.
> > >
> > >
> > > Not necessarily after IN_ORDER?
> > >
> >
> > Sounds like a good idea.
>
> Submission and completion are still not 100% independent with IN_ORDER
> because descriptors are still in use until completion. It may not be
> necessary to keep a freelist, but you cannot actually use the
> descriptors for new submissions until existing requests complete. Is
> that correct?
>

Yes. But we can get rid of the per-vq lock at least.

> Anyway, independent submission and completion rings aren't perfect
> either because independent submission introduces a new point of
> communication: the device must tell the driver when submitted
> descriptors have been processed. That means the driver must access a
> hardware register on the device or the device must DMA to RAM. So it
> involves extra bus traffic that is not necessary if descriptors are in
> use until completion. io_uring gets away with it because the
> io_uring_enter(2) syscall is synchronous and can therefore return the
> number of consumed sq elements for free.
>
> There are ways to minimize that cost:
> 1. The driver only needs to fetch the device's sq index when it has
> run out of sq ring space.
> 2. The device can include sq index updates with completions. This is
> what NVMe does with the CQE SQ Head Pointer field, but the
> disadvantage is that the driver has no way of determining the sq index
> until a completion occurs.
>

It seems that the per-vq lock is still needed in this way if IN_ORDER
is not supported. We still need to maintain a list of the free
descriptors since out-of-order completion may occur.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-25 12:02                               ` Stefan Hajnoczi
  2022-10-28 13:33                                 ` Yongji Xie
@ 2022-11-01  2:36                                 ` Jason Wang
  2022-11-02 19:13                                   ` Stefan Hajnoczi
  1 sibling, 1 reply; 44+ messages in thread
From: Jason Wang @ 2022-11-01  2:36 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Yongji Xie, Michael S. Tsirkin, Ming Lei, Ziyang Zhang,
	Stefan Hajnoczi, io-uring, linux-block, linux-kernel,
	Denis V. Lunev, Xiaoguang Wang

On Tue, Oct 25, 2022 at 8:02 PM Stefan Hajnoczi <[email protected]> wrote:
>
> On Tue, 25 Oct 2022 at 04:17, Yongji Xie <[email protected]> wrote:
> >
> > On Fri, Oct 21, 2022 at 2:30 PM Jason Wang <[email protected]> wrote:
> > >
> > >
> > > 在 2022/10/21 13:33, Yongji Xie 写道:
> > > > On Tue, Oct 18, 2022 at 10:54 PM Stefan Hajnoczi <[email protected]> wrote:
> > > >> On Tue, 18 Oct 2022 at 09:17, Yongji Xie <[email protected]> wrote:
> > > >>> On Tue, Oct 18, 2022 at 2:59 PM Ming Lei <[email protected]> wrote:
> > > >>>> On Mon, Oct 17, 2022 at 07:11:59PM +0800, Yongji Xie wrote:
> > > >>>>> On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
> > > >>>>>> On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> > > >>>>>>> On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> > > >>>>>>>> On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> > > >>>>>>>>> On 2022/10/5 12:18, Ming Lei wrote:
> > > >>>>>>>>>> On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > >>>>>>>>>>> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > >>>>>>>>>>>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > >>>>>>>>>>>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > >>>>>>>>>>>>>> ublk-qcow2 is available now.
> > > >>>>>>>>>>>>> Cool, thanks for sharing!
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> So far it provides basic read/write function, and compression and snapshot
> > > >>>>>>>>>>>>>> aren't supported yet. The target/backend implementation is completely
> > > >>>>>>>>>>>>>> based on io_uring, and share the same io_uring with ublk IO command
> > > >>>>>>>>>>>>>> handler, just like what ublk-loop does.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Follows the main motivations of ublk-qcow2:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - building one complicated target from scratch helps libublksrv APIs/functions
> > > >>>>>>>>>>>>>>    become mature/stable more quickly, since qcow2 is complicated and needs more
> > > >>>>>>>>>>>>>>    requirement from libublksrv compared with other simple ones(loop, null)
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
> > > >>>>>>>>>>>>>>    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > >>>>>>>>>>>>>>    might useful be for covering requirement in this field
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
> > > >>>>>>>>>>>>>>    performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
> > > >>>>>>>>>>>>>>    is started
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - help to abstract common building block or design pattern for writing new ublk
> > > >>>>>>>>>>>>>>    target/backend
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> > > >>>>>>>>>>>>>> device as TEST_DEV, and kernel building workload is verified too. Also
> > > >>>>>>>>>>>>>> soft update approach is applied in meta flushing, and meta data
> > > >>>>>>>>>>>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> > > >>>>>>>>>>>>>> test, and only cluster leak is reported during this test.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> The performance data looks much better compared with qemu-nbd, see
> > > >>>>>>>>>>>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> > > >>>>>>>>>>>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
> > > >>>>>>>>>>>>>> image(8GB):
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - qemu-nbd (make test T=qcow2/002)
> > > >>>>>>>>>>>>> Single queue?
> > > >>>>>>>>>>>> Yeah.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>>>      randwrite(4k): jobs 1, iops 24605
> > > >>>>>>>>>>>>>>      randread(4k): jobs 1, iops 30938
> > > >>>>>>>>>>>>>>      randrw(4k): jobs 1, iops read 13981 write 14001
> > > >>>>>>>>>>>>>>      rw(512k): jobs 1, iops read 724 write 728
> > > >>>>>>>>>>>>> Please try qemu-storage-daemon's VDUSE export type as well. The
> > > >>>>>>>>>>>>> command-line should be similar to this:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>    # modprobe virtio_vdpa # attaches vDPA devices to host kernel
> > > >>>>>>>>>>>> Not found virtio_vdpa module even though I enabled all the following
> > > >>>>>>>>>>>> options:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>          --- vDPA drivers
> > > >>>>>>>>>>>>            <M>   vDPA device simulator core
> > > >>>>>>>>>>>>            <M>     vDPA simulator for networking device
> > > >>>>>>>>>>>>            <M>     vDPA simulator for block device
> > > >>>>>>>>>>>>            <M>   VDUSE (vDPA Device in Userspace) support
> > > >>>>>>>>>>>>            <M>   Intel IFC VF vDPA driver
> > > >>>>>>>>>>>>            <M>   Virtio PCI bridge vDPA driver
> > > >>>>>>>>>>>>            <M>   vDPA driver for Alibaba ENI
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> BTW, my test environment is VM and the shared data is done in VM too, and
> > > >>>>>>>>>>>> can virtio_vdpa be used inside VM?
> > > >>>>>>>>>>> I hope Xie Yongji can help explain how to benchmark VDUSE.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> virtio_vdpa is available inside guests too. Please check that
> > > >>>>>>>>>>> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
> > > >>>>>>>>>>> drivers" menu.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>>>    # modprobe vduse
> > > >>>>>>>>>>>>>    # qemu-storage-daemon \
> > > >>>>>>>>>>>>>        --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
> > > >>>>>>>>>>>>>        --blockdev qcow2,file=file,node-name=qcow2 \
> > > >>>>>>>>>>>>>        --object iothread,id=iothread0 \
> > > >>>>>>>>>>>>>        --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
> > > >>>>>>>>>>>>>    # vdpa dev add name vduse0 mgmtdev vduse
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> A virtio-blk device should appear and xfstests can be run on it
> > > >>>>>>>>>>>>> (typically /dev/vda unless you already have other virtio-blk devices).
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Afterwards you can destroy the device using:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>    # vdpa dev del vduse0
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - ublk-qcow2 (make test T=qcow2/022)
> > > >>>>>>>>>>>>> There are a lot of other factors not directly related to NBD vs ublk. In
> > > >>>>>>>>>>>>> order to get an apples-to-apples comparison with qemu-* a ublk export
> > > >>>>>>>>>>>>> type is needed in qemu-storage-daemon. That way only the difference is
> > > >>>>>>>>>>>>> the ublk interface and the rest of the code path is identical, making it
> > > >>>>>>>>>>>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
> > > >>>>>>>>>>>> Maybe not true.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
> > > >>>>>>>>>>>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
> > > >>>>>>>>>>>> command.
> > > >>>>>>>>>>> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
> > > >>>>>>>>>> I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
> > > >>>>>>>>>>
> > > >>>>>>>>>>> know whether the benchmark demonstrates that ublk is faster than NBD,
> > > >>>>>>>>>>> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
> > > >>>>>>>>>>> whether there are miscellaneous implementation differences between
> > > >>>>>>>>>>> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
> > > >>>>>>>>>>> ublk and backend IO), or something else.
> > > >>>>>>>>>> The theory shouldn't be too complicated:
> > > >>>>>>>>>>
> > > >>>>>>>>>> 1) io uring passthough(pt) communication is fast than socket, and io command
> > > >>>>>>>>>> is carried over io_uring pt commands, and should be fast than virio
> > > >>>>>>>>>> communication too.
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2) io uring io handling is fast than libaio which is taken in the
> > > >>>>>>>>>> test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
> > > >>>>>>>>>> by io_uring.
> > > >>>>>>>>>>
> > > >>>>>>>>>> https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
> > > >>>>>>>>>>
> > > >>>>>>>>>> 3) ublk uses one single io_uring to handle all io commands and qcow2
> > > >>>>>>>>>> backend IOs, so batching handling is common, and it is easy to see
> > > >>>>>>>>>> dozens of IOs/io commands handled in single syscall, or even more.
> > > >>>>>>>>>>
> > > >>>>>>>>>>> I'm suggesting measuring changes to just 1 variable at a time.
> > > >>>>>>>>>>> Otherwise it's hard to reach a conclusion about the root cause of the
> > > >>>>>>>>>>> performance difference. Let's learn why ublk-qcow2 performs well.
> > > >>>>>>>>>> Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
> > > >>>>>>>>>> qemu from the latest github tree, and finally it starts to work. And test kernel
> > > >>>>>>>>>> is v6.0 release.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Follows the test result, and all three devices are setup as single
> > > >>>>>>>>>> queue, and all tests are run in single job, still done in one VM, and
> > > >>>>>>>>>> the test images are stored on XFS/virito-scsi backed SSD.
> > > >>>>>>>>>>
> > > >>>>>>>>>> The 1st group tests all three block device which is backed by empty
> > > >>>>>>>>>> qcow2 image.
> > > >>>>>>>>>>
> > > >>>>>>>>>> The 2nd group tests all the three block devices backed by pre-allocated
> > > >>>>>>>>>> qcow2 image.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Except for big sequential IO(512K), there is still not small gap between
> > > >>>>>>>>>> vdpa-virtio-blk and ublk.
> > > >>>>>>>>>>
> > > >>>>>>>>>> 1. run fio on block device over empty qcow2 image
> > > >>>>>>>>>> 1) qemu-nbd
> > > >>>>>>>>>> running qcow2/001
> > > >>>>>>>>>> run perf test on empty qcow2 image via nbd
> > > >>>>>>>>>>        fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > >>>>>>>>>>        randwrite: jobs 1, iops 8549
> > > >>>>>>>>>>        randread: jobs 1, iops 34829
> > > >>>>>>>>>>        randrw: jobs 1, iops read 11363 write 11333
> > > >>>>>>>>>>        rw(512k): jobs 1, iops read 590 write 597
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2) ublk-qcow2
> > > >>>>>>>>>> running qcow2/021
> > > >>>>>>>>>> run perf test on empty qcow2 image via ublk
> > > >>>>>>>>>>        fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > >>>>>>>>>>        randwrite: jobs 1, iops 16086
> > > >>>>>>>>>>        randread: jobs 1, iops 172720
> > > >>>>>>>>>>        randrw: jobs 1, iops read 35760 write 35702
> > > >>>>>>>>>>        rw(512k): jobs 1, iops read 1140 write 1149
> > > >>>>>>>>>>
> > > >>>>>>>>>> 3) vdpa-virtio-blk
> > > >>>>>>>>>> running debug/test_dev
> > > >>>>>>>>>> run io test on specified device
> > > >>>>>>>>>>        fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > >>>>>>>>>>        randwrite: jobs 1, iops 8626
> > > >>>>>>>>>>        randread: jobs 1, iops 126118
> > > >>>>>>>>>>        randrw: jobs 1, iops read 17698 write 17665
> > > >>>>>>>>>>        rw(512k): jobs 1, iops read 1023 write 1031
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2. run fio on block device over pre-allocated qcow2 image
> > > >>>>>>>>>> 1) qemu-nbd
> > > >>>>>>>>>> running qcow2/002
> > > >>>>>>>>>> run perf test on pre-allocated qcow2 image via nbd
> > > >>>>>>>>>>        fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
> > > >>>>>>>>>>        randwrite: jobs 1, iops 21439
> > > >>>>>>>>>>        randread: jobs 1, iops 30336
> > > >>>>>>>>>>        randrw: jobs 1, iops read 11476 write 11449
> > > >>>>>>>>>>        rw(512k): jobs 1, iops read 718 write 722
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2) ublk-qcow2
> > > >>>>>>>>>> running qcow2/022
> > > >>>>>>>>>> run perf test on pre-allocated qcow2 image via ublk
> > > >>>>>>>>>>        fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
> > > >>>>>>>>>>        randwrite: jobs 1, iops 98757
> > > >>>>>>>>>>        randread: jobs 1, iops 110246
> > > >>>>>>>>>>        randrw: jobs 1, iops read 47229 write 47161
> > > >>>>>>>>>>        rw(512k): jobs 1, iops read 1416 write 1427
> > > >>>>>>>>>>
> > > >>>>>>>>>> 3) vdpa-virtio-blk
> > > >>>>>>>>>> running debug/test_dev
> > > >>>>>>>>>> run io test on specified device
> > > >>>>>>>>>>        fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
> > > >>>>>>>>>>        randwrite: jobs 1, iops 47317
> > > >>>>>>>>>>        randread: jobs 1, iops 74092
> > > >>>>>>>>>>        randrw: jobs 1, iops read 27196 write 27234
> > > >>>>>>>>>>        rw(512k): jobs 1, iops read 1447 write 1458
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>> Hi All,
> > > >>>>>>>>>
> > > >>>>>>>>> We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
> > > >>>>>>>>> Let me share some results here.
> > > >>>>>>>>>
> > > >>>>>>>>> I setup UBLK with:
> > > >>>>>>>>>    ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
> > > >>>>>>>>>
> > > >>>>>>>>> I setup VDUSE with:
> > > >>>>>>>>>    qemu-storage-daemon \
> > > >>>>>>>>>         --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
> > > >>>>>>>>>         --monitor chardev=charmonitor \
> > > >>>>>>>>>         --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
> > > >>>>>>>>>         --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
> > > >>>>>>>>>
> > > >>>>>>>>> Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
> > > >>>>>>>>>
> > > >>>>>>>>> Note:
> > > >>>>>>>>> (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
> > > >>>>>>>>> (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
> > > >>>>>>>>> (3) I do not use ublk null target so that the test is fair.
> > > >>>>>>>>> (4) I setup fio with direct=1, bs=4k.
> > > >>>>>>>>>
> > > >>>>>>>>> ------------------------------
> > > >>>>>>>>> 1 job 1 iodepth, lat（usec)
> > > >>>>>>>>>                  vduse   ublk
> > > >>>>>>>>> seq-read        22.55   11.15
> > > >>>>>>>>> rand-read       22.49   11.17
> > > >>>>>>>>> seq-write       25.67   10.25
> > > >>>>>>>>> rand-write      24.13   10.16
> > > >>>>>>>> Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
> > > >>>>>>>>
> > > >>>>>>> I think one reason for the latency gap of sync I/O is that vduse uses
> > > >>>>>>> workqueue in the I/O completion path but ublk doesn't.
> > > >>>>>>>
> > > >>>>>>> And one bottleneck for the async I/O in vduse is that vduse will do
> > > >>>>>>> memcpy inside the critical section of virtqueue's spinlock in the
> > > >>>>>>> virtio-blk driver. That will hurt the performance heavily when
> > > >>>>>>> virtio_queue_rq() and virtblk_done() run concurrently. And it can be
> > > >>>>>>> mitigated by the advance DMA mapping feature [1] or irq binding
> > > >>>>>>> support [2].
> > > >>>>>> Hi Yongji,
> > > >>>>>>
> > > >>>>>> Yeah, that is the cost you paid for virtio. Wrt. userspace block device
> > > >>>>>> or other sort of userspace devices, cmd completion is driven by
> > > >>>>>> userspace, not sure if one such 'irq' is needed.
> > > >>>>> I'm not sure, it can be an optional feature in the future if needed.
> > > >>>>>
> > > >>>>>> Even not sure if virtio
> > > >>>>>> ring is one good choice for such use case, given io_uring has been proved
> > > >>>>>> as very efficient(should be better than virtio ring, IMO).
> > > >>>>>>
> > > >>>>> Since vduse is aimed at creating a generic userspace device framework,
> > > >>>>> virtio should be the right way IMO.
> > > >>>> OK, it is the right way, but may not be the effective one.
> > > >>>>
> > > >>> Maybe, but I think we can try to optimize it.
> > > >>>
> > > >>>>> And with the vdpa framework, the
> > > >>>>> userspace device can serve both virtual machines and containers.
> > > >>>> virtio is good for VM, but not sure it is good enough for other
> > > >>>> cases.
> > > >>>>
> > > >>>>> Regarding the performance issue, actually I can't measure how much of
> > > >>>>> the performance loss is due to the difference between virtio ring and
> > > >>>>> iouring. But I think it should be very small. The main costs come from
> > > >>>>> the two bottlenecks I mentioned before which could be mitigated in the
> > > >>>>> future.
> > > >>>> Per my understanding, at least there are two places where virtio ring is
> > > >>>> less efficient than io_uring:
> > > >>>>
> > > >>> I might have misunderstood what you mean by virtio ring before. My
> > > >>> previous understanding of the virtio ring does not include the
> > > >>> virtio-blk driver.
> > > >>>
> > > >>>> 1) io_uring uses standalone submission queue(SQ) and completion queue(CQ),
> > > >>>> so no contention exists between submission and completion; but virtio queue
> > > >>>> requires per-vq lock in both submission and completion.
> > > >>>>
> > > >>> Yes, this is the bottleneck of the virtio-blk driver, even in the VM
> > > >>> case. We are also trying to optimize this lock.
> > > >>>
> > > >>> One way to mitigate it is making submission and completion happen in
> > > >>> the same core.
> > > >> QEMU sizes virtio-blk device num-queues to match the vCPU count. The
> > > >> virtio-blk driver is a blk-mq driver, so submissions and completions
> > > >> for a given virtqueue should already be processed by the same vCPU.
> > > >>
> > > >> Unless the device is misconfigured or the guest software chooses a
> > > >> custom vq:vCPU mapping, there should be no vq lock contention between
> > > >> vCPUs.
> > > >>
> > > >> I can think of a reason why submission and completion require
> > > >> coordination: descriptors are occupied until completion. The
> > > >> submission logic chooses free descriptors from the table. The
> > > >> completion logic returns free descriptors so they can be used in
> > > >> future submissions.
> > > >>
> > > > Yes, we need to maintain a head pointer of the free descriptors in
> > > > both submission and completion path.
> > >
> > >
> > > Not necessarily after IN_ORDER?
> > >
> >
> > Sounds like a good idea.
>
> Submission and completion are still not 100% independent with IN_ORDER
> because descriptors are still in use until completion. It may not be
> necessary to keep a freelist, but you cannot actually use the
> descriptors for new submissions until existing requests complete. Is
> that correct?

Yes.


>
> Anyway, independent submission and completion rings aren't perfect
> either because independent submission introduces a new point of
> communication: the device must tell the driver when submitted
> descriptors have been processed. That means the driver must access a
> hardware register on the device or the device must DMA to RAM. So it
> involves extra bus traffic that is not necessary if descriptors are in
> use until completion. io_uring gets away with it because the
> io_uring_enter(2) syscall is synchronous and can therefore return the
> number of consumed sq elements for free.

Note that this is a syscall interface not a device/driver API, so
technically if it's useful, it could be added to virtio as well.

>
> There are ways to minimize that cost:
> 1. The driver only needs to fetch the device's sq index when it has
> run out of sq ring space.
> 2. The device can include sq index updates with completions. This is
> what NVMe does with the CQE SQ Head Pointer field, but the
> disadvantage is that the driver has no way of determining the sq index
> until a completion occurs.

Probably, but as replied in another thread, based on the numbers
measured from the networking test, I think the current virtio layout
should be sufficient for block I/O but might not fit for cases like
NFV.

Thanks

>
> Stefan
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-11-01  2:36                                 ` Jason Wang
@ 2022-11-02 19:13                                   ` Stefan Hajnoczi
  2022-11-04  6:55                                     ` Jason Wang
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Hajnoczi @ 2022-11-02 19:13 UTC (permalink / raw)
  To: Jason Wang
  Cc: Stefan Hajnoczi, Yongji Xie, Michael S. Tsirkin, Ming Lei,
	Ziyang Zhang, io-uring, linux-block, linux-kernel, Denis V. Lunev,
	Xiaoguang Wang

[-- Attachment #1: Type: text/plain, Size: 2390 bytes --]

On Tue, Nov 01, 2022 at 10:36:29AM +0800, Jason Wang wrote:
> On Tue, Oct 25, 2022 at 8:02 PM Stefan Hajnoczi <[email protected]> wrote:
> >
> > On Tue, 25 Oct 2022 at 04:17, Yongji Xie <[email protected]> wrote:
> > >
> > > On Fri, Oct 21, 2022 at 2:30 PM Jason Wang <[email protected]> wrote:
> > > >
> > > >
> > > > 在 2022/10/21 13:33, Yongji Xie 写道:
> > > > > On Tue, Oct 18, 2022 at 10:54 PM Stefan Hajnoczi <[email protected]> wrote:
> > > > >> On Tue, 18 Oct 2022 at 09:17, Yongji Xie <[email protected]> wrote:
> > > > >>> On Tue, Oct 18, 2022 at 2:59 PM Ming Lei <[email protected]> wrote:
> > > > >>>> On Mon, Oct 17, 2022 at 07:11:59PM +0800, Yongji Xie wrote:
> > > > >>>>> On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
> > > > >>>>>> On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> > > > >>>>>>> On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> > > > >>>>>>>> On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> > > > >>>>>>>>> On 2022/10/5 12:18, Ming Lei wrote:
> > > > >>>>>>>>>> On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > >>>>>>>>>>> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > > >>>>>>>>>>>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > >>>>>>>>>>>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > There are ways to minimize that cost:
> > 1. The driver only needs to fetch the device's sq index when it has
> > run out of sq ring space.
> > 2. The device can include sq index updates with completions. This is
> > what NVMe does with the CQE SQ Head Pointer field, but the
> > disadvantage is that the driver has no way of determining the sq index
> > until a completion occurs.
> 
> Probably, but as replied in another thread, based on the numbers
> measured from the networking test, I think the current virtio layout
> should be sufficient for block I/O but might not fit for cases like
> NFV.

I remember that the Linux virtio_net driver doesn't rely on vq spinlocks
because CPU affinity and the NAPI architecture ensure that everything is
CPU-local. There is no need to protect the freelist explicitly because
the functions cannot race.

Maybe virtio_blk can learn from virtio_net...

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-11-02 19:13                                   ` Stefan Hajnoczi
@ 2022-11-04  6:55                                     ` Jason Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Wang @ 2022-11-04  6:55 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Stefan Hajnoczi, Yongji Xie, Michael S. Tsirkin, Ming Lei,
	Ziyang Zhang, io-uring, linux-block, linux-kernel, Denis V. Lunev,
	Xiaoguang Wang

On Thu, Nov 3, 2022 at 3:13 AM Stefan Hajnoczi <[email protected]> wrote:
>
> On Tue, Nov 01, 2022 at 10:36:29AM +0800, Jason Wang wrote:
> > On Tue, Oct 25, 2022 at 8:02 PM Stefan Hajnoczi <[email protected]> wrote:
> > >
> > > On Tue, 25 Oct 2022 at 04:17, Yongji Xie <[email protected]> wrote:
> > > >
> > > > On Fri, Oct 21, 2022 at 2:30 PM Jason Wang <[email protected]> wrote:
> > > > >
> > > > >
> > > > > 在 2022/10/21 13:33, Yongji Xie 写道:
> > > > > > On Tue, Oct 18, 2022 at 10:54 PM Stefan Hajnoczi <[email protected]> wrote:
> > > > > >> On Tue, 18 Oct 2022 at 09:17, Yongji Xie <[email protected]> wrote:
> > > > > >>> On Tue, Oct 18, 2022 at 2:59 PM Ming Lei <[email protected]> wrote:
> > > > > >>>> On Mon, Oct 17, 2022 at 07:11:59PM +0800, Yongji Xie wrote:
> > > > > >>>>> On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
> > > > > >>>>>> On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
> > > > > >>>>>>> On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
> > > > > >>>>>>>> On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
> > > > > >>>>>>>>> On 2022/10/5 12:18, Ming Lei wrote:
> > > > > >>>>>>>>>> On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > > >>>>>>>>>>> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
> > > > > >>>>>>>>>>>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
> > > > > >>>>>>>>>>>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > There are ways to minimize that cost:
> > > 1. The driver only needs to fetch the device's sq index when it has
> > > run out of sq ring space.
> > > 2. The device can include sq index updates with completions. This is
> > > what NVMe does with the CQE SQ Head Pointer field, but the
> > > disadvantage is that the driver has no way of determining the sq index
> > > until a completion occurs.
> >
> > Probably, but as replied in another thread, based on the numbers
> > measured from the networking test, I think the current virtio layout
> > should be sufficient for block I/O but might not fit for cases like
> > NFV.
>
> I remember that the Linux virtio_net driver doesn't rely on vq spinlocks
> because CPU affinity and the NAPI architecture ensure that everything is
> CPU-local. There is no need to protect the freelist explicitly because
> the functions cannot race.
>
> Maybe virtio_blk can learn from virtio_net...

It only works for RX where add and get could be all done in NAPI. But
this is not the case for TX (and virtio-blk).

Actually, if the free_list is the one thing that needs to be
serialized, there's no need to use lock at all. We can try to switch
to use ptr_ring instead.

Thanks

>
> Stefan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-18  6:59                   ` Ming Lei
  2022-10-18 13:17                     ` Yongji Xie
@ 2022-10-21  6:28                     ` Jason Wang
  1 sibling, 0 replies; 44+ messages in thread
From: Jason Wang @ 2022-10-21  6:28 UTC (permalink / raw)
  To: Ming Lei, Yongji Xie
  Cc: Stefan Hajnoczi, Ziyang Zhang, Stefan Hajnoczi, io-uring,
	linux-block, linux-kernel, Denis V. Lunev, Xiaoguang Wang


在 2022/10/18 14:59, Ming Lei 写道:
> On Mon, Oct 17, 2022 at 07:11:59PM +0800, Yongji Xie wrote:
>> On Fri, Oct 14, 2022 at 8:57 PM Ming Lei <[email protected]> wrote:
>>> On Thu, Oct 13, 2022 at 02:48:04PM +0800, Yongji Xie wrote:
>>>> On Wed, Oct 12, 2022 at 10:22 PM Stefan Hajnoczi <[email protected]> wrote:
>>>>> On Sat, 8 Oct 2022 at 04:43, Ziyang Zhang <[email protected]> wrote:
>>>>>> On 2022/10/5 12:18, Ming Lei wrote:
>>>>>>> On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
>>>>>>>> On Tue, 4 Oct 2022 at 05:44, Ming Lei <[email protected]> wrote:
>>>>>>>>> On Mon, Oct 03, 2022 at 03:53:41PM -0400, Stefan Hajnoczi wrote:
>>>>>>>>>> On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
>>>>>>>>>>> ublk-qcow2 is available now.
>>>>>>>>>> Cool, thanks for sharing!
>>>>>>>>>>
>>>>>>>>>>> So far it provides basic read/write function, and compression and snapshot
>>>>>>>>>>> aren't supported yet. The target/backend implementation is completely
>>>>>>>>>>> based on io_uring, and share the same io_uring with ublk IO command
>>>>>>>>>>> handler, just like what ublk-loop does.
>>>>>>>>>>>
>>>>>>>>>>> Follows the main motivations of ublk-qcow2:
>>>>>>>>>>>
>>>>>>>>>>> - building one complicated target from scratch helps libublksrv APIs/functions
>>>>>>>>>>>    become mature/stable more quickly, since qcow2 is complicated and needs more
>>>>>>>>>>>    requirement from libublksrv compared with other simple ones(loop, null)
>>>>>>>>>>>
>>>>>>>>>>> - there are several attempts of implementing qcow2 driver in kernel, such as
>>>>>>>>>>>    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
>>>>>>>>>>>    might useful be for covering requirement in this field
>>>>>>>>>>>
>>>>>>>>>>> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
>>>>>>>>>>>    performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
>>>>>>>>>>>    is started
>>>>>>>>>>>
>>>>>>>>>>> - help to abstract common building block or design pattern for writing new ublk
>>>>>>>>>>>    target/backend
>>>>>>>>>>>
>>>>>>>>>>> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
>>>>>>>>>>> device as TEST_DEV, and kernel building workload is verified too. Also
>>>>>>>>>>> soft update approach is applied in meta flushing, and meta data
>>>>>>>>>>> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
>>>>>>>>>>> test, and only cluster leak is reported during this test.
>>>>>>>>>>>
>>>>>>>>>>> The performance data looks much better compared with qemu-nbd, see
>>>>>>>>>>> details in commit log[1], README[5] and STATUS[6]. And the test covers both
>>>>>>>>>>> empty image and pre-allocated image, for example of pre-allocated qcow2
>>>>>>>>>>> image(8GB):
>>>>>>>>>>>
>>>>>>>>>>> - qemu-nbd (make test T=qcow2/002)
>>>>>>>>>> Single queue?
>>>>>>>>> Yeah.
>>>>>>>>>
>>>>>>>>>>>      randwrite(4k): jobs 1, iops 24605
>>>>>>>>>>>      randread(4k): jobs 1, iops 30938
>>>>>>>>>>>      randrw(4k): jobs 1, iops read 13981 write 14001
>>>>>>>>>>>      rw(512k): jobs 1, iops read 724 write 728
>>>>>>>>>> Please try qemu-storage-daemon's VDUSE export type as well. The
>>>>>>>>>> command-line should be similar to this:
>>>>>>>>>>
>>>>>>>>>>    # modprobe virtio_vdpa # attaches vDPA devices to host kernel
>>>>>>>>> Not found virtio_vdpa module even though I enabled all the following
>>>>>>>>> options:
>>>>>>>>>
>>>>>>>>>          --- vDPA drivers
>>>>>>>>>            <M>   vDPA device simulator core
>>>>>>>>>            <M>     vDPA simulator for networking device
>>>>>>>>>            <M>     vDPA simulator for block device
>>>>>>>>>            <M>   VDUSE (vDPA Device in Userspace) support
>>>>>>>>>            <M>   Intel IFC VF vDPA driver
>>>>>>>>>            <M>   Virtio PCI bridge vDPA driver
>>>>>>>>>            <M>   vDPA driver for Alibaba ENI
>>>>>>>>>
>>>>>>>>> BTW, my test environment is VM and the shared data is done in VM too, and
>>>>>>>>> can virtio_vdpa be used inside VM?
>>>>>>>> I hope Xie Yongji can help explain how to benchmark VDUSE.
>>>>>>>>
>>>>>>>> virtio_vdpa is available inside guests too. Please check that
>>>>>>>> VIRTIO_VDPA ("vDPA driver for virtio devices") is enabled in "Virtio
>>>>>>>> drivers" menu.
>>>>>>>>
>>>>>>>>>>    # modprobe vduse
>>>>>>>>>>    # qemu-storage-daemon \
>>>>>>>>>>        --blockdev file,filename=test.qcow2,cache.direct=of|off,aio=native,node-name=file \
>>>>>>>>>>        --blockdev qcow2,file=file,node-name=qcow2 \
>>>>>>>>>>        --object iothread,id=iothread0 \
>>>>>>>>>>        --export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
>>>>>>>>>>    # vdpa dev add name vduse0 mgmtdev vduse
>>>>>>>>>>
>>>>>>>>>> A virtio-blk device should appear and xfstests can be run on it
>>>>>>>>>> (typically /dev/vda unless you already have other virtio-blk devices).
>>>>>>>>>>
>>>>>>>>>> Afterwards you can destroy the device using:
>>>>>>>>>>
>>>>>>>>>>    # vdpa dev del vduse0
>>>>>>>>>>
>>>>>>>>>>> - ublk-qcow2 (make test T=qcow2/022)
>>>>>>>>>> There are a lot of other factors not directly related to NBD vs ublk. In
>>>>>>>>>> order to get an apples-to-apples comparison with qemu-* a ublk export
>>>>>>>>>> type is needed in qemu-storage-daemon. That way only the difference is
>>>>>>>>>> the ublk interface and the rest of the code path is identical, making it
>>>>>>>>>> possible to compare NBD, VDUSE, ublk, etc more precisely.
>>>>>>>>> Maybe not true.
>>>>>>>>>
>>>>>>>>> ublk-qcow2 uses io_uring to handle all backend IO(include meta IO) completely,
>>>>>>>>> and so far single io_uring/pthread is for handling all qcow2 IOs and IO
>>>>>>>>> command.
>>>>>>>> qemu-nbd doesn't use io_uring to handle the backend IO, so we don't
>>>>>>> I tried to use it via --aio=io_uring for setting up qemu-nbd, but not succeed.
>>>>>>>
>>>>>>>> know whether the benchmark demonstrates that ublk is faster than NBD,
>>>>>>>> that the ublk-qcow2 implementation is faster than qemu-nbd's qcow2,
>>>>>>>> whether there are miscellaneous implementation differences between
>>>>>>>> ublk-qcow2 and qemu-nbd (like using the same io_uring context for both
>>>>>>>> ublk and backend IO), or something else.
>>>>>>> The theory shouldn't be too complicated:
>>>>>>>
>>>>>>> 1) io uring passthough(pt) communication is fast than socket, and io command
>>>>>>> is carried over io_uring pt commands, and should be fast than virio
>>>>>>> communication too.
>>>>>>>
>>>>>>> 2) io uring io handling is fast than libaio which is taken in the
>>>>>>> test on qemu-nbd, and all qcow2 backend io(include meta io) is handled
>>>>>>> by io_uring.
>>>>>>>
>>>>>>> https://github.com/ming1/ubdsrv/blob/master/tests/common/qcow2_common
>>>>>>>
>>>>>>> 3) ublk uses one single io_uring to handle all io commands and qcow2
>>>>>>> backend IOs, so batching handling is common, and it is easy to see
>>>>>>> dozens of IOs/io commands handled in single syscall, or even more.
>>>>>>>
>>>>>>>> I'm suggesting measuring changes to just 1 variable at a time.
>>>>>>>> Otherwise it's hard to reach a conclusion about the root cause of the
>>>>>>>> performance difference. Let's learn why ublk-qcow2 performs well.
>>>>>>> Turns out the latest Fedora 37-beta doesn't support vdpa yet, so I built
>>>>>>> qemu from the latest github tree, and finally it starts to work. And test kernel
>>>>>>> is v6.0 release.
>>>>>>>
>>>>>>> Follows the test result, and all three devices are setup as single
>>>>>>> queue, and all tests are run in single job, still done in one VM, and
>>>>>>> the test images are stored on XFS/virito-scsi backed SSD.
>>>>>>>
>>>>>>> The 1st group tests all three block device which is backed by empty
>>>>>>> qcow2 image.
>>>>>>>
>>>>>>> The 2nd group tests all the three block devices backed by pre-allocated
>>>>>>> qcow2 image.
>>>>>>>
>>>>>>> Except for big sequential IO(512K), there is still not small gap between
>>>>>>> vdpa-virtio-blk and ublk.
>>>>>>>
>>>>>>> 1. run fio on block device over empty qcow2 image
>>>>>>> 1) qemu-nbd
>>>>>>> running qcow2/001
>>>>>>> run perf test on empty qcow2 image via nbd
>>>>>>>        fio (nbd(/mnt/data/ublk_null_8G_nYbgF.qcow2), libaio, bs 4k, dio, hw queues:1)...
>>>>>>>        randwrite: jobs 1, iops 8549
>>>>>>>        randread: jobs 1, iops 34829
>>>>>>>        randrw: jobs 1, iops read 11363 write 11333
>>>>>>>        rw(512k): jobs 1, iops read 590 write 597
>>>>>>>
>>>>>>>
>>>>>>> 2) ublk-qcow2
>>>>>>> running qcow2/021
>>>>>>> run perf test on empty qcow2 image via ublk
>>>>>>>        fio (ublk/qcow2( -f /mnt/data/ublk_null_8G_s761j.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
>>>>>>>        randwrite: jobs 1, iops 16086
>>>>>>>        randread: jobs 1, iops 172720
>>>>>>>        randrw: jobs 1, iops read 35760 write 35702
>>>>>>>        rw(512k): jobs 1, iops read 1140 write 1149
>>>>>>>
>>>>>>> 3) vdpa-virtio-blk
>>>>>>> running debug/test_dev
>>>>>>> run io test on specified device
>>>>>>>        fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
>>>>>>>        randwrite: jobs 1, iops 8626
>>>>>>>        randread: jobs 1, iops 126118
>>>>>>>        randrw: jobs 1, iops read 17698 write 17665
>>>>>>>        rw(512k): jobs 1, iops read 1023 write 1031
>>>>>>>
>>>>>>>
>>>>>>> 2. run fio on block device over pre-allocated qcow2 image
>>>>>>> 1) qemu-nbd
>>>>>>> running qcow2/002
>>>>>>> run perf test on pre-allocated qcow2 image via nbd
>>>>>>>        fio (nbd(/mnt/data/ublk_data_8G_sc0SB.qcow2), libaio, bs 4k, dio, hw queues:1)...
>>>>>>>        randwrite: jobs 1, iops 21439
>>>>>>>        randread: jobs 1, iops 30336
>>>>>>>        randrw: jobs 1, iops read 11476 write 11449
>>>>>>>        rw(512k): jobs 1, iops read 718 write 722
>>>>>>>
>>>>>>> 2) ublk-qcow2
>>>>>>> running qcow2/022
>>>>>>> run perf test on pre-allocated qcow2 image via ublk
>>>>>>>        fio (ublk/qcow2( -f /mnt/data/ublk_data_8G_yZiaJ.qcow2), libaio, bs 4k, dio, hw queues:1, uring_comp: 0, get_data: 0).
>>>>>>>        randwrite: jobs 1, iops 98757
>>>>>>>        randread: jobs 1, iops 110246
>>>>>>>        randrw: jobs 1, iops read 47229 write 47161
>>>>>>>        rw(512k): jobs 1, iops read 1416 write 1427
>>>>>>>
>>>>>>> 3) vdpa-virtio-blk
>>>>>>> running debug/test_dev
>>>>>>> run io test on specified device
>>>>>>>        fio (vdpa(/dev/vdc), libaio, bs 4k, dio, hw queues:1)...
>>>>>>>        randwrite: jobs 1, iops 47317
>>>>>>>        randread: jobs 1, iops 74092
>>>>>>>        randrw: jobs 1, iops read 27196 write 27234
>>>>>>>        rw(512k): jobs 1, iops read 1447 write 1458
>>>>>>>
>>>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> We are interested in VDUSE vs UBLK, too. And I have tested them with nullblk backend.
>>>>>> Let me share some results here.
>>>>>>
>>>>>> I setup UBLK with:
>>>>>>    ublk add -t loop -f /dev/nullb0 -d QUEUE_DEPTH -q NR_QUEUE
>>>>>>
>>>>>> I setup VDUSE with:
>>>>>>    qemu-storage-daemon \
>>>>>>         --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server=on,wait=off \
>>>>>>         --monitor chardev=charmonitor \
>>>>>>         --blockdev driver=host_device,cache.direct=on,filename=/dev/nullb0,node-name=disk0 \
>>>>>>         --export vduse-blk,id=test,node-name=disk0,name=vduse_test,writable=on,num-queues=NR_QUEUE,queue-size=QUEUE_DEPTH
>>>>>>
>>>>>> Here QUEUE_DEPTH is 1, 32 or 128 and NR_QUEUE is 1 or 4.
>>>>>>
>>>>>> Note:
>>>>>> (1) VDUSE requires QUEUE_DEPTH >= 2. I cannot setup QUEUE_DEPTH to 1.
>>>>>> (2) I use qemu 7.1.0-rc3. It supports vduse-blk.
>>>>>> (3) I do not use ublk null target so that the test is fair.
>>>>>> (4) I setup fio with direct=1, bs=4k.
>>>>>>
>>>>>> ------------------------------
>>>>>> 1 job 1 iodepth, lat（usec)
>>>>>>                  vduse   ublk
>>>>>> seq-read        22.55   11.15
>>>>>> rand-read       22.49   11.17
>>>>>> seq-write       25.67   10.25
>>>>>> rand-write      24.13   10.16
>>>>> Thanks for sharing. Any idea what the bottlenecks are for vduse and ublk?
>>>>>
>>>> I think one reason for the latency gap of sync I/O is that vduse uses
>>>> workqueue in the I/O completion path but ublk doesn't.
>>>>
>>>> And one bottleneck for the async I/O in vduse is that vduse will do
>>>> memcpy inside the critical section of virtqueue's spinlock in the
>>>> virtio-blk driver. That will hurt the performance heavily when
>>>> virtio_queue_rq() and virtblk_done() run concurrently. And it can be
>>>> mitigated by the advance DMA mapping feature [1] or irq binding
>>>> support [2].
>>> Hi Yongji,
>>>
>>> Yeah, that is the cost you paid for virtio. Wrt. userspace block device
>>> or other sort of userspace devices, cmd completion is driven by
>>> userspace, not sure if one such 'irq' is needed.
>> I'm not sure, it can be an optional feature in the future if needed.
>>
>>> Even not sure if virtio
>>> ring is one good choice for such use case, given io_uring has been proved
>>> as very efficient(should be better than virtio ring, IMO).
>>>
>> Since vduse is aimed at creating a generic userspace device framework,
>> virtio should be the right way IMO.
> OK, it is the right way, but may not be the effective one.
>
>> And with the vdpa framework, the
>> userspace device can serve both virtual machines and containers.
> virtio is good for VM, but not sure it is good enough for other
> cases.


Well, virtio is not yet limited to virt and has been widely used in bare 
metal, containers, automotive and even edge in production environment 
for years. A lot of vendors has shipped their software or hardware 
virtio/vDPA products.


>
>> Regarding the performance issue, actually I can't measure how much of
>> the performance loss is due to the difference between virtio ring and
>> iouring. But I think it should be very small. The main costs come from
>> the two bottlenecks I mentioned before which could be mitigated in the
>> future.
> Per my understanding, at least there are two places where virtio ring is
> less efficient than io_uring:
>
> 1) io_uring uses standalone submission queue(SQ) and completion queue(CQ),
> so no contention exists between submission and completion; but virtio queue
> requires per-vq lock in both submission and completion.


Virtio is not limited in its layout of the queue. I've used to proposed 
SQ/CQ model in the spec in the past but vendors complains a third format 
immediate after the second. Maybe it's time to revisit that, but it 
needs to be fully benchmarked and proved at first.


>
> 2) io_uring can use single system call of io_uring_enter() for both
> submitting and completing, so one context switch is enough, together
> with natural batch processing for both submission and completion, and
> it is observed that dozens or more than one hundred of IOs can be
> covered in single syscall; virtio requires one notification for submission and
> another one for completion,


You can queue several buffers before a kick to the virtqueue, with a 
polling device and driver, you don't even need any kick/notification. I 
don't see much difference here.


>   looks at least two context switch are required
> for handling one IO(s).


For virtio, the queue layout or ring design should not be bottleneck for 
the block device at least. I can give your some numbers measured by PPS 
(since network traffic is more queue layout sensitive than block):

1) vDPA vendor can achieve 30Mpps or even higher
2) software userspace virtio backends like vhost-user can do almost the 
same or even higher

This is a strong hint that virtio ring should be sufficient for block. 
For NFV/wire-speed like 100G we do need more work on optimization on the 
queue/descriptor format.

Thanks


>
>>> ublk uses io_uring pt cmd for handling both io submission and completion,
>>> turns out the extra latency can be pretty small.
>>>
>>> BTW, one un-related topic, I saw the following words in
>>> Documentation/userspace-api/vduse.rst:
>>>
>>> ```
>>> Note that only virtio block device is supported by VDUSE framework now,
>>> which can reduce security risks when the userspace process that implements
>>> the data path is run by an unprivileged user.
>>> ```
>>>
>>> But when I tried to start qemu-storage-daemon for creating vdpa-virtio
>>> block by nor unprivileged user, 'Permission denied' is still returned,
>>> can you explain a bit how to start such process by unprivileged user?
>>> Or maybe I misunderstood the above words, please let me know.
>>>
>> Currently vduse should only allow privileged users by default. But
>> sysadmin can change the permission of the vduse char device or pass
>> the device fd to an unprivileged process IIUC.
> I appreciate if you may provide a bit detailed steps for the above?
>
> BTW, I changed privilege of /dev/vduse/control to normal user, but
> qemu-storage-daemon still returns 'Permission denied'. And if the
> char dev is /dev/vduse/vduse0N, which is created by qemu-storage-daemon,
> so how to change user of qemu-storage-daemon to unprivileged after
> /dev/vduse/vduse0N is created?
>
>
>
> Thanks,
> Ming
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-04 13:53     ` Stefan Hajnoczi
  2022-10-05  4:18       ` Ming Lei
@ 2022-10-06 10:14       ` Richard W.M. Jones
  2022-10-12 14:15         ` Stefan Hajnoczi
  1 sibling, 1 reply; 44+ messages in thread
From: Richard W.M. Jones @ 2022-10-06 10:14 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Ming Lei, Stefan Hajnoczi, io-uring, linux-block, linux-kernel,
	Kirill Tkhai, Manuel Bentele, qemu-devel, Kevin Wolf, Xie Yongji,
	Denis V. Lunev, Stefano Garzarella

On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> qemu-nbd doesn't use io_uring to handle the backend IO,

Would this be fixed by your (not yet upstream) libblkio driver for
qemu?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-06 10:14       ` Richard W.M. Jones
@ 2022-10-12 14:15         ` Stefan Hajnoczi
  2022-10-13  1:50           ` Ming Lei
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Hajnoczi @ 2022-10-12 14:15 UTC (permalink / raw)
  To: Richard W.M. Jones
  Cc: Ming Lei, Stefan Hajnoczi, io-uring, linux-block, linux-kernel,
	Kirill Tkhai, Manuel Bentele, qemu-devel, Kevin Wolf, Xie Yongji,
	Denis V. Lunev, Stefano Garzarella

On Thu, 6 Oct 2022 at 06:14, Richard W.M. Jones <[email protected]> wrote:
>
> On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > qemu-nbd doesn't use io_uring to handle the backend IO,
>
> Would this be fixed by your (not yet upstream) libblkio driver for
> qemu?

I was wrong, qemu-nbd has syntax to use io_uring:

  $ qemu-nbd ... --image-opts driver=file,filename=test.img,aio=io_uring

The new libblkio driver will also support io_uring, but QEMU's
built-in io_uring support is already available and can be used as
shown above.

Stefan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-12 14:15         ` Stefan Hajnoczi
@ 2022-10-13  1:50           ` Ming Lei
  2022-10-13 16:01             ` Stefan Hajnoczi
  0 siblings, 1 reply; 44+ messages in thread
From: Ming Lei @ 2022-10-13  1:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Richard W.M. Jones, Stefan Hajnoczi, io-uring, linux-block,
	linux-kernel, Kirill Tkhai, Manuel Bentele, qemu-devel,
	Kevin Wolf, Xie Yongji, Denis V. Lunev, Stefano Garzarella

On Wed, Oct 12, 2022 at 10:15:28AM -0400, Stefan Hajnoczi wrote:
> On Thu, 6 Oct 2022 at 06:14, Richard W.M. Jones <[email protected]> wrote:
> >
> > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > qemu-nbd doesn't use io_uring to handle the backend IO,
> >
> > Would this be fixed by your (not yet upstream) libblkio driver for
> > qemu?
> 
> I was wrong, qemu-nbd has syntax to use io_uring:
> 
>   $ qemu-nbd ... --image-opts driver=file,filename=test.img,aio=io_uring

Yeah, I saw the option, previously when I tried io_uring via:

qemu-nbd -c /dev/nbd11 -n --aio=io_uring $my_file

It complains that 'qemu-nbd: Invalid aio mode 'io_uring'' even though
that 'qemu-nbd --help' does say that io_uring is supported.

Today just tried it on Fedora 37, looks it starts working with
--aio=io_uring, but the IOPS is basically same with --aio=native, and
IO trace shows that io_uring is used by qemu-nbd.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-10-13  1:50           ` Ming Lei
@ 2022-10-13 16:01             ` Stefan Hajnoczi
  0 siblings, 0 replies; 44+ messages in thread
From: Stefan Hajnoczi @ 2022-10-13 16:01 UTC (permalink / raw)
  To: Ming Lei
  Cc: Stefan Hajnoczi, Richard W.M. Jones, io-uring, linux-block,
	linux-kernel, Kirill Tkhai, Manuel Bentele, qemu-devel,
	Kevin Wolf, Xie Yongji, Denis V. Lunev, Stefano Garzarella

[-- Attachment #1: Type: text/plain, Size: 1255 bytes --]

On Thu, Oct 13, 2022 at 09:50:55AM +0800, Ming Lei wrote:
> On Wed, Oct 12, 2022 at 10:15:28AM -0400, Stefan Hajnoczi wrote:
> > On Thu, 6 Oct 2022 at 06:14, Richard W.M. Jones <[email protected]> wrote:
> > >
> > > On Tue, Oct 04, 2022 at 09:53:32AM -0400, Stefan Hajnoczi wrote:
> > > > qemu-nbd doesn't use io_uring to handle the backend IO,
> > >
> > > Would this be fixed by your (not yet upstream) libblkio driver for
> > > qemu?
> > 
> > I was wrong, qemu-nbd has syntax to use io_uring:
> > 
> >   $ qemu-nbd ... --image-opts driver=file,filename=test.img,aio=io_uring
> 
> Yeah, I saw the option, previously when I tried io_uring via:
> 
> qemu-nbd -c /dev/nbd11 -n --aio=io_uring $my_file
> 
> It complains that 'qemu-nbd: Invalid aio mode 'io_uring'' even though
> that 'qemu-nbd --help' does say that io_uring is supported.
> 
> Today just tried it on Fedora 37, looks it starts working with
> --aio=io_uring, but the IOPS is basically same with --aio=native, and
> IO trace shows that io_uring is used by qemu-nbd.

Okay, similar performance to Linux AIO is expected. That's what we've
seen with io_uring in QEMU. QEMU doesn't use io_uring in polling mode,
so it's similar to what we get with Linux AIO.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: ublk-qcow2: ublk-qcow2 is available
  2022-09-30  9:24 ublk-qcow2: ublk-qcow2 is available Ming Lei
  2022-10-03 19:53 ` Stefan Hajnoczi
@ 2022-10-04  5:43 ` Manuel Bentele
  1 sibling, 0 replies; 44+ messages in thread
From: Manuel Bentele @ 2022-10-04  5:43 UTC (permalink / raw)
  To: Ming Lei, io-uring, linux-block, linux-kernel
  Cc: Kirill Tkhai, Stefan Hajnoczi, Simon Rettberg,
	Dirk von Suchodoletz

Hi all,

thanks for the notification.I want to note that the official "in kernel
qcow2 (ro)" project was renamed to "xloop" and is now maintained on
Github [1]. So far we are successfully using xloop toimplement our use
case explained in [2].

Seems like we have a technical alternative to get file-format specific
functionality out of the kernel. When I presented the "in kernel qcow2
(ro)" project idea on the mailing list [3], there was a discussion about
whether file formats like qcow2 should be implemented in the kernel or
not? Now, this question should be obsolete.

[1] https://github.com/bwLehrpool/xloop
[2] https://www.spinics.net/lists/linux-block/msg44858.html
[3] https://www.spinics.net/lists/linux-block/msg39538.html

Regards,
Manuel

On 9/30/22 11:24, Ming Lei wrote:
> Hello,
>
> ublk-qcow2 is available now.
>
> So far it provides basic read/write function, and compression and snapshot
> aren't supported yet. The target/backend implementation is completely
> based on io_uring, and share the same io_uring with ublk IO command
> handler, just like what ublk-loop does.
>
> Follows the main motivations of ublk-qcow2:
>
> - building one complicated target from scratch helps libublksrv APIs/functions
>   become mature/stable more quickly, since qcow2 is complicated and needs more
>   requirement from libublksrv compared with other simple ones(loop, null)
>
> - there are several attempts of implementing qcow2 driver in kernel, such as
>   ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
>   might useful be for covering requirement in this field
>
> - performance comparison with qemu-nbd, and it was my 1st thought to evaluate
>   performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
>   is started
>
> - help to abstract common building block or design pattern for writing new ublk
>   target/backend
>
> So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
> device as TEST_DEV, and kernel building workload is verified too. Also
> soft update approach is applied in meta flushing, and meta data
> integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
> test, and only cluster leak is reported during this test.
>
> The performance data looks much better compared with qemu-nbd, see
> details in commit log[1], README[5] and STATUS[6]. And the test covers both
> empty image and pre-allocated image, for example of pre-allocated qcow2
> image(8GB):
>
> - qemu-nbd (make test T=qcow2/002)
> 	randwrite(4k): jobs 1, iops 24605
> 	randread(4k): jobs 1, iops 30938
> 	randrw(4k): jobs 1, iops read 13981 write 14001
> 	rw(512k): jobs 1, iops read 724 write 728
>
> - ublk-qcow2 (make test T=qcow2/022)
> 	randwrite(4k): jobs 1, iops 104481
> 	randread(4k): jobs 1, iops 114937
> 	randrw(4k): jobs 1, iops read 53630 write 53577
> 	rw(512k): jobs 1, iops read 1412 write 1423
>
> Also ublk-qcow2 aligns queue's chunk_sectors limit with qcow2's cluster size,
> which is 64KB at default, this way simplifies backend io handling, but
> it could be increased to 512K or more proper size for improving sequential
> IO perf, just need one coroutine to handle more than one IOs.
>
>
> [1] https://github.com/ming1/ubdsrv/commit/9faabbec3a92ca83ddae92335c66eabbeff654e7
> [2] https://upcommons.upc.edu/bitstream/handle/2099.1/9619/65757.pdf?sequence=1&isAllowed=y
> [3] https://lwn.net/Articles/889429/
> [4] https://lab.ks.uni-freiburg.de/projects/kernel-qcow2/repository
> [5] https://github.com/ming1/ubdsrv/blob/master/qcow2/README.rst
> [6] https://github.com/ming1/ubdsrv/blob/master/qcow2/STATUS.rst
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2022-11-04  6:57 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-09-30  9:24 ublk-qcow2: ublk-qcow2 is available Ming Lei
2022-10-03 19:53 ` Stefan Hajnoczi
2022-10-03 23:57   ` Denis V. Lunev
2022-10-05 15:11     ` Stefan Hajnoczi
2022-10-06 10:26       ` Ming Lei
2022-10-06 13:59         ` Stefan Hajnoczi
2022-10-06 15:09           ` Ming Lei
2022-10-06 18:29             ` Stefan Hajnoczi
2022-10-07 11:21               ` Ming Lei
2022-10-04  9:43   ` Ming Lei
2022-10-04 13:53     ` Stefan Hajnoczi
2022-10-05  4:18       ` Ming Lei
2022-10-05 12:21         ` Stefan Hajnoczi
2022-10-05 12:38           ` Denis V. Lunev
2022-10-06 11:24           ` Ming Lei
2022-10-07 10:04             ` Yongji Xie
2022-10-07 10:51               ` Ming Lei
2022-10-07 11:21                 ` Yongji Xie
2022-10-07 11:23                   ` Ming Lei
2022-10-08  8:43         ` Ziyang Zhang
2022-10-12 14:22           ` Stefan Hajnoczi
2022-10-13  6:48             ` Yongji Xie
2022-10-13 16:02               ` Stefan Hajnoczi
2022-10-14 12:56               ` Ming Lei
2022-10-17 11:11                 ` Yongji Xie
2022-10-18  6:59                   ` Ming Lei
2022-10-18 13:17                     ` Yongji Xie
2022-10-18 14:54                       ` Stefan Hajnoczi
2022-10-19  9:09                         ` Ming Lei
2022-10-24 16:11                           ` Stefan Hajnoczi
2022-10-21  5:33                         ` Yongji Xie
2022-10-21  6:30                           ` Jason Wang
2022-10-25  8:17                             ` Yongji Xie
2022-10-25 12:02                               ` Stefan Hajnoczi
2022-10-28 13:33                                 ` Yongji Xie
2022-11-01  2:36                                 ` Jason Wang
2022-11-02 19:13                                   ` Stefan Hajnoczi
2022-11-04  6:55                                     ` Jason Wang
2022-10-21  6:28                     ` Jason Wang
2022-10-06 10:14       ` Richard W.M. Jones
2022-10-12 14:15         ` Stefan Hajnoczi
2022-10-13  1:50           ` Ming Lei
2022-10-13 16:01             ` Stefan Hajnoczi
2022-10-04  5:43 ` Manuel Bentele

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox