public inbox for [email protected]
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] ublk & io_uring: ublk zero copy support
@ 2023-04-29  2:18 Ming Lei
  2023-05-05 21:57 ` Bernd Schubert
  0 siblings, 1 reply; 4+ messages in thread
From: Ming Lei @ 2023-04-29  2:18 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, Miklos Szeredi, Bernd Schubert,
	Christoph Hellwig, Ziyang Zhang, Xiaoguang Wang
  Cc: ming.lei, lsf-pc, io-uring, linux-block

Hello,

ublk zero copy is observed to improve big chunk(64KB+) sequential IO performance a
lot, such as, IOPS of ublk-loop over tmpfs is increased by 1~2X[1], Jens also observed
that IOPS of ublk-qcow2 can be increased by ~1X[2]. Meantime it saves memory bandwidth.

So this is one important performance improvement.

So far there are three proposal:

1) splice based

- spliced page from ->splice_read() can't be written

ublk READ request can't be handled because spliced page can't be written
to, and extending splice for ublk zero copy isn't one good solution[3]

- it is very hard to meet above requirements  wrt. request buffer lifetime

splice/pipe focuses on page reference lifetime, but ublk zero copy pays more
attention to ublk request buffer lifetime. If is very inefficient to respect
request buffer lifetime by using all pipe buffer's ->release() which requires
all pipe buffers and pipe to be kept when ublk server handles IO. That means
one single dedicated ``pipe_inode_info`` has to be allocated runtime for each
provided buffer, and the pipe needs to be populated with pages in ublk request
buffer.

IMO, it isn't one good way to take splice from both correctness and performance
viewpoint.

2) io_uring register buffer based

- the main idea is to register one runtime buffer in fast io path, and
  unregister it after the buffer is used by the following OPs

- the main problem is that bad performance caused by io_uring link model

registering buffer has to be one OP, same with unregistering buffer; the
following normal OPs(such as FS IO) have to depend on the registering
buffer OP, then io_uring link has to be used.

It is normal to see more than one normal OPs which depend on the registering
buffer OP, so all these OPs(registering buffer, normal (FS IO) OPs and
unregistering buffer) have to be linked together, then normal(FS IO) OPs
have to be submitted one by one, and this way is slow, because there is
often no dependency among all these normal FS OPs. Basically io_uring
link model does not support this kind of 1:N dependency.

No one posted code for showing this approach yet.

3) io_uring fused command[1]

- fused command extend current io_uring usage by allowing submitting following
FS OPs(called secondary OPs) after the primary command provides buffer, and
primary command won't be completed until all secondary OPs are done.

This way solves the problem in 2), and meantime avoids the buffer register cost in
both submission and completion IO fast code path because the primary command won't
be completed until all secondary OPs are done, so no need to write/read the
buffer into per-context global data structure.

Meantime buffer lifetime problem is addressed simply, so correctness gets guaranteed,
and performance is pretty good, and even IOPS of 4k IO gets a little
improved in some workloads, or at least no perf regression is observed
for small size IO.

fused command can be thought as one single request logically, just it has more
than one SQE(all share same link flag), that is why is named as fused command.

- the only concern is that fused command starts one use usage of io_uring, but
still not see comments wrt. what/why is bad with this kind of new usage/interface.

I propose this topic and want to discuss about how to move on with this
feature.


[1] https://lore.kernel.org/linux-block/[email protected]/
[2] https://lore.kernel.org/linux-block/[email protected]/
[3] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@mail.gmail.com/


Thanks,
Ming


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM/BPF TOPIC] ublk & io_uring: ublk zero copy support
  2023-04-29  2:18 [LSF/MM/BPF TOPIC] ublk & io_uring: ublk zero copy support Ming Lei
@ 2023-05-05 21:57 ` Bernd Schubert
  2023-05-06  1:38   ` Ming Lei
  0 siblings, 1 reply; 4+ messages in thread
From: Bernd Schubert @ 2023-05-05 21:57 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, Pavel Begunkov, Miklos Szeredi,
	Christoph Hellwig, Ziyang Zhang, Xiaoguang Wang
  Cc: lsf-pc, io-uring, linux-block

Hi Ming,

On 4/29/23 04:18, Ming Lei wrote:
> Hello,
> 
> ublk zero copy is observed to improve big chunk(64KB+) sequential IO performance a
> lot, such as, IOPS of ublk-loop over tmpfs is increased by 1~2X[1], Jens also observed
> that IOPS of ublk-qcow2 can be increased by ~1X[2]. Meantime it saves memory bandwidth.
> 
> So this is one important performance improvement.
> 
> So far there are three proposal:

looks like there is no dedicated session. Could we still have a 
discussion in a free slot, if possible?

Thanks,
Bernd


> 
> 1) splice based
> 
> - spliced page from ->splice_read() can't be written
> 
> ublk READ request can't be handled because spliced page can't be written
> to, and extending splice for ublk zero copy isn't one good solution[3]
> 
> - it is very hard to meet above requirements  wrt. request buffer lifetime
> 
> splice/pipe focuses on page reference lifetime, but ublk zero copy pays more
> attention to ublk request buffer lifetime. If is very inefficient to respect
> request buffer lifetime by using all pipe buffer's ->release() which requires
> all pipe buffers and pipe to be kept when ublk server handles IO. That means
> one single dedicated ``pipe_inode_info`` has to be allocated runtime for each
> provided buffer, and the pipe needs to be populated with pages in ublk request
> buffer.
> 
> IMO, it isn't one good way to take splice from both correctness and performance
> viewpoint.
> 
> 2) io_uring register buffer based
> 
> - the main idea is to register one runtime buffer in fast io path, and
>    unregister it after the buffer is used by the following OPs
> 
> - the main problem is that bad performance caused by io_uring link model
> 
> registering buffer has to be one OP, same with unregistering buffer; the
> following normal OPs(such as FS IO) have to depend on the registering
> buffer OP, then io_uring link has to be used.
> 
> It is normal to see more than one normal OPs which depend on the registering
> buffer OP, so all these OPs(registering buffer, normal (FS IO) OPs and
> unregistering buffer) have to be linked together, then normal(FS IO) OPs
> have to be submitted one by one, and this way is slow, because there is
> often no dependency among all these normal FS OPs. Basically io_uring
> link model does not support this kind of 1:N dependency.
> 
> No one posted code for showing this approach yet.
> 
> 3) io_uring fused command[1]
> 
> - fused command extend current io_uring usage by allowing submitting following
> FS OPs(called secondary OPs) after the primary command provides buffer, and
> primary command won't be completed until all secondary OPs are done.
> 
> This way solves the problem in 2), and meantime avoids the buffer register cost in
> both submission and completion IO fast code path because the primary command won't
> be completed until all secondary OPs are done, so no need to write/read the
> buffer into per-context global data structure.
> 
> Meantime buffer lifetime problem is addressed simply, so correctness gets guaranteed,
> and performance is pretty good, and even IOPS of 4k IO gets a little
> improved in some workloads, or at least no perf regression is observed
> for small size IO.
> 
> fused command can be thought as one single request logically, just it has more
> than one SQE(all share same link flag), that is why is named as fused command.
> 
> - the only concern is that fused command starts one use usage of io_uring, but
> still not see comments wrt. what/why is bad with this kind of new usage/interface.
> 
> I propose this topic and want to discuss about how to move on with this
> feature.
> 
> 
> [1] https://lore.kernel.org/linux-block/[email protected]/
> [2] https://lore.kernel.org/linux-block/[email protected]/
> [3] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@mail.gmail.com/
> 
> 
> Thanks,
> Ming
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM/BPF TOPIC] ublk & io_uring: ublk zero copy support
  2023-05-05 21:57 ` Bernd Schubert
@ 2023-05-06  1:38   ` Ming Lei
  2023-05-08  2:16     ` Pavel Begunkov
  0 siblings, 1 reply; 4+ messages in thread
From: Ming Lei @ 2023-05-06  1:38 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Jens Axboe, Pavel Begunkov, Miklos Szeredi, Christoph Hellwig,
	Ziyang Zhang, Xiaoguang Wang, lsf-pc, io-uring, linux-block

On Fri, May 05, 2023 at 09:57:47PM +0000, Bernd Schubert wrote:
> Hi Ming,
> 
> On 4/29/23 04:18, Ming Lei wrote:
> > Hello,
> > 
> > ublk zero copy is observed to improve big chunk(64KB+) sequential IO performance a
> > lot, such as, IOPS of ublk-loop over tmpfs is increased by 1~2X[1], Jens also observed
> > that IOPS of ublk-qcow2 can be increased by ~1X[2]. Meantime it saves memory bandwidth.
> > 
> > So this is one important performance improvement.
> > 
> > So far there are three proposal:
> 
> looks like there is no dedicated session. Could we still have a 
> discussion in a free slot, if possible?

Sure, and we can invite Pavel to the talk too if he is in this lsfmm.


thanks,
Ming


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM/BPF TOPIC] ublk & io_uring: ublk zero copy support
  2023-05-06  1:38   ` Ming Lei
@ 2023-05-08  2:16     ` Pavel Begunkov
  0 siblings, 0 replies; 4+ messages in thread
From: Pavel Begunkov @ 2023-05-08  2:16 UTC (permalink / raw)
  To: Ming Lei, Bernd Schubert
  Cc: Jens Axboe, Miklos Szeredi, Christoph Hellwig, Ziyang Zhang,
	Xiaoguang Wang, lsf-pc, io-uring, linux-block

On 5/6/23 02:38, Ming Lei wrote:
> On Fri, May 05, 2023 at 09:57:47PM +0000, Bernd Schubert wrote:
>> Hi Ming,
>>
>> On 4/29/23 04:18, Ming Lei wrote:
>>> Hello,
>>>
>>> ublk zero copy is observed to improve big chunk(64KB+) sequential IO performance a
>>> lot, such as, IOPS of ublk-loop over tmpfs is increased by 1~2X[1], Jens also observed
>>> that IOPS of ublk-qcow2 can be increased by ~1X[2]. Meantime it saves memory bandwidth.
>>>
>>> So this is one important performance improvement.
>>>
>>> So far there are three proposal:
>>
>> looks like there is no dedicated session. Could we still have a
>> discussion in a free slot, if possible?
> 
> Sure, and we can invite Pavel to the talk too if he is in this lsfmm.

I'd love to go but regretfully can't make it

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-05-08  2:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-29  2:18 [LSF/MM/BPF TOPIC] ublk & io_uring: ublk zero copy support Ming Lei
2023-05-05 21:57 ` Bernd Schubert
2023-05-06  1:38   ` Ming Lei
2023-05-08  2:16     ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox