* SQPOLL / uring_cmd_iopoll
@ 2023-04-21 22:09 Bernd Schubert
2023-04-22 2:13 ` Jens Axboe
2023-04-22 12:55 ` Ming Lei
0 siblings, 2 replies; 7+ messages in thread
From: Bernd Schubert @ 2023-04-21 22:09 UTC (permalink / raw)
To: [email protected]; +Cc: Ming Lei, Jens Axboe, Pavel Begunkov
Hello,
I was wondering if I could set up SQPOLL for fuse/IORING_OP_URING_CMD
and what would be the latency win. Now I get a bit confused what the
f_op->uring_cmd_iopoll() function is supposed to do.
Is it just there to check if SQEs are can be completed as CQE? In rw.c
io_do_iopoll() it looks like this. I don't follow all code paths in
__io_sq_thread yet, but it looks a like it already checks if the ring
has new entries
to_submit = io_sqring_entries(ctx);
...
ret = io_submit_sqes(ctx, to_submit);
--> it will eventually call into ->uring_cmd() ?
And then io_do_iopoll -> file->f_op->uring_cmd_iopoll is supposed to
check for available cq entries and will submit these? I.e. I just return
1 if when the request is ready? And also ensure that
req->iopoll_completed is set?
I'm also not sure what I should do with struct io_comp_batch * - I don't
have struct request *req_list anywhere in my fuse-uring changes, seems
to be blk-mq specific? So I should just ignore that parameter?
Btw, this might be useful for ublk as well?
Thanks,
Bernd
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: SQPOLL / uring_cmd_iopoll
2023-04-21 22:09 SQPOLL / uring_cmd_iopoll Bernd Schubert
@ 2023-04-22 2:13 ` Jens Axboe
2023-04-22 13:40 ` Bernd Schubert
2023-04-22 12:55 ` Ming Lei
1 sibling, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2023-04-22 2:13 UTC (permalink / raw)
To: Bernd Schubert, [email protected]; +Cc: Ming Lei, Pavel Begunkov
On 4/21/23 4:09?PM, Bernd Schubert wrote:
> Hello,
>
> I was wondering if I could set up SQPOLL for fuse/IORING_OP_URING_CMD
> and what would be the latency win. Now I get a bit confused what the
> f_op->uring_cmd_iopoll() function is supposed to do.
Certainly, you can use SQPOLL with anything. Whether or not it'd be a
win depends a lot on what you're doing, rate of IO, etc.
IOPOLL and SQPOLL are two different things. SQPOLL has a kernel side
submission thread that polls for new SQ entries and submits them when it
sees them. IOPOLL is a method for avoiding sleeping on waiting on CQ
entries, where it will instead poll the target for completion instead.
That's where ->uring_cmd_iopoll() comes in, that's the hook for polling
for uring commands. For normal fs path read/write requests,
->uring_iopoll() is the hook that performs the same kind of action.
> Is it just there to check if SQEs are can be completed as CQE? In rw.c
Not sure I follow what you're trying to convey here, maybe you can
expand on that? And maybe some of the confusion here is because of
mixing up SQPOLL and IOPOLL?
> io_do_iopoll() it looks like this. I don't follow all code paths in
> __io_sq_thread yet, but it looks a like it already checks if the ring
> has new entries
>
> to_submit = io_sqring_entries(ctx);
> ...
> ret = io_submit_sqes(ctx, to_submit);
>
> --> it will eventually call into ->uring_cmd() ?
The SQPOLL thread will pull off new SQEs, and those will then at some
point hit ->issue() which is an opcode dependent method for issuing the
actual request. Once it's been issued, if the ring is IOPOLL, then
io_iopoll_req_issued() will get called which adds the request to an
internal poll list. When someone does io_uring_enter(2) to wait for
events on a ring with IOPOLL, it will iterate that list and call
->uring_cmd_iopoll() for uring_cmd requests, and ->uring_iopoll() for
"normal" requests.
If the ring is using SQPOLL|IOPOLL, then the SQPOLL thread is also the
one that does the polling. See __io_sq_thread() -> io_do_iopoll().
> And then io_do_iopoll -> file->f_op->uring_cmd_iopoll is supposed to
> check for available cq entries and will submit these? I.e. I just return
> 1 if when the request is ready? And also ensure that
> req->iopoll_completed is set?
The callback polls for a completion on the target side, which will mark
is as ->iopoll_completed = true. That still leaves them on the iopoll
list, and io_do_iopoll() will spot that and post CQEs for them.
> I'm also not sure what I should do with struct io_comp_batch * - I don't
> have struct request *req_list anywhere in my fuse-uring changes, seems
> to be blk-mq specific? So I should just ignore that parameter?
Hard to say since the above is a bit confusing and I haven't seen your
code, but you can always start off just passing NULL. That's fine and
just doesn't do any completion batching. The latter may or may not be
useful for your case, but in any case, it's fine to pass NULL.
> Btw, this might be useful for ublk as well?
Not sure what "this" is :-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: SQPOLL / uring_cmd_iopoll
2023-04-21 22:09 SQPOLL / uring_cmd_iopoll Bernd Schubert
2023-04-22 2:13 ` Jens Axboe
@ 2023-04-22 12:55 ` Ming Lei
2023-04-22 14:08 ` Jens Axboe
1 sibling, 1 reply; 7+ messages in thread
From: Ming Lei @ 2023-04-22 12:55 UTC (permalink / raw)
To: Bernd Schubert
Cc: [email protected], Jens Axboe, Pavel Begunkov, ming.lei
On Fri, Apr 21, 2023 at 10:09:36PM +0000, Bernd Schubert wrote:
> Hello,
>
> I was wondering if I could set up SQPOLL for fuse/IORING_OP_URING_CMD
> and what would be the latency win. Now I get a bit confused what the
> f_op->uring_cmd_iopoll() function is supposed to do.
>
> Is it just there to check if SQEs are can be completed as CQE? In rw.c
> io_do_iopoll() it looks like this. I don't follow all code paths in
> __io_sq_thread yet, but it looks a like it already checks if the ring
> has new entries
>
> to_submit = io_sqring_entries(ctx);
> ...
> ret = io_submit_sqes(ctx, to_submit);
>
> --> it will eventually call into ->uring_cmd() ?
>
> And then io_do_iopoll -> file->f_op->uring_cmd_iopoll is supposed to
> check for available cq entries and will submit these? I.e. I just return
> 1 if when the request is ready? And also ensure that
> req->iopoll_completed is set?
>
>
> I'm also not sure what I should do with struct io_comp_batch * - I don't
> have struct request *req_list anywhere in my fuse-uring changes, seems
> to be blk-mq specific? So I should just ignore that parameter?
>
>
> Btw, this might be useful for ublk as well?
For the in-tree ublk driver, we need to copy data inside ->uring_cmd()
between block request pages and user buffer, so SQPOLL may not be done
because it isn't efficient for the kthread to copy on remote task mm
space. However, ublk user copy feature[1](posted recently) doesn't
need the copy in ->uring_cmd() any more, so SQPOLL becomes possible for
ublk uring cmd.
Also for uring cmd only, IOPOLL may not be done given we don't know
when request from /dev/ublkb* is coming, maybe never.
But if there is any target IO pending, IOPOLL can be used, and the
change should be trivial.
[1] https://lore.kernel.org/linux-block/ZEFeYsQ%[email protected]/T/#t
Thanks,
Ming
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: SQPOLL / uring_cmd_iopoll
2023-04-22 2:13 ` Jens Axboe
@ 2023-04-22 13:40 ` Bernd Schubert
2023-04-24 12:55 ` Bernd Schubert
0 siblings, 1 reply; 7+ messages in thread
From: Bernd Schubert @ 2023-04-22 13:40 UTC (permalink / raw)
To: Jens Axboe, [email protected]; +Cc: Ming Lei, Pavel Begunkov
On 4/22/23 04:13, Jens Axboe wrote:
> On 4/21/23 4:09?PM, Bernd Schubert wrote:
>> Hello,
>>
>> I was wondering if I could set up SQPOLL for fuse/IORING_OP_URING_CMD
>> and what would be the latency win. Now I get a bit confused what the
>> f_op->uring_cmd_iopoll() function is supposed to do.
>
> Certainly, you can use SQPOLL with anything. Whether or not it'd be a
> win depends a lot on what you're doing, rate of IO, etc.
>
> IOPOLL and SQPOLL are two different things. SQPOLL has a kernel side
> submission thread that polls for new SQ entries and submits them when it
> sees them. IOPOLL is a method for avoiding sleeping on waiting on CQ
> entries, where it will instead poll the target for completion instead.
> That's where ->uring_cmd_iopoll() comes in, that's the hook for polling
> for uring commands. For normal fs path read/write requests,
> ->uring_iopoll() is the hook that performs the same kind of action.
>
>> Is it just there to check if SQEs are can be completed as CQE? In rw.c
>
> Not sure I follow what you're trying to convey here, maybe you can
> expand on that? And maybe some of the confusion here is because of
> mixing up SQPOLL and IOPOLL?
Thanks a lot for your help!
I was confused when f_op->uring_cmd_iopoll gets called - if I needed to
check myself if the SQE was submitted or if this for CQE submission.
You already resolved my confusion with your comments below. Thanks a
lot for your help!
>
>> io_do_iopoll() it looks like this. I don't follow all code paths in
>> __io_sq_thread yet, but it looks a like it already checks if the ring
>> has new entries
>>
>> to_submit = io_sqring_entries(ctx);
>> ...
>> ret = io_submit_sqes(ctx, to_submit);
>>
>> --> it will eventually call into ->uring_cmd() ?
>
> The SQPOLL thread will pull off new SQEs, and those will then at some
> point hit ->issue() which is an opcode dependent method for issuing the
> actual request. Once it's been issued, if the ring is IOPOLL, then
> io_iopoll_req_issued() will get called which adds the request to an
> internal poll list. When someone does io_uring_enter(2) to wait for
> events on a ring with IOPOLL, it will iterate that list and call
> ->uring_cmd_iopoll() for uring_cmd requests, and ->uring_iopoll() for
> "normal" requests.
Thanks, that basically answered my SQE confusion - it does all itself.
>
> If the ring is using SQPOLL|IOPOLL, then the SQPOLL thread is also the
> one that does the polling. See __io_sq_thread() -> io_do_iopoll().
>
>> And then io_do_iopoll -> file->f_op->uring_cmd_iopoll is supposed to
>> check for available cq entries and will submit these? I.e. I just return
>> 1 if when the request is ready? And also ensure that
>> req->iopoll_completed is set?
>
> The callback polls for a completion on the target side, which will mark
> is as ->iopoll_completed = true. That still leaves them on the iopoll
> list, and io_do_iopoll() will spot that and post CQEs for them.
>
>> I'm also not sure what I should do with struct io_comp_batch * - I don't
>> have struct request *req_list anywhere in my fuse-uring changes, seems
>> to be blk-mq specific? So I should just ignore that parameter?
>
> Hard to say since the above is a bit confusing and I haven't seen your
> code, but you can always start off just passing NULL. That's fine and
> just doesn't do any completion batching. The latter may or may not be
> useful for your case, but in any case, it's fine to pass NULL.
I had send patches to fsdevel and given it is mostly fuse related, didn't
add you to CC
https://lwn.net/Articles/926773/
The code is also here
https://github.com/bsbernd/linux/tree/fuse-uring-for-6.2
https://github.com/bsbernd/libfuse/tree/uring
I got SQPOLL working using this simple function
/**
* This is called for requests when the ring is configured with
* IORING_SETUP_IOPOLL.
*/
int fuse_uring_cmd_poll(struct io_uring_cmd *cmd, struct io_comp_batch *iob,
unsigned int poll_flags)
{
/* Not much to be done here, when IORING_SETUP_IOPOLL is set
* io_uring_cmd_done() already sets req->iopoll_completed.
* The caller (io_do_iopoll) already checks this flag
* and won't enter this function at all then.
* When we get called we just need to return 0 and tell the
* caller that the cmd is not ready yet.
*/
return 0;
}
Just gave it a quick file creation/removal run with single threaded bonnie++
and performance is actually lower than before (around 8000 creates/s without
IORING_SETUP_SQPOLL (adding IORING_SETUP_IOPOLL doesn't help either) and
about 5000 creates/s with IORING_SETUP_SQPOLL). With plain /dev/fuse
it is about 2000 creates/s.
Main improvement comes from ensuring request submission
(application) and request handling (ring/thread) are on the same core.
I'm running into some scheduler issues, which I work around for now using
migrate_disable()/migrate_enable() in before/after fuse request waitq,
without that performance for metadata requests is similar to plain
/dev/fuse.
I will soon post an RFC v2 series with more benchmark results including
read and write.
>
>> Btw, this might be useful for ublk as well?
>
> Not sure what "this" is :-)
Ming already replied.
Thanks,
Bernd
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: SQPOLL / uring_cmd_iopoll
2023-04-22 12:55 ` Ming Lei
@ 2023-04-22 14:08 ` Jens Axboe
2023-04-23 1:06 ` Ming Lei
0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2023-04-22 14:08 UTC (permalink / raw)
To: Ming Lei, Bernd Schubert; +Cc: [email protected], Pavel Begunkov
On 4/22/23 6:55?AM, Ming Lei wrote:
> On Fri, Apr 21, 2023 at 10:09:36PM +0000, Bernd Schubert wrote:
>> Hello,
>>
>> I was wondering if I could set up SQPOLL for fuse/IORING_OP_URING_CMD
>> and what would be the latency win. Now I get a bit confused what the
>> f_op->uring_cmd_iopoll() function is supposed to do.
>>
>> Is it just there to check if SQEs are can be completed as CQE? In rw.c
>> io_do_iopoll() it looks like this. I don't follow all code paths in
>> __io_sq_thread yet, but it looks a like it already checks if the ring
>> has new entries
>>
>> to_submit = io_sqring_entries(ctx);
>> ...
>> ret = io_submit_sqes(ctx, to_submit);
>>
>> --> it will eventually call into ->uring_cmd() ?
>>
>> And then io_do_iopoll -> file->f_op->uring_cmd_iopoll is supposed to
>> check for available cq entries and will submit these? I.e. I just return
>> 1 if when the request is ready? And also ensure that
>> req->iopoll_completed is set?
>>
>>
>> I'm also not sure what I should do with struct io_comp_batch * - I don't
>> have struct request *req_list anywhere in my fuse-uring changes, seems
>> to be blk-mq specific? So I should just ignore that parameter?
>>
>>
>> Btw, this might be useful for ublk as well?
>
> For the in-tree ublk driver, we need to copy data inside ->uring_cmd()
> between block request pages and user buffer, so SQPOLL may not be done
> because it isn't efficient for the kthread to copy on remote task mm
> space. However, ublk user copy feature[1](posted recently) doesn't
> need the copy in ->uring_cmd() any more, so SQPOLL becomes possible for
> ublk uring cmd.
That hasn't been true for a long time, and isn't even true in
5.10-stable anymore or anything newer. The SQPOLL thread is not a
kthread, and it doesn't need to do anything to copy the data that the
inline submission wouldn't also do. There is no "remote task mm". The
cost would be the same, outside of caching effects.
--
Jens Axboe
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: SQPOLL / uring_cmd_iopoll
2023-04-22 14:08 ` Jens Axboe
@ 2023-04-23 1:06 ` Ming Lei
0 siblings, 0 replies; 7+ messages in thread
From: Ming Lei @ 2023-04-23 1:06 UTC (permalink / raw)
To: Jens Axboe
Cc: Bernd Schubert, [email protected], Pavel Begunkov,
ming.lei
On Sat, Apr 22, 2023 at 08:08:41AM -0600, Jens Axboe wrote:
> On 4/22/23 6:55?AM, Ming Lei wrote:
> > On Fri, Apr 21, 2023 at 10:09:36PM +0000, Bernd Schubert wrote:
> >> Hello,
> >>
> >> I was wondering if I could set up SQPOLL for fuse/IORING_OP_URING_CMD
> >> and what would be the latency win. Now I get a bit confused what the
> >> f_op->uring_cmd_iopoll() function is supposed to do.
> >>
> >> Is it just there to check if SQEs are can be completed as CQE? In rw.c
> >> io_do_iopoll() it looks like this. I don't follow all code paths in
> >> __io_sq_thread yet, but it looks a like it already checks if the ring
> >> has new entries
> >>
> >> to_submit = io_sqring_entries(ctx);
> >> ...
> >> ret = io_submit_sqes(ctx, to_submit);
> >>
> >> --> it will eventually call into ->uring_cmd() ?
> >>
> >> And then io_do_iopoll -> file->f_op->uring_cmd_iopoll is supposed to
> >> check for available cq entries and will submit these? I.e. I just return
> >> 1 if when the request is ready? And also ensure that
> >> req->iopoll_completed is set?
> >>
> >>
> >> I'm also not sure what I should do with struct io_comp_batch * - I don't
> >> have struct request *req_list anywhere in my fuse-uring changes, seems
> >> to be blk-mq specific? So I should just ignore that parameter?
> >>
> >>
> >> Btw, this might be useful for ublk as well?
> >
> > For the in-tree ublk driver, we need to copy data inside ->uring_cmd()
> > between block request pages and user buffer, so SQPOLL may not be done
> > because it isn't efficient for the kthread to copy on remote task mm
> > space. However, ublk user copy feature[1](posted recently) doesn't
> > need the copy in ->uring_cmd() any more, so SQPOLL becomes possible for
> > ublk uring cmd.
>
> That hasn't been true for a long time, and isn't even true in
> 5.10-stable anymore or anything newer. The SQPOLL thread is not a
> kthread, and it doesn't need to do anything to copy the data that the
> inline submission wouldn't also do. There is no "remote task mm". The
> cost would be the same, outside of caching effects.
OK, thanks for the clarification, and create_io_thread() does pass
CLONE_VM, so there isn't remote task mm problem.
However, ublk still can't use SETUP_SQPOLL so far, and problem is that
ublk driver has to be bound with the user task for canceling pending commands
when the ctx is gone[1]. When this issue is solved, SETUP_SQPOLL should
work just fine.
Given fuse takes similar approach with ublk, I believe fuse has similar
limit too.
Actually I was working on adding notifiers in io_uring[2] for addressing
this issue so that driver needn't to use the trick for tracking io_uring
context destroying. Just see one request double free issue(same request freed
in io_submit_flush_completions<-io_fallback_req_func() twice) in case of
DEFER_TASKRUN only, but driver actually calls io_uring_cmd_done() just once.
Will investigate the issue further.
[1] https://lore.kernel.org/linux-fsdevel/[email protected]/
[2] https://github.com/ming1/linux/commits/for-6.4/io_uring_block
Thanks,
Ming
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: SQPOLL / uring_cmd_iopoll
2023-04-22 13:40 ` Bernd Schubert
@ 2023-04-24 12:55 ` Bernd Schubert
0 siblings, 0 replies; 7+ messages in thread
From: Bernd Schubert @ 2023-04-24 12:55 UTC (permalink / raw)
To: Jens Axboe, [email protected]
Cc: Ming Lei, Pavel Begunkov, Miklos Szeredi
On 4/22/23 15:40, Bernd Schubert wrote:
> On 4/22/23 04:13, Jens Axboe wrote:
>
> Just gave it a quick file creation/removal run with single threaded
> bonnie++
> and performance is actually lower than before (around 8000 creates/s
> without
> IORING_SETUP_SQPOLL (adding IORING_SETUP_IOPOLL doesn't help either) and
> about 5000 creates/s with IORING_SETUP_SQPOLL). With plain /dev/fuse
> it is about 2000 creates/s.
> Main improvement comes from ensuring request submission
> (application) and request handling (ring/thread) are on the same core.
> I'm running into some scheduler issues, which I work around for now using
> migrate_disable()/migrate_enable() in before/after fuse request waitq,
> without that performance for metadata requests is similar to plain
> /dev/fuse.
Btw, I have an idea. For sync requests the initial thread could do the
polling, like
right now it is:
app -> vfs/fuse -> fill ring req -> io_uring_cmd_done -> waitq-wait
could become
app ->vfs/fuse -> fill ring req -> io_uring_cmd_done -> poll ring for SQ
Obviously, it is a bit more complex when there are multiple requests on
the same ring.
For async it could be
app page cached write
-> vfs/fuse -> fill ring req, with max 1MB requests
-> io_uring_cmd_done
-> check if there are completed events
-> back to app, possibly next request
-> async poll task/thread (similar to SQPOLL, if the ring was not
polled for some amount of time)
I will investigate if this is feasible once I'm done with other changes.
Bernd
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-04-24 12:56 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-21 22:09 SQPOLL / uring_cmd_iopoll Bernd Schubert
2023-04-22 2:13 ` Jens Axboe
2023-04-22 13:40 ` Bernd Schubert
2023-04-24 12:55 ` Bernd Schubert
2023-04-22 12:55 ` Ming Lei
2023-04-22 14:08 ` Jens Axboe
2023-04-23 1:06 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox