* SQPOLL / uring_cmd_iopoll
@ 2023-04-21 22:09 Bernd Schubert
2023-04-22 2:13 ` Jens Axboe
2023-04-22 12:55 ` Ming Lei
0 siblings, 2 replies; 7+ messages in thread
From: Bernd Schubert @ 2023-04-21 22:09 UTC (permalink / raw)
To: [email protected]; +Cc: Ming Lei, Jens Axboe, Pavel Begunkov
Hello,
I was wondering if I could set up SQPOLL for fuse/IORING_OP_URING_CMD
and what would be the latency win. Now I get a bit confused what the
f_op->uring_cmd_iopoll() function is supposed to do.
Is it just there to check if SQEs are can be completed as CQE? In rw.c
io_do_iopoll() it looks like this. I don't follow all code paths in
__io_sq_thread yet, but it looks a like it already checks if the ring
has new entries
to_submit = io_sqring_entries(ctx);
...
ret = io_submit_sqes(ctx, to_submit);
--> it will eventually call into ->uring_cmd() ?
And then io_do_iopoll -> file->f_op->uring_cmd_iopoll is supposed to
check for available cq entries and will submit these? I.e. I just return
1 if when the request is ready? And also ensure that
req->iopoll_completed is set?
I'm also not sure what I should do with struct io_comp_batch * - I don't
have struct request *req_list anywhere in my fuse-uring changes, seems
to be blk-mq specific? So I should just ignore that parameter?
Btw, this might be useful for ublk as well?
Thanks,
Bernd
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: SQPOLL / uring_cmd_iopoll 2023-04-21 22:09 SQPOLL / uring_cmd_iopoll Bernd Schubert @ 2023-04-22 2:13 ` Jens Axboe 2023-04-22 13:40 ` Bernd Schubert 2023-04-22 12:55 ` Ming Lei 1 sibling, 1 reply; 7+ messages in thread From: Jens Axboe @ 2023-04-22 2:13 UTC (permalink / raw) To: Bernd Schubert, [email protected]; +Cc: Ming Lei, Pavel Begunkov On 4/21/23 4:09?PM, Bernd Schubert wrote: > Hello, > > I was wondering if I could set up SQPOLL for fuse/IORING_OP_URING_CMD > and what would be the latency win. Now I get a bit confused what the > f_op->uring_cmd_iopoll() function is supposed to do. Certainly, you can use SQPOLL with anything. Whether or not it'd be a win depends a lot on what you're doing, rate of IO, etc. IOPOLL and SQPOLL are two different things. SQPOLL has a kernel side submission thread that polls for new SQ entries and submits them when it sees them. IOPOLL is a method for avoiding sleeping on waiting on CQ entries, where it will instead poll the target for completion instead. That's where ->uring_cmd_iopoll() comes in, that's the hook for polling for uring commands. For normal fs path read/write requests, ->uring_iopoll() is the hook that performs the same kind of action. > Is it just there to check if SQEs are can be completed as CQE? In rw.c Not sure I follow what you're trying to convey here, maybe you can expand on that? And maybe some of the confusion here is because of mixing up SQPOLL and IOPOLL? > io_do_iopoll() it looks like this. I don't follow all code paths in > __io_sq_thread yet, but it looks a like it already checks if the ring > has new entries > > to_submit = io_sqring_entries(ctx); > ... > ret = io_submit_sqes(ctx, to_submit); > > --> it will eventually call into ->uring_cmd() ? The SQPOLL thread will pull off new SQEs, and those will then at some point hit ->issue() which is an opcode dependent method for issuing the actual request. Once it's been issued, if the ring is IOPOLL, then io_iopoll_req_issued() will get called which adds the request to an internal poll list. When someone does io_uring_enter(2) to wait for events on a ring with IOPOLL, it will iterate that list and call ->uring_cmd_iopoll() for uring_cmd requests, and ->uring_iopoll() for "normal" requests. If the ring is using SQPOLL|IOPOLL, then the SQPOLL thread is also the one that does the polling. See __io_sq_thread() -> io_do_iopoll(). > And then io_do_iopoll -> file->f_op->uring_cmd_iopoll is supposed to > check for available cq entries and will submit these? I.e. I just return > 1 if when the request is ready? And also ensure that > req->iopoll_completed is set? The callback polls for a completion on the target side, which will mark is as ->iopoll_completed = true. That still leaves them on the iopoll list, and io_do_iopoll() will spot that and post CQEs for them. > I'm also not sure what I should do with struct io_comp_batch * - I don't > have struct request *req_list anywhere in my fuse-uring changes, seems > to be blk-mq specific? So I should just ignore that parameter? Hard to say since the above is a bit confusing and I haven't seen your code, but you can always start off just passing NULL. That's fine and just doesn't do any completion batching. The latter may or may not be useful for your case, but in any case, it's fine to pass NULL. > Btw, this might be useful for ublk as well? Not sure what "this" is :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: SQPOLL / uring_cmd_iopoll 2023-04-22 2:13 ` Jens Axboe @ 2023-04-22 13:40 ` Bernd Schubert 2023-04-24 12:55 ` Bernd Schubert 0 siblings, 1 reply; 7+ messages in thread From: Bernd Schubert @ 2023-04-22 13:40 UTC (permalink / raw) To: Jens Axboe, [email protected]; +Cc: Ming Lei, Pavel Begunkov On 4/22/23 04:13, Jens Axboe wrote: > On 4/21/23 4:09?PM, Bernd Schubert wrote: >> Hello, >> >> I was wondering if I could set up SQPOLL for fuse/IORING_OP_URING_CMD >> and what would be the latency win. Now I get a bit confused what the >> f_op->uring_cmd_iopoll() function is supposed to do. > > Certainly, you can use SQPOLL with anything. Whether or not it'd be a > win depends a lot on what you're doing, rate of IO, etc. > > IOPOLL and SQPOLL are two different things. SQPOLL has a kernel side > submission thread that polls for new SQ entries and submits them when it > sees them. IOPOLL is a method for avoiding sleeping on waiting on CQ > entries, where it will instead poll the target for completion instead. > That's where ->uring_cmd_iopoll() comes in, that's the hook for polling > for uring commands. For normal fs path read/write requests, > ->uring_iopoll() is the hook that performs the same kind of action. > >> Is it just there to check if SQEs are can be completed as CQE? In rw.c > > Not sure I follow what you're trying to convey here, maybe you can > expand on that? And maybe some of the confusion here is because of > mixing up SQPOLL and IOPOLL? Thanks a lot for your help! I was confused when f_op->uring_cmd_iopoll gets called - if I needed to check myself if the SQE was submitted or if this for CQE submission. You already resolved my confusion with your comments below. Thanks a lot for your help! > >> io_do_iopoll() it looks like this. I don't follow all code paths in >> __io_sq_thread yet, but it looks a like it already checks if the ring >> has new entries >> >> to_submit = io_sqring_entries(ctx); >> ... >> ret = io_submit_sqes(ctx, to_submit); >> >> --> it will eventually call into ->uring_cmd() ? > > The SQPOLL thread will pull off new SQEs, and those will then at some > point hit ->issue() which is an opcode dependent method for issuing the > actual request. Once it's been issued, if the ring is IOPOLL, then > io_iopoll_req_issued() will get called which adds the request to an > internal poll list. When someone does io_uring_enter(2) to wait for > events on a ring with IOPOLL, it will iterate that list and call > ->uring_cmd_iopoll() for uring_cmd requests, and ->uring_iopoll() for > "normal" requests. Thanks, that basically answered my SQE confusion - it does all itself. > > If the ring is using SQPOLL|IOPOLL, then the SQPOLL thread is also the > one that does the polling. See __io_sq_thread() -> io_do_iopoll(). > >> And then io_do_iopoll -> file->f_op->uring_cmd_iopoll is supposed to >> check for available cq entries and will submit these? I.e. I just return >> 1 if when the request is ready? And also ensure that >> req->iopoll_completed is set? > > The callback polls for a completion on the target side, which will mark > is as ->iopoll_completed = true. That still leaves them on the iopoll > list, and io_do_iopoll() will spot that and post CQEs for them. > >> I'm also not sure what I should do with struct io_comp_batch * - I don't >> have struct request *req_list anywhere in my fuse-uring changes, seems >> to be blk-mq specific? So I should just ignore that parameter? > > Hard to say since the above is a bit confusing and I haven't seen your > code, but you can always start off just passing NULL. That's fine and > just doesn't do any completion batching. The latter may or may not be > useful for your case, but in any case, it's fine to pass NULL. I had send patches to fsdevel and given it is mostly fuse related, didn't add you to CC https://lwn.net/Articles/926773/ The code is also here https://github.com/bsbernd/linux/tree/fuse-uring-for-6.2 https://github.com/bsbernd/libfuse/tree/uring I got SQPOLL working using this simple function /** * This is called for requests when the ring is configured with * IORING_SETUP_IOPOLL. */ int fuse_uring_cmd_poll(struct io_uring_cmd *cmd, struct io_comp_batch *iob, unsigned int poll_flags) { /* Not much to be done here, when IORING_SETUP_IOPOLL is set * io_uring_cmd_done() already sets req->iopoll_completed. * The caller (io_do_iopoll) already checks this flag * and won't enter this function at all then. * When we get called we just need to return 0 and tell the * caller that the cmd is not ready yet. */ return 0; } Just gave it a quick file creation/removal run with single threaded bonnie++ and performance is actually lower than before (around 8000 creates/s without IORING_SETUP_SQPOLL (adding IORING_SETUP_IOPOLL doesn't help either) and about 5000 creates/s with IORING_SETUP_SQPOLL). With plain /dev/fuse it is about 2000 creates/s. Main improvement comes from ensuring request submission (application) and request handling (ring/thread) are on the same core. I'm running into some scheduler issues, which I work around for now using migrate_disable()/migrate_enable() in before/after fuse request waitq, without that performance for metadata requests is similar to plain /dev/fuse. I will soon post an RFC v2 series with more benchmark results including read and write. > >> Btw, this might be useful for ublk as well? > > Not sure what "this" is :-) Ming already replied. Thanks, Bernd ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: SQPOLL / uring_cmd_iopoll 2023-04-22 13:40 ` Bernd Schubert @ 2023-04-24 12:55 ` Bernd Schubert 0 siblings, 0 replies; 7+ messages in thread From: Bernd Schubert @ 2023-04-24 12:55 UTC (permalink / raw) To: Jens Axboe, [email protected] Cc: Ming Lei, Pavel Begunkov, Miklos Szeredi On 4/22/23 15:40, Bernd Schubert wrote: > On 4/22/23 04:13, Jens Axboe wrote: > > Just gave it a quick file creation/removal run with single threaded > bonnie++ > and performance is actually lower than before (around 8000 creates/s > without > IORING_SETUP_SQPOLL (adding IORING_SETUP_IOPOLL doesn't help either) and > about 5000 creates/s with IORING_SETUP_SQPOLL). With plain /dev/fuse > it is about 2000 creates/s. > Main improvement comes from ensuring request submission > (application) and request handling (ring/thread) are on the same core. > I'm running into some scheduler issues, which I work around for now using > migrate_disable()/migrate_enable() in before/after fuse request waitq, > without that performance for metadata requests is similar to plain > /dev/fuse. Btw, I have an idea. For sync requests the initial thread could do the polling, like right now it is: app -> vfs/fuse -> fill ring req -> io_uring_cmd_done -> waitq-wait could become app ->vfs/fuse -> fill ring req -> io_uring_cmd_done -> poll ring for SQ Obviously, it is a bit more complex when there are multiple requests on the same ring. For async it could be app page cached write -> vfs/fuse -> fill ring req, with max 1MB requests -> io_uring_cmd_done -> check if there are completed events -> back to app, possibly next request -> async poll task/thread (similar to SQPOLL, if the ring was not polled for some amount of time) I will investigate if this is feasible once I'm done with other changes. Bernd ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: SQPOLL / uring_cmd_iopoll 2023-04-21 22:09 SQPOLL / uring_cmd_iopoll Bernd Schubert 2023-04-22 2:13 ` Jens Axboe @ 2023-04-22 12:55 ` Ming Lei 2023-04-22 14:08 ` Jens Axboe 1 sibling, 1 reply; 7+ messages in thread From: Ming Lei @ 2023-04-22 12:55 UTC (permalink / raw) To: Bernd Schubert Cc: [email protected], Jens Axboe, Pavel Begunkov, ming.lei On Fri, Apr 21, 2023 at 10:09:36PM +0000, Bernd Schubert wrote: > Hello, > > I was wondering if I could set up SQPOLL for fuse/IORING_OP_URING_CMD > and what would be the latency win. Now I get a bit confused what the > f_op->uring_cmd_iopoll() function is supposed to do. > > Is it just there to check if SQEs are can be completed as CQE? In rw.c > io_do_iopoll() it looks like this. I don't follow all code paths in > __io_sq_thread yet, but it looks a like it already checks if the ring > has new entries > > to_submit = io_sqring_entries(ctx); > ... > ret = io_submit_sqes(ctx, to_submit); > > --> it will eventually call into ->uring_cmd() ? > > And then io_do_iopoll -> file->f_op->uring_cmd_iopoll is supposed to > check for available cq entries and will submit these? I.e. I just return > 1 if when the request is ready? And also ensure that > req->iopoll_completed is set? > > > I'm also not sure what I should do with struct io_comp_batch * - I don't > have struct request *req_list anywhere in my fuse-uring changes, seems > to be blk-mq specific? So I should just ignore that parameter? > > > Btw, this might be useful for ublk as well? For the in-tree ublk driver, we need to copy data inside ->uring_cmd() between block request pages and user buffer, so SQPOLL may not be done because it isn't efficient for the kthread to copy on remote task mm space. However, ublk user copy feature[1](posted recently) doesn't need the copy in ->uring_cmd() any more, so SQPOLL becomes possible for ublk uring cmd. Also for uring cmd only, IOPOLL may not be done given we don't know when request from /dev/ublkb* is coming, maybe never. But if there is any target IO pending, IOPOLL can be used, and the change should be trivial. [1] https://lore.kernel.org/linux-block/ZEFeYsQ%[email protected]/T/#t Thanks, Ming ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: SQPOLL / uring_cmd_iopoll 2023-04-22 12:55 ` Ming Lei @ 2023-04-22 14:08 ` Jens Axboe 2023-04-23 1:06 ` Ming Lei 0 siblings, 1 reply; 7+ messages in thread From: Jens Axboe @ 2023-04-22 14:08 UTC (permalink / raw) To: Ming Lei, Bernd Schubert; +Cc: [email protected], Pavel Begunkov On 4/22/23 6:55?AM, Ming Lei wrote: > On Fri, Apr 21, 2023 at 10:09:36PM +0000, Bernd Schubert wrote: >> Hello, >> >> I was wondering if I could set up SQPOLL for fuse/IORING_OP_URING_CMD >> and what would be the latency win. Now I get a bit confused what the >> f_op->uring_cmd_iopoll() function is supposed to do. >> >> Is it just there to check if SQEs are can be completed as CQE? In rw.c >> io_do_iopoll() it looks like this. I don't follow all code paths in >> __io_sq_thread yet, but it looks a like it already checks if the ring >> has new entries >> >> to_submit = io_sqring_entries(ctx); >> ... >> ret = io_submit_sqes(ctx, to_submit); >> >> --> it will eventually call into ->uring_cmd() ? >> >> And then io_do_iopoll -> file->f_op->uring_cmd_iopoll is supposed to >> check for available cq entries and will submit these? I.e. I just return >> 1 if when the request is ready? And also ensure that >> req->iopoll_completed is set? >> >> >> I'm also not sure what I should do with struct io_comp_batch * - I don't >> have struct request *req_list anywhere in my fuse-uring changes, seems >> to be blk-mq specific? So I should just ignore that parameter? >> >> >> Btw, this might be useful for ublk as well? > > For the in-tree ublk driver, we need to copy data inside ->uring_cmd() > between block request pages and user buffer, so SQPOLL may not be done > because it isn't efficient for the kthread to copy on remote task mm > space. However, ublk user copy feature[1](posted recently) doesn't > need the copy in ->uring_cmd() any more, so SQPOLL becomes possible for > ublk uring cmd. That hasn't been true for a long time, and isn't even true in 5.10-stable anymore or anything newer. The SQPOLL thread is not a kthread, and it doesn't need to do anything to copy the data that the inline submission wouldn't also do. There is no "remote task mm". The cost would be the same, outside of caching effects. -- Jens Axboe ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: SQPOLL / uring_cmd_iopoll 2023-04-22 14:08 ` Jens Axboe @ 2023-04-23 1:06 ` Ming Lei 0 siblings, 0 replies; 7+ messages in thread From: Ming Lei @ 2023-04-23 1:06 UTC (permalink / raw) To: Jens Axboe Cc: Bernd Schubert, [email protected], Pavel Begunkov, ming.lei On Sat, Apr 22, 2023 at 08:08:41AM -0600, Jens Axboe wrote: > On 4/22/23 6:55?AM, Ming Lei wrote: > > On Fri, Apr 21, 2023 at 10:09:36PM +0000, Bernd Schubert wrote: > >> Hello, > >> > >> I was wondering if I could set up SQPOLL for fuse/IORING_OP_URING_CMD > >> and what would be the latency win. Now I get a bit confused what the > >> f_op->uring_cmd_iopoll() function is supposed to do. > >> > >> Is it just there to check if SQEs are can be completed as CQE? In rw.c > >> io_do_iopoll() it looks like this. I don't follow all code paths in > >> __io_sq_thread yet, but it looks a like it already checks if the ring > >> has new entries > >> > >> to_submit = io_sqring_entries(ctx); > >> ... > >> ret = io_submit_sqes(ctx, to_submit); > >> > >> --> it will eventually call into ->uring_cmd() ? > >> > >> And then io_do_iopoll -> file->f_op->uring_cmd_iopoll is supposed to > >> check for available cq entries and will submit these? I.e. I just return > >> 1 if when the request is ready? And also ensure that > >> req->iopoll_completed is set? > >> > >> > >> I'm also not sure what I should do with struct io_comp_batch * - I don't > >> have struct request *req_list anywhere in my fuse-uring changes, seems > >> to be blk-mq specific? So I should just ignore that parameter? > >> > >> > >> Btw, this might be useful for ublk as well? > > > > For the in-tree ublk driver, we need to copy data inside ->uring_cmd() > > between block request pages and user buffer, so SQPOLL may not be done > > because it isn't efficient for the kthread to copy on remote task mm > > space. However, ublk user copy feature[1](posted recently) doesn't > > need the copy in ->uring_cmd() any more, so SQPOLL becomes possible for > > ublk uring cmd. > > That hasn't been true for a long time, and isn't even true in > 5.10-stable anymore or anything newer. The SQPOLL thread is not a > kthread, and it doesn't need to do anything to copy the data that the > inline submission wouldn't also do. There is no "remote task mm". The > cost would be the same, outside of caching effects. OK, thanks for the clarification, and create_io_thread() does pass CLONE_VM, so there isn't remote task mm problem. However, ublk still can't use SETUP_SQPOLL so far, and problem is that ublk driver has to be bound with the user task for canceling pending commands when the ctx is gone[1]. When this issue is solved, SETUP_SQPOLL should work just fine. Given fuse takes similar approach with ublk, I believe fuse has similar limit too. Actually I was working on adding notifiers in io_uring[2] for addressing this issue so that driver needn't to use the trick for tracking io_uring context destroying. Just see one request double free issue(same request freed in io_submit_flush_completions<-io_fallback_req_func() twice) in case of DEFER_TASKRUN only, but driver actually calls io_uring_cmd_done() just once. Will investigate the issue further. [1] https://lore.kernel.org/linux-fsdevel/[email protected]/ [2] https://github.com/ming1/linux/commits/for-6.4/io_uring_block Thanks, Ming ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-04-24 12:56 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-04-21 22:09 SQPOLL / uring_cmd_iopoll Bernd Schubert 2023-04-22 2:13 ` Jens Axboe 2023-04-22 13:40 ` Bernd Schubert 2023-04-24 12:55 ` Bernd Schubert 2023-04-22 12:55 ` Ming Lei 2023-04-22 14:08 ` Jens Axboe 2023-04-23 1:06 ` Ming Lei
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox