* IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL @ 2022-04-19 11:07 Avi Kivity 2022-04-19 11:38 ` Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Avi Kivity @ 2022-04-19 11:07 UTC (permalink / raw) To: io-uring A simple webserver shows about 5% loss compared to linux-aio. I expect the loss is due to an optimization that io_uring lacks - inline completion vs workqueue completion: static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) { struct poll_iocb *req = container_of(wait, struct poll_iocb, wait); struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll); __poll_t mask = key_to_poll(key); unsigned long flags; /* for instances that support it check for an event match first: */ if (mask && !(mask & req->events)) return 0; /* * Complete the request inline if possible. This requires that three * conditions be met: * 1. An event mask must have been passed. If a plain wakeup was done * instead, then mask == 0 and we have to call vfs_poll() to get * the events, so inline completion isn't possible. * 2. The completion work must not have already been scheduled. * 3. ctx_lock must not be busy. We have to use trylock because we * already hold the waitqueue lock, so this inverts the normal * locking order. Use irqsave/irqrestore because not all * filesystems (e.g. fuse) call this function with IRQs disabled, * yet IRQs have to be disabled before ctx_lock is obtained. */ if (mask && !req->work_scheduled && spin_trylock_irqsave(&iocb->ki_ctx->ctx_lock, flags)) { struct kioctx *ctx = iocb->ki_ctx; list_del_init(&req->wait.entry); list_del(&iocb->ki_list); iocb->ki_res.res = mangle_poll(mask); if (iocb->ki_eventfd && !eventfd_signal_allowed()) { iocb = NULL; INIT_WORK(&req->work, aio_poll_put_work); schedule_work(&req->work); } spin_unlock_irqrestore(&ctx->ctx_lock, flags); if (iocb) iocb_put(iocb); } else { ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-04-19 11:07 IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL Avi Kivity @ 2022-04-19 11:38 ` Jens Axboe 2022-04-19 11:57 ` Avi Kivity 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2022-04-19 11:38 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/19/22 5:07 AM, Avi Kivity wrote: > A simple webserver shows about 5% loss compared to linux-aio. > > > I expect the loss is due to an optimization that io_uring lacks - > inline completion vs workqueue completion: I don't think that's it, io_uring never punts to a workqueue for completions. The aio inline completions is more of a hack because it needs to do that, as always using a workqueue would lead to bad performance and higher overhead. So if there's a difference in performance, it's something else and we need to look at that. But your report is pretty lacking! What kernel are you running? Do you have a test case of sorts? For a performance oriented network setup, I'd normally not consider data readiness poll replacements to be that interesting, my recommendation would be to use async send/recv for that instead. That's how io_uring is supposed to be used, in a completion based model. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-04-19 11:38 ` Jens Axboe @ 2022-04-19 11:57 ` Avi Kivity 2022-04-19 12:04 ` Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Avi Kivity @ 2022-04-19 11:57 UTC (permalink / raw) To: Jens Axboe, io-uring On 19/04/2022 14.38, Jens Axboe wrote: > On 4/19/22 5:07 AM, Avi Kivity wrote: >> A simple webserver shows about 5% loss compared to linux-aio. >> >> >> I expect the loss is due to an optimization that io_uring lacks - >> inline completion vs workqueue completion: > I don't think that's it, io_uring never punts to a workqueue for > completions. I measured this: Performance counter stats for 'system wide': 1,273,756 io_uring:io_uring_task_add 12.288597765 seconds time elapsed Which exactly matches with the number of requests sent. If that's the wrong counter to measure, I'm happy to try again with the correct counter. > The aio inline completions is more of a hack because it > needs to do that, as always using a workqueue would lead to bad > performance and higher overhead. > > So if there's a difference in performance, it's something else and we > need to look at that. But your report is pretty lacking! What kernel are > you running? 5.17.2-300.fc36.x86_64 > Do you have a test case of sorts? Seastar's httpd, running on a single core, against wrk -c 1000 -t 4 http://localhost:10000/. Instructions: git clone --recursive -b io_uring https://github.com/avikivity/seastar cd seastar sudo ./install-dependencies.sh # after carefully verifying it, of course ./configure.py --mode release ninja -C build/release apps/httpd/httpd ./build/release/apps/httpd/httpd --smp 1 [--reactor-backing io_uring|linux-aio|epoll] and run wrk againt it. > For a performance oriented network setup, I'd normally not consider data > readiness poll replacements to be that interesting, my recommendation > would be to use async send/recv for that instead. That's how io_uring is > supposed to be used, in a completion based model. > That's true. Still, an existing system that evolved around poll will take some time and effort to migrate, and have slower IORING_OP_POLL means it cannot benefit from io_uring's many other advantages if it fears a regression from that difference. Note that it's not just a matter of converting poll+recvmsg to IORING_OP_RECVMSG. If you support many connections, one must migrate to internal buffer selection, otherwise the memory load with a large number of idle connections is high. The end result is wonderful but the road there is long. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-04-19 11:57 ` Avi Kivity @ 2022-04-19 12:04 ` Jens Axboe 2022-04-19 12:21 ` Avi Kivity 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2022-04-19 12:04 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/19/22 5:57 AM, Avi Kivity wrote: > > On 19/04/2022 14.38, Jens Axboe wrote: >> On 4/19/22 5:07 AM, Avi Kivity wrote: >>> A simple webserver shows about 5% loss compared to linux-aio. >>> >>> >>> I expect the loss is due to an optimization that io_uring lacks - >>> inline completion vs workqueue completion: >> I don't think that's it, io_uring never punts to a workqueue for >> completions. > > > I measured this: > > > > Performance counter stats for 'system wide': > > 1,273,756 io_uring:io_uring_task_add > > 12.288597765 seconds time elapsed > > Which exactly matches with the number of requests sent. If that's the > wrong counter to measure, I'm happy to try again with the correct > counter. io_uring_task_add() isn't a workqueue, it's task_work. So that is expected. >> The aio inline completions is more of a hack because it >> needs to do that, as always using a workqueue would lead to bad >> performance and higher overhead. >> >> So if there's a difference in performance, it's something else and we >> need to look at that. But your report is pretty lacking! What kernel are >> you running? > > > 5.17.2-300.fc36.x86_64 OK, that sounds fine. >> Do you have a test case of sorts? > > > Seastar's httpd, running on a single core, against wrk -c 1000 -t 4 http://localhost:10000/. > > > Instructions: > > git clone --recursive -b io_uring https://github.com/avikivity/seastar > > cd seastar > > sudo ./install-dependencies.sh # after carefully verifying it, of course > > ./configure.py --mode release > > ninja -C build/release apps/httpd/httpd > > ./build/release/apps/httpd/httpd --smp 1 [--reactor-backing io_uring|linux-aio|epoll] > > > and run wrk againt it. Thanks, I'll give that a spin! >> For a performance oriented network setup, I'd normally not consider data >> readiness poll replacements to be that interesting, my recommendation >> would be to use async send/recv for that instead. That's how io_uring is >> supposed to be used, in a completion based model. >> > > That's true. Still, an existing system that evolved around poll will > take some time and effort to migrate, and have slower IORING_OP_POLL > means it cannot benefit from io_uring's many other advantages if it > fears a regression from that difference. I'd like to separate the two - should the OP_POLL work as well, most certainly. Do I think it's largely a useless way to run it, also yes :-) > Note that it's not just a matter of converting poll+recvmsg to > IORING_OP_RECVMSG. If you support many connections, one must migrate > to internal buffer selection, otherwise the memory load with a large > number of idle connections is high. The end result is wonderful but > the road there is long. Totally agree. My point is just that to take full advantage of it, you need to be using that kind of model and quick conversions aren't really expected to yield much of a performance win. They are also not supposed to run slower, so that does need some attention if that's the case here. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-04-19 12:04 ` Jens Axboe @ 2022-04-19 12:21 ` Avi Kivity 2022-04-19 12:31 ` Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Avi Kivity @ 2022-04-19 12:21 UTC (permalink / raw) To: Jens Axboe, io-uring On 19/04/2022 15.04, Jens Axboe wrote: > On 4/19/22 5:57 AM, Avi Kivity wrote: >> On 19/04/2022 14.38, Jens Axboe wrote: >>> On 4/19/22 5:07 AM, Avi Kivity wrote: >>>> A simple webserver shows about 5% loss compared to linux-aio. >>>> >>>> >>>> I expect the loss is due to an optimization that io_uring lacks - >>>> inline completion vs workqueue completion: >>> I don't think that's it, io_uring never punts to a workqueue for >>> completions. >> >> I measured this: >> >> >> >> Performance counter stats for 'system wide': >> >> 1,273,756 io_uring:io_uring_task_add >> >> 12.288597765 seconds time elapsed >> >> Which exactly matches with the number of requests sent. If that's the >> wrong counter to measure, I'm happy to try again with the correct >> counter. > io_uring_task_add() isn't a workqueue, it's task_work. So that is > expected. Ah, and it should be fine. I'll try 'perf diff' again (I ran it but didn't reach any conclusive results and assumed non-systemwide runs weren't measuring workqueues (and systemwide runs generated too much noise on my workstation)). >>> Do you have a test case of sorts? >> >> Seastar's httpd, running on a single core, against wrk -c 1000 -t 4 http://localhost:10000/. >> >> >> Instructions: >> >> git clone --recursive -b io_uring https://github.com/avikivity/seastar >> >> cd seastar >> >> sudo ./install-dependencies.sh # after carefully verifying it, of course >> >> ./configure.py --mode release >> >> ninja -C build/release apps/httpd/httpd >> >> ./build/release/apps/httpd/httpd --smp 1 [--reactor-backing io_uring|linux-aio|epoll] >> >> >> and run wrk againt it. > Thanks, I'll give that a spin! Thanks. You may need ./configure --c++-dialect=c++17 if your C++ compiler is too old. > >>> For a performance oriented network setup, I'd normally not consider data >>> readiness poll replacements to be that interesting, my recommendation >>> would be to use async send/recv for that instead. That's how io_uring is >>> supposed to be used, in a completion based model. >>> >> That's true. Still, an existing system that evolved around poll will >> take some time and effort to migrate, and have slower IORING_OP_POLL >> means it cannot benefit from io_uring's many other advantages if it >> fears a regression from that difference. > I'd like to separate the two - should the OP_POLL work as well, most > certainly. Do I think it's largely a useless way to run it, also yes :-) Agree. > >> Note that it's not just a matter of converting poll+recvmsg to >> IORING_OP_RECVMSG. If you support many connections, one must migrate >> to internal buffer selection, otherwise the memory load with a large >> number of idle connections is high. The end result is wonderful but >> the road there is long. > Totally agree. My point is just that to take full advantage of it, you > need to be using that kind of model and quick conversions aren't really > expected to yield much of a performance win. They are also not supposed > to run slower, so that does need some attention if that's the case here. > We're in agreement, but I'd like to clarify the quick conversion is intended to win from other aspects of io_uring, with the deeper change coming later. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-04-19 12:21 ` Avi Kivity @ 2022-04-19 12:31 ` Jens Axboe 2022-04-19 15:21 ` Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2022-04-19 12:31 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/19/22 6:21 AM, Avi Kivity wrote: > On 19/04/2022 15.04, Jens Axboe wrote: >> On 4/19/22 5:57 AM, Avi Kivity wrote: >>> On 19/04/2022 14.38, Jens Axboe wrote: >>>> On 4/19/22 5:07 AM, Avi Kivity wrote: >>>>> A simple webserver shows about 5% loss compared to linux-aio. >>>>> >>>>> >>>>> I expect the loss is due to an optimization that io_uring lacks - >>>>> inline completion vs workqueue completion: >>>> I don't think that's it, io_uring never punts to a workqueue for >>>> completions. >>> >>> I measured this: >>> >>> >>> >>> Performance counter stats for 'system wide': >>> >>> 1,273,756 io_uring:io_uring_task_add >>> >>> 12.288597765 seconds time elapsed >>> >>> Which exactly matches with the number of requests sent. If that's the >>> wrong counter to measure, I'm happy to try again with the correct >>> counter. >> io_uring_task_add() isn't a workqueue, it's task_work. So that is >> expected. > > > Ah, and it should be fine. I'll try 'perf diff' again (I ran it but > didn't reach any conclusive results and assumed non-systemwide runs > weren't measuring workqueues (and systemwide runs generated too much > noise on my workstation)). Might help to run the client from another box? But I did a quick run just to see if it worked after setting it up, and it seems mostly dominated by seastar itself. But nothing conclusive, was more of a "let's see if it works". I'll take an actual look in a bit. One thing that did stick out is excessive fget/fput, which is usually solved by using direct descriptors with io_uring. You can use accept_direct, for example, and then IOSQE_FIXED_FILE with the send/recv. But it looks like you are just using io_uring for polling, which also eliminates being able to do that. You can retain the readiness based model and still use io_uring for the actual data transfer, at least that would enable you to use some features that would help reduce overhead. Note that this is separate from the actual poll being slower. I might try and write a basic benchmark that is just poll and see what comes up. Would help do more targeted development and profiling. >>>> Do you have a test case of sorts? >>> >>> Seastar's httpd, running on a single core, against wrk -c 1000 -t 4 http://localhost:10000/. >>> >>> >>> Instructions: >>> >>> git clone --recursive -b io_uring https://github.com/avikivity/seastar >>> >>> cd seastar >>> >>> sudo ./install-dependencies.sh # after carefully verifying it, of course >>> >>> ./configure.py --mode release >>> >>> ninja -C build/release apps/httpd/httpd >>> >>> ./build/release/apps/httpd/httpd --smp 1 [--reactor-backing io_uring|linux-aio|epoll] s/backing/backend >>>> For a performance oriented network setup, I'd normally not consider data >>>> readiness poll replacements to be that interesting, my recommendation >>>> would be to use async send/recv for that instead. That's how io_uring is >>>> supposed to be used, in a completion based model. >>>> >>> That's true. Still, an existing system that evolved around poll will >>> take some time and effort to migrate, and have slower IORING_OP_POLL >>> means it cannot benefit from io_uring's many other advantages if it >>> fears a regression from that difference. >> I'd like to separate the two - should the OP_POLL work as well, most >> certainly. Do I think it's largely a useless way to run it, also yes :-) > > > Agree. > > >> >>> Note that it's not just a matter of converting poll+recvmsg to >>> IORING_OP_RECVMSG. If you support many connections, one must migrate >>> to internal buffer selection, otherwise the memory load with a large >>> number of idle connections is high. The end result is wonderful but >>> the road there is long. >> >> Totally agree. My point is just that to take full advantage of it, >> you need to be using that kind of model and quick conversions aren't >> really expected to yield much of a performance win. They are also not >> supposed to run slower, so that does need some attention if that's >> the case here. >> > > We're in agreement, but I'd like to clarify the quick conversion is > intended to win from other aspects of io_uring, with the deeper change > coming later. Ideally, yes. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-04-19 12:31 ` Jens Axboe @ 2022-04-19 15:21 ` Jens Axboe 2022-04-19 15:51 ` Avi Kivity 2022-04-19 17:14 ` Jens Axboe 0 siblings, 2 replies; 23+ messages in thread From: Jens Axboe @ 2022-04-19 15:21 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/19/22 6:31 AM, Jens Axboe wrote: > On 4/19/22 6:21 AM, Avi Kivity wrote: >> On 19/04/2022 15.04, Jens Axboe wrote: >>> On 4/19/22 5:57 AM, Avi Kivity wrote: >>>> On 19/04/2022 14.38, Jens Axboe wrote: >>>>> On 4/19/22 5:07 AM, Avi Kivity wrote: >>>>>> A simple webserver shows about 5% loss compared to linux-aio. >>>>>> >>>>>> >>>>>> I expect the loss is due to an optimization that io_uring lacks - >>>>>> inline completion vs workqueue completion: >>>>> I don't think that's it, io_uring never punts to a workqueue for >>>>> completions. >>>> >>>> I measured this: >>>> >>>> >>>> >>>> Performance counter stats for 'system wide': >>>> >>>> 1,273,756 io_uring:io_uring_task_add >>>> >>>> 12.288597765 seconds time elapsed >>>> >>>> Which exactly matches with the number of requests sent. If that's the >>>> wrong counter to measure, I'm happy to try again with the correct >>>> counter. >>> io_uring_task_add() isn't a workqueue, it's task_work. So that is >>> expected. Might actually be implicated. Not because it's a async worker, but because I think we might be losing some affinity in this case. Looking at traces, we're definitely bouncing between the poll completion side and then execution the completion. Can you try this hack? It's against -git + for-5.19/io_uring. If you let me know what base you prefer, I can do a version against that. I see about a 3% win with io_uring with this, and was slower before against linux-aio as you saw as well. diff --git a/fs/io_uring.c b/fs/io_uring.c index caa5b673f8f5..f3da6c9a9635 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6303,6 +6303,25 @@ static void io_apoll_task_func(struct io_kiocb *req, bool *locked) io_req_complete_failed(req, ret); } +static bool __io_poll_execute_direct(struct io_kiocb *req, int mask, int events) +{ + struct io_ring_ctx *ctx = req->ctx; + + if (ctx->has_evfd || req->flags & REQ_F_INFLIGHT || + req->opcode != IORING_OP_POLL_ADD) + return false; + if (!spin_trylock(&ctx->completion_lock)) + return false; + + req->cqe.res = mangle_poll(mask & events); + hash_del(&req->hash_node); + __io_req_complete_post(req, req->cqe.res, 0); + io_commit_cqring(ctx); + spin_unlock(&ctx->completion_lock); + io_cqring_ev_posted(ctx); + return true; +} + static void __io_poll_execute(struct io_kiocb *req, int mask, int events) { req->cqe.res = mask; @@ -6384,7 +6403,8 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, else req->flags &= ~REQ_F_SINGLE_POLL; } - __io_poll_execute(req, mask, poll->events); + if (!__io_poll_execute_direct(req, mask, poll->events)) + __io_poll_execute(req, mask, poll->events); } return 1; } -- Jens Axboe ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-04-19 15:21 ` Jens Axboe @ 2022-04-19 15:51 ` Avi Kivity 2022-04-19 17:14 ` Jens Axboe 1 sibling, 0 replies; 23+ messages in thread From: Avi Kivity @ 2022-04-19 15:51 UTC (permalink / raw) To: Jens Axboe, io-uring On 19/04/2022 18.21, Jens Axboe wrote: > On 4/19/22 6:31 AM, Jens Axboe wrote: >> On 4/19/22 6:21 AM, Avi Kivity wrote: >>> On 19/04/2022 15.04, Jens Axboe wrote: >>>> On 4/19/22 5:57 AM, Avi Kivity wrote: >>>>> On 19/04/2022 14.38, Jens Axboe wrote: >>>>>> On 4/19/22 5:07 AM, Avi Kivity wrote: >>>>>>> A simple webserver shows about 5% loss compared to linux-aio. >>>>>>> >>>>>>> >>>>>>> I expect the loss is due to an optimization that io_uring lacks - >>>>>>> inline completion vs workqueue completion: >>>>>> I don't think that's it, io_uring never punts to a workqueue for >>>>>> completions. >>>>> I measured this: >>>>> >>>>> >>>>> >>>>> Performance counter stats for 'system wide': >>>>> >>>>> 1,273,756 io_uring:io_uring_task_add >>>>> >>>>> 12.288597765 seconds time elapsed >>>>> >>>>> Which exactly matches with the number of requests sent. If that's the >>>>> wrong counter to measure, I'm happy to try again with the correct >>>>> counter. >>>> io_uring_task_add() isn't a workqueue, it's task_work. So that is >>>> expected. > Might actually be implicated. Not because it's a async worker, but > because I think we might be losing some affinity in this case. Looking > at traces, we're definitely bouncing between the poll completion side > and then execution the completion. What affinity are we losing? Maybe it's TWA_SIGNAL, which causes the poll notification (which could happen while httpd is running) to interrupt httpd? Although it should happen during tcp processing in softirq, which should be co-located with httpd and therefore httpd wasn't running? > > Can you try this hack? It's against -git + for-5.19/io_uring. If you let > me know what base you prefer, I can do a version against that. I see > about a 3% win with io_uring with this, and was slower before against > linux-aio as you saw as well. > Sure, I'll try it against for-5.19/io_uring, though I don't doubt you fixed it. > diff --git a/fs/io_uring.c b/fs/io_uring.c > index caa5b673f8f5..f3da6c9a9635 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -6303,6 +6303,25 @@ static void io_apoll_task_func(struct io_kiocb *req, bool *locked) > io_req_complete_failed(req, ret); > } > > +static bool __io_poll_execute_direct(struct io_kiocb *req, int mask, int events) > +{ > + struct io_ring_ctx *ctx = req->ctx; > + > + if (ctx->has_evfd || req->flags & REQ_F_INFLIGHT || > + req->opcode != IORING_OP_POLL_ADD) > + return false; > + if (!spin_trylock(&ctx->completion_lock)) > + return false; This looks as if it's losing some affinity, before all completions were co-located with httpd and now some are not. So maybe it's the TWA_SIGNAL thing. > + > + req->cqe.res = mangle_poll(mask & events); > + hash_del(&req->hash_node); > + __io_req_complete_post(req, req->cqe.res, 0); > + io_commit_cqring(ctx); > + spin_unlock(&ctx->completion_lock); > + io_cqring_ev_posted(ctx); > + return true; > +} > + > static void __io_poll_execute(struct io_kiocb *req, int mask, int events) > { > req->cqe.res = mask; > @@ -6384,7 +6403,8 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, > else > req->flags &= ~REQ_F_SINGLE_POLL; > } > - __io_poll_execute(req, mask, poll->events); > + if (!__io_poll_execute_direct(req, mask, poll->events)) > + __io_poll_execute(req, mask, poll->events); > } > return 1; > } > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-04-19 15:21 ` Jens Axboe 2022-04-19 15:51 ` Avi Kivity @ 2022-04-19 17:14 ` Jens Axboe 2022-04-19 19:41 ` Avi Kivity 2022-06-15 10:12 ` Avi Kivity 1 sibling, 2 replies; 23+ messages in thread From: Jens Axboe @ 2022-04-19 17:14 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/19/22 9:21 AM, Jens Axboe wrote: > On 4/19/22 6:31 AM, Jens Axboe wrote: >> On 4/19/22 6:21 AM, Avi Kivity wrote: >>> On 19/04/2022 15.04, Jens Axboe wrote: >>>> On 4/19/22 5:57 AM, Avi Kivity wrote: >>>>> On 19/04/2022 14.38, Jens Axboe wrote: >>>>>> On 4/19/22 5:07 AM, Avi Kivity wrote: >>>>>>> A simple webserver shows about 5% loss compared to linux-aio. >>>>>>> >>>>>>> >>>>>>> I expect the loss is due to an optimization that io_uring lacks - >>>>>>> inline completion vs workqueue completion: >>>>>> I don't think that's it, io_uring never punts to a workqueue for >>>>>> completions. >>>>> >>>>> I measured this: >>>>> >>>>> >>>>> >>>>> Performance counter stats for 'system wide': >>>>> >>>>> 1,273,756 io_uring:io_uring_task_add >>>>> >>>>> 12.288597765 seconds time elapsed >>>>> >>>>> Which exactly matches with the number of requests sent. If that's the >>>>> wrong counter to measure, I'm happy to try again with the correct >>>>> counter. >>>> io_uring_task_add() isn't a workqueue, it's task_work. So that is >>>> expected. > > Might actually be implicated. Not because it's a async worker, but > because I think we might be losing some affinity in this case. Looking > at traces, we're definitely bouncing between the poll completion side > and then execution the completion. > > Can you try this hack? It's against -git + for-5.19/io_uring. If you let > me know what base you prefer, I can do a version against that. I see > about a 3% win with io_uring with this, and was slower before against > linux-aio as you saw as well. Another thing to try - get rid of the IPI for TWA_SIGNAL, which I believe may be the underlying cause of it. diff --git a/fs/io-wq.c b/fs/io-wq.c index 32aeb2c581c5..59987dd212d8 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -871,7 +871,7 @@ static bool io_wq_for_each_worker(struct io_wqe *wqe, static bool io_wq_worker_wake(struct io_worker *worker, void *data) { - set_notify_signal(worker->task); + set_notify_signal(worker->task, true); wake_up_process(worker->task); return false; } @@ -991,7 +991,7 @@ static bool __io_wq_worker_cancel(struct io_worker *worker, { if (work && match->fn(work, match->data)) { work->flags |= IO_WQ_WORK_CANCEL; - set_notify_signal(worker->task); + set_notify_signal(worker->task, true); return true; } diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 3c8b34876744..ac1f14973e09 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -359,10 +359,10 @@ static inline void clear_notify_signal(void) * Called to break out of interruptible wait loops, and enter the * exit_to_user_mode_loop(). */ -static inline void set_notify_signal(struct task_struct *task) +static inline void set_notify_signal(struct task_struct *task, bool need_ipi) { if (!test_and_set_tsk_thread_flag(task, TIF_NOTIFY_SIGNAL) && - !wake_up_state(task, TASK_INTERRUPTIBLE)) + !wake_up_state(task, TASK_INTERRUPTIBLE) && need_ipi) kick_process(task); } diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c index 5d03a2ad1066..bff53f539933 100644 --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -367,7 +367,7 @@ static void klp_send_signals(void) * Send fake signal to all non-kthread tasks which are * still not migrated. */ - set_notify_signal(task); + set_notify_signal(task, true); } } read_unlock(&tasklist_lock); diff --git a/kernel/task_work.c b/kernel/task_work.c index c59e1a49bc40..47d7024dc499 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -51,7 +51,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work, set_notify_resume(task); break; case TWA_SIGNAL: - set_notify_signal(task); + set_notify_signal(task, false); break; default: WARN_ON_ONCE(1); -- Jens Axboe ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-04-19 17:14 ` Jens Axboe @ 2022-04-19 19:41 ` Avi Kivity 2022-04-19 19:58 ` Jens Axboe 2022-06-15 10:12 ` Avi Kivity 1 sibling, 1 reply; 23+ messages in thread From: Avi Kivity @ 2022-04-19 19:41 UTC (permalink / raw) To: Jens Axboe, io-uring On 19/04/2022 20.14, Jens Axboe wrote: > On 4/19/22 9:21 AM, Jens Axboe wrote: >> On 4/19/22 6:31 AM, Jens Axboe wrote: >>> On 4/19/22 6:21 AM, Avi Kivity wrote: >>>> On 19/04/2022 15.04, Jens Axboe wrote: >>>>> On 4/19/22 5:57 AM, Avi Kivity wrote: >>>>>> On 19/04/2022 14.38, Jens Axboe wrote: >>>>>>> On 4/19/22 5:07 AM, Avi Kivity wrote: >>>>>>>> A simple webserver shows about 5% loss compared to linux-aio. >>>>>>>> >>>>>>>> >>>>>>>> I expect the loss is due to an optimization that io_uring lacks - >>>>>>>> inline completion vs workqueue completion: >>>>>>> I don't think that's it, io_uring never punts to a workqueue for >>>>>>> completions. >>>>>> I measured this: >>>>>> >>>>>> >>>>>> >>>>>> Performance counter stats for 'system wide': >>>>>> >>>>>> 1,273,756 io_uring:io_uring_task_add >>>>>> >>>>>> 12.288597765 seconds time elapsed >>>>>> >>>>>> Which exactly matches with the number of requests sent. If that's the >>>>>> wrong counter to measure, I'm happy to try again with the correct >>>>>> counter. >>>>> io_uring_task_add() isn't a workqueue, it's task_work. So that is >>>>> expected. >> Might actually be implicated. Not because it's a async worker, but >> because I think we might be losing some affinity in this case. Looking >> at traces, we're definitely bouncing between the poll completion side >> and then execution the completion. >> >> Can you try this hack? It's against -git + for-5.19/io_uring. If you let >> me know what base you prefer, I can do a version against that. I see >> about a 3% win with io_uring with this, and was slower before against >> linux-aio as you saw as well. > Another thing to try - get rid of the IPI for TWA_SIGNAL, which I > believe may be the underlying cause of it. > Won't it delay notification until the next io_uring_enter? Or does io_uring only guarantee completions when you call it (and earlier completions are best-effort?) For Seastar it's not a problem, asking about the general io_uring completion philosophy. I'll try it tomorrow (also the other patch). ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-04-19 19:41 ` Avi Kivity @ 2022-04-19 19:58 ` Jens Axboe 2022-04-20 11:55 ` Avi Kivity 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2022-04-19 19:58 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/19/22 1:41 PM, Avi Kivity wrote: > > On 19/04/2022 20.14, Jens Axboe wrote: >> On 4/19/22 9:21 AM, Jens Axboe wrote: >>> On 4/19/22 6:31 AM, Jens Axboe wrote: >>>> On 4/19/22 6:21 AM, Avi Kivity wrote: >>>>> On 19/04/2022 15.04, Jens Axboe wrote: >>>>>> On 4/19/22 5:57 AM, Avi Kivity wrote: >>>>>>> On 19/04/2022 14.38, Jens Axboe wrote: >>>>>>>> On 4/19/22 5:07 AM, Avi Kivity wrote: >>>>>>>>> A simple webserver shows about 5% loss compared to linux-aio. >>>>>>>>> >>>>>>>>> >>>>>>>>> I expect the loss is due to an optimization that io_uring lacks - >>>>>>>>> inline completion vs workqueue completion: >>>>>>>> I don't think that's it, io_uring never punts to a workqueue for >>>>>>>> completions. >>>>>>> I measured this: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Performance counter stats for 'system wide': >>>>>>> >>>>>>> 1,273,756 io_uring:io_uring_task_add >>>>>>> >>>>>>> 12.288597765 seconds time elapsed >>>>>>> >>>>>>> Which exactly matches with the number of requests sent. If that's the >>>>>>> wrong counter to measure, I'm happy to try again with the correct >>>>>>> counter. >>>>>> io_uring_task_add() isn't a workqueue, it's task_work. So that is >>>>>> expected. >>> Might actually be implicated. Not because it's a async worker, but >>> because I think we might be losing some affinity in this case. Looking >>> at traces, we're definitely bouncing between the poll completion side >>> and then execution the completion. >>> >>> Can you try this hack? It's against -git + for-5.19/io_uring. If you let >>> me know what base you prefer, I can do a version against that. I see >>> about a 3% win with io_uring with this, and was slower before against >>> linux-aio as you saw as well. >> Another thing to try - get rid of the IPI for TWA_SIGNAL, which I >> believe may be the underlying cause of it. >> > > Won't it delay notification until the next io_uring_enter? Or does > io_uring only guarantee completions when you call it (and earlier > completions are best-effort?) Only if it needs to reschedule, it'll still enter the kernel if not. Or if it's waiting in the kernel, it'll still run the task work as the TIF_NOTIFY_SIGNAL will get that job done. So actually not sure if we ever need the IPI, doesn't seem like we do. > I'll try it tomorrow (also the other patch). Thanks! -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-04-19 19:58 ` Jens Axboe @ 2022-04-20 11:55 ` Avi Kivity 2022-04-20 12:09 ` Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Avi Kivity @ 2022-04-20 11:55 UTC (permalink / raw) To: Jens Axboe, io-uring On 19/04/2022 22.58, Jens Axboe wrote: > >> I'll try it tomorrow (also the other patch). > Thanks! > With the new kernel, I get io_uring_setup(200, {flags=0, sq_thread_cpu=0, sq_thread_idle=0, sq_entries=256, cq_entries=512, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS|0x1800, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=8512}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=280}}) = 7 mmap(NULL, 9536, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 7, 0) = -1 EACCES (Permission denied) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-04-20 11:55 ` Avi Kivity @ 2022-04-20 12:09 ` Jens Axboe 2022-04-21 9:05 ` Avi Kivity 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2022-04-20 12:09 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/20/22 5:55 AM, Avi Kivity wrote: > > On 19/04/2022 22.58, Jens Axboe wrote: >> >>> I'll try it tomorrow (also the other patch). >> Thanks! >> > > With the new kernel, I get > > > io_uring_setup(200, {flags=0, sq_thread_cpu=0, sq_thread_idle=0, sq_entries=256, cq_entries=512, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS|0x1800, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=8512}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=280}}) = 7 > mmap(NULL, 9536, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 7, 0) = -1 EACCES (Permission denied) That looks odd, and not really related at all? -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-04-20 12:09 ` Jens Axboe @ 2022-04-21 9:05 ` Avi Kivity 0 siblings, 0 replies; 23+ messages in thread From: Avi Kivity @ 2022-04-21 9:05 UTC (permalink / raw) To: Jens Axboe, io-uring On 20/04/2022 15.09, Jens Axboe wrote: > On 4/20/22 5:55 AM, Avi Kivity wrote: >> On 19/04/2022 22.58, Jens Axboe wrote: >>>> I'll try it tomorrow (also the other patch). >>> Thanks! >>> >> With the new kernel, I get >> >> >> io_uring_setup(200, {flags=0, sq_thread_cpu=0, sq_thread_idle=0, sq_entries=256, cq_entries=512, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS|IORING_FEAT_RSRC_TAGS|0x1800, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=8512}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, flags=280}}) = 7 >> mmap(NULL, 9536, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 7, 0) = -1 EACCES (Permission denied) > That looks odd, and not really related at all? > It was selinux. So perhaps there is a regression somewhere that breaks selinux+io_uring. The reason I encountered selinux was that I ran it in a virtual machine (nvidia + custom kernel = ...). Now the problem is that the results in the VM are very inconsistent, perhaps due to incorrect topology exposed to the guest. So I can't provide meaningful results from the patches, at least until I find a more workable setup. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-04-19 17:14 ` Jens Axboe 2022-04-19 19:41 ` Avi Kivity @ 2022-06-15 10:12 ` Avi Kivity 2022-06-15 10:48 ` Pavel Begunkov 1 sibling, 1 reply; 23+ messages in thread From: Avi Kivity @ 2022-06-15 10:12 UTC (permalink / raw) To: Jens Axboe, io-uring On 19/04/2022 20.14, Jens Axboe wrote: > On 4/19/22 9:21 AM, Jens Axboe wrote: >> On 4/19/22 6:31 AM, Jens Axboe wrote: >>> On 4/19/22 6:21 AM, Avi Kivity wrote: >>>> On 19/04/2022 15.04, Jens Axboe wrote: >>>>> On 4/19/22 5:57 AM, Avi Kivity wrote: >>>>>> On 19/04/2022 14.38, Jens Axboe wrote: >>>>>>> On 4/19/22 5:07 AM, Avi Kivity wrote: >>>>>>>> A simple webserver shows about 5% loss compared to linux-aio. >>>>>>>> >>>>>>>> >>>>>>>> I expect the loss is due to an optimization that io_uring lacks - >>>>>>>> inline completion vs workqueue completion: >>>>>>> I don't think that's it, io_uring never punts to a workqueue for >>>>>>> completions. >>>>>> I measured this: >>>>>> >>>>>> >>>>>> >>>>>> Performance counter stats for 'system wide': >>>>>> >>>>>> 1,273,756 io_uring:io_uring_task_add >>>>>> >>>>>> 12.288597765 seconds time elapsed >>>>>> >>>>>> Which exactly matches with the number of requests sent. If that's the >>>>>> wrong counter to measure, I'm happy to try again with the correct >>>>>> counter. >>>>> io_uring_task_add() isn't a workqueue, it's task_work. So that is >>>>> expected. >> Might actually be implicated. Not because it's a async worker, but >> because I think we might be losing some affinity in this case. Looking >> at traces, we're definitely bouncing between the poll completion side >> and then execution the completion. >> >> Can you try this hack? It's against -git + for-5.19/io_uring. If you let >> me know what base you prefer, I can do a version against that. I see >> about a 3% win with io_uring with this, and was slower before against >> linux-aio as you saw as well. > Another thing to try - get rid of the IPI for TWA_SIGNAL, which I > believe may be the underlying cause of it. > Resurrecting an old thread. I have a question about timeliness of completions. Let's assume a request has completed. From the patch, it appears that io_uring will only guarantee that a completion appears on the completion ring if the thread has entered kernel mode since the completion happened. So user-space polling of the completion ring can cause unbounded delays. If this is correct (it's not unreasonable, but should be documented), then there should also be a simple way to force a kernel entry. But how to do this using liburing? IIUC if I the following apply: 1. I have no pending sqes 2. There are pending completions 3. There is a completed event for which a completion has not been appended to the completion queue ring Then io_uring_wait_cqe() will elide io_uring_enter() and the completed-but-not-reported event will be delayed. > diff --git a/fs/io-wq.c b/fs/io-wq.c > index 32aeb2c581c5..59987dd212d8 100644 > --- a/fs/io-wq.c > +++ b/fs/io-wq.c > @@ -871,7 +871,7 @@ static bool io_wq_for_each_worker(struct io_wqe *wqe, > > static bool io_wq_worker_wake(struct io_worker *worker, void *data) > { > - set_notify_signal(worker->task); > + set_notify_signal(worker->task, true); > wake_up_process(worker->task); > return false; > } > @@ -991,7 +991,7 @@ static bool __io_wq_worker_cancel(struct io_worker *worker, > { > if (work && match->fn(work, match->data)) { > work->flags |= IO_WQ_WORK_CANCEL; > - set_notify_signal(worker->task); > + set_notify_signal(worker->task, true); > return true; > } > > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h > index 3c8b34876744..ac1f14973e09 100644 > --- a/include/linux/sched/signal.h > +++ b/include/linux/sched/signal.h > @@ -359,10 +359,10 @@ static inline void clear_notify_signal(void) > * Called to break out of interruptible wait loops, and enter the > * exit_to_user_mode_loop(). > */ > -static inline void set_notify_signal(struct task_struct *task) > +static inline void set_notify_signal(struct task_struct *task, bool need_ipi) > { > if (!test_and_set_tsk_thread_flag(task, TIF_NOTIFY_SIGNAL) && > - !wake_up_state(task, TASK_INTERRUPTIBLE)) > + !wake_up_state(task, TASK_INTERRUPTIBLE) && need_ipi) > kick_process(task); > } > > diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c > index 5d03a2ad1066..bff53f539933 100644 > --- a/kernel/livepatch/transition.c > +++ b/kernel/livepatch/transition.c > @@ -367,7 +367,7 @@ static void klp_send_signals(void) > * Send fake signal to all non-kthread tasks which are > * still not migrated. > */ > - set_notify_signal(task); > + set_notify_signal(task, true); > } > } > read_unlock(&tasklist_lock); > diff --git a/kernel/task_work.c b/kernel/task_work.c > index c59e1a49bc40..47d7024dc499 100644 > --- a/kernel/task_work.c > +++ b/kernel/task_work.c > @@ -51,7 +51,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work, > set_notify_resume(task); > break; > case TWA_SIGNAL: > - set_notify_signal(task); > + set_notify_signal(task, false); > break; > default: > WARN_ON_ONCE(1); > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-06-15 10:12 ` Avi Kivity @ 2022-06-15 10:48 ` Pavel Begunkov 2022-06-15 11:04 ` Avi Kivity 0 siblings, 1 reply; 23+ messages in thread From: Pavel Begunkov @ 2022-06-15 10:48 UTC (permalink / raw) To: Avi Kivity, Jens Axboe, io-uring On 6/15/22 11:12, Avi Kivity wrote: > > On 19/04/2022 20.14, Jens Axboe wrote: >> On 4/19/22 9:21 AM, Jens Axboe wrote: >>> On 4/19/22 6:31 AM, Jens Axboe wrote: >>>> On 4/19/22 6:21 AM, Avi Kivity wrote: >>>>> On 19/04/2022 15.04, Jens Axboe wrote: >>>>>> On 4/19/22 5:57 AM, Avi Kivity wrote: >>>>>>> On 19/04/2022 14.38, Jens Axboe wrote: >>>>>>>> On 4/19/22 5:07 AM, Avi Kivity wrote: >>>>>>>>> A simple webserver shows about 5% loss compared to linux-aio. >>>>>>>>> >>>>>>>>> >>>>>>>>> I expect the loss is due to an optimization that io_uring lacks - >>>>>>>>> inline completion vs workqueue completion: >>>>>>>> I don't think that's it, io_uring never punts to a workqueue for >>>>>>>> completions. >>>>>>> I measured this: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Performance counter stats for 'system wide': >>>>>>> >>>>>>> 1,273,756 io_uring:io_uring_task_add >>>>>>> >>>>>>> 12.288597765 seconds time elapsed >>>>>>> >>>>>>> Which exactly matches with the number of requests sent. If that's the >>>>>>> wrong counter to measure, I'm happy to try again with the correct >>>>>>> counter. >>>>>> io_uring_task_add() isn't a workqueue, it's task_work. So that is >>>>>> expected. >>> Might actually be implicated. Not because it's a async worker, but >>> because I think we might be losing some affinity in this case. Looking >>> at traces, we're definitely bouncing between the poll completion side >>> and then execution the completion. >>> >>> Can you try this hack? It's against -git + for-5.19/io_uring. If you let >>> me know what base you prefer, I can do a version against that. I see >>> about a 3% win with io_uring with this, and was slower before against >>> linux-aio as you saw as well. >> Another thing to try - get rid of the IPI for TWA_SIGNAL, which I >> believe may be the underlying cause of it. >> > > Resurrecting an old thread. I have a question about timeliness of completions. Let's assume a request has completed. From the patch, it appears that io_uring will only guarantee that a completion appears on the completion ring if the thread has entered kernel mode since the completion happened. So user-space polling of the completion ring can cause unbounded delays. Right, but polling the CQ is a bad pattern, io_uring_{wait,peek}_cqe/etc. will do the polling vs syscalling dance for you. For larger audience, I'll remind that it's an opt-in feature > If this is correct (it's not unreasonable, but should be documented), then there should also be a simple way to force a kernel entry. But how to do this using liburing? IIUC if I the following apply: > > > 1. I have no pending sqes > > 2. There are pending completions > > 3. There is a completed event for which a completion has not been appended to the completion queue ring > > > Then io_uring_wait_cqe() will elide io_uring_enter() and the completed-but-not-reported event will be delayed. One way is to process all CQEs and then it'll try to enter the kernel and do the job. Another way is to also set IORING_SETUP_TASKRUN_FLAG, then when there is work that requires to enter the kernel io_uring will set IORING_SQ_TASKRUN in sq_flags. Actually, I'm not mistaken io_uring has some automagic handling of it internally https://github.com/axboe/liburing/blob/master/src/queue.c#L36 -- Pavel Begunkov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-06-15 10:48 ` Pavel Begunkov @ 2022-06-15 11:04 ` Avi Kivity 2022-06-15 11:07 ` Avi Kivity 2022-06-15 11:30 ` Pavel Begunkov 0 siblings, 2 replies; 23+ messages in thread From: Avi Kivity @ 2022-06-15 11:04 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe, io-uring On 15/06/2022 13.48, Pavel Begunkov wrote: > On 6/15/22 11:12, Avi Kivity wrote: >> >> On 19/04/2022 20.14, Jens Axboe wrote: >>> On 4/19/22 9:21 AM, Jens Axboe wrote: >>>> On 4/19/22 6:31 AM, Jens Axboe wrote: >>>>> On 4/19/22 6:21 AM, Avi Kivity wrote: >>>>>> On 19/04/2022 15.04, Jens Axboe wrote: >>>>>>> On 4/19/22 5:57 AM, Avi Kivity wrote: >>>>>>>> On 19/04/2022 14.38, Jens Axboe wrote: >>>>>>>>> On 4/19/22 5:07 AM, Avi Kivity wrote: >>>>>>>>>> A simple webserver shows about 5% loss compared to linux-aio. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I expect the loss is due to an optimization that io_uring >>>>>>>>>> lacks - >>>>>>>>>> inline completion vs workqueue completion: >>>>>>>>> I don't think that's it, io_uring never punts to a workqueue for >>>>>>>>> completions. >>>>>>>> I measured this: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Performance counter stats for 'system wide': >>>>>>>> >>>>>>>> 1,273,756 io_uring:io_uring_task_add >>>>>>>> >>>>>>>> 12.288597765 seconds time elapsed >>>>>>>> >>>>>>>> Which exactly matches with the number of requests sent. If >>>>>>>> that's the >>>>>>>> wrong counter to measure, I'm happy to try again with the correct >>>>>>>> counter. >>>>>>> io_uring_task_add() isn't a workqueue, it's task_work. So that is >>>>>>> expected. >>>> Might actually be implicated. Not because it's a async worker, but >>>> because I think we might be losing some affinity in this case. Looking >>>> at traces, we're definitely bouncing between the poll completion side >>>> and then execution the completion. >>>> >>>> Can you try this hack? It's against -git + for-5.19/io_uring. If >>>> you let >>>> me know what base you prefer, I can do a version against that. I see >>>> about a 3% win with io_uring with this, and was slower before against >>>> linux-aio as you saw as well. >>> Another thing to try - get rid of the IPI for TWA_SIGNAL, which I >>> believe may be the underlying cause of it. >>> >> >> Resurrecting an old thread. I have a question about timeliness of >> completions. Let's assume a request has completed. From the patch, it >> appears that io_uring will only guarantee that a completion appears >> on the completion ring if the thread has entered kernel mode since >> the completion happened. So user-space polling of the completion ring >> can cause unbounded delays. > > Right, but polling the CQ is a bad pattern, io_uring_{wait,peek}_cqe/etc. > will do the polling vs syscalling dance for you. Can you be more explicit? I don't think peek is enough. If there is a cqe pending, it will return it, but will not cause compeleted-but-unqueued events to generate completions. And wait won't enter the kernel if a cqe is pending, IIUC. > > For larger audience, I'll remind that it's an opt-in feature > I don't understand - what is an opt-in feature? > >> If this is correct (it's not unreasonable, but should be documented), >> then there should also be a simple way to force a kernel entry. But >> how to do this using liburing? IIUC if I the following apply: >> >> >> 1. I have no pending sqes >> >> 2. There are pending completions >> >> 3. There is a completed event for which a completion has not been >> appended to the completion queue ring >> >> >> Then io_uring_wait_cqe() will elide io_uring_enter() and the >> completed-but-not-reported event will be delayed. > > One way is to process all CQEs and then it'll try to enter the > kernel and do the job. > > Another way is to also set IORING_SETUP_TASKRUN_FLAG, then when > there is work that requires to enter the kernel io_uring will > set IORING_SQ_TASKRUN in sq_flags. > Actually, I'm not mistaken io_uring has some automagic handling > of it internally > > https://github.com/axboe/liburing/blob/master/src/queue.c#L36 > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-06-15 11:04 ` Avi Kivity @ 2022-06-15 11:07 ` Avi Kivity 2022-06-15 11:38 ` Pavel Begunkov 2022-06-15 11:30 ` Pavel Begunkov 1 sibling, 1 reply; 23+ messages in thread From: Avi Kivity @ 2022-06-15 11:07 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe, io-uring Sorry, hit send too quiclky. > >> >>> If this is correct (it's not unreasonable, but should be >>> documented), then there should also be a simple way to force a >>> kernel entry. But how to do this using liburing? IIUC if I the >>> following apply: >>> >>> >>> 1. I have no pending sqes >>> >>> 2. There are pending completions >>> >>> 3. There is a completed event for which a completion has not been >>> appended to the completion queue ring >>> >>> >>> Then io_uring_wait_cqe() will elide io_uring_enter() and the >>> completed-but-not-reported event will be delayed. >> >> One way is to process all CQEs and then it'll try to enter the >> kernel and do the job. But I don't want it to wait. I want it to generate pending completions, and return immediately even if no completions were generated. I have some background computations I'm happy to perform if no events are pending, but I would like those events to be generated promptly. >> >> Another way is to also set IORING_SETUP_TASKRUN_FLAG, then when >> there is work that requires to enter the kernel io_uring will >> set IORING_SQ_TASKRUN in sq_flags. >> Actually, I'm not mistaken io_uring has some automagic handling >> of it internally >> >> https://github.com/axboe/liburing/blob/master/src/queue.c#L36 >> Is there documentation about this flag? >> >> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-06-15 11:07 ` Avi Kivity @ 2022-06-15 11:38 ` Pavel Begunkov 2022-06-15 12:21 ` Jens Axboe 0 siblings, 1 reply; 23+ messages in thread From: Pavel Begunkov @ 2022-06-15 11:38 UTC (permalink / raw) To: Avi Kivity, Jens Axboe, io-uring On 6/15/22 12:07, Avi Kivity wrote: > Sorry, hit send too quiclky. >>>> If this is correct (it's not unreasonable, but should be documented), then there should also be a simple way to force a kernel entry. But how to do this using liburing? IIUC if I the following apply: >>>> >>>> >>>> 1. I have no pending sqes >>>> >>>> 2. There are pending completions >>>> >>>> 3. There is a completed event for which a completion has not been appended to the completion queue ring >>>> >>>> >>>> Then io_uring_wait_cqe() will elide io_uring_enter() and the completed-but-not-reported event will be delayed. >>> >>> One way is to process all CQEs and then it'll try to enter the >>> kernel and do the job. > > > But I don't want it to wait. I want it to generate pending completions, and return immediately even if no completions were generated. I have some background computations I'm happy to perform if no events are pending, but I would like those events to be generated promptly. Ok, then you may want IORING_SETUP_TASKRUN_FLAG described below. btw IORING_SETUP_TASKRUN_FLAG has some additional overhead. >>> Another way is to also set IORING_SETUP_TASKRUN_FLAG, then when >>> there is work that requires to enter the kernel io_uring will >>> set IORING_SQ_TASKRUN in sq_flags. >>> Actually, I'm not mistaken io_uring has some automagic handling >>> of it internally >>> >>> https://github.com/axboe/liburing/blob/master/src/queue.c#L36 >>> > > Is there documentation about this flag? Unfortunately, I don't see any. Here is a link to the kernel commit if it helps: https://git.kernel.dk/cgit/linux-block/commit/?h=ef060ea9e4fd3b763e7060a3af0a258d2d5d7c0d I think it's more interesting what support liburing has, but would need to look up in the code. Maybe Jens remembers and can tell. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-06-15 11:38 ` Pavel Begunkov @ 2022-06-15 12:21 ` Jens Axboe 2022-06-15 13:43 ` Avi Kivity 0 siblings, 1 reply; 23+ messages in thread From: Jens Axboe @ 2022-06-15 12:21 UTC (permalink / raw) To: Pavel Begunkov, Avi Kivity, io-uring On 6/15/22 5:38 AM, Pavel Begunkov wrote: >>>> Another way is to also set IORING_SETUP_TASKRUN_FLAG, then when >>>> there is work that requires to enter the kernel io_uring will >>>> set IORING_SQ_TASKRUN in sq_flags. >>>> Actually, I'm not mistaken io_uring has some automagic handling >>>> of it internally >>>> >>>> https://github.com/axboe/liburing/blob/master/src/queue.c#L36 >>>> >> >> Is there documentation about this flag? > > Unfortunately, I don't see any. > Here is a link to the kernel commit if it helps: > > https://git.kernel.dk/cgit/linux-block/commit/?h=ef060ea9e4fd3b763e7060a3af0a258d2d5d7c0d > > I think it's more interesting what support liburing has, > but would need to look up in the code. Maybe Jens > remembers and can tell. It is documented, in the io_uring.2 man page in liburing: https://git.kernel.dk/cgit/liburing/tree/man/io_uring_setup.2#n200 -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-06-15 12:21 ` Jens Axboe @ 2022-06-15 13:43 ` Avi Kivity 0 siblings, 0 replies; 23+ messages in thread From: Avi Kivity @ 2022-06-15 13:43 UTC (permalink / raw) To: Jens Axboe, Pavel Begunkov, io-uring On 15/06/2022 15.21, Jens Axboe wrote: > On 6/15/22 5:38 AM, Pavel Begunkov wrote: >>>>> Another way is to also set IORING_SETUP_TASKRUN_FLAG, then when >>>>> there is work that requires to enter the kernel io_uring will >>>>> set IORING_SQ_TASKRUN in sq_flags. >>>>> Actually, I'm not mistaken io_uring has some automagic handling >>>>> of it internally >>>>> >>>>> https://github.com/axboe/liburing/blob/master/src/queue.c#L36 >>>>> >>> Is there documentation about this flag? >> Unfortunately, I don't see any. >> Here is a link to the kernel commit if it helps: >> >> https://git.kernel.dk/cgit/linux-block/commit/?h=ef060ea9e4fd3b763e7060a3af0a258d2d5d7c0d >> >> I think it's more interesting what support liburing has, >> but would need to look up in the code. Maybe Jens >> remembers and can tell. > It is documented, in the io_uring.2 man page in liburing: > > https://git.kernel.dk/cgit/liburing/tree/man/io_uring_setup.2#n200 > Perfect. Thanks! ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-06-15 11:04 ` Avi Kivity 2022-06-15 11:07 ` Avi Kivity @ 2022-06-15 11:30 ` Pavel Begunkov 2022-06-15 11:36 ` Avi Kivity 1 sibling, 1 reply; 23+ messages in thread From: Pavel Begunkov @ 2022-06-15 11:30 UTC (permalink / raw) To: Avi Kivity, Jens Axboe, io-uring On 6/15/22 12:04, Avi Kivity wrote: > > On 15/06/2022 13.48, Pavel Begunkov wrote: >> On 6/15/22 11:12, Avi Kivity wrote: >>> >>> On 19/04/2022 20.14, Jens Axboe wrote: >>>> On 4/19/22 9:21 AM, Jens Axboe wrote: >>>>> On 4/19/22 6:31 AM, Jens Axboe wrote: >>>>>> On 4/19/22 6:21 AM, Avi Kivity wrote: >>>>>>> On 19/04/2022 15.04, Jens Axboe wrote: >>>>>>>> On 4/19/22 5:57 AM, Avi Kivity wrote: >>>>>>>>> On 19/04/2022 14.38, Jens Axboe wrote: >>>>>>>>>> On 4/19/22 5:07 AM, Avi Kivity wrote: >>>>>>>>>>> A simple webserver shows about 5% loss compared to linux-aio. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I expect the loss is due to an optimization that io_uring lacks - >>>>>>>>>>> inline completion vs workqueue completion: >>>>>>>>>> I don't think that's it, io_uring never punts to a workqueue for >>>>>>>>>> completions. >>>>>>>>> I measured this: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Performance counter stats for 'system wide': >>>>>>>>> >>>>>>>>> 1,273,756 io_uring:io_uring_task_add >>>>>>>>> >>>>>>>>> 12.288597765 seconds time elapsed >>>>>>>>> >>>>>>>>> Which exactly matches with the number of requests sent. If that's the >>>>>>>>> wrong counter to measure, I'm happy to try again with the correct >>>>>>>>> counter. >>>>>>>> io_uring_task_add() isn't a workqueue, it's task_work. So that is >>>>>>>> expected. >>>>> Might actually be implicated. Not because it's a async worker, but >>>>> because I think we might be losing some affinity in this case. Looking >>>>> at traces, we're definitely bouncing between the poll completion side >>>>> and then execution the completion. >>>>> >>>>> Can you try this hack? It's against -git + for-5.19/io_uring. If you let >>>>> me know what base you prefer, I can do a version against that. I see >>>>> about a 3% win with io_uring with this, and was slower before against >>>>> linux-aio as you saw as well. >>>> Another thing to try - get rid of the IPI for TWA_SIGNAL, which I >>>> believe may be the underlying cause of it. >>>> >>> >>> Resurrecting an old thread. I have a question about timeliness of completions. Let's assume a request has completed. From the patch, it appears that io_uring will only guarantee that a completion appears on the completion ring if the thread has entered kernel mode since the completion happened. So user-space polling of the completion ring can cause unbounded delays. >> >> Right, but polling the CQ is a bad pattern, io_uring_{wait,peek}_cqe/etc. >> will do the polling vs syscalling dance for you. > > > Can you be more explicit? > > > I don't think peek is enough. If there is a cqe pending, it will return it, but will not cause compeleted-but-unqueued events to generate completions. > > > And wait won't enter the kernel if a cqe is pending, IIUC. Right, usually it won't, but works if you eventually end up waiting, e.g. by waiting for all expected cqes. >> For larger audience, I'll remind that it's an opt-in feature >> > > I don't understand - what is an opt-in feature? The behaviour that you worry about when CQEs are not posted until you do syscall, it's only so if you set IORING_SETUP_COOP_TASKRUN. >>> If this is correct (it's not unreasonable, but should be documented), then there should also be a simple way to force a kernel entry. But how to do this using liburing? IIUC if I the following apply: >>> >>> >>> 1. I have no pending sqes >>> >>> 2. There are pending completions >>> >>> 3. There is a completed event for which a completion has not been appended to the completion queue ring >>> >>> >>> Then io_uring_wait_cqe() will elide io_uring_enter() and the completed-but-not-reported event will be delayed. >> >> One way is to process all CQEs and then it'll try to enter the >> kernel and do the job. >> >> Another way is to also set IORING_SETUP_TASKRUN_FLAG, then when >> there is work that requires to enter the kernel io_uring will >> set IORING_SQ_TASKRUN in sq_flags. >> Actually, I'm not mistaken io_uring has some automagic handling >> of it internally >> >> https://github.com/axboe/liburing/blob/master/src/queue.c#L36 >> >> >> -- Pavel Begunkov ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL 2022-06-15 11:30 ` Pavel Begunkov @ 2022-06-15 11:36 ` Avi Kivity 0 siblings, 0 replies; 23+ messages in thread From: Avi Kivity @ 2022-06-15 11:36 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe, io-uring On 15/06/2022 14.30, Pavel Begunkov wrote: > On 6/15/22 12:04, Avi Kivity wrote: >> >> On 15/06/2022 13.48, Pavel Begunkov wrote: >>> On 6/15/22 11:12, Avi Kivity wrote: >>>> >>>> On 19/04/2022 20.14, Jens Axboe wrote: >>>>> On 4/19/22 9:21 AM, Jens Axboe wrote: >>>>>> On 4/19/22 6:31 AM, Jens Axboe wrote: >>>>>>> On 4/19/22 6:21 AM, Avi Kivity wrote: >>>>>>>> On 19/04/2022 15.04, Jens Axboe wrote: >>>>>>>>> On 4/19/22 5:57 AM, Avi Kivity wrote: >>>>>>>>>> On 19/04/2022 14.38, Jens Axboe wrote: >>>>>>>>>>> On 4/19/22 5:07 AM, Avi Kivity wrote: >>>>>>>>>>>> A simple webserver shows about 5% loss compared to linux-aio. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I expect the loss is due to an optimization that io_uring >>>>>>>>>>>> lacks - >>>>>>>>>>>> inline completion vs workqueue completion: >>>>>>>>>>> I don't think that's it, io_uring never punts to a workqueue >>>>>>>>>>> for >>>>>>>>>>> completions. >>>>>>>>>> I measured this: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Performance counter stats for 'system wide': >>>>>>>>>> >>>>>>>>>> 1,273,756 io_uring:io_uring_task_add >>>>>>>>>> >>>>>>>>>> 12.288597765 seconds time elapsed >>>>>>>>>> >>>>>>>>>> Which exactly matches with the number of requests sent. If >>>>>>>>>> that's the >>>>>>>>>> wrong counter to measure, I'm happy to try again with the >>>>>>>>>> correct >>>>>>>>>> counter. >>>>>>>>> io_uring_task_add() isn't a workqueue, it's task_work. So that is >>>>>>>>> expected. >>>>>> Might actually be implicated. Not because it's a async worker, but >>>>>> because I think we might be losing some affinity in this case. >>>>>> Looking >>>>>> at traces, we're definitely bouncing between the poll completion >>>>>> side >>>>>> and then execution the completion. >>>>>> >>>>>> Can you try this hack? It's against -git + for-5.19/io_uring. If >>>>>> you let >>>>>> me know what base you prefer, I can do a version against that. I see >>>>>> about a 3% win with io_uring with this, and was slower before >>>>>> against >>>>>> linux-aio as you saw as well. >>>>> Another thing to try - get rid of the IPI for TWA_SIGNAL, which I >>>>> believe may be the underlying cause of it. >>>>> >>>> >>>> Resurrecting an old thread. I have a question about timeliness of >>>> completions. Let's assume a request has completed. From the patch, >>>> it appears that io_uring will only guarantee that a completion >>>> appears on the completion ring if the thread has entered kernel >>>> mode since the completion happened. So user-space polling of the >>>> completion ring can cause unbounded delays. >>> >>> Right, but polling the CQ is a bad pattern, >>> io_uring_{wait,peek}_cqe/etc. >>> will do the polling vs syscalling dance for you. >> >> >> Can you be more explicit? >> >> >> I don't think peek is enough. If there is a cqe pending, it will >> return it, but will not cause compeleted-but-unqueued events to >> generate completions. >> >> >> And wait won't enter the kernel if a cqe is pending, IIUC. > > Right, usually it won't, but works if you eventually end up > waiting, e.g. by waiting for all expected cqes. > > >>> For larger audience, I'll remind that it's an opt-in feature >>> >> >> I don't understand - what is an opt-in feature? > > The behaviour that you worry about when CQEs are not posted until > you do syscall, it's only so if you set IORING_SETUP_COOP_TASKRUN. > Ah! I wasn't aware of this new flag. This is exactly what I want - either ask for timely completions, or optimize for throughput. Of course, it puts me in a dilemma because I want both, but that's my problem. Thanks! ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2022-06-15 13:43 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-04-19 11:07 IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL Avi Kivity 2022-04-19 11:38 ` Jens Axboe 2022-04-19 11:57 ` Avi Kivity 2022-04-19 12:04 ` Jens Axboe 2022-04-19 12:21 ` Avi Kivity 2022-04-19 12:31 ` Jens Axboe 2022-04-19 15:21 ` Jens Axboe 2022-04-19 15:51 ` Avi Kivity 2022-04-19 17:14 ` Jens Axboe 2022-04-19 19:41 ` Avi Kivity 2022-04-19 19:58 ` Jens Axboe 2022-04-20 11:55 ` Avi Kivity 2022-04-20 12:09 ` Jens Axboe 2022-04-21 9:05 ` Avi Kivity 2022-06-15 10:12 ` Avi Kivity 2022-06-15 10:48 ` Pavel Begunkov 2022-06-15 11:04 ` Avi Kivity 2022-06-15 11:07 ` Avi Kivity 2022-06-15 11:38 ` Pavel Begunkov 2022-06-15 12:21 ` Jens Axboe 2022-06-15 13:43 ` Avi Kivity 2022-06-15 11:30 ` Pavel Begunkov 2022-06-15 11:36 ` Avi Kivity
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox