* Re: [PATCH] io_uring: reduce latency by reissueing the operation [not found] <[email protected]> @ 2021-06-10 9:03 ` Pavel Begunkov [not found] ` <[email protected]> 0 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2021-06-10 9:03 UTC (permalink / raw) To: Olivier Langlois, Jens Axboe, io-uring, linux-kernel On 6/9/21 11:08 PM, Olivier Langlois wrote: > It is quite frequent that when an operation fails and returns EAGAIN, > the data becomes available between that failure and the call to > vfs_poll() done by io_arm_poll_handler(). > > Detecting the situation and reissuing the operation is much faster > than going ahead and push the operation to the io-wq. The poll stuff is not perfect and definitely can be improved, but there are drawbacks, with this one fairness may suffer with higher submit batching and make lat worse for all but one request. I'll get to it and another poll related email later, probably next week. > > Signed-off-by: Olivier Langlois <[email protected]> > --- > fs/io_uring.c | 15 ++++++++++----- > 1 file changed, 10 insertions(+), 5 deletions(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 42380ed563c4..98cf3e323d5e 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -5138,15 +5138,16 @@ static __poll_t __io_arm_poll_handler(struct io_kiocb *req, > return mask; > } > > -static bool io_arm_poll_handler(struct io_kiocb *req) > +static bool io_arm_poll_handler(struct io_kiocb *req, __poll_t *ret) > { > const struct io_op_def *def = &io_op_defs[req->opcode]; > struct io_ring_ctx *ctx = req->ctx; > struct async_poll *apoll; > struct io_poll_table ipt; > - __poll_t mask, ret; > + __poll_t mask; > int rw; > > + *ret = 0; > if (!req->file || !file_can_poll(req->file)) > return false; > if (req->flags & REQ_F_POLLED) > @@ -5184,9 +5185,9 @@ static bool io_arm_poll_handler(struct io_kiocb *req) > > ipt.pt._qproc = io_async_queue_proc; > > - ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask, > + *ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask, > io_async_wake); > - if (ret || ipt.error) { > + if (*ret || ipt.error) { > io_poll_remove_double(req); > spin_unlock_irq(&ctx->completion_lock); > return false; > @@ -6410,7 +6411,9 @@ static void __io_queue_sqe(struct io_kiocb *req) > { > struct io_kiocb *linked_timeout = io_prep_linked_timeout(req); > int ret; > + __poll_t poll_ret; > > +issue_sqe: > ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER); > > /* > @@ -6430,7 +6433,9 @@ static void __io_queue_sqe(struct io_kiocb *req) > io_put_req(req); > } > } else if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) { > - if (!io_arm_poll_handler(req)) { > + if (!io_arm_poll_handler(req, &poll_ret)) { > + if (poll_ret) > + goto issue_sqe; > /* > * Queued up for async execution, worker will release > * submit reference when the iocb is actually submitted. > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <[email protected]>]
* Re: [PATCH] io_uring: reduce latency by reissueing the operation [not found] ` <[email protected]> @ 2021-06-10 15:51 ` Pavel Begunkov 2021-06-10 17:56 ` Olivier Langlois 0 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2021-06-10 15:51 UTC (permalink / raw) To: Olivier Langlois, Jens Axboe, io-uring, linux-kernel On 6/10/21 4:38 PM, Olivier Langlois wrote: > On Thu, 2021-06-10 at 10:03 +0100, Pavel Begunkov wrote: >> On 6/9/21 11:08 PM, Olivier Langlois wrote: >>> It is quite frequent that when an operation fails and returns >>> EAGAIN, >>> the data becomes available between that failure and the call to >>> vfs_poll() done by io_arm_poll_handler(). >>> >>> Detecting the situation and reissuing the operation is much faster >>> than going ahead and push the operation to the io-wq. >> >> The poll stuff is not perfect and definitely can be improved, >> but there are drawbacks, with this one fairness may suffer >> with higher submit batching and make lat worse for all >> but one request. >> >> I'll get to it and another poll related email later, >> probably next week. >> > Hi Pavel, > > I am looking forward to see the improved solution that you succeed > coming up with. > > However, I want to bring 1 detail to your attention in case that it > went unnoticed. > > If io_arm_poll_handler() returns false because vfs_poll() returns a non > zero value, reissuing the sqe will be attempted at most only 1 time > because req->flags will have REQ_F_POLLED and on the second time > io_arm_poll_handler() will be called, it will immediately return false. > > With this detail in mind, I honestly did not think that this would make > the function unfair for the other requests in a batch submission > compared to the cost of pushing the request to io-wq that possibly > includes an io worker thread creation. > > Does this detail can change your verdict? > If not, I would really be interested to know more about your fairness > concern. Right, but it still stalls other requests and IIRC there are people not liking the syscall already taking too long. Consider io_req_task_queue(), adds more overhead but will delay execution to the syscall exit. In any case, would be great to have numbers, e.g. to see if io_req_task_queue() is good enough, how often your problem takes places and how much it gives us. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring: reduce latency by reissueing the operation 2021-06-10 15:51 ` Pavel Begunkov @ 2021-06-10 17:56 ` Olivier Langlois 2021-06-10 19:32 ` Pavel Begunkov 0 siblings, 1 reply; 12+ messages in thread From: Olivier Langlois @ 2021-06-10 17:56 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe, io-uring, linux-kernel On Thu, 2021-06-10 at 16:51 +0100, Pavel Begunkov wrote: > Right, but it still stalls other requests and IIRC there are people > not liking the syscall already taking too long. Consider > io_req_task_queue(), adds more overhead but will delay execution > to the syscall exit. > > In any case, would be great to have numbers, e.g. to see if > io_req_task_queue() is good enough, how often your problem > takes places and how much it gives us. > I will get you more more data later but I did run a fast test that lasted 81 seconds with a single TCP connection. The # of times that the sqe got reissued is 57. I'll intrumentalize a bit the code to answer the following questions: 1. What is the ratio of reissued read sqe/total read sqe 2. Average exec time of __io_queue_sqe() for a read sqe when data is already available vs avg exec time when sqe is reissued 3. average exec time when the sqe is pushed to async when it could have been reissued. With that info, I think that we will be in better position to evaluate whether or not the patch is good or not. Can you think of other numbers that would be useful to know to evaluate the patch performance? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring: reduce latency by reissueing the operation 2021-06-10 17:56 ` Olivier Langlois @ 2021-06-10 19:32 ` Pavel Begunkov 2021-06-11 3:55 ` Olivier Langlois 0 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2021-06-10 19:32 UTC (permalink / raw) To: Olivier Langlois, Jens Axboe, io-uring, linux-kernel On 6/10/21 6:56 PM, Olivier Langlois wrote: > On Thu, 2021-06-10 at 16:51 +0100, Pavel Begunkov wrote: >> Right, but it still stalls other requests and IIRC there are people >> not liking the syscall already taking too long. Consider >> io_req_task_queue(), adds more overhead but will delay execution >> to the syscall exit. >> >> In any case, would be great to have numbers, e.g. to see if >> io_req_task_queue() is good enough, how often your problem >> takes places and how much it gives us. >> > I will get you more more data later but I did run a fast test that > lasted 81 seconds with a single TCP connection. > > The # of times that the sqe got reissued is 57. > > I'll intrumentalize a bit the code to answer the following questions: > > 1. What is the ratio of reissued read sqe/total read sqe > 2. Average exec time of __io_queue_sqe() for a read sqe when data is > already available vs avg exec time when sqe is reissued > 3. average exec time when the sqe is pushed to async when it could have > been reissued. > > With that info, I think that we will be in better position to evaluate > whether or not the patch is good or not. > > Can you think of other numbers that would be useful to know to evaluate > the patch performance? If throughput + latency (avg + several nines) are better (or any other measurable improvement), it's a good enough argument to me, but not sure what test case you're looking at. Single threaded? Does it saturate your CPU? -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring: reduce latency by reissueing the operation 2021-06-10 19:32 ` Pavel Begunkov @ 2021-06-11 3:55 ` Olivier Langlois 2021-06-17 18:10 ` Pavel Begunkov 0 siblings, 1 reply; 12+ messages in thread From: Olivier Langlois @ 2021-06-11 3:55 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe, io-uring, linux-kernel On Thu, 2021-06-10 at 20:32 +0100, Pavel Begunkov wrote: > On 6/10/21 6:56 PM, Olivier Langlois wrote: > > > > > > Can you think of other numbers that would be useful to know to > > evaluate > > the patch performance? > > If throughput + latency (avg + several nines) are better (or any > other measurable improvement), it's a good enough argument to me, > but not sure what test case you're looking at. Single threaded? > Does it saturate your CPU? > I don't know what are the ideal answers to your 2 last questions ;-) I have several possible configurations to my application. The most complex one is a 2 threads setup each having their own io_uring instance where one of the threads is managing 50-85 TCP connections over which JSON stream encapsulated in the WebSocket protocol are received. That more complex setup is also using IORING_SETUP_ATTACH_WQ to share the sqpoll thread between the 2 instances. In that more complex config, the sqpoll thread is running at 85-95% of its dedicated CPU. For the patch performance testing I did use the simplest config: Single thread, 1 TCP connection, no sqpoll. To process an incoming 5Mbps stream, it takes about 5% of the CPU. Here is the testing methodology: add 2 fields in struct io_rw: u64 startTs; u8 readType; startTs is set with ktime_get_raw_fast_ns() in io_read_prep() readType is set to: 0: data ready 1: use fast poll (we ignore those) 2: reissue 3: async readType is used to know what type of measurement it is at the recording point. end time is measured at 3 recording point: 1. In __io_queue_sqe() when io_issue_sqe() returns 0 2. In __io_queue_sqe() after io_queue_async_work() call 3. In io_wq_submit_work() after the while loop. So I took 4 measurements. 1. The time it takes to process a read request when data is already available 2. The time it takes to process by calling twice io_issue_sqe() after vfs_poll() indicated that data was available 3. The time it takes to execute io_queue_async_work() 4. The time it takes to complete a read request asynchronously Before presenting the results, I want to mention that 2.25% of the total number of my read requests ends up in the situation where the read() syscall did return EAGAIN but data became available by the time vfs_poll gets called. My expectations were that reissuing a sqe could be on par or a bit more expensive than placing it on io-wq for async processing and that would put the patch in some gray zone with pros and cons in terms of performance. The reality is instead super nice (numbers in nSec): ready data (baseline) avg 3657.94182918628 min 580 max 20098 stddev 1213.15975908162 reissue completion average 7882.67567567568 min 2316 max 28811 stddev 1982.79172973284 insert io-wq time average 8983.82276995305 min 3324 max 87816 stddev 2551.60056552038 async time completion average 24670.4758861127 min 10758 max 102612 stddev 3483.92416873804 Conclusion: On average reissuing the sqe with the patch code is 1.1uSec faster and in the worse case scenario 59uSec faster than placing the request on io-wq On average completion time by reissuing the sqe with the patch code is 16.79uSec faster and in the worse case scenario 73.8uSec faster than async completion. One important detail to mention about the async completion time, in the testing the ONLY way that a request can end up being completed async is if vfs_poll() reports that the file is ready. Otherwise, the request ends up being processed with io_uring fast poll feature. So there does not seem to have any downside to the patch. TBH, at the initial patch submission, I only did use my intuition to evaluate the merit of my patch but, thx to your healthy skepticism, Pavel, this did force me to actually measure the patch and it appears to incontestably improve performance for both the reissued SQE and also all the other sqes found in a batch submission. Hopefully, the results will please you as much as they have for me! Greetings, Olivier ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring: reduce latency by reissueing the operation 2021-06-11 3:55 ` Olivier Langlois @ 2021-06-17 18:10 ` Pavel Begunkov 2021-06-18 22:45 ` Olivier Langlois 0 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2021-06-17 18:10 UTC (permalink / raw) To: Olivier Langlois, Jens Axboe, io-uring, linux-kernel On 6/11/21 4:55 AM, Olivier Langlois wrote: > On Thu, 2021-06-10 at 20:32 +0100, Pavel Begunkov wrote: >> On 6/10/21 6:56 PM, Olivier Langlois wrote: >>> >>> >>> Can you think of other numbers that would be useful to know to >>> evaluate >>> the patch performance? >> >> If throughput + latency (avg + several nines) are better (or any >> other measurable improvement), it's a good enough argument to me, >> but not sure what test case you're looking at. Single threaded? >> Does it saturate your CPU? >> > I don't know what are the ideal answers to your 2 last questions ;-) > > I have several possible configurations to my application. > > The most complex one is a 2 threads setup each having their own > io_uring instance where one of the threads is managing 50-85 TCP > connections over which JSON stream encapsulated in the WebSocket > protocol are received. > > That more complex setup is also using IORING_SETUP_ATTACH_WQ to share > the sqpoll thread between the 2 instances. > > In that more complex config, the sqpoll thread is running at 85-95% of > its dedicated CPU. > > For the patch performance testing I did use the simplest config: > Single thread, 1 TCP connection, no sqpoll. Queue depth (QD) 1, right? > To process an incoming 5Mbps stream, it takes about 5% of the CPU. I see, under utilised, and so your main concern is latency here. > > Here is the testing methodology: > add 2 fields in struct io_rw: > u64 startTs; > u8 readType; > > startTs is set with ktime_get_raw_fast_ns() in io_read_prep() > > readType is set to: > 0: data ready > 1: use fast poll (we ignore those) > 2: reissue > 3: async > > readType is used to know what type of measurement it is at the > recording point. > > end time is measured at 3 recording point: > 1. In __io_queue_sqe() when io_issue_sqe() returns 0 > 2. In __io_queue_sqe() after io_queue_async_work() call > 3. In io_wq_submit_work() after the while loop. > > So I took 4 measurements. > > 1. The time it takes to process a read request when data is already > available > 2. The time it takes to process by calling twice io_issue_sqe() after > vfs_poll() indicated that data was available > 3. The time it takes to execute io_queue_async_work() > 4. The time it takes to complete a read request asynchronously > > Before presenting the results, I want to mention that 2.25% of the > total number of my read requests ends up in the situation where the > read() syscall did return EAGAIN but data became available by the time > vfs_poll gets called. > > My expectations were that reissuing a sqe could be on par or a bit more > expensive than placing it on io-wq for async processing and that would > put the patch in some gray zone with pros and cons in terms of > performance. > > The reality is instead super nice (numbers in nSec): > > ready data (baseline) > avg 3657.94182918628 > min 580 > max 20098 > stddev 1213.15975908162 > > reissue completion > average 7882.67567567568 > min 2316 > max 28811 > stddev 1982.79172973284 > > insert io-wq time > average 8983.82276995305 > min 3324 > max 87816 > stddev 2551.60056552038 > > async time completion > average 24670.4758861127 > min 10758 > max 102612 > stddev 3483.92416873804 > > Conclusion: > On average reissuing the sqe with the patch code is 1.1uSec faster and > in the worse case scenario 59uSec faster than placing the request on > io-wq > > On average completion time by reissuing the sqe with the patch code is > 16.79uSec faster and in the worse case scenario 73.8uSec faster than > async completion. Hah, you took it fundamentally. I'm trying to get it, correct me I am mistaken. 1) it's avg completion for those 2.5%, not for all requests 2) Do they return equivalent number of bytes? And what the read/recv size (e.g. buffer size)? Because in theory can be that during a somewhat small delay for punting to io-wq, more data had arrived and so async completion pulls more data that takes more time. In that case the time difference should also account the difference in amount of data that it reads. 3) Curious, why read but not recv as you're working with sockets 4) Did you do any userspace measurements. And a question to everyone in general, do we have any good net benchmarking tool that works with io_uring? Like netperf? Hopefully spitting out latency distribution. Also, not particularly about reissue stuff, but a note to myself: 59us is much, so I wonder where the overhead comes from. Definitely not the iowq queueing (i.e. putting into a list). - waking a worker? - creating a new worker? Do we manage workers sanely? e.g. don't keep them constantly recreated and dying back. - scheduling a worker? Olivier, for how long did you run the test? >1 min? > One important detail to mention about the async completion time, in the > testing the ONLY way that a request can end up being completed async is > if vfs_poll() reports that the file is ready. Otherwise, the request > ends up being processed with io_uring fast poll feature. > > So there does not seem to have any downside to the patch. TBH, at the > initial patch submission, I only did use my intuition to evaluate the > merit of my patch but, thx to your healthy skepticism, Pavel, this did > force me to actually measure the patch and it appears to incontestably > improve performance for both the reissued SQE and also all the other > sqes found in a batch submission. Interesting what would be a difference if done through io_req_task_work_add(), and what would be a percent of such reqs for a hi-QD workload. But regardless, don't expect any harm, so sounds good to me. Agree with Jens' comment about return value. I think it will go in quickly once resubmitted with the adjustment. > Hopefully, the results will please you as much as they have for me! -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring: reduce latency by reissueing the operation 2021-06-17 18:10 ` Pavel Begunkov @ 2021-06-18 22:45 ` Olivier Langlois 2021-06-20 20:55 ` Pavel Begunkov 0 siblings, 1 reply; 12+ messages in thread From: Olivier Langlois @ 2021-06-18 22:45 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe, io-uring, linux-kernel On Thu, 2021-06-17 at 19:10 +0100, Pavel Begunkov wrote: > > > > For the patch performance testing I did use the simplest config: > > Single thread, 1 TCP connection, no sqpoll. > > Queue depth (QD) 1, right? Since my io_uring usage is wrapped into a library and some parameters are fixed and adjusted to be able to handle the largest use-case scenario, QD was set to 256 for that test. There is also few accessory fds such as 2 eventfd that are polled to interrupt the framework event loop but in practice they were silent during the whole testing period. but no big batch submission for sure. At most maybe 2 sqes per submission. 1 to provide back the buffer and the other to reinsert the read operation. > > > To process an incoming 5Mbps stream, it takes about 5% of the CPU. > > I see, under utilised, and so your main concern is latency > here. YES! That is my main concern. That is THE reason why you now have me as a io_uring power user. My app was previously using epoll through libev. The idea to eliminate all those syscalls did attract me. > > > > > Here is the testing methodology: > > add 2 fields in struct io_rw: > > u64 startTs; > > u8 readType; > > > > startTs is set with ktime_get_raw_fast_ns() in io_read_prep() > > > > readType is set to: > > 0: data ready > > 1: use fast poll (we ignore those) > > 2: reissue > > 3: async > > > > readType is used to know what type of measurement it is at the > > recording point. > > > > end time is measured at 3 recording point: > > 1. In __io_queue_sqe() when io_issue_sqe() returns 0 > > 2. In __io_queue_sqe() after io_queue_async_work() call > > 3. In io_wq_submit_work() after the while loop. > > > > So I took 4 measurements. > > > > 1. The time it takes to process a read request when data is already > > available > > 2. The time it takes to process by calling twice io_issue_sqe() > > after > > vfs_poll() indicated that data was available > > 3. The time it takes to execute io_queue_async_work() > > 4. The time it takes to complete a read request asynchronously > > > > Before presenting the results, I want to mention that 2.25% of the > > total number of my read requests ends up in the situation where the > > read() syscall did return EAGAIN but data became available by the > > time > > vfs_poll gets called. > > > > My expectations were that reissuing a sqe could be on par or a bit > > more > > expensive than placing it on io-wq for async processing and that > > would > > put the patch in some gray zone with pros and cons in terms of > > performance. > > > > The reality is instead super nice (numbers in nSec): > > > > ready data (baseline) > > avg 3657.94182918628 > > min 580 > > max 20098 > > stddev 1213.15975908162 > > > > reissue completion > > average 7882.67567567568 > > min 2316 > > max 28811 > > stddev 1982.79172973284 > > > > insert io-wq time > > average 8983.82276995305 > > min 3324 > > max 87816 > > stddev 2551.60056552038 > > > > async time completion > > average 24670.4758861127 > > min 10758 > > max 102612 > > stddev 3483.92416873804 > > > > Conclusion: > > On average reissuing the sqe with the patch code is 1.1uSec faster > > and > > in the worse case scenario 59uSec faster than placing the request > > on > > io-wq > > > > On average completion time by reissuing the sqe with the patch code > > is > > 16.79uSec faster and in the worse case scenario 73.8uSec faster > > than > > async completion. > > Hah, you took it fundamentally. I'm trying to get it, correct me > I am mistaken. > > 1) it's avg completion for those 2.5%, not for all requests Correct. Otherwise the executed path is identical to what it was prior to the patch. > > 2) Do they return equivalent number of bytes? And what the > read/recv size (e.g. buffer size)? Nothing escape your eagle vision attention Pavel... I set my read size to an arbitrarilly big size (20KB) just to be sure that I should, most of the time, never end up with partial reads and perform more syscalls that I could get away with big enough buffer size. TBH, I didn't pay that much attention to this detail. out of my head, I would say that the average size is all over the place. It can go from 150 Bytes up to 15KB but I would think that the average must be between 1-2 MTU (around 2500 bytes). That being said, the average read size must spread equally to the packets going to the regular path vs those of take the new shortcut, so I believe that the conclusion should still hold despite not having considered this aspect in the test. > > Because in theory can be that during a somewhat small delay for > punting to io-wq, more data had arrived and so async completion > pulls more data that takes more time. In that case the time > difference should also account the difference in amount of > data that it reads. Good point. This did not even occur to me to consider this aspect but how many more packets would the network stack had the time to receive in an extra 16uSec period? (I am not on one of those crazy Fiber optic 200Gbps Mellanox card....) 1,2,3,4? We aren't talking multiple extra MBs to copy here... > > 3) Curious, why read but not recv as you're working with sockets I have learn network programming with the classic Stevens book. As far as I remember from what I have learned in the book, it is that the only benefit of recv() over read() is if you need to specify one of the funky flags that recv() allow you to provide to it, read() doesn't give access to that functionality. If there is a performance benefit to use recv() over read() for tcp fds, that is something I am not aware of and if you confirm me that it is the case, that would be very easy for me to change my read calls for recv() ones... Now that you ask the question, maybe read() is implemented with recv() but AFAIK, the native network functions are sendmsg and recvmsg so neither read() or recv() would have an edge over the other in that department, AFAIK... while we are talking about read() vs recv(), I am curious too about something, while working on my other patch (store back buffer in case of failure), I did notice that buffer address and bid weren't stored in the same fields. io_put_recv_kbuf() vs io_put_rw_kbuf() I didn't figure out why those values weren't stored in the same io_kiocb fields for recv operations... Why is that? > > 4) Did you do any userspace measurements. And a question to > everyone in general, do we have any good net benchmarking tool > that works with io_uring? Like netperf? Hopefully spitting > out latency distribution. No, I haven't. > > > Also, not particularly about reissue stuff, but a note to myself: > 59us is much, so I wonder where the overhead comes from. > Definitely not the iowq queueing (i.e. putting into a list). > - waking a worker? > - creating a new worker? Do we manage workers sanely? e.g. > don't keep them constantly recreated and dying back. > - scheduling a worker? creating a new worker is for sure not free but I would remove that cause from the suspect list as in my scenario, it was a one-shot event. First measurement was even not significantly higher than all the other measurements. > > Olivier, for how long did you run the test? >1 min? much more than 1 minute. I would say something between 20-25 minutes. I wanted a big enough sample size for those 2.5% special path events so that the conclusion could be statistically significant. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring: reduce latency by reissueing the operation 2021-06-18 22:45 ` Olivier Langlois @ 2021-06-20 20:55 ` Pavel Begunkov 2021-06-20 21:31 ` Olivier Langlois 0 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2021-06-20 20:55 UTC (permalink / raw) To: Olivier Langlois, Jens Axboe, io-uring, linux-kernel On 6/18/21 11:45 PM, Olivier Langlois wrote: > On Thu, 2021-06-17 at 19:10 +0100, Pavel Begunkov wrote: >>> >>> For the patch performance testing I did use the simplest config: >>> Single thread, 1 TCP connection, no sqpoll. >> >> Queue depth (QD) 1, right? > > Since my io_uring usage is wrapped into a library and some parameters > are fixed and adjusted to be able to handle the largest use-case > scenario, QD was set to 256 for that test. > > There is also few accessory fds such as 2 eventfd that are polled to > interrupt the framework event loop but in practice they were silent > during the whole testing period. > > but no big batch submission for sure. At most maybe 2 sqes per > submission. > > 1 to provide back the buffer and the other to reinsert the read > operation. I see [...] >> 2) Do they return equivalent number of bytes? And what the >> read/recv size (e.g. buffer size)? > > Nothing escape your eagle vision attention Pavel... It all may sound over scrutinising, but I just used to analyse performance magic and see how edges may be polished. Not a requirement for the patch > I set my read size to an arbitrarilly big size (20KB) just to be sure > that I should, most of the time, never end up with partial reads and > perform more syscalls that I could get away with big enough buffer > size. > > TBH, I didn't pay that much attention to this detail. out of my head, I > would say that the average size is all over the place. It can go from > 150 Bytes up to 15KB but I would think that the average must be between > 1-2 MTU (around 2500 bytes). > > That being said, the average read size must spread equally to the > packets going to the regular path vs those of take the new shortcut, so > I believe that the conclusion should still hold despite not having > considered this aspect in the test. >> >> Because in theory can be that during a somewhat small delay for >> punting to io-wq, more data had arrived and so async completion >> pulls more data that takes more time. In that case the time >> difference should also account the difference in amount of >> data that it reads. > > Good point. This did not even occur to me to consider this aspect but > how many more packets would the network stack had the time to receive > in an extra 16uSec period? (I am not on one of those crazy Fiber optic > 200Gbps Mellanox card....) 1,2,3,4? We aren't talking multiple extra > MBs to copy here... Say 1-2. Need to check, but I think while processing them and copying to the userspace there might arrive another one, and so you have full 20KB instead of 4KB that would have been copied inline. Plus io-wq overhead, 16us wouldn't be unreasonable then. But that's "what if" thinking. >> >> 3) Curious, why read but not recv as you're working with sockets > > I have learn network programming with the classic Stevens book. As far > as I remember from what I have learned in the book, it is that the only > benefit of recv() over read() is if you need to specify one of the > funky flags that recv() allow you to provide to it, read() doesn't give > access to that functionality. > > If there is a performance benefit to use recv() over read() for tcp > fds, that is something I am not aware of and if you confirm me that it > is the case, that would be very easy for me to change my read calls for > recv() ones... > > Now that you ask the question, maybe read() is implemented with recv() All sinks into the common code rather early > but AFAIK, the native network functions are sendmsg and recvmsg so > neither read() or recv() would have an edge over the other in that > department, AFAIK... For io_uring part, e.g. recv is slimmer than recvmsg, doesn't need to copy extra. Read can be more expensive on the io_uring side because it may copy/alloc extra stuff. Plus additional logic on the io_read() part for generality. But don't expect it to be much of a difference, but never tested. > while we are talking about read() vs recv(), I am curious too about > something, while working on my other patch (store back buffer in case > of failure), I did notice that buffer address and bid weren't stored in > the same fields. > > io_put_recv_kbuf() vs io_put_rw_kbuf() > > I didn't figure out why those values weren't stored in the same > io_kiocb fields for recv operations... > > Why is that? Just because how it was done. May use cleaning up. e.g. I don't like rewriting req->rw.addr with a selected buffer. In general, the first 64B (cacheline) of io_kiocb (i.e. request) is taken by per-opcode data, and we try to fit everything related to a particular opcode there and not spill into generic parts of the struct. Another concern, in general, is not keeping everything tight enough and shuffled right, so it doesn't read extra cachelines in hot path. >> >> 4) Did you do any userspace measurements. And a question to >> everyone in general, do we have any good net benchmarking tool >> that works with io_uring? Like netperf? Hopefully spitting >> out latency distribution. > > No, I haven't. With what was said, I'd expect ~same mean and elevated ~99% reduced by the patch, which is also great. Latency is always the hardest part. >> Also, not particularly about reissue stuff, but a note to myself: >> 59us is much, so I wonder where the overhead comes from. >> Definitely not the iowq queueing (i.e. putting into a list). >> - waking a worker? >> - creating a new worker? Do we manage workers sanely? e.g. >> don't keep them constantly recreated and dying back. >> - scheduling a worker? > > creating a new worker is for sure not free but I would remove that > cause from the suspect list as in my scenario, it was a one-shot event. Not sure what you mean, but speculating, io-wq may have not optimal policy for recycling worker threads leading to recreating/removing more than needed. Depends on bugs, use cases and so on. > First measurement was even not significantly higher than all the other > measurements. You get a huge max for io-wq case. Obviously nothing can be said just because of max. We'd need latency distribution and probably longer runs, but I'm still curious where it's coming from. Just keeping an eye in general >> >> Olivier, for how long did you run the test? >1 min? > > much more than 1 minute. I would say something between 20-25 minutes. > > I wanted a big enough sample size for those 2.5% special path events so > that the conclusion could be statistically significant. Great, if io-wq worker creation doesn't work right, then it's because of policies and so on. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring: reduce latency by reissueing the operation 2021-06-20 20:55 ` Pavel Begunkov @ 2021-06-20 21:31 ` Olivier Langlois 2021-06-20 22:04 ` Pavel Begunkov 0 siblings, 1 reply; 12+ messages in thread From: Olivier Langlois @ 2021-06-20 21:31 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe, io-uring, linux-kernel On Sun, 2021-06-20 at 21:55 +0100, Pavel Begunkov wrote: > On 6/18/21 11:45 PM, Olivier Langlois wrote: > > > > For io_uring part, e.g. recv is slimmer than recvmsg, doesn't > need to copy extra. > > Read can be more expensive on the io_uring side because it > may copy/alloc extra stuff. Plus additional logic on the > io_read() part for generality. > > But don't expect it to be much of a difference, but never > tested. That is super interesting. The way that I see it after getting your explanations it is that in the worse case scenario, there won't be any difference but in the best case, I could see a small speed gain. I made the switch yesterday evening. One of the metric that I monitor the most is my system reaction time from incoming packets. I will let you know if switching to recv() is beneficial in that regard. > > > > > > > Also, not particularly about reissue stuff, but a note to myself: > > > 59us is much, so I wonder where the overhead comes from. > > > Definitely not the iowq queueing (i.e. putting into a list). > > > - waking a worker? > > > - creating a new worker? Do we manage workers sanely? e.g. > > > don't keep them constantly recreated and dying back. > > > - scheduling a worker? > > > > creating a new worker is for sure not free but I would remove that > > cause from the suspect list as in my scenario, it was a one-shot > > event. > > Not sure what you mean, but speculating, io-wq may have not > optimal policy for recycling worker threads leading to > recreating/removing more than needed. Depends on bugs, use > cases and so on. Since that I absolutely don't use the async workers feature I was obsessed about the fact that I was seeing a io worker created. This is root of why I ended up writing the patch. My understanding of how io worker life scope are managed, it is that one remains present once created. In my scenario, once that single persistent io worker thread is created, no others are ever created. So this is a one shot cost. I was prepared to eliminate the first measurement to be as fair as possible and not pollute the async performance result with a one time only thread creation cost but to my surprise... The thread creation cost was not visible in the first measurement time... From that, maybe this is an erroneous shortcut, I do not feel that thread creation is the bottleneck. > > > First measurement was even not significantly higher than all the > > other > > measurements. > > You get a huge max for io-wq case. Obviously nothing can be > said just because of max. We'd need latency distribution > and probably longer runs, but I'm still curious where it's > coming from. Just keeping an eye in general Maybe it is scheduling... I'll keep this mystery in the back of my mind in case that I would end up with a way to find out where the time is spend... > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring: reduce latency by reissueing the operation 2021-06-20 21:31 ` Olivier Langlois @ 2021-06-20 22:04 ` Pavel Begunkov 0 siblings, 0 replies; 12+ messages in thread From: Pavel Begunkov @ 2021-06-20 22:04 UTC (permalink / raw) To: Olivier Langlois, Jens Axboe, io-uring, linux-kernel On 6/20/21 10:31 PM, Olivier Langlois wrote: > On Sun, 2021-06-20 at 21:55 +0100, Pavel Begunkov wrote: [...] >>> creating a new worker is for sure not free but I would remove that >>> cause from the suspect list as in my scenario, it was a one-shot >>> event. >> >> Not sure what you mean, but speculating, io-wq may have not >> optimal policy for recycling worker threads leading to >> recreating/removing more than needed. Depends on bugs, use >> cases and so on. > > Since that I absolutely don't use the async workers feature I was > obsessed about the fact that I was seeing a io worker created. This is > root of why I ended up writing the patch. > > My understanding of how io worker life scope are managed, it is that > one remains present once created. There was one(?) worker as such, and others should be able to eventually die off if there is nothing to do. Something may have changed after recent changes, I should update myself on that > In my scenario, once that single persistent io worker thread is > created, no others are ever created. So this is a one shot cost. I was Good to know, that's for confirming. > prepared to eliminate the first measurement to be as fair as possible > and not pollute the async performance result with a one time only > thread creation cost but to my surprise... The thread creation cost was > not visible in the first measurement time... > > From that, maybe this is an erroneous shortcut, I do not feel that > thread creation is the bottleneck. >> >>> First measurement was even not significantly higher than all the >>> other >>> measurements. >> >> You get a huge max for io-wq case. Obviously nothing can be >> said just because of max. We'd need latency distribution >> and probably longer runs, but I'm still curious where it's >> coming from. Just keeping an eye in general > > Maybe it is scheduling... > > I'll keep this mystery in the back of my mind in case that I would end > up with a way to find out where the time is spend... -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <[email protected]>]
* Re: [PATCH] io_uring: reduce latency by reissueing the operation [not found] <[email protected]> @ 2021-06-16 12:48 ` Jens Axboe 2021-06-18 21:38 ` Olivier Langlois 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2021-06-16 12:48 UTC (permalink / raw) To: Olivier Langlois, Pavel Begunkov, io-uring, linux-kernel On 6/9/21 4:08 PM, Olivier Langlois wrote: > It is quite frequent that when an operation fails and returns EAGAIN, > the data becomes available between that failure and the call to > vfs_poll() done by io_arm_poll_handler(). > > Detecting the situation and reissuing the operation is much faster > than going ahead and push the operation to the io-wq. I think this is obviously the right thing to do, but I'm not too crazy about the 'ret' pointer passed in. We could either add a proper return type instead of the bool and use that, or put the poll-or-queue-async into a helper that then only needs a bool return, and use that return value for whether to re-issue or not. Care to send an updated variant? -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring: reduce latency by reissueing the operation 2021-06-16 12:48 ` Jens Axboe @ 2021-06-18 21:38 ` Olivier Langlois 0 siblings, 0 replies; 12+ messages in thread From: Olivier Langlois @ 2021-06-18 21:38 UTC (permalink / raw) To: Jens Axboe, Pavel Begunkov, io-uring, linux-kernel On Wed, 2021-06-16 at 06:48 -0600, Jens Axboe wrote: > On 6/9/21 4:08 PM, Olivier Langlois wrote: > > It is quite frequent that when an operation fails and returns > > EAGAIN, > > the data becomes available between that failure and the call to > > vfs_poll() done by io_arm_poll_handler(). > > > > Detecting the situation and reissuing the operation is much faster > > than going ahead and push the operation to the io-wq. > > I think this is obviously the right thing to do, but I'm not too > crazy > about the 'ret' pointer passed in. We could either add a proper > return > type instead of the bool and use that, or put the poll-or-queue-async > into a helper that then only needs a bool return, and use that return > value for whether to re-issue or not. > > Care to send an updated variant? > No problem! It is going to be pleasure to send an updated version with the requested change! I will take that opportunity to try out my new patch sending setup ;-) ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2021-06-20 22:04 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <[email protected]> 2021-06-10 9:03 ` [PATCH] io_uring: reduce latency by reissueing the operation Pavel Begunkov [not found] ` <[email protected]> 2021-06-10 15:51 ` Pavel Begunkov 2021-06-10 17:56 ` Olivier Langlois 2021-06-10 19:32 ` Pavel Begunkov 2021-06-11 3:55 ` Olivier Langlois 2021-06-17 18:10 ` Pavel Begunkov 2021-06-18 22:45 ` Olivier Langlois 2021-06-20 20:55 ` Pavel Begunkov 2021-06-20 21:31 ` Olivier Langlois 2021-06-20 22:04 ` Pavel Begunkov [not found] <[email protected]> 2021-06-16 12:48 ` Jens Axboe 2021-06-18 21:38 ` Olivier Langlois
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox