Re: [PATCH] io_uring: reduce latency by reissueing the operation

public inbox for [email protected]
 help / color / mirror / Atom feed

* Re: [PATCH] io_uring: reduce latency by reissueing the operation
       [not found] <[email protected]>
@ 2021-06-10  9:03 ` Pavel Begunkov
       [not found]   ` <[email protected]>
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2021-06-10  9:03 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, io-uring, linux-kernel

On 6/9/21 11:08 PM, Olivier Langlois wrote:
> It is quite frequent that when an operation fails and returns EAGAIN,
> the data becomes available between that failure and the call to
> vfs_poll() done by io_arm_poll_handler().
> 
> Detecting the situation and reissuing the operation is much faster
> than going ahead and push the operation to the io-wq.

The poll stuff is not perfect and definitely can be improved,
but there are drawbacks, with this one fairness may suffer
with higher submit batching and make lat worse for all
but one request.

I'll get to it and another poll related email later,
probably next week.

> 
> Signed-off-by: Olivier Langlois <[email protected]>
> ---
>  fs/io_uring.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 42380ed563c4..98cf3e323d5e 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -5138,15 +5138,16 @@ static __poll_t __io_arm_poll_handler(struct io_kiocb *req,
>  	return mask;
>  }
>  
> -static bool io_arm_poll_handler(struct io_kiocb *req)
> +static bool io_arm_poll_handler(struct io_kiocb *req, __poll_t *ret)
>  {
>  	const struct io_op_def *def = &io_op_defs[req->opcode];
>  	struct io_ring_ctx *ctx = req->ctx;
>  	struct async_poll *apoll;
>  	struct io_poll_table ipt;
> -	__poll_t mask, ret;
> +	__poll_t mask;
>  	int rw;
>  
> +	*ret = 0;
>  	if (!req->file || !file_can_poll(req->file))
>  		return false;
>  	if (req->flags & REQ_F_POLLED)
> @@ -5184,9 +5185,9 @@ static bool io_arm_poll_handler(struct io_kiocb *req)
>  
>  	ipt.pt._qproc = io_async_queue_proc;
>  
> -	ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask,
> +	*ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask,
>  					io_async_wake);
> -	if (ret || ipt.error) {
> +	if (*ret || ipt.error) {
>  		io_poll_remove_double(req);
>  		spin_unlock_irq(&ctx->completion_lock);
>  		return false;
> @@ -6410,7 +6411,9 @@ static void __io_queue_sqe(struct io_kiocb *req)
>  {
>  	struct io_kiocb *linked_timeout = io_prep_linked_timeout(req);
>  	int ret;
> +	__poll_t poll_ret;
>  
> +issue_sqe:
>  	ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER);
>  
>  	/*
> @@ -6430,7 +6433,9 @@ static void __io_queue_sqe(struct io_kiocb *req)
>  			io_put_req(req);
>  		}
>  	} else if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) {
> -		if (!io_arm_poll_handler(req)) {
> +		if (!io_arm_poll_handler(req, &poll_ret)) {
> +			if (poll_ret)
> +				goto issue_sqe;
>  			/*
>  			 * Queued up for async execution, worker will release
>  			 * submit reference when the iocb is actually submitted.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <[email protected]>]

* Re: [PATCH] io_uring: reduce latency by reissueing the operation
       [not found]   ` <[email protected]>
@ 2021-06-10 15:51     ` Pavel Begunkov
  2021-06-10 17:56       ` Olivier Langlois
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2021-06-10 15:51 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, io-uring, linux-kernel

On 6/10/21 4:38 PM, Olivier Langlois wrote:
> On Thu, 2021-06-10 at 10:03 +0100, Pavel Begunkov wrote:
>> On 6/9/21 11:08 PM, Olivier Langlois wrote:
>>> It is quite frequent that when an operation fails and returns
>>> EAGAIN,
>>> the data becomes available between that failure and the call to
>>> vfs_poll() done by io_arm_poll_handler().
>>>
>>> Detecting the situation and reissuing the operation is much faster
>>> than going ahead and push the operation to the io-wq.
>>
>> The poll stuff is not perfect and definitely can be improved,
>> but there are drawbacks, with this one fairness may suffer
>> with higher submit batching and make lat worse for all
>> but one request.
>>
>> I'll get to it and another poll related email later,
>> probably next week.
>>
> Hi Pavel,
> 
> I am looking forward to see the improved solution that you succeed
> coming up with.
> 
> However, I want to bring 1 detail to your attention in case that it
> went unnoticed.
> 
> If io_arm_poll_handler() returns false because vfs_poll() returns a non
> zero value, reissuing the sqe will be attempted at most only 1 time
> because req->flags will have REQ_F_POLLED and on the second time
> io_arm_poll_handler() will be called, it will immediately return false.
> 
> With this detail in mind, I honestly did not think that this would make
> the function unfair for the other requests in a batch submission
> compared to the cost of pushing the request to io-wq that possibly
> includes an io worker thread creation.
> 
> Does this detail can change your verdict?
> If not, I would really be interested to know more about your fairness
> concern.

Right, but it still stalls other requests and IIRC there are people
not liking the syscall already taking too long. Consider
io_req_task_queue(), adds more overhead but will delay execution
to the syscall exit.

In any case, would be great to have numbers, e.g. to see if
io_req_task_queue() is good enough, how often your problem
takes places and how much it gives us.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring: reduce latency by reissueing the operation
  2021-06-10 15:51     ` Pavel Begunkov
@ 2021-06-10 17:56       ` Olivier Langlois
  2021-06-10 19:32         ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Olivier Langlois @ 2021-06-10 17:56 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring, linux-kernel

On Thu, 2021-06-10 at 16:51 +0100, Pavel Begunkov wrote:
> Right, but it still stalls other requests and IIRC there are people
> not liking the syscall already taking too long. Consider
> io_req_task_queue(), adds more overhead but will delay execution
> to the syscall exit.
> 
> In any case, would be great to have numbers, e.g. to see if
> io_req_task_queue() is good enough, how often your problem
> takes places and how much it gives us.
> 
I will get you more more data later but I did run a fast test that
lasted 81 seconds with a single TCP connection.

The # of times that the sqe got reissued is 57.

I'll intrumentalize a bit the code to answer the following questions:

1. What is the ratio of reissued read sqe/total read sqe
2. Average exec time of __io_queue_sqe() for a read sqe when data is
already available vs avg exec time when sqe is reissued
3. average exec time when the sqe is pushed to async when it could have
been reissued.

With that info, I think that we will be in better position to evaluate
whether or not the patch is good or not.

Can you think of other numbers that would be useful to know to evaluate
the patch performance?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring: reduce latency by reissueing the operation
  2021-06-10 17:56       ` Olivier Langlois
@ 2021-06-10 19:32         ` Pavel Begunkov
  2021-06-11  3:55           ` Olivier Langlois
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2021-06-10 19:32 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, io-uring, linux-kernel

On 6/10/21 6:56 PM, Olivier Langlois wrote:
> On Thu, 2021-06-10 at 16:51 +0100, Pavel Begunkov wrote:
>> Right, but it still stalls other requests and IIRC there are people
>> not liking the syscall already taking too long. Consider
>> io_req_task_queue(), adds more overhead but will delay execution
>> to the syscall exit.
>>
>> In any case, would be great to have numbers, e.g. to see if
>> io_req_task_queue() is good enough, how often your problem
>> takes places and how much it gives us.
>>
> I will get you more more data later but I did run a fast test that
> lasted 81 seconds with a single TCP connection.
> 
> The # of times that the sqe got reissued is 57.
> 
> I'll intrumentalize a bit the code to answer the following questions:
> 
> 1. What is the ratio of reissued read sqe/total read sqe
> 2. Average exec time of __io_queue_sqe() for a read sqe when data is
> already available vs avg exec time when sqe is reissued
> 3. average exec time when the sqe is pushed to async when it could have
> been reissued.
> 
> With that info, I think that we will be in better position to evaluate
> whether or not the patch is good or not.
> 
> Can you think of other numbers that would be useful to know to evaluate
> the patch performance?

If throughput + latency (avg + several nines) are better (or any
other measurable improvement), it's a good enough argument to me,
but not sure what test case you're looking at. Single threaded?
Does it saturate your CPU?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring: reduce latency by reissueing the operation
  2021-06-10 19:32         ` Pavel Begunkov
@ 2021-06-11  3:55           ` Olivier Langlois
  2021-06-17 18:10             ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Olivier Langlois @ 2021-06-11  3:55 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring, linux-kernel

On Thu, 2021-06-10 at 20:32 +0100, Pavel Begunkov wrote:
> On 6/10/21 6:56 PM, Olivier Langlois wrote:
> > 
> > 
> > Can you think of other numbers that would be useful to know to
> > evaluate
> > the patch performance?
> 
> If throughput + latency (avg + several nines) are better (or any
> other measurable improvement), it's a good enough argument to me,
> but not sure what test case you're looking at. Single threaded?
> Does it saturate your CPU?
> 
I don't know what are the ideal answers to your 2 last questions ;-)

I have several possible configurations to my application.

The most complex one is a 2 threads setup each having their own
io_uring instance where one of the threads is managing 50-85 TCP
connections over which JSON stream encapsulated in the WebSocket
protocol are received.

That more complex setup is also using IORING_SETUP_ATTACH_WQ to share
the sqpoll thread between the 2 instances.

In that more complex config, the sqpoll thread is running at 85-95% of
its dedicated CPU.

For the patch performance testing I did use the simplest config:
Single thread, 1 TCP connection, no sqpoll.

To process an incoming 5Mbps stream, it takes about 5% of the CPU.

Here is the testing methodology:
add 2 fields in  struct io_rw:
    u64             startTs;
    u8              readType;

startTs is set with ktime_get_raw_fast_ns() in io_read_prep()

readType is set to:
0: data ready
1: use fast poll (we ignore those)
2: reissue
3: async

readType is used to know what type of measurement it is at the
recording point.

end time is measured at 3 recording point:
1. In __io_queue_sqe() when io_issue_sqe() returns 0
2. In __io_queue_sqe() after io_queue_async_work() call
3. In io_wq_submit_work() after the while loop.

So I took 4 measurements.

1. The time it takes to process a read request when data is already
available
2. The time it takes to process by calling twice io_issue_sqe() after
vfs_poll() indicated that data was available
3. The time it takes to execute io_queue_async_work()
4. The time it takes to complete a read request asynchronously

Before presenting the results, I want to mention that 2.25% of the
total number of my read requests ends up in the situation where the
read() syscall did return EAGAIN but data became available by the time
vfs_poll gets called.

My expectations were that reissuing a sqe could be on par or a bit more
expensive than placing it on io-wq for async processing and that would
put the patch in some gray zone with pros and cons in terms of
performance.

The reality is instead super nice (numbers in nSec):

ready data (baseline)
avg	3657.94182918628
min	580
max	20098
stddev	1213.15975908162

reissue	completion
average	7882.67567567568
min	2316
max	28811
stddev	1982.79172973284

insert io-wq time	
average	8983.82276995305
min	3324
max	87816
stddev	2551.60056552038

async time completion
average	24670.4758861127
min	10758
max	102612
stddev	3483.92416873804

Conclusion:
On average reissuing the sqe with the patch code is 1.1uSec faster and
in the worse case scenario 59uSec faster than placing the request on
io-wq

On average completion time by reissuing the sqe with the patch code is
16.79uSec faster and in the worse case scenario 73.8uSec faster than
async completion.

One important detail to mention about the async completion time, in the
testing the ONLY way that a request can end up being completed async is
if vfs_poll() reports that the file is ready. Otherwise, the request
ends up being processed with io_uring fast poll feature.

So there does not seem to have any downside to the patch. TBH, at the
initial patch submission, I only did use my intuition to evaluate the
merit of my patch but, thx to your healthy skepticism, Pavel, this did
force me to actually measure the patch and it appears to incontestably
improve performance for both the reissued SQE and also all the other
sqes found in a batch submission.

Hopefully, the results will please you as much as they have for me!

Greetings,
Olivier

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring: reduce latency by reissueing the operation
  2021-06-11  3:55           ` Olivier Langlois
@ 2021-06-17 18:10             ` Pavel Begunkov
  2021-06-18 22:45               ` Olivier Langlois
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2021-06-17 18:10 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, io-uring, linux-kernel

On 6/11/21 4:55 AM, Olivier Langlois wrote:
> On Thu, 2021-06-10 at 20:32 +0100, Pavel Begunkov wrote:
>> On 6/10/21 6:56 PM, Olivier Langlois wrote:
>>>
>>>
>>> Can you think of other numbers that would be useful to know to
>>> evaluate
>>> the patch performance?
>>
>> If throughput + latency (avg + several nines) are better (or any
>> other measurable improvement), it's a good enough argument to me,
>> but not sure what test case you're looking at. Single threaded?
>> Does it saturate your CPU?
>>
> I don't know what are the ideal answers to your 2 last questions ;-)
> 
> I have several possible configurations to my application.
> 
> The most complex one is a 2 threads setup each having their own
> io_uring instance where one of the threads is managing 50-85 TCP
> connections over which JSON stream encapsulated in the WebSocket
> protocol are received.
> 
> That more complex setup is also using IORING_SETUP_ATTACH_WQ to share
> the sqpoll thread between the 2 instances.
> 
> In that more complex config, the sqpoll thread is running at 85-95% of
> its dedicated CPU.
> 
> For the patch performance testing I did use the simplest config:
> Single thread, 1 TCP connection, no sqpoll.

Queue depth (QD) 1, right?

> To process an incoming 5Mbps stream, it takes about 5% of the CPU.

I see, under utilised, and so your main concern is latency
here. 

> 
> Here is the testing methodology:
> add 2 fields in  struct io_rw:
>     u64             startTs;
>     u8              readType;
> 
> startTs is set with ktime_get_raw_fast_ns() in io_read_prep()
> 
> readType is set to:
> 0: data ready
> 1: use fast poll (we ignore those)
> 2: reissue
> 3: async
> 
> readType is used to know what type of measurement it is at the
> recording point.
> 
> end time is measured at 3 recording point:
> 1. In __io_queue_sqe() when io_issue_sqe() returns 0
> 2. In __io_queue_sqe() after io_queue_async_work() call
> 3. In io_wq_submit_work() after the while loop.
> 
> So I took 4 measurements.
> 
> 1. The time it takes to process a read request when data is already
> available
> 2. The time it takes to process by calling twice io_issue_sqe() after
> vfs_poll() indicated that data was available
> 3. The time it takes to execute io_queue_async_work()
> 4. The time it takes to complete a read request asynchronously
> 
> Before presenting the results, I want to mention that 2.25% of the
> total number of my read requests ends up in the situation where the
> read() syscall did return EAGAIN but data became available by the time
> vfs_poll gets called.
> 
> My expectations were that reissuing a sqe could be on par or a bit more
> expensive than placing it on io-wq for async processing and that would
> put the patch in some gray zone with pros and cons in terms of
> performance.
> 
> The reality is instead super nice (numbers in nSec):
> 
> ready data (baseline)
> avg	3657.94182918628
> min	580
> max	20098
> stddev	1213.15975908162
> 	
> reissue	completion
> average	7882.67567567568
> min	2316
> max	28811
> stddev	1982.79172973284
> 	
> insert io-wq time	
> average	8983.82276995305
> min	3324
> max	87816
> stddev	2551.60056552038
> 	
> async time completion
> average	24670.4758861127
> min	10758
> max	102612
> stddev	3483.92416873804
> 
> Conclusion:
> On average reissuing the sqe with the patch code is 1.1uSec faster and
> in the worse case scenario 59uSec faster than placing the request on
> io-wq
> 
> On average completion time by reissuing the sqe with the patch code is
> 16.79uSec faster and in the worse case scenario 73.8uSec faster than
> async completion.

Hah, you took it fundamentally. I'm trying to get it, correct me
I am mistaken.

1) it's avg completion for those 2.5%, not for all requests

2) Do they return equivalent number of bytes? And what the
read/recv size (e.g. buffer size)?

Because in theory can be that during a somewhat small delay for
punting to io-wq, more data had arrived and so async completion
pulls more data that takes more time. In that case the time
difference should also account the difference in amount of
data that it reads. 

3) Curious, why read but not recv as you're working with sockets

4) Did you do any userspace measurements. And a question to
everyone in general, do we have any good net benchmarking tool
that works with io_uring? Like netperf? Hopefully spitting
out latency distribution.


Also, not particularly about reissue stuff, but a note to myself:
59us is much, so I wonder where the overhead comes from.
Definitely not the iowq queueing (i.e. putting into a list).
- waking a worker?
- creating a new worker? Do we manage workers sanely? e.g.
  don't keep them constantly recreated and dying back.
- scheduling a worker?

Olivier, for how long did you run the test? >1 min?


> One important detail to mention about the async completion time, in the
> testing the ONLY way that a request can end up being completed async is
> if vfs_poll() reports that the file is ready. Otherwise, the request
> ends up being processed with io_uring fast poll feature.
> 
> So there does not seem to have any downside to the patch. TBH, at the
> initial patch submission, I only did use my intuition to evaluate the
> merit of my patch but, thx to your healthy skepticism, Pavel, this did
> force me to actually measure the patch and it appears to incontestably
> improve performance for both the reissued SQE and also all the other
> sqes found in a batch submission.

Interesting what would be a difference if done through
io_req_task_work_add(), and what would be a percent of such reqs
for a hi-QD workload.

But regardless, don't expect any harm, so sounds good to me.
Agree with Jens' comment about return value. I think it will
go in quickly once resubmitted with the adjustment.


> Hopefully, the results will please you as much as they have for me!

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring: reduce latency by reissueing the operation
  2021-06-17 18:10             ` Pavel Begunkov
@ 2021-06-18 22:45               ` Olivier Langlois
  2021-06-20 20:55                 ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Olivier Langlois @ 2021-06-18 22:45 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring, linux-kernel

On Thu, 2021-06-17 at 19:10 +0100, Pavel Begunkov wrote:
> > 
> > For the patch performance testing I did use the simplest config:
> > Single thread, 1 TCP connection, no sqpoll.
> 
> Queue depth (QD) 1, right?

Since my io_uring usage is wrapped into a library and some parameters
are fixed and adjusted to be able to handle the largest use-case
scenario, QD was set to 256 for that test.

There is also few accessory fds such as 2 eventfd that are polled to
interrupt the framework event loop but in practice they were silent
during the whole testing period.

but no big batch submission for sure. At most maybe 2 sqes per
submission.

1 to provide back the buffer and the other to reinsert the read
operation.

> 
> > To process an incoming 5Mbps stream, it takes about 5% of the CPU.
> 
> I see, under utilised, and so your main concern is latency
> here.

YES! That is my main concern. That is THE reason why you now have me as
a io_uring power user.

My app was previously using epoll through libev. The idea to eliminate
all those syscalls did attract me.
> 
> > 
> > Here is the testing methodology:
> > add 2 fields in  struct io_rw:
> >     u64             startTs;
> >     u8              readType;
> > 
> > startTs is set with ktime_get_raw_fast_ns() in io_read_prep()
> > 
> > readType is set to:
> > 0: data ready
> > 1: use fast poll (we ignore those)
> > 2: reissue
> > 3: async
> > 
> > readType is used to know what type of measurement it is at the
> > recording point.
> > 
> > end time is measured at 3 recording point:
> > 1. In __io_queue_sqe() when io_issue_sqe() returns 0
> > 2. In __io_queue_sqe() after io_queue_async_work() call
> > 3. In io_wq_submit_work() after the while loop.
> > 
> > So I took 4 measurements.
> > 
> > 1. The time it takes to process a read request when data is already
> > available
> > 2. The time it takes to process by calling twice io_issue_sqe()
> > after
> > vfs_poll() indicated that data was available
> > 3. The time it takes to execute io_queue_async_work()
> > 4. The time it takes to complete a read request asynchronously
> > 
> > Before presenting the results, I want to mention that 2.25% of the
> > total number of my read requests ends up in the situation where the
> > read() syscall did return EAGAIN but data became available by the
> > time
> > vfs_poll gets called.
> > 
> > My expectations were that reissuing a sqe could be on par or a bit
> > more
> > expensive than placing it on io-wq for async processing and that
> > would
> > put the patch in some gray zone with pros and cons in terms of
> > performance.
> > 
> > The reality is instead super nice (numbers in nSec):
> > 
> > ready data (baseline)
> > avg     3657.94182918628
> > min     580
> > max     20098
> > stddev  1213.15975908162
> >         
> > reissue completion
> > average 7882.67567567568
> > min     2316
> > max     28811
> > stddev  1982.79172973284
> >         
> > insert io-wq time       
> > average 8983.82276995305
> > min     3324
> > max     87816
> > stddev  2551.60056552038
> >         
> > async time completion
> > average 24670.4758861127
> > min     10758
> > max     102612
> > stddev  3483.92416873804
> > 
> > Conclusion:
> > On average reissuing the sqe with the patch code is 1.1uSec faster
> > and
> > in the worse case scenario 59uSec faster than placing the request
> > on
> > io-wq
> > 
> > On average completion time by reissuing the sqe with the patch code
> > is
> > 16.79uSec faster and in the worse case scenario 73.8uSec faster
> > than
> > async completion.
> 
> Hah, you took it fundamentally. I'm trying to get it, correct me
> I am mistaken.
> 
> 1) it's avg completion for those 2.5%, not for all requests

Correct. Otherwise the executed path is identical to what it was prior
to the patch.
> 
> 2) Do they return equivalent number of bytes? And what the
> read/recv size (e.g. buffer size)?

Nothing escape your eagle vision attention Pavel...

I set my read size to an arbitrarilly big size (20KB) just to be sure
that I should, most of the time, never end up with partial reads and
perform more syscalls that I could get away with big enough buffer
size.

TBH, I didn't pay that much attention to this detail. out of my head, I
would say that the average size is all over the place. It can go from
150 Bytes up to 15KB but I would think that the average must be between
1-2 MTU (around 2500 bytes).

That being said, the average read size must spread equally to the
packets going to the regular path vs those of take the new shortcut, so
I believe that the conclusion should still hold despite not having
considered this aspect in the test.
> 
> Because in theory can be that during a somewhat small delay for
> punting to io-wq, more data had arrived and so async completion
> pulls more data that takes more time. In that case the time
> difference should also account the difference in amount of
> data that it reads.

Good point. This did not even occur to me to consider this aspect but
how many more packets would the network stack had the time to receive
in an extra 16uSec period? (I am not on one of those crazy Fiber optic
200Gbps Mellanox card....) 1,2,3,4? We aren't talking multiple extra
MBs to copy here...
> 
> 3) Curious, why read but not recv as you're working with sockets

I have learn network programming with the classic Stevens book. As far
as I remember from what I have learned in the book, it is that the only
benefit of recv() over read() is if you need to specify one of the
funky flags that recv() allow you to provide to it, read() doesn't give
access to that functionality.

If there is a performance benefit to use recv() over read() for tcp
fds, that is something I am not aware of and if you confirm me that it
is the case, that would be very easy for me to change my read calls for
recv() ones...

Now that you ask the question, maybe read() is implemented with recv()
but AFAIK, the native network functions are sendmsg and recvmsg so
neither read() or recv() would have an edge over the other in that
department, AFAIK...

while we are talking about read() vs recv(), I am curious too about
something, while working on my other patch (store back buffer in case
of failure), I did notice that buffer address and bid weren't stored in
the same fields.

io_put_recv_kbuf() vs io_put_rw_kbuf()

I didn't figure out why those values weren't stored in the same
io_kiocb fields for recv operations...

Why is that?
> 
> 4) Did you do any userspace measurements. And a question to
> everyone in general, do we have any good net benchmarking tool
> that works with io_uring? Like netperf? Hopefully spitting
> out latency distribution.

No, I haven't.
> 
> 
> Also, not particularly about reissue stuff, but a note to myself:
> 59us is much, so I wonder where the overhead comes from.
> Definitely not the iowq queueing (i.e. putting into a list).
> - waking a worker?
> - creating a new worker? Do we manage workers sanely? e.g.
>   don't keep them constantly recreated and dying back.
> - scheduling a worker?

creating a new worker is for sure not free but I would remove that
cause from the suspect list as in my scenario, it was a one-shot event.
First measurement was even not significantly higher than all the other
measurements.
> 
> Olivier, for how long did you run the test? >1 min?

much more than 1 minute. I would say something between 20-25 minutes.

I wanted a big enough sample size for those 2.5% special path events so
that the conclusion could be statistically significant.




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring: reduce latency by reissueing the operation
  2021-06-18 22:45               ` Olivier Langlois
@ 2021-06-20 20:55                 ` Pavel Begunkov
  2021-06-20 21:31                   ` Olivier Langlois
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2021-06-20 20:55 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, io-uring, linux-kernel

On 6/18/21 11:45 PM, Olivier Langlois wrote:
> On Thu, 2021-06-17 at 19:10 +0100, Pavel Begunkov wrote:
>>>
>>> For the patch performance testing I did use the simplest config:
>>> Single thread, 1 TCP connection, no sqpoll.
>>
>> Queue depth (QD) 1, right?
> 
> Since my io_uring usage is wrapped into a library and some parameters
> are fixed and adjusted to be able to handle the largest use-case
> scenario, QD was set to 256 for that test.
> 
> There is also few accessory fds such as 2 eventfd that are polled to
> interrupt the framework event loop but in practice they were silent
> during the whole testing period.
> 
> but no big batch submission for sure. At most maybe 2 sqes per
> submission.
> 
> 1 to provide back the buffer and the other to reinsert the read
> operation.

I see

[...]
>> 2) Do they return equivalent number of bytes? And what the
>> read/recv size (e.g. buffer size)?
> 
> Nothing escape your eagle vision attention Pavel...

It all may sound over scrutinising, but I just used to analyse
performance magic and see how edges may be polished. Not
a requirement for the patch 

> I set my read size to an arbitrarilly big size (20KB) just to be sure
> that I should, most of the time, never end up with partial reads and
> perform more syscalls that I could get away with big enough buffer
> size.
> 
> TBH, I didn't pay that much attention to this detail. out of my head, I
> would say that the average size is all over the place. It can go from
> 150 Bytes up to 15KB but I would think that the average must be between
> 1-2 MTU (around 2500 bytes).
> 
> That being said, the average read size must spread equally to the
> packets going to the regular path vs those of take the new shortcut, so
> I believe that the conclusion should still hold despite not having
> considered this aspect in the test.
>>
>> Because in theory can be that during a somewhat small delay for
>> punting to io-wq, more data had arrived and so async completion
>> pulls more data that takes more time. In that case the time
>> difference should also account the difference in amount of
>> data that it reads.
> 
> Good point. This did not even occur to me to consider this aspect but
> how many more packets would the network stack had the time to receive
> in an extra 16uSec period? (I am not on one of those crazy Fiber optic
> 200Gbps Mellanox card....) 1,2,3,4? We aren't talking multiple extra
> MBs to copy here...

Say 1-2. Need to check, but I think while processing them and
copying to the userspace there might arrive another one, and so
you have full 20KB instead of 4KB that would have been copied
inline. Plus io-wq overhead, 16us wouldn't be unreasonable then.

But that's "what if" thinking.

>>
>> 3) Curious, why read but not recv as you're working with sockets
> 
> I have learn network programming with the classic Stevens book. As far
> as I remember from what I have learned in the book, it is that the only
> benefit of recv() over read() is if you need to specify one of the
> funky flags that recv() allow you to provide to it, read() doesn't give
> access to that functionality.
> 
> If there is a performance benefit to use recv() over read() for tcp
> fds, that is something I am not aware of and if you confirm me that it
> is the case, that would be very easy for me to change my read calls for
> recv() ones...
> 
> Now that you ask the question, maybe read() is implemented with recv()

All sinks into the common code rather early

> but AFAIK, the native network functions are sendmsg and recvmsg so
> neither read() or recv() would have an edge over the other in that
> department, AFAIK...

For io_uring part, e.g. recv is slimmer than recvmsg, doesn't
need to copy extra.

Read can be more expensive on the io_uring side because it
may copy/alloc extra stuff. Plus additional logic on the
io_read() part for generality.

But don't expect it to be much of a difference, but never
tested.

> while we are talking about read() vs recv(), I am curious too about
> something, while working on my other patch (store back buffer in case
> of failure), I did notice that buffer address and bid weren't stored in
> the same fields.
> 
> io_put_recv_kbuf() vs io_put_rw_kbuf()
> 
> I didn't figure out why those values weren't stored in the same
> io_kiocb fields for recv operations...
> 
> Why is that?

Just because how it was done. May use cleaning up. e.g. I don't
like rewriting req->rw.addr with a selected buffer.

In general, the first 64B (cacheline) of io_kiocb (i.e. request)
is taken by per-opcode data, and we try to fit everything
related to a particular opcode there and not spill into
generic parts of the struct.

Another concern, in general, is not keeping everything tight
enough and shuffled right, so it doesn't read extra cachelines
in hot path.

>>
>> 4) Did you do any userspace measurements. And a question to
>> everyone in general, do we have any good net benchmarking tool
>> that works with io_uring? Like netperf? Hopefully spitting
>> out latency distribution.
> 
> No, I haven't.

With what was said, I'd expect ~same mean and elevated ~99%
reduced by the patch, which is also great. Latency is always
the hardest part.

>> Also, not particularly about reissue stuff, but a note to myself:
>> 59us is much, so I wonder where the overhead comes from.
>> Definitely not the iowq queueing (i.e. putting into a list).
>> - waking a worker?
>> - creating a new worker? Do we manage workers sanely? e.g.
>>   don't keep them constantly recreated and dying back.
>> - scheduling a worker?
> 
> creating a new worker is for sure not free but I would remove that
> cause from the suspect list as in my scenario, it was a one-shot event.

Not sure what you mean, but speculating, io-wq may have not
optimal policy for recycling worker threads leading to
recreating/removing more than needed. Depends on bugs, use
cases and so on.

> First measurement was even not significantly higher than all the other
> measurements.

You get a huge max for io-wq case. Obviously nothing can be
said just because of max. We'd need latency distribution
and probably longer runs, but I'm still curious where it's
coming from. Just keeping an eye in general

>>
>> Olivier, for how long did you run the test? >1 min?
> 
> much more than 1 minute. I would say something between 20-25 minutes.
> 
> I wanted a big enough sample size for those 2.5% special path events so
> that the conclusion could be statistically significant.

Great, if io-wq worker creation doesn't work right, then it's
because of policies and so on.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring: reduce latency by reissueing the operation
  2021-06-20 20:55                 ` Pavel Begunkov
@ 2021-06-20 21:31                   ` Olivier Langlois
  2021-06-20 22:04                     ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Olivier Langlois @ 2021-06-20 21:31 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring, linux-kernel

On Sun, 2021-06-20 at 21:55 +0100, Pavel Begunkov wrote:
> On 6/18/21 11:45 PM, Olivier Langlois wrote:
> > 
> 
> For io_uring part, e.g. recv is slimmer than recvmsg, doesn't
> need to copy extra.
> 
> Read can be more expensive on the io_uring side because it
> may copy/alloc extra stuff. Plus additional logic on the
> io_read() part for generality.
> 
> But don't expect it to be much of a difference, but never
> tested.

That is super interesting. The way that I see it after getting your
explanations it is that in the worse case scenario, there won't be any
difference but in the best case, I could see a small speed gain.

I made the switch yesterday evening. One of the metric that I monitor
the most is my system reaction time from incoming packets.

I will let you know if switching to recv() is beneficial in that
regard.
> 
> > 
> 
> > > Also, not particularly about reissue stuff, but a note to myself:
> > > 59us is much, so I wonder where the overhead comes from.
> > > Definitely not the iowq queueing (i.e. putting into a list).
> > > - waking a worker?
> > > - creating a new worker? Do we manage workers sanely? e.g.
> > >   don't keep them constantly recreated and dying back.
> > > - scheduling a worker?
> > 
> > creating a new worker is for sure not free but I would remove that
> > cause from the suspect list as in my scenario, it was a one-shot
> > event.
> 
> Not sure what you mean, but speculating, io-wq may have not
> optimal policy for recycling worker threads leading to
> recreating/removing more than needed. Depends on bugs, use
> cases and so on.

Since that I absolutely don't use the async workers feature I was
obsessed about the fact that I was seeing a io worker created. This is
root of why I ended up writing the patch.

My understanding of how io worker life scope are managed, it is that
one remains present once created.

In my scenario, once that single persistent io worker thread is
created, no others are ever created. So this is a one shot cost. I was
prepared to eliminate the first measurement to be as fair as possible
and not pollute the async performance result with a one time only
thread creation cost but to my surprise... The thread creation cost was
not visible in the first measurement time...

From that, maybe this is an erroneous shortcut, I do not feel that
thread creation is the bottleneck.
> 
> > First measurement was even not significantly higher than all the
> > other
> > measurements.
> 
> You get a huge max for io-wq case. Obviously nothing can be
> said just because of max. We'd need latency distribution
> and probably longer runs, but I'm still curious where it's
> coming from. Just keeping an eye in general

Maybe it is scheduling...

I'll keep this mystery in the back of my mind in case that I would end
up with a way to find out where the time is spend...

> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring: reduce latency by reissueing the operation
  2021-06-20 21:31                   ` Olivier Langlois
@ 2021-06-20 22:04                     ` Pavel Begunkov
  0 siblings, 0 replies; 12+ messages in thread
From: Pavel Begunkov @ 2021-06-20 22:04 UTC (permalink / raw)
  To: Olivier Langlois, Jens Axboe, io-uring, linux-kernel

On 6/20/21 10:31 PM, Olivier Langlois wrote:
> On Sun, 2021-06-20 at 21:55 +0100, Pavel Begunkov wrote:
[...]
>>> creating a new worker is for sure not free but I would remove that
>>> cause from the suspect list as in my scenario, it was a one-shot
>>> event.
>>
>> Not sure what you mean, but speculating, io-wq may have not
>> optimal policy for recycling worker threads leading to
>> recreating/removing more than needed. Depends on bugs, use
>> cases and so on.
> 
> Since that I absolutely don't use the async workers feature I was
> obsessed about the fact that I was seeing a io worker created. This is
> root of why I ended up writing the patch.
> 
> My understanding of how io worker life scope are managed, it is that
> one remains present once created.

There was one(?) worker as such, and others should be able
to eventually die off if there is nothing to do. Something
may have changed after recent changes, I should update myself
on that

> In my scenario, once that single persistent io worker thread is
> created, no others are ever created. So this is a one shot cost. I was

Good to know, that's for confirming.


> prepared to eliminate the first measurement to be as fair as possible
> and not pollute the async performance result with a one time only
> thread creation cost but to my surprise... The thread creation cost was
> not visible in the first measurement time...
> 
> From that, maybe this is an erroneous shortcut, I do not feel that
> thread creation is the bottleneck.
>>
>>> First measurement was even not significantly higher than all the
>>> other
>>> measurements.
>>
>> You get a huge max for io-wq case. Obviously nothing can be
>> said just because of max. We'd need latency distribution
>> and probably longer runs, but I'm still curious where it's
>> coming from. Just keeping an eye in general
> 
> Maybe it is scheduling...
> 
> I'll keep this mystery in the back of my mind in case that I would end
> up with a way to find out where the time is spend...

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <[email protected]>]

* Re: [PATCH] io_uring: reduce latency by reissueing the operation
       [not found] <[email protected]>
@ 2021-06-16 12:48 ` Jens Axboe
  2021-06-18 21:38   ` Olivier Langlois
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2021-06-16 12:48 UTC (permalink / raw)
  To: Olivier Langlois, Pavel Begunkov, io-uring, linux-kernel

On 6/9/21 4:08 PM, Olivier Langlois wrote:
> It is quite frequent that when an operation fails and returns EAGAIN,
> the data becomes available between that failure and the call to
> vfs_poll() done by io_arm_poll_handler().
> 
> Detecting the situation and reissuing the operation is much faster
> than going ahead and push the operation to the io-wq.

I think this is obviously the right thing to do, but I'm not too crazy
about the 'ret' pointer passed in. We could either add a proper return
type instead of the bool and use that, or put the poll-or-queue-async
into a helper that then only needs a bool return, and use that return
value for whether to re-issue or not.

Care to send an updated variant?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring: reduce latency by reissueing the operation
  2021-06-16 12:48 ` Jens Axboe
@ 2021-06-18 21:38   ` Olivier Langlois
  0 siblings, 0 replies; 12+ messages in thread
From: Olivier Langlois @ 2021-06-18 21:38 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring, linux-kernel

On Wed, 2021-06-16 at 06:48 -0600, Jens Axboe wrote:
> On 6/9/21 4:08 PM, Olivier Langlois wrote:
> > It is quite frequent that when an operation fails and returns
> > EAGAIN,
> > the data becomes available between that failure and the call to
> > vfs_poll() done by io_arm_poll_handler().
> > 
> > Detecting the situation and reissuing the operation is much faster
> > than going ahead and push the operation to the io-wq.
> 
> I think this is obviously the right thing to do, but I'm not too
> crazy
> about the 'ret' pointer passed in. We could either add a proper
> return
> type instead of the bool and use that, or put the poll-or-queue-async
> into a helper that then only needs a bool return, and use that return
> value for whether to re-issue or not.
> 
> Care to send an updated variant?
> 
No problem!

It is going to be pleasure to send an updated version with the
requested change!

I will take that opportunity to try out my new patch sending setup ;-)



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-06-20 22:04 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <[email protected]>
2021-06-10  9:03 ` [PATCH] io_uring: reduce latency by reissueing the operation Pavel Begunkov
     [not found]   ` <[email protected]>
2021-06-10 15:51     ` Pavel Begunkov
2021-06-10 17:56       ` Olivier Langlois
2021-06-10 19:32         ` Pavel Begunkov
2021-06-11  3:55           ` Olivier Langlois
2021-06-17 18:10             ` Pavel Begunkov
2021-06-18 22:45               ` Olivier Langlois
2021-06-20 20:55                 ` Pavel Begunkov
2021-06-20 21:31                   ` Olivier Langlois
2021-06-20 22:04                     ` Pavel Begunkov
     [not found] <[email protected]>
2021-06-16 12:48 ` Jens Axboe
2021-06-18 21:38   ` Olivier Langlois

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox