On 6/22/20 8:50 AM, Jens Axboe wrote:
> On 6/22/20 7:29 AM, Xuan Zhuo wrote:
>> Hi Jens,
>> I found a problem, and I think it is necessary to solve it. But the change
>> may be relatively large, so I would like to ask you and everyone for your
>> opinions. Or everyone has other ideas about this issue:
>>
>> Problem description:
>> ===================
>> I found that in the sq thread mode, the CPU used by io_commit_cqring and
>> __io_cqring_fill_event accounts for a relatively large amount. The reason is
>> because a large number of calls to smp_store_release and WRITE_ONCE.
>> These two functions are relatively slow, and we need to call smp_store_release
>> every time we submit a cqe. This large number of calls has caused this
>> problem to become very prominent.
>>
>> My test environment is in qemu, using io_uring to accept a large number of
>> udp packets in sq thread mode, the speed is 800000pps. I submitted 100 sqes
>> to recv udp packet at the beginning of the application, and every time I
>> received a cqe, I submitted another sqe. The perf top result of sq thread is
>> as follows:
>>
>>
>>
>> 17.97% [kernel] [k] copy_user_generic_unrolled
>> 13.92% [kernel] [k] io_commit_cqring
>> 11.04% [kernel] [k] __io_cqring_fill_event
>> 10.33% [kernel] [k] udp_recvmsg
>>   5.94% [kernel] [k] skb_release_data
>>   4.31% [kernel] [k] udp_rmem_release
>>   2.68% [kernel] [k] __check_object_size
>>   2.24% [kernel] [k] __slab_free
>>   2.22% [kernel] [k] _raw_spin_lock_bh
>>   2.21% [kernel] [k] kmem_cache_free
>>   2.13% [kernel] [k] free_pcppages_bulk
>>   1.83% [kernel] [k] io_submit_sqes
>>   1.38% [kernel] [k] page_frag_free
>>   1.31% [kernel] [k] inet_recvmsg
>>
>>
>>
>> It can be seen that io_commit_cqring and __io_cqring_fill_event account
>> for 24.96%. This is too much. In general, the proportion of syscall may not
>> be so high, so we must solve this problem.
>>
>>
>> Solution:
>> =================
>> I consider that when the nr of an io_submit_sqes is too large, we don't call
>> io_cqring_add_event directly, we can put the completed req in the queue, and
>> then call __io_cqring_fill_event for each req then call once io_commit_cqring
>> at the end of the io_submit_sqes function. In this way my local simple test
>> looks good.
> 
> I think the solution here is to defer the cq ring filling + commit to the
> caller instead of deep down the stack, I think that's a nice win in general.
> To do that, we need to be able to do it after io_submit_sqes() has been
> called. We can either do that inline, by passing down a list or struct
> that allows the caller to place the request there instead of filling
> the event, or out-of-band by having eg a percpu struct that allows the
> same thing. In both cases, the actual call site would do something ala:
> 
> if (comp_list && successful_completion) {
> 	req->result = ret;
> 	list_add_tail(&req->list, comp_list);
> } else {
> 	io_cqring_add_event(req, ret);
> 	if (!successful_completion)
> 		req_set_fail_links(req);
> 	io_put_req(req);
> }
> 
> and then have the caller iterate the list and fill completions, if it's
> non-empty on return.
> 
> I don't think this is necessarily hard, but to do it nicely it will
> touch a bunch code and hence be quite a bit of churn. I do think the
> reward is worth it though, as this applies to the "normal" submission
> path as well, not just the SQPOLL variant.

Something like this series. I'd be interested to hear if it makes your
specific test case any better.

Patches are against my for-5.9/io_uring branch.

-- 
Jens Axboe