Re: [PATCH V3 5/9] io_uring: support SQE group

public inbox for [email protected]
 help / color / mirror / Atom feed

From: Pavel Begunkov <[email protected]>
To: Ming Lei <[email protected]>
Cc: Jens Axboe <[email protected]>,
	[email protected], [email protected],
	Kevin Wolf <[email protected]>
Subject: Re: [PATCH V3 5/9] io_uring: support SQE group
Date: Sun, 16 Jun 2024 20:13:26 +0100	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <ZmpPONHc8GajjoEm@fedora>

On 6/13/24 02:45, Ming Lei wrote:
> On Mon, Jun 10, 2024 at 03:53:51AM +0100, Pavel Begunkov wrote:
>> On 5/11/24 01:12, Ming Lei wrote:
>>> SQE group is defined as one chain of SQEs starting with the first SQE that
>>> has IOSQE_SQE_GROUP set, and ending with the first subsequent SQE that
>>> doesn't have it set, and it is similar with chain of linked SQEs.
>>
>> The main concern stays same, it adds overhead nearly to every
>> single hot function I can think of, as well as lots of
>> complexity.
> 
> Almost every sqe group change is covered by REQ_F_SQE_GROUP, so I am
> not clear what the added overhead is.

Yes, and there is a dozen of such in the hot path.

>> Another minor issue is REQ_F_INFLIGHT, as explained before,
>> cancellation has to be able to find all REQ_F_INFLIGHT
>> requests. Requests you add to a group can have that flag
>> but are not discoverable by core io_uring code.
> 
> OK, we can deal with it by setting leader as REQ_F_INFLIGHT if the
> flag is set for any member, since all members are guaranteed to
> be drained when leader is completed. Will do it in V4.

Or fail if see one, that's also fine. REQ_F_INFLIGHT is
only set for POLL requests polling another io_uring.

>> Another note, I'll be looking deeper into this patch, there
>> is too much of random tossing around of requests / refcounting
>> and other dependencies, as well as odd intertwinings with
>> other parts.
> 
> The only thing wrt. request refcount is for io-wq, since request
> reference is grabbed when the req is handled in io-wq context, and
> group leader need to be completed after all members are done. That
> is all special change wrt. request refcounting.

I rather mean refcounting the group leader, even if it's not
atomic.

>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>> index 7a6b190c7da7..62311b0f0e0b 100644
>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>> index c184c9a312df..b87c5452de43 100644
>>> --- a/io_uring/io_uring.c
>>> +++ b/io_uring/io_uring.c
...
>>>    	}
>>>    }
>>> +static inline bool need_queue_group_members(struct io_kiocb *req)
>>> +{
>>> +	return req_is_group_leader(req) && req->grp_link;
>>> +}
>>> +
>>> +/* Can only be called after this request is issued */
>>> +static inline struct io_kiocb *get_group_leader(struct io_kiocb *req)
>>> +{
>>> +	if (req->flags & REQ_F_SQE_GROUP) {
>>> +		if (req_is_group_leader(req))
>>> +			return req;
>>> +		return req->grp_link;
>>
>> I'm missing something, it seems io_group_sqe() adding all
>> requests of a group into a singly linked list via ->grp_link,
>> but here we return it as a leader. Confused.
> 
> ->grp_link stores the singly linked list for group leader, and
> the same field stores the group leader pointer for group member requests.
> For later, we can add one union field to make code more readable.
> Will do that in V4.

So you're repurposing it in io_queue_group_members(). Since
it has different meaning at different stages of execution,
it warrants a comment (unless there is one I missed).

>>> +	}
>>> +	return NULL;
>>> +}
>>> +
>>> +void io_cancel_group_members(struct io_kiocb *req, bool ignore_cqes)
>>> +{
>>> +	struct io_kiocb *member = req->grp_link;
>>> +
>>> +	while (member) {
>>> +		struct io_kiocb *next = member->grp_link;
>>> +
>>> +		if (ignore_cqes)
>>> +			member->flags |= REQ_F_CQE_SKIP;
>>> +		if (!(member->flags & REQ_F_FAIL)) {
>>> +			req_set_fail(member);
>>> +			io_req_set_res(member, -ECANCELED, 0);
>>> +		}
>>> +		member = next;
>>> +	}
>>> +}
>>> +
>>> +void io_queue_group_members(struct io_kiocb *req, bool async)
>>> +{
>>> +	struct io_kiocb *member = req->grp_link;
>>> +
>>> +	if (!member)
>>> +		return;
>>> +
>>> +	while (member) {
>>> +		struct io_kiocb *next = member->grp_link;
>>> +
>>> +		member->grp_link = req;
>>> +		if (async)
>>> +			member->flags |= REQ_F_FORCE_ASYNC;
>>> +
>>> +		if (unlikely(member->flags & REQ_F_FAIL)) {
>>> +			io_req_task_queue_fail(member, member->cqe.res);
>>> +		} else if (member->flags & REQ_F_FORCE_ASYNC) {
>>> +			io_req_task_queue(member);
>>> +		} else {
>>> +			io_queue_sqe(member);

io_req_queue_tw_complete() please, just like links deal
with it, so it's executed in a well known context without
jumping ahead of other requests.

>>> +		}
>>> +		member = next;
>>> +	}
>>> +	req->grp_link = NULL;
>>> +}
>>> +
>>> +static inline bool __io_complete_group_req(struct io_kiocb *req,
>>> +			     struct io_kiocb *lead)
>>> +{
>>> +	WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP));
>>> +
>>> +	if (WARN_ON_ONCE(lead->grp_refs <= 0))
>>> +		return false;
>>> +
>>> +	/*
>>> +	 * Set linked leader as failed if any member is failed, so
>>> +	 * the remained link chain can be terminated
>>> +	 */
>>> +	if (unlikely((req->flags & REQ_F_FAIL) &&
>>> +		     ((lead->flags & IO_REQ_LINK_FLAGS) && lead->link)))
>>> +		req_set_fail(lead);
>>> +	return !--lead->grp_refs;
>>> +}
>>> +
>>> +/* Complete group request and collect completed leader for freeing */
>>> +static inline void io_complete_group_req(struct io_kiocb *req,
>>> +		struct io_wq_work_list *grp_list)
>>> +{
>>> +	struct io_kiocb *lead = get_group_leader(req);
>>> +
>>> +	if (__io_complete_group_req(req, lead)) {
>>> +		req->flags &= ~REQ_F_SQE_GROUP;
>>> +		lead->flags &= ~REQ_F_SQE_GROUP_LEADER;
>>> +		if (!(lead->flags & REQ_F_CQE_SKIP))
>>> +			io_req_commit_cqe(lead, lead->ctx->lockless_cq);
>>> +
>>> +		if (req != lead) {
>>> +			/*
>>> +			 * Add leader to free list if it isn't there
>>> +			 * otherwise clearing group flag for freeing it
>>> +			 * in current batch
>>> +			 */
>>> +			if (!(lead->flags & REQ_F_SQE_GROUP))
>>> +				wq_list_add_tail(&lead->comp_list, grp_list);
>>> +			else
>>> +				lead->flags &= ~REQ_F_SQE_GROUP;
>>> +		}
>>> +	} else if (req != lead) {
>>> +		req->flags &= ~REQ_F_SQE_GROUP;
>>> +	} else {
>>> +		/*
>>> +		 * Leader's group flag clearing is delayed until it is
>>> +		 * removed from free list
>>> +		 */
>>> +	}
>>> +}
>>> +
>>>    static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
>>>    {
>>>    	struct io_ring_ctx *ctx = req->ctx;
>>> @@ -1427,6 +1545,17 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
>>>    						    comp_list);
>>>    		if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) {
>>> +			/*
>>> +			 * Group leader may be removed twice, don't free it
>>> +			 * if group flag isn't cleared, when some members
>>> +			 * aren't completed yet
>>> +			 */
>>> +			if (req->flags & REQ_F_SQE_GROUP) {
>>> +				node = req->comp_list.next;
>>> +				req->flags &= ~REQ_F_SQE_GROUP;
>>> +				continue;
>>> +			}
>>> +
>>>    			if (req->flags & REQ_F_REFCOUNT) {
>>>    				node = req->comp_list.next;
>>>    				if (!req_ref_put_and_test(req))
>>> @@ -1459,6 +1588,7 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
>>>    	__must_hold(&ctx->uring_lock)
>>>    {
>>>    	struct io_submit_state *state = &ctx->submit_state;
>>> +	struct io_wq_work_list grp_list = {NULL};
>>>    	struct io_wq_work_node *node;
>>>    	__io_cq_lock(ctx);
>>> @@ -1468,9 +1598,15 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
>>>    		if (!(req->flags & REQ_F_CQE_SKIP))
>>>    			io_req_commit_cqe(req, ctx->lockless_cq);
>>> +
>>> +		if (req->flags & REQ_F_SQE_GROUP)
>>
>> Same note about hot path
>>
>>> +			io_complete_group_req(req, &grp_list);
>>>    	}
>>>    	__io_cq_unlock_post(ctx);
>>> +	if (!wq_list_empty(&grp_list))
>>> +		__wq_list_splice(&grp_list, state->compl_reqs.first);
>>
>> What's the point of splicing it here insted of doing all
>> that under REQ_F_SQE_GROUP above?
> 
> As mentioned, group leader can't be completed until all members are
> done, so any leaders in the current list have to be moved to this
> local list for deferred completion. That should be the only tricky
> part of the whole sqe group implementation.
> 
>>
>>> +
>>>    	if (!wq_list_empty(&ctx->submit_state.compl_reqs)) {
>>>    		io_free_batch_list(ctx, state->compl_reqs.first);
>>>    		INIT_WQ_LIST(&state->compl_reqs);
...
>>> @@ -1863,6 +2012,8 @@ void io_wq_submit_work(struct io_wq_work *work)
>>>    		}
>>>    	}
>>> +	if (need_queue_group_members(req))
>>> +		io_queue_group_members(req, true);
>>>    	do {
>>>    		ret = io_issue_sqe(req, issue_flags);
>>>    		if (ret != -EAGAIN)
>>> @@ -1977,6 +2128,9 @@ static inline void io_queue_sqe(struct io_kiocb *req)
>>>    	 */
>>>    	if (unlikely(ret))
>>>    		io_queue_async(req, ret);
>>> +
>>> +	if (need_queue_group_members(req))
>>> +		io_queue_group_members(req, false);
>>
>> Request ownership is considered to be handed further at this
>> point and requests should not be touched. Only ret==0 from
>> io_issue_sqe it's still ours, but again it's handed somewhere
>> by io_queue_async().
> 
> Yes, you are right.
> 
> And it has been fixed in my local tree:
> 
> @@ -2154,8 +2154,7 @@ static inline void io_queue_sqe(struct io_kiocb *req)
>           */
>          if (unlikely(ret))
>                  io_queue_async(req, ret);
> -
> -       if (need_queue_group_members(req))
> +       else if (need_queue_group_members(req))
>                  io_queue_group_members(req, false);
>   }

In the else branch you don't own the request anymore
and shouldn't be poking into it.

It looks like you're trying to do io_queue_group_members()
when previously the request would get completed. It's not
the right place, and apart from whack'a'moled
io_wq_submit_work() there is also io_poll_issue() missed.

Seems __io_submit_flush_completions() / io_free_batch_list()
would be more appropriate, and you already have a chunk with
GROUP check in there handling the leader appearing in there
twice.


>>>    }
>>>    static void io_queue_sqe_fallback(struct io_kiocb *req)
...
>>> @@ -2232,7 +2443,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>>    			 const struct io_uring_sqe *sqe)
>>>    	__must_hold(&ctx->uring_lock)
>>>    {
>>> -	struct io_submit_link *link = &ctx->submit_state.link;
>>> +	struct io_submit_state *state = &ctx->submit_state;
>>>    	int ret;
>>>    	ret = io_init_req(ctx, req, sqe);
>>> @@ -2241,9 +2452,17 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>>    	trace_io_uring_submit_req(req);
>>> -	if (unlikely(link->head || (req->flags & (IO_REQ_LINK_FLAGS |
>>> -				    REQ_F_FORCE_ASYNC | REQ_F_FAIL)))) {
>>> -		req = io_link_sqe(link, req);
>>> +	if (unlikely(state->group.head ||
>>
>> A note rather to myself and for the future, all theese checks
>> including links and groups can be folded under one common if.
> 
> Sorry, I may not get the idea, can you provide one example?

To be clear, not suggesting you doing it.

Simplifying:

init_req() {
	if (req->flags & GROUP|LINK) {
		ctx->assembling;
	}
}

io_submit_sqe() {
	init_req();

	if (ctx->assembling) {
		check_groups/links();
		if (done);
			ctx->assembling = false;
	}
}


> 
> We need different logics for group and link, meantime group
> has to be handled first before linking, since only the group leader
> can be linked.

-- 
Pavel Begunkov

next prev parent reply	other threads:[~2024-06-16 19:13 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-11  0:12 [PATCH V3 0/9] io_uring: support sqe group and provide group kbuf Ming Lei
2024-05-11  0:12 ` [PATCH V3 1/9] io_uring: add io_link_req() helper Ming Lei
2024-05-11  0:12 ` [PATCH V3 2/9] io_uring: add io_submit_fail_link() helper Ming Lei
2024-05-11  0:12 ` [PATCH V3 3/9] io_uring: add helper of io_req_commit_cqe() Ming Lei
2024-06-10  1:18   ` Pavel Begunkov
2024-06-11 13:21     ` Ming Lei
2024-05-11  0:12 ` [PATCH V3 4/9] io_uring: move marking REQ_F_CQE_SKIP out of io_free_req() Ming Lei
2024-06-10  1:23   ` Pavel Begunkov
2024-06-11 13:28     ` Ming Lei
2024-06-16 18:08       ` Pavel Begunkov
2024-05-11  0:12 ` [PATCH V3 5/9] io_uring: support SQE group Ming Lei
2024-05-21  2:58   ` Ming Lei
2024-06-10  1:55     ` Pavel Begunkov
2024-06-11 13:32       ` Ming Lei
2024-06-16 18:14         ` Pavel Begunkov
2024-06-17  1:42           ` Ming Lei
2024-06-10  2:53   ` Pavel Begunkov
2024-06-13  1:45     ` Ming Lei
2024-06-16 19:13       ` Pavel Begunkov [this message]
2024-06-17  3:54         ` Ming Lei
2024-05-11  0:12 ` [PATCH V3 6/9] io_uring: support sqe group with members depending on leader Ming Lei
2024-05-11  0:12 ` [PATCH V3 7/9] io_uring: support providing sqe group buffer Ming Lei
2024-06-10  2:00   ` Pavel Begunkov
2024-06-12  0:22     ` Ming Lei
2024-05-11  0:12 ` [PATCH V3 8/9] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
2024-05-11  0:12 ` [PATCH V3 9/9] ublk: support provide io buffer Ming Lei
2024-06-03  0:05 ` [PATCH V3 0/9] io_uring: support sqe group and provide group kbuf Ming Lei
2024-06-07 12:32   ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox