From: Pavel Begunkov <[email protected]>
To: Ming Lei <[email protected]>
Cc: Jens Axboe <[email protected]>,
[email protected], [email protected],
Kevin Wolf <[email protected]>
Subject: Re: [PATCH V3 5/9] io_uring: support SQE group
Date: Sun, 16 Jun 2024 20:13:26 +0100 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <ZmpPONHc8GajjoEm@fedora>
On 6/13/24 02:45, Ming Lei wrote:
> On Mon, Jun 10, 2024 at 03:53:51AM +0100, Pavel Begunkov wrote:
>> On 5/11/24 01:12, Ming Lei wrote:
>>> SQE group is defined as one chain of SQEs starting with the first SQE that
>>> has IOSQE_SQE_GROUP set, and ending with the first subsequent SQE that
>>> doesn't have it set, and it is similar with chain of linked SQEs.
>>
>> The main concern stays same, it adds overhead nearly to every
>> single hot function I can think of, as well as lots of
>> complexity.
>
> Almost every sqe group change is covered by REQ_F_SQE_GROUP, so I am
> not clear what the added overhead is.
Yes, and there is a dozen of such in the hot path.
>> Another minor issue is REQ_F_INFLIGHT, as explained before,
>> cancellation has to be able to find all REQ_F_INFLIGHT
>> requests. Requests you add to a group can have that flag
>> but are not discoverable by core io_uring code.
>
> OK, we can deal with it by setting leader as REQ_F_INFLIGHT if the
> flag is set for any member, since all members are guaranteed to
> be drained when leader is completed. Will do it in V4.
Or fail if see one, that's also fine. REQ_F_INFLIGHT is
only set for POLL requests polling another io_uring.
>> Another note, I'll be looking deeper into this patch, there
>> is too much of random tossing around of requests / refcounting
>> and other dependencies, as well as odd intertwinings with
>> other parts.
>
> The only thing wrt. request refcount is for io-wq, since request
> reference is grabbed when the req is handled in io-wq context, and
> group leader need to be completed after all members are done. That
> is all special change wrt. request refcounting.
I rather mean refcounting the group leader, even if it's not
atomic.
>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>> index 7a6b190c7da7..62311b0f0e0b 100644
>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>> index c184c9a312df..b87c5452de43 100644
>>> --- a/io_uring/io_uring.c
>>> +++ b/io_uring/io_uring.c
...
>>> }
>>> }
>>> +static inline bool need_queue_group_members(struct io_kiocb *req)
>>> +{
>>> + return req_is_group_leader(req) && req->grp_link;
>>> +}
>>> +
>>> +/* Can only be called after this request is issued */
>>> +static inline struct io_kiocb *get_group_leader(struct io_kiocb *req)
>>> +{
>>> + if (req->flags & REQ_F_SQE_GROUP) {
>>> + if (req_is_group_leader(req))
>>> + return req;
>>> + return req->grp_link;
>>
>> I'm missing something, it seems io_group_sqe() adding all
>> requests of a group into a singly linked list via ->grp_link,
>> but here we return it as a leader. Confused.
>
> ->grp_link stores the singly linked list for group leader, and
> the same field stores the group leader pointer for group member requests.
> For later, we can add one union field to make code more readable.
> Will do that in V4.
So you're repurposing it in io_queue_group_members(). Since
it has different meaning at different stages of execution,
it warrants a comment (unless there is one I missed).
>>> + }
>>> + return NULL;
>>> +}
>>> +
>>> +void io_cancel_group_members(struct io_kiocb *req, bool ignore_cqes)
>>> +{
>>> + struct io_kiocb *member = req->grp_link;
>>> +
>>> + while (member) {
>>> + struct io_kiocb *next = member->grp_link;
>>> +
>>> + if (ignore_cqes)
>>> + member->flags |= REQ_F_CQE_SKIP;
>>> + if (!(member->flags & REQ_F_FAIL)) {
>>> + req_set_fail(member);
>>> + io_req_set_res(member, -ECANCELED, 0);
>>> + }
>>> + member = next;
>>> + }
>>> +}
>>> +
>>> +void io_queue_group_members(struct io_kiocb *req, bool async)
>>> +{
>>> + struct io_kiocb *member = req->grp_link;
>>> +
>>> + if (!member)
>>> + return;
>>> +
>>> + while (member) {
>>> + struct io_kiocb *next = member->grp_link;
>>> +
>>> + member->grp_link = req;
>>> + if (async)
>>> + member->flags |= REQ_F_FORCE_ASYNC;
>>> +
>>> + if (unlikely(member->flags & REQ_F_FAIL)) {
>>> + io_req_task_queue_fail(member, member->cqe.res);
>>> + } else if (member->flags & REQ_F_FORCE_ASYNC) {
>>> + io_req_task_queue(member);
>>> + } else {
>>> + io_queue_sqe(member);
io_req_queue_tw_complete() please, just like links deal
with it, so it's executed in a well known context without
jumping ahead of other requests.
>>> + }
>>> + member = next;
>>> + }
>>> + req->grp_link = NULL;
>>> +}
>>> +
>>> +static inline bool __io_complete_group_req(struct io_kiocb *req,
>>> + struct io_kiocb *lead)
>>> +{
>>> + WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP));
>>> +
>>> + if (WARN_ON_ONCE(lead->grp_refs <= 0))
>>> + return false;
>>> +
>>> + /*
>>> + * Set linked leader as failed if any member is failed, so
>>> + * the remained link chain can be terminated
>>> + */
>>> + if (unlikely((req->flags & REQ_F_FAIL) &&
>>> + ((lead->flags & IO_REQ_LINK_FLAGS) && lead->link)))
>>> + req_set_fail(lead);
>>> + return !--lead->grp_refs;
>>> +}
>>> +
>>> +/* Complete group request and collect completed leader for freeing */
>>> +static inline void io_complete_group_req(struct io_kiocb *req,
>>> + struct io_wq_work_list *grp_list)
>>> +{
>>> + struct io_kiocb *lead = get_group_leader(req);
>>> +
>>> + if (__io_complete_group_req(req, lead)) {
>>> + req->flags &= ~REQ_F_SQE_GROUP;
>>> + lead->flags &= ~REQ_F_SQE_GROUP_LEADER;
>>> + if (!(lead->flags & REQ_F_CQE_SKIP))
>>> + io_req_commit_cqe(lead, lead->ctx->lockless_cq);
>>> +
>>> + if (req != lead) {
>>> + /*
>>> + * Add leader to free list if it isn't there
>>> + * otherwise clearing group flag for freeing it
>>> + * in current batch
>>> + */
>>> + if (!(lead->flags & REQ_F_SQE_GROUP))
>>> + wq_list_add_tail(&lead->comp_list, grp_list);
>>> + else
>>> + lead->flags &= ~REQ_F_SQE_GROUP;
>>> + }
>>> + } else if (req != lead) {
>>> + req->flags &= ~REQ_F_SQE_GROUP;
>>> + } else {
>>> + /*
>>> + * Leader's group flag clearing is delayed until it is
>>> + * removed from free list
>>> + */
>>> + }
>>> +}
>>> +
>>> static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
>>> {
>>> struct io_ring_ctx *ctx = req->ctx;
>>> @@ -1427,6 +1545,17 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
>>> comp_list);
>>> if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) {
>>> + /*
>>> + * Group leader may be removed twice, don't free it
>>> + * if group flag isn't cleared, when some members
>>> + * aren't completed yet
>>> + */
>>> + if (req->flags & REQ_F_SQE_GROUP) {
>>> + node = req->comp_list.next;
>>> + req->flags &= ~REQ_F_SQE_GROUP;
>>> + continue;
>>> + }
>>> +
>>> if (req->flags & REQ_F_REFCOUNT) {
>>> node = req->comp_list.next;
>>> if (!req_ref_put_and_test(req))
>>> @@ -1459,6 +1588,7 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
>>> __must_hold(&ctx->uring_lock)
>>> {
>>> struct io_submit_state *state = &ctx->submit_state;
>>> + struct io_wq_work_list grp_list = {NULL};
>>> struct io_wq_work_node *node;
>>> __io_cq_lock(ctx);
>>> @@ -1468,9 +1598,15 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
>>> if (!(req->flags & REQ_F_CQE_SKIP))
>>> io_req_commit_cqe(req, ctx->lockless_cq);
>>> +
>>> + if (req->flags & REQ_F_SQE_GROUP)
>>
>> Same note about hot path
>>
>>> + io_complete_group_req(req, &grp_list);
>>> }
>>> __io_cq_unlock_post(ctx);
>>> + if (!wq_list_empty(&grp_list))
>>> + __wq_list_splice(&grp_list, state->compl_reqs.first);
>>
>> What's the point of splicing it here insted of doing all
>> that under REQ_F_SQE_GROUP above?
>
> As mentioned, group leader can't be completed until all members are
> done, so any leaders in the current list have to be moved to this
> local list for deferred completion. That should be the only tricky
> part of the whole sqe group implementation.
>
>>
>>> +
>>> if (!wq_list_empty(&ctx->submit_state.compl_reqs)) {
>>> io_free_batch_list(ctx, state->compl_reqs.first);
>>> INIT_WQ_LIST(&state->compl_reqs);
...
>>> @@ -1863,6 +2012,8 @@ void io_wq_submit_work(struct io_wq_work *work)
>>> }
>>> }
>>> + if (need_queue_group_members(req))
>>> + io_queue_group_members(req, true);
>>> do {
>>> ret = io_issue_sqe(req, issue_flags);
>>> if (ret != -EAGAIN)
>>> @@ -1977,6 +2128,9 @@ static inline void io_queue_sqe(struct io_kiocb *req)
>>> */
>>> if (unlikely(ret))
>>> io_queue_async(req, ret);
>>> +
>>> + if (need_queue_group_members(req))
>>> + io_queue_group_members(req, false);
>>
>> Request ownership is considered to be handed further at this
>> point and requests should not be touched. Only ret==0 from
>> io_issue_sqe it's still ours, but again it's handed somewhere
>> by io_queue_async().
>
> Yes, you are right.
>
> And it has been fixed in my local tree:
>
> @@ -2154,8 +2154,7 @@ static inline void io_queue_sqe(struct io_kiocb *req)
> */
> if (unlikely(ret))
> io_queue_async(req, ret);
> -
> - if (need_queue_group_members(req))
> + else if (need_queue_group_members(req))
> io_queue_group_members(req, false);
> }
In the else branch you don't own the request anymore
and shouldn't be poking into it.
It looks like you're trying to do io_queue_group_members()
when previously the request would get completed. It's not
the right place, and apart from whack'a'moled
io_wq_submit_work() there is also io_poll_issue() missed.
Seems __io_submit_flush_completions() / io_free_batch_list()
would be more appropriate, and you already have a chunk with
GROUP check in there handling the leader appearing in there
twice.
>>> }
>>> static void io_queue_sqe_fallback(struct io_kiocb *req)
...
>>> @@ -2232,7 +2443,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>> const struct io_uring_sqe *sqe)
>>> __must_hold(&ctx->uring_lock)
>>> {
>>> - struct io_submit_link *link = &ctx->submit_state.link;
>>> + struct io_submit_state *state = &ctx->submit_state;
>>> int ret;
>>> ret = io_init_req(ctx, req, sqe);
>>> @@ -2241,9 +2452,17 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>> trace_io_uring_submit_req(req);
>>> - if (unlikely(link->head || (req->flags & (IO_REQ_LINK_FLAGS |
>>> - REQ_F_FORCE_ASYNC | REQ_F_FAIL)))) {
>>> - req = io_link_sqe(link, req);
>>> + if (unlikely(state->group.head ||
>>
>> A note rather to myself and for the future, all theese checks
>> including links and groups can be folded under one common if.
>
> Sorry, I may not get the idea, can you provide one example?
To be clear, not suggesting you doing it.
Simplifying:
init_req() {
if (req->flags & GROUP|LINK) {
ctx->assembling;
}
}
io_submit_sqe() {
init_req();
if (ctx->assembling) {
check_groups/links();
if (done);
ctx->assembling = false;
}
}
>
> We need different logics for group and link, meantime group
> has to be handled first before linking, since only the group leader
> can be linked.
--
Pavel Begunkov
next prev parent reply other threads:[~2024-06-16 19:13 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-11 0:12 [PATCH V3 0/9] io_uring: support sqe group and provide group kbuf Ming Lei
2024-05-11 0:12 ` [PATCH V3 1/9] io_uring: add io_link_req() helper Ming Lei
2024-05-11 0:12 ` [PATCH V3 2/9] io_uring: add io_submit_fail_link() helper Ming Lei
2024-05-11 0:12 ` [PATCH V3 3/9] io_uring: add helper of io_req_commit_cqe() Ming Lei
2024-06-10 1:18 ` Pavel Begunkov
2024-06-11 13:21 ` Ming Lei
2024-05-11 0:12 ` [PATCH V3 4/9] io_uring: move marking REQ_F_CQE_SKIP out of io_free_req() Ming Lei
2024-06-10 1:23 ` Pavel Begunkov
2024-06-11 13:28 ` Ming Lei
2024-06-16 18:08 ` Pavel Begunkov
2024-05-11 0:12 ` [PATCH V3 5/9] io_uring: support SQE group Ming Lei
2024-05-21 2:58 ` Ming Lei
2024-06-10 1:55 ` Pavel Begunkov
2024-06-11 13:32 ` Ming Lei
2024-06-16 18:14 ` Pavel Begunkov
2024-06-17 1:42 ` Ming Lei
2024-06-10 2:53 ` Pavel Begunkov
2024-06-13 1:45 ` Ming Lei
2024-06-16 19:13 ` Pavel Begunkov [this message]
2024-06-17 3:54 ` Ming Lei
2024-05-11 0:12 ` [PATCH V3 6/9] io_uring: support sqe group with members depending on leader Ming Lei
2024-05-11 0:12 ` [PATCH V3 7/9] io_uring: support providing sqe group buffer Ming Lei
2024-06-10 2:00 ` Pavel Begunkov
2024-06-12 0:22 ` Ming Lei
2024-05-11 0:12 ` [PATCH V3 8/9] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
2024-05-11 0:12 ` [PATCH V3 9/9] ublk: support provide io buffer Ming Lei
2024-06-03 0:05 ` [PATCH V3 0/9] io_uring: support sqe group and provide group kbuf Ming Lei
2024-06-07 12:32 ` Pavel Begunkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox