From: Pavel Begunkov <[email protected]>
To: Ming Lei <[email protected]>, Jens Axboe <[email protected]>,
[email protected]
Cc: [email protected], Kevin Wolf <[email protected]>
Subject: Re: [PATCH V3 5/9] io_uring: support SQE group
Date: Mon, 10 Jun 2024 03:53:51 +0100 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 5/11/24 01:12, Ming Lei wrote:
> SQE group is defined as one chain of SQEs starting with the first SQE that
> has IOSQE_SQE_GROUP set, and ending with the first subsequent SQE that
> doesn't have it set, and it is similar with chain of linked SQEs.
The main concern stays same, it adds overhead nearly to every
single hot function I can think of, as well as lots of
complexity.
Another minor issue is REQ_F_INFLIGHT, as explained before,
cancellation has to be able to find all REQ_F_INFLIGHT
requests. Requests you add to a group can have that flag
but are not discoverable by core io_uring code.
Another note, I'll be looking deeper into this patch, there
is too much of random tossing around of requests / refcounting
and other dependencies, as well as odd intertwinings with
other parts.
> Not like linked SQEs, each sqe is issued after the previous one is completed.
> All SQEs in one group are submitted in parallel, so there isn't any dependency
> among SQEs in one group.
>
> The 1st SQE is group leader, and the other SQEs are group member. The whole
> group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and
> the two flags are ignored for group members.
>
> When the group is in one link chain, this group isn't submitted until the
> previous SQE or group is completed. And the following SQE or group can't
> be started if this group isn't completed. Failure from any group member will
> fail the group leader, then the link chain can be terminated.
>
> When IOSQE_IO_DRAIN is set for group leader, all requests in this group and
> previous requests submitted are drained. Given IOSQE_IO_DRAIN can be set for
> group leader only, we respect IO_DRAIN by always completing group leader as
> the last one in the group.
>
> Working together with IOSQE_IO_LINK, SQE group provides flexible way to
> support N:M dependency, such as:
>
> - group A is chained with group B together
> - group A has N SQEs
> - group B has M SQEs
>
> then M SQEs in group B depend on N SQEs in group A.
>
> N:M dependency can support some interesting use cases in efficient way:
>
> 1) read from multiple files, then write the read data into single file
>
> 2) read from single file, and write the read data into multiple files
>
> 3) write same data into multiple files, and read data from multiple files and
> compare if correct data is written
>
> Also IOSQE_SQE_GROUP takes the last bit in sqe->flags, but we still can
> extend sqe->flags with one uring context flag, such as use __pad3 for
> non-uring_cmd OPs and part of uring_cmd_flags for uring_cmd OP.
>
> Suggested-by: Kevin Wolf <[email protected]>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> include/linux/io_uring_types.h | 12 ++
> include/uapi/linux/io_uring.h | 4 +
> io_uring/io_uring.c | 255 +++++++++++++++++++++++++++++++--
> io_uring/io_uring.h | 16 +++
> io_uring/timeout.c | 2 +
> 5 files changed, 277 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 7a6b190c7da7..62311b0f0e0b 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -666,6 +674,10 @@ struct io_kiocb {
> u64 extra1;
> u64 extra2;
> } big_cqe;
> +
> + /* all SQE group members linked here for group lead */
> + struct io_kiocb *grp_link;
> + int grp_refs;
> };
>
> struct io_overflow_cqe {
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index c184c9a312df..b87c5452de43 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -109,7 +109,8 @@
> IOSQE_IO_HARDLINK | IOSQE_ASYNC)
>
> #define SQE_VALID_FLAGS (SQE_COMMON_FLAGS | IOSQE_BUFFER_SELECT | \
> - IOSQE_IO_DRAIN | IOSQE_CQE_SKIP_SUCCESS)
> + IOSQE_IO_DRAIN | IOSQE_CQE_SKIP_SUCCESS | \
> + IOSQE_SQE_GROUP)
>
> #define IO_REQ_CLEAN_FLAGS (REQ_F_BUFFER_SELECTED | REQ_F_NEED_CLEANUP | \
> REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS | \
> @@ -915,6 +916,13 @@ static __always_inline void io_req_commit_cqe(struct io_kiocb *req,
> {
> struct io_ring_ctx *ctx = req->ctx;
>
> + /*
> + * For group leader, cqe has to be committed after all members are
> + * committed, when the request becomes normal one.
> + */
> + if (unlikely(req_is_group_leader(req)))
> + return;
The copy of it inlined into flush_completions should
maintain a proper fast path.
if (req->flags & (CQE_SKIP | GROUP)) {
if (req->flags & CQE_SKIP)
continue;
if (req->flags & GROUP) {}
}
> +
> if (unlikely(!io_fill_cqe_req(ctx, req))) {
> if (lockless_cq) {
> spin_lock(&ctx->completion_lock);
> @@ -926,6 +934,116 @@ static __always_inline void io_req_commit_cqe(struct io_kiocb *req,
> }
> }
>
> +static inline bool need_queue_group_members(struct io_kiocb *req)
> +{
> + return req_is_group_leader(req) && req->grp_link;
> +}
> +
> +/* Can only be called after this request is issued */
> +static inline struct io_kiocb *get_group_leader(struct io_kiocb *req)
> +{
> + if (req->flags & REQ_F_SQE_GROUP) {
> + if (req_is_group_leader(req))
> + return req;
> + return req->grp_link;
I'm missing something, it seems io_group_sqe() adding all
requests of a group into a singly linked list via ->grp_link,
but here we return it as a leader. Confused.
> + }
> + return NULL;
> +}
> +
> +void io_cancel_group_members(struct io_kiocb *req, bool ignore_cqes)
> +{
> + struct io_kiocb *member = req->grp_link;
> +
> + while (member) {
> + struct io_kiocb *next = member->grp_link;
> +
> + if (ignore_cqes)
> + member->flags |= REQ_F_CQE_SKIP;
> + if (!(member->flags & REQ_F_FAIL)) {
> + req_set_fail(member);
> + io_req_set_res(member, -ECANCELED, 0);
> + }
> + member = next;
> + }
> +}
> +
> +void io_queue_group_members(struct io_kiocb *req, bool async)
> +{
> + struct io_kiocb *member = req->grp_link;
> +
> + if (!member)
> + return;
> +
> + while (member) {
> + struct io_kiocb *next = member->grp_link;
> +
> + member->grp_link = req;
> + if (async)
> + member->flags |= REQ_F_FORCE_ASYNC;
> +
> + if (unlikely(member->flags & REQ_F_FAIL)) {
> + io_req_task_queue_fail(member, member->cqe.res);
> + } else if (member->flags & REQ_F_FORCE_ASYNC) {
> + io_req_task_queue(member);
> + } else {
> + io_queue_sqe(member);
> + }
> + member = next;
> + }
> + req->grp_link = NULL;
> +}
> +
> +static inline bool __io_complete_group_req(struct io_kiocb *req,
> + struct io_kiocb *lead)
> +{
> + WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP));
> +
> + if (WARN_ON_ONCE(lead->grp_refs <= 0))
> + return false;
> +
> + /*
> + * Set linked leader as failed if any member is failed, so
> + * the remained link chain can be terminated
> + */
> + if (unlikely((req->flags & REQ_F_FAIL) &&
> + ((lead->flags & IO_REQ_LINK_FLAGS) && lead->link)))
> + req_set_fail(lead);
> + return !--lead->grp_refs;
> +}
> +
> +/* Complete group request and collect completed leader for freeing */
> +static inline void io_complete_group_req(struct io_kiocb *req,
> + struct io_wq_work_list *grp_list)
> +{
> + struct io_kiocb *lead = get_group_leader(req);
> +
> + if (__io_complete_group_req(req, lead)) {
> + req->flags &= ~REQ_F_SQE_GROUP;
> + lead->flags &= ~REQ_F_SQE_GROUP_LEADER;
> + if (!(lead->flags & REQ_F_CQE_SKIP))
> + io_req_commit_cqe(lead, lead->ctx->lockless_cq);
> +
> + if (req != lead) {
> + /*
> + * Add leader to free list if it isn't there
> + * otherwise clearing group flag for freeing it
> + * in current batch
> + */
> + if (!(lead->flags & REQ_F_SQE_GROUP))
> + wq_list_add_tail(&lead->comp_list, grp_list);
> + else
> + lead->flags &= ~REQ_F_SQE_GROUP;
> + }
> + } else if (req != lead) {
> + req->flags &= ~REQ_F_SQE_GROUP;
> + } else {
> + /*
> + * Leader's group flag clearing is delayed until it is
> + * removed from free list
> + */
> + }
> +}
> +
> static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
> {
> struct io_ring_ctx *ctx = req->ctx;
> @@ -1427,6 +1545,17 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
> comp_list);
>
> if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) {
> + /*
> + * Group leader may be removed twice, don't free it
> + * if group flag isn't cleared, when some members
> + * aren't completed yet
> + */
> + if (req->flags & REQ_F_SQE_GROUP) {
> + node = req->comp_list.next;
> + req->flags &= ~REQ_F_SQE_GROUP;
> + continue;
> + }
> +
> if (req->flags & REQ_F_REFCOUNT) {
> node = req->comp_list.next;
> if (!req_ref_put_and_test(req))
> @@ -1459,6 +1588,7 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
> __must_hold(&ctx->uring_lock)
> {
> struct io_submit_state *state = &ctx->submit_state;
> + struct io_wq_work_list grp_list = {NULL};
> struct io_wq_work_node *node;
>
> __io_cq_lock(ctx);
> @@ -1468,9 +1598,15 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
>
> if (!(req->flags & REQ_F_CQE_SKIP))
> io_req_commit_cqe(req, ctx->lockless_cq);
> +
> + if (req->flags & REQ_F_SQE_GROUP)
Same note about hot path
> + io_complete_group_req(req, &grp_list);
> }
> __io_cq_unlock_post(ctx);
>
> + if (!wq_list_empty(&grp_list))
> + __wq_list_splice(&grp_list, state->compl_reqs.first);
What's the point of splicing it here insted of doing all
that under REQ_F_SQE_GROUP above?
> +
> if (!wq_list_empty(&ctx->submit_state.compl_reqs)) {
> io_free_batch_list(ctx, state->compl_reqs.first);
> INIT_WQ_LIST(&state->compl_reqs);
> @@ -1677,8 +1813,12 @@ static u32 io_get_sequence(struct io_kiocb *req)
> struct io_kiocb *cur;
>
> /* need original cached_sq_head, but it was increased for each req */
> - io_for_each_link(cur, req)
> - seq--;
> + io_for_each_link(cur, req) {
> + if (req_is_group_leader(cur))
> + seq -= cur->grp_refs;
> + else
> + seq--;
> + }
> return seq;
> }
>
> @@ -1793,11 +1933,20 @@ struct io_wq_work *io_wq_free_work(struct io_wq_work *work)
> struct io_kiocb *nxt = NULL;
>
> if (req_ref_put_and_test(req)) {
> - if (req->flags & IO_REQ_LINK_FLAGS)
> - nxt = io_req_find_next(req);
> + /*
> + * CQEs have been posted in io_req_complete_post() except
> + * for group leader, and we can't advance the link for
> + * group leader until its CQE is posted.
> + *
> + * TODO: try to avoid defer and complete leader in io_wq
> + * context directly
> + */
> + if (!req_is_group_leader(req)) {
> + req->flags |= REQ_F_CQE_SKIP;
> + if (req->flags & IO_REQ_LINK_FLAGS)
> + nxt = io_req_find_next(req);
> + }
>
> - /* we have posted CQEs in io_req_complete_post() */
> - req->flags |= REQ_F_CQE_SKIP;
> io_free_req(req);
> }
> return nxt ? &nxt->work : NULL;
> @@ -1863,6 +2012,8 @@ void io_wq_submit_work(struct io_wq_work *work)
> }
> }
>
> + if (need_queue_group_members(req))
> + io_queue_group_members(req, true);
> do {
> ret = io_issue_sqe(req, issue_flags);
> if (ret != -EAGAIN)
> @@ -1977,6 +2128,9 @@ static inline void io_queue_sqe(struct io_kiocb *req)
> */
> if (unlikely(ret))
> io_queue_async(req, ret);
> +
> + if (need_queue_group_members(req))
> + io_queue_group_members(req, false);
Request ownership is considered to be handed further at this
point and requests should not be touched. Only ret==0 from
io_issue_sqe it's still ours, but again it's handed somewhere
by io_queue_async().
> }
>
> static void io_queue_sqe_fallback(struct io_kiocb *req)
> @@ -2142,6 +2296,56 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
> return def->prep(req, sqe);
> }
>
> +static struct io_kiocb *io_group_sqe(struct io_submit_link *group,
> + struct io_kiocb *req)
> +{
> + /*
> + * Group chain is similar with link chain: starts with 1st sqe with
> + * REQ_F_SQE_GROUP, and ends with the 1st sqe without REQ_F_SQE_GROUP
> + */
> + if (group->head) {
> + struct io_kiocb *lead = group->head;
> +
> + /* members can't be in link chain, can't be drained */
> + req->flags &= ~(IO_REQ_LINK_FLAGS | REQ_F_IO_DRAIN);
> + lead->grp_refs += 1;
> + group->last->grp_link = req;
> + group->last = req;
> +
> + if (req->flags & REQ_F_SQE_GROUP)
> + return NULL;
> +
> + req->grp_link = NULL;
> + req->flags |= REQ_F_SQE_GROUP;
> + group->head = NULL;
> + return lead;
> + } else if (req->flags & REQ_F_SQE_GROUP) {
> + group->head = req;
> + group->last = req;
> + req->grp_refs = 1;
> + req->flags |= REQ_F_SQE_GROUP_LEADER;
> + return NULL;
> + } else {
> + return req;
> + }
> +}
> +
> +static __cold struct io_kiocb *io_submit_fail_group(
> + struct io_submit_link *link, struct io_kiocb *req)
> +{
> + struct io_kiocb *lead = link->head;
> +
> + /*
> + * Instead of failing eagerly, continue assembling the group link
> + * if applicable and mark the leader with REQ_F_FAIL. The group
> + * flushing code should find the flag and handle the rest
> + */
> + if (lead && (lead->flags & IO_REQ_LINK_FLAGS) && !(lead->flags & REQ_F_FAIL))
> + req_fail_link_node(lead, -ECANCELED);
> +
> + return io_group_sqe(link, req);
> +}
> +
> static __cold int io_submit_fail_link(struct io_submit_link *link,
> struct io_kiocb *req, int ret)
> {
> @@ -2180,11 +2384,18 @@ static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
> {
> struct io_ring_ctx *ctx = req->ctx;
> struct io_submit_link *link = &ctx->submit_state.link;
> + struct io_submit_link *group = &ctx->submit_state.group;
>
> trace_io_uring_req_failed(sqe, req, ret);
>
> req_fail_link_node(req, ret);
>
> + if (group->head || (req->flags & REQ_F_SQE_GROUP)) {
> + req = io_submit_fail_group(group, req);
> + if (!req)
> + return 0;
> + }
> +
> /* cover both linked and non-linked request */
> return io_submit_fail_link(link, req, ret);
> }
> @@ -2232,7 +2443,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> const struct io_uring_sqe *sqe)
> __must_hold(&ctx->uring_lock)
> {
> - struct io_submit_link *link = &ctx->submit_state.link;
> + struct io_submit_state *state = &ctx->submit_state;
> int ret;
>
> ret = io_init_req(ctx, req, sqe);
> @@ -2241,9 +2452,17 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>
> trace_io_uring_submit_req(req);
>
> - if (unlikely(link->head || (req->flags & (IO_REQ_LINK_FLAGS |
> - REQ_F_FORCE_ASYNC | REQ_F_FAIL)))) {
> - req = io_link_sqe(link, req);
> + if (unlikely(state->group.head ||
A note rather to myself and for the future, all theese checks
including links and groups can be folded under one common if.
> + (req->flags & REQ_F_SQE_GROUP))) {
> + req = io_group_sqe(&state->group, req);
> + if (!req)
> + return 0;
> + }
> +
> + if (unlikely(state->link.head ||
> + (req->flags & (IO_REQ_LINK_FLAGS | REQ_F_FORCE_ASYNC |
> + REQ_F_FAIL)))) {
> + req = io_link_sqe(&state->link, req);
> if (!req)
> return 0;
> }
> @@ -2258,6 +2477,17 @@ static void io_submit_state_end(struct io_ring_ctx *ctx)
> {
> struct io_submit_state *state = &ctx->submit_state;
>
> + /* the last member must set REQ_F_SQE_GROUP */
> + if (unlikely(state->group.head)) {
> + struct io_kiocb *lead = state->group.head;
> +
> + state->group.last->grp_link = NULL;
> + if (lead->flags & IO_REQ_LINK_FLAGS)
> + io_link_sqe(&state->link, lead);
> + else
> + io_queue_sqe_fallback(lead);
> + }
> +
> if (unlikely(state->link.head))
> io_queue_sqe_fallback(state->link.head);
> /* flush only after queuing links as they can generate completions */
> @@ -2277,6 +2507,7 @@ static void io_submit_state_start(struct io_submit_state *state,
> state->submit_nr = max_ios;
> /* set only head, no need to init link_last in advance */
> state->link.head = NULL;
> + state->group.head = NULL;
> }
>
> static void io_commit_sqring(struct io_ring_ctx *ctx)
> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> index 624ca9076a50..b11db3bdd8d8 100644
> --- a/io_uring/io_uring.h
> +++ b/io_uring/io_uring.h
> @@ -67,6 +67,8 @@ void io_req_defer_failed(struct io_kiocb *req, s32 res);
> bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
> bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags);
> void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
> +void io_queue_group_members(struct io_kiocb *req, bool async);
> +void io_cancel_group_members(struct io_kiocb *req, bool ignore_cqes);
>
> struct file *io_file_get_normal(struct io_kiocb *req, int fd);
> struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
> @@ -342,6 +344,16 @@ static inline void io_tw_lock(struct io_ring_ctx *ctx, struct io_tw_state *ts)
> lockdep_assert_held(&ctx->uring_lock);
> }
>
> +static inline bool req_is_group_leader(struct io_kiocb *req)
> +{
> + return req->flags & REQ_F_SQE_GROUP_LEADER;
> +}
> +
> +static inline bool req_is_group_member(struct io_kiocb *req)
> +{
> + return !req_is_group_leader(req) && (req->flags & REQ_F_SQE_GROUP);
> +}
> +
> /*
> * Don't complete immediately but use deferred completion infrastructure.
> * Protected by ->uring_lock and can only be used either with
> @@ -355,6 +367,10 @@ static inline void io_req_complete_defer(struct io_kiocb *req)
> lockdep_assert_held(&req->ctx->uring_lock);
>
> wq_list_add_tail(&req->comp_list, &state->compl_reqs);
> +
> + /* members may not be issued when leader is completed */
> + if (unlikely(req_is_group_leader(req) && req->grp_link))
> + io_queue_group_members(req, false);
> }
>
> static inline void io_commit_cqring_flush(struct io_ring_ctx *ctx)
--
Pavel Begunkov
next prev parent reply other threads:[~2024-06-10 2:53 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-11 0:12 [PATCH V3 0/9] io_uring: support sqe group and provide group kbuf Ming Lei
2024-05-11 0:12 ` [PATCH V3 1/9] io_uring: add io_link_req() helper Ming Lei
2024-05-11 0:12 ` [PATCH V3 2/9] io_uring: add io_submit_fail_link() helper Ming Lei
2024-05-11 0:12 ` [PATCH V3 3/9] io_uring: add helper of io_req_commit_cqe() Ming Lei
2024-06-10 1:18 ` Pavel Begunkov
2024-06-11 13:21 ` Ming Lei
2024-05-11 0:12 ` [PATCH V3 4/9] io_uring: move marking REQ_F_CQE_SKIP out of io_free_req() Ming Lei
2024-06-10 1:23 ` Pavel Begunkov
2024-06-11 13:28 ` Ming Lei
2024-06-16 18:08 ` Pavel Begunkov
2024-05-11 0:12 ` [PATCH V3 5/9] io_uring: support SQE group Ming Lei
2024-05-21 2:58 ` Ming Lei
2024-06-10 1:55 ` Pavel Begunkov
2024-06-11 13:32 ` Ming Lei
2024-06-16 18:14 ` Pavel Begunkov
2024-06-17 1:42 ` Ming Lei
2024-06-10 2:53 ` Pavel Begunkov [this message]
2024-06-13 1:45 ` Ming Lei
2024-06-16 19:13 ` Pavel Begunkov
2024-06-17 3:54 ` Ming Lei
2024-05-11 0:12 ` [PATCH V3 6/9] io_uring: support sqe group with members depending on leader Ming Lei
2024-05-11 0:12 ` [PATCH V3 7/9] io_uring: support providing sqe group buffer Ming Lei
2024-06-10 2:00 ` Pavel Begunkov
2024-06-12 0:22 ` Ming Lei
2024-05-11 0:12 ` [PATCH V3 8/9] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
2024-05-11 0:12 ` [PATCH V3 9/9] ublk: support provide io buffer Ming Lei
2024-06-03 0:05 ` [PATCH V3 0/9] io_uring: support sqe group and provide group kbuf Ming Lei
2024-06-07 12:32 ` Pavel Begunkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox