From: Pavel Begunkov <[email protected]>
To: Ming Lei <[email protected]>, Jens Axboe <[email protected]>,
[email protected], [email protected]
Cc: Kevin Wolf <[email protected]>
Subject: Re: [PATCH V5 4/8] io_uring: support SQE group
Date: Tue, 27 Aug 2024 16:18:26 +0100 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 8/8/24 17:24, Ming Lei wrote:
> SQE group is defined as one chain of SQEs starting with the first SQE that
> has IOSQE_SQE_GROUP set, and ending with the first subsequent SQE that
> doesn't have it set, and it is similar with chain of linked SQEs.
>
> Not like linked SQEs, each sqe is issued after the previous one is
> completed. All SQEs in one group can be submitted in parallel. To simplify
> the implementation from beginning, all members are queued after the leader
> is completed, however, this way may be changed and leader and members may
> be issued concurrently in future.
>
> The 1st SQE is group leader, and the other SQEs are group member. The whole
> group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and
> the two flags can't be set for group members.
>
> When the group is in one link chain, this group isn't submitted until the
> previous SQE or group is completed. And the following SQE or group can't
> be started if this group isn't completed. Failure from any group member will
> fail the group leader, then the link chain can be terminated.
>
> When IOSQE_IO_DRAIN is set for group leader, all requests in this group and
> previous requests submitted are drained. Given IOSQE_IO_DRAIN can be set for
> group leader only, we respect IO_DRAIN by always completing group leader as
> the last one in the group. Meantime it is natural to post leader's CQE
> as the last one from application viewpoint.
>
> Working together with IOSQE_IO_LINK, SQE group provides flexible way to
> support N:M dependency, such as:
>
> - group A is chained with group B together
> - group A has N SQEs
> - group B has M SQEs
>
> then M SQEs in group B depend on N SQEs in group A.
>
> N:M dependency can support some interesting use cases in efficient way:
>
> 1) read from multiple files, then write the read data into single file
>
> 2) read from single file, and write the read data into multiple files
>
> 3) write same data into multiple files, and read data from multiple files and
> compare if correct data is written
>
> Also IOSQE_SQE_GROUP takes the last bit in sqe->flags, but we still can
> extend sqe->flags with one uring context flag, such as use __pad3 for
> non-uring_cmd OPs and part of uring_cmd_flags for uring_cmd OP.
>
> Suggested-by: Kevin Wolf <[email protected]>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> include/linux/io_uring_types.h | 18 +++
...
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index f112e9fa90d8..45a292445b18 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
...
> @@ -875,6 +877,116 @@ static __always_inline void io_req_commit_cqe(struct io_ring_ctx *ctx,
> }
> }
...
> +static void io_queue_group_members(struct io_kiocb *req)
> +{
> + struct io_kiocb *member = req->grp_link;
> +
> + if (!member)
> + return;
> +
> + req->grp_link = NULL;
> + while (member) {
> + struct io_kiocb *next = member->grp_link;
> +
> + member->grp_leader = req;
> + if (unlikely(member->flags & REQ_F_FAIL)) {
> + io_req_task_queue_fail(member, member->cqe.res);
> + } else if (unlikely(req->flags & REQ_F_FAIL)) {
> + io_req_task_queue_fail(member, -ECANCELED);
> + } else {
> + io_req_task_queue(member);
> + }
> + member = next;
> + }
> +}
> +
> +static inline bool __io_complete_group_member(struct io_kiocb *req,
> + struct io_kiocb *lead)
> +{
I think it'd be better if you inline this function, it only
obfuscates things.
> + if (WARN_ON_ONCE(lead->grp_refs <= 0))
> + return false;
> +
> + req->flags &= ~REQ_F_SQE_GROUP;
I'm getting completely lost when and why it clears and sets
back REQ_F_SQE_GROUP and REQ_F_SQE_GROUP_LEADER. Is there any
rule?
> + /*
> + * Set linked leader as failed if any member is failed, so
> + * the remained link chain can be terminated
> + */
> + if (unlikely((req->flags & REQ_F_FAIL) &&
> + ((lead->flags & IO_REQ_LINK_FLAGS) && lead->link)))
> + req_set_fail(lead);
if (req->flags & REQ_F_FAIL)
req_set_fail(lead);
REQ_F_FAIL is not specific to links, if a request fails we need
to mark it as such.
> + return !--lead->grp_refs;
> +}
> +
> +static inline bool leader_is_the_last(struct io_kiocb *lead)
> +{
> + return lead->grp_refs == 1 && (lead->flags & REQ_F_SQE_GROUP);
> +}
> +
> +static void io_complete_group_member(struct io_kiocb *req)
> +{
> + struct io_kiocb *lead = get_group_leader(req);
> +
> + if (WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP)))
> + return;
> +
> + /* member CQE needs to be posted first */
> + if (!(req->flags & REQ_F_CQE_SKIP))
> + io_req_commit_cqe(req->ctx, req);
> +
> + if (__io_complete_group_member(req, lead)) {
> + /*
> + * SQE_GROUP flag is kept for the last member, so the leader
> + * can be retrieved & freed from this last member
> + */
> + req->flags |= REQ_F_SQE_GROUP;
> + if (!(lead->flags & REQ_F_CQE_SKIP))
> + io_req_commit_cqe(lead->ctx, lead);
> + } else if (leader_is_the_last(lead)) {
> + /* leader will degenerate to plain req if it is the last */
> + lead->flags &= ~(REQ_F_SQE_GROUP | REQ_F_SQE_GROUP_LEADER);
What's this chunk is about?
> + }
> +}
> +
> +static void io_complete_group_leader(struct io_kiocb *req)
> +{
> + WARN_ON_ONCE(req->grp_refs <= 0);
> + req->flags &= ~REQ_F_SQE_GROUP;
> + req->grp_refs -= 1;
> + WARN_ON_ONCE(req->grp_refs == 0);
Why not combine these WARN_ON_ONCE into one?
> +
> + /* TODO: queue members with leader in parallel */
no todos, please
> + if (req->grp_link)
> + io_queue_group_members(req);
> +}
It's spinlock'ed, we really don't want to do too much here
like potentially queueing a ton of task works.
io_queue_group_members() can move into io_free_batch_list().
> +
> static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
> {
> struct io_ring_ctx *ctx = req->ctx;
> @@ -890,7 +1002,8 @@ static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
> * Handle special CQ sync cases via task_work. DEFER_TASKRUN requires
> * the submitter task context, IOPOLL protects with uring_lock.
> */
> - if (ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL)) {
> + if (ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL) ||
> + req_is_group_leader(req)) {
> req->io_task_work.func = io_req_task_complete;
> io_req_task_work_add(req);
> return;
> @@ -1388,11 +1501,33 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
> comp_list);
>
> if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) {
> + if (req->flags & (REQ_F_SQE_GROUP |
> + REQ_F_SQE_GROUP_LEADER)) {
> + struct io_kiocb *leader;
> +
> + /* Leader is freed via the last member */
> + if (req_is_group_leader(req)) {
> + node = req->comp_list.next;
> + continue;
> + }
> +
> + /*
> + * Only the last member keeps GROUP flag,
> + * free leader and this member together
> + */
> + leader = get_group_leader(req);
> + leader->flags &= ~REQ_F_SQE_GROUP_LEADER;
> + req->flags &= ~REQ_F_SQE_GROUP;
> + wq_stack_add_head(&leader->comp_list,
> + &req->comp_list);
That's quite hacky, but at least we can replace it with
task work if it gets in the way later on.
> + }
> +
> if (req->flags & REQ_F_REFCOUNT) {
> node = req->comp_list.next;
> if (!req_ref_put_and_test(req))
> continue;
> }
> +
> if ((req->flags & REQ_F_POLLED) && req->apoll) {
> struct async_poll *apoll = req->apoll;
>
> @@ -1427,8 +1562,19 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
> struct io_kiocb *req = container_of(node, struct io_kiocb,
> comp_list);
>
> - if (!(req->flags & REQ_F_CQE_SKIP))
> - io_req_commit_cqe(ctx, req);
> + if (unlikely(req->flags & (REQ_F_CQE_SKIP | REQ_F_SQE_GROUP))) {
> + if (req->flags & REQ_F_SQE_GROUP) {
> + if (req_is_group_leader(req))
> + io_complete_group_leader(req);
> + else
> + io_complete_group_member(req);
> + continue;
> + }
> +
> + if (req->flags & REQ_F_CQE_SKIP)
> + continue;
> + }
> + io_req_commit_cqe(ctx, req);
> }
> __io_cq_unlock_post(ctx);
>
...
> @@ -2101,6 +2251,62 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
> return def->prep(req, sqe);
> }
>
> +static struct io_kiocb *io_group_sqe(struct io_submit_link *group,
> + struct io_kiocb *req)
> +{
> + /*
> + * Group chain is similar with link chain: starts with 1st sqe with
> + * REQ_F_SQE_GROUP, and ends with the 1st sqe without REQ_F_SQE_GROUP
> + */
> + if (group->head) {
> + struct io_kiocb *lead = group->head;
> +
> + /* members can't be in link chain, can't be drained */
> + if (req->flags & (IO_REQ_LINK_FLAGS | REQ_F_IO_DRAIN))
> + req_fail_link_node(lead, -EINVAL);
That should fail the entire link (if any) as well.
I have even more doubts we even want to mix links and groups. Apart
from nuances as such, which would be quite hard to track, the semantics
of IOSQE_CQE_SKIP_SUCCESS is unclear. And also it doen't work with
IORING_OP_LINK_TIMEOUT.
> +
> + lead->grp_refs += 1;
> + group->last->grp_link = req;
> + group->last = req;
> +
> + if (req->flags & REQ_F_SQE_GROUP)
> + return NULL;
> +
> + req->grp_link = NULL;
> + req->flags |= REQ_F_SQE_GROUP;
> + group->head = NULL;
> + if (lead->flags & REQ_F_FAIL) {
> + io_queue_sqe_fallback(lead);
Let's say the group was in the middle of a link, it'll
complete that group and continue with assembling / executing
the link when it should've failed it and honoured the
request order.
> + return NULL;
> + }
> + return lead;
> + } else if (req->flags & REQ_F_SQE_GROUP) {
> + group->head = req;
> + group->last = req;
> + req->grp_refs = 1;
> + req->flags |= REQ_F_SQE_GROUP_LEADER;
> + return NULL;
> + } else {
We shouldn't be able to get here.
if (WARN_ONCE(!(req->flags & GROUP)))
...
group->head = req;
...
> + return req;
> + }
> +}
> +
> +static __cold struct io_kiocb *io_submit_fail_group(
> + struct io_submit_link *link, struct io_kiocb *req)
> +{
> + struct io_kiocb *lead = link->head;
> +
> + /*
> + * Instead of failing eagerly, continue assembling the group link
> + * if applicable and mark the leader with REQ_F_FAIL. The group
> + * flushing code should find the flag and handle the rest
> + */
> + if (lead && (lead->flags & IO_REQ_LINK_FLAGS) && !(lead->flags & REQ_F_FAIL))
> + req_fail_link_node(lead, -ECANCELED);
Same as above, you don't need to check for IO_REQ_LINK_FLAGS.
io_queue_sqe_fallback()
> +
> + return io_group_sqe(link, req);
> +}
> +
--
Pavel Begunkov
next prev parent reply other threads:[~2024-08-27 15:18 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-08 16:24 [PATCH V5 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
2024-08-08 16:24 ` [PATCH V5 1/8] io_uring: add io_link_req() helper Ming Lei
2024-08-08 16:24 ` [PATCH V5 2/8] io_uring: add io_submit_fail_link() helper Ming Lei
2024-08-08 16:24 ` [PATCH V5 3/8] io_uring: add helper of io_req_commit_cqe() Ming Lei
2024-08-08 16:24 ` [PATCH V5 4/8] io_uring: support SQE group Ming Lei
2024-08-27 15:18 ` Pavel Begunkov [this message]
2024-08-29 4:29 ` Ming Lei
2024-09-06 17:15 ` Pavel Begunkov
2024-09-07 9:36 ` Ming Lei
2024-09-10 13:12 ` Pavel Begunkov
2024-09-10 15:04 ` Ming Lei
2024-09-10 20:31 ` Pavel Begunkov
2024-09-11 1:28 ` Ming Lei
2024-08-08 16:24 ` [PATCH V5 5/8] io_uring: support sqe group with members depending on leader Ming Lei
2024-08-08 16:24 ` [PATCH V5 6/8] io_uring: support providing sqe group buffer Ming Lei
2024-08-08 16:24 ` [PATCH V5 7/8] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
2024-08-08 16:24 ` [PATCH V5 8/8] ublk: support provide io buffer Ming Lei
2024-08-17 4:16 ` [PATCH V5 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
2024-08-17 19:48 ` Pavel Begunkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox