public inbox for [email protected]
 help / color / mirror / Atom feed
From: Pavel Begunkov <[email protected]>
To: Ming Lei <[email protected]>, Jens Axboe <[email protected]>,
	[email protected], [email protected]
Cc: Kevin Wolf <[email protected]>
Subject: Re: [PATCH V5 4/8] io_uring: support SQE group
Date: Tue, 27 Aug 2024 16:18:26 +0100	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 8/8/24 17:24, Ming Lei wrote:
> SQE group is defined as one chain of SQEs starting with the first SQE that
> has IOSQE_SQE_GROUP set, and ending with the first subsequent SQE that
> doesn't have it set, and it is similar with chain of linked SQEs.
> 
> Not like linked SQEs, each sqe is issued after the previous one is
> completed. All SQEs in one group can be submitted in parallel. To simplify
> the implementation from beginning, all members are queued after the leader
> is completed, however, this way may be changed and leader and members may
> be issued concurrently in future.
> 
> The 1st SQE is group leader, and the other SQEs are group member. The whole
> group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and
> the two flags can't be set for group members.
> 
> When the group is in one link chain, this group isn't submitted until the
> previous SQE or group is completed. And the following SQE or group can't
> be started if this group isn't completed. Failure from any group member will
> fail the group leader, then the link chain can be terminated.
> 
> When IOSQE_IO_DRAIN is set for group leader, all requests in this group and
> previous requests submitted are drained. Given IOSQE_IO_DRAIN can be set for
> group leader only, we respect IO_DRAIN by always completing group leader as
> the last one in the group. Meantime it is natural to post leader's CQE
> as the last one from application viewpoint.
> 
> Working together with IOSQE_IO_LINK, SQE group provides flexible way to
> support N:M dependency, such as:
> 
> - group A is chained with group B together
> - group A has N SQEs
> - group B has M SQEs
> 
> then M SQEs in group B depend on N SQEs in group A.
> 
> N:M dependency can support some interesting use cases in efficient way:
> 
> 1) read from multiple files, then write the read data into single file
> 
> 2) read from single file, and write the read data into multiple files
> 
> 3) write same data into multiple files, and read data from multiple files and
> compare if correct data is written
> 
> Also IOSQE_SQE_GROUP takes the last bit in sqe->flags, but we still can
> extend sqe->flags with one uring context flag, such as use __pad3 for
> non-uring_cmd OPs and part of uring_cmd_flags for uring_cmd OP.
> 
> Suggested-by: Kevin Wolf <[email protected]>
> Signed-off-by: Ming Lei <[email protected]>
> ---
>   include/linux/io_uring_types.h |  18 +++
...
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index f112e9fa90d8..45a292445b18 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
...
> @@ -875,6 +877,116 @@ static __always_inline void io_req_commit_cqe(struct io_ring_ctx *ctx,
>   	}
>   }
...
> +static void io_queue_group_members(struct io_kiocb *req)
> +{
> +	struct io_kiocb *member = req->grp_link;
> +
> +	if (!member)
> +		return;
> +
> +	req->grp_link = NULL;
> +	while (member) {
> +		struct io_kiocb *next = member->grp_link;
> +
> +		member->grp_leader = req;
> +		if (unlikely(member->flags & REQ_F_FAIL)) {
> +			io_req_task_queue_fail(member, member->cqe.res);
> +		} else if (unlikely(req->flags & REQ_F_FAIL)) {
> +			io_req_task_queue_fail(member, -ECANCELED);
> +		} else {
> +			io_req_task_queue(member);
> +		}
> +		member = next;
> +	}
> +}
> +
> +static inline bool __io_complete_group_member(struct io_kiocb *req,
> +			     struct io_kiocb *lead)
> +{

I think it'd be better if you inline this function, it only
obfuscates things.

> +	if (WARN_ON_ONCE(lead->grp_refs <= 0))
> +		return false;
> +
> +	req->flags &= ~REQ_F_SQE_GROUP;

I'm getting completely lost when and why it clears and sets
back REQ_F_SQE_GROUP and REQ_F_SQE_GROUP_LEADER. Is there any
rule?

> +	/*
> +	 * Set linked leader as failed if any member is failed, so
> +	 * the remained link chain can be terminated
> +	 */
> +	if (unlikely((req->flags & REQ_F_FAIL) &&
> +		     ((lead->flags & IO_REQ_LINK_FLAGS) && lead->link)))
> +		req_set_fail(lead);

if (req->flags & REQ_F_FAIL)
	req_set_fail(lead);

REQ_F_FAIL is not specific to links, if a request fails we need
to mark it as such.


> +	return !--lead->grp_refs;
> +}
> +
> +static inline bool leader_is_the_last(struct io_kiocb *lead)
> +{
> +	return lead->grp_refs == 1 && (lead->flags & REQ_F_SQE_GROUP);
> +}
> +
> +static void io_complete_group_member(struct io_kiocb *req)
> +{
> +	struct io_kiocb *lead = get_group_leader(req);
> +
> +	if (WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP)))
> +		return;
> +
> +	/* member CQE needs to be posted first */
> +	if (!(req->flags & REQ_F_CQE_SKIP))
> +		io_req_commit_cqe(req->ctx, req);
> +
> +	if (__io_complete_group_member(req, lead)) {
> +		/*
> +		 * SQE_GROUP flag is kept for the last member, so the leader
> +		 * can be retrieved & freed from this last member
> +		 */
> +		req->flags |= REQ_F_SQE_GROUP;
> +		if (!(lead->flags & REQ_F_CQE_SKIP))
> +			io_req_commit_cqe(lead->ctx, lead);
> +	} else if (leader_is_the_last(lead)) {
> +		/* leader will degenerate to plain req if it is the last */
> +		lead->flags &= ~(REQ_F_SQE_GROUP | REQ_F_SQE_GROUP_LEADER);

What's this chunk is about?

> +	}
> +}
> +
> +static void io_complete_group_leader(struct io_kiocb *req)
> +{
> +	WARN_ON_ONCE(req->grp_refs <= 0);
> +	req->flags &= ~REQ_F_SQE_GROUP;
> +	req->grp_refs -= 1;
> +	WARN_ON_ONCE(req->grp_refs == 0);

Why not combine these WARN_ON_ONCE into one?

> +
> +	/* TODO: queue members with leader in parallel */

no todos, please

> +	if (req->grp_link)
> +		io_queue_group_members(req);
> +}

It's spinlock'ed, we really don't want to do too much here
like potentially queueing a ton of task works.
io_queue_group_members() can move into io_free_batch_list().

> +
>   static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
>   {
>   	struct io_ring_ctx *ctx = req->ctx;
> @@ -890,7 +1002,8 @@ static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
>   	 * Handle special CQ sync cases via task_work. DEFER_TASKRUN requires
>   	 * the submitter task context, IOPOLL protects with uring_lock.
>   	 */
> -	if (ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL)) {
> +	if (ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL) ||
> +	    req_is_group_leader(req)) {
>   		req->io_task_work.func = io_req_task_complete;
>   		io_req_task_work_add(req);
>   		return;
> @@ -1388,11 +1501,33 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
>   						    comp_list);
>   
>   		if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) {
> +			if (req->flags & (REQ_F_SQE_GROUP |
> +					  REQ_F_SQE_GROUP_LEADER)) {
> +				struct io_kiocb *leader;
> +
> +				/* Leader is freed via the last member */
> +				if (req_is_group_leader(req)) {
> +					node = req->comp_list.next;
> +					continue;
> +				}
> +
> +				/*
> +				 * Only the last member keeps GROUP flag,
> +				 * free leader and this member together
> +				 */
> +				leader = get_group_leader(req);
> +				leader->flags &= ~REQ_F_SQE_GROUP_LEADER;
> +				req->flags &= ~REQ_F_SQE_GROUP;
> +				wq_stack_add_head(&leader->comp_list,
> +						  &req->comp_list);

That's quite hacky, but at least we can replace it with
task work if it gets in the way later on.

> +			}
> +
>   			if (req->flags & REQ_F_REFCOUNT) {
>   				node = req->comp_list.next;
>   				if (!req_ref_put_and_test(req))
>   					continue;
>   			}
> +
>   			if ((req->flags & REQ_F_POLLED) && req->apoll) {
>   				struct async_poll *apoll = req->apoll;
>   
> @@ -1427,8 +1562,19 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
>   		struct io_kiocb *req = container_of(node, struct io_kiocb,
>   					    comp_list);
>   
> -		if (!(req->flags & REQ_F_CQE_SKIP))
> -			io_req_commit_cqe(ctx, req);
> +		if (unlikely(req->flags & (REQ_F_CQE_SKIP | REQ_F_SQE_GROUP))) {
> +			if (req->flags & REQ_F_SQE_GROUP) {
> +				if (req_is_group_leader(req))
> +					io_complete_group_leader(req);
> +				else
> +					io_complete_group_member(req);
> +				continue;
> +			}
> +
> +			if (req->flags & REQ_F_CQE_SKIP)
> +				continue;
> +		}
> +		io_req_commit_cqe(ctx, req);
>   	}
>   	__io_cq_unlock_post(ctx);
>   
...
> @@ -2101,6 +2251,62 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
>   	return def->prep(req, sqe);
>   }
>   
> +static struct io_kiocb *io_group_sqe(struct io_submit_link *group,
> +				     struct io_kiocb *req)
> +{
> +	/*
> +	 * Group chain is similar with link chain: starts with 1st sqe with
> +	 * REQ_F_SQE_GROUP, and ends with the 1st sqe without REQ_F_SQE_GROUP
> +	 */
> +	if (group->head) {
> +		struct io_kiocb *lead = group->head;
> +
> +		/* members can't be in link chain, can't be drained */
> +		if (req->flags & (IO_REQ_LINK_FLAGS | REQ_F_IO_DRAIN))
> +			req_fail_link_node(lead, -EINVAL);

That should fail the entire link (if any) as well.

I have even more doubts we even want to mix links and groups. Apart
from nuances as such, which would be quite hard to track, the semantics
of IOSQE_CQE_SKIP_SUCCESS is unclear. And also it doen't work with
IORING_OP_LINK_TIMEOUT.

> +
> +		lead->grp_refs += 1;
> +		group->last->grp_link = req;
> +		group->last = req;
> +
> +		if (req->flags & REQ_F_SQE_GROUP)
> +			return NULL;
> +
> +		req->grp_link = NULL;
> +		req->flags |= REQ_F_SQE_GROUP;
> +		group->head = NULL;
> +		if (lead->flags & REQ_F_FAIL) {
> +			io_queue_sqe_fallback(lead);

Let's say the group was in the middle of a link, it'll
complete that group and continue with assembling / executing
the link when it should've failed it and honoured the
request order.


> +			return NULL;
> +		}
> +		return lead;
> +	} else if (req->flags & REQ_F_SQE_GROUP) {
> +		group->head = req;
> +		group->last = req;
> +		req->grp_refs = 1;
> +		req->flags |= REQ_F_SQE_GROUP_LEADER;
> +		return NULL;
> +	} else {

We shouldn't be able to get here.

if (WARN_ONCE(!(req->flags & GROUP)))
	...
group->head = req;
...

> +		return req;
> +	}
> +}
> +
> +static __cold struct io_kiocb *io_submit_fail_group(
> +		struct io_submit_link *link, struct io_kiocb *req)
> +{
> +	struct io_kiocb *lead = link->head;
> +
> +	/*
> +	 * Instead of failing eagerly, continue assembling the group link
> +	 * if applicable and mark the leader with REQ_F_FAIL. The group
> +	 * flushing code should find the flag and handle the rest
> +	 */
> +	if (lead && (lead->flags & IO_REQ_LINK_FLAGS) && !(lead->flags & REQ_F_FAIL))
> +		req_fail_link_node(lead, -ECANCELED);

Same as above, you don't need to check for IO_REQ_LINK_FLAGS.
io_queue_sqe_fallback()


> +
> +	return io_group_sqe(link, req);
> +}
> +

-- 
Pavel Begunkov

  reply	other threads:[~2024-08-27 15:18 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-08 16:24 [PATCH V5 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
2024-08-08 16:24 ` [PATCH V5 1/8] io_uring: add io_link_req() helper Ming Lei
2024-08-08 16:24 ` [PATCH V5 2/8] io_uring: add io_submit_fail_link() helper Ming Lei
2024-08-08 16:24 ` [PATCH V5 3/8] io_uring: add helper of io_req_commit_cqe() Ming Lei
2024-08-08 16:24 ` [PATCH V5 4/8] io_uring: support SQE group Ming Lei
2024-08-27 15:18   ` Pavel Begunkov [this message]
2024-08-29  4:29     ` Ming Lei
2024-09-06 17:15       ` Pavel Begunkov
2024-09-07  9:36         ` Ming Lei
2024-09-10 13:12           ` Pavel Begunkov
2024-09-10 15:04             ` Ming Lei
2024-09-10 20:31               ` Pavel Begunkov
2024-09-11  1:28                 ` Ming Lei
2024-08-08 16:24 ` [PATCH V5 5/8] io_uring: support sqe group with members depending on leader Ming Lei
2024-08-08 16:24 ` [PATCH V5 6/8] io_uring: support providing sqe group buffer Ming Lei
2024-08-08 16:24 ` [PATCH V5 7/8] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
2024-08-08 16:24 ` [PATCH V5 8/8] ublk: support provide io buffer Ming Lei
2024-08-17  4:16 ` [PATCH V5 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
2024-08-17 19:48   ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox