public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Gabriel Krisman Bertazi <krisman@suse.de>
To: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org,  dvyukov@google.com
Subject: Re: [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
Date: Thu, 11 Jun 2026 12:49:24 -0400	[thread overview]
Message-ID: <87ldcldnu3.fsf@mailhost.krisman.be> (raw)
In-Reply-To: <20260611160553.1486640-2-axboe@kernel.dk> (Jens Axboe's message of "Thu, 11 Jun 2026 09:58:41 -0600")

Jens Axboe <axboe@kernel.dk> writes:

> Local task_work is currently using llists for managing the work,
> but that's a LIFO type of list. This means that running this task_work
> needs to reverse the list first, to ensure fairness in running the
> queued items.
>
> Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC
> node-based queue algorithm, modified with an externally held consumer
> cursor and conditional stub reinsertion. See comments in the header.
>
> Producers are wait-free: a push is a single xchg() on the queue tail,
> which serializes concurrent producers and defines the FIFO order, plus
> a store linking the node to its predecessor. There are no cmpxchg retry
> loops, and pushing is safe from any context, including hardirq.
>
> The cost of linked list FIFO ordering is that a push publishes the node
> in two steps - the xchg() makes it visible as the new tail before the
> subsequent store links it into the chain that is reachable from the
> head. A consumer hitting that window gets a NULL from mpscq_pop() while
> mpscq_empty() reports false, and must retry later rather than treat the
> queue as empty. The window is two instructions wide, but a producer can
> get preempted inside it, so the consumer must not busy wait on it.
>
> The consumer side supports a single consumer at a time, with callers
> providing their own serialization. A stub node, which also defines the
> empty state (tail == stub), allows the consumer to detach the final
> node without racing against producer link stores: that node is only
> handed out once the stub has been cmpxchg'ed back in as the tail. This
> also guarantees that the previous tail returned by mpscq_push() cannot
> get freed before that push has linked it, making it always valid for
> comparisons.
>
> The consumer cursor is deliberately not part of the queue struct - the
> caller owns it and passes it to mpscq_pop(). This is done to separate
> the consumer and producers cacheline.The cursor is written for

Interesting stuff!  The commit message is truncated here, though.

>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/io_uring_types.h |  12 ++++
>  io_uring/mpscq.h               | 121 +++++++++++++++++++++++++++++++++

There's nothing io_uring specific here.  Perhaps put in lib/ directly
some a wider audience can review  and use?

>  2 files changed, 133 insertions(+)
>  create mode 100644 io_uring/mpscq.h
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index aa4d5477f859..85e12b4884a5 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -55,6 +55,18 @@ struct io_wq_work_list {
>  	struct io_wq_work_node *last;
>  };
>  
> +/*
> + * Lockless multi-producer, single-consumer FIFO queue, see
> + * io_uring/mpscq.h for the implementation and rules. Defined here so
> + * that it can be embedded in io_ring_ctx. This is the producer side
> + * only - the consumer cursor is kept separately, on a cacheline that
> + * isn't dirtied by the producers.
> + */
> +struct mpscq {
> +	struct llist_node	*tail;		/* producers */
> +	struct llist_node	stub;
> +};
> +
>  struct io_wq_work {
>  	struct io_wq_work_node list;
>  	atomic_t flags;
> diff --git a/io_uring/mpscq.h b/io_uring/mpscq.h
> new file mode 100644
> index 000000000000..12172cef8394
> --- /dev/null
> +++ b/io_uring/mpscq.h
> @@ -0,0 +1,121 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef IOU_MPSCQ_H
> +#define IOU_MPSCQ_H
> +
> +/*
> + * mpscq - lockless multi-producer, single-consumer FIFO queue
> + *
> + * Unlike llist, which is LIFO ordered and hence needs an O(n)
> + * llist_reverse_order() pass before entries can be processed in queue order,
> + * this queue hands out nodes in the order they were pushed.
> + *
> + * The consumer cursor is held by the caller rather than in the queue struct
> + * (see below), and with the stub reinsertion done as a single cmpxchg attempt
> + * instead of an unconditional push, keeping tail == stub a reliable empty test
> + * while a producer is in the middle of a push.
> + *
> + * Producers may run in any context (task, softirq, hardirq) and are wait-free:
> + * a push is one xchg() plus one store, with no retry loops. FIFO order between
> + * producers is the order in which the xchg() on ->tail serializes them.
> + *
> + * The price for linked-list FIFO is that a push publishes the node in two
> + * steps: the xchg() makes it the new tail, and the subsequent store links it to
> + * its predecessor. In between, the tail end of the queue is not yet reachable
> + * from the head. mpscq_pop() detects this and returns NULL, while mpscq_empty()
> + * reports false. The consumer must not treat such a NULL as "queue empty" - it
> + * should retry later. The window is two instructions wide, but a producer can
> + * be preempted inside it, so the consumer must not spin on it while holding
> + * resources the producer might need to make progress.
> + *
> + * The consumer side only supports a single consumer at a time, callers must
> + * provide their own serialization for it. The stub node is what allows the
> + * consumer to detach the final node without racing with the link stores of
> + * producers. This scheme also guarantees that the previous tail returned by
> + * mpscq_push() cannot be freed by the consumer until the push that returned it
> + * has linked it, hence it's always safe to compare against (but not
> + * dereference, unless the caller otherwise guarantees its lifetime).
> + *
> + * The queue struct only holds the producer side. The consumer keeps its cursor
> + * (the oldest not yet handed out node) externally and passes it to mpscq_pop(),
> + * so that it can be placed on a different cacheline: the cursor is written for
> + * every pop, and having it share a line with ->tail would have the consumer
> + * invalidating the line that producers need for every push.
> + */
> +static inline void mpscq_init(struct mpscq *q, struct llist_node **headp)
> +{
> +	q->tail = *headp = &q->stub;
> +	q->stub.next = NULL;
> +}
> +
> +/*
> + * Returns true if the queue holds no entries that mpscq_pop() hasn't handed out
> + * yet. May be called from any context. Note that !empty doesn't guarantee that
> + * mpscq_pop() will return an entry yet, see the in-flight producer window
> + * above.
> + */
> +static inline bool mpscq_empty(struct mpscq *q)
> +{
> +	return READ_ONCE(q->tail) == &q->stub;
> +}
> +
> +/*
> + * Push a node onto the queue. Safe against concurrent pushes from any context,
> + * and against the (single) consumer. Returns the previous tail node, which is
> + * &q->stub if and only if the queue was empty before this push.
> + */
> +static inline struct llist_node *mpscq_push(struct mpscq *q,
> +					    struct llist_node *node)
> +{
> +	struct llist_node *prev;
> +
> +	node->next = NULL;
> +	/*
> +	 * xchg() implies a full barrier, so the initialization of the
> +	 * entry (including ->next above) is visible before the node can
> +	 * be reached, either via ->tail or via ->next chasing from the
> +	 * head once the store below has linked it.
> +	 */
> +	prev = xchg(&q->tail, node);
> +	WRITE_ONCE(prev->next, node);
> +	return prev;
> +}
> +
> +/*
> + * Pop the oldest node off the queue, or return NULL if no node is available.
> + * NULL is returned both when the queue is empty and when a producer has
> + * published a node via ->tail but hasn't linked it yet; use mpscq_empty() to
> + * tell the two apart. Single consumer only, with headp being the consumer
> + * cursor that mpscq_init() set up.
> + */
> +static inline struct llist_node *mpscq_pop(struct mpscq *q,
> +					   struct llist_node **headp)
> +{
> +	struct llist_node *head = *headp;
> +	struct llist_node *next = READ_ONCE(head->next);
> +
> +	if (head == &q->stub) {
> +		if (!next)
> +			return NULL;
> +		*headp = next;
> +		head = next;
> +		next = READ_ONCE(head->next);
> +	}
> +	if (next) {
> +		*headp = next;
> +		return head;
> +	}
> +	/*
> +	 * 'head' is the last linked node, it can only be handed out once the
> +	 * stub has taken its place as the tail. If the cmpxchg fails, a
> +	 * producer has made a new node the tail but hasn't linked it to 'head'
> +	 * yet - bail and let the caller retry.
> +	 */
> +	q->stub.next = NULL;
> +	if (try_cmpxchg(&q->tail, &head, &q->stub)) {
> +		*headp = &q->stub;
> +		return head;
> +	}
> +	return NULL;
> +}
> +
> +#endif /* IOU_MPSCQ_H */

-- 
Gabriel Krisman Bertazi

  reply	other threads:[~2026-06-11 16:49 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-11 15:58 [PATCHSET 0/2] Add lockless MPSC FIFO queue for task work Jens Axboe
2026-06-11 15:58 ` [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue Jens Axboe
2026-06-11 16:49   ` Gabriel Krisman Bertazi [this message]
2026-06-11 16:58     ` Jens Axboe
2026-06-12  1:13   ` Caleb Sander Mateos
2026-06-12  2:21     ` Jens Axboe
2026-06-12  2:41       ` Caleb Sander Mateos
2026-06-11 15:58 ` [PATCH 2/2] io_uring: switch local task_work to a mpscq Jens Axboe
2026-06-12  1:14   ` Caleb Sander Mateos
2026-06-12  2:23     ` Jens Axboe
2026-06-12  5:24       ` Caleb Sander Mateos
2026-06-12 12:21         ` Jens Axboe
2026-06-12 15:11           ` Jens Axboe
2026-06-15 17:55             ` Caleb Sander Mateos
2026-06-15 18:00               ` Jens Axboe
2026-06-16 20:21                 ` Caleb Sander Mateos

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ldcldnu3.fsf@mailhost.krisman.be \
    --to=krisman@suse.de \
    --cc=axboe@kernel.dk \
    --cc=dvyukov@google.com \
    --cc=io-uring@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox