public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/5] io_uring cmd for tx timestamps
@ 2025-06-12  9:09 Pavel Begunkov
  2025-06-12  9:09 ` [PATCH v3 1/5] net: timestamp: add helper returning skb's tx tstamp Pavel Begunkov
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Pavel Begunkov @ 2025-06-12  9:09 UTC (permalink / raw)
  To: io-uring, Vadim Fedorenko
  Cc: asml.silence, netdev, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S . Miller, Jakub Kicinski,
	Richard Cochran, Stanislav Fomichev, Jason Xing

Vadim Fedorenko suggested to add an alternative API for receiving
tx timestamps through io_uring. The series introduces io_uring socket
cmd for fetching tx timestamps, which is a polled multishot request,
i.e. internally polling the socket for POLLERR and posts timestamps
when they're arrives. For the API description see Patch 5.

It reuses existing timestamp infra and takes them from the socket's
error queue. For networking people the important parts are Patch 1,
and io_uring_cmd_timestamp() from Patch 5 walking the error queue.

It should be reasonable to take it through the io_uring tree once
we have consensus, but let me know if there are any concerns.

v3: Add a flag to distinguish sw vs hw timestamp. skb_get_tx_timestamp()
    from Patch 1 now returns the indication of that, and in Patch 5
    it's converted into a io_uring CQE bit flag.

v2: remove (rx) false timestamp handling
    fix skipping already queued events on request submission
    constantize socket in a helper

Pavel Begunkov (5):
  net: timestamp: add helper returning skb's tx tstamp
  io_uring/poll: introduce io_arm_apoll()
  io_uring/cmd: allow multishot polled commands
  io_uring: add mshot helper for posting CQE32
  io_uring/netcmd: add tx timestamping cmd support

 include/net/sock.h            |  9 ++++
 include/uapi/linux/io_uring.h |  9 ++++
 io_uring/cmd_net.c            | 82 +++++++++++++++++++++++++++++++++++
 io_uring/io_uring.c           | 40 +++++++++++++++++
 io_uring/io_uring.h           |  1 +
 io_uring/poll.c               | 44 +++++++++++--------
 io_uring/poll.h               |  1 +
 io_uring/uring_cmd.c          | 34 +++++++++++++++
 io_uring/uring_cmd.h          |  7 +++
 net/socket.c                  | 45 +++++++++++++++++++
 10 files changed, 255 insertions(+), 17 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 1/5] net: timestamp: add helper returning skb's tx tstamp
  2025-06-12  9:09 [PATCH v3 0/5] io_uring cmd for tx timestamps Pavel Begunkov
@ 2025-06-12  9:09 ` Pavel Begunkov
  2025-06-12 21:20   ` Willem de Bruijn
  2025-06-12  9:09 ` [PATCH v3 2/5] io_uring/poll: introduce io_arm_apoll() Pavel Begunkov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2025-06-12  9:09 UTC (permalink / raw)
  To: io-uring, Vadim Fedorenko
  Cc: asml.silence, netdev, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S . Miller, Jakub Kicinski,
	Richard Cochran, Stanislav Fomichev, Jason Xing

Add a helper function skb_get_tx_timestamp() that returns a tx timestamp
associated with an skb from an queue queue.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/net/sock.h |  9 +++++++++
 net/socket.c       | 45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 92e7c1aae3cc..0b96196d8a34 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2677,6 +2677,15 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk,
 			     struct sk_buff *skb);
 
+enum {
+	NET_TIMESTAMP_ORIGIN_SW		= 0,
+	NET_TIMESTAMP_ORIGIN_HW		= 1,
+};
+
+bool skb_has_tx_timestamp(struct sk_buff *skb, const struct sock *sk);
+int skb_get_tx_timestamp(struct sk_buff *skb, struct sock *sk,
+			 struct timespec64 *ts);
+
 static inline void
 sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
 {
diff --git a/net/socket.c b/net/socket.c
index 9a0e720f0859..9bb618c32d65 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -843,6 +843,51 @@ static void put_ts_pktinfo(struct msghdr *msg, struct sk_buff *skb,
 		 sizeof(ts_pktinfo), &ts_pktinfo);
 }
 
+bool skb_has_tx_timestamp(struct sk_buff *skb, const struct sock *sk)
+{
+	const struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
+	u32 tsflags = READ_ONCE(sk->sk_tsflags);
+
+	if (serr->ee.ee_errno != ENOMSG ||
+	   serr->ee.ee_origin != SO_EE_ORIGIN_TIMESTAMPING)
+		return false;
+
+	/* software time stamp available and wanted */
+	if ((tsflags & SOF_TIMESTAMPING_SOFTWARE) && skb->tstamp)
+		return true;
+	/* hardware time stamps available and wanted */
+	return (tsflags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
+		skb_hwtstamps(skb)->hwtstamp;
+}
+
+int skb_get_tx_timestamp(struct sk_buff *skb, struct sock *sk,
+			  struct timespec64 *ts)
+{
+	u32 tsflags = READ_ONCE(sk->sk_tsflags);
+	ktime_t hwtstamp;
+	int if_index = 0;
+
+	if ((tsflags & SOF_TIMESTAMPING_SOFTWARE) &&
+	    ktime_to_timespec64_cond(skb->tstamp, ts))
+		return NET_TIMESTAMP_ORIGIN_SW;
+
+	if (!(tsflags & SOF_TIMESTAMPING_RAW_HARDWARE) ||
+	    skb_is_swtx_tstamp(skb, false))
+		return -ENOENT;
+
+	if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP_NETDEV)
+		hwtstamp = get_timestamp(sk, skb, &if_index);
+	else
+		hwtstamp = skb_hwtstamps(skb)->hwtstamp;
+
+	if (tsflags & SOF_TIMESTAMPING_BIND_PHC)
+		hwtstamp = ptp_convert_timestamp(&hwtstamp,
+						READ_ONCE(sk->sk_bind_phc));
+	if (!ktime_to_timespec64_cond(hwtstamp, ts))
+		return -ENOENT;
+	return NET_TIMESTAMP_ORIGIN_HW;
+}
+
 /*
  * called from sock_recv_timestamp() if sock_flag(sk, SOCK_RCVTSTAMP)
  */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 2/5] io_uring/poll: introduce io_arm_apoll()
  2025-06-12  9:09 [PATCH v3 0/5] io_uring cmd for tx timestamps Pavel Begunkov
  2025-06-12  9:09 ` [PATCH v3 1/5] net: timestamp: add helper returning skb's tx tstamp Pavel Begunkov
@ 2025-06-12  9:09 ` Pavel Begunkov
  2025-06-12  9:09 ` [PATCH v3 3/5] io_uring/cmd: allow multishot polled commands Pavel Begunkov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2025-06-12  9:09 UTC (permalink / raw)
  To: io-uring, Vadim Fedorenko
  Cc: asml.silence, netdev, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S . Miller, Jakub Kicinski,
	Richard Cochran, Stanislav Fomichev, Jason Xing

In preparation to allowing commands to do file polling, add a helper
that takes the desired poll event mask and arms it for polling. We won't
be able to use io_arm_poll_handler() with IORING_OP_URING_CMD as it
tries to infer the mask from the opcode data, and we can't unify it
across all commands.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/poll.c | 44 +++++++++++++++++++++++++++-----------------
 io_uring/poll.h |  1 +
 2 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/io_uring/poll.c b/io_uring/poll.c
index 0526062e2f81..c7e9fb34563d 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -669,33 +669,18 @@ static struct async_poll *io_req_alloc_apoll(struct io_kiocb *req,
 	return apoll;
 }
 
-int io_arm_poll_handler(struct io_kiocb *req, unsigned issue_flags)
+int io_arm_apoll(struct io_kiocb *req, unsigned issue_flags, __poll_t mask)
 {
-	const struct io_issue_def *def = &io_issue_defs[req->opcode];
 	struct async_poll *apoll;
 	struct io_poll_table ipt;
-	__poll_t mask = POLLPRI | POLLERR | EPOLLET;
 	int ret;
 
-	if (!def->pollin && !def->pollout)
-		return IO_APOLL_ABORTED;
+	mask |= EPOLLET;
 	if (!io_file_can_poll(req))
 		return IO_APOLL_ABORTED;
 	if (!(req->flags & REQ_F_APOLL_MULTISHOT))
 		mask |= EPOLLONESHOT;
 
-	if (def->pollin) {
-		mask |= EPOLLIN | EPOLLRDNORM;
-
-		/* If reading from MSG_ERRQUEUE using recvmsg, ignore POLLIN */
-		if (req->flags & REQ_F_CLEAR_POLLIN)
-			mask &= ~EPOLLIN;
-	} else {
-		mask |= EPOLLOUT | EPOLLWRNORM;
-	}
-	if (def->poll_exclusive)
-		mask |= EPOLLEXCLUSIVE;
-
 	apoll = io_req_alloc_apoll(req, issue_flags);
 	if (!apoll)
 		return IO_APOLL_ABORTED;
@@ -712,6 +697,31 @@ int io_arm_poll_handler(struct io_kiocb *req, unsigned issue_flags)
 	return IO_APOLL_OK;
 }
 
+int io_arm_poll_handler(struct io_kiocb *req, unsigned issue_flags)
+{
+	const struct io_issue_def *def = &io_issue_defs[req->opcode];
+	__poll_t mask = POLLPRI | POLLERR;
+
+	if (!def->pollin && !def->pollout)
+		return IO_APOLL_ABORTED;
+	if (!io_file_can_poll(req))
+		return IO_APOLL_ABORTED;
+
+	if (def->pollin) {
+		mask |= EPOLLIN | EPOLLRDNORM;
+
+		/* If reading from MSG_ERRQUEUE using recvmsg, ignore POLLIN */
+		if (req->flags & REQ_F_CLEAR_POLLIN)
+			mask &= ~EPOLLIN;
+	} else {
+		mask |= EPOLLOUT | EPOLLWRNORM;
+	}
+	if (def->poll_exclusive)
+		mask |= EPOLLEXCLUSIVE;
+
+	return io_arm_apoll(req, issue_flags, mask);
+}
+
 /*
  * Returns true if we found and killed one or more poll requests
  */
diff --git a/io_uring/poll.h b/io_uring/poll.h
index 27e2db2ed4ae..c8438286dfa0 100644
--- a/io_uring/poll.h
+++ b/io_uring/poll.h
@@ -41,6 +41,7 @@ int io_poll_remove(struct io_kiocb *req, unsigned int issue_flags);
 struct io_cancel_data;
 int io_poll_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
 		   unsigned issue_flags);
+int io_arm_apoll(struct io_kiocb *req, unsigned issue_flags, __poll_t mask);
 int io_arm_poll_handler(struct io_kiocb *req, unsigned issue_flags);
 bool io_poll_remove_all(struct io_ring_ctx *ctx, struct io_uring_task *tctx,
 			bool cancel_all);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 3/5] io_uring/cmd: allow multishot polled commands
  2025-06-12  9:09 [PATCH v3 0/5] io_uring cmd for tx timestamps Pavel Begunkov
  2025-06-12  9:09 ` [PATCH v3 1/5] net: timestamp: add helper returning skb's tx tstamp Pavel Begunkov
  2025-06-12  9:09 ` [PATCH v3 2/5] io_uring/poll: introduce io_arm_apoll() Pavel Begunkov
@ 2025-06-12  9:09 ` Pavel Begunkov
  2025-06-12  9:09 ` [PATCH v3 4/5] io_uring: add mshot helper for posting CQE32 Pavel Begunkov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2025-06-12  9:09 UTC (permalink / raw)
  To: io-uring, Vadim Fedorenko
  Cc: asml.silence, netdev, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S . Miller, Jakub Kicinski,
	Richard Cochran, Stanislav Fomichev, Jason Xing

Some commands like timestamping in the next patch can make use of
multishot polling, i.e. REQ_F_APOLL_MULTISHOT. Add support for that,
which is condensed in a single helper called io_cmd_poll_multishot().

The user who wants to continue with a request in a multishot mode must
call the function, and only if it returns 0 the user is free to proceed.
Apart from normal terminal errors, it can also end up with -EIOCBQUEUED,
in which case the user must forward it to the core io_uring. It's
forbidden to use task work while the request is executing in a multishot
mode.

The API is not foolproof, hence it's not exported to modules nor exposed
in public headers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/uring_cmd.c | 23 +++++++++++++++++++++++
 io_uring/uring_cmd.h |  3 +++
 2 files changed, 26 insertions(+)

diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 9ad0ea5398c2..02cec6231831 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -12,6 +12,7 @@
 #include "alloc_cache.h"
 #include "rsrc.h"
 #include "uring_cmd.h"
+#include "poll.h"
 
 void io_cmd_cache_free(const void *entry)
 {
@@ -136,6 +137,9 @@ void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
 {
 	struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
 
+	if (WARN_ON_ONCE(req->flags & REQ_F_APOLL_MULTISHOT))
+		return;
+
 	ioucmd->task_work_cb = task_work_cb;
 	req->io_task_work.func = io_uring_cmd_work;
 	__io_req_task_work_add(req, flags);
@@ -158,6 +162,9 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, u64 res2,
 {
 	struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
 
+	if (WARN_ON_ONCE(req->flags & REQ_F_APOLL_MULTISHOT))
+		return;
+
 	io_uring_cmd_del_cancelable(ioucmd, issue_flags);
 
 	if (ret < 0)
@@ -305,3 +312,19 @@ void io_uring_cmd_issue_blocking(struct io_uring_cmd *ioucmd)
 
 	io_req_queue_iowq(req);
 }
+
+int io_cmd_poll_multishot(struct io_uring_cmd *cmd,
+			  unsigned int issue_flags, __poll_t mask)
+{
+	struct io_kiocb *req = cmd_to_io_kiocb(cmd);
+	int ret;
+
+	if (likely(req->flags & REQ_F_APOLL_MULTISHOT))
+		return 0;
+
+	req->flags |= REQ_F_APOLL_MULTISHOT;
+	mask &= ~EPOLLONESHOT;
+
+	ret = io_arm_apoll(req, issue_flags, mask);
+	return ret == IO_APOLL_OK ? -EIOCBQUEUED : -ECANCELED;
+}
diff --git a/io_uring/uring_cmd.h b/io_uring/uring_cmd.h
index a6dad47afc6b..50a6ccb831df 100644
--- a/io_uring/uring_cmd.h
+++ b/io_uring/uring_cmd.h
@@ -18,3 +18,6 @@ bool io_uring_try_cancel_uring_cmd(struct io_ring_ctx *ctx,
 				   struct io_uring_task *tctx, bool cancel_all);
 
 void io_cmd_cache_free(const void *entry);
+
+int io_cmd_poll_multishot(struct io_uring_cmd *cmd,
+			  unsigned int issue_flags, __poll_t mask);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 4/5] io_uring: add mshot helper for posting CQE32
  2025-06-12  9:09 [PATCH v3 0/5] io_uring cmd for tx timestamps Pavel Begunkov
                   ` (2 preceding siblings ...)
  2025-06-12  9:09 ` [PATCH v3 3/5] io_uring/cmd: allow multishot polled commands Pavel Begunkov
@ 2025-06-12  9:09 ` Pavel Begunkov
  2025-06-12  9:09 ` [PATCH v3 5/5] io_uring/netcmd: add tx timestamping cmd support Pavel Begunkov
  2025-06-12  9:15 ` [PATCH v3 0/5] io_uring cmd for tx timestamps Pavel Begunkov
  5 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2025-06-12  9:09 UTC (permalink / raw)
  To: io-uring, Vadim Fedorenko
  Cc: asml.silence, netdev, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S . Miller, Jakub Kicinski,
	Richard Cochran, Stanislav Fomichev, Jason Xing

Add a helper for posting 32 byte CQEs in a multishot mode and add a cmd
helper on top. As it specifically works with requests, the helper ignore
the passed in cqe->user_data and sets it to the one stored in the
request.

The command helper is only valid with multishot requests.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/io_uring.c  | 40 ++++++++++++++++++++++++++++++++++++++++
 io_uring/io_uring.h  |  1 +
 io_uring/uring_cmd.c | 11 +++++++++++
 io_uring/uring_cmd.h |  4 ++++
 4 files changed, 56 insertions(+)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 98a701fc56cc..4352cf209450 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -793,6 +793,21 @@ bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow)
 	return true;
 }
 
+static bool io_fill_cqe_aux32(struct io_ring_ctx *ctx,
+			      struct io_uring_cqe src_cqe[2])
+{
+	struct io_uring_cqe *cqe;
+
+	if (WARN_ON_ONCE(!(ctx->flags & IORING_SETUP_CQE32)))
+		return false;
+	if (unlikely(!io_get_cqe(ctx, &cqe)))
+		return false;
+
+	memcpy(cqe, src_cqe, 2 * sizeof(*cqe));
+	trace_io_uring_complete(ctx, NULL, cqe);
+	return true;
+}
+
 static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res,
 			      u32 cflags)
 {
@@ -904,6 +919,31 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
 	return posted;
 }
 
+/*
+ * A helper for multishot requests posting additional CQEs.
+ * Should only be used from a task_work including IO_URING_F_MULTISHOT.
+ */
+bool io_req_post_cqe32(struct io_kiocb *req, struct io_uring_cqe cqe[2])
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	bool posted;
+
+	lockdep_assert(!io_wq_current_is_worker());
+	lockdep_assert_held(&ctx->uring_lock);
+
+	cqe[0].user_data = req->cqe.user_data;
+	if (!ctx->lockless_cq) {
+		spin_lock(&ctx->completion_lock);
+		posted = io_fill_cqe_aux32(ctx, cqe);
+		spin_unlock(&ctx->completion_lock);
+	} else {
+		posted = io_fill_cqe_aux32(ctx, cqe);
+	}
+
+	ctx->submit_state.cq_flush = true;
+	return posted;
+}
+
 static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index d59c12277d58..1263af818c47 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -81,6 +81,7 @@ void io_req_defer_failed(struct io_kiocb *req, s32 res);
 bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
 void io_add_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
 bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags);
+bool io_req_post_cqe32(struct io_kiocb *req, struct io_uring_cqe src_cqe[2]);
 void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
 
 void io_req_track_inflight(struct io_kiocb *req);
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 02cec6231831..b228b84a510f 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -328,3 +328,14 @@ int io_cmd_poll_multishot(struct io_uring_cmd *cmd,
 	ret = io_arm_apoll(req, issue_flags, mask);
 	return ret == IO_APOLL_OK ? -EIOCBQUEUED : -ECANCELED;
 }
+
+bool io_uring_cmd_post_mshot_cqe32(struct io_uring_cmd *cmd,
+				   unsigned int issue_flags,
+				   struct io_uring_cqe cqe[2])
+{
+	struct io_kiocb *req = cmd_to_io_kiocb(cmd);
+
+	if (WARN_ON_ONCE(!(issue_flags & IO_URING_F_MULTISHOT)))
+		return false;
+	return io_req_post_cqe32(req, cqe);
+}
diff --git a/io_uring/uring_cmd.h b/io_uring/uring_cmd.h
index 50a6ccb831df..9e11da10ecab 100644
--- a/io_uring/uring_cmd.h
+++ b/io_uring/uring_cmd.h
@@ -17,6 +17,10 @@ void io_uring_cmd_cleanup(struct io_kiocb *req);
 bool io_uring_try_cancel_uring_cmd(struct io_ring_ctx *ctx,
 				   struct io_uring_task *tctx, bool cancel_all);
 
+bool io_uring_cmd_post_mshot_cqe32(struct io_uring_cmd *cmd,
+				   unsigned int issue_flags,
+				   struct io_uring_cqe cqe[2]);
+
 void io_cmd_cache_free(const void *entry);
 
 int io_cmd_poll_multishot(struct io_uring_cmd *cmd,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 5/5] io_uring/netcmd: add tx timestamping cmd support
  2025-06-12  9:09 [PATCH v3 0/5] io_uring cmd for tx timestamps Pavel Begunkov
                   ` (3 preceding siblings ...)
  2025-06-12  9:09 ` [PATCH v3 4/5] io_uring: add mshot helper for posting CQE32 Pavel Begunkov
@ 2025-06-12  9:09 ` Pavel Begunkov
  2025-06-12 14:12   ` Jens Axboe
  2025-06-12 21:35   ` Willem de Bruijn
  2025-06-12  9:15 ` [PATCH v3 0/5] io_uring cmd for tx timestamps Pavel Begunkov
  5 siblings, 2 replies; 14+ messages in thread
From: Pavel Begunkov @ 2025-06-12  9:09 UTC (permalink / raw)
  To: io-uring, Vadim Fedorenko
  Cc: asml.silence, netdev, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S . Miller, Jakub Kicinski,
	Richard Cochran, Stanislav Fomichev, Jason Xing

Add a new socket command which returns tx time stamps to the user. It
provide an alternative to the existing error queue recvmsg interface.
The command works in a polled multishot mode, which means io_uring will
poll the socket and keep posting timestamps until the request is
cancelled or fails in any other way (e.g. with no space in the CQ). It
reuses the net infra and grabs timestamps from the socket's error queue.

The command requires IORING_SETUP_CQE32. All non-final CQEs (marked with
IORING_CQE_F_MORE) have cqe->res set to the tskey, and the upper 16 bits
of cqe->flags keep tstype (i.e. offset by IORING_CQE_BUFFER_SHIFT). The
timevalue is store in the upper part of the extended CQE. The final
completion won't have IORING_CQR_F_MORE and will have cqe->res storing
0/error.

Suggested-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/uapi/linux/io_uring.h |  9 ++++
 io_uring/cmd_net.c            | 82 +++++++++++++++++++++++++++++++++++
 2 files changed, 91 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index cfd17e382082..5c89e6f6d624 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -968,6 +968,15 @@ enum io_uring_socket_op {
 	SOCKET_URING_OP_SIOCOUTQ,
 	SOCKET_URING_OP_GETSOCKOPT,
 	SOCKET_URING_OP_SETSOCKOPT,
+	SOCKET_URING_OP_TX_TIMESTAMP,
+};
+
+#define IORING_CQE_F_TIMESTAMP_HW	((__u32)1 << IORING_CQE_BUFFER_SHIFT)
+#define IORING_TIMESTAMP_TSTYPE_SHIFT	(IORING_CQE_BUFFER_SHIFT + 1)
+
+struct io_timespec {
+	__u64		tv_sec;
+	__u64		tv_nsec;
 };
 
 /* Zero copy receive refill queue entry */
diff --git a/io_uring/cmd_net.c b/io_uring/cmd_net.c
index e99170c7d41a..bc2d33ea2db3 100644
--- a/io_uring/cmd_net.c
+++ b/io_uring/cmd_net.c
@@ -1,5 +1,6 @@
 #include <asm/ioctls.h>
 #include <linux/io_uring/net.h>
+#include <linux/errqueue.h>
 #include <net/sock.h>
 
 #include "uring_cmd.h"
@@ -51,6 +52,85 @@ static inline int io_uring_cmd_setsockopt(struct socket *sock,
 				  optlen);
 }
 
+static bool io_process_timestamp_skb(struct io_uring_cmd *cmd, struct sock *sk,
+				     struct sk_buff *skb, unsigned issue_flags)
+{
+	struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
+	struct io_uring_cqe cqe[2];
+	struct io_timespec *iots;
+	struct timespec64 ts;
+	u32 tstype, tskey;
+	int ret;
+
+	BUILD_BUG_ON(sizeof(struct io_uring_cqe) != sizeof(struct io_timespec));
+
+	ret = skb_get_tx_timestamp(skb, sk, &ts);
+	if (ret < 0)
+		return false;
+
+	tskey = serr->ee.ee_data;
+	tstype = serr->ee.ee_info;
+
+	cqe->user_data = 0;
+	cqe->res = tskey;
+	cqe->flags = IORING_CQE_F_MORE;
+	cqe->flags |= tstype << IORING_TIMESTAMP_TSTYPE_SHIFT;
+	if (ret == NET_TIMESTAMP_ORIGIN_HW)
+		cqe->flags |= IORING_CQE_F_TIMESTAMP_HW;
+
+	iots = (struct io_timespec *)&cqe[1];
+	iots->tv_sec = ts.tv_sec;
+	iots->tv_nsec = ts.tv_nsec;
+	return io_uring_cmd_post_mshot_cqe32(cmd, issue_flags, cqe);
+}
+
+static int io_uring_cmd_timestamp(struct socket *sock,
+				  struct io_uring_cmd *cmd,
+				  unsigned int issue_flags)
+{
+	struct sock *sk = sock->sk;
+	struct sk_buff_head *q = &sk->sk_error_queue;
+	struct sk_buff *skb, *tmp;
+	struct sk_buff_head list;
+	int ret;
+
+	if (!(issue_flags & IO_URING_F_CQE32))
+		return -EINVAL;
+	ret = io_cmd_poll_multishot(cmd, issue_flags, EPOLLERR);
+	if (unlikely(ret))
+		return ret;
+
+	if (skb_queue_empty_lockless(q))
+		return -EAGAIN;
+	__skb_queue_head_init(&list);
+
+	scoped_guard(spinlock_irq, &q->lock) {
+		skb_queue_walk_safe(q, skb, tmp) {
+			/* don't support skbs with payload */
+			if (!skb_has_tx_timestamp(skb, sk) || skb->len)
+				continue;
+			__skb_unlink(skb, q);
+			__skb_queue_tail(&list, skb);
+		}
+	}
+
+	while (1) {
+		skb = skb_peek(&list);
+		if (!skb)
+			break;
+		if (!io_process_timestamp_skb(cmd, sk, skb, issue_flags))
+			break;
+		__skb_dequeue(&list);
+		consume_skb(skb);
+	}
+
+	if (!unlikely(skb_queue_empty(&list))) {
+		scoped_guard(spinlock_irqsave, &q->lock)
+			skb_queue_splice(q, &list);
+	}
+	return -EAGAIN;
+}
+
 int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags)
 {
 	struct socket *sock = cmd->file->private_data;
@@ -76,6 +156,8 @@ int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags)
 		return io_uring_cmd_getsockopt(sock, cmd, issue_flags);
 	case SOCKET_URING_OP_SETSOCKOPT:
 		return io_uring_cmd_setsockopt(sock, cmd, issue_flags);
+	case SOCKET_URING_OP_TX_TIMESTAMP:
+		return io_uring_cmd_timestamp(sock, cmd, issue_flags);
 	default:
 		return -EOPNOTSUPP;
 	}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 0/5] io_uring cmd for tx timestamps
  2025-06-12  9:09 [PATCH v3 0/5] io_uring cmd for tx timestamps Pavel Begunkov
                   ` (4 preceding siblings ...)
  2025-06-12  9:09 ` [PATCH v3 5/5] io_uring/netcmd: add tx timestamping cmd support Pavel Begunkov
@ 2025-06-12  9:15 ` Pavel Begunkov
  5 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2025-06-12  9:15 UTC (permalink / raw)
  To: io-uring, Vadim Fedorenko
  Cc: netdev, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S . Miller, Jakub Kicinski,
	Richard Cochran, Stanislav Fomichev, Jason Xing

On 6/12/25 10:09, Pavel Begunkov wrote:
> Vadim Fedorenko suggested to add an alternative API for receiving
> tx timestamps through io_uring. The series introduces io_uring socket
> cmd for fetching tx timestamps, which is a polled multishot request,
> i.e. internally polling the socket for POLLERR and posts timestamps
> when they're arrives. For the API description see Patch 5.
> 
> It reuses existing timestamp infra and takes them from the socket's
> error queue. For networking people the important parts are Patch 1,
> and io_uring_cmd_timestamp() from Patch 5 walking the error queue.
> 
> It should be reasonable to take it through the io_uring tree once
> we have consensus, but let me know if there are any concerns.
> 
> v3: Add a flag to distinguish sw vs hw timestamp. skb_get_tx_timestamp()
>      from Patch 1 now returns the indication of that, and in Patch 5
>      it's converted into a io_uring CQE bit flag.

FWIW, it's a relatively small change, but I dropped all review tags.

Also I pruned the test I've been using (derived from the tx-timestamp
selftest). Pushed it here:

https://github.com/isilence/liburing/tree/tx-timestamp

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 5/5] io_uring/netcmd: add tx timestamping cmd support
  2025-06-12  9:09 ` [PATCH v3 5/5] io_uring/netcmd: add tx timestamping cmd support Pavel Begunkov
@ 2025-06-12 14:12   ` Jens Axboe
  2025-06-12 14:26     ` Pavel Begunkov
  2025-06-12 21:35   ` Willem de Bruijn
  1 sibling, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2025-06-12 14:12 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, Vadim Fedorenko
  Cc: netdev, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S . Miller, Jakub Kicinski,
	Richard Cochran, Stanislav Fomichev, Jason Xing

On 6/12/25 3:09 AM, Pavel Begunkov wrote:
> Add a new socket command which returns tx time stamps to the user. It
> provide an alternative to the existing error queue recvmsg interface.
> The command works in a polled multishot mode, which means io_uring will
> poll the socket and keep posting timestamps until the request is
> cancelled or fails in any other way (e.g. with no space in the CQ). It
> reuses the net infra and grabs timestamps from the socket's error queue.
> 
> The command requires IORING_SETUP_CQE32. All non-final CQEs (marked with
> IORING_CQE_F_MORE) have cqe->res set to the tskey, and the upper 16 bits
> of cqe->flags keep tstype (i.e. offset by IORING_CQE_BUFFER_SHIFT). The
> timevalue is store in the upper part of the extended CQE. The final
> completion won't have IORING_CQR_F_MORE and will have cqe->res storing
                        ^^^^^^^^^^^^^^^^^

Pointed this out before, but this typo is still there.

> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index cfd17e382082..5c89e6f6d624 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -968,6 +968,15 @@ enum io_uring_socket_op {
>  	SOCKET_URING_OP_SIOCOUTQ,
>  	SOCKET_URING_OP_GETSOCKOPT,
>  	SOCKET_URING_OP_SETSOCKOPT,
> +	SOCKET_URING_OP_TX_TIMESTAMP,
> +};
> +
> +#define IORING_CQE_F_TIMESTAMP_HW	((__u32)1 << IORING_CQE_BUFFER_SHIFT)
> +#define IORING_TIMESTAMP_TSTYPE_SHIFT	(IORING_CQE_BUFFER_SHIFT + 1)

Don't completely follow this, would at the very least need a comment.
Whether it's a HW or SW timestamp is flagged in the upper 16 bits, just
like a provided buffer ID. But since we don't use buffer IDs here, then
it's up for grabs. Do we have other commands that use the upper flags
space for command private flags?

The above makes sense, but then what is IORING_TIMESTAMP_TSTYPE_SHIFT?

> diff --git a/io_uring/cmd_net.c b/io_uring/cmd_net.c
> index e99170c7d41a..bc2d33ea2db3 100644
> --- a/io_uring/cmd_net.c
> +++ b/io_uring/cmd_net.c
> @@ -1,5 +1,6 @@
>  #include <asm/ioctls.h>
>  #include <linux/io_uring/net.h>
> +#include <linux/errqueue.h>
>  #include <net/sock.h>
>  
>  #include "uring_cmd.h"
> @@ -51,6 +52,85 @@ static inline int io_uring_cmd_setsockopt(struct socket *sock,
>  				  optlen);
>  }
>  
> +static bool io_process_timestamp_skb(struct io_uring_cmd *cmd, struct sock *sk,
> +				     struct sk_buff *skb, unsigned issue_flags)
> +{
> +	struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
> +	struct io_uring_cqe cqe[2];
> +	struct io_timespec *iots;
> +	struct timespec64 ts;
> +	u32 tstype, tskey;
> +	int ret;
> +
> +	BUILD_BUG_ON(sizeof(struct io_uring_cqe) != sizeof(struct io_timespec));
> +
> +	ret = skb_get_tx_timestamp(skb, sk, &ts);
> +	if (ret < 0)
> +		return false;
> +
> +	tskey = serr->ee.ee_data;
> +	tstype = serr->ee.ee_info;
> +
> +	cqe->user_data = 0;
> +	cqe->res = tskey;
> +	cqe->flags = IORING_CQE_F_MORE;
> +	cqe->flags |= tstype << IORING_TIMESTAMP_TSTYPE_SHIFT;
> +	if (ret == NET_TIMESTAMP_ORIGIN_HW)
> +		cqe->flags |= IORING_CQE_F_TIMESTAMP_HW;
> +
> +	iots = (struct io_timespec *)&cqe[1];
> +	iots->tv_sec = ts.tv_sec;
> +	iots->tv_nsec = ts.tv_nsec;
> +	return io_uring_cmd_post_mshot_cqe32(cmd, issue_flags, cqe);
> +}

Might help if you just commented here too on the use of the
TSTYPE_SHIFT.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 5/5] io_uring/netcmd: add tx timestamping cmd support
  2025-06-12 14:12   ` Jens Axboe
@ 2025-06-12 14:26     ` Pavel Begunkov
  2025-06-12 14:31       ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2025-06-12 14:26 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Vadim Fedorenko
  Cc: netdev, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S . Miller, Jakub Kicinski,
	Richard Cochran, Stanislav Fomichev, Jason Xing

On 6/12/25 15:12, Jens Axboe wrote:
> On 6/12/25 3:09 AM, Pavel Begunkov wrote:
>> Add a new socket command which returns tx time stamps to the user. It
>> provide an alternative to the existing error queue recvmsg interface.
>> The command works in a polled multishot mode, which means io_uring will
>> poll the socket and keep posting timestamps until the request is
>> cancelled or fails in any other way (e.g. with no space in the CQ). It
>> reuses the net infra and grabs timestamps from the socket's error queue.
>>
>> The command requires IORING_SETUP_CQE32. All non-final CQEs (marked with
>> IORING_CQE_F_MORE) have cqe->res set to the tskey, and the upper 16 bits
>> of cqe->flags keep tstype (i.e. offset by IORING_CQE_BUFFER_SHIFT). The
>> timevalue is store in the upper part of the extended CQE. The final
>> completion won't have IORING_CQR_F_MORE and will have cqe->res storing
>                          ^^^^^^^^^^^^^^^^^
> 
> Pointed this out before, but this typo is still there.

Forgot about that one

> 
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index cfd17e382082..5c89e6f6d624 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -968,6 +968,15 @@ enum io_uring_socket_op {
>>   	SOCKET_URING_OP_SIOCOUTQ,
>>   	SOCKET_URING_OP_GETSOCKOPT,
>>   	SOCKET_URING_OP_SETSOCKOPT,
>> +	SOCKET_URING_OP_TX_TIMESTAMP,
>> +};
>> +
>> +#define IORING_CQE_F_TIMESTAMP_HW	((__u32)1 << IORING_CQE_BUFFER_SHIFT)
>> +#define IORING_TIMESTAMP_TSTYPE_SHIFT	(IORING_CQE_BUFFER_SHIFT + 1)
> 
> Don't completely follow this, would at the very least need a comment.
> Whether it's a HW or SW timestamp is flagged in the upper 16 bits, just
> like a provided buffer ID. But since we don't use buffer IDs here, then
> it's up for grabs. Do we have other commands that use the upper flags
> space for command private flags?

Probably not, but the place is better than the lower half, which
has common flags like F_MORE, especially since the patch is already
using it to store the type.

> The above makes sense, but then what is IORING_TIMESTAMP_TSTYPE_SHIFT?

It's a shift for where the timestamp type is stored, HW vs SW is
not a timestamp type. I don't get the question.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 5/5] io_uring/netcmd: add tx timestamping cmd support
  2025-06-12 14:26     ` Pavel Begunkov
@ 2025-06-12 14:31       ` Jens Axboe
  2025-06-12 15:01         ` Pavel Begunkov
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2025-06-12 14:31 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, Vadim Fedorenko
  Cc: netdev, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S . Miller, Jakub Kicinski,
	Richard Cochran, Stanislav Fomichev, Jason Xing

On 6/12/25 8:26 AM, Pavel Begunkov wrote:
>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>> index cfd17e382082..5c89e6f6d624 100644
>>> --- a/include/uapi/linux/io_uring.h
>>> +++ b/include/uapi/linux/io_uring.h
>>> @@ -968,6 +968,15 @@ enum io_uring_socket_op {
>>>       SOCKET_URING_OP_SIOCOUTQ,
>>>       SOCKET_URING_OP_GETSOCKOPT,
>>>       SOCKET_URING_OP_SETSOCKOPT,
>>> +    SOCKET_URING_OP_TX_TIMESTAMP,
>>> +};
>>> +
>>> +#define IORING_CQE_F_TIMESTAMP_HW    ((__u32)1 << IORING_CQE_BUFFER_SHIFT)
>>> +#define IORING_TIMESTAMP_TSTYPE_SHIFT    (IORING_CQE_BUFFER_SHIFT + 1)
>>
>> Don't completely follow this, would at the very least need a comment.
>> Whether it's a HW or SW timestamp is flagged in the upper 16 bits, just
>> like a provided buffer ID. But since we don't use buffer IDs here, then
>> it's up for grabs. Do we have other commands that use the upper flags
>> space for command private flags?
> 
> Probably not, but the place is better than the lower half, which
> has common flags like F_MORE, especially since the patch is already
> using it to store the type.

Just pondering whether it should be formalized, but probably no point as
each opcode should be free to use the space as it wants.

>> The above makes sense, but then what is IORING_TIMESTAMP_TSTYPE_SHIFT?
> 
> It's a shift for where the timestamp type is stored, HW vs SW is
> not a timestamp type. I don't get the question.

Please add a spec like comment on top of it explaining the usage of the
upper bits in the flags field, then. I try to keep the io_uring.h uapi
header pretty well commented and documented.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 5/5] io_uring/netcmd: add tx timestamping cmd support
  2025-06-12 14:31       ` Jens Axboe
@ 2025-06-12 15:01         ` Pavel Begunkov
  0 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2025-06-12 15:01 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Vadim Fedorenko
  Cc: netdev, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S . Miller, Jakub Kicinski,
	Richard Cochran, Stanislav Fomichev, Jason Xing

On 6/12/25 15:31, Jens Axboe wrote:
> On 6/12/25 8:26 AM, Pavel Begunkov wrote:
>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>> index cfd17e382082..5c89e6f6d624 100644
>>>> --- a/include/uapi/linux/io_uring.h
>>>> +++ b/include/uapi/linux/io_uring.h
>>>> @@ -968,6 +968,15 @@ enum io_uring_socket_op {
>>>>        SOCKET_URING_OP_SIOCOUTQ,
>>>>        SOCKET_URING_OP_GETSOCKOPT,
>>>>        SOCKET_URING_OP_SETSOCKOPT,
>>>> +    SOCKET_URING_OP_TX_TIMESTAMP,
>>>> +};
>>>> +
>>>> +#define IORING_CQE_F_TIMESTAMP_HW    ((__u32)1 << IORING_CQE_BUFFER_SHIFT)
>>>> +#define IORING_TIMESTAMP_TSTYPE_SHIFT    (IORING_CQE_BUFFER_SHIFT + 1)
>>>
>>> Don't completely follow this, would at the very least need a comment.
>>> Whether it's a HW or SW timestamp is flagged in the upper 16 bits, just
>>> like a provided buffer ID. But since we don't use buffer IDs here, then
>>> it's up for grabs. Do we have other commands that use the upper flags
>>> space for command private flags?
>>
>> Probably not, but the place is better than the lower half, which
>> has common flags like F_MORE, especially since the patch is already
>> using it to store the type.
> 
> Just pondering whether it should be formalized, but probably no point as
> each opcode should be free to use the space as it wants.

Right, that's what I insisted on long time ago, all fields except
user_data are opcode specific, even if some flags are reused for
user's convenience. There is no need to covert the upper half of
flags for provided buffers when the majority of opcodes doesn't
care about the feature.

>>> The above makes sense, but then what is IORING_TIMESTAMP_TSTYPE_SHIFT?
>>
>> It's a shift for where the timestamp type is stored, HW vs SW is
>> not a timestamp type. I don't get the question.
> 
> Please add a spec like comment on top of it explaining the usage of the
> upper bits in the flags field, then. I try to keep the io_uring.h uapi
> header pretty well commented and documented.

Ok

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/5] net: timestamp: add helper returning skb's tx tstamp
  2025-06-12  9:09 ` [PATCH v3 1/5] net: timestamp: add helper returning skb's tx tstamp Pavel Begunkov
@ 2025-06-12 21:20   ` Willem de Bruijn
  0 siblings, 0 replies; 14+ messages in thread
From: Willem de Bruijn @ 2025-06-12 21:20 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, Vadim Fedorenko
  Cc: asml.silence, netdev, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S . Miller, Jakub Kicinski,
	Richard Cochran, Stanislav Fomichev, Jason Xing

Pavel Begunkov wrote:
> Add a helper function skb_get_tx_timestamp() that returns a tx timestamp
> associated with an skb from an queue queue.

(minor) repeated queue
 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  include/net/sock.h |  9 +++++++++
>  net/socket.c       | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 54 insertions(+)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 92e7c1aae3cc..0b96196d8a34 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2677,6 +2677,15 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
>  void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk,
>  			     struct sk_buff *skb);
>  
> +enum {
> +	NET_TIMESTAMP_ORIGIN_SW		= 0,
> +	NET_TIMESTAMP_ORIGIN_HW		= 1,
> +};
> +
> +bool skb_has_tx_timestamp(struct sk_buff *skb, const struct sock *sk);
> +int skb_get_tx_timestamp(struct sk_buff *skb, struct sock *sk,
> +			 struct timespec64 *ts);
> +
>  static inline void
>  sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
>  {
> diff --git a/net/socket.c b/net/socket.c
> index 9a0e720f0859..9bb618c32d65 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -843,6 +843,51 @@ static void put_ts_pktinfo(struct msghdr *msg, struct sk_buff *skb,
>  		 sizeof(ts_pktinfo), &ts_pktinfo);
>  }
>  
> +bool skb_has_tx_timestamp(struct sk_buff *skb, const struct sock *sk)
> +{
> +	const struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
> +	u32 tsflags = READ_ONCE(sk->sk_tsflags);
> +
> +	if (serr->ee.ee_errno != ENOMSG ||
> +	   serr->ee.ee_origin != SO_EE_ORIGIN_TIMESTAMPING)
> +		return false;
> +
> +	/* software time stamp available and wanted */
> +	if ((tsflags & SOF_TIMESTAMPING_SOFTWARE) && skb->tstamp)
> +		return true;
> +	/* hardware time stamps available and wanted */
> +	return (tsflags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
> +		skb_hwtstamps(skb)->hwtstamp;
> +}
> +
> +int skb_get_tx_timestamp(struct sk_buff *skb, struct sock *sk,
> +			  struct timespec64 *ts)
> +{
> +	u32 tsflags = READ_ONCE(sk->sk_tsflags);
> +	ktime_t hwtstamp;
> +	int if_index = 0;
> +
> +	if ((tsflags & SOF_TIMESTAMPING_SOFTWARE) &&
> +	    ktime_to_timespec64_cond(skb->tstamp, ts))
> +		return NET_TIMESTAMP_ORIGIN_SW;
> +
> +	if (!(tsflags & SOF_TIMESTAMPING_RAW_HARDWARE) ||
> +	    skb_is_swtx_tstamp(skb, false))
> +		return -ENOENT;
> +
> +	if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP_NETDEV)
> +		hwtstamp = get_timestamp(sk, skb, &if_index);
> +	else
> +		hwtstamp = skb_hwtstamps(skb)->hwtstamp;
> +
> +	if (tsflags & SOF_TIMESTAMPING_BIND_PHC)
> +		hwtstamp = ptp_convert_timestamp(&hwtstamp,
> +						READ_ONCE(sk->sk_bind_phc));
> +	if (!ktime_to_timespec64_cond(hwtstamp, ts))
> +		return -ENOENT;

(minor) consider an empty line between the branch and final return stmt.
> +	return NET_TIMESTAMP_ORIGIN_HW;
> +}
> +
>  /*
>   * called from sock_recv_timestamp() if sock_flag(sk, SOCK_RCVTSTAMP)
>   */
> -- 
> 2.49.0
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 5/5] io_uring/netcmd: add tx timestamping cmd support
  2025-06-12  9:09 ` [PATCH v3 5/5] io_uring/netcmd: add tx timestamping cmd support Pavel Begunkov
  2025-06-12 14:12   ` Jens Axboe
@ 2025-06-12 21:35   ` Willem de Bruijn
  2025-06-13 18:29     ` Pavel Begunkov
  1 sibling, 1 reply; 14+ messages in thread
From: Willem de Bruijn @ 2025-06-12 21:35 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, Vadim Fedorenko
  Cc: asml.silence, netdev, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S . Miller, Jakub Kicinski,
	Richard Cochran, Stanislav Fomichev, Jason Xing

Pavel Begunkov wrote:
> Add a new socket command which returns tx time stamps to the user. It
> provide an alternative to the existing error queue recvmsg interface.
> The command works in a polled multishot mode, which means io_uring will
> poll the socket and keep posting timestamps until the request is
> cancelled or fails in any other way (e.g. with no space in the CQ). It
> reuses the net infra and grabs timestamps from the socket's error queue.
> 
> The command requires IORING_SETUP_CQE32. All non-final CQEs (marked with
> IORING_CQE_F_MORE) have cqe->res set to the tskey, and the upper 16 bits
> of cqe->flags keep tstype (i.e. offset by IORING_CQE_BUFFER_SHIFT). The
> timevalue is store in the upper part of the extended CQE. The final
> completion won't have IORING_CQR_F_MORE and will have cqe->res storing
> 0/error.
> 
> Suggested-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  include/uapi/linux/io_uring.h |  9 ++++
>  io_uring/cmd_net.c            | 82 +++++++++++++++++++++++++++++++++++
>  2 files changed, 91 insertions(+)
> 
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index cfd17e382082..5c89e6f6d624 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -968,6 +968,15 @@ enum io_uring_socket_op {
>  	SOCKET_URING_OP_SIOCOUTQ,
>  	SOCKET_URING_OP_GETSOCKOPT,
>  	SOCKET_URING_OP_SETSOCKOPT,
> +	SOCKET_URING_OP_TX_TIMESTAMP,
> +};
> +
> +#define IORING_CQE_F_TIMESTAMP_HW	((__u32)1 << IORING_CQE_BUFFER_SHIFT)
> +#define IORING_TIMESTAMP_TSTYPE_SHIFT	(IORING_CQE_BUFFER_SHIFT + 1)
> +

Perhaps instead of these shifts define an actual struct, e.g.,
io_uring_cqe_tstamp.

One question is the number of bits to reserve for the tstype.
Currently only 2 are needed. But that can grow. The current
approach conveniently leaves that open.

Alternatively, perhaps make the dependency between the shifts more
obvious:

+#define IORING_TIMESTAMP_HW_SHIFT	IORING_CQE_BUFFER_SHIFT
+#define IORING_TIMESTAMP_TYPE_SHIFT	(IORING_CQE_BUFFER_SHIFT + 1)

+#define IORING_CQE_F_TSTAMP_HW		((__u32)1 << IORING_TIMESTAMP_HW_SHIFT);

> +struct io_timespec {
> +	__u64		tv_sec;
> +	__u64		tv_nsec;
>  };
>  
>  /* Zero copy receive refill queue entry */
> diff --git a/io_uring/cmd_net.c b/io_uring/cmd_net.c
> index e99170c7d41a..bc2d33ea2db3 100644
> --- a/io_uring/cmd_net.c
> +++ b/io_uring/cmd_net.c
> @@ -1,5 +1,6 @@
>  #include <asm/ioctls.h>
>  #include <linux/io_uring/net.h>
> +#include <linux/errqueue.h>
>  #include <net/sock.h>
>  
>  #include "uring_cmd.h"
> @@ -51,6 +52,85 @@ static inline int io_uring_cmd_setsockopt(struct socket *sock,
>  				  optlen);
>  }
>  
> +static bool io_process_timestamp_skb(struct io_uring_cmd *cmd, struct sock *sk,
> +				     struct sk_buff *skb, unsigned issue_flags)
> +{
> +	struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
> +	struct io_uring_cqe cqe[2];
> +	struct io_timespec *iots;
> +	struct timespec64 ts;
> +	u32 tstype, tskey;
> +	int ret;
> +
> +	BUILD_BUG_ON(sizeof(struct io_uring_cqe) != sizeof(struct io_timespec));
> +
> +	ret = skb_get_tx_timestamp(skb, sk, &ts);
> +	if (ret < 0)
> +		return false;
> +
> +	tskey = serr->ee.ee_data;
> +	tstype = serr->ee.ee_info;
> +
> +	cqe->user_data = 0;
> +	cqe->res = tskey;
> +	cqe->flags = IORING_CQE_F_MORE;
> +	cqe->flags |= tstype << IORING_TIMESTAMP_TSTYPE_SHIFT;
> +	if (ret == NET_TIMESTAMP_ORIGIN_HW)
> +		cqe->flags |= IORING_CQE_F_TIMESTAMP_HW;
> +
> +	iots = (struct io_timespec *)&cqe[1];
> +	iots->tv_sec = ts.tv_sec;
> +	iots->tv_nsec = ts.tv_nsec;
> +	return io_uring_cmd_post_mshot_cqe32(cmd, issue_flags, cqe);
> +}
> +
> +static int io_uring_cmd_timestamp(struct socket *sock,
> +				  struct io_uring_cmd *cmd,
> +				  unsigned int issue_flags)
> +{
> +	struct sock *sk = sock->sk;
> +	struct sk_buff_head *q = &sk->sk_error_queue;
> +	struct sk_buff *skb, *tmp;
> +	struct sk_buff_head list;
> +	int ret;
> +
> +	if (!(issue_flags & IO_URING_F_CQE32))
> +		return -EINVAL;
> +	ret = io_cmd_poll_multishot(cmd, issue_flags, EPOLLERR);
> +	if (unlikely(ret))
> +		return ret;
> +
> +	if (skb_queue_empty_lockless(q))
> +		return -EAGAIN;
> +	__skb_queue_head_init(&list);
> +
> +	scoped_guard(spinlock_irq, &q->lock) {
> +		skb_queue_walk_safe(q, skb, tmp) {
> +			/* don't support skbs with payload */
> +			if (!skb_has_tx_timestamp(skb, sk) || skb->len)
> +				continue;
> +			__skb_unlink(skb, q);
> +			__skb_queue_tail(&list, skb);
> +		}
> +	}
> +
> +	while (1) {
> +		skb = skb_peek(&list);
> +		if (!skb)
> +			break;
> +		if (!io_process_timestamp_skb(cmd, sk, skb, issue_flags))
> +			break;
> +		__skb_dequeue(&list);
> +		consume_skb(skb);
> +	}
> +
> +	if (!unlikely(skb_queue_empty(&list))) {
> +		scoped_guard(spinlock_irqsave, &q->lock)
> +			skb_queue_splice(q, &list);
> +	}
> +	return -EAGAIN;
> +}
> +
>  int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags)
>  {
>  	struct socket *sock = cmd->file->private_data;
> @@ -76,6 +156,8 @@ int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags)
>  		return io_uring_cmd_getsockopt(sock, cmd, issue_flags);
>  	case SOCKET_URING_OP_SETSOCKOPT:
>  		return io_uring_cmd_setsockopt(sock, cmd, issue_flags);
> +	case SOCKET_URING_OP_TX_TIMESTAMP:
> +		return io_uring_cmd_timestamp(sock, cmd, issue_flags);
>  	default:
>  		return -EOPNOTSUPP;
>  	}
> -- 
> 2.49.0
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 5/5] io_uring/netcmd: add tx timestamping cmd support
  2025-06-12 21:35   ` Willem de Bruijn
@ 2025-06-13 18:29     ` Pavel Begunkov
  0 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2025-06-13 18:29 UTC (permalink / raw)
  To: Willem de Bruijn, io-uring, Vadim Fedorenko
  Cc: netdev, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S . Miller, Jakub Kicinski,
	Richard Cochran, Stanislav Fomichev, Jason Xing

On 6/12/25 22:35, Willem de Bruijn wrote:
> Pavel Begunkov wrote:
>> Add a new socket command which returns tx time stamps to the user. It
>> provide an alternative to the existing error queue recvmsg interface.
>> The command works in a polled multishot mode, which means io_uring will
>> poll the socket and keep posting timestamps until the request is
>> cancelled or fails in any other way (e.g. with no space in the CQ). It
>> reuses the net infra and grabs timestamps from the socket's error queue.
>>
>> The command requires IORING_SETUP_CQE32. All non-final CQEs (marked with
>> IORING_CQE_F_MORE) have cqe->res set to the tskey, and the upper 16 bits
>> of cqe->flags keep tstype (i.e. offset by IORING_CQE_BUFFER_SHIFT). The
>> timevalue is store in the upper part of the extended CQE. The final
>> completion won't have IORING_CQR_F_MORE and will have cqe->res storing
>> 0/error.
>>
>> Suggested-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>   include/uapi/linux/io_uring.h |  9 ++++
>>   io_uring/cmd_net.c            | 82 +++++++++++++++++++++++++++++++++++
>>   2 files changed, 91 insertions(+)
>>
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index cfd17e382082..5c89e6f6d624 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -968,6 +968,15 @@ enum io_uring_socket_op {
>>   	SOCKET_URING_OP_SIOCOUTQ,
>>   	SOCKET_URING_OP_GETSOCKOPT,
>>   	SOCKET_URING_OP_SETSOCKOPT,
>> +	SOCKET_URING_OP_TX_TIMESTAMP,
>> +};
>> +
>> +#define IORING_CQE_F_TIMESTAMP_HW	((__u32)1 << IORING_CQE_BUFFER_SHIFT)
>> +#define IORING_TIMESTAMP_TSTYPE_SHIFT	(IORING_CQE_BUFFER_SHIFT + 1)
>> +
> 
> Perhaps instead of these shifts define an actual struct, e.g.,
> io_uring_cqe_tstamp.

That wouldn't be pretty since there is a generic io_uring field in
there, it needs to be casted between types, explained to the user
that it's aliased, and you still need to pack the type somehow.

> One question is the number of bits to reserve for the tstype.
> Currently only 2 are needed. But that can grow. The current
> approach conveniently leaves that open.
> 
> Alternatively, perhaps make the dependency between the shifts more
> obvious:
> 
> +#define IORING_TIMESTAMP_HW_SHIFT	IORING_CQE_BUFFER_SHIFT
> +#define IORING_TIMESTAMP_TYPE_SHIFT	(IORING_CQE_BUFFER_SHIFT + 1)
> 
> +#define IORING_CQE_F_TSTAMP_HW		((__u32)1 << IORING_TIMESTAMP_HW_SHIFT);

Let's do that

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-06-13 18:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-12  9:09 [PATCH v3 0/5] io_uring cmd for tx timestamps Pavel Begunkov
2025-06-12  9:09 ` [PATCH v3 1/5] net: timestamp: add helper returning skb's tx tstamp Pavel Begunkov
2025-06-12 21:20   ` Willem de Bruijn
2025-06-12  9:09 ` [PATCH v3 2/5] io_uring/poll: introduce io_arm_apoll() Pavel Begunkov
2025-06-12  9:09 ` [PATCH v3 3/5] io_uring/cmd: allow multishot polled commands Pavel Begunkov
2025-06-12  9:09 ` [PATCH v3 4/5] io_uring: add mshot helper for posting CQE32 Pavel Begunkov
2025-06-12  9:09 ` [PATCH v3 5/5] io_uring/netcmd: add tx timestamping cmd support Pavel Begunkov
2025-06-12 14:12   ` Jens Axboe
2025-06-12 14:26     ` Pavel Begunkov
2025-06-12 14:31       ` Jens Axboe
2025-06-12 15:01         ` Pavel Begunkov
2025-06-12 21:35   ` Willem de Bruijn
2025-06-13 18:29     ` Pavel Begunkov
2025-06-12  9:15 ` [PATCH v3 0/5] io_uring cmd for tx timestamps Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox